Workflow

AI-Ready Documents: The Complete Workflow

From pseudonymization to re-identification in 6 steps

Before a document can be safely submitted to an AI tool, it must be made "AI-ready": personal data is replaced with consistent pseudonyms, the AI processes the pseudonymized version, and then the original data is restored. This workflow comprises six clearly defined steps.

This article describes each step in detail — from document import to re-identification — and explains what matters at each stage.

Step 1: Import Documents

PDF, Word, email — over 70 formats are supported

In the first step, the documents to be processed are imported into Docuflair Mask. The software supports over 70 file formats:

  • Office documents: Word (.docx), Excel (.xlsx), PowerPoint (.pptx)
  • PDF: Native PDFs and scanned PDFs (with integrated OCR)
  • Scans and images: TIFF, JPEG, PNG, BMP
  • Email: MSG, EML (including attachments)
  • Others: RTF, TXT, CSV, HTML and many more

Documents can be imported individually or in batches. For scanned documents, OCR text recognition is performed automatically to make the text machine-readable.

Important: All documents remain on-premises. Import is performed locally — no document is transferred to external servers for pseudonymization.

Step 2: Detect PII Automatically

9 categories of personal data are identified automatically

The software analyses the document content and detects personal data automatically. 9 categories are distinguished:

CategoryDetection methodExamples
NamesDictionary + context analysisJohn Smith, Dr Miller
Email addressesPattern recognitionjohn@company.co.uk
AddressesContext-based123 Main Street, London EC1A 1BB
Phone numbersPattern recognition+44 20 7946 0958
IBAN numbersCheck digit validationGB29 NWBK 6016 1331 9268 19
Company namesDictionary + context analysisABC Ltd, XYZ plc
DatesPattern recognition15/03/2026, 2026-03-15
Tax numbersCountry-specific formatsVAT ID, tax reference
National insurance numbersCountry-specific formatsNI number, SSN

Detection combines rule-based pattern recognition with dictionary matching and context analysis. Dictionaries can be individually maintained — for instance by importing from Active Directory or CSV files.

Step 3: Generate Consistent Pseudonyms

"John Smith" becomes "Person_A" in ALL documents

After detection, personal data is replaced with consistent pseudonyms. The core principle: same person = same pseudonym — across all documents and all processing runs.

Consistency Is Key

If John Smith appears in 50 documents, he becomes Person_A everywhere. If his colleague Jane Brown appears in 30 documents, she becomes Person_B everywhere. This preserves relationships:

  • In document 1: "Person_A signed the contract with Company_A"
  • In document 2: "Person_A received an email from Person_C"
  • In document 3: "The invoice was sent to Address_A of Person_A"

The AI recognises that the same person is involved throughout — without knowing their identity.

Cross-Batch Pseudonyms

Consistency applies not only within a single processing run but also across batches. If John Smith was pseudonymized as Person_A in batch 1, he will also receive the pseudonym Person_A in batches 2, 3 and all subsequent runs. The replacement table is continuously extended.

Step 4: Export Pseudonymized Version

The AI-ready document is prepared for handover to external tools

After pseudonymization, the document is exported. It contains exclusively pseudonyms — no real personal data. This AI-ready document can be safely submitted to external AI tools:

  • ChatGPT — for analysis, summarisation or text creation
  • DeepL — for translations
  • Copilot — for email summaries and document analysis
  • Claude — for report review and contract analysis
  • Other AI tools — for any processing purpose

Since no personal data is transferred, the GDPR risk is minimised — regardless of which AI tool is used and where its servers are located.

Step 5: Receive AI Result

The AI delivers a result based on the pseudonymized data

The AI tool processes the pseudonymized document and delivers a result. This result also contains only pseudonyms:

AI summary (pseudonymized): "The contract between Person_A and Company_A governs the delivery of Product_A. Company_A commits to delivery by Date_B. In case of delay, Person_A is entitled to a contractual penalty of Amount_B per day."

The summary is substantively complete and correct — only the personal data is pseudonymized. In the next step, the pseudonyms are replaced with the original data.

Step 6: Re-Identification

Pseudonyms are replaced with original data via the replacement table

In the final step, the pseudonyms in the AI result are replaced with the original data. Docuflair Mask uses the encrypted replacement table to turn Person_A back into John Smith, Company_A back into ABC Ltd and Address_A back into 123 Main Street.

Finished result: "The contract between John Smith and ABC Ltd governs the delivery of Product X. ABC Ltd commits to delivery by 30 April 2026. In case of delay, John Smith is entitled to a contractual penalty of EUR 500 per day."

Replacement Table Security

  • Encrypted storage: The table is stored with AES encryption
  • Separate storage: The table is stored separately from the pseudonymized document
  • Access control: Only authorised users can view the table and perform re-identification
  • Audit trail: Every access to the table and every re-identification is logged

Experience the Workflow Live

See in a 15-minute demo how Docuflair Mask automates the entire workflow from pseudonymization to re-identification. On-premises and GDPR-compliant.

Frequently Asked Questions

Answers to the most important questions about the pseudonymization workflow

How many file formats does Docuflair Mask support?

Docuflair Mask supports over 70 file formats including PDF, Word, Excel, PowerPoint, scanned documents (TIFF, JPEG, PNG) and email formats (MSG, EML). The integrated OCR text recognition enables pseudonymization of scanned and image-based documents as well.

Which PII categories does the software detect automatically?

Docuflair Mask automatically detects 9 categories of personal data: names, email addresses, addresses, phone numbers, IBAN numbers, company names, dates, tax numbers and national insurance numbers. The categories are fully configurable and extensible.

How is the replacement table protected?

Replacement tables are stored encrypted and kept separately from the pseudonymized document. Only authorised users have access. All access is logged in the audit trail. The table can additionally be exported and secured in an external vault.

What does cross-batch consistency mean?

Cross-batch consistency means that the same person receives the same pseudonym across multiple processing runs and document sets. If John Smith is pseudonymized as Person_A in batch 1, he will also receive the pseudonym Person_A in batches 2, 3 and all subsequent runs.

See it live in 15 min

No obligation & free
Schedule Demo