PDFMind

Research·Mar 22, 2026·6 min read

Why PII redaction is harder than you think

Names in multiple languages, handwritten signatures, medical codes, context-dependent identifiers. Here's the taxonomy that powers Smart Redact.

PII is not a regex problem

The first time someone tries to redact PII, they reach for regex. \b\d{3}-\d{2}-\d{4}\b catches US Social Security numbers. [A-Z][a-z]+\s[A-Z][a-z]+ catches names.

Both are catastrophically wrong.

SSNs also look like phone extensions, order numbers, and date fragments
Names also look like product names, cities, and — worst of all — other names that appear in the non-PII parts of the doc ("Dr. Smith prescribed...")

Real PII redaction is a 40-category classification problem with context.

The Smart Redact taxonomy

We bucket PII into five tiers:

Tier 1 — Unambiguous identifiers

SSN, TIN, PAN, Aadhaar, passport numbers
Credit card + bank account numbers
Email, phone, IP addresses

These are regex-detectable with high precision. Still not perfect (a bank routing number can look like a date), but close.

Tier 2 — Context-dependent identifiers

Names, addresses, dates of birth
License numbers, employee IDs

Detection requires:

NER (named entity recognition) to find candidates
Context classification to decide whether "John Smith" in paragraph 3 refers to a party to a contract (redact) or a case citation (keep)

Tier 3 — Indirect identifiers

Rare diseases, specific employers, unusual job titles
ZIP + birthdate + sex ≈ 87% unique in the US (Latanya Sweeney, 2000)

We flag these but don't auto-redact; they need human review.

Tier 4 — Medical codes

ICD-10 diagnoses, HIPAA identifiers
Medication names, dosages

Separate domain-specific detector because false positives here matter ("aspirin" is not PII, but "Lupron for endometriosis" leaks diagnosis).

Tier 5 — Handwriting & signatures

Detected by layout model; redacted as black boxes
Signatures get a dedicated "signature detector" trained on the TobaccoRaw dataset

How the pipeline runs

Scan every page with the ensemble detector (takes 200-500ms per page on GPU)
Merge overlapping detections (an entity can match multiple patterns)
Apply confidence thresholds per tier — Tier 1 redacts at 0.7, Tier 3 flags at 0.9
Render redacted boxes onto the PDF with PDFProcessor.redact() — non-reversible black rectangles

What we got wrong

Our V1 over-redacted case citations in legal documents. "Smith v. Jones" became "█ v. █". Embarrassing.

The fix: context-aware NER that recognizes courtroom-style citations as non-PII. Cost us three weeks.

Ship something imperfect

Our current false-positive rate is ~3%. False negatives are ~0.5%. Both are better than outsourced human reviewers (who average 6% FP, 2% FN according to a 2024 Stanford study), but we're not declaring victory. Every customer gets a pre-redaction review screen.

Try it at /tools/smart-redact.

Get the next one in your inbox.

One email, every Friday. Product updates + engineering deep-dives.

Keep reading

Product

Research·Mar 22, 2026·6 min read

Why PII redaction is harder than you think

Names in multiple languages, handwritten signatures, medical codes, context-dependent identifiers. Here's the taxonomy that powers Smart Redact.

PII is not a regex problem

The first time someone tries to redact PII, they reach for regex. \b\d{3}-\d{2}-\d{4}\b catches US Social Security numbers. [A-Z][a-z]+\s[A-Z][a-z]+ catches names.

Both are catastrophically wrong.

SSNs also look like phone extensions, order numbers, and date fragments
Names also look like product names, cities, and — worst of all — other names that appear in the non-PII parts of the doc ("Dr. Smith prescribed...")

Real PII redaction is a 40-category classification problem with context.

The Smart Redact taxonomy

We bucket PII into five tiers:

Tier 1 — Unambiguous identifiers

SSN, TIN, PAN, Aadhaar, passport numbers
Credit card + bank account numbers
Email, phone, IP addresses

These are regex-detectable with high precision. Still not perfect (a bank routing number can look like a date), but close.

Tier 2 — Context-dependent identifiers

Names, addresses, dates of birth
License numbers, employee IDs

Detection requires:

NER (named entity recognition) to find candidates
Context classification to decide whether "John Smith" in paragraph 3 refers to a party to a contract (redact) or a case citation (keep)

Tier 3 — Indirect identifiers

Rare diseases, specific employers, unusual job titles
ZIP + birthdate + sex ≈ 87% unique in the US (Latanya Sweeney, 2000)

We flag these but don't auto-redact; they need human review.

Tier 4 — Medical codes

ICD-10 diagnoses, HIPAA identifiers
Medication names, dosages

Separate domain-specific detector because false positives here matter ("aspirin" is not PII, but "Lupron for endometriosis" leaks diagnosis).

Tier 5 — Handwriting & signatures

Detected by layout model; redacted as black boxes
Signatures get a dedicated "signature detector" trained on the TobaccoRaw dataset

How the pipeline runs

Scan every page with the ensemble detector (takes 200-500ms per page on GPU)
Merge overlapping detections (an entity can match multiple patterns)
Apply confidence thresholds per tier — Tier 1 redacts at 0.7, Tier 3 flags at 0.9
Render redacted boxes onto the PDF with PDFProcessor.redact() — non-reversible black rectangles

What we got wrong

Our V1 over-redacted case citations in legal documents. "Smith v. Jones" became "█ v. █". Embarrassing.

The fix: context-aware NER that recognizes courtroom-style citations as non-PII. Cost us three weeks.

Ship something imperfect

Try it at /tools/smart-redact.

Get the next one in your inbox.

One email, every Friday. Product updates + engineering deep-dives.

Keep reading

Product

Introducing Agent Mode: one prompt, multi-tool workflows

Read

Engineering

How we got 98%+ OCR accuracy on noisy scans

Read

Product

Chat with 1,000-page documents — without losing context

Read