PII is not a regex problem
The first time someone tries to redact PII, they reach for regex. \b\d{3}-\d{2}-\d{4}\b catches US Social Security numbers. [A-Z][a-z]+\s[A-Z][a-z]+ catches names.
Both are catastrophically wrong.
- SSNs also look like phone extensions, order numbers, and date fragments
- Names also look like product names, cities, and — worst of all — other names that appear in the non-PII parts of the doc ("Dr. Smith prescribed...")
Real PII redaction is a 40-category classification problem with context.
The Smart Redact taxonomy
We bucket PII into five tiers:
Tier 1 — Unambiguous identifiers
- SSN, TIN, PAN, Aadhaar, passport numbers
- Credit card + bank account numbers
- Email, phone, IP addresses
These are regex-detectable with high precision. Still not perfect (a bank routing number can look like a date), but close.
Tier 2 — Context-dependent identifiers
- Names, addresses, dates of birth
- License numbers, employee IDs
Detection requires:
- NER (named entity recognition) to find candidates
- Context classification to decide whether "John Smith" in paragraph 3 refers to a party to a contract (redact) or a case citation (keep)
Tier 3 — Indirect identifiers
- Rare diseases, specific employers, unusual job titles
- ZIP + birthdate + sex ≈ 87% unique in the US (Latanya Sweeney, 2000)
We flag these but don't auto-redact; they need human review.
Tier 4 — Medical codes
- ICD-10 diagnoses, HIPAA identifiers
- Medication names, dosages
Separate domain-specific detector because false positives here matter ("aspirin" is not PII, but "Lupron for endometriosis" leaks diagnosis).
Tier 5 — Handwriting & signatures
- Detected by layout model; redacted as black boxes
- Signatures get a dedicated "signature detector" trained on the TobaccoRaw dataset
How the pipeline runs
- Scan every page with the ensemble detector (takes 200-500ms per page on GPU)
- Merge overlapping detections (an entity can match multiple patterns)
- Apply confidence thresholds per tier — Tier 1 redacts at 0.7, Tier 3 flags at 0.9
- Render redacted boxes onto the PDF with
PDFProcessor.redact()— non-reversible black rectangles
What we got wrong
Our V1 over-redacted case citations in legal documents. "Smith v. Jones" became "█ v. █". Embarrassing.
The fix: context-aware NER that recognizes courtroom-style citations as non-PII. Cost us three weeks.
Ship something imperfect
Our current false-positive rate is ~3%. False negatives are ~0.5%. Both are better than outsourced human reviewers (who average 6% FP, 2% FN according to a 2024 Stanford study), but we're not declaring victory. Every customer gets a pre-redaction review screen.
Try it at /tools/smart-redact.