The naive approach
Throw the whole PDF into the context window. GPT-4 gives you 128k tokens, which is roughly 300 pages. Beyond that, you truncate and pray.
That approach fails at 500+ pages. It also wastes money — you're paying for tokens that rarely contribute to the answer.
What we do instead
1. Extract + chunk at ingest time
When you upload a PDF, we:
- Parse every page with PyMuPDF
- Split into sentence-aware chunks of ~512 tokens
- Respect page boundaries so citations stay accurate
- Add 25% overlap between chunks for continuity
For a 1,000-page document, that's ~1,500-2,500 chunks.
2. Embed + store in pgvector
Each chunk gets a 1,536-dim OpenAI embedding. We store them in Postgres with a HNSW index — vector similarity search runs in 2-5ms.
3. Hierarchical retrieval at query time
When you ask a question:
- Expand your query into 3-5 sub-queries using an LLM (catches paraphrase)
- Retrieve top-10 chunks for each sub-query (30-50 chunks)
- Deduplicate + re-rank by relevance score
- Return top-10 final chunks with page numbers
This beats naive top-k by ~15% recall in our eval.
4. Generate with Claude Sonnet
The final answer is generated by Claude with the retrieved context. Our prompt is explicit about citations:
Answer based ONLY on the provided excerpts. Cite every claim as
[Document Name, Page X]. If the answer isn't in the context, say so.
5. Verify citations before rendering
A Python regex extracts [Doc, Page X] markers from Claude's output. For each, we check that the referenced context actually exists in what we retrieved. Hallucinated citations are dropped, not rendered.
Persistent memory
Your chat remembers the last 20 messages, capped so context stays fresh. Files attached to the conversation persist across turns — if you uploaded at message 1, the agent can reference "the contract from earlier" at message 15.
The result
- 1,000-page contracts: ~8 seconds to first token, 98% citation accuracy
- Technical books: works end-to-end if the table of contents is extractable
- Scanned-only PDFs: runs OCR first, then the same pipeline
Try it at /chat. Drop a big doc and see.