TL;DR
Parse, classify, evidence-map, extract, validate, score, graph, cross-document-link. The eight-stage pipeline that turns unstructured legal/financial PDFs int
Document intelligence on long unstructured PDFs is not "give the LLM a PDF and ask for JSON." That works for two-page invoices. It does not work for 50-page private placement memoranda, 100-page subscription agreements, or the cap-table spreadsheet with five tabs of footnotes. The pipeline that handles these is staged — eight discrete stages with their own contracts and failure modes.
Stage 1: parse
PDF in, structured-text out. Layout-aware parsing — preserve page numbers, paragraph breaks, table structure. Modern PDF parsers (LayoutParser, Adobe Extract API, Gemini's native PDF intake) handle this. Output: a {page, paragraph, span} addressable representation of the document.
Stage 2: classify
What kind of document is this? PPM, SAFE, SPA, cap table, NDA, side letter? Classification picks the schema that subsequent stages will extract against. A classifier on the first 2-3 pages is usually enough. Misclassifying here cascades — the extractor will run a SAFE schema on a SPA and miss everything that matters.
Stage 3: evidence-map
For each field the schema expects, find the page-and-span pointers where the value lives. This is a retrieval step, not an extraction step. The output is a map: {field_name → [{page, span}, ...]}. The extractor in stage 4 sees only those pointers — it cannot extract from unspecified parts of the document. Hallucination is bounded by where evidence has been mapped.
Stage 4: extract
Run the LLM on each field with its evidence pointers as context. Structured output (JSON schema, Pydantic, or equivalent). Each field has a confidence score from the model. Output: typed structured data with per-field provenance back to {page, span}.
Stage 5: validate
Cross-field consistency rules. Total preferred shares should match the sum of issued + reserved. Post-money valuation should equal pre-money + raise amount. Date fields should be temporally consistent. Validation failures flag the field for human review without rejecting the whole document.
Stage 6: score
Per-field and per-document confidence aggregation. A document where 28 of 30 fields extracted with high confidence and 2 flagged for review gets a single document-level score. Scores below threshold queue for human verification before the document is considered "extracted."
Stage 7: graph
The validated structured data lands in the knowledge graph (Neo4j in our case). Entities (companies, people, instruments, transactions) become nodes; relationships (issuer-of, party-to, beneficiary-of) become edges. Each node carries a back-reference to its source document and provenance.
Stage 8: cross-document link
When a new document's entities are added to the graph, the linker checks: is "Acme Corp" in this PPM the same as "Acme Corporation" in last quarter's cap table? Probably yes. The linker uses Gemini-driven entity disambiguation: name variation + corporate jurisdiction + EIN match → link. Confident matches get auto-merged; uncertain ones queue for human disambiguation.
Mislink detection
After auto-merge, a final pass re-checks linkage by comparing all the entity's attributes across the two source documents. If "Acme" in PPM A has post-money valuation $50M and "Acme" in cap table B has post-money $52M, the link is flagged as suspect. Human reviews; either confirms a typo on one side or splits the merge. This pass catches the long tail of bad auto-merges that no single-pass linker catches.
Numbers from production
- 7 document types currently supported (PPM, SAFE, SPA, cap table, side letter, NDA, subscription agreement)
- Per-document P50 latency: 90-180 seconds (mostly stages 1, 4, 8)
- Per-document cost: $0.40-1.20 (Gemini 2.5 Pro for stages 4 + 8, smaller model for 2 + 5)
- Mislink-detection catch rate: ~15% of auto-merges flagged, ~3% turn out to be actual mislinks
- Human review queue: ~5-8% of documents, processed within 24 hours
What this is not
Real-time document Q&A. The pipeline takes minutes per document, not seconds. For real-time use cases (RAG over already-extracted documents), the pipeline runs once at ingestion and the runtime queries hit the graph + vector store. Pipeline at ingestion, retrieval at runtime — never confuse the two.
Share this article
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years experience
Mudassir Marwat is the Founder & CEO of Cognilium AI. He has shipped 100+ production AI systems acro...
