Back to Blog
Published:
Last Updated:
Fresh Content

The 8-Stage Document Intelligence Pipeline

11 min read
2,200 words
high priority
Muhammad Mudassir

Muhammad Mudassir

Founder & CEO, Cognilium AI

The 8-Stage Document Intelligence Pipeline — Cognilium AI

TL;DR

Parse, classify, evidence-map, extract, validate, score, graph, cross-document-link. The eight-stage pipeline that turns unstructured legal/financial PDFs int

Parse, classify, evidence-map, extract, validate, score, graph, cross-document-link. The eight-stage pipeline that turns unstructured legal/financial PDFs into validated structured data with mislink detection at the end.
document AIstructured extractionGemini DocIntevidence-mappingcross-document linkingentity disambiguationlegal document processing

Document intelligence on long unstructured PDFs is not "give the LLM a PDF and ask for JSON." That works for two-page invoices. It does not work for 50-page private placement memoranda, 100-page subscription agreements, or the cap-table spreadsheet with five tabs of footnotes. The pipeline that handles these is staged — eight discrete stages with their own contracts and failure modes.

Stage 1: parse

PDF in, structured-text out. Layout-aware parsing — preserve page numbers, paragraph breaks, table structure. Modern PDF parsers (LayoutParser, Adobe Extract API, Gemini's native PDF intake) handle this. Output: a {page, paragraph, span} addressable representation of the document.

Stage 2: classify

What kind of document is this? PPM, SAFE, SPA, cap table, NDA, side letter? Classification picks the schema that subsequent stages will extract against. A classifier on the first 2-3 pages is usually enough. Misclassifying here cascades — the extractor will run a SAFE schema on a SPA and miss everything that matters.

Stage 3: evidence-map

For each field the schema expects, find the page-and-span pointers where the value lives. This is a retrieval step, not an extraction step. The output is a map: {field_name → [{page, span}, ...]}. The extractor in stage 4 sees only those pointers — it cannot extract from unspecified parts of the document. Hallucination is bounded by where evidence has been mapped.

Stage 4: extract

Run the LLM on each field with its evidence pointers as context. Structured output (JSON schema, Pydantic, or equivalent). Each field has a confidence score from the model. Output: typed structured data with per-field provenance back to {page, span}.

Stage 5: validate

Cross-field consistency rules. Total preferred shares should match the sum of issued + reserved. Post-money valuation should equal pre-money + raise amount. Date fields should be temporally consistent. Validation failures flag the field for human review without rejecting the whole document.

Stage 6: score

Per-field and per-document confidence aggregation. A document where 28 of 30 fields extracted with high confidence and 2 flagged for review gets a single document-level score. Scores below threshold queue for human verification before the document is considered "extracted."

Stage 7: graph

The validated structured data lands in the knowledge graph (Neo4j in our case). Entities (companies, people, instruments, transactions) become nodes; relationships (issuer-of, party-to, beneficiary-of) become edges. Each node carries a back-reference to its source document and provenance.

Stage 8: cross-document link

When a new document's entities are added to the graph, the linker checks: is "Acme Corp" in this PPM the same as "Acme Corporation" in last quarter's cap table? Probably yes. The linker uses Gemini-driven entity disambiguation: name variation + corporate jurisdiction + EIN match → link. Confident matches get auto-merged; uncertain ones queue for human disambiguation.

Mislink detection

After auto-merge, a final pass re-checks linkage by comparing all the entity's attributes across the two source documents. If "Acme" in PPM A has post-money valuation $50M and "Acme" in cap table B has post-money $52M, the link is flagged as suspect. Human reviews; either confirms a typo on one side or splits the merge. This pass catches the long tail of bad auto-merges that no single-pass linker catches.

Numbers from production

  • 7 document types currently supported (PPM, SAFE, SPA, cap table, side letter, NDA, subscription agreement)
  • Per-document P50 latency: 90-180 seconds (mostly stages 1, 4, 8)
  • Per-document cost: $0.40-1.20 (Gemini 2.5 Pro for stages 4 + 8, smaller model for 2 + 5)
  • Mislink-detection catch rate: ~15% of auto-merges flagged, ~3% turn out to be actual mislinks
  • Human review queue: ~5-8% of documents, processed within 24 hours

What this is not

Real-time document Q&A. The pipeline takes minutes per document, not seconds. For real-time use cases (RAG over already-extracted documents), the pipeline runs once at ingestion and the runtime queries hit the graph + vector store. Pipeline at ingestion, retrieval at runtime — never confuse the two.

Share this article

Muhammad Mudassir

Muhammad Mudassir

Founder & CEO, Cognilium AI | 10+ years

Mudassir Marwat is the Founder & CEO of Cognilium AI. He has shipped 100+ production AI systems acro...

Founder & CEO of Cognilium AI; 100+ production AI systems shipped; multi-cloud AI architecture (AWSGCPAzure); built and operated 4 production AI products
Agentic AIRAG → GraphRAG retrievalVoice AIMulti-Agent Orchestration

Frequently Asked Questions

Find answers to common questions about the topics covered in this article.

Still have questions?

Get in touch with our team for personalized assistance.

Contact Us

Related Articles

Continue exploring related topics and insights from our content library.

Gemini-Driven Entity Disambiguation With Post-Creation Mislink Detection
7 min
1
Muhammad Mudassir
May 5, 2026

Gemini-Driven Entity Disambiguation With Post-Creation Mislink Detection

Auto-merging "Acme Corp" with "Acme Corporation" is the easy half. The hard half is catching the merges that should not have happened — a re-check pass after creation that flags 3% of merges as suspect.

words
Read Article
Smart Category Routing for Contract Review
6 min
2
Muhammad Mudassir
May 5, 2026

Smart Category Routing for Contract Review

A focused application of the LLMOps routing pattern to legal contract analysis — the analyst-selection logic that ships fewer clauses to fewer agents and finishes a 3,300-call review in 154 seconds.

words
Read Article
Zero-Trust Multi-Tenant Firestore: Middleware, Claims, and 60+ Wildcard Permissions
9 min
3
Muhammad Mudassir
May 5, 2026

Zero-Trust Multi-Tenant Firestore: Middleware, Claims, and 60+ Wildcard Permissions

Hard tenant isolation on Firestore is not a query-pattern choice — it is a middleware layer, an immutable claim source, and a permission model with wildcards. The architecture that makes cross-tenant data leakage structurally impossible.

words
Read Article

Explore More Insights

Discover more expert articles on AI, engineering, and technology trends.