Why eight stages instead of one prompt?

A single-prompt extraction works for short structured documents. Long unstructured documents (50-page PPMs, 100-page SAFEs) need staging: classification before extraction so the right schema applies; evidence mapping before extraction so each field has a citation; validation after extraction so wrong values get caught before they hit the graph. Each stage has a narrow job; the failure modes are visible.

What does evidence mapping actually produce?

For each field the schema expects (e.g., "post-money valuation"), a list of {page, span} pointers in the source PDF where the value can be found. The extractor sees those pointers and only extracts from cited spans — hallucination cannot reach unspecified parts of the document.

Is this Gemini-only or model-agnostic?

The pattern is model-agnostic. We use Gemini 2.5 Pro because it handles long-context PDFs natively. Claude 3.5 Sonnet works equally well; GPT-4o needs PDF preprocessing (page-by-page) since its native PDF support is weaker.

How does this scale across tenants?

Each tenant has its own document store, schema (most tenants use a shared base schema with extensions), and extraction job queue. Cross-tenant data never mixes — strict isolation at the storage and queue layers.

The 8-Stage Document Intelligence Pipeline

Q: What is "mislink detection"?

After cross-document linking (entity X in PPM A is the same as entity X in cap table B), a final pass re-checks the linkage by comparing all the entity's attributes across the two documents. If the post-money valuation in PPM A does not match the cap-table valuation in B, the link is flagged as suspect.

Document intelligence on long unstructured PDFs is not "give the LLM a PDF and ask for JSON." That works for two-page invoices. It does not work for 50-page private placement memoranda, 100-page subscription agreements, or the cap-table spreadsheet with five tabs of footnotes. The pipeline that handles these is staged — eight discrete stages with their own contracts and failure modes.

Stage 1: parse

PDF in, structured-text out. Layout-aware parsing — preserve page numbers, paragraph breaks, table structure. Modern PDF parsers (LayoutParser, Adobe Extract API, Gemini's native PDF intake) handle this. Output: a {page, paragraph, span} addressable representation of the document.

Stage 2: classify

What kind of document is this? PPM, SAFE, SPA, cap table, NDA, side letter? Classification picks the schema that subsequent stages will extract against. A classifier on the first 2-3 pages is usually enough. Misclassifying here cascades — the extractor will run a SAFE schema on a SPA and miss everything that matters.

Stage 3: evidence-map

For each field the schema expects, find the page-and-span pointers where the value lives. This is a retrieval step, not an extraction step. The output is a map: {field_name → [{page, span}, ...]}. The extractor in stage 4 sees only those pointers — it cannot extract from unspecified parts of the document. Hallucination is bounded by where evidence has been mapped.

Stage 4: extract

Run the LLM on each field with its evidence pointers as context. Structured output (JSON schema, Pydantic, or equivalent). Each field has a confidence score from the model. Output: typed structured data with per-field provenance back to {page, span}.

Stage 5: validate

Cross-field consistency rules. Total preferred shares should match the sum of issued + reserved. Post-money valuation should equal pre-money + raise amount. Date fields should be temporally consistent. Validation failures flag the field for human review without rejecting the whole document.

Stage 6: score

Per-field and per-document confidence aggregation. A document where 28 of 30 fields extracted with high confidence and 2 flagged for review gets a single document-level score. Scores below threshold queue for human verification before the document is considered "extracted."

Stage 7: graph

The validated structured data lands in the knowledge graph (Neo4j in our case). Entities (companies, people, instruments, transactions) become nodes; relationships (issuer-of, party-to, beneficiary-of) become edges. Each node carries a back-reference to its source document and provenance.

Stage 8: cross-document link

When a new document's entities are added to the graph, the linker checks: is "Acme Corp" in this PPM the same as "Acme Corporation" in last quarter's cap table? Probably yes. The linker uses Gemini-driven entity disambiguation: name variation + corporate jurisdiction + EIN match → link. Confident matches get auto-merged; uncertain ones queue for human disambiguation.

Mislink detection

After auto-merge, a final pass re-checks linkage by comparing all the entity's attributes across the two source documents. If "Acme" in PPM A has post-money valuation $50M and "Acme" in cap table B has post-money $52M, the link is flagged as suspect. Human reviews; either confirms a typo on one side or splits the merge. This pass catches the long tail of bad auto-merges that no single-pass linker catches.

Numbers from production

7 document types currently supported (PPM, SAFE, SPA, cap table, side letter, NDA, subscription agreement)
Per-document P50 latency: 90-180 seconds (mostly stages 1, 4, 8)
Per-document cost: $0.40-1.20 (Gemini 2.5 Pro for stages 4 + 8, smaller model for 2 + 5)
Mislink-detection catch rate: ~15% of auto-merges flagged, ~3% turn out to be actual mislinks
Human review queue: ~5-8% of documents, processed within 24 hours

What this is not

Real-time document Q&A. The pipeline takes minutes per document, not seconds. For real-time use cases (RAG over already-extracted documents), the pipeline runs once at ingestion and the runtime queries hit the graph + vector store. Pipeline at ingestion, retrieval at runtime — never confuse the two.

The 8-Stage Document Intelligence Pipeline

Stage 1: parse

Stage 2: classify

Stage 3: evidence-map

Stage 4: extract

Stage 5: validate

Stage 6: score

Stage 7: graph

Stage 8: cross-document link

Mislink detection

Numbers from production

What this is not

Share this article

Muhammad Mudassir

Muhammad Mudassir

Frequently Asked Questions

Why eight stages instead of one prompt?

What does evidence mapping actually produce?

What is "mislink detection"?

Is this Gemini-only or model-agnostic?

How does this scale across tenants?

Still have questions?

Related Articles

Gemini-Driven Entity Disambiguation With Post-Creation Mislink Detection

Smart Category Routing for Contract Review

Zero-Trust Multi-Tenant Firestore: Middleware, Claims, and 60+ Wildcard Permissions

Explore More Insights