TL;DR
Auto-merging "Acme Corp" with "Acme Corporation" is the easy half. The hard half is catching the merges that should not have happened — a re-check pass after
Two PPMs reference "Acme Corp." A cap table references "Acme Corporation." A side letter references "Acme Holdings LLC." Are these the same entity? Probably the first two are. The third is more interesting — it might be the parent company. Getting this right matters because everything downstream (financial roll-ups, ownership tracking, compliance reporting) depends on the entity graph being correct.
Why one-pass linking fails
Conservative linker: merges only when surface form is near-identical. Misses "Acme Corp" / "Acme Corporation" merges that should have happened. Low recall.
Aggressive linker: merges on partial-match heuristics. Merges "Acme Corp" with "Acme Capital" when they are different companies. Low precision.
There is no threshold that gets both. The two-pass approach — aggressive merge with a precision-recovery pass — gets both at the cost of a second pipeline stage.
Pass 1: Gemini-driven disambiguation
At entity ingestion, the candidate entity is shown to Gemini with up to 10 graph neighbors that name-match. Gemini sees: surface form, jurisdiction, EIN if present, registered address, top relationships, source document type. Output: {action: "merge_with_X" | "create_new", confidence}.
Confident merges (>0.85) auto-merge. Uncertain ones (0.6-0.85) queue for human review with the model's reasoning attached. Below 0.6, default to create_new.
Pass 2: post-creation mislink detection
After auto-merge, a check pass compares the merged entity's attributes across all source documents. If "Acme Corp" in document A has post-money valuation $50M and "Acme Corp" in document B has $52M, the merge is suspect — same entity should have the same valuation as of the same date.
- Numerical fields compared with tolerance (1% on valuations, exact on share counts).
- Legal-entity fields compared exact (jurisdiction, EIN, registered address).
- Relational structure compared loosely (≥80% officer overlap).
Disagreement above tolerance flags the merge. The flag goes to a human review queue with both source documents linked and the conflicting attributes highlighted.
What this catches
- ~15% of auto-merges flagged by mislink check
- ~3% turn out to be actual mislinks (entities the aggressive linker over-merged)
- ~12% are tolerable inconsistencies (typos, time-shifted valuations) — human confirms the merge
- Net precision after mislink check: >99% of merges
Cost
Pass 1 runs at ingestion: ~50 entities/second on cached inputs, ~$0.001 per entity. Pass 2 runs in batch overnight: ~5 entities/second (graph queries are the bottleneck), $0.005 per entity. The mislink check overhead is small relative to the document-extraction cost it backstops.
Share this article
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years experience
Mudassir Marwat is the Founder & CEO of Cognilium AI. He has shipped 100+ production AI systems acro...
