TL;DR
Extraction gives you names. Entity resolution decides identity. How we taught a $850M family office knowledge graph to tell one company from its eleven aliases.
In the last post I described finding the same portfolio company in a client's knowledge graph under eleven different names, and called that kind of silent decay graph rot. This post is about the fix: entity resolution, the part of the pipeline that decides eleven names are one company.
It is the least glamorous problem in knowledge graph engineering and the one that breaks the most systems. Here is how we handle it on the document-intelligence platform we run for a family office managing $850M in assets.
Why does one company end up as eleven nodes?
Because documents don't agree on names, and extraction copies whatever it reads. “Acme Holdings LLC” in a PPM, “Acme Holdings” in a cap table, “ACME HOLDINGS, L.L.C.” in a K-1, and “Acme” in an email are four strings describing one company. An extraction model reads each document on its own and faithfully creates a node for each spelling. Across six document types (PPMs, SPAs, SAFEs, K-1s, cap tables, and operating agreements), one company can easily pick up a dozen aliases before anyone looks.
The graph isn't wrong about any single document. It is wrong about the world, because it never decided which names point to the same thing.
What is the difference between extraction and entity resolution?
Extraction reads the words. Entity resolution decides who the words are about. They are two separate jobs, and conflating them is the root cause of duplicate-entity rot.
A name is a string. An identity is a decision.
In our pipeline, extraction runs first: Gemini 2.5 Pro pulls structured fields out of each document with a confidence score on every value. That step is good at “this paragraph names a company called X.” It has no opinion on whether company X already exists in the graph. That opinion comes from a dedicated resolution pass that runs after extraction, looking across every document at once instead of one at a time.
How do you resolve entities at scale without merging the wrong ones?
You resolve in stages, and you treat merging as a decision that needs evidence, not a string match. Our cross-document linker runs four steps:
- Normalize. Strip legal suffixes, casing, and punctuation so “ACME HOLDINGS, L.L.C.” and “Acme Holdings LLC” reduce to the same comparable form. This alone collapses the easy duplicates.
- Find candidates. For each entity, pull the small set of existing nodes it could plausibly match, rather than comparing it against the whole graph. This keeps the expensive comparisons rare.
- Score the match. Compare candidates on more than the name: shared people, shared addresses, shared identifiers, the documents they appear in. A name match with no supporting signal is weaker than a partial name match that shares a tax ID and three directors.
- Decide by confidence. High-confidence matches merge automatically. Low-confidence matches are flagged for review instead of guessed. The threshold is the whole game, and it is deliberately cautious, because the cost of a wrong merge is higher than the cost of a missed one.
Industry practitioners put the reliability floor for this work around 85% match accuracy, below which the graph becomes untrustworthy. The way you stay above that floor is not a smarter single model. It is refusing to guess on the ambiguous cases. This is the same discipline we apply across our data engineering and pipeline work: correctness comes from where you draw the confidence line, not from one clever step.
When do you let an LLM make the identity call?
Only on the ambiguous middle, and only with the evidence in front of it. Normalization and scoring resolve most pairs cleanly. What's left is the genuinely hard set: two entities with similar names, partial overlap, and no single decisive field.
For those, we hand the LLM the two candidate records plus the source passages they came from and ask it to judge whether they are the same entity, with a reason. The model is not free-associating from a name. It is reading the evidence we already extracted and making a call we can audit later. Confident pairs never reach this step, which keeps the cost down and the reasoning focused on the cases that actually need judgment.
How do you avoid the opposite mistake: merging two things that aren't the same?
You watch for it on purpose, because it is the hardest rot to detect. A silent merge fuses two different entities into one. Two different people named Daniel Chen become a single person with two careers, and the evidence of the mistake is destroyed by the mistake.
We run a reconciler that cross-checks merged entities against the cap table and the source documents, looking for the contradictions a wrong merge produces: a single “person” controlling stakes that don't add up, a company with two conflicting formation dates. After the graph is built, a separate mislink-detection pass looks for edges and merges that the individual steps each thought were fine but that don't survive a second look. These are the same failure modes from the manifesto on the seven ways a knowledge graph rots, caught before an agent ever queries them.
How do you know it actually worked?
You measure the graph against the documents, not against itself. Our nodes carry confidence scores from 0.0 to 1.0, so we can ask the graph directly which entities are low-confidence and route those to human review. The test that matters is simple: ask “how many companies do we hold?” and get a number the team trusts. When that number is stable and defensible, resolution is working. When it drifts every time someone uploads a document, it isn't.
We build and fix knowledge graphs for AI systems, including a document-intelligence platform for a family office managing $850M in assets. If your graph is full of duplicates, book a 15-minute call.
Share this article
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years experience
Mudassir Marwat is the Founder & CEO of Cognilium AI. He has shipped 100+ production AI systems acro...
