TL;DR
When does plain vector RAG hit its ceiling, and when do you actually need GraphRAG? Real production thresholds, cost numbers, and migration patterns.
Most retrieval-augmented generation systems are built on vector search. It works — until it doesn't. The cliff is real, and most teams hit it sooner than they expect. This post is about three things: when plain vector RAG stops being enough, what failure modes show up first, and what GraphRAG actually buys you in production.
I have seen this transition twice in 2025 alone, both at organizations with knowledge bases past the 1M-document mark. In both cases, the team had a working RAG demo and a degraded production system on the same architecture. The diagnosis was the same: the architecture had outgrown its retrieval layer.
The retrieval ceiling — what it looks like
Plain vector RAG behaves predictably under three conditions: a corpus small enough to cover most queries within top-K results, queries that map cleanly to a single document, and answers that don't require reasoning across multiple sources. When any of these breaks, the system degrades silently. The metrics that matter — citation accuracy, answer faithfulness, query coverage — all start drifting at roughly the same point.
Three signals you have outgrown vector RAG
- Top-K recall flattens. Increasing K from 5 to 20 stops improving the answer quality. The relevant document is in the index, but vector similarity is no longer surfacing it reliably.
- Multi-hop questions return wrong synthesis. The retriever pulls the right entities, but the LLM hallucinates the relationship between them — because the relationship was never in the retrieved chunks.
- Citation accuracy drops below 90%. The model cites real documents, but the cited claim isn't supported by the cited passage. This shows up under audit, not in eval scores.
These signals don't arrive as a step function. They drift in over weeks as the corpus grows. The team that catches them is the team running citation audits in production — not the team running BLEU.
Why vector retrieval breaks at scale
The fundamental issue is that embeddings collapse the structural relationships in your knowledge into a single similarity score. For a corpus of a few thousand documents, this is fine — most queries can be answered by retrieving the few semantically closest passages. At 100K documents, the same query has hundreds of plausible matches, most of which are subtly off-topic. By 1M documents, you are gambling.
The semantic-drift problem
Two passages can have a 0.92 cosine similarity and answer different questions. Embeddings encode topical similarity, not factual relevance. As corpus density grows, the gap between "similar" and "answers the question" widens. Reranking helps, but only if the right passage is in the top-K to begin with — and at scale, increasingly often, it isn't.
The multi-hop problem
Vector RAG retrieves passages independently. If your answer requires traversing a relationship — "which clauses in Vendor A's contract conflict with the SOC2 audit requirements set by the parent organization?" — no single passage contains the answer. The retriever returns the SOC2 passage, the contract passage, and the parent-org policy passage, and asks the LLM to figure out the relationship. The LLM either fabricates one or gives up.
What GraphRAG actually changes
GraphRAG isn't a replacement for vector search. The production pattern is hybrid: a knowledge graph holds entities and the relationships between them, and a vector index holds the textual passages. Retrieval runs both, then merges the outputs.
# Hybrid retrieval — sketch
async def retrieve(query: str) -> list[Doc]:
entities = extract_entities(query)
graph_paths = await graph.traverse(entities, depth=2)
vector_hits = await vec.search(embed(query), k=12)
return rerank(merge(graph_paths, vector_hits), query)The graph isn't there to replace the embedding — it is there to give the retriever structure. When the user asks a multi-hop question, the graph traversal finds the path; when the user asks a fuzzy semantic question, vector search handles it. Most production queries are a mix of both.
What this fixes
- Multi-hop questions: traversed paths are first-class results, not synthesized hallucinations.
- Citation accuracy: graph-anchored answers preserve the relationship structure during synthesis, so the LLM cites the path it actually used.
- Query coverage: rare entities get found through graph neighborhood, not just embedding proximity.
- Drift control: as the corpus grows, structural relevance scales with the graph, not with corpus density.
When to migrate — and when not to
GraphRAG is more expensive to build and operate. The entity-extraction pipeline that builds the graph runs at ingest time and isn't cheap. The graph database is another piece of infrastructure to monitor. The retrieval logic is more complex. None of this is worth it unless one of three thresholds is true:
- Your corpus is past 100K documents AND queries are increasingly cross-cutting.
- Multi-hop answers are a regular use case (compliance, contract analysis, diagnostic flows).
- Citation accuracy is a hard requirement (regulated industries, legal, healthcare).
Below these thresholds, the right move is usually better chunking, better embeddings, or a reranker — not GraphRAG. We have walked teams off the GraphRAG migration when the real problem was that their chunks were too large and their reranker was untuned. A two-week reranking project saved them a six-month migration.
Cost in real numbers
For a 4M-document deployment we ran in 2025, the cost shape was approximately: storage 1.8x (Neo4j next to the vector index), ingest compute 1.4x (entity extraction at insert time), query compute roughly equal once entity caching was warm, and total infra cost about 1.6x. The team's citation accuracy went from 87% to 96%, multi-hop answer correctness went from 41% to 78%, and the average query latency increased by 130ms p95.
GraphRAG is the right tool when retrieval correctness is more valuable than retrieval cheapness. For most enterprise knowledge work past 100K documents, that trade is obvious.
What to read next
If you have decided GraphRAG is the right next step, the implementation specifics — entity extraction strategy, graph schema design, query routing — live in the GraphRAG implementation guide. If you are still deciding, the enterprise RAG security article covers a related failure mode (data exfiltration via retrieval) that often forces the migration before scale does.
For teams running this at scale today, the production playbook for agentic AI systems covers how to wire GraphRAG retrieval into a multi-agent system without losing observability.
Share this article
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years experience
Mudassir Marwat is the Founder & CEO of Cognilium AI. He has shipped 100+ production AI systems acro...
