TL;DR
A multi-agent architecture is a cost you pay for two things, parallelism on independent subtasks and isolation for specialists, and most teams pay it without getting either. The field is split: Anthropic reported a multi-agent research system beat a single agent on its own eval while burning roughly fifteen times the tokens, with token volume alone explaining about eighty percent of the score, while Cognition argues to keep writes single-threaded because parallel agents make conflicting decisions. A second agent multiplies your tokens, multiplies coordination failures, and compounds unreliability across every hop. Default to one capable agent with good tools and memory; reach for multiple only when the work truly parallelizes, needs walled-off context, or a specialist beats a generalist, and then share structured state, route instead of fanning out, and keep writes single-threaded.
TL;DR: A multi-agent architecture is a cost you pay for two specific things: parallelism on genuinely independent subtasks, and isolation when a job needs specialists that must not share context. Most teams pay the cost and get neither. The field itself is split on this. Anthropic reported that a multi-agent research system beat a single agent by a wide margin on its own internal eval, and in the same post reported it burned roughly fifteen times the tokens of an ordinary chat, with token volume alone explaining about eighty percent of the score. Cognition, from the opposite corner, argues you should not build parallel multi-agents at all, because agents acting on partial context make conflicting decisions that a later step has to clean up, so you should keep a single continuous thread. Both are right about different problems. A second agent multiplies your token bill, multiplies the ways coordination can fail, and compounds unreliability across every hop, because a ten-step chain at ninety-five percent per step is only about sixty percent reliable end to end. So default to one capable agent with good tools and memory, and reach for multiple only when the work truly parallelizes, when subtasks need walled-off context, or when a specialist genuinely beats a generalist. When you do, follow the pattern the debate is converging on: let extra agents contribute intelligence, keep the writes single-threaded, share structured state instead of passing messages, route instead of fanning out blindly, and put the token bill on the same scorecard as quality.
A team ships a working agent, one model in a loop with a handful of tools and a memory store, and it does the job. Then they read that multi-agent is the frontier, so they split it up: a planner, three parallel researchers, a critic, and a synthesizer. Latency triples. The bill goes up by an order of magnitude. And the three researchers come back with three subtly different answers, because none of them saw what the others found, so now there is a reconciliation step whose entire purpose is to clean up the mess the fan-out created. The demo that worked as one agent is now slower, more expensive, and less consistent, and it is doing exactly the same task. Nobody stopped to ask the only question that matters: what were the extra agents for.
This is the first post in a new series on multi-agent systems in production, and it starts where it has to, at the decision, because the most expensive multi-agent mistake is building one you did not need. The last series covered a single agent's memory, from why agents forget to how to manage a working set inside a finite window to how to evaluate whether any of it actually works. This series is about what happens when one agent becomes many, and the first thing to understand is that many is not an upgrade. It is a trade, and most teams take the wrong side of it.
A second agent does not add intelligence to your system. It adds a coordination problem, a second context window, and a new way for two parts of your own product to disagree with each other. Sometimes the task is worth all three. Usually it is not, and the honest default is one agent that is good at its job.
The field cannot agree, and that tells you something
Start with the fact that the two most credible sources on this question point in opposite directions, because it is the clearest signal that the answer is not architectural fashion.
In one corner, Anthropic published an engineering account of a multi-agent research system, an orchestrator that plans a query, spins up three to five subagents in parallel to explore different threads, then synthesizes their findings with a separate citation pass. They reported it beat a single agent by a wide margin on their own internal research eval. That is the strongest public case for going multi-agent, and it comes with two numbers that matter more than the win. The same post reported the system used on the order of fifteen times the tokens of an ordinary chat interaction, and that token volume by itself explained roughly eighty percent of the variation in how well systems scored. Read those two facts together and the victory reframes itself. The multi-agent system did not win because coordination is magic. It won in large part because it spent an enormous amount of compute exploring in parallel, and much of the measured advantage tracks the spending. This is a vendor reporting its own result, kept directional, and it is being used here partly against its own interest: the party with the most reason to sell multi-agent is also the party telling you it costs fifteen times more.
In the other corner, Cognition published an argument with the blunt title "Don't Build Multi-Agents." Their position, from shipping a real coding agent, is that the moment you fan work out to parallel subagents, each one acts on a partial view of the task and makes its own implicit decisions about style, edge cases, and interpretation, and those decisions conflict. You then need a step whose job is to reconcile the disagreements your own architecture created. Their prescription is to keep a single continuous thread where context is shared in full, and in a later follow-up they sharpened it into a principle worth memorizing: extra agents are fine when they contribute intelligence, reading and analyzing, but the writes, the actions that change state, should stay single-threaded. One writer, shared context, no conflicting decisions.
Neither of these is wrong, which is the point. Anthropic's task was breadth-first research, where the subtasks really are independent, exploring one source does not depend on exploring another, and the value of the answer justifies the token bill. Cognition's task was writing code, where every decision constrains the next and a second agent working in parallel is a second author who never read the first author's mind. The lesson is not that multi-agent is good or bad. It is that the architecture is a fit to a task shape, and if you cannot say which shape you have, you are not ready to choose. A 2026 line of research adds a sharp footnote: several preprints report that when you hold the total thinking-token budget equal, a single agent matches or beats a multi-agent setup on multi-hop reasoning, which again suggests that a good part of what looks like coordination winning is really just more tokens winning. Treat those as not-yet-peer-reviewed, but they push in the same direction as Anthropic's own eighty-percent number.
What a second agent actually buys you
There are exactly two things multi-agent architecture gives you that a single strong agent cannot, and it is worth being precise about both, because everything else people reach for it to fix is fixed more cheaply another way.
The first is parallelism, but only on genuinely independent subtasks. If your problem decomposes into pieces that can run without waiting on each other, reviewing forty documents, searching ten sources, checking a claim against five databases, then spreading them across agents that run at the same time cuts wall-clock time. The critical qualifier is independent. The moment subtask B needs the output of subtask A, they are not parallel, they are a sequence wearing a costume, and running them as separate agents adds coordination overhead while removing none of the wait. And even genuine parallelism has a ceiling that is older than any of this. Amdahl's law, from 1967, says the speedup you can get is capped by the fraction of the work that has to happen serially. If thirty percent of your task is inescapably sequential, the planning, the final synthesis, the reconciliation, then even with infinite agents you cannot go faster than about three and a third times the single-agent speed. The serial glue is the ceiling, and multi-agent systems tend to have a lot of it.
The second is isolation, which comes in two useful forms. One is specialization: a narrowly-scoped agent with a tight prompt, a focused toolset, and no unrelated context can sometimes outperform a generalist that is holding the whole world in its head, in the same way a focused working set beat a stuffed context window in the working-memory post. The other is deliberate walling-off, cases where you specifically do not want one agent to see everything. An adversarial critic that reviews an answer should not see the author's private reasoning, or it will rubber-stamp it. Independent research paths should not see each other early, or they converge and you lose the breadth you were paying for. Context isolation is a real and sometimes irreplaceable benefit, and it is the one argument for multi-agent that a bigger single context window does not answer.
That is the entire honest list. Notice what is not on it. Multi-agent does not make a model smarter. It does not fix a bad retriever, a weak prompt, or a missing tool, and reaching for it to paper over those is how teams end up with an expensive distributed system that has all the same bugs, now spread across five processes.
What it charges you
Against those two gains sit three costs, and unlike the gains, you pay all three every time, whether or not the task justifies them.
The first is token multiplication, and it is not linear. An orchestrator plus N workers is not N times the cost of one agent, it is more, because the orchestrator has to establish context for each worker, each worker re-establishes the shared background in its own window, and the synthesis step re-reads everything they produced. Anthropic's own fifteen-times figure is the honest headline here, and their finding that token volume explains most of the score variance is the uncomfortable part: a large share of what a multi-agent system buys you, you could have bought with the same tokens spent on one agent thinking longer. Before you attribute a quality gain to coordination, you have to rule out that you simply spent more, and most comparisons never control for that.
The second is coordination and consistency failure, which is Cognition's argument stated as an engineering cost. The instant you have two agents acting in parallel, you have built a distributed system, except the nodes are non-deterministic, the wire protocol is natural language, and no two messages mean exactly the same thing twice. Agents make conflicting implicit decisions, the same document gets summarized two incompatible ways, one agent assumes a definition another one contradicts, and the failure does not announce itself as an error. It shows up as a subtly incoherent final answer, or as a reconciliation step you had to add whose only job is to undo the divergence your fan-out introduced. This is exactly the consistency problem the memory-evaluation post named as a first-class metric, now multiplied across processes: more independent writers means more chances to contradict, and a system that contradicts itself reads as untrustworthy no matter how good any single agent was.
The third is compounding unreliability, and it is the one you can put a number on. Chain agent steps in a sequence and the reliabilities multiply. If each step in a pipeline succeeds ninety-five percent of the time, which is an optimistic assumption for an autonomous LLM step, then a ten-step chain succeeds about sixty percent of the time, and a twenty-step chain about thirty-six percent. Drop the per-step reliability to ninety percent and a ten-step chain is down near thirty-five percent, worse than a coin flip. The arithmetic is unforgiving because it is exponential: 0.95 to the tenth power is roughly 0.60, 0.90 to the tenth is roughly 0.35. Every hop you add is another factor below one, and error recovery across agents is harder than within one agent, because a downstream agent often cannot tell that an upstream one quietly failed, it just receives a confident wrong input and builds on it. A single agent that loops has this problem too, but adding agents adds hops, and hops are where reliability goes to die.
The decision, as a gate
Put the gains and the costs together and the decision stops being a matter of taste. Before you build a second agent, make it earn its place against three questions, and be strict, because the default should be one agent and the burden of proof is on the second.
Does the task decompose into subtasks that are genuinely independent, able to run without waiting on each other's output? If the pieces form a line where each needs the last, you do not have a parallelism case, you have a pipeline, and a pipeline does not need autonomous agents. Do any subtasks need isolated context, either because a specialist with a narrow view measurably beats a generalist, or because you specifically must prevent one agent from seeing what another knows, like a critic that has to stay independent of the author? If nothing needs walling off, a single agent with the full picture is simpler and more coherent. And does a specialized agent, given the same tools and the same token budget as one strong generalist agent, actually win in an honest head-to-head? That last clause is where most multi-agent designs quietly fail, because when you equalize the token budget, the gap the team attributed to their clever topology often disappears.
If the answers are mostly no, build one agent, give it good tools and a real memory store, and let it loop. It will be cheaper, more consistent, easier to evaluate, and easier to debug, because there is one place where things happen and one trace to read. If the answers are yes, then build the multi-agent system deliberately, knowing exactly which gain you are buying and which costs you are choosing to pay. The asymmetry is the whole strategy: a single agent is the cheaper hypothesis, so test it first, and make the second agent prove it is necessary before it exists, not after it is in production and the bill has already arrived.
If you go multi-agent, pay the tax on purpose
When the gate says yes, the difference between a multi-agent system that works and one that quietly rots is whether you spend on the coordination or fight it. Four disciplines separate the two, and they are the same conclusions the field's own debate keeps landing on.
Let the extra agents contribute intelligence, and keep the writes single-threaded. This is the single-writer principle, and it is the synthesis both corners of the argument can live with. Many agents can read, analyze, score, and propose in parallel, because reading does not conflict. But the action that changes shared state, the decision that gets committed, should flow through one path, so two agents can never commit contradictory writes. Contribute-in-parallel, commit-in-sequence keeps the parallelism benefit while removing most of the consistency risk.
Share structured state, not messages. The fastest way to accumulate coordination bugs is to have agents pass long natural-language transcripts to each other, because every hand-off is a fresh chance to misread. A shared, structured store that every agent reads from and writes to, a table of facts, scores, or decisions with defined fields, replaces the game of telephone with a single source of truth. It is the same reason the memory series treated the store, not the message history, as the real memory: structured state is queryable, consistent, and does not drift on each retelling.
Route, do not fan out blindly. A system that spins up every agent on every input is paying full price on every request, most of it wasted. A routing layer that inspects the input and activates only the agents it actually needs turns a fixed heavy cost into a variable one that matches the work. This is not a micro-optimization, it is often the difference between an architecture that is affordable and one that is not.
And put tokens and latency on the same scorecard as quality, exactly as the evaluation post argued for memory. A multi-agent system that is more accurate at fifteen times the cost and triple the latency is not automatically better, and if accuracy is the only column you track, you will never see the trade you actually made. Score cost per task next to quality per task, and the marginal agent has to justify its marginal spend or it does not ship.
What this looks like when we build it
The abstract version is a gate and four disciplines. Here is the concrete version, from systems we have shipped, anonymized.
Paralegent, our multi-agent legal-analysis system, is a case where the gate says yes, and it says yes for a clear reason: analyzing a legal document decomposes into genuinely independent passes, many separate scores and analyses over the same source, none of which needs to wait on the others. So it runs twenty-three agents, twelve scorers and eleven analysts. But it pays the tax on purpose rather than by accident. The agents do not pass transcripts around, they share a structured scores table and a queue, so the state every agent reads and writes is one consistent source of truth rather than twenty-three drifting retellings. The writes land in that shared table in a disciplined way instead of twenty-three agents taking conflicting actions, which is the single-writer idea in practice. And a routing layer decides which agents a given document actually needs instead of firing all twenty-three every time, the change that cut model calls by roughly seventy-five percent. That routing number is not a performance flex, it is the thing that makes a twenty-three-agent system economically sane, and it exists because we treated cost as a first-class metric, not an afterthought.
The counter-example matters just as much. On a wealth-management platform we built, the document-to-knowledge pipeline runs eight stages, extraction, entity resolution, confidence scoring, and the rest, feeding a knowledge graph in Neo4j with six entity types and five relationship types. It would be easy to call that a multi-agent system and it is not one, deliberately. The stages are a fixed, ordered sequence with no genuine independence and no need for one part to autonomously coordinate with another, so we built it as a deterministic pipeline, not a swarm of agents. Wrapping those stages in autonomous agents would have added token cost, non-determinism, and coordination failure modes in exchange for nothing, because the work was never parallel or ambiguous enough to need them. The skill is not knowing how to build multi-agent systems. It is knowing which of your problems is Paralegent and which is the pipeline, and having the discipline to build each as what it is. It is the same judgment we apply when we grade a graph before trusting it or watch for graph rot: the architecture serves the problem, never the fashion.
How to tell you are paying for multi-agent without getting it
A handful of symptoms point straight at an architecture that took on the costs of multi-agent without earning the gains.
You split one agent into several and you cannot name the independent subtask each one owns. If the honest description is "they work together on the whole thing," you have divided one job across many minds, which is the setup for conflicting decisions, not for parallelism. Your agents pass long natural-language messages to each other, and somewhere in the flow there is a step whose real job is to reconcile the ways they disagreed. Your token bill jumped by an order of magnitude and your quality did not move with it, which means you are paying the fifteen-times multiplier for tokens that a single agent could have spent thinking longer. Your agents do not actually run at the same time, they wait for each other in a line, so what you built is a pipeline with the overhead of a swarm and the benefit of neither. And the tell that settles it: a single strong agent, given the same tools and the same token budget, matches your system in an honest test. When that is true, the extra agents were never doing work, they were spending money.
The fix is not a better framework. It is to collapse back to one agent, get it genuinely good with tools and memory, and reintroduce a second agent only at the exact point where a gate question is a clear yes, and then build that second agent with shared structured state, single-threaded writes, routing, and cost on the scorecard. Fewer agents, each earning its place, beats more agents papering over a design you have not finished.
This is where the series begins. A single agent's memory was the last story, one mind learning to remember across sessions. This one is about many minds trying to act as a system without tripping over each other, and it opens at the decision because the decision is where most of the value and most of the waste live. The posts after this go into the topologies, the frameworks, the way you evaluate a multi-agent system specifically, and the cost engineering that keeps it affordable. But none of that matters if the first question went the wrong way. Build the one agent you can defend, prove you need the second, and you will spend your compute on capability instead of on coordinating a system that never needed coordinating.
Not sure whether your problem needs a multi-agent system or one strong agent doing more? That is one of the first questions we answer on an engagement, before a line of orchestration gets written, because the wrong call there is the most expensive mistake in the build. Book a 15-minute call and we will tell you honestly which of your problems is genuinely parallel, which is a pipeline in disguise, and where a single well-built agent will beat a swarm at a fraction of the cost. We work US business hours.
Share this article
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years experience
Mudassir Marwat is the Founder & CEO of Cognilium AI. He has shipped 100+ production AI systems acro...
