How is evaluating a multi-agent system different from evaluating a single agent?

A single agent has one input, one output, and one trace, so scoring its final answers against a reference gets you most of the way. A multi-agent system fails in the seams between agents, not just inside them, so final-answer scoring misses two whole classes of failure: every component can pass while the wired system fails from compounding, and the system can pass while a component is quietly broken and masked by a lucky recovery downstream. That is why a multi-agent system needs three layers, outcome, component, and trajectory, rather than the one layer a single agent usually needs. The trajectory layer, scoring the path the system took, is the multi-agent-specific addition and the one almost every team skips.

How do you find out which agent caused a failure?

You need two things: a trajectory evaluation and a readable trace. The trace records, for every step, which agent ran, its inputs and outputs, the tools it called, what it wrote to shared state, and its cost, all under one run identifier, so a failed run can be replayed and pointed at. The fastest way to get that trace is to have agents coordinate through a shared structured store rather than by passing messages, because then the coordination substrate is already the audit log. A useful shortcut is to compare predicted reliability, the product of your measured component reliabilities given the topology, against observed system reliability: if the system is worse than its parts predict, the gap is an interaction failure in the hand-offs or routing rather than a broken agent, which tells you where to look before you open the trace.

What metrics should you track for a multi-agent system?

Track a metric at each of the three layers. At the outcome layer, task success rate against ground truth, reported as a rate over many runs with a confidence interval, never as a single pass or fail. At the component layer, a per-agent success rate for each agent on its own eval set, which also gives you the reliability numbers the topology math needs. At the trajectory layer, routing accuracy, tool-call correctness, and hand-off fidelity, plus deterministic checks like valid output and staying within a step and token budget. Underneath all of them, track cost and latency per *successful* task rather than per task, because a cheap wrong answer is not cheap, and a failure taxonomy that records which step broke when a run fails, so attribution accumulates instead of resetting each time.

Can you use an LLM as a judge to evaluate multi-agent systems?

Yes, for the open-ended outputs where no single answer is correct, but with two guardrails. First, do not use the same model family as your workers to judge them, because a same-model judge shares their blind spots and will approve exactly the errors they make. Second, calibrate the judge against a human-labeled sample and know its agreement rate before you let a judge score gate a release, because model judges are useful enough to lean on and not reliable enough to treat as ground truth. Wherever a correct answer actually exists, prefer a deterministic check, an exact match, a passing test, valid structured output, over any model judge, because it is cheaper, faster, and cannot hallucinate its own verdict.

How many test runs do you need to trust a multi-agent evaluation?

More than one, and usually more than you would guess, because these systems are non-deterministic. A single passing run of an eighty-percent-reliable system happens eighty percent of the time by definition, and two clean runs happen about sixty-four percent of the time, so a manual "it worked" is not a measurement. To read the ballpark reliability of an eighty-percent system you need on the order of a hundred runs, which gives a ninety-five percent interval near plus or minus eight points; to resolve a small change like five points you need on the order of a thousand runs. Hold the eval set fixed, re-run the whole thing on every change, and decide up front what size of improvement is worth detecting, because that decision sets how many runs, and how much token budget, each evaluation costs.

Evaluating Multi-Agent Systems: Which Agent Failed?

TL;DR: A multi-agent system gives you a confident, wrong answer, and unlike a single agent it hands you no stack trace to explain it. That is the real production problem, and single-agent evaluation does not solve it, because the failure usually lives in the seams between agents rather than inside any one of them. You need three layers of evaluation, and most teams run only the first. Outcome evaluation checks the final answer against ground truth and tells you whether the system failed, not where. Component evaluation scores each agent in isolation and gives you the per-agent reliability, the eighty or ninety percent number that the topology math depends on. Trajectory evaluation checks the path the system actually took, the routing, the tool calls, the hand-offs, and it is the multi-agent-specific layer where failure attribution lives. Put those together and the reliability math from the wiring post becomes an instrument: if your components predict about seventy-three percent reliability and the system delivers fifty-five, the twenty-point gap is an interaction bug the component evals cannot see. And because these systems are non-deterministic, one green run is not a passing grade. An eighty-percent-reliable system passes any single test eighty percent of the time, so you have to measure a rate over many runs on a fixed eval set, roughly a hundred to read the ballpark and a thousand to resolve a five-point change. You cannot evaluate what you cannot see, so the highest-leverage investment is the trace, and the shared structured store from the last post doubles as it.

The last two posts got you to a working system. The first was about whether to use multiple agents at all, and its answer was to default to one and make the second prove it is necessary. The second was about how to wire them once you commit, the four topologies and the substrate underneath. This post is about the question that arrives the moment the system is live and someone forwards you a bad output: does this thing actually work, and when it does not, which part broke? That question is harder for a multi-agent system than for anything else you build, and the tools most teams reach for were designed for a single model answering a single prompt. They do not transfer, and the gap between what they measure and what you need to know is where flaky, expensive, untrustworthy agent systems come from.

A single agent that fails gives you one trace to read. A five-agent system that fails gives you a confident answer and a shrug. The bug could be in any agent, any hand-off, the routing, or the wiring itself, and the final answer alone cannot tell you which. Evaluation is not a scoreboard you check at the end. It is the instrument that turns a mysterious wrong answer into a specific broken step.

Why single-agent evaluation does not transfer

The instinct is to evaluate a multi-agent system the way you evaluate a model: assemble a set of inputs with known-good outputs, run them through, score the final answers, report a number. That number is worth having, but on its own it is close to useless for a system of agents, for two reasons that pull in opposite directions.

The first is that every component can pass and the system can still fail. This is the compounding math from the wiring post, read as a warning about evaluation. Three agents that each score ninety percent in isolation, wired so that all three must succeed, produce a system that is about seventy-three percent reliable, because 0.9 times 0.9 times 0.9 is about 0.73. Nobody's component eval is red. Every agent looks healthy on its own test. And the system fails a quarter of the time, in a way that no amount of staring at the individual scores will explain, because the loss is in the chaining, not the parts. Evaluate only the components and you will conclude the system is fine while production tells you it is not.

The second is the mirror image: the system can pass and a component can be quietly broken. A multi-agent system has many paths to a right-looking answer, and some of them are luck. A worker returns a subtly wrong intermediate result, a downstream agent happens to ignore the part that was wrong, and the final answer comes out correct anyway. Your outcome eval is green. The broken worker is still broken, and the next input, the one where the downstream agent does not happen to route around the error, fails in production with no warning. Outcome evaluation cannot see this, because it only looks at the end of the pipe. The failure was in the middle, masked by a lucky recovery.

So the two obvious evaluations, score the parts and score the whole, each miss a whole class of failure, and the classes they miss are exactly the ones that make multi-agent systems infuriating to debug. The information you actually need is not in either the components or the final answer. It is in the path between them.

The three layers you actually need

A multi-agent system needs three layers of evaluation, and they answer three different questions. Run all three or you are guessing.

Outcome evaluation scores the final output against a reference: ground truth where you have it, a rubric or a reference answer where you do not. This is the layer everyone builds, and it is necessary. It is the only layer that answers the question the business actually cares about, did the system produce the right result, and it is your regression backstop when you change anything. Its limit is that it is a single bit of information at the end of a long process. It tells you the system failed. It tells you nothing about where or why.

Component evaluation scores each agent in isolation, on its own inputs and its own outputs, against its own reference. A scorer agent gets a set of documents with known correct scores. A router gets a set of inputs with known correct routing decisions. A summarizer gets sources with reference summaries. This is the layer that gives you the per-agent reliability number, the p in the topology math, and without it that math is fiction, because you are multiplying numbers you never measured. It is also the layer that catches the quietly-broken-but-masked component the outcome eval hides. Its limit is that an agent that is perfect in isolation can still fail in place, because in the system it receives not your clean test input but the possibly-degraded output of the agent before it.

Trajectory evaluation is the one almost nobody builds, and it is the multi-agent-specific layer. It scores the path the system took rather than the answer it landed on: did the orchestrator route to the right agents, did each agent call the right tools with the right arguments, did the hand-offs carry the information they were supposed to, did the run stay inside its step and token budget, did anything loop. This is where failure attribution lives, because a trajectory is a sequence you can point at, and pointing is what you cannot do with a final answer. Two runs can produce the same correct answer, one by a clean path and one by a lucky-wrong path that happened to cancel out, and only the trajectory eval can tell them apart. It is also the layer that lets you catch a regression before it reaches the outcome number: routing accuracy can start slipping while the final answers still look fine, because redundancy elsewhere is covering for it, right up until it is not.

The three layers form a diagnostic hierarchy. Outcome tells you *whether*. Components tell you *which parts*. Trajectory tells you *where in the flow*. A team that runs only outcome evaluation knows its system is failing and has no idea why, which is the single most common state we find production agent systems in.

A three-layer evaluation stack for a multi-agent system, plus a gap diagnostic and a note on non-determinism. The outcome layer scores the final answer against ground truth or a reference; it answers whether the system failed but not where. The component layer scores each agent in isolation on its own inputs and outputs; it gives the per-agent reliability but misses failures that live in the interaction rather than in any one node. The trajectory layer scores the path the system actually took, its routing choices, tool calls, and hand-offs; it answers where in the flow it broke, and it is the multi-agent-specific layer that almost everyone skips. Beside the stack sits a gap diagnostic: predicted reliability from the parts and the wiring, 0.9 times 0.9 times 0.9 or about seventy-three percent, versus observed reliability from the outcome eval, about fifty-five percent, an eighteen-point gap that points to an interaction bug such as a lossy hand-off, correlated errors, or a routing miss. A band on non-determinism states that one green run is not a passing grade, because an eighty-percent-reliable system passes a single test eighty percent of the time and two clean runs in a row about sixty-four percent, so you measure a rate over many runs on a fixed eval set, roughly a hundred runs for a plus-or-minus eight point read and about a thousand to resolve a five-point change. The takeaway: run all three layers, over enough runs, off a trace you can actually read; coordinate through a shared structured store so the blackboard is already the audit log; prefer deterministic checks, and diversify the judge.

The gap between predicted and observed is the diagnostic

Once you have component numbers and a topology, you can do something the single-agent world never lets you do: predict what the system's reliability should be, then measure what it actually is, and read the difference as a specific kind of bug.

Start from the components. Say you have measured, on their own eval sets, that each of three agents in a required chain succeeds about ninety percent of the time. The topology is all-must-succeed. The predicted system reliability is the product, 0.9 times 0.9 times 0.9, about seventy-three percent. That is the number the parts promise, given how they are wired. Now run the outcome eval on the full system over many inputs and measure what it actually delivers. Three things can happen, and each one means something precise.

If the observed reliability matches the prediction, about seventy-three percent, your model of the system is correct. The system is exactly as reliable as its parts and its wiring say it should be, and if that is not good enough, you know the fix is either better components or a topology with redundancy, because there is no hidden loss to find. If the observed reliability is *lower* than predicted, say the system delivers fifty-five percent when the parts promised seventy-three, the eighteen-point gap is an interaction failure that none of the component evals can see: a hand-off that is silently dropping or garbling information, errors that are correlated rather than independent so they strike together, or a routing decision that is sending inputs to the wrong agent. The gap does not just tell you something is wrong. It tells you the wrongness is in the seams, not the nodes, which is exactly the information that saves you from tuning agents that were never the problem. And if the observed reliability is somehow *higher* than the product predicts, that is not luck, it is a sign your topology has redundancy you did not account for, some path is quietly voting or retrying, which is worth knowing because it is also a cost you may be paying without having chosen to.

This is the payoff of doing all three layers instead of one. The component evals give you the prediction. The outcome eval gives you the observation. The trajectory eval tells you which seam the gap is hiding in. A single outcome number can tell you a system is fifty-five percent reliable. Only the three together can tell you that it *should* be seventy-three, and that the missing eighteen points are being lost in one specific hand-off, which is the difference between a week of guessing and an afternoon of fixing. It is the same discipline we apply when we grade a knowledge graph before trusting it: a health number is only useful if it decomposes into the specific defect that produced it.

You cannot evaluate what you cannot see

All three layers depend on one thing that has nothing to do with scoring: you have to be able to see what happened. A trajectory eval is impossible if there is no trajectory to read, and this is where the substrate choice from the last post stops being an architecture decision and becomes an evaluation decision.

If your agents coordinate by passing natural-language messages, your trace is a transcript smeared across a dozen context windows, and reconstructing what the system actually did means reading prose that no two agents interpreted the same way. You can bolt tracing on with the observability tools built for this, LangSmith, Langfuse, Arize Phoenix, and Braintrust all exist precisely because agent runs are otherwise opaque, and you should use one. But the deeper fix is structural. If your agents coordinate through a shared structured store, the blackboard from the wiring post, then your coordination substrate *is* your trace. Every fact, score, decision, and write is already recorded in a queryable store with defined fields, tied to a run, in order. The audit log is not something you add for evaluation. It is the same object the system already uses to think. The blackboard architecture was built for a speech system around 1980 to give independent modules one consistent source of truth, and one consistent source of truth is exactly what a trajectory eval reads. This is the same reason the memory series treated the structured store rather than the message history as the real memory: structured state is inspectable, and prose is not.

Concretely, an evaluable trace records, for every step, the agent that ran, its inputs, its outputs, the tools it called and with what arguments, what it wrote to shared state, and the tokens and time it cost, all under one run identifier. With that, a failed run is a thing you can replay and point at. Without it, debugging a multi-agent failure is archaeology, and you will spend more time reconstructing what happened than fixing it. The teams that can answer "which agent broke" in minutes are not smarter. They just built the trace first, and usually they built it by choosing a substrate that produced the trace as a byproduct.

The problem of who grades the graders

For anything with a single correct answer, evaluation is a lookup, and you should lean on it: exact match, a passing test, valid structured output, a numeric tolerance. Deterministic checks are cheap, fast, perfectly reliable, and they never hallucinate their own verdict, so run every check you can as code before you reach for anything cleverer. Did the agent produce parseable output. Did the write to shared state succeed. Did the routing decision land on an agent that exists. Did the run stay inside budget. A surprising amount of trajectory evaluation is just deterministic assertions on the trace, and that part is the easy, trustworthy part.

The hard part is the open-ended outputs, the summary or the analysis or the plan where there is no single right string, and here the field's default is to use another model as the judge. It is a genuinely useful technique and the frameworks for it are mature, Ragas and DeepEval and OpenAI Evals among them, but in a multi-agent setting it carries two risks that are easy to walk into. The first is correlated blindness: if your judge is the same model family as your workers, it shares their blind spots, so it will confidently approve exactly the errors your workers confidently make. This is the independence problem from the wiring post wearing a different hat, and the mitigation is the same, diversify, use a different model to judge than the one that did the work. The second is that judging a trajectory is harder than judging an answer, and a model asked whether a whole multi-step path was reasonable is being asked to do more than a model asked whether one answer is correct, so its verdicts are noisier exactly where you most need signal.

The honest framing is that a model judge is useful enough to lean on and not reliable enough to treat as ground truth. Reported agreement between model judges and human raters is good enough to make them worth running and not good enough to make them the last word, so you calibrate: hand-label a sample of outputs, measure how often the judge agrees with your humans, and know that gap before you trust a judge score to gate a release. Use deterministic checks wherever a right answer exists, a diversified model judge where it does not, and human labels as the anchor that keeps the judge honest. A judge you have never checked against a person is a number you have not earned.

The non-determinism tax nobody budgets for

Here is the part that catches careful teams, because it has nothing to do with prompts or topology and everything to do with statistics. A multi-agent system is non-deterministic. Run the same input twice and you can get two different paths and two different answers. Which means a single test run is almost worthless as evidence, and "it worked when I tried it" is one of the most misleading sentences in this whole domain.

Do the arithmetic. Suppose your system is genuinely eighty percent reliable, meaning one input in five fails. You run it once by hand to check your change, and it works. That proves nothing, because an eighty-percent system passes any single test eighty percent of the time by definition. You run it a second time to be sure, and it works again. Still weak evidence: two clean runs of an eighty-percent system happen 0.8 times 0.8, about sixty-four percent of the time, so a system that fails one input in five will still show you two greens in a row roughly two times in three. The manual "seems fine" is not a measurement. It is a coin you have flipped twice.

To actually measure the reliability of a stochastic system you need a rate over many runs on a fixed eval set, and the number of runs is larger than intuition suggests. The standard error of a measured proportion is roughly the square root of p times one minus p over N, so at eighty percent reliability, a hundred runs gives a standard error near four points and a ninety-five percent interval near plus or minus eight. That is fine for a ballpark, is this system in the eighties or the sixties, and useless for a small improvement, because a change that moves you from eighty-two to eighty-six is invisible inside a plus-or-minus-eight band. To resolve a five-point change you need the standard error down near one and a quarter points, which takes on the order of a thousand runs. Those are approximations, the normal one, and they get shakier at the extremes, but the shape of the conclusion is robust: you cannot tell a real improvement from noise on ten runs, and often not on a hundred.

This has three consequences most teams learn the expensive way. Hold your eval set fixed, because a moving target makes run-to-run numbers incomparable and hides real regressions in sampling noise. Re-run the whole eval on every change, because a non-deterministic system can regress from an unrelated edit and a single spot-check will not catch it. And decide up front what size of improvement is worth detecting, because that decision sets how many runs each evaluation costs, and evaluation of a multi-agent system is itself expensive, every run is the full token bill times your sample size. The teams that ship reliable agent systems are not the ones with the cleverest prompts. They are the ones who treat evaluation as a measurement problem with a sample size, the same way they would treat any other experiment, which is the same rigor the memory-evaluation post argued for on a single agent, now with the variance turned up because the system has more moving parts.

What this looks like when we build it

The abstract version is three layers, one gap diagnostic, a trace, and a sample size. Here is the concrete version, anonymized, from systems we have shipped.

Paralegent, our multi-agent legal-analysis system, is evaluable because of a decision that looks like architecture and turns out to be evaluation. Its twenty-three agents, twelve scorers and eleven analysts, do not pass transcripts to each other. They read and write one shared, structured scores table, which is a blackboard in everything but name, and that table is also the trace. Because every scorer writes a structured score to one place, we can run component evaluation on each scorer against a reference, read the trajectory straight off the table without reconstructing it from prose, and check the routing decision, did we skip an agent whose input would have changed the outcome, as a first-class question rather than a mystery. That last check is what makes the roughly seventy-five percent reduction in model calls from routing trustworthy rather than reckless: routing is only safe when you can measure that the agents you skipped would not have changed the answer, and you can only measure that when the trajectory is inspectable. Routing without an evaluation harness is not optimization, it is hoping, and the shared table is what turns the hope into a number.

The counter-example is the wealth-management platform, and it is instructive because its evaluation is easy for exactly the reason its architecture is boring. The document-to-knowledge pipeline runs eight fixed, ordered stages feeding a knowledge graph in Neo4j with six entity types and five relationship types, and because we built it as a deterministic pipeline rather than a chain of agents, its evaluation is per-stage and reproducible: the same input gives the same output, so a single run is real evidence and the non-determinism tax mostly does not apply. Better still, the system emits its own evaluation signal, a zero-to-one confidence on every extracted fact, and the evaluation job becomes calibration, checking whether the facts it marks ninety-percent-confident are actually right about ninety percent of the time. The lesson across both systems is the same one the whole evaluation discipline keeps teaching: the hardness of evaluation tracks the non-determinism of the architecture. Deterministic stages are cheap to trust. Autonomous agents are where the sample sizes and the trajectory traces earn their cost, and pretending otherwise is how systems ship unmeasured.

How to tell your evaluation is lying to you

A few symptoms say the problem is not your system but your measurement of it, and each maps to a specific fix.

Your eval is green and production is not. Your eval set does not match the real input distribution, or you are measuring outcome only and missing trajectory failures that production hits and your test set does not. Fix the set first, then add the trajectory layer. You cannot say which agent caused the last failure. You have no trajectory eval and probably no usable trace, so you are debugging by re-reading transcripts, and the fix is a structured, replayable trace before anything else. You changed a prompt and it "seems better." You ran it a handful of times on a stochastic system and you are reading noise as signal, so fix a large eval set and measure a rate with an interval before you believe any improvement. Your judge is the same model as your workers. It shares their blind spots and is rubber-stamping the errors you most need caught, so diversify the judge and calibrate it against human labels. And you report cost per task rather than cost per *successful* task. A cheap wrong answer looks efficient and is not, because a wrong answer at any price is a task you still have to do again, so the only cost that means anything is the cost of the outputs you can actually use.

None of these fixes is a new framework or a better model. They are all the same move: measure the system the way you would measure any non-deterministic process, in three layers, over enough runs, off a trace you can actually read. The next post in this series goes into the cost engineering that keeps all of this affordable, the routing and the token economics that decide whether a system you can now trust is a system you can also pay for. But trust comes first, because a fast, cheap multi-agent system you cannot evaluate is not an asset. It is a liability you have not measured yet.

Shipped a multi-agent system you cannot fully trust, and cannot tell which agent is the flaky one? That is one of the first things we untangle on an engagement, because the failure is almost always in the seams and almost always invisible without the right trace. Book a 15-minute call and we will look at how your agents coordinate, tell you honestly which of your evaluation layers is missing, and show you where a readable trace turns a mysterious wrong answer into a specific broken step. We work US business hours.

When a Multi-Agent System Fails, Which Agent Broke?

Why single-agent evaluation does not transfer

The three layers you actually need

The gap between predicted and observed is the diagnostic

You cannot evaluate what you cannot see

The problem of who grades the graders

The non-determinism tax nobody budgets for

What this looks like when we build it

How to tell your evaluation is lying to you

Share this article

Muhammad Mudassir

Muhammad Mudassir

Frequently Asked Questions

How is evaluating a multi-agent system different from evaluating a single agent?

How do you find out which agent caused a failure?

What metrics should you track for a multi-agent system?

Can you use an LLM as a judge to evaluate multi-agent systems?

How many test runs do you need to trust a multi-agent evaluation?

Still have questions?

Related Articles

Four Ways to Wire a Multi-Agent System (and When Each One Breaks)

Most Multi-Agent Systems Would Work Better as One Agent

Your Agent's Memory Benchmark Is Measuring the Wrong Thing

Explore More Insights