TL;DR
Supervisor + specialist pattern on AgentCore — how to wire it, observe it, and bound its cost in production. Real architecture, real numbers.
Multi-agent systems fail in two ways. The first is obvious — the model picks the wrong tool, or hallucinates a parameter, or returns malformed output. Those failures are loud. You see them in your eval suite. The second mode is the dangerous one: the system silently runs up a 50x cost on a query that should have been a no-op, and you find out when the bill arrives.
This post is about the production-grade orchestration pattern that protects against both. It is what we ship when a client asks for a multi-agent system on AWS, and it is what we have walked teams back to after their bespoke graph-of-agents architecture started losing requests.
The supervisor + specialist pattern
One coordinator (the supervisor) takes the full user request, decomposes it into sub-tasks, and routes each to a specialist agent. Each specialist has a narrow toolset, a narrow system prompt, and visibility only into the sub-task it owns. The supervisor reassembles the results into a final answer.
This is not the most flexible architecture. It is, however, the most reliable. It separates planning from execution. It gives you a single place to enforce budgets and rate limits. It produces traces that humans can actually read.
What goes in the supervisor
- Task decomposition. The supervisor turns the user prompt into a sequence of typed sub-tasks.
- Routing. Each sub-task gets assigned to one specialist by capability match.
- Budget enforcement. Per-call token limits, per-session step limits, and a circuit breaker on tool latency.
- Result aggregation. Specialists return structured outputs; the supervisor assembles the final response.
What goes in the specialists
- A narrow system prompt scoped to the sub-task type.
- A small set of tools — usually 1-4. More than that and you are starting to need another specialist.
- Memory scoped to the sub-task. Specialists don't share session memory directly.
- Per-tool retry and fallback semantics, configured per specialist not globally.
Wiring it on AgentCore
AgentCore gives you primitives for each piece. The supervisor is an agent with a routing tool. Each specialist is a separate agent. AgentCore's gateway handles tool registry, IAM scoping, and request signing; the runtime handles session memory and trace emission.
// Supervisor agent definition (sketch)
const supervisor = await agentCore.createAgent({
name: 'supervisor',
instructions: 'Decompose the user task. Route each sub-task to one specialist by capability.',
tools: [routeToSpecialist, returnFinalAnswer],
budget: { maxTokensPerCall: 8000, maxStepsPerSession: 12 },
memory: { strategy: 'session', maxItems: 50 },
})The budget block is what most teams skip and then regret. Without `maxStepsPerSession` and a hard timeout, a single recursive routing decision can spawn an unbounded chain. We have seen sessions consume $400 of inference before tripping a manual kill switch.
Observability — the part teams cut and regret
A multi-agent system is a distributed system. If you cannot trace a session end-to-end, you cannot debug it. The minimum bar for production is one trace per session, with one span per tool call, LLM hop, and retry. Annotate spans with the supervisor decision, the specialist used, the input/output token counts, and the latency.
On AWS the cleanest path is OpenTelemetry → CloudWatch. AgentCore emits spans natively; you bridge them into the same trace context as your application. Within a week of having traces, the team will stop arguing about whether the supervisor or a specialist is making bad decisions — they will see it.
What to alert on
- Sessions exceeding 80% of the step budget. Usually a sign of a routing loop.
- Per-specialist tool error rate above 2%. Usually a sign that the system prompt drifted or the tool contract changed.
- p99 latency on the supervisor decision step. If this grows, the supervisor system prompt has gotten too long.
- Cost per session p99. Catches the silent runaways before the bill does.
What we would do differently
In our first deployment we let specialists call each other directly when the supervisor was "obviously" the wrong layer for a particular hop. We regretted it. The implicit specialist-to-specialist graph was invisible to our traces, our budgets, and our retry logic. When a downstream specialist started timing out, we had no way to tell which upstream specialist was responsible.
Route everything through the supervisor. The 50ms of extra hop latency is cheap. The operational clarity is not.
Where this fits in the broader stack
Multi-agent orchestration is one piece of a production agent system. The retrieval layer (often GraphRAG by the time you are running multi-agent), the observability layer, and the deployment substrate all matter as much as the orchestration pattern. The agentic AI implementation framework covers the full stack; the AgentCore observability and monitoring guide goes deep on the trace + alert side specifically.
Share this article
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years experience
Mudassir Marwat is the Founder & CEO of Cognilium AI. He has shipped 100+ production AI systems acro...
