Back to Blog
Published:
Last Updated:
Fresh Content

Surviving Partial Failure in a 3,300-Call Agent Pipeline

8 min read
1,600 words
high priority
Muhammad Mudassir

Muhammad Mudassir

Founder & CEO, Cognilium AI

Surviving Partial Failure in a 3,300-Call Agent Pipeline — Cognilium AI

TL;DR

Two-tier retries, atomic DynamoDB chunk claims, and checkpoint-based cancellation — the failure-recovery layer that lets a multi-agent contract review pipelin

Two-tier retries, atomic DynamoDB chunk claims, and checkpoint-based cancellation — the failure-recovery layer that lets a multi-agent contract review pipeline finish even when 5% of LLM calls fail.
DynamoDB conditional updateSQS visibility timeoutmulti-agent retrycheckpoint cancellationECS Fargateexponential backoffidempotent agents

A contract review pipeline that fans out 1,000 to 3,300 LLM calls per document does not survive on retry-once-and-pray. Even at a 99.5% per-call success rate, a 3,300-call pipeline lands at roughly 0.995^3300 ≈ 0% chance of every call succeeding on the first try. Failures are not the exception — they are guaranteed.

The system this writeup describes scores legal contract clauses against twelve categories, runs eleven specialist analyst agents in parallel for each chunk, and ships findings into a Word add-in in real time. The orchestration runs on ECS Fargate; chunks fan out via SQS; per-chunk state lives in DynamoDB. The interesting part is not the orchestration — it is the failure-recovery layer underneath.

The two-tier retry

Tier one is internal exponential backoff inside the worker, with three attempts at 1s, 2s, 4s, capped at 30s total. This catches transient model errors — 429 rate limits, 5xx from Bedrock, network blips. Tier two is the SQS visibility timeout. If the worker crashes, OOMs, or hangs past 5 minutes, the message returns to the queue and a different worker picks it up. After three SQS deliveries the message lands in a dead-letter queue with full context for triage.

The split matters: tier one handles failures the worker can recover from; tier two handles failures the worker cannot. Combining them into one retry layer collapses observability — you cannot tell whether a slow chunk hit a model rate limit or whether the worker fell over.

Atomic chunk claim via DynamoDB conditional update

When tier two re-delivers a message, two workers may grab it. Without coordination they double-process the chunk and double-bill the customer. The conditional update fixes this with no locks: UpdateItem with ConditionExpression "attribute_not_exists(claimed_by) OR claim_expires_at < :now". Whichever worker writes first wins; the other catches ConditionalCheckFailedException and moves on. The claim has a TTL so a wedged worker does not block recovery indefinitely.

Checkpoint-based cancellation

When the customer cancels a 5-minute job at minute 3, you cannot just stop firing LLM calls — you have agents in flight, files being written to S3, downstream tasks queued. The checkpoint table records {job_id, stage, started_at, claimed_by} for every chunk-stage. Cancellation flips a job-level status to "cancelled". Each stage entry-point reads that status before doing work. Half-finished chunks land in the checkpoint table as "cancelled-at-stage-N" so the next operator knows exactly where things stopped.

What we measured

  • 22 chunks → 116 LLM calls per chunk → 154 seconds end-to-end on the happy path
  • Pre-fix: ~12% of jobs reported "incomplete" with one chunk silently missing
  • Post-fix: <0.5%, and every remaining failure is observable in the checkpoint table
  • DLQ retention: 14 days; alert threshold: 3 messages in 5 minutes

When this is overkill

If your agent pipeline is < 50 LLM calls per request, the per-call success rate carries you. If your pipeline is short-running (< 30 seconds), tier-two retries via SQS are slower than just letting the user retry. The pattern earns its complexity at the multi-thousand-call, multi-minute scale where partial failure is the default and full retry is too expensive.

Share this article

Muhammad Mudassir

Muhammad Mudassir

Founder & CEO, Cognilium AI | 10+ years

Mudassir Marwat is the Founder & CEO of Cognilium AI. He has shipped 100+ production AI systems acro...

Founder & CEO of Cognilium AI; 100+ production AI systems shipped; multi-cloud AI architecture (AWSGCPAzure); built and operated 4 production AI products
Agentic AIRAG → GraphRAG retrievalVoice AIMulti-Agent Orchestration

Frequently Asked Questions

Find answers to common questions about the topics covered in this article.

Still have questions?

Get in touch with our team for personalized assistance.

Contact Us

Related Articles

Continue exploring related topics and insights from our content library.

Multi-Agent Orchestration on AWS Bedrock AgentCore
9 min
1
Muhammad Mudassir
May 4, 2026

Multi-Agent Orchestration on AWS Bedrock AgentCore

The supervisor + specialist pattern is the most reliable way to ship multi-agent systems on AWS — here is how to wire it, observe it, and bound its cost.

words
Read Article
Supervisor-Router on Google ADK with Per-Org Tool Registration
9 min
2
Muhammad Mudassir
May 5, 2026

Supervisor-Router on Google ADK with Per-Org Tool Registration

Building a multi-tenant agent platform on Google ADK where the supervisor binds only the tools each org has paid for and integrated — without forking the agent definition per tenant.

words
Read Article
When to Mix SQS FIFO and Standard Queues in an Agent Pipeline
7 min
3
Muhammad Mudassir
May 5, 2026

When to Mix SQS FIFO and Standard Queues in an Agent Pipeline

FIFO for chunk ordering, Standard for parallel analysis fan-out. Why a single queue type for the whole pipeline is the wrong default, with the dead-letter and retry settings that make the split work.

words
Read Article

Explore More Insights

Discover more expert articles on AI, engineering, and technology trends.