TL;DR
Two-tier retries, atomic DynamoDB chunk claims, and checkpoint-based cancellation — the failure-recovery layer that lets a multi-agent contract review pipelin
A contract review pipeline that fans out 1,000 to 3,300 LLM calls per document does not survive on retry-once-and-pray. Even at a 99.5% per-call success rate, a 3,300-call pipeline lands at roughly 0.995^3300 ≈ 0% chance of every call succeeding on the first try. Failures are not the exception — they are guaranteed.
The system this writeup describes scores legal contract clauses against twelve categories, runs eleven specialist analyst agents in parallel for each chunk, and ships findings into a Word add-in in real time. The orchestration runs on ECS Fargate; chunks fan out via SQS; per-chunk state lives in DynamoDB. The interesting part is not the orchestration — it is the failure-recovery layer underneath.
The two-tier retry
Tier one is internal exponential backoff inside the worker, with three attempts at 1s, 2s, 4s, capped at 30s total. This catches transient model errors — 429 rate limits, 5xx from Bedrock, network blips. Tier two is the SQS visibility timeout. If the worker crashes, OOMs, or hangs past 5 minutes, the message returns to the queue and a different worker picks it up. After three SQS deliveries the message lands in a dead-letter queue with full context for triage.
The split matters: tier one handles failures the worker can recover from; tier two handles failures the worker cannot. Combining them into one retry layer collapses observability — you cannot tell whether a slow chunk hit a model rate limit or whether the worker fell over.
Atomic chunk claim via DynamoDB conditional update
When tier two re-delivers a message, two workers may grab it. Without coordination they double-process the chunk and double-bill the customer. The conditional update fixes this with no locks: UpdateItem with ConditionExpression "attribute_not_exists(claimed_by) OR claim_expires_at < :now". Whichever worker writes first wins; the other catches ConditionalCheckFailedException and moves on. The claim has a TTL so a wedged worker does not block recovery indefinitely.
Checkpoint-based cancellation
When the customer cancels a 5-minute job at minute 3, you cannot just stop firing LLM calls — you have agents in flight, files being written to S3, downstream tasks queued. The checkpoint table records {job_id, stage, started_at, claimed_by} for every chunk-stage. Cancellation flips a job-level status to "cancelled". Each stage entry-point reads that status before doing work. Half-finished chunks land in the checkpoint table as "cancelled-at-stage-N" so the next operator knows exactly where things stopped.
What we measured
- 22 chunks → 116 LLM calls per chunk → 154 seconds end-to-end on the happy path
- Pre-fix: ~12% of jobs reported "incomplete" with one chunk silently missing
- Post-fix: <0.5%, and every remaining failure is observable in the checkpoint table
- DLQ retention: 14 days; alert threshold: 3 messages in 5 minutes
When this is overkill
If your agent pipeline is < 50 LLM calls per request, the per-call success rate carries you. If your pipeline is short-running (< 30 seconds), tier-two retries via SQS are slower than just letting the user retry. The pattern earns its complexity at the multi-thousand-call, multi-minute scale where partial failure is the default and full retry is too expensive.
Share this article
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years experience
Mudassir Marwat is the Founder & CEO of Cognilium AI. He has shipped 100+ production AI systems acro...
