Why two retry tiers instead of one?

Internal exponential backoff catches transient model errors (rate limits, 5xx) within 30s without re-queuing. SQS visibility timeout catches deeper failures (worker crash, OOM) — the message reappears on the queue and a different worker picks it up. One tier handles known-recoverable; the other handles unknown.

What does the DynamoDB conditional update actually look like?

UpdateItem with ConditionExpression "attribute_not_exists(claimed_by) OR claim_expires_at < :now". Two workers racing for the same chunk: only one succeeds, the other ConditionalCheckFailedException short-circuits and moves on. No locks, no Redis, no race-window.

How do you cancel a job mid-pipeline without orphaning chunks?

A checkpoint table tracks {job_id, stage, status}. Each stage start reads the row; if status==="cancelled", the worker writes a final "cancelled-at-stage-N" row and exits cleanly. Half-finished work is recorded, not silently dropped.

What was the failure rate before vs. after?

Before the two-tier retry + atomic claim: ~12% of contracts failed with "incomplete results" — usually one chunk silently dropped. After: <0.5%, and the remaining failures are observable (the checkpoint table tells you exactly which chunk + which stage).

Does this work outside AWS?

The pattern does. The atomic claim is "compare-and-swap on a row" — works in Postgres (SELECT ... FOR UPDATE SKIP LOCKED), Firestore (Transaction), or any CAS-capable store. The two-tier retry needs a queue with redelivery semantics — RabbitMQ, GCP Pub/Sub, Kafka with consumer-group offsets all qualify.

Surviving Partial Failure in a 3,300-Call Agent Pipeline

Q: How do you cancel a job mid-pipeline without orphaning chunks?

A checkpoint table tracks {job_id, stage, status}. Each stage start reads the row; if status==="cancelled", the worker writes a final "cancelled-at-stage-N" row and exits cleanly. Half-finished work is recorded, not silently dropped.

Q: What was the failure rate before vs. after?

Before the two-tier retry + atomic claim: ~12% of contracts failed with "incomplete results" — usually one chunk silently dropped. After: <0.5%, and the remaining failures are observable (the checkpoint table tells you exactly which chunk + which stage).

Q: Does this work outside AWS?

The pattern does. The atomic claim is "compare-and-swap on a row" — works in Postgres (SELECT ... FOR UPDATE SKIP LOCKED), Firestore (Transaction), or any CAS-capable store. The two-tier retry needs a queue with redelivery semantics — RabbitMQ, GCP Pub/Sub, Kafka with consumer-group offsets all qualify.

A contract review pipeline that fans out 1,000 to 3,300 LLM calls per document does not survive on retry-once-and-pray. Even at a 99.5% per-call success rate, a 3,300-call pipeline lands at roughly 0.995^3300 ≈ 0% chance of every call succeeding on the first try. Failures are not the exception — they are guaranteed.

The system this writeup describes scores legal contract clauses against twelve categories, runs eleven specialist analyst agents in parallel for each chunk, and ships findings into a Word add-in in real time. The orchestration runs on ECS Fargate; chunks fan out via SQS; per-chunk state lives in DynamoDB. The interesting part is not the orchestration — it is the failure-recovery layer underneath.

The two-tier retry

Tier one is internal exponential backoff inside the worker, with three attempts at 1s, 2s, 4s, capped at 30s total. This catches transient model errors — 429 rate limits, 5xx from Bedrock, network blips. Tier two is the SQS visibility timeout. If the worker crashes, OOMs, or hangs past 5 minutes, the message returns to the queue and a different worker picks it up. After three SQS deliveries the message lands in a dead-letter queue with full context for triage.

The split matters: tier one handles failures the worker can recover from; tier two handles failures the worker cannot. Combining them into one retry layer collapses observability — you cannot tell whether a slow chunk hit a model rate limit or whether the worker fell over.

Atomic chunk claim via DynamoDB conditional update

When tier two re-delivers a message, two workers may grab it. Without coordination they double-process the chunk and double-bill the customer. The conditional update fixes this with no locks: UpdateItem with ConditionExpression "attribute_not_exists(claimed_by) OR claim_expires_at < :now". Whichever worker writes first wins; the other catches ConditionalCheckFailedException and moves on. The claim has a TTL so a wedged worker does not block recovery indefinitely.

Checkpoint-based cancellation

When the customer cancels a 5-minute job at minute 3, you cannot just stop firing LLM calls — you have agents in flight, files being written to S3, downstream tasks queued. The checkpoint table records {job_id, stage, started_at, claimed_by} for every chunk-stage. Cancellation flips a job-level status to "cancelled". Each stage entry-point reads that status before doing work. Half-finished chunks land in the checkpoint table as "cancelled-at-stage-N" so the next operator knows exactly where things stopped.

What we measured

22 chunks → 116 LLM calls per chunk → 154 seconds end-to-end on the happy path
Pre-fix: ~12% of jobs reported "incomplete" with one chunk silently missing
Post-fix: <0.5%, and every remaining failure is observable in the checkpoint table
DLQ retention: 14 days; alert threshold: 3 messages in 5 minutes

When this is overkill

If your agent pipeline is < 50 LLM calls per request, the per-call success rate carries you. If your pipeline is short-running (< 30 seconds), tier-two retries via SQS are slower than just letting the user retry. The pattern earns its complexity at the multi-thousand-call, multi-minute scale where partial failure is the default and full retry is too expensive.

Surviving Partial Failure in a 3,300-Call Agent Pipeline

The two-tier retry

Atomic chunk claim via DynamoDB conditional update

Checkpoint-based cancellation

What we measured

When this is overkill

Share this article

Muhammad Mudassir

Muhammad Mudassir

Frequently Asked Questions

Why two retry tiers instead of one?

What does the DynamoDB conditional update actually look like?

How do you cancel a job mid-pipeline without orphaning chunks?

What was the failure rate before vs. after?

Does this work outside AWS?

Still have questions?

Related Articles

Multi-Agent Orchestration on AWS Bedrock AgentCore

Supervisor-Router on Google ADK with Per-Org Tool Registration

When to Mix SQS FIFO and Standard Queues in an Agent Pipeline

Explore More Insights