TL;DR
The day-2 ops layer of an LLM product — what to evaluate, what to judge in real time, what to retry, and when to fail closed. The components that turn a proto
A working prototype of an LLM product proves the model can do the task. A production deployment proves the system can survive the model — its failures, slowdowns, drift, cost spikes, and the gap between a passing test set and real user inputs. The stack that handles this is roughly the same across products and worth describing as a unit.
Layer 1: offline evals
Evals are fixed test sets that exercise known good and bad behaviors. They run in CI, gate releases, and answer one question: did this prompt change make the system better, worse, or unchanged? Three eval types cover most products.
- Acceptance evals — N hand-crafted queries with N expected behaviors. Pass/fail. Runs in <60s. ~50 examples is usually enough to catch regressions.
- Adversarial evals — queries designed to break the system. Prompt injections, edge cases, ambiguous inputs. Pass means "the system handled it gracefully," not "the system got it right."
- Drift evals — a sample of real user queries from the previous week, replayed against the new prompt. Looks for behavior changes you did not intend.
Layer 2: online judges
Judges run per request, in real time, on the actual generation. They produce a 0-100 score; below the threshold the system retries (with a stricter prompt, higher temperature, or different model). The judge is itself a smaller LLM call with a structured rubric — for a coaching app, the rubric covers groundedness, vocabulary fidelity, format compliance, and tone.
The judge is not a replacement for evals. Evals catch regressions across releases; judges catch bad outputs in flight. Without evals you ship regressions; without judges you ship the bad output to a user who notices.
Layer 3: retries and time budgets
Retries handle transient errors — rate limits, 5xx, timeouts. Two tiers: internal exponential backoff (1s, 2s, 4s) for known-recoverable; queue-level redelivery (SQS visibility / Pub/Sub ack-deadline) for worker crashes.
Time budgets bound the entire request. A 60-second cap on a coaching-lesson generation means the judge gets the time it needs even after two retries. Budget-aware retry logic checks "do we have time for another attempt?" before each tier.
Layer 4: circuit breakers
When upstream model availability degrades — 3 consecutive timeouts in 30 seconds, or judge fail-rate above 5% in 5 minutes — the circuit breaker opens. New requests get a fast-fail response or queue for later. Retrying through a degraded upstream amplifies the problem; the circuit breaker is the system saying "this is not transient, stop trying."
Layer 5: observability
Per-request: request_id, model, prompt version, judge score, retry count, latency p50/p95, cost. Per-tenant: daily cost, error rate, generation volume. Per-prompt-version: judge score distribution over time (drift detection), cost per request (efficiency tracking). Logs go to a structured store; dashboards expose the per-tenant cuts.
Layer 6: cost guardrails
- Per-tenant daily cost budget with alarm at 80% and hard cap at 100%
- Per-request token cap (input + output) — most generations should fit; outliers are bugs
- Model selection by task complexity — judge uses GPT-4o-mini, generator uses GPT-4o, summarizer uses Haiku
- Cache hit ratio for prompt prefixes — measure and optimize
Smallest viable stack
Not every team needs every layer day one. The minimum viable stack: 50-example acceptance eval in CI, one judge call per generation, two-tier retry on transient errors, per-tenant cost alarms. Add circuit breakers when you see your first cascading failure. Add drift evals when your first prompt regression slips through. Build out as you hit the failure modes, not before.
What we measured across systems
- Judge cost overhead: 15-25% of per-request cost
- Eval gate prevented ~1-2 prompt regressions per month from shipping
- Two-tier retry recovered ~98% of transient failures without user-visible impact
- Circuit breaker fire rate steady-state: <0.1% of requests; usually upstream rate-limit episodes
Share this article
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years experience
Mudassir Marwat is the Founder & CEO of Cognilium AI. He has shipped 100+ production AI systems acro...
