Back to Blog
Published:
Last Updated:
Fresh Content

The Production LLMOps Stack: Evals, Judges, Retries, Circuit Breakers

11 min read
2,200 words
high priority
Muhammad Mudassir

Muhammad Mudassir

Founder & CEO, Cognilium AI

The Production LLMOps Stack: Evals, Judges, Retries, Circuit Breakers — Cognilium AI

TL;DR

The day-2 ops layer of an LLM product — what to evaluate, what to judge in real time, what to retry, and when to fail closed. The components that turn a proto

The day-2 ops layer of an LLM product — what to evaluate, what to judge in real time, what to retry, and when to fail closed. The components that turn a prototype into something operable.
LLMOpsLLM evalsLLM-as-judgeretry strategiescircuit breakersobservabilityproduction AIhallucination detection

A working prototype of an LLM product proves the model can do the task. A production deployment proves the system can survive the model — its failures, slowdowns, drift, cost spikes, and the gap between a passing test set and real user inputs. The stack that handles this is roughly the same across products and worth describing as a unit.

Layer 1: offline evals

Evals are fixed test sets that exercise known good and bad behaviors. They run in CI, gate releases, and answer one question: did this prompt change make the system better, worse, or unchanged? Three eval types cover most products.

  • Acceptance evals — N hand-crafted queries with N expected behaviors. Pass/fail. Runs in <60s. ~50 examples is usually enough to catch regressions.
  • Adversarial evals — queries designed to break the system. Prompt injections, edge cases, ambiguous inputs. Pass means "the system handled it gracefully," not "the system got it right."
  • Drift evals — a sample of real user queries from the previous week, replayed against the new prompt. Looks for behavior changes you did not intend.

Layer 2: online judges

Judges run per request, in real time, on the actual generation. They produce a 0-100 score; below the threshold the system retries (with a stricter prompt, higher temperature, or different model). The judge is itself a smaller LLM call with a structured rubric — for a coaching app, the rubric covers groundedness, vocabulary fidelity, format compliance, and tone.

The judge is not a replacement for evals. Evals catch regressions across releases; judges catch bad outputs in flight. Without evals you ship regressions; without judges you ship the bad output to a user who notices.

Layer 3: retries and time budgets

Retries handle transient errors — rate limits, 5xx, timeouts. Two tiers: internal exponential backoff (1s, 2s, 4s) for known-recoverable; queue-level redelivery (SQS visibility / Pub/Sub ack-deadline) for worker crashes.

Time budgets bound the entire request. A 60-second cap on a coaching-lesson generation means the judge gets the time it needs even after two retries. Budget-aware retry logic checks "do we have time for another attempt?" before each tier.

Layer 4: circuit breakers

When upstream model availability degrades — 3 consecutive timeouts in 30 seconds, or judge fail-rate above 5% in 5 minutes — the circuit breaker opens. New requests get a fast-fail response or queue for later. Retrying through a degraded upstream amplifies the problem; the circuit breaker is the system saying "this is not transient, stop trying."

Layer 5: observability

Per-request: request_id, model, prompt version, judge score, retry count, latency p50/p95, cost. Per-tenant: daily cost, error rate, generation volume. Per-prompt-version: judge score distribution over time (drift detection), cost per request (efficiency tracking). Logs go to a structured store; dashboards expose the per-tenant cuts.

Layer 6: cost guardrails

  • Per-tenant daily cost budget with alarm at 80% and hard cap at 100%
  • Per-request token cap (input + output) — most generations should fit; outliers are bugs
  • Model selection by task complexity — judge uses GPT-4o-mini, generator uses GPT-4o, summarizer uses Haiku
  • Cache hit ratio for prompt prefixes — measure and optimize

Smallest viable stack

Not every team needs every layer day one. The minimum viable stack: 50-example acceptance eval in CI, one judge call per generation, two-tier retry on transient errors, per-tenant cost alarms. Add circuit breakers when you see your first cascading failure. Add drift evals when your first prompt regression slips through. Build out as you hit the failure modes, not before.

What we measured across systems

  • Judge cost overhead: 15-25% of per-request cost
  • Eval gate prevented ~1-2 prompt regressions per month from shipping
  • Two-tier retry recovered ~98% of transient failures without user-visible impact
  • Circuit breaker fire rate steady-state: <0.1% of requests; usually upstream rate-limit episodes

Share this article

Muhammad Mudassir

Muhammad Mudassir

Founder & CEO, Cognilium AI | 10+ years

Mudassir Marwat is the Founder & CEO of Cognilium AI. He has shipped 100+ production AI systems acro...

Founder & CEO of Cognilium AI; 100+ production AI systems shipped; multi-cloud AI architecture (AWSGCPAzure); built and operated 4 production AI products
Agentic AIRAG → GraphRAG retrievalVoice AIMulti-Agent Orchestration

Frequently Asked Questions

Find answers to common questions about the topics covered in this article.

Still have questions?

Get in touch with our team for personalized assistance.

Contact Us

Related Articles

Continue exploring related topics and insights from our content library.

LLM-as-Judge With Temperature-Escalation Retry Inside a 60-Second Budget
7 min
1
Muhammad Mudassir
May 5, 2026

LLM-as-Judge With Temperature-Escalation Retry Inside a 60-Second Budget

Judge scores below 85? Retry with temperature 0.3, 0.4, 0.5 — three attempts inside a 60-second wall-clock budget. The simple loop that hits 99.5% on-spec output without crossing the latency ceiling.

words
Read Article
Smart Category-Score Routing That Cuts LLM Cost ~75%
7 min
2
Muhammad Mudassir
May 5, 2026

Smart Category-Score Routing That Cuts LLM Cost ~75%

A pipeline of 12 scorers + 11 analysts does not need to fan out everywhere. A score-driven routing layer sends each chunk only to the analysts that match its category — and saves three quarters of the LLM bill.

words
Read Article
Bias-Detection Alerts on a 4-Agent Candidate Evaluation Pipeline
7 min
3
Muhammad Mudassir
May 5, 2026

Bias-Detection Alerts on a 4-Agent Candidate Evaluation Pipeline

A hiring evaluation pipeline runs four specialists in parallel — resume, profile, GitHub, voice. Bias drift in any one of them is a legal exposure. Continuous monitoring with alerts at the disparity-impact threshold.

words
Read Article

Explore More Insights

Discover more expert articles on AI, engineering, and technology trends.