Where does eval fit vs. judge?

Eval is offline measurement on a fixed test set — runs in CI, gates releases. Judge is online validation on real outputs — runs per request, gates delivery. Both are necessary; neither replaces the other.

How much does an LLM-as-judge add to per-request cost?

Typically 15-25% — the judge call is a smaller model on a structured task. For a $0.10 main generation, judge adds ~$0.02. Cheaper than fixing a broken output reactively.

When should circuit breakers fire?

When upstream model availability degrades (>3 consecutive timeouts in 30s) or when judge fail-rate spikes (>5% in 5 minutes). Either signals a bad state where retries amplify the problem instead of recovering.

Do you need this if you are using Claude/GPT-4 directly?

Less of it. The hosted API absorbs some retry, rate-limit, and stability concerns. You still need eval, judge, and observability — the model behaves differently across versions and the hosted API does not catch business-logic failures.

What is the smallest viable production stack?

Eval on a 50-example test set in CI. One judge call per generation. Two-tier retry on transient errors. Per-tenant cost budget alarms. Everything else is optional until your usage demands it.

The Production LLMOps Stack: Evals, Judges, Retries, Circ…

A working prototype of an LLM product proves the model can do the task. A production deployment proves the system can survive the model — its failures, slowdowns, drift, cost spikes, and the gap between a passing test set and real user inputs. The stack that handles this is roughly the same across products and worth describing as a unit.

Layer 1: offline evals

Evals are fixed test sets that exercise known good and bad behaviors. They run in CI, gate releases, and answer one question: did this prompt change make the system better, worse, or unchanged? Three eval types cover most products.

Acceptance evals — N hand-crafted queries with N expected behaviors. Pass/fail. Runs in <60s. ~50 examples is usually enough to catch regressions.
Adversarial evals — queries designed to break the system. Prompt injections, edge cases, ambiguous inputs. Pass means "the system handled it gracefully," not "the system got it right."
Drift evals — a sample of real user queries from the previous week, replayed against the new prompt. Looks for behavior changes you did not intend.

Layer 2: online judges

Judges run per request, in real time, on the actual generation. They produce a 0-100 score; below the threshold the system retries (with a stricter prompt, higher temperature, or different model). The judge is itself a smaller LLM call with a structured rubric — for a coaching app, the rubric covers groundedness, vocabulary fidelity, format compliance, and tone.

The judge is not a replacement for evals. Evals catch regressions across releases; judges catch bad outputs in flight. Without evals you ship regressions; without judges you ship the bad output to a user who notices.

Layer 3: retries and time budgets

Retries handle transient errors — rate limits, 5xx, timeouts. Two tiers: internal exponential backoff (1s, 2s, 4s) for known-recoverable; queue-level redelivery (SQS visibility / Pub/Sub ack-deadline) for worker crashes.

Time budgets bound the entire request. A 60-second cap on a coaching-lesson generation means the judge gets the time it needs even after two retries. Budget-aware retry logic checks "do we have time for another attempt?" before each tier.

Layer 4: circuit breakers

When upstream model availability degrades — 3 consecutive timeouts in 30 seconds, or judge fail-rate above 5% in 5 minutes — the circuit breaker opens. New requests get a fast-fail response or queue for later. Retrying through a degraded upstream amplifies the problem; the circuit breaker is the system saying "this is not transient, stop trying."

Layer 5: observability

Per-request: request_id, model, prompt version, judge score, retry count, latency p50/p95, cost. Per-tenant: daily cost, error rate, generation volume. Per-prompt-version: judge score distribution over time (drift detection), cost per request (efficiency tracking). Logs go to a structured store; dashboards expose the per-tenant cuts.

Layer 6: cost guardrails

Per-tenant daily cost budget with alarm at 80% and hard cap at 100%
Per-request token cap (input + output) — most generations should fit; outliers are bugs
Model selection by task complexity — judge uses GPT-4o-mini, generator uses GPT-4o, summarizer uses Haiku
Cache hit ratio for prompt prefixes — measure and optimize

Smallest viable stack

Not every team needs every layer day one. The minimum viable stack: 50-example acceptance eval in CI, one judge call per generation, two-tier retry on transient errors, per-tenant cost alarms. Add circuit breakers when you see your first cascading failure. Add drift evals when your first prompt regression slips through. Build out as you hit the failure modes, not before.

What we measured across systems

Judge cost overhead: 15-25% of per-request cost
Eval gate prevented ~1-2 prompt regressions per month from shipping
Two-tier retry recovered ~98% of transient failures without user-visible impact
Circuit breaker fire rate steady-state: <0.1% of requests; usually upstream rate-limit episodes

The Production LLMOps Stack: Evals, Judges, Retries, Circuit Breakers

Layer 1: offline evals

Layer 2: online judges

Layer 3: retries and time budgets

Layer 4: circuit breakers

Layer 5: observability

Layer 6: cost guardrails

Smallest viable stack

What we measured across systems

Share this article

Muhammad Mudassir

Muhammad Mudassir

Frequently Asked Questions

Where does eval fit vs. judge?

How much does an LLM-as-judge add to per-request cost?

When should circuit breakers fire?

Do you need this if you are using Claude/GPT-4 directly?

What is the smallest viable production stack?

Still have questions?

Related Articles

LLM-as-Judge With Temperature-Escalation Retry Inside a 60-Second Budget

Smart Category-Score Routing That Cuts LLM Cost ~75%

Bias-Detection Alerts on a 4-Agent Candidate Evaluation Pipeline

Explore More Insights