Why escalate temperature instead of changing the prompt?

Both work; temperature is cheaper to iterate on. A bad output at temp 0.3 often means the model is over-constrained on a marginal query. Bumping to 0.4 unlocks one more degree of freedom. If three temperature steps fail, then change the prompt.

What does the judge actually score?

A rubric specific to the task. For a coaching app: groundedness (is every claim in retrieved context?), vocabulary fidelity (only allowed terms?), format (matches expected JSON schema?), tone (matches the configured voice?). Each 0-100, weighted average is the score.

How is the 60-second budget enforced?

A wall-clock deadline at request entry. Before each retry, check elapsed time + estimated next-attempt cost; if it would exceed the deadline, fall back to the best output so far and flag for review.

What is the threshold and why 85?

85 is the threshold below which our editors started catching issues consistently. Calibrated by labeling ~500 outputs and comparing editor pass/fail to judge score. 85 was the elbow.

Doesn’t this just mask bad prompts?

It compensates for marginal prompts at runtime. The drift eval catches the regression in the next cycle so you fix the prompt rather than relying on retries forever. The retry is the safety net, not the design.

LLM-as-Judge With Temperature-Escalation Retry Inside a 6…

Structured output generation has a sharp distribution: most generations are clearly good; a long tail are clearly bad; a small middle band is marginal. The middle band is where retry helps. The question is what to retry with.

The retry loop

The pattern: generate at temperature 0.3, judge it. If the judge returns ≥85, accept. If <85 and time remains in the 60-second budget, retry at temperature 0.4. Then 0.5. Then fall back to the highest-judged attempt and flag.

Attempt 1: temp 0.3 — most likely to be on-spec, gets ~85% of accepts
Attempt 2: temp 0.4 — picks up another ~10%
Attempt 3: temp 0.5 — picks up another ~3%
Fallback: ~2%, flagged for editor review

Why temperature, not prompt

Both work. Temperature is cheaper. A failed generation at 0.3 often means the model is over-constrained on a marginal query — bumping temp gives it one more degree of freedom. Prompt changes at runtime are also possible (a "stricter" prompt variant) but they add a configuration surface; temperature is one knob.

The exception: groundedness failures. If the judge fails on groundedness, the output cited information not in the retrieved context. Higher temperature does not fix that — fix it by re-retrieving with relaxed filters and re-generating with the new context.

The 60-second wall-clock budget

Each request gets a deadline at entry. Before each retry, the system checks: elapsed time + estimated next-attempt cost. If that exceeds the deadline, fall back to the best output so far. Without the budget, the retry loop runs into latency outliers (one slow generation eats the user-visible deadline). With it, the system trades quality for predictability on the rare slow case.

What the judge is judging

A 4-axis rubric for a coaching-lesson generator:

Groundedness: every factual claim maps to a retrieved chunk. Hallucinated claims drop the score sharply.
Vocabulary fidelity: only allowed framework terms used (the runtime grounding pattern from the previous chapter).
Format compliance: matches expected JSON schema, no extra keys, all required keys present.
Tone: matches the configured voice (warm, age-appropriate, instructional).

Each axis is 0-100, weighted average is the score. Threshold of 85 calibrated against editor pass/fail labels on ~500 outputs.

Numbers from production

P50 latency: 7-12s (single attempt, judge passes first try)
P95 latency: 25-35s (two attempts + judge each)
Final pass rate after retries: 99.5%
Editor flag rate (fallback path): 0.5%
Per-generation cost: $0.08-0.15 happy path; $0.18-0.25 with two retries

LLM-as-Judge With Temperature-Escalation Retry Inside a 60-Second Budget

The retry loop

Why temperature, not prompt

The 60-second wall-clock budget

What the judge is judging

Numbers from production

Share this article

Muhammad Mudassir

Muhammad Mudassir

Frequently Asked Questions

Why escalate temperature instead of changing the prompt?

What does the judge actually score?

How is the 60-second budget enforced?

What is the threshold and why 85?

Doesn’t this just mask bad prompts?

Still have questions?

Related Articles

The Production LLMOps Stack: Evals, Judges, Retries, Circuit Breakers

Smart Category-Score Routing That Cuts LLM Cost ~75%

Bias-Detection Alerts on a 4-Agent Candidate Evaluation Pipeline

Explore More Insights