TL;DR
Judge scores below 85? Retry with temperature 0.3, 0.4, 0.5 — three attempts inside a 60-second wall-clock budget. The simple loop that hits 99.5% on-spec out
Structured output generation has a sharp distribution: most generations are clearly good; a long tail are clearly bad; a small middle band is marginal. The middle band is where retry helps. The question is what to retry with.
The retry loop
The pattern: generate at temperature 0.3, judge it. If the judge returns ≥85, accept. If <85 and time remains in the 60-second budget, retry at temperature 0.4. Then 0.5. Then fall back to the highest-judged attempt and flag.
- Attempt 1: temp 0.3 — most likely to be on-spec, gets ~85% of accepts
- Attempt 2: temp 0.4 — picks up another ~10%
- Attempt 3: temp 0.5 — picks up another ~3%
- Fallback: ~2%, flagged for editor review
Why temperature, not prompt
Both work. Temperature is cheaper. A failed generation at 0.3 often means the model is over-constrained on a marginal query — bumping temp gives it one more degree of freedom. Prompt changes at runtime are also possible (a "stricter" prompt variant) but they add a configuration surface; temperature is one knob.
The exception: groundedness failures. If the judge fails on groundedness, the output cited information not in the retrieved context. Higher temperature does not fix that — fix it by re-retrieving with relaxed filters and re-generating with the new context.
The 60-second wall-clock budget
Each request gets a deadline at entry. Before each retry, the system checks: elapsed time + estimated next-attempt cost. If that exceeds the deadline, fall back to the best output so far. Without the budget, the retry loop runs into latency outliers (one slow generation eats the user-visible deadline). With it, the system trades quality for predictability on the rare slow case.
What the judge is judging
A 4-axis rubric for a coaching-lesson generator:
- Groundedness: every factual claim maps to a retrieved chunk. Hallucinated claims drop the score sharply.
- Vocabulary fidelity: only allowed framework terms used (the runtime grounding pattern from the previous chapter).
- Format compliance: matches expected JSON schema, no extra keys, all required keys present.
- Tone: matches the configured voice (warm, age-appropriate, instructional).
Each axis is 0-100, weighted average is the score. Threshold of 85 calibrated against editor pass/fail labels on ~500 outputs.
Numbers from production
- P50 latency: 7-12s (single attempt, judge passes first try)
- P95 latency: 25-35s (two attempts + judge each)
- Final pass rate after retries: 99.5%
- Editor flag rate (fallback path): 0.5%
- Per-generation cost: $0.08-0.15 happy path; $0.18-0.25 with two retries
Share this article
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years experience
Mudassir Marwat is the Founder & CEO of Cognilium AI. He has shipped 100+ production AI systems acro...
