A pattern that halts requests to an LLM endpoint when error rates or latency exceed thresholds, allowing the system to fail closed rather than degrade.
Borrowed from microservices reliability. Production LLM systems set circuit breakers per provider (Anthropic, OpenAI, Bedrock) and route to fallback providers when a circuit opens.
A workflow where prompt and architecture changes are scored against a golden test set of queries before deployment.
Analogous to test-driven development for traditional software. Golden sets cover 100-300 queries spanning happy path, edge cases, and adversarial inputs. Scoring uses LLM-as-judge or human review.
A curated collection of input-output pairs used as the regression suite for LLM applications.
Maintained continuously: new failure modes from production logs get added; stale examples get pruned. Without a golden set, "is this prompt change better?" is unanswerable.
Using an LLM to score the output of another LLM against a rubric, replacing expensive human evaluation for scalable quality measurement.
Production judges include temperature-escalation retry patterns (retry at higher temperature if judge score is low) and ensemble judging (3 judges vote). Judges have known failure modes — verbosity bias, position bias.
The operational discipline of running LLM-powered applications in production — evaluation, observability, retries, cost engineering, prompt versioning.
LLMOps is to LLM applications what MLOps is to traditional ML. Stack components: golden-set evaluation, LLM-as-judge, LangSmith/Langfuse for traces, Helicone for token observability, circuit breakers, prompt versioning.
The ability to inspect and debug LLM application behavior in production through traces, logs, metrics, and cost data.
Standard stack in 2026: LangSmith or Langfuse for traces, Helicone for token-spend metering, Datadog for infrastructure metrics, OpenTelemetry GenAI conventions as the open standard.
The practice of designing the instructions, structure, and examples given to an LLM to elicit a desired behavior.
In 2026, prompt engineering is less brittle than in 2023 because models are more aligned, but still consequential. Production prompts are versioned, tested against golden sets, and A/B tested.
The discipline of managing per-request and aggregate cost of LLM applications through routing, caching, batching, and model selection.
Levers: route trivial queries to cheaper models, cache repeated queries with semantic-similar lookup, batch where latency permits, prefer prompt caching when supported. A 70% cost reduction is typical without quality loss.