How do you decide the routing threshold?

Score the validation set with all analysts; measure the score distribution per analyst that catches real findings. The threshold is the 95th-percentile of the "no finding" distribution — above it, the chunk is worth analyst time; below it, the analyst would just confirm "no finding."

What if you mis-route and miss a finding?

Run a low-volume audit pipeline: 1% of routed-out chunks get analyzed anyway, sampled. If the audit finds missed findings consistently, the threshold is too tight — adjust.

Cheaper model for scoring vs. expensive for analysis — what is the cost ratio?

Scorer is Haiku-class ($0.25/M tokens), analyst is Sonnet/4o-class ($3/M tokens). 12 scorers × cheap model + 6 analysts × expensive ≈ 75% cost vs. 11 analysts × expensive on every chunk.

Does this work outside contract review?

Anywhere with N specialists and chunks where most chunks are not interesting to most specialists. Document review, code review, log analysis, ticket triage — the math works.

What is the minimum number of analysts that must run?

A configurable floor. We use min_scoring_agents=6 of 12 — at least half the scorers run regardless of routing decisions, so a low-confidence chunk still gets diverse coverage.

Smart Category-Score Routing That Cuts LLM Cost ~75%

A contract review pipeline that runs 11 specialist analyst agents on every chunk does ~2,400 LLM calls per 100-chunk contract. At Sonnet-class pricing that is real money. Most of those calls are confirmations of "no finding" — the chunk is not relevant to that analyst's domain. A routing layer that decides which analysts to run per chunk cuts the bill by ~75% without losing findings.

The routing layer

Two model tiers. Tier one: 12 scorers, one per legal category (compliance, IP, indemnity, termination, payment, etc.). Each scorer runs a cheap model on the chunk and emits a 0-100 score for "is this chunk relevant to my category?" Tier two: 11 analyst agents, each tied to one or more categories. The router runs only the analysts whose category scores above a threshold.

Tier 1 (scorers): Haiku-class, $0.25/M tokens, runs on every chunk
Tier 2 (analysts): Sonnet/4o-class, $3/M tokens, runs only on routed chunks
12 scorers × cheap on every chunk = small fixed cost
~3 analysts × expensive on average per chunk = order-of-magnitude reduction

Threshold calibration

Score the validation set with every analyst on every chunk. For each analyst, measure the score distribution on (a) chunks where the analyst found something and (b) chunks where it did not. The routing threshold is the 95th-percentile of distribution (b). Above that, route the chunk to the analyst — there is enough relevance signal that the analyst is worth its cost. Below, skip — the analyst would emit "no finding" 95% of the time.

The audit pipeline

Routing trades off: you accept a small false-negative rate (chunks routed away from an analyst that would have found something) in exchange for a large cost cut. You want to know if the trade is going badly. The audit pipeline samples 1% of routed-away chunks and runs the full analyst set anyway. If audit findings exceed a threshold, your routing is too aggressive — relax it.

The minimum-coverage floor

Even with routing, you keep a configurable floor: at least 6 of 12 scorers run, regardless. This protects against a class of failure where the score model is itself wrong in a coordinated way (a contract uses unusual legal terminology and most scorers under-rate it). The floor ensures diversity of coverage on unusual chunks.

What we measured

Cost reduction: ~75% vs. naive fan-out (every analyst on every chunk)
False-negative rate from audit: <2% — within tolerance
Latency: comparable (router adds ~50ms; saves ~10x more by skipping analysts)
154s end-to-end on a 22-chunk sample, 116 LLM calls — vs ~470 calls without routing

Where this fails

Categories that overlap heavily — chunks that are 60-70 across many scorers — collapse to running everyone anyway. If your domain has a small number of broad categories rather than many narrow ones, the routing math is weaker. Tighter category definitions help; sometimes splitting one broad analyst into two narrower ones makes routing land cleaner.

Smart Category-Score Routing That Cuts LLM Cost ~75%

The routing layer

Threshold calibration

The audit pipeline

The minimum-coverage floor

What we measured

Where this fails

Share this article

Muhammad Mudassir

Muhammad Mudassir

Frequently Asked Questions

How do you decide the routing threshold?

What if you mis-route and miss a finding?

Cheaper model for scoring vs. expensive for analysis — what is the cost ratio?

Does this work outside contract review?

What is the minimum number of analysts that must run?

Still have questions?

Related Articles

The Production LLMOps Stack: Evals, Judges, Retries, Circuit Breakers

LLM-as-Judge With Temperature-Escalation Retry Inside a 60-Second Budget

Bias-Detection Alerts on a 4-Agent Candidate Evaluation Pipeline

Explore More Insights