TL;DR
A pipeline of 12 scorers + 11 analysts does not need to fan out everywhere. A score-driven routing layer sends each chunk only to the analysts that match its
A contract review pipeline that runs 11 specialist analyst agents on every chunk does ~2,400 LLM calls per 100-chunk contract. At Sonnet-class pricing that is real money. Most of those calls are confirmations of "no finding" — the chunk is not relevant to that analyst's domain. A routing layer that decides which analysts to run per chunk cuts the bill by ~75% without losing findings.
The routing layer
Two model tiers. Tier one: 12 scorers, one per legal category (compliance, IP, indemnity, termination, payment, etc.). Each scorer runs a cheap model on the chunk and emits a 0-100 score for "is this chunk relevant to my category?" Tier two: 11 analyst agents, each tied to one or more categories. The router runs only the analysts whose category scores above a threshold.
- Tier 1 (scorers): Haiku-class, $0.25/M tokens, runs on every chunk
- Tier 2 (analysts): Sonnet/4o-class, $3/M tokens, runs only on routed chunks
- 12 scorers × cheap on every chunk = small fixed cost
- ~3 analysts × expensive on average per chunk = order-of-magnitude reduction
Threshold calibration
Score the validation set with every analyst on every chunk. For each analyst, measure the score distribution on (a) chunks where the analyst found something and (b) chunks where it did not. The routing threshold is the 95th-percentile of distribution (b). Above that, route the chunk to the analyst — there is enough relevance signal that the analyst is worth its cost. Below, skip — the analyst would emit "no finding" 95% of the time.
The audit pipeline
Routing trades off: you accept a small false-negative rate (chunks routed away from an analyst that would have found something) in exchange for a large cost cut. You want to know if the trade is going badly. The audit pipeline samples 1% of routed-away chunks and runs the full analyst set anyway. If audit findings exceed a threshold, your routing is too aggressive — relax it.
The minimum-coverage floor
Even with routing, you keep a configurable floor: at least 6 of 12 scorers run, regardless. This protects against a class of failure where the score model is itself wrong in a coordinated way (a contract uses unusual legal terminology and most scorers under-rate it). The floor ensures diversity of coverage on unusual chunks.
What we measured
- Cost reduction: ~75% vs. naive fan-out (every analyst on every chunk)
- False-negative rate from audit: <2% — within tolerance
- Latency: comparable (router adds ~50ms; saves ~10x more by skipping analysts)
- 154s end-to-end on a 22-chunk sample, 116 LLM calls — vs ~470 calls without routing
Where this fails
Categories that overlap heavily — chunks that are 60-70 across many scorers — collapse to running everyone anyway. If your domain has a small number of broad categories rather than many narrow ones, the routing math is weaker. Tighter category definitions help; sometimes splitting one broad analyst into two narrower ones makes routing land cleaner.
Share this article
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years experience
Mudassir Marwat is the Founder & CEO of Cognilium AI. He has shipped 100+ production AI systems acro...
