Back to Blog
Published:
Last Updated:
Fresh Content

Smart Category-Score Routing That Cuts LLM Cost ~75%

7 min read
1,400 words
medium priority
Muhammad Mudassir

Muhammad Mudassir

Founder & CEO, Cognilium AI

Smart Category-Score Routing That Cuts LLM Cost ~75% — Cognilium AI

TL;DR

A pipeline of 12 scorers + 11 analysts does not need to fan out everywhere. A score-driven routing layer sends each chunk only to the analysts that match its

A pipeline of 12 scorers + 11 analysts does not need to fan out everywhere. A score-driven routing layer sends each chunk only to the analysts that match its category — and saves three quarters of the LLM bill.
LLM cost optimizationsmart routingmulti-agent fan-outper-chunk scoringdynamic agent selectionLLM ops

A contract review pipeline that runs 11 specialist analyst agents on every chunk does ~2,400 LLM calls per 100-chunk contract. At Sonnet-class pricing that is real money. Most of those calls are confirmations of "no finding" — the chunk is not relevant to that analyst's domain. A routing layer that decides which analysts to run per chunk cuts the bill by ~75% without losing findings.

The routing layer

Two model tiers. Tier one: 12 scorers, one per legal category (compliance, IP, indemnity, termination, payment, etc.). Each scorer runs a cheap model on the chunk and emits a 0-100 score for "is this chunk relevant to my category?" Tier two: 11 analyst agents, each tied to one or more categories. The router runs only the analysts whose category scores above a threshold.

  • Tier 1 (scorers): Haiku-class, $0.25/M tokens, runs on every chunk
  • Tier 2 (analysts): Sonnet/4o-class, $3/M tokens, runs only on routed chunks
  • 12 scorers × cheap on every chunk = small fixed cost
  • ~3 analysts × expensive on average per chunk = order-of-magnitude reduction

Threshold calibration

Score the validation set with every analyst on every chunk. For each analyst, measure the score distribution on (a) chunks where the analyst found something and (b) chunks where it did not. The routing threshold is the 95th-percentile of distribution (b). Above that, route the chunk to the analyst — there is enough relevance signal that the analyst is worth its cost. Below, skip — the analyst would emit "no finding" 95% of the time.

The audit pipeline

Routing trades off: you accept a small false-negative rate (chunks routed away from an analyst that would have found something) in exchange for a large cost cut. You want to know if the trade is going badly. The audit pipeline samples 1% of routed-away chunks and runs the full analyst set anyway. If audit findings exceed a threshold, your routing is too aggressive — relax it.

The minimum-coverage floor

Even with routing, you keep a configurable floor: at least 6 of 12 scorers run, regardless. This protects against a class of failure where the score model is itself wrong in a coordinated way (a contract uses unusual legal terminology and most scorers under-rate it). The floor ensures diversity of coverage on unusual chunks.

What we measured

  • Cost reduction: ~75% vs. naive fan-out (every analyst on every chunk)
  • False-negative rate from audit: <2% — within tolerance
  • Latency: comparable (router adds ~50ms; saves ~10x more by skipping analysts)
  • 154s end-to-end on a 22-chunk sample, 116 LLM calls — vs ~470 calls without routing

Where this fails

Categories that overlap heavily — chunks that are 60-70 across many scorers — collapse to running everyone anyway. If your domain has a small number of broad categories rather than many narrow ones, the routing math is weaker. Tighter category definitions help; sometimes splitting one broad analyst into two narrower ones makes routing land cleaner.

Share this article

Muhammad Mudassir

Muhammad Mudassir

Founder & CEO, Cognilium AI | 10+ years

Mudassir Marwat is the Founder & CEO of Cognilium AI. He has shipped 100+ production AI systems acro...

Founder & CEO of Cognilium AI; 100+ production AI systems shipped; multi-cloud AI architecture (AWSGCPAzure); built and operated 4 production AI products
Agentic AIRAG → GraphRAG retrievalVoice AIMulti-Agent Orchestration

Frequently Asked Questions

Find answers to common questions about the topics covered in this article.

Still have questions?

Get in touch with our team for personalized assistance.

Contact Us

Related Articles

Continue exploring related topics and insights from our content library.

The Production LLMOps Stack: Evals, Judges, Retries, Circuit Breakers
11 min
1
Muhammad Mudassir
May 5, 2026

The Production LLMOps Stack: Evals, Judges, Retries, Circuit Breakers

The day-2 ops layer of an LLM product — what to evaluate, what to judge in real time, what to retry, and when to fail closed. The components that turn a prototype into something operable.

words
Read Article
LLM-as-Judge With Temperature-Escalation Retry Inside a 60-Second Budget
7 min
2
Muhammad Mudassir
May 5, 2026

LLM-as-Judge With Temperature-Escalation Retry Inside a 60-Second Budget

Judge scores below 85? Retry with temperature 0.3, 0.4, 0.5 — three attempts inside a 60-second wall-clock budget. The simple loop that hits 99.5% on-spec output without crossing the latency ceiling.

words
Read Article
Bias-Detection Alerts on a 4-Agent Candidate Evaluation Pipeline
7 min
3
Muhammad Mudassir
May 5, 2026

Bias-Detection Alerts on a 4-Agent Candidate Evaluation Pipeline

A hiring evaluation pipeline runs four specialists in parallel — resume, profile, GitHub, voice. Bias drift in any one of them is a legal exposure. Continuous monitoring with alerts at the disparity-impact threshold.

words
Read Article

Explore More Insights

Discover more expert articles on AI, engineering, and technology trends.