TL;DR
A line-by-line breakdown of the sub-1.5-second p95 latency budget — VAD, streaming STT, first-token LLM, streaming TTS, network — and the optimizations that b
A voice agent has a hard latency ceiling: roughly 1.5 seconds of perceived gap before the user thinks the line dropped. Below that, the conversation feels natural; above it, the user repeats themselves or hangs up. A production budget that hits this is roughly: VAD 150ms, STT 350ms, LLM 500ms, TTS 250ms, network 200ms, total ~1.4s. Each segment has its own optimization frontier.
VAD: the cheapest 150ms
Voice Activity Detection decides when the user has stopped speaking. Naive VADs wait for silence (typically 200-500ms). Prosodic VADs use falling intonation and syntactic completeness to detect end-of-utterance ~150ms earlier. The trade is occasional false positives (cut off the user mid-sentence). For a customer support agent that pauses politely on uncertainty, the trade is worth it.
STT: streaming buys ~300ms of overlap
Streaming STT emits partial transcripts as the user speaks. The LLM does not have to wait for the final transcript — it starts processing the partial. This buys two things: the LLM's KV cache is pre-warmed, so final-transcript inference is faster; and the response generation can start before the user finishes if confidence is high.
Cost: streaming STT is ~3x more expensive per minute than batched. But the latency math works only with streaming.
LLM: first-token is the only metric that matters
"First-token latency" is the time from request to first token returned. For a voice agent, this is the latency you pay — the user hears the first phoneme as soon as TTS sees the first token. Total generation latency is irrelevant to the conversation feel; first-token is everything.
Optimizations: smaller model for the first half of the response (a 7B that streams tokens fast), then hand off to the larger model for sustained generation; prefill the KV cache on the partial transcript; use a model with a 64K-token context cache so common system-prompt tokens are not recomputed.
TTS: 250ms first-chunk is the floor
Modern neural TTS emits the first audio chunk ~250ms after the first text token arrives. Cutting below that requires either a smaller TTS model (lower quality) or a co-located GPU (lower flexibility). 250ms is a reasonable floor for production.
Network: 200ms hides everywhere
- WebRTC connection establishment: 50ms if pre-warmed, 200ms cold
- Per-hop network latency: 30-80ms depending on geography
- TLS handshake: amortized to 0 on persistent connections
- Audio buffering in the player: 50-100ms inherent
What we measured
- P50 perceived latency: 1.1s
- P95: 1.4s
- P99: 2.1s — usually a model-side rate limit or LLM cold start
- Cost per minute of agent time: ~$0.18 at peak (Ultravox + LLM + TTS + VAD)
Share this article
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years experience
Mudassir Marwat is the Founder & CEO of Cognilium AI. He has shipped 100+ production AI systems acro...
