Is "first-token" latency the same as "perceived" latency?

No. Perceived latency is the gap between the user finishing speaking and them hearing the first phoneme of the response. First-token is just the LLM segment of that. The full perceived latency is VAD + STT-final + LLM-first-token + TTS-first-chunk + network.

Why streaming STT instead of waiting for the full transcript?

Streaming STT lets the LLM start processing partial transcripts before the user finishes. The LLM can prefill its KV cache on the partial transcript, so when the final transcript arrives the actual generation latency drops to ~200ms instead of ~600ms.

Voice Activity Detection — decides when the user has stopped speaking. Naive VADs use silence threshold (200ms of silence = end). Better ones use prosody — falling intonation, complete syntactic unit. Prosodic VADs cut ~150ms off perceived latency.

Why is TTS the lowest at 250ms?

Streaming TTS emits the first chunk as soon as the first phoneme is synthesized. That requires a model that does not need the full sentence to plan prosody — modern neural TTS does this. The 250ms is the time from "first text chunk arrives" to "first audio sample plays."

How tight can this realistically get?

Sub-1-second p95 is reachable with all optimizations applied. Sub-700ms p95 starts requiring co-located inference (model and TTS in the same data center as the user) — at that point you trade off operational complexity.

Voice AI Latency Budget Deep Dive: Where the 1.5 Seconds …

A voice agent has a hard latency ceiling: roughly 1.5 seconds of perceived gap before the user thinks the line dropped. Below that, the conversation feels natural; above it, the user repeats themselves or hangs up. A production budget that hits this is roughly: VAD 150ms, STT 350ms, LLM 500ms, TTS 250ms, network 200ms, total ~1.4s. Each segment has its own optimization frontier.

VAD: the cheapest 150ms

Voice Activity Detection decides when the user has stopped speaking. Naive VADs wait for silence (typically 200-500ms). Prosodic VADs use falling intonation and syntactic completeness to detect end-of-utterance ~150ms earlier. The trade is occasional false positives (cut off the user mid-sentence). For a customer support agent that pauses politely on uncertainty, the trade is worth it.

STT: streaming buys ~300ms of overlap

Streaming STT emits partial transcripts as the user speaks. The LLM does not have to wait for the final transcript — it starts processing the partial. This buys two things: the LLM's KV cache is pre-warmed, so final-transcript inference is faster; and the response generation can start before the user finishes if confidence is high.

Cost: streaming STT is ~3x more expensive per minute than batched. But the latency math works only with streaming.

LLM: first-token is the only metric that matters

"First-token latency" is the time from request to first token returned. For a voice agent, this is the latency you pay — the user hears the first phoneme as soon as TTS sees the first token. Total generation latency is irrelevant to the conversation feel; first-token is everything.

Optimizations: smaller model for the first half of the response (a 7B that streams tokens fast), then hand off to the larger model for sustained generation; prefill the KV cache on the partial transcript; use a model with a 64K-token context cache so common system-prompt tokens are not recomputed.

TTS: 250ms first-chunk is the floor

Modern neural TTS emits the first audio chunk ~250ms after the first text token arrives. Cutting below that requires either a smaller TTS model (lower quality) or a co-located GPU (lower flexibility). 250ms is a reasonable floor for production.

Network: 200ms hides everywhere

WebRTC connection establishment: 50ms if pre-warmed, 200ms cold
Per-hop network latency: 30-80ms depending on geography
TLS handshake: amortized to 0 on persistent connections
Audio buffering in the player: 50-100ms inherent

What we measured

P50 perceived latency: 1.1s
P95: 1.4s
P99: 2.1s — usually a model-side rate limit or LLM cold start
Cost per minute of agent time: ~$0.18 at peak (Ultravox + LLM + TTS + VAD)

Voice AI Latency Budget Deep Dive: Where the 1.5 Seconds Goes

VAD: the cheapest 150ms

STT: streaming buys ~300ms of overlap

LLM: first-token is the only metric that matters

TTS: 250ms first-chunk is the floor

Network: 200ms hides everywhere

What we measured

Share this article

Muhammad Mudassir

Muhammad Mudassir

Frequently Asked Questions

Is "first-token" latency the same as "perceived" latency?

Why streaming STT instead of waiting for the full transcript?

What does VAD do?

Why is TTS the lowest at 250ms?

How tight can this realistically get?

Still have questions?

Related Articles

Enterprise Voice AI: Real Latency, Real Compliance, Real Money

Designing a Non-Scripted Voice Interview Agent on Ultravox

Sentiment-Driven Escalation in a 22-Language Voice Support Agent

Explore More Insights