Back to Blog
Published:
Last Updated:
Fresh Content

Voice AI Latency Budget Deep Dive: Where the 1.5 Seconds Goes

7 min read
1,500 words
medium priority
Muhammad Mudassir

Muhammad Mudassir

Founder & CEO, Cognilium AI

Voice AI Latency Budget Deep Dive: Where the 1.5 Seconds Goes — Cognilium AI

TL;DR

A line-by-line breakdown of the sub-1.5-second p95 latency budget — VAD, streaming STT, first-token LLM, streaming TTS, network — and the optimizations that b

A line-by-line breakdown of the sub-1.5-second p95 latency budget — VAD, streaming STT, first-token LLM, streaming TTS, network — and the optimizations that buy each milestone.
voice latencystreaming STTstreaming TTSfirst-token latencyVADend-of-speech detectionperceived latency

A voice agent has a hard latency ceiling: roughly 1.5 seconds of perceived gap before the user thinks the line dropped. Below that, the conversation feels natural; above it, the user repeats themselves or hangs up. A production budget that hits this is roughly: VAD 150ms, STT 350ms, LLM 500ms, TTS 250ms, network 200ms, total ~1.4s. Each segment has its own optimization frontier.

VAD: the cheapest 150ms

Voice Activity Detection decides when the user has stopped speaking. Naive VADs wait for silence (typically 200-500ms). Prosodic VADs use falling intonation and syntactic completeness to detect end-of-utterance ~150ms earlier. The trade is occasional false positives (cut off the user mid-sentence). For a customer support agent that pauses politely on uncertainty, the trade is worth it.

STT: streaming buys ~300ms of overlap

Streaming STT emits partial transcripts as the user speaks. The LLM does not have to wait for the final transcript — it starts processing the partial. This buys two things: the LLM's KV cache is pre-warmed, so final-transcript inference is faster; and the response generation can start before the user finishes if confidence is high.

Cost: streaming STT is ~3x more expensive per minute than batched. But the latency math works only with streaming.

LLM: first-token is the only metric that matters

"First-token latency" is the time from request to first token returned. For a voice agent, this is the latency you pay — the user hears the first phoneme as soon as TTS sees the first token. Total generation latency is irrelevant to the conversation feel; first-token is everything.

Optimizations: smaller model for the first half of the response (a 7B that streams tokens fast), then hand off to the larger model for sustained generation; prefill the KV cache on the partial transcript; use a model with a 64K-token context cache so common system-prompt tokens are not recomputed.

TTS: 250ms first-chunk is the floor

Modern neural TTS emits the first audio chunk ~250ms after the first text token arrives. Cutting below that requires either a smaller TTS model (lower quality) or a co-located GPU (lower flexibility). 250ms is a reasonable floor for production.

Network: 200ms hides everywhere

  • WebRTC connection establishment: 50ms if pre-warmed, 200ms cold
  • Per-hop network latency: 30-80ms depending on geography
  • TLS handshake: amortized to 0 on persistent connections
  • Audio buffering in the player: 50-100ms inherent

What we measured

  • P50 perceived latency: 1.1s
  • P95: 1.4s
  • P99: 2.1s — usually a model-side rate limit or LLM cold start
  • Cost per minute of agent time: ~$0.18 at peak (Ultravox + LLM + TTS + VAD)

Share this article

Muhammad Mudassir

Muhammad Mudassir

Founder & CEO, Cognilium AI | 10+ years

Mudassir Marwat is the Founder & CEO of Cognilium AI. He has shipped 100+ production AI systems acro...

Founder & CEO of Cognilium AI; 100+ production AI systems shipped; multi-cloud AI architecture (AWSGCPAzure); built and operated 4 production AI products
Agentic AIRAG → GraphRAG retrievalVoice AIMulti-Agent Orchestration

Frequently Asked Questions

Find answers to common questions about the topics covered in this article.

Still have questions?

Get in touch with our team for personalized assistance.

Contact Us

Related Articles

Continue exploring related topics and insights from our content library.

Enterprise Voice AI: Real Latency, Real Compliance, Real Money
11 min
1
Muhammad Mudassir
May 4, 2026

Enterprise Voice AI: Real Latency, Real Compliance, Real Money

Sub-1.5s p95 voice AI on Twilio + ElevenLabs + Whisper, designed for HIPAA and SOC2. The decisions that mattered, and the ones we got wrong twice.

words
Read Article
Designing a Non-Scripted Voice Interview Agent on Ultravox
8 min
2
Muhammad Mudassir
May 5, 2026

Designing a Non-Scripted Voice Interview Agent on Ultravox

Voice screening that adapts to the candidate instead of reading from a list — follow-up question generation, multi-language handling, and the prompt structure that keeps the agent on-task without sounding like a robot.

words
Read Article
Sentiment-Driven Escalation in a 22-Language Voice Support Agent
7 min
3
Muhammad Mudassir
May 5, 2026

Sentiment-Driven Escalation in a 22-Language Voice Support Agent

Real-time sentiment scoring drives the handoff decision; full conversation context, transcript, and detected intent travel with it. The escalation that does not start the human conversation from scratch.

words
Read Article

Explore More Insights

Discover more expert articles on AI, engineering, and technology trends.