TL;DR
Sub-1.5s voice AI on Twilio + ElevenLabs + Whisper for enterprise. Latency budget, compliance design, and the architecture that survives real call volume.
A voice AI conversation is a real-time system. The user does not see a spinner. They feel the latency directly — and the moment it crosses about 1.5 seconds, they decide they are talking to a bad robot and disengage. The hard part of building production voice AI is not the demo. It is keeping the latency, compliance, and cost numbers all green at the same time.
This is the architecture we have shipped multiple times in 2025-2026 and the decisions that determined whether it worked. No vendor names that don't deserve to be there; no client names. Just the engineering.
The latency budget — the only number that matters end-to-end
Target: 1.4-1.5s p95 from end-of-user-speech to first audible audio of the response. Below this, the conversation feels natural. Above 1.8s, retention drops sharply. Below 800ms is achievable with full streaming pipelines but costs roughly 2x — most use cases do not need it.
Stage-by-stage breakdown
Decomposing 1400ms p95 across the pipeline gives you the per-stage levers. The shape we have shipped:
- Twilio call leg → media stream into our SBC: ~200ms p95.
- Streaming STT (Whisper or AWS Transcribe streaming): ~350ms from audio frame to final transcript.
- LLM first-token latency: ~400ms with a regional Bedrock or Azure deployment, longer cross-region.
- TTS first audible chunk (ElevenLabs Streaming or AWS Polly): ~250ms.
- Network return path through the SBC back to Twilio: ~200ms.
The trick is that these are p95 numbers and they don't add naively. The chained p95 is closer to p99 of any single stage. We monitor each stage with its own SLO and alert when any single stage drifts more than 20% above target.
Where to buy time, and where to spend it
Most teams over-invest in STT optimization. A 50ms improvement in STT rarely matters; a 200ms improvement in TTS first-chunk almost always does, because the user perceives TTS latency as silence. The order of optimization, in priority:
- TTS streaming with first-chunk under 250ms. Use a streaming-capable provider; do not generate the full audio before playing.
- LLM streaming with first-token under 500ms. Use the smallest model that meets quality, not the biggest.
- STT in streaming mode with partial transcripts. Frame the partials into the LLM context as they arrive.
- Region-locality. STT, LLM, TTS in the same AWS region. Cross-region adds 80-150ms per hop.
Latency comes from the pipeline architecture, not from any single component. Streaming-end-to-end is not optional for production voice AI.
Compliance — the work that isn't in the demo
For regulated industries (healthcare, finance, legal) the compliance design is at least 30% of the build. The vendor selection is the easy part: Twilio, AWS Transcribe, AWS Bedrock, and ElevenLabs Enterprise all sign BAAs and SOC2 reports. The actual work is:
- Consent capture at the start of the call, recorded and timestamped to a separate audit log.
- Encrypted recording storage with key rotation and a documented retention policy (typically 90 days for the audio, indefinite for the transcript).
- PHI redaction in transcripts before they are persisted or sent to the LLM. We use AWS Comprehend Medical for healthcare and a custom NER for finance.
- Caller-identity verification before any account-bound action. The voice itself is not enough; pair it with an OTP or knowledge-based check.
- A documented incident-response runbook for voice-clone attacks. They are real and they target IVR systems.
None of this is glamorous. All of it is what stands between a working demo and a system that survives an enterprise audit.
The economics — what an hour of voice AI actually costs
A representative cost breakdown for a 1-minute production conversation in 2026, mid-range stack:
- TTS (natural voice, streaming): ~$0.07.
- LLM inference (mid-tier model, ~600 tokens out): ~$0.03.
- STT (streaming Whisper-class): ~$0.005.
- Twilio carrier: ~$0.012 (US domestic).
- Other (CloudWatch, observability, SBC compute): ~$0.005.
- Total: ~$0.12 per minute of conversation.
TTS dominates because natural voice quality is non-negotiable for enterprise; the gap between a great voice and an acceptable voice is the gap between users staying on the call and hanging up. The LLM is the second cost driver and the easiest to optimize — most production voice AI deployments use a smaller model than people expect.
What we got wrong twice
In our first deployment we batched TTS — generated the full response before playing it. Latency target missed by 1.5s. The fix was streaming TTS, but it took us two weeks to get the buffering right; play the first chunk too early and you cut off the LLM's self-correction, too late and you have given back the latency win.
In our second deployment we ran the LLM in a different region from the rest of the stack to reuse an existing model deployment. We paid 110ms p95 for that decision. We moved the LLM into the same region as STT and TTS the next sprint and the curve flattened.
When voice AI is the wrong answer
Voice AI is the right answer when latency, customization, or compliance dominate the requirements. It is the wrong answer when call volume is below ~10K minutes/month (build is more expensive than buying a vendor agent platform), when the use case is fully scripted (a no-code IVR builder will ship faster), or when the call complexity is low enough that touch-tone IVR plus a transcript handoff to a human agent works.
For teams who have decided voice AI is the right tool, the next two pieces of the puzzle are voice AI compliance design and the supervisor-pattern multi-agent architecture that often sits behind the LLM step. Both are linked from this article — the architecture as a whole is the sum of all three.
Share this article
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years experience
Mudassir Marwat is the Founder & CEO of Cognilium AI. He has shipped 100+ production AI systems acro...
