Twilio handles the phone call. ElevenLabs makes it sound human. Together, they're the foundation of modern voice AI. But connecting them isn't plug-and-play—you need streaming audio, proper encoding, and latency optimization. This guide covers the complete integration with production-ready code.
Why Twilio + ElevenLabs?
Twilio provides programmable telephony—the ability to make and receive phone calls via API. ElevenLabs provides the most natural-sounding text-to-speech available. Combined with an LLM for conversation and STT for speech recognition, they form the voice AI stack that powers production systems handling millions of calls.
1. Architecture Overview
2. Prerequisites
# Python 3.9+
pip install twilio fastapi uvicorn websockets httpx deepgram-sdk anthropic
# Accounts needed:
# - Twilio: twilio.com (phone number + Media Streams)
# - ElevenLabs: elevenlabs.io (API key + voice ID)
# - Deepgram: deepgram.com (API key)
# - Anthropic: anthropic.com (API key)
Environment Variables
export TWILIO_ACCOUNT_SID="your_account_sid"
export TWILIO_AUTH_TOKEN="your_auth_token"
export TWILIO_PHONE_NUMBER="+1234567890"
export DEEPGRAM_API_KEY="your_deepgram_key"
export ELEVENLABS_API_KEY="your_elevenlabs_key"
export ELEVENLABS_VOICE_ID="your_voice_id"
export ANTHROPIC_API_KEY="your_anthropic_key"
3. Step 1: Twilio Setup
Buy a Phone Number
from twilio.rest import Client
client = Client(TWILIO_ACCOUNT_SID, TWILIO_AUTH_TOKEN)
number = client.incoming_phone_numbers.create(
phone_number="+1234567890",
voice_url="https://your-server.com/voice",
voice_method="POST"
)
print(f"Phone number: {number.phone_number}")
Configure Webhook
In Twilio Console:
- Go to Phone Numbers → Manage → Active Numbers
- Click your number
- Set Voice Configuration: Webhook →
https://your-server.com/voice
4. Step 2: ElevenLabs Configuration
Select a Voice
import httpx
async def list_voices():
async with httpx.AsyncClient() as client:
response = await client.get(
"https://api.elevenlabs.io/v1/voices",
headers={"xi-api-key": ELEVENLABS_API_KEY}
)
voices = response.json()["voices"]
for voice in voices:
print(f"{voice['voice_id']}: {voice['name']}")
# Recommended for phone calls:
# - "Rachel" (professional, clear)
# - "Josh" (conversational, warm)
Test TTS
async def test_tts(text: str) -> bytes:
async with httpx.AsyncClient() as client:
response = await client.post(
f"https://api.elevenlabs.io/v1/text-to-speech/{ELEVENLABS_VOICE_ID}",
headers={
"xi-api-key": ELEVENLABS_API_KEY,
"Content-Type": "application/json"
},
json={
"text": text,
"model_id": "eleven_turbo_v2",
"voice_settings": {
"stability": 0.5,
"similarity_boost": 0.75
}
}
)
return response.content
5. Step 3: Webhook Server
from fastapi import FastAPI, Request, Response
from twilio.twiml.voice_response import VoiceResponse, Connect, Stream
app = FastAPI()
@app.post("/voice")
async def handle_incoming_call(request: Request):
response = VoiceResponse()
response.say("Hello! I'm connecting you now.", voice="alice")
connect = Connect()
stream = Stream(url=f"wss://your-server.com/stream")
stream.parameter(name="caller_id", value=request.form.get("From", "unknown"))
connect.append(stream)
response.append(connect)
return Response(content=str(response), media_type="application/xml")
6. Step 4: WebSocket Handler
import asyncio
import websockets
import json
import base64
from deepgram import Deepgram
deepgram = Deepgram(DEEPGRAM_API_KEY)
class CallHandler:
def __init__(self, websocket):
self.websocket = websocket
self.stream_sid = None
self.conversation_history = []
async def handle(self):
dg_connection = await deepgram.transcription.live({
"encoding": "mulaw",
"sample_rate": 8000,
"channels": 1,
"model": "nova-2",
"punctuate": True,
"interim_results": True
})
dg_connection.register_handler(
dg_connection.event.TRANSCRIPT_RECEIVED,
self.handle_transcript
)
try:
async for message in self.websocket:
data = json.loads(message)
if data["event"] == "start":
self.stream_sid = data["start"]["streamSid"]
elif data["event"] == "media":
audio = base64.b64decode(data["media"]["payload"])
await dg_connection.send(audio)
elif data["event"] == "stop":
break
finally:
await dg_connection.finish()
async def handle_transcript(self, transcript):
if transcript.get("is_final"):
text = transcript["channel"]["alternatives"][0]["transcript"]
if text.strip():
response_text = await self.generate_response(text)
await self.speak(response_text)
async def generate_response(self, user_input: str) -> str:
self.conversation_history.append({"role": "user", "content": user_input})
response = await anthropic.messages.create(
model="claude-3-haiku-20240307",
max_tokens=150,
system="You are a helpful voice assistant. Keep responses brief, 1-3 sentences.",
messages=self.conversation_history
)
assistant_message = response.content[0].text
self.conversation_history.append({"role": "assistant", "content": assistant_message})
return assistant_message
7. Step 5: Streaming TTS Integration
For lower latency, stream TTS as it generates:
async def stream_tts(self, text: str):
async with httpx.AsyncClient() as client:
async with client.stream(
"POST",
f"https://api.elevenlabs.io/v1/text-to-speech/{ELEVENLABS_VOICE_ID}/stream",
headers={
"xi-api-key": ELEVENLABS_API_KEY,
"Content-Type": "application/json"
},
json={
"text": text,
"model_id": "eleven_turbo_v2",
"output_format": "ulaw_8000" # Direct mulaw output!
}
) as response:
async for chunk in response.aiter_bytes(chunk_size=160):
message = {
"event": "media",
"streamSid": self.stream_sid,
"media": {
"payload": base64.b64encode(chunk).decode()
}
}
await self.websocket.send(json.dumps(message))
Key optimization: ElevenLabs supports ulaw_8000 output format—no conversion needed!
8. Step 6: Complete Call Flow
class ProductionCallHandler:
def __init__(self, websocket):
self.websocket = websocket
self.stream_sid = None
self.is_speaking = False
self.interrupt_requested = False
async def handle_transcript(self, transcript):
if transcript.get("is_final"):
text = transcript["channel"]["alternatives"][0]["transcript"].strip()
if not text:
return
if self.is_speaking:
self.interrupt_requested = True
await self.stop_audio()
await self.process_and_respond(text)
async def stop_audio(self):
clear_message = {
"event": "clear",
"streamSid": self.stream_sid
}
await self.websocket.send(json.dumps(clear_message))
self.is_speaking = False
9. Latency Optimization
Optimization Checklist
| Technique | Latency Saved | Implementation |
|---|---|---|
Use eleven_turbo_v2 | 100-200ms | Model selection |
Use ulaw_8000 output | 50-100ms | No conversion |
| Stream TTS | 200-400ms | Async streaming |
| Use Deepgram Nova-2 | 50-100ms | Faster STT |
| Claude Haiku | 100-200ms | Faster LLM |
| Regional endpoints | 20-50ms | Closest region |
Latency Monitoring
class LatencyTracker:
def __init__(self):
self.metrics = []
async def timed_operation(self, name: str, coro):
start = time.perf_counter()
result = await coro
elapsed = (time.perf_counter() - start) * 1000
self.metrics.append({"operation": name, "latency_ms": elapsed})
print(f"{name}: {elapsed:.0f}ms")
return result
10. Production Deployment
Dockerfile
FROM python:3.11-slim
RUN apt-get update && apt-get install -y ffmpeg && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Deploy to AWS
version: "3.8"
services:
voice-ai:
build: .
ports:
- "8000:8000"
environment:
- TWILIO_ACCOUNT_SID
- TWILIO_AUTH_TOKEN
- DEEPGRAM_API_KEY
- ELEVENLABS_API_KEY
- ELEVENLABS_VOICE_ID
- ANTHROPIC_API_KEY
deploy:
replicas: 3
Next Steps
- Enterprise Voice AI Guide → - Complete architecture patterns
- Voice AI for Sales → - Outbound call automation
- Voice AI Compliance → - Recording and consent
Need help with Twilio + ElevenLabs integration?
At Cognilium, we've built production voice systems handling thousands of concurrent calls. Let's discuss your project →
Share this article
Muhammad Mudassir
Founder & CEO, Cognilium AI
Muhammad Mudassir
Founder & CEO, Cognilium AI
Mudassir Marwat is the Founder & CEO of Cognilium AI, where he leads the design and deployment of pr...