What audio format does Twilio use?

Twilio Media Streams use mulaw (μ-law) encoding at 8kHz, mono channel. ElevenLabs can output directly in ulaw_8000 format, eliminating the need for audio conversion.

How do I handle caller interruptions?

Monitor Deepgram interim transcripts while the AI is speaking. If the caller starts speaking (2+ words detected), send a 'clear' event to Twilio to stop audio playback, then process the interruption.

What's the best ElevenLabs model for voice AI?

Use eleven_turbo_v2 for lowest latency (100ms vs 200ms for standard). For the most natural voices, use eleven_multilingual_v2 but expect slightly higher latency.

How many concurrent calls can one server handle?

A single server (1 vCPU, 1GB RAM) can handle 50-100 concurrent calls with this architecture. Scale horizontally behind a load balancer for higher capacity.

Can I use a cloned voice with ElevenLabs?

Yes. ElevenLabs Voice Cloning creates a custom voice_id that works identically to stock voices. This is ideal for brand consistency. Cloning requires ~1 minute of clean audio samples.

What audio format does Twilio use?

Twilio Media Streams use mulaw (μ-law) encoding at 8kHz, mono channel. ElevenLabs can output directly in ulaw_8000 format, eliminating the need for audio conversion.

How do I handle caller interruptions?

Monitor Deepgram interim transcripts while the AI is speaking. If the caller starts speaking (2+ words detected), send a 'clear' event to Twilio to stop audio playback, then process the interruption.

What's the best ElevenLabs model for voice AI?

Use eleven_turbo_v2 for lowest latency (100ms vs 200ms for standard). For the most natural voices, use eleven_multilingual_v2 but expect slightly higher latency.

How many concurrent calls can one server handle?

A single server (1 vCPU, 1GB RAM) can handle 50-100 concurrent calls with this architecture. Scale horizontally behind a load balancer for higher capacity.

Can I use a cloned voice with ElevenLabs?

Yes. ElevenLabs Voice Cloning creates a custom voice_id that works identically to stock voices. This is ideal for brand consistency. Cloning requires ~1 minute of clean audio samples.

Twilio + ElevenLabs Integration: Natural Voice AI Tutorial

Twilio handles the phone call. ElevenLabs makes it sound human. Together, they're the foundation of modern voice AI. But connecting them isn't plug-and-play—you need streaming audio, proper encoding, and latency optimization. This guide covers the complete integration with production-ready code.

Why Twilio + ElevenLabs?

Twilio provides programmable telephony—the ability to make and receive phone calls via API. ElevenLabs provides the most natural-sounding text-to-speech available. Combined with an LLM for conversation and STT for speech recognition, they form the voice AI stack that powers production systems handling millions of calls.

1. Architecture Overview

Architecture Diagram

2. Prerequisites

# Python 3.9+
pip install twilio fastapi uvicorn websockets httpx deepgram-sdk anthropic

# Accounts needed:
# - Twilio: twilio.com (phone number + Media Streams)
# - ElevenLabs: elevenlabs.io (API key + voice ID)
# - Deepgram: deepgram.com (API key)
# - Anthropic: anthropic.com (API key)

Environment Variables

export TWILIO_ACCOUNT_SID="your_account_sid"
export TWILIO_AUTH_TOKEN="your_auth_token"
export TWILIO_PHONE_NUMBER="+1234567890"
export DEEPGRAM_API_KEY="your_deepgram_key"
export ELEVENLABS_API_KEY="your_elevenlabs_key"
export ELEVENLABS_VOICE_ID="your_voice_id"
export ANTHROPIC_API_KEY="your_anthropic_key"

3. Step 1: Twilio Setup

Buy a Phone Number

from twilio.rest import Client

client = Client(TWILIO_ACCOUNT_SID, TWILIO_AUTH_TOKEN)

number = client.incoming_phone_numbers.create(
    phone_number="+1234567890",
    voice_url="https://your-server.com/voice",
    voice_method="POST"
)

print(f"Phone number: {number.phone_number}")

Configure Webhook

In Twilio Console:

Go to Phone Numbers → Manage → Active Numbers
Click your number
Set Voice Configuration: Webhook → https://your-server.com/voice

4. Step 2: ElevenLabs Configuration

Select a Voice

import httpx

async def list_voices():
    async with httpx.AsyncClient() as client:
        response = await client.get(
            "https://api.elevenlabs.io/v1/voices",
            headers={"xi-api-key": ELEVENLABS_API_KEY}
        )
        voices = response.json()["voices"]
        for voice in voices:
            print(f"{voice['voice_id']}: {voice['name']}")

# Recommended for phone calls:
# - "Rachel" (professional, clear)
# - "Josh" (conversational, warm)

Test TTS

async def test_tts(text: str) -> bytes:
    async with httpx.AsyncClient() as client:
        response = await client.post(
            f"https://api.elevenlabs.io/v1/text-to-speech/{ELEVENLABS_VOICE_ID}",
            headers={
                "xi-api-key": ELEVENLABS_API_KEY,
                "Content-Type": "application/json"
            },
            json={
                "text": text,
                "model_id": "eleven_turbo_v2",
                "voice_settings": {
                    "stability": 0.5,
                    "similarity_boost": 0.75
                }
            }
        )
        return response.content

5. Step 3: Webhook Server

from fastapi import FastAPI, Request, Response
from twilio.twiml.voice_response import VoiceResponse, Connect, Stream

app = FastAPI()

@app.post("/voice")
async def handle_incoming_call(request: Request):
    response = VoiceResponse()
    response.say("Hello! I'm connecting you now.", voice="alice")
    
    connect = Connect()
    stream = Stream(url=f"wss://your-server.com/stream")
    stream.parameter(name="caller_id", value=request.form.get("From", "unknown"))
    connect.append(stream)
    response.append(connect)
    
    return Response(content=str(response), media_type="application/xml")

6. Step 4: WebSocket Handler

import asyncio
import websockets
import json
import base64
from deepgram import Deepgram

deepgram = Deepgram(DEEPGRAM_API_KEY)

class CallHandler:
    def __init__(self, websocket):
        self.websocket = websocket
        self.stream_sid = None
        self.conversation_history = []
    
    async def handle(self):
        dg_connection = await deepgram.transcription.live({
            "encoding": "mulaw",
            "sample_rate": 8000,
            "channels": 1,
            "model": "nova-2",
            "punctuate": True,
            "interim_results": True
        })
        
        dg_connection.register_handler(
            dg_connection.event.TRANSCRIPT_RECEIVED,
            self.handle_transcript
        )
        
        try:
            async for message in self.websocket:
                data = json.loads(message)
                
                if data["event"] == "start":
                    self.stream_sid = data["start"]["streamSid"]
                elif data["event"] == "media":
                    audio = base64.b64decode(data["media"]["payload"])
                    await dg_connection.send(audio)
                elif data["event"] == "stop":
                    break
        finally:
            await dg_connection.finish()
    
    async def handle_transcript(self, transcript):
        if transcript.get("is_final"):
            text = transcript["channel"]["alternatives"][0]["transcript"]
            if text.strip():
                response_text = await self.generate_response(text)
                await self.speak(response_text)
    
    async def generate_response(self, user_input: str) -> str:
        self.conversation_history.append({"role": "user", "content": user_input})
        
        response = await anthropic.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=150,
            system="You are a helpful voice assistant. Keep responses brief, 1-3 sentences.",
            messages=self.conversation_history
        )
        
        assistant_message = response.content[0].text
        self.conversation_history.append({"role": "assistant", "content": assistant_message})
        return assistant_message

7. Step 5: Streaming TTS Integration

For lower latency, stream TTS as it generates:

async def stream_tts(self, text: str):
    async with httpx.AsyncClient() as client:
        async with client.stream(
            "POST",
            f"https://api.elevenlabs.io/v1/text-to-speech/{ELEVENLABS_VOICE_ID}/stream",
            headers={
                "xi-api-key": ELEVENLABS_API_KEY,
                "Content-Type": "application/json"
            },
            json={
                "text": text,
                "model_id": "eleven_turbo_v2",
                "output_format": "ulaw_8000"  # Direct mulaw output!
            }
        ) as response:
            async for chunk in response.aiter_bytes(chunk_size=160):
                message = {
                    "event": "media",
                    "streamSid": self.stream_sid,
                    "media": {
                        "payload": base64.b64encode(chunk).decode()
                    }
                }
                await self.websocket.send(json.dumps(message))

Key optimization: ElevenLabs supports ulaw_8000 output format—no conversion needed!

8. Step 6: Complete Call Flow

class ProductionCallHandler:
    def __init__(self, websocket):
        self.websocket = websocket
        self.stream_sid = None
        self.is_speaking = False
        self.interrupt_requested = False
    
    async def handle_transcript(self, transcript):
        if transcript.get("is_final"):
            text = transcript["channel"]["alternatives"][0]["transcript"].strip()
            if not text:
                return
            
            if self.is_speaking:
                self.interrupt_requested = True
                await self.stop_audio()
            
            await self.process_and_respond(text)
    
    async def stop_audio(self):
        clear_message = {
            "event": "clear",
            "streamSid": self.stream_sid
        }
        await self.websocket.send(json.dumps(clear_message))
        self.is_speaking = False

9. Latency Optimization

Optimization Checklist

Technique	Latency Saved	Implementation
Use `eleven_turbo_v2`	100-200ms	Model selection
Use `ulaw_8000` output	50-100ms	No conversion
Stream TTS	200-400ms	Async streaming
Use Deepgram Nova-2	50-100ms	Faster STT
Claude Haiku	100-200ms	Faster LLM
Regional endpoints	20-50ms	Closest region

Latency Monitoring

class LatencyTracker:
    def __init__(self):
        self.metrics = []
    
    async def timed_operation(self, name: str, coro):
        start = time.perf_counter()
        result = await coro
        elapsed = (time.perf_counter() - start) * 1000
        self.metrics.append({"operation": name, "latency_ms": elapsed})
        print(f"{name}: {elapsed:.0f}ms")
        return result

10. Production Deployment

Dockerfile

FROM python:3.11-slim
RUN apt-get update && apt-get install -y ffmpeg && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Deploy to AWS

version: "3.8"
services:
  voice-ai:
    build: .
    ports:
      - "8000:8000"
    environment:
      - TWILIO_ACCOUNT_SID
      - TWILIO_AUTH_TOKEN
      - DEEPGRAM_API_KEY
      - ELEVENLABS_API_KEY
      - ELEVENLABS_VOICE_ID
      - ANTHROPIC_API_KEY
    deploy:
      replicas: 3

Next Steps

Enterprise Voice AI Guide → - Complete architecture patterns
Voice AI for Sales → - Outbound call automation
Voice AI Compliance → - Recording and consent

Need help with Twilio + ElevenLabs integration?

At Cognilium, we've built production voice systems handling thousands of concurrent calls. Let's discuss your project →

Twilio + ElevenLabs: Building Natural Voice Interfaces

Why Twilio + ElevenLabs?

1. Architecture Overview

2. Prerequisites

Environment Variables

3. Step 1: Twilio Setup

Buy a Phone Number

Configure Webhook

4. Step 2: ElevenLabs Configuration

Select a Voice

Test TTS

5. Step 3: Webhook Server

6. Step 4: WebSocket Handler

7. Step 5: Streaming TTS Integration

8. Step 6: Complete Call Flow

9. Latency Optimization

Optimization Checklist

Latency Monitoring

10. Production Deployment

Dockerfile

Deploy to AWS

Next Steps

Share this article

Muhammad Mudassir

Muhammad Mudassir

Frequently Asked Questions

What audio format does Twilio use?

How do I handle caller interruptions?

What's the best ElevenLabs model for voice AI?

How many concurrent calls can one server handle?

Can I use a cloned voice with ElevenLabs?

Still have questions?