Back to Blog
Published:
Last Updated:
Fresh Content

Designing a Non-Scripted Voice Interview Agent on Ultravox

8 min read
1,500 words
medium priority
Muhammad Mudassir

Muhammad Mudassir

Founder & CEO, Cognilium AI

Designing a Non-Scripted Voice Interview Agent on Ultravox — Cognilium AI

TL;DR

Voice screening that adapts to the candidate instead of reading from a list — follow-up question generation, multi-language handling, and the prompt structure

Voice screening that adapts to the candidate instead of reading from a list — follow-up question generation, multi-language handling, and the prompt structure that keeps the agent on-task without sounding like a robot.
Ultravox AIvoice interviewcandidate screeningnon-scripted agentsconversational AIreal-time STTvoice agentsrecruitment automation

A scripted voice screen reads from a list. A non-scripted one adapts to what the candidate just said. The first feels like a robocall; the second feels like a person who came prepared. Building the second on a voice agent platform requires structuring the prompt around goals rather than turns.

Goals, not scripts

The system prompt for an interview agent has three things: a role description, a list of goals to cover before ending the call, and a small library of soft-redirect phrases. There is no question script. The agent picks the next question based on (a) which goals are still uncovered, and (b) what the candidate just said.

Concretely, the goals list for a senior backend role looks like: "Confirm 5+ years of production backend experience. Get a specific story about a system they shipped at scale. Probe for distributed systems knowledge. Ask about a failure they handled. Confirm interest and availability." The agent covers these in any order the conversation suggests, with one follow-up allowed per goal.

The follow-up classifier

After every candidate response, a small classifier (a 3-class fine-tune on a few hundred labeled responses) predicts: {complete, vague, off-topic}. Complete → mark goal covered, move to next goal. Vague → emit one follow-up at most ("can you walk me through that decision in more detail?"). Off-topic → use a soft-redirect from the library back to the active goal.

The hard cap of one follow-up per goal is critical. Without it, the agent rabbit-holes on edge cases and runs out of time before covering the role basics. Coverage > depth, on a screening call.

Latency on the voice path

  • VAD end-of-speech detection: ~150ms
  • Streaming STT: partial transcripts available before speaker stops; final ~300ms after
  • LLM (the goal-tracker + next-question generator): ~600ms first-token, runs concurrently with the rest
  • TTS first chunk: ~250ms — speaker hears the first phoneme this fast
  • Total perceived gap: ~1.2-1.5 sec from "candidate stops" to "agent starts speaking"

Multi-language without forking the role definition

Ultravox handles language detection at the audio layer. The system prompt asks the model to respond in the candidate's language. The goals list is stored once in English and translated server-side per language at session start (cached). Adding a new language is a translation task, not a prompt-engineering one.

What we measured

  • Coverage rate (all goals hit per call): 96.8% in production
  • Average call length: 8-12 minutes (vs. 30-45 minutes for human screen)
  • Candidate sentiment: 4.4/5 average across 12 languages
  • Cost per screen: ~$15 (Ultravox + LLM) vs. $850 average human screen at the orgs we benchmarked

Share this article

Muhammad Mudassir

Muhammad Mudassir

Founder & CEO, Cognilium AI | 10+ years

Mudassir Marwat is the Founder & CEO of Cognilium AI. He has shipped 100+ production AI systems acro...

Founder & CEO of Cognilium AI; 100+ production AI systems shipped; multi-cloud AI architecture (AWSGCPAzure); built and operated 4 production AI products
Agentic AIRAG → GraphRAG retrievalVoice AIMulti-Agent Orchestration

Frequently Asked Questions

Find answers to common questions about the topics covered in this article.

Still have questions?

Get in touch with our team for personalized assistance.

Contact Us

Related Articles

Continue exploring related topics and insights from our content library.

Enterprise Voice AI: Real Latency, Real Compliance, Real Money
11 min
1
Muhammad Mudassir
May 4, 2026

Enterprise Voice AI: Real Latency, Real Compliance, Real Money

Sub-1.5s p95 voice AI on Twilio + ElevenLabs + Whisper, designed for HIPAA and SOC2. The decisions that mattered, and the ones we got wrong twice.

words
Read Article
Sentiment-Driven Escalation in a 22-Language Voice Support Agent
7 min
2
Muhammad Mudassir
May 5, 2026

Sentiment-Driven Escalation in a 22-Language Voice Support Agent

Real-time sentiment scoring drives the handoff decision; full conversation context, transcript, and detected intent travel with it. The escalation that does not start the human conversation from scratch.

words
Read Article
Voice AI Latency Budget Deep Dive: Where the 1.5 Seconds Goes
7 min
3
Muhammad Mudassir
May 5, 2026

Voice AI Latency Budget Deep Dive: Where the 1.5 Seconds Goes

A line-by-line breakdown of the sub-1.5-second p95 latency budget — VAD, streaming STT, first-token LLM, streaming TTS, network — and the optimizations that buy each milestone.

words
Read Article

Explore More Insights

Discover more expert articles on AI, engineering, and technology trends.