TL;DR
Voice screening that adapts to the candidate instead of reading from a list — follow-up question generation, multi-language handling, and the prompt structure
A scripted voice screen reads from a list. A non-scripted one adapts to what the candidate just said. The first feels like a robocall; the second feels like a person who came prepared. Building the second on a voice agent platform requires structuring the prompt around goals rather than turns.
Goals, not scripts
The system prompt for an interview agent has three things: a role description, a list of goals to cover before ending the call, and a small library of soft-redirect phrases. There is no question script. The agent picks the next question based on (a) which goals are still uncovered, and (b) what the candidate just said.
Concretely, the goals list for a senior backend role looks like: "Confirm 5+ years of production backend experience. Get a specific story about a system they shipped at scale. Probe for distributed systems knowledge. Ask about a failure they handled. Confirm interest and availability." The agent covers these in any order the conversation suggests, with one follow-up allowed per goal.
The follow-up classifier
After every candidate response, a small classifier (a 3-class fine-tune on a few hundred labeled responses) predicts: {complete, vague, off-topic}. Complete → mark goal covered, move to next goal. Vague → emit one follow-up at most ("can you walk me through that decision in more detail?"). Off-topic → use a soft-redirect from the library back to the active goal.
The hard cap of one follow-up per goal is critical. Without it, the agent rabbit-holes on edge cases and runs out of time before covering the role basics. Coverage > depth, on a screening call.
Latency on the voice path
- VAD end-of-speech detection: ~150ms
- Streaming STT: partial transcripts available before speaker stops; final ~300ms after
- LLM (the goal-tracker + next-question generator): ~600ms first-token, runs concurrently with the rest
- TTS first chunk: ~250ms — speaker hears the first phoneme this fast
- Total perceived gap: ~1.2-1.5 sec from "candidate stops" to "agent starts speaking"
Multi-language without forking the role definition
Ultravox handles language detection at the audio layer. The system prompt asks the model to respond in the candidate's language. The goals list is stored once in English and translated server-side per language at session start (cached). Adding a new language is a translation task, not a prompt-engineering one.
What we measured
- Coverage rate (all goals hit per call): 96.8% in production
- Average call length: 8-12 minutes (vs. 30-45 minutes for human screen)
- Candidate sentiment: 4.4/5 average across 12 languages
- Cost per screen: ~$15 (Ultravox + LLM) vs. $850 average human screen at the orgs we benchmarked
Share this article
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years experience
Mudassir Marwat is the Founder & CEO of Cognilium AI. He has shipped 100+ production AI systems acro...
