How does it know when to ask a follow-up vs. move on?

A small classifier on the response transcript classifies the answer as {complete, vague, off-topic}. Complete → next planned question. Vague → one follow-up at most. Off-topic → gentle redirect. Hard cap of 1 follow-up per planned question prevents conversational drift.

How do you keep it from going off-rails when the candidate goes off-topic?

The system prompt has a goals list (the topics that must be covered before ending) and a soft-redirect phrase library. The agent tracks coverage and redirects with phrases like "interesting — coming back to your experience with X…" rather than hard transitions.

Multi-language: does the model switch automatically?

Ultravox handles language detection on the audio side. The system prompt includes "respond in the candidate's language" and the goals list is translated server-side per language. We keep a single goals list per role, generated translations cached.

What is the latency budget?

Sub-1.5 second response after candidate stops speaking. Breakdown: ~150ms VAD detection of end-of-speech, ~300ms transcription, ~600ms LLM, ~250ms TTS first-chunk, ~150ms network. Tight but achievable on Ultravox.

How is this evaluated?

Two metrics. Coverage: did the agent hit all goals before ending? (target >95%). Candidate sentiment via post-call survey: would you do this again? (target >4.2/5). Both surface in the ops dashboard within 24 hours of each call.

Designing a Non-Scripted Voice Interview Agent on Ultravox

A scripted voice screen reads from a list. A non-scripted one adapts to what the candidate just said. The first feels like a robocall; the second feels like a person who came prepared. Building the second on a voice agent platform requires structuring the prompt around goals rather than turns.

Goals, not scripts

The system prompt for an interview agent has three things: a role description, a list of goals to cover before ending the call, and a small library of soft-redirect phrases. There is no question script. The agent picks the next question based on (a) which goals are still uncovered, and (b) what the candidate just said.

Concretely, the goals list for a senior backend role looks like: "Confirm 5+ years of production backend experience. Get a specific story about a system they shipped at scale. Probe for distributed systems knowledge. Ask about a failure they handled. Confirm interest and availability." The agent covers these in any order the conversation suggests, with one follow-up allowed per goal.

The follow-up classifier

After every candidate response, a small classifier (a 3-class fine-tune on a few hundred labeled responses) predicts: {complete, vague, off-topic}. Complete → mark goal covered, move to next goal. Vague → emit one follow-up at most ("can you walk me through that decision in more detail?"). Off-topic → use a soft-redirect from the library back to the active goal.

The hard cap of one follow-up per goal is critical. Without it, the agent rabbit-holes on edge cases and runs out of time before covering the role basics. Coverage > depth, on a screening call.

Latency on the voice path

VAD end-of-speech detection: ~150ms
Streaming STT: partial transcripts available before speaker stops; final ~300ms after
LLM (the goal-tracker + next-question generator): ~600ms first-token, runs concurrently with the rest
TTS first chunk: ~250ms — speaker hears the first phoneme this fast
Total perceived gap: ~1.2-1.5 sec from "candidate stops" to "agent starts speaking"

Multi-language without forking the role definition

Ultravox handles language detection at the audio layer. The system prompt asks the model to respond in the candidate's language. The goals list is stored once in English and translated server-side per language at session start (cached). Adding a new language is a translation task, not a prompt-engineering one.

What we measured

Coverage rate (all goals hit per call): 96.8% in production
Average call length: 8-12 minutes (vs. 30-45 minutes for human screen)
Candidate sentiment: 4.4/5 average across 12 languages
Cost per screen: ~$15 (Ultravox + LLM) vs. $850 average human screen at the orgs we benchmarked

Designing a Non-Scripted Voice Interview Agent on Ultravox

Goals, not scripts

The follow-up classifier

Latency on the voice path

Multi-language without forking the role definition

What we measured

Share this article

Muhammad Mudassir

Muhammad Mudassir

Frequently Asked Questions

How does it know when to ask a follow-up vs. move on?

How do you keep it from going off-rails when the candidate goes off-topic?

Multi-language: does the model switch automatically?

What is the latency budget?

How is this evaluated?

Still have questions?

Related Articles

Enterprise Voice AI: Real Latency, Real Compliance, Real Money

Sentiment-Driven Escalation in a 22-Language Voice Support Agent

Voice AI Latency Budget Deep Dive: Where the 1.5 Seconds Goes

Explore More Insights