Series: Self-Hosted AI · Part 4 of 4
Voice Activity Detection: The Hidden Make-or-Break of Voice AI
VAD decides when the user is done speaking. Get it wrong and the agent interrupts or hangs. A deep dive into Silero VAD, energy thresholds, end-of-turn detection, and barge-in handling.
Why VAD is Where Voice AI Lives or Dies
Latency is what users feel, but VAD is what makes the conversation feel natural. Get end-of-turn detection wrong and the bot interrupts. Wait too long and it feels sluggish.
Out of 50 voice AI improvements we shipped at Hureka, 12 were VAD tuning.
Silero VAD: The Production Standard
import torch
import numpy as npmodel, utils = torch.hub.load( repo_or_dir='snakers4/silero-vad', model='silero_vad', force_reload=False ) (get_speech_timestamps, _, _, _, _) = utils
def is_speech(audio_chunk: np.ndarray, sample_rate: int = 16000) -> float: """Return probability that this 30ms chunk contains speech (0-1).""" tensor = torch.from_numpy(audio_chunk).float() return float(model(tensor, sample_rate)) ```
End-of-Turn Detection State Machine
class TurnDetector:
def __init__(self, sample_rate=16000, frame_ms=30):
self.sr = sample_rate
self.frame_size = int(sample_rate * frame_ms / 1000)
self.speech_buffer = []
self.silence_ms = 0
self.speech_ms = 0
self.state = "idle"def feed(self, audio_frame: np.ndarray) -> dict: p = is_speech(audio_frame, self.sr) is_voiced = p > 0.5
if self.state == "idle": if is_voiced: self.state = "speaking" self.speech_ms = 30 return {"event": "speech_started"}
elif self.state == "speaking": if is_voiced: self.speech_ms += 30 self.silence_ms = 0 else: self.silence_ms += 30 if self.silence_ms >= 500: # 500ms silence = turn end self.state = "idle" duration = self.speech_ms self.speech_ms = 0 return {"event": "turn_end", "duration_ms": duration} return {"event": None} ```
The Three Magic Numbers
| Parameter | Recommended | Why |
|---|---|---|
| speech_threshold | 0.5 | Higher = miss soft speakers; lower = false triggers |
| min_speech_ms | 250 | Filters single sharp noises (door, cough) |
| end_silence_ms | 500 | < 400 cuts people off; > 700 feels sluggish |
These vary by language and accent. Tune on real users in your target market.
Barge-In: Letting the User Interrupt
When TTS is playing, the user might interrupt. You MUST detect it and stop TTS instantly:
async def on_user_audio(audio_chunk):
if tts_is_speaking:
if is_speech(audio_chunk) > 0.7 and energy(audio_chunk) > BARGE_ENERGY_THRESHOLD:
await tts_engine.interrupt()
await llm_pipeline.cancel_current_generation()
The energy gate prevents echo from your own TTS triggering false barge-ins.
Common VAD Failures
- 1Background TV/radio — Continuous human voice; VAD never sees silence. Mitigate with active noise suppression upstream (RNNoise, WebRTC NS).
- 2Slow speakers / elderly users — They pause mid-sentence. Increase end_silence_ms to 800ms for that segment.
- 3Code-switching speakers — Silero performs worse on non-English. Test specifically.
- 4High-latency mic — Anything > 100ms of mic latency feels broken. Measure end-to-end.