Dilip Singh logo
All posts
Voice AIAdvanced2026-04-12·10 min read

Voice Activity Detection: The Hidden Make-or-Break of Voice AI

VAD decides when the user is done speaking. Get it wrong and the agent interrupts or hangs. A deep dive into Silero VAD, energy thresholds, end-of-turn detection, and barge-in handling.

Why VAD is Where Voice AI Lives or Dies

Latency is what users feel, but VAD is what makes the conversation feel natural. Get end-of-turn detection wrong and the bot interrupts. Wait too long and it feels sluggish.

Out of 50 voice AI improvements we shipped at Hureka, 12 were VAD tuning.

Silero VAD: The Production Standard

python
import torch
import numpy as np

model, utils = torch.hub.load( repo_or_dir='snakers4/silero-vad', model='silero_vad', force_reload=False ) (get_speech_timestamps, _, _, _, _) = utils

def is_speech(audio_chunk: np.ndarray, sample_rate: int = 16000) -> float: """Return probability that this 30ms chunk contains speech (0-1).""" tensor = torch.from_numpy(audio_chunk).float() return float(model(tensor, sample_rate)) ```

End-of-Turn Detection State Machine

python
class TurnDetector:
    def __init__(self, sample_rate=16000, frame_ms=30):
        self.sr = sample_rate
        self.frame_size = int(sample_rate * frame_ms / 1000)
        self.speech_buffer = []
        self.silence_ms = 0
        self.speech_ms = 0
        self.state = "idle"

def feed(self, audio_frame: np.ndarray) -> dict: p = is_speech(audio_frame, self.sr) is_voiced = p > 0.5

if self.state == "idle": if is_voiced: self.state = "speaking" self.speech_ms = 30 return {"event": "speech_started"}

elif self.state == "speaking": if is_voiced: self.speech_ms += 30 self.silence_ms = 0 else: self.silence_ms += 30 if self.silence_ms >= 500: # 500ms silence = turn end self.state = "idle" duration = self.speech_ms self.speech_ms = 0 return {"event": "turn_end", "duration_ms": duration} return {"event": None} ```

The Three Magic Numbers

ParameterRecommendedWhy
speech_threshold0.5Higher = miss soft speakers; lower = false triggers
min_speech_ms250Filters single sharp noises (door, cough)
end_silence_ms500< 400 cuts people off; > 700 feels sluggish

These vary by language and accent. Tune on real users in your target market.

Barge-In: Letting the User Interrupt

When TTS is playing, the user might interrupt. You MUST detect it and stop TTS instantly:

python
async def on_user_audio(audio_chunk):
    if tts_is_speaking:
        if is_speech(audio_chunk) > 0.7 and energy(audio_chunk) > BARGE_ENERGY_THRESHOLD:
            await tts_engine.interrupt()
            await llm_pipeline.cancel_current_generation()

The energy gate prevents echo from your own TTS triggering false barge-ins.

Common VAD Failures

  1. 1Background TV/radio — Continuous human voice; VAD never sees silence. Mitigate with active noise suppression upstream (RNNoise, WebRTC NS).
  2. 2Slow speakers / elderly users — They pause mid-sentence. Increase end_silence_ms to 800ms for that segment.
  3. 3Code-switching speakers — Silero performs worse on non-English. Test specifically.
  4. 4High-latency mic — Anything > 100ms of mic latency feels broken. Measure end-to-end.
DS
Dilip Singh
Lead Software Architect · Hureka Technologies

14+ years building enterprise software and AI systems. Architecting multi-agent AI platforms, RAG pipelines, voice AI, and high-performance SaaS for global clients.