Dilip Singh logo
All posts
Voice AIAdvanced2025-04-22·15 min read

Self-Hosted Voice AI: The Complete Pipecat + LiveKit + Ollama Stack

How to build a fully self-hosted voice AI agent with zero cloud dependencies. Pipecat orchestrates Faster-Whisper STT, Ollama LLM, and pyttsx3 TTS over LiveKit WebRTC.

Why Self-Host Your Voice AI?

Cloud voice AI services (Google Dialogflow, AWS Lex, Azure Bot Service) charge per minute, have data privacy implications, and add latency from round-trips to remote servers. For enterprise clients — especially in healthcare, legal, or finance — a fully self-hosted stack is non-negotiable.

The Full Stack

Our self-hosted voice AI pipeline:

  1. 1LiveKit — WebRTC server for real-time audio/video transport (self-hosted)
  2. 2Pipecat — AI voice pipeline orchestration framework
  3. 3Faster-Whisper — Optimized Whisper STT (CTranslate2 backend, 4× faster than original)
  4. 4Ollama — Local LLM inference server (LLaMA 3, Mistral, Phi-3)
  5. 5pyttsx3 — Python TTS engine (offline, no API calls)

Architecture

python
import asyncio
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.task import PipelineTask
from pipecat.services.whisper import WhisperSTTService
from pipecat.services.ollama import OLlamaLLMService
from pipecat.transports.services.livekit import LiveKitTransport

async def run_voice_agent(room_url: str, token: str): transport = LiveKitTransport( url=room_url, token=token, params=LiveKitParams(audio_in_enabled=True, audio_out_enabled=True) )

stt = WhisperSTTService(model="large-v3") llm = OLlamaLLMService(model="llama3:8b", base_url="http://localhost:11434")

pipeline = Pipeline([ transport.input(), stt, llm, transport.output(), ])

task = PipelineTask(pipeline) await task.run() ```

Latency Optimization

The biggest challenge in voice AI is latency. Human conversation tolerance is ~300ms. Our production stack achieves 250–400ms end-to-end by:

  1. 1Faster-Whisper with int8 quantization — STT in ~80ms for 1–2 second utterances
  2. 2Streaming LLM responses — Start TTS as first tokens arrive, don't wait for full response
  3. 3Turn detection — Whisper's VAD detects end-of-speech without waiting for silence timeout
  4. 4GPU acceleration — Run Whisper on GPU for 10× speedup over CPU

Twilio Integration

For telephone access, Twilio forwards calls to our LiveKit room via SIP:

python
from twilio.rest import Client

def create_call_to_livekit(to_number: str, livekit_room: str): client = Client(TWILIO_SID, TWILIO_TOKEN) call = client.calls.create( to=to_number, from_=TWILIO_NUMBER, url=f"{APP_URL}/twiml/connect/{livekit_room}" ) return call.sid ```

DS
Dilip Singh
Lead Software Architect · Hureka Technologies

14+ years building enterprise software and AI systems. Architecting multi-agent AI platforms, RAG pipelines, voice AI, and high-performance SaaS for global clients.