Series: Self-Hosted AI · Part 1 of 4
Self-Hosted Voice AI: The Complete Pipecat + LiveKit + Ollama Stack
How to build a fully self-hosted voice AI agent with zero cloud dependencies. Pipecat orchestrates Faster-Whisper STT, Ollama LLM, and pyttsx3 TTS over LiveKit WebRTC.
Why Self-Host Your Voice AI?
Cloud voice AI services (Google Dialogflow, AWS Lex, Azure Bot Service) charge per minute, have data privacy implications, and add latency from round-trips to remote servers. For enterprise clients — especially in healthcare, legal, or finance — a fully self-hosted stack is non-negotiable.
The Full Stack
Our self-hosted voice AI pipeline:
- 1LiveKit — WebRTC server for real-time audio/video transport (self-hosted)
- 2Pipecat — AI voice pipeline orchestration framework
- 3Faster-Whisper — Optimized Whisper STT (CTranslate2 backend, 4× faster than original)
- 4Ollama — Local LLM inference server (LLaMA 3, Mistral, Phi-3)
- 5pyttsx3 — Python TTS engine (offline, no API calls)
Architecture
import asyncio
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.task import PipelineTask
from pipecat.services.whisper import WhisperSTTService
from pipecat.services.ollama import OLlamaLLMService
from pipecat.transports.services.livekit import LiveKitTransportasync def run_voice_agent(room_url: str, token: str): transport = LiveKitTransport( url=room_url, token=token, params=LiveKitParams(audio_in_enabled=True, audio_out_enabled=True) )
stt = WhisperSTTService(model="large-v3") llm = OLlamaLLMService(model="llama3:8b", base_url="http://localhost:11434")
pipeline = Pipeline([ transport.input(), stt, llm, transport.output(), ])
task = PipelineTask(pipeline) await task.run() ```
Latency Optimization
The biggest challenge in voice AI is latency. Human conversation tolerance is ~300ms. Our production stack achieves 250–400ms end-to-end by:
- 1Faster-Whisper with int8 quantization — STT in ~80ms for 1–2 second utterances
- 2Streaming LLM responses — Start TTS as first tokens arrive, don't wait for full response
- 3Turn detection — Whisper's VAD detects end-of-speech without waiting for silence timeout
- 4GPU acceleration — Run Whisper on GPU for 10× speedup over CPU
Twilio Integration
For telephone access, Twilio forwards calls to our LiveKit room via SIP:
from twilio.rest import Clientdef create_call_to_livekit(to_number: str, livekit_room: str): client = Client(TWILIO_SID, TWILIO_TOKEN) call = client.calls.create( to=to_number, from_=TWILIO_NUMBER, url=f"{APP_URL}/twiml/connect/{livekit_room}" ) return call.sid ```