Series: Self-Hosted AI · Part 3 of 4
Ollama in Production: GPU Sizing, Concurrent Requests & Model Management
A complete guide to running Ollama in production. GPU selection, concurrent request handling, model warmup, quantization choices, and the gotchas that take down hobby setups when real traffic hits.
When Ollama Is the Right Choice
- You need data residency (healthcare, finance, defense)
- Your workload is < 50 req/sec per model
- You're OK with a slightly slower stack than vLLM in exchange for operational simplicity
For higher throughput, use vLLM or TGI. But for 90% of enterprise self-hosting needs, Ollama is right.
GPU Sizing Cheat Sheet
| Model | Quant | VRAM | Tokens/s (RTX 4090) |
|---|---|---|---|
| Llama-3-8B | Q4 | 6GB | 90 |
| Llama-3-8B | Q8 | 10GB | 60 |
| Llama-3-70B | Q4 | 42GB | 18 |
| Phi-4-14B | Q5 | 12GB | 55 |
| Qwen-2.5-32B | Q4 | 20GB | 32 |
Rule of thumb: model_size_in_GB × 1.4 = required VRAM (with KV cache overhead).
Concurrent Request Tuning
Default Ollama only handles one request at a time per model. For production:
# /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_NUM_PARALLEL=4" # 4 concurrent generations
Environment="OLLAMA_MAX_LOADED_MODELS=2" # Keep 2 models hot
Environment="OLLAMA_KEEP_ALIVE=24h" # Don't unload after idle
Environment="OLLAMA_FLASH_ATTENTION=1" # 30% throughput boost
Docker Compose for Production
services:
ollama:
image: ollama/ollama:latest
ports: ["11434:11434"]
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
environment:
OLLAMA_NUM_PARALLEL: 4
OLLAMA_MAX_LOADED_MODELS: 2
OLLAMA_KEEP_ALIVE: 24h
OLLAMA_FLASH_ATTENTION: 1
healthcheck:
test: ["CMD", "ollama", "list"]
interval: 30s
volumes:
ollama_data:
Model Warmup
First request after restart is slow because the model loads to VRAM. Warm up on boot:
#!/bin/bash
# warmup.sh — run from systemd ExecStartPost
for model in llama3:8b phi4:14b; do
curl -s http://localhost:11434/api/generate \
-d "{\"model\":\"$model\",\"prompt\":\"warmup\",\"stream\":false}" \
> /dev/null
done
Streaming from FastAPI
import httpx
from fastapi.responses import StreamingResponseasync def ollama_stream(prompt: str, model: str = "llama3:8b"): async with httpx.AsyncClient(timeout=300) as client: async with client.stream("POST", "http://ollama:11434/api/generate", json={"model": model, "prompt": prompt}) as r: async for line in r.aiter_lines(): if line: yield line + "\n"
@app.post("/generate") async def generate(prompt: str): return StreamingResponse(ollama_stream(prompt), media_type="application/x-ndjson") ```
Common Pitfalls
- 1OOM under load — Set OLLAMA_NUM_PARALLEL conservatively; each parallel request claims its own KV cache.
- 2Cold start latency — Keep models warm with OLLAMA_KEEP_ALIVE.
- 3No request queue — Add an external queue (Redis) for graceful overflow.
- 4No request timeout in client — Hung clients keep slots occupied. Always set a request timeout.