Dilip Singh logo
All posts
InfrastructureIntermediate2026-05-05·9 min read

Ollama in Production: GPU Sizing, Concurrent Requests & Model Management

A complete guide to running Ollama in production. GPU selection, concurrent request handling, model warmup, quantization choices, and the gotchas that take down hobby setups when real traffic hits.

When Ollama Is the Right Choice

  • You need data residency (healthcare, finance, defense)
  • Your workload is < 50 req/sec per model
  • You're OK with a slightly slower stack than vLLM in exchange for operational simplicity

For higher throughput, use vLLM or TGI. But for 90% of enterprise self-hosting needs, Ollama is right.

GPU Sizing Cheat Sheet

ModelQuantVRAMTokens/s (RTX 4090)
Llama-3-8BQ46GB90
Llama-3-8BQ810GB60
Llama-3-70BQ442GB18
Phi-4-14BQ512GB55
Qwen-2.5-32BQ420GB32

Rule of thumb: model_size_in_GB × 1.4 = required VRAM (with KV cache overhead).

Concurrent Request Tuning

Default Ollama only handles one request at a time per model. For production:

bash
# /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_NUM_PARALLEL=4"        # 4 concurrent generations
Environment="OLLAMA_MAX_LOADED_MODELS=2"   # Keep 2 models hot
Environment="OLLAMA_KEEP_ALIVE=24h"        # Don't unload after idle
Environment="OLLAMA_FLASH_ATTENTION=1"     # 30% throughput boost

Docker Compose for Production

yaml
services:
  ollama:
    image: ollama/ollama:latest
    ports: ["11434:11434"]
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      OLLAMA_NUM_PARALLEL: 4
      OLLAMA_MAX_LOADED_MODELS: 2
      OLLAMA_KEEP_ALIVE: 24h
      OLLAMA_FLASH_ATTENTION: 1
    healthcheck:
      test: ["CMD", "ollama", "list"]
      interval: 30s
volumes:
  ollama_data:

Model Warmup

First request after restart is slow because the model loads to VRAM. Warm up on boot:

bash
#!/bin/bash
# warmup.sh — run from systemd ExecStartPost
for model in llama3:8b phi4:14b; do
    curl -s http://localhost:11434/api/generate \
      -d "{\"model\":\"$model\",\"prompt\":\"warmup\",\"stream\":false}" \
      > /dev/null
done

Streaming from FastAPI

python
import httpx
from fastapi.responses import StreamingResponse

async def ollama_stream(prompt: str, model: str = "llama3:8b"): async with httpx.AsyncClient(timeout=300) as client: async with client.stream("POST", "http://ollama:11434/api/generate", json={"model": model, "prompt": prompt}) as r: async for line in r.aiter_lines(): if line: yield line + "\n"

@app.post("/generate") async def generate(prompt: str): return StreamingResponse(ollama_stream(prompt), media_type="application/x-ndjson") ```

Common Pitfalls

  1. 1OOM under load — Set OLLAMA_NUM_PARALLEL conservatively; each parallel request claims its own KV cache.
  2. 2Cold start latency — Keep models warm with OLLAMA_KEEP_ALIVE.
  3. 3No request queue — Add an external queue (Redis) for graceful overflow.
  4. 4No request timeout in client — Hung clients keep slots occupied. Always set a request timeout.
DS
Dilip Singh
Lead Software Architect · Hureka Technologies

14+ years building enterprise software and AI systems. Architecting multi-agent AI platforms, RAG pipelines, voice AI, and high-performance SaaS for global clients.