Dilip Singh is a Lead AI Architect and AI developer based in Delhi, India. He has 14+ years of experience building enterprise AI chatbots, AI assistants, multi-agent platforms, RAG pipelines, and ontology-driven knowledge systems. He is Lead Software Architect at Hureka Technologies and has delivered 118+ production projects globally.

Is Dilip Singh an AI developer?

Yes. Dilip Singh is a senior AI developer and architect specializing in production AI systems — LLM orchestration, RAG pipelines, AI chatbots, voice AI assistants, and multi-agent platforms. He works with Claude, OpenAI, Ollama, Qdrant, Temporal, Next.js, and FastAPI.

Does Dilip Singh build AI chatbots and AI assistants?

Yes. Dilip builds enterprise AI chatbots and AI assistants with RAG grounding, multi-channel deployment (web, Slack, Teams), human approval workflows, and per-tenant knowledge bases. Flagship projects include Hureka AI (BYOK support platform) and AImind Agent Hub (multi-agent chat, email, and voice).

Does Dilip Singh work with ontology and knowledge graphs for AI?

Yes. Dilip designs semantic ontologies and knowledge graphs to structure AI retrieval — taxonomy design, entity relationships, and RAG grounding for more accurate AI assistant and chatbot responses. His blog covers ontology-driven content architecture for AI systems.

What services does Dilip Singh offer for freelance AI projects?

Dilip Singh offers AI architecture consulting, AI chatbot development, AI assistant systems, ontology/RAG design, multi-agent AI development, voice AI integration, enterprise SaaS architecture, Drupal-to-modern migration, and CTO-as-a-service for startups.

Is Dilip Singh available for remote freelance work?

Yes. Dilip is based in Delhi, India (IST/Asia timezone) and works with clients globally including USA, Canada, Tanzania, and Europe. Engagements include hourly consulting, fixed-price projects, and monthly retainers.

What is the typical project budget for AI architecture work?

Project budgets vary by scope. AI MVP development typically starts from $15,000, multi-agent AI platforms from $30,000, and enterprise AI architecture engagements from $50,000+. Discovery calls are free to scope requirements.

How quickly does Dilip Singh respond to project inquiries?

All inquiries receive a response within 24 hours. Urgent projects can be discussed via email at dilip@hurekatek.com or WhatsApp.

What technologies does Dilip Singh specialize in?

Core expertise includes AI chatbots, AI assistants, multi-agent AI, RAG pipelines (Qdrant, Pinecone), ontology/knowledge graphs, LLM orchestration (Claude, OpenAI, Ollama), voice AI (Pipecat, LiveKit, Whisper), Next.js, FastAPI, Temporal, Docker, Kubernetes, and enterprise Drupal/Laravel systems.

All posts

Series: Self-Hosted AI · Part 3 of 4

1. Self-Hosted Voice AI 2. FastAPI Production Patterns 3. Ollama in Production 4. Voice Activity Detection

InfrastructureIntermediate2026-05-05·9 min read

Ollama in Production: GPU Sizing, Concurrent Requests & Model Management

A complete guide to running Ollama in production. GPU selection, concurrent request handling, model warmup, quantization choices, and the gotchas that take down hobby setups when real traffic hits.

Ollama LLM Self-Hosted GPU DevOps Inference

When Ollama Is the Right Choice

You need data residency (healthcare, finance, defense)
Your workload is < 50 req/sec per model
You're OK with a slightly slower stack than vLLM in exchange for operational simplicity

For higher throughput, use vLLM or TGI. But for 90% of enterprise self-hosting needs, Ollama is right.

GPU Sizing Cheat Sheet

Model	Quant	VRAM	Tokens/s (RTX 4090)
Llama-3-8B	Q4	6GB	90
Llama-3-8B	Q8	10GB	60
Llama-3-70B	Q4	42GB	18
Phi-4-14B	Q5	12GB	55
Qwen-2.5-32B	Q4	20GB	32

Rule of thumb: model_size_in_GB × 1.4 = required VRAM (with KV cache overhead).

Concurrent Request Tuning

Default Ollama only handles one request at a time per model. For production:

bash

# /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_NUM_PARALLEL=4"        # 4 concurrent generations
Environment="OLLAMA_MAX_LOADED_MODELS=2"   # Keep 2 models hot
Environment="OLLAMA_KEEP_ALIVE=24h"        # Don't unload after idle
Environment="OLLAMA_FLASH_ATTENTION=1"     # 30% throughput boost

Docker Compose for Production

yaml

services:
  ollama:
    image: ollama/ollama:latest
    ports: ["11434:11434"]
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      OLLAMA_NUM_PARALLEL: 4
      OLLAMA_MAX_LOADED_MODELS: 2
      OLLAMA_KEEP_ALIVE: 24h
      OLLAMA_FLASH_ATTENTION: 1
    healthcheck:
      test: ["CMD", "ollama", "list"]
      interval: 30s
volumes:
  ollama_data:

Model Warmup

First request after restart is slow because the model loads to VRAM. Warm up on boot:

bash

#!/bin/bash
# warmup.sh — run from systemd ExecStartPost
for model in llama3:8b phi4:14b; do
    curl -s http://localhost:11434/api/generate \
      -d "{\"model\":\"$model\",\"prompt\":\"warmup\",\"stream\":false}" \
      > /dev/null
done

Streaming from FastAPI

python

import httpx
from fastapi.responses import StreamingResponse

async def ollama_stream(prompt: str, model: str = "llama3:8b"): async with httpx.AsyncClient(timeout=300) as client: async with client.stream("POST", "http://ollama:11434/api/generate", json={"model": model, "prompt": prompt}) as r: async for line in r.aiter_lines(): if line: yield line + "\n"

@app.post("/generate") async def generate(prompt: str): return StreamingResponse(ollama_stream(prompt), media_type="application/x-ndjson") ```

Common Pitfalls

1OOM under load — Set OLLAMA_NUM_PARALLEL conservatively; each parallel request claims its own KV cache.
2Cold start latency — Keep models warm with OLLAMA_KEEP_ALIVE.
3No request queue — Add an external queue (Redis) for graceful overflow.
4No request timeout in client — Hung clients keep slots occupied. Always set a request timeout.

Dilip Singh

Lead Software Architect · Hureka Technologies

14+ years building enterprise software and AI systems. Architecting multi-agent AI platforms, RAG pipelines, voice AI, and high-performance SaaS for global clients.

Hire me →About →

Infrastructure · 16 min read

Cut Your AI Infrastructure Costs by 70%: A Production Playbook

Voice AI · 15 min read

Self-Hosted Voice AI: The Complete Pipecat + LiveKit + Ollama Stack

Infrastructure · 19 min read

LLMOps: A Practical Guide to Deploying LLMs in Production

All posts Work together