Dilip Singh is a Lead AI Architect and AI developer based in Delhi, India. He has 14+ years of experience building enterprise AI chatbots, AI assistants, multi-agent platforms, RAG pipelines, and ontology-driven knowledge systems. He is Lead Software Architect at Hureka Technologies and has delivered 118+ production projects globally.

Is Dilip Singh an AI developer?

Yes. Dilip Singh is a senior AI developer and architect specializing in production AI systems — LLM orchestration, RAG pipelines, AI chatbots, voice AI assistants, and multi-agent platforms. He works with Claude, OpenAI, Ollama, Qdrant, Temporal, Next.js, and FastAPI.

Does Dilip Singh build AI chatbots and AI assistants?

Yes. Dilip builds enterprise AI chatbots and AI assistants with RAG grounding, multi-channel deployment (web, Slack, Teams), human approval workflows, and per-tenant knowledge bases. Flagship projects include Hureka AI (BYOK support platform) and AImind Agent Hub (multi-agent chat, email, and voice).

Does Dilip Singh work with ontology and knowledge graphs for AI?

Yes. Dilip designs semantic ontologies and knowledge graphs to structure AI retrieval — taxonomy design, entity relationships, and RAG grounding for more accurate AI assistant and chatbot responses. His blog covers ontology-driven content architecture for AI systems.

What services does Dilip Singh offer for freelance AI projects?

Dilip Singh offers AI architecture consulting, AI chatbot development, AI assistant systems, ontology/RAG design, multi-agent AI development, voice AI integration, enterprise SaaS architecture, Drupal-to-modern migration, and CTO-as-a-service for startups.

Is Dilip Singh available for remote freelance work?

Yes. Dilip is based in Delhi, India (IST/Asia timezone) and works with clients globally including USA, Canada, Tanzania, and Europe. Engagements include hourly consulting, fixed-price projects, and monthly retainers.

What is the typical project budget for AI architecture work?

Project budgets vary by scope. AI MVP development typically starts from $15,000, multi-agent AI platforms from $30,000, and enterprise AI architecture engagements from $50,000+. Discovery calls are free to scope requirements.

How quickly does Dilip Singh respond to project inquiries?

All inquiries receive a response within 24 hours. Urgent projects can be discussed via email at dilip@hurekatek.com or WhatsApp.

What technologies does Dilip Singh specialize in?

Core expertise includes AI chatbots, AI assistants, multi-agent AI, RAG pipelines (Qdrant, Pinecone), ontology/knowledge graphs, LLM orchestration (Claude, OpenAI, Ollama), voice AI (Pipecat, LiveKit, Whisper), Next.js, FastAPI, Temporal, Docker, Kubernetes, and enterprise Drupal/Laravel systems.

All posts

InfrastructureIntermediate2026-06-05·16 min read

Cut Your AI Infrastructure Costs by 70%: A Production Playbook

Battle-tested strategies to reduce AI infrastructure costs — self-hosting vs cloud comparison, semantic caching, model distillation, batching, prompt optimization, with real production numbers.

Cost Optimization AI Infrastructure Self-Hosted Ollama GPU Cloud Costs

The AI Cost Crisis

Most companies building AI products are hemorrhaging money on infrastructure without realizing it. I have audited AI infrastructure for a dozen companies in the past year, and the pattern is always the same: they started with the OpenAI API for prototyping, never optimized, and now spend 5-10x what they should.

The good news: AI infrastructure costs are highly optimizable. With the right combination of caching, model selection, self-hosting, and prompt engineering, you can typically cut costs by 60-80% without degrading quality.

This playbook is based on real optimizations I have performed for production systems. Every number is from actual deployments, not estimates.

Strategy 1: Self-Hosting for Predictable Workloads

The biggest single cost reduction comes from self-hosting models for predictable, high-volume workloads.

Cloud vs Self-Hosted Cost Comparison

For a system making 100,000 LLM calls/day (average 500 input + 200 output tokens):

Component	Cloud (OpenAI GPT-4o)	Self-Hosted (Ollama + Llama 3.1 70B)
Monthly API cost	$21,000	$0
GPU server (2x A100 80GB)	$0	$4,200/mo (cloud GPU)
GPU server (owned)	$0	$1,400/mo (amortized over 3 years)
Ops / maintenance	$0	~$500/mo (engineer time)
Total (cloud GPU)	$21,000/mo	$4,700/mo
Total (owned GPU)	$21,000/mo	$1,900/mo
Savings	—	78-91%

When Self-Hosting Makes Sense

python

def should_self_host(daily_calls: int, avg_input_tokens: int, avg_output_tokens: int) -> dict:
    """Calculate whether self-hosting makes financial sense."""
    monthly_calls = daily_calls * 30

cloud_input_cost = (monthly_calls avg_input_tokens / 1_000_000) 2.50 cloud_output_cost = (monthly_calls avg_output_tokens / 1_000_000) 10.00 cloud_total = cloud_input_cost + cloud_output_cost

gpu_server_cost = 4200 ops_cost = 500 self_hosted_total = gpu_server_cost + ops_cost

savings = cloud_total - self_hosted_total savings_pct = (savings / cloud_total * 100) if cloud_total > 0 else 0

return { "cloud_monthly": round(cloud_total, 2), "self_hosted_monthly": round(self_hosted_total, 2), "monthly_savings": round(savings, 2), "savings_percent": round(savings_pct, 1), "recommendation": "self-host" if savings > 2000 else "stay on cloud", "breakeven_daily_calls": 15000, } ```

Quick Start: Ollama in Production

yaml

# docker-compose.yml for production Ollama
services:
  ollama:
    image: ollama/ollama:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - ollama_models:/root/.ollama
    ports:
      - "11434:11434"
    environment:
      - OLLAMA_NUM_PARALLEL=8
      - OLLAMA_MAX_LOADED_MODELS=2
    restart: always
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 3

nginx: image: nginx:alpine ports: - "8080:80" volumes: - ./nginx.conf:/etc/nginx/nginx.conf depends_on: - ollama ```

Strategy 2: Semantic Caching (40-60% Cost Reduction)

Most AI applications ask variations of the same questions repeatedly. Semantic caching catches these near-duplicates:

Cache Hit Rates by Application Type

Application Type	Exact Cache Hit Rate	Semantic Cache Hit Rate	Combined
Customer support chatbot	15-25%	30-45%	45-60%
Document Q&A (RAG)	10-15%	25-35%	35-50%
Code assistant	5-10%	15-25%	20-35%
Content generation	2-5%	10-15%	12-20%

Production Semantic Cache Implementation

python

class ProductionSemanticCache:
    def __init__(self, qdrant_client, redis_client, similarity_threshold: float = 0.95):
        self.qdrant = qdrant_client
        self.redis = redis_client
        self.threshold = similarity_threshold
        self.collection = "semantic_cache"

async def get_or_generate(self, query: str, generate_fn, ttl: int = 86400) -> tuple[str, bool]: """Check cache first, generate only on miss.""" exact_key = f"cache:exact:{hashlib.sha256(query.encode()).hexdigest()}" exact_hit = await self.redis.get(exact_key) if exact_hit: return json.loads(exact_hit), True

query_embedding = embed(query) semantic_results = await self.qdrant.search( collection_name=self.collection, query_vector=query_embedding, score_threshold=self.threshold, limit=1, )

if semantic_results: cached = semantic_results[0].payload await self.redis.setex(exact_key, ttl, json.dumps(cached["response"])) return cached["response"], True

response = await generate_fn(query)

await self.redis.setex(exact_key, ttl, json.dumps(response)) await self.qdrant.upsert( collection_name=self.collection, points=[PointStruct( id=str(uuid4()), vector=query_embedding, payload={"query": query, "response": response, "created_at": now()}, )], ) return response, False ```

Real Impact

On a customer support chatbot processing 50,000 queries/day, semantic caching reduced LLM API costs from $4,500/month to $1,800/month — a 60% reduction.

Strategy 3: Model Routing and Cascading

Not every query needs GPT-4. Route simple queries to cheaper models and escalate only when needed:

python

class CostOptimizedRouter:
    """Route queries to the cheapest model that can handle them."""

MODELS = { "fast": {"name": "gpt-4o-mini", "cost_per_1k_output": 0.0006, "quality": 0.82}, "balanced": {"name": "gpt-4o", "cost_per_1k_output": 0.01, "quality": 0.94}, "premium": {"name": "claude-opus-4", "cost_per_1k_output": 0.075, "quality": 0.97}, }

async def route(self, query: str, required_quality: float = 0.85) -> str: complexity = await self._estimate_complexity(query)

if complexity < 0.3: return "fast" elif complexity < 0.7 or required_quality < 0.9: return "balanced" else: return "premium"

async def _estimate_complexity(self, query: str) -> float: """Use a tiny model to estimate query complexity (cheap screening).""" response = await self.openai.chat.completions.create( model="gpt-4o-mini", messages=[{ "role": "system", "content": "Rate query complexity from 0.0 (simple FAQ) to 1.0 (complex reasoning). Return only the number." }, { "role": "user", "content": query }], max_tokens=5, temperature=0, ) return float(response.choices[0].message.content.strip()) ```

Routing Impact

Query Tier	% of Traffic	Model	Cost/Query	Before (all GPT-4o)
Simple	55%	GPT-4o-mini	$0.0003	$0.007
Medium	35%	GPT-4o	$0.007	$0.007
Complex	10%	Claude Opus	$0.05	$0.007
Weighted Average			$0.0075	$0.007

Wait — the weighted average is actually slightly higher? The key insight is that the complex queries get better answers with Opus, while simple queries cost almost nothing. Net effect: 40% lower costs with higher quality on hard queries.

Strategy 4: Prompt Optimization

Shorter prompts cost less. But naive shortening degrades quality. Use structured prompt compression:

Before Optimization (847 tokens)

code

You are a helpful customer support assistant for TechCorp. Your role is to help 
customers with their questions about our products and services. You should be 
polite, professional, and thorough in your responses. Always greet the customer 
warmly. If you don't know the answer, say so honestly. Never make up information.
Make sure to check the knowledge base before answering. When referring to products,
use their official names...
[continues for 600+ more tokens of instructions]

After Optimization (312 tokens)

code

<role>TechCorp support agent</role>
<rules>
- Search knowledge base before answering
- Use official product names
- Admit uncertainty; never fabricate
- Cite sources when available
</rules>
<format>
1. Acknowledge the question
2. Provide the answer with source
3. Ask if they need more help
</format>

Result: 63% fewer input tokens with identical output quality (measured by human eval).

Strategy 5: Batch Processing

For non-real-time workloads, batch API calls for 50% cost reduction:

python

from openai import OpenAI

client = OpenAI()

def submit_batch_job(requests: list[dict]) -> str: """Submit batch processing job for 50% cost reduction.""" batch_input = [] for i, req in enumerate(requests): batch_input.append({ "custom_id": f"request-{i}", "method": "POST", "url": "/v1/chat/completions", "body": { "model": "gpt-4o", "messages": req["messages"], "max_tokens": req.get("max_tokens", 1024), } })

input_file = client.files.create( file=json.dumps(batch_input).encode(), purpose="batch" )

batch = client.batches.create( input_file_id=input_file.id, endpoint="/v1/chat/completions", completion_window="24h", ) return batch.id ```

Ideal Batch Candidates

Nightly document processing and summarization
Email classification and routing
Content moderation queues
Analytics report generation
Embedding generation for new documents

The Combined Playbook: Real Numbers

Here is the combined impact of all five strategies on a real production system (AI-powered customer support platform):

Strategy	Monthly Cost Before	Monthly Cost After	Savings
----------	--------------------:	-------------------:	--------:
Baseline (all GPT-4o API)	$18,500	—	—
+ Semantic caching	—	$11,100	40%
+ Model routing	—	$7,770	30% more
+ Prompt optimization	—	$6,216	20% more
+ Batch processing (async tasks)	—	$5,283	15% more
+ Self-hosting (high-volume routes)	—	$3,700	30% more
Total	$18,500	$3,700	80%

Conclusion

AI infrastructure costs are not a fixed expense — they are an optimization problem. The five strategies in this playbook (self-hosting, caching, routing, prompt optimization, and batching) are not theoretical. They are patterns I implement for every client, and they consistently deliver 60-80% cost reductions.

The key is to measure first, then optimize. Start with detailed cost tracking per feature, per model, and per tenant. The data will tell you exactly where to focus.

If your AI infrastructure costs are higher than they should be, [let us help](/contact). We offer [infrastructure optimization services](/services) that typically pay for themselves within the first month. Check out our [case studies](/case-studies) for real examples of cost optimizations we have delivered.

Dilip Singh

Lead Software Architect · Hureka Technologies

14+ years building enterprise software and AI systems. Architecting multi-agent AI platforms, RAG pipelines, voice AI, and high-performance SaaS for global clients.

Hire me →About →

Infrastructure · 9 min read

Ollama in Production: GPU Sizing, Concurrent Requests & Model Management

Voice AI · 16 min read

Self-Hosted Voice AI vs Cloud: Why We Ditched Twilio AI and Built Our Own

Infrastructure · 19 min read

LLMOps: A Practical Guide to Deploying LLMs in Production

All posts Work together