Dilip Singh logo
All posts
AI ArchitectureIntermediate2026-05-20·11 min read

Cutting LLM Costs by 70%: 8 Strategies That Actually Work

How I reduced LLM costs for production AI products from $42K/month to $12K/month without sacrificing quality. Caching, routing, distillation, prompt compression, and more.

The Bill Shock

In Q4 2025 our LLM bill at one client hit $42K/month. Their MRR was $80K. That math doesn't survive. Six months later we'd cut it to $12K with no measurable quality drop.

Here is the playbook.

1. Semantic Caching

Cache by embedding similarity, not exact match. 30-40% of production queries are near-duplicates:

python
def get_cached(query: str) -> str | None:
    vec = embed(query)
    results = cache_qdrant.search("response_cache", vec, limit=1, score_threshold=0.95)
    return results[0].payload["response"] if results else None

def set_cached(query: str, response: str): cache_qdrant.upsert("response_cache", [{ "id": uuid4().hex, "vector": embed(query), "payload": {"query": query, "response": response, "ts": time.time()}, }]) ```

2. Model Routing

Not every request needs Claude Opus. Route by intent:

python
ROUTING = {
    "classify": "claude-haiku-4-7",      # $0.25/1M
    "extract":  "gpt-5.0-mini",          # $0.15/1M
    "rewrite":  "claude-sonnet-4-6",     # $3/1M
    "reason":   "claude-opus-4-8",       # $15/1M
}
intent = classify_intent(user_msg)
model = ROUTING[intent]

A 70/20/10 split (mini/sonnet/opus) cut average cost per call by 6×.

3. Prompt Compression

LLMLingua compresses prompts 2-4× with minimal quality loss:

python
from llmlingua import PromptCompressor

compressor = PromptCompressor(model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank") result = compressor.compress_prompt( long_context, instruction="Summarize", question="What was decided?", target_token=512, ) ```

4. Cache Anthropic Prompt Prefixes

Anthropic prompt caching gives a 90% discount on repeated prefixes (system prompts, RAG context):

python
response = anthropic.messages.create(
    model="claude-sonnet-4-6",
    system=[{
        "type": "text",
        "text": LARGE_SYSTEM_PROMPT,
        "cache_control": {"type": "ephemeral"},
    }],
    messages=[...],
)

5. Distill to Small Models

For high-volume specialized tasks (classification, extraction), fine-tune Phi-4 or Llama-3-8B on Claude outputs. Inference cost drops from $3/1M to ~$0.10/1M when self-hosted.

6. Stream and Truncate Early

Always stream. Cancel generation when the client disconnects or the answer is complete:

python
async for chunk in stream:
    if "STOP" in chunk or len(buffer) > MAX_TOKENS:
        await stream.aclose()
        break

7. Batch Embedding Generation

Don't embed one document at a time. Batch 100+ at once — same API cost, 50× throughput.

8. Monitor Every Token

LangFuse + cost-per-tenant dashboards. If you can't see who is spending what, you can't optimize.

StrategyCost ReductionImplementation Effort
Semantic caching30%Medium
Model routing40%Low
Prompt caching25%Low
Prompt compression15%Medium
Distillation60% (for distilled tasks)High
DS
Dilip Singh
Lead Software Architect · Hureka Technologies

14+ years building enterprise software and AI systems. Architecting multi-agent AI platforms, RAG pipelines, voice AI, and high-performance SaaS for global clients.