Cutting LLM Costs by 70%: 8 Strategies That Actually Work
How I reduced LLM costs for production AI products from $42K/month to $12K/month without sacrificing quality. Caching, routing, distillation, prompt compression, and more.
The Bill Shock
In Q4 2025 our LLM bill at one client hit $42K/month. Their MRR was $80K. That math doesn't survive. Six months later we'd cut it to $12K with no measurable quality drop.
Here is the playbook.
1. Semantic Caching
Cache by embedding similarity, not exact match. 30-40% of production queries are near-duplicates:
def get_cached(query: str) -> str | None:
vec = embed(query)
results = cache_qdrant.search("response_cache", vec, limit=1, score_threshold=0.95)
return results[0].payload["response"] if results else Nonedef set_cached(query: str, response: str): cache_qdrant.upsert("response_cache", [{ "id": uuid4().hex, "vector": embed(query), "payload": {"query": query, "response": response, "ts": time.time()}, }]) ```
2. Model Routing
Not every request needs Claude Opus. Route by intent:
ROUTING = {
"classify": "claude-haiku-4-7", # $0.25/1M
"extract": "gpt-5.0-mini", # $0.15/1M
"rewrite": "claude-sonnet-4-6", # $3/1M
"reason": "claude-opus-4-8", # $15/1M
}
intent = classify_intent(user_msg)
model = ROUTING[intent]
A 70/20/10 split (mini/sonnet/opus) cut average cost per call by 6×.
3. Prompt Compression
LLMLingua compresses prompts 2-4× with minimal quality loss:
from llmlingua import PromptCompressorcompressor = PromptCompressor(model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank") result = compressor.compress_prompt( long_context, instruction="Summarize", question="What was decided?", target_token=512, ) ```
4. Cache Anthropic Prompt Prefixes
Anthropic prompt caching gives a 90% discount on repeated prefixes (system prompts, RAG context):
response = anthropic.messages.create(
model="claude-sonnet-4-6",
system=[{
"type": "text",
"text": LARGE_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"},
}],
messages=[...],
)
5. Distill to Small Models
For high-volume specialized tasks (classification, extraction), fine-tune Phi-4 or Llama-3-8B on Claude outputs. Inference cost drops from $3/1M to ~$0.10/1M when self-hosted.
6. Stream and Truncate Early
Always stream. Cancel generation when the client disconnects or the answer is complete:
async for chunk in stream:
if "STOP" in chunk or len(buffer) > MAX_TOKENS:
await stream.aclose()
break
7. Batch Embedding Generation
Don't embed one document at a time. Batch 100+ at once — same API cost, 50× throughput.
8. Monitor Every Token
LangFuse + cost-per-tenant dashboards. If you can't see who is spending what, you can't optimize.
| Strategy | Cost Reduction | Implementation Effort |
|---|---|---|
| Semantic caching | 30% | Medium |
| Model routing | 40% | Low |
| Prompt caching | 25% | Low |
| Prompt compression | 15% | Medium |
| Distillation | 60% (for distilled tasks) | High |