LLM Observability with LangFuse: Traces, Costs & Quality at Scale
How to instrument production LLM applications with LangFuse. Traces, scoring, cost attribution, prompt management, and the dashboards that let you ship AI confidently.
You Can't Ship What You Can't See
Most production LLM problems — hallucinations, cost spikes, latency spikes, prompt regressions — are invisible without observability. LangFuse is the open-source standard for this and what I use across all Hureka projects.
Self-Host or Cloud
For most teams, self-hosted LangFuse is fine:
services:
langfuse:
image: langfuse/langfuse:latest
ports: ["3000:3000"]
environment:
DATABASE_URL: postgresql://postgres:secret@db:5432/langfuse
NEXTAUTH_SECRET: <generate>
SALT: <generate>
NEXTAUTH_URL: http://localhost:3000
depends_on: [db]
db:
image: postgres:16-alpine
environment:
POSTGRES_PASSWORD: secret
POSTGRES_DB: langfuse
volumes: [pg_data:/var/lib/postgresql/data]
volumes:
pg_data:
Instrumenting Every Call
from langfuse import Langfuselangfuse = Langfuse( public_key=os.environ["LF_PUBLIC"], secret_key=os.environ["LF_SECRET"], host="http://localhost:3000", )
async def chat(user_id: str, message: str): trace = langfuse.trace( name="chat", user_id=user_id, session_id=session_id, metadata={"tenant_id": tenant_id}, )
retrieval = trace.span(name="retrieve") chunks = await rag_retrieve(message) retrieval.end(output={"chunk_count": len(chunks)})
gen = trace.generation( name="answer", model="claude-sonnet-4-6", input=build_prompt(message, chunks), metadata={"prompt_version": "v3"}, ) response = await anthropic.messages.create(...) gen.end(output=response.content[0].text, usage=response.usage)
return response.content[0].text ```
Scoring Quality
trace.score(name="helpfulness", value=0.85, comment="auto-eval LLM rating")
trace.score(name="faithfulness", value=0.92)
trace.score(name="user_thumbs", value=1) # From UI feedback
Cost Attribution Per Tenant
LangFuse automatically calculates cost from token usage. Group by tenant_id metadata:
-- In LangFuse dashboard analytics
SELECT
metadata->>'tenant_id' AS tenant,
SUM(total_cost) AS cost_usd,
COUNT(*) AS calls
FROM traces
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY tenant
ORDER BY cost_usd DESC;
Alerting on Regressions
# Daily cron
yesterday_avg = avg_score(metric="faithfulness", days=1)
baseline = avg_score(metric="faithfulness", days=30, offset_days=1)if yesterday_avg < baseline * 0.95: alert_slack(f"Faithfulness dropped: {yesterday_avg:.2f} vs {baseline:.2f}") ```
Prompt Management
LangFuse stores prompts versioned, with rollouts:
prompt = langfuse.get_prompt("classify_intent", label="production")
rendered = prompt.compile(domain="healthcare", message=user_text)response = await llm.complete(rendered, langfuse_prompt=prompt) # Now every generation links to the exact prompt version used ```
The Four Dashboards I Live In
- 1Cost per tenant per day — Catches runaway loops within hours
- 2Latency p50/p95/p99 — Detects degradation before user complaints
- 3Quality scores trend — Faithfulness, relevance, user thumbs
- 4Error rate by model — When OpenAI degrades, you see it first