Dilip Singh logo
All posts
InfrastructureIntermediate2026-02-17·9 min read

LLM Observability with LangFuse: Traces, Costs & Quality at Scale

How to instrument production LLM applications with LangFuse. Traces, scoring, cost attribution, prompt management, and the dashboards that let you ship AI confidently.

You Can't Ship What You Can't See

Most production LLM problems — hallucinations, cost spikes, latency spikes, prompt regressions — are invisible without observability. LangFuse is the open-source standard for this and what I use across all Hureka projects.

Self-Host or Cloud

For most teams, self-hosted LangFuse is fine:

yaml
services:
  langfuse:
    image: langfuse/langfuse:latest
    ports: ["3000:3000"]
    environment:
      DATABASE_URL: postgresql://postgres:secret@db:5432/langfuse
      NEXTAUTH_SECRET: <generate>
      SALT: <generate>
      NEXTAUTH_URL: http://localhost:3000
    depends_on: [db]
  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_PASSWORD: secret
      POSTGRES_DB: langfuse
    volumes: [pg_data:/var/lib/postgresql/data]
volumes:
  pg_data:

Instrumenting Every Call

python
from langfuse import Langfuse

langfuse = Langfuse( public_key=os.environ["LF_PUBLIC"], secret_key=os.environ["LF_SECRET"], host="http://localhost:3000", )

async def chat(user_id: str, message: str): trace = langfuse.trace( name="chat", user_id=user_id, session_id=session_id, metadata={"tenant_id": tenant_id}, )

retrieval = trace.span(name="retrieve") chunks = await rag_retrieve(message) retrieval.end(output={"chunk_count": len(chunks)})

gen = trace.generation( name="answer", model="claude-sonnet-4-6", input=build_prompt(message, chunks), metadata={"prompt_version": "v3"}, ) response = await anthropic.messages.create(...) gen.end(output=response.content[0].text, usage=response.usage)

return response.content[0].text ```

Scoring Quality

python
trace.score(name="helpfulness", value=0.85, comment="auto-eval LLM rating")
trace.score(name="faithfulness", value=0.92)
trace.score(name="user_thumbs", value=1)  # From UI feedback

Cost Attribution Per Tenant

LangFuse automatically calculates cost from token usage. Group by tenant_id metadata:

sql
-- In LangFuse dashboard analytics
SELECT
  metadata->>'tenant_id' AS tenant,
  SUM(total_cost) AS cost_usd,
  COUNT(*) AS calls
FROM traces
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY tenant
ORDER BY cost_usd DESC;

Alerting on Regressions

python
# Daily cron
yesterday_avg = avg_score(metric="faithfulness", days=1)
baseline = avg_score(metric="faithfulness", days=30, offset_days=1)

if yesterday_avg < baseline * 0.95: alert_slack(f"Faithfulness dropped: {yesterday_avg:.2f} vs {baseline:.2f}") ```

Prompt Management

LangFuse stores prompts versioned, with rollouts:

python
prompt = langfuse.get_prompt("classify_intent", label="production")
rendered = prompt.compile(domain="healthcare", message=user_text)

response = await llm.complete(rendered, langfuse_prompt=prompt) # Now every generation links to the exact prompt version used ```

The Four Dashboards I Live In

  1. 1Cost per tenant per day — Catches runaway loops within hours
  2. 2Latency p50/p95/p99 — Detects degradation before user complaints
  3. 3Quality scores trend — Faithfulness, relevance, user thumbs
  4. 4Error rate by model — When OpenAI degrades, you see it first
DS
Dilip Singh
Lead Software Architect · Hureka Technologies

14+ years building enterprise software and AI systems. Architecting multi-agent AI platforms, RAG pipelines, voice AI, and high-performance SaaS for global clients.