Dilip Singh logo
All posts
RAG SystemsIntermediate2026-04-04·9 min read

Evaluating RAG Systems: Beyond "Looks Good" with Ragas

How to evaluate RAG quality rigorously. Faithfulness, answer relevance, context precision, context recall — using Ragas to catch regressions before users do.

"Looks Good" Is Not a Test Strategy

Most teams ship RAG and pray. Then a user finds a hallucination. Then the team panics and tweaks something. Then quality regresses elsewhere. Sound familiar?

You can't improve what you don't measure. Ragas gives you four metrics that catch 90% of RAG quality issues.

The Four Core Metrics

MetricWhat it measures
FaithfulnessDoes the answer only use facts from retrieved context?
Answer RelevanceDoes the answer actually address the question?
Context PrecisionAre the retrieved chunks relevant (not noise)?
Context RecallDid we retrieve all needed information?

Setting Up Ragas

python
from ragas import evaluate
from ragas.metrics import (
    faithfulness, answer_relevancy,
    context_precision, context_recall,
)
from datasets import Dataset

eval_set = Dataset.from_list([ { "question": "What is our return policy?", "answer": rag_response.answer, "contexts": rag_response.retrieved_chunks, "ground_truth": "30 day return window with receipt", }, # ... 100 more ])

result = evaluate( eval_set, metrics=[faithfulness, answer_relevancy, context_precision, context_recall], ) print(result) ```

Building Your Eval Set

Two sources:

1. Synthetic from your docs: ``python from ragas.testset import TestsetGenerator generator = TestsetGenerator.from_langchain(...) testset = generator.generate_with_langchain_docs(docs, test_size=100) ``

2. Real user queries with manually-annotated ground truth (this is the gold standard).

CI Integration

Every PR runs the eval set. Fail if any metric drops > 3%:

yaml
# .github/workflows/rag-eval.yml
- name: Run RAG evaluation
  run: |
    python scripts/eval_rag.py --threshold-file thresholds.json
- name: Comment PR with results
  uses: marocchino/sticky-pull-request-comment@v2
  with:
    message: |
      ## RAG Eval Results
      | Metric | Score | Δ vs main |
      |--------|-------|-----------|
      | Faithfulness | ${{ env.FAITHFULNESS }} | ${{ env.FAITH_DELTA }} |
      | Answer Relevance | ${{ env.ANSWER_REL }} | ${{ env.ANSWER_DELTA }} |

What Each Score Tells You

  • Faithfulness < 0.85 → Hallucination problem. The LLM is making things up. Lower temperature, add "if you don't know, say so" instructions.
  • Answer Relevance < 0.80 → Answer drifts off-topic. Tighten your generation prompt.
  • Context Precision < 0.70 → Retrieval brings noise. Add reranking, raise score threshold.
  • Context Recall < 0.80 → Missing relevant chunks. Increase k, improve chunking, fine-tune embeddings.

My Production Targets

MetricThreshold
Faithfulness≥ 0.90
Answer Relevance≥ 0.85
Context Precision≥ 0.75
Context Recall≥ 0.80

Below those, the system is unsafe to ship.

DS
Dilip Singh
Lead Software Architect · Hureka Technologies

14+ years building enterprise software and AI systems. Architecting multi-agent AI platforms, RAG pipelines, voice AI, and high-performance SaaS for global clients.