Evaluating RAG Systems: Beyond "Looks Good" with Ragas
How to evaluate RAG quality rigorously. Faithfulness, answer relevance, context precision, context recall — using Ragas to catch regressions before users do.
"Looks Good" Is Not a Test Strategy
Most teams ship RAG and pray. Then a user finds a hallucination. Then the team panics and tweaks something. Then quality regresses elsewhere. Sound familiar?
You can't improve what you don't measure. Ragas gives you four metrics that catch 90% of RAG quality issues.
The Four Core Metrics
| Metric | What it measures |
|---|---|
| Faithfulness | Does the answer only use facts from retrieved context? |
| Answer Relevance | Does the answer actually address the question? |
| Context Precision | Are the retrieved chunks relevant (not noise)? |
| Context Recall | Did we retrieve all needed information? |
Setting Up Ragas
from ragas import evaluate
from ragas.metrics import (
faithfulness, answer_relevancy,
context_precision, context_recall,
)
from datasets import Dataseteval_set = Dataset.from_list([ { "question": "What is our return policy?", "answer": rag_response.answer, "contexts": rag_response.retrieved_chunks, "ground_truth": "30 day return window with receipt", }, # ... 100 more ])
result = evaluate( eval_set, metrics=[faithfulness, answer_relevancy, context_precision, context_recall], ) print(result) ```
Building Your Eval Set
Two sources:
1. Synthetic from your docs:
``python
from ragas.testset import TestsetGenerator
generator = TestsetGenerator.from_langchain(...)
testset = generator.generate_with_langchain_docs(docs, test_size=100)
``
2. Real user queries with manually-annotated ground truth (this is the gold standard).
CI Integration
Every PR runs the eval set. Fail if any metric drops > 3%:
# .github/workflows/rag-eval.yml
- name: Run RAG evaluation
run: |
python scripts/eval_rag.py --threshold-file thresholds.json
- name: Comment PR with results
uses: marocchino/sticky-pull-request-comment@v2
with:
message: |
## RAG Eval Results
| Metric | Score | Δ vs main |
|--------|-------|-----------|
| Faithfulness | ${{ env.FAITHFULNESS }} | ${{ env.FAITH_DELTA }} |
| Answer Relevance | ${{ env.ANSWER_REL }} | ${{ env.ANSWER_DELTA }} |
What Each Score Tells You
- Faithfulness < 0.85 → Hallucination problem. The LLM is making things up. Lower temperature, add "if you don't know, say so" instructions.
- Answer Relevance < 0.80 → Answer drifts off-topic. Tighten your generation prompt.
- Context Precision < 0.70 → Retrieval brings noise. Add reranking, raise score threshold.
- Context Recall < 0.80 → Missing relevant chunks. Increase k, improve chunking, fine-tune embeddings.
My Production Targets
| Metric | Threshold |
|---|---|
| Faithfulness | ≥ 0.90 |
| Answer Relevance | ≥ 0.85 |
| Context Precision | ≥ 0.75 |
| Context Recall | ≥ 0.80 |
Below those, the system is unsafe to ship.