RAG Evaluation & Testing
RAG Evaluation Metrics
3 min read
Evaluating RAG systems requires metrics beyond traditional ML accuracy. You need to measure both retrieval quality and generation faithfulness.
The RAG Evaluation Challenge
Traditional metrics don't capture RAG-specific failures:
# Traditional metrics miss critical issues:
# Scenario 1: High BLEU score, but hallucinated facts
generated = "The company was founded in 2015 by John Smith"
reference = "The company was founded in 2015 by Jane Smith"
# BLEU: 0.85 (looks good!)
# Reality: Wrong founder name (critical error)
# Scenario 2: Low BLEU score, but factually correct
generated = "Jane Smith established the business in 2015"
reference = "The company was founded in 2015 by Jane Smith"
# BLEU: 0.42 (looks bad!)
# Reality: Same facts, different wording (acceptable)
Component-Based Evaluation
RAG systems have three components to evaluate:
┌─────────────────────────────────────────────────────────────┐
│ RAG Evaluation Framework │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ RETRIEVAL │───▶│ CONTEXT │───▶│ GENERATION │ │
│ │ QUALITY │ │ QUALITY │ │ QUALITY │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ • Context Recall • Relevance • Faithfulness │
│ • Context Precision • Noise Ratio • Answer Relevancy │
│ • MRR, NDCG • Coverage • Correctness │
│ │
└─────────────────────────────────────────────────────────────┘
Retrieval Metrics
Context Precision
Measures if retrieved documents are relevant:
def context_precision(retrieved_contexts: list, relevant_contexts: list) -> float:
"""
What proportion of retrieved contexts are actually relevant?
High precision = Few irrelevant documents retrieved
Low precision = Many irrelevant documents polluting context
"""
relevant_retrieved = set(retrieved_contexts) & set(relevant_contexts)
if not retrieved_contexts:
return 0.0
return len(relevant_retrieved) / len(retrieved_contexts)
# Example
retrieved = ["doc1", "doc2", "doc3", "doc4", "doc5"]
relevant = ["doc1", "doc3", "doc7"]
precision = context_precision(retrieved, relevant)
# Result: 2/5 = 0.4 (Only 2 of 5 retrieved docs are relevant)
Context Recall
Measures if all relevant information was retrieved:
def context_recall(retrieved_contexts: list, relevant_contexts: list) -> float:
"""
What proportion of relevant contexts were retrieved?
High recall = All relevant information found
Low recall = Missing important context
"""
relevant_retrieved = set(retrieved_contexts) & set(relevant_contexts)
if not relevant_contexts:
return 1.0
return len(relevant_retrieved) / len(relevant_contexts)
# Example
retrieved = ["doc1", "doc2", "doc3", "doc4", "doc5"]
relevant = ["doc1", "doc3", "doc7"]
recall = context_recall(retrieved, relevant)
# Result: 2/3 = 0.67 (Retrieved 2 of 3 relevant docs, missed doc7)
Mean Reciprocal Rank (MRR)
Measures how high the first relevant result ranks:
def mean_reciprocal_rank(queries_results: list[list], relevant_docs: list[set]) -> float:
"""
Average of 1/rank for first relevant result per query.
MRR = 1.0 means first result is always relevant
MRR = 0.5 means first relevant result is typically rank 2
"""
reciprocal_ranks = []
for results, relevant in zip(queries_results, relevant_docs):
for rank, doc in enumerate(results, 1):
if doc in relevant:
reciprocal_ranks.append(1 / rank)
break
else:
reciprocal_ranks.append(0)
return sum(reciprocal_ranks) / len(reciprocal_ranks)
# Example
query_results = [
["doc2", "doc1", "doc3"], # Query 1: relevant doc1 at rank 2
["doc5", "doc6", "doc7"], # Query 2: no relevant docs
["doc8", "doc9", "doc4"], # Query 3: relevant doc4 at rank 3
]
relevant = [{"doc1"}, {"doc10"}, {"doc4"}]
mrr = mean_reciprocal_rank(query_results, relevant)
# Result: (1/2 + 0 + 1/3) / 3 = 0.278
Generation Metrics
Faithfulness
Measures if the answer is grounded in retrieved context:
def assess_faithfulness(answer: str, context: str) -> dict:
"""
Faithfulness checks if every claim in the answer
can be verified from the retrieved context.
Uses LLM-as-judge approach.
"""
# Step 1: Extract claims from the answer
claims_prompt = f"""
Extract all factual claims from this answer:
Answer: {answer}
List each claim on a new line.
"""
# Step 2: Verify each claim against context
verify_prompt = f"""
For each claim, determine if it can be verified from the context.
Context: {context}
Claims: {claims}
For each claim, respond with:
- SUPPORTED: Claim is directly supported by context
- NOT_SUPPORTED: Claim cannot be verified from context
"""
# Step 3: Calculate faithfulness score
# Faithfulness = supported_claims / total_claims
return {
"score": supported_claims / total_claims,
"unsupported_claims": unsupported_list
}
# Example output
# {
# "score": 0.75, # 3 of 4 claims supported
# "unsupported_claims": ["The company has 500 employees"]
# }
Answer Relevancy
Measures if the answer addresses the question:
def assess_answer_relevancy(question: str, answer: str) -> float:
"""
Answer relevancy checks if the answer actually
addresses what was asked.
Approach: Generate questions from the answer,
compare semantic similarity to original question.
"""
# Generate questions that the answer would address
generated_questions = generate_questions_from_answer(answer, n=3)
# Compare each generated question to original
similarities = []
for gen_q in generated_questions:
sim = cosine_similarity(
embed(question),
embed(gen_q)
)
similarities.append(sim)
return sum(similarities) / len(similarities)
# Example
question = "What is the capital of France?"
answer = "Paris is the capital of France, located on the Seine River."
# Generated questions from answer:
# - "What is the capital of France?"
# - "Where is Paris located?"
# - "Which river runs through Paris?"
# Similarity to original: [0.95, 0.3, 0.2]
# Relevancy score: 0.48
Metric Selection Guide
| Metric | Measures | Use When |
|---|---|---|
| Context Precision | Retrieval accuracy | You have ground truth labels |
| Context Recall | Retrieval coverage | Missing info causes failures |
| MRR | Ranking quality | Top results matter most |
| Faithfulness | Hallucination prevention | Accuracy is critical |
| Answer Relevancy | Response quality | Answers seem off-topic |
Combined Scoring
def rag_quality_score(
context_precision: float,
context_recall: float,
faithfulness: float,
answer_relevancy: float,
weights: dict = None
) -> float:
"""
Weighted combination of RAG metrics.
Adjust weights based on your priorities.
"""
weights = weights or {
"context_precision": 0.2,
"context_recall": 0.2,
"faithfulness": 0.4, # Usually most important
"answer_relevancy": 0.2
}
score = (
weights["context_precision"] * context_precision +
weights["context_recall"] * context_recall +
weights["faithfulness"] * faithfulness +
weights["answer_relevancy"] * answer_relevancy
)
return score
# Example
quality = rag_quality_score(
context_precision=0.8,
context_recall=0.7,
faithfulness=0.9,
answer_relevancy=0.85
)
# Result: 0.2*0.8 + 0.2*0.7 + 0.4*0.9 + 0.2*0.85 = 0.83
Key Insight: Faithfulness is typically the most critical metric for production RAG systems. Users can tolerate slightly off-topic answers, but hallucinated facts destroy trust.
Next, let's implement these metrics using the RAGAS framework. :::