RAG Evaluation Metrics

Evaluating RAG systems requires metrics beyond traditional ML accuracy. You need to measure both retrieval quality and generation faithfulness.

The RAG Evaluation Challenge

Traditional metrics don't capture RAG-specific failures:

# Traditional metrics miss critical issues:

# Scenario 1: High BLEU score, but hallucinated facts
generated = "The company was founded in 2015 by John Smith"
reference = "The company was founded in 2015 by Jane Smith"
# BLEU: 0.85 (looks good!)
# Reality: Wrong founder name (critical error)

# Scenario 2: Low BLEU score, but factually correct
generated = "Jane Smith established the business in 2015"
reference = "The company was founded in 2015 by Jane Smith"
# BLEU: 0.42 (looks bad!)
# Reality: Same facts, different wording (acceptable)

Component-Based Evaluation

RAG systems have three components to evaluate:

┌─────────────────────────────────────────────────────────────┐
│                    RAG Evaluation Framework                  │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐     │
│  │  RETRIEVAL  │───▶│   CONTEXT   │───▶│ GENERATION  │     │
│  │   QUALITY   │    │   QUALITY   │    │   QUALITY   │     │
│  └─────────────┘    └─────────────┘    └─────────────┘     │
│        │                  │                  │              │
│        ▼                  ▼                  ▼              │
│  • Context Recall    • Relevance       • Faithfulness      │
│  • Context Precision • Noise Ratio     • Answer Relevancy  │
│  • MRR, NDCG        • Coverage         • Correctness       │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Retrieval Metrics

Context Precision

Measures if retrieved documents are relevant:

def context_precision(retrieved_contexts: list, relevant_contexts: list) -> float:
    """
    What proportion of retrieved contexts are actually relevant?

    High precision = Few irrelevant documents retrieved
    Low precision = Many irrelevant documents polluting context
    """
    relevant_retrieved = set(retrieved_contexts) & set(relevant_contexts)

    if not retrieved_contexts:
        return 0.0

    return len(relevant_retrieved) / len(retrieved_contexts)

# Example
retrieved = ["doc1", "doc2", "doc3", "doc4", "doc5"]
relevant = ["doc1", "doc3", "doc7"]

precision = context_precision(retrieved, relevant)
# Result: 2/5 = 0.4 (Only 2 of 5 retrieved docs are relevant)

Context Recall

Measures if all relevant information was retrieved:

def context_recall(retrieved_contexts: list, relevant_contexts: list) -> float:
    """
    What proportion of relevant contexts were retrieved?

    High recall = All relevant information found
    Low recall = Missing important context
    """
    relevant_retrieved = set(retrieved_contexts) & set(relevant_contexts)

    if not relevant_contexts:
        return 1.0

    return len(relevant_retrieved) / len(relevant_contexts)

# Example
retrieved = ["doc1", "doc2", "doc3", "doc4", "doc5"]
relevant = ["doc1", "doc3", "doc7"]

recall = context_recall(retrieved, relevant)
# Result: 2/3 = 0.67 (Retrieved 2 of 3 relevant docs, missed doc7)

Mean Reciprocal Rank (MRR)

Measures how high the first relevant result ranks:

def mean_reciprocal_rank(queries_results: list[list], relevant_docs: list[set]) -> float:
    """
    Average of 1/rank for first relevant result per query.

    MRR = 1.0 means first result is always relevant
    MRR = 0.5 means first relevant result is typically rank 2
    """
    reciprocal_ranks = []

    for results, relevant in zip(queries_results, relevant_docs):
        for rank, doc in enumerate(results, 1):
            if doc in relevant:
                reciprocal_ranks.append(1 / rank)
                break
        else:
            reciprocal_ranks.append(0)

    return sum(reciprocal_ranks) / len(reciprocal_ranks)

# Example
query_results = [
    ["doc2", "doc1", "doc3"],  # Query 1: relevant doc1 at rank 2
    ["doc5", "doc6", "doc7"],  # Query 2: no relevant docs
    ["doc8", "doc9", "doc4"],  # Query 3: relevant doc4 at rank 3
]
relevant = [{"doc1"}, {"doc10"}, {"doc4"}]

mrr = mean_reciprocal_rank(query_results, relevant)
# Result: (1/2 + 0 + 1/3) / 3 = 0.278

Generation Metrics

Faithfulness

Measures if the answer is grounded in retrieved context:

def assess_faithfulness(answer: str, context: str) -> dict:
    """
    Faithfulness checks if every claim in the answer
    can be verified from the retrieved context.

    Uses LLM-as-judge approach.
    """
    # Step 1: Extract claims from the answer
    claims_prompt = f"""
    Extract all factual claims from this answer:
    Answer: {answer}

    List each claim on a new line.
    """

    # Step 2: Verify each claim against context
    verify_prompt = f"""
    For each claim, determine if it can be verified from the context.

    Context: {context}
    Claims: {claims}

    For each claim, respond with:
    - SUPPORTED: Claim is directly supported by context
    - NOT_SUPPORTED: Claim cannot be verified from context
    """

    # Step 3: Calculate faithfulness score
    # Faithfulness = supported_claims / total_claims

    return {
        "score": supported_claims / total_claims,
        "unsupported_claims": unsupported_list
    }

# Example output
# {
#     "score": 0.75,  # 3 of 4 claims supported
#     "unsupported_claims": ["The company has 500 employees"]
# }

Answer Relevancy

Measures if the answer addresses the question:

def assess_answer_relevancy(question: str, answer: str) -> float:
    """
    Answer relevancy checks if the answer actually
    addresses what was asked.

    Approach: Generate questions from the answer,
    compare semantic similarity to original question.
    """
    # Generate questions that the answer would address
    generated_questions = generate_questions_from_answer(answer, n=3)

    # Compare each generated question to original
    similarities = []
    for gen_q in generated_questions:
        sim = cosine_similarity(
            embed(question),
            embed(gen_q)
        )
        similarities.append(sim)

    return sum(similarities) / len(similarities)

# Example
question = "What is the capital of France?"
answer = "Paris is the capital of France, located on the Seine River."

# Generated questions from answer:
# - "What is the capital of France?"
# - "Where is Paris located?"
# - "Which river runs through Paris?"

# Similarity to original: [0.95, 0.3, 0.2]
# Relevancy score: 0.48

Metric Selection Guide

Metric	Measures	Use When
Context Precision	Retrieval accuracy	You have ground truth labels
Context Recall	Retrieval coverage	Missing info causes failures
MRR	Ranking quality	Top results matter most
Faithfulness	Hallucination prevention	Accuracy is critical
Answer Relevancy	Response quality	Answers seem off-topic

Combined Scoring

def rag_quality_score(
    context_precision: float,
    context_recall: float,
    faithfulness: float,
    answer_relevancy: float,
    weights: dict = None
) -> float:
    """
    Weighted combination of RAG metrics.
    Adjust weights based on your priorities.
    """
    weights = weights or {
        "context_precision": 0.2,
        "context_recall": 0.2,
        "faithfulness": 0.4,  # Usually most important
        "answer_relevancy": 0.2
    }

    score = (
        weights["context_precision"] * context_precision +
        weights["context_recall"] * context_recall +
        weights["faithfulness"] * faithfulness +
        weights["answer_relevancy"] * answer_relevancy
    )

    return score

# Example
quality = rag_quality_score(
    context_precision=0.8,
    context_recall=0.7,
    faithfulness=0.9,
    answer_relevancy=0.85
)
# Result: 0.2*0.8 + 0.2*0.7 + 0.4*0.9 + 0.2*0.85 = 0.83

Key Insight: Faithfulness is typically the most critical metric for production RAG systems. Users can tolerate slightly off-topic answers, but hallucinated facts destroy trust.

Next, let's implement these metrics using the RAGAS framework. :::