Evaluating RAG systems is complex because quality depends on two independent components: retrieval quality and generation quality. A perfect retrieval pipeline can still produce hallucinated answers, and a perfect generator is useless with poor retrieval.
Here are the 10 metrics that matter most — and how to automate them with SENTINEL-X.
1. Retrieval Precision — What fraction of retrieved chunks are actually relevant to the query? Target: >85%.
2. Retrieval Recall — What fraction of all relevant chunks were retrieved? Target: >90% for critical applications.
3. Context Relevance — How relevant is the retrieved context to the specific question? SENTINEL-X scores this 0-1 using an LLM judge.
4. Answer Faithfulness — Does the generated answer only contain information present in the retrieved context? The most important metric for preventing hallucinations.
5. Answer Relevance — Does the answer actually address the question that was asked? Catches responses that are factually grounded but miss the point.
6. Semantic Similarity — How similar is the answer to the ground truth? Useful for closed-domain QA where ground truth exists.
7. Citation Accuracy — If the model cites sources, are those citations correct? Critical for legal and medical applications.
8. Latency P95 — What is the 95th percentile end-to-end latency? User experience depends on this.
9. Token Efficiency — How many tokens does retrieval add to the context? Directly impacts cost.
10. Consistency — Given the same question twice, does the system give the same answer? Inconsistency often indicates poor retrieval.
SENTINEL-X automatically computes all 10 metrics on every evaluation run, giving you a single quality score and detailed breakdowns for debugging.