Background
While using DeepEval to evaluate a GraphRAG system in a no-reference setting, we discovered that FaithfulnessMetric can produce misleading perfect scores under certain conditions.
Observed Behavior
We asked GraphRAG a complex question about the 5GC PDU Session establishment procedure. The system returned a detailed technical answer (covering specific responsibilities of AMF, SMF, UPF, PCF, etc.), but the retrieved context contained only the table of contents from 3GPP documents, such as:
The document contains a section '5.6 Session Management' with several sub-subsections.
The document contains a section '5.2 Network Access Control' with several sub-subsections.
The context contained no substantive technical content, yet the Faithfulness score was 1.00 (perfect).
Root Cause Analysis
The Faithfulness metric evaluation consists of 4 steps:
| Step | Purpose |
|---|---|
| 1. Truths extraction | Extract factual statements from retrieval_context |
| 2. Claims extraction | Extract claims from actual_output |
| 3. Verdicts | Compare each claim against context, assign yes/no/idk
|
| 4. Score calculation | Compute final score from verdicts |
The key lies in Step 3's verdict rules:
-
yes— claim is consistent with context -
no— claim directly contradicts context -
idk— context contains no relevant information to judge
And Step 4's default scoring formula:
score = (total - no_count) / total
idk does not count as a penalty. Only explicit contradictions (no) reduce the score.
Real-World Example
In our evaluation, the LLM judge (after switching to a stricter model) assigned idk to all 20 claims:
{
"verdicts": [
{"verdict": "idk"},
{"verdict": "idk"},
... // 20 total, all idk
]
}
Score calculation: score = (20 - 0) / 20 = 1.00
The final reason output:
"The score is 1.00 because there are no contradictions; the actual output fully aligns with the retrieval context."
This is clearly misleading — none of the claims in the answer are supported by the context, but since none are "contradicted" either, the score is perfect.
The Fundamental Issue
Faithfulness measures "is there a contradiction with the context", not "is the answer supported by the context".
These are entirely different dimensions:
| Scenario | Faithfulness | Groundedness |
|---|---|---|
| Answer fully based on context | High | High |
| Answer correct but context irrelevant | High (no contradiction) | Low (no support) |
| Answer contradicts context | Low | Low |
When retrieval context contains only table-of-contents or summary-level information, it's nearly impossible for any specific claim to "directly contradict" it, so Faithfulness will always be perfect.
Solutions
Solution 1: Enable penalize_ambiguous_claims
DeepEval provides a built-in parameter:
FaithfulnessMetric(model=model, threshold=0.5, penalize_ambiguous_claims=True)
With this enabled, the scoring formula becomes:
score = (total - no_count - idk_count) / total
Now 20 claims all judged idk yields: (20 - 0 - 20) / 20 = 0.00, which more accurately reflects how well the context supports the answer.
Solution 2: Add a Groundedness Metric
Use GEval to define a custom Groundedness metric that directly evaluates whether the answer is supported by context:
GEval(
name="Groundedness",
criteria="Determine whether the actual output is fully supported and grounded by the retrieval context. "
"Penalize claims in the output that cannot be traced back to specific information in the retrieval context.",
evaluation_params=[SingleTurnParams.INPUT, SingleTurnParams.ACTUAL_OUTPUT, SingleTurnParams.RETRIEVAL_CONTEXT],
model=model,
threshold=0.5,
)
Recommendation
Use both solutions together:
- Keep Faithfulness (with
penalize_ambiguous_claimsenabled) to detect contradictions and unsupported claims - Add Groundedness to positively evaluate support coverage
- Note Faithfulness limitations in reports to avoid misinterpretation
Additional Pitfall: Summary Claims Misjudged as "idk"
Even when the context contains specific detailed information, if the actual output summarizes those details, the judge may still assign idk.
Real-World Example
The context contained specific procedural details about PDU Session establishment (AMF handling registration, SMF selecting UPF, N4 session setup, etc.), while the actual output included a summary claim:
"From the UE attempting to access a specific DNN to achieving effective user plane forwarding, the entire process involves close cooperation among multiple core network elements, each playing an indispensable role."
The judge's verdict:
{
"verdict": "idk",
"reason": "The claim is a summary statement; the context provides specific procedural details but does not directly confirm this overall description."
}
Cause
The Faithfulness prompt imposes strict constraints on the judge:
"Only use 'no' if retrieval context DIRECTLY CONTRADICTS the claim — never use prior knowledge."
"Use 'idk' for claims not backed up by context — do not assume your knowledge."
The judge is required to perform literal-level matching, not semantic-level reasoning. Even though the context details fully support the summary through logical inference, since the context doesn't "directly confirm" the statement, the judge can only assign idk.
Impact
For RAG systems, answers are expected to synthesize and summarize context — this is normal and desired behavior. However, Faithfulness's literal-level matching treats such reasonable summaries as "unsupported," causing scores to drop when penalize_ambiguous_claims is enabled.
Possible Improvement
DeepEval's FaithfulnessMetric supports an evaluation_template parameter. You can inherit from FaithfulnessTemplate and modify the verdict guidelines to include "summaries that can be reasonably inferred from context details" in the yes category. However, this changes the semantics of the evaluation criteria and should be used cautiously.
Conclusion
The Faithfulness metric was designed to detect hallucination — whether the model fabricates information that contradicts the context. However, it has limitations on two levels:
-
"idk" doesn't penalize by default — always perfect when context is irrelevant (solved with
penalize_ambiguous_claims=True) - Literal-level matching is too strict — reasonable summaries are judged as unsupported (requires custom templates or supplementary Groundedness metrics)
When evaluating RAG systems, both Faithfulness and Groundedness dimensions must be considered to comprehensively assess answer quality.
Top comments (0)