DEV Community

Cover image for Metrics Map for LLM Evaluation: Groundedness, Structure, Correctness
Anindya Obi
Anindya Obi

Posted on

Metrics Map for LLM Evaluation: Groundedness, Structure, Correctness

Evaluating LLM outputs can become overwhelming when you try to track too many metrics.
In practice, most failure modes fall into three simple categories.
Groundedness, Structure, and Correctness.
These three metrics explain the majority of issues in RAG, reasoning, and agent workflows.


1. Groundedness
Groundedness measures whether the model stayed inside the information it was given.
For RAG workflows, this means the answer must come from retrieved context.
For agent workflows, this means the model should rely on the correct tool or evidence.
A groundedness failure looks like
Adding facts that were not present
Hallucinating details
Mixing unrelated content
Drawing conclusions that the context does not support
Example
Context says that the user placed one order.
Model says the user placed three orders.
This is a groundedness failure even if the answer is well formatted.
Groundedness ties directly back to Day 8 because if the evaluation dataset is not clear, groundedness becomes impossible to measure.


2. Structure
Structure measures whether the model followed the required format.
This includes JSON, field names, order, and data types.
Structure failures include
Missing fields
Renamed fields
Lists that appear as single values
Nested blocks that appear inconsistently
Outputs that do not match expected schema
This is the exact failure category described in Day 9.
Without structure validation, scoring becomes meaningless.
Example
You expect answers inside answer and reasoning inside steps.
Model puts everything inside result.
Reasoning may be correct, but the evaluation fails because the structure is unstable.
Structure checks protect the scoring process from collapsing.


3. Correctness
Correctness is the simplest metric.
It answers the question
Was the output right
Correctness can be measured by
String match
Semantic similarity
Exact answer checks
Multiple choice accuracy
Tool use accuracy
Step by step scoring
Example
You ask for the capital of Japan.
Model says Osaka.
The structure may be fine and it may be grounded, but correctness is still zero.
Correctness becomes meaningful only after structure has been validated.
Otherwise the evaluation might score the wrong field or skip the sample entirely.


4. Why These Three Metrics Are Enough
Most evaluation noise comes from unclear dataset examples, unstable JSON, or ambiguous scoring rules.
Groundedness solves the first.
Structure solves the second.
Correctness solves the third.
Together they form a stable foundation for evaluation without needing complex metric stacks.
When these three are tracked cleanly, the evaluation trend lines become predictable and you can diagnose failures without guesswork.


5. Closing Note
You can always add more metrics later, but if your early stage evaluation is unstable, start with these three.
Groundedness, Structure, Correctness.
They explain most issues long before advanced metrics do.

Top comments (0)