DEV Community

Cover image for How We Score a Knowledge Graph Before We Trust It
Mudassir Marwat for Cognilium AI

Posted on • Originally published at cognilium.ai

How We Score a Knowledge Graph Before We Trust It

Most teams ship a knowledge graph the moment it looks done. The documents are loaded, the nodes are there, the queries return something. It looks finished, so they wire the agents up and move on.

That is the mistake. “Looks done” is a feeling, not a measurement. A graph can look complete and still be full of duplicate entities, mislinks, and stale facts, and an agent querying it has no way to tell. The only way to know whether a graph is safe to trust is to score it, against a bar it could have failed, before anything is allowed to query it.

This is the fourth post in the series on graph rot. We have covered the seven ways a knowledge graph rots, how to decide that eleven names are one company, and how to catch the edges that should not exist. This post is about the step that comes after all of that: deciding, with a number, whether the graph is trustworthy enough to put in front of an AI agent.

You do not trust a graph because it looks finished. You trust it because it passed a score it could have failed.

What does it mean to “score” a graph?

It means measuring whether the graph tells the truth, not whether it is large or well-connected.

This is the distinction that trips people up. The metrics built into graph tools, like node count, edge count, and density, measure size and shape. None of them measure correctness. A graph can have a million nodes and a beautiful density score and still claim a director sits on a board he never joined. Size is not truth.

Scoring a graph means asking a different set of questions. Do the extracted facts match what the documents actually say? Did entity resolution merge the right things and only the right things? Do the edges point where the evidence points? And, most importantly, can an agent use this graph to answer the questions it will actually be asked, correctly? Those are correctness questions, and they need a grading process, not a dashboard.

Why is “it looks done” the wrong bar?

Because the failures that matter are invisible to the eye and only show up under a score.

The dangerous problems in a graph are silent. A duplicate company does not announce itself. A mislink looks exactly like a real edge. A stale valuation looks like a current one. None of them throw an error, and none of them show up when you glance at the graph and see that it has data in it. They show up only when you systematically compare the graph against ground truth and count how often it is wrong.

So “it looks done” optimizes for the wrong thing. It rewards a graph that is full, not a graph that is right. The teams that get burned are the ones who treat the presence of data as evidence of quality. The presence of data is evidence of nothing. The score is the evidence.

What do you actually score?

You score the things that break, on a fixed rubric, so the result is a number you can compare over time.

We grade AI outputs against a 100-point rubric across five dimensions. Applied to a knowledge graph, those dimensions map cleanly onto the failure modes from this series: extraction accuracy (did we pull the right fields off the page), grounding (can every fact cite the sentence it came from), identity correctness (entity resolution neither split nor over-merged), relationship correctness (no mislinks), and answer quality (can the graph actually serve a correct answer to a real query). A fixed rubric matters more than the exact points. Because it is fixed, the score means the same thing this week as it did last week, so you can tell whether the graph got better or worse after the last ingestion.

The rubric also forces honesty about partial credit. A graph is rarely all right or all wrong. It is usually 94 percent right in a way that hides the 6 percent that will produce a confident, false answer. A rubric makes you count the 6 percent instead of rounding it away.

How do you score a graph without grading every node by hand?

You use a model as the judge, pointed at a sampled, structured set of cases, not at the whole graph.

Flow: a built graph is graded by an LLM-as-judge on a 100-point rubric across five dimensions, producing an auditable score; if it clears 16 acceptance thresholds the graph is promoted to agents, otherwise it goes back to review.

Grading a real graph by hand does not scale, and grading every node by model is slow and expensive. So we do what we do for retrieval systems: run an LLM-as-judge over a curated evaluation suite. Ours runs 61 evaluation cases, built in two sets of 24 and 37, each one a question with a known-good answer the graph is expected to support. The judge reads the graph's answer, compares it against the rubric, and assigns a score with a written reason, so every grade can be audited rather than taken on faith.

Two engineering details make this reliable rather than theatrical. First, we split the models: a cheaper model does the generation, and a separate, stronger model does the judging, so the grader is not marking its own homework. Second, when the judge is uncertain or its score lands near a threshold, we retry with a temperature escalation, stepping from 0.3 to 0.4 to 0.5, to see whether the verdict is stable or a coin flip. A grade that changes when you nudge the temperature is not a grade you can trust, and it gets flagged for a human.

Can you trust a model to grade a model?

Only if you constrain it the same way you constrain the graph: ground it, and never let it free-associate.

The obvious objection to LLM-as-judge is that you are using one fallible model to grade another. It is a fair worry, and the answer is the same discipline that runs through this whole series. The judge is not asked for an opinion from memory. It is handed the graph's answer, the specific evidence the answer rests on, and an explicit rubric, and asked to grade against that evidence. We also run grounding checks against known term sets, so an answer that invents a term that appears nowhere in the source corpus fails on contact, regardless of how confident the judge feels.

The judge is a measurement instrument, and like any instrument it needs calibration. The model split, the temperature-escalation retry, the written reasons, and the grounding checks are the calibration. Without them, an LLM judge is a vibe with a number attached. With them, it is a repeatable measurement you can defend.

What is the acceptance bar?

A fixed set of queries with score thresholds, and the graph does not reach an agent until it clears them.

A score on its own is just information. It becomes a gate when you attach a threshold. We hold a set of 16 acceptance test queries, each with a minimum score the graph has to hit. These are the questions the graph absolutely must get right, the ones where a wrong answer would do real damage. If the graph cannot clear the threshold on those queries, it does not get promoted to the agents, no matter how good it looks otherwise.

This is the part most pipelines skip, and it is the part that turns scoring from a report into a safeguard. A report tells you the graph is 91 percent. An acceptance gate refuses to ship the graph until the 9 percent that matters most is fixed. The gate is what makes the score load-bearing. Without it, the score is just a number someone glances at on the way to shipping anyway.

How do you keep it honest after launch?

You re-score continuously, because a graph that passed once is not a graph that passes forever.

A knowledge graph that takes in new documents is a moving target. Every ingestion is a chance to introduce a fresh duplicate, a new mislink, a stale fact. So the score is not a launch gate you pass once. It is a recurring check. Every node and edge carries a confidence score, which lets us ask the graph directly for its weakest parts and route them to review, and the acceptance queries run again after each significant ingestion. A graph that was clean at launch and never re-scored is just a graph that is rotting more slowly than an unscored one, which is the same discipline we bring to keeping any production AI system trustworthy after it ships, not just on the day it launches.

What did building this teach us?

The first lesson is that the rubric matters more than the model doing the grading. Teams obsess over which model should be the judge. What actually moved the needle was having a fixed, written rubric and a fixed set of acceptance queries, so the score meant the same thing every time. Swap the judge model and a good rubric still produces a comparable grade. Keep a vague rubric and the best judge in the world produces noise.

The second lesson is that the acceptance gate is the whole point. Scoring without a gate is a measurement nobody acts on. We have watched scores get generated, noted, and then ignored as the graph shipped anyway under deadline pressure. The threshold is what removes the human's ability to wave the graph through, and that is exactly why it works. The score has to be able to say no, and the team has to have agreed in advance to listen.

A graph that an agent can trust is a graph that earned that trust against a bar it could have failed. Everything before the score is hope. The score is where hope turns into a number, and the gate is where the number turns into a decision.

We build and fix knowledge graphs for AI systems, and the agent and retrieval layers that sit on top of them. If you are about to trust an agent to a graph you have never scored, [*book a 15-minute call](https://cognilium.ai/contact).*

Top comments (0)