Keeping a Knowledge Graph Fresh Without Rebuilding It

A knowledge graph is never finished. It is only current as of its last document. The day after you build it, a new filing arrives, a valuation changes, a company is sold, and the graph that was correct yesterday is quietly wrong today. On the platform we run for a family office, documents arrive continuously through a Drive webhook sync, so “the graph is done” was never a state the system reached. It is always one document behind reality, and the job is to keep that gap small.

That leaves you with a hard operational choice every time new information lands. Do you rebuild the whole graph from scratch, or do you add the new facts to the graph you already have? Most teams pick one of those two answers, and both are wrong on their own. This post is about the third option, which is the only one that holds up: incremental updates that keep the graph fresh without rebuilding it and without letting it rot.

This is the fifth post in the series on graph rot. We have covered the seven failure modes, resolving duplicate entities, catching wrong edges, and scoring a graph before agents trust it. This one is about the operational reality underneath all of them: the graph keeps changing, and the changes are where rot creeps back in.

A knowledge graph is never finished. It is only current as of its last document.

Why not just rebuild the graph every time?

Because rebuilding is expensive, it throws away history, and the new build is not guaranteed to be better than the old one.

The instinct is tempting. A full rebuild feels clean: take all the documents, run the whole pipeline, get a fresh graph. But it does not survive contact with a system that ingests documents continuously. You cannot re-extract and re-resolve an entire corpus every time one new document lands, not on cost and not on time. A graph that takes hours to build cannot be rebuilt on every Drive sync.

The deeper problem is that a rebuild is non-deterministic in the ways that matter. Extraction models drift, prompts change, and a fresh run can resolve an entity differently than the last one did, or invent a new mislink the previous build did not have. So a rebuild is not a safe refresh. It is a new graph with its own new errors, and you have thrown away the corrections, the human review decisions, and the provenance that the old graph had accumulated. You do not want to relitigate the entire graph because one document arrived.

Why is appending the new facts blindly worse?

Because blind appends are exactly how the rot from this whole series gets in.

If a full rebuild is too heavy, the lazy alternative is to just add the new document's extractions to the existing graph. Run extraction on the new file, write its nodes and edges in, move on. This scales fine and it is fast, and it is also the single most reliable way to rot a graph over time.

Every blind append is a fresh chance to create the failures the earlier posts described. The new document names a company that already exists in the graph, and if you do not resolve it, you get a duplicate entity. It asserts a relationship, and if you do not check it, you get a mislink. And worst of all, it carries a fact that contradicts a fact already in the graph, a new valuation against an old one, and a blind append leaves both sitting there, so the graph now holds two answers to the same question and an agent can retrieve either. Appending without resolving, checking, and superseding is not keeping the graph fresh. It is layering new rot on top of old.

What does keeping it fresh actually require?

It requires treating each new document as a small, careful merge into the existing graph, not a rebuild and not a dump.

The discipline has four moves, and they run on the new document and the part of the graph it touches, never on the whole thing:

**Resolve before you write. **Run the new document's entities through the same cross-document resolution the rest of the graph used, against the entities already in the graph. A company in the new filing either matches one that exists, and merges into it, or it is genuinely new. This is what stops every ingestion from minting duplicates.
**Check the new edges. **Every relationship the new document introduces goes through the same grounding and validation the build used. A new edge that cannot cite its source, or that creates a structurally impossible shape, gets flagged before it is trusted, not after an agent has walked through it.
**Supersede, do not just add. **When a new fact contradicts an existing one, the graph has to decide which is current rather than keep both. A newer cap table supersedes an older one; a current valuation replaces a stale one. The old fact is not necessarily deleted, but it stops being the answer the graph returns. This is the move that blind appends skip, and it is the cure for stale facts.
**Revalidate the touched region. **You re-run the quality checks, but only on the subgraph the new document affected, not the entire graph. The reconciler re-checks the cap table that changed; mislink detection re-examines the edges around the updated entities. The cost of keeping the graph honest scales with the size of the change, not the size of the graph.

What does one incremental update actually look like?

Walk a single document through it. A new cap table for a portfolio company lands in the connected drive, and the Drive webhook picks it up.

Extraction pulls the ownership rows and the company name. Resolution runs first: the company in the new cap table is matched against the graph and merges into the existing company node rather than creating a second one. Its owners are resolved the same way, so an investor who already appears under a slightly different spelling is recognized, not duplicated.

Then superseding. The graph already holds an older cap table for this company, with older percentages. The new one does not sit beside it as a second opinion. It becomes the current ownership, and the previous cap table drops out of the answers the graph returns, while staying in the history for audit. A query for “who owns this company” now returns the new structure, not a blend of two.

Then revalidation, scoped to what moved. The reconciler re-checks just this company's cap table and confirms the new percentages still sum to a whole; if they do not, the update is flagged rather than trusted. Mislink detection re-examines the ownership edges around the updated owners. The confidence scores on the touched nodes are recomputed, and the acceptance queries that involve this company are re-run.

None of that touched the rest of the graph. One document changed one company's ownership, and the cost of keeping the graph correct was the cost of re-checking one company, not the cost of rebuilding everything. That is what incremental looks like in practice: a small, bounded, self-verifying change.

How do new documents get into the graph in the first place?

Continuously and automatically, which is exactly why the discipline above has to be automatic too.

On the family office platform, new documents arrive through a Drive webhook sync, and recurring scheduled jobs handle the steady cadence of re-checks. Nobody sits and presses “ingest.” The moment a new filing lands in the connected drive, the pipeline picks it up, extracts it with Gemini 2.5 Pro, and runs it through the resolve-check-supersede-revalidate sequence before its facts become queryable in the Neo4j graph.

This matters because the freshness problem is not a once-a-quarter migration. It is a constant trickle. A graph attached to a live document source is being updated all the time, which means the merge discipline cannot be a manual cleanup you do occasionally. It has to be the default path every single document takes on the way in. If keeping the graph fresh depends on someone remembering to clean it up, it will rot, because documents arrive faster than anyone remembers.

How do you handle a fact that simply changed?

You decide currency by recency and source, and you make the graph return the current answer, not all the answers it has ever held.

Stale facts are the quietest rot because nothing about them looks broken. The old valuation is a real number that was once correct. The departed officer really did hold the role. The graph is not wrong about the past; it is wrong about now, and an agent asking “what is this worth” or “who runs this” has no way to know it is being handed last year's truth.

The fix is to treat facts as having currency, not just existence. When a new document carries a fact that updates an old one, the newer fact, from the more authoritative or more recent source, becomes the one the graph serves. The cap table reconciliation is the clearest case: a new cap table does not sit beside the old one as a second opinion, it supersedes it, and the reconciler confirms the new totals still add up. The graph keeps the history if you need an audit trail, but it answers with the present.

How do you re-check quality without re-checking everything?

By scoping the checks to the change, so validation is incremental too.

This is the piece that makes continuous freshness affordable. If every new document forced a full re-score and a full mislink sweep across the whole graph, you would be back to the rebuild cost you were trying to avoid. So the checks are scoped. When a document updates a handful of entities, only those entities and their immediate edges are re-resolved and re-validated. The confidence scores on the affected nodes are recomputed, and anything that drops below threshold is surfaced for review. The acceptance queries that touch the changed region are re-run to confirm the graph still answers them correctly.

The result is a graph where the cost of staying correct is proportional to how much changed, not to how big the graph is. That is the only way a knowledge graph can grow for years and stay trustworthy, because the alternative, paying full-graph validation cost on every document, stops being affordable long before the graph is large enough to be useful. This is the same data engineering discipline that separates a pipeline that survives years of production from one that quietly degrades the moment the team stops watching it.

How do you know what changed, and whether to trust it?

You make every change visible through confidence scores and re-run the acceptance gate, so freshness never becomes a blind spot.

A graph that updates silently is a graph you cannot trust, because you have no way to tell whether the last ingestion improved it or quietly broke something. Every node and edge carries a confidence score, so after an update we can ask the graph directly which of the newly-touched parts are weakest and route them to a human. And the scoring discipline from the last post does not retire after launch. The acceptance queries run again after each significant ingestion, so a document that would have introduced a regression gets caught by the same gate that vetted the original graph. Freshness and trust are the same problem: a fresh graph is only an asset if you can still prove it is right.

What did building this teach us?

The first lesson is that ingestion has to be idempotent. The same document processed twice should not create two copies of anything. Drive syncs fire more than once, jobs retry, and a pipeline that is not idempotent will duplicate its way into rot even with perfect resolution logic. Making every write safe to repeat removed an entire category of freshness bugs before they could start.

The second lesson is that superseding is harder, and more important, than adding. Adding a new fact is easy. Deciding that a new fact retires an old one, and being right about which is current, is where the real judgment lives, and it is the move that separates a graph that gets more accurate over time from one that just gets bigger and more contradictory. A graph that only ever adds is a graph that slowly fills with its own outdated answers.

A knowledge graph attached to a live source of documents is a living system, and living systems either get maintained or they rot. Rebuilding is the bulldozer and appending is the leak. Keeping it fresh is neither. It is updating like a surgeon, on the part that changed, with the checks that prove the change was safe.

We build and fix knowledge graphs for AI systems, including the continuous ingestion pipelines that keep them current. If your graph is drifting out of date faster than you can clean it, [*book a 15-minute call](https://cognilium.ai/contact).*