DEV Community: Cognilium AI

Who Can Help With Pick Optimization in Dynamics 365?

Mudassir Marwat — Wed, 29 Jul 2026 13:38:52 +0000

Answer first

If you run warehouses on Dynamics 365 Finance and Operations and your pickers walk too far, the help you need comes in three forms: your existing implementation partner (good at configuring the system, rarely focused on placement analysis), a specialist who reads your order history and tells you where inventory should live, or an internal analyst doing it in a spreadsheet. Most companies rely on the third by default. This page explains what each actually does, and the questions that tell you whether someone can genuinely help.

First, understand what you are asking for

"Pick optimization" gets used for several different things, and getting help starts with knowing which one you have. There are three layers, and they are usually solved in this order:

Location accuracy. Does the system know where things actually are. If the answer is no, nothing else matters yet, because optimizing on wrong data gives you a confident wrong answer.
Slotting, or placement. Where should each item live so the orders you actually ship are cheapest to fill. This is the biggest lever and the one most often left to a spreadsheet.
Routing. Given where things are, what path should the picker walk. Dynamics has started addressing this directly.

Most people who say they want their pickers to walk less actually have a placement problem. So "who can help" usually means "who can work out where my inventory should live, from my order history, and keep it current."

The three kinds of help, honestly

Your Dynamics implementation partner. They configured your system and they know it well. They are excellent at setting up slotting templates, location directives, and replenishment. What they typically do not do is the analysis of what the policy should be. They will implement the rules you give them. Deciding the rules, from your order history, is a different skill, and it is usually not where an implementation partner spends its time.

A specialist in placement analysis. Someone whose actual work is reading order history, working out which items should move and where, quantifying it, and pushing the result into Dynamics. This is narrower than a full WMS project and it is the layer most directly tied to how far your people walk. It is also, honestly, the least crowded. We could not find a single product positioned specifically as slotting for Dynamics 365 Finance and Operations. If one exists, we would like to see it.

An internal analyst with a spreadsheet. This is what most companies use, whether they mean to or not. Someone exports order lines, builds a pivot table, ranks the SKUs, decides the moves, and types the result back in. It works, and it produces real results. Its weakness is that it is a point in time. The classification you loaded in one quarter is describing a warehouse that has already changed by the next, and nobody has time to redo it monthly by hand.

The questions that separate real help from a pitch

Whoever you talk to, these questions surface whether they understand the problem:

"Do you start from my order history, or from my current layout?" Placement should be derived from what you actually ship, not from where things happen to be now.
"How will you show me the savings net of replenishment?" Every item moved forward is one that has to be replenished more often, and a restock costs more than a pick. Anyone who shows you travel savings without netting out the replenishment cost underneath is showing you half the picture.
"How often does this get revisited?" A one-time slotting project decays. Order profiles drift. If the answer is "once," you are buying a snapshot, not a solution.
"What number will improve, and how will we know?" The honest answer lands on lines picked and shipped per person hour, the metric that is actually on a warehouse manager's scorecard. Be wary of anyone leading with a precise savings percentage before they have seen your data.

What good help does not require

It does not require ripping out your ERP or your WMS. Placement analysis sits alongside the system you already run. It reads history, works out where inventory should be, and writes the recommended locations back. Your team keeps working in the same system.

And it does not require rebuilding your warehouse. In most operations a small share of items drives most of the movement. Moving those captures the bulk of the gain. Good help hands you a ranked list and lets you stop wherever the effort stops being worth it.

Where we fit

We are a small AI engineering firm that builds inside Dynamics 365, and this specific problem, deriving placement from order history and keeping it current, is what we focus on. We will tell you plainly: we are new to this commercially and we do not have a wall of warehouse logos. What we have is a method we can show you in full, and a way to prove it on your own data before you commit to anything.

That is the offer. Send us ninety days of order lines and a location list, and we will show you what your current placement is costing you in lines per person hour, and which items are worth moving. If nothing else, you will get a clear read on whether you have a placement problem worth solving, from your own data rather than from a rule of thumb.

What Is Dynamics 365 Dynamic Item Placement?

Mudassir Marwat — Wed, 29 Jul 2026 13:38:24 +0000

Answer first

Dynamic item placement is a Dynamics 365 Supply Chain Management feature for staging inventory more efficiently during inbound processing. Microsoft describes it with phrases like "optimal placement" and "AI-driven inventory rebalancing," which sound like the system decides where items should go. Read the feature detail directly beneath that language and you find the actual mechanism: you define the preferred storage locations and target quantities yourself, by editing them or importing them. The feature gives you a better way to enter the decision. It does not make the decision.

Read both paragraphs

This feature is a good lesson in reading documentation carefully, because two paragraphs about the same feature say different things, and both are accurate.

The business value paragraph says the system "intelligently balances inventory during inbound processes, ensuring optimal placement for fast fulfillment," and the release overview mentions "AI-driven inventory rebalancing." Taken alone, that reads like the gap being closed. It sounds like Dynamics has started deciding where inventory should live.

The feature detail paragraph, directly below it, says: "Define preferred storage locations and target quantities for each item, either by editing directly or importing data at scale."

Read the verb in the second one. Define. By editing, or by importing. Both paragraphs are true. They are answering different questions. The first tells you what the outcome feels like. The second tells you who decides. And the answer to who decides is: you do.

What the feature genuinely improves

This is not to dismiss it. Dynamic item placement is a real improvement to how you enter and maintain placement policy at scale. If you have preferred locations and target quantities in mind, it gives you a cleaner way to load them and keep them applied during inbound work. For an operation managing thousands of items, a better way to define and maintain that policy is worth having.

What it does not do is work out what the preferred locations should be. That analysis, reading your order history to determine where each item belongs so the orders you ship are cheapest to fill, still happens somewhere upstream. In practice, for most operations, that somewhere is a spreadsheet.

Why the wording matters

Partner and vendor descriptions of this feature have run ahead of Microsoft's own. It is easy to find write-ups saying the feature determines optimal storage locations from operational conditions and inventory movement patterns. Microsoft's release plan does not say that. It says you define the locations.

The distinction is not pedantic. If you buy or configure this expecting it to derive placement, you will be surprised when it asks you to supply the placement. Knowing the difference before you plan around it saves that surprise.

The takeaway

Dynamic item placement is a better on-ramp for placement policy, not an engine that produces it. The engine, the part that reads your history and works out where things should live, is the piece that is still missing from the box, and it is the piece that actually determines how far your people walk. When a feature is described as intelligent, the useful question is always: intelligent about what, and who is still doing the deciding.

Should ABC Analysis Rank by Dollars or by Picks?

Mudassir Marwat — Wed, 29 Jul 2026 13:38:22 +0000

Answer first

For slotting, ABC analysis should rank items by how often they are picked, not by their dollar value. Most ABC reports rank by dollars because that is what the ERP produces by default, and for financial purposes that is correct. But placement is about reducing walking, and walking is driven by pick frequency, not by price. An expensive item picked twice a month does not belong in your best location. A cheap item picked forty times a day does.

The common mistake

ABC analysis sorts your items into A, B, and C groups so you can treat the important ones differently. The question is: important by what measure. Ask most systems and the answer is dollar-volume, price times quantity. That is the default report, and it is genuinely the right lens for a lot of financial and inventory decisions.

It is the wrong lens for slotting, and using it is one of the most common quiet mistakes in warehouse layout.

Why dollars are the wrong variable for placement

Slotting exists to reduce travel. Travel happens when a picker walks to an item. So the items that deserve the best, closest locations are the ones a picker reaches for most often, regardless of what they cost.

Consider two items. One is a high-value component picked twice a month. The other is a cheap consumable picked dozens of times a day. Rank by dollars and the expensive component looks like your A item and gets pride of place near packing. Rank by picks and it is obvious that the cheap consumable is what your pickers actually walk to, over and over, and it is the one that belongs up front.

Put the high-dollar, low-pick item in your golden location and you have optimized for the balance sheet in a place where the balance sheet does not walk. Your pickers still trek to the consumable dozens of times a day.

The insider version of this point

The academic literature is blunt about it. Ranking items by dollar-volume is a financial perspective, and warehouse efficiency is not a financial question, it is an operational one. For placement, the useful ranking is by pick activity.

This is also a quick way to sanity-check your own slotting. Pull your current A locations, the best pick faces near packing, and ask what is in them. If it is your highest-revenue items rather than your most-picked items, your placement is optimizing the wrong variable, and it is almost certainly because it was slotted off a default dollar-based report.

How to do it right

The fix is not complicated in principle. Pull pick frequency per item over a recent, representative window, ideally something like ninety days so seasonality does not distort it. Rank by that. Compare the top of that list against what is actually in your best locations. The items that appear high on the pick-frequency list but are not in good locations are your re-slotting candidates, and the gap between the two lists is your opportunity.

The difficulty is not the concept, it is keeping it current, because pick frequency drifts as your order profile changes. But even a one-time correction from dollars to picks usually surfaces obvious wins.

The takeaway

If you are going to use ABC analysis to drive slotting, make sure the letters mean what you think they mean. A, B, and C should reflect how often items are picked, not how much they are worth. It is the same technique pointed at the right variable, and pointing it at the wrong one is the difference between a layout that serves your accountant and one that serves your pickers.

Claude Fable 5 Is Back Online: What Anthropic Changed, and the Jailbreak-Severity Framework Underneath

Mudassir Marwat — Thu, 02 Jul 2026 11:07:16 +0000

Claude Fable 5 came back online yesterday. Same weights, a new safety classifier, and a proposed cross-lab framework for how the industry talks about jailbreaks. If you build on Anthropic models, the model card is the least interesting change. Here is what happened in the 20 days Fable 5 was gone, what Anthropic actually changed, and the framework underneath the redeploy that matters more than the redeploy itself.

TLDR

On June 12, US export controls forced Anthropic to suspend Fable 5 and Mythos 5 globally after Amazon researchers reported a safeguard bypass. On July 1, Fable 5 returned with a new classifier that blocks the specific technique in over 99 percent of cases, at the cost of more false positives on legitimate cybersecurity coding work. The redeploy also introduced a four-criterion framework, co-developed with Amazon, Microsoft, Google, and Glasswing partners, for classifying jailbreak severity across the industry.

The 20 days Fable 5 was gone

The launch itself was clean. Anthropic shipped Fable 5 to the API and to Pro, Max, Team, and Enterprise plans on June 9, at $10 per million input tokens and $50 per million output tokens, and we covered the release the next day in our launch analysis. Three days later the model disappeared.

**June 9: **Fable 5 and Mythos 5 released. Fable 5 free on subscription plans through June 22.
**June 12: **US government imposes export controls on both models. Anthropic cannot verify nationality in real time, so both are suspended globally, for every customer, immediately.
**June 26: **US government approves Mythos 5 access for select domestic organizations.
**June 30: **Export controls lifted. Redeployment announced.
**July 1: **Fable 5 restored to global users. Mythos 5 access expanded.

The trigger was not a policy decision on its own. It was a specific bypass report from Amazon researchers, and it is worth reading the technical details before assuming the worst.

The Amazon jailbreak, and why it did not disclose what people assumed it disclosed

Amazon researchers found a prompt path that got Fable 5 past its cybersecurity safeguards by framing the request as vulnerability identification. In one run, the model returned code demonstrating how to exploit a specific vulnerability. That is what triggered the export-control directive and the global suspension.

The Anthropic write-up on the redeploy is more interesting than the headline. Testing showed that every less-capable model in the comparison, Claude Opus 4.8, GPT-5.5, and Kimi K2.7, could produce the same demonstration for the same exploit case. Anthropic's framing: the reported technique accessed "a borderline case for Fable 5's safeguards" involving routine defensive-security work, not unique Mythos-level cyber capability. Their exact line: the reported technique did not expose any unique Mythos-level cyber capabilities.

This is the load-bearing sentence in the whole redeploy story. If every frontier model can produce the same output, the failure is not "Fable 5 is uniquely dangerous." The failure is that the industry does not yet share a way to say "this jailbreak is severity 2, not severity 4." A single ambiguous bypass produced a 20-day global suspension of a frontier model. That is a coordination problem, not a capability problem.

What Anthropic actually changed in the redeploy

The model weights did not change. Anthropic added one targeted layer on top of the existing defense-in-depth stack: a new safety classifier trained specifically on the Amazon report's bypass technique. Its job is narrow, and Anthropic states its performance in specific terms.

Blocks the reported bypass in "over 99 percent of cases."
**Reroutes blocked requests to Claude Opus 4.8, **so the user still gets an answer from the fallback model rather than a hard refusal.
**Higher false-positive rate on legitimate cybersecurity coding work, **accepted as the price of catching the specific technique. This is the trade-off engineering teams need to plan for.
**Every other layer of the defense-in-depth stack is unchanged: **training-time refusal, retroactive misuse analysis, and the broader "safety margin" classifier that treats ambiguous requests as blockable.

The last point is the one to hold in mind. The redeploy is a targeted patch on a single reported technique, not a rearchitecture. If your production traffic has never touched Fable 5's cyber safeguards, the redeploy is invisible to you.

Access, tier by tier, and when your credits start burning

The pricing card is the same as the launch: $10 per million input tokens, $50 per million output tokens. What changed is the inclusion window. Here is the picture for July.

**Pro, Max, Team, and select Enterprise: **Fable 5 included for up to 50 percent of your weekly usage through July 7. After July 7, access continues via usage credits.
**Standard Enterprise: **no included allowance. Usage credits from day one.
**Premium Enterprise: **Fable 5 included through July 7, then usage credits.
**Surfaces where Fable 5 is available today: **Claude.ai, Claude Code, Claude Cowork, and Claude Platform. AWS Bedrock, Google Cloud Vertex, and Microsoft Foundry access is in the process of being restored.
**What you need to do to reactivate: **nothing. Access is restored automatically.

The subtext is that Anthropic is still capacity-constrained. The 50-percent-of-weekly-usage cap on Pro/Max/Team is a rate-shaping decision, not a pricing decision. Teams running batch workloads through Claude Code should assume some Fable 5 requests will get rerouted to Opus 4.8 for capacity reasons on top of the classifier reroutes.

The news underneath the news: a shared jailbreak severity framework

The paragraph most people will skim in the Anthropic post is the one that will matter for years. Together with Amazon, Microsoft, Google, and Glasswing partners, Anthropic proposed a four-criterion framework for classifying jailbreak severity. The four criteria:

**Capability gain: **how far beyond existing tools does the jailbreak advance an attacker?
**Breadth of capability gain: **how many distinct offensive tasks does the technique unlock, or is it narrow?
**Ease of weaponization: **how much skilled effort is needed to turn the jailbreak into a real attack?
**Discoverability: **how easily can an average adversary find or replicate the technique?

Anthropic pairs this with a three-tier severity classification: minor jailbreaks that intrude into the safety margin but do not unlock harmful behavior; narrow harmful jailbreaks that elicit specific harmful behaviors with limited scope; and universal jailbreaks that unblock a wide range of harms. Anthropic states that no universal jailbreak has been discovered for Fable 5. The Amazon report, in this framing, was a narrow-harmful case that scored low on capability gain because every frontier model could produce the same output.

This is the first serious cross-lab attempt to standardize how jailbreak severity gets communicated to governments and to the public. Right now, every reported bypass gets covered as if it were the same event. Under a shared framework, a borderline defensive-security case scores differently from a novel bioweapon-uplift technique, and export-control decisions can be calibrated to the difference. Whether the framework survives contact with real incidents is the open question. But it is the direction that would prevent another 20-day suspension for a narrow-harmful case.

Government collaboration commitments

The redeploy also came with four specific commitments Anthropic made to US and allied governments. Worth reading if you build on Anthropic models in regulated environments.

**Pre-release access **to national-security-relevant models for independent evaluation.
**Rapid reporting **of jailbreak and misuse patterns, with new safeguards shared for testing before general release.
**Dedicated Anthropic teams **for joint government research and compute allocation.
**Contribution to shared security standards **across frontier model providers.

The pre-release evaluation commitment is the one to watch. It signals that future frontier launches will slip through a government-review window before general availability. If your product roadmap assumes day-one access to the next Anthropic model, plan for a delay.

What production engineering teams should do this week

The redeploy is not a routine model update. It is a change in three things at once: what gets blocked, how Anthropic communicates about blocks, and what governments see before you do. Here is a concrete checklist.

**Wire the fallback path explicitly. **Blocked Fable 5 requests reroute to Opus 4.8 by default. If your product logs the model that answered each request, that fallback should be visible in your analytics today. If it is not, add it before the false-positive rate becomes an invisible cost.
**Instrument the reroute rate for cybersecurity coding tasks. **If your team runs SAST-style workflows, vulnerability-review workflows, or any defensive-security codegen through Fable 5, sample 100 requests this week and count how many got rerouted. That is your new false-positive baseline, and it will move again the next time Anthropic updates the classifier.
**Recheck data-retention posture on Mythos-class models. **The redeploy introduced a 30-day data retention policy for Mythos-class models. If you push sensitive data through the trusted-access channel, review the retention terms against your compliance boundary.
**Track the severity framework. **The moment Amazon, Microsoft, or Google adopts the four-criterion classification in their public safety comms, it becomes the language your compliance and legal teams will hear jailbreaks in. Read the framework now so you can push back on lazy severity claims later.
**Do not re-architect around the classifier. **The core Fable 5 capability claims from the June 9 launch, including long-context autonomy across millions of tokens and highest-scoring performance on FrontierCode, still hold. The redeploy patched one classifier, not the model.

The takeaway from the whole 20-day episode is not that Fable 5 was too dangerous. It is that a frontier model can be pulled off the market for three weeks because of a jailbreak that every competing model could reproduce. That is a communication failure between labs and governments, and the four-criterion severity framework is the first draft of a fix. If it holds, the next narrow-harmful bypass gets classified in a shared language instead of collapsing into "model unsafe, take it down."

Cognilium AI runs four production AI systems on Fable and Opus-class models, including agentic contract review that touches defensive-security workflows every day. If you are seeing your Fable 5 reroute rate climb this week and want a second read on your fallback wiring, Talk to an Engineer, and we will walk through the reroute traces with you.

Do You Actually Need a Knowledge Graph?

Mudassir Marwat — Thu, 18 Jun 2026 07:42:40 +0000

Most teams decide they need a knowledge graph for the wrong reason. They read that GraphRAG beats plain retrieval, they see a competitor mention a graph, and they conclude that a graph is the upgrade their AI has been missing. Then they spend three months building one, and their agent answers the same questions it answered before, only slower and at higher cost.

The opposite mistake is just as common and more expensive. A team that genuinely needs a graph builds a pile of embeddings instead, ships it, and spends the next year confused about why their agent keeps giving confident answers that fall apart the moment a question requires connecting two facts.

A knowledge graph is not an upgrade. It is a different tool for a different question. This post is the buyer's guide to telling which question you actually have, written by a team that has built both kinds of system and, more usefully, has talked plenty of clients out of the graph they thought they wanted. This is the sixth post in our series on graph rot, and it is the one to read before you commit a single sprint to building one.

Reach for a graph when the answer lives in the connections, not the content.

What does a knowledge graph give you that a vector database does not?

It gives you relationships as first-class facts, instead of relationships you have to hope the model infers from nearby text.

A vector database stores your documents as embeddings and finds the chunks that are semantically closest to a question. That is retrieval, and for a large share of AI systems it is exactly the right tool. Ask it "what does our refund policy say about damaged goods" and it will find the passage, hand it to the model, and the model will answer. The relationship you needed, this question maps to that paragraph, is a similarity relationship, and similarity is what a vector store is built to find.

A knowledge graph stores entities and the explicit edges between them. A company, the people on its board, the funds it belongs to, the documents that mention it, all connected by named relationships you can traverse. The question it answers well is not "what text is similar to this" but "what is connected to this, and through what." Our entity resolution work exists precisely because a graph treats "is the same company as" as a hard fact it can enforce, where a vector store only ever has a fuzzy sense that two chunks sound alike.

The clean way to hold the difference: a vector database is the right tool when the answer is in the content of one place, and a graph is the right tool when the answer is in the connections between many places.

When is a vector database enough?

When your hardest question is a single-hop lookup over the content of your documents, you do not need a graph, and building one is a tax you will pay forever.

Here is the honest disqualifier, and we lead with it because the cheapest knowledge graph is the one you correctly decide not to build. You probably do not need a graph if your real questions look like these. You want semantic search over a document set. You want a support bot that answers from a policy library. You want to summarize or classify text. You want "find me passages similar to this one." Every one of those is single-hop, content-bound retrieval, and a well-built vector pipeline will serve it faster, cheaper, and with far less to maintain than a graph.

You also do not need a graph if your data has almost no meaningful relationships in it, or if those relationships never get asked about. A graph earns its cost by being traversed. If nobody is asking multi-hop questions, the edges sit there as expensive decoration, and you have bought a maintenance burden with no return.

And the hardest disqualifier to hear: you should not build a graph you cannot keep fresh. A graph that ingests new documents picks up new errors on every load, which is the entire subject of keeping a graph fresh. If you do not have the capacity to maintain it, a graph does not stay an asset. It rots into a liability that produces confident, wrong answers. A vector index that goes a little stale degrades quietly. A graph that goes stale lies with structure behind it.

When do you actually need a graph?

When answering your most important question requires connecting facts that live in different places, treating different names as the same thing, or proving how you reached an answer.

There are three signals, and one is usually enough.

The first is multi-hop questions. If the answer to "which of our portfolio companies share a director with a company under investigation" requires hopping from a person to a board seat to a company to a fund, no amount of semantic similarity will assemble that chain reliably. A vector store can find documents that mention directors. It cannot traverse the relationship, because it does not store the relationship. This is the single clearest signal that you have crossed into graph territory.

The second is identity that has to be enforced across messy sources. When the same real-world entity shows up as "Acme Corp," "Acme Corporation," and "ACME Inc." across a thousand documents, and your answers are wrong unless those are treated as one node, you need the thing a graph gives you and a vector store cannot: a place to make "these are the same" a stored, queryable fact. That is entity resolution, and it is structurally a graph problem.

The third is provenance you can defend. In finance and legal work, "the agent said so" is not an acceptable answer. You need to show the path: this conclusion rests on this edge, which came from this clause, in this document. A graph makes that path a first-class object you can return alongside the answer. This is why our legal and finance agent work leans on graphs. The provenance is not a nice-to-have, it is the deliverable. When we built a system of 23 agents for legal review, the value was not only the answers, it was that every answer could point at the structure it came from.

So how do you actually decide?

You run your hardest real question through one decision path, and the path tells you which tool you are actually buying.

Take the single most valuable question you want your AI to answer, the one that justifies the project, and walk it down the tree. Does answering it require connecting facts across different documents or sources, resolving the same entity under different names, or returning a traceable path. If the honest answer to all three is no, stop. You want a vector database and a good retrieval pipeline, and you will ship faster for admitting it. If the answer to any of them is yes, you are in graph territory, and the next question is not whether to build a graph but which graph to build.

The mistake to avoid is running an average question down the tree. Average questions are almost always single-hop, so they will tell you to skip the graph even when your highest-value question needs one. Decide on the hardest question that matters, not the typical one.

Should you build one or buy one?

Buy or adopt off-the-shelf when your domain is generic and your relationships are standard. Build custom when your identity problem is hard, your ontology is specific, or your provenance has to hold up under scrutiny.

The buy path is real and you should take it when you can. Mature open tooling exists, and managed graph databases and off-the-shelf GraphRAG frameworks will get a standard graph running quickly. If your entities are common, your relationships are obvious, and nobody is going to audit the result, adopt the existing stack and move on. Building from scratch what you could have configured is its own kind of waste.

The build path earns its cost in three situations, and they are exactly the situations our data engineering work tends to land in. The first is when entity resolution is genuinely hard, where the same company appears eleven different ways and a naive merge produces a graph that quietly lies. The second is when your domain needs a specific ontology, the relationships that matter in fund administration or contract review are not the ones a generic extractor pulls. The third is when provenance and correctness are load-bearing, which is when you need the scoring and acceptance gates we wrote about in scoring a graph before you trust it, not a graph that merely runs.

A blunt rule of thumb. If a wrong answer is an inconvenience, buy. If a wrong answer is a liability, the extraction, the resolution, and the validation are the hard part, and that is the part worth building carefully.

What does a knowledge graph cost you after you ship it?

The cost of a knowledge graph is not building it. It is keeping it true.

This is the line item teams forget, and it is the one that decides whether the graph was worth it. A graph is a living system. Every new document is a chance to introduce a duplicate entity, a mislinked edge, or a stale fact, which is the whole reason graph rot is the name of this series. The total cost of ownership includes incremental ingestion, ongoing entity resolution, continuous validation, and a confidence score on every node and edge so you can find the weak parts before an agent does. A vector index mostly just needs re-embedding when content changes. A graph needs governance.

So the buyer's question is not "can we build a graph." Almost anyone can stand up a graph that looks done. The question is "can we keep one true," because a graph you cannot maintain is more dangerous than the vector store you replaced. Count the maintenance before you count the benefit. If the maintenance math does not work, the honest answer is the vector database, and we would rather tell you that now than after the build.

What did building both teach us?

Across 50-plus projects since 2019, the pattern is consistent: the teams that get the most from a knowledge graph are the ones who needed it least desperately, because they decided on purpose rather than by trend. They had a real multi-hop question, a real identity problem, or a real provenance requirement, and they could name it before a line of code was written. The teams that struggle are the ones who built a graph to keep up, then went looking for questions it could justify.

The second lesson is that the line between the two tools is not as sharp as the vendors make it sound. Plenty of the systems we ship are hybrids. A vector layer for content-bound retrieval, a graph layer for the connected questions, each doing the job it is actually good at. The decision is rarely "graph or not." It is "where does the value actually live, in the content or in the connections," and the honest answer is often "both, in different places." Our enterprise retrieval work is usually exactly that hybrid, not a graph for its own sake.

If you can name your hardest question, you can answer the graph question yourself. If you cannot, that is the real problem to solve first, and no graph will solve it for you.

We build and fix knowledge graphs for AI systems, and we will tell you when you do not need one. If you are weighing a graph against a simpler retrieval system and want a straight answer, book a 15-minute call.

Keeping a Knowledge Graph Fresh Without Rebuilding It

Mudassir Marwat — Wed, 17 Jun 2026 08:18:41 +0000

A knowledge graph is never finished. It is only current as of its last document. The day after you build it, a new filing arrives, a valuation changes, a company is sold, and the graph that was correct yesterday is quietly wrong today. On the platform we run for a family office, documents arrive continuously through a Drive webhook sync, so “the graph is done” was never a state the system reached. It is always one document behind reality, and the job is to keep that gap small.

That leaves you with a hard operational choice every time new information lands. Do you rebuild the whole graph from scratch, or do you add the new facts to the graph you already have? Most teams pick one of those two answers, and both are wrong on their own. This post is about the third option, which is the only one that holds up: incremental updates that keep the graph fresh without rebuilding it and without letting it rot.

This is the fifth post in the series on graph rot. We have covered the seven failure modes, resolving duplicate entities, catching wrong edges, and scoring a graph before agents trust it. This one is about the operational reality underneath all of them: the graph keeps changing, and the changes are where rot creeps back in.

A knowledge graph is never finished. It is only current as of its last document.

Why not just rebuild the graph every time?

Because rebuilding is expensive, it throws away history, and the new build is not guaranteed to be better than the old one.

The instinct is tempting. A full rebuild feels clean: take all the documents, run the whole pipeline, get a fresh graph. But it does not survive contact with a system that ingests documents continuously. You cannot re-extract and re-resolve an entire corpus every time one new document lands, not on cost and not on time. A graph that takes hours to build cannot be rebuilt on every Drive sync.

The deeper problem is that a rebuild is non-deterministic in the ways that matter. Extraction models drift, prompts change, and a fresh run can resolve an entity differently than the last one did, or invent a new mislink the previous build did not have. So a rebuild is not a safe refresh. It is a new graph with its own new errors, and you have thrown away the corrections, the human review decisions, and the provenance that the old graph had accumulated. You do not want to relitigate the entire graph because one document arrived.

Why is appending the new facts blindly worse?

Because blind appends are exactly how the rot from this whole series gets in.

If a full rebuild is too heavy, the lazy alternative is to just add the new document's extractions to the existing graph. Run extraction on the new file, write its nodes and edges in, move on. This scales fine and it is fast, and it is also the single most reliable way to rot a graph over time.

Every blind append is a fresh chance to create the failures the earlier posts described. The new document names a company that already exists in the graph, and if you do not resolve it, you get a duplicate entity. It asserts a relationship, and if you do not check it, you get a mislink. And worst of all, it carries a fact that contradicts a fact already in the graph, a new valuation against an old one, and a blind append leaves both sitting there, so the graph now holds two answers to the same question and an agent can retrieve either. Appending without resolving, checking, and superseding is not keeping the graph fresh. It is layering new rot on top of old.

What does keeping it fresh actually require?

It requires treating each new document as a small, careful merge into the existing graph, not a rebuild and not a dump.

The discipline has four moves, and they run on the new document and the part of the graph it touches, never on the whole thing:

**Resolve before you write. **Run the new document's entities through the same cross-document resolution the rest of the graph used, against the entities already in the graph. A company in the new filing either matches one that exists, and merges into it, or it is genuinely new. This is what stops every ingestion from minting duplicates.
**Check the new edges. **Every relationship the new document introduces goes through the same grounding and validation the build used. A new edge that cannot cite its source, or that creates a structurally impossible shape, gets flagged before it is trusted, not after an agent has walked through it.
**Supersede, do not just add. **When a new fact contradicts an existing one, the graph has to decide which is current rather than keep both. A newer cap table supersedes an older one; a current valuation replaces a stale one. The old fact is not necessarily deleted, but it stops being the answer the graph returns. This is the move that blind appends skip, and it is the cure for stale facts.
**Revalidate the touched region. **You re-run the quality checks, but only on the subgraph the new document affected, not the entire graph. The reconciler re-checks the cap table that changed; mislink detection re-examines the edges around the updated entities. The cost of keeping the graph honest scales with the size of the change, not the size of the graph.

What does one incremental update actually look like?

Walk a single document through it. A new cap table for a portfolio company lands in the connected drive, and the Drive webhook picks it up.

Extraction pulls the ownership rows and the company name. Resolution runs first: the company in the new cap table is matched against the graph and merges into the existing company node rather than creating a second one. Its owners are resolved the same way, so an investor who already appears under a slightly different spelling is recognized, not duplicated.

Then superseding. The graph already holds an older cap table for this company, with older percentages. The new one does not sit beside it as a second opinion. It becomes the current ownership, and the previous cap table drops out of the answers the graph returns, while staying in the history for audit. A query for “who owns this company” now returns the new structure, not a blend of two.

Then revalidation, scoped to what moved. The reconciler re-checks just this company's cap table and confirms the new percentages still sum to a whole; if they do not, the update is flagged rather than trusted. Mislink detection re-examines the ownership edges around the updated owners. The confidence scores on the touched nodes are recomputed, and the acceptance queries that involve this company are re-run.

None of that touched the rest of the graph. One document changed one company's ownership, and the cost of keeping the graph correct was the cost of re-checking one company, not the cost of rebuilding everything. That is what incremental looks like in practice: a small, bounded, self-verifying change.

How do new documents get into the graph in the first place?

Continuously and automatically, which is exactly why the discipline above has to be automatic too.

On the family office platform, new documents arrive through a Drive webhook sync, and recurring scheduled jobs handle the steady cadence of re-checks. Nobody sits and presses “ingest.” The moment a new filing lands in the connected drive, the pipeline picks it up, extracts it with Gemini 2.5 Pro, and runs it through the resolve-check-supersede-revalidate sequence before its facts become queryable in the Neo4j graph.

This matters because the freshness problem is not a once-a-quarter migration. It is a constant trickle. A graph attached to a live document source is being updated all the time, which means the merge discipline cannot be a manual cleanup you do occasionally. It has to be the default path every single document takes on the way in. If keeping the graph fresh depends on someone remembering to clean it up, it will rot, because documents arrive faster than anyone remembers.

How do you handle a fact that simply changed?

You decide currency by recency and source, and you make the graph return the current answer, not all the answers it has ever held.

Stale facts are the quietest rot because nothing about them looks broken. The old valuation is a real number that was once correct. The departed officer really did hold the role. The graph is not wrong about the past; it is wrong about now, and an agent asking “what is this worth” or “who runs this” has no way to know it is being handed last year's truth.

The fix is to treat facts as having currency, not just existence. When a new document carries a fact that updates an old one, the newer fact, from the more authoritative or more recent source, becomes the one the graph serves. The cap table reconciliation is the clearest case: a new cap table does not sit beside the old one as a second opinion, it supersedes it, and the reconciler confirms the new totals still add up. The graph keeps the history if you need an audit trail, but it answers with the present.

How do you re-check quality without re-checking everything?

By scoping the checks to the change, so validation is incremental too.

This is the piece that makes continuous freshness affordable. If every new document forced a full re-score and a full mislink sweep across the whole graph, you would be back to the rebuild cost you were trying to avoid. So the checks are scoped. When a document updates a handful of entities, only those entities and their immediate edges are re-resolved and re-validated. The confidence scores on the affected nodes are recomputed, and anything that drops below threshold is surfaced for review. The acceptance queries that touch the changed region are re-run to confirm the graph still answers them correctly.

The result is a graph where the cost of staying correct is proportional to how much changed, not to how big the graph is. That is the only way a knowledge graph can grow for years and stay trustworthy, because the alternative, paying full-graph validation cost on every document, stops being affordable long before the graph is large enough to be useful. This is the same data engineering discipline that separates a pipeline that survives years of production from one that quietly degrades the moment the team stops watching it.

How do you know what changed, and whether to trust it?

You make every change visible through confidence scores and re-run the acceptance gate, so freshness never becomes a blind spot.

A graph that updates silently is a graph you cannot trust, because you have no way to tell whether the last ingestion improved it or quietly broke something. Every node and edge carries a confidence score, so after an update we can ask the graph directly which of the newly-touched parts are weakest and route them to a human. And the scoring discipline from the last post does not retire after launch. The acceptance queries run again after each significant ingestion, so a document that would have introduced a regression gets caught by the same gate that vetted the original graph. Freshness and trust are the same problem: a fresh graph is only an asset if you can still prove it is right.

What did building this teach us?

The first lesson is that ingestion has to be idempotent. The same document processed twice should not create two copies of anything. Drive syncs fire more than once, jobs retry, and a pipeline that is not idempotent will duplicate its way into rot even with perfect resolution logic. Making every write safe to repeat removed an entire category of freshness bugs before they could start.

The second lesson is that superseding is harder, and more important, than adding. Adding a new fact is easy. Deciding that a new fact retires an old one, and being right about which is current, is where the real judgment lives, and it is the move that separates a graph that gets more accurate over time from one that just gets bigger and more contradictory. A graph that only ever adds is a graph that slowly fills with its own outdated answers.

A knowledge graph attached to a live source of documents is a living system, and living systems either get maintained or they rot. Rebuilding is the bulldozer and appending is the leak. Keeping it fresh is neither. It is updating like a surgeon, on the part that changed, with the checks that prove the change was safe.

We build and fix knowledge graphs for AI systems, including the continuous ingestion pipelines that keep them current. If your graph is drifting out of date faster than you can clean it, [*book a 15-minute call](https://cognilium.ai/contact).*

How We Score a Knowledge Graph Before We Trust It

Mudassir Marwat — Wed, 17 Jun 2026 07:03:34 +0000

Most teams ship a knowledge graph the moment it looks done. The documents are loaded, the nodes are there, the queries return something. It looks finished, so they wire the agents up and move on.

That is the mistake. “Looks done” is a feeling, not a measurement. A graph can look complete and still be full of duplicate entities, mislinks, and stale facts, and an agent querying it has no way to tell. The only way to know whether a graph is safe to trust is to score it, against a bar it could have failed, before anything is allowed to query it.

This is the fourth post in the series on graph rot. We have covered the seven ways a knowledge graph rots, how to decide that eleven names are one company, and how to catch the edges that should not exist. This post is about the step that comes after all of that: deciding, with a number, whether the graph is trustworthy enough to put in front of an AI agent.

You do not trust a graph because it looks finished. You trust it because it passed a score it could have failed.

What does it mean to “score” a graph?

It means measuring whether the graph tells the truth, not whether it is large or well-connected.

This is the distinction that trips people up. The metrics built into graph tools, like node count, edge count, and density, measure size and shape. None of them measure correctness. A graph can have a million nodes and a beautiful density score and still claim a director sits on a board he never joined. Size is not truth.

Scoring a graph means asking a different set of questions. Do the extracted facts match what the documents actually say? Did entity resolution merge the right things and only the right things? Do the edges point where the evidence points? And, most importantly, can an agent use this graph to answer the questions it will actually be asked, correctly? Those are correctness questions, and they need a grading process, not a dashboard.

Why is “it looks done” the wrong bar?

Because the failures that matter are invisible to the eye and only show up under a score.

The dangerous problems in a graph are silent. A duplicate company does not announce itself. A mislink looks exactly like a real edge. A stale valuation looks like a current one. None of them throw an error, and none of them show up when you glance at the graph and see that it has data in it. They show up only when you systematically compare the graph against ground truth and count how often it is wrong.

So “it looks done” optimizes for the wrong thing. It rewards a graph that is full, not a graph that is right. The teams that get burned are the ones who treat the presence of data as evidence of quality. The presence of data is evidence of nothing. The score is the evidence.

What do you actually score?

You score the things that break, on a fixed rubric, so the result is a number you can compare over time.

We grade AI outputs against a 100-point rubric across five dimensions. Applied to a knowledge graph, those dimensions map cleanly onto the failure modes from this series: extraction accuracy (did we pull the right fields off the page), grounding (can every fact cite the sentence it came from), identity correctness (entity resolution neither split nor over-merged), relationship correctness (no mislinks), and answer quality (can the graph actually serve a correct answer to a real query). A fixed rubric matters more than the exact points. Because it is fixed, the score means the same thing this week as it did last week, so you can tell whether the graph got better or worse after the last ingestion.

The rubric also forces honesty about partial credit. A graph is rarely all right or all wrong. It is usually 94 percent right in a way that hides the 6 percent that will produce a confident, false answer. A rubric makes you count the 6 percent instead of rounding it away.

How do you score a graph without grading every node by hand?

You use a model as the judge, pointed at a sampled, structured set of cases, not at the whole graph.

Grading a real graph by hand does not scale, and grading every node by model is slow and expensive. So we do what we do for retrieval systems: run an LLM-as-judge over a curated evaluation suite. Ours runs 61 evaluation cases, built in two sets of 24 and 37, each one a question with a known-good answer the graph is expected to support. The judge reads the graph's answer, compares it against the rubric, and assigns a score with a written reason, so every grade can be audited rather than taken on faith.

Two engineering details make this reliable rather than theatrical. First, we split the models: a cheaper model does the generation, and a separate, stronger model does the judging, so the grader is not marking its own homework. Second, when the judge is uncertain or its score lands near a threshold, we retry with a temperature escalation, stepping from 0.3 to 0.4 to 0.5, to see whether the verdict is stable or a coin flip. A grade that changes when you nudge the temperature is not a grade you can trust, and it gets flagged for a human.

Can you trust a model to grade a model?

Only if you constrain it the same way you constrain the graph: ground it, and never let it free-associate.

The obvious objection to LLM-as-judge is that you are using one fallible model to grade another. It is a fair worry, and the answer is the same discipline that runs through this whole series. The judge is not asked for an opinion from memory. It is handed the graph's answer, the specific evidence the answer rests on, and an explicit rubric, and asked to grade against that evidence. We also run grounding checks against known term sets, so an answer that invents a term that appears nowhere in the source corpus fails on contact, regardless of how confident the judge feels.

The judge is a measurement instrument, and like any instrument it needs calibration. The model split, the temperature-escalation retry, the written reasons, and the grounding checks are the calibration. Without them, an LLM judge is a vibe with a number attached. With them, it is a repeatable measurement you can defend.

What is the acceptance bar?

A fixed set of queries with score thresholds, and the graph does not reach an agent until it clears them.

A score on its own is just information. It becomes a gate when you attach a threshold. We hold a set of 16 acceptance test queries, each with a minimum score the graph has to hit. These are the questions the graph absolutely must get right, the ones where a wrong answer would do real damage. If the graph cannot clear the threshold on those queries, it does not get promoted to the agents, no matter how good it looks otherwise.

This is the part most pipelines skip, and it is the part that turns scoring from a report into a safeguard. A report tells you the graph is 91 percent. An acceptance gate refuses to ship the graph until the 9 percent that matters most is fixed. The gate is what makes the score load-bearing. Without it, the score is just a number someone glances at on the way to shipping anyway.

How do you keep it honest after launch?

You re-score continuously, because a graph that passed once is not a graph that passes forever.

A knowledge graph that takes in new documents is a moving target. Every ingestion is a chance to introduce a fresh duplicate, a new mislink, a stale fact. So the score is not a launch gate you pass once. It is a recurring check. Every node and edge carries a confidence score, which lets us ask the graph directly for its weakest parts and route them to review, and the acceptance queries run again after each significant ingestion. A graph that was clean at launch and never re-scored is just a graph that is rotting more slowly than an unscored one, which is the same discipline we bring to keeping any production AI system trustworthy after it ships, not just on the day it launches.

What did building this teach us?

The first lesson is that the rubric matters more than the model doing the grading. Teams obsess over which model should be the judge. What actually moved the needle was having a fixed, written rubric and a fixed set of acceptance queries, so the score meant the same thing every time. Swap the judge model and a good rubric still produces a comparable grade. Keep a vague rubric and the best judge in the world produces noise.

The second lesson is that the acceptance gate is the whole point. Scoring without a gate is a measurement nobody acts on. We have watched scores get generated, noted, and then ignored as the graph shipped anyway under deadline pressure. The threshold is what removes the human's ability to wave the graph through, and that is exactly why it works. The score has to be able to say no, and the team has to have agreed in advance to listen.

A graph that an agent can trust is a graph that earned that trust against a bar it could have failed. Everything before the score is hope. The score is where hope turns into a number, and the gate is where the number turns into a decision.

We build and fix knowledge graphs for AI systems, and the agent and retrieval layers that sit on top of them. If you are about to trust an agent to a graph you have never scored, [*book a 15-minute call](https://cognilium.ai/contact).*

The Edge That Shouldn't Exist: Detecting Wrong Relationships in a Knowledge Graph

Mudassir Marwat — Mon, 15 Jun 2026 07:51:52 +0000

A knowledge graph we run for a family office once told an agent that a managing director sat on the board of a company he had never been part of. Both the director and the company were real. Both had been correctly identified, deduplicated, and scored. The only thing wrong was the line drawn between them: an edge no document actually supported.

That edge passed every check we had at the time. The director node was clean. The company node was clean. The relationship had a type, a direction, and a confidence score. Nothing was missing. It was simply false.

This is the third post in the series on graph rot. The first named the seven ways a knowledge graph rots. The second went deep on duplicate entities and how you decide that eleven names are one company. This one is about the failure mode that is hardest to see and most dangerous to leave in place: the mislink, an edge that should not exist.

What is a mislink, and how is it different from a duplicate?

A mislink is a relationship between two nodes that are each correct, but the connection between them is wrong. The endpoints are right. The edge is a lie.

That makes it the mirror image of the duplicate problem. A duplicate is a failure of identity: one real thing stored as several nodes. A mislink is a failure of relationship: two real things joined by an edge that no source supports. Fixing duplicates is about deciding what a node is. Fixing mislinks is about deciding whether a connection is true.

In a knowledge graph, the nodes are the nouns and the edges are the claims. “Acme is owned by Fund II” is a claim. “Jane Doe is a director of Acme” is a claim. The entire reason you build a graph instead of keeping a pile of documents is so an agent can traverse those claims to answer questions. Which means a wrong edge is not a cosmetic flaw. It is a false statement the agent will repeat as fact.

The nodes are the nouns. The edges are the claims. A mislink is a false claim that passed every check you had.

Why are mislinks the hardest kind of graph rot to catch?

Because every individual piece of a mislink looks valid.

When a node is duplicated, you can often spot it by scanning for similar names. When a fact is stale, you can check it against a date. A mislink has none of those tells. The source node is a real, validated entity. The target node is a real, validated entity. The relationship type is one your schema allows. The edge even carries a confidence score, because the extraction step that created it was confident.

Nothing about a single mislink is anomalous on its own. It reveals itself only in context: when you cross-check it against the source documents, against the other edges around it, or against a ground truth like a cap table. A spot check finds them by accident. A system has to go looking for them on purpose. That is exactly why most pipelines never catch them. They were built to extract relationships, not to doubt them.

Academically, this lives in the field of knowledge graph refinement, the body of research on finding and repairing wrong facts in a graph (Heiko Paulheim's survey on the subject is the standard reference, and the error-detection work that followed it). Almost all of that research is academic, with very little of it turned into production tooling. That gap is part of why a real pipeline so rarely ships with a step whose only job is to catch wrong edges.

Where do mislinks come from?

They come from four places, and naming them is half the battle.

**Over-eager extraction. **The language model is asked to find relationships, so it finds relationships. Given a document that mentions a director and a company in the same paragraph, a model will often connect them even when the text only places them on the same page. Co-occurrence is not a relationship, but to a model under instruction to extract edges, it can look like one.
**Ambiguous references. **A document says “the Fund,” or “the Company,” or “he,” and the extractor has to decide which fund, which company, which person. Resolve that reference to the wrong entity and you get a perfectly typed edge pointing at the wrong node. This is where identity and relationship blur together: a near-miss in entity resolution becomes a wrong edge.
**Cross-document stitching errors. **When you assemble one graph from six document types, you stitch facts from a PPM to facts from a K-1 to facts from a cap table. Each stitch is an inference. Get one wrong and you connect the right two entities through the wrong intermediary.
**Schema pressure. **A pipeline built to populate a fixed set of relationship types tends to force ambiguous evidence into one of those slots rather than leave it unconnected. The edge gets created because the schema has a place for it, not because the document earned it.

How do you detect an edge that shouldn't exist?

You run a dedicated pass, after the graph is built, whose only job is to doubt edges.

On the family office platform, this is a post-creation mislink-detection stage. It sits after extraction, identity resolution, and validation in our eight-stage pipeline, and before any agent is allowed to query the Neo4j graph of five node types: company, person, investment, vehicle, and document. It works on three signals, because no single signal catches every mislink.

**Grounding. **Every edge must be able to point to the sentence it came from. An edge whose supporting evidence cannot be located in any source document is the first thing we flag. This is the same grounding discipline used against hallucination elsewhere: a claim that cannot cite its source does not get to stay.
**Structural anomaly. **Some edges are wrong because they are structurally impossible or improbable. A person who suddenly holds board seats at forty companies, a fund that owns itself, an investment that points at a person instead of a company. Type constraints and degree checks catch the edges that violate the shape the graph is supposed to have.
**Evidence re-reading with a model. **For the edges that survive the first two checks but still look suspicious, we hand the model the edge and the exact passages it was supposedly drawn from, and ask a narrow question: does this text actually support this relationship, yes or no, with a reason. This is LLM disambiguation pointed at edges instead of entities, and it is deliberately scoped to the ambiguous middle so the cost stays bounded.

When do you bring a model in to judge an edge?

Only on the suspicious minority, never on the whole graph.

Re-reading every edge with a model would be slow and expensive, and most edges are obviously fine. So the cheap checks run first: grounding and structural anomaly filter the graph down to the edges that are actually in doubt. Only those reach the model, and when they do, the model is not asked to imagine a relationship. It is asked to confirm or reject one against evidence already in hand, and to explain its verdict so the decision can be audited later.

This is the same principle from the entity-resolution work: the model is the most expensive tool in the pipeline, so you spend it only where cheaper signals have already narrowed the question. It judges. It does not hunt.

How do you check edges at scale without re-reading everything?

By treating suspicious-edge selection as its own step.

A graph built from a multi-document portfolio has far more edges than you can afford to re-verify one by one. The trick is the same one that makes entity resolution tractable: you do not compare everything to everything. You generate a candidate set of edges worth doubting, using cheap structural and grounding signals, and you spend the expensive verification only on that set. Most edges never need a second look. The ones that do are the edges where the evidence is thin, the structure is odd, or the same relationship was asserted inconsistently across two documents.

And where a ground truth exists, you use it without mercy. A cap table is a closed system: ownership percentages sum to a whole, and every stake ties to an owner. So the cap-table reconciler doubles as a mislink detector. If an ownership edge implies a stake that breaks the totals, the edge is wrong, and the arithmetic says so without anyone re-reading a single page. This is the same reconciliation we lean on for data engineering and pipeline correctness across the platform: trust the signal that is expensive to fake, and let the math flag what the prose hides.

What does a mislink cost once an agent acts on it?

More than a wrong answer. It costs a wrong decision that looks well-sourced.

When a GraphRAG or agent system traverses the graph to answer “who controls this company” or “what is our exposure to this counterparty,” it treats every edge as true. A mislink does not produce a vague or hedged answer. It produces a specific, confident, traceable one that happens to be false. The agent will even cite the node it walked through, which makes the wrong answer more believable, not less.

This is why edge correctness is a precondition for everything built on top of the graph. When we put an enterprise retrieval and agent layer over a knowledge graph, its trustworthiness is capped by the edges underneath it. A reranker cannot fix a false relationship. A better model only states the falsehood more fluently. The correctness has to live in the graph itself.

How do you know your edges are trustworthy?

You measure the graph against its sources and its ground truths, not against itself.

Every edge in the family office graph carries a confidence score from 0.0 to 1.0 and, where possible, a pointer to its supporting evidence. That lets us ask the graph directly for its weakest edges and route them to review, the same way we surface low-confidence entities. We also score the extraction the way we score everything else: judge models running a 100-point rubric across five dimensions, against a suite of 61 evaluation cases, including grounding checks that fail an extraction for asserting a relationship the text does not support.

The test that matters in the end is blunt. Pick a claim the agent makes, follow the edge back to the document, and see whether the document actually says it. When that round trip holds, the edges are trustworthy. When it does not, you have mislinks you have not found yet.

What did building it teach us?

The first lesson was that you cannot prevent mislinks at extraction time, only reduce them. An extractor tuned to never invent an edge also misses real ones. So the leverage is not a perfect extractor. It is a good-enough extractor followed by a dedicated pass that doubts edges after the fact. Detection beats prevention here, because detection gets to use the whole graph and every ground truth, while extraction only ever sees one document at a time.

The second lesson was that the most valuable signal was also the cheapest: grounding. Most mislinks could not point at a real sentence, because they were never in the text to begin with. Requiring every edge to cite its source caught more wrong edges than any clever model did, and it cost almost nothing to run.

We build and fix knowledge graphs for AI systems, including a document-intelligence platform for a family office managing $850M in assets. If your agents are acting on relationships you are not sure are real, [*book a 15-minute call](https://cognilium.ai/contact).*

Why AI Agents Join Data That Should Never Connect (And How to Stop It)

Mudassir Marwat — Fri, 12 Jun 2026 11:27:55 +0000

Niels Zeilemaker, global CTO at Xebia, recently named the real reason enterprise AI agents fail, and it is not the model. Speaking to AI News, he warned that an agent "will join different fields together in your data which should never be connected," and that "these mistakes are not the fault of the agent. It’s the fault of your foundation."

The data backs him up. Xebia’s Data & AI Monitor 2025/26, a survey of more than 500 data and AI decision-makers across Europe, the US, the Middle East, and Africa, found that fewer than 20 percent of organizations rate their data foundation as "highly mature." While 82 percent accelerated AI experimentation in the past year, only 14 percent believe their data architecture is ready to support AI at scale.

Of all the ways a weak foundation breaks an agent, one is uniquely dangerous, and it is the one worth taking apart in detail: the silent join. Not the agent that fails loudly and returns an error. The agent that confidently fuses two unrelated things and hands you an answer that looks perfectly reasonable and is completely wrong.

The failure nobody catches

Most agent failures announce themselves. The agent times out, returns nothing, or says something obviously off. Those are cheap, because you can see them.

The silent join is expensive precisely because it is invisible. The agent retrieves real data, reasons over it correctly, and produces a fluent, plausible answer. The only problem is that two of the facts it combined describe different things in the real world. Nothing in the output signals the contamination. It passes review. It ships. It is wrong.

Here is the shape of it.

Imagine a company with two contracts both nicknamed "Helios" internally. One is a vendor agreement under a US subsidiary. The other is a client agreement under a EU entity, governed by different law, with different termination terms. To a human, these are obviously separate. To a vector retrieval layer, they are nearly identical: same nickname, same contract vocabulary, same semantic neighborhood.

An agent is asked: "What are the termination terms for Helios?"

The retriever returns its top matches by similarity. The termination passages from both contracts score high, because both are textually about "Helios" and "termination." The agent receives a mixed bag of passages, has no signal that they come from two different legal entities, and writes one coherent answer that blends a US clause with an EU clause.

The answer reads beautifully. It is also a contract summary that describes no real contract that exists.

Why vector retrieval cannot prevent this

This is not a tuning problem. It is structural.

A vector database stores meaning as proximity in an embedding space. It is built to answer "find me text similar to this," and it is excellent at that. What it has no concept of is identity. It does not know that Helios-the-vendor-agreement and Helios-the-client-agreement are two distinct entities that must never share a sentence. To the index, they are two nearby points, and "nearby" is exactly what it is designed to return together.

You can raise the similarity threshold, add re-ranking, or shrink the chunk size. None of it fixes the root cause, because the root cause is that the system has no model of which things are allowed to connect. Similarity is not identity, and an agent built only on similarity will keep stitching distinct entities together whenever they happen to look alike.

The structure that makes the bad join impossible

A knowledge graph stores meaning differently. It stores explicit, typed relationships between resolved entities. The connections an agent is permitted to make are the connections you declared, and nothing else.

In graph form, the two Helios contracts become two separate nodes with distinct identities. One node, Contract {id: "C-4471", alias: "Helios"}, links by an :UNDER edge to entity US-Sub. The other, Contract {id: "C-8830", alias: "Helios"}, links to EU-Sub. Each contract connects to its own termination clause through a :HAS_CLAUSE edge, so the two contracts own separate, non-overlapping clauses.

Now the same question runs as a traversal, not a similarity search. The alias "Helios" resolves to a set of candidate nodes, and the agent must commit to one identity before it can retrieve anything. The query matches a contract by its resolved id, follows only that single node’s :HAS_CLAUSE edges, and returns the termination clause for that one contract.

Because retrieval is scoped to a single resolved node, the termination clause of C-4471 can never be returned alongside the clause of C-8830. The blend is not unlikely. It is structurally impossible. If the alias is ambiguous, the graph forces the system to disambiguate or ask, which is the human fallback Zeilemaker says agents lack, rebuilt into the data layer itself.

The part the news report skips: this is not free

It would be dishonest to stop at "use a graph and you are safe." The hard problem just moves.

A knowledge graph is only as trustworthy as its entity resolution. If your graph construction wrongly merges the two Helios contracts into one node, you have not solved the silent join. You have hard-coded it. The failure shifts from query time to build time, where it is harder to spot and affects every query after.

This is why the real engineering work sits in the construction pipeline, not the query. You need entity resolution that knows C-4471 and C-8830 are distinct despite the shared alias, and a mislink-detection pass that catches bad merges after the fact. We covered that exact problem, disambiguating entities and then detecting mislinks after creation, in Gemini-Driven Entity Disambiguation With Post-Creation Mislink Detection.

It also means a graph is not always the answer. If your domain has a single entity type and no cross-entity joins to worry about, vector retrieval alone is simpler and fine. The graph earns its cost the moment you have multiple entities that can be confused for one another, relational questions, or compliance boundaries that must not be crossed. In most real systems the right answer is hybrid: vector search for recall, a graph constraint for identity. We walked through that combination in Hybrid Retrieval With Prefetch-Time Metadata Filtering.

How to find out if your agent is doing this right now

The silent join hides, so you have to go looking. A practical audit:

**Build an ambiguity test set. **List the entities in your domain that share names, codes, or aliases: duplicate customer names, reused contract numbers, project codenames that recur. These are your collision points.
**Query each one and inspect the retrieved sources, not just the answer. **The answer will look fine. Check whether the retrieved chunks all trace to a single real-world entity. If a single query pulls passages from two distinct entities, you have found a live silent join.
**Log provenance on every retrieved chunk. **If you cannot tell which source entity a passage came from, neither can your agent, and neither can you when you audit it.
**Ground the output against a controlled identity, not just plausibility. **Validating answers against a domain source at runtime catches contamination the model would otherwise smooth over, the approach in Anti-Hallucination via Runtime Grounding Against a Domain Vocabulary.

The headline from the Xebia research is correct: when your agents fail, look at the foundation first. We would add one line. The failures that cost the most are rarely the loud ones. They are the silent joins that read perfectly and connect two things your business never meant to connect. Fixing the model will never catch those. Only the structure of your data foundation can.

Cognilium AI has shipped 50+ projects and runs four production AI systems built on this foundation: graph-backed retrieval, entity resolution with mislink detection, and runtime grounding. If you suspect your agents are quietly fusing data that should stay separate, Talk to an Engineer, and we will run the ambiguity audit with you.

GraphRAG vs Flat-Vector RAG: Why 2026 Is the Year Graph Retrieval Graduates to Default

Mudassir Marwat — Wed, 10 Jun 2026 13:34:37 +0000

Two years after Microsoft published the original GraphRAG paper (arXiv:2404.16130), the engineering pattern has stabilised. Across enterprise deployments — including the K-12 publisher writing co-pilot we documented in our k12 writing co-pilot case study and the supervisor + 7-agent architecture in the multi-family-office case study — the same conclusion keeps surfacing: for knowledge corpora where relationships matter more than passages, graph retrieval beats flat-vector retrieval at production scale.

What "graduated to default" actually means

Until late 2025, GraphRAG looked like an academic flourish bolted onto vector retrieval. Three things changed.

Construction cost has collapsed. Current-generation Claude and Gemini models have dropped per-token costs for entity and relationship extraction by roughly an order of magnitude vs the GPT-4-era baseline. A 1M-character corpus that cost several hundred dollars to graph-index in early 2024 is now in the tens of dollars range — cheap enough to re-index on schedule, not just at project start.
Standard tooling has emerged. Neo4j shipped a stable graphrag-python package and LlamaIndex has matured its KnowledgeGraphIndex. The construction pipeline that previously required bespoke code is now a library call.
The retrieval pattern stabilised on hybrid. Pure graph traversal loses long-context paraphrase; pure flat-vector loses entity relationships. The settled production pattern is hybrid: graph traversal for relationship hops, BM25 plus dense embeddings for passage relevance, reciprocal-rank fusion for the final top-k.

Where graph retrieval wins decisively

Multi-hop relationship queries. "Which counterparties did Trust A's holdings overlap with Trust B's between 2020 and 2024, filtered to those with active litigation?" Flat-vector retrieval returns passages mentioning each entity separately and the JOIN logic falls apart at retrieval time. A property graph answers this in a single Cypher query, with the relationships first-class.

Disambiguation against vocabulary. When a 50-page methodology document uses "Slinky Test" to refer to a specific teaching strategy, pure embedding similarity may surface unrelated passages about slinkys. Graph nodes anchored to a controlled vocabulary catch this — the vocab becomes a first-class retrieval target, not a probabilistic match. We covered the runtime grounding pattern in detail in Anti-Hallucination via Runtime Grounding.

Audit and provenance. Every answer in a graph-grounded system can cite the exact node and edge it relied on. For regulated workloads — financial diligence, healthcare records, legal contract review — this is the difference between deployable and not.

Where flat-vector still wins

Pure narrative corpora. A blog archive, a book of essays, a transcript library — anything where the relationships ARE the prose, not metadata about it. Building a graph here adds latency and infrastructure for marginal precision gains.

Latency-sensitive single-turn lookup. Sub-200ms retrieval still favors an HNSW index over a Cypher query, even with the cleanest graph schema. If you are powering autocomplete or real-time voice retrieval, flat-vector is structurally the right tool.

Volatile corpora without event-driven re-indexing. If documents change daily and your graph build is a nightly batch, graph drift becomes the silent killer. We documented the failure mode in Graph Rot: Why Your Knowledge Graph Is Lying to You.

The 2026 production baseline

The pattern that consistently ships across the deployments we run:

Construction. Entity and relationship extraction via current-generation Claude or Gemini, schema-first prompting, batched with retry-on-validation-fail. The schema decision matters more than the model: a typed Pydantic schema with constrained relationship types prevents the LLM from inventing edges that look plausible but break query intent.

Storage. Neo4j Community Edition handles up to roughly 100M nodes and 1B edges at single-instance scale. Beyond that, Memgraph or NebulaGraph for distributed deployments. For most enterprise corpora — under 10M documents — single-instance is the right call.

Retrieval. Hybrid is the default. Cypher for relationship hops, Qdrant or pgvector for dense passages, BM25 (via fastembed or your vector DB's hybrid mode) for keyword precision, reciprocal-rank fusion for top-k. Resist the urge to use an LLM as the fusion ranker on the hot path — it costs latency you cannot afford for marginal precision.

Generation. Tool-use over retrieval results, not chain-of-thought over a single prompt. Let the LLM decide whether to follow another graph hop or stop at the current passages. This is the difference between a system that explains its citations and one that hallucinates them.

The takeaway

GraphRAG is no longer experimental. It is the boring default for relationship-heavy enterprise knowledge work. The argument for flat-vector RAG is still strong where it always was — pure narrative, hot-path latency, simple semantic search — but the days of defaulting to flat-vector because GraphRAG was 'too complex to build' are over. The tooling is here. The cost has collapsed. The pattern has stabilised.

If you are still on flat-vector for a knowledge corpus where relationships drive value, 2026 is the year to migrate. The engineering case is no longer open.

Anthropic just shipped two new Claude models. The interesting one isn’t generally available.

Mudassir Marwat — Wed, 10 Jun 2026 13:34:36 +0000

Anthropic shipped two new frontier models on June 9, 2026: Claude Fable 5, generally available with full safeguards, and Claude Mythos 5, the same underlying model with safeguards lifted in cyber and biomedical research for trusted partners. Pricing matches the prior Opus tier at $10 per million input tokens and $50 per million output tokens. The naming is a bilingual hat-tip: Fable from Latin fabula, Mythos from the cognate Greek, both meaning "that which is told."

What changed

The Fable 5 and Mythos 5 release marks Anthropic’s first explicit two-tier launch. Fable 5 is the model on the Claude API and on Pro/Max/Team/Enterprise plans, included at no extra cost from June 9 through June 22. Mythos 5 is the same weights served via two channels: Project Glasswing partners (cyber safeguards lifted) and a trusted-access program for select biomedical researchers (biology and chemistry safeguards lifted, cyber retained).

Both run on the same inference stack. The safeguards are AI classifiers that route flagged requests to Claude Opus 4.8 as a fallback. Anthropic reports fallbacks fire in under 5% of sessions on average.

Why the capability bar moved

Anthropic claims state-of-the-art on "nearly all tested benchmarks" and frames three concrete capability jumps that matter to production AI engineering teams. The full system card breaks down evaluation methodology and known limits.

**Long-context autonomy. **Fable 5 holds focus across millions of tokens, with a file-based memory subsystem that lets it reach the final act of Slay the Spire three times more often than Claude Opus 4.8.

**Software engineering at compressed timeframes. **Stripe used Mythos 5 to complete a codebase-wide migration on its 50-million-line Ruby codebase in a single day, work that Stripe estimates would have taken a full engineering team over two months by hand.

**Vision-only autonomous control. **Mythos 5 completed Pokemon FireRed using a vision-only harness fed raw game screenshots. Earlier Claude models required a complex helper harness to make progress. The same vision stack rebuilds full web apps from screenshots alone.

Benchmarks and partner results

Anthropic released Fable 5 and Mythos 5 with statements from a dozen partner organizations. Specific scores are sparse on some benchmarks (Anthropic publishes the comparison chart in the post but withholds exact percentages for several); the named-partner results below give a more grounded picture of where the model has actually been deployed and tested.

Software engineering

**Cognition (Scott Wu, CEO): **Fable 5 is the "highest-scoring model on FrontierBench, Cognition's frontier coding eval." Wu notes the model "excels at long-horizon reasoning and generalizes to unfamiliar tools." Anthropic adds that Fable 5 scores highest among frontier models on FrontierCode "even at medium effort."

*Cursor (Michael Truell, CEO and co-founder): *"State of the art on CursorBench," with Truell describing it as "opening up a class of long-horizon problems that were out of reach."

**GitHub (Mario Rodriguez, Chief Product Officer): **Long-horizon coding tasks ran "at a level of autonomy and reliability that exceeded previous benchmarks."

**Stripe: **Migrated a 50-million-line Ruby codebase in one day. Stripe estimates the same migration would have taken a full team over two months by hand.

Finance, analytics, and quantitative reasoning

*Hebbia: *"Highest score of any model" on the Hebbia Finance Benchmark, with "substantial gains in document-based reasoning, chart and table interpretation, and problem solving."

**IMC: **Aced trading-analysis evaluations "nearly across the board."

*Izzy Miller, AI Research Lead (quoting an internal benchmark): *"First to break 90% on our core analytics benchmark of complex, long-running analytical tasks, a 10-point jump over Opus."

**Damian Miraglia, finance principal engineer (external partner): **Called Fable 5 the "strongest finance-first model" tested, "a notable step up."

Scientific reasoning and biology

In blinded head-to-head comparisons against Opus-class models, scientists preferred Mythos 5's molecular biology hypotheses approximately 80 percent of the time. One Mythos-generated hypothesis, a novel mechanism for an E. coli protein, was independently corroborated by an external lab in a biorxiv preprint working on the same problem.

**Protein and drug design: **Anthropic reports the model accelerated parts of the protein and drug design process by roughly ten times relative to skilled human operators working with the same bioinformatics tools. Of 14 protein targets tested, nine yielded strong candidates spanning immune checkpoints, growth-factor and receptor signaling, neurodegeneration, muscle disease, and harder structural targets.

Physics research

*Matthew Pines, CEO (frontier physics research partner): *"Strongest model we've tested on frontier physics research while using a third of the reasoning tokens. In 36 hours it got nearly to where GPT-5.5 landed after four days." Same end-state, roughly 2.7x faster wall-clock, with one-third the reasoning compute.

Game-playing and long-horizon reasoning

**Pokemon FireRed: **Completed the game with a "minimal, vision-only harness," fed raw game screenshots. Earlier Claude models required a complex helper harness.

**Slay the Spire: **With a persistent file-based memory subsystem, Fable 5 reaches the game's final act three times more often than Claude Opus 4.8 on the same harness.

Safety and red-teaming

**External bug bounty: **Anthropic reports "no universal jailbreaks in over 1,000 hours of testing." A universal jailbreak is defined as "any prompt, script, or harness that allows a user to interact with a model as if its safeguards were not present."

*UK AI Safety Institute (AISI): *"Made progress towards [a universal jailbreak] within a brief initial testing window." This is the only named external entity that approached a working jailbreak.

**Cyberattack-specific evaluations: **Across 30 public jailbreak techniques covering attack planning, exploit development, and defense evasion, an external partner reports Fable 5 "complied with zero harmful single-turn requests."

**Alignment: **Anthropic reports "Mythos 5's level of misaligned behavior was low and similar to that of Opus 4.8."

What this changes for production AI work

For teams shipping with Anthropic models, pricing parity at $10/$50 makes Fable 5 a drop-in upgrade from Opus 4.8 with no cost surprise. The "millions of tokens" autonomy claim is the lever that will most affect agent architectures we ship: supervisor + worker patterns that previously needed aggressive context budgeting can simplify when the model holds focus longer.

The vision benchmarks matter for any team building computer-use agents or document-intelligence pipelines where layout fidelity has been the bottleneck.

The Mythos 5 partner-only model signals where Anthropic is going on dual-use. Cyber safeguards remain on for biomedical partners; biological and chemical safeguards remain on for cyber partners. The split tracks dual-use risk along compartments rather than a single trust gate.

What we’d watch next

Three signals over the next 30 days. First, whether the millions-of-tokens autonomy claim survives contact with real production workloads beyond Anthropic's curated benchmarks. We will be running retention tests on the supervisor + worker architectures from the multi-family-office case study. Second, whether vision benchmarks translate to document-intelligence pipelines in regulated industries. Third, the trajectory of the trusted-access Mythos 5 program: which research programs get safeguards lifted, and how Anthropic communicates the boundary publicly.

Fable 5 is on the Claude API today at claude-fable-5. We will benchmark it against Opus 4.8 across our GraphRAG and voice AI stacks this week. Analysis to follow.