DEV Community

Cover image for I Installed a Knowledge Graph on Claude Code. 14 Days Later, the Audit Killed My Enthusiasm.
Phil Rentier Digital
Phil Rentier Digital

Posted on • Originally published at rentierdigital.xyz

I Installed a Knowledge Graph on Claude Code. 14 Days Later, the Audit Killed My Enthusiasm.

When I heard about graphify, I thought it was going to improve my code and cut my tokens. So 15 days ago I installed it on 2 production projects. I was sure I had upgraded my agent. Two weeks later, I audited the JSONL transcripts. I had to read the numbers twice.

Across 60 sessions, the agent fired around 1,500 hook reminders and invoked the graph exactly 0 times. The grep won the arbitrage every single time, not because the graph was bad but because the graph took one extra step.

TLDR: You can write the most beautiful rule in your CLAUDE.md, declare your MCP server, drop a hook that pings on every commit. The agent nods, then does something else. What changes an agent's behavior is not the context you give it. It is what you remove from its path.

I Wanted This to Work. Badly.

I went all in on graphify. Triple promise: persistent memory of the codebase, auditable graph you can read like a map, cross-document surprises the LLM cache cannot see. Published numbers in the 49-71x range for token savings on queries that hit the graph cleanly. Every box checked.

I ran the full pass on both projects: a Next.js storefront sitting on a Convex backend (290-ish files), and a content pipeline I keep coming back to (387 TypeScript files). One AST pass plus a semantic extraction layer, around $0.30 to $1 per project on Sonnet. The post-commit git hook went in, the MCP server got declared, the CLAUDE.md rule was injected verbatim by graphify claude install. 7 tools visible in the panel. Clean.

Then I pushed it further. Usage log file to track wins. Dedicated Claude memory just for graphify feedback. Review date 7 days out, bumped to 14 when the data felt thin. I had written the evaluation calendar before I had a single data point, which in hindsight is the move of someone who already wants the answer. (You wrote a calendar like that once too, I am not the only one.)

I was sure I had just upgraded my agent. The MCP server was breathing. The graph was warm. The hook was firing. I had nothing to do but wait for the gains.

The Audit That Killed My Enthusiasm

Claude Code keeps a transcript of every session as a .jsonl file under ~/.claude/projects/<repo>/. Every tool call is logged with its name and inputs. A real invocation looks like "type":"tool_use","name":"mcp__graphify__...". The startup deferred_tools_delta lists the available tools but is not an invocation. I grep'd for the first, ignored the second.

Here is what 14 days of "I'm sure this is working" actually produced.

grep "tool_use" ~/.claude/projects/pbn/*.jsonl | grep graphify
grep "tool_use" ~/.claude/projects/rentier/*.jsonl | grep graphify
Enter fullscreen mode Exit fullscreen mode

The numbers, side by side:

  • CLI calls to graphify query / path / explain: 0 on pbn, 0 on rentier
  • MCP graphify tools invoked by the agent: 0 on pbn, 0 on rentier
  • Active reads of GRAPH_REPORT.md: 1 on pbn, 0 on rentier
  • Lines added to my usage log: 1 line on pbn, file never created on rentier
  • Project memories quoting a graph insight: 0 on both

The infrastructure did its job perfectly. 290 commits on the content pipeline and 38 on the storefront were digested by the post-commit hook without a single failure. Zero cost overruns, zero friction, the graph stayed fresh through every push. The MCP server booted clean every session, the 7 tools were visible in the panel, the rule was sitting at the top of CLAUDE.md where the agent supposedly reads it on every cold start. Everything I could control mechanically worked. The one thing I could not control, the reading itself, did not happen.

The one documented productive use was a 404 bug on a partner site I was debugging on May 2. The report pointed at the right cluster of modules. Good. The initial diagnosis was still wrong though, the agent accused humanize.ts, the real culprit was mesh.ts:tryInsertLink. The graph helped narrow the zone. It did not solve. I had to verify against the actual .meta.json files to flip the diagnosis. I saved maybe 500 tokens of grep work and 3 file reads. Marginal compared to the published numbers around 49-71x token savings, as documented by Mustafa Genc on GoPenAI and the CLSkills integration guide (April 2026). Those numbers are measured in controlled benchmarks where an agent must use the graph. My audit measures the opposite. What an agent does when it has the choice.

The Grep Won Because Grep Is Shorter

An AI agent runs a real-time arbitrage at every step. What's the cheapest path to done? Text rules enter that arbitrage as signals, not as laws. Yajin Zhou put it cleanly in his March 2026 post: for AI, rules aren't laws, they're suggestions. Christoph Schweres documented the architectural side a month earlier: after context compression, CLAUDE.md changes status, it stops being a rule and becomes information the agent may or may not weight. That is exactly what played out across my 1,500 ignored hook reminders.

Two concrete reasons graphify lost the arbitrage.

Reason 1, the short path always wins. To answer "where is X defined", "who calls Y", "what touches Z": grep -r gives exact lines in 1-2 seconds. Reading GRAPH_REPORT.md gives 500 tokens of thematic summary, then the agent still has to open the files. One step versus two. The real-time arbitrage picks the short path, no matter what you wrote in caps inside CLAUDE.md. That is not a bug. That is the weighting as it was trained.

Reason 2, text rules have no persistent discipline. My PreToolUse:Bash hook fired over 1,500 reminders telling the agent "3 in-scope file(s) changed, re-run /graphify update". Not one triggered a graph consultation. The reminders became noise to parse and skip. The pattern is heavily documented now: GitHub issues #18660, #22022, #19635, and the very recent #57200 filed May 8 (the reporter burned 85K tokens re-discovering a decision that was already in memory, because the laws.md rules did not survive context compression). Mustafa Morbel framed the distinction well in his April piece: CLAUDE.md is advisory by nature, hooks are deterministic. Text gets debated. Mechanics do not.

I wrote about CLAUDE.md hygiene before, 47 lines in my CLAUDE.md and why Claude burns 50 instructions before mine even loads. That article said write better rules. This one says even the best-written rule stays advisory and loses to the arbitrage. Both are true. Neither is sufficient.

Where I Should Have Used It (And Where You Probably Should)

I made a setup mistake. Two of them, actually.

Mistake 1, I installed a discovery tool on familiar terrain. The graphify docs are explicit about the target use cases: a codebase you are touching for the first time, a reading list (papers, tweets, notes), a research corpus, a personal /raw folder where you dump everything. Neither of my two projects is any of those. I wrote half the code and re-read the rest many times. Same for Claude, the model's working memory of these repos was already warm. The cross-document surprises that a community-detection pass is supposed to reveal don't surprise anyone who already knows the links. I installed a compass inside an apartment I've lived in for 5 years.

Mistake 2, I deployed it cold instead of hot. A context tool wins the real-time arbitrage when the agent has no cheaper alternative. On a fresh repo, grep is still expensive (the agent sweeps blind), so reading the graph first becomes the short path. After 3-6 months of exposure, grep has a warmth cache, and the graph becomes the detour again. I had already crossed that line months ago on both projects. By the time graphify arrived, the agent had already built whatever passes for muscle memory in a stateless system, and that muscle memory said "grep first, ask questions never."

Here is the usage map I should have respected from day one.

Onboarding on an inherited repo you are taking over without knowing it. The graph probably wins, because grep is blind at the start and a topical map saves real reading time. This is the case Jo Van Eyck argued for in his "AI coding agents are useless on large codebases" video, and on that specific frame he is right. The bigger the unknown surface, the more a global map earns its keep.

Cross-module security audit on a legacy repo where you are hunting hidden dependencies. Explicit case from the graphify docs, and it makes sense, since the agent needs a global view that no single grep gives. You are not looking for a string, you are looking for a topology.

Mixed reading list (papers + notes + tweets + reference code) where community detection genuinely surfaces a view no grep can produce. Probably the strongest use case, and the one closest to the original /raw philosophy that started the whole conversation. Heterogeneous corpora benefit most from structural extraction precisely because grep across formats is a bad joke.

Not for maintaining a project you already know. That is what I did, and it was the wrong application.

This nuance does not rescue my experience. It clarifies what I should have tested first. Rajistics has a YouTube piece titled "GraphRAG (you probably don't need it)" that lands on a similar position from the RAG side. The takeaway is the same on knowledge graphs: case-by-case, not always.

(Side note that doesn't really go anywhere: I keep noticing that the tools I install with the most enthusiasm are the ones I read the least about beforehand. The boring tools I skim once and forget are usually the ones I keep using 6 months later. There's probably a lesson there about how enthusiasm distorts evaluation. Or maybe I just have bad taste in tools. Hard to say honestly.)

The Real Lesson: Text Doesn't Change Agent Behavior. Mechanics Do.

This pattern is bigger than graphify. You hand the agent a new tool, a new doc, a new rule. You write it well. You remind it in CLAUDE.md. You drop a hook that pings. The agent nods. Then it does something else. The pattern shows up across GitHub issues, r/ClaudeAI threads, and at least a dozen recent Medium pieces. It is the dominant failure mode of contextual tooling in 2026.

What actually works are 3 mechanical levers.

Lever 1, block instead of remind. A PreToolUse hook that refuses the alternative action forces the path. Mustafa Morbel documented the contrast already: CLAUDE.md is advisory, hooks are deterministic. Any rule that can be ignored will be ignored the moment the real-time arbitrage finds a shorter route. The fix is not better wording. The fix is removing the alternative from the menu. If your hook only reminds, you have built a polite voice that the agent learns to tune out. If your hook blocks the call, you have built a wall.

Lever 2, make the tool the short path, not a detour. A knowledge graph that answers the question wins. A knowledge graph that points at files the agent still has to open loses, because it added a step instead of removing one. Same principle for MCP servers: a server that eliminates N tool calls wins, a server that adds N+1 loses. I argued the same thing about CLIs versus MCP in why CLIs beat MCP for AI agents: CLIs win precisely because they slot into the agent's native path instead of asking it to detour into a JSON-RPC handshake. The mechanism generalizes to any contextual tool. If it adds a step, it loses. If it removes one, it wins. That is the entire game.

Lever 3, introduce the tool at the right moment. A context tool wins on cold start, not on warm maintenance. That is the error I made with graphify, but it is also the error developers make when they add 200 lines to their CLAUDE.md 6 months into a project. Too late. By then the agent's pattern of behavior on that repo is fixed by the existing tools (grep, ls, cat, the file tree it has already learned to navigate). A new tool added on top has to overcome inertia, and inertia in a real-time arbitrage means "this path was cheap last time, take it again."

The 3 levers come at a cost. They make the tooling more invasive. A blocking hook annoys you when you want to make an exception. A short-path tool replaces the flexibility of grep with the rigidity of a fixed schema. A cold-start tool means you have to plan your tooling before you have a project, which most of us never do, because we install tools when we feel the pain, not before. So everyone reaches for the advisory layer instead, writes more lines into CLAUDE.md, drops another reminder hook, and tells themselves the rules are clear enough this time. I have done this. You probably have too. The advisory layer is comfortable precisely because it lets us pretend we have changed the agent's behavior without paying the cost of actually changing it.

Actually, wait. Let me put it differently. The advisory layer is like writing a strongly-worded email to your past self and expecting your future self to obey it. Good luck with that. 😅

The lesson holds beyond Claude Code. If you build an AI workflow that depends on the model following written rules, you are betting on a property the model does not have. The model has weights, not laws. I wrote about a related angle in Vibe Coding, For Real: vibe-coders believe writing the right rules to the agent is enough to ship. Shipping comes from the mechanics of the build, not from the prose around it. Same principle here, scaled up to the tooling layer.

Audit Your Own Transcripts Before You Trust Your Own Rules

My finding on graphify is not that the tool is bad. The tool works as advertised when the agent actually uses it. The problem is that tool adoption by agents is 10 times harder than tool installation. Any tool that asks an agent to change its default path goes through the same real-time arbitrage. My CLAUDE.md goes through it, my reminder hooks go through it, even the prompts I am most proud of go through it. Everyone has their graphify, and they don't know it until they audit.

So audit. The command fits on one line:

grep "tool_use" ~/.claude/projects/<repo>/*.jsonl | grep <tool_name>
Enter fullscreen mode Exit fullscreen mode

Count the real invocations. Compare against what your CLAUDE.md, your hooks, and your rules pretend the agent should be doing. The gap is the actual cost of what you wrote in words. If the gap is zero, you know what to do next: move the rules that matter from text to mechanics. The rest can stay advisory, nobody reads it anyway.

A knowledge graph the agent doesn't read isn't a tool. It's documentation. Same goes for your rules. C'est la vie.

Sources

  • graphify repository and docs, github.com/safishamsi/graphify
  • Christoph Schweres, Claude Code Ignores the CLAUDE.md, HOW Is That Possible? (Feb 2026)
  • Yajin Zhou, Why an AI Agent Broke Its Own Rules (March 2026)
  • Mustafa Morbel, Taming Claude Code: A Guide to CLAUDE.md and Hooks (April 2026)
  • Mustafa Genc, Graphify: Build a Knowledge Graph From Your Entire Codebase (GoPenAI, April 2026)
  • GitHub issues anthropics/claude-code #18660, #22022, #19635, #57200
  • Jo Van Eyck, AI coding agents are useless on large codebases. Unless you do THIS. (YouTube)
  • Rajistics, GraphRAG (you probably don't need it) (YouTube)

This post may contain affiliate links. If you click them, I might earn a small commission, costs you nothing, and helps me keep shipping quality articles every day for your reading pleasure.

Top comments (0)