The daemon had been watching the vault for two weeks. Five hundred and forty-five files — conversation exports, session notes, design documents, research captures — sitting in an Obsidian vault with no analytical metadata. No concept tags. No framework annotations. No discipline classification. Each file arrived raw and stayed raw unless I manually enriched it during import.
I'd built a digest system — a pipeline designed to scan recent imports against the knowledge graph and surface connections between concepts. It was supposed to find where new material touched old material in ways I hadn't noticed. In practice, it spent its entire analytical budget on a prerequisite step: figuring out what each file was about before it could ask how files related to each other. The system designed to find connections was stuck classifying.
One session, working on why the digest felt so slow, a question surfaced that reframed the problem: why is tagging manual? The daemon already watched for new files, indexed frontmatter, tracked changes. It was doing infrastructure work — the passive kind. Files should arrive pre-tagged the way they arrive pre-indexed. By the time the digest runs, the vault should already be enriched. That reframing took five minutes. The implementation took two hours and nine bugs.
Part 9 of Building at the Edges of LLM Tooling. If you're building systems where LLM infrastructure needs to do cognitive work in the background — classifying, enriching, preparing material before a session starts — the automation boundary isn't where you expect it. Start here.
Why It Breaks
Post 8 built the persistence layer — a graph database with typed relationships, a vector store for discovery, and an MCP server that makes it all queryable from inside any LLM session. The session bridge works. New sessions open with structural context instead of compressed summaries.
But the persistence layer stores what's been explicitly extracted. Each concept in the graph got there because a session produced it, the capture protocol flagged it, and I reviewed and approved it. That process works for concepts — ideas with analytical substance worth naming and connecting. It doesn't work for the raw material that concepts emerge from.
Five hundred and forty-five conversation files in the vault. Each one contains concepts, frameworks, references to thinkers, discipline-specific vocabulary — but none of that is tagged. The graph knows about 130 named concepts with typed relationships. The vault holds the raw sessions where those concepts were developed, argued about, refined, and sometimes abandoned — and the graph can't see any of it because the files have no structured metadata connecting them to the concepts they contain.
The digest system was designed to bridge this gap. Scan recent imports. Compare against existing concepts. Surface convergence signals — moments where new material touches old material. But without pre-existing tags on the files, the digest's first step was always classification: read the file, determine what it's about, then check for connections. Every analytical pass started with the same prerequisite work. The digest was doing the operator's job before it could do its own.
Manual enrichment doesn't scale. Not because the tagging itself is hard — reading a file and identifying its concepts takes minutes. Because the cognitive cost is in the wrong place. Every minute I spend classifying a file is a minute not spent evaluating what the classification reveals. The bottleneck isn't the work itself. It's that the work requires human attention for a task that doesn't require human judgment.
What I Tried
The reframe: classification is infrastructure. It should happen in the background, the way file indexing happens in the background. The daemon already existed — a file watcher running as a macOS LaunchAgent, monitoring the vault for new arrivals. It tracked filesystem events, logged imports, maintained a queue. Passive infrastructure. The change was making it active: when a file arrives, don't just log it. Read it. Extract structured metadata. Write the tags back into frontmatter. Move on.
The architecture was straightforward. File arrives. Watcher queues it. Worker calls the Claude API with a focused extraction prompt — not open-ended conversation, a specific ask: identify concepts, frameworks, disciplines, and named references in this file. Structured JSON response writes back into YAML frontmatter. Watcher re-indexes the enriched file. The vault file that arrived raw now carries analytical metadata without the operator knowing it happened. Event-driven. The same watcher-queue-worker pattern from production systems, applied to a knowledge vault.
Then the debugging arc.
Nine bugs across five system layers. Each fix revealed the next failure in a different layer. Two hours, nine restarts.
The first bug was a duplicate function declaration — two versions of the same processing function in the daemon script, left over from iterating on the extraction prompt. Python doesn't warn about this; the second definition silently shadows the first. The daemon crashed on startup. Fix: delete the duplicate. Restart.
The second bug was macOS LaunchAgent throttling. The daemon crashed enough times during development that launchd started throttle-limiting restarts — a 10-second backoff that made the daemon appear unresponsive when it was actually waiting for permission to start. Fix: unload and reload the agent. Restart.
The third bug was vault directory coverage. The watcher was monitoring the top-level vault directory but not subdirectories where most conversation files lived. Five hundred and forty-five files invisible to the daemon because they were one level too deep. Fix: recursive directory watching. Restart.
The fourth bug was SQL. The backfill query — designed to find all files not yet enriched — used a WHERE clause that excluded NULL values. In SQLite, column != 'enriched' doesn't match rows where the column is NULL. Five hundred and forty-five files had NULL enrichment status. The query returned zero results. Fix: add OR column IS NULL. Restart.
The fifth bug was also SQL. Double-quote characters in file paths breaking query string interpolation. Fix: parameterized queries. Restart.
The sixth bug was a stale absolute path from an earlier vault location, hardcoded in the daemon's configuration rather than derived from the current environment. The daemon was looking for files in a directory that hadn't existed for weeks. Fix: environment-relative paths. Restart.
The seventh, eighth, and ninth bugs were all related to the API key. An invisible character appended during clipboard paste. Then a doubled prefix — sk-ant-sk-ant- — from pasting the key into a field that auto-prepended the provider prefix. Then, after both were fixed, the key had been exposed in a terminal command and needed regeneration. Three bugs, one root cause: the boundary between the operator's clipboard and the system's credential store had no validation.
Nine bugs. Five system layers — application code, operating system services, database queries, filesystem paths, API authentication. Two hours. Each bug diagnosed through conversational problem-solving with the IDE agent. The LLM identified each individual bug correctly. What it couldn't do was see the cascade — the way fixing the LaunchAgent throttle revealed the directory coverage gap, which revealed the SQL NULL semantics, which revealed the path staleness. Each fix changed the system state enough that the next failure became visible. The human operator was the thread connecting diagnoses across system boundaries that no single diagnostic frame could hold simultaneously.
After two hours: the daemon ran. A file arrived. Twenty seconds later, its frontmatter contained structured concept tags, framework annotations, and discipline classifications. No operator attention required. The vault had a new property: files that enriched themselves.
What It Revealed
The debugging arc is what agentic development actually looks like. Not a model generating clean code in one pass. Iterative diagnosis across system boundaries, where each fix reveals the next failure in a different layer, and the human is the thread connecting layers that can't see each other. The database doesn't know about the LaunchAgent throttle. The API authentication doesn't know about the SQL NULL semantics. The filesystem paths don't know about the extraction prompt. Each layer is fine on its own. The interactions between layers are where everything breaks — and the interactions are only visible from the position of someone who can see all the layers at once.
That's a finding about agentic development specifically, not software development generally. In conventional development, the layers are well-mapped: you know the database talks to the API which talks to the frontend. In agentic development, the layers include the LLM's extraction capability, the operating system's service management, the database's query semantics, and the human operator's clipboard hygiene — in the same debugging session. The system boundary diagram doesn't exist because the boundaries are between categories that don't traditionally share a diagram.
But the debugging arc, as concrete as it was, pointed at something I didn't see until later. The human threading across system layers during debugging is a specific case of a more general structural role — the position that sees across boundaries the system's own components can't cross. The debugging arc showed that role operating between technical layers. What happened next showed it operating between the system and its outside.
The auto-tagger works. Files get classified. The daemon reads a conversation export and produces: concepts: [session persistence, multi-environment coordination, MCP architecture]. frameworks: [event-driven architecture, knowledge engineering]. disciplines: [systems design, AI orchestration]. Structured, queryable, useful. The digest system can now skip the classification prerequisite and go straight to connection detection. The bottleneck I started with — manual enrichment — is solved.
What the auto-tagger produces is the "what" layer. What's in this file. What concepts it touches. What frameworks it references. That's classification — and classification is exactly the kind of cognitive work that automation handles well. Focused extraction prompt, structured output format, repeatable across hundreds of files. The daemon does this consistently — the same extraction prompt, the same structured output, across any number of files.
What the auto-tagger doesn't produce is the "why" layer. Why this file matters. How it connects to a decision made three months ago. What it changes about an existing concept's scope. Whether the operational evidence in this conversation — the specific way a design pattern was stress-tested against a real problem — is material for something communicable.
I ran into this concretely. Over the course of a week, I'd posted design contributions to four open-source projects — specific, grounded comments on session persistence architecture, knowledge graph design patterns, multi-environment coordination protocols. Each contribution drew directly from infrastructure I'd built and documented. Each one was a real exchange with a maintainer about a real design problem. The auto-tagger could classify the session transcripts from those contributions. It could tag them as containing concepts about session persistence and MCP architecture. What it couldn't do was recognize that those four exchanges, taken together, constituted a pattern — operational evidence of a specific kind of practice being tested in forming spaces. The tags said what the files contained. They didn't say what the files meant.
That gap has a shape. The system can read its own files — literally, the daemon processes them autonomously. But there are two kinds of reading. Reading from inside: what does this file contain? Concepts, frameworks, references. The auto-tagger does this. Reading from outside: what does this file mean in the context of everything else? How does it change the trajectory? What does it reveal about the gap between what the system can see internally and what anyone outside the system can evaluate?
The second kind of reading is what I do when I look at the topology health report and realize that the four engagement sessions aren't just four separate interactions — they're evidence of a pattern. Or when I notice that the publishing cadence lapsed not because I stopped writing, but because the primary signal channel shifted from articles to direct contributions, and the system's health metrics track those as separate tracks instead of recognizing them as the same function expressed differently. Or when I see that the infrastructure I've built to help me get hired for context engineering work is context engineering work — and nobody outside the system can see that yet.
Classification reads the file. The operator reads the system. They're different operations. The first is automatable. The second is where judgment lives — and it's the layer where intelligence becomes legible.
I haven't solved this. The auto-tagger handles classification. The persistence layer handles storage and retrieval. The digest system handles connection detection. What doesn't exist yet is the layer that turns operational evidence into communicable artifacts — field reports, case material, positioning pieces that translate what the system knows internally into something someone outside it can evaluate. I've been calling this the narration layer, and it's the biggest gap in the architecture. Not because it's technically hard to build — the editorial infrastructure exists, the voice contract exists, the publishing pipeline exists. Because it requires a judgment the system hasn't been able to make: which of the things that happened is significant enough to narrate, and for whom?
The auto-tagger was supposed to be the post about infrastructure doing cognitive work autonomously. It is that. But it also drew the line between the work infrastructure can do without the operator and the work it can't. Classification is on one side. Narration is on the other. The daemon reads files. The operator reads the system.
The Reusable Rule
If your LLM infrastructure is spending human attention on classification — manually tagging files, categorizing inputs, sorting material before analytical work can begin — check whether that classification can run in the background. The diagnostic: when your analytical pipeline's first step is always "figure out what this is about," the pipeline is doing prerequisite work that should have happened at import time. Automate it. Event-driven enrichment. Let files arrive pre-classified the way they arrive pre-indexed.
But there's a harder version of this problem. Once classification is automated, you'll notice the layer above it — the one that determines what the classified material means. Not what concepts a file contains, but why those concepts matter right now. Not what frameworks a conversation referenced, but what the conversation reveals about a pattern you hadn't named yet. That layer — call it narration, call it framing, call it the intelligence-to-legibility bridge — resists automation for a structural reason: it depends on context the system doesn't currently hold. Who needs to see this? What would they recognize as valuable? What's the right translation between what actually happened and what becomes communicable?
In my system, the auto-tagger classifies 545 files without my attention. The daemon runs, the tags appear, the digest system can now do connection work instead of classification work. That's real — the bottleneck I started with is gone. But the four design contributions I posted to open-source projects last week are still just tagged files in the vault. The auto-tagger knows they contain concepts about session persistence and MCP architecture. It doesn't know they're evidence of a practice being tested in forming spaces, or that taken together they constitute a positioning pattern that no single tag captures.
Automate classification. Reserve human attention for narration. The system reads the files. Deciding what they mean — and for whom — is still yours.
Top comments (0)