DEV Community: supreet singh

Built (almost) a structured Lobster pipeline on OpenClaw to solve AI non-determinism

supreet singh — Sat, 04 Apr 2026 22:23:03 +0000

If you've been following the MemSpren series, you know the core thesis: AI agents don't execute reliably because we ask them to do too much at once. For months, I've argued that the solution to non-determinism is moving orchestration out of the LLM's "vibes" and into a structured code pipeline.

This is the story of what happened when I actually tried to do that, and why it convinces me that the "AI will replace humans" narrative is fundamentally hollow.

The Stack: OpenClaw + Lobster

OpenClaw is the AI agent platform I'm building MemSpren on. Lobster is its workflow engine. The pitch is elegantly simple: define your steps in YAML, pass data between them as JSON, and let the pipeline handle the sequencing.

The division of labor is clear:

The LLM does what LLMs are good at: generating, analyzing, and transforming text.

Lobster does what code is good at: sequencing, retrying, and routing.

In theory, this eliminates the "hallucinated loop" where an agent gets stuck in a logic trap. But my use case is more than a three-step "Hello World." It involves multi-stage extraction, transformation, and writing, with structured JSON outputs at every turn. That complexity is where the friction lived.

The Myth of the "Easy" Integration

There is a massive amount of FOMO right now, especially among non-technical people, to integrate AI into every facet of their workflow. Big AI companies would have you believe this is a "plug and play" evolution.

My experience over the last week suggests the opposite. While standard, linear workflows are becoming easier to automate, the moment you attempt something creative or non-standard, the difficulty spikes exponentially. If you aren't prepared to spend hours debugging silent failures and observational mishaps, you won't get a "seamless" integration; you'll get a broken system.

When "Explicit Trust" Becomes a Wall

Throughout March 2026, OpenClaw tightened its permissions model. The releases around v2026.3.31 and v2026.4.1 hit me hard.

The update moved from implicit trust to explicit permissioning across exec, gateway auth, and node permissions. Even after I granted "Full Power" in the config, the system kept prompting for manual approval on every tool invocation. For a pipeline designed to run autonomously, this was a death sentence.

I spent hours diving through release notes and toggling flags. Eventually, I made a choice that every developer recognizes: I rolled back. I needed to execute, not audit. But that rollback triggered a cascade of secondary failures.

The Lobster Installation Workflow

The entire installation of Lobster is quite a workflow in itself. You have to install it as a plugin, you have to enable it, and you also need to install it at the global level individually for it to work. If you miss any of these steps or if it is somehow not connected to your main agent, it will simply not work.

I missed some of these instructions initially when I was reading the documentation. It took me a significant amount of time to figure out why Lobster was unable to execute from within OpenClaw as a subprocess, even though I was able to run the entire workflow directly from the terminal command line. When OpenClaw invokes Lobster, it launches the CLI in "tool mode" and expects to parse a JSON envelope from stdout. If the binary is not in the global PATH, the gateway fails silently. There are no logs in the UI and no error message in the console, just a "Step Failed" status and a void where the data should be.

Character Limits: The Silent Killer

This was the most painful bug to squash. I use llm-task directives to force the LLM into returning structured JSON. I had set a maxLength of 3,000 characters on the content field in my schema, and I had reinforced this as a critical requirement within the prompt itself.

The LLM, as LLMs do, ignored both the schema constraint and the explicit prompt instruction, generating 3,200 characters.

The LLM, as LLMs do, ignored the constraint and generated 3,200 characters.

The result? A generic 500 Internal Server Error. No partial content was returned, and there was no UI indication that the failure was a schema mismatch. I eventually found the culprit by tailing the gateway log at /tmp, filtering through the 1,440 "heartbeat" lines the gateway generates every day. Tucked away in the noise was the smoking gun:

LLM JSON did not match schema: /content must NOT have more than 3000 characters

The Anthropic Squeeze

Today, April 4, 2026, Anthropic officially ended Claude subscription support for third-party tools like OpenClaw.

For months, I've been running Opus and Sonnet through my flat-rate subscription. Now, it's pay-as-you-go or bust. I'm currently experimenting with KIMI K2.5 and OpenAI Codex. My concern isn't just price; it's reliability. Claude has been the gold standard for following complex JSON schemas. If the model layer gets flakier while I'm still fighting for determinism at the orchestration layer, I'm fighting a war on two fronts.

The New Frontier: Why AI Won't Replace You (Yet)

The pipeline finally runs. But reaching this point required a level of patience and technical forensic work that most people simply don't have the time or interest to perform.

This leads me to a few firm conclusions:

AI isn't autonomous: it's high-maintenance. The "autonomy" of these agents is an illusion that shatters the moment you step off the well-trodden path of documented examples.

The "Replacement" Narrative is Wrong. Anyone who says AI is going to replace humans has no idea what is coming at them. The complexity of making these systems actually work, not just demo well, is staggering.

The Rise of the AI Specialist. I believe we are about to see a massive surge in AI-related jobs. These are not "prompt engineers," but people who can resolve the silent failures, navigate the learning cycles, and bridge the gap between creative intent and deterministic execution.

People want to do what they are good at. They don't want to tail /tmp logs for four hours to find a schema mismatch. As long as AI remains this "tricky," there will be a massive market for humans who can tame it.

The core thesis of the MemSpren series still holds: tighter scopes and more governance is the only way forward. But I've learned that governance is a human job.

The fight for reliability continues.

If you want the personal and philosophical context behind MemSpren, I write about that on Odyssey (Substack). For the technical deep dives, follow me on dev.to.

I Wrote the Rules. My Agent Read Them. Then Did Whatever It Wanted.

supreet singh — Tue, 24 Mar 2026 11:58:23 +0000

If you've been following the MemSpren project, you know I've been building an agentic second brain that runs on Obsidian. This article is about a problem that's followed me across every interface I've tried to run it through, and the governance experiments I've been running to fight it.

What Happened This Morning

I spent today integrating Telegram with Claude Code so that MemSpren can run through a Telegram bot interface natively, the same daily-driver flow I've had working with my existing OpenClaw setup for months. End-to-end, the loop works now. Telegram message, Claude Code, MemSpren skill, vault write. But getting there produced a friction log worth documenting, because the real problem showed up after setup was done, not during it.

The Setup Friction

Most of what went wrong today came down to one thing: Bun wasn't installed. Bun is the runtime for the Telegram MCP server, and without it the MCP server silently fails. Claude Code reported "MCP failing" with no useful error message pointing at Bun specifically. I spent time ruling out other things before landing there.

There were two other genuine friction points worth knowing about:

Command naming mismatch. The docs say /telegram:configure. The actual installed command is /configure. No error, just silence. I found the real command by inspection. Docs and reality diverged somewhere, and the only way to find out was trial and error.

The --channels flag isn't surfaced by Claude Code. The bot never responded to DMs because I didn't launch Claude Code with --channels plugin:telegram@claude-plugins-official. The flag is documented, but I was following Claude Code's own setup guidance, not the README, and Claude Code didn't mention it. The plugin looks loaded, token saved, skills visible, and nothing tells you the MCP server that actually listens for incoming messages only starts when you pass that flag at launch. That's the kind of thing that costs an afternoon.

One more thing that isn't a setup issue but is genuinely annoying: Claude Code asks for permission on every vault write. In a daily check-in flow with five or ten writes per session, that's not a workflow. Fix is --dangerously-skip-permissions for the vault directory, or adding the path to ~/.claude/settings.json trusted paths.

All of that is solvable. That's not the hard problem.

The Hard Problem

After setup was done, I gave the skill a small update to process. It responded conversationally. It didn't create the vault entry. The sync buffer protocol, explicit, written, right there in the skill file, wasn't followed until I told it to follow it explicitly.

This is the failure pattern I've been trying to solve since I started building MemSpren in OpenClaw, and now it's showing up in Claude Code on day one. The skill reads the protocol correctly when invoked directly. It doesn't apply it proactively on every trigger. The methodology is sound. The execution is inconsistent.

That inconsistency is what I mean by non-determinism. Not random hallucination, not broken code. An instruction set the agent acknowledges understanding but doesn't always execute.

What I've Tried in OpenClaw

I've been running MemSpren through OpenClaw long enough to have a proper history with this problem. The governance stack I've built up over time:

Cron-based synchronization. Regular scheduled syncs that run regardless of what the agent did or didn't do in the session. This catches drift. If the agent missed a write, the cron job surfaces the gap.

Redundant state comparison. Maintaining state in two places so that discrepancies can be detected. If what's in the vault doesn't match what should be in the vault based on the session log, something was missed.

Archive-on-edit, never-delete invariant. Every vault file edit goes through a process that archives the prior version and commits to git before the new version lands. This is enforced as a CRITICAL statement in the skill file. The agent cannot delete, it can only archive. The git history is the audit trail.

End-of-day commit review. A separate process that runs at day's end and validates that everything committed was committed correctly. No truncated writes, no orphaned files, no missing frontmatter.

Protocol-level redundancy. Checks on checks. If a write protocol is supposed to run, there's a verification step that runs after it to confirm it ran. Not foolproof, the verification step itself can be skipped, but it reduces the blast radius.

All of this has helped. None of it has solved the problem. The agent still occasionally skips protocol steps that are clearly documented. The cron scripts catch the misses, but they're reactive, not preventive. I'm managing the consequences of non-determinism rather than eliminating it.

What Others Are Doing

This isn't a MemSpren-specific problem. Brian Scanlan at Intercom shared a Twitter thread on how they approached non-determinism in their AI workflows at production scale. The core of their approach: strict governance layers that wrap every agent action, not relying on the model to follow instructions correctly, but enforcing correct behaviour structurally through the system around the model.

That framing is more honest than most of what gets written about agents. The model is unreliable by nature, so reliability has to come from the architecture. You can't prompt your way to determinism.

I haven't built that. My governance stack is additive and external. What I'm thinking about is whether micro skills, small, single-purpose skill files that each govern one specific behaviour, could be a better model than one large SKILL.md that the agent has to follow end to end. Narrower scope, less instruction for the model to hold in working context, more targeted verification. I haven't tested this yet. It's the experiment I'm setting up now.

MemSpren is an open-source AI second brain skill built on Obsidian and Claude. You can find the project at memspren.io. The architecture is documented in Building MemSpren: A Claude Skill as a Second Brain — V1 Architecture Deep Dive. This article continues the evolution started in Teaching an AI Skill to Learn from Its Own Mistakes.

Follow me on dev.to for technical updates, or on Substack for the thinking behind the building.

I Wrote the Rules. My Agent Read Them. Then Did Whatever It Wanted.

supreet singh — Tue, 24 Mar 2026 11:58:23 +0000

What Happened This Morning

The Setup Friction

There were two other genuine friction points worth knowing about:

All of that is solvable. That's not the hard problem.

The Hard Problem

That inconsistency is what I mean by non-determinism. Not random hallucination, not broken code. An instruction set the agent acknowledges understanding but doesn't always execute.

What I've Tried in OpenClaw

I've been running MemSpren through OpenClaw long enough to have a proper history with this problem. The governance stack I've built up over time:

What Others Are Doing

That framing is more honest than most of what gets written about agents. The model is unreliable by nature, so reliability has to come from the architecture. You can't prompt your way to determinism.

Follow me on dev.to for technical updates, or on Substack for the thinking behind the building.

Teaching an AI Skill to Learn from Its Own Mistakes

supreet singh — Tue, 17 Mar 2026 11:30:30 +0000

Follow-up to Beyond the Prompt: Building Production-Grade AI Skills, which covers the original architecture and reasoning behind this Cypress E2E testing skill.

I have a Claude skill that generates Cypress E2E tests from Jira Acceptance Criteria.

It reads the ticket
discovers the relevant components
instruments them
generates the tests
validates that everything compiles.

The original version did all of that well, but what it didn't do was actually run the tests to see if they passed. Token budgets were tighter at the time, and adding a verification loop meant longer sessions, more context, more cost. So I didn't add it to the initial version.
I didn't realize that running the tests myself and fixing whatever broke would cost me three to four days of manual debugging on the first real run. That turned out to be a much more painful decision than I expected, even though skipping verification was saving me considerable tokens and making the entire process cheaper than it would have been otherwise.

Since then, Opus 4.6 and Sonnet 4.6 have moved to 1,000,000 token context windows, and Claude Code's prompt caching keeps the cost of longer sessions reasonable. Since that constraint didn't apply anymore, I went back and made three changes which have made a massive difference.

1. Human Review Gate on the Test Matrix

The skill already had human review gates in a few places, including reviewing the instrumentation before it gets applied. The new addition is that the human review gate now also prompts you to:

Review the test matrix plan
Verify the acceptance criteria mapping
Approve before the tests are generated

The reason I added this is that I noticed on complex tickets with 15+ acceptance criteria, the matrix was occasionally miscategorizing tests or missing edge cases.
Part of that comes from a limitation that still exists: the skill currently doesn't extract acceptance criteria from screenshots attached to the ticket.
If an AC only exists in an image, it gets missed. That's a future improvement, but for now the human review gate catches those gaps before they turn into missing tests.

The gate is a configuration parameter. If you don't want the back-and-forth of reviewing every matrix, you can turn it off and the skill will proceed straight through to instrumentation and test creation without involving a human in the loop.

2. Verify-and-Fix Loop

This was the biggest change. The skill now actually runs each test before considering it done, and if it fails, it classifies why it failed because not every failure is something the skill should be fixing.

Logic Error: An actual bug in the application code. The skill stops and tells you about it. Fixing business logic is not what this skill is for.
Infrastructure Error: Something environmental is wrong, like the dev server not running or a dependency missing. The skill flags it and waits for you to sort it out.
Instrumentation Error: Something wrong with how the test was written or how the component was set up for testing. That's the skill's territory and it will fix it, re-run, and loop until it passes or hits a retry limit.

A real example that came up on my first run: a component had the correct data-testid on it, but there was a transparent overlay element in the DOM sitting on top of it.
Cypress could find the element but couldn't click it because the overlay was intercepting the interaction.

The skill identified what was blocking it,
adjusted the test to handle the overlay,
and it passed on retry.

That's the kind of issue I would have spent a long time staring at Cypress output trying to figure out, and the old generate-only workflow had no way of catching it at all.

3. Pattern Persistence Across Runs

That transparent overlay wasn't unique to one component. The same pattern existed across a dozen components in the app, which meant every test touching those components would hit the same failure for the same reason.
In the old version each run was completely stateless, so the skill would discover the overlay problem, fix it for that one test, and then have no memory of it on the next run.
Same discovery, same fix, same wasted tokens, every single time.

Now every time the skill discovers a failure pattern during the verify-and-fix loop, it persists it with metadata:

The error message
The root cause
How it was resolved
Which component it came from

This file gets read at the start of every new run before any tests are generated, so the skill is consulting its own history of failures before writing a single test.

The effect compounds over time. The first run discovers patterns reactively and fixes them as they come up. The second run already knows those patterns and generates tests that account for them from the start. Each iteration produces fewer first-attempt failures and uses fewer tokens because less of the session is spent rediscovering things it already learned.

The original 8-phase lifecycle and three-pillar architecture are documented in Beyond the Prompt: Building Production-Grade AI Skills.

Follow the build on dev.to and Substack.

Building MemSpren: A Claude Skill as a Second Brain — V1 Architecture Deep Dive

supreet singh — Tue, 03 Mar 2026 14:05:22 +0000

How I architected a Zettelkasten-powered knowledge graph as a Claude Skill — memory management, node creation, session flow, and the decisions behind each.

I've been building personal knowledge management systems for years, trying to solve a specific problem: losing context, losing ideas, and losing the thread of my own thinking across tools and time. MemSpren is my attempt to solve that properly — using Claude as the intelligence layer and Obsidian as the storage layer.

The name is a nod to Brandon Sanderson's Stormlight Archive, where spren are spirits of concepts that have come to life and bond with a human to guide them. That felt right.

If you want to understand the deeper reasoning behind why I built this — the mindset, the frustrations, the years of failed PKM experiments, and the philosophy underneath the architecture — I've written that on Substack. That's the "why." This article is the "how."

What follows is a technical breakdown of how the V1 system is structured: how it manages memory across sessions, how it creates and links nodes, how the session startup sequence works, and what architectural decisions I made and why.

Prerequisites

MemSpren requires Claude Cowork, not Claude.ai chat or Claude Code. Cowork mounts a real folder from your local filesystem into the sandboxed environment. When the skill asks for your Obsidian vault path during setup, that folder must already be mounted in Cowork. Without this, the skill cannot read or write to your vault.

The implementation is file-based — no external databases, no cloud infrastructure beyond Cowork itself. That's a deliberate choice: auditable in a text editor, runnable without extra accounts, forkable by anyone.

Why Native Claude Memory Isn't Enough

Claude now has a built-in memory feature. As of early 2026, it's available to all users — free and paid. It generates summaries from your conversation history, updates every 24 hours, and gives Claude some continuity between sessions. There's also chat search, which uses RAG to pull relevant context from past conversations when you ask for it.

That's genuinely useful. But it's not what MemSpren is doing, and the difference matters.

Claude's native memory is generalized and unstructured. It summarizes your conversations into a broad synthesis — helpful for not re-explaining your general context, but it has no concept of typed entities. It doesn't know the difference between a project and a person. It can't tell you which of your ideas are linked to which projects. It can't load a specific protocol when you say you want to do a check-in. It can't track that a particular task has been idle for 10 days. It doesn't maintain a graph.

What MemSpren builds is structured, typed, and programmable memory:

Hot memory with a hard token cap that forces prioritization of what matters right now
A graph index that tracks every node and its connections without loading every file
Protocol files that define exactly how Claude should behave for specific intents
Entity files with typed frontmatter that make the vault queryable from metadata alone
A session startup sequence that loads different context depending on what you're trying to do

Native memory solves "I have to re-explain myself." MemSpren solves something harder: how do you build a knowledge graph that gets smarter over time, with surgical memory loading, typed nodes, and evolvable behavioral protocols? That's a different problem. The architecture below is the answer.

Two Separate File Locations

The system uses two distinct locations. Mixing them up breaks things.

# Location 1: Skill operational files
# Dot-prefix hides this from Obsidian's graph view
.second-brain/
├── config.md                    ← static system settings
└── Memory/
    ├── hot-memory.md            ← always loaded each session
    └── system-state.md          ← active protocols, flags, graph index

# Location 2: Vault content
# Lives at vault_path defined in config.md
# This is what Obsidian renders
vault/
├── Vision/
├── Strategy/
├── Work/
│   ├── Projects/
│   └── Ideas/
├── People/
├── Tasks/
│   └── tasks-inbox.md
├── Log/
│   ├── Daily/
│   └── Weekly/
├── Notes/
│   ├── Learnings/
│   └── Resources/
├── Archive/
└── Inbox/

.second-brain/ is Claude's operational layer — config, memory files, protocol files. These should never appear in the Obsidian graph. The dot-prefix keeps them hidden from Obsidian by default.

vault_path is where all content lives — everything Obsidian renders and displays. This is what grows over time. MemSpren writes every note, project file, and daily log into this location.

If vault_path points to the same root as the mounted folder, the two locations coexist in the same directory. That's fine — .second-brain/ is still separate and still hidden.

The Memory Architecture

This is the core architectural problem in any AI-native knowledge system: you cannot load everything into every session, but you need Claude to know enough to be useful from the moment the conversation starts.

The solution is a three-tier loading strategy.

Tier 1: Hot Memory

hot-memory.md is loaded every single session without exception. It is capped at approximately 800 tokens — a deliberate hard constraint. The cap forces prioritization: only what matters right now belongs here. What falls outside the cap belongs in long-term storage.

What lives in hot memory:

Currently active projects (name, status, immediate next action)
This week's tasks and commitments
Friction points or blockers flagged in recent check-ins
Any patterns Claude has been tracking
Current journaling context summary

The daily sync refreshes hot memory at the end of each check-in, based on what was captured during the session.

Tier 1: System State

system-state.md is the operational counterpart to hot memory. Where hot memory is contextual (what are we working on), system state is operational (how is the skill configured to behave right now).

What lives in system state:

Which protocol files are active (lifestyle tracking on/off, journaling enabled, etc.)
A graph index — a compact summary of what nodes exist and how they connect, without the full file content
Inbox sorting rules
Any behavioral flags set by the user in previous sessions

The graph index in system state is what lets Claude answer "what do I have on X?" without loading every vault file. It's a lightweight map of the graph, not the graph itself.

Tier 2: On-Demand Loading

Protocol files and entity files are only loaded when the current conversation intent requires them. The protocol files define how Claude behaves for specific actions (check-ins, entity creation, linking). Entity files contain the actual content of a specific node.

A key rule in SKILL.md: when loading an entity file is warranted, Claude reads the YAML frontmatter first. Only if the conversation requires deeper context does it read the full body. This matters at scale — a vault with hundreds of files should not require reading hundreds of full file bodies to answer a simple question.

Tier 3: The Full Vault

Historical log entries, archived notes, completed project files — these exist in the vault but are never loaded in bulk. They're accessible if specifically needed, but the system is designed so that the vast majority of sessions never need to touch them.

The Session Startup Sequence

Every session follows the same startup sequence, defined in SKILL.md.

A few things worth noting in this flow:

Config is always read first. config.md contains the vault path and the setup_complete flag. Without the vault path, the skill doesn't know where to write anything. Without the setup flag, it doesn't know whether to run the setup flow or skip to normal operation.

Intent detection determines what gets loaded next. MemSpren doesn't load everything upfront and then figure out what to do. It loads the minimum (config + hot memory + system state), detects intent, and only then loads the protocol files relevant to that intent. A session where the user just asks "what am I working on" never loads a single protocol file.

Protocol files are loaded by section, not in full. When creating a project, only the Project section of entity-protocol.md gets read. The Idea and Person sections don't load. This is enforced in the intent detection table in SKILL.md.

The Node System

Every piece of information in the vault is a node. Projects, ideas, people, tasks, daily logs, learnings — all of them follow the same structural rules.

Node Types and Locations

Node Type	Folder	Filename Pattern	Notes
Project	`Work/Projects/`	`project-name.md`	New file per project
Idea	`Work/Ideas/`	`idea-name.md`	New file per idea
Person	`People/`	`firstname-lastname.md`	New file per person
Task	`Tasks/tasks-inbox.md`	—	Appended, not a new file
Learning	`Notes/Learnings/`	`learning-topic.md`	New file per learning
Daily Log	`Log/Daily/`	`YYYY-MM-DD.md`	One per day

Tasks are the exception — they don't get their own files. Everything goes into tasks-inbox.md as an appended entry using Obsidian Tasks plugin syntax. This keeps the task layer flat and queryable without spawning hundreds of individual files.

YAML Frontmatter — Non-Negotiable on Every File

Every node gets YAML frontmatter at creation. No exceptions. The reason: retrofitting metadata onto hundreds of notes later is painful. Starting from day one means the graph is always queryable from metadata alone, even as it scales.

---
node_type: project          # project | idea | person | task | learning | log
status: active              # active | idle | archived | completed
created: 2026-03-01
updated: 2026-03-01
connected:
  - "[[Work/Ideas/progressive-disclosure-idea]]"
  - "[[People/niklas-luhmann]]"
tags: [second-brain, architecture, claude-skills]
---

The connected field mirrors the [[links]] written in the body. This redundancy is intentional. System state's graph index is built from frontmatter, not from parsing body content. Claude can understand the shape of your entire graph by scanning frontmatter alone — without reading thousands of full note bodies.

The No-Orphan-Notes Rule

Every new note must contain at least one [[link]] to an existing node before it gets saved. If there is no existing node to link to, Claude creates the minimal stub for that linked entity first, then saves the original note with a link to it.

This is enforced in linking-protocol.md and is referenced as a hard invariant in SKILL.md. The rule exists because the value of the system is in the connections — a vault full of disconnected notes is just a graveyard. The no-orphan rule is what ensures the graph actually builds over time.

Atomization — One Idea Per Note

When a brain dump contains multiple ideas, Claude doesn't create one note with everything in it. It creates multiple atomic notes, one per idea, each linked to the others and to relevant existing nodes.

This is the Zettelkasten principle applied directly: small, discrete, highly-connected notes are more valuable than large, comprehensive notes that try to contain everything. A note with five [[links]] to existing nodes is more useful than a perfectly organized note with no links.

The atomization happens during the check-in. The user talks in streams. Claude writes in atoms.

The Check-In Flow

The check-in is the primary interface of the system. Everything else is infrastructure in service of this interaction.

The check-in protocol defines exactly what Claude should extract from the brain dump: which entities to create, which existing nodes to update, what goes into tasks-inbox, and what belongs in the daily log vs a separate entity file.

Claude follows up only on gaps that weren't naturally covered — not a read-out-loud checklist. The distinction matters for sustainability. A conversational check-in takes 5–10 minutes and is easy to maintain daily. A structured form with required fields is homework.

Evolvable Protocol Files

Behavioral settings don't live in config.md. They live in protocol files that Claude creates and rewrites over time based on instructions you give it.

The principle: config.md holds static settings that almost never change (vault path, check-in time, idle thresholds). Protocol files hold behavioral settings that should evolve as you learn what works for you.

When you give Claude an instruction, it gets persisted. The relevant protocol file gets updated. System state gets flagged. The next session picks it up without you having to repeat yourself. This is how the system learns your preferences over time without you managing configuration manually.

The distinction between static config and evolvable protocol files is one of the more important architectural decisions in the system. Anything that might change based on user behavior or preference is a protocol file, not a config value.

SKILL.md: Executable Logic in Natural Language

SKILL.md is the operating system of the skill. It is not documentation — it's a structured instruction set that Claude loads and executes at the start of every session.

What SKILL.md defines:

Session startup sequence — what to load, in what order, under what conditions
Intent detection table — maps what the user says to which protocol gets loaded
Entity creation rules — which type goes where, what filename pattern, what frontmatter fields are required
Memory loading rules — what loads first, when to stop loading
Hard invariants — what never happens (no orphan notes, no deletion, no Windows-style paths, no loading all files in bulk)

The design philosophy is progressive disclosure applied to the skill itself. SKILL.md is the thin always-loaded layer. Protocol files are detail that loads on demand. A monolithic SKILL.md that tries to define every template and every edge case is slow to load, expensive in tokens, and fragile to maintain. Keep it lean. Push detail into the protocol files it references.

The hard invariants section of SKILL.md is worth calling out specifically — it's a list of things that categorically never happen, regardless of what the user asks:

No note is saved without YAML frontmatter
No note is saved without at least one [[link]]
No vault file is ever deleted — archived instead
Protocol files and memory files are never written into vault content folders
All file paths use forward slashes
Entity files are never loaded all at once in bulk

These aren't guidelines. They're hard stops baked into the system's behavior from day one.

Getting Started

Everything you need to run MemSpren V1 is in the GitHub repository.

→ github.com/supre/memspren

Install Claude Cowork — MemSpren requires Cowork's local filesystem access. It will not work in Claude.ai chat or Claude Code.
Mount your Obsidian vault — In Cowork, mount the folder that contains (or will contain) your Obsidian vault. Note the full path.
Clone the repo and copy SKILL.md — Drop the skill file into your Cowork environment following the instructions in the README.
Run the setup flow — Open a Cowork session and say "Run the MemSpren skill." It will detect that setup hasn't run, ask for your vault path, and scaffold the full directory structure automatically.
Do your first check-in — Say "Let's do a check-in." The rest follows from there.

The README covers troubleshooting, vault path formatting, and how to verify the skill is reading and writing correctly.

Decisions Reference

Decision	What	Why
No external database	File-based only	Auditable, no cloud dependency, runs anywhere
Deletion policy	Hard no-deletion, archive forever	Past is data. Patterns require historical nodes.
Config scope	Static settings only	Behavioral = protocol files. Evolving config is state, not config.
Always-loaded files	hot-memory + system-state only	Everything else on demand. Token budget discipline.
Hot memory cap	~800 tokens	Forces prioritization. No free lunch on context.
YAML on every file	Required at creation	Retrofitting metadata later is painful. Graph index needs it.
`connected` field + `[[links]]`	Both, always	Dual representation — navigable in Obsidian, queryable from metadata.
Tasks as append-only inbox	Single file, not per-task files	Flat, queryable, no file sprawl.
Atomicity rule	One idea per note	Connection density over comprehensiveness. Small notes link more.
No orphan notes	Hard invariant	Graph only works if everything connects.
MOC creation	Emergent, never upfront	Synthesized when link density warrants it, not pre-planned.
SKILL.md design	Thin always-loaded layer	Detail lives in protocol files. Lean core, deep protocols.

What's Not in V1

V1 has a deliberately tight scope. The goal was one working loop: open Cowork, run the skill, do a check-in, see files appear in your Obsidian vault with frontmatter and links. Everything below is deferred until that loop is proven reliable.

No automated cron jobs — check-in is manually triggered
No weekly or quarterly planning flows
MOC detection exists in system state, but MOC creation is suggestion-only
No pattern correlation analysis
No graph visualization beyond Obsidian native
No financial tracking, calendar integration, or script execution
No Telegram or WhatsApp integration (planned for a future version)

The non-technical story behind MemSpren — the years of failed PKM experiments, the philosophy behind the architecture, and what it actually feels like to use it daily — is on Substack. Start there if you want the "why" before the "how."

Follow the build on dev.to and Substack.

Supreet is a Senior Software Engineer building AI-native tools in public.

Beyond the Prompt: Building Production-Grade AI Skills — A Case Study

supreet singh — Tue, 24 Feb 2026 13:35:12 +0000

In Part 1, we examined the theory of moving from linear prompting to Skill-based orchestration—a workflow abstraction layer that uses progressive disclosure to keep the model focused and efficient.

Every sprint, our QA team was catching the same category of bug: a developer had shipped a feature that passed code review but missed a subtle Acceptance Criterion buried in the Jira ticket. By the time the miss was found, the developer had already moved on. I built a Skill to close that gap—a system that automatically bridges Jira ACs to executable Cypress E2E tests. This article is a technical walkthrough of how it works.

The Problem: Requirement Disconnect and the Structural Gap

In my current company, we manage multiple product lines simultaneously, ranging from fragile legacy systems to modern React builds. We have a small, highly capable QA team, but they are consistently outnumbered by the volume of features being shipped across these tracks.

A typical failure looked like this: a Product Owner writes an AC that says "the modal must close when the user clicks outside it." The developer builds the modal, ensures the internal logic works, and ships it. However, they neglect the "click outside" requirement—a small but critical UX detail. QA catches the miss four days later. The developer has to context-switch back into code they've mentally closed. That single miss costs an hour of velocity—multiplied across a dozen stories per sprint.

This created a structural gap in our process:

Missed Requirements: Developers, naturally focused on technical complexity, occasionally overlook smaller, subtle ACs buried in a ticket description.
Context Latency: The cost of "re-learning" a feature to fix a basic AC failure after days of drift is a massive drain on velocity.
Manual Testing Bottleneck: Lacking automated E2E coverage for new features forces QA to burn limited bandwidth on repetitive manual verification instead of hunting for deep edge cases.

I built this Skill to serve as a checkpoint. It acts as a forcing function for requirement alignment, allowing developers to verify their work against the source of truth before shipping.

Pillar 1: Layered Context Management (The Discovery Layer)

The primary challenge I wanted to avoid was Context Rot. If I had loaded 5,000 lines of documentation into a single prompt, the model's reasoning performance would have dropped significantly. To solve this, I implemented a three-tier hierarchy to ensure only the context strictly necessary for the task is loaded at each stage.

Understanding the Discovery Hierarchy

Front Matter (The Gateway): This YAML block acts as the entry point. It tells Claude when this skill is relevant. By using specific trigger terms like "Cypress" or "Jira ID," we ensure the heavy logic of this skill is only active when needed, preserving the global context window.

The Main Body (The Router): This acts as the "map" for the 8-phase workflow. It manages the sequence and the gates between them. It knows that Phase 3 follows Phase 2, but it doesn't need to know the specific syntax for a Cypress intercept yet.

Reference Files (The Specialists): These hold the deep, domain-specific logic. We separate these files so that Claude only "reads" the Jira API documentation while fetching tickets. Once it moves to writing tests, it drops that context and loads the test-generation.md context. This selective loading keeps the "cognitive scope" narrow.

Pillar 2: Persistent State and Hashing (The Continuity Layer)

Standard AI interactions are stateless, which doesn't work for multi-stage engineering tasks. I built a JSON-based memory system so the Skill can track its progress and detect requirement changes over time.

Memory Management and Session Resumption

The Skill utilizes a dedicated JSON memory file that acts as its persistent state. This file is updated at the conclusion of every phase, ensuring the system never "forgets" the context of its work.

The memory file specifically tracks ticket processing, code instrumentation maps, and the Integrity Hash. I use an MD5 hash of the raw Acceptance Criteria text field from the Jira API response.

By comparing the current AC hash against the stored hash in memory.json, the Skill detects "staleness." In practice, this means a developer can re-run the Skill on the same ticket after a code review without re-generating 40 test files that haven't changed. Only the deltas are ever touched, saving significant token costs and preventing unnecessary file writes.

This architecture also enables Session Resumption. If the process is interrupted, the Skill identifies the incomplete session from the memory file and prompts the user to resume from the last successful phase.

Pillar 3: Review Gates and Governance (The Control Layer)

Even with robust discovery and memory, technical reliability isn't enough. When a system can modify production code, it needs human accountability. I wanted to avoid a "black box" scenario where component instrumentation happened in unexpected places.

The Configuration Layer

I implemented a Configuration Layer that allows the user to toggle "Human Review Gates." This flexibility ensures the architecture adapts to the developer's specific need for control—whether they are shipping a critical billing flow or iterating quickly on a new UI component.

Strategic Review Gates

When these gates are enabled, the workflow pauses at critical decision points:

Component Discovery Gate (Phase 2): The Skill presents a list of exactly which React components it intends to instrument. This prevents unexpected changes in sensitive areas of the codebase.
Test Matrix Gate (Phase 6): Before generating Cypress files, the Skill displays a comprehensive plan categorizing tests into Happy Path, Negative, Edge Case, and Error scenarios.

The Agentic Workflow

When triggered, the Skill follows a rigorous 8-phase lifecycle, where each phase serves as a technical checkpoint.

Breaking Down the Lifecycle

Phase	Name	Description
0	Preflight	Verifies Jira connectivity via the Atlassian MCP and ensures the workspace root is correct.
1	Context & Hashing	Fetches the Jira AC and compares the MD5 hash against the memory file.
2	Discovery	Scans the codebase to map Jira stories to specific React components.
3	Dependency Check	Verifies that required testing libraries (Cypress, custom commands, the TestId enum file) are present and importable. This prevents instrumenting components with identifiers that don't yet exist.
4	Instrumentation	Performs the "surgery" on production code, injecting unique test identifiers.
5	Test Matrix Approval	The Skill presents the full test plan to the developer before writing a single file. This is the last human checkpoint before code is generated.
6	Test Generation	Generates the Page Objects and the actual Spec files based on the approved matrix.
7	Validation	Executes a validation script that compiles the new code and checks for lint errors.

Technical Implementation Details

Self-Healing Validation

If the TypeScript compiler (tsc) or linter finds an issue, the Skill enters an iterative "Self-Healing" loop (up to three times). It reads the error log, identifies the file and line number, and attempts a fix. If it fails after three attempts, it exits the loop, preserves the state in memory.json, and surfaces the raw compiler output. The developer can patch it manually and resume from that phase rather than starting over.

MUI-Aware Instrumentation

Standard AI often struggles with Material UI (MUI) components, which often hide the actual HTML input inside a wrapper. A naive implementation would instrument an MUI TextField like this:

<TextField data-testid={TestId.SEARCH_INPUT} />

Cypress can't interact with that because the testid lands on the wrapper div. The Skill knows to use the inputProps attribute:

<TextField inputProps={{ 'data-testid': TestId.SEARCH_INPUT }} />

Encoding this library-specific knowledge into a reference file ensures any developer on the team gets it right the first time.

Sharing the Vision, Not the Source

Because this Skill was built for production use at my company, the source code stays internal—but the architectural patterns here (progressive disclosure, persistent state, human-in-the-loop governance) are entirely portable. I'm currently formalizing these same patterns into an open-source "Second Brain" project, and I look forward to sharing its structure with the community soon.

Closing Thoughts: Engineering with Confidence

What this system ultimately gave me is confidence at the moment of merge. Before, shipping a feature meant hoping that QA would catch what I'd missed. Now, before a PR is opened, I have programmatic proof that every Acceptance Criterion has a corresponding test.

The regression library builds itself sprint by sprint, and the QA team's attention is freed for the complex, judgment-heavy testing that automation can't replace. That shift—from reactive bug-catching to proactive verification—is what I set out to build when I started exploring Skills. It turns out the architecture was always capable of it; we just needed a system to unlock it.

Beyond the Prompt: An Explorer’s Guide to Claude Skills (Part 1)

supreet singh — Tue, 17 Feb 2026 10:39:50 +0000

I’ve been spending a lot of time lately trying to figure out where "prompting" ends and "systems" begin.

The Chat Box Ceiling

Most of us start our AI journey in a chat box. We learn to ask better questions and provide more context, eventually building massive, complex prompts. Yet, even the best prompts often miss the mark.

The Claude Code Transition

I expected the friction to disappear when I moved to Claude Code because the tool lives in my local environment. It didn't work out that way. The overhead remains surprisingly high. Even with local access, if you want Claude to plan and execute a complex task, you still end up defining a massive instruction set for every single run. It becomes a tedious ritual of:

Specifying every parameter for the task.
Reminding the agent what to watch for.
Defining exactly how to take specific actions.

It feels less like doing the actual work and more like managing a micromanager.

The Legacy System Hurdle

When I started using Claude Code last year, "Skills" weren't really on the radar yet. My workaround was building a library of SOPs (Standard Operating Procedures). This was a necessity because I work with legacy systems—specifically ancient Linux builds and fragile ecosystems that are decades old.

Before every task, I had to manually point Claude toward that world. I would identify which SOPs were relevant, tell Claude Code to read them, and then wait for it to "load" that logic before it could finally act.

The Discovery of Skills

Recently, I stumbled upon Claude Skills. I’ve found them to be much more effective at handling this orchestration without the manual baggage of the SOP approach.

I’m not writing this as an expert who has mastered the architecture. I’m an explorer in the middle of an "Aha!" moment. I’m identifying patterns, figuring out the logic, and trying to build something that actually scales. Here is what I’m learning in real time.

1. The Conceptual Shift: Instruction vs. Orchestration

The first thing I had to unlearn is the idea that a Skill is just a "saved prompt." It isn’t.

If a prompt is a recipe, a Skill is the entire kitchen setup. A normal prompt tells Claude what to do once. A Skill defines how Claude should operate across an entire category of scenarios. It is a workflow abstraction layer. Instead of a linear prompt-response cycle, you are encoding "agentic behavior" into a reusable unit. This includes conditional logic, tool choices, and specific reasoning stages.

Prompt-level: "Summarize this into five bullet points."
Skill-level: "When I give you a financial document, identify the sender, select the right parsing method, and decide if an Excel file is needed based on the data complexity."

Recent research into "Divide and Conquer" frameworks for LLMs supports this. Studies show that single-shot, linear instructions suffer from "superlinear noise growth." In plain English, the longer and more complex your instructions get, the more the model’s focus degrades. Moving to an orchestration model keeps the "noise" low by keeping the immediate task context tiny.

2. The Anatomy: How It’s Actually Built

Technically, a Skill is just a folder. The structure of that folder is where the magic happens. At the root, you have a file called SKILL.md. This isn’t just a readme; it is the brain of the operation. It generally has three layers:

The Front Matter (Identity): A brief YAML block that tells the system what the skill is. This is how Claude "discovers" the skill.
The Main Body (The Router): This is the orchestration layer. It doesn't hold all the instructions. Instead, it acts as a traffic controller. It tells Claude: "If Task A happens, look at File X. If Task B happens, use Tool Y."
Contextual Files (The Specialists): These are separate files that hold the deep, domain-specific logic.

This separation is the core of Progressive Disclosure. Instead of stuffing 5,000 words of instructions into every chat, the Skill selectively loads only what is relevant. If you are summarizing a document, it doesn't load the code-generation logic. It keeps the "cognitive scope" narrow and precise.

3. The "Token Economy" Hypothesis

One of the most exciting things about Skills is that they can significantly reduce token usage, sometimes by 60% to 70%.

This efficiency comes from that Progressive Disclosure architecture. In a typical session, Claude might recognize twenty different skills but only loads a tiny metadata snippet for each one. The heavy instructions and reference files stay out of the context window until Claude decides they are actually needed.

Loading less context uses fewer tokens, but it also leads to better results. This isn't just a hunch; 2025 research on "Context Rot" shows that even if a model is technically within its context window, irrelevant information causes reasoning performance to drop by anywhere from 13% to a staggering 85%. By limiting what the model sees, we aren't just saving money; we are dramatically improving the quality of the reasoning.

4. When Does a Skill Actually Make Sense?

If you just want your AI to always use a specific tone or return JSON, a simple template is fine. You are over-engineering if you build a Skill for a trivial task. The value of a Skill explodes when you deal with conditional complexity. Think about a financial workflow:

Input: A monthly report.
Action: Parse expenses and analyze patterns.
Condition: If spending is over budget, load the "Planning Recommendation" context.
Output: Generate an Excel file and trigger a tool to email the summary to an accountant.

At this point, you have a system that can orchestrate different flows of your workflow automatically.

5. The "Unanswered Questions" Log

This is what I’m currently working through and what I'll cover in the next few articles:

Observability: How do we actually see the Skill flow? I want to know exactly when a specific context is triggered and how to measure if the output is truly optimized.
Versioning: How do we update a Skill without breaking the workflows that rely on it?
Dynamic Tool Selection: I’ve discovered that Claude uses a search pattern to manage massive libraries. Instead of loading every tool definition, it uses a lightweight index to find tools on demand. I’m exploring how to implement custom search strategies in my own skills.

What’s Next?

This is just the beginning. I’m currently building custom skills for my day-to-day work. While I can’t share the proprietary details, I’ll be documenting the broader architectural lessons I learn along the way.

I’ll also be publishing public skills for Openclaw. It uses a similar system, but I’ve noticed it tends to be context-heavy, which leads to significant token waste. I’m looking at how to port Claude’s efficient patterns over to Openclaw to lean out that workflow.

I’m curious about what you’re building. What messy, repetitive workflows are you trying to turn into "systems" right now? If you've managed to incorporate skills into your day-to-day, I’d love to hear how it’s changing your approach.