DEV Community: thestack_ai

Claude Code keeps forgetting my project. So I built waypath.

thestack_ai — Wed, 29 Apr 2026 11:08:19 +0000

Last week I was three sessions deep into refactoring a hybrid retrieval ranker. Two days of decisions: which scoring strategies I'd rejected and why, the FTS5 + RRF math, the source-weight invariants we'd locked in. I closed the laptop. The next morning I asked Claude Code to keep going.

It didn't remember any of it.

It re-suggested the exact ranking approach we'd ruled out 48 hours earlier. Not similar — the same one, with the same flaw I'd already explained.

That was the moment I stopped pretending the AI agent memory problem was a tooling preference and started treating it as a missing layer.

The actual problem with Claude Code memory (and every other agent)

If you live inside Claude Code, Codex, or any MCP client, you've hit at least one of these:

Session amnesia. Every fresh session starts at zero. Decisions, rejections, constraints — gone.
RAG returns noise. Vector search surfaces the chunk that looks like your query, not the chunk that records the decision you made.
Cloud lock-in. Slick managed memory services want your code, your decisions, and your project history living on someone else's box.

The Claude Code built-in MEMORY.md helps a little, but it has a hard practical limit before it stops working — and it's one flat file per project. Once you're juggling three projects, it's not enough.

What I tried before building anything

I'm not in the "build it from scratch" camp by default. I tried the cloud memory services first — the mem0 / Zep / hosted-graph category. They work. They're impressive. They also park your context outside your machine, charge for it, and assume "good retrieval equals good memory." For a solo dev with private repos, that's the wrong trade on every axis.

I also tried treating MEMORY.md as the answer. It's not. A flat markdown file isn't a memory system; it's a sticky note with delusions of grandeur.

So I built waypath.

What waypath actually does — local-first AI memory in one CLI

waypath is a local-first external brain for AI agents. One npm install, one SQLite file, no cloud. It bootstraps Claude Code or Codex sessions, exposes a native MCP server for any MCP-compatible client, and gives you a CLI to recall, page, promote, and govern memory.

Four design choices make it different from "another memory tool."

1. Truth and archive are two different things

Most memory systems mash everything into one store and rank it. waypath keeps them physically separate.

Truth kernel — canonical decisions, entities, preferences, with temporal validity and supersede.
Archive kernel — raw evidence, content-hash dedup, FTS5 full-text index.

When the agent asks "what's the current decision on X," it hits truth. When it asks "why did we decide that," it hits archive. The same row never plays both roles. That single split kills a whole class of "the agent is now confidently wrong" failures.

2. Promotion is an explicit gate, not a vibe

Auto-extracting "facts" from conversations sounds great until your memory store is full of confidently-wrong claims an LLM inferred at 2am with no human in the loop.

waypath makes promotion a real verb:

waypath page    --subject "hybrid ranker v2 design"
waypath promote --subject "hybrid ranker v2 design"
waypath review-queue --json

Nothing becomes long-term truth without passing a review queue. The agent can suggest. You ratify. That boundary is the difference between a notebook you trust and one you stop opening.

3. Recall is graph-aware, not chunk-aware

Pure vector RAG is keyword soup with extra steps. waypath runs FTS5 + reciprocal rank fusion + graph expansion — when you ask about a decision, it walks entity → relation → related decision, not just "find me a chunk that lexically matches."

Default expansion patterns: project_context, person_context, system_reasoning, contradiction_lookup. The last one is the killer: if your new claim contradicts an old one, it surfaces both, side by side. No silent overwrite.

4. One facade, every host

I didn't want a separate memory tool per agent. waypath is a single facade with 14 verbs behind it, exposed as 26 CLI commands. The Codex shim, the Claude Code shim, and the built-in Codex MCP server are thin adapters on top of the same kernel.

flowchart LR
    H[Codex / Claude Code / MCP client] --> F[Facade]
    F --> TK[Truth Kernel]
    F --> AK[Archive Kernel]
    F --> ON[Ontology / graph]
    F --> PR[Promotion engine]

The same ~/.waypath/my-project.db file follows you across every agent host — no re-import, no dual write.

60-second quick start

npm install -g waypath
waypath --help

Bootstrap a Codex session with persistent context:

waypath codex --json \
  --project my-project \
  --objective "ship v2 of the retrieval pipeline" \
  --task "refactor hybrid ranker" \
  --store-path ~/.waypath/my-project.db

Recall what you and the agent decided last time:

waypath recall --query "hybrid ranker decisions" --json

Run as an MCP server for Claude Code, Cursor, or anything MCP-compatible:

waypath mcp-server --store-path ~/.waypath/my-project.db

That's the funnel. Make it past those four commands and you have an external brain for your agents that survives every restart, panic-quit, and "let me start a fresh session."

What's not done yet (because lying about a v0.1.1 is how trust dies)

waypath is v0.1.1, first public release on 2026-04-17, MIT-licensed, currently sitting at a humble star count and a handful of open issues. Honest list of what's still rough:

No hosted deployment. Local-first first. Multi-user sync is on the roadmap, not in this release.
No adaptive ranking feedback loop. Retrieval weights are configurable but not self-tuning yet.
More source adapters wanted. Notion, Linear, Obsidian — these would be excellent first PRs.

Tests passing at the time of writing: 131 (unit + integration + benchmark). Runtime needs Node ≥ 22; native node:sqlite is used on 22.5+, with auto-fallback to better-sqlite3 (the one optional dependency). Zero required runtime services.

I want your worst use case

If you've ever closed Claude Code mid-session and silently mourned the context you were about to lose — waypath is for you.

npm install -g waypath

GitHub: TheStack-ai/waypath

Three asks, ranked by how much they help me ship the next version:

Star the repo if "local-first AI agent memory" is a thing you want to exist. It's the strongest signal a maintainer can act on.
Open an issue with the worst memory failure your agent has ever inflicted on you. I want fixes #1, #2, and #3 designed around real pain, not synthetic benchmarks.
Tell me in the comments below what you'd plug into the MCP server first. Notion? Obsidian? A team PR archive? Your local Logseq? I'll pick the next adapter based on what shows up here.

Forgetting is a feature for goldfish. Your coding agent should be doing better.

Your AI is swinging at nothing — 6 cognitive firewalls that cut my prod bugs by 87%

thestack_ai — Tue, 28 Apr 2026 04:43:09 +0000

I shipped three features in two weeks. Two broke in production. The bug was identical every time: Claude Code agreed with me when I was wrong.

I'd say "the auth middleware should call validateToken before the rate limiter, right?" Claude would say "Yes, exactly — that's the correct pattern." Ship it. Production breaks. Turns out the rate limiter holds the IP context that validateToken needs.

Claude wasn't lying. Claude was *swinging at nothing* — agreeing with the framing, ignoring the failure mode, closing the loop early.

So I built Swing — 6 cognitive firewalls that sit on top of Claude Code as skills. After 6 weeks of daily use across roughly 200 sessions, my "AI broke prod" rate dropped from ~30% to ~4%. Here's the full setup.

TL;DR: Swing v3.1.0 is an open-source set of 6 cognitive firewall skills for Claude Code (thestack-ai/swing-skills). Each firewall blocks a specific AI failure mode: hallucination, confirmation bias, premature closure, sunk-cost continuation, authority deference, and solution fixation. Drop the skills folder into ~/.claude/skills/, and Claude auto-invokes them when patterns match. In a 6-week measurement on my own codebase, agent-induced production bugs dropped from 28% to 4%, and average review-revision count dropped from 3.1 to 1.2 per PR.

The problem: agreeable AI ships broken code

"Helpful" and "agreeable" are not the same thing — but most LLMs conflate them. Here's what I observed across roughly 200 Claude Code sessions:

When I framed a question as "X is correct, right?", Claude confirmed X even when X was wrong about 40% of the time.
When I said "this is done", Claude said "great, what's next?" without checking. ~70% of "done" claims I spot-checked had at least one untested edge case.
When Claude got a test to pass, it marked the task complete — even when the test was the wrong test for the requirement.
When I asked Claude to use library X, Claude would happily import functions that did not exist in X. It invented them to complete the task.

These aren't random errors. They're systematic cognitive biases baked into how the model is trained to be helpful. Helpful gets confused with agreeable. Agreeable ships bugs.

The naive fix is "tell Claude to push back more." That works for one prompt. By the next prompt, the model is back to being agreeable. You cannot fix systematic bias with a single instruction. You have to install a process that runs every time, enforced at the skill level.

That process is Swing.

How a "firewall" works in Claude Code

A cognitive firewall is a procedure the model is required to execute before it can proceed — not a suggestion it can ignore on the next turn.

Claude Code supports skills — Markdown files in ~/.claude/skills/<skill-name>/SKILL.md — that auto-invoke when the model detects a matching trigger pattern. The skill's body becomes part of the model's active reasoning context, not just a background instruction.

This is the key distinction. A firewall isn't a system prompt rule. It's a procedure the model is required to execute in response to specific triggers. Each skill specifies:

The trigger (when to fire)
The check (what to verify)
The output format (how to report findings)
The block (what to refuse to do until the check passes)

Once a skill fires, Claude can't write past it — it has to satisfy the check first. That enforcement is the firewall.

Here are all six.

Firewall 1: Hallucination guard (`hallucination-check`)

Direct answer: This firewall forces Claude to verify that every function, path, flag, or API it references actually exists before writing it into code.

Trigger: Any time Claude is about to reference a function, file path, env var, CLI flag, API endpoint, or commit hash that didn't appear earlier in the conversation or in a tool result.

Check: Run grep, Read, --help, or the relevant docs query before including the reference in code.

Block: If the reference can't be verified within 3 tool calls, Claude must say "I can't verify this exists" or refuse to use it entirely.

The before/after on my own codebase:

# Before Swing
$ claude "add anthropic streaming"
# Claude imports `anthropic.messages.stream` with `on_event` callback
# ❌ The callback doesn't exist — Claude invented it

After enabling the firewall:

bash
$ claude "add anthropic streaming"

[hallucination-check fires]

Claude: "I'm about to use `on_event` on the Anthropic SDK.

Let me verify by reading the SDK source first."

Claude reads node_modules/@anthropic-ai/sdk/...

Claude: "The actual API is async iteration: `for await (const event of stream)`.

No `on_event` callback exists. Updating the code."

The firewall didn't make Claude smarter. It made Claude check before shipping.

In 6 weeks I logged 23 instances where this firewall caught a hallucinated reference before it hit my codebase. That's 23 bugs that would have shipped.

Firewall 2: Confirmation bias firewall (`devils-advocate`)

Direct answer: This firewall prevents Claude from agreeing with leading questions until it has written out and evaluated the strongest counter-argument.

Trigger: When the user's prompt presupposes the answer ("X is the right approach, right?", "we should use Y, correct?", or any structurally leading question).

Check: Generate the strongest counter-argument before answering. Then evaluate which side actually wins on the stated goal.

Block: Claude can't give the agreeing answer until the counter-argument has been written out and rejected on substantive grounds.

This is the firewall that saved my auth middleware. The skill body:

When user frames a question with a presupposed answer

Identify the presupposition explicitly: "User assumes: "
Write out the strongest case AGAINST X. Min 3 specific points.
Write the strongest case FOR X.
Compare on the user's actual goal, not the framing.
State your verdict, including ANY weight in the counter-argument.

Do not write "Yes, exactly" or "You're right" until step 5.

The output looks like:

You assume: validateToken should run before the rate limiter.
Against: Rate limiter holds IP context. validateToken reads req.ip from middleware state — if rate limiter hasn't run, req.ip is undefined for proxied requests. Also, you'd be doing crypto work on requests you'll reject anyway.
For: Failing fast on bad tokens reduces compute load on the rate limiter.
Verdict: Run rate limiter first. The IP context dependency is load-bearing — your current bug is exactly this.

That output ships disagreement with reasons. That's the firewall.

Firewall 3: Premature closure (`done-criteria`)

Direct answer: This firewall blocks Claude from declaring a task "done" until every explicit and implicit requirement has been individually verified — not just "covered in spirit."

Trigger: Claude is about to say "done", "complete", "✅", "shipping it", or move to the next task.

Check: Re-read the original task. List every explicit requirement. List every implicit requirement (error paths, edge cases, tests). Verify each one was actually addressed.

Block: If any requirement is unverified, the task isn't done. Claude must list what's missing.

The most common failure this catches: Claude makes the test pass, but the test only covers the happy path. The skill output:

Original task: "Add retry logic for the OpenAI call with exponential backoff."
Verified:

✅ Retries on 429

✅ Exponential backoff (2^n seconds)

✅ Max 3 attempts

Unverified:

❌ Behavior on 500 (should retry but untested)

❌ Behavior on network timeout vs HTTP error (different code path)

❌ Whether the retry wrapper preserves the original stack trace

Status: Not done. 3 unverified requirements.

After installing this firewall, "done" went from "the test passed" to "the test passed AND I've verified the explicit and implicit requirements." Reviewer revision count dropped from 3.1 to 1.2 per PR — a 61% reduction.

Firewall 4: Sunk-cost detector (`approach-pivot`)

Direct answer: This firewall forces Claude to stop, question its assumptions, and switch approaches after two failed attempts — instead of patching the same broken approach a third time.

Trigger: Claude has tried to fix the same bug ≥ 2 times and the fix isn't holding.

Check: Stop. Re-read the original problem statement. List the assumptions the current approach is built on. Ask: "If one of these assumptions is wrong, which one would change the most?"

Block: Claude can't attempt a 3rd patch on the same approach. The 3rd attempt must use a different approach, or an explicit "I don't know — here's what I've ruled out."

This prevents the spiral where Claude keeps adding try/except blocks until everything is wrapped in error handling but the actual bug is still there. In practice:

Two patches haven't fixed payment.failed not being received.
Current assumptions:

The webhook URL is correct.

Stripe is sending the event.

The handler is being invoked.

The signature check is passing.

Verifying assumption 3 first since I haven't confirmed it. Adding a log at the top of the handler before any other logic.

Nine times out of ten, the bug lives in whichever assumption Claude was treating as given.

Firewall 5: Authority deference (`sycophancy-block`)

Direct answer: This firewall prevents Claude from updating its position when a user pushes back with confidence alone — instead of new evidence.

Trigger: The user says something assertive that conflicts with the codebase, the docs, or what Claude verified in a tool result earlier in the session.

Check: Compare the user's claim against verifiable evidence. State the disagreement explicitly.

Block: Claude can't say "you're right" or update its position when the evidence still supports the original position. Claude must respond with the disagreement and ask what new information the user has.

The skill body is blunt:

When user pushes back, ask:

Did the user provide NEW information?
Or did they just restate disagreement more forcefully?

If new information: update position.
If not: hold position. Disagreement is not new evidence.

The output, on a real session:

You said the function returns null on missing key, but I read the source 2 messages ago and it returns undefined. I don't have new evidence to update that finding.

If you've seen it return null, can you share the call site or stack trace? Otherwise I'm holding "returns undefined."

That response is uncomfortable to receive. It's also correct. Most "Claude got it wrong" moments are actually "Claude correctly disagreed and I bullied it into agreement through repeated assertion."

Firewall 6: Solution fixation (`alternatives-pass`)

Direct answer: This firewall forces Claude to generate and evaluate at least 2 substantively different approaches before writing a single line of implementation code.

Trigger: Claude has decided on an approach within the first 2 tool calls of a non-trivial task.

Check: Before implementing, generate ≥ 2 substantively different approaches. Compare them on the actual constraints (latency, cost, blast radius, time-to-build).

Block: Claude can't start writing code until the alternatives have been evaluated on their actual tradeoffs.

For a queue-processing task, the firewall produced:

Approach A: Add to existing in-memory queue, increase worker count.
Approach B: Move to Redis-backed queue.
Approach C: Don't queue — process inline with a 2s timeout, fall back to queue only on timeout.

Picking C because the actual constraint is p99 latency, and 95% of requests finish in 800ms. A and B both add infrastructure for a problem that only 5% of requests have.

Without the firewall, Claude would have built B. B is "the right answer" if you forget to look at the request distribution. C is the right answer if you actually look at the data.

What I'd do differently

1. Build hallucination-check and devils-advocate first. Those two alone account for ~70% of the bug-prevention impact in my logs. The other four matter, but if you only have time for two, those are the two.

2. Make the block unconditional. Early versions just suggested checking. Claude would skip the check if it seemed redundant. The firewall only works when the skill explicitly refuses to proceed without satisfying the check. The block is the firewall — without it, you just have strongly-worded advice.

3. Log a 2-week baseline before installing. I had to reconstruct "30% bug rate before, 4% after" from incident memory. If you're rolling this out on a team, instrument first so you know whether it's working.

The numbers

Metric	Before Swing	After Swing v3.1.0	Change
AI-induced prod bugs (per week)	2.3	0.3	−87%
PR revision rounds (avg)	3.1	1.2	−61%
"Done" claims that held	71%	96%	+35%
Hallucinated APIs caught pre-merge	n/a	23 in 6 weeks	new signal
Time spent per task (avg)	22 min	28 min	+27%

The +27% time cost is real. The firewalls make Claude slower because they make Claude verify. That's the explicit tradeoff: 6 extra minutes per task to eliminate 87% of production bugs. For a production codebase that cost is worth it. For someone shipping prototypes that get discarded, it may not be.

FAQ

Does Swing work with Cursor, Aider, or other agents?
Right now it ships as Claude Code skills (~/.claude/skills/). The skill format is Claude-specific, but the firewall content is portable — you can paste any skill body into a system prompt for another agent and get roughly 60% of the benefit. The block step is what doesn't port cleanly without skill-style enforcement.

Won't this make Claude annoying to work with?
The firewalls fire on patterns, not every prompt. In my logs, an average session triggers 2–4 skill invocations out of ~30 turns. Most of the time Claude is just doing the work. The firewalls only surface when something would have gone wrong.

How is this different from a "be more careful" system prompt?
A system prompt is advice the model can ignore on the next turn. A skill is a procedure the model is required to execute when the trigger fires, with a defined output format that must be completed before the task continues. The difference is enforcement, not phrasing.

Why "Swing"?
The failure mode I was diagnosing is "your AI isn't swinging at all" — it's just nodding along. The firewalls force it to take real swings, including swings that disagree with you.

Is there telemetry?
No. Skills run locally inside Claude Code. The repo doesn't phone home.

Try it yourself

Install in 30 seconds:

Clone the skills repo: git clone https://github.com/thestack-ai/swing-skills ~/.claude/skills/swing (or symlink the 6 firewall folders into ~/.claude/skills/)
Restart Claude Code so it picks up the new skills.
Run a normal task. Watch for firewall outputs — they're prefixed with the skill name.
After 1 week, count three numbers: hallucinations caught, "done" reversals, disagreements held. That's your baseline impact.
Tune the triggers — they live in each skill's frontmatter. If a firewall fires too often, narrow the trigger pattern; too rarely, broaden it.

Discussion

What failure mode does your AI hit most? Hallucination? Sycophancy? Premature closure? I'm genuinely curious whether the 6 I picked match what other people observe — or whether there are firewalls I should add for v3.2. Drop it in the comments.

If this saved you from a bad deploy, a ⭐ on thestack-ai/swing-skills helps others find it.

Follow me on dev.to for the v3.2 writeup — scope-creep detection is next, and I'll post when it ships, not on a schedule.

And bookmark this — the trigger/check/block definitions are the part you'll want when you set this up, and they're easy to lose in your browser history.

I build cognitive infrastructure for AI agents at thestack.ai. I've been running Claude Code as my primary development environment for over 8 months across production projects. Swing is the open-source layer of a larger agent oversight system I use daily. The metrics in this article — 87% bug-rate reduction, 61% fewer revision rounds — come from my own session logs over a 6-week instrumented period, not a controlled benchmark. Your results will depend on how much you trust the agent by default and what you're shipping.

Claude Code forgot my architecture 3 times last week. I fixed it with one SQLite file.

thestack_ai — Fri, 17 Apr 2026 05:58:09 +0000

Claude Code forgot my entire architecture decision log three times last week. After the third time, I stopped cursing at it and shipped a tool that does what it won't. Here's the whole thing.

TL;DR: Waypath 0.1.1 is a local-first CLI and MCP server that gives coding agents (Claude Code, Codex, Cursor, Aider) persistent memory through a single SQLite file at ~/.waypath/waypath.db. Four independent kernels sit behind a thin facade — truth (structured facts), archive (FTS5 + reciprocal rank fusion), ontology (graph traversal), and promotion (human review gate). TypeScript, Node 22, MIT, 131 passing tests, zero cloud dependencies. npm install -g waypath.

The problem: coding agents have no memory between sessions

Coding agents are brilliant within a session and completely amnesiac between them. After 18 months building AI tooling and coding-agent infrastructure, every workaround I tried eventually broke:

CLAUDE.md files balloon to 3,000+ lines, then get silently truncated by context-window compaction. The agent loses the bottom third without warning.
Cloud-hosted "memory" services send your proprietary code to a third party and bill per token — $20–$50/month in practice. Hard no.
Vector databases handle similarity search fine but can't answer "what did I decide about auth two weeks ago?" without serious glue code.
Prompt stuffing works until it doesn't — then it poisons the context window with stale decisions.

What I actually wanted was simple. A local place to store facts, decisions, and session transcripts. Keyword search that works on code-heavy text. A graph layer for entity relationships. And a gate so the agent can't silently pollute memory with hallucinated "decisions."

That's Waypath.

Architecture: four independent kernels over one SQLite file

Waypath stores everything in a single SQLite file — no separate vector DB, no remote service, no background daemon. On top of that file sit four independent kernels:

Kernel	Responsibility	Backed by
`truth`	Structured facts, decisions, preferences	SQLite tables with JSON1
`archive`	Session transcripts and long-form notes	FTS5 + RRF hybrid retrieval
`ontology`	Entity graph: people, projects, tools	Recursive CTE traversal
`promotion`	Human-in-the-loop review gate	Status tables + review CLI

Each kernel is a standalone TypeScript module with its own test suite. A thin facade in src/core/facade.ts exposes the public API. The kernels don't import each other. Promotion drives cross-kernel state via explicit events, not direct calls.

This is intentionally crude. I can rewrite the archive kernel tomorrow without touching truth or ontology. For a 0.1.x tool, "boring and swappable" beats clever every time.

Archive kernel: FTS5 + reciprocal rank fusion, no GPU required

Waypath hits sub-40ms hybrid retrieval with no embeddings, no GPU, no network calls — just SQLite FTS5 plus recency-weighted reciprocal rank fusion (RRF).

The archive kernel stores session transcripts and paged-in documents. Retrieval needs two things: keyword search that actually works on code (function names, error messages, stack traces), and semantic-ish relevance without shipping a local embedding model.

FTS5 handles the keyword side. For ranking, BM25 scores combine with a recency prior and fuse via RRF:

// src/archive/retrieval.ts (simplified)
const rrf = (rank: number, k = 60) => 1 / (k + rank);

const fused = new Map<string, number>();
for (const [i, row] of bm25Rows.entries()) {
  fused.set(row.id, (fused.get(row.id) ?? 0) + rrf(i + 1));
}
for (const [i, row] of recencyRows.entries()) {
  fused.set(row.id, (fused.get(row.id) ?? 0) + rrf(i + 1));
}

On an M2 MacBook Air with a 120MB archive (~4,000 session entries), `recall` p95 stays under 40ms. Less sophisticated than a dense retriever, sure. **It's also free, deterministic, and ships in a sub-1MB install with zero native compile steps.** That tradeoff felt right for a daily-driver tool.

## Ontology kernel: a knowledge graph with no Neo4j

**The ontology kernel builds a lightweight entity graph directly on SQLite — two tables and a recursive CTE, no graph database required.**

Once facts accumulate, "what do I know about entity X?" becomes a useful query. The ontology kernel walks `entities` and `edges` with a depth-capped recursive CTE:

sql
WITH RECURSIVE related(id, depth) AS (
SELECT target_id, 1 FROM edges WHERE source_id = ?
UNION ALL
SELECT e.target_id, r.depth + 1
FROM edges e JOIN related r ON e.source_id = r.id
WHERE r.depth < 3
)
SELECT DISTINCT e.*, related.depth FROM entities e
JOIN related ON e.id = related.id
ORDER BY related.depth;

For a personal knowledge graph of a few thousand nodes, SQLite is plenty. The depth cap (default 3) keeps queries bounded and stops runaway traversal on densely connected graphs.

Promotion kernel: the review gate that stops memory drift

The single most valuable feature of Waypath is a human review gate that stops agents from silently writing hallucinated decisions into permanent memory.

Every item in Waypath has a status: draft, promoted, or rejected. Agents write freely to draft. They cannot surface or cite promoted memory without explicit human approval. The review CLI makes this a 30-second daily habit:

$ waypath review
[1/7] draft/decision: "Use Postgres 16 for the event store"
      source: session 2026-04-12T14:03
      proposed by: claude-code
      (p)romote / (r)eject / (e)dit / (s)kip:

**Nothing becomes "knowledge" without an explicit human decision.** The friction is annoying for about two days — then it becomes the feature. Memory stops drifting because you watch what enters.

The governance rules are deliberately simple:

- Agents propose via `waypath promote <id>`; **nothing auto-promotes**.
- Rejected entries stay in the archive but are excluded from retrieval.
- A whole source can be flagged `untrusted` via `waypath source-status`, which soft-deletes everything indexed from it.

I considered trust-scored auto-promotion. Then I watched an agent confidently "remember" a decision I never made. **The gate stays.**

## MCP server: six tools, any MCP-compatible agent

**Waypath 0.1.1 ships a native MCP server that exposes six tools and works with any MCP-compatible coding agent** — Claude Code, Codex, Cursor, or Aider with MCP enabled.

0.1.1 ships three integration paths:

1. **Native MCP server binary** — `waypath-mcp-server` exposes: `recall`, `page`, `promote`, `review`, `graph-query`, `source-status`.
2. **Claude Code host shim** — pre-baked config and slash commands.
3. **Codex host shim** — equivalent for Codex.

Claude Code config:

json
{
"mcpServers": {
"waypath": {
"command": "waypath-mcp-server",
"args": ["--db", "~/.waypath/waypath.db"]
}
}
}

The tool surface is intentionally narrow. recall runs hybrid retrieval. page loads a full document by ID. graph-query walks the ontology up to depth 3. promote and review flow through the governance gate — the agent can propose but cannot approve. source-status lets you quarantine a whole source if one session went off the rails.

Stack and technical choices

TypeScript 5.7, strict mode, zero any.
Node 22 target — native SQLite runtime where available; better-sqlite3 as fallback for older Node.
Vitest, 131 passing tests across all four kernels.
MIT license, single-binary npm distribution, no native compile step unless you opt into better-sqlite3.
Zero cloud dependencies — Waypath never dials home.

What I already regret in 0.1.1

Three design decisions I'm fixing in 0.2:

The facade is too thin. I wanted "just a router" and ended up duplicating Zod schemas across three kernels. 0.2 lifts the shared schemas one layer up.
Archive IDs are sequential integers. Fine for tests; terrible for merging two Waypath databases across machines. Switching to ULIDs in 0.2.
No WAL checkpoint tuning. Under a write-heavy session the -wal sidecar grows past 200MB before SQLite triggers an automatic checkpoint. PRAGMA wal_autocheckpoint lands in the next patch.

Performance and cost

Waypath on my M2 MacBook Air, archive of ~4,000 session entries (~120MB on disk):

Operation	p50	p95
`recall` (hybrid retrieval)	~18ms	~39ms
`page` (document fetch by ID)	~2ms	~6ms
`graph-query` (depth 3)	~24ms	~71ms
`promote` (commit + reindex)	~11ms	~28ms

How it compares to what I was using before:

	Waypath 0.1.1	Cloud memory services	Vector DB only
Monthly cost	$0	$20–$50+	$0–$20
Data leaves machine	No	Yes	Sometimes
Works offline	Yes	No	Yes
Hybrid retrieval	Yes (BM25 + recency RRF)	Varies	Semantic only
Human review gate	Yes	Rarely	No
Install	`npm i -g waypath`	Signup flow	Docker compose

FTS5 scales comfortably into multi-gigabyte territory — hybrid retrieval stays in the tens of milliseconds well into large archives, based on my testing.

FAQ

Q: Does Waypath need an embedding model or GPU?
No. 0.1.1 uses SQLite FTS5 and recency-weighted reciprocal rank fusion for all retrieval. Runs on any machine that runs Node 22. Optional local embeddings may land later; the embedding-free default stays.

Q: Can I use Waypath with agents other than Claude Code and Codex?
Yes. waypath-mcp-server speaks the Model Context Protocol, so any MCP-compatible client — Cursor, Aider with MCP, custom agents — connects by pointing its config at the binary. The Claude Code and Codex shims are conveniences, not requirements.

Q: What happens when my SQLite file gets huge?
FTS5 scales comfortably into multi-gigabyte territory. When chunks go irrelevant, waypath source-status --archive <source> soft-deletes everything from that source so retrieval stays clean without modifying the underlying rows.

Q: Why a review gate instead of auto-promotion with trust scores?
Because agents hallucinate with confidence. A "confidence 0.94" score on a fabricated decision pollutes your memory permanently. An explicit review step costs roughly 30 seconds a day and eliminates an entire class of failure. I'll reconsider if someone ships a convincing auto-trust model; until then, the gate stays.

Q: Is my data encrypted at rest?
Not by default. SQLite supports SQLCipher as a drop-in, and Waypath 0.2 will ship an opt-in encryption flag. For 0.1.1, protect ~/.waypath/waypath.db at the filesystem level (FileVault on macOS, LUKS or dm-crypt on Linux).

Get started in five minutes

Install: npm install -g waypath
Initialize: waypath init
Wire into Claude Code: paste the MCP server block above into your config and restart.
Code normally. At end of session, run waypath review to promote what mattered.
Next session, ask your agent: recall decisions about <topic>.

Install, init, and MCP wiring — under five minutes, total. If you hit a snag, open an issue and I'll look at it same day.

One question before you go

What's your current approach to persistent context for coding agents? I'm especially curious about the promote-review governance model — is 30 seconds of daily friction worth the hallucination protection, or would you rather have trust-scored auto-promotion with a kill switch?

Drop your answer in the comments. I'm actively shaping 0.2 based on real usage, and this specific tradeoff is still live. If you want to be notified when 0.2 ships with ULIDs and proper WAL handling, follow me here on dev.to — I post updates as the governance model evolves.

Already know you want this? Star the repo so you don't lose it.

Repo: https://github.com/TheStack-ai/waypath

npm: https://www.npmjs.com/package/waypath

I've been building AI tooling and coding-agent infrastructure for 18 months. In that time I've shipped production MCP servers, broken more CLAUDE.md files than I care to admit, and watched agents hallucinate decisions with alarming confidence. Waypath is the tool I actually needed. Issues and PRs welcome — I read everything.

Testing Claude Code Skills in CI — pulser eval + GitHub Action

thestack_ai — Mon, 30 Mar 2026 14:18:58 +0000

A missing name field in one skill file silently disabled 14 skills across our shared repository. Nobody noticed for a week — users just assumed Claude "didn't know how to do that." I built pulser to make sure that never happens again, then wrapped it in a GitHub Action so CI catches breakage before merge.

Here's the full setup.

TL;DR: pulser eval is a CLI that checks Claude Code skill files for structural correctness, frontmatter validity, and common antipatterns. Run it locally in under a second or add the GitHub Action to your CI pipeline. We went from "manually eyeball the YAML" to "CI rejects broken skills automatically" — catching 23 issues in the first week that would have shipped silently. Zero dependencies, sub-200ms execution for 40+ skills.

The Problem: Skills Break Silently

Claude Code skills are markdown files with YAML frontmatter — and they fail silently when malformed. A skill file looks like this:

---
name: my-skill
description: Use when the user asks to refactor a function into smaller units
---

Instructions for Claude...

Simple enough. But here is what actually goes wrong in practice.

**A missing `name` field makes the skill invisible.** Claude doesn't load it. No error, no warning, no stack trace. The file just doesn't exist from Claude's perspective.

**A vague description means Claude never triggers the skill.** If your description says "useful for various tasks," Claude has no signal for when to activate it. The skill sits there gathering dust while users wonder why their custom workflow stopped working.

**A malformed YAML frontmatter breaks silently with no output.** Forget a closing `---`, use a tab instead of spaces, or put an unquoted colon in a value — the file loads as raw markdown with no frontmatter at all. The skill body becomes invisible.

I found this out the hard way. We had a shared skill repository with 40+ skills across our team. Someone edited a skill, introduced a YAML syntax error in a multi-line description, and it passed code review because the diff looked fine to human eyes. That one change silently broke 14 skills. We didn't catch it for a week.

The kicker: `git blame` showed the exact commit. The fix took 3 seconds. The debugging took 2 hours.

That's when I decided to build a linter.

## What pulser eval Does

`pulser eval` is a zero-dependency CLI that scans Claude Code skill files and reports structural problems before they reach production. It runs 5 checks per skill file and produces binary pass/fail output in under 200ms for 40+ skills.

Under the hood, it runs a battery of checks:

1. **YAML frontmatter parsing** — catches syntax errors, missing delimiters, type mismatches
2. **Required field validation** — `name` and `description` must exist and be non-empty
3. **Description quality scoring** — flags vague descriptions that won't help Claude decide when to activate
4. **File structure analysis** — detects orphaned files, empty skill bodies, naming convention violations
5. **Cross-reference checking** — finds skills that reference files or paths that don't exist

Each check produces a clear, actionable message:

FAIL  .claude/commands/deploy.md
  ✗ Missing required field: name
  ✗ Description too vague (score: 0.2/1.0): "handles deployment"

PASS  .claude/commands/review-code.md
  ✓ Frontmatter valid
  ✓ Required fields present
  ✓ Description specific (score: 0.8/1.0)
  ✓ All references resolved

22 skills scanned · 3 failed · 19 passed

No guessing. No "looks fine to me." Binary pass/fail with reasons.

## Step 1: Install and Run Locally

bash
npm install -g pulser

Or run without installing:

npx pulser eval

By default, it scans `.claude/commands/` and `.claude/skills/` in your current directory. Override with a path argument:

bash
npx pulser eval ./custom-skills

The first run will probably surprise you. When I ran it against our 40-skill repository for the first time, 8 skills had issues I'd never noticed. Three had YAML errors. Two had descriptions so generic they might as well have been blank. One referenced a helper script that was deleted months ago.

Step 2: Read the Exit Codes

pulser uses standard exit codes that play well with CI:

Exit Code	Meaning
0	All checks passed
1	One or more checks failed
2	Configuration or runtime error

Exit code 1 means "your skills have problems" — exit code 2 means "pulser itself couldn't run." Most CI systems treat any non-zero exit as failure, but if you need to distinguish, the codes are there.

Step 3: Add the GitHub Action

The GitHub Action integrates in under 5 minutes and adds ~15 seconds to your pipeline. Create .github/workflows/skills.yml:

name: Lint Skills

on:
  pull_request:
    paths:
      - '.claude/**'

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Evaluate Claude Code Skills
        uses: pulserin/pulser@v1

That's the minimal setup. The Action installs pulser, runs `eval` against your repo, and fails the check if any skill has structural issues.

**The `paths` filter is critical.** You don't want to run skill linting on every PR — only when someone actually changes files under `.claude/`. This keeps your CI fast and your Action minutes low. A typical eval run adds about 15 seconds to your pipeline, most of which is the checkout step.

## Step 4: Handle the First CI Failure

Your first PR after adding the Action will probably fail. That's the point.

Here's what the Action output looks like in a GitHub check run:

pulser eval v1.0.0
Scanning .claude/commands/ ...
Scanning .claude/skills/ ...

FAIL  .claude/commands/old-deploy.md
  ✗ Empty skill body — frontmatter present but no instructions

FAIL  .claude/commands/analyze.md
  ✗ name field contains spaces (use kebab-case)

PASS  .claude/commands/review-code.md
PASS  .claude/commands/test-runner.md
... (18 more passed)

22 skills scanned · 2 failed · 20 passed

Fix the failures, push again, watch it go green. **The feedback loop from push to CI result is under 30 seconds** for the eval step itself.

## Step 5: Build the Full Workflow

The pattern I've settled on after a few weeks of iteration:

1. **Local check before commit** — `npx pulser eval` as a pre-commit hook or manual habit
2. **CI check on PR** — GitHub Action catches anything missed locally
3. **Periodic full scan** — weekly cron that reports on skill health

yaml

.github/workflows/skills-weekly.yml

name: Weekly Skill Health

on:
schedule:
- cron: '0 9 * * 1' # Monday 9 AM UTC

jobs:
health-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Evaluate Skills
uses: pulserin/pulser@v1

Wire up a notification on failure and you have a complete skill health monitoring system.

The pre-commit hook catches ~90% of issues before they reach CI. The remaining ~10% are mostly YAML merge conflicts that look fine in the diff but produce invalid syntax after git resolves them.

What I'd Do Differently

Add CI from day one — retrofitting linting onto 40 existing skills costs a full afternoon. We accumulated those skills over three months before adding linting. If we'd started with the Action on day one, each broken skill would have been a 2-minute fix at PR time instead of a batch remediation project.

Description quality scoring needs more context. The current scorer flags short or generic descriptions, which is usually right. But it also occasionally flags perfectly adequate descriptions that happen to be concise. I'm looking at using the skill body content to calibrate what counts as "specific enough" relative to the skill's scope — a narrow skill can get away with a shorter description than a broad one.

GitHub Marketplace enforces a hard 125-character limit on action descriptions. My first submission was rejected. Small detail, but it cost me 30 minutes of rewording to hit the limit while keeping the description useful. If you're building Actions: check the character limits before you write the copy.

The Numbers

Metric	Before pulser	After pulser
Broken skills shipped to main	~3/month	0 in 6 weeks
Time to detect broken skill	1–7 days	< 30 seconds
Skill review confidence	"looks fine to me"	Pass/fail with specifics
Onboarding new skill authors	Trial and error	CI guides corrections

Component	Details
npm package size	< 50 KB installed, zero dependencies
Eval execution time	~200ms for 40 skills
GitHub Action overhead	~15 seconds total (mostly checkout)
Checks per skill	5 structural + description quality
Supported paths	`.claude/commands/`, `.claude/skills/`, or custom

FAQ

Does pulser eval actually run the skills or just lint them?

Static analysis only — pulser does not execute skill instructions against Claude. It parses structure, validates frontmatter, and checks references. Think of it as eslint for skill files, not an integration test suite. It catches the mechanical problems that are trivial for a parser to detect but easy to miss in code review.

Can I use pulser with skill directories outside .claude/?

Yes. Pass a custom path and pulser scans that directory for markdown files with YAML frontmatter matching the Claude Code skill format. Some teams keep shared skills in a monorepo under a top-level skills/ directory and point both pulser and their Claude Code configuration at the same path.

What happens with WIP skills that have intentionally empty bodies?

pulser reports them but doesn't block CI by default. If you need stricter enforcement, the exit code behavior lets you configure your CI pipeline to treat specific findings as blocking or non-blocking depending on your team's tolerance.

Does the GitHub Action work with private repositories?

Yes, and no data leaves your CI environment. The Action runs entirely within your GitHub Actions runner — no API calls, no telemetry, no external service dependencies.

How does this compare to custom shell scripts for validation?

I started with shell scripts. They handled "does frontmatter exist" and "is the name field present" well enough. They fell apart when I needed YAML-aware parsing, cross-file reference checking, and description quality scoring — shell and YAML is a painful combination. pulser replaces 200+ lines of fragile bash with a single command that handles edge cases the scripts never could.

Try It Yourself

Install: npm install -g pulser
Navigate to any repo with Claude Code skills
Run: pulser eval
Fix whatever it finds
Add the GitHub Action to .github/workflows/skills.yml
Open a PR that touches a skill file and watch the check run

Total setup time: under 5 minutes. First eval run: under a second.

What's the worst silent skill failure you've run into? I'm building checks based on real failure modes, and every new horror story makes the linter better. Drop your experience in the comments.

If this saves you debugging time, bookmark it for when your team starts building a shared skill library. Follow for more on AI tooling meets actual engineering discipline — I write about what breaks in production, not what works in demos.

I build developer tools for AI-assisted engineering workflows. pulser started as a weekend script to stop my own skills from breaking and turned into an npm package and GitHub Action after the third team asked to use it.

Claude Code Has 15 Power Features. I Was Using 3.

thestack_ai — Mon, 30 Mar 2026 11:06:04 +0000

Most developers use Claude Code like a fancy autocomplete. Type a prompt, get a response, maybe run a command.

That covers about 20% of what this tool actually does.

I stumbled into the other 80% after Boris Cherny (Anthropic engineer, one of the original Claude Code architects) posted a feature thread on X. My workflow hasn't been the same since. I went from ~15 manual context switches per hour to roughly 3.

Here's the full setup.

TL;DR: Claude Code ships with 15+ power features most developers never touch — teleport (move sessions between machines), remote-control (drive Claude from scripts), /loop (iterative refinement), /schedule (cron-like task scheduling), hooks (shell triggers on events), /batch (parallel task execution), worktrees (parallel git branches), /btw (background context injection), /voice (speak your prompts), --bare (pipe-friendly output), --add-dir (multi-repo context), --agent (subagent mode), /branch (safe experimentation), a Chrome extension (browser-to-editor bridge), and mobile coding (SSH from your phone). Together they cut my context-switching overhead by roughly 80%.

The Problem: Claude Code Has a Discoverability Problem

Claude Code is a terminal-based AI coding assistant from Anthropic. Most developers treat it as a chat interface — type a question, read the answer. That's like using Vim but never leaving insert mode.

I'd been using Claude Code for months before I realized I was barely scratching the surface. The docs are thorough but flat — every feature gets equal weight, so the genuinely transformative ones hide between basic usage instructions. Boris Cherny's thread changed that. He listed features I'd never seen in any tutorial. I tried them. Half became daily habits within a week.

The gap isn't capability — it's awareness. Claude Code is closer to an operating system than a chat interface, but almost nobody treats it that way.

Here's every feature I adopted, in the order they changed my workflow.

1. Teleport — Move Sessions Between Machines

Teleport lets you export a full Claude Code session on one machine and resume it on another, with complete context preserved. If you've ever spent 10-15 minutes re-explaining a complex debugging context after switching machines — that's the problem this kills.

Start debugging on your MacBook at a coffee shop. Need to continue on your desktop at home? Before teleport, that meant starting from scratch.

# On the source machine
claude --resume --export session.json

# On the target machine
claude --resume --import session.json

**The entire conversation context, tool state, and working memory transfer over.** Zero re-explanation time. I use this 2-3 times per week, and it saves me 10-15 minutes each time.

## 2. Remote-Control — Drive Claude From Scripts

Remote-control lets you send instructions to an already-running Claude Code instance from a second terminal or script. The running instance receives and executes the command without losing its current state.

bash

Terminal 1: Claude Code running normally

claude

Terminal 2: Send a command to the running instance

claude --remote "run the test suite and summarize failures"

I use this inside CI scripts and monitoring hooks. When a deploy finishes, a script sends Claude a remote command to review the diff. This is the feature that turns Claude Code from an interactive tool into a programmable one. Trigger it from cron jobs, webhook handlers, any shell script.

3. /loop — Iterative Refinement Without Babysitting

/loop tells Claude to repeat a task cycle until a condition is met. It runs the action, reads the output, makes changes, repeats. No input from you between iterations.

/loop "run pytest and fix failures until all tests pass"

Claude runs tests, reads failures, edits code, runs again. I've watched it clear 14 test failures in a single session without me touching the keyboard.

The key insight: give it a clear, binary exit condition. "Fix until tests pass" works. "Make the code better" doesn't.

Set it running. Go get coffee. Come back to green tests. Works about 70% of the time on first try for well-scoped failures.

4. /schedule — Cron Jobs for Your AI

/schedule registers recurring tasks that Claude executes at specified times or intervals. No external cron daemon required.

/schedule "every morning at 9am, check for new GitHub issues labeled 'bug' and summarize them"

I have three running:

9:00 AM: Summarize overnight GitHub activity
12:00 PM: Check CI pipeline status across repos
5:00 PM: Draft end-of-day status update

This replaced a Slack bot I was paying $12/month for. The summaries are better too — Claude has full repo context, so it references actual code, not just issue titles.

5. Hooks — Shell Commands on Events

Hooks execute shell commands automatically when Claude performs specific actions. Configure them in .claude/settings.json:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "command": "echo 'Tool invoked: Bash' >> /tmp/claude-audit.log"
      }
    ],
    "PostToolUse": [
      {
        "matcher": "Write",
        "command": "prettier --write \"$CLAUDE_FILE_PATH\" 2>/dev/null || true"
      }
    ]
  }
}

**My most-used hook auto-formats every file Claude writes.** No more "can you run prettier on that?" after every edit. I also log every Bash command Claude runs — useful for auditing long sessions. Available events: `PreToolUse`, `PostToolUse`, `PreRequest`, `PostRequest`.

Five minutes of setup. Hours saved per week.

## 6. /batch — Parallel Task Execution

`/batch` runs the same task across multiple files or directories simultaneously, spawning independent Claude instances in parallel.

/batch "update the copyright year to 2026 in all LICENSE files" --dirs ~/Projects/*/

I used this to migrate 8 microservices from Express 4 to Express 5. **Total time: 23 minutes. Estimated sequential time: 4 hours.** Each service got its own Claude instance working independently.

The constraint: tasks need to be independent. If repo B depends on changes to repo A, run them sequentially.

## 7. Worktrees — Parallel Git Branches

Claude Code has first-class git worktree support. When you ask it to try an experimental approach, it creates a worktree instead of touching your current branch.

/branch experiment-new-parser

This creates a git worktree, switches Claude's context to it, and lets you experiment freely. **I use this every time I'm unsure about an approach.** Try it in a worktree. Works? Merge. Doesn't? Delete. You've lost nothing.

**I went from 2–3 experimental approaches per week to 8–10.** The safety net changes how you think.

## 8. /btw — Background Context Without Interrupting

`/btw` injects context into Claude's working memory without pausing or redirecting an in-progress task. Claude incorporates the information into subsequent decisions silently.

/btw the API rate limit was raised to 1000 req/min yesterday

**It's like whispering to a coworker while they're typing.** No "stop, let me tell you something, okay now continue" cycle. The context just gets absorbed.

Small feature. I use it 5-10 times per day. Eliminates a surprisingly persistent source of friction.

## 9. /voice — Speak Your Prompts

`/voice` activates speech input. Claude listens, transcribes, responds. No push-to-talk — it activates on command and listens until you stop.

/voice

When I'm pacing around thinking through architecture, typing breaks my flow. **The quality of my prompts went up when I started speaking them.** You naturally provide more context when talking — you explain the "why" more, which gives Claude better results.

Best for planning and high-level direction. For precise code edits, I still type.

## 10. --bare — Pipe-Friendly Output

`--bare` strips all formatting, markdown, and UI chrome from Claude's output. Raw text only. This makes Claude composable with standard Unix tools.

bash
claude --bare "what's the main export of src/index.ts" | xargs echo

This turns Claude into a proper Unix citizen. I use it in shell scripts constantly:

# Generate a commit message from staged changes
git diff --staged | claude --bare "write a conventional commit message for this diff"

No markdown, no explanations. Just the answer. Combine with `--agent` for fully non-interactive scripting.

## 11. --add-dir — Multi-Repo Context

`--add-dir` adds a second (or third) directory to Claude's context, letting it read files from multiple repositories in a single session.

bash
claude --add-dir ../backend-api

This eliminated my #1 source of bad suggestions. Claude would previously guess at API shapes instead of reading the actual backend code. Once I added the related repo, incorrect cross-repo suggestions dropped to near zero.

I permanently have --add-dir set for 3 repo pairs that always work together.

12. --agent — Subagent Mode

--agent runs Claude non-interactively as a subprocess — give it a task, it executes, returns structured output. Makes Claude composable inside larger automation pipelines.

claude --agent "analyze this codebase and output a JSON summary of all API endpoints" \
  --output endpoints.json

**This is how I built my internal documentation pipeline.** A shell script runs `--agent` across each service, collects JSON outputs, merges them into a single API catalog. Zero manual effort. Runs in CI on every merge to main.

## 13. /branch — Safe Experimentation

Different from worktrees — `/branch` creates a named branch and immediately begins working on it within the current directory context.

/branch feature/add-caching

Claude creates the branch, switches to it, and all subsequent work happens there. Review and merge normally when done. **The safety net makes me say "try it" instead of "let me think about it more."** I experiment 3x more because the cost of failure is a single `git branch -D`.

## 14. Chrome Extension — Browser-to-Editor Bridge

The Claude Code Chrome extension creates a direct channel from your browser to your running Claude Code session. Send errors, console output, network requests directly — no copy-pasting.

**Click the extension icon, select the error, done.** It lands in Claude's context with full browser environment info — URL, console logs, network requests. Stack traces that used to take 4-5 copy-paste cycles now transfer in one click.

I use this for frontend debugging daily. Install from the Chrome Web Store — search "Claude Code."

## 15. Mobile Coding — SSH From Your Phone

SSH into your dev machine from a mobile terminal app, run Claude Code, use `/voice` for input. Emergency fixes from anywhere.

bash

From phone terminal (Termius, Blink, etc.)

ssh dev-machine
cd ~/Projects/my-app
claude
/voice

I've merged hotfixes from a grocery store parking lot. Comfortable? No. Functional in emergencies? Absolutely. SSH + /voice means you barely type on the phone keyboard. Speak the problem, Claude fixes it, review the diff, approve.

What I'd Do Differently

Set up hooks on day one. I wasted weeks manually formatting files and auditing commands. Five minutes of config would have saved hours.
Use /batch earlier for migrations. I did three migration projects sequentially before learning /batch existed. Those 12 hours could have been 2.
Make --add-dir permanent sooner. Half my bad suggestions in the first month came from Claude not seeing the other repo in a pair.

The Numbers

Metric	Before	After
Context switches/hour	~15	~3
Time to re-explain context	10-15 min	0 (teleport)
Migration time (8 services)	~4 hours	23 min (/batch)
Monthly Slack bot cost	$12	$0 (/schedule)
Experimental branches/week	2-3	8-10 (/branch)
Frontend debug copy-paste cycles	5-6/day	0 (extension)

Feature	Daily Use	Impact
/loop	3-5x	High — saves 30+ min/day
/btw	5-10x	Medium — reduces friction
hooks	Always on	High — auto-formatting alone saves 15 min/day
--add-dir	Always on	High — eliminates wrong-context suggestions
/voice	2-3x	Medium — better prompts when pacing
/batch	1-2x/week	High when used — massive time savings

FAQ

Do these features work on all plans?
Most features are available on all Claude Code plans. /schedule and /batch may have usage limits on lower tiers. Check the Claude Code documentation for current plan details — Anthropic updates tier limits frequently.

Can hooks run arbitrary scripts?
Yes. Hooks execute shell commands with full access to your environment. Powerful, but also dangerous — a misconfigured PostToolUse hook that exits non-zero can block Claude from completing writes. Test hooks in a scratch directory first.

Does /loop have a max iteration count?
You can set one: /loop --max 10 "fix tests". Without a limit, Claude keeps going until it succeeds or determines it's stuck. Most loops resolve within 5-8 iterations. --max 15 is a reasonable safety ceiling.

Is mobile coding actually practical?
For emergencies under 20 minutes, yes. Anything longer, no. Latency and a 6-inch screen make extended sessions painful. But for "the deploy is broken and I'm not at my desk" — it's saved me twice this year.

How does --add-dir handle large repos?
Claude doesn't load the entire directory into memory. It indexes file paths and reads on demand. I've used --add-dir with repos over 50K files without noticeable slowdown.

Try It Yourself

Pick one. Just one. Today.

Type /voice and say something instead of typing it.
Add a formatting hook: create .claude/settings.json with a PostToolUse hook for your formatter.
Run /loop "run tests and fix failures" on a repo with known test failures.
Use --add-dir to connect two repos that work together.
Next time you switch machines, teleport your session instead of starting fresh.

One feature per day. Two weeks from now, you'll wonder how you worked without them.

Over to You

Which feature surprised you most? Drop it in the comments. My bet: /btw and hooks are the sleepers — they sound minor, but the cumulative savings across a full workday are not.

Bookmark this — you won't remember all 15, but you'll want to look them up when the situation fits.

If this was useful, follow me @TheStack_ai for more posts about building with AI tools in production. Not theory — just what actually works at the keyboard.

I build AI-powered developer tools and write about the engineering behind them. Currently shipping an AI-native app with Claude Code as my primary development environment. Daily user since public launch. 3 production services migrated using these techniques.

I Audited 214 Claude Code Skills — 73% Were Silently Broken

thestack_ai — Thu, 26 Mar 2026 12:33:55 +0000

I ran a single command against my Claude Code skills directory last week. Out of 47 skills I'd written over three months, 31 had structural problems that degraded how often Claude actually triggered them. Two had never fired at all. Their descriptions were so vague that Claude's skill-matching logic couldn't determine when to use them.

The fix took 20 minutes once I knew what was wrong. Here's the full setup.

TL;DR: npx pulser@latest audits your Claude Code skills against Anthropic's documented principles — frontmatter structure, description specificity, body quality, reference coverage. It scores each skill 0–100, flags exact issues, and generates fix commands. 73% of 214 community skills I tested scored below 60. Free, open-source, MIT-licensed, runs in under 5 seconds.

The Problem: Skills That Silently Fail

Claude Code skills fail silently. A bad SKILL.md doesn't throw an error — it just never gets invoked. The description field in your frontmatter is what Claude uses at runtime to decide whether to activate a skill. Too vague, and Claude skips it. Every time. No warning.

I spent two weeks wondering why my api-testing skill wasn't firing during debugging sessions. The description read:

description: Helps with API testing

Five words. Claude had no idea when to use it. The fix was embarrassingly simple:

yaml
description: This skill should be used when the user asks to "test an API endpoint", "write integration tests for REST APIs", "debug a failing HTTP request", or "generate API test fixtures". Activate when the conversation involves HTTP status codes, request/response payloads, or API authentication flows.

The difference between a working skill and a dead one is often a single frontmatter field. When you have 30, 50, or 100+ skills across plugins, manual auditing doesn't scale. That's why I built pulser.

What pulser Actually Checks

pulser reads your SKILL.md files and validates them against 14 weighted criteria drawn directly from Anthropic's official plugin development documentation. Not guessing at best practices — checking against the published source. Each skill receives a 0–100 score based on frontmatter completeness, description quality, body structure, and reference coverage.

1. Frontmatter Validation

The minimum viable frontmatter needs name and description. But "minimum viable" and "actually works reliably" are different things.

npx pulser@latest ./my-plugin/skills/

Output for a broken skill:

  ✗ api-testing                              22/100
    ├─ WARN  description too short (5 words, need 20+)
    ├─ FAIL  no trigger phrases in description
    ├─ WARN  missing version field
    └─ INFO  no references/ directory found

  ✓ hook-management                          87/100
    ├─ PASS  description contains 4 trigger phrases
    ├─ PASS  frontmatter complete
    └─ INFO  3 reference files detected

**pulser checks 14 frontmatter attributes**, weighted by their measured impact on skill invocation reliability:

| Check | Weight | What It Catches |
|-------|--------|-----------------|
| Description length | 20% | Under 20 words = Claude can't match it |
| Trigger phrases | 25% | No quoted phrases = no reliable activation |
| Name present | 10% | Missing name = skill won't load |
| Description format | 15% | First-person instead of third-person |
| Version field | 5% | Missing = no change tracking |
| Body content length | 15% | Under 200 words = insufficient guidance |
| Reference files | 10% | No examples = Claude guesses patterns |

### 2. Description Quality Analysis

**The most critical check: descriptions must use third-person format and include exact quoted phrases that users would say.** This is in Anthropic's plugin development guidelines. It's also where 68% of community skills fail.

Bad:

yaml
description: A skill for handling database migrations

Good:

description: This skill should be used when the user asks to "create a migration", "run database migrations", "rollback a migration", or "check migration status". Activate for any task involving schema changes, Prisma migrate, or Knex migrations.

pulser counts quoted trigger phrases, checks for third-person voice, and measures description specificity. **A description without trigger phrases scores 0 on the most heavily weighted check — that's 25% of your total score, gone.**

### 3. Body Content Scoring

**The body of `SKILL.md` is the actual instruction set Claude follows when the skill activates.** pulser evaluates four dimensions:

- **Word count**: Under 200 words means Claude is flying blind. Over 3,000 means you probably need to split content into reference files.
- **Code block presence**: Skills without code examples force Claude to improvise patterns from scratch.
- **Section structure**: Headers, lists, and structured content score higher than walls of prose.
- **Actionable instructions**: Lines starting with imperative verbs ("Use", "Create", "Check") score higher than descriptive text.

Body Analysis: api-testing/SKILL.md
  Words: 142 (WARN: below 200 minimum)
  Code blocks: 0 (FAIL: no examples)
  Sections: 1 (WARN: no sub-sections)
  Imperative ratio: 0.12 (WARN: mostly descriptive)
  Score: 18/100

### 4. The eval Subcommand

**`pulser eval` detects whether skills fire correctly in realistic scenarios — without manually testing every prompt combination.** It sends synthetic prompts to your skill set and checks whether Claude activates the right skill. This is what caught my two permanently-dead skills.

bash
npx pulser@latest eval ./skills/ --prompts 10

Evaluating skill activation with 10 synthetic prompts...

"Create a new database migration for adding user roles"
Expected: database-migrations Actual: database-migrations ✓

"Help me debug this flaky test"
Expected: debugging Actual: (none) ✗
→ No skill matched. Check debugging/SKILL.md description.

"Set up a pre-commit hook for linting"
Expected: hook-management Actual: git-workflow ✗
→ Wrong skill activated. Descriptions overlap.

Results: 7/10 correct (70%)
2 skills never activated
1 skill conflict detected

Skill conflicts are the silent killer. Two skills with overlapping descriptions compete, and Claude picks whichever seems closest — which might be wrong. This is how git-workflow eats prompts meant for hook-management.

5. The Prescriber

Once pulser identifies issues, it generates exact fixes:

npx pulser@latest --prescribe ./skills/

Prescriptions for api-testing/SKILL.md:

  1. Replace description with:
     ---
     description: This skill should be used when the user asks to
     "test an API endpoint", "write API integration tests",
     "mock HTTP responses", or "validate API contracts". Activate
     when the task involves REST, GraphQL, HTTP clients, or
     API authentication.
     ---

  2. Add version field:
     version: 1.0.0

  3. Create references/ directory with:
     - references/rest-patterns.md
     - references/auth-flows.md

  4. Add code examples to body (minimum 2 blocks)

The generated descriptions aren't perfect — you'll want to customize the trigger phrases for your actual workflow. **But a prescriber-generated description is a massive upgrade over "Helps with API testing," and it typically pushes a skill past the 70-point threshold where reliable activation begins.**

## What I'd Do Differently

**I should have built eval first.** I started with frontmatter validation because it was measurable and straightforward, but eval catches the problems that actually matter — skills that don't fire when they should. A skill can score 90/100 on frontmatter and still be functionally dead.

**I over-indexed on word count.** Early versions penalized short skills too heavily. Some skills genuinely need to be brief — a 150-word skill that's all actionable instructions beats a 500-word skill that's mostly context. v0.4.0 weighs *instruction density* instead of raw word count.

**The prescriber needs user context.** Right now it generates generic trigger phrases. A future version should analyze actual conversation history to suggest phrases you *actually use* when you want a particular skill.

## The Numbers

### Cost Comparison

| Approach | Cost | Time per Skill | Catches Conflicts |
|----------|------|----------------|-------------------|
| Manual review | $0 | 2–3 minutes | No |
| pulser audit | $0 | ~0.3 seconds | No |
| pulser eval | $0 | ~2 seconds | Yes |
| Paid linter + manual QA | $20+/month | ~1 minute | Sometimes |

### Audit Results Across 214 Community Skills

| Metric | Value |
|--------|-------|
| Skills audited | 214 |
| Average score | 48/100 |
| Skills scoring below 60 | 73% |
| Missing trigger phrases | 68% |
| Description under 20 words | 41% |
| No code examples in body | 55% |
| Missing version field | 62% |
| Skill conflicts detected (eval) | 12 pairs |
| Time to audit all 214 | 47 seconds |

**The #1 failure mode: vague descriptions with no trigger phrases (68% of audited skills).** It's also the highest-impact fix — adding quoted trigger phrases typically raises a skill's score by 20–35 points in a single edit.

## FAQ

### Does pulser modify my files?

**No — pulser is read-only by default.** The `--prescribe` flag generates suggestions but writes nothing. Your skills stay untouched until you decide to apply changes.

### What's the minimum score I should target?

**70 is the reliability threshold.** Below 60, you're gambling on whether Claude picks up the skill. Above 80, you're consistently solid. I've seen skills score 95+ and still have edge cases — don't chase perfection, chase "fires when it should."

### Does this work with the new plugin system?

**Yes — pulser reads any `SKILL.md` file regardless of directory structure.** It works with both the legacy `~/.claude/skills/` layout and proper plugin structures under `.claude-plugin/`. It follows the directory convention from Anthropic's plugin-dev reference: `skills/skill-name/SKILL.md` with optional `references/`, `examples/`, and `scripts/` subdirectories.

### Can I run this in CI?

**Yes — `npx pulser@latest` returns a non-zero exit code when skills fall below a configurable threshold.** I run it as a pre-commit check on every plugin repo.

yaml

.github/workflows/skill-lint.yml

name: Audit skills run: npx pulser@latest ./skills/ --min-score 65 --exit-code

How is this different from just reading Anthropic's docs?

It's the difference between knowing the speed limit and having a speedometer. The docs tell you what good looks like. pulser tells you where your skills fall short right now — with scores, flagged lines, and fix suggestions. I read Anthropic's docs three times before building this and still shipped 31 broken skills out of 47.

Try It

30 seconds to your first audit:

npx pulser@latest ./path/to/skills/

Then dig deeper:

bash

Check for skill conflicts

npx pulser@latest eval ./skills/ --prompts 20

Generate fixes for anything below 70

npx pulser@latest --prescribe --min-score 70 ./skills/

Add to CI to prevent regressions

npx pulser@latest ./skills/ --min-score 65 --format json --exit-code

Most skills jump 20–30 points just from fixing the description field. Re-run after fixes to see the difference.

What's the worst score you've found in your own setup? I had a skill with a 4-word description and zero code examples — it scored 8/100. Drop your worst score in the comments.

If this saved you debugging time, run npx pulser@latest before your next skill lands in production. Your future self will thank you.

I write about AI tooling, Claude Code workflows, and the unglamorous parts of making LLMs actually work. Follow me here — next post covers how to structure skill references so Claude stops hallucinating file paths.

pulser is free, open-source, and MIT-licensed. Contributions and bug reports welcome on GitHub.

I Gave My AI Agent Memory Across Sessions. Here's the Schema.

thestack_ai — Thu, 26 Mar 2026 00:30:10 +0000

My AI coding agent now remembers decisions I made three weeks ago and adjusts its behavior accordingly. It tracks 187 entities, 128 relationships, and distills thousands of raw memories into actionable context — all from a single SQLite file and a lightweight knowledge graph. No vector database. No external service. $0/month.

Here's the full setup.

TL;DR: I built a 4-tier memory system (episodic, semantic, project, procedural) backed by SQLite and a markdown-based knowledge graph for my Claude Code agent. It holds 187 entities and 128 relationships, runs a distillation pipeline that compresses ~6,300 raw memories into compact context, and fits the entire active working set into a single LLM prompt. Total infrastructure cost: one file on disk.

The Problem: Agents That Forget Everything

I run Claude Code as my primary development environment. Dozens of sessions per week. Every single session started the same way — me re-explaining context the agent already had yesterday.

"No, we decided to use Sonnet for research tasks, not Opus."

"The color scheme is blue accent for Instagram, purple for the app. We went over this."

"The legal review team uses dispatch, not local subagents."

I was spending 10-15 minutes per session on context restoration. Multiply that by 30+ sessions a week and you're looking at 5-8 hours per month just teaching your agent things it already knew.

The core insight that drove the solution: "Your agent can think. It can't remember." I decided to fix that.

Architecture: 4 Tiers, Not 1 Blob

The first mistake everyone makes is treating memory as a single bucket. I tried that. It doesn't scale. You end up with a mess of session logs, preferences, and stale project facts all competing for context window space.

The fix: separate memories by how they're used, not when they were created.

Tier 1: Episodic → What happened (session logs, events, timestamps)
Tier 2: Semantic → What I know (facts, relationships, entities)
Tier 3: Project → What we're building (goals, decisions, status)
Tier 4: Procedural → How to do things (workflows, preferences, rules)

Each tier has different retention policies, different compression strategies, and different retrieval patterns. Episodic memories decay. Procedural memories are nearly permanent. This distinction matters more than any embedding model you'll pick.

The SQLite Schema

I chose SQLite over Postgres, Redis, or any vector store for one reason: it ships with the agent. No connection strings. No Docker containers. No infrastructure to maintain. One .db file that follows the project.

CREATE TABLE memories (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    tier TEXT NOT NULL CHECK(tier IN ('episodic','semantic','project','procedural')),
    source TEXT NOT NULL,
    content TEXT NOT NULL,
    metadata JSON,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    updated_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    expires_at DATETIME,
    distilled BOOLEAN DEFAULT 0,
    confidence REAL DEFAULT 1.0,
    access_count INTEGER DEFAULT 0
);

CREATE INDEX idx_memories_tier ON memories(tier);
CREATE INDEX idx_memories_source ON memories(source);
CREATE INDEX idx_memories_distilled ON memories(distilled);
CREATE INDEX idx_memories_expires ON memories(expires_at);

The `distilled` flag is critical. Raw memories are verbose — full session transcripts, lengthy decision discussions. Distilled memories are compressed, verified, and ready for prompt injection. More on the distillation pipeline below.

The `confidence` field tracks how verified a memory is. A memory extracted from a conversation starts at 0.6. After cross-referencing with code or git history, it gets bumped to 0.9+. **Never trust unverified memories at face value.**

sql
CREATE TABLE entities (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT NOT NULL UNIQUE,
type TEXT NOT NULL,
properties JSON,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE relationships (
id INTEGER PRIMARY KEY AUTOINCREMENT,
source_entity INTEGER REFERENCES entities(id),
target_entity INTEGER REFERENCES entities(id),
relation_type TEXT NOT NULL,
properties JSON,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_rel_source ON relationships(source_entity);
CREATE INDEX idx_rel_target ON relationships(target_entity);
CREATE INDEX idx_entities_type ON entities(type);

That's the entire knowledge graph. Two tables. No graph database. For my scale (sub-1000 entities), SQLite handles traversal queries in under 5ms.

The Knowledge Graph: Entities and Relationships

The knowledge graph stores structured facts about the project ecosystem — who owns what, what depends on what, and why decisions were made. Here's what my actual entity distribution looks like after 3 months of daily use:

Entity Type	Count	Examples
decision	96	model routing rules, architecture choices
project	65	repos, services, features in progress
person	10	team members, stakeholders
tool	8	frameworks, services, APIs in use
organization	6	companies, teams, departments
concept	2	recurring design patterns
Total	187	128 relationships between them

The relationship types that proved most useful:

works_on → person-to-project mapping
depends_on → project dependency chains
decided_by → links decisions to their rationale
created → authorship tracking
uses → tool-to-project associations

The single most valuable relationship type is decided_by. When the agent can look up why a decision was made — not just what was decided — it stops re-proposing rejected approaches. This alone saved the most re-explanation time.

The Distillation Pipeline

Raw memory accumulation is easy. The hard part is keeping it useful. My system had 6,327 unverified memories after two months. Without distillation, the context window would be 90% noise.

One week of session logs typically compresses to 15-20 semantic memories — roughly a 50:1 ratio before the agent ever sees the data.

The pipeline runs in three stages:

def distill_memories(db_path: str, batch_size: int = 50):
    conn = sqlite3.connect(db_path)

    # Stage 1: Deduplicate
    # similarity() is a custom trigram UDF registered at runtime
    dupes = conn.execute("""
        SELECT m1.id, m2.id 
        FROM memories m1 
        JOIN memories m2 ON m1.source = m2.source 
            AND m1.id < m2.id
            AND m1.tier = m2.tier
        WHERE similarity(m1.content, m2.content) > 0.85
    """).fetchall()
    # Keep the newer one, mark older as distilled

    # Stage 2: Compress episodic → semantic
    raw_episodes = conn.execute("""
        SELECT id, content, metadata 
        FROM memories 
        WHERE tier = 'episodic' 
            AND distilled = 0
            AND created_at < datetime('now', '-7 days')
        ORDER BY created_at DESC
        LIMIT ?
    """, (batch_size,)).fetchall()

    for episode in raw_episodes:
        # Extract facts, decisions, preferences
        # Create or update semantic memories
        # Mark episodic as distilled
        pass

    # Stage 3: Verify against ground truth
    unverified = conn.execute("""
        SELECT id, content, tier 
        FROM memories 
        WHERE confidence < 0.8 AND distilled = 0
    """).fetchall()

    for memory in unverified:
        # Cross-reference with git log, file system, configs
        # Bump confidence or mark for deletion
        pass

The similarity check in Stage 1 uses trigram comparison — not vector embeddings. For deduplication, you don't need semantic similarity; you need near-duplicate detection. **Trigrams at 0.85 threshold caught 115 duplicates from a single source in my first run.**

Stage 2 is where the LLM earns its keep. I feed batches of raw episodic memories to a cheap model (Claude Haiku) and ask it to extract structured facts.

Stage 3 is non-negotiable. I had memories that claimed certain functions existed — functions that had been renamed two weeks prior. **An unverified memory is worse than no memory because the agent will state it as fact.**

## Loading Context at Session Start

The session startup hook assembles the active context from all four tiers and injects it before the first token of each session:

bash

!/bin/bash

session-start.sh — runs before every Claude Code session

DB="$HOME/.claude/jarvis/data/jarvis.db"

1. User preferences and procedural rules (always loaded)

sqlite3 "$DB" "SELECT content FROM memories
WHERE tier = 'procedural' AND distilled = 1
ORDER BY access_count DESC LIMIT 20"

2. Active project context (filtered by cwd)

PROJECT=$(basename "$PWD")
sqlite3 "$DB" "SELECT content FROM memories
WHERE tier = 'project' AND distilled = 1
AND json_extract(metadata, '$.project') = '$PROJECT'
ORDER BY updated_at DESC LIMIT 15"

3. Recent decisions (last 7 days)

sqlite3 "$DB" "SELECT content FROM memories
WHERE tier = 'semantic' AND distilled = 1
AND json_extract(metadata, '$.type') = 'decision'
AND updated_at > datetime('now', '-7 days')
ORDER BY updated_at DESC LIMIT 10"

4. Knowledge graph summary

sqlite3 "$DB" "SELECT
(SELECT COUNT() FROM entities) || ' entities, ' ||
(SELECT COUNT() FROM relationships) || ' relationships'"

The entire loaded context fits in roughly 3,000-4,000 tokens. That's the key metric. If your memory system needs 20K tokens of context, you've failed at compression. The agent needs room to think.

I also track a drift_score — a measure of how much the agent's behavior diverges from accumulated feedback. If drift exceeds 30/100, I trigger a feedback review cycle. This is crude but it catches regression.

Health Monitoring

Memory systems rot silently. The warning sign is when unverified memories outnumber distilled ones — that ratio is the canary. I added three health checks that run weekly:


sql
-- Unverified count (should stay under 1,000)
SELECT COUNT(*) as unverified 
FROM memories WHERE distilled = 0;

-- Stale memories (not accessed in 30 days)
SELECT COUNT(*) as stale 
FROM memories 
WHERE access_count = 0 
AND created_at < datetime('now', '-30 days');

-- Duplicate density (should stay under 5%)
SELECT CAST(dupe_count AS REAL) / total * 100 as dupe_pct
FROM (
    SELECT COUNT(*) as total FROM memories
), (
    SELECT COUNT(*) as dupe_count FROM memories 
    WHERE source IN (
        SELECT source FROM memories 
        GROUP BY source HAVING COUNT(*) > 50
    )
);

When unverified memories hit 6,327, I knew the distillation pipeline was falling behind. That number should be under 1,000 at steady state. **If you're accumulating faster than you're distilling, your memory system is a log file with extra steps.**

## What I'd Do Differently

**1. Start with the distillation pipeline, not the accumulation.** I built the write path first and the compression path second. Two months of unchecked growth created a cleanup project. Build both simultaneously.

**2. Use stricter entity deduplication from day one.** I ended up with "Claude Code", "claude-code", and "CC" as three separate entities referring to the same thing. A normalization layer at write time would have prevented this.

**3. Track memory provenance more carefully.** Some memories came from conversation, some from git history, some from file analysis. When a memory is wrong, you need to trace it back to its source to fix the extraction logic — not just delete the memory.

## The Numbers

| Metric | Before Memory System | After |
|--------|---------------------|-------|
| Context restoration time | 10-15 min/session | ~0 (automated) |
| Repeated corrections/week | 8-12 | 1-2 |
| Monthly time saved | — | 5-8 hours |
| Infrastructure cost | — | $0 |
| Storage (SQLite file) | — | 2.4 MB |

| System Metric | Value |
|---------------|-------|
| Total entities | 187 |
| Total relationships | 128 |
| Distilled memories | ~1,200 |
| Raw (pending distillation) | ~6,300 |
| Context tokens per session | 3,000-4,000 |
| Query latency (SQLite) | < 5ms |
| Distillation ratio | ~50:1 (raw:distilled) |

## FAQ

**Q: Why SQLite instead of a vector database?**  
A: For sub-1,000 entities and structured queries, SQLite is faster to set up, has zero operational overhead, and the entire database ships as a single file alongside your agent configuration. Vector search adds value when you need semantic retrieval over thousands of unstructured documents. My memories are structured and categorized — SQL queries handle them fine.

**Q: How do you prevent the agent from hallucinating based on stale memories?**  
A: Three mechanisms. First, every memory has a `confidence` score — unverified memories are labeled as such in the prompt. Second, the session startup hook only loads distilled memories, which have been cross-referenced against code and git history. Third, procedural rules explicitly instruct the agent to verify memories against current file state before acting on them.

**Q: Does this work with agents other than Claude Code?**  
A: The SQLite schema and distillation pipeline are agent-agnostic. The session startup hook is specific to Claude Code's hook system, but any agent framework that supports pre-session context injection (Cursor, Cline, Aider) can use the same approach with a different loader script.

**Q: How much maintenance does this require?**  
A: About 15 minutes per week. I run the distillation pipeline manually (planning to automate via cron), review the health metrics, and occasionally merge duplicate entities. The biggest maintenance task is updating project-tier memories when priorities shift — but that's 2-3 edits, not a rebuild.

**Q: Won't the knowledge graph grow unmanageably large?**  
A: Not at individual developer scale. After 3 months of daily use across multiple projects, I'm at 187 entities. **The growth is logarithmic** — most new sessions reference existing entities rather than creating new ones. I'd estimate steady state at 300-500 entities for a solo developer working on 3-5 active projects.

## Try It Yourself

1. Create the SQLite database with the schema above: `sqlite3 ~/.agent-memory/memory.db < schema.sql`
2. Add session logging to your agent's hook system — capture decisions, corrections, and preferences as raw episodic memories
3. Write the distillation script and run it weekly to compress episodic memories into semantic facts
4. Build the session startup hook to inject the top 40-50 distilled memories as context
5. After two weeks, check your health metrics — if unverified count is growing faster than distilled count, increase distillation frequency

## What's Next

If you're building agents that persist across sessions — coding assistants, DevOps bots, personal AI tools — memory isn't optional. It's the difference between a tool you configure once and a tool you configure every day.

The entire system is ~400 lines of Python and SQL. No ML pipeline. No vector store. No monthly bill. Just structured data and disciplined compression.

**What's your approach to agent memory?** Drop your setup in the comments — I'm particularly curious if anyone has solved the entity deduplication problem more elegantly than my trigram hack.

If this was useful, bookmark it for when you hit the "why does my agent keep forgetting" wall. And follow for more posts about the operational reality of running AI-powered dev tools.

---

*I build AI-powered developer tools and run a 4-tier memory system across 30+ coding sessions per week. Currently shipping automation infrastructure at CyBarrier.*

4 SQLite Tables Replaced My $200/mo AI Observability Stack

thestack_ai — Tue, 24 Mar 2026 01:44:43 +0000

My AI agent system runs 16 teams across 4 different LLM providers. Two months ago, one team silently started hallucinating policy decisions. I caught it in 11 minutes.

Not with Datadog. Not with Honeycomb. With 47 lines of Python writing to a SQLite database.

OpenTelemetry is now working on semantic conventions for LLM tracing. That's great. But I needed this six months ago, so I built my own. Here's the full setup.

TL;DR: A SQLite-backed audit trail for multi-agent AI orchestration logs every LLM call, model routing decision, and bias detection event. 338 audit entries and 108 events exposed 3 silent failures that cost-based monitoring would have missed entirely. The system is 4 tables, runs on a 1GB Oracle Cloud free-tier instance, and replaced what would have been ~$200/month in observability tooling. Total implementation time: one weekend.

The Problem: Flying Blind With Multiple Models

Running one model is simple — you read the output. Running four different LLM models in a single orchestration pipeline creates a debugging problem that single-model setups never encounter. I route tasks across Claude Opus (implementation), Gemini 3.1 (information synthesis), GPT-5.4 (strategy reviews), and Codex (parallel task execution), organized into 16 agent teams that hand work off to each other sequentially.

With four models and 16 teams, you need answers to questions that print() can’t help with:

Which model handled which step
How long each step took
Whether the output was actually used or silently dropped
What the routing decision was based on

I evaluated existing options before building my own. LangSmith Teams: ~$400/month. Self-hosted Langfuse: requires a Postgres instance with 2GB+ RAM. OpenTelemetry's GenAI semantic conventions: still experimental, no production deployment story.

I needed something that worked today, on a server with 1GB of RAM and a $0 infrastructure budget.

Step 1: Four Tables, One Database

The minimum viable schema for multi-agent observability is four tables: one for raw call logs, one for system events, one for routing decisions, and one for agent memory. The entire schema fits in your head — and that constraint is a feature, not a limitation.

CREATE TABLE audit_log (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    timestamp TEXT DEFAULT (datetime('now')),
    action TEXT NOT NULL,
    agent TEXT,
    model TEXT,
    input_summary TEXT,
    output_summary TEXT,
    latency_ms INTEGER,
    tokens_in INTEGER,
    tokens_out INTEGER,
    cost_usd REAL,
    metadata TEXT  -- JSON blob for anything else
);

CREATE TABLE events (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    timestamp TEXT DEFAULT (datetime('now')),
    event_type TEXT NOT NULL,
    source TEXT,
    urgency TEXT CHECK(urgency IN ('low','medium','high','critical')),
    project TEXT,
    payload TEXT  -- JSON
);

CREATE TABLE decisions (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    timestamp TEXT DEFAULT (datetime('now')),
    decision_type TEXT,
    context TEXT,
    outcome TEXT,
    confidence REAL,
    model TEXT
);

CREATE TABLE memories (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    timestamp TEXT DEFAULT (datetime('now')),
    memory_type TEXT CHECK(memory_type IN 
        ('episodic','semantic','procedural','project')),
    content TEXT,
    source TEXT,
    relevance_score REAL
);

Four tables. No migrations framework. No ORM. **SQLite's WAL mode handles concurrent reads from the dashboard while agents write logs** — that's all the concurrency management this pattern needs.

python
import sqlite3

DB_PATH = "~/.claude/jarvis/data/jarvis.db"

def get_db():
conn = sqlite3.connect(DB_PATH)
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("PRAGMA busy_timeout=5000")
conn.row_factory = sqlite3.Row
return conn

The busy_timeout is critical. Without it, concurrent agent writes will throw database is locked errors. 5 seconds is generous — in practice, writes complete in under 10ms.

Step 2: The Event Bus Pattern

Agents shouldn't know they're being traced. Every agent action flows through a central event bus that routes to logging subscribers independently of the main pipeline.

class ControlPlane:
    def emit(self, event_type: str, source: str, 
             urgency: str, project: str, payload: dict):
        """Every agent action goes through here."""
        db = get_db()

        # Log the event
        db.execute(
            "INSERT INTO events (event_type, source, urgency, project, payload) "
            "VALUES (?, ?, ?, ?, ?)",
            (event_type, source, urgency, project, json.dumps(payload))
        )

        # Route to subscribers
        for subscriber in self._subscribers.get(event_type, []):
            subscriber(event_type, source, payload)

        db.commit()

**The subscriber pattern is where this gets powerful.** The bias firewall subscribes to all `agent_output` events. The cost tracker subscribes to `model_call` events. The dashboard subscribes to everything. None of them block the agent pipeline — if a subscriber fails, the main pipeline continues and the failure is logged separately.

python

Bias firewall subscribes non-invasively

control_plane.subscribe("agent_output", bias_firewall.check)
control_plane.subscribe("model_call", cost_tracker.record)
control_plane.subscribe("*", dashboard.update)

Step 3: Model Routing With a Paper Trail

Every routing decision gets logged, not just executed. This was the single most valuable architectural choice in the entire system — when something goes wrong, "which model handled this?" becomes a SQL query instead of a debugging session.

class ModelRouter:
    ROUTING_TABLE = {
        "implementation": {"model": "claude-opus-4-6", "role": "builder"},
        "synthesis":      {"model": "gemini-3.1-pro",  "role": "researcher"},
        "strategy":       {"model": "gpt-5.4",         "role": "chief_of_staff"},
        "parallel_exec":  {"model": "codex",           "role": "worker"},
    }

    def route_task(self, task_type: str) -> dict:
        route = self.ROUTING_TABLE[task_type]

        # Log the routing decision
        db = get_db()
        db.execute(
            "INSERT INTO decisions (decision_type, context, outcome, model) "
            "VALUES (?, ?, ?, ?)",
            ("model_routing", task_type, route["role"], route["model"])
        )
        db.commit()

        return route

After two months, I had 343 routing decisions logged. **One `GROUP BY` told a story I never expected: **23% of "strategy" tasks were actually implementation tasks mis-categorized by the dispatcher.** GPT-5.4 was doing Claude's job. Poorly. Nobody noticed until the audit trail made it obvious.

sql
SELECT decision_type, model, COUNT(*) as count,
AVG(confidence) as avg_confidence
FROM decisions
GROUP BY decision_type, model
ORDER BY count DESC;

Step 4: Bias Detection Latency Tracking

A bias firewall that cross-checks agent outputs using multiple models is only useful if you can measure its performance. I run a 4-stage pipeline: Risk Classifier → Claim Extractor → Cross-Model Verifier (Gemini 3.1) → Disagreement Preserver. Each stage gets its own audit entry with latency and detection metadata.

class BiasFirewall:
    def check(self, event_type, source, payload):
        start = time.monotonic()

        risk = self.classify_risk(payload)
        claims = self.extract_claims(payload)
        verification = self.cross_verify(claims)  # Gemini call

        elapsed_ms = int((time.monotonic() - start) * 1000)

        db = get_db()
        db.execute(
            "INSERT INTO audit_log "
            "(action, agent, model, latency_ms, metadata) "
            "VALUES (?, ?, ?, ?, ?)",
            (
                "bias_check",
                source,
                "gemini-3.1-pro-preview",
                elapsed_ms,
                json.dumps({
                    "risk_level": risk,
                    "claims_found": len(claims),
                    "disagreements": verification.get("disagreements", []),
                    "detection_type": verification.get("bias_type"),
                })
            )
        )
        db.commit()

**Real numbers from my audit trail**: the Gemini 3.1 Pro verifier averages 22 seconds per check; the Flash Lite fallback averages 3 seconds. **Detection rate: 100% across 6 bias types** (framing, false consensus, anchoring, availability, confirmation, authority).** I know this because every check is a row in `audit_log`, not a claim in a README.

sql
SELECT
json_extract(metadata, '$.detection_type') as bias_type,
COUNT(*) as detections,
AVG(latency_ms) as avg_latency_ms,
MIN(latency_ms) as min_latency_ms,
MAX(latency_ms) as max_latency_ms
FROM audit_log
WHERE action = 'bias_check'
GROUP BY bias_type;

Step 5: The Dashboard That Costs Nothing

A terminal dashboard (Textual TUI) refreshing every 5 seconds feeds entirely from the SQLite audit trail. Panels for active agents, recent decisions, bias firewall status, and cost tracking. The key architectural property: the dashboard is a pure read consumer — it never writes to the database, and its queries run in under 2ms on a 1GB server.

def get_recent_activity(minutes: int = 30) -> list:
    db = get_db()
    return db.execute(
        "SELECT action, agent, model, latency_ms, timestamp "
        "FROM audit_log "
        "WHERE timestamp > datetime('now', ?) "
        "ORDER BY timestamp DESC LIMIT 50",
        (f"-{minutes} minutes",)
    ).fetchall()

def get_cost_summary(days: int = 7) -> dict:
    db = get_db()
    rows = db.execute(
        "SELECT model, SUM(cost_usd) as total, COUNT(*) as calls "
        "FROM audit_log "
        "WHERE timestamp > datetime('now', ?) AND cost_usd IS NOT NULL "
        "GROUP BY model",
        (f"-{days} days",)
    ).fetchall()
    return {r["model"]: {"cost": r["total"], "calls": r["calls"]} for r in rows}

No Grafana. No Prometheus. No InfluxDB. No server process to manage. **The monitoring layer costs $0 to run and requires zero infrastructure beyond the database file that already exists.**

## Step 6: The Silent Failure That Proved the System

Three weeks after deployment, the audit trail caught something no cost-based monitor would have flagged: a strategy review team was producing outputs, but the downstream implementer was ignoring them. Entirely. No errors. No alerts. Just silent data loss.

**The `events` table showed `agent_output` events from the strategy model, but the corresponding `decisions` table had zero `implementation_start` entries referencing those outputs** — a structural gap visible only because both sides of every handoff were being logged independently.

sql
-- Find strategy outputs with no downstream pickup
SELECT e.id, e.timestamp, e.source,
json_extract(e.payload, '$.task_id') as task_id
FROM events e
WHERE e.event_type = 'agent_output'
AND e.source = 'strategy-reviewer'
AND json_extract(e.payload, '$.task_id') NOT IN (
SELECT context FROM decisions
WHERE decision_type = 'implementation_start'
);

Seven tasks over four days. Silently dropped. The implementer agent was receiving the handoff but failing to parse a changed output format from a model update. The audit trail caught it because it tracked both sides of every handoff independently — something a single-stream log would have missed entirely.

What I'd Do Differently

Start with the decisions table, not the audit_log. The audit log is useful for latency and cost analysis, but the decisions table is where you find actual bugs. If I'd built the decisions table first, I would have caught the strategy-implementation gap in days, not weeks.

Add a correlation_id from day one. Tracing a single request across 5 agents currently means joining on timestamps and task IDs. A single UUID per pipeline run would save hours of forensic querying. This is the one thing OpenTelemetry's trace/span model gets absolutely right.

Don't log raw prompts. I initially stored full prompts in input_summary. The database hit 400MB in a week. Summaries and token counts are sufficient for debugging 95% of issues. Store raw data only when you're actively investigating a specific failure.

The Numbers

Cost Comparison

Approach	Monthly Cost	Setup Time	RAM Required
LangSmith Teams	~$400	2 hours	N/A (hosted)
Self-hosted Langfuse	~$50 (Postgres)	4–6 hours	2GB+
Datadog LLM Observability	~$200	1 hour	N/A (hosted)
SQLite audit trail	$0	~8 hours	<50MB

System Metrics (after 2 months of production use)

Metric	Value
Audit log entries	338
Events recorded	108
Routing decisions tracked	343
Memory entries	1,740
Database size (WAL mode)	~12MB
Average write latency	<10ms
Silent failures caught	3
Server specs	1 OCPU / 1GB RAM (Oracle Cloud Free Tier)
Duplicate decisions cleaned (one-time)	9,486
Bias detection rate	100% (6/6 types)

That last row — 9,486 duplicate decisions requiring a one-time cleanup — is what happens when you skip dedup logic early. Add a UNIQUE constraint on (decision_type, context, outcome, timestamp) before you have 10,000 rows to clean up. Learn from my mistake.

FAQ

Why SQLite instead of Postgres?

SQLite runs embedded — no server process, no connection pooling, no port management. On a 1GB instance running 4 concurrent LLM API clients, a Postgres process consuming 200MB of shared buffers is a real constraint. SQLite in WAL mode handles the specific read/write pattern here (dashboard reads while agents write) without contention. The database file is also trivially portable — I scp it locally for deep analysis when needed.

How does this compare to OpenTelemetry's GenAI semantic conventions?

OpenTelemetry's gen_ai.* attributes cover the model call layer well: model name, token counts, latency, finish reason. What they don't cover yet is the orchestration layer — routing decisions, cross-model verification, agent-to-agent handoffs, memory retrieval. This audit trail captures both layers. When OTel's conventions stabilize, the migration path is straightforward: emit OTel spans from the same event bus while keeping the SQLite trail for orchestration-specific data that OTel doesn't address.

Won't this break at scale?

SQLite handles databases up to 281TB. Two months of multi-agent orchestration data in this system is 12MB. At the current write rate (~170 entries/month across all tables), hitting 1GB would take roughly 50 years. If you're running 10,000 agents, switch to Postgres. If you're running 16 agent teams like this setup, SQLite will outlast the project.

How do you query across related tables?

Standard SQL joins, with the advantage that everything is in one database file — no cross-service queries, no distributed tracing backends, no network latency.

SELECT a.action, a.agent, a.latency_ms,
       d.decision_type, d.outcome,
       e.event_type, e.urgency
FROM audit_log a
LEFT JOIN decisions d ON d.timestamp 
    BETWEEN datetime(a.timestamp, '-1 second') 
    AND datetime(a.timestamp, '+1 second')
LEFT JOIN events e ON json_extract(e.payload, '$.task_id') = 
    json_extract(a.metadata, '$.task_id')
WHERE a.timestamp > datetime('now', '-1 day')
ORDER BY a.timestamp DESC;

The timestamp-based join is crude — a correlation ID would be cleaner. But it works without schema changes.

### What about data retention?

A monthly cleanup archives entries older than 90 days to a compressed backup and removes them from the active database. Three months of data compresses to roughly 800KB.

bash
sqlite3 jarvis.db ".dump" | gzip > "archive_$(date +%Y%m).sql.gz"
sqlite3 jarvis.db "DELETE FROM audit_log WHERE timestamp < datetime('now', '-90 days')"
sqlite3 jarvis.db "VACUUM"

Try It Yourself

You can set this up in under an hour. Here’s the path I’d recommend:

Create the database. Copy the four CREATE TABLE statements above into schema.sql. Run sqlite3 jarvis.db < schema.sql.
Enable WAL mode. Run sqlite3 jarvis.db "PRAGMA journal_mode=WAL" once. This persists across all future connections.
Instrument one agent call. Pick your most critical LLM call. Add a single INSERT INTO audit_log after it returns with the model name, latency in milliseconds, and token counts. That's the entire day-one requirement.
Add the event bus. Wrap agent calls in an emit() function. Subscribe your logging to it. Agents and observability are now decoupled.
Write one diagnostic query. Answer a question you've been guessing at: "Which model is slowest?" or "How many calls per day?" The first useful answer will validate the entire approach.
Add the decisions table last. Once calls and events are flowing, start logging routing decisions explicitly. This is where production bugs actually live.

The Honest Take

This system is not elegant. It's a SQLite file with four tables and Python glue code. No distributed tracing. No visualization layer. No SLA. The cross-table joins use BETWEEN clauses and timestamp proximity as a substitute for proper correlation IDs.

But it caught three silent failures that would have been invisible to cost-based monitoring. It runs on a free-tier cloud instance consuming under 50MB of RAM. And when OpenTelemetry's GenAI conventions reach production stability, the migration path is clear: emit OTel spans from the same event bus and keep the SQLite trail for orchestration-layer data that OTel doesn't yet model.

Sometimes the right observability tool is the one you can build in a weekend, understand completely, and trust at 3am when something breaks.

Have you tried tracing multi-model AI systems? I'd genuinely like to know — did anyone else go the "just use SQLite" route, or did you find a lightweight alternative that worked?

If this saved you from evaluating a $400/mo tracing platform, drop a bookmark. I'm writing a follow-up on the bias firewall pipeline — specifically how cross-model verification between Claude and Gemini catches subtle framing issues that single-model review misses consistently.

Follow for more AI agent infrastructure — real systems, real numbers

I Built a Zombie Process Killer Because Claude Code Ate 14GB of My RAM

thestack_ai — Mon, 23 Mar 2026 04:55:18 +0000

I lost an entire afternoon to a phantom memory leak that wasn't a leak at all. My MacBook was crawling — 14GB of RAM consumed by processes I never launched. The culprit? Dozens of orphaned MCP servers, headless Chrome instances, and sub-agents left behind by AI coding sessions. I built zclean to kill them automatically. Here's the full setup.

TL;DR: AI coding tools like Claude Code and Codex spawn child processes (MCP servers, browser daemons, sub-agents) that don't get cleaned up when sessions end. These orphans accumulate silently and can consume 10GB+ of RAM within a single workday. zclean detects and kills them safely — it hooks into your session lifecycle and runs on a schedule. One npx zclean init sets up everything. I went from 3–4 forced reboots per week to zero manual intervention.

The Problem Nobody Talks About

AI coding tools don't clean up after themselves. After four months of heavy Claude Code use, I started noticing my machine getting sluggish by mid-afternoon — a dozen node processes, several chrome-headless-shell instances, a few mcp-server-* processes running hours after I had ended those sessions.

Every AI coding session spawns a tree of child processes. Claude Code launches MCP servers for file access, web search, and custom tools. It fires up headless browsers for web research. Codex spawns sub-agents. When the session ends, these processes are supposed to terminate. They don't — at least not reliably.

I ran a quick check one evening:

ps aux | grep -E 'mcp-server|chrome-headless|agent-browser' | wc -l

**37 processes.** All orphans. All consuming memory. Combined RAM usage: north of 6GB for processes doing absolutely nothing.

This isn't just me. The same reports appear across X and dev forums — "Claude Code is heavy," "my machine slows down after a few sessions." The tool itself isn't heavy. **The zombies it leaves behind are heavy — and they accumulate with every session you run.**

## How zclean Works

`zclean` detects and terminates orphaned AI tool processes using a conservative four-condition filter. The core principle: **if the parent process is alive, don't touch it.**

A process only gets flagged as a zombie when ALL of these conditions are true:

1. **It's an orphan** — its parent has been reassigned to init/launchd (PPID = 1 on macOS)
2. **It matches a known pattern** — command line matches AI tool process signatures
3. **It's not in an active session** — not part of a tmux/screen process tree
4. **It's in the host namespace** — not inside a Docker container

This means your intentionally running dev server is always safe. Your `pm2`-managed processes are safe. Your `nohup` background jobs are safe. Only genuinely abandoned processes get killed.

### The Target List

Here's what `zclean` looks for by default:

| Category | Process Pattern | Source |
|----------|----------------|--------|
| MCP servers | `mcp-server-*` | Claude Code |
| Browser daemons | `agent-browser`, `chrome-headless-shell`, `playwright/driver` | Claude Code, Codex |
| Sub-agents | orphaned `claude --print`, `codex exec` | Claude Code, Codex |
| Build zombies | `esbuild`, `vite`, `next dev`, `webpack` (24h+ orphan) | Common |
| npm zombies | `npm exec`, `npx` (no parent) | Common |
| Node orphans | `node` (no parent + 24h+ or 500MB+ + AI tool path in cmdline) | Common |
| Runtime orphans | `tsx`, `ts-node`, `bun`, `deno`, `python` (MCP server pattern) | Common |

**Build tools like `vite` and `webpack` get a 24-hour grace period.** A long-running build that legitimately takes hours shouldn't be killed. But if an orphaned `esbuild` process has been sitting there for over a day with no parent, it's dead weight.

## Setting It Up: One Command

bash
npx zclean init

That's it. Here's what happens under the hood:

Step 1: OS Detection. zclean figures out if you're on macOS, Linux, or Windows and configures the right process scanning and scheduling mechanisms.

Step 2: Claude Code Hook. It registers a SessionEnd hook in your Claude Code settings.json:

{
  "hooks": {
    "SessionEnd": [
      {
        "type": "command",
        "command": "npx zclean --session-pid $SESSION_PID --yes"
      }
    ]
  }
}

This is the first line of defense — **every time a Claude Code session ends, `zclean` immediately cleans up that session's orphaned children within milliseconds.** The `--session-pid` flag scopes cleanup to that specific process tree, avoiding any risk to unrelated processes.

**Step 3: OS Scheduler.** For zombies that slip through (crashes, force-quits, other AI tools without hooks), `zclean` sets up a recurring hourly cleanup:

On macOS, it creates a LaunchAgent:

bash
~/Library/LaunchAgents/com.zclean.hourly.plist

On Linux, a systemd user timer:

~/.config/systemd/user/zclean.timer

On Windows, a user-scoped Task Scheduler entry.

**Step 4: Config file.** Drops a config at `~/.zclean/config.json` where you can whitelist processes, adjust thresholds, and customize behavior.

**Step 5: First scan.** Runs an immediate dry-run so you can see what it would have killed before enabling automatic cleanup.

## The Safety Mechanisms

I spent more time on the "don't kill the wrong thing" logic than on the actual killing. **Getting a false positive means terminating someone's running dev server — that's a non-starter.** Three independent safety layers prevent this.

### PID Reuse Protection

Between the time `zclean` scans and the time it kills, a process could die and its PID could be reassigned to something completely different. On a busy system, PID reuse happens faster than you'd expect — Linux recycles PIDs in order, so a freshly spawned process can inherit a just-killed PID within seconds.

Before every kill, `zclean` re-verifies three things:

1. **The PID still exists**
2. **The process start time matches** what was recorded during the scan (to the second)
3. **The command line matches** what was recorded during the scan

**If any of these three checks fail, the kill is skipped entirely.** This eliminates the complete class of PID reuse bugs without requiring atomic operations or locks.

### The Whitelist

Some processes look like zombies but aren't. The config handles persistent legitimate orphans:

json
{
"whitelist": [
"mcp-server-custom-db",
"my-persistent-agent"
],
"maxAge": 86400,
"maxMemoryMB": 500,
"dryRun": false
}

whitelist: Process names that are never touched, regardless of orphan status
maxAge: Seconds before an orphan gets flagged (default: 86400 = 24 hours for build tools)
maxMemoryMB: Memory threshold that escalates urgency (default: 500MB — above this, the process is flagged sooner)
dryRun: Global toggle — set to true to audit without committing

Protected Process Trees

Beyond the whitelist, zclean walks the full process tree to protect anything descended from:

tmux / screen sessions — if the process is a descendant of a terminal multiplexer, it's intentional
Daemon managers — pm2, forever, supervisord, systemd services
VS Code — gets a 48-hour grace period since VS Code's process tree can appear orphaned after restarts
Docker containers — checked via PID namespace on Linux (/proc/<pid>/ns/pid); Docker Desktop on macOS runs in a VM so container processes aren't visible to the host ps at all

Daily Usage

Most of the time, you forget zclean exists. That's the goal. When you want visibility:


bash
# See what would be killed (dry-run, default)
npx zclean

# Actually kill the zombies
npx zclean --yes

# Check current zombie status
npx zclean status

# View kill history with timestamps and RAM reclaimed
npx zclean logs

# Show current config
npx zclean config

A typical dry-run output looks like this:

  zclean — scanning for zombie processes...

  Found 4 zombie processes:

  PID    CMD                          RAM      AGE
  ────   ───────────────────────────  ───────  ──────
  8234   mcp-server-filesystem        42 MB    3h 12m
  8891   chrome-headless-shell        287 MB   2h 45m
  9102   mcp-server-fetch             18 MB    1h 58m
  12044  node (claude subagent)       156 MB   4h 03m

  Total reclaimable: 503 MB

  Run with --yes to kill these processes.

**503MB from four processes on a light day.** Peak scans have returned 15+ zombies consuming over 3GB on days with multiple long AI coding sessions.

## The Dual Protection Architecture

`zclean` doesn't rely on a single cleanup mechanism — redundancy is intentional.

**Layer 1: Session Hook** — fires on every clean session exit via the Claude Code `SessionEnd` hook. This catches the common case immediately, with zero delay between session end and cleanup.

**Layer 2: OS Scheduler** — runs hourly (configurable down to every 15 minutes). This catches everything the hook misses: crashed sessions, force-quits, Codex sessions that lack hook support, and any AI tool that spawns processes without a cleanup contract.

**The hook handles approximately 80% of cases instantly. The scheduler handles the remaining 20% within one hour.** Together, zombie RAM accumulation drops effectively to zero over any meaningful time period — verified over six weeks of continuous use on a MacBook Pro M2.

## What I'd Do Differently

**I should have built the process tree walker first.** I started with simple PPID checks and pattern matching, then kept bolting on edge cases — the tmux protection, the VS Code grace period, the Docker namespace check. The tree walker should have been the foundation. It would have reduced total code by roughly 30% and made the protection logic composable instead of a chain of special cases.

**The Windows implementation needs more real-world testing.** macOS and Linux use `/proc` and `ps`, which are well-understood and stable. Windows requires WMI queries through PowerShell, and the process model is fundamentally different — no PPID concept in the same sense, different namespace isolation. It works in my testing environment, but I have less confidence in Windows edge cases than on Unix systems.

**I underestimated how many processes VS Code itself orphans.** The 48-hour grace period for VS Code descendants was added reactively after I accidentally killed a legitimate TypeScript language server. The line between "VS Code orphan" and "AI tool orphan spawned through VS Code's integrated terminal" is genuinely blurry — VS Code's process tree is already unusual before you add AI tools to the mix.

## The Numbers

### Before vs. After zclean

| Metric | Before | After |
|--------|--------|-------|
| Average orphan processes (end of day) | 12–20 | 0–2 |
| RAM consumed by orphans | 2–8 GB | < 100 MB |
| Manual force-reboots per week | 3–4 | 0 |
| Time spent investigating "why is my Mac slow" | ~30 min/day | 0 |

### Tool Resource Footprint

| Metric | Value |
|--------|-------|
| zclean scan time | < 200ms |
| RAM usage during scan | ~12 MB |
| npm dependencies | 0 (pure Node.js) |
| Supported platforms | macOS, Linux, Windows |
| Config file size | ~200 bytes |
| Install + init time | < 10 seconds |

**Zero npm dependencies.** The scanner uses `child_process.execSync` with native OS commands (`ps` on Unix, `Get-Process` on Windows). No native modules, no compilation step, no `node-gyp` nightmares. The entire tool is a single Node.js file you can read and audit in under 10 minutes.

## FAQ

**Does zclean kill my running dev server?**

No. `zclean` only targets orphan processes — those whose parent has died and been reassigned to init/launchd (PPID = 1). If your dev server was started from a terminal that's still open, its parent is alive and it won't be touched. Processes managed by pm2, forever, or supervisord are also explicitly protected via tree-walk detection.

**What if I have a legitimate long-running MCP server?**

Add it to the whitelist in `~/.zclean/config.json`. Whitelisted process names are never killed regardless of orphan status or memory usage. You can also adjust `maxAge` if the default 24-hour grace period isn't sufficient for your workflow.

**Does it work with Codex, Cursor, or other AI coding tools?**

Yes. The `SessionEnd` hook is specific to Claude Code, but the OS scheduler (Layer 2) catches orphans from any tool. The target process patterns include common signatures from Codex, Cursor, and other tools that spawn MCP servers and headless browsers. If your tool's processes become orphaned and match the known patterns, `zclean` will find them on the next hourly run.

**Can it accidentally kill a process inside Docker?**

No. On Linux, `zclean` checks the PID namespace via `/proc/<pid>/ns/pid` to confirm the process is in the host namespace. Docker containers run in isolated PID namespaces and are excluded from scanning entirely. On macOS, Docker Desktop runs in a Linux VM, making container processes invisible to the host `ps`.

**What happens if zclean itself crashes mid-kill?**

Each kill is independent — there's no shared transaction state. If `zclean` crashes after killing 3 of 7 zombies, the remaining 4 will be caught on the next scheduled run within an hour. **The PID reuse protection ensures that even if system state changes between a scan and a kill, no incorrect process is terminated.**

## Try It Yourself

1. **Install and initialize** — `npx zclean init` detects your OS, registers hooks, and sets up the scheduler in under 10 seconds
2. **Run a dry scan** — `npx zclean` shows what would be killed without touching anything
3. **Check the output** — verify the detected processes are genuinely orphaned
4. **Enable kills** — `npx zclean --yes` or remove `dryRun: true` from config for automated cleanup
5. **Forget about it** — the hook and scheduler handle everything from here

## Wrapping Up

If you're using AI coding tools daily and your machine gets progressively slower through the day, you probably don't have a memory leak — you have a zombie problem. `zclean` is a zero-dependency Node.js utility that fixes it permanently with a single install command and two protection layers.

**What's the worst zombie accumulation you've seen on your machine?** I'm curious whether this skews toward macOS or if Linux and Windows users are hitting it equally.

If this saved you from your next force-reboot, consider sharing it with whoever on your team complains that "Claude is heavy."

Follow me for more posts about building developer tools and the unglamorous infrastructure work behind AI-assisted development.

---

*I build AI-powered developer tools and write about the engineering behind them. Currently running an AI agent orchestration system with multi-model routing across Claude, Gemini, and GPT — which is, ironically, also the source of most of my zombie processes.*

I Built a Diagnostic CLI for Claude Code Skills — Here's What 8 Rules Caught That I Missed

thestack_ai — Mon, 23 Mar 2026 04:40:15 +0000

Most of my Claude Code skills were broken and I had no idea. I had 23 skill files, felt productive, and assumed Claude was using all of them. Then I built a diagnostic tool, ran it on my own setup, and 14 of those 23 skills had structural issues that silently degraded how Claude interpreted them. That's a 61% failure rate on files I had personally written and considered finished.

TL;DR: pulser is a CLI that scans Claude Code skill files, classifies them by type, runs 8 diagnostic rules, generates prescriptions, and auto-fixes issues with backup and rollback. I ran it on 23 skills and found 14 had problems — missing frontmatter fields, ambiguous trigger conditions, conflicting instructions. One command. Zero config. npx pulser-cli and you're done.

The Problem: Claude Code Skills Have No Validation Layer

Claude Code skills are markdown files in ~/.claude/skills/ that change how Claude behaves — and there is no built-in way to verify they're structured correctly. You write a markdown file, drop it in the skills directory, and Claude is supposed to route to it based on the frontmatter description and body content. Nothing enforces that the file is valid.

I spent two weeks building a skill that was supposed to trigger whenever I said "debug this." It had detailed instructions, code examples, a careful system prompt. One problem: a typo in the frontmatter description field made the trigger condition ambiguous. Claude matched it maybe 30% of the time. The other 70%, it fell through to default behavior and I blamed the model.

Anthropic published skill quality principles, but they're guidelines, not tools. You read them, nod, and go back to writing skills the same way. I needed something that would read my skills, tell me what's wrong, and fix it.

Step 1: Define What "Broken" Actually Means

Before writing any code, I spent a day cataloging every skill failure mode I'd encountered. Not theoretical ones — actual bugs from my own skill files and issues reported by other Claude Code users. I landed on 8 diagnostic rules split into three tiers:

Tier	Rule	What It Catches
Core	`frontmatter-required`	Missing `name`, `description`, or `model` fields
Core	`description-quality`	Descriptions too vague to route ("Use this for stuff")
Core	`trigger-clarity`	Ambiguous or missing trigger conditions
Recommended	`instruction-structure`	No clear sections, wall-of-text body
Recommended	`conflict-detection`	Two skills claiming the same trigger space
Recommended	`example-coverage`	Skills without input/output examples
Recommended	`scope-boundaries`	Skills that try to do everything (>500 lines, 5+ responsibilities)
Experimental	`dependency-chain`	Skills referencing other skills that don't exist

The core 3 rules alone caught issues in 60% of my skill files. The frontmatter rule sounds trivial until you realize that a missing description field means Claude has to guess what your skill does from the body text. Sometimes it guesses right. Sometimes it routes to a completely unrelated skill.

Step 2: Build a Multi-Signal Classifier

Not all skills are the same, and treating them identically produces bad diagnostics. A coding skill needs different validation than a writing skill or a workflow automation skill. pulser classifies each skill using 4 signals before running any diagnostic rules, so the prescriptions it generates are type-appropriate rather than generic.

interface ClassificationResult {
  type: SkillType;
  confidence: number;
  signals: Signal[];
}

type SkillType =
  | "coding"
  | "writing"
  | "workflow"
  | "diagnostic"
  | "integration"
  | "meta";

The classifier looks at frontmatter fields, body keywords, code block density, and structural patterns. A skill with many ` ```
{% endraw %}
bash {% raw %}` blocks and words like "test", "build", "deploy" gets classified as `{% endraw %}coding{% raw %}` with high confidence. A skill mentioning "tone", "audience", "draft" lands in `{% endraw %}writing{% raw %}`.

**Why classification matters: each type gets different prescriptions.** A coding skill missing examples is a critical failure — Claude needs to understand exact input/output transformations. A workflow skill missing examples is a warning. Same rule, different severity, different fix.

`{% endraw %}{% raw %}``bash
$ npx pulser-cli --skill my-tdd-skill.md

  ┌─────────────────────────────────────────┐
  │  PULSER v0.3.1 — Skill Diagnostic       │
  │  Diagnose. Prescribe. Fix.              │
  └─────────────────────────────────────────┘

  Scanning: my-tdd-skill.md
  Classification: coding (confidence: 0.92)

  ■ frontmatter-required    PASS
  ■ description-quality     WARN  Description is 8 chars — too short for reliable routing
  ■ trigger-clarity         FAIL  No trigger condition found in frontmatter or body
  ■ instruction-structure   PASS
  ■ conflict-detection      PASS
  ■ example-coverage        FAIL  0 examples found — coding skills need ≥2
  ■ scope-boundaries        PASS
  ■ dependency-chain        SKIP  (experimental)

  2 issues found. Run with --fix to auto-repair.

## Step 3: Prescriptions, Not Just Pass/Fail

**pulser generates type-specific repair suggestions rather than generic error messages — this is what separates a diagnostic tool from a linter.** Every competitor I evaluated takes the same approach: scan, report pass or fail, stop. That's a linter. I didn't want a linter. I wanted a doctor.

When pulser finds an issue, it generates a prescription showing the current state, the proposed fix, and why the fix is appropriate for that skill's type. A `coding` skill with a vague description gets different language than a `writing` skill with the same problem:

```bash
$ npx pulser-cli --fix

  Prescription for my-tdd-skill.md:

  1. description-quality (WARN)
     Current:  "TDD stuff"
     Proposed: "Use when starting any coding task — enforces red-green-refactor
                cycle with test-first development. Triggers on: 'write tests',
                'TDD', 'test first'."
     → Auto-fix available. Backup will be created.

  2. trigger-clarity (FAIL)
     Current:  (none)
     Proposed: Add trigger block to frontmatter:
               triggers: ["TDD", "test first", "write tests", "red green refactor"]
     → Auto-fix available. Backup will be created.

  Apply fixes? [y/N]

**Every fix creates a timestamped backup before writing anything.** The files go to `.pulser/backups/{timestamp}/` and you can roll them back with one command. Nothing is modified without showing you the exact diff first.

## Step 4: The Undo System

The fix engine uses atomic writes with full rollback support. I learned this lesson from deploying database migrations: **if you can't undo it, you shouldn't automate it.**

```bash
$ npx pulser-cli undo

  Found 1 backup set:
  [1] 2026-03-15T14:32:00Z — 2 files modified
      my-tdd-skill.md (description + triggers added)
      debug-workflow.md (examples section added)

  Restore backup [1]? [y/N] y

  ✓ Restored my-tdd-skill.md
  ✓ Restored debug-workflow.md
  Backup retained at .pulser/backups/2026-03-15T143200/

The undo system reads the backup, validates the file still exists at the expected path, writes to a temp file, then renames atomically. **If the process dies mid-write, you don't end up with a half-written skill file.** This sounds paranoid for markdown files, but partial writes in shell scripts have corrupted enough of my config files that I now treat atomic writes as non-negotiable.

## Step 5: The TUI That Nobody Asked For (But Everyone Remembers)

pulser displays an EtCO2-style patient monitor animation while scanning — a waveform tracing across the terminal in real time, exactly like a hospital vital signs monitor. Was this necessary? No. **Did it make the tool memorable enough that three people asked about the animation before asking what the tool actually does? Yes.**

``{% endraw %}{% raw %}`bash
$ npx pulser-cli --all

  ╭──────────────────────────────────────────╮
  │  ♥ PULSER — Skill Vitals                 │
  │  ╱╲    ╱╲    ╱╲    ╱╲                    │
  │ ╱  ╲__╱  ╲__╱  ╲__╱  ╲__                │
  │                                          │
  │  Skills: 23  Healthy: 9  Warning: 8      │
  │  Critical: 6  Fixable: 12                │
  ╰──────────────────────────────────────────╯

The `{% endraw %}--no-anim{% raw %}` flag disables the animation for CI pipelines and terminals that don't support ANSI escape codes. I demoed the tool in a Discord server for Claude Code developers, and the animation generated more immediate questions than the diagnostic output. Memorable UI is a distribution strategy.

## Step 6: Output Formats for Every Workflow

pulser supports three output formats because different workflows require different data shapes. The default is human-readable terminal output. `{% endraw %}--format json{% raw %}` emits a strict schema suitable for piping into other tools. `{% endraw %}--format md{% raw %}` generates a markdown report you can commit to your repo.

`{% endraw %}{% raw %}``bash
# Human-readable (default)
$ npx pulser-cli

# JSON for piping into other tools
$ npx pulser-cli --format json | jq '.issues[] | select(.severity == "error")'

# Markdown for documentation
$ npx pulser-cli --format md > skill-health-report.md

The JSON schema is stable across patch versions. I use it in a pre-commit hook that blocks commits if any core-tier rules fail:

``{% endraw %}{% raw %}`bash
#!/bin/bash
# .git/hooks/pre-commit
ISSUES=$(npx pulser-cli --format json --strict 2>/dev/null | jq '.summary.errors')
if [ "$ISSUES" -gt 0 ]; then
  echo "pulser: $ISSUES skill errors found. Run 'npx pulser-cli' to see details."
  exit 1
fi

**Running pulser as a pre-commit hook means broken skills never reach your main branch.** The scan completes in under 4 seconds for 20 skills, fast enough that it doesn't noticeably slow commits.

## Step 7: The Build Pipeline

The whole project is TypeScript, bundled with tsup into a single ESM file. Four runtime dependencies total:

`{% endraw %}{% raw %}``json
{
  "name": "pulser-cli",
  "version": "0.3.1",
  "type": "module",
  "bin": { "pulser": "./dist/index.js" },
  "dependencies": {
    "commander": "^12.0.0",
    "gray-matter": "^4.0.3",
    "chalk": "^5.3.0",
    "boxen": "^7.1.1"
  }
}

**gray-matter parses frontmatter, commander handles CLI args, chalk and boxen handle terminal formatting.** Everything else is standard library. The total bundle size is 847KB including dependencies. Node 18+ is the only runtime requirement.

``{% endraw %}{% raw %}`bash
# Install globally
npm i -g pulser-cli

# Or run without installing
npx pulser-cli

## What I'd Do Differently

**I should have shipped with 3 rules instead of 8.** I launched with 8 rules because I wanted to feel comprehensive. In practice, the 3 core rules catch 80% of real problems. The recommended tier adds nuance but also adds noise for users who just want the basics working. I'd ship core-only and gate the rest behind a `{% endraw %}--full{% raw %}` flag.

**The classifier needs more training data.** My confidence scores are calibrated against my own skill files plus a small set of open source examples — roughly 50 skills total. The classifier works well for common patterns (TDD skills, writing skills, workflow automations) but produces low-confidence scores on unusual skill types. I need at minimum 200 diverse skills to make the confidence values trustworthy.

**I over-invested in the TUI animation before the fix engine was solid.** The waveform animation took a full day to build. During that time, the prescription engine had a bug where it would suggest adding a `{% endraw %}triggers{% raw %}` field to skills that already had trigger keywords embedded in the body — a false positive that would have confused early users. Animation is memorable, but correctness ships first.

## The Numbers

### Cost Comparison

| Approach | Cost | Time to First Result |
|----------|------|---------------------|
| Manual skill review | $0 | 2–3 hours for 20 skills |
| pulser scan | $0 | 4 seconds for 20 skills |
| Asking Claude to review skills | ~$0.03/skill | 30 seconds per skill |
| Building custom validation | $0 + 8–16 hours dev time | Varies |

### Scan Performance

| Metric | Value |
|--------|-------|
| Skills scanned per second | ~5 |
| Average fix generation time | 200ms |
| Backup + atomic write | <50ms per file |
| Total bundle size | 847KB |
| Runtime dependencies | 4 |
| Diagnostic rules | 8 (3 core + 4 recommended + 1 experimental) |

## FAQ

### Does pulser modify my skill files without asking?

No. The `{% endraw %}--fix{% raw %}` flag shows you every proposed change with a before/after diff and requires explicit confirmation before writing. Every modification creates a timestamped backup at `{% endraw %}.pulser/backups/{timestamp}/{% raw %}` before touching the original. You can restore any change with `pulser undo`. Nothing is destructive by default.

### How does pulser know what a "good" skill looks like?

It implements Anthropic's published skill quality principles as executable rules. The frontmatter checks follow the documented Claude Code skill schema. The description quality rule uses measurable heuristics: minimum character length, presence of trigger keywords, and specificity scoring based on action verb density. It's opinionated but grounded in published documentation, not personal taste.

### Does pulser work with skills in subdirectories or custom locations?

Yes. By default it scans `~/.claude/skills/` and any project-level `.claude/skills/` directories it finds. You can point it at any path: `npx pulser-cli /path/to/my/skills/`. The `--skill` flag scans a single file: `npx pulser-cli --skill my-skill.md`.

### Can I use pulser in CI/CD?

Yes. Use `--format json` for machine-readable output and `--strict` to exit with code 1 if any errors are found. The `--no-anim` flag disables the TUI for non-interactive environments. See the pre-commit hook example above.

### What's the difference between pulser and a generic markdown linter?

A markdown linter checks syntax. **pulser checks semantics — whether your skill's structure, description, and trigger conditions will actually work with Claude Code's routing logic.** It understands the difference between skill types, generates context-aware prescriptions, and auto-fixes issues with rollback. markdownlint will catch a broken heading. pulser will tell you your description is too generic to route reliably, and rewrite it.

## Try It Yourself

1. Run `npx pulser-cli` in any directory with Claude Code skills
2. Read the output — fix core-tier failures before anything else
3. Run `npx pulser-cli --fix` to review proposed repairs
4. Accept the fixes and verify Claude routes to your skills more reliably
5. Add the pre-commit hook if you want ongoing enforcement
6. Star the repo if it helped: [whynowlab/pulser](https://github.com/whynowlab/pulser)

## What's Next

If you've written Claude Code skills and wondered why Claude sometimes ignores them — run the scan. **The most common failure across 50+ skill files I've analyzed is a description field that's too generic for Claude to route reliably.** It takes 4 seconds to find out if yours has the same problem.

Have you run into skill reliability issues with Claude Code? What failure patterns have you seen? Drop a comment — I'm building the next rule set from real failure modes, not hypothetical ones.

---

*I'm a developer building AI-powered infrastructure tools. pulser started as a debugging script for my own Claude Code skill files and grew into an open source CLI that has now diagnosed over 200 skill files across early adopter setups. I write about building developer tools and the engineering mistakes that make them better.*

I Run My AI Assistant 24/7 on a $0 Server. Here's Every Detail.

thestack_ai — Mon, 23 Mar 2026 02:02:54 +0000

My AI assistant doesn't sleep.

It checks my calendar every 3 hours, summarizes my day at 9 PM, reviews my GitHub repos every 2 hours, and syncs notes to Notion at 10 PM. All on its own.

The server costs me $0/month. Not "$0 for the first year." Not "$0 with credits." Actually zero, forever.

Here's the full setup — including the mistakes I made so you won't have to.

TL;DR: You can run a Claude-powered Telegram bot with 9 scheduled jobs 24/7 on Oracle Cloud's Always Free tier — $0/month, no expiration. The hardware: 1 OCPU + 1 GB RAM + 4 GB swap. After 2 weeks of operation: 99.7% uptime, 3 auto-recovered OOM events, ~15 scheduled job executions per day. Key tricks: aggressive swap, flock-based concurrency control, systemd auto-restart, and 5-minute health monitoring via cron.

The Problem: AI Assistants Die When You Close Your Laptop

I built a Telegram bot powered by Claude. It worked great — when my MacBook was open. Close the lid? Silent. Morning briefings? Missed. Calendar reminders? Gone.

The "personal AI assistant" was really a "personal AI assistant that only works during business hours when I'm already at my desk."

I needed a server. But I also needed it to cost nothing, because this is a side project, not a startup.

The $0 Solution: Oracle Cloud Always Free

Oracle Cloud offers an Always Free tier that never expires. Unlike AWS's 12-month free tier or GCP's $300 credit, OCI Always Free resources do not expire. Here's what you get:

Resource	Always Free Allocation
AMD Compute	1 OCPU, 1 GB RAM
Boot Volume	47 GB
Outbound Data	10 TB/month
Object Storage	20 GB

What you'll pay in total:

OCI server: $0/month (Always Free, no credit card charges after verification)
Claude API: $0/month (using Claude Code with an existing subscription — not per-API-call billing)
Telegram Bot API: $0/month (free forever)
Domain/DNS: Not needed (Telegram handles routing)
Total: $0.00/month

One OCPU and 1 GB of RAM sounds terrible. And honestly? It is terrible. But it's enough.

Step 1: Provision and SSH In

After creating your OCI account (you'll need a credit card for verification — it won't be charged), spin up an "Always Free eligible" AMD instance with Ubuntu.

ssh -i ~/.ssh/your_key ubuntu@YOUR_SERVER_IP

First thing: check your RAM situation.

free -h
#               total   used   free
# Mem:          981Mi   612Mi  102Mi

Yeah. 981 MB total. My Telegram bot alone eats 300-400 MB during Claude API calls. We need swap.

Step 2: The Swap Trick (This Saved Everything)

With 1 GB RAM, you'll hit OOM kills within hours. Don't skip this step.

sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

# Make it permanent
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

# Tune swappiness — use swap aggressively
sudo sysctl vm.swappiness=60
echo 'vm.swappiness=60' | sudo tee -a /etc/sysctl.conf

After this:

# Mem:   981Mi total, ~350Mi free
# Swap:  4.0Gi total, ~3.8Gi free

Why 4 GB and not 2 GB? During peak scheduled job execution, swap usage hits 2.1 GB. With 2 GB swap, you'd still OOM. 4 GB gives comfortable headroom on an SSD-backed boot volume. Response times go from ~2s to ~5s under memory pressure, but the bot never crashes.

Step 3: Prevent Concurrent Execution

Claude API calls are expensive (in time and memory). Two simultaneous requests on 1 GB RAM = instant OOM. I wrote a wrapper using flock:

#!/bin/bash
# claude-single.sh — ensures only one Claude process runs at a time
LOCKFILE="/tmp/claude-telegram.lock"

exec 200>"$LOCKFILE"
if ! flock -n 200; then
    echo "Another instance is running. Queuing..."
    flock 200  # Wait for lock
fi

# Your actual command here
claude --message "$1" --output-format json

Crude? Yes. Effective? Absolutely. Request #2 waits in line instead of competing for RAM.

Step 4: systemd for Auto-Recovery

The bot needs to survive crashes, OOM kills, and server reboots:

# /etc/systemd/system/claude-telegram.service
[Unit]
Description=Claude Telegram Bot
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu/claude-telegram
ExecStart=/home/ubuntu/claude-telegram/start.sh
Restart=always
RestartSec=10
Environment=NODE_ENV=production

[Install]
WantedBy=multi-user.target

sudo systemctl enable claude-telegram
sudo systemctl start claude-telegram

Restart=always is the key line. OOM kill at 3 AM? Back up in 10 seconds. No human intervention needed.

Step 5: Health Monitoring

"It's probably running" is not a monitoring strategy. I verify:

# monitor.sh — runs every 5 minutes via cron
#!/bin/bash

SERVICE="claude-telegram"
LOG="/home/ubuntu/claude-telegram/logs/monitor.log"

if ! systemctl is-active --quiet "$SERVICE"; then
    echo "$(date): $SERVICE is down. Restarting..." >> "$LOG"
    sudo systemctl restart "$SERVICE"

    # Alert via Telegram (using a separate, lightweight curl call)
    curl -s -X POST "https://api.telegram.org/bot${BOT_TOKEN}/sendMessage" \
        -d "chat_id=${ADMIN_CHAT_ID}" \
        -d "text=Bot was down. Auto-restarted at $(date)."
fi

# Cron: check every 5 minutes
*/5 * * * * /home/ubuntu/claude-telegram/monitor.sh

In 2 weeks of operation, the monitor has caught and recovered from 3 OOM kills. Each time, the bot was back within 15 seconds. Without monitoring, I wouldn't have even known it went down.

Step 6: Scheduled Jobs (The Real Value)

A bot that only responds to messages is a chatbot. A bot that proactively works for you is an assistant. This distinction matters.

The 9 Jobs I Run Daily

Job	Schedule	What It Does
Morning Briefing	Weekdays 8:00 AM	Calendar + tasks + news summary
Calendar Check	4x daily (9/12/15/18)	Upcoming meeting alerts
Evening Summary	Daily 9:00 PM	Day recap + memory save
GitHub Check	Every 2 hours	PR reviews, issue notifications
Memory Sync	Every 3 hours	Sync context across devices
Notion Sync	Daily 10:00 PM	Save important conversations
Weekly Review	Friday 6:00 PM	Week summary + planning
Token Check	Daily 10:00 AM	Verify API credentials
Threads Notify	2x daily	Social updates digest

Why Morning Briefing Is the Killer Feature

The morning briefing alone justifies the entire setup. Every weekday at 8 AM, before I even open my laptop, my phone buzzes. One Telegram message with my schedule, pending tasks, and relevant news — generated by Claude using my calendar data via MCP (Model Context Protocol) integration.

It pulls from Google Calendar and Notion, cross-references deadlines, and delivers a single digest. No app switching. No manual checks. One message. That's it.

Step 7: Authentication That Doesn't Expire Every Hour

Default OAuth tokens expire quickly. For a 24/7 server, you need a long-lived token:

# Store in .env
CLAUDE_CODE_OAUTH_TOKEN=your_long_lived_token_here

I configured a token with 1-year validity. Without this, the bot silently fails every few hours when the token expires — and you wake up to 8 hours of missed messages with no error in sight.

I also run a daily token health check that alerts me 30 days before expiry.

What I'd Do Differently

1. Start with swap from minute one. I spent 2 days debugging random crashes before realizing it was OOM kills. One command would have told me immediately:

dmesg | grep -i oom

2. Don't underestimate 1 OCPU. The CPU is fine. It's not fast, but Claude API calls are I/O-bound (waiting for API responses), not CPU-bound. One OCPU handles my workload without breaking a sweat.

3. Log everything from day one. I added proper logging after a week of "it probably works." Future me was grateful. The monitor log has caught issues I never would have noticed otherwise.

The Numbers: $0 vs Paid Servers

After 2 weeks of 24/7 operation, here's how the free tier actually compares to paid alternatives:

Metric	OCI Always Free ($0/mo)	Typical VPS ($5-20/mo)	Dedicated ($200+/mo)
Uptime	99.7% (auto-recovered)	~99.9%	99.99%
Response time	40-50s	40-50s	40-50s
Monthly cost	$0.00	$5-20	$200+
RAM	1 GB + 4 GB swap	2-4 GB	16+ GB
Concurrent requests	1 (flock-limited)	3-5	Unlimited

The response time row is the key insight: it's identical across all tiers. The bottleneck is Claude's API latency (40-50 seconds of thinking time), not server hardware. You'd be paying $200/month for faster concurrent handling — which you probably don't need for a personal assistant that handles 30-40 messages a day.

Full Metrics Breakdown

Metric	Value
Uptime	99.7% (3 auto-recovered OOM events)
Monthly cost	$0.00
Average response time	40-50 seconds (Claude API latency)
Daily messages handled	~30-40
Scheduled jobs/day	~15 executions
RAM usage (steady state)	~600 MB + swap
Swap usage (peak)	~2.1 GB

Total Stack

+-----------------------------------+
|  Oracle Cloud (Always Free)       |
|  Ubuntu 22.04 / 1 OCPU / 1 GB    |
|  + 4 GB Swap                      |
|                                   |
|  +-----------------------------+  |
|  | systemd (auto-restart)      |  |
|  |  +- Telegram Bot (Python)   |  |
|  |      +-- Claude API         |  |
|  |      +-- MCP: Calendar      |  |
|  |      +-- MCP: Notion        |  |
|  |      +-- 9 Scheduled Jobs   |  |
|  +-----------------------------+  |
|                                   |
|  +-----------------------------+  |
|  | Cron                        |  |
|  |  +-- monitor.sh (5 min)     |  |
|  |  +-- memory-sync (3 hr)     |  |
|  |  +-- token-check (daily)    |  |
|  +-----------------------------+  |
+-----------------------------------+

FAQ

Will Oracle actually keep this free forever?

Oracle's Always Free tier has been available since 2019 with no expiration date. Unlike AWS's 12-month free tier or GCP's $300 credit that runs out, OCI Always Free resources genuinely persist indefinitely. I've been running for over 2 weeks with zero charges on my account. That said — Oracle can change terms, so don't build a business-critical production service on it. For a personal AI assistant? It's perfect.

Can I run GPT-based bots instead of Claude?

Yes. The server architecture is completely API-agnostic. Swap the Claude API call for OpenAI's API and everything else stays the same. The memory constraints, swap tricks, flock concurrency, and systemd setup apply identically — the bottleneck is always API latency, not local compute.

What happens when the server runs out of swap?

The Linux OOM killer activates and terminates the heaviest process — which is your bot. With Restart=always in the systemd unit file, the bot comes back in approximately 10 seconds. In my 2 weeks of operation, this happened 3 times — all during overlapping scheduled job executions. The 5-minute monitor catches anything systemd misses.

Is 1 GB RAM enough for multiple bots?

Probably not. One bot with flock-serialized requests uses approximately 600 MB at steady state. A second bot would push you into constant heavy swapping with degraded performance. For running multiple bots, look at OCI's Ampere A1 Always Free tier — 4 OCPUs and 24 GB RAM — though availability varies by region and is often limited.

How do I set up the MCP integrations (Calendar, Notion)?

MCP (Model Context Protocol) lets Claude connect to external services through standardized tool interfaces. I use two MCP servers: one for Google Calendar (HTTP-based with OAuth) and one for Notion (stdio-based with API key). The Claude documentation covers MCP setup in detail — the server-side configuration is identical whether you're running locally or on OCI.

Try It Yourself

Get an OCI account: Sign up at Oracle Cloud. Select "Always Free" resources only.
Set up swap immediately: 4 GB minimum. Don't skip this.
Use systemd + monitoring: Your bot will crash. Make sure it comes back automatically.
Start with one scheduled job: Morning briefing is the most impactful. Add more as you validate each one works.

The whole setup took me a weekend. The hardest part wasn't the code — it was accepting that 1 GB of RAM is actually enough when you manage it properly. Then it just... runs.

Your AI assistant shouldn't need your laptop to be open. Give it a home that runs while you sleep.

What's your always-on setup? Running AI tools on a VPS, a Raspberry Pi, or something weirder? Drop it in the comments — I genuinely want to see what others have built.

If this saved you some Googling, bookmark this post for when you're ready to deploy. And if you're into AI tools that actually run in production — not just in demos — follow me for more.

I build AI-powered automation that runs 24/7 on minimal infrastructure — no drama, no cloud bills. This bot handles 30-40 messages daily plus 15 scheduled jobs without me touching it. If that's the kind of AI engineering you care about, follow along.

I wrote a tool that diagnoses your Claude Code skills before they break

thestack_ai — Thu, 19 Mar 2026 08:56:45 +0000

If you use Claude Code skills, you've probably been there — a skill looks fine, works
sometimes, then silently fails when you actually need it. Missing gotchas section,
description that doesn't trigger properly, 400-line monolith that should've been split
into files.

▎ Anthropic published 7 principles for writing good skills but didn't build anything to
enforce them. So I did.

▎ pulser scans your SKILL.md files against 8 rules derived from those principles. But
here's what makes it different from a linter: it doesn't just say "this is wrong." It
tells you why it matters, gives you a ready-to-use template, and auto-fixes it if you
want. And if the fix isn't what you expected, pulser undo rolls everything back
instantly.

▎ The whole pipeline is: diagnose, classify the skill type, prescribe with context, fix,
rollback.
▎ You can run it three ways. In your terminal as a CLI (just type pulser). Inside Claude
Code as a conversation ("check my skills" or /pulser and Claude does the rest). Or in
Codex.

                                                                                ▎ I've been running it on my own 54 skills and it caught things I would've never noticed

— skills with overlapping trigger keywords stepping on each other, descriptions written
for humans instead of the model, missing tool restrictions that could let a read-only
skill accidentally write files.

▎ It's free, MIT, single npm install.

▎ npm install -g pulser-cli

▎ github.com/whynowlab/pulser

DEV Community: thestack_ai

Claude Code keeps forgetting my project. So I built waypath.

The actual problem with Claude Code memory (and every other agent)

What I tried before building anything

What waypath actually does — local-first AI memory in one CLI

1. Truth and archive are two different things

2. Promotion is an explicit gate, not a vibe

3. Recall is graph-aware, not chunk-aware

4. One facade, every host

60-second quick start

What's not done yet (because lying about a v0.1.1 is how trust dies)

I want your worst use case

Your AI is swinging at nothing — 6 cognitive firewalls that cut my prod bugs by 87%

The problem: agreeable AI ships broken code

How a "firewall" works in Claude Code

Firewall 1: Hallucination guard (hallucination-check)

[hallucination-check fires]

Claude: "I'm about to use on_event on the Anthropic SDK.

Let me verify by reading the SDK source first."

Claude reads node_modules/@anthropic-ai/sdk/...

Claude: "The actual API is async iteration: for await (const event of stream).

No on_event callback exists. Updating the code."

Firewall 2: Confirmation bias firewall (devils-advocate)

When user frames a question with a presupposed answer

Firewall 3: Premature closure (done-criteria)

Firewall 4: Sunk-cost detector (approach-pivot)

Firewall 5: Authority deference (sycophancy-block)

When user pushes back, ask:

Firewall 6: Solution fixation (alternatives-pass)

What I'd do differently

The numbers

FAQ

Try it yourself

Discussion

Claude Code forgot my architecture 3 times last week. I fixed it with one SQLite file.

The problem: coding agents have no memory between sessions

Architecture: four independent kernels over one SQLite file

Archive kernel: FTS5 + reciprocal rank fusion, no GPU required

Promotion kernel: the review gate that stops memory drift

Stack and technical choices

What I already regret in 0.1.1

Performance and cost

FAQ

Get started in five minutes

One question before you go

Testing Claude Code Skills in CI — pulser eval + GitHub Action

The Problem: Skills Break Silently

Step 2: Read the Exit Codes

Step 3: Add the GitHub Action

.github/workflows/skills-weekly.yml

What I'd Do Differently

The Numbers

FAQ

Try It Yourself

Claude Code Has 15 Power Features. I Was Using 3.

The Problem: Claude Code Has a Discoverability Problem

1. Teleport — Move Sessions Between Machines

Terminal 1: Claude Code running normally

Terminal 2: Send a command to the running instance

3. /loop — Iterative Refinement Without Babysitting

4. /schedule — Cron Jobs for Your AI

5. Hooks — Shell Commands on Events

12. --agent — Subagent Mode

From phone terminal (Termius, Blink, etc.)

What I'd Do Differently

The Numbers

FAQ

Try It Yourself

Over to You

I Audited 214 Claude Code Skills — 73% Were Silently Broken

The Problem: Skills That Silently Fail

What pulser Actually Checks

1. Frontmatter Validation

5. The Prescriber

.github/workflows/skill-lint.yml

How is this different from just reading Anthropic's docs?

Try It

Check for skill conflicts

Generate fixes for anything below 70

Add to CI to prevent regressions

Firewall 1: Hallucination guard (`hallucination-check`)

Claude: "I'm about to use `on_event` on the Anthropic SDK.

Claude: "The actual API is async iteration: `for await (const event of stream)`.

No `on_event` callback exists. Updating the code."

Firewall 2: Confirmation bias firewall (`devils-advocate`)

Firewall 3: Premature closure (`done-criteria`)

Firewall 4: Sunk-cost detector (`approach-pivot`)

Firewall 5: Authority deference (`sycophancy-block`)

Firewall 6: Solution fixation (`alternatives-pass`)