DEV Community: Vitaly D.

Signum Can Now Be Installed in Codex App as a Plugin

Vitaly D. — Fri, 08 May 2026 15:59:15 +0000

AI-agent workflows have a boring but important problem: a good process often lives inside one runtime.

Signum started as a Claude Code plugin. That was the right first layer: slash command, local artifacts, contract-first pipeline, proofpack. But the principle is not Claude-specific. If an agent writes code in Codex App, it needs the same context: contract before implementation, verification after implementation, proof in files instead of "looks done" in chat.

The latest Signum changes close that distribution gap: Signum can now be added to Codex App through the Plugins UI.

What changed

Signum now ships a complete Codex App install surface:

.codex-plugin/plugin.json
.agents/plugins/marketplace.json
platforms/codex/.codex-plugin/plugin.json
platforms/codex/SKILL.md
platforms/codex/assets/signum-icon.png

This is not a README instruction that says "copy this prompt". Codex gets a plugin manifest, marketplace metadata, a skill entrypoint, and UI metadata for the plugin card.

The install path is:

Plugins -> Add marketplace

Source: heurema/signum
Git ref: main
Sparse paths: leave blank

After that, Heurema appears in the Plugins source dropdown, and Signum is available inside it.

For pinned installs, the docs point at v4.21.5 or a newer release tag. For sparse checkout, include .agents/plugins and platforms/codex; include .codex-plugin only for direct local-folder installs.

Why this is not just a port

The weak way to move an agent workflow between runtimes is to copy a long system prompt and hope the model remembers how to use it.

Signum does something else. The Codex skill describes a workflow contract, not an answer style:

CONTRACT -> EXECUTE -> AUDIT -> PACK

The core rule stays the same: define correctness before code is written.

In this mode, Codex becomes the orchestrator. It should:

stop if the contract is too weak;
keep artifacts under .signum/contracts/<contractId>/;
run deterministic checks before model review;
treat external reviewer tools as optional evidence, not trust anchors;
say explicitly when a claim could not be verified.

For the user, this matters more than "the plugin exists in the UI". Installation reduces adoption cost, but the value comes from a different layer: the workflow now sits next to the agent that actually executes the task.

Metadata is now part of the release surface

Signum added a dedicated guardrail:

bash tests/test-codex-plugin-metadata.sh

On the current checkout, it passes 37 out of 37 checks.

The test does not only check that files exist. It validates that:

the direct manifest is named signum;
the version matches release metadata;
the direct plugin points at ./platforms/codex/;
the marketplace plugin points at non-empty ./platforms/codex;
the marketplace source is named heurema, with display name Heurema;
install policy is AVAILABLE;
authentication policy is ON_INSTALL;
icon paths resolve;
the Codex skill actually declares the contract-first pipeline and canonical artifact root.

This is the right boundary for plugin distribution. If marketplace metadata breaks, the user never reaches the workflow. That means metadata needs tests the same way code does.

What this changes for context engineering

Context engineering is often reduced to prompt design. This is a useful counterexample.

Context for an agent is not only instruction text. It is the install path, manifest, marketplace entry, skill boundary, artifact layout, validation script, and docs saying the same thing.

When the same workflow is available in Claude Code and Codex App, the idea becomes easier to test honestly. If Signum only works where its author manually configured the environment, it is a local habit. If it can be installed through a plugin surface and checked with a metadata test, it is closer to a reusable capability.

The difference is small, but practical:

less manual setup;
less hidden knowledge in the author's head;
less risk that Codex receives stale or partial instructions;
more chance that the proofpack workflow stays reproducible.

Insight

An agent workflow becomes a product not when it has a polished explanation.

It becomes a product when it has an install boundary, testable metadata, and a clear runtime contract.

Signum in Codex App is that kind of step. Not a new model trick, not a new reviewer, not another layer of confidence. Just a less fragile way to deliver contract-first verification to the place where the agent already works.

Sources

Stop Writing CLAUDE.md From Scratch

Vitaly D. — Mon, 27 Apr 2026 15:31:03 +0000

A month ago I wrote about spec drift -- what happens when an agent reads a stale project.intent.md and treats wrong facts as ground truth. The fix was a reverse diff: derive what the spec should say from the code, compare it to what it says, classify drift section by section.

That post answers: the spec exists and rotted, what now?

The question that bugged me afterwards: what about projects where the spec doesn't exist yet?

Open any new repo started with Claude Code in the last six months and you will find the same artifact: a CLAUDE.md (or AGENTS.md, or .cursor/rules, or .github/copilot-instructions.md) with three lines that read like a placeholder. "This is a Python project. Use type hints. Run pytest." The agent dutifully reads it, fills the gaps with assumptions, and ships a feature based on what a generic Python project usually looks like -- not what this project actually is.

This is not spec drift. This is spec absence. And it shapes the agent's first 50 commits.

Why one file isn't enough

When you only have CLAUDE.md, every question the agent asks ends up there. What's the architecture? What can break? What's the merge bar? Who owns what? The file becomes either a 4000-word kitchen sink that no one updates, or a 200-word stub that answers none of them.

I watched this pattern in five of my own projects. The CLAUDE.md said "the architecture is event-driven" and the agent confidently designed a new feature around an event bus. There was no event bus. The phrase had been copy-pasted from a template six months earlier.

The deeper issue: agents need typed inputs. A reviewer agent asking "is this safe to merge?" needs different context than a contractor agent asking "what's in scope?" Cramming both into one file means each ends up with a partial answer. The agent fills the rest from training-data priors.

The six-file harness

/signum init --harness generates six harness documents, each answering one specific question that an agent will ask:

File	Question it answers
`AGENTS.md`	Where do I read first? What rules override defaults?
`ARCHITECTURE.md`	What is this codebase actually shaped like?
`docs/PLANS.md`	What's in flight right now?
`docs/RELIABILITY.md`	What must keep working? How do I know it broke?
`docs/SECURITY.md`	What's the threat model? What can't I touch?
`docs/QUALITY_SCORE.md`	What's the bar for "done"?

Six files, each with frontmatter (status: draft, owner: TODO, last_reviewed: <date>, review_cadence: quarterly), each with a fixed set of sections that map to a real agent decision.

The frontmatter matters. last_reviewed is the freshness signal: when an agent reads RELIABILITY.md two quarters after creation and sees last_reviewed: 2026-01-15, it knows the doc is past cadence and should be sanity-checked rather than trusted blindly. This is the same insight from the spec-drift post -- verified-on date beats edited-on date -- applied at bootstrap.

What deterministic actually means here

The scaffold is generated by a bash script (lib/init-harness-scaffold.sh) reading from checked-in templates (lib/templates/init-harness/*.tmpl) -- not by a model. Same project root, same date, same output. No LLM creativity in the scaffold step.

This is a deliberate split. The structure (sections, frontmatter, table headers) is deterministic and template-driven. The content (what goes in each section) is LLM-assisted in a follow-up step or written by hand. By separating them, the scaffold itself never drifts: every project gets identical headings, identical section order, identical evidence pattern. Diffs across projects become comparable.

In v4.21 we extracted the templates from inline heredocs into checked-in .tmpl files. Boring change, but it means the templates are code-reviewable as their own surface -- golden tests assert byte-for-byte parity.

Brownfield-safe by default

The fear with any "init" command is that it stomps on real work. init --harness doesn't:

If project.intent.md exists, it is preserved.
If project.glossary.json exists, it is preserved.
If any of the six harness files already exist, they are preserved.
Only files actually missing get scaffolded.
Only --force bypasses these checks.

The behavior is brownfield-first. If you ran init --harness on a four-year-old codebase that already has half the docs, you get drafts only for the missing half. The existing docs stay untouched, even if they are stale or wrong. Spec drift is init --actualize's job, not init --harness's.

What the scaffold gives you (and what it doesn't)

A scaffolded RELIABILITY.md looks like this:

---
status: draft
owner: TODO
last_reviewed: 2026-04-27
review_cadence: quarterly
---

# myproject -- Reliability

## Critical User Journeys
- TODO: List the flows that must keep working for the product to be viable.
- TODO: Link tests, smoke checks, or dashboards that cover each flow.

## Service Levels
- TODO: Define availability / latency / throughput goals if they exist.
- TODO: Record error budgets or acceptable degradation rules.

## Failure Modes and Recovery
- TODO: Top known failure modes.
- TODO: Detection signal for each failure mode.
- TODO: Recovery / rollback procedure for each failure mode.

It is not finished documentation. It is a typed prompt for the human (or a follow-up agent) to fill in. The TODOs are pinpointed: not "describe reliability" but "list the flows that must keep working." Every prompt forces you to put a fact, not a paragraph of mood.

The structure also makes the gap visible. If you scaffold RELIABILITY.md and three months later the section "Failure Modes and Recovery" is still all TODOs, that is a real signal -- your team has no documented failure modes, and an agent shipping changes to that surface is doing it without context.

Why six and not three (or twelve)

Six is the smallest set that doesn't collapse two unrelated agent questions into one file. I tried smaller. Combining RELIABILITY.md and SECURITY.md into a single OPS.md pushed the doc past the point where an agent could find a specific failure mode without scanning the whole file. Combining PLANS.md into AGENTS.md made the latter rot weekly.

I also tried larger. Splitting ARCHITECTURE.md into DATAFLOW.md + MODULES.md + BOUNDARIES.md produced three half-empty files. Most projects don't have enough architecture to justify three docs, but they do have enough to justify one.

The split that survived: one file per agent decision, never one file per topic. That is the design contract.

What it doesn't replace

init --harness does not generate project.intent.md. That file is the contract surface for Signum's pipeline -- used by the contractor agent to derive scope and acceptance criteria -- and it has its own bootstrap path (signum init plain) that uses the code scanner, not the harness templates.

Harness docs are human-and-reviewer surfaces: they shape how a reviewer agent thinks about merge gates, how a contractor frames non-goals, how an engineer agent decides whether a change crosses a reliability boundary. They are not the contract itself.

The two interlock. project.intent.md says "what this project does." The harness says "how this project is run."

Try it

claude plugin install signum@emporium
cd your-project/
claude "/signum init --harness"

You'll see six draft files. Review them, fill the TODOs, commit. If you already have one of them, it is preserved -- only missing files get scaffolded.

If your existing harness has drifted, that's init --actualize's job from the earlier post. Different surface, same underlying logic: derive what the doc should say from the code, classify the gap, write only what you accept.

Most agentic projects fail their first month not because the agent makes bad code but because the agent never had typed context to begin with. The harness is six files because that is the smallest set of typed questions an agent will ask before its second commit. Skip them and you get hallucinated architecture. Stub them with one file and you get partial answers to all six.

Source: github.com/heurema/signum (v4.21.0)

AI Agents Need Permission Boundaries, Not Personalities

Vitaly D. — Wed, 08 Apr 2026 16:50:39 +0000

Most agent tooling mistakes coordination for reliability. It gives you more
roles, more agents, more orchestration, and more shell theater. The demo gets
more impressive. The system does not necessarily get easier to trust.

That tradeoff used to be tolerable when humans still carried the real model of
the work in their heads. A messy runtime could end in a decent result because a
human operator could reconstruct intent, inspect the diff, and override weak
process with judgment. That stops scaling once generation becomes cheap, fast,
and constant. The bottleneck is no longer code generation. It is trust.

That is why the most interesting agent systems are not the ones with the most
personalities. They are the ones that make planning, execution, and
verification legible as different kinds of authority.

That is the bet behind specpunk, now
being reset into punk. The project is explicit about the reset. It is not
polishing a launched product. It is rebuilding around a stricter shape: one
CLI, one vocabulary, one runtime, and three hard modes - plot, cut, and
gate. That matters because those modes are not style presets. They are
permission boundaries.

The coordination trap

A lot of agent tooling still assumes that better software delivery comes from
adding more orchestration surfaces. The pattern usually looks familiar:

one agent plans
another agent implements
another agent reviews
a shell coordinates them
a chat transcript becomes the history of what happened

This can produce useful work. It can also produce confidence theater.
Coordination is not the same thing as ground truth.

If the runtime cannot answer four basic questions, it is not a trust system
yet:

What exactly was approved?
What actually ran?
What state is authoritative now?
What proof exists for the final decision?

More agents do not answer those questions. More roles do not answer those
questions. A fancier shell does not answer those questions. At best, those
things improve throughput or ergonomics. At worst, they multiply ambiguity.
That is the trap: agent runtimes optimize for visible activity instead of
enforceable structure.

What trustworthy agent work actually needs

If you strip away the theater, trustworthy agent work needs a smaller set of
primitives than most tools expose:

approved intent
bounded execution
durable work state
a clear decision surface
proof-bearing artifacts

That list is more important than any model roster. An agent can be brilliant
and still untrustworthy if it is allowed to plan, mutate, and self-validate
inside one fuzzy surface. The failure mode is not only bad code. It is
unfalsifiable process.

A human operator should not have to reconstruct the truth by reading prompts,
shell chatter, and commit residue. The runtime should already have a durable
answer. This is where punk starts from a stronger premise than most agent
systems:

the shape of the runtime matters more than the number of agents inside it.

Why `punk` resets the shape

The specpunk docs are unusually clear about what is being built. punk is
becoming a local-first engineering runtime with one CLI, one vocabulary, one
artifact chain, and one state truth.

The canonical object chain in the docs is:

Project
  -> Goal
    -> Feature
      -> Contract
        -> Task
          -> Run
            -> Receipt
            -> DecisionObject
            -> Proofpack

That is already a different product claim from "we coordinate a bunch of coding
agents for you." The center is not the agent. The center is the artifact
chain.

That choice has consequences. It means the runtime is trying to preserve
continuity across attempts, retries, verification steps, and future
inspection. A Feature survives beyond one implementation pass. A Contract
is explicit. A Run is one concrete attempt. A DecisionObject is written
only by gate. A Proofpack is the final audit bundle. This is a reliability
architecture, not a chat architecture.

Substrate first, shell second

The strongest idea in the current punk design is the split between two
layers:

a correctness substrate
an operator shell

The substrate owns durable truth:

project identity
goal intake
contract
scope
workspace isolation
run state
decision objects
proof artifacts
the ledger

The shell owns ergonomics:

punk init
punk start
punk go --fallback-staged
summaries
blocked and recovery UX
generated repo-local guidance

This may sound obvious, but most agent systems blur these two layers almost
immediately. The shell becomes a hidden policy engine. Safety semantics leak
into prompts. Output formatting starts pretending to be state. Eventually
nobody can tell whether a behavior is enforced by the runtime or merely
suggested by the interface.

punk is trying to stop that drift early. The rule in the architecture docs is
simple and important: the shell may compose substrate operations, but it must
not become a second source of truth. That is the kind of rule that keeps a tool
honest as it grows.

`plot`, `cut`, and `gate` are not vibes

The three canonical modes in punk are easy to misunderstand if you have seen
too many agent UIs. plot, cut, and gate are not there to make the tool
feel cinematic. They exist to separate authority.

plot shapes work, inspects the repo, drafts and refines contracts
cut executes bounded changes in an isolated VCS context
gate verifies results, writes the final decision, and emits proof

The docs explicitly say these are hard permission boundaries, not tone
presets. That is a serious design choice.

A lot of agent failures come from collapsing these phases into a single
conversational loop. The same surface interprets intent, changes code, judges
its own result, and narrates success. Even when the final answer sounds
careful, the trust boundary is weak because the roles are merged at the runtime
level.

punk moves in the opposite direction. Only gate writes the final
DecisionObject. Only approved contracts should reach cut. The event log and
derived views hold runtime truth, not the shell summary. That is what
permission boundaries look like in practice: not "agent A is the planner" and
"agent B is the reviewer," but real authority boundaries and real artifact
ownership.

Durable work state matters more than chat history

Another strong thread in the design is the work ledger idea. Most agent
sessions leave behind a bad form of memory:

shell logs
chat transcripts
commits
maybe a branch name
maybe a PR

That is enough until something blocks, fails, gets superseded, or needs to
continue later. Then everybody starts asking the same questions: what is the
active contract, what was the latest run, did verification block or escalate,
and what should happen next?

If the only answer is "read the last few screens of terminal output," the
runtime is weak. The punk docs push toward a WorkLedgerView that can answer
those questions directly. That is the right instinct. Agents do not only need
context to act. Operators need durable work state to continue. Again, the move
is the same: replace inference with explicit structure.

Why the one-face shell still matters

None of this means the UX should be ugly. In fact, the punk docs make another
smart move: they argue for a one-face operator shell. The normal user should be
able to give a plain goal and get back one concise progress or blocker summary
plus one obvious next step.

That is good design, but the key is what comes underneath it. A clean shell is
valuable only if it sits on top of a substrate that already knows what is
authoritative. Otherwise one-face UX becomes a prettier way to hide ambiguity.
That is why the substrate-versus-shell split matters so much. punk is not
rejecting ergonomics. It is refusing to let ergonomics pretend to be truth.

A better shape for agent engineering

The most interesting thing about punk is not that it might someday
orchestrate multiple models, run councils, or improve skills through eval
loops. The interesting thing is the order of operations.

Before any higher-level feature, the project is trying to get the runtime shape
right:

one vocabulary
explicit artifact chain
bounded modes
durable state
proof before acceptance

That is the right order. If you get the shape wrong, every later feature
inherits ambiguity. Councils become opinion aggregators instead of structured
advisory mechanisms. Skills become prompt folklore instead of evidence-backed
overlays. Shell UX becomes theater instead of control.

If you get the shape right, those later layers have something solid to attach
to. That is why punk is worth paying attention to even in a rebuild phase. It
is making an architectural claim that more agent tooling should make
explicitly:

reliability does not come from adding more agent personalities. It comes
from enforcing boundaries between intent, execution, verification, and
proof.

More agents do not create ground truth. More roles do not create safety. If
planning, execution, and verification are not separated by hard boundaries, the
runtime scales ambiguity, not trust. That is the real reason this design reset
matters.

Sources

My AI Agent Said 'Done.' It Skipped an Entire Acceptance Criterion.

Vitaly D. — Mon, 23 Mar 2026 08:32:54 +0000

Last week, our pipeline produced a proofpack with decision: HUMAN_REVIEW. The contract had 10 acceptance criteria. The engineer agent created all the new files, build passed, tests passed, three independent reviewers ran. Everything looked correct — except AC18.3, which required rewriting an existing endpoint's response schema. The engineer never touched health.go. The pipeline said SUCCESS.

That should have been impossible.

Why "SUCCESS" was wrong

The pipeline had four verification layers: mechanic checks (lint, typecheck, tests), holdout scenarios (blind tests the engineer never sees), multi-model code review (Claude + Codex + Gemini), and a synthesizer that combines everything into a final verdict.

The engineer agent has its own repair loop — three attempts to make all acceptance criteria pass. It runs verify commands for each AC, fixes failures, retries. After three attempts, it reports status to the orchestrator.

Here is the gap: the orchestrator checked execute_log.json for status: SUCCESS and moved on. It trusted the engineer's self-reported status. The engineer reported success because its verify command for AC18.3 was grep freshness_ms health.go — a presence check, not a behavioral check. The string did not exist, the grep failed silently, and the engineer moved on without implementing the criterion.

We had review. We had iteration. We had a proof artifact. What we did not have was independent boundary verification.

The missing trust boundary

The pattern is familiar. A developer says "tests pass" and pushes to main. CI runs the same tests — independently. The developer's claim and the verification live in different trust domains. If CI only checked the developer's test output file instead of running tests itself, nobody would trust it.

Our pipeline had exactly this flaw. The engineer agent both implemented the code and reported whether it succeeded. The orchestrator consumed that report without re-running the checks. The implementer was grading its own homework.

This is not specific to Signum. If your workflow lets a coding agent say "done" and your pipeline checks artifacts emitted by that same agent without independent re-execution, you have the same trust problem. It does not matter how many reviewers you add downstream — reviewers audit the code that exists, not the code that should exist.

What we changed: boundary verification

The fix was not another reviewer or another retry. It was a trust boundary — a deterministic verifier that runs after the engineer finishes and before the audit begins. The verifier:

Captures a cryptographic snapshot of the workspace before execution starts
After the engineer finishes, independently re-runs every acceptance criterion's verify command via a sandboxed DSL runner
Checks scope integrity: are all promised files present? Are there out-of-scope modifications?
Writes an append-only receipt with per-AC evidence, artifact hashes, and a chain linking back to the pre-execution snapshot
Gates the transition to audit: if any visible AC lacks independent evidence, the pipeline blocks

The orchestrator no longer reads the engineer's execute_log.json to decide whether ACs passed. It reads the receipt. The receipt is written by a verifier that shares no state with the engineer. The engineer cannot influence what the receipt contains.

Each repair iteration in the audit loop also runs boundary verification before the candidate proceeds to review. The receipt chain is append-only — iteration 2's receipt references iteration 1's hash, making the full sequence tamper-evident.

What it catches

Three failure modes that previously passed silently:

Skipped criteria. The engineer claims success but never touched the relevant file. The verifier runs the AC's check and finds no evidence.
Vacuous verification. The verify command is too weak (a grep for a string that could appear anywhere). On medium and high risk contracts, the verifier classifies the evidence strength and blocks on exit-only checks.
Scope drift. The engineer modifies files outside the contract's scope, or promises a new file but never creates it. The snapshot diff catches both.

What it does not prove

Boundary verification is not semantic verification. It confirms that each acceptance criterion's check command returned zero. It does not confirm the implementation is correct in a deeper sense.

Some limitations are fundamental:

A well-crafted but subtly wrong implementation can still pass all verify commands. The receipt proves the check ran and passed, not that the check was sufficient.
Manual acceptance criteria (where no automated check exists) skip the verifier entirely. The receipt marks them as unverified — the synthesizer cannot issue AUTO_OK if manual ACs exist.
Stricter verification means more false blocks. A flaky verify command will halt the pipeline even when the implementation is correct.

The receipt chain closes the trust gap between claiming and proving. It does not close the gap between proving and being right.

The broader question

Every AI coding workflow has a version of this problem. The agent generates code, runs checks, reports results. At some point, a human or a system must decide: is this done?

The answer depends on what evidence you require. Self-reported status is the weakest. Test results are stronger but can be gamed by weak tests. Independent re-execution against a pre-declared contract is stronger still — but only as strong as the contract itself.

Two questions worth asking about your own workflow:

Who verifies your acceptance criteria — the same agent that implemented them, or an independent process?
Would you accept more false blocks in exchange for fewer false successes?

We chose more false blocks. The alternative was worse.

Signum is an open-source Claude Code plugin for contract-first AI development. The receipt chain shipped in v4.15.1. The bug described here is issue #10.

Your AI Spec Is Already Stale

Vitaly D. — Fri, 20 Mar 2026 08:39:23 +0000

I maintain 12 Claude Code plugins. Each has a project.intent.md -- a structured spec that tells the agent what the project does, what it doesn't do, and who it's for. The agent reads it at the start of every task.

Last week I ran a reverse diff -- code signals vs. existing spec -- on two projects. Both had drift. One had been wrong for three versions.

The problem with specs in AI-assisted codebases

Traditional docs debt is annoying but survivable. A stale README means a developer spends 10 extra minutes figuring things out. They have context, judgment, and access to git log.

An AI agent reading a stale spec has none of that. It treats the spec as ground truth. If project.intent.md says the scoring formula is source_weight + keyword_density * 0.2 + release_boost, the agent will write code that assumes those variables exist. Even if the actual implementation changed to source_weight + min(points/500, 3.0) two versions ago.

This isn't docs debt. It's an execution bug hiding in plain text.

What I found

Herald: v1 ghosts in a v2 codebase

Herald is a news digest plugin. It went through a major rewrite from v1 (JSONL pipeline) to v2 (SQLite pipeline). The project.intent.md was generated from v1 docs and never updated.

I ran the code scanner against the existing spec. The diff:

Scoring formula -- wrong since v2.0:

EXISTING (glossary):
  source_weight + keyword_density * 0.2 + release_boost

ACTUAL CODE (herald/scoring.py):
  source_weight + min(points/500, 3.0)
  Story-level: max_article_score + coverage + momentum

The keyword_density variable doesn't exist in v2. An agent writing a scoring-related feature would reference a ghost API.

Deduplication -- wrong threshold:

EXISTING (glossary):
  Jaccard trigram title similarity at threshold 0.85

ACTUAL CODE (herald/cluster.py):
  SequenceMatcher with threshold 0.65

Not just a different number -- a different algorithm. Jaccard trigrams vs. Python's SequenceMatcher. An agent tuning dedup behavior would look for the wrong function.

Pipeline orchestration -- dead reference:

EXISTING (glossary):
  "run.sh orchestrator acquires POSIX lockfile, calls collect.py then analyze.py"

ACTUAL CODE (herald/cli.py):
  herald.cli run → pipeline.py → collect → ingest → cluster → project
  No run.sh. No lockfile. No analyze.py.

Three of three architectural facts in the glossary were stale. The agent would reference files that don't exist.

Delve: missing shipped features

Delve is a deep research orchestrator. Its project.intent.md was 3 days old -- written at v0.7, now at v0.8.1. Two shipped features were missing:

ADDED to Core Capabilities:
  + Token-efficient pipeline with trafilatura-based content extraction
    (45-60% input token reduction)
  + Stage 0.5 CONTEXTUALIZE: local context enrichment before web SCAN

Plus two entire sections were absent:

ADDED sections:
  + Success Criteria (derived from quality thresholds in reference.md)
  + Personas (3 user types inferred from README usage patterns)

An agent scoping a new feature for Delve wouldn't know about CONTEXTUALIZE. It might re-implement local context enrichment from scratch, duplicating a shipped capability.

Why AI makes this worse

In a human-only workflow, specs rot slowly. Developers write code, docs lag behind, someone eventually updates the README. The feedback loop is months.

With AI agents, the loop compresses. An agent can ship 5 features in a day. Each feature may add capabilities, change interfaces, or remove dead code. The spec was accurate at 9 AM and wrong by 5 PM.

The multiplier effect: agents don't just read stale specs -- they write code that assumes the stale spec is correct, which then gets reviewed by another agent that also reads the stale spec. Confirmation bias at machine speed.

The reverse diff

The fix isn't "update your docs more often." That's aspirational advice that doesn't scale. The fix is a machine-readable reverse diff: scan the code, derive what the spec should say, compare it to what it actually says.

Here's what the diff output looks like:

SECTION: Goal
STATUS: UNCHANGED
REASON: Fresh derivation semantically matches existing content.

SECTION: Core Capabilities
STATUS: UPDATED
EXISTING: 6 items
PROPOSED: 8 items (+2 new capabilities from shipped commits)
EVIDENCE: git log afe0e42, 90e5ded
CONFIDENCE: high

SECTION: Non-Goals
STATUS: UNCHANGED
REASON: All 5 items still supported by docs/how-it-works.md Limitations.

SECTION: Success Criteria
STATUS: ADDED
REASON: Section absent from existing intent; quality thresholds in reference.md.

SECTION: Personas
STATUS: ADDED
REASON: Section absent; 3 user types inferred from README usage patterns.

Each section is classified independently: UNCHANGED, UPDATED, ADDED, or REMOVED. The developer reviews per-section, not per-file. Unchanged sections are auto-accepted -- you only see what actually drifted.

The key design decision: when in doubt, UNCHANGED beats UPDATED. If the existing content contains facts not derivable from code signals -- manual edits, domain knowledge, judgment calls -- the system preserves them. It only flags drift it can prove from code.

What this means for your projects

If you use CLAUDE.md, project.intent.md, agent instructions, or any structured context that an AI reads as ground truth:

Treat spec accuracy as a correctness property, not a hygiene task. A wrong spec is a wrong input to every agent session.
Automate the reverse direction. You probably have CI that checks code against specs (tests, linting, contracts). You probably don't have anything that checks specs against code. That's the gap.
Diff semantically, not textually. A cosmetic reword shouldn't trigger a review. A missing capability should. The scanner needs to understand what matters.
Run it after shipping, not before. The spec drifts after the code ships, not before. Check intent freshness as a post-deploy step, not a pre-commit hook.

The implementation

I built this as --actualize mode in Signum's /signum init command. It reuses the same scanner that bootstraps new projects -- same signal hierarchy, same evidence tracking -- but produces a diff instead of a full rewrite.

The scanner reads authoritative docs, README, package manifests, git history, and entrypoints. The synthesizer compares each section against existing intent and classifies it. The command presents changes one section at a time and writes only what you accept.

/signum init --actualize

It's a Claude Code plugin. The scanner is deterministic (bash, no LLM). The diff is LLM-produced (semantic comparison, not byte-level). The write is human-confirmed (no auto-apply).

I caught 6 factual errors in Herald's glossary and 2 missing capabilities in Delve's intent. Both had been accurate when written. Both drifted within days.

If your agents read structured context, check when it was last verified -- not when it was last edited.

Source: github.com/heurema/signum (v4.11.0)

What a Formal Verification Agent Taught Me About Code Audit

Vitaly D. — Tue, 17 Mar 2026 11:52:15 +0000

The morning digest surfaced Leanstral -- Mistral's open-source agent for formal verification in Lean 4. A mixture-of-experts model (119B total, 6.5B active per token) that scores within 80% of Claude Opus on the FLTEval theorem-proving benchmark at a fraction of the cost.

I don't need Lean 4. But the agent's architecture proved useful: multi-attempt proof search, diagnostic feedback loops, structured verification. Three of these patterns transferred well to my code audit pipeline. The other three improvements came from the same design session.

A word on Signum. Signum is a plugin for Claude Code that turns feature requests into verifiable artifacts. It works in four phases: a Contractor agent writes a contract (spec + acceptance criteria), an Engineer agent implements it, three independent AI models audit the result (Claude for semantics, Codex for security, Gemini for performance), and a Synthesizer produces a final verdict. The pipeline iterates: if the audit finds issues, the Engineer gets a repair brief and tries again. The output is a proofpack -- a self-contained bundle of contract, diff, review findings, and decision.

Three Patterns from Leanstral

Leanstral works through an MCP server that connects the agent to Lean 4's Language Server Protocol. Its five-phase loop -- discover proof gaps, analyze subgoals, search the library for relevant lemmas, synthesize a tactic, check diagnostics -- is a structured generate-verify cycle. Three elements mapped to Signum:

1. Verification before review. Lean doesn't just check "does the proof compile" -- it verifies that the proof actually type-checks under the kernel. In Signum, the analogue became a policy scanner: a deterministic grep on the diff that runs before the LLM reviewers, catching security and unsafe patterns at zero token cost.

2. Parallel attempts. Lean's multi_attempt tool substitutes several tactics at one position and compares the resulting goal states. In Signum, this became parallel repair lanes -- two Engineer agents working in isolated git worktrees with different fix strategies.

3. Typed diagnostics. Lean LSP returns structured error objects (file, line, message, severity), not raw text. In Signum, the mechanic phase now returns a hybrid format with typed findings instead of a flat "regressions: true/false" boolean.

Policy Scanner

The cheapest improvement. Between the mechanic step (lint, typecheck, tests) and the multi-model code review, a new bash script scans the unified diff for known dangerous patterns. 195 lines, zero LLM cost.

It scans only addition lines. 12 patterns in three categories:

Security (blocks the pipeline): eval, subprocess with shell=True, innerHTML, SQL string concatenation, weak crypto
Unsafe (flagged for review): TODO/FIXME/HACK markers, debug statements
Dependency (flagged for review): new entries in package managers -- but only when the file is actually a manifest (package.json, Cargo.toml, etc.), not a README or test fixture

Three design decisions came from asking Codex (GPT) and Gemini independently, then comparing answers -- a process I call an "arbiter panel" (all three models -- Claude, Codex, Gemini -- agreed on each):

Fail-closed on missing input. If the diff file doesn't exist, the scanner exits with an error rather than silently producing zero findings.
Manifest-only filtering for dependency patterns. Without it, any JSON key-value pair in any file triggers a false positive.
Curated sinks over broad regexes. A short list of known-dangerous calls (subprocess.call, child_process.spawn) beats a generic pattern that matches harmless calls like db.query().

Typed Diagnostics

Before this change, the mechanic report was a flat summary: lint passed or failed, tests passed or failed, regressions yes or no. The Engineer agent in repair mode received this as a blob and had to guess which file and line to fix.

Now the mechanic produces a hybrid format: always a summary per check, plus per-file findings when the runner supports structured output. Each finding carries an origin field -- "structured" for JSON output (ruff, eslint), "stable_text" for parseable text (tsc, mypy), or "none" for summary only. The pipeline gates on the summary; findings are hints, not source of truth.

An aside on catching bugs with the pipeline itself: Claude Opus found a critical issue on the very first review of this feature. A || true after a command substitution silently masked the exit code, making the return value always zero. Regression detection was dead for all eight supported runners. One token. The iterative repair loop fixed it in a single pass -- exactly the kind of convergence the system is built for.

Parallel Repair Lanes

The most complex change. Previously, the repair loop was sequential: one Engineer attempt, audit, next attempt. Now it spawns two Engineers in parallel, each in an isolated git worktree:

Lane A: "Fix with minimal targeted changes. Patch only the flagged lines."
Lane B: "Fix by addressing the root cause. May touch more files."

Both receive the same repair brief. After both complete, the pipeline runs lightweight checks (lint, typecheck, tests, hidden validation scenarios) on each lane, scores them, and sends only the winner through the full three-model review. If the winner still has serious findings, the runner-up also gets reviewed before the iteration is declared failed.

Same principle as Lean's multi_attempt: explore the solution space in parallel, select the best candidate, verify once.

Three More Changes

These came from the same design session but aren't directly Leanstral-inspired:

Dynamic strategy injection. The Contractor agent now classifies the task type (bugfix, feature, refactor, security) via keyword scan and generates a strategy hint in the contract. The Engineer reads it as a process guide -- "reproduce bug with a test first" for bugfixes, "find all occurrences of the pattern, not just the reported one" for security fixes. Informational only; it doesn't block the pipeline.

Context retrieval for reviewers. A new pre-review step gathers git history (last commit per modified file), issue references (parsed from the goal text), and the project's intent document. This context is injected only into the Claude reviewer -- Codex and Gemini remain isolated (goal + diff only), preserving their value as independent validators. The intent is to reduce false positives by giving the semantic reviewer context about why the code looks the way it does.

Approval UX. A small fix: the contract approval display now uses markdown formatting instead of fragmented bash output. The goal text is never truncated, the summary is a compact table, and warnings are grouped.

What I Learned

Each feature went through the full pipeline: design panel, contract, implementation, three-model audit, iterative repair. Of six runs, only one passed on the first attempt (the simplest change). The rest required two to three iterations.

The pattern that emerged: the Engineer's first pass satisfies all acceptance criteria, but code review surfaces real bugs -- exit code masking, race conditions on shared file paths, missing field mappings. The iterative loop fixes them in one or two passes. In this sample of six changes, the system behaved as intended: not a gatekeeping checkpoint, but a convergence loop.

The full session: 7 commits, roughly 1,900 lines of changes, 5 design panels, over 15 multi-model review rounds. It started from one line in a morning news digest.

Switching AI CLIs Without Losing 32 Skills: Why I Built nex

Vitaly D. — Tue, 17 Mar 2026 10:46:14 +0000

I use three AI coding CLIs daily: Claude Code, Codex CLI, and Gemini CLI. Each has plugins, skills, and custom workflows I've built over months. When I wanted to try Codex as my primary tool for a week, the migration looked like this:

Manually recreate 32 symlinks
Adapt plugin layouts to each CLI's format
Track which version is installed where
Hope nothing drifts while I'm not looking

I built nex to make this a one-liner.

The problem: skills are trapped

AI coding agents are converging on similar concepts - skills (reusable instructions), plugins (tools + hooks), and agents (autonomous workers). But each CLI implements them differently:

Concept	Claude Code	Codex CLI	Gemini CLI
Skills	`.claude/skills/SKILL.md`	`.agents/skills/SKILL.md`	`.gemini/skills/SKILL.md`
Plugins	marketplace + `.claude-plugin/`	`AGENTS.md` tree-walk	`gemini-extension.json`
Config	`settings.json` per profile	`config.toml` with profiles	`settings.json`
Discovery	marketplace clone + cache	directory scan	`context.fileName`

The skill format (SKILL.md with YAML frontmatter) is actually the same across all three - thanks to the Agent Skills open standard. But the discovery and installation mechanics are completely different.

If you've built 12 custom plugins with skills, agents, and hooks, switching your primary CLI means rebuilding your entire setup. That's vendor lock-in through installation friction, not through format incompatibility.

What nex does

nex is a Rust CLI (~5000 LOC) that manages the installation layer. It doesn't change how skills work - it handles where they live.

$ nex install signum
  Installing signum v4.8.0...
  [OK] Claude Code  ~/.claude/plugins/signum
  [OK] Codex        ~/.agents/skills/signum
  [OK] Gemini       ~/.agents/skills/signum
  Installed for 3 platforms.

One command, all three CLIs get the skill.

Seeing everything at once

Before nex, I had no single view of what was installed where:

$ nex list
PLUGIN           VERSION    EMPORIUM   CC     CODEX  DEV
────────────────────────────────────────────────────────
signum           4.8.0      v4.8.0     ✓      ✓       -
herald           2.1.0      v2.1.0     ✓       -      dev→~/...
delve            0.8.1      v0.8.1     ✓       -      dev→~/...
arbiter          0.3.0      v0.3.0     ✓       -      dev→~/...
...

32 plugins visible across all platforms. Previously nex could only see 1.

Drift detection

Version drift between platforms is silent and dangerous. nex check catches it:

$ nex check
PLUGIN           EMPORIUM     CC CACHE     CODEX      STATUS
──────────────────────────────────────────────────────────────
herald           v2.1.0       v2.0.0        -          UPDATE ↑
signum           v4.8.0       v4.8.0       linked     OK
delve            v0.8.1       v0.8.1        -          OK (dev override)

Profiles as desired state

The killer feature: TOML profiles that declare which skills should be active per CLI.

# ~/.nex/profiles/work.toml
[plugins]
enable = ["signum", "herald", "delve", "arbiter", "content-ops",
          "anvil", "forge", "genesis", "glyph", "reporter", "sentinel"]

[dev]
herald = "~/personal/skill7/devtools/herald"
delve = "~/personal/skill7/devtools/delve"

[platforms]
claude-code = true
codex = true
gemini = true

$ nex profile apply work
  [OK] signum  - Codex/Gemini symlink exists
  [NEW] herald  - symlink created
  [NEW] delve  - symlink created
  ...
  Profile 'work' applied and set as active.

Switching from Claude Code to Codex as primary? nex profile apply work ensures all your skills are there.

Architecture: layered source of truth

The hardest design decision was ownership. Who owns the plugin state?

Each platform already tracks its own installations internally. Claude Code has installed_plugins.json. Codex discovers skills by scanning directories. Gemini reads extension configs. If nex tried to be the single source of truth for everything, it would constantly fight with the platforms' own state management.

Instead, nex uses a layered model:

┌─────────────────────────────────┐
│            nex CLI              │
│  Catalog │ Profiles │ Adapters  │
└────┬─────┴────┬─────┴──┬──┬──┬─┘
     │          │        │  │  │
     ▼          ▼        ▼  ▼  ▼
  emporium   ~/.nex/    CC Cdx Gem
  (catalog)  profiles  (ro)(rw)(rw)

Layer 1: Catalog - nex owns the emporium marketplace (our plugin registry). It's the source of truth for what versions exist.
Layer 2: Platform runtime - each CLI owns its own state. nex reads Claude Code state (read-only, never writes to CC files), and manages Codex/Gemini symlinks directly.
Layer 3: Profiles - nex owns the desired state. Profiles declare intent; nex profile apply reconciles reality.

This design came from an arbiter panel where I asked Codex and Gemini for their recommendation. Both proposed "layered SSoT" independently. The key insight from Codex: "Don't pick one global source of truth. Pick a source of truth per layer."

Release automation

When I release a new version of a plugin, I don't want to manually update versions, changelogs, tags, and marketplace refs. nex release handles the full pipeline:

$ nex release patch --execute
  [OK] BUMP        .claude-plugin/plugin.json
  [OK] CHANGELOG   inserted [2.1.0] section
  [OK] DOCS        README.md version updated
  [OK] COMMIT      "release: v2.1.0"
  [OK] TAG         v2.1.0
  [OK] PUSH        origin/main
  [OK] PROPAGATE   emporium marketplace ref → v2.1.0

Nine stages, dry-run by default. The DOCS stage (new in v0.9.0) auto-generates changelog entries from git log and syncs SKILL.md descriptions with plugin.json.

Health monitoring

nex doctor runs 11 checks across all platforms:

$ nex doctor
  [OK]   signum
  [WARN] herald  emporium_drift: emporium=v2.1.0 but CC cache=v2.0.0
  [WARN] signum  duplicate: found in 3 locations: dev symlink, emporium cache, nex-devtools

It catches duplicates, stale symlinks, orphan caches, version drift, and deprecated marketplace artifacts.

What I learned

Read-only integration is the right default. My first instinct was to have nex write directly to Claude Code's installed_plugins.json. An arbiter panel (Codex + Gemini) convinced me otherwise: writing to another tool's internal state creates race conditions and breaks on format changes. Read-only + filesystem discovery is more resilient.

Profiles are more useful than auto-sync. I initially wanted nex to automatically keep all platforms in sync. But different contexts need different skill sets. My work profile has 11 plugins; personal has 2. Explicit profiles are better than implicit mirroring.

The skill format war is already over. SKILL.md with YAML frontmatter works in Claude Code, Codex, Gemini, and 20+ other tools. The installation layer is the real fragmentation - and that's what nex fixes.

Try it

nex is open source (MIT), written in Rust, and works on macOS and Linux.

# Install
cargo install --git https://github.com/heurema/nex

# Or download binary
curl -L https://github.com/heurema/nex/releases/latest/download/nex-$(uname -m)-apple-darwin -o nex
chmod +x nex && mv nex ~/.local/bin/

# Get started
nex list              # see all plugins across platforms
nex check             # detect version drift
nex doctor            # health check
nex status            # cross-platform overview

GitHub: heurema/nex

One Pass Isn't Enough: How Signum Learned to Fix Its Own Code

Vitaly D. — Sun, 15 Mar 2026 18:04:12 +0000

The first version of Signum ran in a single pass: CONTRACT → EXECUTE → AUDIT → PACK. If the audit found a problem — block. Human deals with it.

An honest process, but a limited one. Imagine code review where the reviewer can only comment and the author can't fix anything. The finding goes back to the queue, context is lost, the cycle restarts from scratch. Signum v4.6 closes this gap: the pipeline now loops at three levels — code, contract, and project context — before producing the final proofpack.

The problem: one-shot verification

In previous posts I covered the contract as ground truth and proofpack as a verification artifact. The architecture worked: spec → blinded implementation → multi-model audit → proof artifact. But production use revealed a pattern.

Most audit findings in our early runs weren't architectural issues. They were a missed edge case in error handling. A forgotten null check. A test that doesn't cover one of the acceptance criteria. Things the engineer agent could fix in seconds — if it got the chance.

Instead, Signum issued AUTO_BLOCK, a human looked at the finding, restarted the pipeline. Full contract rebuild, full implementation, full audit — for a bug that's a one-line fix.

Loop 1: code — iterative audit

Signum v4.6 adds a repair loop that bridges AUDIT and EXECUTE. When the audit finds MAJOR or CRITICAL findings, instead of blocking, it sends the engineer back to fix:

AUDIT → findings → re-enter EXECUTE (repair) → AUDIT → ... → PACK

After the first audit pass (mechanic + Claude + Codex + Gemini), if there are actionable findings, the engineer agent receives repair_brief.json:

{
  "iteration": 1,
  "findings": [
    {
      "id": "F-1",
      "severity": "MAJOR",
      "file": "src/api/tokens.py",
      "line": 42,
      "description": "Missing error response when rate limit storage is unavailable",
      "source": "codex"
    }
  ]
}

Important: repair_brief.json contains only observed defect symptoms from visible criteria and deterministic checks. Holdout failures are reported as behavioral observations ("function returns 200 when 429 expected") without revealing the hidden acceptance criteria. The data-level blinding from the original contract remains intact — the engineer never sees raw holdout text.

The engineer fixes. The full audit reruns — not on the repair diff, but on the entire implementation from baseline. Then PACK produces the final proofpack as before.

Key decisions:

Best-of-N, not last-of-N. The pipeline stores each iteration's artifacts in .signum/iterations/NN/. If iteration 3 is worse than iteration 2 (the repair broke something else), Signum rolls back to the best candidate. No blind faith that the latest fix is the best one.

Diff progression. On the first pass, reviewers see the full patch. On pass 2+, they see the full patch plus the iteration delta with an instruction to focus on what changed in the repair. This saves tokens and reduces noise. If the delta exceeds 80% of the full patch — fallback to full-diff-only (the repair is too large to review incrementally).

Early stop. If two consecutive iterations show no improvement — stop. Maximum 20 iterations (configurable via SIGNUM_AUDIT_MAX_ITERATIONS). In practice, convergence happens in 2-3 passes.

Finding fingerprints. Each finding gets a fingerprint based on file, line range, and issue type. Between iterations, Signum classifies every finding as resolved, persisting, or new. The synthesizer uses this to evaluate actual progress — not just "fewer findings" but "which specific issues were fixed and which appeared."

Hallucination filtering. If a reviewer cites a line that doesn't exist in the diff or references a file outside scope, the finding is discarded. This is the same mechanism described in the ecosystem post: every AI finding is validated against the actual diff before it enters the repair loop.

Loop 2: contract — self-critique

The code loop fixes implementation. But what if the problem is upstream — in the contract itself? A perfect implementation of a flawed spec is still a failure.

For medium and high-risk tasks, the contractor now runs a 4-pass self-critique before showing the contract to the human:

Ambiguity review — scans goal, acceptance criteria, and scope for ambiguous phrasing
Missing-input review — checks for missing preconditions, records clarification decisions
Contradiction review — detects contradictions between goal, scope, and risk level
Coverage review — reconstructs the goal from acceptance criteria, checks coverage, documents assumption provenance

Maximum 2 auto-revision rounds. If the verdict remains "no-go" after round 2 — escalation to human. Low-risk tasks skip all 4 passes entirely.

The result is written to the contract:

{
  "readinessForPlanning": {
    "verdict": "go",
    "summary": "All ambiguities resolved. AC3 coverage gap closed in round 1."
  },
  "ambiguityCandidates": [...],
  "contradictionsFound": [],
  "clarificationDecisions": [...]
}

The human sees both the verdict and the full path to it. Not "the contractor decided the contract is good" — but what problems were found and how they were resolved.

Loop 3: project — shared context across contracts

The code loop iterates within a single task. The contract loop iterates within a single spec. But previous Signum versions treated contracts in isolation. Each task — a separate universe. In a real project, tasks are connected: they touch the same files, depend on the same decisions, use the same terminology.

Three new layers:

Project intent. A project.intent.md file at the project root — goal, capabilities, non-goals, personas. The contractor reads it before generating a contract. Project non-goals become scope constraints on the contract. For medium and high-risk tasks, missing intent is a blocking question.

Glossary. project.glossary.json defines canonical terms and forbidden synonyms. glossary_check scans contracts for alias usage, terminology_consistency_check catches synonym proliferation across active contracts. Both are WARN, not block.

Cross-contract coherence. overlap_check detects inScope overlap between active contracts (two contracts touching the same file — conflict?). assumption_check flags contradictions in assumptions across related contracts. adr_check warns when relevant ADRs exist but aren't referenced in the contract.

Plus upstream staleness detection: the contractor hashes the contents of project.intent.md and the glossary at contract creation. If upstream files change by execution time — warning (or block, if stalenessPolicy: "block" is configured).

Architecture v4.6.1: checks as standalone scripts

A bonus from the latest refactoring: 6 inline checks that lived inside the orchestrator are now extracted into standalone testable scripts in lib/:

lib/glossary-check.sh      — forbidden synonym scan
lib/terminology-check.sh   — cross-contract term proliferation
lib/overlap-check.sh       — inScope overlap detection
lib/assumption-check.sh    — assumption contradiction detection
lib/adr-check.sh           — ADR relevance check
lib/staleness-check.sh     — upstream artifact staleness

All scripts: JSON stdout, stderr for diagnostics, exit 0 for any check result (non-zero only for infrastructure errors). The orchestrator calls scripts and decides whether to block or warn. Separation of concerns: the script checks, the orchestrator decides.

What this changes

Signum v3 answered "is this correct?" with a binary yes/no. v4.6 answers "can this be made correct?" — and if yes, does it.

In our early runs, a significant share of tasks that v3 blocked with AUTO_BLOCK, v4.6 brings to AUTO_OK in 2-3 iterations without human involvement. The tasks that still block tend to be real spec or architecture problems — exactly what should escalate to a human.

Verification isn't a gate at the end of the pipeline. It's a loop. The same principle as human code review: finding → fix → re-check. The difference is that AI can run this loop in seconds, not days.

Sources

signum on GitHub
The Contract Is the Context — first post in series
AI Writes Code. Where's the Proof? — second post in series
skill7.dev/development/signum
emporium — plugin marketplace

Skillpulse: Your AI Skills Are Flying Blind Without Telemetry

Vitaly D. — Wed, 11 Mar 2026 11:48:22 +0000

You install 16 skills. You see them fire. But here's the question nobody asks: did the model actually follow them?

I reviewed every telemetry tool in the Claude Code ecosystem -- built-in OTel, claude_telemetry, claude-code-otel, even the skills.sh platform metrics. None of them track skill adherence. They tell you a skill was loaded, but not whether the model executed its instructions.

So I built skillpulse.

The gap

Claude Code's built-in OpenTelemetry (via CLAUDE_CODE_ENABLE_TELEMETRY=1) captures general metrics: session duration, tokens, cost, tool calls. With OTEL_LOG_TOOL_DETAILS=1, it even records skill_name in tool result events. But that's a load signal, not a follow signal.

The difference matters. A skill can load successfully and be completely ignored by the model. Without tracking adherence, you're optimizing in the dark.

Tool	Tracks loading	Tracks following
Built-in OTel	Yes	No
claude_telemetry	No	No
claude-code-otel	No	No
skills.sh	Install count only	No
skillpulse	Yes	Yes (planned)

How it works

Skillpulse is a PostToolUse hook that fires on every Skill tool call. It writes one JSONL line per activation:

{
  "skill_id": "signum:signum",
  "timestamp": "2026-03-11T08:48:18Z",
  "session_id": "20260311_114818_93074",
  "loaded": true,
  "followed": null,
  "plugin_name": "skillpulse"
}

The implementation is 60 lines of bash. Some design choices:

2-second watchdog. PostToolUse hooks run synchronously -- a hung hook blocks the entire session. Skillpulse spawns a self-kill watchdog that sends SIGKILL after 2 seconds. In practice, the hook finishes in <50ms.

Skill-only filter. PostToolUse fires for every tool call -- Read, Write, Bash, everything. Skillpulse checks tool_name == "Skill" and exits immediately for anything else. Zero overhead on non-skill calls.

Append-only JSONL. No database, no rotation, no config. One file at ~/.local/share/emporium/activation.jsonl. Survives crashes, easy to inspect, trivial to back up.

What I learned from 4 entries

Yes, four. Skillpulse was created on March 4th and then... not installed. Classic. I fixed that today.

But even 4 entries told me something useful:

Skill                       Acts  Sess  Load%  Age
-----------------------------------------------------
herald:news-digest             2     2  100%   7d
arbiter                        1     1  100%   7d
signum:signum                  1     1  100%   7d

Three skills account for all activations. I have 16 installed. That's an 81% dormancy rate. Most of my skills are dead weight consuming context tokens.

The aggregator

A 90-line Python script reads the JSONL and produces per-skill stats:

python3 scripts/aggregate.py          # table output
python3 scripts/aggregate.py --json   # for pipelines

No dependencies. Reads timestamps, groups by skill_id, computes frequency, unique sessions, loaded rate, and days since last activation.

Where this is going

Skillpulse is Phase 1 of a larger pipeline I'm calling EvoSkill -- skills that improve themselves based on usage data.

The pipeline:

skillpulse (log)
    |
aggregator (stats)
    |
bench (test against tasks)
    |
evolver (propose improvements)

The followed field is currently null -- it needs tool-pattern fingerprinting to determine if the model executed a skill's expected behavior. That's the hard part, and I'm deliberately deferring it until I have enough activation data to validate the approach.

Some things I'm explicitly skipping at my current scale (<20 sessions/week, 16 skills):

Hotelling T-squared drift detection -- need 50+ trajectories per skill
Bayesian calibration -- need labeled outcome data
Hierarchical routing -- relevant at 50+ skills
Automated gates -- human review is my gate for now

The research backing this is in EvoSkill, AutoSkill, and the ASI drift framework.

Try it

claude mcp add-json skillpulse '{"source": {"source": "url", "url": "https://github.com/heurema/skillpulse.git"}}'

Source: github.com/heurema/skillpulse

All data stays local. MIT licensed. Zero dependencies beyond bash and jq.

Environment is context: security auditing for AI agent workstations

Vitaly D. — Wed, 11 Mar 2026 07:25:37 +0000

We talk a lot about prompts, tools, and evals. But almost nobody audits the environment where the AI agent actually runs.

The agent sees your .env files. Your .mcp.json with hardcoded tokens. Your settings.json with "permissions": "allow". Your plugins, hooks, configs. All of this is operational context, and it directly determines what the agent can do. If an API key sits in plaintext - the agent will read it. If no PreToolUse hook is configured - any Bash command runs unfiltered. If .claudeignore is missing - the agent reads every file in the project.

These are not hypothetical risks. This is the default configuration.

The attack surface nobody measures

Run a mental audit of your workstation:

Secrets. How many .env files do your projects have? Are they in .gitignore? Any secrets in git history? When you launch Claude Code, the shell already contains ANTHROPIC_API_KEY, AWS_SECRET_ACCESS_KEY, GITHUB_TOKEN - the agent can run printenv and see everything.

MCP servers. Open .mcp.json. Tokens right there in JSON? Server versions unpinned? No allowedTools to restrict available tools? Every MCP server is a child process that inherits all environment variables.

Hooks. Is there a PreToolUse hook filtering dangerous Bash commands? What about subagents? Claude Code doesn't inherit parent hooks in subagents - that's a documented bug, not a feature.

Trust boundaries. Do you have .claudeignore? Is permission mode default or acceptEdits? How many plugins are installed and which ones have hooks?

Each of these questions is binary: yes or no, safe or not. They can be checked deterministically, without an LLM, without interpretation.

Environment as context

In context engineering, we turn the implicit into the explicit. Prompts, instructions, tools - everything becomes structured context that shapes agent behavior.

But the runtime environment is also context. When an agent launches in a shell with direnv-loaded secrets, it gets access not because you designed it that way, but because nobody checked. When an MCP server starts without allowedTools, the agent gets access to every tool - not because it's needed, but because that's the default.

Workstation security posture is implicit context. And as long as it's implicit, you can't manage it.

Sentinel: deterministic audit

Sentinel is a Claude Code plugin that runs 18 checks across six categories:

Category	Checks	What it looks for
secrets	5	Plaintext keys, `.env` without `.gitignore`, secrets in git history, runtime env vars, dotfiles
mcp	3	Tokens in `.mcp.json`, missing `allowedTools`, unpinned server versions
plugins	3	Registry drift, scope leakage, unverified plugins
hooks	2	Missing `PreToolUse` guard, subagent hook gap
trust	3	No `.claudeignore`, broad permissions, injection surface
config	2	Insecure defaults, stale sessions

Each check is a standalone POSIX sh script outputting JSON. No LLM. No heuristics. grep finds a plaintext token in .mcp.json - or it doesn't. stat checks file permissions - 600 or not. Results are reproducible.

LOAD > VALIDATE > PLAN > RUN > NORMALIZE > ASSESS > PERSIST > RENDER

Output is a JSON report and a terminal scorecard:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  sentinel audit - run_20260311T120000Z
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  Category     Score   Checks
  secrets       40/100  ██░░░░░░░░  2/5 pass
  mcp           67/100  ██████░░░░  2/3 pass
  plugins      100/100  ██████████  3/3 pass
  hooks          0/100  ░░░░░░░░░░  0/2 pass
  trust         60/100  ██████░░░░  2/3 pass
  config        50/100  █████░░░░░  1/2 pass

  Total: 47/100    Verdict: FAIL
  Reliability: 1.00 (18/18 checks ran)

Two independent metrics: score (0-100) for security posture, reliability (0.0-1.0) for how much of the audit actually ran. A score of 95 with reliability 0.4 is not trustworthy.

What I found

The first sentinel run on my workstation scored 47 out of 100. Real findings:

8 plaintext .env files with API keys across 4 work contexts
ANTHROPIC_API_KEY, OPENAI_API_KEY, and 12 more secrets accessible via printenv in the current shell
3 MCP servers with tokens in .mcp.json
Zero PreToolUse hooks - any Bash command runs without filtering
Missing .claudeignore in several projects

None of these were accidental. This is the result of standard setup: install Claude Code, add MCP servers, start working. Environment security is not what you think about during installation.

Remediation: not just finding, but fixing

/sentinel-fix <run_id> walks through each FAIL/WARN finding and shows:

What - problem description with redacted evidence
Why - risk explanation
How - specific command to fix

Commands come from the check registry, not generated by an LLM. sentinel-fix never auto-executes - it only suggests. Risk badge for each action: safe, caution, dangerous.

After fixing, run /sentinel-diff to compare reports. Each finding has a stable finding_id (SHA-256 of check_id|category|evidence_paths), enabling tracking: new issues, resolved issues, status changes.

What this means for context engineering

We spend effort ensuring the agent gets the right system prompt, the right tools, the right documentation. But the runtime environment is context too - just unmanaged. A plaintext secret in .env is not a security problem in a vacuum. It's implicit context that determines what the agent can do, beyond what it's supposed to do.

A security audit for the AI workstation is not paranoia. It's the same practice as dependency checking, linting, CI pipelines. There just wasn't a tool for this new class of risks.

Install

claude plugin add heurema/sentinel

Then:

/sentinel              # full audit
/sentinel --deep       # audit + LLM risk explanation
/sentinel-fix <run_id> # guided remediation
/sentinel-diff         # compare with previous audit

Research Agents Lie. The Fix Is Adversarial Verification.

Vitaly D. — Tue, 10 Mar 2026 14:03:44 +0000

You asked an AI research assistant a detailed question and got a confident multi-page answer with citations. Some of those citations don't exist. Several facts contradict each other. The synthesis reads well — it's structured, well-argued, fluent. It's also built on claims no one verified.

This is not an edge case. It's the default behavior of every research agent I've looked at.

The problem

Research agents optimize for coherence, not correctness. The workflow is always some variation of: gather sources → read and chunk → synthesize. The final output is shaped by what reads well together, not what's actually true.

The failure mode is subtle. You get a report that passes casual inspection. No obvious hallucinations, reasonable citations, plausible numbers. But if you trace the actual claims — "X was released in 2023", "Y's accuracy is 94%", "Z approach outperforms alternatives by 40%" — a significant fraction are wrong, unverifiable, or sourced from a single origin that all the other citations are copying.

This is worse than no research. It produces false confidence. You walk away with a mental model that has errors baked in at the foundation level.

The landscape

I went through seven OSS research frameworks to understand what they actually do: node-deepresearch, deep-research (dzhng), GPT-Researcher (assafelovic), STORM (Stanford), plus the commercial systems — OpenAI Deep Research, Gemini Deep Research, Perplexity.

Every one of them follows the same pattern: decompose topic → search and fetch → synthesize. Some have multi-step retrieval, some have recursive query expansion, some have beautiful citation formatting. None verify claims adversarially after synthesis.

Perplexity leads on speed and gets 93.9% on SimpleQA. OpenAI and Gemini lead on depth. GPT-Researcher won CMU's DeepResearchGym benchmark. These are real achievements. But the benchmark question is "did the final report contain the right answer?" — not "what percentage of atomic claims in the report are independently verified?"

That's the gap.

Delve

Delve is a Claude Code plugin built as a pure SKILL.md file. No binaries, no scripts — just orchestration logic and reference prompts. Five stages:

SCAN → DECOMPOSE → DIVE → VERIFY → SYNTHESIZE

The first three stages are table stakes: scan existing sources and memory, decompose the topic into 2-6 independent sub-questions, dispatch parallel research subagents (2-6 depending on --depth) to investigate each one. Standard research pipeline, done well.

The fourth stage is where it differs.

VERIFY: adversarial claim-level checking

After DIVE completes, before synthesis touches anything, VERIFY runs independently:

Step 1: Claim extraction. All dive outputs get decomposed into atomic claims. Not summaries — individual assertions. "X library achieves 94% accuracy on benchmark Y." "Project Z was last updated in 2024." "The approach outperforms alternatives by a factor of 3." Each claim gets a c_<hash> identifier and a classification: factual, quantitative, time-sensitive, methodology, opinion.

Step 2: Adversarial verification. Independent subagents receive batches of claims. The prompt framing is explicit: find flaws, don't confirm. Crucially, these agents do not see the original research context — no anchoring to the synthesis they're checking. They go look for independent evidence.

verdict: "verified" | "contested" | "rejected" | "uncertain"

Each verdict includes evidence and sources. Source independence is checked: three blog posts copying the same press release don't count as three confirmations.

Step 3: Synthesis with explicit provenance. Contested claims get both sides presented with evidence. Rejected claims are excluded or flagged. If more than 30% of claims are contested, the output is labeled draft. The report includes a Methodology section showing which stages ran, how many agents, timing, and the quality verdict.

Quality model

The output carries two orthogonal labels:

Verification status: verified (≥80% claims checked, 0 failed P0) / partially-verified / unverified
Completion status: complete / incomplete / draft / synthesis_only / no_evidence

This is honest accounting. If verification was skipped because sources were unavailable, you know. If the report is flagged draft because the landscape is contested, you know that too.

Pipeline diagram

/delve "autoresearch landscape" --depth medium

SCAN     [~25s]   → 12 sources found, decision: full-run
DECOMPOSE [~5s]   → 4 sub-questions decomposed
         ↓ HITL checkpoint: approve/edit sub-questions
DIVE     [~4min]  → 4 agents in parallel (background)
         ↓ all P0 completed, coverage 1.0
VERIFY   [~90s]   → 47 claims extracted, 3 agents
         ↓ 42 verified, 4 uncertain, 1 rejected
SYNTHESIZE[~15s]  → report written
         ↓
docs/research/2026-03-10-autoresearch-landscape-a1b2.md
quality: verified / complete

Usage

/delve "autoresearch landscape" --depth medium
/delve "WASM runtimes for edge" --quick          # scan + synthesize only, ~90s
/delve "security audit approach X" --providers claude  # single-model, sensitive topic
/delve resume                                    # resume interrupted run
/delve status                                    # list recent runs

Resume support is file-based with events.jsonl as the canonical log. If the orchestrator crashes mid-DIVE, /delve resume reuses completed worker outputs and continues from where it stopped.

The design insight

The standard framing is "research is a retrieval problem." Add more sources, better chunking, smarter query expansion. This produces marginal improvements on the coherence metric while leaving the correctness problem untouched.

Delve treats research as a verification problem. The VERIFY stage adds 40-60% to total run time. The tradeoff is explicit: you get a report where the trust model is different. Not "the AI synthesized this confidently" but "these claims were checked by agents with adversarial prompting and independent access."

That said — an honest admission. Verification quality depends on what's available. Some domains have sparse or low-quality web coverage. Time-sensitive facts from internal systems or paywalled sources may come back uncertain regardless of how many agents look. The quality model makes this explicit rather than hiding it behind confident prose.

The --providers claude flag handles sensitive topics: single-model mode where external subagent dispatch is blocked. Maximum verification label in that mode is partially-verified — same-model verification isn't structurally independent, and the report says so.

Install

claude plugin marketplace add heurema/emporium
claude plugin install delve@emporium

GitHub: github.com/heurema/delve
Plugin page: skill7.dev/plugins/delve

From Plugin to Product: How Herald Became Sift and Why the Data Model Changed Everything

Vitaly D. — Tue, 10 Mar 2026 11:13:15 +0000

Herald was a Python plugin that collected RSS feeds and Hacker News, clustered articles by title similarity, and generated Markdown briefs. It ran locally, required zero API keys, and did exactly what it was supposed to do.

Then we tried to make it useful for real work.

What Broke

Herald's core assumption was that articles are the primary unit. You collect articles, deduplicate by URL, cluster by title similarity, score by source weight and recency, and project a Markdown brief. This works for a developer reading morning news.

It doesn't work when:

An agent needs to know whether "Coinbase lists TOKEN" and "TOKEN now available on Coinbase" are the same real-world fact
You need confidence levels, not just scores - how many independent sources confirm this event?
The system must update when new evidence arrives, not just when the next cron runs
A downstream automation needs typed fields (assets: ["BTC"], event_type: "listing") instead of parsing Markdown

The fundamental problem: Herald modeled content. The world it was trying to represent contained events.

Articles vs Events

In Herald, the main object was a Story - a cluster of articles with similar titles:

Story: "Python 3.14 Released"
  - Article from HN (score: 342)
  - Article from Simon Willison's blog
  - Article from Python.org

The cluster was the output. The articles were the atoms.

In Sift, the main object is an Event - a structured fact pattern with provenance:

{
  "event_id": "evt_2026030801",
  "title": "Bitcoin ETF daily inflow hits $1.2B record",
  "event_type": "market_milestone",
  "assets": ["BTC"],
  "topics": ["etf", "institutional"],
  "importance_score": 0.87,
  "confidence_score": 0.93,
  "source_cluster_size": 7,
  "published_at": "2026-03-08T14:22:00Z"
}

The event is the truth. The articles that support it are evidence. This distinction matters because:

Events can be updated. When a new article confirms or contradicts an event, the confidence score changes. Herald's stories were frozen after clustering.
Events have typed metadata. assets, topics, event_type are queryable fields, not bag-of-words extracted from titles.
Events separate importance from confidence. A rumor about a Bitcoin ETF approval is high-importance but low-confidence. Herald couldn't express this - a story was either in the brief or it wasn't.

JSON as Truth, Markdown as Projection

Herald's output was a Markdown file. That was the product. Agents read Markdown, humans read Markdown, done.

Sift inverts this. The canonical record is a typed JSON event. Everything else is a projection:

The human digest? A Markdown rendering of the top events in a time window.
The agent context? The same JSON, filtered by asset and topic.
The WebSocket stream? Push notifications when an event is upserted.
The llms.txt? A static slice for LLM-friendly discovery.

This isn't theoretical purity. It's operational: when the API returns an event, the browser workspace and the CLI both render from the same record. There's no "browser version" and "agent version" of the truth.

Python to Go

Herald was ~1,200 lines of Python. Sift is ~7,000 lines of Go. The rewrite wasn't about performance benchmarks.

Three things drove the language change:

Single binary deployment. Sift Pro is a hosted service running on a Linux node. go build produces one binary. No virtualenv, no pip, no runtime. The systemd unit file is trivial.
Shared pipeline. The same Go packages (internal/pipeline, internal/event, internal/ingest) power both the local sift CLI and the hosted siftd server. In Python, sharing code between a CLI tool and an async web server meant fighting import paths and event loops.
Concurrency for real-time. Sift's hosted mode runs a scheduler, HTTP API, and WebSocket broadcaster in one process. Go's goroutines and channels made this straightforward. Python's asyncio could do it, but the cognitive overhead was higher for a small team.

The trade-off: Go's type system caught things earlier but made rapid prototyping slower. Herald's first version was built in a day. Sift's v0 took a week.

Local Free + Hosted Pro

Herald was local-only by design. Sift keeps a local tier and adds a hosted one.

Sift Free (local CLI):

SQLite storage under ~/.sift/
User-controlled sync schedule
Same event model, same digest projections
Full ownership of your data

Sift Pro ($5/mo):

Hosted Postgres store with 30-day retention
Autonomous sync every 5 minutes
Authenticated REST API (/v1/events, /v1/digests)
WebSocket stream for real-time updates
Zitadel-backed accounts

The split matters because the free tier is a real product, not a crippled teaser. A developer who wants local crypto news intelligence gets it. A developer who wants always-on event delivery for their agents pays for the hosted runtime.

What Agents Actually Need

The deeper lesson from Herald to Sift is about what agents need from a news system.

Herald gave agents Markdown. It was human-readable, which seemed like a feature. But agents don't need prose. They need:

Typed records they can filter without parsing natural language
Confidence signals so they can decide whether to act on a report
Stable IDs so they can reference events across sessions
Push delivery so they don't poll for updates
Provenance so they can trace a claim to its sources

This is a context engineering problem. The question isn't "what text do I feed the model." It's "what structured context does the agent need to make a decision."

Herald's Markdown brief was a human projection pretending to be agent context. Sift's JSON events are agent context that happens to have a human projection.

The Provenance Rule

One principle from Sift's manifesto drove more design decisions than any other: no claim without provenance.

Every event tracks which sources contributed to it. The source_cluster_size field tells you how many independent sources confirmed the event. The confidence_score is derived from source agreement, not from a language model's guess.

This means Sift can honestly say: "7 sources reported this ETF milestone, confidence 0.93" vs "1 blog mentioned this rumor, confidence 0.41." Herald couldn't distinguish these - both would appear as stories with different scores, but the scoring didn't separate importance from evidence quality.

The practical impact: downstream agents can set thresholds. "Only act on events with confidence > 0.8 and source_cluster_size > 3." That's a policy an automation can enforce. "Only act on stories with score > 50" is a guess.

What Stayed the Same

Not everything changed. The core insight from Herald survived intact: clustering related reports into a single unit is the most valuable transformation in a news pipeline. Whether you call it a story or an event, deduplication-by-meaning is what turns 45 articles into 27 actionable items.

The scoring formula changed, but the principle didn't: source weight matters, recency matters, cross-source confirmation matters.

And the local-first instinct survived. Sift Pro exists because some users need it, not because local-first was wrong. The free CLI proves the data model works without a cloud dependency.

Try It

Sift is live at skill7.dev/sift. The local CLI is open source.

Herald remains available as a Claude Code plugin for developers who want configurable, multi-topic news intelligence without accounts or subscriptions.

DEV Community: Vitaly D.

Signum Can Now Be Installed in Codex App as a Plugin

What changed

Why this is not just a port

Metadata is now part of the release surface

What this changes for context engineering

Insight

Sources

Stop Writing CLAUDE.md From Scratch

Why one file isn't enough

The six-file harness

What deterministic actually means here

Brownfield-safe by default

What the scaffold gives you (and what it doesn't)

Why six and not three (or twelve)

What it doesn't replace

Try it

AI Agents Need Permission Boundaries, Not Personalities

The coordination trap

What trustworthy agent work actually needs

Why punk resets the shape

Substrate first, shell second

plot, cut, and gate are not vibes

Durable work state matters more than chat history

Why the one-face shell still matters

A better shape for agent engineering

Sources

My AI Agent Said 'Done.' It Skipped an Entire Acceptance Criterion.

Why "SUCCESS" was wrong

The missing trust boundary

What we changed: boundary verification

What it catches

What it does not prove

The broader question

Your AI Spec Is Already Stale

The problem with specs in AI-assisted codebases

What I found

Herald: v1 ghosts in a v2 codebase

Delve: missing shipped features

Why AI makes this worse

The reverse diff

What this means for your projects

The implementation

What a Formal Verification Agent Taught Me About Code Audit

Three Patterns from Leanstral

Policy Scanner

Typed Diagnostics

Parallel Repair Lanes

Three More Changes

What I Learned

Switching AI CLIs Without Losing 32 Skills: Why I Built nex

The problem: skills are trapped

What nex does

Seeing everything at once

Drift detection

Profiles as desired state

Architecture: layered source of truth

Release automation

Health monitoring

What I learned

Try it

One Pass Isn't Enough: How Signum Learned to Fix Its Own Code

The problem: one-shot verification

Loop 1: code — iterative audit

Loop 2: contract — self-critique

Loop 3: project — shared context across contracts

Architecture v4.6.1: checks as standalone scripts

What this changes

Sources

Skillpulse: Your AI Skills Are Flying Blind Without Telemetry

The gap

How it works

What I learned from 4 entries

The aggregator

Where this is going

Try it

Environment is context: security auditing for AI agent workstations

The attack surface nobody measures

Environment as context

Sentinel: deterministic audit

Why `punk` resets the shape

`plot`, `cut`, and `gate` are not vibes