DEV Community: Jaewon Jang

I built a milestone-tracking dashboard for Claude Code – here's how it works

Jaewon Jang — Fri, 29 May 2026 05:30:43 +0000

When I first started running Claude Code as a persistent agent, I thought the hard part was going to be the AI. It wasn't. The hard part was knowing what the AI was doing.

I'd queue up a dozen tasks before bed — fix this bug, draft that document, analyze this dataset — and by morning Claude had touched maybe four of them. The rest were silently waiting, or had been blocked on a clarification it couldn't get, or had finished but I had no idea how it finished. There was no dashboard. No task list. No way to see the queue.

So I built one.

What is claude-ns-hub?

It's a FastAPI server that runs locally on your machine (or home server) and provides a single-page dashboard for managing milestones across multiple projects. I call each task a "stone" — a unit of work with a simple lifecycle:

queued → (agent works on it) → pending_confirmation → done

When a stone is queued, you can click Execute on the dashboard. This spawns a tmux session running claude, injects a prompt listing all queued stones, and the agent starts working. When it finishes a stone, it PATCHes the status to pending_confirmation and posts a 1-3 line completion comment. You review it, confirm, and it's done.

The parts that actually matter

The stop-hook. After every Claude Code session ends, a Python hook re-checks for remaining queued stones. If any exist, it re-injects the dispatch prompt and Claude keeps going — up to 5 continuation loops per session. This means I can queue 20 stones, start the agent, and come back later.

BM25 context injection. When spawning an agent, the server queries a local BM25 index of past milestone decisions. This gives Claude relevant history without me having to hand-craft a context document every time.

SQLite as the primary store. Everything is written to ns-events.db. A per-row UPSERT takes ~1ms vs ~350ms for a full YAML rewrite. YAML files are written asynchronously as a backup.

Push notifications via ntfy.sh. When the agent goes idle, the server POSTs to ntfy.sh and I get a notification on my phone. No custom notification server needed.

How to try it

pip install claude-ns-hub
hub

The dashboard opens at http://localhost:9001. Add a project, create some stones, point it at your Claude Code project directory, and click Execute.

Why I'm sharing this

I've been building this in the open since early 2026 and I've hit a point where I'm using it daily but it's still rough around some edges. The core architecture is solid. What it needs now is people who run into different problems than I do.

If you use Claude Code and wish you had better task visibility, I'd love for you to try it.

GitHub: https://github.com/jaytoone/claude-ns-hub

Contributions welcome — bug reports, fixes, or a "have you considered X?" issue. The contribution guide is in CONTRIBUTING.md.

CTX: I gave Claude Code a memory that actually works

Jaewon Jang — Sun, 03 May 2026 10:19:23 +0000

The problem

Claude Code resets every session. There is no built-in memory. You open a new terminal, start coding, and the model has no idea what you decided yesterday, what architecture you settled on, or which files matter. You explain it again. Every time.

I spent three months building something to fix this.

What CTX does

CTX hooks into Claude Code's UserPromptSubmit event. Before every prompt, three things happen — in under 1ms:

G1 — Decision memory
Parses your git log and surfaces the most relevant past decisions. "Why did we switch to BM25?" "What was the reasoning behind this architecture?" CTX pulls those commit messages and injects them before you even ask.

G2 — Code and doc search
BM25 search across your entire codebase and markdown docs. When you ask about a function, the right files are already in context. No more "I can't find that file" hallucinations.

CM — Chat memory vault
A local SQLite database of past conversations, hybrid-searched (BM25 + optional vector). The things you explained once, you should only have to explain once.

The numbers

I ran rigorous benchmarks — not synthetic toy tests.

Memory recall (MAB, N=50)

System	Recall	Wilson CI 95%
None (baseline)	0.00	[0.00, 0.07]
CTX	0.40	[0.28, 0.54]
CTX v2	0.58	[0.44, 0.71]
CTX v3	0.88	[0.762, 0.944]

CTX v3 vs baseline: McNemar p < 0.001. Statistically significant.

Real-world telemetry (10,000+ turns)

Overall utility rate: 39.6% (items injected that Claude actually cited)
CM block: 52.6% utility rate (highest — chat memory is the most cited)
G1 block: 39.6%
G2 docs: 27.8%

A 42 percentage point gap between KEYWORD (16%) and SEMANTIC (42%) queries confirms retrieval method selection matters — and CTX routes them differently.

How it installs

Option A — Native plugin (recommended, one step):

/plugin install ctx@pluto2060

Claude Code handles everything — venv, daemons, hooks. No terminal needed.

Option B — PyPI:

pip install ctx-retriever && ctx-install

Copies hooks to ~/.claude/hooks/ and patches settings.json atomically. Validated in clean Docker (ubuntu:22.04).

Latest: v0.3.13 — vec-daemon isolated venv (no numpy/ABI conflicts). BGE reranker opt-in: CTX_BGE_ENABLE=1.

What it does not do

No cloud sync. Everything stays local.
No LLM calls. Pure BM25 + SQLite.
No mandatory telemetry. Opt-in only.
Does not replace Claude's context window — it fills it intelligently before you ask.

Demo video

Dashboard live demo (39 seconds) — shows System Health, Knowledge Graph node interactions, and real-time events:

▶ Watch dashboard demo (39s) — Google Drive

Or download directly: ctx-dashboard-demo.mp4

LLM agents don't degrade gradually — they cliff-edge. I built HarnessOS to survive it

Jaewon Jang — Wed, 01 Apr 2026 15:04:10 +0000

There's a concept gaining traction in AI systems engineering: Harness Engineering.

Not the testing tool. The idea: raw LLM capability is like raw power — high voltage,
hard to control, dangerous to run indefinitely. Harness Engineering is the discipline of
building the control structures that make that power usable at scale.
Context managers. Evaluation loops. Failure classifiers. Goal trackers. Memory tiers.

I think it's going to be one of the defining disciplines of serious AI systems work.
And I've been building a platform around it.

What I Built

HarnessOS is a scaffold/middleware system for running infinite autonomous tasks.

The key word is infinite. Not one task. Not one session. An agent that:

Runs continuously, across context window rotations
Evolves its own goals when it succeeds at the current one
Persists state across sessions without losing context
Classifies its own failures and routes them appropriately

This is the architecture:

HarnessOS
├── CTX                      ← context precision layer
│   └── LLM-free retrieval, 5.2% token budget, R@5=1.0 dependency recall
├── omc-live                 ← finite outer loop
│   └── 2-Wave strategy + self-evolving goals + episode memory
├── omc-live-infinite        ← infinite outer loop
│   └── context rotation, world model, no iteration cap
├── HalluMaze                ← hallucination management (in development)
└── [future layers]
    ├── Evaluation Layer
    ├── Safety Layer
    └── Memory Tier System

The Problem with Current Agent Frameworks

Most agent frameworks are built for tasks that complete in one session.

Spin up → run → done.

That's fine for demos. It breaks for real autonomous work:

Context exhaustion: At ~70% context capacity, agents start losing earlier decisions.
Not gracefully. They cliff-edge — sudden degradation, not gradual fade.
No goal evolution: An agent that succeeds at "write tests" has no mechanism to
ask "what's the next improvement?" It just stops.
Failure is terminal: Most frameworks catch exceptions. Few classify them —
transient vs persistent vs fundamental goal mismatch.

HarnessOS is built specifically to address all three.

What I Measured (The Empirical Foundation)

Before building anything, I ran controlled experiments on questions I couldn't find
good empirical answers to anywhere else.

Q1: How should autonomous agents reason about problems?

Compared hypothesis-driven debugging (observe → hypothesize → verify)
against engineering-only (pattern match → retry) on 12 bug scenarios.

Bug type	Engineering	Hypothesis	Delta
Simple	1.0 attempts	1.0 attempts	none
Causal	1.75 attempts	1.0 attempts	-43%
Assumption	2.0 attempts	1.0 attempts	-50%

First-hypothesis accuracy: 100%. This is now the default reasoning strategy in omc-live.

Q2: Where do context limits actually hit?

Measured Lost-in-the-Middle across 1K/10K/50K/100K token contexts.

Key finding: degradation is threshold-based, not gradual.

Agents don't slowly forget. They cliff-edge at a specific token length and fail silently.
This changed how omc-live-infinite handles context — it monitors budget and triggers
a safe rotation handoff at 70%, before the cliff.

Q3: Where do autonomous agents actually fail?

OpenHands on 20-step coding tasks. Failure clusters:

Wrong task decomposition (incorrect sub-goals from the start)
Role non-compliance (agent exceeds defined scope)
Boundary violations (unexpected state mutations)

Predictable = preventable. The omc-failure-router classifies failures into these
categories and routes them appropriately instead of generic retry.

The Architecture in Practice

omc-live: Finite Self-Evolving Loop

Wave 1: Strategy consultation (specialist agents, runs once)
   ↓
Wave 2: Execution loop
   ↓
Judgment: Goal achieved?
   ├── NO  → update goal tree, retry
   └── YES → Score (5 dimensions)
                ├── delta ≥ epsilon → EVOLVE goal, continue
                └── plateau × 3    → CONVERGED, stop

When the system succeeds, it scores the output, finds the weakest dimension,
generates an elevated goal, and continues — until quality plateaus.

omc-live-infinite: No Iteration Cap

New mechanisms beyond the finite version:

Context rotation: at 70% budget → save state → fresh session → resume
World model: epistemic state layer that persists across rotations
Co-evolution feedback: strategy outcomes feed back into Wave 1 planning

Enables agents that work on complex goals for hours, not seconds.

CTX: Precision Context Loading

Query classification → retrieval strategy selection:

EXPLICIT_SYMBOL → direct lookup
SEMANTIC_FUNCTIONALITY → embedding search
STRUCTURAL_RELATIONSHIP → dependency graph
RECENT_CHANGE → git recency

Result: 5.2% average token budget, R@5=1.0. No LLM calls for retrieval.

Why "Harness Engineering" Is the Right Frame

A harness doesn't constrain power — it channels it.

LLMs have enormous capability. Without control structure, that capability is:
context-unaware, goal-unstable, failure-opaque, session-local.

HarnessOS adds the control structure. Not to limit the model — to make it usable
for work that spans hours, not seconds.

Current State & Quick Start

214 tests, 100% coverage. CTX and omc-live/infinite are stable and used daily.

git clone https://github.com/jaytoone/HarnessOS
python3 analyze.py --run

No pip install. No required API keys for base experiments.

GitHub: https://github.com/jaytoone/HarnessOS

If you're building autonomous agents and thinking about long-run reliability — happy to compare notes.

HarnessOS: scaffold/middleware for infinite autonomous tasks — built on Harness Engineering

Jaewon Jang — Wed, 01 Apr 2026 08:27:45 +0000

There's a concept gaining traction in AI systems engineering: Harness Engineering.

Not the testing tool. The idea: raw LLM capability is like raw power — high voltage, hard to control, dangerous to run indefinitely. Harness Engineering is the discipline of building the control structures that make that power usable at scale.
Context managers. Evaluation loops. Failure classifiers. Goal trackers. Memory tiers.

I think it's going to be one of the defining disciplines of serious AI systems work.
And I've been building a platform around it.

What I Built

HarnessOS is a scaffold/middleware system for running infinite autonomous tasks.

The key word is infinite. Not one task. Not one session. An agent that:

Runs continuously, across context window rotations
Evolves its own goals when it succeeds at the current one
Persists state across sessions without losing context
Classifies its own failures and routes them appropriately

This is the architecture:

HarnessOS
├── CTX                      ← context precision layer
│   └── LLM-free retrieval, 5.2% token budget, R@5=1.0 dependency recall
├── omc-live                 ← finite outer loop
│   └── 2-Wave strategy + self-evolving goals + episode memory
├── omc-live-infinite        ← infinite outer loop
│   └── context rotation, world model, no iteration cap
├── HalluMaze                ← hallucination management (in development)
└── [future layers]
    ├── Evaluation Layer
    ├── Safety Layer
    └── Memory Tier System

The Problem with Current Agent Frameworks

Most agent frameworks are built for tasks that complete in one session.

Spin up → run → done.

That's fine for demos. It breaks for real autonomous work:

Context exhaustion: At ~70% context capacity, agents start losing earlier decisions. Not gracefully. They cliff-edge — sudden degradation, not gradual fade.
No goal evolution: An agent that succeeds at "write tests" has no mechanism to ask "what's the next improvement?" It just stops.
Failure is terminal: Most frameworks catch exceptions. Few classify them — transient vs persistent vs fundamental goal mismatch.

HarnessOS is built specifically to address all three.

What I Measured (The Empirical Foundation)

Before building anything, I ran controlled experiments on questions I couldn't find good empirical answers to anywhere else.

Q1: How should autonomous agents reason about problems?

Compared hypothesis-driven debugging (observe → hypothesize → verify) against engineering-only (pattern match → retry) on 12 bug scenarios.

Bug type	Engineering	Hypothesis	Delta
Simple	1.0 attempts	1.0 attempts	none
Causal	1.75 attempts	1.0 attempts	-43%
Assumption	2.0 attempts	1.0 attempts	-50%

First-hypothesis accuracy: 100%. This is now the default reasoning strategy in omc-live.

Q2: Where do context limits actually hit?

Measured Lost-in-the-Middle across 1K/10K/50K/100K token contexts.

Key finding: degradation is threshold-based, not gradual.

Q3: Where do autonomous agents actually fail?

OpenHands on 20-step coding tasks. Failure clusters:

Wrong task decomposition (incorrect sub-goals from the start)
Role non-compliance (agent exceeds defined scope)
Boundary violations (unexpected state mutations)

Predictable = preventable. The omc-failure-router classifies failures into these categories and routes them appropriately instead of generic retry.

The Architecture in Practice

omc-live: Finite Self-Evolving Loop

Wave 1: Strategy consultation (specialist agents, runs once)
   ↓
Wave 2: Execution loop
   ↓
Judgment: Goal achieved?
   ├── NO  → update goal tree, retry
   └── YES → Score (5 dimensions)
                ├── delta ≥ epsilon → EVOLVE goal, continue
                └── plateau × 3    → CONVERGED, stop

When the system succeeds, it scores the output, finds the weakest dimension, generates an elevated goal, and continues — until quality plateaus.

omc-live-infinite: No Iteration Cap

New mechanisms beyond the finite version:

Context rotation: at 70% budget → save state → fresh session → resume
World model: epistemic state layer that persists across rotations
Co-evolution feedback: strategy outcomes feed back into Wave 1 planning

Enables agents that work on complex goals for hours, not seconds.

CTX: Precision Context Loading

Query classification → retrieval strategy selection:

EXPLICIT_SYMBOL → direct lookup
SEMANTIC_FUNCTIONALITY → embedding search
STRUCTURAL_RELATIONSHIP → dependency graph
RECENT_CHANGE → git recency

Result: 5.2% average token budget, R@5=1.0. No LLM calls for retrieval.

Why "Harness Engineering" Is the Right Frame

A harness doesn't constrain power — it channels it.

LLMs have enormous capability. Without control structure, that capability is: context-unaware, goal-unstable, failure-opaque, session-local.

HarnessOS adds the control structure. Not to limit the model — to make it usable for work that spans hours, not seconds.

Current State & Quick Start

214 tests, 100% coverage. CTX and omc-live/infinite are stable and used daily.

git clone https://github.com/jaytoone/HarnessOS
python3 analyze.py --run

No pip install. No required API keys for base experiments.

GitHub: https://github.com/jaytoone/HarnessOS

If you're building autonomous agents and thinking about long-run reliability — happy to compare notes.

DEV Community: Jaewon Jang

I built a milestone-tracking dashboard for Claude Code – here's how it works

What is claude-ns-hub?

The parts that actually matter

How to try it

Why I'm sharing this

CTX: I gave Claude Code a memory that actually works

The problem

What CTX does

The numbers

How it installs

What it does not do

Links

Demo video

LLM agents don't degrade gradually — they cliff-edge. I built HarnessOS to survive it

What I Built

The Problem with Current Agent Frameworks

What I Measured (The Empirical Foundation)

Q1: How should autonomous agents reason about problems?

Q2: Where do context limits actually hit?

Q3: Where do autonomous agents actually fail?

The Architecture in Practice

omc-live: Finite Self-Evolving Loop

omc-live-infinite: No Iteration Cap

CTX: Precision Context Loading

Why "Harness Engineering" Is the Right Frame

Current State & Quick Start

HarnessOS: scaffold/middleware for infinite autonomous tasks — built on Harness Engineering

What I Built

The Problem with Current Agent Frameworks

What I Measured (The Empirical Foundation)

Q1: How should autonomous agents reason about problems?

Q2: Where do context limits actually hit?

Q3: Where do autonomous agents actually fail?

The Architecture in Practice

omc-live: Finite Self-Evolving Loop

omc-live-infinite: No Iteration Cap

CTX: Precision Context Loading

Why "Harness Engineering" Is the Right Frame

Current State & Quick Start