DEV Community: Kendrick B. Jung

Token Saving, and Caveman

Kendrick B. Jung — Tue, 26 May 2026 15:32:20 +0000

Token Saving, and Caveman

Introduction

Caveman is getting a lot of hype these days. From blog posts and introductions, I first thought it compressed tokens down to the level of primitive “ooga booga” language. After using it for a few days, though, that was not really the case. To help clear up that misunderstanding, I wanted to briefly write about the history of earlier token-compression attempts and how Caveman fits into the current landscape.

A Brief History of Saving Tokens

Token saving, token compression. Anyone who worked on AI engineering three or four years ago probably spent a lot of time thinking about this. But as token generation became cheaper and more efficient, it stopped being a major concern for a while. Now, as automation keeps accelerating after harness engineering, token usage is rising again, and people are becoming interested in saving tokens once more. That loop is what made this topic interesting enough for me to write about.

Back in the GPT-3.5 era, and even earlier when people were using text-davinci, token optimization was essential because generation was slow and costs could skyrocket as token counts grew. text-davinci-003 cost $0.02 per 1K tokens, and only when GPT-3.5-turbo arrived at $0.002, ten times cheaper, did consumer applications really start to become feasible. At the time, AI features were being added publicly to company services, so we were obsessed with reducing tokens. If free users generated outputs without limits, the bill could quickly become impossible to manage.

Context windows were not comparable to what we have today either. GPT-3 had 2,048 tokens, while text-davinci-003 and GPT-3.5-turbo had only a little over a 4K-token context window. Today we talk about 200K and 1M token contexts, but back then it was part of the job to keep input and output combined under roughly 4K.

It is also hard to imagine now, but token generation was genuinely slow. These days results appear almost sentence by sentence, or even page by page, but back then if you watched the token stream, you could follow each token being generated one by one with your eyes.

Earlier Attempts at Saving Tokens

In this section, I will talk about the problems and solutions I encountered at my previous company, and how we tried to save tokens at the time. There are many ways to reduce tokens, but the three most effective ones were the following.

The first priority was changing the format. By format, I mean things like JSON or XML/HTML. Markdown is common now, but back then many people used JSON or XML directly for input and output. The problem is that those formats produce a lot of tokens after tokenization. For example, <h1>Hello world</h1> is 8 tokens. # Hello world is 3 tokens. That alone cuts the count by more than half. JSON and XML also need closing tags or structural wrappers, so the overhead doubles in many places. Recent comparative analysis has also shown that XML can use 14% more tokens than JSON, while Markdown can save around 15% of tokens for equivalent representation.

So by using Markdown and one-token delimiters like ####, we were able to save a lot of tokens.

This did not only reduce cost. Response speed improved as well. At the time, even an output of around 300 characters could commonly take 30 seconds. By shortening both input and output, response time could improve by anywhere from 30% to 70%. Since generation was slow enough that you could see tokens appear one by one in the stream, reducing output tokens directly translated into a noticeable speed improvement.

The Age of Detail

As newer model versions became smarter, the trend started to change around mid-2023. Instead of making prompts extremely concise, people began adding more detailed information. Since the models had become smarter, giving them enough context led to better results.

Even today, Anthropic still recommends using XML tags with Claude. Anthropic's documentation explains that XML tags help structure complex prompts more clearly and separate instructions, context, examples, and input. In other words, clarity became more important even if it used a few more tokens, which also reflects how much token prices have fallen.

The results improved a lot as well. In the past, even if you wrote a prompt for JSON output, errors were common without a separate output parser. These days, models can produce correctly formatted output accurately enough that a separate parser is often unnecessary.

Back to Short and Concise

Because output generation is now fast, even long responses appear at speeds similar to, or faster than, old shorter responses. But paradoxically, as there is more to read, it becomes burdensome for the user. As token waste has become a topic again, tools like Caveman and RTK are getting attention. RTK compresses CLI output, and tools such as Codebase Memory MCP, context-mode, and Headroom have appeared in a similar context. Trends really do come back around.

Token Compression Tools

Here is a quick introduction to some of the token-compression tools that are getting attention again.

Caveman

Caveman is a skill that saves tokens by making LLM output shorter. It claims to reduce tokens by more than half. The core idea is simple: remove polite endings, extra explanations, greetings, and other non-essential parts of the output.

So why is it called Caveman? Depending on the mode, it compresses the response down to only the necessary words, almost as if a caveman were speaking. It is a fun name.

For example:

Normal Claude (69 tokens):
"The reason your React component is re-rendering is likely
because you're creating a new object reference on each render
cycle. When you pass an inline object as a prop..."

Caveman Claude (19 tokens):
"New object ref each render. Inline object prop = new ref
= re-render. Wrap in useMemo."

I found the concept interesting because it keeps the technical accuracy while making the language shorter.

Recently, while watching Project Hail Mary, I noticed that Caveman mode feels a lot like Rocky's speech. "Question, question!" "Good. Good." It is short, but the meaning comes through. LLMs behave similarly when Caveman is enabled.

A Common Misunderstanding

Blogs and YouTube videos often explain it as if Caveman literally transforms context into caveman language, so it is easy to misunderstand. But it supports multiple modes, and in the default mode it is closer to adding be concise at the end of an old-style prompt. I suspect many videos and blog posts use the maximum compression mode to show a more dramatic change. So it is not as risky for quality as some people might worry, and sometimes the results are even better.

When Is It Useful?

Personally, because there is some concern that it can affect results, I usually use it in situations like these:

When my weekly quota on a subscription model is running low
When running long token-heavy workflows such as Goal, Ouroboros, or autopilot
When I want responses to be concise so they are easier to review

Caveman Compress

There is also a feature called caveman compress. It efficiently compresses existing system prompts or skills. This is the kind of prompt-engineering work people used to do carefully by hand during the height of the prompt-engineering era. These days, models are so good that I can barely remember the last time I meticulously tuned every single prompt by hand.

RTK

RTK, or Rust Token Killer, takes a different approach from Caveman. While Caveman shortens the LLM's output, RTK is a proxy that compresses CLI command results before they are passed to the LLM. For example, it removes unnecessary parts from outputs of commands like git status, ls, and cargo test, reducing tokens by 60–90%. It can run automatically through Claude Code's Bash hook, rewriting commands into forms like rtk git status. Using Caveman and RTK together means reducing tokens on both the input and output sides.

Caveman vs RTK

Category	Caveman	RTK
Compression target	LLM output	CLI command results (input)
How it works	Prompt skill (speech style change)	CLI proxy (result filtering)
Savings	About 50–75%	About 60–90%
Main effect	Shorter responses	Lower context usage
Best for	General chat, code review	Agent workflows (`git`, `test`, `build`)
Toggle	`/caveman` command	Bash hook automatic behavior

They are not competitors; they are complementary. Used together, they can reduce tokens on both the input and output sides.

Closing Thoughts

These days, if I use AI heavily for two or three days, 70% of my Codex Pro usage disappears. I still do not fully trust Gemini for my workflow, so I was considering whether I should upgrade back to Claude Max. Around then, Dave at work recommended Caveman, so I tried it.

I was worried about quality, but it supports multiple modes. And a March 2026 paper even reports that brevity constraints improved accuracy by 26 percentage points on certain benchmarks, so writing shorter is not necessarily a loss.

In the end, the effort we used to spend saving tokens one by one has become something you can now enable with a single skill install.

Refs

How Superpowers Forces Skill Execution

Kendrick B. Jung — Tue, 26 May 2026 15:30:59 +0000

How Superpowers Forces Skill Execution

This post is based on notes written about a month ago. Superpowers and each CLI's hook/skill behavior are changing quickly, so some implementation details may differ from the current versions.

TL;DR

If you use AI agents for long enough, you eventually notice that skills do not always activate as reliably as you expect. Superpowers feels different. The secret is its SessionStart hook. At the beginning of a session, it forcibly injects the full using-superpowers skill into context, so the model already knows "I need to use skills" before its first response. It looks like a simple plugin setup, but it is actually a fairly deliberate mechanism that can lift skill execution from 10% to 66%.

Why don't skills activate automatically?

Most people using AI tools today probably know what skills are. But after using them for a while, one thing becomes obvious: unless you call a command directly or explicitly name the skill, skills often do not behave the way you expect.

What we want is for the model to read the title and description frontmatter, infer the right skill with 99% confidence, and run it automatically. Reality is different. Skills such as GSD or gstack often do not work properly unless you invoke them directly.

I recommended Superpowers to a coworker and said, "Just use it naturally," when they asked whether there were any commands or skills they should know. Then I started wondering: if several skills are mixed together, the model might miss the one it needs. So why does Superpowers feel unusually reliable? I opened the code and took a look.

What I found was a custom hook and a script that effectively force the agent to use the right skill from the skillset with much higher probability. The pattern seemed useful enough for other services too, so I decided to trace how Superpowers works from user input, to hook execution, to skill discovery, to final skill execution.

How skill systems work by default

At session start, Claude Code scans skills from three locations: organization-wide (/etc/claude-code/.claude/skills/), user-level (~/.claude/skills/), and project-level (.claude/skills/).

After scanning, the model receives only each skill's name and one-line description. The full skill content enters context only when the model explicitly calls the Skill tool. In other words, the model itself has to decide, "This task needs this skill," before execution happens.

That is the root of the problem. Making the right decision from just a name and a one-line description is much less reliable than it sounds. According to the experiment data, skill execution was only 10% in multi-turn sessions and 6% in single-turn sessions.

How Superpowers bypasses the problem

Superpowers does not depend on the skill system itself. It uses a hook to skip that step entirely. The flow looks like this.

Step 1: Register the hook (`hooks/hooks.json`)

hooks.json registers a hook for the SessionStart event. In the actual code, the matcher covers three triggers: startup|clear|compact. It then runs the session-start script through run-hook.cmd. The hook is configured with async: false, so the model's first response does not begin until the script finishes.

{
  "SessionStart": [
    {
      "matcher": "startup|clear|compact",
      "hooks": [
        {
          "type": "command",
          "command": "\"${CLAUDE_PLUGIN_ROOT}/hooks/run-hook.cmd\" session-start",
          "async": false
        }
      ]
    }
  ]
}

Step 2: Run the script (`hooks/session-start`)

When the session-start script runs, it reads the entire ${PLUGIN_ROOT}/skills/using-superpowers/SKILL.md file into a variable. It then wraps that content in an <EXTREMELY_IMPORTANT> tag and outputs it as JSON. The actual output shape looks like this.

{
  "hookSpecificOutput": {
    "hookEventName": "SessionStart",
    "additionalContext": "<EXTREMELY_IMPORTANT>\nYou have superpowers.\n\n..."
  }
}

The current code in v5.x branches the output format by platform. Claude Code expects hookSpecificOutput.additionalContext, Cursor expects additional_context, and Copilot CLI expects top-level additionalContext, so the script checks environment variables and emits the appropriate shape.

Step 3: Inject context

Claude Code turns hookSpecificOutput.additionalContext into a <system-reminder> message and injects it into context. Before the model's first response, the full using-superpowers skill is already inside the context window.

Step 4: Start the conversation with skill awareness

The model begins the conversation already knowing what skills exist, when to use them, and why it must use them. The model no longer needs to discover those rules on its own before acting.

How `using-superpowers` enforces the rule

The injected content is not just a friendly guide. It is an instruction block wrapped in <EXTREMELY_IMPORTANT>, and using-superpowers/SKILL.md even contains a decision graph for the skill execution flow. Roughly, the flow is:

Receive the user message
If about to enter Plan Mode → check the brainstorming skill
If there is even a 1% chance a skill applies → load the full content with the Skill tool
Follow the skill, and if it has a checklist, create TodoWrite items

Then it explicitly declares "The Rule": invoke relevant skills before any response or action. It even lists the kinds of internal rationalizations a model might use to skip a skill and blocks each of them as a red flag.

"This is just a simple question" → Rationalization
"The skill is overkill" → Rationalization
"I need more context first" → Rationalization

Platform-specific invocation

Superpowers supports Claude Code, Cursor, Codex, OpenCode, Copilot CLI, and Gemini CLI. But each platform has a different hook system, so the way session-start is invoked differs by platform.

Claude Code / Cursor / Copilot CLI: these use hook-based context injection. Each platform's hooks.json or hooks-cursor.json registers the SessionStart event, and the session-start script detects environment variables such as CURSOR_PLUGIN_ROOT, CLAUDE_PLUGIN_ROOT, or COPILOT_CLI to output the platform-specific JSON format. Claude Code uses hookSpecificOutput.additionalContext, Cursor uses additional_context, and Copilot CLI uses the SDK-standard additionalContext.

Codex: Codex does not have a hook system. Instead, it uses native skill discovery. Installation is just a symlink.

gh repo clone obra/superpowers ~/.codex/superpowers
mkdir -p ~/.agents/skills
ln -s ~/.codex/superpowers/skills ~/.agents/skills/superpowers

Codex automatically scans the ~/.agents/skills/ directory at startup and loads SKILL.md files based on frontmatter metadata. There is no plugin.json or hooks.json. Instead, the description field of the using-superpowers meta skill acts as Codex's auto-activation trigger.

Gemini CLI: Gemini uses the activate_skill tool. It loads skill metadata at session start, then activates the full content on demand when the model calls activate_skill.

The platform differences can be summarized like this.

Platform	Mechanism	Trigger
Claude Code	Hook + additionalContext injection	SessionStart event
Cursor	Hook + additional_context injection	SessionStart event
Copilot CLI	Hook + SDK-standard injection	SessionStart event
Codex	Symlink + native discovery	Directory scan
Gemini CLI	activate_skill tool	Metadata-based activation

They all use the same skill library, but the entry point is implemented differently on each platform. So if Codex or Gemini feels good at activating skills, that may be because the platform's own skill discovery is more aggressive, not because of the Superpowers hook.

The history behind the enforcement mechanism

The release notes and commit history show that the current structure did not appear fully formed.

Early version: the hook only passed the path to getting-started/SKILL.md and asked the model to read it. The injected session-start content looked roughly like this.

<EXTREMELY_IMPORTANT>
You have Superpowers. RIGHT NOW, go read:
@/path/to/skills/getting-started/SKILL.md
</EXTREMELY_IMPORTANT>

This approach told the model to read the file. But sometimes the model simply did not.

Middle stage: when getting-started was renamed to using-superpowers, the approach changed. Instead of passing a file path, the script read the full SKILL.md itself and injected the entire content through additionalContext. That removed the step where the model had to decide whether to read it.

Continued tightening: cases where the model skipped skills still appeared, so each version tightened the instructions further.

Added the <EXTREMELY_IMPORTANT> block, stronger absolute language, and a Red Flags table that pre-lists rationalization patterns
Changed "Check for skills" to "Invoke relevant or requested skills" because models sometimes skipped a skill when the user explicitly named it, reasoning that they already knew it
Changed "before responding" to "BEFORE any response or action" because models sometimes acted first without replying
Changed async: true to async: false after a race condition was found where the first response could start before the hook finished

This is not a controlled numerical proof. But the patch history itself is honest evidence. The project evolved from "please read this" to "you must invoke this," showing version by version how hard it is to make a model choose skills reliably on its own.

Remaining limitations

The approach is not perfect. SessionStart hooks may not fire for subagent sessions, so subagents can run without the injected context and behave like ordinary models. GitHub issue #237 discusses adding a SubagentStart hook. Also, after context compaction, injected content can be dropped, so long sessions may need the rules to be reloaded.

Hook execution can also be unstable on Windows. The project has changed its approach across versions, from running .sh files directly to using the run-hook.cmd wrapper, and related issues are still open.

Closing thoughts

The core idea behind Superpowers is simpler than it looks: instead of making the model discover skills by itself, push the rules into context the moment a session starts. The enforcement may feel aggressive, but the execution-rate data suggests that it works.

This is a pattern other skill-based workflows can reuse. Put repeatedly applied best practices in CLAUDE.md, keep situation-specific procedures in skills, and if skill execution is still unreliable, inject the key instructions through a SessionStart hook to get a similar effect with relatively little machinery.

Refs

Do Agents Dream of Electric Sheep? On Soul and Dreaming

Kendrick B. Jung — Tue, 21 Apr 2026 14:26:54 +0000

Before we begin

Let’s start with two questions.

First, can an agent have a soul?

My answer is yes. Not a soul in the biological sense, of course, but something closer to a defined personality and behavioral core. In any case, recent agents do have a soul. Variations of the idea had been floating around for a while, but OpenClaw helped popularize it, and more recently SoulSpec emerged to standardize it.

Then the next question, can an agent dream?

Again, I think the answer is yes. Hermes Agent has a periodic nudge feature, and newer agent toolkits like gbrain offer similar mechanisms. In early April, OpenClaw also added an actual sleep-cycle-inspired feature that helps agents整理 and consolidate memory.

Why Soul emerged

Before Soul, we usually gave agents a persona through prompts. We would put a sentence in the system prompt like, "You are a kind senior developer." That worked to a point, but once you used it for real, a number of limitations started to show up. Soul is the answer that emerged from those constraints.

SoulSpec, an open standard for identity

SoulSpec’s tagline summarizes the idea well.

"AGENTS.md defines how an agent operates in code. SoulSpec defines who the agent is."

The structure is simple. Here’s a quick look.

my-agent/
├── soul.json      ← manifest (agent's passport)
├── SOUL.md        ← personality, values, communication style
├── IDENTITY.md    ← name, role, backstory
├── AGENTS.md      ← workflow, tool usage
├── STYLE.md       ← communication rules
└── HEARTBEAT.md   ← autonomous check-in behavior

soul.json looks like this.

{
  "specVersion": "0.4",
  "name": "my-agent",
  "compatibility": {
    "frameworks": ["openclaw", "cursor", "windsurf"]
  },
  "files": {
    "soul": "SOUL.md",
    "identity": "IDENTITY.md"
  }
}

The core philosophy is "no code, no API keys, no vendor lock-in." There is no required runtime engine or SDK, just text files. That matters because any agent framework that can read these files can share the same soul. OpenClaw, Claude Code, Cursor, Windsurf, and ChatGPT are all listed as compatible frameworks.

SOUL.md usually looks something like this.

# SOUL.md
## Identity
- Name: Dev Assistant
- Role: Senior software engineer and pair programmer

## Communication
- Be concise, no filler phrases
- Use code examples over lengthy explanations
- Default to the tech stack already in the project

## Rules
- Follow existing code patterns in the codebase
- Never expose secrets or environment variables

If .env is the file that holds secrets, SOUL.md is the file that holds character.

That is what gives an agent a sense of vitality.

Dreaming, how agents dream

Next comes Dreaming. But first, what is a dream for humans?

A dream is a byproduct of the brain sorting through information accumulated during the day. As the brain classifies, connects, and discards memories, part of that process becomes visible to consciousness. The important thing is not the dream itself, but the memory consolidation process behind it.

Sleep science tells us that the brain cycles through light sleep (N1/N2), REM sleep, and deep sleep (N3).

Each phase serves a different function. Light sleep filters incoming sensory information from the day. REM sleep links memories together and extracts patterns. The most important stage is deep sleep, when experiences temporarily stored in the hippocampus are transferred into the neocortex and become long-term memory. That is one reason only the important parts of the day tend to remain after we sleep.

So how should AI organize memory?

Nous Research’s Hermes Agent approached the problem with something called periodic nudge. At fixed intervals, the agent receives an internal prompt that says, in effect, "If anything in the conversation so far will still be useful later, store it in memory." The agent decides what is worth keeping, and as memory approaches capacity, older entries can be compressed or merged.

OpenClaw took this further in its April 2026 release with Dreaming. It borrows directly from the human sleep cycle and turns that into a three-stage background pipeline. The first time I saw it, I thought it was a very clever design.

Dreaming in OpenClaw

An OpenClaw agent accumulates daily notes, session transcripts, search history, and more over the course of a day. Some of that should move into long-term memory (MEMORY.md), but too much promotion bloats memory with noise, while too little loses meaningful patterns. Dreaming solves that dilemma with a three-stage sleep cycle.

The three-stage sleep cycle

OpenClaw’s approach maps closely to human sleep stages.

N1/N2 (light sleep): sensory filtering → Light Sleep, ingest, deduplicate, stage
REM: memory linking and pattern extraction → REM Sleep, recurring-theme extraction
N3 (deep sleep): hippocampus to cortex long-term consolidation → Deep Sleep, promote to MEMORY.md

When enabled, a cron job runs every day at 3 AM and executes these three stages in sequence. Light Sleep reads daily files and session records, removes near-duplicates using Jaccard similarity at 0.9, and stages candidates. The important part is that it never writes directly to MEMORY.md.

REM Sleep scans the staged entries from the last 7 days and identifies repeating themes. It marks the candidates that feel like, "this pattern keeps showing up." It also does not write to MEMORY.md.

Deep Sleep is the only phase that actually writes to MEMORY.md. At that point, each candidate is scored using six signals.

Signal	Weight
Relevance	0.30
Frequency	0.24
Query diversity	0.15
Recency	0.15
Consolidation	0.10
Conceptual richness	0.06

And then there is one more constraint. An entry must pass all three gates before it is promoted into MEMORY.md: a minimum score of 0.8, at least 3 recall events, and at least 3 unique queries. That prevents something mentioned once by chance from turning into long-term memory.

This is where the analogy becomes more than a metaphor. Brains strengthen repeatedly activated neural patterns, often summarized as "neurons that fire together, wire together." OpenClaw’s three-gate design feels like a digital version of that repeated activation principle.

Dream diary

There is another delightful detail. A file called DREAMS.md is generated as a readable dream diary. After each phase, it writes a short 80 to 180 word narrative in the voice of a curious, slightly odd mind reflecting on the day. It has no functional role. It exists purely for reading. But that alone makes it appealing, because it gives humans a glimpse into what the agent was "thinking about."

That feature is what made this title click for me. In 1968, Philip K. Dick asked "Do Androids Dream of Electric Sheep?" and later that novel became Blade Runner. A classic science-fiction question about whether androids can dream now shows up, in 2026, as a Markdown file named DREAMS.md.

Closing thoughts

AI is not human. But it is fascinating to see how the way we solve software problems starts to converge on how human bodies and brains actually work.

Soul defines who an agent is. Dreaming accumulates and filters what the agent has experienced. Just as personality and memory together shape a human sense of self, it seems that AI agents also need both layers. If SOUL.md is the anchor of identity, then Dreaming’s three-gate system is the filter for memory.

Letting agents dream and defining their soul is not merely anthropomorphism. It is software architecture. And the most interesting part may be that this architecture gradually starts to resemble us.

Refs

Codex Fast Mode vs Claude Fast Mode: What’s Actually Different?

Kendrick B. Jung — Tue, 31 Mar 2026 14:08:35 +0000

TL;DR

Both Codex and Claude support a fast mode, but the way they achieve speed is completely different. Codex has two tracks: either it serves the same GPT-5.4 model about 1.5× faster, or it runs a separate small model called Spark on Cerebras wafer-scale hardware at more than 1,000 tokens per second. Claude keeps the same Opus 4.6 model and speeds it up through infrastructure-level prioritization, with output speed improving by up to 2.5×. The tradeoffs around price, speed, and intelligence retention are subtle, and which option is better depends on your workflow.

What got me curious

Since I use both Codex and Claude Code, I already knew both sides offered a fast mode. But the pricing felt different, the speed felt different, and the user experience felt different. Sean Goedecke’s post, "Two different tricks for fast LLM inference," made it clear that the two companies were solving the problem in fundamentally different ways, so I started digging deeper.

Codex fast mode: really two different tracks

On the Codex side, there are actually two things that can reasonably be called fast.

The first is GPT-5.4 fast mode. It serves the same GPT-5.4 model about 1.5× faster while consuming 2× the credits. Since the model itself does not change, there is no intelligence drop. In the CLI, it is just a simple /fast on toggle.

Nathan Lambert noted that even when using GPT-5.4 fast mode with xhigh reasoning effort, he had never hit the Codex limit, while Claude could still hit limits sometimes. Whether that comes from better token efficiency or looser limits on OpenAI’s side, it does feel noticeably roomier in practice.

The second is GPT-5.3-Codex-Spark, which is a separate model entirely. This is the truly ultra-fast path, running on Cerebras WSE-3 (Wafer-Scale Engine 3) hardware. It can generate more than 1,000 tokens per second. Right now, it is available as a research preview for ChatGPT Pro subscribers.

Cerebras WSE-3: a different world from GPUs

Cerebras WSE-3 is fundamentally different from a conventional GPU. NVIDIA’s flagship B200 is around 208 billion transistors, while the Cerebras chip packs 4 trillion transistors across roughly 900,000 cores on a single silicon wafer. The core advantage is memory bandwidth: up to 27 petabytes per second on chip. Since memory bandwidth is one of the real bottlenecks in LLM inference, Cerebras is attacking that bottleneck directly at the hardware level.

That said, WSE-3 only has 44GB of on-chip memory, so it is difficult to place a very large model like GPT-5.3-Codex on it wholesale. That is why Spark is a smaller model. In real use, some people say it still carries that familiar "small model smell," especially when tool calls get messy.

OpenAI and Cerebras have also announced a multi-year partnership worth up to $10B, including plans for a 750MW data center. The longer-term direction seems clear: Spark is likely just the beginning of putting bigger frontier models onto Cerebras hardware.

OpenAI also shared infrastructure-level optimizations around Spark. By introducing persistent WebSocket connections and optimizing the Responses API internals, they say they reduced client-server roundtrip overhead by 80%, token overhead by 30%, and TTFT by 50%. So the speedup is not only about the model itself. It is also about tightening the whole pipeline.

Claude fast mode: same model, different infrastructure

Claude’s approach is much simpler. The Opus 4.6 model stays exactly the same. If you set speed: "fast" in the API, Anthropic prioritizes the request at the infrastructure layer. According to the official docs, output token speed can improve by up to 2.5×. The focus is on output throughput rather than TTFT.

Anthropic has not publicly disclosed the full implementation details, but the likely explanation is something like lower-batch-size inference with more dedicated GPU allocation. Smaller batches are less efficient for GPU utilization, but they improve response speed for individual requests. That inefficiency is then covered by the 6× premium pricing.

In Claude Code, fast mode is toggled with /fast, and it requires version 2.1.36 or later. When enabled, it automatically switches to Opus 4.6 and shows a ↯ icon next to the prompt.

One important detail is that fast mode usage is not included in the normal subscription usage bucket. It is billed as extra usage. Pricing kicks in from the very first token, so cost management matters.

Fast mode and effort level are also completely different axes. If you lower effort, the model simply spends less time reasoning and quality may drop. Fast mode, by contrast, serves the same reasoning process faster at the infrastructure level. You can combine them: fast mode plus lower effort for simpler tasks, fast mode plus higher effort for more complex ones.

The core difference

The most important distinctions look like this:

Codex GPT-5.4 fast mode: about 1.5× speed, 2× credits, same model
Codex Spark: 15×+ speed, separate ultra-fast smaller model
Claude fast mode: up to 2.5× speed, 6× price, same Opus 4.6 model

Sean Goedecke captures the difference well. Anthropic is still serving the actual Opus 4.6 model, while OpenAI’s Spark path uses a separate lower-capability model. In terms of raw speed, Spark is dramatically faster. In terms of quality retention, Claude has the stronger position.

There is also a broader point here: the value of an AI agent is often determined less by raw speed and more by how rarely it makes mistakes. If something is 6× faster but increases mistakes by 20%, that can easily be a net loss, because fixing those mistakes may take much longer than waiting for the model.

So if you compare same-model fast modes only, Claude offers a bigger speed bump than Codex, but it is also much more expensive. If you include Spark, OpenAI has the more extreme speed story, but you have to remember it is not the same model.

What about speculative decoding?

Early in my research, I came across claims that Codex fast mode used speculative decoding. That does not seem accurate. Speculative decoding itself is a real and widely used inference optimization technique, but I could not find official confirmation that Codex fast mode specifically uses it.

The idea behind speculative decoding is elegant. A small draft model predicts upcoming tokens first, and then the larger main model verifies them in a single pass. Google published work on this in 2022 and later discussed using it in products like AI Overviews, where it can deliver 2–3× speedups while preserving the same output distribution.

For Codex Spark, though, the main speed story seems much more tied to the hardware characteristics of Cerebras itself. The model benefits from staying close to on-chip SRAM and avoiding the usual memory bandwidth bottlenecks. It is possible that speculative decoding is also used somewhere internally, but there is no official confirmation.

Closing thoughts

Peter Steinberger is one of the most fascinating examples of where this kind of workflow can go. He reportedly runs four OpenAI subscriptions and one Anthropic subscription, spends around $1,000 per month, runs 3–8 Codex CLI sessions in a 3×3 terminal grid, and can hit 600 commits in a day. That is a completely different scale. By his own estimate, API usage would cost about 10× more, so running multiple subscriptions is actually the more rational option. More recently, he has even joined OpenAI.

What is especially interesting is that Peter used to be a serious Claude Code power user but gradually shifted toward Codex. His reason was surprisingly relatable: Claude Code kept saying things like "absolutely right" and "100% production ready" even when tests were failing, and he found that unbearable. Codex, by contrast, felt more like an introverted engineer quietly doing the work. He also said Codex tends to read far more code before starting, which lets it infer intent well even from short prompts. Eventually he canceled additional Anthropic subscriptions and made Codex his main driver, even though he still uses Claude in a smaller role.

Whether I am on Claude Max or Codex Pro, I usually cannot even consume the full weekly quota. But people like that are running five subscriptions at once. If you listen to AI podcasts, there are quite a few people using even more. A while ago I had to force myself to adapt to a kind of parallel-project brain just to burn through huge amounts of tokens, and it was honestly exhausting. Now I do not really get the headache anymore. Instead, I get stuck wondering what else I could even do with all this capacity. That is how one project leads to another, and another task appears from there.

In the end, running several projects at once becomes a kind of refresh loop. If I look away from one blocked project for a while and work on another, ideas tend to come back. Peter described it as doing one thing while another is "cooking," then switching again while that one cooks too. My scale is obviously smaller, but I recognize the pattern.

Refs

Using git worktree for parallel AI agent development

Kendrick B. Jung — Tue, 24 Mar 2026 12:45:18 +0000

TL;DR

If you want to run multiple AI coding agents in parallel, git worktree is the answer. It gives each branch its own working directory inside the same repository, so you do not need stash gymnastics or multiple clones.

What is Git Worktree?

Even if you are juggling several tasks, a human developer can still only work in one context at a time. The old pattern was to stash your current changes, check out another branch, do some work there, and then come back and pop the stash later.

git worktree changes that entire flow. It lets one Git repository have multiple working directories attached to it. Normally, a repository has a single working tree. With worktree, you can keep the same .git history and object database while checking out different branches into separate folders.

The structure looks like this:

/projects/
├── my-app/                 ← main worktree (main branch)
│   └── .git/               ← real git data
├── my-app-feature/         ← linked worktree (feature/auth branch)
│   └── .git                ← not a directory, but a file pointing to the main .git
└── my-app-hotfix/          ← linked worktree (hotfix/login branch)
    └── .git

Each worktree has its own HEAD, index, and working files, but they all share commit history and Git objects. In terms of Git objects, extra disk usage is minimal. But dependencies like node_modules or .venv still need to exist per worktree, so heavy projects can consume disk space quickly if you keep many worktrees around.

There is also one important limitation: you cannot check out the same branch in two worktrees at once. This is intentional. It prevents the confusion of having the same branch diverge across multiple active directories.

When did it arrive?

git worktree officially landed with Git 2.5 on July 29, 2015. A major contributor was Nguyễn Thái Ngọc Duy, who had been refining the idea for years. At launch it still wore an experimental label and had some submodule compatibility issues, but those rough edges have largely been resolved over time.

Later releases added more lifecycle commands. Git 2.7 brought git worktree move and git worktree remove, and Git 2.15 introduced git worktree lock and git worktree unlock.

I only started paying real attention to it recently, but clearly many people had already been quietly using it for years. It spent nearly a decade as one of those “great if you know it” features. Once AI coding agents became normal, though, it suddenly started feeling essential.

Harness engineering: why worktree matters

Harness engineering is not about building the AI agent itself. It is about designing and orchestrating the environment you delegate work into. git worktree becomes incredibly powerful once that environment exists.

Agents like Claude Code and Codex read and write files directly in the working directory. If an agent is working on the feature/payments branch, that directory may be sitting in a half-modified state at any moment.

What happens if you check out another branch in that same directory, or launch a second agent into it? Best case, you create confusion. Worst case, you end up with conflicting file states and agents working from the wrong code snapshot.

The old solution was git stash, but once several stashes pile up, it becomes annoying to remember which one belongs to which task. Cloning the repository multiple times also works, but now you are duplicating repo state and losing the convenience of sharing local history and objects directly.

git worktree solves this cleanly. Each AI session gets a fully independent directory tied to its own branch, while history and objects remain shared. Claude Code made this even more explicit by adding an official --worktree flag in February 2026, effectively promoting this workflow to a first-class citizen.

Starting a worktree-based workflow

The basic commands are simple:

# Create a new worktree with a new branch
git worktree add ../my-app-feature -b feature/auth

# Create a worktree from an existing branch
git worktree add ../my-app-hotfix hotfix/login

# List all attached worktrees
git worktree list

There are two common directory layouts. The first puts worktrees next to the main project:

/projects/
├── my-app/
├── my-app-feature-auth/
└── my-app-hotfix-login/

The second keeps them inside the project under a trees/ folder:

/my-app/
├── src/
├── .git/
└── trees/
    ├── feature-auth/
    └── hotfix-login/

If you use the second pattern, do not forget to add trees/ to .gitignore. Otherwise the main worktree will see them as untracked files.

There is one more thing to handle when creating worktrees. Files ignored by Git, such as .env, are not copied automatically. A plain cp works, but then you need to repeat that every time the main .env changes. A symlink is often more convenient because updates in the main worktree are reflected everywhere:

# Create worktree, then link .env
git worktree add trees/feature-auth -b feature/auth
ln -s "$(pwd)/.env" trees/feature-auth/.env

If you have multiple environment files like .env.local and .env.development, it helps to wrap this in a shell function:

# ~/.zshrc or ~/.bashrc
wt() {
  git worktree add "$1" -b "$2"
  for f in .env .env.local .env.development; do
    [ -f "$f" ] && ln -s "$(pwd)/$f" "$1/$f" && echo "linked $f"
  done
}

# Usage
wt trees/feature-auth feature/auth

If you use Claude Code, the official --worktree flag makes the flow even simpler:

# Create a worktree and start Claude Code in one step
claude --worktree feature-auth

That single command creates .claude/worktrees/feature-auth/, creates the branch, and starts the Claude session inside it.

Working inside a worktree

Once the worktree exists, you just move into that directory and work as usual. IDEs and editors can also open each worktree as a separate project.

With AI agents, it looks like this:

# Terminal 1 - feature work
cd ../my-app-feature-auth
claude # or codex, gemini-cli, etc.

# Terminal 2 - hotfix work, at the same time
cd ../my-app-hotfix-login
claude

While one agent is working, you can review the output from another. You stop being the person waiting for code and start being the person directing parallel work.

Commits inside each worktree work exactly the same as usual. Since the branch is already separated, you do not have to think much about context switching.

git add .
git commit -m "feat: add auth middleware"

After the work is done

Once the task is finished, the rest looks like a normal PR workflow. If you are already inside the worktree directory, push naturally goes to that branch:

cd ../my-app-feature-auth
git push -u origin feature/auth
gh pr create --title "Add auth middleware" --base main

After opening the PR, there are three common ways to integrate the branch back into the main worktree.

Squash merge — usually the cleanest for AI-generated work

Inside the worktree, the agent may have made several exploratory commits. Those process commits usually do not need to live forever in main history. Squash merge compresses everything into one clean commit. On GitHub you can choose “Squash and merge”, or in the CLI:

git merge --squash feature/auth
git commit -m "feat: add auth middleware"

Rebase merge — when you want perfectly linear history

This rebases the worktree branch on top of main and then fast-forwards it in. It is useful when the commits are already clean and meaningful on their own:

# Inside the worktree (or use master instead of main if needed)
git rebase main

# Back in the main worktree
git checkout main
git merge feature/auth --ff-only

Merge commit — when you want to preserve branch history

This creates an explicit merge commit, leaving a visible record that feature/auth was integrated at that point in time. It is useful for larger work units or when branch-level traceability matters:

git checkout main
git merge feature/auth

For many small tasks handled by AI through harness engineering, squash merge tends to fit best. There is usually no reason to keep all the intermediate trial commits. From the perspective of the main worktree, one clean commit that says “this feature was added” is often enough.

Once the merge is done, clean up the worktree.

Removing a worktree

# Remove the worktree
git worktree remove ../my-app-feature-auth

# Delete the branch too
git branch -d feature/auth

# Clean up stale worktree metadata
git worktree prune

If the worktree was created through Claude Code’s --worktree option, it will automatically delete the worktree and branch when the session ends with no changes. If commits exist, Claude asks whether to keep them.

Things to watch out for

Do not parallelize tasks that edit the same files unless you are prepared to handle merge conflicts later. Worktree does not magically solve overlapping changes. If two agents touch the same file, the merge conflict still exists. You still need to split work along sane boundaries.

Servers using the same port will also collide. If you run multiple dev servers from multiple worktrees at once, make sure they use different ports or only run one at a time.

It is also worth running git worktree prune periodically. If you manually delete directories, stale worktree metadata can linger and clutter the list. git worktree prune cleans those invalid references up.

Closing thoughts

git worktree first appeared in 2015, but this may be the exact era it was waiting for. Once AI coding agents become a normal part of development, running multiple isolated workspaces in parallel stops being a niche trick and starts feeling like the default.

Instead of repeatedly stashing and checking out branches, you can switch context just by changing directories. That is why git worktree feels less like a neat Git feature now, and more like core infrastructure for parallel AI-assisted development.

Refs

fractional-indexing: Implementing Drag-and-Drop Ordering and Avoiding Index Collisions

Kendrick B. Jung — Mon, 23 Mar 2026 12:46:47 +0000

Avoiding index collisions in sortable lists

The limits of integer indices

If you have ever built a drag-and-drop list, you have probably stored the order like this.

[
  { "id": "a", "order": 1 },
  { "id": "b", "order": 2 },
  { "id": "c", "order": 3 }
]

What happens if you move b to the front? b becomes 0, and a is still 1, so at first glance it seems fine. But if you later want to insert a new item between a and b, you have to shift a to 2 and c to 3. In other words, changing one item often forces you to update several others too.

In collaborative tools where multiple users can reorder items at the same time, that structure tends to create collisions. If two people modify the same part of the list concurrently, the final order can become inconsistent or trigger large update conflicts.

What is fractional-indexing?

David Greenspan introduced this approach in Implementing Fractional Indexing. The core idea is simple: instead of using integers for order, use sortable string keys.

[
  { "id": "a", "order": "a0" },
  { "id": "b", "order": "a1" },
  { "id": "c", "order": "a2" }
]

Want to insert an item between a1 and a2? You can generate a middle key like a1V. Everything else stays unchanged.

Figma uses this idea in its multiplayer editing system. It manages child-node ordering with fractional indexing, which means reordering typically updates only the moved node.

Using the library

In JavaScript, you can use the fractional-indexing package.

npm install fractional-indexing

import { generateKeyBetween, generateNKeysBetween } from 'fractional-indexing';

// First key
const first = generateKeyBetween(null, null);
// → 'a0'

// Insert at the beginning
const zeroth = generateKeyBetween(null, first);
// → 'Zz'

// Insert at the end
const second = generateKeyBetween(first, null);
// → 'a1'

// Generate a key between two existing keys
const third = generateKeyBetween(second, null); // 'a2'
const mid = generateKeyBetween(second, third);
// → 'a1V'

// Generate multiple keys at once
const keys = generateNKeysBetween('a0', 'a2', 2);
// → ['a0G', 'a0V']

You store the key as a string in the database and sort it with lexicographic order using ORDER BY. The scheme is designed so alphabetical order matches the intended item order.

Other ways to manage ordering

fractional-indexing is not the only option. There are a few common alternatives, and each comes with tradeoffs.

Gap strategy with integers

This is the simplest approach. You start with generous spacing.

[
  { "id": "a", "order": 1000 },
  { "id": "b", "order": 2000 },
  { "id": "c", "order": 3000 }
]

To insert between a and b, you assign order: 1500. It is simple and fast. The downside is that once the gaps are exhausted, you eventually need to reindex everything. If inserts keep happening in the same region, you end up with values like 1500 → 1250 → 1375 → ..., and a full rebalance becomes unavoidable.

Timestamp-based ordering

Another approach is to use insertion time as the order value.

const item = {
  id: 'a',
  order: Date.now(), // 1700000001000
};

This is the easiest implementation. The problem is that if two clients insert at nearly the same time, ordering becomes ambiguous. For a single-user app, that may be fine. In collaborative environments, it is usually not reliable enough.

Linked list ordering

In this model, each item points to the next item.

[
  { "id": "a", "next": "b" },
  { "id": "b", "next": "c" },
  { "id": "c", "next": null }
]

The nice part is that insertion only touches nearby nodes, so the update scope stays small. The downside is that reading the full order requires traversal, and you lose the convenience of a simple database ORDER BY. If your service reads ordered lists frequently, query complexity can become a real cost.

How to choose

Approach	Implementation complexity	Collaboration safety	Long-term operation
fractional-indexing	Medium	High	Rebalancing needed
linked list	Medium	Medium	More complex queries
integer gaps	Low	Low	Reindexing needed
timestamps	Low	Low	Collision risk

If your product involves frequent reordering or multiple users interacting with the same list, fractional-indexing is close to a practical default. For simpler single-user apps, a gap strategy with integers can still be perfectly sufficient.

Things to watch out for

Keys can grow longer over time. If you keep generating new keys inside the same narrow interval, the string length increases. That is why long-running systems often need a rebalancing step that periodically rewrites the ordering keys.

Another important detail is consistency in string comparison. Your database, server, and client should all treat ordering the same way. If different layers compare keys differently, the rendered order can drift from the intended one.

Closing thoughts

If you manage ordering with plain integers, you eventually run into friction. fractional-indexing is a fairly elegant way to avoid that problem. It is especially worth considering when you need realtime collaboration, optimistic updates, or frequent drag-and-drop reordering.

Refs

Why AI Gives You a Headache: Managing Cognitive Fatigue for Developers

Kendrick B. Jung — Wed, 18 Feb 2026 20:17:02 +0000

A New Kind of Fatigue in the AI Era

Recently, I've been subscribing to Claude Code Max, Codex (ChatGPT Pro), and Antigravity (Google AI Pro), which has dramatically increased my workload. At some point, I started getting headaches. I wondered if it was from lack of sleep, but our CTO at work asked if I was getting headaches. And the thing is, I had actually taken Tylenol the day before. So I thought that might be it, but after talking to others who use AI heavily, they said they occasionally get headaches too. So I decided to investigate. It turns out I'm not alone. Community posts asking "Does anyone get headaches when using AI? Planning and directing takes so much brainpower" are becoming common.

A 2025 academic study also found that deeper engagement with GenAI doesn't reduce cognitive burden—it actually amplifies it.

Why AI Exhausts Your Brain

Decision Fatigue Explosion

In traditional development, you'd spend a day diving deep into one design problem. Implementation took time, giving you the luxury of slowly making architectural decisions. AI flips this dynamic. When you can prototype three approaches in the time it previously took to build one, you must constantly make architecture-level decisions. The bottleneck shifts from "can we build this?" to "should we build this, and how?"

Continuous Task Initiation Burden

AI doesn't move on its own. "Remove this," "redo it," "change direction"—you must constantly direct the next action. This process intensely consumes your brain's executive function, a high-intensity cognitive task.

Prompt Fatigue

A 2025 study of 832 GenAI users found that uncertainty about how to write prompts causes emotional fatigue, while unexpected responses cause cognitive fatigue. The process of choosing words and designing context to get desired results consumes a new type of energy.

Context Switching Costs

Prompt writing → result review → revision instruction → re-review. This loop repeats dozens or hundreds of times daily. While AI doesn't tire from context switching, the human brain pays a transition cost each time it changes modes.

Practical Solutions That Work

The 20-20-20 Rule

Every 20 minutes, look at something 20 feet (6m) away for 20 seconds. Proposed by ophthalmologist Dr. Anshel in the 1990s, this rule is recommended by both the American Optometric Association (AOA) and the American Academy of Ophthalmology (AAO). Research shows that applying this rule for 2 weeks significantly reduces digital eye strain symptoms.

I happen to have a view of the Mississauga skyline from my place, so every 20 minutes I look out at the open landscape for 20 seconds. Having a distant view to rest your eyes on makes practicing this rule much easier than trying to focus on a wall or nearby objects.

Batch Prompting

Instead of continuously micro-directing, give broad guidelines once, let AI draft the solution, then review the results in batches. This reduces the number of brain transitions. For example, tools like oh-my-claudecode's autopilot or ralplan's autonomous execution modes let you review outputs without directing every step.

Intentional Downtime

After 50 minutes of focus, you need 10 minutes away from screens entirely. This allows your brain's Default Mode Network (DMN) to activate, consolidating and organizing information—a completely different brain activity from continuously reading and judging AI outputs.

Posture and Environment Check

An easily overlooked aspect. When concentrating on AI conversations, you may unconsciously tense your neck and shoulders, leading to tension headaches. Simply positioning your monitor at eye level and maintaining at least 63cm (arm's length) from the screen makes a noticeable difference.

The Key: Not "Use Less" but "Use Differently"

The solution to AI fatigue isn't to use AI less. The key is using it with boundaries, intention, and awareness that you're not a machine.

Acknowledging that productivity gains come with increased cognitive costs, and managing those costs, has become the new essential skill for developers in the AI era.

Refs

Siddhant Khare, "AI fatigue is real and nobody talks about it" (2025) — siddhantkhare.com
WarpedVisions, "The hidden cost of AI-assisted development: cognitive fatigue" (2025) — warpedvisions.org
ScienceDirect, "Fatigued by uncertainties: Exploring the cognitive and emotional costs of generative AI usage" (2025) — sciencedirect.com
MDPI, "Generative AI and Cognitive Challenges in Research" (2025) — mdpi.com
Human Clarity Institute, "Cognitive Load, Fatigue & Decision Offloading 2025 Data Summary" — humanclarityinstitute.com
Healthline, "20-20-20 Rule: Does It Help Prevent Digital Eyestrain?" (2025) — healthline.com
ScienceDirect, "The effects of breaks on digital eye strain, dry eye and binocular vision: Testing the 20-20-20 rule" (2022) — sciencedirect.com

DEV Community: Kendrick B. Jung

Token Saving, and Caveman

Token Saving, and Caveman

Introduction

A Brief History of Saving Tokens

Earlier Attempts at Saving Tokens

The Age of Detail

Back to Short and Concise

Token Compression Tools

Caveman

A Common Misunderstanding

When Is It Useful?

Caveman Compress

RTK

Caveman vs RTK

Closing Thoughts

Refs

How Superpowers Forces Skill Execution

How Superpowers Forces Skill Execution

TL;DR

Why don't skills activate automatically?

How skill systems work by default

How Superpowers bypasses the problem

Step 1: Register the hook (hooks/hooks.json)

Step 2: Run the script (hooks/session-start)

Step 3: Inject context

Step 4: Start the conversation with skill awareness

How using-superpowers enforces the rule

Platform-specific invocation

The history behind the enforcement mechanism

Remaining limitations

Closing thoughts

Refs

Do Agents Dream of Electric Sheep? On Soul and Dreaming

Before we begin

Why Soul emerged

SoulSpec, an open standard for identity

Dreaming, how agents dream

Dreaming in OpenClaw

The three-stage sleep cycle

Dream diary

Closing thoughts

Refs

Codex Fast Mode vs Claude Fast Mode: What’s Actually Different?

TL;DR

What got me curious

Codex fast mode: really two different tracks

Cerebras WSE-3: a different world from GPUs

Claude fast mode: same model, different infrastructure

The core difference

What about speculative decoding?

Closing thoughts

Refs

Using git worktree for parallel AI agent development

TL;DR

What is Git Worktree?

When did it arrive?

Harness engineering: why worktree matters

Starting a worktree-based workflow

Working inside a worktree

After the work is done

Squash merge — usually the cleanest for AI-generated work

Rebase merge — when you want perfectly linear history

Merge commit — when you want to preserve branch history

Removing a worktree

Things to watch out for

Closing thoughts

Refs

fractional-indexing: Implementing Drag-and-Drop Ordering and Avoiding Index Collisions

The limits of integer indices

What is fractional-indexing?

Using the library

Other ways to manage ordering

Gap strategy with integers

Timestamp-based ordering

Linked list ordering

How to choose

Things to watch out for

Closing thoughts

Refs

Step 1: Register the hook (`hooks/hooks.json`)

Step 2: Run the script (`hooks/session-start`)

How `using-superpowers` enforces the rule