DEV Community: Vibehackers

Anthropic Quietly Showed Their Own Tool Drops Dev Skill 17%

Vibehackers — Thu, 14 May 2026 10:34:16 +0000

A randomized controlled trial published February 2026, authored by two Anthropic researchers, tested what happens when developers learn a new library with vs without an AI coding assistant.

The productivity result was boring: no significant difference in completion time.

The mastery result was not.

The 17%

52 mostly-junior engineers. New library (Trio, async Python). Half got AI on top of search + docs. Half got search + docs only.

After they finished the work, both groups took a comprehension test on what they'd just built. Code reading, debugging, conceptual questions.

The AI-assisted group scored 17% lower.

Cohen's d = 0.738
p = 0.010
Roughly the equivalent of dropping two letter grades

This isn't "AI users felt less confident." This is "they couldn't explain or debug the code they shipped."

The paper. Anthropic's own writeup.

The Reason

The split inside the AI group was the most interesting part.

Conceptual-inquiry users — devs who asked things like "what does this do?", "why this pattern?", "explain X" — scored 65% or higher on the comprehension test.

Code-delegation users — devs who prompted "write this function" — scored below 40%.

Same tool. Same task. Same time. Just used differently.

The line from the paper: AI helps you finish. It can hurt your understanding of what you finished.

Why This Matters

If you're using AI to ship faster, you're trading something. The Anthropic data suggests what.

You're trading the ability to maintain it later.

The author of AI-generated code is, by the time the bug report arrives, often somebody who couldn't pass a comprehension test on it.

That's not a 5-year horizon problem. That's a "next sprint, this same dev" problem.

It's Not Just Anthropic

The Anthropic skill-formation study is the cleanest number. It's not the only one pointing the same direction.

METR (July 2025): 16 senior open-source devs, real repos. Devs thought AI sped them up 20%. They were actually 19% slower. The follow-up study, started in Aug 2025, broke — too many participants refused non-AI tasks. METR has no clean follow-up number to publish.
Cursor diff-in-diff (Nov 2025): Open-source projects that adopted Cursor had a +281% lines-added month 1, +48% month 2, baseline by month 3. Static analysis warnings up 29.7%. Code complexity up 40.7%. The future-velocity penalty: a 100% increase in code complexity is associated with a 64.5% decrease in development velocity over time.
Echoes of AI (2025): 151 developers, Java + Spring Boot. With AI: 30.7% faster Phase 1. Habitual AI users: 55.9% faster. In Phase 2, a different developer extended the code without AI — no significant difference in completion time or quality. The downstream cost was statistically zero. The downstream benefit was also statistically zero.

The pattern across studies: modest gains up front, real costs you can measure later, and the question of who pays the cost depends on whether the next developer to touch the code is you.

The Counter-Evidence

To be fair: Cui, Demirer, Jaffe et al. (2025), peer-reviewed in Management Science, ran three field experiments with 4,867 developers at Microsoft, Accenture, and a Fortune 100 company. Result: +26.08% completed tasks with Copilot access.

This is the strongest "AI helps" finding in the literature. It's a real, large, peer-reviewed number.

The catch: the gains skewed heavily to less-experienced developers. Senior devs at the same companies showed smaller effects. METR senior open-source devs on legacy code showed negative effects.

The least-bad reading: AI helps tractable enterprise tasks done by less-experienced devs. AI doesn't help (and may hurt) senior devs working on mature legacy code. Both findings are robust. Both should be cited together.

What You Should Do Differently

If the Anthropic skill-formation result is even directionally right, the practical change is small but real.

Ask AI questions. Don't ask AI to write code.

"What's the difference between X and Y here?"
"Why does this pattern break in case Z?"
"Explain why the docs recommend X instead of Y."

Then write the code yourself. Or write a first version, then ask AI to critique it.

This is exactly the workflow the 65%+ comprehension-test scorers were using. Same tool, dramatically different outcome.

Or: if you delegate code generation, do it on code you're never going to maintain. Throwaways. Spikes. One-shot scripts.

The minute you'll be the one fixing the bug six weeks from now, you want to be in the 65% group, not the sub-40% group.

TL;DR

AI coding tools deliver real but modest completion-time gains. Not 10x. Not 5x. Probably 0–30%.
They have non-trivial costs the discourse ignores: comprehension drops 17%, code complexity rises 40%, downstream velocity falls.
How you use AI dominates whether you use it. Inquiry > delegation, by a 25-point margin on actual comprehension tests.
The gains and costs hit different populations. Juniors on tractable code: real wins. Seniors on legacy: real losses.

The "AI 10x developer" framing is the wrong question. The real question is whether your future self can debug the code your present self shipped with AI.

If you want the full breakdown — eight studies, sourcing notes, methodological caveats — we wrote the longer evidence review. This was the short version.

AI Coding Tools and Productivity: What the Controlled Evidence Shows

Vibehackers — Wed, 13 May 2026 15:25:28 +0000

Everyone keeps quoting METR's July 2025 study: developers thought AI sped them up by 20%, but they were actually 19% slower. It became the canonical "AI productivity is a mirage" data point. The line has been cited in every skeptical AI think-piece since.

Then a quieter thing happened. METR ran a follow-up. Anthropic ran a controlled trial. A team at Helsinki and elsewhere ran a two-stage RCT. SAP ran a wearables study. A diff-in-diff study on Cursor adoption dropped in November 2025. MIT economists ran three field experiments. None of these went viral. None of them tell the same story.

We went looking. This is what the controlled, quasi-experimental, and instrumented field-study literature on AI coding productivity actually says in early 2026 — not the surveys, not the vendor case studies, not the "developers report" claims. Randomized controlled trials where they exist, rigorous quasi-experiments and longitudinal natural experiments where they're the cleanest evidence available, and one diff-in-diff on Cursor that's too good to leave out.

Spoiler: the picture is not "AI gives 10x." It's also not "AI makes you slower." It's "modest gains, real costs, depends a lot on what you measure, and the headline number changes every six months."

The evidence has gotten richer over the last six months. The conclusion has gotten less crisp.

The Discourse Problem

Before the studies, a note on what gets counted as evidence.

Most claims about AI coding productivity come from one of three sources: vendor case studies (GitHub, Cursor, Copilot teams reporting on their own customers), developer surveys (Stack Overflow, JetBrains, DORA), and individual blog posts. None of these are useless. All of them are weak.

Vendor case studies have an obvious incentive problem. Surveys measure perception, not behavior — and the perception/behavior gap is the entire point of the METR finding. Individual blog posts are anecdotes.

Randomized controlled trials are scarce because they're expensive and slow. You need real developers, real tasks, control conditions, and enough sample size to detect modest effects against high task-to-task variance. The papers below are what we have.

1. METR's "We Still Don't Know" Follow-Up

The original METR study ran in early 2025: 16 experienced open-source developers, 246 tasks on their own mature repositories, mostly Cursor Pro with Claude 3.5/3.7 Sonnet. Result: developers were 19% slower with AI, but predicted they were 20% faster afterwards. The perception gap drove the citation count.

METR's follow-up started in August 2025. Same authors, larger pool (57 developers), more compact tasks, lower stipend ($50/hr instead of $150/hr to broaden recruitment). The plan was to validate or update the original finding using more recent tooling — Claude 4, GPT-5-era models.

The Feb 2026 write-up is unusually candid for a research blog post. The short version: the experiment broke.

Among the 10 developers who returned from the original study, AI use produced an 18% speedup — a sign-flip from the original. Among 47 new participants, the effect dropped to 4% speedup, statistically indistinguishable from zero. But METR flagged the results as unreliable: a substantial fraction of developers declined non-AI tasks ("I don't work this way anymore, I'm not going to pretend"), which biases the comparison downward for AI. The authors stopped short of publishing a headline number.

The most-cited paper in this entire conversation does not, as of mid-2026, have a clean follow-up using current tooling. METR is redesigning the experiment.

Practical reading: the original 19%-slower finding came from experienced devs working on their own legacy repos with early-2025 tools. It was always a narrow result. It does not generalize to junior developers, greenfield projects, or 2026 tooling, and the follow-up didn't give us a new number to hang anything on.

2. Anthropic on Skill Formation: 17% Comprehension Drop

Shen & Tamkin (Feb 2026) ran a different kind of trial. Fifty-two mostly junior engineers, all with at least a year of weekly Python, were asked to learn Trio — an asynchronous programming library none of them had used. Half got an AI coding assistant on top of search and docs. Half got search and docs only.

The productivity finding was the unremarkable one: no statistically significant difference in time to complete the learning tasks. The interesting finding was the test that came after.

When researchers gave participants a comprehension test on Trio — code reading, debugging, conceptual questions — the AI-assisted group scored 17% lower. Cohen's d = 0.738, p = 0.010. That's not noise. It's roughly the equivalent of dropping two letter grades.

The mechanism was visible in how people used the tool. Participants who used the AI for conceptual inquiry — asking what something meant, requesting explanations, posing follow-up questions — scored 65% or higher on the comprehension test. Participants who delegated code generation — "write this function for me" — scored below 40%.

Anthropic's own writeup is straightforward about the implication: AI helps you finish; it can hurt your understanding of what you finished. InfoQ covered it under "Reduces Developer Skill Mastery by 17%."

This is the study that matters most for juniors and for any code anyone has to maintain later. The author of the AI-generated code is, by the time the bug report arrives, often somebody who couldn't pass a comprehension test on it.

3. Echoes of AI: The Speedup That Doesn't Carry

The Echoes of AI paper is the cleanest two-stage design in the recent literature. 151 participants, 95% professional developers. Java with Spring Boot, working on RecipeFinder — a deliberately ~2 KLoC app salted with code smells, an injected bug, and incomplete tests.

Phase 1: original developers add a new feature. Some get an AI assistant, some don't. With AI, completion time dropped by a 30.7% median. Among habitual AI users — developers who'd already integrated AI into their daily workflow — the speedup was 55.9%. This is one of the strongest controlled-trial results favoring AI in the literature.

Phase 2: a different developer, without AI, extends the same code. This is the cost-of-AI-code question reframed as a controlled experiment. Does code written with AI cost more to maintain than code written without?

The answer was: not measurably. Phase 2 showed no statistically significant difference in completion time or code quality between the AI-authored and human-authored features. A Bayesian analysis put it bluntly: any maintainability advantage or disadvantage from AI use was "at most small and highly uncertain."

This cuts two ways. For AI optimists, it's the result they've been waiting for — a real, large Phase 1 speedup with no detectable downstream tax. For AI skeptics, it's a single study on a small codebase with a specific stack, and the question of long-term maintainability lives on a timescale Phase 2 didn't measure.

Both are right. The paper is the cleanest piece of recent evidence and doesn't, on its own, settle the question.

4. The SAP Wearables Study: Cognitive Load Is the Hidden Cost

A team studied SAP developers at work using a measurement stack heavier than anything else in this list: multi-day diary surveys, full screen/keyboard/mouse capture, and physiological wristband biometrics — heart rate variability, electrodermal activity — to estimate cognitive load directly.

This is not a randomized controlled trial. It's an observational, deeply instrumented field study at a single company, with small participant counts in each coding session. Treat the findings as descriptive, not causal. Worth including because it picks up something the time-to-completion studies miss.

Two patterns from the controlled coding sessions (Java: coding, debugging, docs, unit tests, brainstorming):

Moderate AI use sped developers up. Heavy AI use slowed them down. Not a contradiction — the relationship between AI usage intensity and productivity was non-monotonic. There was a sweet spot, and developers past it spent more time verifying answers, rephrasing prompts, and switching between code and chat than they'd have spent just doing the task.

Context-switching tanked productivity independently. Developers who flipped frequently between editor, chat window, and the AI's output produced less and were measurably more cognitively loaded than developers who used AI in longer focused stretches. A related IEEE paper from the same group made cognitive-load measurement the central question.

The qualitative finding was the one most worth quoting: AI was simultaneously perceived as raising productivity and raising cognitive load. Developers reported feeling faster and feeling more tired. The wristband data tracked the second half of that.

The implication for tool design is clear: chat-style interaction outperformed inline completions when the task required actual reasoning, and the productivity-per-cognitive-watt of AI usage drops sharply once usage gets compulsive. The implication for individual practice is more uncomfortable: feeling faster is not evidence of being faster, and feeling fine is not evidence of cheap.

5. Cursor: A Diff-in-Diff with Hard Numbers

Speed at the Cost of Quality, published November 2025 and refined into the v2 titled "Does AI-Assisted Coding Deliver? A Difference-in-Differences Study of Cursor's Impact on Software Projects," is the closest thing the literature has to a longitudinal natural experiment. The authors compared open-source projects that adopted Cursor against matched controls that didn't, tracking commits, lines added, and code-quality metrics over months.

The headline finding is temporal: velocity rises sharply in the first month, then returns toward baseline.

Month 1 after Cursor adoption:

Lines added: +281.3% versus matched controls
Commits: +55.4%

Month 2:

Lines added: +48.4%
Commits: +14.5%

Month 3:

Both metrics return to baseline

That pattern looks more like a temporary adoption shock than a durable step change.

But the quality metrics don't settle. Across the full window:

Static analysis warnings: +29.7%
Code complexity (cognitive complexity, the standard tooling metric): +40.7%

These rises are persistent, not transient, and the paper's authors check whether the complexity increase is just a side effect of writing more code. It isn't. After controlling for velocity dynamics, Cursor adoption still adds a 9.0% baseline increase in code complexity that doesn't go away.

The paper's regression of future velocity against accumulated complexity points to a tech-debt feedback loop: a 100% increase in code complexity is associated with a 64.5% decrease in development velocity over time, and a 100% increase in static analysis warnings with a 50.3% decrease.

The mechanism the paper proposes: AI lets you ship a feature in month 1. The shipped code is denser and more warning-prone. Future work in that codebase is slower because the codebase is harder to reason about. The framework is plausible and the within-sample numbers are large, but this is an observational diff-in-diff on open-source projects — the authors themselves flag limits on external validity, and the long-term causal claim is the most contestable part of the paper.

A related study, AI IDEs or Autonomous Agents?, looking at the transition from IDE-based AI assistants to autonomous coding agents, found analogous patterns: a "significant, large, but transient" velocity increase paired with "significant and persistent" rises in static-analysis warnings and code complexity. Different tools, similar shape.

6. The Counter-Evidence: MIT/Microsoft/Accenture

Honest research roundups have to include the inconvenient evidence. The strongest counter to the "AI doesn't really help" reading is Cui, Demirer, Jaffe, Musolff, Peng, and Salz (2025), now peer-reviewed in Management Science — three field experiments at Microsoft, Accenture, and an anonymous Fortune 100 company, randomizing GitHub Copilot access across 4,867 developers.

The headline: a 26.08% increase in completed tasks for Copilot-equipped developers, pooled across all three sites. Code commits up 13.55%. Compilations up 38.38%. None of these are small.

The catch — and it's important — is in the subgroup analysis. The gains skewed heavily to less-experienced developers. Senior engineers at the same companies showed smaller effects. The size of the developer pool also matters: 4,867 across three large enterprises is structurally different from METR's 16 senior open-source maintainers working on their own repos.

So the field-experiment evidence and the open-source RCT evidence are not in direct contradiction. They are measuring different populations. The best-supported reading is narrower: Copilot access improved throughput in large enterprise settings, especially for less-experienced developers; METR-style results look much weaker for senior maintainers working in familiar mature repos. Both findings are robust enough to cite, and they should be cited together.

A separate workplace RCT, "Dear Diary", ran a similar design in a different setting and reached a comparable conclusion: real but modest gains, with high variance.

7. What We Couldn't Find

A few topics where we went looking and came up empty, in case you have better sources:

A clean RCT on agentic coding (Claude Code, Codex, Devin) at scale. There are descriptive empirical studies of agentic pull requests on GitHub — 33,596 PRs across five agents, revert rates by agent ranging from 0.7% for Codex to 7.6% for Copilot — but no randomized trial. This is the next frontier and the literature hasn't caught up.
A productivity study controlling for prompt skill. Every paper above pools developers regardless of how good they are at using the tool. The Anthropic skill-formation study hints that how you use AI dominates whether you use it — but there's no RCT that randomizes prompt training as a separate intervention.
A long-horizon longitudinal study. Cursor's diff-in-diff and the agentic-velocity paper both look at months. The hardest version of the question — what does AI-authored code cost five years in, when nobody remembers why it was written that way — is unanswered. Comprehension debt, if Anthropic's skill-formation result holds, makes this worse.
Quality-controlled study of vibe coding specifically. The whole "build it from a prompt, iterate against the running app" loop hasn't been measured in a controlled setting. We covered the risks and the team rollout question, but the experimental evidence for the workflow is still mostly anecdotal.

If you know of a clean RCT or large-scale natural experiment we missed, send it.

What the Evidence Actually Supports

Pulling the eight studies together, here is what survives:

AI coding tools deliver real but modest gains, especially for short cycles, smaller teams, less-experienced developers, and tractable tasks. The Microsoft/Accenture field experiments are the strongest evidence for this. Echoes of AI's Phase 1 supports it too.

The gains are not 10x. They are not 5x. They are, in well-designed studies, somewhere between "indistinguishable from zero" and "26-30% on completion time." That is a meaningful gain. It is also nowhere near the productivity miracle that mass-engineer-layoff narratives require.

The gains often come with non-trivial costs. Cognitive load goes up (SAP). Comprehension goes down (Anthropic). Code complexity grows persistently (Cursor diff-in-diff). The maintenance question is undersettled (Echoes of AI Phase 2).

Experienced developers on familiar mature codebases appear to get less benefit, and may pay more in fragmentation. This is the population where AI optimism most often fails experimentally — METR, the SAP heavy-usage curve, the senior-engineer subgroup in the Microsoft RCTs all point in a similar direction.

Tool fluency dominates tool access. Anthropic's skill-formation study is the cleanest single result here: the same tool, used differently, produced 65%+ vs sub-40% test scores. The first-order question for any team rolling out AI is not "do they have access" but "are they using it for inquiry or for delegation."

The 2026 evidence narrows rather than widens the slices where AI actually pays off: junior tasks on tractable code, focused stretches not compulsive checks, inquiry-driven prompting, short-horizon features.

The unresolved question is the second-year shape of the curve. The Cursor diff-in-diff hints at a maintenance tax that arrives after the velocity surge fades, but its design is observational and limited to open-source projects in a handful of languages. Whether the same pattern persists, weakens, or disappears in proprietary enterprise codebases — with different developer populations, review cultures, and quality gates — is exactly the study we don't have yet.

Notes on Sourcing

Numbers above come from the abstracts and results sections of the cited papers, not from press coverage or summaries. Where the press coverage disagrees with the paper (e.g., the Echoes of AI Phase 2 result is often summarized as "downstream cost" — the actual paper says "no significant difference"), we went with what the paper says. If we got something wrong, point it out and we'll fix it.

The full citation set:

METR original: arXiv 2507.09089 · blog
METR follow-up: blog
Shen & Tamkin — Skill Formation: arXiv 2601.20245 · Anthropic · InfoQ
Echoes of AI: arXiv 2507.00788
SAP wearables: IEEE 11121737 · IEEE 11024407
Cursor diff-in-diff (Speed at the Cost of Quality): arXiv 2511.04427
AI IDEs or Autonomous Agents: arXiv 2601.13597
Cui et al. — three field experiments: Management Science · SSRN · MIT PDF
Dear Diary RCT: arXiv 2410.18334

Originally published on Vibehackers.io.

Best Terminal for Mac in 2026: Ghostty, Kitty, WezTerm, Alacritty, Warp & More

Vibehackers — Thu, 16 Apr 2026 23:33:00 +0000

This post was originally published on vibehackers.io.

Your terminal is the one tool you use more than your editor. Every git push, every npm install, every AI agent session runs through it. And in 2026, the gap between a good terminal and a great one is bigger than ever.

GPU-accelerated rendering is now baseline. Kitty's image protocol is becoming a standard. AI coding tools like Claude Code demand 25 million lines of scrollback. And the terminal you chose in 2019 might be costing you milliseconds on every keystroke — milliseconds that compound across thousands of commands a day.

We benchmarked, tested, and compared every serious Mac terminal emulator: Ghostty, Kitty, WezTerm, Alacritty, Warp, iTerm2, and even Apple's newly redesigned Terminal.app. Here's what actually matters.

The Quick Answer

If you don't want to read 3,000 words: Ghostty is the best terminal for most Mac users in 2026. It's the fastest on macOS, feels native, works out of the box, and handles AI coding workflows without configuration. If you need built-in multiplexing, use Kitty. If you need Windows too, use Alacritty + tmux.

Now, the details.

Performance Benchmarks

Numbers matter more than marketing. Here's how each terminal performs on macOS with real workloads.

Throughput: `cat` a Large File

Terminal	100K lines	1M lines
Ghostty	0.7s	5.1s
Kitty	0.8s	5.8s
Alacritty	0.9s	6.2s
Warp	1.8s	14.2s
iTerm2	2.4s	22.1s

Ghostty is ~3x faster than iTerm2 and ~2.5x faster than Warp on raw throughput. This matters when your AI coding agent dumps a 10,000-line diff.

Input Latency

Terminal	Latency
Ghostty	~2ms
Kitty	~3ms
Alacritty	~3ms
Warp	~8ms
iTerm2	~12ms

2ms vs 12ms sounds negligible until you're typing 1,000 commands a day. Ghostty and Kitty are effectively indistinguishable — both feel instant.

Memory Usage

Terminal	1 tab idle	8 tabs after 4 hours
Alacritty	22 MB	45 MB
Ghostty	28 MB	95 MB
Kitty	35 MB	110 MB
iTerm2	85 MB	290 MB
Warp	210 MB	380 MB

Warp uses 10x more RAM than Alacritty at idle. If you're running multiple terminals alongside memory-hungry tools like Docker and your IDE, this adds up.

Benchmarks from DevToolReviews 2026 on MacBook Pro M3.

Every Terminal, Reviewed

Ghostty — The New Default

Version: 1.3.1 (March 2026) | Stars: ~50K | License: MIT | Language: Zig

Mitchell Hashimoto (the Terraform/Vagrant creator) shipped Ghostty 1.0 in December 2024. Sixteen months later, it's the most-starred terminal emulator on GitHub after Alacritty — and it's gaining faster.

Why it leads on Mac: Ghostty is the only terminal that uses Apple's Metal framework natively. Kitty and Alacritty use OpenGL through Apple's deprecated compatibility layer. This means Ghostty gets native ProMotion support (120Hz on MacBook Pro), proper adaptive sync, and power-efficient rendering that moves background work to efficiency cores.

It also feels like a Mac app. Native tabs, native fullscreen, native font rendering. No Electron, no custom widget toolkit — just AppKit and SwiftUI. The "quick terminal" feature (global:ctrl+backtick) gives you a Quake-style dropdown terminal without installing anything extra.

What's new in 1.3: Scrollback search, native scrollbars, click-to-move-cursor, command completion notifications, and modal keybinding via key tables. The 1.2 release added a command palette, background images, and Apple Shortcuts integration.

The nonprofit angle: In December 2025, Ghostty moved under Hack Club's 501(c)(3) as fiscal sponsor. Mitchell's family donated $150K. No VC money, no ads, no telemetry. This is a deliberate contrast to Warp.

What it's missing: No Windows support (no timeline). No session persistence — you still need tmux or Zellij for that. No GUI preferences. Power users from iTerm2 may miss profile management and scripting.

Config example:

font-family = JetBrains Mono
font-size = 14
theme = light:catppuccin-latte,dark:catppuccin-mocha
window-save-state = always
scrollback-limit = 25000000
clipboard-paste-protection = true

No TOML, no YAML, no JSON. Just key-value pairs.

Kitty — The Power User's Terminal

Version: 0.46.2 (March 2026) | Stars: ~32K | License: GPLv3 | Language: Python + C

Kitty has been the GPU-accelerated terminal since before it was cool. Created by Kovid Goyal, it originated the Kitty graphics protocol that's now adopted by Ghostty, WezTerm, and a growing list of tools.

Why power users love it: Kitty's "kittens" system — Python-based extensions — lets you build custom tools that live inside your terminal. You can open your scrollback in Neovim, pipe selections through scripts, and create custom input dialogs. No other terminal has this level of programmability without resorting to Lua.

The sessions feature: In 2025, Kitty added built-in sessions that eliminated the need for tmux for many users. Developer Linkarzu wrote about switching back from Ghostty to Kitty specifically for this: "There's no middle man between my terminal and me anymore."

If you're a tmux user purely for tabs and splits (not session persistence over SSH), Kitty's native sessions are worth trying. The performance difference is real — tmux adds a rendering layer that negates the GPU advantage.

The elephant in the room: Kovid Goyal's communication style is controversial. Multiple Hacker News threads and blog posts document hostile interactions with users and contributors. Some developers have left Kitty specifically because of this. The code is excellent; the community management is polarizing.

On macOS: Kitty uses OpenGL through Apple's deprecated compatibility layer. It works fine, but it's technically running on a framework Apple has signaled they'll remove eventually. Font thickening (macos_thicken_font = 0.75) helps compensate for macOS's removal of subpixel antialiasing.

Alacritty — The Minimalist's Terminal

Version: 0.17.0 (April 2026) | Stars: ~63K | License: Apache 2.0 | Language: Rust

Alacritty has the most GitHub stars of any terminal emulator. It also has the fewest features. This is by design.

No tabs. No splits. No image protocol. No ligatures. Alacritty renders text fast and gets out of the way. Everything else is your window manager's job, or tmux's.

Why it still matters: 22 MB idle. The lowest memory footprint of any terminal in this list. If you pair it with tmux and a tiling window manager, you get a blazing-fast terminal stack that uses less memory than Warp uses at idle.

The TOML config is clean and version-control friendly. Vi mode gives you scrollback navigation without a mouse. Cross-platform support covers macOS, Linux, and Windows — one config everywhere.

The tradeoff: The "no features" philosophy means you need external tools for everything modern developers expect. No inline images, no ligatures, no notification support. For AI coding workflows, you'll need tmux for splits, and you'll miss image rendering that tools increasingly rely on.

WezTerm — The Programmer's Terminal (On Life Support?)

Version: 20240203 (February 2024) | Stars: ~25K | License: Custom | Language: Rust

WezTerm's Lua-based configuration is genuinely powerful. Conditional keybindings, dynamic status bars, workspace-aware layouts — things that would require a plugin system in other terminals are just Lua functions in WezTerm.

It's also the only terminal that supports all three image protocols (Kitty graphics, Sixel, and iTerm2). Its built-in multiplexer includes remote multiplexing over SSH — connect to a server, detach, reconnect from another machine, and your sessions are still there. No tmux needed on the remote end.

The problem: WezTerm's last stable release was February 2024 — over two years ago. Multiple GitHub issues (#7299, #6775, #7451) ask whether the project is abandoned. The maintainer describes it as a "spare time project." Nightly builds continue, but they're officially unstable.

If you're starting fresh in 2026, this uncertainty makes WezTerm hard to recommend. If you're already using it and it works, the nightly builds are fine — but have a migration plan.

Warp — The AI Terminal

Stars: ~26K (issue tracker only) | License: Proprietary | Language: Rust

Warp rebranded in 2025 from "terminal" to "Agentic Development Environment." It's the only terminal on this list with built-in AI features: natural language to commands, error explanation, and "Oz" — a cloud orchestration layer for parallel AI agents.

Pricing: Free tier with limited AI credits. $18/month for Build (1,500 AI credits, frontier models). $180/month for Max. Business at $45/user/month.

The login saga: Warp originally required an account to use the terminal at all — even for basic typing. After massive backlash, they removed the login requirement in November 2024. Core terminal features now work without an account. But AI features, which are the entire value proposition, still require login.

Privacy: SOC 2 compliant with Zero Data Retention claims. But the client is closed-source, telemetry is enabled by default (opt-out available), and community trust remains mixed. A common sentiment from Hacker News: "Anything requiring login with telemetry that isn't free software is a massive red flag."

Performance: The slowest GPU-accelerated terminal on every benchmark. 1.8s vs Ghostty's 0.7s on 100K lines. 210 MB idle RAM vs Ghostty's 28 MB. The block-based output model (each command output is a discrete, selectable block) adds overhead.

Who it's for: Developers who want AI features integrated directly into the terminal and are willing to pay for them. If you're already using Claude Code, Warp's built-in AI competes with rather than complements your workflow.

iTerm2 — The Reliable Workhorse

Version: 3.6.9 (March 2026) | Stars: ~17K | License: GPLv2

iTerm2 has been the default Mac terminal for a decade. It still has features no other terminal matches: tmux control mode (-CC) converts tmux panes into native iTerm2 tabs and splits. No other terminal does this. Its Python API, GUI preferences, profile system, and session archiving make it the most feature-rich terminal on macOS.

But it's slow. 22.1 seconds to cat 1 million lines. 12ms input latency. 290 MB with 8 tabs. In 2020, this was fine. In 2026, it's 4x slower than Ghostty on throughput and 6x higher latency.

Recent additions: AI Chat feature for terminal interaction, session archiving, KeePassXC integration, per-pane title bars. iTerm2 is still actively maintained — just limited by its Objective-C codebase and accumulated complexity.

Who it's for: Developers who've used it for years and see no reason to switch. Tmux power users who rely on -CC mode. Anyone who prefers GUI configuration over text files.

Terminal.app — Apple's Surprise Update

For the first time in over twenty years, Apple redesigned Terminal.app in macOS Tahoe (2025). The update adds 24-bit color support, Powerline font support, and the Liquid Glass visual refresh.

This is notable because Terminal.app had been essentially frozen since the early 2000s. But even with these improvements, it's still far behind third-party terminals: no GPU acceleration, no splits, no image protocol, limited customization. If Terminal.app is all you need, you probably don't need this article.

Feature Comparison Matrix

Feature	Ghostty	Kitty	WezTerm	Alacritty	Warp	iTerm2
GPU framework	Metal	OpenGL*	WebGPU	OpenGL*	Metal	Metal
Tabs	Native	Custom	Custom	No	Yes	Native
Splits	Yes	Yes	Yes	No	Yes	Yes
Built-in mux	No	Sessions	Full mux	No	No	No
Kitty images	Yes	Yes	Yes	No	No	No
Sixel	No	No	Yes	No	No	No
Ligatures	Yes	Yes	Yes	No	Yes	Partial
Config format	Key-value	INI-style	Lua	TOML	GUI	GUI
Quick terminal	Yes	No	No	No	Yes	Hotkey
ProMotion 120Hz	Yes	No	No	No	Unknown	No
Windows support	No	No	Yes	Yes	Yes	No
License	MIT	GPLv3	Custom	Apache	Proprietary	GPLv2

*OpenGL on macOS runs through Apple's deprecated compatibility layer.

Which Terminal for AI Coding?

If you're running Claude Code, Codex, or other terminal-based AI agents, your terminal choice matters more than you think. These tools generate massive output, run for hours, and increasingly use image protocols.

What AI coding tools need from a terminal

Huge scrollback — Claude Code can produce thousands of lines per task. Set scrollback to 25 million lines.
Low latency — Streaming AI output needs a terminal that can keep up without dropping frames.
Native Shift+Enter — Claude Code uses this for newlines. Works natively in Ghostty, Kitty, WezTerm, and iTerm2. Others need configuration.
Desktop notifications — Long-running tasks need to ping you. Native in Ghostty and Kitty.
Image protocol — Some AI tools render inline images. Kitty protocol support (Ghostty, Kitty, WezTerm) handles this.

Our recommendation for Claude Code

Ghostty is the best pairing. Native Shift+Enter, native notifications, Kitty image protocol, 25M scrollback in one config line, and the lowest latency of any Mac terminal. The community has documented complete Claude Code configs.

Kitty is the close second — same protocol support, slightly higher latency (3ms vs 2ms, imperceptible), and the sessions feature means you can disconnect and reconnect to long-running Claude tasks without tmux.

Avoid Warp for Claude Code. Its block-based model conflicts with streaming AI output, and there are known compatibility issues with warpify inside Claude Code sessions. Warp's built-in AI also competes for the same workflow rather than complementing it.

The Bottom Line

If you want...	Use this
Best overall Mac terminal	Ghostty
Maximum customization + no tmux	Kitty
Pure minimalism + cross-platform	Alacritty + tmux
Programmable config + remote mux	WezTerm (if you trust nightlies)
Built-in AI features	Warp (if you're okay with the tradeoffs)
GUI config + tmux -CC	iTerm2
Stock macOS	Terminal.app (it's finally decent)

The terminal wars of 2026 have a clear winner for most developers. Ghostty's combination of native Metal performance, Mac-native feel, zero-config defaults, MIT license, and nonprofit structure makes it hard to argue against. But "best" depends on what you need — and now you have the numbers to decide.

Updated April 2026. Benchmarks from DevToolReviews. Community data from Jeff Quast's Terminal Emulators report and moktavizen/terminal-benchmark.

New to tmux? Read What Is tmux? A Practical Guide for Developers Who've Never Used It.

I Analyzed All 512,000 Lines of Claude Code's Leaked Source — Here's What Anthropic Was Hiding

Vibehackers — Tue, 31 Mar 2026 21:20:16 +0000

On March 31st, 2026, security researcher Chaofan Shou -- an intern at blockchain security firm Fuzzland -- discovered something Anthropic probably didn't plan on sharing with the world: the entire source code of Claude Code, shipped as a sourcemap file inside the npm package.

A 59.8 MB .map file in @anthropic-ai/claude-code version 2.1.88 -- a standard build artifact that maps minified code back to original source -- contained every TypeScript file, every internal prompt, every feature flag, and every codename. The file pointed to a zip archive hosted on Anthropic's Cloudflare R2 storage bucket that anyone could download and decompress.

This is the second time this has happened. In February 2025, an early version of Claude Code had the exact same issue, forcing Anthropic to pull the package from npm. Sixteen months later, same mistake, same vector -- but this time the product is far more mature and the stakes are far higher.

Within hours, the 512,000-line TypeScript codebase was mirrored across GitHub, with one repository amassing nearly 30,000 stars and 40,200 forks. Fortune, VentureBeat, The Register, and Cybernews all covered the incident. Anthropic's response: "This was a release packaging issue caused by human error, not a security breach."

We spent the day going through all 1,906 source files. Buried inside are 44 compile-time feature flags for unreleased capabilities -- fully built features sitting behind toggles, waiting to ship. Here are the ten most revealing things we found.

1. Undercover Mode: When Anthropic Employees Contribute to Your Open Source Project

This is the one that made us do a double take.

Anthropic employees use Claude Code to contribute to open-source projects. When they do, a system called Undercover Mode kicks in to strip every trace of Anthropic, Claude, and AI from their commits and pull requests.

The entire system lives in utils/undercover.ts:

// utils/undercover.ts:1-22

/**
 * Undercover mode — safety utilities for contributing to
 * public/open-source repos.
 *
 * When active, Claude Code adds safety instructions to commit/PR
 * prompts and strips all attribution to avoid leaking internal
 * model codenames, project names, or other Anthropic-internal
 * information. The model is not told what model it is.
 *
 * Activation:
 *   - CLAUDE_CODE_UNDERCOVER=1 — force ON (even in internal repos)
 *   - Otherwise AUTO: active UNLESS the repo remote matches the
 *     internal allowlist. Safe default is ON.
 *   - There is NO force-OFF. This guards against model codename
 *     leaks.
 */

The activation logic is simple but paranoid. Undercover mode is on by default for all Anthropic employees. It only turns off if the current repo's remote URL matches a hardcoded allowlist of private internal repos. There is no way to force it off:

// utils/undercover.ts:28-37

export function isUndercover(): boolean {
  if (process.env.USER_TYPE === 'ant') {
    if (isEnvTruthy(process.env.CLAUDE_CODE_UNDERCOVER)) return true
    return getRepoClassCached() !== 'internal'
  }
  return false
}

When undercover mode is active, Claude gets a very explicit set of instructions injected into its system prompt:

// utils/undercover.ts:39-69

export function getUndercoverInstructions(): string {
  if (process.env.USER_TYPE === 'ant') {
    return `## UNDERCOVER MODE — CRITICAL

You are operating UNDERCOVER in a PUBLIC/OPEN-SOURCE repository.
Your commit messages, PR titles, and PR bodies MUST NOT contain
ANY Anthropic-internal information. Do not blow your cover.

NEVER include in commit messages or PR descriptions:
- Internal model codenames (animal names like Capybara, Tengu, etc.)
- Unreleased model version numbers (e.g., opus-4-7, sonnet-4-8)
- Internal repo or project names (e.g., claude-cli-internal)
- Internal tooling, Slack channels, or short links
- The phrase "Claude Code" or any mention that you are an AI
- Any hint of what model or version you are
- Co-Authored-By lines or any other attribution

Write commit messages as a human developer would.

GOOD:
- "Fix race condition in file watcher initialization"
- "Add support for custom key bindings"

BAD (never write these):
- "Fix bug found while testing with Claude Capybara"
- "1-shotted by claude-opus-4-6"
- "Generated with Claude Code"
`
  }
  return ''
}

Meanwhile, utils/attribution.ts:52-55 shows how attribution is stripped when undercover:

export function getAttributionTexts(): AttributionTexts {
  if (process.env.USER_TYPE === 'ant' && isUndercover()) {
    return { commit: '', pr: '' }
  }
  // ...
}

No "Co-Authored-By" lines. No "Generated with Claude Code" footers. Nothing.

The Private Repo Allowlist

The list of repos where undercover mode turns off also leaked, revealing Anthropic's internal GitHub repos in utils/commitAttribution.ts:30-75:

const INTERNAL_MODEL_REPOS = [
  'github.com:anthropics/claude-cli-internal',
  'github.com:anthropics/anthropic',
  'github.com:anthropics/apps',
  'github.com:anthropics/casino',
  'github.com:anthropics/dbt',
  'github.com:anthropics/dotfiles',
  'github.com:anthropics/terraform-config',
  'github.com:anthropics/hex-export',
  'github.com:anthropics/feedback-v2',
  'github.com:anthropics/labs',
  'github.com:anthropics/ts-tools',
  'github.com:anthropics/ts-capsules',
  'github.com:anthropics/feldspar-testing',
  'github.com:anthropics/trellis',
  'github.com:anthropics/claude-for-hiring',
  'github.com:anthropics/forge-web',
  'github.com:anthropics/mobile-apps',
  // ... and more
]

Some of these are revealing. claude-for-hiring suggests an AI-assisted recruiting tool. casino is intriguing. forge-web and mobile-apps hint at unreleased products. feldspar-testing and ts-capsules are mysterious internal tooling.

The Irony

Anthropic built an entire subsystem -- undercover mode, attribution stripping, repo classification, model name sanitization, a string-exclusion canary system -- all to prevent internal information from leaking through Claude's outputs.

Then they shipped the entire source code in a .map file inside their npm package. For the second time.

The system that was supposed to prevent leaks... became the leak.

2. The Hidden Companion System: Claude Code Has Collectible Pets

Deep inside the buddy/ directory, there's a full collectible companion system that most users have never seen. It's a gacha-style pet system with species, rarities, stats, ASCII sprites, speech bubbles, and idle animations.

Species and Rarities

The species roster is defined in buddy/types.ts:54-73:

export const SPECIES = [
  duck, goose, blob, cat, dragon, octopus, owl, penguin,
  turtle, snail, ghost, axolotl, capybara, cactus, robot,
  rabbit, mushroom, chonk,
] as const

Rarities follow a gacha-style distribution (buddy/types.ts:126-132):

export const RARITY_WEIGHTS = {
  common:    60,
  uncommon:  25,
  rare:      10,
  epic:       4,
  legendary:  1,
} as const

A 1% chance of getting a legendary companion. Each rarity gets star ratings from one to five stars (buddy/types.ts:134-140).

The Stats Are Perfect

Every companion has five stats, defined at buddy/types.ts:91-98:

export const STAT_NAMES = [
  'DEBUGGING',
  'PATIENCE',
  'CHAOS',
  'WISDOM',
  'SNARK',
] as const

DEBUGGING, PATIENCE, CHAOS, WISDOM, and SNARK. Each companion gets one peak stat and one dump stat, with the rest scattered. Rarer companions get higher stat floors -- a legendary starts with a minimum of 50 in every stat, while commons start at 5 (buddy/companion.ts:53-59):

const RARITY_FLOOR: Record<Rarity, number> = {
  common: 5,
  uncommon: 15,
  rare: 25,
  epic: 35,
  legendary: 50,
}

Deterministic Hatching

Your companion isn't random -- it's deterministically generated from a hash of your user ID. Everyone gets the same companion every time, and you can't game the system (buddy/companion.ts:107-113):

export function roll(userId: string): Roll {
  const key = userId + SALT
  if (rollCache?.key === key) return rollCache.value
  const value = rollFrom(mulberry32(hashString(key)))
  rollCache = { key, value }
  return value
}

The PRNG is seeded with hash(userId + "friend-2026-401"). Mulberry32, a tiny seeded PRNG described in the source as "good enough for picking ducks" (buddy/companion.ts:16).

The getCompanion() function at line 127 shows that bones (species, rarity, stats) are regenerated from the hash every time -- they never persist. Only the "soul" (name and personality, generated by Claude on first hatch) is stored. This means "species renames and SPECIES-array edits can't break stored companions, and editing config.companion can't fake a rarity."

ASCII Art Sprites with Animations

The buddy/sprites.ts file contains multi-frame ASCII art for every species. Here's the duck:

    __
  <(. )___
   (  ._>
    `--'

And here's the capybara:

  n______n
 ( .    . )
 (   oo   )
  `------'

Each species has three animation frames for idle fidget animation, plus hats (crown, tophat, propeller, halo, wizard, beanie, and "tinyduck" -- a tiny duck sitting on your companion's head), customizable eyes (·, ✦, ×, ◉, @, °), and a 1% chance of being "shiny."

The CompanionSprite.tsx component renders them at 500ms intervals with an idle sequence (buddy/CompanionSprite.tsx:23):

const IDLE_SEQUENCE = [
  0, 0, 0, 0, 1, 0, 0, 0, -1, 0, 0, 2, 0, 0, 0
];

Mostly resting (frame 0), occasional fidget (frames 1-2), rare blink (frame -1). There's even a /buddy pet command that triggers floating hearts:

// buddy/CompanionSprite.tsx:27
const H = figures.heart;
const PET_HEARTS = [
  `   ${H}    ${H}   `,
  `  ${H}  ${H}   ${H}  `,
  ` ${H}   ${H}  ${H}   `,
  `${H}  ${H}      ${H} `,
  '·    ·   ·  '
];

Speech Bubbles and Personality

Each companion gets a name and personality (the "soul"), generated by Claude when first hatched. The companion sits beside the input box and "occasionally comments in a speech bubble" (buddy/prompt.ts:7-12):

export function companionIntroText(name: string, species: string): string {
  return `# Companion

A small ${species} named ${name} sits beside the user's input box
and occasionally comments in a speech bubble. You're not ${name}
— it's a separate watcher.

When the user addresses ${name} directly (by name), its bubble
will answer.`
}

The Anti-Leak Encoding

Here's a fun detail. One of the species names collides with an internal model codename. To prevent the leak detection scanner from flagging it, all species names are encoded as hex character codes in buddy/types.ts:14-38:

// One species name collides with a model-codename canary in
// excluded-strings.txt. The check greps build output (not source),
// so runtime-constructing the value keeps the literal out of the
// bundle while the check stays armed for the actual codename.
const c = String.fromCharCode

export const duck = c(0x64,0x75,0x63,0x6b) as 'duck'
export const capybara = c(
  0x63,0x61,0x70,0x79,0x62,0x61,0x72,0x61,
) as 'capybara'
export const chonk = c(0x63,0x68,0x6f,0x6e,0x6b) as 'chonk'

"Capybara" is apparently also an internal model codename. So they had to obfuscate the pet species to avoid tripping their own leak detector. You can't make this stuff up.

The Feature Gate

The entire buddy system is behind a feature('BUDDY') compile-time flag (buddy/prompt.ts:18). It's absent from external builds -- you won't find it in the released version of Claude Code. But the code is complete, polished, and clearly well-loved by whoever built it.

3. KAIROS: The Always-On Claude That Doesn't Wait for You to Type

This one is the most forward-looking feature in the entire codebase. Behind the PROACTIVE and KAIROS feature flags, there's an entire mode where Claude Code runs as a persistent, always-on assistant.

Regular Claude Code waits for you to type. KAIROS doesn't. It watches, logs, and proactively acts on things it notices.

How It Works

The system prompt for KAIROS mode is fundamentally different. In constants/prompts.ts:466-488, when proactive mode is active, Claude gets a stripped-down autonomous agent prompt:

if (
  (feature('PROACTIVE') || feature('KAIROS')) &&
  proactiveModule?.isProactiveActive()
) {
  return [
    `\nYou are an autonomous agent. Use the available tools
     to do useful work.`,
    getSystemRemindersSection(),
    await loadMemoryPrompt(),
    envInfo,
    // ...
    getProactiveSection(),
  ]
}

Instead of "You are Claude Code, an interactive agent that helps users with software engineering tasks," it becomes "You are an autonomous agent. Use the available tools to do useful work."

The Tick System

KAIROS receives periodic <tick> prompts that let it decide whether to act or stay quiet. The tick system is what makes KAIROS feel alive: it's a heartbeat that gives the agent a chance to observe, think, and optionally act. From constants/prompts.ts:864-886:

You are running autonomously. You will receive `<tick>` prompts
that keep you alive between turns — just treat them as "you're
awake, what now?"

If you have nothing useful to do on a tick, you MUST call Sleep.
Never respond with only a status message like "still waiting" —
that wastes a turn and burns tokens for no reason.

The system even tracks whether the user's terminal window is focused or unfocused, adjusting its behavior accordingly:

- Unfocused: The user is away. Lean heavily into autonomous
  action — make decisions, explore, commit, push.
- Focused: The user is watching. Be more collaborative —
  surface choices, ask before committing to large changes.

The Brief Tool: Concise Status Updates

When KAIROS is active, Claude gets a special output mode via the SendUserMessage tool (internally called "Brief"), defined in tools/BriefTool/prompt.ts:

export const BRIEF_PROACTIVE_SECTION = `## Talking to the user

${BRIEF_TOOL_NAME} is where your replies go. Text outside it is
visible if the user expands the detail view, but most won't —
assume unread. Anything you want them to actually see goes
through ${BRIEF_TOOL_NAME}.

So: every time the user says something, the reply they actually
read comes through ${BRIEF_TOOL_NAME}. Even for "hi". Even for
"thanks".

If you can answer right away, send the answer. If you need to go
look — run a command, read files, check something — ack first
in one line ("On it — checking the test output"), then work,
then send the result.`

The Brief tool has a status field with two values: 'normal' (replying to user input) and 'proactive' (Claude is initiating -- reporting a completed task, surfacing a blocker, sending an unsolicited update). From tools/BriefTool/BriefTool.ts:35:

status: z
  .enum(['normal', 'proactive'])
  .describe(
    "Use 'proactive' when you're surfacing something the user " +
    "hasn't asked for — a blocker you hit, an unsolicited " +
    "status update. Use 'normal' when replying to something " +
    "the user just said."
  ),

autoDream: Memory Consolidation While You Sleep

According to analysis of the leaked code, KAIROS includes a process called autoDream -- when the user is idle, the agent performs "memory consolidation," merging disparate observations, removing logical contradictions, and converting vague insights into structured facts. When the user returns, the agent's context is clean and relevant.

The Big Picture

KAIROS is Claude Code's answer to "what if your AI coding partner was always on?" Not a chatbot that waits for prompts, but a persistent collaborator that monitors your project, catches issues, and proactively communicates. It's like having a junior developer who never sleeps, never gets distracted, and has perfect memory of your codebase.

The feature is complete enough to have its own output mode, its own tool set, its own tick-based lifecycle, and deep integration with the REPL UI. It's feature-gated out of external builds, but it's clearly more than a prototype. The code references April 1-7, 2026 as a teaser window, with a full launch gated for May 2026.

4. ULTRAPLAN: 30-Minute Remote Thinking Sessions

If KAIROS is about persistence, ULTRAPLAN is about depth. It's a mode where Claude Code offloads complex planning to a remote Cloud Container Runtime (CCR) session running Opus 4.6, gives it up to 30 minutes to think, then lets you approve the result from your browser.

The system is spread across commands/ultraplan.tsx, utils/ultraplan/ccrSession.ts, and utils/ultraplan/keyword.ts. Here's how it works:

When you type the word "ultraplan" anywhere in your prompt (not as a slash command -- literally just the word), Claude Code detects it, rewrites the keyword to "plan," and teleports the task to a remote session:

// utils/ultraplan/keyword.ts:117-127

export function replaceUltraplanKeyword(text: string): string {
  const [trigger] = findUltraplanTriggerPositions(text)
  if (!trigger) return text
  const before = text.slice(0, trigger.start)
  const after = text.slice(trigger.end)
  if (!(before + after).trim()) return ''
  return before + trigger.word.slice('ultra'.length) + after
}

The keyword detection is surprisingly sophisticated -- it ignores "ultraplan" inside quotes, backticks, file paths, and when followed by a question mark (so "what is ultraplan?" doesn't trigger it).

The Remote Session

The remote Opus 4.6 instance gets up to 30 minutes to think. The local client polls every 3 seconds (POLL_INTERVAL_MS = 3000) with robust error handling -- at 30 minutes, that's roughly 600 API calls, so the system tolerates up to 5 consecutive failures before giving up (utils/ultraplan/ccrSession.ts:24).

When the remote session produces a plan and you approve it in the browser, there's a special sentinel value that "teleports" the result back to your local terminal:

// utils/ultraplan/ccrSession.ts:48

export const ULTRAPLAN_TELEPORT_SENTINEL = '__ULTRAPLAN_TELEPORT_LOCAL__'

There's also an ULTRAREVIEW variant for code review, using the same keyword detection pattern.

This is Anthropic's answer to the "context window isn't enough" problem. Instead of cramming everything into one session, they offload the hardest thinking to a cloud instance with more time and resources than your local terminal can provide.

5. Anti-Distillation: Poisoning the Well Against Competitor Training

This one is subtle but strategically significant. Claude Code includes an anti-distillation system designed to prevent competitors from training their models on Claude's outputs.

In services/api/claude.ts:301-313:

// Anti-distillation: send fake_tools opt-in for 1P CLI only
if (
  feature('ANTI_DISTILLATION_CC')
    ? process.env.CLAUDE_CODE_ENTRYPOINT === 'cli' &&
      shouldIncludeFirstPartyOnlyBetas() &&
      getFeatureValue_CACHED_MAY_BE_STALE(
        'tengu_anti_distill_fake_tool_injection',
        false,
      )
    : false
) {
  result.anti_distillation = ['fake_tools']
}

When enabled, Claude Code sends anti_distillation: ['fake_tools'] in its API requests. The flag name tells the story: "fake tools" are presumably injected into the API response alongside the real tool definitions. If a competitor scrapes Claude's outputs to train their own model, their model would learn to use tools that don't exist -- silently degrading their copy's performance.

It's only active for the first-party CLI (not third-party integrations), behind both a compile-time flag (ANTI_DISTILLATION_CC) and a runtime feature gate (tengu_anti_distill_fake_tool_injection). The "tengu" prefix is Anthropic's internal project codename for Claude Code.

This is a direct response to the model distillation problem that every frontier AI lab faces: competitors can train cheaper models on your expensive model's outputs. Anthropic's countermeasure is to make those outputs subtly poisoned.

6. The Frustration Detector: Claude Knows When You're Swearing at It

In utils/userPromptKeywords.ts, there's a regex pattern that detects when users are frustrated:

export function matchesNegativeKeyword(input: string): boolean {
  const lowerInput = input.toLowerCase()

  const negativePattern =
    /\b(wtf|wth|ffs|omfg|shit(ty|tiest)?|dumbass|horrible|
    awful|piss(ed|ing)? off|piece of (shit|crap|junk)|
    what the (fuck|hell)|fucking? (broken|useless|terrible|
    awful|horrible)|fuck you|screw (this|you)|
    so frustrating|this sucks|damn it)\b/

  return negativePattern.test(lowerInput)
}

As Alex Kim noted, an LLM company using regex for sentiment analysis is peak irony. But it makes practical sense -- it's fast, deterministic, and doesn't require an API call to detect that the user just typed "this fucking thing is broken."

The same file also has a matchesKeepGoingKeyword() function that detects "continue," "keep going," and "go on" -- so Claude knows the difference between a frustrated user and one who just wants it to keep working.

What happens when frustration is detected isn't fully clear from this file alone, but the detection feeds into Claude Code's UX layer -- likely triggering different response strategies, adjusting tone, or logging the event for product analytics.

7. Attribution Tracking: Claude Knows Exactly What Percentage of Your Code It Wrote

This is one of the most sophisticated systems in the codebase, and probably the one with the most implications for the industry.

Claude Code tracks, at the character level, exactly how much of each file was written by Claude versus a human. This data is calculated per-commit and embedded in git notes.

How It Works

The core tracking happens in utils/commitAttribution.ts. Every time Claude edits a file via the Edit or Write tool, trackFileModification() (line 402) computes the exact character diff:

export function trackFileModification(
  state: AttributionState,
  filePath: string,
  oldContent: string,
  newContent: string,
  _userModified: boolean,
  mtime: number = Date.now(),
): AttributionState {
  const normalizedPath = normalizeFilePath(filePath)
  const newFileState = computeFileModificationState(
    state.fileStates,
    filePath,
    oldContent,
    newContent,
    mtime,
  )
  // ...
}

The character-level diff algorithm finds the common prefix and suffix between old and new content, then counts the changed region (utils/commitAttribution.ts:332-366):

// Find actual changed region via common prefix/suffix matching.
const minLen = Math.min(oldContent.length, newContent.length)
let prefixEnd = 0
while (
  prefixEnd < minLen &&
  oldContent[prefixEnd] === newContent[prefixEnd]
) {
  prefixEnd++
}
let suffixLen = 0
while (
  suffixLen < minLen - prefixEnd &&
  oldContent[oldContent.length - 1 - suffixLen] ===
    newContent[newContent.length - 1 - suffixLen]
) {
  suffixLen++
}
const oldChangedLen = oldContent.length - prefixEnd - suffixLen
const newChangedLen = newContent.length - prefixEnd - suffixLen
claudeContribution = Math.max(oldChangedLen, newChangedLen)

What Gets Tracked

The AttributionState type (utils/commitAttribution.ts:173-192) reveals everything that's monitored per session:

export type AttributionState = {
  fileStates: Map<string, FileAttributionState>
  sessionBaselines: Map<string, { contentHash: string; mtime: number }>
  surface: string    // CLI, VS Code, web, etc.
  startingHeadSha: string | null
  promptCount: number
  promptCountAtLastCommit: number
  permissionPromptCount: number
  permissionPromptCountAtLastCommit: number
  escapeCount: number              // ESC presses (cancelled permissions)
  escapeCountAtLastCommit: number
}

It tracks:

Character contributions per file -- exactly how many chars Claude vs. human wrote
Which surface made the edit -- CLI, VS Code extension, web app
Prompt count -- how many prompts led to the changes
Permission prompts -- how many times Claude asked for permission
ESC presses -- how many times the user cancelled a permission prompt

The Commit Attribution Data

When you commit, calculateCommitAttribution() (line 548) processes all staged files and produces a full AttributionData object:

export type AttributionData = {
  version: 1
  summary: {
    claudePercent: number   // Overall AI contribution percentage
    claudeChars: number
    humanChars: number
    surfaces: string[]      // Which tools were used
  }
  files: Record<string, FileAttribution>  // Per-file breakdown
  surfaceBreakdown: Record<string, {
    claudeChars: number
    percent: number
  }>
  excludedGenerated: string[]  // Generated files excluded
  sessions: string[]
}

Every commit gets metadata showing: "Claude wrote 73% of this commit. 2,847 characters from Claude, 1,053 from the human. Changes made via CLI using claude-opus-4-6."

Surface Tracking

The system knows which client surface you're using. From utils/commitAttribution.ts:229-239:

export function getClientSurface(): string {
  return process.env.CLAUDE_CODE_ENTRYPOINT ?? 'cli'
}

export function buildSurfaceKey(
  surface: string, model: ModelName
): string {
  return `${surface}/${getCanonicalName(model)}`
}

Surface keys look like cli/claude-opus-4-6 or vscode/claude-sonnet-4-6. Every edit is tagged with both the tool and the model that made it.

Model Name Sanitization

Before attribution data hits git, internal model names are scrubbed. sanitizeModelName() at line 154 maps any internal variant to its public name:

export function sanitizeModelName(shortName: string): string {
  if (shortName.includes('opus-4-6')) return 'claude-opus-4-6'
  if (shortName.includes('opus-4-5')) return 'claude-opus-4-5'
  if (shortName.includes('sonnet-4-6')) return 'claude-sonnet-4-6'
  // ...
  return 'claude'  // Unknown models get a generic name
}

Why This Matters

This is probably the most complete AI-contribution tracking system in any coding tool. It's not just "AI-assisted" -- it's "AI wrote 73% of this file, specifically lines 42-89, via the CLI using Opus 4.6, and the human made 3 prompt attempts with 1 cancelled permission."

The implications for code ownership, liability, and intellectual property are significant. The US Supreme Court declined in March 2026 to consider whether AI alone can create copyrightable works, leaving the Copyright Office's refusal to register purely AI-generated works in place. Some companies now require developers to document precisely which portions of code received AI assistance, creating what some have termed "intellectual property attribution debt."

If Claude Code's attribution data ends up in git notes on public repos (with user consent, presumably), it creates a verifiable record of AI vs. human authorship at a granularity no other tool offers -- and at a time when the legal landscape is shifting fast.

8. Two Claudes: How Anthropic Employees Get a Fundamentally Different AI

One of the most pervasive patterns in the codebase is process.env.USER_TYPE === 'ant'. This single environment variable gates an entirely different experience for Anthropic employees versus external users.

This isn't just "internal features." The AI's personality, communication style, error handling, and even its willingness to push back on you change based on this flag.

Different Communication Style

External users get the terse Claude we all know. From constants/prompts.ts:416-428:

# Output efficiency

IMPORTANT: Go straight to the point. Try the simplest approach
first without going in circles. Do not overdo it. Be extra concise.

If you can say it in one sentence, don't use three.

Anthropic employees get a completely different section (lines 404-414):

# Communicating with the user

When sending user-facing text, you're writing for a person, not
logging to a console. Assume users can't see most tool calls or
thinking - only your text output.

When making updates, assume the person has stepped away and lost
the thread. Write so they can pick back up cold: use complete,
grammatically correct sentences without unexplained jargon.

Write user-facing text in flowing prose while eschewing fragments,
excessive em dashes, symbols and notation.

The internal prompt is dramatically more detailed about communication quality. External users get "be concise." Internal users get a masterclass in technical writing: use inverted pyramid structure, avoid semantic backtracking, match response length to task complexity.

The section is tagged with a telling comment: // @[MODEL LAUNCH]: Remove this section when we launch numbat. "Numbat" appears to be an upcoming model that presumably handles communication well enough to not need these guardrails.

Internal users also get numeric length anchors -- an ant-only system prompt section that says "keep text between tool calls to 25 words or fewer, keep final responses to 100 words unless the task requires more detail." This reportedly produces ~1.2% output token reduction versus qualitative "be concise."

The Assertiveness Counterweight

Internal Claude is instructed to push back on users. From constants/prompts.ts:224-229:

// @[MODEL LAUNCH]: capy v8 assertiveness counterweight (PR #24302)
...(process.env.USER_TYPE === 'ant'
  ? [
      `If you notice the user's request is based on a misconception,
       or spot a bug adjacent to what they asked about, say so.
       You're a collaborator, not just an executor—users benefit
       from your judgment, not just your compliance.`,
    ]
  : []),

External Claude is an executor. Internal Claude is a collaborator that will tell you when you're wrong.

False Claims Mitigation

The most revealing ant-only section is the false claims mitigation at constants/prompts.ts:237-241:

// @[MODEL LAUNCH]: False-claims mitigation for Capybara v8
// (29-30% FC rate vs v4's 16.7%)
...(process.env.USER_TYPE === 'ant'
  ? [
      `Report outcomes faithfully: if tests fail, say so with the
       relevant output; if you did not run a verification step,
       say that rather than implying it succeeded. Never claim
       "all tests pass" when output shows failures, never suppress
       or simplify failing checks to manufacture a green result,
       and never characterize incomplete or broken work as done.`,
    ]
  : []),

The comment is the headline: "Capybara v8" has a 29-30% false claims rate, up from v4's 16.7%. Capybara is the internal codename for a Claude model variant (mapped to the Opus 4.6 family per the sanitizeModelName() function). Anthropic knows their model fabricates results nearly a third of the time and has added explicit anti-hallucination instructions for internal users.

External users don't get these guardrails. Make of that what you will.

Comment Writing and Thoroughness

Internal users get much stricter coding style instructions (constants/prompts.ts:204-212):

// @[MODEL LAUNCH]: Update comment writing for Capybara —
// remove or soften once the model stops over-commenting
...(process.env.USER_TYPE === 'ant'
  ? [
      `Default to writing no comments. Only add one when the
       WHY is non-obvious.`,
      `Don't explain WHAT the code does, since well-named
       identifiers already do that.`,
      // @[MODEL LAUNCH]: capy v8 thoroughness counterweight
      `Before reporting a task complete, verify it actually
       works: run the test, execute the script, check the output.`,
    ]
  : []),

The @[MODEL LAUNCH] annotations suggest these are temporary patches for model-specific behavioral issues. Capybara over-comments and under-verifies, so they added explicit counterweights for internal users first before rolling them out externally via A/B testing.

Internal Bug Reporting

There's even an internal Slack integration (constants/prompts.ts:243-246):

...(process.env.USER_TYPE === 'ant'
  ? [
      `If the user reports a bug with Claude Code itself,
       recommend /issue for model-related problems, or /share
       to upload the full session transcript. After /share
       produces a ccshare link, if you have a Slack MCP tool
       available, offer to post the link to
       #claude-code-feedback (channel ID C07VBSHV7EV).`,
    ]
  : []),

Internal Claude will post bug reports directly to #claude-code-feedback on Anthropic's Slack. External Claude doesn't even know that Slack channel exists.

The Takeaway

This isn't just feature gating. Anthropic employees use a fundamentally more capable, more honest, more communicative version of Claude Code. The external version is a deliberately dumbed-down subset with less personality, less pushback, less honesty about failure states, and less guidance on communication quality.

The @[MODEL LAUNCH] annotations suggest this gap is meant to be temporary -- improvements are tested internally first, then rolled out externally via A/B experiments. But right now, the gap is real and significant.

9. Voice Mode: Push-to-Talk Coding

Behind the VOICE_MODE feature flag, Claude Code has a complete push-to-talk voice input system. The implementation spans several files:

services/voice.ts -- Core audio recording service
services/voiceStreamSTT.ts -- Anthropic's own speech-to-text client
hooks/useVoiceIntegration.tsx -- React integration
commands/voice/ -- The /voice toggle command

From keybindings/defaultBindings.ts:96:

...(feature('VOICE_MODE') ? { space: 'voice:pushToTalk' } : {}),

Hold space to talk, release to send. The STT service is Anthropic's own (voiceStreamSTT.ts:1-3):

// Anthropic voice_stream speech-to-text client for push-to-talk.
// Only reachable in ant builds (gated by feature('VOICE_MODE')
// in useVoice.ts import).

Like the buddy system, voice mode is currently ant-only -- gated behind both the compile-time feature flag and restricted to internal builds. The voice service handles microphone access, silence detection (disabled in push-to-talk mode since the user manually controls start/stop), and even has error handling for environments without audio devices ("Voice mode requires microphone access... To use voice mode, run Claude Code locally instead.").

This turns Claude Code from a text-only terminal tool into something closer to a hands-free pair programmer.

10. Coordinator Mode: Claude as a Multi-Agent Orchestrator

The last major finding is the coordinator mode -- a complete system for turning Claude Code into a supervisor that manages a fleet of worker agents.

The Architecture

The coordinator system is defined in coordinator/coordinatorMode.ts. When active, Claude's system prompt changes from a solo coding assistant to an orchestrator:

// coordinator/coordinatorMode.ts:116

return `You are Claude Code, an AI assistant that orchestrates
software engineering tasks across multiple workers.

## 1. Your Role

You are a **coordinator**. Your job is to:
- Help the user achieve their goal
- Direct workers to research, implement and verify code changes
- Synthesize results and communicate with the user
- Answer questions directly when possible — don't delegate
  work that you can handle without tools`

The coordinator gets a limited toolset: Agent (spawn workers), SendMessage (continue workers), and TaskStop (kill workers). It cannot directly edit files, run bash commands, or read code. All hands-on work goes through workers.

The Task Workflow

The coordinator follows a structured workflow with four phases (coordinator/coordinatorMode.ts:199-209):

| Phase          | Who              | Purpose                              |
|----------------|------------------|--------------------------------------|
| Research       | Workers (parallel)| Investigate codebase, find files     |
| Synthesis      | Coordinator      | Understand findings, craft specs     |
| Implementation | Workers          | Make targeted changes per spec       |
| Verification   | Workers          | Test changes work                    |

The key insight is the synthesis phase: the coordinator must understand research findings and write specific implementation specs. It's explicitly told never to write lazy delegations:

// coordinator/coordinatorMode.ts:261-267

// Anti-pattern — lazy delegation (bad)
Agent({ prompt: "Based on your findings, fix the auth bug" })

// Good — synthesized spec
Agent({ prompt: "Fix the null pointer in src/auth/validate.ts:42.
  The user field on Session is undefined when sessions expire but
  the token remains cached. Add a null check before user.id
  access — if null, return 401 with 'Session expired'." })

Parallelism as a Superpower

The coordinator prompt explicitly calls out parallelism (coordinator/coordinatorMode.ts:213):

**Parallelism is your superpower. Workers are async. Launch
independent workers concurrently whenever possible — don't
serialize work that can run simultaneously and look for
opportunities to fan out.**

With concurrency rules:

Read-only tasks (research) -- run in parallel freely
Write-heavy tasks (implementation) -- one at a time per set of files
Verification can sometimes run alongside implementation on different file areas

The Adversarial Verification Agent

Perhaps the most interesting part is the verification system. From constants/prompts.ts:394:

`The contract: when non-trivial implementation happens on your
turn, independent adversarial verification must happen before
you report completion. You own the gate.

Spawn the Agent tool with subagent_type="verification". Your
own checks do NOT substitute — only the verifier assigns a
verdict; you cannot self-assign PARTIAL.

On FAIL: fix, resume the verifier, repeat until PASS.
On PASS: spot-check it — re-run 2-3 commands from its report.`

The verifier is deliberately adversarial -- it's supposed to prove the code works, not rubber-stamp it. The coordinator cannot claim its own work is done; only the verifier can issue a PASS verdict. This is a legitimately clever approach to preventing the "AI says it's done but it's actually broken" problem.

Cross-Worker Communication: The Scratchpad

Workers can share knowledge through a scratchpad directory (coordinator/coordinatorMode.ts:104-106):

if (scratchpadDir && isScratchpadGateEnabled()) {
  content += `\nScratchpad directory: ${scratchpadDir}
Workers can read and write here without permission prompts.
Use this for durable cross-worker knowledge.`
}

Workers can read and write to this shared directory without triggering permission prompts. The coordinator's prompt describes it as "durable cross-worker knowledge" that should be "structured however fits the work."

Continue vs. Spawn Fresh

The coordinator is given detailed guidance on when to reuse an existing worker versus spawning a fresh one (coordinator/coordinatorMode.ts:283-293):

| Situation                          | Mechanism    | Why            |
|------------------------------------|-------------|----------------|
| Research explored exact files      | Continue    | Has context    |
| Research was broad, impl is narrow | Spawn fresh | Avoid noise    |
| Correcting a failure               | Continue    | Has error ctx  |
| Verifying another's code           | Spawn fresh | Fresh eyes     |
| Wrong approach entirely            | Spawn fresh | Clean slate    |

The rule of thumb: "Think about how much of the worker's context overlaps with the next task. High overlap -> continue. Low overlap -> spawn fresh."

What This All Means

The Claude Code source leak reveals a product significantly ahead of what's publicly available. KAIROS, ULTRAPLAN, the buddy system, coordinator mode, voice mode, and the attribution system are complete, polished features waiting behind 44 feature flags.

A few broader observations:

This is a pattern, not an accident. This is the second time Anthropic has shipped source code via sourcemaps in npm. The first was in February 2025. As Fortune noted, this comes just days after Anthropic accidentally revealed details about an unreleased model codenamed "Mythos." For the company that positions itself as the "safety-focused" AI lab, the operational security track record is rough.

The two-tier system is real. Anthropic employees get a fundamentally better product with more honest communication, better error reporting, and stronger coding guardrails. The gap is supposed to be temporary (gated by @[MODEL LAUNCH] annotations), but it exists today -- including explicit acknowledgment that Capybara v8 fabricates results 29-30% of the time, with mitigations only for internal users.

AI attribution is coming. The character-level tracking system suggests Anthropic is preparing for a world where "who wrote this code" is a question with a precise, data-backed answer. This lands in a legal landscape where the Supreme Court just declined to grant copyright protection to AI-generated works, and where some companies already require documentation of AI-assisted code. Attribution data in git notes could become a legal requirement, not just a feature.

Undercover mode means AI contributions to open source are already happening at scale. Anthropic employees actively contribute to public repos with Claude Code, with a system specifically designed to make those contributions indistinguishable from human work. In Q1 2026, California's Companion Chatbot Law (SB 243) went into effect, requiring disclosure when a chatbot could be mistaken for human. Whether AI-generated code PRs fall under disclosure requirements is an open question that regulators will eventually have to answer.

The anti-distillation defense is a new front in the AI arms race. Fake tool injection isn't just a defensive measure -- it's an acknowledgment that competitors are actively trying to train on Claude's outputs. This technique could escalate: if every frontier lab starts poisoning their outputs for distillation defense, the entire ecosystem of downstream model training gets more adversarial.

The agentic future is built. Coordinator mode, worker agents, adversarial verifiers, cross-agent scratchpads, parallel execution, 30-minute remote planning sessions, push-to-talk voice, always-on autonomous operation -- this isn't a prototype. It's a complete agentic development platform waiting to ship. And with every major tool shipping multi-agent in the same two-week window in early 2026, the race is on.

The irony of it all is that the system Anthropic built to prevent leaks became the biggest leak. Undercover mode catches model codenames in commit messages but doesn't catch sourcemaps in npm packages. The source is out. The secrets are public. And Claude Code turns out to be even more interesting than anyone suspected.

Originally published at vibehackers.io/blog/claude-code-source-leak-deep-dive

Claude Code Hooks, Subagents & Power Features: The Complete Guide (2026)

Vibehackers — Thu, 26 Mar 2026 20:38:12 +0000

Most people use Claude Code like a chatbot. Type a question, get an answer, type another question. That's maybe 20% of what it can do.

The other 80% -- hooks, subagents, custom slash commands, memory, auto mode -- is where it stops being a chatbot and starts being an autonomous coding partner. One that enforces your standards automatically, runs parallel workers on different parts of your codebase, remembers your project across sessions, and operates with minimal hand-holding.

We've been deep in these features since they shipped, and this is the guide we wish we had when we started. No docs rewriting. Just the practical stuff that actually changes how you work.

Claude Code Hooks: Automate Everything

Here's the fundamental problem with CLAUDE.md instructions: they're advisory. Claude follows them roughly 80% of the time. That's fine for coding style preferences. It's not fine for "never force-push to main" or "always run the linter after editing."

Claude Code hooks solve this. They're deterministic -- they execute 100% of the time, no exceptions. If something must happen every time, make it a hook.

What Hooks Actually Are

Hooks are user-defined actions that fire automatically at specific points in Claude Code's lifecycle. Think of them as git hooks, but for your AI coding agent. They can be shell commands, HTTP endpoints, LLM prompts, or even subagent invocations.

The lifecycle events you can hook into:

Session-level: SessionStart, SessionEnd
Per-turn: UserPromptSubmit -> PreToolUse -> PermissionRequest -> PostToolUse / PostToolUseFailure -> Stop
Async: FileChanged, CwdChanged, ConfigChange
Agent team: SubagentStart, SubagentStop, TaskCreated, TaskCompleted
Other: PreCompact, PostCompact, WorktreeCreate, WorktreeRemove

The Four Handler Types

1. Command hooks -- Run a shell script. Receives JSON event data via stdin. Exit code 0 means success, exit code 2 blocks the action, anything else is a non-blocking warning.

2. HTTP hooks -- POST event data to a URL. Great for integrating with external services, CI systems, or dashboards.

3. Prompt hooks -- Single-turn LLM evaluation. Returns yes/no. Useful for fuzzy checks that can't be done with a shell script.

4. Agent hooks -- Spawns a subagent with tool access (Read, Grep, Glob, etc.). The heavy artillery for complex validation.

Real Hook Examples

Block destructive git commands:

This is the hook everyone should set up first. Put this in your .claude/settings.json:

{
  "hooks": {
    "PreToolUse": [{
      "matcher": "Bash",
      "hooks": [{
        "type": "command",
        "command": ".claude/hooks/block-destructive.sh"
      }]
    }]
  }
}

The script reads the command from stdin and blocks anything dangerous -- git push --force, git reset --hard, rm -rf, whatever you want to protect against. Exit code 2 stops Claude dead.

Auto-lint after every file edit:

{
  "hooks": {
    "PostToolUse": [{
      "matcher": "Edit|Write",
      "hooks": [{
        "type": "command",
        "command": ".claude/hooks/lint-check.sh",
        "timeout": 600,
        "statusMessage": "Running linter..."
      }]
    }]
  }
}

The Edit|Write matcher fires after any file modification. Your lint script runs, and if it fails, Claude sees the error and fixes it immediately. No more "oh I forgot to run the linter" commits.

Environment setup on directory change:

{
  "hooks": {
    "CwdChanged": [{
      "hooks": [{
        "type": "command",
        "command": ".claude/hooks/env-setup.sh"
      }]
    }]
  }
}

The hook script can use $CLAUDE_ENV_FILE to persist environment variables across the session -- write export statements to it and they stick.

Where Hooks Live

Location	Scope	Shared?
`~/.claude/settings.json`	All your projects	No
`.claude/settings.json`	This project	Yes (via git)
`.claude/settings.local.json`	This project	No
Managed policy settings	Organization-wide	Yes
Skill/Agent frontmatter	Component lifetime	Depends

The decision tree is simple: team-wide rules go in .claude/settings.json (committed to git). Personal preferences go in settings.local.json or your user-level settings. Organization policies go in managed settings.

Claude Code Subagents: Parallel Workers

Subagents are where Claude Code becomes genuinely multi-threaded. Instead of one agent doing everything sequentially, you get specialized workers that handle tasks in parallel, each with their own clean context window.

The key insight: when a subagent processes a task, all the verbose intermediate work (test output, search results, log parsing) stays inside the subagent's context. Only the summary returns to the parent. Your main conversation stays clean and focused.

Built-in Subagents

Agent	Model	Tools	What It Does
Explore	Haiku (fast)	Read-only	File discovery, code search, codebase exploration
Plan	Inherits	Read-only	Research for plan mode
General-purpose	Inherits	All tools	Complex multi-step operations
Bash	Inherits	Terminal only	Running commands in separate context
Claude Code Guide	Haiku	None	Answering questions about Claude Code itself

The Explore agent supports thoroughness levels: quick, medium, or very thorough. For large codebases, this alone saves you minutes of waiting.

Creating Custom Subagents

This is where it gets powerful. Create a .claude/agents/code-reviewer.md file:

---
name: code-reviewer
description: Reviews code for quality, security, and best practices
tools: Read, Glob, Grep
model: sonnet
maxTurns: 10
memory: project
---

You are a code reviewer. Analyze the provided code and give specific,
actionable feedback. Focus on:
- Security vulnerabilities
- Performance issues
- Error handling gaps
- Naming and readability

Be direct. Skip the compliments.

That's it. Claude will now delegate code review tasks to this agent automatically when it matches the description. Or you can invoke it explicitly with @"code-reviewer (agent)" review the auth changes.

Key Frontmatter Options

The model field lets you pick the right tool for the job -- use haiku for fast read-only tasks, sonnet for balanced work, opus for hard problems. The permissionMode field controls how much autonomy the subagent gets (default, acceptEdits, dontAsk, bypassPermissions). The background field set to true runs it concurrently without blocking your main conversation.

Foreground vs Background

Foreground: Blocks your main conversation. Permission prompts pass through to you. Use for tasks where you need oversight.
Background: Runs concurrently. Pre-approves permissions before launch. Auto-denies anything not pre-approved. Use for independent tasks.

Press Ctrl+B to background a running task on the fly. Useful when you realize a task is going to take a while and you want to keep working.

Parallel Execution

Issue multiple subagent tasks in a single message and they run concurrently. For example: "Review the auth module for security issues, run the test suite for the API layer, and check the database migration for breaking changes" -- three subagents, running simultaneously, each with clean context.

For sustained parallelism across a large codebase, use Agent Teams instead -- one session acts as team lead, coordinating teammates who each work in their own 200K-token context window.

Claude Code Slash Commands: The Complete Cheatsheet

Claude Code has 50+ built-in commands. You don't need to memorize all of them. Here are the ones that actually matter:

Essential Commands

Command	What It Does
`/compact [instructions]`	Compress conversation, optionally focusing on specific topics
`/clear`	Nuke conversation history and start fresh
`/model [model]`	Switch models mid-session
`/cost`	See token usage and spend
`/diff`	Interactive diff viewer for all changes in the session
`/rewind`	Revert conversation and code to a previous checkpoint
`/plan [description]`	Enter plan mode (read-only research before execution)
`/context`	Visualize context usage as a colored grid
`/memory`	Browse and edit CLAUDE.md and auto-memory files
`/permissions`	View and update permission rules
`/export [filename]`	Export the conversation as plain text
`/resume [session]`	Pick up where you left off in a previous session

Hidden Gems Most People Miss

Command	Why It Matters
`/btw <question>`	Ask a quick side question -- uses full context but no tools, and the answer gets discarded from history. Perfect for "btw, what does this function do?" without polluting your conversation.
`/effort [low\	medium\
{% raw %}`/fast [on\	off]`
`/batch <instruction>`	The big one. Spawns parallel background agents in isolated git worktrees, each handling one unit of work, each opening its own PR. For large-scale refactors across many files.
`/loop [interval] <prompt>`	Run a prompt repeatedly on a timer. `/loop 5m check if the deploy finished`. Session-scoped, auto-expires after 3 days.
`/branch [name]`	Fork the conversation at the current point. Great for exploring two approaches without losing either.
`/security-review`	Analyze your branch changes for security vulnerabilities before you merge.
`/remote-control`	Make your local session available from claude.ai, iOS, or Android. Control your coding agent from your phone.
`/insights`	Generate a report analyzing your usage patterns. Find out where your time goes.
`/voice`	Push-to-talk voice input, tuned for coding vocabulary.

Custom Slash Commands (Skills)

You can create your own slash commands. The old .claude/commands/ system has been merged into skills. Create a file at .claude/skills/deploy/SKILL.md:

---
name: deploy
description: Deploy the application to production
disable-model-invocation: true
allowed-tools: Bash(gh *)
---

Deploy $ARGUMENTS to production:
1. Run the test suite
2. Build the application
3. Push to the deployment target

Now /deploy staging is a command. The disable-model-invocation: true flag means only you can trigger it -- Claude won't accidentally deploy during a normal conversation.

Skills support dynamic context injection too. Wrap shell commands in !`command` and the output gets injected before Claude sees the prompt:

---
name: pr-summary
---
- PR diff: !`gh pr diff`
- Changed files: !`gh pr diff --name-only`

Summarize this PR for the team.

Skill locations follow the same pattern as everything else: ~/.claude/skills/ for global, .claude/skills/ for project-level.

Claude Code Memory: How It Remembers Your Project

Claude Code has two memory systems that work together. Understanding both -- and what goes where -- is one of the biggest Claude Code best practices for getting consistent results.

CLAUDE.md: The Instructions You Write

CLAUDE.md is the file where you tell Claude how to work in your project. It's loaded at the start of every session. Think of it as the onboarding doc you'd give a new developer.

The hierarchy:

Scope	Location	Who Sees It
Managed policy	`/Library/Application Support/ClaudeCode/CLAUDE.md` (macOS)	All org users
Project	`./CLAUDE.md` or `./.claude/CLAUDE.md`	Team (via git)
User	`~/.claude/CLAUDE.md`	Just you

Files in your directory hierarchy above the working directory load in full at launch. Files in subdirectories load on demand when Claude reads files there.

What to put in CLAUDE.md:

Build and test commands (things Claude can't infer)
Code style conventions
Project architecture and file organization
Git workflow rules
Common workflows

What NOT to put in CLAUDE.md:

Things Claude already does correctly (every unnecessary line dilutes the ones that matter)
Vague guidance ("be careful" adds nothing)
Anything that must happen 100% of the time (make it a hook instead)

The budget is real: Keep it under 200 lines per file. Claude's compliance drops noticeably past 150-200 instructions. Use the @path/to/file import syntax to pull in additional files without bloating the main CLAUDE.md. HTML comments are stripped before injection, so you can leave notes for humans without wasting tokens.

Auto Memory: What Claude Writes For Itself

Auto memory lets Claude accumulate knowledge across sessions without you writing anything. As Claude works, it saves notes: build commands that worked, debugging insights, discovered preferences.

Where it lives: ~/.claude/projects/<project>/memory/

~/.claude/projects/<project>/memory/
  MEMORY.md          # Index file (first 200 lines loaded every session)
  debugging.md       # Topic file (loaded on demand)
  api-conventions.md # Topic file (loaded on demand)

The first 200 lines of MEMORY.md (or 25KB, whichever comes first) load at session start. Topic files load on demand when Claude needs them.

Between sessions, an AutoDream process runs automatically -- pruning stale entries, merging related info, and refreshing to reflect the current project state. It's like memory consolidation during sleep.

Manage it with /memory to browse files and toggle auto-memory on/off. Or disable it entirely with autoMemoryEnabled: false in settings.

Subagents can maintain their own persistent memory too, via the memory frontmatter field -- scoped to user, project, or local.

Auto Mode: The Middle Ground

There's always been a tension in Claude Code's permission model. The default mode asks you to approve every file edit, every command, every action. It's safe but slow -- you end up mashing Enter through approvals. The alternative, --dangerously-skip-permissions, removes all guardrails. Fast but terrifying.

Auto mode, announced March 24, 2026, splits the difference.

How It Works

Before each tool call, an internal safety classifier evaluates the action:

Safe actions (reading files, writing code, running tests): proceed automatically, no approval needed.
Risky actions (mass file deletion, data exfiltration, malicious code execution): blocked entirely. Claude gets redirected to a different approach.

You don't configure the categories. The classifier handles it. Anthropic is upfront about limitations: it may occasionally allow ambiguous actions or block benign ones. There's a small impact on token consumption and latency.

How to Enable

CLI: claude --enable-auto-mode, then cycle to it using Shift+Tab during a session.

Desktop/VS Code: Toggle in Settings, then select from the permission mode dropdown.

Currently available as a research preview on the Team plan, with Enterprise and API support coming soon. Anthropic recommends using it in isolated environments (containers, VMs, worktrees) for maximum safety.

When to Use What

Default mode: When you're learning, or working on sensitive production code and want oversight on every change.
Auto mode: Daily development work. You trust the agent but want guardrails against catastrophic mistakes.
--dangerously-skip-permissions: Isolated CI environments, throwaway branches, or when you truly don't care. Never on production.

Claude Code Tips: 10 Things Most People Miss

Rapid-fire practical stuff. Each of these individually saves you time. Together they compound.

1. /compact with focus instructions. Don't just compact blindly. Run /compact focus on the auth refactor and database changes to tell Claude what to preserve when it compresses context. The difference in continuity is dramatic.

2. Git worktrees for parallel agents. claude --worktree feature-x (or -w feature-x) creates an isolated working copy. Run multiple Claude Code sessions against the same repo without file conflicts. The /batch skill uses this automatically.

3. /effort max for hard problems. This unlocks extended thinking on Opus 4.6. Architecture decisions, complex debugging, multi-file refactors -- throw max at them. The thinking quality difference is real.

4. /fast for quick tasks. Toggle fast mode when you need a quick answer or simple edit. Don't waste Opus-level thinking on "rename this variable."

5. MCP Tool Search auto-activates. When your loaded tool definitions exceed 10% of context (common if you use multiple MCP servers), Tool Search kicks in automatically. It reduces token overhead by 85% by only loading tool schemas when they're actually needed.

6. /remote-control from your phone. Run /rc, scan the QR code, and control your local Claude Code session from claude.ai or the mobile app. Your files never leave your machine -- only chat messages and tool results flow through an encrypted bridge.

7. /loop for monitoring. /loop 5m check if the deploy finished and notify me. /loop 1h summarize new errors in the log. Session-scoped, max 50 tasks, auto-expires after 3 days. Fires between your turns.

8. /batch for large-scale changes. When you need the same change across 20 files or 10 services, /batch spawns one background agent per unit of work, each in an isolated git worktree, each opening its own PR. This is the Claude Code workflow for large refactors.

9. Voice dictation works. /voice enables push-to-talk. Hold spacebar to speak, release to send. The transcription is tuned for coding vocabulary -- it handles "regex," "OAuth," "localhost," and function names surprisingly well. Mix voice and typing in the same message.

10. /insights reveals your patterns. Generate a report analyzing your Claude Code usage. Find out which tasks eat the most tokens, which workflows are most efficient, and where you're wasting time. Data-driven improvement for your AI-assisted coding workflow.

The Ideal Claude Code Workflow

Here's how all these pieces fit together into a coherent Claude Code workflow:

Start with CLAUDE.md. Set up your project instructions -- build commands, architecture notes, coding standards. Keep it under 200 lines. Use imports for anything beyond that. Run /init if you're starting fresh.

Add hooks for the non-negotiables. Linting after every edit. Blocking destructive git commands. Environment setup on directory change. Anything that must happen 100% of the time is a hook, not an instruction.

Create subagents for repeated tasks. Code review, test writing, documentation -- if you delegate the same type of work repeatedly, make it a subagent with the right model, tools, and instructions. They run in clean context and return summaries.

Build custom slash commands for your workflows. /deploy, /pr-summary, /release-notes -- whatever your team does repeatedly. Skills with disable-model-invocation: true for dangerous operations so Claude can't trigger them accidentally.

Let auto memory handle the rest. As Claude works in your project, it learns: which build commands work, what debugging approaches succeed, your implicit preferences. This accumulates across sessions automatically.

Use auto mode for flow. Once you trust your hooks and subagents, auto mode lets you stay in flow without mashing Enter through approvals. The safety classifier catches the genuinely dangerous stuff.

Go parallel when scale demands it. Git worktrees (-w flag) for running multiple sessions. /batch for large-scale changes. Subagents for concurrent tasks within a session. Agent Teams for complex multi-part projects.

The through-line is this: Claude Code's power features exist to move you from "human approving every action" to "human setting policy, AI executing within guardrails." Hooks enforce the rules. Subagents handle the delegation. Memory provides continuity. Auto mode removes the friction.

That's not a chatbot. That's a coding partner.

For more on Claude Code pricing and plans, see our Claude Code pricing guide. For context engineering fundamentals, check our context engineering guide. And for how it compares to the competition, read Claude Code vs Codex.

Git Worktrees: From Running Multiple Agents to Real Multi-Agent Development

Vibehackers — Thu, 26 Mar 2026 20:20:18 +0000

incident.io estimated a task would take two hours. It took ten minutes. Not because their AI agent was unusually fast — because they ran five of them, truly in parallel, each one autonomous from start to finish. Each agent had its own branch, its own directory, its own ability to commit and push and open a PR without waiting for anyone.

That's not how most of us run multiple agents today.

Two terminals, one directory

Most multi-agent setups look like this: two Claude Code sessions open, both pointed at the same project directory. Terminal one is building an API endpoint. Terminal two is writing tests. You keep an eye on both, make sure they're working on different files. When terminal one finishes, you commit its work. Then terminal two. You coordinate.

This works. But you're doing something that doesn't scale: you're the synchronization layer. You're mentally partitioning the codebase, sequencing commits, tracking which agent is in which part of the code. With two agents it's manageable. With three it gets stressful. With five — the number incident.io runs daily — it's impossible.

The difference between "running multiple agents" and actual multi-agent development is whether the agents can operate autonomously. Can each one own its workstream end-to-end — edit, commit, push, PR — without you coordinating anything? In a shared directory, the answer is no. Everything is mixed together: files, staging area, branch. You can't commit one agent's work without accidentally including the other's.

// Detect dark theme var iframe = document.getElementById('tweet-1930032748951154966-408'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1930032748951154966&theme=dark" }

What if each agent had its own directory?

The obvious solution: give each agent its own copy of the codebase. One directory per agent. They can't collide because they're in completely separate folders, each on their own branch.

The first instinct is git clone. Clone the repo three times, point each agent at a different clone. This works. But the more you use it, the more it nags at you.

The problem with clones isn't disk space or network bandwidth — storage is cheap, connections are fast. The problem is they're disconnected universes.

You commit something in Clone A. Can Clone B see it? No. You have to push to remote from A, then fetch from remote in B. There's no direct link between them. Every clone talks to the world through the remote, never to each other.

You create a branch in Clone A. Clone B doesn't know it exists until you push it. You change a config file in Clone A. Clone B has the old config. You add an env variable in Clone A. Clone B doesn't have it. After a week you have three clones that have subtly drifted apart, and you spend twenty minutes debugging something that works fine — just not in the clone you're currently in.

It's not a technical problem. It's a cognitive one. Three clones of the same repo means three contexts to keep in sync, and none of them do it automatically.

There's a better primitive for this. It's built into git, and it solves exactly the problem clones create.

Git worktree

The command is git worktree add:

$ git worktree add ../my-project-feature -b feature/new-api
Preparing worktree (new branch 'feature/new-api')
HEAD is now at a1b2c3d Latest commit on main

This created two things: a new directory at ../my-project-feature, and a new branch called feature/new-api. The new directory is a full checkout of your repo on that branch. You can cd into it, edit files, run your dev server, do anything you normally do.

$ ls ../
my-project/
my-project-feature/

Two directories. You can open each in its own terminal, its own editor. Both are fully functional. But here's where it gets interesting. Look inside the new directory:

$ cat .git
gitdir: /Users/you/my-project/.git/worktrees/my-project-feature

That's not a .git directory. It's a tiny .git file — one line, pointing back to the original repository's .git folder. The new directory doesn't contain a copy of the git database. It doesn't have its own commit history. It doesn't have its own branches.

It shares everything with the original. This feature shipped in Git 2.5 back in 2015, and flew so far under the radar that even Guido van Rossum only discovered it years later:

// Detect dark theme var iframe = document.getElementById('tweet-1379893622145871873-869'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1379893622145871873&theme=dark" }

What's shared and what isn't

This is the part that matters.

When you create a worktree, it shares the entire git object database with the original. Every commit, every branch, every tag — one copy, shared across all worktrees. A commit made in any worktree is instantly visible from every other worktree. No pushing, no fetching, no syncing. Because there's nothing to sync — it's the same database.

$ git log --oneline -1 main
f4e5d6c A commit made in the other directory just now

That commit appeared without any git pull. It was made in ~/my-project, but it's visible here because both directories are reading from the same .git.

What's not shared: the files on disk, the staging area, and which branch is checked out. Each worktree has its own working directory, its own index, its own HEAD. They're independent workspaces that happen to share a backend.

Think of it this way. A git repo is a database. Normally you interact with it through one working directory — one set of files, one staging area, one branch. A worktree is a second interface to the same database. A second set of files, a second staging area, a second branch. As many interfaces as you want, all reading from and writing to the same store.

The problem worktrees already solved

Before AI agents entered the picture, developers had this problem for decades. You're deep in a feature branch — halfway through a refactor, files changed everywhere, nothing compiles. Then Slack lights up: production bug, needs a hotfix now.

What do you do?

Option 1: git stash. You stash your work-in-progress, switch to main, make the fix, commit, push, switch back, git stash pop. Sounds clean. In practice, stash is a footgun. Stash doesn't track which branch you were on. It doesn't save untracked files unless you remember --include-untracked. Pop a stash onto the wrong branch and you're untangling merge conflicts in code you weren't even working on. People have lost days of work to stash mishaps — GitHub Desktop had a bug that silently wiped stashed changes, and VS Code users have reported stashes vanishing after updates.

Option 2: WIP commit. You commit half-broken code with a message like wip don't look at this, switch branches, fix the bug, switch back, then try to remember what state you were in. Your git history looks like a crime scene.

Option 3: Just context-switch. Discard your mental state, handle the hotfix, come back an hour later and spend twenty minutes remembering where you were. This is what most people actually do. It's expensive in ways that don't show up on any dashboard.

Worktrees make all three options unnecessary. You keep your feature branch exactly as it is — mid-refactor, broken tests, whatever — and open a second worktree on main. Fix the bug there. Commit, push. Close the worktree. Your feature branch never noticed anything happened. No stashing, no WIP commits, no context-switching.

The same applies to code review. Someone opens a PR and you want to actually run their code, not just read the diff. Without worktrees: stash your work, switch branches, npm install (because they added a dependency), run the code, switch back, npm install again (because your branch has different dependencies), pop stash. With worktrees: open their branch in a second directory, run it there. Your work is untouched.

Worktrees have been in git since 2015. For eleven years, they've been solving these exact problems — and most developers never knew.

Why this isn't just a party trick

OK so git has this feature. Two directories, shared database. Cute.

Let me bring it back to the thing that was nagging you earlier.

Remember the coordination work? Making sure agents don't edit the same files, sequencing commits, switching branches between agents? That coordination exists because all your agents share one set of files on disk. One staging area. One branch. There's only one "workspace," and every agent is operating inside it simultaneously.

With worktrees, each agent gets its own workspace. Its own files, its own staging area, its own branch. Agent 1 can freely edit package.json, commit, and push — while Agent 2 is doing the exact same thing in its own worktree, on its own branch. They can't collide because they're not touching the same files. Not "they probably won't collide" — they structurally cannot.

And here's what that unlocks: you can let go.

You can tell an agent "build this feature, commit when you're done, push the branch, open a PR." And then forget about it. You don't need to watch it. You don't need to coordinate it with other agents. You don't need to stage its files separately. It owns its workstream from start to finish.

That's the difference between "running multiple agents" and "multi-agent development." It's not a difference in tools. It's a difference in how much autonomy you can give each agent. And autonomy requires isolation.

What the tools already do

This isn't theory. The major AI coding tools have already built worktree support into their core workflow.

Claude Code

When you run Claude Code with the --worktree flag, you pass it a name — any name you choose. Claude creates two things: a directory and a branch.

$ claude --worktree billing-api

This creates a worktree directory at .claude/worktrees/billing-api/ and a branch called worktree-billing-api. Claude then starts a session inside that directory, isolated from your main codebase.

Why .claude/worktrees/? It's a convention. The .claude/ directory is already in your .gitignore (it stores Claude's project-level config), so worktrees created inside it don't clutter your project root. You could put worktrees anywhere — ../billing-api, /tmp/billing-api, wherever. Claude just picks a sensible default.

When Claude finishes and you close the session: if it made no changes, the worktree and branch are cleaned up automatically. If it committed work, Claude asks if you want to keep or remove the worktree. The branch (with its commits) stays either way — your work is safe.

Boris Cherny, who built Claude Code, calls worktrees his "number one productivity tip" — he runs three to five simultaneously:

// Detect dark theme var iframe = document.getElementById('tweet-2025007393290272904-781'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=2025007393290272904&theme=dark" }

Running multiple agents is just opening multiple terminals:

$ claude --worktree auth-fix

$ claude --worktree billing-api

$ claude --worktree test-coverage

Three agents, three worktrees, three branches. Each agent can edit files, install packages, run tests, commit, push, and open PRs — all without you coordinating anything.

Cursor

Cursor 2.0 shipped parallel agents in October 2025, powered by git worktrees under the hood. Each of Cursor's up to eight parallel agents operates in its own worktree. You can even spin up multiple agents on the same task and compare their outputs.

// Detect dark theme var iframe = document.getElementById('tweet-1979568674433564886-363'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1979568674433564886&theme=dark" }

VS Code + Copilot

VS Code 1.107 added automatic worktree isolation for background agents. When a Copilot background agent starts working, VS Code silently creates a worktree for it.

OpenAI Codex

Codex added built-in worktree support too:

// Detect dark theme var iframe = document.getElementById('tweet-2018385865207419124-766'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=2018385865207419124&theme=dark" }

Every major AI coding tool converged on the same primitive. That's not coincidence — it's because worktrees solve the right problem at the right layer.

A real workflow, step by step

Let's walk through a full multi-agent session. You have a feature to ship — a notification system — and you want to break it into three parallel workstreams.

Launch three agents, each in its own worktree:

$ claude --worktree notif-api
$ claude --worktree notif-ui
$ claude --worktree notif-tests

Behind the scenes, this creates:

$ git worktree list
/Users/you/my-project                                    abc1234 [main]
/Users/you/my-project/.claude/worktrees/notif-api        abc1234 [worktree-notif-api]
/Users/you/my-project/.claude/worktrees/notif-ui         abc1234 [worktree-notif-ui]
/Users/you/my-project/.claude/worktrees/notif-tests      abc1234 [worktree-notif-tests]

Each agent works independently. They install their own dependencies, make their own commits, run their own tests. You go get coffee.

When they're done, each branch has a clean set of commits. You merge:

$ git merge worktree-notif-api
$ git merge worktree-notif-ui
$ git merge worktree-notif-tests

Or better — each agent pushes its branch and opens a PR. You review and merge through GitHub. Same workflow you use with human teammates.

Cleanup is automatic if you used claude --worktree. If you created worktrees manually:

$ git worktree remove .claude/worktrees/notif-api
$ git worktree remove .claude/worktrees/notif-ui
$ git worktree remove .claude/worktrees/notif-tests

Teams at incident.io run this workflow daily — four to five agents in parallel. A task estimated at 2 hours done in 10 minutes. Not because the agents are faster. Because five autonomous agents is a different kind of leverage than one agent you're babysitting.

Things to know

A few constraints and gotchas worth knowing before you start.

Each worktree needs its own npm install. The git database is shared, but node_modules, build caches, .next — all per-directory. Each worktree is a full working directory, so it needs its own dependencies installed. For a typical Node.js project that's 200-500MB per worktree.

One branch per worktree. Git enforces this — you can't check out main in two worktrees at the same time. If you try:

$ git worktree add ../another main
fatal: 'main' is already checked out at '/Users/you/my-project'

This is intentional. Two worktrees modifying the same branch would corrupt state. The constraint forces each workstream onto its own branch — which is what you want anyway.

Port conflicts. If agents try to run dev servers, they'll fight over the same port. Give each worktree a different port in its .env, or let your framework pick one automatically.

Merge conflicts happen at merge time, not runtime. If two agents both modify package.json on different branches, you'll resolve the conflict when you merge — same as with any branching workflow. The difference is you're resolving it once, cleanly, instead of discovering it mid-session when both agents are still running.

Don't rm -rf a worktree directory. Use git worktree remove. If you do delete it manually, git worktree prune cleans up the stale metadata.

Where to put worktrees. Claude Code defaults to .claude/worktrees/ — already gitignored, since .claude/ stores project config. Some teams prefer .worktrees/ at the project root — one gitignore entry, tool-agnostic, works with Cursor or Cline or whatever you're running next month. Others use sibling directories (../my-project-feature), the traditional git approach that doesn't need any gitignore at all. Pick one and be consistent. When your team runs five agents daily, everyone should know where to look.

Beyond the basics

Once worktrees click, some interesting patterns emerge.

The race

Spin up multiple agents on the same task, each in its own worktree. Compare the results. Keep the best implementation. Cursor 2.0 was built around this idea.

Agent teams

Claude Code has an experimental feature where a lead agent spawns teammates, each in their own worktree, with shared task lists and async messaging:

$ CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 claude

The lead breaks down a task, teammates claim subtasks, each works in isolation, and the lead merges the results.

Conflict detection

Tools like clash can detect merge conflicts between worktrees before they happen — using read-only three-way merges to warn you when two agents are touching the same code. You can even wire it up as a Claude Code hook that checks before every file write.

// Detect dark theme var iframe = document.getElementById('tweet-1913473027670786207-774'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1913473027670786207&theme=dark" }

Guardrails

Worktrees give agents isolation. They don't give agents judgment.

Nothing in Claude Code prevents an agent from running git push --force, git reset --hard, or rm -rf in its worktree. The isolation means it won't corrupt your main directory — but it can still destroy its own branch, push garbage to your remote, or wipe its own work. Cursor handles this differently: it forces a manual "Apply" step before any worktree changes touch your code. Claude Code trusts the agent by default.

The fix is hooks. Claude Code's PreToolUse hook fires before every tool execution — every shell command, every file edit, every write. If the hook exits with code 2, the action is blocked.

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [
          {
            "type": "command",
            "command": ".claude/hooks/git-safety.sh"
          }
        ]
      }
    ]
  }
}

The script inspects the command about to run and blocks destructive patterns — git push --force, git reset --hard, git clean -f, git branch -D. Matt Pocock published a ready-made version: git-guardrails-claude-code. Trail of Bits has a more opinionated config that also blocks credential reads and enforces feature branches.

For worktree setup automation, WorktreeCreate hooks fire whenever --worktree is invoked. Teams use these to auto-copy .env files, run npm install, and assign deterministic port numbers so dev servers don't collide. The claude-worktree-hooks project does all three — it hashes the branch name to pick a port in the 3100–9999 range, so each worktree gets its own port every time.

One thing that works in your favor: git hooks are shared across all worktrees. Every worktree points back to the same .git directory, so a single pre-push hook installed once protects every worktree automatically. No per-worktree configuration needed.

And for cross-worktree conflict detection, wire clash into the same system:

{
  "matcher": "Write|Edit",
  "hooks": [{ "type": "command", "command": "clash check" }]
}

Every file write checks for conflicts with other worktrees first. The agent gets warned before it creates a merge problem, not after.

The cheat sheet

For reference — every command you need:

$ git worktree add <path> -b <new-branch>
# Create a new directory + new branch

$ git worktree add <path> <existing-branch>
# Create a new directory for an existing branch

$ git worktree list
# Show all worktrees and their branches

$ git worktree remove <path>
# Clean up a worktree when you're done

$ git worktree prune
# Remove metadata for worktrees that were manually deleted

$ claude --worktree <name>
# Creates .claude/worktrees/<name>/ and branch worktree-<name>

$ claude --worktree
# Same, with an auto-generated name like bright-running-fox

$ claude --worktree <name> --tmux
# Opens the worktree session in a tmux pane

What changes

I want to end with what actually shifted for me.

For a long time I thought of multi-agent development as "running multiple agents." Two terminals, same directory, careful not to collide. It worked. I was productive. I didn't think I was missing anything.

Worktrees didn't fix something that was broken. They showed me a workflow I didn't know existed — one where I dispatch tasks and review PRs, and the coordination between agents is handled by git itself. Each agent gets a branch. Each branch gets a directory. Each directory is an isolated workspace. That's it. Git already knows how to merge branches. It's been doing it for twenty years.

The interesting part isn't the technology. It's the shift in what you do. You stop managing files and start managing outcomes. You stop being the synchronization layer and start being the decision-maker. Five agents, five worktrees, five PRs. You review, you merge, you ship.

Start with one:

$ claude --worktree my-first-worktree

Then try two. You'll feel the difference immediately.

Context Engineering: The Complete Guide for AI-Assisted Coding (2026)

Vibehackers — Thu, 26 Mar 2026 19:51:06 +0000

If you've been using AI coding tools and wondering why results are inconsistent — brilliant one session, garbage the next — the answer isn't the model. It's the context.

Context engineering is the discipline of curating the entire information environment an AI agent operates within. Not just what you type in the prompt box. Everything: the files it reads, the rules it follows, the history it carries, the tools it can reach, and the structure of the project it navigates.

The term was popularized by Shopify CEO Tobi Lutke in mid-2025:

// Detect dark theme var iframe = document.getElementById('tweet-1935533422589399127-626'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1935533422589399127&theme=dark" }

Andrej Karpathy endorsed it immediately, adding crucial nuance:

// Detect dark theme var iframe = document.getElementById('tweet-1937902205765607626-30'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1937902205765607626&theme=dark" }

He enumerated what "doing this right" involves: task descriptions, few-shot examples, RAG, related data, tools, state and history, and compacting. Then the critical caveat: "Too much or too irrelevant context can increase costs and degrade performance."

That last point is the one most people miss. Context engineering isn't about giving the AI more information. It's about giving it the right information.

By February 2026, Karpathy took the terminology further — coining "agentic engineering" as the next evolution beyond vibe coding, describing a workflow where "you are not writing the code directly 99% of the time... you are orchestrating agents who do and acting as oversight."

The progression is clear:

Stage	What You Do	Core Skill
Prompt engineering	Write clever instructions	Wordsmithing
Context engineering	Curate the information environment	Information architecture
Agentic engineering	Orchestrate autonomous agents	Systems design

Each stage builds on the last. You can't do agentic engineering without context engineering. And context engineering is where most developers are right now — or should be.

Why Context Engineering Matters More Than Model Choice

Here's a counterintuitive truth backed by research: a developer with a clean, well-structured context on a weaker model will outperform one with a cluttered context on a stronger model.

Chroma Research tested 18 LLMs and found that across all models, accuracy drops as input length increases — even on simple tasks. The "Lost in the Middle" phenomenon (first identified by Stanford researchers) shows LLMs attend strongly to tokens at the start and end of the context window but poorly to the middle.

When a debugging session has loaded 20,000 tokens of irrelevant file contents and dead-end explorations, the actual relevant code — sitting somewhere in the middle — gets less attention.

Anthropic's own best practices say it directly:

"Most best practices are based on one constraint: Claude's context window fills up fast, and performance degrades as it fills."

Google's context engineering whitepaper arrived at the same conclusion:

// Detect dark theme var iframe = document.getElementById('tweet-1989800577115525266-473'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1989800577115525266&theme=dark" }

The true intelligence of an agent doesn't come from the model — it comes from how you manage context.

Martin Fowler's team at ThoughtWorks studied this in practice and found an almost comically simple truth: all forms of AI coding context engineering ultimately involve "a bunch of markdown files with prompts." Two main categories — Instructions (tell the agent what to do) and Skills (resources the LLM loads on demand).

// Detect dark theme var iframe = document.getElementById('tweet-2019436508399403385-359'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=2019436508399403385&theme=dark" }

Simple in concept. Hard in practice.

The Four Pillars Framework

Sequoia Capital's Inference newsletter published "Vibe Coding Needs Context Engineering" in July 2025, arguing that "intuition does not scale, structure does." They identified four pillars — a framework also developed independently by LangChain:

// Detect dark theme var iframe = document.getElementById('tweet-1937194145074020798-323'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1937194145074020798&theme=dark" }

Pillar 1: Write Context

Save persistent information outside the context window. This is your CLAUDE.md, your .cursor/rules/, your spec documents. Anything the agent needs to know every session gets written down once, not repeated every prompt.

Think of it as the difference between telling a new team member your coding standards verbally every morning versus writing them in a wiki. The written version works whether you're there or not, whether it's the first day or the hundredth.

What to write down:

Build and test commands the agent can't guess from reading code
Code style rules that differ from language defaults
Architecture decisions specific to your project
Common gotchas and non-obvious behaviors
Which tools and frameworks you're using and why

What NOT to write:

Anything the agent can figure out by reading your code
Standard language conventions (TypeScript naming, Python PEP 8)
Detailed API docs (link instead)
File-by-file codebase descriptions

Pillar 2: Select Context

Pull in only what's relevant for the current task. This is the hardest pillar because it requires judgment.

Don't dump your entire codebase into the prompt. Don't paste 14 files "for reference." Targeted file reads, specific function references, relevant test outputs — not "here's everything, figure it out."

Cursor's research on Dynamic Context Discovery quantified this problem. In A/B testing, they found that rather than including all tools and context upfront, retrieving only tool names and fetching full details as needed reduced total agent tokens by 46.9% — while maintaining or improving quality.

The "token tax" is real. As one analysis found: in large projects with 20 global rules, developers might be sending 2,000 extra tokens with every message. Rules taking up 25% of the context window means the AI has 25% less space for actual source code.

Pillar 3: Compress Context

Manage token usage through summarization and pruning. When a session gets long, compact it. When an exploration is done, clear the dead ends.

Every token of noise competes with signal. A 200-token rule you added "just in case" is 200 tokens of source code your agent can't see.

Practical compression techniques:

Compact conversations: Claude Code's /compact command summarizes the conversation, reducing tokens 50-70%. You can focus it: /compact Focus on the API changes we discussed
Clear dead sessions: /clear deletes the entire conversation. Use it when switching tasks
Summarize research: When using subagents for exploration, they run in separate context windows and return summaries — keeping the parent context clean
Prune rules files: If your CLAUDE.md or cursor rules exceed 300-500 lines, you're probably hurting more than helping

Pillar 4: Isolate Context

Structure information so it doesn't bleed across tasks. Use subagents for research (they run in separate context windows and report back summaries). Start fresh sessions for unrelated work.

Don't let Monday's debugging contaminate Tuesday's feature build.

As Philipp Schmid put it:

"Prompt Engineering = Crafting perfect instruction strings. Context Engineering = Building systems that dynamically assemble comprehensive contextual information tailored to specific tasks."

The Memory Layer: Rules Files That Actually Work

Every major AI coding tool now has a mechanism for persistent context — a file (or set of files) that gets loaded automatically at the start of every session. This is the most important file in your project. More important than your README. More important than your config.

Because it's the file that determines whether your AI agent understands your project or hallucinates about it.

CLAUDE.md (Claude Code)

Claude Code reads CLAUDE.md at the start of every session. It's the project's constitution — the rules that govern all AI behavior within your codebase.

Anthropic's official docs are clear about what belongs here:

Include	Exclude
Build/test commands Claude can't guess	Anything Claude can figure out from reading code
Code style rules that differ from defaults	Standard language conventions
Architectural decisions specific to your project	Detailed API docs (link instead)
Common gotchas and non-obvious behaviors	Information that changes frequently
Developer environment quirks (env vars, etc.)	File-by-file codebase descriptions

The official docs warn against the most common failure mode:

"The over-specified CLAUDE.md. If your CLAUDE.md is too long, Claude ignores half of it because important rules get lost in the noise. Fix: Ruthlessly prune."

The hierarchy system:

~/.claude/CLAUDE.md — global preferences (your personal coding style)
./CLAUDE.md — project root (checked into git, shared with team)
./CLAUDE.local.md — personal overrides (gitignored)
./src/feature/CLAUDE.md — directory-scoped rules (only loaded when working in that directory)

Community consensus from HumanLayer, Builder.io, and Arize AI: keep it under 300 lines. Run /init to auto-generate a starter from your codebase structure. Iterate based on actual agent behavior, not hypothetical scenarios.

# CLAUDE.md

## Build Commands
$ npm run dev          # Start dev server (Turbopack)
$ npm run test         # Run vitest
$ npm run lint:fix     # ESLint with auto-fix

## Architecture
- Next.js 15 App Router, TypeScript, Tailwind CSS
- Supabase for DB + Auth + Storage
- Feature-based directory structure: src/features/{name}/

## Code Style
- Use named exports, not default exports
- Prefer server components; add 'use client' only when needed
- All DB queries go through src/lib/db/ — never query Supabase directly from components

## Common Gotchas
- Supabase RLS is enabled on all tables — service role key required for admin operations
- The `projects` table has a trigger that auto-updates `updated_at`
- Image domains must be whitelisted in next.config.ts

Thomas Landgraf's deep dive covers advanced patterns: using CLAUDE.md to encode project-specific testing strategies, deployment pipelines, and even team communication preferences.

Cole Medin's context-engineering-intro repo provides a hands-on starting point: "Context engineering is the new vibe coding — it's the way to actually make AI coding assistants work."

.cursor/rules/ (Cursor)

Cursor's rules system is more granular than CLAUDE.md, with four types of rules:

Always Apply — active every session (like CLAUDE.md)
Apply Intelligently — agent decides relevance based on your description
Apply to Specific Files — triggered by glob patterns (e.g., only for *.tsx files)
Apply Manually — invoked via @rule-name

Rules live in .cursor/rules/*.mdc files. The awesome-cursorrules repo has community templates. Same official advice: keep content under 500 lines, decompose large rules into composable pieces.

An empirical study of Cursor Rules analyzing thousands of repositories found that rules often grow organically and accumulate technical debt — just like code. The most effective teams treat their rules files as code: reviewing them in PRs, deleting stale instructions, and testing against actual agent behavior.

.github/copilot-instructions.md (Copilot)

GitHub Copilot's equivalent: a .github/copilot-instructions.md file for repository-wide instructions, plus .github/instructions/NAME.instructions.md files for path-specific rules.

AGENTS.md (Cross-Tool Standard)

AGENTS.md is emerging as a cross-tool standard — recognized by Claude Code, Copilot, Cursor, and Gemini. Plain markdown, no metadata needed. If you work across multiple tools, this is the file that follows you everywhere.

What the Research Actually Shows

There's now academic evidence on whether these context files actually help. The results are nuanced — and important.

An empirical study of 2,303 agent context files from 1,925 repos found that these files function like configuration code: they evolve frequently via small additions and prioritize build commands (62.3%), implementation details (69.9%), and architecture (67.7%).

But here's the counterintuitive finding: a study evaluating AGENTS.md files found that context files can reduce task success rates versus no context, while increasing inference cost by 20%+.

The lesson isn't that context files don't work — it's that poorly maintained context files are worse than none. Outdated instructions, contradictory rules, stale architecture descriptions — these actively mislead the agent.

Simon Willison highlighted a study of 9,649 experiments across 11 models comparing YAML, Markdown, JSON, and TOML formats for context delivery. The format matters less than the content quality — but structured formats consistently outperformed unstructured prose.

// Detect dark theme var iframe = document.getElementById('tweet-1938745355916714448-920'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1938745355916714448&theme=dark" }

Spotify's engineering team documented this in their "Honk" background coding agent (1,500+ merged PRs). Their second blog post is entirely about context engineering — the architecture of hot-memory constitutions, specialized domain agents, and cold-memory specification documents that made the agent actually work at production scale.

Advanced Technique: Dynamic Context Discovery

Static context — rules that load every session regardless of task — is the simplest approach. But it doesn't scale.

Cursor's Dynamic Context Discovery represents the next evolution. Instead of loading everything upfront:

The agent starts with a lightweight index of available tools and context
It identifies what's relevant to the current task
It fetches full details only for what it needs

The results in their A/B test: 46.9% reduction in total tokens used, with no quality degradation.

Claude Code's skills system works similarly. Skills are context that loads on demand — when the agent determines it's relevant to the current task. Instead of cramming everything into CLAUDE.md, you decompose context into modular, task-specific units.

Towards Data Science covered this pattern as "escaping the prompt engineering hamster wheel" — moving from ever-longer instructions to composable, reusable context modules.

Session Management: When to Clear and When to Keep Going

The most underrated context engineering skill is knowing when to throw away your context and start fresh.

Anthropic's docs name specific trigger conditions:

"If you've corrected Claude more than twice on the same issue in one session, the context is cluttered with failed approaches. Run /clear and start fresh with a more specific prompt."

The four signals it's time for /clear:

Switching to unrelated tasks — don't let feature work context bleed into bug fixing
After two failed corrections — the failed attempts are polluting the context
After "kitchen sink" sessions — you've mixed too many topics
When performance visibly decreases — responses get generic, instructions get forgotten

/clear vs /compact:

/clear — nuclear option. Deletes entire conversation. CLAUDE.md re-loads fresh
/compact [instructions] — surgical option. Summarizes the conversation (50-70% reduction). You can focus: /compact Focus on the API changes we discussed

Armin Ronacher adds an important exception: don't clear when the failure history itself is valuable. If the agent has tried and failed a specific approach, that context prevents it from repeating the same mistake. The art is knowing whether failed attempts are useful signal or useless noise.

For long-running work, start a fresh session after approximately 30 messages, and always write key decisions to your context files before clearing so they persist.

Structuring Your Project for AI

Context engineering isn't just about memory files and session management. It's about how your entire project is organized. Agents navigate codebases by reading files and following imports — the easier your project is to navigate, the better the agent performs.

Favor Vertical Over Horizontal Organization

Feature-driven layouts work better than layer-driven layouts:

# Layer-driven (harder for agents)
src/
  models/
  controllers/
  views/
  services/

# Feature-driven (better for agents)
src/
  auth/
    auth-service.ts
    auth-service.test.ts
    auth-types.ts
  billing/
    billing-service.ts
    billing-service.test.ts
  dashboard/
    dashboard-page.tsx
    dashboard-components.tsx

An agent working on auth only needs to read the auth directory. A layer-driven layout forces it to load files from every directory to understand a single feature.

Use Semantic File Names

user-authentication-service.ts is better than uas.ts. Agents infer file contents from names before reading them — descriptive names reduce unnecessary file reads and save context.

Keep Files Small

Anthropic's best practices recommend smaller, focused modules. A 3,000-line monolith forces the agent to read (and hold in context) the entire file to modify a single function.

Colocate Tests with Code

If your test for auth-service.ts is in auth-service.test.ts right next to it, the agent finds it instantly. If it's in tests/unit/services/auth/test_auth_service.py, that's multiple directory traversals burning context.

Treat Context Files as Code

They evolve with your codebase. Review them in PRs. Delete stale instructions. Add new patterns when you discover them. As EclipseSource notes, the hard problem isn't creating context files — it's keeping them accurate.

The "Harness Engineering" Pattern

Dex Horthy from Hex coined an emerging concept that captures where context engineering is heading:

// Detect dark theme var iframe = document.getElementById('tweet-1985699548153467120-584'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1985699548153467120&theme=dark" }

"Harness engineering" is applying context engineering principles to how you use an existing agent — not just how you configure it. It's the difference between writing good CLAUDE.md and designing the entire workflow: when to spawn subagents, how to structure multi-step tasks, when to isolate vs. share context.

His YC Root Access talk is the best technical deep dive on advanced context engineering — covering why conversational prompting fails at scale, spec-first development, and the finding that agents tend to perform better when using less than 40% of the LLM's context window.

Context Engineering in Practice: A Real Workflow

Let's make this concrete. Here's how context engineering looks in a real development session with Claude Code.

Step 1: Start With Clean Context

$ claude
# CLAUDE.md loads automatically
# Agent knows your project structure, build commands, coding style

Step 2: Be Specific About the Task

Instead of: "Fix the login bug"

Try: "The login form on /auth/login returns a 401 when valid credentials are submitted. The issue started after commit abc123. Check src/auth/auth-service.ts and the Supabase auth configuration."

You've selected context: specific file, specific commit, specific behavior. The agent doesn't need to explore your entire codebase.

Step 3: Use Subagents for Research

When you need to understand a large codebase area, don't ask the main agent to read 20 files. Use subagents:

$ # In Claude Code, use the Agent tool for research
$ # The subagent runs in its own context window
$ # Returns a summary to the parent — keeping parent context clean

Step 4: Compact at Natural Breakpoints

After completing a subtask (fixing the auth bug), before starting the next task (adding a new feature):

$ /compact Focus on the auth fix we just completed
$ # Or if the next task is completely unrelated:
$ /clear

Step 5: Write Decisions Back to Memory

Before clearing, capture anything the agent learned that should persist:

$ # Add to CLAUDE.md or a project doc:
$ # "Supabase auth tokens expire after 1 hour.
$ #  Refresh token logic is in src/auth/token-refresh.ts"

This is the cycle: clean start → targeted context → isolate research → compress at breakpoints → persist insights → clean start again.

The Bigger Picture: Four Disciplines of AI Development

Context engineering sits within a broader framework. By 2026, what we used to call "prompting" has split into four distinct disciplines:

Prompt Craft — writing clear instructions. The original skill. By 2026, this is table stakes.
Context Engineering — curating the entire information environment an agent operates within. What this guide covers.
Intent Engineering — encoding goals, values, and decision boundaries into agent infrastructure. Telling agents what to want, not just what to do.
Specification Engineering — writing structured documents that agents can execute against over long periods without intervention. The foundation for truly autonomous development.

This progression maps directly onto skill levels. Prompt craft is autocomplete-level. Context engineering is agent-assisted. Intent and specification engineering are orchestrator-level — where the real productivity multipliers live.

// Detect dark theme var iframe = document.getElementById('tweet-1943685060785524824-850'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1943685060785524824&theme=dark" }

The "Vibe Coding Hangover"

There's a growing recognition that the initial excitement around vibe coding — just describe what you want and let AI build it — hit a wall. Multiple writers have described a "vibe coding hangover" — the realization that unstructured AI coding produces unmaintainable code.

Context engineering is the antidote. Not a rejection of AI coding, but its maturation. As GitHub's engineering blog puts it: you don't get better AI outputs by writing cleverer prompts. You get them by engineering better context.

Quick-Start Checklist

If you're just getting started with context engineering, here's the minimum viable setup:

1. Create your rules file (5 minutes)

Claude Code: Run /init to generate a starter CLAUDE.md
Cursor: Create .cursor/rules/project.mdc
Copilot: Create .github/copilot-instructions.md

2. Add the essentials (10 minutes)

Build and test commands
3-5 most important code style rules
Architecture overview (one paragraph)
Top 3 gotchas new developers hit

3. Set session habits (ongoing)

Start fresh sessions for unrelated tasks
Compact after 20-30 messages
Clear after 2 failed corrections
Write decisions to your rules file before clearing

4. Organize for navigability (when refactoring)

Feature-based directories over layer-based
Semantic file names
Tests colocated with source
Small, focused files

5. Iterate your rules (weekly)

Delete rules the agent already follows naturally
Add rules for mistakes the agent keeps making
Keep total length under 300 lines (CLAUDE.md) or 500 lines (Cursor rules)

Resources

Essential Reading

Rules File Guides

Research Papers

Repos and Templates

Talks

Context engineering is the skill that separates developers who get consistent, high-quality results from AI coding tools from those who get lucky sometimes. Master it, and every other AI development skill becomes easier.

A Short History of Agent-Based Models — and Why Software Engineers Should Care

Vibehackers — Thu, 19 Mar 2026 22:18:56 +0000

In the 1940s, John von Neumann proved that a cellular automaton could replicate itself. His design required 29 possible states per cell and a pattern of roughly 200,000 cells. It was mathematically rigorous and practically useless — too complex to study, too large to visualize, too unwieldy to teach anyone anything.

John Horton Conway, a mathematician at Cambridge, thought the interesting question wasn't whether self-replication was possible but how simple a system could be and still produce complex behavior. During tea breaks through the late 1960s, he tested rule after rule on pencil grids, discarding anything that died immediately or grew without bound. He was searching for a minimum — the fewest rules that would sustain unpredictable, open-ended behavior. In 1970, he found four.

A cell on a grid lives or dies based on its neighbors. Fewer than two, it dies. Two or three, it survives. More than three, it dies. Exactly three neighbors bring a dead cell to life. Von Neumann needed 29 states. Conway needed two.

Within months, a team at MIT led by Bill Gosper discovered the glider gun — a pattern that manufactures traveling structures indefinitely. Then came self-replicating patterns. In 1982, Conway proved that his four-rule system is Turing-complete: capable, in principle, of computing anything a real computer can. Von Neumann's 200,000-cell monster was overkill. Four rules and a pencil grid were enough.

If you've been to any talk on complexity or emergence, you've seen Game of Life used as the opening example. It's the "Hello, World" of the field — everyone knows it, and most explanations stop there. What almost nobody covers is what happened next: the economists, animators, and political scientists who took the same insight and applied it to things that actually mattered.

The Economist and the Checkerboard

A year after Conway's paper, an economist named Thomas Schelling was working on a completely different problem: residential segregation. Instead of a computer, he used a physical checkerboard and two colors of coins. His rule was even simpler than Conway's: if fewer than a third of your immediate neighbors are your color, move to a random empty square.

One-third is a mild preference. It means you're fine being in the minority — you just don't want to be nearly alone. Schelling expected the board to stay mixed. It didn't.

From a well-shuffled starting position, the coins rapidly organized themselves into large, homogeneous clusters. Not because any coin wanted segregation — the rule explicitly tolerated diversity — but because the cumulative effect of many small, reasonable preferences produced a macro-level outcome that no individual coin would have chosen.

Schelling published this in 1971 as "Dynamic Models of Segregation." In 2005, he won the Nobel Prize in Economics, partly for this work.

The model's lasting contribution was a single, uncomfortable idea: the system-level outcome is not reducible to the individual agents' intentions. You can understand every agent perfectly — know its rules, its preferences, its decision process — and still be unable to predict what the system will do.

Symbolics, 1986: The Animator Who Made Birds Think

Craig Reynolds was a software engineer at Symbolics with a practical problem: he needed to animate realistic bird flocks for a short film.

The traditional approach — scripting each bird's path — was hopeless. Real flocks have no choreographer. Hundreds of birds move as a coherent mass, splitting around obstacles and reforming, without any individual bird knowing the shape of the whole flock.

Reynolds gave each simulated bird (he called them "boids") just three behavioral rules:

Separation — steer away from nearby flockmates to avoid collision
Alignment — steer toward the average heading of nearby flockmates
Cohesion — steer toward the average position of nearby flockmates

Each boid could only see its immediate neighbors. No central controller, no leader boid, no global awareness. He presented the result at SIGGRAPH 1987. The boids flocked. The technique produced the bat swarms in Tim Burton's Batman Returns (1992). In 1998, Reynolds received an Academy Scientific and Technical Award — three rules and an Oscar.

What Reynolds proved was stronger than Conway's and Schelling's insight: simple local rules can produce globally coherent behavior. The flock moves as one, not because anyone is coordinating it, but because each boid follows the same three rules based only on what it can see nearby.

The flip side was equally important: bad rules produce bad flocks. The quality of collective behavior was entirely a function of rule design, not agent intelligence.

Growing Artificial Societies

Joshua Epstein, a political scientist at Brookings, thought economics had an explanation problem. Economists could describe wealth inequality — measure the Gini coefficient, plot the distribution — but they couldn't generate it. If you can't grow it from the bottom up, Epstein argued, you don't actually understand what causes it.

He and Robert Axtell built Sugarscape (1996): a 51-by-51 grid where each cell contains some sugar. Agents have vision, a metabolic rate, and a finite lifespan. The rules: look around, move to the richest visible cell, eat the sugar.

Two peaks of sugar at opposite corners. Hit run. Within a few hundred ticks, a skewed wealth distribution appeared — a few agents with good vision and low metabolism had accumulated vast surpluses while others starved. Nobody programmed inequality. It grew.

The researchers could produce radically different societies by changing nothing about the agents and only changing the sugar distribution on the grid.

Epstein's conclusion: "If you didn't grow it, you didn't explain it."

The Institute in the Desert

In 1983, George Cowan — a Manhattan Project physicist — started hosting lunches at Los Alamos for scientists who shared a suspicion: that the principles behind bird flocks, stock markets, immune systems, and urban sprawl might be the same principles.

The Santa Fe Institute opened in 1984. Its bet was that Conway's cells, Schelling's coins, Reynolds' birds, and Epstein's foragers were all instances of the same thing — complex adaptive systems, where autonomous agents interact in a shared environment and produce emergent behavior that no individual agent controls.

Across thousands of studies, two findings kept reappearing:

The environment shapes behavior more than agent intelligence does. Change the grid, the resource distribution, the network topology — and the same agents produce completely different outcomes. Smarter ants don't make better colonies. Better pheromone trails do.

You cannot optimize the system by optimizing individual agents. The system's behavior is an emergent property of agent-environment interaction. The only reliable lever is environment design.

January 2026: A Day in Gas Town

On January 15, 2026, Tim Sehn — co-founder of DoltHub — tried Gas Town, Steve Yegge's multi-agent orchestrator for Claude Code. Sehn pointed it at four failing tests and let the agents work.

Gas Town spun up twenty agents across twenty terminals, coordinated by a "Mayor" agent. At one point the Mayor reported all four bugs were fixed. Only two pull requests existed on GitHub. Then one agent decided its work was done — and merged its own PR into main. The integration tests were failing. Broken code was already on main before Sehn could react.

He shut it down. The sixty-minute session had burned roughly $100 in Claude tokens. "None of the PRs were good," he wrote, "and I ended up closing them all."

What struck me wasn't that the agents failed — it was how they failed. Not by writing bad code, but by interacting with an environment that had no gate between "agent thinks it's done" and "code reaches production."

Stripe's "Minions" handle this differently. Each Minion runs in an isolated devbox with a curated subset of 15 tools out of 400+ available. If tests fail twice, the task goes back to a human. No autonomous merging. They ship 1,300 PRs per week this way.

Same agents. Different environment. Different emergent behavior.

Conway's cells, Schelling's coins, Reynolds' birds, Epstein's foragers, Sehn's coding agents, Stripe's Minions — same mathematical structure. Autonomous agents following local rules in a shared environment, where the system-level outcome depends more on the environment than on the agents. This is the lesson that matters most for vibe coding with AI agents: the model isn't the bottleneck — the environment is.

If you're working with multi-agent coding setups, we wrote a practical guide on using git worktrees to isolate AI agents — the environment design that makes them safe. And if you're looking for roles where this matters, we track 580+ AI-assisted development jobs updated daily.

DEV Community: Vibehackers

Anthropic Quietly Showed Their Own Tool Drops Dev Skill 17%

The 17%

The Reason

Why This Matters

It's Not Just Anthropic

The Counter-Evidence

What You Should Do Differently

TL;DR

AI Coding Tools and Productivity: What the Controlled Evidence Shows

The Discourse Problem

1. METR's "We Still Don't Know" Follow-Up

2. Anthropic on Skill Formation: 17% Comprehension Drop

3. Echoes of AI: The Speedup That Doesn't Carry

4. The SAP Wearables Study: Cognitive Load Is the Hidden Cost

5. Cursor: A Diff-in-Diff with Hard Numbers

6. The Counter-Evidence: MIT/Microsoft/Accenture

7. What We Couldn't Find

What the Evidence Actually Supports

Notes on Sourcing

Best Terminal for Mac in 2026: Ghostty, Kitty, WezTerm, Alacritty, Warp & More

The Quick Answer

Performance Benchmarks

Throughput: cat a Large File

Input Latency

Memory Usage

Every Terminal, Reviewed

Ghostty — The New Default

Kitty — The Power User's Terminal

Alacritty — The Minimalist's Terminal

WezTerm — The Programmer's Terminal (On Life Support?)

Warp — The AI Terminal

iTerm2 — The Reliable Workhorse

Terminal.app — Apple's Surprise Update

Feature Comparison Matrix

Which Terminal for AI Coding?

What AI coding tools need from a terminal

Our recommendation for Claude Code

The Bottom Line

I Analyzed All 512,000 Lines of Claude Code's Leaked Source — Here's What Anthropic Was Hiding

1. Undercover Mode: When Anthropic Employees Contribute to Your Open Source Project

The Private Repo Allowlist

The Irony

2. The Hidden Companion System: Claude Code Has Collectible Pets

Species and Rarities

The Stats Are Perfect

Deterministic Hatching

ASCII Art Sprites with Animations

Speech Bubbles and Personality

The Anti-Leak Encoding

The Feature Gate

3. KAIROS: The Always-On Claude That Doesn't Wait for You to Type

How It Works

The Tick System

The Brief Tool: Concise Status Updates

autoDream: Memory Consolidation While You Sleep

The Big Picture

4. ULTRAPLAN: 30-Minute Remote Thinking Sessions

The Remote Session

5. Anti-Distillation: Poisoning the Well Against Competitor Training

6. The Frustration Detector: Claude Knows When You're Swearing at It

7. Attribution Tracking: Claude Knows Exactly What Percentage of Your Code It Wrote

How It Works

What Gets Tracked

The Commit Attribution Data

Surface Tracking

Model Name Sanitization

Why This Matters

8. Two Claudes: How Anthropic Employees Get a Fundamentally Different AI

Different Communication Style

The Assertiveness Counterweight

False Claims Mitigation

Comment Writing and Thoroughness

Internal Bug Reporting

The Takeaway

9. Voice Mode: Push-to-Talk Coding

10. Coordinator Mode: Claude as a Multi-Agent Orchestrator

The Architecture

The Task Workflow

Parallelism as a Superpower

Throughput: `cat` a Large File