DEV Community: Itay Maman

20 lines of markdown replaced my code review bot

Itay Maman — Sun, 05 Apr 2026 15:35:40 +0000

I stopped waiting for Greptile and co.

Instead I run a skill I call regression-dog. It's open source, simple (about 20 lines of markdown), and integrates right into my terminal/coding agent.

Initially, I imagined I'd do a few quick passes locally and then still lean on the dedicated review bots for the real deep analysis. To my surprise, it's usually just as thorough as those bots, and sometimes better.

What it does

It reads the diff of your branch (or last commit, or last N commits - you choose the scope) and lists every behavioral change it can find. Not style nits, not "consider renaming this variable" — just behavioral deltas. This used to do X, now it does Y.

The output is split into two sections. First, numbered regressions with severity ratings — things like "the retry loop now exits after 3 attempts instead of 5" or "the error response no longer includes the request ID." Then a "Cleared" section listing every change it reviewed and found safe. That second section matters more than it sounds — I'll explain why below.

A typical output looks like this:

Let's unpack the magic

The whole prompt is meticulously crafted to keep the LLM focused on a well-grounded task. Every line is there for a reason:

Do NOT run tests, typechecks, linters, or build commands.

Protects the agent's context window so the underlying LLM can invest its reasoning where it matters.

Enumerate behavioral differences: this used to do X, now it does Y.

It poses a specific, concrete question: what are all the behavioral changes in this diff? This is a grounded task, unlike open-ended "review my code" prompts.

Do not judge whether the old or new behavior is correct - just surface the delta.

Picking apart code is an LLM's happy place. Judging intent isn't. This keeps it where it's strongest.

Do not flag pre-existing issues or suggest improvements

Shuts down a rabbit hole agents love going down.

Add a "Cleared" section listing items that were reviewed and found to have no issues.

This sounds cosmetic but it's load-bearing. When the agent inspects a code change, it has two outlets: either the change is safe (goes to "Cleared") or it's a behavioral difference (goes to "Regressions"). This symmetry forces an explicit decision on every change instead of quietly skipping things it's unsure about. It improved recall noticeably when I added it.

How I actually use it

I work on something. When I think it's ready, I open a fresh Claude Code session and type something like:

flag all regressions in this branch

A fresh session matters — it keeps the review unbiased by the conversation that produced the code.

It comes back fast - often under a minute, depending on scope. Even with bigger changes it's significantly faster than the push-to-GitHub-wait-for-bot loop. And it's right there in my terminal, no tab-switching.

Then I fix things in that same session. It's already loaded with high quality context about the branch, so fixes are fast. Some "regressions" are intentional, so I just move on. Then I open yet another clean session and run it again. Repeat until it comes back clean, or until every flagged item is something I'm deliberately changing.

This works just as well on feature branches as on pure refactors. Turns out "enumerate what changed" is a useful lens even when you expect changes - it catches the unintended changes hiding next to the intentional ones. You meant to change the retry logic, but you also accidentally changed the error message format - that kind of thing.

Install it

It's on GitHub. To install:

npx skills add https://github.com/imaman/skills --skill regression-dog

After installing, open a Claude Code session in your repo and ask it to review your branch for regressions. It'll pick up the skill automatically.

Why I Wouldn't Act on SkillsBench

Itay Maman — Wed, 25 Feb 2026 10:37:02 +0000

I came across SkillsBench (paper, Feb 2026) while watching Theo, and was genuinely excited. It asks two critical questions: do curated procedural documents ("Skills") actually help coding agents, and which coding agent utilizes them best? The headline number, +16.2pp from curated Skills, felt immediately actionable.

Then I started pulling at the methodology, and things unraveled.

Setup

SkillsBench is ambitious in scope: 84 tasks, 11 domains, 7 coding agents, 7,308 trajectories. It evaluates tasks under three conditions: no Skills, curated (expert-written) Skills, and self-generated Skills. Each task ships with a fixed Skill package (markdown instructions, sometimes with scripts or templates) provided to the agent alongside the task.

The leaderboard

In every benchmark the central outcome is the leaderboard. Here it is Finding 2 (§4.1.1), which crowns Gemini CLI + Flash for best raw performance (48.7%) and Claude Code + Opus 4.5 for largest uplift (+23.3pp). This is a legitimate result — though Flash beating Opus 4.5/4.6 is a bit surprising.

The more interesting question is what the leaderboard actually measures. To answer that, let's consider the actual mechanism behind Skills: they are prompt pieces loaded into the context on demand. So the pass rate shown in the leaderboard doesn't tell us which coding agent uses Skills best. It tells us which agent performed the best — but it doesn't tell us whether the Skills mechanism made any difference, or whether the same result would have been achieved by placing the content directly in the prompt.

The experiment that would settle this: inject the same Skill content directly into the prompt (baseline) vs. let the harness load Skills through its native discovery mechanism. That experiment isn't in the paper — and it's the one that would justify a benchmark titled "SkillsBench."

The leaderboard aside, the paper makes several claims about how and why Skills help.

Here things get complicated

Skill design findings are confounded by task identity

Two of the paper's design-oriented findings sound practical:

2–3 Skills are optimal (+18.6pp); 4+ Skills show diminishing returns (+5.9pp). (Finding 5, §4.2.1)
Moderate-length Skills outperform comprehensive ones — detailed (+18.8pp) and compact (+17.1pp) beat comprehensive (–2.9pp). (Finding 6, §4.2.2)

The problem: In the experiment design, each task ships with a fixed Skill package, so Skill count and Skill complexity are properties of the task. Hence, the experiment presented in the paper cannot isolate the effect of "number of Skills" from the effect of "which task this is." A task that happens to need 4+ Skills is a different task than one that needs 1. The paper stratifies post-hoc by Skill count and draws causal language ("optimal," "diminishing returns"), but the design doesn't support that inference.

The same applies to complexity. The N=140 "comprehensive" bucket that shows –2.9pp could simply contain harder tasks. Without controlling for task difficulty — or better, varying Skill count/complexity within a task — these are correlational observations dressed as design guidelines.

The domain-level claims rest on tiny sample sizes

The paper's most striking result is the domain breakdown (Table 4): Healthcare leads at +51.9pp, Manufacturing at +41.9pp. These numbers anchor the paper's claim that domains with knowledge "underrepresented in model pretraining" benefit most from Skills (Finding 4, §4.1.3).

But Healthcare has 2 tasks and Manufacturing has 3. A single outlier task — and several individual tasks swing by 70–85pp — can dominate an entire domain's aggregate. With N=2, you're not measuring a domain effect; you're measuring two tasks. The paper reports these figures without confidence intervals at the domain level and without flagging the sample size issue.

For comparison, Software Engineering (N=16) shows +4.5pp — a much more defensible estimate, but also a much less exciting one.

The other findings restate what we already know about prompting

We noted that Skills are lazily loaded prompt pieces. With that in mind, try the thought experiment of replacing "Skills" with "prompt" in the remaining findings:

Finding 1 (§4.1.1): curated Skills improve performance → curated, expert-written prompts improve performance.
Finding 7 (§4.2.3): smaller model + Skills can exceed larger model without Skills → a smaller model with a good prompt can outperform a larger model with a mediocre prompt.

Neither of these is surprising. The prompting literature has established both points.

Finding 3 (§4.1.1) — self-generated Skills provide no benefit — is slightly more interesting. Meta-prompting (using a model to generate its own prompts) is a real technique that works in some settings, so this finding could have been novel.

But the likely dynamic here is more mundane: for tasks where the model lacks domain knowledge, it can't write effective Skills because it lacks the knowledge. For tasks where the model already has the domain knowledge, the marginal contribution of a Skill is minimal. Either way, performance doesn't improve when the model writes its own Skills. Do the same substitution exercise again and you get "performance doesn't improve when the model provides its own context" — which is not surprising.

What would make this credible

The paper asks the right questions but doesn't yet have the experiments to answer them. Some of these came up above; here they are in one place.

Isolate the mechanism. A benchmark called "SkillsBench" should measure whether the Skills machinery matters — not just whether the Skills content helps. The cleanest test: take the same Skill content and inject it directly into the prompt (baseline) vs. let the harness load it through its native discovery mechanism. If native loading wins, the Skills architecture is doing real work. If the results are equivalent, Skills are just a packaging format for prompt content — useful, but not what the paper claims to measure.

Isolate the content. A harder but complementary experiment would inject the same token count of topically relevant non-procedural text (API docs, reference material) to test whether procedural structure specifically drives the gains.

Vary Skills within tasks, not across them. The Skill design findings (count, complexity) currently can't be separated from task identity. Run the same task multiple times, each time with a different number of Skills, and measure the delta within each task. Same goes for complexity — give the agent a compact Skill vs. an exhaustive one for the same task, and see what happens. This turns correlational observations into actual design guidance.

Test with a fixed Skill library. In the current setup each task gets its own hand-picked Skill package — the agent always has exactly the right Skills for the job. In practice, you write a set of Skills once and they sit there for every task. The interesting experiment is: give the agent a fixed library of, say, 20–30 Skills across all tasks and see if it can discover and apply the right ones. That tests Skill selection, not just Skill consumption — which is the harder and more realistic problem.

Bottom line

My recommendation: don't act on this paper in its current form. If you're investing in Skills for your agents today, calibrate that investment based on your own trial and error, not on this study's findings.

How a 1982 Atari BASIC Program Captures 2026 Agentic Coding

Itay Maman — Wed, 11 Feb 2026 07:27:34 +0000

Back in the early '80s, there was this little Atari BASIC program used to teach graphics, random numbers, and loops. On one side, a "hare" jumps around randomly, drawing lines in arbitrary directions. On the other, a "tortoise" methodically works through the space column by column, top to bottom.

The hare's strides give it a huge early lead—you think it's going to win. Yet the tortoise's systematic approach always prevails. For some unknown nerdy reason, I recreated it recently, in HTML/JavaScript, with a few tweaks.

Watching it run felt uncomfortably familiar. The parallel to AI coding agents writes itself: in hare mode I let the agent take the wheel, whereas in tortoise mode I am pair-programming with it - small tasks, reviewing, staying hands-on.

The hare gets to 97% astonishingly fast. It does everything in quick wide strokes: boilerplate, features, and integrations. Work that would take a skilled developer days, done in minutes. But it leaves gaps that are excruciatingly hard to close.

The tortoise arrives at 97% much later. But when they get there, they keep walking right through to 100%. No invisible wall, just a steady, relentless pace.

I've been both animals these past few months. The hare is more fun. But more than once, struggling to close that last 3%, I've thought: would I be better off starting from scratch in tortoise mode?

The original "hare and tortoise" program appeared in ATARI® Games and Recreations by Herb Kohl, Ted Kahn, Len Lindsay, and Pat Cleland. h/t @yonatanm for digging it up.

The Tortoise and The Hare, Revisited

Itay Maman — Tue, 10 Feb 2026 20:59:34 +0000

This post now lives under a different dev.to article: here

No Habits to Break

Itay Maman — Tue, 10 Feb 2026 11:29:14 +0000

I switched from Claude Code to Codex this week

Codex 5.3 dropped the same day as Opus 4.6, and early results favored Codex — so after six-plus months as a die-hard Claude Code user, I switched. Took me minutes. Literally, minutes. I'll probably switch back the moment Opus pulls ahead again.

Switching IDEs used to be nothing like this. Moving from IntelliJ to VS Code was a genuine project: weeks of hunting for equivalent plugins, remapping keybindings etched into muscle memory, recreating snippets and templates. Real lock-in, built from years of accumulated customization.

Why the difference?

It's the text interface. It's a far less rigid medium than whatever surface area we used to rely on — dropdown menus, key combinations, or configuration files. With a text interface, the prompts I wrote for Claude Code work just as well in Codex. There's almost nothing to learn.

In tech circles, the question of whether apps built on LLMs have moats gets a lot of attention. But as this anecdote shows, the AI labs themselves might have even less.

What this means

If you're behind, catching up is purely a technical problem. Google with Gemini doesn't need to build a better model and break user habits. They "just" need to build a better model. That's hard, but it's one hard thing — and arguably easier than changing user behavior. The flip side: if you're ahead, you're only ahead until the next benchmark. Today's lead means nothing when switching takes minutes.
Benchmarks matter and will keep mattering. Low switching friction keeps competition healthy and value-based — which is exactly what we're seeing. That's why there's so much discussion about LLM benchmarks. Nobody obsessed over IDE benchmarks. There was no point — you weren't going to switch anyway.
The no-moat effect extends beyond coding agents. It applies to most products where the primary interface is chat. Anywhere the interaction is primarily text, switching friction drops. Though once a product accumulates your data, or you've built workflows around it, switching costs return.

For decades, "incumbent advantage" meant something. Users accumulated habits, configurations, workflows — and that accumulation was a moat. We're used to thinking this way. But in the era of LLM-based products, the only thing that matters is whether your product is better right now. Yesterday's gone.

Codex 5.3 vs. Opus 4.6: who wins on a real coding task?

Itay Maman — Sun, 08 Feb 2026 10:54:26 +0000

OpenAI and Anthropic dropped their latest coding models practically at the same time: Codex 5.3 and Opus 4.6. So I did the obvious thing: made them fight.

This is how it went down: I pulled a few key sections from a real npm package's README (~1,500 chars) and used them as a spec. Each agent got the same prompt: implement this spec as a complete, publishable TypeScript repo. The spec describes monocrate, a monorepo publishing CLI we recently open-sourced.

I then fed the implementations produced by each agent — along with the existing monocrate codebase as a baseline — into a judging process. Seven LLMs judged every pairwise matchup, each evaluated twice with order swapped to reduce bias. The question was deliberately simple: "which repo is a better starting point?" — not "does it work?" A win means a judge thought the code was a stronger foundation. This keeps comparisons clean. Full methodology here.

Although simple, this setting — the task and its evaluation scheme — is a reliable yardstick for assessing the overall coding capabilities of the participating models.

Leaderboard

Rank	Agent	Win %
1	Baseline (human + Opus 4.5, iterative)	79%
2	Codex 5.3	63%
3	Opus 4.6	43%
4	Codex 5.2	41%
5	Opus 4.5	25%

Takeaways

Codex 5.3 takes it. It was declared winner in 35 out of 56 judgments — more than any other competitor besides the baseline. In the direct matchup against Opus 4.6, it won 10-4. And it wasn't just Opus 4.6 — Codex 5.3 beat every competitor head-to-head. A clear winner. Here's the full head-to-head breakdown:

Matchup	Result
Codex 5.3 vs. Opus 4.6	Codex 5.3 wins 10-4
Codex 5.3 vs. Codex 5.2	Codex 5.3 wins 9-5
Codex 5.3 vs. Opus 4.5	Codex 5.3 wins 12-2
Opus 4.6 vs. Codex 5.2	Tie 7-7
Opus 4.6 vs. Opus 4.5	Opus 4.6 wins 9-5
Codex 5.2 vs. Opus 4.5	Codex 5.2 wins 8-6

Opus 4.6 and Codex 5.2 are practically tied. Opus 4.6's overall win rate is 2 percentage points higher (43% vs. 41%), but in their direct matchup they split 7-7. Anthropic's latest model landed exactly even with OpenAI's previous generation.

Within each vendor, the generational jump is clear. Codex 5.3 beat 5.2 (9-5), Opus 4.6 beat 4.5 (9-5).

Bottom line

So we have a winner. And it's Codex 5.3. Opus 4.6 is quite behind.

And while this looks like a one-shot benchmark, the judging scheme — "which repo is a better starting point?" — means the results apply more broadly. We're measuring the quality of the headstart you get, which matters whether you're shipping on the first try or settling in for a many-shot session.

monocrate is MIT licensed. Judged by GPT-5 Mini, Claude Sonnet 4.5, DeepSeek v3.2, Gemini 2.5 Flash, Devstral 2512, Sonar Pro, and Qwen3 Coder 30B.

The Elephant in the Room: Systems Thinking Meets Coding Agents

Itay Maman — Fri, 06 Feb 2026 09:43:58 +0000

I just read Paul Homer's latest post, Systems Thinking, on The Programmer's Paradox. It's a well written, thought-provoking piece about the two schools of thought in building complex software: laying out a full specification that accounts for all the dependencies upfront, or evolving the system incrementally over time. Evolution versus Engineering, as he puts it.

I found myself nodding along, especially at his observation about the company with 3000+ active systems that had evolved over fifty years into a shaky house of cards. But as I kept reading, it hit me that this framework extends naturally into something the post got me thinking about: this is exactly the tension we're navigating right now with coding agents.

The two approaches the post describes map almost perfectly onto the two modes of working with AI coding tools. You can write a comprehensive spec and hand it to the agent all at once, or you can work in small, focused chunks.

The irony is that the comprehensive-spec approach is hard for exactly the same reasons the post lays out: real systems have deep, tangled dependencies, and no one — no matter how experienced — can fully work through all of them in advance. If we were good at writing complete specs, we wouldn't have 3000 systems in the first place.

But the incremental approach has its own failure mode with agents. Each chunk gets better individual focus, and you can course-correct as you go — but the agent loses the holistic view. You end up with exactly the kind of inconsistency described in evolved systems, just at a faster pace — and this happens whether you're working in one long session or starting fresh each time.

The "balanced path in the middle" Homer says he hasn't found in decades of practice might actually be the key challenge in AI-assisted development right now: how do you give an agent enough context about the whole to keep each small step coherent, without needing to solve the impossible problem of specifying everything in advance?

Just When I Thought I Was Out, The Code Pulls Me Back In

Itay Maman — Thu, 05 Feb 2026 15:38:51 +0000

Code is unforgiving. It either works or it doesn't. No hand-waving, no "you know what I mean," no close enough. This is what makes programming frustrating—and what makes programmers valuable.

The gap between "works" and "works correctly"

Storing passwords in plaintext works. Validating user input only on the frontend works. Hardcoding API keys in your public endpoint works. Ship any of these and your app will function—until it doesn't, spectacularly.

Out of all the ways something appears to work, only a narrow subset actually works correctly. Finding your way to that subset is the job.

A manager can say "make it work." A PM can wave their hands at a spec. But the developer can't—the programming language won't let them. They have to fight until the code actually works—not just appears to.

For decades, this asymmetry was just how software got built. Then came agents.

Agents: finally, relief

AI coding agents flip this dynamic. Suddenly, you can hand-wave. You can say "make it work" and often... it does. Write me a function that does X. Hook up this API. Fix this bug.

This is a real relief. No more battling a machine that only understands formal syntax. You can finally speak in human terms. The friction is gone.

But when the agent produces code that runs, you can't tell from the output whether it's correct or just appears to work. That distinction lives in the code itself—how the API key is stored, whether validation happens server-side, how errors are handled.

I ran into this recently. I asked an agent to migrate our hand-rolled API call logic to react-query. We had a mix—some places already used react-query, others had hand-crafted code that was essentially a poor imitation of it. The agent did a great job, except in one place where it chose useMutation instead of useQuery.

In its defense: it was a borderline case. There was a minor side effect on the server. But the main point of the call was fetching a token for later use, and it happened on mount—not in response to user action, which is useMutation's typical pattern.

Weird things started happening in dev. React's StrictMode double-mounts components, and the mutation fired twice—but the backend refused to generate a second token, so we'd get errors. A dev-only problem, sure—but it seriously slowed us down while we hunted for the cause.

We asked agents to debug it. They spotted the StrictMode double-mount but kept trying to make useMutation work—patching the symptom, not questioning the choice. Only by reading the code did I see the real fix: this should have been useQuery all along. Once I switched it, the problems vanished.

The agent had done exactly what I asked across dozens of call sites. It got one wrong—in a way that was hard to see and hard to debug, because the choice wasn't crazy. It was just incorrect—and catching it required stepping into the code, questioning the structures in it. Maybe I unknowingly leaned on human critical thinking.

Pulled back in

So we're not free from the code. Disengaging from it lands us right in "appears to work" territory.

And now it's arguably harder: reading code is often harder than writing it. Especially code you didn't write, shaped by patterns you didn't choose, solving the problem in ways you didn't anticipate.

Agents are often hailed as a way out of the messy coding layer. But just when you think you're out, the code pulls you back in. Turns out, they merely shift your work from writing to reviewing.

And reviewing at the pace agents write takes practice. There's a craft to it—one worth developing.