DEV Community: Peter Tamas

Yesterday's "not worth it" is today's quick win

Peter Tamas — Tue, 28 Apr 2026 09:41:14 +0000

Every team has a list of these. The repetitive jobs nobody wants, that everyone agrees should be automated, but never quite make it onto the roadmap. The ones where the right thing to do is build a proper UI, a proper pipeline, a proper anything. But the right thing costs three weeks, and the wrong thing (a human doing it 40 times a month) costs less. So we hire a student. We delegate it to an intern. We rotate it around the team. We grumble.

I don't think we noticed the moment the math changed.

The configuration drawer

We have one of these on a platform we work on at Bobcats Coding. The team needs to produce a configuration for the application based on client input, every time a campaign goes live. The proper solution would be a configuration UI on the admin site, where the client could create the configs themselves through a structured form. We have wanted to build it for a long time. But you know how it goes: the client wants the configs immediately, and other features always end up higher on the list than automating the day-to-day workflow.

So creating each configuration takes about 15 minutes. Read the client input, figure out the not-quite-trivial mappings to values in our DB, deal with the exceptions (every config has at least one), produce the JSON seed. Fast for an experienced dev. Painful when there are 30 of them in a queue. Familiar?

We hired a student to do it. Cheapest solution. And for years, that math made sense.

The math changed and we almost missed it

Here is what is different now. The cost of "automating" this kind of task collapsed. Not by 10%, not by half. It went from "three weeks of dev time we don't have" to "one focused afternoon with Claude in plan mode."

That is a different threshold. And once the threshold moves, the list of things worth automating expands. Suddenly the boring 15-minute task is on the table. Suddenly the messy one-off, the script you would never have committed to, the thing you used to delegate to the cheapest pair of hands you could find. All of it is a candidate.

This is the part I keep coming back to. The mindshift is not "AI writes code now." Plenty of people have written that already. The mindshift is: the set of problems where automation is worth it just got dramatically bigger.

What it actually looks like

For our configuration task, we did this:

Wrote down the onboarding material. The same thing we would write for a new student joining the team.
Opened Claude Code in plan mode. Threw in the onboarding doc, a few previous client inputs paired with the hand-made outputs, and the bits of code and documentation that the task touches.
Asked Claude to turn all of it into a slash command. Asked it to ask back if anything was unclear. (It always has questions. Usually good ones.)
Specified the input arguments for the command, including a final freeform argument for "any custom instructions for this specific config." What came out the other side is a /create-configuration command we can call like a function:

/create-configuration <client input spreadsheet url> <optional additional instructions>

Two things worth noting. First, Claude wrote a better description of the task than we ever had on paper. The command file is, in effect, a human-readable functional spec for a process that lived in our heads for years. Second, because the output is a JSON object, the command also writes a test that checks the output values against the input spreadsheet. So we got testable automation, not a nice-looking script that fails silently.

Time per config: from 15 minutes to about 2. Across the volume we run, that is roughly 70% of what the student was doing.

The bit I did not expect

I asked the student to automate the remaining 30%. He did. We have another command now, for another use case, written by the person who actually understood the corners and the exceptions.

He still works with us. He runs the commands every day. But the work he does on top of them is the interesting part now: the cases the commands cannot handle yet, the edges, the new clients, the things that need judgment. He likes his job more. And we get more value out of his time.

This is the part that does not show up in the AI-replaces-jobs takes. When the threshold moves, the people who used to do the repetitive 70% do not disappear. They move up the stack. The work they do becomes the work that actually needs a human. Which, it turns out, is more interesting work.

A mental model

I have started running a quick check on tasks I used to skip past. Roughly:

If I can onboard a new colleague to a task that they could solve in front of a computer, I can probably onboard Claude to it. Without the forgetting, the typos, the inconsistent output, or the bad day on Tuesday.

Not just code. Anything that fits the shape of "structured input, defined process, evaluable output." Configurations. Reports. Data cleanup. Migrations. The boring layer of work between the interesting tickets.

Your homework

Pick one task on your team that nobody wanted to automate because it was not worth it. Then:

Imagine onboarding a new colleague to it. Write that down.
Collect a few previous executions and any related docs.
Decide how you would evaluate the output.
Open Claude in plan mode, throw all of it in, and ask for a command.
Try it. Iterate. Ship it. If you have done this already, I would love to hear what you put on your list. We are running our own at Bobcats Coding, and I suspect we are not done finding things that used to be "not worth it."

AI Field Notes #004 | Typing is no longer the bottleneck. Thinking is.

Peter Tamas — Sat, 25 Apr 2026 12:27:10 +0000

Software engineer behind this case study: Mark Kővári

Goal: Test whether agentic coding workflows can produce production-grade, architecturally complex software from a phone. The project: Clutter, a polyglot multi-agent orchestration system (Rust, TypeScript, K8s, NATS, SurrealDB).

Highlights

The act of writing code is increasingly something an agent can do for you. What it can't do is decide what to build, how to structure it, when to test, and where to draw boundaries. As more of the typing gets delegated, everything else gets proportionally harder to be good at: specification, architecture, review, testing strategy, and the judgment calls that hold a system together. This experiment is about that shift as much as it is about walking.

Every AI coding tool that shipped in the past few months converges on the same interaction surface: prompt, forms, tool-use approval, text output. That's the entire human-in-the-loop contract. None of it requires a desktop.

So I tested a hypothesis: can someone run a full agentic development workflow from a phone while walking the dog?

The bulk of the work happened across a few sessions in late March (March 26-29, 110 commits), with a follow-up on April 9. The walk shown below was a single 18 km, 4-hour session through Budapest, responsible for the biggest commit spike.

Setup

Requirement	Status
Remote execution environment	Claude Code on a Mac Mini
Thin client	Claude iOS app
Input method	iOS voice dictation + text
Nice weather	Recommended
Power bank	Recommended
Dog and a nice view	Optional but highly recommended

Claude Code already supports the full agentic loop: file reads, edits, shell commands, tool-use approvals. The process is identical to terminal usage, just accessed via remote session on iOS.

What I built (and why)

The walking experiment created its own problem quickly. Multiple concurrent Claude Code sessions, each on a different task, and switching between remote sessions on a phone was painful. I kept losing track of which session was working on what. The friction wasn't the coding, it was the context-switching.

So I started building Clutter: a multi-agent orchestration system that manages one-shot agents. Describe a task, fire off an isolated agent, get results back through NATS events. About 80% of it was built while walking, and all of it was written AI-native.

It also serves an MCP server, so I can create projects and tasks in Clutter directly from a Claude conversation. AI-assisted development building the tool that manages AI-assisted development. During development I was spoon-feeding Clutter its own tasks, and it was creating PRs on its own repo automatically (e.g. PR #51, branch agent/picur-agent-1).

Funny enough, for a while that's all I was doing: developing Clutter with itself but never using it on other projects. Someone called that out in a meeting. So I created a project targeting an external repo, added a task, and it completed first try. Sometimes you need someone to point out you've been sharpening the knife without cutting anything.

The best way to judge whether this workflow produces real output is to look at what it produced:

Metric	Value
Commits	112 (commit activity)
Walking sessions	Main session: ~18 km, ~4 hours (biggest commit day). Overall: March 26-29 + April 9 follow-up
Languages	Rust (76%), TypeScript (20%), Gherkin, Dockerfile, Helm
Rust crates	6 (control-plane, agent-runner, core, embedder, agent-mcp, mcp-server)
TS/Node packages	3 (dashboard, shared types, shared UI)
Infrastructure	Docker, Kubernetes, Helm, NATS JetStream, SurrealDB, GitHub Actions CI
Documentation	Architecture docs, ADRs, glossary, orchestration spec, conventions
Tests	BDD feature specs (Gherkin), unit tests, integration tests

The system is a Rust/Axum control plane with K8s agent isolation, SurrealDB task queue with atomic claiming, NATS event streaming, a React/Vite real-time dashboard, and a vector embedder for semantic search across agent history. Agents run in air-gapped namespaces because multiple instances on the same machine fight for ports. The point isn't the architecture itself. It's that this level of complexity came out of a phone screen and a pair of walking shoes.

🐈 A group of bobcats is called a "clutter." That's where the name comes from.

The process

Typical cycle: dictate a feature description while walking, review prompt on screen, send. Claude scaffolds the module, pauses for tool-use approval. Review proposed changes, approve or redirect, ask for tests. Another approval round. One feature, three to four approvals, ten minutes of walking.

What changes without a desk is everything around the core loop. No cmd+click to jump to a definition. No split-screen diff. No grep. All of that gets delegated to the agent: "show me the swarm worker interface," "what imports the NATS subscriber," "run cargo test and show me failures." The agent becomes the IDE.

This forces you to work at a higher abstraction level. Instead of navigating files, you describe what you want to see. Instead of reading trait implementations line by line, you ask the agent to summarize. You stay in the intent layer, the agent handles navigation. For a project with this many moving parts, that's arguably the right level anyway.

What worked

Ambient development is real. The cognitive overhead of the approval loop is low enough that walking actively helps architectural thinking.
Voice-first input works for prompt composition. Not perfect, but sufficient.
Phone as thin client. Functionally equivalent to a laptop for the human-in-the-loop surface.
Cognitive offloading. Moving forces you to reason about structure rather than grep through files. Helps with modularity.
Genuinely fun.

What didn't work

Voice-to-text accuracy. Mishears technical terms and identifiers. Not continuous like Gemini's voice mode either: dictate, review, send.
No push notifications for agent state. Had to keep checking whether Claude was waiting for an approval. A notification on agent yield would change this significantly.
No code navigation without the agent. Every file lookup costs context window tokens.
Session stability. Occasional remote session hiccups requiring reopen.
UI collision. Tool-use approval buttons appearing during typing cause misclicks.

Side effects: what building this way does to you

There's something that doesn't show up in the commit count.

Sharper instincts for code boundaries. When you can't scroll through files, module boundaries need to be explicit and self-explanatory. You notice when an interface is too wide or a module's responsibility is unclear, because those are the moments you need three follow-up questions instead of one. This trains you to read and reason about code structure faster, whether you're on a phone or not.

Code organization follows your mental model. When navigation is by description ("show me the task lifecycle," "what handles NATS events"), the codebase starts reflecting how you reason about it. Modules get named for what they do, not where they sit in a directory tree. Interfaces get narrower because you want to ask for one thing and get one thing back.

Traditional guardrails harden continuously. When an agent writes the code, you stop trusting that things are correct just because they compile. More tests, stricter type boundaries, better CI, more explicit conventions. The BDD specs, the CONVENTIONS.md, the orchestration spec all exist because this workflow surfaces the cost of ambiguity immediately. And this is where the human in the loop actually matters most: every intermediate artifact becomes a quality gate. A PR review isn't just a formality, it's the moment you catch what the agent missed. A rendered UI isn't just a preview, it's verification that intent survived translation. Every checkpoint (a green CI run, a visual diff, a passing smoke test) carries more weight now because the thing that produced the code between checkpoints isn't reasoning the way you would. The quality assurance surface doesn't shrink when agents write code. It grows.

Conclusion

I built a polyglot multi-agent orchestration system with 112 commits, BDD specs, architecture docs, and a real-time dashboard, mostly from a phone while walking. The project itself was born from the friction of doing exactly that.

The limiting factor isn't the device. It's how well you know your architecture and how clearly you can describe intent to the agent.

As AI coding tools converge on the same agentic loop, the interface becomes thinner. The logical endpoint: the "IDE" is just a notification that your agent needs a decision. Everything else happens in the background.

If you want to checkout the repository https://github.com/markkovari/clutter and see what I made.

Highly recommend experimenting with it. Worst case, you go for a nice walk.

AI Field Notes #003 | When AI Reads Too Much: The Real Price of Complexity

Peter Tamas — Tue, 14 Apr 2026 16:15:42 +0000

Let’s be honest: reading code is not always as straightforward as we would like. Even experienced developers know that some codebases take more effort to navigate than others. And now, AI has joined the same reality.

Turns out, when an AI agent walks through a messy codebase, it does not get tired. It gets expensive. Not in time, but in tokens. The more tangled the logic, the more it costs to figure out what is going on. Same confusion, different billing model.

That is where this tool comes in. Instead of letting packages pile up like an overambitious Jenga tower, it restructures them into a more balanced, layered system. The goal is simple: make the codebase easier to navigate, not just for developers, but for AI agents too.

Whether you are human or silicon, nobody enjoys digging through chaos. And if we can make code more readable for both, that is not just optimization. It is survival.

Source: https://bobcats-coding.notion.site/ai-field-notes-by-bobcats-coding

Goal: Structure node packages so AI agents read less and understand more.

Specifically: measure how the TypeScript monorepo structure affects context window consumption, and build a tool that quantifies the waste and fixes it.

Repository: markkovari/context-pnpm

Context awareness for node packages

Goal: Structure node packages so AI agents read less and understand more.

Specifically: measure how TypeScript monorepo structure affects context window consumption, and build a tool that quantifies the waste and fixes it.

Repository: markkovari/context-pnpm

Before/After Highlights

When I work on different parts of a codebase with AI assistants, the context window fills up fast. Every file the assistant reads to understand a dependency is loaded in full, including implementation details it will never touch. For a busy utility module, that's thousands of tokens of waste, on every session, across every file that imports it. I kept hitting conversation compacting earlier than expected, and it was slowing me down.

My theory was that the shape of your modules, how many packages you have, how big they are, how nested, directly influences how many tokens get burned just loading context. But I didn't have numbers. I didn't know the threshold where splitting a module actually pays off versus adding maintenance overhead for no gain.

So I built a tool to find out.

The approach

I wanted to answer a simple question: given a TypeScript codebase, which files are costing you the most tokens per AI session, and is it worth restructuring them?

The core insight is that file size alone doesn't predict waste. What matters is how much of a file is implementation versus exported API, multiplied by how many files import it. A 10,000-token type declaration file with 98% exports barely registers. A 700-token utility module with a large implementation body, imported by 18 files, costs more than almost anything else.

I landed on this scoring formula:

score = (total_tokens − surface_tokens) × importer_count

Term	Definition
`total_tokens`	Full file token count (tiktoken cl100k_base)
`surface_tokens`	Only the exported declarations
`importer_count`	Number of files that import this one

💡 If the score is above 60 (the overhead of a package.json + index.ts boilerplate), extraction into a separate workspace package is worth it. Below that, leave it alone.

The toolchain

External packages

Package	Purpose
tiktoken (OpenAI)	Accurate token counting with cl100k_base encoding
typescript-estree (typescript-eslint)	ESTree-compatible AST parser to distinguish exported surface from implementation body

Internal packages

Package	Role
`analyzer`	Reads folders via glob pattern, returns total tokens, surface tokens, and importer counts
`estimator`	Projects token savings per AI session from analyzer output
`cli`	User-facing tool: `analyze`, `estimate`, `scaffold`, `verify`, `rebalance`. Dry-run by default; nothing written without `--apply`
`scaffolder`	Rewires imports/exports, registers new pnpm workspace packages, generates minimal `index.ts` re-export surfaces

The process: traverse the module tree, tokenize each file, separate surface from implementation via AST analysis, count importers, score everything, and rank by extraction value. The CLI can then scaffold the actual package extraction if the numbers justify it.

Benchmarks

During development, I spoon-fed the tool its own internal packages as test cases and added synthetic fixtures for both extremes: a "symmetric" already-optimized codebase and an "asymmetric" monolith with classic shared-utility anti-patterns.

But the interesting part was running it against real-world open-source monorepos.

External package benchmarks

I ran dry-run estimations against three popular TypeScript repositories:

Codebase	Files	Candidates	%	Tokens saved / session
tRPC `packages/server/src`	89	56	63%	68,572
TanStack Query `packages/query-core/src`	31	20	65%	34,155
Radix UI `packages/`	131	5	4%	1,591

💡 Pricing reference: Claude Sonnet input at $3/1M tokens. The tRPC result means ~$0.21 in unnecessary tokens per session, which adds up across a team over weeks.

tRPC: the "deep internals" anti-pattern

tRPC's unstable-core-do-not-import/ is a textbook case. 56 files fan out to 2-18 consumers each. Every adapter file that an AI session reads drags in the full internals of the procedure builder, router, and streaming infrastructure, even when it only needs one or two types. The top offender, procedureBuilder.ts, scores 8,260: 4,386 tokens of implementation consumed by 5 importers. After extraction, each consumer would read only a ~200-token surface.

TanStack Query: tight coupling in a small graph

31 files, 20 heavily cross-imported. utils.ts is imported by 17 files and queryClient.ts by 13. The interesting finding here: types.ts is the largest file (10,521 tokens) but scores only fifth because 98% of it is surface. utils.ts scores second despite being half the size, because its implementation body is large relative to what callers use. File size is a bad proxy for waste.

Radix UI: the correct negative result

Only 5 candidates from 131 files, all with a single importer. Radix is already decomposed into ~30 packages with 1-5 files each and minimal internal coupling. The tool correctly says "nothing to do." This was an important validation: I needed to confirm it doesn't generate false positives on well-structured code.

Synthetic fixtures

Fixture	Files	Candidates	Tokens saved	Purpose
`monolith-service`	10	3	9,936	`db.ts`, `logger.ts`, `config.ts` each imported by every other module. Most common anti-pattern.
`decomposed-app`	6	0	0	Small focused files, 1-2 consumers each. Correct negative.

What surprised me

💡 The biggest finding: file size doesn't predict waste. R² ~ 0.15. Importer count alone is equally weak. The strongest predictor is hidden tokens (implementation body), but score is multiplicative (hidden x importers), so both dimensions matter.

This means you can't eyeball your way to the answer. A module that looks "big" might be mostly type exports and perfectly fine. A module that looks "small" might be silently burning thousands of tokens because it's imported everywhere and its public API is tiny compared to its internals.

What worked

The Claude Code hook integration turned out to be the most practical outcome. Wire estimate as a SessionStart hook and it automatically surfaces context bloat whenever you open a session:

{
  "hooks": {
    "SessionStart": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "npx context-pnpm estimate . 2>/dev/null | grep -E 'Total|No extraction'"
          }
        ]
      }
    ]
  }
}

If the codebase is clean, you see No extraction candidates. If it has drifted, you see the token savings waiting to be unlocked, before you've written a single line of code. This feedback loop keeps the team aware of structural drift without adding process overhead.

What didn't work (yet)

The scaffolder, while functional, is the least mature piece. It handles straightforward cases well: generating workspace packages with minimal re-export surfaces and rewriting import paths. But in codebases with circular dependencies or complex re-export chains, the rewiring logic still needs manual intervention. I'm treating this as a "preview" feature while I iterate on edge cases.

I also initially assumed I could use a simpler heuristic (just file size times importer count) and skip the AST-based surface detection entirely. The Radix UI and TanStack types.ts results proved that assumption wrong. Without distinguishing surface from implementation, the scoring would have flagged types.ts as one of the top offenders when it's actually fine.

Current status and next steps

The tool is open source and usable today for the read-only commands (analyze, estimate). The mutation commands (scaffold, rebalance) work but should be used with review.

Next steps I'm considering:

Adding support for JavaScript/JSX alongside TypeScript (partially done)
Making the scoring engine language-agnostic via Tree-sitter. The core formula is language-independent; I'd only need per-language definitions of "what counts as surface" (Python: __all__; Go: capitalized identifiers; Rust: pub items). The tree-sitter-language-pack bundles 248+ grammars with Rust/Node.js/Python bindings, so the plumbing is there.
A rebalance command that identifies merge/split/inline opportunities on existing workspace packages, not just extraction from monoliths
Better heuristics for the "when to extract" decision, incorporating churn rate from git history alongside the static score
Integration with CI pipelines so teams get warned when a PR pushes a module past the extraction threshold

I don't think AI coding assistants will solve the context window problem on their own. Models will get bigger windows, but tokens are never free, and the cost curve is multiplicative with team size. Structuring your code so that AI can read less and understand more is a lever that keeps compounding.

Writing code for two audiences

For decades, "clean code" meant code that humans can read and maintain. That's still true, but AI agents are now a second consumer of your codebase. They read your modules, trace your imports, and parse your exports on every session, from scratch, burning tokens the whole time.

The practices that help humans (small functions, clear separation of concerns) mostly overlap with what helps agents, but not entirely. An agent doesn't care about naming aesthetics. It cares about how many tokens it has to ingest before it can do useful work. A module with a 50-line public API and 2,000 lines of implementation behind it is perfectly clean by human standards, but it's wasteful for an agent that only needs the API.

context-pnpm is built around treating AI readability as a first-class design constraint alongside human readability. The two rarely conflict: narrow interfaces, minimal public surface, and well-decomposed modules are good for both. The difference is that now there's a measurable cost when you get it wrong: tokens per session, dollars per month. I think this will quietly become a standard part of how teams think about code architecture, not as a buzzword, but as a practical recognition that your codebase has two kinds of readers.

Old principles, new payoff: why SOLID matters for AI readability

Most of what makes code AI-readable isn't new. The Interface Segregation Principle (the "I" in SOLID) says no consumer should depend on methods it doesn't use - that's literally what the scoring formula measures. The Dependency Inversion Principle says depend on abstractions, not implementations - that's what extraction into a minimal re-export surface achieves. IDD formalizes this into "design the interface before the implementation." The difference now is that these principles have a measurable second payoff: every unnecessary token you hide behind an interface is a token the agent doesn't burn.

Automatic rebalancing

Extraction is a one-time event, but codebases drift. The rebalance command (in preview) treats the module tree like a self-balancing tree: merge, split, inline, or extract packages as import patterns change. The missing signal is git churn rate, which I'm exploring to avoid suggesting extraction on modules being actively rewritten.

Alternatives and similar tools

Tool	Language	What it does	Difference from context-pnpm
Tach (Gauge)	Python	Module boundaries, dependency enforcement, strict public interfaces. Written in Rust.	No token-based scoring
Codebase-Memory	66 languages	Tree-sitter knowledge graph, 10x fewer tokens via MCP	Optimizes retrieval, not structure
Depends	Java, C/C++, Ruby	Language-agnostic dependency extraction	Raw data, no scoring or restructuring

References

Category	Link
Interface-Driven Development	IDD overview (Milanovic, 2022)
Spec-Driven Development	Spec Driven Development (InfoQ, 2026)
Interface-based programming	Wikipedia
Dependency graphs at scale	Building a Dependency Graph (HRT, 2025)
Dependency graph management	Managing dependency graph in a large codebase (Tweag, 2025)
Context engineering	Context Engineering for Coding Agents (Fowler, 2026)
Token optimization research	Codebase-Memory (arXiv, 2026)
Context strategies	Context Engineering for Developers (Faros AI, 2025)

AI FIELD NOTES #002 – Weekly memos for Engineering Leaders

Peter Tamas — Tue, 07 Apr 2026 13:12:03 +0000

You are a software engineer, so you know that feeling. You are deep in dependency hell, reading library docs, digging through version histories, staring at compatibility matrices. Running tools to detect transitive version conflicts.

I got tired of it, too. So I asked Opus 4.6 to build a version checker hook and let AI deal with this problem instead of me. It turned out to be one of the most quietly impactful things I've added to my workflow.

At Bobcats Coding, we've deliberately built AI into our delivery system: how we specify, how we build, how we test, and how we learn. (Of course, if you are in the middle of a large legacy codebase with years of untouched dependencies, your mileage may vary, but that's a story for another time.:))

The core issue

One of my recurring frustrations with AI-generated code was debugging problems caused by poorly selected dependency versions. When you let AI add a dependency to a project, it usually does not install the latest version of the library. Instead, it often selects an artifact that is a few major versions behind the latest release. This behavior caused two types of issues for me repeatedly:

Usage of already deprecated functions
Version mismatch bugs

If you don’t have proper E2E tests in the project, version mismatch bugs can be very difficult to detect. The feature simply doesn’t behave as expected, usually without any error messages, which makes debugging extremely frustrating.

After running into this situation multiple times across several projects, I asked Opus 4.6 to create a version-checking script and integrated it into a PostToolUse hook, that:

checks whether all project dependencies are up to date (similar to Dependabot)
detects version mismatch issues

Since I started using this hook, I’ve never had to struggle with dependency versions again.

When AI adds a new dependency for a feature, even if I miss an E2E test for a particular scenario, this hook saves me a lot of time by catching hidden issues caused by version mismatches.

After every change in my dependencies, it runs version checks and automatically iterates on dependency versions until they align and all tests pass.

The fix

Ask AI to create a version-checker PostToolUse hook in projects that:

shows a warning when it detects a newer version of a dependency
checks for version mismatch issues and resolves them when found
detects transitive dependency conflicts and resolves them automatically

Best practices:

Update dependencies in a separate commit or PR when warnings appear to keep the project up to date.
Use TDD with BDD-style E2E tests (only partially related, but still helpful for catching subtle runtime issues).

Resulting Workflow

AI adds dependencies during feature implementation
The PostToolUse hook detects version issues
The hook automatically aligns dependency versions
Lint, type checks, and tests verify correctness
Only stable commits enter the repository

Result:

Automated dependency version assurance
Reduced debugging time
Safer AI-generated code

What I learned

In my full-stack TypeScript projects, the hook works surprisingly well. Even when the updated major versions were not yet compatible with each other. The feedback loops in the project recognized this and automatically iterated with minor versions, reading issues and forums, and also automatically came up with a small, effective, temporary patch that solved the incompatibility issue.

The bottom line: connecting a version-checker pre-commit hook to all your projects isn't optional when you're working with AI-generated code. It's a mandatory feedback loop for maintaining developer productivity and saving a lot of my nerves.:)

Want to try it? Here is my implementation example

In my TypeScript projects, I implemented this workflow with a small script:

check-versions.ts

It performs:

dependency outdated checks
peer dependency mismatch detection
transitive conflict detection
optional automatic resolution

Example usage:

bun scripts/check-versions.ts              # All checks (pre-commit)
bun scripts/check-versions.ts --mismatch   # Mismatch only (fast)
bun scripts/check-versions.ts --fix        # Auto-resolve mismatches + transitive conflicts
bun scripts/check-versions.ts --json       # JSON output (for Claude hook)

This allows different checks depending on the stage of the workflow.

For example, the fast mismatch-only check is ideal for pre-commit hooks.

An example output of the script:

bun scripts/check-versions.ts 2>&1)
  ⎿  ⚠ OUTDATED PACKAGES:
       @vitejs/plugin-react           4.7.0 → 5.1.4        ▲ major
       vite                           6.4.1 → 7.3.1        ▲ major
       @colyseus/schema               2.0.37 → 4.0.17      ▲ major
       colyseus.js                    0.15.28 → 0.16.22    ▲ minor
       @colyseus/ws-transport         0.15.3 → 0.17.9      ▲ minor
       colyseus                       0.15.57 → 0.17.8     ▲ minor
       @colyseus/testing              0.15.4 → 0.17.11     ▲ minor
       @biomejs/biome                 2.4.4 → 2.4.6        ▲ patch
       @storybook/react               10.2.13 → 10.2.16    ▲ patch
       @storybook/react-vite          10.2.13 → 10.2.16    ▲ patch
       storybook                      10.2.13 → 10.2.16    ▲ patch

     ✓ No version mismatches or transitive conflicts.

The `check-versions.sh` hook

In the .claude/hooks folder, I created the check-versions.sh script, where I run the check-versions.ts script only when a package.json file is modified.

#!/usr/bin/env bash
# Version checker hook — runs after PostToolUse on Edit/Write.
# Only fires when the modified file is a package.json.
# Runs mismatch check and injects result into Claude's context.

set -euo pipefail

INPUT=$(cat)
ROOT="/Users/kond/kondfox/isuperhero-claude"

# Extract the file path from the hook payload
TOOL_INPUT=$(echo "$INPUT" | jq -r '.tool_input // empty' 2>/dev/null || true)
FILE_PATH=$(echo "$TOOL_INPUT" | jq -r '.file_path // empty' 2>/dev/null || true)

# Only run when a package.json was modified
if [[ -z "$FILE_PATH" ]] || [[ "$FILE_PATH" != *"package.json"* ]]; then
  exit 0
fi

# Run mismatch check only (fast, no network call)
cd "$ROOT"
RESULT=$(~/.bun/bin/bun scripts/check-versions.ts --mismatch --json 2>&1 || true)

echo "$RESULT"

`PostToolUse` hook integration

In the .claude/settings.json I defined the PostToolsUse hook:

{
  // ...
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Edit|Write",
        "hooks": [
          {
            "type": "command",
            "command": "/Users/kond/kondfox/isuperhero-claude/.claude/hooks/check-versions.sh"
          }
        ]
      }
    ],
  // ...
}

Pre-commit Integration (Husky)

The script runs automatically before every commit via Husky.

Example pre-commit hook:

#!/usr/bin/env sh
cd"$(git rev-parse --show-toplevel)"

~/.bun/bin/bun scripts/check-versions.ts--fix
~/.bun/bin/bun run lint:fix
~/.bun/bin/bun run typecheck
~/.bun/bin/bun run test
~/.bun/bin/bun run test:e2e

This ensures that:

dependency versions are aligned
lint errors are fixed
types compile
unit tests pass
E2E tests pass

before any commit enters the repository.

This creates a tight feedback loop for AI-generated code.

You can see a full working example here:

https://github.com/kondfox/isuperhero-claude

The repository contains the full check-versions.ts implementation and the Husky integration used in production.

Legacy project?

I have doubts about how good idea it is to integrate the version checker feedback loop into a legacy project where the dependencies are far behind the actual versions. But I’m gonna give it a try in such a project shortly, and share the result with you.

AI Field Notes #001 | Is AI frontend development finally getting good? Our Opus 4.6 test says yes. (And no.)

Peter Tamas — Tue, 07 Apr 2026 13:00:00 +0000

In December 2025, I wrote about trying to build a full page with AI with a much smaller scope, and it didn’t go well.

At that time, my conclusion was that while implementing a simple, small UI component with AI and Figma MCP worked quite well, it was surprising how badly it handled the implementation of a full page. The small UI component generation wasn't perfect either. I could get a ~90% "close enough" output that I could quickly align to the requirements by hand. But when I asked AI to implement a simple login page that contained only already-existing components, even with Figma MCP, the result was disappointing. The layout was far from the design, and it hallucinated elements that weren't in the design at all. No matter how I prompted, it just produced different hallucinations. Which I really don't understand, because Figma MCP provides a structured description of the design. In the end, I spent much more time experimenting with AI than it would have taken to puzzle the components into their places by myself within a few minutes.

My current experience is still not flawless, but I'm amazed by the improvement in this area over the past 3 months. I managed to implement a whole complex page, with existing and new components, that I had estimated at 48 hours, in just 8 hours. Not in one iteration, not 100% AI-generated, not without refactoring and human code reviews, but the velocity is impressive.

Some Context on the Comparison

After having satisfying experiences with Opus 4.6 UI component implementation, I was eager to retry a full-page AI implementation experiment. When you don't have a strict specification, it's easy to vibe-code a fair-looking result, but it's hard to evaluate how well the output matches the client's needs. That's why I chose a project where we had clear requirements:

Figma designs that we need to implement
An OpenAPI specification of the backend API

These are strict, structural anchors that provide clear and easy verification of the result.

The state of the project when I ran my experiment:

The project was already "Claude-ready". It had a well-set-up, project-specific Claude.md that my colleagues had been using for months.
We already had an API client, but none of the endpoints that this page uses were defined in it yet.
The site layout, design system, and some of the UI components that the page needed were already in place, but the design also contained new, complex UI components.

Unfortunately, this was a client project we built at Bobcats Coding, so screenshots, product details, and the repository stay private. But I'm going to write about everything else.

The first iteration

The better you specify, the better outcome you can expect. This isn't a new directive; it was true before AI coding as well. But with agentic engineering, specification is the new code.

So I spent ~1 hour specifying the task as my initial prompt. I gave a general context about the page we were building, linked the design of the whole page in Figma and the components one by one. I gave a clear specification for each element of the page: which API endpoint it gets its data from, what it represents, how it should work. I specified all the page actions as well. What should happen when a button is clicked, when a dropdown element is selected, and so on. I also instructed Claude to generate every new UI component in a reusable way within our UI library, test them, and provide Storybook stories and documentation.

I asked AI to create an implementation plan that multiple agents could work on in parallel (because I was curious how this would work). I required a contract-first approach so that the results of the asynchronously working agents could be integrated at the end.

Opus 4.6 worked for 9 minutes to create the plan. It correctly found all the files in the project that it needed to modify and the workspaces and folders where it should create the new files.

It separated the work into 4 agents with clear responsibilities, tasks, and restrictions:

Agent 1: API Layer

Update the API schema
Generate API types from the schema
Create and export DTO types
Add mapping functions
Implement new API client methods based on the given interface
Add unit tests

Agent 2: UI Components

Implement the discovered new UI components (the plan listed their names, dependencies, and functional descriptions)
Add unit tests
Create Storybook stories
Export the components from the UI library

Agent 3: Page Composition

Replace the current placeholder component on the page (server component)
Call the proper API client methods for data
Feed the data to the created client component
Implement the layout and state management of the client component
Place the required UI components on the page
Implement the page actions

Agent 4: E2E Tests

Write the necessary BDD-style E2E tests for the page (the BDD features were also included in the plan for quick human verification)

It created an execution order as well:

Phase 1 (parallel): Agent 1 (API) + Agent 2 (UI Components)
Phase 2 (after Phase 1): Agent 3 (Page Composition)
Phase 3 (after Phase 2): Agent 4 (E2E Tests)

Agents 1 and 2 have zero dependencies and run fully in parallel.
Agent 3 depends on both but can start skeleton code immediately.
Agent 4 runs last as it needs rendered DOM.

The execution of the plan took 29 minutes and 25 seconds.

The result at first glance was a bit odd. It clearly contained all the required elements and showed the data correctly from the API, but the layout was broken, and the component designs were only ~80–90% faithful to the Figma designs. No hallucinations, though!

All in all, it was not great, not terrible for a first iteration.

Refining the design

I asked Claude to use Playwright MCP to verify its result: find the differences from the Figma design and fix them.

Using Playwright MCP as a feedback loop in frontend development works surprisingly well. Claude opens the page in a browser, takes screenshots, analyzes them, finds the problems, fixes them, verifies the fix with Playwright again, and iterates until it's solved.

However, my prompt was too vague, and the use of Figma MCP is still far from perfect, so the result was also disappointing. What worked much better was creating screenshots of both the UI implementation and the expected design, then describing the problems. Most of the design issues could be solved this way.

Creating a 100%, pixel-perfect design is still not something an LLM is capable of.

You need to recognize the point when the agent gets stuck in a loop, when every iteration just makes the problem different, but you don't get any closer to the solution. That's the point when you need to take the keyboard and finish the job yourself.

In the case of pixel-perfect design implementation, in my experience with current models and tools, you can usually reach a ~90–95% state with AI.

Refining the code

Opus 4.6 generates relatively decent code, but most of the time it needs some refactoring. In the case of this experiment, here's what I found during code review:

It didn't create components for all the UI elements it should have. This led to unnecessary duplication that would have been difficult to maintain.
It didn't always use the tokens from the design system or our SASS mixins (e.g., for typography).
I found some overcomplicated, mutating logic that could have been written more simply.
It didn't create some of the components as reusable as I expected.
It hardcoded some constants that shouldn't have been hardcoded.
It wasn't forward-thinking enough to extract functionality we could reuse later into a hook or utility function.
It used far more useMemo than necessary. But when I pointed these out to Claude, it could fix them much faster than I would have. With proper permissions, it can even read PR comments from GitHub, so you don't necessarily need to prompt these manually. You can just review on GitHub, then give a short fix my reviews on the PR instruction.

What didn’t work

Figma MCP still surprisingly underperforms compared to a simple screenshot.

The multi-agent implementation was fun to try, but resulted in a 3,000+ line PR, which is far from optimal. Next time, after I have the multi-agent implementation plan and the contracts (types, interfaces) in the code, I'd try to solve the task on separate branches using worktrees.

What worked

With all the refinements, I could ship the module in ~8 hours instead of the estimated 48 hours. Fully tested and documented. Two things stood out from the start:

1. Playwright MCP is a MUST in frontend development.
2. Creating the API schema mapping, types, and API client based on the given OpenAPI specification worked perfectly, even on the first iteration.

I don't think AI frontend development is solved, but for the first time, the velocity feels real. One thing's for sure: I'll keep testing, and I'll keep writing when something interesting happens.

DEV Community: Peter Tamas

Yesterday's "not worth it" is today's quick win

The configuration drawer

The math changed and we almost missed it

What it actually looks like

The bit I did not expect

A mental model

Your homework

AI Field Notes #004 | Typing is no longer the bottleneck. Thinking is.

Highlights

Setup

What I built (and why)

The process

What worked

What didn't work

Side effects: what building this way does to you

Conclusion

AI Field Notes #003 | When AI Reads Too Much: The Real Price of Complexity

Context awareness for node packages

Before/After Highlights

The approach

The toolchain

Benchmarks

External package benchmarks

tRPC: the "deep internals" anti-pattern

TanStack Query: tight coupling in a small graph

Radix UI: the correct negative result

Synthetic fixtures

What surprised me

What worked

What didn't work (yet)

Current status and next steps

Writing code for two audiences

Old principles, new payoff: why SOLID matters for AI readability

Automatic rebalancing

Alternatives and similar tools

References

AI FIELD NOTES #002 – Weekly memos for Engineering Leaders

The core issue

Since I started using this hook, I’ve never had to struggle with dependency versions again.

The fix

Resulting Workflow

What I learned

Want to try it? Here is my implementation example

The check-versions.sh hook

PostToolUse hook integration

Pre-commit Integration (Husky)

Legacy project?

AI Field Notes #001 | Is AI frontend development finally getting good? Our Opus 4.6 test says yes. (And no.)

Some Context on the Comparison

The first iteration

Agent 1: API Layer

Agent 2: UI Components

Agent 3: Page Composition

Agent 4: E2E Tests

Refining the design

Refining the code

What didn’t work

What worked

The `check-versions.sh` hook

`PostToolUse` hook integration