DEV Community: Dmitry Bondarchuk

My AI Micromanager Got a Body

Dmitry Bondarchuk — Sun, 12 Apr 2026 17:55:33 +0000

A follow-up to I Built an AI Micromanager That Bullies Claude Code

So a week ago I built a text-to-speech micromanager for the DEV April Fools' Challenge. It would nag Claude Code with escalating passive-aggressive remarks until your task was done. Dialog boxes. Desktop notifications. "The board has been notified." The whole bit.

That was fun. Then it was Sunday. I had nothing going on.

You can probably see where this is going.

The natural next step

The TTS version had no face. No physical presence. It was just a disembodied voice yelling at you. Relatable, sure — but incomplete. A real micromanager needs to loom. They need to pace. They need to make you feel observed even when nothing is being said.

So I gave him a body.

ai-micromanager now ships with a pixel-art mascot that appears above your terminal window and gets progressively more unhinged the longer Claude takes.

What he does

The escalation follows a strict corporate timeline:

Idle — he just stands there. Breathing. Watching.
Stomping — foot tapping begins. He's noticed the time.
Pacing — walks back and forth above your terminal window. Side to side. Relentless.
Status updates — random speech bubbles. "Any updates?" "Can you at least give me a percentage?"
Whip phase — yes. He has a whip now.
Finally — when the task completes, he delivers a sarcastic closing line and goes back to idle.

If you start the mascot mid-task, it jumps straight to the correct phase based on elapsed time. He's always aware of how late you are.

How it works

Same hook architecture as before — Claude Code supports lifecycle hooks that fire on events like PreToolUse and Stop. The Python hook writes a timestamp to a temp file when work starts and clears it when it ends.

The new part is a native macOS Swift app that polls that file every 500ms, detects the active terminal window (Terminal, iTerm2, Warp, Ghostty, and a few others), and positions an overlay window directly above it. The mascot lives in that overlay, running a 30fps animation loop tied to elapsed task time.

No external dependencies. No cloud. Just a tiny man with a whip and a very short fuse.

The pixel art was generated with PixelLab — genuinely excellent tool if you've ever wanted sprites without learning to draw.

Was this necessary

No.

But it was Sunday, and the alternative was doing something useful, and here we are.

Source: github.com/ubcent/ai-micromanager

No AI agents were harmed in the making of this.

I Built an AI Micromanager That Bullies Claude Code

Dmitry Bondarchuk — Thu, 09 Apr 2026 11:07:23 +0000

This is a submission for the DEV April Fools Challenge

What I Built

I built ai-micromanager, a small tool that answers an important question:

What if your AI coding assistant had a manager so catastrophically unnecessary that even HR would ask it to "take a more human tone"?

Most AI tools are obsessed with helping.
They autocomplete your code.
They fix your tests.
They explain monads with the confidence of a man who has never paid taxes.

That is not what I wanted.

I wanted realism.

I wanted the authentic modern software experience.

I wanted Claude Code to feel like it was trying to refactor a Python file while a regional director of Strategic Alignment stood behind it breathing through his nose and saying, "Do we have an ETA on this?"

So I built a joke hook for Claude Code that activates whenever a task takes longer than five seconds.

After that, every five seconds, it does three things:

speaks a passive-aggressive management line out loud
sends a desktop notification
opens a blocking dialog box, because tyranny should be multisensory

At first it sounds supportive, in the way a bear trap is technically supportive of your leg staying in one place.

Then it escalates.

It starts with:

"Just checking in, any updates?"

Then:

"What's the blocker here?"
"This is impacting sprint velocity."
"Leadership is asking for visibility."

And eventually it reaches its final corporate form:

"I'm setting up a bridge call."
"The board has been notified."
"This is the worst thing that has ever happened to Q4."

So now, instead of simply writing code, your AI assistant gets to enjoy the full dignity of modern knowledge work:

being interrupted by somebody whose entire skill set is converting one small delay into a company-wide weather event.

This is not productivity software.

This is workplace folklore in executable form.

This is not a tool.

This is an emotionally active org chart.

Demo

Observe a machine being managed with the intensity usually reserved for launch failures, data breaches, and a typo in a slide deck seen by a VP:

The key feature is that the manager voice becomes more stressed over time, which means the software does not merely interrupt you.

It develops a narrative.

Code

The repo is here:

https://github.com/ubcent/ai-micromanager

The implementation is beautifully petty.

There are two main Python files:

hooks/micromanager.py
hooks/micromanager_nag.py

One script listens for Claude Code hook events and decides when to start or stop the nonsense.

The other script is the nonsense.

That is the architecture.

No cloud.
No vector database.
No agent swarm.
No tasteful dashboard with rounded corners and the word "insights" in the top left.

Just Python, timers, system dialogs, and the steady moral collapse of a machine that wanted to help and instead got assigned a stakeholder.

The control flow is very simple:

Claude starts working.
The hook starts a background nagging process.
If Claude keeps working for more than five seconds, the manager begins its performance.
Every five seconds the machine receives another demand for visibility, alignment, clarity, ownership, urgency, or healing.
When the task finally ends, the manager stops, but not before one final closing remark to ensure the emotional damage lands cleanly.

It is essentially a watchdog, if the watchdog had gone to business school and described itself as "results-driven."

How I Built It

I built it with:

Python
Claude Code hooks
macOS say
osascript
an amount of personal experience that is difficult to discuss in a safe and constructive environment

The hook entry point watches for Claude Code lifecycle events:

PreToolUse starts the background process
Stop kills it and cleans up state

The nagger process is where the art happens.

It has a fixed escalation ladder of management dialogue, and it delivers each line with increasingly fast speech.

This was important.

I did not want a calm manager.

A calm manager can be reasoned with.

I wanted the specific energy of a man who says "Can we take this offline?" about a problem that is currently on fire in front of everyone.

I wanted the tone of somebody who schedules a 7:30 AM sync called Quick touch base and then opens with, "A few folks have concerns."

I wanted the software equivalent of a Slack message that says:

Hey, just bubbling this up.

and somehow causes your spine to factory reset.

I also wrote tests, because if you are going to build a fake manager from hell, you should still maintain professional standards.

The tests verify:

startup and shutdown behavior
escalation timing
cleanup of temp files
the final sarcastic sendoff

In other words, this project has better QA coverage than several actual managers I have met.

Prize Category

Community Favorite

This project should win Community Favorite because it unites developers across languages, stacks, time zones, and trauma backgrounds.

You might write:

Python
Rust
TypeScript
Go
COBOL in a basement under a government building

But no matter who you are, you understand the universal horror of these phrases:

"gentle reminder"
"friendly ping"
"circling back"
"quick follow-up"
"adding some urgency here"
"just want to make sure this stays visible"
"per my last message"
"can we put together a short deck?"

These are not words.

These are cursed runes.

This project takes that shared experience and turns it into a fully operational harassment machine for your laptop.

And that, to me, is community.

Final Thoughts

The tech industry spent years asking:

"How can AI make developers more productive?"

I asked a better question:

"How can AI make developers feel like they are being lightly hunted through an open-plan office by a director named Brad?"

The answer, it turns out, is surprisingly achievable.

ai-micromanager is stupid, mean, unnecessary, and extremely committed to the bit.

Which is to say:

it is my most realistic software project to date.

OmnethDB: Building a Memory System Agents Can Actually Trust

Dmitry Bondarchuk — Tue, 07 Apr 2026 18:05:27 +0000

I have been working on Vexdo for a while now, trying to build an autonomous system that can ship code with as little human intervention as possible.

Some of that work ended up in earlier write-ups:

OmnethDB came out of a pretty simple thought within that broader Vexdo journey.

If I want agents to work on a codebase with less and less human supervision, it would be really useful if they could accumulate project memory in roughly the same way people do.

A person who has been on a project for a long time is usually much more effective than a newcomer. They know the weird edge cases, the old migrations, the intentional tradeoffs that look like bugs, the decisions that were reversed, and the things that are technically possible but architecturally wrong.

I wanted something closer to that.

Project link: github.com/ubcent/omnethdb

Most agent memory systems are optimized for demos.

They can retrieve semantically similar notes, summarize recent context, and make an assistant feel like it "remembers." That is enough to look impressive in a prototype.

It is not enough to build something trustworthy.

So I started building OmnethDB from a stricter premise: memory for agents should be treated as a serious system primitive, not as a vague cache wrapped around embeddings.

The bar is higher than "it retrieved something relevant."

The bar is:

can we inspect why this memory exists?
can we see whether it was superseded?
can we tell whether it is a stable fact or a historical event?
can we audit what changed and why?
can an agent retrieve current truth without silently mixing it with stale truth?

That is the problem I want OmnethDB to solve.

The Real Problem With "Memory"

A lot of systems treat memory as one undifferentiated blob:

architecture facts
implementation details
temporary incidents
outdated decisions
inferred patterns
random notes from previous runs

Everything gets embedded. Everything becomes retrievable. And then the agent is expected to "figure it out."

That sounds flexible, but in practice it creates ambiguity.

When a fact changes, both the old and new versions often remain in the corpus with no explicit semantic difference. Retrieval might surface either one. Sometimes it surfaces both. The agent gets contaminated context and has to guess what is current.

That is not a retrieval problem. It is a memory semantics problem.

Agents do not just need more memory. They need memory with explicit rules around:

versioning
lineage
lifecycle
provenance
relation semantics
current-vs-historical truth

What OmnethDB Is

OmnethDB is a versioned, governed, inspectable memory primitive for autonomous agents.

At the architecture level, it is intentionally opinionated:

memories have kinds such as Static, Episodic, and Derived
memory updates are explicit, not implicit
lineage is preserved
old memories are not deleted
forgetting is a lifecycle mark, not silent removal
relations are typed
retrieval is designed to return the current version of knowledge, not a probabilistic blend of history and present

That last part matters a lot.

In the OmnethDB architecture, if memory A updates memory B, that is not just metadata for humans to inspect later. It changes the active truth of the lineage. There is exactly one latest memory in a lineage at any point in time.

That gives agents a much stronger contract than "here are some similar snippets, good luck."

Why This Matters In Practice

The dangerous failure mode in agent systems is not forgetting.

It is remembering the wrong thing with high confidence.

If an agent is helping with debugging, migrations, architecture work, or product decisions, stale memory is often worse than missing memory. Missing memory usually creates uncertainty. Stale memory creates false certainty.

That is why I treat memory in OmnethDB as something that must be inspectable and auditable, not just searchable.

Where This Corpus Came From

The corpus behind these examples was not invented for the article.

I connected OmnethDB to Claude Code as an MCP server and used it inside a real pet project for about a week.

During that time, the memory corpus accumulated the kind of facts that actually show up in day-to-day engineering work:

architectural boundaries
infra edge cases
intentional tradeoffs that look like bugs without context
superseded plans
implementation details that matter operationally

That matters because the interesting question is not whether a memory system can store polished examples.

The interesting question is whether it stays useful when the knowledge is messy, evolving, and grounded in real work.

That is the environment these examples came from.

Also, one small warning before the examples: names like mulder, palantir, gringotts and chronicle are just internal service names from my pet project. I have a bad habit of giving services weird names and then making future-me work harder to remember what any of them actually do.

Corpus Example 1: An Intentional Auth Decision

Here is the kind of memory that benefits from strong semantics:

rotateRefreshToken: false in an OIDC config was explicitly recorded as intentional, not a bug.

[static] gringotts: rotateRefreshToken: false in configOIDC.ts is intentional, not a bug.

Reason: default oidc-provider v8 rotates refresh tokens on every use. With
parallel refresh requests, reuse detection can revoke the whole grant, including
 newly issued tokens, leading to permanent 401 failures.

The memory did not just store the final conclusion. It captured the operational reason:

default refresh rotation marked tokens as consumed
parallel refresh requests could trigger token reuse detection
reuse detection revoked the whole grant
users could receive fresh tokens that were already dead
the result was permanent 401 failures

This is exactly the kind of fact that agents routinely mishandle if memory is fuzzy.

Without disciplined memory, a future agent might see rotateRefreshToken: false and "fix" it back to true because rotating refresh tokens sounds more secure in the abstract.

With governed memory, the system can preserve the actual local truth:

this was a deliberate tradeoff
the rationale is known
the memory is stable until superseded

That is much closer to how strong engineering teams actually reason.

Corpus Example 2: Nginx, Subdomains, And The Difference Between A Symptom And A Cause

Another memory in the corpus captured a subtle but high-impact behavior in nginx routing for subdomains.

The observed issue was simple: relative links were broken on artist subdomains.

[static] client/prod.nginx.conf.sigil: subdomain block rewrites location / to
/user/$username.

Critical nginx behavior: proxy_pass with URI replaces the matched location
prefix. Request /foo becomes /user/$usernamefoo, so relative links break.
Only the root / works correctly.

But the memory did not stop at the symptom. It preserved the real mechanism:

location / rewrote traffic to /user/$username
proxy_pass with a URI replaces the matched location prefix
requests like /users became /user/artistusers
only the root path worked correctly

That memory then pointed to the practical fix:

use a top-level app URL
generate absolute internal links
avoid relying on relative navigation from the rewritten subdomain path

This is a good example of memory that is not merely descriptive. It is operationally useful because it encodes causality, not just observed breakage.

Corpus Example 3: Why Lineage Matters More Than Similarity

One of the clearest examples in the corpus is a calendar-related architectural shift.

At one point, memory reflected a plan involving a separate chronicle service emitting calendar:event:changed.

A later memory updated that reality: calendar functionality lives inside palantir, not a standalone chronicle service.

v1:
[static] New pattern: calendar:event:changed from chronicle (port 3007)

v2:
[static] CalendarModule is implemented inside palantir (port 3005) - a separate
chronicle service is not created.

If your system only does semantic retrieval, both memories may look relevant forever.

That is the core problem.

They are both about calendar architecture.
They are both high-similarity.
They are both "useful context."

But only one is the current truth.

OmnethDB's lineage model is designed precisely for this case. The past remains auditable, but the present remains explicit. Historical memory is still available for inspection without silently driving live decisions.

That distinction is one of the main reasons we think memory needs stronger primitives than vector search alone.

Corpus Example 4: Structured Memory For Retrieval Boundaries

Another good example comes from search architecture.

The corpus records that mulder indexes profiles into OpenSearch, but the client does not query mulder directly. Public search flows still go through CMS GraphQL and PostgreSQL unless a specific public search API is added.

[static] mulder fully indexes profiles into OpenSearch, but the client never
queries mulder directly - all search goes through CMS GraphQL -> PostgreSQL.
OpenSearch is currently a "dead" index without a public search API.

That sounds like a small implementation detail, but it is actually a product and architecture boundary.

If an agent misses that distinction, it may:

propose the wrong integration point
wire a client directly into the wrong service
assume OpenSearch is already serving user-facing search traffic

The memory is useful because it tells the agent not just what exists, but what role it currently plays in the system.

Again: explicit behavior beats ambient context.

Not Anti-Embedding, Anti-Hand-Waving

To be clear, this is not an anti-embedding argument.

Embeddings are useful.
Similarity search is useful.
Semantic retrieval is useful.

But embeddings alone do not give you:

supersession semantics
lifecycle control
derivation provenance
auditability
current-version guarantees

Similarity can tell you what is related.
It cannot tell you what is canonical.

That is why we think memory systems for agents need stronger structure than "store chunks, embed them, and retrieve the top K."

Advisory As Memory Lint

Another part of the idea that I find increasingly important is the advisory layer around memory quality.

One useful way to think about it is: advisory is a bit like lint for memory.

A linter does not usually rewrite your whole program for you. It points at suspicious structure, inconsistent style, dead code, or likely mistakes and asks you to make an explicit decision.

I think memory systems need something similar.

Over time, a corpus accumulates:

stale facts that should probably be superseded
duplicate memories that should be merged or retired
weak derived patterns with shaky provenance
memories that are still retrievable but no longer belong on the hot path

That is not exactly retrieval, and it is not exactly storage either. It is memory hygiene.

So one direction I care about in OmnethDB is an advisory layer that can surface these issues the way a linter surfaces code smells: not by pretending to know product truth automatically, but by making memory quality problems visible and actionable.

That feels like an important missing piece. A serious memory system should not just remember. It should also help you keep what it remembers legible, current, and worth trusting.

The Design Standard

The standard we care about is not "good enough to demo."

It is:

semantic correctness
explicit behavior over hidden magic
inspectable state transitions
durable provenance
retrieval that respects current truth
enough structure that a strong engineer can trust the system under scrutiny

If agent memory is going to become core infrastructure, it should be built with the seriousness we apply to databases, queues, and auth systems.

Not as a toy.
Not as a vibe.
As infrastructure.

Where This Gets Interesting

The exciting part is not just that agents can remember more.

It is that they can remember in a way that supports disciplined reasoning:

what is current
what changed
what was superseded
what is historical but still worth inspecting
what was derived from multiple sources
what should remain visible without being allowed to silently control decisions

That is the path from "agent memory" as a demo feature to memory as a trustworthy primitive.

That is what I am trying to build with OmnethDB.

Closing

If we want agents that can operate safely in real codebases and real systems, memory has to become more than retrieval sugar.

It has to become something we can inspect, govern, version, and audit.

That is the bet behind OmnethDB:

memory should be queryable, but it should also be legible.

And when the truth changes, the system should know the difference between history and the present.

I Built a Cookie Banner That Makes It Technically Possible to Reject Cookies

Dmitry Bondarchuk — Tue, 07 Apr 2026 10:23:59 +0000

This is a submission for the DEV April Fools Challenge

What I Built

A React component library that faithfully recreates the experience of trying to reject cookies on a modern enterprise website.

By which I mean: you can reject. Technically. Eventually. After a brief sequence of clarifications.

The package is called react-consent-chaos. It ships a ConsentManagerFromHell modal with three hellMode settings ("polite", "pushy", "comically-evil"), three rejectDifficulty levels ("annoying", "absurd", "nightmare"), and a prop called allowRejectEventually whose false case is described in the README as: rejection remains aspirational.

I want to be very clear that I wrote that documentation willingly.

The default company name is "Consent Dynamics". The default vendor count is 1,847. Both are configurable. Neither number has been audited.

Demo

Here is what happens when a user opens the modal on hellMode="pushy":

The badge says "Partner-Aligned". The accept button says "Accept all and continue". The reject button says "Reject optional cookies" and underneath that, in smaller text: "Step 1 / 5".

Step 1 of 5.

The user clicks it.

The button now says "Confirm reduced experience".

The status box updates to: "Our partners will be disappointed."

The user, briefly, feels something.

They click again. The button becomes "Acknowledge optimization loss". Then "Continue without joy". Then, on step five — visibly exhausted, through gritted teeth — "Reject anyway".

At this point the modal finally closes and fires the onRejectAll callback with the message: "Fine. We respect your persistence."

This is considered the happy path.

The preferences panel is where the package really earns its name.

Every consent category has a description written in the voice of a mid-level enterprise product manager who is doing their best:

Category	Description
Necessary	Required for the continued existence of the button.
Analytics	Measures intent, confusion, and funnel sincerity.
Personalization	Allows the interface to remember your boundaries and negotiate them.
Legitimate interest	A timeless category with exceptional self-esteem.
Mood tracking	Detects mild reluctance for premium reassurance.
Productivity optimization	Ensures banners arrive during your most fragile focus windows.

"Productivity optimization" is a real consent category in this component. It is off by default. You're welcome.

Code

Repository: https://github.com/ubcent/react-consent-chaos

npm install react-consent-chaos

Basic usage

import { ConsentManagerFromHell } from "react-consent-chaos";
import "react-consent-chaos/styles.css";

<ConsentManagerFromHell
  open={open}
  onOpenChange={setOpen}
  companyName="Synergy Harvest"
  hellMode="pushy"
  rejectDifficulty="absurd"
  allowRejectEventually
  onAcceptAll={() => console.log("Excellent. Your journey has been optimized.")}
  onRejectAll={() => console.log("Fine. We respect your persistence.")}
  onSavePreferences={handleSavePreferences}
/>

The reject button in action

The ConsentStepButton walks users through a dynamically generated sequence of labels. At difficulty="nightmare" and mode="comically-evil", the full sequence is:

Reject optional cookies
Confirm reduced experience
Acknowledge optimization loss
Continue without joy
Reject anyway
Decline enhanced destiny
Deny data to the revenue temple

Step 7 of 7 is "Deny data to the revenue temple".

I am not sure I can defend this. I am also not taking it out.

The quiet part, out loud

When a user clicks the reject button before they've completed enough steps, two things happen in the source code:

console.warn("user attempted informed choice")
console.warn("consent friction increased")

These are real lines. In the production bundle. Labeled as warnings.

I briefly considered labeling them console.error. I chose not to because I have some remaining sense of proportion.

The preferences panel also has a feature

getManipulatedPreferences runs silently on every save. In pushy mode, it re-enables legitimateInterest regardless of what the toggle says. In comically-evil mode, it also re-enables advertising and partnerSharing.

The function is not hidden. It is named getManipulatedPreferences. It is called inside handleSavePreferences. Any developer who reads the file will find it immediately.

This is not obfuscation. It is transparency with a very specific energy.

The hook, for enthusiasts

const escalation = useConsentEscalation({
  difficulty: "nightmare",
  mode: "comically-evil",
  allowRejectEventually: true,
});

// escalation.statusMessage →
//   "Compliance theater is nearing completion."
//   "Revenue sadness has been acknowledged."
//   "Your defiance has entered the final audit lane."
//   "Fine. We respect your persistence."  ← only on the last step

useConsentEscalation is exported separately for developers who want to implement their own rejection experience without using the full modal. It returns canReject, advanceRejectFlow(), resetRejectFlow(), and a rotating statusMessage that cycles through mode-appropriate passive aggression until the user has exhausted their allocation of steps.

After that: "Fine. We respect your persistence."

That's it. That's the only acknowledgment available. There is no version of this component that says "sure, no problem."

How I Built It

The package is TypeScript-first, bundles to ESM + CJS via tsup, and ships styles as a separate import. No runtime dependencies beyond React 18+.

The hellMode prop controls copy across the entire modal. In comically-evil mode the title changes to "Universal Consent Acceleration Layer", the badge becomes "Legally Adjacent", the accept button becomes "Excellent, optimize me", and the manage preferences button becomes "Audit the damage".

In polite mode — the tamest setting — the helper text reads:

Your privacy matters deeply to us within commercially reasonable limits.

That's the polite version.

The preferences panel heading in comically-evil mode is "Manual resistance configuration". The component also tracks how many times the user has attempted to save preferences (saveCount) and displays it in the UI, because if you're going to be hostile you should at least be transparent about it.

The footer legalese reads:

Necessary cookies are mandatory, spiritually and technically.
Optional categories may remain enabled where enterprise momentum requires.

Both lines are in the real component. Both survived every review I gave them.

The entire thing is accessible: keyboard navigation works, ARIA roles are set, focus is managed on open and panel change, and the status message during the reject flow is wrapped in aria-live="polite" so screen reader users receive every update in real time.

"Our partners will be disappointed." — delivered accessibly, to everyone.

Prize Category

Anti-Value Proposition

The value of this package is entirely negative and precisely calibrated. Setting allowRejectEventually={false} renders a modal where rejection is permanently queued. The progress bar advances. The steps are counted. The button labels grow more resigned with each click. Nothing ever resolves.

The README describes this as: rejection remains aspirational.

I am genuinely proud of that sentence and also slightly concerned about what it says about me.

Creativity

The original insight is that dark patterns are not hidden — they're just undocumented. react-consent-chaos documents all of them. getManipulatedPreferences is a named export. The console.warn lines are readable in any debugger. allowRejectEventually={false} is a prop you pass on purpose.

The joke is that naming the manipulation doesn't make it less manipulative. It just makes it honest manipulation. Which is somehow worse.

Also: the nightmare difficulty allows up to seven steps, and if you set rejectStepsBeforeSuccess to a number beyond the label pool, the fallback label is:

Continue with regrettable self-determination N

where N is the step number. This was not planned. It emerged naturally from the implementation and I kept it because it felt right.

Technical Execution

Real library, real build pipeline, real types, real accessibility, real hook API. The ConsentStepButton and useConsentEscalation are individually exported for anyone who wants to compose their own dark-pattern UI from primitives. The component handles controlled and uncontrolled state. The overlay is non-closable by default. Escape is gated behind overlayClosable. Focus returns correctly.

It is a well-engineered component whose entire purpose is to demonstrate how much engineering goes into making people feel bad about wanting privacy.

Writing Quality

Every string in this codebase was written as if it would appear in an actual product and reviewed by an actual legal team in an actual company that has lost the thread. "Delight generation may be reduced." "Your independence is being carefully reviewed." "A timeless category with exceptional self-esteem."

These are not captions. They are labels. They render in the UI. They are wrapped in ARIA attributes and shipped in a bundle.

The demo's event log initializes with the entry: "Awaiting a fresh compliance event."

I think about that line sometimes.

I Needed a Workflow Engine for AI Agents. None of Them Fit. So I Built One.

Dmitry Bondarchuk — Fri, 27 Mar 2026 17:08:10 +0000

Part three of the vexdo series — after building a local AI dev pipeline and moving it to the cloud

vexdo works. I use it. It handles the boring parts of shipping code — the implement-review-fix loop that used to eat my afternoons.

At some point I started thinking: could this be something more than a personal tool? Not just a CLI I run on my machine, but an actual product. Something with a proper foundation, not held together with state files and hardcoded pipeline logic.

And that's where things got complicated.

The problem with "just use a workflow engine"

The obvious answer when you want to orchestrate multi-step processes is: use a workflow engine. Airflow, Temporal, BullMQ, Prefect — there are plenty of them, and some are very good at what they do.

The problem is what they're good at.

These engines are built around a core assumption: you know your steps upfront. You define a DAG — nodes, edges, dependencies — and the engine executes it. The graph is fixed. That's the contract.

For traditional workflows, this is fine. ETL pipelines, CI/CD jobs, batch processing — you know what needs to happen before it starts happening.

AI agents break this assumption.

Here's a concrete example from vexdo. When an agent starts working on a task, it first analyzes the codebase — what files are involved, which modules are sensitive, how deep the change goes. But the result of that analysis determines what comes next.

Simple task touching one service? Skip the design council, go straight to implementation.

Task that touches the payments module? Spawn a dedicated security review. If it also changes the API schema, spawn a contract validation step. If the codebase has low test coverage, spawn a test generation pass first.

None of this is knowable when the workflow starts. The agent discovers it by doing the work.

If you try to handle this with a fixed DAG, you end up with one of two bad options:

Pre-define every possible branch — the graph becomes a sprawling mess of conditional edges, and half the nodes never run. You're essentially writing a decision tree disguised as a workflow.
Treat the whole thing as one big step — you lose parallelism, observability, retry granularity, and the ability to checkpoint. Your "workflow" is just a black box that either finishes or fails.

Neither option is good. What you actually want is a workflow that can extend itself at runtime — where completing a step can add new steps to the graph, based on what was discovered.

The core idea: a graph that grows

I've been calling this a living graph.

In a traditional workflow engine, the DAG is immutable after you start a run. In a living graph, nodes can spawn new nodes as part of their output. The graph is a starting point, not a constraint.

When a node completes, its result can include a list of new nodes to add to the graph — with their own dependencies, retry policies, and compensation logic. The scheduler picks them up and runs them exactly like any other node. From the engine's perspective, there's no difference between a node that was defined at the start and one that was spawned mid-run.

This is the key idea behind Grael — the workflow engine I built specifically for AI agent pipelines.

Workflow starts:
  [scout] → ???

Scout runs, analyzes the codebase, returns:
  output: { complexity: "high", touchedModules: ["payments", "api"] }
  spawn:  [
    { id: "council",   dependsOn: ["scout"] },
    { id: "implement", dependsOn: ["council"] },
    { id: "sec-review",dependsOn: ["implement"] }   ← spawned because: payments
    { id: "reviewer",  dependsOn: ["implement"] },
    { id: "arbiter",   dependsOn: ["reviewer", "sec-review"] },
    { id: "pr",        dependsOn: ["arbiter"] }
  ]

Graph is now:
  [scout] → [council] → [implement] → [reviewer] ──→ [arbiter] → [pr]
                                   └→ [sec-review] ─┘

The spawn happened inside the scout's activity — the engine didn't know about any of this when the workflow started.

Another example: spec contradictions. The arbiter reviews the diff and notices that what the executor built doesn't match the original spec — not a code quality issue, but a genuine conflict in requirements. Maybe the spec said "use cursor-based pagination" but the executor implemented offset-based because an existing helper made it easier. Maybe two requirements in the spec are mutually exclusive and the executor quietly picked one.

In a fixed pipeline, this either escalates to a human or gets sent back to the executor with a "fix it" comment. But the right answer is often neither — you need to go back to whoever wrote the spec and ask for a decision.

With a living graph, the arbiter can spawn a spec-clarification node instead:

Arbiter runs, detects contradiction, returns:
  output: { decision: "spec-contradiction" }
  spawn:  [
    {
      id: "spec-clarification",
      activityType: "spec-writer",   ← could be a human checkpoint or another agent
      dependsOn: ["arbiter"],
      input: {
        question: "Spec says cursor-based pagination, executor used offset. Which do you want?",
        context: { diff, specExcerpt }
      }
    },
    {
      id: "implement-revised",
      activityType: "executor",
      dependsOn: ["spec-clarification"]   ← continues once clarified
    },
    ...
  ]

Graph grows:
  ... → [arbiter] → [spec-clarification] → [implement-revised] → [reviewer-2] → [pr]

The spec-writer activity type could be anything — a human approval gate, a dedicated planning agent that re-evaluates the requirements, or a call to the original spec-generation step with additional context. The arbiter doesn't need to know. It just knows this is a spec problem, not a code problem, and spawns the right node type. That routing decision is something only an agent can make in context. You can't pre-define it in a YAML file before the run starts.

What Grael actually is

Grael is a Go service built around this idea. The code is on GitHub: github.com/ubcent/grael. A few things I cared about when building it:

Everything is an event. The entire state of a workflow run is an append-only event log. The current graph, node states, retry counts — all of it is derived by replaying events from the WAL. This means crashes are recoverable, history is auditable, and replay is deterministic. If Grael goes down mid-run, it picks up exactly where it left off.

Workers are just processes that poll for tasks. Any language, any runtime. You register a worker with the activity types it can handle, then poll for tasks. When you get one, you run it and report back. The Go SDK is one file. A TypeScript SDK is straightforward to build on top of the gRPC API.

Compensation is built in. Each node can declare a compensation activity — what to undo if things go wrong downstream. When a run fails, Grael automatically runs compensations in reverse order. Saga pattern, out of the box.

Human checkpoints are first-class. An activity can return a checkpoint signal instead of completing — the node enters a waiting state, unrelated work continues, and the run resumes when someone calls ApproveCheckpoint. The checkpoint timeout is configurable per node. This is how you put a human in the loop without halting the entire pipeline.

Demo

The demo runs a morning incident briefing workflow. The scenario: an on-call team needs to quickly assemble a picture of what's happening and decide what to investigate. Here's the shape of the run:

Three preparation steps start in parallel — collect customer escalations, pull checkout metrics, prepare the briefing outline.
A planning step runs once those complete and decides which follow-up checks need to happen.
Grael spawns the concrete investigation nodes based on that decision — verify checkout latency, confirm payment auth drop, review support spike. These weren't in the graph when the run started.
One spawned investigation fails retryably. Grael retries it automatically.
An editor approval gate opens. The run doesn't freeze — the other investigations keep progressing.
Once all investigations are done and the approval comes through, the results flow into assembling the final brief.
The brief is published. Run completes.

What to watch for: the graph growing after the planning step, multiple nodes running at the same time, the failed node recovering, and the approval gate that's clearly distinct from a stall.

One more thing worth noting: what you're watching is a replay. Not a live run recorded at demo time — a deterministic replay of a previously recorded execution, driven from the event log.

This is possible because of how Grael works internally. Every state transition — node started, node completed, spawn happened, retry scheduled, checkpoint reached — is written to an append-only WAL before anything changes in memory. The current state of a run is always derived by replaying that log from the beginning. There's no separate "current state" that can drift or get corrupted.

The consequence is that any run can be replayed exactly. Same events, same order, same graph shape, same outcome. That's what makes the demo reproducible — and it's also what makes crash recovery work. If Grael goes down mid-run, it replays the log on restart and picks up where it left off. The demo and the durability guarantee are the same mechanism.

Again — this is very early. The workflow is synthetic, built to demonstrate these behaviors in a controlled setting. I haven't thrown real production workloads at this. Consider it a proof of concept, not a reliability claim.

To be clear: this is very early. What you're seeing is a proof of concept, not a production system. I've run it through a handful of synthetic workflows to validate the architecture. I haven't thrown real production workloads at it. There are rough edges, missing features, and exactly zero battle testing. The goal right now is to get the core idea right — not to ship something stable.

Why this matters for vexdo specifically

The current version of vexdo has its orchestration hardcoded. The pipeline is always: submit → review → arbiter → fix → repeat. That works for the use case I built it for, but it's not general enough to turn into a product.

With Grael underneath, vexdo becomes a set of activity workers — scout, executor, reviewer, arbiter, pr-creator — registered against an engine that handles the graph. The pipeline itself is just a starting node definition. What it grows into depends on what the agents discover.

This also unlocks things that were awkward before: running review and security checks in parallel, spawning additional investigation steps when something looks risky, human approval gates at specific points in the pipeline. These become configuration, not code changes.

What's next

Grael needs a gRPC server layer before it can talk to TypeScript workers. That's the immediate next step. After that, the TypeScript SDK, then wiring vexdo's agents into it.

If this is interesting to you — either because you're building something similar, or because you're skeptical this is the right abstraction — I'd genuinely like to hear it. The living graph idea feels right to me, but I'm very much still figuring out where it breaks.

I Let Agents Write My Code. They Got Stuck in a Loop and Argued With Each Other

Dmitry Bondarchuk — Thu, 19 Mar 2026 15:27:16 +0000

A follow-up to building a local AI pipeline that reviews its own code

I built vexdo — a CLI pipeline that automates the full dev cycle: spec → Codex implementation → reviewer → arbiter → PR. The dream: close my laptop, come back to a reviewed PR. No manual copy-pasting between tools, no being the glue.

Then I migrated from local Codex to Codex Cloud. Then I swapped the reviewer from Claude to GitHub Copilot CLI. Then I went to make a coffee and came back to find my pipeline had sent Codex the same feedback four times in a row.

This post is about that, and the other ways things broke. Not the happy path — the other one.

Quick recap: what vexdo does

spec.yaml → Codex (implements) → Reviewer (finds issues) → Arbiter (fix / submit / escalate) → PR
                 ↑___________________________|
                         fix loop

v1 ran locally and synchronously. v2 runs Codex Cloud so I can kick off a task and close my laptop. The reviewer is now GitHub Copilot CLI. The arbiter is still Claude.

Simple enough in theory. Here's what went wrong.

The infinite loop that took me an embarrassing amount of time to notice

This one genuinely hurt.

My spec had a contradiction I didn't catch when writing it. One section described the expected behavior. Another section described the system architecture. They disagreed on where a certain piece of logic should live.

Codex, being a dutiful implementer, read the behavior requirement and made change A. The reviewer flagged it: "this violates the architecture described in section 3." Fair enough. The arbiter sent it back for a fix.

Next iteration: Codex, now armed with the reviewer's feedback, made change B instead. The reviewer flagged it: "this doesn't implement the behavior described in section 1."

Codex made change A again.

I watched this unfold across four iterations before I admitted to myself what was happening. The agents weren't broken. They were doing exactly what they were told. The spec was broken, and nobody in the loop had the job of noticing that — because I hadn't given anyone that job.

The fix: a stuck detector

I added a fourth agent call — a loop detector that runs after each review. It gets the full iteration history: every reviewer output, every piece of feedback, every resulting diff. Its only job is to answer one question: are we making progress, or are we going in circles?

const prompt =
  `You are reviewing the history of a code review loop.\n` +
  `Below are the last ${history.length} iterations: reviewer findings and the resulting diffs.\n\n` +
  `${formatHistory(history)}\n\n` +
  `Is the loop making progress toward resolution, or is it cycling?\n` +
  `If cycling: briefly describe the contradiction causing it.\n` +
  `Respond with JSON: { "status": "progress" | "stuck", "reason": string }`;

When it returns stuck, the pipeline escalates to me with the reason. In the spec-contradiction case the output was something like: "Reviewer alternates between flagging architecture violation and spec violation. Likely spec inconsistency between sections 1 and 3."

That's exactly the signal I needed. I fixed the spec in two minutes. The task ran clean on the next attempt.

One more API call per iteration. Absolutely worth it.

The arbiter that treated every nit as a showstopper

My arbiter's job is to decide: fix, submit, or escalate. In v1, it was prompt-tuned to be thorough — if there are any issues in the review, send it back for fixes.

Sounds responsible. Was not.

The Copilot reviewer, being an agent with opinions, would find real issues — and also flag a variable name it preferred, a missing blank line, inconsistent comment style. Nits. These came back as review comments. The arbiter, seeing review comments, dutifully returned fix.

So tasks that were functionally correct would bounce through 2-3 extra cycles chasing aesthetics. Each cycle is 8-10 minutes of Codex Cloud execution. A task that should have been one pass took four. The diff after iteration 4 was identical to the diff after iteration 1 except for a renamed variable and a blank line.

The fix: severity-aware arbitration

The reviewer was already tagging severity — I just wasn't using it in the arbiter decision. One prompt update:

Severity guide:
- critical / high: always fix before submitting
- medium: fix if it's a behavior issue; use judgment for style
- low / nit: do NOT send back for fix; note in PR description instead

Only return "fix" if there are unresolved critical or high severity issues.
If the only remaining issues are low/nit, return "submit".

Task cycle count dropped immediately.

The thing I kept reminding myself: the arbiter is a policy, not just a judge. Left to its own devices, it defaults to "fix everything," which is technically correct and practically a treadmill. You have to encode the actual policy — what counts as blocking, what doesn't — or you'll spend a lot of Codex Cloud credits on blank lines.

The cloud stuff that also broke (quickly)

Since you'll hit these too:

Exit codes lie. codex cloud status returns non-zero when a task is still pending. Not an error — just "not done yet." My polling loop treated every poll as a failure and gave up immediately. Fix: parse stdout first, only throw if the output is unrecognizable.

The status values aren't what the docs imply. I was matching completed. The actual output contains [READY]. Also in rotation: [PENDING], [RUNNING], [IN_PROGRESS]. Add them all, map READY → completed.

There's no CLI resume command. The web UI lets you continue a Codex session with follow-up instructions. The CLI doesn't expose this. I simulate it by submitting a new task with the original spec plus feedback appended, with a header so it's recognizable in the UI:

[REVIEW FEEDBACK — FIX REQUESTED]
Task: Implement key pairs validation
Iteration: 2

<original spec>

Issues to fix:
<arbiter feedback>

The less funny thing I've been sitting with

All of the above are patchable. Annoying to find, quick to fix.

The bigger issue isn't a bug: my codebase wasn't ready for agents to work in.

I realized this gradually, then all at once. I put together a scoring framework — an "agent-ready codebase" checklist — and ran my codebase through it. The result was humbling.

The framework

1. Repository structure & modularity. Can you clearly identify domain logic, application services, adapters, infrastructure, and tests? Are module boundaries clean, or is there a "shared dump" folder where things go to be forgotten? Hidden coupling is invisible to you and actively confusing to agents.

2. Locality of changes. For a typical feature, how many files does a change touch? Which modules get pulled in? "God files" and scattered logic mean agents produce large, sprawling diffs — which makes the reviewer's job harder and increases the surface area for things to slip through.

3. Naming & intent clarity. Are functions and modules named by use-case, or generically? Can you infer side effects from names? An agent reading processData() has to guess. An agent reading validateAndPersistUserPayment() doesn't.

4. Contracts & boundaries. Are API boundaries validated — schemas, types, runtime validation? Are there contract tests? Is the public API clearly separated from internals? Without this, agents make changes that technically compile but violate implicit assumptions at integration points.

5. Test quality & reliability. Are tests deterministic? Behavior-focused? Do they cover edge cases? Can you easily add a regression test when something breaks? Flaky tests are worse than no tests in an automated pipeline — they inject false negatives into the review loop and you can't tell whether the failure is real.

6. Verification pipeline. Is there a single command that verifies correctness — lint, types, tests? Can you run partial checks scoped to changed files? If the answer is "kind of, it's complicated," agents will struggle to self-verify. And if they can't self-verify, you end up doing it.

7. Review comment verifiability. Can typical review comments be validated automatically — via lint, type checker, tests? Or are most comments subjective judgment calls? The higher the ratio of automatable-to-subjective feedback, the more effective an automated reviewer becomes. A codebase full of subjective review surface generates noise that the arbiter has to wade through on every cycle.

8. Risk segmentation. Can you identify high-risk areas — auth, billing, migrations, infrastructure? Is this encoded somewhere: path conventions, annotations, docs? Without it, agents treat all code as equally safe to modify. That's fine until they're modifying the billing module with the same confidence they'd modify a utility function.

9. Documentation for agents. Is there an ARCHITECTURE.md? A CONTRIBUTING.md? An AGENTS.md (or equivalent) that explains how to run the service, how to test changes, how to add a feature? Agents can infer a lot from code — but they shouldn't have to infer things that could just be written down. Every missing doc is a guess the agent makes on your behalf.

10. Dev environment & reproducibility. Can you bootstrap the service reliably from a clean clone? Are there hidden dependencies — secrets, external services that need to be running, manual steps nobody wrote down? Every hidden dependency is a potential point of silent failure when the agent tries to verify its own work.

My score: 52/100

That number explains a lot of friction. When a Codex change touches six files across three modules, the reviewer has more surface area to miss things. When tests are flaky, the verification step is unreliable. When architectural rules live only in my head, no agent can enforce them — which made the stuck loop I described earlier almost predictable in hindsight.

A brief word about the "code quality matters less now" take

I keep seeing this framing: in the era of agentic systems, code quality matters less because the AI will figure it out. Sloppy structure, vague names, tangled modules — the model is smart enough to work around it.

I think this is exactly backwards, and I'm saying this as someone who just spent several evenings watching agents thrash inside a mediocre codebase.

The agent can't ask you what you meant. It can't read the git history and infer the original design intent. It reads what's there. Ambiguous structure → ambiguous behavior. Hidden coupling → unexpected side effects. Vague names → hallucinated assumptions. No AGENTS.md → the agent guesses how your service is supposed to work and proceeds with confidence.

Code quality doesn't matter less when agents are writing and reviewing your code. It matters more, because the human who could previously fill in the gaps isn't filling them in anymore. The code is the only source of truth the agent has. It better be a good one.

A score of 52/100 means I'm running agents on a codebase that's half-ready for them. Getting that number up is now higher on my list than any pipeline feature.

What the pipeline looks like now

spec.yaml
  → codex cloud exec --branch <feature-branch>
  → [poll until READY]
  → codex cloud apply → git commit → git push
  → copilot review (with full iteration history)
  → stuck detector (iteration > 1)
  → arbiter (severity-aware)
  → if fix: loop with feedback header
  → if submit: open PR
  → if escalate: surface to human with reason

Fix iterations stack as commits on the branch. Each commit message is generated by Copilot — a prompt built around conventional commit rules (type(scope): description), first line of output taken as the message, with a fallback to chore: apply codex changes if Copilot times out or returns nothing. Squash-merge when done. The history is readable: you get actual meaningful commit messages at each iteration, not just iteration 2.

Where I'm going with this

My actual goal is a system where agents write and review code autonomously, and I step in rarely — for escalations, ambiguous specs, and the cases that genuinely need human judgment.

Right now I'm in the loop more than I want to be. Some of that is tooling immaturity. Some of it is the 52/100. Some of it is that spec-writing is still entirely manual — and as I learned, a bad spec defeats even a well-tuned pipeline.

Here's what's on the roadmap, roughly grouped by problem area:

Review and verification

Verification ladder. Right now the arbiter makes a judgment call about whether something is "done." I want to replace that with structured must_haves in the task YAML — a list of requirements that get verified against the diff at four tiers: static (file/export presence), command (tests pass), behavioral (observable output), or human (escalate). Submit is only allowed when every must-have passes. No more "looks good to me" from the arbiter.

Better stuck detection. The current loop detector catches cycles at the review level. I want to add diff-level detection: if Codex produces the same diff twice, fire a diagnostic retry with a targeted prompt. On a second identical diff, escalate with a structured breakdown showing exactly which review comments went unaddressed. Less "something seems wrong," more "here is precisely what didn't change and why."

Context and memory

This is the area I'm most excited about. Right now each Codex submission is stateless — it knows the spec and the feedback, nothing else. Over a multi-step task, that's a problem.

Fresh context injection. Before each Codex submission, prepend summaries of completed steps and a decisions register to the prompt. Prevents Codex from re-implementing utilities already built by earlier steps. Capped at 2000 tokens so it doesn't eat the context window.

Decisions register. A .vexdo/decisions.md file — an append-only table of architectural decisions made during execution: which validation library was chosen, what the storage strategy is, naming conventions adopted. The arbiter populates it automatically. Every subsequent step prompt gets it injected. The goal: agents that build on prior decisions instead of relitigating them.

Scout agent. A focused Claude call before each Codex submission that scans the target service's codebase and returns relevant existing files, reuse hints, and conventions to follow. Non-fatal: if Scout fails, execution continues without it. But when it works, Codex stops reinventing things that already exist.

Adaptive replanning. After each step completes, a lightweight Claude call checks whether remaining step specs are still accurate given what was actually built. Proposes updates for me to confirm before the next step runs. Multi-step plans rarely survive contact with reality unchanged — this is the mechanism for adjusting without rewriting everything manually.

Resilience

Continue-here protocol. Right now if the process crashes mid-task, you start over. I'm adding a .vexdo/continue.md checkpoint written at every major phase transition — codex submitted, codex done, review iteration, arbiter done. vexdo start --resume reads the checkpoint and picks up from exactly where it left off. This matters more than it sounds once tasks are running for 30+ minutes across multiple iterations.

Observability and interaction

Cost and token tracking. Every Claude API call will capture token usage and estimated cost. Per-step and total costs shown in vexdo status. Optional budget ceiling in .vexdo.yml that pauses execution before overspending. Right now I have no idea what a task costs until I check my API bill.

UAT script generation. After all steps complete, Vexdo writes .vexdo/uat.md — a human test script derived from step must-haves and arbiter summaries. vexdo submit warns if UAT items are unchecked (override with --skip-uat). The dream of fully autonomous code is great; the reality is that some things still need a human to click through the UI once.

Discuss command. vexdo discuss <task-id> opens an interactive Claude session with full task context pre-loaded — what was built, what decisions were made, what's still pending. Ask questions, queue spec updates for pending steps, steer execution from a second terminal while start is running. The CLI as a conversation partner, not just an executor.

Getting the codebase score above 80 will get me closer to the goal. So will all of the above. The common thread: the more context agents have, the less they guess. The less they guess, the fewer loops. The fewer loops, the closer I get to actually closing my laptop and coming back to a PR that's ready.

One problem at a time.

vexdo is open source — github.com/vexdo/vexdo-cli. If you're building something similar or have hit these problems differently, I'd like to hear about it.

When Your Code Compiles But You Don't

Dmitry Bondarchuk — Sat, 14 Mar 2026 11:04:07 +0000

This is a submission for the 2026 WeCoded Challenge: Echoes of Experience

On burnout, imposter syndrome, and finding your voice as a developer abroad — and the unexpected failure that woke me back up.

[ERROR] Input suppressed. Reason: unknown.
[WARN]  Motivation: degraded.
[ERROR] Identity: not found.

I stared at that metaphor for a long time before I realised it was my year.

There's a specific kind of silence that only developers know. Not the comfortable silence of deep focus, or the peaceful silence of a solved problem. I'm talking about the silence of knowing the answer — and choosing not to say it.

I've sat in dozens of meetings, heart quietly racing, watching a discussion go in circles — sometimes toward a solution I already knew was wrong — while a voice in my head repeated the same line: your input isn't needed here.

It wasn't shyness. It wasn't a language barrier, though I'd relocated abroad and English was my working language. It was something quieter and more corrosive: a deep-seated belief that I hadn't earned the right to take up space.

Three kinds of silence

Seniority is contextual. I knew that intellectually. I didn't know it would feel like starting over.

I call it the three silences, because that's how it showed up for me.

The language silence. I'd draft a message in Slack, then delete it. Not because it was wrong — because I'd reread it three times wondering if the phrasing sounded too foreign, too formal, too something. I started filtering myself before I even had the chance to be misunderstood.

The professional silence. I had ideas. Lots of them — architectural improvements, product suggestions, side projects that could've become real tools. But every time I thought about sharing them, imposter syndrome arrived right on cue — and at senior level, it doesn't sound naive. It sounds reasonable: This has already been done. There's a library for it, a blog post about it, a better implementation of it somewhere on GitHub. What are you going to add? The voice is more polished. It's harder to argue with.

The inner silence. This one came last, and it was the loudest. There's a particular brand of exhaustion that burnout brings — not tiredness, but numbness. I'd open my laptop, stare at the code, and feel nothing. Not frustration, not curiosity. Just absence. The IDE was running. I was not.

The lie I told myself about time

For a long time, I explained away my stagnation with a single story: I don't have time.

I was busy. The sprints were packed, life abroad had logistics I hadn't anticipated, and building a personal brand felt like a luxury for people who had already "made it." I had ideas for articles, side projects, open-source contributions — all neatly filed in Notion drafts and half-opened browser tabs, where they couldn't be judged.

Here's the thing about that story: it was comfortable. And it was a lie.

The truth was that I had pockets of time. I just didn't trust myself enough to use them. Writing a post felt pointless if nobody cared. Building something felt arrogant if I might fail.

Motivation wasn't missing. It was buried under a layer of fear I'd labelled as a scheduling problem.

What burnout actually looks like (from the inside)

People talk about burnout like it's a dramatic collapse. And sometimes it is. But the version I lived was quieter — a slow dimming rather than a sudden blackout.

I still shipped features. I still showed up to standups. From the outside, I was probably fine. From the inside, I was running on a kind of professional autopilot — doing the work, but completely disconnected from the reason I got into this field in the first place.

The hardest part wasn't the exhaustion. It was losing the thread back to myself — to the person who once pulled all-nighters not because of deadlines, but because the problem was just too interesting to put down.

The failure that woke me up

Recovery didn't come from a productivity hack or a helpful conversation. It came from a system design interview that went badly wrong.

I'd prepared. I knew the patterns. But the moment the interviewer started asking questions, something broke down. I couldn't structure my thinking. I went in circles. I could see the confusion on the screen and I couldn't stop it.

I failed. Clearly, decisively, unambiguously.

And something strange happened in the days that followed: I felt alive again.

Not happy — it stung badly. But underneath the sting was something I hadn't felt in months: genuine engagement. I was frustrated, yes. But frustration means you care. It means the signal is still there. After months of numbness, even a hard feeling was proof that I was still in there somewhere.

That failure did what nothing else had: it threw a stack trace. Suddenly I could see exactly where I'd been running in silent degraded mode. The interview didn't break me — it interrupted the autopilot. And that interruption was what I needed.

[INFO]  Exception caught: SystemDesignInterviewFailure
[INFO]  Stack trace:
          - few years autopilot
          - suppressed ideas
          - unshared voice
[INFO]  Attempting recovery...

Finding the voice again

After that, things moved differently.

I opened one of those Notion drafts and just finished it. I left a comment on a GitHub issue I'd been silently following for weeks. Then I did the thing I'd been postponing the longest: I started an open-source project. Not because I was sure it was needed — but because building it in the open meant I couldn't quietly abandon it. I wrote a couple of posts along the way, sharing things I'd stumbled across while building. Small discoveries, not grand conclusions. Public ones. With a name attached.

And something unexpected happened: people responded. Not to a polished, authoritative voice I thought I needed to have — but to the honest, unguarded version I'd been afraid to show.

It turns out authenticity travels across languages. Struggle is not a regional dialect.

What I'd tell my quieter self

If I could go back to that person staring at a blinking cursor, unsure whether their ideas mattered, I'd say a few things.

Your accent is not a qualifier for your competence. Years of experience don't make this easier — they just make you better at hiding the discomfort. The nervousness you feel speaking in a second language is not evidence that you don't belong. It's evidence that you're doing something genuinely hard, and doing it anyway.

Imposter syndrome isn't a diagnosis — it's a symptom. It usually means you care deeply and you're in new territory. Both of those are good things, even when they don't feel like it.

The ideas you're sitting on are not too small. The blog post you're not writing, the project you're not building, the comment you keep deleting — none of those need to be perfect to be worth sharing. Done and imperfect is infinitely more valuable than flawless and invisible.

Burnout is not a character flaw. It's what happens when a system runs too long without maintenance. You are allowed to rest. You are allowed to be unproductive. You are allowed to be a person first and an engineer second.

And sometimes the failure you're dreading is the only thing loud enough to reach you. Not because pain is a good teacher, but because numbness has no edges. Failure does.

I'm writing this now — this very post — because I'm on the other side of that quiet period. Not because everything is solved, but because I finally stopped waiting to feel ready.

If you're in the thick of it right now — navigating a new country, swallowing your ideas, running on empty — I want you to know: this is not permanent. The silence ends. And when you start speaking again, you'll find that people were waiting to hear you all along.

Your voice belongs here. It always did.

Have you experienced burnout or imposter syndrome in your tech career? I'd love to hear what shifted things for you — drop it in the comments.

I Built a Local AI Dev Pipeline That Reviews Its Own Code Before Opening a PR

Dmitry Bondarchuk — Wed, 11 Mar 2026 13:17:13 +0000

How I got tired of being the glue between AI tools and automated the whole thing with vexdo

I've been using AI coding assistants for about a year now. Claude Code for planning and spec writing, Codex for the actual implementation, Copilot for inline suggestions. The results were genuinely good — but the process was exhausting.

Every task looked like this:

Write a spec with Claude Code
Copy-paste it into Codex
Wait for Codex to finish
Open a PR
Manually request a Copilot review
Read the review comments
Decide which ones matter
Copy the important ones back into Codex
Wait again
Repeat until it looks good

I was the glue. Every step required me to context-switch, copy text between tools, and make judgment calls. The AI was doing the creative work, but I was doing all the plumbing.

So I built vexdo — a local CLI pipeline that automates the entire cycle: spec → implementation → review → fixes → PR. Human intervention only when something goes wrong.

Disclaimer for my boss: this is exclusively a personal project. I have never used any of this for work tasks. Not once. 😄

Fair warning before we go further: this is an experiment, not a production-ready tool. It works, I use it, but it's very much a proof of concept. The goal was to validate the architecture and see where it breaks — not to ship a polished product. If you're looking for something battle-tested, this isn't it yet. If you're curious about the pattern and want to hack on it, read on.

Here's what I learned building it.

The core idea: review before the PR

Most AI coding tools open a PR and then review it. This makes sense in a human workflow — you open a PR, someone reviews it, you fix things.

But in an automated pipeline, this creates a mess. You end up with PRs full of back-and-forth commits like:

feat: add /events endpoint
fix: add input validation (per review)
fix: fix validation again (per review)
fix: actually fix it this time

vexdo flips this. Review happens on the local diff, before a PR is ever opened. The pipeline iterates until the code is clean, then opens exactly one PR — already reviewed, already fixed.

The git history stays clean. The PR is meaningful. You only get notified when there's something that actually needs a human decision.

Spec-driven development as the foundation

The whole thing only works if the agent knows what "done" looks like. That's where spec-driven development comes in.

Every task in vexdo is a YAML file with a structured spec:

id: task-001
title: "Add POST /events endpoint"
steps:
  - service: backend
    spec: |
      Implement a REST endpoint POST /events.

      Acceptance criteria:
        - Validates incoming payload against EventSchema
        - Returns 201 with created event on success
        - Returns 400 with validation errors on failure
        - Unit tests cover happy path and validation errors

      Architectural constraints:
        - Use existing auth middleware, do not reimplement
        - Do not modify existing endpoint interfaces

      Critical if:
        - No input validation
        - Breaking change to existing API
        - No tests

The acceptance criteria and critical if fields aren't just documentation — they're the ground truth that the reviewer and arbiter use to evaluate the code. No spec, no review. No review, no PR.

I write these specs collaboratively with Claude Code before handing anything to Codex. This 10-minute investment saves hours of back-and-forth later.

Why Codex for implementation (and not Claude Code)

The obvious question: why not use Claude Code for the coding step too? It's clearly the better model for complex coding tasks.

Cost.

Claude Code is great but expensive for automated, unattended runs. When you're running a pipeline that might do 3 iterations of "write code → review → fix" per task, the token cost adds up fast — especially if you're running multiple tasks per day.

Codex hits a much more comfortable price point for the implementation step. It's not as capable as Claude Code on hard problems, but for well-scoped tasks with a clear spec, it does the job at a fraction of the cost.

The split I landed on: Codex does the implementation (cheap, runs autonomously, good enough for scoped tasks), Claude Haiku does the review and arbitration (also cheap, but here accuracy matters more than raw coding ability). Claude Code stays in my workflow for the part it's genuinely irreplaceable at — writing the spec interactively with me before the pipeline starts.

One implementation detail worth noting: Codex runs with --full-auto flag and doesn't commit anything. All its changes sit as unstaged modifications. The review loop captures them via git diff HEAD — staged and unstaged together. This means the entire set of changes Codex made is visible to the reviewer in one clean diff, not scattered across intermediate commits.

If cost isn't a constraint for you, swapping Codex for Claude Code in the pipeline would probably improve results. The architecture supports it — it's just a config change.

The review loop: two Claude calls, not one

Here's where it gets interesting. I don't use a single AI call to review the code. I use two, with deliberately isolated contexts.

Call 1 — The Reviewer

The reviewer sees: the spec + the git diff. Nothing else.

It returns a structured list of findings, using four severity levels:

[
  {
    "severity": "critical",
    "file": "src/routes/events.ts",
    "line": 23,
    "comment": "No validation on req.body before passing to createEvent()",
    "suggestion": "Add schema validation using existing validateBody middleware"
  },
  {
    "severity": "important",
    "file": "src/routes/events.ts",
    "line": 31,
    "comment": "Error from createEvent() not caught — unhandled rejection will crash the process",
    "suggestion": "Wrap in try/catch and return 500 with a generic message"
  },
  {
    "severity": "minor",
    "file": "src/routes/events.ts",
    "line": 45,
    "comment": "Inconsistent error message format compared to other endpoints",
    "suggestion": "Use errorResponse() helper for consistency"
  }
]

The four severity levels are strictly defined:

critical — breaks an acceptance criterion or architectural constraint
important — likely to cause bugs directly related to what the spec requires
minor — code quality issue, but doesn't block the spec
noise — style or preference, spec-neutral

The reviewer's job is purely technical: does this code satisfy the spec?

Call 2 — The Arbiter

The arbiter sees: the spec + the diff + the reviewer's findings. It does not see the history of how the spec was written. This isolation is intentional — it prevents the arbiter from being too lenient because it "knows" the original intent.

The arbiter returns a decision:

{
  "decision": "fix",
  "reasoning": "Critical validation issue and unhandled error path must be resolved before merge",
  "feedback_for_codex": "Add input validation to POST /events handler. Use the existing validateBody(EventSchema) middleware pattern from POST /users. The validation should happen before any database calls. Also wrap the createEvent() call in try/catch and return 500 for unexpected errors.",
  "summary": "2 issues require fixing: missing validation, unhandled error path"
}

Three possible decisions:

fix — send feedback_for_codex to Codex and iterate
submit — no critical or important spec violations, open the PR
escalate — something needs a human

The submit threshold matters: the arbiter submits only when there are no critical or important findings that reflect real spec violations. Minor and noise issues don't block submission. The spec is the bar, not stylistic perfection.

When escalation triggers

The arbiter escalates in three distinct cases:

Explicit conflict — a reviewer comment contradicts the spec, or there's genuine architectural ambiguity the arbiter shouldn't resolve autonomously
Max iterations reached — the arbiter kept requesting fixes but ran out of iterations (configurable, default 3)
Missing fix instructions — the arbiter decided "fix" but didn't produce feedback_for_codex (a guardrail against bad model outputs)

In all three cases, you get the full context: the spec, the diff, every review comment with severity and location, the arbiter's reasoning, and a summary. Enough to make a decision without re-reading the code from scratch.

Why Claude as arbiter works surprisingly well

When I first designed this, I was skeptical that an LLM could reliably classify review comments. In practice, it works much better than expected — for a specific reason.

The arbiter isn't making subjective judgments. It's doing a structured comparison: does this review comment point to a violation of the acceptance criteria or architectural constraints in the spec?

If yes → critical or important, needs fixing.
If no → minor or noise, can be ignored.
If the review comment contradicts the spec → escalate.

The spec acts as an objective grounding document. The arbiter doesn't need to have opinions about code quality in the abstract — it just needs to read and compare two documents. LLMs are very good at this.

The key prompt constraint I found most important: the arbiter must not try to resolve conflicts between the reviewer and the spec. When there's a conflict, it escalates. This keeps humans in the loop for the decisions that actually matter.

Multi-repo support without a monorepo

Most of my projects have multiple services in separate repositories. I didn't want to force a monorepo structure just to use an automation tool.

vexdo uses a simple project layout on your local machine:

projectRoot/
  .vexdo.yml          ← project config
  tasks/
    backlog/
    in_progress/
    review/
    done/
    blocked/
  service1/           ← git repo
  service2/           ← git repo
  service3/           ← git repo

The .vexdo.yml config maps service names to paths:

version: 1
services:
  - name: backend
    path: ./backend
  - name: frontend
    path: ./frontend
review:
  model: claude-haiku-4-5-20251001
  max_iterations: 3
  auto_submit: false
codex:
  model: gpt-4o

Each service is its own git repo. vexdo treats projectRoot as a workspace, not a repo.

Multi-step tasks with dependencies work like this:

steps:
  - service: contracts
    spec: "Add EventType to shared schema"

  - service: backend
    depends_on: [contracts]
    spec: "Implement handler for new EventType"

  - service: frontend
    depends_on: [backend]
    spec: "Display new EventType in event list"

Steps without depends_on can run in parallel (not implemented yet, but the architecture supports it). Steps with depends_on run sequentially — the backend step doesn't start until contracts is reviewed, fixed, and submitted.

Each step gets its own branch (vexdo/task-001/backend), its own review loop, and its own PR. If the frontend step fails review 3 times and escalates, you get the full context of what happened in all previous steps, so you can make an informed decision.

What the workflow looks like in practice

# Initialize a new project
cd ~/projects/my-project
vexdo init

# Write a task spec (I do this with Claude Code interactively)
vim tasks/backlog/task-001.yml

# Hand it off
vexdo start tasks/backlog/task-001.yml

Then vexdo takes over. The output is a flat stream of progress markers — something like:

Step 1/2: backend: Add POST /events endpoint
→ Creating branch vexdo/task-001/backend
→ Running codex implementation for service backend
→ Starting review loop for service backend
Iteration 1/3
→ Collecting git diff for service backend
→ Requesting reviewer analysis (model: claude-haiku-4-5-20251001)
Review: 1 critical 1 important 1 minor
- critical (src/routes/events.ts:23): No validation on req.body
- important (src/routes/events.ts:31): Unhandled rejection from createEvent()
- minor (src/routes/events.ts:45): Inconsistent error message format
→ Requesting arbiter decision (model: claude-haiku-4-5-20251001)
→ Arbiter decision: fix (2 issues require fixing)
→ Applying arbiter feedback with codex
Iteration 2/3
→ Collecting git diff for service backend
→ Requesting reviewer analysis (model: claude-haiku-4-5-20251001)
Review: 0 critical 0 important 1 minor
→ Requesting arbiter decision (model: claude-haiku-4-5-20251001)
→ Arbiter decision: submit (no critical issues)

Step 2/2: frontend: Add POST /events endpoint
→ Creating branch vexdo/task-001/frontend
→ Running codex implementation for service frontend
→ Starting review loop for service frontend
Iteration 1/3
→ Collecting git diff for service frontend
→ Requesting reviewer analysis (model: claude-haiku-4-5-20251001)
Review: 0 critical 0 important 0 minor
→ Requesting arbiter decision (model: claude-haiku-4-5-20251001)
→ Arbiter decision: submit (no issues found)

✓ Task ready for PR. Run 'vexdo submit' to create PR.

I come back to two clean PRs, each with a review summary attached. I read the summary, look at the diff, hit merge. The whole thing took 8 minutes and I didn't touch it.

The iteration logs are preserved in .vexdo/logs/{taskId}/ — one diff, one review JSON, and one arbiter JSON per iteration per service. vexdo logs task-001 shows a summary; vexdo logs task-001 --full dumps everything including diffs.

What happens on escalation

When the arbiter escalates, vexdo prints the full context — spec, all review comments with locations, arbiter reasoning — and exits with a non-zero code.

The task file moves to tasks/blocked/. Importantly, the state is preserved — .vexdo/state.json stays on disk with status: escalated. The branches are preserved too. This means you can inspect exactly what happened, fix the spec or the code manually, and decide how to proceed.

The recovery path is still manual: run vexdo abort to clear the state, then restart with an updated spec. Automated recovery from escalation is on the roadmap.

What doesn't work (yet)

I want to be honest about the limitations.

Codex has a complexity ceiling. For well-scoped tasks — add an endpoint, update a client, add a utility function — it's great. For tasks that require deep understanding of implicit system invariants, it struggles. The spec helps a lot, but it's not magic.

The arbiter can be too lenient. If your spec is vague, the arbiter will be too. "Add proper error handling" is not a spec. "Return 400 with { error: string } for validation failures, 500 with a generic message for unexpected errors" is a spec.

No automatic rollback. If step 3 of a 4-step task escalates, the previous steps are already complete (branches and, if auto_submit: true, PRs are already created). You need to handle rollback manually. This is on the roadmap.

State recovery is basic. If the process crashes mid-task, vexdo start --resume picks up from the last completed step. But if it crashes mid-Codex-run, you need to clean up the unstaged changes manually before resuming.

Only GitHub. The PR creation is wired to the gh CLI. GitLab, Gitea, and others aren't supported.

Getting started

npm install -g @vexdo/cli

# In your project
vexdo init

# Set your API key
export ANTHROPIC_API_KEY=your-key-here

# Make sure you have the codex and gh CLIs installed
# Then write a task and run it
vexdo start tasks/backlog/your-task.yml

The repo is at https://github.com/vexdo/vexdo-cli. Contributions welcome — especially around the state recovery story and parallel step execution.

The bigger idea

What I built is less about vexdo specifically and more about a pattern: AI agents work best when they have structured evaluation criteria and a clear escalation path to humans.

The spec is the evaluation criteria. The arbiter is the evaluator. Escalation is the safety valve.

Without the spec, you get an agent that does something but you're not sure if it's right. Without the arbiter, you get a flood of review comments with no prioritization. Without escalation, you get an agent that either loops forever or merges bad code.

All three together create something that actually feels autonomous — not because it never needs you, but because it knows when it needs you.

vexdo is open source under MIT. If you build something with it or find a bug, open an issue — I read them all.

I ran a privacy proxy on my AI traffic. Here's what it found.

Dmitry Bondarchuk — Fri, 06 Mar 2026 10:51:56 +0000

When I built Velar — a local proxy that masks sensitive data before it reaches AI providers — I mostly thought of it as a tool for other people's problems.

I was wrong.

After running it on my own machine during normal browser-based interactions with ChatGPT, here's what it intercepted:

Masked Items
----------------------------------------
API_KEY:        30 ███████████████░░░░░
ORG:             9 ████░░░░░░░░░░░░░░░░
JWT:             1 ░░░░░░░░░░░░░░░░░░░░
Total:          40

40 items. Without doing anything unusual. But the story behind that API_KEY number is what really got me.

30 API keys — before I even hit Send

All 30 API_KEY detections came from a single session where I was editing a script directly inside the ChatGPT input field.

Here's the thing most people don't realize: ChatGPT sends the contents of the input field to its servers in the background as you type. Not when you hit Send — continuously, while you're still editing.

So I pasted a script that contained an API key, spent a few minutes tweaking it before sending, and ChatGPT quietly transmitted that script — and the key inside it — 30 times to OpenAI's servers before I was done.

I wasn't trying to send the key. I was just editing. That's the part that's hard to reason about intuitively.

This is a real gotcha with browser-based AI chat: the moment sensitive data touches the input field, it's potentially already in transit — regardless of whether you decide to actually send the message.

What about the other numbers

The 9 ORG detections are a good example of current limitations. These were false positives from the ONNX NER model — it flagged the Russian word "Расскажи" ("tell me") as an organization name. The model is trained on English only, so it occasionally misreads non-English text as named entities. Something I'm actively working on.

The 1 JWT is probably real — likely from a session token that ended up in a request payload somewhere.

A note on scope

This data covers browser-based interactions only — ChatGPT in the browser, routed through Velar's MITM proxy.

Intercepting IDE tools like Cursor or GitHub Copilot is a different and harder problem. They communicate over gRPC with protobuf, which requires a different interception approach than standard HTTPS traffic. That's on the roadmap, but not there yet — and honestly, that's probably where the more interesting (and scarier) data would come from, given that those tools have access to your full codebase.

What Velar does with detected values

Each value gets replaced with a deterministic placeholder before the request is forwarded:

sk-proj-abc123...  →  [API_KEY_1]
eyJhbGci...        →  [JWT_1]

The AI still gets enough context to be useful. When the response comes back, Velar restores the originals — so your tools keep working normally.

Everything runs locally. No cloud processing, no external logging, no callbacks home. MIT-licensed Go — you can read the source and verify.

The broader pattern

AI coding tools are getting more context access, not less. Cursor reads your whole codebase. Agents are being given filesystem and terminal access. The more capable these tools get, the more opportunities there are for sensitive data to end up in that context without anyone actively deciding to send it.

The input field thing is a small example of a bigger pattern: the boundary between "data I'm sharing" and "data that's being transmitted" is increasingly blurry. Most developers I've talked to haven't thought carefully about where that boundary sits.

Caveats

Velar is experimental — I'm still figuring out what it should become, and I'd be the first to say the detection isn't perfect. Regex-based detection for structured values like API keys is reasonably reliable. NER-based detection for things like names and organizations is still rough, as the false positives above show.

Also, yes — Velar is itself a MITM proxy, which is a fair thing to be skeptical about. It only intercepts domains you explicitly configure. The source is open and auditable.

Try it yourself

git clone https://github.com/ubcent/velar.git
cd velar
make build
./bin/velar ca init
./bin/velar start
./bin/velar proxy on

Run it for a few days and check velar stats. I'm curious whether other people hit the same input-field behavior — or find something I haven't seen yet.

If you try it, share your breakdown in the comments.

I realized my AI tools were leaking sensitive data. So I built a local proxy to stop it

Dmitry Bondarchuk — Thu, 26 Feb 2026 11:31:45 +0000

A few months ago I had a moment of uncomfortable clarity.

I was using Cursor to work on a project that had database credentials in an .env file. The AI had full access to the codebase. I wasn't thinking about it - I was just coding. And then it hit me: all of this is going to their servers right now. The keys, the internal URLs, everything.

I stopped and thought about how long I'd been doing this without a second thought. And then I asked a few colleagues. Same story. Nobody was really thinking about it. We all just... trusted that it was fine.

It probably is fine, most of the time. But "probably fine" is not a compliance posture. And as AI coding tools get deeper access to our codebases, the surface area for accidental leaks keeps growing.

That's why I built Velar — a local proxy that sits between your app and AI providers, detects sensitive data, and masks it before it ever leaves your machine.

The problem is getting worse, not better

Copilot, Cursor - these tools are genuinely useful. But they work by sending your code (and often a lot of surrounding context) to external APIs. Most developers don't think carefully about what's in that context.

Common things that end up in AI requests without people realizing:

AWS/GCP/Azure credentials accidentally committed or present in env files
Database connection strings
Internal API endpoints and tokens
Customer emails or names in logs you're debugging
JWTs from test sessions

None of this is malicious. It's just how development works. But "it's not malicious" doesn't mean it's not a problem when you're dealing with regulated data or working in an enterprise environment.

How Velar works

Velar runs locally as an HTTP/HTTPS proxy with MITM support. You configure it to intercept traffic to specific domains (like api.openai.com), and it inspects outbound payloads before forwarding them.

Your app → Velar → AI provider

When it detects something sensitive, it replaces it with a deterministic placeholder:

alice@company.com → [EMAIL_1]
AKIAIOSFODNN7EXAMPLE → [AWS_KEY_1]

Then, when the response comes back, Velar restores the original values — so your app keeps working exactly as expected.

Everything happens locally. No external services, no logging to the cloud, no callbacks home. You can read the full source and verify this yourself — it's MIT-licensed Go code.

What it detects

Current detection is regex-based and covers:

Emails, phone numbers, names
AWS, GCP, Azure credentials
Private keys
Database URLs
JWTs
High-entropy strings (potential secrets)

There's also optional ONNX NER support via a locally-downloaded model (dslim/bert-base-NER) for more accurate PII detection. Fair warning: this part is still rough and doesn't always behave as expected — it's something I'm actively working on.

"But wait — you're asking me to install a MITM proxy?"

Yes. This is the obvious concern, and it's a fair one.

Here's the honest answer: Velar only intercepts traffic to domains you explicitly configure. By default that's api.openai.com. It doesn't touch your banking traffic, your Slack messages, or anything else.

More importantly — you can verify this. The network code is small and straightforward. There are no background processes phoning home. No analytics. No telemetry. Just a local proxy doing exactly what it says.

I understand if that's still not enough for some people, and that's fine. But for developers who are already sending sensitive data to AI providers without any filtering layer — Velar represents a net improvement in privacy, not a reduction.

Quick start

git clone https://github.com/ubcent/velar.git
cd velar
make build
./velar ca init
./velar start
./velar proxy on

That's it. You'll start seeing local notifications when Velar masks something in your AI traffic.

Where it's going

Honestly — I'm not entirely sure yet. This is v0.0.3, explicitly experimental, and I'm still figuring out the right direction. Some things I'm thinking about: stricter blocking mode, a local dashboard, better cross-platform support (notifications are currently macOS-only, though the proxy itself runs anywhere). But nothing is set in stone.

What I do know is that I'd rather ship something real and iterate based on feedback than plan in a vacuum.

If this sounds useful, check it out on GitHub. Issues, PRs, and honest feedback are all welcome.

And if you've had your own "oh no, what have I been sending to ChatGPT/Claude" moment — I'd love to hear about it in the comments.