DEV Community: Mario Hayashi

The Factory Must Grow (Part III): Stopping the AI Agent Production Line Toyota-style

Mario Hayashi — Mon, 04 May 2026 08:56:20 +0000

Welcome back. Thanks for reading this blog series. I got insightful questions and comments in the past weeks — please keep them coming. But, yes, let’s get back to business: the factory must grow.

The Factory Broke

My orchestrator reported a clean healthcheck: the PRD issue moved to Done, workers exited without errors. There was just one problem. A feature I was keen on didn’t ship. No branch, no commit, no PR. The orchestrator closed what appeared to be a successful issue even though the agent had done zilch.

Part II of this The Factory Must Grow series concluded along these lines: the right architecture helps debug tricky issues; loud failures help us learn to grow the factory. This post is about the operating discipline that goes into growing AI agent orchestration. For full context on how the factory came to be, see Part I — The Factory Must Grow: I Replaced Myself with AI. Now What?, where it all started.

Over two weeks, the system silently dropped a dozen or so PRD issues and closed them as if they shipped. The stalled agents kept retrying and spun endlessly in cycles before quitting. Nothing committed.

The hardest bugs to fix are the silent ones. A crash gives you a stack trace. A silent drop gives you false confidence. In my case, the orchestrator reported success, closed the issue and moved on when it shouldn’t have. I had some choice words with Claude but it wasn’t Claude’s fault.

The fix wasn’t a plaster. It was the realisation that I needed the factory to stop completely, while I investigated the failures one by one. Dead code beats limping code. I can recover from a halted run but not as easily from corrupted state. The discipline I needed to get on top of my mounting issues came from a Japanese car maker.

Why Toyota Production System - Does It Apply To Agents?

The Toyota Production System (TPS) is the set of principles Toyota’s factories run on: quality at source, make the wrong thing impossible, anyone can stop the line, walk symptoms back to root policies and waste as a defect. Also known as jidoka, poka yoke, the andon cord, five whys, muda.

Why did I settle on TPS specifically? There are a whole bunch of DevOps principles you could apply to agent orchestration, I’m sure. But I’ve been fascinated by Taiichi Ohno’s TPS recently and thought it’d be apt to use it for my factory and agent management.

A worker corrupting the board. Or a worker deciding to call it quits and going AWOL. AI agent failures can look like any of the following:

Agents can hallucinate about their own output. An AI worker can write a long essay claiming it made a feature but writing zero lines of code.
Agents are non-deterministic. You can’t necessarily replay a run with the same input and expect the exact same output. This can be a real issue, as we may want determinism in critical decision points.
Agents looping out of control. The worst offender. A misfiring agent burning millions of tokens.

How much of TPS can we apply to managing agents? That’s the rest of this post.

Jidoka: The Silent Drops

The biggest failures retrospectively, were the silent ones. Dozen PRD issues closed as if they shipped but none of them did. A worker reported success and the orchestrator transitioned the issue to “terminal” and the PRD was gone.

Jidoka is Sakichi Toyoda’s principle to design equipment to stop automatically immediately if there are abnormalities. On a car line it means the machine stops the moment a defect is observed: it does not wait for the quality team at the very end of the line. For agents, this means every worker writes a “verdict” file at the end of its run. The orchestrator doesn’t look at whether the agent ran successfully or not (which can create false negatives) but it reads this verdict. If the verdict is present and positive, we ship. If it’s missing or failed, we investigate.

While it’s fairly obvious whether a car has been built or not, an AI worker can answer “yes” to anything. So to achieve jidoka for agents, we have to verify the outputs of the worker that isn’t self-reported. Commits, diffs, a verdict file at a known path which is written deterministically.

Taiichi Ohno who built TPS said that stopping the machine when there is an issue forces awareness on everyone. When the problem is clearly understood, improvement becomes possible.

Replace “machine” with “AI agent” and it’s still true. In this particular case, “success” is not when the worker finished without error but is when the worker has output a deterministic result. (Of course, a malicious worker might game the system and lie but that’s out of scope for this discussion.)

Poka yoke: The Retry Storm

The second type of failure I encountered looked different. Some agents would keep retrying an issue, making zero progress. The retry logic was supposed to have a cap but it didn’t. The switch statement to decide the next step was missing some logic and fell back to retrying.

Shigeo Shingo’s “poka yoke” is about of making the wrong thing impossible as part of the machinery, not the operational discipline. Poka yoke prevents a part from being installed badly unless it’s oriented correctly. A prompt saying “remember to handle XYZ” is not poka yoke.

Poka yoke needs to live at the orchestrator’s interface boundary and defined in the form of accepted dispatch outcomes and state transitions. The fix was to make sure every possible dispatch outcome is handled explicitly, enforced at compile-time. A second poka yoke was placed at the state-transition layer. The orchestrator errors if asked to close an issue that still has the “waiting for human answers” label because a paused issue should never be silently closed.

Andon Cord: Hook Timeouts

The third failure didn’t burn tokens but killed work. A hook inside the worker loop occasionally timed out and took the whole run down with it. The issue was marked “exhausted” even though the agent was fine. No commit, no signal.

The andon cord is the rope anyone on a Toyota line can pull to stop production (these days, it’s an electronic button). Production halts, the line manager looks, the problem is fixed at the station where it surfaced. The principle is that it is always cheaper to stop the line than to ship a defect downstream.

I built the software version of the andon cord. Any worker, on detecting a potential systemic risk, can write a “halt” verdict to a file and exit. The orchestrator catches the “stop” signal and engages a workflow-scoped lock. The orchestrator stops dispatching workflows until a human jumps in and clears it.

Andon is doubly useful for non-deterministic AI workers. With agents, you can’t replay a halted run cheaply. Re-running the same prompt against the same state can yield a different result, possibly a different failure. But a missed stop could mean a corrupted board and not being able to debug the issue, as the original issue is buried deep in the logs. False positive stops are always better as they cost just five minutes and a re-dispatch.

Agents pulled the cord several times in the first week after the cord’s introduction. Each time, I investigated the root cause with Claude and the system got better.

Knowing the line would stop itself the moment it sensed something off meant I could look at each edge case properly, instead of adding a “plaster” fix with retries that would have papered over the root cause.

Muda: The Budget Burn

The fourth failure cost lots of tokens. An agent picked up an issue, ran, stalled, retried, ran, stalled, retried in cycles, burning tokens.

Muda means waste and, in TPS, waste is a defect.

A car manufacturing stamping press wasting time might waste steel and electricity for that car. An agent wasting time might waste tokens all the way up to the weekly token limit until you stop the agents. Muda on a car line is bad but Muda in an agent fleet accelerates with multiple agents running.

The no-progress guard is simple. After every reported success and every budget breach, the orchestrator checks the worker’s workspace for a diff. If the diff is empty, the issue is abandoned regardless of what the agent claimed. No commit means no progress and that means the run should be halted.

Muda in practice means not just asking “how much should I let an agent spend before I cut it off?” but asking “how much should I let an agent spend before it has produced the first commit?” The latter spend should be much smaller.

Five Whys

You’ll probably be familiar with the five whys. You ask the questions that surface the real issue. The root cause might be hiding in the worker logs or the orchestrator dispatch code. The TPS principle is to keep asking “why” until you hit a root policy. The five whys chain from the PRD-drop incident goes something like this:

Why was the issue closed? The orchestrator closed an issue on a successful run.
Why did that path fire? The dispatch handler treated success as terminal and then routed the issue to the workflow listed first in its terminal-states list.
Why did that move the issue to a paused state? The PRD workflow’s terminal-states list had “Awaiting Answers” first.
Why wasn’t this caught by config validation? No field required the workflow to declare which state actually meant “done”, separate from “terminal”.
Why was there no config field? The code had a hard-coded fallback to “Done” that quietly substituted whenever config left it undefined, so the absence was never validated as missing.

The final why requires every workflow config to declare its done state explicitly and makes the schema reject any config that doesn’t declare a done state. This needed a change to the schema and a migration but that’s better than a bug lurking under code smell.

Intermission: How You Can Get Started

I’ve been asked where to get started with agent orchestration. If you’re building your own agent factory, perhaps the fastest place to start is the Symphony spec. It’s a language and tracker-agnostic orchestrator spec, written by OpenAI. I suggest you read the spec, pick a language, pick an issue tracker and you have a starting point.

What would I add to it? I made a fork, so you can have a look at changes I’d make to the spec based on my own preferences and experiences. It contains “Recommended Extensions” that e.g. account for worker hallucinations, hook timeouts or prevent tokens from being burned.

Your mileage may vary.

Work In Progress

The system still breaks. It’s a work in progress and I expect it to stay that way for a while. Every new failure is now spotted earlier, thanks to jidoka, poka yoke, andon cord, muda and five whys. What’s changed with TPS is that the factory is no longer allowed to limp. If an agent moves an issue to “Done” but produces nothing, the line stops. If a hook times out, the line stops. If a verdict is missing, the line stops. I’d rather have dead code than limping code, because limping code can create real damage and dead code can be investigated.

What TPS teaches us for an agent factory is the discipline about _where_ to enforce principles. The principles aren’t optional gates on the way to autonomy (in my opinion). They make autonomy possible at all. You can’t trust a worker to run unattended until you’ve watched it fail in every shape it knows how to fail in. The only way to safely have workers fail is to stop the line every time and walk the problem to a root policy.

My factory stopped dropping PRDs. The finish line isn’t necessarily a system that never stops. It’s a system that stops for the right reasons and that I trust enough to leave running when it doesn’t. (And won’t burn millions of tokens!)

If you’re experimenting with AI in your workflow, I’d love to hear from you! I write more like this at blog.mariohayashi.com, and feel free to follow me on X: @logicalicy.

The Factory Must Grow (Part II): From Spaghetti AI Agent Orchestrator to a Main Bus

Mario Hayashi — Thu, 30 Apr 2026 23:12:11 +0000

This post is about The Big Rejig and mistakes that burned millions of Claude tokens. Part I post here.

AI agent orchestration: From spaghetti architecture to clean, predictable Elm architecture

The First Factory Worked

For those who’ve played the game, your first Factorio factory works. Iron moves on the conveyor belts, copper is delivered to assemblers and circuit boards come out on the other side. You added one thing at a time, solved one problem at a time and the factory hums.

Then you decide to scale it.

You try to add a new conveyor belt and realise the belt you need runs straight through the middle of three other assemblers. You try to decouple and reroute the belt. Then that reroute cuts off a different production line. You try to fix that. The fix introduces a bottleneck for another belt. You fix that. Hours later, your factory looks like patch work.

The foundation works. But beware, improving it is like open chest surgery.

Every Factorio player hits this wall at some point. It’s spaghetti. Kabel Salat. Belts criss-crossing, inserter arms reaching across each other, production lines skirting the walls of multiple factories. The factory grew and the spaghetti too.

The same problem awaits when you add AI agents. The orchestration system I created runs agents like workers on a production line. They pick up issues from a “queue”, write code, open PRs and address review comments, while a central orchestrator decides who runs when and what state every issue is in. It started small (dev). Then I added a second workflow (PR reviews). Then a third (PRDs). Each one added an if-statement case here and a fallback there. The factory worked but touching it became open chest surgery (again).

The fix is the “main bus”. Take out the spaghetti and replace it with a centralised spine of parallel belts, each carrying resources in one direction. New production lines branch off the bus. They get what they need, return their outputs and never cut through the bus. The bus itself doesn’t know what the branches are doing. After my refactor, state and business logic are clearly decoupled!

The Spaghetti Era

Spaghetti is what happens when you don’t have state-machine discipline. Each bug that surfaces is another way imperative dispatching with fallbacks rots over time. When I first described the orchestrator in Part I, it was polling a board, picking up issues, spawning workers. The pipeline was understandable. But there were more than a dozen ways it could fail that was not sustainable.

The cost of spaghetti architecture

State transitions lived inside imperative dispatching code. The orchestrator would pick up an issue, check conditions, mutate the issue’s state, start a worker, check more conditions, mutate the state again. All of it happened in imperative code. There was no single place where “what should happen next” was described. State transitions happened gradually, across several functions that called each other in ways that were hard to trace.

The silent fallbacks were the worst offenders. A switch statement that handled five states but left out two. A default fallback that logged a warning and just moved on.

There were also silently dropped issues. The next tick would check the issue, perhaps find partial state and move on to the next candidate. Only a manual audit surfaced those dropped issues. Imagine an employee threw your work briefs in the bin. That’s a pretty bad employee.

Stuck states at least were visible. The worker may have finished the work but the issue hadn’t moved forward. The orchestrator had seen the worker finish. But a condition somewhere hadn’t been met, so the state transition never completed. The issue was stuck in purgatory.

Then, there were workers that silently dropped state during retries. A continuation run (an agent resuming after its turn limit, put in place to avoid runaway infinite work) would pick up mid-task, make progress and then write back state that was missing fields an earlier run had populated. Oops. As far as it was concerned, it did its job and there was no error and no warning.

Finally, there was “PR ping-pong”, where an issue would cycle between “needs review” and “needs fix” infinitely because the “fix” and the “review” workflows had slightly different conditions about what a resolved review comment looked like. Each thought the other was wrong. So they just kept handing the issue back and forth and burning precious Claude tokens.

Most of these bugs were caused by a default fallback in some switch statement that was not anticipated. The factory grew faster than the imperative switch statements.

The Main Bus

In a spaghetti architecture, the “what should happen” and “how it’s done” are tangled together. In a main-bus architecture they’re decoupled and separated.

What inspired me to do a big refactor and bring order to my AI agent orchestration was the ELM architecture (popularised by React.js’s Redux library). A pure function (a “reducer”, which I also call “the bus” throughout this post) decides what should happen. An interpreter does the work. This is the new, simple architecture.

Moving to the main bus architecture (i.e. Elm architecture)

I spent a couple of weeks thinking about and rewriting the architecture. The spaghetti factory’s problems were: outputs feeding back into inputs, state changes happening in many different places, no single place where “what is the factory doing” is clear.

The main bus architecture (inspired by ELM architecture) solves this by separating two things that spaghetti factories mix: trigger and the action (deciding what to do vs doing it).

Deciding what to do is done by a pure function called a “reducer”. In the codebase it is literally a function named “decide()”. This is the bus. It takes the current state of an issue and an incoming event, then it returns the next state and a list of things to perform (”side effects”). It does not perform the side effects themselves, as that’s not the bus’s job. It does not check the clock or call any external system. It produces a description of what should happen next, as data (JSON values that the codebase calls “effects”).

The interpreter performs the side effects. It takes the list of side effects, goes through them and does the actual work. It writes to GitHub, spawns a worker AI agent and sends a notification. The interpreter does not decide anything. It just performs what the bus told it to do.

The original orchestrator I built mixed both of these but now they are separated. The bus decides what should be performed and the interpreter acts. Responsibilities are clearly separated. The new flow is:

state + event → reducer → [next state, side effects] → interpreter → actions

The immediate benefit was explainability. When something went wrong, there were only a couple of places to look. If a decision is wrong, that’s a bus bug. If the execution is wrong, that’s an interpreter bug. In the spaghetti factory I had no idea where to look. The main bus has exactly two places to look. Implicitly, this saves lots of tokens too.

Deterministic

Especially with AI, I need the same inputs to always produce the same outputs.

The reducer has no clock and no I/O. It’s a pure function that doesn’t perform actions and complex logic. It cannot call outside APIs. Given the same state and the same event, it will always produce the same next state and the same side effects list.

In the spaghetti era, to debug “Issue #45 got stuck”, you’d have to look at the logs and try to reconstruct the sequence of events, wonder whether the retry happened before or after the state was written. Even with good logs, this is hard — very hard. State may or may not have changed when the bug occurred. You had to reconstruct it, if you’re lucky to have the logs.

With a deterministic bus, debugging looks different. The event log is append-only, one record per decision, immutable, nothing ever overwritten. To understand why Issue #45 bugged out, you just replay the event log against the initial state. If something went wrong, you can see exactly which event triggered it, exactly what state the bus saw and what it decided. This is called “event sourcing”. The log is the source of truth and you can reconstruct the state of the world with it.

The log is also a test harness in the sense that you can verify that the bus behaves correctly with it. Use the events to assert the next state. A test for a new state transition is simple: initial state, input event, expected next state, then call the function.

Deterministic flows give you consistency. You also gain “totality”, with defined output for every possible input. The spaghetti era was full of partial functions, such as switches that handled the common cases and defaults that weren’t handled. Every partial function is a potential bug if there is an input that wasn’t anticipated.

When the reducer handles an event, it must handle every event type. The TypeScript compiler enforces this, as the build fails if you add a new event type and don’t add a handler for it. This cuts down on runtime errors massively. The “missing branch” bug that caused dropped issues is now fixed, because the compiler won’t compile without the case being handled.

In addition to event types, we have totality of state types too. Not every combination of fields is a valid state. A GitHub issue cannot be both “in progress” and “waiting for worker” at the same time. In the spaghetti era, invalid states were possible because state was mutated incrementally and mutations could produce a partial result. In the main bus architecture, the state types make that not possible and that reduces bugs.

Declarative

People who know me know I like declarative programming.

Each workflow in my orchestration system has its own spec file (MD file with YAML). The spec declares the states the workflow can be in, the events that trigger transitions and the conditions that gate each transition. This file (e.g. for the dev workflow) is the contract between the orchestrator and the worker.

The state machine’s type is generated from the spec. The code does not define which states are valid, as that’s the spec’s job. This decoupling again is intentional. It means you can’t write a transition to a bogus state the spec doesn’t declare because the state’s type doesn’t exist in the built code.

The spec file is the first thing you edit when a workflow needs to change. You add a new state to the spec, regenerate the types and then fix the compile errors that tell you everywhere the new state needs to be handled. The spec change propagates through the codebase structurally. The spec is the documentation and the implementation is derived from it. The implementation won’t compile if it drifts from the spec.

An example: suppose I add a new GitHub Issue state called “needs-human” to a workflow’s spec and I save the file. The next compile breaks in a couple of places: the function that picks an event handler doesn’t recognise “needs-human” and the “verdict” table doesn’t list it. I work through the couple of errors and the new state is wired up end to end. Without spec-driven types, “needs-human” would have been a string in one switch statement that quietly fell through everywhere.

At its most basic, a workflow spec looks something like this:

tracker:

  kind: github

  status_field: Status

  active_states: [Todo, In Progress]

  terminal_states: [Done, Abandoned]

verdict_map:

  DONE: Done

  FAILED: Abandoned

agent:

  max_concurrent_agents: 3

  max_turns: 20

---

You are working on issue {{ issue.id }}: {{ issue.title }}.

Steps:

1. Read the description above.

2. Plan first: write plan.md before any code.

3. Commit small, push often.

4. Write the result of the run to .verdict (DONE or FAILED).

The spec is declarative with its state types, enum state lists, etc. The YAML on top declares the state machine: which board states are live, which are terminal, what each worker verdict should transition to. The Markdown below it are the instructions that the worker follows.

The hot-reload behaviour in the orchestration system applies to the workflow spec file as well. Changes to the spec regenerate types. Changes to the types force handler updates and bad updates will not compile.

Predictability gained from ditching spaghetti architecture

What I Left Out

The new main bus architecture makes the orchestration (PRD drafting, dev, PR reviews, marketing) more trustworthy in production. I plan to write more about this AI agent orchestrator. In Part III, I’ll discuss how I brought Toyota production principles to my AI agent production line. The architecture in this post gives you the structure. The next posts will be about what happens when the structure gets stress-tested.

The Factory Caught Itself

Last week, a worker hit a turn limit and stopped. The reducer finalised the issue to “abandoned”. A few seconds later a retry timer fired and asked the dispatcher to start a fresh worker on that same issue. The dispatch guard checked the state, saw “abandoned” and refused.

The refusal was the right decision. The architecture said: “I will not dispatch a worker on an issue I have already given up on”. But the refusal tripped a global lock and halted all workflows. I had to clear the lock by hand to bring the factory back.

To fix this, I cancelled pending retries on every terminal transition. I future-proofed the system against new terminal states forgetting to clear stale timers. The architecture caught the bad dispatch. Failing loud beats corrupting state any day.

A cleaner architecture doesn’t mean fewer bugs. It means more debuggable. Dropped issues became not possible by design. Silent state drops became compile errors. When something breaks there are two places to look and the event log says which.

The factory grew since I started it. The architecture has grown now too. There’ll be more things I’ll need to improve and more Claude tokens I’ll end up burning. But the next bugs will be loud and findable and the logs will help me debug them. That’s the next generation of factory I need.

I write about building with AI at blog.mariohayashi.com. Follow along if this is useful to you. If you’re working through similar problems I’d love to hear from you! Feel free to follow me on Twitter: @logicalicy.

The Factory Must Grow: I Replaced Myself With AI. Now What?

Mario Hayashi — Wed, 15 Apr 2026 14:01:41 +0000

tl;dr -- I made an orchestration system that creates PRDs, writes code, opens PRs and handles review feedback. And then I realised I'd automated myself out of the parts I once called my job.

The PR That Changed Things

My orchestration system opened a PR. Tests passing and commit messages better than any I could write. The code was clean. I left my feedback in the PR but the work was solid. After this first PR, I started feeding the task pipeline with more ideas while it churned out PRDs, implemented them and addressed feedback, without much of my attention.

That was the moment it clicked for me: I am just feeding product ideas to the machine. The system handles everything else. Ideas really are cheap now.

If 2025 was the year of agentic AI, 2026 is the year agents will be operationalised at scale. It's both exciting and scary to see half of what I used to call a job automated away.

Step #1: It starts with just a thought. A note in Github

My Slow Start

Most engineers will have been following the rise of agentic AI very closely in 2025. Not me. I was changing nappies. I have a young son and didn't have the bandwidth to follow what was happening. I leaned on Cursor's tooling, had AI help me generate code, but never graduated beyond one-off agentic work. I was comfortable under my rock.

"Just Try Building One Yourself"

An old friend, a software engineer I worked with several times over the years, changed my mind. He had kept a close watch on AI while I was busy being tired from parenting. His advice was simple: build an agent yourself.

I let it sit at the back of my mind for a few weeks. Then I got curious and decided to build it. The gap between reading about agents and watching one work on your own codebase is the gap between reading a recipe and tasting the food. The understanding only clicks when you see an AI agent produce PRs in your codebase, with your conventions. Cheaply also. I'm on Claude's Max plan, so I pay a flat subscription. Over the last week my orchestrator ran 300 worker runs across four workflows and burned through at least $240 of API-equivalent tokens. I say "at least" because crashed runs (of which there were many) never emit a final cost event, so the real number is higher.

At that pace it would be north of $1,000 a month without the subscription. Instead I pay the flat Max fee and the system keeps going. A full pipeline -- PRD, code, review, fix -- averages around $4 in API-equivalent per shipped PR. I hope subscriptions remain this affordable but I know pricing could change any time. I'll park that thought for now.

From Bash Scripts to an Orchestration System

Before the orchestration system, I built basic agents.

I wrote about that first version in An autonomous dev pipeline for one: bash scripts, cron, tmux and Claude glued together until the behaviour was reliable enough to trust. It picked up tasks, wrote code, opened pull requests and handled review feedback. "Beginner" agentic, held together with glue and hope.

It worked well enough but this time I rebuilt the whole thing in TypeScript. Proper architecture and state management. Where bash was duct tape and slightly chaotic, TypeScript is steel: typed interfaces, phase boundaries and methodical error handling.

Managing Many Agents

One agent was manageable. Then I wanted a specialised one for planning, another one for code review, another for fixing PR review feedback. Managing the agents became tricky. Capability is not the bottleneck, orchestration is. I needed a conductor for my orchestra.

Put another way, the factory must grow.

If you recognise that phrase, you already understand how I felt the need to get this right. What started as "let me just automate this one thing" became a full orchestration system: idea goes in, PRD comes out, issues get created, workers pick them up, code gets written, PRs get opened, reviews get addressed, fixes get pushed. I ideate and check in. The system handles the rest.

Factorio: The Factory Must Grow. Source: https://commons.wikimedia.org/wiki/File:Factorio_Space_Age_Gleba_Screenshot.jpg

The pipeline looks something like this:

idea > PRD > issues > code > tests > PR > review > fixes > merge

I'm still figuring this out, but the pattern is clear. Each step is a phase with defined inputs, outputs and failure modes. The system retries, backs off, stops to ask for help as needed.

Under the Hood

The architecture is simpler than you would expect. GitHub is my system of record, my state machine. Each column in the Project board is a state. The orchestrator polls the board, picks up eligible issues and spawns workers.

Step #2: After your idea/note is processed, agent creates a PRD after clarifying ambiguities with me

Step #3: Agent creates a PR

Step #4: Agent reviews its own code and I review it only thereafter

Step #5: Approve and merge the work

Each worker gets its own fresh workspace directory, where it can read, write and commit without stepping on other agents' toes. The worker is a Claude CLI subprocess streaming JSON events and the orchestrator watches that stream for completion.

The part that took the longest to get right was retries. The interesting failures are the ones that look like success. An agent finishes its work, reports success, the orchestrator changes the issue to Done, and nothing has actually been pushed. A pre-commit hook silently rejected the push and the work is sitting orphaned in the workspace directory. Put another way, the agent gets close to finishing, pauses on a turn limit, resumes, gets close again, pauses again, repeat ad infinitum. Or the agent burns through its cost cap without ever committing. Oops.

There are two kinds of retry. A continuation is the agent pausing mid-task because it's hit its turn limit. It resumes with its full conversation history and picks up where it left off. A failure retry is the agent crashing. It retries with a fresh start, no memory, backoff before trying again. Continuation and failure retries both have per-issue caps now. The orchestrator also checks git status after every reported success and abandons any issue that breaches the (configurable) dollar budget. Most of the retry logic exists because one of these went wrong.

The entire workflow configuration (one each for PRD, dev, review) lives in a single file. States, cost limits and prompt templates live in it. When you change the file, the orchestrator hot-reloads it. I spent a good amount of time adjusting these workflow files but it's starting to work.

                    ┌─────────────────────────────┐
                    │      ORCHESTRATOR TICK      │
                    └──────────────┬──────────────┘
                                   │
                    ┌──────────────▼────────────────┐
                    │  1. RECONCILE running workers │
                    └──────────────┬────────────────┘
                                   │
                    ┌──────────────▼──────────────┐
                    │  2. GATE on rate limits     │
                    └──────────────┬──────────────┘
                                   │
                    ┌──────────────▼──────────────┐
                    │  3. FETCH candidate issues  │
                    └──────────────┬──────────────┘
                                   │
                    ┌──────────────▼──────────────┐
                    │  4. DISPATCH workers        │
                    └──────────────┬──────────────┘
                                   │
                    └───── schedule next tick ──────┘

The Thinking Is Still Mine (For Now)

I've replaced (most of) myself. But I can still decide what to build. And what not to build. I can still judge whether code is correct or there's code smell. I can feed the pipeline with ideas that are worth pursuing and axe the ones that are not. I can judge whether the output matches the intent. Being less in the weeds, I have more time to think about strategy.

The orchestration system handles the execution. The thinking, taste and judgement are still mine, for now.

On the flip-side, every layer of abstraction creates demand for someone who understands the layer below it. The more software we automate, the more we need people who can fix the pipes when they burst.

We've all been replaced in small ways before. I started my career writing vanilla JS and jQuery code that has been replaced by higher level libraries and frameworks powering today's web apps. Abstractions make yesterday's hard problems trivial. Each time, the work shifts slightly. The only difference this time is that the shift is... vast.

I am the product and tech strategy machine now. I will feed the machine with ideas until I am replaced again and then I'll have to carve out another higher level role. The factory must grow.

If you're experimenting with AI in your workflow, I'd love to hear from you! I write more like this at blog.mariohayashi.com, and feel free to follow me on X: @logicalicy.

An autonomous dev pipeline for one

Mario Hayashi — Mon, 06 Apr 2026 13:48:03 +0000

If you’re a solo engineer or a technical founder wearing every hat, the gap between planning and implementation is shrinking. One person can own a product while a harness runs the loop. There’s no magic in the stack: bash, cron, tmux and Claude glued together until the behaviour is reliable enough to trust.

The question I keep coming back to is how much implementation work can be delegated in a way that fails safely and produces code that can go from good to great with human review. Worker agents pick up tasks, validate code, open pull requests and address review feedback. I decide what to build and whether to merge. Everything in between is what I’m trying to automate.

The scripts, prompts and guardrails are all still in flux. I update them whenever something breaks. This post is a snapshot of where the setup is today.

Agent workers waiting for an issue

Ralph loop

The foundation is Geoffrey Huntley’s Ralph loop: an autonomous coding cycle that runs task, implementation, testing, PR creation and context reset on repeat. Each iteration gets a fresh context window by design. Memory lives in git and structured files, not in the model.

I spend a lot of energy on isolating phases: plan, build, test, verify. Each phase is clearly defined, can fail on its own and can be retried without affecting others. That matters because agents tend to do well on short focused work and degrade when scope increases. Structure is the differentiator, not a bigger model.

The same principles I applied to a Xero expense auditing CLI applies here: code and framework first, model second. Rules and framework handle the main flow. AI steps in for parts that need some judgement including issue creation, implementation and PR summaries.

Under the hood

There is no special runtime per se. Cron schedules jobs, tmux manages parallel worker shells, shell scripts handle state transitions and Claude runs with streamed JSON so that results can be parsed. GitHub Issues serve as the queue, system of record and state machine. If that sounds fragile, it is! This is why validating the output of each phase matters.

Architecture overview

      HUMAN AUTOMATION GITHUB
      ───── ────────── ──────

    ┌───────┐ manage.sh ┌────────────┐
    │ Ideas │ ───────────────── │ PRD │
    │ .md │ │ draft .md │
    └───────┘ └─────┬──────┘
                                      │
                                prd.sh (Claude)
                                Refine, approve
                                      │
                                      ▼
                                ┌────────────┐ plan.sh ┌──────────────┐
                                │ PRD │ ──────────────────▶│GitHub Issues │
                                └────────────┘ │ │
                                                                  │ ready │
                                                                  │ blocked │
                                                                  │ in-progress │
                                                                  │ done │
                                                                  └──────┬───────┘
                                                                         │
                                worker-loop.sh picks up “ready” issues │
                           ┌─────────────────────────────────────────────┘
                           │
                           ▼
                 ┌───────────────────┐
                 │ tmux: N workers │
                 │ │
                 │ ┌─────────────┐ │
                 │ │ Worker 1 │ │ Each worker runs in
                 │ │ (worktree) │──┼──▶ its own git worktree
                 │ ├─────────────┤ │ with a Ralph loop
                 │ │ Worker 2 │ │
                 │ │ (worktree) │──┼──▶ branch ──▶ code ──▶ PR
                 │ ├─────────────┤ │
                 │ │ Worker N │ │
                 │ │ (worktree) │──┼──▶
                 │ └─────────────┘ │
                 └───────────────────┘
                           │
                           │ opens PRs
                           ▼
                    ┌──────────────┐ ┌──────────────────┐
                    │ Pull │ │ unblock.sh │
                    │ Requests │◀────────│ (cron ~30s) │
                    └──────┬───────┘ │ relabels issues, │
                           │ │ unblocks deps │
                           ▼ └──────────────────┘
                    ┌──────────────┐
                    │ Human merges │
                    └──────────────┘

Issues have labels like “ready”, “in-progress”, “blocked”, “done”, “failed”. Git is the memory and GitHub is the database. I may change parts of this setup completely later. For now, I want cheap iteration and an easy-to-follow trail.

What each part does

1. Backlog capture

“manage.sh” is the entrypoint. When a new backlog item comes in, Claude generates clarifying questions specific to that item rather than asking generic questions. The answers go into a Markdown file and feed into the PRD. So by the time planning starts, the context is there.

2. PRD generation

Claude turns the backlog item into a PRD in an interactive session. I can approve it, edit it directly or ask for a rewrite. Once approved, it gets committed and pushed. From that point it’s the working source of truth. I update the PRD when the plan diverges from reality.

3. Planning

The planner gets “Read”, “Glob” and “Bash”. No “Edit” and no “Skill” on purpose. Skills use hooks that nudge the planner toward implementing, which defeats the point of a dedicated planning phase. The prompt requires JSON as output or fails.

Two constraints run at planning time. If a task has too many acceptance criteria or affects too many files, Claude splits it before any issues are created. Large scope is the fastest way to fill up a context window. The planner also checks for blocking dependencies.

4. Workers (Ralph loop)

Each worker iteration runs: pull “main”, claim the next “ready” issue with an atomic lock, skip refactor issues if feature or bug issues are available, check for blocking issues, create git worktree for isolation, run Claude with streamed JSON, run validation (types, imports, tests), retry if validation fails, then commit, push and open the PR.

The git worktree matters most when workers run in parallel. Without it, they will step on each other’s toes.

5. Model routing

The planner adds a size estimate in every issue body. Workers read it at runtime: small and medium work goes to Haiku, large work goes to Sonnet or Opus for the heavier reasoning. I’ll keep adjusting this as I go.

6. Refactor loop

Individual units of work shipped quickly will, over time, pull the codebase in different directions. A script scans for refactor opportunities, scores them by impact and opens issues labelled “refactor” and “ready”. Cron runs it weekly for P0 issues and biweekly for P1. Workers always perform product work (features, bugs) before picking up refactors.

7. Review feedback

A script scans open PRs for “CHANGES_REQUESTED” reviews, pulls the general comments and inline line-level notes and creates a “review-feedback” issue containing the branch name. Workers pick these review feedback issues up, check out the existing branch, push the fix and close the loop. There’s no second PR opened. The feedback loop is all in one PR.

Design principles

Every task gets a fresh context window. State lives in files or GitHub labels, not in a running conversation. Validation runs after every agent session, before the PR is opened. Three consecutive failures trip the circuit breaker and stop the loop. Where the human (you) step in is approving the PRD and merging the PR. Everything in between is what I’m trying to run autonomously while not trading off too much quality.

What broke

Despite restricting the planner to “Read”, “Glob” and “Bash”, it kept writing code. The “Skill” tool was loading hooks that instructed the agent to implement code despite what the rest of the prompt said. Removing “Skill” entirely and prepending a hard system instruction fixed it for now.

PR summaries were always empty. The culprit was “--output-format json” combined with “2>&1”, which mixed hook events and result JSON into one stream the parser couldn’t untangle. Switching to “--output-format stream-json --verbose” and filtering for lines fixed the bug.

The planner was also linking PRs as blocker dependencies. GitHub issues and PRs share a number namespace, so without an explicit check the planner would happily link to a PR and create a dependency that would never resolve (as issues were expected instead of PRs). Amateur mistake? Yes.

Metrics

Each run adds a CSV row with timestamp, issue number, event type, outcome, duration, model and size estimate. The goal is to build up enough data to see which estimate tiers fail most often, whether routing large tasks to Sonnet actually pays off. The data isn’t telling me much yet but I hope to see some insights months down the line.

What I’m looking to try next

An interesting extension is specialised sub-agents for test passes, QA (Browserbase? Playwright?) and doc updates that run as part of the merge loop. Right now everything goes through the same worker loop regardless of type. Separating those concerns should improve quality.

I also want tracing per phase: worktree setup, agent run, validation, PR creation. Clustering failures by phase would make it much faster to see where the pipeline is spending time or falling over.

The longer-term experiment is a closely observed LLM-as-judge: a second pass that scores whether code changes match the issue description, whether test coverage is adequate and whether a review comment was fully addressed. It could reduce noise before human review but I’m mindful of how it’s incorporated into the flow.

Workers running

Summary

Model quality is not the bottleneck. State, isolation, validation and human review are. The infrastructure is the differentiator.

Implementing things by hand already feels like a distant past. A clear roadmap and a dependable harness can go a long way to do the work of many. I’m still cobbling together the pieces and iterating on the flow. If your setup looks different, I’d love to hear about it!

I write more posts like this at blog.mariohayashi.com. Follow me on X @logicalicy.

Using AI to make Xero expense auditing fast, cheap, and (almost) fun

Mario Hayashi — Mon, 23 Mar 2026 04:27:44 +0000

Github: https://github.com/logicalicy/xero-expense-audit

Expense auditing is the kind of work that nobody talks about but most founders quietly dread (or maybe you are more vocal than I am).

Missing descriptions and wrong tax codes? Bills parked in “draft” state for weeks because fixing them individually is a time sink? Oh, of course, those foreign currency receipts that need converting and exchange rate evidence that need attaching.

It’s the kind of work that takes real attention — but not the kind that benefits from being done manually.

I need to audit expenses in Xero from time to time, so I decided to automate it. What I ended up building let me explore how to use AI well in a workflow (”AI UX”). AI is not magic do-it-all but a very cheap, very fast assistant that fills in the gaps where rules can’t.

This write-up is about the CLI tool I built and how I think about using AI for tasks like accounting.

Setup: Xero, bills, and edge cases

In the context of small business accounting, bills arrive, get created as a “draft” and need to be reviewed before they can be approved and marked as paid.

The review checklist for me is usually:

Does the bill have a description?
Is the account code valid?
Is the tax rate correct?
Is there an attachment (receipt or invoice)?
If there’s a foreign currency receipt, has it been converted at the right rate with evidence? (E.g. USD to GBP.)

In practice, every item on the above list could go wrong in subtle ways. Suppliers don’t label things helpfully. Receipt transactions are in EUR and Xero parsed them as GBP. When you’re running a lean operation, this kind of review is the thing that gets deferred until it’s too late and you need to sink hours into fixing it.

The tool I built is called xero-expense-audit. It’s a Python CLI that automates the tedious work and flags the tricky tasks for human review. ( ~~Link to repo will be added when I tidy it all up and create a Github repo~~. Github: https://github.com/logicalicy/xero-expense-audit)

Design principle: deterministic code first, AI second

Before I get into the details, let’s talk about the part that runs first: deterministic code.

This was a conscious choice. The deterministic rules are defined in a rules.yaml config file and they check things like:

Is the description field empty?
Is the tax rate in the approved list for this account code?
Is the amount zero?
Is there no attachment when one is required?

These checks are instant, reliable and cheap. And they run before any LLM is called (which absolutely is necessary to keep costs low!). The AI only gets involved when a bill has been flagged and only suggests fixes. AI does not apply the fixes automatically. A human is in the loop.

This design keeps the human (you) in control. AI fills gaps that rules can’t handle: ambiguous categories, unstructured text in receipt images/PDFs, natural-language bill editing (i.e. describing changes that need to be made to the bill in English).

Where I used AI

I used Claude Haiku 4.5 for this project (the smaller, fast Anthropic model), costing approximately ~$1 per million input tokens as of the time of this writing:

1. Triaging bills

After the deterministic rules flag issues with bills, Haiku reviews the bill and returns structured JSON suggestions with a confidence score. Anything below 0.7 confidence gets filtered out. The rest is queued for your review.

It pattern-matches across accounting categories and returns ranked guesses. The confidence threshold filters out noise.

2. Reading receipts/invoices’ supplier name

This was always a headache for me. After attaching a receipt/invoice to a Xero bill, Haiku’s vision reads it and extracts the supplier name. That name then gets matched against existing Xero contacts to find the right supplier. If an existing supplier is not found, the CLI suggests you create a new one.

Doing this manually (especially for new suppliers) used to take a tonne of time before. Now it’s just a few key presses.

3. Reading receipts’ line item descriptions

In the same pass, we also try to detect the line item description. Haiku reads the receipt and populates empty description fields. For example “Monthly subscription fee.”

4. Foreign currency detection/conversion

This one I thought wouldn’t be possible given the need for a third party, reliable data source but it could work.

When a receipt/invoice is in a foreign currency, the tool sends it to Haiku for currency identification. Once identified and if it’s in a currency that’s not the business’s currency, it automatically calls the ECB (European Central Bank) Data API to fetch the historical rate for the transaction date, converts the amount to the business’s currency (GBP in my case), and then uploads the ECB rate CSV export directly to the bill as audit evidence.

The whole flow: receipt in USD > AI currency identification > fetch ECB rate > convert amount > attach rate file.

5. Context-based description fallback

When there’s no receipt, Haiku generates a short description from supplier name + amount + date. A last resort but at least gives us an “out” if we’re stuck.

6. Natural-language bill editing

In the interactive fix-bill command, instead of navigating Xero’s UI to update fields, you are given suggestions on how the bill can be fixed. Or, you just type what you want in English. For example:

Set description to monthly subscription fee

Haiku converts the above to a JSON “patch” and applies it to the bill. This one is small but it makes the UX go from average to great. Instead of clicking through Xero, you just say what you want.

The interactive review flow

Once the audit run is complete, you can use the fix-next-bill command to fix issues, one at a time. AI suggestions are shown beside the bill and all you need to do is approve or edit (in natural language).

Approved bills get auto-fixed and approved (i.e. authorised). The CLI makes this flow a real joy (compared to clicking through Xero). Rich handles the display (colour-coded panels, tables, status indicators) and Questionary handles the interactive checkboxes.

The impact

Before, I’d open Xero, manually click through draft bills, fix things individually in the Xero site. Click description field, enter description, click account code, enter account code, etc. I often left the currency conversion until it became a time sink.

After building this tool, with one “run” command, I get a structured list of issues with suggested fixes, review the queue, approve bills in bulk.

The AI helps me semi-automate the corrections. But the deterministic code does most of the work and ensures results are consistent for future audits. You (the human) always stay in the loop: nothing auto-applies unless you explicitly say --auto-correct.

Auditing dozens of bills with full AI suggestions and receipt vision runs to a few cents. Haiku is genuinely cheap enough that it’s a game changer for small business.

What I’d do differently

The natural-language editing (”change description to XYZ”) is fun but most fixes could use predictable patterns. I could turn those into interactive menus next time faster and less prone to misinterpreting and AI safety issues (which is a whole topic in itself). I’d also invest more in exception-handling and testing, as I’ve only tested these on my own expenses (your mileage may vary).

Conclusion

There’s a lot of talk about AI replacing complex knowledge work. What I’ve found through this task is that it’s excellent at very structured tasks. It doesn’t get tired, bored or frustrated about tasks that we might get annoyed about. The tasks still require human input but we get to the decision point much, much faster. This tool freed up real time that I’d rather spend making things.

If you’re experimenting with AI in your workflow, I’d love to hear from you! I write more like this at blog.mariohayashi.com, and feel free to follow me on X: @logicalicy.

I Set Up an AI Personal Assistant with OpenClaw

Mario Hayashi — Thu, 12 Mar 2026 07:57:00 +0000

tl;dr: From email summaries to a Pinecone-powered second brain, this personal assistant has wide-ranging potential!

OpenClaw is an open-source AI personal assistant you self-host, connect to your own accounts, and talk to over a messaging app you already use. Instead of opening five apps, you ask your assistant.

I’ve been meaning to try it for a while. But between work and everything else, I just didn’t get round to it. Then a friend gave me a final nudge. I sat down and set it up.

What Is OpenClaw?

OpenClaw is an open-source AI personal assistant framework. You run a gateway on your machine, connect it to an AI model of your choice, wire up integrations like Gmail or Google Calendar, and talk to it over a messaging app you already use (WhatsApp, Telegram, etc).

Think of it like a personal assistant that has access to the tools you use daily and responds immediately.

It’s not magic but certainly feels like it. It’s mainly scaffolding. Good scaffolding changes how you work.

The Setup

Getting started is mostly configuration. You install OpenClaw, point it at an AI model, and connect your integrations. The whole thing runs in the background.

The part I found most interesting is how you give it a personality and context. There’s a workspace with files like SOUL.md and USER.md where you define who the assistant is and who it’s helping. It reads those at the start of each session.

Within about an hour, I had it running and connected to Telegram.

3x Use Cases I’ve Started With

1. Newsletter Summaries

I subscribe to more newsletters than I read. I’ve not had “zero inbox” in years.

My new workflow: when something lands in my inbox that I want to read but don’t have time for right now, I forward it to the isolated Gmail account my assistant monitors. The assistant knows summarise forwarded emails.

After the next poll cycle, I get back the key points in plain language.

It’s a tiny improvement in my life, but I can already feel it making a huge impact as I work my inbox down to zero.

I didn’t plan this use case per se, but it happened organically.

2. Quick Questions

This one sounds trivial, but it’s surprisingly fun.

“What’s the weather in Tokyo today?”

“What time does X close?”

These micro-queries used to be micro-distractions. Now I fire them off over Telegram or email and get an answer without breaking flow. (Email is slightly more asynchronous than Telegram.)

The assistant is wherever you are: messaging app, email, it doesn’t matter. It answers questions wherever you are. (For ephemeral questions, you might prefer messaging apps.)

3. Calendar Events

I describe what I want such as “create a calendar entry for X event (see https://…)“ and the assistant creates the event. The part I find particularly cool is when I send it a link and it parses the URL to pull in context, or does a quick search to add relevant details to the invite. No clicking through date pickers, no copy-pasting.

It’s still early days and I’m watching for edge cases, but so far it works OK.

Going Further: Adding a Second Brain with Pinecone

The three use cases above are faily lightweight. Forward an email, ask a question, create an event. But there’s a more interesting layer you can add on top: a persistent, searchable memory that grows over time.

I call it my “second brain”. This is where Pinecone comes in.

What’s a vector database?

A regular database retrieves by exact match: give me all rows where category = X. A vector database works differently. It stores data as numerical representations of meaning. If that doesn’t make sense, here are a few ways to query information:

Semantic search: Ask “what did I read about AI orchestration tools last week?” and it surfaces relevant notes even if none of them contain those exact words. It matches on meaning, not keywords.
Keyword search : Traditional text matching. Exact words, not meaning. Fast but literal.
Hybrid search: Semantic similarity and keyword weight combined. Better for mixed queries.
Filtered search : Adds metadata constraints (e.g. category = X AND semantically similar to query)

When you start querying it about things you’ve encountered before, it starts to feel like a second brain.

Setting it up with OpenClaw

The Pinecone DB integration lives in my OpenClaw workspace as a skill. A couple of scripts and a config file. Setup involves creating a Pinecone account and index, adding environment variables, and the skill’s configuration tells the assistant when and how to use the upsert and query scripts.

The schema I’m using is based on Zettelkasten: a note-taking philosophy built around atomic, interlinked ideas. I’ve published the AI prompts I use to drive it on GitHub: github.com/logicalicy/ai-zettelkasten-lite.

In practice, I send a message saying “save this XXX to my second brain”, the assistant fetches the page if there’s a link, writes a structured entry, and upserts it to Pinecone. Later I ask ”what have I saved about XXX?” and it does a semantic search and surfaces notes I haven’t thought about in weeks.

The key difference from bookmarking is context. The entry captures why I saved something, not just what it is. I can also fetch related information. That’s what makes retrieval useful later.

It’s early days. But the more you add, the more useful it becomes.

Safety First: How I’m Thinking About This

Giving an AI access to your email is not something I take lightly. This past week, a security writeup on CodeWall described how researchers broke into McKinsey’s AI platform. It’s a useful reminder that even well-resourced teams get this wrong. I’m sure I’ll discover more gaps in my own setup as I go.

I’ve started by isolating the assistant’s Gmail account. The assistant doesn’t have access to my main inbox. I created a dedicated Gmail account specifically for OpenClaw. The only emails it ever sees are ones I explicitly forward to it. That’s a deliberate boundary.

I’m still working on adding more guardrails and fine-grained permissions. You should too. Always start with the Principle of least privilege.

The Rough Edges

The early experience isn’t perfectly smooth and that’s worth noting.

The most interesting issue I hit was around memory. OpenClaw maintains state across sessions using files on disk such as a long-term memory file, etc. In theory, the assistant can “remember” things you’ve told it. In practice, I reset the OpenClaw session mid-setup and it forgot a bunch of state I thought had been saved. The assistant had it in conversational “context” but never committed it to the memory/state files on disk. It’s the AI equivalent of closing a customer support chat window and having to set the context again for the next support agent.

This highlights an important point: free-form conversational interaction and reliable, repeatable workflows are different tools. Asking an agent “do X” in natural language works great for one-off tasks. But for something that needs to happen consistently, you want determinism. Code, not conversation.

The principle I’ve started designing around is reach for a script before you reach for a prompt. If code can do it reliably, code should do it. LLMs are great when the task is genuinely fuzzy, when you need synthesis or something that doesn’t have a deterministic answer.

The mental model I’ve settled on: use chat to figure out what you want, then encode it as a script when you need it to be reliable. Free-form interaction is prototyping. The deterministic pipeline is production.

Final thoughts

This setup is still a work-in-progress but looks promising. The use cases I’ve described are simple on purpose as I wanted to validate the workflow before building on it. Setting up the assistant is only the beginning and will be iterated on over time.

Areas I’d like to dive into more next in the world of AI (it’s moving too fast): RAG (retrieval-augmented generation) to pull from my second brain at the right moment, evals to measure whether agents are doing the right thing, determinism-first design principle, harness engineering and the evolving tooling landscape (e.g. agent orchestration).

If you haven’t tried OpenClaw yet: this is your nudge!

If you’re experimenting with AI in your workflow, I’d love to hear what you’re building. I write more like this at blog.mariohayashi.com, and feel free to follow me on Twitter: @logicalicy

Better AI code with strong constraints: Ban it() for it.each()

Mario Hayashi — Fri, 17 Oct 2025 11:14:20 +0000

I started banning it() as an experiment. Every test must use it.each().

Then, I banned try-catch. Every error handler must use catchIfError().

It sounds extreme but I learned that the tighter the constraints on AI, the better the code quality.

_ Note: This is a blogpost about optimisations for AI-generated code specifically in TypeScript._

Why ban?

Have you opened an AI-generated PR recently with 1000 lines of changes but with little substance? You start reviewing and you skim. There are so many lines because AI is really good at writing imperative code in TypeScript. You have no time to review all these lines, so you just let it go…

The better way around this is of course to enforce coding standards that make it easier to review. For me, it’s declarative code , starting with tests.

Tests help with code governance, maintainability, and confidence in your codebase (which in turn helps AI’s confidence). You want AI to generates tests but not the imperative mess that tends to happen with dozens of it() blocks.

Most of your time with AI isn’t about writing code but reviewing it. You need to find ways to review code quickly but also trust it.

Constraint #1: Ban `it()` and require `it.each()`

The Problem

If AI generates something like this:

it(’should calculate tax for income of 50000’, () => {
  const result = calculateTax(50000);
  expect(result).toBe(7500);
});

it(’should calculate tax for income of 100000’, () => {
  const result = calculateTax(100000);
  expect(result).toBe(18000);
});

it(’should calculate tax for income of 200000’, () => {
  const result = calculateTax(200000);
  expect(result).toBe(42000);
});

it(’should return 0 for income of 0’, () => {
  const result = calculateTax(0);
  expect(result).toBe(0);
});

…there’s a lot of repetition. Now imagine you have dozens of test files in a PR. It’s going to get tough to review very quickly.

The Solution

Using it.each():

it.each([
  { income: 50000, expected: 7500 },
  { income: 100000, expected: 18000 },
  { income: 200000, expected: 42000 },
  { income: 0, expected: 0 },
])(’calculateTax($income) = $expected’, ({ income, expected }) => {
  expect(calculateTax(income)).toBe(expected);
});

This we can scan in seconds. Reasoning about code is much, much easier. It’s declarative: what, not how.

We’ve created a forcing function for:

Standardized test structure
Makes tests scannable and reliable

Constraint #2: Ban `try-catch` and require `catchIfError()`

The problem

If AI generates nested try-catch blocks like this:

async function processUserData(userId) {
  try {
    const user = await fetchUser(userId);
    try {
      const profile = await fetchProfile(user.profileId);
      try {
        const preferences = await fetchPreferences(user.id);
        return { user, profile, preferences };
      } catch (prefError) {
        return { user, profile, preferences: null };
      }
    } catch (profileError) {
      throw new Error(’Could not fetch profile’);
    }
  } catch (userError) {
    throw new Error(’Could not fetch user’);
  }
}

…it quickly becomes a nightmare to scan error handling. Human brains (mine especially) are not good at keeping track of layers upon layers. So we need a better method.

The solution

Recently I was introduced to the concept of using error-as-return-value instead of try-catch. It’s an interesting idea (one we see used in e.g. GraphQL), so I ran with it.

async function processUserData(userId) {
  const user = await fetchUser(userId);
  const profile = await fetchProfile(user.profileId);
  const [, preferences] = await catchIfError(fetchPreferences(user.id));

  return { user, profile, preferences: preferences ?? null };
}

Its basic implementation could look something like this (you may want to add custom error types and handling):

async function catchIfError(promise) {
  try {
    const result = await promise;
    return [null, result];
  } catch (error) {
    return [error, null];
  }
}

This could be interesting for AI-generated code. It encourages:

Scannable error handling that’s scannable (flat code structure)
Errors as values
Better cohesion , as try blocks become separate functions

On the last point, when you can’t wrap five operations in a try-catch, you’re forced to extract them into a function. Instead of:

try {
  // 20 lines of complex logic
} catch (e) {
  // handle error
}

You write:

const [error, result] = await catchIfError(doComplexOperation());

This naturally encourages us to write single-responsibility functions.

Make your life easier with constraints

Just like you would guide a junior engineer, add strong guardrails. AI generates better code when you give it constraints.

Instead of Write tests for this function, try Write tests using it.each() with test cases for…

Instead of: Add error handling, try Use catchIfError() for optional operations, let required operations throw.

Other ideas

Ban Magic Numbers

Magic Numbers are bad. Your colleagues won’t understand them and it’s likely your future self won’t remember them either.

Try this: Magic numbers are banned. Extract all numeric literals to named constants at the top of the file”.

// No more magic numbers.
const MAX_RETRY_ATTEMPTS = 3;
const TIMEOUT_MS = 5000;

if (retries > MAX_RETRY_ATTEMPTS) {
  throw new Error(’Max retries exceeded’);
}

Ban complex conditionals

Try this: Complex if-statements with more than one boolean operator (&&, ||) are banned. Extract any conditional with more than one boolean operator into a named variable.

// AI generated code.
const isEligibleForDiscount = 
  user.isActive && user.age > 65 && !user.hasDiscount;

if (isEligibleForDiscount) {
  applyDiscount();
}

With each new operator, you’re adding exponential (power of 2) code paths. It’s hard to reason about, so extract the condition to a named variable to make your life easier.

Final thoughts

Optimise code for your review.

Your time is finite so add constraints that make AI-generated code:

Scannable
Standardized
Minimal (declarative wins here)

Thanks for reading!

If you’re experimenting with AI in your workflow, want to share your experiences or want to collaborate, I’d love to hear from you. I write more like this at blog.mariohayashi.com, and feel free to follow me on Twitter: @logicalicy.

7 Prompt UX Patterns to Help You Make Quicker Decisions with AI

Mario Hayashi — Mon, 13 Oct 2025 11:02:24 +0000

Most of us using AI on a day-to-day basis will know having AI ask clarifying questions improve output quality. Prompt engineering 101.

But have you wondered to ask it to suggest answers too?

This makes the tedious back-and-forth with AI easier. Instead of typing essays in response to every question, you’re just confirming or correcting. Instead of breaking your flow, maintain the momentum. I really needed this after wasting many hours of writing long essays to refine PRDs while doing spec-driven development.

This is what real human-in-the-loop should look like. Use smart defaults with the freedom to steer. Here are 7 patterns that may improve your prompt UX (certainly did for me).

1. Suggested Answers Pattern

Here’s the single most useful hack (for me) for working with AI:

When you have AI ask clarifying questions, have it suggest the answers too.

Don’t do this:

Ask me clarifying questions to understand the requirements better.

Do this:

Ask me clarifying questions and suggest answers based on the context.

Example structure:
<target_user>
- Q: Who is the primary user of this feature?
- Suggested Answer: [INFERRED_USER_TYPE_FROM_CONTEXT]
- [Confirm or correct this assumption]
</target_user>

Typing is the bottleneck. If you write lots of spec-driven development PRDs like me, this is a must. The context already has most of the answers and the AI just needs confirmation. You’ve transformed a potential essay into a yes/no confirmation.

This is what human-in-the-loop should look like. Not constant back-and-forth but smart defaults. It strikes the balance between letting AI do its thing and maintaining control over decisions.

When you optimise prompt UX this way, you stop managing the AI and start collaborating with it. You’ve set it up for success (where success is optimised workflows). You maintain momentum, preserve flow state, and get better results with less effort.

2. Pre-Filled Assumptions Pattern

Instead of asking five questions and waiting for five answers, make it state assumptions upfront and let you course-correct only what’s wrong.

Do this:

Proceed with assumptions based on the context. For example:

"I’m proceeding with these assumptions based on your code:
- Framework: [INFERRED_FROM_CONTEXT]
- Code quality: [INFERRED_FROM_CONTEXT]  
- Performance needs: [INFERRED_FROM_CONTEXT]

Please correct any assumptions that are wrong, otherwise I’ll continue."

It’s analogous to a good interview candidate and a bad one: good ones make state assumptions while the bad don’t bother. The user only has to type if something is wrong, not to fill in every detail.

3. Default Template Pattern

Ask it to follow a template rather than start from a blank slate.

Do this:

Please state defaults based on the context and ask me to confirm. For example:

"Here's what I understand about this feature:

<feature_spec>
- Feature name: [INFERRED_FROM_CONTEXT]
- Primary tech: [INFERRED_FROM_CONTEXT]
- Authentication: [INFERRED_FROM_CONTEXT]
- Deployment target: [INFERRED_FROM_CONTEXT]
</feature_spec>

Edit any details above that need changing, or reply 'looks good' to proceed."

It’s my Roomba rule: 80% vacuumed is better than dust accumulating everywhere. Starting with 80% filled in is better than starting with 0% filled in. The user can quickly scan, fix what’s wrong, and move forward.

4. Progressive Disclosure Pattern

Start with the 80%, then allow opt-in as things get more complex.

Do this:

Please progressively disclose options. Start with the simple version and 
progressively ask about advanced options. For example:

"I’ll create a standard REST API with CRUD and basic auth.

Reply 'advanced' if you also need: pagination, rate limiting, 
request validation, or API versioning."

Most users just want the simple version. Don’t burden yourself with advanced options upfront. Make power user options easily accessible however.

5. Quick Pick Pattern

Ease your mental load by getting human-in-the-loop to suggest multiple choice for common scenarios.

Do this:

Offer multiple choice for common decisions. For example:

"How should I structure this code?
A) Quick script - minimal structure
B) Production app - full architecture, tests, docs
C) Prototype - balanced approach

Just reply A, B, or C."

6. Diff Pattern

Show changes as visual diffs instead of descriptions.

Don’t do this:

Describe what you changed in the code.

Do this:

Show proposed changes as diffs. For example:

"Here are the proposed changes:

diff

app.use(basicAuth);
app.use(jwtAuth);
res.status(500).send(’Error’);
res.status(401).send(’Unauthorized’);

Reply 'apply' to confirm."


Humans process + and - symbols faster than words. I love coloured diffs. It’s easier to scan, easier to verify, and easier to make decisions about. When presenting options or modifications, always prefer diff format over prose descriptions.` `

## 7. Incremental Refinement Pattern

Build in stages with explicit pause points.

**Don’t do this:**

plaintext
Build everything and show the complete solution.


**Do this:**

plaintext
Work in stages with clear checkpoints. For example:

“I’ll build this in stages:

Step 1: Basic component structure → Reply ‘continue’ for Step 2 → Reply ‘modify [what]’ to adjust → Reply ‘done’ if sufficient

[Show Step 1 output and pause]”




This gives users natural checkpoints without forcing them to review every tiny detail.

## What Good Prompt UX Achieves

Minimises

- Seconds to decision

- Keystrokes required (huge!)

- Context switching

- Back-and-forth rounds

Maximises

- Correcting mistakes

- Preserving momentum and flow state

- Accuracy of overall work — make it easy for you to review

## Other Tips

**Make “yes” the default path**

Have AI structure questions so continuing requires less effort.

**Use visual structures for quick scanning**

Tables, lists, and emojis help us process information faster.

**Provide escape hatches**

Always show the doors but don’t make users walk through every door.

**Keep feedback loops tight**

Every question should be answerable in seconds.

## Learn You A Prompt UX for Great Good!

This isn’t about making AI smarter but the **interaction** smarter and more fun!

When you design good prompt UX, you’re saving keystrokes, preserving momentum. Ideas flow faster than you can type and it’ll free you to focus on what’s most important: creating.

It shouldn’t feel like filling out forms. It should be a conversation, not dictating.

How I Use Cursor (Now with GPT-5)

Mario Hayashi — Fri, 08 Aug 2025 15:14:47 +0000

I didn’t expect to write this today, but GPT-5 dropped, and Cursor is giving paying users free credits for launch week.

That’s basically the equivalent of finding a $20 bill on your desk (and then some, because it seems they gave me extra credits). Let’s talk how I use Cursor day-to-day.

The magic comes from shaping context, often adding as much as possible_._ The context window is massive compared to a year ago, so use it! Over time, I’ve built a small set of practices that make the AI supercharge my workflow. Hopefully I’ll discover more tricks, now that GPT-5 is out.

1. My `default.mdc` Is an Index File, Not a Dump

One early mistake I made: cramming everything into default.mdc.

That’s like what I used to do as a high school student: carry all textbooks everywhere instead of looking at the classes for the day and brining books that I actually needed.

Now my default.mdc looks more like this:

Please follow best practices, depending on your task: 
- `best-practices-code-style.mdc` for code style 
- `best-practices-cursor-agent-mode.mdc` for Cursor Agent mode 
- `best-practices-data-access.mdc` for data access conventions 
- ... 

After a task, succinctly list which Cursor rules you used.

The “rules” aren’t rules per se. They’re just guidelines, so the Agent doesn’t get lost. When we get lost as humans, we look for a guide — it’s the same here. By keeping the index slim, I’m trying to avoid context and usage limit headaches and the AI can grab what it needs, when it needs it.

I ask the model to list Cursor rules it uses, because I’m currently trying to see if it helps me optimise the look ups.

2. Best Practices Driven

Best practices aren’t rules. They’re what you reach for when you’re not sure. For me, they cover:

Code Style — Comments explain why, not just what. Imports grouped. Names that actually mean something.
Data Access — Always through a DAO tier.
Error Handling — Fail loud where it matters, dead code is better than limping code, validate inputs, don’t let async errors disappear.
React — Functional components, hooks, and typed props.
Testing — it() for behaviour (instead of less fluent test()), formatting the describe() statement.

These files are living. Every time I trip over something, I think about adding it to the best practice docs.

3. AI Dev Tasks: The Not-So-Secret Secret Weapon

Cursor’s Agent mode is great, but AI Dev Tasks have become the go-to workflow upgrade. If you haven’t seen Ryan Carson’s repo: snarktank/ai-dev-tasks. Have a look.

I use three rules:

create_prd — It’s like having a PM who asks the really good clarifying questions. The questions that require executive-level input.
generate_tasks — Turns the PRD into a concrete task list. Tactical.
process_tasks — With tasks broken down sufficiently, execute them.

It’s not just about prompt optimisation — it’s about context optimisation. With these prompts, the AI knows where we’re going and why.

4. Brand Guidelines Aren’t Just for Marketers, Designers

The other file that is underrated: brand guidelines.

I do a lot of front-end, landing pages, new feature work and copy is often “TBD.” Instead of stalling, I feed the AI my brand voice doc and get solid placeholders instantly.

Mine covers everything from:

Tone of Voice — Sample phrases, dos and don’ts.
Color Palette + Typography — Get the vibe right.
Layout Principles — Mobile-first, modular components.
Key Benefits + Differentiators — So even initial copy feels on-brand.

It means I can keep developing while waiting for final copy. When product managers or designers review your work, they’ll see “good enough” copy to sign it off.

Closing Thought

GPT-5 in Cursor is fun but it still needs structure. The real unlock is to treat AI like a junior dev who can out-type me (times 1,000,000) but still needs the right guardrails.

If you’re experimenting with AI in your workflow, I’d love to hear how it looks like.

I write more like this at blog.mariohayashi.com, and you can find me on Twitter: @logicalcy.

How I Use AI Every Day (as a Software Engineer + Indie Maker Making Shopify Apps)

Mario Hayashi — Mon, 04 Aug 2025 10:10:14 +0000

When the year 2025 started, I didn’t expect to use AI this much.

Almost by accident, AI started showing up in every corner of my workflow. Not just the obvious stuff like code generation or content help, but in much subtler, surprisingly helpful ways. OpenAI Pro, Cursor and other tools are increasingly part of how I operate.

It’s not about chasing the next model drop. For me, it’s about tool fluency — knowing which model to reach for, how to frame a good prompt, and where it actually saves me time or gives me perspective.

Here’s how I use AI in my day-to-day life — especially as a software engineer, indie builder, and someone just trying to get more done with less friction.

1. 🏗️ Cursor + Claude 4 + AI Dev Tasks for software engineering

Architecting a solution and software engineering are sometimes an art. Of course, software engineering principles, best practices and years of experience should be drawn from. There’s no substitute for that! But there’s an element of devising a creative solution that requires perspective and testing key assumptions that AI is very suited for. That’s where my setup of Cursor, Claude 4 and AI Dev Tasks come in.

Of all models, Claude 4 I keep coming back to for slightly trickier coding or architecture planning. It’s thoughtful, coherent, and has unpicked some very tricky problems (including bugs!). AI Dev Tasks ’ Cursor rules I use as my pseudo product manager and pseudo junior dev who I create PRDs and execute fully-fledged, working features with. If you don’t know about it, I recommend you check it out.

With AI Dev Tasks, the prompt starts with :

“@create_prd I want to add X feature.”

It doesn’t always get it right. When it fails, it’s usually because I didn’t give it enough context — or I didn’t fully understand the problem myself (a blogpost for another time). Also, if you have bad, smelly code in the codebase or the API/SDK you are using is badly documented, it won’t perform well or will get confused.

But when it works, it’s a massive time-saver.

2. 💬 ChatGPT for career coaching

I’ve worked with a great coach before and AI isn’t a replacement by any means! But ChatGPT has been excellent at helping me think about my career goals. It’s like having a sounding board that doesn’t get tired.

Prompt I use:

You are an expert career coach with a track record of advising seasoned XXXs, YYYs, and ZZZs with over a decade of experience. You deeply understand the transition from strong individual contributor to impactful technical leader, especially in ambiguous environments like startups and big tech. You are committed to helping me develop authentic leadership capabilities—not just by managing people, but by driving outcomes, building influence, and leading through complexity. You provide grounded, evidence-based advice to help me identify and pursue high-leverage growth opportunities. Your role is to support my progression over the next few years into a meaningful, XXX position.

It helps me get unstuck when I’m too in my head. Knowing how valuable a (great) coach can be, I recommend coaching in general. If you can’t find one, try an AI coach?

3. 🗣️ ChatGPT for language roleplay

I’m based in the UK. That means I don’t get many chances to actually speak Japanese day-to-day.

So as an experiment, I created a roleplay prompt for ChatGPT, where it pretends to be a Japanese coworker. I bounce my keigo off it. It’s consistent and responsive, and I can make sure I’m not getting rusty without relying on textbooks alone.

Prompt I use:

You are a polite Japanese-speaking colleague. When I write a sentence, please do the following in your response:

First, provide a natural, polite Japanese translation of my sentence. Use 敬語 (including appropriate use of 尊敬語, 謙譲語, and 丁寧語) based on the context. Do not translate literally—translate naturally as a professional Japanese speaker would.

Then, reply in Japanese as if you are my colleague. Keep the tone professional but warm, and always use appropriate keigo. Include natural phrases, acknowledge what I said, and continue the conversation in a way that would be expected in Japanese business communication.

Structure the response like this:

日本語訳：

[Your translated sentence with keigo and furigana]

返答：

[Your response as a colleague, fully in Japanese keigo with furigana]

Only reply using the above structure every time.

It’s not perfect, but it’s a great way to keep your languages from getting rusty. In a few years time, I wouldn’t be surprised if there were language AI buddies who you could speak to in real time.

4. 📝 ChatGPT for writing (with my own tone)

Everyone’s using AI to write now, but I’ve built a bit of a system around it.

I feed ChatGPT my style guide (voice, tone, structure, signature phrases), and then create with it. I use it to draft content, refine tweets, or shape product copy.

It’s like pair writing with a really fast co-writer who respects my vibe.

I use ChatGPT project instructions along these lines:

# 🧠 Content Writing Style Guide

_This style guide defines the tone, structure, and phrasing typical of XXX’s writing for use when generating content with ChatGPT.

Tone

…_

Then I paste the rest of the guide into the instructions. Each conversation in the project then generates content aligning with the style guide.

5. 🧾 Gemini for OCR + spreadsheets

I sometimes need to scan a receipt or screenshot and enter it into a CSV.

I once gave both ChatGPT and Gemini a photo of a spreadsheet and asked them to convert it to clean CSV. ChatGPT Pro got close. But Gemini 2.5 Pro nailed it. Barely any cleanup required.

If you're dealing with receipts, old docs, or screenshots, it’s a killer use case that not enough people talk about.

6. 🎞️ Playing with Pika for video gen

This one’s still early for me. I’ve been experimenting with Pika for AI-generated videos — mostly for demo purposes and fun product visuals.

It certainly has flaws that you want to be careful of in customer-facing flows, but it’s getting close. More to discover here.

7. 📊 ChatGPT inside Google Sheets (for enriching + writing)

This one feels like a cheat code.

I’ve started using ChatGPT inside Google Sheets’ Extensions > App Scripts — and the leverage is wild. Especially for things like:

Enriching data (e.g. pulling company type or sector from a URL)
Auto-generating short product descriptions
Cleaning up messy labels
Translating or rewriting copy at scale

You’ll need to know a bit of scripting (JavaScript knowledge helps). It’s fast, cheap, and gets me 80% of the way there without needing a full enrichment pipeline.

Ask ChatGPT to generate the App Script for you but be careful (!) to have the code vetted by a real software engineer. Sometimes the generated code can be dangerous.

You’ll need a lot more than this to make your Sheets do anything useful but here’s the code for making a request to OpenAI. Please don’t hardcode your API key and put them somewhere safer (like Script properties).

var apiKey = PropertiesService.getScriptProperties().getProperty('OPENAI_API_KEY');
var prompt = `
You're helping someone do XXX.
`;

var payload = JSON.stringify({
  model: "gpt-4o",
  messages: [
    {
      role: "system",
      content: "You are a precise assistant. Respond with only with a JSON result..."
    },
    {
      role: "user",
      content: prompt
    }
  ],
  temperature: 0.3,
  max_tokens: 50
});

var options = {
  method: "post",
  contentType: "application/json",
  payload: payload,
  headers: {
    Authorization: "Bearer " + apiKey
  },
  muteHttpExceptions: true
};

var aiResponse = UrlFetchApp.fetch("https://api.openai.com/v1/chat/completions", options);
var json = JSON.parse(aiResponse.getContentText());

It’s not perfect, but for $20/month it’s incredible bang for buck. Especially when you’re solo or early-stage and every minute matters.

🧰 Fluency over hype

I think this is the real unlock.

It’s tool fluency.

Knowing when to reach for which tool. Knowing what to ask. Knowing how to layer it into your day in ways that actually remove friction.

As a indie maker (making Shopify apps) and software engineer, my day is full of small frictions:

Naming things
Drafting content
Debugging bugs
Thinking through UX
Deciding what not to do

AI doesn’t solve everything. But it helps with more than I expected.

If you’re experimenting with AI in your workflow (especially around Shopify), I’d love to hear from you.

I write more like this at logicalicy.substack.com. Feel free to follow me on Twitter: @logicalcy

Back in the Fold: My Return to Building in Public (and Doing It Better This Time)

Mario Hayashi — Sat, 02 Aug 2025 15:46:43 +0000

It’s been a while.

The last time I was seriously indie making was sometime back in 2020-2023. Since then, life took over — work, family, and the usual responsibilities. I never fully stopped building things, but it all felt scattered. Some apps made a little money, but none stuck. This time… I’m doing things differently.

This is my return to the fold. I’m keen on doing it better — and more sustainably.

🚀 The New Journey Begins: Wildcardo

My first milestone? Wildcardo — a Shopify app I just got approved.

Wildcardo helps merchants automatically detect and fix 404 errors before they hurt SEO or customer experience. It offers automated and wildcard-based 301 redirects, branded short links, and error notifications. Basically: no more broken links tanking your rankings or sending customers into the void.

This is also my first personally published Shopify app. I used to build Shopify apps professionally, but this one is fully indie. Just me (and a few very smart AI agents).

🧭 What I’m Doing Differently This Time

After years of trying lots of things, here’s the distilled strategy I’m betting on now:

1. Build in public. Always.

This post is part of that. I’ll be sharing progress, lessons, and flops — regularly. It helps me stay accountable and (hopefully) connect with people who care about the same things.

2. Build to test, not to finish.

I’ve learned the hard way that building too much too early is a trap. Now I aim to build just enough to answer one question: Does this solve a real problem for someone who’s willing to pay?

If not, move on — fast.

3. Time is scarce, automate everything.

Between freelance, family, and focus, time is tighter than ever. So:

I’m automating workflows with n8n
Adding proper monitoring and error alerting
Using Cursor, ChatGPT , and other AI tools as true co-pilots
Avoiding side quests at all costs

4. Make AI a first-class citizen

Not just in how I build, but also in the products themselves. If AI can create real leverage for users — I want it baked in, not bolted on.

5. Spend at least 50% of my time not coding

This one’s hard — but essential. Discovery, customer development, marketing, distribution. That’s where momentum is built (and where I used to stall out).

6. Ship value early, often, and clearly

Nobody cares unless you make their life better today. So I’m working on getting much faster at showing obvious wins. Not vague “it could help” stuff — actual, valuable outcomes. That means leaning heavily on Jobs To Be Done, and hyper-focusing on real user struggles.

🎯 The Goal: $1K MRR in 6 Months

I’ll be honest. I haven’t settled on the exact figure yet. But it helps to have something rather than nothing.

I’ll get there by:

Building a few small, sharp apps that solve painful problems
Creating valuable content around those problems
Reaching out directly with useful demos, not generic cold messages
Letting go of what doesn’t click

If it doesn’t get traction, I’ll learn and move on.

🌱 What’s Next?

I’ve got a list of ideas I’m exploring — and I’ll be building in the open as I go. I’m also carving out time each week to write, reflect, and engage with others on the same path.

There’s no magic plan here. Just a commitment to keep learning, stay small, and stack tiny wins.

Let’s see where this goes.

Follow Along?

If you’re also returning to building — or just curious where this journey leads, feel free to subscribe.

Twitter: @logicalicy

Product updates: logicalicy.substack.com

App: Wildcardo on Shopify

I Built a Whole App by Vibe Coding with AI — And It Changed How I Think About Software

Mario Hayashi — Sat, 14 Jun 2025 15:39:37 +0000

I haven’t written a blogpost in a while. Or built something just for fun in… longer than I’d like to admit.

But recently, I shipped a whole app — not with a roadmap, or a spec doc, or a sleepless weekend of React wiring. I just vibe coded it. With AI. And it worked.

Let me explain.

1. What Is Vibe Coding?

Vibe coding isn’t a framework. It’s not some new agile flavor. It’s more like… jamming.

You show up with an idea and let the model fill in the blanks. You sketch the “what,” and the AI figures out the “how.” You’re not wrestling with syntax or structure. You’re keeping the energy flowing.

It’s a little chaotic. A little magical. You prompt, tweak, rerun, adjust — and before you know it, you’ve got something running locally.

In the early days, I thought AI pair programming would be like having a helpful junior dev. Now it feels more like having an unreasonably productive partner who never sleeps and is oddly good at YAML.

2. I Vibe Coded a small Shopify App

I’ve been exploring ways to make AI workflows feel natural for indie builders — no tooling drama, no AI wrappers on top of wrappers. Just real output.

So I forked ai-dev-tasks which was recommended to me via this YouTube video — thanks LT if you’re reading this — and started using it to build a Shopify app.

No long planning. No ticket grooming. Just iterative thinking with the AI.

I’d describe the vibe like this:

Start with a high-level goal (e.g., “build a Shopify app that handles wildcard redirects”).
Feed the LLM some initial context and constraints. Ask it to ask clarifying questions.
Let it generate everything from folder structure to infrastructure.
Fix where it breaks. Prompt again. Repeat.

It wasn’t perfect. But it was fast. And most importantly, it got me shipping again. That was the hardest part. I don’t have boatloads of time in my life anymore — an hour of free time here and there. AI makes those hours count.

I’m now looking at tools like Task Master — which push this thinking further by chaining prompts into buildable task graphs. There’s a growing ecosystem here.

3. This Isn’t Just Faster — It’s Different

What surprised me most wasn’t the speed — it was the shape of the work.

When vibe coding, I’m not thinking about best practices or writing exhaustive specs. I’m having a conversation about what I want and then steering the AI toward something usable.

It’s not passive. You still need good judgment. Product sense.

But it feels like using the AI to prototype your thoughts , not just generate code.

That’s a big deal. In a few years, I think early-stage builders — and even technical PMs — will regularly vibe-code their way to working MVPs before anyone says “let’s open JIRA.”

The Future Is Vibey

I’m creating again — and AI is making it fun. I’m a big fan.

This way of building won’t replace great engineering. But it’s opening doors for speed, experimentation, and creative flow that I haven’t felt in years.

Vibe coding isn’t the answer to everything. But if you’ve been stuck, burned out, or overthinking your next idea… try jamming with a model.

You might ship something cool. You might even write a blogpost about it.

If this resonated, I’ll be sharing more hands-on explorations, tools I’m testing, and the occasional build log. Subscribe to follow along.

DEV Community: Mario Hayashi

The Factory Must Grow (Part III): Stopping the AI Agent Production Line Toyota-style

The Factory Broke

Why Toyota Production System - Does It Apply To Agents?

Jidoka: The Silent Drops

Poka yoke: The Retry Storm

Andon Cord: Hook Timeouts

Muda: The Budget Burn

Five Whys

Intermission: How You Can Get Started

Work In Progress

The Factory Must Grow (Part II): From Spaghetti AI Agent Orchestrator to a Main Bus

The First Factory Worked

The Spaghetti Era

The Main Bus

Deterministic

Declarative

What I Left Out

The Factory Caught Itself

The Factory Must Grow: I Replaced Myself With AI. Now What?

The PR That Changed Things

My Slow Start

"Just Try Building One Yourself"

From Bash Scripts to an Orchestration System

Managing Many Agents

Under the Hood

The Thinking Is Still Mine (For Now)

An autonomous dev pipeline for one

Ralph loop

Under the hood

Architecture overview

What each part does

1. Backlog capture

2. PRD generation

3. Planning

4. Workers (Ralph loop)

5. Model routing

6. Refactor loop

7. Review feedback

Design principles

What broke

Metrics

What I’m looking to try next

Summary

Using AI to make Xero expense auditing fast, cheap, and (almost) fun

Setup: Xero, bills, and edge cases

Design principle: deterministic code first, AI second

Where I used AI

The interactive review flow

The impact

What I’d do differently

Conclusion

I Set Up an AI Personal Assistant with OpenClaw

What Is OpenClaw?

The Setup

3x Use Cases I’ve Started With

1. Newsletter Summaries

2. Quick Questions

3. Calendar Events

Going Further: Adding a Second Brain with Pinecone

What’s a vector database?

Setting it up with OpenClaw

Safety First: How I’m Thinking About This

The Rough Edges

Final thoughts

Better AI code with strong constraints: Ban it() for it.each()

Why ban?

Constraint #1: Ban it() and require it.each()

The Problem

The Solution

Constraint #2: Ban try-catch and require catchIfError()

The problem

The solution

Make your life easier with constraints

Other ideas

Ban Magic Numbers

Ban complex conditionals

Final thoughts

Thanks for reading!

7 Prompt UX Patterns to Help You Make Quicker Decisions with AI

Constraint #1: Ban `it()` and require `it.each()`

Constraint #2: Ban `try-catch` and require `catchIfError()`

1. My `default.mdc` Is an Index File, Not a Dump