Suzuki Yuto

Posted on Jul 4 • Edited on Jul 7

Tired of trial and error to fix your LLM app? I built a tool to automate it.

#ai #aiops #llm #opensource

Hi devs! 👋

I'm Yuto, and I want to share the story of why I built Kaizen Agent — an open-source CLI tool that tests, debugs, and auto-fixes LLM apps and agents.

This post is about why I built it, the pain that led me here, and how it works. If you're building with LLMs and tired of the trial-and-error cycle, I hope this resonates with you.

😤 The real pain behind building LLM apps

Over the past year, I’ve been working on LLM agents and applications as part of my startup and my PhD.

One thing I’ve realized is this:

Building LLM apps isn't that hard — but getting them to production-quality is brutally hard.

You can write a basic agent or prompt flow pretty quickly. But making it robust enough to actually use in production? That’s where it gets messy.

Here’s what I kept running into:

I’d write a prompt, test it… and get weird or inconsistent output.
I’d fix the prompt or logic, test again… and break something else.
I’d try to define test cases, run evaluations, and compare outputs manually — over and over.

Honestly, it felt like I was doing the same boring, manual steps repeatedly:

Write some test cases
Run the agent
Check the outputs manually
Fix the prompt/code
Repeat again and again

This manual cycle was killing my energy.

💡 The insight: LLM testing is different

That’s when something clicked:

LLMs are black boxes. You can't know if your change helps unless you actually test it.

Unlike traditional software, where you can reason through logic and expect consistent outputs, LLMs require a test-it-and-see approach.

You must:

Feed in test data
Evaluate outputs
Spot failure patterns
Iterate based on those observations

So I asked myself:

Why don’t we have tools optimized for this loop — for AI agents and LLM apps specifically?

We don’t just need unit tests or integration tests. We need feedback loops that help us improve LLM behavior.

🛠️ The idea: automate my own debugging process

That’s when I decided to build Kaizen Agent.

The idea was simple:

Define your test inputs, expected behavior, and evaluation logic in a YAML file
Run tests on your LLM app or agent
Detect failures and understand what went wrong
Suggest prompt/code fixes using another LLM
Re-run the tests automatically
(Optional) Open a pull request with the improved prompt/code

So instead of running tests manually and fixing things yourself, you can just run one CLI command — and let the agent debug itself.

🚀 The launch

Once the core functionality worked, I put it on GitHub and released the first version. The README was rough. There was no documentation yet. But it worked.

Since then, I’ve:

Improved the README with a super simple example
Created a full documentation site for better onboarding
Published to PyPI (pip install kaizen-agent) to make it easier to try

🙏 Final thoughts

If you’ve ever felt stuck in the loop of:

Prompt → test → tweak → test again…

and wished someone (or something) could help — I built this for you.

Check out Kaizen Agent on GitHub, and if it’s helpful, please give us a star ⭐ and share your feedback.

You can also follow me on X/Twitter: @yuto_ai_agent — I’d love to hear your thoughts or questions!

Thanks for reading!

— Yuto

DEV Community