Hi devs! 👋
I'm Yuto, and I want to share the story of why I built Kaizen Agent — an open-source CLI tool that tests, debugs, and auto-fixes LLM apps and agents.
This post is about why I built it, the pain that led me here, and how it works. If you're building with LLMs and tired of the trial-and-error cycle, I hope this resonates with you.
😤 The real pain behind building LLM apps
Over the past year, I’ve been working on LLM agents and applications as part of my startup and my PhD.
One thing I’ve realized is this:
Building LLM apps isn't that hard — but getting them to production-quality is brutally hard.
You can write a basic agent or prompt flow pretty quickly. But making it robust enough to actually use in production? That’s where it gets messy.
Here’s what I kept running into:
- I’d write a prompt, test it… and get weird or inconsistent output.
- I’d fix the prompt or logic, test again… and break something else.
- I’d try to define test cases, run evaluations, and compare outputs manually — over and over.
Honestly, it felt like I was doing the same boring, manual steps repeatedly:
- Write some test cases
- Run the agent
- Check the outputs manually
- Fix the prompt/code
- Repeat again and again
This manual cycle was killing my energy.
💡 The insight: LLM testing is different
That’s when something clicked:
LLMs are black boxes. You can't know if your change helps unless you actually test it.
Unlike traditional software, where you can reason through logic and expect consistent outputs, LLMs require a test-it-and-see approach.
You must:
- Feed in test data
- Evaluate outputs
- Spot failure patterns
- Iterate based on those observations
So I asked myself:
Why don’t we have tools optimized for this loop — for AI agents and LLM apps specifically?
We don’t just need unit tests or integration tests. We need feedback loops that help us improve LLM behavior.
🛠️ The idea: automate my own debugging process
That’s when I decided to build Kaizen Agent.
The idea was simple:
- Define your test inputs, expected behavior, and evaluation logic in a YAML file
- Run tests on your LLM app or agent
- Detect failures and understand what went wrong
- Suggest prompt/code fixes using another LLM
- Re-run the tests automatically
- (Optional) Open a pull request with the improved prompt/code
So instead of running tests manually and fixing things yourself, you can just run one CLI command — and let the agent debug itself.
🚀 The launch
Once the core functionality worked, I put it on GitHub and released the first version. The README was rough. There was no documentation yet. But it worked.
Since then, I’ve:
- Improved the README with a super simple example
- Created a full documentation site for better onboarding
- Published to PyPI (
pip install kaizen-agent
) to make it easier to try
🙏 Final thoughts
If you’ve ever felt stuck in the loop of:
Prompt → test → tweak → test again…
and wished someone (or something) could help — I built this for you.
Check out Kaizen Agent on GitHub, and if it’s helpful, please give us a star ⭐ and share your feedback.
You can also follow me on X/Twitter: @yuto_ai_agent — I’d love to hear your thoughts or questions!
Thanks for reading!
— Yuto
Top comments (0)