My LLM Keeps Failing in Production. Here's What I Built to Fix It Automatically.

#ai #opensource #mlops #productivity

A story about debugging, frustration, and why I stopped doing it manually.

I want to tell you about a Tuesday afternoon I had about a year ago.

I was staring at a Langfuse dashboard. Fifty-three failed traces. All from the same RAG pipeline I'd spent three weeks building. The failures were spread across different users, different inputs, different times of day — but they all shared one thing: the model was returning JSON that didn't match the schema my downstream code expected.

I knew what the fix was. I needed to tighten the system prompt. Add a few examples. Maybe enforce structured output.

So I did that. Deployed. Went home.

Two weeks later, different failures. New edge cases the new prompt didn't handle. Back to the dashboard.

I did this cycle four times in six weeks.

The Thing Nobody Talks About
Everyone talks about LLM observability. Trace your calls. Log your scores. Monitor your failures.

And the tools for that are genuinely good now. Langfuse, LangSmith, Helicone — they'll show you exactly what broke, when, and how often.

But then what?

You're staring at a failure. You understand it. And now you have to figure out the fix manually. Rewrite the prompt by intuition. Hope it works. Hope it doesn't break the cases that were already working.

No tool says: here is the fix, and here is proof it doesn't break anything else.

That gap — between seeing the problem and safely fixing it — is where ML engineers lose hours every week. I know because I was one of them.

What I Started Building
I got tired of the manual loop. So I started building something I'm calling LangHeal.

The idea is simple to explain:

When your LLM fails in production, LangHeal doesn't just show you the failure. It proposes a fix. A real one — not a label like "prompt issue" or "try rewriting this." An actual rewritten prompt. An actual JSON schema. An actual routing rule for the edge case your agent doesn't handle.

And before it shows you anything, it tests that fix against every failure you've had before. If the proposed change had broken a case that was already working, you would never see it. It gets filtered out automatically.

You only see proposals that are proven safe against your historical cases.

How It Actually Works
Here's the flow when LangHeal runs an analysis on one of your AI features:

Step 1 — It fetches your failures from Langfuse
Not all your traces. Just the ones scoring below your threshold on the score you told it to watch (e.g., quality < 0.7). Just for the specific AI feature you registered.

Step 2 — It figures out what went wrong
If Langfuse already has a score on the trace, LangHeal uses it. If not, it runs an LLM-as-a-judge classifier to label the failure mode — schema violation, hallucination, edge case, tool failure, and so on.

Step 3 — It generates concrete fixes
Ranked from cheapest to most invasive. First, it tries the small things: a stricter JSON schema, a few-shot example added to the prompt, a retry with validation, and a tightened system instruction. If those can't address the failure, it escalates to bigger changes — a routing rule for the edge case, a tool definition fix, eventually a fine-tuning recipe.

Step 4 — It tests every verifiable fix against history
Each proposed change is replayed against a sliding window of up to 50 of your past failure cases (input, expected output, and cached tool results from the original trace, so even agentic features replay deterministically). If the fix passes all of them, it surfaces with a badge: "50/50 historical cases passed." If it fails even one, it's rejected silently, and you never see it.

There's a known caveat: on your very first analysis, before there's any history to replay against, proposals are marked "Unverified — first run" and you're warned. The regression suite gets built up as failures accumulate.

Step 5 — You approve
You pick the fix you want. It's applied to the AI feature. Failed cases that prompted the analysis are added to the regression suite, so next time around, there are even more historical cases protecting you.

The Piece That Keeps This Honest
Regression replay only works if LangHeal knows when your AI feature actually changed. If you ship a new prompt or swap a tool and LangHeal doesn't know, the historical sample drifts, and the "passed" badge becomes a lie.

The fix is a one-line addition to your deploy script:

curl -X POST https://langheal.example/api/features/{id}/deployed \
-d '{"commit": "abc123", "deployed_at": "..."}'
LangHeal stamps each historical case with the version that captured it, so it knows when the sample is stale and needs refreshing. Without this hook, regression still runs, but it can't catch every kind of silent drift.

What This Isn't
It's not magic. It won't fix every LLM problem automatically without you thinking about it.

You still review the proposals. You still approve the change. The system never applies anything without your sign-off.

What it removes is the busywork of figuring out what to try and the guesswork of wondering whether your fix will break something else.

The Fine-Tuning Part
Sometimes, prompt changes aren't enough. The model needs to actually learn something new.

LangHeal handles that too — but carefully. It takes the failure cases, generates proposed correct outputs using the same healing engine, and puts them in front of a human reviewer. Three columns: the original input, what the model actually said (the bad output), and what the system thinks it should have said.

You approve, edit, or reject each sample. Only verified samples go into the training dataset. Then it kicks off the fine-tuning job (today: OpenAI; Together AI and local Axolotl are on the roadmap), and the resulting model can be put through the same regression suite before you decide to ship it.

No raw failures going into training data. No fine-tuning without human verification. No, deploying a new model without regression testing.

Where I Am Right Now
LangHeal is early. I'm building it as open source. The core loop — fetch failures, propose fixes, verify regression, approve — is what I'm focused on first.

I'm not launching anything yet. I'm not asking you to sign up for anything.

I'm asking: does this problem sound familiar?

If you've spent a Tuesday afternoon staring at failed traces and rewriting prompts by gut feel — I'd genuinely like to hear about it. What broke? What did you try? What would have made it easier?

Every conversation I have right now shapes what gets built first.

Drop a comment, send me a message, or find the project on GitHub when it's ready. The more specific your experience, the more useful it is.

Building LangHeal in public. Follow along if this is your kind of problem.

https://github.com/langheal-io/langheal