Reme Le Hane

Posted on May 6 • Originally published at remejuan.substack.com

The Most Useful AI Workflow I’ve Built Isn’t About Writing Code

#softwareengineering #aiengineering #codingagents

The Problem Wasn’t Triage

Most AI workflows in software engineering still keep the human directly in the middle of triage.

The AI might help write code.
It might explain a stack trace.
It might summarise a pull request.

But the operational loop itself still depends on someone noticing the issue, prioritising it, investigating it, and deciding whether it matters.

I recently shipped a new feature in one of my side projects. The feature itself worked well and was tested properly, but a reasonably significant crash slipped through because it never crossed the notification thresholds in Firebase Crashlytics.

Had I not gone looking manually, I probably would never have known.

That was the moment the idea started forming.

Building the Workflow

Hermes already had access to Crashlytics through MCP tooling, so I started experimenting with whether the entire discovery and investigation process could be automated.

The workflow now looks roughly like this:

Hermes checks Crashlytics every 4 hours
New issues are documented and prioritised automatically
One issue at a time is delegated into an agent workflow
The agent gathers additional context through MCP
It attempts reproduction
Writes or updates tests
Builds the relevant platform bundle
Works toward a fix
Opens a PR if successful
Updates the issue document throughout the process

Once the PR is merged, Hermes closes the Crashlytics issue automatically. CodeRabbit reviews the PR before I even touch it.

At this point the workflow is probably around 90% automated, with human involvement mostly happening during review and validation.

The Important Part Wasn’t the Model

The interesting part is that the biggest breakthroughs had very little to do with raw model intelligence.

The models matter, but operational fit matters far more.

A model that performs well in blog writing might be terrible in a real codebase. A cheap model that looks good in benchmarks might waste enormous amounts of time if it produces low-quality investigations or weak fixes.

The point is not to spend less money on AI.
The point is to solve problems faster and more reliably.

Different stages need different capabilities.

Triage is relatively lightweight. Crashlytics already provides severity ratings, impacted user counts, stack traces, and environmental information. Smaller models can usually prioritise effectively.

Investigation is different.

That is where reasoning quality matters more. The system needs to understand platform limitations, validate assumptions, read documentation, and justify why something cannot be solved if that turns out to be the outcome.

Failure Still Has to Produce Value

One of the production issues that surfaced was caused by a user attempting to upload a 200MB file on Android.

The feature itself was not broken.
The testing was not bad.

The limitation simply had not been considered.

iOS handled the scenario differently, while Android hit native file selector limitations.

The agent eventually linked the issue back to official platform-level documentation explaining that this was an OS-level constraint rather than an application-level bug.

That was surprisingly important to me.

Even when the system could not fully resolve the issue, it still reduced ambiguity.

Instead of:
“Something crashes sometimes.”

The outcome became:

here is the root cause
here is the platform limitation
here are the relevant docs
here are the explored approaches
here is why they failed

That is still valuable engineering work.

Ambiguity Breaks Agent Systems Fast

One of the biggest lessons from building this was how quickly ambiguity compounds across agents.

Early versions of the workflow failed repeatedly because the orchestration itself was unclear.

At one point Hermes stayed stuck in monitoring mode for several iterations because the instructions were sequenced poorly. The issue was not really model intelligence. The issue was that the workflow itself had contradictory or badly ordered expectations.

The fix was not adding a smarter model.

The fix was:

clearer contracts
simplified instructions
stronger constraints
better sequencing
explicit breakout conditions
reduced ambiguity between systems

In other words:
the same kinds of improvements that make human engineering teams work better.

The Real Value

The most useful AI workflows probably are not the flashy ones.

They are the ones that quietly reduce operational overhead, improve visibility, enforce accountability, and help small teams operate beyond their size.

Large companies can throw QA teams, support teams, and operational engineers at problems.

Solo developers and small teams usually cannot.

For them, unnoticed production issues hurt more.
Context switching hurts more.
Operational overhead hurts more.

This type of workflow creates something surprisingly valuable:
a lightweight operational engineering layer that continuously investigates issues instead of waiting for someone to eventually notice them.

The human is still there.
The accountability is still there.
The review process is still there.

But the ignored middle starts disappearing.

And honestly, that feels far more useful than most AI demos I’ve seen online.

Next time: Why AI Workflows Fail Without Clear Ownership and Resolution States

DEV Community