Benoit Doyon for Flare

Posted on May 14

From Vibe-Coded Seed to Agent-Assisted Engineering: Building a Feature at Flare

#agents #softwareengineering #vibecoding #webdev

A story about a banner, a Slack thread, and the two days between a vibe-coded prototype and a feature worth shipping.

The banner problem

A week ago Thursday, an AWS hiccup rippled into Flare. Customer-facing teams wanted a banner in the app, a calm sentence at the top of the screen letting users know we knew, and that we were on it.

We have a tool for this kind of thing. It's an in-app messaging product, the sort that lets non-engineers paint UX on top of a deployed app, and we use it well for the work it was built for: onboarding flows, feature announcements, product education. Improvised emergency banners are not that work.

The runbook agrees, sort of. The bulk of incident response is quick technical steps: declare the war room, announce, start the postmortem doc, mitigate. Then the communications part lands you at one item that links out to its own multi-page wiki article. An eight-step procedure in another product, with a screenshot for every step, ending with a reminder to disable the guide when the incident is over. The people with the right access weren't on the incident. We got the banner up anyway. It was not fast and it was not fun.

At Flare, working with Claude is just how engineers work. We use it for everything from one-line tweaks to multi-file refactors. That morning I picked the loosest possible mode of working with it: I vibe-coded the banner. Ten minutes, two merge requests open, a working banner on screen. I closed my laptop satisfied.

That banner never shipped.

This is the story of why, and of what showed up in its place. It's also a story about how the gap between vibe-coded experiments and production engineering is bigger than it looks, and how the right tooling makes that gap a lot cheaper to cross.

The seed

Here's what I built that morning. I have a coding agent in my editor. I described the problem out loud, something like "I want a banner at the top of the app whose text I can change from a checkbox in our admin tool," and we banged something out together.

We already had a feature-flag system. Feature flags have names and descriptions. The frontend already polled /me/profile for the current user's flags. I asked Claude to add the description field to the response, then to render a banner whenever a specific flag (outage_banner) was on, using its description as the message.

Two MRs, about 140 lines, working banner. Total elapsed time: about ten minutes.

This is what people mean by vibe coding. You hold the goal in your head, narrate the shape of it, and let the agent fill in the keystrokes. It is genuinely thrilling, especially for someone whose job description has lately involved more meetings than commits. Internally at Flare, a number of folks have started doing this with our newly-built experimentation infrastructure, and they're not wrong to be excited. Vibe coding is good at what it's good at: getting a real thing in front of a real person, fast.

The catch is what happens next.

The Slack thread

I posted the MR in our backend channel for a sanity check. The reply was civil and immediate. Over the next thirty minutes, it unwound the design completely.

The first thread: I was changing a public API. I argued, reasonably I thought, that customers don't read which feature flags they have activated. Alex agreed I could probably break the schema. Then he said:

However this is also a problem because this will leak internal descriptions to customers.

Our feature-flag descriptions, it turns out, contain employee names. It's a process thing: when you create a flag, you put your name on it. Mine, Xavier's, Alex's, everyone's. The cute hack of "just expose the description field" was, on closer inspection, a small employee-roster leak attached to every authenticated session.

The second thread: Xavier suggested side-stepping the whole thing: ping our public status page, show a generic message if it's not green. I'd been about to agree when Alex caught a different problem:

I don't think poking an external domain (status.flare.io) is a good idea in our app, this effectively leaks all our customers to Cronitor.

Our status page is hosted by a third party. Every customer browser pinging it from inside our app would have handed that third party a continuous record of who our customers are.

Two well-intentioned designs, two privacy issues neither of us would have caught alone.

The third thread, the one that actually changed the feature: Alex remembered something customer success had been asking about for a while. A per-customer banner. A way for a rep to put account-specific messaging into the app without coming to engineering each time. Not for incidents. For everyday communication with specific customers.

Suddenly the question wasn't "how do I show an outage message." It was: what shape does a system take that lets a customer-facing rep paint a top-of-app banner globally, per organization, or per tenant, without involving an engineer?

That feature already had a stakeholder, a justification, and a real reason to exist in a database table. The outage banner was just one case of it.

By the old rules, this is scope creep: don't build two features when you came to build one. But with an agent doing most of the typing, the cost of an extra table and an extra route is mostly the cost of deciding it should exist. The bottleneck has moved. The expensive part is figuring out the right shape, not writing the code. Bundling the outage banner with the per-customer ask wasn't more work, it was less work. The alternative was shipping a half-system now and reconciling it with a second half-system later.

I closed my laptop again. Not satisfied this time. Humbled by how much my ten-minute banner had been missing, but increasingly proud of what was taking shape in its place. This was no longer a quick hack.

The bloom

What happened next is what I want to show people, especially the colleagues who've been (justifiably) impatient with how long engineering takes once a vibe-coded prototype exists. Half the answer is in the section you just read. Engineers iterate on ideas before they iterate on code, and a working prototype hides that work without replacing it. Mine looked done. It had a security flaw, a privacy leak, and was missing an entire feature. None of those would have surfaced without other engineers pushing back on the design. The conversation was already engineering. The rest of this story, the specs and the MR stacks and the review, is how you ship it.

Step one: write the spec before writing the code

Before I touched a backend file, I wrote a spec. Not a slide deck, not a Confluence page. A Markdown file in our repo under docs/specs/, dated and titled, written with Claude in the same conversation where it had access to the codebase. It covered:

Data model: a new app_banners table with nullable organization_id and tenant_id. A check constraint preventing both from being set at once. Partial unique indexes enforcing one banner per scope.
Resolution rules: precedence is global, then tenant, then org, then none. A signed-in user sees at most one banner, and the resolver picks deterministically.
API surface: one public endpoint (GET /banner/current) that reads scope from the session and returns the resolved banner. One admin endpoint group for reps to set and clear them.
Frontend: where the banner mounts in the customer app, and where reps edit it inside our internal admin tool.
MR breakdown: how this was going to be cut into independently mergeable pieces.
Explicitly out of scope: scheduling, multiple concurrent banners. Listed by name, so they couldn't sneak in during review.

This spec took about thirty minutes to draft. It made every subsequent decision faster, and it gave reviewers a single object to argue with before any code was written. The conversation in Slack had surfaced the shape of the problem. The spec turned that shape into something committable.

Step two: stack the MRs

The spec called for four MRs:

Database table, model, store, and the read endpoint the customer app fetches
Admin endpoints to set and clear banners
Frontend that renders the banner in the customer app
Frontend for the internal admin tool where staff set the banners

Why four? Because each one needed to be a coherent, independently mergeable unit of value: small enough that a reviewer could reason about its safety in isolation, large enough that merging it actually delivered something. Stacked MRs (where each branches off the previous) give you that property. Every commit on the stack is a self-contained change, and each branch can be approved and merged as soon as the one below it lands.

This is where the agent earns its keep. Reading the spec, branching cleanly off the previous MR, writing the migrations, models, stores, resolvers, routes, tests, and frontend composables: that's hours of typing. Mechanical, but the kind of mechanical that gets sloppy when humans do it. The agent is patient and consistent in a way I genuinely am not after lunch.

What it isn't is the architect. The four-tier scope hierarchy, the precedence rule, the decision to keep the outage and per-org systems as one table rather than two: those came from the Slack conversation and from sitting with the problem for a while. Claude wrote the code that expressed those decisions. It did not make them.

Step three: let review do its job

The first round of review found that the admin endpoints had ballooned. The original spec called for six mutation endpoints (PUT /global, PUT /org, PUT /tenant, DELETE /global, DELETE /org, DELETE /tenant), each with their own handler, their own validation, their own audit-event payload. The reviewer's suggestion: collapse it to one PUT and one DELETE, with the scope expressed in the body and query string respectively. Half the surface area, fewer code paths, easier to test.

Another round of review pulled out comments. The agent had, like a slightly anxious junior engineer, sprinkled the diff with brief explanations of what each function did. The reviewer's note was simple: if removing the comment wouldn't confuse a future reader, don't write it. Down they came.

By the end, the feature was smaller than the spec it started from, and substantially smaller than a naïve translation of the spec into code. Each round of review made the implementation more focused, not just shorter. Fewer code paths, fewer special cases, fewer places a future reader has to look. That matters more now than it ever has. The next person to touch this code might be Claude, and a tight surface is one the agent can hold in context without dropping invariants. Code that humans can maintain easily turns out to be code that LLMs can maintain easily. None of those simplifications would have happened in a vibe-coding session, because there was no second pair of eyes. Review wasn't a checkbox. It was where the implementation got focused enough to be maintainable by whoever, or whatever, reads it next.

What this means for the rest of us

If you have been vibe coding experiments at Flare, please keep doing that. We built the infrastructure for it on purpose: custom tooling and workflows that provision a sandboxed machine, wire it into the data and systems an experiment needs, and hand you a place to try an idea without filing a ticket. The point of that investment was exactly to make experimentation cheap. Most of what gets built there will never need to become a production feature, and that's fine. Its job was to teach us something we couldn't have learned by talking about it.

But the moment a vibe-coded thing starts to feel like it might ship, you discover the gap between prototype and production. In our case the gap was about two days. Small enough to cross routinely, big enough to matter. What was in it:

The colleague who points out that the schema you thought was internal (say, a list of feature flags) is actually a roster of every engineer who has ever worked on the product.
The realization that a "harmless ping" to a third party reveals who your customers are, just by virtue of who's pinging it.
The product ask you didn't know existed: the one that turns your one-off outage banner into a feature another team has been waiting on for months.
The reviewer who notices that your six near-identical endpoints want to be two, because the difference between them is data, not behavior.
The reviewer who deletes the comments your agent added on autopilot, because they explain what the code does instead of why.

Engineering's job isn't to slow down the vibe-coded prototype. It's to take the prototype seriously, to hear it as a signal that there's a real problem worth solving, and to bring discipline to the design so that what we ship is something we'd be happy to point at six months from now. Agent-assisted engineering makes this dramatically cheaper. The agent does most of the typing, the spec carries the design pressure, and the MR stack carries the review pressure, leaving humans to do the part humans are good at: notice the things that aren't in the code yet.

What shipped, in the end, is a four-MR stack with tests, audit events, and per-org scoping. A customer success manager can now put a message in the app for one specific customer. A renewal reminder, a "heads up, we're migrating your tenant on Saturday," a "check out this new feature, it's right up your alley" nudge.

They did not get this in ten minutes. They got it in two days. Both numbers, I think, are correct.

DEV Community