Vibe coding with guardrails: on building software in an industry that's reinventing itself

#ai #architecture #softwareengineering #vibecoding

🎵 Writing soundtrack: Architects — Modern Misery

There's a conversation happening in engineering circles that people are having mostly in private. In Slack threads, in team offsites, in 1:1s with managers. It goes something like: "Are we going to be okay?"

Layoffs across the sector have been brutal and, in many cases, surgical. Junior positions are disappearing faster than they're being created. Entry-level roles, the ones that used to be the training ground for the next generation, are quietly vanishing from job boards. The question underneath all of it: how much of this is cyclical, and how much of it is structural?

I'll be honest: I've asked myself that too. Not in the abstract, but sitting in front of a codebase, watching an AI agent produce in twenty minutes something that would have taken me a full sprint. Impressed and a little unsettled at the same time, wondering what the job is supposed to look like in two years.

Then I started doing it properly. Specs in version control, decision records, automated gates, context that actually persists between sessions. The result caught me off guard. Better coverage. Cleaner architecture. And the design actually got more careful. That part I hadn't expected. Doing this well forces you to think before you build in a way that pure coding quietly lets you skip.

The fear in the sector is real. I just think we're pointing the finger at the wrong thing.

I've spent the last year integrating AI coding assistants into real production work. Not production level enterprise systems. Things that actually affect my day. A custom MCP that manages my daily schedule and keeps my agenda from becoming a dumpster fire. Another one that tracks food stock at home, monitors expiry dates, and suggests meals based on what's actually in the fridge. A traffic state monitor so I know whether to leave now or grab another coffee. And a job offer scraper with convenience scoring, because life is too short to manually triage LinkedIn.

None of these are impressive as individual artifacts. And honestly, most of them ship without guardrails. They work, until they hit an edge case, and when they do it usually fails silently. That's exactly what taught me the difference. The projects I'd seen fail in production weren't failing because of the model. They were failing because there was no scaffolding around it, and no observability either. GIGO applies whether the input is a human or an agent.

"Vibe coding" gets used as a pejorative. I think that's lazy. Generating code through AI was never the issue. Generating it with no guardrails, nothing that separates a prototype from something you can maintain, audit, actually debug six months later, and hand off to on call at 2am when things go sideways. That's the issue.

Spec-Driven Development

Probably the biggest workflow shift of the last eighteen months. Frameworks like GitHub's SpecKit, Amazon's Kiro, or OpenSpec treat the specification as the source of truth, not the code. You define what the system must do, and the agent handles the how. It sounds like waterfall. Hacker News called it exactly that in late 2025 and the thread is worth reading. The criticism isn't unfounded: natural language specs are ambiguous, brownfield projects break the tooling fast, and Kiro turns trivial bugs into twelve user stories nobody asked for. Scott Logic ran SpecKit on an existing hobby app and wrote it up as "a sea of markdown documents". On greenfield the story changes. Someone shipped a full expense tracker in TDD mode with over 90% test coverage, tests written before a single line of implementation existed. I ran a PoC trying to adapt SpecKit to an existing production codebase. The agent generated the full environment setup and silently dropped a critical configuration, one that wasn't documented anywhere in the constitution because nobody had ever needed to write it down. It was just how the project worked. I spent more time debugging that than I'd like to admit. The spec being machine readable is what changes the economics: the agent uses what you write, nothing more. So does the gap you leave.

Architecture Decision Records paired with fitness functions

The piece I see most consistently skipped. An ADR captures why a decision was made: the context, the tradeoffs, what got rejected. A fitness function turns that into an executable check on every commit. The point isn't just to remember decisions. It's to enforce them, to keep an architecture coherent and under control as the codebase evolves. InfoQ published a piece on architectural governance at AI speed that gets at this directly: when GenAI accelerates output, the bottleneck shifts from writing code to maintaining alignment, and the only way to scale that governance is to make it deterministic rather than dependent on human review. I skip ADRs on personal projects too — consider this my open TODO. Agents don't drift maliciously. They fill the gaps with whatever their training says is reasonable, and without documented constraints that reasoning runs unchecked. The catch: agents operate within the repository boundary. If your ADRs live in Confluence or a separate wiki, the agent can see your implementation but not your reasoning, and it will make its own decisions about the gaps. ADRs living next to the code they affect, rules encoded into linters and CI gates. One guides. The other holds.

Automated guardrails and context engineering

Precommit hooks, CI gates, SAST, dependency scanning. None of this is new. But the economics change when AI is generating code at this pace. Veracode ran 80 coding tasks across more than 100 LLMs in 2025 and found vulnerabilities in 45% of outputs. In Java it was 70%. One developer making a mistake is one mistake. An agent making the same mistake has a blast radius that scales with every file it touches. Shifting left isn't process religion. It's the only way to keep that surface area under control.

There's a fourth thing I'd call context engineering: maintaining project context that persists between sessions in a form the agent can actually use. CLAUDE.md files, .cursorrules, agent configuration prompts. Spotify used Claude Code to run around 50 migrations and merge thousands of PRs across hundreds of repositories in 2025, and their own postmortem admitted prompts still "evolve by trial and error" without structured evaluation. ETH Zurich published research showing LLM generated context files can hurt agent performance more than having none at all. The files that helped were human written and limited to constraints the agent cannot infer on its own: custom build commands, specific tooling, non obvious architectural decisions. Write your own CLAUDE.md. Don't generate it, and don't fill it with things the agent can already figure out from the codebase. That's where most of the drift actually starts.

The junior pipeline problem is harder to be optimistic about. The traditional path, write small things, break them, learn why, repeat, is getting compressed. Some companies are already asking whether they need entry level engineers when a model generates boilerplate faster than a new hire.

I think that's shortsighted. The engineers who will matter in a few years are the ones who can look at AI output and know when something is subtly wrong, who have enough production experience to catch what slips through the test suite undetected. That judgment doesn't come from prompting. It comes from debugging things you didn't write, from systems that broke in ways you didn't expect, from time that doesn't compress.

None of that makes the transition easier for people going through it right now. The sector owes more honesty than it's currently offering.

The craft isn't disappearing. The boilerplate disappears. What stays is the thinking, what the system should actually do, what constraints matter, what tradeoffs you're making. Shipping things that hold up in prod instead of just generating them.

Different job. Worth doing well.