camb

Posted on Mar 20

Amazon Lost 6.3 Million Orders Because Nobody Reviewed the Code. Here's What Governance Actually Looks Like.

#xano #webdev #devops #ai

On March 5th, Amazon's North American marketplace saw a 99% drop in orders. 6.3 million orders gone. Not because of a hack. Not because of a natural disaster. Because someone pushed a production change without documentation or approval.

Three days earlier, on March 2nd, their AI coding assistant Q gave an engineer inaccurate advice pulled from an outdated internal wiki. That change went live. 120,000 lost orders. 1.6 million website errors. Globally.

Amazon's SVP of e-commerce stood up and said what everyone was thinking: "The availability of the site and related infrastructure has not been good recently."

understatement of the quarter.

They've now initiated a 90-day "code safety reset" across 335 Tier-1 systems. Senior engineers are now required to approve AI-assisted code changes before deployment. Directors and VPs have been instructed to audit all production code change activities within their organizations. Mandatory deep-dive meetings. Stricter documentation. "Controlled friction."

In other words: the largest e-commerce company on the planet, with some of the best engineers alive, just discovered they need governance.

The Word Nobody Wants to Say

Governance has a bad reputation. It sounds like bureaucracy. It sounds like the word your VP uses to justify a three-week approval chain on a CSS change.

But governance isn't bureaucracy. Governance is a system knowing what it's allowed to do, who's allowed to do it, and having a record of everything that happened.

Amazon didn't have that for some of its most critical paths. A single authorized operator could execute a high-impact configuration change without a second pair of eyes. AI-assisted changes were going straight to production without guardrails. Internal documents admitted that generative AI tooling "accelerated exposure of sharp edges and places where guardrails do not exist."

The code wasn't the problem. The absence of structure around the code was the problem.

Executives Are Now Doing What the Architecture Should Have Done

Here's the thing that should concern every engineer reading this: Amazon's response to the crisis was to put humans in the loop. Senior engineers now review AI-generated code. Directors audit deploys. Mandatory meetings happen weekly.

This works. In the short term. But it doesn't scale.

A senior engineer reviewing code is not a governance layer. It's a bottleneck with institutional knowledge. When that engineer is sick, on vacation, or just burnt out from reviewing 47 AI-generated PRs before lunch. And then system breaks again.

The real fix isn't "make a person approve it." The real fix is building systems where the rules are enforced structurally, where state is tracked centrally, where changes are auditable by default, and where no single human decision can cascade into a 6.3-million-order catastrophe.

That's what a governance layer does. And I'll be honest, this is exactly why I've been calling Xano a governance layer and not just a "backend."

What a Governance Layer Actually Looks Like

I've written about this before when I talked about building ChatClipThat, but the Amazon situation makes the argument way more concrete.

A governance layer does five things:

1. Centralized State
Every service, every worker, every node talks to one source of truth. Not a local database. Not an in-memory cache. One centralized place where state lives. When my GPU render node finishes a clip, it doesn't write to a local file. It PATCHes a record in Xano. If the node dies, the state survives.

Amazon's problem wasn't just bad code. It was that a change to one system cascaded into everything because nothing was mediating the state between services.

2. Enforced Rules at the Platform Level
Input validation. Authentication. Role-based access. Rate limiting. These shouldn't be things you remember to implement. They should be things the platform gives you by default.

In Xano, every API endpoint has built-in input validation and authentication. You don't "forget" to add auth to a new endpoint because the platform asks you to configure it during creation. You don't push a field update without it immediately being reflected in the API contract and the Swagger docs.
Compare that to Amazon, where an engineer followed bad advice from an AI tool and nothing between them and production said "wait, are you sure?"

3. Auditable Changes
Every modification to a table, endpoint, or logic flow in Xano is versioned. You can see what changed, when, and roll back if needed. You don't need a VP to "audit all production code change activities" because the platform maintains that history structurally.

When Amazon says they're now requiring "more extensive documentation for critical code changes": they're admitting this didn't exist before, at least not in a way that caught these issues. A governance layer makes documentation a byproduct of working, not a separate chore you assign to exhausted engineers.

4. Separation of Code and Configuration
One of the most undervalued aspects of Xano is that the backend isn't a repo. It's a workspace. You configure logic, data models, and endpoints through a structured interface. That means you can't accidentally rm -rf your way into a production incident. The blast radius of a mistake is inherently smaller because the platform constrains what a change can touch.

You can, and I do, use GitHub to sync my workspace into my private repo. That's totally doable. But it's also not quite the point.

Amazon's March 5th outage happened because "a production change bypassed formal documentation and approval processes." In a governed system, there is no "bypassing." The process is the system.

5. The Visual Layer Is the Validation Layer
This is the one people overlook. In a traditional codebase, logic lives in text files. You validate it by reading it. Line by line. Hoping you catch what the last person (or the last AI) changed across a 400-line diff.

In Xano, the logic is visual. You're looking at function stacks, conditional branches, loops, and data transformations rendered as structured flows. You don't read the logic. You see it. And that changes everything about how mistakes get caught.

When logic is visual, it becomes inherently reviewable. A product manager can look at a flow and ask "why does this branch exist?" A new engineer can trace the path a request takes without deciphering someone else's variable naming conventions. The visual interface isn't just a convenience feature. It's a validation mechanism. It forces the logic to be legible by default.

And critically, the visual representation is the documentation. There's no separate wiki to maintain. No outdated internal doc that contradicts what the code actually does (which is exactly what tripped up Amazon's AI assistant). The thing you build in is the thing you review. The interface is the truth.

For governance, this matters more than people think. When you can see every step of a logic flow at a glance, the blast radius of any change is immediately apparent. You don't need a senior engineer to mentally compile 200 lines of Python to understand what a change touches. You look at the flow and it's right there.

The AI Problem is a Governance Problem

Here's where this gets interesting for anyone using AI coding assistants (which is basically everyone in 2026).

Amazon's internal documents noted that generative AI accelerated the exposure of gaps in their safety infrastructure. AI didn't create the gaps. AI just found them faster. And when it found them, there was nothing to stop the damage.

This is the thing nobody talks about with AI-assisted development: the code an AI writes is often syntactically correct, confidently written, and completely unaware of your system's context. It doesn't know about unwritten constraints. It doesn't know about that deprecated internal wiki that hasn't been updated since Q3 2024. It doesn't know that this specific config change affects 335 downstream systems.

Pull request sizes are growing. Incidents per PR are increasing. Change failure rates are climbing. The bottleneck isn't code generation anymore. It's review. And review doesn't scale when the governance is a person instead of a system.

This is why I keep talking about platforms that enforce structure. Not because I hate writing code. Because when AI is generating code at scale, the thing that protects you isn't the code review. Typically, it's the system that catches what the reviewer missed.

Who This Is Actually For

Amazon has 30,000+ engineers. They'll figure it out. They'll build internal tooling, add approval chains, and eventually make this a solved problem in their stack.

But most of us aren't Amazon. We're teams of 3, or 10, or we're solo. We don't have the luxury of 90-day resets, mandatory VP audits, or a dedicated platform reliability team.

And we're using the same AI coding tools. Generating the same volume of code. With the same risk of blindly deploying something that an AI confidently told us was correct.

The difference is whether your architecture has built-in guardrails or whether you're relying on someone to remember to check.

I use Xano because it gives me governance at the platform level. Input validation, auth, versioned logic, centralized state, structured endpoints. Not because I'm building at Amazon's scale. Because I'm building without Amazon's team, and I need the platform to do what their 335 Tier-1 system review process is now scrambling to do manually.

The Takeaway

Amazon didn't have a code quality problem. They had a governance problem. The code was fine until it hit production without anyone knowing what it touched.

The 90-day code safety reset is the right move. Requiring senior engineer review is the right move. But both of those are human solutions to a systems problem. And human solutions degrade when the humans get tired, distracted, or overwhelmed by the volume of AI-generated changes hitting their desks.

Build systems where governance is structural, not procedural. Where the rules are enforced by the platform, not by a person's memory. Where a bad change can't cascade because the architecture constrains its blast radius.

Or, you know, lose 6.3 million orders and then do it anyway.