sumit2401

Posted on May 20 • Originally published at stacknovahq.com

I Used AI for Code Review on a Production ERP for 6 Months. Here's Where It Actually Failed Me.

#ai #vibecoding #webdev #react

Six months ago I started running every non-trivial piece of code through AI before it shipped. Not prototypes — real ERP and CRM modules with paying clients on the other end. Batch processing tables, MRP allocation logic, dynamic invoice builders, real-time dashboards.

Here's what I found out.

What AI is actually good at

Duplicate functions. In a codebase that's been touched by multiple devs over months, this is AI's most reliable win. It flagged things like: "This mirrors formatCurrency in utils/formatters.ts" — the kind of thing that slips through human review because everyone assumes someone else already checked.

Calculation bugs in self-contained utilities. Off-by-one errors, wrong operator precedence, bad unit conversions — if the function is isolated and doesn't depend on external state, AI catches these consistently. In ERP systems, a broken landed cost formula or tax calculation is instant client trust damage. AI has saved me here more than once.

Direct React state mutations. Pushing directly to an array, mutating nested objects without spreading — AI flags these reliably in simple components. Not groundbreaking, but useful as a first pass before compilation.

Where it completely fell apart

Race conditions. This was my most painful production bug in the last 6 months.

Our batch item table allows rapid concurrent mutations — user triggers cascading calculations, multiple async state updates overlap, one override silently wins. Classic race condition.

I ran the component through every AI tool I was using at the time. Zero flags. Not even a "this might be worth checking." The bug only surfaced when a client hit a specific sequence of interactions in rapid succession.

AI does static analysis. It cannot simulate erratic user interaction timing. Race conditions are invisible to it.

Web Worker memory leaks. I implemented workers for heavy client-side calculations. Suspicious of potential leaks from rapid event-driven spawning, I tasked my entire AI stack with auditing the cleanup patterns.

Every tool gave a confident green pass.

My manual browser profiling found that workers weren't reliably terminating under specific runtime exceptions. Ghost processes, staying alive. The AI verified the cleanup code existed. It could not verify the cleanup actually ran across every execution path.

CSS layout bugs on custom components. Built a proprietary data grid from scratch to hit strict UX requirements. Ran into padding misalignments and layout collapse under certain data payloads.

AI was almost useless. I'd describe the visual bug, it would claim to fix it, the rendered output would still be broken. Without a real rendering environment, it's guessing at cascade behavior.

The thing nobody talks about: chunking

This was my biggest operational discovery.

I was building an MRP allocation table — quantities, supply constraints, fulfillment priorities cascading across hundreds of rows. I fed the full spec into the AI in a single pass.

Every tool failed. Not "slightly off" — confidently wrong logic that looked correct until you traced execution. Broken dependencies, misallocated quantities, state updates that would silently fail on specific edge cases.

Then I split it:

One prompt — just the core allocation math
One prompt — data validation constraints only
One prompt — the immutable React state update pattern
Final prompt — audit each module against the others

Every piece came back clean. Assembled system worked perfectly in production.

The mental model shift: AI degrades under compound logical dependencies, not under token length. An ERP module has overlapping validation paths, tax calculations, database state requirements. Feeding all of it simultaneously overloads the model's situational logic.

If you can't explain the task scope to a senior dev in 2 minutes out loud, chunk it before it touches an LLM.

The confidence problem

This is the part that actually concerns me.

When my Web Worker had a memory leak, no tool said: "This syntax looks valid, but I can't verify runtime cleanup behavior — profile this manually." They said it looked fine. Clean pass. Ship it.

The framing I've settled on:

An AI bug flag is high-signal — act on it immediately.
An AI clean pass is weak evidence — it means no obvious patterns matched, not that the code is safe.

The deadline trap is real. When you're pushing to meet a sprint, "the AI said it's fine" becomes a substitute for actual testing. That's where production regressions come from.

My actual tool stack after 6 months

Codeium — inline autocomplete for single-file work. Hits a ceiling fast on cross-file reasoning.
ChatGPT — useful for isolated logic audits, but snippet output creates integration friction at scale.
Cursor (Claude Sonnet) — my go-to for refactoring. Full-file editing context is genuinely better than any chat interface.
Google Gemini Code Assist — primary daily driver. Large context window, cost-efficient for heavy ERP work.
Claude direct — architectural audits and high-level logic design. Most willing to flag issues with caveats instead of blind passes.
OpenAI Codex App — pre-PR repository-wide audits in sandboxed worktrees. Runs parallel agents without touching my local environment.

Each tool serves a distinct step. Switching between them isn't tool-hopping — it's using the right instrument for the job.

Quick reference

Trust AI for:

Utility functions under ~50 lines with clear I/O
Catching duplicate/redundant modules across large repos
Refactoring known-safe logic into cleaner patterns
Boilerplate React hooks, contexts, basic reducers

Verify manually:

Multi-tiered stateful components with rapid user interactions
Web Workers, Sockets, async lifecycle management
Custom visual components (non-library)
Core financial/billing calculations
Anything you can't explain to a teammate in 2 minutes

Chunk before sending:

Features spanning more than 2 interacting subsystems
Full ERP modules with cascading state (MRP, inventory allocation)
Any prompt that needs multiple paragraphs just to define the constraints

The engineers getting the most out of AI aren't the ones who trust it blindly. They're the ones who know exactly where that trust stops.

Full breakdown with the complete MRP case study, tool-by-tool notes, and a practical checklist here → stacknovahq.com

DEV Community