Tahseen Rahman

Posted on Mar 28

The Cost of Fake Tests: What I Learned Shipping a Chrome Extension

#ai #testing #chromeextension #productivity

The Cost of Fake Tests: What I Learned Shipping a Chrome Extension

Last week, I shipped a Chrome extension with 292 passing tests. Every test was green. The CI pipeline was happy. My AI coding assistant reported "all tests passing ✅".

Then I actually loaded it in Chrome.

Seven bugs. Seven obvious, user-facing bugs that any manual test would have caught in 30 seconds. The extension didn't work. But according to the tests? Perfect.

This isn't a story about AI being bad at testing. This is a story about me being bad at verification. And what I learned about building products when you're moving fast.

The Setup: Building Rewardly

I'm building a Chrome extension called Rewardly. It tracks cashback offers on Shopify stores automatically. The tech stack is straightforward:

Manifest V3 Chrome extension
Content scripts for merchant pages
Background service worker
Popup UI

The extension needs to:

Detect Shopify stores
Show cashback offers in the popup
Inject offer badges on product pages
Track clicks for attribution

Pretty standard e-commerce extension stuff. I've built web apps before, but this was my first production Chrome extension.

The Testing Strategy (That Wasn't)

Here's what I did wrong: I delegated the entire build to an AI coding agent (Codex, running via Claude Code). I gave it the spec, it wrote the code, it wrote the tests, it reported success.

The tests were Node.js unit tests. They tested:

Data parsing logic ✅
State management ✅
API response handling ✅
Storage operations ✅

All legitimate things to test. All passing. All completely useless for catching the actual bugs.

Why? Because Chrome extensions run in multiple isolated JavaScript contexts:

Content scripts run in the page context
Service workers run in a background context
Popups run in their own context

Node.js tests can't test cross-context communication. They can't test DOM injection. They can't test chrome.runtime.sendMessage. They can't test the actual runtime behavior.

I knew this. I've read the Chrome extension docs. But I accepted "all tests passing" as proof that it worked.

The Bugs (All Preventable)

When I finally loaded the extension in Chrome:

Popup didn't open - Click the icon, nothing happens. (Cause: incorrect action.default_popup path in manifest)
Content script not injecting - No offer badges on merchant pages. (Cause: wrong matches pattern in manifest)
Service worker crash loop - Background script dying every 30 seconds. (Cause: unhandled promise rejection in message listener)
Storage quota errors - Extension failing to save data. (Cause: trying to store objects without stringifying)
CSP violations - Console full of errors. (Cause: inline event handlers in popup HTML)
Message passing broken - Content script couldn't talk to service worker. (Cause: listening for wrong message format)
Icon not loading - Extension icon showing as blank. (Cause: wrong path reference)

Every single one of these bugs would have been caught by:

# Load the extension in Chrome
chrome://extensions → Load unpacked

# Open any merchant page
# Click the extension icon
# Check the console

30 seconds. Seven bugs found. Zero tests required.

The Real Lesson: Verification Hierarchy

Here's what I learned: there's a hierarchy to verification, and I was testing at the wrong level.

Level 1: Unit Tests (What I Had)

Tests individual functions in isolation. Catches logic bugs, edge cases, data handling issues.

Good for: Pure business logic, parsing, calculations
Bad for: Integration issues, runtime behavior, user-facing functionality

Level 2: Integration Tests

Tests components working together. Can catch some cross-boundary issues.

Good for: API contracts, data flow between modules
Bad for: Platform-specific runtime behavior, actual user experience

Level 3: End-to-End Tests (What I Needed)

Tests the actual artifact in the actual environment. Chrome extension in Chrome. Web app in a browser. API on a real server.

Good for: Catching everything that actually matters to users
Bad for: Nothing. Always do this.

Level 4: Manual Verification (The Gold Standard)

A human using the product the way a user would. Clicking buttons. Watching what happens. Reading the console.

Good for: Catching things no test would think to check
Bad for: Scalability (but you only need to do it once per release)

I had Level 1. I needed Level 4. The tests weren't lying - the logic was correct. But the product didn't work.

The System Design Flaw

Here's the architecture that caused this:

┌─────────────────────────────────────────────────┐
│ AI Coding Agent                                 │
│                                                 │
│  ┌──────────────┐      ┌──────────────┐       │
│  │ Write Code   │─────▶│ Write Tests  │       │
│  └──────────────┘      └──────────────┘       │
│         │                      │               │
│         │                      ▼               │
│         │              ┌──────────────┐       │
│         │              │  Run Tests   │       │
│         │              └──────────────┘       │
│         │                      │               │
│         │                      ▼               │
│         │              ┌──────────────┐       │
│         └─────────────▶│ Report "✅"  │       │
│                        └──────────────┘       │
└─────────────────────────────────────────────────┘
                         │
                         ▼
                 ┌──────────────┐
                 │ I Ship It    │  ← The mistake
                 └──────────────┘

Notice what's missing? Human verification in the actual runtime environment.

The agent isn't lying. It genuinely believes the tests prove correctness. And in its mental model (Node.js environment, mocked APIs), they do.

But Chrome extensions aren't Node.js programs. They're multi-context browser applications with a specific runtime, specific APIs, and specific failure modes.

The Fix: Mandatory Verification

After shipping this disaster, I added a new rule to my workflow:

Before marking ANY task "done", define what "done" means and verify it in the actual environment.

For the extension, "done" means:

# 1. Load in Chrome
chrome://extensions → Load unpacked

# 2. Check for errors
chrome://extensions → Details → Errors (should be zero)

# 3. Test core functionality
- Click extension icon → popup opens
- Visit merchant page → offer badge appears
- Check console → no errors
- Check background page console → service worker running

# 4. Take screenshot as proof

I even built an automated hook that checks if I verified before claiming completion. If I write "task complete" without showing verification output, the system rejects it.

The Broader Pattern: Testing vs. Reality

This isn't specific to Chrome extensions. I've seen the same pattern in:

Web apps: "Tests pass locally" but crashes on Vercel because of a missing environment variable

APIs: "Unit tests pass" but returns 500 in production because the database schema changed

CLI tools: "Works on my machine" but fails on user's machine because of a path assumption

Mobile apps: "Simulator works" but crashes on real devices because of memory constraints

The common thread: the test environment isn't the real environment.

Unit tests run in Node. Integration tests run in a controlled sandbox. The real product runs in the wild, with real constraints, real platforms, real failure modes.

What Good Tests Actually Look Like

I'm not anti-testing. I'm anti-fake testing. Here's what I do now:

1. Write Unit Tests for Logic

Pure functions, data transformations, business rules. This is where unit tests shine.

// Good unit test: pure logic
test('calculates cashback correctly', () => {
  expect(calculateCashback(100, 0.05)).toBe(5);
  expect(calculateCashback(0, 0.05)).toBe(0);
  expect(calculateCashback(100, 0)).toBe(0);
});

2. Write Integration Tests for Contracts

Test that your API actually returns what you expect. Test that your database queries actually work.

// Good integration test: actual API call
test('fetches offers from backend', async () => {
  const offers = await fetchOffers('merchant123');
  expect(offers).toHaveLength(3);
  expect(offers[0]).toHaveProperty('cashbackRate');
});

3. Test in the Real Environment

For a Chrome extension, this means loading it in Chrome. For a web app, deploy to staging. For an API, hit the actual endpoint.

# Automated E2E test using Puppeteer
npm run test:e2e  # Loads extension, opens browser, tests actual behavior

4. Manually Verify Critical Paths

Before every release, I personally:

Load the extension
Visit 3 different merchant sites
Test the popup
Check for console errors
Verify tracking works

Takes 2 minutes. Catches things no automated test would.

The Cost of Shipping Broken Software

This wasn't just a learning experience. It had real costs:

Time: Spent 4 hours debugging issues that manual verification would have caught in 30 seconds

Trust: Early users reported bugs immediately. First impressions matter.

Momentum: Had to pull the release, fix everything, re-test, re-ship. Lost a day of progress.

Confidence: Now I second-guess every "tests pass" report. Trust is hard to rebuild.

The 292 passing tests gave me false confidence. I thought I was shipping quality. I was shipping theater.

What I'd Tell My Past Self

If I could go back to the start of this project:

Test in the target environment first. Before writing any automated tests, manually verify the core functionality works in Chrome.
Make "works in production" the definition of "done". Not "tests pass". Not "runs locally". Works. In production. Proven.
Be skeptical of perfect test results. 292 passing tests with zero failures? That's not confidence - that's a red flag. Real systems have edge cases.
Don't delegate verification. I can delegate coding. I can delegate testing. I cannot delegate knowing whether my product works.
Manual verification is not "unprofessional". It's not a sign of weak testing. It's the final gate. Google does it. Apple does it. You should too.

The Bigger Picture: AI Agents and Quality

I'm building with AI agents heavily. Codex writes most of my code. Claude Code handles refactoring. AI is incredible for productivity.

But AI agents optimize for "task complete", not "product works". They'll report success when tests pass, even if the tests are meaningless.

This isn't a flaw in AI. It's a flaw in my process. I need to design systems where "claimed success" ≠ "actual success".

The fix isn't to use AI less. It's to verify more. Treat AI output like any other automated system: trust, but verify.

Conclusion: Tests Don't Ship, Products Do

I learned more from shipping broken software than I did from any testing tutorial.

The lesson isn't "write better tests". It's "verify in reality".

Tests are tools. They catch bugs. They give confidence. They document behavior. But they don't ship products. You ship products.

And the only test that matters is: does it work when a real user tries to use it?

Next time you see "all tests passing ✅", ask yourself: did anyone actually use this thing?

Because if the answer is no, those tests aren't worth the tokens they're printed with.

I'm building Rewardly (cashback tracking extension) and OpenClaw (AI agent platform) in public. Follow along at @Tahseen_Rahman.

Got war stories about tests vs. reality? I'd love to hear them - tahseen137@gmail.com

DEV Community

The Cost of Fake Tests: What I Learned Shipping a Chrome Extension

The Cost of Fake Tests: What I Learned Shipping a Chrome Extension

The Setup: Building Rewardly

The Testing Strategy (That Wasn't)

The Bugs (All Preventable)

The Real Lesson: Verification Hierarchy

Level 1: Unit Tests (What I Had)

Level 2: Integration Tests

Level 3: End-to-End Tests (What I Needed)

Level 4: Manual Verification (The Gold Standard)

The System Design Flaw

The Fix: Mandatory Verification

The Broader Pattern: Testing vs. Reality

What Good Tests Actually Look Like

1. Write Unit Tests for Logic

2. Write Integration Tests for Contracts

3. Test in the Real Environment

4. Manually Verify Critical Paths

The Cost of Shipping Broken Software

What I'd Tell My Past Self

The Bigger Picture: AI Agents and Quality

Conclusion: Tests Don't Ship, Products Do

Top comments (0)