I Gave an AI Full Access to My Startup and Asked It to Destroy Me

EzekingDigital — Wed, 01 Jul 2026 16:00:00 +0000

It found the exact bug I'd already learned to fear...

A few days ago I handed an AI model every file in my project — eighteen months of design docs, decision logs, corpus accounting down to the chunk — and gave it a single instruction: act as a hostile red-team, assume my idea is doomed, and tell me exactly how it dies. No mercy. No flattery.

It obliged. And the verdict was worse than I expected — not because it was cruel, but because it was right, and because the bug it found at the heart of my company was the same bug I'd already caught once, in my code, and congratulated myself for fixing.

This is the honest version of building in public. Not the launch-day highlight reel. The reckoning. Here's the whole story — what I built, the first bug, the second bug, and what I'm doing about it now.

What I built, and why it nearly broke me

I'm building DeepForge from Lagos, solo, on a strict zero-budget stack. The first product, SynthForge, is a prompt-engineering synthesis engine — ask it anything, get a detailed, sourced answer. Underneath sits a reusable retrieval core meant, eventually, to power a family of domain engines. That's the vision.

When I started, I did what most people do. I optimized for more. More sources, more coverage — surely a bigger corpus is a better foundation. So I ingested everything: GitHub, a thousand arXiv papers, Reddit through a quality filter, the official docs, LessWrong, Hacker News, Stack Overflow, YouTube transcripts. Ten sources. Nearly twenty thousand chunks. It was live. It worked. By every visible metric, I had built the thing.

I want to be honest about what building this way costs, because the glossy version deletes it, and the deletion is a lie.

Start with electricity, because everything sits on top of it. The backbone of every frustration here is power that arrives and vanishes on a schedule no one can predict. You line up a long embedding run — hours of CPU work, no GPU to shorten it — and the grid, not you, decides whether it finishes. You learn to dread the silence that kills a run you waited hours for. Then the network: an upload that should happen once happens five times, because it fails four. A download you need dies at eighty percent and you start again. You write every script to be resume-safe not as a best practice from a textbook, but because the connection will break and the work has to survive it.

And sometimes the only way to reach a milestone is to physically leave. There are nights I have packed up my laptop and travelled across the city to somewhere I can get the two things I need in the same place at the same time: stable power and free internet. I work through the night there, because that's when both hold, and because the alternative — burning mobile data and generator fuel at home — is a cost that, on a zero budget, you cannot keep paying. I have done it. I'll do it again tomorrow.

This is not Stanford, or MIT, or some lab in San Francisco with backup generators and a fibre line nobody has ever had to think about. This is Lagos, in its raw form. None of that is complaint. It is the project's weather, and every line of this was built inside it.

The first bug: the thing that looked done wasn't

Here is where the project turned the first time.

I eventually went looking at what my pipeline was actually writing to the corpus — not what I'd designed it to write, the real bytes on disk. And I found contamination. Two hundred and two records that should never have qualified had entered anyway.

I traced it, and the root cause was humbling: I had specified an admission contract — rules every chunk had to satisfy before entering — and I had never actually wired that contract into the live write path. The rules existed in the schema. The schema was real. But the code that did the writing bypassed them entirely. I had written the law and never connected it to the door.

Once you see that shape, you see it everywhere. A writer keying records on the wrong field, orphaning them silently. A ghost directory causing failures that threw no error — just quietly wrong results that surfaced later as confusing numbers. None of these were bugs in the ordinary sense. The code ran. Nothing crashed. Every one was a discipline failure: a gap between what I believed was true and what was actually wired and verified.

So I did the thing that felt wrong and turned out to be right. I subtracted. I rebuilt from a fail-closed baseline where every chunk carries a verifiable licence and nothing enters unless it provably qualifies. That clean corpus was 1,231 chunks. From nearly twenty thousand down to twelve hundred — a 94% deletion. And the answers got better, because retrieval over twelve hundred trustworthy chunks beats retrieval over twenty thousand where an unknown fraction is poison.

The lesson I took, and the one I built a whole discipline around: wired, not asserted. "I specified it" is not "the system enforces it." "I verified it in a chat" is not "the pipeline can see it." The gap between those is where everything I'd gotten wrong lived. I got genuinely good at distrusting that gap in my code.

And then I forgot to look for it anywhere else.

The turn: pointing the knife at myself

Having learned to distrust the gap between what I believed and what was verified, I decided to aim that same distrust at the entire project — not the code, the premise. So I gave the AI the CIA-style red-team brief: list every assumption I'm standing on, tier them by how fatal they are, and for each load-bearing one, tell me what evidence would prove it wrong — because if I can't point to that evidence, I'm running on faith, not analysis. Then imagine it's eighteen months later and this failed, badly, and write the post-mortem. Then be a funded competitor who hates me. Then be a customer who paid and felt cheated.

Full access. Every file. I told it the kindest thing it could do with that much context was refuse to flatter me.

It refused.

The verdict, unsparing

Here is what it found, translated out of its cold voice into mine, because I have to own it.

The corpus is too thin to keep the promise — and my own data proves it. SynthForge promises "ask anything, get a sourced answer." But the licence gate that makes the corpus legally shippable also collapsed it 94%, and of those 1,231 clean chunks, the entire clean arXiv spine is two papers. The rest is books and docs. A synthesis engine whose shippable research spine is two papers isn't under-corpused; it's barely-corpused. The promise and the corpus are not the same size, and the corpus is the one telling the truth.

There is zero validated demand — and no instrument even exists to find any. Across eighteen months and forty-plus build sessions, I never once built a way to test whether a single human would pay. The waitlist is a front-end stub wired to nothing. The verdict here isn't a gap in the AI's information; the absence of any demand instrument is itself the finding. I was, in the precise sense the primer meant, operating on faith.

The domain is decaying underneath the build. Prompt engineering as a standalone craft is being absorbed — into the models themselves, and into the auto-optimizers I personally use and write about. My own schema has a "Knowledge Half-Life" field because I know this content ages fast. A 2022 chain-of-thought paper isn't a moat; it's a museum piece.

And the capital allocation is inverted. Bus factor of one, part-time, mid-MSc, under infrastructure stress — and every scarce hour flowing to the recoverable risk (engineering correctness, always fixable later) while starving the existential one (does anyone want this). The line that lodged in me: my verification discipline is genuinely excellent and may be the most sophisticated form of productive procrastination it had ever seen documented. Polishing the drivetrain of a car no one has agreed to buy.

Then it wrote the 1-star review it imagined going viral. I screenshotted it, because every word is a fair hit on the product as it stands today:

★☆☆☆☆ "The most rigorously cited way to learn nothing." Paid for it, asked five questions, three times got told "the corpus is insufficient." I paid for a search engine that refuses to search. Asked Claude the same five for free — better answers, no interpretive dance about source tiers. A $9 monument to one guy's licence compliance. The sourcing is impeccable. There's just nothing to source. Forging, apparently, requires iron.

And the quote-tweet that hurt most, because it's true: a vault door on a shoebox.

The root cause, in one sentence it handed back to me: I built a supply-side engine to an immaculate standard for eighteen months while never once instrumenting the demand side, in a domain that decayed beneath me the whole time.

The bug was the same bug

Here is the part that made me put the laptop down and sit in the dark for a while.

I already know this failure. I have a name for it. It's the bug I deleted 94% of my corpus to fix.

The contamination was a contract I specified but never wired. The demand gap is a market I assumed but never instrumented. They are the identical failure: believing something is true because I designed it to be, instead of building the one test that could tell me I was wrong. The refusal protocol, the two-key verification, the fail-closed licence gate — I had wired rigor into every corner of the engineering, and left the single most important assumption in the whole enterprise — that anyone wants this — completely unwired. Asserted. Faith.

My own signature discipline. Applied everywhere except the one place that decides whether any of it is a business.

You can be immaculate at the level of the code and naive at the level of the company, and the second one is fatal in a way the first never is. I'd been so busy verifying that the door was locked that I never checked whether anyone was trying to come in.

What I'm doing about it, starting now

The method here was never despair. Better embarrassed in the war room than buried on the battlefield — and I asked for the war room. So here's the turn, and I'm executing it as I publish this.

For the next 30 days, I am not touching corpus code. No ingestion, no embedding, no cutover. The build freezes — nothing destructive, the anchor stands — and every hour goes to the one question I've avoided for eighteen months: will one stranger pay?

Concretely: I'm rewriting the promise to what the corpus can honestly keep. Not "ask anything about prompt engineering" — that's the breadth my thin corpus can't back, and the reason that 1-star review lands. Instead, the narrow, defensible thing: a curated, licence-clean, citable reference for the people who actually need provenance — the deep, systematic craft that doesn't decay, not the surface tips a free model already handles better than I can. Free models invent citations; a sourced, licence-clean engine doesn't. That's global, and it's a real difference. Who exactly that audience is, and what they'll pay for, is the thing the next 30 days exists to find out — not another assumption I get to make from my chair. And "sourced, and it tells you when it doesn't know" stops being the embarrassment the review skewered and becomes the whole point.

So: I'm wiring the waitlist for real — an evening's work — with one question that matters: would you pay $9 a month, yes or no. I'm shipping SynthForge alone, free, and instrumented, breaking my own rule that chained it to a harder, unfinished product. I'm having twenty real conversations with people who hit hard PE questions, to hear in plain words whether they'd pay for this when the model they already use is free. And I've written the kill-or-ship line before the data arrives, so I can't move the goalposts on myself: if almost no one would pay and the corpus can't answer half the real questions, I pivot the framing or shelve it and keep the credibility. The 27-Forge vision goes in a drawer marked reopen at ten paying users.

I might be shelving this in a month. That's the honest stakes. But either way, eighteen months of rigorous, legally-clean engineering becomes a portfolio and a lesson instead of a monument — and I'd rather find out now than spend another year forging a beautiful blade with no iron behind the edge.

And if you've read this far, you might be one of those twenty conversations. So let me actually ask you, here:

When you hit a genuinely hard prompt-engineering question — the kind that isn't in a tutorial — what do you do right now? Where do you go?
What would make you stop doing it that way?
Would you pay for something better — and what would it have to do to be worth it?

I don't want a poll. I want a conversation. If any of that provoked an honest answer, reply or reach me directly — that's the signal I'm actually here for. And if the answer to the last one is yes, tell me the realest way you can: on the waitlist, where saying it costs you a click. A click is one unit of commitment more than a comment, and after everything above, you'll understand why that difference is the whole point.

Building in public means publishing this part too — the audit that hurt, not just the wins that flatter. If you're building something and you've never once wired a test that could tell you the whole premise is wrong, I'd gently suggest you run the red-team on yourself before the market runs it on you for free.

I'll write what the 30 days turn up — the numbers, the conversations, the verdict — whichever way it breaks. The honest version is the only one worth reading.

— Building DeepForge from Lagos Nigeria, one verified assumption at a time. Even the painful ones.