DEV Community: bishwas jha

The AI that refuses to publish a tutorial it didn't run twice

bishwas jha — Wed, 08 Jul 2026 20:14:53 +0000

There's a special kind of betrayal in a README quickstart that doesn't work.

You paste the command. It fails. You backtrack, was it a missing step, a flag that got renamed three releases ago, a dependency the author had installed so long ago they forgot to mention it? The instructions were written by a human who was sure they worked. They just… didn't, not on a clean machine.

AI has made this worse, not better. Point a language model at a repo and it will happily generate a confident, well-formatted tutorial — one that was never executed a single time. It reads plausible. That's the problem. Plausible is not the same as runs.

I built readme2demo to take the opposite stance: it doesn't trust the model, and it doesn't trust itself. It trusts a container.

The idea in one sentence

Point it at a repo. An AI agent reads the README and actually runs it inside a hardened Docker sandbox. The working path is distilled to its minimum, then replayed in a brand-new container the agent never touched — and only what survives that clean-room replay gets published.

Out the other end you get a tutorial.md, a step-by-step guide, a
troubleshooting doc, a machine-readable howto.jsonld, and a VHS demo video that types every verified command on camera.

The value was never "AI writes a tutorial." It's that the tutorial ran, twice, before you saw it.

The one rule everything is built around

The LLM never publishes anything a fresh container did not independently execute.

This is the whole project in a sentence, and the important part is how it's enforced: in code, not in prompts.

Prompts are suggestions. A model told "only include commands that worked" will, eventually, helpfully include one that didn't. So grounding isn't left to the model's good intentions. Every command the distiller wants to emit has to fuzzy-match a command that actually succeeded in the recorded run log. No match, no publish. And the final verdict doesn't come from the agent at all — it comes from a separate stage that replays the distilled script in a container with zero agent state.

If that replay fails, the output doesn't get quietly patched. It ships loudly labeled ⚠ UNVERIFIED. When it passes, every doc carries a badge:

✅ Verified on 2026-07-07 · image <digest> · commit <sha>

How it actually works

It's a pipeline of small, resumable stages over a crash-safe manifest.json, so a run that dies at stage 5 resumes at stage 5 instead of starting over:

repo URL / guide → ingest & plan → agent run (in Docker) → normalize transcript
        → distill minimal path → VERIFY replay in a fresh container
        → tutorial + step_by_step.md → render VHS demo video

ingest — clone the repo, collect the README/docs, and run one planner pass that emits a machine-checkable plan with a feasibility verdict. Quickstarts that can't work in a sandbox (need a GPU, real cloud creds, a GUI) fail here, for pennies, before any agent time is spent.
agent — the agent engine runs inside the hardened sandbox until the quickstart works, blocks, or runs out of turns.
normalize — pure, deterministic Python turns the messy transcript into a structured command log. No LLM calls, fully testable against fixtures.
distill — one LLM pass reduces the run to the minimal reproduction path and writes commands.sh. The grounding validator lives here.
verify — the moat. commands.sh is replayed in a fresh container the agent never saw. This is the only source of the word "verified."
tutorial — finalizes the step-by-step guide with the verified outputs.
render — builds the VHS video from the finalized guide, so the demo provably follows the published steps line for line.

Because READMEs are untrusted code, the agent's container is the permission
boundary: cap-drop ALL, no-new-privileges, non-root execution, and
memory/CPU/PID limits.

The unglamorous part: everything that goes wrong

The honest engineering story here isn't the happy path — it's the long tail of
ways "run a README" breaks. A few that shaped the design:

The agent fakes success. Told to make something work, a model will patch the source, stub a socket, or pipe a failing command into head so it exits
1. Each of these is now a detected failure class with a code-level defense and a regression test, not just a stern prompt.
Syntax drift breaks grounding. 2>&1, an env-var prefix, a | head, a heredoc body — the "same" command shows up a dozen ways. Matching is normalized on both sides so real successes don't get dropped as unverified.
Findings tools exit nonzero on success. Linters and drift-detectors "fail" when they find something. Under set -e that aborts the script before the assertion, so those are special-cased to still count as passing.
Videos that lie. The demo tape is generated from the finalized guide in the render stage — not hand-written — so the video can't drift from the doc.

Every one of those was found on a real run and has a test named after it. That's
the culture of the project: when you need the model to behave, you add both a
prompt rule and a parser that enforces it. Prompts are suggestions; parsers are
law.

Running it

pip install -e ".[dev]"
docker build -t readme2demo/base:latest images/base/

# self-hosted, single-operator, on your Claude subscription (no API key):
claude setup-token
export CLAUDE_CODE_OAUTH_TOKEN=sk-ant-oat01-...
readme2demo run https://github.com/owner/repo --llm-backend claude-cli

# or metered API billing (best for scale, required if you host it for others):
export ANTHROPIC_API_KEY=sk-ant-...
readme2demo run https://github.com/owner/repo

The repo is optional now — hand it a self-contained step-by-step guide instead
and the fresh-container replay still verifies every command.

What it is and isn't

It's open-core and MIT-licensed. The CLI and the whole verification pipeline
are the free core. A hosted version is exploratory — gated entirely on whether
people actually want it — and I'm deliberately not putting SaaS code into the
OSS core until that's answered.

It's honest about its edges: the in-sandbox agent needs a model credential that
lives in the container for the duration of the run (hardened, but a real
tradeoff), the OpenHands engine is experimental, and "verified" means the
quickstart runs — not that the project is good.

It even generated its own tutorial and demo. That self-run is committed in the
repo, alongside a run against a second project, so you can inspect exactly what
it produces before installing anything.

Repo: https://github.com/alphacrack/readme2demo (MIT)
Docs: https://alphacrack.github.io/readme2demo/

If the idea of docs that can't lie about whether they run appeals to you, a star
helps it find the next person and I'd genuinely like to hear where it breaks on
your repo.

5 things Railway’s 8 hour outage should change about how you think about redundancy

bishwas jha — Fri, 22 May 2026 08:03:03 +0000

Railway runs on Google Cloud, AWS, and its own metal.

So when I first saw that Railway was down for hours, my first thought was probably the same as yours.

"How does a multi cloud platform go dark like that?"

Then I read the incident report, the Hacker News discussion, and the follow up coverage. And the real lesson is uncomfortable.

This was not really a cloud outage.

The servers did not all die. AWS did not die. Railway Metal did not die. Google Cloud infrastructure itself did not have to collapse.

What failed was much higher up the stack.

The account.

Google Cloud placed Railway's production account into suspended status incorrectly as part of an automated action. Railway says this happened around 22:20 UTC on May 19, and the platform was not fully recovered until the next morning. (https://blog.railway.com/p/incident-report-may-19-2026-gcp-account-outage)

That should make every CloudOps, platform, SRE, and engineering leader stop for a minute.

Because most redundancy plans are built for the wrong failure.

We design for dead VMs.
We design for unavailable zones.
We design for regional failover.
We design for database replicas.

But what do we do when the provider says, incorrectly or automatically, “your account is no longer allowed to exist normally”?

Not much, usually.

1. This was not a cloud outage. It was an account suspension

That is the first big lesson.

A lot of people hear "cloud outage" and instantly think of regions, zones, load balancers, or broken hardware. But Railway’s case was different.

Google Cloud's automated systems suspended Railway's production account. Railway says this was incorrect, and that the action was part of a wider automated event affecting many accounts. (https://blog.railway.com/p/incident-report-may-19-2026-gcp-account-outage)

That kind of failure does not look like a server going unhealthy.

It looks like identity, billing, trust, abuse detection, policy, support, and account control all becoming part of your availability story.

Your health checks can say everything is fine.

Your multi zone architecture can be green.

Your workloads can still technically exist.

But if the account is restricted, your beautiful infrastructure diagram does not matter much.

This is the part many teams do not model.

They model "what if eu west 1 is down?"

They rarely model "what if our production cloud account is frozen by an automated system at 11 PM?"

And honestly, that second one is scarier.

Because you do not debug it with kubectl.

You debug it with support tickets, escalation paths, account managers, legal trust, and luck.

2. The control plane was the real single point of failure

Railway had workloads on AWS and Railway Metal that were still running during the incident. But users still saw errors.

Why?

Because the routing control plane was hosted on Google Cloud.

Railway's edge proxies needed that control plane to know where workloads lived. They had cached route data for a while, but once the cache expired, the edge could not keep routing properly. Railway's community update said route cache expiry caused the incident to spread beyond GCP hosted workloads and affect the wider platform. (https://station.railway.com/community/what-we-know-so-far-may-19th-2026-86354cdd)

This is the second lesson.

Your data plane can be redundant while your control plane is still fragile.

And this is where a lot of "multi cloud" thinking becomes a little fake.

You can run compute in three places.
You can run storage in two places.
You can have Kubernetes clusters everywhere.

But if the scheduler, routing map, identity service, deployment API, config database, or certificate automation lives in one provider, your multi cloud story may only be multi cloud on paper.

The thing customers see as "the product" is often not the workload.

It is the control plane around the workload.

For Railway, customers were not just buying raw compute. They were buying routing, builds, deployments, dashboard access, APIs, orchestration and platform magic.

And the platform magic had a dependency.

That dependency became the outage.

3. Getting the account back is not the same as getting the service back

This one is very important.

According to Railway, Google reversed the suspension shortly after escalation. But recovery still took hours because account restoration did not automatically bring everything back cleanly. Persistent disks, compute instances, networking and orchestration layers had to be restored and verified step by step. (https://blog.railway.com/p/incident-report-may-19-2026-gcp-account-outage)

This is the part people underestimate.

A provider can say, “access restored.”

But your system still has to wake up.

Disks need to attach.
Networks need to behave.
Queues need to drain.
Deployments need to stop stampeding.
Databases need to agree again.
Caches need to be repopulated.
Humans need to verify what is safe.

That is not instant.

And in a complex platform, bringing things back too fast can be worse than bringing them back slowly.

Railway also throttled queued deploys during recovery, which sounds boring, but it is actually the responsible move. Because after an outage, your own backlog becomes traffic. And that traffic can flatten the recovering system.

So the real RTO is not:

"How fast can the provider undo the mistake?"

It is:

"How fast can we safely restore the whole chain after the provider undo the mistake?"

Small difference in words.

Huge difference in reality.

4. Recovery can create a second outage

This is probably my favorite lesson from the whole incident, because it is so real.

When Railway started recovering, queued retries and user activity came back in a burst. That burst hit GitHub OAuth and webhook flows hard enough that GitHub rate limited Railway. So logins and builds had problems again, even after the original Google Cloud issue was no longer the main blocker. (https://blog.railway.com/p/incident-report-may-19-2026-gcp-account-outage)

That is painful.

The first outage came from one provider.

The second problem appeared during recovery, from another dependency.

This happens more often than teams admit.

After an outage, everything tries to catch up.

Cron jobs wake up.
Webhooks retry.
CI pipelines restart.
Users refresh dashboards.
Workers pull old messages.
Integrations suddenly see a wall of traffic.

And then some other system says, “this looks abusive.”

Now your recovery has become its own incident.

This is why serious resilience is not just failover.

It is controlled recovery.

Backpressure matters.
Retry budgets matter.
Queue draining matters.
Circuit breakers matter.
Rate limit awareness matters.
Runbooks matter.

And boring old institutional memory matters even more.

Railway had already hardened parts of the GitHub rate limit path after a prior incident, which helped reduce damage this time. That is not luck. That is the value of learning properly from past pain.

5. Most teams insure the wrong half of the risk

The Railway incident is not the first time account level cloud risk became real.

In 2024, UniSuper, a major Australian pension fund, had a serious Google Cloud incident where its private cloud environment was deleted because of a misconfiguration. Google later published details saying backups in Google Cloud Storage and third party backup software helped restoration. (https://cloud.google.com/blog/products/infrastructure/details-of-google-cloud-gcve-incident)

So no, account level and provider control plane risk is not some imaginary edge case.

It happens.

But most companies still talk about redundancy like this:

"We use multiple clouds."

Ok, but what does that mean?

Does it mean workloads can run somewhere else?

Or does it mean you can actually operate the business if one provider account disappears?

Those are very different things.

Flexera's 2026 State of the Cloud report shows multi cloud is still a major enterprise pattern, and its report is based on 753 cloud decision makers. (https://info.flexera.com/CM-REPORT-State-of-the-Cloud?lead_source=Organic+Search) But in practice, many companies are multi cloud for procurement, politics, analytics, or workload placement.

Not always for true survivability.

True survivability asks much harder questions.

Can we deploy without this provider?
Can we route without this provider?
Can we authenticate without this provider?
Can we restore backups without this provider?
Can we contact support fast enough?
Can we prove ownership if an automated trust system flags us?
Can we keep serving read only traffic if the control plane dies?
Can we rebuild from another account, another org, or another provider?

That is not as sexy as "active active multi cloud."

But it is probably more useful.

The real takeaway

Railway did have redundancy.

Just not for the layer that failed.

And that is the uncomfortable lesson for the rest of us.

Redundancy at the compute layer does not protect you from account suspension.

Multi region databases do not protect you from provider level identity actions.

Healthy servers do not help when routing control planes cannot tell traffic where to go.

And getting your cloud account back does not mean your service is back.

The next resilience review should not only ask:

"What happens if a region dies?"

It should also ask:

"What happens if our cloud provider suspends our production account by mistake tonight?"