Valerii Vainkop

Posted on Feb 26

25K Lines, 2 Weeks, Zero Regressions: The AI-Assisted Migration Methodology That Actually Works

#ai #devops #engineering #productivity

25K Lines, 2 Weeks, Zero Regressions: The AI-Assisted Migration Methodology That Actually Works

If you're sitting on a Terraform migration or a K8s API version upgrade that keeps getting pushed to "next quarter" — this might change your math.

On February 23, Andreas Kling ported LibJS, Ladybird's entire JavaScript engine frontend, from C++ to Rust using AI coding agents. 25,000 lines. Two weeks. Zero regressions across both the ECMAScript test suite and Ladybird's internal tests. I've read his writeup three times now.

The headline is impressive. The methodology is the part worth stealing.

What Actually Got Ported

LibJS handles the JavaScript engine's lexer, parser, AST (abstract syntax tree), and bytecode generator. This is not a utility library. It's the part of the codebase where a subtle bug can break thousands of programs in ways that don't surface immediately. The kind of code where experienced engineers move carefully and budget months, not weeks, for a rewrite.

Kling was using Claude Code and Codex. The same work, done by hand, would have taken "multiple months" by his estimate.

The numbers, from the public PR (ladybird/pull/8104):

~25,000 lines of Rust generated and verified
52,898 test262 test cases — 0 regressions
12,461 Ladybird internal tests — 0 regressions
0 performance regressions
Hard requirement: byte-for-byte identical AST and bytecode output — not "functionally equivalent," identical

This is the largest publicly documented AI-assisted migration with verified production-quality results I'm aware of. And the methodology is more useful than the numbers.

The Methodology

1. Human-directed. Every architectural decision stayed human.

The AI didn't wake up and decide to port LibJS. Kling made every structural call: what to port, in what order, which patterns to preserve. The AI executed bounded tasks — "translate this class," "convert this function," "maintain this exact behavior" — but never owned the roadmap.

This distinction is critical. It's the difference between an AI that's "doing the engineering" and an engineer who's using an AI to multiply their execution speed. The second one produces 25k lines with zero regressions. The first one produces code that looks right until it doesn't.

2. Small units, many iterations.

Not whole files. Not whole modules. Individual functions. Individual data structures. Clear, specific, bounded tasks.

The AI's error rate compounds with scope. Give it 30 lines → high accuracy, easy to verify. Give it 1,000 lines → errors compound, review is slow, you've already lost the time advantage.

Small tasks also make the review cycle fast. If a 30-line output is wrong, you know immediately. If a 1,000-line output is subtly wrong, you find out when tests break three steps later.

3. Multi-pass adversarial review.

After initial translation, Kling ran a different AI model over the output specifically to find mistakes and bad patterns. The models checked each other's work.

He didn't just run the test suite and ship whatever passed. He actively used a second model as a code reviewer — the same way you'd use a second human engineer to review a first engineer's PR, except the "reviewer" has no blind spots from writing the original code.

This is underused. Most people use AI to generate. Few use AI to verify the generation.

4. Zero-regression bar enforced by hard requirements.

The byte-for-byte identical output requirement forced discipline. There was no "close enough" — every deviation showed up in the test suite immediately. The quality bar wasn't aspirational; it was a hard check at every step.

In infrastructure terms, this is like requiring kubectl diff to show zero changes before merging a migration. The constraint is what makes the result trustworthy.

Why This Maps Directly to Your Infrastructure Work

If you run Kubernetes, write Terraform, or maintain Helm charts — you already have this same problem. The pattern-heavy work that keeps getting pushed because it's tedious but low-risk:

→ Upgrading deprecated K8s API versions between releases (always pattern-heavy, always a lot of YAML)

→ Migrating Helm-templated configurations to Kustomize

→ Converting OPA policies to Kyverno syntax (or updating policies to new Kyverno API versions)

→ Updating Prometheus recording rules when metric naming changes

→ Migrating Flux HelmRelease specs between major versions

Every one of these is the same problem Kling solved. Pattern-heavy. Tedious. High surface area for subtle errors. The kind of work an experienced engineer does correctly but slowly — and an AI does fast but needs verification.

Translating the Pattern to Your Next K8s Migration

Here's how you'd apply this methodology to a real K8s API version migration:

Step 1: Define the migration spec precisely

Don't ask the AI to "migrate my Deployments from apps/v1beta1 to apps/v1." Give it a specific unit:

# Task: Convert this single Deployment manifest from apps/v1beta1 to apps/v1.
# Requirements:
#   - Preserve all existing labels, annotations, and selectors
#   - Add required 'selector.matchLabels' field (was optional in beta)
#   - spec.template.metadata.labels must match spec.selector.matchLabels
#   - Do not change any container specs, resource limits, or volume mounts
#   - Output must pass: kubectl apply --dry-run=server
#
# Input manifest:
[paste single manifest here]

One manifest, one task, precise constraints. The output is reviewable in 60 seconds.

Step 2: Validate each unit with a hard check

#!/bin/bash
# validate-migration.sh
# Run after AI generates each migrated manifest

MANIFEST=$1

echo "=== Dry-run validation ==="
kubectl apply --dry-run=server -f "$MANIFEST"

echo "=== Schema validation ==="
kubeconform -strict -kubernetes-version 1.35.0 "$MANIFEST"

echo "=== Policy check ==="
conftest test "$MANIFEST" --policy ./policies/

echo "=== Diff against current state ==="
kubectl diff -f "$MANIFEST"

If any check fails, the migration doesn't proceed. Same principle as Kling's byte-for-byte identical output requirement.

Step 3: Two-model adversarial review for anything over 50 lines

For larger migrations — updating an entire Helm chart's values schema, rewriting a set of Alertmanager rules — I've started running the AI output through a second model with a focused security/correctness lens:

import anthropic
import openai

def two_pass_review(generated_yaml: str, context: str) -> dict:
    """
    Pass 1: Claude generates the migration
    Pass 2: GPT-4o reviews for correctness, security, and subtle issues

    The reviewer has no attachment to the generated code.
    That's the point.
    """

    claude = anthropic.Anthropic()
    oai = openai.OpenAI()

    # Pass 2: adversarial review with a different model
    review = oai.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": f"""Review this Kubernetes config migration for:
1. Correctness: does it achieve the stated migration goal?
2. Security: any privilege escalations, missing RBAC, exposed secrets?
3. Subtle bugs: field renames, removed defaults, changed semantics between API versions?
4. Be specific about any issue. Explain exactly what breaks and why.

Context: {context}

Generated config:
{generated_yaml}"""
        }]
    ).choices[0].message.content

    return {
        "review": review,
        "requires_revision": "issue" in review.lower() or "problem" in review.lower()
    }

The second model has no investment in the first model's output. That's the whole point. The first model wants to produce something that looks complete. The second model is explicitly tasked with finding what's wrong.

The SWE-bench Context

Worth noting: the same week as the Ladybird port, OpenAI announced they're deprecating SWE-bench Verified because the benchmark is compromised. At least 59.4% of the hard problems have flawed test cases. All frontier models can reproduce the "gold patch" verbatim — indicating training contamination. The leaderboard progress for the past six months is likely measuring memorization, not capability.

So we have: the headline AI coding benchmark is broken, and simultaneously, one of the clearest real-world proofs of AI-assisted migration capability just shipped.

The lesson isn't that AI coding is more or less capable than the benchmarks say. It's that benchmarks are a bad proxy for production results. Zero regressions on your actual test suite, with your actual code, is the only number that matters.

Kling had that number. 52,898 tests, zero failures.

What I'd Do Differently Starting Now

I've been using AI for config generation but not for systematic migrations. One-off tasks, not structured campaigns.

The Ladybird story changes that calculus for me. For the next major K8s upgrade cycle, I want to build:

Prompt templates per migration type — not generic instructions, but spec'd-out task templates for each API version change, each Helm breaking change, each policy conversion
Validation scripts per migration type — a validate-migration.sh that knows what "correct" looks like for each category
Two-model review pass baked into the workflow — not optional, not manual, part of the pipeline for any change over 50 lines

The goal: migrations that are faster than doing them by hand, and more reliable than doing them by hand, because the verification is rigorous.

Kling showed that's achievable. The methodology is the whole thing.

What This Means for Engineering Leadership

If you're a CTO or VP Eng at a startup with a small platform team, the Ladybird result should change how you scope migration work. The assumption that "we'll do the K8s 1.32 → 1.35 API migration when we have bandwidth" is based on a cost model that may no longer be accurate.

A methodology that turns a multi-month manual effort into two weeks of structured AI-assisted work — with zero regressions — is worth understanding before you scope your next migration project.

The engineers who figure out this pattern first — human-directed, small-unit, adversarially reviewed — will have a structural productivity advantage that compounds over time. Not because they have better AI tools than everyone else. Because they have a better way of working with the tools everyone else also has.

Zero regressions is the bar. It's achievable. Kling just proved it publicly.

What's your current methodology for AI-assisted config or code migrations? Specifically curious whether anyone's running multi-model adversarial review in a real CI pipeline — and what that tooling looks like.

LinkedIn