DEV Community

John Rooney for Extract by Zyte

Posted on

The compiler caught a lot. It didn't catch enough.

I built a small web scraping framework in Rust, mostly with an AI doing the typing. It's called ferrous — a Colly-style collector: register CSS selector callbacks, queue URLs, write JSONL. About 700 lines. The pitch I kept hearing, and half-believed, was that Rust and LLMs are a good match now: the borrow checker is a correctness oracle the model can lean on, so the class of bugs that plagues AI-written Python just won't compile.

That's true. It's also where the story gets uncomfortable, because the build was green and the code was still wrong.

How I worked

I'm not a Rust native. ferrous was partly an excuse to get fluent — build something real instead of reading about lifetimes — and partly a test of how far an LLM could carry the typing while I drove. The loop was plain: describe the next change in English, let the model write the Rust, read what came back, run cargo, move on. It kept observations.md as a running design journal, one entry per change, each with a short rationale for the decision it made.

That setup has a soft spot, and it's the whole point of this post. When you drive a language you don't fully know, the only reviewer you've got with real authority is the compiler. Everything past that — is this idiomatic, is it the right abstraction, does it actually do what the journal claims — depends on already knowing what correct looks like. Which is exactly the knowledge a learner doesn't have yet. Keep that in mind through the next part, because it's the difference between the one bug I could have caught and the six I couldn't.

The bug that the toolchain told me wasn't there

I ship two fetch backends. The default goes through the Zyte API; an optional one, gated behind a wreq feature, makes direct requests with browser TLS emulation. Each has an example. After a refactor that added URL resolution — ctx.resolve_and_visit(href) so callbacks stop hand-building absolute URLs — I had the model update the examples to use it. It did, and it wrote up the change in a running design journal it kept:

Added ctx.resolve(href) and ctx.resolve_and_visit(href) ... The examples were updated to use ctx.resolve_and_visit(&href).

cargo test passed. cargo check passed. I believed the journal. Here is the actual line in the feature-gated example:

.on_html("li.next a", |el, ctx| {
    if let Some(href) = el.attr("href") {
        ctx.resolve_and_visit(href);   // href is String; the method wants &str
    }
})
Enter fullscreen mode Exit fullscreen mode

resolve_and_visit takes &str. href is a String. That does not compile — String doesn't coerce to &str in argument position, only &String does. The Zyte example got the &; the wreq one didn't. The model fixed one of the two call sites it claimed to have fixed, and reported both as done.

So why was the build green? Because the broken example is behind a feature flag, and cargo check and cargo test don't build it by default. You have to ask:

$ cargo check --all-targets --all-features
error[E0308]: mismatched types
  --> examples/books_direct.rs:20:39
   |
20 |                 ctx.resolve_and_visit(href);
   |                     ----------------- ^^^^ expected `&str`, found `String`
Enter fullscreen mode Exit fullscreen mode

This is the part worth sitting with. The compiler would have caught it — it's a textbook E0308, the friendliest error rustc produces, complete with help: consider borrowing here. The type system did its job perfectly. It just never ran on that file, because the default build target set didn't include it, and nobody — not me, not the model — pointed it at the path where the error lived. The oracle was switched off for exactly the line that needed it, and the green checkmark covered the gap.

What the compiler buys you, honestly

I don't want to undersell the good part, because it's real and it's specific. The most interesting moment in the whole project was the model refactoring the Element type. Originally it stored matched HTML as a string and re-parsed it on every field access — three accessors, three parses of the same fragment. The fix was to parse once and store the scraper::Html. But Html is !Send, and the crawl loop runs callbacks inside spawned tasks. The model reasoned about this explicitly:

Html is !Send but this is safe: Element is only created and consumed inside synchronous callbacks and is never stored in an Arc or sent across threads.

And it structured the crawl loop to hold that invariant — parse the document, run every callback, drain the results into owned Vecs, and drop the Html before the first .await:

let (all_visits, all_items) = {
    let doc = Html::parse_document(&html);
    // ... run callbacks, collect owned results ...
    (all_visits, all_items)
    // doc dropped here, before any await
};
Enter fullscreen mode Exit fullscreen mode

In Python this reasoning is a comment you hope stays true. In Rust, if the model had gotten it wrong — held the Html across the await, stuffed an Element into the task's captured state — it wouldn't compile. The Send bound is load-bearing. That's the version of "the compiler helps the AI" that actually holds up: it converts a class of concurrency mistakes into build failures, and the model can lean on that to attempt refactors it would otherwise have to be timid about.

The same goes for the ordinary stuff. Widening push_item(Value) to push_item<T: Serialize>, swapping a spin-polling semaphore loop for a JoinSet, threading a FetchError enum through the fetch path — these landed cleanly and idiomatically, because they're well-trodden Rust patterns and the types kept the model honest about the seams.

The class of bugs that compiles fine

Then I had a second model do a senior-review pass over the finished code, and it found seven things the green build was hiding. None of them are type errors. All of them compile.

The library panics on normal user mistakes — an invalid selector, a missing API key, a bad output path all unwrap/expect their way to a crash, even though run() returns a ScrapeResult as if failure were representable. concurrency(0) doesn't error; the inner while tasks.len() < 0 loop never spawns anything, tasks.is_empty() is immediately true, and the crawl exits having silently processed nothing. join_next().await discards its Result, so a panicking callback vanishes and the run still reports success. And the stats are quietly miscounted: fetch_errors never increments for HTTP 4xx/5xx — only for network and parse failures — so a crawl that 500s on every page can report fetch_errors: 0 and had_errors() == false. The README, meanwhile, promises that successful status codes are tracked; they aren't recorded at all.

Every one of these is the same shape of mistake: code that satisfies the types and the test suite while doing the wrong thing. The compiler has nothing to say about whether concurrency(0) is meaningful, whether a 500 counts as an error, or whether your docs match your behavior. Those are semantic claims, and semantics is exactly the layer where LLMs are fluent and confident and wrong — and it's the layer Rust doesn't police.

There's a tidy demonstration of this sitting in the repo: the design journal and the README both describe the intended state of the code, not its actual state. The journal says both examples were fixed; one wasn't. The README says status codes are tracked; they're partly not. The model writes the world as it meant to leave it, and prose has no type checker.

You can't review what you can't read

Go back through those seven findings and ask what it would have taken to catch each one by reading. concurrency(0) exiting silently: you'd have to know that while tasks.len() < 0 is vacuously false, then trace what an empty JoinSet does on the next line. The fetch_errors miscount: you'd have to be holding the intended meaning of "error" in your head and notice the HttpError arm bumps a status counter where you expected it to bump the error count. The &href that wouldn't compile: you'd have to know that String doesn't coerce to &str in argument position — the single Rust fact the entire bug turns on.

None of these are exotic. They're the things you internalize after enough hours in the language. But that's precisely the trap when a learner pairs with an AI: the model emits code that looks right, reads fluently, and compiles, and the only way to know it's wrong is to already know the thing you were hoping the model would handle for you. Knowing Rust well wouldn't have stopped the model from writing the concurrency(0) hole — it would have stopped me from nodding past it in review.

So the green build gets promoted past its pay grade. When you can't evaluate the semantics yourself, "it compiles" slides from not obviously broken to works, because compiling is the only check you're actually equipped to read. That promotion is the real hazard of writing an unfamiliar language with a model that writes it confidently. The bugs didn't come from the model being bad at Rust — it handles the syntax better than I do. They came from neither of us being positioned to see the gap between code that satisfies the types and code that does the right thing. The model can't see it because it's a semantic claim about intent; I couldn't see it because I didn't yet know what right looked like.

So, is Rust better for AI now?

For the bugs Rust can see, yes, unambiguously, and more than I expected before I watched the !Send refactor go through. The borrow checker and the trait system catch a real category of AI error at compile time, and that lets a model attempt more aggressive changes without silently corrupting state. The floor is genuinely higher than in a dynamic language.

But the floor isn't the problem. Most of what was wrong with ferrous compiled, passed its tests, and was described accurately by documentation that was itself false. The compiler raised the floor; it did nothing for the ceiling, and feature flags punched a hole in even the floor by hiding a target from the default build. "It compiles, the tests are green, and the model says it's done" turned out to be three independent false comforts stacked on top of each other.

The lesson I'd actually act on has two parts. The mechanical one is cheap: the green checkmark is scoped and the model doesn't know the scope, so run --all-features --all-targets on every check, read the diff instead of the summary of the diff, and treat anything the model asserts about its own output — "both examples updated," "status codes tracked" — as a claim to verify rather than a result to log.

The other part is slower and matters more. If you're using an AI to write a language you're still learning, the AI is not a substitute for learning it — it's the thing that makes the learning feel optional right up until a 500 gets counted as a success. The type system is a fast, narrow oracle that flags the bugs it can see and stays silent on the ones that matter most. Closing that silence is on you, and you can only close it by knowing the language well enough to read what the model wrote and see where it's quietly wrong. I came out of ferrous knowing more Rust than I went in with. That, more than the framework, was the point — and it's the only thing that would have caught the other six.

Top comments (0)