DEV Community

Fernando Rodriguez
Fernando Rodriguez

Posted on • Originally published at frr.dev

RustyClaw: I'm rewriting an AI agent in Rust (because the meme demands it)

"You know what’s great about Rust? It doesn’t let you compile crappy code. You know what sucks? Everything you write at the beginning **is* crappy code."*
— Mr. Krabs, probably

What’s better than an AI agent? An AI agent rewritten in Rust.

If you’ve spent more than five minutes on the internet, you’re aware of the meme. It doesn’t matter what project—text editor, DNS server, BMI calculator. Someone will inevitably comment, "you should rewrite it in Rust." It’s the Rewrite It In Rust—RIIR for friends—and it’s as unavoidable as gravity.

Well, I’m actually doing it. I’m going to port 8,300 lines of a Python AI agent to Rust. But not just because the meme demands it (okay, maybe a little). I’m doing it because I need a guinea pig.

The thesis

For weeks now, I’ve been writing about silent failures, about the five defenses against hallucinations, about how an LLM can generate code that compiles, passes tests, and is still wrong. I even gave it a name: adversarial development. Never trust, always verify.

A lot of theory. Now it’s time to prove it.

I needed a project with three key traits: constrained scope (not a new app with ever-changing requirements), a clear source of truth (the Python code that already works), and enough complexity for the LLM’s hallucinations to have room to hide. A pure port checks all three boxes: the input and expected output already exist. If the Rust version doesn’t behave exactly like the Python one, there’s a bug. Simple as that.

And since I’m going to port something, why not use it as an opportunity to properly learn Rust? The borrow checker, ownership, lifetimes... I’ve spent years reading all about it and touching none of it. Things would be different if I stopped reading tutorials for the 20th time and actually tackled a real project.

The patient

It’s called nanobot. It’s a personal AI agent derived from OpenClaw: a nifty tool that links LLMs (Claude, GPT, DeepSeek, you name it) to chat channels—Telegram, Discord, Slack, email—and gives them hands. It can read/edit files, run commands, browse the web, schedule cron tasks, and maintain persistent memories between conversations.

It works. It’s been running fine. In Python.

What’s the problem? It’s single-threaded. One message at a time. Send it three messages back-to-back, and they queue up like a Saturday morning line at Walmart. It uses about 50MB of RAM to essentially shuffle JSON between APIs. And its error handling is the type you’re embarrassed about: return f"Error: {str(e)}" scattered all over.

To put it bluntly: it works, but it’s a giant hack. Perfect candidate.

Why Rust (besides the meme)?

I could fix it in Python. I could dial up the asyncio, tighten up error-handling with custom exceptions, and optimize memory. The sane option.

But sane doesn’t give me a test bench for adversarial development. Refactoring in Python lacks an external source of truth—the "before" and "after" would share language, libraries, and the LLM’s biases. A port to a different language? That’s different. If Rust’s output differs from Python’s for the same input, somebody’s lying. And that’s exactly the kind of verification I want to test.

Plus, Rust comes with properties that make the experiment more interesting:

  • The compiler as a first line of defense. Nulls, type mismatches, data races—entire categories of bugs that might silently creep into Python won’t even compile in Rust. How many LLM hallucinations can the compiler block before they hit a test? I want to measure that.
  • True concurrency. tokio allows one spawn per conversation. In Python, that’s a pain. This is the one functional improvement that really justifies the port.
  • Static binaries. A 10MB executable instead of a pip install with 47 dependencies. That’s a win for distribution.
  • It’s cool. Not technically a reason, but I don’t care.

The adventure (and the invite)

RustyClaw—that’s the port’s name—is going to be a publicly documented experiment. Each module I port will be its own blog post. With real data: how many tokens used, cost, how often the AI hallucinated, and how long I fought with the borrow checker. No sugarcoating.

If I spend 3 hours on something I could have done in Python in 10 minutes, I’ll admit it. If the LLM invents a non-existent crate (spoiler: it will), I’ll detail it. If I realize at the end this port wasn’t worth it, I’ll confess to that too.

Everyone says, "I used AI to write code." No one publishes how much it cost, how often it lied to them, or if the code held up in production. That’s exactly what I’m going to do.

And I want you to come along for the ride. Because this is going to be an adventure—filled with compiler battles, "WHY WON’T THIS COMPILE IT’S OBVIOUS" moments, and small victories when a differential test passes green. It’s going to be fun. Or, at the very least, honest.

The stack (cheat sheet)

If you’re a Pythonista, the left column will look familiar. If you’re a Rustacean, the right. If you’re neither, welcome to the chaos.

Layer Python (nanobot) Rust (rustyclaw)
Async runtime asyncio tokio
HTTP httpx reqwest
LLM routing litellm Nonexistent — custom router
Telegram python-telegram-bot teloxide
Discord websockets (raw) tokio-tungstenite (raw)
Config pydantic serde + figment
CLI typer clap
Errors str(e) anyhow + thiserror
Logging loguru tracing
AI copilot Claude Code + Codex
Task runner make just
Issue tracker linear CLI

The row that hurts most is LiteLLM. In Python, it routes 100+ LLM providers in a single call. Nothing comes close in Rust. I’ll need to roll my own router. The upside? About 80% of LLM providers conform to OpenAI’s API, so between async-openai + a custom base URL, most use-cases are covered. Anthropic will need its own implementation.

Around ~300 lines of Rust. Sounds manageable. Sounds.

Anti-hallucination strategy (the serious bit)

This is where the adversarial development theory meets reality. An LLM assisting in a port this size is a machine for plausibly inventing things.

The top risk isn’t that the code won’t compile—Rust doesn’t let garbage compile. The risk is that it compiles, passes tests, and silently does the wrong thing. Exactly the silent failure I wrote about two weeks ago.

Five layers of defense:

1. Rust’s compiler. Eliminates nulls, type mismatches, and data races. First free line of defense. But just because it compiles doesn’t make it right.

2. Differential tests. Same input → Python nanobot → output. Same input → RustyClaw → output. If they don’t match, something’s off. The Python code is the source of truth. This is the backbone of the experiment.

3. Provenance tracking. Each ported file gets a header with its original Python source, LLM session, and test differential results. Total traceability.

4. Crate verification. Every crate suggested by the LLM → manually verify on crates.io and docs.rs. LLMs will confidently propose non-existent crates and APIs that just don’t work.

5. Incident logging. Every detected hallucination → an issue logged with a hallucination label. Material for posts and lessons learned.

The golden rule:

The verification system must be external to the generator.

If the LLM writes the code, the tests, and the fixtures, you’re validating fiction with fiction. Differential testing against the original Python code naturally breaks the cycle and makes the port inherently verifiable.

Does it matter?

So, the uncomfortable question—does porting this to Rust even matter?

Metric Python Rust (estimated) Does it matter?
Response latency ~200ms overhead ~5ms overhead No. The LLM takes 2-5 seconds anyway.
RAM ~50MB ~5MB No. My server has 8GB.
Concurrency 1 message at a time N messages in parallel Yes.
Startup time ~2s ~50ms Meh.
Binary pip install + 47 deps Single executable Yes.
Type safety str(e) everywhere Result<T, E> Yes.
The cool factor None High Subjective.

Three out of seven. Four, if we’re being generous. The latency and RAM improvements are meaningless since the bottleneck is always the LLM call. Concurrency matters for multiple users. A static binary is a real upgrade. And the type safety? After seeing how many bugs str(e) lets fly under the radar for months, yeah, that matters.

Does it justify weeks of work? As a standalone port, probably not. As a testbed for adversarial development with published real-world data? I think yes. By the end of this series, we’ll have hard numbers—not opinions.

The raw numbers

Every work session will be logged in a public CSV in the repo:


csv
date,llm,model,module,tokens_in,tokens_out,cost_usd,duration_min,loc_python,loc_rust,hallucinations,tests_pass
---

Which LLM I used, tokens consumed, cost, duration, lines ported, hallucinations detected, tests passed. It’ll all be public. All verifiable.

At the end of this series, anyone will be able to sum up `cost_usd` and decide if RIIR was worth it. Anyone will be able to count hallucinations and decide if adversarial development works or is just hype. Spoiler: I have no idea what the numbers will be. And that’s what makes it interesting.

## Join me

- **Repo:** [github.com/frr149/rustyclaw](https://github.com/frr149/rustyclaw)—code, issues, tracking
- **Blog:** Each phase will have its own post here in the *RustyClaw: Rewrite It In Rust* series
- **Backlog:** Public on Linear, visible via GitHub issues

What’s better than an AGI? An AGI rewritten in Rust. Just ask the meme. Now let’s prove it.
Enter fullscreen mode Exit fullscreen mode

Top comments (0)