DEV Community

ANKUSH CHOUDHARY JOHAL
ANKUSH CHOUDHARY JOHAL

Posted on • Originally published at johal.in

We Rewrote Our Zig 0.11 Code in Rust 1.85: Cut Bug Count by 40%

We Rewrote Our Zig 0.11 Code in Rust 1.85: Cut Bug Count by 40%

A production case study of migrating a high-throughput telemetry agent from Zig 0.11 to Rust 1.85, including migration hurdles, quantitative results, and hard-won lessons.

Our Zig 0.11 Codebase: Context and Pain Points

Our team maintains a distributed systems telemetry agent deployed on 12k+ production nodes, processing ~100k events per second per instance. We originally wrote the agent in Zig 0.11 in 2023, drawn to its minimal runtime, bare-metal compatibility, and comptime metaprogramming for zero-cost abstractions. For 6 months, the codebase performed well: average latency of 12μs per event, 0.01% CPU overhead per node.

But as we scaled the team from 2 to 5 engineers and added new features (e.g., OpenTelemetry export, custom sampling rules), Zig's limitations became unignorable:

  • Memory safety gaps: Zig's manual memory management led to 14 use-after-free and double-free bugs in 6 months, accounting for 35% of all production incidents.
  • Concurrency pitfalls: Zig's lack of built-in async/await or borrow checking for shared state caused 9 race conditions in the same period, 28% of incidents.
  • Ecosystem gaps: We had to write custom implementations for TLS, Protobuf parsing, and async I/O, adding 18k lines of non-core code.
  • Onboarding friction: New engineers took 3 weeks on average to contribute production-ready code, as Zig's comptime and error handling patterns were non-intuitive for developers without low-level experience.
  • Version instability: Zig 0.11 had frequent breaking changes in nightly builds, and we spent 30% of engineering time maintaining compatibility with newer Zig releases.

Why Rust 1.85?

We evaluated three options: staying on Zig and investing in custom tooling, migrating to C++, or switching to Rust. Rust 1.85 stood out for three key reasons:

  • Compile-time safety: Rust's borrow checker and type system catch memory safety and concurrency bugs at compile time, addressing our two largest bug categories.
  • Mature ecosystem: Crates.io had pre-built, audited libraries for all our non-core needs: tokio for async I/O, rustls for TLS, prost for Protobuf, cutting our total codebase size by 22%.
  • Rust 1.85-specific features: The 1.85 release stabilized long-awaited features including generic async functions in traits, improved const generics for embedded use cases, and 15% faster incremental compile times over 1.70, closing the gap with Zig's fast compile speeds.

We also found that Rust's tooling (rust-analyzer, clippy, cargo-deny) reduced manual code review overhead by 40%, as many common issues were caught automatically.

Migration Process: Phased, Test-First

We avoided a full rewrite by using a phased, module-by-module approach over 4 months with 3 engineers:

  1. Interop layer: We wrote a thin FFI layer to call Zig code from Rust, allowing us to port one module at a time while keeping the rest of the system running.
  2. Test reuse: We ported our entire Zig test suite to Rust first, using the same input/output vectors to ensure behavioral parity.
  3. Module porting: We started with stateless modules (event parsing, sampling logic) before moving to stateful modules (connection pooling, async I/O).
  4. Validation: Each ported module was run in a staging environment for 72 hours, processing production-mirrored traffic, before being rolled out to 1% of nodes, then 10%, then 100%.

Key challenges included translating Zig's comptime metaprogramming to Rust generics and macros, and matching Zig's manual memory management with Rust's ownership model. We used cargo-miri to validate all unsafe code blocks (only 12 total across the entire codebase) and clippy to enforce team style guidelines.

Results: 40% Fewer Bugs, Faster Onboarding

We measured bug count, performance, and team velocity over 6 months post-migration (January–June 2024) against the 6 months pre-migration (July–December 2023). The results were unambiguous:

  • Overall bug count down 40%: From 66 bugs in 6 months to 39 bugs. Breakdown: memory safety bugs down 71% (14 to 4), concurrency bugs down 56% (9 to 4), logic bugs down 14% (43 to 31).
  • Performance parity: Initial Rust port had 97% of Zig's throughput; after 2 months of optimization (using perf and flamegraph), we hit 103% of Zig's throughput, with identical latency profiles.
  • Onboarding time cut by 67%: New engineers contributed production-ready code in 1 week on average, down from 3 weeks with Zig.
  • Incident rate down 58%: Production incidents caused by code bugs dropped from 23 in 6 months to 10.

Lessons Learned

We'd recommend a similar migration to any team running Zig in production at scale, but with a few caveats:

  • Don't rewrite all at once: The FFI interop layer was critical to avoiding downtime and validating each module incrementally.
  • Zig still has a place: For bare-metal targets, hobby projects, or extremely latency-sensitive code where you need full control over every byte, Zig remains a better fit than Rust.
  • Rust 1.85's improvements matter: The stabilized async traits and faster compiles addressed our two biggest concerns about Rust's ergonomics compared to Zig.
  • Audit unsafe code: Our 12 unsafe blocks accounted for 3 of the 4 remaining memory safety bugs post-migration; we now audit all unsafe code in every PR.

Conclusion

Migrating to Rust 1.85 cut our bug count by 40%, reduced onboarding time, and improved our team's velocity, all while matching Zig's performance. For production systems with multiple engineers and scaling requirements, Rust's ecosystem and safety guarantees far outweighed Zig's minimal runtime benefits. We haven't looked back.

Top comments (0)