Internals: Rust 1.90 Compiler Optimizations for AWS Graviton5 – 2026 Performance Gains Explained
The Rust 1.90 stable release, shipped in early 2026, introduced a targeted suite of compiler optimizations for ARMv9.2-A architectures, with specific tuning for AWS Graviton5 processors. These changes deliver up to 32% throughput improvements for compute-heavy workloads and 18% latency reductions for I/O-bound Arm-based cloud instances, per AWS internal benchmarks.
Background: Rust, Graviton5, and ARMv9.2-A
AWS Graviton5, launched in late 2025, is a custom 64-bit ARMv9.2-A processor designed for cloud-native workloads, featuring 64 cores per socket, DDR5-6400 memory support, and dedicated acceleration for vectorized operations (SVE2) and cryptography. Rust’s compiler, which uses LLVM as its backend, previously lacked granular tuning for Graviton5’s microarchitectural quirks, leading to suboptimal instruction scheduling and missed vectorization opportunities.
Rust 1.90’s compiler team collaborated with AWS Silicon and LLVM maintainers to address these gaps, focusing on three core areas: instruction scheduling, auto-vectorization, and heap allocation optimization.
Key Rust 1.90 Optimizations for Graviton5
1. Graviton5-Specific Instruction Scheduling
The Rust compiler’s LLVM pipeline now includes a custom scheduling model for Graviton5’s 8-wide decode, 12-wide issue pipeline. This model prioritizes:
- Reduced pipeline stalls for load-store operations, leveraging Graviton5’s 48MB L3 cache and lower memory latency vs. prior Graviton generations.
- Prioritized dispatch of SVE2 vector instructions to Graviton5’s 4x 128-bit vector units, avoiding unnecessary scalar fallback for loops with 16+ element iterations.
- Optimized branch prediction hints for Rust’s match expressions, which are heavily used in pattern matching and error handling.
2. Enhanced Auto-Vectorization for SVE2
Rust 1.90 enables default auto-vectorization for SVE2 instructions when targeting aarch64-unknown-linux-gnu with the +sve2 target feature. Key improvements include:
- Vectorization of iterator chains for Vec and slice operations, including map, filter, and fold, which previously fell back to scalar code for non-trivial closure bodies.
- Support for masked SVE2 operations to handle remainder elements in loops where the iteration count is not a multiple of the vector width, eliminating manual scalar cleanup code.
- Optimized code generation for Rust’s standard library SIMD intrinsics (std::simd) to map directly to Graviton5’s SVE2 instructions, reducing instruction overhead by 22% for SIMD-heavy workloads.
3. Heap Allocation Tuning for Graviton5 Memory Hierarchy
Rust’s default allocator (jemalloc in 1.90, for Linux targets) received Graviton5-specific tuning to align with its 64KB page size and DDR5 memory bandwidth:
- Adjusted jemalloc’s arena count to match Graviton5’s 64 core count, reducing cross-core allocation contention by 40% for multi-threaded workloads.
- Enabled huge page (64KB) support by default for allocations larger than 1MB, cutting TLB miss rates by 28% for memory-intensive applications like in-memory databases.
- Optimized small allocation (≤ 256 bytes) metadata layout to fit Graviton5’s 64-byte cache line size, reducing cache pollution for high-churn allocation workloads.
2026 Performance Benchmark Results
AWS tested Rust 1.90 compiled binaries against Rust 1.89 (the prior stable release) across three Graviton5 instance types (c8g.large, m8g.xlarge, r8g.2xlarge) in Q1 2026. Key results:
Workload Type
Instance Type
1.89 Throughput (req/s)
1.90 Throughput (req/s)
Gain
HTTP API (Actix-web)
c8g.large
12,400
16,368
32%
In-memory KV Store (Redis-rs)
r8g.2xlarge
89,000 ops/s
105,020 ops/s
18%
Data Processing (Apache Arrow-rs)
m8g.xlarge
4.2 GB/s
5.4 GB/s
28%
Cryptographic Signing (Ed25519)
c8g.large
8,200 sig/s
10,004 sig/s
22%
Latency improvements were consistent across workloads: 99th percentile latency for Actix-web dropped from 18ms to 14.7ms, a 18% reduction. Graviton5’s SVE2 acceleration contributed 60% of the throughput gains for Arrow-rs, while allocator tuning drove 45% of the KV store improvements.
Technical Implementation Deep Dive
The optimizations were implemented across three Rust compiler components:
- rustc_codegen_llvm: Added Graviton5’s scheduling model to LLVM’s AArch64 backend, exposed via the -C target-cpu=graiton5 flag (or automatically detected via /proc/cpuinfo on Graviton5 instances).
- std: Updated jemalloc configuration in std’s alloc module to use Graviton5-tuned parameters when the target_os is linux and target_arch is aarch64, with a runtime check for Graviton5’s CPU ID (0x41 0xd4a 0x1).
- compiler_builtins: Added SVE2-optimized implementations of memcpy, memset, and compare_and_swap for aarch64 targets with +sve2 enabled, replacing generic ARMv8 fallbacks.
Developers can enable all optimizations by compiling with:
rustc 1.90 -C target-cpu=native -C target-feature=+sve2
Or via Cargo, set in .cargo/config.toml:
[build] rustflags = ["-C", "target-cpu=native", "-C", "target-feature=+sve2"]
Impact on Arm-Based Cloud Workloads
These optimizations reduce the cost per request for Rust-based services running on Graviton5 by an average of 24%, per AWS’s 2026 cost analysis. They also make Rust more competitive with C++ for high-performance Arm workloads, as Rust 1.90 now matches C++17 compiled with Clang 18 for Graviton5 on 80% of tested benchmarks.
Future work includes tuning for Graviton5’s upcoming confidential computing extensions (expected late 2026) and expanding SVE2 vectorization to Rust’s async/await state machines.
Conclusion
Rust 1.90’s Graviton5-specific optimizations demonstrate the Rust team’s commitment to supporting emerging hardware architectures, delivering immediate performance gains for cloud users. By aligning compiler code generation with Graviton5’s microarchitectural features, Rust maintains its position as a top choice for high-performance, cost-effective Arm-based cloud services in 2026 and beyond.
Top comments (0)