Postmortem: A Rust 1.85 Panic Caused Our Robotics Controller to Reboot 12 Times in Production
On October 12, 2024, our warehouse picking robot fleet experienced 12 unplanned controller reboots over a 6-hour window, resulting in 12 minutes of total downtime. The root cause was traced to a miscompile in the Rust 1.85 compiler for ARMv7-R targets, triggered by our sensor polling subsystem. This postmortem details the incident timeline, root cause analysis, remediation, and prevention measures.
Incident Timeline
- 08:00 UTC: Firmware upgrade to Rust 1.85-based build rolled out to 12 production robots.
- 08:15 UTC: First controller reboot alert triggered for Robot #3.
- 08:15 – 14:00 UTC: 11 additional reboots across the fleet, averaging 2 reboots per hour.
- 14:00 UTC: On-call engineers identified the firmware upgrade as the common factor, rolled back all robots to Rust 1.84-based firmware.
- 14:15 UTC: Reboots stopped, service restored to full capacity.
Root Cause Analysis
Our robotics controller runs on a custom ARM Cortex-R5-based PCB, using a no_std Rust firmware (≈40k lines) with a custom async runtime for real-time sensor polling and motor control. The controller is configured to reboot via hardware watchdog on any unhandled panic, to ensure fail-safe operation.
After the rollback confirmed Rust 1.85 as the culprit, we bisected the toolchain: Rust 1.84 (stable) produced working firmware, while Rust 1.85 (stable) reproduced the panic. The panic message from the controller's debug UART was:
panicked at 'index out of bounds: the len is 128 but the index is 128', src/sensors.rs:42:18
The offending code was the sensor buffer write logic, which used a relaxed atomic load to track the current buffer index:
static SENSOR_IDX: AtomicUsize = AtomicUsize::new(0);
const SENSOR_BUF_SIZE: usize = 128;
static mut SENSOR_BUF: [u16; SENSOR_BUF_SIZE] = [0; SENSOR_BUF_SIZE];
fn poll_sensors() {
let reading = sensor.read();
let idx = SENSOR_IDX.load(Relaxed); // Miscompiled in Rust 1.85
unsafe {
SENSOR_BUF[idx] = reading; // Panics when idx == 128
}
SENSOR_IDX.store((idx + 1) % SENSOR_BUF_SIZE, Relaxed);
}
Rust 1.85 included a change to default codegen settings for ARMv7-R targets, enabling -C opt-size (optimize for size) for release builds. This triggered an LLVM 18 optimization bug where relaxed atomic loads were reordered before prior stores to the same atomic variable, causing SENSOR_IDX.load(Relaxed) to return a stale value of 128 (exceeding the buffer size of 128) instead of the expected 127. This out-of-bounds access triggered a panic in the firmware's bounds-checked array access.
We confirmed the bug by disabling opt-size in Rust 1.85, which eliminated the panic. The issue was reported to the Rust compiler team, who fixed the LLVM miscompile in Rust 1.85.1.
Remediation
- Immediate: Rolled back all production controllers to Rust 1.84-based firmware, which stopped reboots within 15 minutes of the rollback start.
- Short-term: Upgraded to Rust 1.85.1 once released, which included the fix for the ARMv7-R codegen bug. Validated the fix in staging with 48 hours of continuous sensor polling (exceeding 128 buffer writes per second) with no panics.
- Long-term: Replaced relaxed atomic ordering with
Ordering::Acquirefor the sensor index load, adding an explicit compiler barrier to prevent reordering even if future optimizations attempt it.
Prevention Steps
To avoid similar incidents in the future, we implemented the following changes:
- Pin Rust toolchain versions in CI via
rust-toolchain.toml, with explicit sign-off required for toolchain upgrades to production. - Add a 48-hour staging soak test for all toolchain upgrades, with fuzz testing for edge cases (buffer sizes, atomic ordering, sensor dropouts).
- Updated panic handling to restart individual tasks instead of triggering a full controller reboot for non-critical subsystems (sensor polling, telemetry) – only motor control panics trigger a full reboot.
- Subscribe to Rust release notes and target-specific (ARM Cortex-R5) regression reports, with automated alerts for changes to codegen settings for our target.
We apologize for the downtime caused to our warehouse operations team, and have shared the full incident report with all stakeholders. The Rust compiler team has merged the fix into the 1.85 stable branch, so other ARMv7-R users are advised to upgrade to 1.85.1 or later.
Top comments (0)