DEV Community

Indra Gusti Prasetya
Indra Gusti Prasetya

Posted on • Originally published at indragustiprasetya.com

Stop OpenAI Codex Writing 640 TB/Year to Your SSD

Nothing breaks. That is what makes this one nasty. The build passes, Codex answers, the disk still shows free space, and underneath all of it a hardware budget you never charted is draining. Per GitHub issue #28224 filed against openai/codex, one instance left running wrote about 37 TB across 21 days of uptime. Extrapolated, that is roughly 640 TB a year. A typical consumer NVMe drive is warranted to around 600 TBW for its entire service life. So Codex can spend a drive's rated endurance in under twelve months while doing nothing you actually asked it to do.

The bug is a logging default, not a crash

The mechanism is boring, which is precisely why it slipped through. Codex ships a SQLite feedback log sink wired to a global TRACE default. Issue #28224 traces it to Targets::new().with_default(Level::TRACE), the loudest setting available, persisted to ~/.codex/logs_2.sqlite alongside its -wal and -shm companion files. In the reporter's sample, TRACE-level lines account for 70.7% of retained bytes. Fold in the two OpenTelemetry categories (codex_otel.log_only and codex_otel.trace_safe) and about 96% of the volume is data no end user will ever open.

What is actually in there: raw WebSocket payloads, routine filesystem events, the agent opening passwd and ld.so.cache. This is telemetry for the vendor, shipped at full verbosity onto your machine. A "feedback log" that, measured in flash endurance, behaves like a slow attack on your hardware.

And it is not a fresh regression. Issue #17320, titled "Excessive SQLite WAL writes during streaming due to TRACE logs ignoring RUST_LOG," goes back to at least April. The behavior has been visible for months under different symptoms. What changed in June is that someone finally attached a TBW number to it, posted issue #28224, and Hacker News noticed.

Why du lies to you

Here is the part that should bother any operator. The file on disk stays small. The database prunes as fast as it inserts, so it never grows in any way your usual tooling would flag. In a 15-second window the reporter watched it insert 36,211 rows while the retained row count held flat at 681,774. That is continuous insert-then-delete, not accumulation. The logical file barely moves.

Which means du -sh ~/.codex reports a calm, modest size while the drive controller absorbs terabytes of physical writes you cannot see. File size and bytes-written are two different clocks, and almost every "check disk usage" reflex an operator has reads the wrong one.

Then it gets worse, because SQLite is running in WAL mode. Tens of thousands of insert and delete cycles a minute mean the SSD physically writes far more than the logical data footprint suggests. The -wal and -shm files churn without pause. The single number that matters, lifetime bytes committed to the flash, is invisible to du, invisible to your file manager, invisible to anything short of reading the drive's own SMART counters. A bug that hides inside the gap between two metrics is a bug that survives for months. This one did.

Who actually pays for it

Three groups carry the cost, and they are not equally protected.

Individual developers on modern laptops are the worst case. The NVMe in a current ultrabook is frequently soldered to the board. Endurance loss there is permanent and warranty-defining, and when the drive wears out the fix is not a 60-dollar replacement, it is a new machine. You do not get to swap the part.

Platform and CI teams running Codex headless on shared runners are the next tier. One misbehaving sink is a curiosity. The same sink amplified across a fleet of runners is a procurement line item and a wave of surprise drive failures that nobody traces back to a logging default, because the symptom (a dead SSD) shows up far downstream of the cause.

Then there is everyone running agents in long-lived sessions, leaving the thing churning on a goal overnight. That is exactly the usage pattern the entire industry is pushing toward right now. The failure mode is worst in precisely the scenario the tool is being sold for: always on, unattended, long-running. The more useful you make Codex, the more of your drive it eats.

The off switch you would reach for does not work

Any operator who notices runaway logging does the same thing first: set the log level down. In a Rust program that means RUST_LOG. Issue #17320 reports that the SQLite sink ignores it. The standard environment variable, the obvious lever, the first control anyone would try, does not throttle this path. The sink runs independent of the knob users expect to govern it.

That detail is the difference between an annoyance and a real exposure. A noisy logger you can quiet is a config problem. A logger that writes at TRACE, ignores the documented control, and hides its volume behind a self-pruning file is something you have to actively work around. There is no supported toggle in the issue threads, only a redirect (more on that below).

The counterpoint, taken seriously

The reasonable pushback: it is one CLI tool, SSDs are cheap, this is a rounding error. I do not buy it, and the math is why. 640 TBW a year against a 600 TBW warranty is not a fraction of the drive's life, it is the whole thing, consumed in under a year, on hardware that on many laptops cannot be replaced. The cost is real, it is concentrated on the long-running headless usage the product is being pushed toward, and it lands hardest on the people least able to swap the part. "SSDs are cheap" is true for a desktop with a socketed M.2. It is false for the soldered drive in the machine you are reading this on.

How to check your drive this week

Do not trust file size for any of this. Read the drive's own write counter, then act.

  1. Get your baseline. On Linux with NVMe, run sudo smartctl -a /dev/nvme0 | grep "Data Units Written" (each unit is 512 KB), or sudo nvme smart-log /dev/nvme0. On a SATA SSD, read the Total_LBAs_Written SMART attribute instead. Write the number down.

  2. Prove it is Codex before you blame anything. Leave Codex running but idle for an hour, then re-read the same counter and compute the delta against an idle baseline taken with Codex stopped. If an idle agent moves the lifetime-written counter by gigabytes per hour, you have this bug. Confirm the source with ls -la ~/.codex/logs_2.sqlite* and watch the -wal file's modification time churn, correlated with iostat -x 5 showing sustained writes from the Codex process. Name the artifact, then fix it.

  3. Redirect the sink, but verify the target first. The known workaround is to symlink ~/.codex/logs_2.sqlite to a RAM-backed path so the writes never touch the SSD. The file holds no conversation data, so losing it on reboot is safe. The catch: run df -h /tmp and confirm the filesystem reads tmpfs before you point anything there. On plenty of Linux installs /tmp is on-disk, and if it is, you have relocated the wear, not removed it. No tmpfs on /tmp? Mount an explicit one for the redirect target.

  4. On CI, make it ephemeral by policy, not by hand. Point ~/.codex at the runner's scratch tmpfs in job setup so the sink dies with the container and never reaches persistent storage. Bake it into the image. A per-job afterthought you will forget on the next runner you provision.

The broader lesson outlives this one issue. Vendor telemetry sinks that ship at TRACE, ignore the standard log-level controls, and prune their own files to stay small are now part of your infrastructure's write path. Audit what your AI tooling writes to local disk with the same suspicion you apply to what it sends over the network. The trusted tool's debug log is a resource-exhaustion surface, and the file size will tell you nothing.

Sources


Originally published at indragustiprasetya.com

Top comments (0)