How I Made Python Excel Exports 9x Faster with Rust
At my day job, I work with systems that handle a lot of data. One recurring pain point: Excel exports. Every time users triggered a large report — hundreds of thousands of rows — the server would spike. CPU pegged, response times tanked, other requests queued up behind one export job.
I tried everything. xlsxwriter with constant_memory mode. Background workers. Chunked exports. But the fundamental problem remained: Python is slow at writing millions of cells to Excel. A 1M-row export takes ~52 seconds. That's 52 seconds of CPU time blocking the server.
I started researching alternatives in my spare time. That's when I discovered PyO3 — a bridge between Rust and Python — and maturin, which makes building Rust-Python packages dead simple. And I found rust_xlsxwriter — a Rust port of xlsxwriter by the same original author.
The pieces clicked. What if I could keep the Python API but move the heavy lifting to Rust?
My first prototype was rough, but it already showed ~6x speedup over xlsxwriter. That was enough. The server spikes went away, exports finished before users could complain, and I moved on to other work.
The project sat at 6x for a long time. It solved my problem, so there was no urgency to push further. But every now and then, in my spare time, I'd come back to it — tweak something, try an idea, benchmark again.
Recently, I paired up with Claude (Anthropic's AI) for an intensive optimization session. Over a single long conversation, we profiled every layer, found bottlenecks I'd missed, and pushed the performance from 6x all the way to 9x faster.
This is the story of that journey — from the initial problem to the final 9x.
The Problem
Python's xlsxwriter is great. Clean API, well-maintained, does everything you need. But it's pure Python, and for large datasets, that's a problem — especially on a server where one slow export can affect everything else:
import xlsxwriter
wb = xlsxwriter.Workbook("report.xlsx", {"constant_memory": True})
ws = wb.add_worksheet()
# This loop is the bottleneck
for i, record in enumerate(million_records):
for col, value in enumerate(record.values()):
ws.write(i, col, value) # Pure Python per-cell overhead
wb.close()
# ⏱ ~52 seconds for 1M rows
Every ws.write() call goes through Python's interpreter. Type checking, method dispatch, object allocation — multiplied by millions of cells. The actual Excel XML generation is fast, but the Python overhead per cell kills performance.
The Idea: What if the hot loop was in Rust?
The plan was simple:
- Accept Python data (dicts, DataFrames, generators)
- Cross the FFI boundary once
- Do all the heavy lifting in Rust
- Return the Excel file
The First Result
from rustpy_xlsxwriter import write_worksheet
write_worksheet(million_records, "report.xlsx", sheet_name="Data")
# ⏱ ~8 seconds for 1M rows (first version — ~6x faster)
No fancy API, just a function call. But already 6x faster — enough to solve the server problem.
Later, I added the FastExcel builder class for a nicer developer experience. But that came after the core was proven.
The Optimization Journey (6x → 9x)
Getting from 6x to 9x wasn't a single change. It was a series of optimizations, each building on the last. Most of these came out of that intensive session with Claude.
1. The Obvious Win: Move the Loop to Rust
The first version simply moved the per-cell loop from Python to Rust. Instead of Python calling ws.write() a million times, Python passes the entire list once, and Rust iterates it internally.
This alone gave ~5x speedup — just from avoiding Python interpreter overhead per cell. This was my initial version.
2. Cache Everything
The original code created a new Format object for every cell that needed formatting. For 1M rows with float formatting, that's 1M allocations.
// Before: new Format per cell ❌
let format = Format::new().set_num_format("0.00");
worksheet.write_number_with_format(row, col, val, &format)?;
// After: create once, reuse ♻️
let float_fmt = Format::new().set_num_format("0.00"); // once
for row in rows {
worksheet.write_number_with_format(row, col, val, &float_fmt)?; // reuse
}
Same for datetime formats, bold formats — create once, reuse across millions of cells.
3. Skip the Type Cascade
Python is dynamically typed, so for each cell we need to figure out: is it a string? int? float? bool? datetime? None?
The naive approach tries each type in sequence:
if let Ok(s) = value.cast::<PyString>() { ... }
else if let Ok(f) = value.cast::<PyFloat>() { ... } // 1 failed cast
else if let Ok(b) = value.cast::<PyBool>() { ... } // 2 failed casts
else if let Ok(i) = value.cast::<PyInt>() { ... } // 3 failed casts
// ... and so on
For a column that's always integers, every cell wastes 3 failed casts before finding the right one. Multiply by millions of cells.
Fix: First-row type caching. Detect the type from row 1, cache it per column, and try that type first for all subsequent rows. Cache miss? Fall back to the full cascade.
This eliminated ~39 million unnecessary type checks on my 1M × 13-column test data.
4. Dict Values, Not Keys
A subtle one. When iterating a Python dict, you can either:
// Slow: hash lookup per key ❌
for header in &headers {
let value = row_dict.get_item(header)?; // hash lookup
}
// Fast: iterate values directly ✅
for value in row_dict.values() { // sequential iteration
}
Python dicts maintain insertion order, so values() gives us the same column order as keys(). This skips N hash lookups per row.
5. Arrow Zero-Copy for DataFrames
This was the biggest architectural change. Pandas and Polars DataFrames store data in columnar Arrow format internally. Instead of converting each cell to a Python object and passing it to Rust:
DataFrame → Python objects (slow) → Rust → Excel
I read the Arrow memory buffers directly:
DataFrame → Arrow C Data Interface → Rust reads memory → Excel
Zero Python object allocation. Zero per-cell FFI calls. The data never leaves native memory until it's written to Excel.
6. Compiler Optimizations
Free performance with zero code changes:
[profile.release]
lto = "thin" # Link-Time Optimization: cross-crate inlining
codegen-units = 1 # Better optimization (slower compile)
strip = "symbols" # Smaller binary, better cache
This gave ~10-15% on top of everything else.
The Final Numbers
I run python benchmark.py to measure everything:
| Input | 500K rows | 1M rows | vs Python |
|---|---|---|---|
| Records (list of dicts) | 2.99s | 5.94s | ~9x faster |
| Pandas DataFrame | 1.21s | 2.41s | ~8x faster |
| Polars DataFrame | 1.20s | 2.42s | ~7x faster |
| CSV output | 0.16s | 0.32s | ~5x faster |
DataFrames are faster than Records because of the Arrow zero-copy path — no Python objects involved at all.
What the API Looks Like
I wanted it to feel Pythonic, not like a Rust library:
from rustpy_xlsxwriter import FastExcel
# Simple
FastExcel("output.xlsx").sheet("Users", records).save()
# Full-featured
with FastExcel("report.xlsx", password="secret") as f:
f.format(
float_format="0.00",
datetime_format="dd/mm/yyyy",
bold_headers=True,
)
f.freeze(row=1)
f.sheet("Employees", employee_df) # Pandas or Polars
f.sheet("Revenue", revenue_records) # list of dicts
# auto-saves on exit
# CSV/TSV — same API, just change extension
FastExcel("output.csv").sheet("Data", records).save()
Features:
- Pandas & Polars DataFrames (Arrow zero-copy)
- Generators for memory-efficient streaming
- Context manager with auto-save
- Freeze panes, password protection
- Float/datetime formatting, bold headers
- CSV/TSV output (~5x faster than Python csv)
-
io.BytesIOfor web frameworks
Lessons Learned
1. Profile before optimizing. My first instinct was "rewrite everything in Rust." But the actual bottleneck was the per-cell Python overhead, not the Excel generation. Understanding this shaped every optimization.
2. The FFI boundary is the enemy. Every Python↔Rust call has overhead. The goal is to cross it as few times as possible. Arrow zero-copy is the ultimate expression of this — cross the boundary zero times per cell.
3. Diminishing returns are real. Going from 1x to 5x was one optimization (move the loop). Going from 5x to 7x took three more. Going from 7x to 9x took six more. Know when to stop.
4. PyO3 is incredible. Seriously. Writing a Python extension in Rust with PyO3 is surprisingly ergonomic. Type conversions, error handling, GIL management — it handles everything.
5. Tests matter more than benchmarks. I have 102 tests that verify actual cell content via openpyxl read-back. Without these, every optimization would have been a gamble.
Try It
pip install rustpy-xlsxwriter
from rustpy_xlsxwriter import FastExcel
records = [{"name": f"user_{i}", "score": i * 0.1} for i in range(1_000_000)]
FastExcel("fast.xlsx").sheet("Data", records).save()
GitHub: rahmadafandi/rustpy-xlsxwriter
Built with Rust, PyO3, and rust_xlsxwriter. MIT licensed. Contributions welcome.
Top comments (0)