Rahmad Afandi

Posted on Mar 15

How I Made Python Excel Exports 9x Faster with Rust

#python #rust #performance #tutorial

How I Made Python Excel Exports 9x Faster with Rust

At my day job, I work with systems that handle a lot of data. One recurring pain point: Excel exports. Every time users triggered a large report — hundreds of thousands of rows — the server would spike. CPU pegged, response times tanked, other requests queued up behind one export job.

I tried everything. xlsxwriter with constant_memory mode. Background workers. Chunked exports. But the fundamental problem remained: Python is slow at writing millions of cells to Excel. A 1M-row export takes ~52 seconds. That's 52 seconds of CPU time blocking the server.

I started researching alternatives in my spare time. That's when I discovered PyO3 — a bridge between Rust and Python — and maturin, which makes building Rust-Python packages dead simple. And I found rust_xlsxwriter — a Rust port of xlsxwriter by the same original author.

The pieces clicked. What if I could keep the Python API but move the heavy lifting to Rust?

My first prototype was rough, but it already showed ~6x speedup over xlsxwriter. That was enough. The server spikes went away, exports finished before users could complain, and I moved on to other work.

The project sat at 6x for a long time. It solved my problem, so there was no urgency to push further. But every now and then, in my spare time, I'd come back to it — tweak something, try an idea, benchmark again.

Recently, I paired up with Claude (Anthropic's AI) for an intensive optimization session. Over a single long conversation, we profiled every layer, found bottlenecks I'd missed, and pushed the performance from 6x all the way to 9x faster.

This is the story of that journey — from the initial problem to the final 9x.

The Problem

Python's xlsxwriter is great. Clean API, well-maintained, does everything you need. But it's pure Python, and for large datasets, that's a problem — especially on a server where one slow export can affect everything else:

import xlsxwriter

wb = xlsxwriter.Workbook("report.xlsx", {"constant_memory": True})
ws = wb.add_worksheet()

# This loop is the bottleneck
for i, record in enumerate(million_records):
    for col, value in enumerate(record.values()):
        ws.write(i, col, value)  # Pure Python per-cell overhead

wb.close()
# ⏱ ~52 seconds for 1M rows

Every ws.write() call goes through Python's interpreter. Type checking, method dispatch, object allocation — multiplied by millions of cells. The actual Excel XML generation is fast, but the Python overhead per cell kills performance.

The Idea: What if the hot loop was in Rust?

The plan was simple:

Accept Python data (dicts, DataFrames, generators)
Cross the FFI boundary once
Do all the heavy lifting in Rust
Return the Excel file

The First Result

from rustpy_xlsxwriter import write_worksheet

write_worksheet(million_records, "report.xlsx", sheet_name="Data")
# ⏱ ~8 seconds for 1M rows (first version — ~6x faster)

No fancy API, just a function call. But already 6x faster — enough to solve the server problem.

Later, I added the FastExcel builder class for a nicer developer experience. But that came after the core was proven.

The Optimization Journey (6x → 9x)

Getting from 6x to 9x wasn't a single change. It was a series of optimizations, each building on the last. Most of these came out of that intensive session with Claude.

1. The Obvious Win: Move the Loop to Rust

The first version simply moved the per-cell loop from Python to Rust. Instead of Python calling ws.write() a million times, Python passes the entire list once, and Rust iterates it internally.

This alone gave ~5x speedup — just from avoiding Python interpreter overhead per cell. This was my initial version.

2. Cache Everything

The original code created a new Format object for every cell that needed formatting. For 1M rows with float formatting, that's 1M allocations.

// Before: new Format per cell ❌
let format = Format::new().set_num_format("0.00");
worksheet.write_number_with_format(row, col, val, &format)?;

// After: create once, reuse ♻️
let float_fmt = Format::new().set_num_format("0.00"); // once
for row in rows {
    worksheet.write_number_with_format(row, col, val, &float_fmt)?; // reuse
}

Same for datetime formats, bold formats — create once, reuse across millions of cells.

3. Skip the Type Cascade

Python is dynamically typed, so for each cell we need to figure out: is it a string? int? float? bool? datetime? None?

The naive approach tries each type in sequence:

if let Ok(s) = value.cast::<PyString>() { ... }
else if let Ok(f) = value.cast::<PyFloat>() { ... }  // 1 failed cast
else if let Ok(b) = value.cast::<PyBool>() { ... }   // 2 failed casts
else if let Ok(i) = value.cast::<PyInt>() { ... }     // 3 failed casts
// ... and so on

For a column that's always integers, every cell wastes 3 failed casts before finding the right one. Multiply by millions of cells.

Fix: First-row type caching. Detect the type from row 1, cache it per column, and try that type first for all subsequent rows. Cache miss? Fall back to the full cascade.

This eliminated ~39 million unnecessary type checks on my 1M × 13-column test data.

4. Dict Values, Not Keys

A subtle one. When iterating a Python dict, you can either:

// Slow: hash lookup per key ❌
for header in &headers {
    let value = row_dict.get_item(header)?; // hash lookup
}

// Fast: iterate values directly ✅
for value in row_dict.values() { // sequential iteration
}

Python dicts maintain insertion order, so values() gives us the same column order as keys(). This skips N hash lookups per row.

5. Arrow Zero-Copy for DataFrames

This was the biggest architectural change. Pandas and Polars DataFrames store data in columnar Arrow format internally. Instead of converting each cell to a Python object and passing it to Rust:

DataFrame → Python objects (slow) → Rust → Excel

I read the Arrow memory buffers directly:

DataFrame → Arrow C Data Interface → Rust reads memory → Excel

Zero Python object allocation. Zero per-cell FFI calls. The data never leaves native memory until it's written to Excel.

6. Compiler Optimizations

Free performance with zero code changes:

[profile.release]
lto = "thin"        # Link-Time Optimization: cross-crate inlining
codegen-units = 1   # Better optimization (slower compile)
strip = "symbols"   # Smaller binary, better cache

This gave ~10-15% on top of everything else.

The Final Numbers

I run python benchmark.py to measure everything:

Input	500K rows	1M rows	vs Python
Records (list of dicts)	2.99s	5.94s	~9x faster
Pandas DataFrame	1.21s	2.41s	~8x faster
Polars DataFrame	1.20s	2.42s	~7x faster
CSV output	0.16s	0.32s	~5x faster

DataFrames are faster than Records because of the Arrow zero-copy path — no Python objects involved at all.

What the API Looks Like

I wanted it to feel Pythonic, not like a Rust library:

from rustpy_xlsxwriter import FastExcel

# Simple
FastExcel("output.xlsx").sheet("Users", records).save()

# Full-featured
with FastExcel("report.xlsx", password="secret") as f:
    f.format(
        float_format="0.00",
        datetime_format="dd/mm/yyyy",
        bold_headers=True,
    )
    f.freeze(row=1)
    f.sheet("Employees", employee_df)  # Pandas or Polars
    f.sheet("Revenue", revenue_records) # list of dicts
    # auto-saves on exit

# CSV/TSV — same API, just change extension
FastExcel("output.csv").sheet("Data", records).save()

Features:

Pandas & Polars DataFrames (Arrow zero-copy)
Generators for memory-efficient streaming
Context manager with auto-save
Freeze panes, password protection
Float/datetime formatting, bold headers
CSV/TSV output (~5x faster than Python csv)
io.BytesIO for web frameworks

Lessons Learned

1. Profile before optimizing. My first instinct was "rewrite everything in Rust." But the actual bottleneck was the per-cell Python overhead, not the Excel generation. Understanding this shaped every optimization.

2. The FFI boundary is the enemy. Every Python↔Rust call has overhead. The goal is to cross it as few times as possible. Arrow zero-copy is the ultimate expression of this — cross the boundary zero times per cell.

3. Diminishing returns are real. Going from 1x to 5x was one optimization (move the loop). Going from 5x to 7x took three more. Going from 7x to 9x took six more. Know when to stop.

4. PyO3 is incredible. Seriously. Writing a Python extension in Rust with PyO3 is surprisingly ergonomic. Type conversions, error handling, GIL management — it handles everything.

5. Tests matter more than benchmarks. I have 102 tests that verify actual cell content via openpyxl read-back. Without these, every optimization would have been a gamble.

Try It

pip install rustpy-xlsxwriter

from rustpy_xlsxwriter import FastExcel

records = [{"name": f"user_{i}", "score": i * 0.1} for i in range(1_000_000)]
FastExcel("fast.xlsx").sheet("Data", records).save()

GitHub: rahmadafandi/rustpy-xlsxwriter

Built with Rust, PyO3, and rust_xlsxwriter. MIT licensed. Contributions welcome.

DEV Community

How I Made Python Excel Exports 9x Faster with Rust

How I Made Python Excel Exports 9x Faster with Rust

The Problem

The Idea: What if the hot loop was in Rust?

The First Result

The Optimization Journey (6x → 9x)

1. The Obvious Win: Move the Loop to Rust

2. Cache Everything

3. Skip the Type Cascade

4. Dict Values, Not Keys

5. Arrow Zero-Copy for DataFrames

6. Compiler Optimizations

The Final Numbers

What the API Looks Like

Lessons Learned

Try It

Top comments (0)