Yuki Nakazawa

Posted on Mar 10

When JS Libraries Fail at 1M Rows: Generating XLSX via DuckDB SQL

#duckdb #webdev #beginners #javascript

Introduction

I've been building LeapRows — a browser-based CSV analysis tool that runs entirely client-side using DuckDB-WASM, with no server involved.

At some point I needed to add XLSX export. But every existing approach I tried fell apart at any meaningful scale.

I eventually landed on a solution: generate Excel-compatible XML directly via DuckDB SQL, then compress it into an XLSX file using JSZip. This post covers why I went that route and how it works.

Note: My day job is SEO, not software engineering — so please forgive any imprecise terminology. 🙏

The Problem: Exporting 1 Million Rows to XLSX in the Browser

DuckDB-WASM's Built-in XLSX Export Was Unreliable

My first attempt was DuckDB-WASM's native XLSX output (the COPY TO command via the Excel extension). It was unstable — Excel would throw "The file format or file extension is not valid" on some outputs and refuse to open the file.

I ruled it out for production use and started looking for alternatives.

JS Libraries Ran Out of Memory

Next I tried SheetJS and ExcelJS. Both worked fine on small datasets, but memory usage exploded as file size grew, and the conversion itself became painfully slow.

Testing with 1 million rows, progress would reach 95% ... and then the browser would freeze for nearly 20 seconds. After the conversion finally finished, the browser stayed sluggish — and sometimes just crashed entirely.

I Still Wanted 1 Million Rows to Work

LeapRows is built around the promise of handling 1M+ rows quickly. I didn't want to quietly cap the export at some lower limit. If Excel itself supports up to ~1,048,576 rows, I wanted LeapRows to be able to fill every one of them.

The Insight: XLSX Is Just ZIP + XML

Something that might not be obvious: an XLSX file is actually a ZIP archive containing XML files. You can verify this yourself by renaming any .xlsx file to .zip — you'll see the contents directly.

Those XML files contain a lot of styling information (fonts, colors, borders, conditional formatting, etc.), but the actual data lives in just one file: xl/worksheets/sheet1.xml.

Since LeapRows only needs to export raw data from a DuckDB table — no styling required — I realized: the only file I need to generate dynamically is sheet1.xml. The other 4–5 files in the archive can be static strings.

Generating XML Directly with DuckDB SQL

The data is already in a DuckDB table. And DuckDB SQL is extremely fast at string operations — CONCAT, REPLACE, and so on.

That led to the idea: "What if I use a SQL query to output XML cell strings directly, then ZIP them up with JSZip?"

Before:

DuckDB → JS objects → library → XML → ZIP

After:

DuckDB → XML strings (via SQL) → JSZip → ZIP

Skipping the intermediate JS object conversion means no memory pressure from large object graphs, and no GC pauses that were causing the browser to freeze.

The approach uses a CASE expression to handle each column type appropriately — numbers pass through as-is, dates get converted to Excel serial values, and strings get XML-escaped.

What the SQL Looks Like

SELECT
    '<row r="' || CAST(rn + 1 AS VARCHAR) || '">'
    -- String column: XML-escaped, output as inline string
    || CASE WHEN "name" IS NOT NULL
       THEN '<c r="A' || CAST(rn + 1 AS VARCHAR) || '" t="inlineStr"><is><t>'
         || REPLACE(REPLACE(REPLACE("name", '&', '&amp;'), '<', '&lt;'), '>', '&gt;')
         || '</t></is></c>'
       ELSE '' END
    -- Numeric column: output value directly
    || CASE WHEN "age" IS NOT NULL
       THEN '<c r="B' || CAST(rn + 1 AS VARCHAR) || '"><v>'
         || CAST("age" AS VARCHAR)
         || '</v></c>'
       ELSE '' END
    -- Date column: convert to Excel serial value (days since 1899-12-30)
    || CASE WHEN "created_at" IS NOT NULL
       THEN '<c r="C' || CAST(rn + 1 AS VARCHAR) || '" s="1"><v>'
         || CAST(date_diff('day', DATE '1899-12-30', "created_at") AS VARCHAR)
         || '</v></c>'
       ELSE '' END
    || '</row>' AS xml_row
FROM (
    SELECT *, ROW_NUMBER() OVER () AS rn
    FROM my_table
    LIMIT 50000 OFFSET 0
)

Each row in the result set becomes one XML element:

<row r="2"><c r="A2" t="inlineStr"><is><t>John Doe</t></is></c><c r="B2"><v>30</v></c><c r="C2" s="1"><v>45302</v></c></row>
<row r="3"><c r="A3" t="inlineStr"><is><t>Jane Smith</t></is></c><c r="B3"><v>25</v></c><c r="C3" s="1"><v>45150</v></c></row>

Combine these rows with a header row and the static XML files, compress with JSZip, and you have a valid XLSX file.

Results

With ExcelJS, exporting 1 million rows took around 20 seconds and consumed over 1GB of memory (Unfortunately I didn't capture a "before" screenshot for comparison).

After switching to the SQL-based XML approach:

XML generation: 5.5s
ZIP compression: 7.3s
Memory usage: significantly reduced
Post-export browser freeze: gone

The reduction in GC-eligible objects was what solved the post-conversion sluggishness.

Side note: compression could probably be moved to the WASM layer for even better performance, but that felt out of scope for now. 😇

Trade-offs

This approach doesn't support cell styling — no colors, borders, or Excel-specific formatting. For LeapRows that's fine, since the goal is pure data export. But it's worth keeping in mind if you need rich formatting.

Wrapping Up

Once I let go of the assumption that "XLSX is a complex format," it became clear that for plain data export, a minimal XML structure is completely sufficient.

DuckDB's string processing is fast enough to make "generate XML via SQL" a genuinely practical approach — and I suspect the same idea could be applied in other creative ways beyond XLSX export.

DEV Community