DEV Community

scndry
scndry

Posted on

Splitting Correctness from Throughput: A Hybrid Approach to XLSX Streaming Writes

Writing XLSX files involves two concerns that have nothing to do with each other:

  1. OOXML correctness — relationships, content types, namespace declarations, theme references, drawing rels. Get these wrong and Excel may refuse to open the file or report a recovery dialog. The complexity is bounded but the surface is large.
  2. Per-cell throughput<c r="A1" t="s"><v>0</v></c> repeated millions of times. This dominates write time at scale.

Yet most libraries solve both via a heavy object model (Apache POI's User Model — correct but slow) or hand-roll everything (fast but you own every OOXML edge case).

There's a middle path: let POI handle correctness, let StringBuilder handle the hot path, split them at the format boundary.

The Existing Spectrum

All POI (User Model — XSSFWorkbook):

  • ✅ OOXML correctness handled
  • ❌ Object allocation per cell (Cell, RichTextString, CellStyle reference chains)
  • ❌ Whole workbook lives in memory until write()
  • POI's SXSSFWorkbook improves memory by flushing rows to disk, but per-cell API overhead remains

Hand-rolled XML:

  • ✅ Maximum throughput
  • ❌ You own every OOXML detail listed above
  • ❌ Easy to ship a file that opens in some readers but breaks Excel
  • ❌ POI version updates can shift the OOXML output you've replicated, forcing re-validation

FastExcel (dhatim) takes this hand-rolled path with discipline — maintaining its own OOXML correctness logic independent of POI. It's a different tradeoff: faster code path, but a separate compatibility surface to maintain.

The Split

The insight: OOXML correctness is a one-time cost per file. Relationships, metadata, theme — all bounded in size, written once.

Per-cell throughput scales with row count.

The split:

  1. Let POI generate a complete XLSX skeleton with styles and an empty sheet. POI owns correctness.
  2. Split sheet1.xml at the <sheetData> boundary into head, data, tail.
  3. Stream <row> entries via StringBuilder at write time — the hottest path.
  4. Stream sharedStrings.xml independently (one entry per unique string).
  5. Copy every other zip entry (rels, theme, drawing, content types) verbatim from the skeleton.

POI owns OOXML correctness. StringBuilder owns per-cell throughput. The skeleton is the contract between them.

Implementation Sketch

setSchema()
  ├─ XSSFWorkbook (styles + empty sheet) → temp file (the skeleton)
  └─ split sheet1.xml at <sheetData>:
        head, tail   (POI-generated; copied verbatim)
        data         (<row> entries — streamed at write time)

close()
  ├─ for each zip entry:
  │     sheet1.xml        → head + streamed rows + tail
  │     sharedStrings.xml → store-driven streaming
  │     others            → copied as-is
  └─ delete temp file
Enter fullscreen mode Exit fullscreen mode

The cell write path itself, from SSMLSheetWriter:

@Override
public void writeString(final String value) {
    try {
        final int index = _cacheString(value);
        _appendCellStart("s").append(index);
        _appendCellEnd();
    } catch (IOException e) {
        throw new IllegalStateException(e);
    }
}
Enter fullscreen mode Exit fullscreen mode

_appendCellStart("s") writes <c r="A1" t="s" s="0"><v>, _appendCellEnd() closes with </v></c>. No Cell object. No RichTextString allocation. Just text appended to a StringBuilder that flushes into the zip stream periodically.

Benchmarks (100K rows, mixed types, shared string table, JMH)

Approach Time Memory
XSSFWorkbook via Jackson layer (POI User Model) 334 ms 258 MB
SXSSFWorkbook direct (POI's streaming write) 283 ms 207 MB
Skeleton-based hybrid 150 ms 191 MB

47% reduction in write time vs SXSSFWorkbook, with correctness still inherited from POI. Memory drops because POI's per-cell wrapper objects (Cell, RichTextString, CellStyle references) are no longer allocated — only the underlying values remain.

Why This Stays Correct

ECMA-376 fixes the structure of <worksheet>. <sheetData> is required, exactly once, at a specific position in the child sequence — POI has no choice but to emit it that way. The split point is backed by the spec, not by POI convention.

This shapes the safety story:

  • The variable part (<sheetData> contents) is what we replace.
  • The structural part (everything else) is copied verbatim from POI.
  • POI version bumps that change formatting (whitespace, namespace prefixes, optional metadata) get absorbed automatically — we copy whatever POI emits.
  • ECMA-376 evolution is tracked by POI before us; the skeleton inherits the new shape.

A DOM equivalence test in CI parses both outputs (skeleton-based and POI-direct), verifies their sheet1.xml and styles.xml structures match, and runs on every POI version bump — catches implementation bugs in the hand-rolled <sheetData> and any residual divergence before they ship.

The hybrid inherits correctness from POI (which inherits it from ECMA-376) and owns only the per-cell hot path.

Limitations

Applicability. The skeleton overhead — generating an empty XLSX via POI — is a fixed cost per file. The split earns its keep at scale; for small files this overhead can outweigh per-cell savings.

Style table is fixed at skeleton time (implementation-specific). Styles must be declared up-front. Adding a new style after the first cell write would mean re-emitting the styles part — currently unsupported here, though the pattern itself doesn't preclude it.

Why This Pattern Generalizes

The split works for any format where:

  • Correctness has bounded complexity — a fixed set of metadata pieces to get right
  • The hot path is repetitive — one record type repeated N times, dominating size and time
  • The hot path can be isolated structurally — a clear boundary in the format where the repetitive section lives

OOXML fits this pattern cleanly. Other formats with a "metadata + repeated records" shape can invite similar splits. SAX/StAX-only solutions either skip the correctness side or reimplement it. The hybrid acknowledges those are different problems and assigns each to the right tool.

It's the same intuition as letting an ORM build the schema while you write hot-path queries by hand: use the heavy abstraction where correctness matters, drop to the metal where throughput matters, and respect the boundary between them.

The Read Path

Reading uses the same insight without needing a split. StAX directly on OOXML XML, bypassing POI's User Model entirely for XLSX. The XML pull-parser tooling already aligns with Jackson's pull-token model — direct stream-to-token translation, no object materialization. Same principle: don't pay for an object model when you only need a stream of tokens.

The library that motivated this work is jackson-dataformat-spreadsheet — a Jackson dataformat module for Excel I/O, listed in FasterXML/jackson community modules. Apache 2.0.

Critique welcome — especially OOXML edge cases the split might mishandle. I'm specifically curious about scenarios where a hand-rolled <sheetData> could surprise Excel even when the surrounding skeleton is correct.

Top comments (0)