DEV Community

Yana Postnova
Yana Postnova

Posted on

I built a Chrome extension that stream-parses 2GB XML files using only 20MB of RAM. Here's the architecture.

The problem

I work with hotel reservation systems that dump SOAP/OTA XML responses — sometimes 1-2 GB per file. Every XML viewer I tried either crashed, froze the tab, or ran out of memory. Notepad++ tops out around 200MB. Browser-based XML viewers load everything into a DOM tree that eats 3-10x the file size in RAM. A 500MB file? That's 4GB of RAM just to render it.

The solution

I built XML Stream Parser — a Chrome extension that handles XML files up to 2GB without freezing your browser.

How it works (the interesting part)

The core idea is embarrassingly simple: don't build a DOM tree.

  1. File.slice(offset, offset + 16MB) reads a chunk
  2. TextDecoder({ stream: true }) decodes UTF-8 correctly across chunk boundaries (this is the part everyone gets wrong — a multibyte character can land exactly on the boundary)
  3. A custom SAX parser processes the chunk, firing onOpenTag, onCloseTag, onText events
  4. All of this runs in a Web Worker so the main thread stays free
  5. Worker sends progress updates via postMessage, main thread renders a progress bar

Memory usage is ~20MB regardless of file size. A 2GB file uses the same RAM as a 2KB file.

What you can do with it:

  • Stats: total elements, unique tags, attributes, max depth — computed in a single pass
  • Search: filter by tag name, attribute name, attribute value, or text content. Results stream in real-time during parsing
  • Element explorer: all tags listed by nesting depth. Click any tag to see its actual XML code with syntax highlighting. Navigate through up to 50 samples with ◀ ▶
  • XML anatomy hint: the extension picks a representative element from your file and shows an interactive breakdown — what's a tag, what's an attribute, what's a value. Useful for non-dev users who receive XML exports

The SAX parser gotcha

I wrote a minimal SAX parser from scratch (~200 lines) instead of using sax-js because I needed it to:

  • Handle parser.write(chunk) for incremental feeding
  • Not allocate a tree
  • Correctly handle CDATA, comments, PIs, and entity decoding across chunk boundaries

The trickiest part was self-closing tags like <Foo bar="1"/> — the / can end up in the next chunk if it lands on the boundary. The solution: the parser buffers incomplete tags until the closing > arrives.

Numbers from a real test:

File Size Elements Parse time RAM
Hotel reservations 1.8 GB 2.4M 3.4s ~20MB
Product catalog 890 MB 1.1M 1.7s ~18MB
API log dump 450 MB 6.2M 2.1s ~16MB

Stack: Vanilla JS, Web Workers, zero dependencies. The entire extension is 45KB.

Chrome Web Store link | Free, no tracking, all processing is local.

Would love feedback — especially if you have edge-case XML files that break things.

Top comments (0)