Jose

Posted on Mar 19 • Originally published at agentburner.com on Mar 25

The 6,000-Line Email Parser

#node #email #parsing #javascript

There's a developer in Tallinn, Estonia named Andris Reinman who, over the course of a decade, built the entire Node.js email ecosystem. Nodemailer for sending. Mailparser for parsing. WildDuck for IMAP. ZoneMTA for outbound delivery. Plus the half-dozen libraries underneath them — libmime, libqp, libbase64, mailsplit — each handling one piece of what turns out to be an insanely complex problem.

Mailparser alone gets 75 million downloads a year. 56,000 repositories depend on it. And most developers who use it have no idea what's actually happening inside.

Why email parsing is hard

Email looks simple from the outside. A sender, a subject, a body. Maybe an attachment. How hard can it be to parse that?

The answer is: email is not a format. It's an archaeological dig through 40 years of overlapping RFCs, vendor extensions, and broken implementations that all somehow still need to work together.

A single email can contain multiple nested content types. A multipart/mixed wrapping a multipart/alternative wrapping a multipart/related wrapping a message/rfc822 — which is an email inside an email, with its own multipart structure. The parser has to maintain a tree with arbitrary nesting depth.

Each part can be encoded differently. Base64, quoted-printable, 7bit, 8bit. And each part can be in a different character set. UTF-8, ISO-8859-1, Windows-1252, Shift_JIS, EUC-JP, ISO-2022-JP, KS_C_5601-1987, Big5. A base64-encoded Shift_JIS body part requires two sequential decoding steps — first strip the base64, then convert the charset. The order matters.

Headers use their own encoding scheme. RFC 2047 encoded-words let you stick any charset and encoding into a header value: =?charset?encoding?text?=. The charset and encoding can vary word-by-word within a single header line. Some senders put encoded email addresses inside the display name field. Some encode the address itself. Some do both wrong.

And then there's the stuff that isn't in any RFC. Priority headers come in four different non-standard flavors (X-Priority, X-MSMail-Priority, Importance, Priority) with both numeric and text values. Date formats vary wildly, and invalid dates are common. Missing Content-Type headers are supposed to default to text/plain, unless there's a Content-Disposition header saying it's an attachment, in which case it's application/octet-stream.

This is what Andris Reinman's parser handles. All of it.

The architecture

Mailparser is built as a pipeline of Node.js Transform streams. Raw bytes go in one end, structured objects come out the other.

The first stage is mailsplit — a state machine that detects MIME boundary lines and splits the email into a tree of nodes. It tracks parent boundaries so it knows when a child multipart ends versus when a parent ends. The boundary detection is done byte-by-byte, not string-by-string, to avoid charset issues.

Each node gets its own decoder pipeline. If the content is base64-encoded, it flows through a base64 decoder. If it's quoted-printable, through a QP decoder. Then through a charset decoder — iconv-lite for most encodings, but a dedicated Japanese decoder for ISO-2022-JP because that encoding is stateful and can't be decoded in a streaming fashion. The Japanese decoder has to buffer the entire body before converting.

If the content uses format=flowed (RFC 3676), there's another Transform stream in the chain that unwraps soft line breaks. Each concern is a separate stream, composed together only when needed. The main parser stays clean.

The header parsing happens in libmime — a library that's essentially a collection of RFC implementations. RFC 2047 for encoded-word decoding. RFC 2231 for parameter continuation (when a filename is too long for one header line). A state machine for parsing key=value pairs out of Content-Type and Content-Disposition headers. Line folding and unfolding. MIME type to file extension mapping.

The whole thing is about 2,500 lines across the core parser and mailsplit, plus another 1,000 in libmime, plus the streaming decoders. The total across all sub-libraries is roughly 6,000 lines.

The clever bits

A few things in the codebase that I found genuinely elegant.

When the parser encounters an attachment, it emits the attachment as an event and then pauses. The consumer has to explicitly call attachment.release() when it's done reading the attachment stream. Only then does the parser continue. This is cooperative scheduling — it prevents the parser from racing ahead and mixing up node contexts while the consumer is still reading attachment bytes. Simple and correct.

The StreamHash class is a Transform stream that computes an MD5 hash and byte count as data flows through it. The attachment's content property IS the StreamHash instance — consumers read decoded content from it while it silently computes the checksum. No extra buffering, no second pass.

The address decoder handles a real-world edge case where an email's display name is actually a base64-encoded string containing an email address. The parser detects this, decodes it, re-parses it with the address parser, and splices the corrected entries back into the address list. It uses a WeakSet to track already-processed entries and prevent infinite loops from the same array being modified during iteration.

The charset normalizer strips all non-alphanumeric characters before comparison — so UTF-8, utf8, UTF-8, and utf-8 all resolve to the same thing. And KS_C_5601-1987 silently remaps to CP949, because that's what Korean email clients actually mean when they say KS_C_5601-1987.

Why it's streaming

The most obvious question: why not just load the whole email into memory and parse it?

Memory. A production email server processes millions of messages. An email with large attachments can be 100MB+. The streaming architecture processes chunks as they arrive — attachment content flows through the pipeline and out to consumers without ever being fully buffered.

There's also a setImmediate() between processing lines to yield to the event loop. This prevents a large email from blocking the event loop for extended periods. On a server handling thousands of concurrent connections, this is the difference between responsive and frozen.

The pipeline also respects Node.js backpressure throughout. When any stage returns false from write(), the previous stage waits for drain. This propagates all the way back to the network socket. The parser processes data exactly as fast as the slowest consumer can handle it.

The maintainer

Andris Reinman is the sole npm maintainer of mailparser, nodemailer, and all the sub-libraries. He built the commercial product EmailEngine on top of this stack. The mailparser README now says it's in maintenance mode — security updates and critical bug fixes only. He recommends PostalMime for new projects, which works in both Node.js and browsers.

62 contributors over the library's lifetime, but the architecture and the bulk of the code is one person's work. 2 million weekly downloads. The latest release is from March 2026.

There's something worth sitting with about the fact that a single developer in Estonia built the email parsing infrastructure that tens of thousands of Node.js applications rely on. Not a team. Not a company. One person who understood the problem deeply enough to decompose it into composable stream transforms and got every edge case right.

The takeaway

If you parse email in Node.js, you're almost certainly using Andris Reinman's code, directly or transitively. The next time you call simpleParser() and get back a clean object with text, html, and attachments, remember that underneath it, a state machine is splitting MIME boundaries byte-by-byte, a charset decoder is converting KS_C_5601-1987 to CP949, a base64 decoder is managing 3-byte alignment at chunk boundaries, and a WeakSet is preventing infinite loops in re-parsed address fields.

It's 6,000 lines because email is 6,000 lines of edge cases. And one person wrote them all. You can read the source at nodemailer/mailparser and follow Andris Reinman's work on GitHub.

DEV Community