Jocer Franquiz

Posted on Mar 23

HTML Parsing Algorithm and Memory Structure

#learning #webdev #programming #computerscience

Ever wonder what actually happens between the moment your browser receives raw HTML bytes and the moment you see a page? Most of us just load HTML files all day and never think about the machinery underneath. This is the first article in a series where we dig into that machinery. Our end goal is a working HTML parser and static site generator, written from scratch, for the pure joy of understanding how things work. No frameworks, no libraries, just us and the spec!

The State Machine

The browser uses a state machine to parse HTML. Rather than not building a tree directly, it's reading the HTML character by character and switching between states as it goes.

Think of it like a traffic light. The light is always in one state: red, yellow, or green. Based on what happens (timer expires, car approaches), it transitions. The HTML parser works the same way. It's always in a specific state, and the character it reads next determines where it goes.

It starts in "data state." As it reads characters, it looks for <. When it finds one, it jumps to "tag open state." If the next characters form a valid tag name, it moves to "tag name state." Hit a / like in </? That's "close tag state." Each state has strict rules about what's valid and what comes next.

The spec defines 80 of these states. All of them. This is why every browser parses your janky HTML the same way. They're all implementing the same state machine.

(Ref: WHATWG HTML Spec — Tokenization)

Tree Construction Algorithm

While the state machine tokenizes the HTML, a separate tree construction algorithm builds the DOM. It maintains a stack of open elements. Start tag? Push onto the stack. End tag? Pop elements until you find a match.

But the thing is, this algorithm doesn't just blindly follow your tags. It has sophisticated error-handling and autocomplete machine baked in. For example if we forgot to close a <p> before opening a new one, the algorithm closes it for us. Nested something illegally? It rearranges things into valid structure. This is why browsers are ridiculously forgiving of bad HTML.

It also handles special cases. Elements like <br> and <img> are void elements (no </ is present). So, the algorithm knows they never get closing tags and doesn't wait for one. Elements like <p> have special insertion modes that affect how subsequent elements are treated.

You can read the updated HTML specs here: (Ref: WHATWG HTML Spec — Tree Construction)

How the Tree Lives in Memory

The DOM tree is stored as a graph data structure in the browser's heap memory. Each element becomes a node object. Here's what that looks like in practice:

Node Objects: Each DOM element is an object in memory containing its tag name, attributes, and text content, plus pointers (memory addresses) to related nodes.

Parent-Child Relationships: Each node has a pointer to its parent. A <p> inside a <body> has a pointer referencing the body node's memory address. Each parent also maintains pointers to its children — but it doesn't store copies of them, just references to where they live in the heap.

Sibling Relationships: Adjacent elements are linked as siblings. Each node tracks its next sibling and previous sibling through pointers. The DOM API exposes this directly: node.nextSibling, node.previousSibling, node.firstChild. These are the raw pointers surfaced to JavaScript.

Linked List Structure: The children of a parent are stored as a linked list. Each child points to the next sibling. They're not in consecutive memory locations, they're scattered across the heap with pointers connecting them into a logical sequence.

Text Nodes: Text isn't stored inside element nodes. It's wrapped in separate text node objects. <p>Hello</p> has a child node of type "text" containing "Hello", a separate object with its own memory address.

Attributes Storage: Attributes like class, id, and data-* are stored in a map within each element node. When you call element.getAttribute('class'), the browser looks up the attribute in this internal storage.

Memory Efficiency: Browsers don't duplicate data needlessly. In V8 (Chrome's engine), identical strings are stored only once through a technique called string interning — a StringTable ensures that if two elements share the same class name, they both reference the same string object.

Find more details about memory management here: (Ref: V8 StringTable implementation — Zhenghao He, "JavaScript memory model demystified")

The Parsing Flow in Memory

Here's how the whole thing unfolds:

Initial State: A buffer holds the raw HTML text. A pointer reads through it from position zero.

Tokenization: The state machine reads characters and builds tokens: small temporary objects like "start tag: div, attributes: class='container'". These tokens are consumed as they're created; they don't stick around.

Tree Building: Each token goes to the tree construction algorithm. Start tag? Create a new node, populate it from the token, append it as a child of the current open element, push it onto the stack. End tag? Pop elements off the stack until you find a match.

Incremental Growth: The tree grows node by node. First the document node, then <html>, then <head>, then <body>. As parsing continues, deeper nesting happens.

Memory Addresses: Each new node gets allocated memory from the heap. Its parent receives a pointer to the new location. The sibling linked list gets updated. All of this happens while the parser is still chewing through the HTML buffer.

Special Considerations

Script Execution: When the parser hits a <script> tag, it pauses tokenization. The script gets downloaded and executed. That script can modify the DOM tree that's already been built — adding, removing, or rearranging nodes. Parsing resumes after the script finishes.

Incremental Parsing: Modern browsers don't wait for the entire HTML file to download before they start parsing. As bytes arrive over the network, they're fed straight into the parser. The tree starts building before the HTML is fully downloaded, which is why you see pages render progressively. The spec calls this "speculative HTML parsing."

Whitespace Handling: Text nodes get created for whitespace between tags. A line break and spaces between </div> and <p> produce a text node containing just whitespace. Some browsers optimize by collapsing insignificant whitespace, but the spec preserves it.

Document Fragment: During parsing, the tree is a mutable data structure in memory. After parsing completes, scripts can still modify the tree in real time — the in-memory structure updates immediately.

The elegance here is the separation of concerns: the state machine handles the syntactic complexity of HTML, the tree construction algorithm handles semantic relationships and error recovery, and the tree structure in memory provides efficient access for rendering and scripting.

(Ref: WHATWG HTML Spec — Speculative HTML Parsing, MDN — Memory Management)

Practical Example

Consider this simple HTML page:

<!DOCTYPE html>
<html>
<head>
  <title>Simple Page</title>
</head>
<body>
  <header>Header Content</header>
  <div>Div One</div>
  <div>Div Two</div>
  <footer>Footer Content</footer>
</body>
</html>

Memory Layout (Simplified ASCII)

Here's how this DOM tree is connected in memory. Each block is a separate object on the heap, linked by pointers. The addresses are illustrative: real addresses aren't this tidy, and real node objects are much larger than 8 bytes each.

HEAP:

[0xA00: Document node]
  └─ children: [0xA10]

[0xA10: html-node { parent: 0xA00, children: [0xB00, 0xC00] }]
  ├─ firstChild → 0xB00 (head)
  └─ lastChild  → 0xC00 (body)

[0xB00: head-node { parent: 0xA10, children: [0xB20] }]
  └─ [0xB20: title-node { parent: 0xB00, children: [0xB30] }]
       └─ [0xB30: text "Simple Page" { parent: 0xB20 }]

[0xC00: body-node { parent: 0xA10, children: [0xD00, 0xD40, 0xD80, 0xDC0] }]
  └─ firstChild → 0xD00

[0xD00: header-node { parent: 0xC00, nextSibling: 0xD40 }]
  └─ [0xD10: text "Header Content" { parent: 0xD00 }]

[0xD40: div-node { parent: 0xC00, prevSibling: 0xD00, nextSibling: 0xD80 }]
  └─ [0xD50: text "Div One" { parent: 0xD40 }]

[0xD80: div-node { parent: 0xC00, prevSibling: 0xD40, nextSibling: 0xDC0 }]
  └─ [0xD90: text "Div Two" { parent: 0xD80 }]

[0xDC0: footer-node { parent: 0xC00, prevSibling: 0xD80 }]
  └─ [0xDD0: text "Footer Content" { parent: 0xDC0 }]

What This Means

Each block is a separate object allocated somewhere on the heap. The hex addresses (like 0xD40) are what the pointers actually store in memory.

For example: html-node at 0xA10 has no parent beyond the document root. It points to two children: head at 0xB00 and body at 0xC00.

nextSibling: 0xD80 on the first div means: "the next sibling lives at address 0xD80", which is the second div.

text "Div One" at 0xD50 is a separate text node object. Its parent pointer leads to 0xD40 (the first div).

When the browser needs to go from <html> to <body>, it follows the pointer 0xC00, jumps to that address, and finds the body node object there.

These objects are scattered across the heap, but the pointers weave them into one coherent tree. That's the trick — physical layout in memory doesn't matter; logical structure is all in the pointers.

What's Next

Now that we understand how browsers do it, it's time to get our hands dirty. In the next article, we'll write our own tokenizer from scratch! A simplified state machine that can chew through real HTML and spit out tokens. No regex hacks, no shortcuts. Just a loop, a switch statement, and a growing appreciation for the people who wrote the spec.

Here's the full roadmap for this series:

HTML Parsing Algorithms and Memory Structure: You are here.
Building a Tokenizer from Scratch: Writing a state machine that turns raw HTML into tokens, character by character.
From Tokens to Tree: Implementing a DOM Builder: Taking those tokens and constructing an in-memory tree, with the stack-based algorithm and basic error recovery.
A Minimal Static Site Generator: Using our parser to read HTML templates, inject content, and output a complete static site. The fun payoff.
Rendering to the Terminal (Bonus): Because why not? We'll walk our DOM tree and render a simplified version of the page right in the console.

The goal isn't to compete with Chrome. It's to build something small, understand every line, and have fun doing it. See you in the next one.

Which language should we use?

C99?
Python?
FASM?
Forth?

DEV Community