Dev Nestio

Posted on Jun 27

I Built a Browser-Only HTML Entity Encoder/Decoder — Named, Decimal & Hex, 246 Tests

#javascript #webdev #html #devtools

Every developer has hit this: you need to escape <, >, &, and quotes before dropping user input into HTML — or you're staring at mangled text full of & and need to convert it back. Most online tools do the basics, but fall short on the full HTML5 named entity set or force you to choose between three encoding formats.

I built one that handles all three formats, 253 named entities, and decodes all of them with a single regex pass — entirely in the browser, no server, no framework.

👉 https://html-entity-encoder.pages.dev

What It Does

Encode: text → HTML entities in three modes
- Named — é → é, © → ©, π → π
- Decimal — é → é, © → ©
- Hex — é → é, © → ©
Decode: all three entity formats back to plain text
253 HTML5 named entities — Latin-1 Supplement, Latin Extended-A, Greek, Math, Arrows, Punctuation, Currency, Symbols
Real-time: output updates on every keystroke
Quick Reference: collapsible table you can click to insert characters
Swap, Copy, Clear, Sample buttons
Zero external dependencies — single HTML file, works offline

The Core: Encoding in Three Modes

The encoding logic iterates over Unicode code points (not UTF-16 code units), which is essential for handling emoji and characters outside the BMP:

const ALWAYS_ENCODE = new Set(['&', '<', '>', '"', "'"]);

function encode(text, mode) {
  if (!text) return '';
  const result = [];
  for (const ch of text) {          // for...of iterates code points
    const cp = ch.codePointAt(0);
    const mustEncode = ALWAYS_ENCODE.has(ch) || cp > 127;
    if (!mustEncode) { result.push(ch); continue; }

    if (mode === 'named') {
      result.push(CHAR_TO_ENTITY[ch] || `&#${cp};`);
    } else if (mode === 'decimal') {
      result.push(`&#${cp};`);
    } else {                        // hex
      result.push(`&#x${cp.toString(16).toUpperCase()};`);
    }
  }
  return result.join('');
}

The for...of loop over a string yields Unicode code points. A for loop with index would break on any character outside the Basic Multilingual Plane — emoji like 😀 are encoded as surrogate pairs in UTF-16, so a naive str[i] approach would emit two separate (invalid) entities for a single character.

Why &#N; fallback in named mode? Because the 253 named entities don't cover everything. A character like 😀 (U+1F600) has no HTML5 named form, so decimal is the only option.

The Decode Regex

One regex handles all three entity formats in a single pass:

function decode(text) {
  if (!text) return '';
  return text.replace(
    /&([a-zA-Z][a-zA-Z0-9]*);|&#([0-9]+);|&#[xX]([0-9a-fA-F]+);/g,
    (match, name, dec, hex) => {
      try {
        if (name !== undefined)
          return Object.prototype.hasOwnProperty.call(ENTITY_TO_CHAR, name)
            ? ENTITY_TO_CHAR[name] : match;
        if (dec !== undefined)
          return String.fromCodePoint(parseInt(dec, 10));
        if (hex !== undefined)
          return String.fromCodePoint(parseInt(hex, 16));
      } catch (_) {}
      return match;
    }
  );
}

Three alternation groups, each capturing a different entity format. The named entity lookup uses hasOwnProperty explicitly to guard against prototype pollution — toString, constructor, __proto__ are technically valid entity name shapes, so a direct ENTITY_TO_CHAR[name] lookup could be exploited to return unexpected values from the prototype chain.

The hex branch accepts both &#x...; and &#X...; (the [xX] in the regex) — the HTML5 spec allows both, even though lowercase is conventional.

Building the Entity Maps

The decode map is the source of truth: ENTITY_TO_CHAR maps each name string to its Unicode character. Then the encode map is derived by inverting it:

const CHAR_TO_ENTITY = {};
(function buildCharMap() {
  // First pass: reverse all entries
  for (const [name, ch] of Object.entries(ENTITY_TO_CHAR)) {
    if (!CHAR_TO_ENTITY[ch]) CHAR_TO_ENTITY[ch] = `&${name};`;
  }
  // Second pass: force canonical preferred names for ambiguous chars
  const preferred = {
    '"': '&quot;', '&': '&amp;', "'": '&apos;',
    '<': '&lt;',   '>': '&gt;',  ' ': '&nbsp;',
    '©': '&copy;', '®': '&reg;', '™': '&trade;',
    '€': '&euro;', '×': '&times;', '÷': '&divide;'
  };
  Object.assign(CHAR_TO_ENTITY, preferred);
})();

Some characters have multiple named forms in HTML5. For example, ' maps to both ' (from XHTML) and &squot; — the second pass pins canonical names so the encoder always outputs the most recognizable form.

What's in the 253-entity Map

Category	Count	Examples
HTML special	5	`&` `<` `>` `"` `'`
Latin-1 Supplement	96	`é` `ñ` `©` `€`
Latin Extended-A	5	`&OElig;` `&oelig;` `&Scaron;`
Greek	49	`α` `π` `Σ` `Ω`
Mathematical	37	`∞` `≠` `≤` `∑` `√`
Arrows	11	`→` `←` `⇔` `&crarr;`
Punctuation	20	`—` `–` `…` `“`
Misc Symbols	10+	`™` `•` `&spades;` `&hearts;`
Currency	5	`€` `£` `¥` `¢`

Testing: 246 Cases, No Framework

246 tests across 26 sections, built on a two-function inline runner:

let passed = 0, failed = 0;

function eq(a, b, label) {
  if (a === b) { console.log(`  ✓ ${label}`); passed++; }
  else {
    console.error(`  ✗ ${label}\n    got:      ${JSON.stringify(a)}\n    expected: ${JSON.stringify(b)}`);
    failed++;
  }
}

Section	Tests	What's covered
Entity map coverage	12	Size ≥ 250, key entries exist
Encode HTML specials (named/decimal/hex)	23	`& < > " '` in all modes
Encode Latin extended (all modes)	25	`© € é ñ ü ± ° ½`
Encode Greek (all modes)	14	`α β γ π Σ Ω`
Encode math & symbols	11	`∞ ≠ ≤ √ → • — …`
ASCII passthrough	8	Letters, digits, misc symbols
Encode mixed strings	7	XSS payloads, café, résumé
Decode named entities	20	All common named entities
Decode decimal entities	10	`&` through `π`
Decode hex (lowercase/uppercase X)	15	`<` and `&#X3C;` forms
Decode mixed strings	8	Full HTML tags, price strings
Decode edge cases	10	Unknown entities, no semicolon, empty
Round-trip (encode→decode)	30	10 strings × 3 modes
Double-encoding prevention	2	`&` → `&amp;`
Unicode correctness	7	U+0000, U+0041, U+2665
Entity map value checks	10	Known char values
Misc symbols encode	8	`♠ ♥ ♦ ♣ ⇒ ∑`
Less common entities	13	`&OElig;` `•` `&permil;`
Whitespace entities	5	`&ensp;` `&emsp;` `&zwnj;`
Hex uppercase digits	4	`Ä` `Ü`
Non-BMP encode/decode	4	😀 decimal + hex round-trip

Run with npm test.

A Subtle Edge Case: Prototype Pollution in Decode

The named entity lookup is written as:

Object.prototype.hasOwnProperty.call(ENTITY_TO_CHAR, name)
  ? ENTITY_TO_CHAR[name]
  : match

rather than the simpler ENTITY_TO_CHAR[name]. Why? Because name comes from user-supplied text via the regex match. If someone passes &constructor; or &__proto__; as input, a direct bracket lookup would walk the prototype chain and return a function object or the prototype itself — then String.fromCodePoint on a non-integer would throw, but that's after already leaking prototype state.

The hasOwnProperty check ensures we only return values that are explicitly in the entity map, not inherited from Object.prototype.

Try It

https://html-entity-encoder.pages.dev

Single HTML file, no build step. Open DevTools and read the source — everything is there.

Also part of devnestio — a growing collection of zero-dependency browser tools for developers.

Built with: vanilla JS, the HTML5 named character references spec, and an unreasonable number of Greek letters.

DEV Community