DEV Community

Dev Nestio
Dev Nestio

Posted on

I Built a Browser-Only HTML Entity Encoder/Decoder — Named, Decimal & Hex, 246 Tests

Every developer has hit this: you need to escape <, >, &, and quotes before dropping user input into HTML — or you're staring at mangled text full of &amp; and need to convert it back. Most online tools do the basics, but fall short on the full HTML5 named entity set or force you to choose between three encoding formats.

I built one that handles all three formats, 253 named entities, and decodes all of them with a single regex pass — entirely in the browser, no server, no framework.

👉 https://html-entity-encoder.pages.dev

What It Does

  • Encode: text → HTML entities in three modes
    • Namedé&eacute;, ©&copy;, π&pi;
    • Decimalé&#233;, ©&#169;
    • Hexé&#xE9;, ©&#xA9;
  • Decode: all three entity formats back to plain text
  • 253 HTML5 named entities — Latin-1 Supplement, Latin Extended-A, Greek, Math, Arrows, Punctuation, Currency, Symbols
  • Real-time: output updates on every keystroke
  • Quick Reference: collapsible table you can click to insert characters
  • Swap, Copy, Clear, Sample buttons
  • Zero external dependencies — single HTML file, works offline

The Core: Encoding in Three Modes

The encoding logic iterates over Unicode code points (not UTF-16 code units), which is essential for handling emoji and characters outside the BMP:

const ALWAYS_ENCODE = new Set(['&', '<', '>', '"', "'"]);

function encode(text, mode) {
  if (!text) return '';
  const result = [];
  for (const ch of text) {          // for...of iterates code points
    const cp = ch.codePointAt(0);
    const mustEncode = ALWAYS_ENCODE.has(ch) || cp > 127;
    if (!mustEncode) { result.push(ch); continue; }

    if (mode === 'named') {
      result.push(CHAR_TO_ENTITY[ch] || `&#${cp};`);
    } else if (mode === 'decimal') {
      result.push(`&#${cp};`);
    } else {                        // hex
      result.push(`&#x${cp.toString(16).toUpperCase()};`);
    }
  }
  return result.join('');
}
Enter fullscreen mode Exit fullscreen mode

The for...of loop over a string yields Unicode code points. A for loop with index would break on any character outside the Basic Multilingual Plane — emoji like 😀 are encoded as surrogate pairs in UTF-16, so a naive str[i] approach would emit two separate (invalid) entities for a single character.

Why &#N; fallback in named mode? Because the 253 named entities don't cover everything. A character like 😀 (U+1F600) has no HTML5 named form, so decimal is the only option.

The Decode Regex

One regex handles all three entity formats in a single pass:

function decode(text) {
  if (!text) return '';
  return text.replace(
    /&([a-zA-Z][a-zA-Z0-9]*);|&#([0-9]+);|&#[xX]([0-9a-fA-F]+);/g,
    (match, name, dec, hex) => {
      try {
        if (name !== undefined)
          return Object.prototype.hasOwnProperty.call(ENTITY_TO_CHAR, name)
            ? ENTITY_TO_CHAR[name] : match;
        if (dec !== undefined)
          return String.fromCodePoint(parseInt(dec, 10));
        if (hex !== undefined)
          return String.fromCodePoint(parseInt(hex, 16));
      } catch (_) {}
      return match;
    }
  );
}
Enter fullscreen mode Exit fullscreen mode

Three alternation groups, each capturing a different entity format. The named entity lookup uses hasOwnProperty explicitly to guard against prototype pollution — toString, constructor, __proto__ are technically valid entity name shapes, so a direct ENTITY_TO_CHAR[name] lookup could be exploited to return unexpected values from the prototype chain.

The hex branch accepts both &#x...; and &#X...; (the [xX] in the regex) — the HTML5 spec allows both, even though lowercase is conventional.

Building the Entity Maps

The decode map is the source of truth: ENTITY_TO_CHAR maps each name string to its Unicode character. Then the encode map is derived by inverting it:

const CHAR_TO_ENTITY = {};
(function buildCharMap() {
  // First pass: reverse all entries
  for (const [name, ch] of Object.entries(ENTITY_TO_CHAR)) {
    if (!CHAR_TO_ENTITY[ch]) CHAR_TO_ENTITY[ch] = `&${name};`;
  }
  // Second pass: force canonical preferred names for ambiguous chars
  const preferred = {
    '"': '&quot;', '&': '&amp;', "'": '&apos;',
    '<': '&lt;',   '>': '&gt;',  ' ': '&nbsp;',
    '©': '&copy;', '®': '&reg;', '': '&trade;',
    '': '&euro;', '×': '&times;', '÷': '&divide;'
  };
  Object.assign(CHAR_TO_ENTITY, preferred);
})();
Enter fullscreen mode Exit fullscreen mode

Some characters have multiple named forms in HTML5. For example, ' maps to both &apos; (from XHTML) and &squot; — the second pass pins canonical names so the encoder always outputs the most recognizable form.

What's in the 253-entity Map

Category Count Examples
HTML special 5 &amp; &lt; &gt; &quot; &apos;
Latin-1 Supplement 96 &eacute; &ntilde; &copy; &euro;
Latin Extended-A 5 &OElig; &oelig; &Scaron;
Greek 49 &alpha; &pi; &Sigma; &Omega;
Mathematical 37 &infin; &ne; &le; &sum; &radic;
Arrows 11 &rarr; &larr; &hArr; &crarr;
Punctuation 20 &mdash; &ndash; &hellip; &ldquo;
Misc Symbols 10+ &trade; &bull; &spades; &hearts;
Currency 5 &euro; &pound; &yen; &cent;

Testing: 246 Cases, No Framework

246 tests across 26 sections, built on a two-function inline runner:

let passed = 0, failed = 0;

function eq(a, b, label) {
  if (a === b) { console.log(`  ✓ ${label}`); passed++; }
  else {
    console.error(`  ✗ ${label}\n    got:      ${JSON.stringify(a)}\n    expected: ${JSON.stringify(b)}`);
    failed++;
  }
}
Enter fullscreen mode Exit fullscreen mode
Section Tests What's covered
Entity map coverage 12 Size ≥ 250, key entries exist
Encode HTML specials (named/decimal/hex) 23 & < > " ' in all modes
Encode Latin extended (all modes) 25 © € é ñ ü ± ° ½
Encode Greek (all modes) 14 α β γ π Σ Ω
Encode math & symbols 11 ∞ ≠ ≤ √ → • — …
ASCII passthrough 8 Letters, digits, misc symbols
Encode mixed strings 7 XSS payloads, café, résumé
Decode named entities 20 All common named entities
Decode decimal entities 10 &#38; through &#960;
Decode hex (lowercase/uppercase X) 15 &#x3C; and &#X3C; forms
Decode mixed strings 8 Full HTML tags, price strings
Decode edge cases 10 Unknown entities, no semicolon, empty
Round-trip (encode→decode) 30 10 strings × 3 modes
Double-encoding prevention 2 &amp;&amp;amp;
Unicode correctness 7 U+0000, U+0041, U+2665
Entity map value checks 10 Known char values
Misc symbols encode 8 ♠ ♥ ♦ ♣ ⇒ ∑
Less common entities 13 &OElig; &bull; &permil;
Whitespace entities 5 &ensp; &emsp; &zwnj;
Hex uppercase digits 4 &#xC4; &#xDC;
Non-BMP encode/decode 4 😀 decimal + hex round-trip

Run with npm test.

A Subtle Edge Case: Prototype Pollution in Decode

The named entity lookup is written as:

Object.prototype.hasOwnProperty.call(ENTITY_TO_CHAR, name)
  ? ENTITY_TO_CHAR[name]
  : match
Enter fullscreen mode Exit fullscreen mode

rather than the simpler ENTITY_TO_CHAR[name]. Why? Because name comes from user-supplied text via the regex match. If someone passes &constructor; or &__proto__; as input, a direct bracket lookup would walk the prototype chain and return a function object or the prototype itself — then String.fromCodePoint on a non-integer would throw, but that's after already leaking prototype state.

The hasOwnProperty check ensures we only return values that are explicitly in the entity map, not inherited from Object.prototype.

Try It

https://html-entity-encoder.pages.dev

Single HTML file, no build step. Open DevTools and read the source — everything is there.

Also part of devnestio — a growing collection of zero-dependency browser tools for developers.


Built with: vanilla JS, the HTML5 named character references spec, and an unreasonable number of Greek letters.

Top comments (0)