SEN LLC

Posted on Apr 11

I Built a Side-by-Side Base64 / URL / HTML / Encoder and Finally Stopped Confusing Them

#javascript #webdev #encoding #beginners

I Built a Side-by-Side Base64 / URL / HTML / Encoder and Finally Stopped Confusing Them

There are only four text encoding schemes you run into daily on the web. And yet the moment you feed them anything beyond ASCII, each one starts behaving differently. The only way to internalize the differences, I found, is to see all four results next to each other.

"Is this supposed to be Base64 or URL-encoded?" "Why is 'あ' three bytes in the URL?" "Wait, is it \u{1f389} or \u1f389 for 🎉?" I kept hitting the same wall, so I put all four encodings on one page with a single input box.

🔗 Live demo: https://sen.ltd/portfolio/encoder-diff/
📦 GitHub: https://github.com/sen-ltd/encoder-diff

One input. Four output cards (Base64, URL percent, HTML entities, Unicode \u escape). Radio buttons flip between encode and decode mode — all four cards turn around at once. Errors only paint the offending card red. About 300 lines of vanilla JS, zero dependencies, no build step.

Base64: btoa can't eat UTF-8

Base64 is specified as "a way to represent arbitrary bytes in ASCII." But JavaScript's built-in btoa() only accepts strings where every character code fits in 0–255. Call btoa('あ') and you get InvalidCharacterError.

So you have to encode the string to UTF-8 bytes first, then stuff those bytes into a fake "each char = one byte" string before handing it to btoa:

export function encodeBase64(input) {
  const bytes = new TextEncoder().encode(input)
  let binary = ''
  for (const b of bytes) binary += String.fromCharCode(b)
  return btoa(binary)
}

That intermediate representation is the ugly secret of Base64 in the browser. It's the reason third-party libraries like js-base64 exist: they hide this dance.

Decoding reverses it:

export function decodeBase64(input) {
  const binary = atob(input.trim())
  const bytes = new Uint8Array(binary.length)
  for (let i = 0; i < binary.length; i++) bytes[i] = binary.charCodeAt(i)
  return new TextDecoder().decode(bytes)
}

One gotcha: TextDecoder defaults to replacing invalid UTF-8 with U+FFFD (the replacement character) rather than throwing. If you want strict mode, pass { fatal: true }.

URL percent: see the UTF-8 bytes directly

encodeURIComponent does string → UTF-8 bytes → %XX per byte, in a single step. Which is why one Japanese character becomes three percent-triples:

encodeURIComponent('あ')  // '%E3%81%82'

E3 81 82 is the three-byte UTF-8 encoding of U+3042. If you try to decode just %E3, you get a URIError because those bytes alone aren't valid UTF-8:

decodeURIComponent('%E3')  // URIError: URI malformed

Having all four cards visible makes this kind of thing jump out. Paste half of a URL-encoded Japanese string and you'll immediately see the URL card go red while the other three are fine.

HTML entities: named vs numeric

HTML has three flavors of entity because of course it does:

Named: &, <, "
Decimal numeric: A → A
Hex numeric: A → A

Encoding is the easy direction. The practical minimum for XSS-safe output is five characters:

const HTML_ENCODE_MAP = {
  '&': '&amp;',
  '<': '&lt;',
  '>': '&gt;',
  '"': '&quot;',
  "'": '&#39;',
}

export function encodeHtml(input) {
  return input.replace(/[&<>"']/g, (c) => HTML_ENCODE_MAP[c])
}

Decoding has to handle all three flavors. One regex catches all of them:

export function decodeHtml(input) {
  return input.replace(/&(#(?:x[0-9a-fA-F]+|\d+)|[a-zA-Z]+);/g, (match, inner) => {
    if (inner.startsWith('#x') || inner.startsWith('#X')) {
      return String.fromCodePoint(parseInt(inner.slice(2), 16))
    } else if (inner.startsWith('#')) {
      return String.fromCodePoint(parseInt(inner.slice(1), 10))
    } else if (inner in HTML_NAMED_DECODE) {
      return HTML_NAMED_DECODE[inner]
    }
    return match
  })
}

Note String.fromCodePoint, not String.fromCharCode. The former handles anything above U+FFFF correctly; the latter quietly mangles emoji. 🎉 should decode to 🎉, and it does — but only because we used the right function.

Unicode escape: the `\u1234` vs `\u{10000}` line

JavaScript source syntax has two Unicode escape forms:

\u1234 — fixed four hex digits, BMP only (U+0000 to U+FFFF)
\u{1F389} — variable length, handles the full Unicode range (ES6+)

Encoding means picking based on codepoint:

export function encodeUnicode(input) {
  let out = ''
  for (const ch of input) {
    const cp = ch.codePointAt(0)
    if (cp < 0x80) {
      out += ch  // ASCII: leave alone
    } else if (cp <= 0xffff) {
      out += '\\u' + cp.toString(16).padStart(4, '0')
    } else {
      out += '\\u{' + cp.toString(16) + '}'
    }
  }
  return out
}

for (const ch of input) is load-bearing. If you write for (let i = 0; i < input.length; i++), anything above the BMP splits into two surrogate-pair halves (\uD83C\uDF89 instead of one emoji). for...of iterates by codepoint, which is what you want.

Same trap for codePointAt(0) vs charCodeAt(0) — the latter only returns the upper 16 bits of a surrogate pair, silently corrupting emoji. Anything that walks strings character-by-character in JavaScript should use these two APIs.

What you see with all four side-by-side

Paste <img src="x">🎉 into the input box, and the four cards show:

Scheme	Output
Base64	`PGltZyBzcmM9IngiPvCfjok=`
URL	`%3Cimg%20src%3D%22x%22%3E%F0%9F%8E%89`
HTML	`<img src="x">🎉`
Unicode	`<img src="x">\u{1f389}`

Each one has a different domain of responsibility:

Base64 treats the whole string as one opaque blob — reversible, zero human readability.
URL percent treats one byte at a time (and leaves alphanumerics alone).
HTML entities only touch the characters that would break rendering (the emoji is safe UTF-8, so it's left untouched).
Unicode escape only touches non-ASCII.

That difference is why you use a different one for each job: HTML for XSS-safe rendering, URL for query strings, Base64 for binary-in-JSON, \u for embedding in source code. Putting them side-by-side is the clearest way I've seen to hold that distinction in your head at once.

Tests

21 cases on node --test. Each scheme covers:

Empty string
ASCII
Japanese (multi-byte UTF-8)
Emoji (astral plane)
Round-trip (encode → decode)
Malformed input raises

npm test

Zero dependencies, so package.json has exactly one script entry: "test": "node --test tests/".

Series

This is entry #5 in my 100+ public portfolio series. Previous:

📦 Repo: https://github.com/sen-ltd/encoder-diff
🌐 Live: https://sen.ltd/portfolio/encoder-diff/
🏢 Company: https://sen.ltd/

Feedback welcome.

DEV Community

I Built a Side-by-Side Base64 / URL / HTML / Encoder and Finally Stopped Confusing Them

I Built a Side-by-Side Base64 / URL / HTML / Encoder and Finally Stopped Confusing Them

Base64: btoa can't eat UTF-8

URL percent: see the UTF-8 bytes directly

HTML entities: named vs numeric

Unicode escape: the `\u1234` vs `\u{10000}` line

What you see with all four side-by-side

Tests

Series

Top comments (0)

I Built a Side-by-Side Base64 / URL / HTML / Encoder and Finally Stopped Confusing Them

Base64: btoa can't eat UTF-8

URL percent: see the UTF-8 bytes directly

HTML entities: named vs numeric

Unicode escape: the \u1234 vs \u{10000} line

What you see with all four side-by-side

Tests

Series

Unicode escape: the `\u1234` vs `\u{10000}` line