Base64, done properly: the UTF-8 gotcha most tutorials skip

#webdev #programming #javascript #beginners

I build one small browser tool a day and write down what I learned. Day 24 was a Base64 encoder/decoder, and it turned out to hide the single most common bug in front-end Base64 code. Here it is, with the fix.

Live tool: https://dev48v.infy.uk/solve/day24-base64.html

What Base64 is actually for

Base64 is not compression and it is definitely not encryption. It exists for one reason: to move arbitrary binary through channels that were built for text. Email bodies, JSON payloads, URLs, XML — feed them a raw PNG and a stray null byte or control character will truncate or corrupt the message. Base64 re-expresses any bytes using only 64 safe, printable characters, so the binary survives the trip. That is the whole job.

The 64 characters are A-Z, then a-z, then 0-9, then + and /. Sixty-four is chosen because it is 2^6 — a clean six bits per character.

The 3-into-4 trick

A byte is 8 bits. A Base64 character carries 6 bits. The lowest common multiple is 24, so Base64 works in blocks of 24 bits: three input bytes become four output characters.

You glue three bytes into one 24-bit number, then slice it into four 6-bit chunks, each a number 0-63 that indexes the alphabet:

const n = (b0 << 16) | (b1 << 8) | b2;   // 24 bits
const c0 = (n >> 18) & 63;
const c1 = (n >> 12) & 63;
const c2 = (n >>  6) & 63;
const c3 =  n        & 63;
// STD[c0] + STD[c1] + STD[c2] + STD[c3]

"Man" becomes "TWFu". Three bytes, four characters, every time. That 4/3 ratio is exactly why Base64 inflates data by about 33% — you are spending a full character to carry only six of its available eight bits.

When the input length is not a multiple of three, the last group is short. One leftover byte gives two characters plus ==; two leftover bytes give three characters plus =. The = is not data — it tells the decoder how many real bytes the final group carried, so it can drop the zero-padding and rebuild the exact original length. "M" encodes to "TQ==", "Ma" to "TWE=".

The gotcha that bit me

The browser hands you btoa() and it looks like the answer. It is not, and here is why:

btoa("world");   // "d29ybGQ=" — fine
btoa("世");      // 💥 InvalidCharacterError

btoa only accepts a "binary string" where every character code is 0-255 — one byte per character. But 世 has a code point far above 255, and so does é, and so does every emoji. Pass real Unicode and it throws. Worse, some code sidesteps the throw with hacks that silently mangle the bytes, and you do not notice until a user with an accent in their name hits your form.

The correct pipeline converts the string to UTF-8 bytes first, with TextEncoder, before Base64 ever sees it:

function encode(str){
  const bytes = new TextEncoder().encode(str);   // real UTF-8 bytes
  let bin = "";
  for (const b of bytes) bin += String.fromCharCode(b);
  return btoa(bin);
}
function decode(b64){
  const bin = atob(b64);
  const bytes = Uint8Array.from(bin, c => c.charCodeAt(0));
  return new TextDecoder().decode(bytes);
}

Now héllo 世界 round-trips perfectly: it encodes to aMOpbGxvIOS4lueVjA== and decodes straight back. Emoji too. I verified the round-trip and the btoa throw in Node before shipping — non-ASCII in, same non-ASCII out.

URL-safe, because URLs hate + and /

In a URL, + can mean a space and / is the path separator, so a token pasted into a query string gets wrecked. RFC 4648 defines a URL-safe alphabet that swaps + to - and / to _, and usually strips the = padding since %3D is ugly. This is the variant JWTs use. It is a two-character find-and-replace on top of standard Base64:

const urlSafe = b64.replace(/\+/g, "-")
                   .replace(/\//g, "_")
                   .replace(/=+$/, "");   // drop padding

To decode, reverse the swap and add = back until the length is a multiple of four.

The tool

The live page has two textareas: type text on the left and Base64 appears on the right; paste Base64 on the right and the text decodes back on the left. There is a URL-safe toggle, a padding-strip toggle, and a live size-overhead readout. It also renders a bit-level view so you can watch three amber 8-bit bytes get re-sliced into four blue 6-bit groups, and see the = appear when you type one or two trailing characters. Drop in a small file and it gives you the data: URI you can paste straight into an <img src>.

One thing to keep front of mind: Base64 is fully reversible with no key. Anyone can paste your string into atob() and read it. It is a transport encoding, never a security layer — if you need secrecy, encrypt; if you need integrity, sign or hash, and Base64 only the result so it travels as text.

Try it: https://dev48v.infy.uk/solve/day24-base64.html