Writing Base64 From Scratch in JavaScript — Why atob Isn't Enough

#javascript #encoding #webdev #unicode

Writing Base64 From Scratch in JavaScript — Why atob Isn't Enough

JavaScript has btoa() and atob(), but they only accept Latin-1. btoa("こんにちは") throws. The URL-safe Base64 variant (- and _ instead of + and /) isn't supported at all. Implementing Base64 manually — read 3 bytes, write 4 characters, handle padding — is about 40 lines and lets you handle UTF-8, URL-safe encoding, and line wrapping properly.

Base64 is one of those encodings every developer touches but few understand. It's not encryption. It's a way to represent arbitrary bytes as ASCII text — useful for JSON, URLs, email attachments, and data URIs. The math is trivial but the edge cases (padding, variants, Unicode) trip people up.

🔗 Live demo: https://sen.ltd/portfolio/base64-tool/
📦 GitHub: https://github.com/sen-ltd/base64-tool

Features:

Text mode (UTF-8 encode/decode)
File mode (drop image → data URL)
URL-safe variant toggle
Line wrap toggle (76 chars, MIME format)
Size comparison (original vs base64)
Auto-detect encoding direction
Image preview for decoded data URLs
Japanese / English UI
Zero dependencies, 55 tests

The 3-to-4 byte conversion

Base64 groups 3 input bytes (24 bits) into 4 output characters (24 / 6 = 4 chars of 6 bits each):

const ALPHABET = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/';

export function encode(bytes, urlSafe = false) {
  const alpha = urlSafe ? BASE64_URL_ALPHABET : ALPHABET;
  let result = '';

  for (let i = 0; i < bytes.length; i += 3) {
    const b1 = bytes[i];
    const b2 = i + 1 < bytes.length ? bytes[i + 1] : 0;
    const b3 = i + 2 < bytes.length ? bytes[i + 2] : 0;

    const c1 = b1 >> 2;                        // top 6 bits of b1
    const c2 = ((b1 & 0x03) << 4) | (b2 >> 4); // bottom 2 of b1 + top 4 of b2
    const c3 = ((b2 & 0x0F) << 2) | (b3 >> 6); // bottom 4 of b2 + top 2 of b3
    const c4 = b3 & 0x3F;                      // bottom 6 of b3

    result += alpha[c1] + alpha[c2];
    result += i + 1 < bytes.length ? alpha[c3] : (urlSafe ? '' : '=');
    result += i + 2 < bytes.length ? alpha[c4] : (urlSafe ? '' : '=');
  }
  return result;
}

The bit shuffling is what every tutorial gets wrong once. Each input byte contributes to two output characters because 8 and 6 don't divide evenly.

Padding

When the input length isn't divisible by 3, you pad with zero bytes and mark the "missing" output characters with =:

3 bytes in → 4 chars out, no padding
2 bytes in → 3 chars + 1 =
1 byte in → 2 chars + 2 =

URL-safe variant omits the = padding (it's redundant since length mod 4 determines it). That means a URL-safe encoded "f" is just "Zg", not "Zg==".

UTF-8 for text

Since browser btoa only accepts Latin-1, encoding Japanese text requires two steps:

export function encodeText(text, urlSafe = false) {
  const bytes = new TextEncoder().encode(text); // UTF-8 bytes
  return encode(bytes, urlSafe);
}

export function decodeText(str) {
  const bytes = decode(str);
  return new TextDecoder().decode(bytes);
}

TextEncoder produces a Uint8Array of UTF-8 bytes. The base64 encoder doesn't care about text — it just takes bytes. This two-step approach works for any Unicode input.

Example: "こ" is U+3053, which encodes to 3 UTF-8 bytes 0xE3 0x81 0x93. Base64-encoded those become "44GT". Round-trip works correctly.

URL-safe variant (RFC 4648 §5)

Standard Base64 uses + and / which conflict with URL syntax. The URL-safe variant replaces them with - and _:

export const BASE64_ALPHABET     = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/';
export const BASE64_URL_ALPHABET = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_';

export function standardToUrlSafe(str) {
  return str.replace(/\+/g, '-').replace(/\//g, '_').replace(/=+$/, '');
}

export function urlSafeToStandard(str) {
  let s = str.replace(/-/g, '+').replace(/_/g, '/');
  while (s.length % 4 !== 0) s += '=';
  return s;
}

The conversion between variants is just character substitution plus handling the padding. This is how JWT tokens encode their parts: URL-safe Base64 without padding, period-separated.

isBase64 detection

"Does this string look like Base64?" is harder than it seems:

export function isBase64(str) {
  const clean = str.replace(/\s/g, '');
  if (clean.length === 0) return false;
  // Standard: only valid chars + optional = padding, length multiple of 4
  if (/^[A-Za-z0-9+/]+={0,2}$/.test(clean) && clean.length % 4 === 0) return true;
  // URL-safe: with - or _ present
  if (/^[A-Za-z0-9_-]+$/.test(clean) && /[-_]/.test(clean)) return true;
  return false;
}

The "URL-safe" check requires - or _ to be actually present, otherwise every alphanumeric string would match. The auto-direction detection in the UI uses this to guess whether to encode or decode.