DEV Community

SEN LLC
SEN LLC

Posted on

A Unicode Lookup Tool That Shows UTF-8 Bytes, Surrogate Pairs, and All Four Normalization Forms

A Unicode Lookup Tool That Shows UTF-8 Bytes, Surrogate Pairs, and All Four Normalization Forms

Type a character, get its codepoint. Type a codepoint, get the character. See the UTF-8 byte sequence, the UTF-16 code units, the surrogate pair (if applicable), the HTML entity, CSS escape, JS escape, the Unicode block, the normalization forms. It's a calculator for Unicode questions that come up all the time in practice.

Every developer hits Unicode weirdness eventually. "Why is this string length 2 when I see one character?" (Surrogate pair.) "Why does 'é' compare unequal to 'é'?" (Composed vs decomposed — NFC vs NFD.) "How do I paste this emoji into a CSS file?" (You need \1F600, not \u1F600.)

🔗 Live demo: https://sen.ltd/portfolio/unicode-lookup/
📦 GitHub: https://github.com/sen-ltd/unicode-lookup

Screenshot

Features:

  • 4 search modes: by character, by codepoint, by name, by block
  • UTF-8 bytes (1-4 byte sequences)
  • UTF-16 code units (with surrogate pair detection)
  • HTML entity, CSS escape, JS escape formats
  • All 4 Unicode normalization forms (NFC, NFD, NFKC, NFKD)
  • 59 Unicode blocks browsable
  • Japanese / English UI
  • Zero dependencies, 75 tests

UTF-8 encoding

UTF-8 is variable-length: ASCII is 1 byte, European accents are 2 bytes, CJK is 3 bytes, emoji and supplementary plane chars are 4 bytes.

export function toUTF8Bytes(cp) {
  if (cp < 0x80) return [cp];
  if (cp < 0x800) return [
    0xC0 | (cp >> 6),
    0x80 | (cp & 0x3F),
  ];
  if (cp < 0x10000) return [
    0xE0 | (cp >> 12),
    0x80 | ((cp >> 6) & 0x3F),
    0x80 | (cp & 0x3F),
  ];
  return [
    0xF0 | (cp >> 18),
    0x80 | ((cp >> 12) & 0x3F),
    0x80 | ((cp >> 6) & 0x3F),
    0x80 | (cp & 0x3F),
  ];
}
Enter fullscreen mode Exit fullscreen mode

The lead byte tells you how many continuation bytes follow:

  • 0xxxxxxx → 1 byte (ASCII)
  • 110xxxxx → 2 bytes (1 continuation)
  • 1110xxxx → 3 bytes (2 continuations)
  • 11110xxx → 4 bytes (3 continuations)

Continuations always start with 10xxxxxx. This self-synchronizing property is why UTF-8 is so robust: you can find any character boundary by scanning for a non-continuation byte.

UTF-16 surrogate pairs

JavaScript strings are UTF-16. Codepoints above 0xFFFF (the BMP, Basic Multilingual Plane) can't fit in a single 16-bit unit, so they're encoded as surrogate pairs:

export function toUTF16CodeUnits(cp) {
  if (cp <= 0xFFFF) return [cp];
  const offset = cp - 0x10000;
  const high = 0xD800 + (offset >> 10);
  const low = 0xDC00 + (offset & 0x3FF);
  return [high, low];
}
Enter fullscreen mode Exit fullscreen mode

High surrogates are in 0xD800-0xDBFF, low surrogates in 0xDC00-0xDFFF. A 20-bit offset (0 to 0xFFFFF) is split into 10 high bits and 10 low bits.

This is why "😀".length === 2 in JavaScript — the emoji is one codepoint but two code units.

The four normalization forms

Unicode has multiple ways to encode the same visual character:

  • é as a single codepoint U+00E9 (precomposed)
  • é as U+0065 (e) + U+0301 (combining acute) (decomposed)

These look identical but don't compare equal. Normalization fixes this:

  • NFC — Composition (precompose where possible)
  • NFD — Decomposition (break into base + combining marks)
  • NFKC — Compatibility + Composition (also fold "fi" ligature, superscripts, etc.)
  • NFKD — Compatibility + Decomposition
const forms = {
  NFC: str.normalize('NFC'),
  NFD: str.normalize('NFD'),
  NFKC: str.normalize('NFKC'),
  NFKD: str.normalize('NFKD'),
};
Enter fullscreen mode Exit fullscreen mode

String comparison, database keys, filename equality — all of these can bite you if you don't normalize first.

Escape formats

Each language's escape syntax is slightly different:

// HTML: &#x1F600;
// CSS: \1F600 (note: no 0x, no braces, can need trailing space)
// JS ES2015+: \u{1F600}
// JS legacy: \uD83D\uDE00 (surrogate pair)
Enter fullscreen mode Exit fullscreen mode

The CSS escape is the weirdest — it's a backslash followed by up to 6 hex digits, with a required space afterward if the next character could be a valid hex digit.

Tests

75 tests covering: codepoint parsing in all formats, UTF-8 encoding across all byte-length categories, UTF-16 surrogate pair correctness, escape format output, block lookup, emoji detection, normalization forms, and edge cases (BOM, max codepoint 0x10FFFF, surrogate validation).

Series

This is entry #46 in my 100+ public portfolio series.

Top comments (0)