Unicode Has 154,998 Characters: A Developer's Guide to Finding the Right One

#javascript #tutorial #webdev #beginners

Last month I needed to insert a right arrow into a UI. I typed "right arrow unicode" into a search engine and got approximately forty options: rightwards arrow (U+2192), rightwards double arrow (U+21D2), rightwards white arrow (U+21E8), black rightwards arrow (U+27A1), rightwards arrow with hook (U+21AA), and about thirty-five more. Each one rendered slightly differently across operating systems and fonts.

This is the daily reality of working with Unicode. The standard covers 154,998 characters across 168 scripts as of version 16.0. Finding the exact character you need, knowing its code point, and understanding how it will render across platforms is a lookup problem that comes up far more often than most developers expect.

How Unicode is organized

Unicode assigns every character a unique code point, written as U+ followed by a hexadecimal number. The space is divided into 17 planes of 65,536 code points each:

Plane 0 (BMP, U+0000 to U+FFFF): The Basic Multilingual Plane contains virtually all characters used in modern writing systems, plus common symbols. ASCII lives at U+0000 to U+007F. Latin Extended fills U+0080 to U+024F. CJK Unified Ideographs occupy U+4E00 to U+9FFF. Mathematical operators, currency symbols, arrows, box drawing characters, dingbats, and most emoji are here too.

Plane 1 (SMP, U+10000 to U+1FFFF): The Supplementary Multilingual Plane contains historic scripts (Egyptian hieroglyphs, cuneiform), musical notation, mathematical alphanumeric symbols, and newer emoji. If you've ever had a string length bug involving emoji, the character was probably on this plane.

Plane 2 (SIP, U+20000 to U+2FFFF): CJK Unified Ideographs Extension B -- rare and historic Chinese, Japanese, and Korean characters.

Planes 3-13 are mostly unassigned. Plane 14 contains tag characters. Planes 15-16 are private use areas where companies and projects can assign their own characters.

Characters developers actually search for

In my experience, the characters people look up most frequently fall into a few categories:

Typographic characters. The en dash (U+2013, --), em dash (U+2014, ---), non-breaking space (U+00A0), thin space (U+2009), and various quotation marks (U+201C/U+201D for curly doubles, U+2018/U+2019 for curly singles). These matter for professional typography and are distinct from their ASCII approximations.

Mathematical and technical symbols. The multiplication sign (U+00D7, not the letter x), the degree symbol (U+00B0), the micro sign (U+00B5), the approximately equal sign (U+2248), and the not equal sign (U+2260). Using the correct Unicode character instead of ASCII substitutes improves accessibility and machine readability.

Arrows and geometric shapes. As I discovered, there are hundreds. The most useful: leftwards arrow (U+2190), upwards arrow (U+2191), rightwards arrow (U+2192), downwards arrow (U+2193), and the various triangle pointers used in UI disclosure indicators.

Currency symbols. Dollar (U+0024), euro (U+20AC), pound (U+00A3), yen (U+00A5), rupee (U+20B9), bitcoin (U+20BF). Not all fonts include all currency symbols, which is worth testing before shipping.

Special whitespace. The zero-width space (U+200B, useful for allowing line breaks in long strings without visible spaces), the zero-width non-joiner (U+200C, prevents ligatures), and the zero-width joiner (U+200D, the glue that combines emoji into sequences). These invisible characters cause some of the most confusing bugs when they end up in codebases.

Using Unicode in code

In HTML, you can insert Unicode characters three ways:

<!-- Named entity (limited set) -->
<p>&mdash; &copy; &euro;</p>

<!-- Decimal reference -->
<p>&#8212; &#169; &#8364;</p>

<!-- Hex reference -->
<p>&#x2014; &#xA9; &#x20AC;</p>

In JavaScript:

// Unicode escape (BMP only)
const arrow = '\u2192';     // →

// Unicode escape (any plane)
const fire = '\u{1F525}';   // 🔥

// String.fromCodePoint (any code point)
const star = String.fromCodePoint(0x2605);  // ★

// Getting a code point from a character
const cp = '→'.codePointAt(0);  // 8594 (decimal)
const hex = cp.toString(16);    // "2192"

In CSS:

/* Unicode escape in content property */
.arrow::after {
  content: '\2192';  /* → */
}

/* Or use the character directly if your file is UTF-8 */
.arrow::after {
  content: '→';
}

The rendering problem

A Unicode code point is an abstract concept. What you see on screen depends on the font. If your font doesn't include a glyph for a particular code point, you get one of several fallback behaviors:

The system substitutes a glyph from a different font (most common on modern OSes).
The browser shows a blank space where the character should be.
You see the infamous "tofu" -- the small rectangle indicating a missing glyph.

This is why the same emoji can look completely different on iOS versus Android versus Windows versus Samsung. Each platform has its own emoji font with its own artistic interpretation of each code point. The pistol emoji (U+1F52B) is a handgun on some platforms and a water gun on others. Same code point, different rendering.

For text-critical applications, test your Unicode characters across the platforms your users actually use. A character that renders beautifully in Chrome on macOS might be tofu in Firefox on Windows if the font stack doesn't cover it.

Four common Unicode mistakes

Copy-pasting characters that look like ASCII but aren't. Curly quotes copied from Word, en dashes from PDFs, and fullwidth characters from CJK text all look similar to their ASCII counterparts but have different code points. This causes silent failures in code, URLs, and data parsing. If a string comparison is failing mysteriously, check for non-ASCII lookalikes.
Assuming all characters are the same width. CJK characters are typically double-width. Combining characters have zero width. Emoji width varies. If you're building a table or aligning text in a monospace context (terminal output, code editors), you need a function that accounts for character width, not just character count.
Hardcoding characters in code comments or strings without documenting the code point. Two years from now, someone will look at const separator = '\u200B' and have no idea what it does. Comment it: const separator = '\u200B'; // zero-width space.
Using HTML entities when UTF-8 works fine. If your HTML file is served as UTF-8 (which it should be in 2026), you can use Unicode characters directly in your source code. <p>Copyright 2026</p> is more readable than <p>Copyright © 2026</p>. Reserve entities for characters that have syntactic meaning in HTML, like < and &.

Looking up characters

For searching, browsing, and copying Unicode characters by name, category, or code point, I built a character map at zovo.one/free-tools/character-map. It lets you search by description (like "right arrow") and shows the code point, HTML entity, and CSS escape for each character.

Unicode is one of those standards where the breadth of coverage creates its own complexity. 154,998 characters means that the character you need almost certainly exists. The challenge is finding it among the other 154,997.

I'm Michael Lip. I build free developer tools at zovo.one. 350+ tools, all private, all free.