Characters, Bytes, and Code Points: Why String Length Is Never Simple

#javascript #tutorial #webdev #beginners

Pop quiz. What does "hello".length return in JavaScript? Five, obviously. Now what does "cafe\u0301".length return? If you said 5, you're right. The string looks like "cafe" with an accent on the e, rendering as "caf\u00e9." But it's 5 characters, not 4, because the accent is a separate combining character. And "caf\u00e9".length returns 4, even though it looks identical on screen.

Two strings that look the same, render the same, and compare as equal in some contexts have different lengths. Welcome to Unicode.

This is why building a character counter -- the kind you'd use for checking tweet length or meta description limits -- is surprisingly non-trivial once you step outside ASCII.

Characters vs. code points vs. grapheme clusters

The word "character" is ambiguous in computing. There are at least three things it can mean:

Code units are the individual values in a string's underlying encoding. In JavaScript (UTF-16), each code unit is 16 bits. String.length returns the number of UTF-16 code units, not the number of visible characters. Most characters fit in one code unit, but characters outside the Basic Multilingual Plane (including many emoji) require two code units called a surrogate pair.

Code points are the abstract numbers assigned to characters in the Unicode standard. The letter "A" is U+0041. The emoji "fire" is U+1F525. Code points above U+FFFF need two UTF-16 code units. In JavaScript, you can iterate over code points with for...of or use the spread operator:

const str = "Hello! 🔥";
console.log(str.length);        // 9 (UTF-16 code units: 🔥 is 2)
console.log([...str].length);   // 8 (code points)

Grapheme clusters are what humans perceive as a single "character." The flag emoji for a country is a grapheme cluster composed of two regional indicator code points. A family emoji can be 7 code points (person + ZWJ + person + ZWJ + child, with skin tone modifiers). And that accented "e" (e + combining accent) is one grapheme cluster rendered from two code points.

const flag = "🇺🇸";
console.log(flag.length);       // 4 (four UTF-16 code units)
console.log([...flag].length);  // 2 (two code points: U+1F1FA + U+1F1F8)
// But visually, it's 1 character

const family = "👨‍👩‍👧";
console.log(family.length);     // 8 (UTF-16 code units)
console.log([...family].length); // 5 (code points joined with ZWJ)
// Visually: 1 character

When a user types a tweet and expects to see how many characters they've used, they mean grapheme clusters. When a database column has a VARCHAR(100) limit, it might mean bytes, code units, or code points depending on the encoding and database engine. The word "character" is doing too much work.

The Intl.Segmenter solution

Modern JavaScript has a proper answer for grapheme cluster counting:

function countGraphemes(str) {
  const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
  return [...segmenter.segment(str)].length;
}

countGraphemes("Hello! 🇺🇸");  // 8 (H, e, l, l, o, !, space, flag)
countGraphemes("caf\u00e9");     // 4
countGraphemes("cafe\u0301");    // 4 (same visual result, same count)

Intl.Segmenter handles combining characters, emoji sequences, surrogate pairs, and regional indicators correctly. It's supported in all modern browsers and Node.js 16+. Before it existed, you needed a library like grapheme-splitter or a complex regex based on the Unicode Annex #29 specification.

Where character limits actually apply

Twitter/X: 280 characters, counted as Unicode code points (NFC normalized). This means a flag emoji counts as 2, not 1. A tweet with 280 ASCII characters and a single emoji is over the limit.

HTML meta descriptions: Google typically displays 150-160 characters in search results. There's no hard limit, but exceeding it means truncation. These are measured in visible characters (grapheme clusters).

SMS: A single SMS can hold 160 GSM-7 characters (basic Latin, numbers, common symbols) or 70 UCS-2 characters (any Unicode). If your message contains a single non-GSM character, the entire message switches to UCS-2, cutting your capacity by more than half. This is why a 100-character SMS with one emoji becomes two messages.

Database columns: MySQL's VARCHAR(255) in utf8mb4 encoding means 255 code points, but the storage is up to 4 bytes per code point. PostgreSQL's VARCHAR(n) counts characters (code points). SQLite doesn't enforce VARCHAR length at all.

Five character counting mistakes

Using .length for user-facing counts. In JavaScript, .length counts UTF-16 code units. For ASCII-only text, this matches the user's expectation. The moment emoji or non-Latin scripts enter the picture, the count is wrong. Use Intl.Segmenter or at minimum the spread operator for code point counting.
Not normalizing before comparing. "caf\u00e9" (precomposed) and "cafe\u0301" (decomposed) look identical but have different lengths and don't compare as equal with ===. Normalize to NFC before counting or comparing: str.normalize('NFC').
Assuming one byte per character. In UTF-8, ASCII characters are 1 byte, most European accented characters are 2 bytes, most CJK characters are 3 bytes, and emoji are 4 bytes. A 100-character Japanese string is roughly 300 bytes in UTF-8. If you're checking against a byte-level storage limit, convert and measure bytes, not characters.
Truncating in the middle of a grapheme cluster. If you cut a string at a fixed length for display purposes, you might slice through a surrogate pair or a combining character sequence, producing invalid or garbled text. Always truncate at grapheme cluster boundaries.
Forgetting newlines and whitespace. A character counter that counts "Hello World" as 11 characters is technically correct (space is a character), but some platforms exclude trailing whitespace, and newlines can count as 1 or 2 characters depending on the OS (LF vs. CRLF). Be explicit about what you're counting.

// Safe truncation at grapheme boundaries
function truncateGraphemes(str, maxGraphemes) {
  const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
  const segments = [...segmenter.segment(str)];
  return segments.slice(0, maxGraphemes).map(s => s.segment).join('');
}

Quick counting

For fast character, word, sentence, and paragraph counting with proper Unicode handling, I built a character counter at zovo.one/free-tools/character-counter. It shows you the breakdown across different counting methods so you know exactly what you're working with.

Strings are arrays of bytes pretending to be arrays of characters pretending to be arrays of what humans see. Every layer of that pretense has edge cases. Knowing which layer you're operating on is the difference between a character counter that works and one that breaks on the first emoji.

I'm Michael Lip. I build free developer tools at zovo.one. 350+ tools, all private, all free.