Why Your Input Length Limit Is Wrong

#unicode #webdev #programming #javascript

You likely have a database field, a text input, or a textarea with a maxlength attribute. It is highly probable that the way your code calculates that length is incorrect for a global audience.

We are going to look at a practical, structural approach to text length, focusing on a concept called the grapheme cluster. By the end of this guide, you will understand exactly why standard string length properties fail, and how to fix them to respect international users.

When you ask a user for their name, and you limit it to 20 "characters", what do you actually mean?

If you use JavaScript and inspect the length property of a string, or measure string length in languages like Java or C#, you are usually measuring UTF-16 code units. If you are looking at a database like MySQL, you might be measuring bytes or Unicode code points.

None of these represent what a human user considers a "character".

Grapheme Cluster

In Unicode terminology, what a user perceives as a single visual unit of text is called a grapheme cluster. A grapheme cluster consists of a base character followed by zero or more combining characters that modify it.

Here are a few practical examples:

The letter é can be represented as a single code point (U+00E9), but it can also be represented as a base letter e (U+0065) followed by a combining acute accent ◌́ (U+0301). Visually, this is 1 grapheme cluster. To a naive string.length check, it appears as 2.

In the Devanagari script used for Hindi, the single visual unit क्ष्म consists of multiple code points, but represents just 1 grapheme cluster.

The "Woman Farmer" emoji 👩‍🌾 is constructed using the "Woman" emoji (U+1F469), a Zero Width Joiner (U+200D), and a "Sheaf of Rice" emoji (U+1F33E). That is 3 code points (and many more bytes!), but only 1 user-perceived character.

For more on how Unicode works under the hood, I recommend reviewing the W3C Character encodings essential concepts.

Why Your Input Limit is Breaking

When you strictly enforce an input length limit, you run a risk of truncating text in the middle of a grapheme cluster.

Imagine your database has a hard limit. A user pastes a string that looks like 10 characters to them, but occupies 12 code points. If your backend brutally slices the string at an arbitrary byte or code point boundary, you might sever a combining accent from its base letter, or split a family emoji back into floating disembodied heads.

This results in broken database records, corrupted user interfaces, and an inaccessible experience for users writing in Arabic, Indic, or Southeast Asian scripts, not to mention users who just want to use a flag emoji.

The Practical Solution

To build robust, reliable internationalized systems, you can measure lengths using grapheme clusters. Modern programming environments have built-in, standards-compliant tools to handle this.

In modern web development, the most efficient way to accurately count grapheme clusters is utilizing the Intl.Segmenter API provided by the ECMAScript Internationalization API.

// A string with an emoji, a combined diacritic, and standard Latin.
const userInput = "👩‍🌾éx";

// Code Units
console.log(userInput.length); // Output: 7

// The correct method: Using Intl.Segmenter
const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
const segments = segmenter.segment(userInput);

// Count the actual grapheme clusters
let graphemeCount = 0;
for (const segment of segments) {
  graphemeCount++;
}

console.log(graphemeCount); // Output: 3

By standardizing on Intl.Segmenter, you can ensure that your count of grapheme clusters is accurate.

We have a responsibility to build systems that work reliably regardless of the user's locale. For a broader understanding of how to implement global best practices in your web applications, please consult the W3C Internationalization techniques documentation.

DEV Community

Why Your Input Length Limit Is Wrong

Grapheme Cluster

Why Your Input Limit is Breaking

The Practical Solution

Top comments (0)