DEV Community

Cover image for Strings are hard. Unicode is harder.
Daniele
Daniele

Posted on

Strings are hard. Unicode is harder.

Strings should be simple.

You call .length(), slice off a few characters, maybe lowercase something, generate a URL-friendly slug. Basic stuff. The kind of thing you write once and never think about again.

That's what I thought, too, until I shipped a feature where users could create custom display names, and someone named "José 👨‍💻" signed up.

Everything broke in the weirdest ways. The slug generator mangled his name into gibberish. The "initial" function pulled the wrong letter. The character counter said his name was 11 characters long when it was obviously six.

That's when I realized: strings aren't simple. They're one of those deceptively hard problems in programming that everyone assumes is solved, until it quietly corrupts your data in production.

This post is about that realization, about why I ended up writing str (a Unicode-aware string utility library for Gleam), and why I chose Gleam, specifically, to do it.


The moment strings stopped being simple

Let me show you the bug that started all of this.

How many characters are in this string?

"👨‍👩‍👧‍👦"
Enter fullscreen mode Exit fullscreen mode

If your gut reaction is "one," congratulations, you're thinking like a user. That's a family emoji. One symbol. One visual unit.

But ask your programming language, and you'll probably get a completely different answer:

  • JavaScript: 11 (code units in UTF-16)
  • Python 3: 7 (Unicode code points)
  • Most string libraries: somewhere in between

None of them say "1."

And that's where the bugs start.

Because what users see isn't what the computer counts. Modern text isn't just a flat array of characters. It's layered, contextual, and full of invisible combining marks, zero-width joiners, and modifier sequences.

Here's another example that broke things for me:

"🇮🇹"
Enter fullscreen mode Exit fullscreen mode

That's the Italian flag emoji. Visually: one symbol. Internally: two regional indicator code points (🇮 + 🇹).

Slice it in the wrong place, and you don't get "half a flag", you get broken, invalid Unicode. Worse, you might not even notice until a user files a bug report three months later.


Unicode bugs are silent, and that's what makes them dangerous

The worst thing about Unicode bugs? They rarely crash your program.

They just:

  • truncate user input at weird positions
  • generate slugs that look like jos-null-programmer
  • miscount lengths, breaking layout or validation
  • corrupt data silently, everything looks fine until you inspect it closely

I once built a "smart truncation" function that would cut text to exactly 50 characters and append ... if needed. It worked perfectly in tests—because I only tested with ASCII. Then someone entered "Hello 👨‍👩‍👧‍👦 World", and my function split the family emoji right in half. The result? Broken output that rendered as .

That's when I started paying attention to how string libraries actually handle Unicode—or more accurately, how they don't.

And I realized: most libraries treat "Unicode support" as an afterthought, a checkbox to tick. They'll handle ASCII perfectly, maybe support UTF-8 encoding, and call it a day.

But real Unicode correctness? Grapheme cluster boundaries? Handling emoji sequences, combining diacritics, regional indicators?

That stuff gets ignored until it causes problems.


The question that became a library

At some point, frustrated with patching Unicode bugs in multiple projects, I asked myself:

What would a string utility library look like if Unicode correctness was the starting point, not an afterthought?

That question eventually became str.

But before I talk about the library itself, I need to explain why I wrote it in Gleam, of all languages.


Why Gleam?

I didn't choose Gleam because it's popular. It's not, yet. The ecosystem is small, there's no massive community, and you can't just Google your way through every problem.

I chose it because of that.

Coming from Elixir, where there are ten competing libraries for everything and mature tooling for every use case, Gleam felt refreshingly constrained. And constraints are great when you want to focus on API design and correctness instead of fighting with dependency hell.

Gleam also gave me things I really care about when working with something as finicky as Unicode:

  • Strong static typing: no runtime surprises
  • Immutability by default: no accidental state mutations breaking edge cases
  • No null: every function either succeeds or returns a Result/Option
  • Explicit error handling: if something can fail, you have to handle it
  • Simple, readable syntax: I wanted the library to be approachable

But honestly, the biggest reason I chose Gleam?

Because writing this library would force me to understand the language deeply.

And it did. I learned Gleam by building something real, something that had to work correctly, not just compile.


What str is (and what it isn't)

str isn't trying to be a "kitchen sink" library. It's not competing with massive string utilities that do everything from regex to natural language processing.

The goals are deliberately modest:

  • Unicode-aware operations: If a function works with "characters," it should mean graphemes, not bytes or code points
  • Predictable behavior: Functions do what they say and nothing more
  • Small, composable API: Each function has one job
  • Correctness over cleverness: I'd rather be boring and reliable than fast and wrong

Most importantly: if it says it operates on characters, it means graphemes, the things users actually see.


A quick example of what goes wrong without grapheme awareness

Consider this string:

"👩🏽‍🚀"
Enter fullscreen mode Exit fullscreen mode

Visually: one astronaut emoji with medium skin tone.

Internally:

  • 👩 Base emoji (woman)
  • 🏽 Skin tone modifier
  • ZWJ (zero-width joiner)
  • 🚀 Rocket

Four separate Unicode code points, but the user sees one symbol.

Now imagine trying to:

  • Truncate it to "1 character"
  • Capitalize it
  • Reverse it
  • Slice it in half

If your string library doesn't understand grapheme clusters, all of these operations produce broken output.

That's why str consistently operates at the grapheme level:

import str/core

core.length("👩🏽‍🚀")  // → 1 (not 4, not 7)
Enter fullscreen mode Exit fullscreen mode

One visible symbol. One count. That's the mental model users have, and the library should match it.


The surprising complexity of case conversion

Case conversion is one of those things everyone assumes is trivial.

Until it isn't.

Quick: what should this do?

to_snake_case("Hello World")  // Easy: "hello_world"
Enter fullscreen mode Exit fullscreen mode

Okay, that's straightforward. Now try:

to_snake_case("Crème Brûlée 🍮")
Enter fullscreen mode Exit fullscreen mode

What do you want here? Probably:

"creme_brulee"
Enter fullscreen mode Exit fullscreen mode

Not:

  • "crème_brûlée" (preserving accents breaks URLs)
  • "cr_me_br_l_e" (dropped characters entirely)
  • "creme_brulee_" (weird trailing underscore from emoji)

With str, the behavior is explicit and predictable:

import str/extra

extra.to_snake_case("Crème Brûlée 🍮")
// → "creme_brulee"

extra.to_kebab_case("Crème Brûlée 🍮")
// → "creme-brulee"
Enter fullscreen mode Exit fullscreen mode

It normalizes Unicode, folds accents to ASCII when needed, strips non-alphanumeric symbols, and produces stable, URL-safe output.

No surprises. No edge cases breaking in production.


Slugs, URLs, and why slugification is harder than it looks

Slugification is the perfect example of "simple functionality" that hides massive complexity underneath.

People don't write URLs. They write text:

"Caffè ☕ e codice 👨‍💻"
Enter fullscreen mode Exit fullscreen mode

And you want to turn that into:

"caffe-e-codice"
Enter fullscreen mode Exit fullscreen mode

Under the hood, that involves:

  • Unicode normalization (NFD decomposition)
  • ASCII transliteration (è → e, ä → a)
  • Symbol removal (☕, 👨‍💻 stripped)
  • Whitespace collapsing and replacement

But the API stays simple:

extra.slugify("Caffè ☕ e codice 👨‍💻")
// → "caffe-e-codice"
Enter fullscreen mode Exit fullscreen mode

Boring. Predictable. Correct.

That's the goal.


Emoji-friendly by design

I paid special attention to emoji handling, because emoji are where bad string handling is most visible.

core.length("👨‍👩‍👧‍👦")  // → 1 (not 7, not 11)
Enter fullscreen mode Exit fullscreen mode

That family emoji is one visual symbol, and str treats it as one grapheme cluster.

Same with operations like truncation:

core.truncate("Hello 👨‍👩‍👧‍👦 World", 8, "...")
// → "Hello 👨‍👩‍👧‍👦..."
Enter fullscreen mode Exit fullscreen mode

The emoji stays intact. No broken Unicode. No weird � characters.

This might seem like a small thing, but it's the difference between software that feels right and software that subtly breaks for international users.


Design choices (and trade-offs)

Some decisions in str are opinionated.

For example:

  • I avoided depending directly on OTP's Unicode helpers, str is pure Gleam
  • Internal character tables are used for ASCII folding instead of runtime lookups
  • Correctness is favored over clever micro-optimizations
  • Fewer functions, but clearer ones

This means:

  • No magic: every function does exactly what it says
  • No surprising behavior: edge cases are handled explicitly
  • No hidden costs: you know what's happening under the hood

Text bugs are expensive. They corrupt data. They break user trust. They're hard to reproduce and harder to fix.

I'd rather str be boring and correct than clever and unpredictable.


Who is str for?

str is useful if you're:

  • Building web backends and need URL-safe slugs
  • Generating identifiers or usernames from user input
  • Writing CLIs that display or format text
  • Cleaning or transforming international text
  • Working with emoji-heavy user-generated content

It's especially useful if you:

  • Care about Unicode correctness
  • Want predictable, well-tested behavior
  • Are exploring Gleam and want a solid utility library

Writing str was also about the ecosystem

One reason I wanted to do this in Gleam was to contribute something tangible to the ecosystem.

In a smaller community:

  • Every library matters
  • Design decisions are more visible
  • Feedback loops are shorter

Writing str wasn't just about solving string bugs, it was about learning Gleam by building something real, something people could actually use.

And honestly? It was fun. The type system caught so many edge cases. The immutability forced me to think carefully about state. The explicit error handling made the API clearer.

Gleam made me a better programmer while I was building this.


Final thoughts

Strings are hard.

Unicode is harder.

Emoji make everything visible.

Gleam gave me the right balance of constraints and expressiveness to tackle this problem thoughtfully, without letting me cut corners.

If you're exploring Gleam, or if you've ever been bitten by subtle Unicode bugs, I hope str is useful to you.

Feedback, issues, and contributions are always welcome.

Thanks for reading.


P.S. — If you've ever wondered why "café".length !== "café".length in some languages, welcome to Unicode normalization hell. I have stories.

Top comments (0)