Daniele

Posted on Dec 23, 2025

Strings are hard. Unicode is harder.

#gleam #opensource #programming #elixir

Strings should be simple.

You call .length(), slice off a few characters, maybe lowercase something, generate a URL-friendly slug. Basic stuff. The kind of thing you write once and never think about again.

That's what I thought, too, until I shipped a feature where users could create custom display names, and someone named "José 👨‍💻" signed up.

Everything broke in the weirdest ways. The slug generator mangled his name into gibberish. The "initial" function pulled the wrong letter. The character counter said his name was 11 characters long when it was obviously six.

That's when I realized: strings aren't simple. They're one of those deceptively hard problems in programming that everyone assumes is solved, until it quietly corrupts your data in production.

This post is about that realization, about why I ended up writing str (a Unicode-aware string utility library for Gleam), and why I chose Gleam, specifically, to do it.

The moment strings stopped being simple

Let me show you the bug that started all of this.

How many characters are in this string?

"👨‍👩‍👧‍👦"

If your gut reaction is "one," congratulations, you're thinking like a user. That's a family emoji. One symbol. One visual unit.

But ask your programming language, and you'll probably get a completely different answer:

JavaScript: 11 (code units in UTF-16)
Python 3: 7 (Unicode code points)
Most string libraries: somewhere in between

None of them say "1."

And that's where the bugs start.

Because what users see isn't what the computer counts. Modern text isn't just a flat array of characters. It's layered, contextual, and full of invisible combining marks, zero-width joiners, and modifier sequences.

Here's another example that broke things for me:

"🇮🇹"

That's the Italian flag emoji. Visually: one symbol. Internally: two regional indicator code points (🇮 + 🇹).

Slice it in the wrong place, and you don't get "half a flag", you get broken, invalid Unicode. Worse, you might not even notice until a user files a bug report three months later.

Unicode bugs are silent, and that's what makes them dangerous

The worst thing about Unicode bugs? They rarely crash your program.

They just:

truncate user input at weird positions
generate slugs that look like jos-null-programmer
miscount lengths, breaking layout or validation
corrupt data silently, everything looks fine until you inspect it closely

I once built a "smart truncation" function that would cut text to exactly 50 characters and append ... if needed. It worked perfectly in tests—because I only tested with ASCII. Then someone entered "Hello 👨‍👩‍👧‍👦 World", and my function split the family emoji right in half. The result? Broken output that rendered as �.

That's when I started paying attention to how string libraries actually handle Unicode—or more accurately, how they don't.

And I realized: most libraries treat "Unicode support" as an afterthought, a checkbox to tick. They'll handle ASCII perfectly, maybe support UTF-8 encoding, and call it a day.

But real Unicode correctness? Grapheme cluster boundaries? Handling emoji sequences, combining diacritics, regional indicators?

That stuff gets ignored until it causes problems.

The question that became a library

At some point, frustrated with patching Unicode bugs in multiple projects, I asked myself:

What would a string utility library look like if Unicode correctness was the starting point, not an afterthought?

That question eventually became str.

But before I talk about the library itself, I need to explain why I wrote it in Gleam, of all languages.

Why Gleam?

I didn't choose Gleam because it's popular. It's not, yet. The ecosystem is small, there's no massive community, and you can't just Google your way through every problem.

I chose it because of that.

Coming from Elixir, where there are ten competing libraries for everything and mature tooling for every use case, Gleam felt refreshingly constrained. And constraints are great when you want to focus on API design and correctness instead of fighting with dependency hell.

Gleam also gave me things I really care about when working with something as finicky as Unicode:

Strong static typing: no runtime surprises
Immutability by default: no accidental state mutations breaking edge cases
No null: every function either succeeds or returns a Result/Option
Explicit error handling: if something can fail, you have to handle it
Simple, readable syntax: I wanted the library to be approachable

But honestly, the biggest reason I chose Gleam?

Because writing this library would force me to understand the language deeply.

And it did. I learned Gleam by building something real, something that had to work correctly, not just compile.

What `str` is (and what it isn't)

str isn't trying to be a "kitchen sink" library. It's not competing with massive string utilities that do everything from regex to natural language processing.

The goals are deliberately modest:

Unicode-aware operations: If a function works with "characters," it should mean graphemes, not bytes or code points
Predictable behavior: Functions do what they say and nothing more
Small, composable API: Each function has one job
Correctness over cleverness: I'd rather be boring and reliable than fast and wrong

Most importantly: if it says it operates on characters, it means graphemes, the things users actually see.

A quick example of what goes wrong without grapheme awareness

Consider this string:

"👩🏽‍🚀"

Visually: one astronaut emoji with medium skin tone.

Internally:

👩 Base emoji (woman)
🏽 Skin tone modifier
ZWJ (zero-width joiner)
🚀 Rocket

Four separate Unicode code points, but the user sees one symbol.

Now imagine trying to:

Truncate it to "1 character"
Capitalize it
Reverse it
Slice it in half

If your string library doesn't understand grapheme clusters, all of these operations produce broken output.

That's why str consistently operates at the grapheme level:

import str/core

core.length("👩🏽‍🚀")  // → 1 (not 4, not 7)

One visible symbol. One count. That's the mental model users have, and the library should match it.

The surprising complexity of case conversion

Case conversion is one of those things everyone assumes is trivial.

Until it isn't.

Quick: what should this do?

to_snake_case("Hello World")  // Easy: "hello_world"

Okay, that's straightforward. Now try:

to_snake_case("Crème Brûlée 🍮")

What do you want here? Probably:

"creme_brulee"

Not:

"crème_brûlée" (preserving accents breaks URLs)
"cr_me_br_l_e" (dropped characters entirely)
"creme_brulee_" (weird trailing underscore from emoji)

With str, the behavior is explicit and predictable:

import str/extra

extra.to_snake_case("Crème Brûlée 🍮")
// → "creme_brulee"

extra.to_kebab_case("Crème Brûlée 🍮")
// → "creme-brulee"

It normalizes Unicode, folds accents to ASCII when needed, strips non-alphanumeric symbols, and produces stable, URL-safe output.

No surprises. No edge cases breaking in production.

Slugs, URLs, and why slugification is harder than it looks

Slugification is the perfect example of "simple functionality" that hides massive complexity underneath.

People don't write URLs. They write text:

"Caffè ☕ e codice 👨‍💻"

And you want to turn that into:

"caffe-e-codice"

Under the hood, that involves:

Unicode normalization (NFD decomposition)
ASCII transliteration (è → e, ä → a)
Symbol removal (☕, 👨‍💻 stripped)
Whitespace collapsing and replacement

But the API stays simple:

extra.slugify("Caffè ☕ e codice 👨‍💻")
// → "caffe-e-codice"

Boring. Predictable. Correct.

That's the goal.

Emoji-friendly by design

I paid special attention to emoji handling, because emoji are where bad string handling is most visible.

core.length("👨‍👩‍👧‍👦")  // → 1 (not 7, not 11)

That family emoji is one visual symbol, and str treats it as one grapheme cluster.

Same with operations like truncation:

core.truncate("Hello 👨‍👩‍👧‍👦 World", 8, "...")
// → "Hello 👨‍👩‍👧‍👦..."

The emoji stays intact. No broken Unicode. No weird � characters.

This might seem like a small thing, but it's the difference between software that feels right and software that subtly breaks for international users.

Design choices (and trade-offs)

Some decisions in str are opinionated.

For example:

I avoided depending directly on OTP's Unicode helpers, str is pure Gleam
Internal character tables are used for ASCII folding instead of runtime lookups
Correctness is favored over clever micro-optimizations
Fewer functions, but clearer ones

This means:

No magic: every function does exactly what it says
No surprising behavior: edge cases are handled explicitly
No hidden costs: you know what's happening under the hood

Text bugs are expensive. They corrupt data. They break user trust. They're hard to reproduce and harder to fix.

I'd rather str be boring and correct than clever and unpredictable.

Who is `str` for?

str is useful if you're:

Building web backends and need URL-safe slugs
Generating identifiers or usernames from user input
Writing CLIs that display or format text
Cleaning or transforming international text
Working with emoji-heavy user-generated content

It's especially useful if you:

Care about Unicode correctness
Want predictable, well-tested behavior
Are exploring Gleam and want a solid utility library

Writing `str` was also about the ecosystem

One reason I wanted to do this in Gleam was to contribute something tangible to the ecosystem.

In a smaller community:

Every library matters
Design decisions are more visible
Feedback loops are shorter

Writing str wasn't just about solving string bugs, it was about learning Gleam by building something real, something people could actually use.

And honestly? It was fun. The type system caught so many edge cases. The immutability forced me to think carefully about state. The explicit error handling made the API clearer.

Gleam made me a better programmer while I was building this.

Final thoughts

Strings are hard.

Unicode is harder.

Emoji make everything visible.

Gleam gave me the right balance of constraints and expressiveness to tackle this problem thoughtfully, without letting me cut corners.

If you're exploring Gleam, or if you've ever been bitten by subtle Unicode bugs, I hope str is useful to you.

Feedback, issues, and contributions are always welcome.

Thanks for reading.

P.S. — If you've ever wondered why "café".length !== "café".length in some languages, welcome to Unicode normalization hell. I have stories.

DEV Community

Strings are hard. Unicode is harder.

The moment strings stopped being simple

Unicode bugs are silent, and that's what makes them dangerous

The question that became a library

Why Gleam?

What `str` is (and what it isn't)

A quick example of what goes wrong without grapheme awareness

The surprising complexity of case conversion

Slugs, URLs, and why slugification is harder than it looks

Emoji-friendly by design

Design choices (and trade-offs)

Who is `str` for?

Writing `str` was also about the ecosystem

Final thoughts

Top comments (0)

The moment strings stopped being simple

Unicode bugs are silent, and that's what makes them dangerous

The question that became a library

Why Gleam?

What str is (and what it isn't)

A quick example of what goes wrong without grapheme awareness

The surprising complexity of case conversion

Slugs, URLs, and why slugification is harder than it looks

Emoji-friendly by design

Design choices (and trade-offs)

Who is str for?

Writing str was also about the ecosystem

Final thoughts

What `str` is (and what it isn't)

Who is `str` for?

Writing `str` was also about the ecosystem