Ashish Kumar

Posted on Apr 25

Building Slugs That Don't Break: Unicode, Diacritics, and Edge Cases

#webdev #seo #javascript #productivity

You ship a blog. The first international post is titled "Café au Lait — A Morning Routine." Your slug generator turns that into /caf-au-lait--a-morning-routine. The double hyphen is ugly, the dropped accent is worse, and that's just the start of what naive slug generation gets wrong.

This is one of those problems that looks like it deserves five lines of regex and ends up needing four hours and a battle-tested library. Let's walk through what actually goes wrong, why, and the rules a slug generator should follow.

What a slug needs to be

A slug is the human-readable part of a URL: in /blog/why-rust-matters, the slug is why-rust-matters. Good slugs have four properties:

URL-safe — contains only characters that don't need percent-encoding in a URL path
Readable — a human can guess what the page is about from the slug alone
Stable — the same input produces the same slug, forever
Unique — within whatever scope (your blog, your products) two pieces of content don't collide

The naive approach trips on every single one of these.

The five-line slug generator and why it's broken

Most engineers, including me, the first time, write something like this:

function slugify(input) {
  return input
    .toLowerCase()
    .replace(/[^a-z0-9]+/g, '-')
    .replace(/^-|-$/g, '')
}

It works on "Hello World" → "hello-world". It also produces these:

"Café au Lait" → "caf-au-lait" (lost the accent, ugly)
"100% Pure" → "100-pure" (dropped the meaning)
"C++ Programming" → "c-programming" (lost the distinguishing feature)
"日本語入門" → "" (empty string — the entire title is gone)
"Hello World" → "hello---world" (multiple spaces become multiple hyphens)
"Hello-World" → "hello-world" (collides with the natural slug)

And these are just the easy cases.

Step 1: Unicode normalization

The first thing a real slug generator does is normalize Unicode. The character é can be represented two ways:

NFC (composed): one code point, U+00E9
NFD (decomposed): two code points, e (U+0065) followed by ◌́ (U+0301, combining acute accent)

These look identical on screen but have different byte sequences. If your slug code only handles one form, the other slips through unchanged.

The fix is simple — normalize first, strip the diacritics second:

function stripDiacritics(input) {
  return input.normalize('NFD').replace(/[\u0300-\u036f]/g, '')
}

stripDiacritics("Café au Lait")  // → "Cafe au Lait"
stripDiacritics("naïve")          // → "naive"
stripDiacritics("Renée")          // → "Renee"

The \u0300-\u036f range covers the combining marks block — once é is decomposed into e + combining accent, the regex strips just the accent.

This handles most European languages but not all of them. German ß doesn't decompose; it should be transliterated to ss. Polish ł doesn't decompose either. For broad European coverage you need a transliteration map, not just NFD normalization.

Step 2: Transliteration vs dropping

For non-Latin scripts (Chinese, Japanese, Arabic, Hindi, Cyrillic), you have a real decision to make:

Option A: Transliterate. 日本語 becomes nihongo. The slug is readable to a Latin-alphabet reader, but transliteration is lossy and language-specific (東京 → tokyo requires knowing it's Japanese, not Chinese, where it'd be dongjing).

Option B: Pass through. Modern URLs support Unicode. /日本語 is a valid URL, browsers display it correctly, and search engines index it. The slug becomes meaningful to readers of that language.

Option C: Generate from a separate field. Many blogs let authors set a slug manually for non-Latin titles. The slug is whatever the author types, the title is whatever they meant.

There's no universally right answer. WordPress transliterates by default. Ghost passes through. Most documentation systems use option C. Pick based on your audience.

Step 3: Punctuation that means something

100% off shouldn't become 100-off. The % carries information. Battle-tested slug libraries have a symbol map that converts meaningful punctuation into words:

const symbolMap = {
  '&': 'and',
  '%': 'percent',
  '@': 'at',
  '+': 'plus',
  '$': 'dollar',
  '€': 'euro',
  '£': 'pound',
  '#': 'hash',
}

This is opinionated — 100% off → 100-percent-off is more readable than 100-off, but plenty of teams just drop the symbol. Decide once, document it.

For programming languages and tech terms specifically: C++ → cpp, C# → csharp, .NET → dotnet. These are conventions, not deductions; a generic slug library won't get them right unless you tell it to.

Step 4: The collision problem

You publish "Hello World." The slug is hello-world. Six months later, you publish another "Hello World" — maybe a follow-up, maybe a different topic that happens to share a title. What's the second slug?

Common patterns:

Numeric suffix: hello-world, hello-world-2, hello-world-3
Date suffix: hello-world-2026-04
ID suffix: hello-world-a3f9

The numeric suffix is the most common and almost always wrong. It encourages people to delete and republish to "get the clean URL", which breaks every link to the original. Date suffixes are the most stable. ID suffixes look ugly but never collide.

Whatever you pick, never silently overwrite an existing slug. Either reject the new content with an error, or generate a unique variant. Slugs that change break every backlink, RSS feed, social share, and search index.

Step 5: Reserved words

If your slug generator ever produces admin, api, login, logout, settings, signup, register, or dashboard, you've got a problem. Either:

The slug now masks an actual route (/blog/admin works fine, but /admin doesn't)
Or, worse, the route works and a user can SEO-impersonate your admin page

Real slug libraries maintain a reserved-words list. Yours should too:

const RESERVED = new Set([
  'admin', 'api', 'app', 'auth', 'dashboard', 'login', 'logout',
  'register', 'settings', 'signup', 'help', 'support', 'about',
  // ...add anything specific to your app
])

if (RESERVED.has(slug)) slug = `${slug}-post`  // or reject

Step 6: Length limits

There's no formal URL length limit, but practical ones exist:

Most CDNs and proxies cap at 2KB for the full URL.
Email clients truncate links over 80 characters in plain-text emails.
Search engines display only the first ~60 characters of a slug in results.

Cap slugs at 60–80 characters, truncated at a word boundary:

function truncate(slug, max = 60) {
  if (slug.length <= max) return slug
  const cut = slug.lastIndexOf('-', max)
  return slug.slice(0, cut > 0 ? cut : max)
}

The lastIndexOf('-', max) ensures we cut at a hyphen, not mid-word.

Putting it all together

A real slug function looks like this:

function slugify(input, { maxLength = 60 } = {}) {
  return input
    .normalize('NFKD')
    .replace(/[\u0300-\u036f]/g, '')        // strip diacritics
    .replace(/[&]/g, ' and ')                // expand symbols
    .replace(/[%]/g, ' percent ')
    .toLowerCase()
    .replace(/[^a-z0-9]+/g, '-')             // non-alphanumeric → hyphen
    .replace(/^-+|-+$/g, '')                 // trim hyphens
    .slice(0, maxLength)
    .replace(/-+$/, '')                      // re-trim after slice
}

That's the floor. From there you'd add the reserved-words check, the collision handler, and (for non-Latin support) either transliteration or pass-through Unicode handling.

Use a library — but know what it does

For production use, don't write this yourself. Battle-tested options:

slugify (npm) — handles transliteration for major European languages, fast, good defaults.
@sindresorhus/slugify — more aggressive transliteration, more configuration knobs.
github-slugger — what GitHub uses for anchor links in READMEs. Predictable, simple.
speakingurl — the most thorough, supports the most languages, also the most overhead.

For a one-off — generating a slug while drafting, testing edge cases, or comparing two slug strategies — paste the title into a browser-based slug generator and see what falls out. It runs locally, so internal product names and unreleased post titles don't end up on a third-party server.

TL;DR

The five-line slug regex breaks on Unicode, symbols, collisions, and reserved words.
Normalize Unicode (NFD), strip combining marks, decide between transliterate vs pass-through for non-Latin scripts.
Map meaningful symbols (% → percent, & → and) — don't silently drop them.
Maintain a reserved-words list. Cap slugs at 60–80 chars, cut on word boundaries.
Never silently overwrite a slug; suffix or reject. Backlinks break forever otherwise.
For everything beyond a one-off, use a library — slugify or @sindresorhus/slugify.

Slug generation is one of those problems that's easy to get 80% right and hard to get the last 20%. Worth doing properly once, then forgetting about.

If this was useful, I've also built a handful of other free, browser-based tools — no signup, no uploads, everything runs client-side:

JSON Tools — https://json.renderlog.in (formatter, validator, JWT decoder, JSONPath tester, 40+ converters)
Text Tools — https://text.renderlog.in (case converters, slug generator, HTML/markdown utilities, 70+ tools)
PDF Tools — https://pdftools.renderlog.in (merge, split, OCR, compress to exact size, 40+ tools)
Image Tools — https://imagetools.renderlog.in (compress, convert, resize, background remover, 50+ tools)
QR Tools — https://qrtools.renderlog.in (WiFi, vCard, UPI, bulk QR codes with logos)
Calc Tools — https://calctool.renderlog.in (60+ calculators for finance, health, math, dates)
Notepad — https://notepad.renderlog.in (private, offline-first notes, no signup)

DEV Community