Forem

Lasse Skindstad Ebert
Lasse Skindstad Ebert

Posted on

7 2

TIL: Unicode chars in regex

Today I needed to match some unicode chars in an Elixir regex.

TL;DR:

Use u modifier and \x{...}, e.g. ~r/\x{1234}/u

Matching a unicode char in a regex

More specifically, I needed to remove all zero width chars from a string.
These are U+200B, U+200C, U+200D and U+FEFF.

Trying to use \u does not work:

iex(1)> ~r/\u200B/
** (Regex.CompileError) PCRE does not support \L, \l, \N{name}, \U, or \u at position 1
    (elixir) lib/regex.ex:209: Regex.compile!/2
    (elixir) expanding macro: Kernel.sigil_r/2
    iex:1: (file)

Looking at the docs, it seems that \x{} is the way to go, but no:

iex(1)> ~r/\x{200B}/
** (Regex.CompileError) character value in \x{} or \o{} is too large at position 7
    (elixir) lib/regex.ex:209: Regex.compile!/2
    (elixir) expanding macro: Kernel.sigil_r/2
    iex:1: (file)

The trick is that we need to apply a unicode (u) modfier to the regex, telling the regex compiler that we're working in Unicode:

iex(1)> ~r/\x{200B}/u
~r/\x{200B}/u
iex(2)> "Hello,\u200BWorld!" |> String.replace(~r/\x{200B}/u, "")    
"Hello,World!"

Yay!

So my final regex could be something like:

~r/\x{200B}|\x{200C}|\x{200D}|\x{FEFF}/u

Interpolation works too.

We can also interpolate strings into a regex, which works the same way and works without the u modifer:

iex(5)> "Hello,\u200BWorld!" |> String.replace(~r/#{"\u200B"}/, "")
"Hello,World!"

Image of Timescale

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read more →

Top comments (0)

AWS Security LIVE!

Tune in for AWS Security LIVE!

Join AWS Security LIVE! for expert insights and actionable tips to protect your organization and keep security teams prepared.

Learn More

đź‘‹ Kindness is contagious

Engage with a sea of insights in this enlightening article, highly esteemed within the encouraging DEV Community. Programmers of every skill level are invited to participate and enrich our shared knowledge.

A simple "thank you" can uplift someone's spirits. Express your appreciation in the comments section!

On DEV, sharing knowledge smooths our journey and strengthens our community bonds. Found this useful? A brief thank you to the author can mean a lot.

Okay