DEV Community

Lasse Skindstad Ebert
Lasse Skindstad Ebert

Posted on

7 2

TIL: Unicode chars in regex

Today I needed to match some unicode chars in an Elixir regex.

TL;DR:

Use u modifier and \x{...}, e.g. ~r/\x{1234}/u

Matching a unicode char in a regex

More specifically, I needed to remove all zero width chars from a string.
These are U+200B, U+200C, U+200D and U+FEFF.

Trying to use \u does not work:

iex(1)> ~r/\u200B/
** (Regex.CompileError) PCRE does not support \L, \l, \N{name}, \U, or \u at position 1
    (elixir) lib/regex.ex:209: Regex.compile!/2
    (elixir) expanding macro: Kernel.sigil_r/2
    iex:1: (file)

Looking at the docs, it seems that \x{} is the way to go, but no:

iex(1)> ~r/\x{200B}/
** (Regex.CompileError) character value in \x{} or \o{} is too large at position 7
    (elixir) lib/regex.ex:209: Regex.compile!/2
    (elixir) expanding macro: Kernel.sigil_r/2
    iex:1: (file)

The trick is that we need to apply a unicode (u) modfier to the regex, telling the regex compiler that we're working in Unicode:

iex(1)> ~r/\x{200B}/u
~r/\x{200B}/u
iex(2)> "Hello,\u200BWorld!" |> String.replace(~r/\x{200B}/u, "")    
"Hello,World!"

Yay!

So my final regex could be something like:

~r/\x{200B}|\x{200C}|\x{200D}|\x{FEFF}/u

Interpolation works too.

We can also interpolate strings into a regex, which works the same way and works without the u modifer:

iex(5)> "Hello,\u200BWorld!" |> String.replace(~r/#{"\u200B"}/, "")
"Hello,World!"

AWS Security LIVE!

Join us for AWS Security LIVE!

Discover the future of cloud security. Tune in live for trends, tips, and solutions from AWS and AWS Partners.

Learn More

Top comments (0)

AWS Security LIVE!

Join us for AWS Security LIVE!

Discover the future of cloud security. Tune in live for trends, tips, and solutions from AWS and AWS Partners.

Learn More

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay