TIL: Unicode chars in regex

#elixir #regex

Today I needed to match some unicode chars in an Elixir regex.

TL;DR:

Use u modifier and \x{...}, e.g. ~r/\x{1234}/u

Matching a unicode char in a regex

More specifically, I needed to remove all zero width chars from a string.
These are U+200B, U+200C, U+200D and U+FEFF.

Trying to use \u does not work:

iex(1)> ~r/\u200B/
** (Regex.CompileError) PCRE does not support \L, \l, \N{name}, \U, or \u at position 1
    (elixir) lib/regex.ex:209: Regex.compile!/2
    (elixir) expanding macro: Kernel.sigil_r/2
    iex:1: (file)

Looking at the docs, it seems that \x{} is the way to go, but no:

iex(1)> ~r/\x{200B}/
** (Regex.CompileError) character value in \x{} or \o{} is too large at position 7
    (elixir) lib/regex.ex:209: Regex.compile!/2
    (elixir) expanding macro: Kernel.sigil_r/2
    iex:1: (file)

The trick is that we need to apply a unicode (u) modfier to the regex, telling the regex compiler that we're working in Unicode:

iex(1)> ~r/\x{200B}/u
~r/\x{200B}/u
iex(2)> "Hello,\u200BWorld!" |> String.replace(~r/\x{200B}/u, "")    
"Hello,World!"

Yay!

So my final regex could be something like:

~r/\x{200B}|\x{200C}|\x{200D}|\x{FEFF}/u

Interpolation works too.

We can also interpolate strings into a regex, which works the same way and works without the u modifer:

iex(5)> "Hello,\u200BWorld!" |> String.replace(~r/#{"\u200B"}/, "")
"Hello,World!"

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.