DEV Community


Posted on • Updated on • Originally published at

Working with double-byte regex expressions and Python3

As part of my project Self Hosted Zapier Alternative; I am having to deal with doing regex searches against the three Japanese written forms, Kanji, Hiragana and Katakana.

Fortunately this is a common problem. So I have found some references for this.
Also one of my favourite tools for developing regex expressions, Regex101, also offers support in this area.

I found this useful Github Gist.

You should also check the gist directly as there are some follow up comments and additions. See here

Using Regex101 I was able to come up with the following expression.

Enter fullscreen mode Exit fullscreen mode

This will successfully match a string such as:

「渋11 渋谷駅行き・駒沢大学駅前」でタッチしました。

Resulting in the following three groups.

busname = 渋11

destination = 渋谷駅

boardedat = 駒沢大学駅

If you are working in PHP you can also use the following:

\p{Han} (Using Chinese to match Kanji)



You can also checkout my Regex Experiments:

v1 PHP

v2 Python3

Top comments (2)

learnbyexample profile image

You could use if you wish to use \p{} in Python

Also, is another useful online tool for regex debugging

basman profile image

Thanks for the information. I will check it out in the future :)