DEV Community

Adam.S
Adam.S

Posted on • Edited on • Originally published at bas-man.dev

2 1

Working with double-byte regex expressions and Python3

As part of my project Self Hosted Zapier Alternative; I am having to deal with doing regex searches against the three Japanese written forms, Kanji, Hiragana and Katakana.

Fortunately this is a common problem. So I have found some references for this.
Also one of my favourite tools for developing regex expressions, Regex101, also offers support in this area.

I found this useful Github Gist.

note:
You should also check the gist directly as there are some follow up comments and additions. See here

Regex for matching ALL Japanese common & uncommon Kanji (4e00 – 9fcf) ~ The Big Kahuna!
([一-龯])
Regex for matching Hirgana or Katakana
([ぁ-んァ-ン])
Regex for matching Non-Hirgana or Non-Katakana
([^ぁ-んァ-ン])
Regex for matching Hirgana or Katakana or basic punctuation (、。’)
([ぁ-んァ-ン\w])
Regex for matching Hirgana or Katakana and random other characters
([ぁ-んァ-ン!:/])
Regex for matching Hirgana
([ぁ-ん])
Regex for matching full-width Katakana (zenkaku 全角)
([ァ-ン])
Regex for matching half-width Katakana (hankaku 半角)
([ァ-ン゙゚])
Regex for matching full-width Numbers (zenkaku 全角)
([0-9])
Regex for matching full-width Letters (zenkaku 全角)
([A-z])
Regex for matching Hiragana codespace characters (includes non phonetic characters)
([ぁ-ゞ])
Regex for matching full-width (zenkaku) Katakana codespace characters (includes non phonetic characters)
([ァ-ヶ])
Regex for matching half-width (hankaku) Katakana codespace characters (this is an old character set so the order is inconsistent with the hiragana)
([ヲ-゚])
Regex for matching Japanese Post Codes
/^¥d{3}¥-¥d{4}$/
/^¥d{3}-¥d{4}$|^¥d{3}-¥d{2}$|^¥d{3}$/
Regex for matching Japanese mobile phone numbers (keitai bangou)
/^¥d{3}-¥d{4}-¥d{4}$|^¥d{11}$/
/^0¥d0-¥d{4}-¥d{4}$/
Regex for matching Japanese fixed line phone numbers
/^[0-9-]{6,9}$|^[0-9-]{12}$/
/^¥d{1,4}-¥d{4}$|^¥d{2,5}-¥d{1,4}-¥d{4}$/

Using Regex101 I was able to come up with the following expression.

r"
^「(?P<busname>[一-龯]\d{1,2})\s
(?P<destination>[一-龯]+)行き・
(?P<boardedat>[一-龯]+)」
"
Enter fullscreen mode Exit fullscreen mode

This will successfully match a string such as:

「渋11 渋谷駅行き・駒沢大学駅前」でタッチしました。
 

Resulting in the following three groups.

busname = 渋11

destination = 渋谷駅

boardedat = 駒沢大学駅

If you are working in PHP you can also use the following:

\p{Han} (Using Chinese to match Kanji)

\p{Hiragana}

\p{Katakana}

You can also checkout my Regex Experiments:

v1 PHP

v2 Python3

Retry later

Top comments (2)

Collapse
 
learnbyexample profile image
Sundeep

You could use pypi.org/project/regex/ if you wish to use \p{} in Python

Also, debuggex.com/ is another useful online tool for regex debugging

Collapse
 
basman profile image
Adam.S

Thanks for the information. I will check it out in the future :)

Retry later
Retry later