Background
During the development of a simple divination script, I encountered this interesting problem. If it's just a specific few Chinese characters, we can hard-code a dictionary in the script, but what if we want to get the stroke count for any Chinese character?
pypinyin library
from pypinyin import pinyin, Style
def get_strokes_count(chinese_character):
pinyin_list = pinyin(chinese_character, style=Style.NORMAL)
strokes_count = len(pinyin_list[0])
return strokes_count
character = input("Please enter a Chinese character:")
strokes = get_strokes_count(character)
print("Character'{}'stroke numbers:{}".format(character, strokes))
I tried it and found that the result is actually the number of results in the normal pinyin format for that character.
Unihan Database
The Unihan database is a Chinese character database maintained by the Unicode Consortium, which seems quite reliable and also provides online tools.
In its online query tooUnihan Database LookupI found that the query results contain the kTotalStrokes field, which is the stroke count data we need.
As the official database of Unicode, the current version fully meets the basic needs of Chinese character queries.
Nice! One step closer to success!
Getting Stroke Information from Unihan Database
I initially planned to send query requests directly through lookup, but it was too slow, and the address is abroad from China. I found that the database file itself is not large, so I downloaded it directly.
AfterAfter opening the compressed package, there are several files.
By looking up the results, we need the kTotalStrokes field in the IRG Source. Extract this file.
I tested the regex on regex101 to extract the desired Unicode part and stroke count part, and saved them separately for querying.
Coding
- Extracting Stroke Information
file = Path("Stroke/Unihan_IRGSources.txt")
output = Path("Stroke/unicode2stroke.json")
stroke_dict = dict()
with open(file,mode="r") as f:
for line in f:
raw_line = line.strip()
pattern = r"(U\+.*)\skTotalStrokes.*\s(\d+)"
result = re.findall(pattern=pattern, string=raw_line)
if len(result) == 0:
continue
unicode_key = result[0][0]
unicode_stroke = result[0][1]
print(f"{unicode_key}: {unicode_stroke}")
stroke_dict[unicode_key] = unicode_stroke
with open(file=output, mode="w", encoding="utf-8") as f:
json.dump(stroke_dict,f, ensure_ascii=False, indent=4)
exported to json for easy access.
- Writing the Acquisition Function
with open(output) as f:
unicode2stroke = json.load(f)
def get_character_stroke_count(char: str):
unicode = "U+" + str(hex(ord(char)))[2:].upper()
return int(unicode2stroke[unicode])
test_char = "阿"
get_character_stroke_count(char=test_char)
When obtaining, note that Unicode converts the character to its corresponding hexadecimal code
Success! The expected result is achieved!
Top comments (0)