DEV Community

loading...

How do MeCab, Kuromoji and Kagome (Japanese Text Analyzer) compare; and which dictionary to choose?

Pacharapol Withayasakpunt
Currently interested in TypeScript, Vue, Kotlin and Python. Looking forward to learning DevOps, though.
・1 min read

MeCab is a very old project to analyze Japanese text implemented in C++. Note that I am not that very good at reading Japanese (documentation) myself.

Kuromoji is implemented in Java, with Kuromoji.js reimplemented in JavaScript.

Kagome is a more recently updated library implemented in Golang.

However, the parsed output also depend on training data. That's why I asked about dictionary, e.g. unidic neologd...

Discussion (1)

Collapse
itsupera profile image
itsupera

Hello !
I think it will depend on your use case.

For example if we want to extract the phonemes, with MeCab you need to use ipadic and NOT unidic (I made this mistake :) + neologd for newer words.
You can try it out with a docker image someone built:

docker pull intimatemerger/mecab-ipadic-neologd
echo "私は一週間日本に行った" | docker run -i intimatemerger/mecab-ipadic-neologd mecab
Enter fullscreen mode Exit fullscreen mode

Then there is some boilerplate to make it work in Python.
For example if you want to generate furigana for a sentence, check this out: github.com/itsupera/furigana

Now MeCab works pretty well for this, but it's not perfect.
The best tokenizer I have found so far is ichiran (ichi.moe/ for a demo) but it's made in LISP and there is not a lot of documentation available.
As for Kuromoji and Kagome I have not tried them.