How computers stumble over Japanese readings

#nlp #machinelearning #japanese #webdev

Last Saturday I was tutoring a man in English when he asked a question I hadn't heard before.

"Are the words flea and flea market related?"

I had to stop and think. The short answer is no - they share spelling, but their meanings and origins are different. Native speakers don't usually work that out in real time. We just… know.

I wanted to explain that "just knowing" feeling, and I reached for a Japanese example, partly because I'm learning Japanese myself and this is the kind of thing I keep tripping over.

Take 大分. It's two characters: 大 ("big") and 分 ("part"). But together it can be read in more than one way:
Ōita (おおいた), the prefecture
daibu (だいぶ), meaning "quite a bit"

A Japanese reader doesn't pause and decode the kanji from scratch. They lean on context without thinking about it. Geography? Ōita. Degree or extent? daibu.
I'm not sure my student cared about Ōita, but the example stuck with me because it's the same kind of problem I run into constantly as a learner.

Some words have two plausible readings. Some have more. Even a single character can split into a handful of options. 明, for example, can be read as mei, myō, min, or akira, depending on where it appears and what the sentence is doing. For a native reader the correct choice often feels obvious. For a learner, or for a computer, it might not be.
And even if you pick the "right" base reading, Japanese pronunciation doesn't always stay put when words combine.

There are a few common sound changes that native speakers apply automatically:

Rendaku (連濁) (voicing): 花 (hana, "flower") + 火 (hi, "fire") → 花火 (hanabi, "fireworks")
Renjō (連声) (sound linking): 観 (kan, "to observe") + 音 (on, "sound") → 観音 (kannon, a Buddhist term)
Sokuonbin (促音便) (a small っ / doubled consonant): 学校 (gakkō, "school"), often explained as gakukō → gakkō

A dictionary can tell you 火 can be read hi (ひ), and it can list 花火 as hanabi (はなび). But the moment you meet a compound that isn't explicitly in a dictionary, you're back to something fuzzier: patterns, probability, context.

Readings drift over time

The other part of this is that readings aren't fixed.
Language moves. Pronunciations shift. Classical texts pull in older forms that don't match modern habits. And internet writing keeps inventing new conventions faster than any printed reference can follow.
At some point I started thinking of furigana as a kind of agreement. Not random, but not "encoded" either. We write a reading because people, collectively, have decided that's how it's read, at least for now.

You can see that agreement forming in slang:

草 (kusa): "lol," because www (laughing) looks like grass
尊い (toutoi): "too cute / too precious"
萎える (naeru): "to lose steam / get deflated" (motivation drops)

These weren't in dictionaries when they started spreading. They became "real" the same way most language does: usage first, and consensus later.
Why I cared enough to build something
I kept hitting the same wall when I tried to read actual books.

I'd be moving through a page, hit a word I didn't recognize, and have to stop. Copy it. Paste it into a dictionary. Scroll through multiple readings. Try to figure out which one made sense in that sentence. Then do it again three lines later.

One paragraph could easily turn into ten minutes of lookups. I wasn't reading anymore. I was bouncing between the text and a search box.

I didn't want a possible reading. I wanted the reading that made sense here, right now.

Until computers understand language the way people do, furigana is still one of the best bridges between written Japanese and comprehension. I wanted a bridge that held up a little better in messy, real-world text, basically a furigana converter that could survive the stuff I actually read.

Dictionaries help, until they don't

Most furigana systems start with huge dictionaries. They collect common words and compounds and store their expected readings. With enough coverage, you can get pretty far.

But dictionaries don't solve the core ambiguity problem. When 明 appears on its own, is it the Ming dynasty? A given name like Akira? Something else? A person figures it out by glancing at what's around it. A program has to learn how to do that explicitly.
Rules didn't scale for me

My first instinct was to write rules: look for historical terms or dates and bias toward min; look for person-ish clues and bias toward akira. It works in a few handpicked cases, and then it starts to feel like trying to handwrite language. It's slow to extend, easy to break, and it fails as soon as the text gets novel.
So I moved to machine learning.

Why I didn't reach for a neural network

I considered neural networks because that's what most modern language systems lean on. They're powerful and flexible, and they can pick up subtle grammatical signals.

But I wasn't trying to build a general-purpose language model. I was trying to build a furigana converter that's fast and light enough to be practical. For that, a simpler model made more sense.
I ended up using logistic regression. It's not glamorous, but it's quick, easy to reason about, and surprisingly strong when the features are chosen carefully.

Data, and the ways it goes wrong

Once you decide to learn from context, you inherit a data problem.

The text you train on needs range: novels, news, legal writing, casual posts, dialogue. Too broad and the model learns mush; too narrow and it breaks outside its comfort zone. And real-world text is messy (emoji, numbers, odd punctuation, inconsistent formatting), so you often end up building a cleaning pipeline before you can even start training.
The part I didn't appreciate at first is how specific the failure modes can be.

A good example is 紅葉. It can be read as もみじ (momiji) or こうよう (kōyō). The meanings overlap: momiji is usually "maple / red leaves," and kōyō is the broader "autumn foliage" idea. Close enough that you don't always get a clean context signal, which means your training data has to be careful.

While I was building the dataset, I found out a bunch of it was basically contaminated by a novel that had a character written as 紅葉（もみじ）. Once those examples got mixed in, a lot of otherwise useful sentences stopped being usable. The model started learning weird person-ish context around 紅葉, and then it would reach for name-like cues even in ordinary sentences about trees.

Nothing was "broken," exactly. The outputs just slowly started getting stranger until I noticed the pattern. After that I spent an unglamorous amount of time filtering, re-checking, and rebuilding the examples so the model wasn't training on accidental fiction trivia.

Even after cleaning, there's a practical question that never fully goes away: when do you trust a dictionary lookup, and when do you fall back to the statistical model?

A patchwork, like most things

To deal with names, I added another component that tries to tag spans of text that look like personal names. Once something is tagged as a name, I hand it off to different logic for resolving the reading.
At that point the whole project started to look like what most software looks like in the end: not one elegant idea, but a bunch of smaller pieces that cover each other's weak spots.

In practice, what I do is pretty simple.
I try the dictionary approach first. If that doesn't settle it, I look at nearby particles, prefixes, and suffixes that tend to give the reading away. If it smells like a name, I route it through the name path. And if it's one of the characters I kept getting wrong over and over, that's when I let the model make the call.

Training and checking

I iterated over a lot of Japanese text, mostly Aozora Bunko (an online repository of public-domain books) and Wikipedia. I'd run the algorithm, use an LLM to flag outputs that looked suspicious, then manually verify and correct those cases. Cross-referencing open-source dictionaries helped me build a base dictionary set with massive coverage.

I did track accuracy, but mostly as a sanity check for myself, not as a formal benchmark. Over time it climbed into the mid-90s on the test set I kept around. Adding the name handling pushed it into the high-90s. It still makes mistakes, and I still find edge cases, but it crossed the line where I could read with it without babysitting it.

This started as a small personal furigana converter and slowly turned into something I kept reaching for, so I put it up here: www.ezfurigana.com.
The moment it started to feel real

I fed it a couple sentences with 辛い, a character that can mean either "spicy" (karai) or "painful / difficult" (tsurai) depending on context.
Something like 辛いカレー (spicy curry) versus この仕事は辛い (this job is tough).

Then I tried 大分 in a few different shapes too: prefecture, adverb, the usual ambiguity.
It got them right.

Not just once, but consistently. Different sentences, different contexts, and it kept landing on the correct reading. That's when I stopped thinking of it as a project and started thinking of it as a tool I could actually use.

Then I ran it on a full ebook. It generated furigana for the entire thing and kept the formatting intact. I opened it on my reader and just… started reading.
That was the moment I thought: okay. I can actually use this now.