Building a Furigana HTTP Service in Python: Why It's Not Just a Lookup Table
A small FastAPI service that takes Japanese text and returns it with hiragana reading annotations. The interesting part isn't the HTTP plumbing â it's why generating furigana correctly needs a morphological analyzer in the first place.
đŠ GitHub: https://github.com/sen-ltd/furigana-api
If you've ever built anything in the Japanese-learning space â a kids' reading app, an e-reader for graded content, a language-school CMS, a vocabulary notebook â you've hit this problem. You have a blob of Japanese text. You need to show it with the reading hints (furigana) floating above the kanji, like this:
æ±äșŹăżăŻăŒă«èĄăăă
And you need to do it for arbitrary user-submitted text, not just a pre-annotated corpus. The pitch for this project was simple: the morphological-analysis step that makes this possible is heavy to embed in every app, so ship it as a small HTTP service with one clear endpoint. Let me walk through why the problem isn't as simple as "look up each kanji in a dictionary," and what the final shape of the service ended up looking like.
The naive approach and why it fails
The first thing anyone writes when they encounter this problem is a dict:
READINGS = {
"æ±": "ăČăă",
"äșŹ": "ăăă",
"èĄ": "ăă",
# ...
}
This fails immediately. Three reasons.
1. Multi-character compounds. æ±äșŹ is not ăČăă + ăăă. It's ăšăăăă. The reading of a compound is not the concatenation of the per-character readings, because Japanese pronunciations are context-sensitive. On-yomi (Chinese-derived readings) and kun-yomi (native Japanese readings) apply differently depending on whether a kanji stands alone, appears in a compound, or is used as a verb stem.
2. Homographs. Look at these two:
- çăă â ikiru, "to live"
- çăŸăă â umareru, "to be born"
Same ç. Different reading. The only way to pick the right one is to know what morpheme the kanji is participating in â which means morphological analysis, not character lookup. And this isn't a rare edge case. ç, èĄ, èš, äž, äž, ćș, ć
„, äžăă â a huge fraction of everyday kanji have multiple readings gated on morphological context.
3. Inflection. ćŠă¶ (manabu, "to learn") inflects to ćŠăă , ćŠăčă°, ćŠă°ăȘă, ⊠You cannot annotate ćŠ alone because the "word" is really the verb, and the verb determines the reading. You need a tokenizer that knows where word boundaries are.
So you need a morphological analyzer that (a) segments the text into morphemes, (b) returns each morpheme's reading in the dictionary, and (c) picks the right reading based on context. This is the thing MeCab does. And fugashi. And Kuromoji. And Sudachi.
Choosing Sudachi
The Japanese NLP landscape is small enough to enumerate:
- MeCab â the classic, from 2005. Written in C++. Needs a system-level dictionary (IPAdic, UniDic). Fast, well-understood, but the install experience is painful because the Python bindings expect the C library to exist.
-
fugashi â a lovely MeCab wrapper for Python by Paul McCann. Pulls in
unidic-lite(~250 MB) or full UniDic (much more). Excellent quality; the disk footprint is the only knock. - Janome â pure-Python, no system deps. Easy to install but slower than MeCab, and the dictionary is older.
- Kuromoji â the JVM one. Great if you're in the Java/Scala world.
-
Sudachi / SudachiPy â developed by Works Applications, actively maintained since 2017. Ships as a pure
pip install, splits cleanly intosudachipy(the analyzer) +sudachidict-{small,core,full}(the data). Handles compound splitting at three levels â A (shortest), B (middle), C (longest natural unit).
I picked Sudachi for this project for three reasons: (1) the pip install sudachipy sudachidict-small experience is bulletproof on Alpine Linux with no system dependencies, (2) the split-mode feature is exactly right for furigana because mode C keeps compounds like æ±äșŹéœ together, which reads better than æ±äșŹ + éœ as annotated text, and (3) it's the analyzer that's seeing the most ongoing research maintenance in the Japanese NLP world right now.
The analyzer, in ~30 lines
Here's the entire analyzer wrapper, lightly edited for the post:
from sudachipy import Dictionary, Tokenizer
class FuriganaAnalyzer:
def __init__(self) -> None:
# "small" dict â ~120 MB on disk, covers standard modern vocabulary.
# SplitMode.C = longest natural chunks; æ±äșŹéœ stays whole.
self._dict = Dictionary(dict="small")
self._tokenizer = self._dict.create()
self._split_mode = Tokenizer.SplitMode.C
def analyze(self, text: str) -> list[AnalyzedSegment]:
if not text:
return []
morphemes = self._tokenizer.tokenize(text, self._split_mode)
segments = []
for m in morphemes:
surface = m.surface()
pos = m.part_of_speech()[0]
if _contains_kanji(surface):
reading_kata = m.reading_form()
reading = katakana_to_hiragana(reading_kata) if reading_kata else None
else:
reading = None
segments.append(AnalyzedSegment(surface, reading, pos))
return segments
Three things to notice.
First, the _contains_kanji gate. A segment only needs a reading annotation if it contains kanji. Kana-only segments (ăżăŻăŒ, ă§ă, ăź) are their own reading, by definition. Punctuation, Latin, and digits don't need annotation either. Skipping non-kanji segments keeps the output clean â no <ruby>ăżăŻăŒ<rt>ăżăŻăŒ</rt></ruby> noise.
Second, the katakana-to-hiragana conversion. Sudachi returns readings in katakana because that's the classical NLP convention â katakana is what appeared in the original dictionary files. But for furigana in a learning or reading context, you want hiragana. Every Japanese elementary school textbook uses hiragana for furigana, and every user expectation follows from that. So we convert.
The conversion that's just subtraction
The katakana block in Unicode starts at U+30A1 (ăĄ) and the hiragana block starts at U+3041 (ă). They're laid out in the same order â ă, ă, ă, ă, ă
, ă, ⊠â so the conversion is a one-line codepoint shift:
_KATAKANA_START = 0x30A1 # ăĄ
_KATAKANA_END = 0x30F6 # ă¶
_HIRAGANA_OFFSET = 0x60 # katakana codepoint - hiragana codepoint
def katakana_to_hiragana(text: str) -> str:
out = []
for ch in text:
cp = ord(ch)
if _KATAKANA_START <= cp <= _KATAKANA_END:
out.append(chr(cp - _HIRAGANA_OFFSET))
else:
out.append(ch)
return "".join(out)
No lookup table. No library call. Just arithmetic. And it handles every small kana (ă 㣠ㄠă§), the ăŽ/ă pair, everything. The long-vowel mark ăŒ falls outside the [30A1..30F6] range (it's at 30FC), so it's preserved as-is â which is correct, because ăłăŒăăŒ â ăăŒăČăŒ is the conventional rendering.
This is one of my favourite "why is this so clean" moments in Japanese text processing. Two Unicode blocks, laid out in parallel, make the conversion a subtraction.
HTML ruby: the tag that actually exists
Now we have segments with readings. To render them as visible furigana, we use HTML's <ruby> element, which every modern browser supports. The canonical form is:
<ruby>æ±äșŹ<rt>ăšăăăă</rt></ruby>
<rt> is "ruby text," the annotation that floats above the base. There's also <rp> â "ruby parenthesis" â which is fallback content shown only in clients that don't support <ruby> rendering. You use it like this:
<ruby>æ±äșŹ<rp>(</rp><rt>ăšăăăă</rt><rp>)</rp></ruby>
In a modern browser, you see æ±äșŹ with ăšăăăă floating above it. In a primitive client (old RSS reader, plain-text dump, e-ink device), you see æ±äșŹ(ăšăăăă) â the parentheses appear instead of being hidden. This is a graceful degradation story that HTML gets right.
The renderer is about eight lines:
from html import escape as _html_escape
def render_html(segments) -> str:
out = []
for seg in segments:
surface = _html_escape(seg.surface)
if seg.reading:
reading = _html_escape(seg.reading)
out.append(f"<ruby>{surface}<rp>(</rp><rt>{reading}</rt><rp>)</rp></ruby>")
else:
out.append(surface)
return "".join(out)
And there's a parallel render_markdown() that emits æŒąć[ăăă] â the de facto convention in JP Markdown notes and Kindle-style ebook sources, because CommonMark has no standard ruby syntax.
Tradeoffs I'm not going to pretend I solved
A production-grade furigana service has to deal with things this project acknowledges but doesn't fully solve:
-
Homographs in sparse contexts.
仿„ăŻby itself â is ităăă("today") orăăă«ăĄ("these days")? Sudachi picks one based on frequency + context, and usually gets it right, but for a single two-word fragment there just isn't enough context to be sure. The right fix for a production app is to let the caller override readings:{"text": "仿„ăŻ", "overrides": {"仿„": "ăăă"}}. -
Proper nouns.
äœè€ăăis fine;髿©ăă(note the unusual variant of é«) is a problem because it's not in the small dict. Core + Full fix a lot of this but at image-size cost. - On vs. kun readings for context-lite input. Same problem as homographs, subset.
-
Modern slang and new words.
ăšăąă,ăășă, and the Reiwa-era additions are covered by the core/full dicts better than small. Small is a tradeoff for image footprint.
Writing these down so readers know what the service doesn't promise: it's not a one-shot oracle, it's a reasonable baseline that covers the 90% case for learning and reading apps.
Try it in 30 seconds
docker build -t furigana-api https://github.com/sen-ltd/furigana-api.git
docker run --rm -p 8000:8000 furigana-api
curl -sS -G "http://localhost:8000/furigana" \
--data-urlencode "text=æ±äșŹăżăŻăŒă«èĄăăă" | jq
curl -sS -X POST http://localhost:8000/ruby \
-H "Content-Type: application/json; charset=utf-8" \
-d '{"text": "æ„æŹèȘăćŠă¶", "format": "html"}'
The image is around 200 MB â larger than I'd hoped, because sudachidict-small turns out to actually be ~120 MB on disk despite the "small" name (the other dicts are 360 MB and 700 MB, so "small" is still the smallest). If anyone knows a way to LTO/prune a Sudachi dictionary at build time to shrink it further, I'd love to hear it.
Built as part of a side-project sprint building 100+ public repositories for SEN ććäŒç€Ÿ. MIT licensed. Issues and PRs welcome.

Top comments (0)