SEN LLC

Posted on Apr 15

Building a Furigana HTTP Service in Python: Why It's Not Just a Lookup Table

#python #fastapi #japanese #tutorial

Building a Furigana HTTP Service in Python: Why It's Not Just a Lookup Table

A small FastAPI service that takes Japanese text and returns it with hiragana reading annotations. The interesting part isn't the HTTP plumbing — it's why generating furigana correctly needs a morphological analyzer in the first place.

📦 GitHub: https://github.com/sen-ltd/furigana-api

If you've ever built anything in the Japanese-learning space — a kids' reading app, an e-reader for graded content, a language-school CMS, a vocabulary notebook — you've hit this problem. You have a blob of Japanese text. You need to show it with the reading hints (furigana) floating above the kanji, like this:

東京とうきょうタワーに行いきたい

And you need to do it for arbitrary user-submitted text, not just a pre-annotated corpus. The pitch for this project was simple: the morphological-analysis step that makes this possible is heavy to embed in every app, so ship it as a small HTTP service with one clear endpoint. Let me walk through why the problem isn't as simple as "look up each kanji in a dictionary," and what the final shape of the service ended up looking like.

The naive approach and why it fails

The first thing anyone writes when they encounter this problem is a dict:

READINGS = {
    "東": "ひがし",
    "京": "きょう",
    "行": "いく",
    # ...
}

This fails immediately. Three reasons.

1. Multi-character compounds. 東京 is not ひがし + きょう. It's とうきょう. The reading of a compound is not the concatenation of the per-character readings, because Japanese pronunciations are context-sensitive. On-yomi (Chinese-derived readings) and kun-yomi (native Japanese readings) apply differently depending on whether a kanji stands alone, appears in a compound, or is used as a verb stem.

2. Homographs. Look at these two:

生きる — ikiru, "to live"
生まれる — umareru, "to be born"

Same 生. Different reading. The only way to pick the right one is to know what morpheme the kanji is participating in — which means morphological analysis, not character lookup. And this isn't a rare edge case. 生, 行, 言, 上, 下, 出, 入, 下さい — a huge fraction of everyday kanji have multiple readings gated on morphological context.

3. Inflection. 学ぶ (manabu, "to learn") inflects to 学んだ, 学べば, 学ばない, … You cannot annotate 学 alone because the "word" is really the verb, and the verb determines the reading. You need a tokenizer that knows where word boundaries are.

So you need a morphological analyzer that (a) segments the text into morphemes, (b) returns each morpheme's reading in the dictionary, and (c) picks the right reading based on context. This is the thing MeCab does. And fugashi. And Kuromoji. And Sudachi.

Choosing Sudachi

The Japanese NLP landscape is small enough to enumerate:

MeCab — the classic, from 2005. Written in C++. Needs a system-level dictionary (IPAdic, UniDic). Fast, well-understood, but the install experience is painful because the Python bindings expect the C library to exist.
fugashi — a lovely MeCab wrapper for Python by Paul McCann. Pulls in unidic-lite (~250 MB) or full UniDic (much more). Excellent quality; the disk footprint is the only knock.
Janome — pure-Python, no system deps. Easy to install but slower than MeCab, and the dictionary is older.
Kuromoji — the JVM one. Great if you're in the Java/Scala world.
Sudachi / SudachiPy — developed by Works Applications, actively maintained since 2017. Ships as a pure pip install, splits cleanly into sudachipy (the analyzer) + sudachidict-{small,core,full} (the data). Handles compound splitting at three levels — A (shortest), B (middle), C (longest natural unit).

I picked Sudachi for this project for three reasons: (1) the pip install sudachipy sudachidict-small experience is bulletproof on Alpine Linux with no system dependencies, (2) the split-mode feature is exactly right for furigana because mode C keeps compounds like 東京都 together, which reads better than 東京 + 都 as annotated text, and (3) it's the analyzer that's seeing the most ongoing research maintenance in the Japanese NLP world right now.

The analyzer, in ~30 lines

Here's the entire analyzer wrapper, lightly edited for the post:

from sudachipy import Dictionary, Tokenizer

class FuriganaAnalyzer:
    def __init__(self) -> None:
        # "small" dict — ~120 MB on disk, covers standard modern vocabulary.
        # SplitMode.C = longest natural chunks; 東京都 stays whole.
        self._dict = Dictionary(dict="small")
        self._tokenizer = self._dict.create()
        self._split_mode = Tokenizer.SplitMode.C

    def analyze(self, text: str) -> list[AnalyzedSegment]:
        if not text:
            return []
        morphemes = self._tokenizer.tokenize(text, self._split_mode)
        segments = []
        for m in morphemes:
            surface = m.surface()
            pos = m.part_of_speech()[0]
            if _contains_kanji(surface):
                reading_kata = m.reading_form()
                reading = katakana_to_hiragana(reading_kata) if reading_kata else None
            else:
                reading = None
            segments.append(AnalyzedSegment(surface, reading, pos))
        return segments

Three things to notice.

First, the _contains_kanji gate. A segment only needs a reading annotation if it contains kanji. Kana-only segments (タワー, です, の) are their own reading, by definition. Punctuation, Latin, and digits don't need annotation either. Skipping non-kanji segments keeps the output clean — no <ruby>タワー<rt>タワー</rt></ruby> noise.

Second, the katakana-to-hiragana conversion. Sudachi returns readings in katakana because that's the classical NLP convention — katakana is what appeared in the original dictionary files. But for furigana in a learning or reading context, you want hiragana. Every Japanese elementary school textbook uses hiragana for furigana, and every user expectation follows from that. So we convert.

The conversion that's just subtraction

The katakana block in Unicode starts at U+30A1 (ァ) and the hiragana block starts at U+3041 (ぁ). They're laid out in the same order — ぁ, あ, ぃ, い, ぅ, う, … — so the conversion is a one-line codepoint shift:

_KATAKANA_START = 0x30A1  # ァ
_KATAKANA_END   = 0x30F6  # ヶ
_HIRAGANA_OFFSET = 0x60   # katakana codepoint - hiragana codepoint

def katakana_to_hiragana(text: str) -> str:
    out = []
    for ch in text:
        cp = ord(ch)
        if _KATAKANA_START <= cp <= _KATAKANA_END:
            out.append(chr(cp - _HIRAGANA_OFFSET))
        else:
            out.append(ch)
    return "".join(out)

No lookup table. No library call. Just arithmetic. And it handles every small kana (ッャュョ), the ヴ/ゔ pair, everything. The long-vowel mark ー falls outside the [30A1..30F6] range (it's at 30FC), so it's preserved as-is — which is correct, because コーヒー → こーひー is the conventional rendering.

This is one of my favourite "why is this so clean" moments in Japanese text processing. Two Unicode blocks, laid out in parallel, make the conversion a subtraction.

HTML ruby: the tag that actually exists

Now we have segments with readings. To render them as visible furigana, we use HTML's <ruby> element, which every modern browser supports. The canonical form is:

<ruby>東京<rt>とうきょう</rt></ruby>

<rt> is "ruby text," the annotation that floats above the base. There's also <rp> — "ruby parenthesis" — which is fallback content shown only in clients that don't support <ruby> rendering. You use it like this:

<ruby>東京<rp>(</rp><rt>とうきょう</rt><rp>)</rp></ruby>

In a modern browser, you see 東京 with とうきょう floating above it. In a primitive client (old RSS reader, plain-text dump, e-ink device), you see 東京(とうきょう) — the parentheses appear instead of being hidden. This is a graceful degradation story that HTML gets right.

The renderer is about eight lines:

from html import escape as _html_escape

def render_html(segments) -> str:
    out = []
    for seg in segments:
        surface = _html_escape(seg.surface)
        if seg.reading:
            reading = _html_escape(seg.reading)
            out.append(f"<ruby>{surface}<rp>(</rp><rt>{reading}</rt><rp>)</rp></ruby>")
        else:
            out.append(surface)
    return "".join(out)

And there's a parallel render_markdown() that emits 漢字[かんじ] — the de facto convention in JP Markdown notes and Kindle-style ebook sources, because CommonMark has no standard ruby syntax.

Tradeoffs I'm not going to pretend I solved

A production-grade furigana service has to deal with things this project acknowledges but doesn't fully solve:

Homographs in sparse contexts. 今日は by itself — is it きょう ("today") or こんにち ("these days")? Sudachi picks one based on frequency + context, and usually gets it right, but for a single two-word fragment there just isn't enough context to be sure. The right fix for a production app is to let the caller override readings: {"text": "今日は", "overrides": {"今日": "きょう"}}.
Proper nouns. 佐藤さん is fine; 髙橋さん (note the unusual variant of 高) is a problem because it's not in the small dict. Core + Full fix a lot of this but at image-size cost.
On vs. kun readings for context-lite input. Same problem as homographs, subset.
Modern slang and new words. エモい, バズる, and the Reiwa-era additions are covered by the core/full dicts better than small. Small is a tradeoff for image footprint.

Writing these down so readers know what the service doesn't promise: it's not a one-shot oracle, it's a reasonable baseline that covers the 90% case for learning and reading apps.

Try it in 30 seconds

docker build -t furigana-api https://github.com/sen-ltd/furigana-api.git
docker run --rm -p 8000:8000 furigana-api

curl -sS -G "http://localhost:8000/furigana" \
  --data-urlencode "text=東京タワーに行きたい" | jq

curl -sS -X POST http://localhost:8000/ruby \
  -H "Content-Type: application/json; charset=utf-8" \
  -d '{"text": "日本語を学ぶ", "format": "html"}'

The image is around 200 MB — larger than I'd hoped, because sudachidict-small turns out to actually be ~120 MB on disk despite the "small" name (the other dicts are 360 MB and 700 MB, so "small" is still the smallest). If anyone knows a way to LTO/prune a Sudachi dictionary at build time to shrink it further, I'd love to hear it.

Built as part of a side-project sprint building 100+ public repositories for SEN 合同会社. MIT licensed. Issues and PRs welcome.

DEV Community

Building a Furigana HTTP Service in Python: Why It's Not Just a Lookup Table

Building a Furigana HTTP Service in Python: Why It's Not Just a Lookup Table

The naive approach and why it fails

Choosing Sudachi

The analyzer, in ~30 lines

The conversion that's just subtraction

HTML ruby: the tag that actually exists

Tradeoffs I'm not going to pretend I solved

Try it in 30 seconds

Top comments (0)