DEV Community

SEN LLC
SEN LLC

Posted on

Japanese Width Conversion Without Regex: Full↔Half ASCII, Katakana, and the Dakuten Merge Problem

Japanese Width Conversion Without Regex: Full↔Half ASCII, Katakana, and the Dakuten Merge Problem

Every Japanese data pipeline ends up writing a normalization function. This is the one I keep rewriting, finally factored out into a CLI with explicit flags — no "guess what the user wants", no unicodedata.normalize("NFKC", s) silently turning into 1. The interesting bits are the offset arithmetic for ASCII and the table-driven dakuten merge for half-width katakana.

📦 GitHub: https://github.com/sen-ltd/mojihenkan

Screenshot

If you've ever processed Japanese text from a legacy ERP, a CSV exported by an older accounting system, or anything that came from a fixed-width terminal, you know the symptom. The "name" column has SATOU in full-width ASCII; the "kana" column has サトウ in JIS X 0201 half-width katakana; one row in three has サトウ in proper full-width katakana because somebody re-keyed it; and there's a stray on row 4,217 because a unit field was free-text. Search doesn't work, joins don't work, and the dedup function emits five rows for what is the same person.

The standard Python answer to this is one line:

import unicodedata
clean = unicodedata.normalize("NFKC", text)
Enter fullscreen mode Exit fullscreen mode

NFKC fixes the obvious cases — SATOU becomes SATOU, サトウ becomes サトウ. Done, ship it. Except NFKC is compatibility normalization, and "compatibility" means a much larger set of equivalences than just width. NFKC also does this:

>>> unicodedata.normalize("NFKC", "")
'メートル'
>>> unicodedata.normalize("NFKC", "")
'1'
>>> unicodedata.normalize("NFKC", "")
'(株)'
>>> unicodedata.normalize("NFKC", "")  # U+FF5E fullwidth tilde
'~'
Enter fullscreen mode Exit fullscreen mode

Some of those are what you wanted. Some of them are silent data loss. Whether should become 1 depends entirely on whether your downstream cares about ordinal information; whether should become メートル depends on whether the unit was supposed to be carried as one token or four. There is no general right answer, which means the "just NFKC it" approach is right until the day a customer complains that their bullet points all turned into numbers, and then you're stuck unwinding NFKC, which you can't.

This is what mojihenkan exists for. It exposes each conversion as its own flag, applied in the order you declare it. Nothing happens that you didn't ask for.

$ mojihenkan "ABC123" --zen2han-ascii
ABC123

$ mojihenkan "ガンバレ" --han2zen-kana   # 6 codepoints in, 4 out
ガンバレ

$ mojihenkan "ABC ひらがな" --zen2han-ascii --hira2kata
ABC ヒラガナ
Enter fullscreen mode Exit fullscreen mode

Zero dependencies — argparse, json, and unicodedata from the standard library, nothing else. Total source is about 350 lines including comments. Most of the interesting logic is in two places: the offset math for ASCII, and the table for katakana with the dakuten merge.

Full↔half ASCII: one subtraction per character

The full-width ASCII forms (, , , , …) live in the Unicode block "Halfwidth and Fullwidth Forms" at U+FF01..U+FF5E. That's exactly the printable ASCII range U+0021..U+007E shifted by +0xFEE0. Whoever designed this block made it a clean translation, on purpose, so the conversion is one subtraction per character — no tables, no regex, no surprises.

_OFFSET = 0xFEE0
_FULL_ASCII_START = 0xFF01  # !
_FULL_ASCII_END = 0xFF5E    # ~

def fullwidth_to_halfwidth_ascii(text: str) -> str:
    out = []
    for ch in text:
        cp = ord(ch)
        if _FULL_ASCII_START <= cp <= _FULL_ASCII_END:
            out.append(chr(cp - _OFFSET))
        elif ch == "\u3000":          # ideographic space
            out.append(" ")
        else:
            out.append(ch)
    return "".join(out)
Enter fullscreen mode Exit fullscreen mode

The only special case is the space. The half-width space is U+0020, which is outside U+0021..U+007E, so it doesn't follow the offset rule. The full-width space   is U+3000, in a completely different block (CJK Symbols and Punctuation). So: if you see U+3000, emit U+0020. The reverse direction is symmetric.

That's it. Twenty lines of Python, and it correctly handles every printable ASCII character. NFKC also handles this case (it folds to A because it's in the "compatibility decomposition" list), but using a 30 MB Unicode database for a one-line subtraction feels wasteful, and it bundles in everything else NFKC does whether you wanted it or not.

Hiragana ↔ katakana: the same trick at a different offset

The hiragana block starts at U+3041 () and the katakana block starts at U+30A1 (). The two blocks are laid out in the same order — small ぁ aligned with small ァ, あ with ア, か with カ, and so on through the historical kana ゐ/ヰ and ゑ/ヱ at the end. So the conversion is again one subtraction:

_OFFSET = 0x60
_HIRA_START = 0x3041
_HIRA_END = 0x3096

def hiragana_to_katakana(text: str) -> str:
    out = []
    for ch in text:
        cp = ord(ch)
        if _HIRA_START <= cp <= _HIRA_END:
            out.append(chr(cp + _OFFSET))
        else:
            out.append(ch)
    return "".join(out)
Enter fullscreen mode Exit fullscreen mode

Two things to note. First, the long-vowel mark (U+30FC) is in the katakana block but has no hiragana counterpart, so katakana_to_hiragana correctly leaves it alone — コーヒー becomes こーひー, not the broken こ?ひ?. Second, the dakuten on is part of the codepoint, not a combining mark, so it shifts correctly: がぎぐげご becomes ガギグゲゴ in one pass.

If you only ever needed to convert proper full-width hiragana and full-width katakana to each other, you'd be done in 30 lines. The pain comes when half-width katakana shows up.

Half-width katakana: where the offset trick breaks

JIS X 0201 — the 1969 standard for representing katakana in the lower 8 bits of a byte — encodes katakana as a single-byte set in the range U+FF61..U+FF9F. That gave you the basic 50 katakana, the small forms, the long-vowel mark, four punctuation symbols (。「」、), and the middle dot (). What it did not give you was the dakuten and handakuten marks as part of the base characters. There weren't enough codepoints. So they were encoded as separate following characters:

  • (U+FF76) + (U+FF9E) = "ga", two codepoints
  • (U+FF8A) + (U+FF9F) = "pa", two codepoints

Full-width katakana, on the other hand, is the modern Unicode block U+30A0..U+30FF and gives every dakuten/handakuten variant its own codepoint:

  • = U+30AC, one codepoint
  • = U+30D1, one codepoint

So when you do half-width → full-width katakana, you have to merge combining marks into the previous character — but only when that combination is valid. ガ merges to . ン゙ doesn't merge to anything, because there is no "n with dakuten" in the katakana block. The mark needs to survive somehow without corrupting the character before it.

Going the other way is the reverse: every composed has to be split into カ + ゙, doubling the codepoint count. There's no offset; the split table is just the inverse of the merge table.

There is no clean arithmetic for this. It's a table.

# Half-width katakana → full-width base form (no dakuten).
_HALF_TO_FULL = {
    "\uff66": "\u30f2",  # ヲ → ヲ
    "\uff67": "\u30a1",  # ァ → ァ
    # ... 60 entries total ...
    "\uff76": "\u30ab",  # カ → カ
    "\uff77": "\u30ad",  # キ → キ
    # ...
}

# Which (full-width base, mark kind) pairs combine into which composed form.
_DAKUTEN_MERGE = {
    ("\u30ab", "dakuten"): "\u30ac",      # カ + ゙ → ガ
    ("\u30ad", "dakuten"): "\u30ae",      # キ + ゙ → ギ
    # ... k/s/t/h rows ...
    ("\u30cf", "handakuten"): "\u30d1",   # ハ + ゚ → パ
    ("\u30cf", "dakuten"): "\u30d0",      # ハ + ゙ → バ
    # ...
    ("\u30a6", "dakuten"): "\u30f4",      # ウ + ゙ → ヴ
}
Enter fullscreen mode Exit fullscreen mode

That last entry, + dakuten → , is the corner case that catches people. (U+30F4) is the modern way of writing the v-sound, used in loanwords like ヴァイオリン ("violin"). It's a perfectly valid composition — u plus dakuten — but it's the only entry in the merge table that isn't in the standard k/s/t/h rows, and it's easy to forget. If you forget it, ヴ round-trips through your converter as followed by a stray dakuten, which is wrong.

Once you have the tables, the merge logic is a one-character lookahead state machine:

def halfwidth_to_fullwidth_kana(text: str) -> str:
    out = []
    i = 0
    n = len(text)
    while i < n:
        ch = text[i]
        full = _HALF_TO_FULL.get(ch)
        if full is None:
            # Not a half-width kana char. If it's a standalone combining
            # mark with nothing to attach to, emit the spacing form so we
            # don't drop information.
            if ch == "\uff9e":     out.append("\u309b")
            elif ch == "\uff9f":   out.append("\u309c")
            else:                  out.append(ch)
            i += 1
            continue

        # Look ahead: is the next codepoint a combining mark we can merge?
        if i + 1 < n:
            mark = _mark_kind(text[i + 1])  # "dakuten", "handakuten", or None
            if mark is not None:
                composed = _DAKUTEN_MERGE.get((full, mark))
                if composed is not None:
                    out.append(composed)
                    i += 2
                    continue
        out.append(full)
        i += 1
    return "".join(out)
Enter fullscreen mode Exit fullscreen mode

The interesting moves here are the no-op ones. If the lookahead character is a but the current base is , we don't merge — isn't in the merge table, so the lookup misses, and we fall through to the plain "convert base only" path. The then becomes the next iteration's main character, fails to convert as a base, and gets emitted as the standalone spacing form (U+309B). That preserves the mark instead of dropping it. Most "just NFKC it" implementations get this case wrong by silently dropping the standalone mark; my first version of this code did too.

The other subtle case is the followed by something that isn't . The lookahead happens, _mark_kind returns None, and we emit the plain . The non-mark character becomes the next iteration's main character normally. This is what makes the function safe to call on mixed text — half-width katakana mixed with hiragana or ASCII or anything else.

A quick test for the merge

The file has 26 tests just for the kana width conversion, but the one I refer back to most often is the ガンバレ test:

def test_dakuten_merge_ganbare():
    src = "ガンバレ"           # 6 codepoints
    out = halfwidth_to_fullwidth_kana(src)
    assert out == "ガンバレ"   # 4 codepoints
    assert len(out) == 4
Enter fullscreen mode Exit fullscreen mode

Six codepoints in, four out. If you broke the merge, you'd get either five (one merge missed) or six (no merges at all) or four-but-wrong (the dakuten silently dropped). The length assertion is what catches the silent-drop bug — comparing strings is satisfied by "ガンバレ" either way visually, but the length tells you whether the dakuten survived as a combining mark or got merged into the composed form, which is what downstream consumers expect.

Composing the pipeline

The last piece is the CLI. I wanted to be able to write:

mojihenkan "ABC ひらがな" --zen2han-ascii --hira2kata
Enter fullscreen mode Exit fullscreen mode

…and have the conversions apply in the order I typed them, left to right. The natural argparse approach — one boolean per flag, then process them in some fixed order — doesn't preserve user order. So I used a custom Action that all the conversion flags share, with a single dest="steps":

class _AppendStep(argparse.Action):
    def __init__(self, option_strings, dest, canonical=None, **kwargs):
        self._canonical = canonical
        super().__init__(option_strings, dest, **kwargs)

    def __call__(self, parser, namespace, values, option_string=None):
        steps = getattr(namespace, self.dest, None) or []
        steps.append(self._canonical)
        setattr(namespace, self.dest, steps)
Enter fullscreen mode Exit fullscreen mode

When the parser sees --zen2han-ascii, it appends "fullwidth-to-halfwidth-ascii" to args.steps. When it sees --hira2kata, it appends "hiragana-to-katakana". The order is preserved because argparse processes argv left to right, and each Action call appends in that order. Then apply_pipeline is a one-liner:

def apply_pipeline(text: str, steps: list[str]) -> str:
    for name in steps:
        text = CONVERTERS[name][0](text)
    return text
Enter fullscreen mode Exit fullscreen mode

This is what makes the order visible in the JSON output:

$ mojihenkan "ABC ひらがな" --zen2han-ascii --hira2kata --format json
{"input":"ABC ひらがな","output":"ABC ヒラガナ",
 "steps":["fullwidth-to-halfwidth-ascii","hiragana-to-katakana"]}
Enter fullscreen mode Exit fullscreen mode

If a user reports a weird result, the steps array is the receipt. You can replay it.

Tradeoffs and gotchas

A few things this tool does not try to do, and why:

  • It doesn't handle kanji. Kanji-to-kana is a different problem entirely — is or にち or じつ depending on context, and doing it right requires either a full morphological analyser (MeCab, Sudachi) or a large pronunciation dictionary. Pull either in and you've blown the zero-dependencies thesis. If you need kanji handling, run your text through pykakasi first.
  • NFKC is not reversible. If you apply --nfkc and store the result, you can't get the original back. and 1 are the same string after NFKC. Decide before you normalize whether you want to keep the source.
  • Half-width hiragana doesn't really exist. There is no JIS X 0201 hiragana set. If you see "half-width hiragana" in the wild it's almost certainly a font rendering issue, not an encoding one, and mojihenkan won't touch it.
  • is the only u-row dakuten. The merge table only has one u-row entry. Don't expect イ゙ to become anything sensible — there is no イ + dakuten katakana, so the dakuten falls through to the standalone form.
  • The full-width tilde mess is not solved here. Wave dash (, U+301C), full-width tilde (, U+FF5E), and EM dash () are a perennial Japanese text headache and mojihenkan deliberately doesn't try to canonicalize them — that's a policy decision your team has to make about which one is "the" tilde.
  • Combining vs precomposed. This whole article assumes inputs use the JIS X 0201 form for half-width katakana (precomposed base + spacing-form mark). If your data uses combining U+3099 / U+309A on katakana characters, that's a different beast — --strip-diacritics handles those, but the kana width conversions don't.

Try it in 30 seconds

git clone https://github.com/sen-ltd/mojihenkan
cd mojihenkan
docker build -t mojihenkan .

docker run --rm mojihenkan "ABC123" --zen2han-ascii
# → ABC123

docker run --rm mojihenkan "ガンバレ" --han2zen-kana
# → ガンバレ  (6 codepoints → 4, dakuten merged)

docker run --rm mojihenkan "ガンバレ" --zen2han-kana
# → ガンバレ  (4 codepoints → 6, dakuten split)

docker run --rm mojihenkan "ひらがな" --hira2kata
# → ヒラガナ

docker run --rm mojihenkan "ABC ひらがな" --zen2han-ascii --hira2kata --format json
# → {"input":"ABC ひらがな","output":"ABC ヒラガナ","steps":[…]}
Enter fullscreen mode Exit fullscreen mode

Sixty-megabyte alpine image, runs as a non-root user, no network, no state. Pipe stdin into it, redirect stdout out of it, drop it into your CSV ETL between pandas.read_csv and df.to_parquet. It's the first install on every Japanese data pipeline I work on — partly because it's faster to type a flag than to remember which unicodedata constant I need, and partly because explicit conversions don't surprise you six months later when somebody asks why all the bullet points became numbers.

The whole codebase, including the 67 tests and this article's worth of explanatory comments, is on GitHub at sen-ltd/mojihenkan. MIT licensed. Issues and PRs welcome — particularly if you find a real-world dakuten edge case my table doesn't cover, because there's always one more.

Top comments (0)