DEV Community

SEN LLC
SEN LLC

Posted on

Writing a Hepburn Hiragana ↔ Katakana ↔ Rōmaji Converter (With All The Annoying Edge Cases)

Writing a Hepburn Hiragana ↔ Katakana ↔ Rōmaji Converter (With All The Annoying Edge Cases)

"Sanpo" or "sampo"? "Matcha" or "maccha"? Hepburn romanization has rules that naive 1:1 mapping gets wrong, and those are exactly the rules that appear on road signs and passports. I wrote a converter that handles them properly.

Ask most developers to romanize 「さんぽ」 and you'll get sanpo. Hand a Japanese passport office the same word and they'll write sampo. The difference is Hepburn romanization — the standard used on Japanese signage, passports, and official documents — which has a handful of rules that naive 1:1 character mapping gets wrong. I wanted a converter that handled them.

🔗 Live demo: https://sen.ltd/portfolio/hiragana-romaji/
📦 GitHub: https://github.com/sen-ltd/hiragana-romaji

Screenshot

Auto-detects which script you typed (hiragana / katakana / Latin) and shows all three forms simultaneously. Handles sokuon (geminate consonants), moraic nasal (ん → m/n), and yōon (palatalized digraphs like きゃ). Vanilla JS, zero deps, no build.

Hiragana ↔ Katakana is one + 0x60 away

This one's a gift from the Unicode committee:

export function hiraganaToKatakana(input) {
  return input.replace(/[\u3041-\u3096]/g, (ch) =>
    String.fromCharCode(ch.charCodeAt(0) + 0x60)
  )
}

export function katakanaToHiragana(input) {
  return input.replace(/[\u30a1-\u30f6]/g, (ch) =>
    String.fromCharCode(ch.charCodeAt(0) - 0x60)
  )
}
Enter fullscreen mode Exit fullscreen mode

Hiragana and katakana occupy contiguous blocks in the same order, separated by exactly 0x60. No lookup table needed, no edge cases, three lines of code. This is the whole hiragana/katakana conversion.

Kana → rōmaji: table + state machine

The main table is the basic 50 syllables plus voiced () and semi-voiced () rows:

const KANA_TO_ROMAJI = {
  : 'a', : 'i', : 'u', : 'e', : 'o',
  : 'ka', // ...
  : 'shi', : 'chi', : 'tsu', : 'fu',
  : 'ji',
  : 'n',
  '': '-',
}

const DIGRAPHS = {
  きゃ: 'kya', きゅ: 'kyu', きょ: 'kyo',
  しゃ: 'sha', しゅ: 'shu', しょ: 'sho',
  ちゃ: 'cha', ちゅ: 'chu', ちょ: 'cho',
  // ...yōon forms
}
Enter fullscreen mode Exit fullscreen mode

The converter is a small state machine with special handling for sokuon and ん:

export function kanaToRomaji(input) {
  const s = katakanaToHiragana(input)  // collapse both scripts to hiragana
  let out = ''
  let i = 0
  while (i < s.length) {
    const ch = s[i]

    // Sokuon (っ): duplicate the next consonant
    if (ch === '') {
      const nextRomaji = lookupAhead(s, i + 1)
      if (nextRomaji) {
        // Hepburn edge case: "ch" → "tch", not "cch"
        if (nextRomaji.romaji.startsWith('ch')) {
          out += 't' + nextRomaji.romaji
        } else {
          out += nextRomaji.romaji[0] + nextRomaji.romaji
        }
        i += 1 + nextRomaji.consumed
        continue
      }
    }

    // Moraic nasal (ん): m before ば/ぱ/ま rows, else n
    if (ch === '') {
      const nextCh = s[i + 1]
      if (nextCh && /[ばびぶべぼぱぴぷぺぽまみむめも]/.test(nextCh)) {
        out += 'm'
      } else {
        out += 'n'
      }
      i++
      continue
    }

    // Try digraph (2-kana yōon) first
    const two = s.slice(i, i + 2)
    if (DIGRAPHS[two]) { out += DIGRAPHS[two]; i += 2; continue }

    // Single kana
    if (KANA_TO_ROMAJI[ch]) { out += KANA_TO_ROMAJI[ch]; i++; continue }

    out += ch; i++
  }
  return out
}
Enter fullscreen mode Exit fullscreen mode

Four things the loop has to do:

  1. Fold katakana into hiragana first so the table only needs one set of keys
  2. Try 2-character digraph matches before single-character matches
  3. Handle by looking one character ahead and duplicating the lead consonant
  4. Handle by branching on the following character

Why "matcha" and not "maccha"

Sokuon's default is "copy the next consonant": いっぽん → ippon, きって → kitte, かっこう → kakkō. But when the next consonant is ch, Hepburn changes the rule:

if (nextRomaji.romaji.startsWith('ch')) {
  out += 't' + nextRomaji.romaji  // tcha, not ccha
}
Enter fullscreen mode Exit fullscreen mode

「まっちゃ」 becomes matcha, not maccha. This is one of those "because Hepburn was writing for English readers" choices — tcha approximates the actual sound better in English phonotactics than doubled c does. Modern Japanese speakers might prefer the doubled-letter convention, but passport offices use Hepburn, so Hepburn it is.

The same logic applies to the ん → m rule:

if (/[ばびぶべぼぱぴぷぺぽまみむめも]/.test(nextCh)) {
  out += 'm'
}
Enter fullscreen mode Exit fullscreen mode

「さんぽ」 becomes sampo, not sanpo. When an English speaker reads sanpo out loud they say /san-po/, which sounds wrong — the actual pronunciation is closer to /sam-po/. Hepburn's m encodes the real sound.

Rōmaji → kana: longest-match lookup

Going back the other way requires searching longest-first so sha matches before s:

const ROMAJI_LIST = ROMAJI_TO_KANA
  .filter(([r]) => /^[a-z-]+$/.test(r))
  .sort((a, b) => b[0].length - a[0].length)

export function romajiToKana(input) {
  const s = input.toLowerCase()
  let out = ''
  let i = 0
  while (i < s.length) {
    // Sokuon: doubled consonant (not n)
    if (/[a-z]/.test(s[i]) && s[i + 1] === s[i] && s[i] !== 'n') {
      out += ''
      i++
      continue
    }
    // ...longest-match lookup
  }
}
Enter fullscreen mode Exit fullscreen mode

If you sort by length descending and linear-scan, the first hit is always the longest valid token starting at the current position. Without this, shi becomes s (す) + hi (ひ) and everything downstream is broken.

Detecting the input script

The UI detects which script was typed and routes to the right pair of converters:

function detectScript(s) {
  if (/[\u3041-\u3096]/.test(s)) return 'hiragana'
  if (/[\u30a1-\u30f6]/.test(s)) return 'katakana'
  if (/[a-zA-Z]/.test(s)) return 'romaji'
  return 'unknown'
}
Enter fullscreen mode Exit fullscreen mode

Detect on every input change, run conversion to the other two forms, render. No mode selector — the tool figures out what you're doing.

Tests

13 cases on node --test, focused on the Hepburn-specific edge cases:

test('basic hiragana', () => {
  assert.equal(kanaToRomaji('さくら'), 'sakura')
})

test('hepburn: し/ち/つ/ふ', () => {
  assert.equal(kanaToRomaji('しちつふ'), 'shichitsufu')
})

test('digraph: きょう', () => {
  assert.equal(kanaToRomaji('きょう'), 'kyou')
})

test('sokuon: いっぽん', () => {
  assert.equal(kanaToRomaji('いっぽん'), 'ippon')
})

test('sokuon before ch: まっちゃ', () => {
  assert.equal(kanaToRomaji('まっちゃ'), 'matcha')
})

test('n before ba-row: さんぽ', () => {
  assert.equal(kanaToRomaji('さんぽ'), 'sampo')
})

test('n before normal consonant: さんど', () => {
  assert.equal(kanaToRomaji('さんど'), 'sando')
})

test('romaji to hiragana roundtrip', () => {
  assert.equal(romajiToKana('kyou'), 'きょう')
})
Enter fullscreen mode Exit fullscreen mode

Testing both さんぽ → sampo and さんど → sando side-by-side is the important pair. Without the second test, a bug where the m branch unconditionally fires would still pass the sampo test. State-machine rules need paired "positive and negative" tests to pin them down.

Series

This is entry #17 in my 100+ public portfolio series.

Kunrei-shiki support is a reasonable feature request; issues welcome.

Top comments (0)