SEN LLC

Posted on Apr 12

Writing a Hepburn Hiragana ↔ Katakana ↔ Rōmaji Converter (With All The Annoying Edge Cases)

#javascript #webdev #tutorial #i18n

Writing a Hepburn Hiragana ↔ Katakana ↔ Rōmaji Converter (With All The Annoying Edge Cases)

"Sanpo" or "sampo"? "Matcha" or "maccha"? Hepburn romanization has rules that naive 1:1 mapping gets wrong, and those are exactly the rules that appear on road signs and passports. I wrote a converter that handles them properly.

Ask most developers to romanize 「さんぽ」 and you'll get sanpo. Hand a Japanese passport office the same word and they'll write sampo. The difference is Hepburn romanization — the standard used on Japanese signage, passports, and official documents — which has a handful of rules that naive 1:1 character mapping gets wrong. I wanted a converter that handled them.

🔗 Live demo: https://sen.ltd/portfolio/hiragana-romaji/
📦 GitHub: https://github.com/sen-ltd/hiragana-romaji

Auto-detects which script you typed (hiragana / katakana / Latin) and shows all three forms simultaneously. Handles sokuon (geminate consonants), moraic nasal (ん → m/n), and yōon (palatalized digraphs like きゃ). Vanilla JS, zero deps, no build.

Hiragana ↔ Katakana is one `+ 0x60` away

This one's a gift from the Unicode committee:

export function hiraganaToKatakana(input) {
  return input.replace(/[\u3041-\u3096]/g, (ch) =>
    String.fromCharCode(ch.charCodeAt(0) + 0x60)
  )
}

export function katakanaToHiragana(input) {
  return input.replace(/[\u30a1-\u30f6]/g, (ch) =>
    String.fromCharCode(ch.charCodeAt(0) - 0x60)
  )
}

Hiragana and katakana occupy contiguous blocks in the same order, separated by exactly 0x60. No lookup table needed, no edge cases, three lines of code. This is the whole hiragana/katakana conversion.

Kana → rōmaji: table + state machine

The main table is the basic 50 syllables plus voiced (か → が) and semi-voiced (ぱ) rows:

const KANA_TO_ROMAJI = {
  あ: 'a', い: 'i', う: 'u', え: 'e', お: 'o',
  か: 'ka', // ...
  し: 'shi', ち: 'chi', つ: 'tsu', ふ: 'fu',
  じ: 'ji',
  ん: 'n',
  'ー': '-',
}

const DIGRAPHS = {
  きゃ: 'kya', きゅ: 'kyu', きょ: 'kyo',
  しゃ: 'sha', しゅ: 'shu', しょ: 'sho',
  ちゃ: 'cha', ちゅ: 'chu', ちょ: 'cho',
  // ...yōon forms
}

The converter is a small state machine with special handling for sokuon and ん:

export function kanaToRomaji(input) {
  const s = katakanaToHiragana(input)  // collapse both scripts to hiragana
  let out = ''
  let i = 0
  while (i < s.length) {
    const ch = s[i]

    // Sokuon (っ): duplicate the next consonant
    if (ch === 'っ') {
      const nextRomaji = lookupAhead(s, i + 1)
      if (nextRomaji) {
        // Hepburn edge case: "ch" → "tch", not "cch"
        if (nextRomaji.romaji.startsWith('ch')) {
          out += 't' + nextRomaji.romaji
        } else {
          out += nextRomaji.romaji[0] + nextRomaji.romaji
        }
        i += 1 + nextRomaji.consumed
        continue
      }
    }

    // Moraic nasal (ん): m before ば/ぱ/ま rows, else n
    if (ch === 'ん') {
      const nextCh = s[i + 1]
      if (nextCh && /[ばびぶべぼぱぴぷぺぽまみむめも]/.test(nextCh)) {
        out += 'm'
      } else {
        out += 'n'
      }
      i++
      continue
    }

    // Try digraph (2-kana yōon) first
    const two = s.slice(i, i + 2)
    if (DIGRAPHS[two]) { out += DIGRAPHS[two]; i += 2; continue }

    // Single kana
    if (KANA_TO_ROMAJI[ch]) { out += KANA_TO_ROMAJI[ch]; i++; continue }

    out += ch; i++
  }
  return out
}

Four things the loop has to do:

Fold katakana into hiragana first so the table only needs one set of keys
Try 2-character digraph matches before single-character matches
Handle っ by looking one character ahead and duplicating the lead consonant
Handle ん by branching on the following character

Why "matcha" and not "maccha"

Sokuon's default is "copy the next consonant": いっぽん → ippon, きって → kitte, かっこう → kakkō. But when the next consonant is ch, Hepburn changes the rule:

if (nextRomaji.romaji.startsWith('ch')) {
  out += 't' + nextRomaji.romaji  // tcha, not ccha
}

「まっちゃ」 becomes matcha, not maccha. This is one of those "because Hepburn was writing for English readers" choices — tcha approximates the actual sound better in English phonotactics than doubled c does. Modern Japanese speakers might prefer the doubled-letter convention, but passport offices use Hepburn, so Hepburn it is.

The same logic applies to the ん → m rule:

if (/[ばびぶべぼぱぴぷぺぽまみむめも]/.test(nextCh)) {
  out += 'm'
}

「さんぽ」 becomes sampo, not sanpo. When an English speaker reads sanpo out loud they say /san-po/, which sounds wrong — the actual pronunciation is closer to /sam-po/. Hepburn's m encodes the real sound.

Rōmaji → kana: longest-match lookup

Going back the other way requires searching longest-first so sha matches before s:

const ROMAJI_LIST = ROMAJI_TO_KANA
  .filter(([r]) => /^[a-z-]+$/.test(r))
  .sort((a, b) => b[0].length - a[0].length)

export function romajiToKana(input) {
  const s = input.toLowerCase()
  let out = ''
  let i = 0
  while (i < s.length) {
    // Sokuon: doubled consonant (not n)
    if (/[a-z]/.test(s[i]) && s[i + 1] === s[i] && s[i] !== 'n') {
      out += 'っ'
      i++
      continue
    }
    // ...longest-match lookup
  }
}

If you sort by length descending and linear-scan, the first hit is always the longest valid token starting at the current position. Without this, shi becomes s (す) + hi (ひ) and everything downstream is broken.

Detecting the input script

The UI detects which script was typed and routes to the right pair of converters:

function detectScript(s) {
  if (/[\u3041-\u3096]/.test(s)) return 'hiragana'
  if (/[\u30a1-\u30f6]/.test(s)) return 'katakana'
  if (/[a-zA-Z]/.test(s)) return 'romaji'
  return 'unknown'
}

Detect on every input change, run conversion to the other two forms, render. No mode selector — the tool figures out what you're doing.

Tests

13 cases on node --test, focused on the Hepburn-specific edge cases:

test('basic hiragana', () => {
  assert.equal(kanaToRomaji('さくら'), 'sakura')
})

test('hepburn: し/ち/つ/ふ', () => {
  assert.equal(kanaToRomaji('しちつふ'), 'shichitsufu')
})

test('digraph: きょう', () => {
  assert.equal(kanaToRomaji('きょう'), 'kyou')
})

test('sokuon: いっぽん', () => {
  assert.equal(kanaToRomaji('いっぽん'), 'ippon')
})

test('sokuon before ch: まっちゃ', () => {
  assert.equal(kanaToRomaji('まっちゃ'), 'matcha')
})

test('n before ba-row: さんぽ', () => {
  assert.equal(kanaToRomaji('さんぽ'), 'sampo')
})

test('n before normal consonant: さんど', () => {
  assert.equal(kanaToRomaji('さんど'), 'sando')
})

test('romaji to hiragana roundtrip', () => {
  assert.equal(romajiToKana('kyou'), 'きょう')
})

Testing both さんぽ → sampo and さんど → sando side-by-side is the important pair. Without the second test, a bug where the m branch unconditionally fires would still pass the sampo test. State-machine rules need paired "positive and negative" tests to pin them down.

Series

This is entry #17 in my 100+ public portfolio series.

📦 Repo: https://github.com/sen-ltd/hiragana-romaji
🌐 Live: https://sen.ltd/portfolio/hiragana-romaji/
🏢 Company: https://sen.ltd/

Kunrei-shiki support is a reasonable feature request; issues welcome.

DEV Community

Writing a Hepburn Hiragana ↔ Katakana ↔ Rōmaji Converter (With All The Annoying Edge Cases)

Writing a Hepburn Hiragana ↔ Katakana ↔ Rōmaji Converter (With All The Annoying Edge Cases)

Hiragana ↔ Katakana is one `+ 0x60` away

Kana → rōmaji: table + state machine

Why "matcha" and not "maccha"

Rōmaji → kana: longest-match lookup

Detecting the input script

Tests

Series

Top comments (0)

Writing a Hepburn Hiragana ↔ Katakana ↔ Rōmaji Converter (With All The Annoying Edge Cases)

Hiragana ↔ Katakana is one + 0x60 away

Kana → rōmaji: table + state machine

Why "matcha" and not "maccha"

Rōmaji → kana: longest-match lookup

Detecting the input script

Tests

Series

Hiragana ↔ Katakana is one `+ 0x60` away