Writing a Hepburn Hiragana ↔ Katakana ↔ Rōmaji Converter (With All The Annoying Edge Cases)
"Sanpo" or "sampo"? "Matcha" or "maccha"? Hepburn romanization has rules that naive 1:1 mapping gets wrong, and those are exactly the rules that appear on road signs and passports. I wrote a converter that handles them properly.
Ask most developers to romanize 「さんぽ」 and you'll get sanpo. Hand a Japanese passport office the same word and they'll write sampo. The difference is Hepburn romanization — the standard used on Japanese signage, passports, and official documents — which has a handful of rules that naive 1:1 character mapping gets wrong. I wanted a converter that handled them.
🔗 Live demo: https://sen.ltd/portfolio/hiragana-romaji/
📦 GitHub: https://github.com/sen-ltd/hiragana-romaji
Auto-detects which script you typed (hiragana / katakana / Latin) and shows all three forms simultaneously. Handles sokuon (geminate consonants), moraic nasal (ん → m/n), and yōon (palatalized digraphs like きゃ). Vanilla JS, zero deps, no build.
Hiragana ↔ Katakana is one + 0x60 away
This one's a gift from the Unicode committee:
export function hiraganaToKatakana(input) {
return input.replace(/[\u3041-\u3096]/g, (ch) =>
String.fromCharCode(ch.charCodeAt(0) + 0x60)
)
}
export function katakanaToHiragana(input) {
return input.replace(/[\u30a1-\u30f6]/g, (ch) =>
String.fromCharCode(ch.charCodeAt(0) - 0x60)
)
}
Hiragana and katakana occupy contiguous blocks in the same order, separated by exactly 0x60. No lookup table needed, no edge cases, three lines of code. This is the whole hiragana/katakana conversion.
Kana → rōmaji: table + state machine
The main table is the basic 50 syllables plus voiced (か → が) and semi-voiced (ぱ) rows:
const KANA_TO_ROMAJI = {
あ: 'a', い: 'i', う: 'u', え: 'e', お: 'o',
か: 'ka', // ...
し: 'shi', ち: 'chi', つ: 'tsu', ふ: 'fu',
じ: 'ji',
ん: 'n',
'ー': '-',
}
const DIGRAPHS = {
きゃ: 'kya', きゅ: 'kyu', きょ: 'kyo',
しゃ: 'sha', しゅ: 'shu', しょ: 'sho',
ちゃ: 'cha', ちゅ: 'chu', ちょ: 'cho',
// ...yōon forms
}
The converter is a small state machine with special handling for sokuon and ん:
export function kanaToRomaji(input) {
const s = katakanaToHiragana(input) // collapse both scripts to hiragana
let out = ''
let i = 0
while (i < s.length) {
const ch = s[i]
// Sokuon (っ): duplicate the next consonant
if (ch === 'っ') {
const nextRomaji = lookupAhead(s, i + 1)
if (nextRomaji) {
// Hepburn edge case: "ch" → "tch", not "cch"
if (nextRomaji.romaji.startsWith('ch')) {
out += 't' + nextRomaji.romaji
} else {
out += nextRomaji.romaji[0] + nextRomaji.romaji
}
i += 1 + nextRomaji.consumed
continue
}
}
// Moraic nasal (ん): m before ば/ぱ/ま rows, else n
if (ch === 'ん') {
const nextCh = s[i + 1]
if (nextCh && /[ばびぶべぼぱぴぷぺぽまみむめも]/.test(nextCh)) {
out += 'm'
} else {
out += 'n'
}
i++
continue
}
// Try digraph (2-kana yōon) first
const two = s.slice(i, i + 2)
if (DIGRAPHS[two]) { out += DIGRAPHS[two]; i += 2; continue }
// Single kana
if (KANA_TO_ROMAJI[ch]) { out += KANA_TO_ROMAJI[ch]; i++; continue }
out += ch; i++
}
return out
}
Four things the loop has to do:
- Fold katakana into hiragana first so the table only needs one set of keys
- Try 2-character digraph matches before single-character matches
- Handle
っby looking one character ahead and duplicating the lead consonant - Handle
んby branching on the following character
Why "matcha" and not "maccha"
Sokuon's default is "copy the next consonant": いっぽん → ippon, きって → kitte, かっこう → kakkō. But when the next consonant is ch, Hepburn changes the rule:
if (nextRomaji.romaji.startsWith('ch')) {
out += 't' + nextRomaji.romaji // tcha, not ccha
}
「まっちゃ」 becomes matcha, not maccha. This is one of those "because Hepburn was writing for English readers" choices — tcha approximates the actual sound better in English phonotactics than doubled c does. Modern Japanese speakers might prefer the doubled-letter convention, but passport offices use Hepburn, so Hepburn it is.
The same logic applies to the ん → m rule:
if (/[ばびぶべぼぱぴぷぺぽまみむめも]/.test(nextCh)) {
out += 'm'
}
「さんぽ」 becomes sampo, not sanpo. When an English speaker reads sanpo out loud they say /san-po/, which sounds wrong — the actual pronunciation is closer to /sam-po/. Hepburn's m encodes the real sound.
Rōmaji → kana: longest-match lookup
Going back the other way requires searching longest-first so sha matches before s:
const ROMAJI_LIST = ROMAJI_TO_KANA
.filter(([r]) => /^[a-z-]+$/.test(r))
.sort((a, b) => b[0].length - a[0].length)
export function romajiToKana(input) {
const s = input.toLowerCase()
let out = ''
let i = 0
while (i < s.length) {
// Sokuon: doubled consonant (not n)
if (/[a-z]/.test(s[i]) && s[i + 1] === s[i] && s[i] !== 'n') {
out += 'っ'
i++
continue
}
// ...longest-match lookup
}
}
If you sort by length descending and linear-scan, the first hit is always the longest valid token starting at the current position. Without this, shi becomes s (す) + hi (ひ) and everything downstream is broken.
Detecting the input script
The UI detects which script was typed and routes to the right pair of converters:
function detectScript(s) {
if (/[\u3041-\u3096]/.test(s)) return 'hiragana'
if (/[\u30a1-\u30f6]/.test(s)) return 'katakana'
if (/[a-zA-Z]/.test(s)) return 'romaji'
return 'unknown'
}
Detect on every input change, run conversion to the other two forms, render. No mode selector — the tool figures out what you're doing.
Tests
13 cases on node --test, focused on the Hepburn-specific edge cases:
test('basic hiragana', () => {
assert.equal(kanaToRomaji('さくら'), 'sakura')
})
test('hepburn: し/ち/つ/ふ', () => {
assert.equal(kanaToRomaji('しちつふ'), 'shichitsufu')
})
test('digraph: きょう', () => {
assert.equal(kanaToRomaji('きょう'), 'kyou')
})
test('sokuon: いっぽん', () => {
assert.equal(kanaToRomaji('いっぽん'), 'ippon')
})
test('sokuon before ch: まっちゃ', () => {
assert.equal(kanaToRomaji('まっちゃ'), 'matcha')
})
test('n before ba-row: さんぽ', () => {
assert.equal(kanaToRomaji('さんぽ'), 'sampo')
})
test('n before normal consonant: さんど', () => {
assert.equal(kanaToRomaji('さんど'), 'sando')
})
test('romaji to hiragana roundtrip', () => {
assert.equal(romajiToKana('kyou'), 'きょう')
})
Testing both さんぽ → sampo and さんど → sando side-by-side is the important pair. Without the second test, a bug where the m branch unconditionally fires would still pass the sampo test. State-machine rules need paired "positive and negative" tests to pin them down.
Series
This is entry #17 in my 100+ public portfolio series.
- 📦 Repo: https://github.com/sen-ltd/hiragana-romaji
- 🌐 Live: https://sen.ltd/portfolio/hiragana-romaji/
- 🏢 Company: https://sen.ltd/
Kunrei-shiki support is a reasonable feature request; issues welcome.

Top comments (0)