DEV Community

Picute
Picute

Posted on

How to fix garbled (mojibake) subtitles: decode legacy SRT/ASS encodings to UTF-8

The symptom

You open an .srt or .ass subtitle file and instead of text you get garbage:

café  “quotes† ���  ë°©ìâ€"´
Enter fullscreen mode Exit fullscreen mode

Korean turns into 방송, Japanese into ã‚ãŒã¦, simplified Chinese into ä½ å¥½ with extra accents. This is mojibake — the file is fine, but it was saved in a legacy character encoding and your player/editor is reading it as something else (usually UTF-8).

This post is the practical checklist I reach for: how to identify the real encoding and convert the file to clean UTF-8 — on the command line, in Python, in your editor, or in the browser.


Why it happens

Subtitle files are just text, and text has no inherent encoding — the bytes only mean something once you pick a decoder. Older subtitle files (and a lot of files exported by region-specific tools) are saved in pre-Unicode encodings:

Language Common legacy encoding(s)
Japanese Shift-JIS (CP932)
Korean EUC-KR / CP949
Chinese (Simplified) GB2312 / GB18030
Chinese (Traditional) Big5
Cyrillic / Western EU Windows-1251 / Windows-1252

Modern players assume UTF-8. Feed them Shift-JIS bytes and every multi-byte character decodes into the wrong glyphs. The fix is always the same idea: decode with the original encoding, re-encode as UTF-8.


1. Identify the original encoding

Half the battle is guessing the source encoding. Two quick options:

# file gives a rough guess
file -i subtitle.srt
# subtitle.srt: text/plain; charset=iso-8859-1   # <- often wrong, but a hint

# chardetect (pip install chardet) is usually better for CJK
chardetect subtitle.srt
# subtitle.srt: EUC-KR with confidence 0.99
Enter fullscreen mode Exit fullscreen mode

Heuristics that save time: Korean → try CP949/EUC-KR; Japanese → Shift-JIS/CP932; Simplified Chinese → GB18030 (it's a superset of GB2312, so it rarely hurts to use the wider one); Traditional Chinese → Big5.

2. Convert with iconv

# Korean EUC-KR -> UTF-8
iconv -f EUC-KR -t UTF-8 in.srt -o out.srt

# Japanese Shift-JIS -> UTF-8
iconv -f SHIFT-JIS -t UTF-8 in.srt -o out.srt

# Simplified Chinese -> UTF-8 (use the superset)
iconv -f GB18030 -t UTF-8 in.srt -o out.srt
Enter fullscreen mode Exit fullscreen mode

If iconv errors out on a stray byte, add //TRANSLIT or -c to drop un-mappable characters:

iconv -f SHIFT-JIS -t UTF-8//TRANSLIT in.srt -o out.srt
Enter fullscreen mode Exit fullscreen mode

3. Convert in Python (batch-friendly)

Useful when you have a folder of files and don't fully trust a single guess:

from pathlib import Path
import chardet

def to_utf8(path: Path):
    raw = path.read_bytes()
    guess = chardet.detect(raw)            # {'encoding': 'EUC-KR', 'confidence': 0.99}
    enc = guess["encoding"] or "utf-8"
    text = raw.decode(enc, errors="replace")
    path.with_suffix(".utf8.srt").write_text(text, encoding="utf-8")
    print(f"{path.name}: {enc} ({guess['confidence']:.0%}) -> utf-8")

for srt in Path("subs").glob("*.srt"):
    to_utf8(srt)
Enter fullscreen mode Exit fullscreen mode

Two gotchas: chardet can confidently mislabel short files, and it often reports GB2312 for files that contain characters only present in GB18030 — if you see missing glyphs, re-run forcing gb18030.

4. Convert in your editor (no terminal)

In VS Code: open the file, click the encoding in the status bar (e.g. UTF-8) → Reopen with Encoding → pick the real one (e.g. Korean (EUC-KR)). When it looks right, click the encoding again → Save with EncodingUTF-8. Sublime Text and Notepad++ have the same "reopen/convert to UTF-8" flow.

5. Convert in the browser (no install)

When I just want to drag-and-drop one file without touching a terminal, I use a free browser tool: Picute's subtitle encoding converter. It re-decodes EUC-KR, Shift-JIS, GB18030, Big5, and Windows-125x to UTF-8 entirely client-side — the file never leaves your machine — and previews the result so you can confirm the glyphs are right before downloading.

Full disclosure: I'm affiliated with Picute (it's an AI subtitle/caption tool). The converter above is free and needs no signup; I'm including it as one option next to the CLI/Python/editor methods, not as a replacement for them.


A few rules that prevent mojibake next time

  • Always save subtitles as UTF-8 (without BOM for .srt/.vtt — some players choke on the BOM bytes at the start of the first cue).
  • If a player still shows boxes after converting, the issue may be a missing font for that script, not the encoding — try a font that covers the glyphs.
  • Keep the original file until you've verified the converted one; a wrong source-encoding guess is silent and easy to miss on a quick scan.

That's the whole toolkit. Pick whichever layer fits your workflow — they all do the same job: decode with the real encoding, write UTF-8.

Top comments (0)