Picute

Posted on Jun 26

How to fix garbled (mojibake) subtitles: decode legacy SRT/ASS encodings to UTF-8

#beginners #webdev #python #tutorial

The symptom

You open an .srt or .ass subtitle file and instead of text you get garbage:

cafÃ©  â€œquotesâ€  ï¿½ï¿½ï¿½  Ã«Â°Â©Ã¬â€"Â´

Korean turns into ë°©ì†¡, Japanese into ã‚ãŒã¦, simplified Chinese into ä½ å¥½ with extra accents. This is mojibake — the file is fine, but it was saved in a legacy character encoding and your player/editor is reading it as something else (usually UTF-8).

This post is the practical checklist I reach for: how to identify the real encoding and convert the file to clean UTF-8 — on the command line, in Python, in your editor, or in the browser.

Why it happens

Subtitle files are just text, and text has no inherent encoding — the bytes only mean something once you pick a decoder. Older subtitle files (and a lot of files exported by region-specific tools) are saved in pre-Unicode encodings:

Language	Common legacy encoding(s)
Japanese	Shift-JIS (CP932)
Korean	EUC-KR / CP949
Chinese (Simplified)	GB2312 / GB18030
Chinese (Traditional)	Big5
Cyrillic / Western EU	Windows-1251 / Windows-1252

Modern players assume UTF-8. Feed them Shift-JIS bytes and every multi-byte character decodes into the wrong glyphs. The fix is always the same idea: decode with the original encoding, re-encode as UTF-8.

1. Identify the original encoding

Half the battle is guessing the source encoding. Two quick options:

# file gives a rough guess
file -i subtitle.srt
# subtitle.srt: text/plain; charset=iso-8859-1   # <- often wrong, but a hint

# chardetect (pip install chardet) is usually better for CJK
chardetect subtitle.srt
# subtitle.srt: EUC-KR with confidence 0.99

Heuristics that save time: Korean → try CP949/EUC-KR; Japanese → Shift-JIS/CP932; Simplified Chinese → GB18030 (it's a superset of GB2312, so it rarely hurts to use the wider one); Traditional Chinese → Big5.

2. Convert with `iconv`

# Korean EUC-KR -> UTF-8
iconv -f EUC-KR -t UTF-8 in.srt -o out.srt

# Japanese Shift-JIS -> UTF-8
iconv -f SHIFT-JIS -t UTF-8 in.srt -o out.srt

# Simplified Chinese -> UTF-8 (use the superset)
iconv -f GB18030 -t UTF-8 in.srt -o out.srt

If iconv errors out on a stray byte, add //TRANSLIT or -c to drop un-mappable characters:

iconv -f SHIFT-JIS -t UTF-8//TRANSLIT in.srt -o out.srt

3. Convert in Python (batch-friendly)

Useful when you have a folder of files and don't fully trust a single guess:

from pathlib import Path
import chardet

def to_utf8(path: Path):
    raw = path.read_bytes()
    guess = chardet.detect(raw)            # {'encoding': 'EUC-KR', 'confidence': 0.99}
    enc = guess["encoding"] or "utf-8"
    text = raw.decode(enc, errors="replace")
    path.with_suffix(".utf8.srt").write_text(text, encoding="utf-8")
    print(f"{path.name}: {enc} ({guess['confidence']:.0%}) -> utf-8")

for srt in Path("subs").glob("*.srt"):
    to_utf8(srt)

Two gotchas: chardet can confidently mislabel short files, and it often reports GB2312 for files that contain characters only present in GB18030 — if you see missing glyphs, re-run forcing gb18030.

4. Convert in your editor (no terminal)

In VS Code: open the file, click the encoding in the status bar (e.g. UTF-8) → Reopen with Encoding → pick the real one (e.g. Korean (EUC-KR)). When it looks right, click the encoding again → Save with Encoding → UTF-8. Sublime Text and Notepad++ have the same "reopen/convert to UTF-8" flow.

5. Convert in the browser (no install)

When I just want to drag-and-drop one file without touching a terminal, I use a free browser tool: Picute's subtitle encoding converter. It re-decodes EUC-KR, Shift-JIS, GB18030, Big5, and Windows-125x to UTF-8 entirely client-side — the file never leaves your machine — and previews the result so you can confirm the glyphs are right before downloading.

Full disclosure: I'm affiliated with Picute (it's an AI subtitle/caption tool). The converter above is free and needs no signup; I'm including it as one option next to the CLI/Python/editor methods, not as a replacement for them.

A few rules that prevent mojibake next time

Always save subtitles as UTF-8 (without BOM for .srt/.vtt — some players choke on the BOM bytes at the start of the first cue).
If a player still shows boxes after converting, the issue may be a missing font for that script, not the encoding — try a font that covers the glyphs.
Keep the original file until you've verified the converted one; a wrong source-encoding guess is silent and easy to miss on a quick scan.

That's the whole toolkit. Pick whichever layer fits your workflow — they all do the same job: decode with the real encoding, write UTF-8.

DEV Community

How to fix garbled (mojibake) subtitles: decode legacy SRT/ASS encodings to UTF-8

The symptom

Why it happens

1. Identify the original encoding

2. Convert with `iconv`

3. Convert in Python (batch-friendly)

4. Convert in your editor (no terminal)

5. Convert in the browser (no install)

A few rules that prevent mojibake next time

Top comments (0)

The symptom

Why it happens

1. Identify the original encoding

2. Convert with iconv

3. Convert in Python (batch-friendly)

4. Convert in your editor (no terminal)

5. Convert in the browser (no install)

A few rules that prevent mojibake next time

2. Convert with `iconv`