The symptom
You open an .srt or .ass subtitle file and instead of text you get garbage:
café “quotes†��� ë°©ìâ€"´
Korean turns into 방송, Japanese into ã‚ãŒã¦, simplified Chinese into ä½ å¥½ with extra accents. This is mojibake — the file is fine, but it was saved in a legacy character encoding and your player/editor is reading it as something else (usually UTF-8).
This post is the practical checklist I reach for: how to identify the real encoding and convert the file to clean UTF-8 — on the command line, in Python, in your editor, or in the browser.
Why it happens
Subtitle files are just text, and text has no inherent encoding — the bytes only mean something once you pick a decoder. Older subtitle files (and a lot of files exported by region-specific tools) are saved in pre-Unicode encodings:
| Language | Common legacy encoding(s) |
|---|---|
| Japanese | Shift-JIS (CP932) |
| Korean | EUC-KR / CP949 |
| Chinese (Simplified) | GB2312 / GB18030 |
| Chinese (Traditional) | Big5 |
| Cyrillic / Western EU | Windows-1251 / Windows-1252 |
Modern players assume UTF-8. Feed them Shift-JIS bytes and every multi-byte character decodes into the wrong glyphs. The fix is always the same idea: decode with the original encoding, re-encode as UTF-8.
1. Identify the original encoding
Half the battle is guessing the source encoding. Two quick options:
# file gives a rough guess
file -i subtitle.srt
# subtitle.srt: text/plain; charset=iso-8859-1 # <- often wrong, but a hint
# chardetect (pip install chardet) is usually better for CJK
chardetect subtitle.srt
# subtitle.srt: EUC-KR with confidence 0.99
Heuristics that save time: Korean → try CP949/EUC-KR; Japanese → Shift-JIS/CP932; Simplified Chinese → GB18030 (it's a superset of GB2312, so it rarely hurts to use the wider one); Traditional Chinese → Big5.
2. Convert with iconv
# Korean EUC-KR -> UTF-8
iconv -f EUC-KR -t UTF-8 in.srt -o out.srt
# Japanese Shift-JIS -> UTF-8
iconv -f SHIFT-JIS -t UTF-8 in.srt -o out.srt
# Simplified Chinese -> UTF-8 (use the superset)
iconv -f GB18030 -t UTF-8 in.srt -o out.srt
If iconv errors out on a stray byte, add //TRANSLIT or -c to drop un-mappable characters:
iconv -f SHIFT-JIS -t UTF-8//TRANSLIT in.srt -o out.srt
3. Convert in Python (batch-friendly)
Useful when you have a folder of files and don't fully trust a single guess:
from pathlib import Path
import chardet
def to_utf8(path: Path):
raw = path.read_bytes()
guess = chardet.detect(raw) # {'encoding': 'EUC-KR', 'confidence': 0.99}
enc = guess["encoding"] or "utf-8"
text = raw.decode(enc, errors="replace")
path.with_suffix(".utf8.srt").write_text(text, encoding="utf-8")
print(f"{path.name}: {enc} ({guess['confidence']:.0%}) -> utf-8")
for srt in Path("subs").glob("*.srt"):
to_utf8(srt)
Two gotchas: chardet can confidently mislabel short files, and it often reports GB2312 for files that contain characters only present in GB18030 — if you see missing glyphs, re-run forcing gb18030.
4. Convert in your editor (no terminal)
In VS Code: open the file, click the encoding in the status bar (e.g. UTF-8) → Reopen with Encoding → pick the real one (e.g. Korean (EUC-KR)). When it looks right, click the encoding again → Save with Encoding → UTF-8. Sublime Text and Notepad++ have the same "reopen/convert to UTF-8" flow.
5. Convert in the browser (no install)
When I just want to drag-and-drop one file without touching a terminal, I use a free browser tool: Picute's subtitle encoding converter. It re-decodes EUC-KR, Shift-JIS, GB18030, Big5, and Windows-125x to UTF-8 entirely client-side — the file never leaves your machine — and previews the result so you can confirm the glyphs are right before downloading.
Full disclosure: I'm affiliated with Picute (it's an AI subtitle/caption tool). The converter above is free and needs no signup; I'm including it as one option next to the CLI/Python/editor methods, not as a replacement for them.
A few rules that prevent mojibake next time
-
Always save subtitles as UTF-8 (without BOM for
.srt/.vtt— some players choke on the BOM bytes at the start of the first cue). - If a player still shows boxes after converting, the issue may be a missing font for that script, not the encoding — try a font that covers the glyphs.
- Keep the original file until you've verified the converted one; a wrong source-encoding guess is silent and easy to miss on a quick scan.
That's the whole toolkit. Pick whichever layer fits your workflow — they all do the same job: decode with the real encoding, write UTF-8.
Top comments (0)