<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Picute</title>
    <description>The latest articles on DEV Community by Picute (@picute).</description>
    <link>https://dev.to/picute</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4003852%2F26c5caed-b5c5-45f6-b3a7-01b73f5e880b.png</url>
      <title>DEV Community: Picute</title>
      <link>https://dev.to/picute</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/picute"/>
    <language>en</language>
    <item>
      <title>How to fix garbled (mojibake) subtitles: decode legacy SRT/ASS encodings to UTF-8</title>
      <dc:creator>Picute</dc:creator>
      <pubDate>Fri, 26 Jun 2026 11:03:34 +0000</pubDate>
      <link>https://dev.to/picute/how-to-fix-garbled-mojibake-subtitles-decode-legacy-srtass-encodings-to-utf-8-1eb7</link>
      <guid>https://dev.to/picute/how-to-fix-garbled-mojibake-subtitles-decode-legacy-srtass-encodings-to-utf-8-1eb7</guid>
      <description>&lt;h2&gt;
  
  
  The symptom
&lt;/h2&gt;

&lt;p&gt;You open an &lt;code&gt;.srt&lt;/code&gt; or &lt;code&gt;.ass&lt;/code&gt; subtitle file and instead of text you get garbage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cafÃ©  â€œquotesâ€  ï¿½ï¿½ï¿½  Ã«Â°Â©Ã¬â€"Â´
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Korean turns into &lt;code&gt;ë°©ì†¡&lt;/code&gt;, Japanese into &lt;code&gt;ã‚ãŒã¦&lt;/code&gt;, simplified Chinese into &lt;code&gt;ä½ å¥½&lt;/code&gt; with extra accents. This is &lt;strong&gt;mojibake&lt;/strong&gt; — the file is fine, but it was saved in a &lt;em&gt;legacy&lt;/em&gt; character encoding and your player/editor is reading it as something else (usually UTF-8).&lt;/p&gt;

&lt;p&gt;This post is the practical checklist I reach for: how to identify the real encoding and convert the file to clean UTF-8 — on the command line, in Python, in your editor, or in the browser.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why it happens
&lt;/h2&gt;

&lt;p&gt;Subtitle files are just text, and text has no inherent encoding — the bytes only mean something once you pick a decoder. Older subtitle files (and a lot of files exported by region-specific tools) are saved in pre-Unicode encodings:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;Common legacy encoding(s)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Japanese&lt;/td&gt;
&lt;td&gt;Shift-JIS (CP932)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Korean&lt;/td&gt;
&lt;td&gt;EUC-KR / CP949&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chinese (Simplified)&lt;/td&gt;
&lt;td&gt;GB2312 / &lt;strong&gt;GB18030&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chinese (Traditional)&lt;/td&gt;
&lt;td&gt;Big5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cyrillic / Western EU&lt;/td&gt;
&lt;td&gt;Windows-1251 / Windows-1252&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Modern players assume &lt;strong&gt;UTF-8&lt;/strong&gt;. Feed them Shift-JIS bytes and every multi-byte character decodes into the wrong glyphs. The fix is always the same idea: &lt;em&gt;decode with the original encoding, re-encode as UTF-8.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Identify the original encoding
&lt;/h2&gt;

&lt;p&gt;Half the battle is guessing the source encoding. Two quick options:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# file gives a rough guess&lt;/span&gt;
file &lt;span class="nt"&gt;-i&lt;/span&gt; subtitle.srt
&lt;span class="c"&gt;# subtitle.srt: text/plain; charset=iso-8859-1   # &amp;lt;- often wrong, but a hint&lt;/span&gt;

&lt;span class="c"&gt;# chardetect (pip install chardet) is usually better for CJK&lt;/span&gt;
chardetect subtitle.srt
&lt;span class="c"&gt;# subtitle.srt: EUC-KR with confidence 0.99&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Heuristics that save time: Korean → try &lt;strong&gt;CP949/EUC-KR&lt;/strong&gt;; Japanese → &lt;strong&gt;Shift-JIS/CP932&lt;/strong&gt;; Simplified Chinese → &lt;strong&gt;GB18030&lt;/strong&gt; (it's a superset of GB2312, so it rarely hurts to use the wider one); Traditional Chinese → &lt;strong&gt;Big5&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Convert with &lt;code&gt;iconv&lt;/code&gt;
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Korean EUC-KR -&amp;gt; UTF-8&lt;/span&gt;
iconv &lt;span class="nt"&gt;-f&lt;/span&gt; EUC-KR &lt;span class="nt"&gt;-t&lt;/span&gt; UTF-8 &lt;span class="k"&gt;in&lt;/span&gt;.srt &lt;span class="nt"&gt;-o&lt;/span&gt; out.srt

&lt;span class="c"&gt;# Japanese Shift-JIS -&amp;gt; UTF-8&lt;/span&gt;
iconv &lt;span class="nt"&gt;-f&lt;/span&gt; SHIFT-JIS &lt;span class="nt"&gt;-t&lt;/span&gt; UTF-8 &lt;span class="k"&gt;in&lt;/span&gt;.srt &lt;span class="nt"&gt;-o&lt;/span&gt; out.srt

&lt;span class="c"&gt;# Simplified Chinese -&amp;gt; UTF-8 (use the superset)&lt;/span&gt;
iconv &lt;span class="nt"&gt;-f&lt;/span&gt; GB18030 &lt;span class="nt"&gt;-t&lt;/span&gt; UTF-8 &lt;span class="k"&gt;in&lt;/span&gt;.srt &lt;span class="nt"&gt;-o&lt;/span&gt; out.srt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;iconv&lt;/code&gt; errors out on a stray byte, add &lt;code&gt;//TRANSLIT&lt;/code&gt; or &lt;code&gt;-c&lt;/code&gt; to drop un-mappable characters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;iconv &lt;span class="nt"&gt;-f&lt;/span&gt; SHIFT-JIS &lt;span class="nt"&gt;-t&lt;/span&gt; UTF-8//TRANSLIT &lt;span class="k"&gt;in&lt;/span&gt;.srt &lt;span class="nt"&gt;-o&lt;/span&gt; out.srt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  3. Convert in Python (batch-friendly)
&lt;/h2&gt;

&lt;p&gt;Useful when you have a folder of files and don't fully trust a single guess:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;chardet&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;to_utf8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_bytes&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;guess&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chardet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;detect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;            &lt;span class="c1"&gt;# {'encoding': 'EUC-KR', 'confidence': 0.99}
&lt;/span&gt;    &lt;span class="n"&gt;enc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;guess&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;encoding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;enc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;replace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_suffix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.utf8.srt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;write_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;enc&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;guess&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;) -&amp;gt; utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;srt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;glob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*.srt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;to_utf8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;srt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two gotchas: &lt;code&gt;chardet&lt;/code&gt; can confidently mislabel short files, and it often reports &lt;code&gt;GB2312&lt;/code&gt; for files that contain characters only present in &lt;code&gt;GB18030&lt;/code&gt; — if you see missing glyphs, re-run forcing &lt;code&gt;gb18030&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Convert in your editor (no terminal)
&lt;/h2&gt;

&lt;p&gt;In &lt;strong&gt;VS Code&lt;/strong&gt;: open the file, click the encoding in the status bar (e.g. &lt;code&gt;UTF-8&lt;/code&gt;) → &lt;strong&gt;Reopen with Encoding&lt;/strong&gt; → pick the real one (e.g. &lt;code&gt;Korean (EUC-KR)&lt;/code&gt;). When it looks right, click the encoding again → &lt;strong&gt;Save with Encoding&lt;/strong&gt; → &lt;code&gt;UTF-8&lt;/code&gt;. Sublime Text and Notepad++ have the same "reopen/convert to UTF-8" flow.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Convert in the browser (no install)
&lt;/h2&gt;

&lt;p&gt;When I just want to drag-and-drop one file without touching a terminal, I use a free browser tool: &lt;strong&gt;&lt;a href="https://picute.net/en/tools/subtitle-encoding-converter" rel="noopener noreferrer"&gt;Picute's subtitle encoding converter&lt;/a&gt;&lt;/strong&gt;. It re-decodes EUC-KR, Shift-JIS, GB18030, Big5, and Windows-125x to UTF-8 entirely client-side — the file never leaves your machine — and previews the result so you can confirm the glyphs are right before downloading.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Full disclosure: I'm affiliated with Picute (it's an AI subtitle/caption tool). The converter above is free and needs no signup; I'm including it as one option next to the CLI/Python/editor methods, not as a replacement for them.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  A few rules that prevent mojibake next time
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Always save subtitles as UTF-8&lt;/strong&gt; (without BOM for &lt;code&gt;.srt&lt;/code&gt;/&lt;code&gt;.vtt&lt;/code&gt; — some players choke on the BOM bytes at the start of the first cue).&lt;/li&gt;
&lt;li&gt;If a player still shows boxes after converting, the issue may be a &lt;strong&gt;missing font&lt;/strong&gt; for that script, not the encoding — try a font that covers the glyphs.&lt;/li&gt;
&lt;li&gt;Keep the &lt;em&gt;original&lt;/em&gt; file until you've verified the converted one; a wrong source-encoding guess is silent and easy to miss on a quick scan.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's the whole toolkit. Pick whichever layer fits your workflow — they all do the same job: decode with the real encoding, write UTF-8.&lt;/p&gt;

</description>
      <category>tutorial</category>
      <category>python</category>
      <category>webdev</category>
      <category>beginners</category>
    </item>
  </channel>
</rss>
