You export a table. You open the CSV. Instead of "José" you see "José". Instead of "€" you see "€". Instead of an em-dash you see "â€"".
Welcome to encoding hell.
This guide explains why it happens and how to fix it—whether you're exporting manually or building extraction tools.
What's Actually Happening
Character encoding is how computers represent text as bytes. Different encodings use different byte patterns for the same character.
The most common culprit: UTF-8 interpreted as Latin-1 (or vice versa).
UTF-8 uses multiple bytes for non-ASCII characters. When software reads those bytes as Latin-1 (single-byte encoding), each byte becomes a separate garbled character.
| Character | UTF-8 bytes | Misread as Latin-1 |
|---|---|---|
| é | C3 A9 | é |
| € | E2 82 AC | € |
| — | E2 80 94 | â€" |
| ñ | C3 B1 | ñ |
The pattern is consistent: multi-byte sequences become multiple wrong characters.
Where Encoding Breaks
1. The Source Page
The website might declare one encoding but serve another. Or declare nothing, leaving browsers to guess.
<!-- Declares UTF-8 -->
<meta charset="UTF-8">
<!-- But the server sends -->
Content-Type: text/html; charset=ISO-8859-1
When declarations conflict, browsers usually figure it out. Export tools might not.
2. The Export Process
If your extraction tool doesn't preserve encoding, bytes get reinterpreted.
JavaScript's textContent returns Unicode strings, which is safe. But converting to a file requires choosing an encoding. If that choice is wrong, corruption happens.
3. The Application Opening the File
You export a valid UTF-8 CSV. Excel opens it as Latin-1 because it guesses wrong. Same corruption, different cause.
This is why the same file looks fine in one application and broken in another.
How to Diagnose the Problem
Pattern Recognition
Certain corruptions are diagnostic:
| You see | Original was |
|---|---|
| á, é, Ã, ó, ú | á, é, í, ó, ú (Spanish accents) |
| ä, ö, ü | ä, ö, ü (German umlauts) |
| ç | ç (cedilla) |
| € | € (Euro sign) |
| â€" | — (em dash) |
| ’ | ' (curly apostrophe) |
If you see à followed by another character, UTF-8 was read as Latin-1.
Check the Raw Bytes
In Python:
with open('file.csv', 'rb') as f:
print(f.read(100))
If you see \xc3\xa9 where you expect "é", the file is UTF-8. If you see \xe9, it's Latin-1.
Check What the Page Declares
In browser dev tools:
document.characterSet // Returns actual encoding used
Fixing Corrupted Files
The Re-encode Trick
If UTF-8 was misread as Latin-1, you can reverse it:
# Read as Latin-1 (how it was wrongly interpreted)
with open('corrupted.csv', 'r', encoding='latin-1') as f:
text = f.read()
# Encode back to Latin-1 bytes, then decode as UTF-8
fixed = text.encode('latin-1').decode('utf-8')
with open('fixed.csv', 'w', encoding='utf-8') as f:
f.write(fixed)
This only works if the corruption was a simple misinterpretation. Double-corruption or mixed encodings are harder.
Force Excel to Read UTF-8
Excel often ignores UTF-8. Two workarounds:
Option 1: Add BOM
A Byte Order Mark at the file start signals UTF-8:
with open('file.csv', 'w', encoding='utf-8-sig') as f:
f.write(data)
Option 2: Import Wizard
Instead of double-clicking the CSV, use Data → From Text/CSV and specify UTF-8 encoding manually.
Google Sheets as Intermediary
Google Sheets handles encoding better than Excel. Import the CSV there, then export to XLSX. The Excel file will preserve characters correctly.
Preventing Encoding Issues
If You're Exporting Manually
Use tools that handle encoding correctly. HTML Table Exporter preserves Unicode throughout the export process, avoiding the conversion errors that cause mojibake.
For CSV exports, if your tool offers encoding options, choose UTF-8 with BOM for Excel compatibility.
If You're Building Extraction Tools
Always be explicit about encoding:
# Writing CSV
with open('output.csv', 'w', encoding='utf-8', newline='') as f:
writer = csv.writer(f)
writer.writerows(data)
# Reading HTML
response = requests.get(url)
response.encoding = response.apparent_encoding # Auto-detect
html = response.text
For JavaScript/browser extensions:
// Blob with explicit encoding
const blob = new Blob(['\ufeff' + csvContent], {
type: 'text/csv;charset=utf-8;'
});
The \ufeff is the BOM that helps Excel.
If You're Using Pandas
# Reading
df = pd.read_csv('file.csv', encoding='utf-8')
# Writing with BOM for Excel
df.to_csv('output.csv', encoding='utf-8-sig', index=False)
Special Characters That Cause Problems
Smart Quotes and Apostrophes
Word processors convert straight quotes to curly ones:
- ' → ' (U+2019)
- " → " " (U+201C, U+201D)
These are multi-byte in UTF-8 and break in Latin-1.
Em and En Dashes
- — (em dash, U+2014)
- – (en dash, U+2013)
Common in financial data and publishing.
Non-Breaking Spaces
Regular space is 0x20. Non-breaking space is 0xA0 (Latin-1) or 0xC2 0xA0 (UTF-8).
These are invisible but break string comparisons and parsing.
# Normalize spaces
text = text.replace('\xa0', ' ')
Currency Symbols
€, £, ¥ are all outside ASCII and encode differently in UTF-8 vs Latin-1.
Quick Reference
| Problem | Likely Cause | Fix |
|---|---|---|
| Ã followed by character | UTF-8 read as Latin-1 | Re-encode or specify UTF-8 on import |
| ? or □ replacing characters | Encoding doesn't support character | Use UTF-8 throughout |
| Excel shows garbage | Excel guessed wrong encoding | Use UTF-8 BOM or import wizard |
| Invisible characters breaking data | Non-breaking spaces or zero-width chars | Normalize whitespace |
The Fundamental Rule
Match encoding at every step. Export as UTF-8. Open as UTF-8. Save as UTF-8.
The moment one step uses a different encoding, characters corrupt.
When in doubt, UTF-8 with BOM is the safest choice for files that will touch Excel.
Want exports that handle encoding correctly? Learn more at gauchogrid.com/html-table-exporter or try HTML Table Exporter free on the Chrome Web Store.
Top comments (0)