circobit

Posted on May 13

Why Your Exported Table Has Weird Characters (And How to Fix It)

#beginners #tutorial #webdev #programming

You export a table. You open the CSV. Instead of "José" you see "José". Instead of "€" you see "â‚¬". Instead of an em-dash you see "â€"".

Welcome to encoding hell.

This guide explains why it happens and how to fix it—whether you're exporting manually or building extraction tools.

What's Actually Happening

Character encoding is how computers represent text as bytes. Different encodings use different byte patterns for the same character.

The most common culprit: UTF-8 interpreted as Latin-1 (or vice versa).

UTF-8 uses multiple bytes for non-ASCII characters. When software reads those bytes as Latin-1 (single-byte encoding), each byte becomes a separate garbled character.

Character	UTF-8 bytes	Misread as Latin-1
é	C3 A9	Ã©
€	E2 82 AC	â‚¬
—	E2 80 94	â€"
ñ	C3 B1	Ã±

The pattern is consistent: multi-byte sequences become multiple wrong characters.

Where Encoding Breaks

1. The Source Page

The website might declare one encoding but serve another. Or declare nothing, leaving browsers to guess.

<!-- Declares UTF-8 -->
<meta charset="UTF-8">

<!-- But the server sends -->
Content-Type: text/html; charset=ISO-8859-1

When declarations conflict, browsers usually figure it out. Export tools might not.

2. The Export Process

If your extraction tool doesn't preserve encoding, bytes get reinterpreted.

JavaScript's textContent returns Unicode strings, which is safe. But converting to a file requires choosing an encoding. If that choice is wrong, corruption happens.

3. The Application Opening the File

You export a valid UTF-8 CSV. Excel opens it as Latin-1 because it guesses wrong. Same corruption, different cause.

This is why the same file looks fine in one application and broken in another.

How to Diagnose the Problem

Pattern Recognition

Certain corruptions are diagnostic:

You see	Original was
Ã¡, Ã©, Ã, Ã³, Ãº	á, é, í, ó, ú (Spanish accents)
Ã¤, Ã¶, Ã¼	ä, ö, ü (German umlauts)
Ã§	ç (cedilla)
â‚¬	€ (Euro sign)
â€"	— (em dash)
â€™	' (curly apostrophe)

If you see Ã followed by another character, UTF-8 was read as Latin-1.

Check the Raw Bytes

In Python:

with open('file.csv', 'rb') as f:
    print(f.read(100))

If you see \xc3\xa9 where you expect "é", the file is UTF-8. If you see \xe9, it's Latin-1.

Check What the Page Declares

In browser dev tools:

document.characterSet  // Returns actual encoding used

Fixing Corrupted Files

The Re-encode Trick

If UTF-8 was misread as Latin-1, you can reverse it:

# Read as Latin-1 (how it was wrongly interpreted)
with open('corrupted.csv', 'r', encoding='latin-1') as f:
    text = f.read()

# Encode back to Latin-1 bytes, then decode as UTF-8
fixed = text.encode('latin-1').decode('utf-8')

with open('fixed.csv', 'w', encoding='utf-8') as f:
    f.write(fixed)

This only works if the corruption was a simple misinterpretation. Double-corruption or mixed encodings are harder.

Force Excel to Read UTF-8

Excel often ignores UTF-8. Two workarounds:

Option 1: Add BOM

A Byte Order Mark at the file start signals UTF-8:

with open('file.csv', 'w', encoding='utf-8-sig') as f:
    f.write(data)

Option 2: Import Wizard

Instead of double-clicking the CSV, use Data → From Text/CSV and specify UTF-8 encoding manually.

Google Sheets as Intermediary

Google Sheets handles encoding better than Excel. Import the CSV there, then export to XLSX. The Excel file will preserve characters correctly.

Preventing Encoding Issues

If You're Exporting Manually

Use tools that handle encoding correctly. HTML Table Exporter preserves Unicode throughout the export process, avoiding the conversion errors that cause mojibake.

For CSV exports, if your tool offers encoding options, choose UTF-8 with BOM for Excel compatibility.

If You're Building Extraction Tools

Always be explicit about encoding:

# Writing CSV
with open('output.csv', 'w', encoding='utf-8', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(data)

# Reading HTML
response = requests.get(url)
response.encoding = response.apparent_encoding  # Auto-detect
html = response.text

For JavaScript/browser extensions:

// Blob with explicit encoding
const blob = new Blob(['\ufeff' + csvContent], { 
  type: 'text/csv;charset=utf-8;' 
});

The \ufeff is the BOM that helps Excel.

If You're Using Pandas

# Reading
df = pd.read_csv('file.csv', encoding='utf-8')

# Writing with BOM for Excel
df.to_csv('output.csv', encoding='utf-8-sig', index=False)

Special Characters That Cause Problems

Smart Quotes and Apostrophes

Word processors convert straight quotes to curly ones:

' → ' (U+2019)
" → " " (U+201C, U+201D)

These are multi-byte in UTF-8 and break in Latin-1.

Em and En Dashes

— (em dash, U+2014)
– (en dash, U+2013)

Common in financial data and publishing.

Non-Breaking Spaces

Regular space is 0x20. Non-breaking space is 0xA0 (Latin-1) or 0xC2 0xA0 (UTF-8).

These are invisible but break string comparisons and parsing.

# Normalize spaces
text = text.replace('\xa0', ' ')

Currency Symbols

€, £, ¥ are all outside ASCII and encode differently in UTF-8 vs Latin-1.

Quick Reference

Problem	Likely Cause	Fix
Ã followed by character	UTF-8 read as Latin-1	Re-encode or specify UTF-8 on import
? or □ replacing characters	Encoding doesn't support character	Use UTF-8 throughout
Excel shows garbage	Excel guessed wrong encoding	Use UTF-8 BOM or import wizard
Invisible characters breaking data	Non-breaking spaces or zero-width chars	Normalize whitespace

The Fundamental Rule

Match encoding at every step. Export as UTF-8. Open as UTF-8. Save as UTF-8.

The moment one step uses a different encoding, characters corrupt.

When in doubt, UTF-8 with BOM is the safest choice for files that will touch Excel.

Want exports that handle encoding correctly? Learn more at gauchogrid.com/html-table-exporter or try HTML Table Exporter free on the Chrome Web Store.

DEV Community