Spent 2 hours debugging CSV import. Turns out Excel added invisible characters.

#python #csv #debugging #data

Spent 2 hours debugging CSV import. Turns out Excel added invisible characters.

Client sent me a CSV file yesterday. Simple product list with prices. Import it into my Python script and half the rows fail validation.

Opened the CSV in a text editor. Everything looks fine. Columns are there. Data looks clean. Run the script again. Same errors.

What broke

My script was parsing product prices like this:

import csv

with open('products.csv', 'r') as f:
    reader = csv.DictReader(f)
    for row in reader:
        try:
            price = float(row['Price'])
            print(f"{row['Product']}: ${price}")
        except ValueError:
            print(f"Failed: {row['Product']} - Bad price: '{row['Price']}'")

Output showed weird stuff:

Product A: $19.99
Failed: Product B - Bad price: '29.99'
Product C: $39.99
Failed: Product D - Bad price: '49.99'

The prices that failed looked identical to the ones that worked. Copy pasted them into the Python REPL. Worked fine there.

Fun.

Figured it out eventually

After way too long staring at this, I hexdumped the CSV file:

hexdump -C products.csv | head -20

Found this:

00000030  50 72 6f 64 75 63 74 20  42 2c ef bb bf 32 39 2e  |Product B,...29.|
00000040  39 39 0d 0a 50 72 6f 64  75 63 74 20 43 2c 33 39  |99..Product C,39|

See those ef bb bf bytes? UTF8 BOM (Byte Order Mark). Excel loves adding these to the start of fields when you export to CSV. Invisible in text editors. Breaks everything.

The failing prices had BOM characters prepended. So float() choked on them because technically the string was \ufeff29.99 not 29.99.

Fix

Strip the BOM when reading:

import csv

with open('products.csv', 'r', encoding='utf-8-sig') as f:
    reader = csv.DictReader(f)
    for row in reader:
        price = float(row['Price'])
        print(f"{row['Product']}: ${price}")