Cleaned 10k customer records. One emoji crashed my entire pipeline.

#python #data #debugging #programming

Cleaned 10k customer records. One emoji crashed my entire pipeline.

Was scraping ecommerce product reviews last month. Got 10k records, ran a cleaning script to normalize text before feeding it to a sentiment analysis tool. Script ran fine on test data (500 rows). Pushed it to production.

48 minutes in, the whole thing just stops. No error message. Just frozen.

Thought it was memory. 10k rows shouldn't be a problem, but maybe something leaked. Restarted the process, added memory tracking. Same thing. Froze at exactly the same spot (row 6,842).

Checked the CSV manually. Row 6,842 looked fine. Customer name, review text, rating. Nothing weird.

Then I noticed it.

The review had a 💩 emoji in it. Specifically: "This product is 💩 don't buy it"

Encoding hell

My script was using basic text encoding. UTF8, right? Wrong. I was reading the CSV with encoding='latin-1' because an earlier version of the data had some Spanish characters that broke with utf8.

Emojis are multibyte UTF8 characters. Latin1 can't handle them. Python's csv reader just... stopped. No exception, no warning. Just hung there trying to decode something it couldn't.

Ended up doing this:

import pandas as pd

# Read with errors='replace' to handle encoding issues
df = pd.read_csv(
    'reviews.csv',
    encoding='utf-8',
    encoding_errors='replace'  # Replace bad chars with �
)

# Clean out replacement chars
df['review_text'] = df['review_text'].str.replace('�', '', regex=False)

# Remove emojis if you don't need them
import re
df['review_text'] = df['review_text'].apply(
    lambda x: re.sub(r'[^\x00-\x7F]+', '', str(x))
)

df.to_csv('cleaned_reviews.csv', index=False, encoding='utf-8')

That regex strips anything outside basic ASCII range. Emojis, accents, special characters gone.

If you need to keep emojis (some sentiment analysis tools actually use them), just stick with utf8 and don't strip them:

df = pd.read_csv('reviews.csv', encoding='utf-8')
# That's it. Just use utf-8 consistently.

What would've saved me time

My 500 row test set had zero emojis. Production data had 147 emojis across 10k rows. Testing with real data would've caught this immediately.

Also added logging after this mess:

for idx, row in df.iterrows():
    if idx % 1000 == 0:
        print(f"Processing row {idx}...")
    # process row

Now if it breaks, I know exactly where.

Didn't know the encoding_errors parameter existed. Would've caught the issue immediately instead of silent failure.

What I ended up doing

Kept emojis in the final dataset. The sentiment tool I was using (TextBlob) actually interprets 💩 correctly as negative sentiment. Stripping them would've lost signal.

Just had to commit to utf8 everywhere. CSV export, database inserts, API responses all utf8. No more mixing encodings.

Still annoyed it took 48 minutes to find a single emoji tho.