Scraped 1200 products. 87 had prices like '$19.99extra'.

#python #tutorial #webscraping #data

Scraped 1200 products. 87 had prices like '$19.99extra'.

Got hired to scrape competitor pricing for an ecommerce client. Grabbed product names, prices, availability from 3 different sites. Ran the script overnight.

Woke up to a nice CSV with 1200 rows. Felt good.

Client downloads it. Five minutes later I get a message: "Your prices are broken."

Fun times.

Turns out HTML is a mess

Checked the CSV. Most prices looked fine: $19.99, $45.00, $129.95.

But scattered throughout were absolute gems:

$19.99extra
$45.00–
Price: $129.95
$89 (where are the decimals)
FREE (not even a number)

The sites had wildly different HTML. One put "extra savings" text inside the same span as the price. Another stuck a dash after clearance prices for no reason. Third one prefixed everything with "Price:" like it wasn't obvious already.

My scraper just grabbed .innerText and called it a day. Zero cleaning.

Bad call.

Tried the obvious fix first

Figured I'd just strip the junk:

price = price.replace('extra', '').replace('Price:', '').replace('–', '').strip()

Worked for maybe half of them.

Then found prices like "$1,299.00" where the comma got weird. Also didn't handle "FREE" or the missing decimal problem. Spent 20 minutes trying different string replacements. Got nowhere.

What actually fixed it

Stopped trying to clean strings. Started extracting just the number part:

import re

def clean_price(raw_price):
    if not raw_price:
        return None

    # Just grab the first number (kill commas first)
    match = re.search(r'\d+\.?\d*', raw_price.replace(',', ''))
    if not match:
        return None

    price = float(match.group())

    # If no decimal, assume whole dollars
    if '.' not in match.group():
        price = float(f"{price}.00")

    return round(price, 2)

Ran this on all 1200 rows:

$19.99extra became 19.99
Price: $129.95 became 129.95
$89 became 89.00
FREE became None (filtered those out later)

Ended up with 1143 valid prices. The other 57 were "FREE", "Call for price", or just missing.

Client was happy. Finally.

The part I actually should've done

Validation before sending anything:

# Look for suspicious outliers
prices = [p for p in prices if p is not None]
avg = sum(prices) / len(prices)

outliers = [p for p in prices if p > avg * 10 or p < 1]
if outliers:
    print(f"Warning: {len(outliers)} weird prices")
    print(outliers[:5])