Scraped 1200 products. 87 had prices like '$19.99extra'.
Got hired to scrape competitor pricing for an ecommerce client. Grabbed product names, prices, availability from 3 different sites. Ran the script overnight.
Woke up to a nice CSV with 1200 rows. Felt good.
Client downloads it. Five minutes later I get a message: "Your prices are broken."
Fun times.
Turns out HTML is a mess
Checked the CSV. Most prices looked fine: $19.99, $45.00, $129.95.
But scattered throughout were absolute gems:
$19.99extra$45.00–Price: $129.95-
$89(where are the decimals) -
FREE(not even a number)
The sites had wildly different HTML. One put "extra savings" text inside the same span as the price. Another stuck a dash after clearance prices for no reason. Third one prefixed everything with "Price:" like it wasn't obvious already.
My scraper just grabbed .innerText and called it a day. Zero cleaning.
Bad call.
Tried the obvious fix first
Figured I'd just strip the junk:
price = price.replace('extra', '').replace('Price:', '').replace('–', '').strip()
Worked for maybe half of them.
Then found prices like "$1,299.00" where the comma got weird. Also didn't handle "FREE" or the missing decimal problem. Spent 20 minutes trying different string replacements. Got nowhere.
What actually fixed it
Stopped trying to clean strings. Started extracting just the number part:
import re
def clean_price(raw_price):
if not raw_price:
return None
# Just grab the first number (kill commas first)
match = re.search(r'\d+\.?\d*', raw_price.replace(',', ''))
if not match:
return None
price = float(match.group())
# If no decimal, assume whole dollars
if '.' not in match.group():
price = float(f"{price}.00")
return round(price, 2)
Ran this on all 1200 rows:
-
$19.99extrabecame19.99 -
Price: $129.95became129.95 -
$89became89.00 -
FREEbecameNone(filtered those out later)
Ended up with 1143 valid prices. The other 57 were "FREE", "Call for price", or just missing.
Client was happy. Finally.
The part I actually should've done
Validation before sending anything:
# Look for suspicious outliers
prices = [p for p in prices if p is not None]
avg = sum(prices) / len(prices)
outliers = [p for p in prices if p > avg * 10 or p < 1]
if outliers:
print(f"Warning: {len(outliers)} weird prices")
print(outliers[:5])
Would've caught the one site that formatted "$1,299" as "$1.299" somehow. My regex pulled out 1.29. Nobody noticed until week 2.
I check outliers now I guess.
Top comments (0)