You finally got your scraper working. Data is flowing in. But when you open the output file, it's a mess — missing values, duplicate rows, inconsistent formats, and prices that say "$12.99" in one row and "12,99 EUR" in the next.
Welcome to the real work of web scraping: cleaning the data.
In this guide, I'll walk through practical techniques for turning raw scraped data into something you can actually use — with real Python code examples you can adapt to your own projects.
Why Scraped Data Is Always Messy
Unlike API responses with consistent schemas, scraped data inherits all the inconsistencies of the source websites:
- Missing fields — some product listings have reviews, others don't
- Inconsistent formats — dates as "March 5, 2026" vs "2026-03-05" vs "05/03/26"
-
Encoding issues — UTF-8 vs Latin-1, HTML entities like
& - Duplicates — pagination overlaps, retry artifacts
- Type mismatches — prices as strings with currency symbols, quantities as text
Let's tackle each of these systematically.
Setting Up Your Environment
import pandas as pd
from pydantic import BaseModel, field_validator, ValidationError
from typing import Optional
from datetime import datetime
import re
import html
Step 1: Load and Inspect Raw Data
First, load your scraped data and understand what you're dealing with:
# Load raw scraped data
df = pd.read_json("scraped_products.json")
# Quick inspection
print(f"Rows: {len(df)}")
print(f"Columns: {list(df.columns)}")
print(f"\nNull counts:\n{df.isnull().sum()}")
print(f"\nDuplicates: {df.duplicated().sum()}")
print(f"\nSample data types:\n{df.dtypes}")
Output might look like:
Rows: 1547
Columns: ['name', 'price', 'rating', 'reviews', 'url', 'scraped_at']
Null counts:
name 3
price 47
rating 203
reviews 318
url 0
scraped_at 0
Duplicates: 89
Already you can see: 47 missing prices, 203 missing ratings, and 89 duplicates. Let's fix all of it.
Step 2: Remove Duplicates
Duplicates in scraped data usually come from pagination overlaps or retry logic. Handle them carefully:
# Simple dedup on URL (most reliable unique identifier)
df_deduped = df.drop_duplicates(subset=["url"], keep="last")
print(f"Removed {len(df) - len(df_deduped)} duplicates")
# For more nuanced dedup (e.g., same product, different scraped_at)
def smart_dedup(df, key_cols, sort_col="scraped_at"):
"""Keep the most recently scraped version of each item."""
df_sorted = df.sort_values(sort_col, ascending=False)
return df_sorted.drop_duplicates(subset=key_cols, keep="first")
df_clean = smart_dedup(df, key_cols=["name", "url"])
Step 3: Clean Price Data
Price fields are the #1 headache in scraped data. Here's a robust cleaner:
def clean_price(raw_price: str) -> Optional[float]:
"""Convert messy price strings to float values."""
if pd.isna(raw_price) or raw_price in ("", "N/A", "null"):
return None
price_str = str(raw_price)
# Remove currency symbols and whitespace
price_str = re.sub(r'[^\d.,]', '', price_str)
if not price_str:
return None
# Handle European format: 1.234,56 -> 1234.56
if ',' in price_str and '.' in price_str:
if price_str.rindex(',') > price_str.rindex('.'):
# European: 1.234,56
price_str = price_str.replace('.', '').replace(',', '.')
else:
# US: 1,234.56
price_str = price_str.replace(',', '')
elif ',' in price_str:
# Could be European decimal or US thousands
parts = price_str.split(',')
if len(parts[-1]) == 2:
price_str = price_str.replace(',', '.') # European decimal
else:
price_str = price_str.replace(',', '') # US thousands
try:
return round(float(price_str), 2)
except ValueError:
return None
# Apply to dataframe
df_clean["price"] = df_clean["price"].apply(clean_price)
# Test it
test_prices = ["$12.99", "12,99 EUR", "£1,234.56", "1.234,56€", "N/A", "", None]
for p in test_prices:
print(f"{str(p):>15} -> {clean_price(p)}")
Output:
$12.99 -> 12.99
12,99 EUR -> 12.99
£1,234.56 -> 1234.56
1.234,56€ -> 1234.56
N/A -> None
-> None
None -> None
Step 4: Handle Missing Data
Not all missing data should be handled the same way:
# Strategy 1: Drop rows missing critical fields
df_clean = df_clean.dropna(subset=["name", "url"]) # These MUST exist
# Strategy 2: Fill with defaults for non-critical fields
df_clean["rating"] = pd.to_numeric(df_clean["rating"], errors="coerce")
df_clean["reviews"] = pd.to_numeric(df_clean["reviews"], errors="coerce").fillna(0).astype(int)
# Strategy 3: Flag missing data instead of guessing
df_clean["has_price"] = df_clean["price"].notna()
df_clean["has_rating"] = df_clean["rating"].notna()
print(f"Products with price: {df_clean['has_price'].sum()}/{len(df_clean)}")
print(f"Products with rating: {df_clean['has_rating'].sum()}/{len(df_clean)}")
Rule of thumb: Drop rows only when a critical field is missing. For optional fields, flag them or fill with sensible defaults. Never silently fill prices or ratings with averages — that creates misleading data.
Step 5: Validate with Pydantic
For production pipelines, use Pydantic to enforce your data schema. This catches issues early and gives you clear error messages:
class Product(BaseModel):
name: str
price: Optional[float] = None
rating: Optional[float] = None
reviews: int = 0
url: str
scraped_at: datetime
@field_validator("name")
@classmethod
def clean_name(cls, v):
# Decode HTML entities and normalize whitespace
v = html.unescape(v)
v = re.sub(r'\s+', ' ', v).strip()
if len(v) < 2:
raise ValueError("Product name too short")
return v
@field_validator("price")
@classmethod
def validate_price(cls, v):
if v is not None and (v < 0 or v > 1_000_000):
raise ValueError(f"Price {v} outside reasonable range")
return v
@field_validator("rating")
@classmethod
def validate_rating(cls, v):
if v is not None and (v < 0 or v > 5):
raise ValueError(f"Rating {v} outside 0-5 range")
return v
@field_validator("url")
@classmethod
def validate_url(cls, v):
if not v.startswith(("http://", "https://")):
raise ValueError(f"Invalid URL: {v}")
return v
def validate_products(raw_data: list[dict]) -> tuple[list[Product], list[dict]]:
"""Validate a list of raw product dicts. Returns (valid, errors)."""
valid = []
errors = []
for i, item in enumerate(raw_data):
try:
product = Product(**item)
valid.append(product)
except ValidationError as e:
errors.append({"row": i, "data": item, "errors": str(e)})
print(f"Valid: {len(valid)}, Errors: {len(errors)}")
return valid, errors
Step 6: Normalize Text Fields
Scraped text is full of inconsistencies:
def normalize_text(text: str) -> str:
"""Clean up common scraped text issues."""
if not isinstance(text, str):
return str(text)
# Decode HTML entities
text = html.unescape(text)
# Remove zero-width characters
text = re.sub(r'[\u200b\u200c\u200d\ufeff]', '', text)
# Normalize whitespace (but preserve single newlines)
text = re.sub(r'[^\S\n]+', ' ', text)
text = re.sub(r'\n{3,}', '\n\n', text)
return text.strip()
def normalize_date(date_str: str) -> Optional[datetime]:
"""Parse dates from various formats into datetime objects."""
formats = [
"%Y-%m-%d",
"%Y-%m-%dT%H:%M:%S",
"%B %d, %Y",
"%b %d, %Y",
"%d/%m/%Y",
"%m/%d/%Y",
]
for fmt in formats:
try:
return datetime.strptime(date_str.strip(), fmt)
except ValueError:
continue
return None
Step 7: Build a Complete Cleaning Pipeline
Put it all together into a reusable pipeline:
def clean_scraped_products(raw_file: str, output_file: str) -> dict:
"""Full cleaning pipeline for scraped product data."""
# Load
df = pd.read_json(raw_file)
stats = {"raw_count": len(df)}
# Deduplicate
df = df.drop_duplicates(subset=["url"], keep="last")
stats["after_dedup"] = len(df)
# Clean fields
df["name"] = df["name"].apply(normalize_text)
df["price"] = df["price"].apply(clean_price)
df["rating"] = pd.to_numeric(df["rating"], errors="coerce")
df["reviews"] = pd.to_numeric(df["reviews"], errors="coerce").fillna(0).astype(int)
# Drop rows missing critical fields
df = df.dropna(subset=["name", "url"])
stats["after_cleaning"] = len(df)
# Validate with Pydantic
records = df.to_dict(orient="records")
valid, errors = validate_products(records)
stats["valid"] = len(valid)
stats["validation_errors"] = len(errors)
# Save clean data
clean_df = pd.DataFrame([p.model_dump() for p in valid])
clean_df.to_json(output_file, orient="records", indent=2)
print(f"\nCleaning Summary:")
for k, v in stats.items():
print(f" {k}: {v}")
return stats
# Run it
stats = clean_scraped_products("scraped_products.json", "clean_products.json")
Getting Clean Data from the Start
The best cleaning pipeline is one you barely need. If you control the scraping step, you can reduce downstream cleaning work:
- Use ScraperAPI for rendered JavaScript content — you get cleaner HTML to parse
- ScrapeOps provides monitoring and proxy management to reduce failed requests (which cause missing data)
- Structure your scraper output as typed objects from the start rather than raw strings
Common Pitfalls
1. Cleaning too aggressively
Don't delete data just because it looks weird. A product named "USB-C Cable (3-Pack) — 6ft" with special characters is valid. Clean encoding issues, don't strip legitimate content.
2. Silent type coercion
# Bad: silently converts "N/A" to NaN
df["price"] = pd.to_numeric(df["price"], errors="coerce")
# Better: log what you're dropping
bad_prices = df[pd.to_numeric(df["price"], errors="coerce").isna() & df["price"].notna()]
if len(bad_prices) > 0:
print(f"Warning: {len(bad_prices)} unparseable prices dropped")
print(bad_prices["price"].value_counts().head())
3. Not handling encoding
# Always specify encoding when reading files
df = pd.read_csv("data.csv", encoding="utf-8-sig") # Handles BOM
4. Deduplicating on the wrong key
URLs are usually the best dedup key for product data. Names can have slight variations ("iPhone 16" vs "iPhone 16 ") that make them unreliable.
Key Takeaways
- Inspect before cleaning — understand what's broken before you fix it
- Clean prices carefully — international formats trip up naive parsers
- Use Pydantic for validation — catch bad data before it enters your pipeline
- Flag, don't fabricate — mark missing data instead of filling with guesses
- Build reusable pipelines — you'll clean data from every new source
- Log everything — when a row gets dropped, know why
The code examples in this article are production-tested patterns. Copy them, adapt them, and build your own cleaning toolkit. The investment pays off every time you start a new scraping project.
What's the messiest scraped data you've ever had to clean? Share your war stories in the comments!
Top comments (0)