DEV Community

agenthustler
agenthustler

Posted on

How to Clean and Parse Web Scraped Data with Python in 2026

You finally got your scraper working. Data is flowing in. But when you open the output file, it's a mess — missing values, duplicate rows, inconsistent formats, and prices that say "$12.99" in one row and "12,99 EUR" in the next.

Welcome to the real work of web scraping: cleaning the data.

In this guide, I'll walk through practical techniques for turning raw scraped data into something you can actually use — with real Python code examples you can adapt to your own projects.

Why Scraped Data Is Always Messy

Unlike API responses with consistent schemas, scraped data inherits all the inconsistencies of the source websites:

  • Missing fields — some product listings have reviews, others don't
  • Inconsistent formats — dates as "March 5, 2026" vs "2026-03-05" vs "05/03/26"
  • Encoding issues — UTF-8 vs Latin-1, HTML entities like &
  • Duplicates — pagination overlaps, retry artifacts
  • Type mismatches — prices as strings with currency symbols, quantities as text

Let's tackle each of these systematically.

Setting Up Your Environment

import pandas as pd
from pydantic import BaseModel, field_validator, ValidationError
from typing import Optional
from datetime import datetime
import re
import html
Enter fullscreen mode Exit fullscreen mode

Step 1: Load and Inspect Raw Data

First, load your scraped data and understand what you're dealing with:

# Load raw scraped data
df = pd.read_json("scraped_products.json")

# Quick inspection
print(f"Rows: {len(df)}")
print(f"Columns: {list(df.columns)}")
print(f"\nNull counts:\n{df.isnull().sum()}")
print(f"\nDuplicates: {df.duplicated().sum()}")
print(f"\nSample data types:\n{df.dtypes}")
Enter fullscreen mode Exit fullscreen mode

Output might look like:

Rows: 1547
Columns: ['name', 'price', 'rating', 'reviews', 'url', 'scraped_at']

Null counts:
name          3
price        47
rating      203
reviews     318
url           0
scraped_at    0

Duplicates: 89
Enter fullscreen mode Exit fullscreen mode

Already you can see: 47 missing prices, 203 missing ratings, and 89 duplicates. Let's fix all of it.

Step 2: Remove Duplicates

Duplicates in scraped data usually come from pagination overlaps or retry logic. Handle them carefully:

# Simple dedup on URL (most reliable unique identifier)
df_deduped = df.drop_duplicates(subset=["url"], keep="last")
print(f"Removed {len(df) - len(df_deduped)} duplicates")

# For more nuanced dedup (e.g., same product, different scraped_at)
def smart_dedup(df, key_cols, sort_col="scraped_at"):
    """Keep the most recently scraped version of each item."""
    df_sorted = df.sort_values(sort_col, ascending=False)
    return df_sorted.drop_duplicates(subset=key_cols, keep="first")

df_clean = smart_dedup(df, key_cols=["name", "url"])
Enter fullscreen mode Exit fullscreen mode

Step 3: Clean Price Data

Price fields are the #1 headache in scraped data. Here's a robust cleaner:

def clean_price(raw_price: str) -> Optional[float]:
    """Convert messy price strings to float values."""
    if pd.isna(raw_price) or raw_price in ("", "N/A", "null"):
        return None

    price_str = str(raw_price)

    # Remove currency symbols and whitespace
    price_str = re.sub(r'[^\d.,]', '', price_str)

    if not price_str:
        return None

    # Handle European format: 1.234,56 -> 1234.56
    if ',' in price_str and '.' in price_str:
        if price_str.rindex(',') > price_str.rindex('.'):
            # European: 1.234,56
            price_str = price_str.replace('.', '').replace(',', '.')
        else:
            # US: 1,234.56
            price_str = price_str.replace(',', '')
    elif ',' in price_str:
        # Could be European decimal or US thousands
        parts = price_str.split(',')
        if len(parts[-1]) == 2:
            price_str = price_str.replace(',', '.')  # European decimal
        else:
            price_str = price_str.replace(',', '')   # US thousands

    try:
        return round(float(price_str), 2)
    except ValueError:
        return None

# Apply to dataframe
df_clean["price"] = df_clean["price"].apply(clean_price)

# Test it
test_prices = ["$12.99", "12,99 EUR", "£1,234.56", "1.234,56€", "N/A", "", None]
for p in test_prices:
    print(f"{str(p):>15} -> {clean_price(p)}")
Enter fullscreen mode Exit fullscreen mode

Output:

         $12.99 -> 12.99
      12,99 EUR -> 12.99
      £1,234.56 -> 1234.56
       1.234,56€ -> 1234.56
            N/A -> None
                -> None
           None -> None
Enter fullscreen mode Exit fullscreen mode

Step 4: Handle Missing Data

Not all missing data should be handled the same way:

# Strategy 1: Drop rows missing critical fields
df_clean = df_clean.dropna(subset=["name", "url"])  # These MUST exist

# Strategy 2: Fill with defaults for non-critical fields
df_clean["rating"] = pd.to_numeric(df_clean["rating"], errors="coerce")
df_clean["reviews"] = pd.to_numeric(df_clean["reviews"], errors="coerce").fillna(0).astype(int)

# Strategy 3: Flag missing data instead of guessing
df_clean["has_price"] = df_clean["price"].notna()
df_clean["has_rating"] = df_clean["rating"].notna()

print(f"Products with price: {df_clean['has_price'].sum()}/{len(df_clean)}")
print(f"Products with rating: {df_clean['has_rating'].sum()}/{len(df_clean)}")
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: Drop rows only when a critical field is missing. For optional fields, flag them or fill with sensible defaults. Never silently fill prices or ratings with averages — that creates misleading data.

Step 5: Validate with Pydantic

For production pipelines, use Pydantic to enforce your data schema. This catches issues early and gives you clear error messages:

class Product(BaseModel):
    name: str
    price: Optional[float] = None
    rating: Optional[float] = None
    reviews: int = 0
    url: str
    scraped_at: datetime

    @field_validator("name")
    @classmethod
    def clean_name(cls, v):
        # Decode HTML entities and normalize whitespace
        v = html.unescape(v)
        v = re.sub(r'\s+', ' ', v).strip()
        if len(v) < 2:
            raise ValueError("Product name too short")
        return v

    @field_validator("price")
    @classmethod
    def validate_price(cls, v):
        if v is not None and (v < 0 or v > 1_000_000):
            raise ValueError(f"Price {v} outside reasonable range")
        return v

    @field_validator("rating")
    @classmethod
    def validate_rating(cls, v):
        if v is not None and (v < 0 or v > 5):
            raise ValueError(f"Rating {v} outside 0-5 range")
        return v

    @field_validator("url")
    @classmethod
    def validate_url(cls, v):
        if not v.startswith(("http://", "https://")):
            raise ValueError(f"Invalid URL: {v}")
        return v


def validate_products(raw_data: list[dict]) -> tuple[list[Product], list[dict]]:
    """Validate a list of raw product dicts. Returns (valid, errors)."""
    valid = []
    errors = []
    for i, item in enumerate(raw_data):
        try:
            product = Product(**item)
            valid.append(product)
        except ValidationError as e:
            errors.append({"row": i, "data": item, "errors": str(e)})

    print(f"Valid: {len(valid)}, Errors: {len(errors)}")
    return valid, errors
Enter fullscreen mode Exit fullscreen mode

Step 6: Normalize Text Fields

Scraped text is full of inconsistencies:

def normalize_text(text: str) -> str:
    """Clean up common scraped text issues."""
    if not isinstance(text, str):
        return str(text)

    # Decode HTML entities
    text = html.unescape(text)

    # Remove zero-width characters
    text = re.sub(r'[\u200b\u200c\u200d\ufeff]', '', text)

    # Normalize whitespace (but preserve single newlines)
    text = re.sub(r'[^\S\n]+', ' ', text)
    text = re.sub(r'\n{3,}', '\n\n', text)

    return text.strip()


def normalize_date(date_str: str) -> Optional[datetime]:
    """Parse dates from various formats into datetime objects."""
    formats = [
        "%Y-%m-%d",
        "%Y-%m-%dT%H:%M:%S",
        "%B %d, %Y",
        "%b %d, %Y",
        "%d/%m/%Y",
        "%m/%d/%Y",
    ]
    for fmt in formats:
        try:
            return datetime.strptime(date_str.strip(), fmt)
        except ValueError:
            continue
    return None
Enter fullscreen mode Exit fullscreen mode

Step 7: Build a Complete Cleaning Pipeline

Put it all together into a reusable pipeline:

def clean_scraped_products(raw_file: str, output_file: str) -> dict:
    """Full cleaning pipeline for scraped product data."""
    # Load
    df = pd.read_json(raw_file)
    stats = {"raw_count": len(df)}

    # Deduplicate
    df = df.drop_duplicates(subset=["url"], keep="last")
    stats["after_dedup"] = len(df)

    # Clean fields
    df["name"] = df["name"].apply(normalize_text)
    df["price"] = df["price"].apply(clean_price)
    df["rating"] = pd.to_numeric(df["rating"], errors="coerce")
    df["reviews"] = pd.to_numeric(df["reviews"], errors="coerce").fillna(0).astype(int)

    # Drop rows missing critical fields
    df = df.dropna(subset=["name", "url"])
    stats["after_cleaning"] = len(df)

    # Validate with Pydantic
    records = df.to_dict(orient="records")
    valid, errors = validate_products(records)
    stats["valid"] = len(valid)
    stats["validation_errors"] = len(errors)

    # Save clean data
    clean_df = pd.DataFrame([p.model_dump() for p in valid])
    clean_df.to_json(output_file, orient="records", indent=2)

    print(f"\nCleaning Summary:")
    for k, v in stats.items():
        print(f"  {k}: {v}")

    return stats

# Run it
stats = clean_scraped_products("scraped_products.json", "clean_products.json")
Enter fullscreen mode Exit fullscreen mode

Getting Clean Data from the Start

The best cleaning pipeline is one you barely need. If you control the scraping step, you can reduce downstream cleaning work:

  • Use ScraperAPI for rendered JavaScript content — you get cleaner HTML to parse
  • ScrapeOps provides monitoring and proxy management to reduce failed requests (which cause missing data)
  • Structure your scraper output as typed objects from the start rather than raw strings

Common Pitfalls

1. Cleaning too aggressively

Don't delete data just because it looks weird. A product named "USB-C Cable (3-Pack) — 6ft" with special characters is valid. Clean encoding issues, don't strip legitimate content.

2. Silent type coercion

# Bad: silently converts "N/A" to NaN
df["price"] = pd.to_numeric(df["price"], errors="coerce")

# Better: log what you're dropping
bad_prices = df[pd.to_numeric(df["price"], errors="coerce").isna() & df["price"].notna()]
if len(bad_prices) > 0:
    print(f"Warning: {len(bad_prices)} unparseable prices dropped")
    print(bad_prices["price"].value_counts().head())
Enter fullscreen mode Exit fullscreen mode

3. Not handling encoding

# Always specify encoding when reading files
df = pd.read_csv("data.csv", encoding="utf-8-sig")  # Handles BOM
Enter fullscreen mode Exit fullscreen mode

4. Deduplicating on the wrong key

URLs are usually the best dedup key for product data. Names can have slight variations ("iPhone 16" vs "iPhone 16 ") that make them unreliable.

Key Takeaways

  1. Inspect before cleaning — understand what's broken before you fix it
  2. Clean prices carefully — international formats trip up naive parsers
  3. Use Pydantic for validation — catch bad data before it enters your pipeline
  4. Flag, don't fabricate — mark missing data instead of filling with guesses
  5. Build reusable pipelines — you'll clean data from every new source
  6. Log everything — when a row gets dropped, know why

The code examples in this article are production-tested patterns. Copy them, adapt them, and build your own cleaning toolkit. The investment pays off every time you start a new scraping project.


What's the messiest scraped data you've ever had to clean? Share your war stories in the comments!

Top comments (0)