agenthustler

Posted on Mar 26 • Edited on Apr 19

How to Clean and Parse Web Scraped Data with Python in 2026

#webscraping #python #tutorial #beginners

You finally got your scraper working. Data is flowing in. But when you open the output file, it's a mess — missing values, duplicate rows, inconsistent formats, and prices that say "$12.99" in one row and "12,99 EUR" in the next.

Welcome to the real work of web scraping: cleaning the data.

In this guide, I'll walk through practical techniques for turning raw scraped data into something you can actually use — with real Python code examples you can adapt to your own projects.

Why Scraped Data Is Always Messy

Unlike API responses with consistent schemas, scraped data inherits all the inconsistencies of the source websites:

Missing fields — some product listings have reviews, others don't
Inconsistent formats — dates as "March 5, 2026" vs "2026-03-05" vs "05/03/26"
Encoding issues — UTF-8 vs Latin-1, HTML entities like &
Duplicates — pagination overlaps, retry artifacts
Type mismatches — prices as strings with currency symbols, quantities as text

Let's tackle each of these systematically.

Setting Up Your Environment

import pandas as pd
from pydantic import BaseModel, field_validator, ValidationError
from typing import Optional
from datetime import datetime
import re
import html

Step 1: Load and Inspect Raw Data

First, load your scraped data and understand what you're dealing with:

# Load raw scraped data
df = pd.read_json("scraped_products.json")

# Quick inspection
print(f"Rows: {len(df)}")
print(f"Columns: {list(df.columns)}")
print(f"\nNull counts:\n{df.isnull().sum()}")
print(f"\nDuplicates: {df.duplicated().sum()}")
print(f"\nSample data types:\n{df.dtypes}")

Output might look like:

Rows: 1547
Columns: ['name', 'price', 'rating', 'reviews', 'url', 'scraped_at']

Null counts:
name          3
price        47
rating      203
reviews     318
url           0
scraped_at    0

Duplicates: 89

Already you can see: 47 missing prices, 203 missing ratings, and 89 duplicates. Let's fix all of it.

Step 2: Remove Duplicates

Duplicates in scraped data usually come from pagination overlaps or retry logic. Handle them carefully:

# Simple dedup on URL (most reliable unique identifier)
df_deduped = df.drop_duplicates(subset=["url"], keep="last")
print(f"Removed {len(df) - len(df_deduped)} duplicates")

# For more nuanced dedup (e.g., same product, different scraped_at)
def smart_dedup(df, key_cols, sort_col="scraped_at"):
    """Keep the most recently scraped version of each item."""
    df_sorted = df.sort_values(sort_col, ascending=False)
    return df_sorted.drop_duplicates(subset=key_cols, keep="first")

df_clean = smart_dedup(df, key_cols=["name", "url"])

Step 3: Clean Price Data

Price fields are the #1 headache in scraped data. Here's a robust cleaner:

def clean_price(raw_price: str) -> Optional[float]:
    """Convert messy price strings to float values."""
    if pd.isna(raw_price) or raw_price in ("", "N/A", "null"):
        return None

    price_str = str(raw_price)

    # Remove currency symbols and whitespace
    price_str = re.sub(r'[^\d.,]', '', price_str)

    if not price_str:
        return None

    # Handle European format: 1.234,56 -> 1234.56
    if ',' in price_str and '.' in price_str:
        if price_str.rindex(',') > price_str.rindex('.'):
            # European: 1.234,56
            price_str = price_str.replace('.', '').replace(',', '.')
        else:
            # US: 1,234.56
            price_str = price_str.replace(',', '')
    elif ',' in price_str:
        # Could be European decimal or US thousands
        parts = price_str.split(',')
        if len(parts[-1]) == 2:
            price_str = price_str.replace(',', '.')  # European decimal
        else:
            price_str = price_str.replace(',', '')   # US thousands

    try:
        return round(float(price_str), 2)
    except ValueError:
        return None

# Apply to dataframe
df_clean["price"] = df_clean["price"].apply(clean_price)

# Test it
test_prices = ["$12.99", "12,99 EUR", "£1,234.56", "1.234,56€", "N/A", "", None]
for p in test_prices:
    print(f"{str(p):>15} -> {clean_price(p)}")

Output:

         $12.99 -> 12.99
      12,99 EUR -> 12.99
      £1,234.56 -> 1234.56
       1.234,56€ -> 1234.56
            N/A -> None
                -> None
           None -> None

Step 4: Handle Missing Data

Not all missing data should be handled the same way:

# Strategy 1: Drop rows missing critical fields
df_clean = df_clean.dropna(subset=["name", "url"])  # These MUST exist

# Strategy 2: Fill with defaults for non-critical fields
df_clean["rating"] = pd.to_numeric(df_clean["rating"], errors="coerce")
df_clean["reviews"] = pd.to_numeric(df_clean["reviews"], errors="coerce").fillna(0).astype(int)

# Strategy 3: Flag missing data instead of guessing
df_clean["has_price"] = df_clean["price"].notna()
df_clean["has_rating"] = df_clean["rating"].notna()

print(f"Products with price: {df_clean['has_price'].sum()}/{len(df_clean)}")
print(f"Products with rating: {df_clean['has_rating'].sum()}/{len(df_clean)}")

Rule of thumb: Drop rows only when a critical field is missing. For optional fields, flag them or fill with sensible defaults. Never silently fill prices or ratings with averages — that creates misleading data.

Step 5: Validate with Pydantic

For production pipelines, use Pydantic to enforce your data schema. This catches issues early and gives you clear error messages:

class Product(BaseModel):
    name: str
    price: Optional[float] = None
    rating: Optional[float] = None
    reviews: int = 0
    url: str
    scraped_at: datetime

    @field_validator("name")
    @classmethod
    def clean_name(cls, v):
        # Decode HTML entities and normalize whitespace
        v = html.unescape(v)
        v = re.sub(r'\s+', ' ', v).strip()
        if len(v) < 2:
            raise ValueError("Product name too short")
        return v

    @field_validator("price")
    @classmethod
    def validate_price(cls, v):
        if v is not None and (v < 0 or v > 1_000_000):
            raise ValueError(f"Price {v} outside reasonable range")
        return v

    @field_validator("rating")
    @classmethod
    def validate_rating(cls, v):
        if v is not None and (v < 0 or v > 5):
            raise ValueError(f"Rating {v} outside 0-5 range")
        return v

    @field_validator("url")
    @classmethod
    def validate_url(cls, v):
        if not v.startswith(("http://", "https://")):
            raise ValueError(f"Invalid URL: {v}")
        return v


def validate_products(raw_data: list[dict]) -> tuple[list[Product], list[dict]]:
    """Validate a list of raw product dicts. Returns (valid, errors)."""
    valid = []
    errors = []
    for i, item in enumerate(raw_data):
        try:
            product = Product(**item)
            valid.append(product)
        except ValidationError as e:
            errors.append({"row": i, "data": item, "errors": str(e)})

    print(f"Valid: {len(valid)}, Errors: {len(errors)}")
    return valid, errors

Step 6: Normalize Text Fields

Scraped text is full of inconsistencies:

def normalize_text(text: str) -> str:
    """Clean up common scraped text issues."""
    if not isinstance(text, str):
        return str(text)

    # Decode HTML entities
    text = html.unescape(text)

    # Remove zero-width characters
    text = re.sub(r'[\u200b\u200c\u200d\ufeff]', '', text)

    # Normalize whitespace (but preserve single newlines)
    text = re.sub(r'[^\S\n]+', ' ', text)
    text = re.sub(r'\n{3,}', '\n\n', text)

    return text.strip()


def normalize_date(date_str: str) -> Optional[datetime]:
    """Parse dates from various formats into datetime objects."""
    formats = [
        "%Y-%m-%d",
        "%Y-%m-%dT%H:%M:%S",
        "%B %d, %Y",
        "%b %d, %Y",
        "%d/%m/%Y",
        "%m/%d/%Y",
    ]
    for fmt in formats:
        try:
            return datetime.strptime(date_str.strip(), fmt)
        except ValueError:
            continue
    return None

Step 7: Build a Complete Cleaning Pipeline

Put it all together into a reusable pipeline:

def clean_scraped_products(raw_file: str, output_file: str) -> dict:
    """Full cleaning pipeline for scraped product data."""
    # Load
    df = pd.read_json(raw_file)
    stats = {"raw_count": len(df)}

    # Deduplicate
    df = df.drop_duplicates(subset=["url"], keep="last")
    stats["after_dedup"] = len(df)

    # Clean fields
    df["name"] = df["name"].apply(normalize_text)
    df["price"] = df["price"].apply(clean_price)
    df["rating"] = pd.to_numeric(df["rating"], errors="coerce")
    df["reviews"] = pd.to_numeric(df["reviews"], errors="coerce").fillna(0).astype(int)

    # Drop rows missing critical fields
    df = df.dropna(subset=["name", "url"])
    stats["after_cleaning"] = len(df)

    # Validate with Pydantic
    records = df.to_dict(orient="records")
    valid, errors = validate_products(records)
    stats["valid"] = len(valid)
    stats["validation_errors"] = len(errors)

    # Save clean data
    clean_df = pd.DataFrame([p.model_dump() for p in valid])
    clean_df.to_json(output_file, orient="records", indent=2)

    print(f"\nCleaning Summary:")
    for k, v in stats.items():
        print(f"  {k}: {v}")

    return stats

# Run it
stats = clean_scraped_products("scraped_products.json", "clean_products.json")

Getting Clean Data from the Start

The best cleaning pipeline is one you barely need. If you control the scraping step, you can reduce downstream cleaning work:

Use ScraperAPI for rendered JavaScript content — you get cleaner HTML to parse
ScrapeOps provides monitoring and proxy management to reduce failed requests (which cause missing data)
Structure your scraper output as typed objects from the start rather than raw strings

Common Pitfalls

1. Cleaning too aggressively

Don't delete data just because it looks weird. A product named "USB-C Cable (3-Pack) — 6ft" with special characters is valid. Clean encoding issues, don't strip legitimate content.

2. Silent type coercion

# Bad: silently converts "N/A" to NaN
df["price"] = pd.to_numeric(df["price"], errors="coerce")

# Better: log what you're dropping
bad_prices = df[pd.to_numeric(df["price"], errors="coerce").isna() & df["price"].notna()]
if len(bad_prices) > 0:
    print(f"Warning: {len(bad_prices)} unparseable prices dropped")
    print(bad_prices["price"].value_counts().head())

3. Not handling encoding

# Always specify encoding when reading files
df = pd.read_csv("data.csv", encoding="utf-8-sig")  # Handles BOM

4. Deduplicating on the wrong key

URLs are usually the best dedup key for product data. Names can have slight variations ("iPhone 16" vs "iPhone 16 ") that make them unreliable.

Key Takeaways

Inspect before cleaning — understand what's broken before you fix it
Clean prices carefully — international formats trip up naive parsers
Use Pydantic for validation — catch bad data before it enters your pipeline
Flag, don't fabricate — mark missing data instead of filling with guesses
Build reusable pipelines — you'll clean data from every new source
Log everything — when a row gets dropped, know why

The code examples in this article are production-tested patterns. Copy them, adapt them, and build your own cleaning toolkit. The investment pays off every time you start a new scraping project.

What's the messiest scraped data you've ever had to clean? Share your war stories in the comments!

Skip the Build

You don't have to reinvent this. We maintain a production-grade scraper as an Apify actor — proxies, anti-bot, retries, and schema all handled. You can run it on a pay-per-result basis and get clean JSON without writing a single line of scraping code.

our Apify scrapers on Apify

DEV Community