DEV Community

Oddshop
Oddshop

Posted on • Originally published at oddshop.work

How to Clean Scraped Job Data with Python for Analysis

How to Clean Scraped Job Data with Python for Analysis

You've scraped Amazon's careers page and now have a mess of duplicate entries, broken HTML, and inconsistent date formats. The job listings are scattered across multiple rows, descriptions are full of <br> tags and strange line breaks, and some dates are in MM/DD/YYYY while others are DD-MM-YYYY. You need clean data for analysis, but the raw scrape is unusable as-is.

The Manual Way (And Why It Breaks)

Most developers try to clean this by hand — copying and pasting into spreadsheets, deleting rows manually, or writing quick scripts in Excel or Notepad++. This is slow, error-prone, and time-consuming. When scraping at scale, you quickly hit API rate limits or get blocked, so you end up with a massive file and no real way to automate the cleanup. You might spend hours cleaning data that could’ve been done in minutes with a tool.

The Python Approach

Here’s a simplified version of what a developer might write to clean a few rows of job data in Python:

import pandas as pd
import re
from dateutil import parser

# Load raw data
df = pd.read_csv("messy_listings.csv")

# Remove duplicates by job ID and title
df.drop_duplicates(subset=["job_id", "title"], keep="first", inplace=True)

# Clean HTML from description
df["description"] = df["description"].apply(lambda x: re.sub(r"<.*?>", "", str(x)))

# Normalize whitespace
df["description"] = df["description"].apply(lambda x: re.sub(r"\s+", " ", str(x)).strip())

# Standardize date formats
def parse_date(date_str):
    try:
        return parser.parse(str(date_str)).strftime("%Y-%m-%d")
    except:
        return None

df["posted_date"] = df["posted_date"].apply(parse_date)

# Save clean data
df.to_csv("clean_listings.csv", index=False)
Enter fullscreen mode Exit fullscreen mode

This script handles deduplication, HTML stripping, and date standardization — but it assumes clean input and breaks if dates are malformed or descriptions are missing. It also doesn’t validate location data or handle edge cases like malformed job IDs or inconsistent fields.

What the Full Tool Handles

The Marketplace Job Listings Data Cleaner goes beyond what a basic script can do:

  • Handles edge cases: missing fields, malformed dates, inconsistent data types
  • Supports multiple input formats: CSV, JSON, and Excel (.xlsx)
  • Offers both CLI and programmatic usage
  • Provides configurable output formats: CSV or JSON
  • Includes automated duplicate detection using fuzzy matching
  • Validates and splits location fields into city, state, country
  • Gracefully continues on errors without crashing

Running It

You can run the tool on your scraped file from the command line like this:

amazon_job_cleaner --input messy_listings.csv --output clean_listings.json
Enter fullscreen mode Exit fullscreen mode

The tool accepts --input and --output flags, with optional --format to specify input (CSV, JSON, XLSX) or output (CSV, JSON). It processes everything in one go, outputting a clean, validated file with no extra steps.

Results

You save hours of manual cleaning and risk of errors. The tool produces a single clean CSV or JSON file, with all job listings standardized and ready for analysis. Duplicates are removed, dates are in ISO format, and all HTML artifacts are stripped — everything you need to start building reports or dashboards.

Get the Script

If you're tired of reinventing the wheel every time you scrape and clean data, this tool is the upgrade you’ve been looking for. It's a polished version of what you just read — ready to handle the messy, real-world data you’ll encounter.

Download Marketplace Job Listings Data Cleaner →

$29 one-time. No subscription. Works on Windows, Mac, and Linux.

Built by OddShop — Python automation tools for developers and businesses.

Top comments (0)