DEV Community

Cover image for How to Clean Marketplace Job Data with Python
Oddshop
Oddshop

Posted on • Originally published at oddshop.work

How to Clean Marketplace Job Data with Python

Marketplace job data often arrives as a chaotic mess — inconsistent formats, broken HTML, and duplicate entries that make analysis impossible. If you've ever spent hours cleaning scraped job listings from Amazon's career pages, you know the frustration. A tool that automates this cleanup is more than helpful — it's vital.

The Manual Way (And Why It Breaks)

Manually processing scraped job listings is tedious and error-prone. You end up downloading raw CSVs, opening them in Excel, and painstakingly removing duplicates one by one. Copy-pasting descriptions, dealing with broken HTML tags, and manually standardizing date formats can take days. For data analysts or Python developers working with career page scraping, this workflow wastes time and introduces human errors. The process is especially unwieldy when dealing with hundreds of job entries, making manual data preprocessing a bottleneck in any analyst’s pipeline.

The Python Approach

We can automate basic cleanup with a few lines of Python code — a solid first step in data preprocessing. Here's a snippet that mimics part of what the marketplace job data cleaner does:

import pandas as pd
from datetime import datetime
import re

# Load the raw job listings
df = pd.read_csv("raw_job_listings.csv")

# Remove duplicate job titles and IDs
df.drop_duplicates(subset=["job_id", "title"], keep="first", inplace=True)

# Standardize date formats to ISO 8601
df["posted_date"] = pd.to_datetime(df["posted_date"], errors="coerce").dt.strftime("%Y-%m-%d")

# Clean HTML from description fields
df["description"] = df["description"].apply(lambda x: re.sub(r"<.*?>", "", str(x)))

# Normalize whitespace in descriptions
df["description"] = df["description"].apply(lambda x: re.sub(r"\s+", " ", x).strip())

# Export cleaned data to CSV
df.to_csv("cleaned_job_listings.csv", index=False)
Enter fullscreen mode Exit fullscreen mode

This script handles deduplication, date standardization, and basic HTML cleaning. But it won’t fully tackle validating location data, handling malformed entries, or exporting in multiple formats — tasks that grow more complex as datasets scale.

What the Full Tool Handles

This tool does more than basic Python cleanup. It:

  • Deduplicates listings by job ID and title — removes both exact and fuzzy duplicates.
  • Standardizes date formats — converts various string formats into ISO 8601.
  • Cleans HTML artifacts — strips tags and normalizes whitespace in description fields.
  • Validates and structures location data — parses city, state, country into separate columns.
  • Exports to clean CSV or JSON — outputs a consistent, analysis-ready file.
  • Handles edge cases in marketplace job data that manual scripts often miss.

Running It

To use the tool, run this command in your terminal:

amazon_job_cleaner --input messy_listings.csv --output clean_listings.json
Enter fullscreen mode Exit fullscreen mode

The --input flag specifies the source file with messy job listings, and --output sets the destination for the cleaned dataset. The tool outputs a clean JSON file with consistent structure, ready for analysis or further processing.

Get the Script

If you're just starting out, this code gives you a foundation for working with scraped job listings. But the full tool handles many edge cases — and it's already built for you.

Download Marketplace Job Listings Data Cleaner →

$29 one-time. No subscription. Works on Windows, Mac, and Linux.


Built by OddShop — Python automation tools for developers and businesses.

Top comments (0)