How to Clean Amazon Job Listings Data with Python

#python #automation #tutorial

Working with scraped job data from Amazon careers pages can turn your analysis project into a nightmare of inconsistent formats, duplicate entries, and malformed dates. A python data cleaner becomes essential when you're dealing with thousands of messy listings that need to be transformed into reliable datasets for meaningful insights.

The Manual Way (And Why It Breaks)

Most developers start by manually cleaning scraped Amazon careers data using basic pandas operations and regex patterns. You'll spend hours writing individual functions to strip HTML tags from descriptions, then create complex deduplication logic to catch jobs that appear multiple times with slight variations. The job scraping process often introduces inconsistent date formats—some entries show "Jan 15, 2024", others "2024-01-15", and some have "15 days ago" strings that break your analysis pipeline. When you finally get the dates standardized, you discover location fields contain mixed formats like "Seattle, WA, USA", "Seattle, Washington", and "Seattle|WA|US" that require separate parsing logic. This manual approach breaks down quickly as your dataset grows, creating maintenance overhead that diverts time from actual analysis work.

The Python Approach

Here's a basic script to handle the most common cleaning operations:

import pandas as pd
from datetime import datetime
import re
import html

def clean_amazon_jobs(input_file):
    df = pd.read_csv(input_file)

    # Strip HTML tags and normalize whitespace
    df['description'] = df['description'].apply(
        lambda x: html.unescape(re.sub(r'<[^>]+>', '', str(x))) if pd.notna(x) else x
    )
    df['description'] = df['description'].apply(
        lambda x: ' '.join(str(x).split()) if pd.notna(x) else x
    )

    # Basic deduplication by job ID
    df = df.drop_duplicates(subset=['job_id'], keep='first')

    # Convert date strings to standard format
    df['posted_date'] = pd.to_datetime(df['posted_date'], errors='coerce')

    # Simple location parsing
    df[['city', 'state', 'country']] = df['location'].str.extract(
        r'([^,]+),\s*([^,]+),\s*(.+)', expand=True
    )

    return df

This script handles HTML tag removal, basic deduplication, and simple date conversion. However, it lacks sophisticated fuzzy matching for near-duplicate job titles, comprehensive date format detection, and proper location validation that real-world data requires.

What the Full Tool Handles

• Deduplicate listings by job ID and title — removes exact and fuzzy duplicates using advanced matching algorithms

• Standardize date formats — converts various string formats to ISO 8601 including relative dates like "3 days ago"

• Clean HTML artifacts — strips tags and normalizes whitespace from description fields while preserving formatting intent

• Validate and structure location data — parses city, state, country into separate columns with geographic validation

• Export to clean CSV or JSON — outputs consistent, analysis-ready files with proper schema validation

The python data cleaner handles edge cases that manual scripts miss, such as detecting job postings that appear across multiple regions with different IDs but identical content.

Running It

amazon_job_cleaner --input messy_listings.csv --output clean_listings.json

The command accepts input files in CSV, TSV, or JSON formats and allows you to specify output format. Use --format json or --format csv to control the export type, and the tool will validate your cleaned data before writing the final file.