Max Klein

Posted on Mar 2

Data Enrichment: How to Add Value to Raw Scraped Data

#python #datascience #webdev #tutorial

Imagine you've spent hours scraping data from a website—product names, prices, and descriptions. But when you look at the dataset, it feels incomplete. This is the reality of raw scraped data: useful, but limited. To unlock its full potential, you need data enrichment.

In this tutorial, we'll walk through data enrichment techniques using Python.

What Is Data Enrichment?

Data enrichment is the process of enhancing raw data by adding relevant information from external sources:

Geocoding addresses to get latitude/longitude
Appending product categories using a database
Validating email formats with regex or external tools
Enhancing user profiles with demographic data from APIs

Why Enrich Data?

Improves decision-making: Enriched data provides context
Reduces errors: Cleaning and validating data early prevents downstream issues
Boosts ML models: Feature-rich datasets yield better predictions
Unlocks new insights: Merging datasets reveals hidden patterns

Practical Example: Enriching Scraped Product Data

Assume you've scraped products from an e-commerce site:

Step 1: Install Required Libraries

pip install pandas requests

Step 2: Load Raw Data

import pandas as pd

df = pd.read_csv("products.csv")
print(df.head())

Step 3: Enrich with Product Categories

import requests

def get_category(product_name):
    url = "https://api.example.com/product-category"
    payload = {"product_name": product_name}
    response = requests.post(url, json=payload)
    if response.status_code == 200:
        return response.json().get("category", "Unknown")
    return "Unknown"

df["Category"] = df["Product Name"].apply(get_category)

Step 4: Add Brand Names

def get_brand(product_name):
    url = "https://api.example.com/product-brand"
    payload = {"product_name": product_name}
    response = requests.post(url, json=payload)
    if response.status_code == 200:
        return response.json().get("brand", "Unknown")
    return "Unknown"

df["Brand"] = df["Product Name"].apply(get_brand)

Step 5: Save the Enriched Data

df.to_csv("enriched_products.csv", index=False)

Merging Datasets with Pandas

customers = pd.read_csv("customers.csv")
demographics = pd.read_csv("demographics.csv")

merged_data = pd.merge(customers, demographics, on="email", how="left")
print(merged_data.head())

Best Practice: Always check for missing values after merging with merged_data.isnull().sum().

Common Pitfalls

Overloading Your Dataset — Prioritize the most impactful enrichments first
Ignoring Data Quality — Always validate and clean API results
Not Documenting Your Process — Keep track for reproducibility

Next Steps

Automate enrichment with Apache Airflow or Prefect
Explore APIs like OpenWeather, Yelp, or LinkedIn
Learn ETL processes for large-scale workflows
Use pydantic or jsonschema for data validation

Need professional data scraping and enrichment? N3X1S INTELLIGENCE delivers clean, enriched datasets from any source. Hire us on Fiverr →

DEV Community