DEV Community

Max Klein
Max Klein

Posted on

Data Enrichment: How to Add Value to Raw Scraped Data

Imagine you've spent hours scraping data from a website—product names, prices, and descriptions. But when you look at the dataset, it feels incomplete. This is the reality of raw scraped data: useful, but limited. To unlock its full potential, you need data enrichment.

In this tutorial, we'll walk through data enrichment techniques using Python.

What Is Data Enrichment?

Data enrichment is the process of enhancing raw data by adding relevant information from external sources:

  • Geocoding addresses to get latitude/longitude
  • Appending product categories using a database
  • Validating email formats with regex or external tools
  • Enhancing user profiles with demographic data from APIs

Why Enrich Data?

  • Improves decision-making: Enriched data provides context
  • Reduces errors: Cleaning and validating data early prevents downstream issues
  • Boosts ML models: Feature-rich datasets yield better predictions
  • Unlocks new insights: Merging datasets reveals hidden patterns

Practical Example: Enriching Scraped Product Data

Assume you've scraped products from an e-commerce site:

Step 1: Install Required Libraries

pip install pandas requests
Enter fullscreen mode Exit fullscreen mode

Step 2: Load Raw Data

import pandas as pd

df = pd.read_csv("products.csv")
print(df.head())
Enter fullscreen mode Exit fullscreen mode

Step 3: Enrich with Product Categories

import requests

def get_category(product_name):
    url = "https://api.example.com/product-category"
    payload = {"product_name": product_name}
    response = requests.post(url, json=payload)
    if response.status_code == 200:
        return response.json().get("category", "Unknown")
    return "Unknown"

df["Category"] = df["Product Name"].apply(get_category)
Enter fullscreen mode Exit fullscreen mode

Step 4: Add Brand Names

def get_brand(product_name):
    url = "https://api.example.com/product-brand"
    payload = {"product_name": product_name}
    response = requests.post(url, json=payload)
    if response.status_code == 200:
        return response.json().get("brand", "Unknown")
    return "Unknown"

df["Brand"] = df["Product Name"].apply(get_brand)
Enter fullscreen mode Exit fullscreen mode

Step 5: Save the Enriched Data

df.to_csv("enriched_products.csv", index=False)
Enter fullscreen mode Exit fullscreen mode

Merging Datasets with Pandas

customers = pd.read_csv("customers.csv")
demographics = pd.read_csv("demographics.csv")

merged_data = pd.merge(customers, demographics, on="email", how="left")
print(merged_data.head())
Enter fullscreen mode Exit fullscreen mode

Best Practice: Always check for missing values after merging with merged_data.isnull().sum().

Common Pitfalls

  1. Overloading Your Dataset — Prioritize the most impactful enrichments first
  2. Ignoring Data Quality — Always validate and clean API results
  3. Not Documenting Your Process — Keep track for reproducibility

Next Steps

  • Automate enrichment with Apache Airflow or Prefect
  • Explore APIs like OpenWeather, Yelp, or LinkedIn
  • Learn ETL processes for large-scale workflows
  • Use pydantic or jsonschema for data validation

Need professional data scraping and enrichment? N3X1S INTELLIGENCE delivers clean, enriched datasets from any source. Hire us on Fiverr →

Top comments (0)