Imagine you've spent hours scraping data from a website—product names, prices, and descriptions. But when you look at the dataset, it feels incomplete. This is the reality of raw scraped data: useful, but limited. To unlock its full potential, you need data enrichment.
In this tutorial, we'll walk through data enrichment techniques using Python.
What Is Data Enrichment?
Data enrichment is the process of enhancing raw data by adding relevant information from external sources:
- Geocoding addresses to get latitude/longitude
- Appending product categories using a database
- Validating email formats with regex or external tools
- Enhancing user profiles with demographic data from APIs
Why Enrich Data?
- Improves decision-making: Enriched data provides context
- Reduces errors: Cleaning and validating data early prevents downstream issues
- Boosts ML models: Feature-rich datasets yield better predictions
- Unlocks new insights: Merging datasets reveals hidden patterns
Practical Example: Enriching Scraped Product Data
Assume you've scraped products from an e-commerce site:
Step 1: Install Required Libraries
pip install pandas requests
Step 2: Load Raw Data
import pandas as pd
df = pd.read_csv("products.csv")
print(df.head())
Step 3: Enrich with Product Categories
import requests
def get_category(product_name):
url = "https://api.example.com/product-category"
payload = {"product_name": product_name}
response = requests.post(url, json=payload)
if response.status_code == 200:
return response.json().get("category", "Unknown")
return "Unknown"
df["Category"] = df["Product Name"].apply(get_category)
Step 4: Add Brand Names
def get_brand(product_name):
url = "https://api.example.com/product-brand"
payload = {"product_name": product_name}
response = requests.post(url, json=payload)
if response.status_code == 200:
return response.json().get("brand", "Unknown")
return "Unknown"
df["Brand"] = df["Product Name"].apply(get_brand)
Step 5: Save the Enriched Data
df.to_csv("enriched_products.csv", index=False)
Merging Datasets with Pandas
customers = pd.read_csv("customers.csv")
demographics = pd.read_csv("demographics.csv")
merged_data = pd.merge(customers, demographics, on="email", how="left")
print(merged_data.head())
Best Practice: Always check for missing values after merging with
merged_data.isnull().sum().
Common Pitfalls
- Overloading Your Dataset — Prioritize the most impactful enrichments first
- Ignoring Data Quality — Always validate and clean API results
- Not Documenting Your Process — Keep track for reproducibility
Next Steps
- Automate enrichment with Apache Airflow or Prefect
- Explore APIs like OpenWeather, Yelp, or LinkedIn
- Learn ETL processes for large-scale workflows
- Use
pydanticorjsonschemafor data validation
Need professional data scraping and enrichment? N3X1S INTELLIGENCE delivers clean, enriched datasets from any source. Hire us on Fiverr →
Top comments (0)