DEV Community

Anusha Kuppili
Anusha Kuppili

Posted on

Data Collection and Preparation for Machine Learning

🧠 Data Collection and Preparation for Machine Learning | Complete Guide with ETL, Data Lakes & Pandas Demo

Welcome to this comprehensive guide on Data Collection and Preparation — the foundation of every successful Machine Learning (ML) project.

If you've ever wondered how apps like Amazon or Swiggy predict what you’ll want before you even click — this is where it all begins.


🚀 Why Data Collection and Preparation Matter

Imagine building a grocery app that delivers a unique experience to every user.

By analyzing purchase history, browsing patterns, and favorite items, the app can predict future needs, boosting engagement and customer satisfaction. These insights power:

  • Personalized homepages
  • Value-based offers and discounts
  • Higher retention and repeat purchases

At the heart of it all? Data.

Clean, accurate, and well-structured data is what drives personalization, predictions, and performance.


📊 Step 1: Understanding Data Collection

Before we start building models, we need to centralize and clean data scattered across multiple sources.

Typical Data Sources:

  • SQL / NoSQL databases
  • APIs and real-time feeds
  • Google Sheets and CSVs
  • Spreadsheets and cloud storage

Integrating these sources into a single pipeline lays the foundation for robust ML training.

Note: Centralizing and cleaning your data is non-negotiable. Garbage in = garbage out.


⚙️ Step 2: ETL (Extract, Transform, Load) — The Backbone of Data Ingestion

ETL is the process of pulling data from multiple systems, cleaning it up, and loading it into one unified destination for analysis.

💡 The ETL Process

  1. Extract

    Pull data from databases, APIs, CSVs, or logs.

    Tools: Talend, DBT, AWS Glue

  2. Transform

    Standardize, clean, and enrich data for consistency.

    • Remove duplicates
    • Normalize formats
    • Handle missing values
  3. Load

    Move the transformed data into a data warehouse or data lake for easy access and analytics.

⚠️ Warning: Poor transformation leads to broken models. Validate your transformations carefully.


🏗️ Step 3: Data Lakes — The Modern Data Storage Architecture

A data lake isn’t a tool — it’s an architecture that stores all types of data, raw or processed, in one place.

Key Benefits:

  • Centralized Repository: Store both structured and unstructured data together.
  • Scalability: Handle terabytes to petabytes of growing data effortlessly.
  • Diversity: Store CSVs, JSONs, videos, IoT streams — all in one ecosystem.
  • Cost-Effectiveness: Uses blob storage like AWS S3, Azure Data Lake, or Google Cloud Storage.

Netflix uses data lakes to analyze viewing patterns.

Tesla uses them to process sensor data from autonomous cars.


🔐 Advanced Data Lake Capabilities

  • Data Security & Governance: Protect sensitive data while maintaining compliance.
  • Integration with BI Tools: Seamless use with Athena, Tableau, or Power BI.
  • MLOps Friendly: Centralized data simplifies feature extraction and model training.

⚠️ Tip: Always apply strict security and governance when deploying data lakes — they often hold sensitive business data.


🧹 Step 4: Data Cleaning and Transformation

Once data is collected, cleaning it is crucial before using it for ML.

Raw data can come in many forms:

  • CSV files
  • JSON logs
  • Images
  • Documents
  • Parquet files

Your goal is to standardize them into a consistent format (often Parquet) and ensure the data is complete and accurate.

✅ Key Strategies for Data Quality

  1. Ensure Data Accuracy: Fix incorrect entries or typos.
  2. Remove Redundancies: Eliminate duplicate records.
  3. Standardize Formats: Align date, currency, and categorical formats.
  4. Handle Missing Values: Fill or drop nulls strategically.
  5. Optimize for Analysis: Transform unstructured data into analyzable forms.

Example: A retail dataset missing “customer age” will fail at personalization models — fill such gaps using median or mean values.


🧰 Step 5: Data Processing Tools & Technologies

Tool Use Case Example
Pandas Small to medium data processing Cleaning sales data for dashboards
Apache Spark Large-scale real-time analytics Processing global transport data (e.g., Uber)
AWS Glue Serverless ETL Transforming gaming event logs into structured tables
Talend Visual ETL workflows Integrating healthcare records
DBT Transformations inside data warehouses Converting raw clickstream data into insights

⚠️ Choose your tools based on data scale and complexity. Pandas can’t handle terabytes, but Spark can.


💻 Step 6: Hands-On Demo — Data Transformation with Pandas

Let’s get practical!

Here’s a small-to-medium dataset example using Pandas to explore, clean, and transform data.


1️⃣ Load and Inspect Data

import pandas as pd

# Load the CSV
df = pd.read_csv("mock_data.csv")

# Explore the dataset
df.head()
df.info()
df.isnull().sum()
df.describe(include='all')
Enter fullscreen mode Exit fullscreen mode

2️⃣ Handle Missing Values

# Fill numeric columns with median
df['age'] = df['age'].fillna(df['age'].median())
df['salary'] = df['salary'].fillna(df['salary'].median())

# Fill categorical with 'Unknown'
df['department'] = df['department'].fillna('Unknown')
Enter fullscreen mode Exit fullscreen mode

3️⃣ Parse and Transform JSON Data

import json

# Convert JSON strings into dictionaries
df['profile'] = df['profile'].apply(lambda x: json.loads(x) if pd.notnull(x) else {})

# Extract fields
df['address'] = df['profile'].apply(lambda x: x.get('address'))
df['phone'] = df['profile'].apply(lambda x: x.get('phone'))
df['email'] = df['profile'].apply(lambda x: x.get('email'))

# Drop original column
df.drop(columns=['profile'], inplace=True)

# Save cleaned data
df.to_csv("cleaned_data.csv", index=False)
Enter fullscreen mode Exit fullscreen mode

4️⃣ Derive New Columns

# Address length check
df['address_length'] = df['address'].apply(lambda x: len(str(x)))

# Salary categories
bins = [0, 50000, 70000, 100000]
labels = ['low', 'medium', 'high']
df['salary_category'] = pd.cut(df['salary'], bins=bins, labels=labels)
Enter fullscreen mode Exit fullscreen mode

5️⃣ Group and Aggregate Data

summary = df.groupby('department').agg({
    'salary': 'mean',
    'age': 'mean'
}).reset_index()

summary.rename(columns={'salary': 'average_salary', 'age': 'average_age'}, inplace=True)
print(summary)
Enter fullscreen mode Exit fullscreen mode

🎯 Data Insight

Grouping and aggregation reveal department-level trends — essential for optimizing performance and guiding data-driven ML feature engineering.

🧩 Final Thoughts

Here’s what we covered:

How to collect and consolidate data from diverse sources

The ETL process and why transformation matters

The role of data lakes in modern architectures

Data cleaning strategies to ensure accuracy

Practical data transformation using Pandas

Clean, consistent data is the secret ingredient behind every intelligent system — from recommendation engines to fraud detection models.

If you want to explore the hands-on code, check it out here:
👉 GitHub Repository - https://github.com/anushakuppili19/MLOps

❤️ If you found this helpful:

Drop a like on this post

Follow me for more MLOps and Data Engineering tutorials

Subscribe to my YouTube channel Data Enthusiast Era for upcoming videos!
through analogies and real-world examples.
https://youtu.be/lKm_UslTU-A?si=n8lJSgz0MlRd6vfj

Top comments (0)