Anusha Kuppili

Posted on Nov 9 • Edited on Nov 17

Data Collection and Preparation for Machine Learning

#mlops #datascience #machinelearning #dataengineering

🧠 Data Collection and Preparation for Machine Learning | Complete Guide with ETL, Data Lakes & Pandas Demo

Welcome to this comprehensive guide on Data Collection and Preparation — the foundation of every successful Machine Learning (ML) project.

If you've ever wondered how apps like Amazon or Swiggy predict what you’ll want before you even click — this is where it all begins.

🚀 Why Data Collection and Preparation Matter

Imagine building a grocery app that delivers a unique experience to every user.

By analyzing purchase history, browsing patterns, and favorite items, the app can predict future needs, boosting engagement and customer satisfaction. These insights power:

Personalized homepages
Value-based offers and discounts
Higher retention and repeat purchases

At the heart of it all? Data.

Clean, accurate, and well-structured data is what drives personalization, predictions, and performance.

📊 Step 1: Understanding Data Collection

Before we start building models, we need to centralize and clean data scattered across multiple sources.

Typical Data Sources:

SQL / NoSQL databases
APIs and real-time feeds
Google Sheets and CSVs
Spreadsheets and cloud storage

Integrating these sources into a single pipeline lays the foundation for robust ML training.

Note: Centralizing and cleaning your data is non-negotiable. Garbage in = garbage out.

⚙️ Step 2: ETL (Extract, Transform, Load) — The Backbone of Data Ingestion

ETL is the process of pulling data from multiple systems, cleaning it up, and loading it into one unified destination for analysis.

💡 The ETL Process

Extract

Pull data from databases, APIs, CSVs, or logs.

Tools: Talend, DBT, AWS Glue
Transform

Standardize, clean, and enrich data for consistency.
- Remove duplicates
- Normalize formats
- Handle missing values
Load

Move the transformed data into a data warehouse or data lake for easy access and analytics.

⚠️ Warning: Poor transformation leads to broken models. Validate your transformations carefully.

🏗️ Step 3: Data Lakes — The Modern Data Storage Architecture

A data lake isn’t a tool — it’s an architecture that stores all types of data, raw or processed, in one place.

Key Benefits:

Centralized Repository: Store both structured and unstructured data together.
Scalability: Handle terabytes to petabytes of growing data effortlessly.
Diversity: Store CSVs, JSONs, videos, IoT streams — all in one ecosystem.
Cost-Effectiveness: Uses blob storage like AWS S3, Azure Data Lake, or Google Cloud Storage.

Netflix uses data lakes to analyze viewing patterns.

Tesla uses them to process sensor data from autonomous cars.

🔐 Advanced Data Lake Capabilities

Data Security & Governance: Protect sensitive data while maintaining compliance.
Integration with BI Tools: Seamless use with Athena, Tableau, or Power BI.
MLOps Friendly: Centralized data simplifies feature extraction and model training.

⚠️ Tip: Always apply strict security and governance when deploying data lakes — they often hold sensitive business data.

🧹 Step 4: Data Cleaning and Transformation

Once data is collected, cleaning it is crucial before using it for ML.

Raw data can come in many forms:

CSV files
JSON logs
Images
Documents
Parquet files

Your goal is to standardize them into a consistent format (often Parquet) and ensure the data is complete and accurate.

✅ Key Strategies for Data Quality

Ensure Data Accuracy: Fix incorrect entries or typos.
Remove Redundancies: Eliminate duplicate records.
Standardize Formats: Align date, currency, and categorical formats.
Handle Missing Values: Fill or drop nulls strategically.
Optimize for Analysis: Transform unstructured data into analyzable forms.

Example: A retail dataset missing “customer age” will fail at personalization models — fill such gaps using median or mean values.

🧰 Step 5: Data Processing Tools & Technologies

Tool	Use Case	Example
Pandas	Small to medium data processing	Cleaning sales data for dashboards
Apache Spark	Large-scale real-time analytics	Processing global transport data (e.g., Uber)
AWS Glue	Serverless ETL	Transforming gaming event logs into structured tables
Talend	Visual ETL workflows	Integrating healthcare records
DBT	Transformations inside data warehouses	Converting raw clickstream data into insights

⚠️ Choose your tools based on data scale and complexity. Pandas can’t handle terabytes, but Spark can.

💻 Step 6: Hands-On Demo — Data Transformation with Pandas

Let’s get practical!

Here’s a small-to-medium dataset example using Pandas to explore, clean, and transform data.

1️⃣ Load and Inspect Data

import pandas as pd

# Load the CSV
df = pd.read_csv("mock_data.csv")

# Explore the dataset
df.head()
df.info()
df.isnull().sum()
df.describe(include='all')

2️⃣ Handle Missing Values

# Fill numeric columns with median
df['age'] = df['age'].fillna(df['age'].median())
df['salary'] = df['salary'].fillna(df['salary'].median())

# Fill categorical with 'Unknown'
df['department'] = df['department'].fillna('Unknown')

3️⃣ Parse and Transform JSON Data

import json

# Convert JSON strings into dictionaries
df['profile'] = df['profile'].apply(lambda x: json.loads(x) if pd.notnull(x) else {})

# Extract fields
df['address'] = df['profile'].apply(lambda x: x.get('address'))
df['phone'] = df['profile'].apply(lambda x: x.get('phone'))
df['email'] = df['profile'].apply(lambda x: x.get('email'))

# Drop original column
df.drop(columns=['profile'], inplace=True)

# Save cleaned data
df.to_csv("cleaned_data.csv", index=False)

4️⃣ Derive New Columns

# Address length check
df['address_length'] = df['address'].apply(lambda x: len(str(x)))

# Salary categories
bins = [0, 50000, 70000, 100000]
labels = ['low', 'medium', 'high']
df['salary_category'] = pd.cut(df['salary'], bins=bins, labels=labels)

5️⃣ Group and Aggregate Data

summary = df.groupby('department').agg({
    'salary': 'mean',
    'age': 'mean'
}).reset_index()

summary.rename(columns={'salary': 'average_salary', 'age': 'average_age'}, inplace=True)
print(summary)

🎯 Data Insight

Grouping and aggregation reveal department-level trends — essential for optimizing performance and guiding data-driven ML feature engineering.

🧩 Final Thoughts

Here’s what we covered:

How to collect and consolidate data from diverse sources

The ETL process and why transformation matters

The role of data lakes in modern architectures

Data cleaning strategies to ensure accuracy

Practical data transformation using Pandas

Clean, consistent data is the secret ingredient behind every intelligent system — from recommendation engines to fraud detection models.

If you want to explore the hands-on code, check it out here:
👉 GitHub Repository - https://github.com/anushakuppili19/MLOps

❤️ If you found this helpful:

Drop a like on this post

Follow me for more MLOps and Data Engineering tutorials

Subscribe to my YouTube channel Data Enthusiast Era for upcoming videos!
through analogies and real-world examples.
https://youtu.be/lKm_UslTU-A?si=n8lJSgz0MlRd6vfj

Top comments (1)

Sai Naga Sumalya Patnala • Nov 12

Very insightful, keep making more.. especially on end to end machine learning systems.