Mercy Moraa

Posted on Jun 7

Building ML Pipelines with Python: From Data to Insights

#datascience #machinelearning #python

In machine learning, writing a script that trains a model on a clean dataset is only a fraction of the work. The real challenge lies in building a system that can reliably ingest raw data, transform it, train a model, and serve predictions in production.

When code is written as a series of disconnected Jupyter Notebook cells, it inevitably becomes brittle, difficult to test, and prone to data leakage. The solution is to transition from isolated scripts to structured Machine Learning Pipelines. A pipeline automates the workflow, ensures reproducibility, and bridges the gap between data science and software engineering.

Let us build an end-to-end Machine Learning pipeline using pure Python and the industry-standard scikit-learn framework.

What is an ML Pipeline?

An ML pipeline binds data preprocessing steps and model execution into a single, cohesive software element.

Instead of manually applying transformations to your training data and remembering to apply those exact same transformations to your testing data, the pipeline executes the sequence automatically. This design completely eliminates data leakage—a common error where information from outside the training dataset is accidentally used to train the model, leading to overly optimistic but invalid evaluation metrics.

Step 1: Setting Up the Environment

To follow along, initialize a clean workspace and install the core data science libraries:

pip install numpy pandas scikit-learn

Step 2: Designing the Complete Pipeline Code

We will build a pipeline that handles a realistic, messy dataset containing both numerical features (which need scaling) and categorical features (which need encoding), followed by a classification model.

Create a file named pipeline.py and implement the following structure:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# 1. Simulate a realistic raw dataset
def generate_mock_data():
    np.random.seed(42)
    n_samples = 1000

    data = {
        'age': np.random.choice([np.nan, 25, 34, 45, 52, 61], size=n_samples),
        'income': np.random.choice([50000, 80000, 120000, np.nan], size=n_samples),
        'department': np.random.choice(['Sales', 'Engineering', 'Marketing', None], size=n_samples),
        'purchased': np.random.choice([0, 1], size=n_samples, p=[0.7, 0.3])
    }

    df = pd.DataFrame(data)
    # Replace the string "None" with proper NaN values so the imputer can detect them
    df['department'] = df['department'].replace('None', np.nan)
    return df

def main():
    # Load raw data
    df = generate_mock_data()

    # Separate features (X) and target label (y)
    X = df.drop(columns=['purchased'])
    y = df['purchased']

    # Split into train and test sets before any preprocessing occurs
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    # 2. Define feature groups
    numeric_features = ['age', 'income']
    categorical_features = ['department']

    # 3. Create sub-transformers for different data types
    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),   # Fill missing values with median
        ('scaler', StandardScaler())                     # Scale values to standard normal distribution
    ])

    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),  # Fill missing text with mode
        ('encoder', OneHotEncoder(handle_unknown='ignore'))     # Convert text strings to numeric vectors
    ])

    # 4. Combine transformers using ColumnTransformer
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)
        ]
    )

    # 5. Build the master pipeline (Preprocessing + Model Estimator)
    clf_pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
    ])

    # 6. Train the entire pipeline with a single call
    # Transformations are fitted strictly on training data
    clf_pipeline.fit(X_train, y_train)

    # 7. Evaluate performance
    # Test data is passed through the pre-fit transformations automatically
    predictions = clf_pipeline.predict(X_test)

    print("--- Model Performance Metrics ---")
    print(classification_report(y_test, predictions))

if __name__ == '__main__':
    main()

Step 3: Unpacking the Architectural Choices

ColumnTransformer

Real-world data is heterogeneous. Your code needs to treat numerical values differently than strings. The ColumnTransformer lets you isolate specific columns and apply dedicated processing sub-pipelines to them in parallel, before stitching them back together into a unified matrix for the machine learning algorithm.

Streamlined Inference

Notice the execution phase: clf_pipeline.fit(X_train, y_train) handles the entire transformation and training sequence. When it is time to make a prediction on new, raw data, you simply call clf_pipeline.predict(X_new). You do not need to repeat the code for scaling or filling empty data blocks; the pipeline remembers the mathematical rules established during the training step.

Step 4: Production Deployment Strategy

Once your pipeline compiles and performs well, it needs to move out of your local development environment.

Model Serialization

To save the entire trained pipeline—including both the data preprocessing weights and the model parameters—use joblib. It is the recommended serialization tool in the scikit-learn ecosystem as it handles large numpy arrays more efficiently than pickle:

import joblib

# Persist the entire trained pipeline object to disk
joblib.dump(clf_pipeline, 'ml_pipeline.pkl')

To load the pipeline later for inference:

# Load the pipeline in a production script or notebook
clf_pipeline = joblib.load('ml_pipeline.pkl')

# Predict on brand new raw data
predictions = clf_pipeline.predict(new_data)

Serving via API

In a production deployment, an API service layer (such as FastAPI) loads this single serialized file into memory on startup. When a user submits raw data via a JSON endpoint, the raw payload is converted directly into a Pandas DataFrame and passed straight to .predict(). This structural cleanliness guarantees that data transformations in production match your training conditions exactly.

Summary Principles for Clean ML Engineering

Split First, Transform Second: Always execute train_test_split before configuring transforms. If you calculate the mean or median of a column using the entire dataset, your model is subtly cheating by seeing data from the test set.
Handle Missing Values Correctly: Ensure that placeholder strings like "None" or "NaN" are converted to proper np.nan values so that SimpleImputer can detect and fill them appropriately. Failing to do this treats them as valid categories and introduces noise into your model.
Handle Unknown Labels: When configuring categorical encoders, always include parameters like handle_unknown='ignore'. This prevents your API from crashing if a user inputs a completely new category in production that wasn't present during training.
Version Your Artifacts: Treat your pipeline binary file like source code. If the data schemas or hyperparameters change, tag the exported file version cleanly so you can easily rollback if production anomalies occur.

Transitioning from raw scripting blocks to structured object-oriented pipelines makes your code reliable, clean, and immediately ready for modern deployment architectures.

Happy coding!

DEV Community