DEV Community

Cover image for Intermediate Machine Learning
Kaemon Lovendahl
Kaemon Lovendahl

Posted on • Originally published at glitchedgoblet.blog

Intermediate Machine Learning

TL;DR
Clean messy data, encode categorical variables, bundle everything in pipelines, validate with cross‑validation, and unleash XGBoost. All with hands‑on code you can run in a notebook today.

If you finished the Intro to ML post and thought, “Cool, but real data is way uglier,” this follow‑up is for you. We’ll use the same California Housing dataset so you can reuse your environment and immediately see the effects of each technique.

1. Wrestling With Missing Values

When sensors drop packets or humans skip survey fields, you get NaN (Not‑a‑Number) entries. Most scikit‑learn models will straight‑up error if you feed them NaNs, so we MUST decide what to do. The three classic moves are:

Strategy What it does When it’s OK
Drop Columns Delete any column containing any NaNs Column is mostly NaNs or not predictive
Imputation Replace NaNs with a statistic (mean/median) Numeric columns with moderate sparsity
Extended Imputation Impute and add a boolean “was_missing” flag When the fact a value is missing may carry signal
# Detect columns with NaNs
a = [c for c in X_train.columns if X_train[c].isna().any()]

# Drop them
X_drop_train = X_train.drop(a, axis=1)
X_drop_val   = X_val.drop(a, axis=1)
Enter fullscreen mode Exit fullscreen mode

Dropping is safe but often throws out good data. Imputation keeps columns alive:

from sklearn.impute import SimpleImputer

# Impute missing values with the median value
imp = SimpleImputer(strategy='median')

X_imp_train = pd.DataFrame(imp.fit_transform(X_train), columns=X_train.columns)
X_imp_val = pd.DataFrame(imp.transform(X_val), columns=X_val.columns)
Enter fullscreen mode Exit fullscreen mode

Sometimes the mere absence of a value matters. We signal that via extended imputation:

for col in a:
    X_train[col + '_was_missing'] = X_train[col].isna()
    X_val[col + '_was_missing'] = X_val[col].isna()
Enter fullscreen mode Exit fullscreen mode

Why care? Models can only learn from what they see. Telling them “this cell was originally blank” gives another dimension to reason over.

2. Taming Categorical Variables

Algorithms speak numbers, not strings. A column like ocean_proximity (values: '<1H OCEAN', 'INLAND', …) needs translation.

2.a Ordinal Encoding

Assigns an integer to each category. Use only when the categories have an inherent order (e.g., low < medium < high).

from sklearn.preprocessing import OrdinalEncoder

# Identify categorical columns
ord_enc = OrdinalEncoder()

X_ord_train = X_train.copy()
X_ord_val = X_val.copy()

X_ord_train[obj_cols] = ord_enc.fit_transform(X_train[obj_cols])
X_ord_val[obj_cols] = ord_enc.transform(X_val[obj_cols])
Enter fullscreen mode Exit fullscreen mode

2.b One‑Hot Encoding

Makes a binary column per category—perfect when red isn’t “greater than” blue.

from sklearn.preprocessing import OneHotEncoder

# Identify categorical columns
oh = OneHotEncoder(handle_unknown='ignore', sparse=False)

OH_train = pd.DataFrame(oh.fit_transform(X_train[obj_cols]))
OH_val = pd.DataFrame(oh.transform(X_val[obj_cols]))
OH_train.index = X_train.index
OH_val.index = X_val.index

num_train = X_train.drop(obj_cols, axis=1)
num_val = X_val.drop(obj_cols, axis=1)
X_oh_train = pd.concat([num_train, OH_train], axis=1)
X_oh_val = pd.concat([num_val, OH_val], axis=1)
Enter fullscreen mode Exit fullscreen mode

Memory watch: One‑hot explodes column count. For high‑cardinality features (e.g., zip codes) consider hashing tricks or target encoding. One-hot encoding is best for low-cardinality features, about 20 or fewer unique values.

3. Pipelines: Duct‑Tape for Your Workflow

Copy‑pasting preprocessed arrays around eventually ends in tears. Pipelines bundle every step—imputation, encoding, model—into a single object. A pipeline is a sequence of transformations, each feeding into the next.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor

num_pipe = SimpleImputer(strategy='median')

cat_pipe = Pipeline([
    ('imp', SimpleImputer(strategy='most_frequent')),
    ('oh', OneHotEncoder(handle_unknown='ignore'))
])

pre = ColumnTransformer([
    ('num', num_pipe, num_cols),
    ('cat', cat_pipe, cat_cols)
])

model = RandomForestRegressor(n_estimators=300, random_state=0)

tree_pipe = Pipeline([
    ('prep', pre),
    ('model', model)
])
Enter fullscreen mode Exit fullscreen mode

Fit once and you’re done:

tree_pipe.fit(X_train, y_train)
preds = tree_pipe.predict(X_val)
Enter fullscreen mode Exit fullscreen mode

Perks: cleaner code, cross‑validation becomes trivial, and serialization (joblib.dump) ships the exact preprocessing logic to prod.

4. Cross‑Validation: Trust but Verify

A single train/validation split might get lucky. k‑fold cross‑validation (commonly k = 5) rotates the validation slice and averages the score.

from sklearn.model_selection import cross_val_score

neg_mae = cross_val_score(tree_pipe, X, y,
                          cv=5,
                          scoring='neg_mean_absolute_error',
                          n_jobs=-1)

print(f"CV MAE: {(-neg_mae).mean():.3f}")
Enter fullscreen mode Exit fullscreen mode

Yes, it’s slower. Yes, it’s worth it for models that train within a few minutes.

5. XGBoost: Rocket Fuel

When you’ve squeezed the Random Forest, Gradient Boosting often goes further. XGBoost is the industry‑standard implementation. It’s fast, flexible, and has a ton of hyperparameters to tune. It essentially creates a forest of trees, each correcting the errors of the previous one. Then, after enough iterations, it combines them into a single model.

from xgboost import XGBRegressor

booster = XGBRegressor(
    n_estimators=1000,
    learning_rate=0.05,
    n_jobs=4,
    early_stopping_rounds=5
)

booster.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=False
)
Enter fullscreen mode Exit fullscreen mode

Tweak key parameters:

Param Effect
n_estimators number of trees
learning_rate shrinkage per tree
max_depth / min_child_weight controls complexity
subsample / colsample_bytree row/feature subsampling

Early stopping halts training when validation error stops improving—huge time saver.

6. Data Leakage: The Silent Model Killer

Leakage means your model peeked at future info. Two common culprits:

  1. Target leakage: predictors include a feature computed AFTER the target (e.g., actual sale price when predicting listing price).
  2. Train‑test contamination: you fit a preprocessor on the full dataset BEFORE splitting, letting validation rows influence the transformation.

The fix is to fit all transformers inside the pipeline after train_test_split, and audit features to ensure they’d be available at prediction time.

Quick‑Reference Cheatsheet

Problem Go‑to Fix
Missing numeric values SimpleImputer(strategy='median')
Missing categorical values SimpleImputer(strategy='most_frequent')
Nominal categorical OneHotEncoder(handle_unknown='ignore')
Ordinal categorical OrdinalEncoder()
Workflow sprawl Pipeline + ColumnTransformer
Over‑optimistic scores 5‑fold cross_val_score
Last‑mile accuracy XGBRegressor()

Final Thoughts

Intermediate ML is about discipline: rigorous preprocessing, airtight evaluation, and leak‑proof pipelines. Master these, and algorithms like XGBoost become powerful allies instead of mysterious black boxes.

As always, try the notebook, tweak hyper‑parameters, and let me know on BlueSky if your MAE drops. Happy glitching!

Top comments (0)