Vuk Rosić

Posted on Jul 6

Python for Machine Learning: From Simple to Advanced

Part 1: The Core Idea

Machine Learning = Finding patterns in data to make predictions.

# Simple pattern: Height predicts weight
height = 170  # cm
weight = height * 0.5  # Simple rule: weight = height * 0.5
print(f"Predicted weight: {weight} kg")

Intuition: We find mathematical relationships between inputs (height) and outputs (weight).

Part 2: Working with Data

import numpy as np

# Data is just numbers in arrays
heights = np.array([160, 170, 180, 190])  # Input features
weights = np.array([60, 70, 80, 90])      # Target values

print(f"Average height: {heights.mean()}")

What happened: NumPy arrays store our data efficiently and provide math operations.

Part 3: Finding Patterns

# Find the best line through data points
slope = np.corrcoef(heights, weights)[0,1]  # Correlation
print(f"Correlation: {slope:.2f}")  # Close to 1 = strong pattern

Intuition: Correlation tells us how strongly two variables are related.

Part 4: Making Predictions

# Simple linear prediction
def predict_weight(height):
    return height * 0.5 - 15  # Our discovered pattern

new_height = 175
predicted = predict_weight(new_height)
print(f"Height {new_height}cm → Weight {predicted}kg")

Key insight: Once we find the pattern, we can predict new values.

Part 5: Measuring Errors

# How wrong are our predictions?
actual = np.array([65, 75, 85, 95])
predicted = np.array([60, 70, 80, 90])

error = np.mean((actual - predicted) ** 2)  # Mean Squared Error
print(f"Average error: {error}")

Why this matters: We need to know how good our predictions are.

Part 6: Learning from Data

from sklearn.linear_model import LinearRegression

# Let the computer find the pattern
model = LinearRegression()
model.fit(heights.reshape(-1, 1), weights)  # Learn from data

# Make predictions
prediction = model.predict([[175]])
print(f"Learned prediction: {prediction[0]:.1f}kg")

Magic moment: The computer automatically finds the best line through our data.

Part 7: Train/Test Split

from sklearn.model_selection import train_test_split

# Split data: some for learning, some for testing
X_train, X_test, y_train, y_test = train_test_split(heights.reshape(-1,1), weights, test_size=0.5)

model.fit(X_train, y_train)  # Learn from training data
score = model.score(X_test, y_test)  # Test on unseen data
print(f"Model accuracy: {score:.2f}")

Why split: We test on data the model hasn't seen to avoid cheating.

Part 8: Multiple Features

# Use multiple inputs for better predictions
data = np.array([[170, 25],  # [height, age]
                 [180, 30],
                 [160, 20],
                 [175, 35]])
weights = np.array([70, 80, 60, 75])

model.fit(data, weights)  # Learn from height AND age
prediction = model.predict([[172, 28]])  # Predict using both features

Power of ML: Use many features to make better predictions.

Part 9: Classification vs Regression

# Regression: Predict numbers (weight, price, temperature)
regressor = LinearRegression()

# Classification: Predict categories (spam/not spam, cat/dog)
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()

# Same interface, different problems

Two main types: Predicting numbers vs predicting categories.

Part 10: Real Data with Pandas

import pandas as pd

# Load real data
df = pd.DataFrame({
    'height': [160, 170, 180, 190, 165],
    'weight': [60, 70, 80, 90, 65],
    'age': [25, 30, 35, 40, 28]
})

print(df.head())  # See first few rows
print(df.describe())  # Get statistics

Pandas power: Handle real-world messy data with ease.

Part 11: Data Preprocessing

# Clean and prepare data
df['bmi'] = df['weight'] / (df['height']/100)**2  # Create new feature
df = df.dropna()  # Remove missing values

# Separate features and target
X = df[['height', 'age']]  # Features
y = df['weight']           # Target

Essential step: Clean data before feeding to algorithms.

Part 12: Different Algorithms

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR

# Try different algorithms
models = {
    'Linear': LinearRegression(),
    'Tree': DecisionTreeRegressor(),
    'Forest': RandomForestRegressor(),
    'SVM': SVR()
}

# Each finds patterns differently

Algorithm zoo: Different algorithms good for different problems.

Part 13: Model Evaluation

from sklearn.metrics import mean_squared_error, r2_score

# Evaluate model performance
y_pred = model.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.2f}")  # Lower is better
print(f"R²: {r2:.2f}")     # Higher is better (max 1.0)

Metrics matter: Different ways to measure how good your model is.

Part 14: Cross-Validation

from sklearn.model_selection import cross_val_score

# Test model on multiple train/test splits
scores = cross_val_score(model, X, y, cv=5)  # 5-fold cross-validation
print(f"Average score: {scores.mean():.2f} (+/- {scores.std()*2:.2f})")

Robust testing: Get more reliable estimate of model performance.

Part 15: Feature Engineering

# Create better features
df['height_squared'] = df['height'] ** 2
df['age_height'] = df['age'] * df['height']  # Interaction feature

# Sometimes simple transformations improve predictions

Domain knowledge: Understanding your data helps create better features.

Part 16: Handling Categorical Data

# Text categories need special handling
df['gender'] = ['M', 'F', 'M', 'F', 'M']

# Convert to numbers
df_encoded = pd.get_dummies(df, columns=['gender'])
print(df_encoded.columns)  # gender_F, gender_M columns

Encoding: Convert text to numbers for ML algorithms.

Part 17: Scaling Features

from sklearn.preprocessing import StandardScaler

# Scale features to similar ranges
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Mean=0, Std=1

# Some algorithms work better with scaled data

Why scale: Prevents features with large values from dominating.

Part 18: Pipeline

from sklearn.pipeline import Pipeline

# Chain preprocessing and modeling
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestRegressor())
])

pipeline.fit(X_train, y_train)  # Scaling and training in one step

Clean workflow: Combines preprocessing and modeling automatically.

Part 19: Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV

# Find best model settings
param_grid = {'n_estimators': [50, 100, 200]}
grid_search = GridSearchCV(RandomForestRegressor(), param_grid, cv=5)

grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")

Optimization: Automatically find best settings for your model.

Part 20: Putting It All Together

# Complete ML workflow
def ml_workflow(data, target_column):
    # 1. Split features and target
    X = data.drop(target_column, axis=1)
    y = data[target_column]

    # 2. Train/test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

    # 3. Create pipeline
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('model', RandomForestRegressor())
    ])

    # 4. Train model
    pipeline.fit(X_train, y_train)

    # 5. Evaluate
    score = pipeline.score(X_test, y_test)
    return pipeline, score

# Usage
model, accuracy = ml_workflow(df, 'weight')
print(f"Model accuracy: {accuracy:.2f}")

Complete solution: From raw data to trained model in one function.

Key Takeaways

Data = Numbers: Everything must be converted to numbers
Patterns = Models: Algorithms find mathematical relationships
Train/Test = Validation: Always test on unseen data
Features = Input: Good features make good predictions
Metrics = Evaluation: Measure how well your model works
Pipeline = Workflow: Combine steps for clean, reproducible ML

This foundation gives you the tools to solve real machine learning problems!

DEV Community