Part 1: The Core Idea
Machine Learning = Finding patterns in data to make predictions.
# Simple pattern: Height predicts weight
height = 170 # cm
weight = height * 0.5 # Simple rule: weight = height * 0.5
print(f"Predicted weight: {weight} kg")
Intuition: We find mathematical relationships between inputs (height) and outputs (weight).
Part 2: Working with Data
import numpy as np
# Data is just numbers in arrays
heights = np.array([160, 170, 180, 190]) # Input features
weights = np.array([60, 70, 80, 90]) # Target values
print(f"Average height: {heights.mean()}")
What happened: NumPy arrays store our data efficiently and provide math operations.
Part 3: Finding Patterns
# Find the best line through data points
slope = np.corrcoef(heights, weights)[0,1] # Correlation
print(f"Correlation: {slope:.2f}") # Close to 1 = strong pattern
Intuition: Correlation tells us how strongly two variables are related.
Part 4: Making Predictions
# Simple linear prediction
def predict_weight(height):
return height * 0.5 - 15 # Our discovered pattern
new_height = 175
predicted = predict_weight(new_height)
print(f"Height {new_height}cm → Weight {predicted}kg")
Key insight: Once we find the pattern, we can predict new values.
Part 5: Measuring Errors
# How wrong are our predictions?
actual = np.array([65, 75, 85, 95])
predicted = np.array([60, 70, 80, 90])
error = np.mean((actual - predicted) ** 2) # Mean Squared Error
print(f"Average error: {error}")
Why this matters: We need to know how good our predictions are.
Part 6: Learning from Data
from sklearn.linear_model import LinearRegression
# Let the computer find the pattern
model = LinearRegression()
model.fit(heights.reshape(-1, 1), weights) # Learn from data
# Make predictions
prediction = model.predict([[175]])
print(f"Learned prediction: {prediction[0]:.1f}kg")
Magic moment: The computer automatically finds the best line through our data.
Part 7: Train/Test Split
from sklearn.model_selection import train_test_split
# Split data: some for learning, some for testing
X_train, X_test, y_train, y_test = train_test_split(heights.reshape(-1,1), weights, test_size=0.5)
model.fit(X_train, y_train) # Learn from training data
score = model.score(X_test, y_test) # Test on unseen data
print(f"Model accuracy: {score:.2f}")
Why split: We test on data the model hasn't seen to avoid cheating.
Part 8: Multiple Features
# Use multiple inputs for better predictions
data = np.array([[170, 25], # [height, age]
[180, 30],
[160, 20],
[175, 35]])
weights = np.array([70, 80, 60, 75])
model.fit(data, weights) # Learn from height AND age
prediction = model.predict([[172, 28]]) # Predict using both features
Power of ML: Use many features to make better predictions.
Part 9: Classification vs Regression
# Regression: Predict numbers (weight, price, temperature)
regressor = LinearRegression()
# Classification: Predict categories (spam/not spam, cat/dog)
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()
# Same interface, different problems
Two main types: Predicting numbers vs predicting categories.
Part 10: Real Data with Pandas
import pandas as pd
# Load real data
df = pd.DataFrame({
'height': [160, 170, 180, 190, 165],
'weight': [60, 70, 80, 90, 65],
'age': [25, 30, 35, 40, 28]
})
print(df.head()) # See first few rows
print(df.describe()) # Get statistics
Pandas power: Handle real-world messy data with ease.
Part 11: Data Preprocessing
# Clean and prepare data
df['bmi'] = df['weight'] / (df['height']/100)**2 # Create new feature
df = df.dropna() # Remove missing values
# Separate features and target
X = df[['height', 'age']] # Features
y = df['weight'] # Target
Essential step: Clean data before feeding to algorithms.
Part 12: Different Algorithms
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
# Try different algorithms
models = {
'Linear': LinearRegression(),
'Tree': DecisionTreeRegressor(),
'Forest': RandomForestRegressor(),
'SVM': SVR()
}
# Each finds patterns differently
Algorithm zoo: Different algorithms good for different problems.
Part 13: Model Evaluation
from sklearn.metrics import mean_squared_error, r2_score
# Evaluate model performance
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.2f}") # Lower is better
print(f"R²: {r2:.2f}") # Higher is better (max 1.0)
Metrics matter: Different ways to measure how good your model is.
Part 14: Cross-Validation
from sklearn.model_selection import cross_val_score
# Test model on multiple train/test splits
scores = cross_val_score(model, X, y, cv=5) # 5-fold cross-validation
print(f"Average score: {scores.mean():.2f} (+/- {scores.std()*2:.2f})")
Robust testing: Get more reliable estimate of model performance.
Part 15: Feature Engineering
# Create better features
df['height_squared'] = df['height'] ** 2
df['age_height'] = df['age'] * df['height'] # Interaction feature
# Sometimes simple transformations improve predictions
Domain knowledge: Understanding your data helps create better features.
Part 16: Handling Categorical Data
# Text categories need special handling
df['gender'] = ['M', 'F', 'M', 'F', 'M']
# Convert to numbers
df_encoded = pd.get_dummies(df, columns=['gender'])
print(df_encoded.columns) # gender_F, gender_M columns
Encoding: Convert text to numbers for ML algorithms.
Part 17: Scaling Features
from sklearn.preprocessing import StandardScaler
# Scale features to similar ranges
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Mean=0, Std=1
# Some algorithms work better with scaled data
Why scale: Prevents features with large values from dominating.
Part 18: Pipeline
from sklearn.pipeline import Pipeline
# Chain preprocessing and modeling
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', RandomForestRegressor())
])
pipeline.fit(X_train, y_train) # Scaling and training in one step
Clean workflow: Combines preprocessing and modeling automatically.
Part 19: Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV
# Find best model settings
param_grid = {'n_estimators': [50, 100, 200]}
grid_search = GridSearchCV(RandomForestRegressor(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
Optimization: Automatically find best settings for your model.
Part 20: Putting It All Together
# Complete ML workflow
def ml_workflow(data, target_column):
# 1. Split features and target
X = data.drop(target_column, axis=1)
y = data[target_column]
# 2. Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# 3. Create pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', RandomForestRegressor())
])
# 4. Train model
pipeline.fit(X_train, y_train)
# 5. Evaluate
score = pipeline.score(X_test, y_test)
return pipeline, score
# Usage
model, accuracy = ml_workflow(df, 'weight')
print(f"Model accuracy: {accuracy:.2f}")
Complete solution: From raw data to trained model in one function.
Key Takeaways
- Data = Numbers: Everything must be converted to numbers
- Patterns = Models: Algorithms find mathematical relationships
- Train/Test = Validation: Always test on unseen data
- Features = Input: Good features make good predictions
- Metrics = Evaluation: Measure how well your model works
- Pipeline = Workflow: Combine steps for clean, reproducible ML
This foundation gives you the tools to solve real machine learning problems!
Top comments (0)