Trix Cyrus

Posted on Dec 8, 2024

Part 4: Building Your Own AI - Diving Deeper into Supervised Learning

#programming #ai #machinelearning #learning

Author: Trix Cyrus

Try My, Waymap Pentesting tool: Click Here
TrixSec Github: Click Here
TrixSec Telegram: Click Here

Supervised learning is the cornerstone of many AI and ML applications, where models are trained on labeled datasets to make predictions. In this article, we’ll explore the two main types of supervised learning tasks—classification and regression—delve into popular algorithms like Logistic Regression, Decision Trees, and Support Vector Machines (SVMs), and demonstrate real-world applications through a hands-on example: spam email classification.

1. Understanding Supervised Learning Tasks

a. Classification Tasks

Goal: Categorize input data into predefined classes or labels.
Examples:
- Spam vs. non-spam emails.
- Predicting whether a patient has a disease (yes/no).
Common Metrics:
- Accuracy: Percentage of correctly classified instances.
- Precision & Recall: Useful for imbalanced datasets.
- F1-Score: Harmonic mean of precision and recall.

b. Regression Tasks

Goal: Predict continuous numeric values based on input features.
Examples:
- Predicting house prices based on features like size and location.
- Estimating stock prices.
Common Metrics:
- Mean Absolute Error (MAE): Average absolute difference between predicted and actual values.
- Mean Squared Error (MSE): Average squared difference (penalizes larger errors more).

2. Popular Supervised Learning Algorithms

a. Logistic Regression

Type: Classification.
How It Works: Estimates the probability of a binary outcome (e.g., spam or not) using the logistic (sigmoid) function.
Equation: [ P(y=1|x) = \frac{1}{1 + e^{-(b_0 + b_1x_1 + b_2x_2 + ... + b_nx_n)}} ]
Advantages: Simple, fast, interpretable.
Limitations: Struggles with non-linear relationships.

b. Decision Trees

Type: Classification and regression.
How It Works: Splits data into subsets based on feature values, creating a tree-like structure.
Example Split:
- Feature: Email contains "FREE."
- If yes → Likely spam.
- If no → Likely not spam.
Advantages: Easy to interpret, handles non-linear relationships.
Limitations: Prone to overfitting (solved by pruning or ensemble methods like Random Forests).

c. Support Vector Machines (SVMs)

Type: Classification and regression.
How It Works: Finds the hyperplane that best separates classes in a feature space.
Key Concepts:
- Margin: Distance between the hyperplane and nearest data points (support vectors).
- Kernel Trick: Maps data to higher dimensions for complex relationships.
Advantages: Effective for high-dimensional data.
Limitations: Computationally expensive for large datasets.

3. Evaluating Model Performance

a. Cross-Validation

Splits the dataset into multiple subsets (folds) to validate performance across all data.
Example: 5-Fold Cross-Validation.

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
scores = cross_val_score(model, X, y, cv=5)
print("Accuracy:", scores.mean())

b. Confusion Matrix

A table showing correct and incorrect predictions for classification models.
Example: Spam Classification.

Predicted/Actual	Spam	Not Spam
Spam	90	10
Not Spam	5	95

4. Real-World Example: Spam Email Classification

Dataset

Use the "SMS Spam Collection" dataset, which contains labeled text messages (spam or not spam).

Steps:

Load Dataset: Load the data into a Pandas DataFrame.
Preprocess Text: Remove stopwords, convert to lowercase, and tokenize.
Convert Text to Features: Use Term Frequency-Inverse Document Frequency (TF-IDF) vectorization.
Train Model: Use a Logistic Regression model to classify emails.
Evaluate Performance: Use accuracy and F1-score metrics.

Code Example:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Load dataset
data = pd.read_csv('spam.csv', encoding='latin-1')
data = data[['v1', 'v2']].rename(columns={'v1': 'label', 'v2': 'text'})
data['label'] = data['label'].map({'ham': 0, 'spam': 1})

# Split data
X_train, X_test, y_train, y_test = train_test_split(data['text'], data['label'], test_size=0.2, random_state=42)

# Text vectorization
vectorizer = TfidfVectorizer(stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train Logistic Regression model
model = LogisticRegression()
model.fit(X_train_tfidf, y_train)

# Predictions and evaluation
y_pred = model.predict(X_test_tfidf)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

~Trixsec

DEV Community