DEV Community: mungaime-25

UNSUPERVISED LEARNING

mungaime-25 — Sun, 27 Jul 2025 17:55:55 +0000

INTRODUCTION TO UNSUPERVISED LEARNING

It is a type of machine learning where the model is not given any labels. Instead, it tries to find patterns, structures, or relationships in the input data without any human supervision.

main characteristics of unsupervised learning

No labeled outputs.
The system learns patterns from raw data.
Focuses on data exploration and dimensionality reduction.

TYPES OF UNSUPERVISED LEARNING
There are two main types of unsupervised learning

CLUSTERING Clustering is the process of grouping similar data points together such that:
Points in the same cluster are very similar.
Points in different clusters are very different.
DIMENSIONALITY REDUCTION.
Reducing the number of input variables while preserving key information (e.g., PCA, t-SNE).

Common Clustering Algorithms

_1. K-Means Clustering
_
K: number of clusters to form

Algorithm tries to find K centroids (central points)
Assigns each data point to the nearest centroid

from sklearn.cluster import KMeans
import pandas as pd

# Sample data
data = pd.DataFrame({
    'Income': [45, 54, 67, 120, 130, 150],
    'Spending': [50, 60, 65, 90, 85, 95]
})

kmeans = KMeans(n_clusters=2)
kmeans.fit(data)

print("Cluster centers:\n", kmeans.cluster_centers_)
print("Labels:", kmeans.labels_)

Hierarchical Clustering

Doesn’t require you to specify the number of clusters
Creates a tree of clusters (dendrogram)
You can "cut" the tree at any level to decide how many clusters you want

Types:
Agglomerative (Bottom-Up): Start with individual points and merge them
Divisive (Top-Down): Start with one cluster and split

import matplotlib.pyplot as plt
import pandas as pd
from scipy.cluster.hierarchy import dendrogram,linkage
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler

# Sample data
data = pd.DataFrame({
    'Age': [25, 30, 45, 35, 50, 23, 40, 60],
    'Income': [30000, 40000, 50000, 45000, 80000, 32000, 60000, 90000]
})

link = linkage(data, method= 'ward')

#plotting
plt.figure(figsize=(10,6))
dendrogram(link,labels=range(1,len(data)+1),orientation='top', distance_sort= 'ascending',show_leaf_counts= True)
plt.title('hierarchical dendrogram')
plt.xlabel('datapoint')
plt.show()

# Apply Agglomerative Clustering with 3 clusters
model = AgglomerativeClustering(n_clusters=3, linkage='ward')
data['Cluster'] = model.fit_predict(data)

# Visualize
plt.scatter(data['Age'], data['Income'], c=data['Cluster'], cmap='Accent')
plt.xlabel('Age')
plt.ylabel('Income')
plt.title('Agglomerative Clustering')
plt.show()

# Standardize
sl = StandardScaler()
scaled_data = sl.fit_transform(data[['Age', 'Income']])

# Cluster
model = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels = model.fit_predict(scaled_data)

# evaluating
score = silhouette_score(scaled_data, labels)
print(f'Silhouette Score: {score:.4f}')

Dimensionality Reduction – Finding Simplicity in Complexity

Principal Component Analysis (PCA)

Reduces many variables into fewer that still capture most of the information.
Helps visualize high-dimensional data in 2D or 3D.

from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data

pca = PCA(n_components=2)
reduced = pca.fit_transform(X)

print("Reduced shape:", reduced.shape)

🔍 Understanding Supervised Learning: A Guide for Beginners

mungaime-25 — Mon, 21 Jul 2025 15:15:17 +0000

In today’s world of data and artificial intelligence, Supervised Learning is one of the most commonly used techniques in machine learning. It powers everything from predicting house prices to detecting spam emails. But what exactly is supervised learning, and how does it work?

Let’s break it down in simple terms.
📘 What is Supervised Learning?

Supervised Learning is a type of machine learning where the model learns from labeled data. That means, for every input in the dataset, we already know the correct output.

Think of it like teaching a child using flashcards:

You show them a picture of a cat and tell them, “This is a cat.”

You show a picture of a dog and say, “This is a dog.”
After seeing many such examples, the child begins to recognize the difference and can identify new animals on their own.

Similarly, in supervised learning, the algorithm is trained on data with known answers (labels), so it can later predict outcomes for new, unseen data.

⚙️ How Does It Work?

Supervised learning works in two main stages:

1. Training Phase:
The model is fed a dataset containing inputs (also called features) and their correct outputs (labels). It tries to find a pattern or relationship between them.

2. Testing or Prediction Phase:
Once trained, the model is given new inputs it hasn’t seen before, and it uses what it has learned to predict the outputs.
🧠 Types of Supervised Learning Problems

There are two main types of problems in supervised learning:

Regression: Predicts a continuous value
Example: Predicting the price of a car based on mileage, brand, and model year.

Classification: Predicts a category or label
Example: Classifying emails as “Spam” or “Not Spam”.
🔍 Popular Supervised Learning Models

Let’s explore a few common models used in supervised learning:
1. 📈 Linear Regression

Use: For predicting numeric values.

How it works: It draws a straight line through the data points that best represents the relationship between the input and the output.

Example: Predicting house prices based on the size of the house.
2. 🌳 Decision Trees
Use: Can be used for both classification and regression.

How it works: Think of it like a flowchart. It splits the data based on decision rules (e.g., “Is the age > 30?”), forming a tree-like structure.

Example: Classifying whether a customer will buy a product based on age, income, and past behavior.
3. 🚀 Gradient Boosting Machines (GBM)

Use: For complex regression and classification tasks.

How it works: GBM builds models in a sequence. Each new model tries to correct the errors of the previous one, gradually improving the performance.

Example: Predicting loan default risk in financial applications.

4. 🧮 K-Nearest Neighbors (KNN)

Use: Simple and effective for small datasets.

How it works: It looks at the ‘k’ closest points (neighbors) to a new input and assigns the most common label (for classification) or average value (for regression).

Example: Classifying a flower species based on petal length and width.

5. 🧠 Support Vector Machines (SVM)

Use: Mainly for classification tasks.

How it works: SVM finds the best boundary (or hyperplane) that separates different classes in the data.

Example: Detecting whether an email is spam or not.

🧪 Real-Life Example: Predicting Student Grades

Let’s say we want to predict a student’s final grade based on:

Hours studied

Attendance rate

Participation in class
We would:

Collect data from past students with their actual grades.
Train a regression model using this data.
Use the model to predict the grade of a current student.

🎯 Conclusion

Supervised learning is like learning with a teacher — the answers are given, and the model learns by example. It’s powerful, widely used, and forms the basis of many AI systems today.

From predicting prices to classifying emails and diagnosing diseases, supervised learning is everywhere. By understanding its models — like Linear Regression, Decision Trees, and Gradient Boost — we unlock the potential to turn data into valuable predictions.

Introduction to SQL for data science

mungaime-25 — Wed, 16 Apr 2025 09:48:42 +0000

Structured Query Language is a fundamental tool for any data scientist. It allows you to efficiently retrieve, manipulate and analyze structured data stored in relational databases. SQL provides capabilities to extract insights.
In this article we will cover the basics of SQL and essential queries for data science.

Why SQL for data Science

Data Retrieval - SQL enables efficient extraction of data from databases.
Data manipulation - SQL enables us to filter, aggregate and transform data before analysis.
Performance - SQL is optimized for handling large datasets.
Integration - SQL works seamlessly with python, R and BI tools.

Basic SQL Queries

SELECT statement
The select statement is used to retrieve data from a database.
Example

returns for us the first and the second name from the customer table.

WHERE clause
used to filter data from a table.
EXAMPLE

only counts for us the customers who are from Kisumu

HAVING clause
used to filter aggregated data.
EXAMPLE

counts the total orders for only the customers who had more than one order

ORDER BY
Used to sort data in a specified order.
that is ASCENDING or DESCENDING
N/B - The default SQL order is Ascending
Example

lists for us the price from the lowest to the highest.

SUMMARY
in summary SQL is a crucial tool for data scientists. enabling efficient data retrieval, manipulation and analysis from a relational database.
in this article we have covered key SQL concepts including basic queries as SELECT, WHERE, HAVING, ORDERBY for retrieving, filtering and sorting data.