ML-chronicles: Day 1 Understanding KNN with Iris Dataset

#programming #machinelearning #coding #beginners

Understanding KNN with Iris Dataset – A Beginner's Visual Guide

Introduction
As part of my ml-chronicles series, I decided to start things off with a foundational yet powerful algorithm: K-Nearest Neighbors (KNN).
Before diving in, I skimmed through some great GFG articles to grasp the theory. Later I followed a YouTube tutorial to understand how it’s applied practically. I wanted this project to be more interactive. So, I added several features to visualize the model's behavior.And also explored its evaluation metrics interactively.
This post covers follwing stuff:

Building a KNN model from scratch,
Applying it to the Iris dataset,
Visualizing the results,
Testing with custom inputs, and
Optimizing the best value of k.

Credits and resources:
GFG article: https://www.geeksforgeeks.org/k-nearest-neighbours/
Youtube tutorial: https://youtu.be/mpfU9n4MzBE?feature=shared

What is KNN ?

KNN stands for K-Nearest Neighbors. It is also called as lazy learning algorithm. It doesn’t train a model explicitly. It doesn't learn from the training set immediately but stores the dataset and at the time of classification performs action on the dataset. It looks at the k nearest data points around a query and returns the most common class among them.
If dataset has significant outliers or noise a higher k helps smooth out predictions and noise.. However very high k can often cause underfitting.
There are various statistical methods for selecting best value of k : cross-validation, elbow method, odd values for k etc.
Also the distances can be measured in some of the famous methods: Euclidean distance, Manhattan distance, Minkowski distance etc.
Steps involved:
1.Measure distance (usually Euclidean) to all training points.
2.Pick the k closest ones.
3.Vote on the class label.
KNN works best with normalized data and small to medium datasets.

The Dataset: IRIS
The classic Iris dataset has:

150 samples across 3 classes: Setosa, Versicolor, and Virginica.
4 numerical features: Sepal Length, Sepal Width, Petal Length, and Petal Width.
For my project: I removed the Id column and normalized all features using MinMaxScaler. Further I Mapped class labels to numeric codes (1, 2, 3) for internal processing.

My Model Architecture
Preprocessing

Normalized all feature columns to [0, 1].
Mapped species to numeric values for classification.
Used 3D scatter plots to visualize separability among classes.

Custom KNN from Scratch
I implemented KNN in a separate module (kNN_modules.py).
It includes:

A distance function to measure closeness: formula for euclidean distance is used.
A KNN() function that predicts the class using statistics.mode() from nearest neighbors.
Optional print to view selected neighbors. This helped me build intuition about how predictions are formed.

Visualizations: Following visuals were implemented .. I am attaching the results

Confusion Matrix
Used sklearn’s ConfusionMatrixDisplay to evaluate predictions visually for a fixed k=5.

Accuracy Heatmap
I ran 10 random train-test splits (30% test) and varied k from 1 to 15.
Then visualized accuracy trends across combinations using a heatmap.

Best k – Average Accuracy Plot
I Computed average accuracy for each k value across runs.
Thus found the best_k using .idxmax().

Export
Saved all accuracy results to CSV for later inspection.

Interactive Query Input
One of my favorite parts was enabling this command-line prediction:
User can input 4 features like: SepalL SepalW PetalL PetalW
Input is scaled and classified using the custom KNN function
It prints the predicted species!

(It might skip in non-interactive environments — I handled that too.)

Results & Observations
The model performed best at k = 9, with average accuracy peaking there.
Visualization helped reveal class boundaries and overlaps

Writing the KNN logic from scratch gave me a hands-on view of its behavior.

Features I Added