DEV Community

Amruta
Amruta

Posted on

ML-chronicles: Day 1 Understanding KNN with Iris Dataset

Understanding KNN with Iris Dataset – A Beginner's Visual Guide

Introduction
As part of my ml-chronicles series, I decided to start things off with a foundational yet powerful algorithm: K-Nearest Neighbors (KNN).
Before diving in, I skimmed through some great GFG articles to grasp the theory. Later I followed a YouTube tutorial to understand how it’s applied practically. I wanted this project to be more interactive. So, I added several features to visualize the model's behavior.And also explored its evaluation metrics interactively.
This post covers follwing stuff:

  1. Building a KNN model from scratch,
  2. Applying it to the Iris dataset,
  3. Visualizing the results,
  4. Testing with custom inputs, and
  5. Optimizing the best value of k.

Credits and resources:
GFG article: https://www.geeksforgeeks.org/k-nearest-neighbours/
Youtube tutorial: https://youtu.be/mpfU9n4MzBE?feature=shared

What is KNN ?

KNN stands for K-Nearest Neighbors. It is also called as lazy learning algorithm. It doesn’t train a model explicitly. It doesn't learn from the training set immediately but stores the dataset and at the time of classification performs action on the dataset. It looks at the k nearest data points around a query and returns the most common class among them.
If dataset has significant outliers or noise a higher k helps smooth out predictions and noise.. However very high k can often cause underfitting.
There are various statistical methods for selecting best value of k : cross-validation, elbow method, odd values for k etc.
Also the distances can be measured in some of the famous methods: Euclidean distance, Manhattan distance, Minkowski distance etc.
Steps involved:
1.Measure distance (usually Euclidean) to all training points.
2.Pick the k closest ones.
3.Vote on the class label.
KNN works best with normalized data and small to medium datasets.

The Dataset: IRIS
The classic Iris dataset has:

  • 150 samples across 3 classes: Setosa, Versicolor, and Virginica.
  • 4 numerical features: Sepal Length, Sepal Width, Petal Length, and Petal Width.
  • For my project: I removed the Id column and normalized all features using MinMaxScaler. Further I Mapped class labels to numeric codes (1, 2, 3) for internal processing.

My Model Architecture
Preprocessing

  1. Normalized all feature columns to [0, 1].
  2. Mapped species to numeric values for classification.
  3. Used 3D scatter plots to visualize separability among classes.

Custom KNN from Scratch
I implemented KNN in a separate module (kNN_modules.py).
It includes:

  1. A distance function to measure closeness: formula for euclidean distance is used.
  2. A KNN() function that predicts the class using statistics.mode() from nearest neighbors.
  3. Optional print to view selected neighbors. This helped me build intuition about how predictions are formed.

Visualizations: Following visuals were implemented .. I am attaching the results

Confusion Matrix
Used sklearn’s ConfusionMatrixDisplay to evaluate predictions visually for a fixed k=5.

Image description

Accuracy Heatmap
I ran 10 random train-test splits (30% test) and varied k from 1 to 15.
Then visualized accuracy trends across combinations using a heatmap.

Image description

Best k – Average Accuracy Plot
I Computed average accuracy for each k value across runs.
Thus found the best_k using .idxmax().

Image description

Export
Saved all accuracy results to CSV for later inspection.

Interactive Query Input
One of my favorite parts was enabling this command-line prediction:
User can input 4 features like: SepalL SepalW PetalL PetalW
Input is scaled and classified using the custom KNN function
It prints the predicted species!

(It might skip in non-interactive environments — I handled that too.)

Results & Observations
The model performed best at k = 9, with average accuracy peaking there.
Visualization helped reveal class boundaries and overlaps

Writing the KNN logic from scratch gave me a hands-on view of its behavior.

Features I Added

  1. Normalization using MinMaxScaler
  2. Confusion Matrix using sklearn
  3. Heatmap of accuracy vs. k
  4. Interactive command-line query support
  5. CSV export for tracking experimental results

My code link: https://github.com/Amruta-25/ML_chronicles_KNN.git

This was Day 1 of my ml-chronicles.
If you're just starting out in ML — try building this from scratch.
What would you add or change in this project?

Top comments (1)

Collapse
 
mauroquilbit profile image
Mauricio Mercado

Me encantó el tutorial, muy claro, yo recien empiezo a investigar, en un mes tengo que usar técnicas de ML