What is k-NN?
In statistics, the k-nearest neighbors algorithm (k-NN)
is a non-parametric classification method developed by Evelyn Fix
and Joseph Hodges
in 1951
and later expanded by Thomas Cover
. Used for classification and regression. In both cases, the input includes training models close to the data set. Output depends on whether k-NN is used for classification or regression:
- In the k-NN classification, the output of the class membership. An item is divided by a majority vote of its neighbors, the item is assigned the most common category among its closest neighbors and k. If
k = 1
, then the object is given to the class of that one nearest neighbor. - In k-NN regression, the output is the property value for the object. This value is the average value of
k
for nearby neighbors.
Dataset
Dataset used for the model is βFish.csvβ
. Dataset consists of 159 rows
and 7 columns
.
Dataset Description
The attributes used in dataset are given bellow:
- Species
- Weight
- Length1
- Length2
- Length3
- Height
- Width
Independent Attributes
Independent attributes in dataset are:
- Length1
- Length2
- Length3
- Height
- Width
Dependent Attribute
Dependent attribute in dataset is:
- Weight
Target Attribute
Target attribute in dataset is:
- Weight
We will predict the weight of fish by using other attributes to train the model.
Dataset Head
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from math import sqrt
from sklearn import neighbors
from sklearn.model_selection import train_test_split
import pandas as pd
df = pd.read_csv('/content/Fish.csv')
df.head()
Here is the head of dataset used in the model.
Dataset Preprocessing
As we don't need Species
attribute to predict the weight of fish. So, we will drop Species
attribute.
df.drop(['Species'], axis=1, inplace=True)
df = pd.get_dummies(df)
df
The final dataset is given below:
Model Training and Testing
After preprocessing the dataset, we used preprocessed data for model training. For this purpose, we split up the data and select 30%
of data for test purposes and 70%
of data for model training. We test our model for k up to 20
to get minimum Root Mean Square Error (RMSE)
.
# Model Training
train, test = train_test_split(df, test_size=0.3)
x_train = train.drop('Weight', axis=1)
y_train = train['Weight']
x_test = test.drop('Weight', axis=1)
y_test = test['Weight']
%matplotlib inline
# Model Testing
rmse_val = []
for K in range(20):
K = K+1
model = neighbors.KNeighborsRegressor(n_neighbors=K)
# Model Fitting
model.fit(x_train, y_train)
pred = model.predict(x_test)
error = sqrt(mean_squared_error(y_test, pred))
rmse_val.append(error)
print('RMSE value for K = ', K, 'is:', error)
# Plotting RMSE values against value of K
curve = pd.DataFrame(rmse_val)
curve.plot()
The RMSE
for different values of k are given below:
As it is clear from the above figure that RMSE is minimum for k = 3
.
RMSE value for k = 3 is: 47.21824893415052
Prediction Results
So, we used k = 3
for the prediction of the weight of fishes. As we get minimum RMSE on 3 which is approximately 47
.
# Model Fitting Minimum RMSE
model = neighbors.KNeighborsRegressor(n_neighbors=3)
model.fit(x_train, y_train)
pred = model.predict(x_test)
error = sqrt(mean_squared_error(y_test, pred))
rmse_val.append(error) # store rmse values
print('RMSE value for K = ', 3, 'is:', error)
# Prediction Results
test['predicted weights'] = pred
test
The prediction results are given below:
References
- Dataset Link: View Fish Dataset
- Download Link: Download Fish Dataset
- GitHub Repository: k-nearest neighbors algorithm (k-NN)
Top comments (0)