DEV Community: Rupak Biswas

Demonstrating K means Clustering on Iris Dataset

Rupak Biswas — Wed, 11 Oct 2023 18:14:09 +0000

K-means clustering was performed to evaluate the possible clusters can be derived from the features of the given dataset hence giving the unsupervised model. The following explanatory variables were included as possible contributors to a K-means Clustering model (output) includes the petal length & petal width.

Python Code

iris = load_iris(as_frame=True)
iris.data
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

X = iris.data
X_act = iris.data 
X = X.drop(['sepal length (cm)','sepal width (cm)'],axis=1)

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X[['petal length (cm)']])
X['Scaled_PL'] = scaler.transform(X[['petal length (cm)']])
scaler.fit(X[['petal width (cm)']])
X['Scaled_PW'] = scaler.transform(X[['petal width (cm)']])

X = X.drop(['petal length (cm)','petal width (cm)'],axis=1)

plt.scatter(X['Scaled_PL'],X['Scaled_PW'])

Scatter Plot Showing the possible cluster between petal length and petal width

Output

from sklearn.cluster import KMeans
model = KMeans(n_clusters = 2)
model.fit(X)

predictions = model.predict(X)
X['clusters'] = predictions
X

Prediction shows cluster number (Here we have 2)

cluster0 = X[['Scaled_PL','Scaled_PW']][X.clusters == 0]
cluster1 = X[['Scaled_PL','Scaled_PW']][X.clusters == 1]
centroids = model.cluster_centers_
plt.scatter(cluster0['Scaled_PL'],cluster0['Scaled_PW'],color="yellow",label="Cluster1")
plt.scatter(cluster1['Scaled_PL'],cluster1['Scaled_PW'],color="orange",label="Cluster2")
plt.scatter(centroids[:,0],centroids[:,1],marker="*",color="purple",label="centroid")
plt.xlabel("petal length")
plt.ylabel("petal width")
plt.legend()

Scatter Plot representing the 2 predicted cluster along with it's centroids

Finding Elbow

SSE = []

for i in range(1,11):
    test_model = KMeans(n_clusters=i)
    test_model.fit(X[['Scaled_PL','Scaled_PW']])
    SSE.append(test_model.inertia_)

plt.plot(SSE)
plt.xlabel("K")
plt.ylabel("SSE")

Possible elbow found for the predicted Kmeans model

Here We get the elbow curve at nearly 1 to 5 range (No.of clusters). To get accurate predictions we should put the K value between 1 to 5

House price prediction using Lasso Regression

Rupak Biswas — Fri, 06 Oct 2023 18:23:53 +0000

Lasso Regreesion analysis was performed to evaluate the importance of a series of explanatory variables in predicting a probable answer or price in this case. The following explanatory variables were included as possible contributors to a Lasso Regression evaluating the probable price of a house in melbourne (output) includes the area, no.of rooms, landsize and many more.

This is my copy of colab notebook so evryone can see the output along with the code

import pandas as pd
df = pd.read_csv("Melbourne_housing_FULL.csv")
df.nunique()

Suburb 351
Address 34009
Rooms 12
Type 3
Price 2871
Method 9
SellerG 388
Date 78
Distance 215
Postcode 211
Bedroom2 15
Bathroom 11
Car 15
Landsize 1684
BuildingArea 740
YearBuilt 160
CouncilArea 33
Lattitude 13402
Longtitude 14524
Regionname 8
Propertycount 342
dtype: int64

dfS = df[['Suburb', 'Rooms', 'Type', 'Method', 'SellerG', 'Regionname', 'Propertycount', 
               'Distance', 'CouncilArea', 'Bedroom2', 'Bathroom', 'Car', 'Landsize', 'BuildingArea', 'Price']]

Suburb Rooms Type Method SellerG Regionname Propertycount Distance CouncilArea Bedroom2 Bathroom
0 Abbotsford 2 h SS Jellis Northern
Metropolitan 4019.0 2.5 Yarra City Council 2.0 1.0
34857 rows × 15 columns

dfS
dfS.isna().sum()

Suburb 0
Rooms 0
Type 0
Method 0
SellerG 0
Regionname 3
Propertycount 3
Distance 1
CouncilArea 3


dfS[['Propertycount','Distance','Bedroom2','Bathroom','Car']] = dfS[['Propertycount','Distance','
dfS['Landsize']=dfS['Landsize'].fillna(dfS['Landsize'].mean())
dfS['BuildingArea']=dfS['BuildingArea'].fillna(dfS['BuildingArea'].mean())

dfS.dropna(inplace=True)
dfS = pd.get_dummies(dfS,drop_first=True)

X = dfS.drop 
Y= dfS['Price']

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2)
from sklearn.linear_model import Lasso #l1
CPU times: user 3 µs, sys: 2 µs, total: 5 µs
Wall time: 10.3 µs
▾ Lasso
Lasso(alpha=50, tol=0.1)
lasso = Lasso(alpha=50, max_iter=1000, tol=0.1)
lasso.fit(X_train,Y_train)

Output

predictions = lasso.predict(X_test)
predictions
array([1323721.23922339, 721160.34344916, 623689.80964616, ...,
 987946.0460597 , 983561.59313765, 160658.00272658])
lasso.score(X_test,Y_test)
0.6388165172009165

As mentioned in the code from the Lasso Regression Analysis, We get an overall accuracy of about 63% (as shown in code)

Survival Prediction in Titanic Using Decision Tree

Rupak Biswas — Thu, 05 Oct 2023 18:23:48 +0000

Decision Tree analysis was performed to evaluate the importance of a series of explanatory variables in predicting a binary, categorical response variable. The following explanatory variables were included as possible contributors to a decision tree evaluating Survival of a person from the Titanic wreckage (output) includes the Passenger-class, Sex, Age, & Fare.

Python Code

This is my copy of colab notebook so evryone can see the output along with the code

import pandas as pd
df = pd.read_csv('titanic.csv')

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1
Cumings, Mrs. John
Bradley (Florence Briggs
Th...
female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2.
3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques
Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

df.head(5)
X=df[['Pclass','Sex','Age','Fare','Survived']]
from sklearn.preprocessing import LabelEncoder
 X.Sex = le.fit_transform(X.Sex)

Pclass Sex Age Fare Survived
0 3 1 22.0 7.2500 0
1 1 0 38.0 71.2833 1
2 3 0 26.0 7.9250 1
3 1 0 35.0 53.1000 1
4 3 1 35.0 8.0500 0
... ... ... ... ... ...
886 2 1 27.0 13.0000 0
887 1 0 19.0 30.0000 1
888 3 0 NaN 23.4500 0
889 1 1 26.0 30.0000 1
890 3 1 32.0 7.7500 0
891 rows × 5 columns

le = LabelEncoder()
X.Sex = le.fit_transform(X.Sex)
X
X = X.dropna()
Y = X['Survived']
X = X.drop('Survived',axis='columns')
Y

0 0
1 1
2 1
3 1
4 0
 ..
885 0
886 0

from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,_Y_test = train_test_split(X,Y,test_size=0.2)
from sklearn import tree
# from sklearn.linear_model import LogisticRegression
▾ DecisionTreeClassifier
DecisionTreeClassifier()
model = tree.DecisionTreeClassifier()
# model = LogisticRegression()
model.fit(X_train,Y_train)
predictions = model.predict(X_test)
predictions

array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,
 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1,
 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0,
 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,
 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1,
 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1,
 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1])

Output & Accuracy

model.score(X_test,_Y_test)
0.7482517482517482

model.predict([[1,0,39,71.2833]])
[1]

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(_Y_test,predictions)
import seaborn as sn
sn.heatmap(cm,annot=True)

As mentioned in the code from the decision tree classifier, We get an overall accuracy of about 75% (as shown in code)

Demonstrating Random Forest by using IRIS dataset

Rupak Biswas — Tue, 03 Oct 2023 18:35:18 +0000

Random forest analysis was performed to evaluate the importance of a series of explanatory variables in predicting a binary, categorical response variable. The following explanatory variables were included as possible contributors to a random forest evaluating Flower type (output) includes the petal length, petal width, sepal length and sepal width.

This is my copy of colab notebook so evryone can see the output along with the code

Python Code:

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
iris = load_iris()
classes = iris['target_names']
classes
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')
X=iris['data']
Y=iris['target']
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.3)
from sklearn.ensemble import RandomForestClassifier
▾ RandomForestClassifier
RandomForestClassifier(n_estimators=50)
model = RandomForestClassifier(n_estimators=50)
model.fit(X_train,Y_train)
model.score(X_test,Y_test)
0.9555555555555556
Output
print("actual result:",classes[Y_test[2]])
print("predicted result:",classes[model.predict([X_test[2]])[0]])
actual result: virginica
predicted result: virginica
predictions = model.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test,predictions)
import seaborn as sn
sn.heatmap(cm,annot=True)

Confusion Matrix Created based on predictions done on test dataset
X axis represents actual values & Y_axis represents predicted values

Output and accuracy

model.score(X_test,Y_test)
0.9555555555555556
Output
print("actual result:",classes[Y_test[2]])
print("predicted result:",classes[model.predict([X_test[2]])[0]])
actual result: virginica
predicted result: virginica

As mentioned in the code from the random tree classifier, We get an overall accuracy of about 96% (as shown in code)