<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: rashidgithub01</title>
    <description>The latest articles on DEV Community by rashidgithub01 (@rashidgithub01).</description>
    <link>https://dev.to/rashidgithub01</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F971354%2F438e954e-d6b2-4c10-b5cc-e75e743b83c4.jpg</url>
      <title>DEV Community: rashidgithub01</title>
      <link>https://dev.to/rashidgithub01</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rashidgithub01"/>
    <language>en</language>
    <item>
      <title>Running a Lasso Regression</title>
      <dc:creator>rashidgithub01</dc:creator>
      <pubDate>Sun, 13 Nov 2022 18:34:09 +0000</pubDate>
      <link>https://dev.to/rashidgithub01/running-a-lasso-regression-1ki6</link>
      <guid>https://dev.to/rashidgithub01/running-a-lasso-regression-1ki6</guid>
      <description>&lt;h2&gt;
  
  
  What is Lasso Regression
&lt;/h2&gt;

&lt;p&gt;Lasso regression is a regularization technique. It is used over regression methods for a more accurate prediction. This model uses shrinkage. Shrinkage is where data values are shrunk towards a central point as the mean. The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters). This particular type of regression is well-suited for models showing high levels of multicollinearity or when you want to automate certain parts of model selection, like variable parameter elimination.&lt;br&gt;
Lasso Regression uses L1 regularization technique. It is used when we have more features because it automatically performs feature selection.&lt;/p&gt;
&lt;h2&gt;
  
  
  Lasso Meaning
&lt;/h2&gt;

&lt;p&gt;The word “LASSO” stands for Least Absolute Shrinkage and Selection Operator. It is a statistical formula for the regularization of data models and feature selection.&lt;/p&gt;
&lt;h2&gt;
  
  
  Regularization
&lt;/h2&gt;

&lt;p&gt;Regularization is an important concept used to avoid overfitting the data, especially when the trained and test data are much varying. Regularization is implemented by adding a “penalty” term to the best fit derived from the trained data, to achieve a lesser variance with the tested data and also restricts the influence of predictor variables over the output variable by compressing their coefficient. In regularization, what we do is normally keep the same number of features but reduce the magnitude of the coefficients. We can reduce the magnitude of the coefficients by using different types of regression techniques that use regularization to overcome this problem.&lt;/p&gt;
&lt;h2&gt;
  
  
  Lasso Regularization Technique
&lt;/h2&gt;

&lt;p&gt;There are two main regularization techniques, namely Ridge Regression and Lasso Regression. They both differ in the way they assign a penalty to the coefficients. In this blog, we will try to understand more about Lasso Regularization technique.&lt;/p&gt;
&lt;h2&gt;
  
  
  L1 Regularization
&lt;/h2&gt;

&lt;p&gt;If a regression model uses the L1 Regularization technique, then it is called Lasso Regression. If it used the L2 regularization technique, it’s called Ridge Regression. We will study more about these in the later sections. L1 regularization adds a penalty that is equal to the absolute value of the magnitude of the coefficient. This regularization type can result in sparse models with few coefficients. Some coefficients might become zero and get eliminated from the model. Larger penalties result in coefficient values that are closer to zero (ideal for producing simpler models). On the other hand, L2 regularization does not result in any elimination of sparse models or coefficients. Thus, Lasso Regression is easier to interpret as compared to the Ridge.&lt;/p&gt;
&lt;h2&gt;
  
  
  Mathematical equation of Lasso Regression
&lt;/h2&gt;

&lt;p&gt;Residual Sum of Squares + λ * (Sum of the absolute value of the magnitude of coefficients)&lt;/p&gt;

&lt;p&gt;Where,&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;λ denotes the amount of shrinkage.&lt;/li&gt;
&lt;li&gt;λ = 0 implies all features are considered and it is equivalent to the linear regression where only the residual sum of squares is considered to build a predictive model&lt;/li&gt;
&lt;li&gt;λ = ∞ implies no feature is considered i.e, as λ closes to infinity it eliminates more and more features&lt;/li&gt;
&lt;li&gt;The bias increases with an increase in λ&lt;/li&gt;
&lt;li&gt;variance increases with a decrease in λ&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Lasso Regression Example&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import numpy as np
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Creating a New Train and Validation Datasets&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.model_selection import train_test_split
data_train, data_val = train_test_split(new_data_train, test_size = 0.2, random_state = 2)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Classifying Predictors and Target&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#Classifying Independent and Dependent Features
#_______________________________________________
#Dependent Variable
Y_train = data_train.iloc[:, -1].values
#Independent Variables
X_train = data_train.iloc[:,0 : -1].values
#Independent Variables for Test Set
X_test = data_val.iloc[:,0 : -1].values
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Evaluating The Model With RMLSE&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def score(y_pred, y_true):
error = np.square(np.log10(y_pred +1) - np.log10(y_true +1)).mean() ** 0.5
score = 1 - error
return score
actual_cost = list(data_val['COST'])
actual_cost = np.asarray(actual_cost)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Building the Lasso Regressor&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.linear_model import Lasso
#Initializing the Lasso Regressor with Normalization Factor as True
lasso_reg = Lasso(normalize=True)
#Fitting the Training data to the Lasso regressor
lasso_reg.fit(X_train,Y_train)
#Predicting for X_test
y_pred_lass =lasso_reg.predict(X_test)
#Printing the Score with RMLSE
print("\n\nLasso SCORE : ", score(y_pred_lass, actual_cost))

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Output
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--HNjFS-qP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tyv6fj07msmgmezbx7kv.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--HNjFS-qP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tyv6fj07msmgmezbx7kv.jpeg" alt="Image description" width="880" height="561"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Running a K-means Cluster Analysis</title>
      <dc:creator>rashidgithub01</dc:creator>
      <pubDate>Sun, 13 Nov 2022 14:41:28 +0000</pubDate>
      <link>https://dev.to/rashidgithub01/ghj-mi6</link>
      <guid>https://dev.to/rashidgithub01/ghj-mi6</guid>
      <description>&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;TABLE OF CONTENTS&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;What is K-means Clustering&lt;/li&gt;
&lt;li&gt;Explanation: Working of K-means Algorithm

&lt;ul&gt;
&lt;li&gt;  Example Code are given below to get a response on a set 
of &lt;/li&gt;
&lt;li&gt;  clustering variables:&lt;/li&gt;
&lt;li&gt;    1. Loading Data Code&lt;/li&gt;
&lt;li&gt;      1.1 Removing rows with incomplete data&lt;/li&gt;
&lt;li&gt;      1.2 Selecting clustering variables&lt;/li&gt;
&lt;li&gt;    2. Preprocessing Data&lt;/li&gt;
&lt;li&gt;      2.1 Scaling Data&lt;/li&gt;
&lt;li&gt;      2.2 Splitting into train test splits&lt;/li&gt;
&lt;li&gt;    3. Making K-means analysis for 1 to 9 clusters&lt;/li&gt;
&lt;li&gt;    4. Plotting relation between the number of clusters 
    and average distance&lt;/li&gt;
&lt;li&gt;  Output of Plot:&lt;/li&gt;
&lt;li&gt;    5. Checking solution for 3 clusters model&lt;/li&gt;
&lt;li&gt;      5.1 Plotting clusters&lt;/li&gt;
&lt;li&gt;  Output for Plotting Cluster&lt;/li&gt;
&lt;li&gt;      5.2 Merging cluster assignment with clustering 
    variables to examine cluster variable means by 
    cluster&lt;/li&gt;
&lt;li&gt;      5.3 Calculating clustering variable means by 
        cluster&lt;/li&gt;
&lt;li&gt;      5.4 Validating clusters in training data by 
        examining cluster differences in GPA&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What is K-means Clustering
&lt;/h2&gt;

&lt;p&gt;K-means clustering is the unsupervised machine learning algorithm that is part of a much deep pool of data techniques and operations in the realm of Data Science. It is the fastest and most efficient algorithm to categorize data points into groups even when very little information is available about data.&lt;/p&gt;

&lt;p&gt;More on, similar to other unsupervised learning, it is necessary to understand the data before adopting which technique fits well on a given dataset to solve problems. Considering the correct algorithm, in return, can save time and effort and assist in obtaining more accurate results.&lt;/p&gt;

&lt;h2&gt;
  
  
  Explanation: Working of K-means Algorithm
&lt;/h2&gt;

&lt;p&gt;By specifying the value of k, if k is equal to 3, the algorithm accounts for 3 clusters. Following are the working steps of the K-means Algorithm :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;K-center are modeled randomly in accordance with the present value of K.&lt;/li&gt;
&lt;li&gt;K-means assigns each data point in the dataset to the adjacent center and attempts to curtail Euclidean distance between data points. Data points are assumed to be present in the peculiar cluster as if it is nearby to center to that cluster than any other cluster center.&lt;/li&gt;
&lt;li&gt;After that, k-means determines the center by accounting for the mean of all data points referred to that cluster center. It reduces the complete variance of the intra-clusters with respect to the prior step. Here, the “means” defines the average of data points and identifies a new center in the method of k-means clustering.&lt;/li&gt;
&lt;li&gt;The algorithm gets repeated among steps 2 and 3 till some paradigm will be achieved such as the sum of distances between data points and their respective centers are diminished, an appropriate number of iterations is attained, no variation in the value of cluster center or no change in the cluster due to data points.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Example Code are given below to get a response on a set of clustering variables:
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
from sklearn.decomposition import PCA
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi 
%matplotlib inline
RND_STATE = 55121
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Loading Data Code
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data = pd.read_csv("data/tree_addhealth.csv")
data.columns = map(str.upper, data.columns)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;1.1 Removing rows with incomplete data&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data_clean = data.dropna()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;1.2 Selecting clustering variables&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cluster=data_clean[['ALCEVR1','MAREVER1','ALCPROBS1','DEVIANT1','VIOL1',
'DEP1','ESTEEM1','SCHCONN1','PARACTV', 'PARPRES','FAMCONCT']]

cluster.describe()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Preprocessing Data&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;2.1 Scaling Data&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;clustervar=cluster.copy()
clustervar['ALCEVR1']=preprocessing.scale(clustervar['ALCEVR1'].astype('float64'))
clustervar['ALCPROBS1']=preprocessing.scale(clustervar['ALCPROBS1'].astype('float64'))
clustervar['MAREVER1']=preprocessing.scale(clustervar['MAREVER1'].astype('float64'))
clustervar['DEP1']=preprocessing.scale(clustervar['DEP1'].astype('float64'))
clustervar['ESTEEM1']=preprocessing.scale(clustervar['ESTEEM1'].astype('float64'))
clustervar['VIOL1']=preprocessing.scale(clustervar['VIOL1'].astype('float64'))
clustervar['DEVIANT1']=preprocessing.scale(clustervar['DEVIANT1'].astype('float64'))
clustervar['FAMCONCT']=preprocessing.scale(clustervar['FAMCONCT'].astype('float64'))
clustervar['SCHCONN1']=preprocessing.scale(clustervar['SCHCONN1'].astype('float64'))
clustervar['PARACTV']=preprocessing.scale(clustervar['PARACTV'].astype('float64'))
clustervar['PARPRES']=preprocessing.scale(clustervar['PARPRES'].astype('float64'))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;2.2 Splitting into train test splits&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;clus_train, clus_test = train_test_split(clustervar, test_size=0.3, random_state=RND_STATE)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Making K-means analysis for 1 to 9 clusters
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;clusters=range(1,10)
meandist=[]

for k in clusters:
    model=KMeans(n_clusters=k)
    model.fit(clus_train)
    clusassign=model.predict(clus_train)
    meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1)) 
    / clus_train.shape[0])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Plotting relation between the number of clusters and average distance
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;plt.plot(clusters, meandist)
plt.xlabel('Number of clusters')
plt.ylabel('Average distance')
plt.title('Selecting k with the Elbow Method')
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output of Plot:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--FCAM6TTa--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/g3pi3mz7r2s18hk183cl.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--FCAM6TTa--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/g3pi3mz7r2s18hk183cl.jpeg" alt="Image description" width="880" height="654"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Checking solution for 3 clusters model
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;model3=KMeans(n_clusters=3)
model3.fit(clus_train)
clusassign=model3.predict(clus_train)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;5.1 Plotting clusters&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(clus_train)
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,)
plt.xlabel('Canonical variable 1')
plt.ylabel('Canonical variable 2')
plt.title('Scatterplot of Canonical Variables for 3 Clusters')
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output for Plotting Cluster&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Xr-RZNIM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/prz0f3mbbaqfsq6e7w3s.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Xr-RZNIM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/prz0f3mbbaqfsq6e7w3s.jpeg" alt="Image description" width="880" height="658"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;5.2 Merging cluster assignment with clustering variables to examine cluster variable means by cluster&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;clus_train.reset_index(level=0, inplace=True)
cluslist=list(clus_train['index'])
labels=list(model3.labels_)
newlist=dict(zip(cluslist, labels))

newclus=DataFrame.from_dict(newlist, orient='index')
newclus.columns = ['cluster']
newclus.describe()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--0D85eJs_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gd01k18uf56qibgxbnex.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--0D85eJs_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gd01k18uf56qibgxbnex.jpeg" alt="Image description" width="880" height="891"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;newclus.reset_index(level=0, inplace=True)
merged_train=pd.merge(clus_train, newclus, on='index')
merged_train.head(n=100)
merged_train.cluster.value_counts()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;5.3 Calculating clustering variable means by cluster&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;clustergrp = merged_train.groupby('cluster').mean()
print ("Clustering variable means by cluster")
print(clustergrp)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;5.4 Validating clusters in training data by examining cluster differences in GPA using ANOVA&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;gpa_data=data_clean['GPA1']
gpa_train, gpa_test = train_test_split(gpa_data, test_size=.3, random_state=RND_STATE)
gpa_train1=pd.DataFrame(gpa_train)
gpa_train1.reset_index(level=0, inplace=True)
merged_train_all=pd.merge(gpa_train1, merged_train, on='index')
sub1 = merged_train_all[['GPA1', 'cluster']].dropna()

gpamod = smf.ols(formula='GPA1 ~ C(cluster)', data=sub1).fit()
print (gpamod.summary())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;print ('means for GPA by cluster')
m1= sub1.groupby('cluster').mean()
print (m1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;print ('standard deviations for GPA by cluster')
m2= sub1.groupby('cluster').std()
print (m2)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mc1 = multi.MultiComparison(sub1['GPA1'], sub1['cluster'])
res1 = mc1.tukeyhsd()
print(res1.summary())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>coursera</category>
    </item>
    <item>
      <title>Running Random Forest</title>
      <dc:creator>rashidgithub01</dc:creator>
      <pubDate>Sat, 12 Nov 2022 19:39:40 +0000</pubDate>
      <link>https://dev.to/rashidgithub01/running-random-forest-14i5</link>
      <guid>https://dev.to/rashidgithub01/running-random-forest-14i5</guid>
      <description>&lt;h2&gt;
  
  
  Introduction:
&lt;/h2&gt;

&lt;p&gt;Random forest is a technique used in modeling predictions and behavior analysis and is built on decision trees.&lt;br&gt;
It contains many decision trees representing a distinct instance of the classification of data input into the random forest that is used widely in Classification and Regression problems.&lt;br&gt;
It builds decision trees on different samples and takes their majority vote for classification and average in case of regression.&lt;br&gt;
One of the most important features of the Random Forest Algorithm is that it can handle the data set containing continuous variables as in the case of regression and categorical variables as in the case of classification.&lt;/p&gt;
&lt;h2&gt;
  
  
  Real Life Analogy:-
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--S9iFRSbm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ke4qvvdzs7z81tr5nqc3.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--S9iFRSbm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ke4qvvdzs7z81tr5nqc3.jpeg" alt="Image description" width="763" height="260"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Working of Random Forest Algorithm:
&lt;/h2&gt;

&lt;p&gt;Ensemble technique is an important technique that's why we need to know the Ensemble technique. Ensemble uses two types of methods:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Boosting&lt;/strong&gt; :- Boosting combines weak learners into strong learners by creating sequential models such that the final model has the highest accuracy. For example, ADA BOOS, XG BOOST.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Bagging&lt;/strong&gt; :- Bagging creates a different training subset from the sample training data with a replacement &amp;amp; the final output is based on majority voting. For example, Random Forest.&lt;/p&gt;

&lt;p&gt;![Image description](&lt;a href="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0sg6mow7wlgpzfft32fm.jpeg"&gt;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0sg6mow7wlgpzfft32fm.jpeg&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Steps involved in Random Forest:-
&lt;/h2&gt;

&lt;p&gt;Step 1: In Random Forest , 'n' number of random records is taken from the data set having k number of records.&lt;br&gt;
Step 2: Individual decision trees are constructed for each sample.&lt;br&gt;
Step 3: All decision tree will generate an output.&lt;br&gt;
Step 4: Final Output is considered based on the Majority Voting for Classification and regression respectively.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--H5WRbT3W--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/2kzqh2jlnyia8l22euom.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--H5WRbT3W--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/2kzqh2jlnyia8l22euom.jpeg" alt="Image description" width="846" height="430"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Coding in Python:-
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Let's import the Libraries:-&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Importing data set:-&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df = pd.read_csv('heart_v2.csv')
print(df.head())
sns.countplot(df['heart disease'])
plt.title('Value counts of heart disease patients')
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--wsUdqsep--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mtnp9j6hxtnhmudpj5va.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--wsUdqsep--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mtnp9j6hxtnhmudpj5va.jpeg" alt="Image description" width="302" height="176"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Putting Feature Variable to X and Target Variable to Y:-&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;X = df.drop('heart disease',axis=1)
y = df['heart disease']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;4. Train Test Split is Performed:-&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)
X_train.shape, X_test.shape
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--6XtbfcrX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/n380k65r15c52rtoa0az.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--6XtbfcrX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/n380k65r15c52rtoa0az.jpeg" alt="Image description" width="168" height="27"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Import Random Forest Classifier and fit the data:-&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.ensemble import RandomForestClassifier
classifier_rf = RandomForestClassifier(random_state=42, n_jobs=-1, max_depth=5, n_estimators=100, oob_score=True)
%%time
classifier_rf.fit(X_train, y_train)
classifier_rf.oob_score_
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Anae2ryb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7x5sabdfmcycvhehoi6g.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Anae2ryb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7x5sabdfmcycvhehoi6g.jpeg" alt="Image description" width="633" height="57"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Hyperparameter tuning for Random Forest using GridSearchCV and fit the data:-&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rf = RandomForestClassifier(random_state=42, n_jobs=-1)
params = {
    'max_depth': [2,3,5,10,20],
    'min_samples_leaf': [5,10,20,50,100,200],
    'n_estimators': [10,25,30,50,100,200]
}
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=params, cv = 4, n_jobs=-1, verbose=1, scoring="accuracy")
%%time
grid_search.fit(X_train, y_train)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--hBBLM3VT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/zxfxnfnhqktrtw4ndg46.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--hBBLM3VT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/zxfxnfnhqktrtw4ndg46.jpeg" alt="Image description" width="681" height="280"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;grid_search.best_score_
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--sNyW0VRg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/c7w7410g90bnv2iqkb7t.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--sNyW0VRg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/c7w7410g90bnv2iqkb7t.jpeg" alt="Image description" width="170" height="25"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rf_best = grid_search.best_estimator_
rf_best
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QaaanrFu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bhh4pe4xd6784lx1057k.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QaaanrFu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bhh4pe4xd6784lx1057k.jpeg" alt="Image description" width="578" height="39"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. Visualization:-&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.tree import plot_tree
plt.figure(figsize=(80,40))
plot_tree(rf_best.estimators_[5], feature_names = X.columns,class_names=['Disease', "No Disease"],filled=True);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--FZccHpyC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/erc67vp1u1g22a616ljx.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--FZccHpyC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/erc67vp1u1g22a616ljx.jpeg" alt="Image description" width="880" height="430"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.tree import plot_tree
plt.figure(figsize=(80,40))
plot_tree(rf_best.estimators_[7], feature_names = X.columns,class_names=['Disease', "No Disease"],filled=True);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--l-2utg7k--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7kgln47ovlizb03lwozo.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--l-2utg7k--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7kgln47ovlizb03lwozo.jpeg" alt="Image description" width="880" height="430"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;8. Sorting of Data according to feature importance:-&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rf_best.feature_importances_
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--H2SIzKHL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/s21txf9l4nol6qaj66bi.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--H2SIzKHL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/s21txf9l4nol6qaj66bi.jpeg" alt="Image description" width="442" height="32"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;imp_df = pd.DataFrame({
    "Varname": X_train.columns,
    "Imp": rf_best.feature_importances_
})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;imp_df.sort_values(by="Imp", ascending=False)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--LpEK_iKI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3zl9gxetuzdm9gkr371q.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--LpEK_iKI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3zl9gxetuzdm9gkr371q.jpeg" alt="Image description" width="176" height="146"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary:
&lt;/h2&gt;

&lt;p&gt;Now, we can conclude that Random Forest is also a one of the best techniques with high performance which is widely used in various industries for its efficiency. It can handle binary, continuous, and categorical data.&lt;br&gt;
Random Forest is a great choice if anyone wants to build the model fast and efficiently as one of the best things about the random forest is it can handle missing values too.&lt;br&gt;
Overall, Random Forest is fast, simple, flexible and robust model with some limitations. Random forest is a combination of decision trees that can be modeled for prediction and behavior analysis.&lt;/p&gt;

</description>
      <category>python</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
