<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: F.elicià</title>
    <description>The latest articles on DEV Community by F.elicià (@priyanshusah7).</description>
    <link>https://dev.to/priyanshusah7</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F942853%2F32f9946e-2700-4734-be50-617bb122b94c.jpg</url>
      <title>DEV Community: F.elicià</title>
      <link>https://dev.to/priyanshusah7</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/priyanshusah7"/>
    <language>en</language>
    <item>
      <title>Running a K-means Cluster Analysis</title>
      <dc:creator>F.elicià</dc:creator>
      <pubDate>Sun, 13 Nov 2022 13:09:59 +0000</pubDate>
      <link>https://dev.to/priyanshusah7/running-a-k-means-cluster-analysis-jig</link>
      <guid>https://dev.to/priyanshusah7/running-a-k-means-cluster-analysis-jig</guid>
      <description>&lt;p&gt;&lt;strong&gt;What is k-means clustering&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;k-means clustering is the unsupervised machine learning algorithm that is part of a much deep pool of data techniques and operations in the realm of Data Science.It is a fastest and most efficient algorithm to categorize data points into group even when very little information is available about data.&lt;br&gt;
More on, similar to other unsupervised learning, it is necessary to understand the data before adopting which technique fits well on a given dataset to solve problems. Considering the correct algorithm, in return, can save time and effort and assist in obtaining more accurate results.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Explanation: Working of k-means Algorithm&lt;/strong&gt; &lt;br&gt;
By specifying the value of k, if K is equal to 3, the algorithm accounts for 3 clusters. Following are the working steps of the k-means Algorithm:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;K-center are modeled randomly in accordance with the present value of k.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;k-means assign each data point in the dataset to the adjacent center and attempts to curtail Euclidean distance between data points. Data points are assumed to be present in the peculiar cluster as if it nearby to center to that cluster than any other cluster center.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;After that, k-means determine the center by accounting for the means of all data points referred to that cluster center. It reduces the completes variance of the intra-cluster with respect to the prior step. Here the "means" defines the average of data points and identifies a new center in the method of k-means clustering.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The algorithm gets repeated among steps 2 and 3 till some paradigm will be achieved such as some distances between data points and their respective centers are diminished,an appropriate number of iteration is attained,no variation in the value of cluster center or no change in the cluster due to data points.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example Code are given below to get response on a set of clustering variables&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pandas import Series, Dataframe
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
from sklearn.decomposition import PCA
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi 
%matplotlib inline
RND_STATE = 55121
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  1.Loading Data Code
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data = pd.read_csv("data/tree_addhealth_csv")
data.columns = map(str.upper, data.columns)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  1.1 Removing rows with incomplete data
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data_clean = data.dropna()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  1.2 Selecting clustering variables
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cluster=data_clean[['ALCEVR1','MAREVER1','ALCPROBS1','DEVIANT1','VIOL1',
'DEP1','ESTEEM1','SCHCONN1','PARACTV', 'PARPRES','FAMCONCT']]
cluster.describe()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2.Pre-processing Data
&lt;/h3&gt;

&lt;h4&gt;
  
  
  2.1 Scaling data
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;clustervar=Cluster.copy()
clustervar['ALCEVR1']=preprocessing.scale(clustervar['ALCEVR1'].astype('float64'))
clustervar['ALCPROBS1']=preprocessing.scale(clustervar['ALCPROBS1'].astype('float64'))
clustervar['MAREVER1']=preprocessing.scale(clustervar['MAREVER1'].astype('float64'))
clustervar['DEP1']=preprocessing.scale(clustervar['DEP1'].astype('float64'))
clustervar['ESTEEM1']=preprocessing.scale(clustervar['ESTEEM1'].astype('float64'))
clustervar['VIOL1']=preprocessing.scale(clustervar['VIOL1'].astype('float64'))
clustervar['DEVIANT1']=preprocessing.scale(clustervar['DEVIANT1'].astype('float64'))
clustervar['FAMCONCT']=preprocessing.scale(clustervar['FAMCONCT'].astype('float64'))
clustervar['SCHCONN1']=preprocessing.scale(clustervar['SCHCONN1'].astype('float64'))
clustervar['PARACTV']=preprocessing.scale(clustervar['PARACTV'].astype('float64'))
clustervar['PARPRES']=preprocessing.scale(clustervar['PARPRES'].astype('float64'))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  2.2 Splitting into train test splits
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;clus_train,clus_test=train_test_split(clustervar,test_size=0.3,random_state=RND_STATE)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3.Making k means analysis for 1 to 9 cluster
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cluster=range(1,10)
meandist=[]

for k in clusters:
    model=KMeans(n_clusters=k)
    model.fit(clus_train)
    clusassign=model.predict(clus_train)
    meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1)) 
    / clus_train.shape[0])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4.Plotting relation between the number of cluster and average distance
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;plt.plot(clusters,meandist)
plt.xlabel('Number of clusters')
plt.ylabel('Average distance')
plt.title('Selecting k with the Elbow Method')
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output of plot&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--rsnV8ELc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xi4uu0iugnk4lqa868w2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--rsnV8ELc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xi4uu0iugnk4lqa868w2.png" alt="Image description" width="389" height="278"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Checking solution for 3 clusters model
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mode13=KMeans(n_clusters=3)
model3.fit(clus_train)
cluassign=mode13.predict(clus_train)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  5.1 Plotting cluster
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(clus_train)
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,)
plt.xlabel('Canonical variable 1')
plt.ylabel('Canonical variable 2')
plt.title('Scatterplot of Canonical Variables for 3 Clusters')
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output for plotting cluster&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Dqo8m0Bx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fo5r76yuiu0inmycik10.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Dqo8m0Bx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fo5r76yuiu0inmycik10.png" alt="Image description" width="388" height="278"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  5.2 Merging cluster assignment with clustering variables to examine cluster variable means by cluster
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;clus_train.reset_index(level=0, implace=True)
cluslist=list(clus_train['index'])
labels=list(model3.labels_)
newlist=dict(zip(cluslist, labels))

newclus=DataFrame.from_dict(newlist, orient='index')
newclus.columns = ['cluster']
newclus.describe()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--grXsfhmG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/h4hwxtvlxplz3s0stziu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--grXsfhmG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/h4hwxtvlxplz3s0stziu.png" alt="Image description" width="880" height="506"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;newclus.reset_index(level=0, implace=True)
merged_train=pd.merge(clus_train, newclus, on='index')
merged_train.head(n=100)
merged_train.cluster.value_counts()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  5.3 Calculating clustering variable means by cluster
&lt;/h4&gt;

&lt;p&gt;clustergrp= merged_train.groupby('cluster').mean()&lt;br&gt;
print ("Clustering variable means by cluster")&lt;br&gt;
print(clustergrp)&lt;/p&gt;

&lt;h4&gt;
  
  
  5.4 Validating clusters in training data by examining cluster differences in GPA using ANOVA
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;gpa_data=data_clean['GPA1']
gpa_train, gpa_test = train_test_split(gpa_data, test_size=.3, random_state=RND_STATE)
gpa_train1=pd.DataFrame(gpa_train)
gpa_train1.reset_index(level=0, inplace=True)
merged_train_all=pd.merge(gpa_train1, merged_train, on='index')
sub1 = merged_train_all[['GPA1', 'cluster']].dropna()

gpamod = smf.ols(formula='GPA1 ~ C(cluster)', data=sub1).fit()
print( gpamod.summary())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;print (' means for GPA by cluster')
m1= sub1.groupby('cluster').mean()
print(m1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;print('Standard deviation for GPA by cluster')
m2= sub1.groupby('cluster').std()
print(m2)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mc1=multi.MultiComparison(sub1['GPA1'],sub1['cluster'])
res1 = mc1.tukeyhsd()
print(res1.summary())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
    </item>
    <item>
      <title>Running a Lasso Regression Analysis</title>
      <dc:creator>F.elicià</dc:creator>
      <pubDate>Thu, 10 Nov 2022 18:11:03 +0000</pubDate>
      <link>https://dev.to/priyanshusah7/running-a-lasso-regression-analysis-k15</link>
      <guid>https://dev.to/priyanshusah7/running-a-lasso-regression-analysis-k15</guid>
      <description>&lt;p&gt;&lt;strong&gt;What is Lasso Regression&lt;/strong&gt;&lt;br&gt;
Lasso regression is a regularization technique. It is used over regression methods for a more accurate prediction .This model uses shrinkage. Shrinkage is where data values are shrunk towards a central point as the mean .The lasso procedure encourages simple ,sparse models (i.e models with fewer parameters). This particular type of regression is well-suited for models showing high levels of multicollinearity or when you want to automate certain parts of model selection like variable selection/parameter elimination.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lasso Meaning&lt;/strong&gt;&lt;br&gt;
The word "LASSO" stands for Least Absolute Shrinkage and Selection Operator .It is a statical formula for the regularization of data models and feature selection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Regularization&lt;/strong&gt;&lt;br&gt;
Regularization is an important concept used to avoid overfitting the data ,especially When the trained and test data are much varying .Regularization is implemented by adding a "Penalty" term to the best fit derived from the trained data ,to achieve a lesser variance with the tested data and also restricts the influence of predictor variables over the output variable by compressing their coefficient .In regularization ,What we do is normally keep the same number of features but reduce the magnitude of the coefficient by using different types of regression techniques that uses regularization to overcome this problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lasso Regularization Technique&lt;/strong&gt;&lt;br&gt;
There are two main regularization techniques ,namely Ridge Regression and Lasso Regression .They both differ in the way they assign a penalty to the coefficients .In this blog ,we will try to understand more about Lasso Regularization technique.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;L1 Regularization&lt;/strong&gt;&lt;br&gt;
If a regression model uses the L1 Regularization technique ,then it is called Lasso Regression .If it used the L2 regularization technique ,it's called Ridge regression .We will study more about these in the later section.L1 regularization adds a penalty that is equal to the absolute value of the magnitude of the co-efficients. Some coefficient might become zero and eliminated from the model. Larger penalties results in co-efficient values that are closer to zero(ideal for producing simpler models).On the other hands ,L2 regularization does not result in any elimination of sparse models or coefficients. Thus,Lasso Regression is easier to interpret as compared to the Ridge.&lt;/p&gt;

&lt;p&gt;Mathematical equation of Lasso Regression Residual Sum of Square +λ*(Sum of the absolute value of the magnitude of coefficient)&lt;/p&gt;

&lt;p&gt;Where ,&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;λ denotes the amount of shrinkage.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;λ=0 implies all features are considered and it is equivalent to the linear regression where only the residual sum of squares is considered to build a predictive model.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;λ=♾️ implies no feature is considered i.e as λ closes to infinity it eliminates more and more features.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The bias increases with an increase in λ variance increases with a decrease in λ&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Lasso Regression Example&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import numpy as np
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Creating a new tree and validation dataset
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.model_selection import train_test_split
data_train, data_val = train_test_split(new_data_train, test_size = 0.2, random_state=2)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Classifying predictor and Target
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#classifying independent and Dependent feature
#_____________________________________________
#Dependent Variable
Y_train = data_train.iloc[:, -1].values
#Independent Variables
X_train = data_train.iloc[:,0 : -1].values
#Independent Variables for Test Set
X_test = data_val.iloc[:,0 : -1].values
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Evaluating the Model with RMLSE
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def score(y_pred, y_true):
error = np.square(np.log10(y_pred +1) - np.log10(y_true +1)).mean() ** 0.5
score = 1 - error
return score
actual_cost = list(data_val['COST'])
actual_cost = np.asarray(actual_cost)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Building the Lasso Regressor
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#Lasso Regression


from sklearn.linear model import Lasso
#Initializing the Lasso Regressor with Normalization Factor as True
lasso_reg = Lasso(normalize=True)
#Fitting the Training data to the Lasso regressor
lasso_reg.fit(X_train,Y_train)
#Predicting for X_test
y_pred_lass =lasso_reg.predict(X_test)
#Printing the Score with RMLSE
print("\n\nLasso SCORE : ", score(y_pred_lass, actual_cost))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;0.7335508027883148&lt;/code&gt;&lt;br&gt;
&lt;code&gt;The lasso regression has attend 73 percent accuracy with the given data-set&lt;/code&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Running a random forest</title>
      <dc:creator>F.elicià</dc:creator>
      <pubDate>Sun, 06 Nov 2022 18:21:22 +0000</pubDate>
      <link>https://dev.to/priyanshusah7/running-a-random-forest-3edk</link>
      <guid>https://dev.to/priyanshusah7/running-a-random-forest-3edk</guid>
      <description>&lt;p&gt;Random forest is a commonly-used machine learning algorithm trademarked by Leo Breiman and Adele Cutler,Which combines the output of multiple decision trees to reach a single result.Its ease of use and flexibility have fueled its adoption,as it handles both classification and regression problems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Random forest algorithm
&lt;/h2&gt;

&lt;p&gt;The random forest algorithm is an extension of the bagging method and feature randomness to create an uncorrelated forest of decision trees. Feature randomness,also known as feature bagging or "the random subspace method"(link resides outside IBM)(PDF,121KB),generates a random subset of those features.&lt;/p&gt;

&lt;p&gt;If we go back to the"Should I surf?"example,the questions that i may ask to determine the prediction may not be as comprehensive as someone else's set of questions.By accounting for all the potential variability in the data,we can reduce the risk of overfittting,bias,and overall variance,resulting in more precise predictions.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--mAjVdHMj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/27k45wvxk0xc8v7q2el2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--mAjVdHMj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/27k45wvxk0xc8v7q2el2.png" alt="Image description" width="880" height="506"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Random forest application
&lt;/h2&gt;

&lt;p&gt;The random forest algorithm has been applied across a number of industries ,allowing them to make better business decisions .Some use cases include:&lt;/p&gt;

&lt;h2&gt;
  
  
  Finance
&lt;/h2&gt;

&lt;p&gt;It is a preferred algorithm over others as it reduces time spent on data management and pre-processing tasks .It can be used to evaluate customers with high credit risk ,to detect frauds ,and option pricing problems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Healthcare
&lt;/h2&gt;

&lt;p&gt;The random forest algorithm has application with computational biology (link resides outside IBM) (PDF,737 KB), allowing doctors to tackle problems such as gene expression classification ,biomarker discovery ,and sequence annotation .As a result ,doctors can make estimates around drug responses to specific medications.&lt;br&gt;
E-commerce:It can be used for recommendation engines for cross-sell purposes.&lt;/p&gt;
&lt;h2&gt;
  
  
  Implementation
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Step 1&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('Salaries.csv')
print(data)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--X3pteBcO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/27bn5n5g0mpnzhite8k2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--X3pteBcO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/27bn5n5g0mpnzhite8k2.png" alt="Image description" width="349" height="176"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--uq9IIACk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/eommqcprvi0jiedk54wu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--uq9IIACk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/eommqcprvi0jiedk54wu.png" alt="Image description" width="80" height="165"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--CTp8oFCM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rom10wntdr92a3z83g2z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--CTp8oFCM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rom10wntdr92a3z83g2z.png" alt="Image description" width="583" height="59"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 100, random_state = 0)
regressor.fit(x,y)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;output&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--JWaSV3pR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0b0xw730vf5u1tcd2on4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--JWaSV3pR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0b0xw730vf5u1tcd2on4.png" alt="Image description" width="588" height="97"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Y_pred = regressor.predict(np.array([6.5]).reshape(1, 1))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 6&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;X_grid = np.arrange(min(x), max(x), 0.01)
X_grid = X_grid.reshape((len(X_grid), 1))
plt.scatter(x, y, color = 'blue') 
plt.plot(X_grid, regressor.predict(X_grid), 
         color = 'green') 
plt.title('Random Forest Regression')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;output&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--RQ-fC36z--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/z902h02l4vbc8b7xkvbg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--RQ-fC36z--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/z902h02l4vbc8b7xkvbg.png" alt="Image description" width="880" height="469"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Random forest</title>
      <dc:creator>F.elicià</dc:creator>
      <pubDate>Sun, 23 Oct 2022 17:19:09 +0000</pubDate>
      <link>https://dev.to/priyanshusah7/random-forest-22g6</link>
      <guid>https://dev.to/priyanshusah7/random-forest-22g6</guid>
      <description>&lt;p&gt;&lt;strong&gt;Explanation:&lt;/strong&gt;&lt;br&gt;
Random Forest is a classifier that contains several Decision Trees on various subsets of a given DataSet and takes the average to&lt;br&gt;
improve the predictive accuracy of that dataset. During the implementation of homework #2, I fitted several classifiers&lt;br&gt;
including RandomForestClassifier and ExtraTreesClassifier to predict the binary response variable – TREG1 (whether a person is a smoker or not). All variables in the dataset, like age, gender, race, alcohol use, and others (see dataset) were used to build the final model. After fitting the model, these factors influenced the final variable with different levels of importance&lt;br&gt;
Calculated and sorted descending these factors into feature important list:&lt;/p&gt;

&lt;p&gt;marever1     0.096374&lt;br&gt;
age          0.083599&lt;br&gt;
DEVIANT1     0.080081&lt;br&gt;
SCHCONN1     0.075221&lt;br&gt;
GPA1         0.074775&lt;br&gt;
DEP1         0.071728&lt;br&gt;
FAMCONCT     0.067389&lt;br&gt;
PARACTV      0.063784&lt;br&gt;
ESTEEM1      0.057945&lt;br&gt;
ALCPROBS1    0.057670&lt;br&gt;
VIOL1        0.048614&lt;br&gt;
ALCEVR1      0.043539&lt;br&gt;
PARPRES      0.039425&lt;br&gt;
WHITE        0.022146&lt;br&gt;
cigavail     0.021671&lt;br&gt;
BLACK        0.018512&lt;br&gt;
BIO_SEX      0.014942&lt;br&gt;
inhever1     0.012832&lt;br&gt;
cocever1     0.012590&lt;br&gt;
PASSIST      0.010221&lt;br&gt;
EXPEL1       0.009777&lt;br&gt;
HISPANIC     0.007991&lt;br&gt;
AMERICAN     0.005332&lt;br&gt;
ASIAN        0.003844&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Source code:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
%matplotlib inline
RND_STATE = 55324

AH_data = pd.read_csv(“data/tree_addhealth.csv”)
data_clean = AH_data.dropna()
data_clean.dtypes

data_clean.describe()

predictors = data_clean[[‘BIO_SEX’, ‘HISPANIC’, ‘WHITE’, ‘BLACK’, ‘NAMERICAN’, ‘ASIAN’, ‘age’,
‘ALCEVR1’, ‘ALCPROBS1’, ‘marever1’, ‘cocever1’, ‘inhever1’, ‘cigavail’, ‘DEP1’, ‘ESTEEM1’,
‘VIOL1’,
‘PASSIST’, ‘DEVIANT1’, ‘SCHCONN1’, ‘GPA1’, ‘EXPEL1’, ‘FAMCONCT’, ‘PARACTV’, ‘PARPRES’]]

targets = data_clean.TREG1

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4, random_state=RND_STATE)

print(“Predict train shape: “, pred_train.shape)
print(“Predict test shape: “, pred_test.shape)
print(“Target train shape: “, tar_train.shape)
print(“Target test shape: “, tar_test.shape)

classifier = RandomForestClassifier(n_estimators=25, random_state=RND_STATE)
classifier = classifier.fit(pred_train, tar_train)

predictions = classifier.predict(pred_test)

print(“Confusion matrix:”)
print(confusion_matrix(tar_test, predictions))
print()
print(“Accuracy: “, accuracy_score(tar_test, predictions))

important_features = pd.Series(data=classifier.feature_importances_,index=predictors.columns)
important_features.sort_values(ascending=False,inplace=True)

print(important_features)

model = ExtraTreesClassifier(random_state=RND_STATE)
model.fit(pred_train, tar_train)

print(model.feature_importances_)

trees = range(25)
accuracy = np.zeros(25)
for idx in range(len(trees)):
classifier = RandomForestClassifier(n_estimators=idx + 1, random_state=RND_STATE)
classifier = classifier.fit(pred_train, tar_train)
predictions = classifier.predict(pred_test)
accuracy[idx] = accuracy_score(tar_test, predictions)

plt.cla()
plt.plot(trees, accuracy)
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;br&gt;
Final model looked well on test data and showed an accuracy level of 83,4%! So results can be presented in this plot:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ubcBQGvn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jdpo814m183p6iprd4vg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ubcBQGvn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jdpo814m183p6iprd4vg.png" alt="Image description" width="382" height="252"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As we can see from the plot that, even one tree can show the accuracy at a good level. The above-given data can be described even with one tree. But, on the other hand, it is clear, that after adding some more trees final accuracy increases a bit, can make the model able to predict the data more precisely&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Running a classification tree</title>
      <dc:creator>F.elicià</dc:creator>
      <pubDate>Thu, 13 Oct 2022 13:55:18 +0000</pubDate>
      <link>https://dev.to/priyanshusah7/running-a-classification-tree-3dda</link>
      <guid>https://dev.to/priyanshusah7/running-a-classification-tree-3dda</guid>
      <description>&lt;h2&gt;
  
  
  Output
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Filresv6uizcxjx2lrpm1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Filresv6uizcxjx2lrpm1.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Explanation
&lt;/h2&gt;

&lt;p&gt;Decision tree analysis was performed to test nonlinear relationships among a series of explanatory variables and a binary, categorical response variable. All possible separations (categorical) or cut points (quantitative) are tested.&lt;/p&gt;

&lt;p&gt;This decision tree uses these variables to predict output variable (TREG1) – whether person is a smoker, or not:&lt;/p&gt;

&lt;p&gt;BIO_SEX – categorical – gender&lt;br&gt;
GPA1 – numeric – current GPA&lt;br&gt;
ALCEVR1 – binary – alcohol use&lt;br&gt;
WHITE – binary – whether participant is white&lt;br&gt;
BLACK – binary – whether participant is black&lt;br&gt;
To train a decision tree I’ve split given dataset into train and test datasets in proportion 70/30.&lt;/p&gt;

&lt;p&gt;After fitting the tree I’ve tested it on test dataset and got accuracy = 0,826. This is a good result for a model, which is based only on three explaining variables.&lt;/p&gt;

&lt;p&gt;From decision tree we can observe:&lt;/p&gt;

&lt;p&gt;Participants who used alcohol were more likely to be smokers.(up to 5 times more smokers who used alcohol)&lt;br&gt;
Most smokers are white&lt;br&gt;
People with lower GPA are more usual to be regular smokers&lt;br&gt;
Code for the above output is :-&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
import sklearn.metrics
from numpy.lib.format import magic
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from io import StringIO
from IPython.display import Image
import pydotplus
RND_STATE = 55324

AH_data = pd.read_csv(“data/tree_addhealth.csv”)
data_clean = AH_data.dropna()
data_clean.dtypes
data_clean.describe()

predictors = data_clean[[‘BIO_SEX’,’GPA1′, ‘ALCEVR1’, ‘WHITE’, ‘BLACK’]]

targets = data_clean.TREG1

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=0.3)

classifier=DecisionTreeClassifier(random_state=RND_STATE)
classifier=classifier.fit(pred_train, tar_train)
predictions=classifier.predict(pred_test)

print(“Confusion matrix:\n”, sklearn.metrics.confusion_matrix(tar_test,predictions))
print(“Accuracy: “,sklearn.metrics.accuracy_score(tar_test, predictions))

out = StringIO()
tree.export_graphviz(classifier, out_file=out, feature_names=[“sex”, “gpa”, “alcohol”, “white”, “black”],proportion=True, filled=True, max_depth=4)
graph=pydotplus.graph_from_dot_data(out.getvalue())
img = Image(data=graph.create_png())
img

with open(“output” + “.png”, “wb”) as f:
f.write(img.data)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
    </item>
  </channel>
</rss>
