<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Abzal Seitkaziyev</title>
    <description>The latest articles on DEV Community by Abzal Seitkaziyev (@xsabzal).</description>
    <link>https://dev.to/xsabzal</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F344308%2Fda2e55ff-e75e-4b6c-a585-501ef13afea2.png</url>
      <title>DEV Community: Abzal Seitkaziyev</title>
      <link>https://dev.to/xsabzal</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/xsabzal"/>
    <language>en</language>
    <item>
      <title>Classifiers' Evaluation Metrics</title>
      <dc:creator>Abzal Seitkaziyev</dc:creator>
      <pubDate>Sat, 20 Mar 2021 03:36:36 +0000</pubDate>
      <link>https://dev.to/xsabzal/classifiers-evaluation-metrics-16oo</link>
      <guid>https://dev.to/xsabzal/classifiers-evaluation-metrics-16oo</guid>
      <description>&lt;p&gt;&lt;strong&gt;Confusion matrix&lt;/strong&gt;&lt;br&gt;
Confusion matrix is a table that holds True and False Positive values ('TP' and 'FP'), as well as True and False Negative values ('TN' and 'FN').&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--8Seyd26V--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jru9bzpbuwsyu5lo7yt1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--8Seyd26V--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jru9bzpbuwsyu5lo7yt1.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://medium.com/@m.virk1/classification-metrics-65b79bfdd776"&gt;Image&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is important for the project&lt;/strong&gt;&lt;br&gt;
For example, we have an image classifier, which identifies if a rock is a precious stone or not(e.g., diamond) and we use it for automated mining. &lt;br&gt;
In this context, we may want to get as many stones as possible ('TP'), even if we have some not precious stones identified as diamonds ('FP'). Because it could be sorted out by an expert at a later stage.&lt;br&gt;
Now let's imagine, that we are buying these stones by using our image classifier algorithm. We do not want to buy not precious stones('FP'), so our model should be very careful regarding False Positive predictions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common Evaluation Metrics&lt;/strong&gt;&lt;br&gt;
To evaluate and quantify the performance of a classification model, we can use common evaluation metrics: accuracy, balanced accuracy, precision, recall (a.k.a. sensitivity and True Positive Rate),  Specificity (=1-False Positive Rate), ROC (=TPR vs FPR) and F1 score. &lt;br&gt;
As we can see there are many options to choose from regarding evaluation metrics. However, all of these metrics can be calculated using confusion matrix values(TP, FP, TN, and FN). So, the main idea is to know what metrics are most important for the project, and how well balanced is the target we are trying to predict (classify).&lt;br&gt;
The most general approach would be to choose a few metrics to optimize (e.g., accuracy, recall, precision, F1 score, ROC-AUC). &lt;/p&gt;

</description>
      <category>datascience</category>
    </item>
    <item>
      <title>Coefficient of Determination R squared</title>
      <dc:creator>Abzal Seitkaziyev</dc:creator>
      <pubDate>Mon, 15 Mar 2021 03:52:24 +0000</pubDate>
      <link>https://dev.to/xsabzal/coefficient-of-determination-r-squared-1cf3</link>
      <guid>https://dev.to/xsabzal/coefficient-of-determination-r-squared-1cf3</guid>
      <description>&lt;p&gt;To measure the 'goodness of fit' of the line, when we do the linear regression analysis, Coefficient of Determination (R squared) could be calculated. R squared can measure how well our model explains the correlation. Here we measure percentage of variance explained by the linear model vs baseline model(in this case it is simply mean value of the target). &lt;/p&gt;

&lt;p&gt;We can visualize it on the simple example. If we have some target, e.g. number of sales of the item during 5 days and we fitted a line. Now we want to check how good is our fit, basically how well we perform compare to naive prediction: calculating mean value of the sales.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;r2_score&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;

&lt;span class="c1"&gt;# given target
&lt;/span&gt;&lt;span class="n"&gt;y_true&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# base line
&lt;/span&gt;&lt;span class="n"&gt;y_mean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="p"&gt;))]&lt;/span&gt;

&lt;span class="c1"&gt;# fit a line
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;linear_model&lt;/span&gt;
&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;asarray&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;reshape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;y_true&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;linear_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LinearRegression&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;intercept_&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;coef_&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;model_intercept = 2&lt;br&gt;
model_coef = 3.4&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# regression line
&lt;/span&gt;&lt;span class="n"&gt;y_pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;3.4&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# calculate R squared using formula
&lt;/span&gt;&lt;span class="n"&gt;var_mean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;([(&lt;/span&gt;&lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;y_mean&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="p"&gt;))])&lt;/span&gt;
&lt;span class="n"&gt;var_pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;([(&lt;/span&gt;&lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="p"&gt;))])&lt;/span&gt;
&lt;span class="n"&gt;r2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;var_mean&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;var_pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;var_mean&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;0.9730639730639731&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# calculate R squared using scikit learn
&lt;/span&gt;&lt;span class="n"&gt;r2_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;0.9730639730639731&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# calculate using pearson correlation
&lt;/span&gt;&lt;span class="n"&gt;correlation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;corrcoef&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;correlation&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;0.9730639730639731&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--yVETwp3k--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/diuetp61qm5x1b2yqigw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--yVETwp3k--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/diuetp61qm5x1b2yqigw.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this example regression line explained 97% better than just predicting mean value. &lt;/p&gt;

</description>
      <category>datascience</category>
    </item>
    <item>
      <title>K-Means Clustering</title>
      <dc:creator>Abzal Seitkaziyev</dc:creator>
      <pubDate>Mon, 08 Mar 2021 04:44:42 +0000</pubDate>
      <link>https://dev.to/xsabzal/k-means-clustering-394b</link>
      <guid>https://dev.to/xsabzal/k-means-clustering-394b</guid>
      <description>&lt;p&gt;K-Means clustering is unsupervised algorithm, which is very intuitive and could be visualized geometrically. &lt;br&gt;
Basically, we are trying to split the data into the k groups or clusters, and each cluster has a center, which is defined by calculating geometrical centroid of the cluster.&lt;/p&gt;

&lt;p&gt;Steps of the K-means clustering algorithm:&lt;/p&gt;

&lt;p&gt;1) Set k initial centers randomly&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--gVe_r2T5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/hz39xv5gl4vj3se7304s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--gVe_r2T5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/hz39xv5gl4vj3se7304s.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;br&gt;
2) Calculate 'distances'(e.g., if in 2d space) from the data point to these centers and group the data by the 'closest' distances to these centers&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--TR-VDI1y--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/wn6ix8zpqta2bbufqtmm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--TR-VDI1y--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/wn6ix8zpqta2bbufqtmm.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;br&gt;
3) Recalculate position of k centers (as a mean of the data in that cluster) &lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--bdUmR-oK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ffi1nx7fboqr0zrot784.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--bdUmR-oK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ffi1nx7fboqr0zrot784.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;br&gt;
4) repeat steps 2 and 3 until no changes.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--g_OPIeWt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4n4iuxu04q06zbvcixz5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--g_OPIeWt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4n4iuxu04q06zbvcixz5.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--RfEVYGL---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/g5qozv6majc0v86u8jyw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--RfEVYGL---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/g5qozv6majc0v86u8jyw.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here is the &lt;a href="http://tech.nitoyon.com/en/blog/2013/11/07/k-means/"&gt;link&lt;/a&gt; I used to play and visualize clustering.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Introduction to Support Vector Machines</title>
      <dc:creator>Abzal Seitkaziyev</dc:creator>
      <pubDate>Mon, 01 Mar 2021 04:40:19 +0000</pubDate>
      <link>https://dev.to/xsabzal/introduction-to-support-vector-machines-1aba</link>
      <guid>https://dev.to/xsabzal/introduction-to-support-vector-machines-1aba</guid>
      <description>&lt;p&gt;Support Vector Machines(SVMs) are supervised models, and they could be very effective for classification, numerical prediction, and outlier detection problems.&lt;/p&gt;

&lt;p&gt;The main idea is to separate different classes effectively: getting accurate results (e.g., higher accuracy score) and also balancing overfitting and underfitting (SVM introduces a slack term to account for this) at the same time. &lt;/p&gt;

&lt;p&gt;SVM allows dividing classes using a line, plane, or hyperplane. For the simple example, with a line, we can divide by using maximum margin or soft margin. Soft margin is more flexible and allows misclassification by taking into account outliers, which gives a balance to not overfit or underfit.&lt;/p&gt;

&lt;p&gt;Another thing, that sets SVM apart, is the use of so-called Kernel Functions. We could use Linear, Polynomial, Radial(RBF), or Sigmoid Functions in &lt;a href="https://scikit-learn.org/stable/modules/svm.html#kernel-functions"&gt;scikit-learn&lt;/a&gt;. These functions allow creating higher dimensions to separate classes better by use of hyperplanes. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--P1vdn_MX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/dggb3fyd2xchusd8yfui.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--P1vdn_MX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/dggb3fyd2xchusd8yfui.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://www.jeremyjordan.me/support-vector-machines/"&gt;image source&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Even picture above shows a transformation of the data from 2d to 3d, SVM actually does not transform data into higher dimension but rather uses dot product result to find a relationship of the each label with the remaining labels (Kernel Trick).&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Cross-Validation for Time Series</title>
      <dc:creator>Abzal Seitkaziyev</dc:creator>
      <pubDate>Mon, 15 Feb 2021 04:26:25 +0000</pubDate>
      <link>https://dev.to/xsabzal/cross-validation-for-time-series-19ho</link>
      <guid>https://dev.to/xsabzal/cross-validation-for-time-series-19ho</guid>
      <description>&lt;p&gt;&lt;span&gt;Photo by &lt;a href="https://unsplash.com/@markuswinkler?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText"&gt;Markus Winkler&lt;/a&gt; on &lt;a href="https://unsplash.com/s/photos/analytics?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText"&gt;Unsplash&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;To estimate the performance of the machine learning model, we may consider using cross-validation (CV), which uses multiple (e.g. n) train-test splits and trains/tests n models respectively. &lt;/p&gt;

&lt;p&gt;There is a k-fold CV in scikit-learn, which splits data into k train-test groups, and it assumes that observations are independent. However, in time series there is a dependency between observations and it could lead to target leak in the estimation when k-fold CV is used.&lt;br&gt;
For Time Series data I explored the following  cross-validation techniques:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1) Scikit-learn's  Time Series Split.&lt;/strong&gt;&lt;br&gt;
Here we use expanding window for the train set and a fixed-size window for the test data. &lt;/p&gt;

&lt;p&gt;Example of Indices Split:&lt;br&gt;
TRAIN: [0 1 2 3 4 5 ] TEST: [6 7]&lt;br&gt;
TRAIN: [0 1 2 3 4 5 6 7] TEST: [8 9]&lt;br&gt;
TRAIN: [0 1 2 3 4 5 6 7 8 9] TEST: [10 11]&lt;br&gt;
...&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2) Blocking Time Series Split.&lt;/strong&gt;&lt;br&gt;
It's when we train and test on different blocks of data.&lt;br&gt;
Example of the split:&lt;/p&gt;

&lt;p&gt;TRAIN: [0 1 2 3 4 5 6 7 8 9] TEST: [10 11]&lt;br&gt;
TRAIN: [12 13 14 15 16 17 18 19 20 21] TEST: [22 23]&lt;br&gt;
TRAIN: [24 25 26 27 28 29 30 31 32 33] TEST: [34 35]&lt;br&gt;
...&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3)Walk Forward Validation.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;a) we use the fixed-size(sliding) window for the train data and one observation ahead for the test.&lt;/p&gt;

&lt;p&gt;Example of Indices Split:&lt;br&gt;
TRAIN: [0 1 2 3 4 5] TEST: [6]&lt;br&gt;
TRAIN: [1 2 3 4 5 6 ] TEST: [7]&lt;br&gt;
TRAIN: [2 3 4 5 6 7 ] TEST: [8]&lt;br&gt;
....&lt;/p&gt;

&lt;p&gt;b) we use expanding window for the train data and one observation ahead for the test. Which is a variation of the scikit-learn's Time Series Split.&lt;/p&gt;

&lt;p&gt;Example of Indices Split:&lt;br&gt;
TRAIN: [0 1 2 3 4 5] TEST: [6]&lt;br&gt;
TRAIN: [0 1 2 3 4 5 6 ] TEST: [7]&lt;br&gt;
TRAIN: [0 1 2 3 4 5 6 7 ] TEST: [8]&lt;br&gt;
....&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>python</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Gas Field Production Project. Part 1</title>
      <dc:creator>Abzal Seitkaziyev</dc:creator>
      <pubDate>Mon, 08 Feb 2021 04:36:35 +0000</pubDate>
      <link>https://dev.to/xsabzal/gas-field-production-project-part-1-5h33</link>
      <guid>https://dev.to/xsabzal/gas-field-production-project-part-1-5h33</guid>
      <description>&lt;p&gt;&lt;span&gt;Photo by &lt;a href="https://unsplash.com/@kobuagency?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText"&gt;KOBU Agency&lt;/a&gt; on &lt;a href="https://unsplash.com/s/photos/data?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText"&gt;Unsplash&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Here I will briefly describe Data Collection and Exploratory data analysis (EDA) process for the gas field production project. After researching a few oil and gas fields, I selected Lakeshore Gas Field (NY) for my project, as it was a gas field with around 4000 production wells.  Using dynamic web scraping I collected field data and yearly water, gas, and oil production data for each well.  As we can expect the data does not have many physical or extraction properties, e.g. pressure, well fracking, or other stimulation and maintenance activities. There are some options to generate pseudo-pressure values by using petroleum engineering models(e.g. by making some assumptions of the initial reservoir pressure and using production data-volume and mass). At this stage, the data we have is enough to do the EDA. &lt;/p&gt;

&lt;p&gt;Here is some elevation data of the wells.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--kI0Juhca--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/22m95p9bjumvr0rswkee.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--kI0Juhca--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/22m95p9bjumvr0rswkee.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I plotted below the data related to the gas and water produced.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--v36t7l7w--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/qaqywbalqg14d8y3etq1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--v36t7l7w--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/qaqywbalqg14d8y3etq1.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--c78MuspO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/mwf5h5lx7d2j1tbaiqbd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--c78MuspO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/mwf5h5lx7d2j1tbaiqbd.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--uzTcg_JP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/z4ze5hltfhzuw3gvbzb4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--uzTcg_JP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/z4ze5hltfhzuw3gvbzb4.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As we could expect, the gas production drops with time (as reservoir pressure drops) even with increased number of the active wells and stimulation activities.&lt;/p&gt;

</description>
      <category>python</category>
      <category>datascience</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Gradient Boosting Regressor Example</title>
      <dc:creator>Abzal Seitkaziyev</dc:creator>
      <pubDate>Mon, 01 Feb 2021 04:53:23 +0000</pubDate>
      <link>https://dev.to/xsabzal/gradient-boosting-regressor-example-2ghi</link>
      <guid>https://dev.to/xsabzal/gradient-boosting-regressor-example-2ghi</guid>
      <description>&lt;p&gt;&lt;span&gt;Photo by &lt;a href="https://unsplash.com/@lazycreekimages?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText"&gt;Michael Dziedzic&lt;/a&gt; on &lt;a href="https://unsplash.com/s/photos/algorithms?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText"&gt;Unsplash&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;In the previous &lt;a href="https://dev.to/xsabzal/gradient-boosting-classifier-1de5"&gt;post&lt;/a&gt;, I briefly explained Gradient Boosting using a classification problem. Here I will do step by step explanation of how Gradient Boosting Regressor works using sklearn and Python to complement a theory given &lt;a href="https://dev.to/xsabzal/gradient-boost-for-regression-1e42"&gt;here&lt;/a&gt;. I did this exercise mainly to build an intuition of processes inside the Gradient boosted trees and by doing so to avoid using it as some sort of 'black box' algorithm.&lt;/p&gt;

&lt;p&gt;I used a dataset with car prices (&lt;a href="https://www.kaggle.com/adityadesai13/used-car-dataset-ford-and-mercedes"&gt;source&lt;/a&gt;) for this purpose. So, for easy tracking of the processes inside the Gradient boosted trees, I used a small portion of the data with a minimum number of the trees(m=2), and the depth of a tree(max_depth=2).&lt;/p&gt;

&lt;p&gt;1) First, we initialize the model, by getting initial predictions Pred_0. It is calculated as Mean value of the prices in the train dataset. Then we calculated initials residuals: Res_0 = train['price']-Pred_0. See below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--68b4nfR2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/xm5oq7acvps6wa9up2gt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--68b4nfR2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/xm5oq7acvps6wa9up2gt.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;2) Here we fit all data points(= each row features and Res_0) into the first tree. This tree build by using 'MSE' as a criterion.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--kEfxkK_D--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/szzytgmzy28skrlzd6xa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--kEfxkK_D--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/szzytgmzy28skrlzd6xa.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Each Value in the Leaf are calculated by the mean values of the residuals in each leaf. Then Prediction is calculated:&lt;br&gt;
Pred_1 = Pred_0 + learning_rate*output_value_1&lt;/p&gt;

&lt;p&gt;The we calculate residuals:&lt;br&gt;
Res_1 = train['price']-Pred_1&lt;/p&gt;

&lt;p&gt;Node #2, 3, 5, and 6 Predictions and Residuals:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--jJ3JwOnx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/1k76kvd6hbpyv8y89o5f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--jJ3JwOnx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/1k76kvd6hbpyv8y89o5f.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;3)&lt;br&gt;
Here we fit all data points(= each row features and Res_1) into the second tree. This tree build by using 'MSE' as a criterion as well.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--m6pec_0H--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/gjt190lqqim4r26fxnkk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--m6pec_0H--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/gjt190lqqim4r26fxnkk.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Node #5 Predictions shown below.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--YTbBvO1f--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/g8msl0ar3o2bajk6eo73.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--YTbBvO1f--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/g8msl0ar3o2bajk6eo73.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;4) We continue this iterative training process. Here I used only two trees for the simplicity.&lt;/p&gt;

&lt;p&gt;You can refer for the detailed code and step by step in chapter 5 &lt;a href="https://github.com/xs-abzal/Blogs_stat_examples/blob/master/Decision%20Tree%20Splitting%20Criterions.ipynb"&gt;here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>python</category>
    </item>
    <item>
      <title>Gradient Boosting Classifier</title>
      <dc:creator>Abzal Seitkaziyev</dc:creator>
      <pubDate>Mon, 25 Jan 2021 04:11:36 +0000</pubDate>
      <link>https://dev.to/xsabzal/gradient-boosting-classifier-1de5</link>
      <guid>https://dev.to/xsabzal/gradient-boosting-classifier-1de5</guid>
      <description>&lt;p&gt;&lt;span&gt;Photo by &lt;a href="https://unsplash.com/@rodlong?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText"&gt;Rod Long&lt;/a&gt; on &lt;a href="https://unsplash.com/s/photos/ai?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText"&gt;Unsplash&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;In this post, I would like to show briefly how theoretical components given in the &lt;a href="https://dev.to/xsabzal/gradient-boost-for-classification-2f15"&gt;gradient boost for classification&lt;/a&gt; are implemented in sklearn.&lt;/p&gt;

&lt;p&gt;Gradient Boosting Classifier uses &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier.decision_function"&gt;regressor trees&lt;/a&gt;. Meaning, we use mean squared error as splitting criteria and differentiable Loss function (Log - Loss by default). I guess the name of the regressor tree here comes from the Logistic regression Log-Loss function.&lt;/p&gt;

&lt;p&gt;Let us see the first and last tree in the model (&lt;a href="https://github.com/xs-abzal/Blogs_stat_examples/blob/master/Decision%20Tree%20Splitting%20Criterions.ipynb"&gt;repo reference&lt;/a&gt;). &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--A2SJRDMR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/i50konalzuoljmodzd9d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--A2SJRDMR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/i50konalzuoljmodzd9d.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7KT5sVap--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/9jbz5odkcmcwuyg9eadu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7KT5sVap--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/9jbz5odkcmcwuyg9eadu.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As we can see by iteratively building and fitting trees to the negative gradients of the Log-Loss Function (-G=Residuals =True_Probability - Predicted_Probability), each leaf gives an output value. &lt;br&gt;
Each ouput value is calculated like &lt;br&gt;
predicted_leaf_output = (sum of residuals in the leaf) / [sum of the (Predicted_Probability*(1-Predicted_Probability)].&lt;/p&gt;

&lt;p&gt;Then using log of odds, we can convert it to the new probability:&lt;br&gt;
log(new_odds) =log(previos_odds) + alpha * output_value&lt;br&gt;
New_probability = e^log(new_odds)/[1+e^log(new_odds)].&lt;/p&gt;

&lt;p&gt;Then we just use a loop to keep iteratively improve our prediction till we reach the maximum number of estimators or specified hyperparameters in the classifier.&lt;/p&gt;

&lt;p&gt;When we need to predict for a new data, this data (number of known features with unknown label) will go through that trained model (n number of fitted trees), and using the above formula probability will be calculated and Class 0 or 1 assigned respectively for binary classification. &lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
    </item>
    <item>
      <title>XG Boost for Classification
</title>
      <dc:creator>Abzal Seitkaziyev</dc:creator>
      <pubDate>Mon, 18 Jan 2021 03:23:16 +0000</pubDate>
      <link>https://dev.to/xsabzal/xg-boost-for-classification-2og5</link>
      <guid>https://dev.to/xsabzal/xg-boost-for-classification-2og5</guid>
      <description>&lt;p&gt;The logic of the XGBoost Classification algorithm is similar to &lt;a href="https://dev.to/xsabzal/xg-boost-for-regression-5c90"&gt;XGBoost Regression&lt;/a&gt;, with a few minor differences, like using the Log-Likelihood Loss function, instead of Least Squares and using Probability and Log of Odds in the calculations. &lt;/p&gt;

&lt;p&gt;1) Define initial values and hyperparameters.&lt;/p&gt;

&lt;p&gt;1a) Define differentiable Loss Function, e.g. 'Negative Log-Likelihood' : &lt;br&gt;
L(yi,pi) =- [yi*ln(pi) +(1-yi)*ln(1-pi)], where &lt;br&gt;
yi- True probabilities (1 or 0), pi - predicted probabilities.&lt;br&gt;
Here we will convert pi to odds and use log of odds when optimizing the objective function.&lt;/p&gt;

&lt;p&gt;1b) Assign a value to the initial predicted probabilities (p), by default, it is the same number for all observations, e.g. 0.5.&lt;/p&gt;

&lt;p&gt;1c) Assign values to parameters:&lt;br&gt;
learning rate (eta), max_depth, max_leaves, number of boosted rounds etc.&lt;br&gt;
and regularization hyperparameters: lambda, gamma. &lt;br&gt;
Default values in the XG Boost &lt;a href="https://xgboost.readthedocs.io/en/latest/parameter.html"&gt;documentation&lt;/a&gt;.  &lt;/p&gt;

&lt;p&gt;2) Build 1 to N number of trees iteratively.&lt;br&gt;
This step is the same as in the &lt;a href="https://dev.to/xsabzal/xg-boost-for-regression-5c90"&gt;XGBoost Regression&lt;/a&gt;, where we fit each tree to the residuals.&lt;/p&gt;

&lt;p&gt;The difference here: &lt;br&gt;
a) in the formula of the output calculation, where &lt;br&gt;
H = Previous_Probability*[1-Previous_probability].&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--XltcV9eE--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/49asc6jv2a994jc7qqmx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--XltcV9eE--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/49asc6jv2a994jc7qqmx.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;b) we compute new prediction values as &lt;br&gt;
log(new_odds) =log(previos_odds) + eta * output_value&lt;br&gt;
new_probability = e^log(new_odds)/[1+e^log(new_odds)]&lt;/p&gt;

&lt;p&gt;3)last step: get final predictions&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
    </item>
    <item>
      <title>XG Boost for Regression</title>
      <dc:creator>Abzal Seitkaziyev</dc:creator>
      <pubDate>Mon, 11 Jan 2021 04:40:59 +0000</pubDate>
      <link>https://dev.to/xsabzal/xg-boost-for-regression-5c90</link>
      <guid>https://dev.to/xsabzal/xg-boost-for-regression-5c90</guid>
      <description>&lt;p&gt;&lt;span&gt;Photo by &lt;a href="https://unsplash.com/@markusspiske?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText"&gt;Markus Spiske&lt;/a&gt; on &lt;a href="https://unsplash.com/s/photos/algorithm?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText"&gt;Unsplash&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;In the previous posts, I described how the Gradient Boosted Trees algorithm works for &lt;a href="https://dev.to/xsabzal/gradient-boost-for-regression-1e42"&gt;regression&lt;/a&gt; and &lt;a href="https://dev.to/xsabzal/gradient-boost-for-classification-2f15"&gt;classification&lt;/a&gt; problems. However, for the bigger datasets, the training could be a slow process. Some advanced versions of boosted trees address this issue, like Extreme Gradient Boost (XG Boost), Light GBM, and CatBoost. Here I will give an overview of how the XG Boost works for the regression.&lt;/p&gt;

&lt;p&gt;1) Define initial values and hyperparameters.&lt;/p&gt;

&lt;p&gt;1a) Define differentiable Loss Function, e.g. 'Least Squares' : &lt;br&gt;
L(yi,pi) =1/2 (yi - pi)^2, where &lt;br&gt;
y-True values, p- predictions &lt;/p&gt;

&lt;p&gt;1b) Assign a value to the initial predictions (p), by default, it is the same number for all observations, e.g. 0.5.&lt;/p&gt;

&lt;p&gt;1c) Assign values to parameters:&lt;br&gt;
learning rate (eta), max_depth, max_leaves, number of boosted rounds etc.&lt;br&gt;
and regularization hyperparameters: lambda, gamma. &lt;br&gt;
Default values in the XG Boost &lt;a href="https://xgboost.readthedocs.io/en/latest/parameter.html"&gt;documentation&lt;/a&gt;.  &lt;/p&gt;

&lt;p&gt;2) Build 1 to N number of trees iteratively.&lt;br&gt;
2a) Get Residuals (yi-pi) to fit observations to the tree &lt;br&gt;
Note: &lt;br&gt;
-Similar to the ordinary Gradient Boosted Trees, we fit trees iteratively to the residuals, not to the predictions. &lt;/p&gt;

&lt;p&gt;-Building trees in XG boost a bit different compare to the ordinary regression trees, where we could use gini or entropy to get the gain. In XG boost we use the formula which is derived from the optimization problem of the objective function(objective function is a sum of the loss function and regularization terms). &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--VC0rLzq2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/hr8tiv5cgh0ivkq96lyd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--VC0rLzq2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/hr8tiv5cgh0ivkq96lyd.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--HzO1xJ1M--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/y2iwos5bmj7h3kpbp635.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--HzO1xJ1M--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/y2iwos5bmj7h3kpbp635.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;G - represents sum of gradients (first derivate of a loss function with respect to prediction p), in our case its negative sum of the of the residuals in the leaf (L-left leaf, R-right leaf).&lt;/p&gt;

&lt;p&gt;H - second derivative of Loss Function with respect to prediction p, and here equals to number of the residuals&lt;/p&gt;

&lt;p&gt;-XG boost allows using a greedy algorithm or approximate greedy algorithm(for bigger datasets) when building the trees and calculating gains.&lt;/p&gt;

&lt;p&gt;2c) once we choose the best tree by Gain calculated in the previous step and build the full tree(size of the tree will be limited either by gain values, which also includes gamma for pruning in the formula or by parameters we specified initially).&lt;br&gt;
Now compute the output value for each leaf in the tree.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ABT4BYrq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/h0tufkca8dtxteobj8ca.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ABT4BYrq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/h0tufkca8dtxteobj8ca.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;2d) Compute new prediction values as &lt;br&gt;
new_p =previous_p + eta * output_value&lt;/p&gt;

&lt;p&gt;3) get final predictions&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Gradient Boost for Classification</title>
      <dc:creator>Abzal Seitkaziyev</dc:creator>
      <pubDate>Mon, 04 Jan 2021 04:41:55 +0000</pubDate>
      <link>https://dev.to/xsabzal/gradient-boost-for-classification-2f15</link>
      <guid>https://dev.to/xsabzal/gradient-boost-for-classification-2f15</guid>
      <description>&lt;p&gt;&lt;span&gt;Photo by &lt;a href="https://unsplash.com/@spacex?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText"&gt;SpaceX&lt;/a&gt; on &lt;a href="https://unsplash.com/s/photos/rocket?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText"&gt;Unsplash&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Here I would like to go through the steps of the Gradient Boost for Classification. Gradient Boost Classification is very similar to the Gradient Boost Regression algorithm with a few differences: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Target values in binary classification are 0s and 1s&lt;/li&gt;
&lt;li&gt;Log Loss function(similar to logistic regression problems). &lt;/li&gt;
&lt;li&gt;Using Log of odds and probabilities based on the Log Loss Function.
So we can apply the same mathematical algorithm, we used in the previous &lt;a href="https://dev.to/xsabzal/gradient-boost-for-regression-1e42"&gt;post&lt;/a&gt; taking into account the above differences. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Below are the steps of Gradient Boost Classification algorithm when used with Logistic Loss Function L(y,F(x)).&lt;br&gt;
I also tried to avoid mathematical notations and simplified all the steps.&lt;/p&gt;

&lt;p&gt;1) Get the initial prediction logarithms of odds and probability of class = 1, so basically we count numbers of class 1s and class 0s and calculate P(class=1) and log(odds=1)=log(P/(1-P)). &lt;br&gt;
Example: if we have balanced data we will have equal (or similar) numbers of 0s and 1s, so initially:&lt;br&gt;
Predicted_Probability(class_1)=0.5&lt;br&gt;
log(odds_1) = 0.&lt;/p&gt;

&lt;p&gt;2) m is the number of weak learners. So we do the below steps for each decision tree (e.g. m=1 to m=100, when n_estimators=100):&lt;/p&gt;

&lt;p&gt;a. Compute residuals (True-Predicted_Probability) for each tree iteratively (meaning previous residuals used as a target for the next decision tree).&lt;/p&gt;

&lt;p&gt;Example: for the first tree, Residual = True - 0.5 (Predicted in the previous step) and True = 0 or 1 (per Target class)&lt;/p&gt;

&lt;p&gt;b. Fit decision tree to the residuals&lt;/p&gt;

&lt;p&gt;c. Compute the output value for each leaf in the tree. We cannot take simply an average of all the values in the leaf as we did in regression. Here we will use the following formula:&lt;br&gt;
predicted_leaf_output = (sum of residuals in the leaf) / [sum of the (Predicted_Probability*(1-Predicted_Probability)]&lt;/p&gt;

&lt;p&gt;d. First, update the predicted log of odds for each row of data:&lt;br&gt;
log(odds) =  Previously_predicted_log(odds) + leraning_rate * predicted_leaf_output&lt;/p&gt;

&lt;p&gt;e. Then, calculate the Probability for each row of data using log(odds):&lt;br&gt;
P= odds/(1+odds) or P = exp(log(odds))/[1+exp(log(odds)].&lt;/p&gt;

&lt;p&gt;3) Compute the final prediction F(x).&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Gradient Boost for Regression</title>
      <dc:creator>Abzal Seitkaziyev</dc:creator>
      <pubDate>Mon, 28 Dec 2020 04:15:51 +0000</pubDate>
      <link>https://dev.to/xsabzal/gradient-boost-for-regression-1e42</link>
      <guid>https://dev.to/xsabzal/gradient-boost-for-regression-1e42</guid>
      <description>&lt;p&gt;&lt;span&gt;Photo by &lt;a href="https://unsplash.com/@billjelen?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText"&gt;Bill Jelen&lt;/a&gt; on &lt;a href="https://unsplash.com/s/photos/rocket?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText"&gt;Unsplash&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gradient Tree Boosting&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Gradient Tree Boosting is an ensemble algorithm that could be applied to both classification and regression problems. Here I will describe how gradient boost works for regression. Gradient Boost uses Decision Trees as weak learners, and each Decision Tree predicts pseudo-residual values, and all Decision Trees have the same 'weight' in the final decision (described by the learning rate). &lt;/p&gt;

&lt;p&gt;There are a few key components in the Gradient Boosting algorithm:&lt;/p&gt;

&lt;p&gt;a) Loss function - the natural choice for regression is 'Least Squares' &lt;br&gt;
(Note: similar to the linear regression but commonly used with coefficient 1/2*(True-Predicted)^2, to avoid pseudo-residuals and computing actual residuals = (True-Predicted).&lt;/p&gt;

&lt;p&gt;b) Hyperparameters:&lt;br&gt;
learning rate (used to scale each weak learner prediction), and parameters related to the weak learners themselves (e.g. number of weak learners, maximum depth of each tree, etc.) &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Algorithm steps&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let's dive into the details of the algorithm itself. This is the mathematical description of the Boosted gradient Trees.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--g-XgVeTW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/mp00a2bfvqxk7s258849.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--g-XgVeTW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/mp00a2bfvqxk7s258849.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;br&gt;
Source is &lt;a href="https://en.wikipedia.org/wiki/Gradient_boosting"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Below is the simplified explanation of the above steps when using 'Least Squares' Loss Function L(y,F(x)):&lt;/p&gt;

&lt;p&gt;1) Get the initial prediction, it is equal to the mean of the target column. &lt;br&gt;
2) m is the number of weak learners. So we do the below steps for each decision tree (e.g. m=1 to m=100, when n_estimators=100):&lt;/p&gt;

&lt;p&gt;a. Compute residuals (True-Predicted) for each tree iteratively (meaning previous residuals used as a target for the next decision tree).&lt;br&gt;
note: for the first tree, Residual = True - Target_Mean (Predicted in the previous step)&lt;br&gt;
    b. Fit decision tree to the residuals&lt;br&gt;
    c. Compute output value for each leaf in the tree (in this case = mean of the residuals in this leaf)&lt;br&gt;
    d. Update the predicted values; new_prediction = previous_prediction + learning_rate * output_value.&lt;br&gt;
    e. Repeat steps 2a to 2e till all weak learners constructed.&lt;/p&gt;

&lt;p&gt;3) Compute the final prediction F(x)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Summary&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;From the short description above we can see that there are some similarities with AdaBoost (like iterativeness - next trees depends on the previous predictions), and differences(each tree has the same learning rate vs different weights in AdaBoost). &lt;/p&gt;

&lt;p&gt;References:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=3CC4N4z3GJc"&gt;Video material&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting"&gt;scikit-learn&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://en.wikipedia.org/wiki/Gradient_boosting"&gt;Wikipedia&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>beginners</category>
    </item>
  </channel>
</rss>
