<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Bechir Jamoussi</title>
    <description>The latest articles on DEV Community by Bechir Jamoussi (@bechir_jamoussi_cf523a5bb).</description>
    <link>https://dev.to/bechir_jamoussi_cf523a5bb</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F893136%2Fd96dd226-c40a-4abe-a4a0-e00afac5b80c.png</url>
      <title>DEV Community: Bechir Jamoussi</title>
      <link>https://dev.to/bechir_jamoussi_cf523a5bb</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/bechir_jamoussi_cf523a5bb"/>
    <language>en</language>
    <item>
      <title>Feature Engineer each class separately in Binary Classification</title>
      <dc:creator>Bechir Jamoussi</dc:creator>
      <pubDate>Tue, 19 Jul 2022 10:03:40 +0000</pubDate>
      <link>https://dev.to/bechir_jamoussi_cf523a5bb/feature-engineer-each-class-separately-in-binary-classification-3f83</link>
      <guid>https://dev.to/bechir_jamoussi_cf523a5bb/feature-engineer-each-class-separately-in-binary-classification-3f83</guid>
      <description>&lt;p&gt;I have an imbalanced tabular dataset, my problem is a binary classification. The dataset is strongly imbalanced so I have performed oversampling, but it did not solve the issue, you can find the Classification Report below:(The accuracy is 88% but I don't care, it does not represent well the performance since the dataset is imbalanced)&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--3eXPSArT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.googleapis.com/download/storage/v1/b/kaggle-forum-message-attachments/o/inbox%252F5421044%252Fea4e02f481271a010e67679484b33c64%252FCR.PNG%3Fgeneration%3D1658224466177856%26alt%3Dmedia" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3eXPSArT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.googleapis.com/download/storage/v1/b/kaggle-forum-message-attachments/o/inbox%252F5421044%252Fea4e02f481271a010e67679484b33c64%252FCR.PNG%3Fgeneration%3D1658224466177856%26alt%3Dmedia" alt="" width="563" height="176"&gt;&lt;/a&gt;&lt;br&gt;&lt;br&gt;
The only explanation that I found is that the features selected are well suited to detect "0" while they don't tell much information about the "1" class, is there any way to catch the best features that represent the "1" class, maybe split the dataset into 1_Class_Dataset and 0_Class_Dataset and catch the best features for each and then combine both? If it is not possible can you please suggest another solution?&lt;/p&gt;

</description>
      <category>python</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Low F1-Score due to Imbalanced Dataset even after resampling</title>
      <dc:creator>Bechir Jamoussi</dc:creator>
      <pubDate>Sat, 16 Jul 2022 23:50:18 +0000</pubDate>
      <link>https://dev.to/bechir_jamoussi_cf523a5bb/low-f1-score-due-to-imbalanced-dataset-even-after-resampling-4ege</link>
      <guid>https://dev.to/bechir_jamoussi_cf523a5bb/low-f1-score-due-to-imbalanced-dataset-even-after-resampling-4ege</guid>
      <description>&lt;p&gt;I am performing a Binary Classification over an &lt;strong&gt;imbalanced dataset&lt;/strong&gt;:&lt;br&gt;&lt;br&gt;
0: 16,263&lt;br&gt;&lt;br&gt;
1: 214&lt;/p&gt;

&lt;p&gt;I have used multiple oversampling, undersampling, and combination techniques, below are the results that I have obtained:&lt;br&gt;
I obtained this plots thanks to this piece of code:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def plot_resampling(X, y, sampler, ax, title=None):
    X_res, y_res = sampler.fit_resample(X, y)
    ax.scatter(X_res[:, 0], X_res[:, 1], c=y_res, alpha=0.8, edgecolor="k")
    if title is None:
        title = f"Resampling with {sampler.__class__.__name__}"
    ax.set_title(title)
    sns.despine(ax=ax, offset=10)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Clarification: The X and y are the X_train and y_train and I used it to show the distribution of my data points before and after the resampling.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://i.stack.imgur.com/NiRuq.png"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--kAHYm8Pb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i.stack.imgur.com/NiRuq.png" alt="Comparison Initial Dataset and multiple oversampling techniques" width="880" height="884"&gt;&lt;/a&gt;  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For the RandomUnderSampler, the first one is without replacement and the second one is with replacement=True&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;a href="https://i.stack.imgur.com/IJri5.png"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--i0LqMnOJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i.stack.imgur.com/IJri5.png" alt="Comparison Initial Dataset and multiple oversampling techniques" width="880" height="880"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://i.stack.imgur.com/KpNx0.png"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--sFngHRSM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i.stack.imgur.com/KpNx0.png" alt="Use of Combination techniques" width="880" height="880"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You need to know also that I have &lt;strong&gt;multiple outliers in my dataset&lt;/strong&gt;, and hence, multiple columns are skewed, so I chose to use models that are not sensitive to skewness like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SVC&lt;/li&gt;
&lt;li&gt;Naive Bayes Classifier&lt;/li&gt;
&lt;li&gt;Ensemble XGboost&lt;/li&gt;
&lt;li&gt;KNN&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For now, the best result that I have obtained is with &lt;strong&gt;SVC(kernel = "rbf")&lt;/strong&gt; and using the &lt;strong&gt;SMOTE technique&lt;/strong&gt;(Of course the sampling is only performed on the training dataset since the test one should represent the real population):  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Test Accuracy: 0.75&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Training Accuracy: 0.88&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the classification report is not good, the &lt;strong&gt;f1-score is 0.51&lt;/strong&gt;, there is a &lt;strong&gt;real issue with the 1 class&lt;/strong&gt; even after the resampling!! as you can see below:&lt;br&gt;&lt;br&gt;
&lt;a href="https://i.stack.imgur.com/EMjuC.png"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--hRplCAHM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i.stack.imgur.com/EMjuC.png" alt="enter image description here" width="563" height="176"&gt;&lt;/a&gt;&lt;br&gt;&lt;br&gt;
Here is also the Confusion Matrix:&lt;br&gt;&lt;br&gt;
&lt;a href="https://i.stack.imgur.com/8GRjf.png"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Llgpqf1U--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i.stack.imgur.com/8GRjf.png" alt="enter image description here" width="435" height="320"&gt;&lt;/a&gt;&lt;br&gt;&lt;br&gt;
Can you please help me improve the f1 score, what is your analysis of the situation, and what are your suggestions?&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>python</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
