<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Elahe Dorani</title>
    <description>The latest articles on DEV Community by Elahe Dorani (@elldora).</description>
    <link>https://dev.to/elldora</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F577666%2F6ee0ca9f-8b36-4007-b87f-243f9c98ca86.jpg</url>
      <title>DEV Community: Elahe Dorani</title>
      <link>https://dev.to/elldora</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/elldora"/>
    <language>en</language>
    <item>
      <title>[Boost]</title>
      <dc:creator>Elahe Dorani</dc:creator>
      <pubDate>Mon, 26 May 2025 13:46:18 +0000</pubDate>
      <link>https://dev.to/elldora/-gbp</link>
      <guid>https://dev.to/elldora/-gbp</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/mehrandvd/demystifying-aicontents-in-microsoftextensionsai-5hg8" class="crayons-story__hidden-navigation-link"&gt;Demystifying AIContents in Microsoft.Extensions.AI&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/mehrandvd" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F981242%2F335468c4-a562-4321-8045-faade059a05b.jpeg" alt="mehrandvd profile" class="crayons-avatar__image"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/mehrandvd" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Mehran Davoudi
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Mehran Davoudi
                
              
              &lt;div id="story-author-preview-content-2205098" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/mehrandvd" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F981242%2F335468c4-a562-4321-8045-faade059a05b.jpeg" class="crayons-avatar__image" alt=""&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Mehran Davoudi&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/mehrandvd/demystifying-aicontents-in-microsoftextensionsai-5hg8" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Jan 13 '25&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/mehrandvd/demystifying-aicontents-in-microsoftextensionsai-5hg8" id="article-link-2205098"&gt;
          Demystifying AIContents in Microsoft.Extensions.AI
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/dotnet"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;dotnet&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/openai"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;openai&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/extensions"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;extensions&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/csharp"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;csharp&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/mehrandvd/demystifying-aicontents-in-microsoftextensionsai-5hg8" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;5&lt;span class="hidden s:inline"&gt; reactions&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/mehrandvd/demystifying-aicontents-in-microsoftextensionsai-5hg8#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            2 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
      <category>dotnet</category>
      <category>openai</category>
      <category>extensions</category>
      <category>csharp</category>
    </item>
    <item>
      <title>Data Drift: Understanding and Detecting Changes in Data Distribution</title>
      <dc:creator>Elahe Dorani</dc:creator>
      <pubDate>Wed, 21 Jun 2023 20:21:38 +0000</pubDate>
      <link>https://dev.to/elldora/data-drift-understanding-and-detecting-changes-in-data-distribution-ne</link>
      <guid>https://dev.to/elldora/data-drift-understanding-and-detecting-changes-in-data-distribution-ne</guid>
      <description>&lt;h2&gt;
  
  
  What is Data Drift?
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;Data drift&lt;/code&gt; refers to the &lt;code&gt;distributional change&lt;/code&gt; between the data used for training model and the data being send to deployed model. One of the important approaches in machine learning modeling is the &lt;strong&gt;probabilistic modeling&lt;/strong&gt;. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;From &lt;strong&gt;Probabilistic Machine Learning&lt;/strong&gt; perspective, we can assume that features in a dataset, are drawn from a hypothetical distribution.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;However, in real-world modeling, it becomes evident that &lt;em&gt;data does not remain constant over time&lt;/em&gt;. It is influenced by various factors such as &lt;code&gt;seasonality changes&lt;/code&gt;, &lt;code&gt;missing values&lt;/code&gt;, &lt;code&gt;technical issues&lt;/code&gt;, and &lt;code&gt;time fluctuations&lt;/code&gt;. This means that a dataset collected for machine learning modeling may not be the same at all times.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Regular monitoring of the model performance&lt;/strong&gt; allows us to catch instances of data drift. It is crucial to monitor the &lt;strong&gt;change in data distribution&lt;/strong&gt; between the training data and live data from time to time. &lt;/p&gt;

&lt;p&gt;In most cases data drift occurrence, shows that our trained model is becoming outdated, and it should be retrained or updated with the newest dataset. Here, "live data" refers to the data that is being sent to the deployed model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Top 5 Data Drift Techniques
&lt;/h2&gt;

&lt;p&gt;Due to my need on deploy model evaluation, I had to monitor the result of the model on the unseen data. But it was a real quest to understand how to measure the model performance. It was also not clear that how could I measure the data behavior!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.evidentlyai.com/"&gt;EvidentlyAI&lt;/a&gt;&lt;/strong&gt; is one of websites I check regularly its articles. In &lt;a href="https://www.evidentlyai.com/blog/data-drift-detection-large-datasets"&gt;this article&lt;/a&gt;, it has introduced the &lt;code&gt;data drift&lt;/code&gt; concept and &lt;code&gt;top 5 techniques&lt;/code&gt; to detect it on the features used in a large dataset. It also has provided a simple example &lt;/p&gt;

&lt;p&gt;These techniqueys are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test"&gt;Kolmogrorov-Smirnov (KS)&lt;/a&gt;&lt;/strong&gt; technique which is more suitable for numerical features. It is a non-parametric test score. When we use this test, we want to accept or reject that if two datasets are drawn from the same distribution or not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://mwburke.github.io/data%20science/2018/04/29/population-stability-index.html"&gt;Population stability index (PSI)&lt;/a&gt;&lt;/strong&gt; used to measure the data shift between two different datasets. It is suitable for both numerical and categorical dataset. The more this metric, the more different between the distribution of two datasets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence"&gt;Kullback-Leibler divergence(KL)&lt;/a&gt;&lt;/strong&gt; is a metric to measure the difference between two distributions. I could be applied on numeric and categorical datasets. Its range is between 0 to infinity. The more smaller KL metric shows that two distributions are very similar.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence"&gt;Jensen-Shannon divergence&lt;/a&gt;&lt;/strong&gt; is defined based on the KL divergence. Its  difference is that it relies between 0 to 1.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://en.wikipedia.org/wiki/Earth_mover%27s_distance"&gt;Wasserstein distance&lt;/a&gt;&lt;/strong&gt; is a measure to monitor the numerical data drift. It is measured by the difference of the dataset means.
This article also has provided a practical example which I could apply on my own data to understand it well. &lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  More Resources:
&lt;/h2&gt;

&lt;p&gt;As I work with Azure Machine Learning platform, I am very interested to unlock features in it. &lt;br&gt;
First of all, I found a &lt;a href="https://learn.microsoft.com/en-us/training/modules/monitor-data-drift-with-azure-machine-learning/"&gt;mini course&lt;/a&gt; about data drift which you could easily get throw and understand the main concepts in this field. &lt;/p&gt;

&lt;p&gt;Then, I really suggest to have a look to this &lt;a href="https://towardsdatascience.com/getting-a-grip-on-data-and-model-drift-with-azure-machine-learning-ebd240176b8b"&gt;article&lt;/a&gt; which clearly has described the data and model drift. It also tried to apply it using the Azure Machine Learning capabilities for data drift.   &lt;/p&gt;

&lt;p&gt;Finally, I found a &lt;a href="https://github.com/Azure/data-model-drift/tree/main"&gt;git repository&lt;/a&gt; which is tried to monitory the data drift using Azure ML and integrate it with Power BI dashboard.&lt;/p&gt;

&lt;p&gt;I am interested to know more about this topic. If you know other useful resources please put some notes about them :)&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datadrift</category>
      <category>distribution</category>
      <category>largedataset</category>
    </item>
    <item>
      <title>How to plot feature importance using Truncated SVD</title>
      <dc:creator>Elahe Dorani</dc:creator>
      <pubDate>Thu, 08 Jun 2023 11:09:06 +0000</pubDate>
      <link>https://dev.to/elldora/how-to-plot-feature-importance-using-truncated-svd-166n</link>
      <guid>https://dev.to/elldora/how-to-plot-feature-importance-using-truncated-svd-166n</guid>
      <description>&lt;p&gt;&lt;em&gt;How to choose and plot the most important features from an overwhelming pool of features, using Truncated SVD!?&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What was the problem?
&lt;/h2&gt;

&lt;p&gt;As I explained in my &lt;a href="https://dev.to/elldora/unveiling-the-hidden-gems-exploring-important-features-with-truncated-svd-and-pca-22j6"&gt;previous post&lt;/a&gt;, in one of our projects at &lt;a href="https://melkradar.com/"&gt;MelkRadar&lt;/a&gt;, I had to tackle with a large dataset including more than 1500 features! It was really overwhelming identify the most important features especially after the features transformation.&lt;/p&gt;

&lt;p&gt;This problem was the beginning of my journey through the feature selection and feature extraction techniques. I found out about two well-known techniques: Truncated SVD and PCA. After all, I understood that Truncated SVD was a better solution to handle our problem with a large sparse dataset. &lt;/p&gt;

&lt;h2&gt;
  
  
  Why is the Truncated SVD more informative than PCA in my problem?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Truncated SVD and PCA:
&lt;/h3&gt;

&lt;p&gt;Truncated SVD (Singular Value Decomposition) and PCA (Principal Component Analysis) are both linear algebra technique. They are dimensionality reduction techniques used to reduce the dimensionality of high-dimensional datasets. They both aim to find a lower-dimensional representation of the data while indicating the most important information.&lt;/p&gt;

&lt;h3&gt;
  
  
  Truncated SVD vs. PCA:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;The main difference between truncated SVD and PCA lies in how they handle data. Truncated SVD is specifically designed for sparse matrices and can handle missing values, while PCA works on dense matrices and requires complete data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Truncated SVD is often preferred for text data, while PCA is commonly used for numerical data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Truncated SVD is typically faster than PCA for large datasets, as it only computes a subset of the singular vectors and values.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  How to does Truncated SVD identifies the most important features?
&lt;/h2&gt;

&lt;p&gt;Truncated SVD does &lt;code&gt;not directly&lt;/code&gt; identify the most frequent features in a dataset. Its primary goal is to reduce the dimensionality of the data while retaining the most important information. However, it is possible to indirectly identify the most frequent features by &lt;code&gt;examining the singular vectors&lt;/code&gt; obtained from the truncated SVD.&lt;/p&gt;

&lt;p&gt;In truncated SVD, the singular vectors are the linear combinations of the original features that explain the most variance in the data. Therefore, the features that have the highest coefficients in the singular vectors can be considered the most important features in the dataset.&lt;/p&gt;

&lt;h2&gt;
  
  
  Now let's do some coding...
&lt;/h2&gt;

&lt;p&gt;To identify the most frequent features using truncated SVD, one could perform the following steps:&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1:
&lt;/h3&gt;

&lt;p&gt;To keep it simple, I generate a random data matrix to simulate a dataset that I had in my real project.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.decomposition&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TruncatedSVD&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="c1"&gt;# Generate a random data matrix X of size (m x n)
&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1500&lt;/span&gt;
&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2:
&lt;/h3&gt;

&lt;p&gt;Now fit X to the truncated SVD to obtain the singular vectors. Then, compute the low-rank approximation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;  &lt;span class="c1"&gt;# number of singular vectors to keep
&lt;/span&gt;&lt;span class="n"&gt;U&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;S&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Vt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;svd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;X_approx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;U&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;diag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;S&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;Vt&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:]&lt;/span&gt; &lt;span class="c1"&gt;# fit data to the model
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;U&lt;/code&gt;: is the left singular value. It is a m*m matrix. The columns of this matrix are the corresponding eigenvectors of the singular values in the X matrix.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;S&lt;/code&gt;: is the singular values matrix. It is a m*n matrix. The diagonal elements of S are the singular values of X.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Vt&lt;/code&gt;: is the right singular vectors matrix. It is n*n matrix. The columns of this matrix are the corresponding eigenvectors of the singular values in the X matrix.&lt;/p&gt;

&lt;p&gt;Note that &lt;strong&gt;U&lt;/strong&gt; captures the &lt;code&gt;relationships among the columns&lt;/code&gt; of X, while &lt;strong&gt;Vt&lt;/strong&gt; captures the &lt;code&gt;relationships among the rows&lt;/code&gt; of X.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3:
&lt;/h3&gt;

&lt;p&gt;To evaluate the model, we have to calculate the relative approximation error like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;approx_error&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;X_approx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;'Relative approximation error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;approx_error&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4:
&lt;/h3&gt;

&lt;p&gt;Select the &lt;code&gt;k&lt;/code&gt; most important features. For each singular vector, identify the features with the highest coefficients.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Vk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Vt&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 5:
&lt;/h3&gt;

&lt;p&gt;Compute the feature importance scores as the sum of absolute values of the coefficients in the first k singular vectors. Count the frequency of each feature across all singular vectors.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;feature_importance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Vk&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 6:
&lt;/h3&gt;

&lt;p&gt;Sort the feature importance scores in descending order.&lt;br&gt;
Rank the features by frequency to identify the most frequent features in the dataset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sorted_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argsort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;feature_importance&lt;/span&gt;&lt;span class="p"&gt;)[::&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Save the names of the top 10 most frequent features in a list
&lt;/span&gt;&lt;span class="n"&gt;top_features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;feature_names&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sorted_idx&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 7:
&lt;/h3&gt;

&lt;p&gt;Create a bar plot of the top 10 most frequent features:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;barh&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;feature_importance&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;sorted_idx&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;yticks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;top_features&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Importance Score'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Most Frequent Features in Truncated SVD'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Note that...
&lt;/h2&gt;

&lt;p&gt;This approach assumes that the most frequent features are also the most important features in the dataset. This may not always be the case, as important features may have lower frequencies if they are correlated with other features that are more frequent. &lt;/p&gt;

&lt;p&gt;Additionally, the choice of the truncation parameter in truncated SVD can affect the results, so it is important to choose an appropriate truncation level based on the problem at hand.&lt;/p&gt;

</description>
      <category>featureimportance</category>
      <category>plot</category>
      <category>truncatedsvd</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Unveiling the Hidden Gems: Exploring Important Features with Truncated SVD and PCA</title>
      <dc:creator>Elahe Dorani</dc:creator>
      <pubDate>Sat, 20 May 2023 14:36:03 +0000</pubDate>
      <link>https://dev.to/elldora/unveiling-the-hidden-gems-exploring-important-features-with-truncated-svd-and-pca-22j6</link>
      <guid>https://dev.to/elldora/unveiling-the-hidden-gems-exploring-important-features-with-truncated-svd-and-pca-22j6</guid>
      <description>&lt;p&gt;&lt;em&gt;My Journey with Multimodal Data Preprocessing and Truncated SVD&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Dealing with multimodal dataset and dimensionality reduction
&lt;/h2&gt;

&lt;p&gt;In one of our projects, we had &lt;em&gt;a dataset containing over 1500 features&lt;/em&gt; to create a machine learning model. By the &lt;code&gt;multimodality&lt;/code&gt;, I mean there were a combination of &lt;code&gt;numerical&lt;/code&gt;, &lt;code&gt;categorical&lt;/code&gt;, and &lt;code&gt;text&lt;/code&gt; features in it.&lt;/p&gt;

&lt;p&gt;To handle this dataset, I employed a &lt;em&gt;standard strategy of preprocessing&lt;/em&gt; and the current features transformed to more features. A crucial aspect of analyzing these additional features was determining a method to identify &lt;strong&gt;the most important&lt;/strong&gt; ones.&lt;/p&gt;

&lt;p&gt;Of course, before modeling, we analyze data to keep the more informative samples and features. But in this project, we still deal with curse of dimensionality. &lt;/p&gt;

&lt;p&gt;For example, among these features, there were numerous &lt;strong&gt;categorical&lt;/strong&gt; variables for which I utilized &lt;code&gt;OneHotEncoding&lt;/code&gt; for them to convert to the numeric values. This picture shows it in simple, but if want to know more about it you can visit &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html"&gt;this link&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--mMa4T4Gl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/zydst39q6994d7x6dg6j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--mMa4T4Gl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/zydst39q6994d7x6dg6j.png" alt="OneHotEncoding" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Furthermore, there are some &lt;strong&gt;text&lt;/strong&gt; features in this dataset. When we tried to use these kind of features, the &lt;code&gt;Tfidf-Vectorizer&lt;/code&gt; came in use! This technique tries to identify the more important tokens in a text by counting their frequencies in the documents. This picture may show the idea behind in a one shot, but if you want to known more you can again visit &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html"&gt;this link&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--x6_4VXjT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5a2bhlbo0eqxh4nh4x6q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--x6_4VXjT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5a2bhlbo0eqxh4nh4x6q.png" alt="TF-IDF Vectorizer" width="800" height="344"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In our machine learning pipeline, consists of &lt;code&gt;featurization&lt;/code&gt;, &lt;code&gt;preprocessing&lt;/code&gt; and &lt;code&gt;modeling&lt;/code&gt;. After the &lt;strong&gt;featurization&lt;/strong&gt; step, we faced with an enormous sparse dat matrix. In a sparse matrix, there are lots of cells with zero and just few cells containing non-zero values. Using this kind of data matrix can cause to computational overhead and slow down the modeling process. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Hwln_m9x--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jt99vvphu6gds4ldrrie.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Hwln_m9x--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jt99vvphu6gds4ldrrie.png" alt="Spars Data Matrix" width="600" height="304"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The first idea was to use the well-known PCA algorithm as a dimensionality reduction technique. When I attempted to apply the PCA algorithm, I encountered an error indicating that the algorithm could not be used with a sparse matrix. But why?&lt;/p&gt;

&lt;p&gt;Consequently, I started exploring about the Truncated SVD as an alternative method. &lt;/p&gt;

&lt;p&gt;In the next section I tried to sum up all the things I learned from this technique in comparison to the PCA.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the Truncated SVD was better than PCA in for a sparse data matrix?
&lt;/h2&gt;

&lt;p&gt;Truncated SVD (Singular Value Decomposition) and PCA (Principal Component Analysis) are both linear algebra techniques that can be used to reduce the dimensionality of high-dimensional data, while retaining the most important information.&lt;/p&gt;

&lt;p&gt;As I mentioned before, I was dealing with a large dataset that after featurization step it was still large enough to push me to know about the alternative way to deal with!&lt;/p&gt;

&lt;p&gt;The main differences between Truncated SVD and PCA which I found out about are:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The objective:
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;PCA&lt;/em&gt;&lt;/strong&gt; aims to find the directions (principal components) that explain the &lt;code&gt;maximum amount of variance&lt;/code&gt; in the data, while &lt;strong&gt;&lt;em&gt;Truncated SVD&lt;/em&gt;&lt;/strong&gt; aims to &lt;code&gt;factorize a matrix&lt;/code&gt; into two lower rank matrices.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The input data:
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;PCA&lt;/em&gt;&lt;/strong&gt; is typically applied to a &lt;code&gt;covariance matrix&lt;/code&gt;, while &lt;strong&gt;&lt;em&gt;Truncated SVD&lt;/em&gt;&lt;/strong&gt; can be applied &lt;code&gt;directly to a data matrix&lt;/code&gt; without computing the covariance matrix.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The output:
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;PCA&lt;/em&gt;&lt;/strong&gt; provides the &lt;code&gt;principal components&lt;/code&gt;, which are linear combinations of the original variables, while &lt;strong&gt;&lt;em&gt;Truncated SVD&lt;/em&gt;&lt;/strong&gt; provides the &lt;code&gt;singular vectors&lt;/code&gt;, which are also linear combinations of the original variables.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. The number of components:
&lt;/h3&gt;

&lt;p&gt;In PCA, the &lt;code&gt;number of principal components&lt;/code&gt; to keep is typically chosen based on the &lt;code&gt;percentage of variance&lt;/code&gt; explained or by setting a fixed number of components. In Truncated SVD, the &lt;code&gt;number of singular vectors&lt;/code&gt; to keep is typically chosen based on the &lt;code&gt;rank of the matrix&lt;/code&gt; or a fixed number of components.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. The computation:
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Truncated SVD&lt;/em&gt;&lt;/strong&gt; is typically &lt;code&gt;faster&lt;/code&gt; than PCA for &lt;code&gt;large datasets&lt;/code&gt;, as it only computes a subset of the singular vectors and values. &lt;/p&gt;

&lt;p&gt;As I first described, our dataset was This was very important in our case. Because we use a pay-as-you-go Azure Compute to run the experiments. It was crucial to save the computation time. &lt;/p&gt;

&lt;h2&gt;
  
  
  To sum up...
&lt;/h2&gt;

&lt;p&gt;Both &lt;code&gt;Truncated SVD&lt;/code&gt; and &lt;code&gt;PCA&lt;/code&gt; are useful techniques for reducing the dimensionality of high-dimensional data. &lt;/p&gt;

&lt;p&gt;The choice of which technique to use depends on the specific requirements of the problem at hand. In our case, the large sparse data matrix, need to choose the &lt;code&gt;Truncated SVD&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In my next post, I will show a simple code to use this technique!&lt;/p&gt;

</description>
      <category>featureimportance</category>
      <category>pca</category>
      <category>truncatedsvd</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Fake Data with Google Back-Translator API!</title>
      <dc:creator>Elahe Dorani</dc:creator>
      <pubDate>Sat, 06 May 2023 18:09:05 +0000</pubDate>
      <link>https://dev.to/elldora/fake-data-with-google-back-translator-api-ndf</link>
      <guid>https://dev.to/elldora/fake-data-with-google-back-translator-api-ndf</guid>
      <description>&lt;p&gt;&lt;strong&gt;Machine learning&lt;/strong&gt; requires techniques to address the challenges of working with terribly &lt;strong&gt;imbalance&lt;/strong&gt; datasets. &lt;em&gt;&lt;strong&gt;Data Augmentation&lt;/strong&gt;&lt;/em&gt; is a class of techniques you can use to generate fake data.&lt;/p&gt;

&lt;h2&gt;
  
  
  SMOTE
&lt;/h2&gt;

&lt;p&gt;One of the most popular ways to create fake data for multimodal datasets is the &lt;strong&gt;&lt;a href="https://learn.microsoft.com/en-us/azure/machine-learning/component-reference/smote?view=azureml-api-2"&gt;SMOTE&lt;/a&gt;&lt;/strong&gt; technique, which can be applied to numerical and categorical features. &lt;br&gt;
SMOTE technique is based on the KNN algorithm. You can read more about this technique here:&lt;br&gt;
&lt;a href="https://www.analyticsvidhya.com/blog/2020/10/overcoming-class-imbalance-using-smote-techniques/"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QvbbVOeX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/2fbh3me4c63h1gwuxz8t.png" alt="SMOTE technique visualization" width="656" height="444"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Although &lt;strong&gt;SMOTE&lt;/strong&gt; technique can be a nice data generator for numerical and categorical features, when we apply it to the text data, it can be biased due to &lt;em&gt;duplicate text samples&lt;/em&gt;. On the other hand, it can inject &lt;em&gt;noisy samples&lt;/em&gt; to the dataset.&lt;/p&gt;

&lt;p&gt;In a real project, we were tackling with an &lt;em&gt;imbalance multimodal dataset&lt;/em&gt;. The issues we were targeting to handle in this dataset were:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Multimodality&lt;/strong&gt;: there where &lt;code&gt;numerical&lt;/code&gt;, &lt;code&gt;categorical&lt;/code&gt; and &lt;code&gt;text&lt;/code&gt; features in this dataset. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sever imbalance&lt;/strong&gt;: there were a terribly unequal proportion between the classes, e.g. 98 of Class1 and 2 percent of Class2. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lack of data&lt;/strong&gt;: there were just about 80 samples of the minority class. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-English text&lt;/strong&gt;: the text feature was in Persian. It was important to generate a similar text data with an eye on keeping it more alike to the original one.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Back-Translation Technique
&lt;/h2&gt;

&lt;p&gt;As I described before, we were suffering from the lack of data for the minority class. Using the SMOTE technique, injected copied text samples to the dataset and did not improve the model. So, we need a more efficient technique. &lt;br&gt;
&lt;strong&gt;&lt;a href="https://www.kaggle.com/code/sajjadayobi360/filtered-back-translation"&gt;Filtered Back-Translator&lt;/a&gt;&lt;/strong&gt; was a great idea to handle this issue.&lt;br&gt;
Google back-translator is a pretrained generative language model, which is in access calling it as an API. This model generates high-quality fake text data by translating text multiple times between the original and another language, thereby creating new text samples that are similar to the original. This picture shows the whole procedure in simple:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/topics/back-translation"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Wc-bDwxd--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/dl10vs1krbiopv907rbo.png" alt="Back-Translation Procedure" width="800" height="274"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Since the translations are performed by Google's translation engine, the generated text is of high quality and can be used for data augmentation and model testing. &lt;/p&gt;

&lt;p&gt;The Google back translator has several advantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First, it requires no training data or model training, making it an easy-to-use API for generating new data. &lt;/li&gt;
&lt;li&gt;Second, the high-quality generated text is due to the popularity of Google's search engine and its expertise in generating this model. &lt;/li&gt;
&lt;li&gt;Finally, the generated text can be used for various applications, such as improving the performance of machine learning models.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;To sum up...&lt;/strong&gt;&lt;/em&gt; &lt;br&gt;
Generating fake data is an important technique for addressing severe imbalances in datasets in machine learning. SMOTE is a popular approach, but it cannot be applied directly to text data. The Google back translator is an alternative approach that produces high-quality results and can be used to augment text data. By combining SMOTE and the Google back translator, it is possible to create fake data for multimodal datasets that include text data, resulting in improved machine learning model performance. &lt;br&gt;
We successfully used the Google back translator to generate more text data for a project with an imbalance of 98-2 in the class distribution, resulting in a 20% improvement in the F-score and a more reliable model.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>backtranslation</category>
      <category>smote</category>
      <category>imbalancedata</category>
    </item>
    <item>
      <title>How f1-score helped me to choose the best classification model?</title>
      <dc:creator>Elahe Dorani</dc:creator>
      <pubDate>Sat, 01 Apr 2023 21:54:07 +0000</pubDate>
      <link>https://dev.to/elldora/how-f1-score-helped-me-to-choose-the-best-classification-model-209c</link>
      <guid>https://dev.to/elldora/how-f1-score-helped-me-to-choose-the-best-classification-model-209c</guid>
      <description>&lt;p&gt;&lt;em&gt;"F1-score" is one of the main metrics that have always been suggested to evaluate the result of any imbalance classification model. But if you had tried to use it as your key metric, may be faced with different variations of this metric... f1-score, f-score weighted, f-score macro, f-score micro, f-score binary, and f-score class-wise!!!&lt;br&gt;
So, when choose which? Or which helps when?&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;If you had any experience in classification modeling with an imbalanced dataset, one of the main metrics that is been always suggested by the experts is the famous "f1-score".&lt;br&gt;
Imbalance datasets are those that have an asymmetric proportion of items belonging to different classes. In my project, I have wrangled with a 90-10 imbalance dataset of the adverts written by "Realtors" and "People". My main goal is to find the best classification model which classifies the written adverts by a minimum number of misclassified items.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem
&lt;/h2&gt;

&lt;p&gt;In imbalanced datasets, accuracy as the most common metric of classification problems does not best describe the model. Why?&lt;br&gt;
If I define a fake model which labels all items as "Realtor", then this fake model has a 90 percent accuracy. Maybe there is no need to put much time and effort to develop a better model!&lt;br&gt;
On the other hand, the model has not seen the same proportion of classes learn them equally. So it is more probable to learn the majority class than the minority one. But in the accuracy, both classes encountered as same importance to evaluate the model.&lt;/p&gt;

&lt;p&gt;In these cases, the f1-score is the best metric that could help to assess the model efficiency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Confusion Matrix
&lt;/h2&gt;

&lt;p&gt;In all classification problems, the first and most useful job to get the most valuable insight into the model is to calculate the values for each cell in the confusion matrix.&lt;br&gt;
The confusion matrix will clearly show how many of the items are truly or falsely classified by the proposed model.&lt;br&gt;
In this matrix, there are one row and column for each class. So there would be four main cells that can categorize the result of the model. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--S8vv6fXA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xwbycgyusxwommhif2bj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--S8vv6fXA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xwbycgyusxwommhif2bj.png" alt="Image description" width="753" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If I supposed that the true label belongs to the "Realtor" class and the false belongs to the "People" class:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;TP (true-positive):&lt;/em&gt; number of items categorized "Realtor" and their main label is "Realtor"&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;TN (true-negative):&lt;/em&gt; number of items categorized "People" and their main label is "People"&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;FP (false-positive):&lt;/em&gt; number of items categorized "Realtor" and their main label is "People"&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;FN (false-negative):&lt;/em&gt; number of items categorized "People" and their main label is "Realtor"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As it is clear, the dense the main diameter the more reliable model.&lt;br&gt;
So, If I develop a model which predicts the most TP and TN among the other models, then I have done my job :)&lt;/p&gt;

&lt;p&gt;Well, there are already defined metrics over the confusion matrix. The two more important that can help me are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Recall = TN / (TN+FP)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Precision = TN / (TN+FN)&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So, if I target to increase both metrics at the same time, then I will gain my best model.&lt;br&gt;
The "F1-score" metric is the one that will do this for me!&lt;/p&gt;

&lt;h2&gt;
  
  
  F1-score Variations in Azure ML
&lt;/h2&gt;

&lt;p&gt;In the previous section, you can clearly see that both &lt;strong&gt;"Precision"&lt;/strong&gt; and &lt;strong&gt;"Recall"&lt;/strong&gt; metrics are laid between 0 and 1. On the other hand, it is important to cover the error of the imbalance dataset on our evaluation. So, considering the harmonic mean of the &lt;strong&gt;Precision&lt;/strong&gt; and &lt;strong&gt;Recall&lt;/strong&gt; will help us to support all these purposes. F1-score is the harmonic average of the &lt;strong&gt;Recall&lt;/strong&gt; and &lt;strong&gt;Precision&lt;/strong&gt;. For more information about how harmonic average can help us in this case you can take a look at &lt;a href="https://www.investopedia.com/ask/answers/06/geometricmean.asp"&gt;this article&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I work with Azure Machine Learning Service. There are lots of variations of the &lt;a href="https://learn.microsoft.com/en-us/azure/machine-learning/how-to-understand-automated-ml#classification-metrics"&gt;f1-score metric&lt;/a&gt;. At first, it may be so much confusing to choose the right one, but if you know the meaning of the metrics, then it would be even helpful to consider more than one metric to evaluate the model's performance. &lt;/p&gt;

&lt;p&gt;An example of Azure ML metrics:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--l4REMQ8l--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uru4q0vn3p2dsidxjo11.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--l4REMQ8l--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uru4q0vn3p2dsidxjo11.JPG" alt="Image description" width="780" height="765"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;F1-score as a harmonic average:&lt;/em&gt;&lt;br&gt;
F1-score = 2 * (precision * recall) / (precision + recall)&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Class-wise F1-score:&lt;/em&gt;&lt;br&gt;
In the Azure ML Service, the class-wise f1-score will be shown as a dictionary of the f1-score for each class. In the binary classification, it will be calculated from the formula above. For the multiclass problems, it uses One-vs-Rest to calculate the f1-score for each class.&lt;/p&gt;

&lt;p&gt;Sample of f1-score for binary classification problem: {'True': 0.80,'False': 0.70}&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Macro F1-score:&lt;/em&gt;&lt;br&gt;
As its name declares, f1-scores of all classes are taken into account to calculate the macro f1-score.&lt;br&gt;
This metric assumes that all classes have the same weight, then all of them will participate as the equally-weighted parts in the calculation.&lt;/p&gt;

&lt;p&gt;For example, if the f-score of the "Realtor" class is 0.80 and the f-score of the "People" class is 0.70, the f-score macro of this model is:&lt;br&gt;
Macro f1-score = (0.80 + 0.70) / 2 = 0.75&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Micro F1-score:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In the Micro f1-score, we will sum up the microelements needed to calculate this metric.&lt;/p&gt;

&lt;p&gt;I mean that it could be calculated if we have the total TP, FN, and FP over all classes. To get the total TP, we should sum up all TPs for each class, and do so for FNs and FPs. Then we calculate the &lt;strong&gt;micro f1-score&lt;/strong&gt;, using the total TP, total FN, and total FP.&lt;/p&gt;

&lt;p&gt;So, again the name of this metric reveals that it considers the overall TP, TN, and TP from micro items.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Weighted F1-score:&lt;/em&gt;&lt;br&gt;
As I have mentioned earlier, we would have an imbalanced dataset. It is clear, that if the proportion of the classes is imbalanced, we must take a technique to encounter the proportion in the calculation procedure.&lt;/p&gt;

&lt;p&gt;In the weighted f1-score, we use the weight of each class to highlight the effect of the minority class and not let the majority fade it with its power.&lt;/p&gt;

&lt;p&gt;In my example, the weighted f1-score will be calculated in this way:&lt;br&gt;
Weighted f1-score = 0.90 * 0.80 + 0.10 * 0.70 = 0.79&lt;/p&gt;

&lt;p&gt;As it is clear, if I had a balance dataset, the &lt;strong&gt;macro f1-score&lt;/strong&gt; and &lt;strong&gt;weighted f1-score&lt;/strong&gt; would be a same value.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Binary F1-score:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It is the f1-score for positive class in a binary classification problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  To Sum up...
&lt;/h2&gt;

&lt;p&gt;To sum up this article, the f1-score is one of the useful metrics which really helped me to evaluate the result of my experiments... I always consider all the metrics above to assess the validity of the model and the effect of the imbalanced dataset on my modeling. I hope it would help the readers to a better evaluation of their machine learning modeling.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>azureml</category>
      <category>confusionmatrix</category>
      <category>math</category>
    </item>
    <item>
      <title>Configure a custom env on Azure ML</title>
      <dc:creator>Elahe Dorani</dc:creator>
      <pubDate>Tue, 14 Mar 2023 17:47:12 +0000</pubDate>
      <link>https://dev.to/elldora/install-customized-env-on-your-azure-ml-platform-244d</link>
      <guid>https://dev.to/elldora/install-customized-env-on-your-azure-ml-platform-244d</guid>
      <description>&lt;p&gt;Configure a custom env on Azure ML&lt;/p&gt;

&lt;h2&gt;
  
  
  Shared workspace for remote AI Teams
&lt;/h2&gt;

&lt;p&gt;When your team is working remotely, they need to collaborate and work on a shared cloud-based workspace. In this way, all developers in your team members can use it to run the experiments. &lt;/p&gt;

&lt;p&gt;My team and I in &lt;a href="https://melkradar.com/p/search"&gt;MelkRadar&lt;/a&gt;, have a nice experience working with &lt;strong&gt;Azure ML&lt;/strong&gt;. In this platform, you are able to import a wide variety of predefined environments and delegate your tasks on the Azure computes. Fortunately, the Azure ML designers have prepared some &lt;strong&gt;predefined environments&lt;/strong&gt; from the most useful and popular packages to make it &lt;strong&gt;more straight forward for developers&lt;/strong&gt;. You can easily find a list of these predefined environments based on your compute type in the Azure ML platform. &lt;/p&gt;

&lt;h2&gt;
  
  
  Customizing packages on a predefined env
&lt;/h2&gt;

&lt;p&gt;If you are a ML developer, you are familiar with &lt;code&gt;Anaconda&lt;/code&gt; package manager. It's being used to create your local environment and install required packages. If it doesn't work, you may also know how to create a &lt;strong&gt;virtual env on your local machine&lt;/strong&gt; to do so. But when it comes to remote teamwork, it's a totally different challenge!&lt;/p&gt;

&lt;p&gt;In this case you will actually need to install your own package(s) through a customized environment on that machine. Here is my experience to handle such situations.&lt;/p&gt;

&lt;p&gt;At the beginning of the project, it was Ok using the pre-defined env &lt;strong&gt;until I tried to work with some packages which was specially designed to work with a specific language&lt;/strong&gt;. To be clear, I was working with Persian texts which has its own libraries for preprocessing tasks. I needed the &lt;code&gt;Hazm library&lt;/code&gt; to preprocess the Persian texts. I could easily add it to the Anaconda environment and work on local machine. But working with Persian text are not as much popular as English ones. So, it won't be found in the predefined environments on Azure Machine. &lt;/p&gt;

&lt;p&gt;The challenge was to &lt;strong&gt;customize the predefined environments on Azure&lt;/strong&gt;. On the way to handle this issue, I found that Azure ML has provided &lt;code&gt;curated-env&lt;/code&gt; for this job. &lt;br&gt;
First you can prepare a must to install packages and their versions in a &lt;code&gt;yml&lt;/code&gt; file. Then by adding some lines to your code you can say the workspace to create this environment on the Azure machine.&lt;/p&gt;

&lt;p&gt;Here are some snippets you will get an insight about this topic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;azureml.core&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Environment&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;azureml.core.runconfig&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DockerConfiguration&lt;/span&gt;

&lt;span class="n"&gt;myenv&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Environment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_conda_specification&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'azure-custom-env'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'./conda_dependencies.yml'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;myenv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;docker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;base_image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04'&lt;/span&gt;
&lt;span class="n"&gt;docker_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DockerConfiguration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;use_docker&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the initialize the &lt;code&gt;ScriptRunConfig&lt;/code&gt; with this new env:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;azureml.core&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ScriptRunConfig&lt;/span&gt;

&lt;span class="n"&gt;src&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ScriptRunConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;script&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'script.py'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="n"&gt;compute_target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="n"&gt;docker_runtime_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;docker_config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After starting the run, you will find a link to the installed curated environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Modifying packages and versions
&lt;/h2&gt;

&lt;p&gt;If you define the environment once and don't change the packages or their versions, the azure compute will use the first install env. But if you add or remove packages or change their versions, the azure machine will consider it as a new env and will install a new environment.&lt;/p&gt;

&lt;p&gt;There is also another way for environment management knows as system-managed and I will talk about my experience using this way in the future.&lt;/p&gt;

&lt;h2&gt;
  
  
  Our Experience at MelkRadar AI Team
&lt;/h2&gt;

&lt;p&gt;I am an AI developer at &lt;a href="https://melkradar.com/p/search"&gt;MelkRadar&lt;/a&gt;, which is a real estate search engine in Iran. We are using &lt;strong&gt;Azure ML&lt;/strong&gt; as our main platform to collaborate with AI team members. In my recent project, it was very crucial to handle the customized environment for my experiments and this feature really helped me, so shared my experience to help you as well :). You can find more information in this link:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-v2?tabs=cli"&gt;How to manage environments in Azure ML&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>azuremachinelearning</category>
      <category>python</category>
      <category>environment</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
