DEV Community: Elahe Dorani

[Boost]

Elahe Dorani — Mon, 26 May 2025 13:46:18 +0000

Mehran Davoudi

Jan 13 '25

Demystifying AIContents in Microsoft.Extensions.AI

#dotnet #openai #extensions #csharp

Comments

2 min read

Data Drift: Understanding and Detecting Changes in Data Distribution

Elahe Dorani — Wed, 21 Jun 2023 20:21:38 +0000

What is Data Drift?

Data drift refers to the distributional change between the data used for training model and the data being send to deployed model. One of the important approaches in machine learning modeling is the probabilistic modeling.

From Probabilistic Machine Learning perspective, we can assume that features in a dataset, are drawn from a hypothetical distribution.

However, in real-world modeling, it becomes evident that data does not remain constant over time. It is influenced by various factors such as seasonality changes, missing values, technical issues, and time fluctuations. This means that a dataset collected for machine learning modeling may not be the same at all times.

Regular monitoring of the model performance allows us to catch instances of data drift. It is crucial to monitor the change in data distribution between the training data and live data from time to time.

In most cases data drift occurrence, shows that our trained model is becoming outdated, and it should be retrained or updated with the newest dataset. Here, "live data" refers to the data that is being sent to the deployed model.

Top 5 Data Drift Techniques

Due to my need on deploy model evaluation, I had to monitor the result of the model on the unseen data. But it was a real quest to understand how to measure the model performance. It was also not clear that how could I measure the data behavior!

EvidentlyAI is one of websites I check regularly its articles. In this article, it has introduced the data drift concept and top 5 techniques to detect it on the features used in a large dataset. It also has provided a simple example

These techniqueys are:

Kolmogrorov-Smirnov (KS) technique which is more suitable for numerical features. It is a non-parametric test score. When we use this test, we want to accept or reject that if two datasets are drawn from the same distribution or not.
Population stability index (PSI) used to measure the data shift between two different datasets. It is suitable for both numerical and categorical dataset. The more this metric, the more different between the distribution of two datasets.
Kullback-Leibler divergence(KL) is a metric to measure the difference between two distributions. I could be applied on numeric and categorical datasets. Its range is between 0 to infinity. The more smaller KL metric shows that two distributions are very similar.
Jensen-Shannon divergence is defined based on the KL divergence. Its difference is that it relies between 0 to 1.
Wasserstein distance is a measure to monitor the numerical data drift. It is measured by the difference of the dataset means. This article also has provided a practical example which I could apply on my own data to understand it well.

More Resources:

As I work with Azure Machine Learning platform, I am very interested to unlock features in it.
First of all, I found a mini course about data drift which you could easily get throw and understand the main concepts in this field.

Then, I really suggest to have a look to this article which clearly has described the data and model drift. It also tried to apply it using the Azure Machine Learning capabilities for data drift.

Finally, I found a git repository which is tried to monitory the data drift using Azure ML and integrate it with Power BI dashboard.

I am interested to know more about this topic. If you know other useful resources please put some notes about them :)

How to plot feature importance using Truncated SVD

Elahe Dorani — Thu, 08 Jun 2023 11:09:06 +0000

How to choose and plot the most important features from an overwhelming pool of features, using Truncated SVD!?

What was the problem?

As I explained in my previous post, in one of our projects at MelkRadar, I had to tackle with a large dataset including more than 1500 features! It was really overwhelming identify the most important features especially after the features transformation.

This problem was the beginning of my journey through the feature selection and feature extraction techniques. I found out about two well-known techniques: Truncated SVD and PCA. After all, I understood that Truncated SVD was a better solution to handle our problem with a large sparse dataset.

Why is the Truncated SVD more informative than PCA in my problem?

Truncated SVD and PCA:

Truncated SVD (Singular Value Decomposition) and PCA (Principal Component Analysis) are both linear algebra technique. They are dimensionality reduction techniques used to reduce the dimensionality of high-dimensional datasets. They both aim to find a lower-dimensional representation of the data while indicating the most important information.

Truncated SVD vs. PCA:

The main difference between truncated SVD and PCA lies in how they handle data. Truncated SVD is specifically designed for sparse matrices and can handle missing values, while PCA works on dense matrices and requires complete data.
Truncated SVD is often preferred for text data, while PCA is commonly used for numerical data.
Truncated SVD is typically faster than PCA for large datasets, as it only computes a subset of the singular vectors and values.

How to does Truncated SVD identifies the most important features?

Truncated SVD does not directly identify the most frequent features in a dataset. Its primary goal is to reduce the dimensionality of the data while retaining the most important information. However, it is possible to indirectly identify the most frequent features by examining the singular vectors obtained from the truncated SVD.

In truncated SVD, the singular vectors are the linear combinations of the original features that explain the most variance in the data. Therefore, the features that have the highest coefficients in the singular vectors can be considered the most important features in the dataset.

Now let's do some coding...

To identify the most frequent features using truncated SVD, one could perform the following steps:

Step 1:

To keep it simple, I generate a random data matrix to simulate a dataset that I had in my real project.

from sklearn.decomposition import TruncatedSVD
import numpy as np

# Generate a random data matrix X of size (m x n)
m, n = 1000, 1500
X = np.random.randn(m, n)

Step 2:

Now fit X to the truncated SVD to obtain the singular vectors. Then, compute the low-rank approximation.

k = 10  # number of singular vectors to keep
U, S, Vt = np.linalg.svd(X)
X_approx = U[:, :k] @ np.diag(S[:k]) @ Vt[:k, :] # fit data to the model

U: is the left singular value. It is a m*m matrix. The columns of this matrix are the corresponding eigenvectors of the singular values in the X matrix.

S: is the singular values matrix. It is a m*n matrix. The diagonal elements of S are the singular values of X.

Vt: is the right singular vectors matrix. It is n*n matrix. The columns of this matrix are the corresponding eigenvectors of the singular values in the X matrix.

Note that U captures the relationships among the columns of X, while Vt captures the relationships among the rows of X.

Step 3:

To evaluate the model, we have to calculate the relative approximation error like this:

approx_error = np.linalg.norm(X - X_approx) / np.linalg.norm(X)
print(f'Relative approximation error: {approx_error:.2f}')

Step 4:

Select the k most important features. For each singular vector, identify the features with the highest coefficients.

Vk = Vt[:k, :]

Step 5:

Compute the feature importance scores as the sum of absolute values of the coefficients in the first k singular vectors. Count the frequency of each feature across all singular vectors.

feature_importance = np.abs(Vk).sum(axis=0)

Step 6:

Sort the feature importance scores in descending order.
Rank the features by frequency to identify the most frequent features in the dataset.

sorted_idx = np.argsort(feature_importance)[::-1]

# Save the names of the top 10 most frequent features in a list
top_features = [feature_names[i] for i in sorted_idx[:10]]

Step 7:

Create a bar plot of the top 10 most frequent features:

plt.barh(range(10), feature_importance[sorted_idx[:10]])
plt.yticks(range(10), top_features)
plt.xlabel('Importance Score')
plt.title('Most Frequent Features in Truncated SVD')
plt.show()

Note that...

This approach assumes that the most frequent features are also the most important features in the dataset. This may not always be the case, as important features may have lower frequencies if they are correlated with other features that are more frequent.

Additionally, the choice of the truncation parameter in truncated SVD can affect the results, so it is important to choose an appropriate truncation level based on the problem at hand.

Unveiling the Hidden Gems: Exploring Important Features with Truncated SVD and PCA

Elahe Dorani — Sat, 20 May 2023 14:36:03 +0000

My Journey with Multimodal Data Preprocessing and Truncated SVD

Dealing with multimodal dataset and dimensionality reduction

In one of our projects, we had a dataset containing over 1500 features to create a machine learning model. By the multimodality, I mean there were a combination of numerical, categorical, and text features in it.

To handle this dataset, I employed a standard strategy of preprocessing and the current features transformed to more features. A crucial aspect of analyzing these additional features was determining a method to identify the most important ones.

Of course, before modeling, we analyze data to keep the more informative samples and features. But in this project, we still deal with curse of dimensionality.

For example, among these features, there were numerous categorical variables for which I utilized OneHotEncoding for them to convert to the numeric values. This picture shows it in simple, but if want to know more about it you can visit this link.

Furthermore, there are some text features in this dataset. When we tried to use these kind of features, the Tfidf-Vectorizer came in use! This technique tries to identify the more important tokens in a text by counting their frequencies in the documents. This picture may show the idea behind in a one shot, but if you want to known more you can again visit this link.

In our machine learning pipeline, consists of featurization, preprocessing and modeling. After the featurization step, we faced with an enormous sparse dat matrix. In a sparse matrix, there are lots of cells with zero and just few cells containing non-zero values. Using this kind of data matrix can cause to computational overhead and slow down the modeling process.

The first idea was to use the well-known PCA algorithm as a dimensionality reduction technique. When I attempted to apply the PCA algorithm, I encountered an error indicating that the algorithm could not be used with a sparse matrix. But why?

Consequently, I started exploring about the Truncated SVD as an alternative method.

In the next section I tried to sum up all the things I learned from this technique in comparison to the PCA.

Why the Truncated SVD was better than PCA in for a sparse data matrix?

Truncated SVD (Singular Value Decomposition) and PCA (Principal Component Analysis) are both linear algebra techniques that can be used to reduce the dimensionality of high-dimensional data, while retaining the most important information.

As I mentioned before, I was dealing with a large dataset that after featurization step it was still large enough to push me to know about the alternative way to deal with!

The main differences between Truncated SVD and PCA which I found out about are:

1. The objective:

PCA aims to find the directions (principal components) that explain the maximum amount of variance in the data, while Truncated SVD aims to factorize a matrix into two lower rank matrices.

2. The input data:

PCA is typically applied to a covariance matrix, while Truncated SVD can be applied directly to a data matrix without computing the covariance matrix.

3. The output:

PCA provides the principal components, which are linear combinations of the original variables, while Truncated SVD provides the singular vectors, which are also linear combinations of the original variables.

4. The number of components:

In PCA, the number of principal components to keep is typically chosen based on the percentage of variance explained or by setting a fixed number of components. In Truncated SVD, the number of singular vectors to keep is typically chosen based on the rank of the matrix or a fixed number of components.

5. The computation:

Truncated SVD is typically faster than PCA for large datasets, as it only computes a subset of the singular vectors and values.

As I first described, our dataset was This was very important in our case. Because we use a pay-as-you-go Azure Compute to run the experiments. It was crucial to save the computation time.

To sum up...

Both Truncated SVD and PCA are useful techniques for reducing the dimensionality of high-dimensional data.

The choice of which technique to use depends on the specific requirements of the problem at hand. In our case, the large sparse data matrix, need to choose the Truncated SVD.

In my next post, I will show a simple code to use this technique!

Fake Data with Google Back-Translator API!

Elahe Dorani — Sat, 06 May 2023 18:09:05 +0000

Machine learning requires techniques to address the challenges of working with terribly imbalance datasets. Data Augmentation is a class of techniques you can use to generate fake data.

SMOTE

One of the most popular ways to create fake data for multimodal datasets is the SMOTE technique, which can be applied to numerical and categorical features.
SMOTE technique is based on the KNN algorithm. You can read more about this technique here:

Although SMOTE technique can be a nice data generator for numerical and categorical features, when we apply it to the text data, it can be biased due to duplicate text samples. On the other hand, it can inject noisy samples to the dataset.

In a real project, we were tackling with an imbalance multimodal dataset. The issues we were targeting to handle in this dataset were:

Multimodality: there where numerical, categorical and text features in this dataset.
Sever imbalance: there were a terribly unequal proportion between the classes, e.g. 98 of Class1 and 2 percent of Class2.
Lack of data: there were just about 80 samples of the minority class.
Non-English text: the text feature was in Persian. It was important to generate a similar text data with an eye on keeping it more alike to the original one.

Back-Translation Technique

As I described before, we were suffering from the lack of data for the minority class. Using the SMOTE technique, injected copied text samples to the dataset and did not improve the model. So, we need a more efficient technique.
Filtered Back-Translator was a great idea to handle this issue.
Google back-translator is a pretrained generative language model, which is in access calling it as an API. This model generates high-quality fake text data by translating text multiple times between the original and another language, thereby creating new text samples that are similar to the original. This picture shows the whole procedure in simple:

Since the translations are performed by Google's translation engine, the generated text is of high quality and can be used for data augmentation and model testing.

The Google back translator has several advantages:

First, it requires no training data or model training, making it an easy-to-use API for generating new data.
Second, the high-quality generated text is due to the popularity of Google's search engine and its expertise in generating this model.
Finally, the generated text can be used for various applications, such as improving the performance of machine learning models.

To sum up...
Generating fake data is an important technique for addressing severe imbalances in datasets in machine learning. SMOTE is a popular approach, but it cannot be applied directly to text data. The Google back translator is an alternative approach that produces high-quality results and can be used to augment text data. By combining SMOTE and the Google back translator, it is possible to create fake data for multimodal datasets that include text data, resulting in improved machine learning model performance.
We successfully used the Google back translator to generate more text data for a project with an imbalance of 98-2 in the class distribution, resulting in a 20% improvement in the F-score and a more reliable model.

How f1-score helped me to choose the best classification model?

Elahe Dorani — Sat, 01 Apr 2023 21:54:07 +0000

"F1-score" is one of the main metrics that have always been suggested to evaluate the result of any imbalance classification model. But if you had tried to use it as your key metric, may be faced with different variations of this metric... f1-score, f-score weighted, f-score macro, f-score micro, f-score binary, and f-score class-wise!!!
So, when choose which? Or which helps when?

Introduction

If you had any experience in classification modeling with an imbalanced dataset, one of the main metrics that is been always suggested by the experts is the famous "f1-score".
Imbalance datasets are those that have an asymmetric proportion of items belonging to different classes. In my project, I have wrangled with a 90-10 imbalance dataset of the adverts written by "Realtors" and "People". My main goal is to find the best classification model which classifies the written adverts by a minimum number of misclassified items.

Problem

In imbalanced datasets, accuracy as the most common metric of classification problems does not best describe the model. Why?
If I define a fake model which labels all items as "Realtor", then this fake model has a 90 percent accuracy. Maybe there is no need to put much time and effort to develop a better model!
On the other hand, the model has not seen the same proportion of classes learn them equally. So it is more probable to learn the majority class than the minority one. But in the accuracy, both classes encountered as same importance to evaluate the model.

In these cases, the f1-score is the best metric that could help to assess the model efficiency.

Confusion Matrix

In all classification problems, the first and most useful job to get the most valuable insight into the model is to calculate the values for each cell in the confusion matrix.
The confusion matrix will clearly show how many of the items are truly or falsely classified by the proposed model.
In this matrix, there are one row and column for each class. So there would be four main cells that can categorize the result of the model.

If I supposed that the true label belongs to the "Realtor" class and the false belongs to the "People" class:

TP (true-positive): number of items categorized "Realtor" and their main label is "Realtor"
TN (true-negative): number of items categorized "People" and their main label is "People"
FP (false-positive): number of items categorized "Realtor" and their main label is "People"
FN (false-negative): number of items categorized "People" and their main label is "Realtor"

As it is clear, the dense the main diameter the more reliable model.
So, If I develop a model which predicts the most TP and TN among the other models, then I have done my job :)

Well, there are already defined metrics over the confusion matrix. The two more important that can help me are:

Recall = TN / (TN+FP)
Precision = TN / (TN+FN)

So, if I target to increase both metrics at the same time, then I will gain my best model.
The "F1-score" metric is the one that will do this for me!

F1-score Variations in Azure ML

In the previous section, you can clearly see that both "Precision" and "Recall" metrics are laid between 0 and 1. On the other hand, it is important to cover the error of the imbalance dataset on our evaluation. So, considering the harmonic mean of the Precision and Recall will help us to support all these purposes. F1-score is the harmonic average of the Recall and Precision. For more information about how harmonic average can help us in this case you can take a look at this article.

I work with Azure Machine Learning Service. There are lots of variations of the f1-score metric. At first, it may be so much confusing to choose the right one, but if you know the meaning of the metrics, then it would be even helpful to consider more than one metric to evaluate the model's performance.

An example of Azure ML metrics:

F1-score as a harmonic average:
F1-score = 2 * (precision * recall) / (precision + recall)

Class-wise F1-score:
In the Azure ML Service, the class-wise f1-score will be shown as a dictionary of the f1-score for each class. In the binary classification, it will be calculated from the formula above. For the multiclass problems, it uses One-vs-Rest to calculate the f1-score for each class.

Sample of f1-score for binary classification problem: {'True': 0.80,'False': 0.70}

Macro F1-score:
As its name declares, f1-scores of all classes are taken into account to calculate the macro f1-score.
This metric assumes that all classes have the same weight, then all of them will participate as the equally-weighted parts in the calculation.

For example, if the f-score of the "Realtor" class is 0.80 and the f-score of the "People" class is 0.70, the f-score macro of this model is:
Macro f1-score = (0.80 + 0.70) / 2 = 0.75

Micro F1-score:

In the Micro f1-score, we will sum up the microelements needed to calculate this metric.

I mean that it could be calculated if we have the total TP, FN, and FP over all classes. To get the total TP, we should sum up all TPs for each class, and do so for FNs and FPs. Then we calculate the micro f1-score, using the total TP, total FN, and total FP.

So, again the name of this metric reveals that it considers the overall TP, TN, and TP from micro items.

Weighted F1-score:
As I have mentioned earlier, we would have an imbalanced dataset. It is clear, that if the proportion of the classes is imbalanced, we must take a technique to encounter the proportion in the calculation procedure.

In the weighted f1-score, we use the weight of each class to highlight the effect of the minority class and not let the majority fade it with its power.

In my example, the weighted f1-score will be calculated in this way:
Weighted f1-score = 0.90 * 0.80 + 0.10 * 0.70 = 0.79

As it is clear, if I had a balance dataset, the macro f1-score and weighted f1-score would be a same value.

Binary F1-score:

It is the f1-score for positive class in a binary classification problem.

To Sum up...

To sum up this article, the f1-score is one of the useful metrics which really helped me to evaluate the result of my experiments... I always consider all the metrics above to assess the validity of the model and the effect of the imbalanced dataset on my modeling. I hope it would help the readers to a better evaluation of their machine learning modeling.

Configure a custom env on Azure ML

Elahe Dorani — Tue, 14 Mar 2023 17:47:12 +0000

Configure a custom env on Azure ML

Shared workspace for remote AI Teams

When your team is working remotely, they need to collaborate and work on a shared cloud-based workspace. In this way, all developers in your team members can use it to run the experiments.

My team and I in MelkRadar, have a nice experience working with Azure ML. In this platform, you are able to import a wide variety of predefined environments and delegate your tasks on the Azure computes. Fortunately, the Azure ML designers have prepared some predefined environments from the most useful and popular packages to make it more straight forward for developers. You can easily find a list of these predefined environments based on your compute type in the Azure ML platform.

Customizing packages on a predefined env

If you are a ML developer, you are familiar with Anaconda package manager. It's being used to create your local environment and install required packages. If it doesn't work, you may also know how to create a virtual env on your local machine to do so. But when it comes to remote teamwork, it's a totally different challenge!

In this case you will actually need to install your own package(s) through a customized environment on that machine. Here is my experience to handle such situations.

At the beginning of the project, it was Ok using the pre-defined env until I tried to work with some packages which was specially designed to work with a specific language. To be clear, I was working with Persian texts which has its own libraries for preprocessing tasks. I needed the Hazm library to preprocess the Persian texts. I could easily add it to the Anaconda environment and work on local machine. But working with Persian text are not as much popular as English ones. So, it won't be found in the predefined environments on Azure Machine.

The challenge was to customize the predefined environments on Azure. On the way to handle this issue, I found that Azure ML has provided curated-env for this job.
First you can prepare a must to install packages and their versions in a yml file. Then by adding some lines to your code you can say the workspace to create this environment on the Azure machine.

Here are some snippets you will get an insight about this topic:

from azureml.core import Environment
from azureml.core.runconfig import DockerConfiguration

myenv = Environment.from_conda_specification(name='azure-custom-env', file_path='./conda_dependencies.yml')
myenv.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04'
docker_config = DockerConfiguration(use_docker=True)

And the initialize the ScriptRunConfig with this new env:

from azureml.core import ScriptRunConfig

src = ScriptRunConfig(script='script.py',
                      compute_target=cluster,
                      docker_runtime_config=docker_config)

After starting the run, you will find a link to the installed curated environment.

Modifying packages and versions

If you define the environment once and don't change the packages or their versions, the azure compute will use the first install env. But if you add or remove packages or change their versions, the azure machine will consider it as a new env and will install a new environment.

There is also another way for environment management knows as system-managed and I will talk about my experience using this way in the future.

Our Experience at MelkRadar AI Team

I am an AI developer at MelkRadar, which is a real estate search engine in Iran. We are using Azure ML as our main platform to collaborate with AI team members. In my recent project, it was very crucial to handle the customized environment for my experiments and this feature really helped me, so shared my experience to help you as well :). You can find more information in this link:

How to manage environments in Azure ML.