DEV Community: Umberto Calice

Analysis And Generation Model ML

Umberto Calice — Mon, 08 May 2023 10:37:44 +0000

Analysis_And_Generation_Model_ML

BEFORE READING THIS REPOSITORY IT IS RECOMMENDED TO START FROM:

https://github.com/insidbyte/Analysis_and_processing

I have in fact decided to generate a custom vocabulary to train the model and it would be appropriate to look at the repository code.

SEE THIS REPOSITORY AT: https://github.com/insidbyte/Analysis_And_Generation_Model_ML

OPTIONS:

1)-GENERATE MODEL

2)-TEST WITH HYPERPARAMETER TUNING

3)-PLOT WITH TFIDF VECTORIZER AND SVD TRUNCATED REDUCTION

Menù

Starting the ModelsGenerator.py file from the terminal it will appear:

OPTION 1:

Model generation:

I decided to use tfidf and support vector machine because they are highly suitable for text processing and
support vector machine with the linear kernel is highly suitable for classifications based on two classes as
in our case: positive and negative

Kaggle IMDb dataset example:

I created a Client in Angular to send requests to a Python Server

CLIENT:

SERVER:

RESPONSE FROM THE SERVER:

ANOTHER EXAMPLE:

OPTION 2:

Test hyperparameters with gridsearchCV and tfidf vectorizer:

A good way to automate the test phase and save time searching for the best parameters to
generate the most accurate model possible is to use GrisearchCV made available by scikit-learn
The code in ModelsGenerator.py must be customized based on the dataset to be analyzed

WARNING !!

If we don't study the scikit-learn documentation we could start infinite analyzes

so it is always advisable to know what we are doing

Link scikit-learn: https://scikit-learn.org/

Input:

Output:

OPTION 3:

Input:

This option is experimental, the reduction is not applied to model training because it
generates too few components and RAM memory (8GB) of my PC is not enough to generate
more components even if the results are interesting!

Output:

CONCLUSION:

We got satisfactory results and generated a fairly accurate
model this repository will be updated over time

For info or collaborations contact me at: u.calice@studenti.poliba.it

DATASET Analysis and processing

Umberto Calice — Mon, 08 May 2023 10:30:08 +0000

SEE THIS REPOSITORY AT : https://github.com/insidbyte/Analysis_and_processing

1)-Install python with a version >= 3.9.*

2)-Installing virtual env via pip:

pip install virtualenv

3)-Access the folder where you want to create the virtual environment and type in command:

virtualenv --python C:\Path\To\Python\python.exe name_of_new_venv_folder

4)-Access the created folder with the command:

cd name_of_new_venv_folder

5)-Activate the virtual environment with the command:

.\Sripts\activate

6)-The following command creates a file called requirements.txt which enumerates the installed packages:

pip freeze > requirements.txt

7)-This file can then be used by contributors to update virtual environments using the following command:

pip install -r requirements.txt

8)-To return to normal system settings, use the command:

deactivate

Analysis_and_processing

Performs a statistical analysis of the dataset and has more options:
1)-Merges two datasets.
2)-Performs a first cleanup of the dataset.
3)-Analyze and possibly eliminate stop words.
4)-Lemmatize.
5)-Correct.

All the options present in this tool, with the exception of number 3, use multiprocessing.

As many processes as there are cores of the machine used will start, so it is advisable to run the script from the terminal and to close any activity running on the machine.
The steps must be performed in order otherwise the output dataset will not be reliable!

First step:

We will remove special characters, websites, emails, html code and all the contractures of the English language. First we go to the first file and write on the first line : True.

Next we launch main.py.

Then we insert the following Input:

Output:

Second step:

This option corrects the supplied text, dividing it into 8 datasets and concatenating them to return the requested dataset. With 8 cores it took 9 hours for 60MB of datset!! It is highly expensive in terms of: cpu, memory and execution time. I recommend doing this only if necessary.

Input:

Output:

Third step:

Let's lemmatize by trimming the dataset a bit and replacing each compound word with its own root

Input:

Output

Fourth step:

In this case it is not necessary but in case we have cleaned and lemmatized only the positive or negative reviews, we need to merge the dataset to proceed to the analysis phase.

As we can see the union dataset weighs less because it has eliminated the stop words separately for positive and negative

After several tests we noticed that the union dataset is less efficient for model generation.

Fifth step:

This is the most important step because it allows you to greatly lighten the lemmatized and clean dataset. To add new stopwords beyond those already present in the repository just add the words in the text files:

Input:

Output:

We can see how many positive and negative reviews the dataset has and perform word-cloud or ngrams analysis. Below are some images that show the effectiveness of the previous phases and some information invaluable for building personalized wordlists.

Positive and negative review count:

Most Meaningful Words for Word Cloud:

Negative:

Positive:

Most common words in the dataset:

Positive:

Negative:

Most common words in the dataset with NGRAMS 2:

CONCLUSIONS:

At this point we can say that we have created a complete tool that allows us to analyze and modify the dataset. I will show in another repository another useful tool for vectorization and search by tuning hyperparameters with GridSearchCV.

Link: https://github.com/insidbyte/Analysis_And_Generation_Model_ML

Link documentation of this repository: https://scikit-learn.org

POTMY WEB-APP

Umberto Calice — Sun, 07 May 2023 21:03:30 +0000

Web application written in javascript that uses the spotify rest api to search and listen to music by creating a playlist of the searched tracks.

Search ways :
1)-Random :
based on an artist.
2)-Specific :
based on title and artist

https://github.com/insidbyte/Potimy_App