Umberto Calice

Posted on May 8, 2023

DATASET Analysis and processing

#python #machinelearning #ai #datascience

SEE THIS REPOSITORY AT : https://github.com/insidbyte/Analysis_and_processing

1)-Install python with a version >= 3.9.*

2)-Installing virtual env via pip:

pip install virtualenv

3)-Access the folder where you want to create the virtual environment and type in command:

virtualenv --python C:\Path\To\Python\python.exe name_of_new_venv_folder

4)-Access the created folder with the command:

cd name_of_new_venv_folder

5)-Activate the virtual environment with the command:

.\Sripts\activate

6)-The following command creates a file called requirements.txt which enumerates the installed packages:

pip freeze > requirements.txt

7)-This file can then be used by contributors to update virtual environments using the following command:

pip install -r requirements.txt

8)-To return to normal system settings, use the command:

deactivate

Analysis_and_processing

Performs a statistical analysis of the dataset and has more options:
1)-Merges two datasets.
2)-Performs a first cleanup of the dataset.
3)-Analyze and possibly eliminate stop words.
4)-Lemmatize.
5)-Correct.

All the options present in this tool, with the exception of number 3, use multiprocessing.

As many processes as there are cores of the machine used will start, so it is advisable to run the script from the terminal and to close any activity running on the machine.
The steps must be performed in order otherwise the output dataset will not be reliable!

First step:

We will remove special characters, websites, emails, html code and all the contractures of the English language. First we go to the first file and write on the first line : True.

Next we launch main.py.

Then we insert the following Input:

Output:

Second step:

This option corrects the supplied text, dividing it into 8 datasets and concatenating them to return the requested dataset. With 8 cores it took 9 hours for 60MB of datset!! It is highly expensive in terms of: cpu, memory and execution time. I recommend doing this only if necessary.

Input:

Output:

Third step:

Let's lemmatize by trimming the dataset a bit and replacing each compound word with its own root

Input:

Output

Fourth step:

In this case it is not necessary but in case we have cleaned and lemmatized only the positive or negative reviews, we need to merge the dataset to proceed to the analysis phase.

As we can see the union dataset weighs less because it has eliminated the stop words separately for positive and negative

After several tests we noticed that the union dataset is less efficient for model generation.

Fifth step:

This is the most important step because it allows you to greatly lighten the lemmatized and clean dataset. To add new stopwords beyond those already present in the repository just add the words in the text files:

Input:

Output:

We can see how many positive and negative reviews the dataset has and perform word-cloud or ngrams analysis. Below are some images that show the effectiveness of the previous phases and some information invaluable for building personalized wordlists.

Positive and negative review count:

Most Meaningful Words for Word Cloud:

Negative:

Positive:

Most common words in the dataset:

Positive:

Negative:

Most common words in the dataset with NGRAMS 2:

CONCLUSIONS:

At this point we can say that we have created a complete tool that allows us to analyze and modify the dataset. I will show in another repository another useful tool for vectorization and search by tuning hyperparameters with GridSearchCV.

Link: https://github.com/insidbyte/Analysis_And_Generation_Model_ML

Link documentation of this repository: https://scikit-learn.org