DEV Community

Cover image for DATASET Analysis and processing
Umberto Calice
Umberto Calice

Posted on

DATASET Analysis and processing

SEE THIS REPOSITORY AT : https://github.com/insidbyte/Analysis_and_processing


1)-Install python with a version >= 3.9.*

2)-Installing virtual env via pip:

pip install virtualenv
Enter fullscreen mode Exit fullscreen mode

3)-Access the folder where you want to create the virtual environment and type in command:

virtualenv --python C:\Path\To\Python\python.exe name_of_new_venv_folder
Enter fullscreen mode Exit fullscreen mode

4)-Access the created folder with the command:

cd name_of_new_venv_folder
Enter fullscreen mode Exit fullscreen mode

5)-Activate the virtual environment with the command:

.\Sripts\activate
Enter fullscreen mode Exit fullscreen mode

6)-The following command creates a file called requirements.txt which enumerates the installed packages:

pip freeze > requirements.txt
Enter fullscreen mode Exit fullscreen mode

7)-This file can then be used by contributors to update virtual environments using the following command:

pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

8)-To return to normal system settings, use the command:

deactivate
Enter fullscreen mode Exit fullscreen mode

Analysis_and_processing

Performs a statistical analysis of the dataset and has more options:
1)-Merges two datasets.
2)-Performs a first cleanup of the dataset.
3)-Analyze and possibly eliminate stop words.
4)-Lemmatize.
5)-Correct.

All the options present in this tool, with the exception of number 3, use multiprocessing.

As many processes as there are cores of the machine used will start, so it is advisable to run the script from the terminal and to close any activity running on the machine.
The steps must be performed in order otherwise the output dataset will not be reliable!

First step:

We will remove special characters, websites, emails, html code and all the contractures of the English language. First we go to the first file and write on the first line : True.

Image description

Next we launch main.py.

Image description

Image description

Then we insert the following Input:

Image description

Output:

Image description


Second step:

This option corrects the supplied text, dividing it into 8 datasets and concatenating them to return the requested dataset. With 8 cores it took 9 hours for 60MB of datset!! It is highly expensive in terms of: cpu, memory and execution time. I recommend doing this only if necessary.

Input:

Image description

Output:

Image description


Third step:

Let's lemmatize by trimming the dataset a bit and replacing each compound word with its own root

Input:

Image description

Output

Image description


Fourth step:

In this case it is not necessary but in case we have cleaned and lemmatized only the positive or negative reviews, we need to merge the dataset to proceed to the analysis phase.

Image description

As we can see the union dataset weighs less because it has eliminated the stop words separately for positive and negative

Image description

After several tests we noticed that the union dataset is less efficient for model generation.


Fifth step:

This is the most important step because it allows you to greatly lighten the lemmatized and clean dataset. To add new stopwords beyond those already present in the repository just add the words in the text files:

Image description

Input:

Image description

Output:

Image description

Image description

We can see how many positive and negative reviews the dataset has and perform word-cloud or ngrams analysis. Below are some images that show the effectiveness of the previous phases and some information invaluable for building personalized wordlists.

Positive and negative review count:

Image description

Most Meaningful Words for Word Cloud:

Negative:

Image description

Positive:

Image description

Most common words in the dataset:

Positive:

Image description

Negative:

Image description

Most common words in the dataset with NGRAMS 2:

Image description


CONCLUSIONS:

At this point we can say that we have created a complete tool that allows us to analyze and modify the dataset. I will show in another repository another useful tool for vectorization and search by tuning hyperparameters with GridSearchCV.

Link: https://github.com/insidbyte/Analysis_And_Generation_Model_ML

Link documentation of this repository: https://scikit-learn.org

Top comments (0)