One of the most complex tasks in supervised classification methods is to have a tagged data set that allows you to train a model in the first instance and then validate it. Within the text mining process there are few corpus that can be used for these processes. Therefore, the objective will be to create a set of tagged data that allows to form a classification model to recognize the positive or negative reviews of a set of films.
Data source: https://www.cinemablend.com/reviews/
Project repositories: https://github.com/esdanielgomez/MovieClassificationRMP/
Part 1: Collection of unstructured data.
For this process we begin with the extraction of the links of the comments of the films (1470 in total) through 30 pages on the site.
REPOSITORY: Final_A
In this case, both the comment link and its star rating are being selected (weighted between 1.0 and 5.0):
REPOSITORY: Final_B
Now as a next step, the comments are divided into two sections: Positive and Negative:
For negative comments, the selection of comments with a rating of less than or equal to 3.0 is made:
And for negative comments with a rating greater than 3.0:
Subsequently, the information of the films is obtained, in which the most relevant for this study is the Comment attribute:
The process for obtaining comments is as follows:
Finally, these comments are stored in two Excel files (datasetComentariosNegativos.xlsx and datasetComentariosPositivos.xlsx).
REPOSITORY: Final_C
Next, 650 movie comments are selected per dataset (Positive and Negative) and two new datasets are prepared for training data and data for subsequent tests.
Part 2: Preparation of the data.
Until now the data is organized in this way:
REPOSITORY: Final_D
Next, pre-processing activities are executed to eliminate HTML tags, empty words in English and through a custom dictionary, among other processes:
Process Documents from Data:
Part 3: Identify key characteristics.
REPOSITORY: Final_D (In the same repository as part 2).
The result of the previous process (Part 2) returns a structured table consisting of one row for each entry extracted from the website and a number of columns indicating the different tokens that are part of the document and their occurrence weight. However, trying to execute a learning process with so many variables is a very expensive task. Therefore, in this process the objective is to filter those columns (tokens) by some method of selecting characteristics.
For this, the Weight of Information Gain operator is used to classify the characteristics by their relevance to the tag attribute based on the information gain ratio.
Also, the Select by Weights operator is used which selects only those attributes of an input set whose weights meet the specified criteria with respect to the input weights. In this case, through the top k attribute in which the best 90 characteristics of the data set are selected.
Part 4 and 5: Build the training model and apply it.
Two algorithms are used to build the training model: Naive Bayes and SVM, in which you can identify which provides the best accuracy and performance.
Also, within the same process, the two trained models are applied to the test data and their accuracy is verified (The data in this test dataset is pre-processed in the previous parts).
Model and validation NB (Naive Bayes)
Model and validation SVM (Support Vector Machine)
Top comments (2)
This is great @esdaniekgomez!
Thanks Scott!