DEV Community: Bruno Luvizotto

Brazilian News Sentiment Analysis

Bruno Luvizotto — Sat, 02 May 2020 18:55:13 +0000

Disclaimer: this is an article of a project that uses the Google Language Sentiment Analysis API, it doesn't train any machine learning model.

Introduction

As a side project, I decided to develop a project to do sentiment analysis of headlines of some of the most important Brazilian news agencies. On the one hand I would like to test Google's API and on the other hand I would like to check if I could see significant differences on sentiments of the headlines of each news agency.

Architecture

The decisions on the architecture of this project were taken based on two decision criteria:

Lowest Prices
Less work

Database

For a database I decided to use Google's Firestore (non relational database) - no special reason for that other than "I'm already using GCP (Google Cloud Platform) for the sentiment analysis".

The database has three collections: websites, keywords and sentiments.

The documents in the collections have the following fields:

websites
- name: the website's name
- regex: regex used for scraping the website's headlines
- url: the websites's url
keywords (that we want to scrape):
- value: the string that we are looking for on the news agencies websites
sentiments:
- headline: the original headline analyzed
- headlineEnglish: headline translated to English (we'll talk about that later)
- isOnline: boolean that indicates if the headline is still being displayed on the website
- keywords: array with the keywords found in the headine
- onlineStartDate: timestamp of the first time the headline has been seen on the website
- onlineEndDate: timestamp of the last time the headline has been seen on the website
- onlineTotalTimeMS: the difference between the end and start dates (in milliseconds)
- sentimentScore: score of the sentiment analyzed (-1 to -0.25 means a negative sentiment, -0.25 to 0.25 a neutral sentiment and 0.25 to 1 a positive sentiment)
- sentimentMagnitude: the magnitude of the sentiment analyzed
- website: the website's name (from where the headline has been scraped)

Node.js Job

The responsible for actually doing all the work is a Node.js script (https://github.com/Brudhu/politicians_analysis). The script does the following:

Get all the info it needs (like websites info, keywords etc.) from Firestore
Scrape the websites to get the headlines (using puppeteer and the regex stored on Firestore)
Pick headlines that have at least one of the keywords
Check which of the scraped headlines have not been analyzed yet
Translate headlines to English (using an API from Azure) - there we go: the reason for that is that in a quick test of the sentiment analysis API I realized it works a lot better with English sentences than Portuguese sentiments
Analyze the sentiment of the headline translated to English (GCP Language API)
Insert new sentiments in the "sentiments" collection
Update sentiments that are not online anymore

I decided to run this job periodically every 30 minutes (not faster because I don't want to spend to much on Cloud resources).

I had two options to host the job: GCP (again) and Heroku - I know there are thousands of options but these are the ones I've had more experience
with. I decided to go with Heroku and Heroku Scheduler Addon (the scheduler is the responsible for running the script periodically). It's free for now.

Pricing

While the job on Heroku is free, the project on GCP is costing me 0.01 BRL per day.

First Results

To get the data from Firestore and analyze it, I wrote a Python script (will release it later).

For the first tests I set up two news agencies:

The keywords are:

Bolsonaro (Brazilian president)
Moro (Former Brazilian minister of justice - removed from the ministry a in April)
Lula (Former Brazilian president)
Dória (Governor of São Paulo state in Brazil)

In less than 14 days I got 571 headlines analyzed: 366 from UOL (the first one I started collecting data from) and 205 from G1.

The only keyword that has enough data for some analysis is "Bolsonaro", which makes sense since he is the current president.

Top Positive and Negative Sentiment Headlines

Most positive sentiment headline on UOL (Portuguese and the translated version in English):

Opinião: Com a PF, Bolsonaro cumpre a profecia de Jucá
Opinion: With PF, Bolsonaro fulfills the prophecy of Jucá

Most positive sentiment headline on G1:

Bolsonaro amplia lista de atividades consideradas essenciais na pandemia
Bolsonaro expands list of activities considered essential in the pandemic

Most negative sentiment headline on UOL:

Bolsonaro culpa governadores: 'Essa conta não é minha'
Bolsonaro blames governors: 'This account is not mine'

In this case we can see an error on the translation. I would say the best translation would be "Bolsonaro blames governors: 'This bill is not mine'"

Most negative sentiment headline on G1:

Procuradora diz que Bolsonaro violou a Constituição ao determinar revogação de portarias sobre armas
Prosecutor says Bolsonaro violated the Constitution by determining repeal of ordinances on weapons

Word Clouds

The word clouds are displaying only words with 3 or more occurrences. The only keyword analyzed so far is "Bolsonaro".

The word cloud of every single headline analyzed is the following (it's in Portuguese, don't kill me):

Word cloud of positive sentiments:

Word cloud of negative sentiments:

Word cloud of neutral sentiments:

Word cloud of positive sentiments on UOL:

Word cloud of negative sentiments on UOL:

Word cloud of neutral sentiments on UOL:

Word cloud of positive sentiments on G1:

Word cloud of negative sentiments on G1:

Word cloud of neutral sentiments on G1:

Plots

Now that we have an idea of what the word clouds look like for many conditions, let's take a look on some plots. The first one is a box plot of the sentiments grouped by website:

They look very similar: both are largely concentrated around the neutral area and both medians are pretty close - around 0 a little shifted to negative sentiments, but they are not exactly the same. UOL's box plot's minimum and maximum tails are longer then the ones from G1. Let's take a closer look.

Percentages

Total:
- Negative: 26.8%
- Neutral: 57.4%
- Positive: 15.8%
UOL:
- Negative: 25.3%
- Neutral: 58.6%
- Positive: 16.1%
G1:
- Negative: 29.9%
- Neutral: 55.2%
- Positive: 14.9%

While they are still similar, we can see that G1 has more negative sentiment headlines than UOL, while UOL has more neutral and positive sentiment headlines.

Histograms

The histogram with all the sentiments for the "Bolsonaro" keyword is the following:

In the histogram we can confirm what we saw before: we have more negative than positive sentiments, but neutral sentiments are way more common.

Now let's break the sentiments by website:

And the two previous histograms combined in the same plot:

It looks like while G1 has proportionally more negative sentiments than UOL (like we saw on the percentages before), UOL tends to be a little more "extremist", with more very negative and very positive sentiment headlines.

Now let's break the histograms even more: by positive and negative sentiments for each website.

UOL has more headlines with sentiments >= 0.7 (very positive sentiments).

Even though we now that G1 has more headlines with negative sentiments, these histograms shows that UOL has more headlines with sentiments <= -0.6 (very negative sentiments).

Conclusion

While it was a lot fun to work on this project and having learned new stuff, I have to point out some of the flaws here:

The translation from Portuguese to English (Azure) is very good, but not perfect for some cases
Headlines related to Brazilian politics sometimes have a specific context that would be useful for the translation and Azure doesn't get it
Some of the headlines were written by columnists and may be too informal to make sense after being translated (e.g. "Batata assou no fogo do parquinho dos Bolsonaro" which was translated to "Potato baked in the fire of bolsonaro playground" this sentence contains a Brazilian expression and means, in a very simplistic translation, something like "The Bolsonaros are in a bad situation")
Getting way more negative than positive sentiments may not reflect a partial position of the news agencies. Many headlines are about problems related to Covid-19 and may be inherently negative (some are not).

Both agencies have similar results - not exactly the same, but very similar.

Next steps

Recently I added a new news agency (R7) and will try to update the data and analysis once I have more relevant data - maybe with new news agencies and new keywords.

Creating my first Node.js app

Bruno Luvizotto — Sat, 02 May 2020 00:09:47 +0000

This tutorial article was written using Linux – that's why the commands won't work on a Windows computer. While it's not a requirement, if you are planning to become a developer, I strongly recommend using a Unix based operating system.

The only official requirement to run a Node project is having Node installed on your computer, but this is not what happens in the real world. To make it easier to deploy an application, some tools are used – npm in this case (Node Package Manager).

The first step is to install NPM (and the way to do it depends on your Linux distribution or Operating System).

Installing NPM (Node Package Manager)

On Arch linux, npm is supplied by the npm community package:

[brudhu@brudhu-manjaro tutorials]$ sudo pacman -Sy npm

On Ubuntu (and other distributions), the instructions can be found here: https://github.com/nodesource/distributions/blob/master/README.md

[brudhu@brudhu-manjaro tutorials]$ curl -sL https://deb.nodesource.com/setup_14.x | sudo -E bash -
[brudhu@brudhu-manjaro tutorials]$ sudo apt-get install -y nodejs

Creating the app using NPM

Create a directory for you project and enter the directory:

[brudhu@brudhu-manjaro tutorials]$ mkdir tutorial-project-1
[brudhu@brudhu-manjaro tutorial]$ cd tutorial-project-1

Once you are in the directory, create the app using NPM:

[brudhu@brudhu-manjaro tutorial-project-1]$ npm init

After running the init command, it will ask some questions about your project (you can just press enter to all of then for this project):

package name: the name of you project
version: the version of your project
description: the description of your project
entry point: the file that will be called to run your project
test command: a command to run tests on your project
git repository: the git repository of your project, in case it already has one
keywords: keywords of you project
author: the author's name
license: the license type of the project

This is what I answered for this tutorial - once you answer all the questions, it will create a package.json file, as shown bellow:

[brudhu@brudhu-manjaro tutorial-project-1]$ npm init
This utility will walk you through creating a package.json file.
It only covers the most common items, and tries to guess sensible defaults.

See `npm help json` for definitive documentation on these fields
and exactly what they do.

Use `npm install <pkg>` afterwards to install a package and
save it as a dependency in the package.json file.

Press ^C at any time to quit.
package name: (tutorial-project-1)
version: (1.0.0)
description: My first Node.js app project
entry point: (index.js)
test command:
git repository:
keywords: node tutorial
author: Bruno Luvizotto
license: (ISC)
About to write to /home/brudhu/tutorials/tutorial-project-1/package.json:

{
  "name":"tutorial-project-1",
  "version":"1.0.0",
  "description":"My first Node.js app project",
  "main":"index.js",
  "scripts":{
    "test":"echo \"Error: no test specified\" && exit 1"
  },
  "keywords":[
    "node",
    "tutorial"
  ],
  "author":"Bruno Luvizotto",
  "license":"ISC"
}

Is this OK? (yes)

The package.json file is the descriptor of you project - it stores all the information you answered in the npm init command and will store information on the packages used by the project (dependencies).

If you list the files in the project's directory, there will be the new package.json file:

[brudhu@brudhu-manjaro tutorial-project-1]$ ls
package.json

Now that we have the project descriptor (aka package.json), let's create the first file (the entry point of the project):

[brudhu@brudhu-manjaro tutorial-project-1]$ echo 'console.log("I did it! My first project!")' > index.js

At this point, we have the package.json and the index.js files. The next thing to do is to create a start script in your package.json file. Add the line "start": "node index.js" under “scripts”. Don't forget to add the comma after the previous line:

{
  "name": "tutorial-project-1",
  "version": "1.0.0",
  "description": "My first Node.js app project",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1",
    "start": "node index.js"
  },
  "keywords": [
    "node",
    "tutorial"
  ],
  "author": "Bruno Luvizotto",
  "license": "ISC"
}

The scripts described under “scripts” in the package.json file can be run using the npm run command (e.g. npm run test or npm run start in this case).

Now that we have the start script described and also the index.js file, we can finally run the project:

[brudhu@brudhu-manjaro tutorial-project-1]$ npm run start

> tutorial-project-1@1.0.0 start /home/brudhu/tutorials/tutorial-project-1
> node index.js

I did it! My first project!

Congratulations! This is the very beginning of a Node.js project!