Detecting Fake News with Python and Machine Learning

#python #news #machinelearning #beginners

Are all news real? Should we trust all the news presented to us? Apparently, No! Fake news exists and they tend to become viral reducing the impact of real news.

Fake news is one of the most significant new disturbing trend that must be resolved, otherwise the internet cannot truly serve and benefit humanity ~ Tim Berners-Lee

Then how can you detect and fight fake news? Python and Machine learning is the way to go. By practicing this python module for detecting fake news, you will easily distinguish real from fake news.

Fake News:

Fake news refers to anything from intentional fabrication reporting on a controversial topic or misleading information presented as news. Its generally spread through social media and other online media. Fake news aims at damaging the reputation of a person or an entity by imposing certain ideas with harmful intent. Types of fake news may include satire or parody, false connection, misleading content, false content, impostor content, manipulated content and fabricated content. such news may contain false claims and may end up being virialized by algorithms.

TfidfVectorizer:

The TfidfVectorizer converts a collection of raw documents into a matrix of TF-IDF features.
Term Frequency(TF): TF is the number of times a word appears in a document. a higher value suggests a term appears more often than others. Presumably, the documents are a good match when the term is part of the search.
Inverse Document Frequency(IDF): IDF is the measure of how significant a term is in the entire corpus.

PassiveAggressiveClassifier:

PassiveAggresssive Algorithms are online learning algorithms. The algorithms remain passive for a correct classification outcome and turn aggressive in case of miscalculation, updating and adjusting. It does not converge. Its purpose is to make updates correcting losses, causing little change in the norm of the weight vector.

Detecting Fake News with Python

This Python project module for detecting fake and real news makes use of sklearn. We will build a TfidfVectorizer on our dataset, initialize a PassiveAggressive Classifier and fit the model to accurately classify a piece of news as either real or fake.
Project Prerequisites:
You will need the following:

Fake News Dataset The data set has a filename.csv. The dataset that we will use will call it news.csv. Dataset has a shape and columns identifying the news, text, title, and labels denoting whether it's fake or real.
The Libraries To perform this classification, you will need the basic data science pack. You will need to install the libraries; sklearn, numpy, and pandas plus some more libraries like transformers and Pycarets. For us we'll install the following libraries with pip: pip install numpy pandas sklearn.
Jupyter Lab: You will need to install jupyter Lab to run your codes.

STEPS FOR DETECTING FAKE NEWS WITH PYTHON

Below are steps to help you detect fake news using Python:

Import the Libraries:

import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

Load the data:

Let's read the data into a DataFrame, and get the shape and columns. Next, get the labels from the DataFrame and finally split the dataset into training and testing sets.


#Read the data
df=pd.read_csv

#Get the shape and head
df.shape 
df.head()

#Get the labels
labels=df.labels
labels.head()

#Split the dataset
x_train,x_test,y_train,y_test=train_test_split(df['text'], labels, test_size=0.2, random_state=7)

Initializing TfidfVectorizer:

We'll initialize TfidfVecorizer with stop words and a maximum document frequency of o.7. Stop words are simply the useless words in a language or the most common words that are to be filtered out before processing the natural language data.
TfidfVectorizer turns a collection of raw data documents into a matrix of TF-IDF features.
We have to fit and transform the vectorizer on the train set and transform it on the test set.

#Initialize a TfidfVectorizer
tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)

#Fit and transform train set, transform test set
tfidf_train=tfidf_vectorizer.fit_transform(x_train) 
tfidf_test=tfidf_vectorizer.transform(x_test)

Finally initialize a PasiveAggressiveClassifier:

We will initialize the PasiveAggressiveClassifier and then fit it on the tfidf_train and y_train.

#Initialize a PassiveAggressiveClassifier
pac=PassiveAggressiveClassifier(max_iter=50)
pac.fit(tfidf_train,y_train)

We will predict the test set from TfidfVectorizer and calculate the accuracy of the project module.

#Predict on the test set and calculate accuracy
y_pred=pac.predict(tfidf_test)
score=accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%

Finally, we can print out the confusion matrix to gain insight into the number of Real and False. after which we can have our results.
confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])

Conclusion.
To this point, we've effectively learned how to easily detect fake news with Python. We learned how to load the Fake news dataset, initialize and implement TfidfVectorizer and PasiveAggressiveClassifier to fit our model theoretically. I hope once you get to code you will figure it out.

Thank you for reading.

Bibliography:
Dataflair team: https://data-flair.training/blogs/advanced-python-project-detecting-fake-news/
Fake News Wikipedia.https://en.wikipedia.org/wiki/Fake_news