Detecting Fake News with Python and Machine Learning

Hello There!

We always stay updated on current and arising matters through different means but all bring us back to one main piece: News. News gives every bit of information needed to stay updated but not every piece of information is legit or trust-worthy. Anyone can pass info that is fake; this might cause great confusion and unnecessary fits. Don't you worry though, Python comes to the rescue!

A prediction model for fake news can be created with Python (for Machine Learning). The model takes a dataset and returns data telling us if the news we get is legit or fake so get ready, lets begin coding.

Fake News Prediction Model with Python
Before beginning, ensure you have the following libraries required for the model to work:

sklearn
numpy
pandas
itertools

You can check for this in Command Prompt or Terminal(MacOS):

pip list

If you don't have the packages, you can install by simply running this code on command prompt(ensure you run it on Administrator to allow installation automatically):

pip install sklearn numpy pandas

After all installation is done, open a new Jupyter Notebook on the code editor and lets get coding!

Opening a new Jupyter Notebook on VSCode

Ctrl + Shift + P to open the Command Palette.
Type in "Jupyter Notebook" and a command Create: New Jupyter Notebook will pop up. Press Enter
A new Jupyter Notebook with the file extension .ipynb will be created. Now you can work on the prediction model with less hassle.

1. Libraries

Now we can import the libraries required after installing them by writing:

import pandas as pd
import numpy as np
import itertools
from sklearn.model_selection import train_test_split
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer

Scikit_learn(sklearn) = This is a machine learning library for Python and we will implement various methods to get our output and make predictions.
Pandas: This is a library for Python used for data analysis and manipulation.
Itertools: This is a Python module used to iterate over data structures which can be stepped over using a for-loop.
Numpy: This is a library for Python used for scientific computing.

2. Reading dataset

Using pandas, we can read data from a .csv file and have our dataset ready:

df = pd.read_csv('magazines.csv')

After reading the file containing the required dataset, we can get the first five rows of the dataset by using head() and get the number of rows and columns as a tuple using shape:

df.head() 
df.shape

The labels containing REAL and FAKE will be importance in the prediction model so we get them:

labels = df.label
labels.head()

3. Splitting dataset

After getting the labels from our dataframe, the next step is splitting it into training and testing sets using train_test_split:

x_train, x_test, y_train, y_test = train_test_split(df['text'], labels, test_size = 0.2, random_state = 7)

4. Initializing Tfidf_Vectorizer

Tfidf_Vectorizer converts collection of raw materials into a matrix of Term frequency(TD)-Inverse Document frequency(IDF) features.

tfidf_vectorizer = TfidfVectorizer(stop_words = 'english', 
max_df = 0.7)

After initializing tfidf_vectorizer, you can fit and transform the sets using fit_transform() and transform():

tfidf_train= tfidf_vectorizer.fit_transform(x_train)
tfidf_test = tfidf_vectorizer.transform(x_test)

5. Initializing PassiveAggresiveClassifier

PassiveAggressiveClasifier is an online learning algorithm that remains passive for correct classification outcome and turns aggressive incase of miscalculations, updating and adjusting.
After initializing the algorithm, fit in the training set and predict test set.

pac = PassiveAggressiveClassifier(max_iter = 50)
pac.fit(tfidf_train, y_train)       

y_pred = pac.predict(tfidf_test)

6. Calculating accuracy score of the prediction model and confusion matrix

The accuracy score tells how accurate the news we have is while the confusion matrix returns true and false negatives and positives.

score = accuracy_score(y_test, y_pred)
confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])

print(f"Accuracy: {round(score*100, 2)}%")

Conclusion:
Congratulations! You have built your prediction model that confirms if news is fake or not. This is quite the advanced model so take your time to go through the code, learn something new and get to understand some algorithms and functions used in machine learning to create spectacular models.
Happy coding!

DEV Community

Detecting Fake News with Python and Machine Learning

Latest comments (0)