DEV Community: iyissa

Fake News Detection with Machine Learning and Flask

iyissa — Wed, 15 Jun 2022 14:13:21 +0000

Introduction

The world has become more digital, and there is an abundance of data available. Before being sent into space, all data cannot be checked. As the amount of data grows, some of it will be true while the rest will be false. All sources cannot be independently verified, and doing so manually is impossible.

Machine Learning occupies a unique position in that when utilised correctly, it may construct a model based on a trusted dataset that can subsequently be used to sort through news. This project tries to develop a model that analyzes text to determine whether it is true news or not.

Diving into the Project

The Data

The data used for this project was gotten from the Fake and real news dataset on Kaggle. For a simple guide on loading the data from Kaggle to Google Colab, check out this blog post

Data Cleaning

After the data has been loaded, there should be a bit of cleaning done.

true['label'] = 1
fake['label'] = 0

Data Cleaning at this stage is done to ensure text is converted to numbers for the model built to be able to interpret information. True news is hence labelled as 1, while Fake news is labelled as 0.

To increase the speed of the experiment, only the first 5000 data points in the data are used and then put into a data frame.

frames = [true.loc[:5000][:], fake.loc[:5000][:]]
df = pd.concat(frames)

X and y datasets are then created for the process of dividing the earlier data frame into features and labels.

X = df. drop('label', axis=1)
y = df['label']

Dropping missing values and creating a copy data frame for later usage is then done.

df = df.dropna()
df2 = df.copy()
df2.reset_index(inplace=True)

Text Preprocessing

Preprocessing is the process of converting data into a format that a computer can understand and then use. For working with text data, a form of preprocessing usually done is removing useless data. Useless data for text data are referred to as stop words. Stop words are commonly used words that programs and search engines have been instructed to ignore. Examples can include ('a', 'i', 'me', 'my', 'the', you')

Continuing with the Fake News project, to preprocess we use the process
nltk is a python package that is used for text preprocessing.

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import re
import nltk

After importing the required libraries, stemming is the next step. The next bit involves removing all punctuation, all capitalized characters, all stopwords and then stemming. Stemming is the process where words in the dataset are reduced to their base forms. For example, words like "likes", "liked", "likely", and "liking" are reduced to like. To eliminate data redundancy in a model, this is required.

Regex is used in this section, if you're not familiar, you can get an introduction to it here

nltk.download('stopwords')
ps = PorterStemmer()
corpus = []
for i in range(0, len(df2)):
    review = re.sub('[^a-zA-Z]', ' ', df2['text'][i])
    review = review.lower()
    review = review.split()

    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)

The next step involves Word Embedding. Word Embedding is a method used in extracting features from text data for machine learning models to be able to work with the data. There are different word embedding techniques such as Word2Vec, GloVe, BERT but Tfidf is sufficient for this project.

Tfidf is a statistical method for capturing the significance of a text's terms in relation to the corpus/body of the text. It's ideal for retrieving information and extracting keywords from a document.

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_v = TfidfVectorizer(max_features=5000, ngram_range=(1,3))
X = tfidf_v.fit_transform(corpus).toarray()
y = df2['label']

Once that is done, the next step involves splitting the dataset into train and test sets.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Training and Validating the Model

The data has been split and is prime for modelling. For this project, the PassiveAggressiveClassifier is used. The PassiveAggressiveClassifier is an online learning algorithm that works well to detect fake news. Other algorithms can be used in this step such as Regression, XGBoost, or Neural Networks. This classifier works very well on fake news. For a more detailed explanation, check here

from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn import metrics
import numpy as np
import itertools
classifier = PassiveAggressiveClassifier(max_iter=1000)
classifier.fit(X_train, y_train)
pred = classifier.predict(X_test)
score = metrics.accuracy_score(y_test, pred)
print("accuracy:   %0.3f" % score)

A confusion matrix is then used to visualize the results. If you want to learn more about the confusion matrix, you can check out my previous article

For the validation process.

# Validation
import random
r1 = random.randint(5001, len(fake))

review = re.sub('[^a-zA-Z]', ' ', fake['text'][r1])
review = review.lower()
review = review.split() 
review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
review = ' '.join(review)

# Vectorization
val = tfidf_v.transform([review]).toarray()

# Predict 
classifier.predict(val)

To save the model, we make use of the Pickle package

import pickle
pickle.dump(classifier, open('model2.pkl', 'wb'))
pickle.dump(tfidf_v, open('tfidfvect2.pkl', 'wb'))

Loading the model to confirm the results

# Load model and vectorizer
joblib_model = pickle.load(open('model2.pkl', 'rb'))
joblib_vect = pickle.load(open('tfidfvect2.pkl', 'rb'))
val_pkl = joblib_vect.transform([review]).toarray()
joblib_model.predict(val_pkl)

Deploying the model

This section requires a user to have experience using Flask. There are many options to deploy a model but this model will be deployed using flask. The app.py used can be found on the GitHub here and the index.html here

The code for this project is available at this repo

Bringing it All Together

This blog post has gone through the steps from downloading data, to cleaning it, building the model, validating the model, and concluded with deploying on Flask. Thank you for reading through. Any feedback is appreciated.

References

How to Build a Fake News Detection Web App Using Flask

What is a Confusion Matrix

iyissa — Mon, 04 Apr 2022 10:50:18 +0000

For beginners in Machine Learning, it is quite interesting that there is something as aptly named as the Confusion Matrix. It really is quite difficult to interpret or understand what is going on when you encounter it for the first time in your journey but for every machine learning engineer, the confusion matrix is a fundamental tool to have in your toolkit.

The Confusion Matrix is a visualization tool for measuring the performance of a machine learning model. It can be used for measuring a number of things including Accuracy, Recall, Precision, Specificity and AUC-ROC curves. To know what metric to use while working on a problem is a different ballpark on its own.

In machine learning, there are different methods to use to solve different problems. All of them involve separating training and test data from your problem dataset to know if the model created is correct or not instead of just going in blindly. After the creation of these sub-datasets, the model is then created. A tool for testing how correct the model that has been created is for the dataset is a Confusion Matrix.

The rows in a Confusion Matrix signify the results that were predicted by the model while the columns stand for the truth that is known. This is where the concept of True Positives, True Negatives, False Positives and False Negatives come from.

These are explained properly:

A true positive indicates that the prediction was positive and it is true (correct). It can also be explained by saying a true positive is an outcome where the model correctly predicts the positive class. e.g A woman is predicted to be pregnant and she is pregnant
A true negative indicates when the prediction is negative and it is true (correct). It can also be understood as an outcome where a model correctly predicts a negative class. e.g A man is predicted to not be pregnant and he is not pregnant
A false positive indicates that the prediction is positive and it is false (incorrect). It can also be interpreted as an outcome where the model incorrectly predicts the positive class e.g A man is predicted to be pregnant but he is not pregnant
A false negative indicates that the prediction is negative but it is false (incorrect). It can also be understood as an outcome where the model incorrectly predicts the negative class. e.g A woman is predicted to be not pregnant but she is.

As seen from the diagram above, the True Positive is in the top left corner and the True Negative is in the bottom right corner. Those are largely the important figures and they exist in the left diagonal. The numbers not in the diagonal show how many samples were classified wrong by the algorithm. So, a confusion matrix done using two different models can show at face value which is worse in predicting results.

Not all confusion matrixes are 2*2, the size of a confusion matrix is determined by the number of things to be predicted. Something that is constant, no matter the size of the matrix is that the diagonal shows how correct the model predictions are.

Machine Learning and Tabular Data

iyissa — Sun, 13 Mar 2022 15:53:30 +0000

Introduction

Machine Learning is quite simple but then it could be quite complicated at the same time. Neural networks are a fundamental aspect of machine learning. They are a series of algorithms that work together to identify underlying patterns in data, they usually mimic or behave in a way that the neurons of the human brain work and that is where the name is gotten from.

Data comes in various forms, there is structured data, semi-structured and unstructured data. Structured data can be found in spreadsheets etc while unstructured data exists in log files, images, audios etc. Tabular data is a form of structured data. It is data that is structured into rows, where each of those rows contains information about something.

The Problems

Using neural networks on tabular data has not always been ideal. Models that are usually built with neural networks typically have low performance compared to traditional machine learning models. Tabular data typically does not have the hyper non-linear relationships that image recognition, NLP datasets have and there isn’t enough information in tabular data for the models to capitalize on and increase their performance levels.

The quality of data found is another one of the major concerns in tabular data. There are oftentimes outliers in the data, missing values. It is also difficult to find spatial correlations between the variables found in tabular datasets, which means that methods like Convolutional Neural Networks are unable to create models based on tabular data. Another important problem is the conversion of categorical attributes in the data. This is usually done using one-hot encoding but that increases the problem of dimensionality. Data augmentation is a very important part of machine learning as it helps the model become more accurate. It is very challenging to apply that for tabular data and all of these combine to show the complexity of using Neural networks with Tabular data.

Models that perform very well on tabular data such as Gradient boosted trees, random forests, linear regression algorithms etc. all do very well when mapping “shallow” non-linear relationships and the mapping is done in an efficient and simple way. So, neural networks are not bad for tabular data, the amount of data required for a neural network to have good performance is not typically found in tabular data and explains the underperformance.

The time and resources needed to tune neural networks and deep learning for tabular data are also not easily justifiable knowing how well gradient boosting algorithms work on the same type of data.

What ML algorithms work instead

As alluded to earlier, gradient boosting algorithms have been shown to be the best for working on problems including tabular data, the best bet you can get for accurate modelling of these problems are LightGBM, XGBoost, Catboost.These three can be considered as the holy grail of tabular data and should be the first point of call in Tabular Data problems. Linear regression algorithms such as Logistic Regression, ElasticNet, etc. also perform admirably.

If there is still a need for a deep learning model to be created for tabular data, there exists Tabnet. TabNet is a Deep Neural Network for working with Structured, Tabular Data. It has outperformed previously mentioned Decision Tree-based models on multiple benchmark datasets and can be used in practice. A simple guide for implementation in solving a problem can be found here.

Understanding what your problem needs and knowing what to prioritize will aid in choosing the right machine learning method to use. Hopefully, this article helps you understand the available options. Thank you.

Some content that was used to gain an understanding of this issue include:

Evaluation Metrics for an ML Regression Model

iyissa — Sun, 23 Jan 2022 17:59:58 +0000

Introduction

Building a machine learning model is not as difficult as it initially seems. The issues come with what to do after the initial model has been created. This article aims to answer questions such as “What is evaluation” “Why do you need it” “What are evaluation metrics” “What evaluation metrics should be used for a regression model” "Which situation should I use this metric in?” At the end of this article, you should understand the basics and most important parts of evaluating a created regression model.

Quick Recap

Machine Learning can be called a fragment of Artificial Intelligence whose main focus is on building systems (models) that learn and improve based on the data that they consume. Machine Learning is a fundamental part of how the world operates with its use cases and applications visible in various sectors of the world. A machine learning model is a model that takes in independent variables (as input) and aims to predict a dependent variable (output) based on the data it has taken in and the relationships that exist amidst the data. More specifically, a linear regression model is one that makes use of the formula (y=mx+c) to model the relationship between the dependent variable and the independent variable(s)

Why Evaluate?

After the creation of a model, it is necessary to gauge the performance of how well the model performs on data. Evaluation is the process of testing the performance of the model. Evaluation is one of the most important aspects of Machine Learning as it is a method of testing how accurate a model is at predicting outcomes. It must be done because, without it, the risk of a bad model being used is greatly increased. An evaluation metric is a mathematical quantifier of the quality of the model that has been created. Examples of evaluation metrics are accuracy, precision, Inlier Ratio Metric, Mean Squared Error, Mean Absolute Error etc.

Residual Errors and Evaluation Metrics

There are different evaluation metrics for different types of models such as the Classification model, Regression model, etc. A question frequently asked is “How can I calculate the accuracy of a regression model?” The simple answer is that you cannot because the output of a regression model is a numeric value such as a height or a dollar amount or something in that category. The output is not needed to predict the exact value but to instead measure how close the predicted value is to the actual value. That is where residual errors for regression models come in.

For the evaluation of a Regression model, it is important to understand the concept of Residual Errors. A Residual error is a difference between the actual and predicted values. For each output, there is a residual. Residual errors could be either positive or negative. Technically, it is possible to manually check each residual to know how a model performed but in datasets where there are thousands and millions of points, that is not feasible. Hence, why there are evaluation metrics that are calculated using this residual error to simplify the evaluation process.

There are a lot of metrics but the most common ones used are:

Mean Absolute Error (MAE):

The MAE is simply calculated by taking the absolute of the residual errors and then finding the average value. The figure gotten is the absolute mean of the errors. How high or low it is determined how well the model is performing. The lower the MAE, the better the model fits the dataset.

The MAE's advantage is that all the errors are computed on the same scale since the absolute value is used. Hence, not too much attention is given to the outliers in the data and it provides a fairly even measure of how well the model performs. The disadvantage is that if the outliers are very important in the model, the MAE is not very effective. This means that a model with a good MAE is usually great but they often make a few disappointingly poor decisions.

Mean Squared Error (MSE):

The MSE is calculated by squaring the residual error and then finding the average. The value of the MSE is always vastly greater than that of the MAE due to the square that was taken. For the MSE, the closer the value gets to 0, the better the model is. Comparing multiple models, the one with the lowest MSE is the best.

The advantage that comes with using the MSE is ensuring that the model has no outlier predictions that will produce huge errors as the MSE places greater influence on those errors since it squares them. The downside is that if the model makes one bad prediction, the squared function of the MSE greatly magnifies that error and it can skew the total.

Root Mean Squared Error (RMSE):

The RMSE is simply taking the root of the mean squared error. The value of the RMSE should be as low as possible. The lower the RMSE, the better the predictions. A high RMSE shows a great deviation between predicted and actual values which shows that the model could be bad.

Huber Loss Function:

The Huber Loss is something of a mid-point between the Mean Absolute Error and the Mean Squared Error. MSE identifiers outliers while MAE ignores the outliers. The calculation of the Huber loss function is somewhat complicated. Simplifying it is saying that for loss values that are less than the delta, the MSE should be used, and for loss values greater than the delta, the MAE should be used. That is a combination of the best from both error terms.

The advantage of using the Huber is that using the MAE for the larger loss values lowers the weight given to outliers while using the MSE for the other loss values adds up to what is a well-rounded model.

Other evaluation metrics that are not explained include the R-squared, Adjusted R-squared, Max Error, etc.

Considerations and Recommendations

For your model, The Huber Loss function should be used when a balance is needed between giving weight to outliers in the model but not so much that the model is entirely dependent on their presence. It is useful in regression model cases such as estimating location. For cases where outliers are very important, the MSE is advisable to be used and in cases, where outliers are not at all cared about, the MAE functions very well. For models where any slight variation in the error means a lot such as clinical trials, the RMSE should be used because it is the most exact of these metrics.

Evaluation metrics as we have seen are quite useful and very helpful in reducing the stress of manual inspection of each point in the data.

It is recommended to use more than one metric for evaluating a model as some models are seen to perform very well on metric while failing to perform on another metric. That could give a false impression if only the good metric is reported on.

The majority of models created do not care much about outliers and are simply created to provide well-rounded predictions that perform well on the majority of the data. Final Thoughts and Further Reading Resources. It is important to understand how to evaluate a model because it is the bedrock for how good the model is. For further reading on other less used evaluation metrics and also greater in-depth readings on evaluation metrics for regression models, check out:

Keras Documentation
Complete List of Metrics