Matias Carpintini

Posted on Apr 11, 2022

Spam filtering system - Bayes classifier

#rails #ruby #machinelearning

What the f*ck is a Bayes Cassifier?

The Bayes Rule, "The theory that never died"

A simple 17th century theory for the evaluation of knowledge, criticized for most of the 20th century.

It helps people evaluate their initial ideas, update and modify them with new information, in order to make better decisions.

Initial beliefs + recent objective data = A new improved belief.

The theory is very robust. In practice, Bayes rule requires multiple calculations, and powerful computers that reintegrate millions of times the probability of an initial belief each time new information arrives. Bayes rule does not generate an absolutely true (exact answer), instead, it uses probability to move step by step toward the most likely conclusion.

It was discovered and published by two clergymen and amateur mathematicians, the Englishman Thomas Bayes and his Welsh friend Richard Price, during the 18th century.

The French mathematician Pierre-Simon Laplace developed it in the form in which it is used today. Now we could call it the Bayes-Price-Laplace theory, or GLP for short.

Part of the initial controversy is due to the fact that during the 40's of the 18th century a harsh controversy had opened about the improbability of Christian miracles. The question was whether there was evidence in the natural world that would help us reach rational conclusions about God the creator, which in the 18th century was known as "the cause" or "the first cause".

We do not know if Bayes was trying to prove the existence of God as a cause. But we do know that he tried to deal mathematically with the problem of cause and effect;

During the Cold War, the United States Air Force lost a hydrogen bomb off the coast of Palomares, and the United States Navy began to secretly develop the Bayesian theory for finding underwater objects.

In 2009, Air France Flight 447 disappeared in the South Atlantic Ocean with 228 people on board. The United States Navy had developed Bayesian search theory enough to end two years of unsuccessful search for AF447 in a week's underwater search.

Today we use it for an immense amount of things, such as filtering spam and training autopilot systems.

If you want to know more about the history behind it, take a look to this video.

Ok, nice. But you haven't answered the previous question!

Naive Bayes Classification

The Naive Bayes classifier is a machine learning technique that can be used to classify objects such as text documents into two or more classes. A new object it's classified by the similarity between others.

Despite its "naivete", the naive Bayes method tends to work very well in practice.

“all models are wrong, but some are useful” - George Box

Cool, go for it!

So, for our purpose (spam filtering), we need data (a lot of emails), classify each email as spam or ham (legitimate) and then analyze the words independently, in order to get the most common words in each of those classes.

P.S: You can have many classes as you want. Like Gmail does with Promotions, Updates, Forums or Social.

Initial beliefs: For simplicity, let says that the half of emails we get are spam. 1:1.

Now we need to obtain the probability that each of those words appears in spam or ham. The simplest way is to count how many times each word appears in the data and divide the number by the total word count.

word	spam	ham
Free	184	12
total	104342	294554

In this example, the probability that the word "Free" in a spam message appears is 1 out of 567 words. Same exercise for ham: 1 in 24

Recent objective data: If we found the word "Free" in our message, it will increase the probability of being spam in 23.6 (567/24)

1/1 (initial beliefs: half messages we receive are spam) * 23.6 (we just found the world "Free" in our email) = 23.6.

There are on the average about 23.6 spam messages for each ham message, or to use whole numbers, 236 (23.6*10) spam messages for every 10 ham messages. So, the probability would be: 236 / (236 + 10) * 100 = 95.9%

To handle the rest of the words in a email, we can use exactly the same procedure. The posterior odds after one word (what we just calculated), will become the prior odds (or the initial belief) for the next word, and so on.

You may have noticed that how whole thing will be biased. Since once we are going to analyze the second word, it already have a strong belief that the email is spam, due it has the free world on it, and same with next words.

That's known as the base rate fallacy, and you can read more about here.

But let's keep it simple,

"Land-and-Expand"

How can i play with this?

I just found classifier-reborn, a gem that keeps it pretty simple. And this dataset.

Here you go:

require 'classifier-reborn'
require 'csv'

# Load dataset
dataset = CSV.parse(File.read("spam_ham_dataset.csv"), headers: true)

# Create our Bayes / LSI classifier
classifier = ClassifierReborn::Bayes.new('Spam', 'Ham')

# Train the classifier
dataset[1..-1].each_with_index do |email, index|
  if dataset[index]["label"] == "spam"
    classifier.train "Spam", dataset[index]["text"]
  else
    classifier.train "Ham", dataset[index]["text"]
  end
end

# Play with it
puts "Insert your email here (txt)"
puts classifier.classify gets

You can also check this repo, to see how it looks without the gem :p

Speaking of spam... I'm going to launch a job board for devs that want to work remotely, and i'm looking for feedback, take a look!

Bye.

DEV Community