DEV Community: Victor Zhou

Random Forests for Complete Beginners

Victor Zhou — Wed, 10 Apr 2019 12:00:00 +0000

In my opinion, most Machine Learning tutorials aren’t beginner-friendly enough.

Last month, I wrote an introduction to Neural Networks for complete beginners. This post will adopt the same strategy, meaning it again assumes ZERO prior knowledge of machine learning. We’ll learn what Random Forests are and how they work from the ground up.

Ready? Let’s dive in.

1. Decision Trees 🌲

A Random Forest 🌲🌲🌲 is actually just a bunch of Decision Trees 🌲 bundled together (ohhhhh 💡 that’s why it’s called a forest). We need to talk about trees before we can get into forests.

Look at the following dataset:

The Dataset

If I told you that there was a new point with an x coordinate of 1, what color do you think it’d be?

Blue, right?

You just evaluated a decision tree in your head:

That’s a simple decision tree with one decision node that tests x < 2. If the test passes (x < 2), we take the left branch and pick Blue. If the test fails (x ≥ 2), we take the right branch and pick Green.

The Dataset, split at x=2

Decision Trees are often used to answer that kind of question: given a labelled dataset, how should we classify new samples?

Labelled : Our dataset is labelled because each point has a class (color): blue or green.

Classify : To classify a new datapoint is to assign a class (color) to it.

Here’s a dataset that has 3 classes now instead of 2:

The Dataset v2

Our old decision tree doesn’t work so well anymore. Given a new point (x,y),

If x≥2, we can still confidently classify it as green.
If x<2, we can’t immediately classify it as blue - it could be red, too.

We need to add another decision node to our decision tree:

Pretty simple, right? That’s the basic idea behind decision trees.

2. Training a Decision Tree

Let’s start training a decision tree! We’ll use the 3 class dataset again:

The Dataset v2

2.1 Training a Decision Tree: The Root Node

Our first task is to determine the root decision node in our tree. Which feature (x or y) will it test on, and what will the test threshold be? For example, the root node in our tree from earlier used the x feature with a test threshold of 2:

Intuitively, we want a decision node that makes a “good” split, where “good” can be loosely defined as separating different classes as much as possible. The root node above makes a “good” split: all the greens are on the right, and no greens are on the left.

Thus, our goal is now to pick a root node that gives us the “best” split possible. But how do we quantify how good a split is? It’s complicated. I wrote an entire blog post about one way to do this using a metric called Gini Impurity. ← I recommend reading it right now before you continue - we’ll be using those concepts later in this post.

Welcome back!

Hopefully, you just read my Gini Impurity post. If you didn’t, here’s a very short TL;DR: We can use Gini Impurity to calculate a value called Gini Gain for any split. A better split has higher Gini Gain.

Back to the problem of determining our root decision node. Now that we have a way to evaluate splits, all we have to do to is find the best split possible! For the sake of simplicity, we’re just going to try every possible split and use the best one (the one with the highest Gini Gain). This is not the fastest way to find the best split , but it is the easiest to understand.

Trying every split means trying

Every feature (x or y).
All “unique” thresholds. We only need to try thresholds that produce different splits.

For example, here are the thresholds we might select if we wanted to use the x coordinate:

x Thresholds

Let’s do an example Gini Gain calculation for the x=0.4 split.

First, we calculate the Gini Impurity of the whole dataset:

Then, we calculate the Gini Impurities of the two branches:

Finally, we calculate Gini Gain by subtracting the weighted branch impurities from the original impurity:

Confused about what just happened? I told you you should’ve read my Gini Impurity post. It’ll explain all of this Gini stuff.

We can calculate Gini Gain for every possible split in the same way:

All Thresholds

After trying all thresholds for both x and y, we’ve found that the x=2 split has the highest Gini Gain, so we’ll make our root decision node use the x feature with a threshold of 2. Here’s what we’ve got so far:

Making progress!

2.2: Training a Decision Tree: The Second Node

Time to make our second decision node. Let’s (arbitrarily) go to the left branch. We’re now only using the datapoints that would take the left branch (i.e. the datapoints satisfying x<2), specifically the 3 blues and 3 reds.

To build our second decision node, we just do the same thing! We try every possible split for the 6 datapoints we have and realize that y=2 is the best split. We make that into a decision node and now have this:

Our decision tree is almost done…

2.3 Training a Decision Tree: When to Stop?

Let’s keep it going and try to make a third decision node. We’ll use the right branch from the root node this time. The only datapoints in that branch are the 3 greens.

Again, we try all the possible splits, but they all

Are equally good.
Have a Gini Gain of 0 (the Gini Impurity was already 0 and can’t go any lower).

It doesn’t makes sense to add a decision node here because doing so wouldn’t improve our decision tree. Thus, we’ll make this node a leaf node and slap the Green label on it. This means that we’ll classify any datapoint that reaches this node as Green.

If we continue to the 2 remaining nodes, the same thing will happen: we’ll make the bottom left node our Blue leaf node, and we’ll make the bottom right node our Red leaf node. That brings us to the final result:

Once all possible branches in our decision tree end in leaf nodes, we’re done. We’ve trained a decision tree!

3. Random Forests 🌲🌳🌲🌳🌲

We’re finally ready to talk about Random Forests. Remember what I said earlier?

A Random Forest is actually just a bunch of Decision Trees bundled together.

That’s true, but is a bit of a simplification.

3.1 Bagging

Consider the following algorithm to train a bundle of decision trees given a dataset of n points:

Sample, with replacement , n training examples from the dataset.
Train a decision tree on the n samples.
Repeat t times, for some t.

To make a prediction using this model with t trees, we aggregate the predictions from the individual decision trees and either

Take the majority vote if our trees produce class labels (like colors).
Take the average if our trees produce numerical values (e.g. when predicting temperature, price, etc).

This technique is called bagging , or bootstrap aggregating. The sampling with replacement we did is known as a bootstrap sample.

Bagged Decision Trees predicting color

Bagged decision trees are very close to Random Forests - they’re just missing one thing…

3.2 Bagging → Random Forest

Bagged decision trees have only one parameter: t, the number of trees.

Random Forests have a second parameter that controls how many features to try when finding the best split. Our simple dataset for this tutorial only had 2 features (x and y), but most datasets will have far more (hundreds or thousands).

Suppose we had a dataset with p features. Instead of trying all features every time we make a new decision node, we only try a subset of the features. We do this primarily to inject randomness that makes individual trees more unique and reduces correlation between trees , which improves the forest’s performance overall. This technique is sometimes referred to as feature bagging.

4. Now What?

That’s a beginner’s introduction to Random Forests! A quick recap of what we did:

Introduced decision trees , the building blocks of Random Forests.
Learned how to train decision trees by iteratively making the best split possible.
Defined Gini Impurity, a metric used to quantify how “good” a split is.
Saw that a random forest = a bunch of decision trees.
Understood how bagging combines predictions from multiple trees.
Learned that feature bagging is the difference between bagged decision trees and a random forest.

A few things you could do from here:

Experiment with scikit-learn’s DecisionTreeClassifier and RandomForestClassifier classes on real datasets.
Try writing a simple Decision Tree or Random Forest implementation from scratch. I’m happy to give guidance or code review! Just tweet at me or email me.
Read about Gradient Boosted Decision Trees and play with XGBoost, a powerful gradient boosting library.
Read about ExtraTrees, an extension of Random Forests, or play with scikit-learn’s ExtraTreesClassifier class.

That concludes this tutorial. I like writing about Machine Learning (but also other topics), so subscribe if you want to get notified about new posts.

Thanks for reading!

Why I Replaced Disqus and You Should Too

Victor Zhou — Wed, 03 Apr 2019 02:27:50 +0000

When I started my blog, I used Disqus for comments on posts. This was a natural choice: I'd seen sites use Disqus all over the internet, it was easy to setup, and they had a free tier. I happily integrated Disqus and moved on.

Here's the thing: I've always known that using Disqus came at the cost of some page bloat. I've written about web performance before and generally strive to make my pages fast, but I just assumed having Disqus was worth the bit of extra weight. My logic: If Disqus were really so bloated, everyone would've migrated away from them by now. Surely Disqus prioritizes keeping their payload reasonably small, right?

I was wrong. Last week, I finally did what I should've done at the beginning: benchmark it myself. Here are my results (benchmarked on my Webpack post):

Adding Disqus increased my page weight by over 10x and my request count by over 6x. That's ridiculous! I immediately started looking to replace Disqus - web performance is important.

An Alternative: Commento

A while back, I saw a Hacker News post about a fast, privacy-focused alternative to Disqus called Commento. Having learned my lesson, I benchmarked Commento before committing to it:

What a difference. Commento is orders of magnitude lighter than Disqus.

It gets even better. Here are more reasons I was sold on Commento:

It's open source.
It's privacy focused - it doesn't sell user data and tries to collect as little as possible. This is especially nice given that my blog's audience is probably more privacy-conscious than the average internet user.
You can pay what you want. Disqus's free tier is ad-supported, and its cheapest paid tier is $9/month. Commento is actually cheaper (if you want it to be)!
It's configurable. If you scroll down to the comments of this post, you'll see that the styling of the Commento integration matches the styling of the rest of the site.
It has an Import from Disqus tool that's easy to use. I was able to quickly migrate all of my old Disqus comments to Commento.

Commento works great for me, but I'm not trying to say it's the right solution for everyone - there are several good, lightweight commenting platforms out there.

Are you still using Disqus? Did you know how much bloat it adds to your page? What's keeping you from switching?

Originally published at victorzhou.com

How I Became a Programmer

Victor Zhou — Thu, 14 Mar 2019 12:00:00 +0000

It all started when I was 12 years old. 👶

Back in those days, my friends and I all played Runescape, a popular browser-based MMORPG. Those who’ve played Runescape will remember that it requires lots of grinding - the “best” players were the ones who spent the most time killing monsters and leveling up. Unfortunately, my parents didn’t let me play much - all of my friends had passed level 50 by the time I reached level 30. This level gap was [[all I worried about day and night because nobody wants to play with someone 20 levels below them]] somewhat frustrating.

The obvious solution was to

Make a better version of Runescape that required less grinding, and then
Convince millions of players to play my version instead.

With this foolproof 2-step plan in place, I set out to complete Step 1. I remember googling ”How was Runescape made,” reading that it was built in Java, and then googling ”How to code in Java.” Over the next 20 minutes, I would come to realize that learning Java on your own as a 12-year-old is not that easy. Thus, I made the brilliant decision to pivot: I would now make a better version of Runescape without writing any code. I’ll just skip the parts that require coding, I thought. This way, I’ll finish it faster, too. See? Brilliant.

I ended up actually finding a way to make games without writing code: GameMaker. I downloaded it and began making simple, codeless games. One of my favorite memories from middle school was bringing a USB loaded with a Space Invaders-style game I’d made to the computer lab and passing it around between my friends. That proud feeling of seeing people use something I built has been driving me ever since.

Despite my ambitions, I eventually realized that you can’t actually do much without writing code. Luckily, GameMaker supported a proprietary programming language called the GameMaker Language (GML) that had lots of examples and tutorials for it online. I reluctantly decided to learn a bit of GML so I could make more advanced games (read: Runescape). Those were my first if statements and for loops! That’s right. My first programming language was GML. 😲

Ever since those first few lines of GML, I’ve been hooked. In 9th grade, I began making iOS apps and competing in programming competitions. Out of high school, I landed my first software engineering internship and started more seriously pursuing a career in tech. In college, I got into web development and sold my first website.

To summarize: I’ve spent a decade learning and building and still haven’t taken down Runescape 😡. I guess everyone has their white whale…

Originally published at victorzhou.com.

Can You Find The Bug in This Code?

Victor Zhou — Sat, 09 Feb 2019 12:00:00 +0000

Here’s a bit of Javascript that prints “Hello World!” on two lines:

(function() {
  (function() {
    console.log('Hello')
  })()

  (function() {
    console.log('World!')
  })()
})()

…except it fails with a runtime error. Can you spot the bug without running the code?

Scroll down for a hint.

Hint

Here’s the text of the error:

TypeError: (intermediate value)(...) is not a function

What’s going on?

Scroll down for the solution.

Solution

One character fixes this code:

(function() {
  (function() {
    console.log('Hello')
  })();
  (function() {
    console.log('World!')
  })()
})()

Without that semicolon, the last function is interpreted as an argument to a function call. Here’s a rewrite that demonstrates what’s going on when the code is run without the semicolon:

const f1 = function() { console.log('Hello'); };
const f2 = function() { console.log('World!'); };

f1()(f2)();

There are 3 function invocations in that last line:

f1 is called with no arguments
The return value of f1() is called with f2 as its only argument
The return value of f1()(f2) is called with no arguments

Since the return value of f1() is not a function, the runtime throws a TypeError during the second invocation.

With the semicolon added, this becomes:

const f1 = function() { console.log('Hello'); };
const f2 = function() { console.log('World!'); };

f1();(f2)();

Which runs as expected.

Wait, you had this bug once?

Yup.

Why would you ever write code with so many Immediately Invoked Function Expressions (IIFE)?

It’s a long story - this post explains how I wrote bad enough code to have this bug.

The Lesson

Always use semicolons. This specific case was a bit contrived, but something similar could happen to you. Here’s another Hello World program that fails for a related reason:

const a = 'Hello'
const b = 'World' + '!'
[a, b].forEach(s => console.log(s))

I’ll leave figuring this one out as an exercise for you.

Most Javascript style guides require semicolons, including Google’s, Airbnb’s, and jQuery’s. To summarize: always use semicolons.

Building a Better Profanity Detection Library with scikit-learn

Victor Zhou — Mon, 04 Feb 2019 12:00:00 +0000

A few months ago, I needed a way to detect profanity in user-submitted text strings:

This shouldn’t be that hard, right?

I ended up building and releasing my own library for this purpose called profanity-check.

Of course, before I did that, I looked in the Python Package Index (PyPI) for any existing libraries that could do this for me. The only half decent results for the search query “profanity” were:

profanity (the ideal package name)
better-profanity: “Inspired from package profanity of Ben Friedland, this library is much faster than the original one.”
profanityfilter (has 31 Github stars, which is 30 more than most of the other results have)
profanity-filter (uses Machine Learning, enough said?!)

Third-party libraries can sometimes be sketchy, though, so I did my due diligence on these 4 results.

profanity, better-profanity, and profanityfilter

After a quick dig through the profanity repository, I found a file named wordlist.txt:

NSFW

The entire profanity library is just a wrapper over this list of 32 words! profanity detects profanity simply by looking for one of these words.

To my dismay, better-profanity and profanityfilter both took the same approach:

better-profanity uses a 140-word wordlist
profanityfilter uses a 418-word wordlist

This is bad because profanity detection libraries based on wordlists are extremely subjective. For example, better-profanity’s wordlist includes the word “suck.” Are you willing to say that any sentence containing the word “suck” is profane? Furthermore, any hard-coded list of bad words will inevitably be incomplete — do you think profanity’s 32 bad words are the only ones out there?

Fucking Blue Shells. source: xkcd

Having already ruled out 3 libraries, I put my hopes on the 4th and final one: profanity-filter.

profanity-filter

profanity-filter uses Machine Learning! Sweet!

Turns out, it’s really slow. Here’s a benchmark I ran in December 2018 comparing (1) profanity-filter, (2) my library profanity-check, and (3) profanity (the one with the list of 32 words):

A human could probably do this faster than profanity-filter can

I needed to be able to perform many predictions in real time, and profanity-filter was not even close to being fast enough. But hey, maybe this is a classic tradeoff of accuracy for speed, right?

Nope.

At least profanity-filter is not dead last this time

None of the libraries I’d found on PyPI met my needs, so I built my own.

Building profanity-check, Part 1: Data

I knew that I wanted profanity-check to base its classifications on data to avoid being subjective (read: to be able to say I used Machine Learning). I put together a combined dataset from two publicly-available sources:

the “Twitter” dataset from t-davidson/hate-speech-and-offensive-language, which contains tweets scraped from Twitter.
the “Wikipedia” dataset from this Kaggle competition published by Alphabet’s Conversation AI team, which contains comments from Wikipedia’s talk page edits.

Each of these datasets contains text samples hand-labeled by humans through crowdsourcing sites like Figure Eight.

Here’s what my dataset ended up looking like:

Combined = Tweets + Wikipedia

The Twitter dataset has a column named class that’s 0 if the tweet contains hate speech, 1 if it contains offensive language, and 2 if it contains neither. I classified any tweet with a class of 2 as “Not Offensive” and all other tweets as “Offensive.”

The Wikipedia dataset has several binary columns (e.g. toxic or threat) that represent whether or not that text contains that type of toxicity. I classified any text that contained any of the types of toxicity as “Offensive” and all other texts as “Not Offensive.”

Building profanity-check, Part 2: Training

Now armed with a cleaned, combined dataset (which you can download here), I was ready to train the model!

I’m skipping over how I cleaned the dataset because, honestly, it’s pretty boring— if you’re interested in learning more about preprocessing text datasets check out this article or this post.

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.calibration import CalibratedClassifierCV
from sklearn.svm import LinearSVC
from sklearn.externals import joblib

# Read in data
data = pd.read_csv('clean_data.csv')
texts = data['text'].astype(str)
y = data['is_offensive']

# Vectorize the text
vectorizer = CountVectorizer(stop_words='english', min_df=0.0001)
X = vectorizer.fit_transform(texts)

# Train the model
model = LinearSVC(class_weight="balanced", dual=False, tol=1e-2, max_iter=1e5)
cclf = CalibratedClassifierCV(base_estimator=model)
cclf.fit(X, y)

# Save the model
joblib.dump(vectorizer, 'vectorizer.joblib')
joblib.dump(cclf, 'model.joblib')

Are you also surprised the code is so short? Apparently scikit-learn does everything.

Two major steps are happening here: (1) vectorization and (2) training.

Vectorization: Bag of Words

I used scikit-learn’s CountVectorizer class, which basically turns any text string into a vector by counting how many times each given word appears. This is known as a Bag of Words (BOW) representation. For example, if the only words in the English language were the, cat, sat, and hat, a possible vectorization of the sentence the cat sat in the hat might be:

“the cat sat in the hat” -> [2, 1, 1, 1, 1]

The ??? represents any unknown word, which for this sentence is in. Any sentence can be represented in this way as counts of the, cat, sat, hat, and ???!

A handy reference table for the next time you need to vectorize “cat cat cat cat cat”

Of course, there are far more words in the English language, so in the code above I use the fit_transform() method, which does 2 things:

Fit: learns a vocabulary by looking at all words that appear in the dataset.
Transform : turns each text string in the dataset into its vector form.

Training: Linear SVM

The model I decided to use was a Linear Support Vector Machine (SVM), which is implemented by scikit-learn’s LinearSVC class. This post and this tutorial are good introductions if you don’t know what SVMs are.

The CalibratedClassifierCV in the code above exists as a wrapper to give me the predict_proba() method, which returns a probability for each class instead of just a classification. You can pretty much just ignore it if that last sentence made no sense to you, though.

Here’s one (simplified) way you could think about why the Linear SVM works: during the training process, the model learns which words are “bad” and how “bad” they are because those words appear more often in offensive texts. It’s as if the training process is picking out the “bad” words for me , which is much better than using a wordlist I write myself!

A Linear SVM combines the best aspects of the other profanity detection libraries I found: it’s fast enough to run in real-time yet robust enough to handle many different kinds of profanity.

Caveats

That being said, profanity-check is far from perfect. Let me be clear: take predictions from profanity-check with a grain of salt because it makes mistakes. For example, its not good at picking up less common variants of profanities like “f4ck you” or “you b1tch” because they don’t appear often enough in the training data. You’ll never be able to detect all profanity (people will come up with new ways to evade filters), but profanity-check does a good job at finding most.

profanity-check

profanity-check is open source and available on PyPI! To use it, simply

$ pip install profanity-check

How could profanity-check be even better? Feel free to reach out or comment with any thoughts or suggestions!

This article was originally posted on victorzhou.com.