<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Victor Zhou</title>
    <description>The latest articles on DEV Community by Victor Zhou (@vzhou842).</description>
    <link>https://dev.to/vzhou842</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F152146%2Fd37b2113-1b6e-4273-bf6a-3d939b0a3d9d.png</url>
      <title>DEV Community: Victor Zhou</title>
      <link>https://dev.to/vzhou842</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vzhou842"/>
    <language>en</language>
    <item>
      <title>Random Forests for Complete Beginners</title>
      <dc:creator>Victor Zhou</dc:creator>
      <pubDate>Wed, 10 Apr 2019 12:00:00 +0000</pubDate>
      <link>https://dev.to/vzhou842/random-forests-for-complete-beginners-4odd</link>
      <guid>https://dev.to/vzhou842/random-forests-for-complete-beginners-4odd</guid>
      <description>&lt;p&gt;In my opinion, most Machine Learning tutorials aren’t beginner-friendly enough.&lt;/p&gt;

&lt;p&gt;Last month, I wrote an &lt;a href="https://victorzhou.com/blog/intro-to-neural-networks/" rel="noopener noreferrer"&gt;introduction to Neural Networks &lt;strong&gt;for complete beginners&lt;/strong&gt;&lt;/a&gt;. This post will adopt the same strategy, meaning it again &lt;strong&gt;assumes ZERO prior knowledge of machine learning&lt;/strong&gt;. We’ll learn what Random Forests are and how they work from the ground up.&lt;/p&gt;

&lt;p&gt;Ready? Let’s dive in.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Decision Trees 🌲
&lt;/h2&gt;

&lt;p&gt;A Random Forest 🌲🌲🌲 is actually just a bunch of Decision Trees 🌲 bundled together (ohhhhh 💡 that’s why it’s called a &lt;em&gt;forest&lt;/em&gt;). We need to talk about trees before we can get into forests.&lt;/p&gt;

&lt;p&gt;Look at the following dataset:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Frandom-forest-post%2Fdataset.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Frandom-forest-post%2Fdataset.svg"&gt;&lt;/a&gt;The Dataset&lt;/p&gt;

&lt;p&gt;If I told you that there was a new point with an x coordinate of 1, what color do you think it’d be?&lt;/p&gt;

&lt;p&gt;Blue, right?&lt;/p&gt;

&lt;p&gt;You just evaluated a decision tree in your head:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Frandom-forest-post%2Fdecision-tree.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Frandom-forest-post%2Fdecision-tree.svg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That’s a simple decision tree with one &lt;strong&gt;decision node&lt;/strong&gt; that &lt;strong&gt;tests&lt;/strong&gt; x &amp;lt; 2. If the test passes (x &amp;lt; 2), we take the left &lt;strong&gt;branch&lt;/strong&gt; and pick Blue. If the test fails (x ≥ 2), we take the right &lt;strong&gt;branch&lt;/strong&gt; and pick Green.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Frandom-forest-post%2Fdataset-split.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Frandom-forest-post%2Fdataset-split.svg"&gt;&lt;/a&gt;The Dataset, split at x=2&lt;/p&gt;

&lt;p&gt;Decision Trees are often used to answer that kind of question: given a &lt;strong&gt;labelled&lt;/strong&gt; dataset, how should we &lt;strong&gt;classify&lt;/strong&gt; new samples?&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Labelled&lt;/strong&gt; : Our dataset is &lt;em&gt;labelled&lt;/em&gt; because each point has a &lt;strong&gt;class&lt;/strong&gt; (color): blue or green.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Classify&lt;/strong&gt; : To &lt;em&gt;classify&lt;/em&gt; a new datapoint is to assign a class (color) to it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here’s a dataset that has 3 classes now instead of 2:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Frandom-forest-post%2Fdataset2.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Frandom-forest-post%2Fdataset2.svg"&gt;&lt;/a&gt;The Dataset v2&lt;/p&gt;

&lt;p&gt;Our old decision tree doesn’t work so well anymore. Given a new point (x,y),&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If x≥2, we can still confidently classify it as green. &lt;/li&gt;
&lt;li&gt;If x&amp;lt;2, we can’t immediately classify it as blue - it could be red, too.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We need to add another &lt;strong&gt;decision node&lt;/strong&gt; to our decision tree:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Frandom-forest-post%2Fdecision-tree2.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Frandom-forest-post%2Fdecision-tree2.svg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Frandom-forest-post%2Fdataset2-split.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Frandom-forest-post%2Fdataset2-split.svg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Pretty simple, right? That’s the basic idea behind decision trees.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Training a Decision Tree
&lt;/h2&gt;

&lt;p&gt;Let’s start training a decision tree! We’ll use the 3 class dataset again:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Frandom-forest-post%2Fdataset2.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Frandom-forest-post%2Fdataset2.svg"&gt;&lt;/a&gt;The Dataset v2&lt;/p&gt;

&lt;h3&gt;
  
  
  2.1 Training a Decision Tree: The Root Node
&lt;/h3&gt;

&lt;p&gt;Our first task is to determine the root decision node in our tree. Which feature (x or y) will it test on, and what will the test threshold be? For example, the root node in our tree from earlier used the x feature with a test threshold of 2:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Frandom-forest-post%2Fdecision-tree2-root.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Frandom-forest-post%2Fdecision-tree2-root.svg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Intuitively, we want a decision node that makes a “good” split, where “good” can be loosely defined as &lt;strong&gt;separating different classes as much as possible&lt;/strong&gt;. The root node above makes a “good” split: &lt;em&gt;all&lt;/em&gt; the greens are on the right, and &lt;em&gt;no&lt;/em&gt; greens are on the left.&lt;/p&gt;

&lt;p&gt;Thus, our goal is now to pick a root node that gives us the “best” split possible. &lt;strong&gt;But how do we quantify how good a split is?&lt;/strong&gt; It’s complicated. I wrote &lt;a href="https://victorzhou.com/blog/gini-impurity/" rel="noopener noreferrer"&gt;an entire blog post about one way to do this using a metric called Gini Impurity&lt;/a&gt;. &lt;strong&gt;← I recommend reading it right now&lt;/strong&gt; before you continue - we’ll be using those concepts later in this post.&lt;/p&gt;




&lt;p&gt;Welcome back!&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Hopefully, you just read &lt;a href="https://victorzhou.com/blog/gini-impurity/" rel="noopener noreferrer"&gt;my Gini Impurity post&lt;/a&gt;. If you didn’t, here’s a very short TL;DR: We can use Gini Impurity to calculate a value called &lt;strong&gt;Gini Gain&lt;/strong&gt; for any split. &lt;strong&gt;A better split has higher Gini Gain&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Back to the problem of determining our root decision node. Now that we have a way to evaluate splits, all we have to do to is find the best split possible! For the sake of simplicity, we’re just going to &lt;strong&gt;try every possible split&lt;/strong&gt; and use the best one (the one with the highest Gini Gain). &lt;strong&gt;This is not the fastest way to find the best split&lt;/strong&gt; , but it is the easiest to understand.&lt;/p&gt;

&lt;p&gt;Trying every split means trying&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every feature (x or y).&lt;/li&gt;
&lt;li&gt;All “unique” thresholds. &lt;strong&gt;We only need to try thresholds that produce different splits.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, here are the thresholds we might select if we wanted to use the x coordinate:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Frandom-forest-post%2Fdataset2-thresholds-x.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Frandom-forest-post%2Fdataset2-thresholds-x.svg"&gt;&lt;/a&gt;x Thresholds&lt;/p&gt;

&lt;p&gt;Let’s do an example Gini Gain calculation for the x=0.4 split.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fj6dl96l2tlmpn7ragihr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fj6dl96l2tlmpn7ragihr.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;First, we calculate the Gini Impurity of the whole dataset:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fyavkzt1pfq6xrd3c3etf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fyavkzt1pfq6xrd3c3etf.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then, we calculate the Gini Impurities of the two branches:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fpibzd5vzmvuzcjcbkuno.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fpibzd5vzmvuzcjcbkuno.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Finally, we calculate Gini Gain by subtracting the weighted branch impurities from the original impurity:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Farwvak8l6qeo57svtrs6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Farwvak8l6qeo57svtrs6.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Confused about what just happened? I told you you should’ve read &lt;a href="https://victorzhou.com/blog/gini-impurity/" rel="noopener noreferrer"&gt;my Gini Impurity post&lt;/a&gt;. It’ll explain all of this Gini stuff.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We can calculate Gini Gain for every possible split in the same way:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fjnq4bevt5jflhihuzjdo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fjnq4bevt5jflhihuzjdo.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Frandom-forest-post%2Fdataset2-thresholds.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Frandom-forest-post%2Fdataset2-thresholds.svg"&gt;&lt;/a&gt;All Thresholds&lt;/p&gt;

&lt;p&gt;After trying all thresholds for both x and y, we’ve found that the x=2 split has the highest Gini Gain, so we’ll make our root decision node use the x feature with a threshold of 2. Here’s what we’ve got so far:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Frandom-forest-post%2Fdecision-tree2-build1.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Frandom-forest-post%2Fdecision-tree2-build1.svg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Making progress!&lt;/p&gt;

&lt;h3&gt;
  
  
  2.2: Training a Decision Tree: The Second Node
&lt;/h3&gt;

&lt;p&gt;Time to make our second decision node. Let’s (arbitrarily) go to the left branch. &lt;strong&gt;We’re now only using the datapoints that would take the left branch&lt;/strong&gt; (i.e. the datapoints satisfying x&amp;lt;2), specifically the 3 blues and 3 reds.&lt;/p&gt;

&lt;p&gt;To build our second decision node, &lt;strong&gt;we just do the same thing!&lt;/strong&gt; We try every possible split for the 6 datapoints we have and realize that y=2 is the best split. We make that into a decision node and now have this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Frandom-forest-post%2Fdecision-tree2-build2.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Frandom-forest-post%2Fdecision-tree2-build2.svg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Our decision tree is almost done…&lt;/p&gt;

&lt;h3&gt;
  
  
  2.3 Training a Decision Tree: When to Stop?
&lt;/h3&gt;

&lt;p&gt;Let’s keep it going and try to make a third decision node. We’ll use the right branch from the root node this time. The only datapoints in that branch are the 3 greens.&lt;/p&gt;

&lt;p&gt;Again, we try all the possible splits, but they all&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Are equally good.&lt;/li&gt;
&lt;li&gt;Have a Gini Gain of 0 (the Gini Impurity was already 0 and can’t go any lower).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It doesn’t makes sense to add a decision node here because doing so wouldn’t improve our decision tree. Thus, we’ll make this node a &lt;strong&gt;leaf node&lt;/strong&gt; and slap the Green label on it. This means that &lt;strong&gt;we’ll classify any datapoint that reaches this node as Green&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If we continue to the 2 remaining nodes, the same thing will happen: we’ll make the bottom left node our Blue leaf node, and we’ll make the bottom right node our Red leaf node. That brings us to the final result:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Frandom-forest-post%2Fdecision-tree2.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Frandom-forest-post%2Fdecision-tree2.svg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Once all possible branches in our decision tree end in leaf nodes, we’re done.&lt;/strong&gt; We’ve trained a decision tree!&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Random Forests 🌲🌳🌲🌳🌲
&lt;/h2&gt;

&lt;p&gt;We’re finally ready to talk about Random Forests. Remember what I said earlier?&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A Random Forest is actually just a bunch of Decision Trees bundled together.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s true, but is a bit of a simplification.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.1 Bagging
&lt;/h3&gt;

&lt;p&gt;Consider the following algorithm to train a bundle of decision trees given a dataset of n points:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Sample, &lt;strong&gt;with replacement&lt;/strong&gt; , n training examples from the dataset.&lt;/li&gt;
&lt;li&gt;Train a decision tree on the n samples.&lt;/li&gt;
&lt;li&gt;Repeat t times, for some t.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To make a prediction using this model with t trees, we aggregate the predictions from the individual decision trees and either&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Take the &lt;strong&gt;majority vote&lt;/strong&gt; if our trees produce class labels (like colors).&lt;/li&gt;
&lt;li&gt;Take the &lt;strong&gt;average&lt;/strong&gt; if our trees produce numerical values (e.g. when predicting temperature, price, etc).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This technique is called &lt;strong&gt;bagging&lt;/strong&gt; , or &lt;a href="https://en.wikipedia.org/wiki/Bootstrap_aggregating" rel="noopener noreferrer"&gt;&lt;strong&gt;b&lt;/strong&gt;ootstrap &lt;strong&gt;agg&lt;/strong&gt;regating&lt;/a&gt;. The sampling with replacement we did is known as a &lt;a href="https://en.wikipedia.org/wiki/Bootstrapping_(statistics)" rel="noopener noreferrer"&gt;bootstrap&lt;/a&gt; sample.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Frandom-forest-post%2Frandom-forest.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Frandom-forest-post%2Frandom-forest.svg"&gt;&lt;/a&gt;Bagged Decision Trees predicting color&lt;/p&gt;

&lt;p&gt;Bagged decision trees are very close to Random Forests - they’re just missing one thing…&lt;/p&gt;

&lt;h3&gt;
  
  
  3.2 Bagging → Random Forest
&lt;/h3&gt;

&lt;p&gt;Bagged decision trees have only one parameter: t, the number of trees.&lt;/p&gt;

&lt;p&gt;Random Forests have a second parameter that controls &lt;strong&gt;how many features to try when finding the best split&lt;/strong&gt;. Our simple dataset for this tutorial only had 2 features (x and y), but most datasets will have far more (hundreds or thousands).&lt;/p&gt;

&lt;p&gt;Suppose we had a dataset with p features. Instead of trying all features every time we make a new decision node, we &lt;strong&gt;only try a subset of the features&lt;/strong&gt;. We do this primarily to inject randomness that makes individual trees more unique and &lt;strong&gt;reduces correlation between trees&lt;/strong&gt; , which improves the forest’s performance overall. This technique is sometimes referred to as &lt;strong&gt;feature bagging&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Now What?
&lt;/h2&gt;

&lt;p&gt;That’s a beginner’s introduction to Random Forests! A quick recap of what we did:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Introduced &lt;strong&gt;decision trees&lt;/strong&gt; , the building blocks of Random Forests.&lt;/li&gt;
&lt;li&gt;Learned how to train decision trees by iteratively making the best split possible.&lt;/li&gt;
&lt;li&gt;Defined &lt;a href="https://victorzhou.com/blog/gini-impurity/" rel="noopener noreferrer"&gt;Gini Impurity&lt;/a&gt;, a metric used to quantify how “good” a split is.&lt;/li&gt;
&lt;li&gt;Saw that &lt;strong&gt;a random forest = a bunch of decision trees.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Understood how &lt;strong&gt;bagging&lt;/strong&gt; combines predictions from multiple trees.&lt;/li&gt;
&lt;li&gt;Learned that &lt;strong&gt;feature bagging&lt;/strong&gt; is the difference between bagged decision trees and a random forest.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A few things you could do from here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Experiment with scikit-learn’s &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html" rel="noopener noreferrer"&gt;DecisionTreeClassifier&lt;/a&gt; and &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html" rel="noopener noreferrer"&gt;RandomForestClassifier&lt;/a&gt; classes on real datasets.&lt;/li&gt;
&lt;li&gt;Try writing a simple Decision Tree or Random Forest implementation from scratch. I’m happy to give guidance or code review! Just &lt;a href="https://twitter.com/victorczhou" rel="noopener noreferrer"&gt;tweet at me&lt;/a&gt; or &lt;a href="//mailto:vzhou842@gmail.com"&gt;email me&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Read about &lt;a href="https://en.wikipedia.org/wiki/Gradient_boosting#Gradient_tree_boosting" rel="noopener noreferrer"&gt;Gradient Boosted Decision Trees&lt;/a&gt; and play with &lt;a href="https://xgboost.readthedocs.io/en/latest/" rel="noopener noreferrer"&gt;XGBoost&lt;/a&gt;, a powerful gradient boosting library.&lt;/li&gt;
&lt;li&gt;Read about &lt;a href="https://en.wikipedia.org/wiki/Random_forest#ExtraTrees" rel="noopener noreferrer"&gt;ExtraTrees&lt;/a&gt;, an extension of Random Forests, or play with scikit-learn’s &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html" rel="noopener noreferrer"&gt;ExtraTreesClassifier&lt;/a&gt; class.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That concludes this tutorial. I like &lt;a href="https://dev.to/tag/machine-learning"&gt;writing about Machine Learning&lt;/a&gt; (but also other topics), so &lt;strong&gt;&lt;a href="http://eepurl.com/gf8JCX" rel="noopener noreferrer"&gt;subscribe&lt;/a&gt; if you want to get notified about new posts.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Thanks for reading!&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>beginners</category>
      <category>tutorial</category>
      <category>randomforests</category>
    </item>
    <item>
      <title>Why I Replaced Disqus and You Should Too</title>
      <dc:creator>Victor Zhou</dc:creator>
      <pubDate>Wed, 03 Apr 2019 02:27:50 +0000</pubDate>
      <link>https://dev.to/vzhou842/why-i-replaced-disqus-and-you-should-too-2o0e</link>
      <guid>https://dev.to/vzhou842/why-i-replaced-disqus-and-you-should-too-2o0e</guid>
      <description>&lt;p&gt;When I started &lt;a href="https://victorzhou.com" rel="noopener noreferrer"&gt;my blog&lt;/a&gt;, I used &lt;a href="https://disqus.com/" rel="noopener noreferrer"&gt;Disqus&lt;/a&gt; for comments on posts. This was a natural choice: I'd seen sites use Disqus all over the internet, it was easy to setup, and they had a free tier. I happily integrated Disqus and moved on.&lt;/p&gt;

&lt;p&gt;Here's the thing: I've always known that using Disqus came at the cost of some page bloat. I've &lt;a href="https://victorzhou.com/blog/properly-size-images/" rel="noopener noreferrer"&gt;written about web performance&lt;/a&gt; before and generally strive to make my pages fast, but I just assumed having Disqus was worth the bit of extra weight. My logic: If Disqus were really so bloated, everyone would've migrated away from them by now. Surely Disqus prioritizes keeping their payload reasonably small, right?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I was wrong&lt;/strong&gt;. Last week, I finally did what I should've done at the beginning: benchmark it myself. Here are my results (benchmarked on &lt;a href="https://victorzhou.com/blog/why-you-should-use-webpack/" rel="noopener noreferrer"&gt;my Webpack post&lt;/a&gt;):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Fcommento-post%2Frequests1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Fcommento-post%2Frequests1.png"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Fcommento-post%2Fsize1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Fcommento-post%2Fsize1.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Adding Disqus increased my page weight by over 10x and my request count by over 6x&lt;/strong&gt;. That's ridiculous! I immediately started looking to replace Disqus - &lt;a href="https://developers.google.com/web/fundamentals/performance/why-performance-matters/" rel="noopener noreferrer"&gt;web performance is important&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  An Alternative: Commento
&lt;/h2&gt;

&lt;p&gt;A while back, I saw a &lt;a href="https://news.ycombinator.com/item?id=19210697" rel="noopener noreferrer"&gt;Hacker News&lt;/a&gt; post about a fast, privacy-focused alternative to Disqus called &lt;a href="https://commento.io/" rel="noopener noreferrer"&gt;Commento&lt;/a&gt;. Having learned my lesson, I benchmarked Commento before committing to it:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Fcommento-post%2Frequests2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Fcommento-post%2Frequests2.png"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Fcommento-post%2Fsize2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fmedia%2Fcommento-post%2Fsize2.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What a difference. &lt;strong&gt;Commento is &lt;em&gt;orders of magnitude&lt;/em&gt; lighter than Disqus&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It gets even better. Here are more reasons I was sold on Commento:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It's &lt;a href="https://gitlab.com/commento" rel="noopener noreferrer"&gt;open source&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;It's &lt;a href="https://commento.io/privacy" rel="noopener noreferrer"&gt;privacy focused&lt;/a&gt; - it doesn't sell user data and tries to collect as little as possible. This is especially nice given that my blog's audience is probably more privacy-conscious than the average internet user.&lt;/li&gt;
&lt;li&gt;You can &lt;a href="https://commento.io/pricing" rel="noopener noreferrer"&gt;pay what you want&lt;/a&gt;. Disqus's free tier is ad-supported, and its cheapest paid tier is $9/month. Commento is actually cheaper (if you want it to be)!&lt;/li&gt;
&lt;li&gt;It's &lt;a href="https://docs.commento.io/configuration/frontend/" rel="noopener noreferrer"&gt;configurable&lt;/a&gt;. If you scroll down to the comments of this post, you'll see that the styling of the Commento integration matches the styling of the rest of the site.&lt;/li&gt;
&lt;li&gt;It has an &lt;strong&gt;Import from Disqus&lt;/strong&gt; tool that's easy to use. I was able to quickly migrate all of my old Disqus comments to Commento.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Commento works great for me, but I'm not trying to say it's the right solution for everyone - there are several good, &lt;em&gt;lightweight&lt;/em&gt; commenting platforms out there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Are you still using Disqus?&lt;/strong&gt; Did you know how much bloat it adds to your page? What's keeping you from switching?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://victorzhou.com" rel="noopener noreferrer"&gt;victorzhou.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>performance</category>
      <category>bestpractices</category>
      <category>disqus</category>
    </item>
    <item>
      <title>How I Became a Programmer</title>
      <dc:creator>Victor Zhou</dc:creator>
      <pubDate>Thu, 14 Mar 2019 12:00:00 +0000</pubDate>
      <link>https://dev.to/vzhou842/how-i-became-a-programmer-2pe</link>
      <guid>https://dev.to/vzhou842/how-i-became-a-programmer-2pe</guid>
      <description>&lt;p&gt;It all started when I was 12 years old. 👶&lt;/p&gt;

&lt;p&gt;Back in those days, my friends and I all played &lt;a href="https://www.runescape.com"&gt;Runescape&lt;/a&gt;, a popular browser-based &lt;a href="https://en.wikipedia.org/wiki/Massively_multiplayer_online_role-playing_game"&gt;MMORPG&lt;/a&gt;. Those who’ve played Runescape will remember that it requires lots of &lt;a href="https://en.wikipedia.org/wiki/Grinding_(gaming)"&gt;grinding&lt;/a&gt; - the “best” players were the ones who spent the most time killing monsters and leveling up. Unfortunately, my parents didn’t let me play much - all of my friends had passed level 50 by the time I reached level 30. This level gap was [[all I worried about day and night because nobody wants to play with someone 20 levels below them]] somewhat frustrating.&lt;/p&gt;

&lt;p&gt;The obvious solution was to&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Make a better version of Runescape that required less grinding, and then&lt;/li&gt;
&lt;li&gt;Convince millions of players to play my version instead.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;With this foolproof 2-step plan in place, I set out to complete Step 1. I remember googling ”&lt;em&gt;How was Runescape made&lt;/em&gt;,” reading that it was built in Java, and then googling ”&lt;em&gt;How to code in Java&lt;/em&gt;.” Over the next 20 minutes, I would come to realize that learning Java on your own as a 12-year-old is not that easy. Thus, I made the brilliant decision to pivot: I would now make a better version of Runescape &lt;strong&gt;&lt;em&gt;without writing any code&lt;/em&gt;&lt;/strong&gt;. &lt;em&gt;I’ll just skip the parts that require coding&lt;/em&gt;, I thought. &lt;em&gt;This way, I’ll finish it faster, too.&lt;/em&gt; See? Brilliant.&lt;/p&gt;

&lt;p&gt;I ended up actually finding a way to make games without writing code: &lt;a href="https://www.yoyogames.com/gamemaker"&gt;GameMaker&lt;/a&gt;. I downloaded it and began making simple, codeless games. One of my favorite memories from middle school was bringing a USB loaded with a &lt;a href="https://en.wikipedia.org/wiki/Space_Invaders"&gt;Space Invaders&lt;/a&gt;-style game I’d made to the computer lab and passing it around between my friends. That proud feeling of seeing people use something I built has been driving me ever since.&lt;/p&gt;

&lt;p&gt;Despite my ambitions, I eventually realized that you can’t actually do much without writing code. Luckily, GameMaker supported a proprietary programming language called the &lt;a href="https://docs.yoyogames.com/source/dadiospice/002_reference/001_gml%20language%20overview/"&gt;GameMaker Language&lt;/a&gt; (GML) that had lots of examples and tutorials for it online. I reluctantly decided to learn a bit of GML so I could make more advanced games (read: Runescape). Those were my first &lt;code&gt;if&lt;/code&gt; statements and &lt;code&gt;for&lt;/code&gt; loops! That’s right. &lt;strong&gt;My first programming language was GML&lt;/strong&gt;. 😲&lt;/p&gt;

&lt;p&gt;Ever since those first few lines of GML, I’ve been hooked. In 9th grade, I began making iOS apps and competing in programming competitions. Out of high school, I landed my first software engineering internship and started more seriously pursuing a career in tech. In college, I got into web development and sold my first website.&lt;/p&gt;

&lt;p&gt;To summarize: I’ve spent a decade learning and building and &lt;em&gt;still&lt;/em&gt; haven’t taken down Runescape 😡. I guess everyone has their white whale…&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://victorzhou.com"&gt;victorzhou.com&lt;/a&gt;&lt;/em&gt;.&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>career</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Can You Find The Bug in This Code?</title>
      <dc:creator>Victor Zhou</dc:creator>
      <pubDate>Sat, 09 Feb 2019 12:00:00 +0000</pubDate>
      <link>https://dev.to/vzhou842/can-you-find-the-bug-in-this-code-2k0h</link>
      <guid>https://dev.to/vzhou842/can-you-find-the-bug-in-this-code-2k0h</guid>
      <description>&lt;p&gt;Here’s a bit of Javascript that prints “Hello World!” on two lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Hello&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;})()&lt;/span&gt;

  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;World!&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;})()&lt;/span&gt;
&lt;span class="p"&gt;})()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;…except it fails with a runtime error. Can you spot the bug without running the code?&lt;/p&gt;

&lt;p&gt;Scroll down for a hint.&lt;/p&gt;
















&lt;h2&gt;
  
  
  Hint
&lt;/h2&gt;

&lt;p&gt;Here’s the text of the error:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TypeError: (intermediate value)(...) is not a function
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;What’s going on?&lt;/p&gt;

&lt;p&gt;Scroll down for the solution.&lt;/p&gt;
















&lt;h2&gt;
  
  
  Solution
&lt;/h2&gt;

&lt;p&gt;One character fixes this code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Hello&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;})();&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;World!&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;})()&lt;/span&gt;
&lt;span class="p"&gt;})()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Without that semicolon, the last function is interpreted as an argument to a function call. Here’s a rewrite that demonstrates what’s going on when the code is run without the semicolon:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;f1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Hello&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;f2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;World!&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="nx"&gt;f1&lt;/span&gt;&lt;span class="p"&gt;()(&lt;/span&gt;&lt;span class="nx"&gt;f2&lt;/span&gt;&lt;span class="p"&gt;)();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;There are 3 function invocations in that last line:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;f1&lt;/code&gt; is called with no arguments&lt;/li&gt;
&lt;li&gt;The return value of &lt;code&gt;f1()&lt;/code&gt; is called with &lt;code&gt;f2&lt;/code&gt; as its only argument&lt;/li&gt;
&lt;li&gt;The return value of &lt;code&gt;f1()(f2)&lt;/code&gt; is called with no arguments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Since the return value of &lt;code&gt;f1()&lt;/code&gt; is not a function, the runtime throws a &lt;code&gt;TypeError&lt;/code&gt; during the second invocation.&lt;/p&gt;

&lt;p&gt;With the semicolon added, this becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;f1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Hello&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;f2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;World!&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="nx"&gt;f1&lt;/span&gt;&lt;span class="p"&gt;();(&lt;/span&gt;&lt;span class="nx"&gt;f2&lt;/span&gt;&lt;span class="p"&gt;)();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Which runs as expected.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wait, you had this bug once?
&lt;/h2&gt;

&lt;p&gt;Yup.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why would you ever write code with so many Immediately Invoked Function Expressions (&lt;a href="https://developer.mozilla.org/en-US/docs/Glossary/IIFE"&gt;IIFE&lt;/a&gt;)?
&lt;/h2&gt;

&lt;p&gt;It’s a long story - &lt;a href="https://victorzhou.com/blog/why-you-should-use-webpack/"&gt;this post&lt;/a&gt; explains how I wrote bad enough code to have this bug.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Lesson
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Always use semicolons&lt;/strong&gt;. This specific case was a bit contrived, but something similar could happen to you. Here’s another Hello World program that fails for a related reason:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Hello&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;World&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;!&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;forEach&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;I’ll leave figuring this one out as an exercise for you.&lt;/p&gt;

&lt;p&gt;Most Javascript style guides require semicolons, including &lt;a href="https://google.github.io/styleguide/jsguide.html#formatting-semicolons-are-required"&gt;Google’s&lt;/a&gt;, &lt;a href="https://github.com/airbnb/javascript#semicolons"&gt;Airbnb’s&lt;/a&gt;, and &lt;a href="https://contribute.jquery.org/style-guide/js/#semicolons"&gt;jQuery’s&lt;/a&gt;. To summarize: &lt;strong&gt;always use semicolons&lt;/strong&gt;.&lt;/p&gt;

</description>
      <category>javascript</category>
      <category>debugging</category>
      <category>programming</category>
    </item>
    <item>
      <title>Building a Better Profanity Detection Library with scikit-learn</title>
      <dc:creator>Victor Zhou</dc:creator>
      <pubDate>Mon, 04 Feb 2019 12:00:00 +0000</pubDate>
      <link>https://dev.to/vzhou842/building-a-better-profanity-detection-library-with-scikit-learn-3b7f</link>
      <guid>https://dev.to/vzhou842/building-a-better-profanity-detection-library-with-scikit-learn-3b7f</guid>
      <description>&lt;p&gt;A few months ago, I needed a way to detect profanity in user-submitted text strings:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1600%2F1%2Ai2fk4aGvplR7le_3PPajAA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1600%2F1%2Ai2fk4aGvplR7le_3PPajAA.png"&gt;&lt;/a&gt;This shouldn’t be that hard, right?&lt;/p&gt;

&lt;p&gt;I ended up building and releasing my own library for this purpose called &lt;a href="https://github.com/vzhou842/profanity-check" rel="noopener noreferrer"&gt;profanity-check&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Of course, before I did that, I looked in the &lt;a href="https://pypi.org/" rel="noopener noreferrer"&gt;Python Package Index&lt;/a&gt; (PyPI) for any existing libraries that could do this for me. The only half decent results for the search query “profanity” were:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://pypi.org/project/profanity/" rel="noopener noreferrer"&gt;profanity&lt;/a&gt; (the ideal package name)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pypi.org/project/better-profanity/" rel="noopener noreferrer"&gt;better-profanity&lt;/a&gt;: &lt;em&gt;“Inspired from package&lt;/em&gt; &lt;a href="https://github.com/ben174/profanity" rel="noopener noreferrer"&gt;&lt;em&gt;profanity&lt;/em&gt;&lt;/a&gt; &lt;em&gt;of&lt;/em&gt; &lt;a href="https://github.com/ben174" rel="noopener noreferrer"&gt;&lt;em&gt;Ben Friedland&lt;/em&gt;&lt;/a&gt;&lt;em&gt;, this library is much faster than the original one.”&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pypi.org/project/profanityfilter/" rel="noopener noreferrer"&gt;profanityfilter&lt;/a&gt; (has 31 Github stars, which is 30 more than most of the other results have)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pypi.org/project/profanity-filter/" rel="noopener noreferrer"&gt;profanity-filter&lt;/a&gt; (uses Machine Learning, enough said?!)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Third-party libraries can sometimes be sketchy, though, so I did my due diligence on these 4 results.&lt;/p&gt;

&lt;h2&gt;
  
  
  profanity, better-profanity, and profanityfilter
&lt;/h2&gt;

&lt;p&gt;After a quick dig through the &lt;code&gt;profanity&lt;/code&gt; repository, I found a file named &lt;a href="https://github.com/ben174/profanity/blob/master/profanity/data/wordlist.txt" rel="noopener noreferrer"&gt;wordlist.txt&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2A0lTbmHR5WE7HZ8wCvLpqtg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2A0lTbmHR5WE7HZ8wCvLpqtg.png"&gt;&lt;/a&gt;NSFW&lt;/p&gt;

&lt;p&gt;The entire &lt;code&gt;profanity&lt;/code&gt; library is just a wrapper over this list of 32 words! &lt;code&gt;profanity&lt;/code&gt; detects profanity simply by looking for one of these words.&lt;/p&gt;

&lt;p&gt;To my dismay, &lt;code&gt;better-profanity&lt;/code&gt; and &lt;code&gt;profanityfilter&lt;/code&gt; both took the same approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;better-profanity&lt;/code&gt; uses &lt;a href="https://github.com/snguyenthanh/better_profanity/blob/master/better_profanity/profanity_wordlist.txt" rel="noopener noreferrer"&gt;a 140-word wordlist&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;profanityfilter&lt;/code&gt; uses &lt;a href="https://github.com/areebbeigh/profanityfilter/blob/master/profanityfilter/data/badwords.txt" rel="noopener noreferrer"&gt;a 418-word wordlist&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is bad because &lt;strong&gt;profanity detection libraries based on wordlists are extremely subjective.&lt;/strong&gt; For example, &lt;code&gt;better-profanity&lt;/code&gt;’s wordlist includes the word “suck.” Are you willing to say that any sentence containing the word “suck” is profane? Furthermore, any hard-coded list of bad words will inevitably be incomplete — do you think &lt;code&gt;profanity&lt;/code&gt;’s 32 bad words are the only ones out there?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fstatic%2Fb0cc99529a0fad11e9353fc7ff189e2f%2Fb8b3f%2Fxkcd-290.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fvictorzhou.com%2Fstatic%2Fb0cc99529a0fad11e9353fc7ff189e2f%2Fb8b3f%2Fxkcd-290.png" alt="xkcd 290"&gt;&lt;/a&gt;&lt;/p&gt;

  Fucking Blue Shells. source: &lt;a href="https://xkcd.com/290/" rel="noopener noreferrer"&gt;xkcd&lt;/a&gt;




&lt;p&gt;Having already ruled out 3 libraries, I put my hopes on the 4th and final one: &lt;code&gt;profanity-filter&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  profanity-filter
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;profanity-filter&lt;/code&gt; uses Machine Learning! Sweet!&lt;/p&gt;

&lt;p&gt;Turns out, it’s &lt;strong&gt;&lt;em&gt;really&lt;/em&gt;&lt;/strong&gt; slow. Here’s a benchmark I ran in December 2018 comparing (1) &lt;code&gt;profanity-filter&lt;/code&gt;, (2) my library &lt;code&gt;profanity-check&lt;/code&gt;, and (3) &lt;code&gt;profanity&lt;/code&gt; (the one with the list of 32 words):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1600%2F1%2AKRJEl4YHfSTk9PmmScIcUA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1600%2F1%2AKRJEl4YHfSTk9PmmScIcUA.png"&gt;&lt;/a&gt;A human could probably do this faster than profanity-filter can&lt;/p&gt;

&lt;p&gt;I needed to be able to perform many predictions in real time, and &lt;code&gt;profanity-filter&lt;/code&gt; was not even close to being fast enough. But hey, maybe this is a classic tradeoff of accuracy for speed, right?&lt;/p&gt;

&lt;p&gt;Nope.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1600%2F1%2ALYOeGE6vTXTAKhJ_W1fZgQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1600%2F1%2ALYOeGE6vTXTAKhJ_W1fZgQ.png"&gt;&lt;/a&gt;At least profanity-filter is not dead last this time&lt;/p&gt;

&lt;p&gt;None of the libraries I’d found on PyPI met my needs, so I built my own.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building profanity-check, Part 1: Data
&lt;/h2&gt;

&lt;p&gt;I knew that I wanted &lt;code&gt;profanity-check&lt;/code&gt; to base its classifications on data to avoid being subjective &lt;em&gt;(read: to be able to say I used Machine Learning)&lt;/em&gt;. I put together a combined dataset from two publicly-available sources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the “Twitter” dataset from &lt;a href="https://github.com/t-davidson/hate-speech-and-offensive-language/tree/master/data" rel="noopener noreferrer"&gt;t-davidson/hate-speech-and-offensive-language&lt;/a&gt;, which contains tweets scraped from Twitter.&lt;/li&gt;
&lt;li&gt;the “Wikipedia” dataset from &lt;a href="https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge" rel="noopener noreferrer"&gt;this Kaggle competition&lt;/a&gt; published by Alphabet’s &lt;a href="https://conversationai.github.io/" rel="noopener noreferrer"&gt;Conversation AI&lt;/a&gt; team, which contains comments from Wikipedia’s talk page edits.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these datasets contains text samples hand-labeled by humans through crowdsourcing sites like &lt;a href="https://www.figure-eight.com/" rel="noopener noreferrer"&gt;Figure Eight&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Here’s what my dataset ended up looking like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1600%2F1%2ABw_we8cbs-WOpWXOCxzSTg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1600%2F1%2ABw_we8cbs-WOpWXOCxzSTg.png"&gt;&lt;/a&gt;Combined = Tweets + Wikipedia&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The Twitter dataset has a column named &lt;code&gt;class&lt;/code&gt; that’s 0 if the tweet contains hate speech, 1 if it contains offensive language, and 2 if it contains neither. I classified any tweet with a &lt;code&gt;class&lt;/code&gt; of 2 as “Not Offensive” and all other tweets as “Offensive.”&lt;/p&gt;

&lt;p&gt;The Wikipedia dataset has several binary columns (e.g. &lt;code&gt;toxic&lt;/code&gt; or &lt;code&gt;threat&lt;/code&gt;) that represent whether or not that text contains that type of toxicity. I classified any text that contained &lt;em&gt;any&lt;/em&gt; of the types of toxicity as “Offensive” and all other texts as “Not Offensive.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Building profanity-check, Part 2: Training
&lt;/h2&gt;

&lt;p&gt;Now armed with a cleaned, combined dataset (which you can &lt;a href="https://github.com/vzhou842/profanity-check/blob/master/profanity_check/data/clean_data.csv" rel="noopener noreferrer"&gt;download here&lt;/a&gt;), I was ready to train the model!&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I’m skipping over how I cleaned the dataset because, honestly, it’s pretty boring— if you’re interested in learning more about preprocessing text datasets check out &lt;a href="https://machinelearningmastery.com/clean-text-machine-learning-python/" rel="noopener noreferrer"&gt;this article&lt;/a&gt; or &lt;a href="https://medium.com/@datamonsters/text-preprocessing-in-python-steps-tools-and-examples-bf025f872908" rel="noopener noreferrer"&gt;this post&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.calibration import CalibratedClassifierCV
from sklearn.svm import LinearSVC
from sklearn.externals import joblib

# Read in data
data = pd.read_csv('clean_data.csv')
texts = data['text'].astype(str)
y = data['is_offensive']

# Vectorize the text
vectorizer = CountVectorizer(stop_words='english', min_df=0.0001)
X = vectorizer.fit_transform(texts)

# Train the model
model = LinearSVC(class_weight="balanced", dual=False, tol=1e-2, max_iter=1e5)
cclf = CalibratedClassifierCV(base_estimator=model)
cclf.fit(X, y)

# Save the model
joblib.dump(vectorizer, 'vectorizer.joblib')
joblib.dump(cclf, 'model.joblib')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

  Are you also surprised the code is so short?
  Apparently &lt;a href="https://scikit-learn.org/" rel="noopener noreferrer"&gt;scikit-learn&lt;/a&gt; does everything.






&lt;p&gt;Two major steps are happening here: (1) vectorization and (2) training.&lt;/p&gt;

&lt;h3&gt;
  
  
  Vectorization: Bag of Words
&lt;/h3&gt;

&lt;p&gt;I used &lt;code&gt;scikit-learn&lt;/code&gt;’s &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html" rel="noopener noreferrer"&gt;CountVectorizer&lt;/a&gt; class, which basically turns any text string into a vector by counting how many times each given word appears. This is known as a &lt;a href="https://en.wikipedia.org/wiki/Bag-of-words_model" rel="noopener noreferrer"&gt;Bag of Words&lt;/a&gt; (BOW) representation. For example, if the only words in the English language were &lt;code&gt;the&lt;/code&gt;, &lt;code&gt;cat&lt;/code&gt;, &lt;code&gt;sat&lt;/code&gt;, and &lt;code&gt;hat&lt;/code&gt;, a possible vectorization of the sentence &lt;code&gt;the cat sat in the hat&lt;/code&gt; might be:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1600%2F1%2Asbnts1u_QFB_V-X5DSC3pg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1600%2F1%2Asbnts1u_QFB_V-X5DSC3pg.png"&gt;&lt;/a&gt;“the cat sat in the hat” -&amp;gt; [2, 1, 1, 1, 1]&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;???&lt;/code&gt; represents any unknown word, which for this sentence is &lt;code&gt;in&lt;/code&gt;. Any sentence can be represented in this way as counts of &lt;code&gt;the&lt;/code&gt;, &lt;code&gt;cat&lt;/code&gt;, &lt;code&gt;sat&lt;/code&gt;, &lt;code&gt;hat&lt;/code&gt;, and &lt;code&gt;???&lt;/code&gt;!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1600%2F1%2A-wONWZDab2gNQP3Rfdpt_A.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1600%2F1%2A-wONWZDab2gNQP3Rfdpt_A.png"&gt;&lt;/a&gt;A handy reference table for the next time you need to vectorize “cat cat cat cat cat”&lt;/p&gt;

&lt;p&gt;Of course, there are far more words in the English language, so in the code above I use the &lt;code&gt;fit_transform()&lt;/code&gt; method, which does 2 things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fit:&lt;/strong&gt; learns a vocabulary by looking at all words that appear in the dataset.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transform&lt;/strong&gt; : turns each text string in the dataset into its vector form.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Training: Linear SVM
&lt;/h3&gt;

&lt;p&gt;The model I decided to use was a Linear Support Vector Machine (SVM), which is implemented by &lt;code&gt;scikit-learn&lt;/code&gt;’s &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html" rel="noopener noreferrer"&gt;LinearSVC&lt;/a&gt; class. &lt;a href="https://medium.com/machine-learning-101/chapter-2-svm-support-vector-machine-theory-f0812effc72" rel="noopener noreferrer"&gt;This post&lt;/a&gt; and &lt;a href="https://www.svm-tutorial.com/2014/11/svm-understanding-math-part-1/" rel="noopener noreferrer"&gt;this tutorial&lt;/a&gt; are good introductions if you don’t know what SVMs are.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html" rel="noopener noreferrer"&gt;CalibratedClassifierCV&lt;/a&gt; in the code above exists as a wrapper to give me the &lt;code&gt;predict_proba()&lt;/code&gt; method, which returns a probability for each class instead of just a classification. You can pretty much just ignore it if that last sentence made no sense to you, though.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here’s one (simplified) way you could think about why the Linear SVM works: during the training process, the model learns which words are “bad” and how “bad” they are because those words appear more often in offensive texts. &lt;strong&gt;It’s as if the training process is picking out the “bad” words for me&lt;/strong&gt; , which is much better than using a wordlist I write myself!&lt;/p&gt;

&lt;p&gt;A Linear SVM combines the best aspects of the other profanity detection libraries I found: it’s fast enough to run in real-time yet robust enough to handle many different kinds of profanity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Caveats
&lt;/h2&gt;

&lt;p&gt;That being said, &lt;code&gt;profanity-check&lt;/code&gt; is far from perfect. Let me be clear: take predictions from &lt;code&gt;profanity-check&lt;/code&gt; with a grain of salt because &lt;strong&gt;it makes mistakes.&lt;/strong&gt; For example, its not good at picking up less common variants of profanities like “f4ck you” or “you b1tch” because they don’t appear often enough in the training data. You’ll never be able to detect &lt;em&gt;all&lt;/em&gt; profanity (people will come up with new ways to evade filters), but &lt;code&gt;profanity-check&lt;/code&gt; does a good job at finding most.&lt;/p&gt;

&lt;h2&gt;
  
  
  profanity-check
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;profanity-check&lt;/code&gt; is open source and available on PyPI! To use it, simply&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ pip install profanity-check
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How could &lt;code&gt;profanity-check&lt;/code&gt; be even better? Feel free to reach out or comment with any thoughts or suggestions!&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article was originally posted on &lt;a href="https://victorzhou.com/blog/better-profanity-detection-with-scikit-learn/" rel="noopener noreferrer"&gt;victorzhou.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>nlp</category>
      <category>python</category>
      <category>scikitlearn</category>
    </item>
  </channel>
</rss>
