Nick Doty

Posted on Sep 29, 2020

Discord Censor Bot Building a Discord bot to filter out swear words

#discord #python #trie #bot

Now that it's been 7 years since quarantine began, I've been expanding my assortment of random Discord bots.
My latest experiment has been a bot that can determine if a message has a "bad word" in it, and if so, deletes the message, and then shames the user. It also can determine if the user is trying to get around the filter by using "leetspeak". This is a short write-up of how that bot works.

Disclaimer: This bot (and the following write-up) revolves around swear words. I will be talking about swear words in this write-up. If that bothers you, I would recommend not reading this. Personally, I have much worse things to worry about in 2020.

Bot information

If you'd like to install the bot on your Discord server, simply click this link. FYI: You must be an administrator on the Discord server to install it. It requires the manage messages permission as it deletes messages with bad words.
If you'd like to take a look at the bot, all of the code is available here on Github.

If you have any requests or bugs to report, you can report an issue here on Github or you can reach out to me on Twitter, @nldoty.

Building the bot

In building this bot there were 4 areas that I focused on in development and I'll be focusing on each shortly below.

Bad word list

The first problem is establishing a list of "bad words". Depending on what you're comfortable with, this can vary quite a bit. You could choose a classic, the 7 words you can't say by George Carlin. Another article I found had 26 swear words listed. But I wanted to be as thorough as possible - so I chose this list of 1300 words from CMU. That being said - there were a lot of words that could be seen as "controversial" that I didn't feel the need to censor. Words like "Republican" and "Democrat". I narrowed it down to about 1000 words.

Data Structure

Next we need a way to see if any words in a message happens to be in the designated list of "bad words".
The naive approach would be a list search. Every time you come across a word, you check to see if it exists in your list of "bad words".
The problem with this approach is time complexity. Searching a python list (x in list) is O(n) where n is the number of items in your list. If your list has only 7 words in it, this might be acceptable! But with over a thousand words, this isn't going to work.
Thankfully, this is almost a textbook use-case for a Trie.

So uhh, what's a Trie?

A Trie, or prefix tree, is an ordered-tree structure used to store dynamic information that can be searched for. It's a unique type of tree because unlike a binary tree, no node in the tree stores the key that you're searching for - instead, the key itself is distributed.

The purpose of this post isn't to teach Tries: so if you'd like to read more, this GeeksforGeeks post is a great resource. But here's a TL;DR:
Tries are made up of nodes, and each node contains two properties: children representing the nodes beneath the node (usually an array), and a way to designate the END of a word. The implementation is stored here, but the code is very simple:

class TrieNode:
    _MAX_SIZE = 40

    def __init__(self):
        self.children = [None] * self._MAX_SIZE
        self.is_end_of_word = False

The _MAX_SIZE is 40 because that is the size of my dictionary:
'abcdefghijklmnopqrstuvwxyz12345670(|@$[!'
The Trie itself has two base functions: insert and search. To "build" the Trie you will insert all of your words into the structure, and then you can search to see if a word is in your Trie.
Here is an example image from TheoryOfProgramming:

So for my example, I want to not only add all my words from my bad words list to my Trie, but I also wanted to add leet speak variations to Trie to catch as many foul words as possible.

To do this, I use some recursion to add extra TrieNodes when I find a letter that has a leetspeak equivalent - passing the remainder of the letters as the word to insert to fill in the Trie with all possibilities.
The code can be seen here.

Let's look at how this would work in practice, with the word shit.
Looking at each individual letter, s h i t:
s has two equivalents, $ and 5
h has no equivalents
i has two equivalents, | and !
t has one equivalent, 7

This ultimately gives us 18 different options to search for:
[shit, shi7, sh|t, sh|7, sh!t, sh!7, $hit, $hi7, $h|t, $h|7, $h!t, $h!7, 5hit, 5hi7, 5h|t, 5h|7, 5h!t, 5h!7]

Searching through 18 different words for one word would be slow. But the Trie is quick and simple, because the complexity is limited to the length of your word.

The more words you're searching for, the more time you'll be saving. This is the beauty of using a Trie.

Discord integration

The Discord integration code can be seen here and is very basic: it listens for all messages to a server, parses the message into words and searches the Trie for each word. If a word is found in the Trie, the bot will delete the message and then shame the user using a random assortment of shaming messages.

The only somewhat interesting part of this code snippet is this line:
text = text.translate(str.maketrans(table))

A major problem I ran into in working on the bot is determining the question, "what is a word?" This was one of the hardest things for me to solve. When people type, they use things like punctuation - commas, periods, dashes, you name it. A typical way to divide a string in python is using .split(), which will break things up by space. However, if you had the sentence
I like spinach, corn, and beans.
Using .split(), your result would be
['I', 'like', 'spinach,', 'corn,', 'and', 'beans.']

In building my Trie, I don't include punctuation like periods and commas. So if you wrote the sentence
"Shut up you stupid fucker."
.split() would break this up as
['Shut', 'up', 'you', 'stupid', 'fucker.']
and my Trie would miss the word fucker because there is a period as part of the word, so I would not see the is_end_of_word at the right point.

To avoid this, I use string.translate. Using the defined dictionary, it maps keys to their respective values - in my case, I map all the common characters that aren't part of my Trie leetspeak to None. This solved quite a few of my problems, but obviously not all of them.

The last major issue I have is one not easily solved. A really easy way to get around the bot is to use spaces in the middle of the word - instead of shit, just write sh it. The bot sees this as two words, and doesn't connect that without the space, a swear word was made. I could add spaces to the Trie at every level, but how would words be ultimately divided? How could I pass words to the Trie so that shit, sh, and it are all checked? This gets more and more complicated the more words/spaces your message has. You could search for all possible iterations with a moving window of sorts, but then you're running into major speed issues.
The best answer would be some way for a program to know that fu ck is really just one word, but oh hm should be treated as two words. I think this problem is best explained using this XKCD comic:

Dockerization

Finally, I like to build my Discord bots into Docker containers. It just makes deployment a lot easier. And the Dockerfile for this one is very simple:


# set base image (host OS)
FROM python:3.7.7

# set the working directory in the container
WORKDIR /code

# copy the dependencies file to the working directory
COPY requirements.txt .

# install dependencies
RUN pip install -r requirements.txt

# copy the content of the local src directory to the working directory
COPY src/ .

# command to run on container start
CMD [ "python", "./main.py" ]

Then, building the container in the top directory is as simple as
docker build -t pottybot .
and running the container is just
docker run pottybot

And that's it!
If you want to run the bot yourself, it's fairly straight-forward. Follow along with this guide to create a bot in the Discord Developer Portal. Once you have a token, create a .env file in the src/ folder with one line:
DISCORD_TOKEN='YOUR_TOKEN_HERE'

Then build and run the Docker container and you're set! If you have any questions or comments, feel free to reach out to me and I'm happy to help!

Stay safe and healthy!

Top comments (6)

Lars Kellogg-Stedman • Feb 11 '21 • Edited

Why a Trie instead of the native set data type? set will give you O(1) lookups, which beats the O(L) lookup time of the Trie, and you're not dealing with a large enough dataset that relative memory efficiency is going to matter, even if there is a difference.

Nick Doty • Feb 12 '21

Because when you have a lot of words that are very similar (especially when accounting for leet-speak letters) you are much more likely to have collisions in a hash table. It will also take up significantly more space (although in this instance, space is relatively cheap.) But partially because I just wanted to implement a trie.

Mathews Joy • Sep 21 '21

Hello I like the look of this idea, however i tried implmenting it the same but shorten the words down a bit, and iam getting
Char not found: *
Char not found: B
Char not found: *
Char not found: F
Char not found: *
Char not found: a
Char not found: o
Char not found: 5
Char not found: m
Char not found: d
Char not found: i
Char not found: t

shadowtime2000 • Sep 29 '20

I made something like this with a bot called censorbot though instead of just checking if a word is a bad word, it allows admins to censor certain words.