Hey there, my name is Ian. I'm a software engineer for a company that has hundreds of thousands of visitors every month. This may seem small to some, but for me, it's the first time writing code and creating deployments for a website of this scale. With this, there are a lot of lessons and growing pains I've come across. I wanted to share some of those lessons and how I learned them.
Recently, I've moved the website's account server (it controls user logins and session tokens) to a Kubernetes cluster so we can have limited downtime and Load Balancing built-in. Before the move, session tokens were handled in-memory. Usually, this wouldn't be a huge problem, but with Kubernetes you need to be careful about stateful applications. In this case, the solution was moving our session tokens to the Mongodb instance outside of the cluster.
Now that we have our tokens being managed correctly, things should be all good to go... right?
I wake up on a Saturday morning to a pleasant Slack message, "The site is down, no one can login."
I open my laptop to confirm the status of the website is indeed down, specifically the Account Server that I just got done moving to Kubernetes. My stomach dropped. I had worked hard learning Kubernetes, Docker, and Nginx to migrate the server. It felt like all that work was for nothing.
I immediately checked the status of the Kubernetes Pods, all were running. Next, I opened the logs for each pod and used
kubectl describe pod <pod_name> to gather more information. All Pods were alive and well, so why couldn't users login?
It was time to get my hands dirty and load up the Account Server locally to do some testing. All the requests worked instantly. Mongodb's read and writes were 1ms long, our user index was being used, and connecting to the production database worked too.
This meant our problem was at scale.
After a couple of hours of reviewing and rewriting code, I picked up on our first clue! Only the endpoints that utilized mongodb's
MongoClient were experiencing the
504 error code.
I decided to try some queries in the mongo cli to see if we were experiencing issues with reading and writing. First, I tried a
findOne on the
users collection, that worked fine. Next, I tried writing a user with
insertOne, that also worked fine.
Hmmm. What could the issue be then? Without anymore clues to go off of, I updated the
mongodb npm package to the latest version in hopes I'd ran into a bug that has been fixed in the latest version. Unfortunately, we were still in no man's land with no success.
Out of curiosity, I decided to do a
findOne query on our
tokens collection. It took 10 seconds. This might not seem that long, but compared to our
users millisecond response time this was a huge difference.
I used mongodb's
.explain() function on a
tokens collection query, and I realized it was querying all token documents. This explains exactly why our requests were timing out. Every single time a user started a session, mongodb would query all of the tokens in our database.
This was a huge issue.
I simply used
db.tokens.createIndex() on the token's
id and BOOM, the problem was solved.
- Know your databases! Learn the tools on how to scale your database. Indexes, pool sizes, replSets, e.t.c are all essential to scaling a mongodb database.
- Think about each database query before pushing your code to production. How often does this query run? How expensive is this query?
- Even if your server goes down after migrating it to Kubernetes, doesn't mean your work was wasted. It's important to use it as a learning experience.
If you made it this far, hopefully, you enjoyed your read and learned something too! If you'd like to follow me on other platforms, I stream on twitch, and you can also find me on twitter.
Thanks for reading!