loading...
Cover image for The day our web server reached 100% capacity 💾

The day our web server reached 100% capacity 💾

dmahely profile image Doaa Mahely ・4 min read

How it began

One day a few months back, the QA engineer texted me on Slack saying that he couldn't log into the web application. Naturally, I tried to login with his credentials and was able to. I thought he forgot his password, so I sent it to him and he said that he could login now. I didn't think anything of it.

2 hours later, right about log off time, I get an email from a client saying they're not able to login. I dismissed it thinking they forgot their password, intending to get back to them first thing tomorrow. Then, a mobile developer on my team said the same thing.

Smells a bit fishy

So I get to investigating. I went to the website and tried to login and I couldn't. The page would reload when I hit Enter without showing any errors.

Wait what

I quickly started debugging when a colleague mentioned it could be an issue with the database connection. We had recently moved from using a single database instance to a database cluster, and assumed this might be causing an issue or that one instance was taking on too much load. Since this change had been the biggest and most recent one, we narrowed our focus on it. However, the database console looked fine, and didn't show any extra load on any specific instance.

What does that mean

This issue only happened on production, so it was safe to assume it was not related to any code changes. At this point, we started getting more and more complaints from clients, so I decided to do something dangerous: debug on production.

Buckle up

I connected to the server using Cyberduck, navigated to the login view file and logged something like logging in. To my surprise, when I hit save, the file didn't get saved. Cyberduck showed a vague error I can't remember and didn't understand at the time.

Huh

After a couple more hours of debugging, we realized that the server has reached 100% disk usage. That day, I learned two useful unix commands: du and df. From the man page:

The du utility displays the file system block usage for each file argument and for each directory in the file hierarchy rooted in each directory argument.

The df utility displays statistics about the amount of free disk space on the specified filesystem or on the filesystem of which file is a part.

This meant one thing: we had to upgrade the disk size. Thankfully my colleague figured out how to do that with no downtime.

Crisis was averted. People were able to login.

Phew

The end... Not

Believe it or not, but due to the immense workload we had at the time, no further action was taken to monitor the server disk space or dig deeper into why this happened. So somewhat unsurprisingly, two months later, the server reached 100% capacity again!

We were better prepared and quickly identified the issue and upgraded the disk size. This time around, I took the time to dig into why this happened, since we didn't upload a lot of files within the last two months that would justify filling up around ~90 Gigabytes.

Again, I utilized the du and df commands to pinpoint the directory that's eating up the disk space:

$ du -sh /var
...
170.3G    /mail
...

Surprised

Imagine that, the mail directory was taking up 170 Gigabytes, almost 80% of the entire server's disk space! Further digging showed that the culprit was crontab. We had several cron jobs running, and crontab sends emails to the root user that get stored in /var/mail. This was listed clearly in the crontab file as shown below, but the output of a particular cron job was returning a lot of junk that somehow managed to fill up the directory quite quickly.

$ crontab -l
...
# Output of the crontab jobs (including errors) is sent through
# email to the user the crontab file belongs to (unless redirected).
...

Now what?

The plan of action was to first stop further emails, then to delete the existing emails to free up the server.

$ crontab -e
MAILTO="" # to disable cron emails

$ sudo rm /var/mail/ubuntu

Smarter and wiser, we figured let's set up a monitoring service to catch this particular issue in case it happens again. The service of choice was Monit and it was surprisingly easy to start using. It creates a dashboard that allows us to visualize all the numbers we need easily, from disk space to CPU usage to memory, and sends emails alerts on custom events. This great article is very helpful in setting up Monit on an Ubuntu server.

And the rest is history. We didn't face an issue with disk space again. So far.

So relieved

Thanks for reading! Until next time 👋

Cover photo by Taylor Vick on Unsplash

Discussion

pic
Editor guide
Collapse
elegos profile image
Giacomo "Mr. Wolf" Furlan

I'm sorry to write this, but we're in 2020 and you're still on a monolithic architecture? If you sandboxed every service in small, resources-monitored (and limited) instances (containers, for example), your headache would have been way shorter.

I know, I know: business' first. But it's up to you to stand up and talk about tech debt.

Side note: I don't remember the last time I used a (S)FTP client to access to a server... brrrr...

Collapse
mike_hasarms profile image
Mike Healy

Of course implementing microservices could never cause its own headaches 🙄

Collapse
elegos profile image
Giacomo "Mr. Wolf" Furlan

Obviously running microservices requires experience and good knowledge to avoid culprits, but in my personal experience, probably because I got more experienced over the years, I tend to consider small (in code and responsibility) services a great way to make things simple and able to be understood and developed by both experienced and newcomers in the company.

Collapse
dmahely profile image
Doaa Mahely Author

It really takes you back, huh? 😂
As you said, startups in particular often need to put their business needs first to accelerate growth and market share. Plus, not every application needs to use microservices.

Collapse
elegos profile image
Giacomo "Mr. Wolf" Furlan

In my experience every company puts business in front of everything - and most of them put marketing in front of business ("the real heroes"). Just because I worked a long time for startups, old companies and lately consulting ones (i.e. more or less short projects), I can tell you you have more power than you might think, when you adopt standards that are simply different than monolithic approaches, without meaning more work (once you get practiced with them). It's years now that I work with autoscaling solutions, microservices or simply containerised services on a single machine, because in my opinion simple services allow you for great flexibility and replaceability over great control. Which apparently was what you lacked in this occasion :)

Thread Thread
dmahely profile image
Doaa Mahely Author

I appreciate you sharing your experience.

I’m skeptical about any power I have at the moment, but I will try to be a better advocate for better and more modern tech. I have much to learn!

Collapse
csaltos profile image
Carlos Saltos

Thanks for sharing. I like the graphics you added, it helps me to keep engaged and continue reading

Collapse
huncyrus profile image
huncyrus

Interesting, why a simple RM was the answer for the mails, instead of filtering and keeping important messages (e.g.: business logic related ones) even as an archive?

Cyberduck showed a vague error I can't remember and didn't understand at the time

Aren't task for a devop to know what he/she/they do? (I have the feeling, there is no dedicated devop/sysop working there...)
Totally liked the end conclusion to have a monitoring system setup, because 99% of companies lack of monitoring or understanding this kind of things. (It is super common to say for balancing or scaling to just adding more cpu/memory and thats all instead of investigating why they even need that amount)

Collapse
dmahely profile image
Doaa Mahely Author

Hi there, you're right on us not having a dedicated devOps person as it is a small startup.
The amount of emails was huge and after going through a number of them, I noticed one cron job's response was adding backslashes, the number of which increased exponentially with each email. It reached a point where the entire screen was just filled with backslashes. The cron job's output wasn't incorrect, it was just being parsed incorrectly.

This Stack Overflow answer explains the root cause better:

Why is json_encode adding backslashes?

51

I've been using json_encode for a long time, and I've not had any problems so far Now I'm working with a upload script and I try to return some JSON data after file upload.

I have the following code:

print_r($result); // <-- This is an associative array
echo json_encode($result); //
Collapse
mohsin profile image
Saifur Rahman Mohsin

A bit misleading.... got excited over simple error. This is merely disk usage... server usage depends on several factors (which does include disk usage)... but I was thinking you hit a bandwidth limit due to excess customers and had to add load balancers or wrote custom provisioning scripts to create new cloud instances/turn off based on bandwidth usage.

Either way, this wasn’t even an organic thing because of customers—the issue was crontab so your server upgrades were kinda pointless when you coulda simply done the logging right i.e. handled it better from the start. In fact, it’s wrong to have a cron job that logs so much info. Ideally, you’d want to log only abnormal behaviour and normal behaviour must be consolidated into aggregated statistics instead. Anyway, glad you were able to find the culprit in the end.

P.S. It maybe ideal to switch from ftp to ssh.

Collapse
dmahely profile image
Doaa Mahely Author

Hello Saifur.
Exactly, the point of this story is that the first time around we didn't bother debugging closely to understand the issue and just placed a bandaid on it, which led to it happening again. I find FTP faster and easier for me to edit and upload files and that's why I used it in this case, after all it was a high pressure situation.
Thanks for reading.

Collapse
therealkevinard profile image
Kevin Ard

Hi 👋! I'll be quiet about any of my observations that mirror what's already noted here :) One single character to add that may help you out: du -sh *. That * behaves like the * in ls and most others - it'll give you itemized, summarized, human-readable useage for all of the contents of $PWD.

You can start from / and work yoour way down.

Bonus: du -sh * | sort -h will organize the list - but I wouldn't recommend running that at / with 100% du. It could take some time. There are some other pipes that will limit the results to the x heaviest items, too (for busy dirs).

Collapse
therealkevinard profile image
Kevin Ard

Oh another one for clearing du, if you're logrotating and don't need the archives: find /var/log -type f -name "*.gz" -exec "rm -f {} ;"

I typed that on mobile, please don't copy paste lol.

Finds all the rotated .gz logs and nukes em - those add up to A LOT (esp if you're running mod_security lol)

Collapse
dmahely profile image
Doaa Mahely Author

Hello Kevin! Thanks for that lol.

Yes, du / took forever. I remember experimenting with a few different flags and parameters, but I'm not sure which combination I ended up using to figure out the root cause. Thanks a lot for your suggestions, I'll keep them in mind 👍

Collapse
hanpari profile image
Pavel Morava

Interesting reading, but the gifs were rather distracting.

Collapse
dmahely profile image
Doaa Mahely Author

Trying out something different 😁 thanks for reading

Collapse
manishfoodtechs profile image
manish srivastava

Nice sharing Doaa. Monit is good choice. I am using LXC to overcome this situation where email used 80% of available disk space. I have separate LXC containers for email , app, database. Each containers are created with profile fixing harddrive, ram , memory.

Collapse
garettmd profile image
Garett Dunn

I had some of the same reactions that others had regarding your approach and infrastructure setup. However, after taking a step back, I think at this point in your business (early startup it sounds like) and your team's expertise, you're doing just fine. We can't (and shouldn't?) all start off running kubernetes against microservices on the blockchain. I think you're doing great!

Thanks for sharing your postmortem and a transparent view into your work!

Collapse
dmahely profile image
Doaa Mahely Author

Thanks a lot Garett.
We can't all work using the latest tech. We're doing the best we can and slowly moving towards a more stable system and better infrastructure, but it's understandably a laborious process.
I appreciate your response :)

Collapse
elcotu profile image
Daniel Coturel

Good post!

Collapse
alexandrusimandi profile image
Alexandru Simandi

and that's why we drink

Collapse
mwendarapho profile image
mwendarapho

Great article