DEV Community

The day our web server reached 100% capacity 💾

Doaa Mahely on September 06, 2020

How it began One day a few months back, the QA engineer texted me on Slack saying that he couldn't log into the web application. Natural...

Read full post

Giacomo "Mr. Wolf" Furlan • Sep 9 '20

I'm sorry to write this, but we're in 2020 and you're still on a monolithic architecture? If you sandboxed every service in small, resources-monitored (and limited) instances (containers, for example), your headache would have been way shorter.

I know, I know: business' first. But it's up to you to stand up and talk about tech debt.

Side note: I don't remember the last time I used a (S)FTP client to access to a server... brrrr...

Mike Healy • Sep 10 '20

Of course implementing microservices could never cause its own headaches 🙄

Giacomo "Mr. Wolf" Furlan • Sep 10 '20

Obviously running microservices requires experience and good knowledge to avoid culprits, but in my personal experience, probably because I got more experienced over the years, I tend to consider small (in code and responsibility) services a great way to make things simple and able to be understood and developed by both experienced and newcomers in the company.

Doaa Mahely • Sep 10 '20

It really takes you back, huh? 😂
As you said, startups in particular often need to put their business needs first to accelerate growth and market share. Plus, not every application needs to use microservices.

Giacomo "Mr. Wolf" Furlan • Sep 10 '20

In my experience every company puts business in front of everything - and most of them put marketing in front of business ("the real heroes"). Just because I worked a long time for startups, old companies and lately consulting ones (i.e. more or less short projects), I can tell you you have more power than you might think, when you adopt standards that are simply different than monolithic approaches, without meaning more work (once you get practiced with them). It's years now that I work with autoscaling solutions, microservices or simply containerised services on a single machine, because in my opinion simple services allow you for great flexibility and replaceability over great control. Which apparently was what you lacked in this occasion :)

Doaa Mahely • Sep 10 '20

I appreciate you sharing your experience.

I’m skeptical about any power I have at the moment, but I will try to be a better advocate for better and more modern tech. I have much to learn!

Carlos Saltos • Sep 8 '20

Thanks for sharing. I like the graphics you added, it helps me to keep engaged and continue reading

huncyrus • Sep 9 '20

Interesting, why a simple RM was the answer for the mails, instead of filtering and keeping important messages (e.g.: business logic related ones) even as an archive?

Cyberduck showed a vague error I can't remember and didn't understand at the time

Aren't task for a devop to know what he/she/they do? (I have the feeling, there is no dedicated devop/sysop working there...)
Totally liked the end conclusion to have a monitoring system setup, because 99% of companies lack of monitoring or understanding this kind of things. (It is super common to say for balancing or scaling to just adding more cpu/memory and thats all instead of investigating why they even need that amount)

Doaa Mahely • Sep 9 '20

Hi there, you're right on us not having a dedicated devOps person as it is a small startup.
The amount of emails was huge and after going through a number of them, I noticed one cron job's response was adding backslashes, the number of which increased exponentially with each email. It reached a point where the entire screen was just filled with backslashes. The cron job's output wasn't incorrect, it was just being parsed incorrectly.

This Stack Overflow answer explains the root cause better:

Why is json_encode adding backslashes?

Apr 25 '12 Comments: 4 Answers: 6

I've been using json_encode for a long time, and I've not had any problems so far Now I'm working with a upload script and I try to return some JSON data after file upload.

I have the following code:

print_r($result); // <-- This is an associative array
echo json_encode($result); //

…

Saifur Rahman Mohsin • Sep 8 '20

A bit misleading.... got excited over simple error. This is merely disk usage... server usage depends on several factors (which does include disk usage)... but I was thinking you hit a bandwidth limit due to excess customers and had to add load balancers or wrote custom provisioning scripts to create new cloud instances/turn off based on bandwidth usage.

Either way, this wasn’t even an organic thing because of customers—the issue was crontab so your server upgrades were kinda pointless when you coulda simply done the logging right i.e. handled it better from the start. In fact, it’s wrong to have a cron job that logs so much info. Ideally, you’d want to log only abnormal behaviour and normal behaviour must be consolidated into aggregated statistics instead. Anyway, glad you were able to find the culprit in the end.

P.S. It maybe ideal to switch from ftp to ssh.

Doaa Mahely • Sep 8 '20

Hello Saifur.
Exactly, the point of this story is that the first time around we didn't bother debugging closely to understand the issue and just placed a bandaid on it, which led to it happening again. I find FTP faster and easier for me to edit and upload files and that's why I used it in this case, after all it was a high pressure situation.
Thanks for reading.

Kevin Ard • Sep 11 '20 • Edited

Hi 👋! I'll be quiet about any of my observations that mirror what's already noted here :) One single character to add that may help you out: du -sh *. That * behaves like the * in ls and most others - it'll give you itemized, summarized, human-readable useage for all of the contents of $PWD.

You can start from / and work yoour way down.

Bonus: du -sh * | sort -h will organize the list - but I wouldn't recommend running that at / with 100% du. It could take some time. There are some other pipes that will limit the results to the x heaviest items, too (for busy dirs).

Kevin Ard • Sep 11 '20 • Edited

Oh another one for clearing du, if you're logrotating and don't need the archives: find /var/log -type f -name "*.gz" -exec "rm -f {} ;"

I typed that on mobile, please don't copy paste lol.

Finds all the rotated .gz logs and nukes em - those add up to A LOT (esp if you're running mod_security lol)

Doaa Mahely • Sep 11 '20

Hello Kevin! Thanks for that lol.

Yes, du / took forever. I remember experimenting with a few different flags and parameters, but I'm not sure which combination I ended up using to figure out the root cause. Thanks a lot for your suggestions, I'll keep them in mind 👍

manish srivastava • Sep 6 '20

Nice sharing Doaa. Monit is good choice. I am using LXC to overcome this situation where email used 80% of available disk space. I have separate LXC containers for email , app, database. Each containers are created with profile fixing harddrive, ram , memory.

Garett Dunn • Sep 10 '20

I had some of the same reactions that others had regarding your approach and infrastructure setup. However, after taking a step back, I think at this point in your business (early startup it sounds like) and your team's expertise, you're doing just fine. We can't (and shouldn't?) all start off running kubernetes against microservices on the blockchain. I think you're doing great!

Thanks for sharing your postmortem and a transparent view into your work!

Doaa Mahely • Sep 10 '20

Thanks a lot Garett.
We can't all work using the latest tech. We're doing the best we can and slowly moving towards a more stable system and better infrastructure, but it's understandably a laborious process.
I appreciate your response :)