How to troubleshoot potential DOS attacks

#kubernetes #docker #devops #sre

I've had a few scares in my career as a developer, but none come close to the pure horror of unsuccessfully trying to ssh into your VM. Normal people tend to be scared of spiders, snakes or saws.

Silent reference to the mediocre movie franchise with various types of hacking equipment and saws.

Anyhow, not me! Having connections that hang and time out, not knowing why or how to troubleshoot issues, is how my nightmares tend to start. I'm feeling short of breath already.

TL;DR

This tutorial will walk you through how I tend to troubleshoot potential denial-of-service issues. There are a plethora of tools at your disposal to monitor issues like these.

I'll show you some terminal based tools as well as SaaS products. It's up to you to pick what suits you the most. The end goal is to use a SaaS product that gives you access to logs, metrics, and correlations between them so you can pinpoint where a DOS attack is coming from. Then you can blacklist IPs and stop them from hurting you.

Ready? Let's go.

The scenario of having an Nginx web server

Nginx has become the go-to web server and reverse proxy for Kubernetes and Node.js-based applications. Because I'm quite focused on these technologies, you'll have to put up with me explaining this tutorial by using them as examples.

From the get-go, you'll most likely use an Nginx Ingress controller as a load-balancer in front of your cluster. From there, routing requests to either Nginx containers hosting your front-end code or Node.js containers running your APIs. Pretty straightforward, wouldn't you agree?

What happens if you notice increased latency and response times from your APIs, hmm? I go for lunch and hope it goes away. Just kidding. My boss is reading this, so of course not! * sweating profusely *

While there are no malicious requests, this setup works like a charm.

Noticing high request latencies through the terminal

Slow responses can mean many different things. Checking for slow and resource-hungry processes, or blocking code if you write poorly optimized JavaScript code. You may even face the issue of having too many threads if you write any real programming language. Yes, that was a "bashing on JavaScript" reference ha-ha! * whisper * PHP is still worse... 😐

Anyhow, htop is a nice command-line tool for checking resource usage. Installing it is stupid-simple.

$ sudo apt install htop

Running it only requires you to enter one command.

$ htop

Hey presto, you have a nice process list!

You can even group processes by hitting F5.

Even though htop is an amazing tool, the first thing you'll ask is how it handles multiple hosts, nodes, and cluster. Well, it really doesn't.

Using Kubernetes Dashboard to see your inventory

The next logical step is to make sure resources are running at all, meaning they haven't failed and are not hanging in an eternal restart loop. The default kubectl command will be the first line of defense. Fire off this command:

$ kubectl get all

[output]
NAME                DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
ds/sematext-agent   1         1         1         1            1           <none>          Xm
ds/st-logagent      1         1         1         1            1           <none>          Xm

NAME            DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deploy/nodejs   3         3         3            3           Xm
deploy/nginx    3         3         3            3           Xm

NAME                   DESIRED   CURRENT   READY     AGE
rs/nodejs-788f59f8d4   3         3         3         Xm
rs/nginx-788e93r8a5    3         3         3         Xm

NAME                        READY     STATUS    RESTARTS   AGE
po/nginx-788e93r8a5-45t4d   1/1       Running   4          Xm
po/nginx-788e93r8a5-qsgdr   1/1       Running   4          Xm
po/nginx-788e93r8a5-tghhj   1/1       Running   4          Xm
po/nodejs-788f59f8d4-v9cjt  1/1       Running   4          Xm
po/nodejs-788f59f8d4-zllqj  1/1       Running   4          Xm
po/nodejs-788f59f8d4-ztjxw  1/1       Running   4          Xm
po/sematext-agent-rdfm8     1/1       Running   5          Xm
po/st-logagent-52zkt        1/1       Running   4          Xm

NAME            TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
svc/nginx       LoadBalancer   10.103.55.204   <pending>     80:31567/TCP   Xm
svc/nodejs      LoadBalancer   10.103.55.204   <pending>     3000:31567/TCP Xm
svc/kubernetes  ClusterIP      10.96.0.1       <none>        443/TCP        Xm

Listing all available entities you have in your cluster will give you a quick glance at what's running and what's not.

One step up from this is to start the Kubernetes Dashboard and get a nice UI. This is where you can get an overview of your inventory.

It's very nice, but you can't dig down and get to the bottom of the issue at hand. The only thing you have access to is logs.

However nice this is, it doesn't really solve my problem.

Using an infrastructure monitoring tool

It all boils down to finding a tool that can keep everything under one roof. I don't want to ssh into machines, run dozens of kubectl commands , nor fiddle with the Kubernetes Dashboard. It's a huge overhead when you need to multi-task between several tools to solve one issue.

The monitoring market is incredibly saturated, from the likes of SumoLogic, Datadog, and New Relic, there's a huge number of tools to chose from.

I use Sematext as the observability tool to bundle everything together. I have every type of monitoring imaginable with logs that can be correlated to metrics. Alerts can be based on both metrics and logs, making it even more powerful.

Here's what a spike in my processes CPU usage looked like when I got the mini-heart attack when I thought I was being DOSed.

From the processes screen, I can find the exact pod and container that caused an issue. From there I can pinpoint where the DOS attack came from by looking at Nginx access logs.

Blocking this IP is as simple as adding a deny section in the Nginx Ingress configuration file.

location / {
    # ...
    deny    142.93.252.121;
}

No more sweating

Having correlations, where you can easily see which metric spike ties with which log event and getting notified about errors, is what I look for in an observability tool. Limiting the work I need to do to drill down and find the issue is what every tool should do.

Hope you enjoyed my story of when I almost had a heart attack. Stay tuned for more life stories about cool new DevOps and observability topics.

If you want a 30 day free trial of Sematext, sign up here. Otherwise, feel free to support my writing by following me here on DEV. 🙂