DEV Community: JA Samitier

How to start a Python project easily

JA Samitier — Mon, 23 Jan 2023 22:26:23 +0000

Photo by Daniel McCullough on Unsplash

I like to try out new technologies and programming languages, and one of the first blockers I face is how to start a project. I absolutely love the way you can start a NodeJS project with npm init. You can do something similar with Python.

In this article, you'll learn how to start a Python project assuring that its dependencies won't interfere with other projects' dependencies. And how to use Make to distribute your project to be easily installed in other development machines in no time.

Check your versions

First, check your versions. I'm working with Python 3.10, but this will work with any version of Python 3.

λ python --version
Python 3.10.8

Creating the virtual environment

First, create a folder to store your project.

mkdir ~/Development/python-starter-example
cd ~/Development/python-starter-example

Now, you need to create "the project" itself, aka the virtual environment. The venv module allows you to have all the dependencies and configuration for your Python project inside the project directory, something like NodeJS projects having all the dependencies in node_modules. Let's do this.

cd ~/Development/python-starter-example
python -m venv env

Working with the virtual environment

Now, the environment is created, but we need to go "inside", by activating it:

source ./env/bin/activate

Every time you go back to the project directory, you'll need to activate it.

How I know that I'm inside the environment?

Easy-peasy: You'll see a little (env) text in the prompt of your terminal.

Now, every Python command you run inside the environment will be executed in that context, so, if you install a Python library, it won't interfere with other versions of that library from other python projects.

Installing dependencies

Let's install Flask, for example:

# (env) => we are inside the environment
pip install flask

The moment of truth: let's check that Flask was installed inside the environment.

~/Development/python-starter-example
env λ ls env/lib/python3.10/site-packages/flask 
__init__.py app.py      config.py   globals.py  logging.py  sessions.py testing.py  wrappers.py
__main__.py blueprints.py   ctx.py      helpers.py  py.typed    signals.py  typing.py
__pycache__ cli.py      debughelpers.py json        scaffold.py templating.py   views.py

So it's there! But.. what about the Flask binary? Remember that some Python libraries come with a binary you need to execute sometimes. Those binaries will be under the env/bin folder inside your project. See:

~/Development/python-starter-example
env λ ls env/bin                               
Activate.ps1    activate.csh    flask       pip3        python      python3.10
activate    activate.fish   pip     pip3.10     python3

Running the project

Let's use the example Flask application from its documentation. Create a file called app.py in the project and edit it:

from flask import Flask

app = Flask(__name__)

@app.route("/")
def hello_world():
    return "<p>Hello, World!</p>"

Now, run it. Remember that the project need to use the Flask version under the environment:

env/bin/flask --app app run  

* Serving Flask app 'app'
 * Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on http://127.0.0.1:5000
Press CTRL+C to quit
127.0.0.1 - - [22/Jan/2023 13:12:24] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [22/Jan/2023 13:12:24] "GET /favicon.ico HTTP/1.1" 404 -

et voilà!

You could also run it directly using the python command. You just need to tweak the code a little.

import os
from flask import Flask

app = Flask(__name__)


@app.route("/")
def hello_world():
    return "<p>Hello, World!</p>"


if __name__ == '__main__':
    port = int(os.environ.get('PORT', 3000))
    app.run(host='127.0.0.1', port=port)

Now you can run it with the python command:

~/Development/python-starter-example 10s
env λ python app.py
 * Serving Flask app 'app'
 * Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on http://127.0.0.1:3000
Press CTRL+C to quit
127.0.0.1 - - [22/Jan/2023 13:16:37] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [22/Jan/2023 13:16:38] "GET /favicon.ico HTTP/1.1" 404 -

Create the requirements file

If you want to distribute your project with the rest of your team, or in a git repository, you need to create the requirements.txt file that will contain all the dependencies that your project's using. Since you installed everything using pip, the easiest way of creating this file is using the pip freeze command.

pip freeze > requirements.txt

Now you have your requirements file - something like this:

~/Development/python-starter-example
env λ cat requirements.txt            
click==8.1.3
Flask==2.2.2
itsdangerous==2.1.2
Jinja2==3.1.2
MarkupSafe==2.1.2
Werkzeug==2.2.2

Now that the dependencies are documented in the requirements.txt file, anyone can install them with a simple pip command:

pip install -r requirements.txt

Creating an application runner with Make

Now, let's automatize it. What if we could create a way of automatically create the virtual env with everything installed in it? I like doing this with Make. Create a file called Makefile inside the project directory and start with this:

.PHONY: all

env/bin/activate: requirements.txt
    python -m venv env
    ./env/bin/pip install -r requirements.txt

run: env/bin/activate
    ./env/bin/python app.py

freeze: env/bin/pip
    ./env/bin/pip freeze > requirements.txt

clean:
    rm -rf __pycache__
    rm -rf ./env

Let's break down this Makefile. It has three commands:

run: will run the app.py Python file if the venv is active. If it's not it will create the virtual environment and install its dependencies. If no requirements.txt file is found, then nothing will happen.
freeze: will update the requirements.txt with all the libraries installed with pip. This is useful if you installed new libraries.
clean: will delete the Python cache and the environment. You can use this safely, because if you run your application again, everything will be re-created!

Using Make is really easy, just type make and the command you want to execute, e.g:

make run

Please, not that you need to have make installed. In a mac you can do it with brew (brew install make), in Linux is usually installed or available in the software repositories (apt, dnf...), and in Windows you can use WSL to install a Linux distribution inside Windows and run make from there.

Let's sum up

In this article, you learned how to create a Python project in its own virtual environment, and also how to create a Makefile to run everything from it. Now, when someone clones that project, they only will need to type make run and it automatically creates the environment, install the dependencies and run the project. Yay!

I hope you found this interesting. Of course this isn't the only way to manage a Python project. This is the way I use to do it. If there's something I missed, please, ping me, and I'll update the article. Thanks!

Prometheus 2.37 – The first long-term supported release!

JA Samitier — Mon, 18 Jul 2022 14:16:42 +0000

Prometheus 2.37 is out and brings exciting news: this is the first long-term supported release. It'll be supported for at least six months.

Why is Long-Term Support (LTS) so significant?

Previous to this release, each Prometheus version had a six-week life-cycle. That means that if you wanted to stay up-to-date with the latest features and bug fixes, you needed to update your Prometheus server every six weeks or so.

Upgrading isn't always as easy as clicking on a button. With Prometheus growth, more and more companies depend on Prometheus as the key component of their monitoring infrastructure, and they can't face the risk that new features and enhancements also bring regressions, requiring them to upgrade again. That's why Prometheus is adding LTS releases to their release cycle.

Prometheus LTS releases will bring bug, security, and documentation fixes, so companies limit the risks of upgrades while having the Prometheus server still up-to-date.

So, you won't have the latest Prometheus features, but you'll know that upgrading to the next 2.37 fix release will be straightforward.

The Prometheus community is getting more and more mature

2022 is a great year for the Prometheus community. Last KubeCon EU in Valencia, the CNCF announced the Prometheus Associate Certification, which is currently in beta. This allows engineers to demonstrate their proficiency in the Prometheus ecosystem and cloud native observability concepts. Now, Prometheus is announcing a LTS release.

The release of these new LTS versions means that, now, every time the community fixes bugs and security issues in Prometheus, the maintainers will add these fixes both in the latest Prometheus minor and in the latest LTS version.

This extra effort is a serious investment in the Prometheus maturity that will bring more stability to the vast number of companies and projects using Prometheus.

Some nice changes included in Prometheus 2.37

This release also includes other nice changes, like a new built-in service discovery for HashiCorp Nomad, and an enhancement that allows attaching node labels for endpoint roles in the Kubernetes service discovery.

You can find the full list of changes in the official release notes of Prometheus 2.37.

How to monitor nginx in Kubernetes with Prometheus

JA Samitier — Mon, 04 Jul 2022 08:28:02 +0000

nginx is an open source web server often used as a reverse proxy, load balancer, and web cache. Designed for high loads of concurrent connections, it's fast, versatile, reliable, and most importantly, very light on resources.

In this article, you'll learn how to monitor nginx in Kubernetes with Prometheus, and also how to troubleshoot different issues related to latency, saturation, etc.

Ingredients

Before we begin, let's summarize the tools you'll be using for this project.

nginx server (I bet it's already running in your cluster!).
Our beloved Prometheus, the open source monitoring standard.
The official nginx exporter.
Fluentd, and its plugin for Prometheus.

Starting with the basics: nginx exporter

The first thing you need to do when you want to monitor nginx in Kubernetes with Prometheus is install the nginx exporter. Our recommendation is to install it as a sidecar for your nginx servers, just by adding it to the deployment. It should be something like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-server
spec:
  selector:
    matchLabels: null
  app: nginx
  replicas: 3
  template:
    metadata:
      labels:
        app: nginx
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '9113'
    spec:
      containers:
        - name: nginx
          image: nginx
          ports:
            - containerPort: 80
          volumeMounts:
            - name: nginx-config
              mountPath: /etc/nginx/conf.d/default.conf
              subPath: nginx.conf
        - name: nginx-exporter
          image: 'nginx/nginx-prometheus-exporter:0.10.0'
          args:
            - '-nginx.scrape-uri=http://localhost/nginx_status'
          resources:
            limits:
              memory: 128Mi
              cpu: 500m
          ports:
            - containerPort: 9113
      volumes:
        - configMap:
            defaultMode: 420
            name: nginx-config
          name: nginx-config

This way, you've just added a nginx exporter container in each nginx server pod. Since we configured three replicas, there'll be three pods, each containing one nginx server container and one nginx exporter container. Apply this new configuration and voilà! You easily exposed metrics from your nginx server.

Monitoring nginx overall status with Prometheus

Do you want to confirm that it worked? Easy-peasy. Go to Prometheus and try this PromQL out:

sum (nginx_up)

This will say there are three containers reporting nginx_up to one. Don't worry about the metrics yet, we'll be there in no time.

Monitoring nginx connections with Prometheus

Active connections

Let's use the following metrics to take a look at the nginx active connections. You can also focus on which ones are reading or writing:

nginx_connections_active
nginx_connections_reading
nginx_connections_writing

Just by using them you'll have something like this:

Unhandled connections

Now, let's focus on how many connections are not being handled by nginx. You just need to take off the handled connections from the accepted connections. The nginx exporter gives us both metrics with:

nginx_connections_handled
nginx_connections_accepted

So, let's get the percentage of accepted connections that are being unhandled:

rate(nginx_connections_accepted{kube_cluster_name=~$cluster}[$__interval]) - rate(nginx_connections_handled{kube_cluster_name=~$cluster}[$__interval]) or vector(0) / rate(nginx_connections_accepted{kube_cluster_name=~$cluster}[$__interval]) * 100

Hopefully this number will be near zero!

Waiting connections

Fortunately, this is also an easy query. Just type nginx_connections_waiting, which is the metric that nginx exporter uses to expose this information.

Need more metrics? Take them from the logs!

In case you need more information to monitor nginx in Kubernetes with Prometheus, you can use the access.log from nginx to take a little more information. Let's see how.

Fluentd, the open source data collector

You can configure Fluentd to pick up information from the nginx access.log and convert it into a Prometheus metric. This can be really handy for situations where the instrumented application doesn't expose much information.

How to install and configure Fluentd

We already talked about Fluentd and its Prometheus plugin here, so just follow the instructions in that article, and you'll be ready to rock.

Let's configure Fluentd to export a few more metrics

To do this, you need to tweak the access.log format a little: you can pick the default logging format, and add the $upstream_response_time at the end. This way, Fluentd will have this variable and use it to create some useful metrics.

name: nginx-config
data:
  nginx.conf: |
    log_format custom_format '$remote_addr - $remote_user [$time_local] '
      '"$request" $status $body_bytes_sent '
      '"$http_referer" "$http_user_agent" '
      '$upstream_response_time';
    server {
      access_log /var/log/nginx/access.log custom_format;
      ...
    }

This config goes in the nginx.conf, usually in a ConfigMap.

Next, you need to configure Fluentd to read the new log's format. You can do this by creating a new config for nginx in the Fluentd's fileConfig section.

<source>
    @type prometheus_tail_monitor
</source>
<source>
    @type tail
    <parse>
    @type regexp
    expression /^(?<timestamp>.+) (?<stream>stdout|stderr)( (.))? (?<remote>[^ ]*) (?<host>[^ ]*) (?<user>[^ ]*) \[(?<time>[^\]]*)\] \"(?<method>\w+)(?:\s+(?<path>[^\"]*?)(?:\s+\S*)?)?\" (?<status_code>[^ ]*) (?<size>[^ ]*)(?:\s"(?<referer>[^\"]*)") "(?<agent>[^\"]*)" (?<urt>[^ ]*)$/
        time_format %d/%b/%Y:%H:%M:%S %z
        keep_time_key true
        types size:integer,reqtime:float,uct:float,uht:float,urt:float
    </parse>
    tag nginx
    path /var/log/containers/nginx*.log
    pos_file /tmp/fluent_nginx.pos
</source>

<filter nginx>
     @type prometheus
</filter>

With that config, you basically created a regex parser for the nginx access.log. This is the expression config:

expression /^(?<timestamp>.+) (?<stream>stdout|stderr)( (.))? (?<remote>[^ ]*) (?<host>[^ ]*) (?<user>[^ ]*) \[(?<time>[^\]]*)\] \"(?<method>\w+)(?:\s+(?<path>[^\"]*?)(?:\s+\S*)?)?\" (?<status_code>[^ ]*) (?<size>[^ ]*)(?:\s"(?<referer>[^\"]*)") "(?<agent>[^\"]*)" (?<urt>[^ ]*)$/

Take this log line for example:

2022-06-07T14:16:57.754883042Z stdout F 100.96.2.5 - - [07/Jun/2022:14:16:57 +0000] "GET /ok/500/5000000 HTTP/1.1" 200 5005436 "-" "python-requests/2.22.0" 0.091

With the parser, you broke that log line in the following parts:

timestamp: 2022-06-07T14:16:57.754883042Z
stream: stdout
remote: 100.96.2.5
host: -
user: -
time: 07/Jun/2022:14:16:57 +0000
method: GET
path: /ok/500/5000000
status_code: 200
size: 5005436
referer: -
agent: python-requests/2.22.0
urt: 0.091

Now that you configured Fluentd to read the access.log, you can create some metrics by using those variables from the parser.

nginx bytes sent

You can use the size variable to create the nginx_size_bytes_total metric: a counter with the total nginx bytes sent.

      <metric>
        name nginx_size_bytes_total
        type counter
        desc nginx bytes sent
        key size
      </metric>

Error rates

Let's create this simple metric:

<metric>
        name nginx_request_status_code_total
        type counter
        desc nginx request status code
        <labels>
          method ${method}
          path ${path}
          status_code ${status_code}
        </labels>
</metric>

This metric is just a counter with all the log lines. So, why is it useful? Well, you can use other variables as labels, which can be handy to break down all the information. Let's use this metric to get the total error rate percentage:

sum(rate(nginx_request_status_code_total{status_code=~"[4|5].."}[1h])) / sum(rate(nginx_request_status_code_total[1h])) * 100

You could also get this information aggregated by method:

sum by (method) (rate(nginx_request_status_code_total{status_code=~"[4|5].."}[1h])) / sum by (method) (rate(nginx_request_status_code_total[1h]))

Or even by path:

sum by (path) (rate(nginx_request_status_code_total{status_code=~"[4|5].."}[1h])) / sum by (path) (rate(nginx_request_status_code_total[1h]))

Latency

Wouldn't it be great if you could monitor the latency of the successful requests? Well, it might as well be your birthday because you can! Remember when we told you to add the $upstream_response_time variable?

This variable stores the time spent on receiving the response from the upstream server in just seconds. You can create a histogram metric with Fluentd, like this:

<metric>
        name nginx_upstream_time_seconds_hist
        type histogram
        desc Histogram of the total time spent on receiving the response from the upstream server.
        key urt
        <labels>
          method ${method}
          path ${path}
          status_code ${status_code}
        </labels>
</metric>

So now, magically, you can try this PromQL query to get the latency in p95 of all the successful requests, aggregated by the path of the request.

histogram_quantile(0.95, sum(rate(nginx_upstream_time_seconds_hist_bucket{status_code !~ "[4|5].."}[1h])) by (le, path))

To sum up

In this article, you learned how to monitor nginx in Kubernetes with Prometheus, and how to create more metrics using Fluentd to read the nginx access.log. You also learned some interesting metrics to monitor and troubleshoot nginx with Prometheus.

Monitor and troubleshoot Consul with Prometheus

JA Samitier — Fri, 29 Apr 2022 09:21:31 +0000

In this article, you’ll learn how to Monitor Consul with Prometheus. Also, troubleshoot Consul control plane with Prometheus from scratch, following Consul’s docs monitoring recommendations. Also, you’ll find out how to troubleshoot the most common Consul issues.

How to install Consul in Kubernetes

Installing Consul in Kubernetes is straightforward: just take a look at the Consul documentation page and follow the instructions. We recommend using the Helm chart, since it’s the easier way of deploying applications in Kubernetes.

How to configure Consul to expose Prometheus metrics

Consul automatically exports metrics in the Prometheus format. You just need to activate these options in the global.metrics configurations. If you’re using Helm, you can do it with:

--set 'global.metrics.enabled=true'
--set 'global.metrics.enableAgentMetrics=true'

Also, you’ll need to enable the telemetry.disable_hostname for both the Consul Server and Client so the metrics don’t contain the name of the instances.

--set 'server.extraConfig="{"telemetry": {"disable_hostname": true}}"'
--set 'client.extraConfig="{"telemetry": {"disable_hostname": true}}"'

Monitor Consul with Prometheus: Overall status

Autopilot

First, you can check the overall health of the Consul server using the Autopilot metric (consul_autopilot_healthy). If all servers are healthy, this will return 1 – and 0 otherwise. All non-leader servers will report NaN.

You could add this PromQL query to your dashboard to check the overall status of the Consul server:

min(consul_autopilot_healthy)

Adding these thresholds:

1: “Healthy”
0: “Unhealthy”

To trigger an alert when one or many Consul servers in the cluster are unhealthy, you can simply use this PromQL:

consul_autopilot_healthy == 0

Want to dig deeper into PromQL? Read our getting started with PromQL guide to learn how Prometheus stores data, and how to use PromQL functions and operators.

Leadership changes

Consul deploys several instances of the control-plane controllers to ensure high availability. However, only one of them is the leader and the rest are for contingency. A Consul cluster should always have a stable leader. If it’s not stable, due to frequent elections or leadership changes, you could be facing network issues between the Consul servers.

For checking the leadership stability, you can use the following metrics:

consul_raft_leader_lastContact: Indicates how much time has passed since the leader contacted the follower nodes when checking its leader lease.
consul_raft_state_leader: Number of leaders.
consul_raft_state_candidate: Number of candidates to promote to leader. If this metric returns a number higher than 0, it means that a leadership change is in progress.

For a healthy cluster, you’re looking for a consul_raft_leader_lastContact lower than 200ms, a consul_raft_state_leader greater than 0, and a consul_raft_state_candidate equal to 0.

Let’s create some alerts to trigger if there is flapping leadership.

There are too many elections for leadership: sum(rate(consul_raft_state_candidate[1m]))>0
There are too many leadership changes: sum(rate(consul_raft_state_leader[1m]))>0
Leader time to contact followers is too high: consul_raft_leader_lastContact{quantile="0.9"}>200

The last query contains the label quantile="0.9", for using the percentile 90. By using the percentile p90, you’re getting the 10% of the samples that are taking more than 200ms to contact the leader.

Top troubleshooting situations to monitor Consul

Long latency in Consul transactions

Long latency in Consul transactional operations could be due to an unexpected load on the Consul servers, or the issues on the servers.

Anomalies need to be detected in a time context because the network is dynamic by nature, and you can’t just compare your samples with a fixed value. You need to compare your values with other values in the last hour (or the last day, last five minutes…) to determine if it’s a desirable value or needs some attention.

To detect anomalies, you can dust off your old statistics book and find the chapter explaining the normal distribution. The 95% of the samples in a normal distribution are between the average plus or minus two times the standard deviation.

To calculate this in PromQL, you can use the avg_over_time and stddev_over_time functions, like in this example:

avg(rate(consul_kvs_apply_sum[1m]) > 0)>(avg_over_time(rate(consul_kvs_apply_sum[1m]) [1h:1m]) + 2* stddev_over_time(rate(consul_kvs_apply_sum[1m]) [1h:1m]))

Let’s see a few alerts that are triggered if the transaction latency isn’t normal.

Key-Value Store update time anomaly

Consul KV Store update time had noticeable deviations from baseline over the previous hour.

avg(rate(consul_kvs_apply_sum[1m]) > 0)>(avg_over_time(rate(consul_kvs_apply_sum[1m]) [1h:1m]) + 2* stddev_over_time(rate(consul_kvs_apply_sum[1m]) [1h:1m]))

Please note that these examples contain PromQL subqueries.

Transaction time anomalies

Consul Transaction time had noticeable deviations from baseline over the previous hour.

avg(rate(consul_txn_apply_sum[1m]) >0)>(avg_over_time(rate(consul_txn_apply_sum[1m])[1h:1m]+2*stddev_over_time(rate(consul_txn_apply_sum[1m]) [1h:1m]))

Consul has a Consensus protocol that uses the Raft algorithm. Raft is a “consensus” algorithm, a method to achieve value convergence over a distributed and fault-tolerant set of cluster nodes.

Transactions count anomaly

Consul transactions count rate had noticeable deviations from baseline over the previous hour.

avg(rate(consul_raft_apply[1m]) > 0)>(avg_over_time(rate(consul_raft_apply[1m])[1h:1m])+2*stddev_over_time(rate(consul_raft_apply[1m])[1h:1m]))

Commit time anomalies

Consul commit time had noticeable deviations from baseline over the previous hour.

avg(rate(consul_raft_commitTime_sum[1m]) > 0)>(avg_over_time(rate(consul_raft_commitTime_sum[1m])[1h:1m])+2*stddev_over_time(rate(consul_raft_commitTime_sum[1m]) [1h:1m]))

High memory consumption

Keeping the memory usage under control is key to keeping the Consul server healthy. Let’s create some alerts to be sure that your Consul server doesn’t use more memory than available.

Consul is using more than 90% of available memory.

100 * sum by(namespace,pod,container)(container_memory_usage_bytes{container!="POD",container!="", namespace="consul"}) / sum by(namespace,pod,container)(kube_pod_container_resource_limits{job!="",resource="memory", namespace="consul"}) > 90

The garbage collection pause is high

Consul’s garbage collector has the pause event that blocks all runtime threads until the garbage collection completes. This process takes just a few nanoseconds, but if Consul’s memory usage is high, that could trigger more and more GC events that could potentially slow down Consul.

Let’s create two alerts: a warning alert if the GC takes more than two seconds per minute, and a critical alert if the GC takes more than five seconds per minute.

Please note that one second is 1000000000 nanoseconds

Garbage Collection stop-the-world pauses were greater than two seconds per minute.

(rate(consul_runtime_gc_pause_ns_sum[1m]) / (1000000000) > 2

Garbage Collection stop-the-world pauses were greater than five seconds per minute.

(min(consul_runtime_gc_pause_ns_sum)) / (1000000000) > 5

Network load is high

A high RPC count, meaning that the requests are being rate-limited, could imply a misconfigured Consul agent.

Now it’s time to assure that your Consul clients aren’t being rate-limited with sending requests to the Consul server. These are the recommended alerts for the RPC connections.

Client RPC requests anomaly

Consul Client RPC requests had noticeable deviations from baseline over the previous hour.

avg(rate(consul_client_rpc[1m]) > 0) > (avg_over_time(rate(consul_client_rpc[1m]) [1h:1m])+ 2* stddev_over_time(rate(consul_client_rpc[1m]) [1h:1m]) )

Client RPC requests rate limit exceeded

Over 10% of Consul Client RPC requests have exceeded the rate limit.

rate(consul_client_rpc_exceeded[1m]) / rate(consul_client_rpc[1m]) > 0.1

Client RPC requests failed

Over 10% of Consul Client RPC requests are failing.

rate(consul_client_rpc_failed[1m]) / rate(consul_client_rpc[1m]) > 0.1

Replica issues

Restoration time is too high

In this situation, restoring from disk or the leader is slower than the leader writing a new snapshot and truncating its logs. After a restart, followers might never rejoin the cluster until write rates reduce.

consul_raft_leader_oldestLogAge < 2* max(consul_raft_fsm_lastRestoreDuration)

Using Consul Enterprise? Check that your license is up-to-date!

You can use this simple PromQL query to check if your Consul Enterprise license will expire in less than 30 days.

consul_system_licenseExpiration / 24 < 30

Monitor Consul with Prometheus, with these dashboards

Don’t miss these open source dashboards already setup to monitor your Consul cluster overview, but also:

Health
Transaction
Leadership
Network
Cache

In this article, you’ve learned how to monitor the Consul control plane with Prometheus, and some alert recommendations, useful for troubleshooting the most common Consul issues.

How to monitor Starlink with Prometheus

JA Samitier — Tue, 01 Mar 2022 09:04:31 +0000

SpaceX's Starlink uses satellites in low-earth orbit to provide high-speed Internet services to most of the planet. During the beta, Starlink expects users to see data speeds vary from 50Mb/s to 150Mb/s and latency from 20ms to 40ms. It's also expected that there will be brief periods of no connectivity at all. Currently, there are around 1,800 Starlink satellites in orbit.

How to monitor starlink connection

There are several great projects available from the open source community, but the one we settled on using for the basis of our project was the [Starlink Prometheus Exporter (https://github.com/danopstech/starlink_exporter) from Daniel Willcocks. We encourage you to look at his other project Starlink Monitoring System if you are interested in a pre-packaged solution.

To monitor Starlink connections, we decided to fork the Starlink Prometheus Exporter project and create a PR that updates the Starlink gRPC bindings using the latest Starlink
firmware to
provide some additional metrics from Starlink Dishy.

How does it work

The Starlink Dishy is contactable at 192.168.100.1 on port 9200 for gRPC. If you are using the Starlink Wi-Fi router this should be reachable by default. In this example, you'll monitor Starlink connection using the Starlink Exporter to talk to Starlink Dishy via gRPC, and expose metrics in a format Prometheus will understand.

Requirements and what you will use

Access to a Starlink Internet Service.
Linux Node running Ubuntu 20.04 LTS.
Docker and Docker Compose.
[Starlink Prometheus Exporter (https://github.com/sysdigdan/starlink_exporter).
Prometheus.

Configure Prometheus and Launching Containers

First, you need to configure Prometheus to scrape the Starlink Exporter. Create a prometheus folder and add the configuration file prometheus.yml as seen below:

global:
  scrape_interval:     10s # By default, scrape targets every 15 seconds.
  evaluation_interval: 10s # By default, scrape targets every 15 seconds.
  scrape_timeout:      10s # By default, it is set to the global default (10s).


external_labels:
    monitor: 'starlink-exporter'
    origin_prometheus: 'starlink'

scrape_configs:
  - job_name: 'starlink'
    static_configs:
      - targets: ['127.0.0.1:9817']

Next, launch the Prometheus and Starlink Exporter containers using Docker Compose and the following YAML (save this as docker-compose.yml in the same location as your prometheus.yml above):

version: '3.8'

volumes:
  prometheus_data: {}

services:
  starlink-exporter:
    image: sysdigdan/starlink_exporter:v0.1.3
    container_name: starlink_exporter
    restart: unless-stopped
    network_mode: host

  prometheus:
    image: prom/prometheus:v2.32.1
    container_name: prometheus
    restart: unless-stopped
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
    network_mode: host

Now, from the same directory as your docker-compose.yml and
prometheus.yml, you can launch the containers with the following command:

docker-compose up -d

Let's make sure everything is running:

docker ps

Monitor starlink connection with Prometheus dashboards

Now that both containers are running, you can access Prometheus
(http://<NODE IP>:9090/) and look at the available metrics coming from Starlink Dishy (http://<NODE IP>:9817/metrics).

Monitor starlink connection: Performance Metrics

You can review metrics for throughput utilization by using the following:
starlink_dish_downlink_throughput_bytes and starlink_dish_uplink_throughput_bytes.

You can also quickly see the latency between Starlink Dishy, Satellite, and Ground Station by using starlink_dish_pop_ping_latency_seconds.

Monitor starlink connection: Stability Metrics

Use this PromQL query if you are interested in understanding the cause of outages; you can use the following PromQL to review all outages over the past 24 hours:

sum by (cause) (sum_over_time(starlink_dish_outage_duration{cause!='UNKNOWN'}[24h])) / 10^9

You can also count the occurrences.

count by (cause) (count_over_time(starlink_dish_outage_duration{cause!='UNKNOWN'}[24h]))

[ Want to dig deeper into PromQL? Read our getting started with PromQL guide to learn how Prometheus stores data, and how to use PromQL functions and operators.
]

Monitor starlink connection: Troubleshooting Metrics

You want to understand satellite obstruction with the following PromQL that shows a measure of obstruction in 12 30-degree wedges around Dishy.

starlink_dish_wedge_abs_fraction_obstruction_ratio > 0

Monitor starlink connection with Sysdig Monitor LTS

With Prometheus and the Starlink Exporter all set up, we need to think of how best to provide longer retention for comparison over time. By default, Prometheus provides 15 days retention. This can be adjusted but the downside is we would then need to manage storage and backups.

One of the features that customers of Sysdig Monitor are taking full advantage of is Prometheus Remote Write, which allows us to natively ingest metrics from many Prometheus servers. There's also [no need to manage storage, and with long retention and always on metrics (https://sysdig.com/blog/challenges-prometheus-lts/),
it's a simple choice!

The configuration for Prometheus Remote Write is simple. You just need to append the prometheus.yml file we created earlier with a new remote_write section, similar to the following.

remote_write:
    - url: "https:///prometheus/remote/write"
      bearer_token: ""
      tls_config:
        insecure_skip_verify: true

Restart the Prometheus container and you're done!

docker restart prometheus

This article was posted originally by Dan Moloney in Sysdig.

Let’s create an alert if the network traffic is down.

aws_rds_network_receive_throughput_average = 0 AND aws_rds_network_transmit_throughput_average = 0

Read / Write IOPS

Why it matters

The number of operations per second (IOPS) available in the instance, can be configured and is billed separately.

Not having enough can affect the performance of your application, and having more than needed will have a negative impact on your infrastructure costs.

How to monitor and alert

Using the aws_rds_read_iops_average and aws_rds_write_iops_average metrics (which use the ReadIOPS and WriteIOPS CloudWatch metrics).

Let’s create alerts if the read or write IOPS are greater than 2,500 operations per second.

aws_rds_read_iops_average > 2500
aws_rds_write_iops_average > 2500

What’s next: Install this dashboard in a few clicks

In this article, we’ve learned how easy it is to monitor AWS RDS and identify the top five key metrics when monitoring AWS RDS with examples.

All these metrics are available in the dashboards you can download from PromCat. They can be used in Grafana and in Sysdig Monitor as well!

These top key metrics will allow you to see the full picture when troubleshooting and performing improvements in our AWS RDS instance.

If you would like to try this integration, we invite you to sign up for a free trial of Sysdig Monitor.

Getting started with PromQL – Includes Cheatsheet!

JA Samitier — Thu, 11 Mar 2021 16:48:17 +0000

Getting started with PromQL can be challenging when you first arrive in the fascinating world of Prometheus. Since Prometheus stores data in a time-series data model, queries in a Prometheus server are radically different from good old SQL.

Understanding how data is managed in Prometheus is key to learning how to write good, performant PromQL queries.

This article will introduce you to the PromQL basics and provide a cheat sheet you can download to dig deeper into Prometheus and PromQL.

How time-series databases work

Time series are streams of values associated with a timestamp.

Every time series is identified by its metrics name and its labels, like:

mongodb_up{}

kube_node_labels{cluster="aws-01", label_kubernetes_io_role="master"}

In the above example, you can see the metric name (kube_node_labels) and the labels (cluster and label_kubernetes_io_role). Although normally this is how the metrics and labels are referenced, the name of the metric is actually a label too. The query above can also be written like this:

{__name__ = "kube_node_labels", cluster="aws-01", label_kubernetes_io_role="master"}

There are four types of metrics in Prometheus:

Gauges are arbitrary values that can go up and down. For example, mongodb_up tells us if the exporter has a connection to the MongoDB instance.
Counters represent totalizers from the beginning of the exporter and usually have the _total suffix. For example, http_requests_total.
Histogram samples observations, such as the request durations or response sizes, and counts them in configurable buckets.
Summary works as a histogram and also calculates configurable quantiles.

Gettings started with PromQL data selection

Selecting data in PromQL is as easy as specifying the metric you want to get the data from. In this example, we will use the metric http_requests_total.

Imagine that we want to know the number of requests for the /api path in the host 10.2.0.4. To do so, we will use the labels host and path from that metric.

We could run this PromQL query:

http_requests_total{host="10.2.0.4", path="/api"}

It would return the following data:


name	host	path	status_code	value
`http_requests_total`	`10.2.0.4`	`/api`	`200`	`98`
`http_requests_total`	`10.2.0.4`	`/api`	`503`	`20`
`http_requests_total`	`10.2.0.4`	`/api`	`401`	`1`

Every row in that table represents a series with the last available value. As http_requests_total contains the number of requests made since the last counter restart, we see 98 successful requests.

This is called an instant vector, the earliest value for every series at the moment specified by the query. As the samples are taken at random times, Prometheus has to make approximations to select the samples. If no time is specified, then it will return the last available value.

Additionally, you can get an instant vector from another moment (i.e., from one day ago).

To do so, you only need to add an offset, like this:

http_requests_total{host="10.2.0.4", path="/api", status_code="200"} offset 1d

To obtain metric results within a timestamp range, you need to indicate it between brackets:

http_requests_total{host="10.2.0.4", path="/api"}[10m]

It would return something like this:


name	host	path	status_code	value
`http_requests_total`	`10.2.0.4`	`/api`	`200`	`641309@1614690905.515` `641314@1614690965.515` `641319@1614691025.502`
`http_requests_total`	`10.2.0.5`	`/api`	`200`	`641319 @1614690936.628` `641324 @1614690996.628` `641329 @1614691056.628`
`http_requests_total`	`10.2.0.2`	`/api`	`401`	`368736 @1614690901.371` `368737 @1614690961.372` `368738 @1614691021.372`

The query returns multiple values for each time series; that’s because we asked for data within a time range. Thus, every value is associated with a timestamp.

This is called a range vector: all the values for every series within a range of timestamps.

Getting started with PromQL aggregators and operators

As you can see, the PromQL selectors help you obtain metrics data. But what if you want to get more sophisticated results?

Imagine if we had the metric node_cpu_cores with a cluster label. We could, for example, sum the results, aggregating them by a particular label:

sum (by cluster) (node_cpu_cores)

This would return something like this:


cluster	value
foo	100
bar	50

With this simple query, we can see that there are 100 CPU cores for the cluster cluster_foo and 50 for the cluster_bar.

Furthermore, we can use arithmetic operators in our PromQL queries. For example, using the metric node_memory_MemFree_bytes that returns the amount of free memory in bytes, we could get that value in megabytes by using the div operator

node_memory_MemFree_bytes / (1024 * 1024)

We could also get the percentage of free memory available by comparing the previous metric with node_memory_MemTotal_bytes, which returns the total memory available in the node.

(node_memory_MemFree_bytes / node_memory_MemTotal_bytes) * 100

And using it for creating an alert in case there are nodes with less than 5% of free memory.

(node_memory_MemFree_bytes / node_memory_MemTotal_bytes) * 100 < 5

Getting started with PromQL functions

PromQL offers a vast collection of functions we can use to get even more sophisticated results. Continuing with the previous example, we could use the topk function to identify which two nodes have higher free memory percentages.

topk(2, (node_memory_MemFree_bytes / node_memory_MemTotal_bytes) * 100)

Prometheus not only gives us information from the past, but also the future. The predict_linear function predicts where the time series will be in the given amount of seconds. You may remember that we used this function to cook the perfect holiday ham.

Imagine that you want to know how much free disk space left will be available in the next 24 hours. You could apply the predict_linear function to last week’s results from node_filesystem_free_bytes metric, which returns the free disk space available. This lets you predict the free disk space, in gigabytes, in the next 24 hours.

predict_linear(node_filesystem_free_bytes[1w], 3600 * 24) / (1024 * 1024 * 1024) < 100

When working with Prometheus counters, the rate function is pretty convenient. It calculates a per-second increase of a counter, allowing for resets and extrapolating at edges to provide better results.

What if we need to create an alert when we haven’t received a request in the last 10 minutes. We couldn’t just use the http_requests_total metric because if the counter got reset during the timestamp range, the results wouldn’t be accurate.

http_requests_total[10m]


name	host	path	status_code	value
`http_requests_total`	`10.2.0.4`	`/api`	`200`	`100@1614690905.515` `300@1614690965.515` `50@1614691025.502`

In the example above, as the counter got reset, there will be negative values from 300 to 50. Using just this metric wouldn’t be enough. Here is where the rate function comes to the rescue. As it considers the resets, the results are fixed as if they were like this:


name	host	path	status_code	value
`http_requests_total`	`10.2.0.4`	`/api`	`200`	`100@1614690905.515` `300@1614690965.515` `350@1614691025.502`

rate(http_requests_total[10m])


name	host	path	status_code	value
`http_requests_total`	`10.2.0.4`	`/api`	`200`	`0.83`

Regardless of the resets, there were 0.83 requests per second as averaged in the last 10 minutes. Now we can configure the desired alert:

rate(http_requests_total[10m]) = 0

Next steps

In this article, we learned how Prometheus stores data and how to start selecting and aggregating data with PromQL examples.

You can download the PromQL Cheatsheet to learn more PromQL operators, aggregations, and functions, as well as examples. You can also try all the examples in our Prometheus playground.

You can also try the Sysdig Monitor Free 30-day Trial, since Sysdig Monitor is fully compatible with Prometheus. You’ll get started in just a few minutes.

Post navigation

How to monitor AWS SQS with Prometheus

JA Samitier — Fri, 05 Feb 2021 11:13:21 +0000

Article by David de Torres.

In this article, we will explain how to monitor AWS SQS with Prometheus. To monitor AWS SQS, we will leverage the data offered by CloudWatch exporting the metrics to Prometheus using the YACE exporter (Yet Another CloudWatch Exporter). Finally, we will dive into what to monitor and what to alert.

AWS SQS (Simple Queue Service) has gained popularity as a way to communicate and decouple asynchronous applications, specifically for its easy integration with AWS Lambda functions.

Having two decoupled applications allows you to implement and scale independently both extremes. To achieve this decoupling, the system must be prepared for producing and processing the messages between applications at a different rate. Any bottleneck can cause messages to not be processed on time, and hurt the overall performance of the system.

You need to monitor AWS SQL queues closely to find bottleneck situations, properly scale the producers and consumers of messages, and detect errors as soon as possible.

But how do you monitor a managed service like this one?

And can you monitor it from the same place you monitor your entire infrastructure?

The relevant metrics for this service are all available in AWS CloudWatch. You can consult them via the web interface or through the API. To check these metrics from your Prometheus-compatible monitoring solution, you can use a Prometheus exporter.

Let's now dig into how SQS works in detail, how to monitor it with Prometheus, and what key metrics you should keep an eye on.

[tweet_box]📊 It is possible to monitor #AWS #SQS next to your cloud-native infrastructure. 💻 🤓 Learn how to leverage #Prometheus to extract #CloudWatch metrics 📈[/tweet_box]

How do AWS SQS queues work?

Let's settle a common ground on how AWS SQS queues work, making it easier to later identify what's important to monitor and alert on.

Here is the workflow of a message in a SQS queue:

The message is created by a producer service and sent to the SQS queue.
The message appears in the queue for all of the possible receivers as visible. This step can be non-immediate. For example, if you configure a 'delay' in the message, it will stay in the queue in a delayed state and will not be available for the receivers until the delay expires.
One of the possible receivers makes a polling of the messages of the SQS queue. This operation retrieves the visible messages from the queue and switches them to an invisible state, but does not delete them. This keeps other receivers from getting those messages if they execute a new polling.
When the receiver ends to process the message, explicitly removes it from the queue.

Now, what happens if a receiver ends the processing of a message and does not remove it from the queue?

After a configurable delay, the message is marked again as visible so other receivers can get the message and process it.

Wow, that sounds interesting. If you get a message that generates an error in the receiver, shortly after, another receiver will get that message and process it again.

And what if that other receiver also suffers an error? And all of the others after that?

That's a tricky question.

To prevent these old messages from populating the queue and recurrently appearing in the polls, AWS SQS allows you to configure another queue as a dead-letter. A dead-letter queue is where the messages end after being polled a number of times. This helps developers and site operation engineers detect these messages and treat them in an appropriate way.

Monitor AWS SQS with Prometheus metrics

Now that we understand how SQS queues work, let's see how we can get metrics to address all of the possible situations that we can find while working with them.

AWS SQS emits certain metrics that can be gathered by the CloudWatch service under the namespace AWS/SQS. We'll now see how to extract those metrics to be able to monitor AWS SQS with Prometheus.

Prometheus is a leading open source monitoring solution, which provides means to easily create integrations by writing exporters. With Prometheus, you can gather metrics from your whole infrastructure which may be spread across multiple cloud providers, following a single-pane-of-glass approach.

Prometheus exporters gather metrics from services and publish them in a standardized format that both a Prometheus server and the Sysdig Agent can scrape natively. We will use one of these exporters, specifically the YACE exporter (Yet Another CloudWatch Exporter), to get metrics from AWS CloudWatch. We contributed to this exporter to make it more efficient and reliable.

In this use case, we will:

Deploy the CloudWatch exporter in a Kubernetes cluster.
Configure it to gather metrics of SQS in AWS.

This exporter will be conveniently annotated with Prometheus tags, so both a Prometheus server and the Sysdig agent can scrape it.

Installing and configuring Prometheus CloudWatch exporter

Setting up permissions to access CloudWatch metrics

The exporter will connect to the AWS CloudWatch API and pull the metrics, but to get them we need to grant the right permissions.

First, you will need to create an AWS IAM policy that contains the following permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "CloudWatchExporterPolicy",
            "Effect": "Allow",
            "Action": [
                "tag:GetResources",
                "cloudwatch:ListTagsForResource",
                "cloudwatch:GetMetricData",
                "cloudwatch:ListMetrics"
            ],
            "Resource": "*"
        }
    ]
}

Configuring the AWS IAM policy

You will also need to supply the credentials for an AWS IAM account to the CloudWatch exporter. This can be done in a standard manner, in the $HOME/.aws/credentials file.

# CREDENTIALS FOR AWS ACCOUNT
aws_region = us-east-1
aws_access_key_id = AKIAQ33BWUG3BLXXXXX
aws_secret_access_key = bXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Configuring the AWS IAM account in the $HOME/.aws/credentials file.

You can either assign the IAM policy directly to the IAM account or to a IAM role to grant the permissions to the exporter.

Configuring the exporter

The YACE exporter has images for its stable version ready to be deployed in Kubernetes. So, we just need to:

Specify what to scrape from CloudWatch in a config.yml file.
Create a deployment file.
Deploy in a Kubernetes cluster.

Let's focus on the configuration file. Here, you'll define:

Which metrics the exporter will scrape.
From which region.
What dimensions you’ll ask CloudWatch to make the aggregations with.

Here is an example config.yml configuration file:

discovery:
  jobs:
  - regions: 
    - us-east-1
    type: sqs
    enableMetricData: true
    metrics: 
      - name: ApproximateAgeOfOldestMessage
        statistics:
        - Maximum
        period: 300
        length: 3600
      - name: ApproximateNumberOfMessagesDelayed
        statistics:
        - Average
        period: 300
        length: 3600
      - name: ApproximateNumberOfMessagesNotVisible
        statistics:
        - Average
        period: 300
        length: 3600
      - name: ApproximateNumberOfMessagesVisible
        statistics:
        - Average
        period: 300
        length: 3600
      - name: NumberOfEmptyReceives
        statistics:
        - Sum
        period: 300
        length: 3600
      - name: NumberOfMessagesDeleted
        statistics:
        - Sum
        period: 300
        length: 3600
      - name: NumberOfMessagesReceived
        statistics:
        - Sum
        period: 300
        length: 3600
      - name: NumberOfMessagesSent
        statistics:
        - Sum
        period: 300
        length: 3600
      - name: SentMessageSize
        statistics:
        - Average
        - Sum
        period: 300
        length: 3600

Please be aware of the following caveats:

If you wish to add an additional metric, be sure to read up on AWS SQS metrics to use the correct statistic.
CloudWatch offers aggregations by different dimensions. The YACE Exporter automatically selects FunctionName as the default dimension to aggregate the metrics by.
Gathering CloudWatch metrics may incur a certain cost to the AWS bill. Be sure to check the AWS Documentation on CloudWatch Service Quota limits.

The last step is to actually deploy the YACE exporter. To make things easier, you can put the IAM account credentials and the configuration in a file, like this:

apiVersion: v1
kind: Namespace
metadata:
  name: yace
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: yace-sqs
  namespace: yace
spec:
  selector:
    matchLabels:
      app: yace-sqs
  replicas: 1
  template:
    metadata:
      labels:
        app: yace-sqs
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "5000"
    spec:
      containers:
      - name: yace
        image: quay.io/invisionag/yet-another-cloudwatch-exporter:v0.21.0-alpha
        ports:
        - containerPort: 5000
        volumeMounts:
          - name: yace-sqs-config
            mountPath: /tmp/config.yml
            subPath: config.yml
          - name: yace-sqs-credentials
            mountPath: /exporter/.aws/credentials
            subPath: credentials
        resources:
          limits:
            memory: "128Mi"
            cpu: "500m"
      volumes:
        - configMap:
            defaultMode: 420
            name: yace-sqs-config
          name: yace-sqs-config
        - secret:
            defaultMode: 420
            secretName: yace-sqs-credentials
          name: yace-sqs-credentials
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: yace-sqs-config
  namespace: yace
data:
  config.yml: |
    discovery:
      jobs:
      - regions: 
        - us-east-1
        type: sqs
        enableMetricData: true
        metrics: 
          - name: ApproximateAgeOfOldestMessage
            statistics:
            - Maximum
            period: 300
            length: 3600
          - name: ApproximateNumberOfMessagesDelayed
            statistics:
            - Average
            period: 300
            length: 3600
          - name: ApproximateNumberOfMessagesNotVisible
            statistics:
            - Average
            period: 300
            length: 3600
          - name: ApproximateNumberOfMessagesVisible
            statistics:
            - Average
            period: 300
            length: 3600
          - name: NumberOfEmptyReceives
            statistics:
            - Sum
            period: 300
            length: 3600
          - name: NumberOfMessagesDeleted
            statistics:
            - Sum
            period: 300
            length: 3600
          - name: NumberOfMessagesReceived
            statistics:
            - Sum
            period: 300
            length: 3600
          - name: NumberOfMessagesSent
            statistics:
            - Sum
            period: 300
            length: 3600
          - name: SentMessageSize
            statistics:
            - Average
            - Sum
            period: 300
            length: 3600
---
apiVersion: v1
kind: Secret
metadata:
  name: yace-sqs-credentials
  namespace: yace
data:
  # Add in credentials the result of:
  # cat ~/.aws/credentials | base64
  credentials: |
    XXX

Note that leaving your AWS credentials inside a deployment file is not the safest option. You should use a secrets store instead, but the example was simplified to keep the focus.

In this file, we can find:

namespace: yace
The kind: Deployment with the exporter. Note the annotations: with the Prometheus tags for scraping, and the scraping port. This deployment also has two volumes: one with the configuration file, and another with the credentials.
kind: ConfigMap with the contents of the config.yml file.
kind: Secret with the credentials of the IAM account.

Now, you just need to deploy like you usually do:

kubectl apply -f deploy.yaml

Is it working?

Let's do a quick test throwing an HTTP request to the exporter port. To do it, you can use a web browser or curl in a console. As we set the port 5000 in our example pod yace-sqs, we would do:

curl http://:5000/metrics

If everything is OK, you should see a web page with metrics of this kind (output truncated due to size):

# HELP aws_sqs_approximate_age_of_oldest_message_maximum Help is not implemented yet.
# TYPE aws_sqs_approximate_age_of_oldest_message_maximum gauge
aws_sqs_approximate_age_of_oldest_message_maximum{dimension_QueueName="queue_01",name="arn:aws:sqs:us-east-1:029747528706:queue_01",region="us-east-1"} 2
# HELP aws_sqs_approximate_number_of_messages_delayed_average Help is not implemented yet.
# TYPE aws_sqs_approximate_number_of_messages_delayed_average gauge
aws_sqs_approximate_number_of_messages_delayed_average{dimension_QueueName="queue_01",name="arn:aws:sqs:us-east-1:029747528706:queue_01",region="us-east-1"} 3
# HELP aws_sqs_approximate_number_of_messages_not_visible_average Help is not implemented yet.
# TYPE aws_sqs_approximate_number_of_messages_not_visible_average gauge
aws_sqs_approximate_number_of_messages_not_visible_average{dimension_QueueName="queue_01",name="arn:aws:sqs:us-east-1:029747528706:queue_01",region="us-east-1"} 12
# HELP aws_sqs_approximate_number_of_messages_visible_average Help is not implemented yet.
# TYPE aws_sqs_approximate_number_of_messages_visible_average gauge
aws_sqs_approximate_number_of_messages_visible_average{dimension_QueueName="queue_01",name="arn:aws:sqs:us-east-1:029747528706:queue_01",region="us-east-1"} 1

Monitoring AWS SQS: What to look for?

AWS SQS queues have a simple design, so there isn't much to monitor. However, depending on how you are using them, you will want to monitor a different set of metrics.

Let's explore some scenarios and their relevant metrics.

Simple producer-consumer

For this approach, we will consider that you only have producers and consumers processing the messages. We will not cover delayed messages or dead-letter queues.

Visible messages: This metric will give you information about the saturation of the system.

Visible messages are the ones that are ready to process, but not yet polled and deleted by a receiver. It would be a good indicator of how many pending messages you have in the queue.

The metric that offers this information is aws_sqs_approximate_number_of_messages_visible_average.

Not visible messages: This metric is a good indicator of the messages that are being processed at each moment.

Not visible messages are the ones that have been polled by a receiver but were still not deleted.

The metric that offers this information is aws_sqs_approximate_number_of_messages_not_visible_average.

Deleted messages: This metric is a good indicator of the number of messages actually processed by the receivers.

Remember, when a receiver processes a message, it manually deletes the message from the queue.

The metric that gives this information is aws_sqs_number_of_messages_deleted_sum.

Received messages: The received messages are the number of messages that went out of the queue. Take into account that a message can be received by a consumer several times if the message was not deleted from the queue.

The metrics that give this information is aws_sqs_number_of_messages_received_sum.

Empty receives: This metric allows you to detect how many empty requests have been made in order to optimize the way your application makes the requests.

Amazon bills SQS based on the number of requests made. A polling is a request, and in each one you can retrieve 1-to-10 messages for a maximum total payload of up to 256 KB.

The metric that gives this information is aws_sqs_number_of_empty_receives_sum.

To optimize the billing, you can either reduce the request frequency or use long polling. This feature allows you to receive, for 10 seconds, the visible messages. Plus all of the messages that arrive in real time, reducing the number of requests.

Monitoring a producer that can delay messages

If you can estimate the time needed to process the messages, the producer can add a delay to the messages. Leaving time between messages can help alleviate possible bottlenecks caused by a high number of messages sent at the same time.

Some extra metrics worth tracking in this scenario are:

Delayed messages: This indicator can help you scale up or down the number of receivers to adequate the load of work coming in the next minutes.

You can have the number of messages delayed in the queue with the metric aws_sqs_approximate_number_of_messages_delayed_average.

If your producers are deployed in Kubernetes, you can use the Kubernetes horizontal pod autoscaler (HPA) and the Prometheus Adapter to adjust the number of pods depending on the value of this metric.

Total number of messages in the queue: The number of messages gives you an idea of the occupation and saturation of the pipeline.

To have an estimate of the number of messages that the senders produced and that are still waiting to be processed, you can sum the delayed messages, visible (ready to send to receivers) and not visible (currently being processed). If the processing of messages were immediate, this result would be zero.

The promQL that produces this value is:

aws_sqs_approximate_number_of_messages_delayed_average  + aws_sqs_approximate_number_of_messages_not_visible_average + aws_sqs_approximate_number_of_messages_visible_average

Dealing with dead-letter queues

While dealing with dead-letter queues, it is important to monitor when a message arrives to the queue.

Sent messages: This can give an idea of the errors or messages that could not be processed by the receivers.

The metric that gives this information is aws_sqs_number_of_messages_sent_sum.

Monitoring AWS SQS: What to alert?

High number of messages in queue for a long time: The total number of messages in the queue is an indicator of the saturation of the pipeline. You can set a limit (e.g., 100 messages) and receive an alarm if the number of messages is higher than that for an extended period of time.

(aws_sqs_approximate_number_of_messages_delayed_average  + aws_sqs_approximate_number_of_messages_not_visible_average + aws_sqs_approximate_number_of_messages_visible_average) > 100

This alert can also detect messages that are recurrently sent back to the visible state if a dead-letter queue is not configured.

Oldest message in queue: This metric gives an idea of the age of the oldest message of the queue, which is a good indicator of the maximum latency of the pipeline. This alert will trigger when the maximum age is higher than five minutes (300 seconds, you can adjust as you wish).

aws_sqs_approximate_age_of_oldest_message_maximum > 300

For this alert to work properly, you have to be sure to configure a dead-letter queue to prevent messages from recurrently being sent to visible state.

Recurring empty receives: You can detect when your application is recurrently trying to fetch new messages from an empty queue. This can help you adjust your polling frequency or the number of receivers to lower the costs of your infrastructure.

aws_sqs_number_of_empty_receives_sum > 0

Received message in a dead-letter queue: You can detect if a new message has arrived to a dead-letter queue by alerting on the sent message metric. To filter on the dead-letter queues, you can follow different methods. For example, naming your dead-letter queues with a prefix, like dead-letter-.

This promQL will alert you when a message arrives to any of your dead-letter queues:

aws_sqs_number_of_messages_sent_sum{dimension_QueueName=~"dead-letter-.+"} > 0

Getting the CloudWatch metrics into Sysdig Monitor

Sysdig agent setup

To scrape metrics using the Sysdig agent:

In the yace Deployment, remember to include the Prometheus annotations that configure the port of the exporter as a scraping port.

Also, in the Sysdig Agent configuration, make sure to have these lines of configuration that enable the scraping of containers with Prometheus annotations.

process_filter:
  - include:
      kubernetes.pod.annotation.prometheus.io/scrape: true
      conf:
        path: "{kubernetes.pod.annotation.prometheus.io/path}"
        port: "{kubernetes.pod.annotation.prometheus.io/port}"

Monitoring AWS SQS with dashboard and alerts

Once we have SQS metrics in Sysdig Monitor, you can use the AWS SQS dashboard to have a full overview of your queues. In the dashboard, you can filter by cluster and select as many SQS queues as needed. This is especially useful when you need to correlate an SQS queue with its dead-letter queue.

_Sysdig Monitor. AWS SQS Dashboard_

In PromCat.io, you can find instructions on how to install the exporter, along with ready-to-use configurations to monitor AWS SQS. There, you will also find the dashboards that we presented in both Grafana and Sysdig format, as well as examples of alerts for your services.

Conclusion

It is possible to monitor AWS SQS in the same place you monitor your cloud-native infrastructure. Thanks to Prometheus offering a standardized interface, you can leverage existing exporters to ingest Prometheus metrics.

If you would like to try this integration, we invite you to sign up for a free trial in Sysdig Essentials directly from the AWS marketplace.

You can find out more about our Prometheus integration in our documentation or by reading our blog.