<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: JA Samitier</title>
    <description>The latest articles on DEV Community by JA Samitier (@eckelon).</description>
    <link>https://dev.to/eckelon</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F339053%2Ff1ceea51-d08f-4220-b554-04c857a576bd.jpg</url>
      <title>DEV Community: JA Samitier</title>
      <link>https://dev.to/eckelon</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/eckelon"/>
    <language>en</language>
    <item>
      <title>How to start a Python project easily</title>
      <dc:creator>JA Samitier</dc:creator>
      <pubDate>Mon, 23 Jan 2023 22:26:23 +0000</pubDate>
      <link>https://dev.to/eckelon/how-to-start-a-python-project-easily-3265</link>
      <guid>https://dev.to/eckelon/how-to-start-a-python-project-easily-3265</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Photo by &lt;a href="https://unsplash.com/@d_mccullough" rel="noopener noreferrer"&gt;Daniel McCullough&lt;/a&gt; on &lt;a href="https://unsplash.com&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I like to try out new technologies and programming languages, and one of the first blockers I face is how to start a project. I absolutely love the way you can start a NodeJS project with &lt;code&gt;npm init&lt;/code&gt;. You can do something similar with Python.&lt;/p&gt;

&lt;p&gt;In this article, you'll learn how to start a Python project assuring that its dependencies won't interfere with other projects' dependencies. And how to use Make to distribute your project to be easily installed in other development machines in no time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Check your versions
&lt;/h2&gt;

&lt;p&gt;First, check your versions. I'm working with Python 3.10, but this will work with any version of Python 3.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;λ python &lt;span class="nt"&gt;--version&lt;/span&gt;
Python 3.10.8
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Creating the virtual environment
&lt;/h2&gt;

&lt;p&gt;First, create a folder to store your project.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; ~/Development/python-starter-example
&lt;span class="nb"&gt;cd&lt;/span&gt; ~/Development/python-starter-example
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, you need to create "the project" itself, aka the &lt;a href="https://docs.python.org/3/library/venv.html" rel="noopener noreferrer"&gt;virtual environment&lt;/a&gt;. The &lt;code&gt;venv&lt;/code&gt; module allows you to have all the dependencies and configuration for your Python project inside the project directory, something like NodeJS projects having all the dependencies in &lt;code&gt;node_modules&lt;/code&gt;. Let's do this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ~/Development/python-starter-example
python &lt;span class="nt"&gt;-m&lt;/span&gt; venv &lt;span class="nb"&gt;env&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Working with the virtual environment
&lt;/h2&gt;

&lt;p&gt;Now, the environment is created, but we need to go "inside", by activating it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;source&lt;/span&gt; ./env/bin/activate 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every time you go back to the project directory, you'll need to activate it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How I know that I'm inside the environment?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Easy-peasy: You'll see a little &lt;code&gt;(env)&lt;/code&gt; text in the prompt of your terminal.&lt;/p&gt;

&lt;p&gt;Now, every Python command you run inside the environment will be executed in that context, so, if you install a Python library, it won't interfere with other versions of that library from other python projects. &lt;/p&gt;

&lt;h2&gt;
  
  
  Installing dependencies
&lt;/h2&gt;

&lt;p&gt;Let's install &lt;code&gt;Flask&lt;/code&gt;, for example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# (env) =&amp;gt; we are inside the environment&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;flask
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The moment of truth: let's check that &lt;code&gt;Flask&lt;/code&gt; was installed &lt;strong&gt;inside&lt;/strong&gt; the environment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;~/Development/python-starter-example
&lt;span class="nb"&gt;env &lt;/span&gt;λ &lt;span class="nb"&gt;ls env&lt;/span&gt;/lib/python3.10/site-packages/flask 
__init__.py app.py      config.py   globals.py  logging.py  sessions.py testing.py  wrappers.py
__main__.py blueprints.py   ctx.py      helpers.py  py.typed    signals.py  typing.py
__pycache__ cli.py      debughelpers.py json        scaffold.py templating.py   views.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So it's there! But.. what about the Flask binary? Remember that some Python libraries come with a binary you need to execute sometimes. Those binaries will be under the &lt;code&gt;env/bin&lt;/code&gt; folder inside your project. See:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;~/Development/python-starter-example
&lt;span class="nb"&gt;env &lt;/span&gt;λ &lt;span class="nb"&gt;ls env&lt;/span&gt;/bin                               
Activate.ps1    activate.csh    flask       pip3        python      python3.10
activate    activate.fish   pip     pip3.10     python3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Running the project
&lt;/h2&gt;

&lt;p&gt;Let's use the example &lt;code&gt;Flask&lt;/code&gt; application &lt;a href="https://flask.palletsprojects.com/en/2.2.x/quickstart/" rel="noopener noreferrer"&gt;from its documentation&lt;/a&gt;. Create a file called &lt;code&gt;app.py&lt;/code&gt; in the project and edit it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;flask&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Flask&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Flask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@app.route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hello_world&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;p&amp;gt;Hello, World!&amp;lt;/p&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, run it. Remember that the project need to use the &lt;code&gt;Flask&lt;/code&gt; version under the environment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;env&lt;/span&gt;/bin/flask &lt;span class="nt"&gt;--app&lt;/span&gt; app run  

&lt;span class="k"&gt;*&lt;/span&gt; Serving Flask app &lt;span class="s1"&gt;'app'&lt;/span&gt;
 &lt;span class="k"&gt;*&lt;/span&gt; Debug mode: off
WARNING: This is a development server. Do not use it &lt;span class="k"&gt;in &lt;/span&gt;a production deployment. Use a production WSGI server instead.
 &lt;span class="k"&gt;*&lt;/span&gt; Running on http://127.0.0.1:5000
Press CTRL+C to quit
127.0.0.1 - - &lt;span class="o"&gt;[&lt;/span&gt;22/Jan/2023 13:12:24] &lt;span class="s2"&gt;"GET / HTTP/1.1"&lt;/span&gt; 200 -
127.0.0.1 - - &lt;span class="o"&gt;[&lt;/span&gt;22/Jan/2023 13:12:24] &lt;span class="s2"&gt;"GET /favicon.ico HTTP/1.1"&lt;/span&gt; 404 -
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;et voilà!&lt;/p&gt;

&lt;p&gt;You could also run it directly using the &lt;code&gt;python&lt;/code&gt; command. You just need to tweak the code a little.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;flask&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Flask&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Flask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="nd"&gt;@app.route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hello_world&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;p&amp;gt;Hello, World!&amp;lt;/p&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;port&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PORT&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;127.0.0.1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you can run it with the &lt;code&gt;python&lt;/code&gt; command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;~/Development/python-starter-example 10s
&lt;span class="nb"&gt;env &lt;/span&gt;λ python app.py
 &lt;span class="k"&gt;*&lt;/span&gt; Serving Flask app &lt;span class="s1"&gt;'app'&lt;/span&gt;
 &lt;span class="k"&gt;*&lt;/span&gt; Debug mode: off
WARNING: This is a development server. Do not use it &lt;span class="k"&gt;in &lt;/span&gt;a production deployment. Use a production WSGI server instead.
 &lt;span class="k"&gt;*&lt;/span&gt; Running on http://127.0.0.1:3000
Press CTRL+C to quit
127.0.0.1 - - &lt;span class="o"&gt;[&lt;/span&gt;22/Jan/2023 13:16:37] &lt;span class="s2"&gt;"GET / HTTP/1.1"&lt;/span&gt; 200 -
127.0.0.1 - - &lt;span class="o"&gt;[&lt;/span&gt;22/Jan/2023 13:16:38] &lt;span class="s2"&gt;"GET /favicon.ico HTTP/1.1"&lt;/span&gt; 404 -
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Create the requirements file
&lt;/h2&gt;

&lt;p&gt;If you want to distribute your project with the rest of your team, or in a git repository, you need to create &lt;a href="https://pip.pypa.io/en/latest/user_guide/#requirements-files" rel="noopener noreferrer"&gt;the &lt;code&gt;requirements.txt&lt;/code&gt; file that will contain all the dependencies that your project's using&lt;/a&gt;. Since you installed everything using &lt;code&gt;pip&lt;/code&gt;, the easiest way of creating this file is using the &lt;code&gt;pip freeze&lt;/code&gt; command.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip freeze &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you have your requirements file - something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;~/Development/python-starter-example
&lt;span class="nb"&gt;env &lt;/span&gt;λ &lt;span class="nb"&gt;cat &lt;/span&gt;requirements.txt            
&lt;span class="nv"&gt;click&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;8.1.3
&lt;span class="nv"&gt;Flask&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;2.2.2
&lt;span class="nv"&gt;itsdangerous&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;2.1.2
&lt;span class="nv"&gt;Jinja2&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;3.1.2
&lt;span class="nv"&gt;MarkupSafe&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;2.1.2
&lt;span class="nv"&gt;Werkzeug&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;2.2.2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now that the dependencies are documented in the &lt;code&gt;requirements.txt&lt;/code&gt; file, anyone can install them with a simple &lt;code&gt;pip&lt;/code&gt; command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Creating an application runner with Make
&lt;/h2&gt;

&lt;p&gt;Now, let's automatize it. What if we could create a way of automatically create the virtual env with everything installed in it? I like doing this with &lt;code&gt;Make&lt;/code&gt;. Create a file called &lt;code&gt;Makefile&lt;/code&gt; inside the project directory and start with this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight make"&gt;&lt;code&gt;&lt;span class="nl"&gt;.PHONY&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;all&lt;/span&gt;

&lt;span class="nl"&gt;env/bin/activate&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;requirements.txt&lt;/span&gt;
    python &lt;span class="nt"&gt;-m&lt;/span&gt; venv &lt;span class="nb"&gt;env&lt;/span&gt;
    ./env/bin/pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

&lt;span class="nl"&gt;run&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;env/bin/activate&lt;/span&gt;
    ./env/bin/python app.py

&lt;span class="nl"&gt;freeze&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;env/bin/pip&lt;/span&gt;
    ./env/bin/pip freeze &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; requirements.txt

&lt;span class="nl"&gt;clean&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
    &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; __pycache__
    &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; ./env
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's break down this &lt;code&gt;Makefile&lt;/code&gt;. It has three commands:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;run&lt;/code&gt;: will run the &lt;code&gt;app.py&lt;/code&gt; Python file if the &lt;code&gt;venv&lt;/code&gt; is active. If it's not it will create the virtual environment and install its dependencies. If no &lt;code&gt;requirements.txt&lt;/code&gt; file is found, then nothing will happen.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;freeze&lt;/code&gt;: will update the &lt;code&gt;requirements.txt&lt;/code&gt; with all the libraries installed with &lt;code&gt;pip&lt;/code&gt;. This is useful if you installed new libraries.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;clean&lt;/code&gt;: will delete the Python cache and the environment. You can use this safely, because if you run your application again, everything will be re-created!&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Using &lt;code&gt;Make&lt;/code&gt; is really easy, just type &lt;code&gt;make&lt;/code&gt; and the command you want to execute, e.g:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;make run
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Please, not that you need to have make installed. In a mac you can do it with &lt;code&gt;brew&lt;/code&gt; (&lt;code&gt;brew install make&lt;/code&gt;), in Linux is usually installed or available in the software repositories (&lt;code&gt;apt&lt;/code&gt;, &lt;code&gt;dnf&lt;/code&gt;...), and in Windows you can use &lt;a href="https://learn.microsoft.com/en-us/windows/wsl/about" rel="noopener noreferrer"&gt;WSL to install a Linux distribution inside Windows&lt;/a&gt; and run make from there.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Let's sum up
&lt;/h2&gt;

&lt;p&gt;In this article, you learned how to create a Python project in its own virtual environment, and also how to create a Makefile to run everything from it. Now, when someone clones that project, they only will need to type &lt;code&gt;make run&lt;/code&gt; and it automatically creates the environment, install the dependencies and run the project. Yay!&lt;/p&gt;

&lt;p&gt;I hope you found this interesting. Of course this isn't the only way to manage a Python project. This is the way I use to do it. If there's something I missed, please, ping me, and I'll update the article. Thanks!&lt;/p&gt;

</description>
      <category>cloud</category>
      <category>aws</category>
      <category>googlecloud</category>
      <category>azure</category>
    </item>
    <item>
      <title>Prometheus 2.37 – The first long-term supported release!</title>
      <dc:creator>JA Samitier</dc:creator>
      <pubDate>Mon, 18 Jul 2022 14:16:42 +0000</pubDate>
      <link>https://dev.to/eckelon/prometheus-237-the-first-long-term-supported-release-2f2h</link>
      <guid>https://dev.to/eckelon/prometheus-237-the-first-long-term-supported-release-2f2h</guid>
      <description>&lt;p&gt;&lt;strong&gt;Prometheus 2.37 is out and brings exciting news&lt;/strong&gt;: this is the first long-term supported release. It'll be supported for at least six months. &lt;/p&gt;

&lt;h2&gt;
  
  
  Why is Long-Term Support (LTS) so significant?
&lt;/h2&gt;

&lt;p&gt;Previous to this release, each Prometheus version had a six-week life-cycle. That means that if you wanted to &lt;strong&gt;stay up-to-date with the latest features and bug fixes&lt;/strong&gt;, you needed to update your Prometheus server every six weeks or so. &lt;/p&gt;

&lt;p&gt;Upgrading isn't always as easy as clicking on a button. &lt;strong&gt;With Prometheus growth, more and more companies depend on Prometheus&lt;/strong&gt; as the key component of their monitoring infrastructure, and they can't face the risk that new features and enhancements also bring regressions, requiring them to upgrade again. That's why Prometheus is adding LTS releases to their release cycle.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--uG-U1DhH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6mlvjy1eydbtir825f1w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--uG-U1DhH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6mlvjy1eydbtir825f1w.png" alt="Image description" width="880" height="330"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://prometheus.io/docs/introduction/release-cycle/"&gt;Prometheus LTS releases&lt;/a&gt; will bring &lt;strong&gt;bug, security, and documentation fixes&lt;/strong&gt;, so companies limit the risks of upgrades while having the Prometheus server still up-to-date.&lt;/p&gt;

&lt;p&gt;So, you won't have the latest Prometheus features, but you'll know that &lt;strong&gt;upgrading to the next 2.37 fix release will be straightforward&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Prometheus community is getting more and more mature
&lt;/h2&gt;

&lt;p&gt;2022 is a great year for the Prometheus community. Last KubeCon EU in Valencia, the &lt;a href="https://www.cncf.io/announcements/2022/05/18/prometheus-associate-certification-will-demonstrate-ability-to-monitor-infrastructure/"&gt;CNCF announced&lt;/a&gt; the &lt;a href="https://training.linuxfoundation.org/certification/prometheus-certified-associate/"&gt;​​Prometheus Associate Certification, which is currently in beta&lt;/a&gt;. This allows engineers to demonstrate their proficiency in the Prometheus ecosystem and cloud native observability concepts. Now, Prometheus is announcing a LTS release.&lt;/p&gt;

&lt;p&gt;The release of these new LTS versions means that, now, every time the community fixes bugs and security issues in Prometheus, &lt;strong&gt;the maintainers will add these fixes both in the latest Prometheus minor and in the latest LTS version&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;This extra effort is a &lt;strong&gt;serious investment in the Prometheus maturity&lt;/strong&gt; that will bring more stability to the vast number of companies and projects using Prometheus.&lt;/p&gt;

&lt;h2&gt;
  
  
  Some nice changes included in Prometheus 2.37
&lt;/h2&gt;

&lt;p&gt;This release also includes other nice changes, like a new built-in &lt;a href="https://github.com/prometheus/prometheus/pull/10915"&gt;service discovery for HashiCorp Nomad&lt;/a&gt;, and an &lt;a href="https://github.com/prometheus/prometheus/pull/10759"&gt;enhancement that allows attaching node labels for endpoint roles&lt;/a&gt; in the Kubernetes service discovery.&lt;/p&gt;

&lt;p&gt;You can find the full list of changes in the &lt;a href="https://github.com/prometheus/prometheus/releases/tag/v2.37.0"&gt;official release notes of Prometheus 2.37&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>prometheus</category>
      <category>monitoring</category>
      <category>kubernetes</category>
      <category>devops</category>
    </item>
    <item>
      <title>How to monitor nginx in Kubernetes with Prometheus</title>
      <dc:creator>JA Samitier</dc:creator>
      <pubDate>Mon, 04 Jul 2022 08:28:02 +0000</pubDate>
      <link>https://dev.to/eckelon/how-to-monitor-nginx-in-kubernetes-with-prometheus-j5f</link>
      <guid>https://dev.to/eckelon/how-to-monitor-nginx-in-kubernetes-with-prometheus-j5f</guid>
      <description>&lt;p&gt;nginx is an open source web server often used as a reverse proxy, load balancer, and web cache. Designed for high loads of concurrent connections, it's fast, versatile, reliable, and most importantly, very light on resources.&lt;/p&gt;

&lt;p&gt;In this article, you'll learn how to monitor nginx in Kubernetes with Prometheus, and also how to troubleshoot different issues related to latency, saturation, etc.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ingredients
&lt;/h2&gt;

&lt;p&gt;Before we begin, let's summarize the tools you'll be using for this project.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;nginx server (I bet it's already running in your cluster!).&lt;/li&gt;
&lt;li&gt;Our beloved &lt;a href="https://prometheus.io" rel="noopener noreferrer"&gt;Prometheus&lt;/a&gt;, the open source monitoring standard.&lt;/li&gt;
&lt;li&gt;The official &lt;a href="https://github.com/nginxinc/nginx-prometheus-exporter" rel="noopener noreferrer"&gt;nginx exporter&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.fluentd.org/" rel="noopener noreferrer"&gt;Fluentd&lt;/a&gt;, and its &lt;a href="https://github.com/fluent/fluent-plugin-prometheus/blob/master/README.md" rel="noopener noreferrer"&gt;plugin for Prometheus&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Starting with the basics: nginx exporter
&lt;/h2&gt;

&lt;p&gt;The first thing you need to do when you want to monitor nginx in Kubernetes with Prometheus is install the nginx exporter. Our recommendation is to install it as a sidecar for your nginx servers, just by adding it to the deployment. It should be something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-server
spec:
  selector:
    matchLabels: null
  app: nginx
  replicas: 3
  template:
    metadata:
      labels:
        app: nginx
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '9113'
    spec:
      containers:
        - name: nginx
          image: nginx
          ports:
            - containerPort: 80
          volumeMounts:
            - name: nginx-config
              mountPath: /etc/nginx/conf.d/default.conf
              subPath: nginx.conf
        - name: nginx-exporter
          image: 'nginx/nginx-prometheus-exporter:0.10.0'
          args:
            - '-nginx.scrape-uri=http://localhost/nginx_status'
          resources:
            limits:
              memory: 128Mi
              cpu: 500m
          ports:
            - containerPort: 9113
      volumes:
        - configMap:
            defaultMode: 420
            name: nginx-config
          name: nginx-config
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This way, you've just added a nginx exporter container in each nginx server pod. Since we configured three replicas, there'll be three pods, each containing one nginx server container and one nginx exporter container. Apply this new configuration and &lt;em&gt;voilà!&lt;/em&gt; You easily exposed metrics from your nginx server.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitoring nginx overall status with Prometheus
&lt;/h2&gt;

&lt;p&gt;Do you want to confirm that it worked? Easy-peasy. Go to Prometheus and try this PromQL out:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum (nginx_up)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will say there are three containers reporting &lt;em&gt;nginx_up&lt;/em&gt; to one. Don't worry about the metrics yet, we'll be there in no time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ferxh3ubu9kul0fgn4wfd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ferxh3ubu9kul0fgn4wfd.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitoring nginx connections with Prometheus
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Active connections
&lt;/h3&gt;

&lt;p&gt;Let's use the following metrics to take a look at the nginx active connections. You can also focus on which ones are reading or writing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;nginx_connections_active&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;nginx_connections_reading&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;nginx_connections_writing&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Just by using them you'll have something like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5wimsapnbpzd35g9qth3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5wimsapnbpzd35g9qth3.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Unhandled connections
&lt;/h3&gt;

&lt;p&gt;Now, let's focus on how many connections are not being handled by nginx. You just need to take off the handled connections from the accepted connections. The nginx exporter gives us both metrics with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;nginx_connections_handled&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;nginx_connections_accepted&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So, let's get the percentage of accepted connections that are being unhandled:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rate(nginx_connections_accepted{kube_cluster_name=~$cluster}[$__interval]) - rate(nginx_connections_handled{kube_cluster_name=~$cluster}[$__interval]) or vector(0) / rate(nginx_connections_accepted{kube_cluster_name=~$cluster}[$__interval]) * 100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwvc1hk6gtw34nkreg6k8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwvc1hk6gtw34nkreg6k8.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Hopefully this number will be near zero!&lt;/p&gt;

&lt;h3&gt;
  
  
  Waiting connections
&lt;/h3&gt;

&lt;p&gt;Fortunately, this is also an easy query. Just type &lt;code&gt;nginx_connections_waiting,&lt;/code&gt; which is the metric that nginx exporter uses to expose this information.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgqdw3jyy6bjkyikmt699.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgqdw3jyy6bjkyikmt699.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Need more metrics? Take them from the logs!
&lt;/h2&gt;

&lt;p&gt;In case you need more information to monitor nginx in Kubernetes with Prometheus, you can use the &lt;code&gt;access.log&lt;/code&gt; from nginx to take a little more information. Let's see how.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fluentd, the open source data collector
&lt;/h3&gt;

&lt;p&gt;You can configure Fluentd to pick up information from the nginx access.log and convert it into a Prometheus metric. This can be really handy for situations where the instrumented application doesn't expose much information.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to install and configure Fluentd
&lt;/h3&gt;

&lt;p&gt;We already talked about Fluentd and its Prometheus plugin here, so &lt;a href="https://sysdig.com/blog/fluentd-monitoring/" rel="noopener noreferrer"&gt;just follow the instructions in that article&lt;/a&gt;, and you'll be ready to rock.&lt;/p&gt;

&lt;h3&gt;
  
  
  Let's configure Fluentd to export a few more metrics
&lt;/h3&gt;

&lt;p&gt;To do this, you need to tweak the &lt;code&gt;access.log&lt;/code&gt; format a little: you can pick the default logging format, and add the &lt;code&gt;&lt;a href="https://nginx.org/en/docs/http/ngx_http_upstream_module.html#var_upstream_response_time" rel="noopener noreferrer"&gt;$upstream_response_time&lt;/a&gt;&lt;/code&gt; at the end. This way, Fluentd will have this variable and use it to create some useful metrics.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;name: nginx-config
data:
  nginx.conf: |
    log_format custom_format '$remote_addr - $remote_user [$time_local] '
      '"$request" $status $body_bytes_sent '
      '"$http_referer" "$http_user_agent" '
      '$upstream_response_time';
    server {
      access_log /var/log/nginx/access.log custom_format;
      ...
    }

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This config goes in the &lt;code&gt;nginx.conf&lt;/code&gt;, usually in a &lt;code&gt;ConfigMap&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Next, you need to configure Fluentd to read the new log's format. You can do this by creating a new config for nginx in the Fluentd's &lt;code&gt;fileConfig&lt;/code&gt; section.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;source&amp;gt;
    @type prometheus_tail_monitor
&amp;lt;/source&amp;gt;
&amp;lt;source&amp;gt;
    @type tail
    &amp;lt;parse&amp;gt;
    @type regexp
    expression /^(?&amp;lt;timestamp&amp;gt;.+) (?&amp;lt;stream&amp;gt;stdout|stderr)( (.))? (?&amp;lt;remote&amp;gt;[^ ]*) (?&amp;lt;host&amp;gt;[^ ]*) (?&amp;lt;user&amp;gt;[^ ]*) \[(?&amp;lt;time&amp;gt;[^\]]*)\] \"(?&amp;lt;method&amp;gt;\w+)(?:\s+(?&amp;lt;path&amp;gt;[^\"]*?)(?:\s+\S*)?)?\" (?&amp;lt;status_code&amp;gt;[^ ]*) (?&amp;lt;size&amp;gt;[^ ]*)(?:\s"(?&amp;lt;referer&amp;gt;[^\"]*)") "(?&amp;lt;agent&amp;gt;[^\"]*)" (?&amp;lt;urt&amp;gt;[^ ]*)$/
        time_format %d/%b/%Y:%H:%M:%S %z
        keep_time_key true
        types size:integer,reqtime:float,uct:float,uht:float,urt:float
    &amp;lt;/parse&amp;gt;
    tag nginx
    path /var/log/containers/nginx*.log
    pos_file /tmp/fluent_nginx.pos
&amp;lt;/source&amp;gt;

&amp;lt;filter nginx&amp;gt;
     @type prometheus
&amp;lt;/filter&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With that config, you basically created a regex parser for the nginx access.log. This is the &lt;code&gt;expression&lt;/code&gt; config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;expression /^(?&amp;lt;timestamp&amp;gt;.+) (?&amp;lt;stream&amp;gt;stdout|stderr)( (.))? (?&amp;lt;remote&amp;gt;[^ ]*) (?&amp;lt;host&amp;gt;[^ ]*) (?&amp;lt;user&amp;gt;[^ ]*) \[(?&amp;lt;time&amp;gt;[^\]]*)\] \"(?&amp;lt;method&amp;gt;\w+)(?:\s+(?&amp;lt;path&amp;gt;[^\"]*?)(?:\s+\S*)?)?\" (?&amp;lt;status_code&amp;gt;[^ ]*) (?&amp;lt;size&amp;gt;[^ ]*)(?:\s"(?&amp;lt;referer&amp;gt;[^\"]*)") "(?&amp;lt;agent&amp;gt;[^\"]*)" (?&amp;lt;urt&amp;gt;[^ ]*)$/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Take this log line for example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2022-06-07T14:16:57.754883042Z stdout F 100.96.2.5 - - [07/Jun/2022:14:16:57 +0000] "GET /ok/500/5000000 HTTP/1.1" 200 5005436 "-" "python-requests/2.22.0" 0.091 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the parser, you broke that log line in the following parts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;timestamp: 2022-06-07T14:16:57.754883042Z&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;stream: stdout&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;remote: 100.96.2.5&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;host: -&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;user: -&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;time: 07/Jun/2022:14:16:57 +0000&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;method: GET&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;path: /ok/500/5000000&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;status_code: 200&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;size: 5005436&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;referer: -&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;agent: python-requests/2.22.0&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;urt: 0.091&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now that you configured Fluentd to read the access.log, you can create some metrics by using those variables from the parser.&lt;/p&gt;

&lt;h3&gt;
  
  
  nginx bytes sent
&lt;/h3&gt;

&lt;p&gt;You can use the &lt;code&gt;size&lt;/code&gt; variable to create the &lt;code&gt;nginx_size_bytes_total&lt;/code&gt; metric: a counter with the total nginx bytes sent.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;      &amp;lt;metric&amp;gt;
        name nginx_size_bytes_total
        type counter
        desc nginx bytes sent
        key size
      &amp;lt;/metric&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Error rates
&lt;/h3&gt;

&lt;p&gt;Let's create this simple metric:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;metric&amp;gt;
        name nginx_request_status_code_total
        type counter
        desc nginx request status code
        &amp;lt;labels&amp;gt;
          method ${method}
          path ${path}
          status_code ${status_code}
        &amp;lt;/labels&amp;gt;
&amp;lt;/metric&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This metric is just a counter with all the log lines. So, why is it useful? Well, you can use other variables as labels, which can be handy to break down all the information. Let's use this metric to get the total error rate percentage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum(rate(nginx_request_status_code_total{status_code=~"[4|5].."}[1h])) / sum(rate(nginx_request_status_code_total[1h])) * 100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You could also get this information aggregated by &lt;code&gt;method&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum by (method) (rate(nginx_request_status_code_total{status_code=~"[4|5].."}[1h])) / sum by (method) (rate(nginx_request_status_code_total[1h]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or even by path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum by (path) (rate(nginx_request_status_code_total{status_code=~"[4|5].."}[1h])) / sum by (path) (rate(nginx_request_status_code_total[1h]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Latency
&lt;/h3&gt;

&lt;p&gt;Wouldn't it be great if you could monitor the latency of the successful requests? Well, it might as well be your birthday because you can! Remember when we told you to add the &lt;code&gt;$upstream_response_time&lt;/code&gt; variable? &lt;/p&gt;

&lt;p&gt;This variable stores the time spent on receiving the response from the upstream server in just seconds. You can create a histogram metric with Fluentd, like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;metric&amp;gt;
        name nginx_upstream_time_seconds_hist
        type histogram
        desc Histogram of the total time spent on receiving the response from the upstream server.
        key urt
        &amp;lt;labels&amp;gt;
          method ${method}
          path ${path}
          status_code ${status_code}
        &amp;lt;/labels&amp;gt;
&amp;lt;/metric&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So now, magically, you can try this PromQL query to get the latency in p95 of all the successful requests, aggregated by the path of the request.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;histogram_quantile(0.95, sum(rate(nginx_upstream_time_seconds_hist_bucket{status_code !~ "[4|5].."}[1h])) by (le, path))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  To sum up
&lt;/h2&gt;

&lt;p&gt;In this article, you learned how to monitor nginx in Kubernetes with Prometheus, and how to create more metrics using Fluentd to read the nginx access.log. You also learned some interesting metrics to monitor and troubleshoot nginx with Prometheus. &lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>nginx</category>
      <category>prometheus</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>Monitor and troubleshoot Consul with Prometheus</title>
      <dc:creator>JA Samitier</dc:creator>
      <pubDate>Fri, 29 Apr 2022 09:21:31 +0000</pubDate>
      <link>https://dev.to/eckelon/monitor-and-troubleshoot-consul-with-prometheus-2pkf</link>
      <guid>https://dev.to/eckelon/monitor-and-troubleshoot-consul-with-prometheus-2pkf</guid>
      <description>&lt;p&gt;In this article, you’ll learn how to Monitor Consul with Prometheus. Also, troubleshoot Consul control plane with Prometheus from scratch, &lt;a href="https://www.consul.io/docs/agent/telemetry" rel="noopener noreferrer"&gt;following Consul’s docs monitoring recommendations&lt;/a&gt;. Also, you’ll find out how to troubleshoot the most common Consul issues.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to install Consul in Kubernetes
&lt;/h2&gt;

&lt;p&gt;Installing Consul in Kubernetes is straightforward: just take a look at the &lt;a href="https://www.consul.io/docs" rel="noopener noreferrer"&gt;Consul documentation page&lt;/a&gt; and follow the instructions. We &lt;a href="https://www.consul.io/docs/k8s/installation/install#helm-chart-installation" rel="noopener noreferrer"&gt;recommend using the Helm chart&lt;/a&gt;, since it’s the easier way of deploying applications in Kubernetes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-Monitor-and-troubleshooting-consul-with-Prometheus-image-01.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-Monitor-and-troubleshooting-consul-with-Prometheus-image-01.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How to configure Consul to expose Prometheus metrics
&lt;/h2&gt;

&lt;p&gt;Consul &lt;a href="https://www.consul.io/docs/k8s/connect/observability/metrics" rel="noopener noreferrer"&gt;automatically exports metrics in the Prometheus format&lt;/a&gt;. You just need to activate these options in the &lt;code&gt;global.metrics&lt;/code&gt; configurations. If you’re using Helm, you can do it with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;--set 'global.metrics.enabled=true'
--set 'global.metrics.enableAgentMetrics=true'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Also, you’ll need to enable the &lt;code&gt;telemetry.disable_hostname&lt;/code&gt; for both the Consul Server and Client so the metrics don’t contain the name of the instances.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;--set 'server.extraConfig="{"telemetry": {"disable_hostname": true}}"'
--set 'client.extraConfig="{"telemetry": {"disable_hostname": true}}"'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Monitor Consul with Prometheus: Overall status
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Autopilot
&lt;/h3&gt;

&lt;p&gt;First, you can check the overall health of the Consul server using the Autopilot metric (&lt;code&gt;consul_autopilot_healthy&lt;/code&gt;). If all servers are healthy, this will return 1 – and 0 otherwise. All non-leader servers will report &lt;code&gt;NaN&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;You could add this PromQL query to your dashboard to check the overall status of the Consul server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;min(consul_autopilot_healthy)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Adding these thresholds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;code&gt;1&lt;/code&gt;: “Healthy”&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;0&lt;/code&gt;: “Unhealthy”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To trigger an alert when one or many Consul servers in the cluster are unhealthy, you can simply use this PromQL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;consul_autopilot_healthy == 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-Monitor-and-troubleshooting-consul-with-Prometheus-image-02.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-Monitor-and-troubleshooting-consul-with-Prometheus-image-02.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Want to dig deeper into PromQL? Read our &lt;a href="https://sysdig.com/blog/getting-started-with-promql-cheatsheet/" rel="noopener noreferrer"&gt;getting started with PromQL&lt;/a&gt; guide to learn how Prometheus stores data, and how to use PromQL functions and operators.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Leadership changes
&lt;/h3&gt;

&lt;p&gt;Consul deploys several instances of the control-plane controllers to ensure high availability. However, only one of them is the leader and the rest are for contingency. A Consul cluster should always have a stable leader. If it’s not stable, due to frequent elections or leadership changes, you could be facing network issues between the Consul servers.&lt;/p&gt;

&lt;p&gt;For checking the leadership stability, you can use the following metrics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;code&gt;consul_raft_leader_lastContact&lt;/code&gt;: Indicates how much time has passed since the leader contacted the follower nodes when checking its leader lease.&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;consul_raft_state_leader&lt;/code&gt;: Number of leaders.&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;consul_raft_state_candidate&lt;/code&gt;: Number of candidates to promote to leader. If this metric returns a number higher than 0, it means that a leadership change is in progress.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a healthy cluster, you’re looking for a &lt;code&gt;consul_raft_leader_lastContact&lt;/code&gt; lower than 200ms, a &lt;code&gt;consul_raft_state_leader&lt;/code&gt; greater than 0, and a &lt;code&gt;consul_raft_state_candidate&lt;/code&gt; equal to 0.&lt;/p&gt;

&lt;p&gt;Let’s create some alerts to trigger if there is flapping leadership.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  There are too many elections for leadership: &lt;code&gt;sum(rate(consul_raft_state_candidate[1m]))&amp;gt;0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  There are too many leadership changes: &lt;code&gt;sum(rate(consul_raft_state_leader[1m]))&amp;gt;0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  Leader time to contact followers is too high: &lt;code&gt;consul_raft_leader_lastContact{quantile="0.9"}&amp;gt;200&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;The last query contains the label &lt;code&gt;quantile="0.9"&lt;/code&gt;, for using the &lt;a href="https://en.wikipedia.org/wiki/Percentile" rel="noopener noreferrer"&gt;percentile&lt;/a&gt; 90. By using the percentile p90, you’re getting the 10% of the samples that are taking more than 200ms to contact the leader.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-Monitor-and-troubleshooting-consul-with-Prometheus-image-03.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-Monitor-and-troubleshooting-consul-with-Prometheus-image-03.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Top troubleshooting situations to monitor Consul
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Long latency in Consul transactions
&lt;/h3&gt;

&lt;p&gt;Long latency in Consul transactional operations could be due to an unexpected load on the Consul servers, or the issues on the servers.&lt;/p&gt;

&lt;p&gt;Anomalies need to be detected in a time context because the network is dynamic by nature, and you can’t just compare your samples with a fixed value. You need to compare your values with other values in the last hour (or the last day, last five minutes…) to determine if it’s a desirable value or needs some attention.&lt;/p&gt;

&lt;p&gt;To detect anomalies, you can dust off your old statistics book and find the chapter explaining the normal distribution. The 95% of the samples in a normal distribution are between the average plus or minus two times the standard deviation.&lt;/p&gt;

&lt;p&gt;To calculate this in PromQL, you can use the &lt;code&gt;avg_over_time&lt;/code&gt; and &lt;code&gt;stddev_over_time&lt;/code&gt; functions, like in this example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;avg(rate(consul_kvs_apply_sum[1m]) &amp;gt; 0)&amp;gt;(avg_over_time(rate(consul_kvs_apply_sum[1m]) [1h:1m]) + 2* stddev_over_time(rate(consul_kvs_apply_sum[1m]) [1h:1m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-Monitor-and-troubleshooting-consul-with-Prometheus-image-04.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-Monitor-and-troubleshooting-consul-with-Prometheus-image-04.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let’s see a few alerts that are triggered if the transaction latency isn’t normal.&lt;/p&gt;

&lt;h4&gt;
  
  
  Key-Value Store update time anomaly
&lt;/h4&gt;

&lt;p&gt;Consul KV Store update time had noticeable deviations from baseline over the previous hour.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;avg(rate(consul_kvs_apply_sum[1m]) &amp;gt; 0)&amp;gt;(avg_over_time(rate(consul_kvs_apply_sum[1m]) [1h:1m]) + 2* stddev_over_time(rate(consul_kvs_apply_sum[1m]) [1h:1m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-Monitor-and-troubleshooting-consul-with-Prometheus-image-05.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-Monitor-and-troubleshooting-consul-with-Prometheus-image-05.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Please note that these examples contain &lt;a href="https://prometheus.io/docs/prometheus/latest/querying/examples/#subquery" rel="noopener noreferrer"&gt;PromQL subqueries&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Transaction time anomalies
&lt;/h4&gt;

&lt;p&gt;Consul Transaction time had noticeable deviations from baseline over the previous hour.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;avg(rate(consul_txn_apply_sum[1m]) &amp;gt;0)&amp;gt;(avg_over_time(rate(consul_txn_apply_sum[1m])[1h:1m]+2*stddev_over_time(rate(consul_txn_apply_sum[1m]) [1h:1m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Consul has a &lt;a href="https://www.consul.io/docs/architecture/consensus" rel="noopener noreferrer"&gt;Consensus protocol that uses the Raft algorithm&lt;/a&gt;. Raft is a “consensus” algorithm, a method to achieve value convergence over a distributed and fault-tolerant set of cluster nodes.&lt;/em&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Transactions count anomaly
&lt;/h4&gt;

&lt;p&gt;Consul transactions count rate had noticeable deviations from baseline over the previous hour.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;avg(rate(consul_raft_apply[1m]) &amp;gt; 0)&amp;gt;(avg_over_time(rate(consul_raft_apply[1m])[1h:1m])+2*stddev_over_time(rate(consul_raft_apply[1m])[1h:1m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Commit time anomalies
&lt;/h4&gt;

&lt;p&gt;Consul commit time had noticeable deviations from baseline over the previous hour.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;avg(rate(consul_raft_commitTime_sum[1m]) &amp;gt; 0)&amp;gt;(avg_over_time(rate(consul_raft_commitTime_sum[1m])[1h:1m])+2*stddev_over_time(rate(consul_raft_commitTime_sum[1m]) [1h:1m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  High memory consumption
&lt;/h3&gt;

&lt;p&gt;Keeping the memory usage under control is key to keeping the Consul server healthy. Let’s create some alerts to be sure that your Consul server doesn’t use more memory than available.&lt;/p&gt;

&lt;h4&gt;
  
  
  Consul is using more than 90% of available memory.
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;100 * sum by(namespace,pod,container)(container_memory_usage_bytes{container!="POD",container!="", namespace="consul"}) / sum by(namespace,pod,container)(kube_pod_container_resource_limits{job!="",resource="memory", namespace="consul"}) &amp;gt; 90

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-Monitor-and-troubleshooting-consul-with-Prometheus-image-06.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-Monitor-and-troubleshooting-consul-with-Prometheus-image-06.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  The garbage collection pause is high
&lt;/h4&gt;

&lt;p&gt;Consul’s garbage collector has the &lt;em&gt;pause&lt;/em&gt; event that blocks all runtime threads until the garbage collection completes. This process takes just a few nanoseconds, but if Consul’s memory usage is high, that could trigger more and more GC events that could potentially slow down Consul.&lt;/p&gt;

&lt;p&gt;Let’s create two alerts: a warning alert if the GC takes more than two seconds per minute, and a critical alert if the GC takes more than five seconds per minute.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Please note that one second is 1000000000 nanoseconds&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Garbage Collection stop-the-world pauses were greater than two seconds per minute.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(rate(consul_runtime_gc_pause_ns_sum[1m]) / (1000000000) &amp;gt; 2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Garbage Collection stop-the-world pauses were greater than five seconds per minute.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(min(consul_runtime_gc_pause_ns_sum)) / (1000000000) &amp;gt; 5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Network load is high
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-Monitor-and-troubleshooting-consul-with-Prometheus-image-07.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-Monitor-and-troubleshooting-consul-with-Prometheus-image-07.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A high RPC count, meaning that the requests are being rate-limited, could imply a misconfigured Consul agent.&lt;/p&gt;

&lt;p&gt;Now it’s time to assure that your Consul clients aren’t being rate-limited with sending requests to the Consul server. These are the recommended alerts for the RPC connections.&lt;/p&gt;

&lt;h4&gt;
  
  
  Client RPC requests anomaly
&lt;/h4&gt;

&lt;p&gt;Consul Client RPC requests had noticeable deviations from baseline over the previous hour.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;avg(rate(consul_client_rpc[1m]) &amp;gt; 0) &amp;gt; (avg_over_time(rate(consul_client_rpc[1m]) [1h:1m])+ 2* stddev_over_time(rate(consul_client_rpc[1m]) [1h:1m]) )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Client RPC requests rate limit exceeded
&lt;/h4&gt;

&lt;p&gt;Over 10% of Consul Client RPC requests have exceeded the rate limit.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rate(consul_client_rpc_exceeded[1m]) / rate(consul_client_rpc[1m]) &amp;gt; 0.1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Client RPC requests failed
&lt;/h4&gt;

&lt;p&gt;Over 10% of Consul Client RPC requests are failing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rate(consul_client_rpc_failed[1m]) / rate(consul_client_rpc[1m]) &amp;gt; 0.1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Replica issues
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Restoration time is too high
&lt;/h4&gt;

&lt;p&gt;In this situation, restoring from disk or the leader is slower than the leader writing a new snapshot and truncating its logs. After a restart, followers might never rejoin the cluster until write rates reduce.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;consul_raft_leader_oldestLogAge &amp;lt; 2* max(consul_raft_fsm_lastRestoreDuration)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Using Consul Enterprise? Check that your license is up-to-date!
&lt;/h2&gt;

&lt;p&gt;You can use this simple PromQL query to check if your Consul Enterprise license will expire in less than 30 days.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;consul_system_licenseExpiration / 24 &amp;lt; 30
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Monitor Consul with Prometheus, with these dashboards
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-Monitor-and-troubleshooting-consul-with-Prometheus-image-08.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-Monitor-and-troubleshooting-consul-with-Prometheus-image-08.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Don’t miss these &lt;a href="https://promcat.io/apps/consul#Dashboard" rel="noopener noreferrer"&gt;open source dashboards already setup&lt;/a&gt; to monitor your Consul cluster overview, but also:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Health&lt;/li&gt;
&lt;li&gt;  Transaction&lt;/li&gt;
&lt;li&gt;  Leadership&lt;/li&gt;
&lt;li&gt;  Network&lt;/li&gt;
&lt;li&gt;  Cache&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this article, you’ve learned how to monitor the Consul control plane with Prometheus, and some alert recommendations, useful for troubleshooting the most common Consul issues.&lt;/p&gt;




</description>
      <category>consul</category>
      <category>prometheus</category>
      <category>kubernetes</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>How to monitor Starlink with Prometheus</title>
      <dc:creator>JA Samitier</dc:creator>
      <pubDate>Tue, 01 Mar 2022 09:04:31 +0000</pubDate>
      <link>https://dev.to/eckelon/how-to-monitor-starlink-with-prometheus-4bb7</link>
      <guid>https://dev.to/eckelon/how-to-monitor-starlink-with-prometheus-4bb7</guid>
      <description>&lt;p&gt;SpaceX's Starlink uses satellites in low-earth orbit to provide high-speed Internet services to most of the planet. During the beta, Starlink expects users to see data speeds vary from 50Mb/s to 150Mb/s and latency from 20ms to 40ms. It's also expected that there will be brief periods of no connectivity at all. Currently, there are around &lt;a href="https://www.spacex.com/launches"&gt;1,800 Starlink satellites in orbit&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--vWS-syrU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://sysdig.com/wp-content/uploads/Blog-How-to-monitor-Starlink-image-01.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--vWS-syrU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://sysdig.com/wp-content/uploads/Blog-How-to-monitor-Starlink-image-01.png" alt="" width="697" height="970"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How to monitor starlink connection
&lt;/h2&gt;

&lt;p&gt;There are several great projects available from the open source community, but the one we settled on using for the basis of our project was the [Starlink Prometheus Exporter (&lt;a href="https://github.com/danopstech/starlink_exporter"&gt;https://github.com/danopstech/starlink_exporter&lt;/a&gt;) from Daniel Willcocks. We encourage you to look at his other project Starlink Monitoring System if you are interested in a pre-packaged solution.&lt;/p&gt;

&lt;p&gt;To monitor Starlink connections, we decided to fork the Starlink Prometheus Exporter project and &lt;a href="https://github.com/danopstech/starlink_exporter/pull/59"&gt;create a PR that updates the Starlink gRPC bindings using the latest Starlink&lt;br&gt;
firmware&lt;/a&gt; to&lt;br&gt;
provide some additional metrics from Starlink Dishy.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--BmzVaB3H--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://sysdig.com/wp-content/uploads/Blog-How-to-monitor-Starlink-Featured-image.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--BmzVaB3H--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://sysdig.com/wp-content/uploads/Blog-How-to-monitor-Starlink-Featured-image.png" alt="" width="880" height="484"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  How does it work
&lt;/h3&gt;

&lt;p&gt;The Starlink Dishy is contactable at &lt;code&gt;192.168.100.1&lt;/code&gt; on port &lt;code&gt;9200&lt;/code&gt; for gRPC. If you are using the Starlink Wi-Fi router this should be reachable by default. In this example, you'll monitor Starlink connection using the Starlink Exporter to talk to Starlink Dishy via gRPC, and expose metrics in a format Prometheus will understand.&lt;/p&gt;
&lt;h3&gt;
  
  
  Requirements and what you will use
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  Access to a Starlink Internet Service.&lt;/li&gt;
&lt;li&gt;  Linux Node running Ubuntu 20.04 LTS.&lt;/li&gt;
&lt;li&gt;  Docker and Docker Compose.&lt;/li&gt;
&lt;li&gt;  [Starlink Prometheus
Exporter (&lt;a href="https://github.com/sysdigdan/starlink_exporter"&gt;https://github.com/sysdigdan/starlink_exporter&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;  Prometheus.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Configure Prometheus and Launching Containers
&lt;/h2&gt;

&lt;p&gt;First, you need to configure Prometheus to scrape the Starlink Exporter. Create a &lt;code&gt;prometheus&lt;/code&gt; folder and add the configuration file &lt;code&gt;prometheus.yml&lt;/code&gt; as seen below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;global:
  scrape_interval:     10s # By default, scrape targets every 15 seconds.
  evaluation_interval: 10s # By default, scrape targets every 15 seconds.
  scrape_timeout:      10s # By default, it is set to the global default (10s).


external_labels:
    monitor: 'starlink-exporter'
    origin_prometheus: 'starlink'

scrape_configs:
  - job_name: 'starlink'
    static_configs:
      - targets: ['127.0.0.1:9817']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, launch the Prometheus and Starlink Exporter containers using Docker Compose and the following YAML (save this as &lt;code&gt;docker-compose.yml&lt;/code&gt; in the same location as your prometheus.yml above):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;version: '3.8'

volumes:
  prometheus_data: {}

services:
  starlink-exporter:
    image: sysdigdan/starlink_exporter:v0.1.3
    container_name: starlink_exporter
    restart: unless-stopped
    network_mode: host

  prometheus:
    image: prom/prometheus:v2.32.1
    container_name: prometheus
    restart: unless-stopped
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
    network_mode: host
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, from the same directory as your &lt;code&gt;docker-compose.yml&lt;/code&gt; and&lt;br&gt;
&lt;code&gt;prometheus.yml&lt;/code&gt;, you can launch the containers with the following command:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;docker-compose up -d&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Let's make sure everything is running:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;docker ps&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--9mrdg8Lk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://sysdig.com/wp-content/uploads/Blog-How-to-monitor-Starlink-image-02.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--9mrdg8Lk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://sysdig.com/wp-content/uploads/Blog-How-to-monitor-Starlink-image-02.png" alt="" width="880" height="42"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Monitor starlink connection with Prometheus dashboards
&lt;/h2&gt;

&lt;p&gt;Now that both containers are running, you can access Prometheus&lt;br&gt;
(http://&amp;lt;NODE IP&amp;gt;:9090/) and look at the available metrics coming from Starlink Dishy (http://&amp;lt;NODE IP&amp;gt;:9817/metrics).&lt;/p&gt;
&lt;h3&gt;
  
  
  Monitor starlink connection: Performance Metrics
&lt;/h3&gt;

&lt;p&gt;You can review metrics for throughput utilization by using the following:&lt;br&gt;
&lt;code&gt;starlink_dish_downlink_throughput_bytes and starlink_dish_uplink_throughput_bytes&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--74C6h2p---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://sysdig.com/wp-content/uploads/Blog-How-to-monitor-Starlink-image-03.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--74C6h2p---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://sysdig.com/wp-content/uploads/Blog-How-to-monitor-Starlink-image-03.png" alt="" width="880" height="477"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--hZVRbedx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://sysdig.com/wp-content/uploads/Blog-How-to-monitor-Starlink-image-04.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--hZVRbedx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://sysdig.com/wp-content/uploads/Blog-How-to-monitor-Starlink-image-04.png" alt="" width="880" height="479"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can also quickly see the latency between Starlink Dishy, Satellite, and Ground Station by using &lt;code&gt;starlink_dish_pop_ping_latency_seconds&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--LHqd-kJm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://sysdig.com/wp-content/uploads/Blog-How-to-monitor-Starlink-image-05.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--LHqd-kJm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://sysdig.com/wp-content/uploads/Blog-How-to-monitor-Starlink-image-05.png" alt="" width="880" height="478"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Monitor starlink connection: Stability Metrics
&lt;/h3&gt;

&lt;p&gt;Use this PromQL query if you are interested in understanding the cause of outages; you can use the following PromQL to review all outages over the past 24 hours:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum by (cause) (sum_over_time(starlink_dish_outage_duration{cause!='UNKNOWN'}[24h])) / 10^9
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--OCmWzkqx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://sysdig.com/wp-content/uploads/Blog-How-to-monitor-Starlink-image-06.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--OCmWzkqx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://sysdig.com/wp-content/uploads/Blog-How-to-monitor-Starlink-image-06.png" alt="" width="880" height="250"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can also count the occurrences.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;count by (cause) (count_over_time(starlink_dish_outage_duration{cause!='UNKNOWN'}[24h]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--EfoBxjJj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://sysdig.com/wp-content/uploads/Blog-How-to-monitor-Starlink-image-07.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--EfoBxjJj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://sysdig.com/wp-content/uploads/Blog-How-to-monitor-Starlink-image-07.png" alt="" width="880" height="204"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[ &lt;em&gt;Want to dig deeper into PromQL? Read our &lt;a href="https://sysdig.com/blog/getting-started-with-promql-cheatsheet/"&gt;getting started with PromQL&lt;/a&gt; guide to learn how Prometheus stores data, and how to use PromQL functions and operators.&lt;/em&gt;&lt;br&gt;
]&lt;/p&gt;
&lt;h3&gt;
  
  
  Monitor starlink connection: Troubleshooting Metrics
&lt;/h3&gt;

&lt;p&gt;You want to understand satellite obstruction with the following PromQL that shows a measure of obstruction in 12 30-degree wedges around Dishy.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;starlink_dish_wedge_abs_fraction_obstruction_ratio &amp;gt; 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--BV4Vp6S---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://sysdig.com/wp-content/uploads/Blog-How-to-monitor-Starlink-image-08.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--BV4Vp6S---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://sysdig.com/wp-content/uploads/Blog-How-to-monitor-Starlink-image-08.png" alt="" width="880" height="595"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Monitor starlink connection with Sysdig Monitor LTS
&lt;/h2&gt;

&lt;p&gt;With Prometheus and the Starlink Exporter all set up, we need to think of how best to provide longer retention for comparison over time. By default, Prometheus provides 15 days retention. This can be adjusted but the downside is we would then need to manage storage and backups.&lt;/p&gt;

&lt;p&gt;One of the features that customers of Sysdig Monitor are taking full advantage of is Prometheus Remote Write, which allows us to natively ingest metrics from many Prometheus servers. There's also [no need to manage storage, and with long retention and always on metrics (&lt;a href="https://sysdig.com/blog/challenges-prometheus-lts/"&gt;https://sysdig.com/blog/challenges-prometheus-lts/&lt;/a&gt;),&lt;br&gt;
it's a simple choice!&lt;/p&gt;

&lt;p&gt;The configuration for Prometheus Remote Write is simple. You just need to append the prometheus.yml file we created earlier with a new remote_write section, similar to the following.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;remote_write:
    - url: "https:///prometheus/remote/write"
      bearer_token: ""
      tls_config:
        insecure_skip_verify: true
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Restart the Prometheus container and you're done!&lt;/p&gt;

&lt;p&gt;docker restart prometheus&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--XdMmqjpi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://sysdig.com/wp-content/uploads/Blog-How-to-monitor-Starlink-image-09.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--XdMmqjpi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://sysdig.com/wp-content/uploads/Blog-How-to-monitor-Starlink-image-09.png" alt="" width="880" height="645"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--GdNuJtzU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://sysdig.com/wp-content/uploads/Blog-How-to-monitor-Starlink-image-010.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--GdNuJtzU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://sysdig.com/wp-content/uploads/Blog-How-to-monitor-Starlink-image-010.png" alt="" width="880" height="647"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--lIGR-aUy--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://sysdig.com/wp-content/uploads/Blog-How-to-monitor-Starlink-image-011.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--lIGR-aUy--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://sysdig.com/wp-content/uploads/Blog-How-to-monitor-Starlink-image-011.png" alt="" width="880" height="646"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This article was posted originally &lt;a href="https://sysdig.com/blog/monitor-starlink/"&gt;by Dan Moloney in Sysdig&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>monitoring</category>
      <category>starlink</category>
      <category>prometheus</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Top PostgreSQL monitoring metrics for Prometheus – Includes cheat sheet</title>
      <dc:creator>JA Samitier</dc:creator>
      <pubDate>Mon, 15 Nov 2021 15:49:43 +0000</pubDate>
      <link>https://dev.to/eckelon/top-postgresql-monitoring-metrics-for-prometheus-includes-cheat-sheet-47ch</link>
      <guid>https://dev.to/eckelon/top-postgresql-monitoring-metrics-for-prometheus-includes-cheat-sheet-47ch</guid>
      <description>&lt;p&gt;PostgreSQL monitoring with Prometheus is an &lt;a href="https://promcat.io/apps/postgresql#SetupGuide" rel="noopener noreferrer"&gt;easy thing to do&lt;/a&gt; thanks to the &lt;a href="https://github.com/prometheus-community/postgres_exporter" rel="noopener noreferrer"&gt;PostgreSQL Exporter&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;PostgreSQL is an open-source relational database with a powerful community behind it. It’s very popular due to its &lt;strong&gt;strong stability and powerful data types&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In this article, you’ll learn the &lt;strong&gt;top 10 metrics in PostgreSQL monitoring&lt;/strong&gt;, with alert examples, both for PostgreSQL instances in Kubernetes and AWS RDS PostgreSQL instances.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://sysdig.com/wp-content/uploads/Blog-PostgreSQL-Monitoring-Featured-Image-v2.png" rel="noopener noreferrer"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-PostgreSQL-Monitoring-Featured-Image-v2.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Also, we encourage you to &lt;a href="https://dig.sysdig.com/c/pf-top-10-metrics-in-postgresql?x=u_WFRi" rel="noopener noreferrer"&gt;download our Top 10 PostgreSQL monitoring metrics cheat sheet&lt;/a&gt; to dig deeper on how to monitor PostgreSQL with Prometheus.&lt;/p&gt;

&lt;h2&gt;
  
  
  Top 10 metrics in PostgreSQL monitoring
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Availability
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-PostgreSQL-Monitoring-Image-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-PostgreSQL-Monitoring-Image-1.png" alt="PostgreSQL dashboard showing the availability metric to 1, in a green background"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  #1 Check if PostgreSQL is running
&lt;/h4&gt;

&lt;p&gt;Checking that &lt;strong&gt;your PostgreSQL instance is up and running&lt;/strong&gt; should be the first step in PostgreSQL monitoring. The exporter will monitor the connection and availability of the PostgreSQL instance. The metric of monitoring PostgreSQL availability is &lt;code&gt;pg_up&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Let’s create an alert that triggers if the PostgreSQL server goes down.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pg_up == 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  #2 Postmaster Service Uptime
&lt;/h4&gt;

&lt;p&gt;Also, it’s important to assure that the minimum &lt;strong&gt;postmaster service uptime reflects the last known controlled server restart&lt;/strong&gt;. Otherwise, it means that a server has been restarted for unknown reasons. The metric of monitoring PostgreSQL availability is &lt;code&gt;pg_postmaster_start_time_seconds&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Let’s create an alert to notify if the PostgreSQL server was restarted without a known reason in the last hour (&lt;code&gt;3600&lt;/code&gt; seconds).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;time() - pg_postmaster_start_time_seconds &amp;lt; 3600
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Replication
&lt;/h3&gt;

&lt;h4&gt;
  
  
  #3 Replication lag
&lt;/h4&gt;

&lt;p&gt;In scenarios with replicated PostgreSQL servers, &lt;strong&gt;a high replication lag rate can lead to coherence problems&lt;/strong&gt; if the master goes down. The metric of monitoring PostgreSQL availability is &lt;code&gt;pg_replication_lag&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Let’s create an alert that triggers if the replication lag is greater than &lt;code&gt;10&lt;/code&gt; seconds.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pg_replication_lag &amp;gt; 10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Storage
&lt;/h3&gt;

&lt;p&gt;Running out of disk is a common problem in all databases. It also can cause the Write Ahead Log (WAL) to be unable to write on disk. This could end up in &lt;strong&gt;transaction issues&lt;/strong&gt; affecting persisting data.&lt;/p&gt;

&lt;p&gt;Luckily, it’s also a very easy thing to monitor. We will check the database size, and the disk available.&lt;/p&gt;

&lt;h4&gt;
  
  
  #4 Database size
&lt;/h4&gt;

&lt;p&gt;First, let’s figure out what is the &lt;strong&gt;storage usage of each of the PostgreSQL databases&lt;/strong&gt; in our instance. For this, we’ll use the &lt;code&gt;pg_database_size_bytes&lt;/code&gt; metric.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-PostgreSQL-Monitoring-Image-2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-PostgreSQL-Monitoring-Image-2.png" alt="PosgreSQL dashboard showing the sizes of the different databases. In a chart, with a different color for each db."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  #5 Available storage
&lt;/h4&gt;

&lt;p&gt;It depends on how you run your PostgreSQL instance:&lt;/p&gt;

&lt;h5&gt;
  
  
  Kubernetes
&lt;/h5&gt;

&lt;p&gt;You can use the &lt;code&gt;node_filesystem_free_bytes&lt;/code&gt; metric from the &lt;a href="https://github.com/prometheus/node_exporter" rel="noopener noreferrer"&gt;node_exporter&lt;/a&gt;. You may remember when we predicted the future in our &lt;a href="https://sysdig.com/blog/getting-started-with-promql-cheatsheet/" rel="noopener noreferrer"&gt;getting started PromQL guide&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-PostgreSQL-Monitoring-Image-3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-PostgreSQL-Monitoring-Image-3.png" alt="PosgreSQL dashboard showing the percentage of disk used per node, in a chart"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let’s create an alert to notify us when we are going to have less than &lt;code&gt;1&lt;/code&gt; Gb in the next &lt;code&gt;24&lt;/code&gt; hours.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;predict_linear(node_filesystem_free_bytes\[1w\], 3600 \* 24) / (1024 \* 1024 \* 1024) &amp;lt; 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h5&gt;
  
  
  AWS RDS PostgreSQL
&lt;/h5&gt;

&lt;p&gt;Cloud-managed database solutions, like AWS RDS, are getting more and more popular. If you are running an AWS RDS PostgreSQL instance, you can monitor it &lt;a href="https://sysdig.com/blog/monitoring-amazon-rds/" rel="noopener noreferrer"&gt;through CloudWatch and the YACE exporter&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;You can use the &lt;code&gt;aws_rds_free_storage_space_average&lt;/code&gt; metric. Let’s create an alert if you’re going to run out of storage in the next &lt;code&gt;48&lt;/code&gt; hours.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;predict_linear(aws_rds_free_storage_space_average\[48h\], 48 \* 3600) &amp;lt; 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://dig.sysdig.com/c/pf-top-10-metrics-in-postgresql?x=u_WFRi" rel="noopener noreferrer"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FTop-10-metrics-in-ProgreSQL-Post-image_top-10-metrics-blog-img-2.png" alt="Download the PromQL CheatSheet!"&gt;&lt;/a&gt; &lt;/p&gt;

&lt;h3&gt;
  
  
  Networking
&lt;/h3&gt;

&lt;p&gt;If you had to keep just one networking metric, it should be the available connections.&lt;/p&gt;

&lt;h4&gt;
  
  
  #6 Number of available connections
&lt;/h4&gt;

&lt;p&gt;We are going to &lt;strong&gt;calculate the available connections&lt;/strong&gt; by subtracting the superuser reserved connections (&lt;code&gt;pg_settings_superuser_reserved_connections&lt;/code&gt;) and the active connections (&lt;code&gt;pg_stat_activity_count&lt;/code&gt;) to the maximum number of connections (&lt;code&gt;pg_settings_max_connections&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-PostgreSQL-Monitoring-Image-4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-PostgreSQL-Monitoring-Image-4.png" alt="PosgreSQL dashboard showing the percentage of available connections per node, in a chart"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let’s create an alert to notify if the number of available connections is under &lt;code&gt;10&lt;/code&gt; percent of the total.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;((sum(pg_settings_max_connections) by (server) - sum(pg_settings_superuser_reserved_connections) by (server)) - sum(pg_stat_activity_count) by (server)) / sum(pg_settings_max_connections) by (server)) \* 100 &amp;lt; 10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-PostgreSQL-Monitoring-Image-5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-PostgreSQL-Monitoring-Image-5.png" alt="PosgreSQL dashboard showing the number of available connections per node, in a chart"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Performance
&lt;/h3&gt;

&lt;p&gt;Checking performance in any database means keeping an eye on &lt;strong&gt;CPU and memory&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When a server runs out of memory, it can lead to more CPU load. Fortunately, some indicators warn us if memory usage needs to be optimized.&lt;/p&gt;

&lt;h4&gt;
  
  
  #7 Latency
&lt;/h4&gt;

&lt;p&gt;First, we are going to measure performance by calculating how much time it takes to get the results from the slowest active transaction. To do that, we’ll use the &lt;code&gt;pg_stat_activity_max_tx_duration&lt;/code&gt; metric.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-PostgreSQL-Monitoring-Image-6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-PostgreSQL-Monitoring-Image-6.png" alt="PosgreSQL dashboard showing the max active transaction time, by DB, in a chart"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let’s create an alert that notifies us when the active transaction takes more than &lt;code&gt;2&lt;/code&gt; seconds to complete.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pg_stat_activity_max_tx_duration{state="active"} &amp;gt; 2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  #8 Cache hit rate
&lt;/h4&gt;

&lt;p&gt;Low latency can be a consequence of &lt;strong&gt;problems with cache in memory&lt;/strong&gt;, which increments disk usage, so everything is slower.&lt;/p&gt;

&lt;p&gt;For analyzing the cache hit rate, we’ll check the in-memory transactions (&lt;code&gt;pg_stat_database_blks_hit&lt;/code&gt;) and the transactions running in disk (&lt;code&gt;pg_stat_database_blks_read&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-PostgreSQL-Monitoring-Image-7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-PostgreSQL-Monitoring-Image-7.png" alt="PosgreSQL dashboard showing the average cache hit rate for the instance, in a chart"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let’s create an alert when the cache hit rate is lower than &lt;code&gt;80&lt;/code&gt; percent.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;100 \* (rate(pg_stat_database_blks_hit\[$__interval\]) /
((rate(pg_stat_database_blks_hit\[$__interval\]) +
rate(pg_stat_database_blks_read\[$__interval\]))&amp;gt;0)) &amp;lt; 80
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  #9 Memory available
&lt;/h4&gt;

&lt;p&gt;The solution for a low hit rate is &lt;strong&gt;increasing the memory usage&lt;/strong&gt; of your instance. But this is &lt;strong&gt;not always possible&lt;/strong&gt; due to potential memory limitations. So, first, we need to be sure that we have &lt;strong&gt;enough available memory&lt;/strong&gt;.&lt;/p&gt;

&lt;h5&gt;
  
  
  Kubernetes
&lt;/h5&gt;

&lt;p&gt;You can combine the total memory available for your instance (&lt;code&gt;kube_pod_container_resource_limits{resource="memory"}&lt;/code&gt;) with the memory being used (&lt;code&gt;container_memory_usage_bytes{container!="POD",container!=""}&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;Let’s write a PromQL to use those metrics to get the total available memory&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum by(namespace,pod,container)(kube_pod_container_resource_limits{resource="memory"}) - sum by(namespace,pod,container)(container_memory_usage_bytes{container!="POD",container!=""})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this information, now you can &lt;strong&gt;assure how much memory usage&lt;/strong&gt; you can increment for your instance.&lt;/p&gt;

&lt;h5&gt;
  
  
  AWS RDS PostgreSQL instance
&lt;/h5&gt;

&lt;p&gt;If you are using AWS RDS PostgreSQL, then it’s really easy to know the available disk space: just use the &lt;code&gt;aws_rds_freeable_memory_average&lt;/code&gt; metric!&lt;/p&gt;

&lt;h4&gt;
  
  
  #10 Requested buffer checkpoints
&lt;/h4&gt;

&lt;p&gt;PostgreSQL uses the buffer checkpoints to write the dirty buffers on disk, so it creates safe points for the Write Ahead Log (WAL). These checkpoints are scheduled periodically but also &lt;strong&gt;can be requested on-demand&lt;/strong&gt; when the buffer runs out of space.&lt;/p&gt;

&lt;p&gt;A high number of requested checkpoints compared to the number of scheduled checkpoints can impact directly the performance of your PostgreSQL instance. To avoid this situation you could &lt;strong&gt;increase the database buffer size&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Please note that increasing the buffer size &lt;strong&gt;will also increase the memory usage of your PostgreSQL instance&lt;/strong&gt;. Check your memory availability in the previous step.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Let’s create a PromQL query to visualize the percentage of the scheduled checkpoints (&lt;code&gt;pg_stat_bgwriter_checkpoints_timed&lt;/code&gt;) compared with the total of both scheduled and requested checkpoints (&lt;code&gt;pg_stat_bgwriter_checkpoints_req&lt;/code&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rate(pg_stat_bgwriter_checkpoints_req\[5m\]) /
(rate(pg_stat_bgwriter_checkpoints_req\[5m\]) + rate(pg_stat_bgwriter_checkpoints_timed\[5m\])) \* 100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-PostgreSQL-Monitoring-Image-8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-PostgreSQL-Monitoring-Image-8.png" alt="PosgreSQL dashboard showing the percentage of requested checkpoints, comparing to the scheduled ones for the instance, in a chart"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  That was nice, but where are my PostgreSQL monitoring dashboards?
&lt;/h2&gt;

&lt;p&gt;In this article, we introduced PostgreSQL monitoring with Prometheus, using &lt;code&gt;postgres_exporter&lt;/code&gt;. It doesn’t matter if you run your own &lt;strong&gt;PostgreSQL instance in Kubernetes, or in an AWS RDS PostgreSQL&lt;/strong&gt; instance. We also introduced the &lt;a href="https://dig.sysdig.com/c/pf-top-10-metrics-in-postgresql?x=u_WFRi" rel="noopener noreferrer"&gt;Top 10 metrics in PostgreSQL monitoring with Prometheus cheat sheet&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;You can now download the already configured &lt;a href="https://promcat.io/apps/postgresql#Dashboard" rel="noopener noreferrer"&gt;PostgresSQL monitoring dashboards from PromCat&lt;/a&gt; and add them to your Grafana installation (or to Sysdig Monitor!)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-PostgreSQL-Monitoring-Image-9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FBlog-PostgreSQL-Monitoring-Image-9.png" alt="Screenshot showing the available PostgreSQL monitoring dashboards to download, in PromCat.io"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>monitoring</category>
      <category>kubernetes</category>
      <category>prometheus</category>
    </item>
    <item>
      <title>Top 5 key metrics for monitoring AWS RDS</title>
      <dc:creator>JA Samitier</dc:creator>
      <pubDate>Thu, 15 Apr 2021 15:40:33 +0000</pubDate>
      <link>https://dev.to/eckelon/top-5-key-metrics-for-monitoring-aws-rds-3563</link>
      <guid>https://dev.to/eckelon/top-5-key-metrics-for-monitoring-aws-rds-3563</guid>
      <description>&lt;p&gt;Monitoring AWS RDS may require some &lt;strong&gt;observability strategy changes&lt;/strong&gt; if you switched from a classic on-prem MySQL/PostgreSQL solution.&lt;/p&gt;

&lt;p&gt;AWS RDS is a great solution that helps you &lt;strong&gt;focus on the data, and forget about bare metal&lt;/strong&gt;, patches, backups, etc. However, since you don’t have direct access to the machine, you’ll need to adapt your monitoring platform.&lt;/p&gt;

&lt;p&gt;In this article, we are going to describe the differences between an on-prem database solution and AWS RDS, as well as how you can &lt;strong&gt;start monitoring AWS RDS&lt;/strong&gt;. Also, we will identify the top five key metrics for monitoring AWS RDS. Maybe even more!&lt;/p&gt;

&lt;h2&gt;
  
  
  How AWS RDS is different from other on-prem database solutions
&lt;/h2&gt;

&lt;p&gt;Since AWS RDS is a managed cloud service, the way you configure and use it is &lt;strong&gt;through the AWS Console or AWS API&lt;/strong&gt;. You won’t have a terminal to access the machine directly, so every operation, like replication, backups, or disk management, has to be made this way.&lt;/p&gt;

&lt;p&gt;You won’t have to worry about the infrastructure matters such as replication, scaling, or backups. But you won’t have direct access to the instance either. That being so, &lt;strong&gt;you won’t be able to monitor AWS RDS using a classic node-exporter strategy&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitoring AWS RDS
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://sysdig.com/blog/improving-prometheus-cloudwatch-exporter/" rel="noopener noreferrer"&gt;Monitoring AWS is pretty straightforward&lt;/a&gt;, using &lt;a href="https://github.com/ivx/yet-another-cloudwatch-exporter" rel="noopener noreferrer"&gt;YACE&lt;/a&gt; to get data from AWS CloudWatch and store it in Prometheus.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2F01_Improving-the-Prometheus-CloudWatch-exporter.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2F01_Improving-the-Prometheus-CloudWatch-exporter.png" alt="Sysdig collaborated with the YACE exporter to make it production ready. CloudWatch gathers metrics, that YACE reads and presents in a Prometheus compatible format."&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;Using &lt;a href="https://promcat.io/" rel="noopener noreferrer"&gt;PromCat&lt;/a&gt; to include AWS RDS in this setup will take you a couple of clicks. Just configure the credentials and apply the deployment in your cluster. Every step in the &lt;strong&gt;configuration is very well explained&lt;/strong&gt; in the &lt;a href="https://promcat.io/apps/aws-rds" rel="noopener noreferrer"&gt;AWS RDS PromCat setup guide&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2Ftop-5-rds-key-metrics-promcat.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2Ftop-5-rds-key-metrics-promcat.png" alt="screenshot showing the setup guide page for the RDS configuration in PromCat.io"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Top 5 metrics you should look at
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Memory
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Why it matters
&lt;/h4&gt;

&lt;p&gt;Memory is constantly used in databases to cache the queries, tables, and results in order to minimize disk operations. This is directly related to how your database will perform. Not having enough memory will cause a low hit rate in the cache and an increase in the response time in your database. This is not good news!&lt;/p&gt;

&lt;p&gt;Also, every time a client connects to your database, it creates a new process that will use some memory. In situations with massive concurrent connections, like Black Friday, running out of memory can result in multiple rejected connections.&lt;/p&gt;

&lt;h4&gt;
  
  
  How to monitor and alert
&lt;/h4&gt;

&lt;p&gt;Using the &lt;code&gt;aws_rds_freeable_memory_average&lt;/code&gt; metric (which YACE reads from the CloudWatch &lt;code&gt;FreeableMemory&lt;/code&gt; metric). This tells you the memory available (in bytes) for your instance.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2Ftop-5-rds-key-metrics-Memory.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2Ftop-5-rds-key-metrics-Memory.png" alt="chart showing the values for the aws_rds_freeable_memory_average metric"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let’s create an alert if the available memory is under 128Mb:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aws_rds_freeable_memory_average &amp;lt; 128*1024*1024
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  DB Connections
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Why it matters
&lt;/h4&gt;

&lt;p&gt;Even if there’s enough available memory, there is a max number of DB connections in every instance. If you reach this number, the following connections will be rejected, causing database errors in your application.&lt;/p&gt;

&lt;h4&gt;
  
  
  How to monitor and alert
&lt;/h4&gt;

&lt;p&gt;Using the &lt;code&gt;aws_rds_database_connections_average&lt;/code&gt; metric (which uses the &lt;code&gt;DatabaseConnections&lt;/code&gt; CloudWatch metric).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2Ftop-5-rds-key-metrics-dbconnections.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2Ftop-5-rds-key-metrics-dbconnections.png" alt="chart showing the values for the aws_rds_database_connections_average metric"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let’s create an alert if the DB connections number is greater than 1000. Unfortunately, CloudWatch does not provide the maximum number of DB Connections, so you’ll need to specify it manually in the PromQL query.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aws_rds_database_connections_average &amp;gt; 1000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also create an alert in case the number of connections has increased significantly in the last hour. That can be used to detect attempts of brute force or DDoS attacks. In this example, you’ll be notified if the number of connections has increased 10 times in the last hour.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aws_rds_database_connections_average / aws_rds_database_connections_average offset 1h &amp;gt; 10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  CPU
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Why it matters
&lt;/h4&gt;

&lt;p&gt;Databases use CPU to run queries. If there are multiple, concurrent, complex, or not well-optimized queries, the CPU usage can reach the limit of the running instance. This will result in a very high response time and possibly some time-outs.&lt;/p&gt;

&lt;h4&gt;
  
  
  How to monitor and alert
&lt;/h4&gt;

&lt;p&gt;Using the &lt;code&gt;aws_rds_cpuutilization_average&lt;/code&gt; metric (which uses the CloudWatch &lt;code&gt;CPUUtilization&lt;/code&gt; metric).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2Ftop-5-rds-key-metrics-CPU.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2Ftop-5-rds-key-metrics-CPU.png" alt="chart showing the values for the aws_rds_cpuutilization_average metric"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let’s create an alert if the average CPU usage is higher than 95% of the instance.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aws_rds_cpuutilization_average &amp;gt; 0.95
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Storage
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Why it matters
&lt;/h4&gt;

&lt;p&gt;Storage is one of the most important parts of a database since it’s where data is held. Not having enough storage capacity will crush your database.&lt;/p&gt;

&lt;p&gt;Although setting up an auto-scaling strategy in AWS RDS is very easy, it could affect your infrastructure costs. That’s why you should be aware of the instance disk state.&lt;/p&gt;

&lt;h4&gt;
  
  
  How to monitor and alert
&lt;/h4&gt;

&lt;p&gt;Using the &lt;code&gt;aws_rds_free_storage_space_average&lt;/code&gt; metric (which uses the &lt;code&gt;FreeStorageSpace&lt;/code&gt; CloudWatch metric).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2Ftop-5-rds-key-metrics-Storage.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2Ftop-5-rds-key-metrics-Storage.png" alt="chart showing the values for the aws_rds_free_storage_space_average metric"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let’s create an alert if the available storage is lower than 512Mb.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aws_rds_free_storage_space_average &amp;lt; 512*1024*1024
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apart from this PromQL query, you can go further by traveling to the future. How? Using the predict_linear PromQL function to predict when you are going to run out of storage. You may remember this from when &lt;a href="https://sysdig.com/blog/cooking-iot-prometheus/" rel="noopener noreferrer"&gt;we used it to cook a ham&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This PromQL query will alert you if you’re going to run out of storage in the next 48 hours.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;predict_linear(aws_rds_free_storage_space_average[48h], 48 * 3600) &amp;lt; 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you want to dig deeper into PromQL functions, you can check our &lt;a href="https://sysdig.com/blog/getting-started-with-promql-cheatsheet/" rel="noopener noreferrer"&gt;getting started PromQL CheatSheet&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Read/Write Latency
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Why it matters
&lt;/h4&gt;

&lt;p&gt;In situations where there are queries returning a massive amount of data, the database will need to perform disk operations.&lt;/p&gt;

&lt;p&gt;Database disks normally have a low read/write latency, but they can have issues that can result in high latency operations. Monitoring this will assure you the disk latency is as low as expected.&lt;/p&gt;

&lt;h4&gt;
  
  
  How to monitor and alert
&lt;/h4&gt;

&lt;p&gt;Using the &lt;code&gt;aws_rds_read_latency_average&lt;/code&gt; and &lt;code&gt;aws_rds_write_latency_average&lt;/code&gt; metrics (which use the &lt;code&gt;ReadLatency&lt;/code&gt; and &lt;code&gt;WriteLatency&lt;/code&gt; CloudWatch metrics).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2Ftop-5-rds-key-metrics-read-write-latency.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2Ftop-5-rds-key-metrics-read-write-latency.png" alt="chart showing the values for the aws_rds_read_latency_average and aws_rds_write_latency_average metrics"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let’s create alerts to notify when the read or write latency is over 250ms.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aws_rds_read_latency_average &amp;gt; 0.250
aws_rds_write_latency_average &amp;gt; 0.250
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Just 5? Let’s dig deeper with some bonus metrics!
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Network I/O
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Why it matters
&lt;/h4&gt;

&lt;p&gt;It doesn’t matter if the database is working correctly if there’s no connection from the outside. A misconfiguration or a malicious act from an attacker can result in losing connection to the instance.&lt;/p&gt;

&lt;p&gt;Learn how an attacker can &lt;a href="https://sysdig.com/blog/lateral-movement-cloud-containers/" rel="noopener noreferrer"&gt;infiltrate your cloud infrastructure and perform lateral movement&lt;/a&gt;. Also, learn how to prevent and detect such attacks.&lt;/p&gt;

&lt;h4&gt;
  
  
  How to monitor and alert
&lt;/h4&gt;

&lt;p&gt;Using the &lt;code&gt;aws_rds_network_receive_throughput_average&lt;/code&gt; and &lt;code&gt;aws_rds_network_transmit_throughput_average&lt;/code&gt; metrics (which use the &lt;code&gt;NetworkReceiveThroughput&lt;/code&gt; and &lt;code&gt;NetworkTransmitThroughput&lt;/code&gt; CloudWatch metrics).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2Ftop-5-rds-key-metrics-NetworkIO.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2Ftop-5-rds-key-metrics-NetworkIO.png" alt="chart showing the values for the aws_rds_network_receive_throughput_average and aws_rds_network_transmit_throughput_average metrics"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let’s create an alert if the network traffic is down.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aws_rds_network_receive_throughput_average = 0 AND aws_rds_network_transmit_throughput_average = 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Read / Write IOPS
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Why it matters
&lt;/h4&gt;

&lt;p&gt;The number of operations per second (IOPS) available in the instance, can be configured and is billed separately.&lt;/p&gt;

&lt;p&gt;Not having enough can affect the performance of your application, and having more than needed will have a negative impact on your infrastructure costs.&lt;/p&gt;

&lt;h4&gt;
  
  
  How to monitor and alert
&lt;/h4&gt;

&lt;p&gt;Using the &lt;code&gt;aws_rds_read_iops_average&lt;/code&gt; and &lt;code&gt;aws_rds_write_iops_average&lt;/code&gt; metrics (which use the &lt;code&gt;ReadIOPS&lt;/code&gt; and &lt;code&gt;WriteIOPS&lt;/code&gt; CloudWatch metrics).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2Ftop-5-rds-key-metrics-iops.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2Ftop-5-rds-key-metrics-iops.png" alt="chart showing the values for the aws_rds_read_iops_average and aws_rds_write_iops_average metrics"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let’s create alerts if the read or write IOPS are greater than 2,500 operations per second.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aws_rds_read_iops_average &amp;gt; 2500
aws_rds_write_iops_average &amp;gt; 2500
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What’s next: Install this dashboard in a few clicks
&lt;/h2&gt;

&lt;p&gt;In this article, we’ve learned how easy it is to monitor AWS RDS and identify the top five key metrics when monitoring AWS RDS with examples.&lt;/p&gt;

&lt;p&gt;All these metrics are &lt;a href="https://promcat.io/apps/aws-rds" rel="noopener noreferrer"&gt;available in the dashboards you can download from PromCat&lt;/a&gt;. They can be used in Grafana and in Sysdig Monitor as well!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2Ftop-5-rds-key-metrics-dashboards.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2Ftop-5-rds-key-metrics-dashboards.png" alt="screenshot showing the dashboard page for the RDS configuration in PromCat.io, where you can download the dashboards for both Grafana and Sysdig Monitor!"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These top key metrics will allow you to see the full picture when troubleshooting and performing improvements in our AWS RDS instance.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If you would like to try this integration, we invite you to &lt;a href="https://sysdig.com/company/start-free/" rel="noopener noreferrer"&gt;sign up for a free trial&lt;/a&gt; of Sysdig Monitor.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>rds</category>
      <category>prometheus</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>Getting started with PromQL – Includes Cheatsheet!</title>
      <dc:creator>JA Samitier</dc:creator>
      <pubDate>Thu, 11 Mar 2021 16:48:17 +0000</pubDate>
      <link>https://dev.to/eckelon/getting-started-with-promql-includes-cheatsheet-3a1d</link>
      <guid>https://dev.to/eckelon/getting-started-with-promql-includes-cheatsheet-3a1d</guid>
      <description>&lt;p&gt;Getting started with PromQL can be challenging when you first arrive in the fascinating world of Prometheus. Since &lt;strong&gt;Prometheus stores data in a time-series data model&lt;/strong&gt;, queries in a Prometheus server are radically different from good old SQL.&lt;/p&gt;

&lt;p&gt;Understanding &lt;strong&gt;how data is managed in Prometheus&lt;/strong&gt; is key to learning how to write good, performant PromQL queries.&lt;/p&gt;

&lt;p&gt;This article will &lt;strong&gt;introduce you to the PromQL basics&lt;/strong&gt; and provide a cheat sheet you can download to dig deeper into Prometheus and PromQL.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dig.sysdig.com/c/pf-infographic-promql-cheatsheet?x=u_WFRi"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7ZwZoiRx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://478h5m1yrfsa3bbe262u7muv-wpengine.netdna-ssl.com/wp-content/uploads/Blog-Images-Getting-started-with-PromQL-Download-now.png" width="880" height="499"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How time-series databases work
&lt;/h2&gt;

&lt;p&gt;Time series are &lt;strong&gt;streams of values associated with a timestamp&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--3JykVpSq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://478h5m1yrfsa3bbe262u7muv-wpengine.netdna-ssl.com/wp-content/uploads/Blog-Images-Getting-started-with-PromQL-Time-Series.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3JykVpSq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://478h5m1yrfsa3bbe262u7muv-wpengine.netdna-ssl.com/wp-content/uploads/Blog-Images-Getting-started-with-PromQL-Time-Series.png" width="880" height="393"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every time series is identified by its metrics name and its labels, like:&lt;/p&gt;

&lt;p&gt;mongodb_up{}&lt;/p&gt;

&lt;p&gt;or&lt;/p&gt;

&lt;p&gt;kube_node_labels{cluster="aws-01", label_kubernetes_io_role="master"}&lt;/p&gt;

&lt;p&gt;In the above example, you can see the metric name (&lt;code&gt;kube_node_labels&lt;/code&gt;) and the labels (&lt;code&gt;cluster&lt;/code&gt; and &lt;code&gt;label_kubernetes_io_role&lt;/code&gt;). Although normally this is how the metrics and labels are referenced, the name of the metric is actually a label too. The query above can also be written like this:&lt;/p&gt;

&lt;p&gt;{__name__ = "kube_node_labels", cluster="aws-01", label_kubernetes_io_role="master"}&lt;/p&gt;

&lt;p&gt;There are four types of metrics in Prometheus:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gauges&lt;/strong&gt; are arbitrary values that can go up and down. For example, &lt;code&gt;mongodb_up&lt;/code&gt; tells us if the exporter has a connection to the MongoDB instance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Counters&lt;/strong&gt; represent totalizers from the beginning of the exporter and usually have the &lt;code&gt;_total&lt;/code&gt; suffix. For example, &lt;code&gt;http_requests_total&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Histogram&lt;/strong&gt; samples observations, such as the request durations or response sizes, and counts them in configurable buckets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Summary&lt;/strong&gt; works as a histogram and also calculates configurable quantiles.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Gettings started with PromQL data selection
&lt;/h2&gt;

&lt;p&gt;Selecting data in PromQL is as easy as specifying the &lt;strong&gt;metric you want to get the data from&lt;/strong&gt;. In this example, we will use the metric &lt;code&gt;http_requests_total&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Imagine that we want to know the number of requests for the &lt;code&gt;/api&lt;/code&gt; path in the host &lt;code&gt;10.2.0.4&lt;/code&gt;. To do so, we will use the labels host and path from that metric.&lt;/p&gt;

&lt;p&gt;We could run this PromQL query:&lt;/p&gt;

&lt;p&gt;http_requests_total{host="10.2.0.4", path="/api"}&lt;/p&gt;

&lt;p&gt;It would return the following data:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;name&lt;/td&gt;
&lt;td&gt;host&lt;/td&gt;
&lt;td&gt;path&lt;/td&gt;
&lt;td&gt;status_code&lt;/td&gt;
&lt;td&gt;value&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;http_requests_total&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;10.2.0.4&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/api&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;200&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;98&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;http_requests_total&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;10.2.0.4&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/api&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;503&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;20&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;http_requests_total&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;10.2.0.4&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/api&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;401&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every row in that table represents a series with the last available value. As &lt;code&gt;http_requests_total&lt;/code&gt; contains the number of requests made since the last counter restart, we see 98 successful requests.&lt;/p&gt;

&lt;p&gt;This is called an &lt;strong&gt;instant vector&lt;/strong&gt;, the earliest value for every series at the moment specified by the query. As the samples are taken at random times, Prometheus has to make approximations to select the samples. If no time is specified, then it will return the last available value.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--96PXEwVV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://478h5m1yrfsa3bbe262u7muv-wpengine.netdna-ssl.com/wp-content/uploads/Blog-Images-Getting-started-with-PromQL-Instant-Vector.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--96PXEwVV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://478h5m1yrfsa3bbe262u7muv-wpengine.netdna-ssl.com/wp-content/uploads/Blog-Images-Getting-started-with-PromQL-Instant-Vector.png" alt="Graphic showing three-time series and the exact time the query took place, returning an instant vector with the nearest values" width="880" height="394"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Additionally, you can get an instant vector from another moment (i.e., from one day ago).&lt;/p&gt;

&lt;p&gt;To do so, you only need to add an &lt;code&gt;offset&lt;/code&gt;, like this:&lt;/p&gt;

&lt;p&gt;http_requests_total{host="10.2.0.4", path="/api", status_code="200"} offset 1d&lt;/p&gt;

&lt;p&gt;To obtain metric results within a timestamp range, you need to indicate it between brackets:&lt;/p&gt;

&lt;p&gt;http_requests_total{host="10.2.0.4", path="/api"}[10m]&lt;/p&gt;

&lt;p&gt;It would return something like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;name&lt;/td&gt;
&lt;td&gt;host&lt;/td&gt;
&lt;td&gt;path&lt;/td&gt;
&lt;td&gt;status_code&lt;/td&gt;
&lt;td&gt;value&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;http_requests_total&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;10.2.0.4&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/api&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;200&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;641309@1614690905.515&lt;/code&gt;&lt;br&gt;&lt;br&gt;&lt;code&gt;641314@1614690965.515&lt;/code&gt;&lt;br&gt;&lt;br&gt;&lt;code&gt;641319@1614691025.502&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;http_requests_total&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;10.2.0.5&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/api&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;200&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;641319 @1614690936.628&lt;/code&gt;&lt;br&gt;&lt;br&gt;&lt;code&gt;641324 @1614690996.628&lt;/code&gt;&lt;br&gt;&lt;br&gt;&lt;code&gt;641329 @1614691056.628&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;http_requests_total&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;10.2.0.2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/api&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;401&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;368736 @1614690901.371&lt;/code&gt;&lt;br&gt;&lt;br&gt;&lt;code&gt;368737 @1614690961.372&lt;/code&gt;&lt;br&gt;&lt;br&gt;&lt;code&gt;368738 @1614691021.372&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The query returns multiple values for each time series; that’s because we asked for data within a time range. Thus, every value is associated with a timestamp.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is called a range vector&lt;/strong&gt;: all the values for every series within a range of timestamps.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--BWpVweBp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://478h5m1yrfsa3bbe262u7muv-wpengine.netdna-ssl.com/wp-content/uploads/Blog-Images-Getting-started-with-PromQL-Range-Vector.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--BWpVweBp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://478h5m1yrfsa3bbe262u7muv-wpengine.netdna-ssl.com/wp-content/uploads/Blog-Images-Getting-started-with-PromQL-Range-Vector.png" alt="Graphic showing three time series and the time range the query took place, returning an range vector with the all the values inside the range" width="880" height="394"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting started with PromQL aggregators and operators
&lt;/h2&gt;

&lt;p&gt;As you can see, the PromQL selectors help you obtain metrics data. But what if you want to get more sophisticated results?&lt;/p&gt;

&lt;p&gt;Imagine if we had the metric &lt;code&gt;node_cpu_cores&lt;/code&gt; with a &lt;code&gt;cluster&lt;/code&gt; label. We could, for example, sum the results, aggregating them by a particular label:&lt;/p&gt;

&lt;p&gt;sum (by cluster) (node_cpu_cores)&lt;/p&gt;

&lt;p&gt;This would return something like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;cluster&lt;/td&gt;
&lt;td&gt;value&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;foo&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;bar&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;With this simple query, we can see that there are &lt;code&gt;100&lt;/code&gt; CPU cores for the cluster &lt;code&gt;cluster_foo&lt;/code&gt; and &lt;code&gt;50&lt;/code&gt; for the &lt;code&gt;cluster_bar&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Furthermore, we can use arithmetic operators in our PromQL queries. For example, using the metric &lt;code&gt;node_memory_MemFree_bytes&lt;/code&gt; that returns the amount of free memory in bytes, we could get that value in megabytes by using the div operator&lt;/p&gt;

&lt;p&gt;node_memory_MemFree_bytes / (1024 * 1024)&lt;/p&gt;

&lt;p&gt;We could also get the percentage of free memory available by comparing the previous metric with &lt;code&gt;node_memory_MemTotal_bytes,&lt;/code&gt; which returns the total memory available in the node.&lt;/p&gt;

&lt;p&gt;(node_memory_MemFree_bytes / node_memory_MemTotal_bytes) * 100&lt;/p&gt;

&lt;p&gt;And using it for creating an alert in case there are nodes with less than 5% of free memory.&lt;/p&gt;

&lt;p&gt;(node_memory_MemFree_bytes / node_memory_MemTotal_bytes) * 100 &amp;lt; 5&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting started with PromQL functions
&lt;/h2&gt;

&lt;p&gt;PromQL offers a vast collection of functions we can use to get even more sophisticated results. Continuing with the previous example, we could use the &lt;code&gt;topk&lt;/code&gt; function to identify which two nodes have higher free memory percentages.&lt;/p&gt;

&lt;p&gt;topk(2, (node_memory_MemFree_bytes / node_memory_MemTotal_bytes) * 100)&lt;/p&gt;

&lt;p&gt;Prometheus not only gives us information from the past, &lt;strong&gt;but also the future&lt;/strong&gt;. The &lt;code&gt;predict_linear&lt;/code&gt; function predicts where the time series will be in the given amount of seconds. You may remember that we used this function to &lt;a href="https://sysdig.com/blog/cooking-iot-prometheus/"&gt;cook the perfect holiday ham&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Imagine that you want to know how much free disk space left will be available in the next 24 hours. You could apply the &lt;code&gt;predict_linear&lt;/code&gt; function to last week’s results from &lt;code&gt;node_filesystem_free_bytes&lt;/code&gt; metric, which returns the free disk space available. This lets you &lt;strong&gt;predict the free disk space&lt;/strong&gt;, in gigabytes, in the next 24 hours.&lt;/p&gt;

&lt;p&gt;predict_linear(node_filesystem_free_bytes[1w], 3600 * 24) / (1024 * 1024 * 1024) &amp;lt; 100&lt;/p&gt;

&lt;p&gt;When working with Prometheus counters, the &lt;code&gt;rate&lt;/code&gt; function is pretty convenient. It calculates a per-second increase of a counter, allowing for resets and extrapolating at edges to provide better results.&lt;/p&gt;

&lt;p&gt;What if we need to create an alert when we haven’t received a request in the last 10 minutes. We couldn’t just use the &lt;code&gt;http_requests_total&lt;/code&gt; metric because if the counter got reset during the timestamp range, the results wouldn’t be accurate.&lt;/p&gt;

&lt;p&gt;http_requests_total[10m]&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;name&lt;/td&gt;
&lt;td&gt;host&lt;/td&gt;
&lt;td&gt;path&lt;/td&gt;
&lt;td&gt;status_code&lt;/td&gt;
&lt;td&gt;value&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;http_requests_total&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;10.2.0.4&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/api&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;200&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;100@1614690905.515&lt;/code&gt;&lt;br&gt;&lt;br&gt;&lt;code&gt;300@1614690965.515&lt;/code&gt;&lt;br&gt;&lt;br&gt;&lt;code&gt;50@1614691025.502&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In the example above, as the counter got reset, there will be negative values from &lt;code&gt;300&lt;/code&gt; to &lt;code&gt;50&lt;/code&gt;. Using just this metric wouldn’t be enough. Here is where the rate function comes to the rescue. As it considers the resets, the results are fixed as if they were like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;name&lt;/td&gt;
&lt;td&gt;host&lt;/td&gt;
&lt;td&gt;path&lt;/td&gt;
&lt;td&gt;status_code&lt;/td&gt;
&lt;td&gt;value&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;http_requests_total&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;10.2.0.4&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/api&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;200&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;100@1614690905.515&lt;/code&gt;&lt;br&gt;&lt;br&gt;&lt;code&gt;300@1614690965.515&lt;/code&gt;&lt;br&gt;&lt;br&gt;&lt;code&gt;350@1614691025.502&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;rate(http_requests_total[10m])&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;name&lt;/td&gt;
&lt;td&gt;host&lt;/td&gt;
&lt;td&gt;path&lt;/td&gt;
&lt;td&gt;status_code&lt;/td&gt;
&lt;td&gt;value&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;http_requests_total&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;10.2.0.4&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/api&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;200&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.83&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Regardless of the resets, there were 0.83 requests per second as averaged in the last 10 minutes. Now we can configure the desired alert:&lt;/p&gt;

&lt;p&gt;rate(http_requests_total[10m]) = 0&lt;/p&gt;

&lt;h2&gt;
  
  
  Next steps
&lt;/h2&gt;

&lt;p&gt;In this article, we learned how Prometheus stores data and how to &lt;strong&gt;start selecting and aggregating data with PromQL&lt;/strong&gt; examples.&lt;/p&gt;

&lt;p&gt;You can download the PromQL Cheatsheet to &lt;strong&gt;learn more PromQL operators, aggregations, and functions,&lt;/strong&gt; as well as examples. You can also try all the examples in our &lt;a href="https://learn.sysdig.com/promql-playground"&gt;Prometheus playground&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dig.sysdig.com/c/pf-infographic-promql-cheatsheet?x=u_WFRi"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7ZwZoiRx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://478h5m1yrfsa3bbe262u7muv-wpengine.netdna-ssl.com/wp-content/uploads/Blog-Images-Getting-started-with-PromQL-Download-now.png" width="880" height="499"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;You can also try the Sysdig Monitor Free 30-day Trial, since Sysdig Monitor is fully compatible with Prometheus. You’ll &lt;a href="https://sysdig.com/company/free-trial-platform/"&gt;get started in just a few minutes&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Post navigation&lt;/p&gt;

</description>
      <category>prometheus</category>
      <category>promql</category>
      <category>monitoring</category>
      <category>cheatsheet</category>
    </item>
    <item>
      <title>How to monitor AWS SQS with Prometheus</title>
      <dc:creator>JA Samitier</dc:creator>
      <pubDate>Fri, 05 Feb 2021 11:13:21 +0000</pubDate>
      <link>https://dev.to/eckelon/https-sysdig-com-blog-monitor-aws-sqs-prometheus-4gg0</link>
      <guid>https://dev.to/eckelon/https-sysdig-com-blog-monitor-aws-sqs-prometheus-4gg0</guid>
      <description>&lt;p&gt;Article by &lt;a href="https://sysdig.com/blog/monitor-aws-sqs-prometheus/" rel="noopener noreferrer"&gt;David de Torres&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;In this article, we will explain how to monitor AWS SQS with Prometheus. To monitor AWS SQS, we will leverage the data offered by CloudWatch exporting the metrics to Prometheus using the YACE exporter (&lt;a href="https://github.com/ivx/yet-another-cloudwatch-exporter" rel="noopener noreferrer"&gt;Yet Another CloudWatch Exporter&lt;/a&gt;). Finally, we will dive into what to monitor and what to alert.&lt;/p&gt;

&lt;p&gt;AWS SQS (Simple Queue Service) has gained popularity as a way to communicate and decouple asynchronous applications, specifically for its easy integration with AWS Lambda functions.&lt;/p&gt;

&lt;p&gt;Having two decoupled applications allows you to implement and scale independently both extremes. To achieve this decoupling, the system must be prepared for producing and processing the messages between applications at a different rate. Any bottleneck can cause messages to not be processed on time, and hurt the overall performance of the system.&lt;/p&gt;

&lt;p&gt;You need to monitor AWS SQL queues closely to find bottleneck situations, properly scale the producers and consumers of messages, and detect errors as soon as possible.&lt;/p&gt;

&lt;p&gt;But how do you monitor a managed service like this one?&lt;/p&gt;

&lt;p&gt;And can you monitor it from the same place you monitor your entire infrastructure?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FHow-to-monitor-AWS-SQS-with-Prometheus-01.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FHow-to-monitor-AWS-SQS-with-Prometheus-01.png" alt="It is possible to monitor AWS SQS next to your cloud-native infrastructure"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The relevant metrics for this service are all available in AWS CloudWatch. You can consult them via the web interface or through the API. To check these metrics from your Prometheus-compatible monitoring solution, you can use a Prometheus exporter.&lt;/p&gt;

&lt;p&gt;Let's now dig into how SQS works in detail, how to monitor it with Prometheus, and what key metrics you should keep an eye on.&lt;/p&gt;

&lt;p&gt;[tweet_box]📊 It is possible to monitor #AWS #SQS next to your cloud-native infrastructure. 💻 🤓 Learn how to leverage #Prometheus to extract #CloudWatch metrics 📈[/tweet_box]&lt;/p&gt;

&lt;h2&gt;
  
  
  How do AWS SQS queues work?
&lt;/h2&gt;

&lt;p&gt;Let's settle a common ground on how AWS SQS queues work, making it easier to later identify what's important to monitor and alert on.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FHow-to-monitor-AWS-SQS-with-Prometheus-02.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FHow-to-monitor-AWS-SQS-with-Prometheus-02.png" alt="The AWS SQS producer send messages to the queue. The delayed messages wait for a bit and the rest are visible. When a receiver procesed a message it becomes invisible until it is processed and removed from the queue."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here is the workflow of a message in a SQS queue:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; The message is created by a producer service and sent to the SQS queue.&lt;/li&gt;
&lt;li&gt; The message appears in the queue for all of the possible receivers as visible. This step can be non-immediate. For example, if you configure a 'delay' in the message, it will stay in the queue in a delayed state and will not be available for the receivers until the delay expires.&lt;/li&gt;
&lt;li&gt; One of the possible receivers makes a &lt;code&gt;polling&lt;/code&gt; of the messages of the SQS queue. This operation retrieves the visible messages from the queue and switches them to an &lt;code&gt;invisible&lt;/code&gt; state, but does not delete them. This keeps other receivers from getting those messages if they execute a new polling.&lt;/li&gt;
&lt;li&gt; When the receiver ends to process the message, explicitly removes it from the queue.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now, what happens if a receiver ends the processing of a message and does not remove it from the queue?&lt;/p&gt;

&lt;p&gt;After a configurable delay, the message is marked again as visible so other receivers can get the message and process it.&lt;/p&gt;

&lt;p&gt;Wow, that sounds interesting. If you get a message that generates an error in the receiver, shortly after, another receiver will get that message and process it again.&lt;/p&gt;

&lt;p&gt;And what if that other receiver also suffers an error? And all of the others after that?&lt;/p&gt;

&lt;p&gt;That's a tricky question.&lt;/p&gt;

&lt;p&gt;To prevent these old messages from populating the queue and recurrently appearing in the polls, AWS SQS allows you to configure another queue as a dead-letter. A dead-letter queue is where the messages end after being polled a number of times. This helps developers and site operation engineers detect these messages and treat them in an appropriate way.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitor AWS SQS with Prometheus metrics
&lt;/h2&gt;

&lt;p&gt;Now that we understand how SQS queues work, let's see how we can get metrics to address all of the possible situations that we can find while working with them.&lt;/p&gt;

&lt;p&gt;AWS SQS emits &lt;a href="https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-available-cloudwatch-metrics.html" rel="noopener noreferrer"&gt;certain metrics&lt;/a&gt; that can be gathered by the CloudWatch service under the namespace AWS/SQS. We'll now see how to extract those metrics to be able to monitor AWS SQS with Prometheus.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://sysdig.com/opensource/prometheus/" rel="noopener noreferrer"&gt;Prometheus&lt;/a&gt; is a leading open source monitoring solution, which provides means to easily create integrations by &lt;a href="https://prometheus.io/docs/instrumenting/exporters/" rel="noopener noreferrer"&gt;writing exporters&lt;/a&gt;. With Prometheus, you can gather metrics from your whole infrastructure which may be spread across multiple cloud providers, following a &lt;em&gt;single-pane-of-glass&lt;/em&gt; approach.&lt;/p&gt;

&lt;p&gt;Prometheus exporters gather metrics from services and publish them in a standardized format that both a Prometheus server and the Sysdig Agent can scrape natively. We will use one of these exporters, specifically the YACE exporter (&lt;a href="https://github.com/ivx/yet-another-cloudwatch-exporter" rel="noopener noreferrer"&gt;Yet Another CloudWatch Exporter&lt;/a&gt;), to get metrics from AWS CloudWatch. &lt;a href="https://sysdig.com/blog/improving-prometheus-cloudwatch-exporter" rel="noopener noreferrer"&gt;We contributed to this exporter to make it more efficient and reliable&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In this use case, we will:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Deploy the CloudWatch exporter in a Kubernetes cluster.&lt;/li&gt;
&lt;li&gt; Configure it to gather metrics of SQS in AWS.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This exporter will be conveniently annotated with Prometheus tags, so both a Prometheus server and the Sysdig agent can scrape it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FHow-to-monitor-AWS-SQS-with-Prometheus-03.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FHow-to-monitor-AWS-SQS-with-Prometheus-03.png" alt="AWS SQS metrics are available in CloudWatch. The Prometheus Exporter polls them through the CloudWatch API and makes them available in Prometheus format for Prometheus Servers and Sysdig Agents."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Installing and configuring Prometheus CloudWatch exporter
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Setting up permissions to access CloudWatch metrics
&lt;/h3&gt;

&lt;p&gt;The exporter will connect to the AWS CloudWatch API and pull the metrics, but to get them we need to grant the right permissions.&lt;/p&gt;

&lt;p&gt;First, you will need to create an AWS IAM policy that contains the following permissions:&lt;/p&gt;

&lt;pre&gt;{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "CloudWatchExporterPolicy",
            "Effect": "Allow",
            "Action": [
                "tag:GetResources",
                "cloudwatch:ListTagsForResource",
                "cloudwatch:GetMetricData",
                "cloudwatch:ListMetrics"
            ],
            "Resource": "*"
        }
    ]
}&lt;/pre&gt;

&lt;p&gt;&lt;em&gt;Configuring the AWS IAM policy&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You will also need to supply the credentials for an AWS IAM account to the CloudWatch exporter. This can be done in a standard manner, in the &lt;code&gt;$HOME/.aws/credentials&lt;/code&gt; file.&lt;/p&gt;

&lt;pre&gt;# CREDENTIALS FOR AWS ACCOUNT
aws_region = us-east-1
aws_access_key_id = AKIAQ33BWUG3BLXXXXX
aws_secret_access_key = bXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX&lt;/pre&gt;

&lt;p&gt;&lt;em&gt;Configuring the AWS IAM account in the $HOME/.aws/credentials file.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You can either assign the IAM policy directly to the IAM account or to a IAM role to grant the permissions to the exporter.&lt;/p&gt;

&lt;h3&gt;
  
  
  Configuring the exporter
&lt;/h3&gt;

&lt;p&gt;The YACE exporter has images for its stable version ready to be deployed in Kubernetes. So, we just need to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Specify what to scrape from CloudWatch in a &lt;code&gt;config.yml&lt;/code&gt; file.&lt;/li&gt;
&lt;li&gt; Create a deployment file.&lt;/li&gt;
&lt;li&gt; Deploy in a Kubernetes cluster.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Let's focus on the configuration file. Here, you'll define:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Which metrics the exporter will scrape.&lt;/li&gt;
&lt;li&gt;  From which region.&lt;/li&gt;
&lt;li&gt;  What dimensions you’ll ask CloudWatch to make the aggregations with.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here is an example &lt;code&gt;config.yml&lt;/code&gt; configuration file:&lt;/p&gt;

&lt;pre&gt;discovery:
  jobs:
  - regions: 
    - us-east-1
    type: sqs
    enableMetricData: true
    metrics: 
      - name: ApproximateAgeOfOldestMessage
        statistics:
        - Maximum
        period: 300
        length: 3600
      - name: ApproximateNumberOfMessagesDelayed
        statistics:
        - Average
        period: 300
        length: 3600
      - name: ApproximateNumberOfMessagesNotVisible
        statistics:
        - Average
        period: 300
        length: 3600
      - name: ApproximateNumberOfMessagesVisible
        statistics:
        - Average
        period: 300
        length: 3600
      - name: NumberOfEmptyReceives
        statistics:
        - Sum
        period: 300
        length: 3600
      - name: NumberOfMessagesDeleted
        statistics:
        - Sum
        period: 300
        length: 3600
      - name: NumberOfMessagesReceived
        statistics:
        - Sum
        period: 300
        length: 3600
      - name: NumberOfMessagesSent
        statistics:
        - Sum
        period: 300
        length: 3600
      - name: SentMessageSize
        statistics:
        - Average
        - Sum
        period: 300
        length: 3600&lt;/pre&gt;

&lt;p&gt;Please be aware of the following caveats:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; If you wish to &lt;strong&gt;add an additional metric&lt;/strong&gt;, be sure to read up on &lt;a href="https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-available-cloudwatch-metrics.html" rel="noopener noreferrer"&gt;AWS SQS metrics&lt;/a&gt; to use the correct statistic.&lt;/li&gt;
&lt;li&gt; CloudWatch offers &lt;strong&gt;aggregations by different dimensions&lt;/strong&gt;. The YACE Exporter automatically selects &lt;code&gt;FunctionName&lt;/code&gt; as the default dimension to aggregate the metrics by.&lt;/li&gt;
&lt;li&gt; Gathering CloudWatch metrics may incur a certain &lt;strong&gt;cost to the AWS bill&lt;/strong&gt;. Be sure to check the AWS Documentation on CloudWatch Service Quota limits.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The last step is to actually deploy the YACE exporter. To make things easier, you can put the IAM account credentials and the configuration in a file, like this:&lt;/p&gt;

&lt;pre&gt;apiVersion: v1
kind: Namespace
metadata:
  name: yace
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: yace-sqs
  namespace: yace
spec:
  selector:
    matchLabels:
      app: yace-sqs
  replicas: 1
  template:
    metadata:
      labels:
        app: yace-sqs
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "5000"
    spec:
      containers:
      - name: yace
        image: quay.io/invisionag/yet-another-cloudwatch-exporter:v0.21.0-alpha
        ports:
        - containerPort: 5000
        volumeMounts:
          - name: yace-sqs-config
            mountPath: /tmp/config.yml
            subPath: config.yml
          - name: yace-sqs-credentials
            mountPath: /exporter/.aws/credentials
            subPath: credentials
        resources:
          limits:
            memory: "128Mi"
            cpu: "500m"
      volumes:
        - configMap:
            defaultMode: 420
            name: yace-sqs-config
          name: yace-sqs-config
        - secret:
            defaultMode: 420
            secretName: yace-sqs-credentials
          name: yace-sqs-credentials
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: yace-sqs-config
  namespace: yace
data:
  config.yml: |
    discovery:
      jobs:
      - regions: 
        - us-east-1
        type: sqs
        enableMetricData: true
        metrics: 
          - name: ApproximateAgeOfOldestMessage
            statistics:
            - Maximum
            period: 300
            length: 3600
          - name: ApproximateNumberOfMessagesDelayed
            statistics:
            - Average
            period: 300
            length: 3600
          - name: ApproximateNumberOfMessagesNotVisible
            statistics:
            - Average
            period: 300
            length: 3600
          - name: ApproximateNumberOfMessagesVisible
            statistics:
            - Average
            period: 300
            length: 3600
          - name: NumberOfEmptyReceives
            statistics:
            - Sum
            period: 300
            length: 3600
          - name: NumberOfMessagesDeleted
            statistics:
            - Sum
            period: 300
            length: 3600
          - name: NumberOfMessagesReceived
            statistics:
            - Sum
            period: 300
            length: 3600
          - name: NumberOfMessagesSent
            statistics:
            - Sum
            period: 300
            length: 3600
          - name: SentMessageSize
            statistics:
            - Average
            - Sum
            period: 300
            length: 3600
---
apiVersion: v1
kind: Secret
metadata:
  name: yace-sqs-credentials
  namespace: yace
data:
  # Add in credentials the result of:
  # cat ~/.aws/credentials | base64
  credentials: |
    XXX&lt;/pre&gt;

&lt;p&gt;&lt;span&gt;&lt;em&gt;Note that leaving your AWS credentials inside a deployment file is not the safest option. You should use a secrets store instead, but the example was simplified to keep the focus.&lt;/em&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;In this file, we can find:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;&lt;code&gt;namespace: yace&lt;/code&gt;&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt; The &lt;strong&gt;&lt;code&gt;kind: Deployment&lt;/code&gt;&lt;/strong&gt; with the exporter. Note the &lt;code&gt;annotations:&lt;/code&gt; with the Prometheus tags for scraping, and the scraping port. This deployment also has two volumes: one with the configuration file, and another with the credentials.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;&lt;code&gt;kind: ConfigMap&lt;/code&gt;&lt;/strong&gt; with the contents of the config.yml file.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;&lt;code&gt;kind: Secret&lt;/code&gt;&lt;/strong&gt; with the credentials of the IAM account.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now, you just need to deploy like you usually do:&lt;/p&gt;

&lt;pre&gt;kubectl apply -f deploy.yaml&lt;/pre&gt;

&lt;p&gt;Is it working?&lt;/p&gt;

&lt;p&gt;Let's do a quick test throwing an HTTP request to the exporter port. To do it, you can use a web browser or curl in a console. As we set the port &lt;code&gt;5000&lt;/code&gt; in our example pod &lt;code&gt;yace-sqs&lt;/code&gt;, we would do:&lt;/p&gt;

&lt;pre&gt;curl http://:5000/metrics&lt;/pre&gt;

&lt;p&gt;If everything is OK, you should see a web page with metrics of this kind (output truncated due to size):&lt;/p&gt;

&lt;pre&gt;# HELP aws_sqs_approximate_age_of_oldest_message_maximum Help is not implemented yet.
# TYPE aws_sqs_approximate_age_of_oldest_message_maximum gauge
aws_sqs_approximate_age_of_oldest_message_maximum{dimension_QueueName="queue_01",name="arn:aws:sqs:us-east-1:029747528706:queue_01",region="us-east-1"} 2
# HELP aws_sqs_approximate_number_of_messages_delayed_average Help is not implemented yet.
# TYPE aws_sqs_approximate_number_of_messages_delayed_average gauge
aws_sqs_approximate_number_of_messages_delayed_average{dimension_QueueName="queue_01",name="arn:aws:sqs:us-east-1:029747528706:queue_01",region="us-east-1"} 3
# HELP aws_sqs_approximate_number_of_messages_not_visible_average Help is not implemented yet.
# TYPE aws_sqs_approximate_number_of_messages_not_visible_average gauge
aws_sqs_approximate_number_of_messages_not_visible_average{dimension_QueueName="queue_01",name="arn:aws:sqs:us-east-1:029747528706:queue_01",region="us-east-1"} 12
# HELP aws_sqs_approximate_number_of_messages_visible_average Help is not implemented yet.
# TYPE aws_sqs_approximate_number_of_messages_visible_average gauge
aws_sqs_approximate_number_of_messages_visible_average{dimension_QueueName="queue_01",name="arn:aws:sqs:us-east-1:029747528706:queue_01",region="us-east-1"} 1&lt;/pre&gt;

&lt;h2&gt;
  
  
  Monitoring AWS SQS: What to look for?
&lt;/h2&gt;

&lt;p&gt;AWS SQS queues have a simple design, so there isn't much to monitor. However, depending on how you are using them, you will want to monitor a different set of metrics.&lt;/p&gt;

&lt;p&gt;Let's explore some scenarios and their relevant metrics.&lt;/p&gt;

&lt;h3&gt;
  
  
  Simple producer-consumer
&lt;/h3&gt;

&lt;p&gt;For this approach, we will consider that you only have producers and consumers processing the messages. We will not cover delayed messages or dead-letter queues.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Visible messages&lt;/strong&gt;: This metric will give you information about the &lt;strong&gt;saturation of the system&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Visible messages are the ones that are ready to process, but not yet polled and deleted by a receiver. It would be a good indicator of how many pending messages you have in the queue.&lt;/p&gt;

&lt;p&gt;The metric that offers this information is &lt;code&gt;aws_sqs_approximate_number_of_messages_visible_average&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FHow-to-monitor-AWS-SQS-with-Prometheus-04.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FHow-to-monitor-AWS-SQS-with-Prometheus-04.png" alt="A PromQL dashboard panel showing a spike on the visible messages."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not visible messages&lt;/strong&gt;: This metric is a good indicator of the &lt;strong&gt;messages that are being processed&lt;/strong&gt; at each moment.&lt;/p&gt;

&lt;p&gt;Not visible messages are the ones that have been polled by a receiver but were still not deleted.&lt;/p&gt;

&lt;p&gt;The metric that offers this information is &lt;code&gt;aws_sqs_approximate_number_of_messages_not_visible_average&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FHow-to-monitor-AWS-SQS-with-Prometheus-05.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FHow-to-monitor-AWS-SQS-with-Prometheus-05.png" alt="A PromQL dashboard panel showing a spike on the not visible messages."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deleted messages:&lt;/strong&gt; This metric is a good indicator of the number of messages actually processed by the receivers.&lt;/p&gt;

&lt;p&gt;Remember, when a receiver processes a message, it manually deletes the message from the queue.&lt;/p&gt;

&lt;p&gt;The metric that gives this information is &lt;code&gt;aws_sqs_number_of_messages_deleted_sum&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FHow-to-monitor-AWS-SQS-with-Prometheus-06.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FHow-to-monitor-AWS-SQS-with-Prometheus-06.png" alt="A PromQL dashboard panel showing a spike on the deleted messages."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Received messages&lt;/strong&gt;: The received messages are the number of messages that went out of the queue. Take into account that a message can be received by a consumer several times if the message was not deleted from the queue.&lt;/p&gt;

&lt;p&gt;The metrics that give this information is &lt;code&gt;aws_sqs_number_of_messages_received_sum&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FHow-to-monitor-AWS-SQS-with-Prometheus-07.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FHow-to-monitor-AWS-SQS-with-Prometheus-07.png" alt="A PromQL dashboard panel showing a spike on the received messages."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Empty receives&lt;/strong&gt;: This metric allows you to detect how many empty requests have been made in order to optimize the way your application makes the requests.&lt;/p&gt;

&lt;p&gt;Amazon bills SQS based on the number of requests made. A polling is a request, and in each one you can retrieve 1-to-10 messages for a maximum total payload of up to 256 KB.&lt;/p&gt;

&lt;p&gt;The metric that gives this information is &lt;code&gt;aws_sqs_number_of_empty_receives_sum&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FHow-to-monitor-AWS-SQS-with-Prometheus-08.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FHow-to-monitor-AWS-SQS-with-Prometheus-08.png" alt="A PromQL dashboard panel showing a spike on the empty receives."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To optimize the billing, you can either reduce the request frequency or use &lt;code&gt;long polling&lt;/code&gt;. This feature allows you to receive, for 10 seconds, the visible messages. Plus all of the messages that arrive in real time, reducing the number of requests.&lt;/p&gt;

&lt;h3&gt;
  
  
  Monitoring a producer that can delay messages
&lt;/h3&gt;

&lt;p&gt;If you can estimate the time needed to process the messages, the producer can add a delay to the messages. Leaving time between messages can help alleviate possible bottlenecks caused by a high number of messages sent at the same time.&lt;/p&gt;

&lt;p&gt;Some extra metrics worth tracking in this scenario are:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Delayed messages&lt;/strong&gt;: This indicator can help you scale up or down the number of receivers to adequate the load of work coming in the next minutes.&lt;/p&gt;

&lt;p&gt;You can have the number of messages delayed in the queue with the metric &lt;code&gt;aws_sqs_approximate_number_of_messages_delayed_average&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If your producers are deployed in Kubernetes, you can use the &lt;a href="https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/" rel="noopener noreferrer"&gt;Kubernetes horizontal pod autoscaler (HPA)&lt;/a&gt; and the &lt;a href="https://github.com/DirectXMan12/k8s-prometheus-adapter" rel="noopener noreferrer"&gt;Prometheus Adapter&lt;/a&gt; to adjust the number of pods depending on the value of this metric.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Total number of messages in the queue&lt;/strong&gt;: The number of messages gives you an idea of the occupation and saturation of the pipeline.&lt;/p&gt;

&lt;p&gt;To have an estimate of the number of messages that the senders produced and that are still waiting to be processed, you can sum the delayed messages, visible (ready to send to receivers) and not visible (currently being processed). If the processing of messages were immediate, this result would be zero.&lt;/p&gt;

&lt;p&gt;The promQL that produces this value is:&lt;/p&gt;

&lt;pre&gt;aws_sqs_approximate_number_of_messages_delayed_average  + aws_sqs_approximate_number_of_messages_not_visible_average + aws_sqs_approximate_number_of_messages_visible_average&lt;/pre&gt;

&lt;h3&gt;
  
  
  Dealing with dead-letter queues
&lt;/h3&gt;

&lt;p&gt;While dealing with dead-letter queues, it is important to monitor when a message arrives to the queue.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sent messages&lt;/strong&gt;: This can give an idea of the errors or messages that could not be processed by the receivers.&lt;/p&gt;

&lt;p&gt;The metric that gives this information is &lt;code&gt;aws_sqs_number_of_messages_sent_sum&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FHow-to-monitor-AWS-SQS-with-Prometheus-09.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FHow-to-monitor-AWS-SQS-with-Prometheus-09.png" alt="A PromQL dashboard panel showing a spike on the sent messages."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitoring AWS SQS: What to alert?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;High number of messages in queue for a long time&lt;/strong&gt;: The total number of messages in the queue is an indicator of the saturation of the pipeline. You can set a limit (e.g., 100 messages) and receive an alarm if the number of messages is higher than that for an extended period of time.&lt;/p&gt;

&lt;pre&gt;(aws_sqs_approximate_number_of_messages_delayed_average  + aws_sqs_approximate_number_of_messages_not_visible_average + aws_sqs_approximate_number_of_messages_visible_average) &amp;gt; 100&lt;/pre&gt;

&lt;p&gt;This alert can also detect messages that are recurrently sent back to the visible state if a dead-letter queue is not configured.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Oldest message in queue&lt;/strong&gt;: This metric gives an idea of the age of the oldest message of the queue, which is a good indicator of the maximum latency of the pipeline. This alert will trigger when the maximum age is higher than five minutes (&lt;code&gt;300&lt;/code&gt; seconds, you can adjust as you wish).&lt;/p&gt;

&lt;pre&gt;aws_sqs_approximate_age_of_oldest_message_maximum &amp;gt; 300&lt;/pre&gt;

&lt;p&gt;For this alert to work properly, you have to be sure to configure a dead-letter queue to prevent messages from recurrently being sent to visible state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recurring empty receives&lt;/strong&gt;: You can detect when your application is recurrently trying to fetch new messages from an empty queue. This can help you adjust your polling frequency or the number of receivers to lower the costs of your infrastructure.&lt;/p&gt;

&lt;pre&gt;aws_sqs_number_of_empty_receives_sum &amp;gt; 0&lt;/pre&gt;

&lt;p&gt;&lt;strong&gt;Received message in a dead-letter queue&lt;/strong&gt;: You can detect if a new message has arrived to a dead-letter queue by alerting on the sent message metric. To filter on the dead-letter queues, you can follow different methods. For example, naming your dead-letter queues with a prefix, like &lt;code&gt;dead-letter-&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This promQL will alert you when a message arrives to any of your dead-letter queues:&lt;/p&gt;

&lt;pre&gt;aws_sqs_number_of_messages_sent_sum{dimension_QueueName=~"dead-letter-.+"} &amp;gt; 0&lt;/pre&gt;

&lt;h2&gt;
  
  
  Getting the CloudWatch metrics into Sysdig Monitor
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Sysdig agent setup
&lt;/h3&gt;

&lt;p&gt;To scrape metrics using the Sysdig agent:&lt;/p&gt;

&lt;p&gt;In the &lt;code&gt;yace&lt;/code&gt; Deployment, remember to include the Prometheus &lt;code&gt;annotations&lt;/code&gt; that configure the port of the exporter as a scraping port.&lt;/p&gt;

&lt;p&gt;Also, in the Sysdig Agent configuration, make sure to have these lines of configuration that enable the scraping of containers with Prometheus annotations.&lt;/p&gt;

&lt;pre&gt;process_filter:
  - include:
      kubernetes.pod.annotation.prometheus.io/scrape: true
      conf:
        path: "{kubernetes.pod.annotation.prometheus.io/path}"
        port: "{kubernetes.pod.annotation.prometheus.io/port}"&lt;/pre&gt;

&lt;h3&gt;
  
  
  Monitoring AWS SQS with dashboard and alerts
&lt;/h3&gt;

&lt;p&gt;Once we have SQS metrics in Sysdig Monitor, you can use the AWS SQS dashboard to have a full overview of your queues. In the dashboard, you can filter by cluster and select as many SQS queues as needed. This is especially useful when you need to correlate an SQS queue with its dead-letter queue.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FHow-to-monitor-AWS-SQS-with-Prometheus-10.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsysdig.com%2Fwp-content%2Fuploads%2FHow-to-monitor-AWS-SQS-with-Prometheus-10.png" alt="A PromQL dashboard example showing all the metrics explained."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;_Sysdig Monitor. AWS SQS Dashboard_&lt;/center&gt;

&lt;p&gt;In &lt;a href="https://promcat.io/apps/aws-sqs" rel="noopener noreferrer"&gt;PromCat.io&lt;/a&gt;, you can find instructions on how to install the exporter, along with ready-to-use configurations to monitor AWS SQS. There, you will also find the dashboards that we presented in both Grafana and Sysdig format, as well as examples of alerts for your services.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;It is possible to monitor AWS SQS in the same place you monitor your cloud-native infrastructure. Thanks to Prometheus offering a standardized interface, you can leverage existing exporters to ingest Prometheus metrics.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If you would like to try this integration, we invite you to sign up for a &lt;a href="https://aws.amazon.com/marketplace/pp/B08DL3X2FV?ref_=srh_res_product_title" rel="noopener noreferrer"&gt;free trial in Sysdig Essentials directly from the AWS marketplace.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;You can find out more about our Prometheus integration in our documentation or by reading &lt;a href="https://sysdig.com/blog/improving-prometheus-cloudwatch-exporter/" rel="noopener noreferrer"&gt;our blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>sqs</category>
      <category>prometheus</category>
      <category>monitoring</category>
    </item>
  </channel>
</rss>
