DEV Community: Stefan Verkerk

How AppSignal Monitors Their Own Kafka Brokers

Stefan Verkerk — Thu, 30 Jul 2020 15:35:53 +0000

Today, we dip our toes into collecting custom metrics with a standalone agent. We'll be taking our own Kafka brokers and using the StatsD protocol to get the metrics into AppSignal. This post is for those with some experience in using monitoring tools, and who want to take monitoring to every corner of their architecture, or want to add their own metrics to their monitoring setup.

Why the Standalone Agent

AppSignal for Ruby, Elixir, and Node out-of-the-box gives you all one in one monitoring: errors, performance, host metrics, custom metrics, and dashboards.

By having all relevant metrics collected and sent to AppSignal, all the monitoring data is in one place. That is one of the things we think is crucial about good monitoring:

You need to have errors, performance and host metrics in one place, and you need as much of your architecture monitored like that in one place.

That way you can triage from all angles. For example, you might spot at a performance issue in a background job, then see it is caused by a long wait on the DB server, and then find out that that was caused by network issues on the host the database runs on.

This works out-of-the-box in a Ruby, Elixir or Node.js app. If you have part of your architecture run on systems without any Ruby, Elixir or Node.js, you can add their metrics to AppSignal as well using the standalone agent. For example, when you have certain workers do things in Rust, or when you have a standalone database server.

How WE do In-house Monitoring

We use this setup to monitor our Kafka brokers.

One of the things we need to keep a close eye on is available retention in Kafka. There are some hard limits on disk space and we need to make sure we can achieve the required retention. Otherwise, we need to expand our cluster.

Because the requests we monitor for our customers can have a lot of metadata added to them, that means that Kafka's topics can be huge. To make sure things don't go astray because a host disk is used fully, we've dimensioned our Kafka setup by disk usage that we monitor as well. 😉

Disk usage is easy to monitor because the standalone AppSignal agent will automatically send host metrics to AppSignal right out of the box.

You can actually see see in this 30 day resolution graph when we did a clean up of that disk, way before things went wrong 😁

That leaves retention. Retention is not available in Kafka's standard metrics, so we have to find another solution. We monitor retention by looking at the first indexed timestamp of each partition/topic on the broker itself. If that is shorter than a set amount of hours/days, we'd like to be notified as that might impact our customers during an outage.

We do so using our Kafka monitoring tool (called Watchman). It is written in Rust and we use a (very simple) StatsD client code to send data to the standalone AppSignal Agent running on each broker.

In our watchman process we send these retention metrics to the AppSignal agent that listens to StatsD:

    statsd_client.send_gauge("topic_retention_hours", (now_timestamp()  - timestamp) as f32 / 3600.0, &format!("topic:{},partition:{}", topic, partition))?;

And:

   statsd_client.send_gauge("topic_retention_short", ms_short, &format!("topic:{},partition:{}", topic, partition))?;

Kafka brokers report many more metrics, with JMX as the default reporter. So besides watchman, we also send these JMX metrics to our Standalone Agent running on the server.

Using Monitoring Data With Graphs and Triggers

Once that data gets into AppSignal, you can make dashboards with it, and set triggers like with everything in AppSignal.

Here’s what that dashboard looks like on AppSignal:

In this one-hour resolution you can actually see Kafka's logrotate which throws away these once every while. You can notice this happening with every sudden drop in the line.

We can then set up triggers when the topic_retention_short is above 0 ms, that way we will be alerted when the timestamp indicates that the retention is shorter than the specified time. This will help us make informed decisions on what topics to give more disk space or what impact the message flow has on retention times.

Roundup

This concludes our little dive into using a standalone StatsD agent to get data monitored and how we dogfood AppSignal.

This is not the only way to get custom metrics into AppSignal though, from your Ruby / Elixir / Node.js app you can also add any metric you want to. If you want to set this up yourself, read up on it in the documentation about custom metrics or the documentation about the standalone agent.

PS. Big thanks to Robert and Thijs. I typed the blogpost, but your brains wrote it 😉

Performance and N+1 Queries: Explained, Spotted, and Solved

Stefan Verkerk — Wed, 24 Jun 2020 14:29:05 +0000

Today, we’ll dive into N+1 queries⁠—what they are, how to spot them, why they have such an impact, and how to solve them.

Why It Matters

When you run into performance issues in your app that affect user experience, or when things break, you'll probably look at the timeline of the requests in your APM. You might see a large part of the timeline being used by a single, repeated query.

N+1 queries can be quite impactful but are usually easy to spot. In this article, we’ll go over ways to find, fix and prevent slowdowns caused by N+1 queries.

The N+1 Antipattern In Short

The N+1 query antipattern happens when a query is executed for every result of a previous query.

The query count is N + 1, with N being the number of queries for every result of the initial query.
If that initial query has one result, N+1 = 2.

If it has 1000 results, N+1 = 1001 queries.

An N+1 Query Isn't That Hard to Spot (with AppSignal 😉)

We’ve written about N+1 before, but today, we wanted to write a post that is language agnostic because we’ve just released a way to find N+1 queries more easily. So, if a query is slow because of an N+1 query, it will stand out even more.

We were already tracking N+1 queries in the event timeline, but we've added two improvements that help you find these opportunities for improvement more quickly.

First, the performance issues overview now includes an N+1 label. This indicates that there are queries that might need optimization. Second, we've added a box with a warning and a short explanation of what N+1 queries are.

Let’s go back and look at what causes an N+1 query.

What Is Lazy Loading, and Which ORMs Default to It?

Loading relationships from a database can roughly be split into two approaches. First, lazy loading will only load data from your database when needed. Then, there’s eager loading, which loads all data right away. We’ll show you what that means.

If you use an ORM, this is where you set lazy or eager loading. Looking at different languages we are familiar with:

In Elixir, things are set explicitly with Ecto (not an ORM really).
In Node, TypeORM, the most popular ORM, doesn't do lazy loading by default.
In Ruby, ActiveRecord by default uses lazy loading. This makes sense because, in Ruby on Rails, things are often implicit: convention over configuration. But it also means the N+1 anti-pattern is something you can easily stumble into when using ActiveRecord.

How It All Starts: When Fast, Easy and Lazy Are Nice

Let's dive in with a specific example, where we can see the pattern emerge and understand its impact. Because we like Stroopwafels, the example will be an imaginary Cookie webshop. In this imaginary webshop example, each cookie can have any number of toppings.

We will start with the case where a page shows all cookies, and allows you to navigate to a cookie detail page to see all the available toppings.

When we lazy load the data on the overview page, we loop over every cookie and only load their toppings later, at the point where we need them. On a detail page, you will load one cookie, and all of the toppings it has. Your ORM will do 2 queries to show these. To keep the explanation language-agnostic, we’ll just show the resulting queries.

SELECT "cookies".* FROM "cookies" WHERE "cookies"."id" = 1 LIMIT 1
SELECT "toppings".* FROM "toppings" WHERE "toppings"."cookie_id" = 1

When a Thing Starts Going Bad

When you want to not only list all cookies, but ALSO show all the toppings for each cookie, AND you lazy load, N+1 rears it’s ugly head. Because we lazy load, a query is executed for every result of a previous query.

This case returns all 3 cookies from the database and each of their toppings. This leads to 4 queries. Again, we just show the queries to make this applicable across languages:

SELECT "cookies".* FROM "cookies"
SELECT "toppings".* FROM "toppings" WHERE "toppings"."cookie_id" = 1
SELECT "toppings".* FROM "toppings" WHERE "toppings"."cookie_id" = 2
SELECT "toppings".* FROM "toppings" WHERE "toppings"."cookie_id" = 3

How Much Is N+1?

We started with rendering a view for 1 cookie and its toppings. All was well, and it led to 2 queries. The second case rendering a view with all cookies and all toppings led to 4 queries.

Looking at the first query, this is executed by the explicit call to cookie.all in the controller, which finds all cookies. Queries 2 to 4 are lazily executed while we loop through all the cookies.

This results in the number of queries being N+1.

N here is the number of cookies (yum), and the plus one is the first (explicit) query that fetched all the cookies.
It does one query, plus one, for each of the results in the first query.
Because we have 3 cookies here, N = 3 and it leads to N + 1 = 3 + 1 = 4 queries.

With 3 cookies, this probably won’t lead to any performance issues, but if we had 1000, we can predict it would lead to 1001 queries. Ouch. Perhaps there IS such a thing as too many stroopwafels.

How Much Difference Does It Make?

OK, we ate all of the cookies by now. But we timed how much time that took. I mean, how long the queries took 😉. We used a Ruby on Rails with ActiveRecord setup for this example, but the differences are negligible for any language. When we have 3 cookies, eager eating loading was 12% faster than lazy loading. With 10 cookies, the difference was already almost 60%. If we had 1000 cookies, the difference would be close to 80%, 58 ms, against a whopping 290 ms in our example.

How to Solve It: Eagerness

The way to solve this is to use eager loading. By preloading the toppings, when we show all cookies and their toppings on a page again, the query count drops back to 2, even if the number of cookies increases to 1000. The query count is 2 because the second query depends on the data from the first one to know which toppings to fetch.

SELECT "cookies".* FROM "cookies"
SELECT "toppings".* FROM "toppings" WHERE "toppings"."cookie_id" IN (1, 2, 3)

Now What?

Depending on your state of mind, grab some cookies. Or find out more about AppSignal and setup an account. If you ask for stroopwafels, and we will send you some. If you are already using AppSignal, go and see whether you can solve some N+1 issues.

How to Monitor Your Host Metrics Automatically

Stefan Verkerk — Wed, 04 Mar 2020 15:37:39 +0000

Today, we’ll dive deep into monitoring hosts. The good news is that we’ll point you to some shortcuts on how to set up host monitoring in an easy way. The bad news is that we won't be doing any percussive maintenance on any host.

Step 1: Getting the Data In

To monitor hosts, you have to set a few layers in place. Doing all this by yourself would be the hard way. You may ask: "How hard could it be?". Well, AppSignal started as a side project in 2012 because a group of "How hard can it be?" people worked on this system for three years before it was in a shape we were somewhat proud of.

After monitoring thousands of billions of requests, we've learned all kinds of things along the way. So, regardless of whether you choose AppSignal to monitor your app, think thrice before you roll out a solution by yourself.

These challenges go from emitting the data in a lightweight manner to ingesting masses of data in a way that doesn't influence the hosts you are monitoring. If you think that’s cool, we wrote a bit about how we do it in our documentation.

The Automated Part

If you monitor your app with AppSignal, host metrics are collected by the agent every minute. We try to give as many insights as possible right ‘out-of-the-box’, so you don't need to manually set up anything.

We love the combination of many things working right away, and also being able to continuously tweak things. So, with this, you can also turn it off if you want to.

Out-of-the-box, we collect the following metrics:

Metric	Description
CPU usage	User, nice, system, idle and iowait in percentages. Read more about CPU metrics in our academy article.
Load average	1 minute load average on the host.
Memory usage	Available, free and used memory. Also includes swap total and swap used.
Disk usage	Percentage of every disk used.
Disk IO	Throughput of data read from and written to every disk.
Network traffic	Throughput of data received and transmitted through every network interface.

Step 2: Initial Triggers and Setting up Alerts

Now that we have the host metrics available, let's put in some thresholds for when you want to be alerted. The first step would be to set this up with a relatively sensitive or noisy alerting on a day (not a night) where you are available to look into things.

What a noisy setting is, depends on what you see on average on your hosts, but some sane defaults are:

Metric	Trigger setting
CPU	idle over 100%, without warmup
Disk I/O	larger than 10MB per minute, without warmup
Disk usage	over 90%
Load average	over 1

When values go above these defaults, set up an alert via email or Slack (we don’t recommend starting with PagerDuty for these noisy alerts just yet).

Step 3: Monitoring the Patterns Over Time

Once you have some data coming in, the next step is to monitor the emerging patterns. Simply having all this data available is nice, but in our experience, it is all about the patterns.

There are some ‘right’s and ‘wrong’s but the key is to monitor, get to know the bandwidth in which your hosts operate in a ‘normal’ setting, and be able to see when things move beyond that bandwidth. In other words: it is not about having one data point, but about observing the line over a longer period of time.

Step 4: Ice Cold Drinks

Now that we have metrics coming in and we've set some triggers, we need to wait for some time before the patterns emerge.

Sit back and relax, alt-tab to another task or get an ice-cold drink; looking at the graphs all day won’t speed up the process. You may want to have the noisy settings running for a week to get a full pattern of weekdays and perhaps lower-traffic weekends.

Step 5: Monitoring Patterns Between Hosts

The next step, while we're drinking that ice-cold drink, is to compare the metrics over different hosts. Because the other important patterns in monitoring hosts, next to patterns over time, are about how one host behaves in comparison to another host with similar function.

This is why we show different hosts in one view like this:

We’ve experienced that sometimes different hosts can have different performance characteristics, especially in a virtualized environment. For example, if you share the hardware with another customer of your hosting company, a change on their end might trigger issues on your side.

What Happens on the Hosts…

We could also call this ‘when the host issue is the effect, and not the cause’. So far, we've focused on steps to make sure we monitor and set triggers to the overall host-level metrics. But other than interference by other virtual machines, something like a CPU spike is the effect, not the cause of an issue (although it can trigger new problems).

This is why we built monitoring in an integrated way. We realize we are biased 😉 If you have your monitoring in one integrated way, like AppSignal or a comparable solution, you can also see what is running on a host that is causing those peaks.

At AppSignal, these metrics are scoped per namespace. By default, we create web and background namespaces for you, but you have complete freedom in how you organize your namespaces.

By setting your own namespaces, you can also set different levels of triggers. There might be webhooks where you can expect a much higher throughput. In a normal web request, you will want to set different triggers.

We get really excited about this part, so apologies if we dove in a bit too deep. We’ll cover it in a separate post in a few weeks. Let’s get back to the basics.

Step 6: Less Noise & Repeat

Once we’ve seen patterns emerge in time and between hosts, the next step is to set the alerting to some real levels that are not as noisy as the ones we started with. Once you’ve seen the spread of the mean, and you feel confident, hook up PagerDuty or OpsGenie and set up the real alerting that will wake people up.

Step 7: A Good Night’s Sleep

In our Ops team, the vision is ‘a good night’s sleep’. You should aim for that as well: being alerted when needed, and solving things so it doesn’t keep alerting you.

With all the steps taken, a good setup for monitoring your host will now get you some well-deserved sleep. Good night! (* depending on your timezone 😉)

PS: Another hard part about monitoring is digesting the hundreds of billions of requests that you may emit. We might write a blog post about that another day ;-) It's why we think you should use a dedicated service rather than run it yourself. And if you'd try us, we'd be honored.

Monitoring the Erlang VM With AppSignal's Magic Dashboard

Stefan Verkerk — Wed, 12 Feb 2020 15:53:40 +0000

Today, we will dive into one of the hard parts of using any monitoring - making sense out of all the data that is emitted. We think this is one of the hard parts. And being developers building for developers, we think a lot like you do -- we think. Pun intended.

Nowadays, we monitor AppSignal with AppSignal (on a separate setup), so we are still dogfooding all the time. We still run into challenges as you do, often before you.

Magic Dashboards

We believe one of the harder challenges is finding the right data and making sense of it. Once we discover what works best for a certain setup, we don't just keep the solution to ourselves, we make it into a solution that's available to all of our users.

We call this solution Magic Dashboards. Based on the architecture that you are running, we add dashboards that make sense for that architecture.

If you are running a recent version of the AppSignal integration, magic dashboards will show up when you add a new application.

Erlang VM Magic Dashboard

A Magic Dashboard that we made for the Elixir integration is the Erlang VM dashboard. It has graphs metrics on IO, schedulers, processes and memory. This is what it looks like:

IO

This shows the amount of input and output you have cumulatively, expressed in kb.

Schedulers

This graph shows the total number of available schedulers and the number of online schedulers. Erlang's schedulers schedule CPU time between your processes. The number of schedulers defaults to the number of CPU cores on the machine.

If you want to know more about schedulers, here’s a good article on Hamidreza Soleimani’s Blog on why the details of schedulers are important.

Processes

The number of processes and the limit are plotted here. The limit is the maximum number of simultaneously existing processes at the local node. If you reach the limit, the process will raise an exception. But the default Erlang limit is 262144, which should be high enough for almost all applications.

Memory

This shows the total amount of memory that is used as well as its usage, split into processes, system, binary, ets and code.

The level that is considered normal, obviously depends on your situation. But when this suddenly goes up, it might be an indicator that something is wrong.

For anything that's monitored on a dashboard, you can set up triggers (which we call ‘anomaly detection’), that will message you via email/slack/PagerDuty when it goes over a normal value for your case, for a certain period. Our Documentation on Anomaly Detection describes how to set that up.

Dashboards for Host Metrics

Apart from the things that you are running in your Elixir setup, when you have AppSignal running, we also immediately add dashboards and metrics for your hosts.

For example, check the ‘Host usage’ link in the Inspect menu item to see throughput, response time and queue time for any Namespace on that Host. And check the 'Host Metrics' to see CPU, Disk usage, Load average and more for each of the Hosts.

We've seen that the integrated approach of monitoring really helps in narrowing down what causes issues. So for each of these metrics, you can click on a peak, check 'what happened here' in the graph legend and see the entire overview of errors, performance issues and host metrics.

Edit or Extend as You Want

We always mix having loads of insight out-of-the-box with being able to tweak things exactly as you want. So if you want to have any of these dashboards set up differently, you can edit the dashboard configuration YAML and make it do exactly what you want.

What More Do You Want Us to Add?

We are always curious to hear what else you’d like us to set up Magic dashboards for. So if you have something in your Elixir setup that you want us to help visualize in a graph, drop us a line at support@appsignal.com. We'll then let everyone else have it magically.

PS: Another hard part about monitoring is digesting the hundreds of billions of requests that you may emit. We might write a blog post about that another day ;-) It's why we think you should use a dedicated service rather than run it yourself. And if you'd try us we'd be honored.