DEV Community: Cameron Gray

Configuring Logging on Kubernetes

Cameron Gray — Fri, 14 Aug 2020 14:26:47 +0000

In my previous posts I gave an overview of Kubernetes monitoring as well as a deeper dive into application monitoring. In this post I am going to dig into the final piece of the monitoring puzzle which is log consolidation. To get a complete picture it's important to consolidate the logs from your applications and infrastructure into a single stream so you can get a detailed view, in real time if necessary, of everything that is happening in your cluster.

Consolidating Your logs

At Convox we allow you to easily tail and search your application logs with the convox logs CLI command, and we provide the same for your infrastructure with convox rack logs. While this works great for many situations we also recognize that sometimes you need a more robust solution. Seeing all your logs in one place can definitely help you gain a better understanding of what's happening in your environment. In addition, if you are using a third party service such as Datadog or New Relic, having your logs collected in the same platform can allow you to correlate changes in application performance or uptime to specific events in your logs. To this end, having a complete cluster-level logging solution can be extremely helpful.

Kubernetes Logging Overview

By default the logs from your containers will be written to stdout and stderr and Kubernetes will write these logs to the local file system on the Pods running your containers. You can retrieve the logs from a given Pod with the kubectl logs command. While this can be useful for troubleshooting the initial configuration of your cluster or troubleshooting a new app, it doesn't provide the cluster-wide view you often need to manage apps in production. In addition, if a Pod or Node fails or is replaced the logs are lost.

While Kubernetes does not provide a built-in method for cluster-level logging, the official documentation does provide a few options for systems you can set up yourself:

Sidecar Logging

We find that the sidecar approach is a bit cumbersome and generally difficult to set up and maintain, so I wont go into the details too much on this option. One benefit of the sidecar approach that is worth mentioning is because the sidecar container runs within each application's Pod, if you have multiple apps with logs in different formats you can use this per app sidecar to either reformat the logs to a common format or send different types of logs to different locations. You can even use the sidecar in conjunction with the logging agent approach if you want to pre-process your logs before sending them to a Node level agent. This option is definitely the most complex but also the most flexible.

Per Node Logging Agents

Running a logging agent on every node is a great starting point. Once again we find using a DaemonSet is the right approach for this. At Convox we even offer this as a built-in solution using a Fluentd DaemonSet installed by default with every Rack. Fluentd provides three major functions. First it can gather all your logs from stdout or from an application specific stream. Secondly, it can store those logs in a cloud specific storage endpoint such as Elasticsearch or Stackdriver. Finally, it can stream those logs to an outside collector such as a syslog endpoint or even an S3 bucket. At Convox we also offer the option of enabling a syslog forwarder on your cluster so you can easily send all these logs to a 3rd party provider like Papertrail. If you are not running on Convox and you want to deploy a Fluentd DaemonSet you can consult their official documentation here.

Using a 3rd Party Service

One of the shortcomings of solely using the logging agent approach is that it will only collect application logs from stdout and stderr. While this is fine for most applications, sometimes you will want to push logs directly from your application to a third party service. While there are some specific use cases for this we have found most of the third party services you would use for this also provide a DaemonSet based logging agent which you can deploy to serve as both a per Node logging agent and a custom logging endpoint. With this approach, much like we previously outlined with APM, you can push your custom application logs directly to the Agent running on the local Node and gain the benefits of tying Node and Application logs together.

If you are using a service such as Datadog or LogDNA their logging agents will collect all the Node and Application logs from stdout and stderr. Also, in the case of Datadog if you are already using an Agent for infrastructure monitoring and/or APM you can enable logging with a simple configuration option.

Conclusion

Hopefully, this has given you a good overview of the various options for collecting logs for both your Kubernetes cluster and the applications you are hosting on it. As always if you don't want to worry about all this stuff you can just use Convox and we will take care of it for you!

Configuring APM on Kubernetes

Cameron Gray — Wed, 12 Aug 2020 16:59:17 +0000

Monitoring your apps

In our previous post we gave you a general overview of infrastructure and application monitoring for Kubernetes. In this follow up post we are going to take a deeper dive specifically into application monitoring.

The term used by most of the industry for application monitoring is APM which is an acronym for either Application Performance Monitoring or Application Performance Management depending on who you ask. Whether the terms are actually synonymous or not is the source of much debate but I am not going to get into that here. Many vendors offer "APM" solutions and the feature sets of these offerings do have some variety but also a great deal of overlap. At a high level, an APM tool instruments your apps and allows you to dig into actual code performance, stack traces, crashes, etc...

Digging into the specifics, Gartner identifies the following core components of APM:

Front-end monitoring (page load times, browser render time, etc...)
Application discovery, tracing and diagnostics (tracing your running applications through your code all the way down to your underlying infrastructure)
Analytics (providing overview and analysis of the collected data)

Some of the major APM vendors you may have heard of include

Odds are if you have an application in production you are using one or more of these services already. As you make the move to Kubernetes it's important to understand the core components of installing and configuring an APM solution. There are typically three pieces to implementing any APM solution for Kubernetes

Installing and configuring a collection and reporting agent on every Node running in your cluster
Instrumenting your applications with a tracing library or module to collect and report trace data
Configuring your application trace component to communicate with the collection agent running on the local Node

Once you get this setup, all of your nodes should have an APM agent running on them. Your application code will then be collecting trace data and sending it to the agent running on it's node which will allow you to tie application performance to the underlying host performance. Finally your agents will be streaming all the collected data back to a central ingress endpoint for your APM provider so it can be aggregated and analyzed. Once you have all of this in place you can create dashboards, reports, alerts, etc...

Installing a Collection Agent on Every Node

We covered this in some detail in our previous post but the short story is to use a Kubernetes DaemonSet to set this up. You can read our detailed instructions of how to do this with DataDog here. For most providers these collection agents perform many different monitoring tasks and you will need to set a configuration option to enable APM trace collection. Here are a few links to the DaemonSet configuration instructions for some of the other providers:

Instrumenting Your App

Once you have your collection agents up and running the next step is to start collecting data from your running code. Most of the APM providers have libraries or modules for all the major languages and frameworks.

Let's look at the simple example of a Node.js application that we want to monitor using Datadog's APM offering. First we will need to install the Datadog tracer module and save it in a NPM requirements.txt.

npm install --save dd-trace

Next we will need to add the tracer to our code and initialize it. Typically we will want to do this as part of starting up our app server.


var host = process.env.INSTANCE_IP //we will get to this
const.trace = require('dd-trace').init({
    hostname: host,
    port: '8126'
});

For each provider there are several configuration options for the tracer libraries so I strongly recommend you consult the documentation for the particular provider and language/framework you are using. Here a few links to the various provider's APM libraries:

Configuring Your Tracer to Communicate With the Local Agent

As you can see above we are using the INSTANCE_IP environment variable to find the IP address of the current instance that the agent is listening on. If you are used to using APM services on traditional servers, VMs, or instances you are likely using localhost or 127.0.0.1 to communicate with the local agent. In a Kubernetes cluster that approach is not going to work. You will need to find the actual instance IP as localhost would refer to the container your App is running in, rather than the underlying Node that hosts your container.

If you are deploying your application on a Convox Rack then every container will have the INSTANCE_IP environment variable automatically injected into your App’s container for your convenience. Otherwise you will need to use the Kubernetes API or another third party utility to determine the IP address. As an example you can find the Datadog documentation for this here.

Pulling it All Together

Once you have completed the steps outlined above you should have a collection agent running on every Node in your cluster. Your applications should be fully instrumented and streaming data to the collection agent on their respective Nodes. With this in place, you should be able to keep a close eye on the performance of your applications as well any impact those applications may be having on your Nodes. As an example if your infrastructure monitoring indicates that one of your Nodes is consuming all of it's available memory, your APM service should be able to pinpoint what application, and potentially even what line of code is causing the problem.

At Convox we strive to make not only setting up a Kubernetes cluster easy but we also ensure you have all the tools you need to run a production application on Kubernetes such as auto-scaling and monitoring. If you haven't already tried setting up a Kubernetes cluster with Convox it only take a few minutes and it works on all major clouds so give it a try!

Monitoring Applications on Kubernetes an Overview

Cameron Gray — Thu, 06 Aug 2020 16:47:54 +0000

As many teams make the move to Kubernetes one of the challenges is how to monitor your infrastructure and apps. Most of the same monitoring tools that people are already familiar with are available for Kubernetes but it does require a slightly different approach. In this post my hope is to give those of you who are new to Kubernetes a little bit of an overview on the core components as well as the key things to monitor. I will also cover some specific strategies for implementing a variety of third party monitoring services. This is the first of a multi-part series of posts and we will dig into specific examples in subsequent posts.

Kubernetes Architecture

If you are new to deploying onto Kubernetes it's helpful to have a basic understanding of how a Kubernetes cluster is structured before we dive into the details of monitoring. While a complete overview of Kubernetes is definitely outside of the scope of this post, I will try and cover the basics. For the simple case of deploying an application that runs in a Docker container using the typical "one-container-per-Pod" model that is the most common for Kubernetes, you really only need to understand a few core concepts.

The biggest philosophical difference between Kubernetes and more traditional deployment options such as dedicated servers, or virtual machines like EC2 is the often mentioned pets vs cattle analogy. While pets are beloved individuals that are generally thought of as unique and irreplaceable, cattle are generally thought of as a herd. You may not give much consideration to an individual cow, what is important is the total amount of cattle you have and if one is to expire for whatever reason it can be replaced by any other cow.

In a typical server or VM based deployment, you may have a bunch of task specific instances, like the backend server or the mail server (pets). If you have ever had the one EC2 instance that's running your MySQL database suddenly fail you know exactly what I am talking about! Kubernetes takes a different approach and runs your cluster on a group of identical instances that can be terminated and replaced at any time (cattle). When you deploy onto a Kubernetes cluster you don't spend much time at all thinking about the individual instances. You are more concerned with the total capacity and load on your cluster as a whole. Kubernetes will automatically handle replacing any instances and you generally don't spend much, if any, time interacting with individual instances as an end user.

Your cluster is going to be running on a group of underlying physical or virtual hosts, such as AWS EC2 instances, and Kubernetes refers to those hosts as Nodes. Every instance of a container that you deploy (in the simple and most common use case) is referred to as a Pod. When you deploy a Pod onto a cluster the Kubernetes scheduler will attempt to find a Node with sufficient resources (typically RAM and CPU) to host that Pod. For example if you tell Kubernetes to deploy three copies of your main web service and two copies of your backend API you are telling Kubernetes to run five Pods in total and Kubernetes will figure out which Nodes to deploy those Pods on to without you needing to worry about it.

There are two things you do need to worry about however when it comes to the Nodes in your Kubernetes cluster. The first is how the required resources for your individual Pods align with the available resources on any one Node and the second is the total number of Nodes you will need to run to have the total capacity required to meet your demand. If a single instance of your backend web service requires 8gb of memory and your hosts have 2gb of memory each, your cluster will be unable to host your application because no single Node is able to fit your Pod. Alternatively, if your backend web service only requires 1gb of memory and your hosts have 32gb of memory each you can run almost 32 copies of your service (allow room for overhead) on each Node.

While this seems like a very simple concept it gets a little more interesting when you are running multiple types of services (Pods) each with their own resource requirements and you start thinking about scaling up and scaling down to meet changing demand. The important concept to keep in mind is that it's much faster to find a Node with spare capacity and deploy an additional Pod there than it is to spin up a new Node to create additional capacity but it's also expensive to run more or larger Nodes than you need because you are paying for resources that you aren't using.

Kubernetes will try and make the most efficient use possible of the available Node capacity which means if you have different types of Pods with different resource requirements (typically CPU and Memory) that need to be run they will not necessarily be evenly distributed across your Nodes. This is called Bin Packing, and is sometimes referred to as Kubernetes Tetris because of this great talk given by the always awesome Kelsey Hightower. The important thing is that you don't need to concern yourself with which Pods are running on which Nodes and if a Node should fail, Kubernetes is going to automatically find a new place to put those pods and spin up more Nodes if necessary. There is obviously a lot more to Kubernetes under the hood but as an app developer this hopefully gives you a good overview.

What Do You Need to Monitor

Now that we have a basic understanding of how our apps run on a Kubernetes cluster, let's dig into what we need to keep an eye on once we have our apps deployed.

There are many facets to monitoring a set of web applications and their underlying infrastructure, but for the purposes of this post I am going to focus on three main elements

Infrastructure Monitoring (monitor your Nodes)
APM (monitor your apps)
Log collection (consolidating your infrastructure and app logs)

Infrastructure Monitoring

While I spent much of the previous section of this post explaining why you don't need to concern yourself with your individual Nodes, there are two important reasons to monitor the health of the Nodes in your cluster. First, only by monitoring all your individual Nodes will you be able to get a complete picture of the health and capacity of your cluster. While you don't care if one Node crashes and needs to be replaced it is important to know if all your Nodes are running at 99% CPU usage. Secondly, if you have individual Nodes that are spiking or crashing it's important to know what Pods are running on those Nodes. For example if you have a bug in your application code that consumes excessive amounts of memory you may only learn that if you see one of your Pods spiking memory usage and you have a record of which Pods are running on that Node.

One of the easiest ways to keep an eye on all your Nodes is to run a monitoring agent, such as the Datadog Agent, as a DaemonSet on your cluster. In its simplest and most common form, a DaemonSet is a way to instruct Kubernetes to run exactly one copy of a given container (Pod) on each of your Nodes. This is an ideal configuration for a monitoring agent application because it ensures that each Node in your cluster will be monitored. Once you have a monitoring agent running as a DaemonSet on your cluster you can start collecting important information such as memory and CPU usage on all your Nodes and therefore your cluster as a whole.

It's important when running a monitoring agent like Datadog, to make sure that your other containers are labeled with something that will help you identify the app that is running in that container. This will ensure you can tie the load on your underlying Nodes back to what containers are running on them. At Convox we automatically name all containers with the name of the app that is running in them as well as adding labels to provide an even more granular view.

With a monitoring agent running on each of your nodes you can start to make informed decisions about how to best configure your cluster for your particular demands. Scaling decisions in Kubernetes are a little different than in a typical containerized hosting environment. You need to consider the factors we mentioned earlier such as bin packing and headroom as well as understanding how quickly you will need to scale up or down which will significantly impact whether you want to run many smaller nodes or fewer larger nodes. Of course having a smart autoscaler (such as Convox provides) can help a lot.

Application Performance Monitoring (APM)

APM is a concept that has been around for some time and like infrastructure monitoring the major APM providers (Datadog, New Relic, etc..) have solutions for most major hosting configurations including Kubernetes.

While Infrastructure monitoring gives you an important view into the health and capacity of your underlying physical or virtual infrastructure, APM gives you insight into the health and performance of your applications. If you introduced a new bug in your application code that is causing crashes or slow response times, APM is typically the tool that you will use to find the cause of those issues. APM can often pinpoint the performance of individual modules or even lines of code. In production this type of monitoring can also often find the dreaded "works on my machine" problem where a particular application may work just fine on an individual developer's laptop but behaves completely differently in a production environment.

Fortunately the same DaemonSet strategy that you use to deploy an infrastructure monitoring agent in your Kubernetes cluster works for APM. In fact, for many providers such as Datadog it is the same agent and you simply enable the APM features with a configuration option.

Once you have your agent Pods running on all your Nodes you have two important steps to get an APM solution up and running. First you will need to add the actual APM instrumentation code to your application which is typically a fairly lightweight code change and all the major APM providers have libraries for most popular languages and frameworks. This will allow your application to start collecting trace data and now all you need is send that data to the Agent running on the local Node.

This can be a little tricky and depends on how your Kubernetes cluster is configured and which APM service you are using. Once you have it all wired up however your application should be sending trace data to the Agent running on it's Node and in turn the Agent should be reporting that data back to your APM service. We will get into the details of configuring this in a follow up post.

Log Collection

The final piece of the monitoring puzzle is to consolidate the logs from your applications and infrastructure into a single stream so you can get a detailed view, in real time if necessary, of everything that is happening in your cluster.

While Infrastructure and Application monitoring provides you with a real time view of what is happening, the logs provide the details and historical view of everything that has happened in your environment. While not a perfect analogy, you can think of the monitors as the instrument cluster in an airplane cockpit and the logs are the black box in the tail that is recording everything that happens. Oftentimes when something goes wrong you will find you need to pour through the logs to really figure out what happened.

The flaw in the airplane analogy is that monitoring applications and logs each have both realtime and after the fact benefits because they provide different views of your environment and applications and often you will need to look at both to get a complete picture of what is happening or has happened. The key with logging is that every piece of your environment from individual hosts, to the Kubernetes cluster components, up to your custom applications all spit out logs and often you need a consolidated, cluster-level, view of these logs to get a complete picture of what is going on.

Kubernetes does not provide a built-in solution for cluster-level logging, although the documentation does provide some guidance on a few options for setting up cluster-level logging yourself. In what has become a common theme for this post, we once again find that leveraging a DaemonSet to run a logging agent on every node is a really good solution. If you are already running an Agent such as Datadog or LogDNA for the purposes of Node monitoring and/or APM, typically those agents will also provide a solution for gathering logs from stdout or from a file system logging driver, such as the Docker json logging driver, and directing those logs to an outside ingest for analysis. I will also dig into some detailed implementation examples in a follow up post.

Conclusion

Hopefully this post has given you some insight into the important aspects of monitoring your Kubernetes clusters and the applications you deploy onto them. While it can seem daunting at first, you can generally use the same tools you are already familiar with as long as you understand some basic strategies that are unique to Kubernetes. Watch this space for some follow up posts that dig into the details. Meanwhile, if you want to spin up your first Kubernetes cluster on any cloud and deploy an application in just a few clicks and a few minutes there is no easier way than Convox so give it a try!