DEV Community: Davide Berdin

Part 2: Monitoring, Logging and Alarming

Davide Berdin — Sun, 08 Sep 2019 15:28:19 +0000

I'm writing a series on how to build a Data Platform from scratch and it's time for Part 2! In Part 1 I explained how to start building your data platform. But, when your infrastructure grows, making sure that everything is working as expected becomes a challenge. And as one of my dearest colleague tells me all the time: "monitoring and logging is an art!".

In Part 2, I want to tell you how to setup a production-grade monitoring system for your infrastructure. Concepts and caveats are more valuable than some "copy-paste" piece of code, for the simple reason that it is important to understand why certain choices are made.

The tools

Previously, I mentioned that we are using a bunch of tools. When I started listing all of them I thought: wow, so many things for something that looks relatively simple in concept. I mean, what we want to achieve is to be able to see logs and act if something happens. Simple right? Well, not quite. To refer back to my colleague: monitoring and logging is definitely an art :)

Let's start by dividing the problem in sub-problems. In school they taught us that when the problem is too big, we have to divide it and conquer it (aka divide et impera if you are into latin).

Problem 1: Logging

Since we want to understand what it's going on in our infrastructure and applications, we should start tackling logging at first. Generically speaking, an application writes logs on the STDOUT. Since we are using Kubernetes, that means we are able to read the logs from the logging console that comes with Kubernetes itself. Because we are able to read logs, it means we can collect them. How can we collect them? As I mentioned in the previous article we use Elasticsearch for indexing the logs. In order to collect those we use Fluent-Bit.

Solution 1: Fluent Bit

Fluent Bit allows collection of information from different sources, buffering and dispatching them to different outputs. The helm chart we use runs as a daemon set in Kubernetes (here more info about daemon sets). This guarantees that there will be an instance of Fluent Bit per machine in the Kubernetes cluster. The process will collect the information of each pod in kubernetes from the standard output and redirect them towards another system. In our case we chose Kafka.

Now our logs are securely sent towards to a topic in Kafka and ready to be consumed by something else.

Problem 2: System Metrics

Applications write logs on the standard output, but machines don't write logs, right? So, how do I know how a machine is behaving? How do I know if the CPU is sky-rocketing or the I/O operations on the disk are the bottleneck or my application? Well, that's where Node-Exporter comes into play.

Solution 2: Node Exporter

Node Exporter is able to collect metrics from the underlying Operating System. This is powerful because now we are able to collect the system information we needed. Once again, there is a helm chart coming and rescuing us.

Problem 3: Application Metrics

Cool, but what if an application is able to give me more information than simple logging. For example, what if my database is able to give me the current system information such as memory consumption or average query latency. That's hard. These are not logs nor metrics coming from a machine. Although, they are available for us to be used. That's when Prometheus enters the arena.

Solution 3: Prometheus

Prometheus is a systems and service monitoring system. It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts if some condition is observed to be true. BINGO. This sounds like the tool that will do a lot of things for us. But where does it stand in the big picture? Let's take a look at the below image

I took this picture from this great article, and it describes clearly what Prometheus does. Basically, its role is to pull and push information. But what is relevant to know, is that Prometheus standardized the way information should be generated so that they can be parsed and, in a later stage, queried.

Because our data platform has multiple Kubernetes clusters (remember controlplane, production, development, etc), Prometheus needs to be installed in all of them. Thanks to the awesome community of developers, there is an helm chart that we can use. This operator allows us also to use prometheus in federation mode, which is very important in this context. The federation concept allows the prometheus in Controlplane to scrape the information from the other prometheus services so that we can centralize all the metrics in one unique point.

Problem 4: Fetching the logs

We decided to create controlplane for centralizing the information regarding the other environments and have an overview of what it's going on in our platform. Since we pushed our logs into Kafka, we now need to consume them and store them in a format that is readable to humans.

There is a famous acronym called ELK and it stands for Elasticsearch, Logstash, Kibana. So far we mention the E and the K but never the L. Well, that time just arrived.

Solution 4: Logstash

Logstash is an open source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then sends it to your favourite "stash." This is part of the Elastic suit, and it is a fundamental piece for making sure that we are able to have the same type of logging for everything that comes in.

Our input is the Kafka topic we mentioned before and our output is Elasticsearch where then the data will be indexed and "stashed". The helm chart helps you to install the application and by modifying this part of the values.yml you are able to easily read from Kafka. The major issue we found was in the @timestamp field. In fact, we had to adapt the values.yml a little to avoid having issues in reading the timestamp.

The following snippet of code will help you to solve such an issue

You have to modify the Timezone accordingly but that's the major reason why we couldn't have our data ingested in Elasticsearch correctly.

Visualize All

We just finished covering the logging part, but how do we visualize everything? There are two main applications: Kibana and Grafana. We use Kibana to explore all the logs that coming in from all the application. Without Kibana, it would be extremely hard to debug your application because searching for what is going on, it's very hard with kubectl logs -f <pod-name> | grep whatever-error command :)

Grafana helps in visualizing all the metrics coming in from Prometheus. There are a ton of pre-made dashboards that you can just install and use them. The only thing you need to do is to setup the prometheus installed in controlplane as your data-source in Grafana and that's it. All the metrics will automagically be available for you.

Problem 5: Alarming

This is the toughest part of the process. Once again, the concept is simple but deciding the thresholds onto when receiving an alarm is difficult, and it needs to be tuned along the way. I would recommend to start from this awesome website and start building the rules that are important for your data platform.

Solution 5: Slack and PagerDuty

I can't help you with building the rules that you need for your data platform, but I can give you advice on selecting the right tool to notify you and your team when an alarm is triggered. My suggestion is to send the notification to Slack for the alarms that you consider "minor". I leave the definition of minor up to you. We only send a slack notification for the development environment and for those applications that are not public in production yet.

For production systems we use PagerDuty to create a rotation-calendar for taking care of the systems among team members and to make sure that everything is always up-and-running. There is a great integration with Prometheus that I highly recommend to setup.

Grafana also helps with alerting but we haven't used it yet. It looks awesome though. If you've been using Grafana, it would be great if you can share your experience with me in the comments below :)

Conclusion

In this long blog-post I gave you an overview the tools that my team and I are using for our data platform. I hope this gave you more ideas on how to start! You will encounter problems along the way because "Rome wasn't built in a day". I do hope that you have all the initial information to collect, visualize and receive alarms in your data platform. But remember my colleague's motto: logging and monitoring is an art :)

Part 1: How to build a data platform from scratch

Davide Berdin — Sat, 24 Aug 2019 20:50:06 +0000

Have you ever tried to build a data platform from scratch? Do you know where to start, what to look for? And most importantly: what not to do? Fear no more! Learn how to build the perfect data platform with these series of articles :) This article will be the first part of a series, that will give you an inside in my journey towards building a data platform from scratch. By sharing my experience, the lessons learned while developing it, the advantages and disadvantages of certain design choices, and my goals, I hope to make your journey less bumpy.

Wether you are given the task to build a data platform, or you decide to do this in your own time: you aim high with your ambitions - or at least that's what I do. Having ambition is great and it is necessary in order to push yourself a little bit further. That's why, when you start designing something it's important that you look further than the initial requirements.

Without further ado, let's start this journey by discussing the motivations I had behind creating a Data Platform for my current job.

Motivations (aka Requirements)

I work in the Media industry, and the company I work for doesn't have a Tech department like Netflix, GitHub or Twitter. We might not be considered as a 'tech company', we have a team of great engineers that are as motivated as me to create something nice.

One day, our manager came in and said that our company decided to build a Data Platform from scratch. The requirements we've got were the following:

It needs to be scalable
It needs to ingest a lot of data
It needs to be able to query very quickly

That's "all"!

The engineering team had a discussion on what type of technologies we wanted to use, and what the architecture should look like. We decided to focus a lot on Kubernetes. But the choice wasn't straightforward. A large part of the team had never worked with Kubernetes, they had more knowledge on AWS Lambda and API Gateway. While to me Kubernetes was a natural choice, for other people it was a bit scary.

In the end we commonly agreed that it would serve our needs better to use Kubernetes, especially in terms of scalability and deployment of our Dockerized application. Using it meant that we had to migrate tons of services from ECS and other EC2 instances, to Kubernetes.

Initial architecture

After the choice for Kubernetes, we created a bunch of bash files and Terraform scripts to get started. Within the engineering team, there was a discussion on how to tackle the automation and we decided to build a "1-click-deployment" system.

This "1-click-deployment" system, consisted of a Docker image that contained all the packages we needed in order to generate the entire infrastructure of the Data Platform. After that we created the initial skeleton based on this awesome article of Yevgeniy Brikman from Gruntwork. Our result looked similar to this repository.

Tip: When there are a lot of "moving" parts within your infrastructure, it's a good practice to automate it and to rely on tools that will help you and your team to put all the pieces together with minimum effort. The picture below shows how the initial architecture looked like

This was quite simple to us; we needed to create a Kubernetes cluster, add a Load Balancer via NGINX. We were ready to move to the next step.

Drama was about to come

We only used one cluster, and we immediately had to start thinking about how we were going to deploy applications. Then other questions popped up; how do we create a Staging environment, or a Testing environment? The initial choice was to scale the cluster with few more machines and leverage the Kubernetes Namespaces.

The advantage of using Namespaces, is that you can "isolate" your applications in a sort of box. Isolation is what we were trying to achieve, so that we could distinguish the environments. Doing this, created an issue with naming and with how we were going to deploy the applications. For example, let's assume the following: we have two APIs and one WebApp. All three need to have a Testing and Staging environment, so that developers can safely deploy their code before going to production. Because we decided to use Namespaces, we tried to first create three Namespaces: Production, Staging, Testing. There are some drawbacks in putting all the application under the same namespace. The first issue is that it would be much easier to delete everything. For example, the Kubernetes CLI (kubectl) is able to delete namespaces. If a developer would issue the follwing command kubectl delete ns <env> all the applications under that namespace would be gone. The second problem is in the isolation. In fact, the reason why namespacing was created in the first place, was to partition the cluster in smaller virtual cluster.

We tried another approach: we create a namespace per application per environment. For example we would have the following:

production_api1
staging_api1
testing_api1
production_webapp
staging_webapp
testing_webapp
and so on...

You can clearly see that this would pollute the Kubernetes cluster with tons of namespaces. The following article gave us a lot of insights on how to use namespace in better way. Despite the article states that you can use both of the approaches I described above, it also highlights the anti-patterns when using such approaches. For us, those solution didn't work well.

What did we do? The solution was simple: multiple Kubernetes clusters.

Multiple clusters

As mentioned above, we initially decided to adopt the "1-click-deployment" strategy: which means that our code-base was ready for deploying as many clusters we wanted. And so we did. The second Architecture we pulled-off looks like the picture below

As you can see from the picture, we have three clusters. In this way we could deploy applications depending on the environment we wanted to target.

The whole team was very hyped and we thought that we nailed it. But, with great powers come great responsibilities. Responsibilities here means logging and monitoring. With three clusters, understanding what it's going on is rather challenging. We yet again had a problem we needed to tackle.

In a distributed environment with multiple applications, understanding what is happening is crucial. In fact, being able to quickly debug your application in Development or understand what caused an eventual error in Production is the fundamental step for keeping control over your systems.

But logging is not sufficient. Machines themselves in general don't write logs, the expose metrics of their hardware. Thus, collecting such information and visualize them, it will allow you and your team to set up alarms based on certain rules. A rule can be "if the average CPU usage is above 80% trigger an alarm". The alarm itself can be an email or a Slack message to your team. Alarms will prevent having your machines to reach undesirable states.

Controlplane at rescue

When you have a lot of applications, different environments, and a team of engineers and data scientists that is eager to create, test and put applications or models in production, you need to have a solid monitoring and logging system. But how do we do so? Yet again the answer was simple: another Kubernetes cluster :)

Controlplane was a fancy name we took from another team within the company. The sole purpose of this cluster is to collect metrics and logs from the other clusters and centralize the visualization of such information. At this point the architecture looks like the following

On each cluster we set up Prometheus, Node-exporter, Kube-state-metrics for exposing all the metrics of the cluster. To collect the information and send them out the cluster, we used Fluent-Bit. All the metrics and logs were redirected towards a Kafka Topic. In this way we were able to fetch them from the controlplane.

In controlplane we installed Logstash and connected it to the Kafka Topic mentioned above so it could start fetching the logs. Logstash is an open source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then sends it to an output.

Now that Logstash has access to the logs we just need to redirect them to the Elasticsearch service and visualize them with Kibana. For the metrics we put in place Grafana and created a bunch of Dashboards for visualizing the status of each system.

With the controlplane in place we had full visualization and control of every single aspect of each individual cluster. The controlplane was a win-win solution.

Conclusion

In this first part I explained how we started designing the Data Platform. This article focused mostly on the underlying infrastructure and the problems that we had to overcome.

The lesson I learned during this part of the development: is to think ahead much more than when I design software. Software can be changed more rapidly than infrastructure. When designing an infrastructure there are more variables to take into account, and the choices that you make can have a big impact on the final result. Long story short: think first and take your time. Rushing to get things out quickly will not always help you.

In the next article I am going to talk about how applications are deployed and how the team interacts with this architecture.

How to run Apache Druid in Kubernetes using Terraform

Davide Berdin — Sat, 24 Aug 2019 12:01:51 +0000

In this article I would like to show how I created a Terraform module where I could deploy Apache Druid on Kubernetes. This is not production-ready but tests with Real-time ingestion have been quite successful. Hopefully, my experience will be helpful to you!

First things first: what is Apache Druid?

Apache Druid is a high performance real-time analytics database. Originally created at Metamarkets and later donated to the Apache Foundation, Druid has been designed to be fault-tolerant, blazing fast, horizontally scalable and many other things. It’s a big and complex project that requires some time in order to master it properly.

Prepare for scaling!

One of the biggest challenges I faced when working with Druid, was to deploy the system in such a way that I could easy to scale. I started by creating a fleet of machines in a VPC of AWS, then on each instance I started the service that was more suitable for that type of node. For example, for the Broker I selected a generic-purpose *type, whereas for the Historical I used a *storage-optimized machine.

Once all the machines and services where up and running, I had to make sure that all the components where able to connect to each other and could exchange information. I did this by checking the logs of each service. If they would not be connecting, it could potentially lead to some Networking mistakes or inappropriate configuration. To plum and automatize everything together, I chose Terraform in combination with Packer and Vagrant. After days of trials (and error!), I managed to get the Druid cluster up and running. However, the automatic scaling was still an issue.

Kubernetes at Rescue

One thing I like about Kubernetes, is the ability of horizontally scaling pods when the load is above a certain threshold. The cluster I’m using, has the ability of autoscaling the machines if there are no more resources available, hence it makes Kubernetes a good candidate for deploying Druid.

Cluster configuration

Once again, I use Terraform to deploy the infrastructure. Since Terraform has a Kubernetes provider available, I decided to setup some of the services via Terraform as well, and Druid is one of them. All the other applications, such as APIs, Web-apps, etc. are initially deployed manually and then updated via the CI/CD.

I deployed an Amazon EKS cluster with different node-groups so that I could deploy Druid only on particular nodes. As mentioned above, I needed different type of machines depending on the service. The autoscaler application is able to add or remove the machines based on the load on the cluster. An example on how to create such a cluster can be found here.

I started the cluster with 8 machines, divided in 2 different node-groups. In this way I could deploy Druid, Zookeeper and Postgres in different nodes and have it “fault-tolerant” (there is an issue open for the pod-affinity).

Time to put this in action!

The module I created, can be imported by letting Terraform downloading it via git. The module assumes that Terraform is able to deploy on Kubernetes directly. Here is an example

After importing, you can runterraform plan and (hopefully!) terraform apply on your cluster.

Conclusion

The fact that Druid is now set up in Kubernetes, will allow you to automatically scale the system without manual intervention. The module I created, needs some help to make it better and production-ready, but I think it’s a good start if you are planning to use Druid.

Contributions to the project are very welcomed 🎉 Feel free to open up a PR or an Issue.

There are a couple of caveats that you need to keep in mind. For example, if you set the Deep Storage without the usage of S3 or HDFS, you may have issues in keeping the data (I do not recommend to have PVC for this purpose).

Scaling up the Historical nodes in general is not a problem, but the scale down instead could potentially lead to either loosing information or forcing Druid to re-index the data ingested. Make sure that you have good scaling rules for this service.

If you have any questions related to this topic, I’m happy to help! You can find me on Twitter and GitHub.