<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Davide Berdin</title>
    <description>The latest articles on DEV Community by Davide Berdin (@spaghettifunk).</description>
    <link>https://dev.to/spaghettifunk</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F217039%2F96c13f81-d75b-462a-ab7c-572f39adeade.jpeg</url>
      <title>DEV Community: Davide Berdin</title>
      <link>https://dev.to/spaghettifunk</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/spaghettifunk"/>
    <language>en</language>
    <item>
      <title>Part 2: Monitoring, Logging and Alarming</title>
      <dc:creator>Davide Berdin</dc:creator>
      <pubDate>Sun, 08 Sep 2019 15:28:19 +0000</pubDate>
      <link>https://dev.to/spaghettifunk/part-2-monitoring-logging-and-alarming-1bc7</link>
      <guid>https://dev.to/spaghettifunk/part-2-monitoring-logging-and-alarming-1bc7</guid>
      <description>&lt;p&gt;I'm writing a series on how to build a Data Platform from scratch and it's time for Part 2! In &lt;a href="https://dev.to/spaghettifunk/part-1-where-it-all-begins-1d72"&gt;Part 1&lt;/a&gt; I explained how to start building your data platform. But, when your infrastructure grows, making sure that everything is working as expected becomes a challenge. And as one of my dearest colleague tells me all the time: "monitoring and logging is an art!".&lt;/p&gt;

&lt;p&gt;In Part 2, I want to tell you how to setup a production-grade monitoring system for your infrastructure. Concepts and caveats are more valuable than some "copy-paste" piece of code, for the simple reason that it is important to understand why certain choices are made.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tools
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://dev.to/spaghettifunk/part-1-where-it-all-begins-1d72"&gt;Previously&lt;/a&gt;, I mentioned that we are using a bunch of tools. When I started listing all of them I thought: wow, so many things for something that looks relatively simple in concept. I mean, what we want to achieve is to &lt;strong&gt;be able to see logs and act if something happens&lt;/strong&gt;. Simple right? Well, not quite. To refer back to my colleague: monitoring and logging is definitely an art :)&lt;/p&gt;

&lt;p&gt;Let's start by dividing the problem in sub-problems. In school they taught us that when the problem is too big, we have to &lt;code&gt;divide&lt;/code&gt; it and &lt;code&gt;conquer&lt;/code&gt; it (aka &lt;code&gt;divide et impera&lt;/code&gt; if you are into latin).&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 1: Logging
&lt;/h3&gt;

&lt;p&gt;Since we want to understand what it's going on in our infrastructure and applications, we should start tackling logging at first. Generically speaking, an application writes logs on the &lt;strong&gt;STDOUT&lt;/strong&gt;. Since we are using Kubernetes, that means we are able to read the logs from the &lt;code&gt;logging&lt;/code&gt; console that comes with Kubernetes itself. Because we are able to read logs, it means we can collect them. How can we collect them? As I mentioned in the &lt;a href="https://dev.to/spaghettifunk/part-1-where-it-all-begins-1d72"&gt;previous article&lt;/a&gt; we use &lt;a href="https://www.elastic.co/products/elasticsearch"&gt;Elasticsearch&lt;/a&gt; for indexing the logs. In order to collect those we use &lt;a href="https://github.com/fluent/fluent-bit/"&gt;Fluent-Bit&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  Solution 1: Fluent Bit
&lt;/h4&gt;

&lt;p&gt;Fluent Bit allows collection of information from different sources, buffering and dispatching them to different outputs. The &lt;a href="https://github.com/helm/charts/tree/master/stable/fluent-bit"&gt;helm chart&lt;/a&gt; we use runs as a &lt;code&gt;daemon set&lt;/code&gt; in Kubernetes (&lt;a href="https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/"&gt;here&lt;/a&gt; more info about daemon sets). This guarantees that there will be an instance of Fluent Bit per machine in the Kubernetes cluster. The process will collect the information of each pod in kubernetes from the &lt;code&gt;standard output&lt;/code&gt; and redirect them towards another system. In our case we chose &lt;strong&gt;Kafka&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Now our logs are securely sent towards to a topic in Kafka and ready to be consumed by something else.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 2: System Metrics
&lt;/h3&gt;

&lt;p&gt;Applications write logs on the standard output, but &lt;code&gt;machines&lt;/code&gt; don't write logs, right? So, how do I know how a machine is behaving? How do I know if the &lt;code&gt;CPU&lt;/code&gt; is sky-rocketing or the &lt;code&gt;I/O operations&lt;/code&gt; on the disk are the bottleneck or my application? Well, that's where &lt;a href="https://github.com/prometheus/node_exporter"&gt;Node-Exporter&lt;/a&gt; comes into play. &lt;/p&gt;

&lt;h3&gt;
  
  
  Solution 2: Node Exporter
&lt;/h3&gt;

&lt;p&gt;Node Exporter is able to collect metrics from the underlying Operating System. This is powerful because now we are able to collect the system information we needed. Once again, there is a &lt;a href="https://github.com/helm/charts/tree/master/stable/prometheus-node-exporter"&gt;helm chart&lt;/a&gt; coming and rescuing us. &lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 3: Application Metrics
&lt;/h3&gt;

&lt;p&gt;Cool, but what if an application is able to give me more information than simple logging. For example, what if my database is able to give me the current system information such as &lt;code&gt;memory consumption&lt;/code&gt; or &lt;code&gt;average query latency&lt;/code&gt;. That's hard. These are not logs nor metrics coming from a machine. Although, they are available for us to be used. That's when &lt;a href="https://prometheus.io/"&gt;Prometheus&lt;/a&gt; enters the arena.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution 3: Prometheus
&lt;/h3&gt;

&lt;p&gt;Prometheus is a systems and service monitoring system. It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts if some condition is observed to be true. &lt;strong&gt;BINGO&lt;/strong&gt;. This sounds like the tool that will do a lot of things for us. But where does it stand in the big picture? Let's take a look at the below image&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Xwo2LbXV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://478h5m1yrfsa3bbe262u7muv-wpengine.netdna-ssl.com/wp-content/uploads/2018/08/prometheus_kubernetes_diagram_overview.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Xwo2LbXV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://478h5m1yrfsa3bbe262u7muv-wpengine.netdna-ssl.com/wp-content/uploads/2018/08/prometheus_kubernetes_diagram_overview.png" alt="Prometheus" width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I took this picture from &lt;a href="https://sysdig.com/blog/kubernetes-monitoring-prometheus/"&gt;this&lt;/a&gt; great article, and it describes clearly what Prometheus does. Basically, its role is to &lt;strong&gt;pull&lt;/strong&gt; and &lt;strong&gt;push&lt;/strong&gt; information. But what is relevant to know, is that Prometheus standardized the way information should be generated so that they can be parsed and, in a later stage, queried. &lt;/p&gt;

&lt;p&gt;Because our data platform has multiple Kubernetes clusters (remember controlplane, production, development, etc), Prometheus needs to be installed in all of them. Thanks to the awesome community of developers, there is an &lt;a href="https://github.com/helm/charts/tree/master/stable/prometheus-operator"&gt;helm chart&lt;/a&gt; that we can use. This operator allows us also to use prometheus in &lt;a href="https://prometheus.io/docs/prometheus/latest/federation/"&gt;&lt;code&gt;federation&lt;/code&gt;&lt;/a&gt; mode, which is &lt;em&gt;very important&lt;/em&gt; in this context. The federation concept allows the prometheus in Controlplane to &lt;code&gt;scrape&lt;/code&gt; the information from the other prometheus services so that we can centralize all the metrics in one unique point.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 4: Fetching the logs
&lt;/h3&gt;

&lt;p&gt;We decided to create &lt;code&gt;controlplane&lt;/code&gt; for centralizing the information regarding the other environments and have an overview of what it's going on in our platform. Since we pushed our logs into Kafka, we now need to consume them and store them in a format that is readable to humans.&lt;/p&gt;

&lt;p&gt;There is a famous acronym called &lt;code&gt;ELK&lt;/code&gt; and it stands for Elasticsearch, Logstash, Kibana. So far we mention the &lt;strong&gt;E&lt;/strong&gt; and the &lt;strong&gt;K&lt;/strong&gt; but never the &lt;strong&gt;L&lt;/strong&gt;. Well, that time just arrived.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution 4: Logstash
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.elastic.co/products/logstash"&gt;Logstash&lt;/a&gt; is an open source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then sends it to your favourite "stash." This is part of the Elastic suit, and it is a fundamental piece for making sure that we are able to have the same type of logging for everything that comes in. &lt;/p&gt;

&lt;p&gt;Our input is the Kafka topic we mentioned before and our output is Elasticsearch where then the data will be indexed and "stashed". The &lt;a href="https://github.com/helm/charts/blob/master/stable/logstash"&gt;helm chart&lt;/a&gt; helps you to install the application and by modifying this &lt;a href="https://github.com/helm/charts/blob/master/stable/logstash/values.yaml#L286"&gt;part&lt;/a&gt; of the &lt;code&gt;values.yml&lt;/code&gt; you are able to easily read from Kafka. The major issue we found was in the &lt;code&gt;@timestamp&lt;/code&gt; field. In fact, we had to adapt the &lt;code&gt;values.yml&lt;/code&gt; a little to avoid having issues in reading the timestamp.&lt;/p&gt;

&lt;p&gt;The following snippet of code will help you to solve such an issue&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;You have to modify the Timezone accordingly but that's the major reason why we couldn't have our data ingested in Elasticsearch correctly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Visualize All
&lt;/h3&gt;

&lt;p&gt;We just finished covering the logging part, but how do we visualize everything? There are two main applications: &lt;a href="https://www.elastic.co/products/kibana"&gt;Kibana&lt;/a&gt; and &lt;a href="https://grafana.com/"&gt;Grafana&lt;/a&gt;. We use Kibana to explore all the logs that coming in from all the application. Without Kibana, it would be extremely hard to debug your application because searching for what is going on, it's very hard with &lt;code&gt;kubectl logs -f &amp;lt;pod-name&amp;gt; | grep whatever-error&lt;/code&gt; command :) &lt;/p&gt;

&lt;p&gt;Grafana helps in visualizing all the metrics coming in from Prometheus. There are a ton of &lt;a href="https://grafana.com/grafana/dashboards"&gt;pre-made dashboards&lt;/a&gt; that you can just install and use them. The only thing you need to do is to setup the &lt;code&gt;prometheus&lt;/code&gt; installed in &lt;code&gt;controlplane&lt;/code&gt; as your &lt;code&gt;data-source&lt;/code&gt; in Grafana and that's it. All the metrics will &lt;code&gt;automagically&lt;/code&gt; be available for you.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 5: Alarming
&lt;/h3&gt;

&lt;p&gt;This is the toughest part of the process. Once again, the concept is simple but deciding the thresholds onto when receiving an alarm is difficult, and it needs to be tuned along the way. I would recommend to start from this &lt;a href="https://awesome-prometheus-alerts.grep.to/rules.html"&gt;&lt;strong&gt;awesome&lt;/strong&gt;&lt;/a&gt; website and start building the rules that are important for your data platform. &lt;/p&gt;

&lt;h3&gt;
  
  
  Solution 5: Slack and PagerDuty
&lt;/h3&gt;

&lt;p&gt;I can't help you with building the rules that you need for your data platform, but I can give you advice on selecting the right tool to notify you and your team when an alarm is triggered. My suggestion is to send the notification to &lt;a href="https://prometheus.io/docs/alerting/configuration/#slack_config"&gt;Slack&lt;/a&gt; for the alarms that you consider "minor". I leave the definition of minor up to you. We only send a slack notification for the &lt;code&gt;development&lt;/code&gt; environment and for those applications that are not public in &lt;code&gt;production&lt;/code&gt; yet.&lt;/p&gt;

&lt;p&gt;For production systems we use &lt;a href="https://www.pagerduty.com"&gt;PagerDuty&lt;/a&gt; to create a rotation-calendar for taking care of the systems among team members and to make sure that everything is always up-and-running. There is a great &lt;a href="https://www.pagerduty.com/docs/guides/prometheus-integration-guide/"&gt;integration&lt;/a&gt; with Prometheus that I highly recommend to setup.&lt;/p&gt;

&lt;p&gt;Grafana also helps with &lt;a href="https://grafana.com/docs/alerting/rules/"&gt;alerting&lt;/a&gt; but we haven't used it yet. It looks awesome though. If you've been using Grafana, it would be great if you can share your experience with me in the comments below :)&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this long blog-post I gave you an overview the tools that my team and I are using for our data platform. I hope this gave you more ideas on how to start! You will encounter problems along the way because "Rome wasn't built in a day". I do hope that you have all the initial information to collect, visualize and receive alarms in your data platform. But remember my colleague's motto: &lt;code&gt;logging and monitoring is an art&lt;/code&gt; :) &lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>elk</category>
      <category>prometheus</category>
      <category>devops</category>
    </item>
    <item>
      <title>Part 1: How to build a data platform from scratch</title>
      <dc:creator>Davide Berdin</dc:creator>
      <pubDate>Sat, 24 Aug 2019 20:50:06 +0000</pubDate>
      <link>https://dev.to/spaghettifunk/part-1-where-it-all-begins-1d72</link>
      <guid>https://dev.to/spaghettifunk/part-1-where-it-all-begins-1d72</guid>
      <description>&lt;p&gt;Have you ever tried to build a data platform from scratch? Do you know where to start, what to look for? And most importantly: what &lt;em&gt;not&lt;/em&gt; to do? Fear no more! Learn how to build the perfect data platform with these series of articles :) This article will be the first part of a series, that will give you an inside in my journey towards building a data platform from scratch. By sharing my experience, the lessons learned while developing it, the advantages and disadvantages of certain design choices, and my goals, I hope to make your journey less bumpy.&lt;/p&gt;

&lt;p&gt;Wether you are given the task to build a data platform, or you decide to do this in your own time: you aim high with your ambitions - or at least that's what I do. Having ambition is great and it is necessary in order to push yourself a little bit further. That's why, when you start designing something it's important that you look further than the initial &lt;em&gt;requirements&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Without further ado, let's start this journey by discussing the motivations I had behind creating a Data Platform for my current job.&lt;/p&gt;

&lt;h2&gt;
  
  
  Motivations (aka Requirements)
&lt;/h2&gt;

&lt;p&gt;I work in the Media industry, and the company I work for doesn't have a Tech department like Netflix, GitHub or Twitter. We might not be considered as a 'tech company', we have a team of great engineers that are as motivated as me to create something nice.&lt;/p&gt;

&lt;p&gt;One day, our manager came in and said that our company decided to build a Data Platform from scratch. The requirements we've got were the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It needs to be scalable&lt;/li&gt;
&lt;li&gt;It needs to ingest a lot of data&lt;/li&gt;
&lt;li&gt;It needs to be able to query very quickly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's "all"!&lt;/p&gt;

&lt;p&gt;The engineering team had a discussion on what type of technologies we wanted to use, and what the architecture should look like. We decided to focus a lot on &lt;strong&gt;Kubernetes&lt;/strong&gt;. But the choice wasn't straightforward. A large part of the team had never worked with Kubernetes, they had more knowledge on AWS Lambda and API Gateway. While to me Kubernetes was a natural choice, for other people it was a bit scary. &lt;/p&gt;

&lt;p&gt;In the end we commonly agreed that it would serve our needs better to use Kubernetes, especially in terms of scalability and deployment of our Dockerized application. Using it meant that we had to migrate tons of services from ECS and other EC2 instances, to Kubernetes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Initial architecture
&lt;/h2&gt;

&lt;p&gt;After the choice for Kubernetes, we created a bunch of bash files and &lt;strong&gt;Terraform&lt;/strong&gt; scripts to get started. Within the engineering team, there was a discussion on how to tackle the automation and we decided to build a "1-click-deployment" system.&lt;/p&gt;

&lt;p&gt;This "1-click-deployment" system, consisted of a Docker image that contained all the packages we needed in order to generate the entire infrastructure of the Data Platform. After that we created the initial skeleton based on this awesome &lt;a href="https://blog.gruntwork.io/how-to-create-reusable-infrastructure-with-terraform-modules-25526d65f73d" rel="noopener noreferrer"&gt;article&lt;/a&gt; of Yevgeniy Brikman from Gruntwork. Our result looked similar to this &lt;a href="https://github.com/spaghettifunk/cluster-example" rel="noopener noreferrer"&gt;repository&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tip:&lt;/strong&gt; When there are a lot of "moving" parts within your infrastructure, it's a good practice to automate it and to rely on tools that will help you and your team to put all the pieces together with minimum effort. The picture below shows how the initial architecture looked like&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.ibb.co%2F3cKjcxf%2Finit.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.ibb.co%2F3cKjcxf%2Finit.png" alt="Initial Architecture"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This was quite simple to us; we needed to &lt;a href="https://github.com/terraform-aws-modules/terraform-aws-eks" rel="noopener noreferrer"&gt;create a Kubernetes cluster&lt;/a&gt;, add a &lt;a href="https://github.com/nginxinc/kubernetes-ingress/blob/master/docs/installation.md" rel="noopener noreferrer"&gt;Load Balancer via NGINX&lt;/a&gt;. We were ready to move to the next step.&lt;/p&gt;

&lt;h2&gt;
  
  
  Drama was about to come
&lt;/h2&gt;

&lt;p&gt;We only used one cluster, and we immediately had to start thinking about how we were going to deploy applications. Then other questions popped up; how do we create a &lt;strong&gt;Staging&lt;/strong&gt; environment, or a &lt;strong&gt;Testing&lt;/strong&gt; environment? The initial choice was to scale the cluster with few more machines and leverage the &lt;strong&gt;Kubernetes Namespaces&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The advantage of using Namespaces, is that you can &lt;code&gt;"isolate"&lt;/code&gt; your applications in a sort of &lt;code&gt;box&lt;/code&gt;. Isolation is what we were trying to achieve, so that we could distinguish the &lt;code&gt;environments&lt;/code&gt;. Doing this, created an issue with naming and with how we were going to deploy the applications. For example, let's assume the following: we have two APIs and one WebApp. All three need to have a Testing and Staging environment, so that developers can safely deploy their code before going to production. Because we decided to use Namespaces, we tried to first create three Namespaces: Production, Staging, Testing. There are some drawbacks in putting all the application under the same namespace. The first issue is that it would be much easier to delete everything. For example, the Kubernetes CLI (&lt;code&gt;kubectl&lt;/code&gt;) is able to delete namespaces. If a developer would issue the follwing command &lt;code&gt;kubectl delete ns &amp;lt;env&amp;gt;&lt;/code&gt; all the applications under that namespace would be gone. The second problem is in the &lt;code&gt;isolation&lt;/code&gt;. In fact, the reason why namespacing was created in the first place, was to partition the cluster in &lt;code&gt;smaller virtual cluster&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;We tried another approach: we create a namespace per application per environment. For example we would have the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;production_api1&lt;/li&gt;
&lt;li&gt;staging_api1&lt;/li&gt;
&lt;li&gt;testing_api1&lt;/li&gt;
&lt;li&gt;production_webapp&lt;/li&gt;
&lt;li&gt;staging_webapp&lt;/li&gt;
&lt;li&gt;testing_webapp&lt;/li&gt;
&lt;li&gt;and so on...&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can clearly see that this would pollute the Kubernetes cluster with tons of namespaces. The following &lt;a href="https://kubernetes.io/blog/2016/08/kubernetes-namespaces-use-cases-insights/" rel="noopener noreferrer"&gt;article&lt;/a&gt; gave us a lot of insights on how to use namespace in better way. Despite the article states that you can use both of the approaches I described above, it also highlights the anti-patterns when using such approaches. For us, those solution didn't work well.&lt;/p&gt;

&lt;p&gt;What did we do? The solution was simple: multiple Kubernetes clusters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multiple clusters
&lt;/h2&gt;

&lt;p&gt;As mentioned above, we initially decided to adopt the "1-click-deployment" strategy: which means that our code-base was ready for deploying as many clusters we wanted. And so we did. The second Architecture we pulled-off looks like the picture below&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.ibb.co%2F9ymV46J%2Fsecond.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.ibb.co%2F9ymV46J%2Fsecond.png" alt="Multiple clusters"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see from the picture, we have three clusters. In this way we could deploy applications depending on the environment we wanted to target.&lt;/p&gt;

&lt;p&gt;The whole team was very hyped and we thought that we nailed it. But, with great powers come great responsibilities. Responsibilities here means &lt;strong&gt;logging&lt;/strong&gt; and &lt;strong&gt;monitoring&lt;/strong&gt;. With three clusters, understanding what it's going on is rather challenging. We yet again had a problem we needed to tackle.&lt;/p&gt;

&lt;p&gt;In a distributed environment with multiple applications, understanding what is happening is crucial. In fact, being able to quickly debug your application in Development or understand what caused an eventual error in Production is the fundamental step for keeping control over your systems.&lt;/p&gt;

&lt;p&gt;But logging is not sufficient. Machines themselves in general don't write logs, the expose metrics of their hardware. Thus, collecting such information and visualize them, it will allow you and your team to set up &lt;strong&gt;alarms&lt;/strong&gt; based on certain rules. A rule can be "if the average CPU usage is above 80% trigger an alarm". The alarm itself can be an email or a Slack message to your team. Alarms will prevent having your machines to reach undesirable states.&lt;/p&gt;

&lt;h2&gt;
  
  
  Controlplane at rescue
&lt;/h2&gt;

&lt;p&gt;When you have a lot of applications, different environments, and a team of engineers and data scientists that is eager to create, test and put applications or models in production, you need to have a solid monitoring and logging system. But how do we do so? Yet again the answer was simple: another Kubernetes cluster :)&lt;/p&gt;

&lt;p&gt;Controlplane was a fancy name we took from another team within the company. The sole purpose of this cluster is to collect metrics and logs from the other clusters and centralize the visualization of such information. At this point the architecture looks like the following&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.ibb.co%2FmzD23s9%2Fthird.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.ibb.co%2FmzD23s9%2Fthird.png" alt="Controlplane system"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On each cluster we set up &lt;a href="https://prometheus.io/" rel="noopener noreferrer"&gt;&lt;strong&gt;Prometheus&lt;/strong&gt;&lt;/a&gt;, &lt;a href="https://github.com/prometheus/node_exporter" rel="noopener noreferrer"&gt;&lt;strong&gt;Node-exporter&lt;/strong&gt;&lt;/a&gt;, &lt;a href="https://github.com/kubernetes/kube-state-metrics" rel="noopener noreferrer"&gt;&lt;strong&gt;Kube-state-metrics&lt;/strong&gt;&lt;/a&gt; for exposing all the metrics of the cluster. To collect the information and send them out the cluster, we used &lt;a href="https://github.com/fluent/fluent-bit" rel="noopener noreferrer"&gt;&lt;strong&gt;Fluent-Bit&lt;/strong&gt;&lt;/a&gt;. All the metrics and logs were redirected towards a &lt;strong&gt;Kafka Topic&lt;/strong&gt;. In this way we were able to fetch them from the controlplane.&lt;/p&gt;

&lt;p&gt;In controlplane we installed &lt;a href="https://www.elastic.co/products/logstash" rel="noopener noreferrer"&gt;&lt;strong&gt;Logstash&lt;/strong&gt;&lt;/a&gt; and connected it to the &lt;strong&gt;Kafka Topic&lt;/strong&gt; mentioned above so it could start fetching the logs. Logstash is an open source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then sends it to an output.&lt;/p&gt;

&lt;p&gt;Now that Logstash has access to the logs we just need to redirect them to the &lt;a href="https://github.com/elastic/elasticsearch" rel="noopener noreferrer"&gt;&lt;strong&gt;Elasticsearch&lt;/strong&gt;&lt;/a&gt; service and visualize them with &lt;a href="https://github.com/elastic/kibana" rel="noopener noreferrer"&gt;&lt;strong&gt;Kibana&lt;/strong&gt;&lt;/a&gt;. For the metrics we put in place &lt;a href="https://grafana.com/" rel="noopener noreferrer"&gt;&lt;strong&gt;Grafana&lt;/strong&gt;&lt;/a&gt; and created a bunch of Dashboards for visualizing the status of each system.&lt;/p&gt;

&lt;p&gt;With the controlplane in place we had full visualization and control of every single aspect of each individual cluster. The controlplane was a win-win solution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this first part I explained how we started designing the Data Platform. This article focused mostly on the underlying infrastructure and the problems that we had to overcome.&lt;/p&gt;

&lt;p&gt;The lesson I learned during this part of the development: is to think ahead much more than when I design software. Software can be changed more rapidly than infrastructure. When designing an infrastructure there are more variables to take into account, and the choices that you make can have a big impact on the final result. Long story short: think first and take your time. Rushing to get things out quickly will not always help you.&lt;/p&gt;

&lt;p&gt;In the next article I am going to talk about how applications are deployed and how the team interacts with this architecture.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>design</category>
      <category>architecture</category>
    </item>
    <item>
      <title>How to run Apache Druid in Kubernetes using Terraform</title>
      <dc:creator>Davide Berdin</dc:creator>
      <pubDate>Sat, 24 Aug 2019 12:01:51 +0000</pubDate>
      <link>https://dev.to/spaghettifunk/how-to-run-apache-druid-in-kubernetes-using-terraform-3ia7</link>
      <guid>https://dev.to/spaghettifunk/how-to-run-apache-druid-in-kubernetes-using-terraform-3ia7</guid>
      <description>&lt;p&gt;In this article I would like to show how I created a Terraform module where I could deploy Apache Druid on Kubernetes. This is not production-ready but tests with Real-time ingestion have been quite successful. Hopefully, my experience will be helpful to you!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First things first: what is Apache Druid?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F3128%2F1%2Ah9cBMCb1QtVU5VsYx_hRBg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F3128%2F1%2Ah9cBMCb1QtVU5VsYx_hRBg.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Apache Druid is a high performance real-time analytics database. Originally created at &lt;a href="https://metamarkets.com" rel="noopener noreferrer"&gt;Metamarkets&lt;/a&gt; and later donated to the Apache Foundation, Druid has been designed to be fault-tolerant, blazing fast, horizontally scalable and many other things. It’s a big and complex project that requires some time in order to master it properly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prepare for scaling!&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One of the biggest challenges I faced when working with Druid, was to deploy the system in such a way that I could easy to scale. I started by creating a fleet of machines in a VPC of AWS, then on each instance I started the service that was more suitable for that type of node. For example, for the &lt;a href="https://druid.apache.org/docs/latest/tutorials/cluster.html#query-server" rel="noopener noreferrer"&gt;Broker&lt;/a&gt; I selected a &lt;em&gt;generic-purpose *type, whereas for the &lt;a href="https://druid.apache.org/docs/latest/tutorials/cluster.html#data-server" rel="noopener noreferrer"&gt;Historical&lt;/a&gt; I used a *storage-optimized&lt;/em&gt; machine.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F3840%2F1%2Aw1THsy_Ablf9rzUU9NotAA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F3840%2F1%2Aw1THsy_Ablf9rzUU9NotAA.png" alt="Druid services"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once all the machines and services where up and running, I had to make sure that all the components where able to &lt;em&gt;connect&lt;/em&gt; to each other and could exchange information. I did this by checking the logs of each service. If they would not be connecting, it could potentially lead to some Networking mistakes or inappropriate configuration. To plum and automatize everything together, I chose &lt;a href="https://www.terraform.io/" rel="noopener noreferrer"&gt;Terraform&lt;/a&gt; in combination with &lt;a href="https://www.packer.io/" rel="noopener noreferrer"&gt;Packer&lt;/a&gt; and &lt;a href="https://www.vagrantup.com/" rel="noopener noreferrer"&gt;Vagrant&lt;/a&gt;. After days of trials (and error!), I managed to get the Druid cluster up and running. However, the automatic scaling was still an issue.&lt;/p&gt;

&lt;h2&gt;
  
  
  Kubernetes at Rescue
&lt;/h2&gt;

&lt;p&gt;One thing I like about Kubernetes, is the ability of horizontally scaling pods when the load is above a certain threshold. The cluster I’m using, has the ability of autoscaling the machines if there are no more resources available, hence it makes Kubernetes a good candidate for deploying Druid.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cluster configuration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once again, I use Terraform to deploy the infrastructure. Since Terraform has a Kubernetes provider available, I decided to setup some of the services via Terraform as well, and Druid is one of them. All the other applications, such as APIs, Web-apps, etc. are initially deployed manually and then updated via the CI/CD.&lt;/p&gt;

&lt;p&gt;I deployed an &lt;a href="https://aws.amazon.com/eks/" rel="noopener noreferrer"&gt;Amazon EKS&lt;/a&gt; cluster with different &lt;strong&gt;node-groups&lt;/strong&gt; so that I could deploy Druid only on particular nodes. As mentioned above, I needed different type of machines depending on the service. The &lt;em&gt;autoscaler&lt;/em&gt; application is able to add or remove the machines based on the load on the cluster. An example on how to create such a cluster can be found &lt;a href="https://github.com/spaghettifunk/cluster-example" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I started the cluster with 8 machines, divided in 2 different node-groups. In this way I could deploy Druid, Zookeeper and Postgres in different nodes and have it “&lt;em&gt;fault-tolerant&lt;/em&gt;” (there is an &lt;a href="https://github.com/spaghettifunk/druid-terraform/issues/8" rel="noopener noreferrer"&gt;issue&lt;/a&gt; open for the pod-affinity).&lt;/p&gt;

&lt;h2&gt;
  
  
  Time to put this in action!
&lt;/h2&gt;

&lt;p&gt;The module I created, can be imported by letting Terraform downloading it via &lt;strong&gt;git&lt;/strong&gt;. The module assumes that Terraform is able to deploy on Kubernetes directly. Here is an example&lt;/p&gt;

&lt;p&gt;After importing, you can runterraform plan and (hopefully!) terraform apply on your cluster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The fact that Druid is now set up in Kubernetes, will allow you to automatically scale the system without manual intervention. The module I created, needs some help to make it better and &lt;strong&gt;production-ready,&lt;/strong&gt; but I think it’s a good start if you are planning to use Druid.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Contributions to the &lt;a href="https://github.com/spaghettifunk/druid-terraform" rel="noopener noreferrer"&gt;project&lt;/a&gt; are very welcomed 🎉 Feel free to open up a &lt;strong&gt;PR&lt;/strong&gt; or an &lt;strong&gt;Issue&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;There are a couple of caveats that you need to keep in mind. For example, if you set the &lt;a href="https://druid.apache.org/docs/latest/dependencies/deep-storage.html" rel="noopener noreferrer"&gt;Deep Storage&lt;/a&gt; without the usage of S3 or HDFS, you may have issues in keeping the data (I do not recommend to have PVC for this purpose).&lt;/p&gt;

&lt;p&gt;Scaling up the &lt;a href="https://druid.apache.org/docs/latest/design/historical.html" rel="noopener noreferrer"&gt;Historical&lt;/a&gt; nodes in general is not a problem, but the scale down instead could potentially lead to either loosing information or forcing Druid to re-index the data ingested. Make sure that you have good scaling rules for this service.&lt;/p&gt;

&lt;p&gt;If you have any questions related to this topic, I’m happy to help! You can find me on &lt;a href="https://twitter.com/davideberdin" rel="noopener noreferrer"&gt;Twitter&lt;/a&gt; and &lt;a href="https://github.com/spaghettifunk" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>druid</category>
      <category>terraform</category>
      <category>architecture</category>
    </item>
  </channel>
</rss>
