How to run Apache Druid in Kubernetes using Terraform

#kubernetes #druid #terraform #architecture

In this article I would like to show how I created a Terraform module where I could deploy Apache Druid on Kubernetes. This is not production-ready but tests with Real-time ingestion have been quite successful. Hopefully, my experience will be helpful to you!

First things first: what is Apache Druid?

Apache Druid is a high performance real-time analytics database. Originally created at Metamarkets and later donated to the Apache Foundation, Druid has been designed to be fault-tolerant, blazing fast, horizontally scalable and many other things. It’s a big and complex project that requires some time in order to master it properly.

Prepare for scaling!

One of the biggest challenges I faced when working with Druid, was to deploy the system in such a way that I could easy to scale. I started by creating a fleet of machines in a VPC of AWS, then on each instance I started the service that was more suitable for that type of node. For example, for the Broker I selected a generic-purpose *type, whereas for the Historical I used a *storage-optimized machine.

Once all the machines and services where up and running, I had to make sure that all the components where able to connect to each other and could exchange information. I did this by checking the logs of each service. If they would not be connecting, it could potentially lead to some Networking mistakes or inappropriate configuration. To plum and automatize everything together, I chose Terraform in combination with Packer and Vagrant. After days of trials (and error!), I managed to get the Druid cluster up and running. However, the automatic scaling was still an issue.

Kubernetes at Rescue

One thing I like about Kubernetes, is the ability of horizontally scaling pods when the load is above a certain threshold. The cluster I’m using, has the ability of autoscaling the machines if there are no more resources available, hence it makes Kubernetes a good candidate for deploying Druid.

Cluster configuration

Once again, I use Terraform to deploy the infrastructure. Since Terraform has a Kubernetes provider available, I decided to setup some of the services via Terraform as well, and Druid is one of them. All the other applications, such as APIs, Web-apps, etc. are initially deployed manually and then updated via the CI/CD.

I deployed an Amazon EKS cluster with different node-groups so that I could deploy Druid only on particular nodes. As mentioned above, I needed different type of machines depending on the service. The autoscaler application is able to add or remove the machines based on the load on the cluster. An example on how to create such a cluster can be found here.

I started the cluster with 8 machines, divided in 2 different node-groups. In this way I could deploy Druid, Zookeeper and Postgres in different nodes and have it “fault-tolerant” (there is an issue open for the pod-affinity).

Time to put this in action!

The module I created, can be imported by letting Terraform downloading it via git. The module assumes that Terraform is able to deploy on Kubernetes directly. Here is an example

After importing, you can runterraform plan and (hopefully!) terraform apply on your cluster.

Conclusion

The fact that Druid is now set up in Kubernetes, will allow you to automatically scale the system without manual intervention. The module I created, needs some help to make it better and production-ready, but I think it’s a good start if you are planning to use Druid.

Contributions to the project are very welcomed 🎉 Feel free to open up a PR or an Issue.

There are a couple of caveats that you need to keep in mind. For example, if you set the Deep Storage without the usage of S3 or HDFS, you may have issues in keeping the data (I do not recommend to have PVC for this purpose).

Scaling up the Historical nodes in general is not a problem, but the scale down instead could potentially lead to either loosing information or forcing Druid to re-index the data ingested. Make sure that you have good scaling rules for this service.

If you have any questions related to this topic, I’m happy to help! You can find me on Twitter and GitHub.