Part 1: How to build a data platform from scratch

#kubernetes #devops #design #architecture

Have you ever tried to build a data platform from scratch? Do you know where to start, what to look for? And most importantly: what not to do? Fear no more! Learn how to build the perfect data platform with these series of articles :) This article will be the first part of a series, that will give you an inside in my journey towards building a data platform from scratch. By sharing my experience, the lessons learned while developing it, the advantages and disadvantages of certain design choices, and my goals, I hope to make your journey less bumpy.

Wether you are given the task to build a data platform, or you decide to do this in your own time: you aim high with your ambitions - or at least that's what I do. Having ambition is great and it is necessary in order to push yourself a little bit further. That's why, when you start designing something it's important that you look further than the initial requirements.

Without further ado, let's start this journey by discussing the motivations I had behind creating a Data Platform for my current job.

Motivations (aka Requirements)

I work in the Media industry, and the company I work for doesn't have a Tech department like Netflix, GitHub or Twitter. We might not be considered as a 'tech company', we have a team of great engineers that are as motivated as me to create something nice.

One day, our manager came in and said that our company decided to build a Data Platform from scratch. The requirements we've got were the following:

It needs to be scalable
It needs to ingest a lot of data
It needs to be able to query very quickly

That's "all"!

The engineering team had a discussion on what type of technologies we wanted to use, and what the architecture should look like. We decided to focus a lot on Kubernetes. But the choice wasn't straightforward. A large part of the team had never worked with Kubernetes, they had more knowledge on AWS Lambda and API Gateway. While to me Kubernetes was a natural choice, for other people it was a bit scary.

In the end we commonly agreed that it would serve our needs better to use Kubernetes, especially in terms of scalability and deployment of our Dockerized application. Using it meant that we had to migrate tons of services from ECS and other EC2 instances, to Kubernetes.

Initial architecture

After the choice for Kubernetes, we created a bunch of bash files and Terraform scripts to get started. Within the engineering team, there was a discussion on how to tackle the automation and we decided to build a "1-click-deployment" system.

This "1-click-deployment" system, consisted of a Docker image that contained all the packages we needed in order to generate the entire infrastructure of the Data Platform. After that we created the initial skeleton based on this awesome article of Yevgeniy Brikman from Gruntwork. Our result looked similar to this repository.

Tip: When there are a lot of "moving" parts within your infrastructure, it's a good practice to automate it and to rely on tools that will help you and your team to put all the pieces together with minimum effort. The picture below shows how the initial architecture looked like

This was quite simple to us; we needed to create a Kubernetes cluster, add a Load Balancer via NGINX. We were ready to move to the next step.

Drama was about to come

We only used one cluster, and we immediately had to start thinking about how we were going to deploy applications. Then other questions popped up; how do we create a Staging environment, or a Testing environment? The initial choice was to scale the cluster with few more machines and leverage the Kubernetes Namespaces.

The advantage of using Namespaces, is that you can "isolate" your applications in a sort of box. Isolation is what we were trying to achieve, so that we could distinguish the environments. Doing this, created an issue with naming and with how we were going to deploy the applications. For example, let's assume the following: we have two APIs and one WebApp. All three need to have a Testing and Staging environment, so that developers can safely deploy their code before going to production. Because we decided to use Namespaces, we tried to first create three Namespaces: Production, Staging, Testing. There are some drawbacks in putting all the application under the same namespace. The first issue is that it would be much easier to delete everything. For example, the Kubernetes CLI (kubectl) is able to delete namespaces. If a developer would issue the follwing command kubectl delete ns <env> all the applications under that namespace would be gone. The second problem is in the isolation. In fact, the reason why namespacing was created in the first place, was to partition the cluster in smaller virtual cluster.

We tried another approach: we create a namespace per application per environment. For example we would have the following:

production_api1
staging_api1
testing_api1
production_webapp
staging_webapp
testing_webapp
and so on...

You can clearly see that this would pollute the Kubernetes cluster with tons of namespaces. The following article gave us a lot of insights on how to use namespace in better way. Despite the article states that you can use both of the approaches I described above, it also highlights the anti-patterns when using such approaches. For us, those solution didn't work well.

What did we do? The solution was simple: multiple Kubernetes clusters.

Multiple clusters

As mentioned above, we initially decided to adopt the "1-click-deployment" strategy: which means that our code-base was ready for deploying as many clusters we wanted. And so we did. The second Architecture we pulled-off looks like the picture below

As you can see from the picture, we have three clusters. In this way we could deploy applications depending on the environment we wanted to target.

The whole team was very hyped and we thought that we nailed it. But, with great powers come great responsibilities. Responsibilities here means logging and monitoring. With three clusters, understanding what it's going on is rather challenging. We yet again had a problem we needed to tackle.

In a distributed environment with multiple applications, understanding what is happening is crucial. In fact, being able to quickly debug your application in Development or understand what caused an eventual error in Production is the fundamental step for keeping control over your systems.

But logging is not sufficient. Machines themselves in general don't write logs, the expose metrics of their hardware. Thus, collecting such information and visualize them, it will allow you and your team to set up alarms based on certain rules. A rule can be "if the average CPU usage is above 80% trigger an alarm". The alarm itself can be an email or a Slack message to your team. Alarms will prevent having your machines to reach undesirable states.

Controlplane at rescue

When you have a lot of applications, different environments, and a team of engineers and data scientists that is eager to create, test and put applications or models in production, you need to have a solid monitoring and logging system. But how do we do so? Yet again the answer was simple: another Kubernetes cluster :)

Controlplane was a fancy name we took from another team within the company. The sole purpose of this cluster is to collect metrics and logs from the other clusters and centralize the visualization of such information. At this point the architecture looks like the following

On each cluster we set up Prometheus, Node-exporter, Kube-state-metrics for exposing all the metrics of the cluster. To collect the information and send them out the cluster, we used Fluent-Bit. All the metrics and logs were redirected towards a Kafka Topic. In this way we were able to fetch them from the controlplane.

In controlplane we installed Logstash and connected it to the Kafka Topic mentioned above so it could start fetching the logs. Logstash is an open source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then sends it to an output.

Now that Logstash has access to the logs we just need to redirect them to the Elasticsearch service and visualize them with Kibana. For the metrics we put in place Grafana and created a bunch of Dashboards for visualizing the status of each system.

With the controlplane in place we had full visualization and control of every single aspect of each individual cluster. The controlplane was a win-win solution.

Conclusion

In this first part I explained how we started designing the Data Platform. This article focused mostly on the underlying infrastructure and the problems that we had to overcome.

The lesson I learned during this part of the development: is to think ahead much more than when I design software. Software can be changed more rapidly than infrastructure. When designing an infrastructure there are more variables to take into account, and the choices that you make can have a big impact on the final result. Long story short: think first and take your time. Rushing to get things out quickly will not always help you.

In the next article I am going to talk about how applications are deployed and how the team interacts with this architecture.