Kilian Decaderincourt for 365Talents

Posted on Jul 1 • Edited on Jul 2

Live Coding in Kubernetes: How we debug production data remotely with Tilt

#webdev #devops #cloud #kubernetes

tldr: We use Tilt to enable developers to live-edit code running directly on our remote Kubernetes cluster, allowing them to debug issues on realistic production data from their local IDE. It plays nicely with our multitenant model, allowing developers to target and replace specific tenant services while keeping everything else running.

What Tilt does

First, a quick recap on what Tilt does if you're not familiar with it. Tilt is a tool that lets you describe your dev environment as code to run on Kubernetes and start it in a single command with smart rebuild and live updates.

Let's say you have a Dockerized app you want to test. If you do this manually, you need to first build the Docker image, push it to a repository then create a pod using this image. In its simplest form, Tilt will automate all these steps and do it on every change. It also comes with a nice UI to visualise everything and stream your pod logs.

You can also go further with it and instead of rebuilding on every file change, you can configure Tilt to sync specific files directly inside the pod. Possibilities are almost endless due to it being easily configurable with some Python scripting.

Our architecture

As an HR B2B SaaS provider, ensuring data segregation between our customers is very important, you wouldn't want your employee to receive an internal mobility offer.. from a different company. This has a significant impact on how we design our architecture. Each customer gets their own database, S3 storage container, Elasticsearch indexes, Redis namespace, DNS record, etc.

This is what we call a tenant. Each customer has a production tenant for their employees, optionally one or more sandbox/test tenants, and a corresponding staging tenant. Developers also get their own tenant in the development environment. Overall, we have over 200 tenants spread across 3 environments each with its own purpose.

To work at that scale, most of our application components are multitenant, meaning a single deployment can handle multiple tenants. That's the case for the data science backoffice components or our frontend application. However, for historical reasons, some components like our API backend have always been developed to serve a single tenant, so it requires a distinct deployment per tenant.

How we're using it

Instead of using Tilt on a local Kubernetes cluster, like what seems to be the most used scenario, we use it directly on our remote cluster hosted on Azure. The main benefit for us is that we can hijack any tenant pods and make them use a different code that can then be edited in real-time from a local IDE on our laptop.

When starting Tilt, it builds an image per component, pushes it to our repository and deploys pods using this image in the namespace of the selected tenant. Pods deployed by Tilt reuse all the Secrets and ConfigMap already provisioned and updated by the infrastructure to connect to the right database, S3 bucket, etc.

For components handling a single tenant, like our API backend, Tilt will first call a script to stop it by scaling it down to zero, which avoids any background running tasks that could edit data. Whereas components shared with other tenants like the frontend, are not stopped, instead Tilt will create a new pod in the tenant namespace and create a new Ingress with a higher priority that will route requests to our new frontend and backend pods.

To leverage the live update feature and avoid rebuilding on every change, the images are not the same as the prod one. For example, our frontend image uses a Node.js-based image to run the Vite server instead of serving compiled assets with nginx. Every step above is described in the Tiltfile either using built-in functions or a call to custom scripts.

tilt up -- --tenant-name tenant-a

For obvious reasons, we don't allow tilting directly onto the production tenant, instead we first make a copy of it. This process would merit its own blog post but in essence, each of our production tenants have a corresponding staging tenant in a lower environment. We can copy on demand all the tenant data and import it into the staging tenant with a layer of anonymisation for personal information like email, names and avatars. This lets our developers investigate issues and debug them in their development environment with real data, making the reproduction of bugs much easier.

Each developer has their own tenant in the dev cluster so they can use Tilt for day-to-day development work. The code runs on the remote cluster while their IDE stays on their local laptop. To limit the cost of this environment, we run everything on Azure spot instances. Not everyone at 365Talents is using Tilt as their standard development workflow, some prefer the more local solution based on Devcontainers. Either way, Tilt is used by everyone when they need to debug things on a staging tenant.

Advantages

Running code on realistic data

Due to our customers' different use cases, their data is highly heterogeneous depending on the features they use or not. The main advantage for us is the ability for a developer to plug themselves to a specific tenant and edit the code that is run in real time on a realistic dataset.

Production similarity

When developing with Tilt, your setup ends up closer to what is happening in production than if you were on your laptop. The code runs in the same place and almost in the same image. The requests go through the same network components.

Fewer resources used on laptop

Another benefit is to consume less resources on the developer laptops as every service is run remotely in our cloud, this lets more resources available for their IDE and linter.
While not a main concern for our web developers in the stack that was presented in this article, this is more important for our data science team. They also started using Tilt for their own stack running 6 differents data services hungry in memory and computing power.

Difficulties

Wrapper script to switch context

One of the Tilt limitations for security reasons is that once Tilt is started, it cannot switch to another Kubernetes context. For our use case, we had to create a wrapper script to be able to only specify the tenant we want to tilt to without worrying about which cluster it's on. The script simply finds the corresponding cluster based on the requested tenant and starts Tilt with the correct Kubernetes context.

Slow cold start

This one is from our all remote approach, pushing the image to the repository is dependent on the network speed and can take a while. Since the first introduction there have been multiple iterations and attempts to improve this.

This includes optimizing the Dockerfile order to reuse existing layers, tweaking the docker config to use the repository as a build cache, even trying to remove npm dependencies from the image and downloading them at startup or storing them in a PVC. Each comes with its own drawback, overall this is still a persistent problem. It shouldn't be as long as we don't do a full rebuild very often..

Too many uncontrolled full image rebuilds

Despite our best efforts, there are still too many times when a full rebuild is triggered. This mainly occurs when switching branches with lots of changes. This is a persistent pain point that we haven't truly solved yet.

Initial friction with Kubernetes

Having control over their development environment is very important for developers. When we introduced Tilt, our Kubernetes migration for all our services was done recently, so they were still unfamiliar with some Kubernetes concepts. This also contributed in a good way to them becoming more comfortable with Kubernetes in the end. They have more control over it now and are the ones editing and improving the Tiltfile.

Conclusion

Despite the challenges, Tilt has become an integral part of our development workflow. It lets target a single one of our tenant and edit the code running live inside our Kubernetes cluster. Once combined with our process to make a copy of the production data, developers can easily debug issues from the comfort of their local IDE directly on realistic data. We could make it fit our needs thanks to its great customizability.

References

Tilt - The official Tilt website with documentation, tutorials, and download links
365Talents - Adaptive Skills & Talent Intelligence for ambitious HR.
365Talents Jobs - Join us !

DEV Community