This article was originally published on BookMyShow Engineering
A brief post explaining our efforts towards a more modular platform.
If you’ve followed the infrastructure ecosystem for a while, you might’ve come across different ways to manage applications via terms like Containers & Orchestration. A quick Google search will then lead you to technologies like Docker, Kubernetes, Mesos, LXC, Helm Charts, etc... These are touted as the next big things in the inter-sectional worlds of DevOps/SRE as they can help fix a lot of problems when done the right way.
So, what is this all about?
While the BookMyShow platform already has Kubernetes running in production for the last 3+ years for the majority of its microservices but 1 thing missing was widespread adoption by engineering teams.
Some frontends, APIs, scheduled jobs were still functioning from Linux Virtual Machines or VMs with very little awareness on them, running on autopilot mode (due to development for certain projects being stopped)
Why did we need this?
- VMs limit auto-scaling of systems during high traffic (Think big movie/event releases) as its bootstrapping takes a while.
- They also have issues like certain packages/configs setup manually, periodic problems in relaying logs (disk space alerts anyone?) and finally in most use-cases, were over-provisioned.
- This would help standardize the application build/deployment & maintenance life-cycle by requiring efforts only on a single platform.
How is it setup?
At BookMyShow, we use Docker to build Docker images pushed into an internal registry, Bamboo to build these applications in a CI pipeline with detailed visibility & audit logs, Helm to maintain those apps into Charts. A custom wrapper script then takes the Helm charts & deploys them into Kubernetes clusters.
This script also injects a list of necessary parameters like ENV variables based on the environment, NewRelic, and/or Elastic APM config to provide application monitoring & Coralogix for logs. It also adds few parameters related to Kubernetes (smart defaults) such as replicaCount, HPA, keepSecrets while the rest of it (CPU/Memory limits) can be configured in the individual Helm charts as & when required.
Challenges Faced and Fixed
What's any migration without challenges? While the majority of transitions from VMs to Kubernetes were done smoothly, there was lot of additional knowledge we acquired on the way. Here's a brief list of them
NodeJS applications do not respect resource limits and instead consider host compute resources. For e.g: If a Kubernetes Worker Node has 16 CPU(s) and a NodeJS app is deployed as is, it'll start and spawn 16 worker processes. This causes a severe performance impact if we have CPU usage limits for it while starving other apps of the required CPU. Our findings led us to this that we implemented as an ENV Variable to be defined per app than hard-coding it to a specific number.
Applications showing relative slowness post the migration was another one that was tackled by setting the appropriate timeouts on certain datastores like Redis. Reviewing the Redis Documentation for TCP-Keepalives and Client Timeouts among others helped us identify and rollout a fix quickly.
The Result?
Right after the migrations
- We leveraged Kubernetes to ensure the appropriate resource limits for each app, auto-scaling up & down only when necessary via Horizontal Pod Autoscaler (HPA)
- Another benefit was freeing up much needed computing capacity (approximately 25% savings) required for other use-cases.
- All of this with built-in logging/monitoring, streamlined deployments & move towards the famous Infrastucture should be cattle not pets
While Kubernetes is known to be complex with multiple moving parts and has a steep learning curve, it certainly eases the aforementioned issues if implemented correctly.
We hope our learnings through this post guide you one step closer towards a more modular infrastructure, do watch this space for more such experiences. In the meanwhile, you can also follow us on Facebook, Twitter, Instagram, LinkedIn
Thanks to the helpful insights from the BookMyShow Site Reliability Engineering Team
Curious to solve problems like these? Come join us onboard at BookMyShow - India’s premier platform for everything in entertainment.
Top comments (0)