The big break-up: an independent order fulfillment service

#python #django #monolith #multiservice

Together, João Nadas, Raymond Penners, Kostas Petrakis, Joris van Rooij, Marlin Cremers, and Yunier Rojas make up the Infrastructure team. Although “infrastructure” is a bit of a misnomer, since the team isn't really focused on what is commonly thought of as infra. They’re all Backend Developers, João is the Product Owner. Raymond has been with Sendcloud since the development team was just 5 people, back in 2016.

This bunch took on the momentous task of extracting a large chunk from the monolith - essentially splitting it in half - to live on its own. No easy feat, as they soon discovered.

João summarizes: the Sendcloud platform is essentially everything we have, including background tasks neatly, and tightly, coupled in. And for a small team, moving fast, a monolith works. However, with 70 developers working across different teams at Sendcloud today, you hit a breaking point.

The scope was clear: keep a bunch of functionality in the monolith, but take the orders fulfillment service out.

The scope

In Python you can do imports from all directions, which may cause cyclic dependencies. When connecting everything is so native to a language, it’s extra painful to break it up. And we wanted a true split, not just putting code in 2 repositories. We wanted to really decouple services. You can only move fast when you’re bigger, when you don't have too many inter-dependencies.

Why split out the orders fulfillment service, you ask? It’s a natural split, coherent. Coincidentally it makes up about half of the service Sendcloud provides. The infrastructure team was brought to life for the purpose of creating the infrastructure for the two services to talk to each other.

How did we get here?

Less than 1.5 years ago we had 3 teams, today we’re at 9 product teams and 2 platform teams. Merging code needs queues and communication. Moving towards a microservices architecture improves velocity. It allows teams to make optimizations without depending on libraries.

The database for the orders is under heavy operation at any given time. Having that processing at the same time the rest of the platform needs to interact with the database, you start hitting limits.

To further stress how decoupling was a necessity: our test suites took very long to run. A deploy to a branch easily would take 30/45 minutes. Before changes show up in production we were at 1.5 hours, which simply is unacceptable for a hotfix.

But the ordeal was something like a car (Sendcloud), which is being transformed while it’s driving, with feature teams adding functionality. And then the Infrastructure team swoopes in, splits it in half, making it functionally 2 cars, without ever stopping the car. Continuing the metaphor, other services were built in the meantime, causing whole other cars to join the fleet, orbiting the original car closely, again sharing resources.

The groundwork

In late 2018 we already split up the database, but we stuck with the monolith. It wasn’t the right time to split up the codebase. Late 2019 some initial work was done, but more as a side-activity, inspired by how our pipelines started to hurt.

We created an orders directory, put all related code in there and out of the monolith, and tried to make it work again. In Python it's quite easy to make a big bowl of spaghetti code work. To make sure we weren’t taking shortcuts we tracked these cyclic imports (“forbidden imports”) and corrected those one by one by putting in place APIs. Whenever a developer created a new feature that introduced new forbidden imports, the pipeline failed on purpose.

We had to build some missing infra bits, prerequisites, and non-functionals. Stuff that isn’t vital to the business and easily gets pushed down on the backlog. Which is exactly what had happened before. From a management perspective, the big split did not get the priority it should have had. We worked on this while also shipping other code and projects. For example: at this time the DevOps team changed from Terraform to Kubernetes, we needed to decide what we wanted to go with. We went with Kubernetes, but even though everything was already dockerized, it still added work to our backlog.

Crunch time!

We set ourselves an artificial target date to push the project, including 5 working days to get it deployed: the 5th of November. Sendcloud handles hundreds of thousands of orders every day, we can’t afford downtime. That’s why we need to deploy services live. We do so today by serving traffic to both EC2 and Kubernetes, split for gradual rollout.

Like the rest of the world, we were all working from home, but because this was all hands on deck we came together in the office for these few days. Even the CEO checked in on us and encouraged us to not play so safe! To motivate ourselves further, we adopted the V for Vendetta theme: remember remember the 5th of November!

We did a soft release to our Pre-prod environment the week before, to check that everything works. We went so far as to DDoS our test endpoint by watch -n .01 curl test_endpoint, as well as run http load tests on our test environment. Once we went live, legacy got all the traffic at first, then when we put the ingress load balancing to work, traffic gradually switched to the other service.

We set up heavy monitoring in advance to track what happened while we made changes. Our monitoring stack is Datadog and Sentry. The orders service includes background tasks, event consumers, public API endpoints, and XHR endpoints for the frontend, that all needed to be moved. During the whole procedure we were in close contact with support, we could rollback when customers would complain.

We didn’t strictly make the November 5 deadline, so we directed all traffic back to the monolith for the weekend, since we still supported that setup in parallel. Monday is the busiest day of the week for Sendcloud. It’s when our customers collect weekend orders, so we couldn't just get cracking at 8am after the weekend.

The following week, the new service deployed with flying colours in 2 days. Zero issues were recorded that were split related. What remained was fixing non-functionals that were not production service related.

Coincidentally, on that Friday, something was deployed that broke one of our other services, and everybody was pointing the finger at us. People were spooked of issues the split could cause, must have been the time of the year*. Even with a small delay, we pushed before the busiest time of the year (Black Friday / Cyber Weekend), or else we’d have to wait until 2022.

To celebrate the achievement, everyone in the team received a nice bottle with a thematic label (featured in the cover of this post), with the graph that measured their progress during a large chunk of the split.

What’s next?

We advocate heavily for standalone services that are only very loosely coupled. While we were working on the orders fulfillment service split we already started splitting up more services. The Account service split only took 3 sprints (6 weeks). With the “big breakup” on track, we felt confident about separating out more, smaller services.

Everything went well on Cyber weekend - in fact, Sendcloud hit its record high in terms of number of shipments in a day!

Now with this milestone reached, we will focus on creating the building blocks for making our architecture more horizontally scalable. More on that soon.

Looking to join a company that tackles interesting engineering challenges like the one described above? Sendcloud is hiring! Also check out our events on meetup.com. We look forward to meeting you - virtually - soon!

*Halloween got our colleagues jumpy it seems.