There is an endless supply of blog posts, white papers, and slide decks, evangelizing the virtues of microservices. They talk about how microservices “increase agility,” are “more scalable,” and promise that when you make the switch, engineers will be pounding at your office door looking for a job.
Let’s be clear, though occasionally exaggerated, the benefits of microservices can be real in some cases. Particularly for large organizations, with many teams, microservices can make a lot of sense. However, microservices aren’t magic -- for all of their benefits, they come with significant drawbacks. In this post, I’ll describe how the distributed nature of microservices makes them inherently more complex.
A distributed system is any collection of computers that work together to perform a task. Microservices are simply a type of distributed system designed for delivering the backend of a web service.
Since the early days of distributed systems research going back to the 70s. We’ve known that distributed systems are hard. From a theoretical perspective, the difficulty mostly arises from two key areas: consensus and partial failure.
From a theoretical perspective, the fundamental issue with building workable distributed systems comes down to the problem of consensus – agreement on distributed state. Nearly all distributed systems research attempts to grapple with this problem in some way. Paxos, Raft, Vector Clocks, ACID, Eventual Consistency, Map Reduce, Spark, Spanner, and most other significant advances in this area all are fiddling with the tradeoff between strong consensus and performance in some way.
To better understand the problem of distributed consensus, let’s illustrate with an example. Suppose
Server_1 to write
x=5 while concurrently
Server_2 to write
6? Naively, one could look at the time
x=5 occurred, and the time
x=6 occurred, and choose whichever happened last. But how do you determine the time the writes happened. Look at a clock? Whose clock? How do you know that clock is accurate? How do you know,
Server_2 agree with that clock? Clocks are notoriously out of sync, and (as Albert Einstein taught us), that's not fixable. For that matter, does everyone really need to agree on the value of
x? If so, how much agreement? How long should the agreement take? What if
Bob dies while trying to reach agreement? It gets complicated.
So, given that distributed consensus is hard, how does this problem manifest in the context of microservices? Good microservice implementations tend to sidestep the issue altogether by simply disallowing shared state. In the case of our above example, there exists no
x such that two microservices need to agree on the value of
x at any particular point in time. Instead, all shared state in the system is punted to an external database or the container orchestrator.
This approach both does and doesn’t solve the consensus problem. It doesn’t solve the problem in the sense that, from a theoretical perspective, there still is shared state that still requires management. You’ve just moved it. By the way, this is why Kubernetes and databases are so darn complicated.
The approach does solve the problem in that, from a practical perspective, Kubernetes and databases are better at managing shared state than most microservices. Those systems are designed by engineers who spend all day every day thinking about these issues. As a result, they’re more likely to get consensus right.
Consider an HTTP request serviced by a monolith. When the request is received, a single server handles the transaction from beginning to end. If there is a problem, be it a software bug or hardware failure, the entire monolith crashes – every failure is a total failure.
Now consider the same HTTP request coming into a microservice. That microservice may send new requests to other microservices who, in turn, may generate more requests going to yet more microservices. Now suppose one of those microservices fails. What now? One or more microservices are depending on the data that microservice was preparing. What should they do? Wait for a while? How long? Try again? Try someone else? Who else? Give up and do their best with the data they’ve got? Microservices must be engineered to handles these issues, again making them more challenging to develop.
Partial failure has been described as an unqualified good thing. The thinking goes, by supporting partial failure, an application becomes more resilient – small problems can be papered over gracefully. In my opinion, the benefits are small, rarely obtained in practice, and come at the expense of vastly increased implementation complexity.
In addition to the theoretical challenges of microservices, there’s also just a lot of them. Having so many moving pieces complicates nearly every part of the stack and every part of the software development lifecycle.
You can typically run a monolith directly on your laptop. Getting microservices to work on a local machine requires more specialized tools such as docker-compose and minikube. Furthermore, they’re CPU and memory intensive, making them painfully slow on a laptop. Note, check out Kelda, and specifically our whitepaper for a detailed description of this problem.
Everything happening in a monolith occurs in a single process. You can attach the debugger of your choice, and you are off to the races. With microservices, a single request may be spread across dozens of different processes. Distributed tracing tools like Jaeger may help, but it’s still a challenge.
With a monolith, you can store logs in a file and grab them when needed. With microservices, you need a tool like Splunk or the ELK stack to handle this for you.
Simple on-server monitoring tools like Nagios don’t scale when you’ve got hundreds of microservices. Again, better tools (Prometheus/Datadog/Sysdig, etc.) make the problem tractable, but it’s still hard.
Tools like Chef and Puppet are good enough for deploying a monolith, but for microservices, you need something much more sophisticated like Kubernetes.
Monoliths can be handled with a simple load balancer. Microservices have many more endpoints, all of which require load balancing, service discovery, consistent security policy, etc. I suppose service mesh can help with this (I’m not convinced, but that’s a topic for a future post).
From a technical perspective, microservices are strictly more difficult than monoliths. However, from a human perspective, microservices can have an impact on the efficiency of a large organization. They allow different teams within a large company to deploy software independently. This means that teams can move quickly without waiting for the slowest common denominator to get their code QA’d and ready for release. It also means that there’s less coordination overhead between engineers/teams/divisions within a large software engineering organization.
And while microservices can make sense, the key point here is that they aren’t magic. Like nearly everything in computer science, there are tradeoffs — in this case, between technical complexity for organizational efficiency. A reasonable choice, but you better be sure you need that organizational efficiency, for the technical challenges to be worth it.
: Yes, of course, most clocks on earth aren’t moving anywhere near the speed of light. Furthermore, several modern distributed systems (notably Spanner), rely on this fact by using extremely accurate atomic clocks to sidestep the consensus issue. Still, these systems are, themselves, extremely complicated, proving my point: distributed consensus is hard.