Before talking about Service Meshes, I’d like to review how we got here. The first turning point was when people started adopting a different architecture that was designed for many small teams and the distributed nature of cloud computing: in other words, a microservice architecture.
Before this architectural proposal, systems were usually designed as giant monoliths. But as the software grew, teams also grew while sharing the same codebase, increasing its complexity and making the entire codebase out of control. Thus, with the rising popularity of cloud computing, it became desirable to have scalable applications that could handle a low number of requests with minimal cost while simultaneously being able to handle many requests when needed.
A microservice architecture split the giant monoliths into smaller services, all maintained by only one team and sharing resources with other services via a communication protocol (usually REST APIs). Unlike a monolith, smaller services can be quickly deployed and duplicated to handle more requests without needing a larger memory footprint and CPU usage.
However, microservices heavily increase operational complexity, with all platform logs distributed throughout several services. One service needs to discover the dependent service location before sending any message—a primitive form of communication that can be easily exploited without much effort.
The first step to reducing operational overhead was the idea of container orchestrators, such as Kubernetes. These intricate pieces of software can manage all platform services, their versions, new deployments, networking, and any capacity increase. However, they are not designed to address issues regarding service communication and discovery.
Service meshes propose a cheaper way to handle this operational burden without removing the advantages of microservices, as I will show in the following sections.
So, the problem surrounding a microservice architecture is all about the complexity it creates, as every service is an independent application. It demands continuous integration and continuous deployment configuration, discovery services, security layers, networking layers, and monitoring layers.
In a containerized environment, each team or squad responsible for a service can develop it using a different set of solutions by choosing its programming language and libraries. However, IT departments need to dictate some practices to handle all of these operations with minimal effort.
Microservices, especially those working on a container orchestration platform (like Kubernetes), can increase or decrease their number of instances to handle the number of requests within a reasonable period of time. Therefore, if the request number increases and the current number of instances cannot answer these requests in a manageable time, the container orchestration will create new service instances.
The incoming requests are handled by a load balancing service that acts as a front door for all instances and chooses the best instance to attend to a given request.
The problem with remote services is that they may fail at some point. It gets worse: If the service is failing with timeouts, for example, because another service is calling, it will have to wait for the timeout until it declares an error. The calling service may then run out of critical resources due to numerous requests being on hold and create a cascading failure across your entire infrastructure. A circuit breaker is simply a pattern that, once it discovers a timeout issue, will send an error to all other calls to avoid the timeout wait time.
With multiple services running, it’s hard to discover where they’re located. The dependencies between multiple services are not always easily found, and new services may be deployed with a new dependency on an older service. Those services can be deployed anywhere in the infrastructure, so what you need is a Service Discovery service. There are plenty available, such as Netflix Eureka or HashiCorp Consul.
If the communication is not encrypted, all data transferred between services can be read by external, and sometimes malicious, tools. Usually, the services are operating on the same network, which makes it even easier to intercept a communication. Plus, since it’s not common to restrict access between services that aren’t supposed to communicate, forced communication may happen due to exploitation
Monitoring microservices is also much harder. Since each software is running different codebases, you have to create some convention for monitoring. Also, tracing requests are much more challenging due to the microservice’s distributed nature. A single request can trigger many services, so it can be hard to understand where a failure happened without tracing the services involved.
So, what is a Service Mesh and how can it solve all these issues? I describe it as a high-throughput, dedicated infrastructure focused on process communication. A Service Mesh can manage all communication between services, easily allowing for all the features described above, as I will show below.
Most Service Meshes use a sidecar solution: Each service is plugged with a secondary service that adds a communication layer between the application service and other services. The solution avoids platforms for specific solutions that depend on libraries or embedded technologies. Applying a sidecar solution is particularly easy on Kubernetes environments and usually does not demand changes to the application.
Every service communication is now made by a secondary service that manages how the application sends its packages. In some solutions, the application must be configured to use a network proxy to handle all packages.
These secondary services can communicate with other secondary services and provide service discovery and load balancing between multiple instances. Other than that, most Service Mesh solutions offer a circuit breaker out of the box.
Communication is easily protected since only services meant to communicate are allowed to exchange data, while others are blocked and reported. Private SSL keys also ensure that communication is only permitted between authorized services.
It is also possible to allow communication with individuals by granting access permissions.
Since a unique service manages all communication flows, it’s much easier to create metrics and traces between the services. You don’t need to add platform-specific libraries to collect metrics such as latency, traffic, errors, or saturation. Besides, every communication can have a distributed-tracing capability, so you can detect how the entire cluster manages a single request.
There are many Service Mesh products available nowadays. Mostly, they can be integrated with Kubernetes to provide a more concise and secure microservice deployment. Istio is probably the most famous Kubernetes Service Mesh solution, and Google supports it. However, there are other solutions like NGINX Service Mesh, Linkerd, and HashiCorp Consul.
It‘s vital to notice that, despite all the advantages that come with using a Service Mesh, there are problems to consider as well. If you take a look at the ThoughtWorks “Technology Radar”, Istio was indicated for adoption only in the first half of 2020. The technology is incipient and should be managed with care as all of your communication will rely on it.
A Service Mesh also adds some computational overhead, which can increase the cluster cost for large clusters, plus, it can increase service communication latency a bit since now, every communication must be encrypted and secured.
Service Meshes propose a standardized way to handle all service communication: Without one, each service team could propose a different solution that all dependent services then need to implement. Service Meshes additionally give you a transparent way to provide inter-service communication with advanced security and following best practices.
Still, the technology is only now reaching maturity: It is so new that you may find breaking changes between versions and complicated version updates. Before going all in, first use it in test environments or with a backup plan. Once you feel safe, moving forward should be a smoother process.