In this article, I would like to share our approach to designing a microservice architecture for one of the world’s leading classifieds. The resulting architecture efficiently serves tens of thousands of requests per sec, has thousands of microservices, and hundreds of developers use it daily.
Why we moved to microservices
When I joined the company, we had about 200 developers and a giant monolith. The more the organization grew, the longer it took for teams to deliver features. We went through a series of well-known organizational growth problems: slow releases, frequent rollbacks, lots of feature toggling, and so on.
The microservice architecture was a reasonable choice, allowing us to scale our company to a thousand engineers with hundreds of microservices. This approach to scaling especially worked for us because the company wanted to scale in multiple verticals: real estate, auto, jobs, etc., which required many independent products and features.
Of course, the organizational growth problem can’t be solved just by changing the architecture. According to Conway's Law, changes in architecture should go hand in hand with changes in organizational structure, or vice versa. So our company wasn't exempt, and the move to microservice architecture was accompanied by a transition from functional teams to cross-functional ones.
Our target architecture
Our core design, the target architecture, was based on the following prepositions:
- We will use the Time To Market (TTM) metric as a signal that teams are becoming more efficient at delivering features.
- We will rely on Conway's Law as a primary way of structuring the organization and the architecture underneath.
Figure 1. The classified’s target architecture
Considering all of the above, we came up with the architecture from Figure 1. This should (and eventually did) expose the following properties:
- Improve Time To Market.
- Allow us to scale to hundreds of developers.
- Accommodate tens of verticals (business directions) and dozens of product streams.
- Be reliable and resilient.
- Serve tens of thousands of requests per second.
But I suspect everyone had neat and clean architecture when it was first designed. But the real world often turns it into an unmanageable mess once implemented. So here are: practices, processes, and technical patterns that helped us keep it straight.
How to avoid microservices chaos: 3 tips
Implement cross-functional teams
Business logic spread across multiple microservices is one of the common problems. Many microservices end up with shared owners, and teams end up in a highly dependable mesh or microservices that is nothing more than a distributed monolith.
This is probably the most important and most challenging problem to solve when it comes to keeping a team's business logic within its boundaries. In our case, it required changing our organizational structure to cross-functional teams with business goals and streams.
To improve the TTM, you need to ensure that the team is autonomous and owns all the underlying microservices that can deliver with as little dependency on the rest of the company as possible. Also, making the team a unique microservices owner helps keep the business logic within the boundaries.
If your processes and infrastructure are mature enough, you should be able to track the following metrics on a per-team basis:
- Resource consumption (CPU, Network, RAM, etc).
- Services reliability (SLA).
- Code quality / Tech dept / Test Coverage.
- On calls.
This would ensure that the team is going to guard its business boundaries within its limits. So nobody wants their microservice to go down due to the rollout of someone’s unrelated business logic in their microservice.
This practice helps to draw clear boundaries between the team's microservices and the rest of the business, but it doesn't guarantee what happens within those boundaries.
Create API-Composition service
The next piece that can go wrong and turn into a touch point of multiple teams is an infrastructure that allows services to expose their API to the outside world. It’s common if one service has all the infra set up to proxy requests to the internal infra, then other teams start adding their endpoints there, and the snowball grows into a monolith.
Conversely, if you make it too easy to expose any microservice to the external world, then the service's API could be used by external or internal consumers or both. This breaks the internal vs external protocols pattern as well as request flow (see Internal vs External protocols).
Internal vs External protocols
A clear separation between the protocols used for communication between internal and external services is a good idea. Depending on the situation different companies choose them depending on their workloads and business needs, but a general rule of thumb is that internal protocols are designed for safety, rapid development, and efficiency. When external is driven by clients’ requirements, maintainability, conventions, etc.
Our approach to this was introducing a new type of service, API-Composition (see table below), which is the type of microservice that the only service that exposes an API. Let’s compare the API-Composition service with a typical business service:
A few important things about API-composition. As you may see, this type of microservice can’t have persistent storage because we assume that there are no business operations happening there. What it does is:
- Receive a request and transform it to an internal format.
- Parallelize requests to internal ones.
- Aggregate the result and cache it if necessary.
- Response.
Also, the important thing is that it should be reasonably easy for the team to set up a new API composition for their cluster of services. In our case, we solved that by providing a tool that generates the API-composition service and its handlers from the OpenAPI schema. And we automated the exposure of the new handlers upstream (for example, rewriting rules on the API-gateway)
The benefits of this separation are that it helps to keep business services:
- Free from the libraries processing any external requests.
- Safe, as the business service never exposes its API to the outside world, we know that leaks could only happen at the API-composition level.
The downside of this approach is an extra hop, which increases the response time and adds a new point of failure. However, the more services you have downstream whose calls you can parallelize, the more net value API-composition brings.
Avoid multi-domain services
The data layer or core services are the other part where things go wrong. Let's first look at what core services are. In my particular case, the core services were: user-profile, listings services, and so on. The entities that are required by all other verticals: monetization, fraud prevention, listing creation, search, and so on.
The problem with them is that if you don’t spot these types of services early enough, they can become a shared space for multiple teams. A vivid example of such a service could be a listing service with all the listings. Mistakenly, teams try to put all the listing-related stuff into that service. For example, job listings may have specific access control logic, while short-term rental listings have their life cycle.
The solution for this particular case could be to have separate dedicated services owned by the teams that extend the core listing, and these services relate to the core object. This pattern is called bounded contexts. The main problem with using this pattern is that the business processes can't handle the eventual consistency (which is rare nowadays). Besides this main problem, there could be other problems that you might take care of before suggesting to use this pattern:
- The message broker and the infrastructure can't guarantee that business events won't get lost.
- It could not be easy for a team to spin up a new service like this.
- The extended entity doesn't provide all the lifecycle events.
- There could be a situation where nobody owns the core service, and it's just easier to put your stuff in there.
Conclusion
These three high-level suggestions helped us maintain the architecture shown in Figure 1. What helped:
- Reduce TTM.
- Increase rollouts from a few per day to a few dozen.
- Roll out features in a more granular way with less risk.
- Reduce deployment time from hours to minutes.
- Decrease the number of rollbacks.
References
- Conway's law — a Wikipedia article.
- The book “Team Topologies” by Matthew Skelton and Manuel Pais.
- The Command Query Responsibility Segregation Pattern.
- Microservice Architecture article.
Top comments (0)