Microservices are ubiquitous. These days not doing microservices is something like not writing unit tests or not washing hands before dinner - if you're not doing this, you feel shame. Critical look at microservices sounds almost politically incorrect. But I'll try...
First of all I'll be mostly looking at different types of architectures from pragmatic (read: real life) design point of view.
From design point of view, introduction of microservices gives only one advantage (by imposing limitation, by the way): it requires engineers to think much more carefully about boundaries of each service and how it interacts with the rest of the system.
For monolith application it is also important, but it always possible to take shortcut or hack something internally without much hassle. There is no way to perform such a trick with microservices. Needless to say that better thought out systems are working better.
Rise of microservices matched in time with rise of reactive approaches and maturity of other technologies (such as HTTP server implementations and microframeworks, for example). Along with shifting individual service scaling to external infrastructure all these things together enabled making systems more performant.
All of the above gave the impression that microservices architecture are better, faster and scalable. They, in fact, are. But from the design point of view there only two points which actually attributed to this gain:
- Better thought out system
- Ability to scale individual service rather than whole application
In order to achieve these two goals we were ought to pay high price. Let's keep aside deployment and maintenance nightmare. After all now it feeds a lot of devops and companies which sell software and services to manage this hell.
From pragmatic design perspective we lost during transition:
- Design flexibility. There is no free lunch anymore, you can't easily refactor system and shift functionality from one service to other, can't easily change service API.
- Simplicity of local deployment. To debug some issue you need to start a bunch of dependencies. In my practice I see that most apps can't be started locally anymore (or can be, but this is so complex that nobody bothers) and apps are debugged using debug print (say "hello" to 80's era).
- Per-application handling of "environment issues" (network and disk failures, configuration management, monitoring, etc.) now need to be handled for each service individually. Yes, there are apps, frameworks and patterns to handle all of these. But now we ought to keep all these issues in mind all the time spending precious brain resources to things which are not directly related to business logic we're implementing.
- Predictability of failure patterns. Now they are much more complex and harder to predict and prepare for. A whole lot of "semi-working" states of the system.
All of the above might look like appeal to return to monoliths. But it is not. Monoliths have their own set of issues. I see no point to repeat these issues, every article about microservices don't forget to list them.
So, are there alternatives to microservices and monoliths? Well, I think there is at least one.
First of all lets try to summarize what alternative architecture should achieve:
- Be service-based, where interface of each service is clearly defined. In this way we force engineers to be more careful about design
- Be service-friendly - isolate service from environment issues as much as possible
- Enable per-service scalability
Ideally it also should have following properties:
- Keep external dependencies as minimal as possible
- Be simple to deploy and maintain, in particular local deployment should not be an issue
First thing which comes to mind is traditional app server as it was initially envisioned by guys from Sun: apps should just plug in into it and use all available services. The idea didn't get expected acceptance, I think mostly because it was oriented at technologies and approaches which were modern by time of introduction of the idea. Nevertheless, it provides some kind of service-friendly environment, although too regulated and too limited to specific set of API's.
But there is somewhat different approach described below.
At the high level the architecture is a cluster consisting of identical nodes. The cluster is built on top of Data/Computing Grid (for example, Apache Ignite, Infinispan, Hazelcast, etc.). Unlike traditional approach, grid is not something external to application, instead each grid node in the same time is an application node.
Every node consists of a service-friendly shell and user services. Things at the right are part of shell, while things at the left are user services:
(Well, HTTP can be part of shell as well, this actually does not matter much.)
Every node has 4 working modes - single, dormant, slave and master.
The single mode is used for development/debugging or in very small deployments.
The dormant, slave and master are modes enabled in clustering environment. Node starts in dormant mode and tries to connect to cluster. While node in dormant mode, all user services are stopped, so there is no risk to do something wrong. Node is also switched into dormant mode if for some reason cluster can't be formed, for example if there is no majority of nodes in cluster (either because not enough nodes connected to cluster or cluster is experiencing network issues and node belongs to disconnected minority). Once node is connected to cluster and majority of nodes are available, node switches into either slave or master node, depending on the master election results. There is no difference between master and slave nodes from the point of view of user services. The difference is visible only to Cluster Manager (see below), which is enabled only on master node.
The Service Manager is responsible for starting/stopping individual services according to active configuration (which is stored in the data grid). Service Manager is listening to cluster events and once node is disconnected from cluster or connected only to minority of nodes in cluster, all services are immediately stopped, preserving system consistency.
The Cluster Manager is responsible to making decisions which services should be running at each node and how many instances. The Cluster Manager itself is activated only on master node, so there always only one source of truth about services configuration. The decision about number and location of services can be made using different approaches: static configuration, performance monitoring, heuristics, etc. It is also possible to enable Cluster Manager trigger starting/stopping nodes by interacting with external service (Amazon ECS/EC2, Kubernetes, etc.). Note that unlike external monitoring, Cluster Manager has access to all details, so it can make much more informed decision.
Well, this is just data grid code which is part of node.
This is an optional component which is necessary if data need to be persisted to disk (or some other storage). Technically this it is part of data grid configuration which enables storing local data to storage.
First of all, transparent and (almost) instant access to all data in the system. Data not just stored, but also replicated, so entire system is durable and reliable. The replication and consistency can be flexibly tuned to precisely fit requirements. Entire system can survive loss of some number of nodes (up to N/2-1, where N is the maximal cluster size) due to various issues. Loss of the nodes may somewhat affect performance, but does not affect system availability.
Service-friendly environment. Services are isolated from environment issues, and can behave like there are no problems with connectivity or anything like that. Shell takes care of retrying and redirecting calls to other nodes if necessary. All there significantly simplifies writing services and developers can focus on business logic rather than on issue handling. Overall services getting very thin and lightweight, so I've called them "nanoservices".
Whole system is highly scalable. Unlike microservices it has two dimensions to scale: scaling by adding nodes and scaling by starting more service instances. Starting services is much faster than starting new instances, so time necessary to react on load change is significantly smaller.
System is either working or not, there are no intermediate states. Failure patterns are limited in number and predictable.
Minimal dependencies. DB, messaging, queues, distributed computing, etc. are already built in.
Simple deployment and configuration. No need for external "orchestration" services.
It's quite easy to extend shell with more functionality, for example, let each node to be a Kafka node.
Data and processing are collocated, it is possible to design services and configure data grid so all processing will be performed at node which holds all (or most) necessary data locally. This approach can significantly reduce network traffic and distributes processing in natural way. By properly configuring data-to-node assignment it is possible to collect related data at same nodes even further improving performance.
Such an architecture is a natural fit for reactive asynchronous processing.
We have almost the same freedom of refactoring as we do with monolith.
There are no real systems built with architecture described above (at least known to me). Nevertheless, system which contains most of elements of the described above architecture, I've designed and implemented few years ago and it still works just fine (at the best of my knowledge, since I'm not working for that company anymore).