<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Frol K.</title>
    <description>The latest articles on DEV Community by Frol K. (@frolk).</description>
    <link>https://dev.to/frolk</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F615404%2F14ca20bc-1cbe-4c33-a3ca-8399b9718130.jpeg</url>
      <title>DEV Community: Frol K.</title>
      <link>https://dev.to/frolk</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/frolk"/>
    <language>en</language>
    <item>
      <title>Microservices: How to use Null Object Pattern to improve user experience</title>
      <dc:creator>Frol K.</dc:creator>
      <pubDate>Wed, 31 May 2023 08:56:58 +0000</pubDate>
      <link>https://dev.to/frolk/microservices-how-to-use-null-object-pattern-to-improve-user-experience-5400</link>
      <guid>https://dev.to/frolk/microservices-how-to-use-null-object-pattern-to-improve-user-experience-5400</guid>
      <description>&lt;h2&gt;
  
  
  Topics:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;What is a Null Object Pattern?&lt;/li&gt;
&lt;li&gt;How to apply&lt;/li&gt;
&lt;li&gt;How to implement&lt;/li&gt;
&lt;li&gt;Testing noncritical dependencies&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What is a Null Object Pattern?
&lt;/h2&gt;

&lt;p&gt;Resilience of microservice architecture comes from its ability to quickly recover from failures and being able to service customers in the process of recovery. And all this should happen without significant degradation of customer experience. The Null Object Pattern helps to target the latter - ensure that customer experience doesn’t degrade much when a non-essential part of the system is not working. This pattern could be applied in many levels of the tech stack: starting from when a service for a frontend component is not available (see pic example with status bar) and finishing up when you can't get data for an object or a class within a (micro) service in the backend.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzhr1fxh10ti32voinppx.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzhr1fxh10ti32voinppx.jpg" alt="Pic 1. Example when failure of status-bar block causes service denial for whole web page"&gt;&lt;/a&gt;&lt;br&gt;
Pic 1. Example when failure of status-bar block causes service denial for whole web page&lt;/p&gt;

&lt;h2&gt;
  
  
  How to apply
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;you need to come up with a list of web pages, app screens, API endpoints, etc which are crucial from a business perspective.&lt;/li&gt;
&lt;li&gt;fill a list of all external dependencies for each item that can cause a denial of service.&lt;/li&gt;
&lt;li&gt;decide if this dependency is actually critical for the user's intent (e.g. if failure of loading recommended items should block the checkout process).&lt;/li&gt;
&lt;li&gt;noncritical services that cause more failures first.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqvgybjbzdwy8n6tcix8c.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqvgybjbzdwy8n6tcix8c.jpg" alt="List of critical endpoins"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How to implement
&lt;/h2&gt;

&lt;p&gt;Implementing this pattern is straightforward. If your method (such as a factory or repository) can not retrieve data for the object, simply return a null object. For the implementation in your language check out wikipedia and below you can see the diagram to get the idea:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd3cv8wxay1y25i0kot1b.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd3cv8wxay1y25i0kot1b.jpg" alt="Pic 2. Interface diagram of implementing Null Object pattern"&gt;&lt;/a&gt;&lt;br&gt;
Pic 2. Interface diagram of implementing Null Object pattern&lt;/p&gt;

&lt;p&gt;However, blindly applying this pattern everywhere is a bad idea especially if the object's data is used later in modification queries... Imagine you have a repository that retrieves data for a user from the other service. If retrieval of the user fails, then your repository returns a null object with defaulted values (ID is set to 0, Name is set to “empty name” and so on). You can’t use it for writing queries otherwise your business logic will be inconsistent. So how to deal with this?&lt;/p&gt;

&lt;p&gt;One of the options you have is to split your interfaces of the user object into two types: ones that could be implemented by Null Object and others that can’t. This separation is crucial to make sure that Null Object will never be used in writing queries.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing noncritical dependencies
&lt;/h2&gt;

&lt;p&gt;It is relatively easy to implement for services with clear bounded contexts (e.g. recommendation block), however for commodity services like a user profile it isn’t that simple and requires a lot of hard work to turn it into a noncritical dependency. This happens because every request to the backend usually ends up with querying these services to render a name or other (in most cases usually) trivial information. To make sure that all this effort is not wasted it’s wise to enforce such behavior by introducing noncritical dependency testing.&lt;/p&gt;

&lt;p&gt;One way to implement this type of testing is to make them a special case of your e2e regression test suite. The only difference is we re-route traffic for the dependency we want to artificially fail to a service that mocks failing behavior like responding with error or times out after 5 sec for example. See the example in the picture below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Favsazb7q48o47dk870w2.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Favsazb7q48o47dk870w2.jpg" alt="Pic 3. Example of testing mocking user service with bad &amp;lt;br&amp;gt;
behavior. Mock tests make a request with a header to turn off user-service. This header got propagated down the stack and when reaches user-service it’s got rerouted to mock with mocked behavior"&gt;&lt;/a&gt;&lt;br&gt;
Pic 3. Example of testing mocking user service with bad &lt;br&gt;
behavior. Mock tests make a request with a header to turn off user-service. This header got propagated down the stack and when reaches user-service it’s got rerouted to mock with mocked behavior&lt;/p&gt;

&lt;p&gt;For this implementation, we need a sidecar that reroutes traffic based on the request headers and the mock service which implements bad behavior.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Null Object Pattern is absolutely must have for well-scoped dependencies and at the same time could be a burden to implement for commodity ones as it would require a lot of engineering hours to implement.&lt;/p&gt;

&lt;p&gt;Keep in mind that this pattern is just “hiding” the failures from the end user while others like circuit breaker, and health checks are actually trying to recover. So apply it when it’s necessary and don’t forget to set it up for each case when the null object pattern triggers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://en.wikipedia.org/wiki/Null_object_pattern" rel="noopener noreferrer"&gt;https://en.wikipedia.org/wiki/Null_object_pattern&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sourcemaking.com/design_patterns/null_object" rel="noopener noreferrer"&gt;https://sourcemaking.com/design_patterns/null_object&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sre.google/books/" rel="noopener noreferrer"&gt;https://sre.google/books/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/frolk/microservices-what-to-check-in-readiness-probes-3epo"&gt;https://dev.to/frolk/microservices-what-to-check-in-readiness-probes-3epo&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>microservices</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Microservices: What to check in readiness probes?</title>
      <dc:creator>Frol K.</dc:creator>
      <pubDate>Tue, 09 May 2023 10:03:58 +0000</pubDate>
      <link>https://dev.to/frolk/microservices-what-to-check-in-readiness-probes-3epo</link>
      <guid>https://dev.to/frolk/microservices-what-to-check-in-readiness-probes-3epo</guid>
      <description>&lt;p&gt;Microservice architecture is famous for its resilience and ability to self-recover. A health check pattern (e. g. Kubernetes —  readiness, liveness, startup probes) is one of the patterns which actually makes it happen. The topic of how to set up health check probes is well covered but there is not much about what to check in the probe and how to understand if a dependency is critical for a particular service.&lt;/p&gt;

&lt;h2&gt;
  
  
  Topics to cover
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;How a health check pattern works.&lt;/li&gt;
&lt;li&gt;What to check when service is under the load. &lt;/li&gt;
&lt;li&gt;What to check on the initial setup.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How a health check pattern works
&lt;/h2&gt;

&lt;p&gt;Before I can explain the health check pattern, first we need to understand how the load balancing works. Every microservice has multiple instances (or replicas) to provide redundancy and reduce the load on a single instance. So whenever a particular instance of microservice A wants to make an API call to microservice B, it needs to choose a particular instance of microservice B. But how to choose one? The &lt;strong&gt;L&lt;/strong&gt;oad &lt;strong&gt;B&lt;/strong&gt;alancer (LB) helps to make this decision. &lt;/p&gt;

&lt;p&gt;Load balancers keep track of all the healthy instances of microservices and spread the load between them (see load balance strategies). But because instances of services in a microservices architecture are usually ephemeral, the list of healthy instances constantly gets updated. It happens when a new rollout happens: new instances get added and old ones get removed. But the problem is that by the time the instances got into the list of LB, it’s possible that the instance is not ready to handle requests yet (for example the cached is not warmed up or connections to the database are not yet established). So this is a point where a health check pattern helps a LB to understand whether a service instance is ready or still capable of handling the traffic.&lt;/p&gt;

&lt;p&gt;How does the load balancer do this? It makes a request to a health check endpoint for each instance of the microservice, and if the service answers that it is ready for a pre-configured number of times (to avoid jitter), it gets traffic. That, in a nutshell, is what a health check pattern is.&lt;/p&gt;

&lt;p&gt;But what microservice should check in these health check endpoints? It depends on two scenarios:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Service has production traffic.&lt;/li&gt;
&lt;li&gt;Service is at the initial startup. &lt;/li&gt;
&lt;/ol&gt;

&lt;h1&gt;
  
  
  Service has production traffic
&lt;/h1&gt;

&lt;p&gt;Usually you can see that you need to check all critical dependencies. But what are these critical dependencies? Well, a service that serves synchronous requests usually has a combination from this list of dependencies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Persistent database;&lt;/li&gt;
&lt;li&gt;Cache database;&lt;/li&gt;
&lt;li&gt;Message broker;&lt;/li&gt;
&lt;li&gt;Other services;&lt;/li&gt;
&lt;li&gt;Static storage.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But should you actually check all of them during a health check endpoint call? Let's look at a few scenarios. Assume we have a service with a persistent database (which handles 5% of writes) and a cache database (which serves 95% of reads). What are the pros and cons of checking both dependencies or just the cache? Let's explore three scenarios:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flcna30clycpgxs79pqtz.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flcna30clycpgxs79pqtz.jpg" alt="Scenario 1. Instance #3 stops receiving production traffic if it has no access to the database or cache, or both."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Scenario 1. Instance #3 stops receiving production traffic if it has no access to the database or cache, or both.&lt;/p&gt;

&lt;p&gt;Whenever an instance of the service fails to reach out any of the dependencies, health check fails, so the LB stops sending the traffic to that instance. Seems to be a safe option doesn't it? Let's look at the other scenarios:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc1z1110fpu9wt4wrahnz.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc1z1110fpu9wt4wrahnz.jpg" alt="Scenario 2. All instances stop receiving production traffic because they lose the access to the database"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Scenario 2. All instances stop receiving production traffic because they lose the access to the database&lt;/p&gt;

&lt;p&gt;Here our critical dependency fails, so all the instances of our service also start failing because none of the health checks are going through. Now our configuration is not looking good: to make sure that our 5% of writes are surely served when the database is healthy and reachable in case an instance fails, we sacrifice 95% of reads in case of DB failures.&lt;/p&gt;

&lt;p&gt;Let's say that our SLA for 95% of reads is more important and much stricter. In that case, we just put the cache in the health check:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4z1bszirkq4svowfhaf9.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4z1bszirkq4svowfhaf9.jpg" alt="Scenario 3. All instances keep receiving traffic but 5% of writes fail"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Scenario 3. All instances keep receiving traffic but 5% of writes fail&lt;/p&gt;

&lt;p&gt;In the case when the database fails, our health check still returns ok. This means that all instances still serve 95% of reads and keep failing user requests for 5% of writes, which is much better for SLA of read requests and worse for writes.&lt;/p&gt;

&lt;p&gt;When the cache fails though, we are also in pretty good shape here. From SLA or client’s perspective it probably doesn’t matter if a read request fails because of an instance of service unavailability or service responds with an error of “cache is unavailable”. Of course, there are two edge cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For the first one, theoretically, we could’ve service read from the database but in this case, the presence of cache is under question.&lt;/li&gt;
&lt;li&gt;The second one is less obvious: if you allow writes when the cache is unavailable. The cache gets inconsistent and at this point it gets useless…&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As you can see, choosing what critical dependencies to check in a health check entirely depends on your SLA and distributions of traffic. I hope that the case above gives you an idea of how to choose what to check…&lt;/p&gt;

&lt;h1&gt;
  
  
  Service at the initial startup (initial probe)
&lt;/h1&gt;

&lt;p&gt;The situation when you have an initial startup of your service is slightly different from the situation above. The main difference is that this check happens only once. So the goal of a startup check is to ensure that the service has the right configuration for all its critical dependencies. And before we purely produce traffic, we make sure that the service is capable of doing so. In the latter case, I am referring to proactively establishing all required connections and making a pre-warm-up. &lt;/p&gt;

&lt;p&gt;The one thing which is not recommended though is to check the other critical services. Because this could fail your rollout, and trigger the chain of calls that will likely fail. Imagine every service checking every other service, and those other services checking their dependencies. So a single rollout of one instance of service could easily trigger hundreds of requests and a few of them will almost certainly fail. Below you can see a picture of how the probability grows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffgz26mvfzzs7zomwpgcc.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffgz26mvfzzs7zomwpgcc.jpg" alt="Probability of failure of health check vs number of dependant services"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;As you might see from the article, there are a few cases you have to watch out for to choose what dependencies to check in the health checks:&lt;br&gt;
When a service has production traffic:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The choice of critical dependencies depends on your SLA requirements and traffic distribution.&lt;/li&gt;
&lt;li&gt;Usually it’s an anti-pattern to check other services as it causes “chatty traffic”.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However when at initial startup it’s better to check:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If your service can access all critical dependencies like database and cache.&lt;/li&gt;
&lt;li&gt;And as before not recommended to check other services because the rollout could fail due to failure of its dependencies. &lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>microservices</category>
      <category>sla</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>3 tips to keep microservice architecture from becoming a mess</title>
      <dc:creator>Frol K.</dc:creator>
      <pubDate>Tue, 28 Feb 2023 11:00:00 +0000</pubDate>
      <link>https://dev.to/frolk/3-tips-to-keep-microservice-architecture-from-becoming-a-mess-4hic</link>
      <guid>https://dev.to/frolk/3-tips-to-keep-microservice-architecture-from-becoming-a-mess-4hic</guid>
      <description>&lt;p&gt;In this article, I would like to share our approach to designing a microservice architecture for one of the world’s leading classifieds. The resulting architecture efficiently serves tens of thousands of requests per sec, has thousands of microservices, and hundreds of developers use it daily.&lt;/p&gt;

&lt;h1&gt;
  
  
  Why we moved to microservices
&lt;/h1&gt;

&lt;p&gt;When I joined the company, we had about 200 developers and a giant monolith. The more the organization grew, the longer it took for teams to deliver features. We went through a series of well-known organizational growth problems: slow releases, frequent rollbacks, lots of feature toggling, and so on.&lt;/p&gt;

&lt;p&gt;The microservice architecture was a reasonable choice, allowing us to scale our company to a thousand engineers with hundreds of microservices. This approach to scaling especially worked for us because the company wanted to scale in multiple verticals: real estate, auto, jobs, etc., which required many independent products and features.&lt;/p&gt;

&lt;p&gt;Of course, the organizational growth problem can’t be solved just by changing the architecture. According to &lt;a href="https://en.wikipedia.org/wiki/Conway%27s_law" rel="noopener noreferrer"&gt;Conway's Law&lt;/a&gt;, changes in architecture should go hand in hand with changes in organizational structure, or vice versa. So our company wasn't exempt, and the move to microservice architecture was accompanied by a transition from functional teams to &lt;a href="https://en.wikipedia.org/wiki/Cross-functional_team" rel="noopener noreferrer"&gt;cross-functional ones&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  Our target architecture
&lt;/h1&gt;

&lt;p&gt;Our core design, the target architecture, was based on the following prepositions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We will use the &lt;a href="https://en.wikipedia.org/wiki/Time_to_market]" rel="noopener noreferrer"&gt;Time To Market&lt;/a&gt; (TTM) metric as a signal that teams are becoming more efficient at delivering features.&lt;/li&gt;
&lt;li&gt;We will rely on Conway's Law as a primary way of structuring the organization and the architecture underneath.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb1svym3ewq4map89uwv0.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb1svym3ewq4map89uwv0.jpg" alt="Microservice architecture - classified case"&gt;&lt;/a&gt;&lt;br&gt;
      Figure 1. The classified’s target architecture &lt;/p&gt;

&lt;p&gt;Considering all of the above, we came up with the architecture from Figure 1. This should (and eventually did) expose the following properties: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Improve Time To Market.&lt;/li&gt;
&lt;li&gt;Allow us to scale to hundreds of developers.&lt;/li&gt;
&lt;li&gt;Accommodate tens of verticals (business directions) and dozens of product streams.&lt;/li&gt;
&lt;li&gt;Be reliable and resilient.&lt;/li&gt;
&lt;li&gt;Serve tens of thousands of requests per second.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But I suspect everyone had neat and clean architecture when it was first designed. But the real world often turns it into an unmanageable mess once implemented. So here are: practices, processes, and technical patterns that helped us keep it straight.&lt;/p&gt;

&lt;h1&gt;
  
  
  How to avoid microservices chaos: 3 tips
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Implement cross-functional teams
&lt;/h2&gt;

&lt;p&gt;Business logic spread across multiple microservices is one of the common problems. Many microservices end up with shared owners, and teams end up in a highly dependable mesh or microservices that is nothing more than a distributed monolith. &lt;/p&gt;

&lt;p&gt;This is probably the most important and most challenging problem to solve when it comes to keeping a team's business logic within its boundaries. In our case, it required changing our organizational structure to cross-functional teams with  business goals and streams.&lt;/p&gt;

&lt;p&gt;To improve the TTM, you need to ensure that the team is autonomous and owns all the underlying microservices that can deliver with as little dependency on the rest of the company as possible. Also, making the team a unique microservices owner helps keep the business logic within the boundaries.&lt;/p&gt;

&lt;p&gt;If your processes and infrastructure are mature enough, you should be able to track the following metrics on a per-team basis:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Resource consumption (CPU, Network, RAM, etc).&lt;/li&gt;
&lt;li&gt;Services reliability (SLA).&lt;/li&gt;
&lt;li&gt;Code quality / Tech dept / Test Coverage.&lt;/li&gt;
&lt;li&gt;On calls.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This would ensure that the team is going to guard its business boundaries within its limits. So nobody wants their microservice to go down due to the rollout of someone’s unrelated business logic in their microservice. &lt;/p&gt;

&lt;p&gt;This practice helps to draw clear boundaries between the team's microservices and the rest of the business, but it doesn't guarantee what happens within those boundaries.&lt;/p&gt;

&lt;h2&gt;
  
  
  Create API-Composition service
&lt;/h2&gt;

&lt;p&gt;The next piece that can go wrong and turn into a touch point of multiple teams is an infrastructure that allows services to expose their API to the outside world. It’s common if one service has all the infra set up to proxy requests to the internal infra, then other teams start adding their endpoints there, and the snowball grows into a monolith.&lt;/p&gt;

&lt;p&gt;Conversely, if you make it too easy to expose any microservice to the external world, then the service's API could be used by external or internal consumers or both. This breaks the internal vs external protocols pattern as well as request flow (see Internal vs External protocols).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Internal vs External protocols&lt;/strong&gt;&lt;br&gt;
A clear separation between the protocols used for communication between internal and external services is a good idea. Depending on the situation different companies choose them depending on their workloads and business needs, but a general rule of thumb is that internal protocols are designed for safety, rapid development, and efficiency. When external is driven by clients’ requirements, maintainability, conventions, etc. &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Our approach to this was introducing a new type of service, API-Composition (see table below), which is the type of microservice that the only service that exposes an API. Let’s compare the API-Composition service with a typical business service:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnqbkifasak44pu3zja0b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnqbkifasak44pu3zja0b.png" alt="Capabilities comparison of API-composition and business service types of microservices"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A few important things about API-composition. As you may see, this type of microservice can’t have persistent storage because we assume that there are no business operations happening there. What it does is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Receive a request and transform it to an internal format.&lt;/li&gt;
&lt;li&gt;Parallelize requests to internal ones.&lt;/li&gt;
&lt;li&gt;Aggregate the result and cache it if necessary.&lt;/li&gt;
&lt;li&gt;Response.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Also, the important thing is that it should be reasonably easy for the team to set up a new API composition for their cluster of services. In our case, we solved that by providing a tool that generates the API-composition service and its handlers from the OpenAPI schema. And we automated the exposure of the new handlers upstream (for example, rewriting rules on the API-gateway)&lt;/p&gt;

&lt;p&gt;The benefits of this separation are that it helps to keep business services:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Free from the libraries processing any external requests.&lt;/li&gt;
&lt;li&gt;Safe, as the business service never exposes its API to the outside world, we know that leaks could only happen at the API-composition level.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The downside of this approach is an extra hop, which increases the response time and adds a new point of failure. However, the more services you have downstream whose calls you can parallelize, the more net value API-composition brings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Avoid multi-domain services
&lt;/h2&gt;

&lt;p&gt;The data layer or core services are the other part where things go wrong. Let's first look at what core services are. In my particular case, the core services were: user-profile, listings services, and so on. The entities that are required by all other verticals: monetization, fraud prevention, listing creation, search, and so on. &lt;/p&gt;

&lt;p&gt;The problem with them is that if you don’t spot these types of services early enough, they can become a shared space for multiple teams. A vivid example of such a service could be a listing service with all the listings. Mistakenly, teams try to put all the listing-related stuff into that service. For example, job listings may have specific access control logic, while short-term rental listings have their life cycle.&lt;/p&gt;

&lt;p&gt;The solution for this particular case could be to have separate dedicated services owned by the teams that extend the core listing, and these services relate to the core object. This pattern is called bounded contexts. The main problem with using this pattern is that the business processes can't handle the eventual consistency (which is rare nowadays). Besides this main problem, there could be other problems that you might take care of before suggesting to use this pattern:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The message broker and the infrastructure can't guarantee that business events won't get lost.&lt;/li&gt;
&lt;li&gt;It could not be easy for a team to spin up a new service like this.&lt;/li&gt;
&lt;li&gt;The extended entity doesn't provide all the lifecycle events.&lt;/li&gt;
&lt;li&gt;There could be a situation where nobody owns the core service, and it's just easier to put your stuff in there.&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;These three high-level suggestions helped us maintain the architecture shown in Figure 1. What helped:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reduce TTM.&lt;/li&gt;
&lt;li&gt;Increase rollouts from a few per day to a few dozen.&lt;/li&gt;
&lt;li&gt;Roll out features in a more granular way with less risk.&lt;/li&gt;
&lt;li&gt;Reduce deployment time from hours to minutes.&lt;/li&gt;
&lt;li&gt;Decrease the number of rollbacks.&lt;/li&gt;
&lt;/ol&gt;

&lt;h1&gt;
  
  
  References
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://en.wikipedia.org/wiki/Conway%27s_law" rel="noopener noreferrer"&gt;Conway's law&lt;/a&gt; — a Wikipedia article. &lt;/li&gt;
&lt;li&gt;The book &lt;a href="https://teamtopologies.com/" rel="noopener noreferrer"&gt;“Team Topologies”&lt;/a&gt; by Matthew Skelton and Manuel Pais. &lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learn.microsoft.com/en-us/previous-versions/msp-n-p/jj591573(v=pandp.10)" rel="noopener noreferrer"&gt;The Command Query Responsibility Segregation Pattern&lt;/a&gt;. &lt;/li&gt;
&lt;li&gt;
&lt;a href="https://microservices.io/" rel="noopener noreferrer"&gt;Microservice Architecture&lt;/a&gt; article.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>microservices</category>
      <category>design</category>
      <category>refactoring</category>
    </item>
  </channel>
</rss>
