DEV Community: Grégoire Mielle

Distributed architecture: how to make microservices talk to each others?

Grégoire Mielle — Mon, 31 Aug 2020 13:12:56 +0000

Software engineers all aim for the same thing: creating a fault tolerant application to serve users's needs.

As an application and its services grow, so does the organization. Wether a microservice-oriented architecture was built from the ground up or you inherited a monolithic application turned distributed, the need for effective communication between services is clear: we want the checkout service to be able to talk to the customer service in order to send the order confirmation email to the right person.

Let's discuss what are the implications of putting a layer (the network) between services, the pitfalls we want to avoid & the best practises to maintain a fault tolerant experience for our users.

From a function call to multiple network requests: welcome to the the distributed paradigm world

While calling a service to send an email may seem trivial within a monolith, it can become a real headache with a distributed architecture. Indeed, within a monolithic application, calling a service can be summed up to an in-process function call, which is incredibly fast and (almost) always succeeds. Remote calls, on the other hand, can fail due to a failure in the remote process or the connection, thus potentially causing increased errors and latency.

When adopting a distributed architecture, we want to keep the same level of resiliency and performance while benefiting from its pros (independent teams, better testability & deployability, technical decoupling). So how do we make services talk to each other effectively?

The distributed monolith paradigm and how to avoid it

When migrating from a monolith to a distributed architecture (microservices-oriented) or when adopting a distributed architecture from scratch, it's easy to fall into the distributed monolith paradigm.

While microservices allow easy disposal of modules, they require careful consideration of cross-cutting protocols.

Because you want to be able to create new services in seconds while following DRY principles, you tend to think that most of the core pieces of a service should be shared across them: routing, tracing, etc. Also called platform libraries/packages depending on the organization you work in, they are the de facto dependencies you need to install to get started.

Now, what happens if you want to bump the version of one of those shared binaries in a new service: do you have to update it in all other services? And how easy it is to create a new service using a different language from the one in which those libraries are written in?

A trace generated with a different version of the tracing package can not be used, causing monitoring issues

The all promise of microservices independent deployability and technical decoupling is now gone: you've got a distributed monolith.

A better approach and way to avoid that scenario is to define network protocols & data contracts, just like programming languages have interfaces as Ben Christensen puts it in his talk "Don't build a distributed monolith". Thus, teams can choose to use or reimplement independent librairies depending on their timeline & architectural decisions.

Putting this pitfall apart, let's now compare the different ways you can use data contracts and network protocols to easily make distributed services talk to each others.

Data contracts & Network protocols: where do we start?

There are lots of ways to make two entities communicate over a network. One simple way to put it down is to compare synchronous communication mechanisms against asynchronous communication mechanisms to understand where they each work best.

Synchronous communication

Synchronous communication is the most straightforward solution when trying to make services communicate: the client sends a request and waits for a response from the service to come back.

A synchronous request is considered blocking, the response is needed for the process to continue

Most technologies around synchronous communication are associated with HTTP, including examples like gRPC, REST or GraphQL.

REST: Everything is a resource

Using REST, services expose resources which are available on dedicated endpoints using different HTTP verbs depending on the action you want to perform. Information is transported using JSON which leads to serialisation & deserialisation of each request's body.

HTTP 1.1
JSON
HTTP verbs
Almost never fully adhere to all principles: too strict for most apps
Serialisation & Deserialisation
Request/Response
TCP handshakes for each request

gRPC: Calling remote functions from

Using gRPC, services are defined using Protocol buffers. Other services can then generate clients in different languages to send & receive Protobuf messages which are strongly typed.

HTTP 2
Protobuf messages
Interfaces & Structured messages
Strongly typed messages
Streaming

Synchronous communication is easy to grasp but can pose some limitations: while making a query to get a result is supposed to be fast and can fail, how de handle requests that need to perform an action (also known as a command)? What if the command takes a lot of time to complete? Do we want to wait indefinitely risking to create a congestion at the service level?

Asynchronous communication: events & messages at the center

When a service performs a command that other services might be interested in (think of an order service on an e-commerce website), it can become really hard to keep track of all other services to call using synchronous communication. If one call fails, we don't want the whole order to be canceled.

This is when asynchronous communication comes into action. In an event-driven architecture, asynchronous communication between services is done using messages or events. A service produces an event (a record that something was performed) which can be consumed by other services that need to do their own work related to that change. This way, the service creating the event (the order service in our example) does not need to know which other services are gonna react to it, which leads to loose coupling between them.

If a service fails to consume an event, it's easy to re-send the event later or to a different instance of that service. With synchronous communication, we would need to handle this recovery mechanism in the service producing the event.

The service producing the message does not care which other services consume it

There are two asynchronous communication types:

Message queuing: One producer creates a message consumed by a consumer

Message queuing allows messages to be created by producers and consumed once by a consumer. Examples include RabbitMQ, NSQ or ActiveMQ.

Publish/Subscribe: Subscribers listen for messages added to a topic

The Publish/Subscribe pattern allows messages to be created by publishers in topics. Consumers subscribe to one or more topics and consume messages in that topic. Examples include Apache Kafka, Amazon SNS or Redis.

Synchronous vs asynchronous: when to choose one over the other?

Both types of communication have advantages and limitations. While event-driven architectures are hard to get right but offer loose coupling, synchronous communication is synonymous with high coupling but is simple to use & debug. It is very common to find both of them in the same application. Here are common rules you can apply to choose one over the other:

Use synchronous communication if:

The operation is a simple query which is not changing any state
The operation result is needed to move forward in the current process
The operation can fail and does not require a complex retry mechanism
The operation needs to be synchronous

Use asynchronous communication if:

The operation involves multiple services reacting to it
The operation is a command to change a state
The operation must be performed while allowing failures & retries

The flaws you're exposed to when working with a distributed architecture and how to mitigate them

When putting a new layer (the network) between your microservices, your application is exposed to several pitfalls:

Network latency: the time it takes for a request to get from one service to another increases, causing the application to be slow or unavailable
Logical failure: a bug is introduced in a service which is key to your application, causing the application to be unavailable
Congestion & scaling failure: the number of instances needed for a service to run properly is not enough, causing the application to be unavailable

When not handled properly, each of them can take a service down, which leads to increased latency & congestion for other services relying on it, thus making them saturated and more services to go down.

What is the main risk? A cascading failure leading to the whole application to be unavailable.

Software engineers should aim for two objectives:

Failing fast & in silent: Services experiencing failures or anormal latency should be detected quickly and not create troubles for other services depending on it, thus limiting the blast of the issue.
Fallback and gracefully recover: Services which depend on a failing service should be able to rely on an alternative solution while the impacted service is restored to its normal state.

Widely used solutions which meet those objectives follow the circuit breaker pattern. By acting like electrical circuit breakers, they can stop remote services from accessing a service which experiences a high rate of failures immediately. After a timeout period, traffic to that service can resume if no new errors are detected. While solutions dedicated to this task like Netflix's Hystrix, Resilience4j or Alibaba's Sentinel exist, service meshes like Istio also contain traffic management features to enable it.

Wrapping up

Distributed services communication is hard to get right from the ground up. When considering existing services interactions and scaling issues, it gets even harder. Here are some principles you can follow:

Each service should define clear data contracts & communication protocols with other services
Critical services need to be listed and circuit breakers need to be in place to ensure their resiliency to other services's failures

Even with the best architecture, your application is not protected from bugs introduced in new releases, worsened network connections or cloud systems failures. This is why monitoring tools and observability are even more important in a distributed context.

This story was originally published on Redefined.cloud, an open-source publication where to get started with cloud computing using simple words and analogies you can understand.

Monitoring explained: what is the difference between Logging, Tracing and Profiling?

Grégoire Mielle — Mon, 08 Jun 2020 11:20:25 +0000

While organizations and engineers are shifting to a new paradigm which changes the way we build and operate applications, the need of effective monitoring is even more important to meet reliability objectives and user satisfaction.

Indeed, with cloud computing, visibility into services is crucial, especially when we talk about containers, microservices and highly distributed systems. While imagining a plane flying without any way to tell how its systems are doing is hard, imagining a fleet or aircrafts flying over cities without any traffic control towers monitoring the all thing is even harder and close to impossible.

From this starting point, we should be able to agree that monitoring is important, because:

Issues will arise, even with the best applications built by the best engineers
With distributed systems come distributed failures, which can be devastating when not prepared for and no way to tell where they come from
It contributes to transparency and accountability
It reveals mistakes early and offers paths for learning and improvements

With all that said, where should we start and what are the differences between Logging, Tracing and Profiling which are essential parts of how we monitor systems?

Logging: taking notes of things happening into a system

Let's start by the most straightforward way of understanding how a system behaves : logs.

A log is a record of events that happened over time: a screenshot of something with an associated timestamp.

A list of logs displayed in Datadog

A flight manifest is a good example of that principle: we log all passengers and crew members before departing so we know who is in all flying aircrafts at any given time.

In the context of cloud applications, logging can be used to save information about requests (duration, status code, userId), database queries, load balancer usage and more. It gives you precious details when bugs arise to determine the root cause of an outage or performance issue.

While logging everything is tempting, this strategy can be really expensive and ineffective. You need to find the right balance between logging everything and nothing to gain enough context for it to be useful.

Logging consists of multiple steps:

Collecting & Ingesting: When you generate logs in different services, you need a central place where to send them.
Processing: Ingested logs are enriched with metadata & attributes for future use.
Indexing: Logs are segmented into groups to generate metrics, patterns and dashboards.

With tools like the ELK stack, Datadog or AWS CloudWatch, you can generate powerful insights from huge amounts of logs coming from hundreds of different services.

Digging deeper into Logging

Learn more about the Elastic stack to collect, aggregate & analyzie logs using Elasticsearch, Logstash and Kibana
See how AWS logging solution, AWS CloudWatch, works
Read how you can use services like Datadog to collect logs and use them to monitor your application

Tracing: a single user's or request's journey through a system

While per-service logging is a good way of introspecting things, it is not enough to convey the big picture of a request propagating across a distributed system. In a microservices architecture, a request is the result of many interactions between different entities (APIs, databases, queues) which can all be a point of failure.

Tracing acts like the blackbox of an aircraft during a crash: it helps you understand how things went during a crash, to discover the chain of events that led to a problem.

It provides a low-level view to understand:

what triggered what in the program
with which arguments
in which order
how long did each step lasts

It helps discover performance bottlenecks quickly and find the cause of failures when the occur.

The result of Tracing can be visualized in two ways:

Traces: It looks like a flame graph with spans and their associated metadata
Service maps: It looks like a cloud of nodes and links between them to visualize the flow of requests

A distributed trace using the ELK stack

With tools like Open Tracing, AWS X-Ray or Zipkin, you can create traces and service maps to provide richer and more relevant context when troubleshooting issues.

Digging deeper into Tracing:

Learn how Uber uses distributed tracing to conquer microservices complexity
Watch how Lyft integrated and now uses distributed tracing
Read how Netflix built observability tools to better understand its systems
Learn more about tools like Open Tracing and Jaeger Tracing

Profiling & Metrics: measure a system's health over time

Profiling and metrics are the last piece of the monitoring puzzle. Together, they provide a statistic overview of a system's health and tracked events over time.

Software profiling helps you create profiles just like profilers in the police. But instead of catching bad guys, you want to catch bad performances.

Metrics displayed in Grafana using Prometheus

Profiles range from low-level metrics like CPU usage or file I/O to higher-level metrics like throughput or latency. When aggregated and seen together, they are powerful signals giving you a holistic view of your system and can help detect issues.

Like for logging, it's easy to think that the more metrics you have, the best your monitoring is gonna be. While it is tempting, you want to make sure to measure things that directly affect users of your application to be able to effectively detect and alert engineers when necessary.

With tools like Prometheus, Zabbix or Datadog, you can build profiles and metrics to improve how you monitor your services and underlying infrastructure.

Digging deeper into profiling & metrics

Learn more about the golden SRE signals as described by Steve Mushero
Read about Digital Ocean's introduction to Monitoring & Alerting

The bottom line

When used together, Logging, Tracing and Profiling can give you powerful insights about your services and systems: detecting anomalies when they arise and quickly understand the root cause of performance issues.

While monitoring is perfect at answering questions you're already asking yourself about your application, it only works when systems fail in predictable ways. Observability is about going one step further by providing tooling to openly observe and explore systems.

This story was originally published on Redefined.cloud, an open-source publication where to get started with cloud computing using simple words and analogies you can understand.