<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Grégoire Mielle</title>
    <description>The latest articles on DEV Community by Grégoire Mielle (@greeeg).</description>
    <link>https://dev.to/greeeg</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F191608%2F98cbac0a-9b6f-4f55-9d19-0dbccf13e3f4.jpg</url>
      <title>DEV Community: Grégoire Mielle</title>
      <link>https://dev.to/greeeg</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/greeeg"/>
    <language>en</language>
    <item>
      <title>Distributed architecture: how to make microservices talk to each others?</title>
      <dc:creator>Grégoire Mielle</dc:creator>
      <pubDate>Mon, 31 Aug 2020 13:12:56 +0000</pubDate>
      <link>https://dev.to/greeeg/distributed-architecture-how-to-make-microservices-talk-to-each-others-4pd8</link>
      <guid>https://dev.to/greeeg/distributed-architecture-how-to-make-microservices-talk-to-each-others-4pd8</guid>
      <description>&lt;p&gt;Software engineers all aim for the same thing: creating a fault tolerant application to serve users's needs.&lt;/p&gt;

&lt;p&gt;As an application and its services grow, so does the organization. Wether a &lt;a href="https://redefined.cloud/en/microservice-architecture"&gt;microservice-oriented architecture&lt;/a&gt; was built from the ground up or you inherited a &lt;a href="https://redefined.cloud/en/monolith-architecture"&gt;monolithic application&lt;/a&gt; turned distributed, the need for effective communication between services is clear: we want the checkout service to be able to talk to the customer service in order to send the order confirmation email to the right person.&lt;/p&gt;

&lt;p&gt;Let's discuss what are the implications of putting a layer (the network) between services, the pitfalls we want to avoid &amp;amp; the best practises to maintain a fault tolerant experience for our users.&lt;/p&gt;

&lt;h2&gt;
  
  
  From a function call to multiple network requests: welcome to the the distributed paradigm world
&lt;/h2&gt;

&lt;p&gt;While calling a service to send an email may seem trivial within a monolith, it can become a real headache with a distributed architecture. Indeed, within a monolithic application, calling a service can be summed up to an in-process function call, which is incredibly fast and (almost) always succeeds. Remote calls, on the other hand, can fail due to a failure in the remote process or the connection, thus potentially causing increased errors and latency.&lt;/p&gt;

&lt;p&gt;When adopting a distributed architecture, we want to keep the same level of resiliency and performance while benefiting from its pros (independent teams, better testability &amp;amp; deployability, technical decoupling). So how do we make services talk to each other effectively?&lt;/p&gt;

&lt;h2&gt;
  
  
  The distributed monolith paradigm and how to avoid it
&lt;/h2&gt;

&lt;p&gt;When migrating from a monolith to a distributed architecture (microservices-oriented) or when adopting a distributed architecture from scratch, it's easy to fall into the distributed monolith paradigm.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;While microservices allow easy disposal of modules, they require careful consideration of cross-cutting protocols.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Because you want to be able to create new services in seconds while following DRY principles, you tend to think that most of the core pieces of a service should be shared across them: routing, tracing, etc. Also called platform libraries/packages depending on the organization you work in, they are the de facto dependencies you need to install to get started.&lt;/p&gt;

&lt;p&gt;Now, what happens if you want to bump the version of one of those shared binaries in a new service: do you have to update it in all other services? And how easy it is to create a new service using a different language from the one in which those libraries are written in?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--uTZi68EW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/s7ose6wnbklc0fh9e119.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--uTZi68EW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/s7ose6wnbklc0fh9e119.png" alt="A trace generated with a different version of the tracing package can not be used, causing monitoring issues"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;A trace generated with a different version of the tracing package can not be used, causing monitoring issues&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The all promise of microservices independent deployability and technical decoupling is now gone: you've got a distributed monolith.&lt;/p&gt;

&lt;p&gt;A better approach and way to avoid that scenario is to &lt;strong&gt;define network protocols &amp;amp; data contracts&lt;/strong&gt;, just like programming languages have interfaces as Ben Christensen puts it in his talk "Don't build a distributed monolith". Thus, teams can choose to use or reimplement independent librairies depending on their timeline &amp;amp; architectural decisions.&lt;/p&gt;

&lt;p&gt;Putting this pitfall apart, let's now compare the different ways you can use data contracts and network protocols to easily make distributed services talk to each others.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data contracts &amp;amp; Network protocols: where do we start?
&lt;/h2&gt;

&lt;p&gt;There are lots of ways to make two entities communicate over a network. One simple way to put it down is to compare synchronous communication mechanisms against asynchronous communication mechanisms to understand where they each work best.&lt;/p&gt;

&lt;h3&gt;
  
  
  Synchronous communication
&lt;/h3&gt;

&lt;p&gt;Synchronous communication is the most straightforward solution when trying to make services communicate: the client sends a request and waits for a response from the service to come back.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--A3L7XB58--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/eogvqp6e3zggtz4hshir.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--A3L7XB58--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/eogvqp6e3zggtz4hshir.png" alt="A synchronous request is considered blocking, the response is needed for the process to continue"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;A synchronous request is considered blocking, the response is needed for the process to continue&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Most technologies around synchronous communication are associated with HTTP, including examples like gRPC, REST or GraphQL.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;REST: Everything is a resource&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Using REST, services expose resources which are available on dedicated endpoints using different HTTP verbs depending on the action you want to perform. Information is transported using JSON which leads to serialisation &amp;amp; deserialisation of each request's body.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HTTP 1.1&lt;/li&gt;
&lt;li&gt;JSON&lt;/li&gt;
&lt;li&gt;HTTP verbs&lt;/li&gt;
&lt;li&gt;Almost never fully adhere to all principles: too strict for most apps&lt;/li&gt;
&lt;li&gt;Serialisation &amp;amp; Deserialisation&lt;/li&gt;
&lt;li&gt;Request/Response&lt;/li&gt;
&lt;li&gt;TCP handshakes for each request&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;gRPC: Calling remote functions from&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Using gRPC, services are defined using Protocol buffers. Other services can then generate clients in different languages to send &amp;amp; receive Protobuf messages which are strongly typed.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HTTP 2&lt;/li&gt;
&lt;li&gt;Protobuf messages&lt;/li&gt;
&lt;li&gt;Interfaces &amp;amp; Structured messages&lt;/li&gt;
&lt;li&gt;Strongly typed messages&lt;/li&gt;
&lt;li&gt;Streaming&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Synchronous communication is easy to grasp but can pose some limitations: while making a query to get a result is supposed to be fast and can fail, how de handle requests that need to perform an action (also known as a command)? What if the command takes a lot of time to complete? Do we want to wait indefinitely risking to create a congestion at the service level?&lt;/p&gt;

&lt;h3&gt;
  
  
  Asynchronous communication: events &amp;amp; messages at the center
&lt;/h3&gt;

&lt;p&gt;When a service performs a command that other services might be interested in (think of an order service on an e-commerce website), it can become really hard to keep track of all other services to call using synchronous communication. If one call fails, we don't want the whole order to be canceled.&lt;/p&gt;

&lt;p&gt;This is when asynchronous communication comes into action. In an event-driven architecture, asynchronous communication between services is done using messages or events. A service produces an event (a record that something was performed) which can be consumed by other services that need to do their own work related to that change. This way, the service creating the event (the order service in our example) does not need to know which other services are gonna react to it, which leads to loose coupling between them.&lt;/p&gt;

&lt;p&gt;If a service fails to consume an event, it's easy to re-send the event later or to a different instance of that service. With synchronous communication, we would need to handle this recovery mechanism in the service producing the event.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--2rQZaYtl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/hgnjjkno9378u1jo4wey.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--2rQZaYtl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/hgnjjkno9378u1jo4wey.png" alt="The service producing the message does not care which other services consume it"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The service producing the message does not care which other services consume it&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;There are two asynchronous communication types:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Message queuing: One producer creates a message consumed by a consumer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://redefined.cloud/en/message-broker-message-queue"&gt;Message queuing&lt;/a&gt; allows messages to be created by producers and consumed once by a consumer. Examples include RabbitMQ, NSQ or ActiveMQ.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Publish/Subscribe: Subscribers listen for messages added to a topic&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://redefined.cloud/en/pub-sub"&gt;Publish/Subscribe pattern&lt;/a&gt; allows messages to be created by publishers in topics. Consumers subscribe to one or more topics and consume messages in that topic. Examples include Apache Kafka, Amazon SNS or Redis.&lt;/p&gt;

&lt;h2&gt;
  
  
  Synchronous vs asynchronous: when to choose one over the other?
&lt;/h2&gt;

&lt;p&gt;Both types of communication have advantages and limitations. While event-driven architectures are hard to get right but offer loose coupling, synchronous communication is synonymous with high coupling but is simple to use &amp;amp; debug. It is very common to find both of them in the same application. Here are common rules you can apply to choose one over the other:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use synchronous communication if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The operation is a simple query which is not changing any state&lt;/li&gt;
&lt;li&gt;The operation result is needed to move forward in the current process&lt;/li&gt;
&lt;li&gt;The operation can fail and does not require a complex retry mechanism&lt;/li&gt;
&lt;li&gt;The operation needs to be synchronous&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use asynchronous communication if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The operation involves multiple services reacting to it&lt;/li&gt;
&lt;li&gt;The operation is a command to change a state&lt;/li&gt;
&lt;li&gt;The operation must be performed while allowing failures &amp;amp; retries&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The flaws you're exposed to when working with a distributed architecture and how to mitigate them
&lt;/h2&gt;

&lt;p&gt;When putting a new layer (the network) between your microservices, your application is exposed to several pitfalls:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Network latency:&lt;/strong&gt; the time it takes for a request to get from one service to another increases, causing the application to be slow or unavailable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Logical failure:&lt;/strong&gt; a bug is introduced in a service which is key to your application, causing the application to be unavailable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Congestion &amp;amp; scaling failure:&lt;/strong&gt; the number of instances needed for a service to run properly is not enough, causing the application to be unavailable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When not handled properly, each of them can take a service down, which leads to increased latency &amp;amp; congestion for other services relying on it, thus making them saturated and more services to go down.&lt;/p&gt;

&lt;p&gt;What is the main risk? A cascading failure leading to the whole application to be unavailable.&lt;/p&gt;

&lt;p&gt;Software engineers should aim for two objectives:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Failing fast &amp;amp; in silent:&lt;/strong&gt; Services experiencing failures or anormal latency should be detected quickly and not create troubles for other services depending on it, thus limiting the blast of the issue.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fallback and gracefully recover:&lt;/strong&gt; Services which depend on a failing service should be able to rely on an alternative solution while the impacted service is restored to its normal state.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Widely used solutions which meet those objectives follow the &lt;strong&gt;circuit breaker pattern&lt;/strong&gt;. By acting like electrical circuit breakers, they can stop remote services from accessing a service which experiences a high rate of failures immediately. After a timeout period, traffic to that service can resume if no new errors are detected. While solutions dedicated to this task like Netflix's Hystrix, Resilience4j or Alibaba's Sentinel exist, &lt;a href="https://redefined.cloud/en/service-mesh"&gt;service meshes&lt;/a&gt; like Istio also contain traffic management features to enable it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;Distributed services communication is hard to get right from the ground up. When considering existing services interactions and scaling issues, it gets even harder. Here are some principles you can follow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Each service should define clear data contracts &amp;amp; communication protocols with other services&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Critical services need to be listed and circuit breakers need to be in place to ensure their resiliency to other services's failures&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even with the best architecture, your application is not protected from bugs introduced in new releases, worsened network connections or cloud systems failures. This is why &lt;a href="https://redefined.cloud/en/posts/monitoring-explained-differences-logging-tracing-profiling"&gt;monitoring tools and observability&lt;/a&gt; are even more important in a distributed context.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This story was originally published on &lt;a href="https://redefined.cloud/en"&gt;Redefined.cloud&lt;/a&gt;, an open-source publication where to get started with cloud computing using simple words and analogies you can understand.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>devops</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Monitoring explained: what is the difference between Logging, Tracing and Profiling?</title>
      <dc:creator>Grégoire Mielle</dc:creator>
      <pubDate>Mon, 08 Jun 2020 11:20:25 +0000</pubDate>
      <link>https://dev.to/greeeg/monitoring-explained-what-is-the-difference-between-logging-tracing-and-profiling-4028</link>
      <guid>https://dev.to/greeeg/monitoring-explained-what-is-the-difference-between-logging-tracing-and-profiling-4028</guid>
      <description>&lt;p&gt;While organizations and engineers are shifting to a new paradigm which changes the way we build and operate applications, the need of effective &lt;a href="https://redefined.cloud/en/monitoring" rel="noopener noreferrer"&gt;monitoring&lt;/a&gt; is even more important to meet reliability objectives and user satisfaction.&lt;/p&gt;

&lt;p&gt;Indeed, with cloud computing, visibility into services is crucial, especially when we talk about &lt;a href="https://redefined.cloud/en/containers-and-docker" rel="noopener noreferrer"&gt;containers&lt;/a&gt;, &lt;a href="https://redefined.cloud/en/microservice-architecture" rel="noopener noreferrer"&gt;microservices&lt;/a&gt; and highly distributed systems. While imagining a plane flying without any way to tell how its systems are doing is hard, imagining a fleet or aircrafts flying over cities without any traffic control towers monitoring the all thing is even harder and close to impossible.&lt;/p&gt;

&lt;p&gt;From this starting point, we should be able to agree that monitoring is important, because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Issues will arise, even with the best applications built by the best engineers&lt;/li&gt;
&lt;li&gt;With distributed systems come distributed failures, which can be devastating when not prepared for and no way to tell where they come from&lt;/li&gt;
&lt;li&gt;It contributes to transparency and accountability&lt;/li&gt;
&lt;li&gt;It reveals mistakes early and offers paths for learning and improvements&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With all that said, where should we start and what are the differences between Logging, Tracing and Profiling which are essential parts of how we monitor systems?&lt;/p&gt;

&lt;h2&gt;
  
  
  Logging: taking notes of things happening into a system
&lt;/h2&gt;

&lt;p&gt;Let's start by the most straightforward way of understanding how a system behaves : logs.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A log is a record of events that happened over time: a screenshot of something with an associated timestamp.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F8rcofmo8o0cya3ycij7w.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F8rcofmo8o0cya3ycij7w.jpg" alt="A list of logs displayed in Datadog"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A list of logs displayed in Datadog&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A flight manifest is a good example of that principle: we log all passengers and crew members before departing so we know who is in all flying aircrafts at any given time.&lt;/p&gt;

&lt;p&gt;In the context of cloud applications, logging can be used to save information about requests (duration, status code, userId), database queries, load balancer usage and more. It gives you precious details when bugs arise to determine the root cause of an outage or performance issue.&lt;/p&gt;

&lt;p&gt;While logging everything is tempting, this strategy can be really expensive and ineffective. You need to find the right balance between logging everything and nothing to gain enough context for it to be useful.&lt;/p&gt;

&lt;p&gt;Logging consists of multiple steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Collecting &amp;amp; Ingesting:&lt;/strong&gt; When you generate logs in different services, you need a central place where to send them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Processing:&lt;/strong&gt; Ingested logs are enriched with metadata &amp;amp; attributes for future use.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Indexing:&lt;/strong&gt; Logs are segmented into groups to generate metrics, patterns and dashboards.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With tools like the &lt;a href="https://www.elastic.co/what-is/elk-stack" rel="noopener noreferrer"&gt;ELK stack&lt;/a&gt;, &lt;a href="https://docs.datadoghq.com/logs/" rel="noopener noreferrer"&gt;Datadog&lt;/a&gt; or &lt;a href="https://aws.amazon.com/cloudwatch/" rel="noopener noreferrer"&gt;AWS CloudWatch&lt;/a&gt;, you can generate powerful insights from huge amounts of logs coming from hundreds of different services.&lt;/p&gt;

&lt;h3&gt;
  
  
  Digging deeper into Logging
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Learn more about &lt;a href="https://logz.io/learn/complete-guide-elk-stack/#intro" rel="noopener noreferrer"&gt;the Elastic stack&lt;/a&gt; to collect, aggregate &amp;amp; analyzie logs using Elasticsearch, Logstash and Kibana&lt;/li&gt;
&lt;li&gt;See how AWS logging solution, &lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_architecture.html" rel="noopener noreferrer"&gt;AWS CloudWatch&lt;/a&gt;, works&lt;/li&gt;
&lt;li&gt;Read how you can use services like &lt;a href="https://docs.datadoghq.com/logs/" rel="noopener noreferrer"&gt;Datadog&lt;/a&gt; to collect logs and use them to monitor your application&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Tracing: a single user's or request's journey through a system
&lt;/h2&gt;

&lt;p&gt;While per-service logging is a good way of introspecting things, it is not enough to convey the big picture of a request propagating across a distributed system. In a microservices architecture, a request is the result of many interactions between different entities (APIs, databases, queues) which can all be a point of failure.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Tracing acts like the blackbox of an aircraft during a crash: it helps you understand how things went during a crash, to discover the chain of events that led to a problem.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It provides a low-level view to understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what triggered what in the program&lt;/li&gt;
&lt;li&gt;with which arguments&lt;/li&gt;
&lt;li&gt;in which order&lt;/li&gt;
&lt;li&gt;how long did each step lasts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It helps discover performance bottlenecks quickly and find the cause of failures when the occur.&lt;/p&gt;

&lt;p&gt;The result of Tracing can be visualized in two ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Traces:&lt;/strong&gt; It looks like a flame graph with spans and their associated metadata&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service maps:&lt;/strong&gt; It looks like a cloud of nodes and links between them to visualize the flow of requests&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fji2sxk2hfa1fi7drswt9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fji2sxk2hfa1fi7drswt9.png" alt="A distributed trace using the ELK stack"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A distributed trace using the ELK stack&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With tools like &lt;a href="https://opentracing.io/" rel="noopener noreferrer"&gt;Open Tracing&lt;/a&gt;, &lt;a href="https://aws.amazon.com/xray/" rel="noopener noreferrer"&gt;AWS X-Ray&lt;/a&gt; or &lt;a href="https://zipkin.io/" rel="noopener noreferrer"&gt;Zipkin&lt;/a&gt;, you can create traces and service maps to provide richer and more relevant context when troubleshooting issues.&lt;/p&gt;

&lt;h3&gt;
  
  
  Digging deeper into Tracing:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Learn &lt;a href="https://www.youtube.com/watch?v=EW9GjQNcyzI" rel="noopener noreferrer"&gt;how Uber uses distributed tracing&lt;/a&gt; to conquer microservices complexity&lt;/li&gt;
&lt;li&gt;Watch &lt;a href="https://www.youtube.com/watch?v=URCLeycMrhU" rel="noopener noreferrer"&gt;how Lyft integrated and now uses distributed tracing&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Read &lt;a href="https://netflixtechblog.com/lessons-from-building-observability-tools-at-netflix-7cfafed6ab17" rel="noopener noreferrer"&gt;how Netflix built observability tools&lt;/a&gt; to better understand its systems&lt;/li&gt;
&lt;li&gt;Learn more about tools like &lt;a href="https://opentracing.io/" rel="noopener noreferrer"&gt;Open Tracing&lt;/a&gt; and &lt;a href="https://www.jaegertracing.io/" rel="noopener noreferrer"&gt;Jaeger Tracing&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Profiling &amp;amp; Metrics: measure a system's health over time
&lt;/h2&gt;

&lt;p&gt;Profiling and metrics are the last piece of the monitoring puzzle. Together, they provide a statistic overview of a system's health and tracked events over time.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Software profiling helps you create profiles just like profilers in the police. But instead of catching bad guys, you want to catch bad performances.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fufodfrfb58mmnj351ofn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fufodfrfb58mmnj351ofn.png" alt="Metrics displayed in Grafana using Prometheus"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Metrics displayed in Grafana using Prometheus&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Profiles range from low-level metrics like CPU usage or file I/O to higher-level metrics like throughput or latency. When aggregated and seen together, they are powerful signals giving you a holistic view of your system and can help detect issues.&lt;/p&gt;

&lt;p&gt;Like for logging, it's easy to think that the more metrics you have, the best your monitoring is gonna be. While it is tempting, you want to make sure to measure things that directly affect users of your application to be able to effectively detect and alert engineers when necessary.&lt;/p&gt;

&lt;p&gt;With tools like &lt;a href="https://prometheus.io/" rel="noopener noreferrer"&gt;Prometheus&lt;/a&gt;, &lt;a href="https://www.zabbix.com/" rel="noopener noreferrer"&gt;Zabbix&lt;/a&gt; or &lt;a href="https://docs.datadoghq.com/tracing/profiling/?tab=java" rel="noopener noreferrer"&gt;Datadog&lt;/a&gt;, you can build profiles and metrics to improve how you monitor your services and underlying infrastructure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Digging deeper into profiling &amp;amp; metrics
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Learn more about &lt;a href="https://medium.com/faun/how-to-monitor-the-sre-golden-signals-1391cadc7524" rel="noopener noreferrer"&gt;the golden SRE signals&lt;/a&gt; as described by Steve Mushero&lt;/li&gt;
&lt;li&gt;Read about &lt;a href="https://www.digitalocean.com/community/tutorials/an-introduction-to-metrics-monitoring-and-alerting" rel="noopener noreferrer"&gt;Digital Ocean's introduction to Monitoring &amp;amp; Alerting&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The bottom line
&lt;/h2&gt;

&lt;p&gt;When used together, Logging, Tracing and Profiling can give you powerful insights about your services and systems: detecting anomalies when they arise and quickly understand the root cause of performance issues.&lt;/p&gt;

&lt;p&gt;While monitoring is perfect at answering questions you're already asking yourself about your application, it only works when systems fail in predictable ways. &lt;a href="https://redefined.cloud/en/observability" rel="noopener noreferrer"&gt;Observability&lt;/a&gt; is about going one step further by providing tooling to openly observe and explore systems.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This story was originally published on &lt;a href="https://redefined.cloud/en" rel="noopener noreferrer"&gt;Redefined.cloud&lt;/a&gt;, an open-source publication where to get started with cloud computing using simple words and analogies you can understand.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>beginners</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
