<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Anadi Misra</title>
    <description>The latest articles on DEV Community by Anadi Misra (@anadimisra).</description>
    <link>https://dev.to/anadimisra</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F488849%2F9cee1362-fd10-45b4-a317-341cccf233a7.jpeg</url>
      <title>DEV Community: Anadi Misra</title>
      <link>https://dev.to/anadimisra</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/anadimisra"/>
    <language>en</language>
    <item>
      <title>A Guide to Managing the First Fallacy of Distributed Computing</title>
      <dc:creator>Anadi Misra</dc:creator>
      <pubDate>Tue, 24 Oct 2023 12:00:00 +0000</pubDate>
      <link>https://dev.to/anadimisra/a-guide-to-managing-the-first-fallacy-of-distributed-computing-57m</link>
      <guid>https://dev.to/anadimisra/a-guide-to-managing-the-first-fallacy-of-distributed-computing-57m</guid>
      <description>&lt;p&gt;Distributed computing is a complex field with numerous challenges, and understanding the fallacies associated with it is crucial for building robust and reliable distributed systems. Here are eight fallacies of distributed computing and their significance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Network Is Reliable:&lt;/strong&gt; Assuming that network connections are always available and reliable can lead to system failures when network outages occur, even when the network outages are transitory. It's essential to design systems that can gracefully handle network failures through redundancy and fault tolerance mechanisms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency Is Zero:&lt;/strong&gt; Overestimating the speed of communication between distributed components can result in slow and unresponsive systems. Acknowledging network latency and optimizing for it is vital for delivering efficient user experiences.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bandwidth Is Infinite:&lt;/strong&gt; Believing that network bandwidth is unlimited can lead to overloading the network and causing congestion. Efficient data transmission and bandwidth management are crucial to avoid performance bottlenecks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Network Is Secure:&lt;/strong&gt; Assuming that the network is inherently secure can result in vulnerabilities and data breaches. Implementing strong security measures, including encryption and authentication, is necessary to protect sensitive information in distributed systems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Topology Doesn't Change:&lt;/strong&gt; Networks evolve, and assuming a static topology can lead to configuration errors and system instability. Systems should be designed to adapt to changing network conditions and configurations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;There Is One Administrator:&lt;/strong&gt; Believing that a single administrator controls the entire distributed system can lead to coordination issues and conflicts. In reality, distributed systems often involve multiple administrators, and clear governance and coordination mechanisms are needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transport Cost Is Zero:&lt;/strong&gt; Neglecting the cost associated with data transfer can lead to inefficient resource utilization and increased operational expenses. Optimizing data transfer and considering the associated costs are essential for cost-effective distributed computing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Network Is Homogeneous:&lt;/strong&gt; Assuming that all network components and nodes have the same characteristics can result in compatibility issues and performance disparities. Systems should be designed to handle heterogeneity and accommodate various types of devices and platforms.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Understanding these fallacies is critical because they underscore the challenges and complexities of distributed computing. Failure to account for these fallacies can lead to system failures, security breaches, and increased operational costs. Building reliable, efficient, and secure distributed systems requires a deep understanding of these fallacies and the implementation of appropriate software design and architecture, and IT operational strategies to address them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Unreliable Networks
&lt;/h2&gt;

&lt;p&gt;In this blog post, we will look at the first fallacy, its impact on microservices architecture and how to circumvent this limitation. Let's say we're using spring-boot to write our microservice and it uses MongoDB as the backend which is deployed as a StatefulSet in Kubernetes. and were running this on EKS. You might also question that it is your cloud provider's job to give us a reliable network and that we're paying them for high availability. While the expectation might not be wrong, unfortunately, it doesn't always work as expected when you rent hardware over the cloud. Let's say your cloud provider promises 99.99% availability, which is impressive right? No, it ain't! and I'll explain how. 99.99% availability could lead to&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every one request in 10,000 requests failing&lt;/li&gt;
&lt;li&gt;Every 10 requests in 1,00,000 requests failing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now you might say that my system doesn't get that kind of traffic! Fair enough, but this is availability data of the cloud provider, not your instances of service, which means, that if that cloud is getting a billion network requests within its network 1,00,000 will fail! And to make things more complex you can't expect them to distribute these failures across all accounts using their hardware, you might get hit by any number of those failures depending on your luck. The question here is, do you want to run a business just on the chance of these outages not hitting you? I hope not! So here's the fundamental description of the first (and most critical) fallacy of distributed computing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Impact of Network Failures
&lt;/h2&gt;

&lt;p&gt;Let's take an example of an e-commerce system, we'd usually see a product catalogue from the Product Microservice; however, the SKU availability might be fetched from another microservice when building the Product Catalogue response. One could argue though that I can replicate the SKU information into the product catalogue via Choreography, but for the scope of this example, let's assume that's not in place. The product service therefore is making a REST API Call to the SKU service. What happens when this call fails? How would you convey to the end user whether the product they are looking at is available or not?&lt;/p&gt;

&lt;p&gt;Scary stuff yeah? Well, not so scary, since we love to brave ourselves on harder frontiers as engineers, we have a few tricks up our sleeves.&lt;/p&gt;

&lt;h2&gt;
  
  
  Coding for Fault Tolerance and Resilience
&lt;/h2&gt;

&lt;p&gt;It's a topic worthy of a book perhaps in itself instead of a blog post. But I'll try to cover all I can while keeping it simple. Most of what I'm sharing here are experiences gathered over transitioning from Monolith to Microservices for the SaaS business in &lt;a href="https://www.nimblework.com" rel="noopener noreferrer"&gt;NimbleWork&lt;/a&gt;. And I hope others find it helpful too.&lt;/p&gt;

&lt;h3&gt;
  
  
  Patterns for Transitory Outages
&lt;/h3&gt;

&lt;p&gt;The following patterns help circumvent transitory outages or blips as we usually call them. The fundamental underlying assumption is that such outages have a lifetime of a second to two at worst.&lt;/p&gt;

&lt;h4&gt;
  
  
  Retries
&lt;/h4&gt;

&lt;p&gt;One of the simplest things to do is to wrap your network calls in a retry logic so that there are multiple tries before the calling service finally gives up. The idea here is that a temporary network snag from the cloud provider wouldn't last longer than the retries made for fetching data. Microservice libraries and frameworks in almost all common programming languages provide this feature. The retires themselves have to be nuanced or discrete; retrying when you get 400 will not change the output for example until the request signature is changed. Here's an example of using retries when making REST API calls with Spring WebFlux WebClient.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;webClient&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;uri&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uri&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;addAll&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;httpHeaders&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;retrieve&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;bodyToMono&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ParameterizedTypeReference&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Object&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
                &lt;span class="o"&gt;})&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;log&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getClass&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;getName&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="nc"&gt;Level&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;FINE&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;retryWhen&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
                        &lt;span class="nc"&gt;Retry&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;backoff&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Duration&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;ofSeconds&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
                                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;jitter&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
                                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;throwable&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;throwable&lt;/span&gt; &lt;span class="k"&gt;instanceof&lt;/span&gt; &lt;span class="nc"&gt;RuntimeException&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
                                        &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;throwable&lt;/span&gt; &lt;span class="k"&gt;instanceof&lt;/span&gt; &lt;span class="nc"&gt;WebClientResponseException&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;
                                        &lt;span class="o"&gt;(((&lt;/span&gt;&lt;span class="nc"&gt;WebClientResponseException&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;throwable&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;getStatusCode&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nc"&gt;HttpStatus&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;GATEWAY_TIMEOUT&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;WebClientResponseException&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;throwable&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;getStatusCode&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nc"&gt;HttpStatus&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;SERVICE_UNAVAILABLE&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;WebClientResponseException&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;throwable&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;getStatusCode&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nc"&gt;HttpStatus&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;BAD_GATEWAY&lt;/span&gt;&lt;span class="o"&gt;)))&lt;/span&gt;
                                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;onRetryExhaustedThrow&lt;/span&gt;&lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="n"&gt;retryBackoffSpec&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retrySignal&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
                                    &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Service at {} failed to respond, after max attempts of: {}"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;uri&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retrySignal&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;totalRetries&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
                                    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;retrySignal&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;failure&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
                                &lt;span class="o"&gt;}))&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;onErrorResume&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;WebClientResponseException&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ex&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getStatusCode&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;is4xxClientError&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="nc"&gt;Mono&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;empty&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;Mono&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's a summary of what we're trying to achieve with this piece of code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retry a maximum of three times in two seconds&lt;/li&gt;
&lt;li&gt;Space the time between retries randomly based on the &lt;code&gt;jitter&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Retry only if the upstream service gave HTTP &lt;code&gt;504&lt;/code&gt;, &lt;code&gt;503&lt;/code&gt; or &lt;code&gt;502&lt;/code&gt; statuses&lt;/li&gt;
&lt;li&gt;Log the error and pass it downstream when the maximum attempts are exhausted&lt;/li&gt;
&lt;li&gt;Wrap an empty response instead for client errors or pass the error from the previous step downstream&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These retries can help recover from blips or snags which aren't expected to last long. This can also be a good mechanism if the upstream service we're calling restarts for whatever reason.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; We've noticed running replica sets in Kubernetes with &lt;a href="https://kubernetes.io/docs/tutorials/kubernetes-basics/update/update-intro/" rel="noopener noreferrer"&gt;Rolling Updates&lt;/a&gt; strategy helps reduce such blips and hence retries.&lt;/p&gt;

&lt;p&gt;While this is an example using the Reactor Project's implementation in Spring; all major frameworks and languages provide alternatives&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/spring-projects/spring-retry" rel="noopener noreferrer"&gt;Spring Retry&lt;/a&gt; if you're Spring Framework but not on reactive programming&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://doc.akka.io/docs/akka/current/fault-tolerance.html" rel="noopener noreferrer"&gt;Supervisor Strategy&lt;/a&gt; when you're on Akka with Scala or Java&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;scala.util.{Failure, Try}&lt;/code&gt; if you're using Scala without any framework as such&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pypi.org/project/retry/" rel="noopener noreferrer"&gt;Retry Decorator&lt;/a&gt; in python&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.npmjs.com/package/fetch-retry" rel="noopener noreferrer"&gt;fetch-retry&lt;/a&gt; in JavaScript&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;and I'm sure this is not an exhaustive list. This pattern takes care of transitory network blips. But what if there's a sustained outage? More than later in this article.&lt;/p&gt;

&lt;h4&gt;
  
  
  Last Known Good Version
&lt;/h4&gt;

&lt;p&gt;What if the called service continuously crashes and all retries from various clients exhaust? I prefer falling back to a last-known-good version. There are a couple of strategies that can enable this &lt;code&gt;last-known-good-version&lt;/code&gt; policy on the infrastructure and client-side. And we'll briefly touch upon each one of them.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Deployments&lt;/strong&gt; The simplest option from an infrastructure perspective is to redeploy to the last known stable version of the service. This is under the assumption that the downstream apps are still compatible with calling this older version. It's easier to do this in Kubernetes, where it saves previous &lt;a href="https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#rolling-back-to-a-previous-revision" rel="noopener noreferrer"&gt;revisions&lt;/a&gt; of deployments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cached at downstream&lt;/strong&gt; Another way is for the clients to save a last successful response they can fall back on in case of failures from the service, showing a stale data-related prompt to the end user on Browser or Mobile UI is a good option.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Caching at downstream&lt;/strong&gt; &lt;br&gt;
The browser, or any client for that matter, continuously writes data to an in-memory store until it receives a heartbeat from the upstream. This mechanism offers various implementations for both UI and headless clients that make service calls through gRPC or REST. Here is a summary of what to do, regardless of the type of client.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clients are registered on their first API Call for the service to keep track&lt;/li&gt;
&lt;li&gt;Subsequent updates to the clients are managed as a push from the service to the client&lt;/li&gt;
&lt;li&gt;Clients retain the state locally, Redux on browser; or Redis, Memcached for Headless clients (psst .. LinkedHashMaps too if your soul allows that 😏)&lt;/li&gt;
&lt;li&gt;If you're not at a scale to afford push, you can use tools like RTK for ReactJS and NgRx store for Angular and keep pulling state updates, be sure to inform the end-user that they might be seeing stale data when you get any of the 5XX status errors&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Patterns for sustained outages
&lt;/h3&gt;

&lt;p&gt;We'd be lucky if any distributed architecture were a system of only blips, which they are not. Hence, we have to build our systems to handle long-lived outages. Here are some of the patterns that help in this regard.&lt;/p&gt;

&lt;h4&gt;
  
  
  Bulkheads
&lt;/h4&gt;

&lt;p&gt;Bulkheads address the contingency of outages caused by slow upstream services. While the ideal solution is to fix the upstream issue, it's not always feasible. Consider a scenario where the service (X) you're calling relies on another service (Y) that exhibits sluggish response times. If service (X) experiences a high volume of incoming traffic, a significant portion of its threads may be left waiting for the slower upstream service (Y) to respond. This waiting not only slows down the service (X) but also increases the rate of dropped requests, leading to more client retries and exacerbating the bottleneck.&lt;/p&gt;

&lt;p&gt;To mitigate this issue, one effective approach is to localize the impact of failures. For instance, you can create a dedicated thread pool with a limited number of threads for calling the slower service. By doing so, you confine the effects of slowness and timeouts to a specific API call, thereby enhancing the overall service throughput.&lt;/p&gt;

&lt;h4&gt;
  
  
  Circuit Breakers
&lt;/h4&gt;

&lt;p&gt;Circuit Breakers can easily be avoided, we have to write services that will never go down! However, the reality is that our applications often rely on external services developed by others. In these situations, Circuit Breaker, as a pattern, becomes invaluable. It routes all traffic between services through a proxy, which promptly starts rejecting requests once a defined threshold of failures is reached. This pattern proves particularly useful during prolonged network outages in external services, which could otherwise lead to outages in the calling services. Nevertheless, ensuring a seamless user experience in such scenarios is vital, and we've found two approaches to be effective:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Notify users of the outage in the affected area while enabling them to use other parts of the system.&lt;/li&gt;
&lt;li&gt;Allow the client to cache user transactions, providing a "202 Accepted" response instead of "200" or "201" as usual, and resume these transactions once the upstream service becomes available again.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The realization that, despite a cloud provider's commitment to high availability, network failures remain an inevitability due to the vast scale and unpredictable nature of these networks, underscores the critical need for resilient systems. This journey immerses us in the realm of distributed computing, challenging us as engineers to arm ourselves with strategies for fault tolerance and resilience. Employing techniques like retries, last-known-good version policies, and the development of separate client-server architectures with state management on both ends equips us to confront the unpredictability of network outages.&lt;/p&gt;

&lt;p&gt;As we navigate the intricacies of distributed systems, the adoption of these strategies becomes imperative to ensure smooth user experiences and system stability. Welcome to the world of Microservices in the Cloud, where challenges inspire innovation, and resilience forms the bedrock of our response to unreliable networks. 😉&lt;/p&gt;

</description>
      <category>microservices</category>
      <category>kubernetes</category>
      <category>programming</category>
      <category>webdev</category>
    </item>
    <item>
      <title>The Serverless CI - Running Jenkins Slaves on AWS EKS Fargate</title>
      <dc:creator>Anadi Misra</dc:creator>
      <pubDate>Tue, 26 Sep 2023 13:00:00 +0000</pubDate>
      <link>https://dev.to/anadimisra/the-serverless-ci-running-jenkins-slaves-on-aws-eks-fargate-1g0p</link>
      <guid>https://dev.to/anadimisra/the-serverless-ci-running-jenkins-slaves-on-aws-eks-fargate-1g0p</guid>
      <description>&lt;p&gt;Jenkins requires no introduction, as it stands as the undisputed king of Continuous Integration. Over the years, it has adapted to all the technological disruptions in the industry, including Kubernetes. This blog post delves into an intriguing topic: how to execute on-demand slaves in a remote AWS Fargate cluster from a Jenkins master instance. For those wondering why such a capability is necessary, the following sections will elucidate not only the reasons but also the methods and associated advantages.&lt;/p&gt;

&lt;h2&gt;
  
  
  Everything is remote!
&lt;/h2&gt;

&lt;p&gt;Imagine this: you’re running cloud-native services on AWS EKS, and as the diligent engineer that you are, you establish two distinct clusters—one for production and another for all non-production purposes. You might be wondering why you would undertake such an approach. Here’s a hint: consider the blast radius. If you prioritize the security of your cloud-native services as fervently as we do at &lt;a href="https://www.nimblework.com/" rel="noopener noreferrer"&gt;NimbleWork&lt;/a&gt;, this decision makes sense. The dev cluster operates, among other things, all our Continuous Integration and Delivery tools. Speaking of Continuous Delivery, we run nightly pipelines that test the services for performance, security vulnerabilities, and regression before deploying them to the production cluster. In the production cluster, we adhere to a blue-green deployment model. This entails the Jenkins master, which operates on the dev EKS cluster, to run slaves on the production EKS cluster for various deployment, management, and general housekeeping tasks. Having outlined the reasons for this setup, let’s delve into the details of how it is accomplished.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;This post assumes you have a running AWS EKS cluster, either on Fargate or Worker Nodes. You can refer to this &lt;a href="https://www.anadimisra.com/post/building-an-eks-fargate-cluster-with-terraform" rel="noopener noreferrer"&gt;article&lt;/a&gt; for creating a Fargate cluster or &lt;a href="https://www.anadimisra.com/post/eks-karpenter" rel="noopener noreferrer"&gt;this&lt;/a&gt; for Worker nodes if you don’t have them handy. The next step is to install Jenkins on Kubernetes, you can refer to &lt;a href="https://www.jenkins.io/doc/book/installing/kubernetes/" rel="noopener noreferrer"&gt;this&lt;/a&gt; page in their official documentation for this. Since we’re configuring Jenkins slaves to run on AWS Fargate, install the &lt;a href="https://plugins.jenkins.io/kubernetes/" rel="noopener noreferrer"&gt;Kubernetes Plugin&lt;/a&gt; in Jenkins.&lt;/p&gt;

&lt;h2&gt;
  
  
  Serverless Jenkins Slaves
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Configuring Kubernetes Connection in Jenkins Master
&lt;/h3&gt;

&lt;p&gt;Kubernetes cluster can be configured from the Manage Nodes and Clouds option on the Manage Jenkins Page. Navigate to &lt;code&gt;Manage Jenkins &amp;gt; Clouds &amp;gt; New Cloud&lt;/code&gt; to open the Cloud configuration page&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.anadimisra.com%2Fassets%2Fimg%2Fposts%2Fjenkins-cloud.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.anadimisra.com%2Fassets%2Fimg%2Fposts%2Fjenkins-cloud.png" alt="Cloud configuration page in Jenkins LTS" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Add a name for the cloud, choose Kubernetes in the Type section and click on “Create” to create the cloud configuration&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.anadimisra.com%2Fassets%2Fimg%2Fposts%2Fnew-cloud-jenkins.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.anadimisra.com%2Fassets%2Fimg%2Fposts%2Fnew-cloud-jenkins.png" alt="Create a Kubernetes cloud in Jenkins" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Expand the Kubernetes Cloud details dropdown, this is where we will configure Jenkins master access to the AWS Fargate Cluster.&lt;/p&gt;

&lt;h4&gt;
  
  
  Kubernetes URL
&lt;/h4&gt;

&lt;p&gt;Here we add the public API Server URL of the AWS Fargate Cluster. Log in to the AWS Management Console and select Elastic Kubernetes Service, click on the clusters link to list all clusters in your account and then click on the name of the cluster you want to connect Jenkins with to reach the overview page. Copy the API Server URL from the highlighted section in the image below and paste it to the Kubernetes URL field in the cloud configuration page in Jenkins.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.anadimisra.com%2Fassets%2Fimg%2Fposts%2Feks-api-server.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.anadimisra.com%2Fassets%2Fimg%2Fposts%2Feks-api-server.png" alt="Cluster Info to get details of the API Server" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Alternatively, you can get the same information by kubectl using the command line too. Point to your EKS cluster&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;AWS_ACCESS_KEY_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"KEY_ID_HERE"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;AWS_SECRET_ACCESS_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"ACCESS_KEY_HERE"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;AWS_SESSION_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"SESSION_TOKEN_HERE"&lt;/span&gt; 

aws eks update-kubeconfig &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="nt"&gt;--name&lt;/span&gt; mycluster

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;then run the &lt;code&gt;kubectl&lt;/code&gt; commands as follows&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl cluster-info
Kubernetes control plane is running at https://XXXXXXXXXXXXX.gr7.us-east-1.eks.amazonaws.com

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We’ll be using a Kubernetes Service Account to authenticate to the API Server. Perform the following steps on the AWS EKS cluster to enable Jenkins access.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create a namespace jenkins-jobs associated with the Fargate profile&lt;/li&gt;
&lt;li&gt;Create a service account named &lt;code&gt;jenkins-service-account&lt;/code&gt; in the namespace&lt;/li&gt;
&lt;li&gt;Create a secret token named &lt;code&gt;jenkins-token&lt;/code&gt; in the namespace associated with the service account&lt;/li&gt;
&lt;li&gt;Create a role-binding providing the service-account ClusterRole admin in this &lt;code&gt;jenkins-jobs&lt;/code&gt; namespace&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;you can achieve this in multiple ways, via the Management Console or AWS CLI, we like to stick to IAC in NimbleWork so here’s a sample terraform snippet for the same:&lt;/p&gt;

&lt;p&gt;Let’s retrieve the certificate key now, run the following command to get the service account certificate key and token&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;% kubectl get secret jenkins-token &lt;span class="nt"&gt;--namespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;jenkins-jobs &lt;span class="nt"&gt;-o&lt;/span&gt; yaml

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output contains ca.crt and token&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ca.crt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;XXXXXXXXXX&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;XXXXXXXXXXXXXX&lt;/span&gt;
  &lt;span class="na"&gt;token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;XXXXXXXXXX&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Secret&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;kubernetes.io/service-account.name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jenkins&lt;/span&gt;
    &lt;span class="na"&gt;kubernetes.io/service-account.uid&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;XXXXXXXXXXXXXX&lt;/span&gt;
  &lt;span class="na"&gt;creationTimestamp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2023-05-20T17:25:16Z"&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app.kubernetes.io/name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jenkins&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jenkins-token&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jenkins-jobs&lt;/span&gt;
  &lt;span class="na"&gt;resourceVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3388"&lt;/span&gt;
  &lt;span class="na"&gt;uid&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;XXXXXXXXXXXXXX&lt;/span&gt;
&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kubernetes.io/service-account-token&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;the &lt;code&gt;ca.crt&lt;/code&gt; value is &lt;code&gt;base64&lt;/code&gt; encoded, decode it via the &lt;code&gt;base64 -d&lt;/code&gt; command and paste the resultant value into the Kubernetes server certificate key field.&lt;/p&gt;

&lt;h4&gt;
  
  
  Kubernetes Namespace
&lt;/h4&gt;

&lt;p&gt;Enter the value &lt;code&gt;jenkins-jobs&lt;/code&gt; here&lt;/p&gt;

&lt;h4&gt;
  
  
  Credentials
&lt;/h4&gt;

&lt;p&gt;Click on &lt;code&gt;Add &amp;gt; Jenkins&lt;/code&gt; and choose the Secret text in the Kind dropdown list of the Credentials provider pop-up, then add base64 decoded value of the token from the output of the command kubectl get secret above to create the credentials.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.anadimisra.com%2Fassets%2Fimg%2Fposts%2Fjenkins-add-creds.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.anadimisra.com%2Fassets%2Fimg%2Fposts%2Fjenkins-add-creds.png" alt="Adding Service Account token to Jenkins" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Click on the Test Connection button and you’ll see the message&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Connected to Kubernetes v1.28-eks-XXXXXXX&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;When successfully connected!&lt;/p&gt;

&lt;h4&gt;
  
  
  Pod Template and Retention
&lt;/h4&gt;

&lt;p&gt;Add the following values to Pod-Template and Retention settings&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.anadimisra.com%2Fassets%2Fimg%2Fposts%2Fpod-settings.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.anadimisra.com%2Fassets%2Fimg%2Fposts%2Fpod-settings.png" alt="Pod Settings" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Click Save to finish adding the cloud.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using the Kubernetes Cloud in Pipelines
&lt;/h2&gt;

&lt;p&gt;Now that we have the Jenkins configuration in place let’s look at defining builds to run via Fargate Pods as slaves. We’re using the declarative pipeline syntax here. The pipeline job DSL Groovy should mention the configured cloud name as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight groovy"&gt;&lt;code&gt;  &lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;hackernoonkube&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;yamlFile&lt;/span&gt; &lt;span class="s1"&gt;'builder.yaml'&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add the labels defined in the POD template section above to run the job in a Jenkins slave running as a Fargate Pod.&lt;/p&gt;

</description>
      <category>jenkins</category>
      <category>kubernetes</category>
      <category>aws</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>DevOps or Lean?</title>
      <dc:creator>Anadi Misra</dc:creator>
      <pubDate>Thu, 21 Sep 2023 16:11:29 +0000</pubDate>
      <link>https://dev.to/anadimisra/devops-or-lean-2e50</link>
      <guid>https://dev.to/anadimisra/devops-or-lean-2e50</guid>
      <description>&lt;p&gt;DevOps has been instrumental in transforming IT responsiveness to business, as is evident by the fact that we have enterprises from all walks of life including Banks adopting DevOps. It’s certainly not a thing that hip web companies do now. However, with this rush to “do” DevOps comes the noise associated with it and hence, the confusion.&lt;/p&gt;

&lt;p&gt;I’ve been asked this quite a bit "We’re practicing Lean Kanban already how would DevOps help us?" Or "Do we need to do DevOps when we’re practising lean?" This blog is an attempt at two things, at showing how DevOps borrows Lean principles and it’s really not an either/or equation between these practices as they complement each other.&lt;/p&gt;

&lt;h2&gt;
  
  
  DevOps
&lt;/h2&gt;

&lt;p&gt;While there are many definitions of DevOps the one I like to stick to is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A cultural and professional movement that stresses communication, collaboration and integration between software developers and IT operations professionals while automating the process of software delivery and infrastructure changes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;While there wasn’t necessarily anything more to DevOps than an observe-and-solve sort of cycle initially. Over the years we’ve had a more formalised way of working in DevOps. I’d still say most of it is common sense but then creating a structure or nomenclature does help the people identify with a movement. DevOps therefore has certain principles that make up most of what we know as DevOps today.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lean
&lt;/h2&gt;

&lt;p&gt;Most of us would know Lean well enough; the classic definition in IT industry context is&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Lean IT applies the key ideas behind lean production to the development and management of IT products and services.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And to know more about Lean Production you can start from here. Let’s look at the lean principles before I elaborate on this blog’s theme.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Define value precisely from the perspective of the end customer&lt;/li&gt;
&lt;li&gt;Identify the entire value stream for each service, product or product family and eliminate waste&lt;/li&gt;
&lt;li&gt;Make the remaining value-creating steps flow&lt;/li&gt;
&lt;li&gt;As flow is introduced, let the customer pull what the customer wants when the customer wants it&lt;/li&gt;
&lt;li&gt;Pursue perfection&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Three Ways of DevOps
&lt;/h2&gt;

&lt;p&gt;AFAIK the There ways of DevOps first appeared in a &lt;a href="https://itrevolution.com/the-three-ways-principles-underpinning-devops/" rel="noopener noreferrer"&gt;blog-post&lt;/a&gt; by &lt;a href="https://www.linkedin.com/in/realgenekim/" rel="noopener noreferrer"&gt;Gene Kim&lt;/a&gt;, and there’s been quite some commentary (include criticism) on them. I won’t delve into just what the three ways are, but into how they appear derived from lean to my eyes.&lt;/p&gt;

&lt;h3&gt;
  
  
  The First Way
&lt;/h3&gt;

&lt;p&gt;The first way looks at maximising the flow of work from the “left” (business idea) to the right (finished product); represented by the, rather deceptively facile, diagram&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.anadimisra.com%2Fassets%2Fimg%2Fposts%2Fdevops-firstway.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.anadimisra.com%2Fassets%2Fimg%2Fposts%2Fdevops-firstway.jpg" alt="First Way of DevOps, adapted from https://itrevolution.com blog post" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Gene Kim uses the word Systems Thinking for the first way, which basically is the practice of looking at a system holistically and studying how its parts are connected or dependent so as to improve the overall system performance. And goes on to explain how we should maximise the flow of value through the system as a whole, using the Deming Cycle to ensure delivering value. While another (I’d say better) interpretation is Flow. Where the flow of value is to be maximised from business to customer. Studying each step of the flow, analysing bottlenecks and solving each of the parts in the context of improving the whole.&lt;/p&gt;

&lt;p&gt;If you stop here and look back at the aforementioned lean principles you’ll notice where this might be stemming from. It essentially appears to be mandating principles 1 to 4 in this form. This makes sense too, I can passionately argue you cannot really define the flow of a system as a whole without a meticulous study of each of the parts into a value-steam so essentially you’d be applying Lean when you’re working at this first way of flow. If you were to implement the first way you would invariably deploy these (in addition to other practices)&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Continuous Flow&lt;/li&gt;
&lt;li&gt;Kanban&lt;/li&gt;
&lt;li&gt;Value Stream Mapping&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It’s for this reason I believe the first way is an application of Lean to the entire delivery pipeline, unlike Scrum (or similar Agile practices) focused largely on development practices (unintentionally).&lt;/p&gt;

&lt;h3&gt;
  
  
  Second Way
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.anadimisra.com%2Fassets%2Fimg%2Fposts%2Fdevops-secondway.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.anadimisra.com%2Fassets%2Fimg%2Fposts%2Fdevops-secondway.jpg" alt="Second Way of DevOps, adapted from https://itrevolution.com blog post" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The second way is about creating amplified feedback loops between each step of the flow or parts of the system, other than providing critical operative information to teams this essentially is a step towards eliminating overburden and inconsistencies thereby reducing waste in the system. Let’s see how, when you are able to get feedback from each stage of your delivery life cycle, you can build actionable information about the bottlenecks and act on them. With amplified feedback, therefore, you’d move toward what’s rightly mentioned as Tribal Knowledge by &lt;a href="https://itrevolution.com/author/timhunter/" rel="noopener noreferrer"&gt;Tim Hunter&lt;/a&gt;. There isn’t a direct correlation here but if you look at Lean tools there’s a natural tendency to gain deeper insights at each step as it would help improve the flow at each step. Having said that to say it necessarily is an application of Lean would be my desperate attempt to retrofit everything to it.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Third Way
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.anadimisra.com%2Fassets%2Fimg%2Fposts%2Fdevops-thirdway.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.anadimisra.com%2Fassets%2Fimg%2Fposts%2Fdevops-thirdway.jpg" alt="Second Way of DevOps, adapted from https://itrevolution.com blog post" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The third way is about fostering continual experimentation and learning and creating repetitive operations so as to help increase the throughput. This again seems to be stemming from pursuing perfection and removing inconsistencies and overburdening ideas that Lean introduced to the software world. While the practices can be manifold I would again say that the intention is to bring this higher state of knowledge to the system as a whole rather than just it’s individual parts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;So I’d like to rest the opening questions to this blog by saying there is no DevOps or Lean equation; DevOps (like Agile) is Lean. In other words, do not look at DevOps as a framework or isolated practice. DevOps does not substitute your existing knowledge in any way, it complements the knowledge and capabilities of the entire organisation to help them achieve higher performance. As I've shown you in this article, it plays out well with Lean.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Building Observability for Microservices</title>
      <dc:creator>Anadi Misra</dc:creator>
      <pubDate>Thu, 21 Sep 2023 13:00:00 +0000</pubDate>
      <link>https://dev.to/anadimisra/building-observability-for-microservices-338k</link>
      <guid>https://dev.to/anadimisra/building-observability-for-microservices-338k</guid>
      <description>&lt;p&gt;Let’s say you distribute the work that a single highly experienced person is doing to multiple individuals, each performing a specific task. Distributing the work this way may increase throughput and eliminate the single point of failure. However, now you have to monitor many people instead of just one! Observability in microservices addresses a similar issue: how to monitor and proactively address problems in a distributed system? The solution lies in measuring the state of a system through metrics recorded for each of its services. A more software-specific definition is (ref. &lt;a href="https://en.wikipedia.org/wiki/Observability" rel="noopener noreferrer"&gt;Wikipedia&lt;/a&gt;)&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This article explains setting up an observability stack for Microservices to measure the health and performance of the system, sharing our experiences from operating observability for Microservices at &lt;a href="https://www.nimblework.com" rel="noopener noreferrer"&gt;NimbleWork&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tools
&lt;/h2&gt;

&lt;p&gt;There are a lot of tools, paid and open source, in the Observability space. We at NimbleWork though prefer the following stack&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://github.com/prometheus" rel="noopener noreferrer"&gt;Prometheus&lt;/a&gt; - an open source toolkit for monitors and alerts built at &lt;a href="https://soundcloud.com/" rel="noopener noreferrer"&gt;SoundCloud&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://grafana.com/" rel="noopener noreferrer"&gt;Grafana&lt;/a&gt; - a visualisation and analytics data from multiple sources, including prometheus&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://helm.sh/" rel="noopener noreferrer"&gt;Helm&lt;/a&gt; - a tool to manage Kubernetes applications&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://kubernetes.io/docs/reference/kubectl/" rel="noopener noreferrer"&gt;kubectl&lt;/a&gt; - a command line tool for working on Kubernetes clusters&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.terraform.io/" rel="noopener noreferrer"&gt;terraform&lt;/a&gt; - a tool to automate provisioning of infrastructure on clouds such as AWS.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We’re running this on Kubernetes, specifically on AWS EKS which is also the choice for running microservices at NimbleWork. Let’s dive into getting things working! The post assumes you have a working EKS cluster, you can choose AWS Fargate as described in &lt;a href="https://www.anadimisra.com/post/building-an-eks-fargate-cluster-with-terraform" rel="noopener noreferrer"&gt;this&lt;/a&gt; blog post, or Worker Node cluster as described &lt;a href="https://www.anadimisra.com/post/eks-karpenter" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prometheus on EKS
&lt;/h2&gt;

&lt;p&gt;Let’s look at deploying prometheus first as it serves as a data-source for Grafana.&lt;/p&gt;

&lt;h3&gt;
  
  
  Preparing Volumes on EKS Node Group.
&lt;/h3&gt;

&lt;p&gt;We will be using persistent storage for Prometheus data as it’s a bad to idea to keep crucial observability data in ephemeral storage. So we declare the volumes first. Since we’re using EKS NodeGroups we have to configure Persistent Volumes using EBS for the prometheus node. EFS here is a bad idea as prometheus does not support NFS well enough. While you can certainly get it running on EFS, it remains unstable with sudden restarts related to errors in the logs about NFS writes. &lt;a href="https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects" rel="noopener noreferrer"&gt;This link&lt;/a&gt; in prometheus documentation strongly recommends against it.&lt;/p&gt;

&lt;p&gt;Let’s configure our EKS cluster for EBS storage using Terraform, here’s a snippet you’d typically add to a &lt;code&gt;storage.tf&lt;/code&gt; file in the Terraform module you’ll write for managing EKS, we’re following the same structure as in the &lt;a href="https://www.anadimisra.com/post/eks-karpenter" rel="noopener noreferrer"&gt;blog post&lt;/a&gt; on setting up EKS via terraform that you can use for reference. You can of course do it manually but there’s a reason why IAC exits, and I like to respect it!&lt;/p&gt;

&lt;p&gt;Running the terraform script gives, amongst other things, the values of File System and its Access Point ID in the output which we’ll then use in our helm charts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deploying Prometheus via Helm
&lt;/h3&gt;

&lt;p&gt;We prefer the community helm chart for prometheus, available on &lt;a href="https://github.com/prometheus-community/helm-charts" rel="noopener noreferrer"&gt;github&lt;/a&gt;, install the chart using the following command&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, let’s look at how can we configure it to work with EKS NodeGroups. Create a &lt;code&gt;volumes.yaml&lt;/code&gt; file with the following contents.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: v1
kind: PersistentVolume
metadata:
  name: prometheus-volume-0
  labels:
    type: prometheus-volume
spec:
  capacity:
    storage: 10Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  claimRef:
    name: prometheus-volume-claim
    namespace: observability
  storageClassName: eks-efs-storage-class
  csi:
    driver: efs.csi.aws.com
    volumeHandle: fs-XXXXXXXXXXXX::fsap-XXXXXXXXXXXX
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: prometheus-volume-claim
  namespace: observability
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: eks-efs-storage-class
  volumeName: prometheus-volume-0
  resources:
    requests:
      storage: 10Gi

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Those who’ve been working with EC2 instances or EKS Node Groups using EBS will immediately recognize what we’re doing here, the values for file system and access point ids come from the terraform snippet described above and goes into &lt;code&gt;volumeHandle: fs-XXXXXXXXXXXX::fsap-XXXXXXXXXXXX&lt;/code&gt;. With the storage configuration in place, let’s configure Prometheus for use in EKS. We’re doing a default deployment with only a few changes as a detailed customisation of the &lt;code&gt;values.yaml&lt;/code&gt; is beyond the scope of this blog. Refer to the &lt;code&gt;prometheus.yaml&lt;/code&gt; file in this &lt;a href="https://gist.github.com/anadimisra/e7cd377bca32deaf6ecb906fd06d63bf" rel="noopener noreferrer"&gt;gist&lt;/a&gt; for deploying prometheus to EKS. You’ll notice the job names&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;kubernetes-apiservers&lt;/li&gt;
&lt;li&gt;kubernetes-nodes&lt;/li&gt;
&lt;li&gt;kubernetes-nodes-cadvisor&lt;/li&gt;
&lt;li&gt;kubernetes-service-endpoints&lt;/li&gt;
&lt;li&gt;kubernetes-service-endpoints-slow&lt;/li&gt;
&lt;li&gt;kubernetes-services&lt;/li&gt;
&lt;li&gt;kubernetes-pods&lt;/li&gt;
&lt;li&gt;kubernetes-pods-slow&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are prometheus jobs defined to collect metrics from the EKS itself, to monitor the health of Kube cluster too in addition to the microservices. Save the aforementioned file in the gist to a &lt;code&gt;values.yaml&lt;/code&gt; file and Install the helm chart using the following command&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;helm install -i prometheus prometheus-community/prometheus -f values.yaml -n observability

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now let’s look at enabling collection metrics for a service written in Spring Boot. We’d simply refer to the configuration &lt;a href="https://docs.spring.io/spring-boot/docs/current/reference/html/actuator.html" rel="noopener noreferrer"&gt;here&lt;/a&gt;, the spring boot actuator endpoint can be configured in the application or bootstrap YAML as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;management:
  endpoint:
    shutdown:
      enabled: true
scrape_configs:
  - job_name: "spring"
    metrics_path: "/actuator/prometheus"
    static_configs:
      - targets: ["0.0.0.0:${server.port}"]

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you deploy this service you’ll notice the metrics exposed by Spring Boot Actuator are available on prometheus. You can access prometheus via the URL exposed through the ALB ingress.&lt;/p&gt;

&lt;h2&gt;
  
  
  Grafana on EKS
&lt;/h2&gt;

&lt;p&gt;Grafana will be deployed for visualising data collected by prometheus. Let’s see how to get it done.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deploying EFS volumes for Grafana
&lt;/h3&gt;

&lt;p&gt;Assuming we have the EFS block configured for our EKS cluster, ref. this blog post, create a file &lt;code&gt;grafana-volumes.yaml&lt;/code&gt; defining the volumes for Grafana as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: v1
kind: PersistentVolume
metadata:
  name: grafana-volume-0
  labels:
    type: grafana-storage-volume
spec:
  capacity:
    storage: 10Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  claimRef:
    name: grafana-volume-0
    namespace: observability
  storageClassName: eks-efs-storage-class
  csi:
    driver: efs.csi.aws.com
    volumeHandle: fs-XXXXXXXXXXXXXXX::fsap-XXXXXXXXXXXXXXX
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: grafana-volume-0
  namespace: observability
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: eks-efs-storage-class
  volumeName: grafana-volume-0
  resources:
    requests:
      storage: 10Gi

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the following command to create the volumes&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl apply -f grafana-volumes.yaml -n observability

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Deploying Grafana via HELM chart
&lt;/h3&gt;

&lt;p&gt;Download the helm chart for grafana with the command&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;helm repo add grafana https://grafana.github.io/helm-charts

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;grafana.yaml&lt;/code&gt; file in this &lt;a href="https://gist.github.com/anadimisra/e7cd377bca32deaf6ecb906fd06d63bf" rel="noopener noreferrer"&gt;gist&lt;/a&gt; describes a basic Grafana configuration for working with Prometheus as a data-source. We’re using mostly standard config here, adding prometheus as a data-source via the block&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;datasources:
  datasources.yaml:
    apiVersion: 1
    datasources:
    - name: Prometheus
      type: prometheus
      version: 1
      url: http://prometheus-server:80
      access: proxy

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Save it as &lt;code&gt;values.yaml&lt;/code&gt; and deploy the grafana helm chart by running the following command&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;helm install -i grafana grafana/grafana -f values.yaml -n observability

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here again, you can access grafana using the configuration on the ALB Exposed URL.&lt;/p&gt;

&lt;h2&gt;
  
  
  Grafana Dashboards
&lt;/h2&gt;

&lt;p&gt;Now that we have the infrastructure in place, let’s see how can we use it for Observability. I’ll limit examples to Spring-Boot for simplicity. Our microservices expose REST APIs and use MongoDB for data persistence. Here’s what we recommend monitoring for such services.&lt;/p&gt;

&lt;h3&gt;
  
  
  Spring Boot Microservices
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.anadimisra.com%2Fassets%2Fimg%2Fposts%2Fgrafana.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.anadimisra.com%2Fassets%2Fimg%2Fposts%2Fgrafana.jpg" alt="JVM Dashboard" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here’s the panels we’ve added to the Grafana dashboard for monitoring Spring Boot Microservices.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quick Facts&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Uptime Monitor:&lt;/strong&gt; total uptime of pods, can be filtered over namespace, pod names and containers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Current Memory Allocation:&lt;/strong&gt; current total memory allocated to pods, can be filtered over namespace, pod names and containers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Current Memory Usage:&lt;/strong&gt; average memory consumption by app containers in each pod&lt;/li&gt;
&lt;li&gt;CPU Load&lt;/li&gt;
&lt;li&gt;CPU Usage&lt;/li&gt;
&lt;li&gt;FATAL, ERROR and WARN logs count&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;MongoDB&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Repository Method Execution Time:&lt;/strong&gt; Total time taken per method per repository, a high value indicates a performance issue&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Repository Method Execution Count:&lt;/strong&gt; Number of methods executed against each of the repositories, gives an idea of traffic on each collection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Repository Method Avg Execution Time:&lt;/strong&gt; Total time taken per method per repository, a high value indicates a performance issue&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Repository Method Execution Max Time:&lt;/strong&gt; Max recorded time of each MongoDB operation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Time in Seconds of Operations Per Collection:&lt;/strong&gt; The maximum time in seconds it took to execute a particular command on a MongoDB collection. This is the worst case unit. You can use it to identify the slowest operations per collection in your app.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Number of Commands Per collection Per Second:&lt;/strong&gt; The number of commands executed on each of the collections per second, helps you optimise collections sharding, and indexing to suite the read/write operations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operations Average Time Per Collection:&lt;/strong&gt; The average time of various operations running in a collection, this is a quick view to spot slowest transactions for an app&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mongo Operation Max Time:&lt;/strong&gt; max time of operation on each of the collections, this gives a sense of what might be skewing the quick view averages in previous panel&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Dashboard config JSON with the aforementioned panels and additional ones for measuring Java Heap, Thread and GC Details can be imported from the &lt;code&gt;springboot.json&lt;/code&gt; file in this &lt;a href="https://gist.github.com/anadimisra/e7cd377bca32deaf6ecb906fd06d63bf" rel="noopener noreferrer"&gt;gist&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summing it up
&lt;/h2&gt;

&lt;p&gt;Setting up observability for Microservices isn’t quite intuitive but not an uphill task given the tools at our disposal. Having said that, this is pretty much the starting step of your observability journey. Once you have the infra setup as demonstrated in this blog, you have to meticulously design the graphs you will create in Grafana, which in turn depends on the kind of services you are running, and the parts of your system that you want to track. This blog aims to help fellow SRE and DevOps engineers find the steps in this blog helpful enough to be able to set up their observability stack.&lt;/p&gt;

</description>
      <category>microservices</category>
      <category>grafana</category>
      <category>prometheus</category>
      <category>devops</category>
    </item>
    <item>
      <title>Autoscaling EKS Node Groups with Karpenter</title>
      <dc:creator>Anadi Misra</dc:creator>
      <pubDate>Tue, 19 Sep 2023 13:00:00 +0000</pubDate>
      <link>https://dev.to/anadimisra/autoscaling-eks-node-groups-with-karpenter-3oc7</link>
      <guid>https://dev.to/anadimisra/autoscaling-eks-node-groups-with-karpenter-3oc7</guid>
      <description>&lt;p&gt;&lt;a href="https://karpenter.sh" rel="noopener noreferrer"&gt;Karpenter&lt;/a&gt; is an open-source tool for automating node provisioning in Kubernetes. Karpenter aims to enhance both the effectiveness and affordability of managing workloads within a Kubernetes cluster. The core mechanics of Karpenter involve:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monitoring unschedulable pods identified by the Kubernetes scheduler.&lt;/li&gt;
&lt;li&gt;Scrutinizing the scheduling constraints, including resource requests, node selectors, affinities, tolerations, and topology spread constraints, as stipulated by the pods.&lt;/li&gt;
&lt;li&gt;Provisioning nodes that precisely align with the pods' requirements.&lt;/li&gt;
&lt;li&gt;Streamlining cluster resource usage by removing nodes once their services are no longer required.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why we moved to Karpenter?
&lt;/h2&gt;

&lt;p&gt;We at &lt;a href="https://www.nimblework.com" rel="noopener noreferrer"&gt;NimbleWork&lt;/a&gt; used AWS Fargate in the past for running on-demand, short-lived or one-off workloads, one of the examples being running Jenkins slaves in AWS Fargate while the Master runs on worker nodes. Fargate is good in the sense that it takes care of managing node infrastructure, but can cost premium if you're using it for long running workloads. It is for this reason that EKS deployment with Worker Nodes is the more preferred path. But with that comes a new problem, unlike Fargate, we have to not just manage creating nodes and node groups, we also have to ensure that our EC2 nodes utilisation is optimal. It comes back to hurt us specially when we realise there's an entire VM running on just 10% of the CPU/Mem capacity because it has two active pods, which we could have moved to another node and claimed this one. In the past we've relied on a cocktail of Prometheus Alerts and Fluent-Bit monitoring data to conclude we can reschedule pods and clean-up un-used nodes. But any self-respecting Engineering Manager would tell you they'd jump to a better alternative than this as soon as they find one. For us Karpenter came as that alternative.&lt;/p&gt;

&lt;h3&gt;
  
  
  How it works?
&lt;/h3&gt;

&lt;p&gt;Karpenter allows you to define Provisioners which are the heart of it's cluster management capability. When initially installing Karpenter, you establish a default Provisioner, which imparts specific constraints on the nodes created by Karpenter and the pods eligible to run on these nodes. These constraints encompass defining taints to restrict pod deployment on Karpenter-created nodes, establishing startup taints to indicate temporary node tainting, narrowing down node creation to preferred zones, instance types, and computer architectures, and configuring default settings for node expiration. The Provisioner, in essence, empowers you with fine-grained control over resource allocation within your Kubernetes cluster. You can read up more on Provisioners &lt;a href="https://karpenter.sh/preview/concepts/provisioners/" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deploying EKS Cluster
&lt;/h2&gt;

&lt;p&gt;Here's how to deploy the EKS cluster with Karpenter.&lt;/p&gt;

&lt;h3&gt;
  
  
  Setting Up the VPC
&lt;/h3&gt;

&lt;p&gt;Before we begin, let's deploy the AWS VPC to run our EKS cluster. we'll be using terraform for provisioning on the AWS Cloud.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;module "vpc" {
  source               = "terraform-aws-modules/vpc/aws"
  version              = "3.19.0"
  name                 = "mycluster-vpc"
  cidr                 = var.vpc_cidr
  azs                  = ["us-east-1a", "us-east-1b", "us-east-1c"]
  private_subnets      = var.private_subnets_cidr
  public_subnets       = var.public_subnets_cidr
  enable_nat_gateway   = true
  single_nat_gateway   = true
  enable_dns_hostnames = true

  public_subnet_tags = {
    "kubernetes.io/cluster/mycluster" = "shared"
    "kubernetes.io/role/elb"          = "1"
  }

  private_subnet_tags = {
    "kubernetes.io/cluster/mycluster  = "shared"
    "kubernetes.io/role/internal-elb" = "1"
    "karpenter.sh/discovery"          = "mycluster"
  }

  tags = {
    "kubernetes.io/cluster/mycluster" = "shared"
  }

}

module "vpc-security-group" {
  source  = "terraform-aws-modules/security-group/aws"
  version = "4.17.1"
  create  = true

  name        = "mycluster-security-group"
  description = "Security group for VPC"
  vpc_id      = module.vpc.vpc_id

  ingress_with_cidr_blocks = var.ingress_rules
  ingress_with_self = [
    {
      from_port   = 0
      to_port     = 0
      protocol    = -1
      description = "Ingress with Self"
    }
  ]
  egress_with_cidr_blocks = [{
    cidr_blocks = "0.0.0.0/0"
    from_port   = 0
    to_port     = 0
    protocol    = -1
  }]
  tags = {
    Name                      = "mycluster-security-group"
    "karpenter.sh/discovery"  = "mycluster"
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We're using the community contributed modules here for spinning up a VPC which has public and private subnets, and ingress rules. For those interested in more details here a simple example of what could potentially go in the ingress rules&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;variable "ingress_rules" {
  type        = list(map(string))
  description = "VPC Default Security Group Ingress Rules"
  default = [
    {
      cidr_blocks = "0.0.0.0/0"
      from_port   = 443
      to_port     = 443
      protocol    = "tcp"
      description = "Karpenter ingress allow"
    },
    { #other  CIDR blocks to which you might want to restrict access to (for example if this was your dev cluster)
      cidr_blocks = "XX.XX.XX.XXX/XX"
      from_port   = 0
      to_port     = 0
      protocol    = -1
      description = "MyCLuster-NAT"
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;"karpenter.sh/discovery"  = "mycluster"&lt;/code&gt; tag in the vpc module and the in the security group tags is our hint to AWS about using aws-karpenter for autoscaling nodes and pods in this cluster. You can get the VPC up and running via the&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;terraform plan
terraform apply
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;commands, it's a good practice to define key values that you will need in other modules as outputs to this module run, also, we save the state in a S3 bucket as our TF builds run from a Jenkins Salve on Fargate with ephemeral storage. You'd see the following values in the console output of the &lt;code&gt;terraform apply&lt;/code&gt; command if you've included publishing the VPC and security group IDs in the &lt;code&gt;outputs.tf&lt;/code&gt; of your vpc module.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;security_group_id &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"sg-dkfjksdhf83983c883"&lt;/span&gt;
vpc_id &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"vpc-2l4jc2lj4l2cbj42"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this we have our VPC ready, let's deploy the EKS cluster with Node Groups and Karpenter.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deploying EKS Cluster with Node Group Workers and Karpenter
&lt;/h3&gt;

&lt;p&gt;Add the following code to you terraform module to include EKS&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;module "eks-cluster" {
  source          = "terraform-aws-modules/eks/aws"
  version         = "19.12.0"
  cluster_name    = "mycluster"
  cluster_version = 1.26
  subnet_ids      = [  "subnet-XX","subnet-YY","subnet-ZZ"]
  create_cloudwatch_log_group = false
  tags = {
    Name                      = "mycluster"
    "karpenter.sh/discovery"  = "mycluster"
  }

  vpc_id = "vpc-2l4jc2lj4l2cbj42"

  cluster_endpoint_public_access_cidrs = ["XX.XX.XX.XXX/YY"] #important if the cluster_endpoint_public_access is set to true
  cluster_endpoint_private_access      = true
  cluster_endpoint_public_access       = true
  cluster_security_group_id            = "sg-dkfjksdhf83983c883"
}

module "mycluster-workernodes" {
  source  = "terraform-aws-modules/eks/aws//modules/eks-managed-node-group"
  version = "19.12.0"

  name            = "${var.eks_cluster_name}-services"
  cluster_name    = module.eks-cluster.cluster_name
  cluster_version = module.eks-cluster.cluster_version
  create_iam_role = false
  iam_role_arn    = aws_iam_role.nodegroup_role.arn

  subnet_ids = flatten([data.terraform_remote_state.db.outputs.private_subnets])

  cluster_primary_security_group_id = "sg-dkfjksdhf83983c883"
  vpc_security_group_ids            = [module.eks-cluster.cluster_security_group_id]

  min_size     = 1
  max_size     = 5
  desired_size = 2

  instance_types     = ["t3.large"]
  capacity_type      = "ON_DEMAND"
  labels = {
    NodeGroups = "mycluster-workernodes"
  }

  tags = {
    Name                      = "mycluster-workernodes"
    "karpenter.sh/discovery"  = module.eks-cluster.cluster_name
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's the same &lt;code&gt;"karpenter.sh/discovery"&lt;/code&gt; tag at play here too, and that's it! You have an eks cluster with Karpenter managed provisioning ready!&lt;/p&gt;

&lt;h2&gt;
  
  
  Configuring Karpenter Provisioners
&lt;/h2&gt;

&lt;p&gt;Now that we have a cluster ready let's have a look at using Karpenter to manage the Pods. We'll define provisioners for different purposes and then associate pods to each of them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Provisioner for Nodes running Spot Instances&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is good alternative to Fargate, specially for running the one-off workloads which do not live beyond the job completion. Here's an example of a Karpenter provisioner using spot instances.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# spot default&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;karpenter.sh/v1alpha5&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Provisioner&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;requirements&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;karpenter.sh/capacity-type&lt;/span&gt;
      &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;In&lt;/span&gt;
      &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spot"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;karpenter.k8s.aws/instance-category"&lt;/span&gt;
      &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;In&lt;/span&gt;
      &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;c"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;m"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;r"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;karpenter.k8s.aws/instance-cpu"&lt;/span&gt;
      &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;In&lt;/span&gt;
      &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;16"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;32"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;
  &lt;span class="na"&gt;providerRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
  &lt;span class="na"&gt;consolidation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;karpenter.k8s.aws/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWSNodeTemplate&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;subnetSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;karpenter.sh/discovery&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;mycluster&lt;/span&gt;
  &lt;span class="na"&gt;securityGroupSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;karpenter.sh/discovery&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;mycluster&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To use this provisioner add the following tag to the &lt;code&gt;nodeSelector&lt;/code&gt; in kube deployment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;nodeSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;karpenter.sh/provisioner-name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will provision the pods to run on spot instances.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Provisioner for Nodes running On-Demand Instances&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's a sample of how to use on-demand node for worker nodes, and schedule pods on it. The following file defines a provisioner for on-demand instances&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# on-demand&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;karpenter.sh/v1alpha5&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Provisioner&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;on-demand&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# taints:&lt;/span&gt;
  &lt;span class="c1"&gt;#   - key: "name"&lt;/span&gt;
  &lt;span class="c1"&gt;#     value: "on-demand"&lt;/span&gt;
  &lt;span class="c1"&gt;#     effect: "NoSchedule"&lt;/span&gt;
  &lt;span class="na"&gt;requirements&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;karpenter.sh/capacity-type&lt;/span&gt;
      &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;In&lt;/span&gt;
      &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;on-demand"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;karpenter.k8s.aws/instance-category"&lt;/span&gt;
      &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;In&lt;/span&gt;
      &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;c"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;m"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;r"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;karpenter.k8s.aws/instance-cpu"&lt;/span&gt;
      &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;In&lt;/span&gt;
      &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;16"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;32"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;topology.kubernetes.io/zone"&lt;/span&gt;
      &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NotIn&lt;/span&gt;
      &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-east-1b"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;
  &lt;span class="na"&gt;providerRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;on-demand&lt;/span&gt;
  &lt;span class="c1"&gt;# consolidation:&lt;/span&gt;
  &lt;span class="c1"&gt;#   enabled: true&lt;/span&gt;
  &lt;span class="na"&gt;ttlSecondsAfterEmpty&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;karpenter.k8s.aws/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWSNodeTemplate&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;on-demand&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;subnetSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;karpenter.sh/discovery&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;mycluster&lt;/span&gt;
  &lt;span class="na"&gt;securityGroupSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;karpenter.sh/discovery&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;mycluster&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once again we can utilise the &lt;code&gt;nodeSelector&lt;/code&gt; in kube deployment yaml to provision pods on these nodes&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;nodeSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;karpenter.sh/provisioner-name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;on-demand&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This is a simplified example of how to get started with Karpenter on AWS EKS. production grade deployments require more nuanced provisioner definitions including but not limited to resource limits, eviction policies as well. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attribution&lt;/strong&gt;&lt;br&gt;
Image Credits: Photo by &lt;a href="https://unsplash.com/@growtika?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText" rel="noopener noreferrer"&gt;Growtika&lt;/a&gt; on &lt;a href="https://unsplash.com/photos/qPkdgA-KDik?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>kubernetes</category>
      <category>karpenter</category>
      <category>aws</category>
    </item>
    <item>
      <title>Application Integration Pattern : the Choreography way</title>
      <dc:creator>Anadi Misra</dc:creator>
      <pubDate>Thu, 07 Sep 2023 17:04:14 +0000</pubDate>
      <link>https://dev.to/anadimisra/application-integration-pattern-the-choreography-way-18j3</link>
      <guid>https://dev.to/anadimisra/application-integration-pattern-the-choreography-way-18j3</guid>
      <description>&lt;p&gt;Integration can be quite complex, especially when you are talking about making your swanky new application work with a legacy system. I've seen multiple generations of patterns to tackle this problem gain prominence over the past decade and a half. Starting with SOAP web-services to Message Bus to Enterprise Service Bus, REST API and even WebSockets! Each of them has been the blue-eyed boy, or heartthrob of their generation and seen their popularity vain as the new kid on the block got more famous. Having said that, and allowing myself to feel relatively old, whatever the approach, the bigger challenge in integrating legacy services has been that they might eventually get replaced with other services soon, or soon enough for you to get into yet another exercise of re-integrating with the new system all over again with a sense of repetitive labour induced fatigue. Shameless plug, We at &lt;a href="https://www.nimblework.com" rel="noopener noreferrer"&gt;NimbleWork&lt;/a&gt; were facing this very strategic question when our SaaS journey started a couple of years ago; and this blog is about our learnings and experience in using Choreography as the cornerstone of our integration approach just like it was for &lt;a href="https://www.anadimisra.com/post/microservices-choreography" rel="noopener noreferrer"&gt;implementing Strangler Fig&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choreography
&lt;/h2&gt;

&lt;p&gt;There are many definitions of Choreography if you happen to browse popular articles or blogs over the web. The way I like to put it is&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Choreography represents the communication between microservices through the publishing and subscribing of domain events.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It essentially means that each of the services publishes a Domain Event notifying a change that's occurred out of an action initiated on the service. Services to whom this event is of interest subscribe to them and act accordingly. It's an extremely efficient, lightweight, distributed chain of command. Choreography has emerged as the mechanism of choice in implementing the Saga Pattern in Microservices, and I've seen it increasingly replace Orchestrators. I'd like to point out however that it's not a universally applicable solution or implementation/design in the microservices world, there's some criticism of this approach going around too, though I haven't burnt my hands using it yet, so I'd like to think the criticism is more out of design issues elsewhere which might have lead to bottlenecks in using Choreography.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Approach
&lt;/h2&gt;

&lt;p&gt;Here's our approach, both the legacy system and the new systems emit Domain Events, which are published to Kafka. Each Kafka topic is backed by MongoDB collection using the Kafka MongoDB connector for Fault Tolerance and HA purposes. Each of these collections is replicated over a MongoDB replica-set. In case of failed processing of the domain event, the failing side emits an event for compensating transactions instead of rollback. Any outage on either side is covered via two factors&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Kafka Messages are persisted for durations longer than the outage of either of the services&lt;/li&gt;
&lt;li&gt;On being live again each of the services starts from where the consumer offset was when they went offline&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;There isn't much we have to do as developers for either of these provisions but knowing that we have these provisions is more than just handy knowledge.&lt;/p&gt;

&lt;h2&gt;
  
  
  Handling Failures
&lt;/h2&gt;

&lt;p&gt;Handling failures, like most modern architectures is a layered approach in this case. So let's look at them from the perspective of where we choose to tackle them.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Outages&lt;/strong&gt;: are best handled at the infrastructure layer, if you're deploying both the legacy and new services in Kubernetes (which makes sense for that matter as you'll be phasing out the legacy system with newer services eventually) then we can leave it to Kubernetes liveness and readiness probes coupled with deploying services as a stateful or replica set to get over the outage issue. If the legacy system is too old to run as a stateless service, you're still not out of luck here, you can deploy the traditional Sticky session clusters fronted with a load-balancer by using internal ingresses in Kubernetes too. So there too the master slave cluster gives you some degree of protection against the outages.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dirty Reads&lt;/strong&gt;: If Choreography has scared the daylights out of any of your colleagues it's most likely for the fear of this, a comprehensive explanation on how to mitigate this via design in a whole blog post in itself so I'd like to keep it short. Dirty reads are more likely to occur in the case of bi-directional integration, by that I mean both services are reading or writing to each other. And if you find yourself doing that, just stop, you've perhaps created too many routes for data change, it's better to design for all operations on the data that is owned by new services to happen in the new service only and the associated events flow to the legacy app, and vice-versa. Even if you have a duplex of Domain Events, don't allow the legacy system to update its copy of new system data directly nor should the new system allow editing the copy of the legacy system directly. Keep clear separation of traffic is what I'm saying essentially.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Idempotence&lt;/strong&gt;: this is a tricky one, while idempotent API Operations can be guaranteed within the context of each of the systems integrated, there's an additional Domain Event whose consumption should also lead to idempotence, which means that consumption of Domain Events also has to preserve the idempotence on both sides of the integration. A simple rule of thumb here is to follow what their APIS do; a CREATE API call is easiest to handle, and the Create Domain Event can never be idempotent. What about DELETE, PUT? That's where idempotence has to be preserved (assuming the legacy system did respect this principle in the first place, or else God bless you!). We've noticed that using upsert operations for processing PUT Domain Event helps a lot. DELETE Idempotence is tricky, should I keep sending 200 OK when deleting an ID repeatedly? or throw an error the second time around? We chose the first option in our REST API design so we followed the same norm when processing a Domain Event for DELETE, allowing it to fail silently when deleting a deleted entity.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Choreography is a strong pattern in Microservices and is here to stay for good. This blog post however was our attempt to show that we can improvise to use well-established patterns for much wider goals than they were envisioned for. It requires a bit of imagination, lots and lots of reading, hours of experimentation and not to forget, the temperament the handle their failures. I hope this blog helps fellow engineers with one more option to consider while in the transition phase of their microservices journey where new services have to integrate and play well with legacy systems that are built on completely different designs and patterns altogether.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>microservices</category>
      <category>softwaredevelopment</category>
    </item>
    <item>
      <title>Strangling Monolith to Microservices with Choreography and CQRS</title>
      <dc:creator>Anadi Misra</dc:creator>
      <pubDate>Mon, 17 Jul 2023 05:30:00 +0000</pubDate>
      <link>https://dev.to/anadimisra/strangling-monolith-to-microservices-with-choreography-and-cqrs-5d92</link>
      <guid>https://dev.to/anadimisra/strangling-monolith-to-microservices-with-choreography-and-cqrs-5d92</guid>
      <description>&lt;p&gt;Monoliths aren't necessarily bad, but in some cases, they don't perform the job well enough. And no, autoscaling isn't the only solution. You can have stateless monoliths running as a replicaset with HPA or even VPA (if you're not stuck with JVM on fixed heap space) to achieve autoscaling. Taking that thought further, Kubernetes isn't even necessary. You can perform autoscaling on VMs, although managing the spin-up and down times and IAC around it on VMs can be challenging. With that said, this post discusses what to do if you find yourself on a mission to break down a monolith into microservices. The most highly praised approach, the Strangler Application (inspired by the &lt;a href="(https://martinfowler.com/bliki/StranglerFigApplication.html)"&gt;Strangler Fig application&lt;/a&gt; from Martin Fowler), is a great way to embark on this journey. There are plenty of blogs that explain the approach, so that's not the reason why I'm typing away on a cool breezy evening in Bangalore with Arijit Singh's magical voice in the background (no beer). If you're new to the pattern itself, you can read about it here. This post describes an interesting strategy we implemented to refactor a powerful enterprise solution platform monolith into microservices while using the Strangler pattern. If you're new to the pattern itself &lt;a href="https://microservices.io/patterns/refactoring/strangler-application.html" rel="noopener noreferrer"&gt;read here first&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;A bit of background, and a shameless plug, this is our hands-on experience in moving to microservices at &lt;a href="https://www.nimblework.com/" rel="noopener noreferrer"&gt;NimbleWork&lt;/a&gt;. Let's explore how we utilized Choreography and CQRS as an implementation strategy to strangle an EJB-based monolith into microservices. This approach has enabled us to deliver new features more quickly and efficiently handle traffic bursts, such as when users log in to file timesheets or move cards to "done" on a Friday evening.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choreography
&lt;/h2&gt;

&lt;p&gt;Choreography represents the communication between microservices through the publishing and subscribing of domain events. It differs from traditional messaging in that there is no two-way message sent/acknowledged flow. Instead, it follows a publish-and-subscribe model where downstream services decide whether to act upon the messages they receive. To understand this concept better, think of each microservice as a radio station broadcasting music on a specific FM frequency. If you want to listen to their music, you tune your radio to their frequency. The microservice doesn't actively manage every connected client; it's up to the client (in this case, the radio) to connect and respond to the received data.&lt;/p&gt;

&lt;p&gt;In scenarios where you are employing the Database per Service pattern and a business transaction spans multiple services, relying on traditional models like the 2PC (two-phase commit) or messaging from the SOA era is not feasible. This is where choreography proves to be extremely useful. The diagram below illustrates this concept within a fictional e-commerce system, showcasing the handling of workflows when a customer signs up.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F043k8uwjgx761zvynlyx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F043k8uwjgx761zvynlyx.png" alt="Microservices Choreography" width="800" height="365"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here, the Loyalty, Delivery, and Notify services have &lt;strong&gt;tuned&lt;/strong&gt; into a channel where the Customer Service publishes the Customer Created Event. It's important to note that the Customer Service itself is unaware of who is listening to the messages.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does it come handy in Strangler?
&lt;/h3&gt;

&lt;p&gt;Simply put, I introduce a spy into the monolith that tracks all activities within it. The implementation depends on the framework you are using. For example, with EJBs, you can easily create an EventListener that gets invoked after the creation, updating, or deletion of your business objects. This listener captures the business object and publishes corresponding events such as EntityCreated, EntityUpdated, and EntityDeleted. In Hibernate, you can use JPA Lifecycle Events, or in Spring Boot, you can use listeners on entity lifecycle methods. Regardless of your chosen approach, it's crucial to execute these operations asynchronously to avoid keeping any threads tied to the transaction or the initiating request. The key is to ensure that these operations are quick and detached from the actual flow between layers in the application you're refactoring. Now, you have a continuous stream of events relaying your business objects or transactional data to a message/event broker. From this point onward, you have two main options to consider. But before delving into those options, let's first understand CQRS.&lt;/p&gt;

&lt;h2&gt;
  
  
  CQRS
&lt;/h2&gt;

&lt;p&gt;CQRS, coined by &lt;a href="https://twitter.com/gregyoung" rel="noopener noreferrer"&gt;Greg Young&lt;/a&gt;, stands for Command Query Responsibility Segregation, which is a pattern closely aligned with Choreography. When running microservices based on the Database per Service pattern or utilizing Choreography, it becomes challenging to query data that spans multiple microservices. This is where CQRS comes into play. In this approach, services that write data to their respective databases use Choreography to publish domain events, such as OrderCreated or NewSubscription in our e-commerce example. Downstream services then consume these events through event handlers to persist the data in a read-only database. This approach provides us with the flexibility to easily create multiple denormalized views of the data across various services. It also simplifies querying what would have been complex joins in a monolithic architecture.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fybywvrhngrqonmzvzmoe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fybywvrhngrqonmzvzmoe.png" alt="CQRS Example where order history is the ready only DB" width="800" height="365"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting it all together
&lt;/h2&gt;

&lt;p&gt;So, there you have it. On one side, the Monolith emits domain events, and a consumer at the other end of the message queue/message broker processes these domain events to save the business objects into another store. For example, we save them in MongoDB, which then serves as a backend for Reports and Analytics, Mobile Apps, and the &lt;a href="https://www.nimblework.com/products/nimble/cafe/" rel="noopener noreferrer"&gt;Nimble Café&lt;/a&gt;. This approach offers multiple benefits. Firstly, it allows us to create a system where reads outnumber writes. Consequently, we have moved a significant amount of traffic away from the monolith and into autoscaling Microservices that can handle traffic better without impacting our Cloud Budget. MongoDB itself can be optimized for reads and has connectors to systems like Spark, Snowflake, and others, which can serve as a streaming backend for near real-time analytics or even AI. Essentially, I have now split my legacy system into two parts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The older monolith handles the Write Transactions (Command).&lt;/li&gt;
&lt;li&gt;Reporting, Analytics, and other read-heavy apps in our product suite rely on Microservices that access a read-only NoSQL copy of the transactional data (Query).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From here onwards, we continue extracting functionalities into one Microservice at a time. All of these Microservices read from the common read-only copy of the database while writing to their respective stores. Over time, the monolith keeps shrinking and reducing the number of domain events it fires as functionality moves out. But it doesn't stop here. This broker also serves as the backbone for enabling choreography to the newer modules we've built on Microservices. More on Choreography and Saga will be discussed in a later post!&lt;/p&gt;

&lt;h2&gt;
  
  
  Things to watch out for
&lt;/h2&gt;

&lt;p&gt;During the transition phase, the system operates on an eventual consistency model, and there are several factors to consider. What if your message broker goes down? What if the downstream service consuming messages is down? Will those messages be lost forever? What if a message is consumed from a broker but fails to persist into MongoDB? Building retries and utilizing Kubernetes-assisted restarts of failed services based on heartbeat monitors (&lt;a href="https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/" rel="noopener noreferrer"&gt;Liveness And Readiness Probes&lt;/a&gt;) helps in outage scenarios. Similarly, incorporating retry logic in services before giving up and writing failed messages to a Dead Letter Queue proves to be beneficial. However, the most powerful technique in this case was using Upsert transactions in MongoDB. With this approach, a domain event that initially fails would eventually get inserted. If your system prioritizes availability and performance over consistency, this technique can work wonders, as it allows you to navigate through outages effectively.&lt;/p&gt;

&lt;p&gt;The other aspect to consider is eventual consistency itself. If the cycle time for data written to the core product and saved in MongoDB is in seconds, it can lead to issues. Reports and feeds in the Café may start showing stale information. Therefore, it is crucial to ensure that the process is fast enough to complete before a user, in our case, browses to the reporting and analytics view or the Café feed after adding a card. Reactive programming, especially with Spring Boot's added reactive extensions to Database, Messaging, Cloud, and Web modules, has proven to be a saviour in such cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Patterns provide a structured way to solve recurring design problems, or at least that's how I see them. With sufficient reading and practicing samples, one can grasp these patterns. However, the challenging part lies in improvising the translation of a pattern layout itself to achieve a technical or business strategy. This blog post cannot fully summarize the countless hours we have spent translating relational database schemas into NoSQL, all while ensuring it remains useful for other Microservices that, in their initial phase, lift and shift the functionality. I'm not advocating that this approach is suitable for every Strangler Application, but the benefits we have gained from reaching this point have made the months of effort we put into getting the ES/CQRS backbone right worthwhile. I hope this helps fellow engineers who may encounter the same problem in the future.&lt;/p&gt;

</description>
      <category>microservices</category>
    </item>
    <item>
      <title>Serverless Tekton Pipelines on AWS EKS Fargate</title>
      <dc:creator>Anadi Misra</dc:creator>
      <pubDate>Mon, 21 Feb 2022 11:31:44 +0000</pubDate>
      <link>https://dev.to/anadimisra/serverless-tekton-pipelines-on-aws-eks-fargate-3m89</link>
      <guid>https://dev.to/anadimisra/serverless-tekton-pipelines-on-aws-eks-fargate-3m89</guid>
      <description>&lt;p&gt;Continuous Delivery is hard business! Specially if you're dealing with microservices. While Jenkins does work pretty well unto a scale by creating shared libraries of sorts for common builds, but after a while when you're running your SaaS on microservices like we do at &lt;a href="https://www.digite.com" rel="noopener noreferrer"&gt;Digité&lt;/a&gt;, managing the builds, and the infrastructure for CI/CD can get cumbersome. It is for both optimized Cloud Infra usage and ability to easily write and maintain CD pipelines that we considered moving to &lt;a href="https://tekton.dev/" rel="noopener noreferrer"&gt;Tekton&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;Having said that, blocking two extra large VM for the "what if there are too many jobs running in parallel?" does not appear natural to me; so I set out at making Tekton work in Fargate. The reason behind Fargate is the ease of server-less thereby letting us concentrate of managing our CI/CD pipelines without having to manage the Infrastructure for it. Hence, i'll share my experience on how to get a Server-less CI/CD Infrastructure for Tekton up and running quickly via Terraform in this post.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;p&gt;Let's start with creating a Terraform module for installation of Tekton to Fargate, you can refer to &lt;a href="https://www.anadimisra.com/post/building-an-eks-fargate-cluster-with-terraform" rel="noopener noreferrer"&gt;this article&lt;/a&gt; for creating a basic setup of EKS Fargate Cluster. Assuming you have that in place, the next steps are as follows.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fargate Profiles
&lt;/h3&gt;

&lt;p&gt;We'll first create the Fargate profile for running  Tekton, Tekton Dashboard and Tekton Triggers in the &lt;code&gt;tekton-pipelines&lt;/code&gt; namespace&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;resource "aws_eks_fargate_profile" "tekton-dashboard-profile" {
  cluster_name           = module.eks.cluster_id
  fargate_profile_name   = "tekton-dashboard-profile"
  pod_execution_role_arn = module.eks.fargate_iam_role_arn
  subnet_ids             = module.vpc.private_subnets
  selector {
    namespace = "tekton-pipelines"
    labels = {
      "app.kubernetes.io/part-of" = "tekton-dashboard",
      "app.kubernetes.io/part-of" = "tekton-triggers"
    }
  }
  depends_on = [module.eks]
  tags = {
    Environment = "${var.environment}"
    Cost        = "${var.cost_tag}"
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  EFS Setup
&lt;/h3&gt;

&lt;p&gt;EFS is the recommended approach by AWS when it comes to mounting PV for Fargate nodes; hence, we'll add EFS configuration in the next steps.&lt;/p&gt;

&lt;p&gt;It's a good practice to restrict EFS access to the VPC running EKS Cluster and your internal network for IAM controlled users to access it over AWS CLI. Declare a security group with Ingress rules for each of the subnet CIDR of the VPC running EKS Fargate to restrict access.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;module "efs-access-security-group" {
  source  = "terraform-aws-modules/security-group/aws"
  version = "4.3.0"
  create  = true

  name        = "efs-${var.cluster_title}-${var.environment}-security-group"
  description = "Security group for pipeline tekton EFS, created via terraform"
  vpc_id      = module.vpc.vpc_id

  ingress_with_cidr_blocks = [{ cidr_blocks = "172.18.1.0/24"
    from_port = 0
    to_port   = 2049
    protocol  = "tcp"
    self      = true
    }, {
    cidr_blocks = "172.18.3.0/24"
    from_port   = 0
    to_port     = 2049
    protocol    = "tcp"
    self        = true
    }, 
    // All Subnet CIDRs...
, ]
  ingress_with_self = [{
    from_port   = 0
    to_port     = 0
    protocol    = -1
    self        = true
    description = "Ingress with Self"
  }]

  egress_with_cidr_blocks = [{
    cidr_blocks = "0.0.0.0/0"
    from_port   = 0
    to_port     = 0
    protocol    = -1
  }]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;While Fargate auto-installs the &lt;a href="https://docs.aws.amazon.com/eks/latest/userguide/efs-csi.html" rel="noopener noreferrer"&gt;EFS CSI Driver&lt;/a&gt;, we still have to declare an IAM policy for the cluster EFS access. Here's how to do it in our Terraform module&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;resource "aws_iam_policy" "efs-csi-driver-policy" {
  name        = "TektonEFSCSIDriverPolicy"
  description = "EFS CSI Driver Policy"

  policy = jsonencode({
    "Version" : "2012-10-17",
    "Statement" : [
      {
        "Effect" : "Allow",
        "Action" : [
          "elasticfilesystem:DescribeAccessPoints",
          "elasticfilesystem:DescribeFileSystems"
        ],
        "Resource" : "*"
      },
      {
        "Effect" : "Allow",
        "Action" : [
          "elasticfilesystem:CreateAccessPoint"
        ],
        "Resource" : "*",
        "Condition" : {
          "StringLike" : {
            "aws:RequestTag/efs.csi.aws.com/cluster" : "true"
          }
        }
      },
      {
        "Effect" : "Allow",
        "Action" : "elasticfilesystem:DeleteAccessPoint",
        "Resource" : "*",
        "Condition" : {
          "StringEquals" : {
            "aws:ResourceTag/efs.csi.aws.com/cluster" : "true"
          }
        }
      }
    ]
  })
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With that done, we'll define the Cluster IAM for EFS Access. First the policy document which details access the policy statements for the role&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data "aws_iam_policy_document" "efs-iam-assume-role-policy" {

  statement {
    actions = ["sts:AssumeRoleWithWebIdentity"]
    effect  = "Allow"
    condition {
      test     = "StringEquals"
      variable = "${replace(aws_iam_openid_connect_provider.tekton-main.url, "https://", "")}:sub"
      values   = ["system:serviceaccount:tekton-pipelines:tekton-efs-serviceaccount"]
    }
    principals {
      identifiers = [aws_iam_openid_connect_provider.tekton-main.arn]
      type        = "Federated"
    }
  }
  depends_on = [
    aws_iam_policy.efs-csi-driver-policy
  ]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;then we add the role&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;resource "aws_iam_role" "efs-service-account-iam-role" {
  assume_role_policy = data.aws_iam_policy_document.efs-iam-assume-role-policy.json
  name               = "tekton-efs-service-account-role"
}

resource "aws_iam_role_policy_attachment" "efs-csi-driver-policy-attachment" {
  role       = aws_iam_role.efs-service-account-iam-role.name
  policy_arn = aws_iam_policy.efs-csi-driver-policy.arn
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And then we map it to a service account&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;resource "kubernetes_service_account" "efs-service-account" {
  metadata {
    name      = "tekton-efs-serviceaccount"
    namespace = "tekton-pipelines"
    labels = {
      "app.kubernetes.io/name" = "tekton-efs-serviceaccount"
    }
    annotations = {
      # This annotation is only used when running on EKS which can use IAM roles for service accounts.
      "eks.amazonaws.com/role-arn" = aws_iam_role.efs-service-account-iam-role.arn
    }
  }
  depends_on = [
    aws_iam_role_policy_attachment.efs-csi-driver-policy-attachment
  ]
}

resource "kubernetes_role" "efs-kube-role" {
  metadata {
    name = "efs-kube-role"
    labels = {
      "name" = "efs-kube-role"
    }
  }

  rule {
    api_groups = [""]
    resources  = ["persistentvolumeclaims", "persistentvolumes"]
    verbs      = ["create", "get", "list", "update", "watch", "patch"]
  }

  rule {
    api_groups = ["", "storage"]
    resources  = ["nodes", "pods", "events", "csidrivers", "csinodes", "csistoragecapacities", "storageclasses"]
    verbs      = ["get", "list", "watch"]
  }
  depends_on = [aws_iam_role_policy_attachment.alb-ingress-policy-attachment]
}

resource "kubernetes_role_binding" "efs-role-binding" {
  depends_on = [
    kubernetes_service_account.efs-service-account
  ]
  metadata {
    name = "tekton-efs-role-binding"
    labels = {
      "app.kubernetes.io/name" = "tekton-efs-role-binding"
    }
  }

  role_ref {
    api_group = "rbac.authorization.k8s.io"
    kind      = "Role"
    name      = "efs-kube-role"
  }
  subject {
    kind      = "ServiceAccount"
    name      = "tekton-efs-serviceaccount"
    namespace = "tekton-pipelines"
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the IAM linked service account in place, we'll define the EFS file system&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;resource "aws_efs_file_system" "eks-efs" {
  creation_token = "tekton-eks-efs"
  encrypted      = true
  tags = {
    Name                  = "tekton-eks-efs"
    Cost                  = var.cost_tag

  }
  depends_on = [
    kubernetes_role_binding.efs-role-binding
  ]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And its mount targets and storage class&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;resource "aws_efs_mount_target" "eks-efs-private-subnet-mnt-target" {
  count           = length(module.vpc.private_subnets)
  file_system_id  = aws_efs_file_system.eks-efs.id
  subnet_id       = module.vpc.private_subnets[count.index]
  security_groups = [module.efs-access-security-group.security_group_id]
}

resource "aws_efs_access_point" "eks-efs-tekton-access-point" {
  file_system_id = aws_efs_file_system.eks-efs.id
  root_directory {
    path = "/workspace"
    creation_info {
      owner_gid   = 1000
      owner_uid   = 1000
      permissions = 755
    }
  }
  posix_user {
    gid = 1000
    uid = 1000
  }
  tags = {
    Name        = "eks-efs-tekton-access-point"
    Cost        = var.cost_tag
    Environment = "${var.environment}"
  }
}

resource "kubernetes_storage_class" "eks-efs-storage-class" {
  metadata {
    name = "eks-efs-storage-class"
  }
  storage_provisioner = "efs.csi.aws.com"
  reclaim_policy      = "Retain"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note the EFS and access point IDs in the terrafrom output whne appying these changes, they'll be used in the PV and PVC definitions. My scripts gave the output&lt;/p&gt;

&lt;p&gt;&lt;code&gt;fs-8a7eXXXX::fsap-0f60de28766XXXXXX&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Installing Tekton
&lt;/h3&gt;

&lt;p&gt;It's pretty simple from here on; the following command installs Tekton&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl apply --filename https://storage.googleapis.com/tekton-releases/pipeline/latest/release.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;followed by Tekton dashboard (read-only install)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl -sL https://raw.githubusercontent.com/tektoncd/dashboard/main/scripts/release-installer | \
   bash -s -- install latest --read-only
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or&lt;br&gt;
&lt;code&gt;kubectl apply --filename tekton-dashboard-readonly.yaml&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;after downloading the Read Only YAML from &lt;a href="https://github.com/tektoncd/dashboard/releases" rel="noopener noreferrer"&gt;this GitHub link&lt;/a&gt;.  Next we setup the persistent volume, refer to the generated EFS IDs from Terraform run in your PV definition, here's an example for a PV and PVC that will be used by a maven task for running tekton pipeline&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: v1
kind: PersistentVolume
metadata:
  name: piglet-source-pv
  labels:
    type: piglet-source-pv
spec:
  capacity:
    storage: 1Gi
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: eks-efs-storage-class
  csi:
    driver: efs.csi.aws.com
    volumeHandle: fs-8a7eXXXX::fsap-0f60de28766XXXXXX
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: piglet-source-pvc
spec:
  selector:
    matchLabels:
      type: piglet-source-pv
  storageClassName: eks-efs-storage-class
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Gi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;While the Tekton installation itself doesn't change (you're using a kubectl apply command as always), we have to be aware of how Fargate profiles are applies for any workloads to run on EKS Fargate and thereby provision a Fargate profile using existing Tekton annotations as its selectors so that our tasks can run on Fargate. Other than that we have to provision and configure PV and PVC via EFS for tasks to use them at runtime.&lt;/p&gt;

&lt;p&gt;With those in place we have a working Tekton installation over EKS Fargate with a truly on-demand way of running builds and CI/CD Pipelines.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>aws</category>
      <category>terraform</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Building an EKS Fargate cluster with Terraform</title>
      <dc:creator>Anadi Misra</dc:creator>
      <pubDate>Wed, 19 Jan 2022 10:16:37 +0000</pubDate>
      <link>https://dev.to/anadimisra/building-an-eks-fargate-cluster-with-terraform-522g</link>
      <guid>https://dev.to/anadimisra/building-an-eks-fargate-cluster-with-terraform-522g</guid>
      <description>&lt;p&gt;Fargate is service by AWS to run serverless workloads in Kubernetes. With Fargate you do not have to manage VMs as cluster nodes yourself as each of the pods are provisioned as nodes by Fargate itself. It is different from Lambda in the sense that you're still self-managing the Kubernetes cluster or the runtime for all the workloads you run in that cluster. Having said that, I believe it's more suitable for teams that are running containerised microservices and want to do away with managing Kubernetes infrastructure themselves.&lt;/p&gt;

&lt;p&gt;While Lambda prices on a combination of requests, CPU and memory; Fargate pricing is just the CPU and Memory of the nodes running in the cluster in addition to a fixed monthly cost of the Fargate service itself. If you want to go serverless without Vendor lock-in, Fargate is a good option. Hence we at &lt;a href="https://www.digite.com/" rel="noopener noreferrer"&gt;Digité&lt;/a&gt; prefer running our microservices in the Fargate model.&lt;/p&gt;

&lt;p&gt;Managing such an infrastructure is certainly not a feasible manual job, hence we rely on IAC to manage and operate our Infrastructure. Here &lt;a href="https://www.terraform.io/" rel="noopener noreferrer"&gt;Terraform&lt;/a&gt; has been our tool of choice for various reasons from ease of learning to it's robust design. Terraform is an open source Infrastructure As Code tool by Hashicorp that lets you define AWS Infrastructure via a descriptive DSL and has been quite popular in the DevOps world since it's inception.&lt;/p&gt;

&lt;p&gt;In this blog I'll share how we've used Terraform to Deploy an EKS Fargate cluster.&lt;/p&gt;

&lt;h2&gt;
  
  
  VPC
&lt;/h2&gt;

&lt;p&gt;We'll start with deploying the Amazon VPC via Terraform. There are three recommended approaches for deploying a VPC to run EKS Fargate, let's look at each of them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Public and Private Subnets: the pods run in a private subnets while loadbalancers, both Application or Network are deployed in the Public subnets. One public and private subnet is deployed to each of the availability zones within the region for availability and fault tolerance, this is the deployment model we will follow fir this blog&lt;/li&gt;
&lt;li&gt;Public Subnets Only: both the pods (or nodes) and the loadbalancers are in public subnets, here three public subnets are deployed in three different availability zones within the region. All nodes have a public IP address and a security group blocks all inbound and outbound traffic to the nodes. To be honest I haven't ever figured out why would anyone need this :-)&lt;/li&gt;
&lt;li&gt;Private Subnets Only: both pods and loadbalancers run in private subnets only, three of which are created in each Availability zone of the region. Quite naturally we have to configure additional NAT Gateway, Egress Only Gateway, VPN or Direct Connect to be able to access the cluster. There's additional configuration on the &lt;code&gt;kubectl&lt;/code&gt; side as well( which we will skip in this blog&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;VPC subnets should have certain tags which allow EKS Fargate to deploy internal loadbalancers to them and provision nodes; lets look at the tags first&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Key: &lt;code&gt;kubernetes.io/cluster/cluster-name&lt;/code&gt; &lt;/li&gt;
&lt;li&gt;Value: Shared&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The following tags allows EKS Fargate to decide where auto provisioned Elastic loadbalancers are deploy and also allows you to control where the application or network loadbalancers are configured&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Private Subnets:

&lt;ul&gt;
&lt;li&gt;Key: &lt;code&gt;kubernetes.io/role/internal-elb&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Value: 1&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Public Subnets:

&lt;ul&gt;
&lt;li&gt;Key: &lt;code&gt;kubernetes.io/role/elb&lt;/code&gt; &lt;/li&gt;
&lt;li&gt;Value: 1&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;The VPC configuration therefore is as follows, we'll use the AWS VPC Terraform module for this purpose as it provides easier configuration via declarative properties instead of  having to write all the resources yourself.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;module "vpc" {
  source                        = "terraform-aws-modules/vpc/aws"
  version                       = "3.4.0"
  name                          = "vpc-serverless"
  cidr                          = "176.24.0.0/16"
  azs                           = ["us-east-1a", "us-east-1b", "us-east-1c"]
  private_subnets               = ["176.24.1.0/24","176.24.3.0/24","176.24.5.0/24"]
  public_subnets                = ["176.24.2.0/24","176.24.4.0/24","176.24.6.0/24"]
  enable_nat_gateway            = true
  single_nat_gateway            = true
  enable_dns_hostnames          = true
  manage_default_security_group = true
  default_security_group_name   = "vpc-serverless-security-group"

  public_subnet_tags = {
    "kubernetes.io/cluster/vpc-serverless" = "shared"
    "kubernetes.io/role/elb"               = "1"
  }

  private_subnet_tags = {
    "kubernetes.io/cluster/vpc-serverless" = "shared"
    "kubernetes.io/role/internal-elb"      = "1"
  }

  tags = {
    "kubernetes.io/cluster/vpc-serverless" = "shared"
  }

}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There's a single NAT gateway for managing all traffic to the Nodes running in the Private subnet, we also have to ensure we keep the &lt;code&gt;enable_dns_hostnames&lt;/code&gt; option set to true so that any ALBs that we configure in the future can be assigned hostnames for Canonical DNS mapping.&lt;/p&gt;

&lt;h2&gt;
  
  
  EKS Cluster
&lt;/h2&gt;

&lt;p&gt;We'll use the AWS EKS Terraform module to deploy the EKS Fargate Cluster. A basic configuration for it is as follows&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;module "eks-cluster" {
  source                        = "terraform-aws-modules/eks/aws"
  version                       = "17.1.0"
  cluster_name                  = "eks-serverless"
  cluster_version               = "1.21"
  subnets                       = flatten([module.vpc.outputs.public_subnets, module.vpc.outputs.private_subnets])
  cluster_delete_timeout        = "30m"
  cluster_iam_role_name         = "eks-serverless-cluster-iam-role"
  cluster_enabled_log_types     = ["api", "audit", "authenticator", "controllerManager", "scheduler"]
  cluster_log_retention_in_days = 7

  vpc_id = module.vpc.outputs.vpc_id

  fargate_pod_execution_role_name = "eks-serverless-pod-execution-role"
  // Fargate profiles here
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Fargate Profiles and CoreDNS
&lt;/h2&gt;

&lt;p&gt;A basic configuration like the one above will deploy the EKS cluster, however you need to create Fargate Profiles that allow you to define which pods will run in Fargate. These Fargate profiles define selectors and a namespace to run the pods, along with optional tags, you also have to add a pod execution role name, for allowing the EKS infrastrcuture to make AWS API calls on the cluster owner's behalf. You can have up-to 5 selectors for a Fargate profile. &lt;/p&gt;

&lt;p&gt;While Fargate takes care of provisioning nodes as pods for the EKS cluster, it still needs a component that can manage the networking within the cluster nodes, coreDNS is that plugin for EKS Fargate, and like any other workload, needs a Fargate profile to run. So we'll add both the plugin and profile configuration to our Terraform code.&lt;/p&gt;

&lt;p&gt;First, let's update the profile configuration&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fargate_profiles = {
    coredns-fargate-profile = {
      name = "coredns"
      selectors = [
        {
          namespace = "kube-system"
          labels = {
            k8s-app = "kube-dns"
          }
        },
        {
          namespace = "default"
        }
      ]
      subnets = flatten([module.vpc.outputs.private_subnets])
    }
  }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We're essentially saying, select the pods with label &lt;code&gt;k8s-app&lt;/code&gt; to run in the &lt;code&gt;kube-system&lt;/code&gt; namespace. Let's also add the CoreDNS plugin to the configuration&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;resource "aws_eks_addon" "coredns" {
  addon_name        = "coredns"
  addon_version     = "v1.8.4-eksbuild.1"
  cluster_name      = "eks-serve"
  resolve_conflicts = "OVERWRITE"
  depends_on        = [module.eks-cluster]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;At this stage it's a simple module so you can bundle all of it into a single one. Here's how the file structure looks like.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cluster
├── main.tf
├── outputs.tf
├── providers.tf
├── terraform.tf
├── terraform.tfvars
├── variables.tf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;providers.tf&lt;/code&gt; file defined AWS provider along with AWS CLI credentials as variable that you can read from variables defined in the &lt;code&gt;variables.tf&lt;/code&gt; file&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;terraform {
  required_version = "=1.0.2"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "3.49.0"
    }
  }
}

# Provider definition
provider "aws" {
  access_key = var.access_key
  secret_key = var.secret_key
  region     = var.region
  token      = var.session_token
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;if you're saving TF State in a remote backend you can define the configuration for it in the &lt;code&gt;terraform.tf&lt;/code&gt; file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;terraform {

  backend "s3" {
    bucket         = "swiftalk-iac"
    dynamodb_table = "swiftalk-iac-locks"
    key            = "vpc/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it! Run &lt;code&gt;terraform init&lt;/code&gt; to get an EKS Fargate Cluster up and running in minutes!&lt;/p&gt;

</description>
      <category>devops</category>
      <category>kubernetes</category>
      <category>terraform</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Tekton and the Promise of Reusable Pipelines</title>
      <dc:creator>Anadi Misra</dc:creator>
      <pubDate>Tue, 13 Oct 2020 14:21:25 +0000</pubDate>
      <link>https://dev.to/anadimisra/tekton-and-the-promise-of-reusable-pipelines-4jko</link>
      <guid>https://dev.to/anadimisra/tekton-and-the-promise-of-reusable-pipelines-4jko</guid>
      <description>&lt;p&gt;The advent of Cloud and Container technology ushered a new era in distributed computing at “planet scale” which was unheard of and unimaginable just a decade ago. Another interesting movement was brewing up a decade ago which bolstered delivering these complex solutions at high speed and accuracy, DevOps. These two paradigm shifts have gone hand in hand complementing each other to shape up distributed computing in the way we know (or are still learning about) today.&lt;/p&gt;

&lt;p&gt;We all know how integral Continuous Integration and Continuous Deployment are to the DevOps automation paradigm and how organizations have designed verbose pipelines so as to bring a factory floor model into shipping software. &lt;/p&gt;

&lt;p&gt;If you’ve ever been part of implementing any of these DevOps practices for a cloud native distributed system, you’d perhaps know how quickly these CI/CD pipelines become a cacophony of complex tools and integrations requiring their own sub organization of specialists to be built and maintained, thereby adding to the very silos that the practices had set out to break.&lt;/p&gt;

&lt;p&gt;A large system that’s composed of multiple distributed sub systems and is usually deployed as docker containers clustered over an orchestration runtime such Kubernetes; and for those systems these pipelines come anything but easy. The problem is you’d, at any time, be dealing with at-least 5 tools for anything from Triggers, to running builds and packaging, creating test environments and running tests and finally the holy grail of one click deploy (if it is any Holy at all!).&lt;/p&gt;

&lt;h2&gt;
  
  
  Tekton
&lt;/h2&gt;

&lt;p&gt;Some good people at the Knative project with Google felt the aforementioned problem deep enough to come up with a solution that (I believe) is one of the best attempts at building shift left pipelines that yet exists; Tekton.&lt;/p&gt;

&lt;p&gt;Tekton aims at bringing much needed simplicity and uniformity in creating and running these pipelines by providing a high reusable, declarative, component based cloud native build system which utilizes Kubernetes CRDs to get the job done. In the Tekton philosophy any pipeline can be broken down to the following three key parts &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Core Services Version Control, Artifacts Store, deployment automation&lt;/li&gt;
&lt;li&gt;Tasks which could range from running a maven build to various testing automation to security and performance evaluations&lt;/li&gt;
&lt;li&gt;Workflow which decides how and when the tasks run&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The vision therefore is to cut through the inconsistency and complexity so as to provide a mechanism of building pipelines that is &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Flfiawcvj7zklx59pfnnm.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Flfiawcvj7zklx59pfnnm.jpg" alt="Alt Text" width="720" height="405"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;

&lt;p&gt;Tekton defines resources which fulfill the characteristics shown above thereby letting you concentrate on the What needs to be done and when leaving the how to the underlying implementation. Let’s look at key building blocks of a Pipeline created with Tekton.&lt;/p&gt;

&lt;h3&gt;
  
  
  Steps
&lt;/h3&gt;

&lt;p&gt;The most basic of Tekton components are the steps, essentially a Kubernetes container spec which is an existing resource type lets you define an image and the information you need to run it. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu&lt;/span&gt; &lt;span class="c1"&gt;# contains bash&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;#!/usr/bin/env bash&lt;/span&gt;
    &lt;span class="s"&gt;echo "Hello from Bash!"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Task
&lt;/h3&gt;

&lt;p&gt;A Task is composed of one or more steps (you can have a as granular or fine tasks as you wish) and is a unit of work in a pipeline that achieves a specific goal (built jar archive, docker image, test run etc.). The following task runs a maven build for example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tekton.dev/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Task&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
 &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;mvn&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
 &lt;span class="na"&gt;workspaces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;output&lt;/span&gt;
 &lt;span class="na"&gt;params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;GOALS&lt;/span&gt;
      &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;The Maven goals to run&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;array&lt;/span&gt;
      &lt;span class="na"&gt;default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;package"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
 &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;mvn&lt;/span&gt;
      &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gcr.io/cloud-builders/mvn&lt;/span&gt;
      &lt;span class="na"&gt;workingDir&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/workspace/output&lt;/span&gt;
      &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/usr/bin/mvn"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;-Dmaven.repo.local=$(workspaces.maven-repo.path)&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$(inputs.params.GOALS)"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pipeline
&lt;/h3&gt;

&lt;p&gt;A Pipeline is a collection of Tasks that you define and arrange in a specific order of execution as part of your continuous integration flow. Each Task in a Pipeline executes as a Pod on your Kubernetes cluster. You can configure various execution conditions to fit your business needs. Pipelines can be both the workflow of part of workflow as you desire. Here’s a diagrammatic representation of what a pipeline would achieve in Tekton.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fw57x0q67c6mccuuhrh9q.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fw57x0q67c6mccuuhrh9q.jpeg" alt="Alt Text" width="800" height="1130"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Putting it all together
&lt;/h3&gt;

&lt;p&gt;Let's look at what we would expect from a pipeline to run for most of modern day projects:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fnqpcn9smpw2ro4ddi5g6.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fnqpcn9smpw2ro4ddi5g6.jpeg" alt="Alt Text" width="800" height="337"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you have multiple apps that need these steps you can essentially:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Define common  tasks such as Unit Tets, Linting, Build Images, Run Tests (Integration or End-to-End), Publish Images etc&lt;/li&gt;
&lt;li&gt;Define multiple pipelines or create standardized pipeline to be used on similar modules&lt;/li&gt;
&lt;li&gt;Parameterise Pipeline runs and scale to large number of pipelines with lesser automation or CI/CD configuration&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's where lies the power of this tool, being able to author any umber of pipelines without having to integrate multiple tools or manage complex orchestration.  This thorough DRY approach to automated CI/CD pipeline is certainly a great tool at the disposal of software development teams.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building a Pipeline with Tekton
&lt;/h2&gt;

&lt;p&gt;Now that we've seen what Tekton is all about the the promise it brings on the table, let's see how well it lives up to it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Installing Tekton
&lt;/h3&gt;

&lt;p&gt;To install the core component of Tekton (assuming you have a kube cluster up and running already, if not install the kube cluster first), Tekton Pipelines, run the command below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;--filename&lt;/span&gt; https://storage.googleapis.com/tekton-releases/pipeline/latest/release.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It may take a few moments before the installation completes. You can check the progress with the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;--namespace&lt;/span&gt; tekton-pipelines
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Confirm that every component listed has the status Running.&lt;/p&gt;

&lt;h3&gt;
  
  
  Persistent Volumes
&lt;/h3&gt;

&lt;p&gt;To run a CI/CD workflow, you need to provide Tekton a Persistent Volume for storage purposes. Tekton requests a volume of 5Gi with the default storage class by default. Your Kubernetes cluster, such as one from Google Kubernetes Engine, may have persistent volumes set up at the time of creation, thus no extra step is required; if not, you may have to create them manually. Alternatively, you may ask Tekton to use a Google Cloud Storage bucket or an AWS Simple Storage Service (Amazon S3) bucket instead. Note that the performance of Tekton may vary depending on the storage option you choose.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl create configmap config-artifact-pvc &lt;span class="se"&gt;\&lt;/span&gt;
                         &lt;span class="nt"&gt;--from-literal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;10Gi &lt;span class="se"&gt;\&lt;/span&gt;
                         &lt;span class="nt"&gt;--from-literal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;storageClassName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;manual &lt;span class="se"&gt;\&lt;/span&gt;
                         &lt;span class="nt"&gt;-o&lt;/span&gt; yaml &lt;span class="nt"&gt;-n&lt;/span&gt; tekton-pipelines &lt;span class="se"&gt;\&lt;/span&gt;
                         &lt;span class="nt"&gt;--dry-run&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;client | kubectl replace &lt;span class="nt"&gt;-f&lt;/span&gt; -
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For more specific details on the Installation and configuration of Tekton you may refer to their &lt;a href="https://github.com/tektoncd/pipeline/blob/master/docs/install.md" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Further Steps.
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;In this post we saw what Tekton brings on the table in terms of providing a way to author highly scalable pipelines built from reusable tasks and how to quickly get it up and running on a kubernetes cluster. In the next part we will look into building and running a pipeline on Tekton for a simple java application.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>kubernetes</category>
      <category>docker</category>
    </item>
  </channel>
</rss>
