DEV Community: Anadi Misra

A Guide to Managing the First Fallacy of Distributed Computing

Anadi Misra — Tue, 24 Oct 2023 12:00:00 +0000

Distributed computing is a complex field with numerous challenges, and understanding the fallacies associated with it is crucial for building robust and reliable distributed systems. Here are eight fallacies of distributed computing and their significance:

The Network Is Reliable: Assuming that network connections are always available and reliable can lead to system failures when network outages occur, even when the network outages are transitory. It's essential to design systems that can gracefully handle network failures through redundancy and fault tolerance mechanisms.
Latency Is Zero: Overestimating the speed of communication between distributed components can result in slow and unresponsive systems. Acknowledging network latency and optimizing for it is vital for delivering efficient user experiences.
Bandwidth Is Infinite: Believing that network bandwidth is unlimited can lead to overloading the network and causing congestion. Efficient data transmission and bandwidth management are crucial to avoid performance bottlenecks.
The Network Is Secure: Assuming that the network is inherently secure can result in vulnerabilities and data breaches. Implementing strong security measures, including encryption and authentication, is necessary to protect sensitive information in distributed systems.
Topology Doesn't Change: Networks evolve, and assuming a static topology can lead to configuration errors and system instability. Systems should be designed to adapt to changing network conditions and configurations.
There Is One Administrator: Believing that a single administrator controls the entire distributed system can lead to coordination issues and conflicts. In reality, distributed systems often involve multiple administrators, and clear governance and coordination mechanisms are needed.
Transport Cost Is Zero: Neglecting the cost associated with data transfer can lead to inefficient resource utilization and increased operational expenses. Optimizing data transfer and considering the associated costs are essential for cost-effective distributed computing.
The Network Is Homogeneous: Assuming that all network components and nodes have the same characteristics can result in compatibility issues and performance disparities. Systems should be designed to handle heterogeneity and accommodate various types of devices and platforms.

Understanding these fallacies is critical because they underscore the challenges and complexities of distributed computing. Failure to account for these fallacies can lead to system failures, security breaches, and increased operational costs. Building reliable, efficient, and secure distributed systems requires a deep understanding of these fallacies and the implementation of appropriate software design and architecture, and IT operational strategies to address them.

Unreliable Networks

In this blog post, we will look at the first fallacy, its impact on microservices architecture and how to circumvent this limitation. Let's say we're using spring-boot to write our microservice and it uses MongoDB as the backend which is deployed as a StatefulSet in Kubernetes. and were running this on EKS. You might also question that it is your cloud provider's job to give us a reliable network and that we're paying them for high availability. While the expectation might not be wrong, unfortunately, it doesn't always work as expected when you rent hardware over the cloud. Let's say your cloud provider promises 99.99% availability, which is impressive right? No, it ain't! and I'll explain how. 99.99% availability could lead to

Every one request in 10,000 requests failing
Every 10 requests in 1,00,000 requests failing

Now you might say that my system doesn't get that kind of traffic! Fair enough, but this is availability data of the cloud provider, not your instances of service, which means, that if that cloud is getting a billion network requests within its network 1,00,000 will fail! And to make things more complex you can't expect them to distribute these failures across all accounts using their hardware, you might get hit by any number of those failures depending on your luck. The question here is, do you want to run a business just on the chance of these outages not hitting you? I hope not! So here's the fundamental description of the first (and most critical) fallacy of distributed computing.

The Impact of Network Failures

Let's take an example of an e-commerce system, we'd usually see a product catalogue from the Product Microservice; however, the SKU availability might be fetched from another microservice when building the Product Catalogue response. One could argue though that I can replicate the SKU information into the product catalogue via Choreography, but for the scope of this example, let's assume that's not in place. The product service therefore is making a REST API Call to the SKU service. What happens when this call fails? How would you convey to the end user whether the product they are looking at is available or not?

Scary stuff yeah? Well, not so scary, since we love to brave ourselves on harder frontiers as engineers, we have a few tricks up our sleeves.

Coding for Fault Tolerance and Resilience

It's a topic worthy of a book perhaps in itself instead of a blog post. But I'll try to cover all I can while keeping it simple. Most of what I'm sharing here are experiences gathered over transitioning from Monolith to Microservices for the SaaS business in NimbleWork. And I hope others find it helpful too.

Patterns for Transitory Outages

The following patterns help circumvent transitory outages or blips as we usually call them. The fundamental underlying assumption is that such outages have a lifetime of a second to two at worst.

Retries

One of the simplest things to do is to wrap your network calls in a retry logic so that there are multiple tries before the calling service finally gives up. The idea here is that a temporary network snag from the cloud provider wouldn't last longer than the retries made for fetching data. Microservice libraries and frameworks in almost all common programming languages provide this feature. The retires themselves have to be nuanced or discrete; retrying when you get 400 will not change the output for example until the request signature is changed. Here's an example of using retries when making REST API calls with Spring WebFlux WebClient.

webClient.get().uri(uri)
                .headers(headers -> headers.addAll(httpHeaders))
                .retrieve()
                .bodyToMono(new ParameterizedTypeReference<Map<String, Object>>() {
                })
                .log(this.getClass().getName(), Level.FINE)
                .retryWhen(
                        Retry.backoff(3, Duration.ofSeconds(2))
                                .jitter(0.7)
                                .filter(throwable -> throwable instanceof RuntimeException ||
                                        (throwable instanceof WebClientResponseException &&
                                        (((WebClientResponseException) throwable).getStatusCode() == HttpStatus.GATEWAY_TIMEOUT || ((WebClientResponseException) throwable).getStatusCode() == HttpStatus.SERVICE_UNAVAILABLE || ((WebClientResponseException) throwable).getStatusCode() == HttpStatus.BAD_GATEWAY)))
                                .onRetryExhaustedThrow((retryBackoffSpec, retrySignal) -> {
                                    log.error("Service at {} failed to respond, after max attempts of: {}", uri, retrySignal.totalRetries());
                                    return retrySignal.failure();
                                }))
                .onErrorResume(WebClientResponseException.class, ex -> ex.getStatusCode().is4xxClientError() ? Mono.empty() : Mono.error(ex));

Here's a summary of what we're trying to achieve with this piece of code:

Retry a maximum of three times in two seconds
Space the time between retries randomly based on the jitter
Retry only if the upstream service gave HTTP 504, 503 or 502 statuses
Log the error and pass it downstream when the maximum attempts are exhausted
Wrap an empty response instead for client errors or pass the error from the previous step downstream

These retries can help recover from blips or snags which aren't expected to last long. This can also be a good mechanism if the upstream service we're calling restarts for whatever reason.

Note: We've noticed running replica sets in Kubernetes with Rolling Updates strategy helps reduce such blips and hence retries.

While this is an example using the Reactor Project's implementation in Spring; all major frameworks and languages provide alternatives

Spring Retry if you're Spring Framework but not on reactive programming
Supervisor Strategy when you're on Akka with Scala or Java
scala.util.{Failure, Try} if you're using Scala without any framework as such
Retry Decorator in python
fetch-retry in JavaScript

and I'm sure this is not an exhaustive list. This pattern takes care of transitory network blips. But what if there's a sustained outage? More than later in this article.

Last Known Good Version

What if the called service continuously crashes and all retries from various clients exhaust? I prefer falling back to a last-known-good version. There are a couple of strategies that can enable this last-known-good-version policy on the infrastructure and client-side. And we'll briefly touch upon each one of them.

Deployments The simplest option from an infrastructure perspective is to redeploy to the last known stable version of the service. This is under the assumption that the downstream apps are still compatible with calling this older version. It's easier to do this in Kubernetes, where it saves previous revisions of deployments.
Cached at downstream Another way is for the clients to save a last successful response they can fall back on in case of failures from the service, showing a stale data-related prompt to the end user on Browser or Mobile UI is a good option.

Caching at downstream
The browser, or any client for that matter, continuously writes data to an in-memory store until it receives a heartbeat from the upstream. This mechanism offers various implementations for both UI and headless clients that make service calls through gRPC or REST. Here is a summary of what to do, regardless of the type of client.

Clients are registered on their first API Call for the service to keep track
Subsequent updates to the clients are managed as a push from the service to the client
Clients retain the state locally, Redux on browser; or Redis, Memcached for Headless clients (psst .. LinkedHashMaps too if your soul allows that 😏)
If you're not at a scale to afford push, you can use tools like RTK for ReactJS and NgRx store for Angular and keep pulling state updates, be sure to inform the end-user that they might be seeing stale data when you get any of the 5XX status errors

Patterns for sustained outages

We'd be lucky if any distributed architecture were a system of only blips, which they are not. Hence, we have to build our systems to handle long-lived outages. Here are some of the patterns that help in this regard.

Bulkheads

Bulkheads address the contingency of outages caused by slow upstream services. While the ideal solution is to fix the upstream issue, it's not always feasible. Consider a scenario where the service (X) you're calling relies on another service (Y) that exhibits sluggish response times. If service (X) experiences a high volume of incoming traffic, a significant portion of its threads may be left waiting for the slower upstream service (Y) to respond. This waiting not only slows down the service (X) but also increases the rate of dropped requests, leading to more client retries and exacerbating the bottleneck.

To mitigate this issue, one effective approach is to localize the impact of failures. For instance, you can create a dedicated thread pool with a limited number of threads for calling the slower service. By doing so, you confine the effects of slowness and timeouts to a specific API call, thereby enhancing the overall service throughput.

Circuit Breakers

Circuit Breakers can easily be avoided, we have to write services that will never go down! However, the reality is that our applications often rely on external services developed by others. In these situations, Circuit Breaker, as a pattern, becomes invaluable. It routes all traffic between services through a proxy, which promptly starts rejecting requests once a defined threshold of failures is reached. This pattern proves particularly useful during prolonged network outages in external services, which could otherwise lead to outages in the calling services. Nevertheless, ensuring a seamless user experience in such scenarios is vital, and we've found two approaches to be effective:

Notify users of the outage in the affected area while enabling them to use other parts of the system.
Allow the client to cache user transactions, providing a "202 Accepted" response instead of "200" or "201" as usual, and resume these transactions once the upstream service becomes available again.

Conclusion

The realization that, despite a cloud provider's commitment to high availability, network failures remain an inevitability due to the vast scale and unpredictable nature of these networks, underscores the critical need for resilient systems. This journey immerses us in the realm of distributed computing, challenging us as engineers to arm ourselves with strategies for fault tolerance and resilience. Employing techniques like retries, last-known-good version policies, and the development of separate client-server architectures with state management on both ends equips us to confront the unpredictability of network outages.

As we navigate the intricacies of distributed systems, the adoption of these strategies becomes imperative to ensure smooth user experiences and system stability. Welcome to the world of Microservices in the Cloud, where challenges inspire innovation, and resilience forms the bedrock of our response to unreliable networks. 😉

The Serverless CI - Running Jenkins Slaves on AWS EKS Fargate

Anadi Misra — Tue, 26 Sep 2023 13:00:00 +0000

Jenkins requires no introduction, as it stands as the undisputed king of Continuous Integration. Over the years, it has adapted to all the technological disruptions in the industry, including Kubernetes. This blog post delves into an intriguing topic: how to execute on-demand slaves in a remote AWS Fargate cluster from a Jenkins master instance. For those wondering why such a capability is necessary, the following sections will elucidate not only the reasons but also the methods and associated advantages.

Everything is remote!

Imagine this: you’re running cloud-native services on AWS EKS, and as the diligent engineer that you are, you establish two distinct clusters—one for production and another for all non-production purposes. You might be wondering why you would undertake such an approach. Here’s a hint: consider the blast radius. If you prioritize the security of your cloud-native services as fervently as we do at NimbleWork, this decision makes sense. The dev cluster operates, among other things, all our Continuous Integration and Delivery tools. Speaking of Continuous Delivery, we run nightly pipelines that test the services for performance, security vulnerabilities, and regression before deploying them to the production cluster. In the production cluster, we adhere to a blue-green deployment model. This entails the Jenkins master, which operates on the dev EKS cluster, to run slaves on the production EKS cluster for various deployment, management, and general housekeeping tasks. Having outlined the reasons for this setup, let’s delve into the details of how it is accomplished.

Prerequisites

This post assumes you have a running AWS EKS cluster, either on Fargate or Worker Nodes. You can refer to this article for creating a Fargate cluster or this for Worker nodes if you don’t have them handy. The next step is to install Jenkins on Kubernetes, you can refer to this page in their official documentation for this. Since we’re configuring Jenkins slaves to run on AWS Fargate, install the Kubernetes Plugin in Jenkins.

Serverless Jenkins Slaves

Configuring Kubernetes Connection in Jenkins Master

Kubernetes cluster can be configured from the Manage Nodes and Clouds option on the Manage Jenkins Page. Navigate to Manage Jenkins > Clouds > New Cloud to open the Cloud configuration page

Add a name for the cloud, choose Kubernetes in the Type section and click on “Create” to create the cloud configuration

Expand the Kubernetes Cloud details dropdown, this is where we will configure Jenkins master access to the AWS Fargate Cluster.

Kubernetes URL

Here we add the public API Server URL of the AWS Fargate Cluster. Log in to the AWS Management Console and select Elastic Kubernetes Service, click on the clusters link to list all clusters in your account and then click on the name of the cluster you want to connect Jenkins with to reach the overview page. Copy the API Server URL from the highlighted section in the image below and paste it to the Kubernetes URL field in the cloud configuration page in Jenkins.

Alternatively, you can get the same information by kubectl using the command line too. Point to your EKS cluster

export AWS_ACCESS_KEY_ID="KEY_ID_HERE"
export AWS_SECRET_ACCESS_KEY="ACCESS_KEY_HERE"
export AWS_SESSION_TOKEN="SESSION_TOKEN_HERE" 

aws eks update-kubeconfig --region us-east-1 --name mycluster

then run the kubectl commands as follows

kubectl cluster-info
Kubernetes control plane is running at https://XXXXXXXXXXXXX.gr7.us-east-1.eks.amazonaws.com

We’ll be using a Kubernetes Service Account to authenticate to the API Server. Perform the following steps on the AWS EKS cluster to enable Jenkins access.

Create a namespace jenkins-jobs associated with the Fargate profile
Create a service account named jenkins-service-account in the namespace
Create a secret token named jenkins-token in the namespace associated with the service account
Create a role-binding providing the service-account ClusterRole admin in this jenkins-jobs namespace

you can achieve this in multiple ways, via the Management Console or AWS CLI, we like to stick to IAC in NimbleWork so here’s a sample terraform snippet for the same:

Let’s retrieve the certificate key now, run the following command to get the service account certificate key and token

% kubectl get secret jenkins-token --namespace=jenkins-jobs -o yaml

The output contains ca.crt and token

apiVersion: v1
data:
  ca.crt: XXXXXXXXXX
  namespace: XXXXXXXXXXXXXX
  token: XXXXXXXXXX
kind: Secret
metadata:
  annotations:
    kubernetes.io/service-account.name: jenkins
    kubernetes.io/service-account.uid: XXXXXXXXXXXXXX
  creationTimestamp: "2023-05-20T17:25:16Z"
  labels:
    app.kubernetes.io/name: jenkins
  name: jenkins-token
  namespace: jenkins-jobs
  resourceVersion: "3388"
  uid: XXXXXXXXXXXXXX
type: kubernetes.io/service-account-token

the ca.crt value is base64 encoded, decode it via the base64 -d command and paste the resultant value into the Kubernetes server certificate key field.

Kubernetes Namespace

Enter the value jenkins-jobs here

Credentials

Click on Add > Jenkins and choose the Secret text in the Kind dropdown list of the Credentials provider pop-up, then add base64 decoded value of the token from the output of the command kubectl get secret above to create the credentials.

Click on the Test Connection button and you’ll see the message

Connected to Kubernetes v1.28-eks-XXXXXXX

When successfully connected!

Pod Template and Retention

Add the following values to Pod-Template and Retention settings

Click Save to finish adding the cloud.

Using the Kubernetes Cloud in Pipelines

Now that we have the Jenkins configuration in place let’s look at defining builds to run via Fargate Pods as slaves. We’re using the declarative pipeline syntax here. The pipeline job DSL Groovy should mention the configured cloud name as follows:

  agent {
    hackernoonkube {
      yamlFile 'builder.yaml'
    }
  }

Add the labels defined in the POD template section above to run the job in a Jenkins slave running as a Fargate Pod.

DevOps or Lean?

Anadi Misra — Thu, 21 Sep 2023 16:11:29 +0000

DevOps has been instrumental in transforming IT responsiveness to business, as is evident by the fact that we have enterprises from all walks of life including Banks adopting DevOps. It’s certainly not a thing that hip web companies do now. However, with this rush to “do” DevOps comes the noise associated with it and hence, the confusion.

I’ve been asked this quite a bit "We’re practicing Lean Kanban already how would DevOps help us?" Or "Do we need to do DevOps when we’re practising lean?" This blog is an attempt at two things, at showing how DevOps borrows Lean principles and it’s really not an either/or equation between these practices as they complement each other.

DevOps

While there are many definitions of DevOps the one I like to stick to is:

A cultural and professional movement that stresses communication, collaboration and integration between software developers and IT operations professionals while automating the process of software delivery and infrastructure changes.

While there wasn’t necessarily anything more to DevOps than an observe-and-solve sort of cycle initially. Over the years we’ve had a more formalised way of working in DevOps. I’d still say most of it is common sense but then creating a structure or nomenclature does help the people identify with a movement. DevOps therefore has certain principles that make up most of what we know as DevOps today.

Lean

Most of us would know Lean well enough; the classic definition in IT industry context is

Lean IT applies the key ideas behind lean production to the development and management of IT products and services.

And to know more about Lean Production you can start from here. Let’s look at the lean principles before I elaborate on this blog’s theme.

Define value precisely from the perspective of the end customer
Identify the entire value stream for each service, product or product family and eliminate waste
Make the remaining value-creating steps flow
As flow is introduced, let the customer pull what the customer wants when the customer wants it
Pursue perfection

The Three Ways of DevOps

AFAIK the There ways of DevOps first appeared in a blog-post by Gene Kim, and there’s been quite some commentary (include criticism) on them. I won’t delve into just what the three ways are, but into how they appear derived from lean to my eyes.

The First Way

The first way looks at maximising the flow of work from the “left” (business idea) to the right (finished product); represented by the, rather deceptively facile, diagram

Gene Kim uses the word Systems Thinking for the first way, which basically is the practice of looking at a system holistically and studying how its parts are connected or dependent so as to improve the overall system performance. And goes on to explain how we should maximise the flow of value through the system as a whole, using the Deming Cycle to ensure delivering value. While another (I’d say better) interpretation is Flow. Where the flow of value is to be maximised from business to customer. Studying each step of the flow, analysing bottlenecks and solving each of the parts in the context of improving the whole.

If you stop here and look back at the aforementioned lean principles you’ll notice where this might be stemming from. It essentially appears to be mandating principles 1 to 4 in this form. This makes sense too, I can passionately argue you cannot really define the flow of a system as a whole without a meticulous study of each of the parts into a value-steam so essentially you’d be applying Lean when you’re working at this first way of flow. If you were to implement the first way you would invariably deploy these (in addition to other practices)

Continuous Flow
Kanban
Value Stream Mapping

It’s for this reason I believe the first way is an application of Lean to the entire delivery pipeline, unlike Scrum (or similar Agile practices) focused largely on development practices (unintentionally).

Second Way

The second way is about creating amplified feedback loops between each step of the flow or parts of the system, other than providing critical operative information to teams this essentially is a step towards eliminating overburden and inconsistencies thereby reducing waste in the system. Let’s see how, when you are able to get feedback from each stage of your delivery life cycle, you can build actionable information about the bottlenecks and act on them. With amplified feedback, therefore, you’d move toward what’s rightly mentioned as Tribal Knowledge by Tim Hunter. There isn’t a direct correlation here but if you look at Lean tools there’s a natural tendency to gain deeper insights at each step as it would help improve the flow at each step. Having said that to say it necessarily is an application of Lean would be my desperate attempt to retrofit everything to it.

The Third Way

The third way is about fostering continual experimentation and learning and creating repetitive operations so as to help increase the throughput. This again seems to be stemming from pursuing perfection and removing inconsistencies and overburdening ideas that Lean introduced to the software world. While the practices can be manifold I would again say that the intention is to bring this higher state of knowledge to the system as a whole rather than just it’s individual parts.

Conclusion

So I’d like to rest the opening questions to this blog by saying there is no DevOps or Lean equation; DevOps (like Agile) is Lean. In other words, do not look at DevOps as a framework or isolated practice. DevOps does not substitute your existing knowledge in any way, it complements the knowledge and capabilities of the entire organisation to help them achieve higher performance. As I've shown you in this article, it plays out well with Lean.

Building Observability for Microservices

Anadi Misra — Thu, 21 Sep 2023 13:00:00 +0000

Let’s say you distribute the work that a single highly experienced person is doing to multiple individuals, each performing a specific task. Distributing the work this way may increase throughput and eliminate the single point of failure. However, now you have to monitor many people instead of just one! Observability in microservices addresses a similar issue: how to monitor and proactively address problems in a distributed system? The solution lies in measuring the state of a system through metrics recorded for each of its services. A more software-specific definition is (ref. Wikipedia)

Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.

This article explains setting up an observability stack for Microservices to measure the health and performance of the system, sharing our experiences from operating observability for Microservices at NimbleWork.

The Tools

There are a lot of tools, paid and open source, in the Observability space. We at NimbleWork though prefer the following stack

Prometheus - an open source toolkit for monitors and alerts built at SoundCloud
Grafana - a visualisation and analytics data from multiple sources, including prometheus
Helm - a tool to manage Kubernetes applications
kubectl - a command line tool for working on Kubernetes clusters
terraform - a tool to automate provisioning of infrastructure on clouds such as AWS.

We’re running this on Kubernetes, specifically on AWS EKS which is also the choice for running microservices at NimbleWork. Let’s dive into getting things working! The post assumes you have a working EKS cluster, you can choose AWS Fargate as described in this blog post, or Worker Node cluster as described here.

Prometheus on EKS

Let’s look at deploying prometheus first as it serves as a data-source for Grafana.

Preparing Volumes on EKS Node Group.

We will be using persistent storage for Prometheus data as it’s a bad to idea to keep crucial observability data in ephemeral storage. So we declare the volumes first. Since we’re using EKS NodeGroups we have to configure Persistent Volumes using EBS for the prometheus node. EFS here is a bad idea as prometheus does not support NFS well enough. While you can certainly get it running on EFS, it remains unstable with sudden restarts related to errors in the logs about NFS writes. This link in prometheus documentation strongly recommends against it.

Let’s configure our EKS cluster for EBS storage using Terraform, here’s a snippet you’d typically add to a storage.tf file in the Terraform module you’ll write for managing EKS, we’re following the same structure as in the blog post on setting up EKS via terraform that you can use for reference. You can of course do it manually but there’s a reason why IAC exits, and I like to respect it!

Running the terraform script gives, amongst other things, the values of File System and its Access Point ID in the output which we’ll then use in our helm charts.

Deploying Prometheus via Helm

We prefer the community helm chart for prometheus, available on github, install the chart using the following command

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

Now, let’s look at how can we configure it to work with EKS NodeGroups. Create a volumes.yaml file with the following contents.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: prometheus-volume-0
  labels:
    type: prometheus-volume
spec:
  capacity:
    storage: 10Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  claimRef:
    name: prometheus-volume-claim
    namespace: observability
  storageClassName: eks-efs-storage-class
  csi:
    driver: efs.csi.aws.com
    volumeHandle: fs-XXXXXXXXXXXX::fsap-XXXXXXXXXXXX
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: prometheus-volume-claim
  namespace: observability
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: eks-efs-storage-class
  volumeName: prometheus-volume-0
  resources:
    requests:
      storage: 10Gi

Those who’ve been working with EC2 instances or EKS Node Groups using EBS will immediately recognize what we’re doing here, the values for file system and access point ids come from the terraform snippet described above and goes into volumeHandle: fs-XXXXXXXXXXXX::fsap-XXXXXXXXXXXX. With the storage configuration in place, let’s configure Prometheus for use in EKS. We’re doing a default deployment with only a few changes as a detailed customisation of the values.yaml is beyond the scope of this blog. Refer to the prometheus.yaml file in this gist for deploying prometheus to EKS. You’ll notice the job names

kubernetes-apiservers
kubernetes-nodes
kubernetes-nodes-cadvisor
kubernetes-service-endpoints
kubernetes-service-endpoints-slow
kubernetes-services
kubernetes-pods
kubernetes-pods-slow

These are prometheus jobs defined to collect metrics from the EKS itself, to monitor the health of Kube cluster too in addition to the microservices. Save the aforementioned file in the gist to a values.yaml file and Install the helm chart using the following command

helm install -i prometheus prometheus-community/prometheus -f values.yaml -n observability

Now let’s look at enabling collection metrics for a service written in Spring Boot. We’d simply refer to the configuration here, the spring boot actuator endpoint can be configured in the application or bootstrap YAML as follows:

management:
  endpoint:
    shutdown:
      enabled: true
scrape_configs:
  - job_name: "spring"
    metrics_path: "/actuator/prometheus"
    static_configs:
      - targets: ["0.0.0.0:${server.port}"]

When you deploy this service you’ll notice the metrics exposed by Spring Boot Actuator are available on prometheus. You can access prometheus via the URL exposed through the ALB ingress.

Grafana on EKS

Grafana will be deployed for visualising data collected by prometheus. Let’s see how to get it done.

Deploying EFS volumes for Grafana

Assuming we have the EFS block configured for our EKS cluster, ref. this blog post, create a file grafana-volumes.yaml defining the volumes for Grafana as follows:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: grafana-volume-0
  labels:
    type: grafana-storage-volume
spec:
  capacity:
    storage: 10Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  claimRef:
    name: grafana-volume-0
    namespace: observability
  storageClassName: eks-efs-storage-class
  csi:
    driver: efs.csi.aws.com
    volumeHandle: fs-XXXXXXXXXXXXXXX::fsap-XXXXXXXXXXXXXXX
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: grafana-volume-0
  namespace: observability
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: eks-efs-storage-class
  volumeName: grafana-volume-0
  resources:
    requests:
      storage: 10Gi

Run the following command to create the volumes

kubectl apply -f grafana-volumes.yaml -n observability

Deploying Grafana via HELM chart

Download the helm chart for grafana with the command

helm repo add grafana https://grafana.github.io/helm-charts

The grafana.yaml file in this gist describes a basic Grafana configuration for working with Prometheus as a data-source. We’re using mostly standard config here, adding prometheus as a data-source via the block

datasources:
  datasources.yaml:
    apiVersion: 1
    datasources:
    - name: Prometheus
      type: prometheus
      version: 1
      url: http://prometheus-server:80
      access: proxy

Save it as values.yaml and deploy the grafana helm chart by running the following command

helm install -i grafana grafana/grafana -f values.yaml -n observability

Here again, you can access grafana using the configuration on the ALB Exposed URL.

Grafana Dashboards

Now that we have the infrastructure in place, let’s see how can we use it for Observability. I’ll limit examples to Spring-Boot for simplicity. Our microservices expose REST APIs and use MongoDB for data persistence. Here’s what we recommend monitoring for such services.

Spring Boot Microservices

Here’s the panels we’ve added to the Grafana dashboard for monitoring Spring Boot Microservices.

Quick Facts

Uptime Monitor: total uptime of pods, can be filtered over namespace, pod names and containers
Current Memory Allocation: current total memory allocated to pods, can be filtered over namespace, pod names and containers
Current Memory Usage: average memory consumption by app containers in each pod
CPU Load
CPU Usage
FATAL, ERROR and WARN logs count

MongoDB

Repository Method Execution Time: Total time taken per method per repository, a high value indicates a performance issue
Repository Method Execution Count: Number of methods executed against each of the repositories, gives an idea of traffic on each collection
Repository Method Avg Execution Time: Total time taken per method per repository, a high value indicates a performance issue
Repository Method Execution Max Time: Max recorded time of each MongoDB operation
Time in Seconds of Operations Per Collection: The maximum time in seconds it took to execute a particular command on a MongoDB collection. This is the worst case unit. You can use it to identify the slowest operations per collection in your app.
Number of Commands Per collection Per Second: The number of commands executed on each of the collections per second, helps you optimise collections sharding, and indexing to suite the read/write operations
Operations Average Time Per Collection: The average time of various operations running in a collection, this is a quick view to spot slowest transactions for an app
Mongo Operation Max Time: max time of operation on each of the collections, this gives a sense of what might be skewing the quick view averages in previous panel

The Dashboard config JSON with the aforementioned panels and additional ones for measuring Java Heap, Thread and GC Details can be imported from the springboot.json file in this gist.

Summing it up

Setting up observability for Microservices isn’t quite intuitive but not an uphill task given the tools at our disposal. Having said that, this is pretty much the starting step of your observability journey. Once you have the infra setup as demonstrated in this blog, you have to meticulously design the graphs you will create in Grafana, which in turn depends on the kind of services you are running, and the parts of your system that you want to track. This blog aims to help fellow SRE and DevOps engineers find the steps in this blog helpful enough to be able to set up their observability stack.

Autoscaling EKS Node Groups with Karpenter

Anadi Misra — Tue, 19 Sep 2023 13:00:00 +0000

Karpenter is an open-source tool for automating node provisioning in Kubernetes. Karpenter aims to enhance both the effectiveness and affordability of managing workloads within a Kubernetes cluster. The core mechanics of Karpenter involve:

Monitoring unschedulable pods identified by the Kubernetes scheduler.
Scrutinizing the scheduling constraints, including resource requests, node selectors, affinities, tolerations, and topology spread constraints, as stipulated by the pods.
Provisioning nodes that precisely align with the pods' requirements.
Streamlining cluster resource usage by removing nodes once their services are no longer required.

Why we moved to Karpenter?

We at NimbleWork used AWS Fargate in the past for running on-demand, short-lived or one-off workloads, one of the examples being running Jenkins slaves in AWS Fargate while the Master runs on worker nodes. Fargate is good in the sense that it takes care of managing node infrastructure, but can cost premium if you're using it for long running workloads. It is for this reason that EKS deployment with Worker Nodes is the more preferred path. But with that comes a new problem, unlike Fargate, we have to not just manage creating nodes and node groups, we also have to ensure that our EC2 nodes utilisation is optimal. It comes back to hurt us specially when we realise there's an entire VM running on just 10% of the CPU/Mem capacity because it has two active pods, which we could have moved to another node and claimed this one. In the past we've relied on a cocktail of Prometheus Alerts and Fluent-Bit monitoring data to conclude we can reschedule pods and clean-up un-used nodes. But any self-respecting Engineering Manager would tell you they'd jump to a better alternative than this as soon as they find one. For us Karpenter came as that alternative.

How it works?

Karpenter allows you to define Provisioners which are the heart of it's cluster management capability. When initially installing Karpenter, you establish a default Provisioner, which imparts specific constraints on the nodes created by Karpenter and the pods eligible to run on these nodes. These constraints encompass defining taints to restrict pod deployment on Karpenter-created nodes, establishing startup taints to indicate temporary node tainting, narrowing down node creation to preferred zones, instance types, and computer architectures, and configuring default settings for node expiration. The Provisioner, in essence, empowers you with fine-grained control over resource allocation within your Kubernetes cluster. You can read up more on Provisioners here.

Deploying EKS Cluster

Here's how to deploy the EKS cluster with Karpenter.

Setting Up the VPC

Before we begin, let's deploy the AWS VPC to run our EKS cluster. we'll be using terraform for provisioning on the AWS Cloud.

module "vpc" {
  source               = "terraform-aws-modules/vpc/aws"
  version              = "3.19.0"
  name                 = "mycluster-vpc"
  cidr                 = var.vpc_cidr
  azs                  = ["us-east-1a", "us-east-1b", "us-east-1c"]
  private_subnets      = var.private_subnets_cidr
  public_subnets       = var.public_subnets_cidr
  enable_nat_gateway   = true
  single_nat_gateway   = true
  enable_dns_hostnames = true

  public_subnet_tags = {
    "kubernetes.io/cluster/mycluster" = "shared"
    "kubernetes.io/role/elb"          = "1"
  }

  private_subnet_tags = {
    "kubernetes.io/cluster/mycluster  = "shared"
    "kubernetes.io/role/internal-elb" = "1"
    "karpenter.sh/discovery"          = "mycluster"
  }

  tags = {
    "kubernetes.io/cluster/mycluster" = "shared"
  }

}

module "vpc-security-group" {
  source  = "terraform-aws-modules/security-group/aws"
  version = "4.17.1"
  create  = true

  name        = "mycluster-security-group"
  description = "Security group for VPC"
  vpc_id      = module.vpc.vpc_id

  ingress_with_cidr_blocks = var.ingress_rules
  ingress_with_self = [
    {
      from_port   = 0
      to_port     = 0
      protocol    = -1
      description = "Ingress with Self"
    }
  ]
  egress_with_cidr_blocks = [{
    cidr_blocks = "0.0.0.0/0"
    from_port   = 0
    to_port     = 0
    protocol    = -1
  }]
  tags = {
    Name                      = "mycluster-security-group"
    "karpenter.sh/discovery"  = "mycluster"
  }
}

We're using the community contributed modules here for spinning up a VPC which has public and private subnets, and ingress rules. For those interested in more details here a simple example of what could potentially go in the ingress rules

variable "ingress_rules" {
  type        = list(map(string))
  description = "VPC Default Security Group Ingress Rules"
  default = [
    {
      cidr_blocks = "0.0.0.0/0"
      from_port   = 443
      to_port     = 443
      protocol    = "tcp"
      description = "Karpenter ingress allow"
    },
    { #other  CIDR blocks to which you might want to restrict access to (for example if this was your dev cluster)
      cidr_blocks = "XX.XX.XX.XXX/XX"
      from_port   = 0
      to_port     = 0
      protocol    = -1
      description = "MyCLuster-NAT"
    }
  ]
}

The "karpenter.sh/discovery" = "mycluster" tag in the vpc module and the in the security group tags is our hint to AWS about using aws-karpenter for autoscaling nodes and pods in this cluster. You can get the VPC up and running via the

terraform plan
terraform apply

commands, it's a good practice to define key values that you will need in other modules as outputs to this module run, also, we save the state in a S3 bucket as our TF builds run from a Jenkins Salve on Fargate with ephemeral storage. You'd see the following values in the console output of the terraform apply command if you've included publishing the VPC and security group IDs in the outputs.tf of your vpc module.

security_group_id = "sg-dkfjksdhf83983c883"
vpc_id = "vpc-2l4jc2lj4l2cbj42"

With this we have our VPC ready, let's deploy the EKS cluster with Node Groups and Karpenter.

Deploying EKS Cluster with Node Group Workers and Karpenter

Add the following code to you terraform module to include EKS

module "eks-cluster" {
  source          = "terraform-aws-modules/eks/aws"
  version         = "19.12.0"
  cluster_name    = "mycluster"
  cluster_version = 1.26
  subnet_ids      = [  "subnet-XX","subnet-YY","subnet-ZZ"]
  create_cloudwatch_log_group = false
  tags = {
    Name                      = "mycluster"
    "karpenter.sh/discovery"  = "mycluster"
  }

  vpc_id = "vpc-2l4jc2lj4l2cbj42"

  cluster_endpoint_public_access_cidrs = ["XX.XX.XX.XXX/YY"] #important if the cluster_endpoint_public_access is set to true
  cluster_endpoint_private_access      = true
  cluster_endpoint_public_access       = true
  cluster_security_group_id            = "sg-dkfjksdhf83983c883"
}

module "mycluster-workernodes" {
  source  = "terraform-aws-modules/eks/aws//modules/eks-managed-node-group"
  version = "19.12.0"

  name            = "${var.eks_cluster_name}-services"
  cluster_name    = module.eks-cluster.cluster_name
  cluster_version = module.eks-cluster.cluster_version
  create_iam_role = false
  iam_role_arn    = aws_iam_role.nodegroup_role.arn

  subnet_ids = flatten([data.terraform_remote_state.db.outputs.private_subnets])

  cluster_primary_security_group_id = "sg-dkfjksdhf83983c883"
  vpc_security_group_ids            = [module.eks-cluster.cluster_security_group_id]

  min_size     = 1
  max_size     = 5
  desired_size = 2

  instance_types     = ["t3.large"]
  capacity_type      = "ON_DEMAND"
  labels = {
    NodeGroups = "mycluster-workernodes"
  }

  tags = {
    Name                      = "mycluster-workernodes"
    "karpenter.sh/discovery"  = module.eks-cluster.cluster_name
  }
}

It's the same "karpenter.sh/discovery" tag at play here too, and that's it! You have an eks cluster with Karpenter managed provisioning ready!

Configuring Karpenter Provisioners

Now that we have a cluster ready let's have a look at using Karpenter to manage the Pods. We'll define provisioners for different purposes and then associate pods to each of them.

Provisioner for Nodes running Spot Instances

This is good alternative to Fargate, specially for running the one-off workloads which do not live beyond the job completion. Here's an example of a Karpenter provisioner using spot instances.

# spot default
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot"]
    - key: "karpenter.k8s.aws/instance-category"
      operator: In
      values: ["c", "m", "r"]
    - key: "karpenter.k8s.aws/instance-cpu"
      operator: In
      values: ["4", "8", "16", "32"]
  limits:
    resources:
      cpu: 1000
  providerRef:
    name: default
  consolidation:
    enabled: true
---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: default
spec:
  subnetSelector:
    karpenter.sh/discovery: mycluster
  securityGroupSelector:
    karpenter.sh/discovery: mycluster
---

To use this provisioner add the following tag to the nodeSelector in kube deployment.

nodeSelector:
  karpenter.sh/provisioner-name: default

This will provision the pods to run on spot instances.

Provisioner for Nodes running On-Demand Instances

Here's a sample of how to use on-demand node for worker nodes, and schedule pods on it. The following file defines a provisioner for on-demand instances

# on-demand
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: on-demand
spec:
  # taints:
  #   - key: "name"
  #     value: "on-demand"
  #     effect: "NoSchedule"
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["on-demand"]
    - key: "karpenter.k8s.aws/instance-category"
      operator: In
      values: ["c", "m", "r"]
    - key: "karpenter.k8s.aws/instance-cpu"
      operator: In
      values: ["2","4","8", "16", "32"]
    - key: "topology.kubernetes.io/zone"
      operator: NotIn
      values: ["us-east-1b"]
  limits:
    resources:
      cpu: 1000
  providerRef:
    name: on-demand
  # consolidation:
  #   enabled: true
  ttlSecondsAfterEmpty: 30
---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: on-demand
spec:
  subnetSelector:
    karpenter.sh/discovery: mycluster
  securityGroupSelector:
    karpenter.sh/discovery: mycluster
---

Once again we can utilise the nodeSelector in kube deployment yaml to provision pods on these nodes

nodeSelector:
  karpenter.sh/provisioner-name: on-demand

Conclusion

This is a simplified example of how to get started with Karpenter on AWS EKS. production grade deployments require more nuanced provisioner definitions including but not limited to resource limits, eviction policies as well.

Attribution
Image Credits: Photo by Growtika on Unsplash

Application Integration Pattern : the Choreography way

Anadi Misra — Thu, 07 Sep 2023 17:04:14 +0000

Integration can be quite complex, especially when you are talking about making your swanky new application work with a legacy system. I've seen multiple generations of patterns to tackle this problem gain prominence over the past decade and a half. Starting with SOAP web-services to Message Bus to Enterprise Service Bus, REST API and even WebSockets! Each of them has been the blue-eyed boy, or heartthrob of their generation and seen their popularity vain as the new kid on the block got more famous. Having said that, and allowing myself to feel relatively old, whatever the approach, the bigger challenge in integrating legacy services has been that they might eventually get replaced with other services soon, or soon enough for you to get into yet another exercise of re-integrating with the new system all over again with a sense of repetitive labour induced fatigue. Shameless plug, We at NimbleWork were facing this very strategic question when our SaaS journey started a couple of years ago; and this blog is about our learnings and experience in using Choreography as the cornerstone of our integration approach just like it was for implementing Strangler Fig.

Choreography

There are many definitions of Choreography if you happen to browse popular articles or blogs over the web. The way I like to put it is

Choreography represents the communication between microservices through the publishing and subscribing of domain events.

It essentially means that each of the services publishes a Domain Event notifying a change that's occurred out of an action initiated on the service. Services to whom this event is of interest subscribe to them and act accordingly. It's an extremely efficient, lightweight, distributed chain of command. Choreography has emerged as the mechanism of choice in implementing the Saga Pattern in Microservices, and I've seen it increasingly replace Orchestrators. I'd like to point out however that it's not a universally applicable solution or implementation/design in the microservices world, there's some criticism of this approach going around too, though I haven't burnt my hands using it yet, so I'd like to think the criticism is more out of design issues elsewhere which might have lead to bottlenecks in using Choreography.

The Approach

Here's our approach, both the legacy system and the new systems emit Domain Events, which are published to Kafka. Each Kafka topic is backed by MongoDB collection using the Kafka MongoDB connector for Fault Tolerance and HA purposes. Each of these collections is replicated over a MongoDB replica-set. In case of failed processing of the domain event, the failing side emits an event for compensating transactions instead of rollback. Any outage on either side is covered via two factors

Kafka Messages are persisted for durations longer than the outage of either of the services
On being live again each of the services starts from where the consumer offset was when they went offline

There isn't much we have to do as developers for either of these provisions but knowing that we have these provisions is more than just handy knowledge.

Handling Failures

Handling failures, like most modern architectures is a layered approach in this case. So let's look at them from the perspective of where we choose to tackle them.

Outages: are best handled at the infrastructure layer, if you're deploying both the legacy and new services in Kubernetes (which makes sense for that matter as you'll be phasing out the legacy system with newer services eventually) then we can leave it to Kubernetes liveness and readiness probes coupled with deploying services as a stateful or replica set to get over the outage issue. If the legacy system is too old to run as a stateless service, you're still not out of luck here, you can deploy the traditional Sticky session clusters fronted with a load-balancer by using internal ingresses in Kubernetes too. So there too the master slave cluster gives you some degree of protection against the outages.
Dirty Reads: If Choreography has scared the daylights out of any of your colleagues it's most likely for the fear of this, a comprehensive explanation on how to mitigate this via design in a whole blog post in itself so I'd like to keep it short. Dirty reads are more likely to occur in the case of bi-directional integration, by that I mean both services are reading or writing to each other. And if you find yourself doing that, just stop, you've perhaps created too many routes for data change, it's better to design for all operations on the data that is owned by new services to happen in the new service only and the associated events flow to the legacy app, and vice-versa. Even if you have a duplex of Domain Events, don't allow the legacy system to update its copy of new system data directly nor should the new system allow editing the copy of the legacy system directly. Keep clear separation of traffic is what I'm saying essentially.
Idempotence: this is a tricky one, while idempotent API Operations can be guaranteed within the context of each of the systems integrated, there's an additional Domain Event whose consumption should also lead to idempotence, which means that consumption of Domain Events also has to preserve the idempotence on both sides of the integration. A simple rule of thumb here is to follow what their APIS do; a CREATE API call is easiest to handle, and the Create Domain Event can never be idempotent. What about DELETE, PUT? That's where idempotence has to be preserved (assuming the legacy system did respect this principle in the first place, or else God bless you!). We've noticed that using upsert operations for processing PUT Domain Event helps a lot. DELETE Idempotence is tricky, should I keep sending 200 OK when deleting an ID repeatedly? or throw an error the second time around? We chose the first option in our REST API design so we followed the same norm when processing a Domain Event for DELETE, allowing it to fail silently when deleting a deleted entity.

Conclusion

Choreography is a strong pattern in Microservices and is here to stay for good. This blog post however was our attempt to show that we can improvise to use well-established patterns for much wider goals than they were envisioned for. It requires a bit of imagination, lots and lots of reading, hours of experimentation and not to forget, the temperament the handle their failures. I hope this blog helps fellow engineers with one more option to consider while in the transition phase of their microservices journey where new services have to integrate and play well with legacy systems that are built on completely different designs and patterns altogether.

Strangling Monolith to Microservices with Choreography and CQRS

Anadi Misra — Mon, 17 Jul 2023 05:30:00 +0000

Monoliths aren't necessarily bad, but in some cases, they don't perform the job well enough. And no, autoscaling isn't the only solution. You can have stateless monoliths running as a replicaset with HPA or even VPA (if you're not stuck with JVM on fixed heap space) to achieve autoscaling. Taking that thought further, Kubernetes isn't even necessary. You can perform autoscaling on VMs, although managing the spin-up and down times and IAC around it on VMs can be challenging. With that said, this post discusses what to do if you find yourself on a mission to break down a monolith into microservices. The most highly praised approach, the Strangler Application (inspired by the Strangler Fig application from Martin Fowler), is a great way to embark on this journey. There are plenty of blogs that explain the approach, so that's not the reason why I'm typing away on a cool breezy evening in Bangalore with Arijit Singh's magical voice in the background (no beer). If you're new to the pattern itself, you can read about it here. This post describes an interesting strategy we implemented to refactor a powerful enterprise solution platform monolith into microservices while using the Strangler pattern. If you're new to the pattern itself read here first.

A bit of background, and a shameless plug, this is our hands-on experience in moving to microservices at NimbleWork. Let's explore how we utilized Choreography and CQRS as an implementation strategy to strangle an EJB-based monolith into microservices. This approach has enabled us to deliver new features more quickly and efficiently handle traffic bursts, such as when users log in to file timesheets or move cards to "done" on a Friday evening.

Choreography

Choreography represents the communication between microservices through the publishing and subscribing of domain events. It differs from traditional messaging in that there is no two-way message sent/acknowledged flow. Instead, it follows a publish-and-subscribe model where downstream services decide whether to act upon the messages they receive. To understand this concept better, think of each microservice as a radio station broadcasting music on a specific FM frequency. If you want to listen to their music, you tune your radio to their frequency. The microservice doesn't actively manage every connected client; it's up to the client (in this case, the radio) to connect and respond to the received data.

In scenarios where you are employing the Database per Service pattern and a business transaction spans multiple services, relying on traditional models like the 2PC (two-phase commit) or messaging from the SOA era is not feasible. This is where choreography proves to be extremely useful. The diagram below illustrates this concept within a fictional e-commerce system, showcasing the handling of workflows when a customer signs up.

Here, the Loyalty, Delivery, and Notify services have tuned into a channel where the Customer Service publishes the Customer Created Event. It's important to note that the Customer Service itself is unaware of who is listening to the messages.

How does it come handy in Strangler?

Simply put, I introduce a spy into the monolith that tracks all activities within it. The implementation depends on the framework you are using. For example, with EJBs, you can easily create an EventListener that gets invoked after the creation, updating, or deletion of your business objects. This listener captures the business object and publishes corresponding events such as EntityCreated, EntityUpdated, and EntityDeleted. In Hibernate, you can use JPA Lifecycle Events, or in Spring Boot, you can use listeners on entity lifecycle methods. Regardless of your chosen approach, it's crucial to execute these operations asynchronously to avoid keeping any threads tied to the transaction or the initiating request. The key is to ensure that these operations are quick and detached from the actual flow between layers in the application you're refactoring. Now, you have a continuous stream of events relaying your business objects or transactional data to a message/event broker. From this point onward, you have two main options to consider. But before delving into those options, let's first understand CQRS.

CQRS

CQRS, coined by Greg Young, stands for Command Query Responsibility Segregation, which is a pattern closely aligned with Choreography. When running microservices based on the Database per Service pattern or utilizing Choreography, it becomes challenging to query data that spans multiple microservices. This is where CQRS comes into play. In this approach, services that write data to their respective databases use Choreography to publish domain events, such as OrderCreated or NewSubscription in our e-commerce example. Downstream services then consume these events through event handlers to persist the data in a read-only database. This approach provides us with the flexibility to easily create multiple denormalized views of the data across various services. It also simplifies querying what would have been complex joins in a monolithic architecture.

Putting it all together

So, there you have it. On one side, the Monolith emits domain events, and a consumer at the other end of the message queue/message broker processes these domain events to save the business objects into another store. For example, we save them in MongoDB, which then serves as a backend for Reports and Analytics, Mobile Apps, and the Nimble Café. This approach offers multiple benefits. Firstly, it allows us to create a system where reads outnumber writes. Consequently, we have moved a significant amount of traffic away from the monolith and into autoscaling Microservices that can handle traffic better without impacting our Cloud Budget. MongoDB itself can be optimized for reads and has connectors to systems like Spark, Snowflake, and others, which can serve as a streaming backend for near real-time analytics or even AI. Essentially, I have now split my legacy system into two parts:

The older monolith handles the Write Transactions (Command).
Reporting, Analytics, and other read-heavy apps in our product suite rely on Microservices that access a read-only NoSQL copy of the transactional data (Query).

From here onwards, we continue extracting functionalities into one Microservice at a time. All of these Microservices read from the common read-only copy of the database while writing to their respective stores. Over time, the monolith keeps shrinking and reducing the number of domain events it fires as functionality moves out. But it doesn't stop here. This broker also serves as the backbone for enabling choreography to the newer modules we've built on Microservices. More on Choreography and Saga will be discussed in a later post!

Things to watch out for

During the transition phase, the system operates on an eventual consistency model, and there are several factors to consider. What if your message broker goes down? What if the downstream service consuming messages is down? Will those messages be lost forever? What if a message is consumed from a broker but fails to persist into MongoDB? Building retries and utilizing Kubernetes-assisted restarts of failed services based on heartbeat monitors (Liveness And Readiness Probes) helps in outage scenarios. Similarly, incorporating retry logic in services before giving up and writing failed messages to a Dead Letter Queue proves to be beneficial. However, the most powerful technique in this case was using Upsert transactions in MongoDB. With this approach, a domain event that initially fails would eventually get inserted. If your system prioritizes availability and performance over consistency, this technique can work wonders, as it allows you to navigate through outages effectively.

The other aspect to consider is eventual consistency itself. If the cycle time for data written to the core product and saved in MongoDB is in seconds, it can lead to issues. Reports and feeds in the Café may start showing stale information. Therefore, it is crucial to ensure that the process is fast enough to complete before a user, in our case, browses to the reporting and analytics view or the Café feed after adding a card. Reactive programming, especially with Spring Boot's added reactive extensions to Database, Messaging, Cloud, and Web modules, has proven to be a saviour in such cases.

Conclusion

Patterns provide a structured way to solve recurring design problems, or at least that's how I see them. With sufficient reading and practicing samples, one can grasp these patterns. However, the challenging part lies in improvising the translation of a pattern layout itself to achieve a technical or business strategy. This blog post cannot fully summarize the countless hours we have spent translating relational database schemas into NoSQL, all while ensuring it remains useful for other Microservices that, in their initial phase, lift and shift the functionality. I'm not advocating that this approach is suitable for every Strangler Application, but the benefits we have gained from reaching this point have made the months of effort we put into getting the ES/CQRS backbone right worthwhile. I hope this helps fellow engineers who may encounter the same problem in the future.

Serverless Tekton Pipelines on AWS EKS Fargate

Anadi Misra — Mon, 21 Feb 2022 11:31:44 +0000

Continuous Delivery is hard business! Specially if you're dealing with microservices. While Jenkins does work pretty well unto a scale by creating shared libraries of sorts for common builds, but after a while when you're running your SaaS on microservices like we do at Digité, managing the builds, and the infrastructure for CI/CD can get cumbersome. It is for both optimized Cloud Infra usage and ability to easily write and maintain CD pipelines that we considered moving to Tekton.

Having said that, blocking two extra large VM for the "what if there are too many jobs running in parallel?" does not appear natural to me; so I set out at making Tekton work in Fargate. The reason behind Fargate is the ease of server-less thereby letting us concentrate of managing our CI/CD pipelines without having to manage the Infrastructure for it. Hence, i'll share my experience on how to get a Server-less CI/CD Infrastructure for Tekton up and running quickly via Terraform in this post.

Setup

Let's start with creating a Terraform module for installation of Tekton to Fargate, you can refer to this article for creating a basic setup of EKS Fargate Cluster. Assuming you have that in place, the next steps are as follows.

Fargate Profiles

We'll first create the Fargate profile for running Tekton, Tekton Dashboard and Tekton Triggers in the tekton-pipelines namespace

resource "aws_eks_fargate_profile" "tekton-dashboard-profile" {
  cluster_name           = module.eks.cluster_id
  fargate_profile_name   = "tekton-dashboard-profile"
  pod_execution_role_arn = module.eks.fargate_iam_role_arn
  subnet_ids             = module.vpc.private_subnets
  selector {
    namespace = "tekton-pipelines"
    labels = {
      "app.kubernetes.io/part-of" = "tekton-dashboard",
      "app.kubernetes.io/part-of" = "tekton-triggers"
    }
  }
  depends_on = [module.eks]
  tags = {
    Environment = "${var.environment}"
    Cost        = "${var.cost_tag}"
  }
}

EFS Setup

EFS is the recommended approach by AWS when it comes to mounting PV for Fargate nodes; hence, we'll add EFS configuration in the next steps.

It's a good practice to restrict EFS access to the VPC running EKS Cluster and your internal network for IAM controlled users to access it over AWS CLI. Declare a security group with Ingress rules for each of the subnet CIDR of the VPC running EKS Fargate to restrict access.

module "efs-access-security-group" {
  source  = "terraform-aws-modules/security-group/aws"
  version = "4.3.0"
  create  = true

  name        = "efs-${var.cluster_title}-${var.environment}-security-group"
  description = "Security group for pipeline tekton EFS, created via terraform"
  vpc_id      = module.vpc.vpc_id

  ingress_with_cidr_blocks = [{ cidr_blocks = "172.18.1.0/24"
    from_port = 0
    to_port   = 2049
    protocol  = "tcp"
    self      = true
    }, {
    cidr_blocks = "172.18.3.0/24"
    from_port   = 0
    to_port     = 2049
    protocol    = "tcp"
    self        = true
    }, 
    // All Subnet CIDRs...
, ]
  ingress_with_self = [{
    from_port   = 0
    to_port     = 0
    protocol    = -1
    self        = true
    description = "Ingress with Self"
  }]

  egress_with_cidr_blocks = [{
    cidr_blocks = "0.0.0.0/0"
    from_port   = 0
    to_port     = 0
    protocol    = -1
  }]
}

While Fargate auto-installs the EFS CSI Driver, we still have to declare an IAM policy for the cluster EFS access. Here's how to do it in our Terraform module

resource "aws_iam_policy" "efs-csi-driver-policy" {
  name        = "TektonEFSCSIDriverPolicy"
  description = "EFS CSI Driver Policy"

  policy = jsonencode({
    "Version" : "2012-10-17",
    "Statement" : [
      {
        "Effect" : "Allow",
        "Action" : [
          "elasticfilesystem:DescribeAccessPoints",
          "elasticfilesystem:DescribeFileSystems"
        ],
        "Resource" : "*"
      },
      {
        "Effect" : "Allow",
        "Action" : [
          "elasticfilesystem:CreateAccessPoint"
        ],
        "Resource" : "*",
        "Condition" : {
          "StringLike" : {
            "aws:RequestTag/efs.csi.aws.com/cluster" : "true"
          }
        }
      },
      {
        "Effect" : "Allow",
        "Action" : "elasticfilesystem:DeleteAccessPoint",
        "Resource" : "*",
        "Condition" : {
          "StringEquals" : {
            "aws:ResourceTag/efs.csi.aws.com/cluster" : "true"
          }
        }
      }
    ]
  })
}

With that done, we'll define the Cluster IAM for EFS Access. First the policy document which details access the policy statements for the role

data "aws_iam_policy_document" "efs-iam-assume-role-policy" {

  statement {
    actions = ["sts:AssumeRoleWithWebIdentity"]
    effect  = "Allow"
    condition {
      test     = "StringEquals"
      variable = "${replace(aws_iam_openid_connect_provider.tekton-main.url, "https://", "")}:sub"
      values   = ["system:serviceaccount:tekton-pipelines:tekton-efs-serviceaccount"]
    }
    principals {
      identifiers = [aws_iam_openid_connect_provider.tekton-main.arn]
      type        = "Federated"
    }
  }
  depends_on = [
    aws_iam_policy.efs-csi-driver-policy
  ]
}

then we add the role

resource "aws_iam_role" "efs-service-account-iam-role" {
  assume_role_policy = data.aws_iam_policy_document.efs-iam-assume-role-policy.json
  name               = "tekton-efs-service-account-role"
}

resource "aws_iam_role_policy_attachment" "efs-csi-driver-policy-attachment" {
  role       = aws_iam_role.efs-service-account-iam-role.name
  policy_arn = aws_iam_policy.efs-csi-driver-policy.arn
}

And then we map it to a service account

resource "kubernetes_service_account" "efs-service-account" {
  metadata {
    name      = "tekton-efs-serviceaccount"
    namespace = "tekton-pipelines"
    labels = {
      "app.kubernetes.io/name" = "tekton-efs-serviceaccount"
    }
    annotations = {
      # This annotation is only used when running on EKS which can use IAM roles for service accounts.
      "eks.amazonaws.com/role-arn" = aws_iam_role.efs-service-account-iam-role.arn
    }
  }
  depends_on = [
    aws_iam_role_policy_attachment.efs-csi-driver-policy-attachment
  ]
}

resource "kubernetes_role" "efs-kube-role" {
  metadata {
    name = "efs-kube-role"
    labels = {
      "name" = "efs-kube-role"
    }
  }

  rule {
    api_groups = [""]
    resources  = ["persistentvolumeclaims", "persistentvolumes"]
    verbs      = ["create", "get", "list", "update", "watch", "patch"]
  }

  rule {
    api_groups = ["", "storage"]
    resources  = ["nodes", "pods", "events", "csidrivers", "csinodes", "csistoragecapacities", "storageclasses"]
    verbs      = ["get", "list", "watch"]
  }
  depends_on = [aws_iam_role_policy_attachment.alb-ingress-policy-attachment]
}

resource "kubernetes_role_binding" "efs-role-binding" {
  depends_on = [
    kubernetes_service_account.efs-service-account
  ]
  metadata {
    name = "tekton-efs-role-binding"
    labels = {
      "app.kubernetes.io/name" = "tekton-efs-role-binding"
    }
  }

  role_ref {
    api_group = "rbac.authorization.k8s.io"
    kind      = "Role"
    name      = "efs-kube-role"
  }
  subject {
    kind      = "ServiceAccount"
    name      = "tekton-efs-serviceaccount"
    namespace = "tekton-pipelines"
  }
}

With the IAM linked service account in place, we'll define the EFS file system

resource "aws_efs_file_system" "eks-efs" {
  creation_token = "tekton-eks-efs"
  encrypted      = true
  tags = {
    Name                  = "tekton-eks-efs"
    Cost                  = var.cost_tag

  }
  depends_on = [
    kubernetes_role_binding.efs-role-binding
  ]
}

And its mount targets and storage class

resource "aws_efs_mount_target" "eks-efs-private-subnet-mnt-target" {
  count           = length(module.vpc.private_subnets)
  file_system_id  = aws_efs_file_system.eks-efs.id
  subnet_id       = module.vpc.private_subnets[count.index]
  security_groups = [module.efs-access-security-group.security_group_id]
}

resource "aws_efs_access_point" "eks-efs-tekton-access-point" {
  file_system_id = aws_efs_file_system.eks-efs.id
  root_directory {
    path = "/workspace"
    creation_info {
      owner_gid   = 1000
      owner_uid   = 1000
      permissions = 755
    }
  }
  posix_user {
    gid = 1000
    uid = 1000
  }
  tags = {
    Name        = "eks-efs-tekton-access-point"
    Cost        = var.cost_tag
    Environment = "${var.environment}"
  }
}

resource "kubernetes_storage_class" "eks-efs-storage-class" {
  metadata {
    name = "eks-efs-storage-class"
  }
  storage_provisioner = "efs.csi.aws.com"
  reclaim_policy      = "Retain"
}

Note the EFS and access point IDs in the terrafrom output whne appying these changes, they'll be used in the PV and PVC definitions. My scripts gave the output

fs-8a7eXXXX::fsap-0f60de28766XXXXXX

Installing Tekton

It's pretty simple from here on; the following command installs Tekton

kubectl apply --filename https://storage.googleapis.com/tekton-releases/pipeline/latest/release.yaml

followed by Tekton dashboard (read-only install)

curl -sL https://raw.githubusercontent.com/tektoncd/dashboard/main/scripts/release-installer | \
   bash -s -- install latest --read-only

or
kubectl apply --filename tekton-dashboard-readonly.yaml

after downloading the Read Only YAML from this GitHub link. Next we setup the persistent volume, refer to the generated EFS IDs from Terraform run in your PV definition, here's an example for a PV and PVC that will be used by a maven task for running tekton pipeline

apiVersion: v1
kind: PersistentVolume
metadata:
  name: piglet-source-pv
  labels:
    type: piglet-source-pv
spec:
  capacity:
    storage: 1Gi
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: eks-efs-storage-class
  csi:
    driver: efs.csi.aws.com
    volumeHandle: fs-8a7eXXXX::fsap-0f60de28766XXXXXX
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: piglet-source-pvc
spec:
  selector:
    matchLabels:
      type: piglet-source-pv
  storageClassName: eks-efs-storage-class
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Gi

Conclusion

While the Tekton installation itself doesn't change (you're using a kubectl apply command as always), we have to be aware of how Fargate profiles are applies for any workloads to run on EKS Fargate and thereby provision a Fargate profile using existing Tekton annotations as its selectors so that our tasks can run on Fargate. Other than that we have to provision and configure PV and PVC via EFS for tasks to use them at runtime.

With those in place we have a working Tekton installation over EKS Fargate with a truly on-demand way of running builds and CI/CD Pipelines.

Building an EKS Fargate cluster with Terraform

Anadi Misra — Wed, 19 Jan 2022 10:16:37 +0000

Fargate is service by AWS to run serverless workloads in Kubernetes. With Fargate you do not have to manage VMs as cluster nodes yourself as each of the pods are provisioned as nodes by Fargate itself. It is different from Lambda in the sense that you're still self-managing the Kubernetes cluster or the runtime for all the workloads you run in that cluster. Having said that, I believe it's more suitable for teams that are running containerised microservices and want to do away with managing Kubernetes infrastructure themselves.

While Lambda prices on a combination of requests, CPU and memory; Fargate pricing is just the CPU and Memory of the nodes running in the cluster in addition to a fixed monthly cost of the Fargate service itself. If you want to go serverless without Vendor lock-in, Fargate is a good option. Hence we at Digité prefer running our microservices in the Fargate model.

Managing such an infrastructure is certainly not a feasible manual job, hence we rely on IAC to manage and operate our Infrastructure. Here Terraform has been our tool of choice for various reasons from ease of learning to it's robust design. Terraform is an open source Infrastructure As Code tool by Hashicorp that lets you define AWS Infrastructure via a descriptive DSL and has been quite popular in the DevOps world since it's inception.

In this blog I'll share how we've used Terraform to Deploy an EKS Fargate cluster.

VPC

We'll start with deploying the Amazon VPC via Terraform. There are three recommended approaches for deploying a VPC to run EKS Fargate, let's look at each of them:

Public and Private Subnets: the pods run in a private subnets while loadbalancers, both Application or Network are deployed in the Public subnets. One public and private subnet is deployed to each of the availability zones within the region for availability and fault tolerance, this is the deployment model we will follow fir this blog
Public Subnets Only: both the pods (or nodes) and the loadbalancers are in public subnets, here three public subnets are deployed in three different availability zones within the region. All nodes have a public IP address and a security group blocks all inbound and outbound traffic to the nodes. To be honest I haven't ever figured out why would anyone need this :-)
Private Subnets Only: both pods and loadbalancers run in private subnets only, three of which are created in each Availability zone of the region. Quite naturally we have to configure additional NAT Gateway, Egress Only Gateway, VPN or Direct Connect to be able to access the cluster. There's additional configuration on the kubectl side as well( which we will skip in this blog

VPC subnets should have certain tags which allow EKS Fargate to deploy internal loadbalancers to them and provision nodes; lets look at the tags first

Key: kubernetes.io/cluster/cluster-name
Value: Shared

The following tags allows EKS Fargate to decide where auto provisioned Elastic loadbalancers are deploy and also allows you to control where the application or network loadbalancers are configured

Private Subnets:
- Key: kubernetes.io/role/internal-elb
- Value: 1
Public Subnets:
- Key: kubernetes.io/role/elb
- Value: 1

The VPC configuration therefore is as follows, we'll use the AWS VPC Terraform module for this purpose as it provides easier configuration via declarative properties instead of having to write all the resources yourself.

module "vpc" {
  source                        = "terraform-aws-modules/vpc/aws"
  version                       = "3.4.0"
  name                          = "vpc-serverless"
  cidr                          = "176.24.0.0/16"
  azs                           = ["us-east-1a", "us-east-1b", "us-east-1c"]
  private_subnets               = ["176.24.1.0/24","176.24.3.0/24","176.24.5.0/24"]
  public_subnets                = ["176.24.2.0/24","176.24.4.0/24","176.24.6.0/24"]
  enable_nat_gateway            = true
  single_nat_gateway            = true
  enable_dns_hostnames          = true
  manage_default_security_group = true
  default_security_group_name   = "vpc-serverless-security-group"

  public_subnet_tags = {
    "kubernetes.io/cluster/vpc-serverless" = "shared"
    "kubernetes.io/role/elb"               = "1"
  }

  private_subnet_tags = {
    "kubernetes.io/cluster/vpc-serverless" = "shared"
    "kubernetes.io/role/internal-elb"      = "1"
  }

  tags = {
    "kubernetes.io/cluster/vpc-serverless" = "shared"
  }

}

There's a single NAT gateway for managing all traffic to the Nodes running in the Private subnet, we also have to ensure we keep the enable_dns_hostnames option set to true so that any ALBs that we configure in the future can be assigned hostnames for Canonical DNS mapping.

EKS Cluster

We'll use the AWS EKS Terraform module to deploy the EKS Fargate Cluster. A basic configuration for it is as follows

module "eks-cluster" {
  source                        = "terraform-aws-modules/eks/aws"
  version                       = "17.1.0"
  cluster_name                  = "eks-serverless"
  cluster_version               = "1.21"
  subnets                       = flatten([module.vpc.outputs.public_subnets, module.vpc.outputs.private_subnets])
  cluster_delete_timeout        = "30m"
  cluster_iam_role_name         = "eks-serverless-cluster-iam-role"
  cluster_enabled_log_types     = ["api", "audit", "authenticator", "controllerManager", "scheduler"]
  cluster_log_retention_in_days = 7

  vpc_id = module.vpc.outputs.vpc_id

  fargate_pod_execution_role_name = "eks-serverless-pod-execution-role"
  // Fargate profiles here
}

Fargate Profiles and CoreDNS

A basic configuration like the one above will deploy the EKS cluster, however you need to create Fargate Profiles that allow you to define which pods will run in Fargate. These Fargate profiles define selectors and a namespace to run the pods, along with optional tags, you also have to add a pod execution role name, for allowing the EKS infrastrcuture to make AWS API calls on the cluster owner's behalf. You can have up-to 5 selectors for a Fargate profile.

While Fargate takes care of provisioning nodes as pods for the EKS cluster, it still needs a component that can manage the networking within the cluster nodes, coreDNS is that plugin for EKS Fargate, and like any other workload, needs a Fargate profile to run. So we'll add both the plugin and profile configuration to our Terraform code.

First, let's update the profile configuration

fargate_profiles = {
    coredns-fargate-profile = {
      name = "coredns"
      selectors = [
        {
          namespace = "kube-system"
          labels = {
            k8s-app = "kube-dns"
          }
        },
        {
          namespace = "default"
        }
      ]
      subnets = flatten([module.vpc.outputs.private_subnets])
    }
  }

We're essentially saying, select the pods with label k8s-app to run in the kube-system namespace. Let's also add the CoreDNS plugin to the configuration

resource "aws_eks_addon" "coredns" {
  addon_name        = "coredns"
  addon_version     = "v1.8.4-eksbuild.1"
  cluster_name      = "eks-serve"
  resolve_conflicts = "OVERWRITE"
  depends_on        = [module.eks-cluster]
}

Conclusion

At this stage it's a simple module so you can bundle all of it into a single one. Here's how the file structure looks like.

cluster
├── main.tf
├── outputs.tf
├── providers.tf
├── terraform.tf
├── terraform.tfvars
├── variables.tf

The providers.tf file defined AWS provider along with AWS CLI credentials as variable that you can read from variables defined in the variables.tf file

terraform {
  required_version = "=1.0.2"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "3.49.0"
    }
  }
}

# Provider definition
provider "aws" {
  access_key = var.access_key
  secret_key = var.secret_key
  region     = var.region
  token      = var.session_token
}

if you're saving TF State in a remote backend you can define the configuration for it in the terraform.tf file.

terraform {

  backend "s3" {
    bucket         = "swiftalk-iac"
    dynamodb_table = "swiftalk-iac-locks"
    key            = "vpc/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
  }
}

That's it! Run terraform init to get an EKS Fargate Cluster up and running in minutes!

Tekton and the Promise of Reusable Pipelines

Anadi Misra — Tue, 13 Oct 2020 14:21:25 +0000

The advent of Cloud and Container technology ushered a new era in distributed computing at “planet scale” which was unheard of and unimaginable just a decade ago. Another interesting movement was brewing up a decade ago which bolstered delivering these complex solutions at high speed and accuracy, DevOps. These two paradigm shifts have gone hand in hand complementing each other to shape up distributed computing in the way we know (or are still learning about) today.

We all know how integral Continuous Integration and Continuous Deployment are to the DevOps automation paradigm and how organizations have designed verbose pipelines so as to bring a factory floor model into shipping software.

If you’ve ever been part of implementing any of these DevOps practices for a cloud native distributed system, you’d perhaps know how quickly these CI/CD pipelines become a cacophony of complex tools and integrations requiring their own sub organization of specialists to be built and maintained, thereby adding to the very silos that the practices had set out to break.

A large system that’s composed of multiple distributed sub systems and is usually deployed as docker containers clustered over an orchestration runtime such Kubernetes; and for those systems these pipelines come anything but easy. The problem is you’d, at any time, be dealing with at-least 5 tools for anything from Triggers, to running builds and packaging, creating test environments and running tests and finally the holy grail of one click deploy (if it is any Holy at all!).

Tekton

Some good people at the Knative project with Google felt the aforementioned problem deep enough to come up with a solution that (I believe) is one of the best attempts at building shift left pipelines that yet exists; Tekton.

Tekton aims at bringing much needed simplicity and uniformity in creating and running these pipelines by providing a high reusable, declarative, component based cloud native build system which utilizes Kubernetes CRDs to get the job done. In the Tekton philosophy any pipeline can be broken down to the following three key parts

Core Services Version Control, Artifacts Store, deployment automation
Tasks which could range from running a maven build to various testing automation to security and performance evaluations
Workflow which decides how and when the tasks run

The vision therefore is to cut through the inconsistency and complexity so as to provide a mechanism of building pipelines that is

How it works

Tekton defines resources which fulfill the characteristics shown above thereby letting you concentrate on the What needs to be done and when leaving the how to the underlying implementation. Let’s look at key building blocks of a Pipeline created with Tekton.

Steps

The most basic of Tekton components are the steps, essentially a Kubernetes container spec which is an existing resource type lets you define an image and the information you need to run it. For example:

steps:
- image: ubuntu # contains bash
 script: |
    #!/usr/bin/env bash
    echo "Hello from Bash!"

Task

A Task is composed of one or more steps (you can have a as granular or fine tasks as you wish) and is a unit of work in a pipeline that achieves a specific goal (built jar archive, docker image, test run etc.). The following task runs a maven build for example:

apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
 name: mvn
spec:
 workspaces:
    - name: output
 params:
    - name: GOALS
      description: The Maven goals to run
      type: array
      default: ["package"]
 steps:
    - name: mvn
      image: gcr.io/cloud-builders/mvn
      workingDir: /workspace/output
      command: ["/usr/bin/mvn"]
      args:
        - -Dmaven.repo.local=$(workspaces.maven-repo.path)
        - "$(inputs.params.GOALS)"

Pipeline

A Pipeline is a collection of Tasks that you define and arrange in a specific order of execution as part of your continuous integration flow. Each Task in a Pipeline executes as a Pod on your Kubernetes cluster. You can configure various execution conditions to fit your business needs. Pipelines can be both the workflow of part of workflow as you desire. Here’s a diagrammatic representation of what a pipeline would achieve in Tekton.

Putting it all together

Let's look at what we would expect from a pipeline to run for most of modern day projects:

If you have multiple apps that need these steps you can essentially:

Define common tasks such as Unit Tets, Linting, Build Images, Run Tests (Integration or End-to-End), Publish Images etc
Define multiple pipelines or create standardized pipeline to be used on similar modules
Parameterise Pipeline runs and scale to large number of pipelines with lesser automation or CI/CD configuration

That's where lies the power of this tool, being able to author any umber of pipelines without having to integrate multiple tools or manage complex orchestration. This thorough DRY approach to automated CI/CD pipeline is certainly a great tool at the disposal of software development teams.

Building a Pipeline with Tekton

Now that we've seen what Tekton is all about the the promise it brings on the table, let's see how well it lives up to it.

Installing Tekton

To install the core component of Tekton (assuming you have a kube cluster up and running already, if not install the kube cluster first), Tekton Pipelines, run the command below:

kubectl apply --filename https://storage.googleapis.com/tekton-releases/pipeline/latest/release.yaml

It may take a few moments before the installation completes. You can check the progress with the following command:

kubectl get pods --namespace tekton-pipelines

Confirm that every component listed has the status Running.

Persistent Volumes

To run a CI/CD workflow, you need to provide Tekton a Persistent Volume for storage purposes. Tekton requests a volume of 5Gi with the default storage class by default. Your Kubernetes cluster, such as one from Google Kubernetes Engine, may have persistent volumes set up at the time of creation, thus no extra step is required; if not, you may have to create them manually. Alternatively, you may ask Tekton to use a Google Cloud Storage bucket or an AWS Simple Storage Service (Amazon S3) bucket instead. Note that the performance of Tekton may vary depending on the storage option you choose.

kubectl create configmap config-artifact-pvc \
                         --from-literal=size=10Gi \
                         --from-literal=storageClassName=manual \
                         -o yaml -n tekton-pipelines \
                         --dry-run=client | kubectl replace -f -

For more specific details on the Installation and configuration of Tekton you may refer to their documentation.

Further Steps.

In this post we saw what Tekton brings on the table in terms of providing a way to author highly scalable pipelines built from reusable tasks and how to quickly get it up and running on a kubernetes cluster. In the next part we will look into building and running a pipeline on Tekton for a simple java application.