DEV Community: Ruturaj Kadikar

Metrics at a Glance for Production Clusters

Ruturaj Kadikar — Mon, 31 Mar 2025 14:18:50 +0000

Keeping a close eye on your production clusters is not just good practice—it’s essential for survival. Whether you’re managing applications at scale or ensuring robust service delivery, understanding the vital signs of your clusters through metrics is like having a dashboard in a race car, giving you real-time insights and foresight into performance bottlenecks, resource usage and the operational health of your car.

However, too much happens in any cluster. There are so many metrics to track that the huge observability data you may collect could become another obstacle to viewing what is actually happening with your cluster. That’s why you should only collect the important metrics that offer you a complete picture of your cluster’s health without overwhelming you.

In this blog post, we will cut through the complexity and spotlight the essential metrics you need on your radar to quickly detect and address issues as they arise. From CPU usage to network throughput, we’ll break down each metric, show you how to monitor them effectively and provide the queries that get you the data you need. Before we dive into the specifics of which metrics to monitor, let's understand the foundational monitoring principles that guide our approach. We'll explore the RED and USE methods along with the Four Golden Signals, providing a robust framework for what to measure and why it matters in maintaining the health of your production clusters.

Monitoring Principles

Effective monitoring is the cornerstone of maintaining the health and performance of your production clusters. It helps you catch issues early, optimize resource usage, and ensure that your systems are running smoothly. In this section, we introduce two essential monitoring frameworks — USE and RED — and the Four Golden Signals. These principles provide a structured approach to monitoring, making it easier to interpret vast amounts of data and identify critical performance metrics. By understanding and applying these principles, you can transform raw data into actionable insights that keep your systems in top shape.

RED and USE method

In modern systems, keeping track of numerous metrics can be overwhelming, especially when troubleshooting or simply checking for issues. To make this easier, you can use two helpful acronyms: USE and RED.

The USE Method (Utilization, Saturation, Errors) was introduced by Brendan Gregg, a renowned performance engineer:

Utilization: Measures how busy your resources are.
Saturation: Shows how much backlog or congestion there is.
Errors: Counts the number of error events.

The RED Method was introduced by Tom Wilkie. Drawing from his experiences at Google, Wilkie developed this methodology to focus on three key metrics for monitoring microservices (Rate, Errors, and Duration):

Rate: Measures the request throughput.
Errors: Tracks the error rates.
Duration: Measures how long requests take to be processed.

The USE method focuses on resource performance from an internal perspective, while the RED method looks at request performance from an external, workload-focused perspective. Together, they give you a comprehensive view of system health by covering both resource usage and workload behavior. By using these standard performance metrics, USE and RED provide a solid foundation for monitoring and diagnosing issues in complex systems.

Four Golden Signals

The Four Golden Signals — Latency, Traffic, Errors, and Saturation — are foundational metrics introduced in Google's Site Reliability Engineering (SRE) practices to monitor system performance and reliability. According to this method. dashboards should address all the fundamental questions about your service. These signals are essential for understanding system performance and should be prioritized when selecting metrics to monitor.

Latency: Refers to the time taken to handle a request, distinguishing between successful and failed requests.
Traffic: Measures the demand placed on the system, typically quantified by metrics like HTTP requests per second or network I/O rate.
Errors: Represents the rate of failed requests, including explicit errors like HTTP 500s and implicit errors like incorrect content responses.
Saturation: Indicates how "full" the service is, emphasizing the most constrained resources and predicting impending saturation for proactive maintenance.

By monitoring these four golden signals and promptly alerting administrators or support engineers when issues arise, your cluster will benefit from comprehensive monitoring coverage, ensuring reliability and performance.

Using Four Golden Signals for Comprehensive Monitoring

If you're managing a production Kubernetes cluster, you know the importance of staying on top of your monitoring game. We’re here to simplify your monitoring approach by integrating the RED and USE methods with Google's Four Golden Signals, enabling comprehensive monitoring from a single dashboard. This approach allows you to swiftly spot and address issues, ensuring your cluster operates smoothly without the hassle of jumping between multiple dashboards. To get started, you can download the Monitoring Golden Signals for Kubernetes Grafana dashboard.

Let’s jump into each golden signal to understand what metrics should be monitored to track them.

Traffic: What is it and how to monitor it?

If we consider Kubernetes a city's road system, then pods will be cars, nodes will be streets, and services will be traffic lights that manage the flow. In this case, monitoring in Kubernetes is like using traffic cameras and sensors at crossroads to keep everything moving smoothly and avoid traffic jams.

Network I/O is like the main roads that handle cars coming into and going out of the city. If these roads are too busy, it slows everything down. The API server is like a Regional Transport Office (RTO), regulating and overseeing all operations within the cluster, much like traffic and vehicle management in a region. Monitoring traffic to external services such as databases is also important, similar to watching vehicles travel to other cities. You can use tools like the blackbox exporter to keep an eye on traffic leaving Kubernetes. This highlights the importance of pinpointing key areas for monitoring traffic flow.

Below, we outline the primary and general metrics for monitoring the traffic.

Component (Dashboard Title)	Metric	Why Monitor this Metric
Ingress Traffic (Istio)	istio_requests_total	Tracks the total number of requests handled by Istio, essential for understanding ingress controller load and overall health.
API Server Traffic	apiserver_request_total	Measures the number of API server requests, which helps in monitoring control plane load and identifying potential bottlenecks.
API Server Traffic	workqueue_adds_total	Indicates the total number of items added to work queues, helping identify workload spikes and manage resource allocation effectively.
Node Traffic	node_network_receive_bytes_total node_network_transmit_bytes_total	Monitors data received/transmitted by nodes, which is crucial for identifying and addressing network capacity issues.
Node Traffic	node_network_receive_packets_total node_network_transmit_packets_total	Monitors the number of packets received/transmitted, important for analyzing network traffic, identifying issues, and maintaining robust network communication.
Workload Traffic	container_network_receive_bytes_total container_network_transmit_bytes_total	Vital for monitoring the amount of network traffic received/transmitted by containers, ensuring proper traffic handling and performance.
Storage Operations	storage_operation_duration_seconds_bucket	Provides insights into storage operation performance, helping diagnose and address slow disk access issues.
CoreDNS Requests/s	coredns_dns_requests_total	Monitors the number of DNS queries handled by CoreDNS, ensuring reliable service discovery and network performance.

Latency: What is it and how to monitor it?

To understand the latency in Kubernetes, let's take the previous example of traffic system. Latency in Kubernetes is like delays in a city’s traffic system, where slowdowns at various points affect overall efficiency. If a major road is under construction or blocked due to an accident (slow data processing), cars must take detours, increasing travel time. Similarly, when a microservice is overloaded, requests pile up, causing system-wide slowdowns.

Traffic lights that take too long to change (rate-limited APIs or overloaded queues) create long waiting lines, much like API call delays that hold up processing. Similarly, pod startup delays are like traffic signals malfunctioning cars remain idle, and congestion builds up, just as new pods taking too long to initialize slow down request handling.

During rush hour congestion, roads get overwhelmed, making travel slower for everyone. In Kubernetes, when resources like CPU and memory are exhausted, requests are delayed, affecting performance. Likewise, a single-lane road with no passing option (sequential processing) forces cars to crawl behind slow-moving vehicles, just as inefficient sequential request handling slows down application performance.

Just as city planners use traffic monitoring and smart infrastructure to optimize flow, engineers must track key latency metrics to prevent bottlenecks in Kubernetes.

Component (Dashboard Title)	Metric	Why Monitor This Metric
Pod Start Duration	kubelet_pod_start_duration_seconds_bucket	Monitors time taken for pods to start, crucial for optimizing scaling and recovery processes.
Pod Startup Latency	kubelet_pod_worker_duration_seconds_bucket	Tracks duration of pod operations, important for assessing pod management efficiency.
ETCD Cache Duration 99th Quantile	etcd_request_duration_seconds_bucket	Essential for monitoring ETCD request processing latency, impacts overall cluster performance.
API Server Request Duration 99th Quantile	apiserver_request_duration_seconds_bucket	Important for understanding API server response times, indicates control plane health.
API Server Work Queue Latency	workqueue_queue_duration_seconds_bucket	Measures delays in API server work queues, vital for spotting potential processing issues.
API Server Work Queue Depth	workqueue_depth	Provides insight into API server queue load, critical for preventing system overloads.
CoreDNS Request Duration	coredns_dns_request_duration_seconds_bucket	Tracks CoreDNS DNS request processing times, key for efficient network resolution.

Errors: What is it and how to monitor it?

Again, continuing our analogy, let's consider a Kubernetes cluster like a city’s road system. Everything needs to move smoothly for the city to function well. But what happens when things go wrong?

CoreDNS crashes: It’s like traffic signals failing. Without proper directions, cars (data) can’t find their way, leading to confusion and delays.
API Server goes down: This is like losing the central traffic control center. The entire system becomes unresponsive, and nothing moves.
Pod failures: These are like car breakdowns. A few stalled cars won’t stop the whole city, but they slow down traffic in specific lanes (services).
Node issues (like DiskPressure): Imagine a major road being closed. Cars (pods) have to reroute, leading to congestion and bottlenecks.

Just as traffic disruptions cause delays and frustration, Kubernetes failures impact SLAs, SLOs, and user experience. That’s why monitoring errors is like a real-time traffic control system. It detects problems early and helps keep everything running smoothly. The following metrics will help monitor the errors.

Component (Dashboard Title)	Metric	Why Monitor this Metric
CoreDNS	coredns_cache_misses_total	Tracks cache misses in CoreDNS, important for identifying DNS resolution issues affecting cluster connectivity and performance.
API Server Errors	apiserver_request_total	Monitors API server request errors, crucial for detecting and diagnosing failures in handling cluster management tasks.
Nodes	kube_node_spec_unschedulable	Counts nodes that are unschedulable, essential for understanding cluster capacity and scheduling issues.
Nodes	kube_node_status_condition	Tracks node conditions like 'OutOfDisk', 'DiskPressure', 'MemoryPressure', important for preemptive system health alerts.
Kubelet	kubelet_runtime_operations_errors_total	Measures error rates in kubelet operations, key for maintaining node and pod health.
Workloads	kube_pod_status_phase	Monitors pods in failed states, critical for identifying failed workloads and ensuring reliability.
Network	node_network_receive_errs_total, node_network_transmit_errs_total	Monitors network errors in data transmission and reception, vital for maintaining robust network communication.

Saturation: What is it and how to monitor it?

To explain saturation, we will take you to the example of the city traffic system again. CPU and memory utilization are akin to monitoring the flow of vehicles — too much traffic causes congestion, slowing down the city. Node resource exhaustion is similar to key intersections getting overwhelmed, which can halt traffic across the city. Network capacity matches the width and condition of roads; inadequate capacity leads to bottlenecks. Monitoring the top ten nodes and pods with the highest resource utilization is like tracking the busiest areas in the city to prevent and manage traffic jams more effectively. This approach ensures smooth operation and prevents system slowdowns.

The following metrics help to quickly identify the possible slowdowns in Kubernetes clusters.

Component (Dashboard Title)	Metric	Importance of Monitoring that Metric
Cluster Memory Utilization	node_memory_MemFree_bytes, node_memory_MemTotal_bytes, node_memory_Buffers_bytes, node_memory_Cached_bytes	Tracks memory usage metrics to prevent saturation and ensure resource availability.
Cluster CPU Utilization	node_cpu_seconds_total	Monitors CPU usage to prevent overload and maintain performance efficiency.
Node Count	kube_node_labels	Counts the number of nodes, essential for scaling and resource allocation.
PVCs	kube_persistentvolumeclaim_info	Tracks persistent volume claims, important for storage resource management.
Node Parameters	node_filefd_maximum node_filefd_allocated	Tracks maximum file descriptors, prevents resource exhaustion. Tracks allocated file descriptors, prevents resource exhaustion.
Node Parameters	node_sockstat_sockets_used	Monitors sockets in use, crucial for system stability.
Node Parameters	node_nf_conntrack_entries node_nf_conntrack_entries_limit	Tracks active network connections, ensures capacity isn't exceeded. Monitors conntrack entries limit, prevents network tracking overload.

Note: This dashboard is designed specifically for infrastructure monitoring. To cover application insights, you need to create similar dashboards from application metrics, assuming the relevant metrics are available. Additionally, you can generate metrics from logs as needed and incorporate them into these dashboards to achieve a unified view.

Conclusion

By meticulously applying these Four Golden Signals in our monitoring strategy, we ensure a proactive approach to infrastructure management. This not only helps in quick problem resolution but also aids in efficient resource utilization, ultimately enhancing the performance and stability of your Kubernetes cluster. With this comprehensive view provided by this single-dashboard approach, Kubernetes administrators and SREs can effortlessly manage cluster health, allowing them to focus on strategic improvements and innovation. No more navigating through complex monitoring setups—everything you need is now in one place, streamlined for efficiency and effectiveness.

I hope you found this post informative and engaging. I’d love to hear your thoughts on this post; let’s connect and start a conversation on LinkedIn.

Prometheus vs CloudWatch for Cloud Native Applications (Updated in 2024)

Ruturaj Kadikar — Thu, 10 Oct 2024 06:19:23 +0000

Many companies are moving to Kubernetes as the platform of choice for running software workloads. When an organization using VMs in AWS earlier decides to move to Kubernetes (Either EKS or self-managed in AWS), one of the questions that come up is whether one should continue to use Amazon CloudWatch or switch to some other tool like Prometheus? Many organizations think managing Prometheus can be challenging and bring more overheads. Recently, AWS launched a managed Prometheus service for such organizations. While CloudWatch vs Prometheus is not an exact apple to apple comparison, there are reasons to explore this and choose tooling that is built for the future. This post will explore various aspects and pros and cons, including costs of all three options.

Why compare Prometheus and CloudWatch?

Prometheus and Amazon CloudWatch are very different in the problem they solve, and a 1-1 comparison may seem unfair, but as you start moving to cloud-native stack, Prometheus starts popping up in conversations and for many right reasons. Before we start comparing the two technologies, let’s do a quick high-level overview of both.

AWS CloudWatch overview

AWS CloudWatch is a native service within the suite of AWS services. It helps collect metrics, logs, and events, enabling users to visualize data through customizable dashboards, set alarms for operational thresholds, and respond to system-wide performance changes. In very simplistic terms, CloudWatch acts as a metrics sink for AWS services to publish metrics . These metrics are then used to configure alarms and statistics. The services which publish metrics to CloudWatch are listed in the documentation here.

Working of AWS CloudWatch

An application or service publishes metrics (a set of data points ordered by time) in a namespace (a construct to provide isolation of metrics, for e.g., AWS/EC2 or AWS/APIGateway) with dimensions (an identity key/value pair used to filter and lookup metrics). Statistics are the aggregations performed over the reported unit (value) of the metrics over a period of time.

What is Prometheus?

Prometheus is an application used for monitoring and alerting, along with Grafana for dashboarding. It’s a popular and actively CNCF supported open source project, and it enjoys a lot of community support and integrations with other applications in the monitoring ecosystem.

Working of Prometheus

An application exposes metrics (or uses the exporter to do so) at a specific endpoint scraped by the Prometheus server (this also acts as a sink for storing metric time-series data). Prometheus works on a pull-based mechanism where it scrapes metrics exposed by applications at a specific endpoint. It also supports a Push gateway component which is used to allow shortlived jobs such as cron and batch jobs to export their metrics. Prometheus exporters provide support for applications in exposing metrics in the Prometheus format. An Alert manager component provides support for managing alerts.

Amazon Managed Service for Prometheus (AMP)

Amazon Managed Service for Prometheus, is a fully managed, scalable, and highly available service that enables organizations to monitor their containerized applications. Built on the open-source Prometheus project, it allows users to ingest, store, query, and visualize time series metrics from various sources like Kubernetes clusters. AWS Managed Prometheus integrates seamlessly with AWS services like Amazon EKS, ECS, and Fargate, making it ideal for cloud-native environments. It helps users to observe their distributed applications without needing to handle the complexity of scaling or maintaining their Prometheus infrastructure. AWS also offers secure integration with IAM for fine-grained access control, making it easier to manage permissions.

Working of AMP

AWS Managed Prometheus works by ingesting metrics from Kubernetes clusters and other data sources through the Prometheus Remote Write API. AMP does not natively scrape operational metrics from containerized workloads in a Kubernetes cluster. Instead, users need to deploy and manage a standard Prometheus server or use an OpenTelemetry agent, such as the AWS Distro for OpenTelemetry Collector, within their cluster to handle the metric scraping. Once ingested, the metrics are stored in a scalable, multi-tenant, and highly available time-series database. Users can query the stored metrics using PromQL, Prometheus' powerful query language, to analyze application performance and resource utilization. AWS handles all the infrastructure management tasks, such as scaling, patching, and backup. Users can visualize the metrics using Amazon Managed Grafana or other compatible dashboards.

Let’s compare CloudWatch vs Prometheus

We will compare the two technologies primarily aimed at Kubernetes based cloud native platforms and explore which one fits the use cases better. One of the goals is also to find out the cost for each tool keeping the reference of a mid-sized cluster. When we say “cost” here, it is not just the $ value – there are multiple aspects to it. We will include a comparison table to give you a brief idea of cost incurred by all three tools.

Metric support in CloudWatch vs Prometheus

You can have core metric or custom metric at a standard resolution (granularity of a minute) or high resolution (granularity of a second). Specifically, for pod autoscaling to work, you will have to write an adapter and publish the metrics to the Kubernetes API server. From integration to other tools such as PagerDuty, etc., you will have to integrate using AWS SQS or similar tooling.

Prometheus supports scraping Kubernetes metrics natively and is well integrated with the API server for autoscaling. It also comes with support for custom metrics, and the most commonly used use cases will have a community developed adapter ready to use.

One of the areas where CloudWatch still needs to be used in combination with Prometheus is for managed services such as RDS. The metrics from RDS are directly exported to CloudWatch, and from there, you can use a Prometheus exporter to push metrics to Prometheus.

Alerting in Prometheus vs CloudWatch

Alerting in Prometheus and Amazon CloudWatch takes different approaches suited to their ecosystems. Prometheus uses Alertmanager to handle alerts based on custom PromQL queries, offering flexibility in defining rules and integrating with various notification channels like Slack and email. This setup is ideal for complex, dynamic environments. Note that AMP will also have similar approach but by using Alertmanager from Amazon managed Grafana service. CloudWatch, on the other hand, employs CloudWatch Alarms to monitor metrics and logs, triggering notifications or automated actions when thresholds are met. It integrates with AWS services and supports notifications through Amazon SNS and automated responses via AWS Lambda. While CloudWatch provides a seamless experience within AWS, Prometheus excels with its granular, flexible alerting capabilities that are suitable for diverse or multi-cloud setups.

Querying and Dashboarding in Prometheus vs CloudWatch

Querying and Dashboarding is one area where Prometheus – with Grafana as the dashboarding tool and Prometheus query language wins hands down. With dashboards that are beautifully backed by a query language that is rich, the dashboarding in Prometheus world is far better. Also, CloudWatch and Amazon managed Grafana do charge for each additional dashboard, and in a self-managed Prometheus setup, there is no reason not to create a dashboard if you need to.

The big question: Cost! in CloudWatch vs Prometheus

One of the final areas that we want to discuss is the cost comparison of using all the three tools that we are discussing in post. The cost here is not just the dollar value paid for infrastructure but also the engineering and skill cost of the engineering bandwidth. There are also other aspects, such as cloud lock-in, multi-cloud deployments, hybrid deployments, etc. Let’s start with pure data-driven costs and then dive into other areas. Please note that the cost comparison is done based on the pricing of CloudWatch and Prometheus in September 2024 and should be confirmed before making a business decision

When evaluating Prometheus and CloudWatch for Kubernetes monitoring, it's helpful to start with a few assumptions based on a medium-sized organization operating in a single environment. For this comparison, we’ve used data from a 100-node Amazon EKS (Elastic Kubernetes Service) cluster with over 4,000 running pods.

Environment setup

CloudWatch

Deployment: Amazon CloudWatch Agents are deployed as DaemonSets on all nodes in the cluster.
Metrics collection: These agents collect cluster metrics and export them to the ContainerInsights namespace in CloudWatch.
Retention period: CloudWatch retains these metrics for approximately 63 days with a scraping interval of 5 minutes.
Metric count: For our EKS cluster, we recorded around 19,000 metrics across various namespaces, such as ContainerInsights, EC2, AutoScaling Group, and EBS.

Prometheus and Thanos

Deployment: Prometheus is deployed on the EKS cluster alongside Thanos for long-term storage, along with all other components, including node exporters.
Retention period: Prometheus is configured with a 3-day in-memory retention period for metrics. Thanos then stores these metrics in an Amazon S3 bucket for long-term storage (for our comparison, it will be for 2 months).

We compared the two metrics monitoring tools with a reference setup of 100 nodes and a storage duration of 60 days (2 months). Below is a summary of key parameters used for this comparison:

Parameter	Prometheus	CloudWatch	AMP
Number of Nodes	100	100	100
Time Series Definition	Metrics x labels	Namespace x Metrics x dimension	Metrics x labels
Time Series	~1.5 M	~19 K	~1.5 M
Metrics Storage Duration	60 days	60 days	60 days
Scrape Duration	20 s	5 min	20 s

Since CloudWatch is a managed service, there is minimal overhead in managing resources, aside from the CloudWatch Agent DaemonSets. In contrast, using Prometheus requires managing the Prometheus and Thanos StatefulSets, as well as additional components like Node Exporter DaemonSets for scraping cluster metrics. The table below highlights the major components to consider for resource management.

Resource	Prometheus	CloudWatch	AMP
Agents (DaemonSets)	1 GiB x 100 Nodes	512 MiB x 100 Nodes	1 GiB x 100 Nodes
Prometheus Pod	~25 GiB	Not Applicable	Not Applicable
Retention Period	3 days
Prometheus EBS Storage	~40 GiB
Thanos Pod	~20 GiB
Thanos EBS Storage	~40 GiB
S3 Storage	~1 TB (500 MB x 2months)

Based on these resources, we conducted a cost analysis. For CloudWatch, the cost calculation is straightforward and can be directly estimated using the AWS Cost Calculator by inputting the number of metrics. We included the cost of the agents, assuming a t4g.nano instance type, which offers the lowest cost for provisioning 512 MiB of memory. For Prometheus and Thanos, we used r6g.2xlarge instances to accommodate both StatefulSets. For node exporters, we assumed t4g.nano instances, which provide the lowest cost for provisioning 1 GiB of memory. EBS and S3 storage costs were calculated according to the resource usage mentioned in the table above. The following table presents a cost comparison of the two methods under discussion.

Cost Component	Prometheus	CloudWatch	AMP
Agents	$613.2 (Considering t4g.micro instance)	$306.6 (Considering t4g.nano instance)	$613.2 (Considering t4g.micro instance)
AWS Calculator Estimate	Not Applicable	$7,800	$14,227.86
Instance (r6g.2xlarge for 2 monthsx 2 instances)	$1,177.34	Not Applicable	Not Applicable
S3	$47.1
EBS	$59.6
Total Cost	$1,897.24	$8,106.60	$14,841.06

When considering monitoring tools, it's crucial to factor in the cost of dashboards and visualizations, as they play a significant role in monitoring effectiveness. The cost can vary based on how often metrics are fetched and displayed. When considering the cost of monitoring and visualization, it's important to account for the number of requests made to fetch metrics each month. Let’s break this down with some assumptions. Suppose you have 100 dashboards, each used by 100 users, and each dashboard refreshes every minute. This means each dashboard generates 60 requests per hour, resulting in 1,440 requests per day. Over a month, this totals around 432 million requests (60 requests/hour × 24 hours/day × 30 days × 100 dashboards × 100 users).

Parameters	Prometheus	CloudWatch	AMP
Monthly requests	432 million requests
Grafana	T3.medium (Self-managed), $30.37	NA	Considering AWS managed Grafana
EBS (Volume for Grafana)	$7.45	NA	NA
S3	$2,160 (PUT, COPY, POST, LIST requests to S3 Standard)	NA	NA
AWS Calculator Estimate	NA	432M (GetMetricData: Number of metrics requested) \ 291 (100 dashboards)	$509 (1 Admin and 100 viewers)
Total	$2197.82	$4611	$509
Total (2months)	$4395.64	$9222	$1,018
Total with Metrics (from previous table)	$6292.88	$17,328.6	$15,859.06
Monthly Cost	$3,146.44	$8,664.3	$7,929.53

These pricing estimates are provided as json here.

To be fair, this is, again, not an exact apple-to-apple comparison. The cost of engineering effort and skills to run self-managed Prometheus has to be accounted for. Also, the cost of instances considered here is for on-demand and might vary for reserved or spot instances, which have lower costs – and that is something we are not taking into account. Overall, it is clear from the calculation that the cost of running a self-managed Prometheus stack gets cheaper beyond a certain threshold.

Disclaimers:

Pricing estimates: The prices mentioned in this blog are calculated using the AWS Pricing Calculator and are accurate as of the time of writing. However, AWS pricing may change over time, so it's recommended to verify current costs through the AWS Pricing Page or directly in the AWS Pricing Calculator.
Discounts and offers: AWS frequently offers discounts or credits, especially for long-term commitments (e.g., Reserved Instances or Savings Plans). Depending on the usage and business agreements, the actual cost may be lower than the on-demand pricing used in this comparison.
Additional costs: This comparison focuses on the core infrastructure and monitoring services. Additional services such as data transfer, storage, or advanced features (like cross-region metrics replication or higher retention periods) might incur extra costs, depending on the specific use case and setup.

Key takeaways - Prometheus Vs CloudWatch

If you are starting out a new organization or a new product – you definitely might want to use a managed service such as CloudWatch or Amazon-managed Prometheus.
Once you start embracing cloud-native technologies and also thinking of multi-cloud or cloud-agnostic infrastructure, then it is better to start thinking about Prometheus.
If you are doing anything of non-trivial scale – there might be benefits to using Prometheus. The benefits can be economic as well as beyond economics, such as feature richness.
For some AWS managed services, such as RDS – you will need to use CloudWatch for native monitoring and then use Prometheus exporters to get the data into Prometheus servers.

We would love to hear the stories of using CloudWatch vs Prometheus or other services and how you make the decision of choosing one over the other. Let’s start a conversation with Hrishikesh or Ruturaj.

Looking for help with observability stack implementation and consulting? Do check out how we’re helping startups & enterprises as an observability consulting services provider.

Building Resilience with Chaos Engineering and Litmus

Ruturaj Kadikar — Sat, 10 Jun 2023 13:51:07 +0000

Microservices architecture is a popular choice for businesses today due to its scalability, agility, and continuous delivery. However, microservices architectures are not immune to outages. Outages can be caused by a variety of factors, including network communication, inter-service dependencies, external dependencies, and scalability issues.

Several well-known companies, such as Slack, Twitter, Robinhood Trading, Amazon, Microsoft, Google, and many more have recently experienced outages that caused significant downtime costs. These outages highlight the diverse sources of outages in microservices architectures, which can range from configuration errors and database issues to infrastructure scaling failures and code problems.

To minimize the impact of outages and improve system availability, businesses should prioritize resiliency principles in the design, development, and operation of microservices architectures. In this article, we will learn how to improve the resiliency of a system with the help of chaos engineering to minimize system outages.
I recently spoke at Chaos Carnival on the same topic, you can also watch my talk here.

What is Chaos Engineering?

Chaos engineering is a method for testing the resiliency and reliability of complex systems by intentionally introducing controlled failures into them. The goal of chaos engineering is to identify and highlight faults in a system before they can cause real-world problems such as outages, data loss, or security breaches.

This is done by simulating various failure scenarios, such as network outages, server failures, or unexpected spikes in traffic, and observing how the system responds. By intentionally inducing failure in a controlled environment, chaos engineering enables teams to better understand the limits and failure domains of their systems and develop strategies to mitigate or avoid such failures in the future.

Many big companies like Netflix, Amazon, Google, Microsoft, etc. are emphasizing chaos engineering as the crucial part of site reliability. Netflix introduced tools to test chaos like Chaos Monkey, Chaos Kong, and ChAP at different infrastructure levels to maintain their SLAs. Amazon incorporated the concept of Gamedays in their AWS Well-Architected Framework, wherein various teams collaborate and test chaos in their environment to educate, and reinforce the system knowledge in order to increase the overall reliability.

What is Resiliency Testing?

Resiliency testing is primarily concerned with evaluating a system's ability to recover from disruptions or failures and continue to function as intended. The goal of resiliency testing is to improve the overall reliability and availability of a system and minimize the impact of potential disruptions or failures. By identifying and addressing potential vulnerabilities or weaknesses in system design or implementation, resiliency testing can help ensure that the system continues to function in the face of unexpected events or conditions.

Why should I test Resiliency?

Resiliency testing is essential for a number of reasons. Here are a few examples:

Avoiding costly downtime: Resiliency testing helps identify potential points of failure in a system, that can lead to costly downtime if not addressed. By testing a system's ability to recover from disruptions or failures, you can ensure that it'll continue to function as intended even when unexpected events occur.
Increase reliability: It will help improve the overall reliability of a system. By identifying and addressing potential vulnerabilities, you can build a more robust and resilient system that is less likely to fail or be disrupted.
Improve the user experience: A system that is resilient and can recover quickly from disruptions or outages is likely to provide a better user experience. It's less likely to experience outages or data loss, which can increase user satisfaction with the system.
Compliance requirements: Many industries and regulations require that systems have a certain level of resilience and uptime. You can use this testing to ensure your system meets these requirements and avoids potential legal or regulatory issues.

In general, testing resiliency is important to ensure that your system is reliable, available, and able to recover quickly from failures or outages. By identifying and fixing potential failure points, you can build a more robust and resilient system that provides a better user experience and meets regulatory requirements.

Why should I test Resiliency in Kubernetes?

Testing resiliency in Kubernetes is important because Kubernetes is a complex and distributed system designed for large-scale, mission-critical applications. Kubernetes provides many features to ensure resiliency, such as automatic Kubernetes scaling, self-healing, and rolling updates, but it's still possible for a Kubernetes cluster to experience glitches or failures.

Here are the top reasons why we should test resiliency in Kubernetes:

Underlying infrastructure is critical to your application: If your application relies on Kubernetes to manage and orchestrate its components, any disruption to the Kubernetes cluster can lead to downtime or data loss. You can use resiliency testing to ensure that your Kubernetes cluster can recover from disruptions and continue to function as intended.
Distributed system: Kubernetes consists of many components such as nodes, controllers, and APIs that work together to create a unified platform for deploying and managing applications. Auditing the resilience of Kubernetes can help identify potential points of failure in this complex system and ensure that it can recover from disruptions.
Constant evolution: Kubernetes is a rapidly evolving platform, with new features and updates being released regularly. You can use resiliency testing to ensure that your Kubernetes cluster can handle these changes and updates without downtime or disruption.

Considering all of this, testing resiliency in Kubernetes is important to ensure that your application can handle interruptions and continue to function as intended.

Chaos vs Resiliency vs Reliability

Chaos, resiliency, and reliability are related concepts, but they aren't interchangeable. Here you'll find an overview of each concept:

Chaos: Chaos is the intentional introduction of controlled failures or disruptions into a system to test its resilience and identify potential vulnerabilities. chaos engineering is a method of simulating these failures and evaluating the system's response.
Resiliency: Resiliency refers to the ability of a system to recover from disruptions or failures and continue to function as intended. Resilience testing is about evaluating a system's ability to recover from failures and identify potential failure points.
Reliability: Reliability refers to the consistency and predictability of a system's performance over time. A reliable system can be relied upon to perform as intended, without unexpected failures or interruptions. Reliability is typically measured in terms of uptime, availability, and mean time between failures (MTBF).

In a nutshell, chaos engineering is a way to build failures into your system to test resilience, which is the ability of a system to recover from failures, while reliability is a measure of the consistent and predictable performance of a system over time. All three concepts are important for building and maintaining robust and trustworthy systems, and each plays a different role in ensuring the overall quality and resilience of a system.

What are available Tools to Test System Resiliency?

Litmus, Gremlin, Chaos Mesh, and Chaos Monkey are all popular open-source tools used for chaos engineering. As we will be using AWS cloud infrastructure, we will also explore AWS Fault Injection Simulator (FIS). While they share the same goals of testing and improving the resilience of a system, there are some differences between them. Here are some comparisons:

Scope	Chaos Mesh	Chaos Monkey	Litmus	Gremlin	AWS FIS
Kubernetes-native	Yes	Yes	Yes	Yes	No
Cloud-native	No	No	Yes	Yes	Yes (AWS)
Baremetal	No	No	No	Yes	No
Built-in Library	Basic	Basic	Extensive	Extensive	Basic
Customization	Using YAML	Using YAML	Using Operator	Using DSL	Using SSM docs
Dashboard	No	No	Yes	Yes	No
OSS	Yes	Yes	Yes	Yes	No

The bottom line is that while all four tools share similar features, we choose Litmus as it provides flexibility to leverage AWS SSM documents to execute chaos in our AWS infrastructure.
Now let’s see how we can use Litmus to execute chaos like terminating pods and EC2 instances in Kubernetes and AWS environments respectively.

Installing Litmus in Kubernetes

At first, we will see how to install Litmus in Kubernetes to execute chaos in an environment.

Here are the basic installation steps for LitmusChaos:

Set up a Kubernetes cluster: LitmusChaos requires a running Kubernetes cluster. If you don't already have one set up, you can use a tool like kubeadm or kops to set up a cluster on your own infrastructure or use a managed Kubernetes service like GKE, EKS, or AKS. For this article, we will use k3d.
```
k3d cluster create
```

```sh
$ kubectl cluster-info

Kubernetes control plane is running at https://0.0.0.0:38537

CoreDNS is running at https://0.0.0.0:38537/api/vl/namespaces/kube-system/services/kube-dns:dns/proxy

Metrics-server is running at https://0.0.0.0:38537/api/vl/namespaces/kube-system/services/https:metrics-server:https/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster—info dump’.
```

Install Helm: Helm is a package manager for Kubernetes that you'll need to use to install Litmus. You can install Helm by following the instructions on the Helm website.

Add the LitmusChaos chart repository: Run the following command to add the LitmusChaos chart repository:

   helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/

Install LitmusChaos: Run the following command to install LitmusChaos:
```
helm install litmuschaos litmuschaos/litmus --namespace=litmus
```
This will install the LitmusChaos control plane in the litmus namespace. You can change the namespace to your liking.

Verify the installation: Run the following command to verify that LitmusChaos is running:

kubectl get pods -n litmus

Output:

$ kubectl get pods -nlitmus
NAME                            READY  STATUS   RESTARTS  AGE
chaos-litmus-frontend-6££c95c884-x2452  1/1   Running     0       6m22s
chaos-litmus-auth-server-b8dcdf66b-v8hf9  1/1  Running  0       6m22s
chaos-litmus-server-585786dd9c-16x37  1/1   Running     0       6m22s

This should show the LitmusChaos control plane pods running.

kubectl port-forward svc/chaos-litmus-frontend-service -nlitmus 9091:9091

Once you log in, a webhook will install litmus-agent (called self-agent) components in the cluster. Verify it.

Output:

```sh
$  kubectl get pods -n litmus
NAME                                    STATUS  RESTARTS   AGE   READY
chaos-litmus-frontend-6£fc95c884-x245z Running   0         9m6s  1/1
chaos-litmus-auth-server-b8dcdf66b-v8he9  Running   0       9m6s  1/1
chaos-litmus-server-585786dd9c-16x37    Running   0         9m6s  1/1
subscriber-686d9b8dd9-bjgih             Running   0         9m6s  1/1
chaos-operator-ce-84bc885775-kzwzk      Running   0         92s   1/1
chaos-exporter-6c9b5988cd-1wmpm         Running   0         94s   1/1
event-tracker-744b6fd8cf-rhrfc          Running   0         94s   1/1
workflow-controller-768b7d94dc-xr6vy    Running   0         92s   1/1
```

With these steps, you should have LitmusChaos installed and ready to use on your Kubernetes cluster.

Experimenting with chaos

Experimenting with chaos within a cloud-native environment typically involves using a chaos engineering tool to simulate various failure scenarios and test the resilience of the system. Most of the cloud-native application infrastructure consists of Kubernetes and corresponding Cloud components. For this article, we will see chaos in Kubernetes and in the cloud environment i.e. AWS.

Chaos in Kubernetes

In order to evaluate the resilience of a Kubernetes cluster we can test the following failure scenarios:

Kubernetes node failure: Simulate the failure of a Kubernetes node by shutting down a node or disconnecting it from the network. This tests whether the cluster can withstand the failure of a node and whether the affected pods can be moved to other nodes. The delay in migrating the pods from one node to another may cause any cascading failure.
Pod failure: We can simulate the failure of a pod by shutting it down or introducing a fault into the pod's container. This tests the cluster's ability to detect and recover from a pod failure.
Network failure: This consists of simulating network partitioning or network congestion to test the cluster's ability to handle communication failures between nodes and pods. You can use Linux Traffic Control tool to manipulate traffic flowing in and out of your system.
Resource saturation: Simulate resource saturation, such as CPU or memory exhaustion, to test the cluster's ability to handle resource contention and prioritize critical workloads. You can use stress-ng tool to hog memory or CPU utilization.
DNS failure: Introduce DNS failures to test the cluster's ability to resolve DNS names and handle service lookup failures.
Cluster upgrades: Simulate upgrades to the Kubernetes cluster, including the control plane and worker nodes, to test the cluster's ability to perform rolling upgrades and maintain availability during the upgrade process.

By testing these failure scenarios, you can identify potential vulnerabilities in the cluster's resilience and improve the system to ensure high availability and reliability.

Scenario: Killing a Pod

In this experiment, we will kill a pod using Litmus. We will use an Nginx deployment for a sample application under test (AUT).

kubectl create deploy nginx --image=nginx -nlitmus

Output:

$ kubectl get deploy -nlitmus | grep nginx

NAME   READY   UP-TO-DATE   AVAILABLE   AGE
nginx  1/1  1           1           109m

Go to Litmus portal, and click on Home.

Click on Schedule a Chaos Scenario and select Self Agent.

Next, select chaos experiment from ChaosHubs.

Next, name the scenario as ‘kill-pod-test’.

Next, click on ‘Add a new chaos Experiment’.

Choose generic/pod-delete experiment.

Tune the experiment parameters to select the correct deployment labels and namespace.

Enable Revert Schedule and click Next.

Assign the required weight for the experiment, for now, we will keep 10 points.

Click Schedule Now and then Finish. The execution of the Chaos Scenario will start.

To view the Chaos Scenario, click on ‘Show the Chaos Scenario’.

You will see the Chaos Scenario and experiment crds getting deployed and the corresponding pods getting created.

Once the Chaos Scenario is completed, you will see that the existing Nginx pod is deleted and a new pod is up and running.

$ kubectl get pods -nlitmus
NAME                                        READY   STATUS  RESTARTS   AGE
chaos-litmus-frontend-6ffc95c884-x245z      1/1     Running   0         32m
chaos-mongodb-68f8b9444c-w2kkm              1/1     Running   0         32m
chaos-litmus-auth-server-b8dcdf66b-v8hf9    1/1     Running   0         32m
chaos-litmus-server-585786dd9c-16xj7        1/1     Running   0         32m
subscriber-686d9b8dd9-bjgjh                 1/1     Running   0         24m
chaos-operator-ce-84bc885775-kzwzk          1/1     Running   0         24m
chaos-exporter-6c9b5988c4-1wmpm             1/1     Running   0         24m
event-tracker-744b6fd8cf-rhrfc              1/1     Running   0         24m
workflow-controller-768f7d94dc-xr6vv        1/1     Running   0         24m
kill-pod-test-1683898747-869605847          0/2     Completed 0         9m36s
kill-pod-test-1683898747-2510109278         2/2     Running   0         5m49s
Pod-delete-tdoklgkv-runner          1/1 Running   0         4m29s
Pod-delete-swkok2-pj48x         1/1 Running   0         3m37s
nginx-76d6c9b8c-mnk8f                       1/1     Running   0         4m29s

You can verify the series of events to understand the entire process. Some of the main events are shown below stating experiment pod was created, the nginx pod (AUT) getting deleted, the nginx pod getting created again, and the experiment was successfully executed.

$ kubectl get events -nlitmus                                              
66s   Normal    Started            pod/pod-delete-swkok2-pj48x                  Started container pod-delete-swkok2                                                 
62s   Normal    Awaited            chaosresult/pod-delete-tdoklgkv-pod-delete   experiment: pod-delete, Result: Awaited
58s   Normal    PreChaosCheck      chaosengine/pod-delete-tdok1gkv              AUT: Running                                                                        
58s   Normal    Killing            pod/nginx-76d6c9b8c-c8vv7                    Stopping container nginx                                                            
58s   Normal    Successfulcreate   replicaset/nginx-76d6c9b8c                   Created pod: nginx-76d6c9b8c-mnk8f                                                                                                             
44s   Normal    Killing            pod/nginx-76d6c9b8c-mnk8f                    Stopping container nginx                                                            
44s   Normal    Successfulcreate   replicaset/nginx-76d6c9b8c                   Created pod: nginx-76d6c9b8c-kqtgq                                                  
43s   Normal    Scheduled          pod/nginx-76d6c9b8c-kqtgq                    Successfully assigned litmus/nginx-76d6c9b8c-kqtgq to k3d-k3s-default-server-0         
128   Normal    PostChaosCheck     chaosengine/pod-delete-tdok1gkv              AUT: Running                                                                        
8s    Normal    Pass               chaosresult/pod-delete-tdoklgkv-pod-delete   experiment: pod-delete, Result: Pass
8s    Normal    Summary            chaosengine/pod-delete-tdok1gkv              pod-delete experiment has been Passed                                               
3s    Normal    Completed          job/pod-delete-swkok2                        Job completed

Chaos in AWS

Here are some potential problems that can be simulated to assess the ability of an application running on AWS to recover from failures:

Availability zone failure: Simulate the failure of an availability zone in an AWS region to test the application's ability to withstand data center failure. You can simulate this type of failure by changing the NACL or Route table rules and restoring them back again.
Note: You need to be extremely cautious while doing such activities
Instance failure: Simulate the failure of an EC2 instance by terminating the instance to test the application's ability to handle node failures and maintain availability.
Auto-Scaling group failure: Simulate the failure of an auto-scaling group by suspending or terminating all instances in the group to test the application's ability to handle scaling events and maintain availability.
Network failure: Simulate network failures, such as network congestion or network partitioning, to test the application's ability to handle communication failures between nodes. You can also simulate Network Time Protocol (NTP) asynchronization and see the effect when one node or set of nodes are out of sync.
Database failure: Simulate database failure by shutting down the database or introducing a fault into the database system to test the application's ability to handle database failures and maintain data consistency. You can check whether your backup and restore mechanisms are working fine or not. You can verify whether the secondary node is being promoted to primary in case of failures and how much time it takes for the same.
Security breach: Simulate security breaches, such as unauthorized access or data breaches, to test the application's ability to handle security incidents and maintain data confidentiality.

Scenario: Terminate EC2 instance

In this scenario, we will include one chaos experiment of terminating an EC2 instance. Litmus leverages AWS SSM documents for executing experiments in AWS. For this scenario, we will require two manifest files; one for configMap consisting of the script for the SSM document and the other consisting of a complete workflow of the scenario. Both these manifest files can be found here.

Apply the configMap first in the ‘litmus’ namespace.

Kubectl apply -f https://raw.githubusercontent.com/rutu-k/litmus-ssm-docs/main/terminate-instance-cm.yaml

Then, go to the Litmus portal, and click on Home.

Click on Schedule a Chaos Scenario and select Self Agent. (Refer Installation and Chaos in Kubernetes)

Now, instead of selecting chaos experiment from ChaosHubs, we will select Import a Chaos Scenario using YAML and upload our workflow manifest.

Click Next and Finish.

To View the Chaos Scenario, click on Show the Chaos Scenario.

You will see the Chaos Scenario and experiment crds getting deployed and the corresponding pods getting created.

Verify the logs of the experiment pods. It will show the overall process and status of each step.

$ kubectl logs aws-ssm-chaos-by-id-vSoazu-w6tmj -n litmus -f
time="2023-05-11T13:05:10Z" level=info msg="Experiment Name: aws-ssm-chaos-by-id"
time="2023-05-11T13:05:14Z" level=info msg="The instance information is as follows" Chaos Namespace=litmus Instance ID=i-0da74bcaa6357ad60 Sequence=parallel Total Chaos Duration=960
time="2023-05-11T13:05:14Z" level=info msg="[Info]: The instances under chaos(IUC) are: [i-0da74bcaa6357ad60]"
time="2023-05-11T13:07:252" level=info msg="[Status]: Checking SSM command status"
time="2023-05-11T13:07:26Z" level=info msg="The ssm command status is Success"
time="2023-05-11T13:07:28Z" level=info msg="[Wait]: Waiting for chaos interval of 120s"
time="2023-05-11T13:09:28Z" level=info msg="[Info]: Target instanceID list, [i-0da74bcaa6357ad60]"
time="2023-05-11T13:09:28Z" level=info msg="[Chaos]: Starting the ssm command"
time="2023-05-11T13:09:28Z" level=info msg="[Wait]: Waiting for the ssm command to get in InProgress state”
time="2023-05-11T13:09:28Z" level=info msg="[Status]: Checking SSM command status”
time="2023-05-11T13:09:30Z" level=info msg="The ssm command status is InProgress”
time="2023-05-11T13:09:32Z" level=info msg="[Wait]: waiting for the ssm command to get completed”
time="2023-05-11T13:09:32Z" level=info msg="[Status]: Checking SSM command status"
time="2023-05-11T13:09:32Z" level=info msg="The ssm command status is Success"

Once the Chaos Scenario is completed, you will see that the SSM document is executed.

You can verify that the EC2 instance is being terminated.

What to do next?

Design a Resiliency Framework

A resiliency framework refers to a structured approach or set of principles and strategies leveraging chaos engineering to build resilience and ensure overall reliability. The following is a detailed description of the typical steps or lifecycle involved in the resiliency framework:

Define steady state: The steady state of a system refers to a state where the system is in equilibrium and operating normally under typical conditions. It represents a stable and desired outcome where the system's components, services, and dependencies are functioning correctly and fulfilling their intended roles.
Define the hypothesis: In this step, you have to hypothesize or predict the behavior of your system when subjected to specific chaos like high load, failure of a specific component or network disruptions, and many more. Suppose we have strategically distributed our workloads across four distinct availability zones (AZs) to ensure robust availability. Now, imagine a scenario where we deliberately introduce chaos into the system, causing one of the AZs to fail. In such a situation, can the system effectively handle and adapt to this unexpected event while maintaining its overall functionality? .
Formulate and execute the experiment: Determine the scope and parameters of the experiment. This includes identifying the specific type of chaos to be introduced (such as network latency, resource exhaustion, or random termination of pods/instances), the duration of the experiment, and any constraints or safety measures to be put in place. Implement the chaos experiment by introducing controlled disruptions or failures into the target system. The chaos should be introduced gradually and monitored closely to ensure it remains within acceptable boundaries.
Revert chaos: Revert the chaos induced in your system and bring the system back to a steady state.
Verify steady state: Analyze the data collected during the chaos experiment to determine the system's resilience and identify any weaknesses or vulnerabilities. Compare the observed behavior with expected outcomes (hypothesis) and evaluate the system's ability to recover and maintain its desired level of performance and functionality.
Report: Document the experiment details, findings, and recommendations for future reference. Share the results with the broader team or organization to foster a culture of learning and continuous improvement. This documentation can serve as a valuable resource for future chaos engineering experiments, improve your understanding of the system, and help build institutional knowledge.
Periodic resiliency checks: This is an ongoing process rather than a one-time event. Regularly repeat the above steps to validate system resilience, especially after making changes or updates to the system. Gradually scale up the complexity and intensity of the experiments as confidence in the system's resilience grows. Based on the insights gained from the experiment, make necessary adjustments and improvements to the system's architecture, design, or operational procedures. This could involve fixing identified issues, optimizing resource allocation, enhancing fault tolerance mechanisms, or refining automation and monitoring capabilities.

Assign a Resiliency Score

The resiliency score is a metric used to measure and quantify the level of resiliency or robustness of a system (refer). The Resiliency Score is typically calculated based on various factors, including the system's architecture, its mean time to recover (MTTR), its mean time between failures (MTBF), redundancy measures, availability, scalability, fault tolerance, monitoring capabilities, and recovery strategies. It varies from system to system and organization to organization depending upon their priorities and requirements.

The resiliency score helps organizations evaluate their system's resiliency posture and identify areas that need improvement. A higher resiliency score indicates a more resilient system, capable of handling failures with minimal impact on its functionality and user experience. Organizations can track their progress in improving system resiliency over time by continuously measuring and monitoring the resiliency score.

Gamedays

Gamedays are controlled and planned events where organizations simulate real-world failure scenarios and test their system's resiliency in a safe and controlled environment. During a Gameday, a team deliberately introduces failures or injects chaos into the system to observe and analyze its behavior and response.

The organization should practice Gamedays as they offer a chance for teams to practice and improve their incident response and troubleshooting skills. It enhances team collaboration, communication, and coordination during high-stress situations, which are valuable skills when dealing with real-world failures or incidents. It demonstrates an organization's proactive approach in ensuring that the system can endure unexpected events and continue operating without experiencing significant disruptions.

Overall, Gamedays serve as a valuable practice to improve system resiliency, validate recovery mechanisms, and build a culture of preparedness and continuous improvement within organizations.

Incorporate Resiliency Checks in CI/CD Pipelines

Integrating resiliency checks into CI/CD pipelines offers several advantages, helping to enhance the overall robustness and reliability of software systems. Here are some key benefits of incorporating resiliency checks in these pipelines:

Early detection of resiliency issues: By including resiliency checks in the CI/CD pipeline, organizations can identify potential resiliency issues early in the software development lifecycle. This allows teams to address these issues proactively before they manifest as critical failures in production environments.
Enhanced user experience: Resilient software systems are better equipped to handle failures without affecting the end-user experience. Organizations can identify and mitigate issues that could impact user interactions, ensuring a seamless and uninterrupted user experience by incorporating resiliency checks
Increased system stability: Resiliency checks pertaining to the system’s stability can be validated in CI/CD pipelines. This helps prevent cascading failures and ensures that the system remains stable and performs optimally even under challenging conditions.
Better preparedness for production environments: CD pipelines provide an environment for simulating real-world scenarios, including failures and disruptions. By including resiliency checks, teams can better prepare their software for production environments, allowing them to identify and address resiliency gaps before deployment.
Cost savings: By addressing resiliency issues early in the CD pipeline, organizations can mitigate potential financial losses resulting from system failures in production.
Compliance requirements: By integrating resiliency checks into the CI/CD pipeline, organizations can ensure that their software systems meet compliance requirements like SoC/SoX requirements and demonstrate their adherence to industry standards.

Improve Observability Posture

It is presumed that the system is actively monitored, and relevant metrics, log, traces, and other events are captured and analyzed before inducing chaos in the system. Make sure your observability tools and processes provide visibility into the system's health, performance, and potential issues, triggering alerts or notifications when anomalies or deviations from the steady state are detected.

In case, there is no visibility for any issue that occurs during the chaos, you have to incorporate and improve your observability measures accordingly. With several different chaos experiments, you can analyze missing observability data and add it to your system accordingly.

Conclusion

In this article, we learned what is chaos engineering, resiliency, reliability, and how all three are related to each other. We saw what are available tools for executing chaos and why we chose Litmus for our use case.

Further, we explored what types of chaos experiments we can execute in Kubernetes and AWS environments. We also saw a demo of chaos experiments executed in both environments. In addition to this, we have learned how we can design a resiliency framework and incorporate resiliency scoring, gamedays, and resiliency checks to improve the overall observability of a platform.

Thanks for reading! Hope you found this blog post helpful. If you have any questions or suggestions, please do reach out to Ruturaj.

References

EdgeX Foundry on K3s - the Initiation

Ruturaj Kadikar — Wed, 24 Nov 2021 18:33:01 +0000

This blog post is part 2 of a series of articles about how to deploy and operate EdgeX Foundry - an open source software framework for IoT Edge on K3s - a lightweight, highly available, and secured orchestrator.

In the first part of this series, we have seen all the pre-requisites that are needed to proceed with the hands-on. We will extend the EdgeX Foundry tutorial by Jonas Werner and deploy the EdgeX Foundry services on K3s. We have already learned that K3s will be a good lightweight solution to manage and orchestrate the EdgeX microservices. We will use the Geneva version of EdgeX Foundry.

The scope of this post is to demonstrate an Edge use case that will consume the sensor data e.g. ambient temperature. This sensor data will then be processed by EdgeXFoundry services hosted on K3s. This sensor data will be pushed to a cloud-based MQTT broker called HiveMQ. From here, the data can be stored and processed in the cloud. The configurations and manifests used in these posts are available in this repository.

Setup

The end-to-end setup looks like Fig.1.

DHT-22 with Raspberry-Pi (Edge Device)

We will use a DHT-22 sensor that captures ambient temperature and humidity. Note that the sensor doesn't require a breadboard or resistor, it is all mounted on the SMD (Surface Mounted Devices). DHT sensor will be connected to GPIO (General-purpose I/O) pins of Raspberry Pi.

The following changes are made in the script that captures temperature and sends the sensor data to the EdgeX. The EdgeX IP, DHT sensor type, GPIO pin of the Raspberry-Pi, and NodePort of edge-device-rest service as shown below.

import sys, time, requests, json, Adafruit_DHT

edgexip = "192.168.1.179"

while True:

    # Update to match DHT sensor type and GPIO pin
    rawHum, rawTmp = Adafruit_DHT.read_retry(22, 4)

    urlTemp = 'http://%s:32536/api/v1/resource/Temp_and_Humidity_sensor_cluster_01/temperature' % edgexip
    urlHum  = 'http://%s:32536/api/v1/resource/Temp_and_Humidity_sensor_cluster_01/humidity' % edgexip

How do you deploy EdgeX on K3s?

Let us first deploy a K3s cluster, with K3s-server and K3s-agent on two separate VMs as seen in Fig. 3. The EdgeX services on K3s will act more like a Gateway (see Fig. 4 in part-1). For the VMs, we will use Ubuntu-20.04 OS. Once the VMs are created, proceed with the following steps to deploy K3s.
Note that, it will be better to configure static IPs on both VMs.

K3s server/master

To configure the K3s server, we will use the following steps

export K3s_NODE_NAME=${HOSTNAME//_/-}
export K3s_EXTERNAL_IP=<host-ip>
curl -sfL https://docs.rancher.cn/k3s/k3s-install.sh |  sh -

Copy the node token generated

cat /var/lib/rancher/k3s/server/node-token

Check whether the K3s server is up and running

systemctl status k3s

K3s agent

To configure the K3s agent, we will use the following steps

export K3S_TOKEN=<node-token of K3s server>
export K3s_URL=https://<ip of k3s server>:6443
export INSTALL_K3S_EXEC="--docker --token $K3S_TOKEN --server $K3S_URL"
export K3S_NODE_NAME=${HOSTNAME//_/-}
curl -sfL https://docs.rancher.cn/k3s/k3s-install.sh | sh -

Check whether the K3s server is up and running

systemctl status k3s-agent

Make sure you have installed cli tools like kubectl and helm. Also ensure that you have set correct permissions for the same. kubeconfig file for K3S is stored at /etc/rancher/k3s/k3s.yaml. Set the KUBECONFIG environment variable before proceeding further.

export KUBECONFIG=/etc/rancher/k3s/k3s.yaml

How to deploy EdgeX Foundry on K3s?

Once K3s is is up and running, clone this repository for deploying EdgeX Foundry.

git clone https://github.com/rutu-k/edgex-k3s.git

Note that, EdgeX Foundry has a docker-compose manifest to try and test its services. Using kompose we can convert the docker-compose manifest to Kubernetes manifest.

Also, after converting to Kubernetes manifest, the applications which were using volumes may not be configured properly and eventually won't work in Kubernetes. For this reason, the respective manifests must be corrected properly. For simplicity, emptyDir volumes are configured.

First, we will start with Consul. Consul is used as a registry by EdgeX Foundry.

helm --kubeconfig /etc/rancher/k3s/k3s.yaml upgrade --install consul ./consul-helm

Once Consul is in running state, you can visit the dashboard http://[K3s server ip]:[NodePort of service]

Note that if you visit the Key-Value store, it will be empty. We need to provide the respective configs for the EdgeX Foundry services. For this, we will import the Key-Values and store it in Consul.
I have exported the Key/Value JSON file from the Consul when I was going through the tutorial's docker-compose deployment.

consul kv import --http-addr=http://[K3s server ip]@edgex-kv.json

Now you will see the configs in the Key-Value store.

Deploy the EdgeX Foundry application services

kubectl apply -f /k3s/.

Notice that all the application services are up and running.

$ kubectl get svc
NAME                                   TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                                                                   AGE
kubernetes                             ClusterIP   10.43.0.1       <none>        443/TCP                                                                   4d1h
edgex-redis                            ClusterIP   10.43.56.89     <none>        6379/TCP                                                                  3d23h
edgex-core-consul                      ClusterIP   None            <none>        8500/TCP,8301/TCP,8301/UDP,8302/TCP,8302/UDP,8300/TCP,8600/TCP,8600/UDP   25h
consul                                 NodePort    10.43.193.246   <none>        80:32688/TCP                                                              25h
edgex-app-service-configurable-mqtt    NodePort    10.43.102.126   <none>        48101:32294/TCP                                                           4h26m
edgex-app-service-configurable-rules   NodePort    10.43.138.76    <none>        48100:30136/TCP                                                           4h26m
edgex-core-command                     NodePort    10.43.52.70     <none>        48082:32400/TCP                                                           4h26m
edgex-device-rest                      NodePort    10.43.167.127   <none>        49986:32536/TCP                                                           4h26m
edgex-core-metadata                    NodePort    10.43.132.29    <none>        48081:30220/TCP                                                           4h26m
edgex-support-notifications            NodePort    10.43.39.183    <none>        48060:32680/TCP                                                           4h26m
edgex-kuiper                           NodePort    10.43.231.24    <none>        48075:30082/TCP,20498:31868/TCP                                           4h26m
edgex-support-scheduler                NodePort    10.43.6.49      <none>        48085:31497/TCP                                                           4h26m
edgex-sys-mgmt-agent                   NodePort    10.43.250.114   <none>        48090:31736/TCP                                                           4h25m
edgex-core-data                        NodePort    10.43.251.191   <none>        5563:32220/TCP,48080:31931/TCP

$ kubectl get pods
NAME                                                    READY   STATUS    RESTARTS   AGE
edgex-redis-54fb576f64-bdv9x                            1/1     Running   1          3d12h
consul-0                                                1/1     Running   0          25h
edgex-core-metadata-5bd45879cf-h9tbs                    1/1     Running   0          4h25m
edgex-kuiper-bbc6cf47-trkdl                             1/1     Running   0          4h25m
edgex-sys-mgmt-agent-7fb78c6fc5-qmqc2                   1/1     Running   0          4h25m
edgex-support-notifications-7b45446cbc-bqhvb            1/1     Running   0          4h25m
edgex-app-service-configurable-mqtt-59b5c7b6c8-8zhfc    1/1     Running   1          4h25m
edgex-app-service-configurable-rules-58c6846d54-s29hp   1/1     Running   1          4h25m
edgex-core-command-78b5ff9864-hb6wr                     1/1     Running   0          4h25m
edgex-core-data-86f5864db6-cvbl7                        1/1     Running   0          4h25m
edgex-support-scheduler-755b5779dc-br2d8                1/1     Running   0          4h25m
edgex-device-rest-599c579bf5-zbrg8                      1/1     Running   0          4h25m

EdgeX Foundry workflows

EdgeX workflow can be divided into three main parts namely; Device, Core Data Service, and Application Service. There is another part for actuation where an action can be taken by analyzing the sensor data, but this is out of scope for this post.

Device workflow: It is the process of adding a particular sensor device, its profile, selecting the proper device protocol, creating an event object, and sending it to Core Data Service. If you want to add any device to the EdgeX Foundry, it needs three configurations as shown in Fig. 3:

ValueDescriptor: Next, the device service needs to inform EdgeX about the type of data it will be sent on the behalf of the devices. If you are given the number 5, what does that mean to you? Nothing, without some context and unit of measure. For example, if I was to say 5 feet is the scan depth of the camera right now, you have a much better understanding of what the number 5 represents. In EdgeX, Value Descriptors provide the context and unit of measure for any data (or values) sent to and from a device. As the name implies, a Value Descriptor describes a value - it's unit of measure, its minimum and maximum values (if there are any), the way to display the value when showing it on the screen, and more. Any data obtained from a device (we call this a GET from the device) or any data sent to the device for actuation (we call this SET or PUT to the device) requires a Value Descriptor to be associated with that data.
Device profile: The device profile describes a type of device within the EdgeX system. Each device managed by a device service has an association with a device profile, which defines that device type in terms of the operations which it supports.
Device definition: Device information like manufacturers, a protocol which it will use, device profile, etc.

CoreData Service Workflow: Data is submitted to core data as an Event object. An event is a collection of sensor readings from a device (associated with a device by its ID or name) at a particular point in time. A Reading object in an Event object is a particular value sensed by the device and associated with a Value Descriptor in order to provide context to the reading.
Application Service Workflow: Application Services are a means to get data from EdgeX Foundry to external systems and processes (be it analytics package, enterprise or on-prem application, cloud systems like Azure IoT, AWS IoT, or Google IoT Core, etc.). Application Services provide the means for data to be prepared (transformed, enriched, filtered, etc.) and groomed (formatted, compressed, encrypted, etc.) before being sent to an endpoint of choice. Endpoints supported out of the box today include HTTP and MQTT endpoints, but will include additional offerings in the future and could include custom endpoints.

Let's see it in action

Trigger the script to activate the DHT sensor and send the temperature values to the EdgeX Foundry.

$ python rpiPutTempHum.py 
Temp: 27.7999992371C, humidity: 77.5%
Temp: 28.2000007629C, humidity: 75.5999984741%
Temp: 28.1000003815C, humidity: 75.5%
Temp: 28.1000003815C, humidity: 75.4000015259%
Temp: 28.2000007629C, humidity: 75.3000030518%
Temp: 28.2000007629C, humidity: 75.3000030518%
Temp: 28.2000007629C, humidity: 75.3000030518%

As the sensor readings increase, the event count in the EdgeX Foundry also increases.

└─ $ ▶ curl http://192.168.1.179:31931/api/v1/event/count
2043

We can also get the latest temperature value.

└─ $ ▶ curl http://192.168.1.179:31931/api/v1/reading | json_pp -json_opt pretty,canonical | tail -n 10

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  396k    0  396k    0     0  9660k      0 --:--:-- --:--:-- --:--:-- 9660k
   {
      "created" : 1632159295863,
      "device" : "Temp_and_Humidity_sensor_cluster_01",
      "id" : "ffcf2a3b-6ecc-4476-9ab6-e17ae983886f",
      "name" : "temperature",
      "origin" : 1632159295861600937,
      "value" : "28",
      "valueType" : "Int64"
   }
]

To have a demo of Application services, we will configure Application-Mqtt-service to send the temperature values to the MQTT broker. The edgex-app-service-configurable-mqtt service (check the deployed services in section 'EdgeX Foundry on K3s') is a community-provided exporter that sends EdgeX sensor data to the public MQTT broker hosted by HiveMQ at (http://broker.mqttdashboard.com) on port 1883. This sensor data can then be visualized via HiveMQ provided MQTT browser client by publishing and subscribing to a particular topic.

The topic name is configured in the env variables of edgex-app-service-configurable-mqtt deployment (refer here).

- name: WRITABLE_PIPELINE_FUNCTIONS_MQTTSEND_ADDRESSABLE_TOPIC
    value: DHT-SENSOR

Go to HiveMQ MQTT browser client. Click Connect with default configuration. Next, click on Add New Topic Subscription, type the topic name DHT-SENSOR and click on Subscribe. You will see the sensor data in Messages.

Conclusion

In this post, we have seen the following:

How to deploy K3s?
How to convert docker-compose manifest to Kubernetes manifest?
How to deploy EdgeX Foundry on K3s?
How to send sensor data from Raspberry Pi to EdgeX Foundry?

So far, we have seen how to process the data that is received from the sensors. We can also take pre-defined actions by further analyzing the data. EdgeX Foundry provides support for Edge analytics by incorporating eKuiper rules engine. We will try to cover it in the next part.

I hope you found this post informative and engaging. For more posts like this one, do subscribe to our weekly newsletter. I’d love to hear your thoughts on this post, so do start a conversation on Twitter or LinkedIn :).

References and further reading

EdgeX Foundry on K3s - the Inception

Ruturaj Kadikar — Wed, 24 Nov 2021 18:32:14 +0000

This blog post is part 1 of a series of articles about how to deploy and operate EdgeX Foundry - an open source software framework for IoT Edge on K3s - a lightweight, highly available, and secured orchestrator.

Why emphasize on the Edge?

As we start using the edge computing in its true sense, I think we are approaching the edge of the technological renaissance. The Edge is becoming an essential element of all the upcoming and futuristic technologies. Some of the popular examples would be as shown in Fig. 1:

Edge is a complementary solution to 5G.
Edge is the backbone for IoT and Fog.
Edge is required for AI and ML workloads to enable real-time data processing.
Automated vehicles leverage the Edge for mission-critical latency and high reliability.
VR applications needs the Edge for stringent requirements of latency, network and reliability.
Edge will conserve broadband networks while streaming of global events.
Software upgrades can use the Edge to minimize network pressure on backhaul.

Fig. 1 - Showing the 5G stack

The major issues which are common in all the above areas are minimum latency requirement and high network availability. These two issues are exacerbated in the case of traditional cloud-based architectures as it is a centralized architecture.
Edge plays an important role in alleviating the two pressure points by distributing data processing, thus being an essential part of all the technologies.

Where to look for while exploring Edge?

Edge is primarily an application or use case driven strategy. Thus it's implementation changes according to the use case and the underlying issues (e.g latency, network bandwidth, scalability, security, etc.). This is also true for the adoption of 5G. Industries are formulating new tools and technologies to adopt 5G in various areas such as IoT, IIoT, entertainment, and more. This is the correct stage for choosing the correct path for implementation and formulating the standards that will help to onboard future technologies.

Linux Foundation initiated two open communities namely, LF Networking and LF Edge. These communities provide an ecosystem for network infrastructure and services as well as an interoperable framework for the Edge computing respectively. Furthermore, LF Networking integrates with LF Edge to provide an open source Edge framework and seamless Edge networking. But out of these one thing that caught my attention and triggered me to explore this further was the 5G Super Blueprint initiative.

As we have learned so far, all the upcoming technologies desperately need low-latency, high-bandwidth, and scalable networks. To address all these issues, LF Networking announced 5G Super Blueprint (see Fig. 2), a community-driven integration/illustration of multiple open source initiatives coming together to show end-to-end use cases demonstrating implementation architectures for end users.

Fig. 2 - 5G Super Blueprint, Courtesy: Linux Foundation

User Edge (UE)

As you can see 5G Super Blueprint is mainly divided into three sections. The first section named User Edge is also considered as the last mile network. It deals with the applications which are closer to end-users. It uses on-prem and distributed compute resources to reduce latency. It also lowers the pressure on the broadband networks by minimizing the unnecessary backhaul communication to the data centers. In addition to it, we also achieve autonomy, increased security and privacy, and a reduction in overall cost. The business model that applications use in UE is generally based on CAPEX as infrastructure and its operation is handled by the user rather than delivered as managed service.

Service Provider Edge (SPE)

In contrast to UE, SPE is distributed yet a shared space and is primarily consumed as a service. It is considered to be more secured and private as compared to cloud as it uses private networks (both wired and wireless/cellular) operated by service providers. It is more standardized than UE, but it also has unique requirements according to the use case and location.

5G Core

The core of 5G Blueprint consists of tools that provide open cloud native 5G network functions. Cloud infrastructure that adheres to the 5G principles and is able to provision these 5G functions (e.g. network accelerators, vector packet processors, etc.). It also consists of a management plane that offers to orchestrate, automate, and manage the lifecycle of network functions.

Is Edge native similar to cloud native?

No, there is a slight difference. Edge native applications leverage cloud native principles while taking into account the unique characteristics of the Edge in areas such as resource constraints, security, latency, and autonomy. Edge native applications are developed in ways that leverage the cloud and work in concert with upstream resources. Edge applications that don’t comprehend centralized cloud compute resources, remote management, and orchestration or leverage CI/CD aren’t truly “edge native”, rather they more closely resemble traditional on-premises applications.

Why EdgeX Foundry?

Covers both UE and SPE in 5G Super Blueprint

The first reason why I started my exploration in the Edge with EdgeX Foundry is that it was an overlapping project in UE and SPE space. It means that you can use it in UE or in SPE or a combination of both. To understand how it is possible, let's dive into its architecture.

Fig. 3 - EdgeX architecture, Courtesy: EdgeX Foundry

As you can see in Fig. 3, EdgeX Foundry is primarily divided into 4 layers which are briefly described as follows:

Device Services: Responsible for interacting with the Edge devices and connecting with the other services.
Core Services: It is mainly responsible for handling the device information and data processing. It consists of following services:
- Core data: a persistence repository and associated management service for data collected from south side objects.
- Command: a service that facilitates and controls actuation requests from the north side to the south side.
- Metadata: a repository and associated management service of metadata about the objects that are connected to EdgeX Foundry. Metadata provides the capability to provision new devices and pair them with their owning device services.
- Registry and Configuration: provides other EdgeX Foundry micro services with information about associated services within EdgeX Foundry and micro services configuration properties (i.e. - a repository of initialization values).
Supporting services (Optional): The supporting services encompass a wide range of micro services to include the Edge analytics (also known as local analytics). They are mainly responsible for logging, scheduling, and data clean up (also known as scrubbing in EdgeX).
- Rules Engine: the reference implementation of Edge analytics service that performs if-then conditional actuation at the Edge, based on sensor data collected by the EdgeX instance.
- Scheduling: an internal EdgeX “clock” that can kick off operations in any EdgeX service.
- Logging: provides a central logging facility for all of EdgeX services. Services send log entries into the logging facility via a REST API where log entries can be persisted in a database or log file.
- Alerts and Notifications: provides EdgeX services with a central facility to send out an alert or notification.
Application services: Application services are the means to extract, process/transform and send sensed data from EdgeX to an endpoint or process of your choice. They also send data to many of the major cloud providers (Amazon IoT Hub, Google IoT Core, Azure IoT Hub, IBM Watson IoT…), to MQTT(s) topics, and HTTP(s) REST endpoints.

The placement of these services defines whether the EdgeX Foundry implementation is either on UE or SPE. The following diagram (see Fig.4) describes this placement in detail.

The loosely coupled architecture and the microservices design enable the deployment of its services in various combinations.

Fig. 4 - EdgeX implementation strategies, Courtesy: EdgeX Foundry

Graduated to Impact Stage in LF Edge projects

Coming back to the reason starting with EdgeX Foundry, it is currently in Stage 3 of the LF Edge's Project Lifecycle Document (PLD) process. All new projects enter as:

Stage 1 “At Large” are the projects which the TAC believes are, or have the potential to be, important to the ecosystem of Top-Level Projects, or the Edge ecosystem as a whole.
The second “Growth Stage” is for projects that are interested in reaching the Impact Stage, and have identified a growth plan for doing so.
Finally, the third “Impact Stage” is for projects that have reached their growth goals and are now on a self-sustaining cycle of development, maintenance, and long-term support.

Fig. 5 - LF Edge Project Lifecycle Document, Courtesy: LF Edge

Why K3s?

While working in Edge, you require a lightweight, resource-constraint, and highly available orchestrator to manage Edge native applications. There are many options available like minikube, kind, Kubernetes, K3s, and MicroK8s. Although minikube and kind are popular tools for hands-on or demo purposes, they are not production-compliant. Kubernetes is a good option for the orchestration of the Edge native microservices. K3s and MicroK8s are lightweight variants of Kubernetes which are more suitable for the Edge scenarios. Both of them can be deployed on small devices like Raspberry-Pis and also on AWS instances. K3s is Linux distribution independent and follows the multi-node architecture. Due to these reasons, we are focusing on K3s for deploying EdgeX Foundry.

Conclusion

In this post, we have seen:

How the Edge is necessary for upcoming technologies?
How Linux Foundation is contributing to Open Source Edge and Networking?
What is EdgeX Foundry?
How K3s is complementary for the Edge?

I hope you found this post informative and engaging. Stay tuned for the part 2 of this post where we explore how to deploy EdgeX Foundry on K3s and how to send sensor data from Raspberry Pi to EdgeX Foundry.

For more posts like this one, do subscribe to our weekly newsletter. I’d love to hear your thoughts on this post, so do start a conversation on Twitter or LinkedIn :).

References

Avoiding Kubernetes Cluster Outages with Synthetic Monitoring

Ruturaj Kadikar — Tue, 22 Jun 2021 10:15:48 +0000

What is synthetic monitoring?

Synthetic monitoring consists of pre-defined checks to proactively monitor the critical elements in your infrastructure. These checks simulate the functionality of the elements. We can also simulate the communication between the elements to ensure end-to-end connectivity. Continuous monitoring of these checks also helps to measure overall performance in terms of availability and response times.

We will narrow down the scope of synthetic checks for Kubernetes clusters and the rest of the post will be based on the same.

Synthetic checks can help SREs identify issues, and determine the slow responses and downtime before it affects the actual business. It may help to proactively detect network failures, misconfigurations, loss of end-to-end connectivity, etc., during upgrades, major architectural changes, or any feature releases.

Why synthetic checks are important in Kubernetes?

Kubernetes is a collection of distributed processes running simultaneously. Thus, identifying the failure domains in a Kubernetes cluster can be a troublesome task. A well-described synthetic check can reduce/avoid the possible downtime due to these failure domains by replicating the intended workflow and measuring its performance. Some failure domains can be described as follows:

Node issues (Docker daemon/Kubelet in a failed state, unallocated IP address due to CNI failures, etc.).
Pod issues (failed health checks, pods not in running state, etc.)
Namespace issues (pods not able to schedule in a Namespace)
DNS resolution issues (CoreDNS lookup failures)
Network issues (changes in Network policies, etc.)
And many more ...

Tools available for Kubernetes synthetic checks/monitoring

There are multiple tools available for synthetic monitoring, such as AppDynamics, New Relic, Dynatrace, etc.. For this post, let's focus on the Kubernetes native synthetic checks.

At the time of writing this post, two tools have Kubernetes native synthetic checks namely Kuberhealthy and Grafana Cloud. Kuberhealthy is an operator-based synthetic monitoring tool that uses custom resources called Kuberhealthy checks (khchecks), while Grafana cloud uses agents to gather data from the probes that periodically check the pre-defined endpoints. Kuberhealthy provides a lot more synthetic checks in comparison to Grafana Cloud and also it is an open-source option too. Thus, we will explore synthetic monitoring in the Kubernetes clusters with the help of Kuberhealthy.

What is Kuberhealthy?

Kuberhealthy is an operator for running synthetic checks. Each synthetic check is a test container (a checker pod) created by a custom resource called khcheck/khjob (Kuberhealthy check/Kuberhealthy job). Once the checks are created, Kuberhealthy schedules all the checks at a given interval and within a given timeout. Synthetic checks are defined in the form of khcheck or khjob. Both custom resources are almost same in functionality except that khjob runs one time whereas khcheck runs at regular intervals.

Deployment check [Courtesy: Kuberhealthy]

Kuberhealthy provisions checker pods corresponding to a particular khcheck. The checker pod is destroyed once the purpose is served. The creation/deletion cycle repeats at regular intervals depending upon the duration of runInterval/timeout respectively in a khcheck configuration. The result is then sent to the Kuberhealthy, that in turn sends it to the metrics and status endpoints. For monitoring, we can integrate it with Prometheus, or view it on JSON based status page. This page gives a consolidated status of all the khchecks.

Checks available with Kuberhealthy

There are pre-defined checks available which check for core Kubernetes functions. We can use the checks provided directly by Kuberhealthy or we can also write our own custom checks according to the use case.

Here is one example of a khcheck. Any application performing CRUD operations on a database/storage needs to have a constant connection with it. Kuberhealthy HTTP check helps to check the connectivity of HTTP/HTTPS endpoints. For example, the following khcheck checks for reachability of MinIO cluster. For simulating the realistic scenario, MinIO is exposed via ngrok. If the connection is successful, it will show OK: true else if the connection breaks, it will show OK: false.

apiVersion: comcast.github.io/v1
kind: KuberhealthyCheck
metadata:
  name: a-minio-reachable
  namespace: kuberhealthy
spec:
  runInterval: 2m
  timeout: 2m
  podSpec:
    containers:
      - name: a-minio-reachable
        image: kuberhealthy/http-check:v1.5.0
        imagePullPolicy: IfNotPresent
        env:
          - name: CHECK_URL
            value: "http://ff333084d5a0.ngrok.io/login"
          - name: COUNT #### default: "0"
            value: "5"
          - name: SECONDS #### default: "0"
            value: "1"
          - name: PASSING_PERCENT #### default: "100"
            value: "100"
          - name: REQUEST_TYPE #### default: "GET"
            value: "GET"
          - name: EXPECTED_STATUS_CODE #### default: "200"
            value: "200"
        resources:
          requests:
            cpu: 15m
            memory: 15Mi
          limits:
            cpu: 25m
    restartPolicy: Always
    terminationGracePeriodSeconds: 5

Some of the important uses are mentioned in the following section.

How we avoided a major outage in a Kubernetes cluster?

We started facing IP address shortages as the Kubernetes cluster deployed in AWS began to grow with a large number of micro-services being onboarded on it. The issue would become more serious during burst scaling or upgrades. The feasible solution was to incorporate the secondary CIDR solution provided by AWS. However, this required a lot of network changes. A small mistake could result in a major outage.

We wanted a solution, that will buy us some time to identify misconfigurations (if any) during the rollout of the solution. We identified all the endpoints of the dependent services for all the micro-services. We created respective TCP and HTTP khchecks and installed Kuberhealthy along with the khcheck manifest. The following image shows the setup before rolling out secondary CIDR. All the pods can connect to the dependent services.
(Note that the diagram is a minimalistic version of the scenario.)

Now during the rollout, we wanted to ensure that everything will work fine with the new Pod IP address (100.64.x.x). Thus, we manually added one new node in the cluster which uses the secondary CIDR. khcheck placed a Daemonset on the new node and checked the connectivity with all the endpoints. We realized that some of the endpoints were unable to connect.

We checked the required whitelisting in Security Groups, NACLs, and WAFs, and found out that the new CIDR is not whitelisted in some of the WAFs. We corrected the WAF configuration accordingly and the khchecks showed status OK. Then we proceed with the actual secondary CIDR rollout and everything worked fine as shown.

This way, we safeguarded our Kubernetes cluster from a major outage with the help of Kuberhealthy.

Use cases for Kuberhealthy synthetic checks

We explored and found out that Kuberhealthy can help in the following use cases to make a Kubernetes cluster more reliable:

Network changes

If there are major network changes you have to carry out, then having some checks on important endpoints using HTTP or TCP khchecks might help to find any misconfigurations and avoid major downtime proactively.

IAM changes

Kuberhealthy has KIAM checks in order to verify proper KIAM functionality. This concept can further be extended to any production-grade cluster that has to be stringent on the workloads' IAM access. While hardening the access, the security team might block the required access, which may lead to downtime. Having appropriate IAM checks helps minimize downtime (KIAM checks, in case you use KIAM in your cluster).

Additionally, we can also check unnecessary access. We can modify the khchecks to always check for the full access or power user access and alert if anybody provides this access to any workload.

Endpoint connectivity

We can always check whether the important elements outside the cluster such as databases, Key-Value stores are up and running with khchecks monitoring the connectivity with their respective endpoints.

AMI verification

There is a predefined AMI check that verifies the AMI used in the cluster exists in the Amazon marketplace. We can modify the AMI check to verify the important features in a custom-baked AMI like NTP synchronization, directory structures, user access, etc.

CoreDNS checks

An improper CoreDNS configuration may hamper the DNS resolution at heavy loads. Hence, a DNS check can provide the status of DNS resolution both internal and external in such scenarios. To know more on this, follow this guide on how to effectively use CoreDNS with Kubernetes.

Resource Quotas checks

Resource Quotas check is another helpful check which should be running in a production-grade cluster enabled with resource quotas. Suppose the resource quota of a particular namespace is exhausted due to scaling at peak loads. New pods required to serve the additional load won’t be able to be placed in the namespace, which in turn will affect the business in that duration.

These use cases are a few of many that are observed generally. You can have your use cases according to your infrastructure and write your checks for the same.

Conclusion

This article covered the following points:

What is synthetic monitoring and its importance in production-grade clusters?
Why synthetic checks are important for Kubernetes cluster?
What is Kuberhealthy?
How we safeguarded the Kubernetes cluster from a major outage?
What are some of the important use cases of synthetic checks with Kuberhealthy?

To sum it up, this post introduced you to Kuberhealthy tool for synthetic monitoring of a Kubernetes cluster in order to avoid outages and increase infrastructure reliability.

Hope this article was helpful to you and in case you have any further queries, please feel free to start a conversation with me on Twitter. Happy Coding :)

References and further reading

Tracing in Grafana with Tempo and Jaeger

Ruturaj Kadikar — Fri, 23 Apr 2021 18:02:15 +0000

Why do I need tracing if I have a good logging and monitoring framework?

Application logs are beneficial for displaying important events if something is not working as expected (failure, error, incorrect config, etc.). Although it is a very crucial element in application design, one should log thriftily. This is because log collection, transformation, and storage are costly.

Unlike logging, which is event triggered and discrete, tracing provides a broader and continuous application view. Tracing helps us understand the path of a process/transaction/entity while traversing the application stack and identifying the bottlenecks at various stages. This helps to optimize the application and increase performance.

In this post, we will see how to introduce tracing in logs and visualize it easily. In this example, we will use Prometheus, Grafana Loki, Jaeger, and Grafana Tempo as datasources for monitoring metrics, logs, and traces respectively in Grafana.

What is Distributed-tracing?

In a microservices architecture, understanding an application behavior can be an intriguing task. This is because the incoming requests may span over multiple services, and each intermittent service may have one or more operations on that request. It thus increases complexity and takes more time while troubleshooting problems.

Distributed tracing helps to get insight into the individual operation and pinpoint the areas of failure caused by poor performance.

What is OpenTracing?

OpenTracing comprises an API specification, frameworks, and libraries to enable distributed tracing in any application. OpenTracing APIs are very generic and prevents vendor/product lock-in. Recently, OpenTracing and OpenCensus merged to form OpenTelemetry (acronym OTel). It targets the creation and management of telemetry data such as traces, metrics, and logs through a set of APIs, SDKs, tooling, and integrations.
Note: OpenCensus consists of a set of libs for various languages to collect metrics and traces from Applications, visualize them locally and send them remotely for storage and analysis.

What are the fundamental elements of OpenTracing?

Span: It is a primary building block of a distributed trace. It comprises a name, start time, and duration.

Trace: It is a visualization of a request/transaction as it traverses through a distributed system.

Tags: It is key-value information for identifying a span. It helps to query, filter and analyze trace data.

Logs: Logs are key-value pairs that are useful for capturing span-specific logging messages and other debugging or informational output from the application itself.

Span-context: It is a process of association of certain data with the incoming request. This context is accessible in all other layers of the application within the same process.

What are available tools compatible with OpenTracing?

Zipkin: It was one of the first distributed-tracing tools developed by Twitter, inspired by Google Dapper paper. Zipkin is coded in Java and supports Cassandra and ElasticSearch for backend scalability.

It comprises clients or reporters to gather trace data, collectors to index and store the data, a query service to extract and retrieve the trace data, and UI to visualize the traces. Zipkin is compatible with the OpenTracing standard, so these implementations should also work with other distributed tracing systems.

Jaeger: Jaeger is another OpenTracing compatible project from Uber Technologies written in Go. Jaeger also supports Cassandra and ElasticSearch as scalable backend solutions. Although its architecture is like Zipkin, it comprises an additional agent on each host to aggregate data in batches before sending it to the collector.

Appdash: Appdash, created by Sourcegraph, is another distributed tracing system written in Go. It also supports the OpenTracing standard.

Grafana Tempo: Tempo is an open source, highly scalable distributed tracing backend option. We can easily integrate it with Grafana, Loki, and Prometheus. It only requires object storage and is compatible with other open tracing protocols like Jaeger, Zipkin, and OpenTelemetry.

Enabling and Visualizing Traces

There are many hands-on tutorials/demos available, but they exist for the docker-compose environment. We will run a tracing example in a Kubernetes environment. We will take the classic example provided by Jaeger, i.e., HOTROD. Although Jaeger has its own UI to visualize traces, we will visualize it in Grafana with Jaeger as a data source. Similarly, we will also see how Grafana Tempo is useful for visualizing the traces.

For getting started we will clone the Jaeger GitHub repo.



git clone https://github.com/jaegertracing/jaeger.git

Enable distributed tracing in Microservice application

You can check how to enable OpenTracing by navigating through the repo as shown below.



cd jaeger/examples/hotrod
cat pkg/log/factory.go



if span := opentracing.SpanFromContext(ctx); span != nil {
        logger := spanLogger{span: span, logger: b.logger}

        if jaegerCtx, ok := span.Context().(jaeger.SpanContext); ok {
            logger.spanFields = []zapcore.Field{
                zap.String("trace_id", jaegerCtx.TraceID().String()),
                zap.String("span_id", jaegerCtx.SpanID().String()),
            }
        }

Convert docker-compose manifest to Kubernetes manifest

In the hotrod directory, check the existing Docker manifests.

You will see the docker-compose.yml file deploying services like Jaeger and HOTROD. We will use kompose to convert docker-compose manifest to Kubernetes manifest.



kompose convert

You will see some files being created. We are specifically interested in hotrod-deployment.yaml, hotrod-service.yaml, jaeger-deployment.yaml, and jaeger-service.yaml. For simplicity, we will add the following label in the hotrod-deployment manifest.



metadata:
  annotations:
    kompose.cmd: kompose convert
    kompose.version: 1.21.0 (992df58d8)
  creationTimestamp: null
  labels:
    app: hotrod
  name: hotrod
spec:
  replicas: 1
  selector:
    matchLabels:
      app: hotrod

Enable Jaeger tracing in deployment manifest

Now we need to add the following environment variables in hotrod-deployment.yaml.



spec:
  containers:
  - args:
    - all
    env:
    - name: JAEGER_AGENT_HOST
      value: tempo
    - name: JAEGER_AGENT_PORT
      value: "6831"
    - name: JAEGER_SAMPLER_TYPE
      value: const
    - name: JAEGER_SAMPLER_PARAM
      value: "1"
    - name: JAEGER_TAGS
      value: app=hotrod
    image: jaegertracing/example-hotrod:latest
    imagePullPolicy: ""
    name: hotrod
    ports:
    - containerPort: 8080
    resources: {}

JAEGER_AGENT_HOST: It is a hostname to communicate with an agent (defaults to localhost).

JAEGER_AGENT_PORT: It is a port to communicate with an agent (defaults to 6831).

JAEGER_SAMPLER_TYPE: Four types are available remote, const, probabilistic, ratelimiting (defaults to remote). For example, const type refers to sampling decision for every trace.

JAEGER_SAMPLER_PARAM: It is a value between 0 to 1 (1 for sampling every trace and 0 for sampling none of them).

JAEGER_TAGS: It is a comma-separated list of name=value tracer-level tags, which get added to all reported spans.

Now we will apply these manifests. Note that this will require a running Kubernetes cluster as a pre-requisite.

Install Prometheus and Loki

Next, we install Prometheus, Loki, and Grafana. The Prometheus Operator Helm chart (kube-prometheus-stack) will install Prometheus and Grafana. Loki Helm chart (loki-stack) will install Loki and Promtail. This post provides more details about log monitoring with Loki.



helm upgrade --install prometheus prometheus-community/kube-prometheus-stack
helm upgrade --install loki grafana/loki-stack

We need to add Jaeger and Loki data-sources in Grafana. You can achieve this by either manually adding it or having it in the code. We will have the latter one by creating a custom values file prom-oper-values.yaml as shown below.



grafana:
  additionalDataSources:
    - name: loki
      type: loki
      uid: my-loki
      access: proxy
      orgId: 1
      url: http://loki:3100
      basicAuth: false
      isDefault: false
      version: 1
      editable: true
    - name: jaeger
      type: jaeger
      uid: my-jaeger
      access: browser
      url: http://jaeger:16686
      isDefault: false
      version: 1
      editable: true
      basicAuth: false

uid: It is a unique user-defined id.

access: It states whether the access is proxy or direct (server or browser).

isDefault: It sets a data-source to default.

version: It helps in the versioning of the config file.

editable: It allows to update datasource from UI.

We will now upgrade the kube-prometheus-stack Helm chart with the custom values.



helm upgrade --install prometheus prometheus-community/kube-prometheus-stack --values=prom-oper-values.yaml

If you go to the data-sources, you can see jaeger and loki added here. It's time to see how traces are being logged in the log message. For this, we will go to the HOTROD UI and trigger the request from there.

Note: In our configuration, we have given the name loki and jaeger for the Loki and Jaeger data-sources respectively.

Note: Grafana and HOTROD services are using ClusterIP we will use port-forwarding to access the UI.

Go to explore, select loki as a data-source, and select Log labels as {app="hotrod"} to visualize the logs. You can see the span context containing info like trace and span id in JSON. Copy the trace id. Duplicate the window and go to explore, and select Jaeger as a data-source. Paste the trace id and run the query for visualizing all the traces of the request.

Configure Loki Derived Fields

This technique won’t be effective while analyzing burst requests. We need something that will be more efficient and easy to operate. For this, we will use the concept of Loki derived fields. Derived fields allow us to add a parsed field from the log message. We can add a URL comprising the value of the parsed field. Let’s see how this does the trick, but first, add the following config in the prom-oper-values.yaml:



- name: loki
  type: loki
  access: proxy
  orgId: 1
  url: http://loki:3100
  basicAuth: false
  isDefault: false
  version: 1
  editable: true
  jsonData:
    derivedFields:
    - datasourceUid: my-jaeger
      matcherRegex: ((\d+|[a-z]+)(\d+|[a-z]+)(\d+|[a-z]+)(\d+|[a-z]+)(\d+|[a-z]+)(\d+|[a-z]+)(\d+|[a-z]+)(\d+|[a-z]+)(\d+|[a-z]+)(\d+|[a-z]+)(\d+|[a-z]+))
      url: '$${__value.raw}'
      name: TraceID

Note that datasourceUid has the value of Jaeger’s uid. This will help identify the data-source while creating the internal link. matcherRegex has the regex pattern for parsing the trace id from the log message. URL comprises a full link if it points to an external source. If it’s an internal link, this value serves as a query for the target data source. $${__value.raw} macro interpolates the field's value into the internal link.

Add new log labels using Promtail pipelines

We will add one more change for ease of operation. As you have seen earlier, there was no trace id label on the Loki log. To add a particular label, we will use pipelineStages to form a label out of log messages. Create a loki-stack-values.yaml file and add the following code to it.



promtail:
    serviceMonitor:
      enabled: true
      additionalLabels:
        app: prometheus-operator
        release: prometheus

    pipelineStages:
    - docker: {}
    - match:
        selector: '{app="hotrod"}'
        stages:
        - regex:
            expression: ".*(?P<trace>trace_id\"\\S)\\s\"(?P<traceID>[a-zA-Z\\d]+).*"
            traceID: traceID
        - labels:
            traceID:

Here pipelineStages is used to declare a pipeline to add trace id label. You can find more details of the pipeline parameters here. Now we will upgrade both kube-prometheus-stack and loki-stack Helm charts with updated values.



helm upgrade --install prometheus prometheus-community/kube-prometheus-stack --values=prom-oper-values.yaml
helm upgrade --install loki grafana/loki-stack --values=loki-stack-values.yaml

Visualize distributed tracing in Grafana using Jaeger and Tempo

We will again visit HOTROD UI and trigger the request from there. In the Grafana dashboard, click explore and select loki as a data-source. Add {app="hotrod"} in Log labels. Now you will see a derived field with the name TraceID with an automatically generated internal link to Jaeger. You will also see an extra label with the name traceID. Click the derived field TraceID, and it will directly take you to Jaeger data-source and show all the traces of the particular trace id. This makes switching between logs and traces much easier. Also, this makes clear how to parse the log message according to the requirement.

Next, we will add Grafana Tempo as a data-source and visualize traces with minimal changes with the same setup. To enable this add the following lines in prom-oper-values.yaml and upgrade the Helm chart:



- name: tempo
  type: tempo
  uid: my-tempo
  access: browser
  url: http://tempo:16686
  isDefault: false
  version: 1
  editable: true
  basicAuth: false

Change the data-source uid in loki's configuration to tempo's uid (e.g. datasourceUid: my-tempo) in prom-oper-values.yaml. Tempo uses Jaeger client libraries to receive all the trace related information. So, we will delete the Jaeger deployment and its service. To install Tempo in a single binary mode, we will use the standard Helm chart provided by Grafana.



helm upgrade --install tempo grafana/tempo

We also need to change the JAEGER_AGENT_HOST variable in HOTROD (hotrod-deployment.yaml) to tempo for the correct identification of traces. Incorrect value or missing value may lead to the following error:

Re-apply the hotrod-deployment manifest to incorporate the changes made. Once again, visit the HOTROD UI and trigger the request from there. Now check for HOTROD logs in loki. You notice that the link in the derived field changes to Tempo. Click it and you can visualize all the trace information like before.

Conclusion

To summarize the post, we touched upon the following points:

How to enable distributed tracing in a microservice application.
How to convert a docker-compose manifest into Kubernetes manifest.
How to enable Jaeger tracing in the deployment manifest of an application.
How to configure Loki derived fields.
How to parse log messages to add new labels using the Promtail pipeline concept.
How to visualize distributed tracing in Grafana using Jaeger and Tempo data-sources.

We hope you found this blog informative and engaging. If you have questions, feel free to reach out to me on Twitter or LinkedIn and start a conversation :)