DEV Community: Patrick Londa

The Business Case for Chaos Engineering: An ROI Calculator for Testing Application Reliability

Patrick Londa — Tue, 10 Mar 2026 21:41:50 +0000

"What do we get from intentionally injecting failures into our systems?"

Chaos engineering is one of the best ways to proactively test your application reliability, but many leadership teams have never heard of the concept.

Engineering teams need to be able to frame a strong business case to explain the value of chaos engineering and reliability testing to budget holders. When there’s a major outage, the value of application reliability becomes immediately clear, but a strong ROI plan can help earn and maintain executive support while your systems are steady.

We have just released an interactive ROI calculator that can help SRE teams frame the business value of proactive reliability efforts like chaos engineering.

Ensuring Application Resilience with Chaos Engineering

Complex software systems are destined to break and fail at some point, especially with all the factors present in modern production environments. When we ask teams about their approach to system reliability, we often hear back: “We have enough chaos already!”

When done correctly, chaos engineering isn’t adding chaos to your systems. Rather, it’s running different scenarios to validate how resilient your systems are under stressful conditions. You can test your expectations vs. the reality of performance degradation during an availability zone outage, delayed dependency, or a sudden surge of users.

These experiments serve as a feedback loop earlier in the software development cycle that enable teams to design more fault-tolerant systems for their users.

When reliability testing is rolled out across applications, teams can find and address risks that could lead to critical incidents and improve their average time to remediation (MTTR).

Early ROI Prototypes and Why They Didn’t Work

📉 Savings from Reducing the Overall Number of Incidents

Initially, we explored the premise that implementing chaos engineering will lead to fewer incidents overall at every severity tier. If a user provides how many overall incidents they have had in the past year, we could assume some percentage reduction to all incidents. This makes the calculation pretty simple once a cost amount is assigned to each tier of incident.

The trouble with this approach is it assumes that all incidents should be avoided. Incidents as metrics can be useful in flagging anomalous behaviors and some might not necessarily result in negative impacts for many customers. An increase in low-level incidents could actually be a positive sign that alert coverage is improving and properly surfacing system weaknesses.

Moving forward, we chose to focus on savings from reducing the number of critical incidents (Sev0, P1, etc.) year over year. This is a metric that is easier to track without creating incentives to avoid marking incidents at any level.

🧪 Identifying & Fixing Reliability Risks By Running More Experiments

If you go from 0 to 100 experiment runs on your systems, you are bound to discover new performance gaps and reliability risks. How many more risks will you discover at 200 experiment runs?

We built a version of an ROI calculator that assumed that as the number of experiments increased, a certain percentage of experiments would reveal issues at different incident risk levels with assigned potential costs. Teams would then fix a certain percentage of these reliability risks depending on their development capacity. As the experiment run count scaled, there would be diminished returns for revealing new issues.

While it’s true that teams will find more potential reliability issues as they run more experiments, this approach was a little too one-dimensional. We didn’t have well-documented references for how issue detection rates would change, and teams would likely need to create a new reporting mechanism to follow-along as they mitigated risks.

There is also nuance as some teams will automate their experiment runs and use them as regression tests in their CI/CD processes. We decided it would be better to stick to measuring impact by with metrics that are already being tracked and available for SREs at most organizations.

Where We Landed with Key Metrics for Our ROI Calculator

⚡ Savings from Faster Average MTTR

As we were iterating on the inputs and outputs, we saw a great presentation from Keith Blizard and Joe Cho at AWS Re:Invent 2024, featuring a case study on the progress Fidelity Investments had made in rolling out chaos engineering across their organization. They documented major improvements to mean-time-to-resolution (MTTR) as they scaled chaos testing coverage across applications.

We used these case study metrics to plot the correlation between the percent of applications with chaos testing coverage to the incremental positive impact to their MTTR. We then used this relationship to calculate improvements against an assumed industry-wide average MTTR of 175 minutes, according to this 2024 PagerDuty report.

This MTTR savings means fewer minutes of downtime, which that same study estimated can cost between $4,000 - $15,000 per minute. In our calculator, we ask users to input their “Annual Company Revenue” so we can use the most relevant cost of downtime per minute, as downtime is typically more costly for larger enterprises. This 2024 report commissioned by BigPanda found that downtime cost an average of $14,056 per minute for organizations with more than 1,000 employees.

🛡️ Savings from Reducing Critical Incidents

At Steadybit, we partner with a wide range of customers and have seen how many major reliability gaps are uncovered by running chaos experiments. Using insights from our customers and referencing industry studies, we've seen that actively running reliability tests on any given application conservatively leads to an average 30% reduction in critical incidents for that application per year.

For our calculator, we ask users to input the total number of applications their organization operates and how many of these applications have reliability testing coverage. We multiply the standard 30% reduction per year in critical incidents by the percent of applications with testing coverage to get the overall incident reduction for the organization.

🛠️ Costs of Implementing Chaos Engineering

If you want to run chaos experiments at scale, you will likely need to onboard a commercial reliability platform or chaos engineering tool. Open source solutions can be a good starting place, but deploying these across teams and technologies can become increasingly time-intensive. We used general license estimates based on market knowledge and projected experiment activity.

Like with any new program, an organization will need engineers owning the project and dedicating time to a successful rollout of chaos testing. We included a field in our calculator for “Testing Rollout Managers”, measured by FTEs (40hr/week of staff time). We used an average SRE salary of $160k per year as a benchmark to estimate the cost of this implementation effort.

Showing the Return On Investment for Reliability Testing

We ask users to project how they would expect to rollout chaos engineering at their organization, including unique test types, number of experiments, and coverage across applications. Our ROI calculator will then output a summary and detailed view of your projected savings, implementation costs, and return on investment. When you game out multi-year adoption goals, you'll be building a business case that can help you frame the value of making this type of investment.

If you’re successful in getting buy-in to roll out chaos engineering, you’ll need to report back on your progress. If you’re using an incident management platform like Splunk or PagerDuty, you may already have built-in MTTR metrics available to reference. You can also track the number of critical incidents using Observability tools like Datadog, Dynatrace, or Grafana Labs.

These metrics will hopefully show clear improvements, but your systems may become increasingly complex at the same time that you’re rolling out this testing, especially with the rise of AI agents. Even simply maintaining your current reliability posture as your systems evolve and become significantly more complex could be framed as a win.

Rolling Out Chaos Engineering Across Your Organization

Highly available applications don’t naturally draw attention in the way that outages do. If you want to continue the momentum and foster a culture of reliability, you'll need to intentionally share your wins. For example, if you find a major reliability vulnerability and are able to address it before it impacts customers, that's something to celebrate internally.

If you want to help getting started with chaos testing and adopting a proactive reliability program, our team of experts at Steadybit is ready to help.

You can explore our reliability platform with a 30-day free trial or book a quick call with us to discuss how you can implement chaos engineering and start saving money today.

3 Types of Chaos Experiments and How To Run Them

Patrick Londa — Thu, 24 Apr 2025 17:44:42 +0000

The primary objective of a Chaos Experiment is to uncover hidden bugs, weaknesses, or non-obvious points of failure in a system that could lead to significant outages, degradation of service, or system failure under unpredictable real-world conditions.

What is a Chaos Experiment?

A Chaos Experiment is a carefully designed, controlled, and monitored process that systematically introduces disturbances or abnormalities into a system’s operation to observe and understand its response to such conditions.

It forms the core part of ‘Chaos Engineering’, which is predicated on the idea that ‘the best way to understand system behavior is by observing it under stress.’ This means intentionally injecting faults into a system in production or simulated environments to test its reliability and resilience.

This practice emerged from the understanding that systems, especially distributed systems, are inherently complex and unpredictable due to their numerous interactions and dependencies.

The Components of a Chaos Engineering Experiment

Hypothesis Formation. At the initial stage, a hypothesis is formed about the system’s steady-state behavior and expected resilience against certain types of disturbances. This hypothesis predicts no significant deviation in the system’s steady state as a result of the experiment.
Variable Introduction. This involves injecting specific variables or conditions that simulate real-world disturbances (such as network latency, server failures, or resource depletion). These variables are introduced in a controlled manner to avoid unnecessary risk.
Scope and Safety. The experiment’s scope is clearly defined to limit its impact, often called the “blast radius.” Safety mechanisms, such as automatic rollback or kill switches, are implemented to halt the experiment if unexpected negative effects are observed.
Observation and Data Collection. Throughout the experiment, system performance and behavior are closely monitored using detailed logging, metrics, and observability tools. This data collection is critical for analyzing the system’s response to the introduced variables.
Analysis and Learning. After the experiment, the data is analyzed to determine whether the hypothesis was correct. This analysis extracts insights regarding the system’s vulnerabilities, resilience, and performance under stress.
Iterative Improvement. The findings from each chaos experiment inform adjustments in system design, architecture, or operational practices. These adjustments aim to mitigate identified weaknesses and enhance overall resilience.

💡 Note → The ultimate goal is not to break things randomly but to uncover systemic weaknesses to improve the system’s resilience. By introducing chaos, you can enhance the understanding of your systems, leading to higher availability, reliability, and a better user experience.

Types of Chaos Experiments

1. Dependency Failure Experiment

Objective: To assess how microservices behave when one or more of their dependencies fail. In a microservices architecture, services are designed to perform small tasks and often rely on other services to fulfill a request. The failure of these external dependencies can lead to cascading failures across the system, resulting in degraded performance or system outages. Understanding how these failures impact the overall system is crucial for building resilient services.

Possible Experiments

Network Latency and Packet Loss. Simulate increased latency or packet loss to understand its impact on service response times and throughput.
Service Downtime. Emulate the unavailability of a critical service to observe the system’s resilience and failure modes.
Database Connectivity Issues. Introduce connection failures or read/write delays to assess the robustness of data access patterns and caching mechanisms.
Third-party API Limiting. Mimic rate limiting or downtime of third-party APIs to evaluate external dependency management and error handling.

How to Run a Dependency Failure Experiment

Map Out Dependencies.

Begin with a comprehensive inventory of all the external services your system interacts with. This includes databases, third-party APIs, cloud services, and internal services if you work in a microservices architecture.
For each dependency, document how your system interacts with it. Note the data exchanged, request frequency, and criticality of each interaction to your system’s operations.
Rank these dependencies based on their importance to your system’s core functionalities. This will help you focus your efforts on the most critical dependencies first.

Simulate Failures

Use service virtualization or proxy tools like SteadyBit to simulate various failures for your dependencies. These can range from network latency, dropped connections, and timeouts to complete unavailability.
For each dependency, configure the types of faults you want to introduce. This could include delays, error rates, or bandwidth restrictions, mimicking real-world issues that could occur.
Start with less severe faults (like increased latency) and gradually move to more severe conditions (like complete downtime), observing the system’s behavior at each stage.

Test Microservices Isolation

Implement Resilience Patterns. Use libraries like Hystrix, resilience4j, or Spring Cloud Circuit Breaker to implement patterns that prevent failures from cascading across services. This includes:

Bulkheads. Isolate parts of the application into “compartments” to prevent failures in one area from overwhelming others.
Circuit Breakers. Automatically, “cut off” calls to a dependency if it’s detected as down, allowing it to recover without being overwhelmed by constant requests.

Carefully configure thresholds and timeouts for these patterns. This includes setting the appropriate parameters for circuit breakers to trip and recover and defining bulkheads to isolate services effectively.

Monitor Inter-Service Communication

Utilize monitoring solutions like Prometheus, Grafana, or Datadog to monitor how services communicate under normal and failure conditions. Service meshes like Istio or Linkerd can provide detailed insights without changing your application code.
Focus on metrics like request success rates, latency, throughput, and error rates. These metrics will help you understand the impact of dependency failures on your system’s performance and reliability.

💡 Recommendation → Monitoring in real-time allows you to quickly identify and respond to unexpected behaviors, minimizing the impact on your system.

Analyze Fallback Mechanisms

Evaluate the effectiveness of implemented fallback mechanisms. This includes static responses, cache usage, default values, or switching to a secondary service if the primary is unavailable.
Assess if the ‘retry logic’ is appropriately configured. This includes evaluating the retry intervals, backoff strategies, and the maximum number of attempts to prevent overwhelming a failing service.
Ensure that fallback mechanisms enable your system to operate in a degraded mode rather than failing outright. This helps maintain a service level even when dependencies are experiencing issues.

2. Resource Manipulation Experiment

Objective: To understand how a system behaves when subjected to unusual or extreme resource constraints, such as CPU, memory, disk I/O, and network bandwidth. The aim is to identify potential bottlenecks and ensure that the system can handle unexpected spikes in demand without significantly degrading service.

Possible Experiments

CPU Saturation. Increase CPU usage gradually to see how the system prioritizes tasks and whether essential services remain available.
Memory Consumption. Simulate memory leaks or high memory demands to test the system’s handling of low memory conditions.
Disk I/O and Space Exhaustion. Increase disk read/write operations or fill up disk space to observe how the system copes with disk I/O bottlenecks and space limitations.

How to Run a Resource Manipulation Experiment

Define Resource Limits

Start by monitoring your system under normal operating conditions to establish a baseline for CPU, memory, disk I/O, and network bandwidth usage.
Based on historical data and performance metrics, define the normal operating range for each critical resource. This will help you identify when the system is under stress, or resource usage is abnormally high during the experiment.

Check and Verify the Break-Even Point

Understand your system’s maximum capacity before it requires scaling. This involves testing the system under gradually increasing load to identify the point at which performance starts to degrade, and additional resources are needed.
If you’re using auto-scaling (either in the cloud or on-premises), clearly define and verify the rules for adding new instances or allocating resources. This includes setting CPU, memory usage thresholds, and other metrics that trigger scaling actions.
Use load testing tools like JMeter, Gatling, or Locust to simulate demand spikes and verify that your auto-scaling rules work as expected. This will ensure that your system can handle real-world traffic patterns.

Select Manipulation Tool

While Stress and Stress-ng are powerful for generating CPU, memory, and I/O load on Linux systems, they might not be easy to use across distributed or containerized environments. Tools like Steadybit offer more user-friendly interfaces for various environments, including microservices and cloud-native applications.

💡 Pro Tip → Ensure that the tool you select can accurately simulate the types of resource manipulation you’re interested in, whether it’s exhausting CPU cycles, filling up memory, saturating disk I/O, or hogging network bandwidth.

Apply Changes Gradually

Start by applying small changes to resource consumption and monitor the system’s response.
Monitor system performance carefully to identify the thresholds at which performance degrades or fails. This will help you understand the system’s resilience and where improvements are needed.

Monitor System Performance

Use comprehensive monitoring solutions to track the impact of resource manipulation on system performance. Look for changes in response times, throughput, error rates, and system resource utilization.

💡 Pro Tip → Platforms like Steadybit can integrate with monitoring tools to provide a unified view of how resource constraints affect system health, making it easier to correlate actions with outcomes.

Evaluate Resilience

Analyze how effectively your system scales up resources in response to the induced stress. This includes evaluating the timeliness of scaling actions and whether the added resources alleviate the performance issues.
Evaluate the efficiency of your resource allocation algorithms. This involves assessing whether resources are being utilized optimally and whether unnecessary wastage or contention exists.
Test the robustness of your failover and redundancy mechanisms under ‘conditions of resource scarcity’. This can include switching to standby systems, redistributing load among available resources, or degrading service gracefully.

3. Network Disruption Experiment

Objective: To simulate various network conditions that can affect a system’s operations, such as outages, DNS failures, or limited network access. By introducing these disruptions, the experiment seeks to understand how a system responds and adapts to network unreliability, ensuring critical applications can withstand and recover from real-world network issues.

Possible Experiments

DNS Failures. Introduce DNS resolution issues to evaluate the system’s reliance on DNS and its ability to use fallback DNS services.
Latency Injection. Introduce artificial delay in the network to simulate high-latency conditions, affecting the communication between services or components.
Packet Loss Simulation. Simulate the loss of data packets in the network to test how well the system handles data transmission errors and retries.
Bandwidth Throttling. Limit the network bandwidth available to the application, simulating congestion conditions or degraded network services.
Connection Drops. Forcing abrupt disconnections or intermittent connectivity to test session persistence and reconnection strategies.

How to Run a Network Disruption Experiment

Identify Network Paths

Start by mapping out your network’s topology, including routers, switches, gateways, and the connections between different segments. Tools like Nmap or network diagram software can help visualize your network’s structure.
Focus on identifying the critical paths data takes when traveling through your system. These include paths between microservices, external APIs, databases, and the Internet.
Document these paths and prioritize them based on their importance to your system’s operation. This will help you decide where to start with your network disruption experiments.

Choose Disruption Type

Decide on the type of network disruption to simulate. Options include:

complete network outages,
latency (delays in data transmission)
packet loss (data packets being lost during transmission)
bandwidth limitations

Next, choose disruptions based on their likelihood and potential impact on your system.
For example, simulating latency and packet loss might be particularly relevant if your system is distributed across multiple geographic locations.

Use Network Chaos Tools

Traffic Control (TC). The ‘tc’ command in Linux is a powerful tool for controlling network traffic. It allows you to introduce delays, packet loss, and bandwidth restrictions on your network interfaces.

⚠️ Note → Simulating DNS failures can be complex but is crucial for understanding how your system reacts to DNS resolution issues. Consider using specialized tools or features for this purpose.

On the flip side, chaos experiment solutions like Steadybit provide user-friendly interfaces for simulating network disruptions. For example, you get safety features like built-in rollback strategies to minimize the risk of long-term impact on your system.

Monitor Connectivity and Throughput

During the experiment, use network monitoring tools and observability platforms to track connectivity and throughput metrics in real time.
Focus on monitoring packet loss rates, latency, bandwidth usage, and error rates to assess the impact of the network disruptions you’re simulating.

Assess Failover and Recovery

Evaluate how well your system’s failover mechanisms respond to network disruptions. For example, you could switch to a redundant network path, use a different DNS server, or take other predefined recovery actions.
Measure the time it takes for the system to detect and recover the issue. This includes the time it takes to failover and return to normal operations after the disruption ends.

💡 Recommended → Analyze the overall resilience of your system to network instability. This assessment should include how well services degrade (if at all) and how quickly and effectively they recover once normal conditions are restored.

If you want to read more, you can check out the rest of the post about principles of chaos engineering and popular tools here.

Reducing Your Cloud Costs: An Operational Optimization Guide

Patrick Londa — Mon, 17 Oct 2022 14:49:20 +0000

Cloud costs are top of mind as many business leaders and teams are focusing attention on honing their operational efficiency.

In April at CIO.com’s Future of Cloud Summit, Dave McCarthy, research vice president of cloud infrastructure services at IDC, shared that cloud spending represents roughly 30% of current IT budgets. In the 2022 State of Cloud Report by Flexera, 750 surveyed executives shared that they estimate they are wasting 30% of their cloud spend, while also saying that they expect costs to increase 47% over the next year. If you combine those stats, there is an efficiency opportunity roughly the size of 10% of IT budgets.

Achieving those cost savings isn’t as easy as flipping a switch. There is wasted spend embedded across multiple resource types, regions, and services. By function, the main categories of cloud spending are compute time, data storage, and data transfer.

In this post, we’ll outline a framework for reviewing your cloud spending today, identifying wasted resources, and reviewing your long-term infrastructure efficiency.

Reviewing Your Current Spending

“What are we currently spending money on?”

To start, you can review your current spend at the account-level with the major cloud providers. AWS, Azure, and GCP all have reporting options that enable you to view and filter your spending over a period of time.

In AWS, you can create Cost and Usage Reports. In GCP, you can review your Cloud Billing Report and view spend by “Project” or other filters. In the Azure portal, you can download usage and charges from the “Cost Management + Billing” section.

These views may be useful to get started and see transactional costs, such as from data transfers. In order to get more granular details on your cloud spending, you should leverage resource labels and tags to accurately categorize expenses.

With labels and tags, you can associate resources with specific cost centers, projects, business units, or teams. You can then easily organize your resource data, create custom reports, and run specific queries.

If you do not currently have a mechanism or standard practice around resource tags and labels, you can refer to these how-to guides for setting up mandatory tags:

If you use more than one cloud computing provider, you’ll need to aggregate invoices and usage reports across vendors. In this scenario, having consistent tagging methods across platforms is even more useful as it can offer a consistent way to view your resource usage and expenses.

Once you have a clear sense of your current spending, you can look for opportunities to reduce your expenses.

Eliminating Unnecessary Resources

“What resources are we spending money on and not using at all?”

As projects are spun up and shut down, there are often resources that become unattached and left behind. While they are no longer in use, they are still costing your organization money on a recurring basis.

Ideally, you have an automated way to regularly catch and delete these unattached resources. With a no-code platform like Blink, teams can scale up scheduled automations to continuously detect and remove unnecessary resources.

If you don’t have automations already in place, you can manually review resources in the console and remove unused ones in bulk. It can be time-consuming, but you may be able to reduce your operating costs significantly this way in the short-term.

To know what types of resources to review, here are some common examples:

Unattached Disks

Unattached IP Addresses

Old Snapshots

Finding and removing idle resources is a clear way to cut your operating costs, but it also is an important practice for maintaining a strong security posture. If you leave resources like unattached IP addresses, idle NAT Gateways, load balancers with no target, or orphaned Secrets lying around, bad actors could find them and take advantage of the information. In this way, resource management is key to reducing costs and reducing risk.

Optimizing and Updating Resources

“How can we optimize our existing resources?”

Now that you’ve reviewed and removed unused resources, you can now look at optimizing the resources you are using.

Using the Right Family for the Job

Whether you are creating new resources or evaluating existing ones, it’s important to consider which family of resources best fits your needs. If you’re using general-purpose machines, there might be another more cost-effective machine that is a better fit.

Depending on your usage, you may need more capacity in some specifications than others. For example, if you’re using AWS, there are Compute Optimized instances under the C family (e.g. EC2 C7g instances) which offer optimal price performance for especially computing-intense use cases, like batch processing workloads and scientific modeling. Other families include Memory Optimized (e.g. EC2 R6a instances) and Storage Optimized (Ec2 lm4gn instances). There are lots of other families (e.g. IOPs, network, accelerator-optimized) depending on the platform and the specification you want to optimize for.

When considering your performance requirements, you might have use cases like batch jobs or workloads that are fault-tolerant. Azure, GCP, and AWS all have unused capacity that they offer as less expensive, less reliable Spot VMs. Compared to on-demand instances, they are up to 90% less expensive to run.

Updating to New Machines

Within each of these families, there are often newer versions being offered. Often, the newer versions run more efficiently or have higher performance, so it’s a good best practice to upgrade to newer versions as much as you can.

One example of this is with EBS volumes. By switching from EBS GP2 volumes to EBS GP3 volumes, you can reduce your costs by 20%. There are some small performance tradeoffs, but it’s important to keep these types of upgrade opportunities in mind.

Another AWS example is switching from older machines to ones that use the new AWS Graviton2 processors. Instances running on Graviton2 processors vs. Intel processors offer up to 40% better price performance, with specific efficiencies varying by family.

Looking for Low CPU Usage

One way to optimize your spending is by rightsizing resources to match the usage level that you need. For example, you may be running an instance or virtual machine that has more computer capacity than you need.

By reviewing your usage data, you can determine if you are running at an average CPU usage of 30% or less for example. By reducing the size or type of instance, you can slightly reduce your spend, which adds up over time.

Here are some how-to guides that show examples for each platform:

Using Long-Term Resourcing for Predictable CPU Usage

Another way to optimize your costs is by leveraging reserved instances or committed use discounts. In exchange for predictable computing expectations, the major cloud providers offer resources at a discount with a committed term, such as 1 year or 3 years.

Here are some how-to guides that show examples for each platform:

Starting Nightly Non-Production Scale-Downs

Are there any resources that you can shut-down when they are not being used? For example, if your team is working with a test environment during certain work hours, you don’t need to run it 24 hours a day. You can scale it down at night and scale it back up the next morning.

With some automation, pausing and restarting a non-production cluster could be as simple as clicking an approval button in a slack message, and reducing your daily cloud costs.

Here are a couple examples of how to pause and restart clusters nightly:

AWS: How to Scale Down AWS EKS Clusters Nightly
GCP: How to Pause Your GKE Cluster Nightly
Azure: How to Pause Your AKS Cluster Nightly

Storing and Moving Data Efficiency

“Can we optimize how our data is stored and transferred?”

Storing Only Relevant Data

Your cloud bill is also impacted by how much data you are storing. While it’s useful to collect data to see how your services are running, it likely becomes less useful and relevant over time. Even if you want to maintain as much data as possible, you’ll want to employ a strategy of periodically switching data over to less-costly, long-term storage vehicles, such as Amazon’s S3 Glacier storage.

Here are some how-to guides for AWS on how to identify data that hasn’t changed in a while and how to reduce logging storage costs.

Optimizing Data Transfers

Data transfers may also account for a significant part of your cloud costs, and vary greatly depending on their source, destination, method of transport, and size.

You can also likely expect charges if you are transferring data across regions or across availability zones. Unless your business case requires it, you should look to avoid data transfers that go across regions and availability zones.

While inbound (or ingress) data transfers between the internet and your cloud provider are not charged, outbound transfers are charged per service. You should reduce outbound data transfers from your cloud to external destinations as much as possible.

If you are transferring data across AWS services for example, you should be utilizing private endpoints. This way, when you are accessing a S3 bucket from an EC2 instance, you can avoid data transfer charges.

The same principle applies for transferring data from your cloud to on-premises locations, and tools like AWS Direct Connect, GCP Direct Peering, and Azure ExpressRoute which may offer lower cost per GB compared to transfers over the internet. Actual savings depends on the amount of data you are moving, and if you are below a certain threshold, it might not make sense.

You can read more about the types of data transfer charges in the Cost Optimization pillar of the AWS Well-Architected Framework, or these AWS, GCP, and Azure resources.

Achieving Operational Excellence with Blink Automations

So far, we have covered several areas where you and your team can focus and optimize your costs, but significant savings over time takes new processes.

Beyond finding unused resources, you need an automated process for alerting you to cost reduction opportunities, and then making approval for removing resources as easy as clicking a button. If you only rely on scripts, you may accidentally take down environments or orphaned resources that should have been left up.

With Blink, you can use no-code automations to achieve operational excellence. In the cost optimization context, Blink lets you create and run dozens of common resource checks and send reports to email or Slack channels with simple, actionable options.

By running these Blink automations on a schedule, you’ll be able to confidently ensure that you are achieving operational excellence not just one time, but daily. You can take the same Blink automation approach for other operational excellence categories, like security operations, incident response, troubleshooting, and permissions management.

Get started with a free Blink account or reach out to us directly to hear more.

Finding and Deleting Orphaned ConfigMaps

Patrick Londa — Thu, 16 Jun 2022 18:00:03 +0000

If you don’t take steps to maintain your Kubernetes cluster, you could end up wasting money and storage on orphaned resources. Orphaned (or unused) resources, like ConfigMaps, Secrets, and Services, should be regularly located and removed to clear up storage space and prevent performance issues.

In this post, we’ll be focusing on how to find and remove orphaned ConfigMaps.

ConfigMaps are API objects created to hold small amounts of visible configuration data. These objects support unbinding of configuration data from container images and application code for optimum portability of applications, but they cannot hold secret/encrypted data.

ConfigMaps may get orphaned if they are left isolated from the deployment they were created to support, or if their owners have been purged. Once orphaned, these ConfigMaps waste temporary storage and increase the risk of cluster instability.

Finding and Deleting Orphaned ConfigMaps

Here are some steps you can take to find and remove orphaned ConfigMaps:

Step 1: Find all ConfigMaps

First off, you can generate a list of all ConfigMaps using this command:

kubectl get configmaps –all-namespaces -o json

This command will return the list of ConfigMaps across all namespaces, but as you’ll see, the ConfigMap object does not reference its owner. You’ll need to run another command to identify which of the ConfigMaps have owners and are in use.

Step 2: Compare with a List of Used ConfigMaps

To find any orphaned ConfigMaps, you have to get the list of pods across your cluster and list all ConfigMaps in use. Alternatively you can use the following to diff the list of ConfigMaps and used ConfigMaps, and get unused ConfigMaps:

volumesCM=$( kubectl get pods -o
jsonpath='{.items[*].spec.volumes[*].configMap.name}' | xargs -n1)
volumesProjectedCM=$( kubectl get pods -o
jsonpath='{.items[*].spec.volumes[*].projected.sources[*].configMap.name}' | xargs -n1)
envCM=$( kubectl get pods -o
jsonpath='{.items[*].spec.containers[*].env[*].ValueFrom.configMapKeyRef.name}' | xargs -n1)
envFromCM=$( kubectl get pods -o
jsonpath='{.items[*].spec.containers[*].envFrom[*].configMapKeyRef.name}' | xargs -n1)

diff \
<(echo "$volumesCM\n$volumesProjectedCM\n$envCM\n$envFromCM" | sort | uniq) \
<(kubectl get configmaps -o jsonpath='{.items[*].metadata.name}' | xargs -n1 | sort | uniq)

Finally, you can compare the two lists and delete ConfigMaps from the first list that are not in use by any pod.

Step 3: Delete Orphaned ConfigMaps

Now that you have a list of orphaned ConfigMaps, you can run this command to delete them and free up memory in your cluster:

kubectl delete configmap/samplemap

Example output:

configmap "samplemap" deleted

Once you’ve deleted all the orphaned ConfigMaps you found, you’ll have removed unneeded, unused resources from your cluster and freed up memory and storage space. If you remove orphaned resources regularly, you’ll ensure that your team is maintaining optimal Kubernetes resource management.

Thanks for reading! Let me know if this worked for you.