<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Patrick Londa</title>
    <description>The latest articles on DEV Community by Patrick Londa (@patricklonda).</description>
    <link>https://dev.to/patricklonda</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F729626%2F41f43855-071c-460f-b907-e96ba8993a64.jpg</url>
      <title>DEV Community: Patrick Londa</title>
      <link>https://dev.to/patricklonda</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/patricklonda"/>
    <language>en</language>
    <item>
      <title>The Business Case for Chaos Engineering: An ROI Calculator for Testing Application Reliability</title>
      <dc:creator>Patrick Londa</dc:creator>
      <pubDate>Tue, 10 Mar 2026 21:41:50 +0000</pubDate>
      <link>https://dev.to/steadybit/the-business-case-for-chaos-engineering-an-roi-calculator-for-testing-application-reliability-2dhk</link>
      <guid>https://dev.to/steadybit/the-business-case-for-chaos-engineering-an-roi-calculator-for-testing-application-reliability-2dhk</guid>
      <description>&lt;p&gt;&lt;strong&gt;"What do we get from intentionally injecting failures into our systems?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://steadybit.com/chaos-engineering/" rel="noopener noreferrer"&gt;Chaos engineering&lt;/a&gt; is one of the best ways to proactively test your application reliability, but many leadership teams have never heard of the concept.&lt;/p&gt;

&lt;p&gt;Engineering teams need to be able to frame a strong business case to explain the value of chaos engineering and reliability testing to budget holders. When there’s a major outage, the value of application reliability becomes immediately clear, but a strong ROI plan can help earn and maintain executive support while your systems are steady.&lt;/p&gt;

&lt;p&gt;We have just released an &lt;a href="https://steadybit.com/chaos-engineering/roi-calculator/" rel="noopener noreferrer"&gt;interactive ROI calculator&lt;/a&gt; that can help SRE teams frame the business value of proactive reliability efforts like chaos engineering.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ensuring Application Resilience with Chaos Engineering
&lt;/h2&gt;

&lt;p&gt;Complex software systems are destined to break and fail at some point, especially with all the factors present in modern production environments. When we ask teams about their approach to system reliability, we often hear back: “We have enough chaos already!”&lt;/p&gt;

&lt;p&gt;When done correctly, &lt;a href="https://steadybit.com/chaos-engineering/" rel="noopener noreferrer"&gt;chaos engineering&lt;/a&gt; isn’t adding chaos to your systems. Rather, it’s running different scenarios to validate how resilient your systems are under stressful conditions. You can test your expectations vs. the reality of performance degradation during an availability zone outage, delayed dependency, or a sudden surge of users.&lt;/p&gt;

&lt;p&gt;These experiments serve as a feedback loop earlier in the software development cycle that enable teams to design more fault-tolerant systems for their users.&lt;/p&gt;

&lt;p&gt;When reliability testing is rolled out across applications, teams can find and address risks that could lead to critical incidents and improve their average time to remediation (MTTR).&lt;/p&gt;

&lt;h2&gt;
  
  
  Early ROI Prototypes and Why They Didn’t Work
&lt;/h2&gt;

&lt;h3&gt;
  
  
  📉 Savings from Reducing the Overall Number of Incidents
&lt;/h3&gt;

&lt;p&gt;Initially, we explored the premise that implementing chaos engineering will lead to fewer incidents overall at every severity tier. If a user provides how many overall incidents they have had in the past year, we could assume some percentage reduction to all incidents. This makes the calculation pretty simple once a cost amount is assigned to each tier of incident.&lt;/p&gt;

&lt;p&gt;The trouble with this approach is it assumes that all incidents should be avoided. Incidents as metrics can be useful in flagging anomalous behaviors and some might not necessarily result in negative impacts for many customers. An increase in low-level incidents could actually be a positive sign that alert coverage is improving and properly surfacing system weaknesses. &lt;/p&gt;

&lt;p&gt;Moving forward, we chose to focus on savings from reducing the number of critical incidents (Sev0, P1, etc.) year over year. This is a metric that is easier to track without creating incentives to avoid marking incidents at any level. &lt;/p&gt;

&lt;h3&gt;
  
  
  🧪 Identifying &amp;amp; Fixing Reliability Risks By Running More Experiments
&lt;/h3&gt;

&lt;p&gt;If you go from 0 to 100 experiment runs on your systems, you are bound to discover new performance gaps and reliability risks. How many more risks will you discover at 200 experiment runs?&lt;/p&gt;

&lt;p&gt;We built a version of an ROI calculator that assumed that as the number of experiments increased, a certain percentage of experiments would reveal issues at different incident risk levels with assigned potential costs. Teams would then fix a certain percentage of these reliability risks depending on their development capacity. As the experiment run count scaled, there would be diminished returns for revealing new issues.&lt;/p&gt;

&lt;p&gt;While it’s true that teams will find more potential reliability issues as they run more experiments, this approach was a little too one-dimensional. We didn’t have well-documented references for how issue detection rates would change, and teams would likely need to create a new reporting mechanism to follow-along as they mitigated risks. &lt;/p&gt;

&lt;p&gt;There is also nuance as some teams will automate their experiment runs and use them as regression tests in their CI/CD processes. We decided it would be better to stick to measuring impact by with metrics that are already being tracked and available for SREs at most organizations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where We Landed with Key Metrics for Our ROI Calculator
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ⚡ Savings from Faster Average MTTR
&lt;/h3&gt;

&lt;p&gt;As we were iterating on the inputs and outputs, we saw a &lt;a href="https://youtu.be/mYKNR0UXwMc?si=d-HUlm7xivWNaTHd&amp;amp;t=2444" rel="noopener noreferrer"&gt;great presentation&lt;/a&gt; from Keith Blizard and Joe Cho at AWS Re:Invent 2024, featuring a case study on the progress Fidelity Investments had made in rolling out chaos engineering across their organization. They documented major improvements to mean-time-to-resolution (MTTR) as they scaled chaos testing coverage across applications.&lt;/p&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/mYKNR0UXwMc"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;p&gt;We used these case study metrics to plot the correlation between the percent of applications with chaos testing coverage to the incremental positive impact to their MTTR. We then used this relationship to calculate improvements against an assumed industry-wide average MTTR of 175 minutes, according to this &lt;a href="https://www.pagerduty.com/resources/insights/learn/cost-of-downtime/" rel="noopener noreferrer"&gt;2024 PagerDuty report&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This MTTR savings means fewer minutes of downtime, which that same study estimated can cost between $4,000 - $15,000 per minute. In our calculator, we ask users to input their “Annual Company Revenue” so we can use the most relevant cost of downtime per minute, as downtime is typically more costly for larger enterprises. This &lt;a href="https://www.bigpanda.io/wp-content/uploads/2024/04/EMA-BigPanda-final-Outage-eBook.pdf" rel="noopener noreferrer"&gt;2024 report&lt;/a&gt; commissioned by BigPanda found that downtime cost an average of $14,056 per minute for organizations with more than 1,000 employees.&lt;/p&gt;

&lt;h3&gt;
  
  
  🛡️ Savings from Reducing Critical Incidents
&lt;/h3&gt;

&lt;p&gt;At &lt;a href="https://steadybit.com/" rel="noopener noreferrer"&gt;Steadybit&lt;/a&gt;, we partner with a wide range of customers and have seen how many major reliability gaps are uncovered by running chaos experiments. Using insights from our customers and referencing industry studies, we've seen that actively running reliability tests on any given application conservatively leads to an average 30% reduction in critical incidents for that application per year.&lt;/p&gt;

&lt;p&gt;For &lt;a href="https://steadybit.com/chaos-engineering/roi-calculator/" rel="noopener noreferrer"&gt;our calculator&lt;/a&gt;, we ask users to input the total number of applications their organization operates and how many of these applications have reliability testing coverage. We multiply the standard 30% reduction per year in critical incidents by the percent of applications with testing coverage to get the overall incident reduction for the organization.&lt;/p&gt;

&lt;h3&gt;
  
  
  🛠️ Costs of Implementing Chaos Engineering
&lt;/h3&gt;

&lt;p&gt;If you want to run chaos experiments at scale, you will likely need to onboard a commercial reliability platform or chaos engineering tool. Open source solutions can be a good starting place, but deploying these across teams and technologies can become increasingly time-intensive. We used general license estimates based on market knowledge and projected experiment activity.&lt;/p&gt;

&lt;p&gt;Like with any new program, an organization will need engineers owning the project and dedicating time to a successful rollout of chaos testing. We included a field in our calculator for “Testing Rollout Managers”, measured by FTEs (40hr/week of staff time). We used an average SRE salary of $160k per year as a benchmark to estimate the cost of this implementation effort.&lt;/p&gt;

&lt;h2&gt;
  
  
  Showing the Return On Investment for Reliability Testing
&lt;/h2&gt;

&lt;p&gt;We ask users to project how they would expect to rollout chaos engineering at their organization, including unique test types, number of experiments, and coverage across applications. Our &lt;a href="https://steadybit.com/chaos-engineering/roi-calculator/" rel="noopener noreferrer"&gt;ROI calculator&lt;/a&gt; will then output a summary and detailed view of your projected savings, implementation costs, and return on investment. When you game out multi-year adoption goals, you'll be building a business case that can help you frame the value of making this type of investment.&lt;/p&gt;

&lt;p&gt;If you’re successful in getting buy-in to roll out chaos engineering, you’ll need to report back on your progress. If you’re using an incident management platform like Splunk or PagerDuty, you may already have built-in MTTR metrics available to reference. You can also track the number of critical incidents using Observability tools like Datadog, Dynatrace, or Grafana Labs.&lt;/p&gt;

&lt;p&gt;These metrics will hopefully show clear improvements, but your systems may become increasingly complex at the same time that you’re rolling out this testing, especially with the rise of AI agents. Even simply maintaining your current reliability posture as your systems evolve and become significantly more complex could be framed as a win.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rolling Out Chaos Engineering Across Your Organization
&lt;/h2&gt;

&lt;p&gt;Highly available applications don’t naturally draw attention in the way that outages do. If you want to continue the momentum and foster a culture of reliability, you'll need to intentionally share your wins. For example, if you find a major reliability vulnerability and are able to address it before it impacts customers, that's something to celebrate internally.&lt;/p&gt;

&lt;p&gt;If you want to help getting started with chaos testing and adopting a proactive reliability program, our team of experts at &lt;a href="https://steadybit.com/" rel="noopener noreferrer"&gt;Steadybit&lt;/a&gt; is ready to help.&lt;/p&gt;

&lt;p&gt;You can explore our reliability platform with a &lt;a href="https://signup.steadybit.com/" rel="noopener noreferrer"&gt;30-day free trial&lt;/a&gt; or &lt;a href="https://steadybit.com/book-demo/" rel="noopener noreferrer"&gt;book a quick call&lt;/a&gt; with us to discuss how you can implement chaos engineering and start saving money today.&lt;/p&gt;

</description>
      <category>roi</category>
      <category>chaosengineering</category>
      <category>sre</category>
      <category>testing</category>
    </item>
    <item>
      <title>3 Types of Chaos Experiments and How To Run Them</title>
      <dc:creator>Patrick Londa</dc:creator>
      <pubDate>Thu, 24 Apr 2025 17:44:42 +0000</pubDate>
      <link>https://dev.to/steadybit/3-types-of-chaos-experiments-and-how-to-run-them-1p59</link>
      <guid>https://dev.to/steadybit/3-types-of-chaos-experiments-and-how-to-run-them-1p59</guid>
      <description>&lt;p&gt;The primary objective of a Chaos Experiment is to uncover hidden bugs, weaknesses, or non-obvious points of failure in a system that could lead to significant outages, degradation of service, or system failure under unpredictable real-world conditions.&lt;/p&gt;

&lt;h1&gt;
  
  
  What is a Chaos Experiment?
&lt;/h1&gt;

&lt;p&gt;A Chaos Experiment is a carefully designed, controlled, and monitored process that systematically introduces disturbances or abnormalities into a system’s operation to observe and understand its response to such conditions.&lt;/p&gt;

&lt;p&gt;It forms the core part of &lt;a href="https://steadybit.com/chaos-engineering/" rel="noopener noreferrer"&gt;‘Chaos Engineering’&lt;/a&gt;, which is predicated on the idea that ‘the best way to understand system behavior is by observing it under stress.’ This means intentionally injecting faults into a system in production or simulated environments to test its reliability and resilience.&lt;/p&gt;

&lt;p&gt;This practice emerged from the understanding that systems, especially distributed systems, are inherently complex and unpredictable due to their numerous interactions and dependencies.&lt;/p&gt;

&lt;h1&gt;
  
  
  The Components of a Chaos Engineering Experiment
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Hypothesis Formation.&lt;/strong&gt; At the initial stage, a hypothesis is formed about the system’s steady-state behavior and expected resilience against certain types of disturbances. This hypothesis predicts no significant deviation in the system’s steady state as a result of the experiment.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Variable Introduction.&lt;/strong&gt; This involves injecting specific variables or conditions that simulate real-world disturbances (such as network latency, server failures, or resource depletion). These variables are introduced in a controlled manner to avoid unnecessary risk.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scope and Safety.&lt;/strong&gt; The experiment’s scope is clearly defined to limit its impact, often called the “blast radius.” Safety mechanisms, such as automatic rollback or kill switches, are implemented to halt the experiment if unexpected negative effects are observed.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Observation and Data Collection.&lt;/strong&gt; Throughout the experiment, system performance and behavior are closely monitored using detailed logging, metrics, and observability tools. This data collection is critical for analyzing the system’s response to the introduced variables.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Analysis and Learning.&lt;/strong&gt; After the experiment, the data is analyzed to determine whether the hypothesis was correct. This analysis extracts insights regarding the system’s vulnerabilities, resilience, and performance under stress.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Iterative Improvement.&lt;/strong&gt; The findings from each chaos experiment inform adjustments in system design, architecture, or operational practices. These adjustments aim to mitigate identified weaknesses and enhance overall resilience.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;💡 Note → The ultimate goal is not to break things randomly but to uncover systemic weaknesses to improve the system’s resilience. By introducing chaos, you can enhance the understanding of your systems, leading to higher availability, reliability, and a better user experience.&lt;/p&gt;

&lt;h1&gt;
  
  
  Types of Chaos Experiments
&lt;/h1&gt;

&lt;h2&gt;
  
  
  1. Dependency Failure Experiment
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Objective:&lt;/strong&gt; To assess how microservices behave when one or more of their dependencies fail. In a microservices architecture, services are designed to perform small tasks and often rely on other services to fulfill a request. The failure of these external dependencies can lead to cascading failures across the system, resulting in degraded performance or system outages. Understanding how these failures impact the overall system is crucial for building resilient services.&lt;/p&gt;

&lt;h3&gt;
  
  
  Possible Experiments
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Network Latency and Packet Loss.&lt;/strong&gt; Simulate increased latency or packet loss to understand its impact on service response times and throughput.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Service Downtime.&lt;/strong&gt; Emulate the unavailability of a critical service to observe the system’s resilience and failure modes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Database Connectivity Issues.&lt;/strong&gt; Introduce connection failures or read/write delays to assess the robustness of data access patterns and caching mechanisms.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Third-party API Limiting.&lt;/strong&gt; Mimic rate limiting or downtime of third-party APIs to evaluate external dependency management and error handling.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  How to Run a Dependency Failure Experiment
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Map Out Dependencies.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Begin with a comprehensive inventory of all the external services your system interacts with. This includes databases, third-party APIs, cloud services, and internal services if you work in a microservices architecture.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For each dependency, document how your system interacts with it. Note the data exchanged, request frequency, and criticality of each interaction to your system’s operations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Rank these dependencies based on their importance to your system’s core functionalities. This will help you focus your efforts on the most critical dependencies first.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Simulate Failures&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Use service virtualization or proxy tools like SteadyBit to simulate various failures for your dependencies. These can range from network latency, dropped connections, and timeouts to complete unavailability.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For each dependency, configure the types of faults you want to introduce. This could include delays, error rates, or bandwidth restrictions, mimicking real-world issues that could occur.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Start with less severe faults (like increased latency) and gradually move to more severe conditions (like complete downtime), observing the system’s behavior at each stage.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Test Microservices Isolation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Implement Resilience Patterns. Use libraries like &lt;a href="https://github.com/Netflix/Hystrix" rel="noopener noreferrer"&gt;Hystrix&lt;/a&gt;, &lt;a href="https://resilience4j.readme.io/docs/getting-started" rel="noopener noreferrer"&gt;resilience4j&lt;/a&gt;, or &lt;a href="https://spring.io/projects/spring-cloud-circuitbreaker" rel="noopener noreferrer"&gt;Spring Cloud Circuit Breaker&lt;/a&gt; to implement patterns that prevent failures from cascading across services. This includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Bulkheads.&lt;/strong&gt; Isolate parts of the application into “compartments” to prevent failures in one area from overwhelming others.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Circuit Breakers.&lt;/strong&gt; Automatically, “cut off” calls to a dependency if it’s detected as down, allowing it to recover without being overwhelmed by constant requests.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Carefully configure thresholds and timeouts for these patterns. This includes setting the appropriate parameters for circuit breakers to trip and recover and defining bulkheads to isolate services effectively.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitor Inter-Service Communication&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Utilize monitoring solutions like &lt;a href="https://prometheus.io/" rel="noopener noreferrer"&gt;Prometheus&lt;/a&gt;, &lt;a href="https://grafana.com/" rel="noopener noreferrer"&gt;Grafana&lt;/a&gt;, or &lt;a href="https://www.datadoghq.com/" rel="noopener noreferrer"&gt;Datadog&lt;/a&gt; to monitor how services communicate under normal and failure conditions. Service meshes like Istio or Linkerd can provide detailed insights without changing your application code.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Focus on metrics like request success rates, latency, throughput, and error rates. These metrics will help you understand the impact of dependency failures on your system’s performance and reliability.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;💡 Recommendation → Monitoring in real-time allows you to quickly identify and respond to unexpected behaviors, minimizing the impact on your system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Analyze Fallback Mechanisms&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Evaluate the effectiveness of implemented fallback mechanisms. This includes static responses, cache usage, default values, or switching to a secondary service if the primary is unavailable.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Assess if the ‘retry logic’ is appropriately configured. This includes evaluating the retry intervals, backoff strategies, and the maximum number of attempts to prevent overwhelming a failing service.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Ensure that fallback mechanisms enable your system to operate in a degraded mode rather than failing outright. This helps maintain a service level even when dependencies are experiencing issues.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2. Resource Manipulation Experiment
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Objective:&lt;/strong&gt; To understand how a system behaves when subjected to unusual or extreme resource constraints, such as CPU, memory, disk I/O, and network bandwidth. The aim is to identify potential bottlenecks and ensure that the system can handle unexpected spikes in demand without significantly degrading service.&lt;/p&gt;

&lt;h3&gt;
  
  
  Possible Experiments
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;CPU Saturation.&lt;/strong&gt; Increase CPU usage gradually to see how the system prioritizes tasks and whether essential services remain available.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Memory Consumption.&lt;/strong&gt; Simulate memory leaks or high memory demands to test the system’s handling of low memory conditions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Disk I/O and Space Exhaustion.&lt;/strong&gt; Increase disk read/write operations or fill up disk space to observe how the system copes with disk I/O bottlenecks and space limitations.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  How to Run a Resource Manipulation Experiment
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Define Resource Limits&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Start by monitoring your system under normal operating conditions to establish a baseline for CPU, memory, disk I/O, and network bandwidth usage.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Based on historical data and performance metrics, define the normal operating range for each critical resource. This will help you identify when the system is under stress, or resource usage is abnormally high during the experiment.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Check and Verify the Break-Even Point&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Understand your system’s maximum capacity before it requires scaling. This involves testing the system under gradually increasing load to identify the point at which performance starts to degrade, and additional resources are needed.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If you’re using auto-scaling (either in the cloud or on-premises), clearly define and verify the rules for adding new instances or allocating resources. This includes setting CPU, memory usage thresholds, and other metrics that trigger scaling actions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Use load testing tools like &lt;a href="https://jmeter.apache.org/" rel="noopener noreferrer"&gt;JMeter&lt;/a&gt;, &lt;a href="https://gatling.io/" rel="noopener noreferrer"&gt;Gatling&lt;/a&gt;, or &lt;a href="https://locust.io/" rel="noopener noreferrer"&gt;Locust&lt;/a&gt; to simulate demand spikes and verify that your auto-scaling rules work as expected. This will ensure that your system can handle real-world traffic patterns.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Select Manipulation Tool&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;While Stress and Stress-ng are powerful for generating CPU, memory, and I/O load on Linux systems, they might not be easy to use across distributed or containerized environments. Tools like &lt;a href="https://steadybit.com/" rel="noopener noreferrer"&gt;Steadybit&lt;/a&gt; offer more user-friendly interfaces for various environments, including microservices and cloud-native applications.&lt;/p&gt;

&lt;p&gt;💡 Pro Tip → Ensure that the tool you select can accurately simulate the types of resource manipulation you’re interested in, whether it’s exhausting CPU cycles, filling up memory, saturating disk I/O, or hogging network bandwidth.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Apply Changes Gradually&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Start by applying small changes to resource consumption and monitor the system’s response.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Monitor system performance carefully to identify the thresholds at which performance degrades or fails. This will help you understand the system’s resilience and where improvements are needed.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Monitor System Performance&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use comprehensive monitoring solutions to track the impact of resource manipulation on system performance. Look for changes in response times, throughput, error rates, and system resource utilization.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;💡 Pro Tip → Platforms like &lt;a href="https://steadybit.com/" rel="noopener noreferrer"&gt;Steadybit&lt;/a&gt; can integrate with monitoring tools to provide a unified view of how resource constraints affect system health, making it easier to correlate actions with outcomes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluate Resilience&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Analyze how effectively your system scales up resources in response to the induced stress. This includes evaluating the timeliness of scaling actions and whether the added resources alleviate the performance issues.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Evaluate the efficiency of your resource allocation algorithms. This involves assessing whether resources are being utilized optimally and whether unnecessary wastage or contention exists.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Test the robustness of your failover and redundancy mechanisms under ‘conditions of resource scarcity’. This can include switching to standby systems, redistributing load among available resources, or degrading service gracefully.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  3. Network Disruption Experiment
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Objective:&lt;/strong&gt; To simulate various network conditions that can affect a system’s operations, such as outages, DNS failures, or limited network access. By introducing these disruptions, the experiment seeks to understand how a system responds and adapts to network unreliability, ensuring critical applications can withstand and recover from real-world network issues.&lt;/p&gt;

&lt;h3&gt;
  
  
  Possible Experiments
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;DNS Failures.&lt;/strong&gt; Introduce DNS resolution issues to evaluate the system’s reliance on DNS and its ability to use fallback DNS services.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Latency Injection.&lt;/strong&gt; Introduce artificial delay in the network to simulate high-latency conditions, affecting the communication between services or components.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Packet Loss Simulation.&lt;/strong&gt; Simulate the loss of data packets in the network to test how well the system handles data transmission errors and retries.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Bandwidth Throttling.&lt;/strong&gt; Limit the network bandwidth available to the application, simulating congestion conditions or degraded network services.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Connection Drops.&lt;/strong&gt; Forcing abrupt disconnections or intermittent connectivity to test session persistence and reconnection strategies.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  How to Run a Network Disruption Experiment
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Identify Network Paths&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Start by mapping out your network’s topology, including routers, switches, gateways, and the connections between different segments. Tools like &lt;a href="https://nmap.org/" rel="noopener noreferrer"&gt;Nmap&lt;/a&gt; or network diagram software can help visualize your network’s structure.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Focus on identifying the critical paths data takes when traveling through your system. These include paths between microservices, external APIs, databases, and the Internet.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Document these paths and prioritize them based on their importance to your system’s operation. This will help you decide where to start with your network disruption experiments.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Choose Disruption Type&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Decide on the type of network disruption to simulate. Options include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;complete network outages,&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;latency (delays in data transmission)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;packet loss (data packets being lost during transmission)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;bandwidth limitations&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Next, choose disruptions based on their likelihood and potential impact on your system.&lt;br&gt;
For example, simulating latency and packet loss might be particularly relevant if your system is distributed across multiple geographic locations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use Network Chaos Tools&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Traffic Control (TC).&lt;/strong&gt; The ‘tc’ command in Linux is a powerful tool for controlling network traffic. It allows you to introduce delays, packet loss, and bandwidth restrictions on your network interfaces.&lt;/p&gt;

&lt;p&gt;⚠️ Note → Simulating DNS failures can be complex but is crucial for understanding how your system reacts to DNS resolution issues. Consider using specialized tools or features for this purpose.&lt;/p&gt;

&lt;p&gt;On the flip side, chaos experiment solutions like Steadybit provide user-friendly interfaces for simulating network disruptions. For example, you get safety features like built-in rollback strategies to minimize the risk of long-term impact on your system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitor Connectivity and Throughput&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;During the experiment, use network monitoring tools and observability platforms to track connectivity and throughput metrics in real time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Focus on monitoring packet loss rates, latency, bandwidth usage, and error rates to assess the impact of the network disruptions you’re simulating.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Assess Failover and Recovery&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Evaluate how well your system’s failover mechanisms respond to network disruptions. For example, you could switch to a redundant network path, use a different DNS server, or take other predefined recovery actions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Measure the time it takes for the system to detect and recover the issue. This includes the time it takes to failover and return to normal operations after the disruption ends.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;💡 Recommended → Analyze the overall resilience of your system to network instability. This assessment should include how well services degrade (if at all) and how quickly and effectively they recover once normal conditions are restored.&lt;/p&gt;

&lt;p&gt;If you want to read more, you can check out the &lt;a href="https://steadybit.com/blog/chaos-experiments/" rel="noopener noreferrer"&gt;rest of the post&lt;/a&gt; about principles of chaos engineering and popular tools here.&lt;/p&gt;

</description>
      <category>sre</category>
      <category>performance</category>
      <category>chaosengineering</category>
    </item>
    <item>
      <title>Reducing Your Cloud Costs: An Operational Optimization Guide</title>
      <dc:creator>Patrick Londa</dc:creator>
      <pubDate>Mon, 17 Oct 2022 14:49:20 +0000</pubDate>
      <link>https://dev.to/blink-ops/reducing-your-cloud-costs-an-operational-optimization-guide-3eh6</link>
      <guid>https://dev.to/blink-ops/reducing-your-cloud-costs-an-operational-optimization-guide-3eh6</guid>
      <description>&lt;p&gt;Cloud costs are top of mind as many business leaders and teams are focusing attention on honing their operational efficiency.&lt;/p&gt;

&lt;p&gt;In April at CIO.com’s Future of Cloud Summit, Dave McCarthy, research vice president of cloud infrastructure services at IDC, shared that cloud spending represents roughly &lt;a href="https://www.cio.com/article/403231/cios-contend-with-rising-cloud-costs.html" rel="noopener noreferrer"&gt;30% of current IT budgets&lt;/a&gt;. In the 2022 State of Cloud Report by Flexera, 750 surveyed executives shared that they estimate they are &lt;a href="https://www.forbes.com/sites/joemckendrick/2020/04/29/one-third-of-cloud-spending-wasted-but-still-accelerates/?sh=5a313399489e" rel="noopener noreferrer"&gt;wasting 30% of their cloud spend&lt;/a&gt;, while also saying that they expect costs to increase 47% over the next year. If you combine those stats, there is an efficiency opportunity roughly the size of 10% of IT budgets.&lt;/p&gt;

&lt;p&gt;Achieving those cost savings isn’t as easy as flipping a switch. There is wasted spend embedded across multiple resource types, regions, and services. By function, the main categories of cloud spending are compute time, data storage, and data transfer.&lt;/p&gt;

&lt;p&gt;In this post, we’ll outline a framework for reviewing your cloud spending today, identifying wasted resources, and reviewing your long-term infrastructure efficiency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reviewing Your Current Spending
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;“What are we currently spending money on?”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;To start, you can review your current spend at the account-level with the major cloud providers. AWS, Azure, and GCP all have reporting options that enable you to view and filter your spending over a period of time.&lt;/p&gt;

&lt;p&gt;In AWS, you can create &lt;a href="https://docs.aws.amazon.com/cur/latest/userguide/cur-create.html" rel="noopener noreferrer"&gt;Cost and Usage Reports&lt;/a&gt;. In GCP, you can review your &lt;a href="https://cloud.google.com/billing/docs/how-to/reports" rel="noopener noreferrer"&gt;Cloud Billing Report&lt;/a&gt; and view spend by “Project” or other filters. In the Azure portal, you can download usage and charges from the “&lt;a href="https://learn.microsoft.com/en-us/azure/cost-management-billing/understand/download-azure-daily-usage" rel="noopener noreferrer"&gt;Cost Management + Billing&lt;/a&gt;” section.&lt;/p&gt;

&lt;p&gt;These views may be useful to get started and see transactional costs, such as from data transfers. In order to get more granular details on your cloud spending, you should leverage resource labels and tags to accurately categorize expenses.&lt;/p&gt;

&lt;p&gt;With labels and tags, you can associate resources with specific cost centers, projects, business units, or teams. You can then easily organize your resource data, create custom reports, and run specific queries.&lt;/p&gt;

&lt;p&gt;If you do not currently have a mechanism or standard practice around resource tags and labels, you can refer to these how-to guides for setting up mandatory tags:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS: &lt;a href="https://www.blinkops.com/blog/enforcing-mandatory-tags-across-aws-resources" rel="noopener noreferrer"&gt;Enforcing Mandatory Tags Across Your AWS Resources&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GCP: &lt;a href="https://www.blinkops.com/blog/enforcing-labels-and-tags-across-your-gcp-resources" rel="noopener noreferrer"&gt;Enforcing Labels and Tags Across Your GCP Resources&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Azure: &lt;a href="https://www.blinkops.com/blog/enforcing-mandatory-tags-across-azure-resources" rel="noopener noreferrer"&gt;Enforcing Mandatory Tags Across Your Azure Resources&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you use more than one cloud computing provider, you’ll need to aggregate invoices and usage reports across vendors. In this scenario, having consistent tagging methods across platforms is even more useful as it can offer a consistent way to view your resource usage and expenses.&lt;/p&gt;

&lt;p&gt;Once you have a clear sense of your current spending, you can look for opportunities to reduce your expenses.&lt;/p&gt;

&lt;h2&gt;
  
  
  Eliminating Unnecessary Resources
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;“What resources are we spending money on and not using at all?”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;As projects are spun up and shut down, there are often resources that become unattached and left behind. While they are no longer in use, they are still costing your organization money on a recurring basis.&lt;/p&gt;

&lt;p&gt;Ideally, you have an automated way to regularly catch and delete these unattached resources. With a no-code platform like &lt;a href="https://app.blinkops.com/signup" rel="noopener noreferrer"&gt;Blink&lt;/a&gt;, teams can scale up scheduled automations to continuously detect and remove unnecessary resources.&lt;/p&gt;

&lt;p&gt;If you don’t have automations already in place, you can manually review resources in the console and remove unused ones in bulk. It can be time-consuming, but you may be able to reduce your operating costs significantly this way in the short-term.&lt;/p&gt;

&lt;p&gt;To know what types of resources to review, here are some common examples:&lt;/p&gt;

&lt;h4&gt;
  
  
  Unattached Disks
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;AWS: &lt;a href="https://www.blinkops.com/blog/how-to-find-and-delete-unattached-aws-resources" rel="noopener noreferrer"&gt;How to Find and Delete Unattached AWS Volumes and Gateways&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Azure: &lt;a href="https://www.blinkops.com/blog/finding-and-deleting-unattached-disks-with-the-azure-cli" rel="noopener noreferrer"&gt;Finding and Deleting Unattached Disks with the Azure CLI&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GCP: &lt;a href="https://www.blinkops.com/blog/how-to-find-and-delete-unattached-gcp-disks" rel="noopener noreferrer"&gt;How to Find and Delete Unattached GCP Disks&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Unattached IP Addresses
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;AWS: &lt;a href="https://www.blinkops.com/blog/finding-and-removing-unattached-aws-elastic-ip-addresses" rel="noopener noreferrer"&gt;Finding and Removing Unattached AWS Elastic IP Addresses&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Azure: &lt;a href="https://www.blinkops.com/blog/how-to-detect-and-remove-unattached-azure-public-ip-addresses" rel="noopener noreferrer"&gt;How to Detect and Remove Unattached Azure Public IP Addresses&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GCP: &lt;a href="https://www.blinkops.com/blog/finding-and-removing-unattached-gcp-external-ip-addresses" rel="noopener noreferrer"&gt;Finding and Removing Unattached GCP External IP Addresses&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Old Snapshots
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;AWS: &lt;a href="https://www.blinkops.com/blog/how-to-find-and-remove-old-ebs-snapshots" rel="noopener noreferrer"&gt;How to Find and Remove Old EBS Snapshots&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Azure: &lt;a href="https://www.blinkops.com/blog/how-to-find-and-remove-old-azure-snapshots" rel="noopener noreferrer"&gt;How to Find and Remove Old Azure Snapshots&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GCP: &lt;a href="https://www.blinkops.com/blog/how-to-find-and-remove-old-gcp-disk-snapshots" rel="noopener noreferrer"&gt;How to Find and Remove Old GCP Disk Snapshots&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Finding and removing idle resources is a clear way to cut your operating costs, but it also is an important practice for maintaining a strong security posture. If you leave resources like unattached IP addresses, &lt;a href="https://www.blinkops.com/blog/how-to-find-and-delete-unattached-aws-resources" rel="noopener noreferrer"&gt;idle NAT Gateways&lt;/a&gt;, &lt;a href="https://www.blinkops.com/blog/tracking-down-amazon-load-balancers-with-no-target" rel="noopener noreferrer"&gt;load balancers with no target&lt;/a&gt;, or &lt;a href="https://www.blinkops.com/blog/getting-and-deleting-orphaned-secrets-with-kubectl" rel="noopener noreferrer"&gt;orphaned Secrets&lt;/a&gt; lying around, bad actors could find them and take advantage of the information. In this way, resource management is key to reducing costs and reducing risk.&lt;/p&gt;

&lt;h2&gt;
  
  
  Optimizing and Updating Resources
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;“How can we optimize our existing resources?”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Now that you’ve reviewed and removed unused resources, you can now look at optimizing the resources you are using.&lt;/p&gt;

&lt;h4&gt;
  
  
  Using the Right Family for the Job
&lt;/h4&gt;

&lt;p&gt;Whether you are creating new resources or evaluating existing ones, it’s important to consider which family of resources best fits your needs. If you’re using general-purpose machines, there might be another more cost-effective machine that is a better fit.&lt;/p&gt;

&lt;p&gt;Depending on your usage, you may need more capacity in some specifications than others. For example, if you’re using AWS, there are Compute Optimized instances under the C family (e.g. EC2 C7g instances) which offer optimal price performance for especially computing-intense use cases, like batch processing workloads and scientific modeling. Other families include Memory Optimized (e.g. EC2 R6a instances) and Storage Optimized (Ec2 lm4gn instances). There are lots of other families (e.g. IOPs, network, accelerator-optimized) depending on the platform and the specification you want to optimize for.&lt;/p&gt;

&lt;p&gt;When considering your performance requirements, you might have use cases like batch jobs or workloads that are fault-tolerant. &lt;a href="https://azure.microsoft.com/en-us/products/virtual-machines/spot/#overview" rel="noopener noreferrer"&gt;Azure&lt;/a&gt;, &lt;a href="https://cloud.google.com/spot-vms" rel="noopener noreferrer"&gt;GCP&lt;/a&gt;, and &lt;a href="https://aws.amazon.com/ec2/spot/" rel="noopener noreferrer"&gt;AWS&lt;/a&gt; all have unused capacity that they offer as less expensive, less reliable Spot VMs. Compared to on-demand instances, they are up to 90% less expensive to run.&lt;/p&gt;

&lt;h4&gt;
  
  
  Updating to New Machines
&lt;/h4&gt;

&lt;p&gt;Within each of these families, there are often newer versions being offered. Often, the newer versions run more efficiently or have higher performance, so it’s a good best practice to upgrade to newer versions as much as you can. &lt;/p&gt;

&lt;p&gt;One example of this is with EBS volumes. By switching from &lt;a href="https://www.blinkops.com/blog/switching-gp2-volumes-to-gp3-volumes-to-lower-aws-ebs-costs" rel="noopener noreferrer"&gt;EBS GP2 volumes to EBS GP3 volumes&lt;/a&gt;, you can reduce your costs by 20%. There are some small performance tradeoffs, but it’s important to keep these types of upgrade opportunities in mind.&lt;/p&gt;

&lt;p&gt;Another AWS example is switching from older machines to ones that use the new AWS Graviton2 processors. Instances running on Graviton2 processors vs. Intel processors offer up to 40% better price performance, with specific efficiencies varying by family.&lt;/p&gt;

&lt;h4&gt;
  
  
  Looking for Low CPU Usage
&lt;/h4&gt;

&lt;p&gt;One way to optimize your spending is by rightsizing resources to match the usage level that you need. For example, you may be running an instance or virtual machine that has more computer capacity than you need.&lt;/p&gt;

&lt;p&gt;By reviewing your usage data, you can determine if you are running at an average CPU usage of 30% or less for example. By reducing the size or type of instance, you can slightly reduce your spend, which adds up over time.&lt;/p&gt;

&lt;p&gt;Here are some how-to guides that show examples for each platform:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS: &lt;a href="https://www.blinkops.com/blog/finding-and-resizing-amazon-ec2-instances-with-low-cpu-usage" rel="noopener noreferrer"&gt;Finding and Resizing Amazon EC2 Instances with Low CPU Usage&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GCP: &lt;a href="https://www.blinkops.com/blog/finding-and-resizing-gcp-compute-instances-with-low-cpu-usage" rel="noopener noreferrer"&gt;Finding and Resizing GCP Compute Instances with Low CPU Usage&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Azure: &lt;a href="https://www.blinkops.com/blog/finding-and-resizing-azure-virtual-machines-with-low-cpu-usage" rel="noopener noreferrer"&gt;Finding and Resizing Azure Virtual Machines with Low CPU Usage&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Using Long-Term Resourcing for Predictable CPU Usage
&lt;/h4&gt;

&lt;p&gt;Another way to optimize your costs is by leveraging reserved instances or committed use discounts. In exchange for predictable computing expectations, the major cloud providers offer resources at a discount with a committed term, such as 1 year or 3 years.&lt;/p&gt;

&lt;p&gt;Here are some how-to guides that show examples for each platform:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS: &lt;a href="https://www.blinkops.com/blog/lowering-costs-on-long-running-aws-ec2-instances" rel="noopener noreferrer"&gt;Lowering Costs on Long Running AWS EC2 Instances&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GCP: &lt;a href="https://www.blinkops.com/blog/lowering-costs-for-long-running-gcp-instances-with-committed-use-discounts" rel="noopener noreferrer"&gt;Lower Costs for Long Running GCP Instances with Committed Use Discounts&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Azure: &lt;a href="https://www.blinkops.com/blog/optimizing-costs-for-long-running-azure-vms-with-reserved-instances" rel="noopener noreferrer"&gt;Optimizing Costs for Long Running Azure VMs with Reserved Instances&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Starting Nightly Non-Production Scale-Downs
&lt;/h4&gt;

&lt;p&gt;Are there any resources that you can shut-down when they are not being used? For example, if your team is working with a test environment during certain work hours, you don’t need to run it 24 hours a day. You can scale it down at night and scale it back up the next morning.&lt;/p&gt;

&lt;p&gt;With some automation, pausing and restarting a non-production cluster could be as simple as clicking an approval button in a slack message, and reducing your daily cloud costs.&lt;/p&gt;

&lt;p&gt;Here are a couple examples of how to pause and restart clusters nightly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS: &lt;a href="https://www.blinkops.com/blog/how-to-scale-down-aws-eks-clusters-nightly-to-lower-ec2-costs" rel="noopener noreferrer"&gt;How to Scale Down AWS EKS Clusters Nightly&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GCP: &lt;a href="https://www.blinkops.com/blog/how-to-pause-your-gke-cluster-nightly" rel="noopener noreferrer"&gt;How to Pause Your GKE Cluster Nightly&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Azure: &lt;a href="https://www.blinkops.com/blog/how-to-pause-your-aks-clusters-nightly" rel="noopener noreferrer"&gt;How to Pause Your AKS Cluster Nightly&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Storing and Moving Data Efficiency
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;“Can we optimize how our data is stored and transferred?”&lt;/em&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Storing Only Relevant Data
&lt;/h4&gt;

&lt;p&gt;Your cloud bill is also impacted by how much data you are storing. While it’s useful to collect data to see how your services are running, it likely becomes less useful and relevant over time. Even if you want to maintain as much data as possible, you’ll want to employ a strategy of periodically switching data over to less-costly, long-term storage vehicles, such as Amazon’s &lt;a href="https://aws.amazon.com/archive/" rel="noopener noreferrer"&gt;S3 Glacier storage&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Here are some how-to guides for AWS on how to identify data that hasn’t changed in a while and how to reduce logging storage costs.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS: &lt;a href="https://www.blinkops.com/blog/detecting-aws-dynamodb-tables-with-stale-data" rel="noopener noreferrer"&gt;Detecting AWS DynamoDB Tables with Stale Data&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;AWS: &lt;a href="https://www.blinkops.com/blog/lowering-aws-cloudtrail-costs-by-removing-redundant-trails" rel="noopener noreferrer"&gt;Lowering AWS CloudTrail Costs by Removing Redundant Trails&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;AWS: &lt;a href="https://www.blinkops.com/blog/ensuring-aws-cloudwatch-log-groups-have-set-retention-periods" rel="noopener noreferrer"&gt;Ensuring AWS CloudWatch Log Groups Have Set Retention Periods&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Optimizing Data Transfers
&lt;/h4&gt;

&lt;p&gt;Data transfers may also account for a significant part of your cloud costs, and vary greatly depending on their source, destination, method of transport, and size.&lt;/p&gt;

&lt;p&gt;You can also likely expect charges if you are transferring data across regions or across availability zones. Unless your business case requires it, you should look to avoid data transfers that go across regions and availability zones.&lt;/p&gt;

&lt;p&gt;While inbound (or ingress) data transfers between the internet and your cloud provider are not charged, outbound transfers are charged per service. You should reduce outbound data transfers from your cloud to external destinations as much as possible.&lt;/p&gt;

&lt;p&gt;If you are transferring data across AWS services for example, you should be utilizing private endpoints. This way, when you are accessing a S3 bucket from an EC2 instance, you can avoid data transfer charges. &lt;/p&gt;

&lt;p&gt;The same principle applies for transferring data from your cloud to on-premises locations, and tools like AWS &lt;a href="https://aws.amazon.com/directconnect/" rel="noopener noreferrer"&gt;Direct Connect&lt;/a&gt;, GCP &lt;a href="https://cloud.google.com/network-connectivity/docs/direct-peering" rel="noopener noreferrer"&gt;Direct Peering&lt;/a&gt;, and Azure &lt;a href="https://azure.microsoft.com/en-us/products/expressroute/#overview" rel="noopener noreferrer"&gt;ExpressRoute&lt;/a&gt; which may offer lower cost per GB compared to transfers over the internet. Actual savings depends on the amount of data you are moving, and if you are below a certain threshold, it might not make sense.&lt;/p&gt;

&lt;p&gt;You can read more about the types of data transfer charges in the &lt;a href="https://docs.aws.amazon.com/wellarchitected/latest/cost-optimization-pillar/plan-for-data-transfer.html" rel="noopener noreferrer"&gt;Cost Optimization&lt;/a&gt; pillar of the AWS Well-Architected Framework, or these &lt;a href="https://aws.amazon.com/blogs/architecture/overview-of-data-transfer-costs-for-common-architectures/" rel="noopener noreferrer"&gt;AWS&lt;/a&gt;, &lt;a href="https://cloud.google.com/vpc/network-pricing" rel="noopener noreferrer"&gt;GCP&lt;/a&gt;, and &lt;a href="https://azure.microsoft.com/en-us/pricing/details/bandwidth/" rel="noopener noreferrer"&gt;Azure&lt;/a&gt; resources.&lt;/p&gt;

&lt;h2&gt;
  
  
  Achieving Operational Excellence with Blink Automations
&lt;/h2&gt;

&lt;p&gt;So far, we have covered several areas where you and your team can focus and optimize your costs, but significant savings over time takes new processes.&lt;/p&gt;

&lt;p&gt;Beyond finding unused resources, you need an automated process for alerting you to cost reduction opportunities, and then making approval for removing resources as easy as clicking a button. If you only rely on scripts, you may accidentally take down environments or orphaned resources that should have been left up.&lt;/p&gt;

&lt;p&gt;With &lt;a href="https://www.blinkops.com/" rel="noopener noreferrer"&gt;Blink&lt;/a&gt;, you can use no-code automations to achieve operational excellence. In the cost optimization context, Blink lets you create and run dozens of common resource checks and send reports to email or Slack channels with simple, actionable options.&lt;/p&gt;

&lt;p&gt;By running these Blink automations on a schedule, you’ll be able to confidently ensure that you are achieving operational excellence not just one time, but daily. You can take the same Blink automation approach for other operational excellence categories, like security operations, incident response, troubleshooting, and permissions management.&lt;/p&gt;

&lt;p&gt;Get started with a &lt;a href="https://app.blinkops.com/signup" rel="noopener noreferrer"&gt;free Blink account&lt;/a&gt; or reach out to us directly to &lt;a href="https://www.blinkops.com/contact" rel="noopener noreferrer"&gt;hear more&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>cloud</category>
    </item>
    <item>
      <title>Finding and Deleting Orphaned ConfigMaps</title>
      <dc:creator>Patrick Londa</dc:creator>
      <pubDate>Thu, 16 Jun 2022 18:00:03 +0000</pubDate>
      <link>https://dev.to/blink-ops/finding-and-deleting-orphaned-configmap-g4p</link>
      <guid>https://dev.to/blink-ops/finding-and-deleting-orphaned-configmap-g4p</guid>
      <description>&lt;p&gt;If you don’t take steps to maintain your Kubernetes cluster, you could end up wasting money and storage on orphaned resources. Orphaned (or unused) resources, like ConfigMaps, Secrets, and Services, should be regularly located and removed to clear up storage space and prevent performance issues. &lt;/p&gt;

&lt;p&gt;In this post, we’ll be focusing on how to find and remove orphaned ConfigMaps.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kubernetes.io/docs/concepts/configuration/configmap/" rel="noopener noreferrer"&gt;ConfigMaps&lt;/a&gt; are API objects created to hold small amounts of visible configuration data. These objects support unbinding of configuration data from container images and application code for optimum portability of applications, but they cannot hold secret/encrypted data.&lt;/p&gt;

&lt;p&gt;ConfigMaps may get orphaned if they are left isolated from the deployment they were created to support, or if their owners have been purged. Once orphaned, these ConfigMaps waste temporary storage and increase the risk of cluster instability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding and Deleting Orphaned ConfigMaps
&lt;/h2&gt;

&lt;p&gt;Here are some steps you can take to find and remove orphaned ConfigMaps:&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Find all ConfigMaps
&lt;/h3&gt;

&lt;p&gt;First off, you can generate a list of all ConfigMaps using this command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get configmaps –all-namespaces -o json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command will return the list of ConfigMaps across all namespaces, but as you’ll see, the ConfigMap object does not reference its owner. You’ll need to run another command to identify which of the ConfigMaps have owners and are in use.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Compare with a List of Used ConfigMaps
&lt;/h3&gt;

&lt;p&gt;To find any orphaned ConfigMaps, you have to get the list of pods across your cluster and list all ConfigMaps in use. Alternatively you can use the following to diff the list of ConfigMaps and used ConfigMaps, and get unused ConfigMaps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;volumesCM=$( kubectl get pods -o
jsonpath='{.items[*].spec.volumes[*].configMap.name}' | xargs -n1)
volumesProjectedCM=$( kubectl get pods -o
jsonpath='{.items[*].spec.volumes[*].projected.sources[*].configMap.name}' | xargs -n1)
envCM=$( kubectl get pods -o
jsonpath='{.items[*].spec.containers[*].env[*].ValueFrom.configMapKeyRef.name}' | xargs -n1)
envFromCM=$( kubectl get pods -o
jsonpath='{.items[*].spec.containers[*].envFrom[*].configMapKeyRef.name}' | xargs -n1)

diff \
&amp;lt;(echo "$volumesCM\n$volumesProjectedCM\n$envCM\n$envFromCM" | sort | uniq) \
&amp;lt;(kubectl get configmaps -o jsonpath='{.items[*].metadata.name}' | xargs -n1 | sort | uniq)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, you can compare the two lists and delete ConfigMaps from the first list that are not in use by any pod.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Delete Orphaned ConfigMaps
&lt;/h3&gt;

&lt;p&gt;Now that you have a list of orphaned ConfigMaps, you can run this command to delete them and free up memory in your cluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl delete configmap/samplemap
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;configmap "samplemap" deleted
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once you’ve deleted all the orphaned ConfigMaps you found, you’ll have removed unneeded, unused resources from your cluster and freed up memory and storage space. If you remove orphaned resources regularly, you’ll ensure that your team is maintaining optimal Kubernetes resource management.&lt;/p&gt;

&lt;p&gt;Thanks for reading! Let me know if this worked for you.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
