DEV Community

Cover image for Why Apps Crash and How Resilience Testing Can Help
keploy
keploy

Posted on • Originally published at keploy.io

Why Apps Crash and How Resilience Testing Can Help

Most of the times teams spend their time testing that features work correctly under normal conditions. But do you know what happens when the database goes down,the network goes slow or a third party service stops responding?

In this case, resilience testing solves this problem by intentionally breaking parts of your system to see how it responds to it. Instead of only testing perfect scenarios, you find real-world failures like server crashes, network issues and service timeouts. the goal isn’t to prevent all problems but to make sure your application can handle them gracefully when they inevitably happen.

What is Resilience Testing?

What is Resilience Testing

In simple terms, basically it's the process of breaking things in our system on our own to understand how it behaves under different conditions and different scenarios. We're talking network failures, server crashes, database timeouts, memory leaks – you name it and the end goal? Make sure your app can either handle these situations gracefully or recover quickly when they happen.

We will simulate failures, outages, slowdowns, and all sorts of chaos to see if our applications can either:

  • Keep working despite the problems or

  • Fail gracefully and recover quickly

Hold Up - Isn't This Just Reliability Testing?

Hold Up - Isn't This Just Reliability Testing

I think you also have got this question, right? I also got this confusion a lot, and honestly, I used to mix them up too.

Here is the deal: reliability testing and resilience testing are like cousins, but they are not the same.

Reliability testing is about the consistency under perfect conditions. It is asking: If I run this same test 100 times under good conditions, will I get the same result every time? We can think of it like testing if your car starts every morning when it is sitting in your garage.

Resilience testing, on the other hand, is all about how your system behaves when things go sideways. It is asking like, what happens when I try to start my car in a blizzard, with a dead battery or while being chased by police?(Maybe you have stolen something at that time).

You know, both are important part of testing, but they are testing completely different things. Reliability is all about predictability and resilience is about adaptability.

Why is Resilience Testing Vital? (Spoiler: It Really Is)

See, I have learned this the hard way. A few days back, I was working on the e-commerce platforms, and we thought we had everything covered. Regular testing, good code reviews and everything was working. Then Black Friday hit, traffic spiked, and out payment gateway started timing out. Customers were unable to complete purchases, resulting in thousands of dollars in lost sales.

That is when I realised, functional testing only tells you if something works under perfect conditions. Resilience testing tells you if it works when everything is on fire.

Here is why it matters:

  • Real-world chaos happens: Sometimes the network fails, the server crashes, and third-party APIs go down. Then your app needs to handle this gracefully.

  • User experience is everything: Users do not care about your technical difficulties. They just want things to work in a smooth way.

  • Cost of downtime: Every minute your app is down or behaving weirdly costs money and reputation.

  • Regulatory compliance: In some industries, resilience is not optional; it is required.

Key Objectives of Resilience Testing

As we know, resilience testing is all about making sure your system will be able to sustain unforeseen challenges and recover smoothly. This is what it is attempting to do.

  1. Maintain the system under stress - It basically tests how efficiently your application behaves when things are not right, such as high traffic, sluggish services or partial failures.

  2. Identify the soft spots - By stressing the systems to their limits, you can identify the parts that are most likely to fail at random.

  3. Test recovery procedures - Resilience testing ensures that your system will recover elegantly from crashes, outages, or any other unforeseen interruptions

  4. Make sure it can handle faults - It verifies whether your system can continue running even when some parts of it fails, without affecting the overall user experience.

  5. Strengthen the system design - You can make your architecture more robust and better prepared by learning the test outcomes.

  6. Maintain a good user experience - So, even if something breaks in the background, the user should still get meaningful responses or fallback options.

How Does Resilience Testing Work?

The concept is straightforward. We simulate real world failures and observe how the system responds. But the execution? That is where it gets interesting.

We use various techniques like chaos engineering(thanks to Netflix for popularising this!), fault injection, load testing with failures, and network partitioning. Basically, idea is to create controlled chaos and see what breaks at that time.

For example, I might:

  • Suddenly kill a database connection mid-transaction to test it.

  • Introduce random delays in API responses.

  • Fill up disk space to simulate storage issues.

  • Simulate network partitions between services.

Advantages of Resilience Testing

  • Better system reliability

    It helps make sure your system stays up and running, even when something goes wrong behind the scenes and something broke.

  • Smoother user experience

    Users shouldn't even notice if a service hiccups, then resilience testing helps make that possible.

  • Catches hidden issues

    Some problems don’t show up in regular testing. Resilience testing digs deeper and often uncovers issues you didn’t even know existed.

  • Validates your recovery plans

    This is the thing you should always do. It’s one thing to have a recovery strategy on paper. Resilience testing checks if it works.

  • Helps meet SLAs

    If you're promising high availability or uptime, this kind of testing helps ensure you’re living up to it.

Disadvantages

As we know, there always exist some challenges. Here are a some of them:

  • Setup can get complicated

    As we know, simulating real-world failures isn’t easy and often needs custom tools or configurations.

  • Risk of breaking things (if done in prod)

    If you're testing in a live environment and are not careful, you might cause real disruption.

  • You need to know your system well

    Mostly it requires a solid understanding of how all the pieces of your infrastructure fit together.

  • Time-intensive

    Planning, running the tests, and analysing the results take serious time and effort

Steps in Conducting Resilience Testing

Steps in Conducting Resilience Testing

Let me walk you through the typical process:

  1. Get Familiar with Your System - Take some time to get familiar with how your system operates, what services communicate with one another, where the key components are, and what relies on what.

  2. Consider What May Go Wrong - Write down potential failure scenarios: What if a server crashes? What if the database fails? What if a third-party API stalls? Realistic failure scenarios will make your testing more useful.

  3. Define What You Want to Achieve - Basically, you have to be specific about what you want to accomplish. Are you checking how quickly the system recovers? Is data safe or not? Or how your users are impacted when there's a failure? Knowing this keeps you focused and updated.

  4. Set Up a Safe Testing Environment - One thing you want to know always, don't run tests directly in production (unless you know what you're doing). Create an environment very close to your actual system where you can safely replicate failures.

  5. Define Test Cases for Every Scenario - You should always document your specific test cases. Let say, "What should occur if Service A goes down for 30 seconds?" These kind of thing will improve your testing process and enable you to quantify outcomes.

  6. Begin Breaking Things (On Purpose!) - Now this is exciting you have to inject failures by design. This might break cutting network connections, halting a microservice, or adding delays. Chaos engineering tools such as Chaos Monkey or litmuschaos can assist with this.

  7. Observe What Happens - Some time the system might be "under attack," the first thing you should do is observe it firmly. You can check logs, dashboards, alerts, and user behaviour. Did the things break quietly? Were there any alerts fired? Did the system recover itself. You have observe these things also.

  8. Inspect and Learn from the Outcomes - Once the testing is complete, walk through what worked and what did not. Were the users impacted? Did you lose any data? How much time did it take to recover? This is where you locate the actual insights.

  9. Repair the Problems You Discovered - According to what you learned, fix things, perhaps you require more robust error handling, quicker failover, or better monitoring.

  10. Continuously Test Regularly - Resilience testing is not a once-off. Make it routine and best if you can automate it. So you're always prepared for the unexpected.

Types of Resilience Testing

Resilience testing comes in many flavours depending on what kind of failure you want to simulate and how you expect the system to react.

1. Failure Injection Testing

What it is:

In this type of testing, we intentionally introduce failures into the system (like crashing a service or killing a process) to see how the system behaves under stress on our own.

Key Features:

  • Simulates service failures in real-world scenarios.

  • Assists in validating fallback and retry strategies.

  • Typically introduce failure with tools such as Chaos Monkey or LitmusChaos.

  • Can test at the application and infrastructure level smoothly.

Use case example:

One of the best use cases is killing a Kubernetes microservice pod to verify if traffic is routed properly to healthy instances or not.

2. Load Induced Failure Testing

What it is:

In this type of testing, the end goal is basically to stress the system with a heavy load (CPU, memory, network, or user traffic) unless something breaks in the infrastructure and then observe how the system handles that situation.

Key Features:

  • Identifies system failures and weak scaling points.

  • Validated auto scaling and resources limits.

  • Simulates sudden traffic spikes or batch jobs.

  • Good for backend heavy systems like APIs and databases

Use case example:

Flooding an API with thousands of simultaneous requests to see how it performs under sudden peak loads, this will give a great example for this type of testing.

3. Recovery Testing

What it is:

This is about the speed and efficiency at which a system recovers from a crash or unexpected shutdown. It's about the elegant return.

Key Features:

  • Verifies the speed and correctness of recovery procedures

  • Verifies if services are restarted properly

  • Ensures data integrity after recovery

  • Usually coupled with disaster recovery scenarios

Use case example:

Imitating a server crash, and then rebooting to check whether all services continue as usual and no data is lost.

4. Chaos Testing (Chaos Engineering)

What it is:

Chaos testing advances resilience testing by injecting faults into the system at random in a controlled environment, usually in production (with caution!).

Key Features:

  • Tries out the system's resilience against surprising, random failure

  • Promotes creating strong, fault-tolerant architectures

  • Typically automated and ongoing

  • Pays closer attention to system behavior than individual failure responses

Use case example:

Random network link disconnections or injecting latency between services to observe how the system compensates

Best Practices I've Learned

  • Start Small: Do not try to break everything at once. Always begin with non critical components.

  • Monitor everything: You can not improve what you don’t measure. Good observability is important.

  • Automate where possible: Manual testing is time-consuming and inconsistent also.

  • Test in production-like environments: Staging should mirror production as closely as possible

  • Plan for failure: Always have rollback procedures ready before you start testing.

  • Involve the whole team: Resilience testing is not just a QA concern; developers, ops, and business folks should all be involved.

How Keploy Can Help with Resilience Testing

How Keploy Can Help with Resilience Testing

Now, here is where things get interesting. Keploy is API testing platform that can significantly simplify your resilience testing efforts.

What I love about keploy is that I automatically generates test cases from your API interactions. But here is the kicker - you can use these generates tests as a baseline for resilience testing.

  • Baseline establishment: Keploy captures a normal API behaviour, giving you a reference point for what “good” looks like for you.

  • Automated test generation: Instead of manually writing hundred of test cases on your own, Keploy generates them from real traffic patterns using the AI.

  • Regression testing: After implementing resilience testing, we can easly verify that normal functionality still works or not.

  • Integration with failure scenarios: The good thing is that, we can combine Keploy’s automated tests with fault injection to see how APUs behave under stress, or if something broke or not.

The best part is that we can integrate keploy with existing CI/CD pipelines, making it easy to incorporate resilience testing into your regular development workflow

Related Article

  • Reliability Testing – A Complete Guide*:* In this blog, you will walk through Reliability Testing, exploring its primary objectives, testing methods, various types, and industry-standard tools that make it effective

  • Latency Testing Guide For Faster Apps: In this blog, you are going to explore about latency testing and how it makes a bigger impact and why you should care about it.

Conclusion

Look, resilience testing is not just another checkbox in your testing process, It is essential for building systems that can handle the real world. And trust me, the real world is messy.

I have seem too many projects fail not because the code was bad, but nobody thought about what will happen when things go wrong. So, do not try to be that team

You can start small but be consistent and gradually build up your resilience testing practices. Remember the goal isn’t to prevent all type of failures, that is impossible. The goal is to fail gracefully, recover quickly from it and then learn from each incident.

Frequently Asked Questions

Q. How often should I run resilience tests?

It depends on your system's criticality and change frequency. For mission-critical systems, I recommend weekly automated tests with monthly comprehensive scenarios. For less critical systems, monthly testing might suffice..

Q. What's the difference between resilience testing and disaster recovery testing?

As we know, Resilience testing focuses on how systems handle and recover from various failures during normal operation. On the other hand, disaster recovery testing specifically validates your ability to restore services after major incidents like data centre outages.

Q. Can Keploy help with resilience testing?

Keploy isn’t a resilience testing tool directly, but it helps by generating real traffic-based test cases. You can use these tests alongside failure injection tools to check how your system behaves under stress

Q. Can resilience testing replace traditional testing?

Not. Resilience testing complements functional, performance, and security testing. You need all of them for a comprehensive testing strategy.

Q. What if resilience tests break something important?

This is why we test! Better to find issues in testing than in production. Always have rollback procedures ready and test in environments that mirror production as closely as possible.

Top comments (0)