A list of my resiliency related blog posts.
Resilient systems embrace the idea that failures are typical, and that it’s entirely OK to run applications in what we call partially failing mode. While not suitable for life-critical applications, running in a partially failing mode is a viable option for most web applications. Of course, I’m not saying it doesn’t matter if your system fails. It does, and it might result in lost revenue. But, it’s probably not life-critical.
Building resilient architectures has had its ups-and-downs, some 1 am wake-up calls, some Christmases spent debugging, some “I’m done, I quit” … but most of all, it’s been an incredible learning experience and journey.
This blog post is a collection of tips and tricks that have served me well throughout this journey, and I hope they will help you well too.
In part 1 of this series, I focus on the infrastructure layer, redundancy, immutability, and the concept of infrastructure as code.
In part 2, I focus on cascading failure prevention. Cascading failure happen when one part of a system experiences a local failure and takes down the entire system through inter-connections and failure propagation.
In part 3, I discuss the importance and the challenge of health checks — striking a balance between failure detection and reaction.
In part 4, I talk about caching. While caching is often associated with accelerating content delivery, it is also essential from a resiliency standpoint.