Building Anti-Fragile Systems For Modern-Day DevOps

#devops #antifragility #cloud #cloudcomputing

INTRODUCTION
What is antifragility?

Antifragility is a concept introduced by Nassim Nicholas Taleb in his book “Antifragile: Things That Gain from Disorder,” published in 2012. The term refers to a property of systems or entities that thrive and benefit from volatility, uncertainty, stress, and disorder.

It’s common to say that robust or resilient is the opposite of fragile. Here, however, we respectfully disagree. I want to talk about an idea known as antifragility.

“Resilience refers to the ability of a system or entity to withstand shocks, recover from adversity, and return to its original state or function. Resilience suggests the capacity to absorb and adapt to challenges without significant damage”

Antifragility goes beyond resilience. An antifragile system not only withstands stressors but actually benefits and improves as a result of exposure to adversity. It thrives in dynamic and uncertain environments, becoming stronger and more robust through challenges. The human muscular system is one example of an antifragile mechanism in nature. Our muscles experience stress when we work out, which causes them to grow and strengthen. Another term for this is post-traumatic growth.

Let’s now use the example of using a courier service provider to deliver a glass piece. Packages marked “fragile” are those that detest stress and break easily upon experiencing it. An item that is designed to be mishandled and anticipates stress should be the exact opposite.

Let’s bring that into the day-to-day designing, building, and managing of scalable systems by trying to build systems that expect variability and predict outcomes.

HOW TO MEASURE FRAGILITY
Fragility refers to the quality or state of being fragile, which means easily broken, delicate, or vulnerable. It can be used to describe physical objects that are prone to breaking or damage.

Let’s revisit the glass mirror example. There are only two states for a mirror: whole and broken. There is no middle ground when it comes to measuring the risk associated with that mirror; it either breaks or it doesn’t. This implies that we now know the second-order derivative in the event that the mirror falls, and this information helps us prevent the mirror from falling.

The same ideology applies to systems. Only when we are aware of all that could go wrong in the system do we stand a better chance at preventing failure in the system.

FACTORS THAT INFLUENCE ANTIFRAGILITY
Build, test, and fail fast: AWS and other public cloud providers have made building and developing systems easier because we can have multiple environments quickly. This has also enabled us to build and test quickly. Imagine spinning up a high-end server for a 30-minute test compared to having to rent that same server 20 years ago. The ease cannot be overemphasized. The ability to test fast also comes with failing fast, and when we fail fast, we learn fast. And by learning, we are able to measure the fragility of that environment.

Size: The business’s size is a critical factor. For a monolithic program with ten infrequent users, the stressors of a mobile app with over 500k concurrent users cannot be the same. Determine the environmental risk level and apply antifragility accordingly.
Complexity: There are various ways that complexity can manifest itself; it might originate from the way certain functionalities are handled in the code or from the architecture of the entire infrastructure. I’ll use two AWS environments, ENV A and ENV B, as an example.
ENV A comprises a single server responsible for hosting file systems, databases, and the web server. In the event of a server failure, a recent backup can be deployed to replace the malfunctioning server. Additionally, when faced with a surge in traffic, auto-scaling mechanisms come into play to ensure the server remains operational. In this scenario, it can be asserted that the principles of disaster recovery contribute to the concept of antifragility.

While ENV B is a complex, loosely coupled microservice environment that comprises and relies on lambda functions, eventbridges, SNS, SQS, step-functions, and databases, Obviously, since the playing field is now bigger, so is the risk, meaning that multiple parts can fail; this would eventually introduce observability. Observability will in turn provide insights that predict failures using patterns. We may now configure automated actions and alerts that, depending on the kind of action, can be set to reverse or effect changes.
Because both environments were able to measure fragility and predict the second-order derivative, antifragility was introduced. Building systems that anticipate variability has been made possible by this in addition.

AREAS WHERE ANTIFRAGILITY CAN BE IMPLEMENTED
Security: Ensuring the security of our infrastructure is crucial, requiring a comprehensive approach from entry to exit. While traditional firewalls primarily served as detective systems, the evolution to next-generation firewalls (NGFWs) brings enhanced features. These include the Intrusion Prevention System (IPS), application awareness and control, and cloud-delivered threat intelligence, among others. NGFWs employ automation rules to detect anomalies and promptly respond by adjusting rules to mitigate potential threats. Notable examples of such advanced firewalls include AWS WAF and Fortinet FortiGate.

Compute: Public cloud providers, such as AWS and Azure, have taken significant steps to enhance system robustness. One notable feature at the compute level is Auto-Scaling, which dynamically adjusts resources in response to sudden increases in traffic. Additionally, Elastic Load Balancer (ELB) is a key service that spans multiple Availability Zones (AZs) and conducts regular health checks. This ensures that only healthy servers receive traffic, and in the event of any issues, the ELB automatically redirects traffic to other healthy instances from a pool.
This approach guarantees continuous uptime for the environment.

Networking: Networking is crucial for building robust systems, and achieving antifragility at the network level is essential for prioritizing interconnectivity. Spanning across two networks can enhance this antifragility, with services like AWS Route 53 enabling availability at a global scale. Route 53, a scalable domain name system (DNS) web service, efficiently routes end-user requests to globally distributed endpoints, contributing to application availability and reliability

Monitoring and Observability:
To gain insight into patterns for measuring fragility, we require systems that bolster monitoring. Tools like CloudWatch, Prometheus, and Grafana are employed to establish alerts and updates when anomalies are detected. Observability tools, such as AWS X-RAY, are utilized to monitor existing systems. The insights gathered from observation are then leveraged to predict and anticipate anomalies, enhancing the predictability of fragility.

By examining the breakdowns provided above, I hope I have been able to show you that building antifragile systems that can thrive in disorder is truly a robust way of development.

QUICK SUMMARY

Building anti-fragile systems is possible
Fragility should always be measured.
The next part will focus on the day-to-day DevOps practices, including developing CI-CD pipelines, testing and integration, automated deployment, and monitoring.

I hope you enjoyed this read. If you did, kindly connect with me on LinkedIn.

HAPPY CLOUD COMPUTING!!

DEV Community

Building Anti-Fragile Systems For Modern-Day DevOps

Top comments (0)

Read next

Build an automated video monitoring system with AWS IoT and AI/ML : AWS Project

Powerful Command line tools for DevOps: Nushell and Jc

5 Awesome Docker Tools To Make Your Life Easier

Guide to Networking Commands on Ubuntu