DEV Community: Nikita Vetoshkin

Software automation done right

Nikita Vetoshkin — Mon, 23 Oct 2023 23:01:09 +0000

Software is eating the world. For better or worse. Among many applications we try to make computers perform is automation - the most tedious, repetitive, and thus error-prone for humans type of task. A single mistake in an everyday routine operation may take down Internet-scale businesses for minutes and hours, taking more than 24 hours to fully recover and incur millions of dollars in losses. Automating away most operations is thus a legitimate goal that saves operating expenses and improves business robustness.

On the other hand automation is not free. It is an expensive and long process and should be properly evaluated from a business and engineering perspective. Overengineering and building low business value Rube Goldberg machines is often can be as bad as under-engineering and leaving crucial parts manual. Here I would like to describe a reasoning approach to such software engineering tasks. This approach is not a law of nature, but a useful abstraction that I came up with to guide decisions for me, my team, and the business I work for. Let’s start with a task that sounds like a good junior DevOps assignment:
Convert from RAW to MP3 and backup all call records files on a call center server

Do it somehow

How do we start things? Manually - employing the power of our brain, its experiences, and modern Internet search. Our brain is good at decomposing tasks and addressing them one by one. We also constantly run a cost function, checking the efforts and time spent against current progress and potential profits. After some time and trial and error, we arrived at one of the possible results.

The first one: the task is impossible, there’s no currently known workaround for a fundamental limitation. This is a good (alas disappointing) result. If we were planning to bet our business on that - we have data against and can plan and act accordingly.
Another similar conclusion: the task is possible, but prohibitively slow and/or expensive. Again, armed with this data we can make better business decisions.
Best case: the task is possible after a series of manual steps, here’s the result. That is a great achievement. At this point it is worth to ask yourself a question:

Do we ever need to repeat that?

Or, even better, what is the probability we can assign to the positive response. If the answer is “no” or “probably, no”, then we’re done. No need to spend time and resources on this, no need to move to the next step of automation. Usually software engineering projects have lots of other things to work on.
If we’d like to persist and make this task easily repeatable, then it is time to move to the next step.

Write it down

According to the “Software Engineering at Google” book, software development is a team effort integrated over time and one of many aspects of it is preserving and sharing the knowledge among org members. Right now there is only one person who knows how to solve the task (or even if it solvable at all) and can repeat it - you. The bus factor is 1 and that is not the state we want our team to be in.

So let us put a HOWTO.txt in our project’s repo or write the exact steps on a wiki page called “136 easy steps to …”. Yes, as simple as that. What’s the profit?

We serialized our experience (probably double-checking it in the process) and put it on persistent storage, that is more reliable than a human brain in the long run. In a month, in a year or 5 we can read and repeat. In software we serialize data when we need to pass it around, to share. The same directly applies here, because we’ve just shared the knowledge and someone else can carry out the task. Again, software engineering is a team game.

Another useful property to stress out is the low requirements for preciseness of the document - it is targeted to be interpreted by a human being, which handles inaccuracies and errors much more gracefully than, say, Python interpreter. Many details can be omitted, meaning the task of writing the doc can be accomplished fast and thus cheap.

Are we done? It is imperative to ask following questions.

Is the task tedious enough to invest in it further?
If the task is a sequence of a couple of bash commands with clear path on how to handle errors and deal with changes - probably we’re done. If we need to wait for hours for something to be downloaded or juggle and copy-paste tens of variables, certificates and unreadable long hash strings (remember, humans are bad at that) - maybe it is worth moving to the next step.

How long does it take to execute the task?

If it takes hours to execute it (i.e. longer than a typical human attention span) - it is a good indicator, than we (i.e. our business) will benefit from further automation.
- Are the readers of our doc capable of following it?
What if a manager or a customer might need to follow the steps - can they accomplish and confidently deal with imperfections, handle errors, and recover from missed steps?

Depending on the answers we might decide to move on to the next step.

Translate into a script

Usually all it takes is follow the existing doc and translate from a natural into a more strict computer language. It can be as simple as a page long bash script or an elaborate Python script, employing rich ecosystem of freely or locally available libraries. Today (in 2023) some may even consider using Go for this task.
Let’s look at the profit:

A script can accept parameters like a version to work with, where to put resulting artifacts, etc.
We lowered the bar, executing the task is even easier - just checkout/download and run.
Thus it can be integrate into other automation pipelines like CI/CD
A script can log usage stats to allow data driven assessment of its usefulness

This is no small feat. We improved robustness and now our team/org can execute the same task only in a couple of seconds of human attention during execution: to start and check the results.
The script approach often works great for periodic, cron-style jobs. If we want to run the task daily or hourly we put install the script on the server (package it along with other server components) and configure local cron daemon (or systemd.timer accordingly). Done. Simple and stateless automation is easy to manage and debug.

On the flip side the script approach comes short when we need to react to some external event: RPC call, disk size reaching some limit, etc. In other words, a script is not well suited for interactive or reactive tasks. Next shortcoming is lack of state: if we need to keep TCP connections or load some lookup data from a remote storage during each execution - script might be too resource-hungry and slow.

Let it run the background

If we make the next logical step and make our script stateful and running in the background - now it is called a daemon. With a daemon we can pro-actively react to local changes, timers, disk space, etc. And do that efficiently - connections state and in-memory cache our at our disposal. Though state can be fragile to manage, we can keep the process single threaded to continue keeping things simple. Do not forget, in our case simple means “reliable”. The cost of having a daemon is highe: we need to carefully work with state and possible memory leaks - we had no such issues with a script. That is why we need to carefully estimate if this costs its while in each case.

Daemon/process approach in our Internet scale times has some shortcomings too. First, it cannot handle something that does not fit a single host. Modern servers can be super powerful and oftentimes it can be cheaper and faster to buy (or rent in your cloud) a beefier machine, while keeping the software simple. We might add multiple threads to handle the load and utilize additional cores. The upside here is that we can improve things gradually, having something that works and gets things done on each step.

Provide a service

If our task:

fails to fit into a single machine
needs to provide an external service (in our case: remote sound file encoding and storing)
needs to be single machine failure resistant
we can solve it by promoting our daemon process into a service by adding an external API, making it discoverable and reachable via service mesh, DNS or other means. Upsides:
Flexible: the task execution is not tight to a particular machine
Scalable: potentially we can more replicas and scale our service horizontally
Reliable: we can be resilient to multiple domain failures: server, rack, DC, etc.

Downsides are there and include:

Increased complexity: we have a distributed system on our hands with all its complexity and potential issue
Increased cognitive load: we need to carefully design and evolve API
Increased operational costs: we need to monitor if our service is reachable over the network, protect against DoS attacks, potentially customer isolation, etc. All these lead to increased costs, which must be assessed against the profits.

Breaking the rules
All these steps are not a definitive recipe for success, but they do provide a framework to reason about engineering tasks at hand. Throughout my career I’ve seen many attempts to skip ahead, ignore a step or two to get to the final solution faster. Sometimes that does work, but oftentimes it doesn’t and we end up wasting a lot of time and efforts. E.g. given the same task I would like to try myself in API design and skip the boring manual and script parts. After days of intensive labour I arrive at a perfect set of API functions my encoding server will provide. Then I can decide to make a tiny prototype to do the encoding… Given there’s only one good ending to this story, there are many not so “happy ever after” ones:

Network costs to send uncompressed files are too large given the refined requirements (which were not there from the start, of course).
There’s no spare server/capacity on premises to run this service on.
The script solution fits the needs and is needed yesterday
The daemon solution we started off with is leaking memory and clients are not happy

The list can go on and on, the message here is: be cautious, know the rules and justify breaking them if needed. And that’s all, thank you.

Fault Tolerance in Distributed Systems: Strategies and Case Studies

Nikita Vetoshkin — Wed, 18 Oct 2023 11:31:56 +0000

The complex technological web that supports our daily lives has grown into a vast network of distributed systems. It is especially visible in the present era when our world is more connected than ever. The smooth operation of these systems has evolved into more than just a convenience; rather, it has become essential for everything from streaming our favourite movies to managing crucial financial transactions.

Imagine living in a society where a single system glitch could impair your ability to access essential services or even the world economy. Quoting Leslie Lamport: “A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable” [1]. A situation like this emphasises the critical significance of fault tolerance, a concept at the core of these complex networks.

This article is dedicated, therefore, to a more focused consideration of what is fault tolerance in distributed systems, what are the best approaches to achieving it and which of them are already implemented.

Understanding Fault Tolerance

Fault tolerance, in the realm of distributed systems, refers to the ability of a system to continue operating without interruption despite encountering failures or faults in one or more of its components. It is a measure of the system's resilience against disruptions (ranging from a single server failure to a whole data centre outage due to power failure) and its capability to ensure consistent and reliable performance.

Our reliance on online platforms for everything from business operations to personal communications means that even a minor system disruption can have far-ranging consequences. An outage can result in financial losses, hinder productivity, compromise security, or shatter trust among users.

However, ensuring fault tolerance in distributed systems is not at all easy. These systems are complex, with multiple nodes or components working together. A failure in one node can cascade across the system if not addressed timely. Moreover, the inherently distributed nature of these systems can make it challenging to pinpoint the exact location and cause of fault - that is why modern systems rely heavily on distributed tracing solutions pioneered by Google Dapper and widely available now in Jaeger and OpenTracing. But still, understanding and implementing fault tolerance becomes not just about addressing the failure but predicting and mitigating potential risks before they escalate.

In essence, the journey to achieving fault tolerance is riddled with challenges, but its importance in ensuring seamless technological experiences makes it an indispensable pursuit. Therefore, it is important to observe the strategies for improving this resilience.

Strategies for Fault Tolerance

Redundancy
At its core, redundancy implies having backup systems or components that can take over if the primary ones fail (either manually or automatically) — this ensures that a single failure doesn’t compromise the entire system.

Sharding
A technique primarily used in databases, sharding involves dividing the data into smaller and independent chunks called shards. If one shard fails, only a subset of the data is affected. It allows the remaining shards to serve the unaffected parts.

Replication
This strategy involves creating copies of data or services. In the situation of a failure, the system can switch to a replica, ensuring continuous service. Replication can be local, in the same data centre, or geographically distributed for even higher fault tolerance. Replicas can serve the same traffic, providing higher throughput to the system, e.g. in a search engine having 10 or more replicas is not uncommon.

Load Balancing
By distributing incoming traffic across multiple servers or components, load balancers prevent any single component from becoming a bottleneck or point of failure. If one component fails, the load balancer redirects traffic to the operational ones. There is a multitude of concrete strategies and this is a rapidly evolving part of computer science.

Failure Detection and Recovery
It’s not enough to have backup systems. It’s also crucial to detect failures quickly. Modern systems employ monitoring tools and rely on distributed coordination systems such as Zookeeper or etcd to identify faults in real-time: once detected, recovery mechanisms are triggered to restore the service.

In the journey towards achieving fault tolerance, the blend of these strategies ensures that systems are resilient, reliable, and consistently available, even in the face of startling challenges. Let us proceed to the practical cases to showcase the art of using fault tolerance approaches.

Case Study 1: Google's Infrastructure

Google's colossal distributed infrastructure is symbolic of a robust fault-tolerant system. A central strategy they employ is replication, the one which we’ve already discussed. By replicating Zanzibar data across the globe, not only is latency diminished, but data resilience is enhanced. Specifically, replicas are in various locations worldwide, with multiple replicas within each region.

Another crucial aspect of Google's fault-tolerance approach is the focus on performance isolation. This strategy is indispensable for shared services aiming for low latency and high uptime. In situations where Zanzibar or its clients might not provide sufficient resources due to unpredictable usage patterns, performance isolation mechanisms help. These mechanisms determine that performance issues are contained within the problematic area, ensuring no adverse effects on other clients.

Furthermore, Google's large-scale cluster management, exemplified by Borg, showcases its commitment to reliability and availability, even as challenges arise from scale and complexity. In essence, Borg manages vast clusters by combining optimised task distribution, performance isolation, and fault-recovery features while simplifying user experience with a declarative job specification and integrated monitoring tools. This fusion of technology and strategy underscores Google's dedication to real-world benefits while managing inherent challenges in its vast infrastructure.

Case Study 2: AWS Route 53

Amazon Web Services (AWS) exemplify high availability and fault tolerance, particularly in Route 53. This service employs a widespread network of health checkers across multiple AWS regions that continuously monitor targets. Through smart aggregation logic, isolated failures don't destabilise the system: a target is only deemed unhealthy if multiple checks fail, and this can be customised based on user preferences.

Regardless of the target's health status, the system maintains a constant workload [2], which ensures operational predictability during high-demand periods. The cellular design of health checkers and aggregators allows for scalability. As needs grow, new cells can be introduced without compromising the system's capacity.

Even in the face of large-scale failures, such as numerous targets failing simultaneously, the system remains resilient, with potential reductions in workload due to aligned system redundancies. Instead of making numerous DNS adjustments, Route 53 efficiently updates its DNS servers with fixed-size health status tables. By proactively pushing data, workload distribution remains balanced. In essence, Route 53's design ensures total resilience and adaptability.

Challenges and Future Trends

Since a growing number of projects are transitioning into distributed systems, the imperative for fault tolerance is greater than ever. The complexity and interconnectedness of these systems mean that early error detection, often referred to as "shifting left" error discoveries, is vital.

Emerging strategies include a deep focus on static analysis. TLA+ models and modern programming languages like Rust are at the leading edge of this movement, aiming to identify and address issues even before runtime. However, while preventive measures are important, it's equally crucial to have runtime safeguards: machine learning algorithms can predict potential system failures, allowing for timely interventions; additionally, robotics research, branching into automated testing and maintenance, offers promising avenues to ensure system robustness.

Best Practices for Implementing Fault Tolerance

To make the presented case studies more practical and useful, I’d prefer to present a checklist for designing fault-tolerant systems:

Replication: Implement data replication across multiple regions and ensure multiple replicas within each region as well.
Isolate Performance: Create barriers so that a fault in one area doesn't spread.
Monitor Constantly: Utilise integrated tools for constant system health checks.
Stay Scalable: Adopt designs that allow easy scalability in response to growing needs.
Maintain Consistency: Ensure that the system behaves predictably at all times, especially during peak loads or failures.
Plan for Failures: Assume things will break and design recovery strategies in advance.

By adhering to these principles and referencing this checklist, businesses can foster systems that stand resilient against the unpredictable nature of the digital realm.

As technology continually evolves, the complexities and demands of these systems heighten. With such rapid advancements in this realm, it's highly important for professionals and enthusiasts alike to keep pace with the latest methodologies and strategies. This overview is hopefully a good squeeze of the latest strategies that will help developers and engineers make resilient systems.