DEV Community: Andrii Liashenko

Fault-tolerance in distributed systems

Andrii Liashenko — Thu, 30 Jan 2020 15:57:31 +0000

A distributed system is a network of computers, which are communicating with each other by passing messages, but acting as a single computer to the end-user.

With distributed power comes big challenges, and one of them is inevitable failures caused by distributed nature.
Network connections fail or degrade, servers crash or respond enormously slow, software has bugs, etc.

How to make your system stable and tolerant to the failures?

Make your components redundant. Avoid single point of failure.
Handle your Interaction Points (calls to remote services).
When it's possible, respond to requests when faillures happen.
Test your system to discover its behavior under pressure.
Embrace the chaos to bring order in your system facilitating Chaos Engineering experiments.

Handle your Interaction Points

Integration points are the number-one killer of systems.

Every remote call is a risk to your system health and a single failing call can take the whole system down if not handled properly.

Let's review some common patterns to handle remote calls.

Retries

Often trying the same request again causes the request to succeed. It happens because of partial or transient failures.
A partial failure is when a part of requests succeed.
A transient failure is when a request fails for a short period of time.

But it's not always safe to retry. A retry can increase the load on the system being called. Instead of retrying immediately, you can use exponential backoff, where the wait time is increased exponentially after every attempt.

waitTime = min(maxWait, baseInterval * exponentialFactor ** attempt)

When failures are caused by overload, backing off doesn't help. If all the failed calls back off at the same time, they increase overload even more.
The solution is jitter. Jitter adds randomness to the backoff to spread the retries in time.

waitTime = rand(0, min(maxWait, baseInterval * exponentialFactor ** attempt))

Timeouts

When a request is taking longer than usual, it might increase latency in your system (and fail eventually).

Also, the call holds on to the resources it is using for that request and during high load the server can quickly run out of the resources (memory, threads, connections, etc.).

To avoid this situation set connection and request timeouts.

Circuit breakers

When there’s an issue with a dependency, stop calling it!

In the normal “closed” state, the circuit breaker executes requests as usual.
Once the number of failures for the frequency of failures exceeds a threshold, the circuit breaker “opens” the circuit for some time.

Bulkhead

In a ship, a bulkhead is a dividing wall or barrier between other compartments.

If the hull of a ship is compromised, only the damaged section fills with water, which prevents the ship from sinking.

Isolate the failure.
Separate thread pools dedicated to different functions (e.g. separate thread pools for each remote service), so that if one fails, the others will continue to function.

Respond when failure happens

“Fail fast” is generally a good idea:

no increased latency
no risk for the whole system to halt
no invalid system behaviour
releasing the pressure on underlying systems (i.e. shed load) when they are having issues

However, there are scenarios where your service can provide responses in a “fallback mode” to reduce the impact of failure on users.

Some fallback approaches:

Cache Save the data that comes from remote services to a local or remote cache and reuse the cached data as a response during one of the service failure.
Queue Setup a queue for the requests to a remote service to be persisted until the dependency is available.
Stubbed (default) values Return default values when personalized options can’t be retrieved.
Fail silently. Return empty or null response that can be handled by the caller (e.g. UI). If possible, disable the functionality that is failing.

Hystrix

Hystrix is a Netflix open-source library that helps you handle Integration Points using the techniques described before: Timeout, Circuit Breaker, Bulkhead, and without effort allows you to provide fallback options.

Embed the fault tolerance and latency tolerance in your system wrapping the calls to external services into HystrixCommands:

Testing

Load testing and stress testing

Perform load and stress testing to discover how your system behaves under the load. It might uncover unexpected issues and failures in your system.

Perform the testing for the long period of time to discover how your system behaves under continuous stress.

Test for remote services failures

no response
failed response
slow response

Chaos engineering (resilience testing)

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

Facilitate Chaos Engineering experiments to understand the system robustness and discover the system weaknesses.

Chaos Engineering experiments follow four steps:

Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.

Hypothesize that this steady state will continue in both the control group and the experimental group.

Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.

Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.

Gremlin

Gremlin is a chaos engineering platform. Gremlin provides the framework to safely and simply simulate real outages.
Be prepared - Gremlins come:

Resource gremlins. Throttle CPU, Memory, I/O, and Disk
State gremlins. Reboot hosts, kill processes, travel in time
Network gremlins. Introduce latency, blackhole traffic, lose packets, fail DNS

Conclusion

In distributed systems failures are unavoidable by nature. Keep that in mind during architecture, implementation and testing of your system.
Handle the Integration points with Retries, Timeouts, Bulkheads and Circuit breakers.
Minimize failures impact on users by responding when failures happen. Leverage Caching, Queues, return default values or disable the failing functionality.
Test your system vigorously. Test for remote services failures.
Break your system to make it unbreakable facilitating chaos engineering experiments.

References

Michael T. Nygard. Release It!: Design and Deploy Production-Ready Software 2nd Edition (2018)
Article "Fault-tolerance in a high volume distributed system" by Netflix
Article "Lessons learned from the AWS outage" by Netflix
Article "Exponential backoff and jitter" by AWS
Hystrix
Principles of chaos
Gremlin

AWS S3 + Athena real-time business analytics

Andrii Liashenko — Sat, 30 Nov 2019 17:27:18 +0000

Overview
Business analytics is crucial:

It provides business people a view on the current status of your software.
It is the key for data-driven door oh, what a pun

To make business analytics possible we need data, data that represents the right business metrics for the software (KPIs).
Business metrics can be stored in a database, logs, files or a dedicated warehouse.

In this article I’d like to show you real-time business analytics in AWS S3 using AWS Athena.

AWS S3 is a simple object storage service.
It is highly available (99.9%) and durable ( 99.999999999%).

AWS Athena is a service to query data (basically files with records) in S3 using SQL.
Athena supports querying CSV, JSON, Apache Parquet data formats.

It’s serverless! You don't need to set up or maintain hosts/databases.
You pay per query! 1TB of scanned data = 5$
You can use it with different business intelligence or SQL clients.

How Athena works?
Athena uses Presto under the hood.
What is Presto?

Now, getting back to our topic, let's imagine we have an online bookstore and we need to analyze books purchases that are processed by PurchaseService.
Let's define our purchase metrics metadata:

{
    "bookId": "ec437d98-455d-4fec-8dbe-2c2630454bdd", 
    "title": "Orlando", 
    "author": "Virginia Woolf", 
    "genre": "Fiction", 
    "userId": "test-user", 
    "userCountry": "country", 
    "userAge": 24, 
    "purchaseTimestamp": 1575121641
}

Good enough to make different kind of business analytics.

Now we can publish our purchase metrics directly to S3, but AWS Kinesis Firehose is better and here is why:

Firehose buffers incoming records and delivers in batches.
Firehose can convert the records to another data format, for example Apache Parquet (which is much more efficient for Athena querying - 1TB of JSON records reducing down to 130GB, meaning faster and cheaper querying)
Firehose can compress the data (gzip, snappy, etc.)

Let's create a Firehose stream in AWS Console called books-purchase-stream that delivers data to S3.
PurchaseService is a NodeJS AWS Lambda Function and it will publish purchase events (purchase metrics format we defined recently) to books-purchase-stream.

const AWS = require('aws-sdk');
const firehose = new AWS.Firehose();
const firehoseStream = "books-purchase-stream"

exports.metricsPublisher = function(event, context) {
    const purchaseRecord = JSON.stringify(event);
    const firehoseRecord = {
        DeliveryStreamName: firehoseStream,
        Record:  {
            Data: purchaseRecord
       }
    };
    firehose.putRecord(firehoseRecord, function(error, data) {
        if (error) {
            console.log(error, error.stack);
        }
        else {
            console.log(data);
        }
        context.done();
    });
};

Now we have our purchase metrics stored in S3, how do we query it?
Athena is integrated with AWS Glue Data Catalog and requires a Glue database and a Glue table for querying.
AWS Glue Data Catalog is a metadata repository, a Glue table is the data model (schema) and a Glue database contains tables.

To create a Glue database and a table with our purchase metrics metadata we’re gonna use a Glue Crawler.
Point a Glue Crawler to the data in S3 and the crawler will extract the metadata into AWS Glue Data Catalog.

The flow we've created so far:

We have our purchase metrics in AWS S3 and purchase metrics metadata in AWS Glue Catalog, can we query it now?
Yes! Let’s go to Athena and write a simple query:

Conclusion
In the end we have a simple yet powerful serverless real-time business analytics infrastructure.

References
https://docs.aws.amazon.com/athena/latest/ug/what-is.html
https://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html
https://aws.amazon.com/premiumsupport/knowledge-center/error-json-athena/
https://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html