Joel Lutman for AWS Community Builders

Posted on Jun 23, 2020 • Edited on Mar 30, 2021 • Originally published at manta-innovations.co.uk

Automating data pipeline with AWS step functions

#aws #serverless #microservices #architecture

Apache Spark, Serverless, and Microservice's are phrases you rarely hear spoken about together, but that's all about to change with AWS Step Functions.

Apache Spark vs Serverless

As someone who works as a SME in Apache Spark it's been common for me to be working with large Hadoop clusters (either on premise or as part of an EMR cluster on AWS), which run up large bills even though the clusters are mostly idle, seeing short periods of intense compute when pipelines run.

In contrast we have the Serverless movement.

The severless movement aims to abstract away many of these issues with managed services, where you only pay for what you use, examples being AWS Lambda, Glue, DynamoDB, and S3.

Here I’m going to talk about how we can bring these two concepts together to utilise serverless in delivering big data solutions to try and get the best of both worlds.

Hello world AWS Step Functions

Welcome to AWS Step Functions, a managed service that lets us coordinate multiple AWS services into workflows.

AWS Step Functions can be used for a number of use cases and workflows including;

sequence batch processing
transcoding media files
publishing events from serverless workflows
sending messages from automated workflows
or orchestrating big data workflows.

A traditional enterprise Big Data architecture may involve many complex distributed self managed tools. This could include clusters for Apache Spark, Zookeeper, HDFS, and more. This type of architecture is heavily reliant on time based schedulers such as CRON and does a poor job of binding individual workflow steps together.

The diagram above illustrates a typical big data workflow of sourcing data into a datalake, ETL'ing our data from source format to Parquet, and using a pre-trained Machine Learning model to predict based on the new data. Data is made available for user interaction via SQL queries.

What if a single service goes down? How are we alerted? Our orchestration times have to be well defined and follow a synchronous blocking workflow. We have no contract between services - which leads to slow development of each component and drives a waterfall approach. These are all questions and problems that arise with such an architecture.

So lets try to replicate this using serverless components and see if we can do a better job.

In the above diagram we’ve been able to replicate the previous architecture in a completely serverless approach, thanks to Step Functions enabling us a way of binding any AWS service into a workflow. Additionally, using managed serverless components has enabled us to overcome many of the problems and issues identified with the previous approach.

This serverless approach gives us the ability to:

Query the data at any stage via AWS Athena
Handle any errors or timeouts across the entire stack, route the error to a SNS topic, then onto any support team
Configure retries at a per service or entire stack level
Inspect any file movement or service state via a simple query or HTTP request to DynamoDB
Configure spark resources independently of each job, without worrying about cluster constraints or YARN resource sharing
Orchestrate stages neatly together in many different ways (sequential, parallel, diverging)
Trigger the entire pipeline on a CRON schedule or via events Monitor ETL workflows via UI

In comparison to traditional single stack server based architecture, organisations and businesses also gain a number of advantages for both the development process and service management:

Increase development velocity and flexibility by splitting the Sourcing Lambda, Spark ETL, View Lambda, and Sagemaker Scripts into micro-service's or monorepo's
Treat each ETL stage as a standalone service which only requires data in S3 as the interface between services
Recreate our services quickly and reproducibly by leveraging tools such as Terraform
Create and manage workflows in a simple readable configuration language
Avoid managing servers, clusters, databases, replication, or failure scenario's
Reduce our cloud spend and hidden maintenance costs by consuming resources as a service.

Configuration as Code

As mentioned, one of the clear benefits of using AWS Step Functions is being able to describe and orchestrate our pipelines with a simple configuration language.

This enables us to remove any reliance on explicitly sending signals between services, custom error handling, timeouts, or retries. Instead defining these with the Amazon States Language - a simple, straightforward, JSON-based, structured configuration language.

With the states language we declare each task in our step function as a state. We define how that state transitions into subsequent states; what happens in the event of a state's failure (allowing for different transitions depending on the type of failure), and how we want a state to execute (sequential or in parallel alongside other states).

Not the only option, but...

It's worth pointing out that some of these benefits are not limited to just AWS Step Functions.

Airflow, Luigi, and NiFi are all alternative orchestration tools that are able to provide us with a subset of these benefits, in particular scheduling and a UI.

However these rely on running on top of EC2 instances which in turn would have to be maintained.

If the servers were to go offline our entire stack would be non-functional, which is not acceptable to any high performing business. They also lack many of the other benefits discussed such as; stack level error, timeout handling, and configuration as code, among others.

AWS Step Functions - a versatile and reliable tool

AWS Step Functions is a versatile service which allows us to focus on delivering value through orchestrating components.

Used in conjunction with serverless applications we can avoid waterfall architecture patterns. By swapping in different services to fulfil roles during development this allows developers to focus on the core use case, rather than solutionising. For instance, we could easily swap DynamoDB out for AWS RDS without any architecture burden.

As we've demonstrated, it can be a powerful and reliable tool in leveraging big data within the serverless framework and should not be overlooked for anyone exploring orchestration of big data pipelines on AWS.

Used in conjunction with the serverless framework, it can enable us to quickly deliver huge value without the traditional big (data) headaches.

More for information about AWS and Serverless feel free to check out my other blogs, and my website, Manta Innovations