DEV Community: Luca Silvestri

Digital Twin service: extracting data continuously from a data pipeline with a fully serverless infrastructure.

Luca Silvestri — Mon, 08 Nov 2021 10:11:02 +0000

This post has been originally published as Case Study on Amanox blog. Customer is SBB Cargo

The problem: Digital Twin.

Our customer needed to extract the most recent data points from a wide range of live data provider.
Those data are continuously acquired through a set of data pipelines (as described in this blog article).
Also customer wanted that the most recent data points were made available through a HTTP REST API.
The resulting service has been called "Digital Twin", because it collects and exposes the most recent data for a wide array of data sets coming from physical objects, making them a "digital representation" of objects' states.

Technical challenges.

The data has various structure and formats, according to which service they come from. Many of these services provide near real time data which are acquired with a rate that vary from 15 minutes to 1 second. Those data need to be filtered in real time, normalized, and persisted in a database that can be subsequently queried from the API users.

A serverless solution based on AWS.

We wanted the solution to be consistent with data pipelines architecture and with the "serverless design philosophy" which we've successfully embraced together with the customer.
We also wanted to keep using tools and managed services provided by AWS.
During architectural design, we kept an eye to allow easy maintenance and incremental extensions.
The resulting architecture is depicted in the following picture and has been described using Terraform as IaC tool.

Digital Twin service takes fully advantage of the pluggability of data streaming architecture: it collects data by hooking to the Kinesis Streams/Firehose buffers which are deployed in every data pipelines.
The services used in this implementation are:

AWS Kinesis Analytics as data extraction service.
AWS Kinesis Stream as data stream service provider.
AWS Lambda as computing infrastructure to run storage access microservice.
Amazon DynamoDB as persistence layer.
AWS API Gateway to provide HTTPs REST interfaces.
Amazon Cognito to provide OAuth2 infrastructure.
AWS KMS as encryption service.
AWS IAM for policy / roles definition.
AWS CloudWatch to provide logging infrastructure and observability.

What follows is the description of each component.

Data Extractor => AWS Kinesis Analytics: Once data are retrieved by pipeline's sourcing workers, they are buffered for a configurable amount of time in Kinesis Stream, being available for near real-time analysis, ETL transformations or continuous monitoring. By creating a Data Extractor as Kinesis consumer, we were able to leverage the almost instant data availability in Kinesis and modular architecture of data streaming services. Instead of implementing a Kinesis consumer from scratch, we've opted for using Kinesis Analytics and its SQL capabilities. Hence, data extractor became a simple Kinesis Analytics jobs which does 3 things: mapping data properties ("data templating"), continuously extracting a subset of them ("data pumping") and providing a normalized output which is sent to Digitaltwin Kinesis Stream ("data delivery"). By hooking a different data extractor for every data pipeline with consistent output, we've achieved data normalization across all data pipelines. All this has been done by just defining a Kinesis Analytics Terraform resource and a set of SQL statements saved in a single file for every Extractor.
Digitaltwin stream => AWS Kinesis Stream: A data stream with short data expiration is a good choice to buffer extracted data as they wait to be persisted (or further analyzed in future improvements of this architecture). A good choice on AWS is to use Kinesis Stream.
Storage Access microservice => AWS Lambda: We've used AWS Lambda as the computing infrastructure to build a serverless microservice that writes and reads data to/from the persistence layer. Lambda has been chosen because of its flexible invocation model: the code to access the persistence layer is the same regardless which service needs it and it can be triggered by a Kinesis Event (data coming from Extractors) or by an API Gateway one (REST API request by an external consumer service).
Persistance layer => Amazon DynamoDB: We needed to store the most recent value for a given component metric. This access pattern can be efficiently served by a K-V store, where the Key is Component.MetricName and the value is the actual metric (plus some metadata to identify its lineage). DynamoDB is a serverless NoSQL store which fits perfectly in this scenario. Thanks to conditional writing capability it's simple to enforce that a write attempt is successful only with the most recent data. Through autoscaling capabilities DynamoDB is also capable of absorbing any amount of access load.
HTTPs REST interfaces => AWS API Gateway: To provide external access to Digital Twin data we've deployed an API Gateway which has been integrated with Storage Access microservice (that retrieves and serves data). It's a serverless solution, fully integrated with AWS Lambda for computing needs and Amazon Cognito for user/service-based access control.
OAuth 2.0 infrastructure => Amazon Cognito: Implementing OAuth 2.0 server side is not trivial. Cognito offers a set of tools and resources that makes a lot easier to implement OAuth 2.0 flows, providing AUTH endpoints and signed JWT tokens to authorized users/service clients. It's also fully integrated with API Gateway for simple JWT auth checks, while more complex authorization logics can be implemented by using a Lambda Authorizer in API Gateway.
Security/1 => KMS: Data in transit and at rest are encrypted using KMS with scope limited CMKs (Customer Master Keys).
Security/2 => Lambda specific role & policies: Lambda functions run using scoped-down permissions, expressed through IAM policy documents attached to a specific role which is assigned to the lambda.
Security/3 => Kinesis Analytics specific role & policies: Kinesis Analytics execution is linked to scoped-down permissions, expressed through IAM policy documents attached to a specific role which is assigned to the Kinesis Analytics.
Logging and observability => CloudWatch: All logs (execution, access, errors) are collected by CloudWatch. Operational metrics are plotted in a specific CloudWatch Dashboard, providing an easy way to set custom alarms for any kind of DevOps activities.

Everything in this implementation is described using IaC and under version control in a git repository.
A CICD pipeline is attached to the repository, providing automatic infrastructure build and deployment at every new commit.

Conclusions.

Building this Digital Twin service was not trivial given the many moving parts and heterogeneous data sources, with different acquisition rates and structure.
By adopting the data pipeline paradigm and by integrating the service in that paradigm, leveraging the data stream architecture, the problem can be divided in smaller and simpler parts, making it easier to solve from the methodological point of view.
By using the tools and services provided by AWS, all the infrastructure and application server provisioning is offloaded completely to the cloud provider, letting us developers and our customer to focus only on business logic and making the implementation achievable in few days by a single developer with greatly reduced operational costs.
Maintenance is provided through IaC and git commits.
DevOps new services can be easily introduced by leveraging operational metrics, most of them collected in the CloudWatch dashboard for live surveillance.
Also expanding the service to future datasets will be as simple as adding a new Kinesis Analytics job and a new SQL set of statements, making this implementation fully modular.

Data lakes: building a serverless data pipeline

Luca Silvestri — Mon, 08 Nov 2021 09:50:23 +0000

This post has been originally published as Case Study on Amanox blog. Customer is SBB Cargo

Header photo by René Riegal on Unsplash

The silo problem

In modern businesses, data are ubiquitous and come from a wide array of different sources like:

Company website(s).
Company app(s).
IoT devices.
CRMs.
Social Media.
Inventories & logistics tools (like SAP).
Data warehouses.
Internal reports.
Sales channels.
You name it.

The main problem of such a scenario is that each of those data are saved into so called "vertical siloes" and those siloes aren't designed to allow cross-siloes data sharing.

This root problem, generates many other issues:

trying to analyze those data with aggregation tools and transversally across siloes requires a lot of error-prone and tedious handwork, involving data exporting, data extraction, spreadsheet "mambo-jambos": in few years, companies end up with a set of handcrafted tools.
Such handcrafted tools are quite fragile and heavily dependent on who implemented it originally, making them a nightmare to be maintained.
When it comes to data, it's not clear what the "single source of truth is".
It's even more unclear where data reside, especially in mid-large organizations.
Raw (unmodified) data are often missing.

The following diagram summarize the silo architecture.

Silo architecture

The data lake solution.

A way to solve the "vertical siloes" problem is to build a data lake. It's a place where data, coming from different sources, can be stored and accessed in a consistent way.
Data are stored in theirs natural/raw format, making possible future data manipulation.
Sometime transformed data are stored in data lake also; they are necessary for tasks such as reporting, visualization, advanced analytics, and machine learning.
Usually data "lineage" (where data comes from, when it was acquired and manipulated and by whom/what) is saved together with datasets (as metadata or using some easy-to-search way).

What is a data pipeline?

To "hydrate" a data lake, workers that acquire data from sources (siloes) need to be provisioned.
Each worker is specific for a data source and is attached to services that allow further near-real-time data analysis.
This set of functionalities (worker + services) is commonly referred as "data pipelines". In analogy to water pumps and pipes, data are extracted ("pumped") from sources and "flow" through services until they land in data lakes or specific storage/database solutions.
The following pictures exemplify these concepts.

Data lake and data pipelines

Data pipeline general architecture

Going serverless

Building a data lake and hydrating data pipelines can be achieved using:
a) On-prem services.
b) Cloud based self-managed services.
c) Cloud based fully managed services.
d) Cloud based serverless infrastructure.

The level of commitment required to operate the previous architectures significally decreases from a) to d).
On a) all the ops work is on IT team, including hardware management.
On d) the only things that the IT team must focus on are:

application logic development (Worker logic);
using the right tools for the job and use their API, following proper security best practices.

Since data lakes involve a lot of moving part and different technologies, going fully serverless is a choice worth to be considered to reduce operational complexity.

Implementation on AWS

AWS is all about choices, so there are services to build each of the architectures mentioned before. Customers can go from fully on-prem (renting bare metal machines) to fully serverless event-driven infrastructure. And even in a serverless scenario there are several options to choose from.
What follows is an implementation that we've developed for one of our key customers and that is running in production.

The services used in this implementation are:

AWS Lambda as computing infrastructure to run data sourcing worker and transformation worker.
AWS Kinesis stream / Firehose as data stream service provider.
Amazon S3 as data lake storage system.
AWS KMS as encryption service.
AWS System Manager Parameter Store as repository for secrets.
AWS IAM for policy / roles definition.

Let's go briefly on each component to analyze the logic behind the architectural choices.

Data Sourcing Worker => AWS Lambda: This worker needs to connect to the silo/ service and to extract data from it. According to the type of data and its usage, this worker could run few times per day or continuously. In a serverless environment on AWS, the choices are:
- Lambda (scheduled execution).
- A container deployed on ECS using Fargate as computing infrastructure (continuous execution).

We've opted for Lambda because customer desired to have a data pipeline architecture able to deal with different data frequency; for nearly continuous data flows, a short invocation rate coupled with small datasets per execution provides a good solution to use a simple Lambda function instead of a more complicated containerized worker, keeping also architectural consistency with discrete data flows.

Transformation Worker => AWS Lambda: This worker transforms data according to specific criteria, which depend on the actual data source. Lambda is the recommended solution to be used together with AWS Kinesis Firehose. An input event is generated from Kinesis with raw sourced data, Lambda process and transform data according to the logic implemented by code and the transformed data is sent back to Kinesis Firehose for that can be persisted.
Stream Service => Kinesis Stream / Firehose: Streams are particularly useful in data ingestion because they buffer input data for a configurable period and enable a pluggable architecture to consume, analyze and process data, in batch and/or near real-time. Kinesis Firehose is a serverless managed data stream service, whose purpose is to dump data continuously to a specific target type. In this specific case, the target is a S3 bucket. Like Kinesis Stream, Firehose can generate events attaching data flowing in it. Those events can be used to execute code through Lambda or to trigger complex tasks using AWS Step-Functions or other AWS services.
Data lake => Amazon S3: Infinite storage capacity, configurable lifecycle policies and custom metadata are perfect to build a data lake. Amazon S3 offers that, together with a powerful API and a seamless integration with other AWS services, like Firehose. The data lake may consist in a single bucket or in multiple ones (raw data, transformed data, aggregated data...).
Security/1 => KMS: Data in transit and at rest are encrypted using KMS with scope limited CMKs (Customer Master Keys).
Security/2 => Lambda specific role & policies: Lambda functions run using scoped-down permissions, expressed through IAM policy documents attached to a specific role which is assigned to the lambda.
Security/3 => Secrets: To store secrets and service wide parameters, Parameter Store is used. Secrets are saved as secure-strings and encrypted with KMS using a dedicated CMK.
Benefits/1: One of the key properties of this implementation is its modularity: once it is described by using some IaC tools (like CloudFormation or, like in our case, Terraform) it can be easily templated and reproduced for every other data-pipelines and in different environments, as the only differentiation between pipelines is the code in Lambdas and its execution parameters.
Benefits/2: Standardizing architecture with templates helps tremendously with multi-region failover.
Benefits/3: Using serverless and fully decoupled stateless services also makes them scalable out-of-the-box.

Conclusion

Data lakes are the key for freeing data and making them available across multiple users and standard tools.
They are also quite challenging to be correctly implemented from scratch, as they require cloud expertise and good development skills. Having the right partner in this phase helps and speeds up the process a lot.
Major cloud providers like AWS provide a huge set of tools and services to make the transition from data-siloes to data ubiquity much easier and operationally sustainable in the long term.
We proved that with our customer, where a fully remote pizza-box size team developed and deployed to production multiple data-pipelines, making data widely accessible to business and operational units.
In 2021 there are less excuses to not introduce data-lakes and automated data ingestion through data-pipelines in company data strategy.