Elizabeth Fuentes L for AWS

Posted on May 24 • Originally published at community.aws

Relax and let the data flow: A Zero-ETL Pipeline

#aws #analytics #database #zeroetl

Real-time Data Visualization with OpenSearch and Amazon DynamoDB: A Zero-ETL Pipeline

🇻🇪🇨🇱 Dev.to Linkedin GitHub Twitter Instagram Youtube
Linktr

Elizabeth Fuentes L

AWS Developer Advocate

Amazon OpenSearch Service and Amazon DynamoDB provide a powerful combination for real-time data visualization without the need for complex Extract, Transform, Load (ETL) processes. This repositorie introduces an AWS Cloud Development Kit (CDK) stack that deploys a serverless architecture for efficient, real-time data ingestion using the OpenSearch Ingestion service (OSIS).

By leveraging OSIS, you can process and transform data from DynamoDB streams directly into OpenSearch, enabling near-instant visualization and analysis. This zero-ETL pipeline eliminates the overhead of traditional data transformation workflows, allowing you to focus on deriving insights from your data.

The CDK stack provisions key components such as Amazon Cognito for authentication, IAM roles for secure access, an OpenSearch domain for indexing and visualization, an S3 bucket for data backups, and a DynamoDB table as the data source. OpenSearch Ingestion acts as the central component, efficiently processing data based on a declarative YAML configuration.

Prerequisites

💰 Cost to complete:

How Does This Application Work?

The flow starts with data stored in Amazon DynamoDB, a managed and scalable NoSQL database. Then, the data is transmitted to Amazon S3.

From the data in S3, it is indexed using Amazon OpenSearch, a service that enables real-time search and analysis on large volumes of data. OpenSearch indexes the data and makes it easily accessible for fast queries.

The next component is Amazon Cognito, a service that enables user identity and access management. Cognito authenticates and authorizes users to access the OpenSearch Dashboard.

AWS Identity and Access Management Roles is used to define roles and access permissions.

To create an OpenSearch Ingestion pipeline, you need an IAM role that the pipeline will assume to write data to the sink (an OpenSearch Service domain or OpenSearch Serverless collection). The role's ARN must be included in the pipeline configuration. The sink, which can be an OpenSearch Service domain (running OpenSearch 1.0+ or Elasticsearch 7.4+) or an OpenSearch Serverless collection, must have an access policy granting the necessary permissions to the IAM pipeline role. (Granting Amazon OpenSearch Ingestion pipelines access to domains - Granting Amazon OpenSearch Ingestion pipelines access to collections).

OpenSearch Ingestion requires specific IAM permissions to create pipelines, including osis:CreatePipeline to create a pipeline, osis:ValidatePipeline to validate the pipeline configuration, and iam:PassRole to pass the pipeline role to OpenSearch Ingestion, allowing it to write data to the domain. The iam:PassRole permission must be granted on the pipeline role resource (specified as sts_role_arn in the pipeline configuration) or set to * if different roles will be used for each pipeline.

The main link of this pipeline configuration is a YAML file format that connects the DynamoDB table with OpenSearch:

The pipeline configuration is done through a YAML file format like:

version: "2"
dynamodb-pipeline:
  source:
    dynamodb:
      acknowledgments: true
      tables:
        # REQUIRED: Supply the DynamoDB table ARN and whether export or stream processing is needed, or both
        - table_arn: "DYNAMODB_TABLE_ARN"
          # Remove the stream block if only export is needed
          stream:
            start_position: "LATEST"
          # Remove the export block if only stream is needed
          export:
            # REQUIRED for export: Specify the name of an existing S3 bucket for DynamoDB to write export data files to
            s3_bucket: "<<my-bucket>>"
            # Specify the region of the S3 bucket
            s3_region: "<<REGION_NAME>"
            # Optionally set the name of a prefix that DynamoDB export data files are written to in the bucket.
            s3_prefix: "ddb-to-opensearch-export/"
      aws:
        # REQUIRED: Provide the role to assume that has the necessary permissions to DynamoDB, OpenSearch, and S3.
        sts_role_arn: "<<STS_ROLE_ARN>>"
        # Provide the region to use for aws credentials
        region: "<<REGION_NAME>>"
  sink:
    - opensearch:
        # REQUIRED: Provide an AWS OpenSearch endpoint
        hosts:
          [
            "<<https://OpenSearch_DOMAIN>>"
          ]
        index: "<<table-index>>"
        index_type: custom
        document_id: "${getMetadata(\"primary_key\")}"
        action: "${getMetadata(\"opensearch_action\")}"
        document_version: "${getMetadata(\"document_version\")}"
        document_version_type: "external"
        aws:
          # REQUIRED: Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com
          sts_role_arn: "<<STS_ROLE_ARN>>"
          # Provide the region of the domain.
          region: "<<REGION_NAME>>"

The pipeline configuration file is automatically created in the CDK stack along with all the other resources.

Let's build!

Step 1: APP Set Up

✅ Clone the repo

git clone https://github.com/build-on-aws/realtime-dynamodb-zero-etl-opensearch-visualization

✅ Go to:

cd dashboard

Configure the AWS Command Line Interface
Deploy architecture with CDK Follow steps:

✅ Create The Virtual Environment: by following the steps in the README

python3 -m venv .venv

source .venv/bin/activate

for windows:

.venv\Scripts\activate.bat

✅ Install The Requirements:

pip install -r requirements.txt

✅ Synthesize The Cloudformation Template With The Following Command:

cdk synth

✅🚀 The Deployment:

cdk deploy

The deployment will take between 5 and 10 minutes, which is how long it takes for the OpenSearch domain to be created.

When it is ready you will see that the status changes to completed:

To access the OpenSearch Dashboards through the OpenSearch Dashboards URL (IPv4) you need to create a user in the Amazon Cognito user pools.

With the created user, access the Dashboard and begin to experience the magic of Zero-ETL between the DynamoDB table and OpenSearch.

In this repository you created a table to which you can inject data, but you can also change it by Updating Amazon OpenSearch Ingestion pipelines making a change to the YAML file or modifying the CDK stack.

Conclusion

The combination of Amazon OpenSearch and Amazon DynamoDB enables real-time data visualization without the complexities of traditional ETL processes. By utilizing the OpenSearch Ingest Service (OSIS), a serverless architecture can be implemented that efficiently processes and transforms data from DynamoDB directly into OpenSearch. Building the application with AWS CDK streamlines and simplifies the setup of key components such as authentication, secure access, indexing, visualization, and data backup.

This solution allows users to focus on gaining insights from their data rather than managing infrastructure. Ideal for real-time dashboards, log analytics, or IoT event monitoring, this Zero-ETL pipeline offers a scalable and agile approach to data ingestion and visualization. It is recommended to clone the repository, customize the configuration, and deploy the stack on AWS to leverage the power of OpenSearch and DynamoDB for real-time data visualization.

Elizabeth Fuentes L

AWS Developer Advocate

🇻🇪🇨🇱 Dev.to Linkedin GitHub Twitter Instagram Youtube
Linktr

Top comments (1)

José Cardoso • May 30

Everything works well, until you see the bill at the end of the month. These solutions are great, but dangerous from a financial point of view

DEV Community

Relax and let the data flow: A Zero-ETL Pipeline

Real-time Data Visualization with OpenSearch and Amazon DynamoDB: A Zero-ETL Pipeline

Elizabeth Fuentes L

Prerequisites

💰 Cost to complete:

How Does This Application Work?

Let's build!

Step 1: APP Set Up

Conclusion

Elizabeth Fuentes L

Top comments (1)

Read next

Supabase | My Way of Designing & Managing DB

Who should be your first data hire and when should you hire them?

Transaction Safety in Rails: Identifying and Addressing Non-Atomic Interactions

Installing Python Dependencies on AWS Lambda Using EFS

Real-time Data Visualization with OpenSearch and Amazon DynamoDB: A Zero-ETL Pipeline

Elizabeth Fuentes LFollow

Prerequisites

💰 Cost to complete:

How Does This Application Work?

Let's build!

Step 1: APP Set Up

Conclusion

Elizabeth Fuentes LFollow

Read next

Supabase | My Way of Designing & Managing DB

Who should be your first data hire and when should you hire them?

Transaction Safety in Rails: Identifying and Addressing Non-Atomic Interactions

Installing Python Dependencies on AWS Lambda Using EFS

Elizabeth Fuentes L

Elizabeth Fuentes L