NaDia for AWS Community Builders

Posted on Feb 4, 2024

Optimising Sentiment Analysis Workflows: AWS Zero-ETL and Amazon Redshift Synergy-Part 1

#llm #zeroetl #redshift #sagemaker

Introduction

Being a passionate advocate for Machine Learning within Data Warehouses, I find the most intriguing aspect of this solution to be its ability to alleviate data fragmentation. By incorporating ML into your Data Warehouse, you centralize your data rather than dispersing fragments across various storage options such as Amazon S3 buckets or Azure Blob storage simply to enable accessibility for your machine learning tools. This represents just one of the numerous benefits that come with implementing ML in a Data Warehouse.

Data Warehouse Your New Data Lab!

Today, leading cloud providers have made it super simple to venture into Machine Learning right from your Data Warehouse. Amazon Redshift boasts the Amazon Redshift ML feature, Azure Synapse Analytics buddies up with Azure ML, and Google BigQuery is your go-to for ML adventures. Even Snowflake is in on the action, offering SQL-based ML magic with SnowPark ML.

At AWS:Reinvent, AWS made a big deal about Zero ETL Integration with Amazon Redshift and tossed in LLM models into Redshift ML. This seriously boosted the whole idea!

In this blog post series, we'll explore the details of Zero ETL Integration with Amazon Redshift. Part one kicks off with connecting Amazon Aurora to Amazon Redshift, facilitating almost real-time data interactions in the Data Warehouse. Subsequently, in Part 2, we will harness this data for insightful analyses using advanced LLM models now accessible in Redshift ML.

Simplify Data Movement With AWS Zero ETL Integration

The ETL process (Extract, Transform, Load) is super important to prepare the data for a central Data Warehouse. This means Gathering, Cleaning, Normalising, and Combining data from different sources to make sure it's all set for use in the downstream system.

Traditional ETL can be a bit of a hassle. It can cost a lot, be tricky to set up, and take a while to get the data ready.
What if your primary goal is to provide your Data Analytics team with immediate access to the data?
Zero-ETL integration is a fully managed solution. It has been designed to get your transactional or operational data into Amazon Redshift Data Warehouse almost in real time. As a fully managed solution it handles all the hard work on its own, making sure the data is secure and reducing the complexity of setting up the ETL Data Pipeline.

Aurora Zero ETL Integration With Amazon Redshift

Aurora is designed more for online transactional processing rather than analytics. When handling extensive analytics queries, its performance can drop noticeably. A common practice is to have a primary database cluster and a read replica for analytics to improve performance.

To tackle performance challenges, an advanced solution is also to set up an ETL data pipeline. You might opt for Amazon Data Migration Service (DMS) to move data to S3, utilise AWS Glue for ETL jobs, or employ Amazon EMR for distributed ETL and ML tasks. Afterwards, you can load the transformed data or model artifacts back to S3 and store the refined data in Redshift for analytics. With multiple steps involved, the concept of Zero ETL steps in, offering innovative solutions to simplify the process.

Set up Zero ETL Integration

Considerations

Before creating Zero ETL Integration in your Aurora database or any other databases that support this feature, it is essential to verify that the Aurora MySQL/PostgreSQL versions are compatible and indeed support Zero ETL Integration. To get the full list of prerequisites please check out here.

To Set up Zero ETL Integration for your Aurora Source Database simply follow these steps:

Create a Custom DB Parameter Group

First step to start with Aurora Zero ETL is to create a Custom DB Parameter Group that controls replication and associate it with your Aurora DB cluster.

Once the cluster parameter group is created successfully, select the custom group and modify the values for each parameter as per bellow and hit save changes. In addition, Make sure binlog_row_value_options parameter is unset.

  binlog_backup=0
  binlog_replication_globaldb=0
  binlog_format=ROW
  aurora_enhanced_binlog=1
  binlog_row_image=full
  binlog_row_metadata=full
  binlog_transaction_compression=OFF

Create Aurora Source Database

If your Amazon Aurora DB Cluster are not already set up, the next step is to create Aurora Source Database instance. You can simply follow the instruction. A few important notes to consider:

1- To make sure the Aurora MySQL version set to 3.05.0 or higher.

2- Change the default DB cluster parameter group to the custom parameter group that you created in the previous step.

3- To apply changes when associating the parameter group with the DB cluster after creating the cluster, you'll need to reboot the primary DB instance in the cluster before initiating a zero-ETL integration.

For demonstration purposes I have downloaded a public dataset called Consumer Reviews of Amazon Products from Kaggle and stored it into S3 Bucket. At the time of creating my Source DB Cluster I chose to restore data from S3. If you chose to restore data from S3 Bucket make sure you have given Aurora permissions to get objects from your S3 Bucket.

Set up Redshift Serverless

Next step is to set up Amazon Redshift Serverless Workgroup and NameSpace to use as our target data warehouse.

When Redshift NameSpace and WorkGroup are ready, for Zero ETL integration to be successful we must enable the enable_case_sensitive_identifier parameter. To enable case sensitivity on a Redshift Serverless workgroup run this AWS CLI command:

aws redshift-serverless update-workgroup \
  --workgroup-name <YOU_REDSHIFT_SERVERLESS_WORKGROUP> \
  --config-parameters parameterKey=enable_case_sensitive_identifier,parameterValue=true

One last step before creating a Zero ETL integration is to add our Aurora Source DB as an authorised integration source to the namespace. This allows the Aurora Source DB to update our Amazon Redshift data warehouse. For that go to the Resource Policy tab and Add the ARN of the Aurora source DB as authorised integration source. We also need to add our AWS Account ID as authorised principal for Amazon Redshift.

Create Zero ETL Integration

To make this showcase easy to follow, I am using clicks up approach instead of setting up the integration using AWS SDK.

To create an Aurora zero-ETL integration with Amazon Redshift simply follow these steps:

Review the configuration and select create.

It takes approximately 30 minutes for the integration to be active. When the integration is successfully created, the status of the integration and the target Amazon Redshift data warehouse both change to Active.

Create Destination DataBase

After successfully creating a zero-ETL integration, we must create a destination database within the target Amazon Redshift workgroup. I do it by using the query editor v2 by simply running the following SQL command:

CREATE DATABASE <DESTINATION_DB_NAME> FROM INTEGRATION '<INTEGRATION_ID>';

To get the integration ID, navigate to the integration list on the Amazon Redshift console.

As simple as that! Now it's time to test the Zero-ETL integration in action.

How Zero ETL Works

When Zero ETL integration is created, It first loads the existing data from source database to target Data Warehouse, then starts streaming transactional data into Amazon Redshift Destination Database.
Let's test the Zero ETL Integration by adding Data to our Aurora MySQL Source DB.

I use MySQL Workbench to connect to my Aurora instance and load new dataset into my table. As soon as new data is updated in the Aurora source database, we can query the destination table in Amazon Redshift and get the data back.

Note: If you have issues with connecting to your RDS instances I recommend to follow this page for troubleshooting. Also, If your RDS Clusters are inside a VPC make sure you have correct inbound policy attached to the security group.

Now that we have the product feedback data available in Amazon Redshift, we can leverage pre-trained publicly available LLMs from Amazon Sagemaker JumpStart in Amazon Redshift ML to summarise feedback, perform entity extraction, sentiment analysis and product feedback classification.

Final Words

To wrap it up I would like to review some advantages of employing Zero ETL integration include:

Data is seamlessly available in Redshift.
Enables us to run near real time analytics, visualisation and ML on the data without impacting the production workloads.
With Zero ETL we don’t need to build and maintain complex Data pipelines to perform ETL operations. Still there are lost of other use cases to create and maintain a data pipeline but if it is specifically to run analytics processing, it’s convenient to use zero ETL integration.
Zero ETL with Redshift is provided at no additional cost.
We can create integration from multiple source Databases into a single Redshift warehouse.
We can have end to end serverless solution with Aurora serverless and Redshift serverless.
There is consistent monitoring on Zero ETL Integration, it detects when data tables need to be reseeded. When integration need to be Fixed or Recovered, and it’s healed automatically.
Also Redshift sends integration related events to Amazon EventBridge.

Zero ETL is a one step easy and secure way to enable near real time analytics on transactional or operational Data. Also, The new Super data type has advanced the Amazon Redshift ML capabilities, allowing integration of large language models (LLM) from SageMaker JumpStart for remote inferences. Combining these two features will empower us to create an end to end robust ML/AI solution faster but cost effectively.
Keep an eye out for Part 2, where we leverage an LLM Model in Redshift for immediate sentiment analysis on our dataset.

DEV Community