Gabriel Sant'Ana

Posted on Jun 1, 2023

Run analytical queries on SAP data with Redshift Serverless powered by Amazon AppFlow

#redshift #appflow #sap #awsondata

There is no secret that SAP customers lean on the flexibility and scalability of the Amazon Web Services (AWS) platform and its support for the complete set of SAP-certified offerings to run their business-critical SAP workloads.

But what inspires them goes beyond infrastructure matters: the need to capture a wide variety of data and convert it into a format that can be used for decision-making, creating an advantage over companies that make decisions based only on traditional reports. Through the adoption of the broad range of technologies that AWS offers in the areas of data analytics and machine learning, this desire can be met in a fast and serverless fashion.

In this tutorial, we will see how to set up a small architecture to run analytical queries on SAP data to get better insights from sales order details by using these important AWS resources: Amazon AppFlow and Redshift Serverless. We will carry out the following activities:

Setup a new Redshift Serverless instance;
Create an IAM Role and an S3 bucket;
Create Amazon AppFlow connections;
Create a flow and run on demand;
Run a query on SAP data;

Tutorial Architecture

Right below, you will find an illustration of the architecture that will be covered here:

In a nutshell, AppFlow will call the Sales Order (A2X) OData API and extract the sales order data from SAP to be dumped on an S3 bucket, and afterwards issue a COPY command to Redshift serverless for further analytics. First of all, please let me give you a brief overview of the AWS resources used in this tutorial:

What is Amazon AppFlow?

Amazon AppFlow is a fully managed integration service that enables you to securely transfer data between applications (e.g. Salesforce and SAP) and AWS services, including Amazon S3 and Amazon Redshift, in just a few clicks. It has several transformation capabilities (such as filtering, validation, partitioning, aggregation) an integration with the AWS Glue Catalog, to automatically infer the data schema. It can also encrypt data in motion, reducing exposure to security threats.

AWS is continuously investing in this service, as we can observe from several improvement announcements made at Re:Invent 2022.

And what about Amazon Redshift?

Amazon Redshift is a column-oriented, massively parallel data warehouse designed to be deployed in two modes: provisioned or serverless. This last one is the newest version (generally available since July 2022), allowing us to run analytics workloads of any size without having to manage infrastructure. In other words, you focus on insights, and AWS takes care of the rest.

We have highlighted in this illustration an overview of functionalities and integrations available for the serverless version:

There is also a personal reason for choosing serverless: At the time of writing this post, AWS is offering $300 in credit 🤑 for a free trial within 90 days.

Before you begin!

To transfer data from SAP OData to supported destinations using Amazon AppFlow, your SAP NetWeaver stack version must be 7.40 SP02 or above, and you must enable catalog service for service discovery.

In the most simple manner, my SAP demo system is accessible through an Elastic Load Balancer with proper Amazon-issued SSL Certificate that can be reached by Amazon AppFlow through an HTTPS connection from the public internet.

Getting started with step-by-step tutorial

Step 1 - Redshift Serverless instance creation

On the AWS Console, navigate to Redshift and click on "Try Redshift Serverless free trial". Please make sure to be in an eligible region before proceeding to receive $300 in AWS credits for the trial.

Under "Configuration" settings, please select "Use default settings"; a default namespace and workgroup will be automatically created.

An IAM role must be assigned to your Redshift instance, allowing it to access the content from your S3 bucket (to be created). At "Associated IAM Roles" config, click on "Manage IAM roles->Create IAM role", then select the option "Any S3 bucket". Then click on "Create IAM role as default" button to confirm.

After clicking on the last "Save Configuration" button, your Redshift instance will start to be set up and will take a few minutes to become available. When finished, the dashboard will be displayed, containing more details about your instance, as highlighted below:

Now, we must access the Redshift query editor to create a new table to store the data coming from SA, which is accessible by clicking on the button "Query Data", in the top-right of your dashboard.

On the first access, a popup will be prompted to create a connection to the database "dev". Please choose the "Federated user" authentication option and then click on the "Create connection" button to proceed.

Let's execute the following SQL commands to create the new table and grant access to the Amazon AppFlow:

CREATE TABLE SalesOrder(
    SalesOrder CHAR(10),
    SoldToParty CHAR(10),
    CreationDate DATE,
    TotalNetAmount DECIMAL(15,2),
    TransactionCurrency CHAR(5));

GRANT ALL PRIVILEGES ON TABLE SalesOrder TO public

After a few seconds, the new table salesorder will appear in the "public" schema, as exemplified in the figure below:

Last but not least, to fulfill a service permission policy condition used by Amazon AppFlow, the Redshift workgroup "default" must be tagged with the key RedshiftDataFullAccess, as illustrated below.

Step 2 - S3 bucket and IAM Role creation

A new S3 bucket is needed to configure the connection between AppFlow and Redshift, because AppFlow uses it as an intermediate destination during data transfer to Amazon Redshift. I have created an S3 bucket called "appflow-redshift", which will be used in the next steps.

To keep this tutorial as short as possible, I will not detail the bucket creation in here. However, if you are not sure how to proceed, please read the official AWS User's Guide.

Also, a new IAM role will be needed to grant Amazon AppFlow access to data via the Amazon Redshift Data API. I have created the role "AppFlowRedshiftDataFullAccess" using the AWS managed permission policy AmazonRedshiftDataFullAccess

Step 3 - AppFlow connections creation

Let's move on with AppFlow connections creation, starting with Redshift.

Navigate to the AppFlow menu, and click on the "Connections" option located in the left sidebar menu. Then, select "Amazon Redshift" as the connector and proceed by clicking on "Create Connection" button.

A pop-up will show up with fields asking for connection details, which I have filled out as illustrated bellow. After clicking on "Connect" button, a new connection between Amazon AppFlow and Redshift will be available for use.

To set up a connection with SAP, Amazon AppFlow requires detailed information from this system, such as the application host, port, client number, and others related to authentication. I strongly recommend you contact your SAP Basis team to grant or validate these information.

Additionally, here is the link to the Amazon User Guide, which contains details on how to proceed with the SAP connection creation.

Step 4 - Create a new flow and run on demand

I hope you have done well so far! Now it is time to create a new flow at AppFlow, to integrate sales order data from SAP into Redshift Serverless. At "Flows" option at the left sidebar menu, click on "Create Flow" button.

The flow creation works like a wizard, containing sequential steps with fields asking for configuration details. Below is all the information I used to create the flow:

Source Details:

Destination Details:

Choose 'Ignore and continue the flow run' as error handling option and 'Run on demand' as flow trigger option. On the next step, define the field mapping as follows:

Skip the remaining steps until the flow is created.

After starting the flow manually by clicking the "Run Flow" button located on the top right of the flow details, it will take some time to extract SAP data and copy it into the Redshift 'salesorder' table. In the end, the following message will be displayed:

Step 5 - Run a query on SAP data

Alright, Folks! Here we are! Let's get into the Redshift query editor to play a little!

The first query I want to run is obviously to check whether the data was copied correctly into the table, and... looks as good as it appears!

To finish this tutorial, we will perform a more elaborate query containing a chart as the final result. I want to find the five most valuable customers, in other words, those who expend the most. This information can be find by executing the following SQL command:

SELECT TOP 5 SOLDTOPARTY, SUM( TOTALNETAMOUNT ) 
FROM SALESORDER 
GROUP BY SOLDTOPARTY 
ORDER BY SUM DESC

Now, by activating the "Chart" view on the result tab and choosing "Pie" visual type, you will get a better view of your query, as shown below.

Conclusion

I tried to demonstrate in this tutorial, as clearly and briefly as possible, how we can set up and use a data warehouse quickly, without the need for infrastructure provisioning, by adopting Redshift Serverless. Additionally, we saw that it is possible to use Amazon AppFlow to extract data from SAP without the need for development (no-code).

I hope you enjoyed the content, and thank you very much for reading. See you in the next post! 🙌

DEV Community