Build a data warehouse quickly with Amazon Redshift - Part 1

#aws #database #tutorial #cloud

In the previous tutorial, you learnt how to create an AWS account to access the Management Console for Amazon Service.

In this post, you will learn about data analytics and machine learning use cases for building a cloud data warehouse.

The Challenge

Operational data is siloed, your organization wants to improve the quality of data used for business intelligence and reporting. You need to build an analytics platform that can connect to a data warehouse and produce dashboards for your leaders by bringing in your own data.

Synopsis ✨

Eventually your organization wants to make business decisions close to real-time and you will connect other sophisticated data sources such as audio transcripts, csv files, Tableau Server, Power BI, customer satisfaction surveys, Salesforce, SharePoint data via a Microsoft Connector and extract data from APIs from another hybrid cloud.

Your task is to produce real-time dashboards for operational as a data analyst, business analyst, data analyst or data scientist.

Solution overview

You may not have been involved in the technical information discovery sessions such as 101 Consulting 👀 but your organization has decided they want to build a data warehouse and they have chosen Amazon Redshift.

Use Case 1

The high-level architecture includes ingesting a csv file that is stored in an Amazon S3 bucket and a cluster is created on Amazon Redshift. Once a connection is made to the data base, tables are created and using the COPY command the data is loaded into the cluster.

Once data is loaded into Amazon Redshift by connecting to Amazon Quicksight, data visualization will provide insights and dashboards may be produced. Alternatively there is an Amazon S3 connector to Tableau Server and Microsoft Power BI connector to Amazon Redshift for producing dashboards.

Use Case 2

The high-level architecture includes integrating processed data from Amazon Redshift used for machine learning to make predictions using Amazon SageMaker. Again there is a direct integration from Amazon SageMaker to Tableau Server, Microsoft Power BI and Amazon Quick Sight to visualize predicted results.

Start small and develop the first use case and build a proof of concept.Help your decision-makers understand your solution and what success means for them. I recommend sketching the solution architecture using AWS Architecture Icons to help you build your architecture diagrams located here for story-telling to obtain the stakeholder buy-in. Help your organization, navigate for future success by laying down the foundations for a successful deployment.

Full Data analytics maturity life-cycle (descriptive to predictive analytics):

Analyze speech to text
Improve customer experience by analyze customer sentiment scores
Agent call monitoring
Provide call scoring of agent performance
Monitor live chat agents
Measure chat bot call deflection
Perform machine learning algorithm topic modelling
Extract key phrases from speech
Monitor call drivers from a campaign
Connect to another data lake with Amazon Redshift

What is Amazon Redshift ?

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the AWS Cloud. An Amazon Redshift data warehouse is a collection of computing resources called nodes, which are organized into a group called a cluster. Each cluster runs an Amazon Redshift engine and contains one or more databases.

You can try Amazon Redshift on a free trial for 2 months and receive 750 hours free per month if you have never created a Redshift cluster before.

Data Warehouse Architecture

You may read more about Amazon Redshift's data warehouse architecture which includes a leader node, client application and compute node.

Cluster: A cluster is composed of one or more compute nodes.
Leader node: The leader node manages communications with client programs and all communication with compute nodes.

For details of each node type, see Amazon Redshift clusters in the Amazon Redshift Cluster Management Guide.

Lake House Architecture 💧

With lake house architecture, Amazon Redshift enables you to query data across the data warehouse, S3 data lakes and operational databases using SQL queries and write data back to the S3 data lake in open format.

This is a video at Amazon re:invent in 2020 explaining lake house approach by Amazon Principal Product Manager Vinay Shukla here.

Secondly, with lake house architecture you have the ability to connect other semi-structured and unstructured data such as call recordings, mobile application data into the data lake and query the data from S3 without loading them into Amazon Redshift.

Thirdly, as data analytics matures in your organization, you may connect Amazon Sagemaker to Amazon Redshift for machine learning to make predictions such as regression or classification and perform topic modelling to analyze NPS surveys.

Amazon Redshift may also connect with Amazon Quicksight or third-party business intelligence tools such as Tableau and Microsoft Power BI to produce data visualization and reporting dashboards to answer key business questions.

In September 2021, Amazon released a new product feature for Amazon Quicksight called Amazon Quicksight Q which includes machine learning to answer business questions with natural language processing this produces dashboards based on topics that is specified by the author, this is like your personal data analyst. You can read more about Amazon Quicksight Q here.

Some new features for Amazon Redshift include:

Data sharing across multiple accounts
AQUA (Advanced Query Accelerator)
Federated queries announced in April 2020 at Amazon re-invent discusses the ability to query data from Amazon Redshift, S3 and operational databases.Here is the link to the announcement.

Other Considerations for a Greenfield project 🚦

As mentioned above you might be wearing three hats because you work in a small team. Ask the right questions and iterate because this is Greenfield, test and learn.

Business Analyst:

Make your organization aware about data governance and protecting customer-sensitive information.

Have you thought about Amazon Macie for PII?
Do you have a data strategy in place for a future audit?
Has your organization obtained sign-off from your legal team for data governance to bring in your chosen data sources to build a cloud data warehouse?

Data Engineer:

Ensure that customer sensitive data is not stored in S3 bucket
Do the IAM user group members have the necessary permissions to access the data stored in the S3 bucket?
When creating configuring security on a Redshift cluster, what is the IPv4 CIDR block, what is the given range?

Data Analyst:

Experiment with the newly Redshift cluster and ensure you are able to connect to the database, use the Query Editor to create a table and the COPY command to load data from the S3 bucket to the Redshift cluster (i.e. database).