DEV Community

Cover image for Real-Time Data Processing using AWS

Real-Time Data Processing using AWS

Awan Shrestha on March 04, 2023

Real-time data is something that is being updated on a near-to-real-time basis. We will be using different AWS services to create a data pipeline t...
Collapse
 
charlottekellogg944 profile image
charlottekellogg944 • Edited

The architecture you described outlines the process of handling and integrating real-time data using various AWS services, including lead enrichment. Here's a summary of the steps involved:

1. The architecture assumes that the real-time data is being updated in a Google BigQuery View, which contains relational data for real-time integration.
2. AWS Glue Job is used as an extraction script to connect to the Google BigQuery View. A connection to the view is established using the JDBC URL and connection credentials stored in AWS Secrets Manager. The Glue Job script accesses the data from the view and converts it into a PySpark DataFrame.
3. The extracted data is then put into an active Kinesis Data Stream. Kinesis Data Stream is a scalable and distributed streaming service that ingests large streams of data in real-time. The data is distributed among shards defined by partition keys, allowing multiple consumers to process the data.
4. Kinesis Firehose is used to receive records from the Kinesis Data Stream and deliver them to a destination. In this scenario, the destination is an AWS S3 bucket, which serves as a data lake for storing the ingested data. Kinesis Firehose can transform the data format and enable quick ingestion.
5. The ingested data is delivered to the specified S3 bucket by Kinesis Firehose. The bucket acts as a storage location for the real-time data.
6. A Lambda function is triggered when the data is delivered to the S3 bucket. The Lambda function retrieves the data file from the bucket and loads it into the desired tables in AWS Redshift. Redshift is a petabyte-scale data warehouse that can store large amounts of relational data.
7. After the data is loaded into Redshift tables, further transformations and processing can be performed using Lambda functions. Redshift provides a powerful environment for querying and analyzing the data stored in the data warehouse. Different schemas, such as staging (STG), temporary (TMP), and target (TGT), can be created to organize the data.
8. With the data stored in Redshift, views can be created on top of the target tables to facilitate data analysis. Tools like Tableau can be used to create interactive reports and visualizations based on the processed data.
Enter fullscreen mode Exit fullscreen mode