WanjohiChristopher for AWS Community Builders

Posted on Dec 1, 2025

Building a Production-Ready Data Pipeline on AWS: A Hands-On Guide for Data Engineers

#aws #dataengineering #tutorial #architecture

Introduction

Modern data engineering requires scalable, fault-tolerant, and secure architectures. In this article, I walk through a fully operational AWS data pipeline using S3, Kinesis, Glue, Athena, Redshift, and QuickSight.
Everything here is hands-on — every step can be reproduced in your own AWS console, and I will include the exact screenshots from my implementation.

This article helps anyone learn:

How to build a real AWS ETL pipeline end-to-end
How to combine batch + streaming data
How to orchestrate jobs with Glue + Lambda
How to query data with Athena and Redshift
How to build dashboards with QuickSight

Architecture Overview
We will build this architecture:

Architecture Components

Amazon S3 – Data lake (Raw → Clean → Analytics Zones)
Amazon Kinesis Data Streams – Real-time data ingestion
AWS Glue – ETL jobs & data catalog
AWS Lambda – Event-driven transformations
Amazon Athena – Serverless SQL analytics
Amazon Redshift – Data warehouse
Amazon QuickSight – Dashboards
AWS Lake Formation – Governance
Amazon CloudWatch – Monitoring

Step 1: Create the Data Lake on Amazon S3

Create an S3 bucket like - your-datalake-name

/raw/
/clean/
/analytics/

Step 2: Ingest Real-Time Data using Kinesis

We simulate a real IoT or clickstream pipeline.

Step:

Go to Kinesis → Data Streams
Click Create Stream
Name it:

datalyte_stream

4.Set number of shards = 1

✔ Monitoring graph of incoming data

Step 3: Create AWS Glue Crawler (Schema Discovery)

We want Glue to:

Crawl S3 raw zone

Detect schema

Create database + tables in Glue Catalog

Steps:

Go to AWS Glue → Crawlers
Click Create Crawler
Name:

datalyte_raw_crawler

Choose S3 input: s3://ddatalyte-data-lake-cmp/raw/
Create a new Glue database:

datalyte_db

Glue Crawler setup

Running Crawler

Step 4: Build ETL Transformations with AWS Glue (Spark)

We transform data from raw → clean.

Steps:

Go to Glue → Jobs
Click Create Job
Choose Spark / Python
Script:

import pyspark.sql.functions as F

df = glueContext.create_dynamic_frame.from_catalog(
    database="datalyte_db",
    table_name="raw_data"
).toDF()

clean_df = df.withColumn("processed_at", F.current_timestamp())

clean_dyf = DynamicFrame.fromDF(clean_df, glueContext, "clean_dyf")

glueContext.write_dynamic_frame.from_options(
    frame=clean_dyf,
    connection_type="s3",
    connection_options={"path": "s3://datalyte-data-lake/clean/"},
    format="parquet"
)

Step 5: Query Clean Data using Amazon Athena

Go to Athena
Select database: datalyte_db
Run this query:

SELECT *
FROM clean_data
LIMIT 20;

Step 6: Load Analytics Data into Redshift

Now we push curated data into Redshift for BI.

Steps:
1.Create a Redshift Serverless Workspace
2.Create a schema:

CREATE SCHEMA datalyte;

3.Load S3 data into Redshift:

COPY datalyte.analytics_table
FROM 's3://datalyte-data-lake/analytics/'
IAM_ROLE '<your_redshift_role>'
FORMAT AS PARQUET;

Step 7: Build a Dashboard with QuickSight

Connect QuickSight to Redshift
Import analytics table
Create:

Line charts
Bar charts
KPIs (metrics)

Conclusion

This hands-on project demonstrates how AWS services like S3, Glue, Kinesis, Athena, and Redshift come together to form a complete, production-ready data pipeline. With this foundation in place, you can now extend the architecture to include BI, machine learning, or more advanced data engineering workflows on AWS.

DEV Community