DEV Community

Cover image for Building a Production-Ready Data Pipeline on AWS: A Hands-On Guide for Data Engineers

Building a Production-Ready Data Pipeline on AWS: A Hands-On Guide for Data Engineers

Introduction

Modern data engineering requires scalable, fault-tolerant, and secure architectures. In this article, I walk through a fully operational AWS data pipeline using S3, Kinesis, Glue, Athena, Redshift, and QuickSight.
Everything here is hands-on — every step can be reproduced in your own AWS console, and I will include the exact screenshots from my implementation.

This article helps anyone learn:

  • How to build a real AWS ETL pipeline end-to-end
  • How to combine batch + streaming data
  • How to orchestrate jobs with Glue + Lambda
  • How to query data with Athena and Redshift
  • How to build dashboards with QuickSight

Architecture Overview
We will build this architecture:

Architecture Components

  • Amazon S3 – Data lake (Raw → Clean → Analytics Zones)
  • Amazon Kinesis Data Streams – Real-time data ingestion
  • AWS Glue – ETL jobs & data catalog
  • AWS Lambda – Event-driven transformations
  • Amazon Athena – Serverless SQL analytics
  • Amazon Redshift – Data warehouse
  • Amazon QuickSight – Dashboards
  • AWS Lake Formation – Governance
  • Amazon CloudWatch – Monitoring

Step 1: Create the Data Lake on Amazon S3

Create an S3 bucket like - your-datalake-name

/raw/
/clean/
/analytics/
Enter fullscreen mode Exit fullscreen mode

Folder Structure

Step 2: Ingest Real-Time Data using Kinesis

We simulate a real IoT or clickstream pipeline.

Step:

  1. Go to Kinesis → Data Streams
  2. Click Create Stream
  3. Name it:

datalyte_stream

4.Set number of shards = 1

Kinesis Data stream

✔ Monitoring graph of incoming data

Monitoring

Step 3: Create AWS Glue Crawler (Schema Discovery)

We want Glue to:

Crawl S3 raw zone

Detect schema

Create database + tables in Glue Catalog

Steps:

  1. Go to AWS Glue → Crawlers
  2. Click Create Crawler
  3. Name:
datalyte_raw_crawler
Enter fullscreen mode Exit fullscreen mode
  1. Choose S3 input: s3://ddatalyte-data-lake-cmp/raw/
  2. Create a new Glue database:
datalyte_db
Enter fullscreen mode Exit fullscreen mode

Glue Crawler setup

Crawler setup

Running Crawler

Running Crawler

Step 4: Build ETL Transformations with AWS Glue (Spark)

We transform data from raw → clean.

Steps:

  1. Go to Glue → Jobs
  2. Click Create Job
  3. Choose Spark / Python
  4. Script:
import pyspark.sql.functions as F

df = glueContext.create_dynamic_frame.from_catalog(
    database="datalyte_db",
    table_name="raw_data"
).toDF()

clean_df = df.withColumn("processed_at", F.current_timestamp())

clean_dyf = DynamicFrame.fromDF(clean_df, glueContext, "clean_dyf")

glueContext.write_dynamic_frame.from_options(
    frame=clean_dyf,
    connection_type="s3",
    connection_options={"path": "s3://datalyte-data-lake/clean/"},
    format="parquet"
)

Enter fullscreen mode Exit fullscreen mode

Script

Step 5: Query Clean Data using Amazon Athena

  1. Go to Athena
  2. Select database: datalyte_db
  3. Run this query:
SELECT *
FROM clean_data
LIMIT 20;

Enter fullscreen mode Exit fullscreen mode

Athena Query Editor

Step 6: Load Analytics Data into Redshift

Now we push curated data into Redshift for BI.

Steps:
1.Create a Redshift Serverless Workspace
2.Create a schema:

CREATE SCHEMA datalyte;

Enter fullscreen mode Exit fullscreen mode

3.Load S3 data into Redshift:

COPY datalyte.analytics_table
FROM 's3://datalyte-data-lake/analytics/'
IAM_ROLE '<your_redshift_role>'
FORMAT AS PARQUET;
Enter fullscreen mode Exit fullscreen mode

Redshift Query EDITOR

Step 7: Build a Dashboard with QuickSight

  1. Connect QuickSight to Redshift
  2. Import analytics table
  3. Create:
  • Line charts
  • Bar charts
  • KPIs (metrics)

Conclusion

This hands-on project demonstrates how AWS services like S3, Glue, Kinesis, Athena, and Redshift come together to form a complete, production-ready data pipeline. With this foundation in place, you can now extend the architecture to include BI, machine learning, or more advanced data engineering workflows on AWS.

Top comments (0)