Skip to content

DEV Community

Jessica Tiwari

Posted on Jan 5

Building a Batch Data Pipeline on AWS

This is how I approached as a beginner.

Define the Data Flow and Storage

Created an S3-based data lake with three zones:
raw for incoming data
processed for cleaned data
Enabled versioning on the raw bucket to preserve original data for reprocessing.

Catalog and Schema

Created a Glue Data Catalog database.
Used Glue Crawlers to scan raw data and infer schemas.
Enabled automatic partition discovery based on date folders.
Scheduled crawlers to run after each data ingestion.

ETL Transformation

Implemented AWS Glue Jobs using Python Spark.
Transformation steps:
Read raw CSV/JSON data from S3.
Standardize column names and data types.
Handle null and malformed records.
Convert data into Parquet format with Snappy compression.
Enabled job bookmarks to ensure incremental processing.

Query and Validation

Configured Amazon Athena to use the Glue Data Catalog.
Ran validation queries on processed and curated datasets.
Used partition filters to minimize scanned data and reduce cost.
Verified record counts and schema consistency.

Automation

Triggered Glue Jobs using EventBridge schedules.
Monitored job execution and failures via CloudWatch.
Configured SNS alerts for ETL failures.
Archived older raw data to lower-cost S3 storage classes.

Top comments (0)

Subscribe