DEV Community

Jessica Tiwari
Jessica Tiwari

Posted on

Building a Batch Data Pipeline on AWS

This is how I approached as a beginner.

Define the Data Flow and Storage

  1. Created an S3-based data lake with three zones:
  2. raw for incoming data
  3. processed for cleaned data
  4. Enabled versioning on the raw bucket to preserve original data for reprocessing.

Catalog and Schema

  1. Created a Glue Data Catalog database.
  2. Used Glue Crawlers to scan raw data and infer schemas.
  3. Enabled automatic partition discovery based on date folders.
  4. Scheduled crawlers to run after each data ingestion.

ETL Transformation

  1. Implemented AWS Glue Jobs using Python Spark.
  2. Transformation steps:
  3. Read raw CSV/JSON data from S3.
  4. Standardize column names and data types.
  5. Handle null and malformed records.
  6. Convert data into Parquet format with Snappy compression.
  7. Enabled job bookmarks to ensure incremental processing.

Query and Validation

  1. Configured Amazon Athena to use the Glue Data Catalog.
  2. Ran validation queries on processed and curated datasets.
  3. Used partition filters to minimize scanned data and reduce cost.
  4. Verified record counts and schema consistency.

Automation

  1. Triggered Glue Jobs using EventBridge schedules.
  2. Monitored job execution and failures via CloudWatch.
  3. Configured SNS alerts for ETL failures.
  4. Archived older raw data to lower-cost S3 storage classes.

Top comments (0)