- Step 1: I considered:
- Spark on EC2 (high control, high ops)
- Databricks
- AWS Glue I selected AWS Glue to minimize operational complexity.
- Step 2: Ingestion Strategy
- Data lands in raw/
- Glue Crawlers detect schema changes
- Catalog updated automatically
- Step 3: Transformation Logic
Glue Jobs perform:
- Type casting
- Null handling
- Deduplication
- Format conversion (CSV → Parquet)
- Step 4: Performance Optimization
- Enabled job bookmarks
- Tuned DPUs
- Used Parquet + Snappy compression
- Step 5: Output Strategy
- Processed data written to S3
Top comments (0)