Skip to content

DEV Community

Wakeup Flower

Posted on Sep 23

Aws Glue & ETL bookmarks

#aws

ETL is not a tool → it’s a methodology or workflow.
Extract → Transform → Load = a process to move raw data into a clean, usable form for analytics.

🔹 AWS Glue (tool/service)

AWS Glue = Amazon’s serverless ETL service.
It lets you build and run ETL pipelines without managing servers.
Glue provides all the parts you need to implement ETL:

🔑 How Glue Fits Into ETL

Extract

Glue connectors pull data from sources: S3, RDS, DynamoDB, JDBC databases, etc.
- Pull data from sources: databases, APIs, files, IoT sensors, logs, etc.
- Example: Extract customer data from MySQL, clickstream data from S3, and logs from CloudWatch.

Transform

Glue generates Spark jobs (PySpark) to clean and enrich data.
You can customize transformations with Python.
Supports job bookmarks to avoid reprocessing the same data.
- Clean, enrich, and reformat the data.
- Examples:
- Remove duplicates.
- Convert dates to a standard format.
- Join multiple tables (e.g., customer + orders).
- Aggregate (e.g., daily sales totals).

This step ensures the data is usable and consistent.

Load

Glue can load data into S3 (data lake), Redshift (warehouse), or other targets.
- Write the transformed data into a target system:
- Data warehouse (Amazon Redshift, Snowflake).
- Data lake (Amazon S3, Lake Formation).
- Analytics system (Elasticsearch, Athena).

🔹 Extra Glue Features

Glue Data Catalog → a centralized metadata store (like a database of all your datasets).
Glue Crawlers → scan data sources and automatically infer schema (tables, columns, data types).
Glue Studio → visual interface to design ETL jobs.
Glue Streaming ETL → for real-time data pipelines.

🔹 Tools for ETL

AWS Glue (serverless ETL service).
Apache Spark, Apache Flink.
Talend, Informatica.
Custom Python jobs with Pandas.

🔹 What is a Job Bookmark?

A bookmark is a mechanism to keep track of previously processed data in an ETL job.
It ensures that when your ETL job runs again, it only processes new or changed data, instead of reprocessing everything.

🔹 Why It Matters

Without bookmarks:

Each ETL run processes the entire dataset → inefficient, expensive, and may cause duplicates.

With bookmarks:

ETL job “remembers” where it left off.
Next run starts from the last checkpoint (like saving your place in a book).

🔹 Where Used

AWS Glue ETL jobs (Spark or Python shell).
Glue streaming jobs (with checkpoints).
Similar concept in Apache Spark and other ETL tools → often called checkpointing or incremental processing.

🔑 Takeaway

Bookmark = memory of ETL job progress.
Ensures incremental processing (only new/changed data).
Prevents duplicates, saves time & cost.

Top comments (0)

Subscribe