DEV Community

Wakeup Flower
Wakeup Flower

Posted on

Aws Glue & ETL bookmarks

  • ETL is not a tool → it’s a methodology or workflow.
  • Extract → Transform → Load = a process to move raw data into a clean, usable form for analytics.

🔹 AWS Glue (tool/service)

  • AWS Glue = Amazon’s serverless ETL service.
  • It lets you build and run ETL pipelines without managing servers.
  • Glue provides all the parts you need to implement ETL:

🔑 How Glue Fits Into ETL

  1. Extract
  • Glue connectors pull data from sources: S3, RDS, DynamoDB, JDBC databases, etc.

    • Pull data from sources: databases, APIs, files, IoT sensors, logs, etc.
    • Example: Extract customer data from MySQL, clickstream data from S3, and logs from CloudWatch.
  1. Transform
  • Glue generates Spark jobs (PySpark) to clean and enrich data.
  • You can customize transformations with Python.
  • Supports job bookmarks to avoid reprocessing the same data.

    • Clean, enrich, and reformat the data.
    • Examples:
    • Remove duplicates.
    • Convert dates to a standard format.
    • Join multiple tables (e.g., customer + orders).
    • Aggregate (e.g., daily sales totals).

This step ensures the data is usable and consistent.

  1. Load
  • Glue can load data into S3 (data lake), Redshift (warehouse), or other targets.

    • Write the transformed data into a target system:
    • Data warehouse (Amazon Redshift, Snowflake).
    • Data lake (Amazon S3, Lake Formation).
    • Analytics system (Elasticsearch, Athena).

🔹 Extra Glue Features

  • Glue Data Catalog → a centralized metadata store (like a database of all your datasets).
  • Glue Crawlers → scan data sources and automatically infer schema (tables, columns, data types).
  • Glue Studio → visual interface to design ETL jobs.
  • Glue Streaming ETL → for real-time data pipelines.

🔹 Tools for ETL

  • AWS Glue (serverless ETL service).
  • Apache Spark, Apache Flink.
  • Talend, Informatica.
  • Custom Python jobs with Pandas.

🔹 What is a Job Bookmark?

  • A bookmark is a mechanism to keep track of previously processed data in an ETL job.
  • It ensures that when your ETL job runs again, it only processes new or changed data, instead of reprocessing everything.

🔹 Why It Matters

Without bookmarks:

  • Each ETL run processes the entire dataset → inefficient, expensive, and may cause duplicates.

With bookmarks:

  • ETL job “remembers” where it left off.
  • Next run starts from the last checkpoint (like saving your place in a book).

🔹 Where Used

  • AWS Glue ETL jobs (Spark or Python shell).
  • Glue streaming jobs (with checkpoints).
  • Similar concept in Apache Spark and other ETL tools → often called checkpointing or incremental processing.

🔑 Takeaway

  • Bookmark = memory of ETL job progress.
  • Ensures incremental processing (only new/changed data).
  • Prevents duplicates, saves time & cost.

Top comments (0)