Parth soni

Posted on Jan 13, 2023

AWS Glue: Understanding ETL Data Processing and Architecture

#aws #awsglue #etl #awscommunity

If you are familiar with databases, or data warehouses, you have probably heard the term “ETL.” As the amount of data at organisations grow, making use of that data in analytics to derive business insights grows as well. For the efficient management of these data, ETL (Extract, Transform, and Load) process is necessary. AWS helps you perform ETL tasks, especially complex ones, with AWS Glue.

AWS Glue

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL. So what is ETL..
ETL stands for “extract, transform, and load.” The process of ETL plays a key role in data integration strategies. ETL allows businesses to gather data from multiple sources and consolidate it into a single, centralized location. ETL also makes it possible for different types of data to work together.

AWS Clue Architecture

AWS Glue Components

AWS Glue provides a console and API operations to set up and manage your extract, transform, and load (ETL) workload. You can use API operations through several language-specific SDKs and the AWS Command Line Interface (AWS CLI). AWS Glue uses the AWS Glue Data Catalog to store metadata about data sources, transforms, and targets. The Data Catalog is a drop-in replacement for the Apache Hive Metastore. The AWS Glue Jobs system provides a managed infrastructure for defining, scheduling, and running ETL operations on your data.

AWS Glue Console

You use the AWS Glue console to define and orchestrate your ETL workflow. The console calls several API operations in the AWS Glue Data Catalog and AWS Glue Jobs system to perform the following tasks:

Define AWS Glue objects such as jobs, tables, crawlers, and connections.
Schedule when crawlers run.
Define events or schedules for job triggers.
Search and filter lists of AWS Glue objects.
Edit transformation scripts.

AWS Glue Data Catalog

The AWS Glue Data Catalog is your persistent metadata store. It is a managed service that lets you store, annotate, and share metadata in the AWS Cloud in the same way you would in an Apache Hive metastore.

AWS Glue Crawlers and Classifiers

AWS Glue also lets you set up crawlers that can scan data in all kinds of repositories, classify it, extract schema information from it, and store the metadata automatically in the AWS Glue Data Catalog. From there it can be used to guide ETL operations.

Which Data Stores Can I Crawl?

Crawlers can crawl both file-based and table-based data stores. Crawlers can crawl the following data stores through their respective native interfaces:

Amazon Simple Storage Service (Amazon S3)
Amazon DynamoDB
Crawlers can crawl the following data stores through a JDBC connection:
Amazon Redshift
Amazon Relational Database Service (Amazon RDS)
Amazon Aurora
MariaDB
Microsoft SQL Server
MySQL
Oracle
PostgreSQL
Publicly accessible databases
Aurora
MariaDB
SQL Server
MySQL
Oracle
PostgreSQL

What Happens When a Crawler Runs?

When a crawler runs, it takes the following actions to interrogate a data store:

Classifies data to determine the format, schema, and associated properties of the raw data – You can configure the results of classification by creating a custom classifier.
Groups data into tables or partitions – Data is grouped based on crawler heuristics.
Writes metadata to the Data Catalog – You can configure how the crawler adds, updates, and deletes tables and partitions.

AWS Glue ETL Operations

Using the metadata in the Data Catalog, AWS Glue can autogenerate Scala or PySpark (the Python API for Apache Spark) scripts with AWS Glue extensions that you can use and modify to perform various ETL operations. For example, you can extract, clean, and transform raw data, and then store the result in a different repository, where it can be queried and analyzed. Such a script might convert a CSV file into a relational form and save it in Amazon Redshift.

The AWS Glue Jobs System

The AWS Glue Jobs system provides managed infrastructure to orchestrate your ETL workflow. You can create jobs in AWS Glue that automate the scripts you use to extract, transform, and transfer data to different locations. Jobs can be scheduled and chained, or they can be triggered by events such as the arrival of new data.

AWS Glue’s Features

Easy – AWS Glue automates much of the effort in building, maintaining, and running ETL jobs. AWS Glue crawls your data sources, identifies data formats, and suggests schemas and transformations. AWS Glue automatically generates the code to execute your data transformations and loading processes.
Integrated – AWS Glue is integrated across a wide range of AWS services.
Serverless – AWS Glue is serverless. There is no infrastructure to provision or manage. AWS Glue handles provisioning, configuration, and scaling of the resources required to run your ETL jobs on a fully managed, scale-out Apache Spark environment. You pay only for the resources used while your jobs are running.
Developer Friendly – AWS Glue generates ETL code that is customizable, reusable, and portable, using familiar technology – Scala, Python, and Apache Spark. You can also import custom readers, writers and transformations into your Glue ETL code. Since the code AWS Glue generates is based on open frameworks, there is no lock-in. You can use it anywhere.

DEV Community