AWS Glue - ETL Power !

#aws #cloud #serverless #architecture

Glue is one of the powerful tool offered by AWS for ETL(Extract, Transform and Load) processes. Glue makes it easy to clean the data ,categorize it and helps to move it from one data store to another. This helps users to use it in a cost-effective manner for testing and running them. One can use AWS Glue console to transform the data to make it available for query and search. It mostly works with semi-structured data .

Guess what - its serverless ,so think about infrastructure ? not at all!

## When should you use AWS Glue ?

In order to generate ETL scripts for transformation of data, flattening or cleansing data from source to target.
Create triggers for ETL jobs for dependency for your job flows. These triggers can be either scheduled or event based.
Amazon S3 is one of the data lake services that aws has to offer. We can leverage that service to catalog the data available in s3,then use it for query using Athena and Redshift Spectrum.
Glue gives opportunity to maintain a unified view of data. This avoids different silos of data spread out in different places.

## Working of glue jobs

There are 3 variety of jobs in AWS Glue:

Spark - A Spark job is run in an Apache Spark environment managed by AWS Glue. It processes data in batches.
Streaming ETL - It is similar to a Spark job, except that it performs ETL on data streams. It uses the Apache Spark Structured Streaming framework. Some Spark job features are not available to streaming ETL jobs.
Python shell - The job runs Python scripts as a shell and supports a Python version that depends on the AWS Glue version you are using. You can use these jobs to schedule and run tasks that don't require an Apache Spark environment.

AWS Glue job can generate output in different file formats such as - JSPN, CSV, ORC ,Parquet and Avro respectively.

Development and Testing in Glue Studio Console

It is a visual editor that provides a graphical interface that makes it easy to create, run, and monitor jobs. It also performs extract, transform, and load (ETL) jobs in AWS Glue. You can visually compose data transformation workflows and seamlessly run them on AWS Glue's Apache Spark-based serverless ETL engine. You can inspect the schema and data results in each step of the job.

I extensively use Glue studio for running my ETL Glue jobs. It comprise of Job Details tab as well to edit job configurations, add library paths ,resource paths ,add connections etc. You can save, run and delete your job using Glue studio. Under run tab you can view your job status whether its running, succeeded or thrown some error . It also generates Cloudwatch error and output logs link ,this helps you to view your error and output that your glue job is expected to generate.

Local development is available for all AWS Glue versions, including AWS Glue version 0.9 and AWS Glue version 1.0 and later.

Programming ETL jobs in Glue

Sign in to your Management Console. Follow these steps - Search for AWS Glue Console > Navigate to Glue Studio > Create new job > Add configurations > Security details .

All set to start your programming in the new job -

Basic imports and libraries
`
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

glueContext = GlueContext(SparkContext.getOrCreate())

`
You are ready to start your preferred language to write Glue ETL Jobs.

*## Running the job *

$ aws glue start-job-run --job-name "CSV to CSV" --arguments='--scriptLocation="s3://glue_jobs/library/sample_lib.py"'

This command makes use of --arguments parameter to run your glue job. There are few parameters that are internal to glue and should never be set by yourself. List of them includes:(--conf, --debug, --mode, --JOB_NAME).

While programming in ETL ,AWS Glue also offers connection parameters that helps provide connection from your Python or Scala scripts with other AWS Service and external databases such as - Document Db, Dynamo Db, Amazon S3, Kinesis, Kafka, Mongo Db, ORC, JDBC and Parquet.

This is one of the most used services that I use in my day to day work and thus want to reach out to you all with my practices.
Do like, share and comment. Happy Blogging! 😊

Reference for local development and testing ,check out this blog:
https://aws.amazon.com/blogs/big-data/building-an-aws-glue-etl-pipeline-locally-without-an-aws-account/