Mohammed Ali Chherawalla

Posted on Nov 6, 2023

Code reusability and CD with AWS Glue

#aws #awsglue #dataengineering

In the realm of data engineering, AWS Glue has emerged as a powerful, fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load your data for analytics. But what if we told you that you could harness even more power from this service by using custom code and continuous deployment? In this tutorial, we'll show you exactly how to do that.

What is AWS Glue?

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. It provides all the capabilities needed for data integration so you can start analyzing your data and putting it to use in minutes instead of months.

What use cases does it handle very well?

AWS Glue shines in scenarios where you need to clean, enrich and move data across various data stores. It's especially useful when dealing with large amounts of disparate data, where manual coding would be time-consuming and error-prone.

What are the limitations of the visual editor?

While the visual editor in AWS Glue is a great tool for building ETL jobs, it does have its limitations. It may not provide the flexibility needed for complex transformations or specific use cases. Additionally, it might not be the best fit for developers who prefer coding over visual interfaces.

Why should I use a custom code?

Custom code allows you to tailor your ETL jobs to your specific needs, providing flexibility and control that the visual editor might not offer. It enables you to handle complex transformations and unique use cases, making your ETL jobs more efficient and effective.

What we're going to do

In this tutorial, we'll walk you through the process of setting up a continuous deployment (CD) pipeline for your AWS Glue job using GitHub Actions. We'll also show you how to automate the building of a library, which will be pushed to an S3 bucket. This library will then be used within the Glue job.

Prerequisites

Before we start, make sure you have a basic understanding of the following:

How we're going to do it
Here's a step-by-step guide on how we'll proceed:

Step 1: Configure local setup for AWS Glue using Jupyter notebooks

We'll start by setting up your local environment for AWS Glue using Jupyter Notebooks. This will allow you to write, test, and debug your Glue scripts locally.

Step 2: Set up a CD pipeline for our Glue job

Next, we'll set up a CD pipeline for our Glue job using GitHub Actions. This will ensure that every time there's a merge to the dev branch, the script in AWS will be updated.

Step 3: Create a library

After that, we'll create a library that will contain common functionalities used in our Glue job.

Step 4: Automate Building of the library in GitHub Actions

We'll then automate the building of the library using GitHub Actions. This will ensure that the latest version of the library is always available for our Glue job.

Step 5: Run this on AWS

Now that we have our library ready, we'll run our Glue job on AWS. We'll do this by creating a pull request on GitHub, or if we're confident, pushing it to the main branch directly. Once our changes are reflected on AWS, we'll hit run, either via the notebook or from the actions drop-down.

‍Step 6: Use common functionality from the library

Finally, we'll show you how to use the common functionalities from the library in your Glue job. This will help you keep your Glue scripts clean and efficient.‍

Now that you've reached this far, are you ready to dive into the step-by-step tutorial and start building your continuous deployment pipeline for AWS Glue? Click here to access the comprehensive guide.

This was originally published at https://www.wednesday.is. Come say Hi :) and let us know if we can help you design and build digital products.

DEV Community