DEV Community

Charles Uneze
Charles Uneze

Posted on

Cost-Efficient ETL for Small Datasets using AWS Lambda, S3, Wrangler, and Glue

While studying the AWS Data Associate certification guide and designing a data pipeline, you will be exposed to data transformation concepts like transforming a file’s format or transforming the actual data.

As a refresher, the file format is often transformed to improve query speed. And columnar-supported file formats are preferred. Popular examples are Parquet and ORC (Optimized Row Columnar).

Transforming the actual data is about performing actions on the rows & columns; things like removing duplicate rows, for example.

The tricky thing about transforming a file format on the cloud is that it can get expensive. And sometimes you don't often need that expensive transformer when you aren’t frequently working with a huge dataset.

The AWS Glue ETL job uses Spark for transformation, and the minimum data processor allocates 4 vCPUs, 16 GB memory, 94GB disk. While studying, this can get expensive.

An alternative is to convert the file using a library. In my case, where I code in Python, the PyArrow library is a good choice. I also had to convert a CSV file, so I added the Pandas library too.

The solution works locally; it converts CSV to Parquet. However, the package installation size is too large to add to Lambda during a trigger. I also didn't want to package and deploy a container image to ECR and tell Lambda to pull it.

The best alternative I ended up figuring out was to use the AWS Wrangler package. It comes preinstalled with Pandas and Pyarrow and can be attached to Lambda’s layer.

How to install AWS Wrangler on Lambda

  1. To install it as a Lambda layer, visit the AWS serverless repository.
  2. Deploy it for free
  3. Once it's completed, it should become available in your Lambda layer.
  4. Now you can write your transformation logic in Python and run it as a Lambda function. You also need to reference the newly added Lambda layer.

Defining the other ETL logics

  1. Create a Glue catalog, database, crawler, trigger, and workflow.
  2. Create an isolated workgroup for your Athena query service based on Presto
  3. Add your CSV dataset to S3
  4. As previously described, your Lambda function reads the dataset on S3:
    • Removes duplicate rows
    • Converts the CSV into Parquet
    • Saves the new file to an S3 path
    • Then it triggers your Glue workflow
  5. The Glue workflow immediately does the following using Glue:
    • Crawls the Parquet file
    • Creates a table
  6. Then you can use Athena to query the data in S3

Demo

Architecture

Architecture

I have defined this logic using Terraform so that you can play with it. Clone the following repository, update the backend configuration, bucket name, and region, then apply your configuration.

git clone https://github.com/network-charles/aws-etl-wrangler.git
cd aws-etl-wrangler
terraform init -backend-config=backend.conf
terraform apply --auto-approve
Enter fullscreen mode Exit fullscreen mode

Final takeaway

If you are studying the AWS Data Associate using small datasets, once you understand how an ETL job works, you can transition to using AWS Wrangler in your pipeline instead.

Top comments (0)