DEV Community: Thiago Panini

Building Modular AWS Infrastructure with Terraform: Inside the tfbox Project

Thiago Panini — Sun, 20 Jul 2025 00:15:02 +0000

Introduction

Welcome, fellow cloud wrangler! Whether you’re a seasoned DevOps pro, a data engineer moonlighting as an infrastructure architect, or just someone who likes their YAML with a side of automation, you’re in the right place.

In this article we'll go through the tfbox project as a curated collection of production-ready Terraform modules for AWS, designed to accelerate cloud provisioning and standardize best practices across teams. By encapsulating common AWS resources, such as DynamoDB tables, IAM roles, Lambda layers, and many other in the future, tfbox empowers engineers to compose robust infrastructure with minimal boilerplate and maximum flexibility.

Whether you’re here to learn, contribute, or just see how someone else solved a real world problem, grab a coffee and let’s dive in!

Repository Overview

Modules Included

DynamoDB Table: Configurable provisioning of tables, keys, attributes, and billing modes.
IAM Role: Automated creation of roles, trust policies, and policy attachments.
Lambda Layer: Build and deploy Lambda layers from Python requirements, with packaging and cleanup automation.

Architectural Patterns and Design Principles

Modular Terraform Design

Each AWS resource is encapsulated as a standalone Terraform module, adhering to the following principles:

Isolation: Modules are self-contained, with their own variables, resources, and outputs.
Reusability: Modules can be referenced independently in any Terraform configuration.
Documentation: Every module is documented, with input variables, outputs, and usage examples available in the repository Wiki.

Example: Referencing a Module

module "dynamodb_table" {
  source  = "git::https://github.com/ThiagoPanini/tfbox.git//aws/dynamodb-table?ref=v1.0.0"
  # ...module variables...
}

This pattern enables versioned, remote module usage, critical for reproducible infrastructure and CI/CD workflows.

Automated Versioning and Release Management

tfbox leverages the terraform-module-releaser GitHub Action for:

Automated releases: New module versions are published upon merging changes.
Documentation updates: The Wiki is refreshed with every release, ensuring up-to-date module references.
Semantic versioning: Modules are tagged for precise dependency management.

This automation reduces manual overhead and ensures consistency across environments.

Clean Separation of Concerns

Each module directory typically includes:

main.tf: Core resource definitions.
variables.tf: Input variable declarations.
locals.tf: Local values for intermediate computations.
versions.tf: Provider and module version constraints.
Additional files (e.g., policies.tf, role.tf for IAM) for logical separation.

This structure supports maintainability and extensibility, allowing teams to add new modules or enhance existing ones without cross-module coupling.

Technical Highlights

Lambda Layer Module: Automated Packaging

The Lambda Layer module stands out for its automation of Python dependency packaging:

Requirements-driven builds: Layers are built from a requirements.txt, streamlining dependency management.
Automated cleanup: Temporary files and artifacts are managed within a dedicated directory, reducing clutter and risk of stale state.
Terraform-native orchestration: All steps are orchestrated via Terraform, enabling declarative infrastructure and repeatable builds.

IAM Role Module: Policy Management

The IAM Role module provides:

Template-driven trust policies: Simplifies cross-service and cross-account role assumptions.
Flexible policy attachment: Supports both inline and managed policies, catering to diverse security requirements.
Locals for policy composition: Uses Terraform locals to dynamically construct policy documents, improving readability and maintainability.

DynamoDB Table Module: Flexible Schema Definition

The DynamoDB Table module allows:

Configurable keys and attributes: Supports various partition and sort key configurations.
Billing mode selection: Enables choice between provisioned and on-demand throughput.
Data-driven resource creation: Uses variables and locals to abstract table schema, making it easy to adapt to changing requirements.

Deployment and Usage

Modules are designed for seamless integration into existing Terraform projects. By referencing modules via Git URLs and version tags, teams can:

Pin module versions for stability.
Upgrade modules with minimal disruption.
Share best practices across projects and teams.

Conclusion

tfbox is infrastructure engineering for the real world: modular, automated, and ready for action. By abstracting common AWS resources into reusable Terraform modules, it helps you move fast, stay consistent, and avoid reinventing the wheel (again).

Why you’ll love it:

Rapid, reliable AWS provisioning
Automated versioning and docs
Clean, maintainable module design

What’s next?

Add more AWS modules (VPC, ECS, RDS, bring your wish list!)
Integrate automated testing and compliance checks
Enhance observability and monitoring integrations

🤝 Let’s Build This Together

If you’ve made it this far, awesome. That means you’re probably the kind of builder who enjoys digging into code, improving ideas, or helping others learn.

This project is open source, and that’s not just a license, it’s an invitation. Whether it’s fixing a typo, proposing a new feature, or writing better docs, your contribution helps make the whole ecosystem stronger.

Every pull request is a chance to learn, grow, and connect. Let’s keep this feedback loop alive and build tools that empower devs everywhere

Get in touch

GitHub: @ThiagoPanini
LinkedIn: Thiago Panini
Hashnode: panini-tech-lab

datadelivery: Providing public datasets to explore in AWS

Thiago Panini — Sun, 09 Apr 2023 01:57:58 +0000

Project Story

In this documentation page, I will talk about the big idea behind the datadelivery Terraform module and how its creation can be a huge milestone on learning anything on AWS.

🪄 This is the story about how I had to decouple my open source solutions to have a more scalable kit of projects that help users in their analytics learning journey on AWS.

Terraglue: the Beginning and the First

No, you are not reading anything wrong and you are neither at the wrong documentation page. The truth is that we can't talk about datadelivery without talking about terraglue first.

I know, it's a bunch of unknown names and maybe you're thinking about what's going on. But let me tell you something really important: the datadelivery project was born from terraglue project. To know more about terraglue (and to get the idea about the decoupling process), I suggest you to stop this reading for a little while and go to the main story about how it all started.

This is not like Dark, the Netflix TV Show, where you travel through time, but probably you will like to know the beginning of everything before going ahead in this page. Feel free to choose your beginning.

Where is the Data?

Well, regardless of how you got here and which of my other open source projects you are familiar with, the datadelivery project was born to solve a specific problem: the lack of public data sources available to explore AWS services.

In fact, this is a honest claim of anyone who wants to learn more about AWS services using different datasets. After all, data is probably the most important thing in any data project (and I don't want to be redundant with this sentence).

"So where can we find datasets to explore?"

Nowadays, finding public datasets isn't too hard. There are many websites, blog posts, books and many other sources who offers links to download datasets with the most varied contents. To name a few, we have:

So, it's enough to say that there are many ways to download and use public datasets for whatever learning purpose. Ok, it's fair enough.

But in our context we are talking about use those datasets inside AWS, right? What about all the effort needed do download the files, upload into a storage system (like S3) and catalog all the medata into Data Catalog? It's seems like a little bit too hard yet.

🚛💨 This is where datadelivery shines!

datadelivery: A Data Exploration Toolkit

I think you get the idea but just to reinforce: the datadelivery project provides an efficient way to activate pieces and services in an AWS account in order to enable users to explore preselected public datasets. It does that by providing a Terraform module that can be called directly from its source GitHub repository.

I state that in the project documentation home page, but this is the perfect time to clarify what really happens when users call the datadelivery Terraform module:

Five different buckets are created in the target AWS account
The content of data/ folder at the source module are uploaded to the SoR bucket
An IAM role is created with enough permissions to run a Glue Crawler
A Glue Crawler is created with a S3 target pointing to the SoR bucket
A cron expression is configured to trigger the Glue Crawler 2 minutes after finishing the infrastructure deployment
All files from SoR bucket (previously on data/ folder) are cataloged as new tables on Data Catalog
A preconfigured Athena workgroup is created in order to enable users to run queries

If writing it isn't enough, take a look at the project diagram:

Do you want to know more about the "behind the scenes" of the project construction? I will present some code details about how all the infrastructure was declared in the module.

Storage Structure in S3

This was the first infrastructure block created in the project. After all, it would be impossible to provide the exploration of public datasets in analytics services in AWS without thinking about the storage layer.

To do such a thing, I declared some useful variables into a locals.tf Terraform file as you can see below:

# Defining data sources to help local variables
data "aws_region" "current" {}
data "aws_caller_identity" "current" {}

# Defining local variables to be used on the module
locals {
  account_id  = data.aws_caller_identity.current.account_id
  region_name = data.aws_region.current.name

  # Creating a map with bucket names to be deployed
  bucket_names_map = {
    "sor"    = "datadelivery-sor-data-${local.account_id}-${local.region_name}"
    "sot"    = "datadelivery-sot-data-${local.account_id}-${local.region_name}"
    "spec"   = "datadelivery-spec-data-${local.account_id}-${local.region_name}"
    "athena" = "datadelivery-athena-query-results-${local.account_id}-${local.region_name}"
    "glue"   = "datadelivery-glue-assets-${local.account_id}-${local.region_name}"
  }

  # [...] more code below

The aws_region and the aws_caller_identity Terraform data sources were created to make it possible to get some useful attributes from the target AWS account, like the account_id and region_name local values.

According to the official Terraform documentation:

"A local value assigns a name to an expression, so you can use the name multiple times within a module instead of repeating the expression. [...] The expressions in local values are not limited to literal constants; they can also reference other values in the module in order to transform or combine them, including variables, resource attributes, or other local."

With that in mind, the storage layer has its heart at the bucket_names_map local value used to create a map of bucket names using dynamic information gotten from the aforementioned data sources.

So, the next step was about to create a storage.tf Terraform file to declare a aws_s3_bucket Terraform resource for each value in the bucket_names_map local variable as you can see below:

# Creating all buckets
resource "aws_s3_bucket" "this" {
  for_each      = local.bucket_names_map
  bucket        = each.value
  force_destroy = true
}

The big idea about the resource block code above is the definition of a for_each meta-argument that makes it possible to create several similar objects without writing a separate block for each one.

And once again, according to the Terraform official documentation page:

"If a resource or module block includes a for_each argument whose value is a map or a set of strings, Terraform creates one instance for each member of that map or set."

And that's how multiple buckets could be created using a local value that maps different bucket names.

In addition to that, other bucket configurations and resources were defined on the storage.tf file, such as:

Public access block with aws_s3_bucket_public_access_block
Server Side Encryption with aws_s3_bucket_public_access_block

And finally, with all the buckets created and configured, it was possible to upload preselected public datasets originally stored in the data/ folder from the source GitHub repository. Before showing the Terraform code block to do that, let's see the structure of this folder:

├───data
│   ├───bike_data
│   │   ├───tbl_bikedata_station
│   │   └───tbl_bikedata_trip
│   ├───br_ecommerce
│   │   ├───tbl_brecommerce_customers
│   │   ├───tbl_brecommerce_geolocation
│   │   ├───tbl_brecommerce_orders
│   │   ├───tbl_brecommerce_order_items
│   │   ├───tbl_brecommerce_payments
│   │   ├───tbl_brecommerce_products
│   │   ├───tbl_brecommerce_reviews
│   │   └───tbl_brecommerce_sellers
│   ├───flights_data
│   │   ├───tbl_flights_airport_codes_na
│   │   ├───tbl_flights_departure_delays
│   │   └───tbl_flights_summary_data
│   ├───tbl_airbnb
│   ├───tbl_blogs
│   └───tbl_iot_devices

Here you can see that there are some data folders simulating table structures with raw files in each one of them. In order to provide some context, the table below shows some useful information about the datasets in the data/ source repository folder:

🎲 Dataset	🏷️ Description	🔗 Source Link
Bike Data	The dataset has information about San Francisco loan bike service from August 2013 to August 2015.	Link
Brazilian E-Commerce	The dataset has information of 100k orders from 2016 to 2018 made at multiple marketplaces in Brazil.	Link
Flights Data	This dataset has information about flight travels in the United States.	Link
Airbnb	A dataset with interactions with Airbnb in many of their services. There are 700 attributes to be explored.	Link
Blogs	A small and fake dataset with information of blogs published on internet	Link
IoT Devices	A fake dataset with measurements from IoT devices collected in a company facility	Link

In the future, it's highly possible that new datasets are included in the datadelivery, so users will have a wider range of possibilities.

So, now that you know the content of the data/ folder, it's possible to turn back to the storage.tf file and see how the upload process to S3 works:

# Adding local files on SoR bucket
resource "aws_s3_object" "data_sources" {
  for_each               = fileset(local.data_path, "**")
  bucket                 = aws_s3_bucket.this["sor"].bucket
  key                    = each.value
  source                 = "${local.data_path}${each.value}"
  server_side_encryption = "aws:kms"
}

In the end, it's all about using the fileset() Terraform function to get the contents of a local path (represented by a local value called data_path). The target is pointed as the SoR bucket (as we are talking about raw files, it makes sense to store it in the System of Records layer).

The data_path local value is nothing more than a combination between the path module and the data/ folder, as you can see below:

# Referencing a data folder where the files to be uploaded are located
data_path = "${path.module}/data/"

And this is how the storage structure was built. After all, the users will have a set of S3 buckets and public datasets stored in the SoR bucket.

This is just the beginning.

Crawling the Data

We know that uploading raw files to S3 isn't enough to build all the elements needed to explore analytics services on AWS. It is also necessary to catalog data in order to make it accessible.

The first step taken to accomplish this mission was to crate a catalog.tf Terraform file to declare all the infrastructure needed to input the metadata from the raw data files provided into storage.tf into the Data Catalog.

So, we start by defining a aws_glue_catalog_database resource to create different databases into Glue Data Catalog in order to receive new tables.

# Creating Glue databases on Data Catalog
resource "aws_glue_catalog_database" "mesh" {
  for_each    = var.glue_db_names
  name        = each.value
  description = "Database ${each.value} for storing tables in this specific layer"
}

Here we can see the glue_db_names variables taken from a variables.tf Terraform file which handles all the acceptable variables for the datadelivery module. The definition of database names can be seen below:

variable "glue_db_names" {
  description = "List of database names for storing Glue catalog tables"
  type        = map(string)
  default = {
    "sor"  = "db_datadelivery_sor",
    "sot"  = "db_datadelivery_sot",
    "spec" = "db_datadelivery_spec"
  }
}

For each entry in the glue_db_names map variable, a new database will be created on the target AWS account. Here it's important to say that only the "db_datadelivery_sor" database will receive the catalogged data (well, the SoR layer handles raw data, so it's far enough to create tables in this database). The similar SoT and Spec databases are provided just in case if users want to input their own tables from process like jobs Glue or Athena queries.

Then, the most important resource to make things happen is the aws_glue_crawler, but before showing the Terraform declaration block, let's take a look into the definition of a Glue Crawler.

According to the official AWS documentation page:

You can use a crawler to populate the AWS Glue Data Catalog with tables. This is the primary method used by most AWS Glue users. A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. Extract, transform, and load (ETL) jobs that you define in AWS Glue use these Data Catalog tables as sources and targets. The ETL job reads from and writes to the data stores that are specified in the source and target Data Catalog tables.

[...]

When a crawler runs, it takes the following actions to interrogate a data store:

Classifies data to determine the format, schema, and associated properties of the raw data

Groups data into tables or partitions

Writes metadata to the Data Catalog

With that in mind, the following Terraform code declares a Glue Crawler resource with some special attributes that will be explained later.

# Defining a Glue Crawler
resource "aws_glue_crawler" "sor" {
  database_name = var.glue_db_names["sor"]
  name          = "terracatalog-glue-crawler-sor"
  role          = aws_iam_role.glue_crawler_role.arn

  s3_target {
    path = "s3://${local.bucket_names_map["sor"]}"
  }

  schedule = local.crawler_cron_expr

  depends_on = [
    aws_s3_object.data_sources,
    aws_iam_policy.glue_policies,
    aws_iam_role.glue_crawler_role
  ]
}

Some points need to be clarified here:

The target database for the Crawler is the SoR database
The target storage location for the Crawler is the SoR S3 bucket
A new IAM role is previously created on the iam.tf Terraform file with all the permissions needed to run a Glue Crawler (you can check the source link if you want do see it in details)
:material-alert-decagram:{ .mdx-pulse .warning } A cron expression is defined in the locals.tf file to run the Crawler 2 minutes (by default) after finishing the infrastructure deploy

The last point is surely a great way out to automate the crawling process without the need to access the AWS account and run it manually. So let's take a deep dive into it.

Coming back to the locals.tf Terraform file, to be able to create a valid cron expression that runs the Crawler a couple of minutes before the infrastructure deployment, it was important to get the current time in execution time. The way chosen to do that envolved the use of the timestamp() and timeadd() Terraform functions.

# [...]

# Extracting current timestamp and adding a delay
timestamp_to_run = timeadd(timestamp(), var.delay_to_run_crawler)

The delay_to_run_crawler variable can be passed by user in a datadelivery module call. Its default value is "2m", pointing that the actual current timestamp value to be used to create a cron expression is delayed by 2 minutes.

So, the next step was about to extract all elements needed in valid a cron expression. It could be done by calling the formatdate() Terraform function with different date format arguments:

# Getting date information
cron_day    = formatdate("D", local.timestamp_to_run)
cron_month  = formatdate("M", local.timestamp_to_run)
cron_year   = formatdate("YYYY", local.timestamp_to_run)
cron_hour   = formatdate("h", local.timestamp_to_run)
cron_minute = formatdate("m", local.timestamp_to_run)

And then, the last step was about to create the cron expression with the individual local values for each cron attribute:

# Building a cron expression for Glue Crawler to run minutes after infrastructure deploy
crawler_cron_expr = "cron(${local.cron_minute} ${local.cron_hour} ${local.cron_day} ${local.cron_month} ? ${local.cron_year})"

For example, if a user calls the datadelivery Terraform module at 6:45PM, and supposing that the infrastructure deployment takes about 5 minutes to finish (at 6:50PM to be exactly), then a Glue Crawler will run in the AWS target account at 6:52PM (and never more).

So, coming back to the catalog.tf Terraform file, the last thing that is done is the cretion of an Athena workgroup through aws_athena_workgroup Terraform resource.

# Defining an Athena preconfigured workgroup
resource "aws_athena_workgroup" "analytics" {
  name          = "terracatalog-workgroup"
  force_destroy = true

  configuration {
    result_configuration {
      output_location = "s3://${local.bucket_names_map["athena"]}"

      encryption_configuration {
        encryption_option = "SSE_KMS"
        kms_key_arn       = data.aws_kms_key.s3.arn
      }
    }
  }
}

The great thing about it is that this is a preconfigured workgroup to store Athena query results in the datadelivery Athena parametrized bucket (taken from buckets_map local value). Users will be able to start using the Athena query editor without worry about any other settings.

So, with storage.tf and catalog.tf files, users can extract the real power from datadelivery Terraform module. The iam.tf file, as said before, is also extremely useful to an IAM policies and role context (especially when we talk about the crawler process).

So What About Now?

Well, by now I really invite all the readers to join and read more about the datadelivery Terraform module. There is a huge documentation page hosted on readthedocs with many useful information about how this project can help users on their analytics journey in AWS.

By the way, with all the things presented here, to start using datadelivery in your AWS account, you just need to call the module from the source GitHub repository using the following example:

# Calling datadelivery module with default configuration
module "datadelivery" {
  source = "git::https://github.com/ThiagoPanini/datadelivery"
}

And finally, If you want to know more, I reinforce: don't forget to check the official documentation page. I really feel that all the users using AWS to learn more about analytics can be helped with datadelivery and its features!

References

The story about how I took my learning on AWS Glue to the next level

Thiago Panini — Fri, 31 Mar 2023 12:33:43 +0000

Project Story

For everyone reading this page, I ask a poetic licence to tell you a story about the challenges I faced on my analytics learning journey using AWS services and what I've done to overcome them.

🪄 In fact, this is a story about how I really started to build and share open source solutions that help people learning more about analytics services in AWS.

How it All Started

First of all, It's important to provide some context on how it all started. I'm an Analytics Engineer working for a financial company that has a lot of data and an increasing number of opportunities to use it. The company adopted the Data Mesh architecture to give more autonomy so the the data teams can build and share their own datasets through three different layers: SoR (System of Record), SoT (Source of Truth) and Spec (Specialized).

There are a lot of articles explaining the Data Mesh architecture and the differences between layers SoR, SoT and Spec for storing and sharing data. In fact, this is a really useful way to improve analytics on organizations.

If you want to know a little bit more about it, there are some links that can help you on this mission:

🔗 Data Mesh Principles and Logical Architecture

🔗 Building a System of Record vs. a Source of Truth

🔗 The Difference Between System of Record and Source of Truth

So, the company decided to use mainly AWS services for all this journey. From analytics perspective, services like Glue, EMR, Athena and Quicksight just popped up as really good options to solve some real problems in the company.

And that's how the story begins: an Analytics Engineer trying his best to deep dive into those services in his sandbox environment for learning everything he can.

First Steps

Well, I had to choose an initial goal. After deciding to start learning more about AWS Glue to develop ETL jobs, I looked for documentation pages, watched some tutorial videos to prepare myself and talked to other developers to collect thoughts and experiences about the whole thing.

After a little while, I found myself ready to start building something useful. In my hands, I had an AWS sandbox account and a noble desire to learn.

Here is an important information:
The use of an AWS sandbox account was due to a subscription that I had in a learning platform. This platform allowed subscribers to use an AWS environment for learning purposes and that was really nice. However, it was an ephemeral environment with an automatic shut off mechanism after a few hours. This behavior is one of the key points of the story. Keep that in mind. You will know why.

Creating the Storage Layers

I started to get my hands dirty by creating S3 buckets to replicate something next to a Data Lake storage architecture in a Data Mesh approach. So, one day I just logged in my sandbox AWS account and created the following buckets:

A bucket to store SoR data
A bucket to store SoT data
A bucket to store Spec data
A bucket to store Glue assets
A bucket to store Athena query results

Uploading Files on Buckets

Once the storage structures was created, I started to search for public datasets do be part of my learning path. The idea was to upload some data into the buckets to make it possible to do some analytics, such as creating ETL jobs or even querying with Athena.

So, I found the excellent Brazilian E-Commerce dataset on Kaggle and it fitted perfectly. I was now able to download the data and upload it on the SoR bucket to simulate some raw data available for further analysis in an ETL pipeline.

And now my diagram was like:

Cataloging Data

Uploading data on S3 buckets wasn't enough to have a complete experience on applying analytics. It was important to catalog its metadata on Data Catalog to make them visible for services like Glue and Athena.

So, the next step embraced the input of all the files of Brazilian Ecommerce dataset as tables on Data Catalog. For this task, I tested two different approaches:

Building and running CREATE TABLE queries on Athena based on file schema
Manually inputting fields and table properties on Data Catalog

By the way, as Athena proved itself to be a good service to start looking at cataloged data, I took the opportunity to create a workgroup with all appropriate parameters for storing query results.

And now my diagram was like:

Creating IAM Roles and Policies

A huge milestone was reached at that moment. I had a storage structure, I had data to be used and I had all metadata information already cataloged on Data Catalog.

"What was still missing to start creating Glue jobs?"
The answer was IAM roles and policies. Simple as that.

At this point I must tell you that creating IAM roles wasn't an easy step to be completed. First of all, it was a little bit difficult to understand all the permissions needed to run Glue jobs on AWS, to log steps on CloudWatch and all other things.

Suddenly I found myself searching a lot on docs pages and studying about Glue's actions to be included in my policy. After a while, I was able to create a set of policies for a good IAM role to be assumed by my future and first Glue job on AWS.

And, once again, I added more pieces on my diagram:

Creating a Glue Job

Well, after all those manual setup I was finally able to create my first Glue job on AWS to create ETL pipelines using public datasets available on Data Catalog.

I was really excited at that moment and the big idea was to simulate a data pipeline that read data from a SoR layer, transform it and put the curated dataset in a SoT layer. After learning a lot about awsglue library and elements like GlueContext and DynamicFrame, I was able to create a Spark application using pyspark that had enough features to reach the aforementioned goal.

And now my diagram was complete!

Not Yet: A Real Big Problem

As much as this is happy ending story, it doesn't happen just now.

The AWS sandbox account problem.
Well, remember as a said at the beginning of the story that I had in my hands an AWS sandbox account? By sandbox account I mean a temporary environment that shuts down after a few hours.

And that's was the first big problem: I needed to recreate ALL components of the final diagram every single day.

The huge manual effort.
As you can imagine, I spent almost one hour setting up things every time I wanted to practice with Glue. It was a huge manual effort and that was just almost half of the sandbox life time.

Something needed to be done.

Of course you can aske me:

"Why don't you build that architecture in your personal account?"
That was a nice option but the problem was the charges. I was just trying to learn Glue and running jobs multiple times (that's the expectation when you are learning) may incur some unpredictable costs.

Ok, so now I think everyone is trying to figure out what I did to solve those problems.

Yes, I found a way!

If you are still here with me, I think you would like to know it.

Terraglue: A New Project is Born

Well, the problems were shown and I had to think in a solution to make my life easier for that simple learning task.

The answer was right on my face all the time. If my main problem was spending time recreating infrastructure all over again, why not to automate the infrastructure creation with an IaC tool? So every time my sandbox environment expired, I could create all again with much less overhead.

That was a fantastic idea and I started to use Terraform to declare resources used in my architecture. I splitted things into modules and suddenly I had enough code to create buckets, upload data, catalog things, create IAM roles, policies and a preconfigured Glue job!

While creating all this, I just felt that everyone who had the same learning challenges that made me come to this point would enjoy the project. So I prepared it, documented it and called it terraglue.

It was really impressive how I could deploy all the components with just a couple of commands. If I used spent about 1 hour to create and configure every service manually, after terraglue that time was reduced to just seconds!

A Constant Evolution with New Solutions

After a while, I noticed that the terraglue project became a central point of almost everything. The source repository was composed by all the infrastructure (including buckets, data files and a Glue job) and also a Spark application with modules, classes and methods used to develop an example of a Glue job.

That wasn't a good thing.

Imagine if I started my learning journey on EMR, for example. I would almost duplicate all terraglue infrastructure to a new project to have components like buckets and data files. The same scenario can be expanded to the application layer on terraglue: I would have to copy and paste scripts from project to project. It was not scalable.

So, thinking on the best way to provide the best experience for me and for all users that could use all of this, I started to decouple the initial idea of terraglue into new open source projects. And that's the state-of-art of my solutions shelf:

In case you want to know more about each one of those new solutions, I will leave here some links for documentation pages created to give the best user experience possible:

🚛 datadelivery: a Terraform module that helps users to have public datasets to explore using AWS services
🌖 terraglue: a Terraform module that helps users to create their own preconfigured Glue jobs in their AWS account
✨ sparksnake: a Python package that contains useful Spark features created to help users to develop their own Spark applications

And maybe it's more to come! Terraglue was the first and, for a while, it was the only one. Suddenly, new solutions were created to fill special needs. I think this is a continuous process.

Conclusion

If I had to summarize this story in a few topics, I think the best sequence would be:

🤔 An Analytics Engineer wanted to learn AWS Glue and other analytics services on AWS
🤪 He started to build a complete infrastructure in his AWS sandbox account manually
🥲 Every time this AWS sandbox account expired, he did it all again
😮‍💨 He was tired of doing this all the time so he started to think on how to solve this problem
😉 He started to apply Terraform to declare all infrastructure
🤩 He created a pocket AWS environment to learn analytics and called it terraglue
🤔 He noticed that new projects could be created from terraglue in order to turn it more scalable
🚀 He now has a shelf of open source solutions that can help many people on learning AWS

And that's the real story about how I faced a huge problem on my learning journey and used Terraform to declare AWS components and take my learning experience to the next level.

I really hope any of the solutions presented here can be useful for anyone who needs to laearn more about analytics services on AWS.

Finally, if you like this story, don't forget to interact with me, star the repos, comment and leave your feedback. Thank you!

References

AWS Glue

Terraform

Apache Spark

GitHub

Docker

Tests

Others

Improving ETL jobs on AWS with sparksnake

Thiago Panini — Mon, 20 Mar 2023 21:53:15 +0000

Have you ever thought about having a bunch of Spark features and code blocks to improve once at all your journey on developing Spark applications in AWS services like Glue and EMR? In this article I'll introduce you sparksnake as a powerful Python package as a game changing on Spark application development on AWS.

The idea behind sparksnake

To understand the main reasons for bringing sparksnake to life, let's first take a quick look on a Glue boilerplate code presented wherever a new job is created on AWS console:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

Now let me show you two simple perspectives from Glue users in different levels.

Beginner: it's reasonable to say that the block code above isn't something we see everyday outside Glue, right? So, for people who are trying Glue for the first time, there would be questions about elements like GlueContext, Job, and special methods.
Experienced Developer: even for this group, the "Glue setup" could be something painful (specially if you need to do that every single time when starting a new job development).

Therefore, the main idea behind sparksnake is to take every common step in a Spark application developed using AWS services and encapsulate it on classes and methods that can be called by users using a single line of code. In other words, all the boilerplate code shown above would be replaced in sparksnake as:

# Importing sparksnake's main class
from sparksnake.manager import SparkETLManager

# Initializing a glue job
spark_manager = SparkETLManager(mode="glue")
spark_manager.init_job()

This is just one of a series of features available in sparksnake! The great thing about it is the ability to call methods and functions to use Spark common features in jobs whether you're running them on AWS services like Glue and EMR or locally.

The library structure

After a quick overview on sparksnake, it's important to know a little bit more on how the library is structured under the hood.

By this time, there are two modules on the package:

manager: central module who hosts the SparkETLManager class with common Spark features. It inherits features from other classes based on an operation mode chosen by the user
glue: side module who hosts the GlueJobManager class with special features used in Glue jobs

In a common usage pattern, users import the SparkETLManager class and choose a operation mode according to where the Spark application will be developed and deployed. This operation mode guides the SparkETLManager class to inherit features from AWS services like Glue and EMR to provide users a custom experience.

Features

Now that you know more about the main concepts about sparksnake, let's summarize some of its features:

🤖 Enhanced development experience of Spark Applications to be deployed as jobs in AWS services like Glue and EMR
🌟 Possibility to use common Spark operations for improving ETL steps using custom classes and methods
⚙️ No need to think too much into the hard and complex service setup (e.g. with sparksnake you can have all elements for a Glue Job on AWS with a single line of code)
👁️‍🗨️ Application observability improvement with detailed log messages in CloudWatch
🛠️ Exception handling already embedded in library methods

A quickstart

To start using sparksnake, just install it using pip:

pip install sparksnake

Now let's say, for instance, that we are developing a new Glue job on AWS and we want to use sparksnake to make things easier. In order to provide a useful example about how powerful the library can be, imagine we have a series of data sources to be read into the job. There would be very painful to write multiple lines of code for reading each data source from catalog.

With sparksnake, we can read multiple data sources from catalog using a single line of code:

# Generating a dictionary of Spark DataFrames from catalog
dfs_dict = spark_manager.generate_dataframes_dict()

# Indexing to get individual DataFrames
df_orders = dfs_dict["orders"]
df_customers = dfs_dict["customers"]

And what about writing data on S3 and cataloging it on Data Catalog? No worries, that could be done with a single line of code too:

# Writing data on S3 and cataloging on Data Catalog
spark_manager.write_and_catalog_data(
    df=df_orders,
    s3_table_uri="s3://bucket-name/table-name",
    output_database_name="db-name",
    output_table_name="table-name",
    partition_name="partition-name",
    output_data_format="data-format" # e.g. "parquet"
)

Once again, those are only two examples of a series of features already available on the library and this article was written to present all users a different way to learn and to improve skills on Spark applications inside AWS.

Learn more

There are some useful links and documentations about sparksnake. Check it out on: