<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Thiago Panini</title>
    <description>The latest articles on DEV Community by Thiago Panini (@thiagopanini).</description>
    <link>https://dev.to/thiagopanini</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1047759%2F19f64412-fd89-4466-bda2-0bd2b39187a8.jpg</url>
      <title>DEV Community: Thiago Panini</title>
      <link>https://dev.to/thiagopanini</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/thiagopanini"/>
    <language>en</language>
    <item>
      <title>Building Modular AWS Infrastructure with Terraform: Inside the tfbox Project</title>
      <dc:creator>Thiago Panini</dc:creator>
      <pubDate>Sun, 20 Jul 2025 00:15:02 +0000</pubDate>
      <link>https://dev.to/aws-builders/building-modular-aws-infrastructure-with-terraform-inside-the-tfbox-project-4fc2</link>
      <guid>https://dev.to/aws-builders/building-modular-aws-infrastructure-with-terraform-inside-the-tfbox-project-4fc2</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Welcome, fellow cloud wrangler! Whether you’re a seasoned DevOps pro, a data engineer moonlighting as an infrastructure architect, or just someone who likes their YAML with a side of automation, you’re in the right place. &lt;/p&gt;

&lt;p&gt;In this article we'll go through the &lt;code&gt;tfbox&lt;/code&gt; project as a curated collection of production-ready Terraform modules for AWS, designed to accelerate cloud provisioning and standardize best practices across teams. By encapsulating common AWS resources, such as DynamoDB tables, IAM roles, Lambda layers, and many other in the future, &lt;code&gt;tfbox&lt;/code&gt; empowers engineers to compose robust infrastructure with minimal boilerplate and maximum flexibility.&lt;/p&gt;

&lt;p&gt;Whether you’re here to learn, contribute, or just see how someone else solved a real world problem, grab a coffee and let’s dive in!&lt;/p&gt;

&lt;h2&gt;
  
  
  Repository Overview
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Modules Included
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DynamoDB Table&lt;/strong&gt;: Configurable provisioning of tables, keys, attributes, and billing modes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IAM Role&lt;/strong&gt;: Automated creation of roles, trust policies, and policy attachments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lambda Layer&lt;/strong&gt;: Build and deploy Lambda layers from Python requirements, with packaging and cleanup automation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architectural Patterns and Design Principles
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Modular Terraform Design
&lt;/h3&gt;

&lt;p&gt;Each AWS resource is encapsulated as a standalone Terraform module, adhering to the following principles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Isolation&lt;/strong&gt;: Modules are self-contained, with their own variables, resources, and outputs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reusability&lt;/strong&gt;: Modules can be referenced independently in any Terraform configuration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Documentation&lt;/strong&gt;: Every module is documented, with input variables, outputs, and usage examples available in the repository Wiki.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Example: Referencing a Module
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"dynamodb_table"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"git::https://github.com/ThiagoPanini/tfbox.git//aws/dynamodb-table?ref=v1.0.0"&lt;/span&gt;
  &lt;span class="c1"&gt;# ...module variables...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This pattern enables versioned, remote module usage, critical for reproducible infrastructure and CI/CD workflows.&lt;/p&gt;

&lt;h3&gt;
  
  
  Automated Versioning and Release Management
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;tfbox&lt;/code&gt; leverages the &lt;a href="https://github.com/marketplace/actions/terraform-module-releaser" rel="noopener noreferrer"&gt;terraform-module-releaser&lt;/a&gt; GitHub Action for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Automated releases&lt;/strong&gt;: New module versions are published upon merging changes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Documentation updates&lt;/strong&gt;: The Wiki is refreshed with every release, ensuring up-to-date module references.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic versioning&lt;/strong&gt;: Modules are tagged for precise dependency management.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This automation reduces manual overhead and ensures consistency across environments.&lt;/p&gt;

&lt;h3&gt;
  
  
  Clean Separation of Concerns
&lt;/h3&gt;

&lt;p&gt;Each module directory typically includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;main.tf&lt;/code&gt;: Core resource definitions.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;variables.tf&lt;/code&gt;: Input variable declarations.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;locals.tf&lt;/code&gt;: Local values for intermediate computations.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;versions.tf&lt;/code&gt;: Provider and module version constraints.&lt;/li&gt;
&lt;li&gt;Additional files (e.g., &lt;code&gt;policies.tf&lt;/code&gt;, &lt;code&gt;role.tf&lt;/code&gt; for IAM) for logical separation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This structure supports maintainability and extensibility, allowing teams to add new modules or enhance existing ones without cross-module coupling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Technical Highlights
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Lambda Layer Module: Automated Packaging
&lt;/h3&gt;

&lt;p&gt;The Lambda Layer module stands out for its automation of Python dependency packaging:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Requirements-driven builds&lt;/strong&gt;: Layers are built from a &lt;code&gt;requirements.txt&lt;/code&gt;, streamlining dependency management.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automated cleanup&lt;/strong&gt;: Temporary files and artifacts are managed within a dedicated directory, reducing clutter and risk of stale state.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Terraform-native orchestration&lt;/strong&gt;: All steps are orchestrated via Terraform, enabling declarative infrastructure and repeatable builds.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  IAM Role Module: Policy Management
&lt;/h3&gt;

&lt;p&gt;The IAM Role module provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Template-driven trust policies&lt;/strong&gt;: Simplifies cross-service and cross-account role assumptions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexible policy attachment&lt;/strong&gt;: Supports both inline and managed policies, catering to diverse security requirements.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Locals for policy composition&lt;/strong&gt;: Uses Terraform &lt;code&gt;locals&lt;/code&gt; to dynamically construct policy documents, improving readability and maintainability.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  DynamoDB Table Module: Flexible Schema Definition
&lt;/h3&gt;

&lt;p&gt;The DynamoDB Table module allows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Configurable keys and attributes&lt;/strong&gt;: Supports various partition and sort key configurations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Billing mode selection&lt;/strong&gt;: Enables choice between provisioned and on-demand throughput.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data-driven resource creation&lt;/strong&gt;: Uses variables and locals to abstract table schema, making it easy to adapt to changing requirements.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Deployment and Usage
&lt;/h2&gt;

&lt;p&gt;Modules are designed for seamless integration into existing Terraform projects. By referencing modules via Git URLs and version tags, teams can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pin module versions&lt;/strong&gt; for stability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Upgrade modules&lt;/strong&gt; with minimal disruption.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Share best practices&lt;/strong&gt; across projects and teams.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;tfbox&lt;/code&gt; is infrastructure engineering for the real world: modular, automated, and ready for action. By abstracting common AWS resources into reusable Terraform modules, it helps you move fast, stay consistent, and avoid reinventing the wheel (again).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why you’ll love it:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rapid, reliable AWS provisioning&lt;/li&gt;
&lt;li&gt;Automated versioning and docs&lt;/li&gt;
&lt;li&gt;Clean, maintainable module design&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What’s next?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Add more AWS modules (VPC, ECS, RDS, bring your wish list!)&lt;/li&gt;
&lt;li&gt;Integrate automated testing and compliance checks&lt;/li&gt;
&lt;li&gt;Enhance observability and monitoring integrations&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🤝 Let’s Build This Together
&lt;/h2&gt;

&lt;p&gt;If you’ve made it this far, awesome. That means you’re probably the kind of builder who enjoys digging into code, improving ideas, or helping others learn.&lt;/p&gt;

&lt;p&gt;This project is open source, and that’s not just a license, it’s an invitation. Whether it’s fixing a typo, proposing a new feature, or writing better docs, your contribution helps &lt;strong&gt;make the whole ecosystem stronger&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Every pull request is a chance to learn, grow, and connect. Let’s keep this feedback loop alive and build tools that empower devs everywhere&lt;/p&gt;

&lt;h2&gt;
  
  
  Get in touch
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/ThiagoPanini" rel="noopener noreferrer"&gt;@ThiagoPanini&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;LinkedIn: &lt;a href="https://www.linkedin.com/in/thiago-panini/" rel="noopener noreferrer"&gt;Thiago Panini&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Hashnode: &lt;a href="https://panini.hashnode.dev/" rel="noopener noreferrer"&gt;panini-tech-lab&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>terraform</category>
    </item>
    <item>
      <title>datadelivery: Providing public datasets to explore in AWS</title>
      <dc:creator>Thiago Panini</dc:creator>
      <pubDate>Sun, 09 Apr 2023 01:57:58 +0000</pubDate>
      <link>https://dev.to/aws-builders/datadelivery-providing-public-datasets-to-explore-in-aws-2029</link>
      <guid>https://dev.to/aws-builders/datadelivery-providing-public-datasets-to-explore-in-aws-2029</guid>
      <description>&lt;h1&gt;
  
  
  Project Story
&lt;/h1&gt;

&lt;p&gt;In this documentation page, I will talk about the big idea behind the &lt;em&gt;datadelivery&lt;/em&gt; Terraform module and how its creation can be a huge milestone on learning anything on AWS.&lt;/p&gt;

&lt;p&gt;🪄 This is the story about how I had to decouple my open source solutions to have a more scalable kit of projects that help users in their analytics learning journey on AWS.&lt;/p&gt;

&lt;h2&gt;
  
  
  Terraglue: the Beginning and the First
&lt;/h2&gt;

&lt;p&gt;No, you are not reading anything wrong and you are neither at the wrong documentation page. The truth is that we can't talk about &lt;em&gt;datadelivery&lt;/em&gt; without talking about &lt;em&gt;terraglue&lt;/em&gt; first.&lt;/p&gt;

&lt;p&gt;I know, it's a bunch of unknown names and maybe you're thinking about what's going on. But let me tell you something really important: the &lt;em&gt;datadelivery&lt;/em&gt; project was born from &lt;em&gt;terraglue&lt;/em&gt; project. To know more about &lt;em&gt;terraglue&lt;/em&gt; (and to get the idea about the decoupling process), I suggest you to stop this reading for a little while and go to the &lt;a href="https://terraglue.readthedocs.io/en/latest/story/" rel="noopener noreferrer"&gt;main story about how it all started&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This is not like Dark, the Netflix TV Show, where you travel through time, but probably you will like to know the beginning of everything before going ahead in this page. Feel free to choose your beginning.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpz8ufcob0cqca4f96r7t.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpz8ufcob0cqca4f96r7t.gif" alt="A gif of Jonas, the main character of a Netflix TV Show called Dark" width="540" height="298"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Where is the Data?
&lt;/h2&gt;

&lt;p&gt;Well, regardless of how you got here and which of my other open source projects you are familiar with, the &lt;em&gt;datadelivery&lt;/em&gt; project was born to solve a specific problem: the lack of public data sources available to explore AWS services.&lt;/p&gt;

&lt;p&gt;In fact, this is a honest claim of anyone who wants to learn more about AWS services using different datasets. After all, data is probably the most important thing in any data project (and I don't want to be redundant with this sentence).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;"So where can we find datasets to explore?"&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Nowadays, finding public datasets isn't too hard. There are many websites, blog posts, books and many other sources who offers links to download datasets with the most varied contents. To name a few, we have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.kaggle.com/datasets/" rel="noopener noreferrer"&gt;Kaggle&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://archive-beta.ics.uci.edu/" rel="noopener noreferrer"&gt;UCI Machine Learning Repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;GitHub repository from books such as:

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/databricks/Spark-The-Definitive-Guide/tree/master/data" rel="noopener noreferrer"&gt;Spark - The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/databricks/LearningSparkV2/tree/master/databricks-datasets/learning-spark-v2" rel="noopener noreferrer"&gt;Learning Spark&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/PacktPublishing/Apache-Hive-Essentials-Second-Edition/tree/master/data" rel="noopener noreferrer"&gt;Apache Hive Essentials&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;So, it's enough to say that there are many ways to download and use public datasets for whatever learning purpose. Ok, it's fair enough.&lt;/p&gt;

&lt;p&gt;But in our context we are talking about use those datasets inside AWS, right? What about all the effort needed do download the files, upload into a storage system (like S3) and catalog all the medata into Data Catalog? It's seems like a little bit too hard yet.&lt;/p&gt;

&lt;p&gt;🚛💨 This is where &lt;em&gt;datadelivery&lt;/em&gt; shines!&lt;/p&gt;

&lt;h2&gt;
  
  
  datadelivery: A Data Exploration Toolkit
&lt;/h2&gt;

&lt;p&gt;I think you get the idea but just to reinforce: the &lt;em&gt;datadelivery&lt;/em&gt; project provides an efficient way to activate pieces and services in an AWS account in order to enable users to explore preselected public datasets. It does that by providing a Terraform module that can be called directly from its source GitHub repository.&lt;/p&gt;

&lt;p&gt;I state that in the project documentation home page, but this is the perfect time to clarify what really happens when users call the &lt;em&gt;datadelivery&lt;/em&gt; Terraform module:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Five different buckets are created in the target AWS account&lt;/li&gt;
&lt;li&gt;The content of &lt;code&gt;data/&lt;/code&gt; folder at the source module are uploaded to the SoR bucket&lt;/li&gt;
&lt;li&gt;An IAM role is created with enough permissions to run a Glue Crawler&lt;/li&gt;
&lt;li&gt;A Glue Crawler is created with a S3 target pointing to the SoR bucket&lt;/li&gt;
&lt;li&gt;A cron expression is configured to trigger the Glue Crawler 2 minutes after finishing the infrastructure deployment&lt;/li&gt;
&lt;li&gt;All files from SoR bucket (previously on &lt;code&gt;data/&lt;/code&gt; folder) are cataloged as new tables on Data Catalog&lt;/li&gt;
&lt;li&gt;A preconfigured Athena workgroup is created in order to enable users to run queries&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If writing it isn't enough, take a look at the project diagram:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/ThiagoPanini/datadelivery/blob/feature/first-deploy/docs/assets/imgs/project-diagram.png?raw=true" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsqg447zuvcpywlfld2pe.png" alt="A diagram of services deployed" width="800" height="427"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Do you want to know more about the "behind the scenes" of the project construction? I will present some code details about how all the infrastructure was declared in the module.&lt;/p&gt;

&lt;h3&gt;
  
  
  Storage Structure in S3
&lt;/h3&gt;

&lt;p&gt;This was the first infrastructure block created in the project. After all, it would be impossible to provide the exploration of public datasets in analytics services in AWS without thinking about the storage layer.&lt;/p&gt;

&lt;p&gt;To do such a thing, I declared some useful variables into a &lt;a href="https://github.com/ThiagoPanini/datadelivery/blob/main/locals.tf" rel="noopener noreferrer"&gt;&lt;code&gt;locals.tf&lt;/code&gt;&lt;/a&gt; Terraform file as you can see below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Defining&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;sources&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;help&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;local&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;variables&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aws_region"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"current"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aws_caller_identity"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"current"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Defining&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;local&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;variables&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;be&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;used&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;module&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;locals&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;account_id&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;data.aws_caller_identity.current.account_id&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;region_name&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;data.aws_region.current.name&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="err"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Creating&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;map&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;with&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;bucket&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;names&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;be&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;deployed&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;bucket_names_map&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"sor"&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"datadelivery-sor-data-${local.account_id}-${local.region_name}"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"sot"&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"datadelivery-sot-data-${local.account_id}-${local.region_name}"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"spec"&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"datadelivery-spec-data-${local.account_id}-${local.region_name}"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"athena"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"datadelivery-athena-query-results-${local.account_id}-${local.region_name}"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"glue"&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"datadelivery-glue-assets-${local.account_id}-${local.region_name}"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="err"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;more&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;code&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;below&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;aws_region&lt;/code&gt; and the &lt;code&gt;aws_caller_identity&lt;/code&gt; Terraform data sources were created to make it possible to get some useful attributes from the target AWS account, like the &lt;code&gt;account_id&lt;/code&gt; and &lt;code&gt;region_name&lt;/code&gt; local values.&lt;/p&gt;

&lt;p&gt;According to the official &lt;a href="https://developer.hashicorp.com/terraform/language/values/locals" rel="noopener noreferrer"&gt;Terraform documentation&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"A local value assigns a name to an expression, so you can use the name multiple times within a module instead of repeating the expression. [...] The expressions in local values are not limited to literal constants; they can also reference other values in the module in order to transform or combine them, including variables, resource attributes, or other local."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;With that in mind, the storage layer has its heart at the &lt;code&gt;bucket_names_map&lt;/code&gt; local value used to create a map of bucket names using dynamic information gotten from the aforementioned data sources.&lt;/p&gt;

&lt;p&gt;So, the next step was about to create a &lt;a href="https://github.com/ThiagoPanini/datadelivery/blob/main/storage.tf" rel="noopener noreferrer"&gt;storage.tf&lt;/a&gt; Terraform file to declare a &lt;a href="https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/s3_bucket" rel="noopener noreferrer"&gt;aws_s3_bucket&lt;/a&gt; Terraform resource for each value in the &lt;code&gt;bucket_names_map&lt;/code&gt; local variable as you can see below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Creating&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;all&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;buckets&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;resource&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aws_s3_bucket"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"this"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;for_each&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;local.bucket_names_map&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;bucket&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;each.value&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;force_destroy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The big idea about the resource block code above is the definition of a &lt;a href="https://developer.hashicorp.com/terraform/language/meta-arguments/for_each" rel="noopener noreferrer"&gt;&lt;code&gt;for_each&lt;/code&gt;&lt;/a&gt; meta-argument that makes it possible to create several similar objects without writing a separate block for each one.&lt;/p&gt;

&lt;p&gt;And once again, according to the Terraform official documentation page:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"If a resource or module block includes a for_each argument whose value is a map or a set of strings, Terraform creates one instance for each member of that map or set."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And that's how multiple buckets could be created using a local value that maps different bucket names.&lt;/p&gt;

&lt;p&gt;In addition to that, other bucket configurations and resources were defined on the &lt;code&gt;storage.tf&lt;/code&gt; file, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Public access block with &lt;a href="https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/s3_bucket_public_access_block" rel="noopener noreferrer"&gt;aws_s3_bucket_public_access_block&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Server Side Encryption with &lt;a href="https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/s3_bucket_public_access_block" rel="noopener noreferrer"&gt;aws_s3_bucket_public_access_block&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And finally, with all the buckets created and configured, it was possible to upload preselected public datasets originally stored in the &lt;code&gt;data/&lt;/code&gt; folder from the source GitHub repository. Before showing the Terraform code block to do that, let's see the structure of this folder:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;├───data
│   ├───bike_data
│   │   ├───tbl_bikedata_station
│   │   └───tbl_bikedata_trip
│   ├───br_ecommerce
│   │   ├───tbl_brecommerce_customers
│   │   ├───tbl_brecommerce_geolocation
│   │   ├───tbl_brecommerce_orders
│   │   ├───tbl_brecommerce_order_items
│   │   ├───tbl_brecommerce_payments
│   │   ├───tbl_brecommerce_products
│   │   ├───tbl_brecommerce_reviews
│   │   └───tbl_brecommerce_sellers
│   ├───flights_data
│   │   ├───tbl_flights_airport_codes_na
│   │   ├───tbl_flights_departure_delays
│   │   └───tbl_flights_summary_data
│   ├───tbl_airbnb
│   ├───tbl_blogs
│   └───tbl_iot_devices
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here you can see that there are some data folders simulating table structures with raw files in each one of them. In order to provide some context, the table below shows some useful information about the datasets in the &lt;code&gt;data/&lt;/code&gt; source repository folder:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;🎲 Dataset&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;🏷️ Description&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;🔗 Source Link&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Bike Data&lt;/td&gt;
&lt;td&gt;The dataset has information about San Francisco loan bike service from August 2013 to August 2015.&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/databricks/Spark-The-Definitive-Guide/tree/master/data/bike-data" rel="noopener noreferrer"&gt;Link&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Brazilian E-Commerce&lt;/td&gt;
&lt;td&gt;The dataset has information of 100k orders from 2016 to 2018 made at multiple marketplaces in Brazil.&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce" rel="noopener noreferrer"&gt;Link&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flights Data&lt;/td&gt;
&lt;td&gt;This dataset has information about flight travels in the United States.&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/databricks/Spark-The-Definitive-Guide/tree/master/data/flight-data" rel="noopener noreferrer"&gt;Link&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Airbnb&lt;/td&gt;
&lt;td&gt;A dataset with interactions with Airbnb in many of their services. There are 700 attributes to be explored.&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/databricks/Spark-The-Definitive-Guide/tree/master/data" rel="noopener noreferrer"&gt;Link&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Blogs&lt;/td&gt;
&lt;td&gt;A small and fake dataset with information of blogs published on internet&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/databricks/Spark-The-Definitive-Guide/tree/master/data" rel="noopener noreferrer"&gt;Link&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IoT Devices&lt;/td&gt;
&lt;td&gt;A fake dataset with measurements from IoT devices collected in a company facility&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/databricks/LearningSparkV2/tree/master/databricks-datasets/learning-spark-v2/iot-devices" rel="noopener noreferrer"&gt;Link&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In the future, it's highly possible that new datasets are included in the &lt;em&gt;datadelivery&lt;/em&gt;, so users will have a wider range of possibilities.&lt;/p&gt;

&lt;p&gt;So, now that you know the content of the &lt;code&gt;data/&lt;/code&gt; folder, it's possible to turn back to the &lt;code&gt;storage.tf&lt;/code&gt; file and see how the upload process to S3 works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Adding&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;local&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;files&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;SoR&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;bucket&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;resource&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aws_s3_object"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"data_sources"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;for_each&lt;/span&gt;&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;fileset(local.data_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"**"&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;bucket&lt;/span&gt;&lt;span class="w"&gt;                 &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;aws_s&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="err"&gt;_bucket.this&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"sor"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="err"&gt;.bucket&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;key&lt;/span&gt;&lt;span class="w"&gt;                    &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;each.value&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;source&lt;/span&gt;&lt;span class="w"&gt;                 &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"${local.data_path}${each.value}"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;server_side_encryption&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aws:kms"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the end, it's all about using the &lt;a href="https://developer.hashicorp.com/terraform/language/functions/fileset" rel="noopener noreferrer"&gt;&lt;code&gt;fileset()&lt;/code&gt;&lt;/a&gt; Terraform function to get the contents of a local path (represented by a local value called &lt;code&gt;data_path&lt;/code&gt;). The target is pointed as the SoR bucket (as we are talking about raw files, it makes sense to store it in the System of Records layer).&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;data_path&lt;/code&gt; local value is nothing more than a combination between the path module and the &lt;code&gt;data/&lt;/code&gt; folder, as you can see below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Referencing a data folder where the files to be uploaded are located
data_path = "${path.module}/data/"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And this is how the storage structure was built. After all, the users will have a set of S3 buckets and public datasets stored in the SoR bucket.&lt;/p&gt;

&lt;p&gt;This is just the beginning.&lt;/p&gt;

&lt;h3&gt;
  
  
  Crawling the Data
&lt;/h3&gt;

&lt;p&gt;We know that uploading raw files to S3 isn't enough to build all the elements needed to explore analytics services on AWS. It is also necessary to &lt;strong&gt;catalog&lt;/strong&gt; data in order to make it accessible.&lt;/p&gt;

&lt;p&gt;The first step taken to accomplish this mission was to crate a &lt;a href="https://github.com/ThiagoPanini/datadelivery/blob/main/catalog.tf" rel="noopener noreferrer"&gt;&lt;code&gt;catalog.tf&lt;/code&gt;&lt;/a&gt; Terraform file to declare all the infrastructure needed to input the metadata from the raw data files provided into &lt;code&gt;storage.tf&lt;/code&gt; into the Data Catalog.&lt;/p&gt;

&lt;p&gt;So, we start by defining a &lt;a href="https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/glue_catalog_database" rel="noopener noreferrer"&gt;aws_glue_catalog_database&lt;/a&gt; resource to create different databases into Glue Data Catalog in order to receive new tables.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Creating&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Glue&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;databases&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Catalog&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;resource&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aws_glue_catalog_database"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"mesh"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;for_each&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;var.glue_db_names&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;name&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;each.value&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;description&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Database ${each.value} for storing tables in this specific layer"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here we can see the &lt;code&gt;glue_db_names&lt;/code&gt; variables taken from a &lt;code&gt;variables.tf&lt;/code&gt; Terraform file which handles all the acceptable variables for the &lt;em&gt;datadelivery&lt;/em&gt; module. The definition of database names can be seen below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;variable&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"glue_db_names"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;description&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"List of database names for storing Glue catalog tables"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;type&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;map(string)&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;default&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"sor"&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"db_datadelivery_sor"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"sot"&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"db_datadelivery_sot"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"spec"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"db_datadelivery_spec"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For each entry in the &lt;code&gt;glue_db_names&lt;/code&gt; map variable, a new database will be created on the target AWS account. Here it's important to say that only the "db_datadelivery_sor" database will receive the catalogged data (well, the SoR layer handles raw data, so it's far enough to create tables in this database). The similar SoT and Spec databases are provided just in case if users want to input their own tables from process like jobs Glue or Athena queries.&lt;/p&gt;

&lt;p&gt;Then, the most important resource to make things happen is the &lt;a href="https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/glue_crawler" rel="noopener noreferrer"&gt;aws_glue_crawler&lt;/a&gt;, but before showing the Terraform declaration block, let's take a look into the definition of a Glue Crawler.&lt;/p&gt;

&lt;p&gt;According to the &lt;a href="https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html" rel="noopener noreferrer"&gt;official AWS documentation page&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You can use a crawler to populate the AWS Glue Data Catalog with tables. This is the primary method used by most AWS Glue users. A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. Extract, transform, and load (ETL) jobs that you define in AWS Glue use these Data Catalog tables as sources and targets. The ETL job reads from and writes to the data stores that are specified in the source and target Data Catalog tables.&lt;/p&gt;

&lt;p&gt;[...]&lt;/p&gt;

&lt;p&gt;When a crawler runs, it takes the following actions to interrogate a data store:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Classifies data to determine the format, schema, and associated properties of the raw data&lt;/li&gt;
&lt;li&gt;Groups data into tables or partitions&lt;/li&gt;
&lt;li&gt;Writes metadata to the Data Catalog&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;With that in mind, the following Terraform code declares a Glue Crawler resource with some special attributes that will be explained later.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Defining&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Glue&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Crawler&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;resource&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aws_glue_crawler"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sor"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;database_name&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;var.glue_db_names&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"sor"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;name&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"terracatalog-glue-crawler-sor"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;role&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;aws_iam_role.glue_crawler_role.arn&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="err"&gt;s&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="err"&gt;_target&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;path&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"s3://${local.bucket_names_map["&lt;/span&gt;&lt;span class="err"&gt;sor&lt;/span&gt;&lt;span class="s2"&gt;"]}"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="err"&gt;schedule&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;local.crawler_cron_expr&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="err"&gt;depends_on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;aws_s&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="err"&gt;_object.data_sources&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;aws_iam_policy.glue_policies&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;aws_iam_role.glue_crawler_role&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Some points need to be clarified here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The target database for the Crawler is the SoR database&lt;/li&gt;
&lt;li&gt;The target storage location for the Crawler is the SoR S3 bucket&lt;/li&gt;
&lt;li&gt;A new IAM role is previously created on the &lt;a href="https://github.com/ThiagoPanini/datadelivery/blob/main/iam.tf" rel="noopener noreferrer"&gt;&lt;code&gt;iam.tf&lt;/code&gt;&lt;/a&gt; Terraform file with all the permissions needed to run a Glue Crawler (you can check the source link if you want do see it in details)&lt;/li&gt;
&lt;li&gt;:material-alert-decagram:{ .mdx-pulse .warning } A cron expression is defined in the &lt;code&gt;locals.tf&lt;/code&gt; file to &lt;strong&gt;run the Crawler&lt;/strong&gt; 2 minutes (by default) after finishing the infrastructure deploy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The last point is surely a great way out to automate the crawling process without the need to access the AWS account and run it manually. So let's take a deep dive into it.&lt;/p&gt;

&lt;p&gt;Coming back to the &lt;code&gt;locals.tf&lt;/code&gt; Terraform file, to be able to create a valid cron expression that runs the Crawler a couple of minutes before the infrastructure deployment, it was important to get the current time in execution time. The way chosen to do that envolved the use of the &lt;a href="https://developer.hashicorp.com/terraform/language/functions/timestamp" rel="noopener noreferrer"&gt;&lt;code&gt;timestamp()&lt;/code&gt;&lt;/a&gt; and &lt;a href="https://developer.hashicorp.com/terraform/language/functions/timeadd" rel="noopener noreferrer"&gt;&lt;code&gt;timeadd()&lt;/code&gt;&lt;/a&gt; Terraform functions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Extracting&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;current&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;timestamp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;and&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;adding&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;delay&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;timestamp_to_run&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;timeadd(timestamp(),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;var.delay_to_run_crawler)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;delay_to_run_crawler&lt;/code&gt; variable can be passed by user in a &lt;em&gt;datadelivery&lt;/em&gt; module call. Its default value is "2m", pointing that the actual current timestamp value to be used to create a cron expression is delayed by 2 minutes.&lt;/p&gt;

&lt;p&gt;So, the next step was about to extract all elements needed in valid a cron expression. It could be done by calling the &lt;a href="https://developer.hashicorp.com/terraform/language/functions/formatdate" rel="noopener noreferrer"&gt;&lt;code&gt;formatdate()&lt;/code&gt;&lt;/a&gt; Terraform function with different date format arguments:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Getting&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;date&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;information&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;cron_day&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;formatdate(&lt;/span&gt;&lt;span class="s2"&gt;"D"&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;local.timestamp_to_run)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;cron_month&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;formatdate(&lt;/span&gt;&lt;span class="s2"&gt;"M"&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;local.timestamp_to_run)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;cron_year&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;formatdate(&lt;/span&gt;&lt;span class="s2"&gt;"YYYY"&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;local.timestamp_to_run)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;cron_hour&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;formatdate(&lt;/span&gt;&lt;span class="s2"&gt;"h"&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;local.timestamp_to_run)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;cron_minute&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;formatdate(&lt;/span&gt;&lt;span class="s2"&gt;"m"&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;local.timestamp_to_run)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And then, the last step was about to create the cron expression with the individual local values for each cron attribute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Building&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;cron&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;expression&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Glue&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Crawler&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;run&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;minutes&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;after&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;infrastructure&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;deploy&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;crawler_cron_expr&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cron(${local.cron_minute} ${local.cron_hour} ${local.cron_day} ${local.cron_month} ? ${local.cron_year})"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For example, if a user calls the &lt;em&gt;datadelivery&lt;/em&gt; Terraform module at 6:45PM, and supposing that the infrastructure deployment takes about 5 minutes to finish (at 6:50PM to be exactly), then a Glue Crawler will run in the AWS target account at 6:52PM (and never more).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbl33x5t3x9mqsdm6nqn0.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbl33x5t3x9mqsdm6nqn0.gif" alt="A gif from the Neflix TV show called Dark where Jonas, the main charactere, is asking " width="480" height="270"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So, coming back to the &lt;code&gt;catalog.tf&lt;/code&gt; Terraform file, the last thing that is done is the cretion of an Athena workgroup through &lt;a href="https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/athena_workgroup" rel="noopener noreferrer"&gt;aws_athena_workgroup&lt;/a&gt; Terraform resource.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Defining&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;an&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Athena&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;preconfigured&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;workgroup&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;resource&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aws_athena_workgroup"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"analytics"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;name&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"terracatalog-workgroup"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;force_destroy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="err"&gt;configuration&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;result_configuration&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="err"&gt;output_location&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"s3://${local.bucket_names_map["&lt;/span&gt;&lt;span class="err"&gt;athena&lt;/span&gt;&lt;span class="s2"&gt;"]}"&lt;/span&gt;&lt;span class="w"&gt;

      &lt;/span&gt;&lt;span class="err"&gt;encryption_configuration&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="err"&gt;encryption_option&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"SSE_KMS"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="err"&gt;kms_key_arn&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;data.aws_kms_key.s&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="err"&gt;.arn&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The great thing about it is that this is a preconfigured workgroup to store Athena query results in the &lt;em&gt;datadelivery&lt;/em&gt; Athena parametrized bucket (taken from &lt;code&gt;buckets_map&lt;/code&gt; local value). Users will be able to start using the Athena query editor without worry about any other settings.&lt;/p&gt;

&lt;p&gt;So, with &lt;code&gt;storage.tf&lt;/code&gt; and &lt;code&gt;catalog.tf&lt;/code&gt; files, users can extract the real power from &lt;em&gt;datadelivery&lt;/em&gt; Terraform module. The &lt;code&gt;iam.tf&lt;/code&gt; file, as said before, is also extremely useful to an IAM policies and role context (especially when we talk about the crawler process).&lt;/p&gt;

&lt;h2&gt;
  
  
  So What About Now?
&lt;/h2&gt;

&lt;p&gt;Well, by now I really invite all the readers to join and read more about the &lt;em&gt;datadelivery&lt;/em&gt; Terraform module. There is a &lt;a href="https://datadelivery.readthedocs.io/en/latest/" rel="noopener noreferrer"&gt;huge documentation page&lt;/a&gt; hosted on &lt;a href="https://readthedocs.org/" rel="noopener noreferrer"&gt;readthedocs&lt;/a&gt; with many useful information about how this project can help users on their analytics journey in AWS.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0k02oll0funbsz3rm7e6.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0k02oll0funbsz3rm7e6.gif" alt="A gif showing Jonas and Martha, the two main characters from Dark, a Netflix TV Show" width="498" height="368"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By the way, with all the things presented here, to start using &lt;em&gt;datadelivery&lt;/em&gt; in your AWS account, you just need to call the module from the source GitHub repository using the following example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Calling&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;datadelivery&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;module&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;with&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;default&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;configuration&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;module&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"datadelivery"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;source&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"git::https://github.com/ThiagoPanini/datadelivery"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And finally, If you want to know more, I reinforce: don't forget to check the &lt;a href="https://datadelivery.readthedocs.io/en/latest/" rel="noopener noreferrer"&gt;official documentation page&lt;/a&gt;. I really feel that all the users using AWS to learn more about analytics can be helped with &lt;em&gt;datadelivery&lt;/em&gt; and its features!&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://developer.hashicorp.com/terraform/language/values/locals" rel="noopener noreferrer"&gt;Terraform - Local Values&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developer.hashicorp.com/terraform/language/meta-arguments/for_each" rel="noopener noreferrer"&gt;Terraform - The &lt;code&gt;for_each&lt;/code&gt; Meta-Argument&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developer.hashicorp.com/terraform/language/functions/formatdate" rel="noopener noreferrer"&gt;Terraform - &lt;code&gt;formatdate&lt;/code&gt; Function&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developer.hashicorp.com/terraform/language/functions/fileset" rel="noopener noreferrer"&gt;Terraform - &lt;code&gt;fileset&lt;/code&gt; Function&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html" rel="noopener noreferrer"&gt;AWS - Defining Crawlers in AWS Glue&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/glue/latest/dg/crawler-running.html" rel="noopener noreferrer"&gt;AWS - How Crawlers Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developer.hashicorp.com/terraform/language/functions/timestamp" rel="noopener noreferrer"&gt;Terraform - &lt;code&gt;timestamp&lt;/code&gt; Function&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developer.hashicorp.com/terraform/language/functions/timeadd" rel="noopener noreferrer"&gt;Terraform - &lt;code&gt;timeadd&lt;/code&gt; Function&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>terraform</category>
      <category>analytics</category>
      <category>iac</category>
    </item>
    <item>
      <title>The story about how I took my learning on AWS Glue to the next level</title>
      <dc:creator>Thiago Panini</dc:creator>
      <pubDate>Fri, 31 Mar 2023 12:33:43 +0000</pubDate>
      <link>https://dev.to/aws-builders/the-story-about-how-i-took-my-learning-on-aws-glue-to-the-next-level-42c5</link>
      <guid>https://dev.to/aws-builders/the-story-about-how-i-took-my-learning-on-aws-glue-to-the-next-level-42c5</guid>
      <description>&lt;h1&gt;
  
  
  Project Story
&lt;/h1&gt;

&lt;p&gt;For everyone reading this page, I ask a poetic licence to tell you a story about the challenges I faced on my analytics learning journey using AWS services and what I've done to overcome them.&lt;/p&gt;

&lt;p&gt;🪄 In fact, this is a story about how I really started to build and share open source solutions that help people learning more about analytics services in AWS.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3vtx77j74isuyjr00r5v.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3vtx77j74isuyjr00r5v.webp" alt="Snoopy reading a book" width="500" height="286"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How it All Started
&lt;/h2&gt;

&lt;p&gt;First of all, It's important to provide some context on how it all started. I'm an Analytics Engineer working for a financial company that has a lot of data and an increasing number of opportunities to use it. The company adopted the Data Mesh architecture to give more autonomy so the the data teams can build and share their own datasets through three different layers: SoR (System of Record), SoT (Source of Truth) and Spec (Specialized).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;There are a lot of articles explaining the Data Mesh architecture and the differences between layers SoR, SoT and Spec for storing and sharing data. In fact, this is a really useful way to improve analytics on organizations.&lt;/p&gt;

&lt;p&gt;If you want to know a little bit more about it, there are some links that can help you on this mission:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🔗 &lt;a href="https://martinfowler.com/articles/data-mesh-principles.html" rel="noopener noreferrer"&gt;Data Mesh Principles and Logical Architecture&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🔗 &lt;a href="https://www.integrify.com/blog/posts/system-of-record-vs-source-of-truth/" rel="noopener noreferrer"&gt;Building a System of Record vs. a Source of Truth&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🔗 &lt;a href="https://www.linkedin.com/pulse/difference-between-system-record-source-truth-santosh-kudva/" rel="noopener noreferrer"&gt;The Difference Between System of Record and Source of Truth&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;So, the company decided to use mainly AWS services for all this journey. From analytics perspective, services like Glue, EMR, Athena and Quicksight just popped up as really good options to solve some real problems in the company.&lt;/p&gt;

&lt;p&gt;And that's how the story begins: an Analytics Engineer trying his best to deep dive into those services in his sandbox environment for learning everything he can.&lt;/p&gt;

&lt;h2&gt;
  
  
  First Steps
&lt;/h2&gt;

&lt;p&gt;Well, I had to choose an initial goal. After deciding to start learning more about AWS Glue to develop ETL jobs, I looked for documentation pages, watched some tutorial videos to prepare myself and talked to other developers to collect thoughts and experiences about the whole thing.&lt;/p&gt;

&lt;p&gt;After a little while, I found myself ready to start building something useful. In my hands, I had an AWS sandbox account and a noble desire to learn.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fht3fneg3f5z3phhstpqr.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fht3fneg3f5z3phhstpqr.gif" alt="Michael B. Jordan on Creed movie" width="500" height="248"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Here is an important information:&lt;/strong&gt;&lt;br&gt;
The use of an AWS sandbox account was due to a subscription that I had in a learning platform. This platform allowed subscribers to use an AWS environment for learning purposes and that was really nice. However, it was an ephemeral environment with an automatic shut off mechanism after a few hours. This behavior is one of the key points of the story. Keep that in mind. You will know why.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Creating the Storage Layers
&lt;/h2&gt;

&lt;p&gt;I started to get my hands dirty by creating S3 buckets to replicate something next to a Data Lake storage architecture in a Data Mesh approach. So, one day I just logged in my sandbox AWS account and created the following buckets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A bucket to store SoR data&lt;/li&gt;
&lt;li&gt;A bucket to store SoT data&lt;/li&gt;
&lt;li&gt;A bucket to store Spec data&lt;/li&gt;
&lt;li&gt;A bucket to store Glue assets&lt;/li&gt;
&lt;li&gt;A bucket to store Athena query results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://raw.githubusercontent.com/ThiagoPanini/terraglue/feature/terraglue-refactor/docs/assets/imgs/architecture/terraglue-diagram-resources-storage.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FThiagoPanini%2Fterraglue%2Ffeature%2Fterraglue-refactor%2Fdocs%2Fassets%2Fimgs%2Fproject-story%2Fterraglue-diagram-resources-storage.png" alt="Storage resources" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Uploading Files on Buckets
&lt;/h2&gt;

&lt;p&gt;Once the storage structures was created, I started to search for public datasets do be part of my learning path. The idea was to upload some data into the buckets to make it possible to do some analytics, such as creating ETL jobs or even querying with Athena.&lt;/p&gt;

&lt;p&gt;So, I found the excellent &lt;a href="https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce" rel="noopener noreferrer"&gt;Brazilian E-Commerce dataset&lt;/a&gt; on Kaggle and it fitted perfectly. I was now able to download the data and upload it on the SoR bucket to simulate some raw data available for further analysis in an ETL pipeline.&lt;/p&gt;

&lt;p&gt;And now my diagram was like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://raw.githubusercontent.com/ThiagoPanini/terraglue/feature/terraglue-refactor/docs/assets/imgs/architecture/terraglue-diagram-resources-data.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FThiagoPanini%2Fterraglue%2Ffeature%2Fterraglue-refactor%2Fdocs%2Fassets%2Fimgs%2Fproject-story%2Fterraglue-diagram-resources-data.png" alt="Storage resources with data" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Cataloging Data
&lt;/h2&gt;

&lt;p&gt;Uploading data on S3 buckets wasn't enough to have a complete experience on applying analytics. It was important to catalog its metadata on Data Catalog to make them visible for services like Glue and Athena.&lt;/p&gt;

&lt;p&gt;So, the next step embraced the input of all the files of Brazilian Ecommerce dataset as tables on Data Catalog. For this task, I tested two different approaches:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Building and running &lt;code&gt;CREATE TABLE&lt;/code&gt; queries on Athena based on file schema&lt;/li&gt;
&lt;li&gt;Manually inputting fields and table properties on Data Catalog&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By the way, as Athena proved itself to be a good service to start looking at cataloged data, I took the opportunity to create a workgroup with all appropriate parameters for storing query results.&lt;/p&gt;

&lt;p&gt;And now my diagram was like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://raw.githubusercontent.com/ThiagoPanini/terraglue/feature/terraglue-refactor/docs/assets/imgs/architecture/terraglue-diagram-resources-catalog.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FThiagoPanini%2Fterraglue%2Ffeature%2Fterraglue-refactor%2Fdocs%2Fassets%2Fimgs%2Fproject-story%2Fterraglue-diagram-resources-catalog.png" alt="Catalog process using Data Catalog and Athena" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Creating IAM Roles and Policies
&lt;/h2&gt;

&lt;p&gt;A huge milestone was reached at that moment. I had a storage structure, I had data to be used and I had all metadata information already cataloged on Data Catalog.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;"What was still missing to start creating Glue jobs?"&lt;/strong&gt;&lt;br&gt;
The answer was IAM roles and policies. Simple as that.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;At this point I must tell you that creating IAM roles wasn't an easy step to be completed. First of all, it was a little bit difficult to understand all the permissions needed to run Glue jobs on AWS, to log steps on CloudWatch and all other things.&lt;/p&gt;

&lt;p&gt;Suddenly I found myself searching a lot on docs pages and studying about Glue's actions to be included in my policy. After a while, I was able to create a set of policies for a good IAM role to be assumed by my future and first Glue job on AWS.&lt;/p&gt;

&lt;p&gt;And, once again, I added more pieces on my diagram:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://raw.githubusercontent.com/ThiagoPanini/terraglue/feature/terraglue-refactor/docs/assets/imgs/architecture/terraglue-diagram-resources-iam.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FThiagoPanini%2Fterraglue%2Ffeature%2Fterraglue-refactor%2Fdocs%2Fassets%2Fimgs%2Fproject-story%2Fterraglue-diagram-resources-iam.png" alt="IAM role and policies" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Creating a Glue Job
&lt;/h2&gt;

&lt;p&gt;Well, after all those manual setup I was finally able to create my first Glue job on AWS to create ETL pipelines using public datasets available on Data Catalog.&lt;/p&gt;

&lt;p&gt;I was really excited at that moment and the big idea was to simulate a data pipeline that read data from a SoR layer, transform it and put the curated dataset in a SoT layer. After learning a lot about &lt;code&gt;awsglue&lt;/code&gt; library and elements like &lt;code&gt;GlueContext&lt;/code&gt; and &lt;code&gt;DynamicFrame&lt;/code&gt;, I was able to create a Spark application using pyspark that had enough features to reach the aforementioned goal.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://raw.githubusercontent.com/ThiagoPanini/terraglue/feature/terraglue-refactor/docs/assets/imgs/architecture/terraglue-diagram-resources-glue.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FThiagoPanini%2Fterraglue%2Ffeature%2Fterraglue-refactor%2Fdocs%2Fassets%2Fimgs%2Fproject-story%2Fterraglue-diagram-resources-glue.png" alt="Final diagram with Glue job" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And now my diagram was complete!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmedia.tenor.com%2Fs89PZe54F4IAAAAd%2Frenuu-thanos.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmedia.tenor.com%2Fs89PZe54F4IAAAAd%2Frenuu-thanos.gif" alt="Thanos resting on the end of Infinity War movie" width="1024" height="1024"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Not Yet: A Real Big Problem
&lt;/h2&gt;

&lt;p&gt;As much as this is happy ending story, it doesn't happen just now.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The AWS sandbox account problem.&lt;/strong&gt;&lt;br&gt;
Well, remember as a said at the beginning of the story that I had in my hands an AWS sandbox account? By sandbox account I mean a temporary environment that shuts down after a few hours.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And that's was the first big problem: I needed to recreate ALL components of the final diagram every single day.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The huge manual effort.&lt;/strong&gt;&lt;br&gt;
As you can imagine, I spent almost one hour setting up things every time I wanted to practice with Glue. It was a huge manual effort and that was just almost half of the sandbox life time.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Something needed to be done.&lt;/p&gt;

&lt;p&gt;Of course you can aske me:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;"Why don't you build that architecture in your personal account?"&lt;/strong&gt;&lt;br&gt;
That was a nice option but the problem was the charges. I was just trying to learn Glue and running jobs multiple times (that's the expectation when you are learning) may incur some unpredictable costs.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Ok, so now I think everyone is trying to figure out what I did to solve those problems.&lt;/p&gt;

&lt;p&gt;Yes, I found a way!&lt;/p&gt;

&lt;p&gt;If you are still here with me, I think you would like to know it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Terraglue: A New Project is Born
&lt;/h2&gt;

&lt;p&gt;Well, the problems were shown and I had to think in a solution to make my life easier for that simple learning task.&lt;/p&gt;

&lt;p&gt;The answer was right on my face all the time. If my main problem was spending time recreating infrastructure all over again, why not to &lt;strong&gt;automate&lt;/strong&gt; the infrastructure creation with an IaC tool? So every time my sandbox environment expired, I could create all again with much less overhead.&lt;/p&gt;

&lt;p&gt;That was a fantastic idea and I started to use &lt;a href="https://www.terraform.io/" rel="noopener noreferrer"&gt;Terraform&lt;/a&gt; to declare resources used in my architecture. I splitted things into modules and suddenly I had enough code to create buckets, upload data, catalog things, create IAM roles, policies and a preconfigured Glue job!&lt;/p&gt;

&lt;p&gt;While creating all this, I just felt that everyone who had the same learning challenges that made me come to this point would enjoy the project. So I prepared it, documented it and called it &lt;a href="https://terraglue.readthedocs.io/en/latest/?badge=latest" rel="noopener noreferrer"&gt;terraglue&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;br&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FThiagoPanini%2Fterraglue%2Fblob%2Fmain%2Fdocs%2Fassets%2Fimgs%2Fheader-readme.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FThiagoPanini%2Fterraglue%2Fblob%2Fmain%2Fdocs%2Fassets%2Fimgs%2Fheader-readme.png%3Fraw%3Dtrue" alt="terraglue-logo" width="600" height="141"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It was really impressive how I could deploy all the components with just a couple of commands. If I used spent about 1 hour to create and configure every service manually, after &lt;em&gt;terraglue&lt;/em&gt; that time was reduced to just seconds!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://raw.githubusercontent.com/ThiagoPanini/terraglue/feature/terraglue-refactor/docs/assets/imgs/architecture/diagram-user-view.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FThiagoPanini%2Fterraglue%2Ffeature%2Fterraglue-refactor%2Fdocs%2Fassets%2Fimgs%2Farchitecture%2Fdiagram-user-view.png" alt="Terraglue diagram with IaC modules" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  A Constant Evolution with New Solutions
&lt;/h2&gt;

&lt;p&gt;After a while, I noticed that the &lt;em&gt;terraglue&lt;/em&gt; project became a central point of almost everything. The source repository was composed by all the infrastructure (including buckets, data files and a Glue job) and also a Spark application with modules, classes and methods used to develop an example of a Glue job.&lt;/p&gt;

&lt;p&gt;That wasn't a good thing.&lt;/p&gt;

&lt;p&gt;Imagine if I started my learning journey on EMR, for example. I would almost duplicate all terraglue infrastructure to a new project to have components like buckets and data files. The same scenario can be expanded to the application layer on terraglue: I would have to copy and paste scripts from project to project. It was not scalable.&lt;/p&gt;

&lt;p&gt;So, thinking on the best way to provide the best experience for me and for all users that could use all of this, I started to &lt;strong&gt;decouple&lt;/strong&gt; the initial idea of terraglue into new open source projects. And that's the state-of-art of my solutions shelf:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FThiagoPanini%2Fdatadelivery%2Fblob%2Ffeature%2Ffirst-deploy%2Fdocs%2Fassets%2Fimgs%2Fproducts-overview-v2.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FThiagoPanini%2Fdatadelivery%2Fblob%2Ffeature%2Ffirst-deploy%2Fdocs%2Fassets%2Fimgs%2Fproducts-overview-v2.png%3Fraw%3Dtrue" alt="A diagram showing how its possible to use other solutions like datadelivery, terraglue and sparksnake" width="480" height="338"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In case you want to know more about each one of those new solutions, I will leave here some links for documentation pages created to give the best user experience possible:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🚛 &lt;a href="https://datadelivery.readthedocs.io/en/latest/" rel="noopener noreferrer"&gt;&lt;strong&gt;datadelivery&lt;/strong&gt;&lt;/a&gt;: a Terraform module that helps users to have public datasets to explore using AWS services&lt;/li&gt;
&lt;li&gt;🌖 &lt;a href="https://terraglue.readthedocs.io/en/latest/" rel="noopener noreferrer"&gt;&lt;strong&gt;terraglue&lt;/strong&gt;&lt;/a&gt;: a Terraform module that helps users to create their own preconfigured Glue jobs in their AWS account&lt;/li&gt;
&lt;li&gt;✨ &lt;a href="https://sparksnake.readthedocs.io/en/latest/" rel="noopener noreferrer"&gt;&lt;strong&gt;sparksnake&lt;/strong&gt;&lt;/a&gt;: a Python package that contains useful Spark features created to help users to develop their own Spark applications&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And maybe it's more to come! Terraglue was the first and, for a while, it was the only one. Suddenly, new solutions were created to fill special needs. I think this is a continuous process.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;If I had to summarize this story in a few topics, I think the best sequence would be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🤔 An Analytics Engineer wanted to learn AWS Glue and other analytics services on AWS&lt;/li&gt;
&lt;li&gt;🤪 He started to build a complete infrastructure in his AWS sandbox account manually&lt;/li&gt;
&lt;li&gt;🥲 Every time this AWS sandbox account expired, he did it all again&lt;/li&gt;
&lt;li&gt;😮‍💨 He was tired of doing this all the time so he started to think on how to solve this problem&lt;/li&gt;
&lt;li&gt;😉 He started to apply Terraform to declare all infrastructure&lt;/li&gt;
&lt;li&gt;🤩 He created a pocket AWS environment to learn analytics and called it &lt;em&gt;terraglue&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;🤔 He noticed that new projects could be created from &lt;em&gt;terraglue&lt;/em&gt; in order to turn it more scalable&lt;/li&gt;
&lt;li&gt;🚀 He now has a shelf of open source solutions that can help many people on learning AWS&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And that's the real story about how I faced a huge problem on my learning journey and used Terraform to declare AWS components and take my learning experience to the next level.&lt;/p&gt;

&lt;p&gt;I really hope any of the solutions presented here can be useful for anyone who needs to laearn more about analytics services on AWS.&lt;/p&gt;

&lt;p&gt;Finally, if you like this story, don't forget to interact with me, star the repos, comment and leave your feedback. Thank you!&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;AWS Glue&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/glue/" rel="noopener noreferrer"&gt;AWS - Glue Official Page&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html" rel="noopener noreferrer"&gt;AWS - Jobs Parameters Used by AWS Glue&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-glue-context.html#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_catalog" rel="noopener noreferrer"&gt;AWS - GlueContext Class&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html" rel="noopener noreferrer"&gt;AWS - DynamicFrame Class&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://stackoverflow.com/questions/50992655/etl-job-failing-with-pyspark-sql-utils-analysisexception-in-aws-glue" rel="noopener noreferrer"&gt;Stack Overflow - Job Failing by Job Bookmark Issue - Empty DataFrame&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-calling.html" rel="noopener noreferrer"&gt;AWS - Calling AWS Glue APIs in Python&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html#aws-glue-programming-python-libraries-zipping" rel="noopener noreferrer"&gt;AWS - Using Python Libraries with AWS Glue&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://stackoverflow.com/questions/53718221/aws-glue-data-catalog-temporary-tables-and-apache-spark-createorreplacetempview" rel="noopener noreferrer"&gt;Spark Temporary Tables in Glue Jobs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.plainenglish.io/understanding-all-aws-glue-import-statements-and-why-we-need-them-59279c402224" rel="noopener noreferrer"&gt;Medium - Understanding All AWS Glue Import Statements and Why We Need Them&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container/" rel="noopener noreferrer"&gt;AWS - Develop and test AWS Glue jobs Locally using Docker&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_providers_create_oidc.html" rel="noopener noreferrer"&gt;AWS - Creating OpenID Connect (OIDC) identity providers&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Terraform&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.terraform.io/" rel="noopener noreferrer"&gt;Terraform - Hashicorp Terraform&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developer.hashicorp.com/terraform/language/expressions/conditionals" rel="noopener noreferrer"&gt;Terraform - Conditional Expressions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://stackoverflow.com/questions/68911814/combine-count-and-for-each-is-not-possible" rel="noopener noreferrer"&gt;Stack Overflow - combine "count" and "for_each" on Terraform&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Apache Spark&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://sparkbyexamples.com/pyspark/pyspark-sql-date-and-timestamp-functions/" rel="noopener noreferrer"&gt;SparkByExamples - Pyspark Date Functions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://spark.apache.org/docs/latest/configuration.html" rel="noopener noreferrer"&gt;Spark - Configuration Properties&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://stackoverflow.com/questions/31610971/spark-repartition-vs-coalesce" rel="noopener noreferrer"&gt;Stack Overflow - repartition() vs coalesce()&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;GitHub&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.conventionalcommits.org/en/v1.0.0/#summary" rel="noopener noreferrer"&gt;Conventional Commits&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://semver.org/" rel="noopener noreferrer"&gt;Semantic Release&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/angular/angular/blob/main/CONTRIBUTING.md#-commit-message-format" rel="noopener noreferrer"&gt;GitHub - Angular Commit Message Format&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/conventional-changelog/commitlint" rel="noopener noreferrer"&gt;GitHub - commitlint&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://shields.io/" rel="noopener noreferrer"&gt;shields.io&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.codecov.com/docs" rel="noopener noreferrer"&gt;Codecoverage - docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/marketplace?type=actions" rel="noopener noreferrer"&gt;GitHub Actions Marketplace&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://endjin.com/blog/2022/09/continuous-integration-with-github-actions" rel="noopener noreferrer"&gt;Continuous Integration with GitHub Actions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.github.com/en/actions/deployment/security-hardening-your-deployments/about-security-hardening-with-openid-connect" rel="noopener noreferrer"&gt;GitHub - About security hardening with OpenID Connect&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.eliasbrange.dev/posts/secure-aws-deploys-from-github-actions-with-oidc/#:~:text=To%20be%20able%20to%20authenticate,Provider%20type%2C%20select%20OpenID%20Connect." rel="noopener noreferrer"&gt;GitHub - Securing deployments to AWS from GitHub Actions with OpenID Connect&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions" rel="noopener noreferrer"&gt;GitHub - Workflow syntax for GitHub Actions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=L1f6N6NcgPw&amp;amp;t=3043s&amp;amp;ab_channel=EduardoMendes" rel="noopener noreferrer"&gt;Eduardo Mendes - Live de Python #170 - GitHub Actions&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Docker&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/marketplace/actions/docker-run-action" rel="noopener noreferrer"&gt;GitHub Docker Run Action&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aschmelyun.com/blog/using-docker-run-inside-of-github-actions/" rel="noopener noreferrer"&gt;Using Docker Run inside of GitHub Actions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://stackoverflow.com/questions/62546743/running-aws-glue-jobs-in-docker-container-outputs-com-amazonaws-sdkclientexcep" rel="noopener noreferrer"&gt;Stack Overflow - Unable to find region when running docker locally&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Tests&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=MjQCvJmc31A&amp;amp;" rel="noopener noreferrer"&gt;Eduardo Mendes - Live de Python #167 - Pytest: Uma Introdução&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=sidi9Z_IkLU&amp;amp;t" rel="noopener noreferrer"&gt;Eduardo Mendes - Live de Python #168 - Pytest Fixtures&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=uzVewG8M6r0&amp;amp;t=1127s" rel="noopener noreferrer"&gt;Databricks - Data + AI Summit 2022 - Learn to Efficiently Test ETL Pipelines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://realpython.com/python-testing/" rel="noopener noreferrer"&gt;Real Python - Getting Started with Testing in Python&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.inspiredpython.com/article/five-advanced-pytest-fixture-patterns" rel="noopener noreferrer"&gt;Inspired Python - Five Advanced Pytest Fixture Patterns&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/getmoto/moto/blob/master/tests/test_glue/fixtures/datacatalog.py" rel="noopener noreferrer"&gt;getmoto/moto - mock inputs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://about.codecov.io/blog/should-i-include-test-files-in-code-coverage-calculations/" rel="noopener noreferrer"&gt;Codecov - Do test files belong in code coverage calculations?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://issues.jenkins.io/browse/JENKINS-63177" rel="noopener noreferrer"&gt;Jenkins Issue: Endpoint does not contain a valid host name&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Others&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.linkedin.com/pulse/difference-between-system-record-source-truth-santosh-kudva/" rel="noopener noreferrer"&gt;Differences between System of Record and Source of Truth&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce" rel="noopener noreferrer"&gt;Olist Brazilian E-Commerce Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://stackoverflow.com/questions/6843549/are-there-any-benefits-from-using-a-staticmethod" rel="noopener noreferrer"&gt;Stack Overflow - @staticmethod&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>glue</category>
      <category>terraform</category>
      <category>analytics</category>
      <category>iac</category>
    </item>
    <item>
      <title>Improving ETL jobs on AWS with sparksnake</title>
      <dc:creator>Thiago Panini</dc:creator>
      <pubDate>Mon, 20 Mar 2023 21:53:15 +0000</pubDate>
      <link>https://dev.to/aws-builders/improving-etl-jobs-on-aws-with-sparksnake-2e35</link>
      <guid>https://dev.to/aws-builders/improving-etl-jobs-on-aws-with-sparksnake-2e35</guid>
      <description>&lt;p&gt;Have you ever thought about having a bunch of Spark features and code blocks to improve once at all your journey on developing Spark applications in AWS services like Glue and EMR? In this article I'll introduce you &lt;a href="https://sparksnake.readthedocs.io/en/latest/" rel="noopener noreferrer"&gt;&lt;code&gt;sparksnake&lt;/code&gt;&lt;/a&gt; as a powerful Python package as a game changing on Spark application development on AWS.&lt;/p&gt;

&lt;h2&gt;
  
  
  The idea behind sparksnake
&lt;/h2&gt;

&lt;p&gt;To understand the main reasons for bringing &lt;code&gt;sparksnake&lt;/code&gt; to life, let's first take a quick look on a Glue boilerplate code presented wherever a new job is created on AWS console:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;awsglue.transforms&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;awsglue.utils&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;getResolvedOptions&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.context&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkContext&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;awsglue.context&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GlueContext&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;awsglue.job&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Job&lt;/span&gt;

&lt;span class="c1"&gt;## @params: [JOB_NAME]
&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getResolvedOptions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;JOB_NAME&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;sc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SparkContext&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;glueContext&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GlueContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;glueContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spark_session&lt;/span&gt;
&lt;span class="n"&gt;job&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Job&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;glueContext&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;JOB_NAME&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now let me show you two simple perspectives from Glue users in different levels.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Beginner:&lt;/strong&gt; it's reasonable to say that the block code above isn't something we see everyday outside Glue, right? So, for people who are trying Glue for the first time, there would be questions about elements like GlueContext, Job, and special methods.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Experienced Developer:&lt;/strong&gt; even for this group, the "Glue setup" could be something painful (specially if you need to do that every single time when starting a new job development).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Therefore, the main idea behind &lt;code&gt;sparksnake&lt;/code&gt; is to take every common step in a Spark application developed using AWS services and encapsulate it on classes and methods that can be called by users using a single line of code. In other words, all the boilerplate code shown above would be replaced in &lt;code&gt;sparksnake&lt;/code&gt; as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Importing sparksnake's main class
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sparksnake.manager&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkETLManager&lt;/span&gt;

&lt;span class="c1"&gt;# Initializing a glue job
&lt;/span&gt;&lt;span class="n"&gt;spark_manager&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SparkETLManager&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;glue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;spark_manager&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;init_job&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is just one of a series of features available in &lt;code&gt;sparksnake&lt;/code&gt;! The great thing about it is the ability to call methods and functions to use Spark common features in jobs whether you're running them on AWS services like Glue and EMR or locally.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ohkolrquk9fn7ys5lkn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ohkolrquk9fn7ys5lkn.png" alt="A simple diagram showing how sparksnake package inherits features from AWS services like Glue and EMR to provide users a custom experience" width="800" height="188"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The library structure
&lt;/h2&gt;

&lt;p&gt;After a quick overview on &lt;code&gt;sparksnake&lt;/code&gt;, it's important to know a little bit more on how the library is structured under the hood.&lt;/p&gt;

&lt;p&gt;By this time, there are two modules on the package:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;manager&lt;/code&gt;: central module who hosts the &lt;code&gt;SparkETLManager&lt;/code&gt; class with common Spark features. It inherits features from other classes based on an operation mode chosen by the user&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;glue&lt;/code&gt;: side module who hosts the &lt;code&gt;GlueJobManager&lt;/code&gt; class with special features used in Glue jobs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In a common usage pattern, users import the &lt;code&gt;SparkETLManager&lt;/code&gt; class and choose a operation mode according to where the Spark application will be developed and deployed. This operation mode guides the &lt;code&gt;SparkETLManager&lt;/code&gt; class to inherit features from AWS services like Glue and EMR to provide users a custom experience.&lt;/p&gt;

&lt;h2&gt;
  
  
  Features
&lt;/h2&gt;

&lt;p&gt;Now that you know more about the main concepts about sparksnake, let's summarize some of its features:&lt;/p&gt;

&lt;p&gt;🤖 Enhanced development experience of Spark Applications to be deployed as jobs in AWS services like Glue and EMR&lt;br&gt;
🌟 Possibility to use common Spark operations for improving ETL steps using custom classes and methods&lt;br&gt;
⚙️ No need to think too much into the hard and complex service setup (e.g. with sparksnake you can have all elements for a Glue Job on AWS with a single line of code)&lt;br&gt;
👁️‍🗨️ Application observability improvement with detailed log messages in CloudWatch&lt;br&gt;
🛠️ Exception handling already embedded in library methods&lt;/p&gt;
&lt;h2&gt;
  
  
  A quickstart
&lt;/h2&gt;

&lt;p&gt;To start using &lt;code&gt;sparksnake&lt;/code&gt;, just install it using pip:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;sparksnake
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now let's say, for instance, that we are developing a new Glue job on AWS and we want to use &lt;code&gt;sparksnake&lt;/code&gt; to make things easier. In order to provide a useful example about how powerful the library can be, imagine we have a series of data sources to be read into the job. There would be very painful to write multiple lines of code for reading each data source from catalog.&lt;/p&gt;

&lt;p&gt;With &lt;code&gt;sparksnake&lt;/code&gt;, we can read multiple data sources from catalog using a single line of code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Generating a dictionary of Spark DataFrames from catalog
&lt;/span&gt;&lt;span class="n"&gt;dfs_dict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark_manager&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_dataframes_dict&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Indexing to get individual DataFrames
&lt;/span&gt;&lt;span class="n"&gt;df_orders&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dfs_dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;df_customers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dfs_dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And what about writing data on S3 and cataloging it on Data Catalog? No worries, that could be done with a single line of code too:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Writing data on S3 and cataloging on Data Catalog
&lt;/span&gt;&lt;span class="n"&gt;spark_manager&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_and_catalog_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;df_orders&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;s3_table_uri&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://bucket-name/table-name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;output_database_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db-name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;output_table_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;table-name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;partition_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;partition-name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;output_data_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data-format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="c1"&gt;# e.g. "parquet"
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once again, those are only two examples of a series of features already available on the library and this article was written to present all users a different way to learn and to improve skills on Spark applications inside AWS. &lt;/p&gt;

&lt;h2&gt;
  
  
  Learn more
&lt;/h2&gt;

&lt;p&gt;There are some useful links and documentations about sparksnake. Check it out on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://sparksnake.readthedocs.io/en/latest/" rel="noopener noreferrer"&gt;sparksnake.readthedocs.io&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/ThiagoPanini/sparksnake" rel="noopener noreferrer"&gt;ThiagoPanini/sparksnake&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pypi.org/project/sparksnake/" rel="noopener noreferrer"&gt;PyPI/sparksnake&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>spark</category>
      <category>python</category>
      <category>etl</category>
      <category>analytics</category>
    </item>
  </channel>
</rss>
