<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kyle Escosia</title>
    <description>The latest articles on DEV Community by Kyle Escosia (@escosiakyle).</description>
    <link>https://dev.to/escosiakyle</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F624370%2F9bc3b695-a2e8-4b55-9ce6-4af478526869.jpg</url>
      <title>DEV Community: Kyle Escosia</title>
      <link>https://dev.to/escosiakyle</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/escosiakyle"/>
    <language>en</language>
    <item>
      <title>Building a Pinoy-themed Game using Amazon Q CLI</title>
      <dc:creator>Kyle Escosia</dc:creator>
      <pubDate>Mon, 30 Jun 2025 06:39:44 +0000</pubDate>
      <link>https://dev.to/awscommunity-asean/building-a-pinoy-themed-game-using-amazon-q-cli-1ge</link>
      <guid>https://dev.to/awscommunity-asean/building-a-pinoy-themed-game-using-amazon-q-cli-1ge</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;I remember the very first game that I built back in my college days (circa 2016). Our Rizal professor wanted us to think about how we can apply what we do in our course to create awareness for our national hero, Jose Rizal. Basically, how would you relate Information Technology to Rizal. &lt;/p&gt;

&lt;p&gt;First thing that came to my mind was to build a fighting game that features the life of Rizal throughout the Spanish Period using Java since we just learned it from the previous semester. It doesn't look much but it goes like this:&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/nBwq1O50q5M"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;So when I encountered an article about Building Games with Amazon Q CLI, I thought this is a good opportunity for me to reimagine my game. And since we Filipinos are celebrating our Independence 🇵🇭 this June, it made me a bit nostalgic. But this time, I want to incorporate my work experience, which is Data Engineering. &lt;/p&gt;

&lt;h2&gt;
  
  
  Setting up
&lt;/h2&gt;

&lt;p&gt;Getting Amazon Q CLI up and running is actually very straightforward. I followed the guide from AWS.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/command-line-installing.html" rel="noopener noreferrer"&gt;Installing Amazon Q for command line&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You do need to &lt;a href="https://docs.aws.amazon.com/signin/latest/userguide/create-aws_builder_id.html" rel="noopener noreferrer"&gt;Create your AWS Builder ID&lt;/a&gt;. But after that, Amazon Q is all yours! &lt;/p&gt;

&lt;h2&gt;
  
  
  My attempt
&lt;/h2&gt;

&lt;p&gt;Now, I thought about the prompt, I wanted to be as specific as possible. Here's what I came up with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;I want a turn-based SQL fighting game that will allow me to fight through bosses using a chosen character. 
I want to have Philippines as a theme for my game. Since we celebrate our independence in the month of June. 
The game should be like Street fighter/Tekken. Where characters face off with each other. 
The battle system is like Pokemon, each character's actions are based on whatever action the user will choose.

For the game, I want the following mechanics.

Background and goal:

Hero - my "Bayani" starts unarmed in a Pre-Hispanic tutorial level, then faces colonial “bosses” as he goes through times

Levels tied to eras – pre-Spanish, Spanish, American, Japanese, final Independence Day showdown (June 12 1898)

Each level introduces new tables and sql query concepts (SELECT basics, JOINs, aggregates, window functions)

Game Mechanics:

SQL-type quizzes

On each turn, the game asks a quiz question about the current battle’s database

Quiz-type sql puzzles
- Each action corresponds to a quiz question on a specific SQL concept
- Questions appear as multiple choice or fill in inputs
- Correct answer executes the chosen action’s animation and adjusts the HP or shield values accordingly, 
  important to note that boss also attacks afterwards
- Wrong answer causes the boss to attack instead, dealing damage to the your player

Combat Actions
- Attack – executes by answering a SELECT/WHERE question correctly and deals damage to the boss’s HP
- Defend – executes by answering an aggregates question and grants a shield that reduces incoming damage
- Heal – executes by answering a set-operation question and restores a portion of the hero’s HP
- Special Move - unlock by a successful multi-step quiz (CTEs or window functions) which unleashes a high-damage combo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For me, I already know what I wanted, but feel free to collaborate with Amazon Q for your game. At the end of the day, it's all about what you want your game to look like, have fun with the process!&lt;/p&gt;

&lt;p&gt;One prompt, and it went down to business creating all sorts of scripts. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4y25uow33n313ezsvg8v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4y25uow33n313ezsvg8v.png" alt="amazon-q-cli-generating-code"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One thing that I liked about Amazon Q was that it automatically knows that it needs to test the scripts and creates its own test cases. It was part of the workflow. Most chatbots doesn't do this out-of-the-box, you need to explicitly say to create test scenarios and execute them. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8lsnr3e04xi867sfgglm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8lsnr3e04xi867sfgglm.png" alt="amazon-q-cli-test-cases"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It also creates documentations with game mechanics, files it created, and how to play.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu08uf7g7y5dtashcwbdn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu08uf7g7y5dtashcwbdn.png" alt="amazon-q-cli-generates-documentations"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After running the game, here's what I had:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2rnbg0d1izcx7yvee8f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2rnbg0d1izcx7yvee8f.png" alt="amazon-q-cli-game"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Improvements
&lt;/h2&gt;

&lt;p&gt;After one prompt, Amazon Q was able to create a CLI-based SQL game, that allows users to play through eras with each bosses having different SQL questions.&lt;/p&gt;

&lt;p&gt;But, there is one problem, not all of you wants a CLI-based game. I mean, I know I don't. I would want something that I can see and interact with using my mouse.&lt;/p&gt;

&lt;p&gt;Additionally, I had downloaded asset packs from &lt;a href="https://ansimuz.itch.io/gothicvania-patreon-collection" rel="noopener noreferrer"&gt;Ansimuz's Legacy Collection&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Here's another prompt, I didn't think too much about it, I just did a simple one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;This is a good start. Please give it a UI. Use PyGame. 
I also have assets under /assets folder. 
Please make use of it as sprites.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Amazon Q CLI proceeded in creating the necessary adjustments.&lt;br&gt;
And came up with this:&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/pltPFiSxQ7Q"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;Looks good, but then I was curious on extending it beyond and asking Amazon Q about improvements.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Let's work on improvements. Can you suggest? 
Please list down your suggestions before implementing. 
I want improvements on the battle system, game mechanics, questions, and animations.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo6eosquh84irf0u8hkan.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo6eosquh84irf0u8hkan.png" alt="amazon-q-cli-improvement-suggestions"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;All of which are a good suggestions but I like that Amazon Q also gives prioritizations:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkybd0110ltnxisjmck9x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkybd0110ltnxisjmck9x.png" alt="amazon-q-cli-improvement-prioritizations"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I decided to go with the Phase 1 changes, but I had some suggestions as well.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Please implement high-priority improvements. 
I don't want an interactive SQL Editor yet as it is complex to implement. 
For the list of SQL Questions, can you please add more? 
It seems that the user can just choose the same questions over and over, it should vary per turn.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What I noticed while playing the game was that the questions are very limited. It was only 1 question per action across the battle, so the player can just memorize the answers and just attack. It's no fun :) &lt;/p&gt;

&lt;p&gt;It again did all the necessary adjustments. Here's the final output:&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/qZjgU-2WTJw"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;Check out the full game in my GitHub:&lt;br&gt;
&lt;a href="https://github.com/klescosia/bayani-sql-fighter" rel="noopener noreferrer"&gt;https://github.com/klescosia/bayani-sql-fighter&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;AI-powered assistance have come a long way since it's inception.&lt;/li&gt;
&lt;li&gt;What makes Amazon Q special is its training data. It is fine-tuned on years of AWS knowledge, best practices, resources, and well-architected patterns.&lt;/li&gt;
&lt;li&gt;This tool enabled me to quickly build a working application in just one prompt.&lt;/li&gt;
&lt;li&gt;It can automate most of the development tasks so that users can focus on delivering value, though it's important to note that we always take this with a grain of salt and do due diligence.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Overall, it was a fun experience, I definitely would like to improve further on this. I can spend the whole day talking to Amazon Q CLI. But maybe that's for another blog or video. I'll keep you posted! Highly recommend Amazon Q, I do think this is one of those useful tools that can really help you in your development. Did I mention that you can also use Amazon Q in your favorite IDE? &lt;/p&gt;

&lt;p&gt;Check this out:&lt;br&gt;
&lt;a href="https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/q-in-IDE.html" rel="noopener noreferrer"&gt;Using Amazon Q Developer in the IDE.&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This blog is authored solely by me and reflects my personal opinions and experiences, not those of my employer. All references to products, including names, logos, and trademarks, belong to their respective owners and are used for identification purposes only.&lt;/em&gt;&lt;br&gt;
 &lt;/p&gt;

</description>
      <category>aws</category>
      <category>awschallenge</category>
      <category>ai</category>
      <category>sql</category>
    </item>
    <item>
      <title>Securing Amazon Redshift - Best Practices for Access Control</title>
      <dc:creator>Kyle Escosia</dc:creator>
      <pubDate>Sun, 12 Jan 2025 16:59:14 +0000</pubDate>
      <link>https://dev.to/awscommunity-asean/securing-amazon-redshift-best-practices-for-access-control-15l6</link>
      <guid>https://dev.to/awscommunity-asean/securing-amazon-redshift-best-practices-for-access-control-15l6</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Not long ago, I had the chance to conduct a knowledge transfer session focused on access management in Amazon Redshift for our partner client. As I started researching the topic, I realized something surprising - while there’s plenty of material explaining Redshift’s features, &lt;strong&gt;finding clear best practices&lt;/strong&gt; or &lt;strong&gt;structured approaches&lt;/strong&gt; for managing access was a challenge. Most of the information available felt basic and lacked the depth I was looking for. Yes, I also watched re:Invent videos.&lt;/p&gt;

&lt;p&gt;This realization motivated me to dive deeper and build a better understanding of how to effectively secure Redshift. In this blog, I want to share what I’ve learned - not just the technical details, but practical strategies and insights I’ve gained from working hands-on with Redshift access management. Whether you’re setting up a new cluster or refining an existing one, I hope this guide gives you the tools and confidence to tackle access management with clarity and purpose. &lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding Redshift Access Management
&lt;/h2&gt;

&lt;p&gt;Of course, before we'll be able to apply these settings in Redshift, we need to understand how its security framework works first. &lt;/p&gt;

&lt;h3&gt;
  
  
  Redshift Built-In Security and Compliance
&lt;/h3&gt;

&lt;p&gt;Let’s review the core components of Redshift security.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;For this blog, I’ll skip VPC isolation since, if you’ve been using AWS for a while, you’re likely familiar with VPC-supported services.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Authentication&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Redshift supports multiple authentication methods to verify user identities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://docs.aws.amazon.com/redshift/latest/mgmt/generating-user-credentials.html" rel="noopener noreferrer"&gt;Using IAM authentication to generate database user credentials&lt;/a&gt; - You can use AWS IAM (Users, Roles) to manage user access.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://aws.amazon.com/blogs/big-data/federated-authentication-to-amazon-redshift-using-aws-single-sign-on/" rel="noopener noreferrer"&gt;Single Sign-On (SSO)&lt;/a&gt; - Simplifies user access by leveraging corporate identity providers. &lt;/li&gt;
&lt;li&gt;
&lt;a href="https://aws.amazon.com/blogs/big-data/amazon-redshift-identity-federation-with-multi-factor-authentication/" rel="noopener noreferrer"&gt;MFA (Multi-Factor Authentication)&lt;/a&gt; - You can enforce MFA for added security.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Database Credentials&lt;/strong&gt; - Traditional username and password-based authentication. Make sure to &lt;a href="https://docs.aws.amazon.com/redshift/latest/mgmt/redshift-secrets-manager-integration.html" rel="noopener noreferrer"&gt;store your superuser or admin credentials in AWS Secrets Manager&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Authorization&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once users are authenticated, Redshift manages what they can access through robust authorization features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Default Permissions&lt;/strong&gt; - By default, only the owner of a database object (schema, tables, views) can modify or delete it, ensuring secure defaults. &lt;strong&gt;THIS CONCEPT IS IMPORTANT.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Users and Groups&lt;/strong&gt; - Permissions can be granted to individual users or groups, allowing for more efficient access control. Groups are always advisable so that you won't have the pain of managing many users.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Role-Based Access Control (RBAC)&lt;/strong&gt; - Roles simplify managing permissions by grouping privileges and assigning them to users.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;A Review on Authentication vs. Authorization&lt;/strong&gt;: &lt;br&gt;
Authentication verifies &lt;strong&gt;&lt;em&gt;who a user is&lt;/em&gt;&lt;/strong&gt;, ensuring only valid individuals or systems can connect to your Redshift cluster through methods like credentials, MFA, or federated SSO. Authorization, on the other hand, determines &lt;em&gt;&lt;strong&gt;what authenticated users can access and do within the system&lt;/strong&gt;&lt;/em&gt;, such as querying specific tables, viewing certain rows, or accessing sensitive columns.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;3. Data Encryption&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Security isn’t just about controlling access; it’s also about protecting the data itself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data at Rest&lt;/strong&gt; - Encrypted using AWS Key Management Service (KMS) or custom-managed keys.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data in Transit&lt;/strong&gt; - Protected with SSL to ensure secure communication.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load Data Encryption&lt;/strong&gt; - Ensures sensitive data remains protected even during data loading.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Advanced Access Management Features&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Column-Level Security (CLS)&lt;/strong&gt; - Restricts access to sensitive columns (e.g., SSNs, mobile number, credit cards) while allowing access to non-sensitive data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Row-Level Security (RLS)&lt;/strong&gt; - Ensures users see only the rows they are authorized to access, based on attributes or roles.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic Data Masking&lt;/strong&gt; - Masks sensitive information for users without full permissions, providing an extra layer of data protection.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;5. Compliance&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Third-party auditors assess the security and compliance of Amazon Redshift as part of multiple AWS compliance programs. These include SOC, PCI, FedRAMP, HIPAA, and others. &lt;/p&gt;

&lt;p&gt;Here's a link for the list: &lt;a href="https://docs.aws.amazon.com/redshift/latest/mgmt/security-compliance.html" rel="noopener noreferrer"&gt;Compliance validation for Amazon Redshift&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Understanding these foundational components is the first step toward mastering Redshift access management as each of these works together.&lt;/p&gt;

&lt;h3&gt;
  
  
  Default Permissions
&lt;/h3&gt;

&lt;p&gt;I wanted to expand more about the Default Permissions in Redshift. In my experience, overlooking how these default settings interact with specific roles or use cases can lead to prolonged back-and-forth conversations between users and administrators, especially when working across time zones. This is why it’s essential to understand how these permissions work and how to adjust them as needed.&lt;/p&gt;

&lt;p&gt;New users are assigned a set of default permissions that determine their initial access rights. These permissions provide a basic level of access to the database, allowing users to perform common actions while restricting access to sensitive areas.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Principle of Object Ownership&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;When a user creates an object (e.g., a table, view, or schema) in Redshift, they automatically become its owner.

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Key Rule&lt;/strong&gt; - Only the owner has permission to modify or drop the object unless they &lt;strong&gt;explicitly grant those permissions to others&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;This principle ensures that no unauthorized user can tamper with critical objects, even if they have general access to the schema.&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;GRANT&lt;/code&gt; and &lt;code&gt;ALTER DEFAULT PERMISSIONS&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;&lt;code&gt;GRANT&lt;/code&gt;&lt;/strong&gt; statement provides access to &lt;strong&gt;existing objects&lt;/strong&gt; in Redshift, such as tables and views, for users, groups, or roles.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;&lt;code&gt;ALTER DEFAULT PRIVILEGES&lt;/code&gt;&lt;/strong&gt; statement is used to manage permissions for &lt;strong&gt;future objects that will be created in a schema&lt;/strong&gt;. This is especially useful in collaborative environments where you want new objects to automatically inherit specific access rules.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Regranting Permissions After Object Modification&lt;/strong&gt; - If a table or view is recreated (e.g., dropped and re-created), all previously granted permissions are lost and must be reapplied. Use &lt;code&gt;ALTER DEFAULT PERMISSIONS&lt;/code&gt; statement to solve this or execute &lt;code&gt;GRANT&lt;/code&gt; statement again.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Advanced Access Management Features
&lt;/h2&gt;

&lt;p&gt;One of the requirements of a modern data platform is the ability to provide granular access control. Amazon Redshift delivers this through features such as Column-Level Security (CLS), Row-Level Security (RLS), and Dynamic Data Masking, ensuring users see only what they’re authorized to see. Let's go through them one-by-one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Column-Level Security (CLS)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Column-Level Security restricts access to &lt;strong&gt;sensitive columns&lt;/strong&gt; within a table while allowing users to interact with non-sensitive data. This prevents unnecessary exposure of information such as personal identifiers or payment details.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;address&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="n"&gt;customer_service_representatives&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Creating a view is also an option and can sometimes be a  straightforward and effective way to control column-level access.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;customer_info&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;address&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also change this into a &lt;code&gt;MATERIALIZED VIEW&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Row-Level Security (RLS)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;RLS restricts access to specific rows within a table based on user roles or attributes, ensuring that users see only the data relevant to them.&lt;/p&gt;

&lt;p&gt;This feature is essential for scenarios where data segregation is required, such as multi-tenant environments or organizations with hierarchical roles (e.g., regional managers, department leads).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;RLS&lt;/span&gt; &lt;span class="n"&gt;POLICY&lt;/span&gt; &lt;span class="n"&gt;region_a_policy&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Region A'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="n"&gt;ATTACH&lt;/span&gt; &lt;span class="n"&gt;RLS&lt;/span&gt; &lt;span class="n"&gt;POLICY&lt;/span&gt; &lt;span class="n"&gt;region_a_policy&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;sales_data&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="k"&gt;ROLE&lt;/span&gt;
&lt;span class="n"&gt;sales_manager&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;sales_data&lt;/span&gt; &lt;span class="k"&gt;ROW&lt;/span&gt; &lt;span class="k"&gt;LEVEL&lt;/span&gt; &lt;span class="k"&gt;SECURITY&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;CREATE RLS POLICY&lt;/code&gt; statement defines the filtering condition (region = 'Region A').&lt;/li&gt;
&lt;li&gt;The policy is attached to the &lt;code&gt;sales_data&lt;/code&gt; table and applies only to users assigned the &lt;code&gt;regional_manager_a&lt;/code&gt; role.&lt;/li&gt;
&lt;li&gt;Row-level security is enabled for the table with &lt;code&gt;ALTER TABLE&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Again, you can always use a &lt;code&gt;VIEW&lt;/code&gt; for this. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuxkdkq7v9ofuepw5vl0n.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuxkdkq7v9ofuepw5vl0n.jpg" alt="row-level-security" width="800" height="280"&gt;&lt;/a&gt;&lt;/p&gt;
An example of a row-level security policy



&lt;p&gt;&lt;strong&gt;3. Dynamic Data Masking&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Dynamic Data Masking (DDM) is a powerful feature in Amazon Redshift that protects sensitive data by replacing it with masked values when accessed by users without full permissions. Unlike traditional encryption, which hides data entirely, masking obfuscates sensitive fields while maintaining their usability for tasks like reporting and analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Obfuscation&lt;/strong&gt; - Sensitive information is replaced with masked values when accessed by users without full permissions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conditional Masking&lt;/strong&gt; - Masking can be applied based on user roles or attributes, ensuring the right users see appropriate levels of data.&lt;/p&gt;

&lt;p&gt;The masking is applied at query runtime, ensuring that the underlying data remains unchanged.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftvi3uwmaj82oa7mya8r8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftvi3uwmaj82oa7mya8r8.png" alt="dynamic-data-masking-redshift" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;
Dynamic Data Masking in Redshift



&lt;p&gt;&lt;strong&gt;How to implement:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Create a &lt;code&gt;MASKING POLICY&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Define how sensitive data should be masked.&lt;/p&gt;

&lt;p&gt;Example: Masking a credit card number to show only the last four digits:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;MASKING&lt;/span&gt; &lt;span class="n"&gt;POLICY&lt;/span&gt; &lt;span class="n"&gt;mask_credit_card_full&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;credit_card&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'000000XXXX0000'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Attach the &lt;code&gt;MASKING POLICY&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Apply the masking policy to the desired column and restrict access based on &lt;code&gt;ROLES&lt;/code&gt;. You can also attach this to &lt;code&gt;USERS&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Example: Attaching the policy to the credit_card column:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;ATTACH&lt;/span&gt; &lt;span class="n"&gt;MASKING&lt;/span&gt; &lt;span class="n"&gt;POLICY&lt;/span&gt; &lt;span class="n"&gt;mask_credit_card&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;credit_cards&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;credit_card&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="k"&gt;ROLE&lt;/span&gt; &lt;span class="n"&gt;customer_support_role&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3: Test the Policy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When users with the &lt;code&gt;customer_support_role&lt;/code&gt; query the &lt;code&gt;credit_card&lt;/code&gt; column, they'll see masked values:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Result:
****-****-****-1234
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Though, users with elevated permissions (e.g., administrators) will see the full credit card number.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Advanced: Custom Masking Logic&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In cases where more complex masking logic is needed, you can define custom functions using Python as a language.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;FUNCTION&lt;/span&gt; &lt;span class="n"&gt;REDACT_CREDIT_CARD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;credit_card&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;RETURNS&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;IMMUTABLE&lt;/span&gt;
&lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="err"&gt;$$&lt;/span&gt;
    &lt;span class="n"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;
    &lt;span class="n"&gt;regexp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"^([0-9]{6})[0-9]{5,6}([0-9]{4})"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;regexp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;credit_card&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;if&lt;/span&gt; &lt;span class="k"&gt;match&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;first&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;last&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;first&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"000000"&lt;/span&gt;
        &lt;span class="k"&gt;last&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"0000"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="nv"&gt;"{first}XXXXX{last}"&lt;/span&gt;
&lt;span class="err"&gt;$$&lt;/span&gt; &lt;span class="k"&gt;LANGUAGE&lt;/span&gt; &lt;span class="n"&gt;plpythonu&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;MASKING&lt;/span&gt; &lt;span class="n"&gt;POLICY&lt;/span&gt; &lt;span class="n"&gt;custom_mask_credit_card&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;credit_card&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;REDACT_CREDIT_CARD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;credit_card&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

&lt;span class="n"&gt;ATTACH&lt;/span&gt; &lt;span class="n"&gt;MASKING&lt;/span&gt; &lt;span class="n"&gt;POLICY&lt;/span&gt; &lt;span class="n"&gt;custom_mask_credit_card&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;credit_cards&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;credit_card&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="k"&gt;ROLE&lt;/span&gt; &lt;span class="n"&gt;data_analyst_role&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsiokvhni6mrqbsemj54p.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsiokvhni6mrqbsemj54p.jpg" alt="dynamic-data-masking-policies-redshift" width="800" height="375"&gt;&lt;/a&gt;&lt;/p&gt;
Example of Masking Policies



&lt;h2&gt;
  
  
  Best Practices for Redshift Access Management
&lt;/h2&gt;

&lt;p&gt;While Amazon Redshift provides many options for different use cases, it can sometimes feel overwhelming to decide which one to use. The key is to simplify your approach by focusing on your specific needs and objectives, rather than trying to use every available feature. Remember your KISS principle, people :) - keep it simple, &lt;del&gt;stupid&lt;/del&gt; and straightforward.&lt;/p&gt;

&lt;p&gt;In this section, let’s use the example of a &lt;strong&gt;Sales Report System&lt;/strong&gt; where different teams, like sales analysts and regional managers, interact with data in various ways.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Role-Based Access Control (RBAC)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of assigning permissions to individual users, define roles that are related to their job responsibilities.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;ROLE&lt;/span&gt; &lt;span class="n"&gt;sales_read_only&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt; &lt;span class="n"&gt;TABLES&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="k"&gt;ROLE&lt;/span&gt; &lt;span class="n"&gt;sales_read_only&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;ROLE&lt;/span&gt; &lt;span class="n"&gt;sales_read_only&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="k"&gt;USER&lt;/span&gt; &lt;span class="n"&gt;sales_analyst1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;RBAC can only be assigned to &lt;strong&gt;USERS&lt;/strong&gt; not GROUPS.&lt;/p&gt;

&lt;p&gt;A &lt;code&gt;USER&lt;/code&gt; can have multiple &lt;code&gt;ROLES&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;2. Least Privilege Principle&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you’ve been in the cloud domain for some time now, you’ve likely heard this phrase countless times - and for good reason. The Least Privilege Principle means granting users only the permissions they need to perform their specific tasks. This approach minimizes risks by limiting access to resources, ensuring users can do their job without accidentally compromising security.&lt;/p&gt;

&lt;p&gt;For instance, a &lt;strong&gt;Sales Analyst&lt;/strong&gt; needs access to view the &lt;code&gt;monthly_sales&lt;/code&gt; table but &lt;strong&gt;shouldn’t&lt;/strong&gt; be able to modify it. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Use Groups for Database-Level Security&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Group users into logical categories to manage permissions more efficiently. If you've worked with IAM in AWS, you'd be familiar with this concept.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Secure Authentication&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Strengthen how users authenticate to your Redshift cluster to prevent unauthorized access. A secure authentication process is the first line of defense against security breaches.&lt;/p&gt;

&lt;p&gt;Store and manage database credentials securely using &lt;strong&gt;AWS Secrets Manager&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Implement Advanced Access Management Features&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Leverage Redshift’s built-in capabilities for fine-grained access control.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Column-Level Security (CLS)&lt;/strong&gt; - Restrict access to sensitive fields&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Row-Level Security (RLS)&lt;/strong&gt; - Ensure users see only the rows they are authorized to access.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both of these features are very useful when creating reports, though a key consideration is to whether to apply this within Redshift side vs. BI tools like Power BI, Tableau, QuickSight, and etc. Each approach has its trade-offs and should align with your overall data governance strategy.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic Data Masking (DDM)&lt;/strong&gt; - Mask sensitive fields like credit card numbers for users without full permissions:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;6. Use available Access Monitoring with Admin Scripts and Tools&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Managing and monitoring access in Amazon Redshift becomes much easier when you have the right tools and scripts in place. To ensure a secure and well-governed environment, it’s a game-changer to use tools like &lt;a href="https://github.com/awslabs/amazon-redshift-utils/tree/master/src/AdminViews" rel="noopener noreferrer"&gt;Redshift Utils&lt;/a&gt;, which provide a collection of administrative views and scripts for monitoring and managing your Redshift clusters effectively.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Managing access in Amazon Redshift is about keeping your data safe while making sure users can do their jobs effectively. Simple practices like creating roles for specific tasks, giving only the necessary permissions, and using features like column and row restrictions have worked well for us and for our partners. &lt;/p&gt;

&lt;p&gt;This approach has been effective in our case, but I’m always eager to learn from others. If you have similar practices, ideas, or constructive feedback, I’d love to hear them!&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This blog is authored solely by me and reflects my personal opinions and experiences, not those of my employer. All references to products, including names, logos, and trademarks, belong to their respective owners and are used for identification purposes only.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>redshift</category>
      <category>security</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Hello everyone, are you stuck with deploying your Glue Jobs to production? Here’s how I did it using AWS CDK and GitHub</title>
      <dc:creator>Kyle Escosia</dc:creator>
      <pubDate>Tue, 07 Jan 2025 03:15:30 +0000</pubDate>
      <link>https://dev.to/escosiakyle/hello-everyone-are-you-stuck-with-deploying-your-glue-jobs-to-production-heres-how-i-did-it-1ok4</link>
      <guid>https://dev.to/escosiakyle/hello-everyone-are-you-stuck-with-deploying-your-glue-jobs-to-production-heres-how-i-did-it-1ok4</guid>
      <description>&lt;div class="ltag__link"&gt;
  &lt;a href="/aws-builders" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__org__pic"&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F2794%2F88da75b6-aadd-4ea1-8083-ae2dfca8be94.png" alt="AWS Community Builders " width="350" height="350"&gt;
      &lt;div class="ltag__link__user__pic"&gt;
        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F624370%2F9bc3b695-a2e8-4b55-9ce6-4af478526869.jpg" alt="" width="800" height="805"&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="https://dev.to/aws-builders/build-a-cicd-pipeline-using-aws-glue-aws-cdk-and-github-4m0j" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;Build a CI/CD Pipeline Using AWS Glue, AWS CDK and GitHub&lt;/h2&gt;
      &lt;h3&gt;Kyle Escosia for AWS Community Builders  ・ Jan 7&lt;/h3&gt;
      &lt;div class="ltag__link__taglist"&gt;
        &lt;span class="ltag__link__tag"&gt;#aws&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#tutorial&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#cicd&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#devops&lt;/span&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;


</description>
    </item>
    <item>
      <title>Build a CI/CD Pipeline Using AWS Glue, AWS CDK and GitHub</title>
      <dc:creator>Kyle Escosia</dc:creator>
      <pubDate>Tue, 07 Jan 2025 02:57:11 +0000</pubDate>
      <link>https://dev.to/aws-builders/build-a-cicd-pipeline-using-aws-glue-aws-cdk-and-github-4m0j</link>
      <guid>https://dev.to/aws-builders/build-a-cicd-pipeline-using-aws-glue-aws-cdk-and-github-4m0j</guid>
      <description>&lt;h2&gt;
  
  
  The Phantom Menace
&lt;/h2&gt;

&lt;p&gt;I’ve been a heavy user of AWS Glue since its early days, starting with version 0.9. It’s been a bit of a love-hate relationship—especially back then, when Glue jobs took what felt like an eternity to run. Over the years, though, Glue has come a long way. From upgrading Apache Spark to supporting modern data lake formats like &lt;strong&gt;Hudi&lt;/strong&gt;, &lt;strong&gt;Iceberg&lt;/strong&gt;, and &lt;strong&gt;Delta Lake&lt;/strong&gt;, to introducing &lt;a href="https://aws.amazon.com/blogs/big-data/introducing-generative-ai-upgrades-for-apache-spark-in-aws-glue-preview/" rel="noopener noreferrer"&gt;Generative AI capabilities&lt;/a&gt;, Glue has evolved into a powerful tool for building scalable ETL solutions.&lt;/p&gt;

&lt;p&gt;But even as Glue has improved, I’ve found myself grappling with a different challenge—&lt;strong&gt;managing Glue jobs manually&lt;/strong&gt;. As I’ve built solutions for clients over the years, the lack of automation in provisioning and deploying Glue jobs has become a pain point.&lt;/p&gt;

&lt;p&gt;Back in the early days, we didn’t have CI/CD pipelines or automation tools to provision Glue jobs. Everything was done manually—configuring jobs, managing dependencies, and deploying them. At first, this seemed manageable for small-scale solutions, but as the complexity of pipelines grew, so did the problems. And I’ve always thought that there was something wrong with that since having that workflow creates room for error, such as:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Inconsistencies Across Environments (Dev, QA, Prod) &lt;/li&gt;
&lt;li&gt;Scaling issues - adding or modifying jobs manually is time-consuming and error-prone&lt;/li&gt;
&lt;li&gt;No Version Control - need I say more?&lt;/li&gt;
&lt;li&gt;Deployment Complexity - dependencies, configurations, etc.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These challenges didn’t just slow the project down—they put the reliability of the data pipelines at risk, introducing errors and inefficiencies that could cascade into costly problems downstream.&lt;/p&gt;

&lt;h2&gt;
  
  
  A New Hope
&lt;/h2&gt;

&lt;p&gt;Fast forward to today, and I’ve adopted a completely different approach. By using &lt;strong&gt;AWS CDK (Cloud Development Kit)&lt;/strong&gt; in combination with &lt;strong&gt;GitHub&lt;/strong&gt; (though this can also be GitLab, BitBucket), I’ve been able to solve many of these challenges. AWS CDK allows me to define Glue resources as &lt;strong&gt;Infrastructure-as-Code (IaC)&lt;/strong&gt;, ensuring consistency and scalability. Integrating this with GitHub and CI/CD workflows has made deploying Glue jobs faster, more reliable, and far less error-prone.&lt;/p&gt;

&lt;p&gt;The Medallion Architecture as a Framework&lt;/p&gt;

&lt;p&gt;One of the data design pattern that are popular today is the Medallion Architecture.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fowivhseg0izufww48lcd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fowivhseg0izufww48lcd.png" alt="the-medallion-architecture" width="800" height="126"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The pipeline that I built contains Glue scripts for the Bronze layer, Silver layer, and Gold layer. &lt;/p&gt;

&lt;h2&gt;
  
  
  Development Workflow
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnbd8hloz64q0v5yo3ukw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnbd8hloz64q0v5yo3ukw.png" alt="development-workflow" width="800" height="1426"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To ensure consistency and collaboration across the team, a structured development workflow is followed, as outlined in the diagram below. This workflow integrates tools like Jira for task tracking, integrated to GitHub to map the tickets to the git branches. The workflow is pretty standard.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F23ez5zvqiksf9g1uldbk.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F23ez5zvqiksf9g1uldbk.jpeg" alt="darth-vader-meme" width="348" height="145"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You don't seem to believe, eh? Read on :) &lt;/p&gt;




&lt;h2&gt;
  
  
  The Project Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;project-root/
├── ingestion/                         # For ingestion Glue jobs
│   ├── configs/      
│   │   ├── jobs.csv
│   │   ├── custom_jobs.yaml
│   │   ├── default_configs.yaml
│   │   └── README.md
│   ├── scripts/                       # Scripts for ingestion Glue jobs
│   │   ├── dev-ingestion-script.py
│   │   └── prd-ingestion-script.py
│   ├── ingestion_stack.py             # CDK stack for ingestion jobs
│   └── README.md
│
├── standardization/                   # For standardization Glue jobs
│   ├── configs/
│   │   ├── jobs.csv
│   │   ├── custom_jobs.yaml
│   │   ├── default_configs.yaml
│   │   └── README.md
│   ├── scripts/
│   │   ├── dev-standardization-script.py
│   │   └── prd-standardization-script.py
│   ├── standardization_stack.py    # CDK stack for standardization jobs
│   └── README.md
│
├── transformation/                   # For transformation Glue jobs
│   ├── configs/
│   │   ├── jobs.csv
│   │   ├── custom_jobs.yaml
│   │   ├── default_configs.yaml
│   │   └── README.md
│   ├── scripts/
│   │   ├── dev-transformation-script.py
│   │   └── prd-transformation-script.py
│   ├── transformation_stack.py      # CDK stack for transformation jobs
│   └── README.md
│
├── loading/                         # For loading Glue jobs
│   ├── configs/
│   │   ├── jobs.csv
│   │   ├── custom_jobs.yaml
│   │   ├── default_configs.yaml
│   │   └── README.md
│   ├── scripts/
│   │   ├── dev-loading-script.py
│   │   └── prd-loading-script.py
│   ├── loading_stack.py              # CDK stack for loading jobs
│   └── README.md
│
├── upload_script.py                  # Script to upload files to S3
├── app.py                            # Root entry point for AWS CDK
├── requirements.txt                  # Python dependencies for CDK
└── README.md                         # High-level project documentation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Config Files: Defaults and Customizations
&lt;/h2&gt;

&lt;p&gt;Each folder &lt;code&gt;(ingestion/, standardization/, transformation/, loading/)&lt;/code&gt; contains:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;configs/&lt;/strong&gt; - Configuration files specific to that component. These can share a similar structure but should have unique data for each purpose.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;jobs.csv&lt;/strong&gt;- Defines the Glue jobs and their classifications. This file acts as the source of truth for the jobs you want to deploy. The columns can be adjusted as needed. Take note of the &lt;strong&gt;classification&lt;/strong&gt; as well, this will be used for identifying whether to provision the job as &lt;strong&gt;default&lt;/strong&gt; or &lt;strong&gt;custom&lt;/strong&gt;, which will be discussed below.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;JobName&lt;/th&gt;
&lt;th&gt;Classification&lt;/th&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;ConnectionName&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;dim-products&lt;/td&gt;
&lt;td&gt;default&lt;/td&gt;
&lt;td&gt;Transformation&lt;/td&gt;
&lt;td&gt;redshift-conn&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dim-users&lt;/td&gt;
&lt;td&gt;default&lt;/td&gt;
&lt;td&gt;Transformation&lt;/td&gt;
&lt;td&gt;redshift-conn&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;fact-sales&lt;/td&gt;
&lt;td&gt;custom&lt;/td&gt;
&lt;td&gt;Transformation&lt;/td&gt;
&lt;td&gt;redshift-conn&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;default_configs.yaml&lt;/strong&gt; - Separation of default and custom configurations gives me the &lt;strong&gt;flexibility&lt;/strong&gt; to manage AWS Glue jobs efficiently. With default configurations, I can define a baseline setup—for example, provisioning a Glue job with a &lt;code&gt;G.1X&lt;/code&gt; worker type and 2 DPUs. This ensures that most jobs follow a consistent and standardized configuration, reducing the need for repetitive definitions.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;WorkerType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;G.1X&lt;/span&gt;
&lt;span class="na"&gt;NumberOfWorkers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
&lt;span class="na"&gt;GlueVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5.0"&lt;/span&gt;
&lt;span class="na"&gt;ExecutionClass&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;STANDARD&lt;/span&gt;
&lt;span class="na"&gt;IAMRole&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:iam::123456789012:role/glue-role"&lt;/span&gt;
&lt;span class="na"&gt;Command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;Name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;glueetl"&lt;/span&gt;
  &lt;span class="na"&gt;PythonVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3"&lt;/span&gt;
&lt;span class="na"&gt;DefaultArguments&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--enable-metrics"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--TempDir"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://default-bucket/temp/"&lt;/span&gt;
  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--job-language"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python"&lt;/span&gt;
  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--enable-glue-datacatalog"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--spark-event-logs-path"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://bucket/logs/sparkHistoryLogs/"&lt;/span&gt;
&lt;span class="na"&gt;ScriptLocationBase&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://bucket/cdk/scripts/transformation/"&lt;/span&gt;
&lt;span class="na"&gt;Tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;Project&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sales"&lt;/span&gt;
  &lt;span class="na"&gt;Environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dev"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;custom_configs.yaml&lt;/strong&gt; - Custom configurations, on the other hand, allow me to handle exceptions where jobs require more specialized settings, like &lt;strong&gt;higher memory&lt;/strong&gt; or specific Spark arguments. By separating these two, I can keep the defaults simple and focused while tailor fitting individual jobs as needed.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;fact-sales&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;WorkerType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;G.2X&lt;/span&gt;
  &lt;span class="na"&gt;NumberOfWorkers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4&lt;/span&gt;
  &lt;span class="na"&gt;GlueVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5.0"&lt;/span&gt;
  &lt;span class="na"&gt;ExecutionClass&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;STANDARD&lt;/span&gt;
  &lt;span class="na"&gt;IAMRole&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:iam::123456789012:role/glue-role"&lt;/span&gt;
  &lt;span class="na"&gt;DefaultArguments&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--enable-metrics"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
    &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--TempDir"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://my-bucket/temp/"&lt;/span&gt;
    &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--job-language"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python"&lt;/span&gt;
  &lt;span class="na"&gt;Command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;glueetl"&lt;/span&gt;
    &lt;span class="na"&gt;PythonVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3"&lt;/span&gt;
  &lt;span class="na"&gt;Tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Project&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sales&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Dashboard"&lt;/span&gt;
    &lt;span class="na"&gt;Environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dev"&lt;/span&gt;

&lt;span class="na"&gt;transformation-job2&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;WorkerType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;G.1X&lt;/span&gt;
  &lt;span class="na"&gt;NumberOfWorkers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
  &lt;span class="na"&gt;GlueVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5.0"&lt;/span&gt;
  &lt;span class="na"&gt;ExecutionClass&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;FLEX&lt;/span&gt;
  &lt;span class="na"&gt;IAMRole&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:iam::123456789012:role/glue-role"&lt;/span&gt;
  &lt;span class="na"&gt;DefaultArguments&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--enable-continuous-cloudwatch-log"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
  &lt;span class="na"&gt;Tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Project&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sales&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Dashboard"&lt;/span&gt;
    &lt;span class="na"&gt;Environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dev"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;scripts/&lt;/strong&gt; - contains the actual Python-based Glue scripts specific to the component. &lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.context&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkContext&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;awsglue.context&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GlueContext&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;awsglue.utils&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;getResolvedOptions&lt;/span&gt;

&lt;span class="c1"&gt;# Get job arguments
&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getResolvedOptions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;JOB_NAME&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;input_path&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;output_path&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize GlueContext and SparkContext
&lt;/span&gt;&lt;span class="n"&gt;sc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SparkContext&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;glueContext&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GlueContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;glueContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spark_session&lt;/span&gt;

&lt;span class="c1"&gt;# Input and output paths (passed as parameters)
&lt;/span&gt;&lt;span class="n"&gt;input_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;input_path&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# e.g., s3://your-bucket/raw-data/customers/
&lt;/span&gt;&lt;span class="n"&gt;output_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;output_path&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# e.g., s3://your-bucket/processed-data/dim_customers/
&lt;/span&gt;
&lt;span class="c1"&gt;# Load raw data into a DataFrame
&lt;/span&gt;&lt;span class="n"&gt;raw_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;header&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inferSchema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Register the DataFrame as a temporary SQL table
&lt;/span&gt;&lt;span class="n"&gt;raw_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createOrReplaceTempView&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;raw_customers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Use Spark SQL to create the dimension table
&lt;/span&gt;&lt;span class="n"&gt;dimension_query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
SELECT
    CAST(customer_id AS STRING) AS customer_id,
    first_name,
    last_name,
    email,
    CAST(date_of_birth AS DATE) AS date_of_birth,
    country,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS row_num
FROM
    raw_customers
WHERE
    country IS NOT NULL
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="c1"&gt;# Execute the SQL query
&lt;/span&gt;&lt;span class="n"&gt;dimension_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dimension_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Filter to include only the latest record per customer
&lt;/span&gt;&lt;span class="n"&gt;final_dimension_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dimension_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dimension_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;row_num&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;row_num&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Write the resulting DataFrame to S3 in Parquet format
&lt;/span&gt;&lt;span class="n"&gt;final_dimension_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Dimension table created and saved to &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;_stack.py&lt;/strong&gt; - AWS CDK stack for defining Glue jobs for that specific component.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;aws_cdk&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;aws_glue&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;glue&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;aws_cdk&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Stack&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CfnOutput&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;constructs&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Construct&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;csv&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;yaml&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;GlueTransformationStack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Stack&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    A CDK stack for creating AWS Glue jobs based on configurations specified in CSV and YAML files.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Construct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;job_csv_file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;custom_config_file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default_config_file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        Initialize the GlueTransformationStack.

        :param scope: The scope in which to define this construct.
        :param id: The scoped construct ID.
        :param job_csv_file: Path to the CSV file containing job definitions.
        :param custom_config_file: Path to the YAML file containing custom configurations.
        :param default_config_file: Path to the YAML file containing default configurations.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Load configurations from YAML files
&lt;/span&gt;        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;custom_config_file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;custom_configs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;yaml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;safe_load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default_config_file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;default_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;yaml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;safe_load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Process each job in the CSV file
&lt;/span&gt;        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;job_csv_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8-sig&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;csv_reader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DictReader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;csv_reader&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;process_job_row&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;custom_configs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default_config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_job_row&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;custom_configs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default_config&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        Process a single row from the CSV file and create a Glue job.

        :param row: A dictionary representing a row from the CSV file.
        :param custom_configs: A dictionary of custom job configurations.
        :param default_config: A dictionary of default configurations.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;job_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;JobName&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;classification&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Classification&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;connection_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ConnectionName&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="c1"&gt;# Determine job configuration: merge custom settings if available
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;classification&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;custom&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;job_name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;custom_configs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;job_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;default_config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;custom_configs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;job_name&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;job_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;default_config&lt;/span&gt;

        &lt;span class="c1"&gt;# Ensure 'Command' is defined in the configuration
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Command&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;job_config&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;job_config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Command&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;default_config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Command&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;

        &lt;span class="c1"&gt;# Set script location directly from the job name
&lt;/span&gt;        &lt;span class="n"&gt;script_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;job_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;job_config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Command&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ScriptLocation&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;job_config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ScriptLocationBase&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;script_name&lt;/span&gt;
        &lt;span class="n"&gt;job_config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ConnectionName&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;connection_name&lt;/span&gt;

        &lt;span class="c1"&gt;# Merge tags from default and custom configurations
&lt;/span&gt;        &lt;span class="n"&gt;default_tags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;default_config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Tags&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;
        &lt;span class="n"&gt;custom_tags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;job_config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Tags&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;
        &lt;span class="n"&gt;combined_tags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;default_tags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;custom_tags&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_glue_job&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;job_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;job_config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;combined_tags&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_glue_job&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;job_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;job_config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        Create an AWS Glue job using the provided configuration.

        :param job_name: The name of the Glue job.
        :param job_config: A dictionary containing the job configuration.
        :param tags: Combined tags for the Glue job.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;glue_job&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;glue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;CfnJob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;job_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;job_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;job_config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;IAMRole&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;glue_version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;job_config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;GlueVersion&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;glue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CfnJob&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;JobCommandProperty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;job_config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Command&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="n"&gt;script_location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;job_config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Command&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ScriptLocation&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="n"&gt;python_version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;job_config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Command&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PythonVersion&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;default_arguments&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;job_config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DefaultArguments&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;execution_class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;job_config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ExecutionClass&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;connections&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;glue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CfnJob&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ConnectionsListProperty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;connections&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;job_config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ConnectionName&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;worker_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;job_config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;WorkerType&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;number_of_workers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;job_config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;NumberOfWorkers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tags&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;


        &lt;span class="nc"&gt;CfnOutput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;job_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;Output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;glue_job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Name of the Glue job: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;job_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Automating Script Uploads
&lt;/h2&gt;

&lt;p&gt;Theis script can dynamically upload scripts from the &lt;code&gt;scripts/&lt;/code&gt; folder of any component to an S3 bucket. Update the &lt;strong&gt;local_directory_to_upload&lt;/strong&gt; and &lt;strong&gt;s3_destination_path&lt;/strong&gt; variables to target specific components.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;

&lt;span class="c1"&gt;# Configure logging
&lt;/span&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;basicConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;upload_files_to_s3.log&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;INFO&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%(asctime)s - %(levelname)s - %(message)s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;upload_files_to_s3&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;local_directory&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bucket_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s3_destination&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Uploads files from a local directory to an S3 bucket.

    :param local_directory: The local directory to upload.
    :param bucket_name: The name of the S3 bucket.
    :param s3_destination: The destination path in the S3 bucket.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;s3_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Walk through the local directory
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;walk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;local_directory&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="nb"&gt;file&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;local_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;relative_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;relpath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;local_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;local_directory&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;s3_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s3_destination&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;relative_path&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="c1"&gt;# Upload file to S3
&lt;/span&gt;            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;s3_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upload_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;local_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bucket_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s3_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Successfully uploaded &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; to s3://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;bucket_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s3_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Uploaded &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; to s3://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;bucket_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s3_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Failed to upload &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Failed to upload &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; to s3://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;bucket_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s3_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; - Error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;

    &lt;span class="c1"&gt;# Check command-line arguments
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Usage: python upload_files_to_s3.py &amp;lt;bucket_name&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Script called with insufficient arguments.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Parse the bucket name
&lt;/span&gt;    &lt;span class="n"&gt;bucket_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# Define the folders to process
&lt;/span&gt;    &lt;span class="n"&gt;folders&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingestion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingestion/scripts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standardization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standardization/scripts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transformation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transformation/scripts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;loading&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;loading/scripts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;# Loop through each folder and upload files
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;component&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;local_directory&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;folders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;s3_destination&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;glue-scripts/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;component&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Processing folder: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;local_directory&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; -&amp;gt; s3://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;bucket_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s3_destination&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Processing folder: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;local_directory&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; -&amp;gt; s3://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;bucket_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s3_destination&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;upload_files_to_s3&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;local_directory&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bucket_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s3_destination&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Bringing It All Together with app.py
&lt;/h2&gt;

&lt;p&gt;The root-level CDK application orchestrates the deployment of all component-specific stacks (Ingestion, Standardization, Transformation, and Loading).&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;aws_cdk&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;App&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ingestion.ingestion_stack&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;IngestionStack&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;standardization.standardization_stack&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StandardizationStack&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformation.transformation_stack&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TransformationStack&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;loading.loading_stack&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LoadingStack&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;App&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Instantiate each component stack
&lt;/span&gt;&lt;span class="nc"&gt;IngestionStack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;IngestionStack&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nc"&gt;StandardizationStack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;StandardizationStack&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nc"&gt;TransformationStack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TransformationStack&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nc"&gt;LoadingStack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LoadingStack&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;synth&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Local Testing
&lt;/h2&gt;

&lt;p&gt;Once you have these components, you can test locally using the following commands.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Synthesize the CloudFormation Templates&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cdk synth
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Deploy to a Test Environment&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Please make sure that your local AWS user is on your development environment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cdk deploy IngestionStack &lt;span class="nt"&gt;--require-approval&lt;/span&gt; never
cdk deploy StandardizationStack &lt;span class="nt"&gt;--require-approval&lt;/span&gt; never
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Verify Resources in AWS:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Confirm that Glue jobs were created with the expected configurations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;(OPTIONAL) Purge the resources:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cdk destroy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  CI/CD Integration with GitHub
&lt;/h2&gt;

&lt;p&gt;Once you have the components ready, the next step would be to push the codes into a Code Repository. In this section, we’ll explore how to push your code to a repository (like GitHub) and configure CI/CD to automatically deploy Glue jobs on each push to a specific branch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pre-requisities:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/dtconsole/latest/userguide/connections-create-github.html" rel="noopener noreferrer"&gt;Authenticate GitHub to AWS&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;GitHub Account&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create Repository&lt;/li&gt;
&lt;li&gt;Push the codes into the Repository&lt;/li&gt;
&lt;li&gt;Go into your Repository, click on &lt;strong&gt;Actions&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Select &lt;strong&gt;New Workflow&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;When choosing a workflow, locate and choose the &lt;strong&gt;set up a workflow yourself&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Paste this yaml. Change as necessary.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;name: Deploy Data Lake Glue Jobs

on:
  push:
    branches:
      - main # Replace with your branch if different

permissions:
  id-token: write   # Required for requesting the JWT
  contents: read    # Required for actions/checkout

jobs:
  deploy:
    runs-on: ubuntu-latest

    steps:
    # Step 1: Check out the code repository
    - name: Checkout Repository
      uses: actions/checkout@v3

    # Step 2: Configure AWS credentials using the IAM role
    - name: Configure AWS credentials
      uses: aws-actions/configure-aws-credentials@v4.0.2
      with:
        role-to-assume: ${{ secrets.AWS_IAM_ROLE }}
        aws-region: ${{ secrets.AWS_REGION }}
        role-session-name: GitHubActionsDeployment

    # Step 3: Set up Python
    - name: Setup Python
      uses: actions/setup-python@v5.1.0
      with:
        python-version: '3.10'
        cache: 'pip'

    # Step 4: Set up Node.js for AWS CDK
    - name: Setup Node.js
      uses: actions/setup-node@v4.0.0
      with:
        node-version: '21.2.0'

    # Step 5: Install AWS CDK CLI globally
    - name: Install AWS CDK
      run: npm install -g aws-cdk

    # Step 6: Verify CDK installation
    - name: Verify CDK Installation
      run: cdk --version

    # Step 7: Install Python dependencies globally
    - name: Install Python Dependencies
      run: pip install -r ./infrastructure/requirements.txt

    # Step 8: Upload Glue scripts for each layer
    - name: Upload Ingestion Files to S3
      run: python3 upload_files.py &amp;lt;your-bucket-name&amp;gt;

    # Step 9: Deploy all Glue stacks using AWS CDK
    - name: Deploy Glue Stacks
      run: cdk deploy GlueIngestionStack GlueStandardizationStack GlueTransformationStack GlueLoadingStack --require-approval never
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Workflow Structure:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;*&lt;em&gt;Trigger (on) *&lt;/em&gt;- Specifies that the workflow runs when there is a push to the main branch. If your main branch is named differently (e.g., &lt;code&gt;main-prod&lt;/code&gt;), replace it here. This ensures the workflow executes only on the primary branch where approved code resides. More on GitHub Event triggers here.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Permissions&lt;/strong&gt; - note that you must configure OIDC with GitHub from AWS. See the prerequisites.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;id-token: write&lt;/code&gt;:&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Grants the workflow permission to request an OpenID Connect (OIDC) token for securely authenticating with AWS.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;contents: read&lt;/code&gt;:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Allows the workflow to read repository content during the actions/checkout step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Job Definition:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Runner&lt;/strong&gt; - Specifies the operating system environment (ubuntu-latest) where the job executes. This ensures compatibility with Python and Node.js for AWS CDK and Glue.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Steps:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Check Out the Code Repository&lt;/strong&gt; - Checks out the code from the repository so that subsequent steps have access to the scripts. This is similar to checking out the code locally. It pulls the contents of the repository in its virtual environment.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Configure AWS Credentials&lt;/strong&gt; - Configures AWS credentials using an IAM role defined in GitHub Secrets. To define secrets in GitHub follow this guide.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Key Parameters&lt;/strong&gt;:&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;role-to-assume&lt;/code&gt;: The ARN of the IAM role the workflow assumes for permissions. You can also use access keys but I highly recommend IAM Roles for security purposes.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;aws-region&lt;/code&gt;: The region where AWS resources are deployed (e.g., ap-southeast-1).&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;Set Up Python&lt;/strong&gt; - Sets up &lt;code&gt;Python 3.10&lt;/code&gt; in the workflow environment.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;Set Up Node.js&lt;/strong&gt; - Installs &lt;code&gt;Node.js&lt;/code&gt;, required for running AWS CDK.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;Install AWS CDK&lt;/strong&gt; - Installs AWS CDK globally using npm. This is necessary to execute cdk commands.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;Verify CDK Installation&lt;/strong&gt; - Checks that AWS CDK was installed correctly by outputting its version.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;Install Python Dependencies&lt;/strong&gt; - Installs Python dependencies specified in requirements.txt, such as &lt;code&gt;aws-cdk-lib&lt;/code&gt;, &lt;code&gt;constructs&lt;/code&gt;, or &lt;code&gt;boto3&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;Upload Glue Scripts to S3&lt;/strong&gt; - Executes a Python script to upload Glue job scripts to the specified S3 bucket. This ensures the Glue jobs have access to the latest ETL scripts stored in S3.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;Deploy Glue Stacks&lt;/strong&gt; - Deploys all defined Glue stacks (&lt;strong&gt;GlueIngestionStack&lt;/strong&gt;, &lt;strong&gt;GlueStandardizationStack&lt;/strong&gt;, etc.) using AWS CDK. &lt;code&gt;--require-approval never&lt;/code&gt;: Skips manual approval prompts, enabling fully automated deployments.&lt;/p&gt;&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;You can now push to your main branch and automatically deploy Glue Jobs using GitHub Actions.&lt;/p&gt;

&lt;h2&gt;
  
  
  I Have The High Ground
&lt;/h2&gt;

&lt;p&gt;Adopting a CI/CD-driven workflow for deploying AWS Glue jobs has been a transformative step for solving my problems. By integrating AWS CDK, GitHub, and automated pipelines, I’ve significantly improved the deployment process. &lt;strong&gt;Manual errors&lt;/strong&gt;, &lt;strong&gt;configuration inconsistencies&lt;/strong&gt;, and &lt;strong&gt;deployment delays&lt;/strong&gt; are challenges I’ve left behind, allowing me to focus on delivering reliable and scalable data solutions.&lt;/p&gt;

&lt;p&gt;This approach ensures that every change is &lt;strong&gt;traceable&lt;/strong&gt;, &lt;strong&gt;reviewable&lt;/strong&gt;, and &lt;strong&gt;deployed&lt;/strong&gt; &lt;strong&gt;consistently&lt;/strong&gt; across environments. &lt;/p&gt;

&lt;p&gt;This workflow has worked well for my project based on the requirements and project goals. However, I know that there’s always &lt;strong&gt;room for improvement&lt;/strong&gt;, and &lt;strong&gt;workflows often evolve over time to meet new challenges.&lt;/strong&gt; If you have suggestions, insights, or ideas on how to further enhance this approach, I’d be happy to discuss them.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwrpxl46gzn0y3d656bg7.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwrpxl46gzn0y3d656bg7.gif" alt="obi-wan-star-wars" width="462" height="200"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This blog is authored solely by me and reflects my personal opinions and experiences, not those of my employer. All references to products, including names, logos, and trademarks, belong to their respective owners and are used for identification purposes only.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>tutorial</category>
      <category>cicd</category>
      <category>devops</category>
    </item>
    <item>
      <title>Installing Python Packages in AWS Glue using AWS CodeArtifact without Internet Access</title>
      <dc:creator>Kyle Escosia</dc:creator>
      <pubDate>Fri, 10 Nov 2023 09:24:22 +0000</pubDate>
      <link>https://dev.to/aws-builders/installing-python-packages-in-aws-glue-using-aws-codeartifact-cag</link>
      <guid>https://dev.to/aws-builders/installing-python-packages-in-aws-glue-using-aws-codeartifact-cag</guid>
      <description>&lt;h2&gt;
  
  
  Background of the Problem
&lt;/h2&gt;

&lt;p&gt;I spent quite sometime figuring out how to install Python Packages in AWS Glue inside a VPC &lt;strong&gt;without&lt;/strong&gt; internet access and I managed to figure it out after some tinkering. Just to recall, AWS introduced the support for &lt;a href="https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html#addl-python-modules-support" rel="noopener noreferrer"&gt;installation of Python Packages&lt;/a&gt; via &lt;code&gt;--additional-python-modules&lt;/code&gt; option. While this is a lifesaver - for those who started working with Glue 1.0, it only works if your Glue Job can connect to the internet.&lt;/p&gt;

&lt;p&gt;Given the emphasis on security, a number of customers chose to &lt;strong&gt;limit/restrict&lt;/strong&gt; egress traffic from their VPC to the public internet and require a method to manage the packages used by their data pipelines.&lt;/p&gt;

&lt;p&gt;This article focuses on that challenge. This is a step-by-step process on how to setup your Glue Job to connect to a pypi mirror via AWS CodeArtifact, allowing you to install packages in a Private Subnet. For this tutorial, it is recommended to have a working knowledge of basic stuffs (e.g. Networking, Services) on AWS. But, I'll try my best to explain each part.&lt;/p&gt;

&lt;p&gt;Let's get started!&lt;/p&gt;

&lt;h2&gt;
  
  
  Solution Overview
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9bk5w0y20awdy2nygox5.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9bk5w0y20awdy2nygox5.jpg" alt="kyle-escosia-aws-codeartifact-aws-glue-integration" width="800" height="473"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Fig. 1. Architecture for the AWS CodeArtifact and AWS Glue Integration&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The core of the solution is the AWS CodeArtifact, which allows you to use it as tool to &lt;strong&gt;securely&lt;/strong&gt; store, publish, and share packages, in this case, &lt;code&gt;PyPi&lt;/code&gt; packages, across your &lt;strong&gt;private network&lt;/strong&gt; without directly connecting into the Public PyPi Repository. This is made possible by VPC Endpoints through &lt;a href="https://docs.aws.amazon.com/vpc/latest/privatelink/what-is-privatelink.html" rel="noopener noreferrer"&gt;PrivateLink connections&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;You do need to create endpoints for S3 and CodeArtifact for this to work, or else, you'll get errors like &lt;code&gt;Connection timed out&lt;/code&gt; errors.&lt;/p&gt;

&lt;p&gt;Here's some resources to help you out with that:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpoints-s3.html#create-gateway-endpoint-s3" rel="noopener noreferrer"&gt;Gateway endpoints for Amazon S3&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/codeartifact/latest/ug/create-vpc-endpoints.html" rel="noopener noreferrer"&gt;Create VPC endpoints for CodeArtifact&lt;/a&gt; - if via console, kindly follow the same steps as with the S3 Endpoint.&lt;/p&gt;
&lt;h2&gt;
  
  
  What you will need
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;An AWS account&lt;/strong&gt;, of course &lt;/p&gt;

&lt;p&gt;Note: Test this on your dev environment first&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/glue/" rel="noopener noreferrer"&gt;AWS Glue&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/codeartifact/" rel="noopener noreferrer"&gt;AWS CodeArtifact&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Docker&lt;/li&gt;
&lt;li&gt;AWS Access Keys (with permissions on AWS CodeArtifact)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I won't go over these tools one by one as I believe ChatGPT can you give those definitions and its use better than me.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Solution
&lt;/h2&gt;

&lt;p&gt;In this section, I'll go over the step-by-step solution for each process.&lt;/p&gt;

&lt;p&gt;Let's start by setting up our CodeArtifact Repository.&lt;/p&gt;
&lt;h2&gt;
  
  
  Setting up the AWS Codeartifact
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Create a CodeArtifact Repository
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2p3wfa338u5ndilqhskl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2p3wfa338u5ndilqhskl.png" alt="kyle-escosia-codeartifact-home" width="800" height="334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1qha4e6ladmyyhnz64dt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1qha4e6ladmyyhnz64dt.png" alt="kyle-escosia-codeartifact-home-creation" width="800" height="528"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Fill up the details
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Repository Name&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Repository Details&lt;/code&gt; (Optional)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Public upstream repositories&lt;/code&gt; - I chose PyPi&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Select the domain
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxf6ob8i49s5bt2z55gvt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxf6ob8i49s5bt2z55gvt.png" alt="kyle-escosia-codeartifact-domain" width="800" height="604"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Specify your domain name&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg5g22xqvwk3xbsuyugwj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg5g22xqvwk3xbsuyugwj.png" alt="kyle-escosia-codeartifact-repo-list" width="800" height="213"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You should have the following repositories after creation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;&amp;lt;your-repo&amp;gt;&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;pypi-store&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now that's done, you can inspect the created repositories. The &lt;code&gt;pypi-store&lt;/code&gt; was automatically created. The &lt;code&gt;&amp;lt;your-repo&amp;gt;&lt;/code&gt; is the one that we're interested in since this will contain our Python Packages.&lt;/p&gt;

&lt;p&gt;With that, let's proceed with configuring your local environment.&lt;/p&gt;
&lt;h2&gt;
  
  
  Setting up your local environment
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Step 1: Install Docker
&lt;/h3&gt;

&lt;p&gt;Install here:&lt;br&gt;
&lt;a href="https://docs.docker.com/get-docker/" rel="noopener noreferrer"&gt;https://docs.docker.com/get-docker/&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 2: Pull the Amazon Linux 2 Image
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker pull amazonlinux:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Step 3: Run the container
&lt;/h3&gt;

&lt;p&gt;Run the container and interact with the command line of the container using &lt;code&gt;-it&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker run -it --rm -v /path/on/host:/path/in/container image_name /bin/bash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Some notes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;-v /path/on/host:/path/in/container&lt;/code&gt;: This is the volume mount option. It mounts a directory from your host &lt;code&gt;(/path/on/host)&lt;/code&gt; into the container &lt;code&gt;(/path/in/container)&lt;/code&gt;. Any changes made in the mounted directory inside the container will be reflected on the host directory and vice versa.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;--rm&lt;/code&gt;: This tells Docker to automatically remove the container when it exits. This means that once you're done with the bash session and exit, the container will be cleaned up, and no container filesystem will be left on your host system. &lt;strong&gt;Feel free to remove this option if you do not want your container to behave like that.&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 4: Install Python 3.10
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ wget https://www.python.org/ftp/python/3.10.0/Python-3.10.0.tgz
$ tar -xf Python-3.10.0.tgz
$ cd Python-3.10.0
$ ./configure --enable-optimizations
$ sudo make altinstall
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that &lt;code&gt;AWS Glue 4.0&lt;/code&gt; runs &lt;code&gt;Python 3.10&lt;/code&gt; version. For others, kindly refer to the &lt;a href="https://docs.aws.amazon.com/glue/latest/dg/release-notes.html" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Install AWS CLI
&lt;/h3&gt;

&lt;p&gt;Using &lt;code&gt;pip&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ pip install awscli
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Using &lt;code&gt;yum&lt;/code&gt;&lt;br&gt;
&lt;a href="https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html#getting-started-install-instructions" rel="noopener noreferrer"&gt;https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html#getting-started-install-instructions&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 6: Configure AWS Credentials
&lt;/h3&gt;

&lt;p&gt;Refer to this for creating your access keys:&lt;br&gt;
&lt;a href="https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html" rel="noopener noreferrer"&gt;https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After getting the values for the access keys, configure your AWS CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ aws configure
AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Default region name [None]: us-west-2
Default output format [None]: json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 7: Connect to Repository
&lt;/h3&gt;

&lt;p&gt;Go back to the AWS Console and click on your created repository.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwe8g3cdbt92f6qd3hrec.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwe8g3cdbt92f6qd3hrec.png" alt="kyle-escosia-codeartifact-my-code-repository" width="800" height="435"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Click &lt;code&gt;View connection instructions&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fczf853bicpm9yv9lvn4s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fczf853bicpm9yv9lvn4s.png" alt="kyle-escosia-codeartifact-connection-instructions" width="800" height="614"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Copy&lt;/strong&gt; and &lt;strong&gt;run&lt;/strong&gt; the command in &lt;code&gt;Step 3&lt;/code&gt; of the &lt;code&gt;Connection instructions&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ aws codeartifact login \
--tool pip \
--repository &amp;lt;your-repo-name&amp;gt; \
--domain &amp;lt;your-domain-name&amp;gt; \
--domain-owner &amp;lt;your-account-id&amp;gt; \
--region &amp;lt;your-region&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once successfully logged in, kindly note that any &lt;code&gt;pip install&lt;/code&gt; command will be pushed to this repository instead of the Python environment on the Docker container.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 8: Install Python Packages
&lt;/h3&gt;

&lt;p&gt;Install your packages!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcdkznkm0xoqwg60h10g8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcdkznkm0xoqwg60h10g8.png" alt="kyle-escosia-codeartifact-packages-installed" width="800" height="528"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now that the repository is ready, we can now install from AWS Glue using this Pypi mirror that we created!&lt;/p&gt;

&lt;h2&gt;
  
  
  AWS CodeArtifact and AWS Glue Integration
&lt;/h2&gt;

&lt;p&gt;This section discusses how you can point the installation of Python Packages in AWS Glue to AWS Codeartifact.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Get the Authorization Token
&lt;/h3&gt;

&lt;p&gt;We need to generate an &lt;code&gt;authorization token&lt;/code&gt; from &lt;a href="https://docs.aws.amazon.com/codeartifact/latest/ug/tokens-authentication.html" rel="noopener noreferrer"&gt;AWS CodeArtifact&lt;/a&gt;. This is done using this command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ aws codeartifact get-authorization-token \
--domain my_domain \
--domain-owner 111122223333 \
--query authorizationToken \
--output text
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that the maximum duration of this token is &lt;code&gt;12 hours&lt;/code&gt;. And yes, you do &lt;strong&gt;need to generate this every day&lt;/strong&gt; if you are planning to run your jobs daily.&lt;/p&gt;

&lt;p&gt;Store this into a &lt;code&gt;.txt&lt;/code&gt; file. &lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Configure Job Details in Glue Job
&lt;/h3&gt;

&lt;p&gt;Navigate to your Glue Job&lt;/p&gt;

&lt;p&gt;I'm assuming you have already configured the &lt;code&gt;Data Connections&lt;/code&gt;. If not kindly configure it before proceeding to this step. The idea is that the Glue Job will run inside the Private Subnet of the VPC. &lt;/p&gt;

&lt;p&gt;See screenshot below&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyez8r991vetawpo93h7w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyez8r991vetawpo93h7w.png" alt="kyle-escosia-step-glue-configure-connection" width="800" height="658"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Under &lt;code&gt;Job Parameters&lt;/code&gt;, add the following &lt;code&gt;key-value&lt;/code&gt; pairs:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Parameter 1&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Key - "--additional-python-modules" // without double quotes

Value - "&amp;lt;your-python-package&amp;gt;==&amp;lt;version&amp;gt;"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Parameter 2&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Key - "--python-modules-installer-option"

Value - "--no-cache-dir --verbose --index-url https://aws:&amp;lt;CODEARTIFACT-AUTH-TOKEN&amp;gt;@&amp;lt;DOMAIN-NAME&amp;gt;-&amp;lt;ACCOUNT-ID&amp;gt;.d.codeartifact.&amp;lt;REGION-NAME&amp;gt;.amazonaws.com/pypi/pypi-store/simple/"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Change the following values:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;CODEARTIFACT-AUTH-TOKEN&lt;/code&gt; - refer to Step 1&lt;/li&gt;
&lt;li&gt;&lt;code&gt;DOMAIN-NAME&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ACCOUNT-ID&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;REGION-NAME&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 3: Run your Glue Job
&lt;/h3&gt;

&lt;p&gt;After configuring all of that, run your Glue Job and check the CloudWatch Logs to confirm if it's being installed correctly. You should see some text there that says:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Looking in indexes: https://aws:****@test-mirror-1234561234.d.codeartifact.ap-southeast-1.amazonaws.com/pypi/pypi-store/simple/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Kindly make sure that the &lt;code&gt;IAM_ROLE&lt;/code&gt; that you are using for the Glue Jobs has access to &lt;code&gt;write&lt;/code&gt; to &lt;code&gt;CloudWatch Logs&lt;/code&gt;, some engineers usually forgets this. Also tick the &lt;code&gt;Enable logs in CloudWatch&lt;/code&gt; on Glue Jobs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Wrap up
&lt;/h3&gt;

&lt;p&gt;That's it! In this article, we demonstrated how we can leverage CodeArtifact for managing Python packages and modules for AWS Glue jobs that run inside a Private Subnet that have no internet access. &lt;/p&gt;

&lt;p&gt;Do let me know if you have any questions on this, happy to answer any queries you might have.&lt;/p&gt;

&lt;p&gt;Happy Coding, builders! &lt;/p&gt;




&lt;p&gt;&lt;em&gt;This blog is authored solely by me and reflects my personal opinions, not those of my employer. All references to products, including names, logos, and trademarks, belong to their respective owners and are used for identification purposes only.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>tutorial</category>
      <category>python</category>
      <category>aws</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>SQL-based INSERTS, DELETES and UPSERTS in S3 using AWS Glue 3.0 and Delta Lake</title>
      <dc:creator>Kyle Escosia</dc:creator>
      <pubDate>Mon, 23 Aug 2021 15:17:42 +0000</pubDate>
      <link>https://dev.to/awscommunity-asean/sql-based-inserts-deletes-and-upserts-in-s3-using-aws-glue-3-0-and-delta-lake-42f0</link>
      <guid>https://dev.to/awscommunity-asean/sql-based-inserts-deletes-and-upserts-in-s3-using-aws-glue-3-0-and-delta-lake-42f0</guid>
      <description>&lt;p&gt;&lt;strong&gt;AWS NOW SUPPORTS DELTA LAKE ON GLUE NATIVELY. &lt;br&gt;
CHECK IT OUT HERE:&lt;/strong&gt;&lt;/p&gt;


&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
      &lt;div class="c-embed__cover"&gt;
        &lt;a href="https://aws.amazon.com/blogs/big-data/handle-upsert-data-operations-using-open-source-delta-lake-and-aws-glue/" class="c-link s:max-w-50 align-middle" rel="noopener noreferrer"&gt;
          &lt;img alt="" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fd2908q01vomqb2.cloudfront.net%2Fb6692ea5df920cad691c20319a6fffd7a4a766b8%2F2023%2F01%2F30%2Fupsert-data-lake-glue.jpg" height="398" class="m-0" width="800"&gt;
        &lt;/a&gt;
      &lt;/div&gt;
    &lt;div class="c-embed__body"&gt;
      &lt;h2 class="fs-xl lh-tight"&gt;
        &lt;a href="https://aws.amazon.com/blogs/big-data/handle-upsert-data-operations-using-open-source-delta-lake-and-aws-glue/" rel="noopener noreferrer" class="c-link"&gt;
          Handle UPSERT data operations using open-source Delta Lake and AWS Glue | AWS Big Data Blog
        &lt;/a&gt;
      &lt;/h2&gt;
      &lt;div class="color-secondary fs-s flex items-center"&gt;
          &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fa0.awsstatic.com%2Fmain%2Fimages%2Fsite%2Ffav%2Ffavicon.ico" width="16" height="16"&gt;
        aws.amazon.com
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;


&lt;p&gt;The purpose of this blog post is to demonstrate how you can use &lt;code&gt;Spark SQL Engine&lt;/code&gt; to do &lt;code&gt;UPSERTS&lt;/code&gt;, &lt;code&gt;DELETES&lt;/code&gt;, and &lt;code&gt;INSERTS&lt;/code&gt;. Basically, updates.&lt;/p&gt;

&lt;p&gt;Earlier this month, I made a blog post about doing this via &lt;code&gt;PySpark&lt;/code&gt;. Check it out below:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag__link"&gt;
  &lt;div class="ltag__link__content"&gt;
    &lt;div class="missing"&gt;
      &lt;h2&gt;Article No Longer Available&lt;/h2&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;


&lt;blockquote&gt;
&lt;p&gt;But, what if we want it to make it more &lt;strong&gt;simple&lt;/strong&gt; and &lt;strong&gt;familiar&lt;/strong&gt;?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This month, AWS released &lt;a href="https://aws.amazon.com/blogs/big-data/introducing-aws-glue-3-0-with-optimized-apache-spark-3-1-runtime-for-faster-data-integration/" rel="noopener noreferrer"&gt;Glue version 3.0&lt;/a&gt;! &lt;strong&gt;AWS Glue 3.0&lt;/strong&gt; introduces a performance-optimized &lt;strong&gt;Apache Spark 3.1&lt;/strong&gt; runtime for batch and stream processing. The new engine speeds up data ingestion, processing and integration allowing you to hydrate your data lake and extract insights from data quicker.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff4em3re9n4r1igwxe52k.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff4em3re9n4r1igwxe52k.jpg" alt="aws-glue-3.0-updates" width="585" height="218"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faqrb33r822am02ows6nn.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faqrb33r822am02ows6nn.jpg" alt="aws-glue-3.0-performance-improvements" width="600" height="373"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But, what's the big deal with this?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Well, aside from a &lt;strong&gt;lot&lt;/strong&gt; of general performance improvements of the Spark Engine, it can now also support the latest versions of &lt;a href="https://github.com/delta-io/delta/releases" rel="noopener noreferrer"&gt;Delta Lake&lt;/a&gt;. The most notable one is the &lt;a href="https://databricks.com/blog/2020/08/27/enabling-spark-sql-ddl-and-dml-in-delta-lake-on-apache-spark-3-0.html#toc-3" rel="noopener noreferrer"&gt;Support for SQL Insert, Delete, Update and Merge&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you don't know what &lt;strong&gt;Delta Lake&lt;/strong&gt; is, you can check out my blog post that I referenced above to have a general idea of what it is.&lt;/p&gt;

&lt;p&gt;Let's proceed with the demo!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmedia.giphy.com%2Fmedia%2F5YTFe5djWgq0o%2Fgiphy.gif%3Fcid%3Decf05e47czfo46hm6rpy1j4milotn5w9yzslxtccccx6y3r0%26rid%3Dgiphy.gif%26ct%3Dg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmedia.giphy.com%2Fmedia%2F5YTFe5djWgq0o%2Fgiphy.gif%3Fcid%3Decf05e47czfo46hm6rpy1j4milotn5w9yzslxtccccx6y3r0%26rid%3Dgiphy.gif%26ct%3Dg" width="260" height="146"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Architecture Diagram&lt;/li&gt;
&lt;li&gt;Format to Delta&lt;/li&gt;
&lt;li&gt;Upsert&lt;/li&gt;
&lt;li&gt;Delete&lt;/li&gt;
&lt;li&gt;Insert&lt;/li&gt;
&lt;li&gt;Partitioned Data&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  ✅ Architecture Diagram
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2inq5fmmbqogt8wffjf1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2inq5fmmbqogt8wffjf1.png" alt="kyle-escosia-aws-glue-delta-lake-diagram" width="800" height="311"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is basically a simple process flow of what we'll be doing. We take a &lt;code&gt;sample csv&lt;/code&gt; file, load it into an &lt;code&gt;S3 Bucket&lt;/code&gt; then process it using &lt;code&gt;Glue&lt;/code&gt;. (OPTIONAL) Then you can connect it into your favorite BI tool (I'll leave it up to you) and start visualizing your updated data.&lt;/p&gt;

&lt;h2&gt;
  
  
  ❗ Pre-requisites
&lt;/h2&gt;

&lt;p&gt;But, before we get to that, we need to do some pre-work.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Download the Delta Lake package &lt;a href="https://mvnrepository.com/artifact/io.delta/delta-core_2.12/1.0.0" rel="noopener noreferrer"&gt;here&lt;/a&gt; - &lt;em&gt;a bit hard to spot, but look for the &lt;code&gt;Files&lt;/code&gt; in the table and click on the &lt;code&gt;jar&lt;/code&gt;&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;An AWS Account - ❗ &lt;strong&gt;Glue ETL is not included in the free tier&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Download the sample data &lt;a href="https://github.com/klescosia/aws-glue-delta-lake/tree/main/data" rel="noopener noreferrer"&gt;here&lt;/a&gt; - you can use your own though, but I'll be using this one&lt;/li&gt;
&lt;li&gt;Codes can be found in my &lt;a href="https://github.com/klescosia/aws-glue-delta-lake" rel="noopener noreferrer"&gt;GitHub Repository&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;✅ Format to Delta Table&lt;/p&gt;

&lt;p&gt;First things first, we need to convert each of our dataset into Delta Format. Below is the code for doing this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="c1"&gt;# Import the packages
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.session&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize Spark Session along with configs for Delta Lake
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.extensions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;io.delta.sql.DeltaSparkSessionExtension&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.catalog.spark_catalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;org.apache.spark.sql.delta.catalog.DeltaCatalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="c1"&gt;# Read Source
&lt;/span&gt;&lt;span class="n"&gt;inputDF&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;header&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3://delta-lake-aws-glue-demo/raw/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Write data as a DELTA TABLE
&lt;/span&gt;&lt;span class="n"&gt;inputDF&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3a://delta-lake-aws-glue-demo/current/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Read Source
&lt;/span&gt;&lt;span class="n"&gt;updatesDF&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;header&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3://delta-lake-aws-glue-demo/updates/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Write data as a DELTA TABLE
&lt;/span&gt;&lt;span class="n"&gt;updatesDF&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3a://delta-lake-aws-glue-demo/updates_delta/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Generate MANIFEST file for Athena/Catalog
&lt;/span&gt;&lt;span class="n"&gt;deltaTable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3a://delta-lake-aws-glue-demo/current/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;deltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;symlink_format_manifest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;### OPTIONAL, UNCOMMENT IF YOU WANT TO VIEW ALSO THE DATA FOR UPDATES IN ATHENA
###
# Generate MANIFEST file for Updates
# updatesDeltaTable = DeltaTable.forPath(spark, "s3a://delta-lake-aws-glue-demo/updates_delta/")
# updatesDeltaTable.generate("symlink_format_manifest")
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This code converts our dataset into &lt;code&gt;delta&lt;/code&gt; format. This is done on both our source data and as well as for the updates.&lt;/p&gt;

&lt;p&gt;After generating the &lt;code&gt;SYMLINK MANIFEST&lt;/code&gt; file, we can view it via Athena. &lt;em&gt;SQL code is also included in the repository&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhpyc4kt35cqfrtwhi626.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhpyc4kt35cqfrtwhi626.PNG" alt="athena-sample-data" width="800" height="247"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  🔀 Upserts
&lt;/h2&gt;

&lt;p&gt;Upsert is defined as an operation that &lt;code&gt;inserts&lt;/code&gt; rows into a database table if they &lt;code&gt;do not already exist&lt;/code&gt;, or &lt;code&gt;updates&lt;/code&gt; them if they do.&lt;/p&gt;

&lt;p&gt;In this example, we'll be updating the value for a couple of rows on &lt;code&gt;ship_mode&lt;/code&gt;, &lt;code&gt;customer_name&lt;/code&gt;, &lt;code&gt;sales&lt;/code&gt;, and &lt;code&gt;profit&lt;/code&gt;. &lt;del&gt;I just did a random character spam and I didn't think it through 😅.&lt;/del&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="c1"&gt;# Import as always
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.session&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize Spark Session along with configs for Delta Lake
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.extensions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;io.delta.sql.DeltaSparkSessionExtension&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.catalog.spark_catalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;org.apache.spark.sql.delta.catalog.DeltaCatalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="n"&gt;updateDF&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;

MERGE INTO delta.`s3a://delta-lake-aws-glue-demo/current/` as superstore
USING delta.`s3a://delta-lake-aws-glue-demo/updates_delta/` as updates
ON superstore.row_id = updates.row_id
WHEN MATCHED THEN
  UPDATE SET *
WHEN NOT MATCHED
  THEN INSERT *
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Generate MANIFEST file for Athena/Catalog
&lt;/span&gt;&lt;span class="n"&gt;deltaTable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3a://delta-lake-aws-glue-demo/current/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;deltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;symlink_format_manifest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;### OPTIONAL
## SQL-BASED GENERATION OF SYMLINK
&lt;/span&gt;
&lt;span class="c1"&gt;# spark.sql("""
# GENERATE symlink_format_manifest 
# FOR TABLE delta.`s3a://delta-lake-aws-glue-demo/current/`
# """)
&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The SQL Code above &lt;code&gt;updates&lt;/code&gt; the current table that is found on the updates table based on the &lt;code&gt;row_id&lt;/code&gt;. It then proceeds to evaluate the condition that, &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If &lt;code&gt;row_id&lt;/code&gt; is matched, then &lt;code&gt;UPDATE ALL&lt;/code&gt; the data. If not, then do an &lt;code&gt;INSERT ALL&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you want to check out the &lt;strong&gt;full operation semantics&lt;/strong&gt; of &lt;code&gt;MERGE&lt;/code&gt; you can read through &lt;a href="https://docs.delta.io/latest/delta-update.html#operation-semantics" rel="noopener noreferrer"&gt;this&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After which, we update the &lt;code&gt;MANIFEST&lt;/code&gt; file again. Note that this generation of &lt;code&gt;MANIFEST&lt;/code&gt; file can be set to &lt;em&gt;automatically&lt;/em&gt; update by running the query below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;`&amp;lt;path-to-delta-table&amp;gt;`&lt;/span&gt; 
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;TBLPROPERTIES&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compatibility&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;symlinkFormatManifest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;More information can be found &lt;a href="https://docs.delta.io/latest/presto-integration.html#step-3-update-manifests" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You should now see your updated table in Athena.&lt;/p&gt;

&lt;h2&gt;
  
  
  ❌ Deletes
&lt;/h2&gt;

&lt;p&gt;Deletes via Delta Lakes are very straightforward.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.session&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;


&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.extensions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;io.delta.sql.DeltaSparkSessionExtension&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.catalog.spark_catalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;org.apache.spark.sql.delta.catalog.DeltaCatalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="n"&gt;deleteDF&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
DELETE 
FROM delta.`s3a://delta-lake-aws-glue-demo/current/` as superstore 
WHERE CAST(superstore.row_id as integer) &amp;lt;= 20
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Generate MANIFEST file for Athena/Catalog
&lt;/span&gt;&lt;span class="n"&gt;deltaTable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3a://delta-lake-aws-glue-demo/current/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;deltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;symlink_format_manifest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;### OPTIONAL
## SQL-BASED GENERATION OF SYMLINK MANIFEST
&lt;/span&gt;
&lt;span class="c1"&gt;# spark.sql("""
&lt;/span&gt;
&lt;span class="c1"&gt;# GENERATE symlink_format_manifest 
# FOR TABLE delta.`s3a://delta-lake-aws-glue-demo/current/`
&lt;/span&gt;
&lt;span class="c1"&gt;# """)
&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This operation does a simple delete based on the &lt;code&gt;row_id&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;"default"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;"superstore"&lt;/span&gt; 
&lt;span class="c1"&gt;-- Need to CAST hehe bec it is currently a STRING&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row_id&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;integer&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7pm4a9w651ne1623k34p.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7pm4a9w651ne1623k34p.PNG" alt="aws-athena-delete" width="800" height="243"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  ⤴ Inserts
&lt;/h2&gt;

&lt;p&gt;Like Deletes, Inserts are also very straightforward.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.session&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;


&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.extensions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;io.delta.sql.DeltaSparkSessionExtension&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.catalog.spark_catalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;org.apache.spark.sql.delta.catalog.DeltaCatalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="n"&gt;insertDF&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
INSERT INTO delta.`s3a://delta-lake-aws-glue-demo/current/`
SELECT *
FROM delta.`s3a://delta-lake-aws-glue-demo/updates_delta/`
WHERE CAST(row_id as integer) &amp;lt;= 20
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Generate MANIFEST file for Athena/Catalog
&lt;/span&gt;&lt;span class="n"&gt;deltaTable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3a://delta-lake-aws-glue-demo/current/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;deltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;symlink_format_manifest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;### OPTIONAL
## SQL-BASED GENERATION OF SYMLINK MANIFEST
&lt;/span&gt;
&lt;span class="c1"&gt;# spark.sql("""
&lt;/span&gt;
&lt;span class="c1"&gt;# GENERATE symlink_format_manifest 
# FOR TABLE delta.`s3a://delta-lake-aws-glue-demo/current/`
&lt;/span&gt;
&lt;span class="c1"&gt;# """)
&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  ❗ Partitioned Data
&lt;/h2&gt;

&lt;p&gt;We've done Upsert, Delete, and Insert operations for a simple dataset. But, that &lt;strong&gt;rarely&lt;/strong&gt; happens irl. So what if we spice things up and do it to a partitioned data? &lt;/p&gt;

&lt;p&gt;I went ahead and did some partitioning via Spark and did a &lt;code&gt;partitioned&lt;/code&gt; version of this using the &lt;code&gt;order_date&lt;/code&gt; as the &lt;strong&gt;partition key&lt;/strong&gt;. The S3 structure looks like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fma6m9ykdzzuzjl7cdr0g.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fma6m9ykdzzuzjl7cdr0g.PNG" alt="s3-partitioned-data" width="641" height="609"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;❗ What do you think?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Answer is: &lt;strong&gt;YES!&lt;/strong&gt; You can also do this on a partitioned data.&lt;/p&gt;

&lt;p&gt;The concept of Delta Lake is based on &lt;code&gt;log history&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Delta Lake will generate delta logs for each committed transactions. &lt;/p&gt;

&lt;p&gt;Delta logs will have delta files stored as &lt;code&gt;JSON&lt;/code&gt; which has &lt;strong&gt;information about the operations occurred&lt;/strong&gt; and details about the latest snapshot of the file and also it contains the information about the statistics of the data. &lt;/p&gt;

&lt;p&gt;Delta files are sequentially &lt;strong&gt;increasing&lt;/strong&gt; named &lt;code&gt;JSON&lt;/code&gt; files and together make up the &lt;strong&gt;log of all changes&lt;/strong&gt; that have occurred to a table.&lt;/p&gt;

&lt;p&gt;-from &lt;a href="https://datafloq.com/read/understand-the-fundamentals-of-delta-lake-concept/7610#:~:text=How%20does%20it%20works%3F,existing%20cloud%20storage%20data%20lake.&amp;amp;text=When%20you%20store%20data%20as,logs%20for%20each%20committed%20transactions." rel="noopener noreferrer"&gt;Data Floq&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see this on the example below&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;raw date_part=2014-08-27/&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F92e3r65cy2sydrk2bnn3.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F92e3r65cy2sydrk2bnn3.PNG" alt="raw-partitioned" width="671" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;current date_part=2014-08-27/&lt;/strong&gt; - &lt;code&gt;DELETED ROWS&lt;/code&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi081l8rb7l2x1c7jxx24.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi081l8rb7l2x1c7jxx24.PNG" alt="current-partitioned" width="664" height="421"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If we open the &lt;code&gt;parquet&lt;/code&gt; file:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F84h396lvhqgpyx4uexmn.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F84h396lvhqgpyx4uexmn.PNG" alt="updated-data" width="800" height="407"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From the examples above, we can see that our code wrote a new &lt;code&gt;parquet&lt;/code&gt; file during the delete &lt;code&gt;excluding&lt;/code&gt; the ones that are filtered from our &lt;code&gt;delete&lt;/code&gt; operation. After which, the &lt;code&gt;JSON&lt;/code&gt; file maps it to the newly generated &lt;code&gt;parquet&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Additionally, in &lt;strong&gt;Athena&lt;/strong&gt;, if your table is partitioned, you need to specify it in your query during the creation of schema&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;EXTERNAL&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;superstore&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt; 
    &lt;span class="n"&gt;row_id&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ship_date&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ship_mode&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;customer_name&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;segment&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;city&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;state&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;postal_code&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;region&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;product_id&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sub_category&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;product_name&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sales&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;quantity&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;discount&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;profit&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;date_part&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;

&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;-- Add PARTITIONED BY option&lt;/span&gt;
&lt;span class="n"&gt;PARTITIONED&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;date_part&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;ROW&lt;/span&gt; &lt;span class="n"&gt;FORMAT&lt;/span&gt; &lt;span class="n"&gt;SERDE&lt;/span&gt; &lt;span class="s1"&gt;'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'&lt;/span&gt; 
&lt;span class="n"&gt;STORED&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;INPUTFORMAT&lt;/span&gt; &lt;span class="s1"&gt;'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'&lt;/span&gt;
&lt;span class="n"&gt;OUTPUTFORMAT&lt;/span&gt; &lt;span class="s1"&gt;'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'&lt;/span&gt; 
&lt;span class="k"&gt;LOCATION&lt;/span&gt; &lt;span class="s1"&gt;'s3://delta-lake-aws-glue-demo/current/_symlink_format_manifest/'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then run an &lt;code&gt;MSCK REPAIR &amp;lt;table&amp;gt;&lt;/code&gt; to &lt;a href="https://docs.aws.amazon.com/athena/latest/ug/msck-repair-table.html" rel="noopener noreferrer"&gt;add the partitions&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you don't do these steps, you'll get an error.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb2nhocc3hgfbxradts0l.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb2nhocc3hgfbxradts0l.PNG" alt="partition-error" width="461" height="244"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  ✅ Conclusion
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmedia.giphy.com%2Fmedia%2FdsKnRuALlWsZG%2Fgiphy.gif%3Fcid%3Decf05e4707uvnmsa4eaa2si3o9bzrvpstd2vqte2mmi8c3b9%26rid%3Dgiphy.gif%26ct%3Dg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmedia.giphy.com%2Fmedia%2FdsKnRuALlWsZG%2Fgiphy.gif%3Fcid%3Decf05e4707uvnmsa4eaa2si3o9bzrvpstd2vqte2mmi8c3b9%26rid%3Dgiphy.gif%26ct%3Dg" width="500" height="282"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That's it! It's a great time to be a SQL Developer! Thank you for reading through! Hope you learned something new on this post.&lt;/p&gt;

&lt;p&gt;Have you tried Delta Lake? What tips, tricks and best practices can you share with the community? Would love to hear your thoughts on the comments below!&lt;/p&gt;

&lt;p&gt;Happy coding!&lt;/p&gt;

</description>
      <category>aws</category>
      <category>tutorial</category>
      <category>bigdata</category>
      <category>datascience</category>
    </item>
    <item>
      <title>UPSERTS and DELETES using AWS Glue and Delta Lake</title>
      <dc:creator>Kyle Escosia</dc:creator>
      <pubDate>Wed, 21 Jul 2021 03:45:42 +0000</pubDate>
      <link>https://dev.to/awscommunity-asean/making-your-data-lake-acid-compliant-using-aws-glue-and-delta-lake-gk9</link>
      <guid>https://dev.to/awscommunity-asean/making-your-data-lake-acid-compliant-using-aws-glue-and-delta-lake-gk9</guid>
      <description>&lt;p&gt;The purpose of this blog post is to demonstrate how you can enable your Data Lake to be ACID-compliant, that is, having the same functionality as a database. This will allow you to do UPSERTS and DELETES directly to your data lake&lt;/p&gt;

&lt;p&gt;Let me start first by defining what a Data Lake is:&lt;/p&gt;

&lt;p&gt;From &lt;a href="https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/" rel="noopener noreferrer"&gt;AWS&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A data lake is a &lt;strong&gt;centralized repository&lt;/strong&gt; that allows you to store all your structured and unstructured data at any scale. You can &lt;strong&gt;store your data as-is&lt;/strong&gt;, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to &lt;strong&gt;big data processing&lt;/strong&gt;, real-time analytics, and machine learning to guide better decisions.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Data Lake
&lt;/h2&gt;

&lt;p&gt;A data lake is scalable, performant, secure, and cost-efficient. And it has played a crucial part of an organization's Data Analytics pipeline. So what's the problem?&lt;/p&gt;

&lt;p&gt;Well, &lt;strong&gt;updates&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We all know that data lakes are &lt;a href="https://www.ctl.io/developers/blog/post/immutability" rel="noopener noreferrer"&gt;immutable&lt;/a&gt; - &lt;em&gt;the idea that data or objects should not be modified after they are created&lt;/em&gt;; how do we then go beyond that immutability? &lt;/p&gt;

&lt;h2&gt;
  
  
  Delta Lake
&lt;/h2&gt;

&lt;p&gt;The answer is &lt;a href="https://delta.io/" rel="noopener noreferrer"&gt;Delta Lake&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;An &lt;a href="https://github.com/delta-io/delta" rel="noopener noreferrer"&gt;open-source&lt;/a&gt; storage layer that brings scalable, ACID transactions to Apache Spark™ and big data workloads. It provides serializability, the strongest level of isolation level. Scalable Metadata Handling, Time Travel, and is 100% compatible with Apache Spark APIs.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Basically, it allows you to do DELETES and UPSERTS directly to your data lake. &lt;/p&gt;

&lt;h2&gt;
  
  
  How Spark Fails ACID
&lt;/h2&gt;

&lt;p&gt;We all know our beloved Spark doesn't support ACID transactions, but to be fair, it isn't really built to address that kind of specific use case.&lt;/p&gt;

&lt;p&gt;I came across a blog post from &lt;a href="https://blog.knoldus.com/spark-acid-compliant-or-not/" rel="noopener noreferrer"&gt;kundankumarr&lt;/a&gt;, explaining how Spark fails ACID. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A&lt;/strong&gt;tomicity &amp;amp; &lt;strong&gt;C&lt;/strong&gt;onsistency&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Atomicity&lt;/strong&gt; states that it should either write full data or nothing to the data source when using spark data frame writer. &lt;strong&gt;Consistency&lt;/strong&gt;, on the other hand, ensures that the data is always in a valid state.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;I&lt;/strong&gt;solation &amp;amp; &lt;strong&gt;D&lt;/strong&gt;urability&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;We know that when a transaction is in process and not yet committed, it must remain isolated from any other transaction. This is called &lt;strong&gt;Isolation&lt;/strong&gt; Property. It means writing to a data set shouldn’t impact another concurrent read/write on the same data set.&lt;/p&gt;

&lt;p&gt;Finally, &lt;strong&gt;Durability&lt;/strong&gt;. It is the ACID property which guarantees that transactions that have committed will survive permanently. However, when Spark doesn’t correctly implement the commit, then all the durability features offered by the storage goes for a toss.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  AWS Glue and Delta Lake
&lt;/h2&gt;

&lt;p&gt;This part demonstrates how you can use Delta Lake with AWS Glue. &lt;/p&gt;

&lt;p&gt;These are the services that will be used in this exercise:&lt;/p&gt;

&lt;h3&gt;
  
  
  AWS Glue
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Amazon Athena
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Amazon S3
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;an object storage service that offers industry-leading scalability, data availability, security, and performance.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is what we'll be doing:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqr0m67akjycfnyrffza5.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqr0m67akjycfnyrffza5.PNG" alt="kyle-escosia-process-flow" width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Basically, I have an &lt;code&gt;initial data&lt;/code&gt; then I want to apply changes to the &lt;code&gt;Sales&lt;/code&gt; and &lt;code&gt;Profit&lt;/code&gt; column. Then the table in the &lt;strong&gt;AWS Glue Data Catalog&lt;/strong&gt; should be able to capture that changes. Just a basic update to the data.&lt;/p&gt;

&lt;p&gt;So, let's start!&lt;/p&gt;

&lt;h3&gt;
  
  
  Pre-requisites
&lt;/h3&gt;

&lt;p&gt;First, download the data &lt;a href="https://www.kaggle.com/bravehart101/sample-supermarket-dataset" rel="noopener noreferrer"&gt;here&lt;/a&gt; - I used Tableau's Superstore Dataset, this one is on Kaggle, you may need to register for an account to download.&lt;/p&gt;

&lt;p&gt;Then, you need to download the Delta Lake &lt;code&gt;.jar&lt;/code&gt; file to access it's libraries. You can download it &lt;a href="https://search.maven.org/artifact/io.delta/delta-core_2.11/0.6.1/jar" rel="noopener noreferrer"&gt;here&lt;/a&gt;. Upload it on your S3 Bucket and take note of the S3 path, we'll use this as a reference later. &lt;/p&gt;

&lt;p&gt;❗ &lt;em&gt;As of this writing, Glue's Spark Engine (v2.4) only supports v0.6.1 of Delta Lake since versions beyond that were implemented in Spark 3.0.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;❗❗❗ &lt;em&gt;UPDATE: AWS GLUE 3.0 WAS RELEASED ON AUGUST 2021! Check out my blog post on this one:&lt;/em&gt; ❗❗❗&lt;/p&gt;


&lt;div class="ltag__link"&gt;
  &lt;a href="/awscommunity-asean" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__org__pic"&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F3974%2F95700370-548c-431b-8ed5-cce70f477aed.png" alt="AWS Community ASEAN" width="800" height="800"&gt;
      &lt;div class="ltag__link__user__pic"&gt;
        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F624370%2F9bc3b695-a2e8-4b55-9ce6-4af478526869.jpg" alt="" width="800" height="805"&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="https://dev.to/awscommunity-asean/sql-based-inserts-deletes-and-upserts-in-s3-using-aws-glue-3-0-and-delta-lake-42f0" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;SQL-based INSERTS, DELETES and UPSERTS in S3 using AWS Glue 3.0 and Delta Lake&lt;/h2&gt;
      &lt;h3&gt;Kyle Escosia for AWS Community ASEAN ・ Aug 23 '21&lt;/h3&gt;
      &lt;div class="ltag__link__taglist"&gt;
        &lt;span class="ltag__link__tag"&gt;#aws&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#tutorial&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#bigdata&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#datascience&lt;/span&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;


&lt;h3&gt;
  
  
  AWS Glue
&lt;/h3&gt;

&lt;p&gt;Navigate to AWS Glue then proceed to the creation of an ETL Job. Specify the &lt;code&gt;This job runs&lt;/code&gt; to &lt;code&gt;A new script to be authored by you&lt;/code&gt;. This will allow you to have a custom spark code.&lt;/p&gt;

&lt;p&gt;Under &lt;code&gt;Security configuration, script libraries, and job parameters (optional)&lt;/code&gt;, specify the location of where you stored the &lt;code&gt;.jar&lt;/code&gt; file as shown below:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flab1rujbugkn6wmo7ltl.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flab1rujbugkn6wmo7ltl.PNG" alt="etl-job" width="800" height="189"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then on blank script page, paste the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.session&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This imports the SparkSession libraries as well as the Delta Lake libraries.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Initialize Spark Session with Delta Lake
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.extensions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;io.delta.sql.DeltaSparkSessionExtension&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.catalog.spark_catalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;org.apache.spark.sql.delta.catalog.DeltaCatalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This code initializes the &lt;code&gt;SparkSession&lt;/code&gt; along with the Delta Lake configurations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Read Source
&lt;/span&gt;&lt;span class="n"&gt;inputDF&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;header&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3://delta-lake-ia-test/raw/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We read the source csv file as a Spark DataFrame.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Write data as DELTA TABLE
&lt;/span&gt;&lt;span class="n"&gt;inputDF&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3a://delta-lake-ia-test/current/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then we output it as a &lt;code&gt;Delta&lt;/code&gt; format. &lt;/p&gt;

&lt;p&gt;❗ Notice the use of &lt;code&gt;s3a&lt;/code&gt; prefix in the save path, it is essential to use the &lt;code&gt;s3a&lt;/code&gt; prefix instead of the standard &lt;code&gt;s3&lt;/code&gt; as the path. As using the &lt;code&gt;s3&lt;/code&gt; prefix, will throw an &lt;code&gt;UnsupportedFileSystemException&lt;/code&gt; error. Followed by a &lt;code&gt;fs.AbstractFileSystem.s3.impl=null: No AbstractFileSystem configured for scheme: s3&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;More on the differences of &lt;code&gt;s3&lt;/code&gt; and &lt;code&gt;s3a&lt;/code&gt; &lt;a href="https://stackoverflow.com/questions/33356041/technically-what-is-the-difference-between-s3n-s3a-and-s3" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Generate MANIFEST file for Athena/Catalog
&lt;/span&gt;&lt;span class="n"&gt;deltaTable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3a://delta-lake-ia-test/current/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;deltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;symlink_format_manifest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Athena supports reading from external tables using a &lt;code&gt;manifest&lt;/code&gt; file, which is a text file containing the list of data files to read for querying a table. Running the above code will generate a &lt;code&gt;manifest&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;Read more about Delta Lake's integration for Presto and Athena &lt;a href="https://docs.delta.io/0.6.1/presto-integration.html" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Final Code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.session&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.extensions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;io.delta.sql.DeltaSparkSessionExtension&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.catalog.spark_catalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;org.apache.spark.sql.delta.catalog.DeltaCatalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="c1"&gt;# Read Source
&lt;/span&gt;&lt;span class="n"&gt;inputDF&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;header&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3://delta-lake-ia-test/raw/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Write data as DELTA TABLE
&lt;/span&gt;&lt;span class="n"&gt;inputDF&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3a://delta-lake-ia-test/current/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Generate MANIFEST file for Athena/Catalog
&lt;/span&gt;&lt;span class="n"&gt;deltaTable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3a://delta-lake-ia-test/current/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;deltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;symlink_format_manifest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Amazon Athena
&lt;/h3&gt;

&lt;p&gt;In your S3 bucket, you should see a &lt;code&gt;_symlink_format_manifest&lt;/code&gt; prefix/folder. This will be used by Amazon Athena for mapping out the parquet files.&lt;/p&gt;

&lt;p&gt;Create your table using the code below as a reference:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;EXTERNAL&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="nv"&gt;"default"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;"superstore"&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt; 
&lt;span class="n"&gt;row_id&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;ship_date&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;ship_mode&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;customer_name&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;segment&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;city&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;state&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;postal_code&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;region&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;product_id&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;sub_category&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;product_name&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;sales&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;quantity&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;discount&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;profit&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;ROW&lt;/span&gt; &lt;span class="n"&gt;FORMAT&lt;/span&gt; &lt;span class="n"&gt;SERDE&lt;/span&gt; &lt;span class="s1"&gt;'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'&lt;/span&gt; 

&lt;span class="n"&gt;STORED&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;INPUTFORMAT&lt;/span&gt; &lt;span class="s1"&gt;'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'&lt;/span&gt;
&lt;span class="n"&gt;OUTPUTFORMAT&lt;/span&gt; &lt;span class="s1"&gt;'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'&lt;/span&gt;
&lt;span class="k"&gt;LOCATION&lt;/span&gt; &lt;span class="s1"&gt;'s3://delta-lake-ia-test/current/_symlink_format_manifest/'&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;del&gt;I'm lazy.&lt;/del&gt; I've made things simple by using &lt;code&gt;STRING&lt;/code&gt; for all columns.&lt;/p&gt;

&lt;p&gt;Note that you have to define the &lt;strong&gt;table name&lt;/strong&gt; as it is when you  you wrote it as a &lt;strong&gt;delta table&lt;/strong&gt;. Or else you'll get blank results when querying with Athena.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Run a simple select&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;"default"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;"superstore"&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fypbn4wr5tmr2ibiztepx.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fypbn4wr5tmr2ibiztepx.PNG" alt="kyle-escosia-athena-sample-resultset" width="800" height="246"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Recap:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Read from a CSV&lt;/li&gt;
&lt;li&gt;Created a Spark DataFrame from the CSV&lt;/li&gt;
&lt;li&gt;Written the DataFrame as a Delta Table&lt;/li&gt;
&lt;li&gt;Made a manifest file&lt;/li&gt;
&lt;li&gt;Created an external table in Athena&lt;/li&gt;
&lt;li&gt;Query sample data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What we'll do next:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Read the updates from the CSV&lt;/li&gt;
&lt;li&gt;Make an update based on the new files&lt;/li&gt;
&lt;li&gt;Generate/update the manifest file&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's add another Glue ETL Job for the updates.&lt;/p&gt;

&lt;p&gt;I have manually modified my raw data to simulate the updates, I just plug in the &lt;code&gt;99999&lt;/code&gt; values in the sales and profit for the first 15 rows. Feel free to have your own modifications. &lt;/p&gt;

&lt;p&gt;After which, upload it to your S3 Bucket in a different location.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.session&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;


&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.extensions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;io.delta.sql.DeltaSparkSessionExtension&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.catalog.spark_catalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;org.apache.spark.sql.delta.catalog.DeltaCatalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Nothing new here.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Read updates
&lt;/span&gt;&lt;span class="n"&gt;df_updates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;header&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3://delta-lake-ia-test/updates/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Read current as DELTA TABLE
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3a://delta-lake-ia-test/current/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;First line is a typical read from csv code.&lt;/p&gt;

&lt;p&gt;The next line creates a &lt;code&gt;DeltaTable&lt;/code&gt; object, which allows us to call functions in the delta package.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# UPSERT process
&lt;/span&gt;&lt;span class="n"&gt;final_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;full_df&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;source&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_updates&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append_df&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;condition&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;expr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append_df.row_id = full_df.row_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;\
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;whenMatchedUpdateAll&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;whenNotMatchedInsertAll&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One of this is the &lt;code&gt;merge(source, condition)&lt;/code&gt; function, which:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Merges the data from the &lt;em&gt;source&lt;/em&gt; DataFrame based on the given merge &lt;em&gt;condition&lt;/em&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;First, we take a &lt;code&gt;DeltaTable DataFrame&lt;/code&gt; object, then give it an alias. We then call the &lt;code&gt;merge()&lt;/code&gt; function, supplying the &lt;code&gt;Parameters&lt;/code&gt; with the our &lt;code&gt;Arguments&lt;/code&gt;. Which, in this case, is the updates &lt;code&gt;DataFrame&lt;/code&gt; and the merge condition.&lt;/p&gt;

&lt;p&gt;Then, we call the &lt;code&gt;whenMatchedUpdateAll(condition=None)&lt;/code&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Updates all the columns of the matched table row with the values of the corresponding columns in the source row. If a &lt;code&gt;condition&lt;/code&gt; is specified, then it must be true for the new row to be updated.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;to have the code update all the columns.&lt;/p&gt;

&lt;p&gt;If the condition specified in the &lt;code&gt;merge()&lt;/code&gt; function doesn't match, then we do a &lt;code&gt;whenNotMatchedInsertAll(condition=None)&lt;/code&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Insert a new target Delta table row by assigning the target columns to the values of the corresponding columns in the source row. If a &lt;code&gt;condition&lt;/code&gt; is specified, then it must evaluate to true for the new row to be inserted.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Lastly, we call the &lt;code&gt;execute()&lt;/code&gt; function to sum it up&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Execute the merge operation based on the built matched and not matched actions.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Generate new MANIFEST file
&lt;/span&gt;&lt;span class="n"&gt;final_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3a://delta-lake-ia-test/current/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;final_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;symlink_format_manifest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then we update the &lt;code&gt;manifest&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;For more functions in the library, kindly refer to the &lt;a href="https://docs.delta.io/0.6.1/api/python/index.html" rel="noopener noreferrer"&gt;official docs&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Final code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.session&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;


&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.extensions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;io.delta.sql.DeltaSparkSessionExtension&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.catalog.spark_catalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;org.apache.spark.sql.delta.catalog.DeltaCatalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="c1"&gt;# Read updates
&lt;/span&gt;&lt;span class="n"&gt;df_updates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;header&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3://delta-lake-ia-test/updates/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Read current as DELTA TABLE
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3a://delta-lake-ia-test/current/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# UPSERT process
&lt;/span&gt;&lt;span class="n"&gt;final_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;full_df&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;source&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_updates&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append_df&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;condition&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;expr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append_df.row_id = full_df.row_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;\
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;whenMatchedUpdateAll&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;whenNotMatchedInsertAll&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Generate new MANIFEST file
&lt;/span&gt;&lt;span class="n"&gt;final_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3a://delta-lake-ia-test/current/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;final_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;symlink_format_manifest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, try querying your updated in Athena. It should show the most updated data. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy7fljgeli4qrnznai2ij.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy7fljgeli4qrnznai2ij.PNG" alt="kyle-escosia-athena-updated-dataset" width="800" height="312"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion and Final Thoughts
&lt;/h2&gt;

&lt;p&gt;This blog post demonstrated how you can leverage ACID transactions in your data lake.&lt;/p&gt;

&lt;p&gt;Having a functionality like this is helpful especially if you have requirements such as Change Data Capture (CDC). Curious to know how you guys implemented such things. Let me know in the comments! &lt;/p&gt;

&lt;p&gt;I've read a couple of articles and blog posts whether a Data Lake should be &lt;code&gt;immutable&lt;/code&gt; or not.&lt;/p&gt;

&lt;p&gt;From &lt;em&gt;O'Reilly&lt;/em&gt;, &lt;a href="https://www.oreilly.com/library/view/data-lake-for/9781787281349/91414449-b6a4-463a-bdf8-ac578047e7ff.xhtml" rel="noopener noreferrer"&gt;Data Lake for Enterprises by Tomcy John, Pankaj Misra&lt;/a&gt; on the topic of &lt;strong&gt;Immutable Data Principle&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The data should be stored in a raw format from the different source systems. More importantly, the data stored should be immutable in nature. &lt;/p&gt;

&lt;p&gt;By making it immutable, it inherently takes care of human fault tolerance to at least some extent and takes away errors with regards to data loss and corruption. It allows data to be selected, inserted, and not updated or deleted. &lt;/p&gt;

&lt;p&gt;To cater to fundamental fast processing/performance, the data is usually stored in a denormalized fashion. Data being immutable makes the system in general simpler and more manageable.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;From &lt;a href="https://www.sqlservercentral.com/editorials/should-the-data-lake-be-immutable" rel="noopener noreferrer"&gt;SQLServerCentral&lt;/a&gt; by Steve Jones on the topic of &lt;em&gt;Should the Data Lake be Immutable?&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Imagine I had a large set of data, say GBs in a file, would I want to download this and change a few values before uploading it again? Do we want a large ETL load process to repeat? &lt;/p&gt;

&lt;p&gt;Could we repeat the process and reload a file again? I don't think so, but it's hard to decide. After all, the lake isn't the source of data; that is some other system.&lt;/p&gt;

&lt;p&gt;Maybe that's the simplest solution, and one that reduces complexity, downtime, or anything else that might be involved with locking and changing a file.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Comments on the topic can be found &lt;a href="https://www.sqlservercentral.com/forums/topic/should-the-data-lake-be-immutable" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;There is a comment from roger.plowman&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I suspect immutability should be asked after asking if you should even have the data lake or warehouse in the first place.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;What do you guys think? Would love to hear your thoughts!&lt;/p&gt;

&lt;p&gt;Speaking of Data Lakes vs Data Warehouse, there's also this very interesting concept I picked up from one of the AWS Community Builder (&lt;a href="https://dev.to/jol_farvault_72301b8e349"&gt;Joel Farvault&lt;/a&gt;), it is called &lt;a href="https://martinfowler.com/articles/data-monolith-to-mesh.html" rel="noopener noreferrer"&gt;Data Mesh Architecture&lt;/a&gt;, in which they describe it as the &lt;em&gt;next enterprise data platform architecture&lt;/em&gt;. I'll leave it up for you to read on about. AWS also made a &lt;a href="https://aws.amazon.com/blogs/big-data/design-a-data-mesh-architecture-using-aws-lake-formation-and-aws-glue/" rel="noopener noreferrer"&gt;blog post&lt;/a&gt; on this using AWS Lake Formation.&lt;/p&gt;

&lt;p&gt;There is also a youtube video back from the AWS DevDay Data &amp;amp; Analytics conducted on July 14, 2021. Where AWS Technical Evangelists; &lt;a href="https://aws.amazon.com/developer/community/evangelists/javier-ramirez/" rel="noopener noreferrer"&gt;Javier Ramirez&lt;/a&gt; and &lt;a href="https://aws.amazon.com/developer/community/evangelists/ricardo-sueiras/" rel="noopener noreferrer"&gt;Ricardo Sueiras&lt;/a&gt; discusses the &lt;a href="https://youtu.be/l3sV9YxIcSo?t=5558" rel="noopener noreferrer"&gt;Data Mesh Architecture&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Additionally, here is a video about &lt;a href="https://www.youtube.com/watch?v=eiUhV56uVUc&amp;amp;t=3s" rel="noopener noreferrer"&gt;Data Mesh in practice&lt;/a&gt; for Europe's biggest online fashion retailer. &lt;/p&gt;

&lt;p&gt;Hope this helps! Let me know if you have questions below.&lt;/p&gt;




&lt;p&gt;Happy coding!&lt;/p&gt;

&lt;p&gt;&lt;em&gt;P.S. CloudFormation stack is on going&lt;/em&gt; ⚙&lt;/p&gt;

</description>
      <category>aws</category>
      <category>tutorial</category>
      <category>bigdata</category>
      <category>analytics</category>
    </item>
    <item>
      <title>I passed the AWS Data Analytics Specialty Exam! (DAS-C01)</title>
      <dc:creator>Kyle Escosia</dc:creator>
      <pubDate>Sat, 10 Jul 2021 04:54:25 +0000</pubDate>
      <link>https://dev.to/awscommunity-asean/i-passed-the-aws-data-analytics-specialty-exam-das-c01-3a83</link>
      <guid>https://dev.to/awscommunity-asean/i-passed-the-aws-data-analytics-specialty-exam-das-c01-3a83</guid>
      <description>&lt;p&gt;While still fresh from my memory, I will share tips on how to pass the Data Analytics Specialty Exam! This exam tests your ability to &lt;strong&gt;design&lt;/strong&gt;, &lt;strong&gt;build&lt;/strong&gt;, &lt;strong&gt;secure&lt;/strong&gt;, and &lt;strong&gt;maintain&lt;/strong&gt; analytics solutions on AWS that are &lt;strong&gt;efficient&lt;/strong&gt;, &lt;strong&gt;cost-effective&lt;/strong&gt;, and &lt;strong&gt;secure&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;This exam is definitely challenging and very detailed, I was lucky enough to be involved in various data lake projects using some of these AWS Services, so I had hands-on experience. But hey, anything can be learned right? &lt;/p&gt;

&lt;h2&gt;
  
  
  ✅ Pre-requisites
&lt;/h2&gt;

&lt;p&gt;I would definitely recommend to take the &lt;a href="https://aws.amazon.com/certification/certified-solutions-architect-associate/" rel="noopener noreferrer"&gt;AWS Solutions Architect Associate Exam&lt;/a&gt; first before this one. As this would help you get an overview of the principles in the AWS Cloud and it will be much easier for you to visualize how AWS Services work together.&lt;/p&gt;

&lt;h2&gt;
  
  
  ✅ Content Outline
&lt;/h2&gt;

&lt;p&gt;Below are the AWS Services covered:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmd9npgwwf84qihlbmzol.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmd9npgwwf84qihlbmzol.PNG" alt="aws-services-data-analytics-exam" width="800" height="380"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The exam will test you on different domains:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm8o1mxsa4bcp2inhyvo5.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm8o1mxsa4bcp2inhyvo5.PNG" alt="aws-data-analytics-content-outline" width="800" height="175"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Check out the full Exam Guide &lt;a href="https://d1.awsstatic.com/training-and-certification/docs-data-analytics-specialty/AWS-Certified-Data-Analytics-Specialty_Exam-Guide.pdf" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  ✅ Tips in General (study materials, practice exams, etc.)
&lt;/h2&gt;

&lt;p&gt;Read the questions carefully, &lt;strong&gt;190 minutes&lt;/strong&gt; is very long and will give you enough time to properly analyze your answers.&lt;/p&gt;

&lt;p&gt;Look for keywords in the question, try to choose the most appropriate service for that keyword, and try to build and visualize the solution in your mind or in the whiteboard.&lt;/p&gt;

&lt;p&gt;Abuse the Review Button! If you are unsure with your answer, don't spend too much time on it and just proceed with the next one.&lt;/p&gt;

&lt;p&gt;Pre-exam, make sure you get a good sleep. Tbh, I was anxious going into the exam, so I drank a lot of coffee and ate chocolates lol. But as soon as I answered the first few questions, I gained confidence then proceeded with it! I guess it worked out pretty well?? 😂😂&lt;/p&gt;

&lt;p&gt;Again, this shouldn't be your first exam. AWS recommends to have at least an associate level exam.  &lt;/p&gt;

&lt;h2&gt;
  
  
  Practice Exam
&lt;/h2&gt;

&lt;p&gt;I personally recommend &lt;strong&gt;Sir Jon Bonso's&lt;/strong&gt; practice exams, it is available on Udemy and on &lt;a href="https://portal.tutorialsdojo.com/courses/aws-certified-data-analytics-specialty-practice-exams/" rel="noopener noreferrer"&gt;Tutorials Dojo&lt;/a&gt;. This, in my opinion, is the &lt;strong&gt;CLOSEST ONE&lt;/strong&gt; to the actual exam, I even got an exact scenario from the practice exam while I was taking it. So try your best to score high on this one! Their explanations are also super helpful!&lt;/p&gt;

&lt;p&gt;Additionally, their &lt;a href="https://tutorialsdojo.com/" rel="noopener noreferrer"&gt;website&lt;/a&gt;, hosts the AWS Service Cheat Sheets, so be sure to read through that!&lt;/p&gt;

&lt;h2&gt;
  
  
  Study Materials 📘🎥
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Stephane Maarek&lt;/strong&gt; and &lt;strong&gt;Frank Kane's&lt;/strong&gt; &lt;a href="https://www.udemy.com/course/aws-data-analytics/" rel="noopener noreferrer"&gt;Udemy Course&lt;/a&gt; are very insightful, be generous with yourself in these study materials as they will greatly help you in the exam. They also have a practice exam.&lt;/p&gt;

&lt;p&gt;I'd like to also recommend &lt;a href="https://cloudacademy.com/" rel="noopener noreferrer"&gt;CloudAcademy&lt;/a&gt; as a learning portal, they offer full courses for your certifications including hands-on (using a their service account)! So if you want to experience using a service without using your account, definitely check them out!&lt;/p&gt;

&lt;p&gt;🎥 re:Invent videos are also very helpful, these are the hidden gems from the re:Invent sessions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://youtu.be/lj8oaSpCFTc" rel="noopener noreferrer"&gt;Deep dive and best practices for Amazon Redshift&lt;/a&gt; - this helped me have a detailed understanding of what Redshift is&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://youtu.be/S_xeHvP7uMo" rel="noopener noreferrer"&gt;Building Serverless Analytics Pipelines with AWS Glue&lt;/a&gt; - a deep dive on Glue components&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://youtu.be/pT5lAYTCYJ4" rel="noopener noreferrer"&gt;Serverless data preparation with AWS Glue&lt;/a&gt; - a more recent one, this discusses the updates that were rolled out in AWS Glue&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://youtu.be/jKPlGznbfZ0" rel="noopener noreferrer"&gt;High Performance Data Streaming with Amazon Kinesis: Best Practices&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://youtu.be/DIQVJqiSUkE" rel="noopener noreferrer"&gt;Data modeling with Amazon DynamoDB&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://youtu.be/tzoXRRCVmIQ" rel="noopener noreferrer"&gt;Deep dive into Amazon Athena&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://youtu.be/ovPheIbY7U8" rel="noopener noreferrer"&gt;Big Data Analytics Architectural Patterns &amp;amp; Best Practices&lt;/a&gt; - this is a great discussion on how AWS Services integrate with each other&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://youtu.be/Aj5T5fcZZr0" rel="noopener noreferrer"&gt;Deep Dive Into AWS Lake Formation&lt;/a&gt; &lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;More AWS Videos can be found here at &lt;a href="https://awsstash.com/" rel="noopener noreferrer"&gt;AWS Stash&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  ✅ Tips by Service
&lt;/h1&gt;

&lt;p&gt;In this section, I will give tips on what to study for each services, below is an outline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Collect
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Amazon Kinesis&lt;/li&gt;
&lt;li&gt;AWS Database Migration Service (DMS)&lt;/li&gt;
&lt;li&gt;Amazon Simple Queue Service (SQS)&lt;/li&gt;
&lt;li&gt;AWS Snowball&lt;/li&gt;
&lt;li&gt;AWS IoT&lt;/li&gt;
&lt;li&gt;AWS Managed Streaming for Kafka (MSK)&lt;/li&gt;
&lt;li&gt;AWS Direct Connect&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Storage
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Amazon S3&lt;/li&gt;
&lt;li&gt;Amazon DynamoDB&lt;/li&gt;
&lt;li&gt;Amazon Elasticache&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Processing
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;AWS Lambda&lt;/li&gt;
&lt;li&gt;AWS Glue&lt;/li&gt;
&lt;li&gt;Amazon EMR&lt;/li&gt;
&lt;li&gt;AWS Lake Formation&lt;/li&gt;
&lt;li&gt;AWS Step Functions&lt;/li&gt;
&lt;li&gt;AWS Data Pipeline&lt;/li&gt;
&lt;li&gt;Other AWS Services&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Analysis
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Amazon Kinesis Data Analytics&lt;/li&gt;
&lt;li&gt;Amazon ElasticSearch Service&lt;/li&gt;
&lt;li&gt;Amazon Athena&lt;/li&gt;
&lt;li&gt;Amazon Redshift&lt;/li&gt;
&lt;li&gt;Amazon SageMaker&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Visualization
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Amazon Quicksight&lt;/li&gt;
&lt;li&gt;Other Visualization Tools&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Security
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Amazon STS&lt;/li&gt;
&lt;li&gt;AWS Key Management Service (KMS)&lt;/li&gt;
&lt;li&gt;Cloud HSM (Hardware Security Module)&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  ✅ Collection
&lt;/h1&gt;

&lt;p&gt;For the most part, questions require you to know how to move data from one source to another, using the right tools in the right situation. Knowing the advantages and disadvantages of each one will help you answer those questions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Amazon Kinesis
&lt;/h2&gt;

&lt;p&gt;Amazon Kinesis has 4 capabilities, namely: Video Streams, Data Streams, Firehose, Data Analytics.&lt;/p&gt;

&lt;p&gt;For the &lt;strong&gt;Video Streams&lt;/strong&gt;, I didn't get a question for this so you only need to remember that this is used for &lt;em&gt;streaming video data for analytics&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;For &lt;strong&gt;Kinesis Data Streams&lt;/strong&gt; and &lt;strong&gt;Kinesis Firehose&lt;/strong&gt;, most of the collection part will revolve on this, you need to &lt;strong&gt;KNOW&lt;/strong&gt; how to differentiate Kinesis Firehose and Data Streams, I can't stress this out enough. Please study this one as there are a lot of answers that involve the use of Kinesis Data Streams and Firehose. You need to be able to distinguish both.&lt;/p&gt;

&lt;p&gt;There are also troubleshooting and scenario-based questions, like how would you solve a &lt;code&gt;ProvisionedThroughPutExceeded&lt;/code&gt; error, when should you merge or split shard, what encryption options are available, and how the Kinesis service integrates with other services.&lt;/p&gt;

&lt;h2&gt;
  
  
  AWS Database Migration Service (DMS)
&lt;/h2&gt;

&lt;p&gt;This came up in a few questions, make sure you know when to use DMS vs other tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  Amazon Simple Queue Service (SQS)
&lt;/h2&gt;

&lt;p&gt;Just the difference between Kinesis and SQS. What should you use on each problem. &lt;/p&gt;

&lt;h2&gt;
  
  
  AWS Snowball
&lt;/h2&gt;

&lt;p&gt;The test will add this as an option, although including this as a part of the solution is feasible, you should look at what the exam asks you for. If the solution requires you to migrate data fast, perhaps this is not the most appropriate one.&lt;/p&gt;

&lt;h2&gt;
  
  
  AWS IoT
&lt;/h2&gt;

&lt;p&gt;I didn't get a lot of IoT questions, but it's nice to have an overview of this one. IoT topics, rules, and etc. Just browse through it.&lt;/p&gt;

&lt;h2&gt;
  
  
  AWS Managed Streaming for Kafka (MSK)
&lt;/h2&gt;

&lt;p&gt;MSK is another option for streaming data similar to Kinesis, so knowing when to use MSK vs Kinesis will be crucial.&lt;/p&gt;

&lt;h2&gt;
  
  
  AWS Direct Connect
&lt;/h2&gt;

&lt;p&gt;Part of a data warehouse migration is integrating your on-premises data center to your Amazon VPC network, knowing when to use a Site-to-Site VPN and a Direct Connect can help you with this.&lt;/p&gt;

&lt;h1&gt;
  
  
  ✅ Storage
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Amazon S3
&lt;/h2&gt;

&lt;p&gt;From Storage Classes, Replication, Lifecycle, and etc. You need to know your Amazon S3 concepts! Amazon S3 Glacier also covers some part for archiving purposes. &lt;/p&gt;

&lt;h2&gt;
  
  
  Amazon DynamoDB
&lt;/h2&gt;

&lt;p&gt;Understanding what DynamoDB and its features is essential also, as the exam will trick you into choosing other services instead of this one so knowing the advantages and disadvantage of DynamoDB will help you filter those tricky questions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Amazon Elasticache
&lt;/h2&gt;

&lt;p&gt;I didn't get an Elasticache question, but knowing what it is also helps.&lt;/p&gt;

&lt;h1&gt;
  
  
  ✅ Processing
&lt;/h1&gt;

&lt;h2&gt;
  
  
  AWS Lambda
&lt;/h2&gt;

&lt;p&gt;Lambda covers a lot in the exam, as it integrates with almost all of the AWS Services, so knowing when to use Lambda or not is definitely one you should be studying for.&lt;/p&gt;

&lt;h2&gt;
  
  
  AWS Glue
&lt;/h2&gt;

&lt;p&gt;AWS Glue also shows up on a lot of the options, I've had no troubles with Glue as I've been using it since Version 1. Generally if the question looks for a &lt;em&gt;cost-effective&lt;/em&gt;_ solution which &lt;em&gt;requires no operational overhead&lt;/em&gt;, definitely look for a Glue answer.&lt;/p&gt;

&lt;p&gt;Features of Glue also shows up, bookmarks, &lt;code&gt;DynamicFrame&lt;/code&gt; functions, job metrics, and etc. Troubleshooting a Glue Job is also one, what should you do if Glue throws an error.&lt;/p&gt;

&lt;h2&gt;
  
  
  Amazon EMR
&lt;/h2&gt;

&lt;p&gt;Study and understand EMR, all of its application! Period.&lt;/p&gt;

&lt;h2&gt;
  
  
  AWS Lake Formation
&lt;/h2&gt;

&lt;p&gt;If the question is about managing access to your data lake, Lake Formation is the answer instead of managing it via IAM. &lt;/p&gt;

&lt;h2&gt;
  
  
  AWS Step Functions
&lt;/h2&gt;

&lt;p&gt;Used for orchestration, an overview will do.&lt;/p&gt;

&lt;h2&gt;
  
  
  AWS Data Pipeline
&lt;/h2&gt;

&lt;p&gt;I didn't get a Data Pipeline question, but they show up as one of the answers, so knowing it will help.&lt;/p&gt;

&lt;h2&gt;
  
  
  Other AWS Services
&lt;/h2&gt;

&lt;p&gt;S3Select, S3DistCP, Hadoop Tools (Ganglia, Mahout, Ranger, HCatalog, etc.), just a basic understanding of what they do will suffice.&lt;/p&gt;

&lt;h1&gt;
  
  
  ✅ Analysis
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Amazon Kinesis Data Analytics
&lt;/h2&gt;

&lt;p&gt;KDA allows you query streaming data using SQL, knowing window functions will help and when you should use KDA vs Lambda.&lt;/p&gt;

&lt;h2&gt;
  
  
  Amazon ElasticSearch Service
&lt;/h2&gt;

&lt;p&gt;Generally for log analysis, look for an ES solution along with Kibana&lt;/p&gt;

&lt;h2&gt;
  
  
  Amazon Athena
&lt;/h2&gt;

&lt;p&gt;Most &lt;strong&gt;cost-effective&lt;/strong&gt; solution involves Athena as a solution. Watch out for answers that involves Redshift Spectrum, as the exam will trick you into using Athena even if the best one is Spectrum.&lt;/p&gt;

&lt;h2&gt;
  
  
  Amazon Redshift
&lt;/h2&gt;

&lt;p&gt;Node types, resizing options, dist styles, Redshift Spectrum, cluster administration, encryption options. Please study and remember these.&lt;/p&gt;

&lt;h2&gt;
  
  
  Amazon SageMaker
&lt;/h2&gt;

&lt;p&gt;I didn't get a lot of SageMaker questions but an overview will help.&lt;/p&gt;

&lt;h1&gt;
  
  
  ✅ Visualization
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Amazon Quicksight
&lt;/h2&gt;

&lt;p&gt;Row-level security, Standard and Enterprise editions, authentication options (MS AD, SAML), Kibana vs Quicksight solution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Other Visualization Tools
&lt;/h2&gt;

&lt;p&gt;There are 1-2 questions that offers D3.js, HighCharts, and a custom chart as a solution, knowing when to choose between those and Quicksight is nice.&lt;/p&gt;

&lt;h1&gt;
  
  
  ✅ Security
&lt;/h1&gt;

&lt;p&gt;Security covers a lot in the exam, it can be very tricky for you if you didn't study for these.&lt;/p&gt;

&lt;h2&gt;
  
  
  Amazon STS
&lt;/h2&gt;

&lt;p&gt;In some parts, they require you to access other AWS Account, so knowing IAM in general, how STS and authentication works with AWS is a nice to have. &lt;/p&gt;

&lt;h2&gt;
  
  
  AWS Key Management Service (KMS)
&lt;/h2&gt;

&lt;p&gt;KMS shows up in almost all of the security questions in the exam, so please make sure you are prepared for this. I warned you lol.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cloud HSM (Hardware Security Module)
&lt;/h2&gt;

&lt;p&gt;If the exam asks you about managing your own security options, look for an HSM solution.&lt;/p&gt;

&lt;h1&gt;
  
  
  Final Thoughts
&lt;/h1&gt;

&lt;p&gt;Studying all at once can be overwhelming, so try to take your time in understanding each services first and how they work.&lt;/p&gt;

&lt;p&gt;Whitepapers and Webinars are very helpful, especially Migration videos, as they give you an overall design on how things are done in AWS. &lt;/p&gt;

&lt;p&gt;Consider also having a habit of watching 1 video or reading a whitepaper at a time to avoid information overload (small wins!).&lt;/p&gt;

&lt;p&gt;To others who have passed and taken the same exam, feel free to share your thoughts. I would gladly add it to this post to help others pass! Let's learn from each other!&lt;/p&gt;

&lt;p&gt;Good Luck! &lt;/p&gt;

</description>
      <category>aws</category>
      <category>certification</category>
      <category>tutorial</category>
      <category>cloudskills</category>
    </item>
    <item>
      <title>Introduction to the AWS Big Data Portfolio</title>
      <dc:creator>Kyle Escosia</dc:creator>
      <pubDate>Sat, 29 May 2021 07:15:08 +0000</pubDate>
      <link>https://dev.to/awscommunity-asean/introduction-to-the-aws-big-data-portfolio-2539</link>
      <guid>https://dev.to/awscommunity-asean/introduction-to-the-aws-big-data-portfolio-2539</guid>
      <description>&lt;p&gt;Want to build an end-to-end data pipeline in AWS? &lt;/p&gt;

&lt;p&gt;You're in luck! In this post, I will introduce you to AWS' Big Data portfolio.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; (below image)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmteh8d4czp7fzru4t2wj.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmteh8d4czp7fzru4t2wj.PNG" alt="Alt Text" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Content Outline 
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Collect
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;AWS Direct Connect&lt;/li&gt;
&lt;li&gt;Amazon Kinesis&lt;/li&gt;
&lt;li&gt;AWS Snowball&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Store
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Amazon S3&lt;/li&gt;
&lt;li&gt;Amazon Glacier&lt;/li&gt;
&lt;li&gt;Amazon DynamoDB&lt;/li&gt;
&lt;li&gt;Amazon RDS&lt;/li&gt;
&lt;li&gt;Amazon Aurora&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Process and Analyze
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Amazon Redshift&lt;/li&gt;
&lt;li&gt;Amazon Athena&lt;/li&gt;
&lt;li&gt;AWS Glue&lt;/li&gt;
&lt;li&gt;Amazon EMR&lt;/li&gt;
&lt;li&gt;Amazon EC2&lt;/li&gt;
&lt;li&gt;Amazon Sagemaker&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Visualize
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Amazon QuickSight&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In my future posts, I will be going into details on these services, so make sure you watch out for that! &lt;/p&gt;

&lt;p&gt;Before we dive into the suite of AWS Services, let's first define what &lt;strong&gt;Big Data&lt;/strong&gt; is.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is Big Data?
&lt;/h2&gt;

&lt;p&gt;From &lt;a href="https://aws.amazon.com/big-data/what-is-big-data/" rel="noopener noreferrer"&gt;AWS&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Big data can be described in terms of data management challenges that – due to increasing volume, velocity and variety of data – cannot be solved with traditional databases.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To make this definition more simple.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A data set is considered &lt;em&gt;big data&lt;/em&gt; when it is too &lt;strong&gt;big&lt;/strong&gt; or &lt;strong&gt;complex&lt;/strong&gt; to be &lt;strong&gt;stored&lt;/strong&gt; or &lt;strong&gt;analyzed&lt;/strong&gt; by &lt;strong&gt;traditional data systems&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Obviously, there are many definitions of Big Data around the web, but for me, this is the most simple one to understand.&lt;/p&gt;

&lt;p&gt;Now that we've defined what Big Data is, we'll proceed with the AWS Services that will help you answer those challenges.&lt;/p&gt;




&lt;h2&gt;
  
  
  Collect
&lt;/h2&gt;

&lt;p&gt;The collection of raw data has always been a challenge for many organizations, especially for us developers because you have these different complex source systems that are scattered in the company such as ERP systems, CRM systems, Transactional DBs, and etc. &lt;/p&gt;

&lt;p&gt;You have to also think about how you would integrate the data between these systems to create a unified view of your data. &lt;/p&gt;

&lt;p&gt;AWS helps you make these steps easier, allowing us developers to ingest data from - structured and unstructured, real-time to batch.&lt;/p&gt;

&lt;h3&gt;
  
  
  AWS Direct Connect
&lt;/h3&gt;

&lt;p&gt;AWS Direct Connect is a networking service that provides an alternative to using the internet to connect to AWS. &lt;/p&gt;

&lt;p&gt;Using AWS Direct Connect, data that would have previously been transported over the internet is delivered through a private network connection between your facilities and AWS.&lt;/p&gt;

&lt;p&gt;This is useful if you want consistent network performance or if you have workloads that are bandwidth-heavy. I personally haven't tried this yet. Most of our implementations, we just use AWS Site-to-Site VPN.&lt;/p&gt;

&lt;h3&gt;
  
  
  Amazon Kinesis
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Easily collect, process, and analyze video and data streams in real time&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Amazon Kinesis enables you to process and analyze data as it arrives and respond instantly instead of having to wait until all your data is collected before the processing can begin.&lt;/p&gt;

&lt;p&gt;Amazon Kinesis is fully managed and runs your streaming applications without requiring you to manage any infrastructure.&lt;/p&gt;

&lt;p&gt;Kinesis has 4 capabilities namely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kinesis Video Streams&lt;/li&gt;
&lt;li&gt;Kinesis Data Streams&lt;/li&gt;
&lt;li&gt;Kinesis Data Firehose &lt;/li&gt;
&lt;li&gt;Kinesis Data Analytics&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Amazon Kinesis Video Streams
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Capture, process, and store video streams&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Kinesis Video Streams makes it easy to securely stream video from connected devices to AWS for analytics, machine learning (ML), and other processing.&lt;/p&gt;

&lt;h4&gt;
  
  
  Amazon Kinesis Data Streams
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Capture, process, and store data streams&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Kinesis Data Streams is a scalable and durable real-time data streaming service that can continuously capture gigabytes of data per second from hundreds of thousands of sources. &lt;/p&gt;

&lt;h4&gt;
  
  
  Amazon Kinesis Data Firehose
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Load data streams into AWS data stores&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Kinesis Data Firehose is the easiest way to capture, transform, and load data streams into AWS data stores for near real-time analytics with existing business intelligence tools.&lt;/p&gt;

&lt;h4&gt;
  
  
  Amazon Kinesis Data Analytics
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Analyze data streams with SQL or Apache Flink&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Kinesis Data Analytics is the easiest way to process data streams in real time with SQL or Apache Flink without having to learn new programming languages or processing frameworks.&lt;/p&gt;

&lt;h3&gt;
  
  
  AWS Snowball
&lt;/h3&gt;

&lt;p&gt;An interesting way to move your data from on-premise to AWS Cloud would be AWS Snowball. Which is a service that provides secure, rugged devices, so you can bring AWS computing and storage capabilities to your edge environments, and transfer data into and out of AWS.&lt;/p&gt;

&lt;p&gt;I personally haven't tried this yet but would love to do so in the future!&lt;/p&gt;

&lt;h3&gt;
  
  
  Amazon S3
&lt;/h3&gt;

&lt;p&gt;The most famous AWS Service would be Amazon S3. Which is an object storage built to store and retrieve any amount of data from anywhere. &lt;/p&gt;

&lt;p&gt;It’s a simple storage service that offers industry leading durability, availability, performance, security, and virtually unlimited scalability at very low costs.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Amazon S3 is AWS' first service that launched back in 2006!&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Amazon S3 Glacier
&lt;/h3&gt;

&lt;p&gt;S3 Glacier is an extremely low-cost storage service that provides secure, durable, and flexible storage for data backup and archival.&lt;/p&gt;

&lt;p&gt;Which is excellent for businesses or organizations that needs to retain their data for years and even decades!&lt;/p&gt;




&lt;h2&gt;
  
  
  Store
&lt;/h2&gt;

&lt;p&gt;I'm honestly a big fan of Amazon S3, given how scalable and how easy it is to use. I'll just say that if you aren't using Amazon S3 for your data lakes, then you are missing out on a lot of things lol.&lt;/p&gt;

&lt;p&gt;There are obviously a lot of factors that need to be considered when building your Big Data project. Any big data platform needs a secure, scalable, and durable repository to store data prior or even after processing tasks. AWS provides you with services depending on your specific requirements.&lt;/p&gt;

&lt;h3&gt;
  
  
  Amazon DynamoDB
&lt;/h3&gt;

&lt;p&gt;DynamoDB is a key-value and document database that delivers single-digit millisecond performance at any scale. &lt;/p&gt;

&lt;p&gt;It is one of the AWS Services that is &lt;strong&gt;fully-managed&lt;/strong&gt;, meaning that you don't have to worry about setting up the infrastructure and software updates, you just use the service.&lt;/p&gt;

&lt;h3&gt;
  
  
  Amazon RDS
&lt;/h3&gt;

&lt;p&gt;Amazon RDS is a &lt;strong&gt;managed&lt;/strong&gt; service that makes it easy to set up, operate, and scale a relational database in the cloud.&lt;/p&gt;

&lt;p&gt;Amazon RDS supports Amazon Aurora, MySQL, MariaDB, Oracle, SQL Server, and PostgreSQL database engines.&lt;/p&gt;

&lt;h3&gt;
  
  
  Amazon Aurora
&lt;/h3&gt;

&lt;p&gt;Amazon Aurora is a relational database engine that combines the speed and reliability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases.&lt;/p&gt;




&lt;h2&gt;
  
  
  Process and Analyze
&lt;/h2&gt;

&lt;p&gt;This is the step where data is transformed from its raw state into a consumable format – usually by means of sorting, aggregating, joining and even performing more advanced functions and algorithms. &lt;/p&gt;

&lt;p&gt;The resulting data sets are then stored for further processing or made available for consumption via business intelligence and data visualization tools.&lt;/p&gt;

&lt;h3&gt;
  
  
  Amazon Redshift
&lt;/h3&gt;

&lt;p&gt;Amazon Redshift is the most widely used cloud data warehouse. &lt;/p&gt;

&lt;p&gt;It makes it fast, simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. &lt;/p&gt;

&lt;p&gt;It allows you to run complex analytic queries against terabytes to petabytes of structured and semi-structured data, using sophisticated query optimization, columnar storage on high-performance storage, and massively parallel query execution.&lt;/p&gt;

&lt;p&gt;We've had some successful implementations on Redshift and I can share you guys some experiences that I've had with it, so watch out for that.&lt;/p&gt;

&lt;h3&gt;
  
  
  Amazon Athena
&lt;/h3&gt;

&lt;p&gt;Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. &lt;/p&gt;

&lt;p&gt;Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.&lt;/p&gt;

&lt;p&gt;We've used Athena a lot in our implementations and I must say that they really helped us in terms of the Data Exploration and Data Validation.&lt;/p&gt;

&lt;h3&gt;
  
  
  AWS Glue
&lt;/h3&gt;

&lt;p&gt;AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.&lt;/p&gt;

&lt;p&gt;AWS Glue has evolved significantly from its initial release 0.9 to AWS Glue 2.0. Along with that are enhancements that glue (pun intended) all your pipelines together. Definitely worth looking into.&lt;/p&gt;

&lt;h3&gt;
  
  
  Amazon EMR
&lt;/h3&gt;

&lt;p&gt;Amazon EMR is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. &lt;/p&gt;

&lt;p&gt;It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).&lt;/p&gt;

&lt;p&gt;As opposed to Glue, being serverless meaning you don't need to provision your own server, EMR allows you to be more flexible in terms of the workload depending on how "big" your data processing workloads are.&lt;/p&gt;

&lt;h3&gt;
  
  
  Amazon EC2
&lt;/h3&gt;

&lt;p&gt;Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud.&lt;/p&gt;

&lt;p&gt;Basically your virtual machine in cloud which has a lot of use cases, living up to its name "Elastic".&lt;/p&gt;

&lt;h3&gt;
  
  
  Amazon Sagemaker
&lt;/h3&gt;

&lt;p&gt;Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly. &lt;/p&gt;

&lt;p&gt;SageMaker removes the heavy lifting from each step of the machine learning process to make it easier to develop high quality models.&lt;/p&gt;

&lt;p&gt;AWS re:Invent 2020 introduced us a lot of significant improvements to Amazon Sagemaker such as Data Wrangler, Clarify, SageMaker pipeline, and many more! I'll be making a deep dive on these exciting features soon! &lt;/p&gt;




&lt;h2&gt;
  
  
  Visualize
&lt;/h2&gt;

&lt;p&gt;Big data is all about getting high value, actionable insights from your data assets. &lt;/p&gt;

&lt;p&gt;Ideally, data is made available to stakeholders through self-service business intelligence and agile data visualization tools that allow for fast and easy exploration of datasets. &lt;/p&gt;

&lt;p&gt;Depending on the type of analytics, end-users may also consume the resulting data in the form of statistical “predictions” – in the case of predictive analytics – or recommended actions – in the case of prescriptive analytics.&lt;/p&gt;

&lt;h3&gt;
  
  
  Amazon QuickSight
&lt;/h3&gt;

&lt;p&gt;Amazon QuickSight is a very fast, easy-to-use, cloud-powered business analytics service that makes it easy for all employees within an organization to build visualizations, perform ad-hoc analysis, and quickly get business insights from their data, anytime, on any device. &lt;/p&gt;

&lt;p&gt;QuickSight is easy to use and has also made some major improvements ever since it was publicly released. It's still fairly new compared to other major BI Tools but I think it has potential. Look into QuickSight if you want a cost-effective BI solutions.&lt;/p&gt;




&lt;p&gt;That's it for me. Would love to hear your thoughts!&lt;/p&gt;

&lt;h6&gt;
  
  
  References:
&lt;/h6&gt;

&lt;h6&gt;
  
  
  &lt;a href="https://aws.amazon.com/big-data/what-is-big-data/" rel="noopener noreferrer"&gt;https://aws.amazon.com/big-data/what-is-big-data/&lt;/a&gt;
&lt;/h6&gt;

&lt;h6&gt;
  
  
  &lt;a href="https://aws.amazon.com/directconnect/" rel="noopener noreferrer"&gt;https://aws.amazon.com/directconnect/&lt;/a&gt;
&lt;/h6&gt;

&lt;h6&gt;
  
  
  &lt;a href="https://aws.amazon.com/kinesis/" rel="noopener noreferrer"&gt;https://aws.amazon.com/kinesis/&lt;/a&gt;
&lt;/h6&gt;

&lt;h6&gt;
  
  
  &lt;a href="https://aws.amazon.com/snowball/" rel="noopener noreferrer"&gt;https://aws.amazon.com/snowball/&lt;/a&gt;
&lt;/h6&gt;

&lt;h6&gt;
  
  
  &lt;a href="https://aws.amazon.com/s3/" rel="noopener noreferrer"&gt;https://aws.amazon.com/s3/&lt;/a&gt;
&lt;/h6&gt;

&lt;h6&gt;
  
  
  &lt;a href="https://aws.amazon.com/glacier/" rel="noopener noreferrer"&gt;https://aws.amazon.com/glacier/&lt;/a&gt;
&lt;/h6&gt;

&lt;h6&gt;
  
  
  &lt;a href="https://aws.amazon.com/dynamodb/" rel="noopener noreferrer"&gt;https://aws.amazon.com/dynamodb/&lt;/a&gt;
&lt;/h6&gt;

&lt;h6&gt;
  
  
  &lt;a href="https://aws.amazon.com/rds/" rel="noopener noreferrer"&gt;https://aws.amazon.com/rds/&lt;/a&gt;
&lt;/h6&gt;

&lt;h6&gt;
  
  
  &lt;a href="https://aws.amazon.com/redshift/" rel="noopener noreferrer"&gt;https://aws.amazon.com/redshift/&lt;/a&gt;
&lt;/h6&gt;

&lt;h6&gt;
  
  
  &lt;a href="https://aws.amazon.com/athena/" rel="noopener noreferrer"&gt;https://aws.amazon.com/athena/&lt;/a&gt;
&lt;/h6&gt;

&lt;h6&gt;
  
  
  &lt;a href="https://aws.amazon.com/glue/" rel="noopener noreferrer"&gt;https://aws.amazon.com/glue/&lt;/a&gt;
&lt;/h6&gt;

&lt;h6&gt;
  
  
  &lt;a href="https://aws.amazon.com/emr/" rel="noopener noreferrer"&gt;https://aws.amazon.com/emr/&lt;/a&gt;
&lt;/h6&gt;

&lt;h6&gt;
  
  
  &lt;a href="https://aws.amazon.com/ec2/" rel="noopener noreferrer"&gt;https://aws.amazon.com/ec2/&lt;/a&gt;
&lt;/h6&gt;

&lt;h6&gt;
  
  
  &lt;a href="https://aws.amazon.com/sagemaker/" rel="noopener noreferrer"&gt;https://aws.amazon.com/sagemaker/&lt;/a&gt;
&lt;/h6&gt;

&lt;h6&gt;
  
  
  &lt;a href="https://aws.amazon.com/quicksight/" rel="noopener noreferrer"&gt;https://aws.amazon.com/quicksight/&lt;/a&gt;
&lt;/h6&gt;

</description>
      <category>aws</category>
      <category>community</category>
      <category>data</category>
      <category>beginners</category>
    </item>
  </channel>
</rss>
