DEV Community: Kyle Escosia

Building a Pinoy-themed Game using Amazon Q CLI

Kyle Escosia — Mon, 30 Jun 2025 06:39:44 +0000

Introduction

I remember the very first game that I built back in my college days (circa 2016). Our Rizal professor wanted us to think about how we can apply what we do in our course to create awareness for our national hero, Jose Rizal. Basically, how would you relate Information Technology to Rizal.

First thing that came to my mind was to build a fighting game that features the life of Rizal throughout the Spanish Period using Java since we just learned it from the previous semester. It doesn't look much but it goes like this:

So when I encountered an article about Building Games with Amazon Q CLI, I thought this is a good opportunity for me to reimagine my game. And since we Filipinos are celebrating our Independence 🇵🇭 this June, it made me a bit nostalgic. But this time, I want to incorporate my work experience, which is Data Engineering.

Setting up

Getting Amazon Q CLI up and running is actually very straightforward. I followed the guide from AWS.

Installing Amazon Q for command line

You do need to Create your AWS Builder ID. But after that, Amazon Q is all yours!

My attempt

Now, I thought about the prompt, I wanted to be as specific as possible. Here's what I came up with:

I want a turn-based SQL fighting game that will allow me to fight through bosses using a chosen character. 
I want to have Philippines as a theme for my game. Since we celebrate our independence in the month of June. 
The game should be like Street fighter/Tekken. Where characters face off with each other. 
The battle system is like Pokemon, each character's actions are based on whatever action the user will choose.

For the game, I want the following mechanics.

Background and goal:

Hero - my "Bayani" starts unarmed in a Pre-Hispanic tutorial level, then faces colonial “bosses” as he goes through times

Levels tied to eras – pre-Spanish, Spanish, American, Japanese, final Independence Day showdown (June 12 1898)

Each level introduces new tables and sql query concepts (SELECT basics, JOINs, aggregates, window functions)

Game Mechanics:

SQL-type quizzes

On each turn, the game asks a quiz question about the current battle’s database

Quiz-type sql puzzles
- Each action corresponds to a quiz question on a specific SQL concept
- Questions appear as multiple choice or fill in inputs
- Correct answer executes the chosen action’s animation and adjusts the HP or shield values accordingly, 
  important to note that boss also attacks afterwards
- Wrong answer causes the boss to attack instead, dealing damage to the your player

Combat Actions
- Attack – executes by answering a SELECT/WHERE question correctly and deals damage to the boss’s HP
- Defend – executes by answering an aggregates question and grants a shield that reduces incoming damage
- Heal – executes by answering a set-operation question and restores a portion of the hero’s HP
- Special Move - unlock by a successful multi-step quiz (CTEs or window functions) which unleashes a high-damage combo

For me, I already know what I wanted, but feel free to collaborate with Amazon Q for your game. At the end of the day, it's all about what you want your game to look like, have fun with the process!

One prompt, and it went down to business creating all sorts of scripts.

One thing that I liked about Amazon Q was that it automatically knows that it needs to test the scripts and creates its own test cases. It was part of the workflow. Most chatbots doesn't do this out-of-the-box, you need to explicitly say to create test scenarios and execute them.

It also creates documentations with game mechanics, files it created, and how to play.

After running the game, here's what I had:

Improvements

After one prompt, Amazon Q was able to create a CLI-based SQL game, that allows users to play through eras with each bosses having different SQL questions.

But, there is one problem, not all of you wants a CLI-based game. I mean, I know I don't. I would want something that I can see and interact with using my mouse.

Additionally, I had downloaded asset packs from Ansimuz's Legacy Collection.

Here's another prompt, I didn't think too much about it, I just did a simple one:

This is a good start. Please give it a UI. Use PyGame. 
I also have assets under /assets folder. 
Please make use of it as sprites.

Amazon Q CLI proceeded in creating the necessary adjustments.
And came up with this:

Looks good, but then I was curious on extending it beyond and asking Amazon Q about improvements.

Let's work on improvements. Can you suggest? 
Please list down your suggestions before implementing. 
I want improvements on the battle system, game mechanics, questions, and animations.

All of which are a good suggestions but I like that Amazon Q also gives prioritizations:

I decided to go with the Phase 1 changes, but I had some suggestions as well.

Please implement high-priority improvements. 
I don't want an interactive SQL Editor yet as it is complex to implement. 
For the list of SQL Questions, can you please add more? 
It seems that the user can just choose the same questions over and over, it should vary per turn.

What I noticed while playing the game was that the questions are very limited. It was only 1 question per action across the battle, so the player can just memorize the answers and just attack. It's no fun :)

It again did all the necessary adjustments. Here's the final output:

Check out the full game in my GitHub:
https://github.com/klescosia/bayani-sql-fighter

Takeaways

AI-powered assistance have come a long way since it's inception.
What makes Amazon Q special is its training data. It is fine-tuned on years of AWS knowledge, best practices, resources, and well-architected patterns.
This tool enabled me to quickly build a working application in just one prompt.
It can automate most of the development tasks so that users can focus on delivering value, though it's important to note that we always take this with a grain of salt and do due diligence.

Conclusion

Overall, it was a fun experience, I definitely would like to improve further on this. I can spend the whole day talking to Amazon Q CLI. But maybe that's for another blog or video. I'll keep you posted! Highly recommend Amazon Q, I do think this is one of those useful tools that can really help you in your development. Did I mention that you can also use Amazon Q in your favorite IDE?

Check this out:
Using Amazon Q Developer in the IDE.

This blog is authored solely by me and reflects my personal opinions and experiences, not those of my employer. All references to products, including names, logos, and trademarks, belong to their respective owners and are used for identification purposes only.

Securing Amazon Redshift - Best Practices for Access Control

Kyle Escosia — Sun, 12 Jan 2025 16:59:14 +0000

Introduction

Not long ago, I had the chance to conduct a knowledge transfer session focused on access management in Amazon Redshift for our partner client. As I started researching the topic, I realized something surprising - while there’s plenty of material explaining Redshift’s features, finding clear best practices or structured approaches for managing access was a challenge. Most of the information available felt basic and lacked the depth I was looking for. Yes, I also watched re:Invent videos.

This realization motivated me to dive deeper and build a better understanding of how to effectively secure Redshift. In this blog, I want to share what I’ve learned - not just the technical details, but practical strategies and insights I’ve gained from working hands-on with Redshift access management. Whether you’re setting up a new cluster or refining an existing one, I hope this guide gives you the tools and confidence to tackle access management with clarity and purpose.

Understanding Redshift Access Management

Of course, before we'll be able to apply these settings in Redshift, we need to understand how its security framework works first.

Redshift Built-In Security and Compliance

Let’s review the core components of Redshift security.

For this blog, I’ll skip VPC isolation since, if you’ve been using AWS for a while, you’re likely familiar with VPC-supported services.

1. Authentication

Redshift supports multiple authentication methods to verify user identities:

Using IAM authentication to generate database user credentials - You can use AWS IAM (Users, Roles) to manage user access.
Single Sign-On (SSO) - Simplifies user access by leveraging corporate identity providers.
MFA (Multi-Factor Authentication) - You can enforce MFA for added security.
Database Credentials - Traditional username and password-based authentication. Make sure to store your superuser or admin credentials in AWS Secrets Manager.

2. Authorization

Once users are authenticated, Redshift manages what they can access through robust authorization features:

Default Permissions - By default, only the owner of a database object (schema, tables, views) can modify or delete it, ensuring secure defaults. THIS CONCEPT IS IMPORTANT.
Users and Groups - Permissions can be granted to individual users or groups, allowing for more efficient access control. Groups are always advisable so that you won't have the pain of managing many users.
Role-Based Access Control (RBAC) - Roles simplify managing permissions by grouping privileges and assigning them to users.

A Review on Authentication vs. Authorization:
Authentication verifies who a user is, ensuring only valid individuals or systems can connect to your Redshift cluster through methods like credentials, MFA, or federated SSO. Authorization, on the other hand, determines what authenticated users can access and do within the system, such as querying specific tables, viewing certain rows, or accessing sensitive columns.

3. Data Encryption

Security isn’t just about controlling access; it’s also about protecting the data itself:

Data at Rest - Encrypted using AWS Key Management Service (KMS) or custom-managed keys.
Data in Transit - Protected with SSL to ensure secure communication.
Load Data Encryption - Ensures sensitive data remains protected even during data loading.

4. Advanced Access Management Features

Column-Level Security (CLS) - Restricts access to sensitive columns (e.g., SSNs, mobile number, credit cards) while allowing access to non-sensitive data.
Row-Level Security (RLS) - Ensures users see only the rows they are authorized to access, based on attributes or roles.
Dynamic Data Masking - Masks sensitive information for users without full permissions, providing an extra layer of data protection.

5. Compliance

Third-party auditors assess the security and compliance of Amazon Redshift as part of multiple AWS compliance programs. These include SOC, PCI, FedRAMP, HIPAA, and others.

Here's a link for the list: Compliance validation for Amazon Redshift

Understanding these foundational components is the first step toward mastering Redshift access management as each of these works together.

Default Permissions

I wanted to expand more about the Default Permissions in Redshift. In my experience, overlooking how these default settings interact with specific roles or use cases can lead to prolonged back-and-forth conversations between users and administrators, especially when working across time zones. This is why it’s essential to understand how these permissions work and how to adjust them as needed.

New users are assigned a set of default permissions that determine their initial access rights. These permissions provide a basic level of access to the database, allowing users to perform common actions while restricting access to sensitive areas.

The Principle of Object Ownership

When a user creates an object (e.g., a table, view, or schema) in Redshift, they automatically become its owner.
- Key Rule - Only the owner has permission to modify or drop the object unless they explicitly grant those permissions to others.
This principle ensures that no unauthorized user can tamper with critical objects, even if they have general access to the schema.

GRANT and ALTER DEFAULT PERMISSIONS

The GRANT statement provides access to existing objects in Redshift, such as tables and views, for users, groups, or roles.

The ALTER DEFAULT PRIVILEGES statement is used to manage permissions for future objects that will be created in a schema. This is especially useful in collaborative environments where you want new objects to automatically inherit specific access rules.

Regranting Permissions After Object Modification - If a table or view is recreated (e.g., dropped and re-created), all previously granted permissions are lost and must be reapplied. Use ALTER DEFAULT PERMISSIONS statement to solve this or execute GRANT statement again.

Advanced Access Management Features

One of the requirements of a modern data platform is the ability to provide granular access control. Amazon Redshift delivers this through features such as Column-Level Security (CLS), Row-Level Security (RLS), and Dynamic Data Masking, ensuring users see only what they’re authorized to see. Let's go through them one-by-one.

1. Column-Level Security (CLS)

Column-Level Security restricts access to sensitive columns within a table while allowing users to interact with non-sensitive data. This prevents unnecessary exposure of information such as personal identifiers or payment details.

GRANT SELECT (customer_id, name, address) ON customers TO
GROUP customer_service_representatives;

Creating a view is also an option and can sometimes be a straightforward and effective way to control column-level access.

CREATE VIEW customer_info AS
SELECT customer_id, name, address
FROM customers;

You can also change this into a MATERIALIZED VIEW.

2. Row-Level Security (RLS)

RLS restricts access to specific rows within a table based on user roles or attributes, ensuring that users see only the data relevant to them.

This feature is essential for scenarios where data segregation is required, such as multi-tenant environments or organizations with hierarchical roles (e.g., regional managers, department leads).

CREATE RLS POLICY region_a_policy
USING (region = 'Region A');

ATTACH RLS POLICY region_a_policy ON sales_data TO ROLE
sales_manager;

ALTER TABLE sales_data ROW LEVEL SECURITY ON;

The CREATE RLS POLICY statement defines the filtering condition (region = 'Region A').
The policy is attached to the sales_data table and applies only to users assigned the regional_manager_a role.
Row-level security is enabled for the table with ALTER TABLE.

Again, you can always use a VIEW for this.

An example of a row-level security policy

3. Dynamic Data Masking

Dynamic Data Masking (DDM) is a powerful feature in Amazon Redshift that protects sensitive data by replacing it with masked values when accessed by users without full permissions. Unlike traditional encryption, which hides data entirely, masking obfuscates sensitive fields while maintaining their usability for tasks like reporting and analysis.

How it works:

Data Obfuscation - Sensitive information is replaced with masked values when accessed by users without full permissions.

Conditional Masking - Masking can be applied based on user roles or attributes, ensuring the right users see appropriate levels of data.

The masking is applied at query runtime, ensuring that the underlying data remains unchanged.

Dynamic Data Masking in Redshift

How to implement:

1. Create a MASKING POLICY

Define how sensitive data should be masked.

Example: Masking a credit card number to show only the last four digits:

CREATE MASKING POLICY mask_credit_card_full
WITH (credit_card VARCHAR(256))
USING ('000000XXXX0000'::TEXT);

2. Attach the MASKING POLICY

Apply the masking policy to the desired column and restrict access based on ROLES. You can also attach this to USERS.

Example: Attaching the policy to the credit_card column:

ATTACH MASKING POLICY mask_credit_card
ON credit_cards(credit_card)
TO ROLE customer_support_role;

Step 3: Test the Policy

When users with the customer_support_role query the credit_card column, they'll see masked values:

Result:
****-****-****-1234

Though, users with elevated permissions (e.g., administrators) will see the full credit card number.

Advanced: Custom Masking Logic

In cases where more complex masking logic is needed, you can define custom functions using Python as a language.

CREATE FUNCTION REDACT_CREDIT_CARD(credit_card TEXT)
RETURNS TEXT IMMUTABLE
AS $$
    import re
    regexp = re.compile("^([0-9]{6})[0-9]{5,6}([0-9]{4})")
    match = regexp.search(credit_card)
    if match:
        first = match.group(1)
        last = match.group(2)
    else:
        first = "000000"
        last = "0000"
    return f"{first}XXXXX{last}"
$$ LANGUAGE plpythonu;

CREATE MASKING POLICY custom_mask_credit_card
WITH (credit_card VARCHAR(256))
USING (REDACT_CREDIT_CARD(credit_card));

ATTACH MASKING POLICY custom_mask_credit_card
ON credit_cards(credit_card)
TO ROLE data_analyst_role;

Example of Masking Policies

Best Practices for Redshift Access Management

While Amazon Redshift provides many options for different use cases, it can sometimes feel overwhelming to decide which one to use. The key is to simplify your approach by focusing on your specific needs and objectives, rather than trying to use every available feature. Remember your KISS principle, people :) - keep it simple, ~~stupid~~ and straightforward.

In this section, let’s use the example of a Sales Report System where different teams, like sales analysts and regional managers, interact with data in various ways.

1. Role-Based Access Control (RBAC)

Instead of assigning permissions to individual users, define roles that are related to their job responsibilities.

CREATE ROLE sales_read_only;
GRANT SELECT ON ALL TABLES IN SCHEMA sales TO ROLE sales_read_only;
GRANT ROLE sales_read_only TO USER sales_analyst1;

RBAC can only be assigned to USERS not GROUPS.

A USER can have multiple ROLES

2. Least Privilege Principle

If you’ve been in the cloud domain for some time now, you’ve likely heard this phrase countless times - and for good reason. The Least Privilege Principle means granting users only the permissions they need to perform their specific tasks. This approach minimizes risks by limiting access to resources, ensuring users can do their job without accidentally compromising security.

For instance, a Sales Analyst needs access to view the monthly_sales table but shouldn’t be able to modify it.

3. Use Groups for Database-Level Security

Group users into logical categories to manage permissions more efficiently. If you've worked with IAM in AWS, you'd be familiar with this concept.

4. Secure Authentication

Strengthen how users authenticate to your Redshift cluster to prevent unauthorized access. A secure authentication process is the first line of defense against security breaches.

Store and manage database credentials securely using AWS Secrets Manager.

5. Implement Advanced Access Management Features

Leverage Redshift’s built-in capabilities for fine-grained access control.

Column-Level Security (CLS) - Restrict access to sensitive fields
Row-Level Security (RLS) - Ensure users see only the rows they are authorized to access.

Both of these features are very useful when creating reports, though a key consideration is to whether to apply this within Redshift side vs. BI tools like Power BI, Tableau, QuickSight, and etc. Each approach has its trade-offs and should align with your overall data governance strategy.

Dynamic Data Masking (DDM) - Mask sensitive fields like credit card numbers for users without full permissions:

6. Use available Access Monitoring with Admin Scripts and Tools

Managing and monitoring access in Amazon Redshift becomes much easier when you have the right tools and scripts in place. To ensure a secure and well-governed environment, it’s a game-changer to use tools like Redshift Utils, which provide a collection of administrative views and scripts for monitoring and managing your Redshift clusters effectively.

Final Thoughts

Managing access in Amazon Redshift is about keeping your data safe while making sure users can do their jobs effectively. Simple practices like creating roles for specific tasks, giving only the necessary permissions, and using features like column and row restrictions have worked well for us and for our partners.

This approach has been effective in our case, but I’m always eager to learn from others. If you have similar practices, ideas, or constructive feedback, I’d love to hear them!

Hello everyone, are you stuck with deploying your Glue Jobs to production? Here’s how I did it using AWS CDK and GitHub

Kyle Escosia — Tue, 07 Jan 2025 03:15:30 +0000

Build a CI/CD Pipeline Using AWS Glue, AWS CDK and GitHub

Kyle Escosia for AWS Community Builders ・ Jan 7

#aws #tutorial #cicd #devops

Build a CI/CD Pipeline Using AWS Glue, AWS CDK and GitHub

Kyle Escosia — Tue, 07 Jan 2025 02:57:11 +0000

The Phantom Menace

I’ve been a heavy user of AWS Glue since its early days, starting with version 0.9. It’s been a bit of a love-hate relationship—especially back then, when Glue jobs took what felt like an eternity to run. Over the years, though, Glue has come a long way. From upgrading Apache Spark to supporting modern data lake formats like Hudi, Iceberg, and Delta Lake, to introducing Generative AI capabilities, Glue has evolved into a powerful tool for building scalable ETL solutions.

But even as Glue has improved, I’ve found myself grappling with a different challenge—managing Glue jobs manually. As I’ve built solutions for clients over the years, the lack of automation in provisioning and deploying Glue jobs has become a pain point.

Back in the early days, we didn’t have CI/CD pipelines or automation tools to provision Glue jobs. Everything was done manually—configuring jobs, managing dependencies, and deploying them. At first, this seemed manageable for small-scale solutions, but as the complexity of pipelines grew, so did the problems. And I’ve always thought that there was something wrong with that since having that workflow creates room for error, such as:

Inconsistencies Across Environments (Dev, QA, Prod)
Scaling issues - adding or modifying jobs manually is time-consuming and error-prone
No Version Control - need I say more?
Deployment Complexity - dependencies, configurations, etc.

These challenges didn’t just slow the project down—they put the reliability of the data pipelines at risk, introducing errors and inefficiencies that could cascade into costly problems downstream.

A New Hope

Fast forward to today, and I’ve adopted a completely different approach. By using AWS CDK (Cloud Development Kit) in combination with GitHub (though this can also be GitLab, BitBucket), I’ve been able to solve many of these challenges. AWS CDK allows me to define Glue resources as Infrastructure-as-Code (IaC), ensuring consistency and scalability. Integrating this with GitHub and CI/CD workflows has made deploying Glue jobs faster, more reliable, and far less error-prone.

The Medallion Architecture as a Framework

One of the data design pattern that are popular today is the Medallion Architecture.

The pipeline that I built contains Glue scripts for the Bronze layer, Silver layer, and Gold layer.

Development Workflow

To ensure consistency and collaboration across the team, a structured development workflow is followed, as outlined in the diagram below. This workflow integrates tools like Jira for task tracking, integrated to GitHub to map the tickets to the git branches. The workflow is pretty standard.

You don't seem to believe, eh? Read on :)

The Project Structure

project-root/
├── ingestion/                         # For ingestion Glue jobs
│   ├── configs/      
│   │   ├── jobs.csv
│   │   ├── custom_jobs.yaml
│   │   ├── default_configs.yaml
│   │   └── README.md
│   ├── scripts/                       # Scripts for ingestion Glue jobs
│   │   ├── dev-ingestion-script.py
│   │   └── prd-ingestion-script.py
│   ├── ingestion_stack.py             # CDK stack for ingestion jobs
│   └── README.md
│
├── standardization/                   # For standardization Glue jobs
│   ├── configs/
│   │   ├── jobs.csv
│   │   ├── custom_jobs.yaml
│   │   ├── default_configs.yaml
│   │   └── README.md
│   ├── scripts/
│   │   ├── dev-standardization-script.py
│   │   └── prd-standardization-script.py
│   ├── standardization_stack.py    # CDK stack for standardization jobs
│   └── README.md
│
├── transformation/                   # For transformation Glue jobs
│   ├── configs/
│   │   ├── jobs.csv
│   │   ├── custom_jobs.yaml
│   │   ├── default_configs.yaml
│   │   └── README.md
│   ├── scripts/
│   │   ├── dev-transformation-script.py
│   │   └── prd-transformation-script.py
│   ├── transformation_stack.py      # CDK stack for transformation jobs
│   └── README.md
│
├── loading/                         # For loading Glue jobs
│   ├── configs/
│   │   ├── jobs.csv
│   │   ├── custom_jobs.yaml
│   │   ├── default_configs.yaml
│   │   └── README.md
│   ├── scripts/
│   │   ├── dev-loading-script.py
│   │   └── prd-loading-script.py
│   ├── loading_stack.py              # CDK stack for loading jobs
│   └── README.md
│
├── upload_script.py                  # Script to upload files to S3
├── app.py                            # Root entry point for AWS CDK
├── requirements.txt                  # Python dependencies for CDK
└── README.md                         # High-level project documentation

Config Files: Defaults and Customizations

Each folder (ingestion/, standardization/, transformation/, loading/) contains:

configs/ - Configuration files specific to that component. These can share a similar structure but should have unique data for each purpose.

jobs.csv- Defines the Glue jobs and their classifications. This file acts as the source of truth for the jobs you want to deploy. The columns can be adjusted as needed. Take note of the classification as well, this will be used for identifying whether to provision the job as default or custom, which will be discussed below.

JobName	Classification	Category	ConnectionName
dim-products	default	Transformation	redshift-conn
dim-users	default	Transformation	redshift-conn
fact-sales	custom	Transformation	redshift-conn

default_configs.yaml - Separation of default and custom configurations gives me the flexibility to manage AWS Glue jobs efficiently. With default configurations, I can define a baseline setup—for example, provisioning a Glue job with a G.1X worker type and 2 DPUs. This ensures that most jobs follow a consistent and standardized configuration, reducing the need for repetitive definitions.

WorkerType: G.1X
NumberOfWorkers: 2
GlueVersion: "5.0"
ExecutionClass: STANDARD
IAMRole: "arn:aws:iam::123456789012:role/glue-role"
Command:
  Name: "glueetl"
  PythonVersion: "3"
DefaultArguments:
  "--enable-metrics": "true"
  "--TempDir": "s3://default-bucket/temp/"
  "--job-language": "python"
  "--enable-glue-datacatalog": "true"
  "--spark-event-logs-path": "s3://bucket/logs/sparkHistoryLogs/"
ScriptLocationBase: "s3://bucket/cdk/scripts/transformation/"
Tags:
  Project: "Sales"
  Environment: "dev"

custom_configs.yaml - Custom configurations, on the other hand, allow me to handle exceptions where jobs require more specialized settings, like higher memory or specific Spark arguments. By separating these two, I can keep the defaults simple and focused while tailor fitting individual jobs as needed.

fact-sales:
  WorkerType: G.2X
  NumberOfWorkers: 4
  GlueVersion: "5.0"
  ExecutionClass: STANDARD
  IAMRole: "arn:aws:iam::123456789012:role/glue-role"
  DefaultArguments:
    "--enable-metrics": "true"
    "--TempDir": "s3://my-bucket/temp/"
    "--job-language": "python"
  Command:
    Name: "glueetl"
    PythonVersion: "3"
  Tags:
    Project: "Sales Dashboard"
    Environment: "dev"

transformation-job2:
  WorkerType: G.1X
  NumberOfWorkers: 2
  GlueVersion: "5.0"
  ExecutionClass: FLEX
  IAMRole: "arn:aws:iam::123456789012:role/glue-role"
  DefaultArguments:
    "--enable-continuous-cloudwatch-log": "true"
  Tags:
    Project: "Sales Dashboard"
    Environment: "dev"

scripts/ - contains the actual Python-based Glue scripts specific to the component.

Example:

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.utils import getResolvedOptions

# Get job arguments
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'input_path', 'output_path'])

# Initialize GlueContext and SparkContext
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# Input and output paths (passed as parameters)
input_path = args['input_path']  # e.g., s3://your-bucket/raw-data/customers/
output_path = args['output_path']  # e.g., s3://your-bucket/processed-data/dim_customers/

# Load raw data into a DataFrame
raw_df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load(input_path)

# Register the DataFrame as a temporary SQL table
raw_df.createOrReplaceTempView("raw_customers")

# Use Spark SQL to create the dimension table
dimension_query = """
SELECT
    CAST(customer_id AS STRING) AS customer_id,
    first_name,
    last_name,
    email,
    CAST(date_of_birth AS DATE) AS date_of_birth,
    country,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS row_num
FROM
    raw_customers
WHERE
    country IS NOT NULL
"""

# Execute the SQL query
dimension_df = spark.sql(dimension_query)

# Filter to include only the latest record per customer
final_dimension_df = dimension_df.filter(dimension_df.row_num == 1).drop("row_num")

# Write the resulting DataFrame to S3 in Parquet format
final_dimension_df.write.mode("overwrite").parquet(output_path)

print(f"Dimension table created and saved to {output_path}")

_stack.py - AWS CDK stack for defining Glue jobs for that specific component.

Example:

from aws_cdk import aws_glue as glue
from aws_cdk import Stack, CfnOutput
from constructs import Construct
import csv
import yaml

class GlueTransformationStack(Stack):
    """
    A CDK stack for creating AWS Glue jobs based on configurations specified in CSV and YAML files.
    """

    def __init__(self, scope: Construct, id: str, job_csv_file: str, custom_config_file: str, default_config_file: str, **kwargs):
        """
        Initialize the GlueTransformationStack.

        :param scope: The scope in which to define this construct.
        :param id: The scoped construct ID.
        :param job_csv_file: Path to the CSV file containing job definitions.
        :param custom_config_file: Path to the YAML file containing custom configurations.
        :param default_config_file: Path to the YAML file containing default configurations.
        """
        super().__init__(scope, id, **kwargs)

        # Load configurations from YAML files
        with open(custom_config_file) as f:
            custom_configs = yaml.safe_load(f)
        with open(default_config_file) as f:
            default_config = yaml.safe_load(f)

        # Process each job in the CSV file
        with open(job_csv_file, mode='r', encoding="utf-8-sig") as file:
            csv_reader = csv.DictReader(file)
            for row in csv_reader:
                self.process_job_row(row, custom_configs, default_config)

    def process_job_row(self, row, custom_configs, default_config):
        """
        Process a single row from the CSV file and create a Glue job.

        :param row: A dictionary representing a row from the CSV file.
        :param custom_configs: A dictionary of custom job configurations.
        :param default_config: A dictionary of default configurations.
        """
        job_name = row['JobName']
        classification = row['Classification']
        connection_name = row['ConnectionName']

        # Determine job configuration: merge custom settings if available
        if classification == 'custom' and job_name in custom_configs:
            job_config = {**default_config, **custom_configs[job_name]}
        else:
            job_config = default_config

        # Ensure 'Command' is defined in the configuration
        if 'Command' not in job_config:
            job_config['Command'] = default_config.get('Command', {})

        # Set script location directly from the job name
        script_name = f"{job_name}.py"
        job_config['Command']['ScriptLocation'] = job_config['ScriptLocationBase'] + script_name
        job_config['ConnectionName'] = connection_name

        # Merge tags from default and custom configurations
        default_tags = default_config.get('Tags', {})
        custom_tags = job_config.get('Tags', {})
        combined_tags = {**default_tags, **custom_tags}

        self.create_glue_job(job_name, job_config, combined_tags)

    def create_glue_job(self, job_name, job_config, tags):
        """
        Create an AWS Glue job using the provided configuration.

        :param job_name: The name of the Glue job.
        :param job_config: A dictionary containing the job configuration.
        :param tags: Combined tags for the Glue job.
        """
        glue_job = glue.CfnJob(
            self, job_name,
            name=job_name,
            role=job_config['IAMRole'],
            glue_version=job_config['GlueVersion'],
            command=glue.CfnJob.JobCommandProperty(
                name=job_config['Command']['Name'],
                script_location=job_config['Command']['ScriptLocation'],
                python_version=job_config['Command']['PythonVersion']
            ),
            default_arguments=job_config['DefaultArguments'],
            execution_class=job_config['ExecutionClass'],
            connections=glue.CfnJob.ConnectionsListProperty(
                connections=[job_config['ConnectionName']]
            ),
            worker_type=job_config['WorkerType'],
            number_of_workers=job_config['NumberOfWorkers'],
            tags=tags
        )


        CfnOutput(
            self, f"{job_name}Output",
            value=glue_job.ref,
            description=f"Name of the Glue job: {job_name}"
        )

Automating Script Uploads

Theis script can dynamically upload scripts from the scripts/ folder of any component to an S3 bucket. Update the local_directory_to_upload and s3_destination_path variables to target specific components.

import boto3
import os
import logging

# Configure logging
logging.basicConfig(
    filename="upload_files_to_s3.log",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)

def upload_files_to_s3(local_directory, bucket_name, s3_destination):
    """
    Uploads files from a local directory to an S3 bucket.

    :param local_directory: The local directory to upload.
    :param bucket_name: The name of the S3 bucket.
    :param s3_destination: The destination path in the S3 bucket.
    """
    s3_client = boto3.client('s3')

    # Walk through the local directory
    for root, _, files in os.walk(local_directory):
        for file in files:
            local_path = os.path.join(root, file)
            relative_path = os.path.relpath(local_path, local_directory)
            s3_path = os.path.join(s3_destination, relative_path).replace("\\", "/")

            # Upload file to S3
            try:
                s3_client.upload_file(local_path, bucket_name, s3_path)
                print(f"Successfully uploaded {file} to s3://{bucket_name}/{s3_path}")
                logging.info(f"Uploaded {file} to s3://{bucket_name}/{s3_path}")
            except Exception as e:
                print(f"Failed to upload {file}: {e}")
                logging.error(f"Failed to upload {file} to s3://{bucket_name}/{s3_path} - Error: {e}")

if __name__ == "__main__":
    import sys

    # Check command-line arguments
    if len(sys.argv) != 2:
        print("Usage: python upload_files_to_s3.py <bucket_name>")
        logging.error("Script called with insufficient arguments.")
        sys.exit(1)

    # Parse the bucket name
    bucket_name = sys.argv[1]

    # Define the folders to process
    folders = {
        "ingestion": "ingestion/scripts",
        "standardization": "standardization/scripts",
        "transformation": "transformation/scripts",
        "loading": "loading/scripts",
    }

    # Loop through each folder and upload files
    for component, local_directory in folders.items():
        s3_destination = f"glue-scripts/{component}"
        print(f"\nProcessing folder: {local_directory} -> s3://{bucket_name}/{s3_destination}")
        logging.info(f"Processing folder: {local_directory} -> s3://{bucket_name}/{s3_destination}")
        upload_files_to_s3(local_directory, bucket_name, s3_destination)

Bringing It All Together with app.py

The root-level CDK application orchestrates the deployment of all component-specific stacks (Ingestion, Standardization, Transformation, and Loading).

Example:

from aws_cdk import App
from ingestion.ingestion_stack import IngestionStack
from standardization.standardization_stack import StandardizationStack
from transformation.transformation_stack import TransformationStack
from loading.loading_stack import LoadingStack

app = App()

# Instantiate each component stack
IngestionStack(app, "IngestionStack")
StandardizationStack(app, "StandardizationStack")
TransformationStack(app, "TransformationStack")
LoadingStack(app, "LoadingStack")

app.synth()

Local Testing

Once you have these components, you can test locally using the following commands.

Synthesize the CloudFormation Templates

cdk synth

Deploy to a Test Environment

Please make sure that your local AWS user is on your development environment.

cdk deploy IngestionStack --require-approval never
cdk deploy StandardizationStack --require-approval never

Verify Resources in AWS:

Confirm that Glue jobs were created with the expected configurations.

(OPTIONAL) Purge the resources:

cdk destroy

CI/CD Integration with GitHub

Once you have the components ready, the next step would be to push the codes into a Code Repository. In this section, we’ll explore how to push your code to a repository (like GitHub) and configure CI/CD to automatically deploy Glue jobs on each push to a specific branch.

Pre-requisities:

Authenticate GitHub to AWS
GitHub Account

Steps:

Create Repository
Push the codes into the Repository
Go into your Repository, click on Actions tab
Select New Workflow
When choosing a workflow, locate and choose the set up a workflow yourself.
Paste this yaml. Change as necessary.

name: Deploy Data Lake Glue Jobs

on:
  push:
    branches:
      - main # Replace with your branch if different

permissions:
  id-token: write   # Required for requesting the JWT
  contents: read    # Required for actions/checkout

jobs:
  deploy:
    runs-on: ubuntu-latest

    steps:
    # Step 1: Check out the code repository
    - name: Checkout Repository
      uses: actions/checkout@v3

    # Step 2: Configure AWS credentials using the IAM role
    - name: Configure AWS credentials
      uses: aws-actions/configure-aws-credentials@v4.0.2
      with:
        role-to-assume: ${{ secrets.AWS_IAM_ROLE }}
        aws-region: ${{ secrets.AWS_REGION }}
        role-session-name: GitHubActionsDeployment

    # Step 3: Set up Python
    - name: Setup Python
      uses: actions/setup-python@v5.1.0
      with:
        python-version: '3.10'
        cache: 'pip'

    # Step 4: Set up Node.js for AWS CDK
    - name: Setup Node.js
      uses: actions/setup-node@v4.0.0
      with:
        node-version: '21.2.0'

    # Step 5: Install AWS CDK CLI globally
    - name: Install AWS CDK
      run: npm install -g aws-cdk

    # Step 6: Verify CDK installation
    - name: Verify CDK Installation
      run: cdk --version

    # Step 7: Install Python dependencies globally
    - name: Install Python Dependencies
      run: pip install -r ./infrastructure/requirements.txt

    # Step 8: Upload Glue scripts for each layer
    - name: Upload Ingestion Files to S3
      run: python3 upload_files.py <your-bucket-name>

    # Step 9: Deploy all Glue stacks using AWS CDK
    - name: Deploy Glue Stacks
      run: cdk deploy GlueIngestionStack GlueStandardizationStack GlueTransformationStack GlueLoadingStack --require-approval never

Workflow Structure:

*Trigger (on) *- Specifies that the workflow runs when there is a push to the main branch. If your main branch is named differently (e.g., main-prod), replace it here. This ensures the workflow executes only on the primary branch where approved code resides. More on GitHub Event triggers here.
Permissions - note that you must configure OIDC with GitHub from AWS. See the prerequisites.
id-token: write:

Grants the workflow permission to request an OpenID Connect (OIDC) token for securely authenticating with AWS.

contents: read:

Allows the workflow to read repository content during the actions/checkout step.

Job Definition:

Runner - Specifies the operating system environment (ubuntu-latest) where the job executes. This ensures compatibility with Python and Node.js for AWS CDK and Glue.

Steps:

Check Out the Code Repository - Checks out the code from the repository so that subsequent steps have access to the scripts. This is similar to checking out the code locally. It pulls the contents of the repository in its virtual environment.
Configure AWS Credentials - Configures AWS credentials using an IAM role defined in GitHub Secrets. To define secrets in GitHub follow this guide.
- Key Parameters:
- role-to-assume: The ARN of the IAM role the workflow assumes for permissions. You can also use access keys but I highly recommend IAM Roles for security purposes.
- aws-region: The region where AWS resources are deployed (e.g., ap-southeast-1).
Set Up Python - Sets up Python 3.10 in the workflow environment.
Set Up Node.js - Installs Node.js, required for running AWS CDK.
Install AWS CDK - Installs AWS CDK globally using npm. This is necessary to execute cdk commands.
Verify CDK Installation - Checks that AWS CDK was installed correctly by outputting its version.
Install Python Dependencies - Installs Python dependencies specified in requirements.txt, such as aws-cdk-lib, constructs, or boto3.
Upload Glue Scripts to S3 - Executes a Python script to upload Glue job scripts to the specified S3 bucket. This ensures the Glue jobs have access to the latest ETL scripts stored in S3.
Deploy Glue Stacks - Deploys all defined Glue stacks (GlueIngestionStack, GlueStandardizationStack, etc.) using AWS CDK. --require-approval never: Skips manual approval prompts, enabling fully automated deployments.

You can now push to your main branch and automatically deploy Glue Jobs using GitHub Actions.

I Have The High Ground

Adopting a CI/CD-driven workflow for deploying AWS Glue jobs has been a transformative step for solving my problems. By integrating AWS CDK, GitHub, and automated pipelines, I’ve significantly improved the deployment process. Manual errors, configuration inconsistencies, and deployment delays are challenges I’ve left behind, allowing me to focus on delivering reliable and scalable data solutions.

This approach ensures that every change is traceable, reviewable, and deployed consistently across environments.

This workflow has worked well for my project based on the requirements and project goals. However, I know that there’s always room for improvement, and workflows often evolve over time to meet new challenges. If you have suggestions, insights, or ideas on how to further enhance this approach, I’d be happy to discuss them.

Installing Python Packages in AWS Glue using AWS CodeArtifact without Internet Access

Kyle Escosia — Fri, 10 Nov 2023 09:24:22 +0000

Background of the Problem

I spent quite sometime figuring out how to install Python Packages in AWS Glue inside a VPC without internet access and I managed to figure it out after some tinkering. Just to recall, AWS introduced the support for installation of Python Packages via --additional-python-modules option. While this is a lifesaver - for those who started working with Glue 1.0, it only works if your Glue Job can connect to the internet.

Given the emphasis on security, a number of customers chose to limit/restrict egress traffic from their VPC to the public internet and require a method to manage the packages used by their data pipelines.

This article focuses on that challenge. This is a step-by-step process on how to setup your Glue Job to connect to a pypi mirror via AWS CodeArtifact, allowing you to install packages in a Private Subnet. For this tutorial, it is recommended to have a working knowledge of basic stuffs (e.g. Networking, Services) on AWS. But, I'll try my best to explain each part.

Let's get started!

Solution Overview

Fig. 1. Architecture for the AWS CodeArtifact and AWS Glue Integration

The core of the solution is the AWS CodeArtifact, which allows you to use it as tool to securely store, publish, and share packages, in this case, PyPi packages, across your private network without directly connecting into the Public PyPi Repository. This is made possible by VPC Endpoints through PrivateLink connections.

You do need to create endpoints for S3 and CodeArtifact for this to work, or else, you'll get errors like Connection timed out errors.

Here's some resources to help you out with that:

Gateway endpoints for Amazon S3

Create VPC endpoints for CodeArtifact - if via console, kindly follow the same steps as with the S3 Endpoint.

What you will need

An AWS account, of course

Note: Test this on your dev environment first

AWS Glue
AWS CodeArtifact
Docker
AWS Access Keys (with permissions on AWS CodeArtifact)

I won't go over these tools one by one as I believe ChatGPT can you give those definitions and its use better than me.

The Solution

In this section, I'll go over the step-by-step solution for each process.

Let's start by setting up our CodeArtifact Repository.

Setting up the AWS Codeartifact

Create a CodeArtifact Repository

Fill up the details

Repository Name
Repository Details (Optional)
Public upstream repositories - I chose PyPi

Select the domain

Specify your domain name

You should have the following repositories after creation:

<your-repo>
pypi-store

Now that's done, you can inspect the created repositories. The pypi-store was automatically created. The <your-repo> is the one that we're interested in since this will contain our Python Packages.

With that, let's proceed with configuring your local environment.

Setting up your local environment

Step 1: Install Docker

Install here:
https://docs.docker.com/get-docker/

Step 2: Pull the Amazon Linux 2 Image

$ docker pull amazonlinux:latest

Step 3: Run the container

Run the container and interact with the command line of the container using -it

$ docker run -it --rm -v /path/on/host:/path/in/container image_name /bin/bash

Some notes:

-v /path/on/host:/path/in/container: This is the volume mount option. It mounts a directory from your host (/path/on/host) into the container (/path/in/container). Any changes made in the mounted directory inside the container will be reflected on the host directory and vice versa.
--rm: This tells Docker to automatically remove the container when it exits. This means that once you're done with the bash session and exit, the container will be cleaned up, and no container filesystem will be left on your host system. Feel free to remove this option if you do not want your container to behave like that.

Step 4: Install Python 3.10

$ wget https://www.python.org/ftp/python/3.10.0/Python-3.10.0.tgz
$ tar -xf Python-3.10.0.tgz
$ cd Python-3.10.0
$ ./configure --enable-optimizations
$ sudo make altinstall

Note that AWS Glue 4.0 runs Python 3.10 version. For others, kindly refer to the documentation.

Step 5: Install AWS CLI

Using pip

$ pip install awscli

Using yum
https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html#getting-started-install-instructions

Step 6: Configure AWS Credentials

Refer to this for creating your access keys:
https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html

After getting the values for the access keys, configure your AWS CLI:

$ aws configure
AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Default region name [None]: us-west-2
Default output format [None]: json

Step 7: Connect to Repository

Go back to the AWS Console and click on your created repository.

Click View connection instructions

Copy and run the command in Step 3 of the Connection instructions

$ aws codeartifact login \
--tool pip \
--repository <your-repo-name> \
--domain <your-domain-name> \
--domain-owner <your-account-id> \
--region <your-region>

Once successfully logged in, kindly note that any pip install command will be pushed to this repository instead of the Python environment on the Docker container.

Step 8: Install Python Packages

Install your packages!

Now that the repository is ready, we can now install from AWS Glue using this Pypi mirror that we created!

AWS CodeArtifact and AWS Glue Integration

This section discusses how you can point the installation of Python Packages in AWS Glue to AWS Codeartifact.

Step 1: Get the Authorization Token

We need to generate an authorization token from AWS CodeArtifact. This is done using this command:

$ aws codeartifact get-authorization-token \
--domain my_domain \
--domain-owner 111122223333 \
--query authorizationToken \
--output text

Note that the maximum duration of this token is 12 hours. And yes, you do need to generate this every day if you are planning to run your jobs daily.

Store this into a .txt file.

Step 2: Configure Job Details in Glue Job

Navigate to your Glue Job

I'm assuming you have already configured the Data Connections. If not kindly configure it before proceeding to this step. The idea is that the Glue Job will run inside the Private Subnet of the VPC.

See screenshot below

Under Job Parameters, add the following key-value pairs:

Parameter 1


Key - "--additional-python-modules" // without double quotes

Value - "<your-python-package>==<version>"

Parameter 2

Key - "--python-modules-installer-option"

Value - "--no-cache-dir --verbose --index-url https://aws:<CODEARTIFACT-AUTH-TOKEN>@<DOMAIN-NAME>-<ACCOUNT-ID>.d.codeartifact.<REGION-NAME>.amazonaws.com/pypi/pypi-store/simple/"

Change the following values:

CODEARTIFACT-AUTH-TOKEN - refer to Step 1
DOMAIN-NAME
ACCOUNT-ID
REGION-NAME

Step 3: Run your Glue Job

After configuring all of that, run your Glue Job and check the CloudWatch Logs to confirm if it's being installed correctly. You should see some text there that says:

Looking in indexes: https://aws:****@test-mirror-1234561234.d.codeartifact.ap-southeast-1.amazonaws.com/pypi/pypi-store/simple/

Kindly make sure that the IAM_ROLE that you are using for the Glue Jobs has access to write to CloudWatch Logs, some engineers usually forgets this. Also tick the Enable logs in CloudWatch on Glue Jobs.

Wrap up

That's it! In this article, we demonstrated how we can leverage CodeArtifact for managing Python packages and modules for AWS Glue jobs that run inside a Private Subnet that have no internet access.

Do let me know if you have any questions on this, happy to answer any queries you might have.

Happy Coding, builders!

This blog is authored solely by me and reflects my personal opinions, not those of my employer. All references to products, including names, logos, and trademarks, belong to their respective owners and are used for identification purposes only.

SQL-based INSERTS, DELETES and UPSERTS in S3 using AWS Glue 3.0 and Delta Lake

Kyle Escosia — Mon, 23 Aug 2021 15:17:42 +0000

AWS NOW SUPPORTS DELTA LAKE ON GLUE NATIVELY.
CHECK IT OUT HERE:

Handle UPSERT data operations using open-source Delta Lake and AWS Glue | AWS Big Data Blog

aws.amazon.com

The purpose of this blog post is to demonstrate how you can use Spark SQL Engine to do UPSERTS, DELETES, and INSERTS. Basically, updates.

Earlier this month, I made a blog post about doing this via PySpark. Check it out below:

Article No Longer Available

But, what if we want it to make it more simple and familiar?

This month, AWS released Glue version 3.0! AWS Glue 3.0 introduces a performance-optimized Apache Spark 3.1 runtime for batch and stream processing. The new engine speeds up data ingestion, processing and integration allowing you to hydrate your data lake and extract insights from data quicker.

But, what's the big deal with this?

Well, aside from a lot of general performance improvements of the Spark Engine, it can now also support the latest versions of Delta Lake. The most notable one is the Support for SQL Insert, Delete, Update and Merge.

If you don't know what Delta Lake is, you can check out my blog post that I referenced above to have a general idea of what it is.

Let's proceed with the demo!

Architecture Diagram
Format to Delta
Upsert
Delete
Insert
Partitioned Data
Conclusion

✅ Architecture Diagram

This is basically a simple process flow of what we'll be doing. We take a sample csv file, load it into an S3 Bucket then process it using Glue. (OPTIONAL) Then you can connect it into your favorite BI tool (I'll leave it up to you) and start visualizing your updated data.

❗ Pre-requisites

But, before we get to that, we need to do some pre-work.

Download the Delta Lake package here - a bit hard to spot, but look for the Files in the table and click on the jar
An AWS Account - ❗ Glue ETL is not included in the free tier
Download the sample data here - you can use your own though, but I'll be using this one
Codes can be found in my GitHub Repository

✅ Format to Delta Table

First things first, we need to convert each of our dataset into Delta Format. Below is the code for doing this.


# Import the packages
from delta import *
from pyspark.sql.session import SparkSession

# Initialize Spark Session along with configs for Delta Lake
spark = SparkSession \
    .builder \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()


# Read Source
inputDF = spark.read.format("csv").option("header", "true").load('s3://delta-lake-aws-glue-demo/raw/')

# Write data as a DELTA TABLE
inputDF.write.format("delta").mode("overwrite").save("s3a://delta-lake-aws-glue-demo/current/")

# Read Source
updatesDF = spark.read.format("csv").option("header", "true").load('s3://delta-lake-aws-glue-demo/updates/')

# Write data as a DELTA TABLE
updatesDF.write.format("delta").mode("overwrite").save("s3a://delta-lake-aws-glue-demo/updates_delta/")

# Generate MANIFEST file for Athena/Catalog
deltaTable = DeltaTable.forPath(spark, "s3a://delta-lake-aws-glue-demo/current/")
deltaTable.generate("symlink_format_manifest")

### OPTIONAL, UNCOMMENT IF YOU WANT TO VIEW ALSO THE DATA FOR UPDATES IN ATHENA
###
# Generate MANIFEST file for Updates
# updatesDeltaTable = DeltaTable.forPath(spark, "s3a://delta-lake-aws-glue-demo/updates_delta/")
# updatesDeltaTable.generate("symlink_format_manifest")

This code converts our dataset into delta format. This is done on both our source data and as well as for the updates.

After generating the SYMLINK MANIFEST file, we can view it via Athena. SQL code is also included in the repository

🔀 Upserts

Upsert is defined as an operation that inserts rows into a database table if they do not already exist, or updates them if they do.

In this example, we'll be updating the value for a couple of rows on ship_mode, customer_name, sales, and profit. ~~I just did a random character spam and I didn't think it through 😅.~~


# Import as always
from delta import *
from pyspark.sql.session import SparkSession

# Initialize Spark Session along with configs for Delta Lake
spark = SparkSession \
    .builder \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()


updateDF = spark.sql("""

MERGE INTO delta.`s3a://delta-lake-aws-glue-demo/current/` as superstore
USING delta.`s3a://delta-lake-aws-glue-demo/updates_delta/` as updates
ON superstore.row_id = updates.row_id
WHEN MATCHED THEN
  UPDATE SET *
WHEN NOT MATCHED
  THEN INSERT *
""")

# Generate MANIFEST file for Athena/Catalog
deltaTable = DeltaTable.forPath(spark, "s3a://delta-lake-aws-glue-demo/current/")
deltaTable.generate("symlink_format_manifest")

### OPTIONAL
## SQL-BASED GENERATION OF SYMLINK

# spark.sql("""
# GENERATE symlink_format_manifest 
# FOR TABLE delta.`s3a://delta-lake-aws-glue-demo/current/`
# """)

The SQL Code above updates the current table that is found on the updates table based on the row_id. It then proceeds to evaluate the condition that,

If row_id is matched, then UPDATE ALL the data. If not, then do an INSERT ALL.

If you want to check out the full operation semantics of MERGE you can read through this

After which, we update the MANIFEST file again. Note that this generation of MANIFEST file can be set to automatically update by running the query below.

ALTER TABLE delta.`<path-to-delta-table>` 
SET TBLPROPERTIES(delta.compatibility.symlinkFormatManifest.enabled=true)

More information can be found here

You should now see your updated table in Athena.

❌ Deletes

Deletes via Delta Lakes are very straightforward.

from delta import *
from pyspark.sql.session import SparkSession


spark = SparkSession \
    .builder \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()


deleteDF = spark.sql("""
DELETE 
FROM delta.`s3a://delta-lake-aws-glue-demo/current/` as superstore 
WHERE CAST(superstore.row_id as integer) <= 20
""")

# Generate MANIFEST file for Athena/Catalog
deltaTable = DeltaTable.forPath(
    spark, "s3a://delta-lake-aws-glue-demo/current/")
deltaTable.generate("symlink_format_manifest")

### OPTIONAL
## SQL-BASED GENERATION OF SYMLINK MANIFEST

# spark.sql("""

# GENERATE symlink_format_manifest 
# FOR TABLE delta.`s3a://delta-lake-aws-glue-demo/current/`

# """)

This operation does a simple delete based on the row_id.

SELECT * 
FROM "default"."superstore" 
-- Need to CAST hehe bec it is currently a STRING
ORDER BY CAST(row_id as integer);

⤴ Inserts

Like Deletes, Inserts are also very straightforward.


from delta import *
from pyspark.sql.session import SparkSession


spark = SparkSession \
    .builder \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()


insertDF = spark.sql("""
INSERT INTO delta.`s3a://delta-lake-aws-glue-demo/current/`
SELECT *
FROM delta.`s3a://delta-lake-aws-glue-demo/updates_delta/`
WHERE CAST(row_id as integer) <= 20
""")

# Generate MANIFEST file for Athena/Catalog
deltaTable = DeltaTable.forPath(
    spark, "s3a://delta-lake-aws-glue-demo/current/")
deltaTable.generate("symlink_format_manifest")

### OPTIONAL
## SQL-BASED GENERATION OF SYMLINK MANIFEST

# spark.sql("""

# GENERATE symlink_format_manifest 
# FOR TABLE delta.`s3a://delta-lake-aws-glue-demo/current/`

# """)

❗ Partitioned Data

We've done Upsert, Delete, and Insert operations for a simple dataset. But, that rarely happens irl. So what if we spice things up and do it to a partitioned data?

I went ahead and did some partitioning via Spark and did a partitioned version of this using the order_date as the partition key. The S3 structure looks like this:

❗ What do you think?

Answer is: YES! You can also do this on a partitioned data.

The concept of Delta Lake is based on log history.

Delta Lake will generate delta logs for each committed transactions.

Delta logs will have delta files stored as JSON which has information about the operations occurred and details about the latest snapshot of the file and also it contains the information about the statistics of the data.

Delta files are sequentially increasing named JSON files and together make up the log of all changes that have occurred to a table.

-from Data Floq

We can see this on the example below

raw date_part=2014-08-27/

current date_part=2014-08-27/ - DELETED ROWS

If we open the parquet file:

From the examples above, we can see that our code wrote a new parquet file during the delete excluding the ones that are filtered from our delete operation. After which, the JSON file maps it to the newly generated parquet.

Additionally, in Athena, if your table is partitioned, you need to specify it in your query during the creation of schema


CREATE EXTERNAL TABLE IF NOT EXISTS superstore ( 
    row_id STRING,
    order_id STRING,
    order_date STRING,
    ship_date STRING,
    ship_mode STRING,
    customer_id STRING,
    customer_name STRING,
    segment STRING,
    country STRING,
    city STRING,
    state STRING,
    postal_code STRING,
    region STRING,
    product_id STRING,
    category STRING,
    sub_category STRING,
    product_name STRING,
    sales STRING,
    quantity STRING,
    discount STRING,
    profit STRING,
    date_part STRING

)
-- Add PARTITIONED BY option
PARTITIONED BY (date_part STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' 
LOCATION 's3://delta-lake-aws-glue-demo/current/_symlink_format_manifest/'

Then run an MSCK REPAIR <table> to add the partitions.

If you don't do these steps, you'll get an error.

✅ Conclusion

That's it! It's a great time to be a SQL Developer! Thank you for reading through! Hope you learned something new on this post.

Have you tried Delta Lake? What tips, tricks and best practices can you share with the community? Would love to hear your thoughts on the comments below!

Happy coding!

UPSERTS and DELETES using AWS Glue and Delta Lake

Kyle Escosia — Wed, 21 Jul 2021 03:45:42 +0000

The purpose of this blog post is to demonstrate how you can enable your Data Lake to be ACID-compliant, that is, having the same functionality as a database. This will allow you to do UPSERTS and DELETES directly to your data lake

Let me start first by defining what a Data Lake is:

From AWS

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.

Data Lake

A data lake is scalable, performant, secure, and cost-efficient. And it has played a crucial part of an organization's Data Analytics pipeline. So what's the problem?

Well, updates.

We all know that data lakes are immutable - the idea that data or objects should not be modified after they are created; how do we then go beyond that immutability?

Delta Lake

The answer is Delta Lake.

An open-source storage layer that brings scalable, ACID transactions to Apache Spark™ and big data workloads. It provides serializability, the strongest level of isolation level. Scalable Metadata Handling, Time Travel, and is 100% compatible with Apache Spark APIs.

Basically, it allows you to do DELETES and UPSERTS directly to your data lake.

How Spark Fails ACID

We all know our beloved Spark doesn't support ACID transactions, but to be fair, it isn't really built to address that kind of specific use case.

I came across a blog post from kundankumarr, explaining how Spark fails ACID.

Atomicity & Consistency

Atomicity states that it should either write full data or nothing to the data source when using spark data frame writer. Consistency, on the other hand, ensures that the data is always in a valid state.

Isolation & Durability

We know that when a transaction is in process and not yet committed, it must remain isolated from any other transaction. This is called Isolation Property. It means writing to a data set shouldn’t impact another concurrent read/write on the same data set.

Finally, Durability. It is the ACID property which guarantees that transactions that have committed will survive permanently. However, when Spark doesn’t correctly implement the commit, then all the durability features offered by the storage goes for a toss.

AWS Glue and Delta Lake

This part demonstrates how you can use Delta Lake with AWS Glue.

These are the services that will be used in this exercise:

AWS Glue

a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.

Amazon Athena

an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Amazon S3

an object storage service that offers industry-leading scalability, data availability, security, and performance.

This is what we'll be doing:

Basically, I have an initial data then I want to apply changes to the Sales and Profit column. Then the table in the AWS Glue Data Catalog should be able to capture that changes. Just a basic update to the data.

So, let's start!

Pre-requisites

First, download the data here - I used Tableau's Superstore Dataset, this one is on Kaggle, you may need to register for an account to download.

Then, you need to download the Delta Lake .jar file to access it's libraries. You can download it here. Upload it on your S3 Bucket and take note of the S3 path, we'll use this as a reference later.

❗ As of this writing, Glue's Spark Engine (v2.4) only supports v0.6.1 of Delta Lake since versions beyond that were implemented in Spark 3.0.

❗❗❗ UPDATE: AWS GLUE 3.0 WAS RELEASED ON AUGUST 2021! Check out my blog post on this one: ❗❗❗

SQL-based INSERTS, DELETES and UPSERTS in S3 using AWS Glue 3.0 and Delta Lake

Kyle Escosia for AWS Community ASEAN ・ Aug 23 '21

#aws #tutorial #bigdata #datascience

AWS Glue

Navigate to AWS Glue then proceed to the creation of an ETL Job. Specify the This job runs to A new script to be authored by you. This will allow you to have a custom spark code.

Under Security configuration, script libraries, and job parameters (optional), specify the location of where you stored the .jar file as shown below:

Then on blank script page, paste the following code:

from delta import *
from pyspark.sql.session import SparkSession

This imports the SparkSession libraries as well as the Delta Lake libraries.

# Initialize Spark Session with Delta Lake
spark = SparkSession \
  .builder \
  .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
  .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
  .getOrCreate()

This code initializes the SparkSession along with the Delta Lake configurations.

# Read Source
inputDF = spark.read.format("csv") \
.option("header", "true") \
.load('s3://delta-lake-ia-test/raw/')

We read the source csv file as a Spark DataFrame.

# Write data as DELTA TABLE
inputDF.write.format("delta") \
.mode("overwrite") \
.save("s3a://delta-lake-ia-test/current/")

Then we output it as a Delta format.

❗ Notice the use of s3a prefix in the save path, it is essential to use the s3a prefix instead of the standard s3 as the path. As using the s3 prefix, will throw an UnsupportedFileSystemException error. Followed by a fs.AbstractFileSystem.s3.impl=null: No AbstractFileSystem configured for scheme: s3.

More on the differences of s3 and s3a here

# Generate MANIFEST file for Athena/Catalog
deltaTable = DeltaTable.forPath(spark, "s3a://delta-lake-ia-test/current/")
deltaTable.generate("symlink_format_manifest")

Athena supports reading from external tables using a manifest file, which is a text file containing the list of data files to read for querying a table. Running the above code will generate a manifest file.

Read more about Delta Lake's integration for Presto and Athena here

Final Code:

from delta import *
from pyspark.sql.session import SparkSession

spark = SparkSession \
  .builder \
  .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
  .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
  .getOrCreate()


# Read Source
inputDF = spark.read.format("csv").option("header", "true").load('s3://delta-lake-ia-test/raw/')

# Write data as DELTA TABLE
inputDF.write.format("delta").mode("overwrite").save("s3a://delta-lake-ia-test/current/")

# Generate MANIFEST file for Athena/Catalog
deltaTable = DeltaTable.forPath(spark, "s3a://delta-lake-ia-test/current/")
deltaTable.generate("symlink_format_manifest")

Amazon Athena

In your S3 bucket, you should see a _symlink_format_manifest prefix/folder. This will be used by Amazon Athena for mapping out the parquet files.

Create your table using the code below as a reference:

CREATE EXTERNAL TABLE IF NOT EXISTS "default"."superstore" ( 
row_id STRING,
order_id STRING,
order_date STRING,
ship_date STRING,
ship_mode STRING,
customer_id STRING,
customer_name STRING,
segment STRING,
country STRING,
city STRING,
state STRING,
postal_code STRING,
region STRING,
product_id STRING,
category STRING,
sub_category STRING,
product_name STRING,
sales STRING,
quantity STRING,
discount STRING,
profit STRING
) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 

STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://delta-lake-ia-test/current/_symlink_format_manifest/'

~~I'm lazy.~~ I've made things simple by using STRING for all columns.

Note that you have to define the table name as it is when you you wrote it as a delta table. Or else you'll get blank results when querying with Athena.

-- Run a simple select
SELECT *
FROM "default"."superstore"
LIMIT 10

Recap:

Read from a CSV
Created a Spark DataFrame from the CSV
Written the DataFrame as a Delta Table
Made a manifest file
Created an external table in Athena
Query sample data

What we'll do next:

Read the updates from the CSV
Make an update based on the new files
Generate/update the manifest file

Let's add another Glue ETL Job for the updates.

I have manually modified my raw data to simulate the updates, I just plug in the 99999 values in the sales and profit for the first 15 rows. Feel free to have your own modifications.

After which, upload it to your S3 Bucket in a different location.

from delta import *
from pyspark.sql.session import SparkSession


spark = SparkSession \
  .builder \
  .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
  .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
  .getOrCreate()

Nothing new here.

# Read updates
df_updates = spark.read.format("csv").option("header", "true").load('s3://delta-lake-ia-test/updates/')

# Read current as DELTA TABLE
df = DeltaTable.forPath(spark, "s3a://delta-lake-ia-test/current/")

First line is a typical read from csv code.

The next line creates a DeltaTable object, which allows us to call functions in the delta package.

# UPSERT process
final_df = df.alias("full_df").merge(
    source = df_updates.alias("append_df"),
    condition = expr("append_df.row_id = full_df.row_id"))\
  .whenMatchedUpdateAll() \
  .whenNotMatchedInsertAll() \
  .execute()

One of this is the merge(source, condition) function, which:

Merges the data from the source DataFrame based on the given merge condition.

First, we take a DeltaTable DataFrame object, then give it an alias. We then call the merge() function, supplying the Parameters with the our Arguments. Which, in this case, is the updates DataFrame and the merge condition.

Then, we call the whenMatchedUpdateAll(condition=None)

Updates all the columns of the matched table row with the values of the corresponding columns in the source row. If a condition is specified, then it must be true for the new row to be updated.

to have the code update all the columns.

If the condition specified in the merge() function doesn't match, then we do a whenNotMatchedInsertAll(condition=None)

Insert a new target Delta table row by assigning the target columns to the values of the corresponding columns in the source row. If a condition is specified, then it must evaluate to true for the new row to be inserted.

Lastly, we call the execute() function to sum it up

Execute the merge operation based on the built matched and not matched actions.

# Generate new MANIFEST file
final_df = DeltaTable.forPath(spark, "s3a://delta-lake-ia-test/current/")
final_df.generate("symlink_format_manifest")

Then we update the manifest file.

For more functions in the library, kindly refer to the official docs

Final code:


from delta import *
from pyspark.sql.session import SparkSession


spark = SparkSession \
  .builder \
  .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
  .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
  .getOrCreate()


# Read updates
df_updates = spark.read.format("csv").option("header", "true").load('s3://delta-lake-ia-test/updates/')

# Read current as DELTA TABLE
df = DeltaTable.forPath(spark, "s3a://delta-lake-ia-test/current/")

# UPSERT process
final_df = df.alias("full_df").merge(
    source = df_updates.alias("append_df"),
    condition = expr("append_df.row_id = full_df.row_id"))\
  .whenMatchedUpdateAll() \
  .whenNotMatchedInsertAll() \
  .execute()

# Generate new MANIFEST file
final_df = DeltaTable.forPath(spark, "s3a://delta-lake-ia-test/current/")
final_df.generate("symlink_format_manifest")

Now, try querying your updated in Athena. It should show the most updated data.

Conclusion and Final Thoughts

This blog post demonstrated how you can leverage ACID transactions in your data lake.

Having a functionality like this is helpful especially if you have requirements such as Change Data Capture (CDC). Curious to know how you guys implemented such things. Let me know in the comments!

I've read a couple of articles and blog posts whether a Data Lake should be immutable or not.

From O'Reilly, Data Lake for Enterprises by Tomcy John, Pankaj Misra on the topic of Immutable Data Principle

The data should be stored in a raw format from the different source systems. More importantly, the data stored should be immutable in nature.

By making it immutable, it inherently takes care of human fault tolerance to at least some extent and takes away errors with regards to data loss and corruption. It allows data to be selected, inserted, and not updated or deleted.

To cater to fundamental fast processing/performance, the data is usually stored in a denormalized fashion. Data being immutable makes the system in general simpler and more manageable.

From SQLServerCentral by Steve Jones on the topic of Should the Data Lake be Immutable?

Imagine I had a large set of data, say GBs in a file, would I want to download this and change a few values before uploading it again? Do we want a large ETL load process to repeat?

Could we repeat the process and reload a file again? I don't think so, but it's hard to decide. After all, the lake isn't the source of data; that is some other system.

Maybe that's the simplest solution, and one that reduces complexity, downtime, or anything else that might be involved with locking and changing a file.

Comments on the topic can be found here.

There is a comment from roger.plowman

I suspect immutability should be asked after asking if you should even have the data lake or warehouse in the first place.

What do you guys think? Would love to hear your thoughts!

Speaking of Data Lakes vs Data Warehouse, there's also this very interesting concept I picked up from one of the AWS Community Builder (Joel Farvault), it is called Data Mesh Architecture, in which they describe it as the next enterprise data platform architecture. I'll leave it up for you to read on about. AWS also made a blog post on this using AWS Lake Formation.

There is also a youtube video back from the AWS DevDay Data & Analytics conducted on July 14, 2021. Where AWS Technical Evangelists; Javier Ramirez and Ricardo Sueiras discusses the Data Mesh Architecture.

Additionally, here is a video about Data Mesh in practice for Europe's biggest online fashion retailer.

Hope this helps! Let me know if you have questions below.

Happy coding!

P.S. CloudFormation stack is on going ⚙

I passed the AWS Data Analytics Specialty Exam! (DAS-C01)

Kyle Escosia — Sat, 10 Jul 2021 04:54:25 +0000

While still fresh from my memory, I will share tips on how to pass the Data Analytics Specialty Exam! This exam tests your ability to design, build, secure, and maintain analytics solutions on AWS that are efficient, cost-effective, and secure.

This exam is definitely challenging and very detailed, I was lucky enough to be involved in various data lake projects using some of these AWS Services, so I had hands-on experience. But hey, anything can be learned right?

✅ Pre-requisites

I would definitely recommend to take the AWS Solutions Architect Associate Exam first before this one. As this would help you get an overview of the principles in the AWS Cloud and it will be much easier for you to visualize how AWS Services work together.

✅ Content Outline

Below are the AWS Services covered:

The exam will test you on different domains:

Check out the full Exam Guide here

✅ Tips in General (study materials, practice exams, etc.)

Read the questions carefully, 190 minutes is very long and will give you enough time to properly analyze your answers.

Look for keywords in the question, try to choose the most appropriate service for that keyword, and try to build and visualize the solution in your mind or in the whiteboard.

Abuse the Review Button! If you are unsure with your answer, don't spend too much time on it and just proceed with the next one.

Pre-exam, make sure you get a good sleep. Tbh, I was anxious going into the exam, so I drank a lot of coffee and ate chocolates lol. But as soon as I answered the first few questions, I gained confidence then proceeded with it! I guess it worked out pretty well?? 😂😂

Again, this shouldn't be your first exam. AWS recommends to have at least an associate level exam.

Practice Exam

I personally recommend Sir Jon Bonso's practice exams, it is available on Udemy and on Tutorials Dojo. This, in my opinion, is the CLOSEST ONE to the actual exam, I even got an exact scenario from the practice exam while I was taking it. So try your best to score high on this one! Their explanations are also super helpful!

Additionally, their website, hosts the AWS Service Cheat Sheets, so be sure to read through that!

Study Materials 📘🎥

Stephane Maarek and Frank Kane's Udemy Course are very insightful, be generous with yourself in these study materials as they will greatly help you in the exam. They also have a practice exam.

I'd like to also recommend CloudAcademy as a learning portal, they offer full courses for your certifications including hands-on (using a their service account)! So if you want to experience using a service without using your account, definitely check them out!

🎥 re:Invent videos are also very helpful, these are the hidden gems from the re:Invent sessions:

Deep dive and best practices for Amazon Redshift - this helped me have a detailed understanding of what Redshift is
Building Serverless Analytics Pipelines with AWS Glue - a deep dive on Glue components
Serverless data preparation with AWS Glue - a more recent one, this discusses the updates that were rolled out in AWS Glue
High Performance Data Streaming with Amazon Kinesis: Best Practices
Data modeling with Amazon DynamoDB
Deep dive into Amazon Athena
Big Data Analytics Architectural Patterns & Best Practices - this is a great discussion on how AWS Services integrate with each other
Deep Dive Into AWS Lake Formation

More AWS Videos can be found here at AWS Stash

✅ Tips by Service

In this section, I will give tips on what to study for each services, below is an outline.

Collect

Amazon Kinesis
AWS Database Migration Service (DMS)
Amazon Simple Queue Service (SQS)
AWS Snowball
AWS IoT
AWS Managed Streaming for Kafka (MSK)
AWS Direct Connect

Storage

Amazon S3
Amazon DynamoDB
Amazon Elasticache

Processing

AWS Lambda
AWS Glue
Amazon EMR
AWS Lake Formation
AWS Step Functions
AWS Data Pipeline
Other AWS Services

Analysis

Amazon Kinesis Data Analytics
Amazon ElasticSearch Service
Amazon Athena
Amazon Redshift
Amazon SageMaker

Visualization

Amazon Quicksight
Other Visualization Tools

Security

Amazon STS
AWS Key Management Service (KMS)
Cloud HSM (Hardware Security Module)

✅ Collection

For the most part, questions require you to know how to move data from one source to another, using the right tools in the right situation. Knowing the advantages and disadvantages of each one will help you answer those questions.

Amazon Kinesis

Amazon Kinesis has 4 capabilities, namely: Video Streams, Data Streams, Firehose, Data Analytics.

For the Video Streams, I didn't get a question for this so you only need to remember that this is used for streaming video data for analytics.

For Kinesis Data Streams and Kinesis Firehose, most of the collection part will revolve on this, you need to KNOW how to differentiate Kinesis Firehose and Data Streams, I can't stress this out enough. Please study this one as there are a lot of answers that involve the use of Kinesis Data Streams and Firehose. You need to be able to distinguish both.

There are also troubleshooting and scenario-based questions, like how would you solve a ProvisionedThroughPutExceeded error, when should you merge or split shard, what encryption options are available, and how the Kinesis service integrates with other services.

AWS Database Migration Service (DMS)

This came up in a few questions, make sure you know when to use DMS vs other tools.

Amazon Simple Queue Service (SQS)

Just the difference between Kinesis and SQS. What should you use on each problem.

AWS Snowball

The test will add this as an option, although including this as a part of the solution is feasible, you should look at what the exam asks you for. If the solution requires you to migrate data fast, perhaps this is not the most appropriate one.

AWS IoT

I didn't get a lot of IoT questions, but it's nice to have an overview of this one. IoT topics, rules, and etc. Just browse through it.

AWS Managed Streaming for Kafka (MSK)

MSK is another option for streaming data similar to Kinesis, so knowing when to use MSK vs Kinesis will be crucial.

AWS Direct Connect

Part of a data warehouse migration is integrating your on-premises data center to your Amazon VPC network, knowing when to use a Site-to-Site VPN and a Direct Connect can help you with this.

✅ Storage

Amazon S3

From Storage Classes, Replication, Lifecycle, and etc. You need to know your Amazon S3 concepts! Amazon S3 Glacier also covers some part for archiving purposes.

Amazon DynamoDB

Understanding what DynamoDB and its features is essential also, as the exam will trick you into choosing other services instead of this one so knowing the advantages and disadvantage of DynamoDB will help you filter those tricky questions.

Amazon Elasticache

I didn't get an Elasticache question, but knowing what it is also helps.

✅ Processing

AWS Lambda

Lambda covers a lot in the exam, as it integrates with almost all of the AWS Services, so knowing when to use Lambda or not is definitely one you should be studying for.

AWS Glue

AWS Glue also shows up on a lot of the options, I've had no troubles with Glue as I've been using it since Version 1. Generally if the question looks for a cost-effective_ solution which requires no operational overhead, definitely look for a Glue answer.

Features of Glue also shows up, bookmarks, DynamicFrame functions, job metrics, and etc. Troubleshooting a Glue Job is also one, what should you do if Glue throws an error.

Amazon EMR

Study and understand EMR, all of its application! Period.

AWS Lake Formation

If the question is about managing access to your data lake, Lake Formation is the answer instead of managing it via IAM.

AWS Step Functions

Used for orchestration, an overview will do.

AWS Data Pipeline

I didn't get a Data Pipeline question, but they show up as one of the answers, so knowing it will help.

Other AWS Services

S3Select, S3DistCP, Hadoop Tools (Ganglia, Mahout, Ranger, HCatalog, etc.), just a basic understanding of what they do will suffice.

✅ Analysis

Amazon Kinesis Data Analytics

KDA allows you query streaming data using SQL, knowing window functions will help and when you should use KDA vs Lambda.

Amazon ElasticSearch Service

Generally for log analysis, look for an ES solution along with Kibana

Amazon Athena

Most cost-effective solution involves Athena as a solution. Watch out for answers that involves Redshift Spectrum, as the exam will trick you into using Athena even if the best one is Spectrum.

Amazon Redshift

Node types, resizing options, dist styles, Redshift Spectrum, cluster administration, encryption options. Please study and remember these.

Amazon SageMaker

I didn't get a lot of SageMaker questions but an overview will help.

✅ Visualization

Amazon Quicksight

Row-level security, Standard and Enterprise editions, authentication options (MS AD, SAML), Kibana vs Quicksight solution.

Other Visualization Tools

There are 1-2 questions that offers D3.js, HighCharts, and a custom chart as a solution, knowing when to choose between those and Quicksight is nice.

✅ Security

Security covers a lot in the exam, it can be very tricky for you if you didn't study for these.

Amazon STS

In some parts, they require you to access other AWS Account, so knowing IAM in general, how STS and authentication works with AWS is a nice to have.

AWS Key Management Service (KMS)

KMS shows up in almost all of the security questions in the exam, so please make sure you are prepared for this. I warned you lol.

Cloud HSM (Hardware Security Module)

If the exam asks you about managing your own security options, look for an HSM solution.

Final Thoughts

Studying all at once can be overwhelming, so try to take your time in understanding each services first and how they work.

Whitepapers and Webinars are very helpful, especially Migration videos, as they give you an overall design on how things are done in AWS.

Consider also having a habit of watching 1 video or reading a whitepaper at a time to avoid information overload (small wins!).

To others who have passed and taken the same exam, feel free to share your thoughts. I would gladly add it to this post to help others pass! Let's learn from each other!

Good Luck!

Introduction to the AWS Big Data Portfolio

Kyle Escosia — Sat, 29 May 2021 07:15:08 +0000

Want to build an end-to-end data pipeline in AWS?

You're in luck! In this post, I will introduce you to AWS' Big Data portfolio.

TL;DR (below image)

Content Outline

Collect

AWS Direct Connect
Amazon Kinesis
AWS Snowball

Store

Amazon S3
Amazon Glacier
Amazon DynamoDB
Amazon RDS
Amazon Aurora

Process and Analyze

Amazon Redshift
Amazon Athena
AWS Glue
Amazon EMR
Amazon EC2
Amazon Sagemaker

Visualize

Amazon QuickSight

In my future posts, I will be going into details on these services, so make sure you watch out for that!

Before we dive into the suite of AWS Services, let's first define what Big Data is.

What is Big Data?

From AWS

Big data can be described in terms of data management challenges that – due to increasing volume, velocity and variety of data – cannot be solved with traditional databases.

To make this definition more simple.

A data set is considered big data when it is too big or complex to be stored or analyzed by traditional data systems

Obviously, there are many definitions of Big Data around the web, but for me, this is the most simple one to understand.

Now that we've defined what Big Data is, we'll proceed with the AWS Services that will help you answer those challenges.

Collect

The collection of raw data has always been a challenge for many organizations, especially for us developers because you have these different complex source systems that are scattered in the company such as ERP systems, CRM systems, Transactional DBs, and etc.

You have to also think about how you would integrate the data between these systems to create a unified view of your data.

AWS helps you make these steps easier, allowing us developers to ingest data from - structured and unstructured, real-time to batch.

AWS Direct Connect

AWS Direct Connect is a networking service that provides an alternative to using the internet to connect to AWS.

Using AWS Direct Connect, data that would have previously been transported over the internet is delivered through a private network connection between your facilities and AWS.

This is useful if you want consistent network performance or if you have workloads that are bandwidth-heavy. I personally haven't tried this yet. Most of our implementations, we just use AWS Site-to-Site VPN.

Amazon Kinesis

Easily collect, process, and analyze video and data streams in real time

Amazon Kinesis enables you to process and analyze data as it arrives and respond instantly instead of having to wait until all your data is collected before the processing can begin.

Amazon Kinesis is fully managed and runs your streaming applications without requiring you to manage any infrastructure.

Kinesis has 4 capabilities namely:

Kinesis Video Streams
Kinesis Data Streams
Kinesis Data Firehose
Kinesis Data Analytics

Amazon Kinesis Video Streams

Capture, process, and store video streams

Kinesis Video Streams makes it easy to securely stream video from connected devices to AWS for analytics, machine learning (ML), and other processing.

Amazon Kinesis Data Streams

Capture, process, and store data streams

Kinesis Data Streams is a scalable and durable real-time data streaming service that can continuously capture gigabytes of data per second from hundreds of thousands of sources.

Amazon Kinesis Data Firehose

Load data streams into AWS data stores

Kinesis Data Firehose is the easiest way to capture, transform, and load data streams into AWS data stores for near real-time analytics with existing business intelligence tools.

Amazon Kinesis Data Analytics

Analyze data streams with SQL or Apache Flink

Kinesis Data Analytics is the easiest way to process data streams in real time with SQL or Apache Flink without having to learn new programming languages or processing frameworks.

AWS Snowball

An interesting way to move your data from on-premise to AWS Cloud would be AWS Snowball. Which is a service that provides secure, rugged devices, so you can bring AWS computing and storage capabilities to your edge environments, and transfer data into and out of AWS.

I personally haven't tried this yet but would love to do so in the future!

Amazon S3

The most famous AWS Service would be Amazon S3. Which is an object storage built to store and retrieve any amount of data from anywhere.

It’s a simple storage service that offers industry leading durability, availability, performance, security, and virtually unlimited scalability at very low costs.

Amazon S3 is AWS' first service that launched back in 2006!

Amazon S3 Glacier

S3 Glacier is an extremely low-cost storage service that provides secure, durable, and flexible storage for data backup and archival.

Which is excellent for businesses or organizations that needs to retain their data for years and even decades!

Store

I'm honestly a big fan of Amazon S3, given how scalable and how easy it is to use. I'll just say that if you aren't using Amazon S3 for your data lakes, then you are missing out on a lot of things lol.

There are obviously a lot of factors that need to be considered when building your Big Data project. Any big data platform needs a secure, scalable, and durable repository to store data prior or even after processing tasks. AWS provides you with services depending on your specific requirements.

Amazon DynamoDB

DynamoDB is a key-value and document database that delivers single-digit millisecond performance at any scale.

It is one of the AWS Services that is fully-managed, meaning that you don't have to worry about setting up the infrastructure and software updates, you just use the service.

Amazon RDS

Amazon RDS is a managed service that makes it easy to set up, operate, and scale a relational database in the cloud.

Amazon RDS supports Amazon Aurora, MySQL, MariaDB, Oracle, SQL Server, and PostgreSQL database engines.

Amazon Aurora

Amazon Aurora is a relational database engine that combines the speed and reliability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases.

Process and Analyze

This is the step where data is transformed from its raw state into a consumable format – usually by means of sorting, aggregating, joining and even performing more advanced functions and algorithms.

The resulting data sets are then stored for further processing or made available for consumption via business intelligence and data visualization tools.

Amazon Redshift

Amazon Redshift is the most widely used cloud data warehouse.

It makes it fast, simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools.

It allows you to run complex analytic queries against terabytes to petabytes of structured and semi-structured data, using sophisticated query optimization, columnar storage on high-performance storage, and massively parallel query execution.

We've had some successful implementations on Redshift and I can share you guys some experiences that I've had with it, so watch out for that.

Amazon Athena

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.

Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

We've used Athena a lot in our implementations and I must say that they really helped us in terms of the Data Exploration and Data Validation.

AWS Glue

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.

AWS Glue has evolved significantly from its initial release 0.9 to AWS Glue 2.0. Along with that are enhancements that glue (pun intended) all your pipelines together. Definitely worth looking into.

Amazon EMR

Amazon EMR is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data.

It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

As opposed to Glue, being serverless meaning you don't need to provision your own server, EMR allows you to be more flexible in terms of the workload depending on how "big" your data processing workloads are.

Amazon EC2

Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud.

Basically your virtual machine in cloud which has a lot of use cases, living up to its name "Elastic".

Amazon Sagemaker

Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly.

SageMaker removes the heavy lifting from each step of the machine learning process to make it easier to develop high quality models.

AWS re:Invent 2020 introduced us a lot of significant improvements to Amazon Sagemaker such as Data Wrangler, Clarify, SageMaker pipeline, and many more! I'll be making a deep dive on these exciting features soon!

Visualize

Big data is all about getting high value, actionable insights from your data assets.

Ideally, data is made available to stakeholders through self-service business intelligence and agile data visualization tools that allow for fast and easy exploration of datasets.

Depending on the type of analytics, end-users may also consume the resulting data in the form of statistical “predictions” – in the case of predictive analytics – or recommended actions – in the case of prescriptive analytics.

Amazon QuickSight

Amazon QuickSight is a very fast, easy-to-use, cloud-powered business analytics service that makes it easy for all employees within an organization to build visualizations, perform ad-hoc analysis, and quickly get business insights from their data, anytime, on any device.

QuickSight is easy to use and has also made some major improvements ever since it was publicly released. It's still fairly new compared to other major BI Tools but I think it has potential. Look into QuickSight if you want a cost-effective BI solutions.

That's it for me. Would love to hear your thoughts!

DEV Community: Kyle Escosia

Building a Pinoy-themed Game using Amazon Q CLI

Introduction

Setting up

My attempt

Improvements

Takeaways

Conclusion

Securing Amazon Redshift - Best Practices for Access Control

Introduction

Understanding Redshift Access Management

Redshift Built-In Security and Compliance

Default Permissions

Advanced Access Management Features

Best Practices for Redshift Access Management

Final Thoughts

Hello everyone, are you stuck with deploying your Glue Jobs to production? Here’s how I did it using AWS CDK and GitHub

Build a CI/CD Pipeline Using AWS Glue, AWS CDK and GitHub

Kyle Escosia for AWS Community Builders ・ Jan 7

Build a CI/CD Pipeline Using AWS Glue, AWS CDK and GitHub

The Phantom Menace

A New Hope

Development Workflow

The Project Structure

Config Files: Defaults and Customizations

Automating Script Uploads

Bringing It All Together with app.py

Local Testing

CI/CD Integration with GitHub

I Have The High Ground

Installing Python Packages in AWS Glue using AWS CodeArtifact without Internet Access

Background of the Problem

Solution Overview

What you will need

The Solution

Setting up the AWS Codeartifact

Create a CodeArtifact Repository

Fill up the details

Select the domain

Setting up your local environment

Step 1: Install Docker

Step 2: Pull the Amazon Linux 2 Image

Step 3: Run the container

Step 4: Install Python 3.10

Step 5: Install AWS CLI

Step 6: Configure AWS Credentials

Step 7: Connect to Repository

Step 8: Install Python Packages

AWS CodeArtifact and AWS Glue Integration

Step 1: Get the Authorization Token

Step 2: Configure Job Details in Glue Job

Step 3: Run your Glue Job

Wrap up

SQL-based INSERTS, DELETES and UPSERTS in S3 using AWS Glue 3.0 and Delta Lake

Handle UPSERT data operations using open-source Delta Lake and AWS Glue | AWS Big Data Blog

Article No Longer Available

Table of Contents

✅ Architecture Diagram

❗ Pre-requisites

🔀 Upserts

❌ Deletes

⤴ Inserts

❗ Partitioned Data

✅ Conclusion

UPSERTS and DELETES using AWS Glue and Delta Lake

Data Lake

Delta Lake

How Spark Fails ACID

AWS Glue and Delta Lake

AWS Glue

Amazon Athena

Amazon S3

Pre-requisites

SQL-based INSERTS, DELETES and UPSERTS in S3 using AWS Glue 3.0 and Delta Lake

Kyle Escosia for AWS Community ASEAN ・ Aug 23 '21

AWS Glue

Amazon Athena

Conclusion and Final Thoughts

I passed the AWS Data Analytics Specialty Exam! (DAS-C01)

✅ Pre-requisites