DEV Community: Anand

Connect to AWS Aurora PostgreSQL/Amazon Redshift Database from AWS Lambda

Anand — Fri, 05 Mar 2021 20:58:45 +0000

Image by Peggy und Marco Lachmann-Anke from Pixabay

In this blog post I will discuss following scenarios to connect to databases from AWS Lambda function:

Connecting to Amazon Aurora PostgreSQL database in private subnet with public accessibility set to No in same AWS account.
Connecting to cross account Amazon Redshift database in public subnet with public accessibility set to Yes.

Connect to Amazon Aurora PostgreSQL database in Private subnet with Public accessibility set to No in the same AWS account

In this setup, Amazon Aurora PostgreSQL database is running in private subnet with public accessibility set to No. The connectivity and security detail are as follows:

To connect to the Aurora PostgreSQL database in a Private subnet you need to configure Lambda function in a Virtual Private Cloud (VPC). Lets go ahead and create the Lambda function -

AWS Service > Lambda > Functions > Create Function > Author from scratch

Under Basic Information

Enter the function name
Choose the language of your preference to use. Here I selected Python 3.8.
In the Permissions section you can keep the default - Create a new role with basic Lambda permissions

Under Advanced settings

Enter the details in Network
- Choose the VPC name in which the Aurora PostgreSQL database is running
- Select at least 2 Private subnets. To access private Amazon VPC resources, such DB instance you need to associate your Lambda function with one or more private subnets. In case you select public subnets instead of Private subnet, the Lambda function will time out as they cannot have public IP addresses.
- Choose the default security group

Once you have completed the details, click on Create function. The function creation can take some time as Lambda creates ENI (elastic network interface) in each subnet of the VPC configuration. An ENI represents a virtual network cards and you can read more here.

In this case the Lambda function is launched with an execution role (IAM role) having 2 managed policies attached by default:

AWSLambdaBasicExecutionRole
AWSLambdaVPCAccessExecutionRole

In case you choose to use an existing role while creating the Lambda function make sure to attach AWSLambdaVPCAccessExecutionRole policy. This managed policy has the following permissions which Lambda uses to create and manage network interfaces:

ec2:CreateNetworkInterface
ec2:DescribeNetworkInterfaces
ec2:DeleteNetworkInterface

Below is the VPC configuration for my Lambda function attached with 2 Private subnets:

After creating the lambda function, I used the below code to connect and execute SQL against the Amazon Aurora PostgreSQL database. The code installs the PostgreSQL interface library pg8000 to interact with the database. The handler creates a connection to the PostgreSQL database, executes a SELECT sql to fetch the current timestamp from the database into results variable and returns the results as string.

import sys
import boto3
import logging
import urllib.parse
from pip._internal import main

# install pg8000
main(['install', '-I', '-q', 'pg8000', '--target', '/tmp/', '--no-cache-dir', '--disable-pip-version-check'])
sys.path.insert(0,'/tmp/')
    
import pg8000

def lambda_handler(event, context):
    
    sql = """SELECT current_timestamp"""
    
    conn = pg8000.connect(
        database='demodb',
        user='admin',
        password='xxxxxxx',
        host='cluster-demodb.cluster-cijke9kklkrh.us-east-1.rds.amazonaws.com',
        port=8192,
        ssl_context=True
        )
        
    dbcur = conn.cursor()
    dbcur.execute(sql)
    results = dbcur.fetchall()
    dbcur.close()
    
    return str(results)

Output:

Connect to cross account Amazon Redshift database in Public subnet with Publicly accessible set to Yes

Below is Amazon Redshift connection details running in Account A. The database is running in public subnet and is publicly accessible. The security groups acts as virtual firewall for the cluster to control inbound and outbound traffic.

In Account B I have created a new VPC "Cross-Account-Lambda-VPC" to test this use-case. To create the Lambda function I followed the same steps as mentioned in the previous section except that the VPC selected in this case was "Cross-Account-Lambda-VPC" (VPC in Account B). The screenshot below is of Lambda function from Account B which has 2 private subnets added.

The private subnets are attached to a route table. The route table needs a NAT Gateway attached.

The NAT gateway resides in a public subnet and has an Elastic IP (EIP) associated with it which acts as a public IP address and can connect to the internet through the VPC's internet gateway.

This EIP address needs to be added to the inbound rules of security group attached with the Amazon Redshift database in Account A as shown below. With this configuration settings the lambda function will be able to connect to the database in cross account.

For now to conclude, in this blog post we reviewed step by step how we can setup the Lambda function to connect to database running in private subnet in the same AWS account and to connect to cross account database running on public subnet. In the next blog post I will show few libraries which I played around with to connect and query the cross account Redshift database from Lambda function.

New Features in Amazon DynamoDB - PartiQL, Export to S3, Integration with Kinesis Data Streams

Anand — Tue, 15 Dec 2020 06:45:34 +0000

Every time with AWS re:Invent around, AWS releases many new features over a period of month. In this blog post I will touch on 3 new features which were introduced for Amazon DynamoDB. DynamoDB is a non-relational managed database with single digit millisecond performance at any scale.

New Features in Amazon DynamoDB -

PartiQL - SQL-compatible query language for Amazon DynamoDB.
Export to S3 - Export Amazon DynamoDB table to S3. In this blog I have added a use-case of deserializing the DynamoDB items, writing it to S3 and query using Athena.
Direct integration of DynamoDB with Kinesis Streams - Stream item-level images of Amazon DynamoDB as a Kinesis Data Stream.

To start with, lets look at the new Amazon DynamoDB console. I have 2 DDB tables to play around with for this blog. The Books table is partitioned by Author and sorted by Title. The Movies table is partitioned by year and sorted by title.

Lets jump on to the features:

PartiQL - You can use SQL to select, insert, update and delete items from Amazon DynamoDB. Currently you can use PartiQL for DynamoDB from the Amazon DynamoDB console, the AWS Command Line Interface (AWS CLI), and DynamoDB APIs. For this blog, I am using the AWS console.

DynamoDB > PartiQL editor

SELECT SQLs -

Simple select SQL

SELECT * FROM Books where Author='William Shakespeare'

Title	Formats	Author	Category
Hamlet	{ "Hardcover" : { "S" : "GVJZQ7JK" }, "Paperback" : { "S" : "A4TFUR98" }, "Audiobook" : { "S" : "XWMGHW96" } }	William Shakespeare	Drama

The following SQL returns the title, hardcover and category using key path -

SELECT Title, Formats['Hardcover'], category FROM Books where Author='John Grisham'

Category	Title	Hardcover
Suspense	The Firm	Q7QWE3U2
Suspense	The Rainmaker	J4SUKVGU
Thriller	The Reckoning	null

The following SQL uses "contains" function which returns TRUE if attribute category has string 'Suspense' -

SELECT Title, Formats['Audiobook'], Category FROM Books where Author='John Grisham' and contains(Category, 'Suspense')

year	title	release_date	rank
2011	Sherlock Holmes: A Game of Shadows	2011-12-10T00:00:00Z	570

INSERT SQL -

Insert a single item -

INSERT INTO Books value {'Title' : 'A time to kill', 'Author' : 'John Grisham', 'Category' : 'Suspense' }

SELECT * FROM Books WHERE Title='A time to kill'

Author	Title	Category
John Grisham	A time to kill	Suspense

"INSERT INTO SELECT" SQL fails with ValidationException: Unsupported operation: Inserting multiple items in a single statement is not supported, use "INSERT INTO tableName VALUE item" instead

UPDATE SQL -

In the previous insert sql, Formats column was null. So lets update the Formats column for the book.

UPDATE Books SET Formats={'Hardcover':'J4SUKVGU' ,'Paperback': 'D7YF4FCX'} WHERE Author='John Grisham' and Title='A time to kill'

Title	Formats	Author	Category
A time to kill	{"Hardcover":{"S":"J4SUKVGU"},"Paperback":{"S":"D7YF4FCX"}}	John Grisham	Suspense

You can use update sql to remove key from map -

UPDATE Books REMOVE Formats.Paperback WHERE Author='John Grisham' and Title='A time to kill'

Title	Formats	Author	Category
A time to kill	{"Hardcover":{"S":"J4SUKVGU"}}	John Grisham	Suspense

DELETE SQL -

DELETE FROM Books WHERE Author='John Grisham' and Title='A time to kill'

For more references - https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ql-reference.html

2. EXPORT TO S3 - Export full Amazon DynamoDB table to Amazon S3 bucket. Using DynamoDB table export, you can export data from an Amazon DynamoDB table from any time within your point-in-time recovery window. To do so the table must have point in time recovery (PITR) enabled. If PITR is not enabled for the table, the Export to S3 will report error asking it to be enabled.

DynamoDB > Exports to S3

After the export is complete it generates a manifest-summary.json file summarizing the export details and a manifest-file.json containing the details of S3 file locations.

Further to extend on Export to S3 feature to perform analytics and complex queries on your data you can create a data pipeline to deserialize the DynamoDB JSON format and query data from AWS Athena.

The workflow contains the following steps:

Time based CloudWatch event is triggered.
This event triggers AWS Lambda function.
DynamoDB Export to S3 is initiated.
It writes the data in DynamoDB JSON format to S3 RAW bucket. The S3 objects are zipped json files.
For each .json.gz file in S3 RAW bucket an event notification is set to trigger AWS Lambda. The detail is show below in S3 Event to trigger AWS Lambda section.
AWS Lambda reads the S3 object and deserializes the DynamoDB JSON format data using DynamoDB TypeDeserializer class. This class deserializes DynamoDB types to Python types. The deserialized data is written to S3 Content bucket in Parquet format. Code is in Lambda function code section.
AWS Lambda function updates the table location in AWS Glue catalog.
Query the data using AWS Athena.

S3 Event to trigger AWS Lambda

Lambda function code

import io
import gzip
import json
import boto3
import uuid
import pandas as pd
import awswrangler as wr
from datetime import datetime
from urllib.parse import unquote_plus


def update_glue_table(*, database, table_name, new_location, region_name):
    """ Update AWS Glue non-partitioned table location
    """

    glue = boto3.client("glue", region_name=region_name)

    response = glue.get_table(
        DatabaseName=database, Name=table_name)

    table_input = response["Table"]
    current_location = table_input["StorageDescriptor"]["Location"]

    table_input.pop("UpdateTime", None)
    table_input.pop("CreatedBy", None)
    table_input.pop("CreateTime", None)
    table_input.pop("DatabaseName", None)
    table_input.pop("IsRegisteredWithLakeFormation", None)
    table_input.pop("CatalogId", None)

    table_input["StorageDescriptor"]["Location"] = new_location

    response = glue.update_table(
        DatabaseName=database,
        TableInput=table_input
    )

    return response
    

def lambda_handler(event, context): 
    
    """
    Uses class TypeDeserializer which deserializes DynamoDB types to Python types 
    
    Example - 

    raw data format :
        [{'ACTIVE': {'BOOL': True}, 'params': {'M': {'customer': {'S': 'TEST'}, 'index': {'N': '1'}}}}, ]
    deserialized data format:
        [{'ACTIVE': True, 'params': {'customer': 'TEST', 'index': Decimal('1')}}]

    """
        
    
    s3client = boto3.client('s3')
    athena_db = "default"
    athena_table = "movies"
    
    for record in event['Records']:
        bucket = record['s3']['bucket']['name']
        key = unquote_plus(record['s3']['object']['key'])
 
        response = s3client.get_object(Bucket=bucket, Key=key)
        content = response['Body'].read()
        with gzip.GzipFile(fileobj=io.BytesIO(content), mode='rb') as fh:
            data = [json.loads(line) for line in fh]
                    
    all_data = []
    
    boto3.resource('dynamodb')
    deserializer = boto3.dynamodb.types.TypeDeserializer()
    for row in data:
        all_data.append({k: deserializer.deserialize(v) for k,v in row['Item'].items()})

    
    data_df = pd.DataFrame(all_data)
    
    dt = datetime.utcnow().strftime("%Y-%m-%d-%H-%M")
    s3_path="s3://%s/dynamodb/%s/content/dt=%s/" % (bucket, athena_table, dt)
    
    wr.s3.to_parquet(
        df=data_df,
        path=s3_path,
        dataset = True,
    )
    

    update_response = update_glue_table(
        database=athena_db, 
        table_name=athena_table, 
        new_location=s3_path,
        region_name="us-west-2")

    if update_response["ResponseMetadata"]["HTTPStatusCode"] == 200:
        return (f"Successfully updated glue table location - {athena_db}.{athena_table}")
    else:
        return (f"Failed updating glue table location - {athena_db}.{athena_table}")

Query data from Athena

SELECT title, info.actors, info.rating, info.release_date, year FROM movies where title='Christmas Vacation'

	title	actors	rating	release_date	year
1	Christmas Vacation	[Chevy Chase, Beverly D'Angelo, Juliette Lewis]	0.73	1989-11-30T00:00:00Z	1989

For more references - https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DataExport.html

3. DynamoDB integration with Kinesis Stream.

DynamoDB Streams captures a time-ordered sequence of item-level modifications in a DynamoDB table. Earlier to publish DynamoDB data to S3 in near real time, one of the ways was to enable DynamoDB streams and use AWS Lambda function to forward data to Kinesis Firehose which published the data to S3. To do so you can use a handy package provided by aws labs - https://github.com/awslabs/lambda-streams-to-firehose .

With several such use cases, AWS has now integrated AWS DynamoDB directly with Amazon Kinesis Stream. Now you can capture item-level changes in your DynamoDB tables as a Kinesis data stream. This can enable you to publish the data to S3 as shown in the below pipeline.

To setup the pipeline, create a Amazon Kinesis Data Stream.

After setting up the Amazon Kinesis Data Stream, create Amazon Kinesis Firehose. The source to Kinesis Firehose will be Amazon Kinesis Data Stream and the destination will be S3.

To proceed move on to enable streaming to Kinesis.

DynamoDB > Table > Kinesis data stream details > Manage streaming to Kinesis

Once the stream is enabled any item-level change in the table will be captured and written to Amazon S3 bucket. Below is an example of the record which was updated in DynamoDB using PartiQL. The record contains approximate creation date time of the record in DynamoDB streams, along with New and Old image of the record. These records can be parsed using AWS Lambda or AWS Glue and stored in Data Lake for analytical use-cases.

{
   "awsRegion": "us-west-2",
   "dynamodb": {
     "ApproximateCreationDateTime": 1606714671542,
     "Keys": {
       "Author": {
         "S": "James Patterson"
       },
       "Title": {
         "S": "The President Is Missing"
       }
     },
     "NewImage": {
       "Title": {
         "S": "The President Is Missing"
       },
       "Formats": {
         "M": {
           "Hardcover": {
             "S": "JSU4KGVU"
           }
         }
       },
       "Author": {
         "S": "James Patterson"
       },
       "Category": {
         "S": "Mystery"
       }
     },
     "OldImage": {
       "Title": {
         "S": "The President Is Missing"
       },
       "Formats": {
         "M": {
           "Hardcover": {
             "S": "JSU4KGVU"
           },
           "Paperback": {
             "S": "DY7F4CFX"
           }
         }
       },
       "Author": {
         "S": "James Patterson"
       },
       "Category": {
         "S": "Mystery"
       }
     },
     "SizeBytes": 254
   },
   "eventID": "bcaaf073-7e0d-49c2-818e-fe3cf7e5f18a",
   "eventName": "MODIFY",
   "userIdentity": null,
   "recordFormat": "application/json",
   "tableName": "Books",
   "eventSource": "aws:dynamodb"
 }

To conclude, in this post I have introduced you to PartiQL which provides SQL-compatible query for Amazon DynamoDB. We also looked at the Export to S3 feature and how we can create an end to end pipeline to query the Amazon DynamoDB data in Amazon S3 bucket using AWS Athena. And finally we looked at real time analytics use-case where you can enable streams to capture item-level changes in Amazon DynamoDB table as Kinesis data streams.

Transform AWS CloudTrail data using AWS Data Wrangler

Anand — Sun, 20 Sep 2020 18:27:01 +0000

AWS CloudTrail service captures actions taken by an IAM user, IAM role, APIs, SDKs and other AWS services. By default, AWS CloudTrail is enabled in your AWS account. You can create "trail" to record ongoing events which will be delivered in JSON format to an Amazon S3 Bucket of your choice.

CloudTrail Dashboard

Create Trail

Choose events to capture

You can configure the trail to log read-write, read-only, write-only data events for all current and future S3 buckets

Also, you have the option to log data events for Lambda functions. You can select all region, all functions or specify any specific Lambda ARN or region.

The trail creates small, mostly KB size gzipped json files in the S3 Bucket.

Trail Log files in S3 Bucket

You can select the file and use "Select from" tab to view the content of the file.

Select from S3 file

Below is an example of "PutObject" event to S3 bucket.

{
    "eventVersion": "1.07",
    "userIdentity": {
        "type": "AWSService",
        "invokedBy": "s3.amazonaws.com"
    },
    "eventTime": "2020-09-12T23:53:22Z",
    "eventSource": "s3.amazonaws.com",
    "eventName": "PutObject",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "s3.amazonaws.com",
    "userAgent": "s3.amazonaws.com",
    "requestParameters": {
        "bucketName": "my-data-bucket",
        "Host": "s3.us-east-1.amazonaws.com",
        "key": "mydatabase/mytable/data-content.snappy.parquet"
    },
    "responseElements": null,
    "additionalEventData": {
        "SignatureVersion": "SigV4",
        "CipherSuite": "ECDHE-RSA-AES128-SHA",
        "bytesTransferredIn": 107886,
        "AuthenticationMethod": "AuthHeader",
        "x-amz-id-2": "Dg9gelyiPojDT00UJ+CI7MmmEyUhPRe1EAUtzQSs3kJAZ8JxMe+2IQ4f6wT2Kpd+Czih1Dc2SI8=",
        "bytesTransferredOut": 0
    },
    "requestID": "29C76F4BC75743BF",
    "eventID": "6973f9b1-1a7d-46d4-a48f-f2d91c80b2d3",
    "readOnly": false,
    "resources": [
        {
            "type": "AWS::S3::Object",
            "ARN": "arn:aws:s3:::my-data-bucket/mydatabase/mytable/data-content.snappy.parquet"
        },
        {
            "accountId": "xxxxxxxxxxxx",
            "type": "AWS::S3::Bucket",
            "ARN": "arn:aws:s3:::my-data-bucket"
        }
    ],
    "eventType": "AwsApiCall",
    "managementEvent": false,
    "recipientAccountId": "xxxxxxxxxxxx",
    "sharedEventID": "eb37214b-623b-43e6-876b-7088c7d0e0ee",
    "vpcEndpointId": "vpce-xxxxxxx",
    "eventCategory": "Data"
}

CloudTrail provides a useful feature under Event history to create Athena table over the trail's Amazon S3 bucket which you can use to query the data using standard SQL.

Now, depending on the duration and events captured, CloudTrail would create lots of small files in S3, which can impact execution time, when queried from Athena.

Moving ahead, I will show you how you can use AWS Data Wrangler and Pandas to perform the following:

Query data from Athena into Pandas dataframe using AWS Data Wrangler.
Transform eventtime string datatype to datetime datatype.
Extract and add year, month, and day columns from eventtime to dataframe.
Write dataframe to S3 in Parquet format with hive partition using AWS Data Wrangler.
Along with writing the dataframe, how you can create the table in Glue catalog using AWS Data Wrangler.

For this example, I have setup a Sagemaker Notebook with Lifecycle configuration. Once you have the notebook open, you can use conda_python3 kernel to work using AWS Data Wrangler.

Import the required libraries

import awswrangler as wr
import pandas as pd
pd.set_option('display.width', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_rows', None)

Python function to execute the SQL in Athena using AWS Data Wrangler

def execute_sql(sql, database, ctas=False):
    return wr.athena.read_sql_query(sql, database, ctas_approach=ctas)

SQL query to get details related to S3 Events

s3ObjectSql = """
SELECT 
    useridentity.sessioncontext.sessionissuer.username as username,
    useridentity.sessioncontext.sessionissuer.type as type,
    useridentity.principalid as principalid,
    useridentity.invokedby as invokedby,
    eventname as event_name,
    eventtime,
    eventsource as event_source,
    awsregion as aws_region,
    sourceipaddress,
    eventtype as event_type,
    readonly as read_only,
    requestparameters
FROM cloudtrail_logs_cloudtrail_logs_traillogs
WHERE eventname in ('ListObjects', 'PutObject', 'GetObject') and eventtime > '2020-08-23'
"""

Execute the sql and have results in Pandas dataframe

data = execute_sql(sql=s3GObjectSql, database='default')

Find unique username(just for fun)

data['username'].value_counts()

If you observe, eventtime column is "string" datatype so performing any date transformations will be difficult. So here we will create a new column with datetime datatype and drop eventtime.

String to Datetime conversion for eventtime column

data['event_time'] = pd.to_datetime(data['eventtime'], errors='coerce')

data.drop('eventtime', axis=1, inplace=True)

Lets extract and add year, month, day columns from event_time column. With this change you can write the data back to S3 as Hive partitions.

Extract and add new fields to dataframe

data['year'] = data['event_time'].dt.year
data['month'] = data['event_time'].dt.month
data['day'] = data['event_time'].dt.day

Now using AWS Data Wrangler s3.to_parquet API you can write the data back to S3 partitioned by year, month, day, and in parquet format. You can also add database and table parameters to it, to write the metadata on Athena/Glue catalog. Note that the database must exists to be command to be successful.

wr.s3.to_parquet(
    df=data,
    path='s3://my-bucket/s3_access_analyzer/cloudtrail/',
    dataset=True,
    partition_cols=['year', 'month', 'day'],
    database='default',  # Athena/Glue database
    table='cloudtrail' # Athena/Glue table
)

You can query the Athena to view the results

The query took just 1.74 seconds to complete with 0 KB of data scanned. Now why 0 KB? Well, I will leave that for you to think and answer :)

To conclude, with AWS Data Wrangler you can easily and efficiently perform extract, transform and load (ETL) task as shown above. It is well integrated with other AWS services and is actively being updated with new features and enhancements.

Guide - AWS Glue and PySpark

Anand — Sat, 19 Sep 2020 17:25:49 +0000

In this post, I have penned down AWS Glue and PySpark functionalities which can be helpful when thinking of creating AWS pipeline and writing AWS Glue PySpark scripts.

AWS Glue is a fully managed extract, transform, and load (ETL) service to process large amount of datasets from various sources for analytics and data processing.

While creating the AWS Glue job, you can select between Spark, Spark Streaming and Python shell. These job can run proposed script generated by AWS Glue, or an existing script that you provide or a new script authored by you. Along with this you can select different monitoring options, job execution capacity, timeouts, delayed notification threshold and non-overridable and overridable parameters.

Glue Job Type and Glue Version

Script file name and other available options

AWS recently launched Glue version 2.0 which features 10x faster Spark ETL job start times and reducing the billing duration from a 10 minute minimum to 1 minute minimum.

https://aws.amazon.com/blogs/aws/aws-glue-version-2-0-featuring-10x-faster-job-start-times-and-1-minute-minimum-billing-duration

With AWS Glue you can create development endpoint and configure SageMaker or Zeppelin notebooks to develop and test your Glue ETL scripts.

I create a SageMaker notebook connected to the Dev endpoint to author and test the ETL scripts. Depending on the language you are comfortable with, you can spin up the notebook.

Now, lets talk about some specific features and functionalities in AWS Glue and PySpark which can be helpful.

1. Spark DataFrames

Spark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database. You can create DataFrame from RDD, from file formats like csv, json, parquet.

With SageMaker Sparkmagic(PySpark) Kernel notebook, Spark session is automatically created.

To create DataFrame -

# from CSV files 
S3_IN = "s3://mybucket/train/training.csv"

csv_df = (
    spark.read.format("org.apache.spark.csv")
    .option("header", True)
    .option("quote", '"')
    .option("escape", '"')
    .option("inferSchema", True)
    .option("ignoreLeadingWhiteSpace", True)
    .option("ignoreTrailingWhiteSpace", True)
    .csv(S3_IN, multiLine=False)
)

# from PARQUET files 
S3_PARQUET="s3://mybucket/folder1/dt=2020-08-24-19-28/"

df = spark.read.parquet(S3_PARQUET)

# from JSON files
df = spark.read.json(S3_JSON)

# from multiline JSON file 
df = spark.read.json(S3_JSON, multiLine=True)

2. GlueContext

GlueContext is the entry point for reading and writing DynamicFrames in AWS Glue. It wraps the Apache SparkSQL SQLContext object providing mechanisms for interacting with the Apache Spark platform.

from awsglue.job import Job
from awsglue.transforms import *
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from awsglue.utils import getResolvedOptions
from awsglue.dynamicframe import DynamicFrame

glueContext = GlueContext(SparkContext.getOrCreate())

3. DynamicFrame

AWS Glue DynamicFrames are similar to SparkSQL DataFrames. It represent a distributed collection of data without requiring you to specify a schema.It can also be used to read and transform data that contains inconsistent values and types.

DynamicFrame can be created using the below options –

create_dynamic_frame_from_rdd – created from an Apache Spark Resilient Distributed Dataset (RDD)
create_dynamic_frame_from_catalog – created using a Glue catalog database and table name
create_dynamic_frame_from_options – created with the specified connection and format. Example – The connection type, such as Amazon S3, Amazon Redshift, and JDBC

DynamicFrames can be converted to and from DataFrames using .toDF() and fromDF().

#create DynamicFame from S3 parquet files
datasource0 = glueContext.create_dynamic_frame_from_options(
            connection_type="s3",
            connection_options = {
                "paths": [S3_location]
            },
            format="parquet",
            transformation_ctx="datasource0")

#create DynamicFame from glue catalog 
datasource0 = glueContext.create_dynamic_frame.from_catalog(
           database = "demo",
           table_name = "testtable",
           transformation_ctx = "datasource0")

#convert to spark DataFrame 
df1 = datasource0.toDF()

#convert to Glue DynamicFrame
df2 = DynamicFrame.fromDF(df1, glueContext , "df2")

Further Read - https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-glue-context.html#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_catalog

4. AWS Glue Job Bookmark

AWS Glue Job bookmark helps process incremental data when rerunning the job on a scheduled interval, preventing reprocessing of old data.

Further Read - https://aprakash.wordpress.com/2020/05/07/implementing-glue-etl-job-with-job-bookmarks/

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html

5. Write out data

The DynamicFrame of transformed dataset can be written out to S3 as non-partitioned (default) or partitioned. "partitionKeys" parameter can be specified in connection_option to write out the data to S3 as partitioned. AWS Glue organizes these dataset in Hive-style partition.

In the below code example, AWS Glue DynamicFrame is partitioned by year, month, day, hour and written in parquet format in Hive-style partition on to S3.

s3://bucket_name/table_name/year=2020/month=7/day=13/hour=14/part-000-671c.c000.snappy.parquet

S3_location = "s3://bucket_name/table_name"

datasink = glueContext.write_dynamic_frame_from_options(
    frame= data,
    connection_type="s3",
    connection_options={
        "path": S3_location,
        "partitionKeys": ["year", "month", "day", "hour"]
    },
    format="parquet",
    transformation_ctx ="datasink")

6. "glueparquet" format option

glueparquet is a performance optimized Apache parquet writer type for writing DynamicFrames. It computes and modifies the schema dynamically.

datasink = glueContext.write_dynamic_frame_from_options(
               frame=dynamicframe,
               connection_type="s3",
               connection_options={
                  "path": S3_location,
                  "partitionKeys": ["year", "month", "day", "hour"]
               },
               format="glueparquet",
               format_options = {"compression": "snappy"},
               transformation_ctx ="datasink")

Further Read - https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format.html

7. S3 Lister and other options for optimizing memory management

AWS Glue provides an optimized mechanism to list files on S3 while reading data into DynamicFrame which can be enabled using additional_options parameter "useS3ListImplementation" to true.

Further Read - https://aws.amazon.com/blogs/big-data/optimize-memory-management-in-aws-glue/

8. Purge S3 path

purge_s3_path is a nice option available to delete files from specified S3 path recursively based on retention period or other available filters. As an example, suppose you are running AWS Glue job to fully refresh the table per day writing the data to S3 with naming convention of s3://bucket-name/table-name/dt=<data-time>. Based on the defined retention period using the Glue job itself you can delete the dt=<date-time> s3 folders. Another option is to set S3 bucket lifecycle policy with prefix.

#purge locations older than 3 days
print("Attempting to purge S3 path with retention set to 3 days.")
glueContext.purge_s3_path(
    s3_path=output_loc, 
    options={"retentionPeriod": 72})

You have other options like purge_table, transition_table and transition_s3_path also available. The transition_table option transitions the storage class of the files stored on Amazon S3 for the specified catalog's database and table.

Further Read - https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-glue-context.html#aws-glue-api-crawler-pyspark-extensions-glue-context-purge_s3_path

9. Relationalize Class

Relationalize class can help flatten nested json outermost level.

Further read - https://aprakash.wordpress.com/2020/02/26/aws-glue-querying-nested-json-with-relationalize-transform/

10. Unbox Class

The Unbox class helps unbox string field in DynamicFrame to specified format type(optional).

Further read - https://aprakash.wordpress.com/2020/02/26/aws-glue-querying-nested-json-with-relationalize-transform/

11. Unnest Class

The Unnest class flattens nested objects to top-level elements in a DynamicFrame.

root
|-- id: string
|-- type: string
|-- content: map
|    |-- keyType: string
|    |-- valueType: string

With content attribute/column being map Type, we can use unnest class to unnest each key elements.

unnested = UnnestFrame.apply(frame=data_dynamic_dframe)
unnested.printSchema()

root
|-- id: string
|-- type: string
|-- content.dateLastUpdated: string
|-- content.creator: string
|-- content.dateCreated: string
|-- content.title: string

12. printSchema()

To print the Spark or Glue DynamicFrame schema in tree format use printSchema().

datasource0.printSchema()

root
|-- ID: int
|-- Name: string
|-- Identity: string
|-- Alignment: string
|-- EyeColor: string
|-- HairColor: string
|-- Gender: string
|-- Status: string
|-- Appearances: int
|-- FirstAppearance: choice
|    |-- int
|    |-- long
|    |-- string
|-- Year: int
|-- Universe: string

13. Fields Selection

select_fields can be used to select fields from Glue DynamicFrame.

# From DynamicFrame

datasource0.select_fields(["Status","HairColor"]).toDF().distinct().show()

To select fields from Spark Dataframe use "select" -

# From Dataframe

datasource0_df.select(["Status","HairColor"]).distinct().show()

14. Timestamp

Suppose the application writes data into DynamoDB and has last_updated attribute/column. DynamoDB does not natively support date/timestamp data type. So, you could either store it as String or Number. If stored as number, its usually done as epoch time - the number of seconds since 00:00:00 UTC on 1 January 1970. You could see something like "1598331963" which is 2020-08-25T05:06:03+00:00 in ISO 8601.

https://www.unixtimestamp.com/index.php

How can you convert it to timestamp?

When you read the data using AWS Glue DynamicFrame and view the schema, it will show it as "long" data type.

root
|-- version: string
|-- item_id: string
|-- status: string
|-- event_type: string
|-- last_updated: long

To convert the last_updated long data type into timestamp data type, you can use the below -

import pyspark.sql.functions as f
import pyspark.sql.types as t

new_df = (
    df
        .withColumn("last_updated", f.from_unixtime(f.col("last_updated")/1000).cast(t.TimestampType()))
)

15. Temporary View from Spark DataFrame

If you want to store the Spark DataFrame as table and query it using spark sql, you can convert the DataFrame into temporary view that is available for only that spark session using createOrReplaceTempView.

df = spark.createDataFrame(
    [
        (1, ['a', 'b', 'c'], 90.00),
        (2, ['x', 'y'], 99.99),
    ],
    ['id', 'event', 'score'] 
)

df.printSchema()
root
 |-- id: long (nullable = true)
 |-- event: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- score: double (nullable = true)

df.createOrReplaceTempView("example")

spark.sql("select * from example").show()

+---+---------+-----+
| id|    event|score|
+---+---------+-----+
|  1|[a, b, c]| 90.0|
|  2|   [x, y]|99.99|
+---+---------+-----+

16. Extract element from ArrayType

Suppose from the above example, you want to create a new attribute/column to store only the last event. How would you do it?

Using element_at function. It returns element of array at given index in extraction if col is array. It can also be used to extract given key in extraction if col is map.

import pyspark.sql.functions as element_at

newdf = df.withColumn("last_event", element_at("event", -1))

newdf.printSchema()
root
 |-- id: long (nullable = true)
 |-- event: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- score: double (nullable = true)
 |-- last_event: string (nullable = true)

newdf.show()
+---+---------+-----+----------+
| id|    event|score|last_event|
+---+---------+-----+----------+
|  1|[a, b, c]| 90.0|         c|
|  2|   [x, y]|99.99|         y|
+---+---------+-----+----------+

17. explode

The explode function in PySpark is used to explode array or map columns in rows. Taking an example, lets try to explode "event" column from the above example

from pyspark.sql.functions import explode

df1 = df.select(df.id,explode(df.event))

df1.printSchema()
root
 |-- id: long (nullable = true)
 |-- col: string (nullable = true)

df1.show()
+---+---+
| id|col|
+---+---+
|  1|  a|
|  1|  b|
|  1|  c|
|  2|  x|
|  2|  y|
+---+---+

18. getField

In a Struct type, if you want to get a field by name, you can use "getField".

import pyspark.sql.functions as f
from pyspark.sql import Row

from pyspark.sql import Row
df = spark.createDataFrame([Row(attributes=Row(Name='scott', Height=6.0, Hair='black')),
                            Row(attributes=Row(Name='kevin', Height=6.1, Hair='brown'))]
)

df.printSchema()
root
 |-- attributes: struct (nullable = true)
 |    |-- Hair: string (nullable = true)
 |    |-- Height: double (nullable = true)
 |    |-- Name: string (nullable = true)

df.show()
+-------------------+
|         attributes|
+-------------------+
|[black, 6.0, scott]|
|[brown, 6.1, kevin]|
+-------------------+

df1 = (df
      .withColumn("name", f.col("attributes").getField("Name"))
      .withColumn("height", f.col("attributes").getField("Height"))
      .drop("attributes")
      )

df1.show()
+-----+------+
| name|height|
+-----+------+
|scott|   6.0|
|kevin|   5.1|
+-----+------+

19. startswith

If you want to find records based on string match you can use "startswith".

In the below example I am searching for all records where value for description column starts with "[{".

import pyspark.sql.functions as f

df.filter(f.col("description").startswith("[{")).show()

20. Extract year, month, day, hour

One of the common use case is to write the AWS Glue DynamicFrame or Spark DataFrame to S3 in Hive-style partition. To do so you can extract year, month, day, hour and use it as partitionkeys to write the DynamicFrame/DataFrame to S3.

import pyspark.sql.functions as f

df2 = (raw_df
        .withColumn('year', f.year(f.col('last_updated')))
        .withColumn('month', f.month(f.col('last_updated')))
        .withColumn('day', f.dayofmonth(f.col('last_updated')))
        .withColumn('hour', f.hour(f.col('last_updated')))            
        )