DEV Community: Neylson Crepalde

How to access data from other AWS account using Athena

Neylson Crepalde — Sat, 24 Jun 2023 03:36:29 +0000

If you find yourself collaborating with a Data Product team, there's a high probability that you'll encounter the need to access data residing in an S3 bucket located in a different account. While this task may not be overly complicated, it isn't always as straightforward as one might hope, leading many colleagues to seek guidance on the matter. In the following article, I'll provide you with a comprehensive guide on how to effortlessly establish cross-account access between Athena and S3 in just two simple steps. So, without further ado, let's delve into the details!

Step 1: Bucket Policy in Source Account

First thing you need to do is to create a bucket policy in the source account (where the data in S3 actually is) to allow for the "client" account to read data. Go to the bucket in S3, click in the "permissions" tab and edit the bucket policy as follows:



{
    "Version": "2012-10-17",
    "Id": "cross-account-bucket-policy-ney",
    "Statement": [
        {
            "Sid": "CrossAccountPermission",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<CLIENT-ACCOUNT-NUMBER>:root"
            },
            "Action": [
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:ListBucketMultipartUploads",
                "s3:ListMultipartUploadParts",
                "s3:AbortMultipartUpload"
            ],
            "Resource": [
                "arn:aws:s3:::<BUCKET-NAME>",
                "arn:aws:s3:::<BUCKET-NAME>/silver/titanic/*"
            ]
        }
    ]
}

This bucket policy allows for all users within the client account to access this bucket and get objects within the /silver/titanic/ folders. If you want to restrict access to a specific user, you can replace root for user/<USERNAME>, for instance user/neylson.crepalde.

Step 2: Create an external table in Athena in Client Account

In the client account navigate to Athena console and create a new query editor. Run a SQL statement to create an external table pointing to the S3 bucket in the source account. In our case, we are testing with the (very famous) TITANIC dataset partitioned by one of its columns, pclass, as a delta table:



CREATE EXTERNAL TABLE `titanicdelta`(
  `passengerid` int, 
  `survived` int, 
  `name` string, 
  `sex` string, 
  `age` double)
PARTITIONED BY ( 
  `pclass` int)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://<BUCKET-NAME>/silver/titanic/_symlink_format_manifest'

Because this data is partitioned, after creating the external table you have to load data partitions with the following SQL command:



MSCK REPAIR TABLE `titanicdelta`

And you're done! Now, if you query your data in the client account:

By following these two straightforward steps, you can establish a secure and efficient cross-account connection between Athena and S3. This enables you and your Data Product team to access and analyze data stored in a remote S3 bucket effortlessly. So, the next time you encounter the need to access data from another account, you can confidently navigate the process and achieve your goals without any hassle.

Remember, fostering collaboration and enabling seamless data access across different accounts is crucial for efficient and streamlined workflows within a Data Product team. By mastering this skill, you can enhance your productivity and contribute to the success of your projects.

Data Governance Hands On with Amazon DataZone

Neylson Crepalde — Mon, 22 May 2023 11:47:13 +0000

At re:Invent 2022, AWS announced its new data governance solution, Amazon DataZone. Although the tool is currently in preview, it is now available in the US East (Northern Virginia), US West (Oregon) and Europe (Ireland) regions.

Data Governance is one of the hottest topics on the market today. Several companies around the international market have pointed problems in this area, such as:

Efficient data cataloging;
Discovery and documentation of data that gives users autonomy for decision making;
Data Literacy — sharing data knowledge across the organization allowing the operation to become increasingly data driven;
Data Quality — Correct, reliable and consistent data;
Data Availability — Data available at all times and failsafe processing and serving pipelines.

Then, a pool of tools appeared on the market with features that allow covering some of the challenges cited, especially those related to data cataloging. Informatica's tool is perhaps the best known among the licensed. Among the open source tools, I highlight Data Hub (www.datahubproject.io) developed on LinkedIn, Open Metadata (https://open-metadata.org/) and Amundsen (https://www.amundsen.io /) powered by Lyft. In addition to cataloging and discovering data artifacts, these tools allow for a view of data lineage, including technical documentation and business terms, and building relationships between data artifacts. Also, it is possible to register data owners, the people responsible for the data in those tools. This greatly facilitates access request and evaluation process (which today is a major bottleneck).

Amazon DataZone

I always tell my friends: "It was about time for AWS to launch its data governance tool!". DataZone comes with the main features needed for a good data catalog, in addition to security features and access management integrated into the AWS environment. Let's take a look.

The first step to use DataZone is the creation of a data domain (oriented to a business vertical, for example, sales, logistics, finance). It is curious to note that AWS took great care in showing all the access policies granted to the DataZone, as the tool will have great power in the environment.

Policy details are available by clicking View permission details.

At this point, you can leverage IAM Identity Center (formerly AWS SSO) for granting domain access permissions. However, the domain must be deployed in the same region as the Identity Center. In my case, as you can see, this was not possible.

After a few seconds to create the domain, the access link becomes available. The tool's interface impresses with a beautiful design. The next step is to create a project within the existing domain. The tool already suggests a default profile that has a native connection with Athena, AWS Glue and S3.

Once the project is ready, we will publish data for consultation in the tool. When the project was created, DataZone automatically created two databases in Amazon Athena, one with the _pub_db suffix and one with the _sub_db suffix. The "pub" database will be used for data producing teams to publish their tables. In my case, I already had some glue crawlers configured to automatically map tables in S3 - the 3 tables from Brazil's Higher Education Census (Censo da Educação Superior) 2019 and one table from Brazil's National Student Performance Exam 2017, all public datasets from the Ministry of Education. I just edited the crawlers so that these tables were available in the "pub" database. After that, it is necessary to publish the data in DataZone. In the project interface, we click on Publish Data and we see the screen below:

After filling in the settings and the job has been successfully executed, on the project page we can access the tables brought in the "publishing" menu.

Note that all tables are still marked as draft, that is, they are not yet active to be viewed in the catalog. By clicking on the table name, it is possible to view various general information such as the location in S3 where the data is stored, region, data format, table schema and subscribers (if any). In the schema screen, we can edit the columns giving them a more readable name (the real name of the column is not changed and continues to be displayed in the interface) and a detailed description. A "readable" title and description can also be added to the table. After editing, it is necessary to click on set asset to active, so that the table can be consulted by other users of the catalog.

After activating the tables, they are available for consultation in the Data catalog page and also via search.

To request access to these artifacts, other users registered in another project could open a registration request as consumers (subscribers). From there, data publishers can authorize requests and set granular per-table permissions for their consumption.

Final evaluation

DataZone is still in preview and is not intended by Amazon for production use. The tool already has several important features for data governance but there is still a lot to develop. Below, I write down some pros and cons of the tool so far.

Pros

Easy integration with the AWS environment;
Integrated security and access management with AWS Identity Center or IAM;
Easy metadata ingestion;
Creation of automated projects with CloudFormation (transparent to the user);
Efficient data search;
Glossary of business terms.

Cons

The tool still does not have data lineage implemented, a very important feature to understand the construction of KPIs;
It does not have integration with any data quality framework;
It is not possible to register other data artifacts such as dashboards, charts, pipelines, etc.

In fact, Amazon DataZone is still a tool under development and it has an enormous potential. I look forward to the following development steps of this promising tool.

We need to talk about Data Contracts

Neylson Crepalde — Fri, 17 Mar 2023 15:02:36 +0000

The term Data Contracts is the latest buzzword in the data world and has been heavily explored in publications around the world. Although the subject has not yet reached Brazil with the force it deserves, the pain that it proposes to solve is already very much present in the market. But what are we really talking about and what is this pain?

Seeking to be Data Driven

In recent years, organizations in Brazilian market have made a great effort to become more and more Data Driven. In the beginning, the aim was to gather data available in the company and mobilize them to bring intelligence to day-to-day decision-making.

After overcoming this challenge, the market realized that, although they managed to use the data, it was unstructured, disorganized and siloed in different systems. At this stage, the priority became building an environment that would make it possible to work with very large volumes of data. It was also important that this architecture allowed an efficient and fast scalability of resources when necessary and that it brought operational efficiency to the company's processes, making pipelines more automatic and with better performance. It was critical, at this stage, to make data from different sources and with different formats able to "talk to each other", giving the user a consolidated view of the phenomenon that is the basis of decisions.

At the time of writing this article, I realize that the market has already achieved a certain maturity in data architectures. Companies' Data Lakes and Data Mesh structures are already running in production with data pipelines that can deliver data in a consolidated manner to the business user. HOWEVER, there is an important consequence here:

As organizations move towards being more data driven, they are, in fact, anchoring their business operation on data. This brings an enormous criticality to our architecture so that if there is any type of failure or data is unavailable somehow, we can stop the operation (!!!).

Unfortunately, data unavailability is a very common situation in the daily lives of data teams. It is very common for these teams to receive a distressed call from the business user saying that data is not available, has not been updated, is blank or something like that. This is enhanced when the organization is large and complex and has a scenario where there are several teams producing and consuming data. In such cases, the process failure points are multiplied. According to Barr Moses, CEO of Monte Carlo, among the top data challenges are:

"Data pipelines constantly break and create quality and usability issues."
"There is a communication chasm between service implementers, data engineers and data consumers" (Check out the original article here).

The high availability of data, therefore, presents itself as a first order problem to be solved. There are other very important issues related to our environment such as security, access management, quality… but high availability takes on enormous relevance in this context. What to do?

What are Data Contracts

Data contracts emerged as a proposed solution to the aforementioned problem. This term got prominence in the market with Chad Sanderson in August 2022 in his text The Rise of Data Contracts. In this article, he posits the problem and proposes the concept of data contracts as "API-like agreements between Software Engineers who own services and Data Consumers that understand how the business works in order to generate well-modeled, high-quality, trusted, real-time data".

Maggie Hays comments that a data contract needs to define:

"what data needs to move from a (producer’s) source to a (consumer’s) destination"
"the shape of that data, its schema, and semantics"
"expectations around availability and data quality"
"details about contract violation(s) and enforcement"
"how (and for how long) the consumer will use the data" (Check out the original article here)

The point that most calls my attention is that the idea of data contracts does not reside only in the clear establishment of data delivery and consumption criteria, but is also concerned with ensuring, in an automated way, that this contract is not breached breaking entire data pipelines and generating loss for the business operation.

My 2 cents on the matter

Source: https://blog.datahubproject.io/harnessing-the-power-of-data-lineage-with-datahub-ad086358dec4

In addition to what has already been exhaustively argued, I would like to add one more feature necessary for a good implementation of contracts of data: the visualization of the whole path of the data and its points of failure.

It's easy to argue that the feature I'm talking about is nothing more than data lineage, a feature made available by some data catalog solutions on the market. However, lineage (although critical to a good data governance strategy) is just one part. The view that I believe is useful for data contracts would also bring the processes' points of failure and if these checkpoints are OK or if failure was detected; a consolidated view of all contracts that permeate the organization, a holistic view of the process.

In addition to functioning as a monitor of the entire data flow in the company, this view would be extremely important in the culture and data literacy strategy, making the entire process explicit to stakeholders.

Perspectives for the future

At this moment there are some published ideas for the implementation of data contracts. Some using Kafka's schema registry and CDC as enforcement in an event-oriented architecture, others proposing to use dbt in a batch fashion... but there is still no clear definition or consolidated method to implement the concept.

The expectation regarding this feature, however, is huge. The promise of accelerating the production and strategic mobilization of data along with a significant improvement in quality and availability brought this concept to the market spotlight. The forecast is that in the coming months the first products in this area will begin to emerge.

How to run Amazon EMR Serverless with --packages flag

Neylson Crepalde — Thu, 18 Aug 2022 00:57:19 +0000

In this previous post, we showed how to run Delta Lake on Amazon EMR Serverless. Since then, a new release was out (6.7.0) with the --packages flag implemented . This helps us getting things done with spark a lot easier. Yet, --packages flag requires some extra networking setup that most of Data Scientists and Engineers are not familiar with. Our goal is to show step by step how to do it.

First, some concept explanations

When using Spark with Java dependencies, we have two options: (1) build and insert .jar files manually in cluster or (2) pass the dependencies to the --packages flag so spark can automatically download them from maven. Since release 6.7.0 of EMR Serverless, this flag is available for use.

The problem is that spark cluster must reach the internet to download packages from maven. Amazon EMR Serverless, at first, lives outside any VPC and so, cannot reach the internet. To do that, you must create your EMR application inside a VPC. However, EMR applications can only be created in private subnets which (by the way...) don't reach the internet and cannot reach S3 😭... How do we fix this?

Step one: networking

The diagram below shows the whole network structure that is necessary:

This is easily created on AWS in the VPC interface. Click the Create VPC button and select VPC and more. AWS does the heavy lifting and provides a design for a VPC with 2 public subnets, 2 private subnets, internet gateway, the necessary route tables and an S3-endpoint (so resources inside the VPC can reach S3).

You can set the Number of availability zones to 1 if you want but in order to have high availability, you should work with, at least, 2 AZ's.

Next, you need to make sure you mark at least one NAT Gateway which is responsible for letting private subnets reach the internet. Below is the screen with the final setup:

Hit create VPC and we're done with networking.

Last thing is to create a Security Group that allows outbound traffic to the internet. Go back to VPC in AWS and click Security Group in the left panel. Then, click Create security group. Name your security group, uncheck the selected VPC and check the one you have just created. By default, security groups don't allow any inbound traffic and allow all outbound traffic. We can leave it that way. Create the security group et voilà!

Step two: IAM Roles and Policies

You need two roles, a Service Linked Role and another role that gives permission to Access S3 and Glue. We have already discussed that in this previous post in the Setup - Authentication section. Check it out. We will also need a dataset to work with. The famous Titanic dataset should do it. You can download it here.

Step three: Create EMR Studio and an EMR Serverless Application

First, we must create an EMR Studio. If you don't have any studios created yet, this is very straightforward. After clicking Get started in the EMR Serverless home page, you can click to create a studio automatically.

Second, you have to create an EMR Serverless application. Set up a name and (remember!) choose release 6.7.0. To setup networking you have to check Choose custom settings and scroll down to Network connections.

In Network connections choose the VPC you created, the two private subnets and the security group.

Step four: Spark code

Now, we are preparing a simple pyspark code to simulate some modifications in the dataset (we will include two new passengers - Ney and Sarah - and we will update information on two passengers that were presumed dead but found alive, Mr. Owen Braund and Mr. William Allen). Below is the code to do that.



from pyspark.sql import functions as f
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .config("spark.jars.packages", "io.delta:delta-core_2.12:2.0.0")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    .getOrCreate()
)

from delta.tables import *

print("Reading CSV file from S3...")

schema = "PassengerId int, Survived int, Pclass int, Name string, Sex string, Age double, SibSp int, Parch int, Ticket string, Fare double, Cabin string, Embarked string"
df = spark.read.csv(
    "s3://<YOUR-BUCKET>/titanic", 
    header=True, schema=schema, sep=";"
)

print("Writing titanic dataset as a delta table...")
df.write.format("delta").save("s3://<YOUR-BUCKET>/silver/titanic_delta")

print("Updating and inserting new rows...")
new = df.where("PassengerId IN (1, 5)")
new = new.withColumn("Survived", f.lit(1))
newrows = [
    (892, 1, 1, "Sarah Crepalde", "female", 23.0, 1, 0, None, None, None, None),
    (893, 0, 1, "Ney Crepalde", "male", 35.0, 1, 0, None, None, None, None)
]
newrowsdf = spark.createDataFrame(newrows, schema=schema)
new = new.union(newrowsdf)

print("Create a delta table object...")
old = DeltaTable.forPath(spark, "s3://<YOUR-BUCKET>/silver/titanic_delta")


print("UPSERT...")
# UPSERT
(
    old.alias("old")
    .merge(new.alias("new"), 
    "old.PassengerId = new.PassengerId"
    )
    .whenMatchedUpdateAll()
    .whenNotMatchedInsertAll()
    .execute()
)

print("Checking if everything is ok")
print("New data...")

(
    spark.read.format("delta")
    .load("s3://<YOUR-BUCKET>/silver/titanic_delta")
    .where("PassengerId < 6 OR PassengerId > 888")
    .show()
)

print("Old data - with time travel")
(
    spark.read.format("delta")
    .option("versionAsOf", "0")
    .load("s3://<YOUR-BUCKET>/silver/titanic_delta")
    .where("PassengerId < 6 OR PassengerId > 888")
    .show()
)

This .py file should be uploaded to S3.

Step five: GO!

Now, we submit a job for execution. We can do it with AWS CLI:



aws emr-serverless start-job-run \
--name Delta-Upsert \
--application-id <YOUR-APPLICATION-ID> \
--execution-role-arn arn:aws:iam::<ACCOUNT-NUMBER>:role/EMRServerlessJobRole \
--job-driver '{
  "sparkSubmit": {
    "entryPoint": "s3://<YOUR-BUCKET>/pyspark/emrserverless_delta_titanic.py", 
    "sparkSubmitParameters": "--packages io.delta:delta-core_2.12:2.0.0"
  }
}' \
--configuration-overrides '{
"monitoringConfiguration": {
  "s3MonitoringConfiguration": {
    "logUri": "s3://<YOUR-BUCKET>/emr-serverless-logs/"} 
  } 
}'

That's it! When the job is done, go the you log folder and check the logs (look for your application ID, job ID and SPARK_DRIVER logs). You should see something like this:



Reading CSV file from S3...
Writing titanic dataset as a delta table...
Updating and inserting new rows...
Create a delta table object...
UPSERT...
Checking if everything is ok
New data...
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       1|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       1|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
|        889|       0|     3|"Johnston, Miss. ...|female|null|    1|    2|      W./C. 6607|  23.45| null|       S|
|        890|       1|     1|Behr, Mr. Karl Ho...|  male|26.0|    0|    0|          111369|   30.0| C148|       C|
|        891|       0|     3| Dooley, Mr. Patrick|  male|32.0|    0|    0|          370376|   7.75| null|       Q|
|        892|       1|     1|      Sarah Crepalde|female|23.0|    1|    0|            null|   null| null|    null|
|        893|       0|     1|        Ney Crepalde|  male|35.0|    1|    0|            null|   null| null|    null|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+

Old data - with time travel
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
|        889|       0|     3|"Johnston, Miss. ...|female|null|    1|    2|      W./C. 6607|  23.45| null|       S|
|        890|       1|     1|Behr, Mr. Karl Ho...|  male|26.0|    0|    0|          111369|   30.0| C148|       C|
|        891|       0|     3| Dooley, Mr. Patrick|  male|32.0|    0|    0|          370376|   7.75| null|       Q|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+

Happy coding and build on!

Running Delta Lake on Amazon EMR Serverless

Neylson Crepalde — Sat, 30 Jul 2022 15:24:00 +0000

Amazon EMR Serverless is a brand new AWS Service made generally available in June 1st, 2022. With this service, it is possible to run serverless Spark clusters that can process TB scale data very easily and using any spark open source libraries. Getting started with EMR Serverless can be a bit tricky. The goal of this post is to help you get your Spark+Delta jobs up and running "serverlessly". Let's get to it!

Setup - Authentication

In order to run EMR Serverless you'll need to configure two IAM roles, a service-linked role and an access authorization role for your spark jobs. The service-linked role is very straightforward to create. Go to IAM on the AWS console, click on roles and click on Create role,

choose Amazon EMR Serverless,

and choose default settings until you finish creating the role. Next, create a job role with permissions to access S3 and glue. We will create a very open role (not the best practice) for didactic purposes. In a "production" environment, you should make your permissions very strict.

Click again Create role on the AWS console Roles section and mark Custom Trust Policy. Below, in the "Service" key, replace "{}" with emr-serverless.amazonaws.com.

Next, you can select 2 AWS managed policies, "AmazonS3FullAccess" and "AWSGlueConsoleFullAccess". Click next, give your new role an easy identifiable name (like "EMRServerlessJobRole") and finish creating the role.

Setup - Data

For this post, we are working with the (very famous) titanic dataset which you can download here and upload to S3.

Data Pipeline Strategy

Delta Lake is a great tool that implements the Lakehouse architecture. It has many cool features (such as schema evolution, data time travel, transaction logs, ACID transactions) and it is fundamentally valuable when we have a case of incremental data ingestion. Thus, we are going to simulate some changes in titanic dataset. We will include two new passengers (Ney and Sarah) and we will update information on two passengers that were presumed dead but found alive(!!!), Mr. Owen Braund and Mr. William Allen.

First version of data is written as a delta table and the updates will be written with an upsert transaction.

Python code to do those operations is presented below:



from pyspark.sql import functions as f
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .getOrCreate()
)

from delta.tables import *

print("Reading CSV file from S3...")

schema = "PassengerId int, Survived int, Pclass int, Name string, Sex string, Age double, SibSp int, Parch int, Ticket string, Fare double, Cabin string, Embarked string"
df = spark.read.csv(
    "s3://<YOUR-BUCKET>/titanic", 
    header=True, schema=schema, sep=";"
)

print("Writing titanic dataset as a delta table...")
df.write.format("delta").save("s3://<YOUR-BUCKET>/silver/titanic_delta")

print("Updating and inserting new rows...")
new = df.where("PassengerId IN (1, 5)")
new = new.withColumn("Survived", f.lit(1))
newrows = [
    (892, 1, 1, "Sarah Crepalde", "female", 23.0, 1, 0, None, None, None, None),
    (893, 0, 1, "Ney Crepalde", "male", 35.0, 1, 0, None, None, None, None)
]
newrowsdf = spark.createDataFrame(newrows, schema=schema)
new = new.union(newrowsdf)

print("Create a delta table object...")
old = DeltaTable.forPath(spark, "s3://<YOUR-BUCKET>/silver/titanic_delta")


print("UPSERT...")
# UPSERT
(
    old.alias("old")
    .merge(new.alias("new"), 
    "old.PassengerId = new.PassengerId"
    )
    .whenMatchedUpdateAll()
    .whenNotMatchedInsertAll()
    .execute()
)

print("Checking if everything is ok")
print("New data...")

(
    spark.read.format("delta")
    .load("s3://<YOUR-BUCKET>/silver/titanic_delta")
    .where("PassengerId < 6 OR PassengerId > 888")
    .show()
)

print("Old data - with time travel")
(
    spark.read.format("delta")
    .option("versionAsOf", "0")
    .load("s3://<YOUR-BUCKET>/silver/titanic_delta")
    .where("PassengerId < 6 OR PassengerId > 888")
    .show()
)

This .py file should be uploaded to S3.

Dependencies

One thing about EMR Serverless latest release available (6.6.0) is that the spark-submit flag --packages is not available yet (😢). So, we have an extra step to use java dependencies and python dependencies.

Jars

To use java dependencies, we have to build them manually into a single .jar file. AWS has provided a Dockerfile that we can use to build the dependencies without having to install maven locally (😍). I used this pom.xml file to define the dependencies:



<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.serverless-samples</groupId>
    <artifactId>jars</artifactId>
    <packaging>jar</packaging>
    <version>1.0-SNAPSHOT</version>
    <name>jars</name>
    <url>http://maven.apache.org</url>
    <dependencies>
        <dependency>
            <groupId>io.delta</groupId>
            <artifactId>delta-core_2.12</artifactId>
            <version>1.2.1</version>
            <scope>compile</scope>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                    </execution>
                </executions>
                <configuration>
                    <finalName>uber-${artifactId}-${version}</finalName>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>

We build the final .jar file with the command



docker build -f Dockerfile.jars --output . .

The output uber-jars-1.0-SNAPSHOT.jar must be uploaded to S3.

With this .jar file, we can use .format("delta") in our python code but if we try to import delta.tables we will get a python dependency error.

Python

We can build python dependencies in two ways: uploading dependencies files to S3 or building a virtual environment to use in EMR Serverless. For this post, uploading a zip file with delta python library was very simple to do.



mkdir dependencies
pip install delta-spark==1.2.1 --target dependencies
cd dependencies
zip -r9 ../emrserverless_dependencies.zip .

The emrserverless_dependencies.zip file must also be uploaded to S3.

Now, we are ready to configure our serverless Spark application.

EMR Serverless

First, we must create an EMR Studio. If you don't have any studios created yet, this is very straightforward. After clicking Get started in the EMR Serverless home page, you can click to create a studio automatically.

Then click to enter the studio url.

With EMR Serverless, we don't have to create a cluster. Instead, we work with the application concept. To create a new EMR Serverless application, click Create application, type an application name, select version and click Create application again at the bottom of the page.

Now, the last thing to do is to submit a spark job. If you have aws cli installed, the code below will submit a job spark job.



aws emr-serverless start-job-run \
--name Delta-Upsert \
--application-id <YOUR-APPLICATION-ID> \
--execution-role-arn arn:aws:iam::<ACCOUNT-NUMBER>:role/EMRServerlessJobRole \
--job-driver '{
  "sparkSubmit": {
    "entryPoint": "s3://<YOUR-BUCKET>/pyspark/emrserverless_delta_titanic.py", 
    "sparkSubmitParameters": "--jars s3://<YOUR-BUCKET>/pyspark/jars/uber-jars-1.0-SNAPSHOT.jar --conf spark.submit.pyFiles=s3://<YOUR-BUCKET>/pyspark/dependencies/emrserverless_dependencies.zip"
  }
}' \
--configuration-overrides '{
"monitoringConfiguration": {
  "s3MonitoringConfiguration": {
    "logUri": "s3://<YOUR-BUCKET>/emr-serverless-logs/"} 
  } 
}'

Some important parameters:

entrypoint sets the s3 path for you pyspark script
sparkSubmitParameters: you should add the java dependencies with the --jars flag and set the --conf spark.submit.pyFiles=<YOUR .py/.zip/.egg FILE>
s3MonitoringConfiguration sets the s3 path that will be used to save job logs.

If you wish to use the console, set the job name, role and script location

and .jar file and .zip file location as follows

Spark job should start after this. When it finishes, check the logs folder in s3 (look for your application ID, job ID and SPARK_DRIVER logs). You should see something like this



Reading CSV file from S3...
Writing titanic dataset as a delta table...
Updating and inserting new rows...
Create a delta table object...
UPSERT...
Checking if everything is ok
New data...
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       1|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       1|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
|        889|       0|     3|"Johnston, Miss. ...|female|null|    1|    2|      W./C. 6607|  23.45| null|       S|
|        890|       1|     1|Behr, Mr. Karl Ho...|  male|26.0|    0|    0|          111369|   30.0| C148|       C|
|        891|       0|     3| Dooley, Mr. Patrick|  male|32.0|    0|    0|          370376|   7.75| null|       Q|
|        892|       1|     1|      Sarah Crepalde|female|23.0|    1|    0|            null|   null| null|    null|
|        893|       0|     1|        Ney Crepalde|  male|35.0|    1|    0|            null|   null| null|    null|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+

Old data - with time travel
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
|        889|       0|     3|"Johnston, Miss. ...|female|null|    1|    2|      W./C. 6607|  23.45| null|       S|
|        890|       1|     1|Behr, Mr. Karl Ho...|  male|26.0|    0|    0|          111369|   30.0| C148|       C|
|        891|       0|     3| Dooley, Mr. Patrick|  male|32.0|    0|    0|          370376|   7.75| null|       Q|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+

Notice the same delta table being shown in 2 different moments using time travel. The last version, with Mr. Braund and Mr. Allen marked as alive and the new passengers in the first table and the original version of the dataset in the second table.