DEV Community: Damon P. Cortesi

An Introduction to Modern Data Lake Storage Layers

Damon P. Cortesi — Wed, 02 Feb 2022 22:22:22 +0000

In recent years we’ve seen a rise in new storage layers for data lakes. In 2017, Uber announced Hudi - an incremental processing framework for data pipelines. In 2018, Netflix introduced Iceberg - a new table format for managing extremely large cloud datasets. And in 2019, Databricks open-sourced Delta Lake - originally intended to bring ACID transactions to data lakes.

📹 If you’d like to watch a video that discusses the content of this post, I’ve also recorded an overview here. Each relevant section below will also link to individual timestamps.

This post aims to introduce each of these engines and give some insight into how they function under the hood and some of the differences in each. While I’ll summarize the findings here, you can also view my Jupyter notebooks for each in my modern-data-lake-storage-layers repository. We begin with basic operations of writing and updating datasets.

One thing to note about all of these frameworks is that each began with a different challenge they were solving for, but over time they have begun to converge on a common set of functionality. I should also note that I am learning about these frameworks as well - the comments here are neither authoritative or comprehensive. 🤗

Apache Hudi

📹 Intro to Apache Hudi video

Apache Hudi (Hadoop Upsert Delete and Incremental) was originally designed as an incremental stream processing framework and was built to combine the benefits of stream and batch processing. Hudi can be used with Spark, Flink, Presto, Trino and Hive, but much of the original work was focused around Spark and that’s what I use for these examples. One of the other huge benefits of Hudi is the concept of a self-managed data layer. For example, Hudi can automatically perform asynchronous compaction to optimize data lakes and also supports multi-writer gaurantees. Hudi also offers flexibility in storage formats depending on read/write requirements and data size.

For Hudi, we create a simple Spark DataFrame partitioned by creation_date and write that to S3.

# Create a DataFrame
inputDF = spark.createDataFrame(
    [
        ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
        ("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
        ("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
        ("103", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
        ("104", "2015-01-02", "2015-01-01T12:15:00.512679Z"),
        ("105", "2015-01-02", "2015-01-01T13:51:42.248818Z"),
    ],
    ["id", "creation_date", "last_update_time"],
)

# Specify common DataSourceWriteOptions in the single hudiOptions variable
hudiOptions = {
    "hoodie.table.name": "my_hudi_table",
    "hoodie.datasource.write.recordkey.field": "id",
    "hoodie.datasource.write.partitionpath.field": "creation_date",
    "hoodie.datasource.write.precombine.field": "last_update_time",
    "hoodie.datasource.hive_sync.enable": "true",
    "hoodie.datasource.hive_sync.table": "my_hudi_table",
    "hoodie.datasource.hive_sync.partition_fields": "creation_date",
    "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
    "hoodie.index.type": "GLOBAL_BLOOM", # This is required if we want to ensure we upsert a record, even if the partition changes
    "hoodie.bloom.index.update.partition.path": "true", # This is required to write the data into the new partition (defaults to false in 0.8.0, true in 0.9.0)
}

# Write a DataFrame as a Hudi dataset
inputDF.write.format("org.apache.hudi").option(
    "hoodie.datasource.write.operation", "insert"
).options(**hudiOptions).mode("overwrite").save(f"s3://{S3_BUCKET_NAME}/tmp/hudi/")

When we look at the file structure on S3, we see a few things:

A hoodie.properties file

2022-01-14 00:33:46 503 tmp/hudi/.hoodie/hoodie.properties

This file contains certain metadata about the Hudi dataset:

#Properties saved on Fri Jan 14 00:33:45 UTC 2022
#Fri Jan 14 00:33:45 UTC 2022
hoodie.table.precombine.field=last_update_time
hoodie.table.partition.fields=creation_date
hoodie.table.type=COPY_ON_WRITE
hoodie.archivelog.folder=archived
hoodie.populate.meta.fields=true
hoodie.timeline.layout.version=1
hoodie.table.version=2
hoodie.table.recordkey.fields=id
hoodie.table.base.file.format=PARQUET
hoodie.table.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator
hoodie.table.name=my_hudi_table

A set of commit-related files

2022-01-14 00:33:57 2706 tmp/hudi/.hoodie/20220114003341.commit
2022-01-14 00:33:48 0 tmp/hudi/.hoodie/20220114003341.commit.requested
2022-01-14 00:33:52 1842 tmp/hudi/.hoodie/20220114003341.inflight

The actual .parquet data files and associated metadata organized into date-based partitions.

2022-01-14 00:33:54 93 tmp/hudi/2015-01-01/.hoodie_partition_metadata
2022-01-14 00:33:54 434974 tmp/hudi/2015-01-01/57f66198-5303-4922-9323-91737ec40d25-0_0-4-98_20220114003341.parquet
2022-01-14 00:33:55 93 tmp/hudi/2015-01-02/.hoodie_partition_metadata
2022-01-14 00:33:55 434943 tmp/hudi/2015-01-02/43051d12-87e7-4dfb-8201-6ce293cf0df7-0_1-6-99_20220114003341.parquet

We then update the creation_date of one row in this dataset.

from pyspark.sql.functions import lit

# Create a new DataFrame from the first row of inputDF with a different creation_date value
updateDF = inputDF.where("id = 100").withColumn("creation_date", lit("2022-01-11"))

updateDF.show()

# Update by using the "upsert" operation
updateDF.write.format("org.apache.hudi").option(
    "hoodie.datasource.write.operation", "upsert"
).options(**hudiOptions).mode("append").save(f"s3://{S3_BUCKET_NAME}/tmp/hudi/")

One thing to note here is that since we’re updating a partition value ( DANGER! ), we had to set the hoodie.index.type to GLOBAL_BLOOM as well as setting hoodie.bloom.index.update.partition.path to true. This can have a large impact on performance so normally we would try not to change a partition value in a production environment, but it’s useful here to see the impact it has. You can mind more details in the Hudi FAQ about Hudi indexing.

After this write, we have a new set of commit-related files on S3:

2022-01-14 00:34:15 2706 tmp/hudi/.hoodie/20220114003401.commit
2022-01-14 00:34:03 0 tmp/hudi/.hoodie/20220114003401.commit.requested
2022-01-14 00:34:08 2560 tmp/hudi/.hoodie/20220114003401.inflight

And we actually have 2 new .parquet files:

2022-01-14 00:34:12 434925 tmp/hudi/2015-01-01/57f66198-5303-4922-9323-91737ec40d25-0_0-37-13680_20220114003401.parquet
...
2022-01-14 00:34:13 93 tmp/hudi/2022-01-11/.hoodie_partition_metadata
2022-01-14 00:34:14 434979 tmp/hudi/2022-01-11/0c210872-484e-428b-a9ca-90a26e42125c-0_1-43-13681_20220114003401.parquet

So what happened with the update is that the old partition (2015-01-01) had its data overwritten and the new partition (2022-01-11) also had data written to it. You can now see why the global bloom index could have such a large impact on write performance as there is significant potential for write amplication.

If we query the data and add the source filename for each row, we can also see that data for the old partition now comes from the new parquet file (notice the commit ID 20220114003401 shows up in the filename):

from pyspark.sql.functions import input_file_name

snapshotQueryDF = spark.read \
    .format('org.apache.hudi') \
    .load(f"s3://{S3_BUCKET_NAME}/tmp/hudi/") \
    .select('id', 'creation_date') \
    .withColumn("filename", input_file_name())

snapshotQueryDF.show(truncate=False)


+---+-------------+------------------------------------------------------------------------------------------------------------------------------+
|id |creation_date|filename |
+---+-------------+------------------------------------------------------------------------------------------------------------------------------+
|100|2022-01-11 |/hudi/2022-01-11/0c210872-484e-428b-a9ca-90a26e42125c-0_1-43-13681_20220114003401.parquet |
|105|2015-01-02 |/hudi/2015-01-02/43051d12-87e7-4dfb-8201-6ce293cf0df7-0_1-6-99_20220114003341.parquet |
|104|2015-01-02 |/hudi/2015-01-02/43051d12-87e7-4dfb-8201-6ce293cf0df7-0_1-6-99_20220114003341.parquet |
|102|2015-01-01 |/hudi/2015-01-01/57f66198-5303-4922-9323-91737ec40d25-0_0-37-13680_20220114003401.parquet |
|103|2015-01-01 |/hudi/2015-01-01/57f66198-5303-4922-9323-91737ec40d25-0_0-37-13680_20220114003401.parquet |
|101|2015-01-01 |/hudi/2015-01-01/57f66198-5303-4922-9323-91737ec40d25-0_0-37-13680_20220114003401.parquet |
+---+-------------+------------------------------------------------------------------------------------------------------------------------------+

One other thing to note is that Hudi adds quite a bit of metadata to your Parquet files. This data helps enable record-level change streams - more detail can be found in this comprehensive blog post about the Hudi platform. If we use native Spark to read one of the Parquet files and show it, we see that there’s various _hoodie-prefixed keys.

from pyspark.sql.functions import split

rawDF = (
    spark.read.parquet(f"s3://{S3_BUCKET_NAME}/tmp/hudi/*/*.parquet")
    .withColumn("filename", split(input_file_name(), "tmp/hudi").getItem(1))
    .sort("_hoodie_commit_time", "_hoodie_commit_seqno")
)
rawDF.show(truncate=False)


+-------------------+--------------------+------------------+----------------------+------------------------------------------------------------------------+---+-------------+---------------------------+------------------------------------------------------------------------------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name |id |creation_date|last_update_time |filename |
+-------------------+--------------------+------------------+----------------------+------------------------------------------------------------------------+---+-------------+---------------------------+------------------------------------------------------------------------------------+
|20220114003341 |20220114003341_0_1 |100 |2015-01-01 |57f66198-5303-4922-9323-91737ec40d25-0_0-4-98_20220114003341.parquet |100|2015-01-01 |2015-01-01T13:51:39.340396Z|/2015-01-01/57f66198-5303-4922-9323-91737ec40d25-0_0-4-98_20220114003341.parquet |
|20220114003341 |20220114003341_0_2 |102 |2015-01-01 |57f66198-5303-4922-9323-91737ec40d25-0_0-4-98_20220114003341.parquet |102|2015-01-01 |2015-01-01T13:51:40.417052Z|/2015-01-01/57f66198-5303-4922-9323-91737ec40d25-0_0-37-13680_20220114003401.parquet|
|20220114003341 |20220114003341_0_2 |102 |2015-01-01 |57f66198-5303-4922-9323-91737ec40d25-0_0-4-98_20220114003341.parquet |102|2015-01-01 |2015-01-01T13:51:40.417052Z|/2015-01-01/57f66198-5303-4922-9323-91737ec40d25-0_0-4-98_20220114003341.parquet |
|20220114003341 |20220114003341_0_3 |103 |2015-01-01 |57f66198-5303-4922-9323-91737ec40d25-0_0-4-98_20220114003341.parquet |103|2015-01-01 |2015-01-01T13:51:40.519832Z|/2015-01-01/57f66198-5303-4922-9323-91737ec40d25-0_0-37-13680_20220114003401.parquet|
|20220114003341 |20220114003341_0_3 |103 |2015-01-01 |57f66198-5303-4922-9323-91737ec40d25-0_0-4-98_20220114003341.parquet |103|2015-01-01 |2015-01-01T13:51:40.519832Z|/2015-01-01/57f66198-5303-4922-9323-91737ec40d25-0_0-4-98_20220114003341.parquet |
|20220114003341 |20220114003341_0_4 |101 |2015-01-01 |57f66198-5303-4922-9323-91737ec40d25-0_0-4-98_20220114003341.parquet |101|2015-01-01 |2015-01-01T12:14:58.597216Z|/2015-01-01/57f66198-5303-4922-9323-91737ec40d25-0_0-37-13680_20220114003401.parquet|
|20220114003341 |20220114003341_0_4 |101 |2015-01-01 |57f66198-5303-4922-9323-91737ec40d25-0_0-4-98_20220114003341.parquet |101|2015-01-01 |2015-01-01T12:14:58.597216Z|/2015-01-01/57f66198-5303-4922-9323-91737ec40d25-0_0-4-98_20220114003341.parquet |
|20220114003341 |20220114003341_1_5 |105 |2015-01-02 |43051d12-87e7-4dfb-8201-6ce293cf0df7-0_1-6-99_20220114003341.parquet |105|2015-01-02 |2015-01-01T13:51:42.248818Z|/2015-01-02/43051d12-87e7-4dfb-8201-6ce293cf0df7-0_1-6-99_20220114003341.parquet |
|20220114003341 |20220114003341_1_6 |104 |2015-01-02 |43051d12-87e7-4dfb-8201-6ce293cf0df7-0_1-6-99_20220114003341.parquet |104|2015-01-02 |2015-01-01T12:15:00.512679Z|/2015-01-02/43051d12-87e7-4dfb-8201-6ce293cf0df7-0_1-6-99_20220114003341.parquet |
|20220114003401 |20220114003401_1_1 |100 |2022-01-11 |0c210872-484e-428b-a9ca-90a26e42125c-0_1-43-13681_20220114003401.parquet|100|2022-01-11 |2015-01-01T13:51:39.340396Z|/2022-01-11/0c210872-484e-428b-a9ca-90a26e42125c-0_1-43-13681_20220114003401.parquet|
+-------------------+--------------------+------------------+----------------------+------------------------------------------------------------------------+---+-------------+---------------------------+------------------------------------------------------------------------------------+

In the background, Hudi figures out which commits and values to show based on the commit files and metadata in the parquet files.

Apache Iceberg

📹 Intro to Apache Iceberg video

When I first heard about Iceberg, the phrase “table format for storing large, slow-moving tabular data” didn’t really make sense to me. But after working with data lakes at scale, it became quite clear. Apache Hive is a popular data warehouse project that provides a SQL-like interface to large datasets. Built on top of Hadoop, it originally used HDFS as its data store. With cloud migrations, object stores like Amazon S3 enabled the ability to store even more data particularly without the operational concerns of a large Hadoop cluster, but with some limitations when compared to HDFS. Specifically, directory listings are slower (simple physics here, network calls are slower), renames are not atomic (by design), and results were previously eventually consistent.

So imagine you are Netflix, you have hundreds of petabytes of data stored on S3, and you need a way for your organization to efficiently query this. You need a data storage layer that reduces or removes directory listings, you want atomic changes, and you want to ensure that when you’re reading your data you get consistent results. There is more to Iceberg, but I’m simplifying because this helped me understand. :)

These were some of the original goals for Iceberg, so let’s dive in and see how it works. Similar to Hudi, we’ll create a simple Spark DataFrame and write that to S3 in Iceberg format.

I should note that much of Iceberg is focused around Spark SQL, so I will switch to that below for certain operations.

# Create a DataFrame
inputDF = spark.createDataFrame(
    [
        ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
        ("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
        ("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
        ("103", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
        ("104", "2015-01-02", "2015-01-01T12:15:00.512679Z"),
        ("105", "2015-01-02", "2015-01-01T13:51:42.248818Z"),
    ],
    ["id", "creation_date", "last_update_time"],
)

# Write a DataFrame as an Iceberg dataset
inputDF.write.format("iceberg").mode("overwrite").partitionBy("creation_date").option(
    "path", f"s3://{S3_BUCKET_NAME}/tmp/iceberg/"
).saveAsTable(ICEBERG_TABLE_NAME)

There are two main differences here - there is not as much “configuration” as we had to do with Hudi and we also explicitly use saveAsTable. With Iceberg, much of the metadata is stored in a data catalog so creating the table is necessary. Let’s see what happened on S3.

First, we have a metadata.json file

2022-01-28 06:03:50 2457 tmp/iceberg/metadata/00000-bb1d38a9-af77-42c4-a7b7-69416fe36d9c.metadata.json

Then a snapshot manifest list file

2022-01-28 06:03:50 3785 tmp/iceberg/metadata/snap-7934053180928033536-1-e79c79ba-c7f0-45ad-8f2e-fd1bc349db55.avro

And a manifest file

2022-01-28 06:03:50 6244 tmp/iceberg/metadata/e79c79ba-c7f0-45ad-8f2e-fd1bc349db55-m0.avro

And finally, we’ve got our Parquet data files

2022-01-28 06:03:49 1197 tmp/iceberg/data/creation_date=2015-01-01/00000-4-fa9a18fd-abc4-4e04-91b4-e2ac4c9531be-00001.parquet
2022-01-28 06:03:49 1171 tmp/iceberg/data/creation_date=2015-01-01/00001-5-eab30115-a1d6-4918-abb4-a198ac12b262-00001.parquet
2022-01-28 06:03:50 1182 tmp/iceberg/data/creation_date=2015-01-02/00001-5-eab30115-a1d6-4918-abb4-a198ac12b262-00002.parquet

There are a lot of moving pieces here, but the image from the Iceberg spec illustrates it quite well.

Similar to Hudi, our data is written to Parquet files in each partition, although Hive-style partitioning is used by default. Hudi can also do this by setting the hoodie.datasource.write.hive_style_partitioning parameter.

Different from Hudi, though, is the default usage of the data catalog to identify the current metadata file to use. (Hudi 0.7.0 introduced support for a metadata table to reduce the performance impact of file listings.) That metadata file contains references to a list of manifest files to use to determine which data files compose the dataset for that particular version, also known as snapshots. The snapshot data also has quite a bit of additional information. Let’s update our dataset then take a look at S3 again and the snapshot portion of the metadata file.

spark.sql(f"UPDATE {ICEBERG_TABLE_NAME} SET creation_date = '2022-01-11' WHERE id = 100")

We can see that we have:

2 new .parquet data files

2022-01-28 06:07:07 1180 tmp/iceberg/data/creation_date=2015-01-01/00000-16-033354bd-7b02-44f4-95e2-7045e10706fc-00001.parquet
2022-01-28 06:07:08 1171 tmp/iceberg/data/creation_date=2022-01-11/00000-16-033354bd-7b02-44f4-95e2-7045e10706fc-00002.parquet

As well as:

1 new metadata.json file
2 new .avro metadata listings
1 new snap-*.avro snapshot file

Let’s look at the snapshot portion of the metadata.json file.

"snapshots": [
    {
        "manifest-list": "s3://<BUCKET>/tmp/iceberg/metadata/snap-7934053180928033536-1-e79c79ba-c7f0-45ad-8f2e-fd1bc349db55.avro",
        "schema-id": 0,
        "snapshot-id": 7934053180928033536,
        "summary": {
            "added-data-files": "3",
            "added-files-size": "3550",
            "added-records": "6",
            "changed-partition-count": "2",
            "operation": "append",
            "spark.app.id": "application_1643153254969_0029",
            "total-data-files": "3",
            "total-delete-files": "0",
            "total-equality-deletes": "0",
            "total-files-size": "3550",
            "total-position-deletes": "0",
            "total-records": "6"
        },
        "timestamp-ms": 1643349829278
    },
    {
        "manifest-list": "s3://<BUCKET>/tmp/iceberg/metadata/snap-5441092870212826638-1-605de48f-8ccf-450c-935e-bbd4194ee8cc.avro",
        "parent-snapshot-id": 7934053180928033536,
        "schema-id": 0,
        "snapshot-id": 5441092870212826638,
        "summary": {
            "added-data-files": "2",
            "added-files-size": "2351",
            "added-records": "3",
            "changed-partition-count": "2",
            "deleted-data-files": "1",
            "deleted-records": "3",
            "operation": "overwrite",
            "removed-files-size": "1197",
            "spark.app.id": "application_1643153254969_0029",
            "total-data-files": "4",
            "total-delete-files": "0",
            "total-equality-deletes": "0",
            "total-files-size": "4704",
            "total-position-deletes": "0",
            "total-records": "6"
        },
        "timestamp-ms": 1643350027635
    }
]

This is pretty amazing - we see how many files and records were added or deleted, what the file sizes were, and even what the Spark app_id was! 🤯 Some of this data is in the manifest-list files as well, but you can begin to see just how much you could potentially optimize your queries using this data.

Delta Lake

📹 Intro to Delta Lake video

Delta Lake was also introduced by Databricks as a way to address many of the challenges of Data Lakes. Similar to Hudi and Iceberg its goals include unifying batch and stream processing, ACID transactions, and scalable metadata handling among others.

Again, we’ll create a simple Spark DataFrame and write it to S3 in Delta format.

# Create a DataFrame
inputDF = spark.createDataFrame(
    [
        ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
        ("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
        ("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
        ("103", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
        ("104", "2015-01-02", "2015-01-01T12:15:00.512679Z"),
        ("105", "2015-01-02", "2015-01-01T13:51:42.248818Z"),
    ],
    ["id", "creation_date", "last_update_time"],
)

# Write a DataFrame as a Delta dataset
inputDF.write.format("delta").mode("overwrite").option(
    "overwriteSchema", "true"
).partitionBy("creation_date").save(f"s3://{S3_BUCKET_NAME}/tmp/delta/")

On S3, we now see the following files:

a 00000000000000000000.json file

2022-01-24 22:57:54 2120 tmp/delta/_delta_log/00000000000000000000.json

Several .snappy.parquet files

2022-01-24 22:57:52 875 tmp/delta/creation_date=2015-01-01/part-00005-2e09dbe4-469e-40dc-9b36-833480f6d375.c000.snappy.parquet
2022-01-24 22:57:52 875 tmp/delta/creation_date=2015-01-01/part-00010-848c69e1-71fb-4f8f-a19a-dd74e0ef1b8a.c000.snappy.parquet
2022-01-24 22:57:53 875 tmp/delta/creation_date=2015-01-01/part-00015-937d1837-0f03-4306-9b4e-4366207e688d.c000.snappy.parquet
2022-01-24 22:57:54 875 tmp/delta/creation_date=2015-01-01/part-00021-978a808e-4c36-4646-b7b1-ef5a21e706d8.c000.snappy.parquet
2022-01-24 22:57:54 875 tmp/delta/creation_date=2015-01-02/part-00026-538e1ac6-055e-4e72-9177-63daaaae1f98.c000.snappy.parquet
2022-01-24 22:57:52 875 tmp/delta/creation_date=2015-01-02/part-00031-8a03451a-0297-4c43-b64d-56db25807d02.c000.snappy.parquet

OK, so what’s in that _delta_log file? Similar to Iceberg, quite a bit of information about this initial write to S3 including the number of files written, the schema of the dataset, and even the individual add operations for each file.

{
  "commitInfo": {
    "timestamp": 1643065073634,
    "operation": "WRITE",
    "operationParameters": {
      "mode": "Overwrite",
      "partitionBy": "[\"creation_date\"]"
    },
    "isBlindAppend": false,
    "operationMetrics": {
      "numFiles": "6",
      "numOutputBytes": "5250",
      "numOutputRows": "6"
    }
  }
}
{
  "protocol": {
    "minReaderVersion": 1,
    "minWriterVersion": 2
  }
}
{
  "metaData": {
    "id": "a7f4b1d1-09f6-4475-894a-0eec90d1aab5",
    "format": {
      "provider": "parquet",
      "options": {}
    },
    "schemaString": "{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"creation_date\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"last_update_time\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}",
    "partitionColumns": [
      "creation_date"
    ],
    "configuration": {},
    "createdTime": 1643065064066
  }
}
{
  "add": {
    "path": "creation_date=2015-01-01/part-00005-2e09dbe4-469e-40dc-9b36-833480f6d375.c000.snappy.parquet",
    "partitionValues": {
      "creation_date": "2015-01-01"
    },
    "size": 875,
    "modificationTime": 1643065072000,
    "dataChange": true
  }
}
{
  "add": {
    "path": "creation_date=2015-01-01/part-00010-848c69e1-71fb-4f8f-a19a-dd74e0ef1b8a.c000.snappy.parquet",
    "partitionValues": {
      "creation_date": "2015-01-01"
    },
    "size": 875,
    "modificationTime": 1643065072000,
    "dataChange": true
  }
}
{
  "add": {
    "path": "creation_date=2015-01-01/part-00015-937d1837-0f03-4306-9b4e-4366207e688d.c000.snappy.parquet",
    "partitionValues": {
      "creation_date": "2015-01-01"
    },
    "size": 875,
    "modificationTime": 1643065073000,
    "dataChange": true
  }
}
{
  "add": {
    "path": "creation_date=2015-01-01/part-00021-978a808e-4c36-4646-b7b1-ef5a21e706d8.c000.snappy.parquet",
    "partitionValues": {
      "creation_date": "2015-01-01"
    },
    "size": 875,
    "modificationTime": 1643065074000,
    "dataChange": true
  }
}
{
  "add": {
    "path": "creation_date=2015-01-02/part-00026-538e1ac6-055e-4e72-9177-63daaaae1f98.c000.snappy.parquet",
    "partitionValues": {
      "creation_date": "2015-01-02"
    },
    "size": 875,
    "modificationTime": 1643065074000,
    "dataChange": true
  }
}
{
  "add": {
    "path": "creation_date=2015-01-02/part-00031-8a03451a-0297-4c43-b64d-56db25807d02.c000.snappy.parquet",
    "partitionValues": {
      "creation_date": "2015-01-02"
    },
    "size": 875,
    "modificationTime": 1643065072000,
    "dataChange": true
  }
}

Alright, let’s go ahead and update one of our rows. Delta Lake provides a merge operation that we can use. We’ll use the syntax from the docs that’s slightly different from native Spark as it creates a DeltaTable object.

from pyspark.sql.functions import lit

# Create a new DataFrame from the first row of inputDF with a different creation_date value
updateDF = inputDF.where("id = 100").withColumn("creation_date", lit("2022-01-11"))

from delta.tables import *
from pyspark.sql.functions import *

deltaTable = DeltaTable.forPath(spark, f"s3://{S3_BUCKET_NAME}/tmp/delta/")

deltaTable.alias("oldData") \
  .merge(
    updateDF.alias("newData"),
    "oldData.id = newData.id") \
  .whenMatchedUpdate(set = { "creation_date": col("newData.creation_date") }) \
  .execute()

Interestingly, now when we look at S3 we see 1 new json file and only 1 new parquet file (Remember Hudi and Iceberg both had 2 new parquet files).

2022-01-24 23:05:46 1018 tmp/delta/_delta_log/00000000000000000001.json
2022-01-24 23:05:46 875 tmp/delta/creation_date=2022-01-11/part-00000-3f3fd83a-b876-4b6f-8f64-d8a4189392ae.c000.snappy.parquet

If we look at that new JSON file we see something really interesting:

{
  "commitInfo": {
    "timestamp": 1643065545396,
    "operation": "MERGE",
    "operationParameters": {
      "predicate": "(oldData.`id` = newData.`id`)",
      "matchedPredicates": "[{\"actionType\":\"update\"}]",
      "notMatchedPredicates": "[]"
    },
    "readVersion": 0,
    "isBlindAppend": false,
    "operationMetrics": {
      "numTargetRowsCopied": "0",
      "numTargetRowsDeleted": "0",
      "numTargetFilesAdded": "1",
      "executionTimeMs": "4705",
      "numTargetRowsInserted": "0",
      "scanTimeMs": "3399",
      "numTargetRowsUpdated": "1",
      "numOutputRows": "1",
      "numSourceRows": "1",
      "numTargetFilesRemoved": "1",
      "rewriteTimeMs": "1265"
    }
  }
}
{
  "remove": {
    "path": "creation_date=2015-01-01/part-00005-2e09dbe4-469e-40dc-9b36-833480f6d375.c000.snappy.parquet",
    "deletionTimestamp": 1643065545378,
    "dataChange": true,
    "extendedFileMetadata": true,
    "partitionValues": {
      "creation_date": "2015-01-01"
    },
    "size": 875
  }
}
{
  "add": {
    "path": "creation_date=2022-01-11/part-00000-3f3fd83a-b876-4b6f-8f64-d8a4189392ae.c000.snappy.parquet",
    "partitionValues": {
      "creation_date": "2022-01-11"
    },
    "size": 875,
    "modificationTime": 1643065546000,
    "dataChange": true
  }
}

In addition to the operationMetrics that gives us insight into how the data changed on “disk”, we also now see both a remove and add operation. In Delta Lake (and I’m not quite sure why this happened yet…), each row was written to an individual .parquet file! So for this second version of the data, the fact that that row was updated simply lives in the metadata because it was the only row stored in that Parquet file. I’m guessing this is simply because my dataset is so small the default number of partitions in Spark/Delta Lake resulted in this write configuration.

Snapshots

So now we’ve got a good idea of the semantics of each of these storage layers. Let’s take one more look at an important component of all of them and that’s snapshots!

Hudi

Hudi has a concept of “point-in-time” queries where you provide it a range of two commit timestamps and it will show you what the data looked like at that point in time.

# Query data from the first version of the table
readOptions = {
  'hoodie.datasource.query.type': 'incremental',
  'hoodie.datasource.read.begin.instanttime': '0',
  'hoodie.datasource.read.end.instanttime': '20220114003341',
}

incQueryDF = spark.read \
    .format('org.apache.hudi') \
    .options(**readOptions) \
    .load(f"s3://{S3_BUCKET_NAME}/tmp/hudi")

incQueryDF.show()

Iceberg

Iceberg supports a similar mechanism called time travel and you can use either a snapshot-id or as-of-timestamp similar to Hudi.

# time travel to 2022-01-27 22:04:00 -0800
df = spark.read \
    .option("as-of-timestamp", "1643349840000") \
    .format("iceberg") \
    .load(ICEBERG_TABLE_NAME)

df.show()

Delta Lake

And, of course, Delta Lake supports this as well using either Spark SQL or DataFrames. And similar to Iceberg you can use versionAsOf or timestampAsOf.

# time travel to 2022-01-24 23:00
df1 = (
    spark.read.format("delta")
    .option("timestampAsOf", "2022-01-24 23:00")
    .load(f"s3://{S3_BUCKET_NAME}/tmp/delta/")
)

Deletes

I bet you’re surprised I haven’t mentioned deletes or GDPR yet. Don’t worry…I will. 😀 But first I just wanted to understand exactly how these different systems work.

Wrapup

In this post, we reviewed the basics of Apache Hudi, Apache Iceberg, and Delta Lake - modern data lake storage layers. All these frameworks enable a set of functionality that optimize working with data in cloud-based object stores, albeit with slightly different approaches.

SSH to EC2 Instances with Session Manager

Damon P. Cortesi — Wed, 29 Sep 2021 16:49:03 +0000

I’m kind of an old-school sys admin (aka, managed NT4 in the 90’s) so I’m really used to SSH’ing into hosts. More often than not, however, I’m working with AWS EC2 instances in a private subnet.

If you’re not familiar with it AWS Systems Manager Session Manager is a pretty sweet feature that allows you to connect remotely to EC2 instances with the AWS CLI, without needing to open up ports for SSH or utilize a bastion host.

I’ve been using it in my browser occasionally, which is pretty handy, but I wanted to use it from my terminal. It required a couple steps to get working.

Set up Session Manager with AWS CLI

Install the Session Manager plugin for the AWS CLI

I’m on a mac, so I just installed the plugin with the signed installer

curl "https://s3.amazonaws.com/session-manager-downloads/plugin/latest/mac/session-manager-plugin.pkg" -o ~/Downloads/session-manager-plugin.pkg

sudo installer -pkg ~/Downloads/session-manager-plugin.pkg -target /
sudo ln -s /usr/local/sessionmanagerplugin/bin/session-manager-plugin /usr/local/bin/session-manager-plugin


➜ session-manager-plugin

The Session Manager plugin was installed successfully. Use the AWS CLI to start a session.

Sweet, good to go there!

Now use the AWS CLI to connect to an instance!

You may need to specify the region your instance is in

aws ssm start-session --target i-abcdefgh123456789 --region us-west-2

Awesome! You’re good to go!

Enable Logging

The other nice thing if your memory is as bad as mine (or you want auditing, which is a more legitimate reason), you can also enable logging of your sessions to S3 or CloudWatch.

This is what the default config looks like:

aws ssm get-document \
    --region us-west-2 \
    --name "SSM-SessionManagerRunShell" \
    --document-version "\$LATEST" \
    | jq '.Content | fromjson'


{
  "schemaVersion": "1.0",
  "description": "Document to hold regional settings for Session Manager",
  "sessionType": "Standard_Stream",
  "inputs": {
    "s3BucketName": "",
    "s3KeyPrefix": "",
    "s3EncryptionEnabled": true,
    "cloudWatchLogGroupName": "",
    "cloudWatchEncryptionEnabled": true,
    "cloudWatchStreamingEnabled": true,
    "idleSessionTimeout": "20",
    "kmsKeyId": "",
    "runAsEnabled": false,
    "runAsDefaultUser": "",
    "shellProfile": {
      "windows": "",
      "linux": ""
    }
  }
}

So we’ll just update that to add in the S3 configuration.

Update Session Manager preferences

Note that I do not enable encryption here. This setting needs to match your bucket setting and you need to make sure your VPC has the proper endpoints and access to write to S3. Check troubleshooting if you get a blank screen when trying to start a session.

BUCKET=<BUCKET_NAME>
PREFIX=logs/session_manager/

aws ssm update-document \
    --region us-west-2 \
    --name "SSM-SessionManagerRunShell" \
    --document-version "\$LATEST" \
    --content '{
  "schemaVersion": "1.0",
  "description": "Document to hold regional settings for Session Manager",
  "sessionType": "Standard_Stream",
  "inputs": {
    "s3BucketName": "'${BUCKET}'",
    "s3KeyPrefix": "'${PREFIX}'",
    "s3EncryptionEnabled": false,
    "cloudWatchLogGroupName": "",
    "cloudWatchEncryptionEnabled": true,
    "cloudWatchStreamingEnabled": true,
    "idleSessionTimeout": "20",
    "kmsKeyId": "",
    "runAsEnabled": false,
    "runAsDefaultUser": "",
    "shellProfile": {
      "windows": "",
      "linux": ""
    }
  }
}'

Create another session!

aws ssm start-session --target i-abcdefgh123456789 --region us-west-2

Once you’re done with your session and exit, you should have a log file in your S3 bucket.

View logs

aws s3 ls s3://<BUCKET_NAME>/logs/session_manager/


2021-09-29 10:21:24 4177 your-aws-username-abcdefgh123456789.log

And that log file will have the full contents of your session.

aws s3 cp s3://<BUCKET_NAME>/logs/session_manager/your-aws-username-abcdefgh123456789.log -

And yes, the FULL CONTENTS. So if you enter a password or sensitive info, you should follow the steps here.

stty -echo; read passwd; stty echo;

Updating Partition Values With Apache Hudi

Damon P. Cortesi — Thu, 23 Sep 2021 19:21:15 +0000

If you're not familiar with Apache Hudi, it's a pretty awesome piece of software that brings transactions and record-level updates/deletes to data lakes.

More specifically, if you're doing Analytics with S3, Hudi provides a way for you to consistently update records in your data lake, which historically has been pretty challenging. It can also optimize file sizes, allow for rollbacks, and makes streaming CDC data impressively easy.

Updating Partition Values

I'm learning more about Hudi and was following this EMR guide to working with a Hudi dataset, but the "Upsert" operation didn't quite work as I expected. Instead of overwriting the desired record, it added a second one with the same ID. 🤔

After some furious searching, I finally came across this post about employing the right indexes in Apache Hudi. Specifically, this line caught my attention:

Global indexes enforce uniqueness of keys across all partitions of a table i.e guarantees that exactly one record exists in the table for a given record key.

Ah-ha! In the example, we're updating a partition value. BY DEFAULT, the hoodie.index.type is BLOOM. I tried changing it to GLOBAL_BLOOM, and when updating the record, it wrote it into the old partition. It turns out that there is also a hoodie.bloom.index.update.partition.path setting that will also update the partition path. This defaults to true in Hudi v0.9.0, but I'm using v0.8.0 where it defaults to false.

Note that there is a performance/storage impact to enabling global indexes

So flipping that, I got the expected behavior. Using the example from the EMR docs, my code now looks like this:

Writing initial dataset

# Create a DataFrame
inputDF = spark.createDataFrame(
    [
        ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
        ("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
        ("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
        ("103", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
        ("104", "2015-01-02", "2015-01-01T12:15:00.512679Z"),
        ("105", "2015-01-02", "2015-01-01T13:51:42.248818Z"),
    ],
    ["id", "creation_date", "last_update_time"],
)

# Specify common DataSourceWriteOptions in the single hudiOptions variable
hudiOptions = {
    "hoodie.table.name": "my_hudi_table",
    "hoodie.datasource.write.recordkey.field": "id",
    "hoodie.datasource.write.partitionpath.field": "creation_date",
    "hoodie.datasource.write.precombine.field": "last_update_time",
    "hoodie.datasource.hive_sync.enable": "true",
    "hoodie.datasource.hive_sync.table": "my_hudi_table",
    "hoodie.datasource.hive_sync.partition_fields": "creation_date",
    "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
    "hoodie.index.type": "GLOBAL_BLOOM",                 # This is required if we want to ensure we upsert a record, even if the partition changes
    "hoodie.bloom.index.update.partition.path": "true",  # This is required to write the data into the new partition (defaults to false in 0.8.0, true in 0.9.0)
}

# Write a DataFrame as a Hudi dataset
(
    inputDF.write.format("org.apache.hudi")
    .option("hoodie.datasource.write.operation", "insert")
    .options(**hudiOptions)
    .mode("overwrite")
    .save("s3://<BUCKET>/tmp/myhudidataset_001/")
)

Updating one partition row

from pyspark.sql.functions import lit

updateDF = inputDF.limit(1).withColumn('creation_date', lit('2021-09-22'))

(
    updateDF.write.format("org.apache.hudi")
    .option("hoodie.datasource.write.operation", "upsert")
    .options(**hudiOptions)
    .mode("append")
    .save("s3://<BUCKET>/tmp/myhudidataset_001/")
)

Resulting Parquet Files

Now if we look at the Parquet files on S3, we can see that:

The old partition has a new Parquet file with the record removed
There is a new partition with the single record

aws s3 ls s3://<BUCKET>/tmp/myhudidataset_001/

2021-09-23 11:45:23     434901 tmp/myhudidataset_001/2015-01-01/cd4b4b74-13f7-4c1e-a7ce-110bba8e16fd-0_0-404-90423_20210923184511.parquet
2021-09-23 11:45:44     434864 tmp/myhudidataset_001/2015-01-01/cd4b4b74-13f7-4c1e-a7ce-110bba8e16fd-0_0-442-103950_20210923184526.parquet
2021-09-23 11:45:23     434863 tmp/myhudidataset_001/2015-01-02/578ea02b-09f0-4952-afe5-94d44d158d29-0_1-404-90424_20210923184511.parquet
2021-09-23 11:45:43     434895 tmp/myhudidataset_001/2021-09-22/d67c9b50-1034-44b2-8ec9-2f3b1dcbf26c-0_1-442-103951_20210923184526.parquet

Athena Compatibility

We can also successfully query this dataset from Athena and see the updated data as well!

SELECT * FROM "default"."my_hudi_table"

Note the the different _hoodie_file_name for record id 100.

Awesome! Now that I understand what's going on, it makes perfect sense. 🙌

Continuous Deployment of Jupyter Notebooks

Damon P. Cortesi — Wed, 14 Jul 2021 23:37:28 +0000

This is a guide on how to use AWS CodePipeline to continuously deploy Jupyter notebooks to an S3-backed static website.

Overview

Since I started using EMR Studio, I've been making more use of Jupyter notebooks as scratch pads and often want to be able to easily share the results of my research. I hunted around for a few different solutions and while there are some good ones like nbconvert and jupytext, I wanted something a bit simpler and off-the-shelf. This post from Mikkel Hartmann about making a static website from Jupyter Notebooks led me to MkDocs and luckily, I came across mknotebooks, which offers a simple plugin for MkDocs. 😅

So, by using a simple static site generator that's geared toward project documentation, and a plugin that renders Jupyter notebooks quite well, and a few fancy code pipelines...I can easily push my notebooks to production. Let's go!

Architecture

This is the architecture we'll be implementing. This will all be built using the AWS Cloud Development Kit (CDK).

We'll be creating the following:

2 S3 buckets to store our logs and website artifacts
A CodeCommmit repository that holds our site and notebooks
A CodeBuild project that generates the static site
A CodePipeline that is triggered by new commits, builds the site, and deploys it to S3
A CloudFront Distribution that serves the site
And optionally an ACM certificate if you want an alternate domain name

I won't go into the details of the entire CDK stack, but instead will show how to deploy the CD pipeline.

Deploying

Pre-requisites

You'll need to have CDK installed (>= v1.107.0) and Python >= 3.9.

I use nodenv and virtualenv for my respective environments.

# I use node 14.5.0
nodenv shell 14.5.0

# And Python3
virtualenv -p python3.8 .venv
source .venv/bin/activate

Bootstrapping

The source code is available in dacort/jupyter-static-website. In order to get started, we just need to clone that repo and deploy our CDK stack!

This project is a two-phased deploy due to the fact that CloudFront certificates need to be in us-east-1. If you do not need a custom domain, you can skip the first part.

First, clone the project and install the necessary requirements.

git clone https://github.com/dacort/jupyter-static-website.git
cd jupyter-static-website
pip install -r requirements.txt

You'll also need to bootstrap your AWS CDK environment in the account and region you want to deploy Part 2 in.

cdk bootstrap aws://ACCOUNT-NUMBER-1/REGION-1

Part 1 - CloudFront Certificate

If you are not using a custom domain, skip to Part 2

This project only supports using the default CloudFront certificate and a DNS-validated CNAME. In order to generate the certificate, you'll need to run the command below, go into the AWS Certificate Manager console and make sure you follow the validation instructions.

cdk deploy CloudfrontCertificateStack -c domain_name=notebooks.example.com

Once the domain is validated, the stack should finish provisioning.

One of the outputs from this stack will be CloudfrontCertificateStack.certificatearn - you'll need the value of this for the next phase.

Part 2 - Jupyter CD Pipeline

If you are not using a custom domain, you can omit both of the -c options below.

If you want to deploy to a different region, make sure you set the AWS_DEFAULT_REGION environment variable.

cdk deploy EmrStudioPublisherStack \
    -c domain_name=notebooks.example.com \
    -c certificate_arn=arn:aws:acm:us-east-1:012345678912:certificate/f07b01a4-3e8c-4639-8a22-b7a20a832de3

Once this stack finishes, you should have a CodeCommit repository you can make changes to, a CloudFront distribution, and a publicly accessible URL (found in the EmrStudioPublisherStack.cloudfrontendpoint output) that has a pre-populated example site.

The site will take a few minutes to deploy - you'll be able to keep an eye on the status in the CodePipeline console.

Usage

Usage is pretty straight-forward. git clone the repository, add a new notebook, and push it back up! If you're using EMR Studio, you can add your CodeCommit repository and make your changes to your Jupyter notebooks there.

I made a video about connecting to Git in EMR Studio that you might find useful.

Any new notebooks added in the site/docs/notebooks/ directory will automatically be published.

You can add links to the notebooks by updating the nav section of the mkdocs.yml file.

nav:
  - Home: index.md
  - Notebooks:
    - Oura Sleep Analysis: notebooks/damons_sleep.ipynb
    - Intro to Data Processing on AWS: notebooks/intro_data_processing_aws.ipynb

This YAML config will generate a nav dropdown like so.

Advanced Usage

Note that not all images or libraries render nicely when converting to HTML. This is why, for example, in my plotly example I had to use fig.show(renderer="jupyterlab")

In addition, if you paste multiple images into your notebook's Markdown, mknotebooks only renders one of them. In order to work around this, I added a pre-build step that uniquify's all the image attachments in Markdown cells.

Build your own Air Quality Map with OpenAQ and EMR on EKS

Damon P. Cortesi — Fri, 09 Jul 2021 18:48:52 +0000

Fire season is closely approaching and as somebody that spent two weeks last year hunkered down inside with my browser glued to various air quality sites, I wanted to show how to use data from OpenAQ to build your own air quality analysis.

With Amazon EMR on EKS, you can now customize and package your own Apache Spark dependencies and I use that functionality for this post.

Overview

OpenAQ maintains a publicly accessible dataset of various air quality metrics that's updated every half hour. Bokeh is a popular library for Python data visualization. While it includes sample data for US county and state boundaries, we're going to use shapefiles from census.gov.

We'll use an Apache Spark job on EMR on EKS to read the initial dataset from the S3 bucket, filter it for use case, and then combine it with the boundary data from census.gov in order to draw a map of the current air quality.

This post also shows how to use the custom containers support in EMR on EKS to build our own container image with the necessary dependencies.

Pre-requisites

An AWS account with access to Amazon Elastic Container Registry (ECR)
An EMR on EKS cluster already setup
Docker
A container registry to push your image to

Building the EMR on EKS Container Image

Download the EMR base image

For this post, we'll be using the us-west-2 region and EMR 6.3.0 release. Each region and release has a different base image URL, and you can find the full list here.

aws ecr get-login-password --region us-west-2 \
    | docker login --username AWS --password-stdin 895885662937.dkr.ecr.us-west-2.amazonaws.com

docker pull 895885662937.dkr.ecr.us-west-2.amazonaws.com/notebook-spark/emr-6.3.0:latest

Customize the image

EMR on EKS comes with a variety of default libraries installed including plotly and seaborn, but we wanted to try out Bokeh for my illustration as they have a great choropleth example and it's a library I've been hearing about occasionally. I was hoping to use Bokeh's sampledata for US and county, but I ended up using GeoPandas to re-project my map to a conic projection so Michigan wasn't squashed up against Wisconsin. :) GeoPandas makes it easy to read in shapefiles, so I just used the census.gov provided state and county data.

Bokeh also uses Selenium and Chrome for it's static image generation, so we go ahead and install Chrome on the container image as well.

FROM 895885662937.dkr.ecr.us-west-2.amazonaws.com/notebook-spark/emr-6.3.0:latest

USER root

# Install Chrome
RUN curl https://intoli.com/install-google-chrome.sh | bash && \
    mv /usr/bin/google-chrome-stable /usr/bin/chrome

# We need to upgrade pip in order to install pyproj
RUN pip3 install --upgrade pip

# If you pip install as root, use this
RUN pip3 install \
    bokeh==2.3.2 \
    boto3==1.17.93 \
    chromedriver-py==91.0.4472.19.0 \
    geopandas==0.9.0 \
    selenium==3.141.0 \
    shapely==1.7.1

RUN ln -s /usr/local/lib/python3.7/site-packages/chromedriver_py/chromedriver_linux64 /usr/local/bin/chromedriver

# Install bokeh sample data to /usr/local/share
RUN mkdir /root/.bokeh && \
    echo "sampledata_dir: /usr/local/share/bokeh" > /root/.bokeh/config && \
    bokeh sampledata

# Also install census data into the image :)
ADD https://www2.census.gov/geo/tiger/GENZ2020/shp/cb_2020_us_state_500k.zip  /usr/local/share/bokeh/
ADD https://www2.census.gov/geo/tiger/GENZ2020/shp/cb_2020_us_county_500k.zip /usr/local/share/bokeh/
RUN chmod 644 /usr/local/share/bokeh/cb*.zip

# This is a simple test to make sure generating the image works properly
COPY test /test/

USER hadoop:hadoop

Build and push

Great, we've customized our image – now we just need to build and push it to a container registery somewhere! For this post, I chose GitHub but you can use any container registry like ECR or DockerHub.

The below commands assume you have a GitHub Personal Access Token that has access to push images in the CR_PAT environment variable.

docker build -t emr-6.3.0-bokeh:latest .

export USERNAME=GH_USERNAME
echo $CR_PAT| docker login ghcr.io -u ${USERNAME} --password-stdin
docker tag emr-6.3.0-bokeh:latest ghcr.io/${USERNAME}/emr-6.3.0-bokeh:latest
docker push ghcr.io/${USERNAME}/emr-6.3.0-bokeh:latest

Great, now your image is ready to go! Let's look at the code we're going to use to generate our air quality map.

Code walkthrough

If you already built your image, you can run the below code locally. In order to access S3 data, you'll have to set your AWS_ACCESS_KEY_ID and AWS_SECRET_KEY_ID environment variables.

docker run --rm -it --name airq-demo \
    -e AWS_ACCESS_KEY_ID \
    -e AWS_SECRET_ACCESS_KEY \
    emr-6.3.0-bokeh \
    pyspark --deploy-mode client --master 'local[1]'

Reading and filtering OpenAQ Data

The first thing we need to do is read the the data for today's date into a Spark dataframe.

import datetime

date = f"{datetime.datetime.utcnow().date()}"
df = spark.read.json(f"s3://openaq-fetches/realtime-gzipped/{date}/")
df.show()

+--------------------+---------------+---------+--------------------+-------+--------------------+--------------------+------+---------+-----------------+----------+-----+-----+
|         attribution|averagingPeriod|     city|         coordinates|country|                date|            location|mobile|parameter|       sourceName|sourceType| unit|value|
+--------------------+---------------+---------+--------------------+-------+--------------------+--------------------+------+---------+-----------------+----------+-----+-----+
|[{EPA AirNow DOS,...|   {hours, 1.0}|Abu Dhabi|{24.424399, 54.43...|     AE|{2021-06-13T22:00...|US Diplomatic Pos...| false|     pm25|StateAir_AbuDhabi|government|µg/m³| 25.0|
|[{EPA AirNow DOS,...|   {hours, 1.0}|Abu Dhabi|{24.424399, 54.43...|     AE|{2021-06-13T23:00...|US Diplomatic Pos...| false|     pm25|StateAir_AbuDhabi|government|µg/m³| 16.0|
|[{EPA AirNow DOS,...|   {hours, 1.0}|Abu Dhabi|{24.424399, 54.43...|     AE|{2021-06-14T00:00...|US Diplomatic Pos...| false|     pm25|StateAir_AbuDhabi|government|µg/m³| 18.0|
|[{EPA AirNow DOS,...|   {hours, 1.0}|Abu Dhabi|{24.424399, 54.43...|     AE|{2021-06-14T01:00...|US Diplomatic Pos...| false|     pm25|StateAir_AbuDhabi|government|µg/m³| 23.0|
|[{EPA AirNow DOS,...|   {hours, 1.0}|Abu Dhabi|{24.424399, 54.43...|     AE|{2021-06-14T02:00...|US Diplomatic Pos...| false|     pm25|StateAir_AbuDhabi|government|µg/m³| 23.0|
|[{EPA AirNow DOS,...|   {hours, 1.0}|Abu Dhabi|{24.424399, 54.43...|     AE|{2021-06-14T03:00...|US Diplomatic Pos...| false|     pm25|StateAir_AbuDhabi|government|µg/m³| 21.0|
|[{EPA AirNow DOS,...|   {hours, 1.0}|Abu Dhabi|{24.424399, 54.43...|     AE|{2021-06-14T04:00...|US Diplomatic Pos...| false|     pm25|StateAir_AbuDhabi|government|µg/m³| 20.0|
|[{EPA AirNow DOS,...|   {hours, 1.0}|Abu Dhabi|{24.424399, 54.43...|     AE|{2021-06-14T05:00...|US Diplomatic Pos...| false|     pm25|StateAir_AbuDhabi|government|µg/m³| 16.0|
|[{EPA AirNow DOS,...|   {hours, 1.0}|Abu Dhabi|{24.424399, 54.43...|     AE|{2021-06-14T06:00...|US Diplomatic Pos...| false|     pm25|StateAir_AbuDhabi|government|µg/m³| 17.0|
|[{EPA AirNow DOS,...|   {hours, 1.0}|Abu Dhabi|{24.424399, 54.43...|     AE|{2021-06-14T07:00...|US Diplomatic Pos...| false|     pm25|StateAir_AbuDhabi|government|µg/m³| 18.0|
|[{EPA AirNow DOS,...|   {hours, 1.0}|Abu Dhabi|{24.424399, 54.43...|     AE|{2021-06-14T08:00...|US Diplomatic Pos...| false|     pm25|StateAir_AbuDhabi|government|µg/m³| 20.0|
|[{EPA AirNow DOS,...|   {hours, 1.0}|Abu Dhabi|{24.424399, 54.43...|     AE|{2021-06-14T09:00...|US Diplomatic Pos...| false|     pm25|StateAir_AbuDhabi|government|µg/m³| 26.0|
|[{EPA AirNow DOS,...|   {hours, 1.0}|Abu Dhabi|{24.424399, 54.43...|     AE|{2021-06-14T10:00...|US Diplomatic Pos...| false|     pm25|StateAir_AbuDhabi|government|µg/m³| 29.0|
|[{EPA AirNow DOS,...|   {hours, 1.0}|Abu Dhabi|{24.424399, 54.43...|     AE|{2021-06-14T11:00...|US Diplomatic Pos...| false|     pm25|StateAir_AbuDhabi|government|µg/m³| 34.0|
|[{EPA AirNow DOS,...|   {hours, 1.0}|Abu Dhabi|{24.424399, 54.43...|     AE|{2021-06-14T12:00...|US Diplomatic Pos...| false|     pm25|StateAir_AbuDhabi|government|µg/m³| 33.0|
|[{EPA AirNow DOS,...|   {hours, 1.0}|Abu Dhabi|{24.424399, 54.43...|     AE|{2021-06-14T13:00...|US Diplomatic Pos...| false|     pm25|StateAir_AbuDhabi|government|µg/m³| 40.0|
|[{EPA AirNow DOS,...|   {hours, 1.0}|Abu Dhabi|{24.424399, 54.43...|     AE|{2021-06-14T14:00...|US Diplomatic Pos...| false|     pm25|StateAir_AbuDhabi|government|µg/m³| 39.0|
|[{EPA AirNow DOS,...|   {hours, 1.0}|Abu Dhabi|{24.424399, 54.43...|     AE|{2021-06-14T15:00...|US Diplomatic Pos...| false|     pm25|StateAir_AbuDhabi|government|µg/m³| 41.0|
|[{EPA AirNow DOS,...|   {hours, 1.0}|Abu Dhabi|{24.424399, 54.43...|     AE|{2021-06-14T16:00...|US Diplomatic Pos...| false|     pm25|StateAir_AbuDhabi|government|µg/m³| 50.0|
|[{EPA AirNow DOS,...|   {hours, 1.0}|Abu Dhabi|{24.424399, 54.43...|     AE|{2021-06-14T17:00...|US Diplomatic Pos...| false|     pm25|StateAir_AbuDhabi|government|µg/m³| 56.0|
+--------------------+---------------+---------+--------------------+-------+--------------------+--------------------+------+---------+-----------------+----------+-----+-----+

We can quickly see a few things:

Data is provided from all over the globe, we just want US
We have coordinates and country, but that's it for location data
There are multiple different types of readings
There are multiple different readings per day per location

df.select('unit', 'parameter').distinct().sort("parameter").show()

+-----+---------+                                                               
| unit|parameter|
+-----+---------+
|µg/m³|       bc|
|µg/m³|       co|
|  ppm|       co|
|  ppm|      no2|
|µg/m³|      no2|
|µg/m³|       o3|
|  ppm|       o3|
|µg/m³|     pm10|
|µg/m³|     pm25|
|  ppm|      so2|
|µg/m³|      so2|
+-----+---------+

So, let's go ahead and filter down to the most recent PM2.5 reading in the United States.

To do that, it's a couple where filters and then we can utilize a window function (last) to get the last reading.

# Filter down to US locations and PM2.5 readings only
usdf = (
    df.where(df.country == "US")
    .where(df.parameter == "pm25")
    .select("coordinates", "date", "parameter", "unit", "value", "location")
)

# Retrieve the most recent pm2.5 reading per county
from pyspark.sql.window import Window
from pyspark.sql.functions import last
windowSpec = (
    Window.partitionBy("location")
    .orderBy("date.utc")
    .rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
)
last_reading_df = (
    usdf.withColumn("last_value", last("value").over(windowSpec))
    .select("coordinates", "last_value")
    .distinct()
)

last_reading_df.show()

We also only selected the coordinates and last_value columns as these are all we need at this point.

+--------------------+----------+                                               
|         coordinates|last_value|
+--------------------+----------+
|{38.6619, -121.7278}|       2.0|
| {41.9767, -91.6878}|       4.9|
|{39.54092, -119.7...|       8.0|
|{43.629605, -72.3...|       9.0|
|{46.8505, -111.98...|      10.0|
|{39.818715, -75.4...|       8.5|
+--------------------+----------+

Mapping coordinates to counties

This was the most "fun" part of this journey. Bokeh provides some sample data and I initially just created a UDF that looked up the first county ID using the Polygon intersects method. Unfortunately, I then wanted to re-project the map to a conical projection (Albers). Bokeh's geo support isn't very strong, so I ended up looking at using GeoPandas to do the reprojection. That worked well, but the Bokeh county data wasn't in a format I could use with GeoPandas so I ended up downloading Shapefiles from the Census Bureau.

So, we've got our last_reading_df dataframe. Lets map those coordinates to counties. The county data is relatively small (12mb zipped) so what I did was create a broadcast variable of GEOID -> Geometry mappings that could be used in a UDF to figure out if a PM2.5 reading is inside a specific county.

Download the census data and create a broadcast variable

import geopandas as gpd

COUNTY_URL = 'https://www2.census.gov/geo/tiger/GENZ2020/shp/cb_2020_us_county_500k.zip'

countydf = gpd.read_file(COUNTY_URL)
bc_county = sc.broadcast(dict(zip(countydf["GEOID"], countydf["geometry"])))

countydf.head()

We can see we're just mapping the GEOID column to the geometry column which is a polygon object containing the county boundaries.

  STATEFP COUNTYFP  COUNTYNS        AFFGEOID  GEOID       NAME          NAMELSAD STUSPS  STATE_NAME LSAD       ALAND     AWATER                                           geometry
0      21      141  00516917  0500000US21141  21141      Logan      Logan County     KY    Kentucky   06  1430224002   12479211  POLYGON ((-87.06037 36.68085, -87.06002 36.708...
1      36      081  00974139  0500000US36081  36081     Queens     Queens County     NY    New York   06   281594050  188444349  POLYGON ((-73.96262 40.73903, -73.96243 40.739...
2      34      017  00882278  0500000US34017  34017     Hudson     Hudson County     NJ  New Jersey   06   119640822   41836491  MULTIPOLYGON (((-74.04220 40.69997, -74.03900 ...
3      34      019  00882228  0500000US34019  34019  Hunterdon  Hunterdon County     NJ  New Jersey   06  1108086284   24761598  POLYGON ((-75.19511 40.57969, -75.19466 40.581...
4      21      147  00516926  0500000US21147  21147   McCreary   McCreary County     KY    Kentucky   06  1105416696   10730402  POLYGON ((-84.77845 36.60329, -84.73068 36.665...

Create a UDF to find the county a coordinate is in

This just brute forces the list of GEOIDs/polygons and returns the first GEOID that intersects. There is likely a more elegant to do this.

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from shapely.geometry import Point

def find_first_county_id(longitude: float, latitude: float):
    p = Point(longitude, latitude)
    for index, geo in bc_county.value.items():
        if geo.intersects(p):
            return index
    return None


find_first_county_id_udf = udf(find_first_county_id, StringType())

Now we apply to the UDF to our last_reading_df dataframe

# Find the county that this reading is from
mapped_county_df = last_reading_df.withColumn(
    "GEOID",
    find_first_county_id_udf(
        last_reading_df.coordinates.longitude, last_reading_df.coordinates.latitude
    ),
).select("GEOID", "last_value")

And then finally we calculate the average PM2.5 value per county

# Calculate the average reading per county
pm_avg_by_county = (
    mapped_county_df.groupBy("GEOID")
    .agg({"last_value": "avg"})
    .withColumnRenamed("avg(last_value)", "avg_value")
)

pm_avg_by_county.show(5)

+-----+------------------+                                                      
|GEOID|         avg_value|
+-----+------------------+
|31157|              16.0|
|49053|               3.0|
|26153|               6.9|
|36029|               1.1|
|42101|             10.66|
+-----+------------------+

Cool! So now we have a GEOID we can use in our GeoPandas dataframe and an average value of the most recent PM2.5 reading for that county.

Generating our Air Quality map

Now that we've got an average PM2.5 value per county, we need to join this with our map data and generate an image!

The first step is reading in US State and County shapefiles. We fetched these from census.gov while building the image and they're stored in /usr/local/share/bokeh. We also exclude any state not in the continental US.

import geopandas as gpd

STATE_FILE = "file:///usr/local/share/bokeh/cb_2020_us_state_500k.zip"
COUNTY_FILE = "file:///usr/local/share/bokeh/cb_2020_us_county_500k.zip"
EXCLUDED_STATES = ["AK", "HI", "PR", "GU", "VI", "MP", "AS"]

county_df = gpd.read_file(COUNTY_FILE).query(f"STUSPS not in {EXCLUDED_STATES}")
state_df = gpd.read_file(STATE_FILE).query(f"STUSPS not in {EXCLUDED_STATES}")

Now we just do a simple merge on the GeoPandas dataframe, convert our maps to the Albers projection and save them as JSON objects.

# Merge in our air quality data
county_aqi_df = county_df.merge(pm_avg_by_county.toPandas(), on="GEOID")

# Convert to a "proper" Albers projection :)
state_json = state_df.to_crs("ESRI:102003").to_json()
county_json = county_aqi_df.to_crs("ESRI:102003").to_json()

Now comes the fun part! Our data is all prepped, we've averaged the most recent data by county, and built a GeoJSON file of everything we need. Let's map it!

I won't go into the details of every line, but we'll make use of Bokeh's awesome GeoJSONDataSource functionality, add a LinearColorMapper that automatically shades the counties for us by the avg_value column using the Reds9 palette, and adds a ColorBar on the right-hand side.

from bokeh.models import ColorBar, GeoJSONDataSource, LinearColorMapper
from bokeh.palettes import Reds9 as palette
from bokeh.plotting import figure

p = figure(
    title="US Air Quality Data",
    plot_width=1100,
    plot_height=700,
    toolbar_location=None,
    x_axis_location=None,
    y_axis_location=None,
    tooltips=[
        ("County", "@NAME"),
        ("Air Quality Index", "@avg_value"),
    ],
)
p.grid.grid_line_color = None

# This just adds our state lines
p.patches(
    "xs",
    "ys",
    fill_alpha=0.0,
    line_color="black",
    line_width=0.5,
    source=GeoJSONDataSource(geojson=state_json),
)

# Add our county data and shade them based on "avg_value"
color_mapper = LinearColorMapper(palette=tuple(reversed(palette)))
color_column = "avg_value"
p.patches(
    "xs",
    "ys",
    fill_alpha=0.7,
    fill_color={"field": color_column, "transform": color_mapper},
    line_color="black",
    line_width=0.5,
    source=GeoJSONDataSource(geojson=county_json),
)

# Now add a color bar legend on the right-hand side
color_bar = ColorBar(color_mapper=color_mapper, label_standoff=12, width=10)
p.add_layout(color_bar, "right")

Finally, let's go ahead export the png!

from bokeh.io import export_png
from bokeh.io.webdriver import create_chromium_webdriver

driver = create_chromium_webdriver(["--no-sandbox"])
export_png(p, filename="map.png", webdriver=driver)

Now, if you're running on a mac, you can just copy the generated map to your local system and open it up!

docker cp airq-demo:/home/hadoop/map.png .
open map.png

Running on EMR on EKS

I've bundled this all up into a pyspark script in my demo-code repo.

This demo assumes you already have an EMR on EKS virtual cluster up and running, you've built the image in the first part and pushed it to a container registry, and the IAM Role you use to run the job has access to both read and write to an S3 bucket.

First, download the generate_aqi_map.py code from the GitHub repo.

Then, upload that script to an S3 bucket you have access to.

aws s3 cp generate_aqi_map.py s3://<BUCKET>/code/

Now, just run your job! The pyspark script takes a few parameters:

<S3_BUCKET> - The S3 bucket where you want to upload the generated image to
<PREFIX> - The prefix in the bucket where you want the image located
--date 2021-01-01 (optional) - A specific date for which you want to generate data for
- Defaults to UTC today

export S3_BUCKET=<BUCKET_NAME>
export EMR_EKS_CLUSTER_ID=abcdefghijklmno1234567890
export EMR_EKS_EXECUTION_ARN=arn:aws:iam::123456789012:role/emr_eks_default_role

# Replace ghcr.io/OWNER/emr-6.3.0-bokeh:latest below with your image URL
aws emr-containers start-job-run \
    --virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
    --name openaq-conus \
    --execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
    --release-label emr-6.3.0-latest \
    --job-driver '{
        "sparkSubmitJobDriver": {
            "entryPoint": "s3://'${S3_BUCKET}'/code/generate_aqi_map.py",
            "entryPointArguments": ["'${S3_BUCKET}'", "output/airq/"],
            "sparkSubmitParameters": "--conf spark.kubernetes.container.image=ghcr.io/OWNER/emr-6.3.0-bokeh:latest"
        }
    }' \
    --configuration-overrides '{
        "monitoringConfiguration": {
            "s3MonitoringConfiguration": { "logUri": "s3://'${S3_BUCKET}'/logs/" }
        }
    }'

{
    "id": "0000000abcdefg12345",
    "name": "openaq-conus",
    "arn": "arn:aws:emr-containers:us-east-2:123456789012:/virtualclusters/abcdefghijklmno1234567890/jobruns/0000000abcdefg12345",
    "virtualClusterId": "abcdefghijklmno1234567890"
}

While the job is running, you can get the fetch the status of the job using the emr-containers describe-job-run command.

aws emr-containers describe-job-run \
    --virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
    --id 0000000abcdefg12345

Once the job is in the COMPLETED state, you should be able to copy the resulting image from your S3 bucket!

aws s3 cp s3://${S3_BUCKET}/output/airq/2021-06-24-latest.png .

And if you open that file, you'll get the most recent PM2.5 readings!

Wrapup

Be sure to check out the launch post for more details, the documentation for customing docker images for EMR on EKS, and my demo video.