<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Neylson Crepalde</title>
    <description>The latest articles on DEV Community by Neylson Crepalde (@neylsoncrepalde).</description>
    <link>https://dev.to/neylsoncrepalde</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F897876%2F3db842a6-2dc8-428d-b1a5-8bca0cf32963.jpeg</url>
      <title>DEV Community: Neylson Crepalde</title>
      <link>https://dev.to/neylsoncrepalde</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/neylsoncrepalde"/>
    <language>en</language>
    <item>
      <title>How to access data from other AWS account using Athena</title>
      <dc:creator>Neylson Crepalde</dc:creator>
      <pubDate>Sat, 24 Jun 2023 03:36:29 +0000</pubDate>
      <link>https://dev.to/aws-builders/how-to-access-data-from-other-aws-account-using-athena-32ie</link>
      <guid>https://dev.to/aws-builders/how-to-access-data-from-other-aws-account-using-athena-32ie</guid>
      <description>&lt;p&gt;If you find yourself collaborating with a Data Product team, there's a high probability that you'll encounter the need to access data residing in an S3 bucket located in a different account. While this task may not be overly complicated, it isn't always as straightforward as one might hope, leading many colleagues to seek guidance on the matter. In the following article, I'll provide you with a comprehensive guide on how to effortlessly establish cross-account access between Athena and S3 in just two simple steps. So, without further ado, let's delve into the details!&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Bucket Policy in Source Account
&lt;/h2&gt;

&lt;p&gt;First thing you need to do is to create a bucket policy in the source account (where the data in S3 actually is) to allow for the "client" account to read data. Go to the bucket in S3, click in the "permissions" tab and edit the bucket policy as follows:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cross-account-bucket-policy-ney"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Statement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Sid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CrossAccountPermission"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Principal"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"AWS"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:iam::&amp;lt;CLIENT-ACCOUNT-NUMBER&amp;gt;:root"&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="s2"&gt;"s3:GetBucketLocation"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="s2"&gt;"s3:GetObject"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="s2"&gt;"s3:ListBucket"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="s2"&gt;"s3:ListBucketMultipartUploads"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="s2"&gt;"s3:ListMultipartUploadParts"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="s2"&gt;"s3:AbortMultipartUpload"&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:s3:::&amp;lt;BUCKET-NAME&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:s3:::&amp;lt;BUCKET-NAME&amp;gt;/silver/titanic/*"&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;


&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This bucket policy allows for all users within the client account to access this bucket and get objects within the &lt;code&gt;/silver/titanic/&lt;/code&gt; folders. If you want to restrict access to a specific user, you can replace &lt;code&gt;root&lt;/code&gt; for &lt;code&gt;user/&amp;lt;USERNAME&amp;gt;&lt;/code&gt;, for instance &lt;code&gt;user/neylson.crepalde&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Create an external table in Athena in Client Account
&lt;/h2&gt;

&lt;p&gt;In the client account navigate to Athena console and create a new query editor. Run a SQL statement to create an external table pointing to the S3 bucket in the source account. In our case, we are testing with the (very famous) TITANIC dataset partitioned by one of its columns, &lt;code&gt;pclass&lt;/code&gt;, as a &lt;a href="https://docs.delta.io/latest/delta-intro.html" rel="noopener noreferrer"&gt;delta table&lt;/a&gt;:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;EXTERNAL&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="nv"&gt;`titanicdelta`&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nv"&gt;`passengerid`&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`survived`&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`name`&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`sex`&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`age`&lt;/span&gt; &lt;span class="nb"&gt;double&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;PARTITIONED&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt; 
  &lt;span class="nv"&gt;`pclass`&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;ROW&lt;/span&gt; &lt;span class="n"&gt;FORMAT&lt;/span&gt; &lt;span class="n"&gt;SERDE&lt;/span&gt; 
  &lt;span class="s1"&gt;'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'&lt;/span&gt; 
&lt;span class="n"&gt;STORED&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;INPUTFORMAT&lt;/span&gt; 
  &lt;span class="s1"&gt;'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'&lt;/span&gt; 
&lt;span class="n"&gt;OUTPUTFORMAT&lt;/span&gt; 
  &lt;span class="s1"&gt;'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'&lt;/span&gt;
&lt;span class="k"&gt;LOCATION&lt;/span&gt;
  &lt;span class="s1"&gt;'s3://&amp;lt;BUCKET-NAME&amp;gt;/silver/titanic/_symlink_format_manifest'&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Because this data is partitioned, after creating the external table you have to load data partitions with the following SQL command:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;

&lt;span class="n"&gt;MSCK&lt;/span&gt; &lt;span class="n"&gt;REPAIR&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="nv"&gt;`titanicdelta`&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And you're done! Now, if you query your data in the client account:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxkqdjk53qi1yqgsz7873.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxkqdjk53qi1yqgsz7873.png" alt="SQL query"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By following these two straightforward steps, you can establish a secure and efficient cross-account connection between Athena and S3. This enables you and your Data Product team to access and analyze data stored in a remote S3 bucket effortlessly. So, the next time you encounter the need to access data from another account, you can confidently navigate the process and achieve your goals without any hassle.&lt;/p&gt;

&lt;p&gt;Remember, fostering collaboration and enabling seamless data access across different accounts is crucial for efficient and streamlined workflows within a Data Product team. By mastering this skill, you can enhance your productivity and contribute to the success of your projects.&lt;/p&gt;

</description>
      <category>crossaccount</category>
      <category>athena</category>
      <category>iam</category>
    </item>
    <item>
      <title>Data Governance Hands On with Amazon DataZone</title>
      <dc:creator>Neylson Crepalde</dc:creator>
      <pubDate>Mon, 22 May 2023 11:47:13 +0000</pubDate>
      <link>https://dev.to/aws-builders/data-governance-hands-on-with-amazon-datazone-3k5o</link>
      <guid>https://dev.to/aws-builders/data-governance-hands-on-with-amazon-datazone-3k5o</guid>
      <description>&lt;p&gt;At re:Invent 2022, AWS announced its new data governance solution, Amazon DataZone. Although the tool is currently in preview, it is now available in the US East (Northern Virginia), US West (Oregon) and Europe (Ireland) regions.&lt;/p&gt;

&lt;p&gt;Data Governance is one of the hottest topics on the market today. Several companies around the international market have pointed problems in this area, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Efficient data cataloging;&lt;/li&gt;
&lt;li&gt;Discovery and documentation of data that gives users autonomy for decision making;&lt;/li&gt;
&lt;li&gt;Data Literacy — sharing data knowledge across the organization allowing the operation to become increasingly data driven;&lt;/li&gt;
&lt;li&gt;Data Quality — Correct, reliable and consistent data;&lt;/li&gt;
&lt;li&gt;Data Availability — Data available at all times and failsafe processing and serving pipelines.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then, a pool of tools appeared on the market with features that allow covering some of the challenges cited, especially those related to data cataloging. &lt;strong&gt;Informatica's&lt;/strong&gt; tool is perhaps the best known among the licensed. Among the open source tools, I highlight &lt;strong&gt;Data Hub&lt;/strong&gt; (&lt;a href="http://www.datahubproject.io" rel="noopener noreferrer"&gt;www.datahubproject.io&lt;/a&gt;) developed on LinkedIn, &lt;strong&gt;Open Metadata&lt;/strong&gt; (&lt;a href="https://open-metadata.org/" rel="noopener noreferrer"&gt;https://open-metadata.org/&lt;/a&gt;) and &lt;strong&gt;Amundsen&lt;/strong&gt; (&lt;a href="https://www.amundsen.io" rel="noopener noreferrer"&gt;https://www.amundsen.io&lt;/a&gt; /) powered by Lyft. In addition to cataloging and discovering data artifacts, these tools allow for a view of data lineage, including technical documentation and business terms, and building relationships between data artifacts. Also, it is possible to register data owners, the people responsible for the data in those tools. This greatly facilitates access request and evaluation process (which today is a major bottleneck).&lt;/p&gt;

&lt;h1&gt;
  
  
  Amazon DataZone
&lt;/h1&gt;

&lt;p&gt;I always tell my friends: "It was about time for AWS to launch its data governance tool!". DataZone comes with the main features needed for a good data catalog, in addition to security features and access management integrated into the AWS environment. Let's take a look.&lt;/p&gt;

&lt;p&gt;The first step to use DataZone is the creation of a data domain (oriented to a business vertical, for example, sales, logistics, finance). It is curious to note that AWS took great care in showing all the access policies granted to the DataZone, as the tool will have great power in the environment.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgw90ooqm97241jog7ax0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgw90ooqm97241jog7ax0.png" alt="View of the necessary accesses to the DataZone. Author's elaboration."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Policy details are available by clicking &lt;em&gt;View permission details&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcdpwhdenxariz14kjie3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcdpwhdenxariz14kjie3.png" alt="Details of DataZone access policies. Author's elaboration."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At this point, you can leverage IAM Identity Center (formerly AWS SSO) for granting domain access permissions. However, the domain must be deployed in the same region as the Identity Center. In my case, as you can see, this was not possible.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpa3okhptry0geh0ffe1f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpa3okhptry0geh0ffe1f.png" alt="Access management configuration. Author's elaboration."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After a few seconds to create the domain, the access link becomes available. The tool's interface impresses with a beautiful design. The next step is to create a project within the existing domain. The tool already suggests a default profile that has a native connection with Athena, AWS Glue and S3.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F75o65jab3dmnj6szcjob.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F75o65jab3dmnj6szcjob.png" alt="DataZone project creation. Author's elaboration."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once the project is ready, we will publish data for consultation in the tool. When the project was created, DataZone automatically created two databases in Amazon Athena, one with the _pub_db suffix and one with the _sub_db suffix. The "pub" database will be used for data producing teams to publish their tables. In my case, I already had some glue crawlers configured to automatically map tables in S3 - the 3 tables from Brazil's Higher Education Census (Censo da Educação Superior) 2019 and one table from Brazil's National Student Performance Exam 2017, all public datasets from the Ministry of Education. I just edited the crawlers so that these tables were available in the "pub" database. After that, it is necessary to publish the data in DataZone. In the project interface, we click on Publish Data and we see the screen below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frjgo2rf11nchaz3jdfut.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frjgo2rf11nchaz3jdfut.png" alt="Data publication. Author's elaboration."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After filling in the settings and the job has been successfully executed, on the project page we can access the tables brought in the "publishing" menu.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Far3x9mt98si36o9nttvq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Far3x9mt98si36o9nttvq.png" alt="Menu of published tables. Author's elaboration."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note that all tables are still marked as draft, that is, they are not yet active to be viewed in the catalog. By clicking on the table name, it is possible to view various general information such as the location in S3 where the data is stored, region, data format, table schema and subscribers (if any). In the schema screen, we can edit the columns giving them a more readable name (the real name of the column is not changed and continues to be displayed in the interface) and a detailed description. A "readable" title and description can also be added to the table. After editing, it is necessary to click on &lt;em&gt;set asset to active&lt;/em&gt;, so that the table can be consulted by other users of the catalog.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ntm9tk801d8uxyasmyy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ntm9tk801d8uxyasmyy.png" alt="Table schema with readable column names and description. Author's elaboration."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After activating the tables, they are available for consultation in the Data catalog page and also via search.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fltmy4ugsv7aprqu1b4rb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fltmy4ugsv7aprqu1b4rb.png" alt="Tables cataloged in the Sales domain. Author's elaboration."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To request access to these artifacts, other users registered in another project could open a registration request as consumers (subscribers). From there, data publishers can authorize requests and set granular per-table permissions for their consumption.&lt;/p&gt;

&lt;h1&gt;
  
  
  Final evaluation
&lt;/h1&gt;

&lt;p&gt;DataZone is still in preview and is not intended by Amazon for production use. The tool already has several important features for data governance but there is still a lot to develop. Below, I write down some pros and cons of the tool so far.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Easy integration with the AWS environment;&lt;/li&gt;
&lt;li&gt;Integrated security and access management with AWS Identity Center or IAM;&lt;/li&gt;
&lt;li&gt;Easy metadata ingestion;&lt;/li&gt;
&lt;li&gt;Creation of automated projects with CloudFormation (transparent to the user);&lt;/li&gt;
&lt;li&gt;Efficient data search;&lt;/li&gt;
&lt;li&gt;Glossary of business terms.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The tool still does not have data lineage implemented, a very important feature to understand the construction of KPIs;&lt;/li&gt;
&lt;li&gt;It does not have integration with any data quality framework;&lt;/li&gt;
&lt;li&gt;It is not possible to register other data artifacts such as dashboards, charts, pipelines, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In fact, Amazon DataZone is still a tool under development and it has an enormous potential. I look forward to the following development steps of this promising tool.&lt;/p&gt;

</description>
      <category>datagovernance</category>
      <category>datacatalog</category>
      <category>datazone</category>
      <category>beginners</category>
    </item>
    <item>
      <title>We need to talk about Data Contracts</title>
      <dc:creator>Neylson Crepalde</dc:creator>
      <pubDate>Fri, 17 Mar 2023 15:02:36 +0000</pubDate>
      <link>https://dev.to/neylsoncrepalde/we-need-to-talk-about-data-contracts-1b1o</link>
      <guid>https://dev.to/neylsoncrepalde/we-need-to-talk-about-data-contracts-1b1o</guid>
      <description>&lt;p&gt;The term &lt;em&gt;Data Contracts&lt;/em&gt; is the latest buzzword in the data world and has been heavily explored in publications around the world. Although the subject has not yet reached Brazil with the force it deserves, the pain that it proposes to solve is already very much present in the market. But what are we really talking about and what is this pain?&lt;/p&gt;

&lt;h2&gt;
  
  
  Seeking to be Data Driven
&lt;/h2&gt;

&lt;p&gt;In recent years, organizations in Brazilian market have made a great effort to become more and more &lt;em&gt;Data Driven&lt;/em&gt;. In the beginning, the aim was to gather data available in the company and mobilize them to bring intelligence to day-to-day decision-making.&lt;/p&gt;

&lt;p&gt;After overcoming this challenge, the market realized that, although they managed to use the data, it was unstructured, disorganized and siloed in different systems. At this stage, the priority became building an environment that would make it possible to work with very large volumes of data. It was also important that this architecture allowed an efficient and fast scalability of resources when necessary and that it brought operational efficiency to the company's processes, making pipelines more automatic and with better performance. It was critical, at this stage, to make data from different sources and with different formats able to "talk to each other", giving the user a consolidated view of the phenomenon that is the basis of decisions.&lt;/p&gt;

&lt;p&gt;At the time of writing this article, I realize that the market has already achieved a certain maturity in data architectures. Companies' Data Lakes and Data Mesh structures are already running in production with data pipelines that can deliver data in a consolidated manner to the business user. HOWEVER, there is an important consequence here:&lt;/p&gt;

&lt;p&gt;As organizations move towards being more data driven, they are, in fact, anchoring their business operation on data. This brings an enormous criticality to our architecture so that if there is any type of failure or data is unavailable somehow, we can stop the operation (!!!).&lt;/p&gt;

&lt;p&gt;Unfortunately, data unavailability is a very common situation in the daily lives of data teams. It is very common for these teams to receive a distressed call from the business user saying that data is not available, has not been updated, is blank or something like that. This is enhanced when the organization is large and complex and has a scenario where there are several teams &lt;strong&gt;producing&lt;/strong&gt; and &lt;strong&gt;consuming&lt;/strong&gt; data. In such cases, the process failure points are multiplied. According to &lt;a href="https://www.linkedin.com/in/barrmoses/" rel="noopener noreferrer"&gt;Barr Moses&lt;/a&gt;, CEO of Monte Carlo, among the top data challenges are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Data pipelines constantly break and create quality and usability issues."&lt;/li&gt;
&lt;li&gt;"There is a communication chasm between service implementers, data engineers and data consumers" (Check out the original article &lt;a href="https://barrmoses.medium.com/implementing-data-contracts-7-key-learnings-d214a5947d5e" rel="noopener noreferrer"&gt;here&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;strong&gt;high availability&lt;/strong&gt; of data, therefore, presents itself as a first order problem to be solved. There are other very important issues related to our environment such as security, access management, quality… but high availability takes on enormous relevance in this context. What to do?&lt;/p&gt;

&lt;h2&gt;
  
  
  What are Data Contracts
&lt;/h2&gt;

&lt;p&gt;Data contracts emerged as a proposed solution to the aforementioned problem. This term got prominence in the market with &lt;a href="https://www.linkedin.com/in/chad-sanderson/" rel="noopener noreferrer"&gt;Chad Sanderson&lt;/a&gt; in August 2022 in his text &lt;a href="https://dataproducts.substack.com/p/the-rise-of-data-contracts" rel="noopener noreferrer"&gt;The Rise of Data Contracts&lt;/a&gt;. In this article, he posits the problem and proposes the concept of data contracts as "API-like agreements between Software Engineers who own services and Data Consumers that understand how the business works in order to generate well-modeled, high-quality, trusted, real-time data".&lt;/p&gt;

&lt;p&gt;Maggie Hays comments that a data contract needs to define:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"what data needs to move from a (producer’s) source to a (consumer’s) destination"&lt;/li&gt;
&lt;li&gt;"the shape of that data, its schema, and semantics"&lt;/li&gt;
&lt;li&gt;"expectations around availability and data quality"&lt;/li&gt;
&lt;li&gt;"details about contract violation(s) and enforcement"&lt;/li&gt;
&lt;li&gt;"how (and for how long) the consumer will use the data" (Check out the original article &lt;a href="https://blog.datahubproject.io/the-what-why-and-how-of-data-contracts-278aa7c5f294" rel="noopener noreferrer"&gt;here&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The point that most calls my attention is that the idea of data contracts does not reside only in the clear establishment of data delivery and consumption criteria, but is also concerned with &lt;strong&gt;ensuring&lt;/strong&gt;, in an automated way, that this contract is not breached breaking entire data pipelines and generating loss for the business operation.&lt;/p&gt;

&lt;h2&gt;
  
  
  My 2 cents on the matter
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A4800%2Fformat%3Awebp%2F1%2AU4YP9lqGwFD4FD87JW5FKg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A4800%2Fformat%3Awebp%2F1%2AU4YP9lqGwFD4FD87JW5FKg.png" alt="Data Lineage"&gt;&lt;/a&gt;&lt;/p&gt;
Source: https://blog.datahubproject.io/harnessing-the-power-of-data-lineage-with-datahub-ad086358dec4



&lt;p&gt;In addition to what has already been exhaustively argued, I would like to add one more feature necessary for a good implementation of contracts of data: &lt;strong&gt;the visualization of the whole path of the data and its points of failure&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It's easy to argue that the feature I'm talking about is nothing more than data lineage, a feature made available by some data catalog solutions on the market. However, lineage (although critical to a good data governance strategy) is just one part. The view that I believe is useful for data contracts would also bring the &lt;strong&gt;processes' points of failure&lt;/strong&gt; and &lt;strong&gt;if these checkpoints are OK or if failure was detected&lt;/strong&gt;; a consolidated view of all contracts that permeate the organization, a holistic view of the process.&lt;/p&gt;

&lt;p&gt;In addition to functioning as a monitor of the entire data flow in the company, this view would be extremely important in the culture and data literacy strategy, making the entire process explicit to stakeholders.&lt;/p&gt;

&lt;h2&gt;
  
  
  Perspectives for the future
&lt;/h2&gt;

&lt;p&gt;At this moment there are some published ideas for the implementation of data contracts. Some using &lt;a href="https://dataproducts.substack.com/p/an-engineers-guide-to-data-contracts?utm_source=substack&amp;amp;utm_campaign=post_embed&amp;amp;utm_medium=web" rel="noopener noreferrer"&gt;Kafka's schema registry and CDC as enforcement&lt;/a&gt; in an event-oriented architecture, others proposing to use &lt;a href="https://dataproducts.substack.com/p/data-contracts-for-the-warehouse" rel="noopener noreferrer"&gt;dbt in a batch fashion&lt;/a&gt;... but there is still no clear definition or consolidated method to implement the concept.&lt;/p&gt;

&lt;p&gt;The expectation regarding this feature, however, is huge. The promise of accelerating the production and strategic mobilization of data along with a significant improvement in quality and availability brought this concept to the market spotlight. The forecast is that in the coming months the first products in this area will begin to emerge.&lt;/p&gt;

</description>
      <category>datacontracts</category>
      <category>datagovernance</category>
      <category>dataarchitecture</category>
    </item>
    <item>
      <title>How to run Amazon EMR Serverless with --packages flag</title>
      <dc:creator>Neylson Crepalde</dc:creator>
      <pubDate>Thu, 18 Aug 2022 00:57:19 +0000</pubDate>
      <link>https://dev.to/aws-builders/how-to-run-amazon-emr-serverless-with-packages-flag-1p95</link>
      <guid>https://dev.to/aws-builders/how-to-run-amazon-emr-serverless-with-packages-flag-1p95</guid>
      <description>&lt;p&gt;In &lt;a href="https://dev.to/aws-builders/running-delta-lake-on-amazon-emr-serverless-3d39"&gt;this previous post&lt;/a&gt;, we showed how to run Delta Lake on Amazon EMR Serverless. Since then, a new release was out (6.7.0) with the &lt;code&gt;--packages&lt;/code&gt; flag implemented . This helps us getting things done with spark a lot easier. Yet, &lt;code&gt;--packages&lt;/code&gt; flag requires some extra networking setup that most of Data Scientists and Engineers are not familiar with. Our goal is to show step by step how to do it.&lt;/p&gt;

&lt;h1&gt;
  
  
  First, some concept explanations
&lt;/h1&gt;

&lt;p&gt;When using Spark with Java dependencies, we have two options: (1) build and insert &lt;code&gt;.jar&lt;/code&gt; files manually in cluster or (2) pass the dependencies to the &lt;code&gt;--packages&lt;/code&gt; flag so spark can automatically download them from maven. Since release 6.7.0 of EMR Serverless, this flag is available for use. &lt;/p&gt;

&lt;p&gt;The problem is that spark cluster must reach the internet to download packages from maven. Amazon EMR Serverless, at first, lives outside any VPC and so, cannot reach the internet. To do that, you must create your EMR application inside a VPC. However, EMR applications can only be created in &lt;strong&gt;private subnets&lt;/strong&gt; which (by the way...) &lt;strong&gt;don't reach the internet&lt;/strong&gt; and &lt;strong&gt;cannot reach S3&lt;/strong&gt; 😭... How do we fix this?&lt;/p&gt;

&lt;h1&gt;
  
  
  Step one: networking
&lt;/h1&gt;

&lt;p&gt;The diagram below shows the whole network structure that is necessary:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4c5idpkw3xfwqu7f62sx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4c5idpkw3xfwqu7f62sx.png" alt="Network diagram"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is easily created on AWS in the VPC interface. Click the &lt;strong&gt;Create VPC&lt;/strong&gt; button and select &lt;strong&gt;VPC and more&lt;/strong&gt;. AWS does the heavy lifting and provides a design for a VPC with 2 public subnets, 2 private subnets, internet gateway, the necessary route tables and an S3-endpoint (so resources inside the VPC can reach S3). &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F75103x0efd0axcpiixuq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F75103x0efd0axcpiixuq.png" alt="Create VPC"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can set the &lt;strong&gt;Number of availability zones&lt;/strong&gt; to 1 if you want but in order to have high availability, you should work with, at least, 2 AZ's.&lt;/p&gt;

&lt;p&gt;Next, you need to make sure you mark at least one &lt;strong&gt;NAT Gateway&lt;/strong&gt; which is responsible for letting private subnets reach the internet. Below is the screen with the final setup:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0wu9bxllcszfxbpvrqg8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0wu9bxllcszfxbpvrqg8.png" alt="Final VPC configuration"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Hit &lt;strong&gt;create VPC&lt;/strong&gt; and we're done with networking. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2kjxfktepf5qt52u9mbh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2kjxfktepf5qt52u9mbh.png" alt="Creating network resources"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Last thing is to create a Security Group that allows outbound traffic to the internet. Go back to VPC in AWS and click Security Group in the left panel. Then, click &lt;strong&gt;Create security group&lt;/strong&gt;. Name your security group, uncheck the selected VPC and check the one you have just created. By default, security groups don't allow any inbound traffic and allow all outbound traffic. We can leave it that way. Create the security group &lt;em&gt;et voilà&lt;/em&gt;!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftd8y59yhbrzw1mqj9u37.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftd8y59yhbrzw1mqj9u37.png" alt="Create security group"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Step two: IAM Roles and Policies
&lt;/h1&gt;

&lt;p&gt;You need two roles, a Service Linked Role and another role that gives permission to Access S3 and Glue. We have already discussed that &lt;a href="https://dev.to/aws-builders/running-delta-lake-on-amazon-emr-serverless-3d39"&gt;in this previous post&lt;/a&gt; in the &lt;strong&gt;Setup - Authentication&lt;/strong&gt; section. Check it out. We will also need a dataset to work with. The famous &lt;em&gt;Titanic&lt;/em&gt; dataset should do it. You can download it &lt;a href="https://raw.githubusercontent.com/neylsoncrepalde/titanic_data_with_semicolon/main/titanic.csv" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  Step three: Create EMR Studio and an EMR Serverless Application
&lt;/h1&gt;

&lt;p&gt;First, we must create an EMR Studio. If you don't have any studios created yet, this is very straightforward. After clicking Get started in the EMR Serverless home page, you can click to create a studio automatically.&lt;/p&gt;

&lt;p&gt;Second, you have to create an EMR Serverless application. Set up a name and (remember!) choose release 6.7.0. To setup networking you have to check &lt;strong&gt;Choose custom settings&lt;/strong&gt; and scroll down to &lt;strong&gt;Network connections&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhv8pc9g487sz1emrqkrb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhv8pc9g487sz1emrqkrb.png" alt="Create application"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In Network connections choose the VPC you created, the two private subnets and the security group. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feb218r7jniicsr9zyaiw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feb218r7jniicsr9zyaiw.png" alt="EMR application networking configuration"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Step four: Spark code
&lt;/h1&gt;

&lt;p&gt;Now, we are preparing a simple pyspark code to simulate some modifications in the dataset (we will include two new passengers - Ney and Sarah - and we will update information on two passengers that were presumed dead but found alive, Mr. Owen Braund and Mr. William Allen). Below is the code to do that.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;functions&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.jars.packages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;io.delta:delta-core_2.12:2.0.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.extensions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;io.delta.sql.DeltaSparkSessionExtension&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.catalog.spark_catalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;org.apache.spark.sql.delta.catalog.DeltaCatalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;delta.tables&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Reading CSV file from S3...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PassengerId int, Survived int, Pclass int, Name string, Sex string, Age double, SibSp int, Parch int, Ticket string, Fare double, Cabin string, Embarked string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://&amp;lt;YOUR-BUCKET&amp;gt;/titanic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sep&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Writing titanic dataset as a delta table...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://&amp;lt;YOUR-BUCKET&amp;gt;/silver/titanic_delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Updating and inserting new rows...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;new&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PassengerId IN (1, 5)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;new&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;new&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Survived&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;newrows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;892&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sarah Crepalde&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;female&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;23.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;893&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ney Crepalde&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;male&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;35.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;newrowsdf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;newrows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;new&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;new&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;union&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;newrowsdf&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Create a delta table object...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;old&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://&amp;lt;YOUR-BUCKET&amp;gt;/silver/titanic_delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UPSERT...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# UPSERT
&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;old&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;old&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;new&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;old.PassengerId = new.PassengerId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;whenMatchedUpdateAll&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;whenNotMatchedInsertAll&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Checking if everything is ok&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;New data...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://&amp;lt;YOUR-BUCKET&amp;gt;/silver/titanic_delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PassengerId &amp;lt; 6 OR PassengerId &amp;gt; 888&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Old data - with time travel&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;versionAsOf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://&amp;lt;YOUR-BUCKET&amp;gt;/silver/titanic_delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PassengerId &amp;lt; 6 OR PassengerId &amp;gt; 888&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This &lt;code&gt;.py&lt;/code&gt; file should be uploaded to S3.&lt;/p&gt;

&lt;h1&gt;
  
  
  Step five: GO!
&lt;/h1&gt;

&lt;p&gt;Now, we submit a job for execution. We can do it with AWS CLI:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

aws emr-serverless start-job-run &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--name&lt;/span&gt; Delta-Upsert &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--application-id&lt;/span&gt; &amp;lt;YOUR-APPLICATION-ID&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--execution-role-arn&lt;/span&gt; arn:aws:iam::&amp;lt;ACCOUNT-NUMBER&amp;gt;:role/EMRServerlessJobRole &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--job-driver&lt;/span&gt; &lt;span class="s1"&gt;'{
  "sparkSubmit": {
    "entryPoint": "s3://&amp;lt;YOUR-BUCKET&amp;gt;/pyspark/emrserverless_delta_titanic.py", 
    "sparkSubmitParameters": "--packages io.delta:delta-core_2.12:2.0.0"
  }
}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--configuration-overrides&lt;/span&gt; &lt;span class="s1"&gt;'{
"monitoringConfiguration": {
  "s3MonitoringConfiguration": {
    "logUri": "s3://&amp;lt;YOUR-BUCKET&amp;gt;/emr-serverless-logs/"} 
  } 
}'&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;That's it! When the job is done, go the you log folder and check the logs (look for your application ID, job ID and SPARK_DRIVER logs). You should see something like this:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

Reading CSV file from S3...
Writing titanic dataset as a delta table...
Updating and inserting new rows...
Create a delta table object...
UPSERT...
Checking if everything is ok
New data...
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       1|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       1|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
|        889|       0|     3|"Johnston, Miss. ...|female|null|    1|    2|      W./C. 6607|  23.45| null|       S|
|        890|       1|     1|Behr, Mr. Karl Ho...|  male|26.0|    0|    0|          111369|   30.0| C148|       C|
|        891|       0|     3| Dooley, Mr. Patrick|  male|32.0|    0|    0|          370376|   7.75| null|       Q|
|        892|       1|     1|      Sarah Crepalde|female|23.0|    1|    0|            null|   null| null|    null|
|        893|       0|     1|        Ney Crepalde|  male|35.0|    1|    0|            null|   null| null|    null|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+

Old data - with time travel
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
|        889|       0|     3|"Johnston, Miss. ...|female|null|    1|    2|      W./C. 6607|  23.45| null|       S|
|        890|       1|     1|Behr, Mr. Karl Ho...|  male|26.0|    0|    0|          111369|   30.0| C148|       C|
|        891|       0|     3| Dooley, Mr. Patrick|  male|32.0|    0|    0|          370376|   7.75| null|       Q|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Happy coding and build on!&lt;/p&gt;

</description>
      <category>aws</category>
      <category>bigdata</category>
      <category>spark</category>
      <category>emrserverless</category>
    </item>
    <item>
      <title>Running Delta Lake on Amazon EMR Serverless</title>
      <dc:creator>Neylson Crepalde</dc:creator>
      <pubDate>Sat, 30 Jul 2022 15:24:00 +0000</pubDate>
      <link>https://dev.to/aws-builders/running-delta-lake-on-amazon-emr-serverless-3d39</link>
      <guid>https://dev.to/aws-builders/running-delta-lake-on-amazon-emr-serverless-3d39</guid>
      <description>&lt;p&gt;Amazon EMR Serverless is a brand new AWS Service made generally available in June 1st, 2022. With this service, it is possible to run serverless Spark clusters that can process TB scale data very easily and using any spark open source libraries. Getting started with EMR Serverless can be a bit tricky. The goal of this post is to help you get your Spark+Delta jobs up and running "serverlessly". Let's get to it!&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup - Authentication
&lt;/h2&gt;

&lt;p&gt;In order to run EMR Serverless you'll need to configure two IAM roles, a service-linked role and an access authorization role for your spark jobs. The service-linked role is very straightforward to create. Go to IAM on the AWS console, click on roles and click on &lt;strong&gt;Create role&lt;/strong&gt;,&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frrdg0jc7s5ugtuvzzolm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frrdg0jc7s5ugtuvzzolm.png" alt="IAM Role console"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;choose &lt;strong&gt;Amazon EMR Serverless&lt;/strong&gt;,&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6vier0c7e8tc6tth95tr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6vier0c7e8tc6tth95tr.png" alt="Configurations to create a Service role"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;and choose default settings until you finish creating the role. Next, create a job role with permissions to access S3 and glue. We will create a very open role (not the best practice) for didactic purposes. In a "production" environment, you should make your permissions very strict.&lt;/p&gt;

&lt;p&gt;Click again &lt;strong&gt;Create role&lt;/strong&gt; on the AWS console Roles section and mark &lt;strong&gt;Custom Trust Policy&lt;/strong&gt;. Below, in the "Service" key, replace "{}" with &lt;strong&gt;emr-serverless.amazonaws.com&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc8vq56stycoxnyepb6tv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc8vq56stycoxnyepb6tv.png" alt="Configurations for Custom trust policy"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Next, you can select 2 AWS managed policies, "AmazonS3FullAccess" and "AWSGlueConsoleFullAccess". Click next, give your new role an easy identifiable name (like "EMRServerlessJobRole") and finish creating the role.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup - Data
&lt;/h2&gt;

&lt;p&gt;For this post, we are working with the (very famous) &lt;strong&gt;titanic&lt;/strong&gt; dataset which you can download &lt;a href="https://raw.githubusercontent.com/neylsoncrepalde/titanic_data_with_semicolon/main/titanic.csv" rel="noopener noreferrer"&gt;here&lt;/a&gt; and upload to S3.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Pipeline Strategy
&lt;/h2&gt;

&lt;p&gt;Delta Lake is a great tool that implements the Lakehouse architecture. It has many cool features (such as schema evolution, data time travel, transaction logs, ACID transactions) and it is fundamentally valuable when we have a case of incremental data ingestion. Thus, we are going to simulate some changes in titanic dataset. We will include two new passengers (Ney and Sarah) and we will update information on two passengers that were presumed dead but found alive(!!!), Mr. Owen Braund and Mr. William Allen. &lt;/p&gt;

&lt;p&gt;First version of data is written as a delta table and the updates will be written with an &lt;a href="https://docs.delta.io/1.0.1/delta-update.html#upsert-into-a-table-using-merge" rel="noopener noreferrer"&gt;upsert&lt;/a&gt; transaction.&lt;/p&gt;

&lt;p&gt;Python code to do those operations is presented below:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;functions&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;delta.tables&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Reading CSV file from S3...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PassengerId int, Survived int, Pclass int, Name string, Sex string, Age double, SibSp int, Parch int, Ticket string, Fare double, Cabin string, Embarked string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://&amp;lt;YOUR-BUCKET&amp;gt;/titanic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sep&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Writing titanic dataset as a delta table...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://&amp;lt;YOUR-BUCKET&amp;gt;/silver/titanic_delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Updating and inserting new rows...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;new&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PassengerId IN (1, 5)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;new&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;new&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Survived&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;newrows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;892&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sarah Crepalde&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;female&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;23.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;893&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ney Crepalde&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;male&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;35.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;newrowsdf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;newrows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;new&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;new&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;union&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;newrowsdf&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Create a delta table object...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;old&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://&amp;lt;YOUR-BUCKET&amp;gt;/silver/titanic_delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UPSERT...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# UPSERT
&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;old&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;old&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;new&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;old.PassengerId = new.PassengerId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;whenMatchedUpdateAll&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;whenNotMatchedInsertAll&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Checking if everything is ok&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;New data...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://&amp;lt;YOUR-BUCKET&amp;gt;/silver/titanic_delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PassengerId &amp;lt; 6 OR PassengerId &amp;gt; 888&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Old data - with time travel&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;versionAsOf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://&amp;lt;YOUR-BUCKET&amp;gt;/silver/titanic_delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PassengerId &amp;lt; 6 OR PassengerId &amp;gt; 888&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This &lt;code&gt;.py&lt;/code&gt; file should be uploaded to S3.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dependencies
&lt;/h2&gt;

&lt;p&gt;One thing about EMR Serverless latest release available (6.6.0) is that the &lt;code&gt;spark-submit&lt;/code&gt; flag &lt;code&gt;--packages&lt;/code&gt; is &lt;em&gt;not available&lt;/em&gt; yet (😢). So, we have an extra step to use java dependencies and python dependencies.&lt;/p&gt;

&lt;h3&gt;
  
  
  Jars
&lt;/h3&gt;

&lt;p&gt;To use java dependencies, we have to build them manually into a single .jar file. AWS has provided a &lt;a href="https://github.com/aws-samples/emr-serverless-samples/blob/main/examples/pyspark/dependencies/Dockerfile.jars" rel="noopener noreferrer"&gt;Dockerfile&lt;/a&gt; that we can use to build the dependencies without having to install maven locally (😍). I used this &lt;code&gt;pom.xml&lt;/code&gt; file to define the dependencies:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;

&lt;span class="nt"&gt;&amp;lt;project&lt;/span&gt; &lt;span class="na"&gt;xmlns=&lt;/span&gt;&lt;span class="s"&gt;"http://maven.apache.org/POM/4.0.0"&lt;/span&gt; &lt;span class="na"&gt;xmlns:xsi=&lt;/span&gt;&lt;span class="s"&gt;"http://www.w3.org/2001/XMLSchema-instance"&lt;/span&gt; &lt;span class="na"&gt;xsi:schemaLocation=&lt;/span&gt;&lt;span class="s"&gt;"http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;modelVersion&amp;gt;&lt;/span&gt;4.0.0&lt;span class="nt"&gt;&amp;lt;/modelVersion&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;com.serverless-samples&lt;span class="nt"&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;jars&lt;span class="nt"&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;packaging&amp;gt;&lt;/span&gt;jar&lt;span class="nt"&gt;&amp;lt;/packaging&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;version&amp;gt;&lt;/span&gt;1.0-SNAPSHOT&lt;span class="nt"&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;name&amp;gt;&lt;/span&gt;jars&lt;span class="nt"&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;url&amp;gt;&lt;/span&gt;http://maven.apache.org&lt;span class="nt"&gt;&amp;lt;/url&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;dependencies&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
            &lt;span class="nt"&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;io.delta&lt;span class="nt"&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
            &lt;span class="nt"&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;delta-core_2.12&lt;span class="nt"&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
            &lt;span class="nt"&gt;&amp;lt;version&amp;gt;&lt;/span&gt;1.2.1&lt;span class="nt"&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
            &lt;span class="nt"&gt;&amp;lt;scope&amp;gt;&lt;/span&gt;compile&lt;span class="nt"&gt;&amp;lt;/scope&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/dependencies&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;build&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;plugins&amp;gt;&lt;/span&gt;
            &lt;span class="nt"&gt;&amp;lt;plugin&amp;gt;&lt;/span&gt;
                &lt;span class="nt"&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;org.apache.maven.plugins&lt;span class="nt"&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
                &lt;span class="nt"&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;maven-shade-plugin&lt;span class="nt"&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
                &lt;span class="nt"&gt;&amp;lt;executions&amp;gt;&lt;/span&gt;
                    &lt;span class="nt"&gt;&amp;lt;execution&amp;gt;&lt;/span&gt;
                        &lt;span class="nt"&gt;&amp;lt;phase&amp;gt;&lt;/span&gt;package&lt;span class="nt"&gt;&amp;lt;/phase&amp;gt;&lt;/span&gt;
                        &lt;span class="nt"&gt;&amp;lt;goals&amp;gt;&lt;/span&gt;
                            &lt;span class="nt"&gt;&amp;lt;goal&amp;gt;&lt;/span&gt;shade&lt;span class="nt"&gt;&amp;lt;/goal&amp;gt;&lt;/span&gt;
                        &lt;span class="nt"&gt;&amp;lt;/goals&amp;gt;&lt;/span&gt;
                    &lt;span class="nt"&gt;&amp;lt;/execution&amp;gt;&lt;/span&gt;
                &lt;span class="nt"&gt;&amp;lt;/executions&amp;gt;&lt;/span&gt;
                &lt;span class="nt"&gt;&amp;lt;configuration&amp;gt;&lt;/span&gt;
                    &lt;span class="nt"&gt;&amp;lt;finalName&amp;gt;&lt;/span&gt;uber-${artifactId}-${version}&lt;span class="nt"&gt;&amp;lt;/finalName&amp;gt;&lt;/span&gt;
                &lt;span class="nt"&gt;&amp;lt;/configuration&amp;gt;&lt;/span&gt;
            &lt;span class="nt"&gt;&amp;lt;/plugin&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;/plugins&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/build&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/project&amp;gt;&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;We build the final .jar file with the command&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

docker build &lt;span class="nt"&gt;-f&lt;/span&gt; Dockerfile.jars &lt;span class="nt"&gt;--output&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The output &lt;code&gt;uber-jars-1.0-SNAPSHOT.jar&lt;/code&gt; must be uploaded to S3. &lt;/p&gt;

&lt;p&gt;With this .jar file, we can use &lt;code&gt;.format("delta")&lt;/code&gt; in our python code but if we try to &lt;code&gt;import delta.tables&lt;/code&gt; we will get a python dependency error.&lt;/p&gt;

&lt;h3&gt;
  
  
  Python
&lt;/h3&gt;

&lt;p&gt;We can build python dependencies in two ways: uploading dependencies files to S3 or building a virtual environment to use in EMR Serverless. For this post, uploading a zip file with delta python library was very simple to do.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

&lt;span class="nb"&gt;mkdir &lt;/span&gt;dependencies
pip &lt;span class="nb"&gt;install &lt;/span&gt;delta-spark&lt;span class="o"&gt;==&lt;/span&gt;1.2.1 &lt;span class="nt"&gt;--target&lt;/span&gt; dependencies
&lt;span class="nb"&gt;cd &lt;/span&gt;dependencies
zip &lt;span class="nt"&gt;-r9&lt;/span&gt; ../emrserverless_dependencies.zip &lt;span class="nb"&gt;.&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;emrserverless_dependencies.zip&lt;/code&gt; file must also be uploaded to S3.&lt;/p&gt;

&lt;p&gt;Now, we are ready to configure our serverless Spark application.&lt;/p&gt;

&lt;h2&gt;
  
  
  EMR Serverless
&lt;/h2&gt;

&lt;p&gt;First, we must create an EMR Studio. If you don't have any studios created yet, this is very straightforward. After clicking &lt;strong&gt;Get started&lt;/strong&gt; in the EMR Serverless home page, you can click to create a studio automatically.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2r0q56c6130txdj4kxri.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2r0q56c6130txdj4kxri.png" alt="Create EMR Studio automatically"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then click to enter the studio url.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fabox2r1g2stwrmkrwlsr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fabox2r1g2stwrmkrwlsr.png" alt="Studio url"&gt;&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;With EMR Serverless, we don't have to create a cluster. Instead, we work with the &lt;strong&gt;application&lt;/strong&gt; concept. To create a new EMR Serverless application, click &lt;strong&gt;Create application&lt;/strong&gt;, type an application name, select version and click &lt;strong&gt;Create application&lt;/strong&gt; again at the bottom of the page.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fznqiygyc4xu9ebq9ndgn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fznqiygyc4xu9ebq9ndgn.png" alt="Application creation page"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, the last thing to do is to submit a spark job. If you have &lt;a href="https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html" rel="noopener noreferrer"&gt;aws cli&lt;/a&gt; installed, the code below will submit a job spark job.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

aws emr-serverless start-job-run &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--name&lt;/span&gt; Delta-Upsert &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--application-id&lt;/span&gt; &amp;lt;YOUR-APPLICATION-ID&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--execution-role-arn&lt;/span&gt; arn:aws:iam::&amp;lt;ACCOUNT-NUMBER&amp;gt;:role/EMRServerlessJobRole &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--job-driver&lt;/span&gt; &lt;span class="s1"&gt;'{
  "sparkSubmit": {
    "entryPoint": "s3://&amp;lt;YOUR-BUCKET&amp;gt;/pyspark/emrserverless_delta_titanic.py", 
    "sparkSubmitParameters": "--jars s3://&amp;lt;YOUR-BUCKET&amp;gt;/pyspark/jars/uber-jars-1.0-SNAPSHOT.jar --conf spark.submit.pyFiles=s3://&amp;lt;YOUR-BUCKET&amp;gt;/pyspark/dependencies/emrserverless_dependencies.zip"
  }
}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--configuration-overrides&lt;/span&gt; &lt;span class="s1"&gt;'{
"monitoringConfiguration": {
  "s3MonitoringConfiguration": {
    "logUri": "s3://&amp;lt;YOUR-BUCKET&amp;gt;/emr-serverless-logs/"} 
  } 
}'&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Some important parameters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;entrypoint&lt;/em&gt; sets the s3 path for you pyspark script&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;sparkSubmitParameters&lt;/em&gt;: you should add the java dependencies with the &lt;code&gt;--jars&lt;/code&gt; flag and set the &lt;code&gt;--conf spark.submit.pyFiles=&amp;lt;YOUR .py/.zip/.egg FILE&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;s3MonitoringConfiguration&lt;/em&gt; sets the s3 path that will be used to save job logs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you wish to use the console, set the job name, role and script location&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdggwkhudd69wbpz5nq23.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdggwkhudd69wbpz5nq23.png" alt="Job initial parameters"&gt;&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;and .jar file and .zip file location as follows&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fplq65fklf4wmxkuy84z9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fplq65fklf4wmxkuy84z9.png" alt="Jar files parameters"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Spark job should start after this. When it finishes, check the logs folder in s3 (look for your application ID, job ID and SPARK_DRIVER logs). You should see something like this&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

Reading CSV file from S3...
Writing titanic dataset as a delta table...
Updating and inserting new rows...
Create a delta table object...
UPSERT...
Checking if everything is ok
New data...
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       1|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       1|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
|        889|       0|     3|"Johnston, Miss. ...|female|null|    1|    2|      W./C. 6607|  23.45| null|       S|
|        890|       1|     1|Behr, Mr. Karl Ho...|  male|26.0|    0|    0|          111369|   30.0| C148|       C|
|        891|       0|     3| Dooley, Mr. Patrick|  male|32.0|    0|    0|          370376|   7.75| null|       Q|
|        892|       1|     1|      Sarah Crepalde|female|23.0|    1|    0|            null|   null| null|    null|
|        893|       0|     1|        Ney Crepalde|  male|35.0|    1|    0|            null|   null| null|    null|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+

Old data - with time travel
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
|        889|       0|     3|"Johnston, Miss. ...|female|null|    1|    2|      W./C. 6607|  23.45| null|       S|
|        890|       1|     1|Behr, Mr. Karl Ho...|  male|26.0|    0|    0|          111369|   30.0| C148|       C|
|        891|       0|     3| Dooley, Mr. Patrick|  male|32.0|    0|    0|          370376|   7.75| null|       Q|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Notice the same delta table being shown in 2 different moments using time travel. The last version, with Mr. Braund and Mr. Allen marked as alive and the new passengers in the first table and the original version of the dataset in the second table.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>deltalake</category>
      <category>spark</category>
      <category>emr</category>
    </item>
  </channel>
</rss>
