<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Aki</title>
    <description>The latest articles on DEV Community by Aki (@datapenguin).</description>
    <link>https://dev.to/datapenguin</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3358661%2Fe003a75e-e7e7-40a0-99ac-f328da87b768.jpg</url>
      <title>DEV Community: Aki</title>
      <link>https://dev.to/datapenguin</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/datapenguin"/>
    <language>en</language>
    <item>
      <title>Rethinking Lakehouse Architecture Through Data Ownership: AWS vs. Snowflake</title>
      <dc:creator>Aki</dc:creator>
      <pubDate>Mon, 01 Jun 2026 13:31:04 +0000</pubDate>
      <link>https://dev.to/aws-builders/rethinking-lakehouse-architecture-through-data-ownership-aws-vs-snowflake-336e</link>
      <guid>https://dev.to/aws-builders/rethinking-lakehouse-architecture-through-data-ownership-aws-vs-snowflake-336e</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Original Japanese article&lt;/strong&gt;: &lt;a href="https://zenn.dev/penginpenguin/articles/e2662e118ef61d" rel="noopener noreferrer"&gt;データの主導権から考えるAWSとSnowflakeのレイクハウスアーキテクチャ&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;I'm Aki, an AWS Community Builder (&lt;a href="https://x.com/jitepengin" rel="noopener noreferrer"&gt;@jitepengin&lt;/a&gt;).&lt;br&gt;
When designing a data platform, discussions about whether to lean toward AWS or Snowflake are still very common.&lt;/p&gt;

&lt;p&gt;However, with the rise of Apache Iceberg, data and platforms can now be decoupled. Because of this shift, I believe we need to reconsider the question itself.&lt;/p&gt;

&lt;p&gt;Rather than asking:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Should we build around AWS or Snowflake?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A more fundamental question is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Who owns the data?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In this article, I'd like to define what I mean by &lt;em&gt;data ownership&lt;/em&gt; and explore the architectural trade-offs of AWS-centric and Snowflake-centric lakehouse designs.&lt;/p&gt;


&lt;h1&gt;
  
  
  Why Data Ownership Matters
&lt;/h1&gt;

&lt;p&gt;Apache Iceberg has made it possible to separate data from the platform that accesses it.&lt;/p&gt;

&lt;p&gt;Today, an Iceberg table stored on Amazon S3 can be accessed from Athena, Snowflake, Spark, and many other engines. As a result, choosing a product is becoming less important than deciding who is responsible for managing the data.&lt;/p&gt;

&lt;p&gt;Before diving into architectural patterns, let's first examine why this shift matters.&lt;/p&gt;
&lt;h2&gt;
  
  
  Defining Ownership Across Three Layers
&lt;/h2&gt;

&lt;p&gt;In this article, I define &lt;strong&gt;data ownership&lt;/strong&gt; through the following three layers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Catalog Ownership&lt;/td&gt;
&lt;td&gt;Who owns the metadata?&lt;/td&gt;
&lt;td&gt;Glue Data Catalog / Snowflake Open Catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Write Ownership&lt;/td&gt;
&lt;td&gt;Who can update or delete data?&lt;/td&gt;
&lt;td&gt;Glue ETL / Snowflake DML&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Governance Ownership&lt;/td&gt;
&lt;td&gt;Who controls access policies?&lt;/td&gt;
&lt;td&gt;Lake Formation / Snowflake Horizon&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Only when these three layers are consistently controlled by the same authority can we truly say that ownership exists.&lt;/p&gt;

&lt;p&gt;Conversely, when ownership is distributed or unclear, complexity tends to emerge in architecture, operations, and security.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Reality of Vendor Lock-In
&lt;/h2&gt;

&lt;p&gt;Even in the Iceberg era, platform dependencies have not disappeared—they have simply changed form.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Catalog dependency&lt;/strong&gt;: Tables managed by Snowflake Open Catalog still rely operationally on a Snowflake-managed service, although external engines can access them through the REST Catalog API.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Write-engine dependency&lt;/strong&gt;: Snowflake-managed Iceberg tables are primarily updated through Snowflake, though Horizon Catalog now supports external writes from engines such as Spark. The choice of write engine remains closely tied to catalog design.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Governance dependency&lt;/strong&gt;: Lake Formation's fine-grained permissions are fundamentally tied to the AWS ecosystem.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Therefore, saying that "Iceberg eliminates vendor lock-in" is only partially true.&lt;/p&gt;

&lt;p&gt;What Iceberg removes is &lt;strong&gt;storage-format lock-in&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Dependencies around catalog management, governance, and operational processes still remain. In practice, migrating a data platform involves challenges such as governance policies, access control, metadata management, and platform-specific features.&lt;/p&gt;


&lt;h2&gt;
  
  
  Extensibility and Strategic Flexibility
&lt;/h2&gt;

&lt;p&gt;Data platforms are never finished.&lt;/p&gt;

&lt;p&gt;The rapid evolution of AI technologies and the continuous changes in the Modern Data Stack mean that architectures must adapt over time.&lt;/p&gt;

&lt;p&gt;Common examples include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Adding or changing analytics tools&lt;/strong&gt;&lt;br&gt;
Athena may be sufficient initially, but business users may later request Snowflake access.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Introducing AI workloads&lt;/strong&gt;&lt;br&gt;
Integration with SageMaker or Snowflake Cortex AI may become necessary.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cost optimization initiatives&lt;/strong&gt;&lt;br&gt;
As query volumes grow, Snowflake compute costs may become significant, leading teams to move batch processing to EMR.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Stronger governance requirements&lt;/strong&gt;&lt;br&gt;
Column masking or row-level security may need to be introduced later.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When ownership across the three layers is clearly defined from the beginning, these changes become easier to evaluate and implement.&lt;/p&gt;

&lt;p&gt;Without that clarity, every change raises new questions about where responsibilities and controls should reside.&lt;/p&gt;


&lt;h1&gt;
  
  
  What Changed After Iceberg?
&lt;/h1&gt;

&lt;p&gt;Historically, data and platforms were tightly coupled.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Snowflake-Centric&lt;/th&gt;
&lt;th&gt;AWS-Centric&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Data Location&lt;/td&gt;
&lt;td&gt;Inside Snowflake&lt;/td&gt;
&lt;td&gt;Inside S3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Management Ownership&lt;/td&gt;
&lt;td&gt;Snowflake owns everything&lt;/td&gt;
&lt;td&gt;AWS owns everything&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Access from Other Engines&lt;/td&gt;
&lt;td&gt;Not possible&lt;/td&gt;
&lt;td&gt;Snowflake could not access directly&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Iceberg fundamentally changed this model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Iceberg Tables on S3
        ↓
Shared by Multiple Engines

Athena / Glue / Snowflake / Spark / Redshift ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Iceberg adds a metadata layer on top of Parquet files stored in object storage, enabling ACID transactions and schema evolution independent of any specific compute engine.&lt;/p&gt;

&lt;p&gt;A catalog tracks metadata such as schemas and active data files, allowing multiple engines to safely access the same table.&lt;/p&gt;

&lt;p&gt;Data files are now shareable.&lt;/p&gt;

&lt;p&gt;However, ownership of the catalog, write operations, and governance still depends on architectural decisions.&lt;/p&gt;

&lt;p&gt;In other words, deciding who manages the catalog effectively determines who owns the data.&lt;/p&gt;




&lt;blockquote&gt;
&lt;h3&gt;
  
  
  Major Iceberg Catalog Options
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Catalog&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Characteristics&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AWS Glue Data Catalog&lt;/td&gt;
&lt;td&gt;AWS-managed&lt;/td&gt;
&lt;td&gt;Supports REST Catalog API and integrates with Lake Formation governance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Snowflake Open Catalog&lt;/td&gt;
&lt;td&gt;Snowflake-managed (based on Apache Polaris)&lt;/td&gt;
&lt;td&gt;REST Catalog compliant and accessible from Spark, Trino, and others&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Snowflake Horizon Catalog&lt;/td&gt;
&lt;td&gt;Snowflake service&lt;/td&gt;
&lt;td&gt;Exposes Snowflake-managed Iceberg tables through APIs; differs from Open Catalog because it is not a standalone metadata store&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;/blockquote&gt;




&lt;h1&gt;
  
  
  Snowflake-Centric Architecture
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Characteristics
&lt;/h2&gt;

&lt;p&gt;In this approach, Snowflake becomes the center of catalog management, governance, and analytics, while data files remain in external object storage such as S3.&lt;/p&gt;

&lt;p&gt;This model prioritizes simplicity and a streamlined analytics experience.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ownership Model
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Owner&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Catalog Ownership&lt;/td&gt;
&lt;td&gt;Snowflake Open Catalog or Horizon Catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Write Ownership&lt;/td&gt;
&lt;td&gt;Primarily Snowflake DML&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Governance Ownership&lt;/td&gt;
&lt;td&gt;Snowflake Horizon&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Although data files remain on S3, external engines can access Snowflake-managed Iceberg tables through two mechanisms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Via Open Catalog&lt;/strong&gt;: Snowflake-managed Iceberg tables are synced to Open Catalog and exposed through the REST Catalog API. In this sync scenario, external engines have read-only access. (Note: when Open Catalog itself is used as an internal catalog, read/write access is supported.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Via Horizon Catalog&lt;/strong&gt;: Tables are exposed directly through the Horizon Iceberg REST Catalog API without syncing to Open Catalog. External engines can both read and write, and existing Snowflake users and roles can be used for access control.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Benefits
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Governance policies such as column masking and row-level security can be applied to Iceberg tables in the same way as native Snowflake tables. When external engines access tables through Horizon Catalog, the same policies are enforced at read time. Note, however, that &lt;strong&gt;writing to tables with masking policies or tags applied is not supported from external engines&lt;/strong&gt; — this is an important constraint to be aware of.&lt;/li&gt;
&lt;li&gt;Rich ecosystem support for BI tools such as Power BI makes Snowflake a convenient analytics front end.&lt;/li&gt;
&lt;li&gt;External engines can access Iceberg tables through Open Catalog or Horizon Catalog while reusing Snowflake users and roles as the unit of access control.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Drawbacks
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Snowflake warehouse compute costs can be significant for write-heavy workloads. When external engines such as Spark write through Horizon Catalog, Snowflake warehouses are not used — but &lt;strong&gt;Horizon Catalog API calls are billed at 0.5 credits per million requests&lt;/strong&gt;, so cost planning is still required.&lt;/li&gt;
&lt;li&gt;Coordination is needed when AWS services such as Glue ETL also write to the same datasets. Clearly defining who holds catalog ownership is essential.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even with Iceberg, many enterprises ultimately converge on a Snowflake-centric operating model because governance, metadata, and write operations all remain concentrated within Snowflake.&lt;/p&gt;

&lt;p&gt;In such cases, Iceberg provides openness in theory, but ownership remains firmly within the Snowflake ecosystem.&lt;/p&gt;




&lt;h1&gt;
  
  
  AWS-Centric Architecture
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Characteristics
&lt;/h2&gt;

&lt;p&gt;This architecture uses S3 for storage, Glue Data Catalog for metadata, and AWS-native services for ETL, analytics, and governance.&lt;/p&gt;

&lt;p&gt;Its primary advantages are flexibility and service interoperability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ownership Model
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Owner&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Catalog Ownership&lt;/td&gt;
&lt;td&gt;AWS Glue Data Catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Write Ownership&lt;/td&gt;
&lt;td&gt;Glue ETL / EMR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Governance Ownership&lt;/td&gt;
&lt;td&gt;Lake Formation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Because Glue Data Catalog supports the Iceberg REST Catalog API, external engines such as Snowflake and Databricks can access the same tables.&lt;/p&gt;

&lt;p&gt;This enables AWS to retain ownership while allowing Snowflake to serve as an analytics front end.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benefits
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Tight integration across Athena, Glue, EMR, and Redshift with a shared catalog.&lt;/li&gt;
&lt;li&gt;Fine-grained column- and row-level governance through Lake Formation, applicable to Iceberg tables.&lt;/li&gt;
&lt;li&gt;Ability to optimize compute engines for different workloads — EMR for large-scale batch, Athena for interactive queries.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Drawbacks
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Increased architectural and operational complexity due to the number of AWS services involved.&lt;/li&gt;
&lt;li&gt;Additional design considerations for multi-cloud environments, as the catalog remains AWS-dependent.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Lake Formation is powerful, but troubleshooting permission issues can become challenging. Identifying why a specific user cannot access a specific table or row often takes considerable time, requiring mature operational practices and careful permission design.&lt;/p&gt;




&lt;h1&gt;
  
  
  Combining AWS and Snowflake
&lt;/h1&gt;

&lt;p&gt;A realistic approach is not choosing one platform over the other, but assigning clear responsibilities to each.&lt;/p&gt;

&lt;p&gt;The key is defining ownership boundaries upfront.&lt;/p&gt;

&lt;h2&gt;
  
  
  AWS Owns the Data, Snowflake Powers Analytics
&lt;/h2&gt;

&lt;p&gt;This is one of the most common patterns.&lt;/p&gt;

&lt;p&gt;The goal is to maintain data ownership within AWS while leveraging Snowflake's analytics capabilities and its rich ecosystem of BI connectors.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────┐
│                       AWS                        │
│  S3 (Iceberg data files)                         │
│  Glue Data Catalog (Catalog Ownership)           │
│  Lake Formation (Governance Ownership)           │
│  Glue / EMR (Write Ownership)                    │
└──────────────────────┬───────────────────────────┘
                       │ Iceberg REST Catalog API
        ┌──────────────┼───────────────────┐
        ▼              ▼                   ▼
     Athena          Glue              Snowflake
  (Interactive)     (ETL)            (Analytics)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this model:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Owner&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Catalog Ownership&lt;/td&gt;
&lt;td&gt;AWS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Write Ownership&lt;/td&gt;
&lt;td&gt;AWS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Governance Ownership&lt;/td&gt;
&lt;td&gt;AWS&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Snowflake acts primarily as an analytical interface.&lt;/p&gt;

&lt;p&gt;Two variations exist:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Glue Catalog Integration (Read-Only)
&lt;/h3&gt;

&lt;p&gt;Snowflake accesses AWS-managed Iceberg tables through External Iceberg Tables. Write ownership and governance remain entirely with AWS. Lake Formation can be used as the single source of truth for access control.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Catalog-Linked Database (Read/Write)
&lt;/h3&gt;

&lt;p&gt;Snowflake can update Iceberg tables through the Iceberg REST Catalog API while the data remains stored on S3. This approach is attractive when analysts and AI workloads primarily operate in Snowflake.&lt;/p&gt;

&lt;p&gt;However, governance responsibilities become shared between AWS and Snowflake. Both Lake Formation and Snowflake-side access controls must be configured carefully — a misconfiguration in either can become a security gap. If the read-only pattern (option 1) is sufficient, consolidating governance in Lake Formation is simpler.&lt;/p&gt;

&lt;p&gt;For step-by-step implementation details of these patterns — including how to set up External Volumes, Catalog Integrations, and Catalog-Linked Databases — see this companion article:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/aws-builders/aws-snowflake-lakehouse-2-practical-apache-iceberg-integration-patterns-812"&gt;AWS Snowflake Lakehouse: 2 Practical Apache Iceberg Integration Patterns&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Comparison: Three Architectural Patterns
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Snowflake-Centric&lt;/th&gt;
&lt;th&gt;AWS-Centric&lt;/th&gt;
&lt;th&gt;Hybrid&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Catalog Ownership&lt;/td&gt;
&lt;td&gt;Snowflake&lt;/td&gt;
&lt;td&gt;AWS&lt;/td&gt;
&lt;td&gt;AWS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Write Ownership&lt;/td&gt;
&lt;td&gt;Snowflake&lt;/td&gt;
&lt;td&gt;AWS&lt;/td&gt;
&lt;td&gt;AWS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Governance Ownership&lt;/td&gt;
&lt;td&gt;Snowflake Horizon&lt;/td&gt;
&lt;td&gt;Lake Formation&lt;/td&gt;
&lt;td&gt;AWS primary (①) / AWS+SF (②)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compute Cost&lt;/td&gt;
&lt;td&gt;Tends to be higher&lt;/td&gt;
&lt;td&gt;Optimizable by workload&lt;/td&gt;
&lt;td&gt;Optimizable by workload&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operational Complexity&lt;/td&gt;
&lt;td&gt;Low to medium&lt;/td&gt;
&lt;td&gt;Medium to high&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-Engine Flexibility&lt;/td&gt;
&lt;td&gt;Medium (via REST API)&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Choosing the Right Pattern
&lt;/h2&gt;

&lt;p&gt;Based on the patterns above, here is a simplified decision guide:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Snowflake-centric tends to fit when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Analytics is BI-driven or led by non-engineers&lt;/li&gt;
&lt;li&gt;Development speed and analytics experience take priority over data volume&lt;/li&gt;
&lt;li&gt;Centralized governance through Snowflake Horizon is preferred&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;AWS-centric tends to fit when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data volumes are large and ETL is the dominant workload&lt;/li&gt;
&lt;li&gt;A dedicated data engineering team is already working within the AWS ecosystem&lt;/li&gt;
&lt;li&gt;Fine-grained access control through Lake Formation is a requirement&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Hybrid tends to fit when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Different teams use different tools (e.g., engineers on AWS, analysts on Snowflake)&lt;/li&gt;
&lt;li&gt;Future extensibility for AI, ML, or multi-engine workloads is a priority&lt;/li&gt;
&lt;li&gt;AWS retains data ownership while Snowflake's query performance is still needed&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  What Happens When Ownership Is Unclear
&lt;/h1&gt;

&lt;p&gt;A common anti-pattern is building a platform that "works" without explicitly defining ownership.&lt;/p&gt;

&lt;p&gt;Typical symptoms include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Nobody knows who is responsible for schema changes.&lt;/strong&gt; When both Glue and Snowflake have schema owners, it becomes unclear which definition is authoritative.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data written from Snowflake is not visible in Athena.&lt;/strong&gt; When two catalogs attempt to manage the same table, one may lose track of the latest snapshot, causing metadata inconsistencies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Governance rules drift between Lake Formation and Snowflake Horizon.&lt;/strong&gt; Maintaining access policies in two places creates risk — a gap in either becomes a security vulnerability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incident response slows down.&lt;/strong&gt; When multiple engines can write, identifying what happened and where becomes difficult, delaying recovery.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These issues often evolve from technical challenges into organizational problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Teams blame each other over unclear responsibilities.&lt;/li&gt;
&lt;li&gt;Audits become difficult because nobody can fully explain who has access to what.&lt;/li&gt;
&lt;li&gt;Incident recovery is delayed due to unclear decision-making authority.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A running system is not necessarily a well-designed system.&lt;/p&gt;

&lt;p&gt;Ownership becomes increasingly difficult to fix after the platform has already grown.&lt;/p&gt;




&lt;h1&gt;
  
  
  "AWS or Snowflake?" Is a Secondary Question
&lt;/h1&gt;

&lt;p&gt;In practice, organizations often begin by debating whether to standardize on AWS or Snowflake.&lt;/p&gt;

&lt;p&gt;In the Iceberg era, I believe that is the wrong starting point.&lt;/p&gt;

&lt;p&gt;The first questions should be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Who owns the catalog?&lt;/li&gt;
&lt;li&gt;Who owns writes?&lt;/li&gt;
&lt;li&gt;Who owns governance?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once these three ownership layers are defined, the platform choice naturally follows.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Want all three owned by Snowflake? → Snowflake-centric architecture.&lt;/li&gt;
&lt;li&gt;Want all three owned by AWS? → AWS-centric architecture.&lt;/li&gt;
&lt;li&gt;Want AWS to own data while Snowflake provides analytics? → Hybrid architecture.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Iceberg has dramatically increased flexibility around where data lives.&lt;/p&gt;

&lt;p&gt;As flexibility increases, architects must become more deliberate about defining responsibility.&lt;/p&gt;

&lt;p&gt;Starting with product selection often leads to contradictions later. A configuration where Snowflake is used as the query interface, Glue handles writes, and Lake Formation controls governance — without intentional design — is a classic symptom of ownership being distributed and unclear from the start.&lt;/p&gt;

&lt;p&gt;The hardest challenge is no longer connectivity.&lt;/p&gt;

&lt;p&gt;It is ownership.&lt;/p&gt;




&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;Apache Iceberg has significantly reduced storage-level vendor lock-in.&lt;/p&gt;

&lt;p&gt;However, catalog ownership, write ownership, and governance ownership still require deliberate architectural decisions.&lt;/p&gt;

&lt;p&gt;A useful decision-making sequence is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Decide who owns the catalog. (Glue / Snowflake Open Catalog / Snowflake Horizon)&lt;/li&gt;
&lt;li&gt;Decide who owns writes. (AWS-native services / Snowflake)&lt;/li&gt;
&lt;li&gt;Decide who owns governance. (Lake Formation / Snowflake Horizon / both)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once those three decisions are made, choosing between AWS and Snowflake becomes much easier. From there, you can design the architecture that best fits your requirements.&lt;/p&gt;

&lt;p&gt;Ultimately, the hardest part of a modern lakehouse architecture is often not the technology itself. It is agreeing on ownership boundaries — deciding which team manages the catalog, who is responsible for data updates, and where governance policies are enforced.&lt;/p&gt;

&lt;p&gt;Technology evolves. The challenge of people and processes remains.&lt;/p&gt;

&lt;p&gt;I hope this article helps anyone evaluating lakehouse architectures built on AWS, Snowflake, and Apache Iceberg.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>snowflake</category>
      <category>iceberg</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Exploring Snowpark While Comparing It with Apache Spark</title>
      <dc:creator>Aki</dc:creator>
      <pubDate>Mon, 01 Jun 2026 04:00:00 +0000</pubDate>
      <link>https://dev.to/datapenguin/exploring-snowpark-while-comparing-it-with-apache-spark-mki</link>
      <guid>https://dev.to/datapenguin/exploring-snowpark-while-comparing-it-with-apache-spark-mki</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Original Japanese article&lt;/strong&gt;: &lt;a href="https://zenn.dev/penginpenguin/articles/91be27b34c7309" rel="noopener noreferrer"&gt;Snowparkを動かしながらSparkとの違いを整理してみる&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Recently, I've had more opportunities to work with Snowflake when building data platforms.&lt;/p&gt;

&lt;p&gt;When working with modern data platforms, Apache Spark is often used for distributed data processing. Snowflake also provides its own data processing framework called Snowpark.&lt;/p&gt;

&lt;p&gt;If you're already familiar with Spark or AWS Glue, you may find yourself wondering:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Wait... how is Snowpark actually different from Spark?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In this article, I'd like to organize my own understanding while exploring Snowpark's behavior and comparing it with Spark.&lt;/p&gt;

&lt;p&gt;For this experiment, everything was done entirely within Snowflake Notebooks in Snowsight.&lt;/p&gt;

&lt;p&gt;One of the biggest advantages is that no local environment setup or connection configuration is required—you can start experimenting immediately.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is Snowpark?
&lt;/h2&gt;

&lt;p&gt;Snowpark is a data processing framework provided by Snowflake.&lt;/p&gt;

&lt;p&gt;Its biggest feature is the ability to write code in Python, Java, or Scala and execute it directly inside Snowflake.&lt;/p&gt;

&lt;p&gt;Traditionally, Snowflake workloads were primarily implemented using SQL. With Snowpark, however, you can use a DataFrame API similar to Spark or Pandas while keeping all processing within Snowflake.&lt;/p&gt;

&lt;p&gt;In other words, you no longer need to pull data into a local environment or AWS Lambda for processing.&lt;/p&gt;

&lt;p&gt;Some key characteristics include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Managed execution environment – Processing runs on Snowflake warehouses with no infrastructure management required.&lt;/li&gt;
&lt;li&gt;DataFrame API – Similar developer experience to Spark and Pandas.&lt;/li&gt;
&lt;li&gt;Pushdown execution – Code is executed within Snowflake, eliminating data transfer overhead.&lt;/li&gt;
&lt;li&gt;UDF and UDTF support – Custom functions can be defined and executed inside Snowflake.&lt;/li&gt;
&lt;li&gt;Snowflake Notebook integration – Interactive development is supported directly in Snowsight.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Personally, I still like Scala, but these days I write most data processing code in Python.&lt;/p&gt;

&lt;p&gt;While Scala often offers better performance, Python's simplicity and extensive ecosystem make it the more practical choice in many situations.&lt;/p&gt;




&lt;h2&gt;
  
  
  Differences Between Spark and Snowpark
&lt;/h2&gt;

&lt;p&gt;Many people immediately think of Spark when they hear the name Snowpark.&lt;/p&gt;

&lt;p&gt;The names are similar, and the DataFrame APIs feel very familiar.&lt;/p&gt;

&lt;p&gt;However, there are several important differences.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Spark&lt;/th&gt;
&lt;th&gt;Snowpark&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Execution Environment&lt;/td&gt;
&lt;td&gt;Distributed cluster&lt;/td&gt;
&lt;td&gt;Snowflake warehouse&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Source&lt;/td&gt;
&lt;td&gt;HDFS, S3, and others&lt;/td&gt;
&lt;td&gt;Primarily Snowflake tables&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scaling&lt;/td&gt;
&lt;td&gt;Cluster size managed by user&lt;/td&gt;
&lt;td&gt;Warehouse size adjustment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Languages&lt;/td&gt;
&lt;td&gt;Scala, Java, Python, R, etc.&lt;/td&gt;
&lt;td&gt;Python, Java, Scala&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;External Data Support&lt;/td&gt;
&lt;td&gt;Broad ecosystem support&lt;/td&gt;
&lt;td&gt;Primarily Snowflake-centric&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Infrastructure Management&lt;/td&gt;
&lt;td&gt;Cluster management required&lt;/td&gt;
&lt;td&gt;Fully managed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Spark requires awareness of distributed clusters and execution mechanics.&lt;/p&gt;

&lt;p&gt;Snowpark, on the other hand, is fundamentally a processing framework that operates on top of the Snowflake platform.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;DataFrame operations are internally converted into SQL execution plans and executed by Snowflake's SQL engine.&lt;/p&gt;

&lt;p&gt;Unlike Spark, user code is not distributed across worker nodes.&lt;/p&gt;

&lt;p&gt;Scaling is handled by Snowflake warehouses.&lt;/p&gt;

&lt;p&gt;UDFs are an exception. UDF code is pushed into Snowflake and executed in parallel by Snowflake's infrastructure.&lt;/p&gt;

&lt;p&gt;A useful mental model is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DataFrame operations → SQL generation&lt;/li&gt;
&lt;li&gt;UDFs → Server-side parallel execution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In either case, users do not need to manage clusters or DAG execution as they would in Spark.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If your data is already centralized in Snowflake, Snowpark provides a convenient way to write Spark-like code without worrying about infrastructure management.&lt;/p&gt;

&lt;p&gt;Of course, AWS Glue also provides a largely serverless experience, making it another convenient option in the AWS ecosystem.&lt;/p&gt;




&lt;h1&gt;
  
  
  Getting Started
&lt;/h1&gt;

&lt;p&gt;All examples in this article are executed within Snowflake Notebooks in Snowsight.&lt;/p&gt;

&lt;p&gt;No local Python environment or connection configuration is required—the entire workflow runs directly in the browser.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;From the Snowsight menu:&lt;/p&gt;

&lt;p&gt;Create → Notebooks&lt;/p&gt;

&lt;p&gt;Create a new notebook.&lt;/p&gt;

&lt;p&gt;Snowpark for Python is already installed, so there is no need to run &lt;code&gt;pip install&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;You can start coding immediately.&lt;/p&gt;

&lt;h2&gt;
  
  
  Obtaining a Session
&lt;/h2&gt;

&lt;p&gt;In local environments, Snowpark sessions are typically created using:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;configs&lt;/span&gt;&lt;span class="p"&gt;(...).&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In Snowflake Notebooks, an active session already exists.&lt;/p&gt;

&lt;p&gt;You can simply retrieve it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;snowflake.snowpark.context&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;get_active_session&lt;/span&gt;

&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_active_session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Session acquired successfully!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flu0m5qkzeq2df5t9spvr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flu0m5qkzeq2df5t9spvr.png" width="799" height="323"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One of the major advantages of Snowflake Notebooks is that connection details never need to be written manually.&lt;/p&gt;




&lt;h2&gt;
  
  
  Basic DataFrame Operations
&lt;/h2&gt;

&lt;p&gt;Let's create and manipulate a DataFrame from a table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MY_DB.MY_SCHEMA.SALES_DATA&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ORDER_ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AMOUNT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;REGION&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
              &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;REGION&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Asia&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
              &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AMOUNT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;desc&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="n"&gt;result_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcqsy6sufuherfu9gmexo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcqsy6sufuherfu9gmexo.png" width="800" height="486"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One important point is that no SQL is actually executed until &lt;code&gt;show()&lt;/code&gt; is called.&lt;/p&gt;

&lt;p&gt;We'll discuss this in more detail when covering lazy evaluation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Aggregations
&lt;/h2&gt;

&lt;p&gt;GroupBy operations feel almost identical to Spark.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;snowflake.snowpark&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;functions&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;

&lt;span class="n"&gt;summary_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group_by&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;REGION&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
               &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                   &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AMOUNT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TOTAL_AMOUNT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                   &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ORDER_ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ORDER_COUNT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                   &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AMOUNT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AVG_AMOUNT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
               &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;summary_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fogu4lrrn4ep4asbl1tpf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fogu4lrrn4ep4asbl1tpf.png" width="800" height="511"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When executed, these DataFrame operations are translated into SQL and run within Snowflake.&lt;/p&gt;

&lt;p&gt;You can inspect the generated SQL using:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;summary_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6a6kpk8p5t5vw8xt39os.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6a6kpk8p5t5vw8xt39os.png" width="799" height="298"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Writing Results Back to a Table
&lt;/h2&gt;

&lt;p&gt;To save results into a Snowflake table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;summary_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;save_as_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MY_DB.MY_SCHEMA.SALES_SUMMARY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since &lt;code&gt;save_as_table()&lt;/code&gt; does not return a result, it's often useful to reload the table to verify the output.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MY_DB.MY_SCHEMA.SALES_SUMMARY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm0szjnnu89853uqs3dfs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm0szjnnu89853uqs3dfs.png" width="800" height="441"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;overwrite&lt;/code&gt; mode replaces the existing table.&lt;/p&gt;

&lt;p&gt;Use &lt;code&gt;append&lt;/code&gt; if you want to add rows instead.&lt;/p&gt;




&lt;h1&gt;
  
  
  How Does Lazy Evaluation Work?
&lt;/h1&gt;

&lt;p&gt;Anyone familiar with Spark has likely encountered lazy evaluation.&lt;/p&gt;

&lt;p&gt;Sometimes it can even lead to unexpected behavior during debugging.&lt;/p&gt;

&lt;p&gt;Snowpark adopts the same fundamental concept.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding Lazy Evaluation
&lt;/h2&gt;

&lt;p&gt;DataFrame transformations such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;select&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;filter&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;group_by&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;are not executed immediately.&lt;/p&gt;

&lt;p&gt;These operations merely build an execution plan.&lt;/p&gt;

&lt;p&gt;Actual execution occurs only when an action is triggered.&lt;/p&gt;

&lt;p&gt;Common action operations include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;show()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;collect()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;count()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;to_pandas()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;write.save_as_table()&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Verifying Lazy Evaluation
&lt;/h2&gt;

&lt;p&gt;A convenient way to inspect behavior is through &lt;code&gt;df.queries&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;snowflake.snowpark&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;functions&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;

&lt;span class="n"&gt;df_filtered&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MY_DB.MY_SCHEMA.LARGE_TABLE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
                     &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;STATUS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ACTIVE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
                     &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NAME&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;STATUS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CREATED_AT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_filtered&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_filtered&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; rows retrieved&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The generated SQL can be inspected before execution, but no query has actually been sent to Snowflake yet.&lt;/p&gt;

&lt;p&gt;To verify execution timing precisely, we can use Query History in Snowsight.&lt;/p&gt;

&lt;p&gt;Open:&lt;/p&gt;

&lt;p&gt;Monitoring → Query History&lt;/p&gt;

&lt;p&gt;Then perform the following steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Define the DataFrame.&lt;/li&gt;
&lt;li&gt;Check Query History.&lt;/li&gt;
&lt;li&gt;Confirm that no SELECT statement has been executed.&lt;/li&gt;
&lt;li&gt;Execute &lt;code&gt;collect()&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Refresh Query History.&lt;/li&gt;
&lt;li&gt;Observe that the SELECT statement now appears.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Define the DataFrame
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo6igp01ad2xq52zv19bd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo6igp01ad2xq52zv19bd.png" width="800" height="404"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;No corresponding query appears yet.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flingdk3583mtx7yfqtda.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flingdk3583mtx7yfqtda.png" width="800" height="288"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Execute collect()
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1r0sl0nylkrbmpl76tvb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1r0sl0nylkrbmpl76tvb.png" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After execution, the query becomes visible.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx8syemsxtaec8882ooi1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx8syemsxtaec8882ooi1.png" width="800" height="351"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This confirms that DataFrame definitions alone do not trigger execution.&lt;/p&gt;

&lt;p&gt;The SQL is executed only when &lt;code&gt;collect()&lt;/code&gt; is called.&lt;/p&gt;




&lt;h2&gt;
  
  
  Differences from Spark's Lazy Evaluation
&lt;/h2&gt;

&lt;p&gt;In Spark, lazy evaluation constructs a DAG that is optimized and executed across a cluster.&lt;/p&gt;

&lt;p&gt;In Snowpark, lazy evaluation ultimately produces SQL, which is then optimized and executed by Snowflake's query optimizer.&lt;/p&gt;

&lt;p&gt;The concept is similar, but the execution engine is fundamentally different.&lt;/p&gt;

&lt;p&gt;One particularly useful feature is that generated SQL can be inspected via &lt;code&gt;df.queries&lt;/code&gt;, making it easier to validate execution plans.&lt;/p&gt;




&lt;h1&gt;
  
  
  Can We Use Caching?
&lt;/h1&gt;

&lt;p&gt;If you're coming from Spark, your first instinct may be to use &lt;code&gt;cache()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Snowpark provides a similar capability through &lt;code&gt;cache_result()&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Differences from Spark cache()
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Spark &lt;code&gt;cache()&lt;/code&gt;
&lt;/th&gt;
&lt;th&gt;Snowpark &lt;code&gt;cache_result()&lt;/code&gt;
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Storage&lt;/td&gt;
&lt;td&gt;Memory (and disk)&lt;/td&gt;
&lt;td&gt;Temporary table&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lifetime&lt;/td&gt;
&lt;td&gt;Until application ends&lt;/td&gt;
&lt;td&gt;Until session ends&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;No additional write&lt;/td&gt;
&lt;td&gt;INSERT into temporary table&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Internally, &lt;code&gt;cache_result()&lt;/code&gt; materializes results into a temporary table.&lt;/p&gt;

&lt;p&gt;Subsequent operations reuse that table rather than re-running expensive transformations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Example
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df_heavy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MY_DB.MY_SCHEMA.LARGE_TABLE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
                  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;STATUS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ACTIVE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
                  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                      &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MY_DB.MY_SCHEMA.MASTER&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                  &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;cached_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_heavy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cache_result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;result1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cached_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;REGION&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Asia&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;result2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cached_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group_by&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;REGION&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
                   &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; \
                   &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;cached_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop_table&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Using a &lt;code&gt;with&lt;/code&gt; block is often more convenient because the temporary table is automatically dropped when the block exits.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;df_heavy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cache_result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;cached_df&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cached_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;REGION&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Asia&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;result2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cached_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group_by&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;REGION&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
                       &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; \
                       &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Since &lt;code&gt;cache_result()&lt;/code&gt; performs an INSERT into a temporary table, it can actually make things slower when the DataFrame is only used once.&lt;/p&gt;

&lt;p&gt;It's most effective when the same expensive transformation is reused multiple times.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You can also observe this behavior in Snowsight.&lt;/p&gt;

&lt;p&gt;Temporary table creation:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkyso0e39p4pdwq581b59.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkyso0e39p4pdwq581b59.png" width="799" height="198"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Subsequent SELECT from the temporary table:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fap1ipg845lbgdxb1mhqw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fap1ipg845lbgdxb1mhqw.png" width="800" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Another query reusing the same temporary table:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkrx4pslv6cm53se94fek.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkrx4pslv6cm53se94fek.png" width="800" height="278"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  Use Cases
&lt;/h1&gt;

&lt;p&gt;Let's consider some practical scenarios where Snowpark can be useful.&lt;/p&gt;

&lt;h2&gt;
  
  
  ETL Pipelines
&lt;/h2&gt;

&lt;p&gt;Traditionally, pipelines often look like:&lt;/p&gt;

&lt;p&gt;S3 → Glue → Redshift&lt;/p&gt;

&lt;p&gt;With Snowpark, many transformations can be performed entirely within Snowflake.&lt;/p&gt;

&lt;p&gt;This reduces data movement and simplifies overall architecture.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;raw_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MY_DB.MY_SCHEMA.RAW_EVENTS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;cleaned_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw_df&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EVENT_TYPE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EVENT_DATE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EVENT_TIMESTAMP&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop_duplicates&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;USER_ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EVENT_DATE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EVENT_TYPE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;aggregated_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cleaned_df&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group_by&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EVENT_DATE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EVENT_TYPE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;USER_ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
         &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;USER_COUNT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;aggregated_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
             &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save_as_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                 &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MY_DB.MY_SCHEMA.DAILY_EVENT_SUMMARY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
             &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MY_DB.MY_SCHEMA.DAILY_EVENT_SUMMARY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbus1sf8ispbcvz0j4h5j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbus1sf8ispbcvz0j4h5j.png" width="800" height="499"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Custom Transformations Using UDFs
&lt;/h2&gt;

&lt;p&gt;Snowpark UDFs allow complex logic that would be cumbersome in SQL to be implemented in Python.&lt;/p&gt;

&lt;p&gt;You can register UDFs using either the &lt;code&gt;@udf&lt;/code&gt; decorator or &lt;code&gt;session.udf.register()&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;snowflake.snowpark.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;udf&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;snowflake.snowpark.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StringType&lt;/span&gt;

&lt;span class="nd"&gt;@udf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;return_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
     &lt;span class="n"&gt;input_types&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;()])&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;normalize_region&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;region_map&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;US&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;North America&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;JP&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Asia&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Europe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;region_map&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Other&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MY_DB.MY_SCHEMA.RAW_EVENTS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;df_with_region&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NORMALIZED_REGION&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nf"&gt;normalize_region&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;REGION_CODE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If type hints are available, explicit type definitions can be omitted:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;snowflake.snowpark.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;udf&lt;/span&gt;

&lt;span class="nd"&gt;@udf&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;normalize_region&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;region_map&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;US&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;North America&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;JP&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Asia&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Europe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;region_map&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Other&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That said, simple transformations are often faster when implemented using built-in SQL functions.&lt;/p&gt;

&lt;p&gt;As always, benchmark before deciding.&lt;/p&gt;




&lt;h2&gt;
  
  
  Data Quality Validation
&lt;/h2&gt;

&lt;p&gt;Snowpark can also be used for data quality checks before processing continues.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;total_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;null_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AMOUNT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;is_null&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;null_rate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;null_count&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;total_count&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;null_rate&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;0.10&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NULL rate exceeds the threshold: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;null_rate&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Data quality check passed &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(NULL rate: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;null_rate&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fspfqmgi9wjotq1c6e2dv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fspfqmgi9wjotq1c6e2dv.png" width="800" height="475"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;In this article, we explored Snowpark's fundamentals, compared it with Spark, and examined its lazy evaluation behavior.&lt;/p&gt;

&lt;p&gt;For engineers already familiar with Spark, Snowpark should feel quite approachable.&lt;/p&gt;

&lt;p&gt;However, it's important to remember that execution occurs on Snowflake warehouses rather than a Spark cluster.&lt;/p&gt;

&lt;p&gt;Reviewing generated SQL and understanding how Snowflake executes queries can help avoid unexpected full-table scans and other performance issues.&lt;/p&gt;

&lt;p&gt;If your data is already centralized in Snowflake, keeping processing inside Snowflake rather than moving data to Lambda or Glue Python Shell can be a significant advantage.&lt;/p&gt;

&lt;p&gt;Reducing infrastructure management overhead and consolidating ETL processing within Snowflake can also improve maintainability.&lt;/p&gt;

&lt;p&gt;One final note: throughout this experiment, I frequently relied on Cortex Code whenever I encountered errors.&lt;/p&gt;

&lt;p&gt;The workflow of iteratively fixing notebook errors through Cortex Code was surprisingly convenient.&lt;/p&gt;

&lt;p&gt;That said, just like any AI-assisted coding workflow, it's still important to carefully validate the generated code rather than accepting it blindly.&lt;/p&gt;

&lt;p&gt;I hope this article helps anyone considering Snowpark for data processing within Snowflake.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>snowflake</category>
    </item>
    <item>
      <title>Organizing How to Use AWS Glue Workflow</title>
      <dc:creator>Aki</dc:creator>
      <pubDate>Fri, 22 May 2026 13:38:32 +0000</pubDate>
      <link>https://dev.to/aws-builders/organizing-how-to-use-aws-glue-workflow-4n0j</link>
      <guid>https://dev.to/aws-builders/organizing-how-to-use-aws-glue-workflow-4n0j</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Original Japanese article&lt;/strong&gt;: &lt;a href="https://zenn.dev/penginpenguin/articles/5d8278b0d84881" rel="noopener noreferrer"&gt;AWS Glue Workflowの使い方について整理してみる&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;I'm Aki, an AWS Community Builder (&lt;a href="https://x.com/jitepengin" rel="noopener noreferrer"&gt;@jitepengin&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Previously, I wrote an article comparing when to use AWS Step Functions versus Glue Workflow.&lt;br&gt;
&lt;a href="https://dev.to/aws-builders/organizing-the-use-cases-of-aws-step-functions-and-glue-workflow-for-etl-processing-with-aws-glue-2o6b"&gt;Organizing the Use Cases of AWS Step Functions and Glue Workflow for ETL Processing with AWS Glue Jobs&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As I mentioned there, I personally like Glue Workflow and consider it an excellent service that balances simplicity and low cost.&lt;/p&gt;

&lt;p&gt;However, in recent years, Step Functions has become increasingly mainstream, and I get the impression that opportunities to work with Glue Workflow have decreased.&lt;br&gt;
Because of that, I think many people are unsure about how they should actually use it in practice.&lt;/p&gt;

&lt;p&gt;So in this article, I’d like to organize everything from the basics of Glue Workflow to practical usage patterns.&lt;br&gt;
I hope this helps more people become interested in Glue Workflow.&lt;/p&gt;


&lt;h2&gt;
  
  
  What Is Glue Workflow?
&lt;/h2&gt;

&lt;p&gt;AWS Glue Workflow is Glue’s native workflow orchestration feature.&lt;br&gt;
It defines ETL pipelines by combining the following three elements:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Glue Job&lt;/td&gt;
&lt;td&gt;The actual ETL processing using Spark or Python Shell&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Glue Crawler&lt;/td&gt;
&lt;td&gt;Scans data sources such as S3 and registers table definitions in the Data Catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Glue Trigger&lt;/td&gt;
&lt;td&gt;Defines execution conditions for Jobs and Crawlers (schedule, event, conditional, etc.)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;By connecting these components as a DAG (Directed Acyclic Graph), you can build ETL pipelines.&lt;/p&gt;

&lt;p&gt;Another characteristic is that workflows can be visually configured through the Glue console GUI.&lt;/p&gt;

&lt;p&gt;A major advantage is that workflows themselves incur no additional cost—you only pay for Job and Crawler execution.&lt;/p&gt;


&lt;h2&gt;
  
  
  Details of the Core Components
&lt;/h2&gt;

&lt;p&gt;The core of Glue Workflow is the Trigger mechanism.&lt;br&gt;
There are four trigger types, each with different roles.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Trigger Type&lt;/th&gt;
&lt;th&gt;Execution Condition&lt;/th&gt;
&lt;th&gt;Typical Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SCHEDULED&lt;/td&gt;
&lt;td&gt;Scheduled execution using cron expressions (UTC, minimum 5-minute interval)&lt;/td&gt;
&lt;td&gt;Periodic execution such as daily ETL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ON_DEMAND&lt;/td&gt;
&lt;td&gt;Manual execution or via API/SDK&lt;/td&gt;
&lt;td&gt;Arbitrary execution timing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CONDITIONAL&lt;/td&gt;
&lt;td&gt;Triggered based on the status of preceding Jobs/Crawlers&lt;/td&gt;
&lt;td&gt;Chained execution between Jobs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EVENT&lt;/td&gt;
&lt;td&gt;Triggered by EventBridge events&lt;/td&gt;
&lt;td&gt;Event-driven pipelines&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The typical pattern is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;code&gt;SCHEDULED&lt;/code&gt;, &lt;code&gt;ON_DEMAND&lt;/code&gt;, or &lt;code&gt;EVENT&lt;/code&gt; as the workflow’s starting trigger&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;CONDITIONAL&lt;/code&gt; to connect downstream Jobs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By the way, AWS officially recommends keeping the number of elements included in a workflow (Jobs + Crawlers + Triggers) under 100.&lt;br&gt;
Exceeding this recommendation can cause errors when resuming or stopping Workflow Runs.&lt;/p&gt;


&lt;h2&gt;
  
  
  CONDITIONAL Trigger Usage Patterns
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;CONDITIONAL&lt;/code&gt; Triggers are one of the key features that provide flexibility in Glue Workflow.&lt;br&gt;
Here are several representative patterns.&lt;/p&gt;


&lt;h3&gt;
  
  
  1. Simple Sequential Pattern
&lt;/h3&gt;

&lt;p&gt;This is the most basic usage pattern: “Run JobB after JobA succeeds.”&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;JobA (SUCCEEDED) → JobB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft4lo1qb55pycmsypu3m7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft4lo1qb55pycmsypu3m7.png" width="800" height="409"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can implement this simply by configuring a CONDITIONAL Trigger with the condition:&lt;br&gt;
“Start when JobA reaches the SUCCEEDED state.”&lt;/p&gt;


&lt;h3&gt;
  
  
  2. Waiting for Multiple Jobs to Complete (AND Condition)
&lt;/h3&gt;

&lt;p&gt;This pattern runs JobC only after both JobA and JobB succeed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;JobA (SUCCEEDED) ┐
                 ├→ JobC
JobB (SUCCEEDED) ┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb010g1ri50yqziptg5oh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb010g1ri50yqziptg5oh.png" width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This can be implemented by setting &lt;code&gt;Logical: AND&lt;/code&gt; in the CONDITIONAL Trigger predicate and listing multiple conditions.&lt;/p&gt;

&lt;p&gt;This is useful in scenarios such as:&lt;br&gt;
“Run an aggregation Job only after multiple data sources have finished loading.”&lt;/p&gt;


&lt;h3&gt;
  
  
  3. Triggering When Any Job Completes (ANY Condition)
&lt;/h3&gt;

&lt;p&gt;This pattern runs JobC when either JobA or JobB succeeds.&lt;/p&gt;

&lt;p&gt;The predicate of a CONDITIONAL Trigger has a &lt;code&gt;Logical&lt;/code&gt; field where you can specify either &lt;code&gt;AND&lt;/code&gt; or &lt;code&gt;ANY&lt;/code&gt;.&lt;br&gt;
Using &lt;code&gt;ANY&lt;/code&gt; causes the trigger to fire as soon as any one of the specified conditions is satisfied.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6gz82gf7gzogtq3unvcc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6gz82gf7gzogtq3unvcc.png" width="800" height="418"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note that although this behavior is logically equivalent to “OR,” the actual Glue configuration value is &lt;code&gt;ANY&lt;/code&gt;.&lt;br&gt;
This is important when defining workflows using IaC or CLI because specifying &lt;code&gt;OR&lt;/code&gt; will result in an error.&lt;/p&gt;


&lt;h3&gt;
  
  
  4. Failure Branching (Catching FAILED States)
&lt;/h3&gt;

&lt;p&gt;CONDITIONAL Triggers can react not only to &lt;code&gt;SUCCEEDED&lt;/code&gt;, but also to states such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;FAILED&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;STOPPED&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;TIMEOUT&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ERROR&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;(The supported states differ slightly between Jobs and Crawlers.)&lt;/p&gt;

&lt;p&gt;Using this feature, you can create patterns such as launching a notification Job (for example, a Python Shell Job that publishes to SNS) when a Job fails.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;JobA (SUCCEEDED) → Downstream Processing
JobA (FAILED)    → Notification Job
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz2a65wew7c1kogj4sssa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz2a65wew7c1kogj4sssa.png" width="800" height="461"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For relatively simple error handling, this approach allows you to avoid introducing Step Functions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Managing Parameters with &lt;code&gt;default_run_properties&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Glue Workflow provides a property called &lt;code&gt;default_run_properties&lt;/code&gt;, which acts like globally shared variables across the workflow.&lt;/p&gt;

&lt;h3&gt;
  
  
  How It Works
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;default_run_properties&lt;/code&gt; stores key-value pairs that can be referenced by all Jobs within the workflow.&lt;/p&gt;

&lt;p&gt;It functions as the default set of parameters passed during Workflow execution and serves as the foundation for sharing information between Jobs.&lt;/p&gt;

&lt;p&gt;One important note:&lt;br&gt;
Run Property values may appear in logs, so you should avoid storing secrets directly in them.&lt;/p&gt;

&lt;p&gt;Instead, retrieve secrets through services such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS Secrets Manager&lt;/li&gt;
&lt;li&gt;Glue Connections&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  How to Configure It
&lt;/h3&gt;

&lt;p&gt;There are three main configuration methods.&lt;/p&gt;
&lt;h4&gt;
  
  
  Configure from the Console
&lt;/h4&gt;

&lt;p&gt;In the Glue console:&lt;br&gt;
Workflow → Edit Properties&lt;/p&gt;

&lt;p&gt;You can add key-value pairs there.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F97w8stf6glxmlbk53h4v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F97w8stf6glxmlbk53h4v.png" width="799" height="547"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h4&gt;
  
  
  Configure via boto3
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;

&lt;span class="n"&gt;glue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;glue&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;glue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_workflow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;my-workflow&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;DefaultRunProperties&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;env&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;production&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;target_date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2026-05-21&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Configure via IaC (CloudFormation, etc.)
&lt;/h4&gt;

&lt;p&gt;You can specify &lt;code&gt;DefaultRunProperties&lt;/code&gt; in the &lt;code&gt;AWS::Glue::Workflow&lt;/code&gt; resource.&lt;/p&gt;


&lt;h3&gt;
  
  
  Static Parameters vs Dynamic Parameters
&lt;/h3&gt;

&lt;p&gt;Typical usage patterns include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Static parameters&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Environment names (&lt;code&gt;dev&lt;/code&gt; / &lt;code&gt;prod&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;S3 bucket names&lt;/li&gt;
&lt;li&gt;Values that rarely change&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Dynamic parameters&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Processing dates&lt;/li&gt;
&lt;li&gt;Execution IDs&lt;/li&gt;
&lt;li&gt;Values that change per execution&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Dynamic parameters can be updated either by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Passing &lt;code&gt;RunProperties&lt;/code&gt; to &lt;code&gt;start_workflow_run&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Dynamically updating them later using &lt;code&gt;put_workflow_run_properties&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Passing Data Between Jobs
&lt;/h2&gt;

&lt;p&gt;Using &lt;code&gt;default_run_properties&lt;/code&gt; as the foundation, let’s look at how to exchange data between Jobs.&lt;/p&gt;


&lt;h3&gt;
  
  
  Dynamically Updating Run Properties
&lt;/h3&gt;

&lt;p&gt;During Job execution, you can dynamically update Workflow Run properties by calling the &lt;code&gt;put_workflow_run_properties&lt;/code&gt; API.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;

&lt;span class="n"&gt;glue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;glue&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;glue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put_workflow_run_properties&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;my-workflow&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;RunId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;workflow_run_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;RunProperties&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;processed_records&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;12345&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;output_path&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3://mybucket/output/2026-05-21/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This allows downstream Jobs to reference values calculated by upstream Jobs.&lt;/p&gt;




&lt;h3&gt;
  
  
  Retrieving Properties from a PySpark Job
&lt;/h3&gt;

&lt;p&gt;Inside a PySpark Job, you first retrieve the Workflow Run ID and then call &lt;code&gt;get_workflow_run_properties&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;awsglue.utils&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;getResolvedOptions&lt;/span&gt;

&lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getResolvedOptions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;JOB_NAME&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;WORKFLOW_NAME&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;WORKFLOW_RUN_ID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;glue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;glue&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;glue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_workflow_run_properties&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;WORKFLOW_NAME&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;RunId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;WORKFLOW_RUN_ID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;properties&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;RunProperties&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;target_date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;target_date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;WORKFLOW_NAME&lt;/code&gt; and &lt;code&gt;WORKFLOW_RUN_ID&lt;/code&gt; are special arguments automatically passed when a Job is launched through a Workflow.&lt;/p&gt;




&lt;h3&gt;
  
  
  Retrieving Properties from a Python Shell Job
&lt;/h3&gt;

&lt;p&gt;The basic approach is the same for Python Shell Jobs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retrieve arguments using &lt;code&gt;getResolvedOptions&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Access properties through boto3&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Since Python Shell Jobs do not require SparkContext initialization, they can be written more lightweight.&lt;/p&gt;

&lt;p&gt;They are also cheaper than Spark-based Glue Jobs, so depending on the requirements, they can be a good option.&lt;/p&gt;

&lt;p&gt;Personally, I also like Python Shell Jobs, although I feel opportunities to use them in real-world projects have decreased, which is a bit unfortunate.&lt;/p&gt;

&lt;p&gt;I’ve also written articles about Python Shell Jobs, so feel free to check them out.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/aws-builders/s3-triggers-how-to-launch-glue-python-shell-via-aws-lambda-4ke8"&gt;S3 Triggers: How to Launch Glue Python Shell via AWS Lambda&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Anti-Patterns for Data Passing
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;default_run_properties&lt;/code&gt; should only be used for metadata-like information exchange.&lt;/p&gt;

&lt;p&gt;The following usage patterns should generally be avoided:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Passing large datasets directly&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Store the data itself in S3 and pass only the path&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;Storing secrets&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use Secrets Manager or Glue Connections instead&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;Frequently rewriting properties&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;This increases API calls and introduces race condition risks&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;




&lt;h2&gt;
  
  
  Integrating with EventBridge and Other Services
&lt;/h2&gt;

&lt;p&gt;Glue Workflow becomes much more flexible when combined with EventBridge.&lt;/p&gt;




&lt;h3&gt;
  
  
  EventBridge-Based Startup (EVENT Trigger)
&lt;/h3&gt;

&lt;p&gt;Glue Workflow can be started directly by EventBridge events.&lt;/p&gt;

&lt;p&gt;This is achieved by setting the Trigger Type to &lt;code&gt;EVENT&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws glue create-trigger &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--workflow-name&lt;/span&gt; my-workflow &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--type&lt;/span&gt; EVENT &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; s3-arrival-trigger &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--actions&lt;/span&gt; &lt;span class="nv"&gt;JobName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my-job
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By configuring an EventBridge rule with Glue Workflow as the target, the workflow starts automatically when an event occurs.&lt;/p&gt;

&lt;p&gt;However, appropriate IAM permissions such as &lt;code&gt;glue:notifyEvent&lt;/code&gt; are required.&lt;/p&gt;




&lt;h3&gt;
  
  
  Batch Event Startup
&lt;/h3&gt;

&lt;p&gt;EVENT Triggers also support event batching.&lt;/p&gt;

&lt;p&gt;Using &lt;code&gt;EventBatchingCondition&lt;/code&gt;, you can configure the workflow to start when either:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;N events arrive&lt;/li&gt;
&lt;li&gt;M seconds pass since the first event arrived
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws glue create-trigger &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--workflow-name&lt;/span&gt; my-workflow &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--type&lt;/span&gt; EVENT &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; batch-trigger &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--event-batching-condition&lt;/span&gt; &lt;span class="nv"&gt;BatchSize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;10,BatchWindow&lt;span class="o"&gt;=&lt;/span&gt;300 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--actions&lt;/span&gt; &lt;span class="nv"&gt;JobName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my-job
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This enables patterns such as:&lt;br&gt;
“Run ETL once 100 files have arrived.”&lt;/p&gt;

&lt;p&gt;The maximum batch window is 900 seconds (15 minutes).&lt;/p&gt;


&lt;h3&gt;
  
  
  Starting from S3 Events (The Parameter Passing Limitation)
&lt;/h3&gt;

&lt;p&gt;A common use case is:&lt;br&gt;
“Start a workflow when a file is uploaded to S3.”&lt;/p&gt;

&lt;p&gt;When starting Glue Workflow through EventBridge, the event IDs are automatically stored in a Run Property called &lt;code&gt;aws:eventIds&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;event_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;glue_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_workflow_run_properties&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;workflow_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;RunId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;workflow_run_id&lt;/span&gt;
&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;RunProperties&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;aws:eventIds&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The returned value looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;abc-123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;def-456&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, this is where one of Glue Workflow’s limitations becomes apparent.&lt;/p&gt;

&lt;p&gt;The EventBridge event payload itself (such as the S3 object key or bucket name) is not automatically passed as Run Properties.&lt;/p&gt;

&lt;p&gt;Only the event IDs are provided.&lt;/p&gt;

&lt;p&gt;If you need the actual object details, your Job must retrieve the corresponding event contents from CloudTrail, which becomes somewhat cumbersome.&lt;/p&gt;

&lt;p&gt;Because of this, many cases are easier to manage by placing Lambda in the middle and explicitly calling &lt;code&gt;start_workflow_run&lt;/code&gt; with structured Run Properties.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;S3 PUT → EventBridge → Lambda → start_workflow_run (pass parameters via RunProperties)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example Lambda code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;

&lt;span class="n"&gt;glue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;glue&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;bucket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;detail&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bucket&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;detail&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;glue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_workflow_run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;my-workflow&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;RunProperties&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;source_bucket&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;source_key&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With Step Functions, the EventBridge payload can be received directly using paths such as &lt;code&gt;$.detail&lt;/code&gt;, so there is no need for an intermediate Lambda function.&lt;/p&gt;

&lt;p&gt;This is one of the areas where Glue Workflow limitations become noticeable compared to Step Functions.&lt;/p&gt;




&lt;h3&gt;
  
  
  Detecting Workflow Completion via EventBridge
&lt;/h3&gt;

&lt;p&gt;Glue Workflow status changes are emitted as EventBridge events.&lt;/p&gt;

&lt;p&gt;You can use this to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Send SNS notifications&lt;/li&gt;
&lt;li&gt;Trigger downstream systems&lt;/li&gt;
&lt;li&gt;Launch post-processing workflows
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Glue Workflow (COMPLETED/FAILED)
    → EventBridge
        → SNS / Lambda
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is especially useful when you want separate processing for success and failure cases.&lt;/p&gt;




&lt;h3&gt;
  
  
  Calling Glue Workflow from Step Functions
&lt;/h3&gt;

&lt;p&gt;It is also possible to launch Glue Workflow from Step Functions.&lt;/p&gt;

&lt;p&gt;However, since there is no &lt;code&gt;.sync&lt;/code&gt; integration pattern available, you must implement your own polling logic to detect completion.&lt;/p&gt;

&lt;p&gt;I covered this in detail in the previous article, so feel free to refer to it if interested.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/aws-builders/organizing-the-use-cases-of-aws-step-functions-and-glue-workflow-for-etl-processing-with-aws-glue-2o6b"&gt;Organizing the Use Cases of AWS Step Functions and Glue Workflow for ETL Processing with AWS Glue Jobs&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Operational Tips
&lt;/h2&gt;

&lt;p&gt;Here are several practical points worth knowing when operating Glue Workflow in production.&lt;/p&gt;




&lt;h3&gt;
  
  
  Resuming Failed Workflows (&lt;code&gt;ResumeWorkflowRun&lt;/code&gt;)
&lt;/h3&gt;

&lt;p&gt;Glue Workflow provides a feature called &lt;code&gt;ResumeWorkflowRun&lt;/code&gt;, which allows resuming execution from failed nodes.&lt;/p&gt;

&lt;p&gt;In the console, you can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Open the failed Workflow Run detail page&lt;/li&gt;
&lt;li&gt;Select the nodes to resume&lt;/li&gt;
&lt;li&gt;Enable the “Resume” checkbox&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It is also available through CLI/API.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws glue resume-workflow-run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; my-workflow &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--run-id&lt;/span&gt; wr_xxxx &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--node-ids&lt;/span&gt; node_yyyy node_zzzz
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When resumed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The specified nodes&lt;/li&gt;
&lt;li&gt;And all downstream nodes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;are re-executed.&lt;/p&gt;

&lt;p&gt;The resumed workflow is tracked using a new Run ID.&lt;/p&gt;

&lt;p&gt;Compared with Step Functions Redrive, however, there are several limitations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You must explicitly specify failed nodes&lt;/li&gt;
&lt;li&gt;Retrieving node IDs requires calling:
&lt;code&gt;get-workflow-run --include-graph&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Additional IAM permissions (&lt;code&gt;glue:ResumeWorkflowRun&lt;/code&gt;) are required&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For simple retry scenarios, restarting the entire workflow via &lt;code&gt;ON_DEMAND&lt;/code&gt; execution is often easier.&lt;/p&gt;

&lt;p&gt;For pipelines with more sophisticated recovery requirements, Step Functions tends to provide a smoother operational experience.&lt;/p&gt;




&lt;h3&gt;
  
  
  Monitoring with CloudWatch
&lt;/h3&gt;

&lt;p&gt;Glue Workflow execution states can be viewed in the Glue console, but detailed Job and Crawler logs are output to CloudWatch Logs.&lt;/p&gt;

&lt;p&gt;Typical monitoring targets include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Workflow Run list&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Glue Console&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;Job execution logs&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;/aws-glue/jobs/output&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;/aws-glue/jobs/error&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;Metrics&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CloudWatch Metrics&lt;/li&gt;
&lt;li&gt;Execution duration&lt;/li&gt;
&lt;li&gt;DPU usage&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;In practice, monitoring and alerting strategies often combine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Workflow-level statuses&lt;/li&gt;
&lt;li&gt;Job-level metrics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;depending on the situation.&lt;/p&gt;




&lt;h3&gt;
  
  
  Handling the 100-Object Recommendation Limit
&lt;/h3&gt;

&lt;p&gt;As mentioned earlier, AWS recommends keeping the total number of objects in a workflow (Jobs + Crawlers + Triggers) below 100.&lt;/p&gt;

&lt;p&gt;If a large pipeline approaches this limit, consider:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Splitting workflows&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Trigger downstream workflows after upstream completion&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;Consolidating common processing inside Jobs&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Merge smaller processing units into larger Jobs&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;In my experience, workflows approaching 100 objects are often already too complex from a design perspective.&lt;br&gt;
At that point, it may be worth reconsidering the architecture itself or migrating to Step Functions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where Glue Workflow Fits — and Its Limitations
&lt;/h2&gt;

&lt;p&gt;Glue Workflow shines in scenarios such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simple ETL pipelines completed entirely within Glue Jobs and Crawlers&lt;/li&gt;
&lt;li&gt;Periodic ingestion pipelines for data lakes&lt;/li&gt;
&lt;li&gt;Lightweight small-to-medium scale pipelines that need to be launched quickly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On the other hand, Step Functions is generally better suited for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Integrations involving Lambda, ECS, and other AWS services&lt;/li&gt;
&lt;li&gt;Complex branching and dynamic parameter control&lt;/li&gt;
&lt;li&gt;Large-scale development involving multiple teams&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I discussed these decision criteria in more detail in the previous article, so feel free to refer to it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/aws-builders/organizing-the-use-cases-of-aws-step-functions-and-glue-workflow-for-etl-processing-with-aws-glue-2o6b"&gt;Organizing the Use Cases of AWS Step Functions and Glue Workflow for ETL Processing with AWS Glue Jobs&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this article, I organized AWS Glue Workflow from the basics through practical usage patterns.&lt;/p&gt;

&lt;p&gt;To summarize:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Four trigger types:&lt;br&gt;
&lt;code&gt;SCHEDULED&lt;/code&gt; / &lt;code&gt;ON_DEMAND&lt;/code&gt; / &lt;code&gt;CONDITIONAL&lt;/code&gt; / &lt;code&gt;EVENT&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;CONDITIONAL Triggers:&lt;br&gt;
Flexible flow control using &lt;code&gt;AND&lt;/code&gt; / &lt;code&gt;ANY&lt;/code&gt; conditions and failure branching&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;default_run_properties&lt;/code&gt;:&lt;br&gt;
Shared workflow-wide parameter management&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Data passing between Jobs:&lt;br&gt;
Dynamic value propagation using &lt;code&gt;put_workflow_run_properties&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;EventBridge integration:&lt;br&gt;
Event-driven execution via EVENT Trigger&lt;br&gt;
(although Lambda-based parameter passing is often easier)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;ResumeWorkflowRun&lt;/code&gt;:&lt;br&gt;
Partial restart functionality from failed nodes&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After writing all of this, I still have to admit:&lt;br&gt;
Step Functions is generally easier to use.&lt;/p&gt;

&lt;p&gt;That said, Glue Workflow still provides meaningful value today because it allows you to build Glue-centric ETL pipelines in a very simple and cost-efficient way.&lt;/p&gt;

&lt;p&gt;Rather than defaulting to Step Functions automatically, understanding Glue Workflow properly and knowing when to use it can broaden your architectural options.&lt;/p&gt;

&lt;p&gt;I hope this article helps both people who are starting to use Glue Workflow and those already working with it.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Organizing the Use Cases of AWS Step Functions and Glue Workflow for ETL Processing with AWS Glue Jobs</title>
      <dc:creator>Aki</dc:creator>
      <pubDate>Wed, 20 May 2026 12:09:21 +0000</pubDate>
      <link>https://dev.to/aws-builders/organizing-the-use-cases-of-aws-step-functions-and-glue-workflow-for-etl-processing-with-aws-glue-2o6b</link>
      <guid>https://dev.to/aws-builders/organizing-the-use-cases-of-aws-step-functions-and-glue-workflow-for-etl-processing-with-aws-glue-2o6b</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Original Japanese article&lt;/strong&gt;: &lt;a href="https://zenn.dev/penginpenguin/articles/47fd9b49b418a9" rel="noopener noreferrer"&gt;Glue JobのETL処理におけるAWS Step FunctionsとGlue Workflowの使い分けを整理する&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;I'm Aki, an AWS Community Builder (&lt;a href="https://x.com/jitepengin" rel="noopener noreferrer"&gt;@jitepengin&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;When building data pipelines on AWS, deciding which workflow orchestration tool to use is an important architectural decision.&lt;br&gt;
Especially when designing ETL pipelines centered around Glue Jobs, the following two services are commonly compared:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS Step Functions&lt;/li&gt;
&lt;li&gt;AWS Glue Workflow&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both services are workflow orchestration tools that manage job dependencies and execute processes sequentially.&lt;br&gt;
However, since they are designed with different strengths and philosophies, choosing the right one based on your requirements is critical.&lt;/p&gt;

&lt;p&gt;In this article, I’ll organize the characteristics, advantages, disadvantages, and use cases of these two services, and discuss how to decide which one to choose.&lt;/p&gt;

&lt;p&gt;Personally, I really like Glue Workflow, but I’ve had fewer opportunities to use it recently, which is honestly a bit disappointing.&lt;br&gt;
I also like Glue Python Shell, but compared to standard Glue Jobs, I rarely get to use it these days either...&lt;/p&gt;

&lt;h2&gt;
  
  
  Major Workflow Orchestration Tools
&lt;/h2&gt;

&lt;p&gt;AWS provides multiple workflow orchestration services, but the two most commonly compared for Glue Job–based data pipelines are the following:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Overview&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AWS Step Functions&lt;/td&gt;
&lt;td&gt;AWS managed workflow orchestration service. Can flexibly orchestrate a wide range of AWS services such as Lambda, Glue, ECS, and more&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS Glue Workflow&lt;/td&gt;
&lt;td&gt;Glue-native workflow feature. Defines pipelines using Glue Jobs, Crawlers, and Triggers&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;AWS also offers Amazon Managed Workflows for Apache Airflow (MWAA).&lt;br&gt;
MWAA becomes a strong option when you need highly complex dependency management or cross-cloud orchestration. However, in this article, I’ll focus specifically on comparing Step Functions and Glue Workflow.&lt;/p&gt;

&lt;p&gt;Recently, many teams have also adopted workflow tools such as Airflow, Dagster, and Prefect. As always, selecting the right tool depends heavily on your requirements and goals.&lt;/p&gt;

&lt;h2&gt;
  
  
  AWS Step Functions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Overview
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flijyw5w6x21p2ml3f47w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flijyw5w6x21p2ml3f47w.png" width="799" height="410"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Figure above: Visual workflow definition using Workflow Studio)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;AWS Step Functions is a workflow orchestration service that defines workflows using Amazon States Language (ASL), a JSON/YAML-based language.&lt;/p&gt;

&lt;p&gt;It integrates with over 200 AWS services including Lambda, Glue, ECS, SNS, and DynamoDB, and supports flexible workflow controls such as conditional branching, parallel execution, and retry handling.&lt;/p&gt;

&lt;h3&gt;
  
  
  Features
&lt;/h3&gt;

&lt;p&gt;Step Functions provides two workflow types:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Characteristics&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Standard Workflow&lt;/td&gt;
&lt;td&gt;Long-running execution, audit logs, exactly-once execution guarantee&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Express Workflow&lt;/td&gt;
&lt;td&gt;High throughput, low cost, optimized for short-lived processing (at-least-once)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In the context of data pipelines, Standard Workflow is generally the more common choice.&lt;/p&gt;

&lt;p&gt;Step Functions also provides AWS Step Functions Workflow Studio, a visual editor that allows workflows to be built through a GUI.&lt;/p&gt;

&lt;h3&gt;
  
  
  Advantages
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Multi-service integration: Easily integrates Glue with other AWS services such as Lambda, ECS, SNS, and more&lt;/li&gt;
&lt;li&gt;Flexible control flow: Supports conditional branching (Choice), parallel execution (Parallel/Map), and advanced error handling (Catch/Retry)&lt;/li&gt;
&lt;li&gt;High observability: Execution history and per-step states can be visually inspected in the console&lt;/li&gt;
&lt;li&gt;Event-driven integration: Easily triggered through EventBridge using S3 uploads or schedules&lt;/li&gt;
&lt;li&gt;Parallel execution support (Map): Well suited for large-scale processing such as file-level or partition-level parallel execution. Distributed Map is especially useful for high-scale parallel workloads&lt;/li&gt;
&lt;li&gt;Infrastructure as Code support: Can be fully managed through CloudFormation, CDK, or Terraform&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Disadvantages
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Learning curve: Requires understanding ASL, and complex workflows can become verbose&lt;/li&gt;
&lt;li&gt;Cost: Standard Workflow pricing is based on state transitions, so costs can increase with larger workflows&lt;/li&gt;
&lt;li&gt;Glue integration setup: IAM roles, parameter passing, and Glue Job integration must be configured manually&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Use Cases
&lt;/h3&gt;

&lt;p&gt;Step Functions is a strong fit for scenarios such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hybrid pipelines combining Glue with Lambda, ECS, or other AWS services&lt;/li&gt;
&lt;li&gt;Complex workflows requiring branching, parallel processing, or dynamic parameter passing&lt;/li&gt;
&lt;li&gt;Pipelines that require SNS notifications or compensating actions on failures&lt;/li&gt;
&lt;li&gt;Teams managing infrastructure strictly through IaC&lt;/li&gt;
&lt;li&gt;Large-scale data platforms involving multiple teams&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  AWS Glue Workflow
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Overview
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgb8t1vgeas2xynimxmo8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgb8t1vgeas2xynimxmo8.png" width="800" height="419"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Figure above: Visual workflow definition using Glue Workflow)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;AWS Glue Workflow is Glue’s native workflow orchestration feature.&lt;/p&gt;

&lt;p&gt;It defines pipelines using three primary components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Glue Jobs&lt;/li&gt;
&lt;li&gt;Glue Crawlers&lt;/li&gt;
&lt;li&gt;Glue Triggers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pipelines can be configured directly from the Glue console GUI, making it easy to build Glue-centric ETL workflows quickly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Features
&lt;/h3&gt;

&lt;p&gt;The major components of Glue Workflow are as follows:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Glue Job&lt;/td&gt;
&lt;td&gt;Actual ETL processing using Spark or Python Shell&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Glue Crawler&lt;/td&gt;
&lt;td&gt;Scans data sources such as S3 and registers table metadata into the Data Catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Glue Trigger&lt;/td&gt;
&lt;td&gt;Defines execution conditions such as schedules, events, or conditional dependencies&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Glue Triggers support three types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;SCHEDULED&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ON_DEMAND&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;CONDITIONAL&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;EVENT&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This allows chained execution patterns, such as triggering downstream jobs based on upstream job success or failure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Advantages
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Native Glue integration: Seamlessly integrates with Glue Jobs and Crawlers with minimal additional setup&lt;/li&gt;
&lt;li&gt;Simple configuration: DAG-style workflows can be built intuitively through the console GUI&lt;/li&gt;
&lt;li&gt;Low cost: No additional charge for the workflow itself beyond Glue Job execution costs&lt;/li&gt;
&lt;li&gt;Easy crawler integration: Natural fit for workflows that update the Data Catalog after ETL execution&lt;/li&gt;
&lt;li&gt;Glue Data Catalog integration: Job execution metadata and lineage can be managed centrally within Glue&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Disadvantages
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Glue-only orchestration: Cannot directly orchestrate Lambda, ECS, or other non-Glue services&lt;/li&gt;
&lt;li&gt;Limited event-driven capabilities: Primarily Trigger- and schedule-based, making advanced event integration less flexible than Step Functions&lt;/li&gt;
&lt;li&gt;Limited control flow: Weak support for advanced branching and dynamic parameter handling&lt;/li&gt;
&lt;li&gt;Observability limitations: Detailed execution logs often require separate CloudWatch investigation&lt;/li&gt;
&lt;li&gt;More difficult IaC management: CloudFormation and Terraform management can become cumbersome compared to Step Functions&lt;/li&gt;
&lt;li&gt;Limited parallel execution control: Not ideal for fine-grained parallelization or Map-style orchestration&lt;/li&gt;
&lt;li&gt;Weaker retry/re-execution control: Re-running only failed portions of a workflow is less flexible than in Step Functions&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Use Cases
&lt;/h3&gt;

&lt;p&gt;Glue Workflow is well suited for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simple ETL pipelines composed entirely of Glue Jobs and Crawlers&lt;/li&gt;
&lt;li&gt;Periodic ingestion pipelines into S3-based data lakes with automatic catalog updates&lt;/li&gt;
&lt;li&gt;Small teams or early-stage projects that need fast implementation&lt;/li&gt;
&lt;li&gt;Organizations that prefer operating primarily through the Glue console&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Which One Should You Choose?
&lt;/h2&gt;

&lt;p&gt;Based on the characteristics of both services, the selection criteria can generally be summarized as follows.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to Choose Glue Workflow
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Your ETL pipeline consists only of Glue Jobs and Crawlers&lt;/li&gt;
&lt;li&gt;Simple sequential execution and conditional triggers are sufficient&lt;/li&gt;
&lt;li&gt;You want to build quickly (prototypes or small-scale projects)&lt;/li&gt;
&lt;li&gt;Your operations are centered around the Glue console&lt;/li&gt;
&lt;li&gt;You want to minimize costs&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  When to Choose Step Functions
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;You need integration with Lambda, ECS, or other AWS services&lt;/li&gt;
&lt;li&gt;You require branching, parallel processing, or advanced error handling&lt;/li&gt;
&lt;li&gt;You are adopting an EventBridge-centric event-driven architecture&lt;/li&gt;
&lt;li&gt;You want strict Infrastructure as Code management&lt;/li&gt;
&lt;li&gt;Multiple teams are involved in operating the data platform&lt;/li&gt;
&lt;li&gt;Observability and audit logging are important&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Summary of Decision Criteria
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Perspective&lt;/th&gt;
&lt;th&gt;Glue Workflow&lt;/th&gt;
&lt;th&gt;Step Functions&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Supported Services&lt;/td&gt;
&lt;td&gt;Glue only&lt;/td&gt;
&lt;td&gt;Broad AWS integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Control Flow&lt;/td&gt;
&lt;td&gt;Simple&lt;/td&gt;
&lt;td&gt;Flexible and advanced&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Observability&lt;/td&gt;
&lt;td&gt;△&lt;/td&gt;
&lt;td&gt;◎&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ease of Configuration&lt;/td&gt;
&lt;td&gt;◎&lt;/td&gt;
&lt;td&gt;△ (learning curve)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Depends on state transitions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IaC Management&lt;/td&gt;
&lt;td&gt;△&lt;/td&gt;
&lt;td&gt;◎&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Crawler Integration&lt;/td&gt;
&lt;td&gt;◎&lt;/td&gt;
&lt;td&gt;△ (manual setup required)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Glue Workflow is fundamentally Trigger-oriented and is not designed for advanced event orchestration like Step Functions.&lt;/p&gt;

&lt;p&gt;Unless Glue Workflow specifically satisfies your requirements better, choosing Step Functions is generally the safer long-term option.&lt;br&gt;
Personally, when designing architectures, I often start by considering Step Functions first.&lt;/p&gt;

&lt;p&gt;That said, Glue Workflow remains a strong choice when the requirement is simply:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“I want to quickly build a Glue-centric ETL pipeline.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Combining Both Services
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7lfz55jfm7tei3v78ssk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7lfz55jfm7tei3v78ssk.png" width="800" height="415"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Instead of choosing one or the other exclusively, it is also possible to invoke Glue Workflow from Step Functions.&lt;/p&gt;

&lt;p&gt;For example, Step Functions can handle preprocessing and postprocessing with Lambda, while delegating the ETL core to Glue Workflow.&lt;/p&gt;

&lt;p&gt;However, this introduces additional complexity because workflow state coordination between the two services must be managed carefully.&lt;br&gt;
If simplicity is important, standardizing on one orchestration tool is generally easier operationally.&lt;/p&gt;

&lt;h3&gt;
  
  
  Caveats When Invoking Glue Workflow from Step Functions
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. Completion Detection Requires Polling
&lt;/h4&gt;

&lt;p&gt;Step Functions provides a convenient &lt;code&gt;.sync&lt;/code&gt; integration pattern for &lt;code&gt;Glue: StartJobRun&lt;/code&gt;, which waits for job completion automatically.&lt;/p&gt;

&lt;p&gt;However, &lt;code&gt;Glue: StartWorkflowRun&lt;/code&gt; does &lt;strong&gt;not&lt;/strong&gt; support the &lt;code&gt;.sync&lt;/code&gt; integration pattern.&lt;/p&gt;

&lt;p&gt;While you can invoke Glue Workflow through SDK integration, Step Functions will immediately proceed to the next state without waiting for completion.&lt;br&gt;
As a result, you must implement custom polling logic to repeatedly check the WorkflowRun status.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. Polling Logic Becomes Complex
&lt;/h4&gt;

&lt;p&gt;You typically need to implement a loop like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Wait&lt;/li&gt;
&lt;li&gt;GetWorkflowRun&lt;/li&gt;
&lt;li&gt;Choice (RUNNING / COMPLETED / FAILED)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This increases both the number of states and the verbosity of the ASL definition.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Error Handling Becomes More Complicated
&lt;/h4&gt;

&lt;p&gt;Glue Workflow status is returned at the WorkflowRun level rather than the individual Job level.&lt;/p&gt;

&lt;p&gt;As a result, identifying which specific Glue Job failed requires additional parsing logic against the &lt;code&gt;GetWorkflowRun&lt;/code&gt; response.&lt;/p&gt;

&lt;p&gt;Because of these complications, although combining both services is technically possible, I generally recommend avoiding Step Functions → Glue Workflow orchestration unless there is a compelling reason.&lt;/p&gt;

&lt;p&gt;One possible use case is when you need to extend an existing Glue Workflow–based system incrementally.&lt;br&gt;
Even then, rebuilding the orchestration directly in Step Functions using existing Glue Jobs often feels cleaner.&lt;/p&gt;

&lt;h2&gt;
  
  
  Migration Considerations
&lt;/h2&gt;

&lt;p&gt;Some teams initially adopt Glue Workflow during the early stages of a project and later migrate to Step Functions as the data platform grows.&lt;/p&gt;

&lt;p&gt;When migrating, the following considerations become important:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Workflow definitions must be rewritten: Glue Trigger definitions need to be converted into ASL&lt;/li&gt;
&lt;li&gt;IAM roles must be redesigned: Step Functions requires permissions to invoke Glue Jobs&lt;/li&gt;
&lt;li&gt;Extensive testing is necessary: Existing jobs must be validated carefully after migration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Migration is certainly possible, but it is not trivial.&lt;br&gt;
This is why it is important to consider future extensibility and maintainability from the beginning when selecting your orchestration tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this article, I organized the differences between AWS Step Functions and Glue Workflow for ETL orchestration.&lt;/p&gt;

&lt;p&gt;To summarize:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Glue Workflow: Glue-native, simple, low-cost, and ideal for rapidly building straightforward ETL pipelines&lt;/li&gt;
&lt;li&gt;Step Functions: Better suited for multi-service orchestration, advanced workflow control, observability, and large-scale pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Neither service is universally “better.”&lt;br&gt;
The right choice depends on your use cases, organizational structure, operational requirements, and future scalability needs.&lt;/p&gt;

&lt;p&gt;For small-scale ETL pipelines, Glue Workflow is often sufficient. However, as data platforms evolve, requirements such as exception handling, notifications, conditional branching, and integrations with other services tend to grow over time. In many cases, architectures gradually move toward Step Functions as complexity increases.&lt;/p&gt;

&lt;p&gt;A practical strategy can be to start simple with Glue Workflow during the early stages, and later migrate to Step Functions when requirements become more sophisticated.&lt;/p&gt;

&lt;p&gt;That said, considering future migration costs, building on Step Functions from the beginning can also be a very reasonable approach.&lt;/p&gt;

&lt;p&gt;I hope this article helps anyone currently evaluating workflow orchestration options on AWS.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Differences Between Snowflake Editions and Secure Connectivity with AWS</title>
      <dc:creator>Aki</dc:creator>
      <pubDate>Fri, 15 May 2026 12:55:34 +0000</pubDate>
      <link>https://dev.to/aws-builders/differences-between-snowflake-editions-and-secure-connectivity-with-aws-37ob</link>
      <guid>https://dev.to/aws-builders/differences-between-snowflake-editions-and-secure-connectivity-with-aws-37ob</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Original Japanese article&lt;/strong&gt;: &lt;a href="https://zenn.dev/penginpenguin/articles/87d4454b2d04fe" rel="noopener noreferrer"&gt;Snowflakeのエディションごとの違いとAWSとのセキュアな接続方法&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;I'm Aki, an AWS Community Builder (&lt;a href="https://x.com/jitepengin" rel="noopener noreferrer"&gt;@jitepengin&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;More and more organizations are adopting Snowflake as their data platform.&lt;br&gt;
However, once you actually start planning an implementation, there is often a surprisingly common question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Which Snowflake edition should we choose?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In particular, many teams struggle to decide between Enterprise Edition and Business Critical Edition.&lt;/p&gt;

&lt;p&gt;This becomes even more important when Snowflake is used together with AWS and there is a requirement for secure network connectivity.&lt;/p&gt;

&lt;p&gt;Snowflake offers four editions, and higher editions provide stronger capabilities around security, compliance, and availability.&lt;br&gt;
In addition, private connectivity using AWS PrivateLink requires Business Critical Edition or higher.&lt;/p&gt;

&lt;p&gt;This means that if you initially start with Enterprise Edition and later realize that you need PrivateLink, migrating afterward can introduce additional operational and architectural effort.&lt;/p&gt;

&lt;p&gt;In this article, I would like to briefly organize the differences between Snowflake editions and then explore how to securely connect Snowflake with AWS using Business Critical Edition, especially through AWS PrivateLink.&lt;/p&gt;


&lt;h1&gt;
  
  
  Differences Between Snowflake Editions (Quick Overview)
&lt;/h1&gt;

&lt;p&gt;Snowflake provides the following four editions.&lt;br&gt;
Each higher edition includes all functionality from the lower editions, while credit pricing increases accordingly.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Edition&lt;/th&gt;
&lt;th&gt;Positioning&lt;/th&gt;
&lt;th&gt;Major Additional Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Standard&lt;/td&gt;
&lt;td&gt;Entry-level&lt;/td&gt;
&lt;td&gt;Core features, Time Travel (1 day), SSO, Network Policies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Enterprise&lt;/td&gt;
&lt;td&gt;Large-scale / Production workloads&lt;/td&gt;
&lt;td&gt;Multi-cluster warehouses, Materialized Views, Time Travel (up to 90 days), Column- and Row-level security&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Business Critical&lt;/td&gt;
&lt;td&gt;Regulated industries&lt;/td&gt;
&lt;td&gt;Tri-Secret Secure, AWS PrivateLink, Failover, HIPAA / PCI DSS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Virtual Private Snowflake (VPS)&lt;/td&gt;
&lt;td&gt;Highest isolation level&lt;/td&gt;
&lt;td&gt;Fully isolated Snowflake environment with dedicated infrastructure&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h2&gt;
  
  
  Standard Edition
&lt;/h2&gt;

&lt;p&gt;This is the entry-level edition and provides the core Snowflake functionality.&lt;/p&gt;

&lt;p&gt;It includes capabilities such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SQL processing&lt;/li&gt;
&lt;li&gt;Semi-structured data support&lt;/li&gt;
&lt;li&gt;Data sharing&lt;/li&gt;
&lt;li&gt;1-day Time Travel&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For organizations simply looking to use Snowflake as a data warehouse, Standard Edition is often sufficient.&lt;/p&gt;

&lt;p&gt;It is generally a good fit for startups or smaller analytics teams that want to begin using Snowflake quickly.&lt;/p&gt;


&lt;h2&gt;
  
  
  Enterprise Edition
&lt;/h2&gt;

&lt;p&gt;Enterprise Edition adds several features that are effectively essential for production workloads.&lt;/p&gt;

&lt;p&gt;Key additions include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-cluster warehouses (for scaling concurrent users)&lt;/li&gt;
&lt;li&gt;Materialized Views&lt;/li&gt;
&lt;li&gt;Extended Time Travel (up to 90 days)&lt;/li&gt;
&lt;li&gt;Column-level and row-level security&lt;/li&gt;
&lt;li&gt;Dynamic Data Masking&lt;/li&gt;
&lt;li&gt;Query Acceleration Service&lt;/li&gt;
&lt;li&gt;Search Optimization Service&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, Enterprise Edition often feels like the real starting point for production Snowflake environments.&lt;/p&gt;

&lt;p&gt;As data volume and concurrency increase, multi-cluster warehouses become especially important.&lt;/p&gt;


&lt;h2&gt;
  
  
  Business Critical Edition
&lt;/h2&gt;

&lt;p&gt;Business Critical Edition is designed for organizations handling regulated or highly sensitive data.&lt;/p&gt;

&lt;p&gt;In addition to all Enterprise features, it includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS PrivateLink / Azure Private Link / Google Private Service Connect&lt;/li&gt;
&lt;li&gt;Tri-Secret Secure (dual encryption using customer-managed keys)&lt;/li&gt;
&lt;li&gt;Account failover / failback&lt;/li&gt;
&lt;li&gt;Client redirect&lt;/li&gt;
&lt;li&gt;Compliance support for HIPAA, HITRUST CSF, PCI DSS, FedRAMP&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you are handling sensitive information such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PHI (medical data)&lt;/li&gt;
&lt;li&gt;PCI-related cardholder data&lt;/li&gt;
&lt;li&gt;Personally identifiable information&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;then Business Critical Edition becomes necessary.&lt;/p&gt;

&lt;p&gt;It is also mandatory if your networking requirements specify that connectivity must avoid the public internet and use PrivateLink instead.&lt;/p&gt;

&lt;p&gt;Data is one of the most important enterprise assets.&lt;br&gt;
For that reason alone, I often see organizations adopting Business Critical Edition.&lt;/p&gt;


&lt;h2&gt;
  
  
  Virtual Private Snowflake (VPS)
&lt;/h2&gt;

&lt;p&gt;This is Snowflake’s highest edition and provides a dedicated Snowflake environment.&lt;/p&gt;

&lt;p&gt;Infrastructure is completely isolated and hardware resources are not shared with other customers.&lt;/p&gt;

&lt;p&gt;It is intended for organizations with extremely strict security requirements, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Financial institutions&lt;/li&gt;
&lt;li&gt;Government agencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I personally have not worked with VPS directly, and pricing/details require contacting Snowflake, so I will omit deeper discussion here.&lt;/p&gt;


&lt;h2&gt;
  
  
  Which Edition Should You Choose?
&lt;/h2&gt;

&lt;p&gt;From a practical perspective, the decision criteria often look something like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Standard&lt;/strong&gt;: Evaluation, PoC, small analytics teams&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise&lt;/strong&gt;: General production workloads (this is where many companies start)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Business Critical&lt;/strong&gt;: Regulated industries, sensitive data, or mandatory PrivateLink requirements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VPS&lt;/strong&gt;: Financial/government environments requiring complete isolation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One important point is that if you simply decide to “start with Enterprise,” you cannot use PrivateLink.&lt;/p&gt;

&lt;p&gt;Because of this, it is safest to validate networking requirements at the beginning of the project.&lt;/p&gt;

&lt;p&gt;Upgrading later is possible, but it may require revisiting architecture and pricing assumptions.&lt;/p&gt;


&lt;h1&gt;
  
  
  Considering Connectivity with AWS
&lt;/h1&gt;

&lt;p&gt;When using Snowflake on AWS, network design—specifically how clients and applications connect—is tightly related to edition selection.&lt;/p&gt;

&lt;p&gt;Here, I would like to organize connectivity approaches from both the Enterprise and Business Critical perspectives.&lt;/p&gt;


&lt;h1&gt;
  
  
  Connectivity with Enterprise Edition
&lt;/h1&gt;

&lt;p&gt;Enterprise Edition does not support AWS PrivateLink, so connectivity to Snowflake is fundamentally internet-based.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxtlvy8cqr13xdfmsimde.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxtlvy8cqr13xdfmsimde.png" width="800" height="355"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client / AWS Services
        ↓
    Internet
        ↓
    Snowflake
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Hearing “internet-based connectivity” may sound concerning at first.&lt;br&gt;
However, even with Enterprise Edition, it is possible to achieve a practical security level by combining multiple controls.&lt;/p&gt;


&lt;h2&gt;
  
  
  1. Restricting Source IPs with Network Policies
&lt;/h2&gt;

&lt;p&gt;Snowflake network policies can restrict allowed source IP addresses.&lt;/p&gt;

&lt;p&gt;By limiting access to known egress IPs such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Corporate network addresses&lt;/li&gt;
&lt;li&gt;Elastic IPs attached to NAT Gateways&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;you can significantly reduce the risk of unauthorized access.&lt;/p&gt;


&lt;h2&gt;
  
  
  2. Key Pair Authentication + AWS Secrets Manager
&lt;/h2&gt;

&lt;p&gt;For credential management, password authentication should generally be avoided in favor of key pair authentication.&lt;/p&gt;

&lt;p&gt;As of 2026, Snowflake is actively moving away from single-factor password authentication.&lt;br&gt;
For system integrations, Key Pair authentication or OAuth is now the recommended approach.&lt;/p&gt;

&lt;p&gt;When connecting from services such as AWS Lambda, storing private keys in AWS Secrets Manager is a practical approach.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvf8cm97n1qdqrsjkag93.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvf8cm97n1qdqrsjkag93.png" width="800" height="420"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Event Source
    ↓
AWS Lambda
    ↓
Secrets Manager
    ↓
Retrieve Private Key
    ↓
Snowflake (TLS over Internet)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I covered this approach in a previous article as well:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/aws-builders/securely-implementing-snowflake-x-aws-lambda-integration-with-key-pair-authentication-secrets-2ba2"&gt;Securely Implementing Snowflake AWS Lambda Integration with Key Pair Authentication + Secrets Manager&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  3. TLS Encryption
&lt;/h2&gt;

&lt;p&gt;All communication with Snowflake is encrypted via TLS.&lt;/p&gt;

&lt;p&gt;This means the communication channel itself remains confidential.&lt;/p&gt;

&lt;p&gt;In other words, even with Enterprise Edition, combining:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;IP restrictions&lt;/li&gt;
&lt;li&gt;Key Pair authentication&lt;/li&gt;
&lt;li&gt;TLS encryption&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;can provide a practical level of security.&lt;/p&gt;

&lt;p&gt;However, traffic still traverses the public internet.&lt;/p&gt;




&lt;h2&gt;
  
  
  Limitations of Enterprise Edition
&lt;/h2&gt;

&lt;p&gt;Enterprise Edition cannot satisfy requirements such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Audit/compliance mandates requiring “no internet-based connectivity”&lt;/li&gt;
&lt;li&gt;Internal security policies mandating PrivateLink&lt;/li&gt;
&lt;li&gt;Handling regulated data under HIPAA or PCI DSS&lt;/li&gt;
&lt;li&gt;Encrypting data using customer-managed KMS keys (Tri-Secret Secure)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once these requirements appear, Business Critical Edition or higher becomes necessary.&lt;/p&gt;




&lt;h1&gt;
  
  
  Connectivity with Business Critical Edition
&lt;/h1&gt;

&lt;p&gt;Business Critical Edition supports AWS PrivateLink, enabling private connectivity between AWS VPCs and Snowflake.&lt;/p&gt;

&lt;p&gt;In this architecture, traffic remains entirely within the AWS backbone network and never traverses the public internet.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6rn26lamu5y94bid1y99.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6rn26lamu5y94bid1y99.png" width="800" height="355"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    AWS VPC
        ↓
VPC Interface Endpoint
        ↓
 AWS PrivateLink
        ↓
   Snowflake VPC
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  High-Level Setup Procedure
&lt;/h2&gt;

&lt;p&gt;Following the official documentation, the configuration can be summarized in several major steps.&lt;/p&gt;




&lt;h3&gt;
  
  
  1. Enable PrivateLink on the Snowflake Side
&lt;/h3&gt;

&lt;p&gt;Using the ACCOUNTADMIN role, authorize AWS PrivateLink for the Snowflake account.&lt;/p&gt;

&lt;p&gt;First, obtain a federation token using AWS STS.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws sts get-federation-token &lt;span class="nt"&gt;--name&lt;/span&gt; your-user-name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then execute authorization from the Snowflake side.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="k"&gt;ROLE&lt;/span&gt; &lt;span class="n"&gt;ACCOUNTADMIN&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="n"&gt;AUTHORIZE_PRIVATELINK&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="s1"&gt;'&amp;lt;aws_account_id&amp;gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;'&amp;lt;federated_token&amp;gt;'&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can verify authorization with the following function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="n"&gt;GET_PRIVATELINK&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'&amp;lt;aws_account_id&amp;gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&amp;lt;federated_token&amp;gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the response returns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Account is authorized for PrivateLink.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;then authorization succeeded.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Retrieve the VPC Endpoint ID
&lt;/h3&gt;

&lt;p&gt;Retrieve the information required for AWS VPC Endpoint creation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="n"&gt;GET_PRIVATELINK_CONFIG&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Take note of the &lt;code&gt;privatelink-vpce-id&lt;/code&gt; value returned in the JSON response.&lt;/p&gt;

&lt;p&gt;This ID becomes the “service name” when creating the VPC Endpoint on AWS.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Create the VPC Endpoint on AWS
&lt;/h3&gt;

&lt;p&gt;Create an AWS VPC Interface Endpoint using the following configuration.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Service Name: the &lt;code&gt;privatelink-vpce-id&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;VPC: the source VPC&lt;/li&gt;
&lt;li&gt;Subnets: multi-AZ deployment is recommended&lt;/li&gt;
&lt;li&gt;Security Groups: allow ports 443 and 80 from Lambda/EC2&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Port 80 is required for OCSP (certificate revocation checking), so do not forget to allow it.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Configure DNS
&lt;/h3&gt;

&lt;p&gt;When using PrivateLink, the Snowflake account URL changes to the following format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;account_identifier&amp;gt;.privatelink.snowflakecomputing.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You must create a CNAME record mapping the endpoint returned by &lt;code&gt;SYSTEM$GET_PRIVATELINK_CONFIG()&lt;/code&gt; to the DNS name of the AWS VPC Endpoint.&lt;/p&gt;

&lt;p&gt;Using a Route 53 Private Hosted Zone is the most common approach.&lt;/p&gt;




&lt;h3&gt;
  
  
  5. Verify Connectivity
&lt;/h3&gt;

&lt;p&gt;Finally, verify connectivity from the client side.&lt;/p&gt;

&lt;p&gt;For diagnostics, the &lt;code&gt;SnowCD&lt;/code&gt; (Snowflake Connectivity Diagnostic Tool) is useful for validating PrivateLink connectivity.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;snowcd &amp;lt;hostfile&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Configuring VPC Endpoints for S3 Access
&lt;/h2&gt;

&lt;p&gt;This is an easy detail to overlook.&lt;/p&gt;

&lt;p&gt;Snowflake drivers such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JDBC&lt;/li&gt;
&lt;li&gt;ODBC&lt;/li&gt;
&lt;li&gt;Python Connector&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;internally access Amazon S3 during data load/unload operations against stages.&lt;/p&gt;

&lt;p&gt;Even if Snowflake connectivity itself is private via PrivateLink, S3 traffic may still traverse the public internet unless additional configuration is performed.&lt;/p&gt;

&lt;p&gt;Available approaches include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Creating AWS VPC Interface Endpoints for Snowflake internal stages (recommended)&lt;/li&gt;
&lt;li&gt;Creating an S3 Gateway Endpoint to privatize S3 bucket access&lt;/li&gt;
&lt;li&gt;Allowing internet-based S3 access (strongly discouraged)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you want a fully private architecture, Snowflake officially recommends creating VPC Interface Endpoints for internal stages.&lt;/p&gt;




&lt;h2&gt;
  
  
  Blocking Public Access
&lt;/h2&gt;

&lt;p&gt;After establishing PrivateLink connectivity, you can also block public access from the Snowflake side.&lt;/p&gt;

&lt;p&gt;This allows only PrivateLink-based connectivity.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;NETWORK&lt;/span&gt; &lt;span class="n"&gt;POLICY&lt;/span&gt; &lt;span class="n"&gt;privatelink_only&lt;/span&gt;
  &lt;span class="n"&gt;ALLOWED_IP_LIST&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'10.0.0.0/8'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="n"&gt;ACCOUNT&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;NETWORK_POLICY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;privatelink_only&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Snowflake also provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;SYSTEM$ENFORCE_PRIVATELINK_ACCESS_ONLY&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;“Enforce privatelink-only access”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;which are also valid approaches.&lt;/p&gt;

&lt;p&gt;Combining VPN-based corporate IP ranges with PrivateLink-only access can create an even more secure architecture.&lt;/p&gt;




&lt;h2&gt;
  
  
  Leveraging Tri-Secret Secure
&lt;/h2&gt;

&lt;p&gt;Business Critical Edition also supports Tri-Secret Secure using AWS KMS customer-managed keys (CMKs).&lt;/p&gt;

&lt;p&gt;This mechanism requires both:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Snowflake-managed keys&lt;/li&gt;
&lt;li&gt;Customer-managed keys&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;as an AND condition for decryption.&lt;/p&gt;

&lt;p&gt;Even if Snowflake itself were compromised, data could not be decrypted without the customer-managed key.&lt;/p&gt;

&lt;p&gt;Combining:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PrivateLink&lt;/li&gt;
&lt;li&gt;Tri-Secret Secure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;creates a very strong architecture for regulatory compliance.&lt;/p&gt;

&lt;p&gt;I have not personally implemented this feature, so I will omit further details here.&lt;/p&gt;




&lt;h1&gt;
  
  
  Cross-Region Connectivity
&lt;/h1&gt;

&lt;p&gt;AWS PrivateLink is fundamentally designed for same-region connectivity.&lt;/p&gt;

&lt;p&gt;However, Business Critical Edition and above also support cross-region connectivity.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Snowflake account in &lt;code&gt;US-EAST&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;AWS VPC in &lt;code&gt;AP-NORTHEAST-1 (Tokyo)&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;can still communicate privately via PrivateLink.&lt;/p&gt;

&lt;p&gt;That said, there are several caveats:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PaaS services such as S3 and KMS do not support cross-region PrivateLink&lt;/li&gt;
&lt;li&gt;Government and China regions are not supported&lt;/li&gt;
&lt;li&gt;“Enable Cross Region Endpoint” must be enabled in the VPC console&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, aligning the Snowflake region with the application region generally results in a simpler and easier-to-operate architecture.&lt;/p&gt;

&lt;p&gt;Still, for globally distributed data platforms, these considerations become important.&lt;/p&gt;




&lt;h1&gt;
  
  
  Balancing Edition Selection and Cost
&lt;/h1&gt;

&lt;p&gt;Business Critical Edition provides major security advantages, but the credit cost is roughly 1.3x higher than Enterprise Edition.&lt;/p&gt;

&lt;p&gt;As rough on-demand reference pricing for US East in 2026:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Standard: approximately $2/credit&lt;/li&gt;
&lt;li&gt;Enterprise: approximately $3/credit&lt;/li&gt;
&lt;li&gt;Business Critical: approximately $4/credit&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you have a strict requirement that traffic must never traverse the public internet, Business Critical is effectively the only option.&lt;/p&gt;

&lt;p&gt;However, from a practical standpoint, balancing data sensitivity and cost often leads to architectures such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Production data warehouse (including sensitive data): Business Critical&lt;/li&gt;
&lt;li&gt;Development / testing environments: Enterprise&lt;/li&gt;
&lt;li&gt;Dedicated data sharing accounts: Enterprise&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Using multiple editions strategically within the same organization can also be a reasonable approach.&lt;/p&gt;




&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;In this article, including some reflections from my own experience, I introduced the differences between Snowflake editions and explored secure AWS connectivity using Business Critical Edition.&lt;/p&gt;

&lt;p&gt;The key points are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Snowflake provides four editions (Standard / Enterprise / Business Critical / VPS), with higher editions adding stronger security and compliance capabilities&lt;/li&gt;
&lt;li&gt;AWS PrivateLink requires Business Critical Edition or higher, so networking/security requirements should be validated early&lt;/li&gt;
&lt;li&gt;Even Enterprise Edition can achieve reasonable security through network policies, Key Pair authentication, and TLS&lt;/li&gt;
&lt;li&gt;Business Critical enables private connectivity between AWS VPCs and Snowflake through PrivateLink, fully isolating traffic from the public internet&lt;/li&gt;
&lt;li&gt;S3 access must also be privatized, so VPC Endpoints for internal stages should be configured as well&lt;/li&gt;
&lt;li&gt;Combining Tri-Secret Secure with PrivateLink enables architectures well suited for regulatory compliance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I think many teams struggle specifically with deciding between Enterprise Edition and Business Critical Edition.&lt;/p&gt;

&lt;p&gt;Although edition upgrades are possible later, they can significantly impact both architecture and cost.&lt;br&gt;
For that reason, it is best to organize these requirements carefully during the early stages of requirements definition and architecture design.&lt;/p&gt;

&lt;p&gt;I hope this article helps anyone looking to use Snowflake securely on AWS.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>snowflake</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>What Is Apache Polaris? Why Open Data Catalogs Matter and How to Use Them with AWS</title>
      <dc:creator>Aki</dc:creator>
      <pubDate>Sat, 02 May 2026 06:27:16 +0000</pubDate>
      <link>https://dev.to/aws-builders/what-is-apache-polaris-why-open-data-catalogs-matter-and-how-to-use-them-with-aws-5gal</link>
      <guid>https://dev.to/aws-builders/what-is-apache-polaris-why-open-data-catalogs-matter-and-how-to-use-them-with-aws-5gal</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Original Japanese article&lt;/strong&gt;: &lt;a href="https://zenn.dev/penginpenguin/articles/28aa29c2f9fbeb" rel="noopener noreferrer"&gt;Apache Polarisとは何か？オープンなデータカタログが求められる理由とAWSとの組み合わせ方を整理する&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;I'm Aki, an AWS Community Builder (&lt;a href="https://x.com/jitepengin" rel="noopener noreferrer"&gt;@jitepengin&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;In recent years, lakehouse architectures centered around Apache Iceberg have been rapidly expanding.&lt;/p&gt;

&lt;p&gt;By placing Iceberg tables on object storage such as S3, it has become possible to query the same data from multiple engines such as Athena, Snowflake, Spark, Trino, and Dremio.&lt;br&gt;
As a result, the discussion has shifted from &lt;em&gt;“Where should data be placed, and which engine should be used for analysis?”&lt;/em&gt; to &lt;em&gt;“Where should data ownership reside, and which catalog should be used to unify governance?”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Amid this trend, &lt;strong&gt;Apache Polaris&lt;/strong&gt; has been attracting attention in recent years.&lt;br&gt;
Apache Polaris is an open-source implementation of the Iceberg REST Catalog, led by Snowflake and donated to the Apache Software Foundation.&lt;/p&gt;

&lt;p&gt;Multiple vendors—including Dremio, AWS, Google, Microsoft, and Confluent—are contributing to it, and it is positioned as an &lt;strong&gt;“open catalog”&lt;/strong&gt; that enables cross-platform management of Iceberg tables while avoiding vendor lock-in.&lt;/p&gt;

&lt;p&gt;In this article, I would like to think through the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What Apache Polaris is&lt;/li&gt;
&lt;li&gt;Why open data catalogs are required&lt;/li&gt;
&lt;li&gt;Differences from AWS Glue Data Catalog&lt;/li&gt;
&lt;li&gt;Differences from Snowflake Horizon Catalog&lt;/li&gt;
&lt;li&gt;How responsibilities should be divided when combining with AWS&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In conclusion, Apache Polaris is not something that &lt;em&gt;competes&lt;/em&gt; with AWS Glue Catalog or Snowflake Horizon Catalog; rather, they are catalogs that operate at different layers.&lt;/p&gt;

&lt;p&gt;It may be easier to understand Apache Polaris as a component that enables an architecture such as:&lt;br&gt;
&lt;strong&gt;“The data itself resides in AWS, the catalog is open, and analysis engines are selected based on use cases.”&lt;/strong&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  What is Apache Polaris?
&lt;/h2&gt;

&lt;p&gt;Apache Polaris is an open-source catalog implementation compliant with the Apache Iceberg REST Catalog specification.&lt;br&gt;
It was announced by Snowflake in 2024 and later became an incubation project under the Apache Software Foundation.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The project has now graduated from incubation and has been promoted to a top-level Apache project.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Official site:&lt;br&gt;
&lt;a href="https://polaris.apache.org/" rel="noopener noreferrer"&gt;https://polaris.apache.org/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What Polaris aims to achieve is a &lt;strong&gt;common metadata and governance foundation in a lakehouse centered around Iceberg tables&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A major characteristic is that it is not tied to any specific query engine or cloud vendor, and anyone can access it using the same specification via REST APIs.&lt;/p&gt;


&lt;h3&gt;
  
  
  Key Features of Apache Polaris
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Implementation of Iceberg REST Catalog&lt;/td&gt;
&lt;td&gt;Accessible via standardized REST APIs. Can be directly used from engines such as Spark, Trino, Flink, Snowflake, and Dremio&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-catalog architecture&lt;/td&gt;
&lt;td&gt;Multiple catalogs can be defined within a single Polaris instance. Enables separation and management by team or business domain&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RBAC (Role-Based Access Control)&lt;/td&gt;
&lt;td&gt;Provides a permission model combining principals, principal roles, and catalog roles&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;External catalog integration&lt;/td&gt;
&lt;td&gt;Can connect to other catalogs compliant with the Iceberg REST specification (e.g., Nessie, Gravitino)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OSS / Managed support&lt;/td&gt;
&lt;td&gt;Can be self-hosted as OSS, or used as managed offerings such as Snowflake Open Catalog or Dremio Catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h3&gt;
  
  
  What Apache Polaris Solves
&lt;/h3&gt;

&lt;p&gt;As Apache Iceberg has become more widely adopted, multiple Iceberg-compatible catalogs have emerged, including Hive Metastore, JDBC, Nessie, AWS Glue, and Snowflake.&lt;/p&gt;

&lt;p&gt;Since each has its own client libraries and interfaces, the following challenges have arisen:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The need to implement catalog clients for each programming language&lt;/li&gt;
&lt;li&gt;Inconsistent access control specifications across catalogs&lt;/li&gt;
&lt;li&gt;Difficulty enforcing governance across multiple catalogs&lt;/li&gt;
&lt;li&gt;As a result, the overall architecture becomes constrained by the chosen catalog&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To solve these challenges, the Iceberg REST Catalog specification was introduced.&lt;br&gt;
Apache Polaris is an open-source implementation of that specification, further enhanced with multi-catalog support and RBAC.&lt;/p&gt;

&lt;p&gt;In other words, you can think of it as an &lt;strong&gt;open catalog for Apache Iceberg&lt;/strong&gt;.&lt;/p&gt;


&lt;h3&gt;
  
  
  Polaris Security Model
&lt;/h3&gt;

&lt;p&gt;The Polaris security model can be organized into the following three concepts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Principal&lt;/strong&gt;: An entity representing a user or service. Accesses Polaris via client ID/secret, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Principal Role&lt;/strong&gt;: A grouping of multiple catalog roles. Assigned to principals&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Catalog Role&lt;/strong&gt;: A set of permissions within a specific catalog. Includes permissions such as &lt;code&gt;TABLE_READ_DATA&lt;/code&gt;, &lt;code&gt;TABLE_CREATE&lt;/code&gt;, and &lt;code&gt;NAMESPACE_LIST&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, you can design it such that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;data_engineer&lt;/code&gt; principal role is assigned both &lt;em&gt;write access to prod_catalog&lt;/em&gt; and &lt;em&gt;administrative access to dev_catalog&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;data_analyst&lt;/code&gt; principal role is assigned only &lt;em&gt;read access to prod_catalog&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An important point is that RBAC is centralized on the catalog side, eliminating the need to implement access control separately for each engine.&lt;/p&gt;


&lt;h2&gt;
  
  
  Why Open Data Catalogs Are Required
&lt;/h2&gt;

&lt;p&gt;Let us first consider why open data catalogs are required in the first place.&lt;/p&gt;


&lt;h3&gt;
  
  
  Separation of Data and Engines Has Become a Premise
&lt;/h3&gt;

&lt;p&gt;The greatest value of open table formats such as Apache Iceberg is the ability to separate data storage from query engines.&lt;/p&gt;

&lt;p&gt;It has become possible to freely choose engines such as Athena, Glue, Spark, Snowflake, Dremio, and DuckDB depending on the use case when querying Iceberg tables on S3.&lt;/p&gt;

&lt;p&gt;As a result, the key question in data platforms has shifted from &lt;em&gt;“Which product should we use?”&lt;/em&gt; to &lt;em&gt;“Where should data ownership reside, and who should be responsible for governance at which layer?”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;However, while engines can now be freely selected, the remaining challenge is the &lt;strong&gt;catalog&lt;/strong&gt;.&lt;/p&gt;


&lt;h3&gt;
  
  
  What Happens When Catalogs Are Tied to Engines
&lt;/h3&gt;

&lt;p&gt;When using catalogs tightly coupled with query engines, the following situations tend to occur:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The data itself is open (S3 + Iceberg), but the catalog is tied to a specific engine&lt;/li&gt;
&lt;li&gt;You want to reference the same table from another engine, but the catalog does not support it&lt;/li&gt;
&lt;li&gt;Access control is fragmented across engines, making governance difficult&lt;/li&gt;
&lt;li&gt;Every time the catalog is changed, all engine-side configurations must be redone&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, even if storage and formats are open, a closed catalog significantly reduces the benefits of a lakehouse.&lt;/p&gt;

&lt;p&gt;Especially in today’s environments where multi-cloud, multiple products, and multiple engines are commonly combined, how to unify catalogs becomes a key challenge.&lt;/p&gt;


&lt;h3&gt;
  
  
  Requirements for an Open Catalog
&lt;/h3&gt;

&lt;p&gt;Based on this background, lakehouse catalogs are expected to meet the following requirements:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Requirement&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Compliance with standard APIs&lt;/td&gt;
&lt;td&gt;Support vendor-neutral APIs such as the Iceberg REST Catalog specification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-engine support&lt;/td&gt;
&lt;td&gt;Usable across engines such as Spark, Trino, Flink, Snowflake, and Dremio&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Centralized RBAC&lt;/td&gt;
&lt;td&gt;Define permissions at the catalog level and apply consistent governance across all engines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-cloud / hybrid&lt;/td&gt;
&lt;td&gt;Not dependent on a specific cloud and capable of running on-premises when necessary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OSS sustainability&lt;/td&gt;
&lt;td&gt;Not discontinued based on vendor decisions; continuously developed in a community-driven manner&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Apache Polaris is a catalog designed to satisfy these requirements.&lt;/p&gt;


&lt;h2&gt;
  
  
  Differences from AWS Glue Data Catalog
&lt;/h2&gt;

&lt;p&gt;When building on AWS, AWS Glue Data Catalog is often positioned as the central data catalog.&lt;br&gt;
Here, we will organize the differences between AWS Glue Data Catalog and Apache Polaris.&lt;/p&gt;


&lt;h3&gt;
  
  
  Positioning of AWS Glue Data Catalog
&lt;/h3&gt;

&lt;p&gt;AWS Glue Data Catalog is a core metadata management service in AWS.&lt;/p&gt;

&lt;p&gt;It is natively integrated with AWS analytics services such as Athena, Glue, Redshift Spectrum, and EMR, and plays the role of managing data on S3 as a catalog.&lt;/p&gt;

&lt;p&gt;As discussed in previous articles, Glue Data Catalog is an excellent &lt;strong&gt;technical catalog used by data platforms&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/aws-builders/is-aws-glue-data-catalog-sufficient-as-a-data-catalog-organizing-its-design-limitations-and-kih"&gt;Is AWS Glue Data Catalog Sufficient as a Data Catalog? Organizing Its Design, Limitations, and Complementary Strategies&lt;/a&gt;&lt;/p&gt;


&lt;h3&gt;
  
  
  Functional Comparison
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;AWS Glue Data Catalog&lt;/th&gt;
&lt;th&gt;Apache Polaris&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Offering&lt;/td&gt;
&lt;td&gt;AWS-managed (closed)&lt;/td&gt;
&lt;td&gt;OSS / Managed (Snowflake Open Catalog, Dremio Catalog, etc.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API&lt;/td&gt;
&lt;td&gt;AWS proprietary API (recently also provides Iceberg REST compatibility)&lt;/td&gt;
&lt;td&gt;Iceberg REST Catalog specification (open)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloud support&lt;/td&gt;
&lt;td&gt;AWS&lt;/td&gt;
&lt;td&gt;Multi-cloud / on-prem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engines&lt;/td&gt;
&lt;td&gt;Athena, Glue, Redshift, EMR, Spark&lt;/td&gt;
&lt;td&gt;Spark, Trino, Flink, Snowflake, Dremio, StarRocks, DuckDB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-catalog&lt;/td&gt;
&lt;td&gt;Account-level (logical separation via Lake Formation)&lt;/td&gt;
&lt;td&gt;Native support for multiple catalogs within a single instance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Access control&lt;/td&gt;
&lt;td&gt;IAM + Lake Formation&lt;/td&gt;
&lt;td&gt;Built-in RBAC (Principal / Principal Role / Catalog Role)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;External catalog integration&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Can integrate with Iceberg REST-compliant catalogs (Nessie, Gravitino, etc.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Non-Iceberg formats&lt;/td&gt;
&lt;td&gt;Supports Hive, JSON, CSV, Parquet, etc.&lt;/td&gt;
&lt;td&gt;Currently Iceberg-centric (Generic Table support on roadmap)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h3&gt;
  
  
  How to Interpret the Difference
&lt;/h3&gt;

&lt;p&gt;Rather than being in a competitive relationship, it is easier to understand them as catalogs with different roles.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AWS Glue Data Catalog&lt;/strong&gt;: Strong integration with AWS services, making it the primary choice for workloads completed within AWS. It supports a wide range of data lake formats beyond Iceberg and features such as S3 crawling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apache Polaris&lt;/strong&gt;: A catalog that enables governance across multiple engines and clouds based on the industry-standard Iceberg REST API. It is effective when you want to enforce consistent RBAC across engines outside AWS (e.g., Snowflake, Dremio).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In summary:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If your use case is &lt;strong&gt;AWS-contained and includes formats beyond Iceberg&lt;/strong&gt;, Glue Data Catalog is a practical choice&lt;/li&gt;
&lt;li&gt;If you want &lt;strong&gt;common management of Iceberg across multiple engines and a vendor-neutral catalog layer&lt;/strong&gt;, Polaris is suitable&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Differences from Snowflake Horizon Catalog
&lt;/h2&gt;

&lt;p&gt;This is often confused, so let’s clarify the difference between Snowflake Horizon Catalog and Apache Polaris.&lt;br&gt;
Note that it is different from “Snowflake Open Catalog,” despite the similar name.&lt;/p&gt;


&lt;h3&gt;
  
  
  What is Snowflake Horizon Catalog?
&lt;/h3&gt;

&lt;p&gt;Snowflake Horizon Catalog is a data governance and discovery suite provided by Snowflake.&lt;/p&gt;

&lt;p&gt;For data managed within Snowflake (Snowflake-managed tables, stages, views, shared data, etc.), it provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data discovery (search, tagging, descriptions)&lt;/li&gt;
&lt;li&gt;Lineage&lt;/li&gt;
&lt;li&gt;Data quality monitoring&lt;/li&gt;
&lt;li&gt;Masking policies and row access policies&lt;/li&gt;
&lt;li&gt;Automatic classification of sensitive data&lt;/li&gt;
&lt;li&gt;Compliance management&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In terms of positioning, it is similar to Amazon DataZone + Lake Formation + Glue Data Quality in AWS.&lt;/p&gt;

&lt;p&gt;In other words, it is the layer responsible for &lt;strong&gt;cataloging and governance so that people can discover, understand, and trust data&lt;/strong&gt;.&lt;/p&gt;


&lt;h3&gt;
  
  
  What is Snowflake Open Catalog (Relation to Polaris)
&lt;/h3&gt;

&lt;p&gt;On the other hand, Snowflake Open Catalog is a managed offering of Apache Polaris.&lt;/p&gt;

&lt;p&gt;Although the name is confusing, this is the lakehouse catalog that serves as an Iceberg REST Catalog.&lt;/p&gt;

&lt;p&gt;In Snowflake’s model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Snowflake Horizon Catalog: Business catalog and governance layer for Snowflake-managed data&lt;/li&gt;
&lt;li&gt;Snowflake Open Catalog (= Apache Polaris): Lakehouse catalog layer for open table formats such as Iceberg&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  Functional Comparison
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Snowflake Horizon Catalog&lt;/th&gt;
&lt;th&gt;Apache Polaris&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Primary target&lt;/td&gt;
&lt;td&gt;Data in Snowflake (internal tables, shared data, etc.)&lt;/td&gt;
&lt;td&gt;Iceberg (Generic Table support for other formats is planned)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Layer&lt;/td&gt;
&lt;td&gt;Business catalog / governance layer&lt;/td&gt;
&lt;td&gt;Lakehouse catalog layer (technical catalog)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Offering&lt;/td&gt;
&lt;td&gt;Built into Snowflake (closed)&lt;/td&gt;
&lt;td&gt;OSS / Managed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API&lt;/td&gt;
&lt;td&gt;Snowflake proprietary&lt;/td&gt;
&lt;td&gt;Iceberg REST Catalog specification (open)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data location&lt;/td&gt;
&lt;td&gt;Snowflake internal storage or recognized external data&lt;/td&gt;
&lt;td&gt;Iceberg tables on cloud storage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scope&lt;/td&gt;
&lt;td&gt;Within Snowflake organizations&lt;/td&gt;
&lt;td&gt;Across multiple engines and clouds&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h3&gt;
  
  
  How to Interpret the Difference
&lt;/h3&gt;

&lt;p&gt;Again, these are not in opposition but complementary.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Snowflake Horizon Catalog&lt;/strong&gt;: Upper layer that provides data to business users, handling discovery, quality, masking, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apache Polaris&lt;/strong&gt;: Lower layer (metadata foundation) that exposes Iceberg tables to multiple engines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Conceptually, the structure looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────┐
│  Business Catalog / Governance Layer         │ ← Snowflake Horizon Catalog
│  (Discovery / Lineage / Quality / Masking)   │   Amazon DataZone, etc.
└─────────────────────┬────────────────────────┘
                      │
┌─────────────────────┴────────────────────────┐
│  Lakehouse Catalog Layer                     │ ← Apache Polaris
│  (Iceberg REST Catalog / RBAC)               │   AWS Glue Data Catalog, etc.
└─────────────────────┬────────────────────────┘
                      │
┌─────────────────────┴────────────────────────┐
│  Data Lake (S3 / GCS / Azure Blob)           │
│  Iceberg / Parquet                           │
└──────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you think of Snowflake Horizon Catalog and Apache Polaris as “choosing one or the other,” it feels unnatural, but when organized as different layers, the division of responsibilities becomes clear.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Combine with AWS
&lt;/h2&gt;

&lt;p&gt;From here, we will consider cases where Apache Polaris is introduced into an AWS environment.&lt;br&gt;
Since AWS already has a powerful catalog called Glue Data Catalog, it is important to clarify &lt;strong&gt;how Polaris should be positioned&lt;/strong&gt; and &lt;strong&gt;who is responsible for what&lt;/strong&gt;.&lt;/p&gt;


&lt;h3&gt;
  
  
  Expected Architecture
&lt;/h3&gt;

&lt;p&gt;Representative configurations can be organized into the following three patterns.&lt;/p&gt;


&lt;h4&gt;
  
  
  Pattern 1: AWS-only (Glue Data Catalog-centered)
&lt;/h4&gt;

&lt;p&gt;This is the simplest configuration.&lt;br&gt;
It is a typical setup using S3 + Iceberg + Glue Data Catalog, along with Athena / Glue / Redshift Spectrum.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Catalog: AWS Glue Data Catalog&lt;/li&gt;
&lt;li&gt;Governance: IAM + Lake Formation&lt;/li&gt;
&lt;li&gt;Query engines: Athena, Redshift Spectrum, Glue ETL, EMR&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If everything is completed within AWS and there is no strong need to share with external engines, this configuration remains the most practical.&lt;br&gt;
There is no need to forcibly introduce Apache Polaris.&lt;/p&gt;


&lt;h4&gt;
  
  
  Pattern 2: AWS + Snowflake (Using Polaris as a shared catalog foundation)
&lt;/h4&gt;

&lt;p&gt;This configuration is effective when you want to reference the same Iceberg tables from both AWS (e.g., Athena) and Snowflake.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data storage: S3 + Iceberg&lt;/li&gt;
&lt;li&gt;Catalog: Apache Polaris (OSS self-hosted or Snowflake Open Catalog)&lt;/li&gt;
&lt;li&gt;AWS side: Reference Polaris as an Iceberg REST Catalog (via Spark or third-party tools)&lt;/li&gt;
&lt;li&gt;Snowflake side: Connect to Polaris using External Volume and Catalog Integration (&lt;code&gt;CATALOG_SOURCE = POLARIS&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From the Snowflake side, Polaris can be referenced directly as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;CATALOG&lt;/span&gt; &lt;span class="n"&gt;INTEGRATION&lt;/span&gt; &lt;span class="n"&gt;polaris_catalog_int&lt;/span&gt;
  &lt;span class="n"&gt;CATALOG_SOURCE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;POLARIS&lt;/span&gt;
  &lt;span class="n"&gt;TABLE_FORMAT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ICEBERG&lt;/span&gt;
  &lt;span class="n"&gt;REST_CONFIG&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;CATALOG_URI&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'https://&amp;lt;polaris-host&amp;gt;/api/catalog'&lt;/span&gt;
    &lt;span class="k"&gt;CATALOG_NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'&amp;lt;your_polaris_catalog&amp;gt;'&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;REST_AUTHENTICATION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;OAUTH&lt;/span&gt;
    &lt;span class="n"&gt;OAUTH_CLIENT_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'&amp;lt;polaris_client_id&amp;gt;'&lt;/span&gt;
    &lt;span class="n"&gt;OAUTH_CLIENT_SECRET&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'&amp;lt;polaris_client_secret&amp;gt;'&lt;/span&gt;
    &lt;span class="n"&gt;OAUTH_ALLOWED_SCOPES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'PRINCIPAL_ROLE:ALL'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;ENABLED&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h4&gt;
  
  
  Pattern 3: Multi-engine / multi-cloud configuration
&lt;/h4&gt;

&lt;p&gt;In addition to Snowflake, this configuration includes multiple engines such as Dremio, Databricks, Trino, and Flink.&lt;/p&gt;

&lt;p&gt;In this case, all engines reference Polaris as a common Iceberg REST Catalog.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data storage: S3 (and other cloud storage if needed)&lt;/li&gt;
&lt;li&gt;Catalog: Apache Polaris (center of governance)&lt;/li&gt;
&lt;li&gt;Query engines: Snowflake, Dremio, Spark, Trino, Flink, etc.&lt;/li&gt;
&lt;li&gt;Governance: Polaris provides unified RBAC across all engines&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  How to Think About Responsibility Separation
&lt;/h3&gt;

&lt;p&gt;This is the key point.&lt;br&gt;
When combining Polaris, AWS, Snowflake, and others, it is important to clearly define &lt;strong&gt;who is responsible for which layer&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Primary Owner&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Data storage (files)&lt;/td&gt;
&lt;td&gt;AWS (S3)&lt;/td&gt;
&lt;td&gt;Storage location of the data. Single Source of Truth&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage access control&lt;/td&gt;
&lt;td&gt;AWS (IAM)&lt;/td&gt;
&lt;td&gt;Access permissions to S3 buckets/prefixes are defined on the AWS side&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Table metadata&lt;/td&gt;
&lt;td&gt;Apache Polaris&lt;/td&gt;
&lt;td&gt;Source of Truth for Iceberg metadata such as schema, snapshots, partitions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Table-level RBAC&lt;/td&gt;
&lt;td&gt;Apache Polaris&lt;/td&gt;
&lt;td&gt;Applies consistent permission rules across engines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ETL / pipelines&lt;/td&gt;
&lt;td&gt;AWS Glue / Lambda / EMR / Spark&lt;/td&gt;
&lt;td&gt;Responsible for ingestion and transformation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query execution&lt;/td&gt;
&lt;td&gt;Athena / Snowflake / Dremio / Spark&lt;/td&gt;
&lt;td&gt;Engines selected based on use case&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Business catalog / discovery&lt;/td&gt;
&lt;td&gt;Snowflake Horizon Catalog / Amazon DataZone&lt;/td&gt;
&lt;td&gt;Higher-layer features for search, lineage, quality for users&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data quality&lt;/td&gt;
&lt;td&gt;Glue Data Quality / Snowflake DMF&lt;/td&gt;
&lt;td&gt;Implemented at engine or quality service layer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;What is especially important is the &lt;strong&gt;three-layer separation&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data resides in AWS, the catalog is Polaris, and usage is handled by each engine&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;By making this separation explicit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS can focus on storage and IAM management&lt;/li&gt;
&lt;li&gt;Polaris can focus on metadata and access control&lt;/li&gt;
&lt;li&gt;Each query engine can focus on its strengths&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Considerations When Adopting Polaris
&lt;/h3&gt;

&lt;p&gt;Polaris is powerful, but there are also important considerations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Operational cost when self-hosting OSS&lt;/strong&gt;: Running on EKS or EC2 requires a metastore (e.g., PostgreSQL), authentication infrastructure, monitoring, and upgrade handling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Managed services are often more practical&lt;/strong&gt;: Using Snowflake Open Catalog or Dremio Catalog significantly reduces operational burden&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Less seamless integration with AWS services compared to Glue&lt;/strong&gt;: For AWS-native services such as Athena, Redshift, and QuickSight, using Glue Data Catalog is far more straightforward&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Need to avoid double governance&lt;/strong&gt;: If IAM policies on S3 and RBAC in Polaris are inconsistent, troubleshooting becomes complex&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, when deciding whether to adopt Apache Polaris in an AWS environment, it is realistic to evaluate based on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Whether multi-engine requirements exist&lt;/li&gt;
&lt;li&gt;The organization’s stance on vendor lock-in&lt;/li&gt;
&lt;li&gt;Whether operational cost is acceptable (or managed services can be used)&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  A Practical Approach
&lt;/h3&gt;

&lt;p&gt;Personally, when considering Polaris in an AWS environment, the following phased approach is practical:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Build a lakehouse within AWS using Glue Data Catalog + Iceberg&lt;/li&gt;
&lt;li&gt;When integration with other engines such as Snowflake becomes necessary, consider introducing an Iceberg REST layer&lt;/li&gt;
&lt;li&gt;At that point, compare “Glue Iceberg REST endpoint,” “Apache Polaris OSS,” and “Snowflake Open Catalog” based on requirements&lt;/li&gt;
&lt;li&gt;If multi-engine / multi-cloud requirements become clear, redesign with Polaris (especially managed) at the center&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Rather than designing with Polaris from the beginning, it is often more practical to &lt;strong&gt;replace the catalog layer with an open one when requirements mature&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this article, we organized the key points around Apache Polaris.&lt;/p&gt;

&lt;p&gt;In the world of data platforms, while storage and formats have become open, a closed catalog reduces the benefits of a lakehouse by half.&lt;/p&gt;

&lt;p&gt;Therefore, there is a need for an &lt;strong&gt;open catalog&lt;/strong&gt; that complies with the Iceberg REST Catalog specification and enables unified governance across multiple engines and clouds.&lt;br&gt;
Apache Polaris is designed to fulfill exactly that role.&lt;/p&gt;

&lt;p&gt;However, it is important to think not in terms of “which one to choose” among Polaris, AWS Glue Data Catalog, and Snowflake Horizon Catalog, but rather &lt;strong&gt;which layer each is responsible for&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS Glue Data Catalog: Technical catalog within AWS (still the primary choice for AWS-only workloads)&lt;/li&gt;
&lt;li&gt;Apache Polaris: Lakehouse catalog centered on Iceberg, shared across multiple engines&lt;/li&gt;
&lt;li&gt;Snowflake Horizon Catalog: Business catalog and governance layer for Snowflake users&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even when combining with AWS, by consciously separating responsibilities as&lt;br&gt;
&lt;strong&gt;“data in AWS, catalog in Polaris, analytics in engines, business catalog in another layer”&lt;/strong&gt;,&lt;br&gt;
you can design an architecture that leverages the strengths of each.&lt;/p&gt;

&lt;p&gt;Going forward, lakehouse architectures are expected to increasingly adopt vendor-neutral designs.&lt;br&gt;
Apache Polaris is likely to become an important component supporting that openness.&lt;/p&gt;

&lt;p&gt;I hope this article will be helpful for those considering Apache Polaris or designing lakehouse architectures across multiple platforms such as AWS and Snowflake.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>snowflake</category>
      <category>dataengineering</category>
      <category>iceberg</category>
    </item>
    <item>
      <title>Lightweight ETL on AWS Lambda Using DuckDB and Snowflake Connector</title>
      <dc:creator>Aki</dc:creator>
      <pubDate>Sat, 04 Apr 2026 13:54:10 +0000</pubDate>
      <link>https://dev.to/aws-builders/lightweight-etl-on-aws-lambda-using-duckdb-and-snowflake-connector-3h19</link>
      <guid>https://dev.to/aws-builders/lightweight-etl-on-aws-lambda-using-duckdb-and-snowflake-connector-3h19</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Original Japanese article&lt;/strong&gt;: &lt;a href="https://zenn.dev/penginpenguin/articles/b47f2bf700b68b" rel="noopener noreferrer"&gt;AWS Lambda × DuckDB × Snowflake ConnectorによるETLの実装&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;I'm Aki, an AWS Community Builder (&lt;a href="https://x.com/jitepengin" rel="noopener noreferrer"&gt;@jitepengin&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;In my previous article, I introduced how to connect to Snowflake from AWS Lambda using Key Pair authentication.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/aws-builders/securely-implementing-snowflake-x-aws-lambda-integration-with-key-pair-authentication-secrets-2ba2"&gt;Securely Implementing Snowflake AWS Lambda Integration with Key Pair Authentication + Secrets Manager&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This time, I would like to try the event-driven data ingestion approach that I introduced in the previous article.&lt;/p&gt;

&lt;p&gt;In this article, I will implement an event-driven ETL pipeline that uses DuckDB on AWS Lambda to perform lightweight transformations on Parquet files stored in Amazon S3 and then load the processed data into Snowflake.&lt;/p&gt;

&lt;p&gt;In addition, during the implementation process, I encountered an interesting limitation where &lt;code&gt;write_pandas&lt;/code&gt; fails when writing to a Catalog-Linked Database. I will also summarize the root cause and the workaround.&lt;/p&gt;




&lt;h1&gt;
  
  
  Why Snowpipe Is Not Enough
&lt;/h1&gt;

&lt;p&gt;Snowpipe is a very convenient feature for automatic data ingestion.&lt;/p&gt;

&lt;p&gt;However, it has limitations when it comes to data transformation and complex filtering.&lt;/p&gt;

&lt;p&gt;In other words, when you need preprocessing, filtering, or the integration of multiple events, you need to choose another approach.&lt;/p&gt;

&lt;p&gt;In such cases, AWS Lambda becomes a strong option due to its high flexibility.&lt;/p&gt;




&lt;h1&gt;
  
  
  Architecture
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feunw8xz2jgbd4jm9vqp6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feunw8xz2jgbd4jm9vqp6.png" width="800" height="247"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A Parquet file is uploaded to S3&lt;/li&gt;
&lt;li&gt;Lambda is triggered by the S3 event&lt;/li&gt;
&lt;li&gt;DuckDB reads the data and performs the required transformations&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;snowflake.connector&lt;/code&gt; writes the data into Snowflake&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The two key libraries used in this implementation are shown below.&lt;/p&gt;




&lt;h2&gt;
  
  
  DuckDB
&lt;/h2&gt;

&lt;p&gt;DuckDB is an embedded database engine designed for OLAP (Online Analytical Processing).&lt;/p&gt;

&lt;p&gt;Because DuckDB is extremely lightweight and supports in-memory processing, it can run efficiently even in a simple execution environment such as AWS Lambda.&lt;/p&gt;

&lt;p&gt;It is said to provide particularly strong performance for batch workloads such as data analytics and ETL processing.&lt;/p&gt;

&lt;p&gt;In addition, it enables SQL-based filtering and lightweight data transformations, allowing for intuitive implementations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://duckdb.org/" rel="noopener noreferrer"&gt;https://duckdb.org/&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Snowflake Connector
&lt;/h2&gt;

&lt;p&gt;Snowflake Connector for Python is a library that provides an interface for connecting to Snowflake and executing all standard operations.&lt;/p&gt;

&lt;p&gt;By using this library, it becomes possible to operate Snowflake from runtime environments such as Lambda.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.snowflake.com/en/developer-guide/python-connector/python-connector" rel="noopener noreferrer"&gt;https://docs.snowflake.com/en/developer-guide/python-connector/python-connector&lt;/a&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  Sample Code
&lt;/h1&gt;

&lt;p&gt;In the sample code below, &lt;code&gt;WHERE VendorID = 1&lt;/code&gt; is added as an ETL filter.&lt;/p&gt;

&lt;p&gt;By performing filtering and data transformation inside Lambda, highly flexible preprocessing becomes possible.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;snowflake.connector&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;cryptography.hazmat.primitives&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;serialization&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;cryptography.hazmat.backends&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;default_backend&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;snowflake.connector.pandas_tools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;write_pandas&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;

&lt;span class="n"&gt;SECRET_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;snowflake-keypair&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_secret&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;secretsmanager&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_secret_value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SecretId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SECRET_ID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SecretString&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;duckdb_connection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;duckdb_connection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;database&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:memory:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;duckdb_connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SET home_directory=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/tmp&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;duckdb_connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INSTALL httpfs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;duckdb_connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LOAD httpfs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;s3_bucket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Records&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bucket&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;s3_object_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Records&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="n"&gt;s3_input_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s3_bucket&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s3_object_key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;S3 input path: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s3_input_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


        &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
            SELECT *
            FROM read_parquet(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s3_input_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)
            WHERE VendorID = 1
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

        &lt;span class="n"&gt;result_arrow_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;duckdb_connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetch_arrow_table&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DuckDB filtered rows: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result_arrow_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;num_rows&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;secret&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_secret&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="n"&gt;private_key_obj&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;serialization&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_pem_private_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;privateKey&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;backend&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;default_backend&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;snowflake&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;connector&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;account&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;account&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;private_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;private_key_obj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;warehouse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;warehouse&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;database&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;database&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;schema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;cur&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;USE DATABASE &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;database&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;USE SCHEMA &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;schema&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

        &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result_arrow_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_pandas&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tpep_pickup_datetime&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tpep_pickup_datetime&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;unit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%Y-%m-%d %H:%M:%S&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tpep_dropoff_datetime&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tpep_dropoff_datetime&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;unit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%Y-%m-%d %H:%M:%S&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;success&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nchunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nrows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;write_pandas&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YELLOW_TRIPDATA&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Snowflake write success=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rows=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;nrows&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, chunks=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;nchunks&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;statusCode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Processed &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result_arrow_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;num_rows&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; rows &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;and wrote &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;nrows&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; rows to Snowflake.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;An error occurred: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;traceback&lt;/span&gt;
        &lt;span class="n"&gt;traceback&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;print_exc&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;statusCode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;finally&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;duckdb_connection&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;duckdb_connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  Execution Result
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fackggvtaz462vtl750tf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fackggvtaz462vtl750tf.png" width="800" height="202"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5gu8zxnvcnh4st7xdle5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5gu8zxnvcnh4st7xdle5.png" width="800" height="418"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As shown above, the data was successfully written.&lt;/p&gt;




&lt;h1&gt;
  
  
  Switching the Destination to a Catalog-Linked Database
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj7g7qkaznukvw61kefhr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj7g7qkaznukvw61kefhr.png" width="800" height="351"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As introduced in a previous article, what happens if we try writing to a table configured with a Catalog-Linked Database (Iceberg)?&lt;/p&gt;

&lt;p&gt;Let’s test it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/aws-builders/aws-snowflake-lakehouse-2-practical-apache-iceberg-integration-patterns-812"&gt;AWS Snowflake Lakehouse: 2 Practical Apache Iceberg Integration Patterns&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  A Write Error Occurs
&lt;/h2&gt;

&lt;p&gt;When attempting to write to the Catalog-Linked Database, the following error occurred:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"statusCode"&lt;/span&gt;: 500,
  &lt;span class="s2"&gt;"body"&lt;/span&gt;: &lt;span class="s2"&gt;"093678 (0A000): SQL Compilation Error: This operation is not supported in a catalog-linked database."&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Why Writing to a Catalog-Linked Database Fails
&lt;/h2&gt;

&lt;p&gt;The reason this happens is due to the interaction between &lt;code&gt;write_pandas&lt;/code&gt; in the Snowflake Connector and the constraints of a Catalog-Linked Database.&lt;/p&gt;

&lt;p&gt;Internally, &lt;code&gt;write_pandas&lt;/code&gt; creates a temporary stage.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.snowflake.com/en/developer-guide/python-connector/python-connector-api#id13" rel="noopener noreferrer"&gt;Python Connector API&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Writes a pandas DataFrame to a table in a Snowflake database.&lt;br&gt;
To write the data to the table, the function saves the data to Parquet files, uses the PUT command to upload these files to a temporary stage, and uses the COPY INTO &amp;lt;table&amp;gt; command to copy the data from the files to the table.&lt;/p&gt;

&lt;p&gt;You can use some of the function parameters to control how the PUT and COPY INTO &amp;lt;table&amp;gt; statements are executed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;However, stage creation is not supported in a Catalog-Linked Database.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.snowflake.com/en/user-guide/tables-iceberg-catalog-linked-database#considerations-for-using-a-catalog-linked-database-for-iceberg-tables" rel="noopener noreferrer"&gt;Considerations for using a catalog-linked database for Iceberg tables&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You can create schemas, externally managed Iceberg tables, or database roles in a catalog-linked database. Creating other Snowflake objects isn't currently supported.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This conflict causes &lt;code&gt;write_pandas&lt;/code&gt; to fail with the error:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;This operation is not supported in a catalog-linked database.&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;More specifically, the temporary stage created internally falls under the category of “other Snowflake objects,” so the error occurs at the point where &lt;code&gt;CREATE TEMPORARY STAGE&lt;/code&gt; is executed.&lt;/p&gt;

&lt;p&gt;That said, there is a workaround.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Write to a Catalog-Linked Database
&lt;/h2&gt;

&lt;p&gt;A relatively simple approach is to use an &lt;code&gt;INSERT&lt;/code&gt; statement directly.&lt;/p&gt;

&lt;p&gt;Here is an example implementation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
            &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;placeholders&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="n"&gt;sql&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INSERT INTO &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;schema&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.YELLOW_TRIPDATA (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;) VALUES (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;placeholders&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;executemany&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Alternatively, another good approach is to create a stage in a different database and execute the &lt;code&gt;INSERT&lt;/code&gt; through that route.&lt;/p&gt;




&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;In this article, I implemented a lightweight event-driven ETL pipeline triggered by S3 events using AWS Lambda, DuckDB, and the Snowflake Connector.&lt;/p&gt;

&lt;p&gt;By using DuckDB inside Lambda, I was able to perform SQL-based filtering and lightweight transformations directly on Parquet files stored in S3, and successfully load the processed results into Snowflake.&lt;/p&gt;

&lt;p&gt;In addition, I confirmed an important limitation: when using &lt;code&gt;write_pandas&lt;/code&gt; against a Catalog-Linked Database (Iceberg), the write fails because the connector internally creates a temporary stage.&lt;/p&gt;

&lt;p&gt;Although there are some constraints, combining DuckDB and the Snowflake Connector enables the construction of a low-cost and flexible data processing pipeline for Snowflake.&lt;/p&gt;

&lt;p&gt;The key point is to clearly understand how Snowflake manages Iceberg tables.&lt;/p&gt;

&lt;p&gt;It is important to determine whether the table is a Snowflake-managed Iceberg table or connected through mechanisms such as a Catalog-Linked Database, and to properly understand that structure.&lt;/p&gt;

&lt;p&gt;In any case, the combination of Snowflake and Iceberg is an extremely powerful option for building a Lakehouse architecture.&lt;/p&gt;

&lt;p&gt;I hope this article will be helpful for those considering lightweight data processing and real-time ETL pipelines with AWS and Snowflake when working with Iceberg tables.&lt;/p&gt;




</description>
      <category>aws</category>
      <category>snowflake</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Securely Implementing Snowflake AWS Lambda Integration with Key Pair Authentication + Secrets Manager</title>
      <dc:creator>Aki</dc:creator>
      <pubDate>Thu, 02 Apr 2026 14:14:17 +0000</pubDate>
      <link>https://dev.to/aws-builders/securely-implementing-snowflake-x-aws-lambda-integration-with-key-pair-authentication-secrets-2ba2</link>
      <guid>https://dev.to/aws-builders/securely-implementing-snowflake-x-aws-lambda-integration-with-key-pair-authentication-secrets-2ba2</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Original Japanese article&lt;/strong&gt;: &lt;a href="https://zenn.dev/penginpenguin/articles/03f6ffff54b0d4" rel="noopener noreferrer"&gt;Snowflake × AWS Lambda連携をKey Pair認証 + Secrets Managerで安全に実装する&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;I'm Aki, an AWS Community Builder (&lt;a href="https://x.com/jitepengin" rel="noopener noreferrer"&gt;@jitepengin&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;When building data pipelines and business systems on AWS, there are cases where you need to access Snowflake directly as part of your application processing.&lt;/p&gt;

&lt;p&gt;For example, the following use cases are common:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Load data received via an API into Snowflake&lt;/li&gt;
&lt;li&gt;Execute SQL on the Snowflake side after ETL / ELT processing is completed&lt;/li&gt;
&lt;li&gt;Write results from external system integrations into Snowflake&lt;/li&gt;
&lt;li&gt;Trigger Snowflake stored procedures as part of task execution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AWS Lambda works extremely well with serverless, event-driven processing, and by combining it with Snowflake, you can build a highly flexible data integration platform.&lt;/p&gt;

&lt;p&gt;On the other hand, when connecting from Lambda to Snowflake, it is critically important to determine how authentication credentials should be managed securely.&lt;/p&gt;

&lt;p&gt;As of 2026, for Snowflake system integrations, &lt;strong&gt;Key Pair authentication or OAuth has become the standard&lt;/strong&gt;, and embedding passwords directly is no longer recommended.&lt;/p&gt;

&lt;p&gt;In this article, I would like to explain an implementation pattern that uses &lt;strong&gt;AWS Secrets Manager to securely manage the private key&lt;/strong&gt; and connects from Lambda to Snowflake using &lt;strong&gt;Key Pair authentication&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Reference: Deprecation plan for single-factor password sign-in&lt;br&gt;
&lt;a href="https://docs.snowflake.com/ja/user-guide/security-mfa-rollout" rel="noopener noreferrer"&gt;https://docs.snowflake.com/ja/user-guide/security-mfa-rollout&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  First of all, why access Snowflake from Lambda?
&lt;/h2&gt;

&lt;p&gt;Before getting into the implementation, let’s first consider &lt;strong&gt;why it is necessary to access Snowflake from Lambda in the first place&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I’d like to explain this using three common use case patterns.&lt;/p&gt;


&lt;h3&gt;
  
  
  Case 1: Event-driven data ingestion
&lt;/h3&gt;

&lt;p&gt;I think this is probably the most common pattern.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqw053udwmi57eqla2vlb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqw053udwmi57eqla2vlb.png" width="800" height="421"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;S3 / API Gateway / EventBridge
            ↓
         Lambda
            ↓
       Snowflake
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;order data received through an API&lt;/li&gt;
&lt;li&gt;CSV / JSON files uploaded to S3&lt;/li&gt;
&lt;li&gt;SaaS integration events&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These can be received by Lambda and then directly loaded into Snowflake.&lt;/p&gt;

&lt;p&gt;There are architectures that use Snowpipe, but when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;preprocessing is required&lt;/li&gt;
&lt;li&gt;format conversion is required&lt;/li&gt;
&lt;li&gt;multiple system integrations are involved&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Lambda is often easier to work with.&lt;/p&gt;




&lt;h3&gt;
  
  
  Case 2: Snowflake integration after AWS ETL completion
&lt;/h3&gt;

&lt;p&gt;This pattern executes SQL on Snowflake via Lambda after ETL processing has been completed on the AWS side.&lt;/p&gt;

&lt;p&gt;This example assumes Apache Iceberg is being used as the data platform.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffitxgh6oo0540464l5vo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffitxgh6oo0540464l5vo.png" width="800" height="420"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Glue / Lambda
      ↓
S3 Iceberg
      ↓
Lambda
      ↓
Execute SQL on Snowflake
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ETL completed on AWS&lt;/li&gt;
&lt;li&gt;REFRESH on Snowflake&lt;/li&gt;
&lt;li&gt;execute MERGE SQL&lt;/li&gt;
&lt;li&gt;data quality checks&lt;/li&gt;
&lt;li&gt;update BI marts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is a common pattern for downstream processing.&lt;/p&gt;




&lt;h3&gt;
  
  
  Case 3: Operations automation
&lt;/h3&gt;

&lt;p&gt;This is another pattern that can often be seen in real-world implementations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5lauuo9oby3r9guz1w5g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5lauuo9oby3r9guz1w5g.png" width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;starting / stopping warehouses&lt;/li&gt;
&lt;li&gt;executing tasks&lt;/li&gt;
&lt;li&gt;automatic SQL execution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These operations can be automated from Lambda.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The architecture used in this article is as follows.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvf8cm97n1qdqrsjkag93.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvf8cm97n1qdqrsjkag93.png" width="800" height="420"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Event Source
   ↓
AWS Lambda
   ↓
Secrets Manager
   ↓
Retrieve Private Key
   ↓
Snowflake
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lambda retrieves the private key from Secrets Manager and connects to Snowflake using Key Pair authentication.&lt;/p&gt;




&lt;h2&gt;
  
  
  Security best practices
&lt;/h2&gt;




&lt;h3&gt;
  
  
  1. Do not store the private key in environment variables
&lt;/h3&gt;

&lt;p&gt;This is important not only for Snowflake, but for any sensitive credentials.&lt;/p&gt;

&lt;p&gt;A common anti-pattern is storing it like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;PRIVATE_KEY &lt;span class="o"&gt;=&lt;/span&gt; os.environ[&lt;span class="s2"&gt;"PRIVATE_KEY"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This should absolutely be avoided.&lt;/p&gt;

&lt;p&gt;In addition to security concerns, there are operational issues such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;possibility of accidental log output&lt;/li&gt;
&lt;li&gt;operational overhead during configuration changes&lt;/li&gt;
&lt;li&gt;difficulty in rotation&lt;/li&gt;
&lt;li&gt;weak access control&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  2. Centralized management with Secrets Manager
&lt;/h3&gt;

&lt;p&gt;Sensitive information like this should ideally be centrally managed using Secrets Manager.&lt;/p&gt;

&lt;p&gt;By using Secrets Manager, you gain the following benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;IAM-based access control&lt;/li&gt;
&lt;li&gt;KMS encryption&lt;/li&gt;
&lt;li&gt;audit logging with CloudTrail&lt;/li&gt;
&lt;li&gt;credential rotation support&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  3. Principle of least privilege for IAM
&lt;/h3&gt;

&lt;p&gt;Only the minimum required permissions should be granted to Lambda.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Statement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"secretsmanager:GetSecretValue"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:secretsmanager:ap-northeast-1:123456789012:secret:snowflake-keypair-*"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  4. Use a dedicated Snowflake service user
&lt;/h3&gt;

&lt;p&gt;Do not use personal users.&lt;/p&gt;

&lt;p&gt;Instead, create a dedicated service user.&lt;/p&gt;

&lt;p&gt;The service user should be created with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;TYPE = 'LEGACY_SERVICE'&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;or &lt;code&gt;TYPE = 'SERVICE'&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Registering authentication information
&lt;/h2&gt;




&lt;h3&gt;
  
  
  1. Create the private key
&lt;/h3&gt;

&lt;p&gt;First, create the private key.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;openssl genrsa 2048 | openssl pkcs8 &lt;span class="nt"&gt;-topk8&lt;/span&gt; &lt;span class="nt"&gt;-inform&lt;/span&gt; PEM &lt;span class="nt"&gt;-out&lt;/span&gt; rsa_key.p8 &lt;span class="nt"&gt;-nocrypt&lt;/span&gt;
openssl rsa &lt;span class="nt"&gt;-in&lt;/span&gt; rsa_key.p8 &lt;span class="nt"&gt;-pubout&lt;/span&gt; &lt;span class="nt"&gt;-out&lt;/span&gt; rsa_key.pub
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  2. Configure Snowflake
&lt;/h3&gt;

&lt;p&gt;Register the public key to the Snowflake user.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;USER&lt;/span&gt; &lt;span class="n"&gt;lambda_service_user&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;RSA_PUBLIC_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'MIIBIjANBgkq...'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use the contents of the &lt;code&gt;rsa_key.pub&lt;/code&gt; file as the public key.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Register in Secrets Manager
&lt;/h3&gt;

&lt;p&gt;Store the private key as follows.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"account"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"XXXXXXXXXX"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"lambda_service_user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"privateKey"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"-----BEGIN PRIVATE KEY-----&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;...&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;-----END PRIVATE KEY-----"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"passphrase"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"LAMBDA_ROLE"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"warehouse"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"COMPUTE_WH"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"database"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ICEBERGDB"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"schema"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"PUBLIC"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;account&lt;/code&gt;: Snowflake account identifier&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;user&lt;/code&gt;: Snowflake connection user&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;privateKey&lt;/code&gt;: private key created in step 1 (&lt;code&gt;rsa_key.p8&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;&amp;lt;orgname&amp;gt;-&amp;lt;account_name&amp;gt;&lt;/code&gt; format is recommended for the account identifier.&lt;/p&gt;

&lt;p&gt;Now the preparation is complete.&lt;/p&gt;




&lt;h2&gt;
  
  
  Lambda sample code
&lt;/h2&gt;

&lt;p&gt;For verification purposes, this example simply executes:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT CURRENT_VERSION()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The key point is using &lt;code&gt;snowflake.connector&lt;/code&gt;, which makes connecting to Snowflake very straightforward.&lt;/p&gt;

&lt;p&gt;Add &lt;code&gt;snowflake.connector&lt;/code&gt; using a Lambda Layer, or build the Lambda function as a container image.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;snowflake.connector&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;cryptography.hazmat.primitives&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;serialization&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;cryptography.hazmat.backends&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;default_backend&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;

&lt;span class="n"&gt;SECRET_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;snowflake-keypair&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_secret&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;secretsmanager&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_secret_value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SecretId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SECRET_ID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SecretString&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;execute_query_with_retry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;retry_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;retry_count&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;cur&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetchone&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;snowflake&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;connector&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Query failed (attempt &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;retry_count&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;): &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;retry_count&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
            &lt;span class="n"&gt;sleep_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;retry_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sleep_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Query execution exceeded max retries.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;secret&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_secret&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;private_key_obj&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;serialization&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_pem_private_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;privateKey&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;backend&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;default_backend&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;snowflake&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;connector&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;account&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;account&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;private_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;private_key_obj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;warehouse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;warehouse&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;database&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;database&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;schema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;execute_query_with_retry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT CURRENT_VERSION()&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;statusCode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Lambda execution error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;statusCode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;finally&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Depending on default service user settings can lead to unintended behavior during configuration changes.&lt;/p&gt;

&lt;p&gt;Therefore, explicitly specifying:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;role&lt;/li&gt;
&lt;li&gt;warehouse&lt;/li&gt;
&lt;li&gt;database&lt;/li&gt;
&lt;li&gt;schema&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;is safer for production environments.&lt;/p&gt;




&lt;h2&gt;
  
  
  Execution result
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"statusCode"&lt;/span&gt;: 200,
  &lt;span class="s2"&gt;"body"&lt;/span&gt;: &lt;span class="s2"&gt;"('10.11.2',)"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Executed on Snowflake side:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;CURRENT_VERSION&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The results match, so the connection was successful.&lt;/p&gt;




&lt;h2&gt;
  
  
  Additional considerations
&lt;/h2&gt;




&lt;h3&gt;
  
  
  Strengthening network security
&lt;/h3&gt;

&lt;p&gt;Using &lt;strong&gt;AWS PrivateLink&lt;/strong&gt; enables communication within a private network and eliminates internet-based communication risks.&lt;/p&gt;

&lt;p&gt;This can be implemented with the following steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;enable private connectivity in Snowflake&lt;/li&gt;
&lt;li&gt;create a VPC endpoint
(&lt;code&gt;com.amazonaws.vpce.snowflake&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;pair it with Snowflake’s service principal&lt;/li&gt;
&lt;li&gt;place Lambda in the same VPC&lt;/li&gt;
&lt;li&gt;restrict access using security groups&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Error handling
&lt;/h3&gt;

&lt;p&gt;As implemented in the sample code, it is a good idea to add retry processing to tolerate temporary network failures and transient Snowflake issues.&lt;/p&gt;

&lt;p&gt;Using &lt;strong&gt;exponential backoff&lt;/strong&gt; improves resilience compared to fixed delays.&lt;/p&gt;




&lt;h3&gt;
  
  
  Monitoring
&lt;/h3&gt;

&lt;p&gt;The following monitoring perspectives are recommended:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CloudWatch Logs: capture connection errors and query failures&lt;/li&gt;
&lt;li&gt;CloudTrail: verify access logs to Secrets Manager&lt;/li&gt;
&lt;li&gt;Snowflake audit logs: monitor user connection activity&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this article, I introduced the full flow from key management to connection setup for accessing Snowflake from AWS Lambda using Key Pair authentication.&lt;/p&gt;

&lt;p&gt;There are more use cases for Lambda-to-Snowflake access than many people expect, especially for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;event-driven processing&lt;/li&gt;
&lt;li&gt;downstream ETL workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And of course, secure credential management is critically important—not only for Snowflake, but for any system integration.&lt;/p&gt;

&lt;p&gt;At this point in time, &lt;strong&gt;Key Pair authentication + Secrets Manager&lt;/strong&gt; is likely the standard implementation pattern for connecting to Snowflake.&lt;/p&gt;

&lt;p&gt;I hope this article helps as a reference when building application integrations between AWS and Snowflake.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>snowflake</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>AWS Snowflake Lakehouse: 2 Practical Apache Iceberg Integration Patterns</title>
      <dc:creator>Aki</dc:creator>
      <pubDate>Tue, 31 Mar 2026 20:44:32 +0000</pubDate>
      <link>https://dev.to/aws-builders/aws-snowflake-lakehouse-2-practical-apache-iceberg-integration-patterns-812</link>
      <guid>https://dev.to/aws-builders/aws-snowflake-lakehouse-2-practical-apache-iceberg-integration-patterns-812</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Original Japanese article&lt;/strong&gt;: &lt;a href="https://zenn.dev/penginpenguin/articles/97f30cc1ac8377" rel="noopener noreferrer"&gt;AWSのレイクハウス（Apache Iceberg）をSnowflakeと連携する2つのパターンを整理する&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;I'm Aki, an AWS Community Builder (&lt;a href="https://x.com/jitepengin" rel="noopener noreferrer"&gt;@jitepengin&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;In recent years, it has become increasingly common to build lakehouse architectures centered around Apache Iceberg.&lt;/p&gt;

&lt;p&gt;Before the rise of lakehouse architecture, it was common to design systems where data was consolidated into a specific platform, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;an architecture centered around Amazon Redshift on AWS&lt;/li&gt;
&lt;li&gt;an architecture centered around internal tables in Snowflake&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, with the advent of Apache Iceberg, this assumption is rapidly changing.&lt;/p&gt;

&lt;p&gt;Now that data on Amazon S3 can be directly accessed from multiple engines, what matters is no longer simply product selection.&lt;br&gt;
Instead, the architecture design itself has become the central focus: &lt;strong&gt;where the data resides, who owns the write responsibility, and who holds governance authority&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In this article, focusing on the coexistence of AWS and Snowflake, I will organize the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;two patterns based on S3 × Iceberg&lt;/li&gt;
&lt;li&gt;connectivity with Power BI Service&lt;/li&gt;
&lt;li&gt;future prospects including AI utilization&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Why AWS × Snowflake Coexistence Is Necessary
&lt;/h2&gt;

&lt;p&gt;The greatest value of Apache Iceberg lies in its ability to separate data from the query engine.&lt;/p&gt;

&lt;p&gt;For example, while keeping the physical data stored in S3, the same data can be accessed from multiple tools such as Athena, AWS Glue / Spark, Redshift, and Snowflake.&lt;/p&gt;

&lt;p&gt;In other words, while consolidating the physical data into a single location, it has become possible to choose the most suitable analytics platform depending on the use case.&lt;/p&gt;

&lt;p&gt;As a result, the architectural discussion has shifted from &lt;strong&gt;“which product should we use?”&lt;/strong&gt; to &lt;strong&gt;“where should data sovereignty reside, and where should analytical ownership be placed?”&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The benefits of using Snowflake here include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;user-friendly UI/UX&lt;/li&gt;
&lt;li&gt;powerful SQL analytics capabilities&lt;/li&gt;
&lt;li&gt;integration with Cortex AI&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In particular, I believe that an architecture where AWS serves as the &lt;strong&gt;data sovereignty layer&lt;/strong&gt; while Snowflake is utilized as the &lt;strong&gt;analytics layer&lt;/strong&gt; is highly compatible.&lt;/p&gt;


&lt;h2&gt;
  
  
  Two Patterns for Snowflake × S3 Integration
&lt;/h2&gt;

&lt;p&gt;Here, I will organize the two commonly used patterns when integrating Snowflake with S3.&lt;/p&gt;


&lt;h3&gt;
  
  
  Pattern 1: Glue Catalog Integration
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpg1nzevfxocqmzvjcgob.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpg1nzevfxocqmzvjcgob.png" width="800" height="222"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this pattern, Iceberg tables stored on S3 are referenced from Snowflake through AWS Glue Data Catalog.&lt;/p&gt;

&lt;p&gt;The advantages of this architecture are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;relatively simple configuration&lt;/li&gt;
&lt;li&gt;S3 becomes the Single Source of Truth&lt;/li&gt;
&lt;li&gt;because Snowflake cannot write, AWS retains sovereignty over data management
(this can also be considered a disadvantage)&lt;/li&gt;
&lt;li&gt;since user access is consolidated into Snowflake, access control can be centralized there&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, Snowflake focuses solely on the role of &lt;strong&gt;data analytics (query engine)&lt;/strong&gt;, while AWS retains authority over data management.&lt;/p&gt;


&lt;h4&gt;
  
  
  Setup Procedure
&lt;/h4&gt;
&lt;h5&gt;
  
  
  Step 1: Create an External Volume (S3 Access Configuration)
&lt;/h5&gt;

&lt;p&gt;Run the following on the Snowflake side:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;EXTERNAL&lt;/span&gt; &lt;span class="n"&gt;VOLUME&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;sample_iceberg_volume&lt;/span&gt;
  &lt;span class="n"&gt;STORAGE_LOCATIONS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="n"&gt;NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'my-s3-location'&lt;/span&gt;
      &lt;span class="n"&gt;STORAGE_PROVIDER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'S3'&lt;/span&gt;
      &lt;span class="n"&gt;STORAGE_BASE_URL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;catalog&lt;/span&gt; &lt;span class="n"&gt;S3&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;
      &lt;span class="n"&gt;STORAGE_AWS_ROLE_ARN&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;role&lt;/span&gt; &lt;span class="n"&gt;ARN&lt;/span&gt; &lt;span class="k"&gt;to&lt;/span&gt; &lt;span class="n"&gt;use&lt;/span&gt;
      &lt;span class="n"&gt;STORAGE_AWS_EXTERNAL_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'my_external_id'&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h5&gt;
  
  
  Step 2: Create Glue Catalog Integration
&lt;/h5&gt;

&lt;p&gt;Run the following on the Snowflake side:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;CATALOG&lt;/span&gt; &lt;span class="n"&gt;INTEGRATION&lt;/span&gt; &lt;span class="n"&gt;glue_catalog_int&lt;/span&gt; &lt;span class="c1"&gt;-- arbitrary name&lt;/span&gt;
  &lt;span class="n"&gt;CATALOG_SOURCE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;GLUE&lt;/span&gt;
  &lt;span class="n"&gt;CATALOG_NAMESPACE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Glue&lt;/span&gt; &lt;span class="k"&gt;catalog&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;
  &lt;span class="n"&gt;TABLE_FORMAT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ICEBERG&lt;/span&gt;
  &lt;span class="n"&gt;GLUE_AWS_ROLE_ARN&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;role&lt;/span&gt; &lt;span class="n"&gt;ARN&lt;/span&gt; &lt;span class="k"&gt;to&lt;/span&gt; &lt;span class="n"&gt;use&lt;/span&gt;
  &lt;span class="n"&gt;GLUE_CATALOG_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Glue&lt;/span&gt; &lt;span class="k"&gt;catalog&lt;/span&gt; &lt;span class="n"&gt;ID&lt;/span&gt; &lt;span class="k"&gt;to&lt;/span&gt; &lt;span class="n"&gt;use&lt;/span&gt;
  &lt;span class="n"&gt;GLUE_REGION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'ap-northeast-1'&lt;/span&gt;
  &lt;span class="n"&gt;ENABLED&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h5&gt;
  
  
  Step 3: Retrieve Required Information for AWS Trust Policy
&lt;/h5&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="k"&gt;EXTERNAL&lt;/span&gt; &lt;span class="n"&gt;VOLUME&lt;/span&gt; &lt;span class="n"&gt;sample_iceberg_volume&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ Note down &lt;code&gt;STORAGE_AWS_IAM_USER_ARN&lt;/code&gt; and &lt;code&gt;STORAGE_AWS_EXTERNAL_ID&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="k"&gt;CATALOG&lt;/span&gt; &lt;span class="n"&gt;INTEGRATION&lt;/span&gt; &lt;span class="n"&gt;glue_catalog_int&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ Note down &lt;code&gt;GLUE_AWS_IAM_USER_ARN&lt;/code&gt; and &lt;code&gt;GLUE_AWS_EXTERNAL_ID&lt;/code&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Since the two External IDs above will have different values, be sure to add both to the AWS IAM role Trust Policy.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Please configure the Trust Policy for the role used on the AWS side.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Statement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Principal"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"AWS"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;STORAGE_AWS_IAM_USER_ARN&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sts:AssumeRole"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Condition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"StringEquals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="nl"&gt;"sts:ExternalId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;STORAGE_AWS_EXTERNAL_ID&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Principal"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"AWS"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;GLUE_AWS_IAM_USER_ARN&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sts:AssumeRole"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Condition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"StringEquals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="nl"&gt;"sts:ExternalId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;GLUE_AWS_EXTERNAL_ID&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h5&gt;
  
  
  Step 4: Create Database
&lt;/h5&gt;

&lt;p&gt;Run the following on the Snowflake side:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;icebergdb&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h5&gt;
  
  
  Step 5: Create Table
&lt;/h5&gt;

&lt;p&gt;Run the following on the Snowflake side:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="n"&gt;ICEBERG&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;icebergdb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;yellow_tripdata&lt;/span&gt; &lt;span class="c1"&gt;-- arbitrary name&lt;/span&gt;
  &lt;span class="n"&gt;EXTERNAL_VOLUME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'sample_iceberg_volume'&lt;/span&gt;
  &lt;span class="k"&gt;CATALOG&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'glue_catalog_int'&lt;/span&gt;
  &lt;span class="n"&gt;CATALOG_TABLE_NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'yellow_tripdata'&lt;/span&gt;
  &lt;span class="n"&gt;AUTO_REFRESH&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h5&gt;
  
  
  Step 6: Verify Data
&lt;/h5&gt;

&lt;p&gt;Run the following on the Snowflake side:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;icebergdb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;yellow_tripdata&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz3ogm5kait864thxemid.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz3ogm5kait864thxemid.png" width="800" height="378"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We were able to read the data from S3!&lt;/p&gt;




&lt;h3&gt;
  
  
  Pattern 2: Catalog-Linked Database (Iceberg)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm4ui2ma9din031mzjqyq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm4ui2ma9din031mzjqyq.png" width="800" height="222"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This pattern integrates Glue Catalog using the REST Catalog approach and manages Iceberg tables directly from the Snowflake side.&lt;/p&gt;

&lt;p&gt;The advantages of this architecture are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;read and write operations from Snowflake are possible&lt;/li&gt;
&lt;li&gt;the physical data remains stored in S3&lt;/li&gt;
&lt;li&gt;Snowflake users can perform SQL-based updates and analytics&lt;/li&gt;
&lt;li&gt;easier integration with Power BI and Cortex AI&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, the biggest feature is that &lt;strong&gt;analysis and updates can be performed from Snowflake while keeping the physical data in S3&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;However, governance must be considered from both the AWS side and the Snowflake side.&lt;/p&gt;




&lt;h4&gt;
  
  
  Setup Procedure
&lt;/h4&gt;

&lt;h5&gt;
  
  
  Step 1: Create an External Volume (S3 Access Configuration)
&lt;/h5&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;EXTERNAL&lt;/span&gt; &lt;span class="n"&gt;VOLUME&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;sample_iceberg_volume&lt;/span&gt;
  &lt;span class="n"&gt;STORAGE_LOCATIONS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="n"&gt;NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'my-s3-location'&lt;/span&gt;
      &lt;span class="n"&gt;STORAGE_PROVIDER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'S3'&lt;/span&gt;
      &lt;span class="n"&gt;STORAGE_BASE_URL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;catalog&lt;/span&gt; &lt;span class="n"&gt;S3&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;
      &lt;span class="n"&gt;STORAGE_AWS_ROLE_ARN&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;role&lt;/span&gt; &lt;span class="n"&gt;ARN&lt;/span&gt; &lt;span class="k"&gt;to&lt;/span&gt; &lt;span class="n"&gt;use&lt;/span&gt;
      &lt;span class="n"&gt;STORAGE_AWS_EXTERNAL_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'my_external_id'&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h5&gt;
  
  
  Step 2: Create Glue Iceberg REST Catalog Integration
&lt;/h5&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;CATALOG&lt;/span&gt; &lt;span class="n"&gt;INTEGRATION&lt;/span&gt; &lt;span class="n"&gt;glue_rest_catalog_int&lt;/span&gt;
  &lt;span class="n"&gt;CATALOG_SOURCE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ICEBERG_REST&lt;/span&gt;
  &lt;span class="n"&gt;TABLE_FORMAT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ICEBERG&lt;/span&gt;
  &lt;span class="n"&gt;CATALOG_NAMESPACE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Glue&lt;/span&gt; &lt;span class="k"&gt;catalog&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;
  &lt;span class="n"&gt;REST_CONFIG&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;CATALOG_URI&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Glue&lt;/span&gt; &lt;span class="k"&gt;catalog&lt;/span&gt; &lt;span class="n"&gt;URI&lt;/span&gt;
    &lt;span class="n"&gt;CATALOG_API_TYPE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AWS_GLUE&lt;/span&gt;
    &lt;span class="k"&gt;CATALOG_NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AWS&lt;/span&gt; &lt;span class="n"&gt;account&lt;/span&gt; &lt;span class="n"&gt;ID&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;REST_AUTHENTICATION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SIGV4&lt;/span&gt;
    &lt;span class="n"&gt;SIGV4_IAM_ROLE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;role&lt;/span&gt; &lt;span class="n"&gt;ARN&lt;/span&gt; &lt;span class="k"&gt;to&lt;/span&gt; &lt;span class="n"&gt;use&lt;/span&gt;
    &lt;span class="n"&gt;SIGV4_SIGNING_REGION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'ap-northeast-1'&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;ENABLED&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h5&gt;
  
  
  Step 3: Retrieve Required Information for AWS Trust Policy
&lt;/h5&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="k"&gt;EXTERNAL&lt;/span&gt; &lt;span class="n"&gt;VOLUME&lt;/span&gt; &lt;span class="n"&gt;sample_iceberg_volume&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ Note down &lt;code&gt;STORAGE_AWS_IAM_USER_ARN&lt;/code&gt; and &lt;code&gt;STORAGE_AWS_EXTERNAL_ID&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="k"&gt;CATALOG&lt;/span&gt; &lt;span class="n"&gt;INTEGRATION&lt;/span&gt; &lt;span class="n"&gt;glue_rest_catalog_int&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ Note down &lt;code&gt;GLUE_AWS_IAM_USER_ARN&lt;/code&gt; and &lt;code&gt;GLUE_AWS_EXTERNAL_ID&lt;/code&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Since the two External IDs above will have different values, be sure to add both to the AWS IAM role Trust Policy.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Please configure the Trust Policy for the role used on the AWS side.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Statement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Principal"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"AWS"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;STORAGE_AWS_IAM_USER_ARN&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sts:AssumeRole"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Condition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"StringEquals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="nl"&gt;"sts:ExternalId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;STORAGE_AWS_EXTERNAL_ID&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Principal"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"AWS"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;GLUE_AWS_IAM_USER_ARN&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sts:AssumeRole"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Condition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"StringEquals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="nl"&gt;"sts:ExternalId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;GLUE_AWS_EXTERNAL_ID&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h5&gt;
  
  
  Step 4: Create Catalog Linked Database (Read/Write Enabled)
&lt;/h5&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;my_iceberg_linked_db&lt;/span&gt;
  &lt;span class="n"&gt;LINKED_CATALOG&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;CATALOG&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'glue_rest_catalog_int'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ALLOWED_WRITE_OPERATIONS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;EXTERNAL_VOLUME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'sample_iceberg_volume'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h5&gt;
  
  
  Step 5: Tables Are Automatically Discovered (Synced Every 30 Seconds)
&lt;/h5&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;my_iceberg_linked_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;"icebergdb"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;"yellow_tripdata"&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcecgnhjoe4tyo8p3ovtx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcecgnhjoe4tyo8p3ovtx.png" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We were able to read the data from S3!&lt;/p&gt;




&lt;h5&gt;
  
  
  Step 6: Write Test
&lt;/h5&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;my_iceberg_linked_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;"icebergdb"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;"yellow_tripdata"&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vendorid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tpep_pickup_datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tpep_dropoff_datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;passenger_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="n"&gt;trip_distance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ratecodeid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;store_and_fwd_flag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pulocationid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dolocationid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="n"&gt;payment_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fare_amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;extra&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mta_tax&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tip_amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tolls_amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="n"&gt;improvement_surcharge&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;congestion_surcharge&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;airport_fee&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;VALUES&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'2025-01-01 13:00:00'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'2025-01-01 13:30:00'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'N'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;22&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h5&gt;
  
  
  Step 7: Verify Write
&lt;/h5&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;my_iceberg_linked_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;"icebergdb"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;"yellow_tripdata"&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;tpep_pickup_datetime&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2025-01-01 13:00:00'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvqepukvrmsp1tudi0b3l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvqepukvrmsp1tudi0b3l.png" width="800" height="289"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We were able to write data into S3!&lt;/p&gt;

&lt;p&gt;Athena side&lt;/p&gt;

&lt;p&gt;Before write&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F35d6mw67en78uueiju1g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F35d6mw67en78uueiju1g.png" width="800" height="323"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After write&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe2mx1ju2a068rq86n85k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe2mx1ju2a068rq86n85k.png" width="800" height="316"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Which Pattern Should You Choose?
&lt;/h3&gt;

&lt;p&gt;For practical use, the following way of organizing it is easy to understand.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;th&gt;Recommended Pattern&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AWS-led ETL workloads, with Snowflake primarily used for read/query access&lt;/td&gt;
&lt;td&gt;Pattern 1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BI / AI / SQL updates driven primarily from Snowflake&lt;/td&gt;
&lt;td&gt;Pattern 2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Governance needs to be centralized on AWS&lt;/td&gt;
&lt;td&gt;Pattern 1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operations are mainly led by Snowflake users&lt;/td&gt;
&lt;td&gt;Pattern 2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Integration with Power BI Service
&lt;/h2&gt;

&lt;p&gt;Traditionally, when referencing an S3-based lakehouse from Power BI Service, many architectures used Redshift as an intermediary.&lt;/p&gt;

&lt;p&gt;In such cases, securely connecting directly from Power BI Service often required provisioning an EC2 instance and configuring an on-premises data gateway.&lt;/p&gt;

&lt;p&gt;This introduces additional operational costs such as EC2 management.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmp3pxn40gbrwlrmrfyi5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmp3pxn40gbrwlrmrfyi5.png" width="800" height="262"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, how about Snowflake?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl1i8js5n31sxhcbxy6zu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl1i8js5n31sxhcbxy6zu.png" width="800" height="201"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Power BI provides a native connector for Snowflake, allowing direct authentication and connection from Power BI Service.&lt;/p&gt;

&lt;p&gt;This eliminates the need for relay servers or on-premises gateways that are often required in Redshift-based architectures.&lt;/p&gt;

&lt;p&gt;In other words, it becomes possible to exclude costly operational components such as on-premises gateways.&lt;/p&gt;

&lt;p&gt;In addition, since the semantically organized Gold layer on the Snowflake side can be directly connected to Power BI, this also improves usability for BI users.&lt;/p&gt;




&lt;h2&gt;
  
  
  Thinking in Terms of Medallion Architecture
&lt;/h2&gt;

&lt;p&gt;Personally, I consider the following architecture to be highly practical for real-world use.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxz1dsmfqfnma38hlqbj8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxz1dsmfqfnma38hlqbj8.png" width="800" height="221"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bronze: S3 + Iceberg&lt;/li&gt;
&lt;li&gt;Silver: S3 + Iceberg (with Snowflake integration as needed)&lt;/li&gt;
&lt;li&gt;Gold: S3 + Iceberg → Snowflake integration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By especially aligning the Gold layer with Snowflake, it becomes easier to provide a semantic layer that is easy for BI users and business departments to consume.&lt;/p&gt;

&lt;p&gt;Depending on the use case, the Silver layer can also be utilized for more detailed analysis.&lt;/p&gt;

&lt;p&gt;In other words, this enables a separation of responsibilities between &lt;strong&gt;data management and data analytics&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Thinking in Terms of Separation of Ownership
&lt;/h2&gt;

&lt;p&gt;This is the most important point in this article.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Domain&lt;/th&gt;
&lt;th&gt;Ownership&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Physical data&lt;/td&gt;
&lt;td&gt;AWS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Governance&lt;/td&gt;
&lt;td&gt;AWS / Snowflake&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Analytics&lt;/td&gt;
&lt;td&gt;Snowflake&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI interaction&lt;/td&gt;
&lt;td&gt;Snowflake&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;What matters is not the product itself, but &lt;strong&gt;how ownership is separated&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;By clearly defining this separation, it becomes easier to organize the scope of responsibilities across data engineering, BI, and AI utilization, which also provides benefits from an organizational management perspective.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Change Brought by Snowflake Cortex AI
&lt;/h2&gt;

&lt;p&gt;AI utilization will become even more important going forward.&lt;/p&gt;

&lt;p&gt;As with other data platforms, AI adoption is progressing rapidly in Snowflake as well.&lt;/p&gt;

&lt;p&gt;By leveraging Snowflake Cortex AI, it becomes possible to query Iceberg tables on S3 using natural language.&lt;/p&gt;

&lt;p&gt;In other words, the data platform is evolving from &lt;strong&gt;“a platform for writing SQL”&lt;/strong&gt; into &lt;strong&gt;“a platform for conversation.”&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AI utilization is expected to continue evolving in many aspects.&lt;/p&gt;

&lt;p&gt;One key point will be preparing data that is easier for AI to use—in other words, &lt;strong&gt;AI-ready data&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this article, I organized two patterns for integrating an AWS lakehouse (Apache Iceberg) with Snowflake.&lt;/p&gt;

&lt;p&gt;In recent data utilization scenarios, it is increasingly common not only to rely solely on AWS, but also to integrate with platforms such as Databricks and Snowflake as introduced here.&lt;/p&gt;

&lt;p&gt;As mentioned earlier, what matters is not the product itself, but &lt;strong&gt;how ownership is separated&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Depending on which service takes responsibility for data, governance, and analytics, both the architecture and configuration will change.&lt;/p&gt;

&lt;p&gt;In any case, what is truly important is not the product itself, but the perspective of &lt;strong&gt;how to design a data platform that users will continue to use over time&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Going forward, it will become even more important to design architectures not only from the perspective of where data is stored, but also from the viewpoint of &lt;strong&gt;who owns responsibility for each layer and how that responsibility connects to user value&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I hope this article will be helpful for those considering a combination of AWS and Snowflake.&lt;/p&gt;




</description>
      <category>aws</category>
      <category>snowflake</category>
      <category>dataengineering</category>
      <category>apacheiceberg</category>
    </item>
    <item>
      <title>Is AWS Glue Data Catalog Sufficient as a Data Catalog? Organizing Its Design, Limitations, and Complementary Strategies</title>
      <dc:creator>Aki</dc:creator>
      <pubDate>Wed, 25 Mar 2026 14:41:14 +0000</pubDate>
      <link>https://dev.to/aws-builders/is-aws-glue-data-catalog-sufficient-as-a-data-catalog-organizing-its-design-limitations-and-kih</link>
      <guid>https://dev.to/aws-builders/is-aws-glue-data-catalog-sufficient-as-a-data-catalog-organizing-its-design-limitations-and-kih</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Original Japanese article&lt;/strong&gt;: &lt;a href="https://zenn.dev/penginpenguin/articles/22f1be5b8b6f98" rel="noopener noreferrer"&gt;AWS Glue Data Catalogはデータカタログとして十分か？設計・限界・補完戦略を整理する&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;I'm Aki, an AWS Community Builder (&lt;a href="https://x.com/jitepengin" rel="noopener noreferrer"&gt;@jitepengin&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;As data utilization within organizations has advanced in recent years, the importance of data catalogs has continued to grow.&lt;/p&gt;

&lt;p&gt;When building a data platform on AWS, the first thing that typically comes to mind as a data catalog is AWS Glue Data Catalog.&lt;br&gt;
Especially in data lake architectures centered around Amazon S3, AWS Glue Data Catalog is almost a prerequisite service. By combining it with services like Athena, AWS Glue, and Redshift Spectrum, it is possible to quickly stand up a minimal data platform.&lt;/p&gt;

&lt;p&gt;However, as data usage evolves, you may encounter challenges such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not knowing which data to use&lt;/li&gt;
&lt;li&gt;Multiple datasets that look similar&lt;/li&gt;
&lt;li&gt;Being unable to determine whether data is trustworthy&lt;/li&gt;
&lt;li&gt;Not being able to trace how data was generated&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At first glance, these may appear to be separate issues, but in reality, they all stem from a single root cause: an insufficient data catalog.&lt;/p&gt;

&lt;p&gt;In this article, starting from AWS Glue Data Catalog, we will explore:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The role of a data catalog&lt;/li&gt;
&lt;li&gt;The strengths and limitations of AWS Glue Data Catalog&lt;/li&gt;
&lt;li&gt;How to complement it within AWS&lt;/li&gt;
&lt;li&gt;How to approach building a data catalog on AWS&lt;/li&gt;
&lt;li&gt;A comparison with other data catalogs (OpenMetadata)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The conclusion is that AWS Glue Data Catalog is not a “data catalog” in the full sense.&lt;br&gt;
Rather, it is a &lt;strong&gt;technical catalog used by query engines&lt;/strong&gt;, and it is not sufficient as a catalog for humans to discover, understand, and trust data.&lt;/p&gt;

&lt;p&gt;For this reason, a data catalog on AWS should not be designed as a single service, but as an &lt;strong&gt;architecture composed of multiple services&lt;/strong&gt;.&lt;/p&gt;


&lt;h2&gt;
  
  
  What is a Data Catalog?
&lt;/h2&gt;

&lt;p&gt;A data catalog is not simply about metadata management—it is a foundation that makes data usable.&lt;/p&gt;

&lt;p&gt;Metadata can be understood as “data about data.”&lt;br&gt;
Specifically, it includes information such as who created the data, what it means, how it is used, where it came from, how it flows, where it is stored, and what its quality is.&lt;/p&gt;

&lt;p&gt;A data catalog centralizes this metadata and supports search and utilization.&lt;/p&gt;

&lt;p&gt;Traditionally, data catalogs focused on table definitions and schema management. However, modern data catalogs are expected to include the following elements:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Metadata Management&lt;/td&gt;
&lt;td&gt;Technical metadata (schemas, types, partitions), business metadata (meaning, usage, owner), operational metadata (job logs, processing metrics)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Discovery&lt;/td&gt;
&lt;td&gt;Data discovery, filtering, classification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Lineage&lt;/td&gt;
&lt;td&gt;Tracking data generation and transformation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Quality&lt;/td&gt;
&lt;td&gt;Reliability indicators, anomaly detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Governance&lt;/td&gt;
&lt;td&gt;Access control and permission management&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These elements span multiple domains defined in DMBOK, and it is rare for a single tool to cover all of them.&lt;/p&gt;

&lt;p&gt;As will be explained later, AWS also requires combining multiple services to achieve this.&lt;br&gt;
In the AWS context, it is more appropriate to think of a data catalog not as a single “service,” but as an “architecture.”&lt;/p&gt;


&lt;h2&gt;
  
  
  Role and Strengths of Glue Data Catalog
&lt;/h2&gt;

&lt;p&gt;Glue Data Catalog is the core metadata management component in AWS and serves as a foundational element for operating a data platform.&lt;/p&gt;
&lt;h3&gt;
  
  
  Key Features of Glue Data Catalog
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Metadata Storage&lt;/td&gt;
&lt;td&gt;Persistent storage of structured metadata&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Schema Management&lt;/td&gt;
&lt;td&gt;Definition and updates of table schemas&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Partition Management&lt;/td&gt;
&lt;td&gt;Management of partition information&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Statistics&lt;/td&gt;
&lt;td&gt;Column statistics such as min/max values and null counts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tagging&lt;/td&gt;
&lt;td&gt;Classification using key-value pairs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API/SDK&lt;/td&gt;
&lt;td&gt;Programmatic access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Lineage&lt;/td&gt;
&lt;td&gt;Basic lineage is available; advanced visualization requires additional tools&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operational Metadata&lt;/td&gt;
&lt;td&gt;CloudWatch logs, Spark UI, job execution insights&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Advanced Discovery&lt;/td&gt;
&lt;td&gt;Console browsing, attribute filtering, unified search&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Strengths of Glue Data Catalog
&lt;/h3&gt;
&lt;h4&gt;
  
  
  Seamless Integration with AWS Services
&lt;/h4&gt;

&lt;p&gt;While this may seem obvious, it is an important point: Glue integrates natively with AWS services such as Athena, AWS Glue, and Redshift Spectrum.&lt;/p&gt;

&lt;p&gt;Because these services reference the same catalog, it ensures consistency in how data is accessed across the platform.&lt;/p&gt;
&lt;h4&gt;
  
  
  Strong Affinity with Data Lakes
&lt;/h4&gt;

&lt;p&gt;In modern lakehouse architectures, this is a significant advantage.&lt;/p&gt;

&lt;p&gt;Glue Data Catalog allows data stored in S3 to be cataloged directly.&lt;br&gt;
This makes it possible to build a lakehouse using formats like Iceberg and manage it through Glue Data Catalog.&lt;/p&gt;

&lt;p&gt;(Note: Iceberg table metadata itself resides in S3, while Glue Data Catalog functions as the catalog endpoint.)&lt;/p&gt;


&lt;h2&gt;
  
  
  Is Glue Data Catalog Sufficient as a Data Catalog?
&lt;/h2&gt;

&lt;p&gt;The conclusion is that Glue Data Catalog is a &lt;strong&gt;data platform–oriented catalog&lt;/strong&gt;, not a &lt;strong&gt;user-oriented catalog&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It is highly effective as a &lt;strong&gt;technical metadata foundation&lt;/strong&gt; referenced by analytics platforms.&lt;br&gt;
However, its capabilities are limited when it comes to serving as a business catalog that enables users to &lt;strong&gt;discover, understand, and trust data&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In other words, Glue is extremely strong as the &lt;strong&gt;core (technical foundation)&lt;/strong&gt; of a data catalog, but requires complementary services when used as a &lt;strong&gt;user-facing catalog that supports data utilization&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Support by Glue Alone&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Technical Metadata&lt;/td&gt;
&lt;td&gt;○&lt;/td&gt;
&lt;td&gt;Schemas, types, partitions, column statistics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Business Metadata&lt;/td&gt;
&lt;td&gt;△&lt;/td&gt;
&lt;td&gt;Descriptions, tags, classifications (advanced capabilities require Amazon DataZone)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operational Metadata&lt;/td&gt;
&lt;td&gt;△&lt;/td&gt;
&lt;td&gt;Job execution history is stored; detailed metrics are managed in CloudWatch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Discovery&lt;/td&gt;
&lt;td&gt;△&lt;/td&gt;
&lt;td&gt;Console search and filtering (advanced capabilities require Amazon Q or Amazon DataZone)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Lineage&lt;/td&gt;
&lt;td&gt;△&lt;/td&gt;
&lt;td&gt;Basic lineage (input/output tables in Glue ETL jobs) is captured; no end-to-end lineage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Quality&lt;/td&gt;
&lt;td&gt;△&lt;/td&gt;
&lt;td&gt;Column statistics and auto statistics (advanced capabilities require Glue Data Quality or AWS Glue DataBrew)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Workflow Management&lt;/td&gt;
&lt;td&gt;✕&lt;/td&gt;
&lt;td&gt;Not handled by Glue Data Catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Governance&lt;/td&gt;
&lt;td&gt;△&lt;/td&gt;
&lt;td&gt;IAM integration, resource policies, encryption (advanced capabilities require Lake Formation)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Profiling&lt;/td&gt;
&lt;td&gt;✕&lt;/td&gt;
&lt;td&gt;Not supported&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h3&gt;
  
  
  Supplement: How to Complement Areas Where Glue Alone Falls Short
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Operational Metadata&lt;br&gt;
Advanced metrics (e.g., processed record counts, error rates, memory usage) need to be managed using CloudWatch or AWS X-Ray.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Data Lineage&lt;br&gt;
For end-to-end, advanced lineage visualization, you need Amazon DataZone, support for the OpenLineage specification, or the AWS Lineage API.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Data Quality&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Glue Data Quality: Rule-based validations (e.g., NULL checks, range checks)&lt;/li&gt;
&lt;li&gt;Glue DataBrew: Statistical profiling, distribution analysis, outlier detection (ML-based)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Workflow Management&lt;br&gt;
Utilize AWS Step Functions, Amazon Managed Workflows for Apache Airflow (MWAA), or AWS Glue Workflows.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Data Profiling&lt;br&gt;
Perform statistical profiling with Glue DataBrew, and detect sensitive data (PII classification) using Amazon Macie.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;



&lt;p&gt;In summary, the following capabilities are not fully provided by Glue alone and must be complemented:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Business metadata management&lt;/li&gt;
&lt;li&gt;Advanced data lineage&lt;/li&gt;
&lt;li&gt;Data quality&lt;/li&gt;
&lt;li&gt;Data discovery&lt;/li&gt;
&lt;li&gt;Workflow management&lt;/li&gt;
&lt;li&gt;Data governance&lt;/li&gt;
&lt;li&gt;Data profiling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The question then becomes how to realize these capabilities, which leads to combining multiple AWS services.&lt;/p&gt;


&lt;h2&gt;
  
  
  Complementing Glue Data Catalog as a Data Catalog
&lt;/h2&gt;

&lt;p&gt;As discussed earlier, Glue Data Catalog alone is not sufficient as a complete data catalog.&lt;br&gt;
In AWS, this gap is addressed by combining multiple services to complement its capabilities.&lt;/p&gt;

&lt;p&gt;Here, we organize &lt;strong&gt;which capabilities are complemented by which services&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;AWS Service&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Business Metadata&lt;/td&gt;
&lt;td&gt;Amazon DataZone&lt;/td&gt;
&lt;td&gt;Business glossary, data ownership definition, rich descriptions and context, data asset reviews and ratings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Lineage&lt;/td&gt;
&lt;td&gt;Amazon DataZone&lt;/td&gt;
&lt;td&gt;Lineage visualization, understanding data transformation flows, dependency management &lt;em&gt;(end-to-end lineage requires OpenLineage or AWS Lineage APIs)&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Quality&lt;/td&gt;
&lt;td&gt;AWS Glue Data Quality / DataBrew&lt;/td&gt;
&lt;td&gt;Data quality rule definition, scoring, anomaly detection, profiling &lt;em&gt;(Glue Data Quality can auto-generate rules based on profiling results from DataBrew)&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Discovery&lt;/td&gt;
&lt;td&gt;Amazon DataZone / Amazon Q&lt;/td&gt;
&lt;td&gt;Filtering, recommendations, related data suggestions, natural language search, AI-assisted analysis and insight generation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Workflow Management&lt;/td&gt;
&lt;td&gt;AWS Step Functions / Amazon MWAA (Airflow) / Glue Workflows&lt;/td&gt;
&lt;td&gt;Workflow orchestration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Governance&lt;/td&gt;
&lt;td&gt;AWS Lake Formation&lt;/td&gt;
&lt;td&gt;Column/row-level access control, tag-based access control, permissions management, data filtering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Profiling&lt;/td&gt;
&lt;td&gt;AWS Glue DataBrew / Amazon Macie&lt;/td&gt;
&lt;td&gt;Profiling, statistical analysis, sensitive data detection, PII classification&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;As shown above, AWS enables a data catalog by combining multiple services with Glue Data Catalog at the core.&lt;/p&gt;


&lt;h2&gt;
  
  
  Data Catalog Architecture
&lt;/h2&gt;

&lt;p&gt;A data catalog on AWS, centered around Glue Data Catalog, can be organized into the following layered structure.&lt;/p&gt;

&lt;p&gt;The key point is to view this not as individual services, but as an &lt;strong&gt;architecture composed of layers&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────┐
│   Business Catalog Layer      │   ← Amazon DataZone / Amazon Q
│   (Discovery / Glossary)      │
└──────────────┬───────────────┘
               │
┌──────────────┼───────────────┐
│ Governance / Quality Layer   │   ← Lake Formation / Glue Data Quality
│ (Access Control / Quality)   │
└──────────────┬───────────────┘
               │
┌──────────────┼───────────────┐
│ Metadata Core Layer          │   ← Glue Data Catalog
│ (Technical Metadata)         │
└──────────────┬───────────────┘
               │
┌──────────────┼───────────────┐
│ Processing / Query Layer     │   ← Athena / Glue ETL / Redshift
│ (Query / ETL Processing)     │
└──────────────┬───────────────┘
               │
┌──────────────────────────────┐
│ Data Layer (S3)              │   ← Raw / Curated Data
└──────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Roles of Each Layer
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Business Catalog Layer
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Amazon DataZone: Entry point for business users to discover data, review it, and request access&lt;/li&gt;
&lt;li&gt;Amazon Q: AI assistant that supports natural language search, data analysis, and insight generation (e.g., “Where is the sales data for 2023?”)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Governance / Quality Layer
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Lake Formation: Column- and row-level access control, tag-based permission management&lt;/li&gt;
&lt;li&gt;Glue Data Quality: Definition and validation of data quality rules (e.g., “Check that the age column does not contain negative values”)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Metadata Core Layer
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Glue Data Catalog: Centralized management of technical metadata (schemas, statistics, partitions)&lt;/li&gt;
&lt;li&gt;Integration with S3: Automatically catalogs file structures in the data lake&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Processing / Query Layer
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Athena / Redshift Spectrum: Query data directly on S3 using Glue Data Catalog&lt;/li&gt;
&lt;li&gt;Glue ETL: Executes transformation jobs based on metadata from the catalog&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Data Layer
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;S3: Stores raw data (CSV, Parquet, etc.) and processed (curated) data&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Implementation Best Practices
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Use Glue Data Catalog as the foundation
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Place it at the center since it integrates natively with services like Athena, Glue, and Redshift&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Add a business-facing layer
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Introduce Amazon DataZone and build a business glossary&lt;/li&gt;
&lt;li&gt;Define data ownership and utilize review/rating features&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Strengthen data quality and governance
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Define rules with Glue Data Quality (e.g., “Order amount must not be negative”)&lt;/li&gt;
&lt;li&gt;Apply access control with Lake Formation (e.g., “Finance team can only view accounting data”)
&lt;em&gt;Note: DataZone also supports IAM integration and access control independently&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Visualize data lineage
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Use OpenLineage specifications to automatically capture input/output of Glue ETL jobs&lt;/li&gt;
&lt;li&gt;Visualize lineage graphs in Amazon DataZone&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. Enable profiling and sensitive data detection
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Use DataBrew for profiling (e.g., distribution analysis of columns)&lt;/li&gt;
&lt;li&gt;Use Amazon Macie for detecting and classifying PII&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  6. Improve search experience
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Integrate Amazon Q to enable natural language search (e.g., “Customer purchase history”)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Challenges of Adopting DataZone
&lt;/h2&gt;

&lt;p&gt;Among the complementary services, one stands out as particularly important—but also challenging to adopt: Amazon DataZone.&lt;/p&gt;

&lt;p&gt;In DataZone, data assets are managed as &lt;strong&gt;data products&lt;/strong&gt;.&lt;br&gt;
A data product represents a meaningful unit of business data (e.g., “Customer transaction data”) with clearly defined ownership and responsibility.&lt;/p&gt;

&lt;p&gt;This structure clarifies &lt;strong&gt;who owns the data&lt;/strong&gt;, forming the foundation for data quality and governance.&lt;br&gt;
It also aligns well with Data Mesh principles, enabling domain-oriented data management.&lt;/p&gt;

&lt;p&gt;DataZone provides what Glue lacks: a &lt;strong&gt;catalog for humans&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data asset cataloging&lt;/li&gt;
&lt;li&gt;Search and discovery&lt;/li&gt;
&lt;li&gt;Lineage visualization&lt;/li&gt;
&lt;li&gt;Data quality visibility&lt;/li&gt;
&lt;li&gt;Governance management&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, it extends the technical catalog into a &lt;strong&gt;business catalog&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;While Glue Data Catalog is a “catalog for systems,” DataZone is a “catalog for people.”&lt;/p&gt;

&lt;p&gt;However, adopting DataZone requires meeting organizational, operational, and technical prerequisites.&lt;/p&gt;




&lt;h3&gt;
  
  
  1. Organizational Prerequisites
&lt;/h3&gt;

&lt;p&gt;This is often the most difficult part.&lt;/p&gt;

&lt;h4&gt;
  
  
  Data Domain Design
&lt;/h4&gt;

&lt;p&gt;Data domains—logical groupings of business data with clear ownership—must be defined.&lt;/p&gt;

&lt;p&gt;Since DataZone manages data at the domain level, unclear boundaries make operations unsustainable.&lt;br&gt;
In reality, many organizations have not formalized domain design, making this the first major challenge.&lt;/p&gt;

&lt;h4&gt;
  
  
  Data Ownership
&lt;/h4&gt;

&lt;p&gt;Each data asset must have a clearly defined owner.&lt;/p&gt;

&lt;p&gt;Data is treated as a “data product,” and each domain is responsible for managing its own data.&lt;br&gt;
However, in many organizations, ownership is ambiguous or fragmented.&lt;/p&gt;

&lt;h4&gt;
  
  
  Responsibility Definition
&lt;/h4&gt;

&lt;p&gt;Responsibilities for data quality, access control, and updates must be defined.&lt;br&gt;
This forms the basis for governance and approval workflows.&lt;/p&gt;

&lt;p&gt;In practice, aligning responsibilities across departments often becomes a bottleneck.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Operational Prerequisites
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Approval Workflows
&lt;/h4&gt;

&lt;p&gt;Processes for requesting and approving data access must be established.&lt;/p&gt;

&lt;h4&gt;
  
  
  Data Classification
&lt;/h4&gt;

&lt;p&gt;Standardized classification rules based on sensitivity and usage are required.&lt;/p&gt;

&lt;h4&gt;
  
  
  Usage Policies
&lt;/h4&gt;

&lt;p&gt;Guidelines and compliance rules for data usage must be clearly defined.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Technical Prerequisites
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Lineage Collection
&lt;/h4&gt;

&lt;p&gt;DataZone visualizes lineage, but only if lineage data exists.&lt;/p&gt;

&lt;p&gt;This requires:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Integration with processing systems (Glue, Redshift, etc.)&lt;/li&gt;
&lt;li&gt;Adoption of standards like OpenLineage&lt;/li&gt;
&lt;li&gt;Designing metadata collection within pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Metadata Integration
&lt;/h4&gt;

&lt;p&gt;Metadata from various services must be integrated into DataZone:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Catalog integration (Glue / Redshift / S3)&lt;/li&gt;
&lt;li&gt;Data quality metadata (Glue Data Quality)&lt;/li&gt;
&lt;li&gt;Access control metadata (Lake Formation)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This integration enables a consistent data catalog experience.&lt;/p&gt;




&lt;p&gt;In summary, DataZone does not automatically solve data governance problems.&lt;br&gt;
It requires the following conditions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A data-driven culture is emerging&lt;/li&gt;
&lt;li&gt;Cross-functional collaboration exists&lt;/li&gt;
&lt;li&gt;Awareness of data quality is high&lt;/li&gt;
&lt;li&gt;Continuous improvement processes are in place&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without these, the catalog risks becoming a formality that is not actually used.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Practical Approach to Adopting DataZone
&lt;/h2&gt;

&lt;p&gt;Given the complexity, a phased approach is often effective.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 1: Foundation (Data Platform Team)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Establish technical metadata with Glue Data Catalog&lt;/li&gt;
&lt;li&gt;Basic data classification&lt;/li&gt;
&lt;li&gt;Simple access control&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Phase 2: Governance (Involving Governance Teams)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Implement fine-grained access control with Lake Formation&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Example: Apply classification tags at the column level and deny SELECT access to PII-tagged columns via IAM policies&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Introduce data quality monitoring&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Establish basic lineage&lt;/p&gt;&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  Phase 3: DataZone (Business-Led)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Introduce once organizational prerequisites are met&lt;/li&gt;
&lt;li&gt;Manage business metadata&lt;/li&gt;
&lt;li&gt;Enable self-service analytics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;DataZone becomes effective only when the organization reaches a certain level of maturity.&lt;/p&gt;

&lt;p&gt;It is not just a tool, but a mechanism for organizational transformation.&lt;br&gt;
Technical readiness alone is not sufficient—cultural and process changes are required.&lt;/p&gt;




&lt;h2&gt;
  
  
  Considering OpenMetadata
&lt;/h2&gt;

&lt;p&gt;OpenMetadata is an open-source data catalog that supports a wide range of platforms.&lt;br&gt;
&lt;a href="https://open-metadata.org/" rel="noopener noreferrer"&gt;https://open-metadata.org/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Connectors for various data sources&lt;/li&gt;
&lt;li&gt;Data lineage&lt;/li&gt;
&lt;li&gt;Data quality management&lt;/li&gt;
&lt;li&gt;Search and UI capabilities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It can function as a comprehensive data catalog on its own.&lt;/p&gt;

&lt;p&gt;This raises the question: why not use OpenMetadata from the start?&lt;/p&gt;

&lt;p&gt;The answer depends on your context—specifically, &lt;strong&gt;which layer you want to implement the catalog in&lt;/strong&gt; (infrastructure-oriented vs. business-oriented).&lt;/p&gt;

&lt;p&gt;In AWS-centric environments, Glue Data Catalog integrates natively with services like Athena, Redshift, and Glue, providing consistency and operational simplicity.&lt;/p&gt;

&lt;p&gt;Therefore, if your architecture is primarily within AWS, it is reasonable to center your design around Glue.&lt;/p&gt;

&lt;p&gt;On the other hand, OpenMetadata becomes advantageous when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Managing across multiple clouds (AWS / GCP / Azure)&lt;/li&gt;
&lt;li&gt;Integrating diverse sources (SaaS, on-premises, etc.)&lt;/li&gt;
&lt;li&gt;Requiring flexible and customizable metadata management&lt;/li&gt;
&lt;li&gt;Designing a business-centric catalog from the beginning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, adopting OpenMetadata requires:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Infrastructure setup (ECS/EKS or VMs, metadata storage such as PostgreSQL)&lt;/li&gt;
&lt;li&gt;Operations (monitoring with Prometheus/Grafana, scaling, upgrades)&lt;/li&gt;
&lt;li&gt;Security design (RBAC, authentication/authorization, encryption)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Compared to AWS managed services, it offers flexibility at the cost of increased operational overhead.&lt;/p&gt;

&lt;p&gt;In summary:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Situation&lt;/th&gt;
&lt;th&gt;Recommendation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AWS only / small scale&lt;/td&gt;
&lt;td&gt;Glue Data Catalog centered&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS + governance needs&lt;/td&gt;
&lt;td&gt;+ Lake Formation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Advanced data usage&lt;/td&gt;
&lt;td&gt;+ DataZone&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-cloud&lt;/td&gt;
&lt;td&gt;Consider OpenMetadata&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;OpenMetadata can also integrate with Glue Data Catalog as a catalog provider.&lt;br&gt;
Its Glue connector can ingest metadata from Glue and map it to OpenMetadata constructs such as glossaries and data products.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this article, we explored data catalogs centered around AWS Glue Data Catalog.&lt;/p&gt;

&lt;p&gt;A data catalog is essential to prevent a data lake from becoming a data swamp.&lt;/p&gt;

&lt;p&gt;What matters is not the tool itself, but:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How metadata is managed&lt;/li&gt;
&lt;li&gt;Who owns the data&lt;/li&gt;
&lt;li&gt;How it continues to be used&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This requires designing not only technology, but also organization and processes.&lt;/p&gt;

&lt;p&gt;In other words, a data catalog is not just a tool—it is an &lt;strong&gt;organizational system&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Glue Data Catalog plays a central role, but it cannot form a complete data catalog on its own.&lt;/p&gt;

&lt;p&gt;On AWS, a data catalog should be designed not as a single service, but as an &lt;strong&gt;architecture centered around Glue Data Catalog&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;And most importantly, a data catalog should be viewed not merely as a technical foundation, but as a foundation for enabling organizations to operate with data.&lt;/p&gt;

&lt;p&gt;I hope this article helps those considering data catalogs on AWS.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Boosting Lightweight ETL on AWS Lambda &amp; Glue Python Shell with DuckDB and Apache Arrow Dataset</title>
      <dc:creator>Aki</dc:creator>
      <pubDate>Fri, 06 Mar 2026 00:32:32 +0000</pubDate>
      <link>https://dev.to/aws-builders/boosting-lightweight-etl-on-aws-lambda-glue-python-shell-with-duckdb-and-apache-arrow-dataset-3n09</link>
      <guid>https://dev.to/aws-builders/boosting-lightweight-etl-on-aws-lambda-glue-python-shell-with-duckdb-and-apache-arrow-dataset-3n09</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Original Japanese article&lt;/strong&gt;: &lt;a href="https://zenn.dev/penginpenguin/articles/56bd1307159247" rel="noopener noreferrer"&gt;AWS Lambda/Glue Python Shell×DuckDBの軽量ETLをApache Arrow Datasetで高速化してみた&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;I'm Aki, an AWS Community Builder (&lt;a href="https://x.com/jitepengin" rel="noopener noreferrer"&gt;@jitepengin&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;In my previous articles, I introduced lightweight ETL using AWS Lambda and Glue Python Shell.&lt;br&gt;
In the process, I found that DuckDB's performance was not as high as expected:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/aws-builders/does-increasing-aws-lambda-memory-to-10gb-really-make-it-faster-aws-lambda-chdbduckdb-pyiceberg-19j2"&gt;Does Increasing AWS Lambda Memory to 10GB Really Make It Faster? (AWS Lambda chDB/DuckDB PyIceberg Benchmark)&lt;/a&gt;&lt;br&gt;
&lt;a href="https://dev.to/aws-builders/aws-lambda-and-aws-glue-python-shell-in-the-context-of-lightweight-etl-3ao5"&gt;AWS Lambda and AWS Glue Python Shell in the Context of Lightweight ETL&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this article, I will cover what became the bottleneck for DuckDB and how using Apache Arrow Dataset can improve performance, along with the trade-offs observed.&lt;/p&gt;


&lt;h2&gt;
  
  
  Recap of Previous Articles
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://dev.to/aws-builders/does-increasing-aws-lambda-memory-to-10gb-really-make-it-faster-aws-lambda-chdbduckdb-pyiceberg-19j2"&gt;Does Increasing AWS Lambda Memory to 10GB Really Make It Faster? (AWS Lambda chDB/DuckDB PyIceberg Benchmark)&lt;/a&gt;&lt;br&gt;
&lt;a href="https://dev.to/aws-builders/aws-lambda-and-aws-glue-python-shell-in-the-context-of-lightweight-etl-3ao5"&gt;AWS Lambda and AWS Glue Python Shell in the Context of Lightweight ETL&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Using NYC taxi data, we compared performance on the same file:&lt;br&gt;
data.page]&lt;a href="https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page" rel="noopener noreferrer"&gt;https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Test files:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;January 2024 Yellow Taxi Trip Records (2,964,624 records, 48MB)&lt;/li&gt;
&lt;li&gt;Full-year 2024 Yellow Taxi Trip Records (41,169,720 records, 807MB)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Lambda measurements were taken with memory configurations of 1024MB, 2048MB, and the maximum 3008MB (without quota increase).&lt;br&gt;
Glue Python Shell tests were performed with DPU settings of 1/16 and 1.&lt;/p&gt;

&lt;p&gt;Since memory usage cannot be directly compared, we focus only on execution time.&lt;/p&gt;
&lt;h3&gt;
  
  
  48MB File (1 Month)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Execution Platform&lt;/th&gt;
&lt;th&gt;Resource Setting&lt;/th&gt;
&lt;th&gt;chDB Time (s)&lt;/th&gt;
&lt;th&gt;DuckDB Time (s)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Glue Python Shell&lt;/td&gt;
&lt;td&gt;1/16 DPU&lt;/td&gt;
&lt;td&gt;46.000&lt;/td&gt;
&lt;td&gt;40.000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Glue Python Shell&lt;/td&gt;
&lt;td&gt;1 DPU&lt;/td&gt;
&lt;td&gt;39.000&lt;/td&gt;
&lt;td&gt;34.000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS Lambda&lt;/td&gt;
&lt;td&gt;1024 MB&lt;/td&gt;
&lt;td&gt;5.092&lt;/td&gt;
&lt;td&gt;5.163&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS Lambda&lt;/td&gt;
&lt;td&gt;2048 MB&lt;/td&gt;
&lt;td&gt;3.873&lt;/td&gt;
&lt;td&gt;4.265&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS Lambda&lt;/td&gt;
&lt;td&gt;3008 MB&lt;/td&gt;
&lt;td&gt;3.370&lt;/td&gt;
&lt;td&gt;4.061&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  807MB File (1 Year)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Execution Platform&lt;/th&gt;
&lt;th&gt;Resource Setting&lt;/th&gt;
&lt;th&gt;chDB Time (s)&lt;/th&gt;
&lt;th&gt;DuckDB Time (s)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Glue Python Shell&lt;/td&gt;
&lt;td&gt;1/16 DPU&lt;/td&gt;
&lt;td&gt;OutOfMemory&lt;/td&gt;
&lt;td&gt;OutOfMemory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Glue Python Shell&lt;/td&gt;
&lt;td&gt;1 DPU&lt;/td&gt;
&lt;td&gt;51.0&lt;/td&gt;
&lt;td&gt;212.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS Lambda&lt;/td&gt;
&lt;td&gt;1024 MB&lt;/td&gt;
&lt;td&gt;OutOfMemory&lt;/td&gt;
&lt;td&gt;OutOfMemory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS Lambda&lt;/td&gt;
&lt;td&gt;2048 MB&lt;/td&gt;
&lt;td&gt;OutOfMemory&lt;/td&gt;
&lt;td&gt;OutOfMemory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS Lambda&lt;/td&gt;
&lt;td&gt;3008 MB&lt;/td&gt;
&lt;td&gt;27.171&lt;/td&gt;
&lt;td&gt;187.332&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h2&gt;
  
  
  What Caused the DuckDB Bottleneck?
&lt;/h2&gt;

&lt;p&gt;When loading Parquet directly into DuckDB, the flow is typically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;S3
↓
DuckDB read_parquet
↓
Filter / Query
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key points:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;S3 Scan&lt;/strong&gt;: Reading the entire dataset involves heavy network I/O — this can take most of the time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parquet Decode&lt;/strong&gt;: Decoding inside DuckDB adds CPU load.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query Processing&lt;/strong&gt;: For simple filters like &lt;code&gt;WHERE VendorID = 1&lt;/code&gt;, query time is minimal.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Even if the query itself is light, S3 scanning becomes the bottleneck, lowering DuckDB’s standalone performance.&lt;br&gt;
Measurements showed that in Glue Python Shell, of the total 210 seconds, 176 seconds (~83%) were spent in S3 Scan + Parquet Decode.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Use Apache Arrow Dataset to separate reading from querying and improve performance.&lt;/p&gt;


&lt;h2&gt;
  
  
  What is Apache Arrow Dataset?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://arrow.apache.org/docs/python/dataset.html" rel="noopener noreferrer"&gt;https://arrow.apache.org/docs/python/dataset.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Apache Arrow Dataset is a library for efficiently reading Parquet or CSV files using the columnar Arrow in-memory format.&lt;/p&gt;

&lt;p&gt;Features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fast Parquet reading&lt;/li&gt;
&lt;li&gt;Efficient decode operations&lt;/li&gt;
&lt;li&gt;Parallelized S3 reads&lt;/li&gt;
&lt;li&gt;Filter/Projection pushdown to reduce I/O&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By leveraging these features, the S3 Scan + Parquet Decode bottleneck can be greatly reduced.&lt;/p&gt;


&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;
&lt;h3&gt;
  
  
  AWS Lambda
&lt;/h3&gt;

&lt;p&gt;Same architecture as in previous articles:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdlgvomb81m7ev1sz4tg5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdlgvomb81m7ev1sz4tg5.png" width="800" height="260"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Glue Python Shell
&lt;/h3&gt;

&lt;p&gt;Also same as before:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F630cym2fbaera3wghq49.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F630cym2fbaera3wghq49.png" width="800" height="208"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Sample Code (AWS Lambda)
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pyarrow.dataset&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;ds&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pyarrow.fs&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;fs&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pyarrow&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pa&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyiceberg.catalog.glue&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GlueCatalog&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

        &lt;span class="c1"&gt;# DuckDB setup in Lambda
&lt;/span&gt;        &lt;span class="n"&gt;duckdb_connection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;database&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;:memory:&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Retrieve S3 path from event
&lt;/span&gt;        &lt;span class="n"&gt;s3_bucket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Records&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bucket&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;s3_object_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Records&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;s3_input_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s3_bucket&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s3_object_key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;S3 input path: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s3_input_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Read Parquet from S3 using Arrow Dataset
&lt;/span&gt;        &lt;span class="c1"&gt;# Use boto3 session to get temporary credentials
&lt;/span&gt;        &lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;credentials&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_credentials&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;get_frozen_credentials&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="n"&gt;s3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;S3FileSystem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ap-northeast-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;access_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;credentials&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;access_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;secret_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;credentials&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;secret_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;session_token&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;credentials&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Load dataset with Arrow Dataset
&lt;/span&gt;        &lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;s3_input_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;filesystem&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;s3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parquet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Convert dataset to Arrow Table (in-memory)
&lt;/span&gt;        &lt;span class="n"&gt;arrow_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_table&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Number of rows retrieved: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;arrow_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;num_rows&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Schema: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;arrow_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# DuckDB processing (SQL query)
&lt;/span&gt;        &lt;span class="c1"&gt;# Use DuckDB from_arrow to run SQL on Arrow 
&lt;/span&gt;        &lt;span class="n"&gt;rel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;duckdb_connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_arrow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arrow_table&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;result_arrow_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;duckdb_connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
            SELECT * 
            FROM rel
            WHERE VendorID = 1
            &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetch_arrow_table&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="c1"&gt;# Configure Glue Catalog (to access Iceberg table)
&lt;/span&gt;        &lt;span class="n"&gt;catalog&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GlueCatalog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;region_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ap-northeast-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;database&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;icebergdb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my_catalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Adjust to your environment.
&lt;/span&gt;
        &lt;span class="c1"&gt;# Load the table
&lt;/span&gt;        &lt;span class="n"&gt;namespace&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;icebergdb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# Adjust to your environment.
&lt;/span&gt;        &lt;span class="n"&gt;table_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;yellow_tripdata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# Adjust to your environment.
&lt;/span&gt;        &lt;span class="n"&gt;iceberg_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Append data to the Iceberg table in bulk
&lt;/span&gt;        &lt;span class="n"&gt;iceberg_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result_arrow_table&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Data has been appended to S3 in Iceberg format.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;An error occurred: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; This article focuses on the differences in operation, so version updates or conflict handling in Iceberg tables are omitted.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Key points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Arrow Dataset separates S3 reading from DuckDB querying&lt;/li&gt;
&lt;li&gt;Light filters can be applied in Arrow Dataset alone for speed&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Sample Code (Glue Python Shell)
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pyarrow.dataset&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;ds&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pyarrow.fs&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;fs&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;awsglue.utils&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;getResolvedOptions&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_job_parameters&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getResolvedOptions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3_input&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;s3_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3_input&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s3_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;s3_path&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;setup_duckdb_environment&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;duckdb_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/tmp/.duckdb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
        &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;HOME&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/tmp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
        &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;makedirs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;duckdb_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exist_ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DuckDB environment setup completed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;duckdb_dir&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DuckDB environment setup error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;read_parquet_with_arrow_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s3_input&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Reading with Arrow Dataset...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;credentials&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_credentials&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;get_frozen_credentials&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;s3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;S3FileSystem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ap-northeast-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;access_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;credentials&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;access_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;secret_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;credentials&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;secret_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;session_token&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;credentials&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s3_input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;filesystem&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;s3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parquet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_table&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Arrow read rows: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;num_rows&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_with_duckdb&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arrow_table&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;

    &lt;span class="n"&gt;con&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:memory:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

        &lt;span class="n"&gt;rel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_arrow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arrow_table&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;con&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
            SELECT *
            FROM rel
            WHERE VendorID = 1
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;arrow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DuckDB filtered rows: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;num_rows&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;

    &lt;span class="k"&gt;finally&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;con&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;write_iceberg_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arrow_table&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Writing started...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyiceberg.catalog&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_catalog&lt;/span&gt;

        &lt;span class="n"&gt;catalog_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;glue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;warehouse&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://your-bucket/your-warehouse/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;region&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ap-northeast-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="n"&gt;catalog&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_catalog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;glue_catalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;catalog_config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;table_identifier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;icebergdb.yellow_tripdata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

        &lt;span class="n"&gt;table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_identifier&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Target data to write: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;arrow_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;num_rows&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; rows&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arrow_table&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;

    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Writing error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;traceback&lt;/span&gt;
        &lt;span class="n"&gt;traceback&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;print_exc&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;setup_duckdb_environment&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Failed to set up DuckDB environment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

        &lt;span class="n"&gt;s3_input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_job_parameters&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="c1"&gt;# Arrow Dataset read
&lt;/span&gt;        &lt;span class="n"&gt;arrow_tbl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;read_parquet_with_arrow_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s3_input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# DuckDB SQL filter
&lt;/span&gt;        &lt;span class="n"&gt;result_tbl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;process_with_duckdb&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arrow_tbl&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Iceberg write
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;write_iceberg_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result_tbl&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Writing fully successful!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Writing failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Main error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;traceback&lt;/span&gt;
        &lt;span class="n"&gt;traceback&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;print_exc&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; This article focuses on the differences in operation, so version updates or conflict handling in Iceberg tables are omitted.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Key points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Glue Python Shell can execute ETL in the same Lambda-style configuration&lt;/li&gt;
&lt;li&gt;Responsibilities separated: Arrow Dataset for reading, DuckDB for SQL query&lt;/li&gt;
&lt;li&gt;Lightweight filters can be processed efficiently&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Benchmarking
&lt;/h2&gt;

&lt;p&gt;Using the same dataset and queries as in previous articles:&lt;/p&gt;
&lt;h3&gt;
  
  
  AWS Lambda
&lt;/h3&gt;

&lt;p&gt;48MB File (1 Month) memory=3008MB&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Time (ms)&lt;/th&gt;
&lt;th&gt;Memory (MB)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;chDB&lt;/td&gt;
&lt;td&gt;3,369.78&lt;/td&gt;
&lt;td&gt;1115&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DuckDB&lt;/td&gt;
&lt;td&gt;4,061.33&lt;/td&gt;
&lt;td&gt;524&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DuckDB × Arrow Dataset&lt;/td&gt;
&lt;td&gt;3,591.84&lt;/td&gt;
&lt;td&gt;928&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;807MB File (1 Year) memory=10240MB&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Time (ms)&lt;/th&gt;
&lt;th&gt;Memory (MB)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;chDB&lt;/td&gt;
&lt;td&gt;22,839.18&lt;/td&gt;
&lt;td&gt;3490&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DuckDB&lt;/td&gt;
&lt;td&gt;189,678.02&lt;/td&gt;
&lt;td&gt;2788&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DuckDB × Arrow Dataset&lt;/td&gt;
&lt;td&gt;15,220.6&lt;/td&gt;
&lt;td&gt;8086&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Glue Python Shell (1 DPU)
&lt;/h3&gt;

&lt;p&gt;48MB File (1 Month)&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Time (s)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;chDB&lt;/td&gt;
&lt;td&gt;39&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DuckDB&lt;/td&gt;
&lt;td&gt;34&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DuckDB × Arrow Dataset&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;807MB File (1 Year)&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Time (s)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;chDB&lt;/td&gt;
&lt;td&gt;51&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DuckDB&lt;/td&gt;
&lt;td&gt;212&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DuckDB × Arrow Dataset&lt;/td&gt;
&lt;td&gt;44&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;As a result, both AWS Lambda and Glue Python Shell were able to achieve significant performance improvements compared to chDB.&lt;br&gt;
In other words, addressing the S3 scan and Parquet decoding bottlenecks seems to be the key to improving DuckDB processing.&lt;/p&gt;

&lt;p&gt;However, in the case of Lambda, large file sizes can lead to high memory usage, potentially exceeding the memory limits.&lt;br&gt;
This means that careful consideration of where and how to use this approach is necessary.&lt;/p&gt;


&lt;h2&gt;
  
  
  Memory Consideration with Arrow Dataset
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;s3_input_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;filesystem&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;s3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parquet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;arrow_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_table&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;In this process, &lt;code&gt;dataset.to_table()&lt;/code&gt; materializes the entire dataset in memory as an Arrow Table.&lt;br&gt;
Arrow Tables use a columnar in-memory format, which is very fast, but in this case, loading the entire file at once can result in high memory usage.&lt;/p&gt;

&lt;p&gt;For example, reading an 807MB Parquet file in Lambda can cause the memory footprint of the Arrow Table to be much larger than the compressed Parquet file size.&lt;br&gt;
While &lt;code&gt;to_table()&lt;/code&gt; is convenient, it is important to be aware that it can significantly increase memory consumption depending on the processing.&lt;/p&gt;

&lt;p&gt;Lambda can also use &lt;code&gt;/tmp&lt;/code&gt; for disk-backed processing, but processing in memory is overwhelmingly faster.&lt;br&gt;
However, due to memory limits, expanding a large file into an Arrow Table can quickly consume a large amount of memory.&lt;/p&gt;

&lt;p&gt;For simple queries, one approach is to iterate over row groups instead of materializing the entire table in memory, processing small chunks at a time.&lt;br&gt;
This method can potentially keep memory usage within a few hundred MBs.&lt;/p&gt;


&lt;h2&gt;
  
  
  Trade-Offs
&lt;/h2&gt;

&lt;p&gt;Using Arrow Dataset significantly improves the speed of S3 reads and Parquet decoding.&lt;br&gt;
However, expanding the dataset all at once with &lt;code&gt;to_table()&lt;/code&gt; increases memory usage, which may hit Lambda’s memory limits.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pros: Decoding and I/O are faster, resulting in improved performance.&lt;/li&gt;
&lt;li&gt;Cons: Materializing the entire file consumes a lot of memory, and for large files, Lambda may run into OutOfMemory errors.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Therefore, it is important to design your ETL with a balance between performance and memory usage in mind.&lt;br&gt;
For small files or when Lambda has sufficient memory, loading the full dataset at once is fine.&lt;br&gt;
For larger files, consider chunked processing by row group or pushdown filters to keep memory usage under control.&lt;/p&gt;


&lt;h2&gt;
  
  
  Pushdown with Arrow Dataset
&lt;/h2&gt;

&lt;p&gt;Using Arrow Dataset’s Filter/Projection Pushdown, you can load only the row groups you need from S3. &lt;/p&gt;

&lt;p&gt;Here’s how you can apply it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;arrow_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VendorID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tpep_pickup_datetime&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="nb"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VendorID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Only the necessary row groups are read (this is crucial for large datasets).&lt;/li&gt;
&lt;li&gt;Reduces network I/O from S3.&lt;/li&gt;
&lt;li&gt;Can further shorten processing time.&lt;/li&gt;
&lt;li&gt;Arrow Dataset is optimal for lightweight reads and simple filters; complex queries should still be handled in DuckDB.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Experimenting in Lambda
&lt;/h3&gt;

&lt;p&gt;You can integrate pushdown into your existing Lambda code. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pyarrow.dataset&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;ds&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pyarrow.fs&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;fs&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pyarrow&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pa&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyiceberg.catalog.glue&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GlueCatalog&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

        &lt;span class="c1"&gt;# DuckDB setup in Lambda
&lt;/span&gt;        &lt;span class="n"&gt;duckdb_connection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;database&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;:memory:&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Retrieve S3 path from event
&lt;/span&gt;        &lt;span class="n"&gt;s3_bucket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Records&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bucket&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;s3_object_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Records&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;s3_input_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s3_bucket&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s3_object_key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;S3 input path: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s3_input_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Read Parquet from S3 using Arrow Dataset
&lt;/span&gt;        &lt;span class="c1"&gt;# Use boto3 session to get temporary credentials
&lt;/span&gt;        &lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;credentials&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_credentials&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;get_frozen_credentials&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="n"&gt;s3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;S3FileSystem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ap-northeast-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;access_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;credentials&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;access_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;secret_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;credentials&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;secret_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;session_token&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;credentials&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Load dataset with Arrow Dataset
&lt;/span&gt;        &lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;s3_input_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;filesystem&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;s3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parquet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Convert dataset to Arrow Table (in-memory)
&lt;/span&gt;        &lt;span class="n"&gt;arrow_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VendorID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Number of rows retrieved: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;arrow_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;num_rows&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Schema: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;arrow_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# DuckDB processing (SQL query)
&lt;/span&gt;        &lt;span class="c1"&gt;# Use DuckDB from_arrow to run SQL on Arrow 
&lt;/span&gt;        &lt;span class="n"&gt;rel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;duckdb_connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_arrow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arrow_table&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;result_arrow_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;duckdb_connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
            SELECT * 
            FROM rel
            &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetch_arrow_table&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="c1"&gt;# Configure Glue Catalog (to access Iceberg table)
&lt;/span&gt;        &lt;span class="n"&gt;catalog&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GlueCatalog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;region_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ap-northeast-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;database&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;icebergdb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my_catalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Adjust to your environment.
&lt;/span&gt;
        &lt;span class="c1"&gt;# Load the table
&lt;/span&gt;        &lt;span class="n"&gt;namespace&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;icebergdb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# Adjust to your environment.
&lt;/span&gt;        &lt;span class="n"&gt;table_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;yellow_tripdata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# Adjust to your environment.
&lt;/span&gt;        &lt;span class="n"&gt;iceberg_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Append data to the Iceberg table in bulk
&lt;/span&gt;        &lt;span class="n"&gt;iceberg_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result_arrow_table&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Data has been appended to S3 in Iceberg format.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;An error occurred: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Using pushdown, memory usage was reduced significantly:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Time (ms)&lt;/th&gt;
&lt;th&gt;Memory (MB)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DuckDB × Arrow Dataset&lt;/td&gt;
&lt;td&gt;15,220.6&lt;/td&gt;
&lt;td&gt;8086&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DuckDB × Arrow Dataset (Pushdown)&lt;/td&gt;
&lt;td&gt;15,238.14&lt;/td&gt;
&lt;td&gt;3458&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The memory usage was significantly reduced with pushdown. &lt;br&gt;
This two-step approach filtering unnecessary data with Arrow Dataset before passing it to DuckDB proves to be effective for large datasets.&lt;/p&gt;




&lt;h2&gt;
  
  
  Insights
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;DuckDB alone is slow due to S3 Scan + Parquet Decode&lt;/li&gt;
&lt;li&gt;DuckDB shines with complex queries (JOIN, GROUP BY, WINDOW)&lt;/li&gt;
&lt;li&gt;Pushdown is key in Arrow Dataset&lt;/li&gt;
&lt;li&gt;Separating responsibilities (Arrow Dataset + DuckDB) enables efficient ETL in Lambda/Glue&lt;/li&gt;
&lt;li&gt;chDB offers balanced memory and speed out-of-the-box&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Responsibility-Separated Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;S3 Parquet (raw data)
      │
      ▼
Arrow Dataset → row group scan + simple filter
      │
      ▼
DuckDB → SQL query (JOIN, GROUP BY, Window functions)
      │
      ▼
PyIceberg → Iceberg table write
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this article, we explored performance improvements for a lightweight ETL built with AWS Lambda / Glue Python Shell × DuckDB × PyIceberg.&lt;/p&gt;

&lt;p&gt;For lightweight ETL, especially on AWS Lambda, processing time is a critical factor. By using Apache Arrow Dataset, we were able to significantly improve performance by offloading S3 reading and Parquet decoding before running queries in DuckDB.&lt;/p&gt;

&lt;p&gt;However, there are trade-offs. Expanding an entire dataset into memory with to_table() can lead to high memory usage, which may exceed Lambda’s limits for large files. Therefore, careful responsibility separation and chunked processing (e.g., row group iteration or pushdown filters) are important considerations.&lt;/p&gt;

&lt;p&gt;With the architecture presented here, even large files can be processed quickly in a lightweight ETL on AWS. While complex queries may still face performance limitations, this approach provides a practical and efficient option for real-time or near-real-time ETL in a Lakehouse environment using Apache Iceberg.&lt;/p&gt;

&lt;p&gt;We hope this article serves as a reference for those exploring lightweight data processing and ETL patterns on Iceberg tables.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>duckdb</category>
      <category>chdb</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Does Increasing AWS Lambda Memory to 10GB Really Make It Faster? (AWS Lambda chDB/DuckDB PyIceberg Benchmark)</title>
      <dc:creator>Aki</dc:creator>
      <pubDate>Thu, 26 Feb 2026 13:33:33 +0000</pubDate>
      <link>https://dev.to/aws-builders/does-increasing-aws-lambda-memory-to-10gb-really-make-it-faster-aws-lambda-chdbduckdb-pyiceberg-19j2</link>
      <guid>https://dev.to/aws-builders/does-increasing-aws-lambda-memory-to-10gb-really-make-it-faster-aws-lambda-chdbduckdb-pyiceberg-19j2</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Original Japanese article&lt;/strong&gt;: &lt;a href="https://zenn.dev/penginpenguin/articles/111a7d0a2feac7" rel="noopener noreferrer"&gt;AWS Lambdaを10GBにすると本当に速くなるのか？（AWS Lambda×chDB/DuckDB×PyIceberg検証）&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;I'm Aki, an AWS Community Builder (&lt;a href="https://x.com/jitepengin" rel="noopener noreferrer"&gt;@jitepengin&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;In a previous article, I benchmarked Iceberg integration using AWS Lambda with DuckDB and chDB.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/aws-builders/lightweight-etl-with-aws-lambda-chdb-and-pyiceberg-compared-with-duckdb-2coo"&gt;Lightweight ETL with AWS Lambda, chDB, and PyIceberg (Compared with DuckDB)&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In that article, I tested two patterns on AWS Lambda:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;chDB × PyIceberg&lt;/li&gt;
&lt;li&gt;DuckDB × PyIceberg&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Memory sizes were set to 1024 MB, 2048 MB, and 3008 MB (the maximum without quota increase at the time).&lt;/p&gt;

&lt;p&gt;The results showed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For small datasets, increasing memory generally improved performance.&lt;/li&gt;
&lt;li&gt;For a large dataset (807 MB), 3008 MB was barely enough to complete processing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This time, I extended the experiment:&lt;/p&gt;

&lt;p&gt;What happens if we increase Lambda memory up to 10GB (10240 MB)?&lt;/p&gt;




&lt;h2&gt;
  
  
  Increasing the Lambda Memory Quota
&lt;/h2&gt;

&lt;p&gt;To raise the Lambda memory limit beyond 3008 MB, you must request a quota increase.&lt;/p&gt;

&lt;p&gt;Important:&lt;br&gt;
You cannot increase Lambda memory from the Service Quotas console.&lt;/p&gt;

&lt;h3&gt;
  
  
  Steps
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Go to the AWS Support Center and create a new case.&lt;/li&gt;
&lt;li&gt;Clearly state:&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;The reason for the increase&lt;/li&gt;
&lt;li&gt;The target region&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example request content:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;We are building and validating a data processing platform using AWS Lambda.&lt;br&gt;
The workload is memory-intensive, including large Parquet file loading, aggregation, and transformation.&lt;br&gt;
The current 3008 MB limit is insufficient to complete processing.&lt;/p&gt;

&lt;p&gt;We are performing analytical processing inside Lambda using columnar formats (Parquet), and the workload requires higher memory allocation.&lt;/p&gt;

&lt;p&gt;Currently, we experience performance degradation and OutOfMemory errors.&lt;/p&gt;

&lt;p&gt;We would like to request an increase of the Lambda memory limit in the Tokyo region to 10240 MB.&lt;/p&gt;

&lt;p&gt;Although we considered migrating to other compute services, we determined that continuing with Lambda is the most appropriate option from both operational and architectural perspectives.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;After submission, AWS responded in about &lt;strong&gt;3 business days&lt;/strong&gt; and applied the increase.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdlgvomb81m7ev1sz4tg5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdlgvomb81m7ev1sz4tg5.png" width="800" height="260"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The architecture is identical to the previous article.&lt;/p&gt;

&lt;p&gt;Flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Load a Parquet file from S3 in Lambda&lt;/li&gt;
&lt;li&gt;Process it using chDB or DuckDB&lt;/li&gt;
&lt;li&gt;Write results into an Iceberg table&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In short:&lt;/p&gt;

&lt;p&gt;S3 → Lambda (chDB/DuckDB) → Iceberg (via Glue Catalog)&lt;/p&gt;

&lt;p&gt;In this article, I focus on performance behavior differences.&lt;br&gt;
Iceberg version conflicts and concurrency handling are omitted for simplicity.&lt;/p&gt;




&lt;h2&gt;
  
  
  Sample Code (chDB)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;chdb&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pyarrow&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pa&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyiceberg.catalog.glue&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GlueCatalog&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_to_pyarrow_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Compatibility helper to extract a pyarrow.Table from a chDB query_result.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chdb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to_arrowTable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chdb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_arrowTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to_pyarrow&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_pyarrow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to_arrow&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_arrow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cannot convert chdb query_result to pyarrow.Table. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Available attributes: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;dir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;))[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;normalize_arrow_for_iceberg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Table&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;pa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Table&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Normalize Arrow types that Iceberg does not accept
    (mainly timezone-aware timestamps).
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;new_fields&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;new_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;column&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_timestamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tz&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Remove timezone information (values remain in UTC)
&lt;/span&gt;            &lt;span class="n"&gt;new_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unit&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;new_fields&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nullable&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="n"&gt;new_columns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_type&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;new_fields&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;new_columns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;new_schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_fields&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;pa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_arrays&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_columns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;new_schema&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Extract S3 bucket and object key from the event
&lt;/span&gt;        &lt;span class="n"&gt;s3_bucket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Records&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bucket&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;s3_object_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Records&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="c1"&gt;# Build S3 HTTPS URL
&lt;/span&gt;        &lt;span class="n"&gt;s3_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s3_bucket&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3.ap-northeast-1.amazonaws.com/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s3_object_key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3_url: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s3_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Query Parquet data on S3 using chDB
&lt;/span&gt;        &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
            SELECT *
            FROM s3(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s3_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Parquet&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)
            WHERE VendorID = 1
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

        &lt;span class="c1"&gt;# Execute chDB query with Arrow output
&lt;/span&gt;        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chdb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Arrow&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Convert chDB result to pyarrow.Table
&lt;/span&gt;        &lt;span class="n"&gt;arrow_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_to_pyarrow_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Original schema: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;arrow_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Normalize schema for Iceberg compatibility
&lt;/span&gt;        &lt;span class="n"&gt;arrow_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;normalize_arrow_for_iceberg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arrow_table&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Normalized schema: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;arrow_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Rows: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;arrow_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;num_rows&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Initialize Iceberg Glue Catalog
&lt;/span&gt;        &lt;span class="n"&gt;catalog&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GlueCatalog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my_catalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;database&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;icebergdb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;region_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ap-northeast-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Load Iceberg table
&lt;/span&gt;        &lt;span class="n"&gt;iceberg_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;icebergdb.yellow_tripdata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Append data to Iceberg table
&lt;/span&gt;        &lt;span class="n"&gt;iceberg_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arrow_table&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Data appended to Iceberg table.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Exception:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; This article focuses on the differences in operation, so version updates or conflict handling in Iceberg tables are omitted.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Sample Code (DuckDB)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pyarrow&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pa&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyiceberg.catalog.glue&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GlueCatalog&lt;/span&gt;  

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Connect to DuckDB and set the home directory
&lt;/span&gt;        &lt;span class="n"&gt;duckdb_connection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;database&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;:memory:&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;duckdb_connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SET home_directory=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/tmp&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

        &lt;span class="c1"&gt;# Install and load the httpfs extension
&lt;/span&gt;        &lt;span class="n"&gt;duckdb_connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INSTALL httpfs;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;duckdb_connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LOAD httpfs;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Load data from S3 using DuckDB
&lt;/span&gt;        &lt;span class="n"&gt;s3_bucket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Records&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bucket&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;s3_object_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Records&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="n"&gt;s3_input_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s3_bucket&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s3_object_key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3_input_path: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s3_input_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
            SELECT * FROM read_parquet(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s3_input_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;) WHERE VendorID = 1
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="c1"&gt;# Execute SQL and retrieve results as a PyArrow Table
&lt;/span&gt;        &lt;span class="n"&gt;result_arrow_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;duckdb_connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetch_arrow_table&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Number of rows retrieved: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result_arrow_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;num_rows&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Data schema: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result_arrow_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Configure Glue Catalog (to access Iceberg table)
&lt;/span&gt;        &lt;span class="n"&gt;catalog&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GlueCatalog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;region_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ap-northeast-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;database&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;icebergdb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my_catalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Adjust to your environment.
&lt;/span&gt;
        &lt;span class="c1"&gt;# Load the table
&lt;/span&gt;        &lt;span class="n"&gt;namespace&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;icebergdb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# Adjust to your environment.
&lt;/span&gt;        &lt;span class="n"&gt;table_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;yellow_tripdata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# Adjust to your environment.
&lt;/span&gt;        &lt;span class="n"&gt;iceberg_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Append data to the Iceberg table in bulk
&lt;/span&gt;        &lt;span class="n"&gt;iceberg_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result_arrow_table&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Data has been appended to S3 in Iceberg format.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;An error occurred: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Version updates and conflict handling are omitted here as well.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Test Conditions
&lt;/h2&gt;

&lt;p&gt;Dataset:&lt;br&gt;
NYC Taxi Trip Records&lt;br&gt;
&lt;a href="https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page" rel="noopener noreferrer"&gt;https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Test files:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;January 2024 (Yellow Taxi)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;2,964,624 records&lt;/li&gt;
&lt;li&gt;48 MB&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;Full year 2024 (aggregated file)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;41,169,720 records&lt;/li&gt;
&lt;li&gt;807 MB&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Memory configurations tested:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1024 MB&lt;/li&gt;
&lt;li&gt;2048 MB&lt;/li&gt;
&lt;li&gt;3008 MB (no quota increase maximum)&lt;/li&gt;
&lt;li&gt;4096 MB&lt;/li&gt;
&lt;li&gt;10240 MB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each configuration was executed 5 times under warm conditions to eliminate cold start effects.&lt;/p&gt;




&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  48MB File (1 Month)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Memory (MB)&lt;/th&gt;
&lt;th&gt;chDB Time (ms)&lt;/th&gt;
&lt;th&gt;chDB Memory Used (MB)&lt;/th&gt;
&lt;th&gt;DuckDB Time (ms)&lt;/th&gt;
&lt;th&gt;DuckDB Memory Used (MB)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;5,092&lt;/td&gt;
&lt;td&gt;1018&lt;/td&gt;
&lt;td&gt;5,163&lt;/td&gt;
&lt;td&gt;512&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2048&lt;/td&gt;
&lt;td&gt;3,872&lt;/td&gt;
&lt;td&gt;1132&lt;/td&gt;
&lt;td&gt;4,264&lt;/td&gt;
&lt;td&gt;538&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3008&lt;/td&gt;
&lt;td&gt;3,369&lt;/td&gt;
&lt;td&gt;1115&lt;/td&gt;
&lt;td&gt;4,061&lt;/td&gt;
&lt;td&gt;524&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4096&lt;/td&gt;
&lt;td&gt;3,197&lt;/td&gt;
&lt;td&gt;1263&lt;/td&gt;
&lt;td&gt;3,547&lt;/td&gt;
&lt;td&gt;568&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10240&lt;/td&gt;
&lt;td&gt;3,087&lt;/td&gt;
&lt;td&gt;1255&lt;/td&gt;
&lt;td&gt;3,484&lt;/td&gt;
&lt;td&gt;554&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Performance improved as memory increased — but only marginally above 4096 MB.&lt;/p&gt;

&lt;p&gt;Actual memory usage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;chDB ≈ 1.2 GB&lt;/li&gt;
&lt;li&gt;DuckDB ≈ 550 MB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Allocated memory increased, but real usage did not scale proportionally.&lt;/p&gt;




&lt;h3&gt;
  
  
  807MB File (1 Year)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Memory (MB)&lt;/th&gt;
&lt;th&gt;chDB Time (ms)&lt;/th&gt;
&lt;th&gt;chDB Memory Used (MB)&lt;/th&gt;
&lt;th&gt;DuckDB Time (ms)&lt;/th&gt;
&lt;th&gt;DuckDB Memory Used (MB)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2048&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3008&lt;/td&gt;
&lt;td&gt;27,170&lt;/td&gt;
&lt;td&gt;3001&lt;/td&gt;
&lt;td&gt;187,331&lt;/td&gt;
&lt;td&gt;2732&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4096&lt;/td&gt;
&lt;td&gt;24,631&lt;/td&gt;
&lt;td&gt;3322&lt;/td&gt;
&lt;td&gt;188,880&lt;/td&gt;
&lt;td&gt;2767&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10240&lt;/td&gt;
&lt;td&gt;22,839&lt;/td&gt;
&lt;td&gt;3490&lt;/td&gt;
&lt;td&gt;189,678&lt;/td&gt;
&lt;td&gt;2788&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;OOM occurred because the memory allocation could not hold the temporary buffers required during the Parquet → Arrow → Iceberg transformation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Did Performance Really Improve at 4096MB and 10GB?
&lt;/h2&gt;

&lt;p&gt;What we really wanted to check in this experiment was whether performance would actually improve &lt;strong&gt;once memory goes beyond 3008 MB&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  48MB (1-month data)
&lt;/h3&gt;

&lt;p&gt;Increasing memory from 3008 → 4096 → 10240 MB resulted in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;chDB&lt;/strong&gt;: 3,369 → 3,197 → 3,087 ms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DuckDB&lt;/strong&gt;: 4,061 → 3,547 → 3,484 ms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Performance &lt;strong&gt;did improve&lt;/strong&gt;, but the gains were limited.&lt;br&gt;
In particular, the difference from 4096 → 10240 MB is &lt;strong&gt;almost negligible&lt;/strong&gt;, within the margin of error.&lt;/p&gt;

&lt;p&gt;Looking at &lt;strong&gt;Max Memory Used&lt;/strong&gt;, chDB only used about 1.2 GB and DuckDB about 550 MB, meaning increasing allocated memory &lt;strong&gt;did not increase actual usage&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  807MB (1-year data)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;chDB&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;3008 MB: 27,170 ms&lt;/li&gt;
&lt;li&gt;4096 MB: 24,631 ms&lt;/li&gt;
&lt;li&gt;10240 MB: 22,839 ms&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Overall, going from 3008 → 10240 MB improved performance by roughly &lt;strong&gt;16%&lt;/strong&gt;,&lt;br&gt;
but from 4096 → 10240 MB, the improvement was only about &lt;strong&gt;7%&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Even though memory increased roughly &lt;strong&gt;3.4×&lt;/strong&gt;, performance only improved by ~16%, suggesting that performance is &lt;strong&gt;hitting a ceiling&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DuckDB&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;3008 MB: 187,331 ms&lt;/li&gt;
&lt;li&gt;4096 MB: 188,881 ms&lt;/li&gt;
&lt;li&gt;10240 MB: 189,678 ms&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Almost no improvement; in some cases, it was slightly slower.&lt;br&gt;
Simply increasing memory &lt;strong&gt;does not affect execution time&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Analysis (Bottleneck Insights)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Memory behaves as a threshold parameter
&lt;/h3&gt;

&lt;p&gt;For Lambda × DuckDB/chDB, memory seems to behave &lt;strong&gt;more like a threshold than a proportional scaling parameter&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1024 MB and 2048 MB → OOM&lt;/li&gt;
&lt;li&gt;3008 MB → first point where processing succeeds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Beyond that, adding memory &lt;strong&gt;does not yield proportional performance gains&lt;/strong&gt;.&lt;br&gt;
This suggests the bottleneck is &lt;strong&gt;likely elsewhere&lt;/strong&gt;, not just compute resources.&lt;/p&gt;




&lt;h3&gt;
  
  
  Does increasing vCPU help?
&lt;/h3&gt;

&lt;p&gt;Lambda increases available CPU with memory.&lt;br&gt;
However:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DuckDB barely scaled&lt;/li&gt;
&lt;li&gt;chDB only slightly improved&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Likely bottlenecks are &lt;strong&gt;I/O and serialization&lt;/strong&gt;, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reading from S3&lt;/li&gt;
&lt;li&gt;Iceberg metadata operations&lt;/li&gt;
&lt;li&gt;Parquet → Arrow conversion&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  Engine-specific behavior
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;chDB&lt;/strong&gt;: small improvements with more memory&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DuckDB&lt;/strong&gt;: almost no change&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This difference may be due to internal implementations or parallelization strategies.&lt;br&gt;
At least for this workload, simply going to 10 GB &lt;strong&gt;does not make DuckDB explode in speed&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways for Large Workloads on Lambda
&lt;/h2&gt;

&lt;p&gt;From this experiment, a few points are clear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Crossing the OOM threshold is the main goal.&lt;/li&gt;
&lt;li&gt;Memory beyond that should be considered carefully, especially for cost.&lt;/li&gt;
&lt;li&gt;Simply allocating 10 GB does not guarantee faster execution.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Looking at DuckDB results, it's clear that maxing out memory does not automatically make things faster.&lt;br&gt;
From a cost perspective, finding the “just enough” memory is more practical.&lt;/p&gt;

&lt;p&gt;For more complex or larger workloads, sticking to Lambda may not be optimal — Glue or EMR could be faster and more stable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this article, we walked through applying for a Lambda memory quota increase and measured the performance of lightweight ETL tasks with the expanded memory.&lt;/p&gt;

&lt;p&gt;Both chDB and DuckDB are attractive open-source options, but they have significantly different characteristics. One clear takeaway is that crossing the OOM threshold should always be the first goal; beyond that, performance improvements will likely need to come from areas other than memory.&lt;/p&gt;

&lt;p&gt;This experiment reinforced that designing for maximum memory by default is not necessarily the best approach. It's more important to understand your workload and identify the critical memory boundaries.&lt;/p&gt;

&lt;p&gt;Also, keep in mind that Lambda quota increases cannot be requested from the Service Quotas screen, which can be useful knowledge in both personal projects and professional settings.&lt;/p&gt;

&lt;p&gt;While both engines are still evolving, understanding their characteristics and using them appropriately allows you to build simple, yet highly extensible data processing workflows.&lt;/p&gt;

&lt;p&gt;I hope this article serves as a helpful reference for anyone considering lightweight data processing or real-time ETL with Iceberg tables.&lt;/p&gt;




</description>
      <category>aws</category>
      <category>duckdb</category>
      <category>chdb</category>
      <category>iceberg</category>
    </item>
  </channel>
</rss>
