Setting Up EMR with Iceberg support and Integration with DBT Core

Introduction

Amazon EMR (Elastic MapReduce) enables scalable big data processing on AWS. For modern data workflows, open table formats such as Apache Iceberg offer robust solutions for data lakes, supporting features like ACID transactions, schema evolution, and time travel. When paired with DBT Core and Spark, Iceberg unlocks powerful analytics and data transformation capabilities.
This guide provides a detailed walkthrough for setting up EMR with Iceberg support and integrating with DBT Core. It covers EMR cluster configuration, Spark Thriftserver setup, and DBT configurations, including model and profile files, and custom macro injection. By following these steps, you can ensure a seamless connection between your data lake infrastructure and DBT, leveraging the strengths of each component.

Key Steps Overview

• Enable Iceberg as part of the EMR cluster configuration.
• Start Spark Thriftserver with appropriate Iceberg and Glue Data Catalog configurations.
• Modify DBT configurations to integrate with Iceberg and Spark.
• Update DBT models for Iceberg compatibility.
• Edit profiles.yml for correct schema handling.
• Inject and configure a DBT macro for schema management.

Step 1: Enable Iceberg in EMR Cluster Configuration

To work with Apache Iceberg tables on your EMR cluster, you must enable Iceberg as part of the cluster’s initial configuration, or update the configuration for existing clusters.
• When launching a new cluster, add the following classification to enable Iceberg:

{
"Classification": "iceberg-defaults",
"Properties": {
"iceberg.enabled": "true"
}
}

• For existing clusters, propagate this configuration across all instance groups to ensure consistent Iceberg support.As shown in the below screenshots:

This configuration activates Iceberg functionalities on the EMR cluster, allowing Spark jobs to create, manage, and query Iceberg tables.

Step 2: Start Spark Thriftserver with Iceberg and Glue Catalog

The Spark Thriftserver allows external clients, including DBT, to interact with Spark SQL via JDBC/Thrift protocols. When working with Iceberg tables, it’s essential to configure Spark Thriftserver appropriately, including integration with AWS Glue as the metadata catalog and optional table-locking via DynamoDB.
• Start the Thriftserver with the following command, updating my_catalog and warehouse as needed:

sudo /usr/lib/spark/sbin/start-thriftserver.sh \
--conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.my_catalog.warehouse=s3://data-apsoutheast3-117019135262/xldemo/icebergmart/ \
--conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
--conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
--conf spark.sql.catalog.my_catalog.lock-impl=org.apache.iceberg.aws.dynamodb.DynamoDbLockManager \
--conf spark.sql.catalog.my_catalog.lock.table=myGlueLockTable

• The lock configuration using DynamoDB is optional but enhances consistency in concurrent environments.

• The my_catalog name is arbitrary—you might want to use an environment-specific name like dev_catalog for clarity.
• Remember: this catalog name is required when specifying the fully qualified schema for Iceberg tables in Spark and DBT.

Step 3: Modify DBT Configurations for Iceberg and Spark

Once your backend infrastructure supports Iceberg and Spark, DBT must be configured accordingly to build and manage Iceberg tables. This involves updating your model SQL files, profiles.yml, and injecting a custom macro for precise schema handling.

3.a Update Target Model for Iceberg Format

Each DBT model that should use Iceberg must specify this in the model configuration block. For example, in /root/credit_history/models/mart/credit/aggregated_credit_iceberg.sql:

{{
config(
materialized='incremental',
file_format='iceberg',
partition_by=['event_date],
incremental_strategy='append',
schema='my_catalog.dev'
)
}}

• Set file_format to iceberg.
• Set location_root to your S3 path for Iceberg storage.
• Specify partition_by for efficient table partitioning.
• Ensure the schema includes the Spark catalog name (e.g., my_catalog.dev).
This ensures that when DBT builds or refreshes your models, it creates or updates Iceberg tables with the expected configuration.

3.b Update profiles.yml for Schema Handling

DBT’s default behavior may concatenate schema names from profiles.yml and the model files, which can lead to undesired naming conventions, especially with Spark and Iceberg.
To control schema generation:
• Set the schema key to an empty string for your default output (dev), and explicitly state the schema for catalog-specific outputs.
Sample profiles.yml configuration:

dbt_project:
outputs:
dev:
host: <DNS name>
method: thrift
port: 10001
schema: ""
threads: 1
type: spark
my_catalog.dev:
host: <DNS Name>
method: thrift
schema: my_catalog.dev
port: 10001
threads: 1
type: spark
target: dev

• This setup ensures that if a schema is provided in the model config, it will be used as-is, preventing unwanted prefixing or concatenation.
• For environment-specific configurations, you may add similar blocks for prod, test, etc., updating hosts and schemas as needed.
[NOTE] DNS Name can be obtained from the EMR cluster details as shown below:

3.c Inject Custom DBT Macro for Schema Name Control

DBT allows macro-overrides to customize core behaviors. The generate_schema_name.sql macro controls how DBT assigns schema names to target tables.
Create a macro named generate_schema_name.sql with the following content:

{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}
{{ default_schema }}
{%- else -%}
{{ custom_schema_name | trim }}
{%- endif -%}
{%- endmacro %}

• This macro checks if a custom schema is provided in the model file. If not, it defaults to the schema defined in profiles.yml.
• If the schema in profiles.yml is empty (""), the macro uses the schema from the model configuration, giving you full control over the table location.
• Ensure every model file has the correct schema configuration (e.g., schema='my_catalog.dev'), as jobs will now rely on this value.

Additional Recommendations

• Testing: After setting up, test model builds and incremental loads to verify Iceberg table creation and data consistency.
• Security: Configure IAM roles and S3 bucket permissions to permit Spark and DBT to read/write Iceberg data.
• Observability: Monitor EMR, Spark, and Glue logs for troubleshooting and performance tuning.
• Scalability: Use EMR autoscaling features to adapt to workload demands, ensuring efficient resource use.

Conclusion

Configuring EMR with Iceberg, Spark, and DBT Core empowers your organization to build reliable, scalable, and future-proof data pipelines. By following this guide—enabling Iceberg, configuring Spark and Glue Catalog, and adapting DBT settings—you gain powerful control over table formats, schema definitions, and transformation workflows. This setup provides the foundation for robust analytics, flexible data modeling, and streamlined lakehouse operations.

DEV Community

Setting Up EMR with Iceberg support and Integration with DBT Core

Top comments (0)