DEV Community: Ankur Srivastava

Setting Up EMR with Iceberg support and Integration with DBT Core

Ankur Srivastava — Tue, 05 Aug 2025 14:44:26 +0000

Introduction

Amazon EMR (Elastic MapReduce) enables scalable big data processing on AWS. For modern data workflows, open table formats such as Apache Iceberg offer robust solutions for data lakes, supporting features like ACID transactions, schema evolution, and time travel. When paired with DBT Core and Spark, Iceberg unlocks powerful analytics and data transformation capabilities.
This guide provides a detailed walkthrough for setting up EMR with Iceberg support and integrating with DBT Core. It covers EMR cluster configuration, Spark Thriftserver setup, and DBT configurations, including model and profile files, and custom macro injection. By following these steps, you can ensure a seamless connection between your data lake infrastructure and DBT, leveraging the strengths of each component.

Key Steps Overview

• Enable Iceberg as part of the EMR cluster configuration.
• Start Spark Thriftserver with appropriate Iceberg and Glue Data Catalog configurations.
• Modify DBT configurations to integrate with Iceberg and Spark.
• Update DBT models for Iceberg compatibility.
• Edit profiles.yml for correct schema handling.
• Inject and configure a DBT macro for schema management.

Step 1: Enable Iceberg in EMR Cluster Configuration

To work with Apache Iceberg tables on your EMR cluster, you must enable Iceberg as part of the cluster’s initial configuration, or update the configuration for existing clusters.
• When launching a new cluster, add the following classification to enable Iceberg:

{
"Classification": "iceberg-defaults",
"Properties": {
"iceberg.enabled": "true"
}
}

• For existing clusters, propagate this configuration across all instance groups to ensure consistent Iceberg support.As shown in the below screenshots:

This configuration activates Iceberg functionalities on the EMR cluster, allowing Spark jobs to create, manage, and query Iceberg tables.

Step 2: Start Spark Thriftserver with Iceberg and Glue Catalog

The Spark Thriftserver allows external clients, including DBT, to interact with Spark SQL via JDBC/Thrift protocols. When working with Iceberg tables, it’s essential to configure Spark Thriftserver appropriately, including integration with AWS Glue as the metadata catalog and optional table-locking via DynamoDB.
• Start the Thriftserver with the following command, updating my_catalog and warehouse as needed:

sudo /usr/lib/spark/sbin/start-thriftserver.sh \
--conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.my_catalog.warehouse=s3://data-apsoutheast3-117019135262/xldemo/icebergmart/ \
--conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
--conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
--conf spark.sql.catalog.my_catalog.lock-impl=org.apache.iceberg.aws.dynamodb.DynamoDbLockManager \
--conf spark.sql.catalog.my_catalog.lock.table=myGlueLockTable

• The lock configuration using DynamoDB is optional but enhances consistency in concurrent environments.

• The my_catalog name is arbitrary—you might want to use an environment-specific name like dev_catalog for clarity.
• Remember: this catalog name is required when specifying the fully qualified schema for Iceberg tables in Spark and DBT.

Step 3: Modify DBT Configurations for Iceberg and Spark

Once your backend infrastructure supports Iceberg and Spark, DBT must be configured accordingly to build and manage Iceberg tables. This involves updating your model SQL files, profiles.yml, and injecting a custom macro for precise schema handling.

3.a Update Target Model for Iceberg Format

Each DBT model that should use Iceberg must specify this in the model configuration block. For example, in /root/credit_history/models/mart/credit/aggregated_credit_iceberg.sql:

{{
config(
materialized='incremental',
file_format='iceberg',
partition_by=['event_date],
incremental_strategy='append',
schema='my_catalog.dev'
)
}}

• Set file_format to iceberg.
• Set location_root to your S3 path for Iceberg storage.
• Specify partition_by for efficient table partitioning.
• Ensure the schema includes the Spark catalog name (e.g., my_catalog.dev).
This ensures that when DBT builds or refreshes your models, it creates or updates Iceberg tables with the expected configuration.

3.b Update profiles.yml for Schema Handling

DBT’s default behavior may concatenate schema names from profiles.yml and the model files, which can lead to undesired naming conventions, especially with Spark and Iceberg.
To control schema generation:
• Set the schema key to an empty string for your default output (dev), and explicitly state the schema for catalog-specific outputs.
Sample profiles.yml configuration:

dbt_project:
outputs:
dev:
host: <DNS name>
method: thrift
port: 10001
schema: ""
threads: 1
type: spark
my_catalog.dev:
host: <DNS Name>
method: thrift
schema: my_catalog.dev
port: 10001
threads: 1
type: spark
target: dev

• This setup ensures that if a schema is provided in the model config, it will be used as-is, preventing unwanted prefixing or concatenation.
• For environment-specific configurations, you may add similar blocks for prod, test, etc., updating hosts and schemas as needed.
[NOTE] DNS Name can be obtained from the EMR cluster details as shown below:

3.c Inject Custom DBT Macro for Schema Name Control

DBT allows macro-overrides to customize core behaviors. The generate_schema_name.sql macro controls how DBT assigns schema names to target tables.
Create a macro named generate_schema_name.sql with the following content:

{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}
{{ default_schema }}
{%- else -%}
{{ custom_schema_name | trim }}
{%- endif -%}
{%- endmacro %}

• This macro checks if a custom schema is provided in the model file. If not, it defaults to the schema defined in profiles.yml.
• If the schema in profiles.yml is empty (""), the macro uses the schema from the model configuration, giving you full control over the table location.
• Ensure every model file has the correct schema configuration (e.g., schema='my_catalog.dev'), as jobs will now rely on this value.

Additional Recommendations

• Testing: After setting up, test model builds and incremental loads to verify Iceberg table creation and data consistency.
• Security: Configure IAM roles and S3 bucket permissions to permit Spark and DBT to read/write Iceberg data.
• Observability: Monitor EMR, Spark, and Glue logs for troubleshooting and performance tuning.
• Scalability: Use EMR autoscaling features to adapt to workload demands, ensuring efficient resource use.

Conclusion

Configuring EMR with Iceberg, Spark, and DBT Core empowers your organization to build reliable, scalable, and future-proof data pipelines. By following this guide—enabling Iceberg, configuring Spark and Glue Catalog, and adapting DBT settings—you gain powerful control over table formats, schema definitions, and transformation workflows. This setup provides the foundation for robust analytics, flexible data modeling, and streamlined lakehouse operations.

Optimization Techniques in dbt Core with Spark on EMR and dbt Core with Glue

Ankur Srivastava — Wed, 02 Jul 2025 13:57:32 +0000

Introduction
In the world of data engineering, selecting the right tools and frameworks is crucial for efficient data processing and transformation. dbt (data build tool) has become a popular choice for data transformation, providing a simple interface to manage and deploy SQL-based transformations. In this article, we'll explore the optimization techniques in dbt Core when used with Apache Spark on Amazon EMR (Elastic MapReduce) and AWS Glue. We'll also discuss when to use each option, their pros and cons, cost benefits and configuring the profile.yml file for optimized result.

dbt Core with Spark on EMR

Optimization Techniques
Optimizing dbt Core with Spark on EMR involves several techniques to improve performance and efficiency:
• Cluster Configuration: Choose appropriate EMR instance types and cluster configurations based on workload requirements.
Eg: If we require to process 18Gb of data in each run in that case we should select atleast 4 node/VCPU cluster(1 node reserved for Yarn + 3 usable nodes) with minimum 2 Gb of memory per node. So you can choose r or m series cluster.
• Data Partitioning: Partition large datasets to enable parallel processing and reduce shuffle operations. This can significantly speed up data transformations. Partitioning should be done on column with low cardinality and mostly used for filters or joining condition which limits the number of partitions scanned. Eg: date column for sales related data which can be defined in the dbt model’s config parameter:

• Efficient Data Formats: Use columnar data formats like Parquet or ORC, which are optimized for read performance and compression.
• Resource Allocation: Allocate resources dynamically based on the workload. Use auto-scaling features to adjust the cluster size according to the processing demands. At dbt model level you can enable dynamic allocation:

When to Use Spark on EMR
Spark on EMR is suitable when dealing with massive datasets that require distributed processing and advanced analytics. It is ideal for:
• Big data processing with high throughput and low latency.
• Running complex machine learning algorithms and iterative data processing tasks.
• Workloads which require significant computational resources and custom spark configurations.
• When reading data directly from file system like S3 in AWS of local file system. Suppose there is a parquet file stored in an s3 bucket then file to be read should be used in format :
SELECT
col1,
col2,
col3
FROM parquet.s3://<bucket name>/<folder name>/<file name>

Pros and Cons of Spark on EMR
Pros:
• Scalability: Easily scale up or down based on workload requirements.
• Flexibility: Customize cluster configurations and use various instance types.
• Advanced Analytics: Run sophisticated analytics and machine learning workloads.
Cons:
• Cost: High computational resources can lead to increased costs.
• Complexity: Requires expertise in managing and optimizing Spark clusters.
• No instant scalability: Scalability takes time as it spins up an EC2 instance which can cause delay or job failures when Yarn is waiting for the executor to be assigned but its not available.

Cost Benefits
The cost benefits of using Spark on EMR depend on the workload's complexity and scale. EMR's ability to auto-scale and use spot instances(only recommended for short time processing) can help reduce costs. However, high computational requirements can still lead to significant expenses.

Configuring profile.yml for Spark on EMR
To configure the profile.yml file for dbt with Spark on EMR, follow these steps:

[profile name]:
target: dev
outputs:
dev:
type: spark
method: thrift
host: [your-emr-cluster-master-dns]
port: 10000
user: [your-user]
schema: [your-schema]
database: [your-database]
connect_timeout: 30[can be increased or decreased]
connect_retries: 2[can be increased or decreased]
threads: 3[can be increased or decreased for parallel processing]

dbt Core with AWS Glue
Optimization Techniques
Optimizing dbt Core with AWS Glue involves several techniques to improve performance and efficiency:
• Job Bookmarking: Enable job bookmarking to track data processing and only process new or changed data, reducing redundant computations.
• Dynamic Frames: Use AWS Glue Dynamic Frames to handle semi-structured data efficiently and perform schema transformations.
• Partition Pruning: Use partition pruning to limit the data scanned during processing, improving performance for large datasets.
• Parallel Processing: Configure Glue jobs to run in parallel, leveraging multiple workers to speed up data transformations.
• Efficient Data Formats: Use optimized file formats like Parquet or ORC for improved read and write performance.

When to Use AWS Glue
AWS Glue is suitable for serverless data processing and ETL tasks, especially when ease of use and integration with other AWS services are essential. It is ideal for:
• Serverless ETL operations with minimal infrastructure management.
• Integrating with other AWS services like S3, Redshift, and RDS,etc.
• Processing small datasets.
• Data cataloging and metadata management.
• Incremental data loads or merge.

Pros and Cons of AWS Glue
Pros:
• Simplicity: Serverless architecture eliminates infrastructure management.
• Integration: Seamlessly integrates with other AWS services.
• Scalability: Automatically scale up or down based on workload requirements.
Cons:
• Cost: Can be expensive for large-scale data processing.
• Limited Control: Limited customization options compared to Spark on EMR.

Cost Benefits
AWS Glue offers cost benefits for small to medium-scale ETL tasks due to its serverless nature, eliminating the need for dedicated infrastructure. However, for large-scale data processing, costs can escalate based on the amount of data processed and the number of worker nodes used.dbt with glue runs sparkSQL which leverages auto optimization of latest glue version(currently 5.0 which is 10x faster that older versions)

Configuring profile.yml for AWS Glue
To configure the profile.yml file for dbt with AWS Glue, follow these steps:

[profile name]:
target: dev
outputs:
dev:
type: glue
role_arn: [ARN of the role assigned to glue interactive session]
region: [region where glue interactive session need to be created and used]
glue_version: “5.0”[use latest version]
workers: 4[increased or decreased on the basis of compute requirement]
worker_type: G.1X [can be changed on the basis of requirement]
database: [your-database]
schema: [your-schema]
catalog: [your-catalog]
glue_session_reuse: True [set to true in order to reuse existing glue interactive session]
location: [ARN of the s3 location to be used by the interactive session]
datalake_formats: "iceberg"
default_arguments: "--enable-metrics=true" [required to enable metrics in cloudwatch]
conf: --conf <spark configuration name and its value>

Conclusion
Choosing between dbt Core with Spark on EMR and dbt Core with AWS Glue depends on your specific use case, data processing requirements, and budget constraints. Spark on EMR is ideal for large-scale, complex data processing tasks, while AWS Glue offers a serverless, easy-to-use solution for ETL operations with option of incremental data loads for iceberg table format. Understanding the pros and cons, cost benefits, and configuration details for both options will help you make an informed decision and optimize your data transformation workflows.

Running Apache Flink Application on AWS EC2 Instance using flink SQL

Ankur Srivastava — Tue, 24 Jun 2025 12:01:05 +0000

Reading and Writing Data with S3 as source and sink
Apache Flink is a powerful stream processing framework that allows for real-time data processing.
Setting up Apache Flink with Java can be quite challenging and time-consuming. The process involves configuring numerous dependencies, handling complex coding tasks, and ensuring compatibility between various components. Moreover, you need to be adept at managing Java environments and troubleshooting issues that arise during setup and execution.
To alleviate these challenges, we offer a simplified approach utilizing Flink SQL. Flink SQL provides a more intuitive and streamlined way to interact with data, reducing the complexity inherent in programming with Java. By leveraging Flink SQL, users can efficiently run queries and manage data processing tasks with minimal setup and configuration.
This guide will walk you through setting up Apache Flink on an EC2 instance to read data from S3 as a source and save data to another location in S3 using the Flink S3 connector and simple SQL queries.

Introduction
Apache Flink is a powerful stream processing framework that allows for real-time data processing. This guide will walk you through setting up Apache Flink on an EC2 instance to read data from S3 as a source and use S3 as a sink.

Pre-requisites
• An AWS account
• Access to EC2 and S3 services
• An EC2 instance with sufficient resources
• Basic knowledge of Python and AWS
• Apache Flink installed on your local machine for testing

Installation
Launch EC2 Instance
• Log in to AWS Management Console
• Navigate to EC2 Dashboard and click "Launch Instance"
• Select an appropriate Amazon Machine Image (AMI) Linux and instance type(t3.2xlarge preferably as it has 8 cores and 32Gb of Memory)
• Configure instance details, add storage(EBS) and tag instance
• Configure security group to allow necessary ports(in security group add your local machine IP to be allowed for inbound rule with port 8081 – which will be used to view flink dashboard through local machine browser)
• Review and launch the instance
Install Apache Flink and Other dependencies
• SSH into your EC2 instance
• Download Apache Flink tarball(preferably the latest version) from the official website link(wget https://www.apache.org/dyn/closer.lua/flink/flink-1.19.2/flink-1.19.2-bin-scala_2.12.tgz)
• Extract the tarball(tar -xvzf flink-1.19.2-bin-scala_2.12.tgz) and move the files to a suitable directory(mv flink-1.19.2 /mnt/)
• Set up Java 11.x version or above
List java version : sudo yum list available | grep java
Install the right java from the list : sudo yum install java-11-openjdk-devel
Check if java is installed : java -version
openjdk version "11.0.27" 2025-04-15 LTS
OpenJDK Runtime Environment Corretto-11.0.27.6.1 (build 11.0.27+6-LTS)
OpenJDK 64-Bit Server VM Corretto-11.0.27.6.1 (build 11.0.27+6-LTS, mixed mode)
• Install python3 latest version
Install python: sudo yum install python3
Python 3 is installed but not accessible via the python command: sudo ln -s /usr/bin/python3 /usr/bin/python
Check path /usr/bin is part of the output: echo $PATH
Check python version: python –version
Python 3.9.22
• Install Pyflink library: pip3 install apache-flink
• Check if AWS services like S3 is accessible from EC2.If not then define your user’s ACCESS_KEY and SECRET_KEY in AWS credential file or use default credential.In current case we used default credentials of IAM role assigned to the EC2 instance.
• Download the flink-s3-fs-hadoop plugin(same version as of Apache flink) from the Flink distribution or the official Flink website
Goto plugin directory: cd /mnt/flink-1.19.2/plugins/
Download plugin: wget https://repo1.maven.org/maven2/org/apache/flink/flink-s3-fs-hadoop/1.19.2/flink-s3-fs-hadoop-1.19.2.jar
Make a directory to store the above file: mkdir flink-s3-fs-hadoop
Move the jar file in new directory: mv flink-s3-fs-hadoop-1.19.2.jar flink-s3-fs-hadoop/
• Modify the flink config file to allow us to view flink dashboard on start of cluster.
Open the config file: vi conf/config.yaml
Modify rest.bind-address: replace localhost with 0.0.0.0
• Modify flink config file to configure s3 settings at the last of the file:

s3.endpoint: s3.ap-southeast-3.amazonaws.com # Default endpoint for AWS S3
s3a.credentials.provider: org.apache.hadoop.fs.s3a.DefaultAWSCredentialsProviderChain
s3.connection.timeout: 5000 # in milliseconds
s3.request.timeout: 10000 # in milliseconds
• Start the Flink cluster( ./bin/start-cluster.sh)
• Once started you can verify if you are able to login to flink dashboard using URL : http://<>:8081/
• Below is how the dashboard looks like :

Check Environment Readiness
• Verify Java installation: java -version
• Ensure Flink cluster is up and running: ps – aux | grep flink
• Confirm connectivity to S3 buckets: aws s3 ls
Creating a Python Job
• Import pyflink library for environment creation:

• Create the execution environment:

• Configure the S3 access(preferably default credential of the IAM role) you can use other options :

• Create source table using flink SQL:

• Create Sink table using flink SQL :

• Transfer the data from source to sink using flink SQL:

• Test the script in EC2 using the command: ./bin/flink run -py s3_flink_job.py
• Once the job is submitted successfully view the logs on the flink dashboard in the browser for successful or failure run in “Completed Jobs” section.

Running the Job Using Managed Flink Cluster
• Package your Python script in jar format
• Upload the package to s3 location
• Create Apache flink application in AWS MAF(Managed Apache Flink)
• Once the application is created configure it with the JAR file path and bucket details
• Run the Job from console once all the configurations are completed and verified
• Monitor the job submission and execution
Monitoring the Job and Handling Errors
• Access Flink Dashboard to monitor job
• Check task manager logs to debug issues
• Implement retry and error handling mechanisms

Implementation of Data Archival Solution with GenAI

Ankur Srivastava — Fri, 02 May 2025 13:33:16 +0000

Organizations are now modernizing their legacy systems by creating automated, AI-enabled platforms. One major challenge is handling the efficient data preparation and archival absence of which leads to data loss, skewness , and costlier reporting and analytics.
Wipro, an AWS Premier Consulting Partner and Managed Service Provider (MSP), addresses these challenges by delivering cloud data archival solutions using GenAI.

GenAI in AWS
AWS Bedrock is a fully managed GenAI service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, DeepSeek, Luma, Meta, Mistral AI, poolside (coming soon), Stability AI, and Amazon through a single API. It provides a broad set of capabilities needed to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can easily experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources.
AWS Bedrock can be used to generate code in various stages of the software development lifecycle (SDLC). allows developers to create their own systems to augment, write, and audit code by using models within Amazon Bedrock instead of relying on out-of-the-box coding assistants.
Code interpretation in Amazon Bedrock enables your agent to generate, run, and troubleshoot your application code in a secure test environment. This includes tasks such as understanding user requests for specific tasks, generating code to perform those tasks, executing the code, and providing the result from the code execution.

Solution Overview-
The solution we have proposed is using AWS bedrock service which will generate the automated scripts based on the prompt provided by the user.

Data Flow-
1) We will use AWS native connectors to connect through customer source systems. Data will be stored in S3.Based on the event trigger of full load file or CDC accordingly AWS Glue or DMS will be triggered and data will be copied over to S3 raw layer.
2) The data will be catalogued and processed using AWS Glue and will be stored in processed layer. Post which data lifecycle policy will be implemented on the bucket for data archival.
3) Bedrock model will be invoked using boto3 for code generation wherever required.
4) Metadata of the data archival will be available for query and report generation.
5) Based on reporting user requirement through Quicksight, Lambda will be used to invoke the model and generate required query for the data which will be serviced via athena.
The various capabilities of our solution are:-
1) Ability to effectively perform the script generation activities using AWS native GenAI services.
2) Increased Reusability of the script used for result generation from the models.
3) Auto optimized script generation from GenAI.
4) Cost Effective archieval solution based on serverless architecture.
5) Automated archival framework will provide fully integrated skeleton for reuse.
6) Efficiency and effectiveness in script preparation.
7) Event based trigger for pipeline creation and processing.
8) Tight coupling with all AWS native services.
9) Less manual intervention in the fully integrated solution.
10) End to End data delivery using cloud agnostic solution provide scalability and cost effectiveness.

Benefits-
1) Process Efficiency: Increases overall efficiency in script generation process upto 90%
2) Effort optimization: Up to 40% reduction in involvement of in-house teams required for data archieval activities.
3) Reduction in the requirement for proficient and highly skilled people.
4) Proper organized data and its metadata availability to give wider view to the users.

Industrial usage-
The Data Archival with GenAI as a solution has benefits across industries as efficient data preparation is required by most of the industry process for operational functions. E.g., for Retail industry monthly sales analysis, for health care it could be medical records used for future prediction of upcoming health challenges, for Finance industry finding out the fund utilization rate in real time, etc. So the overall solution will deliver cloud transformation at scale with GenAI in a speedy manner needed for most of the organizations and implementing the solution leveraging Amazon Cloud services.

GenAI enabled Glue Job

Ankur Srivastava — Mon, 07 Apr 2025 15:33:35 +0000

Glue Job is now widely used AWS native ETL tool. It is a serverless data integration service that simplifies the process of discovering, preparing, moving, and integrating data from multiple sources for analytics, machine learning, and application development. It offers a centralized data catalog, allowing users to manage their data efficiently. With AWS Glue, you can create, run, and monitor ETL (extract, transform, load) pipelines to load data into your data lakes. It supports various workloads, including ETL, ELT, and streaming, and integrates seamlessly with AWS analytics services and Amazon S3 data lakes. AWS Glue also provides a graphical interface called AWS Glue Studio, which makes it easy to create and manage data integration jobs visually.

Gen AI enabled AWS Glue
Generative AI has significantly enhanced AWS Glue by streamlining data integration and boosting developer productivity. With the integration of generative AI capabilities, AWS Glue now allows users to build data integration pipelines using natural language. This means you can describe your intent through a chat interface, and AWS Glue will generate a complete job for you. This feature, known as Amazon Bedrock, helps you create data integration jobs with minimal coding experience, allowing you to focus more on analyzing data rather than on mundane tasks.
Additionally, generative AI has modernized Spark jobs within AWS Glue, accelerating troubleshooting and reducing the time spent on undifferentiated tasks. This AI-powered assistance provides expert-level guidance throughout the entire data integration lifecycle, making it easier to maintain and troubleshoot jobs.
Moreover, generative AI has automated the generation of comprehensive metadata descriptions for data assets in the AWS Glue Data Catalog. This automation enhances data discoverability, understanding, and overall data governance within the AWS Cloud environment.
These advancements have made AWS Glue more efficient and user-friendly, enabling users to integrate data faster and improve developer productivity.

Solution Overview-
The solution we have proposed is to create automated Glue job using Anthropic Claude 3 sonnet model which will generate the scripts based on the prompt provided by the user.
Why Claude-3-sonnet model: -
It has the ability to perform nuanced content creation, accurate summarization and handle complex scientific queries. This model demonstrates increased proficiency in non-English languages and coding tasks with more accurate responses, supporting a wider range of use cases on a global scale.

*Pre-Requisites: *

Creation of IAM role for Lambda to provide access to the services that are being used in the automation i.e. S3, Bedrock, CloudWatch, Glue.
Creation of IAM Glue Service role for glue job that is being created through lambda.
Creation of S3 Bucket in which the files that are being read through lambda should be stored.
Creation of Lambda function containing the automation script with necessary layers and configurations.
Enablement of access to Bedrock Anthropic Claude-3-sonnet model. Data Flow- 1) User will fill the table in excel requesting different inputs required to create glue job(eg: Job name, source data file details, target details and transformation required to be done on the source data. This excel will then be uploaded to S3 bucket. 2) The event-based architecture triggers an event as S3 push to call the respective Lambda function to start the bedrock invocation process. 3) Bedrock model will be invoked using boto3 – invoke_model as shown below:

4) Response from the bedrock will be used for creation of template for script deployment or directly as an executable script.
5) Lambda will use the generated template for creation of Glue job using the code snippet as shown below:

Benefits:
1) Cost Optimization: 40-50% cost reduction in writing the transformation script, creation of Glue/Lambda jobs.
2) Metadata driven processing: User just need to fill in the metadata document which contains input details (like Source, target, prompt(text format of what transformation is required),etc) no need to alter scripts all the time based on user input.
3) Code generation with high efficiency: We are using anthropic.claude-3-sonnet foundation model available in Amazon Bedrock which gives efficient code and needs less optimization.
4) Plug and play kind architecture: There are 2 main components ( Excel file for metadata input and Python script to be executed in lambda ) which can be easily integrated across multi cloud platforms and also to the existing environments in delivery environments.
5) Less Overheads: Since we are using all serverless components so nearly 0 maintenance overheads

Implementation of DataOps with GenAI

Ankur Srivastava — Mon, 24 Mar 2025 16:27:28 +0000

Organizations are now leveraging cloud-native capabilities for stability, scalability, accuracy, and speed in their applications. They are modernizing their legacy systems by creating automated, AI-enabled DataOps platforms. One major challenge is the time spent on data preparation, validation, and accuracy, leading to increased costs and lower data quality.
Wipro, an AWS Premier Consulting Partner and Managed Service Provider (MSP), addresses these challenges by delivering cloud transformation solutions using DataOps with GenAI.
GenAI in AWS
AWS Bedrock is a fully managed GenAI service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, DeepSeek, Luma, Meta, Mistral AI, poolside (coming soon), Stability AI, and Amazon through a single API. It provides a broad set of capabilities needed to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can easily experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources.
AWS Bedrock can be used to generate code in various stages of the software development lifecycle (SDLC). allows developers to create their own systems to augment, write, and audit code by using models within Amazon Bedrock instead of relying on out-of-the-box coding assistants.
Code interpretation in Amazon Bedrock enables your agent to generate, run, and troubleshoot your application code in a secure test environment. This includes tasks such as understanding user requests for specific tasks, generating code to perform those tasks, executing the code, and providing the result from the code execution.

Solution Overview-
The solution we have proposed is using Anthropic Claude 3 sonnet model which will generate the automated scripts based on the prompt provided by the user.
Why Claude-3-sonnet model: -
It has the ability to perform nuanced content creation, accurate summarization and handle complex scientific queries. This model demonstrates increased proficiency in non-English languages and coding tasks with more accurate responses, supporting a wider range of use cases on a global scale.

Data Flow-
1) User will fill the table in excel requesting different inputs formats for different services to be created as part of DataOps Pipeline. This excel will then be uploaded to S3 bucket.
2) The event-based architecture triggers an event as S3 push to call the respective Lambda function to start the bedrock invocation process.
3) Bedrock model will be invoked using boto3 – invoke_model as shown below:

4) Response from the bedrock will be used for creation of template for script deployment or directly as an executable script.
5) Lambda will use the generated template for creation of Jobs.

The various capabilities of our solution are:-

Ability to effectively perform the script generation activities using AWS native GenAI services.
Increased Reusability of the script used for result generation from the models.
Auto optimized script generation from GenAI.
Cost Effective solution based on serverless architecture.
DataOps driven automated framework will provide fully integrated skeleton for reuse.
Efficiency and effectiveness in script preparation.
Event based trigger for pipeline creation and processing.
Tight coupling with all AWS native services.
Less manual intervention in the fully integrated solution.
End to End data delivery using cloud agnostic solution provide scalability and cost effectiveness.

Benefits:

Process Efficiency: Increases overall efficiency in script generation process upto 90%
Effort optimization: Up to 30-40% reduction in involvement of in-house teams required for data preparation activities.
Reduction in the requirement for proficient and highly skilled people.

Industrial usage-
The DataOps with GenAI as a solution has benefits across industries as efficient data preparation is required by most of the industry process for operational functions. E.g., for Manufacturing industry monthly sales analysis, for health care it could be medical records used for future prediction of upcoming health challenges, for Media industry finding out the TRP driven content in real time, etc. So the overall solution will deliver cloud transformation at scale with GenAI in a speedy manner needed for most of the organizations and implementing the solution leveraging Amazon Cloud services which offers the benefit of Quick Win, optimal cost and unlimited scalability.

Attribute level governance using Apache Iceberg Table

Ankur Srivastava — Tue, 28 Jan 2025 07:20:23 +0000

In large organizations where number of users accessing the crucial data is pretty high have to face lot of challenges in managing the fine-grained access.

Variety of AWS services like IAM, Lake formation, S3 ACL can help in fine grained access control. But there are scenarios where a single entity containing the global data need to be accessed by multiple user groups across the system with restrictive access. Also, for organization with global presence might be working in different environment and with different toolsets so data movement and cataloguing become very tedious.

For Example: A user wants to access the sales data from a table for analytics purpose, but he should be restricted to access only Australia region related sales data. No other data should be visible to him. Also, he wants to access the data from a different cloud platform for multiple DML operations, so he needs to bring data and transform it into the tool’s native format for processing which causes delays.

For this kind of scenarios, we require data control at attribute level and data across environments to support the native toolset formats and faster access.

Wipro, an AWS Premier Consulting Partner and Managed Service Provider (MSP) with rich global experience, takes a step ahead to address these challenges and deliver cloud transformation solution leveraging Lake formation for data governance on Apache iceberg table which can be queried and catalogued in AWS S3 itself and can be accessed across platforms and cloud.

Using data filter option in lake formation we can ensure column-level security, row-level security and cell-level security

What is Iceberg table format?

Iceberg is an open-source table format with following benefits:

Iceberg fully supports flexible SQL commands which makes it possible to update, merge and delete the data. Iceberg can be used to rewrite data files to enhance read performance and use delete deltas to quicken the pace of updates.
Iceberg supports full schema evolution. Schema updates in Iceberg tables change only the metadata, leaving the data files themselves unaffected. Schema evolution changes include adds, drops, renaming, reordering, and type promotions.
Data stored in a data lake or data mesh architecture is available to multiple independent applications across an organization simultaneously.
Iceberg is designed for use with huge analytical data sets. It offers multiple features designed to increase querying speed and efficiency including fast scan planning, pruning metadata files that aren’t needed, and the ability to filter out data files that don’t contain matching data.

Solution Overview-

The solution we have proposed is using Lake formation service to create Data Filters on which we can grant permissions to the user for access. The heart of the solution is using iceberg table format which is catalogued and then added with filter conditions to govern access.

Data Flow-

DMS or Glue is used to fetch data from the source system repositories to store it in designated S3 bucket.
The event-based architecture trigger event as S3 push to call the respective Lambda function to start the ETL process.
Data will be stored in iceberg table format and will be catalogued.
Data can be processed. Transformed using glue leveraging the GenAI readymade models.
Processed data will be stored in redshift for consumption.
Catalogued iceberg tables will be added with the tag column (tag value is mapped to the user group) Below image describes a sample data filter and how it looks like. We can also limit the number of columns using the data filters.

Once the filter is created, we can then use grant permission option to give permission to users, roles, groups, accounts. User can use Athena to query the data

The various capabilities of our solution are: -

Ability to effectively manage the fine-grained control of access to the data.
Reusability of the data filters for multiple user groups.
We can achieve column-level security, row-level security and cell-level security.
Effective use of Apache Iceberg table format features for seamless control over the data and its access.
Efficiency and effectiveness in data preparation.
Centralized access management and governance using lake formation.
Less manual intervention in the fully integrated solution.
End to End data delivery using cloud agnostic solution and serverless components to provide scalability and cost effectiveness.
Benefits-

Operational Efficiency: Use of serverless components reduces the operational ad maintenance overheads involved in managing it.
Effort optimization: Up to 20-30% reduction in effort by using GenAI models to generate standardized and efficient ETL scripts.
Governance and Compliance benefits: Attribute based control in lake formation helps to comply with the standard regulations and provide audit and logging capabilities.
Industrial usage-

Attribute level governance using Apache iceberg table can be very seamlessly implemented in financial sector like bank or insurance company where customers need to have restricted access to the data ensuring authenticity and security of the data. Healthcare sector can use it to generate and share Electronic Health Record of the patient in fast and speedy manner ensuring the sensitivity of data which can lead to timely treatment and medication.

So, the overall solution will deliver attribute level governance at scale with data preparation in a speedy manner using Apache iceberg table format needed for most organizations and implementing the solution leveraging Amazon Cloud services, which offers the benefit of Quick Win, optimal cost, and unlimited scalability.