Attribute level governance using Apache Iceberg Table

In large organizations where number of users accessing the crucial data is pretty high have to face lot of challenges in managing the fine-grained access.

Variety of AWS services like IAM, Lake formation, S3 ACL can help in fine grained access control. But there are scenarios where a single entity containing the global data need to be accessed by multiple user groups across the system with restrictive access. Also, for organization with global presence might be working in different environment and with different toolsets so data movement and cataloguing become very tedious.

For Example: A user wants to access the sales data from a table for analytics purpose, but he should be restricted to access only Australia region related sales data. No other data should be visible to him. Also, he wants to access the data from a different cloud platform for multiple DML operations, so he needs to bring data and transform it into the tool’s native format for processing which causes delays.

For this kind of scenarios, we require data control at attribute level and data across environments to support the native toolset formats and faster access.

Wipro, an AWS Premier Consulting Partner and Managed Service Provider (MSP) with rich global experience, takes a step ahead to address these challenges and deliver cloud transformation solution leveraging Lake formation for data governance on Apache iceberg table which can be queried and catalogued in AWS S3 itself and can be accessed across platforms and cloud.

Using data filter option in lake formation we can ensure column-level security, row-level security and cell-level security

What is Iceberg table format?

Iceberg is an open-source table format with following benefits:

Iceberg fully supports flexible SQL commands which makes it possible to update, merge and delete the data. Iceberg can be used to rewrite data files to enhance read performance and use delete deltas to quicken the pace of updates.
Iceberg supports full schema evolution. Schema updates in Iceberg tables change only the metadata, leaving the data files themselves unaffected. Schema evolution changes include adds, drops, renaming, reordering, and type promotions.
Data stored in a data lake or data mesh architecture is available to multiple independent applications across an organization simultaneously.
Iceberg is designed for use with huge analytical data sets. It offers multiple features designed to increase querying speed and efficiency including fast scan planning, pruning metadata files that aren’t needed, and the ability to filter out data files that don’t contain matching data.

Solution Overview-

The solution we have proposed is using Lake formation service to create Data Filters on which we can grant permissions to the user for access. The heart of the solution is using iceberg table format which is catalogued and then added with filter conditions to govern access.

Data Flow-

DMS or Glue is used to fetch data from the source system repositories to store it in designated S3 bucket.
The event-based architecture trigger event as S3 push to call the respective Lambda function to start the ETL process.
Data will be stored in iceberg table format and will be catalogued.
Data can be processed. Transformed using glue leveraging the GenAI readymade models.
Processed data will be stored in redshift for consumption.
Catalogued iceberg tables will be added with the tag column (tag value is mapped to the user group) Below image describes a sample data filter and how it looks like. We can also limit the number of columns using the data filters.

Once the filter is created, we can then use grant permission option to give permission to users, roles, groups, accounts. User can use Athena to query the data

The various capabilities of our solution are: -

Ability to effectively manage the fine-grained control of access to the data.
Reusability of the data filters for multiple user groups.
We can achieve column-level security, row-level security and cell-level security.
Effective use of Apache Iceberg table format features for seamless control over the data and its access.
Efficiency and effectiveness in data preparation.
Centralized access management and governance using lake formation.
Less manual intervention in the fully integrated solution.
End to End data delivery using cloud agnostic solution and serverless components to provide scalability and cost effectiveness.
Benefits-

Operational Efficiency: Use of serverless components reduces the operational ad maintenance overheads involved in managing it.
Effort optimization: Up to 20-30% reduction in effort by using GenAI models to generate standardized and efficient ETL scripts.
Governance and Compliance benefits: Attribute based control in lake formation helps to comply with the standard regulations and provide audit and logging capabilities.
Industrial usage-

Attribute level governance using Apache iceberg table can be very seamlessly implemented in financial sector like bank or insurance company where customers need to have restricted access to the data ensuring authenticity and security of the data. Healthcare sector can use it to generate and share Electronic Health Record of the patient in fast and speedy manner ensuring the sensitivity of data which can lead to timely treatment and medication.

So, the overall solution will deliver attribute level governance at scale with data preparation in a speedy manner using Apache iceberg table format needed for most organizations and implementing the solution leveraging Amazon Cloud services, which offers the benefit of Quick Win, optimal cost, and unlimited scalability.

DEV Community

Attribute level governance using Apache Iceberg Table

Top comments (0)