Aki for AWS Community Builders

Posted on Jun 8

Organizing How to Use AWS Lake Formation

#aws #dataengineering

Original Japanese article: AWS Lake Formationの使い方について整理してみる

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).

Previously, I wrote an article titled Is AWS Glue Data Catalog Sufficient as a Data Catalog? Organizing Its Design, Limitations, and Complementary Strategies.
In that article, I mentioned that "AWS Lake Formation is necessary to complement data governance" but did not go into detail because it was outside the scope of the article.

This time, I'd like to organize my thoughts on Lake Formation, covering everything from the fundamentals to practical usage patterns.

Lake Formation is often perceived as a service that is "somewhat difficult" or "unnecessary because IAM is enough."
However, once you start implementing proper access control for a data lake, the necessity of Lake Formation becomes much clearer.

I hope this article helps you evaluate whether Lake Formation is worth adopting in your environment.

What Is Lake Formation?

AWS Lake Formation is a service that provides access management and governance for data lakes.

It allows you to centrally manage who can access which data and at what level.
One of its key strengths is the ability to manage access controls consistently across multiple AWS services such as Athena, Glue, and Redshift Spectrum.

Although they are often confused, Lake Formation and Glue Data Catalog serve different purposes.

Service	Role
Glue Data Catalog	A technical catalog that manages metadata such as schemas and partitions
Lake Formation	A governance layer that manages access permissions for data registered in the Glue Data Catalog

Amazon S3 (Actual Data)
        ↓
Glue Data Catalog (Metadata Management)
        ↓
Lake Formation (Access Control)
        ↓
Athena / Glue Job / Redshift Spectrum

In other words, data resides in S3, Glue Data Catalog manages metadata, and Lake Formation provides access control on top of that metadata layer.

How Lake Formation Differs from IAM

Isn't IAM Enough?

When managing a data lake on S3 using IAM alone, several challenges emerge:

Granularity limitations: IAM primarily operates at the bucket or prefix level, making table-, column-, and row-level access control difficult.
Operational complexity: As users and roles increase, S3 bucket policies and IAM policies become increasingly difficult to manage.
Cross-account sharing: Implementing data sharing across AWS accounts using only IAM can lead to complicated designs.
Limited visibility for auditing: It is difficult to easily understand who can access which tables.

Typical examples include:

More than ten Athena users need different levels of access, making permission management increasingly complicated.
Different departments should see different subsets of data. For example, the sales department should only see Eastern Japan sales, while executives can see all data.
Personally identifiable information (PII) such as email addresses and credit card numbers should be hidden from analysts.
Data needs to be shared with another AWS account.

Lake Formation addresses these challenges.

What Lake Formation Solves

With Lake Formation, you can implement:

Fine-grained table-, column-, and row-level access control
Permission management at the Glue Data Catalog database and table level
Tag-based access control (LF-TBAC) for large-scale environments
Cross-account data sharing through AWS RAM
Centralized auditing through CloudTrail integration

The Relationship Between IAM and Lake Formation

Lake Formation does not replace IAM; it works as an additional layer on top of IAM.

When a query is executed (for example, through Athena), access is granted only if both conditions are satisfied:

IAM Permission
        AND
Lake Formation Permission
        ↓
Access Allowed

Even if permissions are granted in Lake Formation, access is denied if IAM blocks it.

Likewise, even if IAM allows access, the request is denied if the corresponding Lake Formation permissions are missing.

Understanding this "AND" relationship is the foundation of permission design.

Lake Formation Permission Model

Lake Formation permissions are managed across multiple levels.

Level	Target	Example Permissions
Data Lake Administrator	Entire Lake Formation environment	Full permissions
Database Level	Glue Data Catalog database	CREATE TABLE, DROP
Table Level	Individual table	SELECT, INSERT, ALTER
Column Level	Specific columns within a table	SELECT on selected columns
Row Level	Rows matching specific conditions	SELECT on filtered rows

Permissions can be granted or revoked through the console, CLI, or SDK.

# Example: Grant SELECT permission on a table
aws lakeformation grant-permissions \
  --principal DataLakePrincipalIdentifier=arn:aws:iam::123456789:role/analyst-role \
  --permissions SELECT \
  --resource '{
    "Table": {
      "DatabaseName": "mydb",
      "Name": "sales_table"
    }
  }'

Column-Level and Row-Level Access Control

One of Lake Formation's strongest capabilities is fine-grained access control beyond the table level.

Both column-level and row-level security are implemented using a mechanism called Data Filters.
You create Data Filters in the console and reference them when granting permissions.

Column-Level Security

Access can be restricted to specific columns.

Suppose the customer table contains the following columns:

customer_id	name	email	credit_card	purchase_amount

You could allow analysts to access only customer_id, name, and purchase_amount, while hiding email and credit_card.

This can be achieved simply by specifying included or excluded columns in a Data Filter.
Excluded columns will not appear in Athena query results.

Row Filters

Row-level filters allow access only to rows matching specific conditions.

Filter expressions are written using PartiQL WHERE-clause syntax.

For example, if the sales table contains a region column and the Eastern Japan team should only see rows where region = 'east', you can create the following Data Filter:

aws lakeformation create-data-cells-filter \
  --table-data '{
    "TableCatalogId": "123456789012",
    "DatabaseName": "mydb",
    "TableName": "sales",
    "Name": "east-region-filter",
    "RowFilter": {
      "FilterExpression": "region = '\''east'\''"
    },
    "ColumnWildcard": {}
  }'

Combining column filters and row filters enables cell-level security, where users can access only specific columns within specific rows.

Data Filter Limitations

According to the official documentation:

Up to 100 filters per principal
array and map types are not supported in filter expressions (struct types can be used in row filters)
Cell-level security does not support nested columns, views, or resource links
Cell-level security is available in all regions when using Athena Engine Version 3 or Redshift Spectrum

Common Use Cases

Protecting PII such as email addresses and credit card numbers
Restricting business data by department or geographic region
Compliance requirements for regulated data

Tag-Based Access Control (LF-TBAC)

As the number of databases and tables grows, managing permissions table by table becomes increasingly difficult.

LF-TBAC (Lake Formation Tag-Based Access Control) addresses this problem.

What Are LF-Tags?

LF-Tags are key-value tags unique to Lake Formation.

They are separate from both S3 resource tags and IAM tags and are managed independently within Lake Formation.

aws lakeformation create-lf-tag \
  --tag-key "sensitivity" \
  --tag-values '["public", "internal", "confidential"]'

Tagging Resources and Mapping Permissions

LF-Tags can be assigned to databases, tables, and columns.

aws lakeformation add-lf-tags-to-resource \
  --resource '{"Table": {"DatabaseName": "mydb", "Name": "sales"}}' \
  --lf-tags '[{"TagKey": "sensitivity", "TagValues": ["internal"]}]'

Permissions are then granted based on tags rather than table names.

aws lakeformation grant-permissions \
  --principal DataLakePrincipalIdentifier=arn:aws:iam::123456789:role/analyst-role \
  --permissions SELECT \
  --resource '{
    "LFTagPolicy": {
      "ResourceType": "TABLE",
      "Expression": [{"TagKey": "sensitivity", "TagValues": ["public", "internal"]}]
    }
  }'

This grants SELECT access to all tables tagged with either sensitivity=public or sensitivity=internal.

When new tables are created, simply assigning the appropriate LF-Tag automatically applies the correct permissions.

Benefits in Large-Scale Environments

In environments with dozens or hundreds of tables, table-by-table permission management becomes unrealistic.

LF-TBAC enables a simpler model:

Roles can access data with specific tags.

However, tag design should be carefully planned from the beginning.
Defining categories such as sensitivity, domain, and owner early on can save significant effort later.

Integration with Glue Data Catalog

Lake Formation works closely with Glue Data Catalog.

Glue Data Catalog manages metadata, while Lake Formation governs access to that metadata.
Together they enable secure sharing and consumption of data stored in S3.

How Lake Formation Works with Data Catalog

When Lake Formation is enabled, access to Glue Data Catalog is routed through Lake Formation authorization checks.

This means that access to metadata itself—such as table definitions—can also be controlled.

Granting Lake Formation Permissions to Glue Jobs

When a Glue Job accesses data governed by Lake Formation, permissions must be granted not only through IAM but also through Lake Formation.

This is a common pitfall.

A typical issue is:

IAM permissions look correct, but the Glue Job still cannot read data.

aws lakeformation grant-permissions \
  --principal DataLakePrincipalIdentifier=arn:aws:iam::123456789:role/glue-job-role \
  --permissions SELECT \
  --resource '{
    "Table": {"DatabaseName": "mydb", "Name": "source_table"}
  }'

Cross-Account Sharing

Lake Formation supports cross-account data sharing through AWS RAM (Resource Access Manager).

Users in the target account can query shared tables directly from their own Athena environment.

Because Lake Formation permissions—including column and row filters—remain enforced, scenarios such as sharing data while excluding sensitive columns are supported.

To use cross-account sharing, the Data Catalog Cross Account Version setting must be configured to Version 3 or later.

Version 3 enables direct sharing with IAM principals in other accounts.
Version 4 adds support for hybrid access mode in cross-account scenarios.

Integration with Athena and Redshift Spectrum

Authorization Flow During Query Execution

When Athena accesses a Lake Formation-managed table:

A user executes a query in Athena.
Athena requests table metadata from Glue Data Catalog.
Lake Formation validates permissions.
If authorized, access to data in S3 is allowed.
Column and row filters are applied before results are returned.

This enables fine-grained access control without modifying S3 bucket policies.

Redshift Spectrum Integration

Since Redshift Spectrum also relies on Glue Data Catalog, Lake Formation permissions are enforced there as well.

This makes it easier to maintain consistent access control across Athena and Redshift Spectrum.

Adoption Challenges and Realistic Operations

Existing Environments: IAMAllowedPrincipals and Hybrid Access Mode

To preserve backward compatibility, Lake Formation grants the IAMAllowedPrincipals group Super permissions on existing Data Catalog resources by default.

In this state, access is effectively controlled by IAM alone, and Lake Formation's fine-grained controls are not enforced.

To fully leverage Lake Formation, these permissions must eventually be removed and replaced with explicit Lake Formation permissions.

However, switching everything at once can break existing workloads.

This is where Hybrid Access Mode becomes useful.

When registering S3 locations, Hybrid Access Mode allows selected principals to opt into Lake Formation authorization while other principals continue using IAM-only access.

This approach minimizes risk and enables gradual migration.

Personally, I believe this is the most practical approach for existing environments.

Common Pitfalls

Forgetting Lake Formation Permissions for Glue Jobs

As mentioned earlier, forgetting to grant Lake Formation permissions to Glue Job roles prevents ETL jobs from reading or writing data.

Many "it should work but doesn't" permission issues ultimately trace back to this.

I've forgotten it myself a few times and ended up scrambling to find the root cause.

Interaction with S3 Bucket Policies

Lake Formation does not override S3 bucket policies.

Even if access is granted in Lake Formation, requests are denied if the bucket policy blocks them.

When adopting Lake Formation, bucket policies must be designed to allow access through Lake Formation-authorized service roles.

Maintaining consistency among IAM, Lake Formation, and S3 bucket policies is critical.

Changing the design later can become painful, so it's worth thinking through carefully from the beginning.

Configuring Data Lake Administrators

When enabling Lake Formation for the first time, at least one Data Lake Administrator must be configured.

Relying on a single administrator can become an operational bottleneck, so I recommend assigning multiple administrators.

Athena Workgroups

When Athena Workgroups are used together with Lake Formation, behavior may vary depending on Workgroup configuration.

In particular, don't forget to grant permissions to the S3 bucket used for query results.

This is another thing I occasionally forget myself.

Incremental Adoption Strategy

For new environments, enabling Lake Formation from the start is usually the best option.

For existing environments, a phased approach tends to work better.

I've done this before, and while it's certainly possible, it's somewhat tedious.
If you're building a new environment, enabling Lake Formation from day one can save you trouble later.

Step 1: Gradual Opt-In with Hybrid Access Mode

Register S3 locations using Hybrid Access Mode
Opt in selected principals
Keep IAM-only access for others
Monitor access through CloudTrail

Step 2: Use Lake Formation for New Tables

Manage permissions for newly created tables through Lake Formation
Leave existing tables under IAMAllowedPrincipals

Step 3: Migrate Existing Tables

Gradually revoke IAMAllowedPrincipals permissions
Replace them with Lake Formation permissions
Validate behavior after each migration step

Where Lake Formation Excels—and Where It Doesn't

Lake Formation is particularly valuable for:

Fine-grained table-, column-, and row-level access control
Consistent authorization across Athena, Glue, and Redshift Spectrum
Scalable permission management using LF-TBAC
Cross-account data sharing

However, some areas remain outside its scope:

Direct access control to raw files in S3
Business metadata management
Data quality management

Relationship with Amazon DataZone

As discussed in my previous article, Lake Formation and DataZone have complementary responsibilities.

Service	Role
Lake Formation	Technical governance (who can access what)
Amazon DataZone	Business governance (discovering, understanding, and requesting data)

A useful way to think about them is:

Lake Formation = Technical foundation for governance
DataZone = Business foundation for governance

Combined with Glue Data Catalog, these services form a comprehensive data catalog and governance solution on AWS.

Conclusion

In this article, I reviewed AWS Lake Formation from its fundamentals through practical implementation patterns.

While there is a learning curve, it is an extremely important service for implementing proper data governance.

Key takeaways:

Lake Formation complements IAM rather than replacing it, adding fine-grained table-, column-, and row-level controls.
Column, row, and cell-level security are implemented through Data Filters.
LF-TBAC reduces operational overhead as the number of tables grows.
Lake Formation integrates tightly with Glue Data Catalog by adding a governance layer on top of metadata management.
Understanding IAMAllowedPrincipals and using Hybrid Access Mode for gradual adoption is essential in existing environments.

Lake Formation certainly introduces some complexity, but when implementing proper access control in a data lake, the limitations of IAM alone eventually become apparent.

In environments where data is consumed by multiple teams and a wide variety of users, Lake Formation is well worth considering.

That said, successful adoption depends on maintaining consistency across IAM, Lake Formation, and S3 bucket policies, so careful planning is essential.

I hope this article helps anyone considering the adoption of Lake Formation.

DEV Community