Original Japanese article: AWS Lake Formationの使い方について整理してみる
Introduction
I'm Aki, an AWS Community Builder (@jitepengin).
Previously, I wrote an article titled Is AWS Glue Data Catalog Sufficient as a Data Catalog? Organizing Its Design, Limitations, and Complementary Strategies.
In that article, I mentioned that "AWS Lake Formation is necessary to complement data governance" but did not go into detail because it was outside the scope of the article.
This time, I'd like to organize my thoughts on Lake Formation, covering everything from the fundamentals to practical usage patterns.
Lake Formation is often perceived as a service that is "somewhat difficult" or "unnecessary because IAM is enough."
However, once you start implementing proper access control for a data lake, the necessity of Lake Formation becomes much clearer.
I hope this article helps you evaluate whether Lake Formation is worth adopting in your environment.
What Is Lake Formation?
AWS Lake Formation is a service that provides access management and governance for data lakes.
It allows you to centrally manage who can access which data and at what level.
One of its key strengths is the ability to manage access controls consistently across multiple AWS services such as Athena, Glue, and Redshift Spectrum.
Although they are often confused, Lake Formation and Glue Data Catalog serve different purposes.
| Service | Role |
|---|---|
| Glue Data Catalog | A technical catalog that manages metadata such as schemas and partitions |
| Lake Formation | A governance layer that manages access permissions for data registered in the Glue Data Catalog |
Amazon S3 (Actual Data)
↓
Glue Data Catalog (Metadata Management)
↓
Lake Formation (Access Control)
↓
Athena / Glue Job / Redshift Spectrum
In other words, data resides in S3, Glue Data Catalog manages metadata, and Lake Formation provides access control on top of that metadata layer.
How Lake Formation Differs from IAM
Isn't IAM Enough?
When managing a data lake on S3 using IAM alone, several challenges emerge:
- Granularity limitations: IAM primarily operates at the bucket or prefix level, making table-, column-, and row-level access control difficult.
- Operational complexity: As users and roles increase, S3 bucket policies and IAM policies become increasingly difficult to manage.
- Cross-account sharing: Implementing data sharing across AWS accounts using only IAM can lead to complicated designs.
- Limited visibility for auditing: It is difficult to easily understand who can access which tables.
Typical examples include:
- More than ten Athena users need different levels of access, making permission management increasingly complicated.
- Different departments should see different subsets of data. For example, the sales department should only see Eastern Japan sales, while executives can see all data.
- Personally identifiable information (PII) such as email addresses and credit card numbers should be hidden from analysts.
- Data needs to be shared with another AWS account.
Lake Formation addresses these challenges.
What Lake Formation Solves
With Lake Formation, you can implement:
- Fine-grained table-, column-, and row-level access control
- Permission management at the Glue Data Catalog database and table level
- Tag-based access control (LF-TBAC) for large-scale environments
- Cross-account data sharing through AWS RAM
- Centralized auditing through CloudTrail integration
The Relationship Between IAM and Lake Formation
Lake Formation does not replace IAM; it works as an additional layer on top of IAM.
When a query is executed (for example, through Athena), access is granted only if both conditions are satisfied:
IAM Permission
AND
Lake Formation Permission
↓
Access Allowed
Even if permissions are granted in Lake Formation, access is denied if IAM blocks it.
Likewise, even if IAM allows access, the request is denied if the corresponding Lake Formation permissions are missing.
Understanding this "AND" relationship is the foundation of permission design.
Lake Formation Permission Model
Lake Formation permissions are managed across multiple levels.
| Level | Target | Example Permissions |
|---|---|---|
| Data Lake Administrator | Entire Lake Formation environment | Full permissions |
| Database Level | Glue Data Catalog database | CREATE TABLE, DROP |
| Table Level | Individual table | SELECT, INSERT, ALTER |
| Column Level | Specific columns within a table | SELECT on selected columns |
| Row Level | Rows matching specific conditions | SELECT on filtered rows |
Permissions can be granted or revoked through the console, CLI, or SDK.
# Example: Grant SELECT permission on a table
aws lakeformation grant-permissions \
--principal DataLakePrincipalIdentifier=arn:aws:iam::123456789:role/analyst-role \
--permissions SELECT \
--resource '{
"Table": {
"DatabaseName": "mydb",
"Name": "sales_table"
}
}'
Column-Level and Row-Level Access Control
One of Lake Formation's strongest capabilities is fine-grained access control beyond the table level.
Both column-level and row-level security are implemented using a mechanism called Data Filters.
You create Data Filters in the console and reference them when granting permissions.
Column-Level Security
Access can be restricted to specific columns.
Suppose the customer table contains the following columns:
| customer_id | name | credit_card | purchase_amount |
|---|
You could allow analysts to access only customer_id, name, and purchase_amount, while hiding email and credit_card.
This can be achieved simply by specifying included or excluded columns in a Data Filter.
Excluded columns will not appear in Athena query results.
Row Filters
Row-level filters allow access only to rows matching specific conditions.
Filter expressions are written using PartiQL WHERE-clause syntax.
For example, if the sales table contains a region column and the Eastern Japan team should only see rows where region = 'east', you can create the following Data Filter:
aws lakeformation create-data-cells-filter \
--table-data '{
"TableCatalogId": "123456789012",
"DatabaseName": "mydb",
"TableName": "sales",
"Name": "east-region-filter",
"RowFilter": {
"FilterExpression": "region = '\''east'\''"
},
"ColumnWildcard": {}
}'
Combining column filters and row filters enables cell-level security, where users can access only specific columns within specific rows.
Data Filter Limitations
According to the official documentation:
- Up to 100 filters per principal
-
arrayandmaptypes are not supported in filter expressions (structtypes can be used in row filters) - Cell-level security does not support nested columns, views, or resource links
- Cell-level security is available in all regions when using Athena Engine Version 3 or Redshift Spectrum
Common Use Cases
- Protecting PII such as email addresses and credit card numbers
- Restricting business data by department or geographic region
- Compliance requirements for regulated data
Tag-Based Access Control (LF-TBAC)
As the number of databases and tables grows, managing permissions table by table becomes increasingly difficult.
LF-TBAC (Lake Formation Tag-Based Access Control) addresses this problem.
What Are LF-Tags?
LF-Tags are key-value tags unique to Lake Formation.
They are separate from both S3 resource tags and IAM tags and are managed independently within Lake Formation.
aws lakeformation create-lf-tag \
--tag-key "sensitivity" \
--tag-values '["public", "internal", "confidential"]'
Tagging Resources and Mapping Permissions
LF-Tags can be assigned to databases, tables, and columns.
aws lakeformation add-lf-tags-to-resource \
--resource '{"Table": {"DatabaseName": "mydb", "Name": "sales"}}' \
--lf-tags '[{"TagKey": "sensitivity", "TagValues": ["internal"]}]'
Permissions are then granted based on tags rather than table names.
aws lakeformation grant-permissions \
--principal DataLakePrincipalIdentifier=arn:aws:iam::123456789:role/analyst-role \
--permissions SELECT \
--resource '{
"LFTagPolicy": {
"ResourceType": "TABLE",
"Expression": [{"TagKey": "sensitivity", "TagValues": ["public", "internal"]}]
}
}'
This grants SELECT access to all tables tagged with either sensitivity=public or sensitivity=internal.
When new tables are created, simply assigning the appropriate LF-Tag automatically applies the correct permissions.
Benefits in Large-Scale Environments
In environments with dozens or hundreds of tables, table-by-table permission management becomes unrealistic.
LF-TBAC enables a simpler model:
Roles can access data with specific tags.
However, tag design should be carefully planned from the beginning.
Defining categories such as sensitivity, domain, and owner early on can save significant effort later.
Integration with Glue Data Catalog
Lake Formation works closely with Glue Data Catalog.
Glue Data Catalog manages metadata, while Lake Formation governs access to that metadata.
Together they enable secure sharing and consumption of data stored in S3.
How Lake Formation Works with Data Catalog
When Lake Formation is enabled, access to Glue Data Catalog is routed through Lake Formation authorization checks.
This means that access to metadata itself—such as table definitions—can also be controlled.
Granting Lake Formation Permissions to Glue Jobs
When a Glue Job accesses data governed by Lake Formation, permissions must be granted not only through IAM but also through Lake Formation.
This is a common pitfall.
A typical issue is:
IAM permissions look correct, but the Glue Job still cannot read data.
aws lakeformation grant-permissions \
--principal DataLakePrincipalIdentifier=arn:aws:iam::123456789:role/glue-job-role \
--permissions SELECT \
--resource '{
"Table": {"DatabaseName": "mydb", "Name": "source_table"}
}'
Cross-Account Sharing
Lake Formation supports cross-account data sharing through AWS RAM (Resource Access Manager).
Users in the target account can query shared tables directly from their own Athena environment.
Because Lake Formation permissions—including column and row filters—remain enforced, scenarios such as sharing data while excluding sensitive columns are supported.
To use cross-account sharing, the Data Catalog Cross Account Version setting must be configured to Version 3 or later.
- Version 3 enables direct sharing with IAM principals in other accounts.
- Version 4 adds support for hybrid access mode in cross-account scenarios.
Integration with Athena and Redshift Spectrum
Authorization Flow During Query Execution
When Athena accesses a Lake Formation-managed table:
- A user executes a query in Athena.
- Athena requests table metadata from Glue Data Catalog.
- Lake Formation validates permissions.
- If authorized, access to data in S3 is allowed.
- Column and row filters are applied before results are returned.
This enables fine-grained access control without modifying S3 bucket policies.
Redshift Spectrum Integration
Since Redshift Spectrum also relies on Glue Data Catalog, Lake Formation permissions are enforced there as well.
This makes it easier to maintain consistent access control across Athena and Redshift Spectrum.
Adoption Challenges and Realistic Operations
Existing Environments: IAMAllowedPrincipals and Hybrid Access Mode
To preserve backward compatibility, Lake Formation grants the IAMAllowedPrincipals group Super permissions on existing Data Catalog resources by default.
In this state, access is effectively controlled by IAM alone, and Lake Formation's fine-grained controls are not enforced.
To fully leverage Lake Formation, these permissions must eventually be removed and replaced with explicit Lake Formation permissions.
However, switching everything at once can break existing workloads.
This is where Hybrid Access Mode becomes useful.
When registering S3 locations, Hybrid Access Mode allows selected principals to opt into Lake Formation authorization while other principals continue using IAM-only access.
This approach minimizes risk and enables gradual migration.
Personally, I believe this is the most practical approach for existing environments.
Common Pitfalls
Forgetting Lake Formation Permissions for Glue Jobs
As mentioned earlier, forgetting to grant Lake Formation permissions to Glue Job roles prevents ETL jobs from reading or writing data.
Many "it should work but doesn't" permission issues ultimately trace back to this.
I've forgotten it myself a few times and ended up scrambling to find the root cause.
Interaction with S3 Bucket Policies
Lake Formation does not override S3 bucket policies.
Even if access is granted in Lake Formation, requests are denied if the bucket policy blocks them.
When adopting Lake Formation, bucket policies must be designed to allow access through Lake Formation-authorized service roles.
Maintaining consistency among IAM, Lake Formation, and S3 bucket policies is critical.
Changing the design later can become painful, so it's worth thinking through carefully from the beginning.
Configuring Data Lake Administrators
When enabling Lake Formation for the first time, at least one Data Lake Administrator must be configured.
Relying on a single administrator can become an operational bottleneck, so I recommend assigning multiple administrators.
Athena Workgroups
When Athena Workgroups are used together with Lake Formation, behavior may vary depending on Workgroup configuration.
In particular, don't forget to grant permissions to the S3 bucket used for query results.
This is another thing I occasionally forget myself.
Incremental Adoption Strategy
For new environments, enabling Lake Formation from the start is usually the best option.
For existing environments, a phased approach tends to work better.
I've done this before, and while it's certainly possible, it's somewhat tedious.
If you're building a new environment, enabling Lake Formation from day one can save you trouble later.
Step 1: Gradual Opt-In with Hybrid Access Mode
- Register S3 locations using Hybrid Access Mode
- Opt in selected principals
- Keep IAM-only access for others
- Monitor access through CloudTrail
Step 2: Use Lake Formation for New Tables
- Manage permissions for newly created tables through Lake Formation
- Leave existing tables under IAMAllowedPrincipals
Step 3: Migrate Existing Tables
- Gradually revoke IAMAllowedPrincipals permissions
- Replace them with Lake Formation permissions
- Validate behavior after each migration step
Where Lake Formation Excels—and Where It Doesn't
Lake Formation is particularly valuable for:
- Fine-grained table-, column-, and row-level access control
- Consistent authorization across Athena, Glue, and Redshift Spectrum
- Scalable permission management using LF-TBAC
- Cross-account data sharing
However, some areas remain outside its scope:
- Direct access control to raw files in S3
- Business metadata management
- Data quality management
Relationship with Amazon DataZone
As discussed in my previous article, Lake Formation and DataZone have complementary responsibilities.
| Service | Role |
|---|---|
| Lake Formation | Technical governance (who can access what) |
| Amazon DataZone | Business governance (discovering, understanding, and requesting data) |
A useful way to think about them is:
- Lake Formation = Technical foundation for governance
- DataZone = Business foundation for governance
Combined with Glue Data Catalog, these services form a comprehensive data catalog and governance solution on AWS.
Conclusion
In this article, I reviewed AWS Lake Formation from its fundamentals through practical implementation patterns.
While there is a learning curve, it is an extremely important service for implementing proper data governance.
Key takeaways:
- Lake Formation complements IAM rather than replacing it, adding fine-grained table-, column-, and row-level controls.
- Column, row, and cell-level security are implemented through Data Filters.
- LF-TBAC reduces operational overhead as the number of tables grows.
- Lake Formation integrates tightly with Glue Data Catalog by adding a governance layer on top of metadata management.
- Understanding IAMAllowedPrincipals and using Hybrid Access Mode for gradual adoption is essential in existing environments.
Lake Formation certainly introduces some complexity, but when implementing proper access control in a data lake, the limitations of IAM alone eventually become apparent.
In environments where data is consumed by multiple teams and a wide variety of users, Lake Formation is well worth considering.
That said, successful adoption depends on maintaining consistency across IAM, Lake Formation, and S3 bucket policies, so careful planning is essential.
I hope this article helps anyone considering the adoption of Lake Formation.
Top comments (0)