Aki for AWS Community Builders

Posted on Mar 25

Is AWS Glue Data Catalog Sufficient as a Data Catalog? Organizing Its Design, Limitations, and Complementary Strategies

#aws #dataengineering

Original Japanese article: AWS Glue Data Catalogはデータカタログとして十分か？設計・限界・補完戦略を整理する

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).

As data utilization within organizations has advanced in recent years, the importance of data catalogs has continued to grow.

When building a data platform on AWS, the first thing that typically comes to mind as a data catalog is AWS Glue Data Catalog.
Especially in data lake architectures centered around Amazon S3, AWS Glue Data Catalog is almost a prerequisite service. By combining it with services like Athena, AWS Glue, and Redshift Spectrum, it is possible to quickly stand up a minimal data platform.

However, as data usage evolves, you may encounter challenges such as:

Not knowing which data to use
Multiple datasets that look similar
Being unable to determine whether data is trustworthy
Not being able to trace how data was generated

At first glance, these may appear to be separate issues, but in reality, they all stem from a single root cause: an insufficient data catalog.

In this article, starting from AWS Glue Data Catalog, we will explore:

The role of a data catalog
The strengths and limitations of AWS Glue Data Catalog
How to complement it within AWS
How to approach building a data catalog on AWS
A comparison with other data catalogs (OpenMetadata)

The conclusion is that AWS Glue Data Catalog is not a “data catalog” in the full sense.
Rather, it is a technical catalog used by query engines, and it is not sufficient as a catalog for humans to discover, understand, and trust data.

For this reason, a data catalog on AWS should not be designed as a single service, but as an architecture composed of multiple services.

What is a Data Catalog?

A data catalog is not simply about metadata management—it is a foundation that makes data usable.

Metadata can be understood as “data about data.”
Specifically, it includes information such as who created the data, what it means, how it is used, where it came from, how it flows, where it is stored, and what its quality is.

A data catalog centralizes this metadata and supports search and utilization.

Traditionally, data catalogs focused on table definitions and schema management. However, modern data catalogs are expected to include the following elements:

Category	Description
Metadata Management	Technical metadata (schemas, types, partitions), business metadata (meaning, usage, owner), operational metadata (job logs, processing metrics)
Data Discovery	Data discovery, filtering, classification
Data Lineage	Tracking data generation and transformation
Data Quality	Reliability indicators, anomaly detection
Data Governance	Access control and permission management

These elements span multiple domains defined in DMBOK, and it is rare for a single tool to cover all of them.

As will be explained later, AWS also requires combining multiple services to achieve this.
In the AWS context, it is more appropriate to think of a data catalog not as a single “service,” but as an “architecture.”

Role and Strengths of Glue Data Catalog

Glue Data Catalog is the core metadata management component in AWS and serves as a foundational element for operating a data platform.

Key Features of Glue Data Catalog

Feature	Description
Metadata Storage	Persistent storage of structured metadata
Schema Management	Definition and updates of table schemas
Partition Management	Management of partition information
Statistics	Column statistics such as min/max values and null counts
Tagging	Classification using key-value pairs
API/SDK	Programmatic access
Data Lineage	Basic lineage is available; advanced visualization requires additional tools
Operational Metadata	CloudWatch logs, Spark UI, job execution insights
Advanced Discovery	Console browsing, attribute filtering, unified search

Strengths of Glue Data Catalog

Seamless Integration with AWS Services

While this may seem obvious, it is an important point: Glue integrates natively with AWS services such as Athena, AWS Glue, and Redshift Spectrum.

Because these services reference the same catalog, it ensures consistency in how data is accessed across the platform.

Strong Affinity with Data Lakes

In modern lakehouse architectures, this is a significant advantage.

Glue Data Catalog allows data stored in S3 to be cataloged directly.
This makes it possible to build a lakehouse using formats like Iceberg and manage it through Glue Data Catalog.

(Note: Iceberg table metadata itself resides in S3, while Glue Data Catalog functions as the catalog endpoint.)

Is Glue Data Catalog Sufficient as a Data Catalog?

The conclusion is that Glue Data Catalog is a data platform–oriented catalog, not a user-oriented catalog.

It is highly effective as a technical metadata foundation referenced by analytics platforms.
However, its capabilities are limited when it comes to serving as a business catalog that enables users to discover, understand, and trust data.

In other words, Glue is extremely strong as the core (technical foundation) of a data catalog, but requires complementary services when used as a user-facing catalog that supports data utilization.

Category	Support by Glue Alone	Notes
Technical Metadata	○	Schemas, types, partitions, column statistics
Business Metadata	△	Descriptions, tags, classifications (advanced capabilities require Amazon DataZone)
Operational Metadata	△	Job execution history is stored; detailed metrics are managed in CloudWatch
Data Discovery	△	Console search and filtering (advanced capabilities require Amazon Q or Amazon DataZone)
Data Lineage	△	Basic lineage (input/output tables in Glue ETL jobs) is captured; no end-to-end lineage
Data Quality	△	Column statistics and auto statistics (advanced capabilities require Glue Data Quality or AWS Glue DataBrew)
Workflow Management	✕	Not handled by Glue Data Catalog
Data Governance	△	IAM integration, resource policies, encryption (advanced capabilities require Lake Formation)
Data Profiling	✕	Not supported

Supplement: How to Complement Areas Where Glue Alone Falls Short

Operational Metadata
Advanced metrics (e.g., processed record counts, error rates, memory usage) need to be managed using CloudWatch or AWS X-Ray.
Data Lineage
For end-to-end, advanced lineage visualization, you need Amazon DataZone, support for the OpenLineage specification, or the AWS Lineage API.
Data Quality
- Glue Data Quality: Rule-based validations (e.g., NULL checks, range checks)
- Glue DataBrew: Statistical profiling, distribution analysis, outlier detection (ML-based)
Workflow Management
Utilize AWS Step Functions, Amazon Managed Workflows for Apache Airflow (MWAA), or AWS Glue Workflows.
Data Profiling
Perform statistical profiling with Glue DataBrew, and detect sensitive data (PII classification) using Amazon Macie.

In summary, the following capabilities are not fully provided by Glue alone and must be complemented:

Business metadata management
Advanced data lineage
Data quality
Data discovery
Workflow management
Data governance
Data profiling

The question then becomes how to realize these capabilities, which leads to combining multiple AWS services.

Complementing Glue Data Catalog as a Data Catalog

As discussed earlier, Glue Data Catalog alone is not sufficient as a complete data catalog.
In AWS, this gap is addressed by combining multiple services to complement its capabilities.

Here, we organize which capabilities are complemented by which services.

Category	AWS Service	Description
Business Metadata	Amazon DataZone	Business glossary, data ownership definition, rich descriptions and context, data asset reviews and ratings
Data Lineage	Amazon DataZone	Lineage visualization, understanding data transformation flows, dependency management (end-to-end lineage requires OpenLineage or AWS Lineage APIs)
Data Quality	AWS Glue Data Quality / DataBrew	Data quality rule definition, scoring, anomaly detection, profiling (Glue Data Quality can auto-generate rules based on profiling results from DataBrew)
Data Discovery	Amazon DataZone / Amazon Q	Filtering, recommendations, related data suggestions, natural language search, AI-assisted analysis and insight generation
Workflow Management	AWS Step Functions / Amazon MWAA (Airflow) / Glue Workflows	Workflow orchestration
Data Governance	AWS Lake Formation	Column/row-level access control, tag-based access control, permissions management, data filtering
Data Profiling	AWS Glue DataBrew / Amazon Macie	Profiling, statistical analysis, sensitive data detection, PII classification

As shown above, AWS enables a data catalog by combining multiple services with Glue Data Catalog at the core.

Data Catalog Architecture

A data catalog on AWS, centered around Glue Data Catalog, can be organized into the following layered structure.

The key point is to view this not as individual services, but as an architecture composed of layers.

┌──────────────────────────────┐
│   Business Catalog Layer      │   ← Amazon DataZone / Amazon Q
│   (Discovery / Glossary)      │
└──────────────┬───────────────┘
               │
┌──────────────┼───────────────┐
│ Governance / Quality Layer   │   ← Lake Formation / Glue Data Quality
│ (Access Control / Quality)   │
└──────────────┬───────────────┘
               │
┌──────────────┼───────────────┐
│ Metadata Core Layer          │   ← Glue Data Catalog
│ (Technical Metadata)         │
└──────────────┬───────────────┘
               │
┌──────────────┼───────────────┐
│ Processing / Query Layer     │   ← Athena / Glue ETL / Redshift
│ (Query / ETL Processing)     │
└──────────────┬───────────────┘
               │
┌──────────────────────────────┐
│ Data Layer (S3)              │   ← Raw / Curated Data
└──────────────────────────────┘

Roles of Each Layer

Business Catalog Layer

Amazon DataZone: Entry point for business users to discover data, review it, and request access
Amazon Q: AI assistant that supports natural language search, data analysis, and insight generation (e.g., “Where is the sales data for 2023?”)

Governance / Quality Layer

Lake Formation: Column- and row-level access control, tag-based permission management
Glue Data Quality: Definition and validation of data quality rules (e.g., “Check that the age column does not contain negative values”)

Metadata Core Layer

Glue Data Catalog: Centralized management of technical metadata (schemas, statistics, partitions)
Integration with S3: Automatically catalogs file structures in the data lake

Processing / Query Layer

Athena / Redshift Spectrum: Query data directly on S3 using Glue Data Catalog
Glue ETL: Executes transformation jobs based on metadata from the catalog

Data Layer

S3: Stores raw data (CSV, Parquet, etc.) and processed (curated) data

Implementation Best Practices

1. Use Glue Data Catalog as the foundation

Place it at the center since it integrates natively with services like Athena, Glue, and Redshift

2. Add a business-facing layer

Introduce Amazon DataZone and build a business glossary
Define data ownership and utilize review/rating features

3. Strengthen data quality and governance

Define rules with Glue Data Quality (e.g., “Order amount must not be negative”)
Apply access control with Lake Formation (e.g., “Finance team can only view accounting data”) Note: DataZone also supports IAM integration and access control independently

4. Visualize data lineage

Use OpenLineage specifications to automatically capture input/output of Glue ETL jobs
Visualize lineage graphs in Amazon DataZone

5. Enable profiling and sensitive data detection

Use DataBrew for profiling (e.g., distribution analysis of columns)
Use Amazon Macie for detecting and classifying PII

6. Improve search experience

Integrate Amazon Q to enable natural language search (e.g., “Customer purchase history”)

Challenges of Adopting DataZone

Among the complementary services, one stands out as particularly important—but also challenging to adopt: Amazon DataZone.

In DataZone, data assets are managed as data products.
A data product represents a meaningful unit of business data (e.g., “Customer transaction data”) with clearly defined ownership and responsibility.

This structure clarifies who owns the data, forming the foundation for data quality and governance.
It also aligns well with Data Mesh principles, enabling domain-oriented data management.

DataZone provides what Glue lacks: a catalog for humans.

Data asset cataloging
Search and discovery
Lineage visualization
Data quality visibility
Governance management

In other words, it extends the technical catalog into a business catalog.

While Glue Data Catalog is a “catalog for systems,” DataZone is a “catalog for people.”

However, adopting DataZone requires meeting organizational, operational, and technical prerequisites.

1. Organizational Prerequisites

This is often the most difficult part.

Data Domain Design

Data domains—logical groupings of business data with clear ownership—must be defined.

Since DataZone manages data at the domain level, unclear boundaries make operations unsustainable.
In reality, many organizations have not formalized domain design, making this the first major challenge.

Data Ownership

Each data asset must have a clearly defined owner.

Data is treated as a “data product,” and each domain is responsible for managing its own data.
However, in many organizations, ownership is ambiguous or fragmented.

Responsibility Definition

Responsibilities for data quality, access control, and updates must be defined.
This forms the basis for governance and approval workflows.

In practice, aligning responsibilities across departments often becomes a bottleneck.

2. Operational Prerequisites

Approval Workflows

Processes for requesting and approving data access must be established.

Data Classification

Standardized classification rules based on sensitivity and usage are required.

Usage Policies

Guidelines and compliance rules for data usage must be clearly defined.

3. Technical Prerequisites

Lineage Collection

DataZone visualizes lineage, but only if lineage data exists.

This requires:

Integration with processing systems (Glue, Redshift, etc.)
Adoption of standards like OpenLineage
Designing metadata collection within pipelines

Metadata Integration

Metadata from various services must be integrated into DataZone:

Catalog integration (Glue / Redshift / S3)
Data quality metadata (Glue Data Quality)
Access control metadata (Lake Formation)

This integration enables a consistent data catalog experience.

In summary, DataZone does not automatically solve data governance problems.
It requires the following conditions:

A data-driven culture is emerging
Cross-functional collaboration exists
Awareness of data quality is high
Continuous improvement processes are in place

Without these, the catalog risks becoming a formality that is not actually used.

A Practical Approach to Adopting DataZone

Given the complexity, a phased approach is often effective.

Phase 1: Foundation (Data Platform Team)

Establish technical metadata with Glue Data Catalog
Basic data classification
Simple access control

Phase 2: Governance (Involving Governance Teams)

Implement fine-grained access control with Lake Formation
- Example: Apply classification tags at the column level and deny SELECT access to PII-tagged columns via IAM policies
Introduce data quality monitoring
Establish basic lineage

Phase 3: DataZone (Business-Led)

Introduce once organizational prerequisites are met
Manage business metadata
Enable self-service analytics

DataZone becomes effective only when the organization reaches a certain level of maturity.

It is not just a tool, but a mechanism for organizational transformation.
Technical readiness alone is not sufficient—cultural and process changes are required.

Considering OpenMetadata

OpenMetadata is an open-source data catalog that supports a wide range of platforms.
https://open-metadata.org/

It provides:

Connectors for various data sources
Data lineage
Data quality management
Search and UI capabilities

It can function as a comprehensive data catalog on its own.

This raises the question: why not use OpenMetadata from the start?

The answer depends on your context—specifically, which layer you want to implement the catalog in (infrastructure-oriented vs. business-oriented).

In AWS-centric environments, Glue Data Catalog integrates natively with services like Athena, Redshift, and Glue, providing consistency and operational simplicity.

Therefore, if your architecture is primarily within AWS, it is reasonable to center your design around Glue.

On the other hand, OpenMetadata becomes advantageous when:

Managing across multiple clouds (AWS / GCP / Azure)
Integrating diverse sources (SaaS, on-premises, etc.)
Requiring flexible and customizable metadata management
Designing a business-centric catalog from the beginning

However, adopting OpenMetadata requires:

Infrastructure setup (ECS/EKS or VMs, metadata storage such as PostgreSQL)
Operations (monitoring with Prometheus/Grafana, scaling, upgrades)
Security design (RBAC, authentication/authorization, encryption)

Compared to AWS managed services, it offers flexibility at the cost of increased operational overhead.

In summary:

Situation	Recommendation
AWS only / small scale	Glue Data Catalog centered
AWS + governance needs	+ Lake Formation
Advanced data usage	+ DataZone
Multi-cloud	Consider OpenMetadata

OpenMetadata can also integrate with Glue Data Catalog as a catalog provider.
Its Glue connector can ingest metadata from Glue and map it to OpenMetadata constructs such as glossaries and data products.

Conclusion

In this article, we explored data catalogs centered around AWS Glue Data Catalog.

A data catalog is essential to prevent a data lake from becoming a data swamp.

What matters is not the tool itself, but:

How metadata is managed
Who owns the data
How it continues to be used

This requires designing not only technology, but also organization and processes.

In other words, a data catalog is not just a tool—it is an organizational system.

Glue Data Catalog plays a central role, but it cannot form a complete data catalog on its own.

On AWS, a data catalog should be designed not as a single service, but as an architecture centered around Glue Data Catalog.

And most importantly, a data catalog should be viewed not merely as a technical foundation, but as a foundation for enabling organizations to operate with data.

I hope this article helps those considering data catalogs on AWS.