Original Japanese article: AWS Glue Data Catalogはデータカタログとして十分か?設計・限界・補完戦略を整理する
Introduction
I'm Aki, an AWS Community Builder (@jitepengin).
As data utilization within organizations has advanced in recent years, the importance of data catalogs has continued to grow.
When building a data platform on AWS, the first thing that typically comes to mind as a data catalog is AWS Glue Data Catalog.
Especially in data lake architectures centered around Amazon S3, AWS Glue Data Catalog is almost a prerequisite service. By combining it with services like Athena, AWS Glue, and Redshift Spectrum, it is possible to quickly stand up a minimal data platform.
However, as data usage evolves, you may encounter challenges such as:
- Not knowing which data to use
- Multiple datasets that look similar
- Being unable to determine whether data is trustworthy
- Not being able to trace how data was generated
At first glance, these may appear to be separate issues, but in reality, they all stem from a single root cause: an insufficient data catalog.
In this article, starting from AWS Glue Data Catalog, we will explore:
- The role of a data catalog
- The strengths and limitations of AWS Glue Data Catalog
- How to complement it within AWS
- How to approach building a data catalog on AWS
- A comparison with other data catalogs (OpenMetadata)
The conclusion is that AWS Glue Data Catalog is not a “data catalog” in the full sense.
Rather, it is a technical catalog used by query engines, and it is not sufficient as a catalog for humans to discover, understand, and trust data.
For this reason, a data catalog on AWS should not be designed as a single service, but as an architecture composed of multiple services.
What is a Data Catalog?
A data catalog is not simply about metadata management—it is a foundation that makes data usable.
Metadata can be understood as “data about data.”
Specifically, it includes information such as who created the data, what it means, how it is used, where it came from, how it flows, where it is stored, and what its quality is.
A data catalog centralizes this metadata and supports search and utilization.
Traditionally, data catalogs focused on table definitions and schema management. However, modern data catalogs are expected to include the following elements:
| Category | Description |
|---|---|
| Metadata Management | Technical metadata (schemas, types, partitions), business metadata (meaning, usage, owner), operational metadata (job logs, processing metrics) |
| Data Discovery | Data discovery, filtering, classification |
| Data Lineage | Tracking data generation and transformation |
| Data Quality | Reliability indicators, anomaly detection |
| Data Governance | Access control and permission management |
These elements span multiple domains defined in DMBOK, and it is rare for a single tool to cover all of them.
As will be explained later, AWS also requires combining multiple services to achieve this.
In the AWS context, it is more appropriate to think of a data catalog not as a single “service,” but as an “architecture.”
Role and Strengths of Glue Data Catalog
Glue Data Catalog is the core metadata management component in AWS and serves as a foundational element for operating a data platform.
Key Features of Glue Data Catalog
| Feature | Description |
|---|---|
| Metadata Storage | Persistent storage of structured metadata |
| Schema Management | Definition and updates of table schemas |
| Partition Management | Management of partition information |
| Statistics | Column statistics such as min/max values and null counts |
| Tagging | Classification using key-value pairs |
| API/SDK | Programmatic access |
| Data Lineage | Basic lineage is available; advanced visualization requires additional tools |
| Operational Metadata | CloudWatch logs, Spark UI, job execution insights |
| Advanced Discovery | Console browsing, attribute filtering, unified search |
Strengths of Glue Data Catalog
Seamless Integration with AWS Services
While this may seem obvious, it is an important point: Glue integrates natively with AWS services such as Athena, AWS Glue, and Redshift Spectrum.
Because these services reference the same catalog, it ensures consistency in how data is accessed across the platform.
Strong Affinity with Data Lakes
In modern lakehouse architectures, this is a significant advantage.
Glue Data Catalog allows data stored in S3 to be cataloged directly.
This makes it possible to build a lakehouse using formats like Iceberg and manage it through Glue Data Catalog.
(Note: Iceberg table metadata itself resides in S3, while Glue Data Catalog functions as the catalog endpoint.)
Is Glue Data Catalog Sufficient as a Data Catalog?
The conclusion is that Glue Data Catalog is a data platform–oriented catalog, not a user-oriented catalog.
It is highly effective as a technical metadata foundation referenced by analytics platforms.
However, its capabilities are limited when it comes to serving as a business catalog that enables users to discover, understand, and trust data.
In other words, Glue is extremely strong as the core (technical foundation) of a data catalog, but requires complementary services when used as a user-facing catalog that supports data utilization.
| Category | Support by Glue Alone | Notes |
|---|---|---|
| Technical Metadata | ○ | Schemas, types, partitions, column statistics |
| Business Metadata | △ | Descriptions, tags, classifications (advanced capabilities require Amazon DataZone) |
| Operational Metadata | △ | Job execution history is stored; detailed metrics are managed in CloudWatch |
| Data Discovery | △ | Console search and filtering (advanced capabilities require Amazon Q or Amazon DataZone) |
| Data Lineage | △ | Basic lineage (input/output tables in Glue ETL jobs) is captured; no end-to-end lineage |
| Data Quality | △ | Column statistics and auto statistics (advanced capabilities require Glue Data Quality or AWS Glue DataBrew) |
| Workflow Management | ✕ | Not handled by Glue Data Catalog |
| Data Governance | △ | IAM integration, resource policies, encryption (advanced capabilities require Lake Formation) |
| Data Profiling | ✕ | Not supported |
Supplement: How to Complement Areas Where Glue Alone Falls Short
Operational Metadata
Advanced metrics (e.g., processed record counts, error rates, memory usage) need to be managed using CloudWatch or AWS X-Ray.Data Lineage
For end-to-end, advanced lineage visualization, you need Amazon DataZone, support for the OpenLineage specification, or the AWS Lineage API.-
Data Quality
- Glue Data Quality: Rule-based validations (e.g., NULL checks, range checks)
- Glue DataBrew: Statistical profiling, distribution analysis, outlier detection (ML-based)
Workflow Management
Utilize AWS Step Functions, Amazon Managed Workflows for Apache Airflow (MWAA), or AWS Glue Workflows.Data Profiling
Perform statistical profiling with Glue DataBrew, and detect sensitive data (PII classification) using Amazon Macie.
In summary, the following capabilities are not fully provided by Glue alone and must be complemented:
- Business metadata management
- Advanced data lineage
- Data quality
- Data discovery
- Workflow management
- Data governance
- Data profiling
The question then becomes how to realize these capabilities, which leads to combining multiple AWS services.
Complementing Glue Data Catalog as a Data Catalog
As discussed earlier, Glue Data Catalog alone is not sufficient as a complete data catalog.
In AWS, this gap is addressed by combining multiple services to complement its capabilities.
Here, we organize which capabilities are complemented by which services.
| Category | AWS Service | Description |
|---|---|---|
| Business Metadata | Amazon DataZone | Business glossary, data ownership definition, rich descriptions and context, data asset reviews and ratings |
| Data Lineage | Amazon DataZone | Lineage visualization, understanding data transformation flows, dependency management (end-to-end lineage requires OpenLineage or AWS Lineage APIs) |
| Data Quality | AWS Glue Data Quality / DataBrew | Data quality rule definition, scoring, anomaly detection, profiling (Glue Data Quality can auto-generate rules based on profiling results from DataBrew) |
| Data Discovery | Amazon DataZone / Amazon Q | Filtering, recommendations, related data suggestions, natural language search, AI-assisted analysis and insight generation |
| Workflow Management | AWS Step Functions / Amazon MWAA (Airflow) / Glue Workflows | Workflow orchestration |
| Data Governance | AWS Lake Formation | Column/row-level access control, tag-based access control, permissions management, data filtering |
| Data Profiling | AWS Glue DataBrew / Amazon Macie | Profiling, statistical analysis, sensitive data detection, PII classification |
As shown above, AWS enables a data catalog by combining multiple services with Glue Data Catalog at the core.
Data Catalog Architecture
A data catalog on AWS, centered around Glue Data Catalog, can be organized into the following layered structure.
The key point is to view this not as individual services, but as an architecture composed of layers.
┌──────────────────────────────┐
│ Business Catalog Layer │ ← Amazon DataZone / Amazon Q
│ (Discovery / Glossary) │
└──────────────┬───────────────┘
│
┌──────────────┼───────────────┐
│ Governance / Quality Layer │ ← Lake Formation / Glue Data Quality
│ (Access Control / Quality) │
└──────────────┬───────────────┘
│
┌──────────────┼───────────────┐
│ Metadata Core Layer │ ← Glue Data Catalog
│ (Technical Metadata) │
└──────────────┬───────────────┘
│
┌──────────────┼───────────────┐
│ Processing / Query Layer │ ← Athena / Glue ETL / Redshift
│ (Query / ETL Processing) │
└──────────────┬───────────────┘
│
┌──────────────────────────────┐
│ Data Layer (S3) │ ← Raw / Curated Data
└──────────────────────────────┘
Roles of Each Layer
Business Catalog Layer
- Amazon DataZone: Entry point for business users to discover data, review it, and request access
- Amazon Q: AI assistant that supports natural language search, data analysis, and insight generation (e.g., “Where is the sales data for 2023?”)
Governance / Quality Layer
- Lake Formation: Column- and row-level access control, tag-based permission management
- Glue Data Quality: Definition and validation of data quality rules (e.g., “Check that the age column does not contain negative values”)
Metadata Core Layer
- Glue Data Catalog: Centralized management of technical metadata (schemas, statistics, partitions)
- Integration with S3: Automatically catalogs file structures in the data lake
Processing / Query Layer
- Athena / Redshift Spectrum: Query data directly on S3 using Glue Data Catalog
- Glue ETL: Executes transformation jobs based on metadata from the catalog
Data Layer
- S3: Stores raw data (CSV, Parquet, etc.) and processed (curated) data
Implementation Best Practices
1. Use Glue Data Catalog as the foundation
- Place it at the center since it integrates natively with services like Athena, Glue, and Redshift
2. Add a business-facing layer
- Introduce Amazon DataZone and build a business glossary
- Define data ownership and utilize review/rating features
3. Strengthen data quality and governance
- Define rules with Glue Data Quality (e.g., “Order amount must not be negative”)
- Apply access control with Lake Formation (e.g., “Finance team can only view accounting data”) Note: DataZone also supports IAM integration and access control independently
4. Visualize data lineage
- Use OpenLineage specifications to automatically capture input/output of Glue ETL jobs
- Visualize lineage graphs in Amazon DataZone
5. Enable profiling and sensitive data detection
- Use DataBrew for profiling (e.g., distribution analysis of columns)
- Use Amazon Macie for detecting and classifying PII
6. Improve search experience
- Integrate Amazon Q to enable natural language search (e.g., “Customer purchase history”)
Challenges of Adopting DataZone
Among the complementary services, one stands out as particularly important—but also challenging to adopt: Amazon DataZone.
In DataZone, data assets are managed as data products.
A data product represents a meaningful unit of business data (e.g., “Customer transaction data”) with clearly defined ownership and responsibility.
This structure clarifies who owns the data, forming the foundation for data quality and governance.
It also aligns well with Data Mesh principles, enabling domain-oriented data management.
DataZone provides what Glue lacks: a catalog for humans.
- Data asset cataloging
- Search and discovery
- Lineage visualization
- Data quality visibility
- Governance management
In other words, it extends the technical catalog into a business catalog.
While Glue Data Catalog is a “catalog for systems,” DataZone is a “catalog for people.”
However, adopting DataZone requires meeting organizational, operational, and technical prerequisites.
1. Organizational Prerequisites
This is often the most difficult part.
Data Domain Design
Data domains—logical groupings of business data with clear ownership—must be defined.
Since DataZone manages data at the domain level, unclear boundaries make operations unsustainable.
In reality, many organizations have not formalized domain design, making this the first major challenge.
Data Ownership
Each data asset must have a clearly defined owner.
Data is treated as a “data product,” and each domain is responsible for managing its own data.
However, in many organizations, ownership is ambiguous or fragmented.
Responsibility Definition
Responsibilities for data quality, access control, and updates must be defined.
This forms the basis for governance and approval workflows.
In practice, aligning responsibilities across departments often becomes a bottleneck.
2. Operational Prerequisites
Approval Workflows
Processes for requesting and approving data access must be established.
Data Classification
Standardized classification rules based on sensitivity and usage are required.
Usage Policies
Guidelines and compliance rules for data usage must be clearly defined.
3. Technical Prerequisites
Lineage Collection
DataZone visualizes lineage, but only if lineage data exists.
This requires:
- Integration with processing systems (Glue, Redshift, etc.)
- Adoption of standards like OpenLineage
- Designing metadata collection within pipelines
Metadata Integration
Metadata from various services must be integrated into DataZone:
- Catalog integration (Glue / Redshift / S3)
- Data quality metadata (Glue Data Quality)
- Access control metadata (Lake Formation)
This integration enables a consistent data catalog experience.
In summary, DataZone does not automatically solve data governance problems.
It requires the following conditions:
- A data-driven culture is emerging
- Cross-functional collaboration exists
- Awareness of data quality is high
- Continuous improvement processes are in place
Without these, the catalog risks becoming a formality that is not actually used.
A Practical Approach to Adopting DataZone
Given the complexity, a phased approach is often effective.
Phase 1: Foundation (Data Platform Team)
- Establish technical metadata with Glue Data Catalog
- Basic data classification
- Simple access control
Phase 2: Governance (Involving Governance Teams)
-
Implement fine-grained access control with Lake Formation
- Example: Apply classification tags at the column level and deny SELECT access to PII-tagged columns via IAM policies
Introduce data quality monitoring
Establish basic lineage
Phase 3: DataZone (Business-Led)
- Introduce once organizational prerequisites are met
- Manage business metadata
- Enable self-service analytics
DataZone becomes effective only when the organization reaches a certain level of maturity.
It is not just a tool, but a mechanism for organizational transformation.
Technical readiness alone is not sufficient—cultural and process changes are required.
Considering OpenMetadata
OpenMetadata is an open-source data catalog that supports a wide range of platforms.
https://open-metadata.org/
It provides:
- Connectors for various data sources
- Data lineage
- Data quality management
- Search and UI capabilities
It can function as a comprehensive data catalog on its own.
This raises the question: why not use OpenMetadata from the start?
The answer depends on your context—specifically, which layer you want to implement the catalog in (infrastructure-oriented vs. business-oriented).
In AWS-centric environments, Glue Data Catalog integrates natively with services like Athena, Redshift, and Glue, providing consistency and operational simplicity.
Therefore, if your architecture is primarily within AWS, it is reasonable to center your design around Glue.
On the other hand, OpenMetadata becomes advantageous when:
- Managing across multiple clouds (AWS / GCP / Azure)
- Integrating diverse sources (SaaS, on-premises, etc.)
- Requiring flexible and customizable metadata management
- Designing a business-centric catalog from the beginning
However, adopting OpenMetadata requires:
- Infrastructure setup (ECS/EKS or VMs, metadata storage such as PostgreSQL)
- Operations (monitoring with Prometheus/Grafana, scaling, upgrades)
- Security design (RBAC, authentication/authorization, encryption)
Compared to AWS managed services, it offers flexibility at the cost of increased operational overhead.
In summary:
| Situation | Recommendation |
|---|---|
| AWS only / small scale | Glue Data Catalog centered |
| AWS + governance needs | + Lake Formation |
| Advanced data usage | + DataZone |
| Multi-cloud | Consider OpenMetadata |
OpenMetadata can also integrate with Glue Data Catalog as a catalog provider.
Its Glue connector can ingest metadata from Glue and map it to OpenMetadata constructs such as glossaries and data products.
Conclusion
In this article, we explored data catalogs centered around AWS Glue Data Catalog.
A data catalog is essential to prevent a data lake from becoming a data swamp.
What matters is not the tool itself, but:
- How metadata is managed
- Who owns the data
- How it continues to be used
This requires designing not only technology, but also organization and processes.
In other words, a data catalog is not just a tool—it is an organizational system.
Glue Data Catalog plays a central role, but it cannot form a complete data catalog on its own.
On AWS, a data catalog should be designed not as a single service, but as an architecture centered around Glue Data Catalog.
And most importantly, a data catalog should be viewed not merely as a technical foundation, but as a foundation for enabling organizations to operate with data.
I hope this article helps those considering data catalogs on AWS.
Top comments (0)