Ahmed Adel for AWS Community Builders

Posted on Dec 4, 2024

Discover Amazon S3 Metadata: A New Way to Explore Your Storage at AWS re:Invent 2024

#aws #s3 #cloud #data

Amazon Web Services (AWS) has revolutionized data storage and management once again with a groundbreaking feature introduced at AWS re:Invent 2024: Amazon S3 Metadata. This new addition to Amazon Simple Storage Service (S3) simplifies the way we interact with and analyze the metadata of our S3 objects, empowering businesses to streamline workflows and enhance data insights.

Here’s everything you need to know about this powerful new feature.

The Challenge of Scale

Organizations leveraging Amazon S3 often deal with massive datasets — billions or even trillions of objects in a single bucket. Identifying specific objects based on characteristics like size, tags, or patterns in their keys is no easy task. Historically, businesses had to build custom systems to manage and query metadata, which could be complex, hard to scale, and prone to falling out of sync with the actual data.

What is Amazon S3 Metadata?

Amazon S3 Metadata introduces automated metadata capture for objects stored in S3 buckets. This metadata is stored in Apache Iceberg tables, enabling compatibility with tools like:

Amazon Athena
Amazon Redshift
Amazon QuickSight
Apache Spark

With these tools, you can perform scalable queries on metadata to find objects of interest efficiently, whether for analytics, data processing, or AI training.

Rich Metadata Elements

The metadata schema includes over 20 elements, such as:

Bucket Name and Object Key
Creation/Modification Time
Storage Class
Encryption Details
Object Tags
User Metadata

Additionally, the feature supports storing application-specific metadata in separate tables for advanced queries.

How Does It Work?

1. Enable Metadata Capture

To get started, designate a bucket and table to store your metadata. Metadata updates are automatically recorded whenever objects are created, modified, or deleted. Each update includes:

Record Type: CREATE, UPDATE, or DELETE
Sequence Number: Tracks historical records
Timestamps: Capture modification times

2. Query Metadata Effortlessly

Using Iceberg-compatible tools, query metadata to retrieve insights like:

Objects uploaded within a specific timeframe
Objects matching a particular tag or key pattern
Size-based filters for optimizing storage costs

S3 Default Metadata:

By default, S3 Metadata provides three types of metadata:

1- System-defined metadata, such as an object's creation time and storage class

2- Custom metadata, such as tags and user-defined metadata that was included during object upload

3- Event metadata, such as when an object is updated or deleted, and the AWS account that made the request.

For details about what data is stored in metadata tables, see S3 Metadata tables schema.

How Metadata Tables Work

Amazon S3 takes the reins when it comes to managing metadata tables, ensuring their accuracy and performance. Here’s what makes them stand out:

Read-Only for Integrity: Metadata tables are fully managed by Amazon S3 and are read-only to all IAM principals. This guarantees they always reflect the exact state of your bucket. You can delete your metadata tables if needed, but you can't directly modify them.
Automatic Maintenance: Amazon S3 periodically performs maintenance activities, such as file compaction and removal of unreferenced files. These automated processes help:
- 🔧 Optimize query performance.
- 💰 Minimize storage costs for metadata tables.
No Effort Required: This maintenance happens automatically—no need for opt-ins or manual configurations. However, if customization is required, you have the flexibility to configure these activities.

Hands-On Example

Here’s how you can enable and query metadata in a few simple steps:

Step 1: Create a Table Bucket

aws s3tables create-table-bucket --name my-metadata-bucket --region us-west-1

Step 2: Configure Metadata Capture

Prepare a JSON configuration file:

{
  "S3TablesDestination": {
    "TableBucketArn": "arn:aws:s3tables:us-west-1:123456789012:bucket/my-metadata-bucket",
    "TableName": "my_s3_metadata_table"
  }
}

and Attach this configuration to your data bucket:

aws s3api create-bucket-metadata-table-configuration \
  --bucket my-data-bucket \
  --metadata-table-configuration file://config.json \
  --region us-west-1

Step 3: Query Metadata

Using Apache Spark

spark-submit \
  --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.6.0 \
  --conf "spark.sql.catalog.mytablebucket=org.apache.iceberg.spark.SparkCatalog" \
  --conf "spark.sql.catalog.mytablebucket.warehouse=s3://my-metadata-bucket" \
  query.py

Why It Matters

With Amazon S3 Metadata, AWS eliminates the complexity of custom metadata systems. Now, you can:
Enhance data discoverability for analytics and AI workloads.
Maintain a scalable and synchronized view of your S3 objects.
Simplify compliance and auditing with enriched metadata tracking.

Further Resources

📺 Watch the Video Overview: Amazon S3 Metadata at AWS re:Invent
🌐 Explore the Official Feature Page: Amazon S3 Metadata
📝 Read the AWS Blog Post by Jeff Barr: Introducing Queryable Object Metadata for Amazon S3 Buckets

Follow me on:
Linkedin.

DEV Community