DEV Community

Cover image for Discover Amazon S3 Metadata: A New Way to Explore Your Storage at AWS re:Invent 2024

Discover Amazon S3 Metadata: A New Way to Explore Your Storage at AWS re:Invent 2024

Amazon Web Services (AWS) has revolutionized data storage and management once again with a groundbreaking feature introduced at AWS re:Invent 2024: Amazon S3 Metadata. This new addition to Amazon Simple Storage Service (S3) simplifies the way we interact with and analyze the metadata of our S3 objects, empowering businesses to streamline workflows and enhance data insights.

Here’s everything you need to know about this powerful new feature.


The Challenge of Scale

Organizations leveraging Amazon S3 often deal with massive datasets — billions or even trillions of objects in a single bucket. Identifying specific objects based on characteristics like size, tags, or patterns in their keys is no easy task. Historically, businesses had to build custom systems to manage and query metadata, which could be complex, hard to scale, and prone to falling out of sync with the actual data.


What is Amazon S3 Metadata?

Amazon S3 Metadata introduces automated metadata capture for objects stored in S3 buckets. This metadata is stored in Apache Iceberg tables, enabling compatibility with tools like:

  • Amazon Athena
  • Amazon Redshift
  • Amazon QuickSight
  • Apache Spark

With these tools, you can perform scalable queries on metadata to find objects of interest efficiently, whether for analytics, data processing, or AI training.

Rich Metadata Elements

The metadata schema includes over 20 elements, such as:

  • Bucket Name and Object Key
  • Creation/Modification Time
  • Storage Class
  • Encryption Details
  • Object Tags
  • User Metadata

Additionally, the feature supports storing application-specific metadata in separate tables for advanced queries.


How Does It Work?

1. Enable Metadata Capture

To get started, designate a bucket and table to store your metadata. Metadata updates are automatically recorded whenever objects are created, modified, or deleted. Each update includes:

  • Record Type: CREATE, UPDATE, or DELETE
  • Sequence Number: Tracks historical records
  • Timestamps: Capture modification times

Image description

2. Query Metadata Effortlessly

Using Iceberg-compatible tools, query metadata to retrieve insights like:

  • Objects uploaded within a specific timeframe
  • Objects matching a particular tag or key pattern
  • Size-based filters for optimizing storage costs

S3 Default Metadata:

  • By default, S3 Metadata provides three types of metadata:

1- System-defined metadata, such as an object's creation time and storage class

2- Custom metadata, such as tags and user-defined metadata that was included during object upload

3- Event metadata, such as when an object is updated or deleted, and the AWS account that made the request.


How Metadata Tables Work

Amazon S3 takes the reins when it comes to managing metadata tables, ensuring their accuracy and performance. Here’s what makes them stand out:

  • Read-Only for Integrity: Metadata tables are fully managed by Amazon S3 and are read-only to all IAM principals. This guarantees they always reflect the exact state of your bucket. You can delete your metadata tables if needed, but you can't directly modify them.

  • Automatic Maintenance: Amazon S3 periodically performs maintenance activities, such as file compaction and removal of unreferenced files. These automated processes help:

    • 🔧 Optimize query performance.
    • 💰 Minimize storage costs for metadata tables.
  • No Effort Required: This maintenance happens automatically—no need for opt-ins or manual configurations. However, if customization is required, you have the flexibility to configure these activities.


Hands-On Example

Here’s how you can enable and query metadata in a few simple steps:

Step 1: Create a Table Bucket

aws s3tables create-table-bucket --name my-metadata-bucket --region us-west-1
Enter fullscreen mode Exit fullscreen mode

Step 2: Configure Metadata Capture

Prepare a JSON configuration file:

{
  "S3TablesDestination": {
    "TableBucketArn": "arn:aws:s3tables:us-west-1:123456789012:bucket/my-metadata-bucket",
    "TableName": "my_s3_metadata_table"
  }
}

Enter fullscreen mode Exit fullscreen mode

and Attach this configuration to your data bucket:

aws s3api create-bucket-metadata-table-configuration \
  --bucket my-data-bucket \
  --metadata-table-configuration file://config.json \
  --region us-west-1
Enter fullscreen mode Exit fullscreen mode

Step 3: Query Metadata

Using Apache Spark

spark-submit \
  --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.6.0 \
  --conf "spark.sql.catalog.mytablebucket=org.apache.iceberg.spark.SparkCatalog" \
  --conf "spark.sql.catalog.mytablebucket.warehouse=s3://my-metadata-bucket" \
  query.py
Enter fullscreen mode Exit fullscreen mode

Why It Matters

  • With Amazon S3 Metadata, AWS eliminates the complexity of custom metadata systems. Now, you can:

  • Enhance data discoverability for analytics and AI workloads.

  • Maintain a scalable and synchronized view of your S3 objects.

  • Simplify compliance and auditing with enriched metadata tracking.

Further Resources

Follow me on:
Linkedin.

Top comments (0)