DEV Community

Cover image for Data Cataloguing in AWS
Oteng Isaac
Oteng Isaac

Posted on

Data Cataloguing in AWS

AWS Data Cataloguing

Cataloguing Data in AWS Using Glue Crawlers: A Practical Guide for Data Engineers

Introduction

In modern data engineering, one of the most overlooked but powerful capabilities is data cataloguing. Without a clear understanding of what data exists, where it lives, its schema, and how it changes over time, no ETL architecture can scale. In this guide, I walk through how to catalogue data using AWS Glue Crawlers, and how to structure your metadata layer when working with raw and cleaned datasets stored in Amazon S3.

This tutorial uses a simple CSV file in an S3 raw bucket and walks through how AWS Glue automatically discovers its structure and builds a searchable, query-ready data catalog. You can replicate every step through your AWS Console and include screenshots to transform this into a visual, practical learning resource.

What is Data Cataloguing?

Data cataloguing is the process of creating a structured inventory of all your data assets.

A good data catalog contains:

  • Dataset name
  • Schema (columns, data types, partitions)
  • Location (e.g., S3 path)
  • Metadata (size, owner, last updated)
  • Tags, classifications, lineage

Think of it as the "index" of your data ecosystem - similar to how a library catalog helps readers find books quickly.

Why it matters:

  • Makes data discoverable across teams
  • Reduces manual documentation
  • Ensures schema consistency across pipelines
  • Enables data validation and quality checks
  • Fuels self-service analytics
  • Supports governance and compliance

Data Cataloguing in ETL Pipelines

ETL pipelines depend heavily on metadata. Before transforming any dataset, the pipeline must understand:

  • What columns exist
  • Which data types to enforce
  • What partitions to use
  • What schema evolution has happened
  • How to map raw → cleaned → curated layers

A strong data catalog ensures that:

  • ETL jobs run reliably
  • Glue/Spark scripts do not break due to schema drift
  • Downstream BI tools (Athena, QuickSight, Superset, Power BI) can read data instantly
  • Data lineage and documentation stay updated

AWS Glue Data Catalog acts as the central metadata store for all your structured and semi-structured data.

Architecture Overview

Below is the structure you'll demonstrate in your article:

The project walkthrough will show how Glue Crawlers:

  • Scan an S3 bucket
  • Detect the schema (headers, types, formatting)
  • Generate metadata
  • Store the metadata as a table in the Data Catalog

This metadata is then queryable through Amazon Athena, interoperable with Glue ETL Jobs, and usable by analytics tools.

Understanding Amazon S3, AWS Glue Crawler, and the Glue Data Catalog

Amazon S3 (Simple Storage Service)

Amazon S3 is a fully managed object storage service that allows you to store any type of data at scale—CSV files, logs, JSON, Parquet, images, and more.

It is highly durable, cost-effective, and integrates seamlessly with AWS analytics services. In most modern data engineering architectures (including the Medallion architecture), S3 serves as the landing, raw, and processed layers where data is ingested and stored before further transformation.

AWS Glue Crawler

An AWS Glue Crawler is an automated metadata discovery tool that scans data stored in Amazon S3 and other sources.

When the crawler runs, it:

  • Reads the file structure and content
  • Detects the data format (CSV, JSON, Parquet, etc.)
  • Infers column names and data types
  • Identifies partitions
  • Classifies datasets using built-in or custom classifiers

The crawler then automatically creates or updates table metadata without you having to define schemas manually.

AWS Glue Data Catalog

The Glue Data Catalog is a centralized metadata repository for all your datasets within AWS.

It stores:

  • Table definitions
  • Schema information
  • Partition details
  • Metadata used by analytics services

When the Glue Crawler finishes scanning an S3 bucket, it writes the discovered schema and table information into the Glue Data Catalog.

This metadata can then be queried by services such as Athena, EMR, Redshift Spectrum, and AWS Glue ETL jobs.

In short, the workflow is:
S3 → Glue Crawler scans files → Schema is inferred → Metadata is stored in Glue Data Catalog → Data becomes queryable.

Step-by-Step Workflow

Below is the structure you'll follow in your Medium/LinkedIn article when documenting your implementation with screenshots.

1. Upload Your CSV File to Amazon S3

  • Create an S3 bucket named: medallion-orders-2025-12-17 (Replace with your bucket name)
# Create an S3 bucket (basic settings)
aws s3api create-bucket --bucket medallion-orders-2025-12-17 --region us-east-1
Enter fullscreen mode Exit fullscreen mode
  • Upload your sample CSV file (e.g., orders.csv)
# Upload the CSV file to the bucket
aws s3 cp orders.csv s3://medallion-orders-2025-12-17/
Enter fullscreen mode Exit fullscreen mode
# Upload to a folder (prefix)
aws s3 cp orders.csv s3://medallion-orders-2025-12-17/raw/orders.csv
Enter fullscreen mode Exit fullscreen mode

2. Create a Glue Database

In the Glue Console:

  • Go to Data Catalog → Databases

  • Click Add database

  • Name it orders_db and click on Create database

3. Create an AWS Glue Crawler

  • Navigate to Glue → Crawlers
  • Click on Create crawler
  • Provide a name (e.g., orders_crawler) and click Next

  • Click on Add a data source

  • Choose S3 as the data store and Click on Browse S3 to select the S3 bucket

  • On the next screen, choose a role (Glue created role or custom IAM role) Click Next

  • Select your database. For Crawler schedule On demand and click Next then Create crawler.

  • Run the crawler and wait until status shows complete

4. Run the Crawler & Generate Metadata

Once the crawler completes:

  • It will create a table inside your Glue Data Catalog database
  • Open the table to view:
    • Columns
    • Data types
    • S3 location
    • Classification (csv)

5. Query the Table Using Amazon Athena

  • Open Athena
  • Select your Glue database

  • Run a simple SELECT * FROM "AwsDataCatalog"."orders_db"."medallion_orders_2025_12_17" limit 10; Repalce tablename with your table

Final Outcome

After completing the steps, you will have:

  • A fully indexed representation of your raw data
  • A searchable table in Glue Data Catalog
  • A metadata-driven foundation for ETL jobs
  • A structure ready for transformation into a cleaned bucket and eventually a curated analytics layer

This sets the stage for my next article:

"Building ETL pipelines using Glue ETL Jobs and writing cleaned data back into S3."

Conclusion

Data cataloguing is a foundational step in any scalable data engineering architecture. AWS Glue Crawlers make it easy to automate metadata extraction from raw data sources, reduce manual schema definition, and keep your ETL pipelines schema-aware and resilient.

By the end of this project, you'll have a practical, AWS-native setup that you can build on for data cleaning, transformations, and analytical workloads.

Top comments (0)