Cláudio Filipe Lima Rapôso

Posted on May 1

How to Build a Serverless Data Lake Foundation with AWS Glue

#aws #dataengineering #serverless #tutorial

1. Introduction

Welcome to this comprehensive tutorial on building a Serverless Data Lake Foundation using AWS Glue. By the end of this guide, you will be able to design and implement a robust, automated pipeline that extracts raw data from Amazon S3, transforms it into an optimized analytics-ready format, and makes it available for querying via Amazon Athena. This architectural pattern is highly useful because it completely eliminates the need to manage underlying servers, clusters, or infrastructure, allowing you to focus entirely on your core data logic. Furthermore, it automatically scales out alongside your data volume, ensuring long-term cost-effectiveness since you only pay for the exact compute resources consumed during the active transformation process. Whether you are dealing with daily batch processing operations or aggregating vast amounts of historical data, mastering this foundational architectural pattern is a critical and necessary milestone in modern data engineering.

2. Prerequisites

Before starting this step-by-step implementation, ensure you have the following tools and foundational knowledge prepared:

AWS Account: An active Amazon Web Services account with administrative or sufficient permissions to create S3 buckets, configure IAM roles, and deploy AWS Glue jobs.
Basic Python or PySpark Knowledge: Familiarity with basic data manipulation concepts, variables, and data structures using Python or Apache Spark.
Understanding of Cloud Storage: General knowledge of how object storage systems function, specifically Amazon S3 concepts including buckets and prefixes.
IAM Fundamentals: A foundational understanding of Identity and Access Management principles to allow distinct AWS services to communicate securely.

3. Step-by-Step

Step 1: Architecting the Serverless Data Lake Foundation

What to do: Review the end-to-end data flow and understand the specific roles of each AWS service before provisioning any cloud resources. The architecture requires Amazon S3 for storage, Amazon EventBridge for orchestration, AWS Glue for processing and metadata management, and Amazon Athena for consumption.

Why do it: Establishing a clear mental model and a visual blueprint prevents configuration errors later in the process. It ensures that all AWS services are logically connected, network paths are understood, and the separation between storage and compute is maintained throughout the pipeline design.

Example:

Step 2: Configuring Amazon S3 Storage Zones

What to do: Access the Amazon S3 console and create a new S3 bucket with a unique name. Inside this bucket, create two distinct folders (prefixes). Name the first folder raw/ to represent the landing zone for your incoming, unprocessed data. Name the second folder curated/ to represent the destination for your clean, transformed data.

Why do it: Segregating data by its processing state is a fundamental best practice in data engineering. It prevents accidental overwrites of source data, simplifies access control policies, and optimizes query performance by keeping raw, unoptimized files strictly separated from the analytical query engine.

Example:

# 1. Define the core S3 Data Lake Bucket
resource "aws_s3_bucket" "datalake_production" {
  bucket = "enterprise-datalake-production"

  tags = {
    Environment = "Production"
    Purpose     = "Serverless Data Lake Foundation"
    ManagedBy   = "Terraform"
  }
}

# 2. Enable Bucket Versioning (Recommended Data Lake Best Practice)
resource "aws_s3_bucket_versioning" "datalake_versioning" {
  bucket = aws_s3_bucket.datalake_production.id

  versioning_configuration {
    status = "Enabled"
  }
}

# 3. Create the 'raw' subdirectory (Landing Zone)
resource "aws_s3_object" "raw_zone" {
  bucket = aws_s3_bucket.datalake_production.id
  key    = "raw/"

  # Defines the object as a directory rather than a standard file
  content_type = "application/x-directory"
}

# 4. Create the 'curated' subdirectory (Gold/Analytics Zone)
resource "aws_s3_object" "curated_zone" {
  bucket = aws_s3_bucket.datalake_production.id
  key    = "curated/"

  content_type = "application/x-directory"
}

Step 3: Setting Up IAM Permissions

What to do: Navigate to the AWS Identity and Access Management console and create a new Service Role specifically assigned to AWS Glue. Attach the AWS managed policy named AWSGlueServiceRole. Additionally, create and attach an inline policy that explicitly grants s3:GetObject and s3:PutObject permissions targeted strictly at the S3 bucket you created in the previous step.

Why do it: AWS services operate under the strict principle of least privilege. AWS Glue cannot access your raw data, write processed files to S3, or publish execution logs to Amazon CloudWatch without explicit, cryptographically secure permissions defined by an IAM Role.

Example:

# 1. Define the Trust Relationship (Assume Role Policy) for AWS Glue
data "aws_iam_policy_document" "glue_assume_role_policy" {
  statement {
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["glue.amazonaws.com"]
    }
  }
}

# 2. Create the IAM Role
resource "aws_iam_role" "glue_execution_role" {
  name               = "GlueDataLakeExecutionRole"
  assume_role_policy = data.aws_iam_policy_document.glue_assume_role_policy.json

  tags = {
    Environment = "Production"
    Purpose     = "Serverless Data Lake Foundation"
    ManagedBy   = "Terraform"
  }
}

# 3. Attach the core AWS Managed Policy for Glue Service operations
# This grants necessary permissions for CloudWatch logging, ENI creation (if VPC bounded), etc.
resource "aws_iam_role_policy_attachment" "glue_service_managed_policy" {
  role       = aws_iam_role.glue_execution_role.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
}

# 4. Define a restrictive inline policy for the specific Data Lake S3 bucket
data "aws_iam_policy_document" "s3_datalake_access_policy" {
  statement {
    sid    = "AllowDataLakeReadWrite"
    effect = "Allow"

    actions = [
      "s3:GetObject",
      "s3:PutObject"
    ]

    resources = [
      # Restrict access exclusively to the specific bucket and defined prefixes
      "arn:aws:s3:::enterprise-datalake-production/raw/*",
      "arn:aws:s3:::enterprise-datalake-production/curated/*"
    ]
  }
}

# 5. Create and attach the inline S3 policy to the IAM Role
resource "aws_iam_role_policy" "s3_datalake_inline_policy" {
  name   = "RestrictedS3DataLakeAccess"
  role   = aws_iam_role.glue_execution_role.id
  policy = data.aws_iam_policy_document.s3_datalake_access_policy.json
}

Step 4: Orchestrating the Pipeline Execution Flow

What to do: Define the exact sequence of automated events that will trigger and execute your ETL process. Configure an Amazon EventBridge rule that listens for an ObjectCreated event in your S3 raw/ prefix. Set the target of this rule to invoke your AWS Glue Job automatically whenever a new file finishes uploading.

Why do it: Automation entirely removes the need for manual intervention and scheduling guesswork. By structurally mapping the sequence, you ensure that compute resources are only utilized when new data is actually present, maximizing efficiency and minimizing idle costs.

Example:

Step 5: Authoring the AWS Glue Transformation Job

What to do: Open the AWS Glue Studio visual editor and create a new Spark ETL job. Configure the source node to read from your S3 raw/ path. Add a transformation node to clean the data, such as dropping null values, mapping column names to standard conventions, and casting data types correctly. Finally, configure the target node to write the data back to your S3 curated/ path, explicitly selecting Parquet as the output format.

Why do it: The transformation phase is where raw information becomes valuable data. Writing the output in Apache Parquet format is critical. Parquet is a highly optimized, columnar storage format that compresses data heavily, dramatically reduces long-term S3 storage costs, and exponentially speeds up subsequent analytical queries by allowing engines to skip irrelevant columns.

Example (Terraform):

resource "aws_glue_job" "datalake_curation_job" {
  name     = "raw-to-curated-parquet-job"
  role_arn = aws_iam_role.glue_execution_role.arn # From the previous IAM step

  # Defines the execution engine and script location
  command {
    name            = "glueetl"
    script_location = "s3://enterprise-datalake-production/scripts/etl_job.py"
    python_version  = "3"
  }

  default_arguments = {
    "--job-language"                     = "python"
    "--enable-metrics"                   = "true"
    "--enable-continuous-cloudwatch-log" = "true"
  }

  # Cost and performance optimization
  glue_version      = "4.0"
  worker_type       = "G.1X"
  number_of_workers = 2

  tags = {
    Environment = "Production"
    Purpose     = "Serverless Data Lake Foundation"
    ManagedBy   = "Terraform"
  }
}

Example (Glue Job):

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

# ==============================================================================
# Node 1: S3 Source (Reads raw CSV data)
# ==============================================================================
S3Source_node1 = glueContext.create_dynamic_frame.from_options(
    format_options={"quoteChar": '"', "withHeader": True, "separator": ","},
    connection_type="s3",
    format="csv",
    connection_options={
        "paths": ["s3://enterprise-datalake-production/raw/"], 
        "recurse": True
    },
    transformation_ctx="S3Source_node1"
)

# ==============================================================================
# Node 2: Transform - ApplyMapping (Cleans and maps data types)
# ==============================================================================
ApplyMapping_node2 = ApplyMapping.apply(
    frame=S3Source_node1,
    mappings=[
        ("id", "string", "user_id", "string"),
        ("name", "string", "full_name", "string"),
        ("timestamp", "string", "event_time", "timestamp")
    ],
    transformation_ctx="ApplyMapping_node2"
)

# ==============================================================================
# Node 3: S3 Target (Writes optimized Parquet data)
# ==============================================================================
S3Target_node3 = glueContext.write_dynamic_frame.from_options(
    frame=ApplyMapping_node2,
    connection_type="s3",
    format="glueparquet",
    connection_options={
        "path": "s3://enterprise-datalake-production/curated/", 
        "partitionKeys": []
    },
    format_options={"compression": "snappy"},
    transformation_ctx="S3Target_node3"
)

job.commit()

Example (Glue Studio):

Step 6: Cataloging the Data with AWS Glue Crawlers

What to do: Navigate to the Crawlers section within the AWS Glue console. Create a new crawler and configure its target repository to point directly at your curated/ S3 folder. Assign the IAM role created in Step 3 to the crawler. Set the crawler to run on a daily schedule or to be triggered immediately after the AWS Glue transformation job completes successfully.

Why do it: Data sitting in S3 is useless to business analysts without structural context. The Crawler automatically inspects your Parquet files, infers the underlying schema including column names and data types, and registers this structural definition as a logical table within the centralized AWS Glue Data Catalog.

Step 7: Querying the Transformed Data with Amazon Athena

What to do: Navigate to the Amazon Athena console. Ensure you have set up an Athena query result location in a separate S3 bucket. Select the database created by your Glue Crawler from the dropdown menu. Write and execute standard SQL queries against your newly cataloged table to analyze the data.

Why do it: This final step validates the structural integrity of your entire pipeline. Athena uses the metadata schema stored in the Data Catalog to seamlessly execute SQL queries directly against the compressed Parquet files residing in S3. This proves that your serverless architecture is fully operational and ready for business intelligence workloads.

4. Common Troubleshooting

Problem 1: Out of Memory Errors in Glue Jobs
This issue frequently occurs due to data skew, where one worker node attempts to process significantly more data than others, or when attempting to process thousands of excessively small files simultaneously.
Solution: Resolve this by navigating to the Glue Job details and increasing the worker type from the standard configuration to G.1X or G.2X, providing more memory per executor. Alternatively, utilize Spark's .repartition() command within your script to evenly distribute the workload across the entire serverless cluster before executing wide transformations.

Problem 2: Glue Crawler Creates Multiple Unwanted Tables
If your curated S3 data utilizes a deeply nested, highly partitioned, or logically inconsistent folder structure, the AWS Glue Crawler might misinterpret the subfolders as entirely separate tables instead of partitions of a single dataset.
Solution: To rectify this, edit the Crawler configuration settings. Locate the section labeled "Grouping behavior for S3 data" and explicitly check the option to "Create a single schema for each S3 path". This forces the crawler to treat all files under the root prefix as part of the same table.

Problem 3: Amazon Athena Queries Return Zero Records
If the table successfully appears in the Data Catalog but standard SELECT queries return absolutely empty results, the S3 data is highly likely partitioned, but the Data Catalog remains unaware of the new partition directories.
Solution: Execute the MSCK REPAIR TABLE your_table_name; command directly within the Athena query editor. This command scans the S3 path, loads the missing partition metadata into the Data Catalog, and immediately makes the data queryable.

5. Conclusion

Congratulations on successfully architecting and understanding the Serverless Data Lake Foundation. You have learned how to decouple storage from compute layers, automate data transformations using event-driven principles, and seamlessly bridge raw information into a highly optimized, queryable state. By mastering these core components, you now possess the foundational engineering skills required to build scalable, enterprise-grade data platforms. As a logical next step, consider exploring AWS Step Functions to orchestrate more complex, multi-job dependencies, or experiment with Apache Iceberg within your AWS Glue jobs to introduce robust transactional capabilities and time-travel querying to your data lake.

DEV Community

How to Build a Serverless Data Lake Foundation with AWS Glue

1. Introduction

2. Prerequisites

3. Step-by-Step

Step 1: Architecting the Serverless Data Lake Foundation

Step 2: Configuring Amazon S3 Storage Zones

Step 3: Setting Up IAM Permissions

Step 4: Orchestrating the Pipeline Execution Flow

Step 5: Authoring the AWS Glue Transformation Job

Step 6: Cataloging the Data with AWS Glue Crawlers

Step 7: Querying the Transformed Data with Amazon Athena

4. Common Troubleshooting

5. Conclusion

Top comments (0)