DEV Community

Aliasgar K C
Aliasgar K C

Posted on

Unlocking Data at Scale: A Complete Guide to Amazon Redshift Spectrum with AWS Glue and Lake Formation

Amazon Redshift Spectrum brings powerful analytics directly to the data stored in your Amazon S3 data lake. This feature allows you to run complex SQL queries on vast amounts of structured and semi-structured data in S3, without moving or loading the data into your Redshift cluster.

However, setting up Redshift Spectrum with modern AWS services like Lake Formation and AWS Glue can be challenging due to complex permission requirements. This comprehensive guide walks you through the complete process, from basic concepts to production-ready implementation.

What Is Redshift Spectrum?

Redshift Spectrum extends Amazon Redshift's analytics capabilities seamlessly into your S3 data lake. Instead of loading all your data into Redshift tables, you can define external tables that reference S3 files directly and query them using standard SQL. Key highlights include:

  • Query Across S3 and Redshift Data: Run ad-hoc analytics pulling together data in both Redshift and S3.
  • Cost Efficiency: Only pay for the amount of data scanned in S3, with no persistent compute cost if not in use.
  • Scalability: Redshift Spectrum harnesses thousands of AWS-managed nodes for massively parallel query processing. It can scale to exabyte-level datasets.
  • Performance: Compute-intensive work such as filtering and aggregation is pushed to the Spectrum layer, freeing Redshift cluster resources.

Modern Architecture: How the Components Work Together

Understanding the data flow is crucial for successful implementation:

  1. AWS Glue Crawler → Discovers your S3 data and creates table metadata in the Glue Data Catalog
  2. Lake Formation → Manages fine-grained permissions for catalog data and S3 locations
  3. Redshift Spectrum → Queries external tables using Glue Catalog metadata with Lake Formation permissions

This integration provides enterprise-grade security and governance while maintaining performance and scalability.

Redshift Spectrum: Typical Use Case

Scenario: Seamless Analysis of Historical Website Clickstream Data

Imagine an e-commerce company tracking website visitor behavior. Recent data (current year) resides in Amazon Redshift tables for quick analysis, but years of historical clickstream logs are archived in compressed Parquet format in S3. Analysts want to compare recent trends with multi-year history without having to reload terabytes of old data.

With Redshift Spectrum:

  • Analysts can create external tables referencing raw S3 data.
  • Run SQL queries joining fresh Redshift data with archived S3 logs.
  • Gain immediate, full-history insights using a single query engine.

This hybrid analytics workflow is impossible or cost-prohibitive with traditional ETL pipelines. Spectrum saves significant time and storage costs.

Complete Step-by-Step Implementation Guide

Prerequisites

Before starting, ensure you have:

  • Data files (CSV, Parquet, JSON, etc.) in an S3 bucket, in the same AWS region as your Redshift cluster
  • Administrative access to AWS IAM, Glue, Lake Formation, and Redshift
  • AWS CLI configured (optional but recommended for troubleshooting)

Step 1: Create IAM Role for AWS Glue Crawler

Part 1: Setting Up IAM Roles and Permissions

# Navigate to IAM Console → Roles → Create role

Enter fullscreen mode Exit fullscreen mode

Role Configuration:

  • Trusted entity: AWS service
  • Use case: Glue
  • Role nameAWSGlueServiceRole-akc

Required Policies:

  1. AWS Managed PolicyAWSGlueServiceRole
  2. Custom S3 Access Policy (S3-access-policy):
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Statement1",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::your-bucket-name",
                "arn:aws:s3:::your-bucket-name/*"
            ]
        }
    ]
}

Enter fullscreen mode Exit fullscreen mode
  1. Glue Catalog Permissions (Table-policy):
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Statement1",
            "Effect": "Allow",
            "Action": [
                "glue:GetTable",
                "glue:GetTables",
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:CreateTable",
                "glue:UpdateTable",
                "glue:DeleteTable",
                "glue:GetPartition",
                "glue:GetPartitions",
                "glue:CreatePartition",
                "glue:UpdatePartition",
                "glue:DeletePartition"
            ],
            "Resource": [
                "arn:aws:glue:us-east-1:ACCOUNT-ID:catalog",
                "arn:aws:glue:us-east-1:ACCOUNT-ID:database/your-database",
                "arn:aws:glue:us-east-1:ACCOUNT-ID:table/your-database/*"
            ]
        }
    ]
}

Enter fullscreen mode Exit fullscreen mode
  1. Lake Formation Access (Lakeformation-policy):
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Statement1",
            "Effect": "Allow",
            "Action": [
                "lakeformation:GetDataAccess"
            ],
            "Resource": "*"
        }
    ]
}

Enter fullscreen mode Exit fullscreen mode

Step 2: Create IAM Role for Redshift Spectrum

# Navigate to IAM Console → Roles → Create role

Enter fullscreen mode Exit fullscreen mode

Role Configuration:

  • Trusted entity: AWS service
  • Use case: Redshift - Customizable
  • Role nameSpectrumRole

Required Policies:

  1. AWS Managed PolicyAmazonS3ReadOnlyAccess
  2. Glue Access Policy:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:GetTable",
                "glue:GetTables",
                "glue:GetPartition",
                "glue:GetPartitions",
                "glue:CreateDatabase"
            ],
            "Resource": [
                "arn:aws:glue:us-east-1:ACCOUNT-ID:catalog",
                "arn:aws:glue:us-east-1:ACCOUNT-ID:database/*",
                "arn:aws:glue:us-east-1:ACCOUNT-ID:table/*/*"
            ]
        }
    ]
}

Enter fullscreen mode Exit fullscreen mode
  1. Lake Formation Access Policy:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "lakeformation:GetDataAccess"
            ],
            "Resource": "*"
        }
    ]
}

Enter fullscreen mode Exit fullscreen mode

Part 2: Setting up the Redshift Cluster

Step 3: Create Redshift Namespace and Workgroup

Namespace Configuration:

  • Namespace namespectrum-analytics-namespace
  • Database nameanalytics_db
  • Admin user nameadmin_user
  • Admin user password: Create a secure password
  • Associated IAM roles: We'll add the Spectrum role later

Permissions and encryption:

  • Manage IAM roles: Skip for now (we'll configure later)
  • Encryption: Use AWS managed key (default)
  • Log exports: Enable audit logging and user activity logging (recommended)

Click Create namespace

Workspace Configuration:

# Navigate to Amazon Redshift Console → Serverless dashboard → Create workgroup
Enter fullscreen mode Exit fullscreen mode

Step 4: Associate Redshift Role with Cluster

# Navigate to Redshift Console → Clusters → [Your Cluster] → Properties → Security
# Manage IAM roles → Associate IAM roles → Select Spectrum-role

Enter fullscreen mode Exit fullscreen mode

Part 3: Setting Up AWS Glue Data Catalog

Step 5: Create Glue Database

# Navigate to AWS Glue Console → Databases → Add database
Enter fullscreen mode Exit fullscreen mode

Database Configuration:

  • Nameyour-database (e.g., akc-db)
  • DescriptionDatabase for analytics data
  • Location: Leave empty

Database Configuration

Step 6: Configure Lake Formation Permissions

Grant Database Permissions to Crawler:

# Navigate to Lake Formation Console → Permissions → Grant

Enter fullscreen mode Exit fullscreen mode
  • PrincipalAWSGlueServiceRole-akc
  • LF-Tags or catalog resources: Catalog resources
  • Databases: Select your database
  • Database permissions: ✓ Create tableAlterDropDescribe

Grant Data Location Access to Crawler:

# Navigate to Lake Formation Console → Permissions → Grant
Enter fullscreen mode Exit fullscreen mode
  • PrincipalAWSGlueServiceRole-akc
  • LF-Tags or catalog resources: Data locations
  • Amazon S3 paths3://your-bucket-name/
  • Data location permissions: ✓ Data location access

Grant Permissions to Redshift Spectrum Role:

# Navigate to Lake Formation Console → Permissions → Grant
Enter fullscreen mode Exit fullscreen mode
  • PrincipalSpectrum-role
  • Database: Your database
  • Database permissions: ✓ Describe
  • Tables: All tables
  • Table permissions: ✓ SelectDescribe

Grant Data Location Access to Spectrum Role:

# Navigate to Lake Formation Console → Permissions → Grant
Enter fullscreen mode Exit fullscreen mode
  • PrincipalSpectrum-role
  • LF-Tags or catalog resources: Data locations
  • Amazon S3 paths3://your-bucket-name/
  • Data location permissions: ✓ Data location access

Step 7: Create and Run Glue Crawler

# Navigate to AWS Glue Console → Crawlers → Create crawler
Enter fullscreen mode Exit fullscreen mode

Crawler Configuration:

  • Nameakc-crawler
  • DescriptionCrawler for analytics data

Data Sources:

  • Data source: S3
  • Location of S3 datas3://your-bucket-name/your-folder/
  • Subsequent crawler runs: Crawl all folders

Security Settings:

  • IAM roleAWSGlueServiceRole-akc

Output and Scheduling:

  • Target database: Your database
  • Table name prefix: Leave empty
  • Crawler schedule: On demand

Run the Crawler:

# Navigate to AWS Glue Console → Crawlers → your-data-crawler → Run crawler
Enter fullscreen mode Exit fullscreen mode

Step 8: Verify Table Creation

# Navigate to AWS Glue Console → Databases → your-database → Tables
Enter fullscreen mode Exit fullscreen mode

You should see a table created based on your data file. Click on it to verify:

  • Column names and data types
  • S3 location
  • File format

Part 3: Setting Up Redshift Spectrum

Step 9: Create External Schema in Redshift

Connect to your Redshift cluster and execute:

CREATE EXTERNAL SCHEMA spectrum_schema
FROM DATA CATALOG
DATABASE 'your-database'
IAM_ROLE 'arn:aws:iam::ACCOUNT-ID:role/Spectrum-role'
REGION 'us-east-1';
Enter fullscreen mode Exit fullscreen mode

Step 10: Query Your Data

Basic Query:

SELECT * FROM spectrum_schema.your_table_name LIMIT 10;
Enter fullscreen mode Exit fullscreen mode

Check Table Structure:

SELECT * FROM SVV_EXTERNAL_COLUMNS
WHERE schemaname = 'spectrum_schema'
AND tablename = 'your_table_name';
Enter fullscreen mode Exit fullscreen mode

Analytics Queries:

SELECT
c.customer_id,
COUNT(*) AS page_views
FROM
spectrum_schema.akc_bucket_rs c
WHERE
event_time BETWEEN '2025-01-01' AND '2025-06-30'
GROUP BY
c.customer_id;
Enter fullscreen mode Exit fullscreen mode

Conclusion

Amazon Redshift Spectrum, when properly configured with AWS Glue and Lake Formation, provides a powerful platform for modern data analytics. The integration enables organizations to:

  • Scale analytics across petabytes of S3 data
  • Reduce costs by avoiding data movement
  • Implement governance with fine-grained access controls
  • Accelerate insights with familiar SQL interfaces

While the initial setup involves multiple services and complex permissions, following this comprehensive guide ensures a robust, production-ready implementation. The key to success lies in understanding the relationship between IAM, Lake Formation, Glue, and Redshift, and systematically configuring each component.

By implementing the best practices outlined in this guide, you'll have a scalable, secure, and cost-effective data analytics platform that can grow with your organization's needs. Start with a simple use case, validate the setup, and gradually expand to more complex scenarios as your team becomes comfortable with the architecture.

Remember: proper planning and permission management are crucial for success. Take time to understand each component's role, test thoroughly in development environments, and monitor performance in production.

Ready to unlock the full potential of your data lake? Start implementing Redshift Spectrum today and transform your organization's approach to large-scale data analytics.

Top comments (0)