DEV Community

Cover image for Redshift Spectrum + Lake Formation — Enterprise Governance on NAS Data

Redshift Spectrum + Lake Formation — Enterprise Governance on NAS Data

TL;DR

In Part 1, Athena provided serverless SQL. In Part 2, Databricks hit boundaries. In Part 3, Snowflake works with config. In Part 4, DuckDB Lambda was cheapest. In Part 5, EMR Spark delivered full ETL. This Part 6 adds enterprise governance: Redshift Spectrum + Lake Formation provides 4-layer authorization on NAS data.

Redshift Serverless (8 RPU) successfully queries FSx for ONTAP data via S3 Access Points using the same Glue Catalog tables as Athena — no additional data registration needed. Add Lake Formation on top for table-level, column-level, and tag-based access control.

Query Duration Comparison with Athena
COUNT(*) 10K rows 3,231 ms Athena: ~1,500 ms
GROUP BY aggregation 2,580 ms Athena: ~1,800 ms
COUNT(*) 5M rows 4,277 ms Athena: 2,196 ms

~2x slower than Athena for simple scans (Redshift Serverless cold start overhead), but Redshift adds DWH capabilities: federated JOINs with local tables, materialized views, and stored procedures.

Quick Decision Guide:

  • Need DWH JOINs with NAS data → Redshift Spectrum (this article)
  • Need enterprise governance (table/column/tag) → Add Lake Formation
  • Need serverless SQL only (no DWH) → Use Athena (Part 1) — faster and cheaper

GitHub: fsxn-lakehouse-integrations


How to Read This Article

This article is:

  • A reproduction-focused validation report
  • Evidence from one environment (Redshift Serverless 8 RPU, ap-northeast-1)
  • A governance architecture guide for Lake Formation + FSx S3 AP

Read by role:

  • DWH engineer: Architecture → Setup → Benchmark Results
  • Security / governance reviewer: 4-Layer Authorization → Governance Impact
  • Data engineer: When to Use → Comparison with Athena
  • Partner / SA: Partner Decision Card → Discovery Questions

Prerequisite Concepts

Before reading this article, it helps to understand:

  • Redshift Spectrum — Redshift's ability to query data in S3 via external schemas (Glue Catalog)
  • Redshift Serverless — pay-per-query Redshift without cluster management (measured in RPU)
  • Lake Formation — AWS's centralized governance layer for data lakes (table/column/tag permissions)
  • Glue Catalog — AWS's metadata catalog (shared by Athena, Redshift Spectrum, EMR, Glue)
  • External Schema — a Redshift schema that maps to a Glue Catalog database

Architecture

┌─────────────────────────────────────────────────────────────────┐
│  Redshift Serverless (8 RPU)                                     │
│                                                                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  SQL Query                                                │   │
│  │  SELECT * FROM fsxn_spectrum.sensor_readings              │   │
│  │         JOIN local_table ON ...                            │   │
│  └──────────────────────────────────────────────────────────┘   │
│                          │                                       │
│              External Schema (Glue Catalog)                       │
└──────────────────────────┼───────────────────────────────────────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
    Lake Formation    IAM Role    S3 Access Point
    (table/column     (API        (resource
     permissions)      access)     policy)
              │            │            │
              └────────────┼────────────┘
                           │
                           ▼
              FSx for ONTAP Volume (Parquet files)
Enter fullscreen mode Exit fullscreen mode

4-Layer Authorization:

  1. Lake Formation — Who can access which tables/columns (fine-grained)
  2. IAM — Who can call which AWS APIs
  3. S3 Access Point Policy — Which principals can access this access point
  4. File System — UNIX permissions on the underlying files

Benchmark Results

Query Duration (ms) Rows Notes
CREATE EXTERNAL SCHEMA 240 One-time setup
COUNT(*) 10K rows 3,231 10,000 Cold start overhead
GROUP BY + AVG aggregation 2,580 3 groups Status grouping
COUNT(*) 5M rows 4,277 5,000,000 Large scan

Environment: Redshift Serverless 8 RPU, ap-northeast-1. FSx for ONTAP Single-AZ, 128 MB/s.

Performance note: Redshift Serverless has cold start overhead (~2-3s for first query). Warm queries on provisioned Redshift clusters would be faster. For simple scans, Athena is ~2x faster because it has no DWH initialization overhead.


Evidence Matrix

Layer Evidence Result Interpretation
Redshift Serverless Workgroup creation (8 RPU) ✅ Pass Serverless endpoint available
IAM role Spectrum role with S3 AP permissions ✅ Pass GetObject + ListBucket on AP ARN
External Schema CREATE EXTERNAL SCHEMA from Glue ✅ Pass Same catalog as Athena
Spectrum read (small) COUNT(*) 10K rows ✅ Pass 3,231ms
Spectrum read (aggregation) GROUP BY + AVG ✅ Pass 2,580ms
Spectrum read (large) COUNT(*) 5M rows ✅ Pass 4,277ms
Lake Formation admin put-data-lake-settings ✅ Pass Admin configured
Lake Formation grant Table-level SELECT grant ✅ Pass Fine-grained permission works
LF column-level SELECT on 3 permitted columns ✅ Pass Non-permitted column returns "cannot be resolved"
LF column deny SELECT on denied column (humidity) ✅ Pass (denied) "Column cannot be resolved or requester is not authorized"
LF row filter Data cells filter creation ✅ Pass Row filter (status='normal') + column filter combined
LF-Tag creation sensitivity tag (public/internal/confidential) ✅ Pass Tag created and assigned to table
LF-Tag permission Tag-based DESCRIBE+ASSOCIATE grant ✅ Pass Scalable governance via classification
Athena under LF Query with LF permissions active ✅ Pass Same governance applies to Athena

Setup

Step 1: Create External Schema (reuses Glue Catalog)

CREATE EXTERNAL SCHEMA fsxn_spectrum
FROM DATA CATALOG
DATABASE 'fsxn_athena_verification'
IAM_ROLE 'arn:aws:iam::<ACCOUNT_ID>:role/fsxn-redshift-spectrum-role'
REGION 'ap-northeast-1';
Enter fullscreen mode Exit fullscreen mode

Key insight: This uses the same Glue Catalog database that Athena uses. No additional table registration needed — if Athena can query it, Redshift Spectrum can too.

Step 2: Query FSx for ONTAP Data

-- Simple count
SELECT COUNT(*) FROM fsxn_spectrum.sensor_readings;
-- Result: 10000 (3,231ms)

-- Aggregation
SELECT status, COUNT(*), AVG(temperature)
FROM fsxn_spectrum.sensor_readings
GROUP BY status;
-- Result: 3 groups (2,580ms)

-- JOIN with local Redshift table (DWH capability)
SELECT s.device_id, s.temperature, d.location
FROM fsxn_spectrum.sensor_readings s
JOIN device_master d ON s.device_id = d.device_id
WHERE s.temperature > 35;
Enter fullscreen mode Exit fullscreen mode

Step 3: Add Lake Formation Governance

# Set Lake Formation admin
aws lakeformation put-data-lake-settings \
  --data-lake-settings '{"DataLakeAdmins": [{"DataLakePrincipalIdentifier": "arn:aws:iam::<ACCOUNT_ID>:user/<admin>"}]}'

# Grant table-level SELECT to a role
aws lakeformation grant-permissions \
  --principal '{"DataLakePrincipalIdentifier": "arn:aws:iam::<ACCOUNT_ID>:role/fsxn-analyst-role"}' \
  --resource '{"Table": {"DatabaseName": "fsxn_athena_verification", "Name": "sensor_readings"}}' \
  --permissions '["SELECT", "DESCRIBE"]'
Enter fullscreen mode Exit fullscreen mode

Lake Formation Data Permissions — table-level SELECT grant

Lake Formation Data Permissions: fine-grained table-level SELECT grant for fsxn-athena-glue-role on sensor_readings table.


Lake Formation Governance Value

Capability Without Lake Formation With Lake Formation
Table-level access S3 AP policy (all-or-nothing per prefix) Per-table SELECT/DESCRIBE grants
Column-level security ❌ Not possible ✅ Column-level grants + masking
Row-level filtering ❌ Not possible ✅ Data Cells Filter (row filter expressions)
Tag-based access control ❌ Not possible ✅ Classify data → auto-grant by tag (LF-Tags)
Centralized audit CloudTrail (API-level) Lake Formation audit (table/column-level)
Cross-account sharing Share S3 AP (complex) Share tables via Lake Formation (simple)

Fine-Grained Governance — Verified (May 2026)

All three fine-grained Lake Formation capabilities have been validated on FSx for ONTAP S3 AP data:

Feature Test Result
Column-level permission Grant SELECT on 3 of 4 columns; query the denied column ✅ Permitted columns return data; denied column (humidity) returns "cannot be resolved"
Row filter (Data Cells Filter) Create filter status = 'normal'; query returns only matching rows ✅ Only rows matching the filter expression are returned
LF-Tag Create tag sensitivity: public/internal/confidential; assign to table ✅ Tag created, assigned, and queryable via Lake Formation console

Governance implication for regulated workloads: Lake Formation on FSx for ONTAP S3 AP data provides the same fine-grained access control as on native S3 data. Column masking, row filtering, and tag-based classification all work without data movement. This is the strongest AWS-native governance path for FSx for ONTAP data.

Iceberg + Lake Formation path: Glue Data Catalog supports Iceberg table registration natively. For transactional workloads requiring ACID guarantees: sync FSx for ONTAP data to S3 via DataSync → write as Iceberg table (EMR Spark) → register in Glue Catalog → query via Redshift Spectrum with full Lake Formation governance (column/row/tag). This provides the best of both worlds: FSx for ONTAP as source of truth + Iceberg ACID + Lake Formation governance.

Enterprise governance use cases:

  • Healthcare: Column-level masking of PHI fields (e.g., hide patient_name from analysts)
  • Finance: Row-level filtering by business unit (each team sees only their data)
  • Public sector: LF-Tag classification enforcement (sensitivity: public/internal/confidential)

Comparison with Other Engines in This Series

Aspect Redshift Spectrum Athena (Part 1) DuckDB Lambda (Part 4) EMR Spark (Part 5)
Query latency (5M rows) 4,277ms 2,196ms N/A (memory limit) 6,780ms
DWH JOINs with local tables ✅ Best
Lake Formation governance ⚠️ Optional
Materialized views
Stored procedures
Zero idle cost ✅ (Serverless)
Write-back to FSxN ❌ (results stay in Redshift) ✅ CTAS ✅ COPY TO ✅ Best
Cold start ~3s (Serverless) ~2s 1.9s 20s
Cost model RPU-seconds $/TB scanned $/invocation $/job

Partner Decision Card

Customer requirement Redshift Spectrum + LF today Recommended path
JOIN NAS data with DWH tables ✅ Best fit Redshift Spectrum external schema
Enterprise governance (table/column/tag) ✅ Best fit Add Lake Formation
Existing Redshift investment ✅ Natural extension Add external schema to existing cluster
Serverless SQL only (no DWH) ⚠️ Overkill Use Athena (faster, cheaper for simple queries)
Write-back to FSxN ❌ Not supported Use EMR Serverless (Part 5)
Sub-second latency ❌ Cold start overhead Use DuckDB Lambda (Part 4)
Cross-account data sharing ✅ Lake Formation sharing Configure LF cross-account grants
Column-level masking for compliance ✅ Lake Formation Configure column-level permissions

Discovery Questions for Partners

When a customer asks about Redshift Spectrum + Lake Formation + FSx for ONTAP S3 AP:

  1. Does the customer already have a Redshift cluster or Serverless workgroup? (If yes, adding Spectrum is trivial)
  2. Do they need to JOIN NAS data with existing DWH tables? (This is Redshift Spectrum's unique value)
  3. Is table/column-level governance required? (Lake Formation adds this layer)
  4. Is the workload read-only analytics, or does it need write-back? (Spectrum is read-only from external data)
  5. What is the query frequency? (For < 10 queries/day, Athena is cheaper)
  6. Is cross-account data sharing needed? (Lake Formation simplifies this)
  7. Are there compliance requirements for column-level masking? (Lake Formation provides this)
  8. What is the acceptable query latency? (Redshift Serverless has ~3s cold start)

Governance Impact Summary

Access path Authorization layers Auditability Production suitability
Redshift Spectrum (no LF) IAM + S3 AP + File System (3 layers) Medium (CloudTrail) Good for non-regulated workloads
Redshift Spectrum + Lake Formation LF + IAM + S3 AP + File System (4 layers) High (LF audit + CloudTrail) Recommended for regulated workloads
Athena + Lake Formation LF + IAM + S3 AP + File System (4 layers) High (LF audit + CloudTrail) Recommended for serverless regulated workloads

Key insight: Redshift Spectrum and Athena share the same Glue Catalog and Lake Formation permissions. Governance configured for one automatically applies to the other. This means you can use EMR Spark for write-back, register output in Glue, apply Lake Formation permissions, and query from both Athena and Redshift Spectrum with the same governance.


AI Readiness Score

Pattern Governance Performance AI Capability Cost Operational Simplicity Overall
Redshift Spectrum + LF ★★★★★ ★★★☆☆ ★★☆☆☆ ★★★☆☆ ★★★☆☆ 3.2
Athena + Lake Formation ★★★★★ ★★★☆☆ ★★☆☆☆ ★★★★☆ ★★★★☆ 3.6
Snowflake External Table ★★★★☆ ★★☆☆☆ ★★★★☆ ★★★☆☆ ★★★★☆ 3.4
DuckDB Lambda ★☆☆☆☆ ★★★★☆ ★☆☆☆☆ ★★★★★ ★★★★★ 3.2
EMR Serverless Spark ★★☆☆☆ ★★★★☆ ★★★☆☆ ★★★☆☆ ★★★☆☆ 3.0

Scoring methodology: Redshift Spectrum + LF scores highest on Governance (same as Athena + LF) but lower on Cost and Simplicity due to RPU pricing and DWH management overhead. Choose Redshift Spectrum when DWH JOINs are required; choose Athena when serverless SQL is sufficient.


Cost Analysis

Component Cost
Redshift Serverless (8 RPU, per query) ~$0.36/RPU-hour (billed per second)
Redshift Serverless (idle) $0 (scales to zero)
Lake Formation $0 (no additional charge)
Glue Catalog $1/100K objects/month
FSx for ONTAP (existing) $0 incremental

Monthly estimate (100 queries/day, avg 5s each):

  • 100 queries × 5s × 8 RPU × $0.36/RPU-hour ÷ 3600 = ~$0.40/day = ~$12/month

Compare with:

  • Athena (same queries): ~$5/TB × data scanned
  • DuckDB Lambda: ~$1.10/month (but no DWH JOINs)

When Redshift Spectrum is cost-justified: When you already have Redshift and need to JOIN NAS data with local tables. The marginal cost of adding Spectrum queries is low.


When to Use (and When Not To)

Use Redshift Spectrum + Lake Formation when:

  • Customer already has Redshift (adding Spectrum is trivial)
  • Need to JOIN NAS data with DWH tables
  • Enterprise governance (table/column/tag) is required
  • Cross-account data sharing is needed
  • Compliance requires column-level masking

Don't use when:

  • Simple serverless SQL is sufficient (use Athena — faster, cheaper)
  • Need write-back to FSxN (use EMR Serverless)
  • Need sub-second latency (use DuckDB Lambda)
  • No existing Redshift investment (Athena is simpler to start)
  • Dataset is small and ad-hoc (DuckDB Lambda is cheapest)

Known Failure Signatures

Symptom Likely cause Next step
permission denied for schema IAM role not associated with Redshift Associate IAM role with Redshift namespace
S3 access denied on external table IAM role missing S3 AP permissions Add S3 AP ARN to role policy
External schema creation fails Glue database doesn't exist Create database in Glue Catalog first (or use Athena)
Query returns 0 rows Table location doesn't match S3 AP path Verify Glue table LOCATION uses AP alias
Spectrum is not supported Using provisioned cluster without Spectrum Enable Spectrum or use Serverless
Lake Formation permission denied LF permissions not granted Grant SELECT via aws lakeformation grant-permissions

What's Next

  • Part 7: Table Format Boundaries — why Delta, Iceberg, and Hudi can't write to FSx S3 AP, and what flat Parquet patterns work instead (critical knowledge for architecture decisions)

Previously in this series:


References


Key achievement: This validation established that Redshift Spectrum + Lake Formation provides the strongest enterprise governance path for FSx for ONTAP S3 AP data — 4-layer authorization (Lake Formation → IAM → S3 AP → File System), table/column-level access control, and seamless sharing of Glue Catalog with Athena. The same governance configuration applies to both Athena and Redshift Spectrum queries, enabling a unified governance model across query engines.

All benchmarks are from a specific test environment (Redshift Serverless 8 RPU, FSx for ONTAP Single-AZ 128 MB/s, ap-northeast-1). Performance improves with warm queries and provisioned clusters.

Disclaimer: This article is an independent validation report and does not represent AWS or NetApp official guidance. Product behavior and platform capabilities may change. Always validate in your own environment.

Top comments (0)