Yoshiki Fujiwara(藤原善基)@AWS Community Builder for AWS Community Builders

Posted on May 26

Redshift Spectrum + Lake Formation — Enterprise Governance on NAS Data

#aws #redshift #lakeformation #amazonfsxfornetappontap

TL;DR

In Part 1, Athena provided serverless SQL. In Part 2, Databricks hit boundaries. In Part 3, Snowflake works with config. In Part 4, DuckDB Lambda was cheapest. In Part 5, EMR Spark delivered full ETL. This Part 6 adds enterprise governance: Redshift Spectrum + Lake Formation provides 4-layer authorization on NAS data.

Redshift Serverless (8 RPU) successfully queries FSx for ONTAP data via S3 Access Points using the same Glue Catalog tables as Athena — no additional data registration needed. Add Lake Formation on top for table-level, column-level, and tag-based access control.

Query	Duration	Comparison with Athena
COUNT(*) 10K rows	3,231 ms	Athena: ~1,500 ms
GROUP BY aggregation	2,580 ms	Athena: ~1,800 ms
COUNT(*) 5M rows	4,277 ms	Athena: 2,196 ms

~2x slower than Athena for simple scans (Redshift Serverless cold start overhead), but Redshift adds DWH capabilities: federated JOINs with local tables, materialized views, and stored procedures.

Quick Decision Guide:

Need DWH JOINs with NAS data → Redshift Spectrum (this article)
Need enterprise governance (table/column/tag) → Add Lake Formation
Need serverless SQL only (no DWH) → Use Athena (Part 1) — faster and cheaper

GitHub: fsxn-lakehouse-integrations

How to Read This Article

This article is:

A reproduction-focused validation report
Evidence from one environment (Redshift Serverless 8 RPU, ap-northeast-1)
A governance architecture guide for Lake Formation + FSx S3 AP

Read by role:

DWH engineer: Architecture → Setup → Benchmark Results
Security / governance reviewer: 4-Layer Authorization → Governance Impact
Data engineer: When to Use → Comparison with Athena
Partner / SA: Partner Decision Card → Discovery Questions

Prerequisite Concepts

Before reading this article, it helps to understand:

Redshift Spectrum — Redshift's ability to query data in S3 via external schemas (Glue Catalog)
Redshift Serverless — pay-per-query Redshift without cluster management (measured in RPU)
Lake Formation — AWS's centralized governance layer for data lakes (table/column/tag permissions)
Glue Catalog — AWS's metadata catalog (shared by Athena, Redshift Spectrum, EMR, Glue)
External Schema — a Redshift schema that maps to a Glue Catalog database

Architecture

┌─────────────────────────────────────────────────────────────────┐
│  Redshift Serverless (8 RPU)                                     │
│                                                                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  SQL Query                                                │   │
│  │  SELECT * FROM fsxn_spectrum.sensor_readings              │   │
│  │         JOIN local_table ON ...                            │   │
│  └──────────────────────────────────────────────────────────┘   │
│                          │                                       │
│              External Schema (Glue Catalog)                       │
└──────────────────────────┼───────────────────────────────────────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
    Lake Formation    IAM Role    S3 Access Point
    (table/column     (API        (resource
     permissions)      access)     policy)
              │            │            │
              └────────────┼────────────┘
                           │
                           ▼
              FSx for ONTAP Volume (Parquet files)

4-Layer Authorization:

Lake Formation — Who can access which tables/columns (fine-grained)
IAM — Who can call which AWS APIs
S3 Access Point Policy — Which principals can access this access point
File System — UNIX permissions on the underlying files

Benchmark Results

Query	Duration (ms)	Rows	Notes
CREATE EXTERNAL SCHEMA	240	—	One-time setup
COUNT(*) 10K rows	3,231	10,000	Cold start overhead
GROUP BY + AVG aggregation	2,580	3 groups	Status grouping
COUNT(*) 5M rows	4,277	5,000,000	Large scan

Environment: Redshift Serverless 8 RPU, ap-northeast-1. FSx for ONTAP Single-AZ, 128 MB/s.

Performance note: Redshift Serverless has cold start overhead (~2-3s for first query). Warm queries on provisioned Redshift clusters would be faster. For simple scans, Athena is ~2x faster because it has no DWH initialization overhead.

Evidence Matrix

Layer	Evidence	Result	Interpretation
Redshift Serverless	Workgroup creation (8 RPU)	✅ Pass	Serverless endpoint available
IAM role	Spectrum role with S3 AP permissions	✅ Pass	GetObject + ListBucket on AP ARN
External Schema	CREATE EXTERNAL SCHEMA from Glue	✅ Pass	Same catalog as Athena
Spectrum read (small)	COUNT(*) 10K rows	✅ Pass	3,231ms
Spectrum read (aggregation)	GROUP BY + AVG	✅ Pass	2,580ms
Spectrum read (large)	COUNT(*) 5M rows	✅ Pass	4,277ms
Lake Formation admin	put-data-lake-settings	✅ Pass	Admin configured
Lake Formation grant	Table-level SELECT grant	✅ Pass	Fine-grained permission works
LF column-level	SELECT on 3 permitted columns	✅ Pass	Non-permitted column returns "cannot be resolved"
LF column deny	SELECT on denied column (humidity)	✅ Pass (denied)	"Column cannot be resolved or requester is not authorized"
LF row filter	Data cells filter creation	✅ Pass	Row filter (status='normal') + column filter combined
LF-Tag creation	sensitivity tag (public/internal/confidential)	✅ Pass	Tag created and assigned to table
LF-Tag permission	Tag-based DESCRIBE+ASSOCIATE grant	✅ Pass	Scalable governance via classification
Athena under LF	Query with LF permissions active	✅ Pass	Same governance applies to Athena

Setup

Step 1: Create External Schema (reuses Glue Catalog)

CREATE EXTERNAL SCHEMA fsxn_spectrum
FROM DATA CATALOG
DATABASE 'fsxn_athena_verification'
IAM_ROLE 'arn:aws:iam::<ACCOUNT_ID>:role/fsxn-redshift-spectrum-role'
REGION 'ap-northeast-1';

Key insight: This uses the same Glue Catalog database that Athena uses. No additional table registration needed — if Athena can query it, Redshift Spectrum can too.

Step 2: Query FSx for ONTAP Data

-- Simple count
SELECT COUNT(*) FROM fsxn_spectrum.sensor_readings;
-- Result: 10000 (3,231ms)

-- Aggregation
SELECT status, COUNT(*), AVG(temperature)
FROM fsxn_spectrum.sensor_readings
GROUP BY status;
-- Result: 3 groups (2,580ms)

-- JOIN with local Redshift table (DWH capability)
SELECT s.device_id, s.temperature, d.location
FROM fsxn_spectrum.sensor_readings s
JOIN device_master d ON s.device_id = d.device_id
WHERE s.temperature > 35;

Step 3: Add Lake Formation Governance

# Set Lake Formation admin
aws lakeformation put-data-lake-settings \
  --data-lake-settings '{"DataLakeAdmins": [{"DataLakePrincipalIdentifier": "arn:aws:iam::<ACCOUNT_ID>:user/<admin>"}]}'

# Grant table-level SELECT to a role
aws lakeformation grant-permissions \
  --principal '{"DataLakePrincipalIdentifier": "arn:aws:iam::<ACCOUNT_ID>:role/fsxn-analyst-role"}' \
  --resource '{"Table": {"DatabaseName": "fsxn_athena_verification", "Name": "sensor_readings"}}' \
  --permissions '["SELECT", "DESCRIBE"]'

Lake Formation Data Permissions: fine-grained table-level SELECT grant for fsxn-athena-glue-role on sensor_readings table.

Lake Formation Governance Value

Capability	Without Lake Formation	With Lake Formation
Table-level access	S3 AP policy (all-or-nothing per prefix)	Per-table SELECT/DESCRIBE grants
Column-level security	❌ Not possible	✅ Column-level grants + masking
Row-level filtering	❌ Not possible	✅ Data Cells Filter (row filter expressions)
Tag-based access control	❌ Not possible	✅ Classify data → auto-grant by tag (LF-Tags)
Centralized audit	CloudTrail (API-level)	Lake Formation audit (table/column-level)
Cross-account sharing	Share S3 AP (complex)	Share tables via Lake Formation (simple)

Fine-Grained Governance — Verified (May 2026)

All three fine-grained Lake Formation capabilities have been validated on FSx for ONTAP S3 AP data:

Feature	Test	Result
Column-level permission	Grant SELECT on 3 of 4 columns; query the denied column	✅ Permitted columns return data; denied column (`humidity`) returns "cannot be resolved"
Row filter (Data Cells Filter)	Create filter `status = 'normal'`; query returns only matching rows	✅ Only rows matching the filter expression are returned
LF-Tag	Create tag `sensitivity: public/internal/confidential`; assign to table	✅ Tag created, assigned, and queryable via Lake Formation console

Governance implication for regulated workloads: Lake Formation on FSx for ONTAP S3 AP data provides the same fine-grained access control as on native S3 data. Column masking, row filtering, and tag-based classification all work without data movement. This is the strongest AWS-native governance path for FSx for ONTAP data.

Iceberg + Lake Formation path: Glue Data Catalog supports Iceberg table registration natively. For transactional workloads requiring ACID guarantees: sync FSx for ONTAP data to S3 via DataSync → write as Iceberg table (EMR Spark) → register in Glue Catalog → query via Redshift Spectrum with full Lake Formation governance (column/row/tag). This provides the best of both worlds: FSx for ONTAP as source of truth + Iceberg ACID + Lake Formation governance.

Enterprise governance use cases:

Healthcare: Column-level masking of PHI fields (e.g., hide patient_name from analysts)
Finance: Row-level filtering by business unit (each team sees only their data)
Public sector: LF-Tag classification enforcement (sensitivity: public/internal/confidential)

Comparison with Other Engines in This Series

Aspect	Redshift Spectrum	Athena (Part 1)	DuckDB Lambda (Part 4)	EMR Spark (Part 5)
Query latency (5M rows)	4,277ms	2,196ms	N/A (memory limit)	6,780ms
DWH JOINs with local tables	✅ Best	❌	❌	❌
Lake Formation governance	✅	✅	❌	⚠️ Optional
Materialized views	✅	❌	❌	❌
Stored procedures	✅	❌	❌	❌
Zero idle cost	✅ (Serverless)	✅	✅	✅
Write-back to FSxN	❌ (results stay in Redshift)	✅ CTAS	✅ COPY TO	✅ Best
Cold start	~3s (Serverless)	~2s	1.9s	20s
Cost model	RPU-seconds	$/TB scanned	$/invocation	$/job

Partner Decision Card

Customer requirement	Redshift Spectrum + LF today	Recommended path
JOIN NAS data with DWH tables	✅ Best fit	Redshift Spectrum external schema
Enterprise governance (table/column/tag)	✅ Best fit	Add Lake Formation
Existing Redshift investment	✅ Natural extension	Add external schema to existing cluster
Serverless SQL only (no DWH)	⚠️ Overkill	Use Athena (faster, cheaper for simple queries)
Write-back to FSxN	❌ Not supported	Use EMR Serverless (Part 5)
Sub-second latency	❌ Cold start overhead	Use DuckDB Lambda (Part 4)
Cross-account data sharing	✅ Lake Formation sharing	Configure LF cross-account grants
Column-level masking for compliance	✅ Lake Formation	Configure column-level permissions

Discovery Questions for Partners

When a customer asks about Redshift Spectrum + Lake Formation + FSx for ONTAP S3 AP:

Does the customer already have a Redshift cluster or Serverless workgroup? (If yes, adding Spectrum is trivial)
Do they need to JOIN NAS data with existing DWH tables? (This is Redshift Spectrum's unique value)
Is table/column-level governance required? (Lake Formation adds this layer)
Is the workload read-only analytics, or does it need write-back? (Spectrum is read-only from external data)
What is the query frequency? (For < 10 queries/day, Athena is cheaper)
Is cross-account data sharing needed? (Lake Formation simplifies this)
Are there compliance requirements for column-level masking? (Lake Formation provides this)
What is the acceptable query latency? (Redshift Serverless has ~3s cold start)

Governance Impact Summary

Access path	Authorization layers	Auditability	Production suitability
Redshift Spectrum (no LF)	IAM + S3 AP + File System (3 layers)	Medium (CloudTrail)	Good for non-regulated workloads
Redshift Spectrum + Lake Formation	LF + IAM + S3 AP + File System (4 layers)	High (LF audit + CloudTrail)	Recommended for regulated workloads
Athena + Lake Formation	LF + IAM + S3 AP + File System (4 layers)	High (LF audit + CloudTrail)	Recommended for serverless regulated workloads

Key insight: Redshift Spectrum and Athena share the same Glue Catalog and Lake Formation permissions. Governance configured for one automatically applies to the other. This means you can use EMR Spark for write-back, register output in Glue, apply Lake Formation permissions, and query from both Athena and Redshift Spectrum with the same governance.

AI Readiness Score

Pattern	Governance	Performance	AI Capability	Cost	Operational Simplicity	Overall
Redshift Spectrum + LF	★★★★★	★★★☆☆	★★☆☆☆	★★★☆☆	★★★☆☆	3.2
Athena + Lake Formation	★★★★★	★★★☆☆	★★☆☆☆	★★★★☆	★★★★☆	3.6
Snowflake External Table	★★★★☆	★★☆☆☆	★★★★☆	★★★☆☆	★★★★☆	3.4
DuckDB Lambda	★☆☆☆☆	★★★★☆	★☆☆☆☆	★★★★★	★★★★★	3.2
EMR Serverless Spark	★★☆☆☆	★★★★☆	★★★☆☆	★★★☆☆	★★★☆☆	3.0

Scoring methodology: Redshift Spectrum + LF scores highest on Governance (same as Athena + LF) but lower on Cost and Simplicity due to RPU pricing and DWH management overhead. Choose Redshift Spectrum when DWH JOINs are required; choose Athena when serverless SQL is sufficient.

Cost Analysis

Component	Cost
Redshift Serverless (8 RPU, per query)	~$0.36/RPU-hour (billed per second)
Redshift Serverless (idle)	$0 (scales to zero)
Lake Formation	$0 (no additional charge)
Glue Catalog	$1/100K objects/month
FSx for ONTAP (existing)	$0 incremental

Monthly estimate (100 queries/day, avg 5s each):

100 queries × 5s × 8 RPU × $0.36/RPU-hour ÷ 3600 = ~$0.40/day = ~$12/month

Compare with:

Athena (same queries): ~$5/TB × data scanned
DuckDB Lambda: ~$1.10/month (but no DWH JOINs)

When Redshift Spectrum is cost-justified: When you already have Redshift and need to JOIN NAS data with local tables. The marginal cost of adding Spectrum queries is low.

When to Use (and When Not To)

Use Redshift Spectrum + Lake Formation when:

Customer already has Redshift (adding Spectrum is trivial)
Need to JOIN NAS data with DWH tables
Enterprise governance (table/column/tag) is required
Cross-account data sharing is needed
Compliance requires column-level masking

Don't use when:

Simple serverless SQL is sufficient (use Athena — faster, cheaper)
Need write-back to FSxN (use EMR Serverless)
Need sub-second latency (use DuckDB Lambda)
No existing Redshift investment (Athena is simpler to start)
Dataset is small and ad-hoc (DuckDB Lambda is cheapest)

Known Failure Signatures

Symptom	Likely cause	Next step
`permission denied for schema`	IAM role not associated with Redshift	Associate IAM role with Redshift namespace
`S3 access denied` on external table	IAM role missing S3 AP permissions	Add S3 AP ARN to role policy
External schema creation fails	Glue database doesn't exist	Create database in Glue Catalog first (or use Athena)
Query returns 0 rows	Table location doesn't match S3 AP path	Verify Glue table LOCATION uses AP alias
`Spectrum is not supported`	Using provisioned cluster without Spectrum	Enable Spectrum or use Serverless
Lake Formation permission denied	LF permissions not granted	Grant SELECT via `aws lakeformation grant-permissions`

What's Next

Part 7: Table Format Boundaries — why Delta, Iceberg, and Hudi can't write to FSx S3 AP, and what flat Parquet patterns work instead (critical knowledge for architecture decisions)

Previously in this series:

Part 1: Athena — Query NAS Data In Place
Part 2: Databricks — A Layer-by-Layer Validation of Observed Boundaries
Part 3: Snowflake — From 'Access Denied' to Working External Tables
Part 4: DuckDB Lambda — Serverless Analytics for $0.00001/Query
Part 5: EMR Spark — Read-Write ETL on NAS Data

References

Key achievement: This validation established that Redshift Spectrum + Lake Formation provides the strongest enterprise governance path for FSx for ONTAP S3 AP data — 4-layer authorization (Lake Formation → IAM → S3 AP → File System), table/column-level access control, and seamless sharing of Glue Catalog with Athena. The same governance configuration applies to both Athena and Redshift Spectrum queries, enabling a unified governance model across query engines.

All benchmarks are from a specific test environment (Redshift Serverless 8 RPU, FSx for ONTAP Single-AZ 128 MB/s, ap-northeast-1). Performance improves with warm queries and provisioned clusters.

Disclaimer: This article is an independent validation report and does not represent AWS or NetApp official guidance. Product behavior and platform capabilities may change. Always validate in your own environment.

DEV Community