Yoshiki Fujiwara(藤原善基)@AWS Community Builder for AWS Community Builders

Posted on May 26

Read-Write ETL on NAS Data with EMR Serverless Spark — No Cluster, No Copy

#aws #spark #emr #amazonfsxfornetappontap

TL;DR

In Part 1, Athena provided serverless read-only SQL. In Part 2, Databricks hit session policy boundaries. In Part 3, Snowflake works with config. In Part 4, DuckDB Lambda delivered the cheapest path. This Part 5 shows the full-power Spark ETL path with write-back.

EMR Serverless Spark can read, transform, and write-back Parquet files on FSx for ONTAP via S3 Access Points. Total Spark execution: 16 seconds for a full ETL pipeline (read → aggregate → window → write). Job total including cold start: 37 seconds. Cost: ~$0.05 per job.

No cluster to manage. No data to copy. No idle cost.

Quick Decision Guide:

Need Spark's full power (UDFs, ML, window functions) + write-back → EMR Serverless
Read-only SQL, no Spark needed → Use Athena (Part 1) or DuckDB Lambda (Part 4)
Need enterprise governance on results → Combine EMR write-back + Athena/Lake Formation for reads

GitHub: fsxn-lakehouse-integrations/integrations/emr-spark/

How to Read This Article

This article is:

A reproduction-focused validation report
Evidence from one environment (EMR Serverless emr-7.1.0, ap-northeast-1)
A deployment guide for EMR Serverless + FSx for ONTAP S3 AP

Read by role:

Data engineer: Architecture → Critical Findings → PySpark Job
Platform engineer: Deploy and Run → Gotchas → Cost Analysis
Partner / SA: Partner Decision Card → Discovery Questions
Security reviewer: Governance Impact → When to Use

Prerequisite Concepts

Before reading this article, it helps to understand:

EMR Serverless — a deployment option for EMR that runs Spark/Hive jobs without managing clusters
EMRFS — EMR's S3 filesystem implementation (s3:// prefix) that natively supports S3 AP aliases
S3A vs EMRFS — s3a:// (Hadoop's S3AFileSystem) does NOT support S3 AP aliases; always use s3://
PySpark — Python API for Apache Spark
Parquet timestamp resolution — Spark requires microsecond timestamps; nanosecond (pandas default) causes errors

Why EMR Serverless + FSx for ONTAP?

Traditional ETL	This approach
Provision EMR cluster (minutes)	Submit job to EMR Serverless (seconds)
Copy data from NAS to S3	Read NAS data in place via S3 AP
Pay for idle cluster	Pay only during job execution
Manage cluster scaling	Auto-scales per job
Write results to separate S3 bucket	Write results back to FSx for ONTAP

EMR Serverless eliminates cluster management entirely. Combined with FSx S3 AP, you get a fully serverless ETL pipeline that reads and writes directly to your NAS storage.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│  EMR Serverless Application (Spark 3.5, emr-7.1.0)              │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  PySpark Job                                             │   │
│  │  ├── Read: spark.read.parquet("s3://<AP>/sensor-data/")  │   │
│  │  ├── Transform: GROUP BY, Window functions               │   │
│  │  └── Write: df.write.parquet("s3://<AP>/gold/output/")   │   │
│  └──────────────────────────────────────────────────────────┘   │
│                          │                                      │
│                    EMRFS (s3://)                                │
└──────────────────────────┼──────────────────────────────────────┘
                           │
                           ▼
              S3 Access Point (internet-origin)
                           │
                           ▼
              FSx for ONTAP Volume (Parquet files)

Key: EMR Serverless uses EMRFS (s3:// prefix) which natively supports S3 AP aliases. No special configuration needed.

Benchmark Results

Operation	Duration	Notes
Read 10K rows	6.78s	First read includes Spark initialization
GROUP BY aggregation	2.52s	Status + AVG(temperature)
Window function	1.19s	Moving average per device
Write-back to FSxN	3.61s	Parquet output to S3 AP
Total Spark execution	16.35s	All operations combined
Job total (with cold start)	37s	Includes EMR Serverless startup

Environment: EMR Serverless, emr-7.1.0, Spark 3.5, ap-northeast-1. FSx for ONTAP Single-AZ, 128 MB/s.

Evidence Matrix

Layer	Evidence	Result	Interpretation
EMR Serverless app	create-application	✅ Pass	Spark 3.5 app created
IAM role	Execution role with S3 AP permissions	✅ Pass	GetObject + PutObject on AP ARN
EMRFS read	spark.read.parquet("s3://AP/...")	✅ Pass	EMRFS natively handles AP alias
Spark transforms	GROUP BY, Window, aggregation	✅ Pass	Full Spark SQL works
Write-back	df.write.parquet("s3://AP/gold/...")	✅ Pass	PutObject to FSxN via S3 AP
S3A (negative test)	spark.read.parquet("s3a://AP/...")	❌ Expected fail	S3A cannot parse AP alias
Job lifecycle	start → running → success	✅ Pass	37s total including cold start

Critical Finding: EMRFS vs S3A

This is the most important thing to know:

# ✅ WORKS — EMRFS natively supports S3 AP aliases
df = spark.read.parquet("s3://my-ap-alias-ext-s3alias/sensor-data/")

# ❌ FAILS — S3A cannot parse AP alias URLs
df = spark.read.parquet("s3a://my-ap-alias-ext-s3alias/sensor-data/")
# Error: IllegalArgumentException: Invalid S3 URI

Always use s3:// (EMRFS) with EMR. The s3a:// filesystem (Hadoop's S3AFileSystem) does not understand S3 AP alias format.

Critical Finding: Parquet Timestamp Compatibility

If you generate Parquet files with pandas or DuckDB, they default to nanosecond timestamps. Spark cannot read these:

AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS, true))

Fix: Generate Parquet with microsecond timestamps:

import pyarrow as pa, pyarrow.parquet as pq

# Convert nanosecond → microsecond before writing
table = pa.Table.from_pandas(df)
schema = table.schema
new_fields = []
for field in schema:
    if pa.types.is_timestamp(field.type):
        new_fields.append(field.with_type(pa.timestamp('us')))
    else:
        new_fields.append(field)
new_schema = pa.schema(new_fields)
table = table.cast(new_schema)
pq.write_table(table, 'output.parquet')

This affects cross-engine compatibility: if you write Parquet with DuckDB or pandas and want to read it with Spark (EMR, Glue, Databricks), always use microsecond resolution.

Comparison with Other Engines in This Series

Aspect	EMR Serverless	Athena (Part 1)	DuckDB Lambda (Part 4)	Snowflake (Part 3)	Databricks (Part 2)
Read from FSx for ONTAP S3 AP	✅ Direct	✅ Direct	✅ Direct	✅ With ARN	⚠️ Partial (explicit path only)
Write-back to FSx for ONTAP	✅ Best	✅ CTAS	✅ COPY TO	⚠️ TBD	❌ Blocked
Complex transforms (UDF, ML)	✅ Best	❌ SQL only	❌ SQL only	⚠️ Snowpark	✅ Best (if data in UC)
Cold start	20s	~2s	1.9s	N/A	N/A (cluster always on)
Cost per job	$0.05	$0.005/TB	$0.00001	Credits	DBU
Governance	IAM only	✅ Glue + LF	❌ None	✅ Tags + RBAC	❌ UC blocked on S3 AP
Distributed processing	✅ Best	✅	❌	✅	✅ Best (if data in UC)
Session policy issues	❌ None	❌ None	❌ None	Resolved with ARN	❌ Blocks table creation

Why EMR Serverless instead of Databricks for FSx for ONTAP S3 AP?

EMR Serverless uses direct IAM role credentials without intermediary session policies. The S3 AP ARN format works natively — no special configuration needed. In contrast, Databricks UC generates a restrictive session policy that blocks subdirectory listing, table creation, and write operations on FSx for ONTAP S3 AP paths (confirmed by Databricks Support, May 2026).

For teams that need Spark processing on FSx for ONTAP data today:

EMR Serverless: Direct read + write-back, no session policy issues, IAM governance
Databricks: Requires DataSync → S3 → UC (data copy), but provides full UC governance + Mosaic AI

Partner Decision Card

Customer requirement	EMR Serverless today	Recommended path
Full Spark ETL with write-back	✅ Best fit	Deploy EMR Serverless
Complex transforms (UDFs, ML pipelines)	✅ Best fit	Deploy EMR Serverless
Large-scale distributed processing	✅ Best fit	Deploy EMR Serverless
Read-only SQL analytics	⚠️ Overkill	Use Athena or DuckDB Lambda
Sub-second query latency	❌ 20s cold start	Use DuckDB Lambda
Enterprise governance on results	⚠️ IAM only	Write to FSxN → read via Athena + Lake Formation
Delta/Iceberg table format	❌ Write not supported on S3 AP	Write flat Parquet only. Iceberg read (pre-existing table) is theoretically possible via GetObject but not validated.
Scheduled batch ETL	✅ Good fit	EMR Serverless + Step Functions

Discovery Questions for Partners

When a customer asks about EMR Serverless + FSx for ONTAP S3 Access Points:

Does the workload require Spark-specific features (UDFs, ML, window functions, graph)?
Is write-back to FSx for ONTAP required? (EMR is the best write-back path)
What is the typical dataset size? (EMR shines at > 1 GB; for < 1 GB, DuckDB Lambda is cheaper)
Is the workload batch or interactive? (EMR has 20s cold start — not suitable for interactive)
Does the team have Spark expertise? (If not, Athena SQL may be simpler)
Is Delta/Iceberg table format required? (Not supported for write on FSx S3 AP)
What is the job frequency? (10 jobs/day = $15/month; 100 jobs/day = $150/month)
Is there an existing EMR or Glue investment? (Leverage existing IAM roles and scripts)

Governance Impact

Capability	EMR Serverless	Notes
Authentication	IAM (execution role)	Standard AWS IAM
Authorization	S3 AP policy + IAM	No table/column-level control natively
Audit trail	CloudWatch Logs + CloudTrail	Job logs + S3 API calls logged
Data classification	❌ None built-in	Can integrate with Lake Formation for reads
Row/column security	❌ None built-in	Apply at read layer (Athena + LF)
Catalog integration	⚠️ Optional (Glue Catalog)	Can register output in Glue for downstream governance

Governance model: EMR Serverless uses IAM + S3 AP policy for access control. For enterprise governance, write results back to FSxN and read them via Athena + Lake Formation (Part 6). This gives you Spark's processing power with Lake Formation's governance on the output.

Recommended pattern for governed ETL:

FSxN (raw) → EMR Spark (transform) → FSxN (gold) → Athena + Lake Formation (governed read)

AI Readiness Score

Pattern	Governance	Performance	AI Capability	Cost	Operational Simplicity	Overall
EMR Serverless Spark	★★☆☆☆	★★★★☆	★★★☆☆	★★★☆☆	★★★☆☆	3.0
Athena + Lake Formation	★★★★★	★★★☆☆	★★☆☆☆	★★★★☆	★★★★☆	3.6
DuckDB Lambda	★☆☆☆☆	★★★★☆	★☆☆☆☆	★★★★★	★★★★★	3.2
Snowflake External Table	★★★★☆	★★☆☆☆	★★★★☆	★★★☆☆	★★★★☆	3.4

Governance: Access control, audit, classification capabilities
Performance: Processing throughput for ETL workloads
AI Capability: Built-in ML/AI integration (Spark MLlib, etc.)
Cost: Total cost for batch ETL workloads
Operational Simplicity: Setup and maintenance effort

Scoring methodology: Each dimension rated by the author based on validated evidence. EMR scores highest on Performance and AI Capability (Spark MLlib, distributed ML) but lower on Governance (IAM-only) and Simplicity (requires Spark expertise).

Cost Analysis

Component	Cost
EMR Serverless (37s job)	~$0.05
FSx for ONTAP (existing)	$0 incremental
S3 AP requests	$0 (included in FSx)
Script storage (S3)	< $0.01

Monthly estimate (10 jobs/day):

300 jobs × $0.05 = $15/month
Zero idle cost (application stopped between jobs)

Compare with:

EMR on EC2 (m5.xlarge cluster): ~$200/month (always-on)
Glue ETL (same workload): ~$0.44/job × 300 = $132/month
DuckDB Lambda: ~$1.10/month (but no distributed processing)

The PySpark Job

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
import time

spark = SparkSession.builder.appName("FSxN-S3AP-Verification").getOrCreate()

S3_AP = "s3://<your-ap-alias-ext-s3alias>"

# --- Read ---
start = time.time()
df = spark.read.parquet(f"{S3_AP}/sensor-data/sensor_data_microsecond.parquet")
row_count = df.count()
print(f"Read: {row_count} rows in {time.time()-start:.2f}s")

# --- Transform: GROUP BY ---
start = time.time()
agg_df = df.groupBy("status").agg(
    F.count("*").alias("count"),
    F.avg("temperature").alias("avg_temp"),
    F.avg("humidity").alias("avg_humidity")
)
agg_df.show()
print(f"GROUP BY: {time.time()-start:.2f}s")

# --- Transform: Window function ---
start = time.time()
window_spec = Window.partitionBy("device_id").orderBy("timestamp").rowsBetween(-5, 0)
window_df = df.withColumn("moving_avg_temp", F.avg("temperature").over(window_spec))
window_df.select("device_id", "timestamp", "temperature", "moving_avg_temp").show(5)
print(f"Window: {time.time()-start:.2f}s")

# --- Write-back ---
start = time.time()
agg_df.write.mode("overwrite").parquet(f"{S3_AP}/gold/emr_spark_output/")
print(f"Write-back: {time.time()-start:.2f}s")

spark.stop()

Deploy and Run

# 1. Create EMR Serverless application
aws emr-serverless create-application \
  --name "fsxn-spark" \
  --release-label "emr-7.1.0" \
  --type "SPARK" \
  --region ap-northeast-1

# 2. Upload script to S3 (regular bucket, not S3 AP)
aws s3 cp scripts/spark_verification.py \
  s3://my-scripts-bucket/emr-scripts/

# 3. Submit job
aws emr-serverless start-job-run \
  --application-id <app-id> \
  --execution-role-arn arn:aws:iam::<ACCOUNT_ID>:role/emr-serverless-role \
  --job-driver '{
    "sparkSubmit": {
      "entryPoint": "s3://my-scripts-bucket/emr-scripts/spark_verification.py"
    }
  }'

# 4. Check status
aws emr-serverless get-job-run \
  --application-id <app-id> \
  --job-run-id <job-run-id>

# 5. Stop application (zero cost when stopped)
aws emr-serverless stop-application --application-id <app-id>

Known Failure Signatures

Symptom	Likely cause	Next step
`IllegalArgumentException: Invalid S3 URI`	Using `s3a://` instead of `s3://`	Switch to EMRFS (`s3://`) prefix
`Illegal Parquet type: INT64 (TIMESTAMP(NANOS))`	Nanosecond timestamps in Parquet	Regenerate with microsecond resolution
Job stuck in PENDING > 60s	EMR Serverless capacity	Check service quotas; retry
`AccessDeniedException` on S3 AP	IAM role missing AP permissions	Add S3 AP ARN to execution role policy
Script not found	Script on S3 AP instead of regular S3	Move script to regular S3 bucket
Write fails with 501	Attempting Delta/Iceberg write	Use flat Parquet write only

Gotchas and Lessons

1. Script must be on regular S3 (not S3 AP)

EMR Serverless loads the PySpark script from S3. The script location must be a regular S3 bucket, not an FSx S3 AP. The script then reads/writes data from/to the S3 AP.

2. IAM role needs both S3 bucket and S3 AP permissions

{
  "Effect": "Allow",
  "Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"],
  "Resource": [
    "arn:aws:s3:::my-scripts-bucket/*",
    "arn:aws:s3:ap-northeast-1:<ACCOUNT_ID>:accesspoint/<ap-name>",
    "arn:aws:s3:ap-northeast-1:<ACCOUNT_ID>:accesspoint/<ap-name>/object/*"
  ]
}

3. Cold start is ~20 seconds

EMR Serverless has a cold start of ~20 seconds before Spark begins executing. For latency-sensitive workloads, keep the application in "started" state (costs ~$0.01/hour for pre-initialized capacity).

4. No session policy issues

Unlike Databricks and Snowflake, EMR Serverless uses direct IAM role credentials without intermediary session policies. The S3 AP ARN format works natively.

When to Use EMR Serverless vs Other Engines

Requirement	EMR Serverless	Athena	DuckDB Lambda	Glue ETL
Read-only SQL	✅	✅ Best	✅	✅
Write-back to FSxN	✅ Best	✅ (CTAS)	✅	✅
Complex Spark transformations	✅ Best	❌	❌	✅
Sub-second latency	❌ (cold start)	❌	✅ Best	❌
Zero idle cost	✅	✅	✅	✅
Large-scale distributed	✅ Best	✅	❌	✅

What's Next

Part 6: Redshift Spectrum + Lake Formation — for teams that need DWH-integrated analytics with enterprise governance (4-layer authorization) on NAS data
Part 7: Table Format Boundaries — why Delta, Iceberg, and Hudi can't write to FSx S3 AP, and what flat Parquet patterns work instead

Previously in this series:

Part 1: Athena — Query NAS Data In Place
Part 2: Databricks — A Layer-by-Layer Validation of Observed Boundaries
Part 3: Snowflake — From 'Access Denied' to Working External Tables
Part 4: DuckDB Lambda — Serverless Analytics for $0.00001/Query

References

Key achievement: This validation established that EMR Serverless Spark provides the most capable read-write ETL path for FSx for ONTAP S3 AP data — full Spark SQL, UDFs, window functions, and write-back in 16 seconds of Spark execution at $0.05/job. No cluster management, no data copy, no session policy issues. The trade-off is cold start latency (20s) and lack of built-in governance — pair with Athena + Lake Formation for governed reads on the output.

All benchmarks are from a specific test environment (EMR Serverless emr-7.1.0, FSx for ONTAP Single-AZ 128 MB/s, ap-northeast-1). Scale throughput provisioning for production workloads.

Disclaimer: This article is an independent validation report and does not represent AWS or NetApp official guidance. Product behavior and platform capabilities may change. Always validate in your own environment.

DEV Community