DEV Community

Cover image for Read-Write ETL on NAS Data with EMR Serverless Spark — No Cluster, No Copy

Read-Write ETL on NAS Data with EMR Serverless Spark — No Cluster, No Copy

TL;DR

In Part 1, Athena provided serverless read-only SQL. In Part 2, Databricks hit session policy boundaries. In Part 3, Snowflake works with config. In Part 4, DuckDB Lambda delivered the cheapest path. This Part 5 shows the full-power Spark ETL path with write-back.

EMR Serverless Spark can read, transform, and write-back Parquet files on FSx for ONTAP via S3 Access Points. Total Spark execution: 16 seconds for a full ETL pipeline (read → aggregate → window → write). Job total including cold start: 37 seconds. Cost: ~$0.05 per job.

No cluster to manage. No data to copy. No idle cost.

Quick Decision Guide:

  • Need Spark's full power (UDFs, ML, window functions) + write-back → EMR Serverless
  • Read-only SQL, no Spark needed → Use Athena (Part 1) or DuckDB Lambda (Part 4)
  • Need enterprise governance on results → Combine EMR write-back + Athena/Lake Formation for reads

GitHub: fsxn-lakehouse-integrations/integrations/emr-spark/


How to Read This Article

This article is:

  • A reproduction-focused validation report
  • Evidence from one environment (EMR Serverless emr-7.1.0, ap-northeast-1)
  • A deployment guide for EMR Serverless + FSx for ONTAP S3 AP

Read by role:

  • Data engineer: Architecture → Critical Findings → PySpark Job
  • Platform engineer: Deploy and Run → Gotchas → Cost Analysis
  • Partner / SA: Partner Decision Card → Discovery Questions
  • Security reviewer: Governance Impact → When to Use

Prerequisite Concepts

Before reading this article, it helps to understand:

  • EMR Serverless — a deployment option for EMR that runs Spark/Hive jobs without managing clusters
  • EMRFS — EMR's S3 filesystem implementation (s3:// prefix) that natively supports S3 AP aliases
  • S3A vs EMRFSs3a:// (Hadoop's S3AFileSystem) does NOT support S3 AP aliases; always use s3://
  • PySpark — Python API for Apache Spark
  • Parquet timestamp resolution — Spark requires microsecond timestamps; nanosecond (pandas default) causes errors

Why EMR Serverless + FSx for ONTAP?

Traditional ETL This approach
Provision EMR cluster (minutes) Submit job to EMR Serverless (seconds)
Copy data from NAS to S3 Read NAS data in place via S3 AP
Pay for idle cluster Pay only during job execution
Manage cluster scaling Auto-scales per job
Write results to separate S3 bucket Write results back to FSx for ONTAP

EMR Serverless eliminates cluster management entirely. Combined with FSx S3 AP, you get a fully serverless ETL pipeline that reads and writes directly to your NAS storage.


Architecture

┌─────────────────────────────────────────────────────────────────┐
│  EMR Serverless Application (Spark 3.5, emr-7.1.0)              │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  PySpark Job                                             │   │
│  │  ├── Read: spark.read.parquet("s3://<AP>/sensor-data/")  │   │
│  │  ├── Transform: GROUP BY, Window functions               │   │
│  │  └── Write: df.write.parquet("s3://<AP>/gold/output/")   │   │
│  └──────────────────────────────────────────────────────────┘   │
│                          │                                      │
│                    EMRFS (s3://)                                │
└──────────────────────────┼──────────────────────────────────────┘
                           │
                           ▼
              S3 Access Point (internet-origin)
                           │
                           ▼
              FSx for ONTAP Volume (Parquet files)
Enter fullscreen mode Exit fullscreen mode

Key: EMR Serverless uses EMRFS (s3:// prefix) which natively supports S3 AP aliases. No special configuration needed.


Benchmark Results

Operation Duration Notes
Read 10K rows 6.78s First read includes Spark initialization
GROUP BY aggregation 2.52s Status + AVG(temperature)
Window function 1.19s Moving average per device
Write-back to FSxN 3.61s Parquet output to S3 AP
Total Spark execution 16.35s All operations combined
Job total (with cold start) 37s Includes EMR Serverless startup

Environment: EMR Serverless, emr-7.1.0, Spark 3.5, ap-northeast-1. FSx for ONTAP Single-AZ, 128 MB/s.


Evidence Matrix

Layer Evidence Result Interpretation
EMR Serverless app create-application ✅ Pass Spark 3.5 app created
IAM role Execution role with S3 AP permissions ✅ Pass GetObject + PutObject on AP ARN
EMRFS read spark.read.parquet("s3://AP/...") ✅ Pass EMRFS natively handles AP alias
Spark transforms GROUP BY, Window, aggregation ✅ Pass Full Spark SQL works
Write-back df.write.parquet("s3://AP/gold/...") ✅ Pass PutObject to FSxN via S3 AP
S3A (negative test) spark.read.parquet("s3a://AP/...") ❌ Expected fail S3A cannot parse AP alias
Job lifecycle start → running → success ✅ Pass 37s total including cold start

Critical Finding: EMRFS vs S3A

This is the most important thing to know:

# ✅ WORKS — EMRFS natively supports S3 AP aliases
df = spark.read.parquet("s3://my-ap-alias-ext-s3alias/sensor-data/")

# ❌ FAILS — S3A cannot parse AP alias URLs
df = spark.read.parquet("s3a://my-ap-alias-ext-s3alias/sensor-data/")
# Error: IllegalArgumentException: Invalid S3 URI
Enter fullscreen mode Exit fullscreen mode

Always use s3:// (EMRFS) with EMR. The s3a:// filesystem (Hadoop's S3AFileSystem) does not understand S3 AP alias format.


Critical Finding: Parquet Timestamp Compatibility

If you generate Parquet files with pandas or DuckDB, they default to nanosecond timestamps. Spark cannot read these:

AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS, true))
Enter fullscreen mode Exit fullscreen mode

Fix: Generate Parquet with microsecond timestamps:

import pyarrow as pa, pyarrow.parquet as pq

# Convert nanosecond → microsecond before writing
table = pa.Table.from_pandas(df)
schema = table.schema
new_fields = []
for field in schema:
    if pa.types.is_timestamp(field.type):
        new_fields.append(field.with_type(pa.timestamp('us')))
    else:
        new_fields.append(field)
new_schema = pa.schema(new_fields)
table = table.cast(new_schema)
pq.write_table(table, 'output.parquet')
Enter fullscreen mode Exit fullscreen mode

This affects cross-engine compatibility: if you write Parquet with DuckDB or pandas and want to read it with Spark (EMR, Glue, Databricks), always use microsecond resolution.


Comparison with Other Engines in This Series

Aspect EMR Serverless Athena (Part 1) DuckDB Lambda (Part 4) Snowflake (Part 3) Databricks (Part 2)
Read from FSx for ONTAP S3 AP ✅ Direct ✅ Direct ✅ Direct ✅ With ARN ⚠️ Partial (explicit path only)
Write-back to FSx for ONTAP ✅ Best ✅ CTAS ✅ COPY TO ⚠️ TBD ❌ Blocked
Complex transforms (UDF, ML) ✅ Best ❌ SQL only ❌ SQL only ⚠️ Snowpark ✅ Best (if data in UC)
Cold start 20s ~2s 1.9s N/A N/A (cluster always on)
Cost per job $0.05 $0.005/TB $0.00001 Credits DBU
Governance IAM only ✅ Glue + LF ❌ None ✅ Tags + RBAC ❌ UC blocked on S3 AP
Distributed processing ✅ Best ✅ Best (if data in UC)
Session policy issues ❌ None ❌ None ❌ None Resolved with ARN ❌ Blocks table creation

Why EMR Serverless instead of Databricks for FSx for ONTAP S3 AP?

EMR Serverless uses direct IAM role credentials without intermediary session policies. The S3 AP ARN format works natively — no special configuration needed. In contrast, Databricks UC generates a restrictive session policy that blocks subdirectory listing, table creation, and write operations on FSx for ONTAP S3 AP paths (confirmed by Databricks Support, May 2026).

For teams that need Spark processing on FSx for ONTAP data today:

  • EMR Serverless: Direct read + write-back, no session policy issues, IAM governance
  • Databricks: Requires DataSync → S3 → UC (data copy), but provides full UC governance + Mosaic AI

Partner Decision Card

Customer requirement EMR Serverless today Recommended path
Full Spark ETL with write-back ✅ Best fit Deploy EMR Serverless
Complex transforms (UDFs, ML pipelines) ✅ Best fit Deploy EMR Serverless
Large-scale distributed processing ✅ Best fit Deploy EMR Serverless
Read-only SQL analytics ⚠️ Overkill Use Athena or DuckDB Lambda
Sub-second query latency ❌ 20s cold start Use DuckDB Lambda
Enterprise governance on results ⚠️ IAM only Write to FSxN → read via Athena + Lake Formation
Delta/Iceberg table format ❌ Write not supported on S3 AP Write flat Parquet only. Iceberg read (pre-existing table) is theoretically possible via GetObject but not validated.
Scheduled batch ETL ✅ Good fit EMR Serverless + Step Functions

Discovery Questions for Partners

When a customer asks about EMR Serverless + FSx for ONTAP S3 Access Points:

  1. Does the workload require Spark-specific features (UDFs, ML, window functions, graph)?
  2. Is write-back to FSx for ONTAP required? (EMR is the best write-back path)
  3. What is the typical dataset size? (EMR shines at > 1 GB; for < 1 GB, DuckDB Lambda is cheaper)
  4. Is the workload batch or interactive? (EMR has 20s cold start — not suitable for interactive)
  5. Does the team have Spark expertise? (If not, Athena SQL may be simpler)
  6. Is Delta/Iceberg table format required? (Not supported for write on FSx S3 AP)
  7. What is the job frequency? (10 jobs/day = $15/month; 100 jobs/day = $150/month)
  8. Is there an existing EMR or Glue investment? (Leverage existing IAM roles and scripts)

Governance Impact

Capability EMR Serverless Notes
Authentication IAM (execution role) Standard AWS IAM
Authorization S3 AP policy + IAM No table/column-level control natively
Audit trail CloudWatch Logs + CloudTrail Job logs + S3 API calls logged
Data classification ❌ None built-in Can integrate with Lake Formation for reads
Row/column security ❌ None built-in Apply at read layer (Athena + LF)
Catalog integration ⚠️ Optional (Glue Catalog) Can register output in Glue for downstream governance

Governance model: EMR Serverless uses IAM + S3 AP policy for access control. For enterprise governance, write results back to FSxN and read them via Athena + Lake Formation (Part 6). This gives you Spark's processing power with Lake Formation's governance on the output.

Recommended pattern for governed ETL:

FSxN (raw) → EMR Spark (transform) → FSxN (gold) → Athena + Lake Formation (governed read)
Enter fullscreen mode Exit fullscreen mode

AI Readiness Score

Pattern Governance Performance AI Capability Cost Operational Simplicity Overall
EMR Serverless Spark ★★☆☆☆ ★★★★☆ ★★★☆☆ ★★★☆☆ ★★★☆☆ 3.0
Athena + Lake Formation ★★★★★ ★★★☆☆ ★★☆☆☆ ★★★★☆ ★★★★☆ 3.6
DuckDB Lambda ★☆☆☆☆ ★★★★☆ ★☆☆☆☆ ★★★★★ ★★★★★ 3.2
Snowflake External Table ★★★★☆ ★★☆☆☆ ★★★★☆ ★★★☆☆ ★★★★☆ 3.4
  • Governance: Access control, audit, classification capabilities
  • Performance: Processing throughput for ETL workloads
  • AI Capability: Built-in ML/AI integration (Spark MLlib, etc.)
  • Cost: Total cost for batch ETL workloads
  • Operational Simplicity: Setup and maintenance effort

Scoring methodology: Each dimension rated by the author based on validated evidence. EMR scores highest on Performance and AI Capability (Spark MLlib, distributed ML) but lower on Governance (IAM-only) and Simplicity (requires Spark expertise).


Cost Analysis

Component Cost
EMR Serverless (37s job) ~$0.05
FSx for ONTAP (existing) $0 incremental
S3 AP requests $0 (included in FSx)
Script storage (S3) < $0.01

Monthly estimate (10 jobs/day):

  • 300 jobs × $0.05 = $15/month
  • Zero idle cost (application stopped between jobs)

Compare with:

  • EMR on EC2 (m5.xlarge cluster): ~$200/month (always-on)
  • Glue ETL (same workload): ~$0.44/job × 300 = $132/month
  • DuckDB Lambda: ~$1.10/month (but no distributed processing)

The PySpark Job

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
import time

spark = SparkSession.builder.appName("FSxN-S3AP-Verification").getOrCreate()

S3_AP = "s3://<your-ap-alias-ext-s3alias>"

# --- Read ---
start = time.time()
df = spark.read.parquet(f"{S3_AP}/sensor-data/sensor_data_microsecond.parquet")
row_count = df.count()
print(f"Read: {row_count} rows in {time.time()-start:.2f}s")

# --- Transform: GROUP BY ---
start = time.time()
agg_df = df.groupBy("status").agg(
    F.count("*").alias("count"),
    F.avg("temperature").alias("avg_temp"),
    F.avg("humidity").alias("avg_humidity")
)
agg_df.show()
print(f"GROUP BY: {time.time()-start:.2f}s")

# --- Transform: Window function ---
start = time.time()
window_spec = Window.partitionBy("device_id").orderBy("timestamp").rowsBetween(-5, 0)
window_df = df.withColumn("moving_avg_temp", F.avg("temperature").over(window_spec))
window_df.select("device_id", "timestamp", "temperature", "moving_avg_temp").show(5)
print(f"Window: {time.time()-start:.2f}s")

# --- Write-back ---
start = time.time()
agg_df.write.mode("overwrite").parquet(f"{S3_AP}/gold/emr_spark_output/")
print(f"Write-back: {time.time()-start:.2f}s")

spark.stop()
Enter fullscreen mode Exit fullscreen mode

Deploy and Run

# 1. Create EMR Serverless application
aws emr-serverless create-application \
  --name "fsxn-spark" \
  --release-label "emr-7.1.0" \
  --type "SPARK" \
  --region ap-northeast-1

# 2. Upload script to S3 (regular bucket, not S3 AP)
aws s3 cp scripts/spark_verification.py \
  s3://my-scripts-bucket/emr-scripts/

# 3. Submit job
aws emr-serverless start-job-run \
  --application-id <app-id> \
  --execution-role-arn arn:aws:iam::<ACCOUNT_ID>:role/emr-serverless-role \
  --job-driver '{
    "sparkSubmit": {
      "entryPoint": "s3://my-scripts-bucket/emr-scripts/spark_verification.py"
    }
  }'

# 4. Check status
aws emr-serverless get-job-run \
  --application-id <app-id> \
  --job-run-id <job-run-id>

# 5. Stop application (zero cost when stopped)
aws emr-serverless stop-application --application-id <app-id>
Enter fullscreen mode Exit fullscreen mode

Known Failure Signatures

Symptom Likely cause Next step
IllegalArgumentException: Invalid S3 URI Using s3a:// instead of s3:// Switch to EMRFS (s3://) prefix
Illegal Parquet type: INT64 (TIMESTAMP(NANOS)) Nanosecond timestamps in Parquet Regenerate with microsecond resolution
Job stuck in PENDING > 60s EMR Serverless capacity Check service quotas; retry
AccessDeniedException on S3 AP IAM role missing AP permissions Add S3 AP ARN to execution role policy
Script not found Script on S3 AP instead of regular S3 Move script to regular S3 bucket
Write fails with 501 Attempting Delta/Iceberg write Use flat Parquet write only

Gotchas and Lessons

1. Script must be on regular S3 (not S3 AP)

EMR Serverless loads the PySpark script from S3. The script location must be a regular S3 bucket, not an FSx S3 AP. The script then reads/writes data from/to the S3 AP.

2. IAM role needs both S3 bucket and S3 AP permissions

{
  "Effect": "Allow",
  "Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"],
  "Resource": [
    "arn:aws:s3:::my-scripts-bucket/*",
    "arn:aws:s3:ap-northeast-1:<ACCOUNT_ID>:accesspoint/<ap-name>",
    "arn:aws:s3:ap-northeast-1:<ACCOUNT_ID>:accesspoint/<ap-name>/object/*"
  ]
}
Enter fullscreen mode Exit fullscreen mode

3. Cold start is ~20 seconds

EMR Serverless has a cold start of ~20 seconds before Spark begins executing. For latency-sensitive workloads, keep the application in "started" state (costs ~$0.01/hour for pre-initialized capacity).

4. No session policy issues

Unlike Databricks and Snowflake, EMR Serverless uses direct IAM role credentials without intermediary session policies. The S3 AP ARN format works natively.


When to Use EMR Serverless vs Other Engines

Requirement EMR Serverless Athena DuckDB Lambda Glue ETL
Read-only SQL ✅ Best
Write-back to FSxN ✅ Best ✅ (CTAS)
Complex Spark transformations ✅ Best
Sub-second latency ❌ (cold start) ✅ Best
Zero idle cost
Large-scale distributed ✅ Best

What's Next

  • Part 6: Redshift Spectrum + Lake Formation — for teams that need DWH-integrated analytics with enterprise governance (4-layer authorization) on NAS data
  • Part 7: Table Format Boundaries — why Delta, Iceberg, and Hudi can't write to FSx S3 AP, and what flat Parquet patterns work instead

Previously in this series:


References


Key achievement: This validation established that EMR Serverless Spark provides the most capable read-write ETL path for FSx for ONTAP S3 AP data — full Spark SQL, UDFs, window functions, and write-back in 16 seconds of Spark execution at $0.05/job. No cluster management, no data copy, no session policy issues. The trade-off is cold start latency (20s) and lack of built-in governance — pair with Athena + Lake Formation for governed reads on the output.

All benchmarks are from a specific test environment (EMR Serverless emr-7.1.0, FSx for ONTAP Single-AZ 128 MB/s, ap-northeast-1). Scale throughput provisioning for production workloads.

Disclaimer: This article is an independent validation report and does not represent AWS or NetApp official guidance. Product behavior and platform capabilities may change. Always validate in your own environment.

Top comments (0)