Yoshiki Fujiwara(藤原善基)@AWS Community Builder for AWS Community Builders

Posted on May 26

Serverless Analytics on NAS Data for $0.00001/Query — DuckDB Lambda FSx for ONTAP

#aws #duckdb #serverless #amazonfsxfornetappontap

TL;DR

In Part 1, Athena worked cleanly. In Part 2, Databricks hit session policy boundaries. In Part 3, Snowflake works with AWS_ACCESS_POINT_ARN config. This Part 4 shows the cheapest path: $0.00001/query.

You can deploy DuckDB inside a Lambda function (arm64, 1024 MB) and query Parquet files on FSx for ONTAP via S3 Access Points. Warm queries return in 452ms for 10,000 rows. Cold start is ~1.9s. Cost per query: approximately $0.00001.

No database server. No cluster. No idle cost. Just a Lambda function with an 18 MB layer.

Quick Decision Guide:

Cheapest ad-hoc analytics on NAS data → DuckDB Lambda ($1.10/month for 1000 queries/day)
Need governance / catalog → Use Athena + Lake Formation instead
Need distributed processing > 10 GB → Use EMR Serverless (Part 5)

GitHub: fsxn-lakehouse-integrations/integrations/duckdb/

How to Read This Article

This article is:

A reproduction-focused validation report
Evidence from one environment (Lambda arm64, 1024 MB, ap-northeast-1)
A deployment guide for DuckDB + Lambda + FSx for ONTAP S3 AP

Read by role:

Developer / Data engineer: Architecture → Deploy in 5 Minutes → Handler config
Cost-conscious team lead: Cost Analysis → When to Use
Partner / SA: Partner Decision Card → Discovery Questions
Security reviewer: Governance Impact → When Not To Use

Prerequisite Concepts

Before reading this article, it helps to understand:

DuckDB — an in-process SQL engine (like SQLite for analytics) that runs inside your application
httpfs extension — DuckDB's HTTP/S3 file system extension for reading remote Parquet files
S3 Access Point alias — the *-ext-s3alias hostname that FSx for ONTAP S3 AP exposes
Lambda arm64 (Graviton2) — AWS Lambda on ARM architecture, ~20% cheaper than x86
Path-style S3 access — required for S3 AP aliases (s3_url_style = 'path')

Why DuckDB + Lambda + FSx for ONTAP?

Traditional approach	This approach
Provision Redshift/EMR cluster	Deploy Lambda function
Pay for idle compute	Pay only when queried
Copy data from NAS to S3	Query NAS data in place
Manage infrastructure	Zero infrastructure
Minutes to first query	Sub-second (warm)

DuckDB is an in-process SQL engine — think "SQLite for analytics." It runs inside your Lambda function, reads Parquet directly from S3 via the httpfs extension, and returns results. No external database, no connection pooling, no cluster scaling.

Architecture

Client (API Gateway / SDK / CLI)
    │
    ▼
Lambda Function (arm64, Python 3.12, 1024 MB)
    │
    ├── DuckDB (in-process, 18 MB layer)
    │   └── httpfs extension (S3 access)
    │
    └── S3 Access Point (internet-origin)
            │
            └── FSx for ONTAP Volume
                ├── sensor_data.parquet (10K rows)
                └── sensor_data_large.parquet (5M rows)

No VPC attachment needed (internet-origin AP). This avoids the ~1-2s ENI cold start penalty.

Lambda function: arm64 (Graviton2), Python 3.12, 1024 MB memory, DuckDB httpfs layer attached.

Benchmark Results

Test	Latency	Notes
Cold start (simple query)	1,854 ms	httpfs INSTALL + credential setup
Warm (simple query)	0.96 ms	Connection reused
Warm COUNT(*) 10K rows	452 ms	S3 AP → Parquet → result
Warm GROUP BY 10K rows	1,411 ms	Full scan + aggregation
Local COUNT(*) 5M rows	779 ms	For comparison (over internet)
Local COPY TO Parquet	304 ms	Write-back to FSxN

Environment: Lambda arm64, 1024 MB, ap-northeast-1. FSx for ONTAP Single-AZ, 128 MB/s.

Storage context (Yakio-san lens): The 10K-row Parquet file is ~250 KB. At 128 MBps provisioned throughput, a single query consumes negligible bandwidth (<0.2% of capacity). NFS/SMB workloads sharing the same file system are not impacted. For concurrent Lambda invocations, throughput becomes a factor at ~500+ simultaneous queries reading large files.

Key insight: Lambda warm queries (452ms) are faster than local queries (628ms) because Lambda runs in the same region as FSxN — lower network latency.

Lambda test invocation: cold start 2071ms, query returns 3 status groups (normal: 8505, warning: 1221, critical: 274). Memory used: 185 MB.

Evidence Matrix

Layer	Evidence	Result	Interpretation
Lambda deployment	CloudFormation stack	✅ Pass	Function + Layer deployed
DuckDB initialization	httpfs INSTALL + LOAD	✅ Pass	Extension loads in ~1.8s cold
S3 AP connectivity	read_parquet() via httpfs	✅ Pass	Path-style + endpoint config works
Read (small)	COUNT(*) 10K rows	✅ Pass	452ms warm
Read (aggregation)	GROUP BY + AVG	✅ Pass	1,411ms warm
Read (large)	COUNT(*) 5M rows	✅ Pass	779ms (local test)
Write-back	COPY TO Parquet	✅ Pass	304ms write to S3 AP
IAM authorization	Lambda execution role	✅ Pass	s3:GetObject/PutObject on AP ARN

Deploy in 5 Minutes

1. Build the Layer (Docker required)

cd integrations/duckdb
docker run --rm --platform linux/arm64 --entrypoint bash \
  -v "$(pwd)/dist:/output" \
  public.ecr.aws/lambda/python:3.12-arm64 \
  -c "dnf install -y zip > /dev/null 2>&1 && \
      pip install duckdb==1.1.3 --target /tmp/python/lib/python3.12/site-packages/ --quiet && \
      cd /tmp && zip -qr /output/duckdb-layer.zip python/"

2. Deploy

./deploy.sh --region ap-northeast-1

This builds the layer, uploads to S3, deploys CloudFormation (Lambda + IAM + Layer), and runs a test invocation.

3. Query

aws lambda invoke \
  --function-name fsxn-duckdb-query \
  --payload '{"query": "SELECT status, COUNT(*) FROM read_parquet('\''s3://{S3_AP}/sensor-data/sensor_data.parquet'\'') GROUP BY status"}' \
  --cli-binary-format raw-in-base64-out \
  response.json && cat response.json | jq .

The {S3_AP} placeholder is automatically replaced with your S3 AP alias from the Lambda environment variable.

The Handler (Key Configuration)

Three settings are critical for DuckDB + S3 AP in Lambda:

# 1. Lambda has no home directory
conn.execute("SET home_directory = '/tmp';")

# 2. S3 AP aliases require path-style access
conn.execute("SET s3_url_style = 'path';")

# 3. Explicit endpoint for AP alias resolution
conn.execute("SET s3_endpoint = 's3.ap-northeast-1.amazonaws.com';")

Without these, you'll get:

Can't find the home directory (missing #1)
Unknown error for HTTP HEAD (missing #2 or #3)

Full handler: lambda/handler.py

Comparison with Other Engines in This Series

Aspect	DuckDB Lambda	Athena (Part 1)	Snowflake (Part 3)	EMR Spark (Part 5)
Query latency (10K rows)	452ms (warm)	~2s	~3s	6.78s
Cold start	1.9s	~2s	N/A (warehouse)	20s
Cost per query	$0.00001	$0.005/TB	Credits	$0.05/job
Write-back to FSxN	✅ COPY TO	✅ CTAS	⚠️ TBD	✅ Spark write
Governance / Catalog	❌ None	✅ Glue + LF	✅ Tags + RBAC	⚠️ IAM only
Max dataset size	~1 GB (Lambda limit)	Unlimited	Unlimited	Unlimited
Distributed processing	❌ Single process	✅	✅	✅

Partner Decision Card

Customer requirement	DuckDB Lambda today	Recommended path
Cheapest ad-hoc analytics	✅ Best ($1.10/month)	Deploy DuckDB Lambda
API-driven analytics (behind API GW)	✅ Best (sub-second warm)	Deploy with API Gateway
IoT / edge data quick analysis	✅ Good fit	Deploy DuckDB Lambda
Need governance / audit trail	❌ No built-in governance	Use Athena + Lake Formation
Dataset > 10 GB	❌ Lambda memory limit	Use EMR Serverless or Athena
Need JOINs with DWH tables	❌ Isolated engine	Use Redshift Spectrum
Need catalog integration	❌ No Glue/catalog support	Use Athena or Redshift
Write-back (small files)	✅ COPY TO works	DuckDB Lambda for small writes

Discovery Questions for Partners

When a customer asks about DuckDB Lambda + FSx for ONTAP S3 Access Points:

What is the typical dataset size per query? (DuckDB Lambda works best < 1 GB)
How many concurrent queries are expected? (Lambda scales horizontally but each invocation is isolated)
Is governance / audit trail required? (DuckDB has none — consider Athena + Lake Formation)
Is the workload ad-hoc or scheduled? (Lambda excels at sporadic, event-driven queries)
Does the team need SQL JOINs with other data sources? (DuckDB is isolated — no cross-source JOINs)
Is sub-second latency required? (Warm DuckDB Lambda delivers; cold start adds ~1.9s)
Is there an existing analytics platform? (If yes, DuckDB Lambda may be redundant)
What is the budget tolerance? (DuckDB Lambda is the cheapest option in this series)

Governance Impact

Capability	DuckDB Lambda	Notes
Authentication	IAM (Lambda execution role)	Standard AWS IAM
Authorization	S3 AP policy + IAM	No table/column-level control
Audit trail	CloudWatch Logs + CloudTrail	Query text logged if configured
Data classification	❌ None	No tagging or masking
Row/column security	❌ None	All-or-nothing file access
Catalog integration	❌ None	No Glue, no schema registry

Governance model: DuckDB Lambda relies entirely on IAM + S3 AP policy for access control. There is no built-in governance layer. For regulated workloads requiring table-level access control, column masking, or audit trails, use Athena + Lake Formation (Part 1 + Part 6) or Snowflake External Tables (Part 3).

AI Readiness Score

Pattern	Governance	Performance	AI Capability	Cost	Operational Simplicity	Overall
DuckDB Lambda	★☆☆☆☆	★★★★☆	★☆☆☆☆	★★★★★	★★★★★	3.2
Athena + Lake Formation	★★★★★	★★★☆☆	★★☆☆☆	★★★★☆	★★★★☆	3.6
Snowflake External Table	★★★★☆	★★☆☆☆	★★★★☆	★★★☆☆	★★★★☆	3.4
EMR Serverless Spark	★★☆☆☆	★★★★☆	★★★☆☆	★★★☆☆	★★★☆☆	3.0

Governance: Access control, audit, classification capabilities
Performance: Query latency for typical workloads
AI Capability: Built-in ML/AI integration
Cost: Total cost of ownership for low-frequency workloads
Operational Simplicity: Setup and maintenance effort

Scoring methodology: Each dimension rated by the author based on validated evidence. This is not an official AWS assessment. DuckDB Lambda scores highest on Cost and Simplicity but lowest on Governance and AI — it's the "quick and cheap" option, not the "governed enterprise" option.

Cost Analysis

Component	Monthly Cost (1000 queries/day)
Lambda invocations	~$0.60
Lambda compute (1024 MB × 1s avg)	~$0.50
FSx for ONTAP (128 MB/s, existing)	$0 incremental
S3 AP requests	$0 (included in FSx)
Total	~$1.10/month

Compare with:

Redshift Serverless (8 RPU): ~$2.88/hour when active
Athena: $5/TB scanned (but no idle cost)
EMR Serverless: ~$0.05/job (but 20s cold start per job)

DuckDB Lambda is the cheapest option for ad-hoc, low-frequency analytics on FSxN data.

When to Use (and When Not To)

Use DuckDB Lambda when:

Ad-hoc queries on < 1 GB datasets
API-driven analytics (behind API Gateway)
Cost is the primary concern
No existing analytics infrastructure
Edge/IoT data analysis
Databricks customer waiting for UC + FSx for ONTAP S3 AP support: Quick NAS data validation without spinning up a Databricks cluster or copying data to S3. Use as a lightweight bridge until Databricks UC natively supports FSx for ONTAP S3 Access Points.

Don't use when:

Datasets > 10 GB (Lambda memory/timeout limits)
High concurrency (> 100 concurrent queries)
Need JOINs with DWH tables (use Redshift Spectrum)
Need governance/catalog integration (use Athena + Lake Formation)
Need Delta/Iceberg table format (not supported on FSxN S3 AP)

Known Failure Signatures

Symptom	Likely cause	Next step
`Can't find the home directory`	Missing `SET home_directory = '/tmp'`	Add to handler initialization
`Unknown error for HTTP HEAD`	Missing path-style or endpoint config	Set `s3_url_style = 'path'` and `s3_endpoint`
`HTTP 403 Forbidden`	IAM role missing S3 AP permissions	Add `s3:GetObject` on AP ARN to execution role
Timeout (15s Lambda limit)	Dataset too large for Lambda memory	Increase memory or use EMR Serverless
`Out of Memory`	Dataset exceeds Lambda memory	Reduce query scope or increase to 10 GB memory
Cold start > 3s	Layer too large or extension install slow	Pre-install httpfs in layer (avoid runtime INSTALL)

Local Development

You can also run DuckDB locally without Lambda:

import duckdb, boto3

session = boto3.Session(region_name='ap-northeast-1')
creds = session.get_credentials().get_frozen_credentials()

conn = duckdb.connect(':memory:')
conn.execute("INSTALL httpfs; LOAD httpfs;")
conn.execute(f"SET s3_region = 'ap-northeast-1';")
conn.execute(f"SET s3_access_key_id = '{creds.access_key}';")
conn.execute(f"SET s3_secret_access_key = '{creds.secret_key}';")
conn.execute(f"SET s3_session_token = '{creds.token}';")
conn.execute("SET s3_url_style = 'path';")

result = conn.execute("""
    SELECT status, COUNT(*), AVG(temperature)
    FROM read_parquet('s3://<your-ap-alias>/sensor-data/sensor_data.parquet')
    GROUP BY status
""").fetchall()
print(result)

What's Next

Part 5: EMR Spark — Read-Write ETL on NAS Data — for teams that need distributed Spark processing with write-back capability and larger-than-memory datasets
Part 6: Redshift Spectrum + Lake Formation — for teams that need DWH-integrated analytics with enterprise governance on NAS data
Part 7: Table Format Boundaries — why Delta, Iceberg, and Hudi can't write to FSx for ONTAP S3 AP, and what works instead

Future exploration: DuckDB's iceberg extension may enable reading pre-existing Iceberg tables on FSx for ONTAP S3 AP (metadata + data files accessed via GetObject). Not validated in this article.

Previously in this series:

Part 1: Athena — Query NAS Data In Place
Part 2: Databricks — A Layer-by-Layer Validation of Observed Boundaries
Part 3: Snowflake — From 'Access Denied' to Working External Tables

References

Key achievement: This validation established that DuckDB Lambda is the lowest-cost analytics path for FSx for ONTAP S3 AP data — $0.00001/query with 452ms warm latency. Zero infrastructure, zero idle cost, and full read-write capability on Parquet files. The trade-off is zero governance — for regulated workloads, pair with Athena + Lake Formation or use Snowflake External Tables.

All benchmarks are from a specific test environment (FSx for ONTAP Single-AZ, 128 MB/s, ap-northeast-1). Scale throughput provisioning for production workloads. Full evidence: verification-pack/duckdb-local/

Disclaimer: This article is an independent validation report and does not represent AWS or NetApp official guidance. Product behavior and platform capabilities may change. Always validate in your own environment.

DEV Community