TL;DR
In Part 1, Athena worked cleanly. In Part 2, Databricks hit session policy boundaries. In Part 3, Snowflake works with AWS_ACCESS_POINT_ARN config. This Part 4 shows the cheapest path: $0.00001/query.
You can deploy DuckDB inside a Lambda function (arm64, 1024 MB) and query Parquet files on FSx for ONTAP via S3 Access Points. Warm queries return in 452ms for 10,000 rows. Cold start is ~1.9s. Cost per query: approximately $0.00001.
No database server. No cluster. No idle cost. Just a Lambda function with an 18 MB layer.
Quick Decision Guide:
- Cheapest ad-hoc analytics on NAS data → DuckDB Lambda ($1.10/month for 1000 queries/day)
- Need governance / catalog → Use Athena + Lake Formation instead
- Need distributed processing > 10 GB → Use EMR Serverless (Part 5)
GitHub: fsxn-lakehouse-integrations/integrations/duckdb/
How to Read This Article
This article is:
- A reproduction-focused validation report
- Evidence from one environment (Lambda arm64, 1024 MB, ap-northeast-1)
- A deployment guide for DuckDB + Lambda + FSx for ONTAP S3 AP
Read by role:
- Developer / Data engineer: Architecture → Deploy in 5 Minutes → Handler config
- Cost-conscious team lead: Cost Analysis → When to Use
- Partner / SA: Partner Decision Card → Discovery Questions
- Security reviewer: Governance Impact → When Not To Use
Prerequisite Concepts
Before reading this article, it helps to understand:
- DuckDB — an in-process SQL engine (like SQLite for analytics) that runs inside your application
- httpfs extension — DuckDB's HTTP/S3 file system extension for reading remote Parquet files
-
S3 Access Point alias — the
*-ext-s3aliashostname that FSx for ONTAP S3 AP exposes - Lambda arm64 (Graviton2) — AWS Lambda on ARM architecture, ~20% cheaper than x86
-
Path-style S3 access — required for S3 AP aliases (
s3_url_style = 'path')
Why DuckDB + Lambda + FSx for ONTAP?
| Traditional approach | This approach |
|---|---|
| Provision Redshift/EMR cluster | Deploy Lambda function |
| Pay for idle compute | Pay only when queried |
| Copy data from NAS to S3 | Query NAS data in place |
| Manage infrastructure | Zero infrastructure |
| Minutes to first query | Sub-second (warm) |
DuckDB is an in-process SQL engine — think "SQLite for analytics." It runs inside your Lambda function, reads Parquet directly from S3 via the httpfs extension, and returns results. No external database, no connection pooling, no cluster scaling.
Architecture
Client (API Gateway / SDK / CLI)
│
▼
Lambda Function (arm64, Python 3.12, 1024 MB)
│
├── DuckDB (in-process, 18 MB layer)
│ └── httpfs extension (S3 access)
│
└── S3 Access Point (internet-origin)
│
└── FSx for ONTAP Volume
├── sensor_data.parquet (10K rows)
└── sensor_data_large.parquet (5M rows)
No VPC attachment needed (internet-origin AP). This avoids the ~1-2s ENI cold start penalty.
Lambda function: arm64 (Graviton2), Python 3.12, 1024 MB memory, DuckDB httpfs layer attached.
Benchmark Results
| Test | Latency | Notes |
|---|---|---|
| Cold start (simple query) | 1,854 ms | httpfs INSTALL + credential setup |
| Warm (simple query) | 0.96 ms | Connection reused |
| Warm COUNT(*) 10K rows | 452 ms | S3 AP → Parquet → result |
| Warm GROUP BY 10K rows | 1,411 ms | Full scan + aggregation |
| Local COUNT(*) 5M rows | 779 ms | For comparison (over internet) |
| Local COPY TO Parquet | 304 ms | Write-back to FSxN |
Environment: Lambda arm64, 1024 MB, ap-northeast-1. FSx for ONTAP Single-AZ, 128 MB/s.
Storage context (Yakio-san lens): The 10K-row Parquet file is ~250 KB. At 128 MBps provisioned throughput, a single query consumes negligible bandwidth (<0.2% of capacity). NFS/SMB workloads sharing the same file system are not impacted. For concurrent Lambda invocations, throughput becomes a factor at ~500+ simultaneous queries reading large files.
Key insight: Lambda warm queries (452ms) are faster than local queries (628ms) because Lambda runs in the same region as FSxN — lower network latency.
Lambda test invocation: cold start 2071ms, query returns 3 status groups (normal: 8505, warning: 1221, critical: 274). Memory used: 185 MB.
Evidence Matrix
| Layer | Evidence | Result | Interpretation |
|---|---|---|---|
| Lambda deployment | CloudFormation stack | ✅ Pass | Function + Layer deployed |
| DuckDB initialization | httpfs INSTALL + LOAD | ✅ Pass | Extension loads in ~1.8s cold |
| S3 AP connectivity | read_parquet() via httpfs | ✅ Pass | Path-style + endpoint config works |
| Read (small) | COUNT(*) 10K rows | ✅ Pass | 452ms warm |
| Read (aggregation) | GROUP BY + AVG | ✅ Pass | 1,411ms warm |
| Read (large) | COUNT(*) 5M rows | ✅ Pass | 779ms (local test) |
| Write-back | COPY TO Parquet | ✅ Pass | 304ms write to S3 AP |
| IAM authorization | Lambda execution role | ✅ Pass | s3:GetObject/PutObject on AP ARN |
Deploy in 5 Minutes
1. Build the Layer (Docker required)
cd integrations/duckdb
docker run --rm --platform linux/arm64 --entrypoint bash \
-v "$(pwd)/dist:/output" \
public.ecr.aws/lambda/python:3.12-arm64 \
-c "dnf install -y zip > /dev/null 2>&1 && \
pip install duckdb==1.1.3 --target /tmp/python/lib/python3.12/site-packages/ --quiet && \
cd /tmp && zip -qr /output/duckdb-layer.zip python/"
2. Deploy
./deploy.sh --region ap-northeast-1
This builds the layer, uploads to S3, deploys CloudFormation (Lambda + IAM + Layer), and runs a test invocation.
3. Query
aws lambda invoke \
--function-name fsxn-duckdb-query \
--payload '{"query": "SELECT status, COUNT(*) FROM read_parquet('\''s3://{S3_AP}/sensor-data/sensor_data.parquet'\'') GROUP BY status"}' \
--cli-binary-format raw-in-base64-out \
response.json && cat response.json | jq .
The {S3_AP} placeholder is automatically replaced with your S3 AP alias from the Lambda environment variable.
The Handler (Key Configuration)
Three settings are critical for DuckDB + S3 AP in Lambda:
# 1. Lambda has no home directory
conn.execute("SET home_directory = '/tmp';")
# 2. S3 AP aliases require path-style access
conn.execute("SET s3_url_style = 'path';")
# 3. Explicit endpoint for AP alias resolution
conn.execute("SET s3_endpoint = 's3.ap-northeast-1.amazonaws.com';")
Without these, you'll get:
-
Can't find the home directory(missing #1) -
Unknown error for HTTP HEAD(missing #2 or #3)
Full handler: lambda/handler.py
Comparison with Other Engines in This Series
| Aspect | DuckDB Lambda | Athena (Part 1) | Snowflake (Part 3) | EMR Spark (Part 5) |
|---|---|---|---|---|
| Query latency (10K rows) | 452ms (warm) | ~2s | ~3s | 6.78s |
| Cold start | 1.9s | ~2s | N/A (warehouse) | 20s |
| Cost per query | $0.00001 | $0.005/TB | Credits | $0.05/job |
| Write-back to FSxN | ✅ COPY TO | ✅ CTAS | ⚠️ TBD | ✅ Spark write |
| Governance / Catalog | ❌ None | ✅ Glue + LF | ✅ Tags + RBAC | ⚠️ IAM only |
| Max dataset size | ~1 GB (Lambda limit) | Unlimited | Unlimited | Unlimited |
| Distributed processing | ❌ Single process | ✅ | ✅ | ✅ |
Partner Decision Card
| Customer requirement | DuckDB Lambda today | Recommended path |
|---|---|---|
| Cheapest ad-hoc analytics | ✅ Best ($1.10/month) | Deploy DuckDB Lambda |
| API-driven analytics (behind API GW) | ✅ Best (sub-second warm) | Deploy with API Gateway |
| IoT / edge data quick analysis | ✅ Good fit | Deploy DuckDB Lambda |
| Need governance / audit trail | ❌ No built-in governance | Use Athena + Lake Formation |
| Dataset > 10 GB | ❌ Lambda memory limit | Use EMR Serverless or Athena |
| Need JOINs with DWH tables | ❌ Isolated engine | Use Redshift Spectrum |
| Need catalog integration | ❌ No Glue/catalog support | Use Athena or Redshift |
| Write-back (small files) | ✅ COPY TO works | DuckDB Lambda for small writes |
Discovery Questions for Partners
When a customer asks about DuckDB Lambda + FSx for ONTAP S3 Access Points:
- What is the typical dataset size per query? (DuckDB Lambda works best < 1 GB)
- How many concurrent queries are expected? (Lambda scales horizontally but each invocation is isolated)
- Is governance / audit trail required? (DuckDB has none — consider Athena + Lake Formation)
- Is the workload ad-hoc or scheduled? (Lambda excels at sporadic, event-driven queries)
- Does the team need SQL JOINs with other data sources? (DuckDB is isolated — no cross-source JOINs)
- Is sub-second latency required? (Warm DuckDB Lambda delivers; cold start adds ~1.9s)
- Is there an existing analytics platform? (If yes, DuckDB Lambda may be redundant)
- What is the budget tolerance? (DuckDB Lambda is the cheapest option in this series)
Governance Impact
| Capability | DuckDB Lambda | Notes |
|---|---|---|
| Authentication | IAM (Lambda execution role) | Standard AWS IAM |
| Authorization | S3 AP policy + IAM | No table/column-level control |
| Audit trail | CloudWatch Logs + CloudTrail | Query text logged if configured |
| Data classification | ❌ None | No tagging or masking |
| Row/column security | ❌ None | All-or-nothing file access |
| Catalog integration | ❌ None | No Glue, no schema registry |
Governance model: DuckDB Lambda relies entirely on IAM + S3 AP policy for access control. There is no built-in governance layer. For regulated workloads requiring table-level access control, column masking, or audit trails, use Athena + Lake Formation (Part 1 + Part 6) or Snowflake External Tables (Part 3).
AI Readiness Score
| Pattern | Governance | Performance | AI Capability | Cost | Operational Simplicity | Overall |
|---|---|---|---|---|---|---|
| DuckDB Lambda | ★☆☆☆☆ | ★★★★☆ | ★☆☆☆☆ | ★★★★★ | ★★★★★ | 3.2 |
| Athena + Lake Formation | ★★★★★ | ★★★☆☆ | ★★☆☆☆ | ★★★★☆ | ★★★★☆ | 3.6 |
| Snowflake External Table | ★★★★☆ | ★★☆☆☆ | ★★★★☆ | ★★★☆☆ | ★★★★☆ | 3.4 |
| EMR Serverless Spark | ★★☆☆☆ | ★★★★☆ | ★★★☆☆ | ★★★☆☆ | ★★★☆☆ | 3.0 |
- Governance: Access control, audit, classification capabilities
- Performance: Query latency for typical workloads
- AI Capability: Built-in ML/AI integration
- Cost: Total cost of ownership for low-frequency workloads
- Operational Simplicity: Setup and maintenance effort
Scoring methodology: Each dimension rated by the author based on validated evidence. This is not an official AWS assessment. DuckDB Lambda scores highest on Cost and Simplicity but lowest on Governance and AI — it's the "quick and cheap" option, not the "governed enterprise" option.
Cost Analysis
| Component | Monthly Cost (1000 queries/day) |
|---|---|
| Lambda invocations | ~$0.60 |
| Lambda compute (1024 MB × 1s avg) | ~$0.50 |
| FSx for ONTAP (128 MB/s, existing) | $0 incremental |
| S3 AP requests | $0 (included in FSx) |
| Total | ~$1.10/month |
Compare with:
- Redshift Serverless (8 RPU): ~$2.88/hour when active
- Athena: $5/TB scanned (but no idle cost)
- EMR Serverless: ~$0.05/job (but 20s cold start per job)
DuckDB Lambda is the cheapest option for ad-hoc, low-frequency analytics on FSxN data.
When to Use (and When Not To)
Use DuckDB Lambda when:
- Ad-hoc queries on < 1 GB datasets
- API-driven analytics (behind API Gateway)
- Cost is the primary concern
- No existing analytics infrastructure
- Edge/IoT data analysis
- Databricks customer waiting for UC + FSx for ONTAP S3 AP support: Quick NAS data validation without spinning up a Databricks cluster or copying data to S3. Use as a lightweight bridge until Databricks UC natively supports FSx for ONTAP S3 Access Points.
Don't use when:
- Datasets > 10 GB (Lambda memory/timeout limits)
- High concurrency (> 100 concurrent queries)
- Need JOINs with DWH tables (use Redshift Spectrum)
- Need governance/catalog integration (use Athena + Lake Formation)
- Need Delta/Iceberg table format (not supported on FSxN S3 AP)
Known Failure Signatures
| Symptom | Likely cause | Next step |
|---|---|---|
Can't find the home directory |
Missing SET home_directory = '/tmp'
|
Add to handler initialization |
Unknown error for HTTP HEAD |
Missing path-style or endpoint config | Set s3_url_style = 'path' and s3_endpoint
|
HTTP 403 Forbidden |
IAM role missing S3 AP permissions | Add s3:GetObject on AP ARN to execution role |
| Timeout (15s Lambda limit) | Dataset too large for Lambda memory | Increase memory or use EMR Serverless |
Out of Memory |
Dataset exceeds Lambda memory | Reduce query scope or increase to 10 GB memory |
| Cold start > 3s | Layer too large or extension install slow | Pre-install httpfs in layer (avoid runtime INSTALL) |
Local Development
You can also run DuckDB locally without Lambda:
import duckdb, boto3
session = boto3.Session(region_name='ap-northeast-1')
creds = session.get_credentials().get_frozen_credentials()
conn = duckdb.connect(':memory:')
conn.execute("INSTALL httpfs; LOAD httpfs;")
conn.execute(f"SET s3_region = 'ap-northeast-1';")
conn.execute(f"SET s3_access_key_id = '{creds.access_key}';")
conn.execute(f"SET s3_secret_access_key = '{creds.secret_key}';")
conn.execute(f"SET s3_session_token = '{creds.token}';")
conn.execute("SET s3_url_style = 'path';")
result = conn.execute("""
SELECT status, COUNT(*), AVG(temperature)
FROM read_parquet('s3://<your-ap-alias>/sensor-data/sensor_data.parquet')
GROUP BY status
""").fetchall()
print(result)
What's Next
- Part 5: EMR Spark — Read-Write ETL on NAS Data — for teams that need distributed Spark processing with write-back capability and larger-than-memory datasets
- Part 6: Redshift Spectrum + Lake Formation — for teams that need DWH-integrated analytics with enterprise governance on NAS data
- Part 7: Table Format Boundaries — why Delta, Iceberg, and Hudi can't write to FSx for ONTAP S3 AP, and what works instead
Future exploration: DuckDB's
icebergextension may enable reading pre-existing Iceberg tables on FSx for ONTAP S3 AP (metadata + data files accessed via GetObject). Not validated in this article.
Previously in this series:
- Part 1: Athena — Query NAS Data In Place
- Part 2: Databricks — A Layer-by-Layer Validation of Observed Boundaries
- Part 3: Snowflake — From 'Access Denied' to Working External Tables
References
- DuckDB documentation
- DuckDB httpfs extension
- AWS Lambda arm64 (Graviton2)
- FSx for ONTAP S3 Access Points
- GitHub: fsxn-lakehouse-integrations
Key achievement: This validation established that DuckDB Lambda is the lowest-cost analytics path for FSx for ONTAP S3 AP data — $0.00001/query with 452ms warm latency. Zero infrastructure, zero idle cost, and full read-write capability on Parquet files. The trade-off is zero governance — for regulated workloads, pair with Athena + Lake Formation or use Snowflake External Tables.
All benchmarks are from a specific test environment (FSx for ONTAP Single-AZ, 128 MB/s, ap-northeast-1). Scale throughput provisioning for production workloads. Full evidence: verification-pack/duckdb-local/
Disclaimer: This article is an independent validation report and does not represent AWS or NetApp official guidance. Product behavior and platform capabilities may change. Always validate in your own environment.


Top comments (0)