DEV Community

Cover image for Serverless Analytics on NAS Data for $0.00001/Query — DuckDB Lambda FSx for ONTAP

Serverless Analytics on NAS Data for $0.00001/Query — DuckDB Lambda FSx for ONTAP

TL;DR

In Part 1, Athena worked cleanly. In Part 2, Databricks hit session policy boundaries. In Part 3, Snowflake works with AWS_ACCESS_POINT_ARN config. This Part 4 shows the cheapest path: $0.00001/query.

You can deploy DuckDB inside a Lambda function (arm64, 1024 MB) and query Parquet files on FSx for ONTAP via S3 Access Points. Warm queries return in 452ms for 10,000 rows. Cold start is ~1.9s. Cost per query: approximately $0.00001.

No database server. No cluster. No idle cost. Just a Lambda function with an 18 MB layer.

Quick Decision Guide:

  • Cheapest ad-hoc analytics on NAS data → DuckDB Lambda ($1.10/month for 1000 queries/day)
  • Need governance / catalog → Use Athena + Lake Formation instead
  • Need distributed processing > 10 GB → Use EMR Serverless (Part 5)

GitHub: fsxn-lakehouse-integrations/integrations/duckdb/


How to Read This Article

This article is:

  • A reproduction-focused validation report
  • Evidence from one environment (Lambda arm64, 1024 MB, ap-northeast-1)
  • A deployment guide for DuckDB + Lambda + FSx for ONTAP S3 AP

Read by role:

  • Developer / Data engineer: Architecture → Deploy in 5 Minutes → Handler config
  • Cost-conscious team lead: Cost Analysis → When to Use
  • Partner / SA: Partner Decision Card → Discovery Questions
  • Security reviewer: Governance Impact → When Not To Use

Prerequisite Concepts

Before reading this article, it helps to understand:

  • DuckDB — an in-process SQL engine (like SQLite for analytics) that runs inside your application
  • httpfs extension — DuckDB's HTTP/S3 file system extension for reading remote Parquet files
  • S3 Access Point alias — the *-ext-s3alias hostname that FSx for ONTAP S3 AP exposes
  • Lambda arm64 (Graviton2) — AWS Lambda on ARM architecture, ~20% cheaper than x86
  • Path-style S3 access — required for S3 AP aliases (s3_url_style = 'path')

Why DuckDB + Lambda + FSx for ONTAP?

Traditional approach This approach
Provision Redshift/EMR cluster Deploy Lambda function
Pay for idle compute Pay only when queried
Copy data from NAS to S3 Query NAS data in place
Manage infrastructure Zero infrastructure
Minutes to first query Sub-second (warm)

DuckDB is an in-process SQL engine — think "SQLite for analytics." It runs inside your Lambda function, reads Parquet directly from S3 via the httpfs extension, and returns results. No external database, no connection pooling, no cluster scaling.


Architecture

Client (API Gateway / SDK / CLI)
    │
    ▼
Lambda Function (arm64, Python 3.12, 1024 MB)
    │
    ├── DuckDB (in-process, 18 MB layer)
    │   └── httpfs extension (S3 access)
    │
    └── S3 Access Point (internet-origin)
            │
            └── FSx for ONTAP Volume
                ├── sensor_data.parquet (10K rows)
                └── sensor_data_large.parquet (5M rows)
Enter fullscreen mode Exit fullscreen mode

No VPC attachment needed (internet-origin AP). This avoids the ~1-2s ENI cold start penalty.

Lambda function overview — arm64, Python 3.12, DuckDB layer

Lambda function: arm64 (Graviton2), Python 3.12, 1024 MB memory, DuckDB httpfs layer attached.


Benchmark Results

Test Latency Notes
Cold start (simple query) 1,854 ms httpfs INSTALL + credential setup
Warm (simple query) 0.96 ms Connection reused
Warm COUNT(*) 10K rows 452 ms S3 AP → Parquet → result
Warm GROUP BY 10K rows 1,411 ms Full scan + aggregation
Local COUNT(*) 5M rows 779 ms For comparison (over internet)
Local COPY TO Parquet 304 ms Write-back to FSxN

Environment: Lambda arm64, 1024 MB, ap-northeast-1. FSx for ONTAP Single-AZ, 128 MB/s.

Storage context (Yakio-san lens): The 10K-row Parquet file is ~250 KB. At 128 MBps provisioned throughput, a single query consumes negligible bandwidth (<0.2% of capacity). NFS/SMB workloads sharing the same file system are not impacted. For concurrent Lambda invocations, throughput becomes a factor at ~500+ simultaneous queries reading large files.

Key insight: Lambda warm queries (452ms) are faster than local queries (628ms) because Lambda runs in the same region as FSxN — lower network latency.

Lambda test invocation success — JSON result with metrics

Lambda test invocation: cold start 2071ms, query returns 3 status groups (normal: 8505, warning: 1221, critical: 274). Memory used: 185 MB.


Evidence Matrix

Layer Evidence Result Interpretation
Lambda deployment CloudFormation stack ✅ Pass Function + Layer deployed
DuckDB initialization httpfs INSTALL + LOAD ✅ Pass Extension loads in ~1.8s cold
S3 AP connectivity read_parquet() via httpfs ✅ Pass Path-style + endpoint config works
Read (small) COUNT(*) 10K rows ✅ Pass 452ms warm
Read (aggregation) GROUP BY + AVG ✅ Pass 1,411ms warm
Read (large) COUNT(*) 5M rows ✅ Pass 779ms (local test)
Write-back COPY TO Parquet ✅ Pass 304ms write to S3 AP
IAM authorization Lambda execution role ✅ Pass s3:GetObject/PutObject on AP ARN

Deploy in 5 Minutes

1. Build the Layer (Docker required)

cd integrations/duckdb
docker run --rm --platform linux/arm64 --entrypoint bash \
  -v "$(pwd)/dist:/output" \
  public.ecr.aws/lambda/python:3.12-arm64 \
  -c "dnf install -y zip > /dev/null 2>&1 && \
      pip install duckdb==1.1.3 --target /tmp/python/lib/python3.12/site-packages/ --quiet && \
      cd /tmp && zip -qr /output/duckdb-layer.zip python/"
Enter fullscreen mode Exit fullscreen mode

2. Deploy

./deploy.sh --region ap-northeast-1
Enter fullscreen mode Exit fullscreen mode

This builds the layer, uploads to S3, deploys CloudFormation (Lambda + IAM + Layer), and runs a test invocation.

3. Query

aws lambda invoke \
  --function-name fsxn-duckdb-query \
  --payload '{"query": "SELECT status, COUNT(*) FROM read_parquet('\''s3://{S3_AP}/sensor-data/sensor_data.parquet'\'') GROUP BY status"}' \
  --cli-binary-format raw-in-base64-out \
  response.json && cat response.json | jq .
Enter fullscreen mode Exit fullscreen mode

The {S3_AP} placeholder is automatically replaced with your S3 AP alias from the Lambda environment variable.


The Handler (Key Configuration)

Three settings are critical for DuckDB + S3 AP in Lambda:

# 1. Lambda has no home directory
conn.execute("SET home_directory = '/tmp';")

# 2. S3 AP aliases require path-style access
conn.execute("SET s3_url_style = 'path';")

# 3. Explicit endpoint for AP alias resolution
conn.execute("SET s3_endpoint = 's3.ap-northeast-1.amazonaws.com';")
Enter fullscreen mode Exit fullscreen mode

Without these, you'll get:

  • Can't find the home directory (missing #1)
  • Unknown error for HTTP HEAD (missing #2 or #3)

Full handler: lambda/handler.py


Comparison with Other Engines in This Series

Aspect DuckDB Lambda Athena (Part 1) Snowflake (Part 3) EMR Spark (Part 5)
Query latency (10K rows) 452ms (warm) ~2s ~3s 6.78s
Cold start 1.9s ~2s N/A (warehouse) 20s
Cost per query $0.00001 $0.005/TB Credits $0.05/job
Write-back to FSxN ✅ COPY TO ✅ CTAS ⚠️ TBD ✅ Spark write
Governance / Catalog ❌ None ✅ Glue + LF ✅ Tags + RBAC ⚠️ IAM only
Max dataset size ~1 GB (Lambda limit) Unlimited Unlimited Unlimited
Distributed processing ❌ Single process

Partner Decision Card

Customer requirement DuckDB Lambda today Recommended path
Cheapest ad-hoc analytics ✅ Best ($1.10/month) Deploy DuckDB Lambda
API-driven analytics (behind API GW) ✅ Best (sub-second warm) Deploy with API Gateway
IoT / edge data quick analysis ✅ Good fit Deploy DuckDB Lambda
Need governance / audit trail ❌ No built-in governance Use Athena + Lake Formation
Dataset > 10 GB ❌ Lambda memory limit Use EMR Serverless or Athena
Need JOINs with DWH tables ❌ Isolated engine Use Redshift Spectrum
Need catalog integration ❌ No Glue/catalog support Use Athena or Redshift
Write-back (small files) ✅ COPY TO works DuckDB Lambda for small writes

Discovery Questions for Partners

When a customer asks about DuckDB Lambda + FSx for ONTAP S3 Access Points:

  1. What is the typical dataset size per query? (DuckDB Lambda works best < 1 GB)
  2. How many concurrent queries are expected? (Lambda scales horizontally but each invocation is isolated)
  3. Is governance / audit trail required? (DuckDB has none — consider Athena + Lake Formation)
  4. Is the workload ad-hoc or scheduled? (Lambda excels at sporadic, event-driven queries)
  5. Does the team need SQL JOINs with other data sources? (DuckDB is isolated — no cross-source JOINs)
  6. Is sub-second latency required? (Warm DuckDB Lambda delivers; cold start adds ~1.9s)
  7. Is there an existing analytics platform? (If yes, DuckDB Lambda may be redundant)
  8. What is the budget tolerance? (DuckDB Lambda is the cheapest option in this series)

Governance Impact

Capability DuckDB Lambda Notes
Authentication IAM (Lambda execution role) Standard AWS IAM
Authorization S3 AP policy + IAM No table/column-level control
Audit trail CloudWatch Logs + CloudTrail Query text logged if configured
Data classification ❌ None No tagging or masking
Row/column security ❌ None All-or-nothing file access
Catalog integration ❌ None No Glue, no schema registry

Governance model: DuckDB Lambda relies entirely on IAM + S3 AP policy for access control. There is no built-in governance layer. For regulated workloads requiring table-level access control, column masking, or audit trails, use Athena + Lake Formation (Part 1 + Part 6) or Snowflake External Tables (Part 3).


AI Readiness Score

Pattern Governance Performance AI Capability Cost Operational Simplicity Overall
DuckDB Lambda ★☆☆☆☆ ★★★★☆ ★☆☆☆☆ ★★★★★ ★★★★★ 3.2
Athena + Lake Formation ★★★★★ ★★★☆☆ ★★☆☆☆ ★★★★☆ ★★★★☆ 3.6
Snowflake External Table ★★★★☆ ★★☆☆☆ ★★★★☆ ★★★☆☆ ★★★★☆ 3.4
EMR Serverless Spark ★★☆☆☆ ★★★★☆ ★★★☆☆ ★★★☆☆ ★★★☆☆ 3.0
  • Governance: Access control, audit, classification capabilities
  • Performance: Query latency for typical workloads
  • AI Capability: Built-in ML/AI integration
  • Cost: Total cost of ownership for low-frequency workloads
  • Operational Simplicity: Setup and maintenance effort

Scoring methodology: Each dimension rated by the author based on validated evidence. This is not an official AWS assessment. DuckDB Lambda scores highest on Cost and Simplicity but lowest on Governance and AI — it's the "quick and cheap" option, not the "governed enterprise" option.


Cost Analysis

Component Monthly Cost (1000 queries/day)
Lambda invocations ~$0.60
Lambda compute (1024 MB × 1s avg) ~$0.50
FSx for ONTAP (128 MB/s, existing) $0 incremental
S3 AP requests $0 (included in FSx)
Total ~$1.10/month

Compare with:

  • Redshift Serverless (8 RPU): ~$2.88/hour when active
  • Athena: $5/TB scanned (but no idle cost)
  • EMR Serverless: ~$0.05/job (but 20s cold start per job)

DuckDB Lambda is the cheapest option for ad-hoc, low-frequency analytics on FSxN data.


When to Use (and When Not To)

Use DuckDB Lambda when:

  • Ad-hoc queries on < 1 GB datasets
  • API-driven analytics (behind API Gateway)
  • Cost is the primary concern
  • No existing analytics infrastructure
  • Edge/IoT data analysis
  • Databricks customer waiting for UC + FSx for ONTAP S3 AP support: Quick NAS data validation without spinning up a Databricks cluster or copying data to S3. Use as a lightweight bridge until Databricks UC natively supports FSx for ONTAP S3 Access Points.

Don't use when:

  • Datasets > 10 GB (Lambda memory/timeout limits)
  • High concurrency (> 100 concurrent queries)
  • Need JOINs with DWH tables (use Redshift Spectrum)
  • Need governance/catalog integration (use Athena + Lake Formation)
  • Need Delta/Iceberg table format (not supported on FSxN S3 AP)

Known Failure Signatures

Symptom Likely cause Next step
Can't find the home directory Missing SET home_directory = '/tmp' Add to handler initialization
Unknown error for HTTP HEAD Missing path-style or endpoint config Set s3_url_style = 'path' and s3_endpoint
HTTP 403 Forbidden IAM role missing S3 AP permissions Add s3:GetObject on AP ARN to execution role
Timeout (15s Lambda limit) Dataset too large for Lambda memory Increase memory or use EMR Serverless
Out of Memory Dataset exceeds Lambda memory Reduce query scope or increase to 10 GB memory
Cold start > 3s Layer too large or extension install slow Pre-install httpfs in layer (avoid runtime INSTALL)

Local Development

You can also run DuckDB locally without Lambda:

import duckdb, boto3

session = boto3.Session(region_name='ap-northeast-1')
creds = session.get_credentials().get_frozen_credentials()

conn = duckdb.connect(':memory:')
conn.execute("INSTALL httpfs; LOAD httpfs;")
conn.execute(f"SET s3_region = 'ap-northeast-1';")
conn.execute(f"SET s3_access_key_id = '{creds.access_key}';")
conn.execute(f"SET s3_secret_access_key = '{creds.secret_key}';")
conn.execute(f"SET s3_session_token = '{creds.token}';")
conn.execute("SET s3_url_style = 'path';")

result = conn.execute("""
    SELECT status, COUNT(*), AVG(temperature)
    FROM read_parquet('s3://<your-ap-alias>/sensor-data/sensor_data.parquet')
    GROUP BY status
""").fetchall()
print(result)
Enter fullscreen mode Exit fullscreen mode

What's Next

  • Part 5: EMR Spark — Read-Write ETL on NAS Data — for teams that need distributed Spark processing with write-back capability and larger-than-memory datasets
  • Part 6: Redshift Spectrum + Lake Formation — for teams that need DWH-integrated analytics with enterprise governance on NAS data
  • Part 7: Table Format Boundaries — why Delta, Iceberg, and Hudi can't write to FSx for ONTAP S3 AP, and what works instead

Future exploration: DuckDB's iceberg extension may enable reading pre-existing Iceberg tables on FSx for ONTAP S3 AP (metadata + data files accessed via GetObject). Not validated in this article.

Previously in this series:


References


Key achievement: This validation established that DuckDB Lambda is the lowest-cost analytics path for FSx for ONTAP S3 AP data — $0.00001/query with 452ms warm latency. Zero infrastructure, zero idle cost, and full read-write capability on Parquet files. The trade-off is zero governance — for regulated workloads, pair with Athena + Lake Formation or use Snowflake External Tables.

All benchmarks are from a specific test environment (FSx for ONTAP Single-AZ, 128 MB/s, ap-northeast-1). Scale throughput provisioning for production workloads. Full evidence: verification-pack/duckdb-local/

Disclaimer: This article is an independent validation report and does not represent AWS or NetApp official guidance. Product behavior and platform capabilities may change. Always validate in your own environment.

Top comments (0)