Original Japanese article: AWS Lambdaを10GBにすると本当に速くなるのか?(AWS Lambda×chDB/DuckDB×PyIceberg検証)
Introduction
I'm Aki, an AWS Community Builder (@jitepengin).
In a previous article, I benchmarked Iceberg integration using AWS Lambda with DuckDB and chDB.
Lightweight ETL with AWS Lambda, chDB, and PyIceberg (Compared with DuckDB)
In that article, I tested two patterns on AWS Lambda:
- chDB × PyIceberg
- DuckDB × PyIceberg
Memory sizes were set to 1024 MB, 2048 MB, and 3008 MB (the maximum without quota increase at the time).
The results showed:
- For small datasets, increasing memory generally improved performance.
- For a large dataset (807 MB), 3008 MB was barely enough to complete processing.
This time, I extended the experiment:
What happens if we increase Lambda memory up to 10GB (10240 MB)?
Increasing the Lambda Memory Quota
To raise the Lambda memory limit beyond 3008 MB, you must request a quota increase.
Important:
You cannot increase Lambda memory from the Service Quotas console.
Steps
- Go to the AWS Support Center and create a new case.
- Clearly state:
- The reason for the increase
- The target region
Example request content:
We are building and validating a data processing platform using AWS Lambda.
The workload is memory-intensive, including large Parquet file loading, aggregation, and transformation.
The current 3008 MB limit is insufficient to complete processing.We are performing analytical processing inside Lambda using columnar formats (Parquet), and the workload requires higher memory allocation.
Currently, we experience performance degradation and OutOfMemory errors.
We would like to request an increase of the Lambda memory limit in the Tokyo region to 10240 MB.
Although we considered migrating to other compute services, we determined that continuing with Lambda is the most appropriate option from both operational and architectural perspectives.
After submission, AWS responded in about 3 business days and applied the increase.
Architecture
The architecture is identical to the previous article.
Flow:
- Load a Parquet file from S3 in Lambda
- Process it using chDB or DuckDB
- Write results into an Iceberg table
In short:
S3 → Lambda (chDB/DuckDB) → Iceberg (via Glue Catalog)
In this article, I focus on performance behavior differences.
Iceberg version conflicts and concurrency handling are omitted for simplicity.
Sample Code (chDB)
import chdb
import pyarrow as pa
from pyiceberg.catalog.glue import GlueCatalog
def _to_pyarrow_table(result):
"""
Compatibility helper to extract a pyarrow.Table from a chDB query_result.
"""
if hasattr(chdb, "to_arrowTable"):
return chdb.to_arrowTable(result)
if hasattr(result, "to_pyarrow"):
return result.to_pyarrow()
if hasattr(result, "to_arrow"):
return result.to_arrow()
raise RuntimeError(
"Cannot convert chdb query_result to pyarrow.Table. "
f"Available attributes: {sorted(dir(result))[:200]}"
)
def normalize_arrow_for_iceberg(table: pa.Table) -> pa.Table:
"""
Normalize Arrow types that Iceberg does not accept
(mainly timezone-aware timestamps).
"""
new_fields = []
new_columns = []
for field, column in zip(table.schema, table.columns):
if pa.types.is_timestamp(field.type) and field.type.tz is not None:
# Remove timezone information (values remain in UTC)
new_type = pa.timestamp(field.type.unit)
new_fields.append(pa.field(field.name, new_type, field.nullable))
new_columns.append(column.cast(new_type))
else:
new_fields.append(field)
new_columns.append(column)
new_schema = pa.schema(new_fields)
return pa.Table.from_arrays(new_columns, schema=new_schema)
def lambda_handler(event, context):
try:
# Extract S3 bucket and object key from the event
s3_bucket = event['Records'][0]['s3']['bucket']['name']
s3_object_key = event['Records'][0]['s3']['object']['key']
# Build S3 HTTPS URL
s3_url = (
f"https://{s3_bucket}."
f"s3.ap-northeast-1.amazonaws.com/"
f"{s3_object_key}"
)
print(f"s3_url: {s3_url}")
# Query Parquet data on S3 using chDB
query = f"""
SELECT *
FROM s3('{s3_url}', 'Parquet')
WHERE VendorID = 1
"""
# Execute chDB query with Arrow output
result = chdb.query(query, "Arrow")
# Convert chDB result to pyarrow.Table
arrow_table = _to_pyarrow_table(result)
print(f"Original schema: {arrow_table.schema}")
# Normalize schema for Iceberg compatibility
arrow_table = normalize_arrow_for_iceberg(arrow_table)
print(f"Normalized schema: {arrow_table.schema}")
print(f"Rows: {arrow_table.num_rows}")
# Initialize Iceberg Glue Catalog
catalog = GlueCatalog(
name="my_catalog",
database="icebergdb",
region_name="ap-northeast-1",
)
# Load Iceberg table
iceberg_table = catalog.load_table("icebergdb.yellow_tripdata")
# Append data to Iceberg table
iceberg_table.append(arrow_table)
print("Data appended to Iceberg table.")
except Exception as e:
print("Exception:", e)
raise
Note: This article focuses on the differences in operation, so version updates or conflict handling in Iceberg tables are omitted.
Sample Code (DuckDB)
import duckdb
import pyarrow as pa
from pyiceberg.catalog.glue import GlueCatalog
def lambda_handler(event, context):
try:
# Connect to DuckDB and set the home directory
duckdb_connection = duckdb.connect(database=':memory:')
duckdb_connection.execute("SET home_directory='/tmp'")
# Install and load the httpfs extension
duckdb_connection.execute("INSTALL httpfs;")
duckdb_connection.execute("LOAD httpfs;")
# Load data from S3 using DuckDB
s3_bucket = event['Records'][0]['s3']['bucket']['name']
s3_object_key = event['Records'][0]['s3']['object']['key']
s3_input_path = f"s3://{s3_bucket}/{s3_object_key}"
print(f"s3_input_path: {s3_input_path}")
query = f"""
SELECT * FROM read_parquet('{s3_input_path}') WHERE VendorID = 1
"""
# Execute SQL and retrieve results as a PyArrow Table
result_arrow_table = duckdb_connection.execute(query).fetch_arrow_table()
print(f"Number of rows retrieved: {result_arrow_table.num_rows}")
print(f"Data schema: {result_arrow_table.schema}")
# Configure Glue Catalog (to access Iceberg table)
catalog = GlueCatalog(region_name="ap-northeast-1", database="icebergdb", name="my_catalog") # Adjust to your environment.
# Load the table
namespace = "icebergdb" # Adjust to your environment.
table_name = "yellow_tripdata" # Adjust to your environment.
iceberg_table = catalog.load_table(f"{namespace}.{table_name}")
# Append data to the Iceberg table in bulk
iceberg_table.append(result_arrow_table)
print("Data has been appended to S3 in Iceberg format.")
except Exception as e:
print(f"An error occurred: {e}")
Note: Version updates and conflict handling are omitted here as well.
Test Conditions
Dataset:
NYC Taxi Trip Records
https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Test files:
-
January 2024 (Yellow Taxi)
- 2,964,624 records
- 48 MB
-
Full year 2024 (aggregated file)
- 41,169,720 records
- 807 MB
Memory configurations tested:
- 1024 MB
- 2048 MB
- 3008 MB (no quota increase maximum)
- 4096 MB
- 10240 MB
Each configuration was executed 5 times under warm conditions to eliminate cold start effects.
Results
48MB File (1 Month)
| Memory (MB) | chDB Time (ms) | chDB Memory Used (MB) | DuckDB Time (ms) | DuckDB Memory Used (MB) |
|---|---|---|---|---|
| 1024 | 5,092 | 1018 | 5,163 | 512 |
| 2048 | 3,872 | 1132 | 4,264 | 538 |
| 3008 | 3,369 | 1115 | 4,061 | 524 |
| 4096 | 3,197 | 1263 | 3,547 | 568 |
| 10240 | 3,087 | 1255 | 3,484 | 554 |
Performance improved as memory increased — but only marginally above 4096 MB.
Actual memory usage:
- chDB ≈ 1.2 GB
- DuckDB ≈ 550 MB
Allocated memory increased, but real usage did not scale proportionally.
807MB File (1 Year)
| Memory (MB) | chDB Time (ms) | chDB Memory Used (MB) | DuckDB Time (ms) | DuckDB Memory Used (MB) |
|---|---|---|---|---|
| 1024 | OOM | - | OOM | - |
| 2048 | OOM | - | OOM | - |
| 3008 | 27,170 | 3001 | 187,331 | 2732 |
| 4096 | 24,631 | 3322 | 188,880 | 2767 |
| 10240 | 22,839 | 3490 | 189,678 | 2788 |
OOM occurred because the memory allocation could not hold the temporary buffers required during the Parquet → Arrow → Iceberg transformation.
Did Performance Really Improve at 4096MB and 10GB?
What we really wanted to check in this experiment was whether performance would actually improve once memory goes beyond 3008 MB.
48MB (1-month data)
Increasing memory from 3008 → 4096 → 10240 MB resulted in:
- chDB: 3,369 → 3,197 → 3,087 ms
- DuckDB: 4,061 → 3,547 → 3,484 ms
Performance did improve, but the gains were limited.
In particular, the difference from 4096 → 10240 MB is almost negligible, within the margin of error.
Looking at Max Memory Used, chDB only used about 1.2 GB and DuckDB about 550 MB, meaning increasing allocated memory did not increase actual usage.
807MB (1-year data)
chDB:
- 3008 MB: 27,170 ms
- 4096 MB: 24,631 ms
- 10240 MB: 22,839 ms
Overall, going from 3008 → 10240 MB improved performance by roughly 16%,
but from 4096 → 10240 MB, the improvement was only about 7%.
Even though memory increased roughly 3.4×, performance only improved by ~16%, suggesting that performance is hitting a ceiling.
DuckDB:
- 3008 MB: 187,331 ms
- 4096 MB: 188,881 ms
- 10240 MB: 189,678 ms
Almost no improvement; in some cases, it was slightly slower.
Simply increasing memory does not affect execution time.
Analysis (Bottleneck Insights)
Memory behaves as a threshold parameter
For Lambda × DuckDB/chDB, memory seems to behave more like a threshold than a proportional scaling parameter.
- 1024 MB and 2048 MB → OOM
- 3008 MB → first point where processing succeeds
Beyond that, adding memory does not yield proportional performance gains.
This suggests the bottleneck is likely elsewhere, not just compute resources.
Does increasing vCPU help?
Lambda increases available CPU with memory.
However:
- DuckDB barely scaled
- chDB only slightly improved
Likely bottlenecks are I/O and serialization, such as:
- Reading from S3
- Iceberg metadata operations
- Parquet → Arrow conversion
Engine-specific behavior
- chDB: small improvements with more memory
- DuckDB: almost no change
This difference may be due to internal implementations or parallelization strategies.
At least for this workload, simply going to 10 GB does not make DuckDB explode in speed.
Key Takeaways for Large Workloads on Lambda
From this experiment, a few points are clear:
- Crossing the OOM threshold is the main goal.
- Memory beyond that should be considered carefully, especially for cost.
- Simply allocating 10 GB does not guarantee faster execution.
Looking at DuckDB results, it's clear that maxing out memory does not automatically make things faster.
From a cost perspective, finding the “just enough” memory is more practical.
For more complex or larger workloads, sticking to Lambda may not be optimal — Glue or EMR could be faster and more stable.
Conclusion
In this article, we walked through applying for a Lambda memory quota increase and measured the performance of lightweight ETL tasks with the expanded memory.
Both chDB and DuckDB are attractive open-source options, but they have significantly different characteristics. One clear takeaway is that crossing the OOM threshold should always be the first goal; beyond that, performance improvements will likely need to come from areas other than memory.
This experiment reinforced that designing for maximum memory by default is not necessarily the best approach. It's more important to understand your workload and identify the critical memory boundaries.
Also, keep in mind that Lambda quota increases cannot be requested from the Service Quotas screen, which can be useful knowledge in both personal projects and professional settings.
While both engines are still evolving, understanding their characteristics and using them appropriately allows you to build simple, yet highly extensible data processing workflows.
I hope this article serves as a helpful reference for anyone considering lightweight data processing or real-time ETL with Iceberg tables.

Top comments (0)