Aki for AWS Community Builders

Posted on Feb 26

Does Increasing AWS Lambda Memory to 10GB Really Make It Faster? (AWS Lambda chDB/DuckDB PyIceberg Benchmark)

#aws #iceberg #chdb #duckdb

Original Japanese article: AWS Lambdaを10GBにすると本当に速くなるのか？（AWS Lambda×chDB/DuckDB×PyIceberg検証）

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).

In a previous article, I benchmarked Iceberg integration using AWS Lambda with DuckDB and chDB.

Lightweight ETL with AWS Lambda, chDB, and PyIceberg (Compared with DuckDB)

In that article, I tested two patterns on AWS Lambda:

chDB × PyIceberg
DuckDB × PyIceberg

Memory sizes were set to 1024 MB, 2048 MB, and 3008 MB (the maximum without quota increase at the time).

The results showed:

For small datasets, increasing memory generally improved performance.
For a large dataset (807 MB), 3008 MB was barely enough to complete processing.

This time, I extended the experiment:

What happens if we increase Lambda memory up to 10GB (10240 MB)?

Increasing the Lambda Memory Quota

To raise the Lambda memory limit beyond 3008 MB, you must request a quota increase.

Important:
You cannot increase Lambda memory from the Service Quotas console.

Steps

Go to the AWS Support Center and create a new case.
Clearly state:

The reason for the increase
The target region

Example request content:

We are building and validating a data processing platform using AWS Lambda.
The workload is memory-intensive, including large Parquet file loading, aggregation, and transformation.
The current 3008 MB limit is insufficient to complete processing.

We are performing analytical processing inside Lambda using columnar formats (Parquet), and the workload requires higher memory allocation.

Currently, we experience performance degradation and OutOfMemory errors.

We would like to request an increase of the Lambda memory limit in the Tokyo region to 10240 MB.

Although we considered migrating to other compute services, we determined that continuing with Lambda is the most appropriate option from both operational and architectural perspectives.

After submission, AWS responded in about 3 business days and applied the increase.

Architecture

The architecture is identical to the previous article.

Flow:

Load a Parquet file from S3 in Lambda
Process it using chDB or DuckDB
Write results into an Iceberg table

In short:

S3 → Lambda (chDB/DuckDB) → Iceberg (via Glue Catalog)

In this article, I focus on performance behavior differences.
Iceberg version conflicts and concurrency handling are omitted for simplicity.

Sample Code (chDB)

import chdb
import pyarrow as pa
from pyiceberg.catalog.glue import GlueCatalog


def _to_pyarrow_table(result):
    """
    Compatibility helper to extract a pyarrow.Table from a chDB query_result.
    """
    if hasattr(chdb, "to_arrowTable"):
        return chdb.to_arrowTable(result)

    if hasattr(result, "to_pyarrow"):
        return result.to_pyarrow()
    if hasattr(result, "to_arrow"):
        return result.to_arrow()

    raise RuntimeError(
        "Cannot convert chdb query_result to pyarrow.Table. "
        f"Available attributes: {sorted(dir(result))[:200]}"
    )


def normalize_arrow_for_iceberg(table: pa.Table) -> pa.Table:
    """
    Normalize Arrow types that Iceberg does not accept
    (mainly timezone-aware timestamps).
    """
    new_fields = []
    new_columns = []

    for field, column in zip(table.schema, table.columns):
        if pa.types.is_timestamp(field.type) and field.type.tz is not None:
            # Remove timezone information (values remain in UTC)
            new_type = pa.timestamp(field.type.unit)
            new_fields.append(pa.field(field.name, new_type, field.nullable))
            new_columns.append(column.cast(new_type))
        else:
            new_fields.append(field)
            new_columns.append(column)

    new_schema = pa.schema(new_fields)
    return pa.Table.from_arrays(new_columns, schema=new_schema)


def lambda_handler(event, context):
    try:
        # Extract S3 bucket and object key from the event
        s3_bucket = event['Records'][0]['s3']['bucket']['name']
        s3_object_key = event['Records'][0]['s3']['object']['key']

        # Build S3 HTTPS URL
        s3_url = (
            f"https://{s3_bucket}."
            f"s3.ap-northeast-1.amazonaws.com/"
            f"{s3_object_key}"
        )

        print(f"s3_url: {s3_url}")

        # Query Parquet data on S3 using chDB
        query = f"""
            SELECT *
            FROM s3('{s3_url}', 'Parquet')
            WHERE VendorID = 1
        """

        # Execute chDB query with Arrow output
        result = chdb.query(query, "Arrow")

        # Convert chDB result to pyarrow.Table
        arrow_table = _to_pyarrow_table(result)
        print(f"Original schema: {arrow_table.schema}")

        # Normalize schema for Iceberg compatibility
        arrow_table = normalize_arrow_for_iceberg(arrow_table)
        print(f"Normalized schema: {arrow_table.schema}")
        print(f"Rows: {arrow_table.num_rows}")

        # Initialize Iceberg Glue Catalog
        catalog = GlueCatalog(
            name="my_catalog",
            database="icebergdb",
            region_name="ap-northeast-1",
        )

        # Load Iceberg table
        iceberg_table = catalog.load_table("icebergdb.yellow_tripdata")

        # Append data to Iceberg table
        iceberg_table.append(arrow_table)

        print("Data appended to Iceberg table.")

    except Exception as e:
        print("Exception:", e)
        raise

Note: This article focuses on the differences in operation, so version updates or conflict handling in Iceberg tables are omitted.

Sample Code (DuckDB)

import duckdb
import pyarrow as pa
from pyiceberg.catalog.glue import GlueCatalog  

def lambda_handler(event, context):
    try:
        # Connect to DuckDB and set the home directory
        duckdb_connection = duckdb.connect(database=':memory:')
        duckdb_connection.execute("SET home_directory='/tmp'") 

        # Install and load the httpfs extension
        duckdb_connection.execute("INSTALL httpfs;")
        duckdb_connection.execute("LOAD httpfs;")

        # Load data from S3 using DuckDB
        s3_bucket = event['Records'][0]['s3']['bucket']['name']
        s3_object_key = event['Records'][0]['s3']['object']['key']

        s3_input_path = f"s3://{s3_bucket}/{s3_object_key}"

        print(f"s3_input_path: {s3_input_path}")

        query = f"""
            SELECT * FROM read_parquet('{s3_input_path}') WHERE VendorID = 1
        """
        # Execute SQL and retrieve results as a PyArrow Table
        result_arrow_table = duckdb_connection.execute(query).fetch_arrow_table()

        print(f"Number of rows retrieved: {result_arrow_table.num_rows}")
        print(f"Data schema: {result_arrow_table.schema}")

        # Configure Glue Catalog (to access Iceberg table)
        catalog = GlueCatalog(region_name="ap-northeast-1", database="icebergdb", name="my_catalog")  # Adjust to your environment.

        # Load the table
        namespace = "icebergdb"  # Adjust to your environment.
        table_name = "yellow_tripdata"  # Adjust to your environment.
        iceberg_table = catalog.load_table(f"{namespace}.{table_name}")

        # Append data to the Iceberg table in bulk
        iceberg_table.append(result_arrow_table) 

        print("Data has been appended to S3 in Iceberg format.")

    except Exception as e:
        print(f"An error occurred: {e}")

Note: Version updates and conflict handling are omitted here as well.

Test Conditions

Dataset:
NYC Taxi Trip Records
https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

Test files:

January 2024 (Yellow Taxi)
- 2,964,624 records
- 48 MB
Full year 2024 (aggregated file)
- 41,169,720 records
- 807 MB

Memory configurations tested:

1024 MB
2048 MB
3008 MB (no quota increase maximum)
4096 MB
10240 MB

Each configuration was executed 5 times under warm conditions to eliminate cold start effects.

Results

48MB File (1 Month)

Memory (MB)	chDB Time (ms)	chDB Memory Used (MB)	DuckDB Time (ms)	DuckDB Memory Used (MB)
1024	5,092	1018	5,163	512
2048	3,872	1132	4,264	538
3008	3,369	1115	4,061	524
4096	3,197	1263	3,547	568
10240	3,087	1255	3,484	554

Performance improved as memory increased — but only marginally above 4096 MB.

Actual memory usage:

chDB ≈ 1.2 GB
DuckDB ≈ 550 MB

Allocated memory increased, but real usage did not scale proportionally.

807MB File (1 Year)

Memory (MB)	chDB Time (ms)	chDB Memory Used (MB)	DuckDB Time (ms)	DuckDB Memory Used (MB)
1024	OOM	-	OOM	-
2048	OOM	-	OOM	-
3008	27,170	3001	187,331	2732
4096	24,631	3322	188,880	2767
10240	22,839	3490	189,678	2788

OOM occurred because the memory allocation could not hold the temporary buffers required during the Parquet → Arrow → Iceberg transformation.

Did Performance Really Improve at 4096MB and 10GB?

What we really wanted to check in this experiment was whether performance would actually improve once memory goes beyond 3008 MB.

48MB (1-month data)

Increasing memory from 3008 → 4096 → 10240 MB resulted in:

chDB: 3,369 → 3,197 → 3,087 ms
DuckDB: 4,061 → 3,547 → 3,484 ms

Performance did improve, but the gains were limited.
In particular, the difference from 4096 → 10240 MB is almost negligible, within the margin of error.

Looking at Max Memory Used, chDB only used about 1.2 GB and DuckDB about 550 MB, meaning increasing allocated memory did not increase actual usage.

807MB (1-year data)

chDB:

3008 MB: 27,170 ms
4096 MB: 24,631 ms
10240 MB: 22,839 ms

Overall, going from 3008 → 10240 MB improved performance by roughly 16%,
but from 4096 → 10240 MB, the improvement was only about 7%.

Even though memory increased roughly 3.4×, performance only improved by ~16%, suggesting that performance is hitting a ceiling.

DuckDB:

3008 MB: 187,331 ms
4096 MB: 188,881 ms
10240 MB: 189,678 ms

Almost no improvement; in some cases, it was slightly slower.
Simply increasing memory does not affect execution time.

Analysis (Bottleneck Insights)

Memory behaves as a threshold parameter

For Lambda × DuckDB/chDB, memory seems to behave more like a threshold than a proportional scaling parameter.

1024 MB and 2048 MB → OOM
3008 MB → first point where processing succeeds

Beyond that, adding memory does not yield proportional performance gains.
This suggests the bottleneck is likely elsewhere, not just compute resources.

Does increasing vCPU help?

Lambda increases available CPU with memory.
However:

DuckDB barely scaled
chDB only slightly improved

Likely bottlenecks are I/O and serialization, such as:

Reading from S3

Iceberg metadata operations

Parquet → Arrow conversion

Engine-specific behavior

chDB: small improvements with more memory
DuckDB: almost no change

This difference may be due to internal implementations or parallelization strategies.
At least for this workload, simply going to 10 GB does not make DuckDB explode in speed.

Key Takeaways for Large Workloads on Lambda

From this experiment, a few points are clear:

Crossing the OOM threshold is the main goal.
Memory beyond that should be considered carefully, especially for cost.
Simply allocating 10 GB does not guarantee faster execution.

Looking at DuckDB results, it's clear that maxing out memory does not automatically make things faster.
From a cost perspective, finding the “just enough” memory is more practical.

For more complex or larger workloads, sticking to Lambda may not be optimal — Glue or EMR could be faster and more stable.

Conclusion

In this article, we walked through applying for a Lambda memory quota increase and measured the performance of lightweight ETL tasks with the expanded memory.

Both chDB and DuckDB are attractive open-source options, but they have significantly different characteristics. One clear takeaway is that crossing the OOM threshold should always be the first goal; beyond that, performance improvements will likely need to come from areas other than memory.

This experiment reinforced that designing for maximum memory by default is not necessarily the best approach. It's more important to understand your workload and identify the critical memory boundaries.

Also, keep in mind that Lambda quota increases cannot be requested from the Service Quotas screen, which can be useful knowledge in both personal projects and professional settings.

While both engines are still evolving, understanding their characteristics and using them appropriately allows you to build simple, yet highly extensible data processing workflows.

I hope this article serves as a helpful reference for anyone considering lightweight data processing or real-time ETL with Iceberg tables.

DEV Community