takuma818t

Posted on Oct 14

Lambda Performance Evaluation: The Relationship Between Memory and Internal vCPU Architecture, and Their Comparison

#aws #lambda

Thoughts on Lambda Memory Allocation

How do you allocate memory for your Lambda functions?

Lambda is billed only for the time it actually runs.
Increasing the memory also boosts CPU performance, but this raises the cost.
As performance increases, the execution time shortens.
In some cases, even if you increase the performance, the shorter execution time may result in the same cost.
Many of you might have adjusted the memory allocation while monitoring the execution time, keeping these points in mind.

For example, if you adjust the memory and get the following execution times:

Memory	Execution Time
128MB	400ms
256MB	200ms
512MB	100ms
1024MB	50ms
2048MB	50ms

Since billing is done in 100ms increments, the cost up to 512MB remains the same. Additionally, the performance hits a ceiling at 1024MB.
This kind of relationship between memory and execution time is common, so many of you may have chosen one of these two options:

Choosing 512MB considering the cost and performance since the billing unit is 100ms
Choosing 1024MB for better performance, knowing that the speed hits its limit at this point

The Relationship Between Lambda Memory Size, Billing Units, and Performance

AWS Lambda has a billing unit of 1 millisecond.
AWS Lambda can now be expanded to up to 10GB of memory and 6 vCPUs.
As a result:

Allocating more memory (and thus CPU) to achieve response times under 100ms no longer increases the total cost.
Once CPU performance reaches a certain level, additional vCPUs are assigned.
This has likely changed how we allocate memory with performance tuning and cost considerations in mind. Specifically, tuning for execution times under 100ms now has cost benefits, and we need to think about leveraging multi-core processing as Lambda now supports multiple cores.
In this article, I will examine these aspects.

Relationship Between Lambda Memory Allocation and vCPUs

In the update, it was mentioned that up to 6 vCPUs are available, but how much memory corresponds to how many vCPUs? I checked the official documentation, and the following was all I could find:

[Official] Configuring Lambda Function Memory

At 1,769MB, the equivalent of 1 vCPU is allocated.

Since there wasn’t much detail in the official documentation, I decided to use Python's multiprocessing.cpu_count() to output the number of vCPUs by changing the Lambda memory allocation.

This isn't official information, so the results may change if the specifications are updated, but I hope it can serve as a reference. (The results are based on testing in the Oregon region as of December 12, 2020.)

Memory Allocation	vCPU
128MB–1769MB	1 vCPU*
1770MB–3008MB	2 vCPUs
3009MB–5307MB	3 vCPUs
5308MB–7076MB	4 vCPUs
7077MB–8845MB	5 vCPUs
8846MB–10240MB	6 vCPUs

*Even though cpu_count outputs 2 vCPUs, based on the documentation and performance test results, it seems that only 1 vCPU is actually being utilized internally.

Performance Test 1 (Single Task, Multi-Thread, Multi-Process)

For memory allocations above 3009MB, the CPU performance increase comes from adding more vCPUs. For single-task operations, the performance likely hits a ceiling at that point. If you want to improve performance further, the process needs to be optimized to utilize multi-core effectively.

With that in mind, I conducted some tests.

Test Overview: Calculating the 30th Fibonacci number four times.

I ran the test under the following conditions (the actual code is provided at the end of this article):

Single task
Multi-thread processing
Multi-process processing

Performance Test 1 Results

Memory Size	vCPUs	Single Task	Multi-Thread	Multi-Process
128MB	1*	22,357.25	24,526.17	22,601.75
256MB	1*	11,103.38	12,096.86	11,613.07
512MB	1*	5,554.15	5,783.73	5,675.32
1024MB	1*	2,737.14	2,913.90	2,792.90
1536MB	1*	1,859.68	1,909.14	1,880.79
1769MB	1*	1,576.30	1,691.91	1,597.14
2048MB	2	1,574.19	1,626.24	1,370.34
3008MB	2	1,590.36	1,643.64	950.26
3009MB	3	1,621.39	1,639.41	940.40
4096MB	3	1,574.45	1,590.55	722.13
5120MB	3	1,578.06	1,633.17	637.16
6144MB	4	1,547.85	1,656.60	484.49
7076MB	4	1,578.67	1,653.11	403.14
7168MB	5	1,606.02	1,627.12	402.30
8192MB	5	1,602.95	1,654.36	402.57
9216MB	6	1,577.55	1,633.96	420.52
10240MB	6	1,591.83	1,640.31	407.27

The points marked in red indicate where the performance hits a ceiling.

As expected, single-task and multi-thread operations seem to only utilize a single core internally. The performance hits a ceiling when the number of vCPUs increases.

While cpu_count() shows 2 vCPUs from 128MB to 3008MB, based on the results, it seems that the performance limit for single-threaded processing occurs at 1769MB. This aligns with the official documentation, which states that "1,769MB corresponds to 1 vCPU." Therefore, it seems that 1769MB and below is equivalent to 1 vCPU, while above that is equivalent to 2 vCPUs.

On the other hand, multi-process operations show improved performance, but they hit a limit when the number of processes exceeds the number of vCPUs.

Performance Test 2 (Varying Number of Processes)

In Test 1, we compared single-task, multi-thread, and multi-process operations. Now, let’s test how performance changes with different numbers of processes in multi-process operations.

Test Overview: Calculate the 30th Fibonacci number in each process.

Although it would have been better to balance the workload, the test was conducted as above (which means the total amount of computation increases with the number of processes).

Performance Test 2 Results

Memory Size	vCPUs	3 Processes	4 Processes	6 Processes	8 Processes	12 Processes
128MB	1*	17,170.78	22,601.75	34,307.67	45,027.37	67,933.81
256MB	1*	8,469.28	11,613.07	17,009.97	22,894.79	34,513.40
512MB	1*	4,237.92	5,675.32	8,498.68	11,360.69	17,194.66
1024MB	1*	2,138.52	2,792.90	4,218.83	5,620.41	8,468.93
2048MB	2	1,088.32	1,370.34	2,037.55	2,817.35	4,222.83
4096MB	3	964.51	722.13	1,064.67	1,423.73	2,099.09
5120MB	3	440.11	637.16	853.15	1,132.36	1,685.33
5307MB	3	412.64	607.87	-	-	-
6144MB	4	401.42	484.49	707.66	954.88	1,402.62
7076MB	4	-	403.14	-	-	-
7168MB	5	411.62	402.30	714.30	846.54	1,220.98
8192MB	5	398.72	402.57	649.03	767.90	1,089.49
9216MB	6	402.85	420.52	470.74	673.93	947.82
10240MB	6	400.56	407.27	424.13	642.19	870.46

From these results, we can see that increasing the number of processes beyond the number of vCPUs does not lead to further performance improvements. Since Lambda’s current limit is 6 vCPUs, there isn’t much benefit in parallelizing beyond that.

Conclusion

Some of you might have previously given up on performance improvements beyond 100ms due to the lack of cost benefits, but with this recent update, why not push the performance limits further?

For example, in a project I worked on, we had a Lambda function that wrote data from Excel files (with multiple sheets) uploaded to S3 into DynamoDB. It was difficult to improve performance, but perhaps splitting the process by sheets and handling them with multi-processing could speed things up.

Python Code Used for Testing

Single Task Code

def lambda_handler(event, context):
    fibonacci_num=int(30)

    s0=fibonacci(fibonacci_num)
    s1=fibonacci(fibonacci_num)
    s2=fibonacci(fibonacci_num)
    s3=fibonacci(fibonacci_num)

    return 0

#Calculation Process (Fibonacci Retrieval)
def fibonacci(n):
    if n < 2 :
        return n
    else:
        return fibonacci(n-2) + fibonacci(n-1)

Multi-Thread Code

import threading

def lambda_handler(event, context):
    fibonacci_num=int(30)

    # Thread Creation
    th0 = threading.Thread(target=fibonacci, args=(fibonacci_num,))
    th1 = threading.Thread(target=fibonacci, args=(fibonacci_num,))
    th2 = threading.Thread(target=fibonacci, args=(fibonacci_num,))
    th3 = threading.Thread(target=fibonacci, args=(fibonacci_num,))

    # Starting Threads
    th0.start()
    th1.start()
    th2.start()
    th3.start()

    # Waiting for Threads 
    th0.join()
    th1.join()
    th2.join()
    th3.join()

    return 0

#Calculation Process (Fibonacci Retrieval)
def fibonacci(n):
    if n < 2 :
        return n
    else:
        return fibonacci(n-2) + fibonacci(n-1)

Multi-Process Code

import multiprocessing

def lambda_handler(event, context):
    fibonacci_num=int(30)

    # Process Creation
    p0 = multiprocessing.Process(target=fibonacci, args=(fibonacci_num,))
    p1 = multiprocessing.Process(target=fibonacci, args=(fibonacci_num,))
    p2 = multiprocessing.Process(target=fibonacci, args=(fibonacci_num,))
    p3 = multiprocessing.Process(target=fibonacci, args=(fibonacci_num,))

    # Starting Process
    p0.start()
    p1.start()
    p2.start()
    p3.start()

    # Waiting for Process Termination
    p0.join()
    p1.join()
    p2.join()
    p3.join()

    return 0

#Calculation Process (Fibonacci Retrieval)
def fibonacci(n):
    if n < 2 :
        return n
    else:
        return fibonacci(n-2) + fibonacci(n-1)

DEV Community