DEV Community: Aleh Karachun

Legacy .NET 4.8.1 on AWS: When Fargate Abstractions Meet Single-Threaded Workloads

Aleh Karachun — Wed, 13 May 2026 12:14:13 +0000

"Premature optimization is the root of all evil." However, in cloud migrations, the abstraction of resources often hides the physical limitations of the underlying hardware. For latency-sensitive legacy runtimes, these abstractions can become a performance bottleneck.

This post analyzes a migration of a legacy .NET Framework 4.8.1 monolith from standalone EC2 instances to Windows Containers on AWS ECS, where the choice of Fargate led to a 10x performance degradation.

1. The Context: Infrastructure Modernization

The primary goal was to achieve centralized deployment and orchestration using AWS ECS.

The Constraints: A migration to .NET 6+ was rejected due to cost and time constraints. The mandate was to containerize the existing .NET 4.8.1 codebase "as-is."
The Path: Migration from legacy EC2 setups to Windows Containers on ECS Fargate.
The Stack: .NET 4.8.1, Razor Pages, Windows Server Core images.

2. The Symptom: Consistent 20-Second Latency

Post-migration, page rendering latency spiked to 20 seconds. This was not a cold-start issue; the delay remained constant across all requests in a steady state.

The Metrics Trap:
CloudWatch (Monitoring Details) showed a stable CPU Utilization plateau at ~30%. Increasing the task size to 4 vCPUs provided zero improvement. The response time remained static, while the total CPU Utilization metric dropped proportionally, creating a false impression of idle capacity.

This is a classic case where average is the enemy of understanding. The aggregate metric created a false impression of idle capacity, masking the reality of the execution thread.

3. Investigation: Eliminating Secondary Bottlenecks

Before attributing the latency to CPU frequency, we ruled out other infrastructure constraints:

Storage I/O: Legacy Razor engines read a large number of .cshtml files during execution. We verified storage throughput and ephemeral disk metrics to ensure we weren't hitting limits on ephemeral storage, which could cause "stuttering" during file access.
Network Latency: Using netstat and monitoring Time to First Byte (TTFB) for backend calls, we confirmed that the 20s delay was happening strictly during the internal rendering phase, not during database communication or network negotiation.
The Compression Trap (Bandwidth vs. Compute): We hypothesized that payload size might be a factor and enabled ZIP compression for HTTP responses. Counterintuitively, the latency grew even worse. Compression is a computationally expensive operation. Because the application was already bottlenecked by a single thread, forcing it to compress the payload simply stole the remaining clock cycles from the rendering engine, exacerbating the starvation.
Thread Saturation: Per-process performance counters showed one worker thread pinned at 100% CPU while the total container utilization remained low.

4. Root Cause: Abstraction Mismatch

The bottleneck resulted from an architectural mismatch between a legacy runtime and a fully abstracted compute layer.

Single-Threaded Rendering Path
The rendering path of our legacy Razor views was effectively CPU-bound and largely single-threaded. In a 4-vCPU environment, the request pipeline exhibited limited parallelism during view rendering, meaning the entire request was gated by the throughput of a single core.

The Abstraction Deficit
The issue was not that "Fargate is slow," but rather that Fargate abstracts away CPU characteristics that were critical for this specific workload.

Per-core Variability: Fargate provides abstract compute units. For modern asynchronous workloads, this is ideal. For legacy synchronous tasks, the inability to control the CPU class or guarantee a high base clock speed introduces unacceptable latency.
Scheduling Overhead: Windows Container overhead, combined with the lack of control over the underlying hardware, meant we couldn't guarantee the raw single-core throughput required for the monolith’s rendering engine.

5. The Solution: c7a.xlarge (EC2 Launch Type)

To resolve the latency without refactoring the code, we moved the workload to ECS on EC2 using c7a.xlarge instances.

Why c7a (AMD EPYC Genoa):

High Frequency: High sustained single-core throughput.
Single-Core Performance: The 4th Gen AMD EPYC architecture provided significantly stronger per-core throughput for this workload.

Outcome:
Rendering latency dropped from 20 seconds to 1.5 seconds. We achieved our goal of centralized ECS deployment without sacrificing performance.

Conclusion

Cloud abstractions work exceptionally well for horizontally scalable workloads.
But many legacy runtimes still encode assumptions about single-core throughput, scheduling behavior, and hardware consistency.
When migrating these systems, infrastructure selection becomes part of application performance engineering - not just operations.

.NET 10 Performance: The O(n^2) String Trap and the Zero-Allocation Quest

Aleh Karachun — Sat, 21 Mar 2026 13:08:00 +0000

"Premature optimization is the root of all evil." We’ve all heard it. But in the world of high-load cloud systems and serverless environments, there is another truth: "Ignoring scalability is the root of a massive AWS bill."

Today, we are doing a deep dive into .NET 10 string manipulation. We’ll explore how a simple += can turn your performance into a disaster and how to achieve Zero-Allocation using modern C# features.

1. The Big Picture: Scaling is a Cliff

In computer science, O(n) vs O(n^2) is often treated as academic theory. But when you visualize it, theory becomes a cold, hard reality. We compared three contenders:

Classic Concatenation: The quadratic O(n^2) path.
StringBuilder: The standard heap-allocated buffer.
ValueStringBuilder (Optimized): A ref struct living entirely on the stack.

Figure 1. Scaling performance overview.

If the log scale feels too abstract, look at the linear reality at N=10,000:

Figure 2. Linear comparison at maximum scale.

2. The Micro-Scale Paradox (N=10)

Engineering is about choosing the right tool for the right job. On a tiny scale (N=10), our "super-optimized" approach actually loses.

UseStringBuilder: 32.30 ns
UseStringConcatenation: 52.95 ns
UseValueStringBuilder_Optimized: ~107 ns

The Paradox Explained:
Why does the "optimized" method lose here? It comes down to the "Setup Tax." Initializing a ref struct and preparing a stackalloc buffer takes more time than the actual string processing when N is small.

Meanwhile, StringBuilder in .NET 10 has been heavily tuned for small-scale operations. It manages to avoid the heavy allocations of += while bypassing the complex initialization required by our manual stack-based approach. At this scale, the runtime's built-in optimizations are simply more efficient than manual memory management.

Figure 3. Execution time distribution for N=10.

Lesson: Don't over-engineer for the small stuff. For small-scale formatting or log messages, standard library tools provide the best balance of performance and maintainability.

3. The "GC Fingerprint" (N=10,000)

When we scale to 10,000 operations, the masks come off. String concatenation at this scale allocates 379.4 MB of garbage. This leads to what is called the "Camel Effect" on our density plots.

Figure 4. Impact of Garbage Collection on latency.

Now, compare this to the optimized Zero-Allocation method:

Figure 5. Predictability of zero-allocation execution.

Note on hardware physics: Even in Figure 5, where Zero-Allocation is achieved, a microscopic "tail" of jitter is still visible on the right. This isn't the Garbage Collector; it is the "physics of the hardware". OS interrupts, CPU context switching, and cache misses introduce these unavoidable micro-fluctuations. However, compared to the "Camel Effect" of GC pauses, this is just statistical noise, confirming the almost perfect predictability of our approach.

4. Engineering for Zero-Allocation

How did we achieve this? By staying off the Managed Heap entirely. We combined three pillars of modern .NET:

ref struct: Ensures our builder never escapes to the heap.
stackalloc char[256]: Allocates the initial buffer directly on the stack.
ISpanFormattable: Writes data directly into memory via TryFormat, avoiding intermediate ToString() allocations.

public void Process(ReadOnlySpan<Transaction> transactions)
{
    // 1. Initial buffer on the stack
    Span<char> buffer = stackalloc char[512];
    var vsb = new ValueStringBuilder(buffer);

    foreach (var tx in transactions)
    {
        // 2. Zero-allocation formatting
        tx.Amount.TryFormat(vsb.AppendSpan(10), out int written);
    }

    // 3. Final result (the only allocation)
    string result = vsb.ToString(); 
}

Conclusion: Be Pragmatic

The benchmark results demonstrate that the optimal string manipulation strategy depends entirely on the expected data volume and system requirements.

Small scale (N < 50): StringBuilder is technically the winner, offering 40% better performance and 50% fewer allocations than simple concatenation. However, concatenation remains an acceptable choice for one-off tasks where code readability is the top priority.
Medium scale (N < 1000): StringBuilder remains the standard efficient approach for general-purpose applications, providing linear scaling with manageable heap pressure.
High-performance / High-load: Implementation of Zero-Allocation patterns (e.g., ValueStringBuilder) is critical for systems with strict latency requirements. This approach eliminates bimodal distribution caused by Garbage Collection, ensuring deterministic execution time and lower memory throughput.

Final decision-making should balance code complexity against predictability. For high-concurrency environments like AWS Lambda, bypassing the managed heap is a primary strategy for cost and latency optimization.

The full source code and raw BenchmarkDotNet data are available on my GitHub:
👉 https://github.com/olegKarachun/dotnet-string-optimization-benchmarks

Battle of the Titans (Part 1): The Ultimate Go Lambda on AWS Graviton

Aleh Karachun — Thu, 19 Mar 2026 17:04:03 +0000

Hi everyone! Welcome to the first part of my series exploring AWS Lambda performance. My goal is to compare Go and .NET Native AOT in a realistic serverless environment.

To make this a fair benchmark, we aren't just deploying a "Hello World" function. Our Lambda simulates a standard combat task: it deserializes a JSON payload of financial transactions, filters them, calculates the total amount, and computes a SHA-256 hash of the IDs to generate a signature (simulating CPU load).

Today, we are focusing on setting up and optimizing the Go contender on ARM64 (Graviton).

1. The Infrastructure (AWS SAM)

We use AWS SAM (Serverless Application Model) to define our infrastructure. It allows us to describe resources declaratively and generates the underlying CloudFormation template.

Here is the core of our template.yaml:

CodeUri: bin/
Handler: bootstrap
Runtime: provided.al2023
Architectures:
  - arm64

Key takeaways

Runtime: provided.al2023: Amazon Linux 2023 is currently the recommended minimalist OS for compiled languages in AWS. It boots significantly faster than the legacy go1.x runtime.
Architectures: arm64: Targeting AWS Graviton processors. They use a RISC architecture that typically provides around 20% better price/performance for serverless workloads compared to x86_64.
Handler: bootstrap: When using custom runtimes, AWS Lambda expects the executable binary inside the deployment package to be named exactly bootstrap.

2. Compiling for Lambda

A standard go build works, but we can optimize it further for the Lambda environment. Here is the command we use:

GOOS=linux GOARCH=arm64 go build \
  -tags lambda.norpc \
  -ldflags="-s -w" \
  -o bin/bootstrap main.go

Key takeaways

GOOS=linux GOARCH=arm64: This enables cross-compilation, allowing us to build a Linux ARM64 binary directly from our local machine (even if it's x86).
-tags lambda.norpc: The al2023 runtime communicates with the Lambda service via an internal HTTP API. This tag tells the compiler to drop the legacy RPC compatibility code from the aws-lambda-go library, reducing the binary size and initialization time.
-ldflags="-s -w": These linker flags strip the symbol table and debug information, resulting in a leaner binary that loads into memory faster.

3. Local Testing and the "Error 255"

If you develop on an x86 (Intel/AMD) machine and try to test this locally using sam local invoke, you will likely hit a Fatal Error 255.

This happens because the Docker container spins up an ARM64 environment, but your host CPU cannot natively execute ARM instructions.

The fix: We need a translator. Running the multiarch/qemu-user-static Docker image solves this. QEMU intercepts the ARM commands and translates them into x86 instructions for your host CPU on the fly, allowing you to seamlessly test the production binary locally.

4. Anatomy of a Cold Start

When you run sam deploy --guided, AWS packages the binary, uploads it to S3, and updates the CloudFormation stack. But the most interesting part happens on the first invocation.

When we triggered the Lambda, CloudWatch reported an Init Duration of ~60 ms.

During these 54 milliseconds, AWS performed the following:

Allocated a Graviton-based server.
Provisioned an isolated Firecracker microVM.
Downloaded the deployment zip from S3 and extracted it.
Booted the provided.al2023 OS.
Loaded our bootstrap binary into memory.

Once the environment was warm, subsequent invocations (Warm Starts) took roughly 2 ms of compute time with a memory footprint of about 19 MB.

Conclusion

Go on ARM64 with the AL2023 runtime provides an excellent baseline. With extremely low memory consumption and cold starts consistently under 60ms, it is a highly efficient choice for serverless APIs.

What’s Next?

In Part 2, we will set up our challenger: .NET 10 Native AOT. We will explore how to configure the C# project with Zero-Allocation techniques and Source Generators to see if it can match or beat Go's numbers.

The full source code for this setup is available in my GitHub repository:
👉 https://github.com/olegKarachun/aws-lambda-go-graviton