<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Aleh Karachun</title>
    <description>The latest articles on DEV Community by Aleh Karachun (@aleh_karachun).</description>
    <link>https://dev.to/aleh_karachun</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3763585%2F1b3180e7-f3f9-476d-a692-ba897d9a4687.png</url>
      <title>DEV Community: Aleh Karachun</title>
      <link>https://dev.to/aleh_karachun</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aleh_karachun"/>
    <language>en</language>
    <item>
      <title>Legacy .NET 4.8.1 on AWS: When Fargate Abstractions Meet Single-Threaded Workloads</title>
      <dc:creator>Aleh Karachun</dc:creator>
      <pubDate>Wed, 13 May 2026 12:14:13 +0000</pubDate>
      <link>https://dev.to/aleh_karachun/legacy-net-481-on-aws-when-fargate-abstractions-meet-single-threaded-workloads-42eo</link>
      <guid>https://dev.to/aleh_karachun/legacy-net-481-on-aws-when-fargate-abstractions-meet-single-threaded-workloads-42eo</guid>
      <description>&lt;p&gt;"Premature optimization is the root of all evil." However, in cloud migrations, the abstraction of resources often hides the physical limitations of the underlying hardware. For latency-sensitive legacy runtimes, these abstractions can become a performance bottleneck.&lt;/p&gt;

&lt;p&gt;This post analyzes a migration of a legacy &lt;strong&gt;.NET Framework 4.8.1&lt;/strong&gt; monolith from standalone EC2 instances to &lt;strong&gt;Windows Containers on AWS ECS&lt;/strong&gt;, where the choice of Fargate led to a 10x performance degradation.&lt;/p&gt;




&lt;h3&gt;
  
  
  1. The Context: Infrastructure Modernization
&lt;/h3&gt;

&lt;p&gt;The primary goal was to achieve &lt;strong&gt;centralized deployment&lt;/strong&gt; and orchestration using AWS ECS.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Constraints:&lt;/strong&gt; A migration to .NET 6+ was rejected due to cost and time constraints. The mandate was to containerize the existing .NET 4.8.1 codebase "as-is."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Path:&lt;/strong&gt; Migration from legacy EC2 setups to Windows Containers on ECS Fargate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Stack:&lt;/strong&gt; .NET 4.8.1, Razor Pages, Windows Server Core images.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  2. The Symptom: Consistent 20-Second Latency
&lt;/h3&gt;

&lt;p&gt;Post-migration, page rendering latency spiked to &lt;strong&gt;20 seconds&lt;/strong&gt;. This was not a cold-start issue; the delay remained constant across all requests in a steady state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Metrics Trap:&lt;/strong&gt;&lt;br&gt;
CloudWatch (Monitoring Details) showed a stable CPU Utilization plateau at &lt;strong&gt;~30%&lt;/strong&gt;. Increasing the task size to 4 vCPUs provided zero improvement. The response time remained static, while the total CPU Utilization metric dropped proportionally, creating a false impression of idle capacity.&lt;/p&gt;

&lt;p&gt;This is a classic case where &lt;strong&gt;average is the enemy of understanding&lt;/strong&gt;. The aggregate metric created a false impression of idle capacity, masking the reality of the execution thread.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Investigation: Eliminating Secondary Bottlenecks
&lt;/h3&gt;

&lt;p&gt;Before attributing the latency to CPU frequency, we ruled out other infrastructure constraints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Storage I/O:&lt;/strong&gt; Legacy Razor engines read a large number of &lt;code&gt;.cshtml&lt;/code&gt; files during execution. We verified storage throughput and ephemeral disk metrics to ensure we weren't hitting limits on ephemeral storage, which could cause "stuttering" during file access.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network Latency:&lt;/strong&gt; Using &lt;code&gt;netstat&lt;/code&gt; and monitoring Time to First Byte (TTFB) for backend calls, we confirmed that the 20s delay was happening strictly during the internal rendering phase, not during database communication or network negotiation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thread Saturation:&lt;/strong&gt; Per-process performance counters showed &lt;strong&gt;one worker thread pinned at 100% CPU&lt;/strong&gt; while the total container utilization remained low.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  4. Root Cause: Abstraction Mismatch
&lt;/h3&gt;

&lt;p&gt;The bottleneck resulted from an architectural mismatch between a legacy runtime and a fully abstracted compute layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single-Threaded Rendering Path&lt;/strong&gt;&lt;br&gt;
The rendering path of our legacy Razor views was effectively CPU-bound and largely single-threaded. In a 4-vCPU environment, the request pipeline exhibited limited parallelism during view rendering, meaning the entire request was gated by the throughput of a single core.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Abstraction Deficit&lt;/strong&gt;&lt;br&gt;
The issue was not that "Fargate is slow," but rather that &lt;strong&gt;Fargate abstracts away CPU characteristics&lt;/strong&gt; that were critical for this specific workload.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Per-core Variability:&lt;/strong&gt; Fargate provides abstract compute units. For modern asynchronous workloads, this is ideal. For legacy synchronous tasks, the inability to control the CPU class or guarantee a high base clock speed introduces unacceptable latency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scheduling Overhead:&lt;/strong&gt; Windows Container overhead, combined with the lack of control over the underlying hardware, meant we couldn't guarantee the raw single-core throughput required for the monolith’s rendering engine.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  5. The Solution: c7a.xlarge (EC2 Launch Type)
&lt;/h3&gt;

&lt;p&gt;To resolve the latency without refactoring the code, we moved the workload to &lt;strong&gt;ECS on EC2&lt;/strong&gt; using &lt;strong&gt;c7a.xlarge&lt;/strong&gt; instances.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why c7a (AMD EPYC Genoa):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;High Frequency:&lt;/strong&gt; High sustained single-core throughput.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single-Core Performance:&lt;/strong&gt; The 4th Gen AMD EPYC architecture provided significantly stronger per-core throughput for this workload.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Outcome:&lt;/strong&gt;&lt;br&gt;
Rendering latency dropped from &lt;strong&gt;20 seconds to 1.5 seconds&lt;/strong&gt;. We achieved our goal of centralized ECS deployment without sacrificing performance.&lt;/p&gt;




&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Cloud abstractions work exceptionally well for horizontally scalable workloads.&lt;/li&gt;
&lt;li&gt;But many legacy runtimes still encode assumptions about single-core throughput, scheduling behavior, and hardware consistency.&lt;/li&gt;
&lt;li&gt;When migrating these systems, infrastructure selection becomes part of application performance engineering - not just operations.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>dotnet</category>
      <category>aws</category>
      <category>performance</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>.NET 10 Performance: The O(n^2) String Trap and the Zero-Allocation Quest</title>
      <dc:creator>Aleh Karachun</dc:creator>
      <pubDate>Sat, 21 Mar 2026 13:08:00 +0000</pubDate>
      <link>https://dev.to/aleh_karachun/net-10-performance-the-on2-string-trap-and-the-zero-allocation-quest-3cjh</link>
      <guid>https://dev.to/aleh_karachun/net-10-performance-the-on2-string-trap-and-the-zero-allocation-quest-3cjh</guid>
      <description>&lt;p&gt;"Premature optimization is the root of all evil." We’ve all heard it. But in the world of high-load cloud systems and serverless environments, there is another truth: "Ignoring scalability is the root of a massive AWS bill."&lt;/p&gt;

&lt;p&gt;Today, we are doing a deep dive into .NET 10 string manipulation. We’ll explore how a simple &lt;code&gt;+=&lt;/code&gt; can turn your performance into a disaster and how to achieve Zero-Allocation using modern C# features.&lt;/p&gt;




&lt;h3&gt;
  
  
  1. The Big Picture: Scaling is a Cliff
&lt;/h3&gt;

&lt;p&gt;In computer science, &lt;em&gt;O(n)&lt;/em&gt; vs &lt;em&gt;O(n^2)&lt;/em&gt; is often treated as academic theory. But when you visualize it, theory becomes a cold, hard reality. We compared three contenders:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Classic Concatenation:&lt;/strong&gt; The quadratic &lt;em&gt;O(n^2)&lt;/em&gt; path.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;StringBuilder:&lt;/strong&gt; The standard heap-allocated buffer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ValueStringBuilder (Optimized):&lt;/strong&gt; A &lt;code&gt;ref struct&lt;/code&gt; living entirely on the stack.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs9tmyi6iuteavf9iaxh1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs9tmyi6iuteavf9iaxh1.png" alt="This chart visualizes " width="800" height="300"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Figure 1.&lt;/strong&gt; Scaling performance overview.&lt;/p&gt;

&lt;p&gt;If the log scale feels too abstract, look at the linear reality at &lt;em&gt;N=10,000&lt;/em&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8r4v3fpxtly2id6crd1s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8r4v3fpxtly2id6crd1s.png" alt="This represents " width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Figure 2.&lt;/strong&gt; Linear comparison at maximum scale.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. The Micro-Scale Paradox (&lt;em&gt;N=10&lt;/em&gt;)
&lt;/h3&gt;

&lt;p&gt;Engineering is about choosing the right tool for the right job. On a tiny scale &lt;em&gt;(N=10)&lt;/em&gt;, our "super-optimized" approach actually loses.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;UseStringBuilder: 32.30 ns&lt;/li&gt;
&lt;li&gt;UseStringConcatenation: 52.95 ns&lt;/li&gt;
&lt;li&gt;UseValueStringBuilder_Optimized: ~107 ns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The Paradox Explained:&lt;/strong&gt;&lt;br&gt;
Why does the "optimized" method lose here? It comes down to the "Setup Tax." Initializing a &lt;code&gt;ref struct&lt;/code&gt; and preparing a &lt;code&gt;stackalloc&lt;/code&gt; buffer takes more time than the actual string processing when &lt;em&gt;N&lt;/em&gt; is small.&lt;/p&gt;

&lt;p&gt;Meanwhile, &lt;strong&gt;StringBuilder&lt;/strong&gt; in .NET 10 has been heavily tuned for small-scale operations. It manages to avoid the heavy allocations of &lt;code&gt;+=&lt;/code&gt; while bypassing the complex initialization required by our manual stack-based approach. At this scale, the runtime's built-in optimizations are simply more efficient than manual memory management.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdhdyg4qwqd2ea0ro6jgm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdhdyg4qwqd2ea0ro6jgm.png" alt="This is the " width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Figure 3.&lt;/strong&gt; Execution time distribution for &lt;em&gt;N=10&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; Don't over-engineer for the small stuff. For small-scale formatting or log messages, standard library tools provide the best balance of performance and maintainability.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. The "GC Fingerprint" (&lt;em&gt;N=10,000&lt;/em&gt;)
&lt;/h3&gt;

&lt;p&gt;When we scale to 10,000 operations, the masks come off. String concatenation at this scale allocates 379.4 MB of garbage. This leads to what is called the "Camel Effect" on our density plots.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj5e6zkta6shxlamnsrfe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj5e6zkta6shxlamnsrfe.png" alt="This is " width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Figure 4.&lt;/strong&gt; Impact of Garbage Collection on latency.&lt;/p&gt;

&lt;p&gt;Now, compare this to the optimized Zero-Allocation method:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbfqdw5p54ybjq9g3btyo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbfqdw5p54ybjq9g3btyo.png" alt="This is " width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Figure 5.&lt;/strong&gt; Predictability of zero-allocation execution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note on hardware physics:&lt;/strong&gt; Even in Figure 5, where Zero-Allocation is achieved, a microscopic "tail" of jitter is still visible on the right. This isn't the Garbage Collector; it is the "physics of the hardware". OS interrupts, CPU context switching, and cache misses introduce these unavoidable micro-fluctuations. However, compared to the "Camel Effect" of GC pauses, this is just statistical noise, confirming the almost perfect predictability of our approach.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Engineering for Zero-Allocation
&lt;/h3&gt;

&lt;p&gt;How did we achieve this? By staying off the Managed Heap entirely. We combined three pillars of modern .NET:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;ref struct&lt;/code&gt;: Ensures our builder never escapes to the heap.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;stackalloc char[256]&lt;/code&gt;: Allocates the initial buffer directly on the stack.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ISpanFormattable&lt;/code&gt;: Writes data directly into memory via &lt;code&gt;TryFormat&lt;/code&gt;, avoiding intermediate &lt;code&gt;ToString()&lt;/code&gt; allocations.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;Process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ReadOnlySpan&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Transaction&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;transactions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// 1. Initial buffer on the stack&lt;/span&gt;
    &lt;span class="n"&gt;Span&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;char&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;buffer&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;stackalloc&lt;/span&gt; &lt;span class="kt"&gt;char&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;512&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;vsb&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;ValueStringBuilder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;foreach&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;tx&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;transactions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// 2. Zero-allocation formatting&lt;/span&gt;
        &lt;span class="n"&gt;tx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Amount&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;TryFormat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vsb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;AppendSpan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="k"&gt;out&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;written&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// 3. Final result (the only allocation)&lt;/span&gt;
    &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vsb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ToString&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; 
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Conclusion: Be Pragmatic
&lt;/h3&gt;

&lt;p&gt;The benchmark results demonstrate that the optimal string manipulation strategy depends entirely on the expected data volume and system requirements.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Small scale (&lt;em&gt;N &amp;lt; 50&lt;/em&gt;):&lt;/strong&gt; &lt;strong&gt;StringBuilder&lt;/strong&gt; is technically the winner, offering 40% better performance and 50% fewer allocations than simple concatenation. However, &lt;strong&gt;concatenation&lt;/strong&gt; remains an acceptable choice for one-off tasks where code readability is the top priority.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Medium scale (&lt;em&gt;N &amp;lt; 1000&lt;/em&gt;):&lt;/strong&gt; &lt;strong&gt;StringBuilder&lt;/strong&gt; remains the standard efficient approach for general-purpose applications, providing linear scaling with manageable heap pressure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High-performance / High-load:&lt;/strong&gt; Implementation of &lt;strong&gt;Zero-Allocation&lt;/strong&gt; patterns (e.g., &lt;code&gt;ValueStringBuilder&lt;/code&gt;) is critical for systems with strict latency requirements. This approach eliminates bimodal distribution caused by Garbage Collection, ensuring deterministic execution time and lower memory throughput.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Final decision-making should balance &lt;strong&gt;code complexity&lt;/strong&gt; against &lt;strong&gt;predictability&lt;/strong&gt;. For high-concurrency environments like AWS Lambda, bypassing the managed heap is a primary strategy for cost and latency optimization.&lt;/p&gt;

&lt;p&gt;The full source code and raw BenchmarkDotNet data are available on my GitHub:&lt;br&gt;
👉 &lt;a href="https://github.com/olegKarachun/dotnet-string-optimization-benchmarks" rel="noopener noreferrer"&gt;https://github.com/olegKarachun/dotnet-string-optimization-benchmarks&lt;/a&gt;&lt;/p&gt;

</description>
      <category>dotnet</category>
      <category>performance</category>
      <category>programming</category>
      <category>aws</category>
    </item>
    <item>
      <title>Battle of the Titans (Part 1): The Ultimate Go Lambda on AWS Graviton</title>
      <dc:creator>Aleh Karachun</dc:creator>
      <pubDate>Thu, 19 Mar 2026 17:04:03 +0000</pubDate>
      <link>https://dev.to/aleh_karachun/battle-of-the-titans-part-1-the-ultimate-go-lambda-on-aws-graviton-2632</link>
      <guid>https://dev.to/aleh_karachun/battle-of-the-titans-part-1-the-ultimate-go-lambda-on-aws-graviton-2632</guid>
      <description>&lt;p&gt;Hi everyone! Welcome to the first part of my series exploring AWS Lambda performance. My goal is to compare Go and .NET Native AOT in a realistic serverless environment.&lt;/p&gt;

&lt;p&gt;To make this a fair benchmark, we aren't just deploying a "Hello World" function. Our Lambda simulates a standard combat task: it deserializes a JSON payload of financial transactions, filters them, calculates the total amount, and computes a SHA-256 hash of the IDs to generate a signature (simulating CPU load).&lt;/p&gt;

&lt;p&gt;Today, we are focusing on setting up and optimizing the Go contender on &lt;strong&gt;ARM64 (Graviton)&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  1. The Infrastructure (AWS SAM)
&lt;/h3&gt;

&lt;p&gt;We use AWS SAM (Serverless Application Model) to define our infrastructure. It allows us to describe resources declaratively and generates the underlying CloudFormation template.&lt;/p&gt;

&lt;p&gt;Here is the core of our &lt;code&gt;template.yaml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;CodeUri&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bin/&lt;/span&gt;
&lt;span class="na"&gt;Handler&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bootstrap&lt;/span&gt;
&lt;span class="na"&gt;Runtime&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;provided.al2023&lt;/span&gt;
&lt;span class="na"&gt;Architectures&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;arm64&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Key takeaways
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;Runtime: provided.al2023&lt;/code&gt;: Amazon Linux 2023 is currently the recommended minimalist OS for compiled languages in AWS. It boots significantly faster than the legacy &lt;code&gt;go1.x&lt;/code&gt; runtime.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;Architectures: arm64&lt;/code&gt;: Targeting AWS Graviton processors. They use a RISC architecture that typically provides around 20% better price/performance for serverless workloads compared to &lt;code&gt;x86_64&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;Handler: bootstrap&lt;/code&gt;: When using custom runtimes, AWS Lambda expects the executable binary inside the deployment package to be named exactly &lt;code&gt;bootstrap&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  2. Compiling for Lambda
&lt;/h3&gt;

&lt;p&gt;A standard &lt;code&gt;go build&lt;/code&gt; works, but we can optimize it further for the Lambda environment. Here is the command we use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;GOOS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;linux &lt;span class="nv"&gt;GOARCH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;arm64 go build &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-tags&lt;/span&gt; lambda.norpc &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ldflags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"-s -w"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-o&lt;/span&gt; bin/bootstrap main.go
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Key takeaways
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;GOOS=linux GOARCH=arm64&lt;/code&gt;: This enables cross-compilation, allowing us to build a Linux ARM64 binary directly from our local machine (even if it's x86).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;-tags lambda.norpc&lt;/code&gt;: The &lt;code&gt;al2023&lt;/code&gt; runtime communicates with the Lambda service via an internal HTTP API. This tag tells the compiler to drop the legacy RPC compatibility code from the &lt;code&gt;aws-lambda-go&lt;/code&gt; library, reducing the binary size and initialization time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;-ldflags="-s -w"&lt;/code&gt;: These linker flags strip the symbol table and debug information, resulting in a leaner binary that loads into memory faster.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  3. Local Testing and the "Error 255"
&lt;/h3&gt;

&lt;p&gt;If you develop on an x86 (Intel/AMD) machine and try to test this locally using &lt;code&gt;sam local invoke&lt;/code&gt;, you will likely hit a &lt;strong&gt;Fatal Error 255&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This happens because the Docker container spins up an ARM64 environment, but your host CPU cannot natively execute ARM instructions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; We need a translator. Running the &lt;code&gt;multiarch/qemu-user-static&lt;/code&gt; Docker image solves this. QEMU intercepts the ARM commands and translates them into x86 instructions for your host CPU on the fly, allowing you to seamlessly test the production binary locally.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Anatomy of a Cold Start
&lt;/h3&gt;

&lt;p&gt;When you run &lt;code&gt;sam deploy --guided&lt;/code&gt;, AWS packages the binary, uploads it to S3, and updates the CloudFormation stack. But the most interesting part happens on the first invocation.&lt;/p&gt;

&lt;p&gt;When we triggered the Lambda, CloudWatch reported an &lt;strong&gt;Init Duration of ~60 ms.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;During these 54 milliseconds, AWS performed the following:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Allocated a Graviton-based server.&lt;/li&gt;
&lt;li&gt;Provisioned an isolated Firecracker microVM.&lt;/li&gt;
&lt;li&gt;Downloaded the deployment zip from S3 and extracted it.&lt;/li&gt;
&lt;li&gt;Booted the &lt;code&gt;provided.al2023&lt;/code&gt; OS.&lt;/li&gt;
&lt;li&gt;Loaded our &lt;code&gt;bootstrap&lt;/code&gt; binary into memory.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once the environment was warm, subsequent invocations (Warm Starts) took roughly &lt;strong&gt;2 ms&lt;/strong&gt; of compute time with a memory footprint of about &lt;strong&gt;19 MB.&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;Go on ARM64 with the AL2023 runtime provides an excellent baseline. With extremely low memory consumption and cold starts consistently under 60ms, it is a highly efficient choice for serverless APIs.&lt;/p&gt;

&lt;h3&gt;
  
  
  What’s Next?
&lt;/h3&gt;

&lt;p&gt;In &lt;strong&gt;Part 2&lt;/strong&gt;, we will set up our challenger: &lt;strong&gt;.NET 10 Native AOT&lt;/strong&gt;. We will explore how to configure the C# project with Zero-Allocation techniques and Source Generators to see if it can match or beat Go's numbers.&lt;/p&gt;

&lt;p&gt;The full source code for this setup is available in my GitHub repository:&lt;br&gt;
👉 &lt;a href="https://github.com/olegKarachun/aws-lambda-go-graviton" rel="noopener noreferrer"&gt;https://github.com/olegKarachun/aws-lambda-go-graviton&lt;/a&gt;&lt;/p&gt;

</description>
      <category>go</category>
      <category>aws</category>
      <category>serverless</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
