<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Iwan Setiawan</title>
    <description>The latest articles on DEV Community by Iwan Setiawan (@ionehouten).</description>
    <link>https://dev.to/ionehouten</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3812152%2F051726c8-4f14-4b25-84d3-5ea5f2104f74.jpg</url>
      <title>DEV Community: Iwan Setiawan</title>
      <link>https://dev.to/ionehouten</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ionehouten"/>
    <language>en</language>
    <item>
      <title>Bare Metal vs. AWS RDS: A Deep Dive into NUMA-Aware Tuning and PostgreSQL Performance (Part 2)</title>
      <dc:creator>Iwan Setiawan</dc:creator>
      <pubDate>Mon, 16 Mar 2026 06:52:44 +0000</pubDate>
      <link>https://dev.to/ionehouten/bare-metal-vs-aws-rds-a-deep-dive-into-numa-aware-tuning-and-postgresql-performance-part-2-2daa</link>
      <guid>https://dev.to/ionehouten/bare-metal-vs-aws-rds-a-deep-dive-into-numa-aware-tuning-and-postgresql-performance-part-2-2daa</guid>
      <description>&lt;h2&gt;
  
  
  Bare Metal vs. AWS RDS: CPU/NUMA Pinning and HugePages — How We Beat Aurora on Write Throughput (Part 2)
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;In &lt;a href="https://dev.to/ionehouten/bare-metal-vs-aws-rds-a-deep-dive-into-numa-aware-tuning-and-postgresql-performance-1fil"&gt;Part 1&lt;/a&gt;, we established storage baselines — Local SSD vs Longhorn vs AWS managed PostgreSQL. This article goes deeper: CPU/NUMA pinning and HugePages push bare metal write performance past Aurora IO-Optimized at every concurrency level.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;In Part 1, we ended with &lt;strong&gt;CNPG Local SSD&lt;/strong&gt; — bare metal with direct-attached storage and AWS-matched PostgreSQL config. Already leading Aurora on write TPS at baseline. The question was: how much further can we push it without adding hardware?&lt;/p&gt;

&lt;p&gt;Two steps. Significant results.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setup Recap
&lt;/h2&gt;

&lt;p&gt;Same constraint as Part 1: &lt;strong&gt;2 vCPU / 8 GB RAM&lt;/strong&gt;, single instance, no HA. Same PostgreSQL config matched to AWS defaults. Same benchmark: &lt;code&gt;pgbench&lt;/code&gt; · scale factor 100 · 60s per run · 39 runs · ap-southeast-3.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where we left off — CNPG Local SSD (Baseline):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Clients&lt;/th&gt;
&lt;th&gt;RO TPS&lt;/th&gt;
&lt;th&gt;RO Lat&lt;/th&gt;
&lt;th&gt;RW TPS&lt;/th&gt;
&lt;th&gt;RW Lat&lt;/th&gt;
&lt;th&gt;TPC-B TPS&lt;/th&gt;
&lt;th&gt;TPC-B Lat&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;749&lt;/td&gt;
&lt;td&gt;1.34 ms&lt;/td&gt;
&lt;td&gt;134&lt;/td&gt;
&lt;td&gt;7.48 ms&lt;/td&gt;
&lt;td&gt;99&lt;/td&gt;
&lt;td&gt;10.10 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;7,675&lt;/td&gt;
&lt;td&gt;1.30 ms&lt;/td&gt;
&lt;td&gt;1,425&lt;/td&gt;
&lt;td&gt;7.02 ms&lt;/td&gt;
&lt;td&gt;1,031&lt;/td&gt;
&lt;td&gt;9.70 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;6,788&lt;/td&gt;
&lt;td&gt;3.68 ms&lt;/td&gt;
&lt;td&gt;1,560&lt;/td&gt;
&lt;td&gt;16.02 ms&lt;/td&gt;
&lt;td&gt;1,073&lt;/td&gt;
&lt;td&gt;23.30 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;6,430&lt;/td&gt;
&lt;td&gt;7.78 ms&lt;/td&gt;
&lt;td&gt;1,550&lt;/td&gt;
&lt;td&gt;32.27 ms&lt;/td&gt;
&lt;td&gt;996&lt;/td&gt;
&lt;td&gt;50.18 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;6,092&lt;/td&gt;
&lt;td&gt;16.41 ms&lt;/td&gt;
&lt;td&gt;1,464&lt;/td&gt;
&lt;td&gt;68.32 ms&lt;/td&gt;
&lt;td&gt;902&lt;/td&gt;
&lt;td&gt;110.92 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The 3-Layer Tuning Stack
&lt;/h2&gt;

&lt;p&gt;Most PostgreSQL performance articles stop at database config. This one goes deeper.&lt;/p&gt;

&lt;p&gt;The performance gains in this article come from tuning at &lt;strong&gt;three layers simultaneously&lt;/strong&gt; — bare metal KVM hypervisor, VM/OS, and Kubernetes pod spec. Each layer is required for the next to work correctly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: KVM Hypervisor (Bare Metal Host)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;domain&lt;/span&gt; &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;'kvm'&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="c"&gt;&amp;lt;!-- NUMA: all VM memory from NUMA node 1 only --&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;numatune&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;memory&lt;/span&gt; &lt;span class="na"&gt;mode=&lt;/span&gt;&lt;span class="s"&gt;'strict'&lt;/span&gt; &lt;span class="na"&gt;nodeset=&lt;/span&gt;&lt;span class="s"&gt;'1'&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/numatune&amp;gt;&lt;/span&gt;

  &lt;span class="c"&gt;&amp;lt;!-- CPU pinning: each vCPU mapped to specific physical core --&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;vcpu&lt;/span&gt; &lt;span class="na"&gt;placement=&lt;/span&gt;&lt;span class="s"&gt;'static'&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;8&lt;span class="nt"&gt;&amp;lt;/vcpu&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;cputune&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;vcpupin&lt;/span&gt; &lt;span class="na"&gt;vcpu=&lt;/span&gt;&lt;span class="s"&gt;'0'&lt;/span&gt; &lt;span class="na"&gt;cpuset=&lt;/span&gt;&lt;span class="s"&gt;'8'&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;vcpupin&lt;/span&gt; &lt;span class="na"&gt;vcpu=&lt;/span&gt;&lt;span class="s"&gt;'1'&lt;/span&gt; &lt;span class="na"&gt;cpuset=&lt;/span&gt;&lt;span class="s"&gt;'9'&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class="c"&gt;&amp;lt;!-- cores 8-13, 28-29 — all on NUMA node 1 --&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/cputune&amp;gt;&lt;/span&gt;

  &lt;span class="c"&gt;&amp;lt;!-- HugePages: VM uses host HugePages, memory locked (no swap) --&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;memoryBacking&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;hugepages/&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;locked/&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/memoryBacking&amp;gt;&lt;/span&gt;

  &lt;span class="c"&gt;&amp;lt;!-- host-passthrough: CPU features exposed directly to VM --&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;cpu&lt;/span&gt; &lt;span class="na"&gt;mode=&lt;/span&gt;&lt;span class="s"&gt;'host-passthrough'&lt;/span&gt; &lt;span class="na"&gt;check=&lt;/span&gt;&lt;span class="s"&gt;'none'&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;

  &lt;span class="c"&gt;&amp;lt;!-- Disable memory ballooning: hypervisor cannot steal VM memory --&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;memballoon&lt;/span&gt; &lt;span class="na"&gt;model=&lt;/span&gt;&lt;span class="s"&gt;'none'&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/domain&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What this achieves:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;mode='strict' nodeset='1'&lt;/code&gt; — zero cross-NUMA memory access. PostgreSQL shared buffer pool and its pinned CPU cores are on the same NUMA node. This is the primary driver of the 7.48ms → 1.81ms write latency drop at 1 client.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;&amp;lt;locked/&amp;gt;&lt;/code&gt; — VM memory is non-swappable. Shared buffer pool stays in RAM permanently.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;host-passthrough&lt;/code&gt; — VM inherits host CPU instruction set, hardware prefetcher, and cache optimization directly.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;memballoon model='none'&lt;/code&gt; — hypervisor cannot reclaim memory from this VM for other VMs. Fixed allocation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Layer 2: VM / OS Level
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# /etc/sysctl.conf on the VM&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"vm.nr_hugepages = 8192"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; /etc/sysctl.conf
sysctl &lt;span class="nt"&gt;-p&lt;/span&gt;

&lt;span class="c"&gt;# CPU governor&lt;/span&gt;
cpupower frequency-set &lt;span class="nt"&gt;-g&lt;/span&gt; performance
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;HugePages must be pre-allocated at OS boot before PostgreSQL starts — they cannot be allocated on-demand. &lt;code&gt;8192 × 2MB = 16GB&lt;/code&gt; pre-allocated, enough to cover the 8Gi hugepages requested by the pod with headroom. The performance governor eliminates clock speed throttling for bursty query patterns.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Kubernetes Pod Spec
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2'&lt;/span&gt;
    &lt;span class="na"&gt;hugepages-2Mi&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;8Gi&lt;/span&gt;    &lt;span class="c1"&gt;# request HugePages from OS pool&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;4Gi&lt;/span&gt;            &lt;span class="c1"&gt;# regular memory (non-huge) — separate accounting&lt;/span&gt;
  &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2'&lt;/span&gt;               &lt;span class="c1"&gt;# requests = limits = Guaranteed QoS class&lt;/span&gt;
    &lt;span class="na"&gt;hugepages-2Mi&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;8Gi&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;4Gi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What this achieves:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;requests = limits&lt;/code&gt; → &lt;strong&gt;Guaranteed QoS class&lt;/strong&gt; — Kubernetes will not evict this pod under memory pressure. Other pods die first.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;hugepages-2Mi: 8Gi&lt;/code&gt; as a separate resource → HugePages are tracked independently from regular memory. The 6 GB shared_buffers fits within the 8Gi hugepages allocation with headroom.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cpu: requests = limits&lt;/code&gt; → enables &lt;strong&gt;CPU Manager &lt;code&gt;static&lt;/code&gt; policy&lt;/strong&gt; — Kubernetes pins the pod to exclusive physical cores, which is what enables NUMA affinity at Layer 1 to be effective.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why All Three Layers Matter
&lt;/h3&gt;

&lt;p&gt;Remove any one layer and performance degrades:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Remove&lt;/th&gt;
&lt;th&gt;Impact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;KVM NUMA pinning&lt;/td&gt;
&lt;td&gt;Cross-NUMA memory access → +3-4ms write latency per hop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;&amp;lt;locked/&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Memory swappable → latency spikes under memory pressure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU governor&lt;/td&gt;
&lt;td&gt;Clock throttling → latency spikes on short transactions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kubernetes Guaranteed QoS&lt;/td&gt;
&lt;td&gt;Pod can be evicted or CPU throttled under node pressure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HugePages&lt;/td&gt;
&lt;td&gt;TLB pressure → higher latency at high concurrency&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is why the benchmark results are reproducible but not trivially so — you need all three layers configured correctly.&lt;/p&gt;




&lt;p&gt;PostgreSQL was allocated 2 vCPU with no CPU affinity — running on whatever cores the kernel scheduled, potentially crossing NUMA boundaries on every memory access, with clock speed throttled by the default &lt;code&gt;powersave&lt;/code&gt; governor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three changes applied simultaneously:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Dedicated CPU cores (Kubernetes CPU Manager: &lt;code&gt;static&lt;/code&gt; policy)&lt;/strong&gt;&lt;br&gt;
Pins the PostgreSQL pod to specific physical cores. Eliminates context switching with other workloads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. CPU governor: &lt;code&gt;powersave&lt;/code&gt; → &lt;code&gt;performance&lt;/code&gt;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cpupower frequency-set &lt;span class="nt"&gt;-g&lt;/span&gt; performance
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Default governor throttles clock speed at low load. Every short transaction pays a ramp-up penalty.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. NUMA pinning&lt;/strong&gt;&lt;br&gt;
PostgreSQL process pinned to cores on the same NUMA node as its memory allocation. Cross-NUMA memory access adds 30–40% latency on NUMA-enabled systems — our 32-core host is NUMA-aware.&lt;/p&gt;
&lt;h3&gt;
  
  
  Tuning 1 Results
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Clients&lt;/th&gt;
&lt;th&gt;RO TPS&lt;/th&gt;
&lt;th&gt;RO Lat&lt;/th&gt;
&lt;th&gt;RW TPS&lt;/th&gt;
&lt;th&gt;RW Lat&lt;/th&gt;
&lt;th&gt;TPC-B TPS&lt;/th&gt;
&lt;th&gt;TPC-B Lat&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2,480&lt;/td&gt;
&lt;td&gt;0.40 ms&lt;/td&gt;
&lt;td&gt;552&lt;/td&gt;
&lt;td&gt;1.81 ms&lt;/td&gt;
&lt;td&gt;380&lt;/td&gt;
&lt;td&gt;2.63 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;8,066&lt;/td&gt;
&lt;td&gt;1.24 ms&lt;/td&gt;
&lt;td&gt;1,909&lt;/td&gt;
&lt;td&gt;5.24 ms&lt;/td&gt;
&lt;td&gt;1,265&lt;/td&gt;
&lt;td&gt;7.91 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;7,770&lt;/td&gt;
&lt;td&gt;3.22 ms&lt;/td&gt;
&lt;td&gt;1,902&lt;/td&gt;
&lt;td&gt;13.14 ms&lt;/td&gt;
&lt;td&gt;1,233&lt;/td&gt;
&lt;td&gt;20.27 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;7,384&lt;/td&gt;
&lt;td&gt;6.77 ms&lt;/td&gt;
&lt;td&gt;1,786&lt;/td&gt;
&lt;td&gt;27.99 ms&lt;/td&gt;
&lt;td&gt;1,173&lt;/td&gt;
&lt;td&gt;42.62 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;6,939&lt;/td&gt;
&lt;td&gt;14.41 ms&lt;/td&gt;
&lt;td&gt;1,657&lt;/td&gt;
&lt;td&gt;60.36 ms&lt;/td&gt;
&lt;td&gt;1,065&lt;/td&gt;
&lt;td&gt;93.87 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;vs Baseline:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Baseline&lt;/th&gt;
&lt;th&gt;Tuning 1&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Avg RO TPS&lt;/td&gt;
&lt;td&gt;6,111&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;6,896&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+12.8%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg RW TPS&lt;/td&gt;
&lt;td&gt;1,355&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1,659&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+22.4%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg RW Lat&lt;/td&gt;
&lt;td&gt;30.02 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;25.44 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-15.3%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RW Lat (1c)&lt;/td&gt;
&lt;td&gt;7.48 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.81 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-75.8%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The single-client write latency drop from 7.48ms to 1.81ms is the most dramatic — this is the NUMA penalty being eliminated. Short transactions no longer wait for cross-NUMA memory access.&lt;/p&gt;


&lt;h2&gt;
  
  
  Tuning 2: HugePages
&lt;/h2&gt;

&lt;p&gt;HugePages reduce TLB (Translation Lookaside Buffer) pressure by mapping PostgreSQL's shared buffer pool with 2 MB pages instead of the default 4 KB. Fewer TLB entries = fewer TLB misses under concurrent access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Enabled at three levels:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. VM OS — pre-allocate HugePages at boot&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"vm.nr_hugepages = 8192"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; /etc/sysctl.conf
sysctl &lt;span class="nt"&gt;-p&lt;/span&gt;

&lt;span class="c"&gt;# 2. Pod Resources — request HugePages as dedicated Kubernetes resource&lt;/span&gt;
resources:
  limits:
    cpu: &lt;span class="s1"&gt;'2'&lt;/span&gt;
    hugepages-2Mi: 8Gi     &lt;span class="c"&gt;# request HugePages from OS pool&lt;/span&gt;
    memory: 4Gi            &lt;span class="c"&gt;# regular memory (non-huge) — separate accounting&lt;/span&gt;
  requests:
    cpu: &lt;span class="s1"&gt;'2'&lt;/span&gt;               &lt;span class="c"&gt;# requests = limits = Guaranteed QoS class&lt;/span&gt;
    hugepages-2Mi: 8Gi
    memory: 4Gi

&lt;span class="c"&gt;# 3. PostgreSQL&lt;/span&gt;
huge_pages &lt;span class="o"&gt;=&lt;/span&gt; on
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why &lt;code&gt;requests = limits&lt;/code&gt;?&lt;/strong&gt; This gives the pod &lt;strong&gt;Guaranteed QoS class&lt;/strong&gt; — Kubernetes will not evict or throttle it under resource pressure. It also enables CPU Manager &lt;code&gt;static&lt;/code&gt; policy to pin exclusive physical cores to this pod, which is what makes NUMA affinity effective.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Tuning 2 Results
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Clients&lt;/th&gt;
&lt;th&gt;RO TPS&lt;/th&gt;
&lt;th&gt;RO Lat&lt;/th&gt;
&lt;th&gt;RW TPS&lt;/th&gt;
&lt;th&gt;RW Lat&lt;/th&gt;
&lt;th&gt;TPC-B TPS&lt;/th&gt;
&lt;th&gt;TPC-B Lat&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2,558&lt;/td&gt;
&lt;td&gt;0.39 ms&lt;/td&gt;
&lt;td&gt;562&lt;/td&gt;
&lt;td&gt;1.78 ms&lt;/td&gt;
&lt;td&gt;386&lt;/td&gt;
&lt;td&gt;2.59 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;8,325&lt;/td&gt;
&lt;td&gt;1.20 ms&lt;/td&gt;
&lt;td&gt;1,903&lt;/td&gt;
&lt;td&gt;5.25 ms&lt;/td&gt;
&lt;td&gt;1,276&lt;/td&gt;
&lt;td&gt;7.84 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;8,205&lt;/td&gt;
&lt;td&gt;3.05 ms&lt;/td&gt;
&lt;td&gt;1,954&lt;/td&gt;
&lt;td&gt;12.79 ms&lt;/td&gt;
&lt;td&gt;1,254&lt;/td&gt;
&lt;td&gt;19.94 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;7,892&lt;/td&gt;
&lt;td&gt;6.34 ms&lt;/td&gt;
&lt;td&gt;1,875&lt;/td&gt;
&lt;td&gt;26.67 ms&lt;/td&gt;
&lt;td&gt;1,215&lt;/td&gt;
&lt;td&gt;41.16 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;7,485&lt;/td&gt;
&lt;td&gt;13.36 ms&lt;/td&gt;
&lt;td&gt;1,725&lt;/td&gt;
&lt;td&gt;57.97 ms&lt;/td&gt;
&lt;td&gt;1,111&lt;/td&gt;
&lt;td&gt;90.01 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;vs Tuning 1:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Tuning 1&lt;/th&gt;
&lt;th&gt;Tuning 2&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Avg RO TPS&lt;/td&gt;
&lt;td&gt;6,896&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;7,232&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+4.9%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg RW TPS&lt;/td&gt;
&lt;td&gt;1,659&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1,706&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+2.8%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg RW Lat&lt;/td&gt;
&lt;td&gt;25.44 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;24.44 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-3.9%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Incremental improvement — HugePages reduce TLB contention at high concurrency. The impact is smaller than NUMA pinning but consistent across all workload types.&lt;/p&gt;




&lt;h2&gt;
  
  
  Full Tuning Journey
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Key Change&lt;/th&gt;
&lt;th&gt;Avg RO TPS&lt;/th&gt;
&lt;th&gt;Avg RW TPS&lt;/th&gt;
&lt;th&gt;Avg RW Lat&lt;/th&gt;
&lt;th&gt;Overall Avg&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Baseline (Local SSD)&lt;/td&gt;
&lt;td&gt;AWS-matched config&lt;/td&gt;
&lt;td&gt;6,111&lt;/td&gt;
&lt;td&gt;1,355&lt;/td&gt;
&lt;td&gt;30.02 ms&lt;/td&gt;
&lt;td&gt;2,796&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tuning 1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CPU/NUMA + perf governor&lt;/td&gt;
&lt;td&gt;6,896&lt;/td&gt;
&lt;td&gt;1,659&lt;/td&gt;
&lt;td&gt;25.44 ms&lt;/td&gt;
&lt;td&gt;3,214&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tuning 2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;HugePages&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;7,232&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1,706&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;24.44 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3,351&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Write latency progression (1 client):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Baseline  7.48 ms  ████████████████████████████████████████
Tuning 1  1.81 ms  ████████
Tuning 2  1.78 ms  ████████
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;75% write latency reduction from Baseline → Tuning 2. Same hardware, same config.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Comparison: CNPG Tuning 2 vs AWS
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Average RW TPS — All Environments
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Environment&lt;/th&gt;
&lt;th&gt;Avg RO TPS&lt;/th&gt;
&lt;th&gt;Avg RW TPS&lt;/th&gt;
&lt;th&gt;Avg RW Lat&lt;/th&gt;
&lt;th&gt;Overall Avg&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AWS RDS Standard&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;10,724&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2,250&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;17.30 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4,826&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS Aurora IO-Prov&lt;/td&gt;
&lt;td&gt;8,370&lt;/td&gt;
&lt;td&gt;1,234&lt;/td&gt;
&lt;td&gt;29.72 ms&lt;/td&gt;
&lt;td&gt;3,480&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CNPG Tuning 2 (Final)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7,232&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1,706&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;24.44 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3,351&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS Aurora Standard&lt;/td&gt;
&lt;td&gt;8,039&lt;/td&gt;
&lt;td&gt;1,162&lt;/td&gt;
&lt;td&gt;31.45 ms&lt;/td&gt;
&lt;td&gt;3,326&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CNPG Tuning 1&lt;/td&gt;
&lt;td&gt;6,896&lt;/td&gt;
&lt;td&gt;1,659&lt;/td&gt;
&lt;td&gt;25.44 ms&lt;/td&gt;
&lt;td&gt;3,214&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CNPG Local SSD (Baseline)&lt;/td&gt;
&lt;td&gt;6,111&lt;/td&gt;
&lt;td&gt;1,355&lt;/td&gt;
&lt;td&gt;30.02 ms&lt;/td&gt;
&lt;td&gt;2,796&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;CNPG Tuning 2 Overall Avg (3,351) nearly matches Aurora IO-Optimized (3,480) — just -3.7% difference.&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;On write TPS: CNPG Tuning 2 (1,706) beats Aurora IO-Optimized (1,234) by +38%.&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;On write latency: CNPG Tuning 2 (24.44ms) beats Aurora IO-Optimized (29.72ms) by -17.7%.&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The honest picture:&lt;/strong&gt; With 2 vCPU and mid-range SAS SSD, CNPG Tuning 2 matches Aurora IO-Optimized on overall throughput (-3.7%) while beating it by &lt;strong&gt;38% on write TPS&lt;/strong&gt;. Aurora leads on reads (~14% higher Avg RO TPS) — this reflects its distributed read cache architecture, not a config gap. We verified by pushing PostgreSQL to its limit (shared_buffers=6GB, random_page_cost=1.1, effective_io_concurrency=200) and the read ceiling held. For write-intensive OLTP, bare metal wins. For read-heavy analytical workloads, Aurora's distributed cache is worth paying for.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h3&gt;
  
  
  ⚠️ The RDS Standard Caveat: Burstable CPU
&lt;/h3&gt;

&lt;p&gt;RDS Standard (t3.large) leads the benchmark at &lt;strong&gt;4,826 overall avg&lt;/strong&gt; — but this number requires an important caveat.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;t3 instances use a CPU credit system:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Baseline CPU utilization: &lt;strong&gt;30%&lt;/strong&gt; (for t3.large)&lt;/li&gt;
&lt;li&gt;Above baseline = consuming burst credits&lt;/li&gt;
&lt;li&gt;When credits are exhausted: performance drops to the 30% baseline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each benchmark run is &lt;strong&gt;60 seconds&lt;/strong&gt;, with the full test suite taking ~50 minutes total — within the burst window for t3.large. Our results therefore reflect &lt;strong&gt;peak burst performance&lt;/strong&gt;, which is valid for this benchmark duration.&lt;/p&gt;

&lt;p&gt;However, in a production workload running continuously 24/7, RDS Standard t3.large performance will drop once CPU credits are depleted:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;t3.large burst performance:    ~4,826 avg TPS  ← what our ~50 min benchmark measured
t3.large baseline CPU:         30% of full capacity
t3.large sustained (24/7):     significantly lower once credits exhaust
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;If you need sustained, predictable performance on AWS, consider:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Option&lt;/th&gt;
&lt;th&gt;vCPU&lt;/th&gt;
&lt;th&gt;RAM&lt;/th&gt;
&lt;th&gt;Key Difference&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;RDS t3.large&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;8 GB&lt;/td&gt;
&lt;td&gt;Burstable — our benchmark used this&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RDS m6i.large&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;8 GB&lt;/td&gt;
&lt;td&gt;Non-burstable, dedicated CPU, consistent performance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RDS m7g.large&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;8 GB&lt;/td&gt;
&lt;td&gt;Graviton3, non-burstable, better price/performance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Aurora Serverless v2&lt;/td&gt;
&lt;td&gt;2 ACU&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Auto-scales, consistent, higher baseline cost&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For a truly fair comparison against bare metal with &lt;strong&gt;consistent, non-burstable performance&lt;/strong&gt;, RDS m6i.large or m7g.large would be the appropriate AWS counterpart — not t3.large.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Bottom line:&lt;/strong&gt; Our benchmark results for RDS Standard are valid — the ~50 minute test ran within the burst window. But if your production workload runs continuously 24/7, RDS t3.large will eventually underperform these numbers once CPU credits exhaust. CNPG Tuning 2's 3,351 overall avg is &lt;strong&gt;consistent regardless of duration&lt;/strong&gt; — no burst credits, no performance cliffs.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  Per-Client Write TPS Breakdown
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Clients&lt;/th&gt;
&lt;th&gt;Aurora IO-Prov&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;CNPG Tuning 1&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;Aurora Standard&lt;/th&gt;
&lt;th&gt;RDS Standard&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;285&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;562&lt;/strong&gt; 🥇&lt;/td&gt;
&lt;td&gt;191&lt;/td&gt;
&lt;td&gt;253&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;984&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;1,903&lt;/strong&gt; 🥇&lt;/td&gt;
&lt;td&gt;922&lt;/td&gt;
&lt;td&gt;1,881&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;1,278&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;1,954&lt;/strong&gt; 🥇&lt;/td&gt;
&lt;td&gt;1,179&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;2,839&lt;/strong&gt; 🥇&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;1,472&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;1,875&lt;/strong&gt; 🥇&lt;/td&gt;
&lt;td&gt;1,384&lt;/td&gt;
&lt;td&gt;2,620&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;1,623&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;1,725&lt;/strong&gt; 🥇&lt;/td&gt;
&lt;td&gt;1,557&lt;/td&gt;
&lt;td&gt;2,585&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;CNPG Tuning 2 beats Aurora IO-Optimized on RW TPS at every concurrency level. RDS Standard leads at 25–100 clients due to t3 burst credits.&lt;/p&gt;

&lt;h3&gt;
  
  
  Per-Client Write Latency
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Clients&lt;/th&gt;
&lt;th&gt;Aurora IO-Prov&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;CNPG Tuning 1&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;Aurora Standard&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;3.51 ms&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;1.78 ms&lt;/strong&gt; 🥇&lt;/td&gt;
&lt;td&gt;5.23 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;10.16 ms&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;5.25 ms&lt;/strong&gt; 🥇&lt;/td&gt;
&lt;td&gt;10.85 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;19.57 ms&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;12.79 ms&lt;/strong&gt; 🥇&lt;/td&gt;
&lt;td&gt;21.20 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;33.96 ms&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;26.67 ms&lt;/strong&gt; 🥇&lt;/td&gt;
&lt;td&gt;36.13 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;61.63 ms&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;57.97 ms&lt;/strong&gt; 🥇&lt;/td&gt;
&lt;td&gt;64.22 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Write latency: bare metal wins at every concurrency level vs Aurora.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Aurora Loses on Write Latency
&lt;/h2&gt;

&lt;p&gt;Aurora replicates every write to its distributed storage fleet before acknowledging:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PostgreSQL → WAL → Aurora storage network → 2/3 replicas ack → done
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On bare metal with NUMA-pinned CPUs and local SSD:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PostgreSQL → WAL buffer (HugePages) → local SSD → done
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At 1 client: Aurora write path = 3.51ms, bare metal = 1.78ms — nearly 2× faster. At 100 clients, the gap narrows as both are network/IO bound, but bare metal still leads.&lt;/p&gt;




&lt;h2&gt;
  
  
  Platform Selection Guide
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload&lt;/th&gt;
&lt;th&gt;Best Choice&lt;/th&gt;
&lt;th&gt;Reason&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Write-intensive OLTP&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;CNPG Tuning 1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Best write TPS and latency vs Aurora&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Read-heavy (API, reporting)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Aurora IO-Opt&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;47% higher Avg RO TPS vs bare metal tuned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Burst/unpredictable load&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;RDS Standard&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;t3 burst credits handle spikes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost-sensitive, stable load&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;CNPG Tuning 1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Aurora-level write perf at fraction of cost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Managed simplicity&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Aurora Standard&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Competitive overall, no ops overhead&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;NUMA pinning is the biggest single tuning lever.&lt;/strong&gt; Tuning 1 (CPU/NUMA + performance governor) delivered +22% Avg RW TPS and -76% write latency at 1 client — more impact than any PostgreSQL config change.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HugePages: consistent but incremental.&lt;/strong&gt; Tuning 2 added +2.8% Avg RW TPS on top of Tuning 1. Worth enabling for latency stability at high concurrency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bare metal beats Aurora IO-Optimized on writes — with mid-range SAS SSD.&lt;/strong&gt; +38% Avg RW TPS and -18% write latency. Not NVMe. Not enterprise flash. Samsung SM863a SAS in RAID 1.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aurora's read advantage is architectural, not a config gap.&lt;/strong&gt; We maxed PostgreSQL config (6 GB shared_buffers, random_page_cost=1.1, effective_io_concurrency=200, maintenance_work_mem=512MB) and reached 86% of Aurora IO-Prov read throughput. The remaining 14% gap comes from Aurora's distributed read cache — a genuine architectural advantage for read-heavy workloads, not something tunable away.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"2 vCPU" is not equal.&lt;/strong&gt; Same allocation on NUMA-aware 32-core bare metal with pinned cores outperforms hypervisor-backed t3.large on write-sensitive workloads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RDS Standard t3.large benchmark results are valid — but context matters.&lt;/strong&gt; Our ~50 minute benchmark ran within the burst window, so results accurately reflect t3 burst performance. However, in 24/7 production workloads, performance will drop once CPU credits exhaust (baseline CPU = 30%). For sustained production comparison, m6i.large or m7g.large (non-burstable) is more appropriate than t3.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Choose based on workload, not hype.&lt;/strong&gt; Write-intensive OLTP → bare metal wins on both performance and cost. Read-heavy analytical → Aurora's distributed cache is worth paying for.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Environment Details
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CloudNativePG&lt;/strong&gt;: v1.24 on Kubernetes 1.31&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Host&lt;/strong&gt;: Bare Metal 32-Core (16 Physical / 16 HT), NUMA-Aware, 32 GB RAM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage&lt;/strong&gt;: Samsung SM863a Enterprise SSD RAID 1 (SAS Interface) — mid-range enterprise SSD&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PostgreSQL config (Tuning 2)&lt;/strong&gt;: shared_buffers=6GB, huge_pages=on, work_mem=4MB, max_connections=200, wal_buffers=64MB, random_page_cost=1.1, effective_io_concurrency=200, maintenance_work_mem=512MB, checkpoint_completion_target=0.9&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment&lt;/strong&gt;: Single instance — no HA, no read replicas, no connection pooling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Region&lt;/strong&gt;: ap-southeast-3 (Indonesia)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instance class&lt;/strong&gt;: t3.large (2 vCPU, 8 GB) for all AWS environments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scale Factor&lt;/strong&gt;: 100 (~10M rows, ~1.5 GB table)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Benchmark runner&lt;/strong&gt;: Kubernetes-native pgbench Job — &lt;a href="https://github.com/ionehouten/devops-kangservice/tree/main/kubernetes/benchmark/postgres" rel="noopener noreferrer"&gt;source on GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;← Part 1: &lt;a href="https://dev.to/ionehouten/bare-metal-vs-aws-rds-a-deep-dive-into-numa-aware-tuning-and-postgresql-performance-1fil"&gt;Storage Baseline — Longhorn vs Local SSD vs Managed Cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;— Iwan Setiawan, Hybrid Cloud &amp;amp; Platform Architect · &lt;a href="https://portfolio.kangservice.cloud" rel="noopener noreferrer"&gt;portfolio.kangservice.cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>linux</category>
      <category>performance</category>
      <category>postgres</category>
    </item>
    <item>
      <title>Redis + AOF + Distributed Storage: A Cautionary Benchmark</title>
      <dc:creator>Iwan Setiawan</dc:creator>
      <pubDate>Sat, 07 Mar 2026 23:13:17 +0000</pubDate>
      <link>https://dev.to/ionehouten/redis-aof-distributed-storage-a-cautionary-benchmark-4jf0</link>
      <guid>https://dev.to/ionehouten/redis-aof-distributed-storage-a-cautionary-benchmark-4jf0</guid>
      <description>&lt;p&gt;&lt;em&gt;We put AOF persistence through 9 configurations across local SSD SAS and Longhorn. The results are definitive.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;When designing a caching layer for a production migration to bare metal Kubernetes, we faced a question that sounds simple but turned out to have an expensive answer: &lt;strong&gt;should Redis AOF persistence live on Longhorn distributed storage?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Redis documentation hints at the answer. But intuition and documentation are not the same as production data. So we ran &lt;code&gt;redis-benchmark&lt;/code&gt; across nine configurations — varying storage backend, persistence settings, and dataset size — and measured the impact empirically.&lt;/p&gt;

&lt;p&gt;The results are unambiguous, and one number in particular should give any architect pause.&lt;/p&gt;




&lt;h2&gt;
  
  
  Test Configuration
&lt;/h2&gt;

&lt;p&gt;All tests used the same parameters throughout:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;requests:    50,000
clients:     20 parallel
payload:     180,000 bytes (~180 KB)
pipeline:    keep-alive=1
thread:      single-threaded
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 180 KB payload is intentional — it reflects realistic cache object sizes for the production workload being benchmarked, not the micro-payload tests commonly seen in vendor benchmarks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nine environments tested:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Label&lt;/th&gt;
&lt;th&gt;Storage&lt;/th&gt;
&lt;th&gt;AOF&lt;/th&gt;
&lt;th&gt;RDB&lt;/th&gt;
&lt;th&gt;Dataset&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Local · AOF off&lt;/td&gt;
&lt;td&gt;Local SSD SAS&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Thresholds&lt;/td&gt;
&lt;td&gt;Empty&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local · AOF on (baseline)&lt;/td&gt;
&lt;td&gt;Local SSD SAS&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Thresholds&lt;/td&gt;
&lt;td&gt;Empty&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local · AOF on (tuning 1)&lt;/td&gt;
&lt;td&gt;Local SSD SAS&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Thresholds&lt;/td&gt;
&lt;td&gt;Empty&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local · AOF on (tuning 2)&lt;/td&gt;
&lt;td&gt;Local SSD SAS&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Thresholds&lt;/td&gt;
&lt;td&gt;Empty&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local · AOF on (t2 + data)&lt;/td&gt;
&lt;td&gt;Local SSD SAS&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Thresholds&lt;/td&gt;
&lt;td&gt;375,795 keys&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Longhorn · AOF on (empty)&lt;/td&gt;
&lt;td&gt;Longhorn&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Thresholds&lt;/td&gt;
&lt;td&gt;Empty&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Longhorn · AOF on (data)&lt;/td&gt;
&lt;td&gt;Longhorn&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Thresholds&lt;/td&gt;
&lt;td&gt;375,795 keys&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  SET Throughput: The Core Finding
&lt;/h2&gt;

&lt;p&gt;The most important metric for a write-capable cache is SET throughput under load. Here are the results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Configuration&lt;/th&gt;
&lt;th&gt;SET RPS&lt;/th&gt;
&lt;th&gt;SET avg latency&lt;/th&gt;
&lt;th&gt;SET p99 latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Local · AOF off&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;7,696&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.47 ms&lt;/td&gt;
&lt;td&gt;5.12 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local · AOF on (baseline)&lt;/td&gt;
&lt;td&gt;1,275&lt;/td&gt;
&lt;td&gt;14.39 ms&lt;/td&gt;
&lt;td&gt;102.53 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local · AOF on (tuning 1)&lt;/td&gt;
&lt;td&gt;1,251&lt;/td&gt;
&lt;td&gt;15.03 ms&lt;/td&gt;
&lt;td&gt;105.92 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local · AOF on (tuning 2)&lt;/td&gt;
&lt;td&gt;1,248&lt;/td&gt;
&lt;td&gt;15.03 ms&lt;/td&gt;
&lt;td&gt;112.38 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local · AOF on (t2 + 375K keys)&lt;/td&gt;
&lt;td&gt;1,212&lt;/td&gt;
&lt;td&gt;15.85 ms&lt;/td&gt;
&lt;td&gt;121.15 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Longhorn · AOF on (empty)&lt;/td&gt;
&lt;td&gt;577&lt;/td&gt;
&lt;td&gt;33.56 ms&lt;/td&gt;
&lt;td&gt;225.66 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Longhorn · AOF on (375K keys)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;537&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;36.17 ms&lt;/td&gt;
&lt;td&gt;201.86 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Let that sink in. Local SSD SAS with AOF disabled: &lt;strong&gt;7,696 SET RPS, p99 = 5 ms&lt;/strong&gt;. Longhorn with AOF enabled: &lt;strong&gt;537 SET RPS, p99 = 202 ms&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That is a &lt;strong&gt;14.3x throughput difference&lt;/strong&gt; and a &lt;strong&gt;39x p99 latency difference&lt;/strong&gt; — on the same application code, same Redis version, same client.&lt;/p&gt;

&lt;p&gt;The worst-case single SET operation on Longhorn reached &lt;strong&gt;903 ms&lt;/strong&gt;. For a cache layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  The AOF Wall on Local Storage
&lt;/h2&gt;

&lt;p&gt;Before we get to Longhorn, it's worth understanding what AOF persistence costs even on fast local SSD SAS.&lt;/p&gt;

&lt;p&gt;Disabling AOF (keeping only RDB snapshot thresholds) delivers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SET p99: &lt;strong&gt;3.8–5.1 ms&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Average SET latency: &lt;strong&gt;~1.5 ms&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Enabling AOF on the same local storage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SET p99: &lt;strong&gt;102–121 ms&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Average SET latency: &lt;strong&gt;14–16 ms&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's roughly a &lt;strong&gt;20x p99 latency penalty just from AOF on local SSD SAS&lt;/strong&gt;. And critically — tuning doesn't help. Across three tuning iterations (different &lt;code&gt;appendfsync&lt;/code&gt; settings, &lt;code&gt;no-appendfsync-on-rewrite&lt;/code&gt; toggles, and RDB threshold adjustments), the p99 numbers barely moved:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Baseline:  102.5 ms p99
Tuning 1:  105.9 ms p99
Tuning 2:  112.4 ms p99  ← actually got worse
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The reason is fundamental: AOF with &lt;code&gt;appendfsync everysec&lt;/code&gt; must call &lt;code&gt;fsync()&lt;/code&gt; at least once per second. On an otherwise busy single-threaded Redis instance processing 180 KB payloads, this fsync stall dominates. You cannot tune your way past it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Longhorn Makes AOF Catastrophic
&lt;/h2&gt;

&lt;p&gt;Longhorn is a distributed block storage system for Kubernetes. It replicates data across nodes for durability. This is excellent for stateful workloads like databases with controlled write patterns.&lt;/p&gt;

&lt;p&gt;Redis AOF is not that.&lt;/p&gt;

&lt;p&gt;AOF appends to a log file on every write operation (or at least every second with &lt;code&gt;everysec&lt;/code&gt;). The write pattern is continuous, small, and latency-sensitive. When this write pattern hits Longhorn:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Each AOF append crosses the network to the Longhorn controller&lt;/li&gt;
&lt;li&gt;The controller replicates to N replicas before acknowledging&lt;/li&gt;
&lt;li&gt;Only then does Redis get its fsync confirmation&lt;/li&gt;
&lt;li&gt;Redis is single-threaded — it waits&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The result: every SET operation pays the cost of network round-trip + multi-replica write confirmation. At 180 KB payload size, this stacks badly.&lt;/p&gt;

&lt;p&gt;Redis's own documentation says:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Avoid storing AOF/RDB files on storage that has network latency in the I/O path, such as NFS mounts."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Longhorn is effectively that — a network-replicated volume. The documentation warning is correct. Our benchmark puts a number on it: &lt;strong&gt;903 ms max latency, 202 ms p99&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  GET Performance
&lt;/h2&gt;

&lt;p&gt;One important nuance: GET performance is much less affected by persistence settings.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Configuration&lt;/th&gt;
&lt;th&gt;GET RPS&lt;/th&gt;
&lt;th&gt;GET avg latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Local · AOF off&lt;/td&gt;
&lt;td&gt;8,027&lt;/td&gt;
&lt;td&gt;1.47 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local · AOF on (baseline)&lt;/td&gt;
&lt;td&gt;2,537&lt;/td&gt;
&lt;td&gt;4.29 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Longhorn · AOF on (375K keys)&lt;/td&gt;
&lt;td&gt;2,522&lt;/td&gt;
&lt;td&gt;4.21 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Longhorn doesn't significantly degrade GET performance compared to AOF-on local storage. This makes sense — reads don't write to the AOF log. The Longhorn penalty only appears when Redis needs to persist.&lt;/p&gt;




&lt;h2&gt;
  
  
  PING Latency: The Baseline
&lt;/h2&gt;

&lt;p&gt;PING throughput gives a sense of the overhead without persistence in the picture:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Configuration&lt;/th&gt;
&lt;th&gt;PING RPS&lt;/th&gt;
&lt;th&gt;PING avg latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Local · AOF off&lt;/td&gt;
&lt;td&gt;~37,000&lt;/td&gt;
&lt;td&gt;0.32 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local · AOF on (baseline)&lt;/td&gt;
&lt;td&gt;~11,000–18,000&lt;/td&gt;
&lt;td&gt;0.84–1.50 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Longhorn · AOF on&lt;/td&gt;
&lt;td&gt;~19,000–21,000&lt;/td&gt;
&lt;td&gt;0.74–0.83 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Interestingly, PING performance on Longhorn is &lt;em&gt;better&lt;/em&gt; than AOF-on-local at baseline. The Longhorn overhead only materializes when Redis actually needs to write to the AOF log — confirming that the bottleneck is specifically the persistence write path, not general Longhorn I/O overhead.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Recommended Architecture
&lt;/h2&gt;

&lt;p&gt;Based on these results, the right architecture for this workload is a &lt;strong&gt;split-persistence design&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hot path (primary):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Redis with AOF disabled&lt;/li&gt;
&lt;li&gt;RDB snapshots only, with generous thresholds (e.g., &lt;code&gt;save 3600 1&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Local-path storage on SSD SAS&lt;/li&gt;
&lt;li&gt;Result: 7,600+ SET RPS, sub-5 ms p99&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Recovery path (replica):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Redis replica of the primary&lt;/li&gt;
&lt;li&gt;RDB-only snapshots to persistent storage (Longhorn acceptable here — snapshot writes are infrequent and bursty)&lt;/li&gt;
&lt;li&gt;Not in the hot write path&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This gives you sub-5 ms p99 at full throughput on the write path, while maintaining durability guarantees through the replica's periodic snapshots. If the primary fails, you lose at most one RDB snapshot interval of data — which for most cache workloads is acceptable.&lt;/p&gt;

&lt;p&gt;If true durability for every write is required (it rarely is for a cache), the right answer is a different tool — not Redis with AOF on distributed storage.&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Answer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Can AOF on local SSD SAS achieve good SET latency?&lt;/td&gt;
&lt;td&gt;No. p99 stays above 100 ms regardless of tuning.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Can AOF on Longhorn achieve acceptable SET latency?&lt;/td&gt;
&lt;td&gt;No. p99 reaches 202 ms, max 903 ms.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Does Longhorn affect GET performance with AOF?&lt;/td&gt;
&lt;td&gt;Minimally — GETs don't write to AOF.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;What's the right architecture for high-throughput caching?&lt;/td&gt;
&lt;td&gt;AOF disabled on hot path, RDB replica for recovery.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Is the Redis documentation warning about network storage accurate?&lt;/td&gt;
&lt;td&gt;Definitively yes. Our data confirms it.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 14x throughput gap between AOF-on-Longhorn and AOF-off-local is not a configuration problem. It is an architectural mismatch. Building a fast cache on slow persistence is a contradiction — and these numbers prove it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Environment Details
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Redis version&lt;/strong&gt;: 7.x&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage backends&lt;/strong&gt;: Local-path provisioner (SSD SAS) and Longhorn 1.6 on Kubernetes 1.31&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;redis-benchmark parameters&lt;/strong&gt;: &lt;code&gt;-n 50000 -c 20 -d 180000 --keepalive 1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single-threaded mode&lt;/strong&gt; throughout (no &lt;code&gt;--threads&lt;/code&gt; flag)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dataset&lt;/strong&gt;: Empty at baseline; 375,795 keys for loaded tests&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Questions about Redis architecture on Kubernetes? Leave a comment below.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;— Iwan Setiawan, Hybrid Cloud &amp;amp; Platform Architect · &lt;a href="https://portfolio.kangservice.cloud" rel="noopener noreferrer"&gt;portfolio.kangservice.cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>redis</category>
      <category>cloudnative</category>
      <category>kubernetes</category>
      <category>devops</category>
    </item>
    <item>
      <title>Bare Metal vs. AWS RDS: A Deep Dive into NUMA-Aware Tuning and PostgreSQL Performance (Part 1)</title>
      <dc:creator>Iwan Setiawan</dc:creator>
      <pubDate>Sat, 07 Mar 2026 23:00:09 +0000</pubDate>
      <link>https://dev.to/ionehouten/bare-metal-vs-aws-rds-a-deep-dive-into-numa-aware-tuning-and-postgresql-performance-1fil</link>
      <guid>https://dev.to/ionehouten/bare-metal-vs-aws-rds-a-deep-dive-into-numa-aware-tuning-and-postgresql-performance-1fil</guid>
      <description>&lt;h2&gt;
  
  
  Bare Metal vs. AWS RDS: Storage Baseline — Longhorn vs Local SSD vs Managed Cloud
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Before tuning anything, we needed to answer a simpler question first: does storage backend matter more than the platform itself?&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;This is Part 1 of a 2-part series.&lt;/strong&gt; This article establishes bare metal storage baselines across four configurations and compares them against AWS managed PostgreSQL. Part 2 covers CPU/NUMA pinning and HugePages, where bare metal overtakes Aurora on write throughput.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Most PostgreSQL performance comparisons jump straight to config tuning. We didn't.&lt;/p&gt;

&lt;p&gt;Before touching CPU governors or HugePages, we needed to answer a more fundamental question: &lt;strong&gt;how much does storage backend affect performance on bare metal Kubernetes?&lt;/strong&gt; We ran four configurations — and the results reveal exactly where the bottleneck lives.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;All environments: &lt;strong&gt;2 vCPU / 8 GB RAM&lt;/strong&gt; throughout. Our bare metal node is a 32-core NUMA-aware host with Samsung SM863a Enterprise SSD in RAID 1 (SAS). AWS environments run on t3.large in ap-southeast-3.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single-instance comparison throughout&lt;/strong&gt; — one CNPG pod vs one RDS instance vs one Aurora instance. No read replicas, no Multi-AZ, no connection pooling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PostgreSQL config — intentionally matched to AWS defaults:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight conf"&gt;&lt;code&gt;&lt;span class="n"&gt;shared_buffers&lt;/span&gt;       = ~&lt;span class="m"&gt;1&lt;/span&gt;.&lt;span class="m"&gt;9&lt;/span&gt; &lt;span class="n"&gt;GB&lt;/span&gt;
&lt;span class="n"&gt;effective_cache_size&lt;/span&gt; = ~&lt;span class="m"&gt;3&lt;/span&gt;.&lt;span class="m"&gt;8&lt;/span&gt; &lt;span class="n"&gt;GB&lt;/span&gt;
&lt;span class="n"&gt;work_mem&lt;/span&gt;             = &lt;span class="m"&gt;4&lt;/span&gt; &lt;span class="n"&gt;MB&lt;/span&gt;
&lt;span class="n"&gt;max_connections&lt;/span&gt;      = &lt;span class="m"&gt;839&lt;/span&gt;
&lt;span class="n"&gt;wal_buffers&lt;/span&gt;          = ~&lt;span class="m"&gt;60&lt;/span&gt; &lt;span class="n"&gt;MB&lt;/span&gt;
&lt;span class="n"&gt;maintenance_work_mem&lt;/span&gt; = &lt;span class="m"&gt;128&lt;/span&gt; &lt;span class="n"&gt;MB&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By using the same config across all environments, performance differences come purely from &lt;strong&gt;platform and storage architecture&lt;/strong&gt; — not tuning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benchmark:&lt;/strong&gt; &lt;code&gt;pgbench&lt;/code&gt; · Scale factor 100 (~10M rows) · 60s per run · 39 runs per environment&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Four bare metal storage configurations tested:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Label&lt;/th&gt;
&lt;th&gt;Storage&lt;/th&gt;
&lt;th&gt;Replicas&lt;/th&gt;
&lt;th&gt;Disk&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CNPG Local SSD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Direct-attached Samsung SM863a SAS&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Dedicated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CNPG Longhorn 1R&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Longhorn distributed storage&lt;/td&gt;
&lt;td&gt;1 replica&lt;/td&gt;
&lt;td&gt;Dedicated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CNPG Longhorn 2R&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Longhorn distributed storage&lt;/td&gt;
&lt;td&gt;2 replicas&lt;/td&gt;
&lt;td&gt;Dedicated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CNPG Longhorn 2R+shared&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Longhorn distributed storage&lt;/td&gt;
&lt;td&gt;2 replicas&lt;/td&gt;
&lt;td&gt;Shared with OS/worker&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Results: AWS RDS Standard (t3.large)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Clients&lt;/th&gt;
&lt;th&gt;RO TPS&lt;/th&gt;
&lt;th&gt;RO Lat&lt;/th&gt;
&lt;th&gt;RW TPS&lt;/th&gt;
&lt;th&gt;RW Lat&lt;/th&gt;
&lt;th&gt;TPC-B TPS&lt;/th&gt;
&lt;th&gt;TPC-B Lat&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1,677&lt;/td&gt;
&lt;td&gt;0.60 ms&lt;/td&gt;
&lt;td&gt;253&lt;/td&gt;
&lt;td&gt;3.95 ms&lt;/td&gt;
&lt;td&gt;178&lt;/td&gt;
&lt;td&gt;5.63 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;13,955&lt;/td&gt;
&lt;td&gt;0.72 ms&lt;/td&gt;
&lt;td&gt;1,881&lt;/td&gt;
&lt;td&gt;5.32 ms&lt;/td&gt;
&lt;td&gt;1,460&lt;/td&gt;
&lt;td&gt;6.85 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;12,859&lt;/td&gt;
&lt;td&gt;1.94 ms&lt;/td&gt;
&lt;td&gt;2,839&lt;/td&gt;
&lt;td&gt;8.80 ms&lt;/td&gt;
&lt;td&gt;1,864&lt;/td&gt;
&lt;td&gt;13.41 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;10,397&lt;/td&gt;
&lt;td&gt;4.81 ms&lt;/td&gt;
&lt;td&gt;2,620&lt;/td&gt;
&lt;td&gt;19.09 ms&lt;/td&gt;
&lt;td&gt;1,646&lt;/td&gt;
&lt;td&gt;30.37 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;10,627&lt;/td&gt;
&lt;td&gt;9.41 ms&lt;/td&gt;
&lt;td&gt;2,585&lt;/td&gt;
&lt;td&gt;38.68 ms&lt;/td&gt;
&lt;td&gt;1,623&lt;/td&gt;
&lt;td&gt;61.61 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Results: AWS Aurora IO-Optimized (t3.large)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Clients&lt;/th&gt;
&lt;th&gt;RO TPS&lt;/th&gt;
&lt;th&gt;RO Lat&lt;/th&gt;
&lt;th&gt;RW TPS&lt;/th&gt;
&lt;th&gt;RW Lat&lt;/th&gt;
&lt;th&gt;TPC-B TPS&lt;/th&gt;
&lt;th&gt;TPC-B Lat&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2,607&lt;/td&gt;
&lt;td&gt;0.38 ms&lt;/td&gt;
&lt;td&gt;285&lt;/td&gt;
&lt;td&gt;3.51 ms&lt;/td&gt;
&lt;td&gt;218&lt;/td&gt;
&lt;td&gt;4.58 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;10,928&lt;/td&gt;
&lt;td&gt;0.92 ms&lt;/td&gt;
&lt;td&gt;984&lt;/td&gt;
&lt;td&gt;10.16 ms&lt;/td&gt;
&lt;td&gt;739&lt;/td&gt;
&lt;td&gt;13.53 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;9,265&lt;/td&gt;
&lt;td&gt;2.70 ms&lt;/td&gt;
&lt;td&gt;1,278&lt;/td&gt;
&lt;td&gt;19.57 ms&lt;/td&gt;
&lt;td&gt;880&lt;/td&gt;
&lt;td&gt;28.42 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;8,163&lt;/td&gt;
&lt;td&gt;6.12 ms&lt;/td&gt;
&lt;td&gt;1,472&lt;/td&gt;
&lt;td&gt;33.96 ms&lt;/td&gt;
&lt;td&gt;990&lt;/td&gt;
&lt;td&gt;50.49 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;7,783&lt;/td&gt;
&lt;td&gt;12.85 ms&lt;/td&gt;
&lt;td&gt;1,623&lt;/td&gt;
&lt;td&gt;61.63 ms&lt;/td&gt;
&lt;td&gt;1,027&lt;/td&gt;
&lt;td&gt;97.41 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Results: AWS Aurora Standard (t3.large)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Clients&lt;/th&gt;
&lt;th&gt;RO TPS&lt;/th&gt;
&lt;th&gt;RO Lat&lt;/th&gt;
&lt;th&gt;RW TPS&lt;/th&gt;
&lt;th&gt;RW Lat&lt;/th&gt;
&lt;th&gt;TPC-B TPS&lt;/th&gt;
&lt;th&gt;TPC-B Lat&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1,540&lt;/td&gt;
&lt;td&gt;0.65 ms&lt;/td&gt;
&lt;td&gt;191&lt;/td&gt;
&lt;td&gt;5.23 ms&lt;/td&gt;
&lt;td&gt;150&lt;/td&gt;
&lt;td&gt;6.66 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;10,020&lt;/td&gt;
&lt;td&gt;1.00 ms&lt;/td&gt;
&lt;td&gt;922&lt;/td&gt;
&lt;td&gt;10.85 ms&lt;/td&gt;
&lt;td&gt;690&lt;/td&gt;
&lt;td&gt;14.48 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;9,189&lt;/td&gt;
&lt;td&gt;2.72 ms&lt;/td&gt;
&lt;td&gt;1,179&lt;/td&gt;
&lt;td&gt;21.20 ms&lt;/td&gt;
&lt;td&gt;800&lt;/td&gt;
&lt;td&gt;31.23 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;8,014&lt;/td&gt;
&lt;td&gt;6.24 ms&lt;/td&gt;
&lt;td&gt;1,384&lt;/td&gt;
&lt;td&gt;36.13 ms&lt;/td&gt;
&lt;td&gt;897&lt;/td&gt;
&lt;td&gt;55.77 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;7,665&lt;/td&gt;
&lt;td&gt;13.05 ms&lt;/td&gt;
&lt;td&gt;1,557&lt;/td&gt;
&lt;td&gt;64.22 ms&lt;/td&gt;
&lt;td&gt;970&lt;/td&gt;
&lt;td&gt;103.10 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Results: CNPG Local SSD
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Clients&lt;/th&gt;
&lt;th&gt;RO TPS&lt;/th&gt;
&lt;th&gt;RO Lat&lt;/th&gt;
&lt;th&gt;RW TPS&lt;/th&gt;
&lt;th&gt;RW Lat&lt;/th&gt;
&lt;th&gt;TPC-B TPS&lt;/th&gt;
&lt;th&gt;TPC-B Lat&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;749&lt;/td&gt;
&lt;td&gt;1.34 ms&lt;/td&gt;
&lt;td&gt;134&lt;/td&gt;
&lt;td&gt;7.48 ms&lt;/td&gt;
&lt;td&gt;99&lt;/td&gt;
&lt;td&gt;10.10 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;7,675&lt;/td&gt;
&lt;td&gt;1.30 ms&lt;/td&gt;
&lt;td&gt;1,425&lt;/td&gt;
&lt;td&gt;7.02 ms&lt;/td&gt;
&lt;td&gt;1,031&lt;/td&gt;
&lt;td&gt;9.70 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;6,788&lt;/td&gt;
&lt;td&gt;3.68 ms&lt;/td&gt;
&lt;td&gt;1,560&lt;/td&gt;
&lt;td&gt;16.02 ms&lt;/td&gt;
&lt;td&gt;1,073&lt;/td&gt;
&lt;td&gt;23.30 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;6,430&lt;/td&gt;
&lt;td&gt;7.78 ms&lt;/td&gt;
&lt;td&gt;1,550&lt;/td&gt;
&lt;td&gt;32.27 ms&lt;/td&gt;
&lt;td&gt;996&lt;/td&gt;
&lt;td&gt;50.18 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;6,092&lt;/td&gt;
&lt;td&gt;16.41 ms&lt;/td&gt;
&lt;td&gt;1,464&lt;/td&gt;
&lt;td&gt;68.32 ms&lt;/td&gt;
&lt;td&gt;902&lt;/td&gt;
&lt;td&gt;110.92 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Results: CNPG Longhorn 1 Replica (Dedicated Disk)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Clients&lt;/th&gt;
&lt;th&gt;RO TPS&lt;/th&gt;
&lt;th&gt;RO Lat&lt;/th&gt;
&lt;th&gt;RW TPS&lt;/th&gt;
&lt;th&gt;RW Lat&lt;/th&gt;
&lt;th&gt;TPC-B TPS&lt;/th&gt;
&lt;th&gt;TPC-B Lat&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;754&lt;/td&gt;
&lt;td&gt;1.33 ms&lt;/td&gt;
&lt;td&gt;119&lt;/td&gt;
&lt;td&gt;8.43 ms&lt;/td&gt;
&lt;td&gt;90&lt;/td&gt;
&lt;td&gt;11.12 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;7,713&lt;/td&gt;
&lt;td&gt;1.30 ms&lt;/td&gt;
&lt;td&gt;940&lt;/td&gt;
&lt;td&gt;10.64 ms&lt;/td&gt;
&lt;td&gt;748&lt;/td&gt;
&lt;td&gt;13.37 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;7,311&lt;/td&gt;
&lt;td&gt;3.42 ms&lt;/td&gt;
&lt;td&gt;1,254&lt;/td&gt;
&lt;td&gt;19.93 ms&lt;/td&gt;
&lt;td&gt;1,015&lt;/td&gt;
&lt;td&gt;24.64 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;6,587&lt;/td&gt;
&lt;td&gt;7.59 ms&lt;/td&gt;
&lt;td&gt;1,384&lt;/td&gt;
&lt;td&gt;36.12 ms&lt;/td&gt;
&lt;td&gt;1,064&lt;/td&gt;
&lt;td&gt;46.98 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;6,109&lt;/td&gt;
&lt;td&gt;16.37 ms&lt;/td&gt;
&lt;td&gt;1,453&lt;/td&gt;
&lt;td&gt;68.83 ms&lt;/td&gt;
&lt;td&gt;1,009&lt;/td&gt;
&lt;td&gt;99.14 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Results: CNPG Longhorn 2 Replicas (Dedicated Disk)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Clients&lt;/th&gt;
&lt;th&gt;RO TPS&lt;/th&gt;
&lt;th&gt;RO Lat&lt;/th&gt;
&lt;th&gt;RW TPS&lt;/th&gt;
&lt;th&gt;RW Lat&lt;/th&gt;
&lt;th&gt;TPC-B TPS&lt;/th&gt;
&lt;th&gt;TPC-B Lat&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;752&lt;/td&gt;
&lt;td&gt;1.33 ms&lt;/td&gt;
&lt;td&gt;103&lt;/td&gt;
&lt;td&gt;9.71 ms&lt;/td&gt;
&lt;td&gt;81&lt;/td&gt;
&lt;td&gt;12.39 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;7,655&lt;/td&gt;
&lt;td&gt;1.31 ms&lt;/td&gt;
&lt;td&gt;712&lt;/td&gt;
&lt;td&gt;14.04 ms&lt;/td&gt;
&lt;td&gt;607&lt;/td&gt;
&lt;td&gt;16.47 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;7,379&lt;/td&gt;
&lt;td&gt;3.39 ms&lt;/td&gt;
&lt;td&gt;996&lt;/td&gt;
&lt;td&gt;25.11 ms&lt;/td&gt;
&lt;td&gt;835&lt;/td&gt;
&lt;td&gt;29.93 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;6,699&lt;/td&gt;
&lt;td&gt;7.46 ms&lt;/td&gt;
&lt;td&gt;1,144&lt;/td&gt;
&lt;td&gt;43.71 ms&lt;/td&gt;
&lt;td&gt;908&lt;/td&gt;
&lt;td&gt;55.07 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;6,063&lt;/td&gt;
&lt;td&gt;16.49 ms&lt;/td&gt;
&lt;td&gt;1,269&lt;/td&gt;
&lt;td&gt;78.78 ms&lt;/td&gt;
&lt;td&gt;852&lt;/td&gt;
&lt;td&gt;117.39 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Results: CNPG Longhorn 2 Replicas (Shared Disk)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Clients&lt;/th&gt;
&lt;th&gt;RO TPS&lt;/th&gt;
&lt;th&gt;RO Lat&lt;/th&gt;
&lt;th&gt;RW TPS&lt;/th&gt;
&lt;th&gt;RW Lat&lt;/th&gt;
&lt;th&gt;TPC-B TPS&lt;/th&gt;
&lt;th&gt;TPC-B Lat&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;741&lt;/td&gt;
&lt;td&gt;1.35 ms&lt;/td&gt;
&lt;td&gt;97&lt;/td&gt;
&lt;td&gt;10.32 ms&lt;/td&gt;
&lt;td&gt;76&lt;/td&gt;
&lt;td&gt;13.08 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;8,399&lt;/td&gt;
&lt;td&gt;1.19 ms&lt;/td&gt;
&lt;td&gt;681&lt;/td&gt;
&lt;td&gt;14.69 ms&lt;/td&gt;
&lt;td&gt;598&lt;/td&gt;
&lt;td&gt;16.72 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;8,279&lt;/td&gt;
&lt;td&gt;3.02 ms&lt;/td&gt;
&lt;td&gt;957&lt;/td&gt;
&lt;td&gt;26.11 ms&lt;/td&gt;
&lt;td&gt;802&lt;/td&gt;
&lt;td&gt;31.17 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;7,406&lt;/td&gt;
&lt;td&gt;6.75 ms&lt;/td&gt;
&lt;td&gt;1,081&lt;/td&gt;
&lt;td&gt;46.27 ms&lt;/td&gt;
&lt;td&gt;873&lt;/td&gt;
&lt;td&gt;57.29 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;6,697&lt;/td&gt;
&lt;td&gt;14.93 ms&lt;/td&gt;
&lt;td&gt;1,206&lt;/td&gt;
&lt;td&gt;82.89 ms&lt;/td&gt;
&lt;td&gt;829&lt;/td&gt;
&lt;td&gt;120.67 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Combined Average Summary
&lt;/h2&gt;

&lt;p&gt;Averaged across all 13 client/thread combinations per workload type:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Config&lt;/th&gt;
&lt;th&gt;Avg RO TPS&lt;/th&gt;
&lt;th&gt;Avg RW TPS&lt;/th&gt;
&lt;th&gt;Avg RW Lat&lt;/th&gt;
&lt;th&gt;Overall Avg TPS&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AWS RDS Standard&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;10,724&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2,250&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;17.30 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4,826&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS Aurora IO-Prov&lt;/td&gt;
&lt;td&gt;8,370&lt;/td&gt;
&lt;td&gt;1,234&lt;/td&gt;
&lt;td&gt;29.72 ms&lt;/td&gt;
&lt;td&gt;3,480&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS Aurora Standard&lt;/td&gt;
&lt;td&gt;8,039&lt;/td&gt;
&lt;td&gt;1,162&lt;/td&gt;
&lt;td&gt;31.45 ms&lt;/td&gt;
&lt;td&gt;3,326&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CNPG Local SSD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;6,111&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1,355&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;30.02 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2,796&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CNPG Longhorn 1R&lt;/td&gt;
&lt;td&gt;6,376&lt;/td&gt;
&lt;td&gt;1,152&lt;/td&gt;
&lt;td&gt;32.46 ms&lt;/td&gt;
&lt;td&gt;2,797&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CNPG Longhorn 2R&lt;/td&gt;
&lt;td&gt;6,356&lt;/td&gt;
&lt;td&gt;935&lt;/td&gt;
&lt;td&gt;39.35 ms&lt;/td&gt;
&lt;td&gt;2,671&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CNPG Longhorn 2R+shared&lt;/td&gt;
&lt;td&gt;7,052&lt;/td&gt;
&lt;td&gt;892&lt;/td&gt;
&lt;td&gt;40.95 ms&lt;/td&gt;
&lt;td&gt;2,885&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What The Data Tells Us
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Finding 1: Overall average is misleading for storage comparison.&lt;/strong&gt;&lt;br&gt;
Overall avg TPS (RO+RW+TPCB combined) shows all bare metal configs at ~2,670–2,885 — nearly identical. This is because read tests dominate by volume and read performance is unaffected by storage. Always disaggregate by workload type.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Finding 2: Read performance — storage backend is irrelevant.&lt;/strong&gt;&lt;br&gt;
All four bare metal configs produce 6,111–7,052 Avg RO TPS — variation is within normal test noise. Aurora leads on reads (8,039–8,370) due to its distributed read-optimized storage layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Finding 3: Write performance — Local SSD wins clearly.&lt;/strong&gt;&lt;br&gt;
Local SSD delivers 1,355 Avg RW TPS vs Longhorn 2R's 892 — a &lt;strong&gt;52% advantage&lt;/strong&gt;. Every Longhorn replica adds ~3–4 ms write latency (network round-trip for replication acknowledgment).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Finding 4: Dedicated vs shared disk makes almost no difference.&lt;/strong&gt;&lt;br&gt;
Longhorn 2R dedicated (935 Avg RW TPS) vs Longhorn 2R shared (892) — only 4.6% difference. The bottleneck is &lt;strong&gt;network replication&lt;/strong&gt;, not disk I/O contention.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Finding 5: Bare metal Local SSD write TPS beats Aurora IO-Optimized.&lt;/strong&gt;&lt;br&gt;
Local SSD Avg RW TPS (1,355) vs Aurora IO-Prov (1,234) — &lt;strong&gt;+9.8% advantage&lt;/strong&gt; at baseline, before any CPU or kernel tuning. Aurora's write path pays network replication overhead just like Longhorn.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Finding 6: RDS Standard leads overall — but it's burstable.&lt;/strong&gt;&lt;br&gt;
RDS Standard's 4,826 overall avg and 2,250 Avg RW TPS comes from t3 CPU burst credits. Once credits are exhausted in sustained workloads, performance drops significantly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Recommendations
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload&lt;/th&gt;
&lt;th&gt;Recommendation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Write-intensive OLTP&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Local SSD&lt;/strong&gt; — 52% higher write TPS vs Longhorn 2R&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Read-heavy (API, reporting)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Longhorn is fine&lt;/strong&gt; — zero read overhead vs local SSD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HA with block-level durability&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Longhorn 2R&lt;/strong&gt; — accept write penalty, gain replication&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best write + HA&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Local SSD + CNPG streaming replication&lt;/strong&gt; — no storage network in write path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Managed simplicity&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Aurora Standard&lt;/strong&gt; — competitive write TPS, no ops overhead&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;blockquote&gt;
&lt;p&gt;**→ &lt;a href="https://dev.to/ionehouten/bare-metal-vs-aws-rds-a-deep-dive-into-numa-aware-tuning-and-postgresql-performance-part-2-2daa"&gt;Part 2 :** CPU/NUMA pinning + HugePages — pushing bare metal write performance even further past Aurora.&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Environment Details
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CloudNativePG&lt;/strong&gt;: v1.24 on Kubernetes 1.31&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Host&lt;/strong&gt;: Bare Metal 32-Core (16 Physical / 16 HT), NUMA-Aware&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage&lt;/strong&gt;: Samsung SM863a Enterprise SSD RAID 1 (SAS Interface)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PostgreSQL config&lt;/strong&gt;: Intentionally matched to AWS t3.large defaults for fair comparison&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment&lt;/strong&gt;: Single instance — no HA, no read replicas, no connection pooling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Region&lt;/strong&gt;: ap-southeast-3 (Indonesia)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scale Factor&lt;/strong&gt;: 100 (~10M rows, ~1.5 GB table)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Benchmark runner&lt;/strong&gt;: Kubernetes-native pgbench Job — &lt;a href="https://github.com/ionehouten/devops-kangservice/tree/main/kubernetes/benchmark/postgres" rel="noopener noreferrer"&gt;source on GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;— Iwan Setiawan, Hybrid Cloud &amp;amp; Platform Architect · &lt;a href="https://portfolio.kangservice.cloud" rel="noopener noreferrer"&gt;portfolio.kangservice.cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>postgressql</category>
      <category>devops</category>
      <category>cloudnative</category>
      <category>kubernetes</category>
    </item>
  </channel>
</rss>
