Bare Metal vs. AWS RDS: CPU/NUMA Pinning and HugePages — How We Beat Aurora on Write Throughput (Part 2)
In Part 1, we established storage baselines — Local SSD vs Longhorn vs AWS managed PostgreSQL. This article goes deeper: CPU/NUMA pinning and HugePages push bare metal write performance past Aurora IO-Optimized at every concurrency level.
In Part 1, we ended with CNPG Local SSD — bare metal with direct-attached storage and AWS-matched PostgreSQL config. Already leading Aurora on write TPS at baseline. The question was: how much further can we push it without adding hardware?
Two steps. Significant results.
Setup Recap
Same constraint as Part 1: 2 vCPU / 8 GB RAM, single instance, no HA. Same PostgreSQL config matched to AWS defaults. Same benchmark: pgbench · scale factor 100 · 60s per run · 39 runs · ap-southeast-3.
Where we left off — CNPG Local SSD (Baseline):
| Clients | RO TPS | RO Lat | RW TPS | RW Lat | TPC-B TPS | TPC-B Lat |
|---|---|---|---|---|---|---|
| 1 | 749 | 1.34 ms | 134 | 7.48 ms | 99 | 10.10 ms |
| 10 | 7,675 | 1.30 ms | 1,425 | 7.02 ms | 1,031 | 9.70 ms |
| 25 | 6,788 | 3.68 ms | 1,560 | 16.02 ms | 1,073 | 23.30 ms |
| 50 | 6,430 | 7.78 ms | 1,550 | 32.27 ms | 996 | 50.18 ms |
| 100 | 6,092 | 16.41 ms | 1,464 | 68.32 ms | 902 | 110.92 ms |
The 3-Layer Tuning Stack
Most PostgreSQL performance articles stop at database config. This one goes deeper.
The performance gains in this article come from tuning at three layers simultaneously — bare metal KVM hypervisor, VM/OS, and Kubernetes pod spec. Each layer is required for the next to work correctly.
Layer 1: KVM Hypervisor (Bare Metal Host)
<domain type='kvm'>
<!-- NUMA: all VM memory from NUMA node 1 only -->
<numatune>
<memory mode='strict' nodeset='1'/>
</numatune>
<!-- CPU pinning: each vCPU mapped to specific physical core -->
<vcpu placement='static'>8</vcpu>
<cputune>
<vcpupin vcpu='0' cpuset='8'/>
<vcpupin vcpu='1' cpuset='9'/>
<!-- cores 8-13, 28-29 — all on NUMA node 1 -->
</cputune>
<!-- HugePages: VM uses host HugePages, memory locked (no swap) -->
<memoryBacking>
<hugepages/>
<locked/>
</memoryBacking>
<!-- host-passthrough: CPU features exposed directly to VM -->
<cpu mode='host-passthrough' check='none'/>
<!-- Disable memory ballooning: hypervisor cannot steal VM memory -->
<memballoon model='none'/>
</domain>
What this achieves:
-
mode='strict' nodeset='1'— zero cross-NUMA memory access. PostgreSQL shared buffer pool and its pinned CPU cores are on the same NUMA node. This is the primary driver of the 7.48ms → 1.81ms write latency drop at 1 client. -
<locked/>— VM memory is non-swappable. Shared buffer pool stays in RAM permanently. -
host-passthrough— VM inherits host CPU instruction set, hardware prefetcher, and cache optimization directly. -
memballoon model='none'— hypervisor cannot reclaim memory from this VM for other VMs. Fixed allocation.
Layer 2: VM / OS Level
# /etc/sysctl.conf on the VM
echo "vm.nr_hugepages = 8192" >> /etc/sysctl.conf
sysctl -p
# CPU governor
cpupower frequency-set -g performance
HugePages must be pre-allocated at OS boot before PostgreSQL starts — they cannot be allocated on-demand. 8192 × 2MB = 16GB pre-allocated, enough to cover the 8Gi hugepages requested by the pod with headroom. The performance governor eliminates clock speed throttling for bursty query patterns.
Layer 3: Kubernetes Pod Spec
resources:
limits:
cpu: '2'
hugepages-2Mi: 8Gi # request HugePages from OS pool
memory: 4Gi # regular memory (non-huge) — separate accounting
requests:
cpu: '2' # requests = limits = Guaranteed QoS class
hugepages-2Mi: 8Gi
memory: 4Gi
What this achieves:
-
requests = limits→ Guaranteed QoS class — Kubernetes will not evict this pod under memory pressure. Other pods die first. -
hugepages-2Mi: 8Gias a separate resource → HugePages are tracked independently from regular memory. The 6 GB shared_buffers fits within the 8Gi hugepages allocation with headroom. -
cpu: requests = limits→ enables CPU Managerstaticpolicy — Kubernetes pins the pod to exclusive physical cores, which is what enables NUMA affinity at Layer 1 to be effective.
Why All Three Layers Matter
Remove any one layer and performance degrades:
| Remove | Impact |
|---|---|
| KVM NUMA pinning | Cross-NUMA memory access → +3-4ms write latency per hop |
<locked/> |
Memory swappable → latency spikes under memory pressure |
| CPU governor | Clock throttling → latency spikes on short transactions |
| Kubernetes Guaranteed QoS | Pod can be evicted or CPU throttled under node pressure |
| HugePages | TLB pressure → higher latency at high concurrency |
This is why the benchmark results are reproducible but not trivially so — you need all three layers configured correctly.
PostgreSQL was allocated 2 vCPU with no CPU affinity — running on whatever cores the kernel scheduled, potentially crossing NUMA boundaries on every memory access, with clock speed throttled by the default powersave governor.
Three changes applied simultaneously:
1. Dedicated CPU cores (Kubernetes CPU Manager: static policy)
Pins the PostgreSQL pod to specific physical cores. Eliminates context switching with other workloads.
2. CPU governor: powersave → performance
cpupower frequency-set -g performance
Default governor throttles clock speed at low load. Every short transaction pays a ramp-up penalty.
3. NUMA pinning
PostgreSQL process pinned to cores on the same NUMA node as its memory allocation. Cross-NUMA memory access adds 30–40% latency on NUMA-enabled systems — our 32-core host is NUMA-aware.
Tuning 1 Results
| Clients | RO TPS | RO Lat | RW TPS | RW Lat | TPC-B TPS | TPC-B Lat |
|---|---|---|---|---|---|---|
| 1 | 2,480 | 0.40 ms | 552 | 1.81 ms | 380 | 2.63 ms |
| 10 | 8,066 | 1.24 ms | 1,909 | 5.24 ms | 1,265 | 7.91 ms |
| 25 | 7,770 | 3.22 ms | 1,902 | 13.14 ms | 1,233 | 20.27 ms |
| 50 | 7,384 | 6.77 ms | 1,786 | 27.99 ms | 1,173 | 42.62 ms |
| 100 | 6,939 | 14.41 ms | 1,657 | 60.36 ms | 1,065 | 93.87 ms |
vs Baseline:
| Metric | Baseline | Tuning 1 | Delta |
|---|---|---|---|
| Avg RO TPS | 6,111 | 6,896 | +12.8% |
| Avg RW TPS | 1,355 | 1,659 | +22.4% |
| Avg RW Lat | 30.02 ms | 25.44 ms | -15.3% |
| RW Lat (1c) | 7.48 ms | 1.81 ms | -75.8% |
The single-client write latency drop from 7.48ms to 1.81ms is the most dramatic — this is the NUMA penalty being eliminated. Short transactions no longer wait for cross-NUMA memory access.
Tuning 2: HugePages
HugePages reduce TLB (Translation Lookaside Buffer) pressure by mapping PostgreSQL's shared buffer pool with 2 MB pages instead of the default 4 KB. Fewer TLB entries = fewer TLB misses under concurrent access.
Enabled at three levels:
# 1. VM OS — pre-allocate HugePages at boot
echo "vm.nr_hugepages = 8192" >> /etc/sysctl.conf
sysctl -p
# 2. Pod Resources — request HugePages as dedicated Kubernetes resource
resources:
limits:
cpu: '2'
hugepages-2Mi: 8Gi # request HugePages from OS pool
memory: 4Gi # regular memory (non-huge) — separate accounting
requests:
cpu: '2' # requests = limits = Guaranteed QoS class
hugepages-2Mi: 8Gi
memory: 4Gi
# 3. PostgreSQL
huge_pages = on
Why
requests = limits? This gives the pod Guaranteed QoS class — Kubernetes will not evict or throttle it under resource pressure. It also enables CPU Managerstaticpolicy to pin exclusive physical cores to this pod, which is what makes NUMA affinity effective.
Tuning 2 Results
| Clients | RO TPS | RO Lat | RW TPS | RW Lat | TPC-B TPS | TPC-B Lat |
|---|---|---|---|---|---|---|
| 1 | 2,558 | 0.39 ms | 562 | 1.78 ms | 386 | 2.59 ms |
| 10 | 8,325 | 1.20 ms | 1,903 | 5.25 ms | 1,276 | 7.84 ms |
| 25 | 8,205 | 3.05 ms | 1,954 | 12.79 ms | 1,254 | 19.94 ms |
| 50 | 7,892 | 6.34 ms | 1,875 | 26.67 ms | 1,215 | 41.16 ms |
| 100 | 7,485 | 13.36 ms | 1,725 | 57.97 ms | 1,111 | 90.01 ms |
vs Tuning 1:
| Metric | Tuning 1 | Tuning 2 | Delta |
|---|---|---|---|
| Avg RO TPS | 6,896 | 7,232 | +4.9% |
| Avg RW TPS | 1,659 | 1,706 | +2.8% |
| Avg RW Lat | 25.44 ms | 24.44 ms | -3.9% |
Incremental improvement — HugePages reduce TLB contention at high concurrency. The impact is smaller than NUMA pinning but consistent across all workload types.
Full Tuning Journey
| Step | Key Change | Avg RO TPS | Avg RW TPS | Avg RW Lat | Overall Avg |
|---|---|---|---|---|---|
| Baseline (Local SSD) | AWS-matched config | 6,111 | 1,355 | 30.02 ms | 2,796 |
| Tuning 1 | CPU/NUMA + perf governor | 6,896 | 1,659 | 25.44 ms | 3,214 |
| Tuning 2 | HugePages | 7,232 | 1,706 | 24.44 ms | 3,351 |
Write latency progression (1 client):
Baseline 7.48 ms ████████████████████████████████████████
Tuning 1 1.81 ms ████████
Tuning 2 1.78 ms ████████
75% write latency reduction from Baseline → Tuning 2. Same hardware, same config.
Final Comparison: CNPG Tuning 2 vs AWS
Average RW TPS — All Environments
| Environment | Avg RO TPS | Avg RW TPS | Avg RW Lat | Overall Avg |
|---|---|---|---|---|
| AWS RDS Standard | 10,724 | 2,250 | 17.30 ms | 4,826 |
| AWS Aurora IO-Prov | 8,370 | 1,234 | 29.72 ms | 3,480 |
| CNPG Tuning 2 (Final) | 7,232 | 1,706 | 24.44 ms | 3,351 |
| AWS Aurora Standard | 8,039 | 1,162 | 31.45 ms | 3,326 |
| CNPG Tuning 1 | 6,896 | 1,659 | 25.44 ms | 3,214 |
| CNPG Local SSD (Baseline) | 6,111 | 1,355 | 30.02 ms | 2,796 |
CNPG Tuning 2 Overall Avg (3,351) nearly matches Aurora IO-Optimized (3,480) — just -3.7% difference.
On write TPS: CNPG Tuning 2 (1,706) beats Aurora IO-Optimized (1,234) by +38%.
On write latency: CNPG Tuning 2 (24.44ms) beats Aurora IO-Optimized (29.72ms) by -17.7%.
The honest picture: With 2 vCPU and mid-range SAS SSD, CNPG Tuning 2 matches Aurora IO-Optimized on overall throughput (-3.7%) while beating it by 38% on write TPS. Aurora leads on reads (~14% higher Avg RO TPS) — this reflects its distributed read cache architecture, not a config gap. We verified by pushing PostgreSQL to its limit (shared_buffers=6GB, random_page_cost=1.1, effective_io_concurrency=200) and the read ceiling held. For write-intensive OLTP, bare metal wins. For read-heavy analytical workloads, Aurora's distributed cache is worth paying for.
⚠️ The RDS Standard Caveat: Burstable CPU
RDS Standard (t3.large) leads the benchmark at 4,826 overall avg — but this number requires an important caveat.
t3 instances use a CPU credit system:
- Baseline CPU utilization: 30% (for t3.large)
- Above baseline = consuming burst credits
- When credits are exhausted: performance drops to the 30% baseline
Each benchmark run is 60 seconds, with the full test suite taking ~50 minutes total — within the burst window for t3.large. Our results therefore reflect peak burst performance, which is valid for this benchmark duration.
However, in a production workload running continuously 24/7, RDS Standard t3.large performance will drop once CPU credits are depleted:
t3.large burst performance: ~4,826 avg TPS ← what our ~50 min benchmark measured
t3.large baseline CPU: 30% of full capacity
t3.large sustained (24/7): significantly lower once credits exhaust
If you need sustained, predictable performance on AWS, consider:
| Option | vCPU | RAM | Key Difference |
|---|---|---|---|
| RDS t3.large | 2 | 8 GB | Burstable — our benchmark used this |
| RDS m6i.large | 2 | 8 GB | Non-burstable, dedicated CPU, consistent performance |
| RDS m7g.large | 2 | 8 GB | Graviton3, non-burstable, better price/performance |
| Aurora Serverless v2 | 2 ACU | — | Auto-scales, consistent, higher baseline cost |
For a truly fair comparison against bare metal with consistent, non-burstable performance, RDS m6i.large or m7g.large would be the appropriate AWS counterpart — not t3.large.
Bottom line: Our benchmark results for RDS Standard are valid — the ~50 minute test ran within the burst window. But if your production workload runs continuously 24/7, RDS t3.large will eventually underperform these numbers once CPU credits exhaust. CNPG Tuning 2's 3,351 overall avg is consistent regardless of duration — no burst credits, no performance cliffs.
Per-Client Write TPS Breakdown
| Clients | Aurora IO-Prov | CNPG Tuning 1 | Aurora Standard | RDS Standard |
|---|---|---|---|---|
| 1 | 285 | 562 🥇 | 191 | 253 |
| 10 | 984 | 1,903 🥇 | 922 | 1,881 |
| 25 | 1,278 | 1,954 🥇 | 1,179 | 2,839 🥇 |
| 50 | 1,472 | 1,875 🥇 | 1,384 | 2,620 |
| 100 | 1,623 | 1,725 🥇 | 1,557 | 2,585 |
CNPG Tuning 2 beats Aurora IO-Optimized on RW TPS at every concurrency level. RDS Standard leads at 25–100 clients due to t3 burst credits.
Per-Client Write Latency
| Clients | Aurora IO-Prov | CNPG Tuning 1 | Aurora Standard |
|---|---|---|---|
| 1 | 3.51 ms | 1.78 ms 🥇 | 5.23 ms |
| 10 | 10.16 ms | 5.25 ms 🥇 | 10.85 ms |
| 25 | 19.57 ms | 12.79 ms 🥇 | 21.20 ms |
| 50 | 33.96 ms | 26.67 ms 🥇 | 36.13 ms |
| 100 | 61.63 ms | 57.97 ms 🥇 | 64.22 ms |
Write latency: bare metal wins at every concurrency level vs Aurora.
Why Aurora Loses on Write Latency
Aurora replicates every write to its distributed storage fleet before acknowledging:
PostgreSQL → WAL → Aurora storage network → 2/3 replicas ack → done
On bare metal with NUMA-pinned CPUs and local SSD:
PostgreSQL → WAL buffer (HugePages) → local SSD → done
At 1 client: Aurora write path = 3.51ms, bare metal = 1.78ms — nearly 2× faster. At 100 clients, the gap narrows as both are network/IO bound, but bare metal still leads.
Platform Selection Guide
| Workload | Best Choice | Reason |
|---|---|---|
| Write-intensive OLTP | CNPG Tuning 1 | Best write TPS and latency vs Aurora |
| Read-heavy (API, reporting) | Aurora IO-Opt | 47% higher Avg RO TPS vs bare metal tuned |
| Burst/unpredictable load | RDS Standard | t3 burst credits handle spikes |
| Cost-sensitive, stable load | CNPG Tuning 1 | Aurora-level write perf at fraction of cost |
| Managed simplicity | Aurora Standard | Competitive overall, no ops overhead |
Key Takeaways
- NUMA pinning is the biggest single tuning lever. Tuning 1 (CPU/NUMA + performance governor) delivered +22% Avg RW TPS and -76% write latency at 1 client — more impact than any PostgreSQL config change.
- HugePages: consistent but incremental. Tuning 2 added +2.8% Avg RW TPS on top of Tuning 1. Worth enabling for latency stability at high concurrency.
- Bare metal beats Aurora IO-Optimized on writes — with mid-range SAS SSD. +38% Avg RW TPS and -18% write latency. Not NVMe. Not enterprise flash. Samsung SM863a SAS in RAID 1.
- Aurora's read advantage is architectural, not a config gap. We maxed PostgreSQL config (6 GB shared_buffers, random_page_cost=1.1, effective_io_concurrency=200, maintenance_work_mem=512MB) and reached 86% of Aurora IO-Prov read throughput. The remaining 14% gap comes from Aurora's distributed read cache — a genuine architectural advantage for read-heavy workloads, not something tunable away.
- "2 vCPU" is not equal. Same allocation on NUMA-aware 32-core bare metal with pinned cores outperforms hypervisor-backed t3.large on write-sensitive workloads.
- RDS Standard t3.large benchmark results are valid — but context matters. Our ~50 minute benchmark ran within the burst window, so results accurately reflect t3 burst performance. However, in 24/7 production workloads, performance will drop once CPU credits exhaust (baseline CPU = 30%). For sustained production comparison, m6i.large or m7g.large (non-burstable) is more appropriate than t3.
- Choose based on workload, not hype. Write-intensive OLTP → bare metal wins on both performance and cost. Read-heavy analytical → Aurora's distributed cache is worth paying for.
Environment Details
- CloudNativePG: v1.24 on Kubernetes 1.31
- Host: Bare Metal 32-Core (16 Physical / 16 HT), NUMA-Aware, 32 GB RAM
- Storage: Samsung SM863a Enterprise SSD RAID 1 (SAS Interface) — mid-range enterprise SSD
- PostgreSQL config (Tuning 2): shared_buffers=6GB, huge_pages=on, work_mem=4MB, max_connections=200, wal_buffers=64MB, random_page_cost=1.1, effective_io_concurrency=200, maintenance_work_mem=512MB, checkpoint_completion_target=0.9
- Deployment: Single instance — no HA, no read replicas, no connection pooling
- AWS Region: ap-southeast-3 (Indonesia)
- Instance class: t3.large (2 vCPU, 8 GB) for all AWS environments
- Scale Factor: 100 (~10M rows, ~1.5 GB table)
- Benchmark runner: Kubernetes-native pgbench Job — source on GitHub
← Part 1: Storage Baseline — Longhorn vs Local SSD vs Managed Cloud
— Iwan Setiawan, Hybrid Cloud & Platform Architect · portfolio.kangservice.cloud
Top comments (0)