James Lee

Posted on May 17

Linux Performance Tuning: CPU, Memory, I/O & Network

#devops #linux #performance #tutorial

Linux performance optimization covers four major subsystems:

Subsystem	Typical Bottleneck Scenario
CPU	Compute-intensive workloads (Nginx, Node.js, math/batch processing)
Memory	Database workloads (MySQL) — heavy memory and storage consumption
I/O	Disk-bound applications with heavy read/write
Network	High-throughput web services

Section 1: CPU Performance

CPU is the most critical subsystem — responsible for all computation. Modern production servers use multi-core CPUs based on SMP (Symmetric Multiprocessing) architecture. In practice, CPU utilization is often below 5%, meaning significant resource waste.

CPU Cache Hierarchy

# lscpu
L1d cache:   32K    ← L1 data cache (static, per-core)
L1i cache:   32K    ← L1 instruction cache (static, per-core)
L2 cache:    256K   ← dynamic, shared
L3 cache:    8192K  ← dynamic, shared across cores

L1 cache: static cache, split into data and instruction caches
L2 / L3 cache: dynamic cache; L2 is shared between cores

CPU Affinity

In SMP systems, the Linux scheduler may run the same thread on different cores across time slices. Since each core has its own memory space (not shared), this causes cache invalidation — the thread's data must be reloaded into the new core's cache, degrading performance.

CPU affinity pins a process to a specific core, maximizing cache hit rate:

# Pin process 73890 to CPU core 0
taskset -pc 0 73890

NUMA (Non-Uniform Memory Access)

taskset alone doesn't guarantee local memory allocation. For NUMA architectures, use numactl:

NUMA topology:
┌──────────────┐     ┌──────────────┐
│  CPU Node 0  │     │  CPU Node 1  │
│  Local RAM   │     │  Local RAM   │
│  (fast)      │     │  (fast)      │
└──────┬───────┘     └──────┬───────┘
       │  remote access (slower)  │
       └──────────────────────────┘

# View current NUMA configuration
numactl --show

# Bind program to specific NUMA node
numactl --cpunodebind=0 --membind=0 ./myapp

⚠️ Database servers should NOT use NUMA by default. If required, start the DB with numactl --interleave=all to avoid memory hotspots.

CPU Scheduling Policies

Real-time scheduling (priority 1–99, higher = more urgent):

Policy	Behavior
`SCHED_FIFO`	Static priority; once running, holds CPU until higher-priority task arrives or it yields
`SCHED_RR`	Round-robin with time slices; expired slice goes to end of queue — fair among equal-priority tasks

General scheduling (priority 100–139, lower number = higher priority):

Policy	Behavior
`SCHED_OTHER`	Default; priority determined by `nice` + `counter` values. Least recently scheduled gets priority.
`SCHED_BATCH`	For batch processing
`SCHED_IDLE`	For very low priority background tasks

# Adjust process priority with nice (-20 to 19, lower = higher priority)
renice 5 <pid>

# Modify real-time scheduling priority
chrt -r -p 50 <pid>

Context Switches

The Linux kernel treats each core as an independent processor. Each core can run 50–50,000 processes. Each thread gets a time slice; when it expires or is preempted, a context switch occurs.

The more context switches, the heavier the kernel scheduling overhead.

Run Queue

Each CPU has a run queue. A thread is either sleeping (blocked on I/O) or runnable (waiting for CPU time).

load = currently running threads + threads in run queue

Example: 2 cores, 2 running + 4 queued → load = 6

CPU Performance Targets

Healthy CPU metrics:
┌─────────────────────────────────────────┐
│  us (user)    60% – 70%                 │
│  sy (system)  30% – 35%                 │
│  id (idle)     0% –  5%                 │
│  run queue    ≤ 4 per core (ideal)      │
└─────────────────────────────────────────┘

# Monitor with vmstat (1-second intervals, 5 samples)
vmstat 1 5

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs  us  sy  id  wa
 3  0 1150840 271628 260684 5530984  0    0     2     1    0    0  22   4  73   0
 5  0 1150840 270264 260684 5531032  0    0     0     0  5873 6085 13  13  73   0

High in (interrupts) and cs (context switches) indicate the kernel is constantly switching processes and servicing hardware requests.

Bind Interrupts to a Specific CPU

# Bind interrupt type 19 to CPU core 2
echo 03 > /proc/irq/19/smp_affinity

# Bind TCP interrupts to one CPU (reduces scheduler interference)

Section 2: Memory Performance

Linux uses Virtual Memory Management (VMM) — writes go to filesystem cache in memory first, then flush to disk lazily. This is why available memory appears low after running Linux for a while: most is consumed by cache + buffer.

Optimization goal: reduce disk writes, improve write efficiency.

Dirty Data Flush Policy

# Trigger pdflush when dirty data exceeds 10% of physical memory
echo 10 > /proc/sys/vm/dirty_background_ratio

# Flush dirty data that has been in memory longer than 2000ms
echo 2000 > /proc/sys/vm/dirty_expire_centisecs

⚠️ Tune carefully — these settings have a large impact on I/O performance.

Swap Tuning

When physical memory is insufficient, Linux uses LRU to swap out cold pages to disk, and swap in when needed.

# 0 = prefer physical memory; 100 = aggressively use swap
echo 10 > /proc/sys/vm/swappiness   # recommended for production

Minimize swap usage in production. For Redis, disable overcommit:
echo 0 > /proc/sys/vm/overcommit_memory

Reclaiming Memory

sync
echo 3 > /proc/sys/vm/drop_caches
# 1 = drop page cache (buffers)
# 2 = drop slab cache (cached)
# 3 = drop both

Huge Pages

Large page sizes reduce TLB misses and page table overhead:

cat /proc/meminfo | grep -i huge
# AnonHugePages: transparent huge pages (auto-managed)
# Hugepagesize:  2048 kB (standard huge page size)

# Manually set huge page count
sysctl vm.nr_hugepages=20

32-bit: 4MB huge pages; 64-bit: 2MB huge pages.
Larger pages = less overhead but more internal fragmentation.

Page Faults

MPF (Major Page Fault):  data not in cache → read from disk (expensive)
MnPF (Minor Page Fault): data found in buffer cache → no disk I/O (cheap)

# First run: mostly MPF (cold cache)
/usr/bin/time -v ./myapp

# Second run: mostly MnPF (warm cache)
/usr/bin/time -v ./myapp

The File Buffer Cache continuously grows to reduce MPF and increase MnPF, until the kernel needs to reclaim memory for other processes. Low free memory ≠ memory pressure — Linux intentionally uses free memory for caching.

Section 3: Disk I/O Performance

The I/O subsystem is typically the slowest part of a Linux system — both due to physical distance from the CPU and mechanical/electrical constraints. Minimize disk I/O wherever possible.

I/O Scheduler

cat /sys/block/sda/queue/scheduler
# noop  anticipatory  deadline  [cfq]

Scheduler	Description	Best For
CFQ (default)	Completely Fair Queuing; up to 8 requests per time slice; idles waiting for more I/O from same process	General workloads
Deadline	Every request must be served before a deadline	Databases, latency-sensitive
noop	No scheduling; FIFO order	SSDs, VMs
anticipatory	Deprecated; good for write-heavy/read-light	Legacy

I/O Priority

# Set process I/O priority (1=realtime, 2=best-effort, 3=idle)
ionice -c1 -p <pid>   # highest I/O priority for pid

Page Size & Block Size

Linux kernel accesses disk I/O in pages (typically 4KB):

/usr/bin/time -v date   # shows page size info

Tune page size and block size based on your application's I/O pattern (large sequential vs. small random).

DMA (Direct Memory Access)

DMA allows hardware to transfer data directly to/from memory without CPU involvement:

Without DMA: disk → CPU → memory  (CPU occupied during transfer)
With DMA:    disk ──────▶ memory  (CPU free during transfer)

DMA transfer lifecycle: Request → Acknowledge → Transfer → Complete

Writing Data Back to Disk

# Force immediate flush
fsync()   # per-file
sync()    # system-wide

# pdflush runs periodically if not explicitly called

Useful I/O Monitoring Commands

iotop      # per-process I/O usage
lsof       # list all open files and file descriptors
iostat     # disk I/O statistics
vmstat     # combined system stats including I/O

Section 4: Network Performance

For web applications, network performance is critical. Potential bottlenecks include: application response time, Linux network subsystem, NIC, and bandwidth.

NIC Settings

# Check if NIC is in full-duplex mode
ethtool eth0

# Increase MTU for high-bandwidth (≥1Gbps) networks
ifconfig eth0 mtu 9000 up   # jumbo frames

TCP Buffer Tuning

# TCP read buffer: min / initial / max (bytes)
sysctl -w net.ipv4.tcp_rmem="4096 87380 8388608"

# TCP write buffer: min / initial / max (bytes)
sysctl -w net.ipv4.tcp_wmem="4096 87380 8388608"

TCP Window Scaling

# Disable window scaling and set fixed window size
# (similar to setting JVM -Xms = -Xmx for predictability)
sysctl -w net.ipv4.tcp_window_scaling=0

TCP Connection Reuse

Reusing TIME_WAIT connections avoids the full 3-way handshake overhead — significant performance gain for web servers:

sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.tcp_tw_recycle=1

Keepalive Timeout

# Release idle persistent connections sooner (default is much longer)
sysctl -w net.ipv4.tcp_keepalive_time=1800   # 1800 seconds

SYN Backlog (DoS Protection)

# Max length of queue for TCP connections not yet ESTABLISHED
# Prevents server crash under SYN flood / DoS attacks
sysctl -w net.ipv4.tcp_max_syn_backlog=4096

Disable Unnecessary Protocols

# Disable ICMP broadcast responses to reduce noise
sysctl -w net.ipv4.icmp_echo_ignore_broadcasts=1

Bind Network Interrupts to One CPU

# Bind NIC interrupt to CPU core 1 (reduces scheduler interference)
echo 02 > /proc/irq/<nic_irq>/smp_affinity

Recommended Monitoring Toolset

Tool	Purpose
`htop`	Interactive process and CPU monitor
`vmstat`	CPU, memory, swap, I/O, context switches
`iotop`	Per-process disk I/O
`sar`	Historical system activity reports
`strace`	Trace system calls for a process
`iftop`	Real-time network bandwidth by connection
`ss`	Socket statistics (faster replacement for `netstat`)
`lsof`	List open files and file descriptors
`ethtool`	NIC diagnostics and settings
`mtr`	Combined traceroute + ping for network diagnosis

DEV Community