Linux performance optimization covers four major subsystems:
| Subsystem | Typical Bottleneck Scenario |
|---|---|
| CPU | Compute-intensive workloads (Nginx, Node.js, math/batch processing) |
| Memory | Database workloads (MySQL) — heavy memory and storage consumption |
| I/O | Disk-bound applications with heavy read/write |
| Network | High-throughput web services |
Section 1: CPU Performance
CPU is the most critical subsystem — responsible for all computation. Modern production servers use multi-core CPUs based on SMP (Symmetric Multiprocessing) architecture. In practice, CPU utilization is often below 5%, meaning significant resource waste.
CPU Cache Hierarchy
# lscpu
L1d cache: 32K ← L1 data cache (static, per-core)
L1i cache: 32K ← L1 instruction cache (static, per-core)
L2 cache: 256K ← dynamic, shared
L3 cache: 8192K ← dynamic, shared across cores
- L1 cache: static cache, split into data and instruction caches
- L2 / L3 cache: dynamic cache; L2 is shared between cores
CPU Affinity
In SMP systems, the Linux scheduler may run the same thread on different cores across time slices. Since each core has its own memory space (not shared), this causes cache invalidation — the thread's data must be reloaded into the new core's cache, degrading performance.
CPU affinity pins a process to a specific core, maximizing cache hit rate:
# Pin process 73890 to CPU core 0
taskset -pc 0 73890
NUMA (Non-Uniform Memory Access)
taskset alone doesn't guarantee local memory allocation. For NUMA architectures, use numactl:
NUMA topology:
┌──────────────┐ ┌──────────────┐
│ CPU Node 0 │ │ CPU Node 1 │
│ Local RAM │ │ Local RAM │
│ (fast) │ │ (fast) │
└──────┬───────┘ └──────┬───────┘
│ remote access (slower) │
└──────────────────────────┘
# View current NUMA configuration
numactl --show
# Bind program to specific NUMA node
numactl --cpunodebind=0 --membind=0 ./myapp
⚠️ Database servers should NOT use NUMA by default. If required, start the DB with
numactl --interleave=allto avoid memory hotspots.
CPU Scheduling Policies
Real-time scheduling (priority 1–99, higher = more urgent):
| Policy | Behavior |
|---|---|
SCHED_FIFO |
Static priority; once running, holds CPU until higher-priority task arrives or it yields |
SCHED_RR |
Round-robin with time slices; expired slice goes to end of queue — fair among equal-priority tasks |
General scheduling (priority 100–139, lower number = higher priority):
| Policy | Behavior |
|---|---|
SCHED_OTHER |
Default; priority determined by nice + counter values. Least recently scheduled gets priority. |
SCHED_BATCH |
For batch processing |
SCHED_IDLE |
For very low priority background tasks |
# Adjust process priority with nice (-20 to 19, lower = higher priority)
renice 5 <pid>
# Modify real-time scheduling priority
chrt -r -p 50 <pid>
Context Switches
The Linux kernel treats each core as an independent processor. Each core can run 50–50,000 processes. Each thread gets a time slice; when it expires or is preempted, a context switch occurs.
The more context switches, the heavier the kernel scheduling overhead.
Run Queue
Each CPU has a run queue. A thread is either sleeping (blocked on I/O) or runnable (waiting for CPU time).
load = currently running threads + threads in run queue
Example: 2 cores, 2 running + 4 queued → load = 6
CPU Performance Targets
Healthy CPU metrics:
┌─────────────────────────────────────────┐
│ us (user) 60% – 70% │
│ sy (system) 30% – 35% │
│ id (idle) 0% – 5% │
│ run queue ≤ 4 per core (ideal) │
└─────────────────────────────────────────┘
# Monitor with vmstat (1-second intervals, 5 samples)
vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa
3 0 1150840 271628 260684 5530984 0 0 2 1 0 0 22 4 73 0
5 0 1150840 270264 260684 5531032 0 0 0 0 5873 6085 13 13 73 0
High in (interrupts) and cs (context switches) indicate the kernel is constantly switching processes and servicing hardware requests.
Bind Interrupts to a Specific CPU
# Bind interrupt type 19 to CPU core 2
echo 03 > /proc/irq/19/smp_affinity
# Bind TCP interrupts to one CPU (reduces scheduler interference)
Section 2: Memory Performance
Linux uses Virtual Memory Management (VMM) — writes go to filesystem cache in memory first, then flush to disk lazily. This is why available memory appears low after running Linux for a while: most is consumed by cache + buffer.
Optimization goal: reduce disk writes, improve write efficiency.
Dirty Data Flush Policy
# Trigger pdflush when dirty data exceeds 10% of physical memory
echo 10 > /proc/sys/vm/dirty_background_ratio
# Flush dirty data that has been in memory longer than 2000ms
echo 2000 > /proc/sys/vm/dirty_expire_centisecs
⚠️ Tune carefully — these settings have a large impact on I/O performance.
Swap Tuning
When physical memory is insufficient, Linux uses LRU to swap out cold pages to disk, and swap in when needed.
# 0 = prefer physical memory; 100 = aggressively use swap
echo 10 > /proc/sys/vm/swappiness # recommended for production
Minimize swap usage in production. For Redis, disable overcommit:
echo 0 > /proc/sys/vm/overcommit_memory
Reclaiming Memory
sync
echo 3 > /proc/sys/vm/drop_caches
# 1 = drop page cache (buffers)
# 2 = drop slab cache (cached)
# 3 = drop both
Huge Pages
Large page sizes reduce TLB misses and page table overhead:
cat /proc/meminfo | grep -i huge
# AnonHugePages: transparent huge pages (auto-managed)
# Hugepagesize: 2048 kB (standard huge page size)
# Manually set huge page count
sysctl vm.nr_hugepages=20
32-bit: 4MB huge pages; 64-bit: 2MB huge pages.
Larger pages = less overhead but more internal fragmentation.
Page Faults
MPF (Major Page Fault): data not in cache → read from disk (expensive)
MnPF (Minor Page Fault): data found in buffer cache → no disk I/O (cheap)
# First run: mostly MPF (cold cache)
/usr/bin/time -v ./myapp
# Second run: mostly MnPF (warm cache)
/usr/bin/time -v ./myapp
The File Buffer Cache continuously grows to reduce MPF and increase MnPF, until the kernel needs to reclaim memory for other processes. Low free memory ≠ memory pressure — Linux intentionally uses free memory for caching.
Section 3: Disk I/O Performance
The I/O subsystem is typically the slowest part of a Linux system — both due to physical distance from the CPU and mechanical/electrical constraints. Minimize disk I/O wherever possible.
I/O Scheduler
cat /sys/block/sda/queue/scheduler
# noop anticipatory deadline [cfq]
| Scheduler | Description | Best For |
|---|---|---|
| CFQ (default) | Completely Fair Queuing; up to 8 requests per time slice; idles waiting for more I/O from same process | General workloads |
| Deadline | Every request must be served before a deadline | Databases, latency-sensitive |
| noop | No scheduling; FIFO order | SSDs, VMs |
| anticipatory | Deprecated; good for write-heavy/read-light | Legacy |
I/O Priority
# Set process I/O priority (1=realtime, 2=best-effort, 3=idle)
ionice -c1 -p <pid> # highest I/O priority for pid
Page Size & Block Size
Linux kernel accesses disk I/O in pages (typically 4KB):
/usr/bin/time -v date # shows page size info
Tune page size and block size based on your application's I/O pattern (large sequential vs. small random).
DMA (Direct Memory Access)
DMA allows hardware to transfer data directly to/from memory without CPU involvement:
Without DMA: disk → CPU → memory (CPU occupied during transfer)
With DMA: disk ──────▶ memory (CPU free during transfer)
DMA transfer lifecycle: Request → Acknowledge → Transfer → Complete
Writing Data Back to Disk
# Force immediate flush
fsync() # per-file
sync() # system-wide
# pdflush runs periodically if not explicitly called
Useful I/O Monitoring Commands
iotop # per-process I/O usage
lsof # list all open files and file descriptors
iostat # disk I/O statistics
vmstat # combined system stats including I/O
Section 4: Network Performance
For web applications, network performance is critical. Potential bottlenecks include: application response time, Linux network subsystem, NIC, and bandwidth.
NIC Settings
# Check if NIC is in full-duplex mode
ethtool eth0
# Increase MTU for high-bandwidth (≥1Gbps) networks
ifconfig eth0 mtu 9000 up # jumbo frames
TCP Buffer Tuning
# TCP read buffer: min / initial / max (bytes)
sysctl -w net.ipv4.tcp_rmem="4096 87380 8388608"
# TCP write buffer: min / initial / max (bytes)
sysctl -w net.ipv4.tcp_wmem="4096 87380 8388608"
TCP Window Scaling
# Disable window scaling and set fixed window size
# (similar to setting JVM -Xms = -Xmx for predictability)
sysctl -w net.ipv4.tcp_window_scaling=0
TCP Connection Reuse
Reusing TIME_WAIT connections avoids the full 3-way handshake overhead — significant performance gain for web servers:
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.tcp_tw_recycle=1
Keepalive Timeout
# Release idle persistent connections sooner (default is much longer)
sysctl -w net.ipv4.tcp_keepalive_time=1800 # 1800 seconds
SYN Backlog (DoS Protection)
# Max length of queue for TCP connections not yet ESTABLISHED
# Prevents server crash under SYN flood / DoS attacks
sysctl -w net.ipv4.tcp_max_syn_backlog=4096
Disable Unnecessary Protocols
# Disable ICMP broadcast responses to reduce noise
sysctl -w net.ipv4.icmp_echo_ignore_broadcasts=1
Bind Network Interrupts to One CPU
# Bind NIC interrupt to CPU core 1 (reduces scheduler interference)
echo 02 > /proc/irq/<nic_irq>/smp_affinity
Recommended Monitoring Toolset
| Tool | Purpose |
|---|---|
htop |
Interactive process and CPU monitor |
vmstat |
CPU, memory, swap, I/O, context switches |
iotop |
Per-process disk I/O |
sar |
Historical system activity reports |
strace |
Trace system calls for a process |
iftop |
Real-time network bandwidth by connection |
ss |
Socket statistics (faster replacement for netstat) |
lsof |
List open files and file descriptors |
ethtool |
NIC diagnostics and settings |
mtr |
Combined traceroute + ping for network diagnosis |
Top comments (0)