Vlad Levinas

Posted on Mar 18

Throughput vs IOPS vs Latency Beyond Storage Network, Compute and Cloud Performance Explained

#kubernetes #docker #linux #network

1. Introduction

Most engineers first encounter throughput, IOPS, and latency in the context of storage. You provision an EBS volume, you see three numbers, and you move on. This is a mistake — and it compounds over years of building systems that mysteriously underperform.

These three metrics are not storage concepts. They are fundamental properties of any system that processes work: disks, network interfaces, CPUs, GPUs, load balancers, container runtimes, and Kubernetes control planes. The same relationship between throughput, operations per second, and response time governs how a NVMe drive handles 4KB random reads, how an ENI processes packets, and how an etcd cluster responds to key-value writes.

The reason performance tuning goes wrong is almost always the same: engineers optimize the wrong layer. They increase EBS IOPS when the real bottleneck is cross-AZ network latency. They add more CPU when the application is blocked on memory bandwidth. They scale pods horizontally when etcd is the chokepoint.

This article explains these metrics across every layer that matters in modern cloud infrastructure and gives you the mental model to diagnose bottlenecks correctly.

2. Core Definitions

Throughput

Throughput is the volume of data or work completed per unit of time. It measures capacity — how much the system can move. In storage, it is MB/s. In networking, it is Gbps. In compute, it is operations per second, instructions per cycle, or requests per second.

Think of throughput as the cross-sectional area of a pipe. A wider pipe moves more water per second, but says nothing about how fast any individual drop arrives.

IOPS (Operations Per Second)

IOPS counts discrete operations completed per unit of time. In storage, one IOP is one read or write request. In networking, the equivalent is packets per second (PPS). In compute, it maps to transactions per second, context switches per second, or system calls per second.

The critical insight: throughput and IOPS are related by operation size.

Throughput = IOPS × Operation Size

A volume doing 16,000 IOPS at 256KB block size delivers 4,000 MB/s throughput. The same volume doing 16,000 IOPS at 4KB delivers only 64 MB/s. You can hit your IOPS limit long before saturating throughput, or vice versa — and which one you hit first depends entirely on your workload pattern.

Latency

Latency is the time elapsed between issuing a request and receiving the response. It is measured in microseconds, milliseconds, or seconds depending on the layer.

Latency is not the inverse of throughput. A system can have high throughput and high latency simultaneously — this is the fundamental behavior of pipelining. A satellite link has 600ms latency but can sustain 100 Mbps throughput because multiple requests are in flight concurrently. Conversely, a system can have low latency but low throughput if it can only handle one operation at a time.

The relationship between all three:

Effective Throughput = Concurrency × (Operation Size / Latency)

This equation is universal. It explains why queue depth matters for disks, why TCP window size matters for networks, and why thread count matters for applications.

Metric	Storage Example	Network Example	Compute Example
Throughput	1,000 MB/s sequential	25 Gbps link capacity	3.2 GHz × IPC × cores
IOPS	64,000 random 4KB ops	14M packets per second	50,000 requests/sec
Latency	200μs per I/O	0.5ms RTT within AZ	2ms P99 response time

3. Storage Performance

Block Size Determines Which Limit You Hit

Every storage workload has a dominant block size. Databases doing 8KB page reads are IOPS-bound. Video transcoding doing 1MB sequential reads is throughput-bound. Misidentifying this is the single most common storage performance mistake.

Workload Pattern	Typical Block Size	Bottleneck
OLTP database	8–16 KB	IOPS
Data warehouse scan	256 KB – 1 MB	Throughput
Log writes	4–64 KB	IOPS + mixed
Object storage GET	Varies (full object)	Throughput
etcd (Kubernetes)	4–8 KB	Latency

Queue Depth

A single-threaded application issuing synchronous I/O achieves a queue depth of 1. At 200μs latency, this caps throughput at 5,000 IOPS — regardless of what the underlying volume supports. Increasing queue depth (via async I/O, io_uring, or multiple threads) allows the drive to process operations in parallel. NVMe drives are designed for queue depths of 32–256. EBS volumes benefit from queue depths of 4–16 depending on type.

This is exactly the same principle as HTTP/2 multiplexing or TCP window scaling: concurrency hides latency.

Burst vs Sustained Performance

AWS EBS gp3 volumes provide a baseline of 3,000 IOPS and 125 MB/s, with burst credits for gp2 (legacy). The distinction matters enormously for capacity planning:

Volume Type	Baseline IOPS	Max IOPS	Baseline Throughput	Max Throughput	Latency
gp3	3,000	16,000	125 MB/s	1,000 MB/s	sub-ms
io2 BE	provisioned	256,000	provisioned	4,000 MB/s	sub-ms, consistent
Local NVMe	N/A (no limit)	1.5M+	N/A	7+ GB/s	<100μs
gp2 (legacy)	100–16,000	16,000	128–250 MB/s	250 MB/s	sub-ms, variable

gp3 is fine for most workloads. io2 Block Express exists for when you need both high IOPS and predictable sub-millisecond latency — production databases where P99 consistency matters. Instance store NVMe is a different category entirely: no network hop, no EBS controller overhead, microsecond-level latency, but ephemeral.

Distributed Storage (Longhorn, Ceph, OpenEBS)

Running distributed storage in Kubernetes adds at least one network hop per I/O operation. Longhorn replicates data across nodes — a write to a 3-replica Longhorn volume becomes one local write plus two network-replicated writes. This fundamentally changes the performance profile:

Latency increases by the network RTT plus the slowest replica's write time. IOPS drops because each logical operation becomes multiple physical operations. Throughput is bounded by network bandwidth between nodes.

In a homelab K3s cluster with 1 Gbps links, a Longhorn volume with 3 replicas maxes out around 100 MB/s throughput regardless of underlying disk capability. With 10 Gbps links, you might reach 400–600 MB/s. Ceph (with RBD or CephFS) performs better at scale due to its CRUSH algorithm distributing I/O across OSDs, but the principle remains: every network hop adds latency, and replication multiplies write load.

For latency-sensitive workloads (databases, etcd), use local storage or hostPath. Reserve distributed storage for workloads where data availability matters more than raw performance.

4. Network Performance

Bandwidth vs Throughput

Network bandwidth is the raw capacity of the link (e.g., 25 Gbps for a c6i.8xlarge). Network throughput is how much of that bandwidth you actually use. The gap between them is caused by protocol overhead, packet loss, latency (the bandwidth-delay product), and application behavior.

A 25 Gbps link with 1ms RTT and a 64KB TCP window can only sustain about 500 Mbps, because the sender waits for acknowledgments before sending more data. This is why TCP window tuning and BBR congestion control exist.

Packets Per Second (PPS)

PPS is the network equivalent of IOPS. Every instance type has a PPS limit, and small packets exhaust it before bandwidth is saturated.

Consider a c6i.xlarge with roughly 12.5 Gbps bandwidth and approximately 1.5M PPS. Sending 64-byte packets: 1,500,000 × 64 bytes × 8 bits = 768 Mbps. You hit the PPS ceiling at under 1 Gbps despite having 12.5 Gbps available.

Scenario	Packet Size	PPS Needed	Bandwidth Used
HTTP API (small JSON)	~200 bytes	High	Low
Video streaming	~1400 bytes	Moderate	High
DNS resolution	~60 bytes	Very high	Very low
Database replication	~1400 bytes	Moderate	Moderate–High

This is why DNS servers, load balancers, and services handling many small requests per second need instance types with high PPS limits, not just high bandwidth.

Round-Trip Time (RTT)

RTT is network latency — the time for a packet to travel to a destination and back.

Path	Typical RTT
Same AZ (same region)	0.1–0.5 ms
Cross-AZ (same region)	0.5–2 ms
Cross-region (e.g., us-east to eu-west)	60–100 ms
To internet (varies)	10–200 ms

Cross-AZ latency is small per request but compounds under load. A service making 10 sequential cross-AZ calls per request adds 5–20ms to every response. For microservices architectures, this is the primary source of latency inflation that teams fail to account for.

Jitter

Jitter is latency variance — the difference between the best and worst RTT. Consistent 2ms RTT is manageable. RTT oscillating between 0.5ms and 15ms destroys application predictability. Jitter typically comes from network congestion, noisy neighbors on shared infrastructure, or garbage collection pauses in application-level proxies.

Load Balancer Impact

Every load balancer adds latency. AWS ALB adds 1–5ms per request depending on load. NLB adds ~100μs (it operates at Layer 4 with no HTTP parsing). A service mesh sidecar proxy (Envoy) adds 0.5–2ms per hop.

For a request flowing through: Client → ALB → Envoy sidecar → Service → Envoy sidecar → Database, you accumulate roughly 3–10ms of infrastructure latency before your application code runs a single instruction.

Container Networking Overhead

Containers introduce additional network processing. Every packet traverses the host's network namespace, crosses a veth pair, passes through iptables/nftables rules (or eBPF programs in Cilium), and potentially hits a CNI overlay network.

In Kubernetes, a kube-proxy iptables-based setup with 10,000 services creates approximately 40,000+ iptables rules. Each packet is evaluated against these rules linearly or via ipset. This is why IPVS mode and eBPF-based CNIs (Cilium) exist — they reduce per-packet processing overhead from O(n) to O(1) with hash-based lookups.

The practical impact: on a node running 50+ services with iptables kube-proxy, PPS capacity can drop 10–20% purely from kernel networking overhead.

5. Compute Performance

CPU Saturation

CPU utilization at 100% does not mean your application is working at maximum efficiency. It means your CPU has no idle cycles, but those cycles may be spent on context switches, kernel overhead, or spinlocks. The distinction between user time (your code), system time (kernel calls), and iowait (waiting for I/O) matters.

A machine at 90% CPU where 30% is system time has a fundamentally different bottleneck than one at 90% where 85% is user time. The first is making too many system calls or handling too many interrupts; the second is genuinely compute-bound.

CPU Steal

In virtualized environments (all of EC2), CPU steal indicates the hypervisor allocated your vCPU time to another instance. Steal above 5% means your workload is being throttled by the physical host. Burstable instances (t3, t3a) are particularly affected — once CPU credits are exhausted, steal effectively caps your performance.

On dedicated or metal instances, steal is zero. For latency-sensitive production workloads, this is not a trivial difference.

Context Switching

Every context switch costs 2–10μs depending on the CPU and cache state. An application doing 50,000 context switches per second spends 100–500ms of CPU time per second just switching — up to 50% of a core. Goroutine-heavy Go applications, thread-per-request Java services, and heavily multiplexed event loops all exhibit different context switching profiles.

Monitor vmstat and /proc/pid/status context switch counters. If voluntary context switches are high, your application is blocking on I/O. If involuntary context switches are high, you have too many runnable threads competing for CPU.

Memory Bandwidth

Modern CPUs can execute billions of operations per second, but they can only fetch data from RAM at 50–80 GB/s (dual-channel DDR4/DDR5). Workloads that scan large datasets — sorting, filtering, columnar analytics, ML inference — are often memory-bandwidth bound, not compute-bound. The CPU spends cycles stalled, waiting for data from memory.

This manifests as moderate CPU utilization (60–70%) that refuses to go higher regardless of thread count. perf stat showing high LLC (last-level cache) miss rates confirms this.

NUMA Considerations

Multi-socket servers (common in metal instances like i3en.metal, c5.metal) have Non-Uniform Memory Access architectures. Memory attached to the local socket has ~80ns access time; memory on the remote socket takes ~140ns. A thread accessing remote NUMA memory pays a 75% latency penalty on every cache miss.

For databases and latency-sensitive services running on metal instances, NUMA-aware scheduling and memory binding (via numactl or cgroup cpuset) is not optional — it is the difference between consistent and inconsistent performance.

GPU Workloads and Data Pipelines

GPU-accelerated workloads introduce a new throughput/latency tradeoff: PCIe or NVLink bandwidth between CPU and GPU memory. An A100 GPU can perform 312 TFLOPS (TF32), but it must be fed data fast enough. PCIe Gen4 x16 provides ~25 GB/s; NVLink provides up to 600 GB/s.

If your training pipeline reads data from EBS, preprocesses on CPU, and transfers to GPU, the bottleneck chain is: EBS throughput → CPU preprocessing speed → PCIe transfer bandwidth → GPU compute. Optimizing GPU utilization is useless if the data pipeline cannot keep up. This is why SageMaker and training instances use NVMe instance storage for dataset staging, and why data loader workers run in parallel.

Compute Metric	What It Means	Watch For
CPU User %	Time running application code	>85% = genuinely compute-bound
CPU System %	Time in kernel	>20% = excessive syscalls
CPU Steal %	Time stolen by hypervisor	>5% = noisy neighbor or credit exhaustion
Context switches/sec	Thread scheduling overhead	>50K/core = investigate
LLC miss rate	Cache misses hitting RAM	>10% = memory-bandwidth bound
IPC (instructions/cycle)	CPU efficiency	<1.0 = stalled on memory/branch

6. Kubernetes Perspective

etcd Latency Sensitivity

etcd is a consensus-based key-value store. Every write requires a quorum of disk fsyncs across the cluster. etcd's official recommendation is that 99th percentile disk fsync latency stays below 10ms. On EBS gp3, fsync latency during normal operation is 1–4ms but can spike to 20–50ms during EBS maintenance or burst credit exhaustion.

Symptoms of etcd performance problems: API server request latency increases, pod scheduling slows down, leader elections become unstable. The fix is straightforward — run etcd on io2 volumes or local NVMe storage. This is not premature optimization; it is a requirement for clusters above 50 nodes.

Image Pulls: Throughput + IOPS Combined

Pulling a container image is a mixed workload. Downloading layers is network-throughput bound. Extracting and writing layers to the container filesystem is storage IOPS and throughput bound. On a node with a slow disk, image pulls can take 30–60 seconds for a 500MB image, which directly impacts pod startup latency and autoscaling responsiveness.

Using containerd with zstd compression, image streaming (lazy pulling via stargz/nydus), or pre-cached images on AMIs eliminates this bottleneck. In CI/CD environments, this is often the difference between a 2-minute and a 10-minute pipeline.

Service Mesh Overhead

Each Envoy sidecar proxy consumes 50–100MB of memory and adds 0.5–2ms latency per hop. For a request traversing 5 services in a service mesh, that is 5–20ms of mesh overhead alone. Multiply by request volume, and sidecar proxies can consume 10–15% of cluster compute resources.

The decision to deploy a full mesh (Istio, Linkerd) vs a simpler approach (ambient mesh, direct service communication) should be based on measured overhead against measured security/observability value.

Autoscaling Effects on Latency

Horizontal Pod Autoscaler (HPA) reacts to metrics with a delay: metrics collection (15s default) → scaling decision (15s default) → pod creation → image pull → readiness probe → load balancer registration. The total cold-start latency is typically 30–120 seconds.

During this window, existing pods handle all load. If your scaling threshold is 70% CPU, you may see P99 latency spike as existing pods saturate beyond 90% while waiting for new pods. Solutions: use KEDA with request-rate metrics for faster scaling, over-provision slightly (target 50–60%), or use Knative/serverless for workloads with extreme burst patterns.

7. Real-World Scenarios

Scenario 1: PostgreSQL Production Database

A 16-core RDS instance on io2 with 50,000 provisioned IOPS and 1,000 MB/s throughput. The application reports slow queries, but RDS CloudWatch shows CPU at 40% and IOPS at 12,000. The bottleneck is not what you think.

Investigation reveals: the application makes 15 sequential queries per request, each taking 2ms for database processing plus 0.8ms cross-AZ network RTT (application runs in a different AZ from the database). Total latency per request: 15 × 2.8ms = 42ms. Moving the application to the same AZ drops this to 15 × 2.1ms = 31.5ms — a 25% improvement with zero database changes.

Lesson: network latency between application and database is multiplied by query count per request.

Scenario 2: AI/ML Training Pipeline on p4d.24xlarge

Eight A100 GPUs connected via NVLink. Training throughput is 40% below expected. GPU utilization shows frequent drops to 0%. The data pipeline cannot feed the GPUs fast enough.

The dataset is stored on an EFS mount delivering 200 MB/s. Eight GPUs need approximately 3–4 GB/s of preprocessed data. Fix: stage the dataset to local NVMe instance storage (4× 1.9TB NVMe SSDs providing 14 GB/s aggregate throughput), use multiple data loader workers for CPU preprocessing, and prefetch batches asynchronously.

Result: GPU utilization goes from 40% to 92%. Training time drops by 55%.

Lesson: GPU throughput is gated by the slowest link in the data pipeline.

Scenario 3: CI/CD Pipeline on Kubernetes

Jenkins agents running on m6i.xlarge nodes. Build times are 12 minutes, with 4 minutes spent on docker build. Investigation shows: each build pulls base images (network throughput), runs npm install downloading 800 packages (network IOPS — hundreds of small HTTP requests), compiles (CPU), and pushes the final image (network throughput + storage IOPS).

The npm install phase is slow not because of bandwidth but because of per-request latency: 800 sequential HTTPS requests to the registry, each with TLS handshake, DNS resolution, and connection setup. Fix: deploy a local npm proxy (Verdaccio), warm the cache, reduce per-request latency from 50ms to 2ms. npm install drops from 90 seconds to 8 seconds.

Lesson: high operation count with sequential latency dominates over raw throughput.

Scenario 4: High-Traffic API Gateway (50,000 RPS)

An API gateway running on c6i.4xlarge behind an NLB. At 50K RPS with ~200-byte average responses, bandwidth usage is only 80 Mbps — nowhere near the 12.5 Gbps limit. But the application hits the PPS ceiling.

Each request-response pair generates approximately 10–15 packets (TCP handshake, request, response, ACK, FIN). At 50K RPS, that is 500K–750K PPS. The c6i.4xlarge has roughly 1.5M PPS capacity for ENA. The system is at 50% PPS capacity, and with TLS termination and connection tracking overhead, the effective ceiling is lower.

Fix: enable TCP keepalive and HTTP/2 to amortize connection setup. This drops per-request packet count from 10–15 to 2–3 for persistent connections. Effective PPS at the same RPS drops by 70%.

Lesson: PPS, not bandwidth, is the limiting factor for high-RPS small-payload services.

Scenario 5: Log Ingestion Stack (Loki + Promtail)

A Loki cluster ingesting 50 GB/day of logs from 200 pods. Write latency is fine, but queries over 24-hour ranges take 30+ seconds. The bottleneck: Loki stores log chunks as compressed blocks on S3. Each query retrieves hundreds of small objects. S3 GET latency is 20–50ms per request, and Loki issues them sequentially per chunk.

With 500 chunks to read for a 24-hour query: 500 × 30ms = 15 seconds minimum, just on S3 GET latency. Adding compaction reduces chunk count. Deploying a Loki caching layer (memcached for chunk and index caches) drops repeat-query latency from 30s to 2s. Enabling parallel chunk fetching helps first-query performance.

Lesson: object storage latency is per-operation, and query performance depends on minimizing operation count.

8. Metrics That Senior Engineers Watch

P95 and P99 Latency

Average latency is almost useless. A service with 5ms average and 500ms P99 is broken for 1% of users. P99 captures the experience of real users during peak load, garbage collection pauses, and infrastructure hiccups. For public-facing services, track P99. For internal SLOs, P95 is often sufficient.

Tail Latency

Beyond P99, P99.9 and P99.99 latencies expose systemic issues: NUMA misses, TCP retransmissions, EBS latency spikes, or Go runtime GC pauses. At 10,000 RPS, P99.9 affects 10 requests per second — enough to trigger downstream timeouts and cascading failures.

Throttling

EBS IOPS throttling, EC2 network bandwidth throttling, API Gateway rate limiting, and CPU credit exhaustion all manifest similarly: sudden performance cliffs with no gradual degradation. Monitor VolumeQueueLength for EBS, NetworkBandwidthExceeded for EC2, and CPUCreditBalance for burstable instances.

Queue Length and Concurrency

Queue length is the universal health indicator. A disk with queue depth growing from 4 to 32 is saturated. A Kubernetes scheduler queue backing up means the control plane cannot keep pace. A load balancer with growing active connections and declining throughput indicates backend saturation.

Metric	What It Tells You	Red Flag Threshold
P99 latency	Worst-case user experience	>10× median
EBS VolumeQueueLength	Disk saturation	>4 for gp3, >16 for io2
CPU steal	Hypervisor contention	>5% sustained
Network PPS	Packet processing saturation	>70% of instance limit
Pod restart count	OOM kills or liveness failures	Any non-zero in prod
etcd fsync P99	Control plane health	>10ms
Active connections	Load balancer / backend saturation	Sustained growth

9. Common DevOps Mistakes

Optimizing IOPS when network is the bottleneck. An application making cross-AZ database calls has its latency dominated by network RTT, not disk I/O. Upgrading from gp3 to io2 saves 0.5ms per query, but the 1.5ms cross-AZ RTT remains unchanged. Fix the network path first — move the application into the same AZ, or use read replicas.

Confusing bandwidth with throughput. A 25 Gbps instance can move 25 Gbps of data only under ideal conditions — large transfer windows, no packet loss, no application bottlenecks. Real-world throughput with default TCP settings, small transfers, and TLS overhead is often 5–10 Gbps on the same hardware. Tune TCP buffer sizes and window scaling before upgrading instance types.

Ignoring CPU scheduling delays. Kubernetes pods with no CPU limits run on best-effort scheduling. Under contention, the CFS scheduler introduces latency that does not appear as CPU utilization — it appears as increased response time. Setting CPU requests (not limits, which cause throttling) ensures CFS allocates fair share without hard capping.

Using average latency for SLOs. A service reporting 5ms average latency with 200ms P99 is not meeting a 50ms SLO for 1% of traffic. Averages mask bimodal distributions where most requests are fast and a minority are catastrophically slow.

Over-provisioning compute, under-provisioning storage. Teams regularly deploy 32-core instances with gp3 defaults (3,000 IOPS, 125 MB/s). A single core doing synchronous I/O at queue depth 1 can generate 5,000+ IOPS from the application layer. Even a moderately I/O-intensive application on 32 cores can exhaust gp3 limits while CPU sits at 20%.

Ignoring the replication tax on distributed storage. Deploying Ceph or Longhorn with 3× replication and expecting bare-metal performance. Every write becomes three writes. Write throughput is bounded by the slowest replica node's disk and the network path to it. Budget accordingly: divide raw disk throughput by the replication factor and subtract network overhead.

10. Conclusion

Performance in cloud infrastructure is not a single-dimension problem. Every request flowing through your system traverses network links, hits CPU scheduling decisions, waits for memory access, and performs I/O operations — and each of these layers has its own throughput ceiling, IOPS limit, and latency floor.

The engineer who understands this treats performance as a pipeline analysis problem. You trace the request path, measure each segment, and find the narrowest pipe. The fix for a slow API is not always more replicas — it might be moving a database to the same AZ, switching from iptables to IPVS, or adding one SSD for etcd.

Three principles that consistently produce better outcomes:

First, measure before optimizing. Use perf, bpftrace, iostat, sar, CloudWatch, and Prometheus to identify where time is actually spent. Assumptions about bottleneck location are wrong more often than they are right.

Second, understand the relationship between throughput, IOPS, and latency at every layer. A system bottlenecked on PPS needs different remediation than one bottlenecked on bandwidth, even though both are "network problems."

Third, remember that latency is additive across layers while throughput is bounded by the weakest link. A request that passes through 6 components, each adding 2ms, has 12ms of infrastructure latency before your business logic executes. Design for shallow call graphs, minimize network hops, and keep latency-sensitive components close together — physically and architecturally.

Performance engineering is not about knowing the maximum IOPS of gp3. It is about knowing which questions to ask when something is slow.

Published on doc.thedevops.dev