DEV Community: Mike Smith

The GPU Utilization Number That's Quietly Wrecking AI Team Budgets

Mike Smith — Fri, 03 Jul 2026 06:57:20 +0000

Teams obsess over GPU hourly rates when comparing providers. The number that actually determines your real cost per training run is something almost nobody tracks closely enough: utilization.

When AI teams evaluate GPU infrastructure providers, the conversation almost always centers on the hourly rate. Provider A charges $2.10 per hour for an H100. Provider B charges $1.85. The comparison feels straightforward, and the cheaper option looks like the obvious choice.

This comparison, while not wrong, is also not the number that actually determines what your team spends per unit of useful work produced. The number that matters far more, and that almost nobody tracks with the same rigor they apply to hourly rate shopping, is GPU utilization — the percentage of time your expensive GPU hardware is actually doing productive computation versus sitting idle, waiting on data, or running at a fraction of its theoretical throughput.

Why a Lower Hourly Rate Can Still Mean Higher Total Cost

Here's the calculation that gets skipped in most provider comparisons. If you're paying $1.85 per hour but your actual GPU utilization during a training run averages 45% — meaning the GPU is idle or underutilized more than half the time it's billed — your effective cost per unit of useful compute is roughly $4.11 per hour. A provider charging $2.10 per hour with infrastructure and tooling that supports 85% utilization delivers an effective cost of roughly $2.47 per hour for the same useful work.

The provider with the higher sticker price is, in this scenario, meaningfully cheaper in terms of actual cost per unit of training progress achieved. This isn't a hypothetical edge case. Utilization rates this divergent between providers and configurations are common in real-world AI infrastructure, and the gap is driven by factors that have nothing to do with the GPU hardware itself.

Where GPU Idle Time Actually Comes From

Data loading bottlenecks. A remarkably common cause of low GPU utilization is a training pipeline where the GPU finishes processing a batch faster than the data loading pipeline can prepare the next one. The GPU sits idle, waiting for data, while CPU-bound data preprocessing, disk I/O, or network transfer from a remote storage bucket becomes the actual bottleneck. This is especially common when training data is stored in cloud object storage and streamed during training rather than pre-staged on fast local storage, because the network and deserialization overhead can easily exceed the GPU's processing time per batch for sufficiently large or complex models.

Checkpoint and logging overhead. Frequent model checkpointing — saving model state to persistent storage at regular intervals — pauses GPU computation while the checkpoint write completes, particularly if checkpoints are large and being written to slower or remote storage. Teams that checkpoint very frequently for safety, without considering the cumulative GPU idle time this introduces, can lose a meaningful percentage of total training time to this overhead alone.

Inefficient multi-GPU communication patterns. As covered extensively in discussions about interconnect latency and bandwidth, poorly tuned distributed training configurations can leave GPUs waiting on gradient synchronization for longer than necessary, particularly with suboptimal batch sizes, communication backend configuration, or network topology awareness in the training framework.

Provisioning and cold-start overhead on cloud instances. Spot or on-demand cloud GPU instances often require image pulls, environment setup, and dependency installation on every fresh instance launch. For short-lived training jobs, this cold-start overhead can represent a substantial percentage of total billed time without contributing any actual training progress.

Mismatched batch size and model architecture for the available VRAM. A batch size too small for the GPU's memory capacity leaves compute throughput on the table — the GPU has spare capacity that a larger batch size could use, but the configuration doesn't take advantage of it. This is a frequent and often overlooked source of suboptimal utilization, particularly when configurations are copied from documentation or previous projects without re-tuning for the specific hardware being used.

How to Actually Measure This (Most Teams Don't)

The uncomfortable truth is that most teams running GPU training jobs do not have utilization monitoring in place granular enough to catch these problems. Checking whether a job "completed successfully" is not the same as understanding whether that job used the provisioned hardware efficiently.

Tools like nvidia-smi provide real-time GPU utilization snapshots, but a single snapshot during a training run tells you very little. What's needed is utilization tracked continuously over the full duration of a training job, ideally visualized as a time series alongside markers for checkpoint events, data loading stages, and distributed communication phases — so that utilization dips can be correlated with their actual cause rather than just observed as an unexplained gap.

NVIDIA's DCGM (Data Center GPU Manager) and various open-source and commercial MLOps observability platforms provide this level of detail, and the investment in setting this up pays for itself quickly for any team running GPU training at meaningful scale and cost.

The Fixes, Ranked by Typical Impact

Pre-stage training data on fast local or network storage rather than streaming from remote object storage during training. This single change frequently produces the largest utilization improvement for teams whose bottleneck is data loading, because it eliminates the network and deserialization latency that competes with GPU processing time.

Profile and parallelize the data loading pipeline explicitly. Most deep learning frameworks support asynchronous, multi-worker data loading specifically designed to keep data preparation ahead of GPU consumption. Ensuring this is properly configured, with enough parallel workers and appropriate prefetching, closes much of the gap between theoretical and actual GPU throughput.

Tune checkpoint frequency deliberately, balancing safety against overhead. Understand the actual cost, in GPU idle time, of your checkpoint frequency, and make a deliberate trade-off rather than defaulting to an arbitrarily frequent interval copied from a tutorial or previous project.

Right-size batch size to actual available VRAM, not a default value. Profile memory usage and incrementally increase batch size until you're using available GPU memory efficiently, which typically improves both throughput and utilization simultaneously.

Choose infrastructure providers based on demonstrated utilization support, not just hourly rate. Ask providers directly about typical customer utilization rates on comparable workloads, available high-throughput storage options, and network configuration for multi-GPU communication. The honest answer to this question is far more predictive of your actual total cost than the headline hourly rate.

The Real Lesson for AI Infrastructure Budgets

The hourly rate on a GPU cloud pricing page is the easiest number to compare, which is exactly why it gets the most attention during procurement decisions. It is also, on its own, a poor predictor of what your team will actually spend to achieve a given amount of training progress.

Utilization — the unglamorous, harder-to-measure number that requires actual instrumentation and ongoing attention rather than a quick comparison of pricing pages — is what actually determines effective cost. Teams that build the habit of measuring and optimizing it consistently extract significantly more value from the same infrastructure budget than teams that optimize purely for the lowest sticker price per GPU-hour.

The Autoscaling Bug That Costs Companies Thousands Before Anyone Notices

Mike Smith — Thu, 02 Jul 2026 06:42:29 +0000

Autoscaling is supposed to save money by matching capacity to demand. In practice, a small misconfiguration can cause it to do the exact opposite — and the bill is usually the first symptom anyone sees.

Autoscaling is one of cloud computing's most genuinely valuable capabilities, and also one of its most quietly dangerous ones when misconfigured. The pitch is straightforward: scale resources up automatically when demand increases, scale them back down when demand subsides, and pay only for what you actually need at any given moment.

In practice, a meaningful percentage of teams running autoscaling infrastructure are running configurations that produce the opposite of the intended outcome — burning more compute than a fixed, well-sized capacity allocation would have cost, often for months before anyone reviews the billing closely enough to notice the pattern.

The Failure Mode Nobody Designs For: Scaling Thrash

The most common and most expensive autoscaling failure pattern is what's generally called "thrashing" — a feedback loop where the autoscaler repeatedly scales up, then scales down, then scales up again, in rapid succession, in response to metric fluctuations that don't actually represent sustained demand changes.

Here's how it typically happens. An autoscaling policy is configured to add capacity when CPU utilization crosses a threshold — say, 70%. A traffic spike pushes utilization above that threshold, triggering a scale-up event. New instances or pods come online, which takes anywhere from 30 seconds to several minutes depending on the platform and image size. By the time the new capacity is actually serving traffic, the original spike has often already subsided, because most traffic spikes are shorter than the provisioning time required to respond to them.

_Also Read - Data Sovereignty and Cloud Hosting: Navigating Compliance in a Global Market
_
Now you have excess capacity sitting idle. If the scale-down cooldown period is short, the autoscaler notices the now-lower average utilization and scales back down. If a new spike arrives shortly after, the cycle repeats. Each cycle has a cost: the provisioning overhead of spinning up new instances, the brief period of over-provisioned capacity before scale-down kicks in, and in cloud billing models with minimum billing increments, the cost of partial-hour or partial-minute charges that round up regardless of how briefly the resource actually ran.

This pattern is often invisible in day-to-day monitoring because the application appears to be functioning correctly throughout — users aren't experiencing errors, response times look fine. The only place this shows up clearly is in the billing data, accumulated over weeks, where a careful audit reveals a far higher instance-hour count than the actual traffic pattern would justify.

The Metric Mismatch Problem

A second, equally common cause of autoscaling cost overruns is scaling on a metric that doesn't actually correlate well with the resource constraint that matters for your specific application.

CPU utilization is the default metric most autoscaling configurations use, largely because it's the easiest to measure and the most universally available. But CPU utilization is frequently a poor proxy for actual application capacity needs. An application that's memory-bound, I/O-bound, or constrained by external API rate limits can show low CPU utilization even while genuinely struggling under load — meaning the autoscaler never triggers when it should. Conversely, applications with CPU-intensive but infrequent background tasks (batch processing, scheduled jobs, garbage collection cycles) can trigger unnecessary scale-up events based on CPU spikes that have nothing to do with actual user-facing demand.

The mismatch between the metric being measured and the resource constraint that actually matters means many autoscaling configurations are simultaneously over-provisioning in response to irrelevant signals and under-provisioning in response to the signals that actually matter — a combination that produces both higher costs and worse user experience at the same time.

Minimum and Maximum Boundaries Set Once and Never Revisited

Autoscaling policies require minimum and maximum instance counts, and these boundaries are frequently set during initial setup based on rough estimates and then never revisited as the application's actual traffic patterns become better understood.

A minimum instance count set conservatively high "just to be safe" during initial deployment becomes a permanent cost floor that persists indefinitely, even after the team gains enough operational confidence to know the true minimum capacity required. A maximum instance count set without careful consideration of cost ceiling can allow a traffic anomaly — including, in some cases, an actual attack or a misbehaving client making excessive requests — to scale the infrastructure to a cost level far beyond what any legitimate business justification would support, with no automated circuit breaker in place to catch it.

The teams that avoid this problem treat autoscaling boundaries as a metric to be revisited quarterly based on actual observed traffic data, not a one-time configuration decision made during initial setup and forgotten.

Scale-Down Reluctance: The Hidden Cost of Conservative Defaults

Many autoscaling implementations default to asymmetric behavior: scaling up quickly and aggressively in response to demand signals, but scaling down slowly and conservatively to avoid prematurely removing capacity during a sustained spike. This asymmetry exists for good reason — the cost of under-provisioning (degraded user experience) is generally considered worse than the cost of over-provisioning (wasted spend) — but the conservative scale-down defaults frequently go further than necessary, leaving excess capacity running for far longer than the actual traffic pattern justifies.

Cooldown periods of 10-15 minutes after a scale-up event before any scale-down is considered are common defaults. For applications with frequent short traffic bursts, this can mean the infrastructure spends a substantial proportion of total runtime in an over-provisioned state, well after the demand that triggered the scale-up has subsided.

What Actually Fixes This

Audit instance-hour billing against actual traffic patterns regularly. The clearest signal that autoscaling thrash is happening is a mismatch between your billing data's instance-hour count and what your traffic logs suggest should be necessary. This comparison should be a recurring operational review, not a one-time setup check.

Choose scaling metrics deliberately, based on your application's actual bottleneck. If your application is memory-bound, scale on memory utilization. If it's request-latency sensitive, consider scaling on request queue depth or response time percentiles rather than CPU. The right metric is the one that actually correlates with user-facing degradation for your specific workload.

Implement scaling cooldowns and stabilization windows that match your traffic's actual volatility. A high-traffic, consistently busy application can tolerate more aggressive scale-down behavior than one with frequent, short, unpredictable bursts. Tune cooldown periods based on observed traffic shape, not generic defaults.

Set and revisit minimum and maximum boundaries based on real data, on a recurring schedule. Treat these as living configuration values informed by actual operational history, not static numbers set once during initial deployment.

Combine autoscaling with predictive or scheduled scaling where traffic patterns are genuinely predictable. Applications with known daily or weekly traffic patterns — business-hours-only usage, predictable weekend dips — often benefit from supplementing reactive autoscaling with scheduled capacity adjustments that pre-empt known patterns rather than reacting to them after the fact.

The Real Lesson

Autoscaling is a powerful capability, but it's not a "set it and forget it" feature. It's an ongoing tuning exercise that requires periodic review against real operational data — and the cost of skipping that review doesn't show up as an error or an outage. It shows up quietly, as a slightly inflated cloud bill every single month, easy to overlook until someone finally does the audit and finds out exactly how much that inattention has cost.

Your Dedicated Server Benchmark Looks Great. Your Production Database Disagrees. Here's Why.

Mike Smith — Tue, 30 Jun 2026 10:21:17 +0000

A clean fio or dd benchmark on a brand-new dedicated server is not the same thing as real-world I/O performance under concurrent, mixed-pattern production load. The gap between the two trips up more teams than it should.

Every team that provisions a new dedicated server runs the same ritual at some point. Spin up the box, SSH in, run a quick storage benchmark — fio, dd, iozone, whatever the team's preferred tool is — and watch the numbers come back looking excellent. Sequential write throughput in the gigabytes per second. Sub-millisecond read latency. Everything looks exactly like the vendor's spec sheet promised.

Then the database goes live, real traffic hits it, and query latency under load doesn't match what the benchmark implied at all. This gap — between synthetic storage benchmarks and real production I/O behavior — is one of the most consistently underestimated factors in dedicated server performance planning, and it's worth understanding precisely why it happens.

Why Sequential Benchmarks Lie (Without Meaning To)

The default storage benchmark most engineers reach for tests sequential read or write throughput — writing or reading one large, contiguous block of data as fast as possible. This is a genuinely useful number for understanding the theoretical ceiling of your storage hardware. It is also almost never representative of what a production database actually does.

Real database workloads are dominated by random I/O, not sequential. A transactional database serving concurrent users is constantly reading and writing small, scattered blocks of data across the disk — index lookups, row updates, write-ahead log entries, all interleaved with each other, often from multiple connections simultaneously.

Also read - Latency Maps: Server Location Matters More Than You Think

NVMe storage handles random I/O dramatically better than older spinning disk or even SATA SSD technology, which is exactly why NVMe has become the standard for serious database workloads. But "dramatically better than the alternative" doesn't mean "identical to the sequential benchmark number." A drive that delivers 3.5 GB/s on a sequential write test can show meaningfully different — and more variable — performance under a realistic random I/O pattern with high queue depth and concurrent access.

The Queue Depth Problem

Here's a detail that gets glossed over constantly: most default benchmark configurations test at a queue depth of 1 — meaning one I/O operation in flight at a time. This produces the lowest possible latency numbers because there's no contention for the device's internal resources.

Production databases under real load operate at much higher effective queue depths, with many operations in flight simultaneously from different connections and threads. As queue depth increases, individual operation latency typically increases as well, even on excellent hardware, simply because operations are now waiting behind each other for the underlying device controller's attention.

A benchmark run at queue depth 1 and a production workload running at effective queue depth 32 or 64 are testing fundamentally different things, even though they're hitting the same physical drive. Teams that benchmark with default settings and then extrapolate those numbers to predict production performance under concurrent load are comparing two different scenarios without realizing it.

What Actually Changes Under Real Load

Filesystem and database engine overhead. Raw block-device benchmarks bypass much of the filesystem and database engine logic that real queries pass through. Write-ahead logging, journaling, checksumming, and transaction commit semantics all add overhead that a raw dd test never touches. A database configured for strong durability guarantees (synchronous commits, fsync on every write) will show meaningfully different I/O latency characteristics than a raw storage benchmark, because it's doing genuinely more work per logical operation.

Resource contention from concurrent processes. A freshly provisioned, otherwise idle dedicated server gives a storage benchmark the entire I/O subsystem to itself. A production server is also running the application layer, background jobs, monitoring agents, log shipping, and often multiple database connections simultaneously — all competing for the same underlying I/O resources. None of this contention exists during a clean benchmark run.

Thermal and sustained-write behavior. Many NVMe drives exhibit excellent burst performance but show reduced sustained write throughput once onboard cache is exhausted and the controller has to manage thermal load during extended write-heavy periods. A short benchmark run captures burst performance. A database under sustained heavy write load for hours can encounter the drive's actual sustained performance characteristics, which can be meaningfully lower than the headline burst numbers.

RAID configuration overhead. If the dedicated server uses RAID for redundancy — which most production database deployments should — write operations now involve additional overhead for parity calculation or mirroring, depending on the RAID level chosen. A single-disk benchmark doesn't capture this, and the overhead varies significantly between RAID 1, RAID 5, RAID 10, and software versus hardware RAID implementations.

How to Benchmark in a Way That Actually Predicts Production Behavior

Test with realistic access patterns, not just sequential. Configure fio (or your benchmarking tool of choice) to use a mixed random read/write pattern with a block size that matches your actual database's typical I/O size — often 4K, 8K, or 16K depending on the database engine — rather than relying on default large sequential block tests.

Test at realistic queue depths. Benchmark at multiple queue depths, including ones that approximate your expected concurrent connection count, not just queue depth 1. This gives you a latency curve rather than a single optimistic number, and that curve is far more useful for capacity planning.

Run sustained tests, not quick bursts. A 30-second benchmark captures burst performance. Run tests for 15-30 minutes or longer to surface any sustained throughput degradation that burst tests miss entirely.

Benchmark with the actual database engine under realistic concurrent load, not just raw storage tools. Tools like sysbench for database-specific benchmarking, configured with a representative schema and query mix, will surface engine-level overhead that raw storage benchmarks can't capture. This is a meaningfully better predictor of production behavior than any raw I/O tool alone.

Validate under contention, not isolation. If possible, run your storage benchmark concurrently with a synthetic CPU and memory load that approximates your actual application's resource footprint, to capture how I/O performance holds up when it's not the only thing happening on the box.

The Honest Bottom Line

A clean benchmark number on a freshly provisioned dedicated server tells you the hardware is capable. It does not tell you how that hardware will behave under the specific, messy, concurrent, mixed-pattern reality of your actual production workload. The gap between those two things isn't a sign that the hardware is bad or that the provider misrepresented anything — it's simply a reflection of the fact that synthetic benchmarks and production workloads are testing fundamentally different scenarios.

Teams that build capacity planning models around synthetic benchmark numbers alone are working from data that systematically overstates real-world performance. The fix isn't distrust of benchmarks — it's running benchmarks that actually resemble what you're going to do with the hardware, and validating those results against real application behavior before you commit to a capacity plan.

The Cloud Networking Problem Nobody Mentions Until Your Latency Bill Arrives

Mike Smith — Tue, 30 Jun 2026 09:53:51 +0000

Cross-region cloud architecture looks clean on an architecture diagram. In production, it quietly adds latency that no amount of compute power can fix — and most teams don't notice until users complain.

There's a particular kind of incident report that shows up in engineering postmortems with predictable regularity. The application is fast. The database is fast. The compute is appropriately sized. And yet, users in a specific region are experiencing response times that are inexplicably, persistently slower than everyone else.

Nine times out of ten, the root cause isn't compute. It's the network path the request actually takes — and it's a problem that architecture diagrams almost never capture accurately.

Why "It's All in the Cloud" Doesn't Mean "It's All Close Together"

Cloud architecture diagrams have a habit of representing services as boxes connected by clean, straight lines. A user hits your application server, which calls your database, which calls your cache layer, which calls a third-party API. On the diagram, these all look adjacent.

In reality, each of those boxes might be running in a different region, a different availability zone, or worse, a different cloud provider entirely — and every hop between them incurs real, physical, unavoidable network latency that's bound by the speed of light and the actual fiber path the data takes.

A request that bounces from your application server in one region, to a database replica in another, to a caching layer in a third, can easily accumulate 150-300ms of pure network latency before any actual processing happens — even though every individual service is, in isolation, performing perfectly.

This is the gap between "architecturally correct" and "physically fast." Both can be true about the same system simultaneously, and the second one is the one your users actually experience.

The Specific Patterns That Cause This

Database reads crossing regions silently. A common pattern: an application dedicated server in US-East queries a database that has a read replica in US-East, but a misconfigured connection string or load balancer routes some percentage of traffic to a replica in EU-West instead. The application works correctly. The data is accurate. But a subset of requests are now incurring a transatlantic round trip that adds 80-100ms to every single query, and because the failure mode isn't an error, it's just slowness, it can persist undetected for months.

Microservices chatting across availability zones. Modern microservice architectures often involve a single user request triggering a cascade of internal service-to-service calls — sometimes a dozen or more. If those services aren't deliberately co-located within the same availability zone, each inter-service call incurs cross-zone latency that compounds. A chain of 10 services each adding 2-5ms of cross-zone overhead turns into 20-50ms of pure latency tax that has nothing to do with actual computation.

CDN and origin server mismatch. Content delivery networks are excellent at caching static assets close to users. But when a cache miss occurs — or when the request requires dynamic, uncacheable content — the request has to travel all the way back to the origin server. If your origin is in a single region and your CDN edge nodes span the globe, users far from your origin experience this round trip on every cache miss, which for highly dynamic applications can be the majority of requests.

Third-party API dependencies in distant regions. Your application might be perfectly architected, but if it depends synchronously on a third-party payment processor, authentication provider, or data enrichment API hosted in a region far from your own infrastructure, that dependency becomes a latency floor you cannot engineer around without changing the dependency itself or its connection path.

Why This Is Genuinely Hard to Diagnose

The frustrating part of cross-region latency issues is that they're often invisible in the metrics teams check first.

CPU utilization looks fine. Memory looks fine. Database query execution time, measured at the database itself, looks fine. The problem only becomes visible when you measure end-to-end latency from the actual user's perspective and trace exactly which network hops contributed to it — which requires distributed tracing instrumentation that many teams don't have in place until after they've already experienced the problem.

This is precisely why observability practices that capture the full request path — not just individual service performance — have become essential rather than optional for any application with meaningful geographic distribution. Tools like distributed tracing (OpenTelemetry, Jaeger, AWS X-Ray) exist specifically to make these invisible cross-region hops visible.

What Actually Fixes This

Co-locate services that talk to each other frequently. The most direct fix is architectural: services that communicate synchronously and frequently should run in the same region, ideally the same availability zone. This sounds obvious, but as organizations grow and teams deploy services independently, regional drift happens gradually and often goes unnoticed until someone audits it deliberately.

Use read replicas correctly, and verify routing. If you've deployed regional read replicas specifically to reduce latency, audit your connection routing regularly. A misconfigured load balancer or an application that doesn't properly use region-aware connection strings can silently defeat the entire purpose of having replicas in the first place.

Push computation to the edge where it makes sense. For workloads where round-tripping to a centralized origin is unavoidable for every request, edge computing — running logic at the CDN edge rather than the origin — can eliminate much of this latency for the subset of operations that don't require centralized state.

Choose infrastructure providers with genuine geographic coverage that matches your user base. This sounds like an obvious point, but it's frequently under-prioritized during initial infrastructure selection. If a meaningful portion of your user base is in a region where your provider has thin or absent regional presence, no amount of clever architecture fully compensates for the physical distance between your infrastructure and your users.

Measure from the user's actual vantage point, not just from your own data center. Synthetic monitoring tools that test from multiple global locations — not just from your own infrastructure's region — are the only reliable way to catch this class of problem before users report it.

The Broader Lesson

Latency problems caused by network topology rarely show up as a single dramatic incident. They show up as a slow, persistent erosion of user experience in specific regions that's easy to attribute to "the app being a bit slow sometimes" rather than correctly diagnosing as a structural, fixable architecture issue.

The fix isn't more compute. It's rarely more compute. It's understanding the actual physical and logical path your data takes, region by region, hop by hop — and being deliberate about which services genuinely need to be geographically close to each other and which ones can tolerate distance.