Benard Otieno

Posted on May 20

The GPU Is the New Database

#ai #infrastructure #devops #cloud

Twenty years ago, teams had no idea how to run databases at scale. They made every mistake possible before the patterns solidified. We are now in the same position with GPU infrastructure, making the same mistakes, faster.
Find more articles atThis site
_

In 2004, if you were running a web application at any meaningful scale,
your biggest infrastructure problem was the database. Not the application
servers, those were stateless, you could add more. The database was the
single stateful thing everything depended on, it didn't scale horizontally,
it was expensive to run, and almost nobody knew how to operate it well.

Teams made every mistake. They put too much logic in the application and
not enough in the database. They put too much in the database and not
enough in the application. They didn't index correctly. They didn't cache
correctly. They scaled vertically until they couldn't, then scrambled to
shard. They had no idea what their query plans looked like. They treated
the database as a black box until it stopped working, then learned the
hard way that it wasn't.

Over the following decade, the patterns solidified. Connection pooling.
Read replicas. Query analysis. Proper indexing strategy. Cache layers.
The knowledge became common. The tools improved. Managed database services
abstracted most of the complexity. Today a competent team can run a
database at significant scale without extraordinary expertise.

We are now, in 2026, in the same position with GPU infrastructure. The
GPU is the new database, the expensive, stateful, poorly-understood
bottleneck that everything AI depends on, that doesn't scale the way
people expect, that is being operated badly by the majority of teams
running it, and for which the patterns have not yet solidified.

The teams that figure this out first will have an infrastructure advantage
that is very difficult to close. The teams that don't will spend the
next five years making the same mistakes everyone made with databases
in 2004, just faster and more expensively.

Why the GPU is not just a fast CPU

The first mistake most teams make with GPU infrastructure is treating
GPUs as very fast CPUs. They're not. They're a fundamentally different
computational model, and the mismatch between that model and how most
people use them is where most of the waste comes from.

A CPU is optimised for latency ,completing a single complex task as
quickly as possible. It has a small number of powerful cores, large
caches, sophisticated branch prediction, and out-of-order execution.
It's good at sequential logic, conditional branching, and tasks where
each step depends on the result of the previous one.

A GPU is optimised for throughput, completing an enormous number of
simple tasks simultaneously. It has thousands of smaller, simpler cores.
It's good at the same operation applied in parallel to a large amount
of data. It's bad at anything sequential, anything with complex
branching, and anything where you need to move data back to the CPU
in the middle of computation.

The practical consequence: a GPU that is not batching work is a GPU
that is mostly idle. The most common pattern for teams deploying AI
inference in production, one request comes in, run the model, return
the result, wait for the next request, uses a small fraction of the
GPU's actual capacity. The GPU's utilisation number looks reasonable.
The GPU's actual computational throughput is terrible.

This is the equivalent of a database that opens a new connection for
every query, executes it, and closes the connection. Technically
functional. Completely missing how the system should be used.

# What most teams do: one request, one inference
# GPU utilisation looks like 20-40%, but throughput is poor

async def handle_inference_request(prompt: str) -> str:
    result = model.generate(prompt)  # GPU mostly idle while waiting
    return result


# What should be happening: dynamic batching
# Multiple requests grouped and processed together

class InferenceBatcher:
    def __init__(self, model, max_batch_size: int = 32, max_wait_ms: int = 50):
        self.model = model
        self.max_batch_size = max_batch_size
        self.max_wait_ms = max_wait_ms
        self.queue: asyncio.Queue = asyncio.Queue()

    async def infer(self, prompt: str) -> str:
        future = asyncio.Future()
        await self.queue.put((prompt, future))
        return await future

    async def _batch_worker(self):
        while True:
            batch = []
            deadline = asyncio.get_event_loop().time() + (self.max_wait_ms / 1000)

            # Collect requests until batch is full or deadline passes
            while len(batch) < self.max_batch_size:
                timeout = deadline - asyncio.get_event_loop().time()
                if timeout <= 0:
                    break
                try:
                    item = await asyncio.wait_for(
                        self.queue.get(),
                        timeout=timeout
                    )
                    batch.append(item)
                except asyncio.TimeoutError:
                    break

            if not batch:
                continue

            prompts = [item[0] for item in batch]
            futures = [item[1] for item in batch]

            # Single GPU call processes all requests simultaneously
            results = self.model.generate_batch(prompts)

            for future, result in zip(futures, results):
                future.set_result(result)

Dynamic batching is the connection pooling of GPU inference. It is
not optional if you care about cost or throughput. It is also not
implemented by default in most hand-rolled inference deployments,
for the same reason that early web applications didn't implement
connection pooling: teams didn't know they needed it until they
hit the wall.

The memory hierarchy nobody teaches you

GPU memory is not like CPU memory. Understanding the difference is
the difference between a system that works and one that doesn't, and
between inference costs that are manageable and ones that are not.

A GPU has its own on-device memory ,VRAM. VRAM is fast, finite,
and expensive. A GPU with 80GB of VRAM is a very expensive GPU.
The model you're running must fit in VRAM. If it doesn't fit, you
can use techniques like quantization to make it smaller, or you can
distribute it across multiple GPUs, but you cannot simply overflow
to system RAM without taking a catastrophic performance hit. The
bandwidth between CPU RAM and GPU VRAM is orders of magnitude slower
than VRAM bandwidth. When you hear about models being "quantized
to 4-bit," this is why 4-bit quantization halves the memory
footprint roughly, which is the difference between fitting on one
GPU and not fitting on one GPU.

Within the GPU itself, there is a memory hierarchy that determines
how fast computation runs. The KV cache, the cached attention
computation for the tokens already processed in a conversation
lives in VRAM and grows with sequence length. Managing KV cache
is one of the most consequential performance decisions in LLM serving,
and most teams don't think about it at all until they start hitting
out-of-memory errors on long contexts.

# KV cache management: what happens without it
# Each new token regenerates attention for the entire context
# Cost is O(n²) in sequence length

# What vLLM and similar systems do differently:
# PagedAttention manages KV cache in fixed-size blocks
# like virtual memory paging in an OS

# This allows:
# 1. Sharing KV cache between requests with the same prefix
# 2. Better memory utilisation (no internal fragmentation)
# 3. Handling variable-length sequences without pre-allocating
#    worst-case memory

from vllm import LLM, SamplingParams

# vLLM handles KV cache management automatically
# This is not a minor optimisation — it's 2-4x throughput improvement
# on typical workloads versus naive implementations

llm = LLM(
    model="meta-llama/Llama-3-8b-instruct",
    gpu_memory_utilization=0.90,    # Leave 10% headroom
    max_model_len=8192,             # Maximum sequence length
    enable_prefix_caching=True,     # Cache common prefixes (system prompts)
    tensor_parallel_size=1,         # Number of GPUs for this model
)

sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=512,
)

# Prefix caching means your system prompt is computed once
# and cached for all subsequent requests — significant for
# long system prompts used with every inference
outputs = llm.generate(prompts, sampling_params)

Most teams serving LLMs in production are not using PagedAttention.
They're using naive inference implementations that waste fifty to
seventy percent of their GPU memory to fragmentation and redundant
computation. The cost difference is not marginal.

The scaling question everyone asks wrong

When a team's AI infrastructure starts struggling under load, the
first question is almost always: "should we add more GPUs?"

This is the wrong question asked at the wrong time, for the same
reason that "should we add more database servers" was the wrong
first question when a database was struggling in 2008. The right
question is: "why are we using our current GPUs so inefficiently?"

GPU utilisation that is below sixty percent is almost always a
batching problem. Requests are not being grouped efficiently before
hitting the GPU. You can add more GPUs and halve your utilisation
number, which means you now have twice the infrastructure running at
thirty percent capacity instead of one set running at sixty. You've
doubled your cost and solved nothing.

GPU utilisation that is high but latency is still bad is almost
always a model sizing problem. The model is too large for the request
volume being served. A smaller quantized model, or a different
architecture, may serve your request latency requirements at a
fraction of the compute cost.

# Measuring what actually matters before deciding to scale

import time
import psutil
from prometheus_client import Histogram, Gauge, Counter

# These metrics tell you where the problem actually is

GPU_UTILISATION = Gauge(
    'gpu_utilisation_percent',
    'GPU compute utilisation',
    ['device_id']
)

GPU_MEMORY_USED = Gauge(
    'gpu_memory_used_bytes',
    'GPU VRAM in use',
    ['device_id']
)

BATCH_SIZE = Histogram(
    'inference_batch_size',
    'Number of requests processed per batch',
    buckets=[1, 2, 4, 8, 16, 32, 64]
)

TOKENS_PER_SECOND = Histogram(
    'inference_tokens_per_second',
    'Throughput of inference in tokens per second',
    buckets=[10, 25, 50, 100, 200, 400, 800]
)

TIME_TO_FIRST_TOKEN = Histogram(
    'inference_ttft_seconds',
    'Time from request to first token generated',
    buckets=[.05, .1, .25, .5, 1, 2, 5]
)

REQUEST_QUEUE_DEPTH = Gauge(
    'inference_queue_depth',
    'Number of requests waiting for GPU'
)


class InstrumentedInferenceServer:
    async def infer(self, prompts: list[str]) -> list[str]:
        BATCH_SIZE.observe(len(prompts))
        REQUEST_QUEUE_DEPTH.set(self.queue.qsize())

        start = time.perf_counter()
        results = await self._run_inference(prompts)
        duration = time.perf_counter() - start

        total_tokens = sum(len(r.split()) for r in results)
        TOKENS_PER_SECOND.observe(total_tokens / duration)

        return results

When you can see batch sizes, queue depth, tokens per second, and
time-to-first-token alongside GPU utilisation and VRAM usage, the
question of "do we need more GPUs" almost answers itself. Usually
the answer is "no, we need to batch better" or "no, we need to use
a smaller model" and scaling turns out to be unnecessary.

The cold start problem nobody planned for

Databases take seconds to start. GPU inference servers take minutes.

A database that restarts unexpectedly is back within thirty seconds
in most cases. An LLM inference server that restarts needs to load
model weights from storage into VRAM before it can serve any requests.
A 70B parameter model stored in 4-bit quantization is roughly 35GB.
Loading 35GB from network storage into VRAM, at typical cloud storage
bandwidth, takes several minutes under good conditions.

This changes incident dynamics entirely. A database blip is a brief
interruption. A GPU server blip is a several-minute outage for every
affected instance. Autoscaling, which works well for stateless
application servers and adequately for databases, works badly for
GPU inference because new instances take so long to become ready.

The teams that have worked this out run warm pools ,GPU instances
with models already loaded, sitting idle, waiting for traffic that
hasn't arrived yet. This feels wasteful. It's the only way to handle
traffic spikes without minutes-long latency blowouts.

# Kubernetes deployment with warm pool strategy
# Minimum replicas keep instances warm even at low traffic
# This costs money. The alternative is cold start latency.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
spec:
  replicas: 3  # Never scale below this. These are your warm pool.
  template:
    spec:
      containers:
      - name: inference-server
        image: myteam/inference:latest
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          # Model loading takes time. This probe must not pass
          # until the model is fully loaded in VRAM.
          initialDelaySeconds: 180   # 3 minutes minimum
          periodSeconds: 10
          failureThreshold: 30       # 5 more minutes of retries
        lifecycle:
          preStop:
            exec:
              # Drain in-flight requests before shutdown
              command: ["/bin/sh", "-c", "sleep 30"]

---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference
  minReplicas: 3    # Warm pool floor
  maxReplicas: 10
  metrics:
  - type: External
    external:
      metric:
        name: inference_queue_depth
      target:
        type: AverageValue
        averageValue: "5"  # Scale when queue exceeds 5 requests per replica
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Pods
        value: 2
        periodSeconds: 120   # Add 2 pods max every 2 minutes
                             # Fast enough to respond, slow enough
                             # to not over-provision during spikes
    scaleDown:
      stabilizationWindowSeconds: 600  # Wait 10 minutes before scaling down
                                       # Cold start cost makes yo-yo scaling
                                       # extremely expensive

The scaleDown stabilization window is long deliberately. The cold start
cost is so high that scaling down and back up in response to a brief
traffic dip is more expensive than just keeping the instances running.
This is counterintuitive if you're coming from stateless web services.
It's the operational reality of GPU infrastructure.

The cost model is upside down

Database costs scale with data volume and query complexity. You pay
more as your data grows and your queries get more complex.

GPU costs scale with time. You pay for every second a GPU exists,
whether it's serving requests or not. An idle GPU costs the same as
a busy GPU.

This inverts the normal infrastructure economics. With stateless
application servers, idle capacity is cheap, you can scale to zero
when traffic drops and pay nothing. With GPU inference, scaling to
zero means cold starts when traffic returns. The minimum viable
capacity for a production inference service is not zero, it's
whatever your warm pool needs to be, which is determined by your
acceptable cold start latency and your traffic spike patterns.

The teams that have made peace with this have stopped thinking about
GPU cost as a variable cost that tracks usage and started thinking
about it as a fixed cost that buys capacity. The question is not
"how do we pay less for GPU when traffic is low?" The question is
"what is the right amount of always-on capacity, and how do we make
sure we use it efficiently?"

Efficient use means high batch fill rates, high token throughput per
GPU-hour, low idle time. The metrics above are the inputs to this
calculation. Without them you're guessing about whether your
infrastructure is sized correctly.

The pattern that's emerging

The teams operating GPU infrastructure well in 2026 look, in their
operational discipline, a lot like the teams that operated databases
well in 2012 after enough people had been burned that the patterns
were starting to solidify.

They treat GPU utilisation as a lagging indicator and token throughput
as the leading one. They instrument everything: batch sizes, queue
depth, time-to-first-token, VRAM usage, KV cache hit rates. They
size their warm pools based on measured traffic patterns rather
than intuition. They run the smallest model that meets their quality
bar, not the largest model they can afford, because smaller models
batched efficiently outperform larger models batched poorly on
almost every practical metric.

They've also accepted something that takes a while to accept: that
the right abstraction for GPU infrastructure is not "fast compute"
but "throughput capacity." The question is not "how fast can this
machine process one request?" GPUs are fast at that regardless.
The question is "how many requests per dollar can this infrastructure
handle at acceptable latency?" That question requires different
metrics, different architecture, and a different mental model than
the one most teams bring from their experience with CPU infrastructure.

The database analogy runs deeper than it looks. In 2004, the teams
that treated the database as a black box, put data in, get data
out, add more RAM when it's slow, eventually hit walls that their
architecture couldn't get past. The teams that understood what was
happening inside the database query plans, index usage, lock
contention, buffer pool behaviour built things that scaled.

The GPU is not a black box. It has a memory hierarchy, a batching
model, a cost structure, and performance characteristics that reward
understanding and punish ignorance in the same way the database did.

The patterns are forming. The teams learning them now will have the
same advantage in five years that database-literate engineers had in 2015.

The mistakes are happening right now, at scale, expensively.
Most of them are the same mistakes. Most of them are avoidable.

Top comments (1)

VoltageGPU • May 20

Interesting analogy. I'm seeing a similar pattern in confidential computing where the shift toward TEEs for GPUs is forcing us to rethink memory orchestration from the ground up. We're definitely in that "making mistakes" phase of the learning curve.