Seenivasa Ramadurai

Posted on May 16

You Were Trained. But Are You Ready to Serve?

Introduction

The gap between building an LLM and running it in production and what it teaches us about our own careers.

We have all met that person. Top of their class. Brilliant in theory. Deep, encyclopedic knowledge in their field. And yet, somehow, they struggle the moment real work lands on their desk. They freeze when faced with ambiguous problems. They slow down under pressure, failing to deliver at the level everyone expected.

The world of machine learning has a name for this exact failure mode. It isn’t a training problem. It’s a serving problem. Once you see it through this lens, you will never look at your education, your career, or your daily workflow the same way again.

Part 1: You Are a Model.

College Was Your Training Run.

In machine learning, a model begins as a blank slate an empty architecture with no knowledge, no instincts, and no ability to recognize patterns.

Then, training begins. The model is fed enormous amounts of data text, images, signals and it fails constantly. Each failure produces an error signal. That error signal flows backward through the network, making tiny adjustments to the model's internal parameters: its weights and biases.

Repeat this process millions of times, and something remarkable happens. The model stops failing randomly and starts recognizing structure. It builds intuition, develops defaults, and becomes capable.

That is exactly what formal education does to you.

Every lecture, every textbook chapter, every exam you failed and had to retake, every piece of stinging feedback from a mentor—each one was a gradient update. It was a small error signal flowing back through your thinking, adjusting your internal parameters. Your weights and biases are your professional instincts: how you approach a problem, what tool you reach for first, and how you reason under pressure.

College built those slowly, painfully, and iteratively. Training was never truly about the grade; it was about adjusting the weights.

But remember: this phase is long and controlled. The data is curated, and the environment is safe. The answers exist somewhere, and someone is grading your output against them. Training is preparation not performance.

Part 2: Your Degree Is Your Domain-Specific Fine-Tune.

After general pre-training, machine learning models go through a second phase called fine-tuning. The base model already has broad capabilities it understands language, logic, and patterns. Fine-tuning narrows that capability toward a specific domain.

A model fine-tuned on medical data learns to reason about symptoms and diagnoses. One fine-tuned on legal documents learns to navigate argument and precedent. It’s the same base architecture, but a completely different specialization.

Your degree is your fine-tune. You stopped being a general learner and became domain-specific. Your configuration was set, and your weights were adjusted for a particular problem space.

A medical student's parameters are tuned to healthcare.
A software engineer's are tuned to systems and logic.
A finance major's are tuned to risk, capital, and market behavior.

By the time you walk across that graduation stage, your architecture is locked in. You are no longer a blank model trained on everything broadly; you are a specialized model trained deeply on something specific.

That is the value your institution produced , and specific is exactly what the real world hires for.

But here is the thing nobody tells you at graduation: Fine-tuning is not the finish line. It is just the end of the controlled phase. The real test begins somewhere else entirely.

Part 3: Getting the Job Is Deployment.

And Deployment Changes Everything.

In machine learning, when a model finishes training and fine-tuning, it gets deployed into production. This is called model serving.

The model is now live. Real users send real requests. The environment is absolutely nothing like training. There is no curated dataset, no answer key, and no controlled batch of problems neatly designed to be solvable. There are just requests—unpredictable, varied, and arriving concurrently at any time. The model must handle them fast, reliably, and accurately.

When you land your first job, you have been deployed. And the rules change completely.

Model serving is the most critical phase of the entire pipeline. It is where value is actually created not in the research notebook, but in production, under real load, handling requests the model has never seen before.

A model that trains beautifully but collapses in production is entirely worthless

Part 4: The Uncomfortable Truth of Brilliant People Who Cannot Perform

We have all witnessed it: the student who aced every exam but freezes the moment a project doesn't fit a known template. The top graduate who cannot handle ambiguity. The deeply knowledgeable professional who always seems behind, overwhelmed, and bottlenecked on every task.

This is not an intelligence failure, nor is it a lack of knowledge. In machine learning terms, this is a well-trained model with broken serving infrastructure.

The weights are good, the training was solid, and the fine-tuning was real. But when the model hit production when unseen requests started arriving in real-time with no answer key the infrastructure around it simply couldn't handle the load. Requests queued up, memory was wasted, and output slowed to a crawl. The model was capable, but the serving layer was not.

Training quality and serving quality are two completely separate problems. A brilliant model can fail in production, and a brilliant person can fail at work for the exact same reason.

This is the gap nobody talks about in education. Schools optimize entirely for training quality better lectures, better exams, better grades. Nobody teaches you how to serve. Nobody teaches you how to handle requests you’ve never seen, how to manage your cognitive resources under concurrent load, or how to build the execution infrastructure that turns what you know into what you consistently deliver.

In machine learning, two frameworks represent exactly this divide. One is built for training and research; the other is built for production serving. Understanding what separates them changes everything.

Part 5: Hugging Face Transformers vs. vLLM

Framework 1: Hugging Face-The Brilliant Student Who Works Alone

Hugging Face Transformers is the gold standard for research, experimentation, fine-tuning, and prototyping. If you want to load a state-of-the-art model and iterate on an idea, it’s extraordinary.

But when you take a Hugging Face model and naively deploy it to serve real user traffic, engineering bottlenecks surface fast:

Static batching: It waits for a full batch to assemble before processing. If requests arrive unevenly, the GPU idles, throughput drops, and users wait.
Memory pre-allocation: It pre-allocates a fixed block of GPU memory per request for the maximum possible sequence length, even if the request is short. Most memory is wasted, causing you to run out of memory far too early under real load.
No shared caching: If a hundred users start with the same long system prompt, attention states are recomputed a hundred times from scratch with no reuse.
The Pipeline Jams: A single long generation occupies a batch slot, blocking faster, shorter requests behind it.

The Human Equivalent: This is the brilliant professional who works deeply but can only handle one task at a time. They take on a problem, give it everything, finish it completely, and then pick up the next. They never build systems, and they don't document solutions for reuse, so every new project starts from scratch. They are outstanding in a controlled environment, but entirely overwhelmed the moment volume, concurrency, and unpredictability arrive simultaneously.

Hugging Face isn't wrong—it is perfectly designed for its purpose. The mistake is using a research tool as a production serving engine , assuming that being well-trained is the same as being ready to serve.

Framework 2: vLLM The Same Model, Built to Serve

vLLM is an open-source inference engine built with a single purpose: serving large language models in production at scale. It doesn’t change the model’s weights or retrain anything. It takes the exact same model that runs in Hugging Face and serves it in a way optimized for real traffic, memory constraints, and throughput requirements.

The results are dramatic: the same model, on the same hardware, can achieve up to 24x higher throughput simply because the serving layer was optimized.

Four core engineering innovations make this possible, and each has a direct equivalent in how high-performing people operate in the real world:

1. PagedAttention vs. Focused Attention

In ML: Traditional serving pre-allocates one massive block of GPU memory per request—like reserving an entire hotel floor for a single guest. Most of it sits empty . vLLM's PagedAttention manages the KV cache in small, dynamic, non-contiguous pages. Memory is allocated only as needed and released immediately upon completion, resulting in near-zero waste. This is how vLLM handles dramatically more concurrent traffic.
In You: High performers do not hold every open project and pending email simultaneously in their active working memory. They "page in" what the current task actually needs, process it, and release it. People who carry everything at once feel constantly busy, but their output is fragmented, slower, and lower quality. Focused attention isn't a soft skill—it’s memory management.

2. Continuous Batching vs. Pipeline Thinking

In ML: Instead of waiting for an entire batch to run to completion (static batching), vLLM uses continuous batching. The moment any slot completes its generation, a new request is slotted in immediately. The GPU is never idle, and throughput skyrockets.
In You: Effective professionals design their workflow so it never idles. While one deliverable is in review, the next is already in motion. While waiting on a response, another task is being processed. This isn't frantic multitasking; it is deliberate, linear sequencing.

3. KV Cache Reuse vs. Your Body of Prior Work

In ML: In enterprise applications, requests constantly repeat the same system prompt. Hugging Face recomputes those attention states from scratch every single time . vLLM uses prefix caching to compute those states once, store them, and allow subsequent requests to retrieve the cache instantly instead of recomputing. Latency drops off a cliff.
In You: Every problem you have solved and documented, every framework you've built, and every decision log, template, or post-mortem you’ve written down is your personal KV cache. You don't start from scratch on a new task; you retrieve, adapt, and ship. Professionals who never build this cache spend their entire careers recomputing things they solved years ago.

KV Cache Joke

4. High Throughput vs. Output Matching Capability

In ML: The combined effect of these innovations means more requests handled per second and a lower time-to-first-token on the exact same hardware. The model didn’t get smarter; the infrastructure got optimized.
In You: This translates to more output per unit of energy. Not by working longer hours or magically knowing more, but by removing the friction between what you are capable of and what you actually deliver.

The Point: Build the Model.

Then Build the Inference Engine.

Your education trained you. Your discipline fine tuned you. Your first job deployed you into production. Those phases matter deeply, and the years you put into them are real. But they only produced a capable model not an optimized serving layer.

Production is where the game is actually played: concurrent demands, zero answer keys, immediate deadlines, and real stakes. This is where training quality stops mattering, and serving infrastructure takes over.

The person who implements PagedAttention (focused, uncluttered cognitive management) processes metrics more clearly.
The person who practices Continuous Batching (keeping their pipeline moving safely) delivers consistently.
The person who builds a KV Cache (documenting and templating solutions) never wastes time recomputing the past.

Hugging Face gets you running; vLLM gets you scaling. Your degree got you deployed, but how you serve is how you are remembered.

The question was never whether you were trained well enough. The question is whether your infrastructure is ready for production.

Thanks
Sreeni Ramadorai

DEV Community