Why vLLM, TensorRT-LLM, and llama.cpp each solve only part of the problem — and how I built inferx to fill the gap. Runs on any laptop, no GPU needed.
I spent the last few months building inferx — an open-source LLM inference optimization library that runs on any machine, including a laptop with no GPU. Along the way I learned more about how LLMs actually work at the systems level than in any course or paper I had read before.
This is that story: what the problem is, how I solved it, what the code looks like, and how you can run it in 60 seconds.
mkkotcherla
/
inferx
Description:Modular LLM inference library — KV cache, quantization, batching, speculative decoding
inferx ⚡
Unified LLM inference optimization library — modular, composable, production-grade.
inferx packages the hard parts of LLM serving into one clean library:
- Continuous batching scheduler — iteration-level, FCFS / priority / deadline
- Paged KV cache — PagedAttention with prefix caching and sliding window
- Quantization — AWQ, GPTQ, INT8, FP8, per-layer mixed precision
- Tensor & pipeline parallelism — multi-GPU sharding via NCCL
- Speculative decoding — draft model + Medusa heads
-
OpenAI-compatible server —
/v1/chat/completionsdrop-in endpoint - Prometheus + OpenTelemetry — TTFT, TBT, cost-per-request tracking
- CPU-only mode — runs on any laptop, zero GPU for development
Why inferx?
| Feature | vLLM | TRT-LLM | llama.cpp | inferx |
|---|---|---|---|---|
| Paged KV cache | ✅ | ✅ | ❌ | ✅ |
| Prefix caching | ✅ | ❌ | ❌ | ✅ |
| All quant formats | GGUF | ✅ | ||
| Speculative decoding | ✅ | ✅ | ❌ | ✅ |
| CPU / Metal backend | ❌ | ❌ | ✅ | ✅ |
| Modular API | ❌ | ❌ | ❌ | ✅ |
| Built-in cost tracking | ❌ | ❌ | ❌ | ✅ |
| Open |
The problem nobody talks about
When most people want to serve an LLM, they do something like this:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
for prompt in user_prompts:
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(output[0]))
This works beautifully for one user. But the moment you have ten users sending requests at the same time, it quietly falls apart:
Each request waits for the previous one to finish
GPU memory is reserved at worst-case sequence length even if the user sends 20 tokens
60–90% of VRAM sits unused at any given moment
No way to prioritize urgent requests over slow background ones
The result: a GPU costing $3/hr sitting at 15% utilization while users wait 8 seconds for a response.
This is the problem that vLLM, TensorRT-LLM, and llama.cpp exist to solve. But each solves only part of it, and none exposes a clean modular API where you can swap individual components.
That gap is where inferx comes from.
What existing tools miss:
The key gap is modularity. vLLM is a monolith — you cannot pull out just its KV cache manager and use it elsewhere. TensorRT-LLM is NVIDIA-only and closed. llama.cpp has no batching scheduler.
inferx is built so every component is independently importable:
from inferx.memory import PagedKVCacheManager
from inferx.scheduler import ContinuousBatchScheduler
from inferx.quantization import AWQQuantizer
from inferx.serving import OpenAIServer
Use only what you need. Swap any component for your own implementation.
The three ideas that make this work
- Paged KV cache — treating GPU memory like an OS Every transformer layer produces Key and Value tensors for every processed token. The naive approach reserves a contiguous GPU memory block per request at maximum sequence length — wasting up to 92% of the allocation. The solution, from the PagedAttention paper (SOSP 2023), borrows from OS virtual memory paging. KV memory is divided into fixed-size 16-token blocks. A sequence's KV data lives in non-contiguous blocks tracked by a block table — exactly like an OS maps virtual addresses to physical frames. Here is how inferx implements this:
class PagedKVCacheManager:
def allocate(self, sequence) -> int:
"""
Allocate KV blocks for a new sequence.
Returns number of tokens whose KV is already cached.
"""
# Check prefix cache first
cached_len, shared_blocks = self.prefix_cache.lookup(
sequence.prompt_token_ids
)
# Share those blocks (ref-counted)
for bid in shared_blocks:
self.allocator.share(bid)
# Allocate only what's needed for the uncached portion
remaining = len(sequence.prompt_token_ids) - cached_len
new_blocks = self.allocator.allocate_gpu(
math.ceil(remaining / self.block_size)
)
sequence.block_table = shared_blocks + new_blocks
return cached_len # prefill can skip this many tokens
For 10 concurrent requests with 512-token max length: naive pre-allocation needs ~5 MB. PagedAttention with average 64-token generation uses 0.3 MB — 94% reduction.
- Continuous batching — never let the GPU idle Traditional serving waits for a full batch to finish before starting the next. If one request generates a 500-token essay while others finished at 40 tokens, the GPU mostly idles. Continuous batching, from the Orca paper (OSDI 2022), operates at the iteration level. Every forward pass, the scheduler re-evaluates which sequences to include. New requests join mid-flight. Finished sequences immediately free their KV blocks.
class ContinuousBatchScheduler:
def schedule(self) -> Batch:
# Ensure decode sequences have room for 1 more token
self._ensure_decode_space()
# Promote preempted sequences if memory recovered
self._restore_preempted()
# Admit new sequences from waiting queue
self._admit_waiting()
# Build batch: prefill + decode sequences together
return Batch(
prefill_seqs=[s for s in self._running
if s.status == PREFILLING],
decode_seqs=[s for s in self._running
if s.status == DECODING],
)
Three strategies supported: FCFS, priority (preempts lower-priority sequences), and deadline (hard SLA targets).
- Prefix caching — skip recomputing the same prompt If 1,000 users share the same system prompt, naive serving computes that KV cache 1,000 times. Prefix caching computes it once and reuses it across all matching requests. inferx uses a hash-based LRU cache:
class PrefixCache:
def lookup(self, token_ids: List[int]):
"""Find the longest cached prefix of these token IDs."""
num_full_blocks = len(token_ids) // self.block_size
for n in range(num_full_blocks, 0, -1):
prefix = token_ids[:n * self.block_size]
h = self._hash_tokens(prefix)
if h in self._cache:
self._cache.move_to_end(h) # LRU update
return n * self.block_size, self._cache[h]
return 0, []
On GPU with a 2,000-token shared system prompt: 3–5× speedup on TTFT for the second and subsequent requests.
Full architecture
inferx has six independently usable layers:
Try it right now — no GPU needed
inferx has a complete CPU-only mode. Run the full pipeline — scheduler, KV cache, batching, streaming, HTTP server — on any laptop.
bash git clone https://github.com/mkkotcherla/inferx.git
cd inferx
pip install -e .
python examples/quickstart.py
Output in ~3 seconds:
inferx quickstart
────────────────────────────────────────
Model: inferx-mock-cpu
KV blocks: 512
Prompt: 'The key innovation of PagedAttention is'
Output: '...'
Usage: {'prompt_tokens': 40, 'completion_tokens': 49, 'total_tokens': 89}
Done ✓
Ten concurrent requests:
bashpython examples/batch_requests.py
Requests: 10
Total tokens: 250
Wall time: 1.38s
Throughput: 180.7 tok/s
Avg latency: 138ms/request
OpenAI-compatible server:
pip install fastapi "uvicorn[standard]"
python inferx/cli.py serve --mock --port 8000
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="inferx-mock-cpu",
messages=[{"role": "user", "content": "What is KV caching?"}],
max_tokens=50,
)
print(response.choices[0].message.content)
Benchmark results — GPT-2 124M on CPU
Real GPT-2 124M architecture (correct OpenAI layers, random weights), CPU only, 1 thread:
The throughput numbers are similar on a 1-thread CPU — expected and honest. Scheduling overhead isn't offset by batching gains without GPU parallelism. On GPU with 50+ concurrent requests, continuous batching gives 8–12× throughput improvement because GPU tensor cores scale with batch size.
The 94% memory savings are real and hardware-independent.
The one thing I would do differently
I would have added run_concurrent() from day one. The natural instinct is asyncio.gather() — but that makes each task drive its own decode loop and they fight over scheduler state.
The correct pattern: a single shared decode loop with all requests in one queue:
async def run_concurrent(engine, prompts_and_params):
"""All requests share one decode loop — correct behavior."""
seqs = []
for prompt, params in prompts_and_params:
seq = Sequence(prompt=prompt, ...)
engine._scheduler.add_request(seq)
seqs.append(seq)
# One loop drives everything
while not all(s.is_finished for s in seqs):
await engine._step()
return [(engine._detokenize(s.output_token_ids), usage(s))
for s in seqs]
This is how vLLM works internally. Getting it right made throughput jump significantly.
Try it and contribute
mkkotcherla
/
inferx
Description:Modular LLM inference library — KV cache, quantization, batching, speculative decoding
inferx ⚡
Unified LLM inference optimization library — modular, composable, production-grade.
inferx packages the hard parts of LLM serving into one clean library:
- Continuous batching scheduler — iteration-level, FCFS / priority / deadline
- Paged KV cache — PagedAttention with prefix caching and sliding window
- Quantization — AWQ, GPTQ, INT8, FP8, per-layer mixed precision
- Tensor & pipeline parallelism — multi-GPU sharding via NCCL
- Speculative decoding — draft model + Medusa heads
-
OpenAI-compatible server —
/v1/chat/completionsdrop-in endpoint - Prometheus + OpenTelemetry — TTFT, TBT, cost-per-request tracking
- CPU-only mode — runs on any laptop, zero GPU for development
Why inferx?
| Feature | vLLM | TRT-LLM | llama.cpp | inferx |
|---|---|---|---|---|
| Paged KV cache | ✅ | ✅ | ❌ | ✅ |
| Prefix caching | ✅ | ❌ | ❌ | ✅ |
| All quant formats | GGUF | ✅ | ||
| Speculative decoding | ✅ | ✅ | ❌ | ✅ |
| CPU / Metal backend | ❌ | ❌ | ✅ | ✅ |
| Modular API | ❌ | ❌ | ❌ | ✅ |
| Built-in cost tracking | ❌ | ❌ | ❌ | ✅ |
| Open |
bashgit clone https://github.com/mkkotcherla/inferx.git
cd inferx && pip install -e .
python examples/quickstart.py # works on any laptop
python examples/benchmark.py # throughput + latency numbers
python inferx/cli.py serve --mock # OpenAI server, no GPU needed
11 runnable examples, 22 unit tests, Apache 2.0.
If you are working on LLM inference, building on top of this, or have questions — open an issue or Discussion. PRs are very welcome, especially for the GPU kernel layer.
Built with Python, PyTorch, and a lot of reading of the vLLM, Orca, and PagedAttention papers.



Top comments (0)