Moving DeepSeek-R1 from Transformers to vLLM: A 14x Throughput Boost

#python #programming

At 2 AM, I was jolted awake by a call from operations: "Why did the billing system charge the user twice?" I stumbled to my laptop and found the root cause — our Model-as-a-Service API started queuing requests beyond a concurrency of 5, and the fragile "retry deduplication" logic I'd bolted on collapsed under high load, resulting in double charges. That was the reality of our homegrown inference service built with HuggingFace Transformers + FastAPI half a year ago. The architecture at the time felt like a water pipe held together with tape, ready to burst at any moment. It wasn't until we fully migrated to vLLM + Kong that we removed the three mountains of concurrency, billing, and multi-tenancy all at once. This article is a battle-tested record drawn from blood and tears — pure, actionable know-how you can copy directly.

Problem Breakdown: Why the Original Approach Couldn't Survive a Traffic Spike

Our use case was straightforward: provide a text-generation API for DeepSeek-R1, charge by token, and support multiple customers (tenants) each with their own API key and quota. Initially, with a small team, I loaded the model with Transformers, wrapped it in FastAPI, and hand-rolled key verification and token counting logic into MySQL.

The cracks appeared quickly. The root cause was that native Transformers inference is shamelessly wasteful: every request, regardless of sequence length, grabs the entire GPU memory for a full forward pass with no continuous batching. One request isn't finished, and the rest queue up. Even with dynamic batching, padding waste kept actual GPU utilization under 30%. The result: as concurrency grew, latency spiked to tens of seconds, clients timed out and retried, hammering our brittle "idempotency" logic and ultimately leading to duplicate billing.

Additionally, the hand-written tenant management and rate-limiting logic was scattered across business code. Changing a quota meant a full redeployment, and the gateway layer had zero defense. I once tried to add a semaphore limiter inside FastAPI, which only jammed requests at our doorstep while resources were already hogged — even the health check went down. It felt like locking myself out of my own house.

Solution Design: vLLM as the Inference Engine, Kong as the Billing Gateway

After the postmortem, we adopted two iron rules: the inference layer must implement continuous batching so the GPU runs like an assembly line without gaps; the gateway layer must offload cross-cutting concerns—billing, authentication, rate-limiting—so business code focuses solely on inference.

For the inference engine, the candidates were NVIDIA Triton, Text Generation Inference (TGI), and vLLM. Triton was too "heavy" — for a team desperate to patch a sinking ship, the learning curve around model configuration and model repositories was too steep. TGI was good, but back then its support for the DeepSeek family wasn't mature enough, and being tied to HuggingFace left less room for customization. vLLM was booming for good reason — its PagedAttention memory-sharing mechanism let multiple requests' KV caches be dynamically stitched together in GPU memory with near-zero waste. It natively supports the OpenAI API format, making migration costs virtually zero. So we chose it.

For the gateway, Kong was a component we’d always wanted but never had time to adopt. Why not build it yourself? Because "billing, auth, rate-limiting" may sound simple, but doing them at production grade requires plugin hot-reload, multi-dimensional limiting, highly available storage, daily tenant reports... Building that yourself is like developing half a gateway from scratch. Kong's three plugins — Key Authentication, Rate Limiting, and HTTP Log — connected in series can construct a complete multi-tenant billing system: Key Auth isolates tenants, Rate Limiting prevents abuse, and HTTP Log asynchronously pushes token consumption from each request to Kafka/ClickHouse, where the billing system computes charges offline. Once the architecture was clear, I could finally sleep at night.

Core Implementation: From a Single Command to a Full Multi-Tenant Gateway

Below is runnable code and configuration. I’ve split it into two parts: vLLM deployment and Kong configuration. This first part starts the inference service with a single Docker command, exposing an OpenAI-compatible endpoint.

# 要预先下载好 DeepSeek-R1 模型，放在 /data/model/deepseek-r1
docker run -d --gpus all \
  --name vllm-deepseek \
  -p 8000:8000 \
  -v /data/model:/models \
  vllm/vllm-openai:latest \
  --model /models/deepseek-r1 \
  --tensor-parallel-size 2 \    # 双卡，用张量并行
  --max-model-len 8192 \
  --enable-prefix-caching \     # 开启前缀缓存，相同 system prompt 能秒出
  --gpu-memory-utilization 0.92

Once the service is up, you can simply curl http://localhost:8000/v1/chat/completions and call DeepSeek-R1 just like OpenAI. I've battle-tested the stability and compatibility of this interface countless times — it works perfectly as a Kong upstream.

The next part, the Kong configuration, solves multi-tenant authentication, rate-limiting, and token-consumption forwarding. I use Kong's decK declarative format. Copy and paste it into Kong, and it takes effect immediately.

# kong-config.yaml
_format_version: "3.0"
services:
  - name: deepseek-r1
    url: http://vllm-deepseek:8000/v1   # 指向 vLLM 容器
    routes:
      - name: deepseek-chat
        paths:
          - /chat
        strip_path: false               # 保留 /chat 后缀，透传给 vLLM
    plugins:
      - name: key-auth                  # 启用 API Key 认证
        config:
          key_names: ["apikey"]         # 从 header 或 query 取 key
      - name: rate-limiting
        config:
          minute: 100                   # 每个租户每分钟最多100请求
          policy: local                 # 单节点限流，多节点用 redis
      - name: http-log                 # 日志推送到计费系统
        config:
          http_endpoint: http://billing-collector:3000/log
          method: POST
          timeout: 2000
          keepalive: 60000
# 消费者的 API Key 定义
consumers:
  - username: tenant_a
    keyauth_credentials:
      - key: sk-tenantA-xxxxx
  - username: tenant_b

With this setup, the moment a request hits Kong, it’s authenticated and counted; vLLM continuously processes the rough stream of inference without ever touching billing logic. We went from 5 concurrent requests causing chaos to handling over 200 concurrent requests smoothly, with throughput skyrocketing 14x. That middle-of-the-night phone call has never rung again for this reason.