DEV Community

丁久
丁久

Posted on • Originally published at dingjiu1989-hue.github.io

Model Deployment: vLLM, TGI, ONNX, Quantization, GPU Optimization

This article was originally published on AI Study Room. For the full version with working code examples and related articles, visit the original post.

Model Deployment: vLLM, TGI, ONNX, Quantization, GPU Optimization

Introduction

Deploying large language models for production inference requires specialized infrastructure. Unlike traditional ML models, LLMs demand gigabytes of GPU memory, specialized attention kernels, and careful batching strategies to achieve acceptable throughput. This article covers the major deployment frameworks and optimization techniques.

vLLM

vLLM is the most popular open-source LLM serving framework, featuring PagedAttention for efficient memory management:

# Using vLLM's OpenAI-compatible API server

# Start server:

# python -m vllm.entrypoints.openai.api_server \

#     --model meta-llama/Llama-3.1-8B-Instruct \

#     --tensor-parallel-size 2 \

#     --gpu-memory-utilization 0.95 \

#     --max-model-len 8192 \

#     --dtype bfloat16

from openai import OpenAI

client = OpenAI(

    base_url="http://localhost:8000/v1",

    api_key="token-not-needed",

)

response = client.chat.completions.create(

    model="meta-llama/Llama-3.1-8B-Instruct",

    messages=[{"role": "user", "content": "What is vLLM?"}],

    temperature=0.7,

    max_tokens=1024,

    stream=True,

)

for chunk in response:

    if chunk.choices[0].delta.content:

        print(chunk.choices[0].delta.content, end="")
Enter fullscreen mode Exit fullscreen mode

vLLM's PagedAttention manages the KV cache in fixed-size blocks, eliminating fragmentation and enabling near-100% GPU memory utilization. It supports continuous batching, meaning new requests can start processing as soon as previous ones complete generation.

Performance Tuning

# Key vLLM performance flags

--max-num-seqs 256        # Max concurrent sequences

--max-num-batched-tokens 8192  # Tokens processed per batch

--enable-chunked-prefill   # Longer prompts handled efficiently

--enforce-eager           # Disable CUDA graphs (saves memory)
Enter fullscreen mode Exit fullscreen mode

Hugging Face TGI

Text Generation Inference (TGI) is Hugging Face's optimized serving solution:

# docker-compose.yml for TGI

version: "3.8"

services:

  tgi:

    image: ghcr.io/huggingface/text-generation-inference:latest

    environment:

      - MODEL_ID=mistralai/Mistral-7B-Instruct-v0.3

      - NUM_SHARD=2

      - MAX_INPUT_TOKENS=4096

      - MAX_TOTAL_TOKENS=8192

      - HF_TOKEN=${HF_TOKEN}

    ports:

      - "8080:80"

    volumes:

      - ~/.cache/huggingface:/data

    deploy:

      resources:

        reservations:

          devices:

            - driver: nvidia

              count: 2

              capabilities: [gpu]

import requests

response = requests.post(

    "http://localhost:8080/generate",

    json={

        "inputs": "Explain quantization in ML:",

        "parameters": {

            "max_new_tokens": 256,

            "temperature": 0.7,

            "top_p": 0.95,

        },

    },

)

print(response.json()["generated_text"])
Enter fullscreen mode Exit fullscreen mode

TGI provides native support for tensor parallelism across GPUs, watermarking, and speculative decoding for faster generation.

ONNX Runtime

ONNX Runtime enables deployment across GPU and CPU with hardware-specific optimizations:

import onnxruntime as ort

from transformers import AutoTokenizer, AutoConfig

import numpy as np

# Load ONNX-optimized model

session = ort.InferenceSession(

    "model_optimized.onnx",

    providers=["CUDAExecutionProvider", "CPUExecutionProvider"],

)

tokenizer = AutoTokenizer.from_pretrained("model-name")

# Prepare inputs

inputs = tokenizer("Explain model quantization.", return_tensors="np")

onnx_inputs = {

    "input_ids": inputs["input_ids"],

    "attention_mask": inputs["attention_mask"],

}

# Run inference

outputs = session.run(None, onnx_inputs)
Enter fullscreen mode Exit fullscreen mode

ONNX models require an initial conversion step but benefit from aggressive graph optimizations and operator fusion.

Quantization

Quantization reduces model size and accelerates inference by using lower-precision numbers:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

import torch

# 4-bit quantization with bitsandbytes

quant_config = BitsAndBytesConfig(

    load_in_4bit=True,

    bnb_4bit_compute_dtype=torch.bfloat16,

    bnb_4bit_use_double_quant=True,

    bnb_4bit_quant_type="nf4",

)

model = AutoModelForCausalLM.from_pretrained(

    "meta-llama/Llama-3.1-8B-Instruct",

    quantization_config=quant_config,

    device_map="auto",

)
Enter fullscreen mode Exit fullscreen mode

| Technique | Bit Width | Size Reduction | Speed Impact | Quality Loss |

|-----------|-----------|----------------|--------------|--------------|

| FP16/BF16 | 16-bit | 2x vs FP32 | 1.5-2x | None |

| INT8 | 8-bit | 4x vs FP32 | 2-3x | Minimal |

| INT4 (GPTQ) | 4-bit | 8x vs FP32


Read the full article on AI Study Room for complete code examples, comparison tables, and related resources.

Found this useful? Check out more developer guides and tool comparisons on AI Study Room.

Top comments (0)