Rost

Posted on Mar 22 • Originally published at glukhov.org

SGLang QuickStart: Install, Configure, and Serve LLMs via OpenAI API

#cheatsheet #selfhosting #llm #ai

SGLang is a high-performance serving framework for large language models and multimodal models, built to deliver low-latency and high-throughput inference across everything from a single GPU to distributed clusters.

For a broader comparison of self-hosted and cloud LLM hosting options — including Ollama, vLLM, llama-swap, LocalAI, and managed cloud providers — see the LLM hosting guide for 2026.

If you already have apps wired to the OpenAI API shape, SGLang is especially appealing because it can expose OpenAI-compatible endpoints for chat completions and completions, helping you migrate from hosted APIs to self-hosted models with minimal client-side changes. When you need to route requests across multiple backends (llama.cpp, vLLM, SGLang, etc.) with hot-swap and TTL-based unloading, llama-swap provides a transparent proxy layer that keeps a single /v1 URL stable while swapping upstreams on demand.

This QuickStart walks through installation (multiple methods), practical configuration patterns, and a clean "install → serve → verify → integrate → tune" workflow, with working examples for both HTTP serving and offline batch inference.

If you need multimodal support (text, embeddings, images, audio) with a built-in Web UI and maximum OpenAI API drop-in compatibility, LocalAI offers a broader feature set with more model format support.

What is SGLang for high-throughput LLM and multimodal model serving

At its core, SGLang is designed for efficient inference and scalable serving. The “fast runtime” stack includes RadixAttention for prefix caching, a zero-overhead CPU scheduler, speculative decoding, continuous batching, paged attention, multiple parallelism strategies (tensor, pipeline, expert, data parallelism), structured outputs, chunked prefill, and multiple quantisation options (for example FP4, FP8, INT4, AWQ, GPTQ).

It targets broad-platform deployment: NVIDIA GPUs, AMD GPUs, Intel Xeon CPUs, Google TPUs, Ascend NPUs, and more.

PyPI requires Python >= 3.10. As of 20 March 2026, the published line included 0.5.9 (released 23 February 2026)—pin or check current versions when installing.

How to install SGLang on Linux GPU hosts with uv pip, source builds, or Docker

Install options include uv or pip, source builds, Docker images, Kubernetes manifests, Docker Compose, SkyPilot, and AWS SageMaker. Most walkthroughs assume common NVIDIA GPU setups; other accelerators have their own setup notes elsewhere.

Install SGLang quickly with uv or pip on Python 3.10+

For a straightforward local install, uv is usually the fastest path:

pip install --upgrade pip
pip install uv
uv pip install sglang

CUDA 13 notes

For CUDA 13, Docker avoids host-side PyTorch/CUDA mismatches. Without Docker: install a CUDA 13 PyTorch build, then sglang, then the matching sglang-kernel wheel from the published wheel releases (version must match the stack).

# 1) Install PyTorch with CUDA 13 support (replace X.Y.Z as needed)
uv pip install torch==X.Y.Z torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

# 2) Install SGLang
uv pip install sglang

# 3) Install the matching CUDA 13 sglang-kernel wheel (replace X.Y.Z)
uv pip install "https://github.com/sgl-project/whl/releases/download/vX.Y.Z/sglang_kernel-X.Y.Z+cu130-cp310-abi3-manylinux2014_x86_64.whl"

Install and run SGLang with Docker Hub images

For containerised deployments—or to sidestep host CUDA/PyTorch pairing—use the published Docker Hub images. A typical docker run mounts the Hugging Face cache and passes HF_TOKEN when pulling gated models.

docker run --gpus all \
  --shm-size 32g \
  -p 30000:30000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=<secret>" \
  --ipc=host \
  lmsysorg/sglang:latest \
  python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 30000

For production-style images, latest-runtime drops build tools and dev dependencies, so the image stays much smaller than the default latest variant.

docker run --gpus all \
  --shm-size 32g \
  -p 30000:30000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=<secret>" \
  --ipc=host \
  lmsysorg/sglang:latest-runtime \
  python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 30000

Install from source and other deployment methods

To develop against SGLang or carry local patches, clone a release branch and install the Python package in editable mode:

git clone -b v0.5.9 https://github.com/sgl-project/sglang.git
cd sglang

pip install --upgrade pip
pip install -e "python"

For orchestration, the repo includes Kubernetes manifests (single- and multi-node) and a minimal Docker Compose layout—reasonable starting points before custom wiring.

How to configure SGLang server arguments with YAML config files and environment variables

SGLang configuration is driven by server arguments and environment variables. Flags cover model selection, parallelism, memory, and optimisation knobs; the full set is listed with python3 -m sglang.launch_server --help.

Environment variables use two prefixes: SGL_ and SGLANG_ (many flags accept either CLI or env form—launch_server --help shows the mapping).

Some commonly relevant env vars include host and port controls such as SGLANG_HOST_IP and SGLANG_PORT.

Use a YAML config file for reproducible SGLang server launches

For repeatable deployments and shorter command lines, pass a YAML file with --config. CLI arguments override values from the file when both set the same option.

# Create config.yaml
cat > config.yaml << 'EOF'
model-path: meta-llama/Meta-Llama-3-8B-Instruct
host: 0.0.0.0
port: 30000
tensor-parallel-size: 2
enable-metrics: true
log-requests: true
EOF

# Launch server with config file
python -m sglang.launch_server --config config.yaml

A few configuration and tuning essentials to keep in mind:

SGLang's --model-path can point to a local folder or a Hugging Face repo ID, which makes it easy to switch between local weights and Hub-hosted models without changing your serving code.

For multi-GPU, enable tensor parallelism with --tp. If startup fails with “peer access is not supported between these two devices”, add --enable-p2p-check.

If serving hits OOM, reduce KV cache pressure with a smaller --mem-fraction-static (default is 0.9).

If long prompts OOM during prefill, lower --chunked-prefill-size.

How to run an OpenAI-compatible SGLang server and call it from the OpenAI Python client

A practical "happy path" workflow looks like this:

Install SGLang (uv/pip or Docker).
Start the server with your chosen model and port.
Verify basic serving via OpenAI-compatible endpoints.
Integrate your application by pointing the OpenAI SDK base_url at the local server.
Tune throughput and memory with server args once you have real traffic.

Send a local chat completion request to SGLang using OpenAI SDK

For OpenAI-compatible usage, two details matter:

The server implements the OpenAI HTTP surface and, when the tokenizer provides one, applies the Hugging Face chat template automatically. Override with --chat-template at launch if needed.

Point an OpenAI client at the server’s /v1 prefix (base_url → http://<host>:<port>/v1), then call client.chat.completions.create(...) as usual.

Start the server with either entrypoint: python -m sglang.launch_server still works, but sglang serve is the preferred CLI.

# Recommended CLI entrypoint
sglang serve --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --port 30000

# Still supported
python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --port 30000

Then call it with the OpenAI Python client:

import openai

client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")

response = client.chat.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    messages=[{"role": "user", "content": "List 3 countries and their capitals."}],
    temperature=0,
    max_tokens=64,
)

print(response.choices[0].message.content)

How to run batch inference with the SGLang Offline Engine API and native endpoints

SGLang supports multiple "API surfaces" depending on what you're building:

The /generate endpoint is the low-level runtime API. Prefer /v1/... OpenAI-compatible routes when you want chat templates and the usual client ecosystem handled for you.

Without any HTTP server, the Offline Engine runs inference in-process: suited to batch jobs and custom services. It supports sync/async and streaming/non-streaming combinations—pick the mode that matches the call pattern.

Example using the native /generate endpoint

Minimal pattern: run a server, then POST /generate with temperature and max_new_tokens (and any other sampling fields you need).

import requests

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
    },
)

print(response.json())

temperature = 0 is greedy sampling; higher values increase diversity.

Example using the Offline Engine API for in-process batch inference

Typical flow: construct sgl.Engine(model_path=...), run llm.generate(...) over a batch of prompts, then llm.shutdown() to release GPU and other resources.

import sglang as sgl

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

prompts = [
    "Write a concise self-introduction.",
    "Explain what prefix caching is in one paragraph.",
]

sampling_params = {"temperature": 0.2, "top_p": 0.9}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("PROMPT:", prompt)
    print("OUTPUT:", output["text"])
    print()

llm.shutdown()

DEV Community