Text Generation Inference (TGI) is Hugging Face's production-grade inference server for LLMs. It powers the Hugging Face Inference API and is used by companies like IBM, Intel, and Deutsche Telekom.
Free, open source, optimized for throughput. Run any Hugging Face model with a single Docker command.
Why Use TGI?
- Blazing fast — continuous batching, FlashAttention, tensor parallelism
- OpenAI-compatible — drop-in replacement for OpenAI API
- Any HF model — Llama, Mistral, Falcon, StarCoder, and 100K+ models
- Production features — token streaming, quantization, multi-GPU support
- Structured output — JSON schema enforcement via grammar
Quick Setup
1. Run with Docker
# Run Mistral 7B
docker run --gpus all -p 8080:80 \
-v ~/.cache/huggingface:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id mistralai/Mistral-7B-Instruct-v0.3
# Run Llama 3.1 8B (needs ~16GB VRAM)
docker run --gpus all -p 8080:80 \
-v ~/.cache/huggingface:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Meta-Llama-3.1-8B-Instruct
2. Generate Text
# TGI native endpoint
curl -s http://localhost:8080/generate \
-H "Content-Type: application/json" \
-d '{
"inputs": "What are the best practices for web scraping?",
"parameters": {"max_new_tokens": 200, "temperature": 0.7}
}' | jq '.generated_text'
# OpenAI-compatible endpoint
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "tgi",
"messages": [{"role": "user", "content": "Explain Docker networking in 3 sentences"}],
"max_tokens": 200
}' | jq '.choices[0].message.content'
3. Streaming
curl -s -N http://localhost:8080/generate_stream \
-H "Content-Type: application/json" \
-d '{
"inputs": "Write a Python web scraper for",
"parameters": {"max_new_tokens": 300}
}'
4. Structured Output (JSON)
curl -s http://localhost:8080/generate \
-H "Content-Type: application/json" \
-d '{
"inputs": "Extract info: John works at Google as a senior engineer",
"parameters": {
"max_new_tokens": 100,
"grammar": {
"type": "json",
"value": {
"type": "object",
"properties": {
"name": {"type": "string"},
"company": {"type": "string"},
"role": {"type": "string"}
},
"required": ["name", "company", "role"]
}
}
}
}'
Python Example
from openai import OpenAI
# Works with OpenAI SDK!
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="tgi",
messages=[
{"role": "system", "content": "You are a web scraping expert."},
{"role": "user", "content": "How do I handle rate limiting when scraping?"}
],
temperature=0.5,
max_tokens=300
)
print(response.choices[0].message.content)
# Streaming
stream = client.chat.completions.create(
model="tgi",
messages=[{"role": "user", "content": "List 5 Python scraping libraries"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Key Endpoints
| Endpoint | Description |
|---|---|
| /generate | Single generation |
| /generate_stream | Streaming generation |
| /v1/chat/completions | OpenAI-compatible chat |
| /v1/completions | OpenAI-compatible completion |
| /v1/models | List loaded models |
| /info | Model info and parameters |
| /health | Health check |
| /metrics | Prometheus metrics |
Performance Tips
- Use
--quantize bitsandbytes-nf4for 4-bit quantization (saves VRAM) - Use
--num-shard 2for multi-GPU inference - Use
--max-concurrent-requests 128for high throughput - Enable
--speculative-decodingfor faster generation
Need custom data extraction or scraping solution? I build production-grade scrapers for any website. Email: Spinov001@gmail.com | My Apify Actors
Top comments (0)