DEV Community

Alex Spinov
Alex Spinov

Posted on

Hugging Face TGI Has a Free API — Production-Grade LLM Inference Server

Text Generation Inference (TGI) is Hugging Face's production-grade inference server for LLMs. It powers the Hugging Face Inference API and is used by companies like IBM, Intel, and Deutsche Telekom.

Free, open source, optimized for throughput. Run any Hugging Face model with a single Docker command.

Why Use TGI?

  • Blazing fast — continuous batching, FlashAttention, tensor parallelism
  • OpenAI-compatible — drop-in replacement for OpenAI API
  • Any HF model — Llama, Mistral, Falcon, StarCoder, and 100K+ models
  • Production features — token streaming, quantization, multi-GPU support
  • Structured output — JSON schema enforcement via grammar

Quick Setup

1. Run with Docker

# Run Mistral 7B
docker run --gpus all -p 8080:80 \
  -v ~/.cache/huggingface:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id mistralai/Mistral-7B-Instruct-v0.3

# Run Llama 3.1 8B (needs ~16GB VRAM)
docker run --gpus all -p 8080:80 \
  -v ~/.cache/huggingface:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Meta-Llama-3.1-8B-Instruct
Enter fullscreen mode Exit fullscreen mode

2. Generate Text

# TGI native endpoint
curl -s http://localhost:8080/generate \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": "What are the best practices for web scraping?",
    "parameters": {"max_new_tokens": 200, "temperature": 0.7}
  }' | jq '.generated_text'

# OpenAI-compatible endpoint
curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tgi",
    "messages": [{"role": "user", "content": "Explain Docker networking in 3 sentences"}],
    "max_tokens": 200
  }' | jq '.choices[0].message.content'
Enter fullscreen mode Exit fullscreen mode

3. Streaming

curl -s -N http://localhost:8080/generate_stream \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": "Write a Python web scraper for",
    "parameters": {"max_new_tokens": 300}
  }'
Enter fullscreen mode Exit fullscreen mode

4. Structured Output (JSON)

curl -s http://localhost:8080/generate \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": "Extract info: John works at Google as a senior engineer",
    "parameters": {
      "max_new_tokens": 100,
      "grammar": {
        "type": "json",
        "value": {
          "type": "object",
          "properties": {
            "name": {"type": "string"},
            "company": {"type": "string"},
            "role": {"type": "string"}
          },
          "required": ["name", "company", "role"]
        }
      }
    }
  }'
Enter fullscreen mode Exit fullscreen mode

Python Example

from openai import OpenAI

# Works with OpenAI SDK!
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a web scraping expert."},
        {"role": "user", "content": "How do I handle rate limiting when scraping?"}
    ],
    temperature=0.5,
    max_tokens=300
)

print(response.choices[0].message.content)

# Streaming
stream = client.chat.completions.create(
    model="tgi",
    messages=[{"role": "user", "content": "List 5 Python scraping libraries"}],
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
Enter fullscreen mode Exit fullscreen mode

Key Endpoints

Endpoint Description
/generate Single generation
/generate_stream Streaming generation
/v1/chat/completions OpenAI-compatible chat
/v1/completions OpenAI-compatible completion
/v1/models List loaded models
/info Model info and parameters
/health Health check
/metrics Prometheus metrics

Performance Tips

  • Use --quantize bitsandbytes-nf4 for 4-bit quantization (saves VRAM)
  • Use --num-shard 2 for multi-GPU inference
  • Use --max-concurrent-requests 128 for high throughput
  • Enable --speculative-decoding for faster generation

Need custom data extraction or scraping solution? I build production-grade scrapers for any website. Email: Spinov001@gmail.com | My Apify Actors

Top comments (0)