Hugging Face TGI Has a Free API — Production-Grade LLM Inference Server

#machinelearning #api #ai #tutorial

Text Generation Inference (TGI) is Hugging Face's production-grade inference server for LLMs. It powers the Hugging Face Inference API and is used by companies like IBM, Intel, and Deutsche Telekom.

Free, open source, optimized for throughput. Run any Hugging Face model with a single Docker command.

Why Use TGI?

Blazing fast — continuous batching, FlashAttention, tensor parallelism
OpenAI-compatible — drop-in replacement for OpenAI API
Any HF model — Llama, Mistral, Falcon, StarCoder, and 100K+ models
Production features — token streaming, quantization, multi-GPU support
Structured output — JSON schema enforcement via grammar

Quick Setup

1. Run with Docker

# Run Mistral 7B
docker run --gpus all -p 8080:80 \
  -v ~/.cache/huggingface:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id mistralai/Mistral-7B-Instruct-v0.3

# Run Llama 3.1 8B (needs ~16GB VRAM)
docker run --gpus all -p 8080:80 \
  -v ~/.cache/huggingface:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Meta-Llama-3.1-8B-Instruct

2. Generate Text

# TGI native endpoint
curl -s http://localhost:8080/generate \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": "What are the best practices for web scraping?",
    "parameters": {"max_new_tokens": 200, "temperature": 0.7}
  }' | jq '.generated_text'

# OpenAI-compatible endpoint
curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tgi",
    "messages": [{"role": "user", "content": "Explain Docker networking in 3 sentences"}],
    "max_tokens": 200
  }' | jq '.choices[0].message.content'

3. Streaming

curl -s -N http://localhost:8080/generate_stream \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": "Write a Python web scraper for",
    "parameters": {"max_new_tokens": 300}
  }'

4. Structured Output (JSON)

curl -s http://localhost:8080/generate \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": "Extract info: John works at Google as a senior engineer",
    "parameters": {
      "max_new_tokens": 100,
      "grammar": {
        "type": "json",
        "value": {
          "type": "object",
          "properties": {
            "name": {"type": "string"},
            "company": {"type": "string"},
            "role": {"type": "string"}
          },
          "required": ["name", "company", "role"]
        }
      }
    }
  }'

Python Example

from openai import OpenAI

# Works with OpenAI SDK!
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a web scraping expert."},
        {"role": "user", "content": "How do I handle rate limiting when scraping?"}
    ],
    temperature=0.5,
    max_tokens=300
)

print(response.choices[0].message.content)

# Streaming
stream = client.chat.completions.create(
    model="tgi",
    messages=[{"role": "user", "content": "List 5 Python scraping libraries"}],
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Key Endpoints

Endpoint	Description
/generate	Single generation
/generate_stream	Streaming generation
/v1/chat/completions	OpenAI-compatible chat
/v1/completions	OpenAI-compatible completion
/v1/models	List loaded models
/info	Model info and parameters
/health	Health check
/metrics	Prometheus metrics

Performance Tips

Use --quantize bitsandbytes-nf4 for 4-bit quantization (saves VRAM)
Use --num-shard 2 for multi-GPU inference
Use --max-concurrent-requests 128 for high throughput
Enable --speculative-decoding for faster generation

Need custom data extraction or scraping solution? I build production-grade scrapers for any website. Email: Spinov001@gmail.com | My Apify Actors

DEV Community