DEV Community

Alex Spinov
Alex Spinov

Posted on

text-generation-webui Has a Free API — Run Any AI Model with a Gradio Interface

text-generation-webui (by oobabooga) is the most popular open-source web UI for running large language models. Think of it as Automatic1111 but for text generation. It supports GGUF, GPTQ, AWQ, EXL2, and HuggingFace models.

Free, local, with a built-in OpenAI-compatible API and extensions system.

Why Use text-generation-webui?

  • Any model format — GGUF (llama.cpp), GPTQ, AWQ, EXL2, HF Transformers
  • Web UI — Gradio-based chat interface with character cards
  • OpenAI API — compatible endpoint for programmatic access
  • Extensions — TTS, image generation, long-term memory, multimodal
  • Full control — sampling parameters, context size, GPU layers, quantization

Quick Setup

1. Install

git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui

# One-click installer
./start_linux.sh  # or start_macos.sh / start_windows.bat

# Or manual
pip install -r requirements.txt
python server.py --api
Enter fullscreen mode Exit fullscreen mode

2. Download a Model

In the UI: Model → Download → paste HuggingFace model ID

Or via CLI:

python download-model.py TheBloke/Mistral-7B-Instruct-v0.2-GGUF
Enter fullscreen mode Exit fullscreen mode

3. Enable API

# Start with API enabled
python server.py --api --listen
# API runs on port 5000 by default
# OpenAI-compatible API on port 5001
Enter fullscreen mode Exit fullscreen mode

4. Use the API

# OpenAI-compatible chat
curl -s http://localhost:5001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Write a web scraper in Python"}],
    "mode": "instruct",
    "max_tokens": 300
  }' | jq '.choices[0].message.content'

# Native API — more control
curl -s http://localhost:5000/api/v1/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "### Instruction: Explain rate limiting in web scraping\n### Response:",
    "max_new_tokens": 200,
    "temperature": 0.7,
    "top_p": 0.9,
    "repetition_penalty": 1.15
  }' | jq '.results[0].text'

# List models
curl -s http://localhost:5001/v1/models | jq '.data[].id'

# Load a model
curl -s -X POST http://localhost:5000/api/v1/model \
  -H "Content-Type: application/json" \
  -d '{"action": "load", "model_name": "Mistral-7B-Instruct-v0.2-GGUF", "args": {"n_gpu_layers": 35}}'
Enter fullscreen mode Exit fullscreen mode

Python Example

from openai import OpenAI

client = OpenAI(base_url="http://localhost:5001/v1", api_key="not-needed")

# Chat
response = client.chat.completions.create(
    model="x",  # model name doesn't matter for single-model
    messages=[
        {"role": "system", "content": "You are a web scraping expert."},
        {"role": "user", "content": "How do I bypass Cloudflare protection ethically?"}
    ],
    temperature=0.5
)
print(response.choices[0].message.content)

# Streaming
stream = client.chat.completions.create(
    model="x",
    messages=[{"role": "user", "content": "Best Python libraries for parsing HTML"}],
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
Enter fullscreen mode Exit fullscreen mode

Key Endpoints

Endpoint Port Description
/v1/chat/completions 5001 OpenAI-compatible chat
/v1/completions 5001 OpenAI-compatible completion
/v1/models 5001 List models
/api/v1/generate 5000 Native generation
/api/v1/model 5000 Load/unload models
/api/v1/token-count 5000 Count tokens

Supported Model Formats

Format Backend VRAM Needed
GGUF llama.cpp Low (CPU+GPU)
GPTQ AutoGPTQ/ExLlama Medium
AWQ AutoAWQ Medium
EXL2 ExLlamaV2 Medium
HF Transformers High

Need custom data extraction or scraping solution? I build production-grade scrapers for any website. Email: Spinov001@gmail.com | My Apify Actors

Top comments (0)