text-generation-webui (by oobabooga) is the most popular open-source web UI for running large language models. Think of it as Automatic1111 but for text generation. It supports GGUF, GPTQ, AWQ, EXL2, and HuggingFace models.
Free, local, with a built-in OpenAI-compatible API and extensions system.
Why Use text-generation-webui?
- Any model format — GGUF (llama.cpp), GPTQ, AWQ, EXL2, HF Transformers
- Web UI — Gradio-based chat interface with character cards
- OpenAI API — compatible endpoint for programmatic access
- Extensions — TTS, image generation, long-term memory, multimodal
- Full control — sampling parameters, context size, GPU layers, quantization
Quick Setup
1. Install
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
# One-click installer
./start_linux.sh # or start_macos.sh / start_windows.bat
# Or manual
pip install -r requirements.txt
python server.py --api
2. Download a Model
In the UI: Model → Download → paste HuggingFace model ID
Or via CLI:
python download-model.py TheBloke/Mistral-7B-Instruct-v0.2-GGUF
3. Enable API
# Start with API enabled
python server.py --api --listen
# API runs on port 5000 by default
# OpenAI-compatible API on port 5001
4. Use the API
# OpenAI-compatible chat
curl -s http://localhost:5001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Write a web scraper in Python"}],
"mode": "instruct",
"max_tokens": 300
}' | jq '.choices[0].message.content'
# Native API — more control
curl -s http://localhost:5000/api/v1/generate \
-H "Content-Type: application/json" \
-d '{
"prompt": "### Instruction: Explain rate limiting in web scraping\n### Response:",
"max_new_tokens": 200,
"temperature": 0.7,
"top_p": 0.9,
"repetition_penalty": 1.15
}' | jq '.results[0].text'
# List models
curl -s http://localhost:5001/v1/models | jq '.data[].id'
# Load a model
curl -s -X POST http://localhost:5000/api/v1/model \
-H "Content-Type: application/json" \
-d '{"action": "load", "model_name": "Mistral-7B-Instruct-v0.2-GGUF", "args": {"n_gpu_layers": 35}}'
Python Example
from openai import OpenAI
client = OpenAI(base_url="http://localhost:5001/v1", api_key="not-needed")
# Chat
response = client.chat.completions.create(
model="x", # model name doesn't matter for single-model
messages=[
{"role": "system", "content": "You are a web scraping expert."},
{"role": "user", "content": "How do I bypass Cloudflare protection ethically?"}
],
temperature=0.5
)
print(response.choices[0].message.content)
# Streaming
stream = client.chat.completions.create(
model="x",
messages=[{"role": "user", "content": "Best Python libraries for parsing HTML"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Key Endpoints
| Endpoint | Port | Description |
|---|---|---|
| /v1/chat/completions | 5001 | OpenAI-compatible chat |
| /v1/completions | 5001 | OpenAI-compatible completion |
| /v1/models | 5001 | List models |
| /api/v1/generate | 5000 | Native generation |
| /api/v1/model | 5000 | Load/unload models |
| /api/v1/token-count | 5000 | Count tokens |
Supported Model Formats
| Format | Backend | VRAM Needed |
|---|---|---|
| GGUF | llama.cpp | Low (CPU+GPU) |
| GPTQ | AutoGPTQ/ExLlama | Medium |
| AWQ | AutoAWQ | Medium |
| EXL2 | ExLlamaV2 | Medium |
| HF | Transformers | High |
Need custom data extraction or scraping solution? I build production-grade scrapers for any website. Email: Spinov001@gmail.com | My Apify Actors
Top comments (0)