I fine-tuned Qwen2.5-0.5B to classify telecom support tickets, quantized it to 350MB, and deployed it on a cheap VPS. Here's how.
The Problem
Support teams waste hours manually routing tickets. A customer writes "my wifi is slow" — is it a technical issue? Billing? Should it go to L1 or L2 support?
I built a classifier that outputs structured JSON with intent, category, urgency, sentiment, routing target, and extracted entities.
Why Not Just Use a Cloud API?
- Cost — 50K requests/month via cloud LLMs (OpenAI, Claude, Gemini) ≈ $100-200. Self-hosted = $10-20
- Privacy — Some companies can't send customer data to external APIs
- Control — Fine-tune for your specific domain
The Stack
- Qwen2.5-0.5B (fine-tuned) → GGUF Q4_K_M (350MB)
- llama-cpp-python for inference → FastAPI for API → nginx for reverse proxy
- Docker → VPS ($10/mo)
Fine-Tuning
Base Model
Qwen2.5-0.5B-Instruct — small enough for CPU inference, smart enough for classification.
Dataset
~1000 synthetic support tickets with labels:
- Technical issues (internet, TV, mobile)
- Billing inquiries
- Cancellation requests
- General questions
Training
Full fine-tuning on Google Colab T4 (free tier):
- 3 epochs
- Learning rate: 2e-5
- bf16 training
- ~40 minutes total
Quantization
Converted to GGUF and quantized to 4-bit using llama.cpp tools.
Result: 350MB model that runs on CPU.
The API
Simple FastAPI wrapper: load the GGUF model, accept POST requests, construct chat messages with system prompt and user text, parse JSON from model output, log to database.
Filtering Garbage Input
Users will send random stuff. Added a heuristic check:
- Text too short (< 10 chars) → not relevant
- Contains telecom keywords (wifi, internet, bill, etc.) → relevant
- No keywords + category=unknown → not relevant
Now irrelevant queries return is_relevant: false.
Deployment
VPS Setup
Standard approach:
- Install Docker
- Deploy with docker compose
- Add SSL with Certbot
Total cost: ~$10-15/month for a 2 vCore, 4GB RAM VPS.
Performance
| Metric | Value |
|---|---|
| Intent accuracy | ~92% |
| Category accuracy | ~89% |
| Inference (VPS CPU) | 3-5 sec |
| Inference (M1 Mac) | 150-300ms |
| Model size | 350 MB |
| Memory usage | ~700 MB |
Why 3-5 seconds is fine
This isn't a chatbot. It's ticket classification that happens once when a ticket is created. You can also process async via a queue.
For faster inference: use a modern CPU (AMD EPYC) or add a GPU.
When to Fine-Tune vs Use GPT API
Fine-tune when:
- Data privacy is required (on-premise)
- High volume of similar requests (>10K/month)
- Specific domain knowledge needed
Use GPT API when:
- Low volume
- Diverse tasks
- Need best quality regardless of cost
Try It
- Demo: silentworks.tech
- API docs: silentworks.tech/docs
Want something similar for your company? I build custom LLM solutions that run on your infrastructure.
Reach out on Telegram — let's discuss your use case.
Top comments (0)