Artyom Molchanov

Posted on Jan 26

I Built a Support Ticket Classifier with a Fine-Tuned LLM for $10/month

#llm #python #fastapi #machinelearning

I fine-tuned Qwen2.5-0.5B to classify telecom support tickets, quantized it to 350MB, and deployed it on a cheap VPS. Here's how.

Live Demo | API Docs

The Problem

Support teams waste hours manually routing tickets. A customer writes "my wifi is slow" — is it a technical issue? Billing? Should it go to L1 or L2 support?

I built a classifier that outputs structured JSON with intent, category, urgency, sentiment, routing target, and extracted entities.

Why Not Just Use a Cloud API?

Cost — 50K requests/month via cloud LLMs (OpenAI, Claude, Gemini) ≈ $100-200. Self-hosted = $10-20
Privacy — Some companies can't send customer data to external APIs
Control — Fine-tune for your specific domain

The Stack

Qwen2.5-0.5B (fine-tuned) → GGUF Q4_K_M (350MB)
llama-cpp-python for inference → FastAPI for API → nginx for reverse proxy
Docker → VPS ($10/mo)

Fine-Tuning

Base Model

Qwen2.5-0.5B-Instruct — small enough for CPU inference, smart enough for classification.

Dataset

~1000 synthetic support tickets with labels:

Technical issues (internet, TV, mobile)
Billing inquiries
Cancellation requests
General questions

Training

Full fine-tuning on Google Colab T4 (free tier):

3 epochs
Learning rate: 2e-5
bf16 training
~40 minutes total

Quantization

Converted to GGUF and quantized to 4-bit using llama.cpp tools.

Result: 350MB model that runs on CPU.

The API

Simple FastAPI wrapper: load the GGUF model, accept POST requests, construct chat messages with system prompt and user text, parse JSON from model output, log to database.

Filtering Garbage Input

Users will send random stuff. Added a heuristic check:

Text too short (< 10 chars) → not relevant
Contains telecom keywords (wifi, internet, bill, etc.) → relevant
No keywords + category=unknown → not relevant

Now irrelevant queries return is_relevant: false.

Deployment

VPS Setup

Standard approach:

Install Docker
Deploy with docker compose
Add SSL with Certbot

Total cost: ~$10-15/month for a 2 vCore, 4GB RAM VPS.

Performance

Metric	Value
Intent accuracy	~92%
Category accuracy	~89%
Inference (VPS CPU)	3-5 sec
Inference (M1 Mac)	150-300ms
Model size	350 MB
Memory usage	~700 MB

Why 3-5 seconds is fine

This isn't a chatbot. It's ticket classification that happens once when a ticket is created. You can also process async via a queue.

For faster inference: use a modern CPU (AMD EPYC) or add a GPU.

When to Fine-Tune vs Use GPT API

Fine-tune when:

Data privacy is required (on-premise)
High volume of similar requests (>10K/month)
Specific domain knowledge needed

Use GPT API when:

Low volume
Diverse tasks
Need best quality regardless of cost

Try It

Demo: silentworks.tech
API docs: silentworks.tech/docs

Want something similar for your company? I build custom LLM solutions that run on your infrastructure.

Reach out on Telegram — let's discuss your use case.

DEV Community