I replaced an LLM with a 120 MB ONNX model to read YouTube comments

Maksim G — Mon, 08 Jun 2026 17:41:41 +0000

Likes and view counts tell you a video was loud. They don't tell you whether the audience agreed, pushed back, or just showed up to argue. The signal for that is sitting right there in the comments — you just have to read a few hundred of them and keep score.

So I built a tool that does exactly that. It's called PJQ (Public Judgment Quotient): point it at a YouTube video, and it samples the comments, sorts every one into a stance, and gives you one honest verdict on how the audience actually received the thing.

The first version ran every comment through an LLM. It worked, and it was also the wrong tool for the job. This is the story of replacing it with a fine-tuned model that fits in 120 MB and runs on a cheap CPU box with no API bill.

Why "sentiment" wasn't enough

Off-the-shelf sentiment analysis gives you positive / negative / neutral. That collapses the part that's actually interesting.

"This is genuinely wrong and here's why" and "first!! 🔥🔥" are both technically positive-ish or neutral noise to a sentiment model, but they mean completely different things about reception. A coordinated spam wave looks like overwhelming support if you don't separate it out. So I threw away the 3-class scale and wrote a 7-stance rubric instead:

SUP — substantive support (actually agrees with the point)
AGA — substantive disagreement (actually argues against it)
NEU — neutral (a question, a clarification, a fact, no position)
OFF — off-topic (about something else entirely)
THIN — thin positive (hype with no argument — "🔥 legend")
SUS — inorganic (bot / paid / coordinated spam)
AGN — own-agenda (a comment carrying the author's external goal: crypto shill, self-promo, account farming)

The reporting rule matters as much as the labels: mood and support are computed strictly from SUP / AGA / NEU. THIN, OFF, SUS and AGN are tracked separately and never inflate the opinion numbers. That single decision is what stops a fan-noise flood or a bot ring from faking a 90%-positive verdict.

The LLM version, and why it had to go

For the first cut I batched comments and asked an LLM to label each one against the rubric. Quality was good. Everything else was bad:

Cost scaled with every comment. One video is 300–500 comments. Multiply by every analysis and the per-call bill is the whole business model's problem.
It was a single point of failure. One API key, one provider. If that account gets rate-limited or suspended, the product is just down.
It was slow and non-deterministic. Same comment, slightly different label across runs. Hard to test, hard to trust.

The task itself is narrow: short text in, one of seven labels out. That is a classification problem, and classification problems don't need a 70-billion-parameter model reasoning from scratch every time. They need a model that has seen enough examples to know the boundary.

SetFit: few-shot fine-tuning without a GPU farm

I went with SetFit. It fine-tunes a sentence-transformer with contrastive learning, then trains a tiny classification head on top of the embeddings. It gets competitive accuracy from a few hundred labeled examples and trains on a laptop CPU in minutes — no GPU cluster, no prompt engineering.

The architecture is deliberately boring:

Body: paraphrase-multilingual-MiniLM-L12-v2 — a 384-dimension multilingual encoder. Multilingual is non-negotiable here; the comments under one video can be Russian, English and Spanish in the same thread.
Head: plain scikit-learn LogisticRegression over the embeddings.
Input trick: I prepend the video title as context — title[:200] + " [SEP] " + comment[:300]. The same three words mean different things under a physics lecture and a boxing highlight, and the title is cheap context that fixes a surprising number of edge cases.

Training is a stratified 80/10/10 split by (label, language), then evaluation on macro-F1 — not accuracy. With seven imbalanced classes, accuracy lies to you; a model that calls everything NEU can still look "80% accurate." Macro-F1 forces every class to pull its weight. My acceptance bar was macro-F1 ≥ 0.72, with per-class precision/recall and a per-source breakdown so I could see which stance was dragging.

Exporting to ONNX so it runs anywhere

A fine-tuned model in a notebook is a demo. To put it in production cheaply I exported the whole thing to ONNX:

the sentence-transformer body via 🤗 Optimum (optimum[exporters-onnx]),
the LogisticRegression head via skl2onnx.

At inference time the runtime is just onnxruntime on CPU:

tokenize → run the body ONNX graph → last_hidden_state
mean-pool over the token dimension using the attention mask (exactly like sentence-transformers does internally — get this wrong and your embeddings are quietly garbage)
run the head ONNX graph → probabilities → argmax → stance

# rough shape of the hot path
emb = mean_pool(body_session.run(None, tokens), tokens["attention_mask"])
probs = head_session.run(None, {"input": emb})[0]
stance = STANCE_CODES[int(probs.argmax(-1))]

The payoff:

Deterministic. Same comment, same label, every time. Now I can actually write tests.
No subscription, no per-call cost. It runs next to the app on a small VPS. The marginal cost of classifying a comment is some CPU cycles.
Batchable and fast. 32 comments a batch, a few hundred per video in seconds instead of minutes.

The model directory ships its own labels.json and a manifest.json with the encoder name, embedding dim and rubric version, so the serving code refuses to load a model that doesn't match the rubric it was trained for. Small thing, saves you from a very confusing class of bug.

What you get out the other end

Once every sampled comment has a stance, the verdict is just aggregation: the SUP/AGA/NEU split, a net-mood score, a controversy index (how evenly split SUP vs AGA is), and the bot/hype/off-topic counters kept strictly to the side. You can see live verdicts and an audience-trends dashboard across YouTube categories and regions without signing up for anything.

Takeaways if you're doing something similar

Reach for an LLM to prototype a labeling task, not to serve it. Use it to bootstrap a labeled set, then distill that into a small model you own.
Pick your metric before you celebrate. Macro-F1 over accuracy the moment your classes are imbalanced.
ONNX is the boring bridge that makes "it works in my notebook" into "it runs on a $5 box." Optimum + skl2onnx + onnxruntime, and you've cut the API cord.
Match the model to the shape of the problem. A seven-way text classifier does not need to reason. It needs to have seen the boundary.

If you want to see the classifier's verdicts on real videos, it's live at pjq.life — first analysis is free. Happy to answer anything about the rubric or the training setup in the comments.

DEV Community: Maksim G

[Boost]