Customer feedback is one of the great unglamorous NLP problems. Millions of short, messy, opinionated messages, and someone has to turn them into "what is going wrong, and how often." I wanted to build the core of that pipeline properly, so I took about 14,600 real tweets about US airlines and set two tasks: classify each tweet's sentiment, and, for the complaints, classify the reason (late flight, lost luggage, customer service, and so on).
I gave myself one rule: build the transformer from scratch in PyTorch, and do not trust it until it has beaten a stupid-simple baseline.
It didn't beat the baseline. That turned out to be the most useful result in the project, and the rest of this post is about why.
The full code is here: github.com/gbadedata/airline-feedback-transformer.
The data and the two tasks
The dataset is the Twitter US Airline Sentiment set: real customer feedback about six airlines from February 2015, labelled for sentiment and, for the negative tweets, a complaint reason.
It is imbalanced the way genuine complaints are: 9,178 negative, 3,099 neutral, 2,363 positive. I kept that imbalance rather than resampling it away, because production feedback is imbalanced too, and hiding it just moves the problem.
- Task 1, sentiment: three classes over all tweets.
- Task 2, reason: the complaint reason over negative tweets that have an identifiable one. I dropped the "Can't Tell" bucket, since by definition it has no aspect to learn, which left nine classes over 7,988 tweets. This is the extraction-flavoured task.
Both use a stratified 70/15/15 split.
The simple baseline I had to beat
Before any deep learning, the thing to beat: TF-IDF features into a class-weighted logistic regression. This is a genuinely strong text classifier and it fits in a few lines.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
vec = TfidfVectorizer(ngram_range=(1, 2), min_df=2, sublinear_tf=True, max_features=40000)
clf = LogisticRegression(max_iter=2000, C=4.0, class_weight="balanced")
clf.fit(vec.fit_transform(train_texts), y_train)
Plus a majority-class floor, so I always know what "predicting nothing" scores.
The transformer, built from scratch
I could have called AutoModel.from_pretrained(...), but building the encoder by hand is a much better way to show the architecture is understood rather than imported: token and learned positional embeddings, multi-head self-attention with a correct padding mask, pre-norm residual blocks with a GELU feed-forward, masked mean pooling, and a small classification head.
The part that quietly matters most is the padding mask. Tweets are short, so batches are mostly padding, and if the model attends to pad tokens it learns from noise. Padded keys have to be removed before the softmax:
scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.d_head) # (B, heads, T, T)
scores = scores.masked_fill(pad_mask[:, None, None, :], NEG_INF)
attn = torch.softmax(scores, dim=-1)
and the same mask has to exclude padding from the pooled representation:
keep = (~pad_mask).float().unsqueeze(-1) # (B, T, 1)
pooled = (x * keep).sum(1) / keep.sum(1).clamp(min=1.0) # (B, d_model)
I unit-tested both. One test adds extra padding to an input and asserts the prediction does not change, which is exactly the property a correct mask guarantees. The default model is deliberately compact (d_model 128, two layers, four heads, about 0.97M parameters) so it trains on a CPU in minutes.
Training it
A standard, explicit PyTorch loop: class-weighted cross-entropy for the imbalance, AdamW, linear warmup and decay, gradient clipping, and early stopping on validation macro-F1 with the best checkpoint restored.
loss_fn = nn.CrossEntropyLoss(weight=class_weights) # counter the imbalance
opt = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
# ... warmup+decay schedule, clip_grad_norm_, early stop on val macro-F1
One evaluation framework for everyone
Every model (majority, TF-IDF, transformer, and an optional zero-shot LLM) is scored by the same code: accuracy, macro-F1, weighted-F1, a full per-class precision/recall/F1 table, and a confusion matrix. Define the test set once, define the metrics once, run every model through them. That is the piece that makes the comparison trustworthy, and it is the piece that tells you where a model fails rather than just how often.
The results
| Task | Model | Accuracy | Macro-F1 |
|---|---|---|---|
| Sentiment | Majority | 0.627 | 0.257 |
| TF-IDF + LogReg | 0.790 | 0.735 | |
| Transformer (scratch) | 0.769 | 0.729 | |
| Reason | Majority | 0.364 | 0.059 |
| TF-IDF + LogReg | 0.648 | 0.501 | |
| Transformer (scratch) | 0.612 | 0.495 |
On both tasks the transformer lands within a single macro-F1 point of the baseline and does not pass it. It is not broken: it learns cleanly and then early-stops.
Where it wins and where it loses
The per-class view is where the evaluation framework earns its place. On the reason task the model is good at the frequent, distinctive complaints and poor at the rare ones.
Late Flight (0.72), Customer Service Issue (0.68), Lost Luggage (0.68) and Cancelled Flight (0.67) come out well. Damaged Luggage (0.17, with eleven test examples) and longlines (0.27) do not. The confusion matrix shows the errors are sensible: rare and adjacent complaints get pulled toward the big "Customer Service Issue" and "Late Flight" classes.
That is an actionable readout. It tells you which classes need more data or a different approach, which is far more useful than a single headline accuracy.
Why the baseline won, and why that matters
Transformers earn their advantage from pretraining on enormous corpora. Mine was trained from random initialisation on about 10,000 short tweets, which is nowhere near enough for the architecture to express its strengths. The bottleneck was never the model's capacity. It was data and pretraining.
So at this scale the correct engineering decision is the simpler, faster, more interpretable TF-IDF model, and the value of the whole exercise is being able to say that with numbers instead of assuming the neural model must be better. That is the judgment that separates "I can train a transformer" from "I know when a transformer is worth it."
What I would do in production
Nothing about the finding says transformers are the wrong tool. It says a from-scratch one is. The pipeline is built for the obvious next step: swap the scratch encoder for a pretrained one and fine-tune it. The pooling, the head, the evaluation framework and the training loop do not change, and that is where the neural approach starts to pull ahead. Parameter-efficient fine-tuning (LoRA) slots in at the same point.
Takeaways
- Always build the baseline first. A class-weighted TF-IDF logistic regression is a high bar, and if your model cannot clear it, that is information, not failure.
- Build the architecture by hand once. Implementing attention, masking and pooling yourself teaches you where the bodies are buried, and the padding mask is one of them.
- Measure per class, not just overall. The headline number hides which complaint types you actually handle.
- Report the result that does not flatter you. "The simple model won" is often the most valuable sentence in a project.
Code, tests and figures: github.com/gbadedata/airline-feedback-transformer. If you want the retrieval and RAG side of this kind of work, I wrote up a biomedical question-answering system separately at github.com/gbadedata/biomedical-rag-qa.




Top comments (0)