I Built a Hate Speech Detector That Actually Knows the Difference Between Offensive and Hateful

#ai #machinelearning #nlp #showdev

Most hate speech models get this wrong: they treat "this movie sucked ass" and "heil hitler" as the same category.

They're not. One is someone venting. The other is an ideological statement. Conflating them makes content moderation either useless (too permissive) or annoying (bans people for swearing). So when I built AuricErgeson/hate-speech-detector, I started with that distinction as a hard requirement.

Three classes, not two

The model outputs neither, offensive, or hate_speech.

That middle class, offensive, is where most binary classifiers fail. They either flag everything offensive as hate speech, or they let actual hate speech through because it doesn't contain obvious slurs. "They control the media" is a good example. No profanity, no slur, but it is a well-documented antisemitic dog whistle. A binary clean/hateful model often misses it. Mine doesn't.

The dataset problem

110,585 training examples, fused from four public datasets:

Davidson et al. 2017: 24,783 examples of explicit Twitter slurs and offensive language
ImplicitHate: 21,480 examples of coded language and dog whistles
HateXplain: 19,229 examples with multi-annotator labels and rationales
HateDay 2025: 45,000 examples of contemporary Twitter hate speech

Each dataset uses different label schemes. Harmonizing them into a unified 3-class system took more time than the actual training. Davidson uses hate/offensive/neither, HateXplain uses hate/offensive/normal, ImplicitHate is binary. You have to make judgment calls about where the boundaries are and apply them consistently across 110K rows.

The other problem is class imbalance. "Neither" dominates naturally, since most text online is not hateful. Without correction, the model just learns to predict "neither" and gets decent accuracy while being useless. I oversampled to 21,621 examples per class, giving 64,863 total training examples.

One more thing: general Twitter corpora barely contain neo-Nazi numeric codes like 1488 or phrases like "14 words." They exist in the real world but not in enough volume to train on. I added 93 targeted augmentation examples covering these specifically. 93 is a small number but it moved the needle on those cases noticeably.

The base model choice

I used cardiffnlp/twitter-roberta-base-2022-154m instead of standard RoBERTa. The reason is simple: it was trained on 154 million tweets through 2022. Hate speech on Twitter has its own grammar, abbreviations, and coded vocabulary. A model that has never seen that register will struggle with it no matter how good the fine-tuning data is.

There was one annoying technical issue. This checkpoint uses legacy TensorFlow-style parameter names for LayerNorm: gamma and beta instead of the standard weight and bias. Transformers version 5.0 and above no longer maps these automatically, so loading the weights silently fails in some configurations. I had to reload the checkpoint manually with the names remapped before training. Took a while to debug because there was no error, the model just trained on randomly initialized LayerNorm parameters.

Results

Evaluated on a stratified held-out test set of 11,059 examples. Weighted F1 of 0.849, accuracy 0.843. Per class: neither at 0.884, offensive at 0.870, hate_speech at 0.697.

The hate_speech F1 of 0.697 is the honest weak point. It is harder to classify than the other two because it covers a wide range: explicit slurs, coded language, dog whistles, and symbol use all fall in the same bucket. A model that gets 0.88 on "neither" and 0.70 on "hate_speech" is not broken, it reflects how genuinely ambiguous hate speech classification is.

I ran 8 probe cases manually. 7 passed. The one that failed: "1488" as a standalone 4-digit string. In context ("1488 white power") it classifies correctly. As a bare number it predicts neither. That is a known limitation and it is in the model card.

What it gets right that others miss

"they control the media" classifies as hate_speech with 0.87 confidence. This is the antisemitic dog whistle test. Most models miss it because there is no surface-level offensive vocabulary.

"this movie sucked ass" classifies as offensive, not hate_speech. This matters for moderation. You probably do not want to ban people for this.

"I really enjoyed the concert" classifies as neither with 0.97 confidence. Basic sanity check.

Limitations worth knowing

A few honest limitations. Post-2022 coded language may not be recognized since the base model training data ends there and new slang appears constantly. Academic discussion of hate speech can produce false positives because "researchers studying the n-word" and an actual slur look similar at the token level. English only. And do not use this as the sole decision system for automated account bans. It is a classifier, not a human moderator.

The model

AuricErgeson/hate-speech-detector on HuggingFace. MIT license. There is also a live Gradio demo at the same path under Spaces if you want to test it without any code.

If you find edge cases it gets wrong, post them in the Community tab. The limitation around bare numeric codes is known. Other failures I want to hear about.

Auric Ergeson Nitonde is a software development in Germany, building NLP tools and publishing models at huggingface.co/AuricErgeson.

DEV Community

I Built a Hate Speech Detector That Actually Knows the Difference Between Offensive and Hateful

Top comments (0)