Imagine you're building an app that analyzes conversations within a development team. You want to detect whether the team's tone is healthy or if there are signs of stress. You decide to use Apple's NLTagger because it's readily available, free, runs on-device, and doesn't require a server. Three lines of code, and off you go.
First surprise: the phrase "kill the process and restart the daemon" scores -0.6. Negative. Almost hostile. "Fatal error in memory allocation" gets a -0.8. And "crash report uploaded successfully" — which is literally good news — scores -0.4.
Your team wellness tool just decided your developers are on the brink of an emotional breakdown. And all that happened was someone restarted a service.
The Problem: A Model Trained for Another Planet
NLTagger with .sentimentScore returns a value between -1.0 (very negative) and 1.0 (very positive). Apple introduced it in iOS 13 / macOS 10.15; it works with an on-device model built into the system, and there's no way to customize it or examine the training data.
To put it simply: it's a black box that gives you a number and asks you to trust it.
The model works reasonably well for what it was designed for: product reviews, social media comments, consumer-facing text where "horrible" means bad, and "amazing" means good. That's its domain. Problems arise when you use it outside that domain.
The Experiment: Technical Text vs. Everyday Text
Let’s test it out. Here’s some pure Swift code you can run in a Playground:
import NaturalLanguage
func sentiment(of text: String) -> Double {
let tagger = NLTagger(tagSchemes: [.sentimentScore])
tagger.string = text
let (tag, _) = tagger.tag(
at: text.startIndex,
unit: .paragraph,
scheme: .sentimentScore
)
return Double(tag?.rawValue ?? "0") ?? 0
}
---
Now let’s feed it some sentences any programmer would say on a regular day:
| Sentence | Score | Actual Sentiment |
| --------------------------------------- | ----------- | -------------------------------- |
| "Kill the background process" | -0.5 ~ -0.7 | Neutral (technical instruction) |
| "Fatal error: index out of range" | -0.7 ~ -0.9 | Neutral (compiler message) |
| "Abort the deployment" | -0.4 ~ -0.6 | Neutral (operational decision) |
| "Destroy the old database" | -0.5 ~ -0.7 | Neutral (maintenance task) |
| "The crash was caused by a null pointer"| -0.5 ~ -0.8 | Neutral (diagnosis) |
| "Successfully deployed to production" | +0.3 ~ +0.5 | Positive |
Here’s the kicker: five out of six sentences that are perfectly normal in a technical context are scored as negative. The only positive one contains the word "successfully" — a word the model recognizes from the world of reviews.
Now let’s compare this with sentences from the domain the model *was* trained for:
| Sentence | Score | Actual Sentiment |
| ------------------------------------ | ----------- | ------------------- |
| "This product is terrible" | -0.7 ~ -0.9 | Negative (correct) |
| "I love this app, it's amazing" | +0.7 ~ +0.9 | Positive (correct) |
| "The food was okay, nothing special" | -0.1 ~ +0.1 | Neutral (correct) |
These are spot-on. The model isn’t bad. It’s **out of context**.
## Why It Happens: Lexical Bias
`NLTagger`'s model doesn’t understand semantics. It doesn’t know that "kill a process" is a technical metaphor and "kill a person" is violence. To the model, "kill" is a negative word. Period.
This is called *lexical bias* — the tendency of the model to assign sentiment based on individual words (or short n-grams) without understanding the context of the domain. It’s like an automatic translator rendering "I’m dying of laughter" as a medical emergency.
Technical vocabulary is littered with words that are negative in everyday language:
- **kill**, **terminate**, **abort**, **destroy** — routine process operations
- **fatal**, **critical**, **severe** — log levels
- **crash**, **panic**, **fault** — system events
- **dead**, **zombie**, **orphan** — process states
- **reject**, **deny**, **block** — access control
- **break**, **interrupt**, **suspend** — flow control
Any paragraph using three or four of these words — an error log, a postmortem, a PR discussion — will score as if someone is typing from a dark, emotional abyss.
## Won’t More Context Fix This?
You might think, "If I pass a longer paragraph with context, the model should understand better." Let’s try:
swift
let technical = """
The daemon was killed after a fatal memory error. \
We restarted the service and the crash did not recur. \
Deployment completed successfully with zero downtime.
"""
print(sentiment(of: technical))
// Typical result: -0.3 ~ -0.5
A paragraph describing an incident resolved successfully. The ending is clearly positive. But the words "killed," "fatal," "error," and "crash" outweigh "successfully" and "completed." The model averages lexical weight rather than interpreting a narrative.
Compare that with a piece of similar length about a product:
swift
let review = """
The screen had some dead pixels and the battery was terrible. \
But after the replacement, the product works perfectly. \
I'm very happy with the customer service.
"""
print(sentiment(of: review))
// Typical result: +0.1 ~ +0.3
Here, the model recognizes that the positive ending outweighs the negative beginning. The difference? Words like "happy," "perfectly," and "works" are highly weighted in the model, which has seen them millions of times in reviews. "Successfully" and "completed" don’t carry the same weight in its vocabulary.
## What to Use Instead
If you need sentiment analysis for technical text within the Apple ecosystem, you have three practical options.
### 1. Core ML with a Custom Model (The Serious Option)
Train a classifier using *Create ML* with data from your domain. If you have logs, Slack messages from your team, or manually tagged commits, you can create a `.mlmodel` that understands "kill the process" as neutral.
swift
// Train with Create ML (macOS)
import CreateML
let data = try MLDataTable(contentsOf: trainingDataURL)
let classifier = try MLTextClassifier(
trainingData: data,
textColumn: "text",
labelColumn: "sentiment"
)
try classifier.write(to: modelURL)
It’s more work, but the model runs on-device, just like `NLTagger`. The difference is that you control the training data.
### 2. Embeddings + Classifier (The Elegant Option)
Since WWDC23, Apple offers contextual embeddings based on transformers via the `NaturalLanguage` framework. You can use `NLEmbedding` to extract vector representations of text and classify based on those. Unlike raw sentiment scores, embeddings capture semantic relationships — "kill process" and "stop process" would be close in vector space.
swift
import NaturalLanguage
if let embedding = NLEmbedding.sentenceEmbedding(for: .english) {
let vector = embedding.vector(for: "Kill the background process")
// Use the vector as input for a classifier
}
You’ll need a classifier on top (a simple `MLClassifier` will do), but the base representation is far richer than a thin lexical score.
### 3. Domain-Specific Heuristics (The Quick Hack That Works)
If your use case is simple — say, filtering messages for "tone" in an internal tool — sometimes a dictionary of technical terms to neutralize the text before passing it to the model is enough:
swift
let technicalNeutral: Set = [
"kill", "terminate", "abort", "destroy", "fatal",
"crash", "panic", "dead", "zombie", "orphan",
"reject", "deny", "block", "error", "fault"
]
func preprocessed(_ text: String) -> String {
var result = text.lowercased()
for term in technicalNeutral {
result = result.replacingOccurrences(of: term, with: "process")
}
return result
}
It’s a kludge, sure. But it's a transparent kludge you can audit and tweak. Better than blindly trusting a black box that can’t tell the difference between a postmortem and a breakup letter.
## The Takeaway
`NLTagger` isn’t broken. It does exactly what it was designed to do: generic sentiment analysis optimized for the kind of content Apple expects apps in the App Store to process — reviews, messages, notes.
The mistake is assuming that "sentiment analysis" is a universal problem. It isn’t. Sentiment is domain-specific. "Fatal" in a log is as neutral as "hello." "Fatal" in a conversation is a red flag. No generic model will distinguish the two without additional domain context.
If your text comes from a technical environment — logs, commits, developer chats, tickets — `NLTagger` is going to mislead you. Not out of malice, but out of ignorance. And the worst lie is the one wrapped in a `Double` with two decimals of fake precision.
Before plugging a sentiment model into your pipeline, ask yourself: does my text resemble an Amazon review? If the answer is no, `NLTagger` isn’t the right tool. And if the answer is yes… you probably don’t need sentiment analysis anyway.
Top comments (0)