I have a ruby application that pulls in various blog and forum posts from the internet (like a game dev tracker) and needed a way to classify them. Some simple rules got me 75% of the way there - does it include the word "patch" or "patch notes"? then it's likely a patch notes or announcement post - but that wasn't completely reliable. For example, a post reporting a bug on a patch might be worded as "Empty inventory because of Patch 9.0.0.2076", or a post might be announcing a patch deployment but not include any notes.
Recently I found the gem smalltext which has solved most of my problems. It classifies quickly and is pretty simple to train.
I built a training script that hardcodes training data into categories, then trains and saves a model.
lib/train_classifier.rb
require 'smalltext'
data = {
patch_notes: [
"PC Client Update - 12/7/2018",
"v6.31 Patch Notes",
],
news: [
"$1,000,000 Winter Royale + Tournament Updates",
"Pop-Up Cups Update 12/5/2018",
],
}
s = Smalltext::Classifier.new
data.each do |classification, titles|
titles.each do |title|
s.add_item(classification.to_s, title)
end
end
s.train
s.save_model("classifications.model")
I update this training script with definite examples and incorrectly classified posts as time goes on to improve the model.
app/services/classifier.rb
class Classifier
def self.classify(text, threshold = 0.99)
# Load the classifier and the trained model
@classifier ||
((@classifier = Smalltext::Classifier.new) && @classifier.load_model('classifications.model'))
classified = @classifier.classify(text).to_h
# { "patch_notes" => 0.9998900049414214 }
# Return the first category that meets the threshold
result = classified.detect {|k,v| v >= threshold }
result.present? ? result.first : nil
end
end
This class gives my application an entry point to the classifier. I only care about reasonably certain classifications so I built this to only output a classification if the result is above a certain threshold. In my application, I either get a category or nil, which makes it easier for me to handle.
> Classifier.classify("V7.00 PATCH NOTES")
sentence: V7.00 PATCH NOTES
classification: [["patch_notes", 0.9998900049414214]]
=> "patch_notes"
Top comments (1)
thanks a lot man