Text classification is a famous topic inside NLP (Natural Language Processing), which in a broader sense divided into two categories,
To categorize a given text into a predefined set of categories. Say, you have incoming emails, that you want to group into 2 categories Social or Promotions. In this case, you will need to have training set for each category, build a model with that, then the classification engine will perform classification for any given text.
It does not need training dataset, instead groups given set of document into logical units. Unsupervised learning is a form clustering. K-Means is a popular algorithm here.
Lightweight supervised classification
We are interested in the supervised classification in this article. Generally this requires decent amount of training data to give right prediction. I wanted to have classification for one of my projects related to email handling, most of the tools outhere required good amount of datasets which I don't have, but I had a small amount of accurate data. So, I built a lightweight classifier in Python, which takes a small training dataset, produces results based on the words.
This tool totally relies on the words presence and count the total occurance on the given input against the training set, and returns a list of categories.
Training file format
__label__ category1 training data __label__ category1 some other data __label__ category2 some data
Text followed by label is the category name, followed by a space then the input sentence. This format is chosen to be consistent with fasttext library, so that in case you want to move to that, the training file can remian the same.
Invoking the classifier
import classifierresults = classifier.classify("offer linkedin linkedin", "somerandomcategory")
results will be a list of tuple, like
[('category'1', 10), ('category2',5)] sorted by top match first. 10,5 are the scores i.e number of word matches. "somerandomcategory" is the default category that you will receive in the event of no match!
This small library should work in both Python 2.x and 3.x, and has no dependencies.
How it works
This library prepares the training into a counted words, then compares that with the given input text, orders the result by word matches. For the above training data, it will have
when you give input of
Hello data, then it will return [('category1', 2),('category2',1)], since
data exists twice in category1 and once in category2.
Tools for Text Classification
When you need more powerful classification, a few good options are,
- Apache Mahout
https://www.oracle.com/technetwork/community/bookstore/taming-text-sample-523387.pdf has some basic details.
- Facebook's FastText
https://github.com/facebookresearch/fastText, written in Python, it claims that it runs faster. Our training set format is compatible with this tool.
- AWS Classification
https://aws.amazon.com/comprehend is a NLP service to find insights and releationships in text. The other two services we discussed here requires to setup on your own server while this one can be accessed using API and you pay as per your usage.
Top comments (0)