Kamban

Posted on Jun 28, 2019 • Originally published at kambanthemaker.com on Jun 28, 2019

A light-weight text classifier in Python

Text classification

Text classification is a famous topic inside NLP (Natural Language Processing), which in a broader sense divided into two categories,

Supervised
Unsupervised

Supervised classification

To categorize a given text into a predefined set of categories. Say, you have incoming emails, that you want to group into 2 categories Social or Promotions. In this case, you will need to have training set for each category, build a model with that, then the classification engine will perform classification for any given text.

Unsupervised classification

It does not need training dataset, instead groups given set of document into logical units. Unsupervised learning is a form clustering. K-Means is a popular algorithm here.

Lightweight supervised classification

We are interested in the supervised classification in this article. Generally this requires decent amount of training data to give right prediction. I wanted to have classification for one of my projects related to email handling, most of the tools outhere required good amount of datasets which I don't have, but I had a small amount of accurate data. So, I built a lightweight classifier in Python, which takes a small training dataset, produces results based on the words.

Checkout https://github.com/kambanthemaker/textclassifier

This tool totally relies on the words presence and count the total occurance on the given input against the training set, and returns a list of categories.

Training file format

__label__ category1 training data
__label__ category1 some other data
__label__ category2 some data

Text followed by label is the category name, followed by a space then the input sentence. This format is chosen to be consistent with fasttext library, so that in case you want to move to that, the training file can remian the same.

Invoking the classifier

import classifierresults = classifier.classify("offer linkedin linkedin", "somerandomcategory")

results will be a list of tuple, like [('category'1', 10), ('category2',5)] sorted by top match first. 10,5 are the scores i.e number of word matches. "somerandomcategory" is the default category that you will receive in the event of no match!

This small library should work in both Python 2.x and 3.x, and has no dependencies.

How it works

This library prepares the training into a counted words, then compares that with the given input text, orders the result by word matches. For the above training data, it will have

{'category1':['training','data','some','other','data'],'category2':['some','data']}

when you give input of Hello data, then it will return [('category1', 2),('category2',1)], since data exists twice in category1 and once in category2.

Tools for Text Classification

When you need more powerful classification, a few good options are,

Apache Mahout

https://www.oracle.com/technetwork/community/bookstore/taming-text-sample-523387.pdf has some basic details.

Facebook's FastText

https://github.com/facebookresearch/fastText, written in Python, it claims that it runs faster. Our training set format is compatible with this tool.

AWS Classification

https://aws.amazon.com/comprehend is a NLP service to find insights and releationships in text. The other two services we discussed here requires to setup on your own server while this one can be accessed using API and you pay as per your usage.