DEV Community: AviKKi

Automatic text classification in 3 lines of code 🤗 [Tutorial]

AviKKi — Thu, 10 Dec 2020 00:51:49 +0000

This is a follow-up tutorial on Hugging Face's library transformers i wrote earlier. In this post I'll cover zero shot classification pipeline;I'll cover what this is, and how a web developer or iOS developer can leverage this technology.

Spoiler

A little peak of what this library can do -

>>> from transformers import pipeline
>>> classifier = pipeline('zero-shot-classification')
>>> classifier('your delivery boy was really rude, the service sucks. #review #badreview', ['negative', 'positive'])
{'sequence': 'your delivery boy was really rude, the service sucks. #review #badreview', 'labels': ['negative', 'positive'], 'scores': [0.9980460405349731, 0.0019539250060915947]}

That mean 99% negative and 0.19% positive, in 3 lines you know if review is positive or negative.

Another one

>>> classifier('Get a free iphone today, just let us your banking details', ['spam', 'not spam'])
{'sequence': 'Get a free iphone today, just let us your banking details', 'labels': ['not spam', 'spam'], 'scores': [0.7550246119499207, 0.24497543275356293]}

That is 100% scam obviously; but 75% confidence score in 3 lines of code is pretty decent.

What is Zero Shot classification?

In machine learning, you feed in a lot of data into a model with some labels, it's called training; Then you pass in some data and model predicts those model. If you have different labels retrain the model.

This is very good at replacing humans, but overall it's dumb. Humans don't have to learn every time we have a new question(or set of labels), we leverage our general understanding.

Zero-shot learning to rescue, just pass in the data(text) and labels; the model tell which label is most suitable.

Sounds like magic but it isn't.

My tests with the model

I tried to test model on some real world data to see how good this performs from a naive developer's perspective.

1) Predicting labels in Reddit

Reddit have flairs/labels in a post, so you can filter posts by a specific flair.

I tried to predict flairs from a post's title, in the following subreddits,

Science Subreddit (/r/science/) -
Very good performance for non-overlapping labels like Medicine, Technology, etc. but for related labels like Health, Medicine, Animal etc. it was very frequently mislabeled.
Jobbit (/r/jobbit/) -
In this subreddit people offer job/project [Hiring] and also showcase their resume[For hire].
Using these labels as prediction target we had very confusing results, so I changes labels to hiring and resume and it worked like a charm, with a confidence score of 90+ mostly.

2) Predicting type of SMS from my phone

I picked out some SMS from my inbox, and used labels 'OTP', 'bank statement', 'Offers'. Some stats -

	Number of messages	Correctly Classified	Average score of correct prediction
OTP	5	5	0.92
bank statement	4	4	0.65
Offers	4	3	0.72

3) Predicting type of programing language

I haven't checked the dataset this model was trained on so I wasn't sure if model can handle it, but it did work pretty decent. I didn't do any benchmarks but overall it seems to work, with a lot of fluctuation in confidence score. It gets confused between c and c++ but it did distinguish python and c++ like languages very well.

It does confuse label assembly language with c language, may because it wasn't trained in code repositories XD; or because a lot of c code had inline assembly.

Choice of proper label effects the performance, for example go vs golang.

Note: This was just a random thought, don't use this in production.

What it means for all the developers?

You can now include some intelligent features in your applications without any deep learning expertise. This is very good for prototyping and hackathons, where you create a proof-of-concept and if it takes off, hire an expert for more accurate solution.

Few examples I can think of are -

Support Ticket - forwarding a support request by customer to the correct department is very crucial for quickly resolving it, use support message as text and departments as classes.
Ban negative/NSFW content - filter out any text in your application for hateful or NSFW content in message board, comments etc.
Let us know in the comments - Suggest some cool applications you can think of in the comments down below.

Cons

Model is huge

if you are thinking about running it on an android or iOS device don't. Memory usage of my Ubuntu desktop went from 6Gb to 13Gb when using this model, most people won't even be able to run this on their 4Gb laptop.

I haven't tried running it on a cloud environment but obviously a 512mb free tier VPS won't do, you need 7Gb+ with a decent CPU, dedicated CPU will be best.

NOTE: These stats are in python when i tested the library, I haven't looked at optimized model serving performance.

Not reliable

You can use this for prototyping, or when people are verifying it, or when mislabeling doesn't cost your business a lot.

In general take it with a grain of salt, it's very new technology and will take some time to mature.

Some Tips

Larger text is classified with better accuracy
Choose labels wisely, as found above 'For hire' and 'Hiring' were really bad set of labels.

Note: This is my naive observation, take it with a grain of salt.

Thank you for reading

Let me know what you think about this article series, down in the comments.

Give this article a like if this was helpful.

Follow me to get notified on similar articles, more is on the way ;)

Disclaimer: I'm not affiliated with Hugging Face in any manner XP, this project just have a lot of untapped potential with very few tutorials out of deep learning community.

[Tutorial] State-of-the-art NLP with single line of code 🤗

AviKKi — Fri, 04 Dec 2020 09:59:28 +0000

In the past deep learning has become very easy for common people to use and train; with just a few lines of code you can teach a computer to differentiate between cat and dog photos.

After the introduction of transformers, deep learning models are achieving state of the art results at most of the Natural Language processing tasks, and with Huggingface's 🤗 Pypi library transformers you can now use these deep learning models with a few line of codes.

Below jupyter notebook has the instructions for it -
Jupyter Notebook

My first reaction to it.

Sentiment Analysis

>>> from transformers import pipeline
>>> s = pipeline("sentiment-analysis")
>>> s("Obama is not a bad person")
[{'label': 'POSITIVE', 'score': 0.9990673065185547}]

A tradition model that works on Bag-of-Words would look at the word bad and say the statement is Negative, but this model can understand the meaning of not a bad.

Text Generation

>>> g = pipeline("text-generation")
>>> result  = g("How frustrating it is trying to organize your work using")
>>> print(result[0]['generated_text'])
How frustrating it is trying to organize your work using a calendar

If your plan is to use any calendar at all, you might have one or more of those issues which you can add to your calendar. You can use your calendar to organize activities
>>>

Although this just generates random text, you can add more context in the initial sentence and it is much better than hidden markov chain model.

This can be used for writing articles, captions for your social media posts etc.

Summarization

I summarized first 3 paragraphs of this dev.to article and this was the output

Auto-scaling instances are supposed to handle traffic fluctuation regardless of the number of users and requests.

The unnecessary resources should be eliminated, and the required ones should be triggered following the demand.

Amazon Auto Scaling solves the problem by automatically keeping the currently important instances active and removing the ones that are no longer needed

I can save so much of my time now, summarizing Youtube video's subtitle, and articles in my daily news feed.

Thank you for reading

There is a lot you can do with this library, I'll be writing some tutorials on this in near future, follow me for updates.

Django Postgres Text Search for 10M+ rows

AviKKi — Tue, 11 Aug 2020 22:28:59 +0000

In a recent project I had to add a full text search functionality to an already existing Django project, below are notes of what challenges I encountered and how I solved them.

For easy reading I have listed down a brief walk-through and limitations I found, followed by a more detailed log.

Project Overview

This project involved loading a 30GB+ CSV into the database, that included information about books; and implementing full text search on those book's title, author, tags, categories.

Overall walk-through

Adding in FTS with Postges is super easy -
- Add django.contrib.postgres in installed_apps
- perform search as following

Book.objects.filter(title__search='A little girl')

Indexing to increase performance
- Add a SearchVectorField field to the model

  # for pre computed search vectors
  search_vector = SearchVectorField(null=True, blank=True)

Create Index

  class Meta(object):
      indexes = [ GinIndex(fields=['search_vector']) ]

Increase work_mem, default work_mem of Postgres is too low for M+ rows.
edit work_mem in postgresql.config file and restart your db.
A bit of sed command if you are using docker.
Caching
I cached whole webpage on a redis instance along with certain queries like result count(very heavy one), which will be repeat for every search page load, required overloading Paginator class and ListView class.
Increase Shared Memory, generally not required but my docker container was running out of memory for some queries.

Limitations Found

Complex queries would be really slow, example
- sorting results based on similarity
- sorting search results(aka ORDER BY) based on number of comments on an book, or any other non text column.
A bit hard to tune
- Doing a trade-off between relevant results and max possible results ( good for SEO ) requires complex queries which will take too loooong to process.

Detailed log TL;DR

Available options

There are two major ways of achieving this -

django haystack plugin

With this you can integrate a search engine with your django application. You have several options like solr and elastic search for a search backend. These are really good at handing text search for a large amount of documents, but has overhead in form of server cost, development overhead etc.
Postgresql's full text search

Postgres has a full text search feature, in sql you just have to add a WHERE clause and you have fully working text search, and on djangos side you can use .filter method. Although it is not a dedicated search application so has many shortcomings, for small applications it works great out of the box, but as database grows you'll have to do some tweaking.

Implementation

Config

add django.contrib.postgres to installed apps.

# settings.py
....
INSTALLED_APPS = [
    ...
    'django.contrib.postgres', # for fts search
    ...
]
...

Model

from django.db import models
from django.contrib.postgres.search import SearchVectorField
from django.contrib.postgres.indexes import GinIndex

class Book(models.Model):
    title = models.CharField(max_length=300)
    poster_url = models.URLField()
    downloads = models.IntegerField()
    likes = models.IntegerField()
    comments_count = models.IntegerField()
    search_vector = SearchVectorField(null=True, blank=True)  # for pre computed serch vectors

    # tags, categories, authors remaining
    # raw data fields
    _tags = models.TextField(default="", blank=True)
    _categories = models.TextField(default="", blank=True)
    _authors = models.TextField(default="", blank=True)

    class Meta(object):
        indexes = [GinIndex(fields=['search_vector'])]

Above is a typical Django ORM model, search_vector contains vector representation of book's title, tags, categories and authors; Postgres converts both the search query and textfields into vectors then compares them for a match, by pre-computing the search vector and indexing it with a GinIndex we are improving the query speed.

search_vector can be computed with below python code,

Book.objects.update(search_vector=SearchVector('title', '_tags', '_categories', '_authors'))

Using authors, tags and categories as TextField helps in loading the huge CSV file faster.

View

view was implemented with generic ListView

Profiling

After this I used Django's debugging toolbar to have a look at the queries being performed, there were 2 major issues.

Count(*) was slow for queries with ~100K+ results

Count(*) is an notoriously expensive operation in sql, you basically have to scan through whole table to do this, there are some workarounds like storing count separately, partial indexes, but nothing is applicable to our use case.

I cached the queries for this

Query time was drastically more after a certain increase in number of search results.

DEV Community: AviKKi

Automatic text classification in 3 lines of code 🤗 [Tutorial]

Spoiler

What is Zero Shot classification?

My tests with the model

What it means for all the developers?

Cons

Model is huge

Not reliable

Some Tips

Further reading

Thank you for reading

[Tutorial] State-of-the-art NLP with single line of code 🤗

My first reaction to it.

Sentiment Analysis

Text Generation

Summarization

Thank you for reading

Django Postgres Text Search for 10M+ rows

Project Overview

Overall walk-through

Limitations Found

Detailed log TL;DR

Available options

Implementation

Config

Model

View

Profiling