<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: AviKKi</title>
    <description>The latest articles on DEV Community by AviKKi (@avikki).</description>
    <link>https://dev.to/avikki</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F403153%2F35adcda6-4b06-4ecc-99f3-06a05613bc71.png</url>
      <title>DEV Community: AviKKi</title>
      <link>https://dev.to/avikki</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/avikki"/>
    <language>en</language>
    <item>
      <title>Automatic text classification in 3 lines of code 🤗 [Tutorial]</title>
      <dc:creator>AviKKi</dc:creator>
      <pubDate>Thu, 10 Dec 2020 00:51:49 +0000</pubDate>
      <link>https://dev.to/avikki/automatic-text-classification-in-3-lines-of-code-tutorial-17l4</link>
      <guid>https://dev.to/avikki/automatic-text-classification-in-3-lines-of-code-tutorial-17l4</guid>
      <description>&lt;p&gt;This is a follow-up tutorial on Hugging Face's library &lt;code&gt;transformers&lt;/code&gt; &lt;a href="https://dev.to/avikki/tutorial-state-of-the-art-nlp-with-single-line-of-code-44cm"&gt;i wrote earlier&lt;/a&gt;. In this post I'll cover zero shot classification pipeline;I'll cover what this is, and how a web developer or iOS developer can leverage this technology. &lt;/p&gt;

&lt;h3&gt;
  
  
  Spoiler
&lt;/h3&gt;

&lt;p&gt;A little peak of what this library can do -&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; from transformers import pipeline
&amp;gt;&amp;gt;&amp;gt; classifier = pipeline('zero-shot-classification')
&amp;gt;&amp;gt;&amp;gt; classifier('your delivery boy was really rude, the service sucks. #review #badreview', ['negative', 'positive'])
{'sequence': 'your delivery boy was really rude, the service sucks. #review #badreview', 'labels': ['negative', 'positive'], 'scores': [0.9980460405349731, 0.0019539250060915947]}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That mean 99% negative and 0.19% positive, in 3 lines you know if review is positive or negative.&lt;/p&gt;

&lt;p&gt;Another one&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; classifier('Get a free iphone today, just let us your banking details', ['spam', 'not spam'])
{'sequence': 'Get a free iphone today, just let us your banking details', 'labels': ['not spam', 'spam'], 'scores': [0.7550246119499207, 0.24497543275356293]}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is 100% scam obviously; but 75% confidence score in 3 lines of code is pretty decent.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Zero Shot classification?
&lt;/h2&gt;

&lt;p&gt;In machine learning, you feed in a lot of data into a model with some labels, it's called training; Then you pass in some data and model predicts those model. If you have different labels retrain the model.&lt;/p&gt;

&lt;p&gt;This is very good at replacing humans, but overall it's dumb.  Humans don't have to learn every time we have a new question(or set of labels), we leverage our general understanding.&lt;/p&gt;

&lt;p&gt;Zero-shot learning to rescue, &lt;strong&gt;just pass in the data(text) and labels; the model tell which label is most suitable.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Sounds like magic but it isn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  My tests with the model
&lt;/h2&gt;

&lt;p&gt;I tried to test model on some real world data to see how good this performs from a naive developer's perspective.&lt;/p&gt;

&lt;p&gt;1) Predicting labels in Reddit&lt;/p&gt;

&lt;p&gt;Reddit have flairs/labels in a post, so you can filter posts by a specific flair.&lt;/p&gt;

&lt;p&gt;I tried to predict flairs from a post's title, in the following subreddits, &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Science Subreddit (/r/science/) - &lt;br&gt;
Very good performance for non-overlapping labels like Medicine, Technology, etc. but for related labels like Health, Medicine, Animal etc. it was very frequently mislabeled.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Jobbit (/r/jobbit/) - &lt;br&gt;
In this subreddit people offer job/project [Hiring] and also showcase their resume[For hire].&lt;br&gt;
Using these labels as prediction target we had very confusing results, so I changes labels to &lt;code&gt;hiring&lt;/code&gt; and &lt;code&gt;resume&lt;/code&gt; and it worked like a charm, with a confidence score of 90+ mostly.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;2) Predicting type of SMS from my phone&lt;/p&gt;

&lt;p&gt;I picked out some SMS from my inbox, and used labels 'OTP', 'bank statement', 'Offers'. Some stats -&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;th&gt;&lt;/th&gt;
&lt;td&gt;Number of messages&lt;/td&gt;
&lt;td&gt;Correctly Classified&lt;/td&gt;
&lt;td&gt;Average score of correct prediction&lt;/td&gt;
&lt;tr&gt;
&lt;td&gt;OTP&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;0.92&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;bank statement&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;0.65&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Offers&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;0.72&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;3) Predicting type of programing language&lt;/p&gt;

&lt;p&gt;I haven't checked the dataset this model was trained on so I wasn't sure if model can handle it, but it did work pretty decent. I didn't do any benchmarks but overall it seems to work, with a lot of fluctuation in confidence score. It gets confused between c and c++ but it did distinguish python and c++ like languages very well.&lt;/p&gt;

&lt;p&gt;It does confuse label &lt;code&gt;assembly language&lt;/code&gt; with &lt;code&gt;c language&lt;/code&gt;, may because it wasn't trained in code repositories XD; or because a lot of c code had inline assembly.&lt;/p&gt;

&lt;p&gt;Choice of proper label effects the performance, for example go vs golang.&lt;/p&gt;

&lt;p&gt;Note: This was just a random thought, don't use this in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it means for all the developers?
&lt;/h2&gt;

&lt;p&gt;You can now include some intelligent features in your applications without any deep learning expertise. This is very good for prototyping and hackathons, where you create a proof-of-concept and if it takes off, hire an expert for more accurate solution.&lt;/p&gt;

&lt;p&gt;Few examples I can think of are -&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Support Ticket - forwarding a support request by customer to the correct department is very crucial for quickly resolving it, use support message as text and departments as classes.&lt;/li&gt;
&lt;li&gt;Ban negative/NSFW content - filter out any text in your application for hateful or NSFW content in message board, comments etc.&lt;/li&gt;
&lt;li&gt;Let us know in the comments - Suggest some cool applications you can think of in the comments down below.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cons
&lt;/h2&gt;

&lt;h4&gt;
  
  
  Model is huge
&lt;/h4&gt;

&lt;p&gt;if you are thinking about running it on an android or iOS device &lt;strong&gt;don't&lt;/strong&gt;. Memory usage of my Ubuntu desktop went from 6Gb to 13Gb when using this model, most people won't even be able to run this on their 4Gb laptop. &lt;/p&gt;

&lt;p&gt;I haven't tried running it on a cloud environment but obviously a 512mb free tier VPS won't do, you need 7Gb+ with a decent CPU, dedicated CPU will be best.  &lt;/p&gt;

&lt;p&gt;NOTE: These stats are in python when i tested the library, I haven't looked at optimized model serving performance.&lt;/p&gt;

&lt;h4&gt;
  
  
  Not reliable
&lt;/h4&gt;

&lt;p&gt;You can use this for prototyping, or when people are verifying it, or when mislabeling doesn't cost your business a lot.&lt;/p&gt;

&lt;p&gt;In general take it with a grain of salt, it's very new technology and will take some time to mature.&lt;/p&gt;

&lt;h2&gt;
  
  
  Some Tips
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Larger text is classified with better accuracy&lt;/li&gt;
&lt;li&gt;Choose labels wisely, as found above 'For hire' and 'Hiring' were really bad set of labels.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Note: This is my naive observation, take it with a grain of salt.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;p&gt;I haven't touched on the technical details of this but if you want to dig deeper here are my recommendations.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Joe's article on this pipeline, the master mind who added this pipeline - &lt;a href="https://joeddav.github.io/blog/2020/05/29/ZSL.html"&gt;link&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Zero-shot Text Classification With Generative Language Models &lt;a href="https://arxiv.org/abs/1912.10165"&gt;link&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach &lt;a href="https://arxiv.org/abs/1909.00161"&gt;link&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;this paper originally proposed the idea.&lt;/p&gt;

&lt;h1&gt;
  
  
  Thank you for reading
&lt;/h1&gt;

&lt;p&gt;Let me know what you think about this article series, down in the comments. &lt;/p&gt;

&lt;p&gt;Give this article a like if this was helpful.&lt;/p&gt;

&lt;p&gt;Follow me to get notified on similar articles, more is on the way ;)&lt;/p&gt;

&lt;p&gt;Disclaimer: I'm not affiliated with Hugging Face in any manner XP, this project just have a lot of untapped potential with very few tutorials out of deep learning community.&lt;/p&gt;

</description>
      <category>deeplearning</category>
      <category>machinelearning</category>
      <category>nlp</category>
      <category>datascience</category>
    </item>
    <item>
      <title>[Tutorial] State-of-the-art NLP  with single line of code 🤗</title>
      <dc:creator>AviKKi</dc:creator>
      <pubDate>Fri, 04 Dec 2020 09:59:28 +0000</pubDate>
      <link>https://dev.to/avikki/tutorial-state-of-the-art-nlp-with-single-line-of-code-44cm</link>
      <guid>https://dev.to/avikki/tutorial-state-of-the-art-nlp-with-single-line-of-code-44cm</guid>
      <description>&lt;p&gt;In the past deep learning has become very easy for common people to use and train; with just a few lines of code you can teach a computer to differentiate between cat and dog photos.&lt;/p&gt;

&lt;p&gt;After the introduction of transformers, deep learning models are achieving state of the art results at most of the Natural Language processing tasks, and with Huggingface's 🤗 Pypi library &lt;code&gt;transformers&lt;/code&gt; you can now use these deep learning models with a few line of codes.&lt;/p&gt;

&lt;p&gt;Below jupyter notebook has the instructions for it -&lt;br&gt;
&lt;a href="https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/03-pipelines.ipynb"&gt;Jupyter Notebook&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  My first reaction to it.
&lt;/h3&gt;
&lt;h4&gt;
  
  
  Sentiment Analysis
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; from transformers import pipeline
&amp;gt;&amp;gt;&amp;gt; s = pipeline("sentiment-analysis")
&amp;gt;&amp;gt;&amp;gt; s("Obama is not a bad person")
[{'label': 'POSITIVE', 'score': 0.9990673065185547}]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;A tradition model that works on Bag-of-Words would look at the word &lt;code&gt;bad&lt;/code&gt; and say the statement is Negative, but this model can understand the meaning of &lt;code&gt;not a bad&lt;/code&gt;.&lt;/p&gt;
&lt;h4&gt;
  
  
  Text Generation
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; g = pipeline("text-generation")
&amp;gt;&amp;gt;&amp;gt; result  = g("How frustrating it is trying to organize your work using")
&amp;gt;&amp;gt;&amp;gt; print(result[0]['generated_text'])
How frustrating it is trying to organize your work using a calendar

If your plan is to use any calendar at all, you might have one or more of those issues which you can add to your calendar. You can use your calendar to organize activities
&amp;gt;&amp;gt;&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Although this just generates random text, you can add more context in the initial sentence and it is much better than hidden markov chain model.&lt;/p&gt;

&lt;p&gt;This can be used for writing articles, captions for your social media posts etc.&lt;/p&gt;
&lt;h4&gt;
  
  
  Summarization
&lt;/h4&gt;

&lt;p&gt;I summarized first 3 paragraphs of &lt;a href="https://dev.to/techmagic/how-to-design-a-scalable-serverless-application-on-aws-4k9j"&gt;this&lt;/a&gt; dev.to article and this was the output&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Auto-scaling instances are supposed to handle traffic fluctuation regardless of the number of users and requests.

The unnecessary resources should be eliminated, and the required ones should be triggered following the demand.

Amazon Auto Scaling solves the problem by automatically keeping the currently important instances active and removing the ones that are no longer needed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I can save so much of my time now, summarizing Youtube video's subtitle, and articles in my daily news feed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Thank you for reading
&lt;/h3&gt;

&lt;p&gt;There is a lot you can do with this library, I'll be writing some tutorials on this in near future, follow me for updates.&lt;/p&gt;

</description>
      <category>deeplearning</category>
      <category>nlp</category>
      <category>python</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Django Postgres Text Search for 10M+ rows</title>
      <dc:creator>AviKKi</dc:creator>
      <pubDate>Tue, 11 Aug 2020 22:28:59 +0000</pubDate>
      <link>https://dev.to/avikki/django-postgres-text-search-for-10m-rows-2c78</link>
      <guid>https://dev.to/avikki/django-postgres-text-search-for-10m-rows-2c78</guid>
      <description>&lt;p&gt;In a recent project I had to add a full text search functionality to an already existing Django project, below are notes of what challenges I encountered and how I solved them.&lt;/p&gt;

&lt;p&gt;For easy reading I have listed down a brief walk-through and limitations I found, followed by a more detailed log.&lt;/p&gt;

&lt;h1&gt;
  
  
  Project Overview
&lt;/h1&gt;

&lt;p&gt;This project involved loading a 30GB+ CSV into the database, that included information about books; and implementing full text search on those book's title, author, tags, categories.&lt;/p&gt;

&lt;h1&gt;
  
  
  Overall walk-through
&lt;/h1&gt;

&lt;ol&gt;
&lt;li&gt;Adding in FTS with Postges is super easy -

&lt;ul&gt;
&lt;li&gt;Add &lt;code&gt;django.contrib.postgres&lt;/code&gt; in installed_apps&lt;/li&gt;
&lt;li&gt;perform search as following
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Book&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;objects&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title__search&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'A little girl'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Indexing to increase performance

&lt;ul&gt;
&lt;li&gt;Add a &lt;code&gt;SearchVectorField&lt;/code&gt; field to the model
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;  &lt;span class="c1"&gt;# for pre computed search vectors
&lt;/span&gt;  &lt;span class="n"&gt;search_vector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SearchVectorField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;null&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;blank&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Create Index
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;  &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Meta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;object&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
      &lt;span class="n"&gt;indexes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="n"&gt;GinIndex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'search_vector'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Increase &lt;code&gt;work_mem&lt;/code&gt;, default work_mem of Postgres is too low for M+ rows.&lt;br&gt;
edit &lt;code&gt;work_mem&lt;/code&gt; in &lt;code&gt;postgresql.config&lt;/code&gt; file and restart your db.&lt;br&gt;
A bit of &lt;code&gt;sed&lt;/code&gt; command if you are using docker.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Caching&lt;br&gt;
I cached whole webpage on a redis instance along with certain queries like &lt;code&gt;result count&lt;/code&gt;(very heavy one), which will be repeat for every search page load, required overloading &lt;code&gt;Paginator&lt;/code&gt; class and &lt;code&gt;ListView&lt;/code&gt; class.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Increase Shared Memory, generally not required but my docker container was running out of memory for some queries.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h1&gt;
  
  
  Limitations Found
&lt;/h1&gt;

&lt;ol&gt;
&lt;li&gt;Complex queries would be really slow, example

&lt;ul&gt;
&lt;li&gt;sorting results based on similarity&lt;/li&gt;
&lt;li&gt;sorting search results(aka &lt;code&gt;ORDER BY&lt;/code&gt;) based on number of comments on an book, or any other non text column.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;A bit hard to tune

&lt;ul&gt;
&lt;li&gt;Doing a trade-off between relevant results and max possible results ( good for SEO ) requires complex queries which will take too loooong to process.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h1&gt;
  
  
  Detailed log TL;DR
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Available options
&lt;/h2&gt;

&lt;p&gt;There are two major ways of achieving this -&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://django-haystack.readthedocs.io/en/master/"&gt;django haystack&lt;/a&gt; plugin&lt;/p&gt;

&lt;p&gt;With this you can integrate a search engine with your django application. You have several options like solr and elastic search for a search backend. These are really good at handing text search for a large amount of documents, but has overhead in form of server cost, development overhead etc.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Postgresql's full text search&lt;/p&gt;

&lt;p&gt;Postgres has a full text search feature, in sql you just have to add a &lt;code&gt;WHERE&lt;/code&gt; clause and you have fully working text search, and on djangos side you can use &lt;code&gt;.filter&lt;/code&gt; method. Although it is not a dedicated search application so has many shortcomings, for small applications it works great out of the box, but as database grows you'll have to do some tweaking.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Implementation
&lt;/h2&gt;

&lt;h4&gt;
  
  
  Config
&lt;/h4&gt;

&lt;p&gt;add &lt;code&gt;django.contrib.postgres&lt;/code&gt; to installed apps.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# settings.py
&lt;/span&gt;&lt;span class="p"&gt;....&lt;/span&gt;
&lt;span class="n"&gt;INSTALLED_APPS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;...&lt;/span&gt;
    &lt;span class="s"&gt;'django.contrib.postgres'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# for fts search
&lt;/span&gt;    &lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Model
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;django.db&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;django.contrib.postgres.search&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SearchVectorField&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;django.contrib.postgres.indexes&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GinIndex&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Book&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CharField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_length&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;poster_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;URLField&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;downloads&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IntegerField&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;likes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IntegerField&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;comments_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IntegerField&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;search_vector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SearchVectorField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;null&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;blank&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# for pre computed serch vectors
&lt;/span&gt;
    &lt;span class="c1"&gt;# tags, categories, authors remaining
&lt;/span&gt;    &lt;span class="c1"&gt;# raw data fields
&lt;/span&gt;    &lt;span class="n"&gt;_tags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TextField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;blank&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;_categories&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TextField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;blank&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;_authors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TextField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;blank&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Meta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;object&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;indexes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;GinIndex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'search_vector'&lt;/span&gt;&lt;span class="p"&gt;])]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Above is a typical Django ORM model, &lt;code&gt;search_vector&lt;/code&gt; contains vector representation of book's &lt;code&gt;title, tags, categories and authors&lt;/code&gt;; Postgres converts both the search query and textfields into vectors then compares them for a match, by pre-computing the search vector and indexing it with a &lt;code&gt;GinIndex&lt;/code&gt; we are improving the query speed.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;search_vector&lt;/code&gt; can be computed with below python code,&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Book&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;objects&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;search_vector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SearchVector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'title'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'_tags'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'_categories'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'_authors'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Using authors, tags and categories as TextField helps in loading the huge CSV file faster.&lt;/p&gt;

&lt;h4&gt;
  
  
  View
&lt;/h4&gt;

&lt;p&gt;view was implemented with generic ListView&lt;/p&gt;

&lt;h4&gt;
  
  
  Profiling
&lt;/h4&gt;

&lt;p&gt;After this I used Django's debugging toolbar to have a look at the queries being performed, there were 2 major issues.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Count(*) was slow for queries with ~100K+ results&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;code&gt;Count(*)&lt;/code&gt; is an notoriously expensive operation in sql, you basically have to scan through whole table to do this, there are some workarounds like storing count separately, partial indexes, but nothing is applicable to our use case. &lt;/p&gt;

&lt;p&gt;I cached the queries for this&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Query time was drastically more after a certain increase in number of search results.&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>postgres</category>
      <category>django</category>
      <category>python</category>
    </item>
  </channel>
</rss>
