<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jogesh Mishra</title>
    <description>The latest articles on DEV Community by Jogesh Mishra (@jogeshmishra).</description>
    <link>https://dev.to/jogeshmishra</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F487038%2F249256d8-281b-4bbb-bbc4-4c3454c78e1b.jpeg</url>
      <title>DEV Community: Jogesh Mishra</title>
      <link>https://dev.to/jogeshmishra</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jogeshmishra"/>
    <language>en</language>
    <item>
      <title>Autocorrect using NLP</title>
      <dc:creator>Jogesh Mishra</dc:creator>
      <pubDate>Sat, 10 Oct 2020 21:22:35 +0000</pubDate>
      <link>https://dev.to/spectrumcetb/autocorrect-using-nlp-3h07</link>
      <guid>https://dev.to/spectrumcetb/autocorrect-using-nlp-3h07</guid>
      <description>&lt;p&gt;In this era of Digitalization, almost every individual has a smartphone with them, may it be Android or iOS. The benefits of smartphone would be a never-ending list and we ain't going to focus on that in this blog ! The purpose of this post, as you can guess it from the title itself, is to build an Autocorrect System from Scratch. Yes, it's somewhat similar, definitely not the exact copy , to that of the smartphone you are using now, but this would be just an implementation of Natural Language Processing  on a comparatively smaller dataset- Shakespeare.txt (a file full of words used by William Shakespeare in his works). &lt;br&gt;
This algorithm was first created by &lt;strong&gt;Peter Norvig&lt;/strong&gt; back in 2007 in his &lt;a href="https://norvig.com/spell-correct.html"&gt;article&lt;/a&gt;.&lt;/p&gt;
&lt;h1&gt;
  
  
  Overview of the Problem :
&lt;/h1&gt;

&lt;p&gt;In this model, we are going to consider &lt;strong&gt;edit distance&lt;/strong&gt; between every pair of words in a list containing the vocabulary. Basically, edit distance is a &lt;strong&gt;measure of minimum edits required to convert one word to another&lt;/strong&gt;.&lt;br&gt;
This process of conversion includes steps like &lt;strong&gt;Delete&lt;/strong&gt;,&lt;strong&gt;Replace&lt;/strong&gt;,&lt;strong&gt;Switch&lt;/strong&gt; and &lt;strong&gt;Insert&lt;/strong&gt; on a pair of words. In this blog, to reduce complexity, we would go for words that are &lt;strong&gt;1 or 2 edit distance away&lt;/strong&gt;.&lt;br&gt;
The goal of our model to produce the right output is to compute the &lt;strong&gt;probability of a word being correct, &lt;em&gt;P(c/w)&lt;/em&gt; ,is probability of certain word &lt;em&gt;w&lt;/em&gt; given that is is correct, &lt;em&gt;P(w/c)&lt;/em&gt; , multiplied to probability of being correct in general, &lt;em&gt;P(c)&lt;/em&gt; , divided by probability of that word appearing, &lt;em&gt;P(w)&lt;/em&gt; .&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Formula :               &lt;strong&gt;𝑃(𝑐|𝑤)=𝑃(𝑤|𝑐)×𝑃(𝑐)/𝑃(𝑤)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The method used above is called &lt;strong&gt;Bayes Theorem&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;em&gt;And we are all set for the journey ahead !&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;Import the necessary packages :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import re
from collections import Counter
import numpy as np
import pandas as pd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pre-processing the data :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def process_data(file_name):
    words = [] 
    with open(file_name) as f:
        file_name_data = f.read()
    file_name_data=file_name_data.lower()
    words = re.findall(r'\w+',file_name_data)

    return words

word_l = process_data('shakespeare.txt')
vocab = set(word_l)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here we are using the function process data which takes the file &lt;br&gt;
&lt;em&gt;shakespeare.txt&lt;/em&gt; as input and returns a &lt;strong&gt;words&lt;/strong&gt;, list of all words in the file by ignoring the numerical values and converting every word to lower case. Again, we set vocabulary for the text file , &lt;strong&gt;vocab&lt;/strong&gt; , as a set of all words from the list of words received by &lt;strong&gt;word_l&lt;/strong&gt;, i.e making a list of all unique words.&lt;/p&gt;

&lt;p&gt;Now, we would create a count dictionary for the words in vocab and their number of occurrences in the list of words from the input file :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def get_count(word_l):
    word_count_dict = {}  # fill this with word counts
    word_count_dict = Counter(word_l)
    return word_count_dict

word_count_dict = get_count(word_l)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above function will return a dictionary, &lt;em&gt;word_count_dict&lt;/em&gt; where each word will be a key and the value assigned to each word will be the number of times it has appeared in the list of words &lt;em&gt;word_l&lt;/em&gt; .&lt;/p&gt;

&lt;p&gt;Next, we will need a dictionary that will contain every word as the key and the value assigned to each key would be the probability of the occurrence of that specific word(key) in the input file &lt;em&gt;shakespeare.txt&lt;/em&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def get_probs(word_count_dict):
    probs = {}  # return this variable correctly

    m = sum(word_count_dict.values())
    for key in word_count_dict :
        probs[key] = word_count_dict[key]/m
    return probs

probs = get_probs(word_count_dict)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;probs&lt;/em&gt; is the required dictionary that will contain each word &lt;em&gt;w&lt;/em&gt; and the respective probability &lt;em&gt;P(w)&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Now, here comes the most important aspect of our model ! In the following code snippets below, we would be implementing functiones to carry out the tasks : &lt;strong&gt;Delete&lt;/strong&gt;, &lt;strong&gt;Switch&lt;/strong&gt;, &lt;strong&gt;Replace&lt;/strong&gt; and &lt;strong&gt;Insert&lt;/strong&gt; and return the respective list of words from each task.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def delete_letter(word):
    delete_l = []
    split_l = []

    for i in range(len(word)):
        split_l.append([word[:i],word[i:]])
    for a,b in split_l :
        delete_l.append(a+b[1:])

    return delete_l
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def switch_letter(word):
    switch_l = []
    split_l = []

    for c in range(len(word)):
        split_l.append([word[:c],word[c:]])
    switch_l = [a + b[1] + b[0] + b[2:] for a,b in split_l if len(b) &amp;gt;= 2]    

    return switch_l
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def replace_letter(word):
    letters = 'abcdefghijklmnopqrstuvwxyz'
    replace_l = []
    split_l = []

    for c in range(len(word)):
        split_l.append((word[0:c],word[c:]))
    replace_l = [a + l + (b[1:] if len(b)&amp;gt; 1 else '') for a,b in split_l if b for l in letters]
    replace_set=set(replace_l)    
    replace_set.remove(word)
    # turn the set back into a list and sort it, for easier viewing
    replace_l = sorted(list(replace_set))

    return replace_l
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def insert_letter(word):
    letters = 'abcdefghijklmnopqrstuvwxyz'
    insert_l = []
    split_l = []

    for c in range(len(word)+1):
        split_l.append((word[0:c],word[c:]))
    insert_l = [ a + l + b for a,b in split_l for l in letters]

    return insert_l
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The functions used above produce lists of words on specific type of edits. We need to use these as a whole for each word in the list of input words from the file by using the &lt;strong&gt;n-edit distance algorithm&lt;/strong&gt; where we will use &lt;strong&gt;n=1,2&lt;/strong&gt; only.&lt;/p&gt;

&lt;p&gt;So, if we want to edit &lt;strong&gt;one letter at a time&lt;/strong&gt; then we choose &lt;strong&gt;n=1&lt;/strong&gt; and implement the following function :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def edit_one_letter(word, allow_switches = True):
    edit_one_set = set()

    edit_one_set.update(delete_letter(word))
    if allow_switches:
        edit_one_set.update(switch_letter(word))
    edit_one_set.update(replace_letter(word))
    edit_one_set.update(insert_letter(word))

    return edit_one_set
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Similarly, if we choose to edit &lt;strong&gt;two letters at a time&lt;/strong&gt;, so we go for &lt;strong&gt;n=2&lt;/strong&gt; and implement the following function on our list of words, that is going to use the above function two times, i.e first on actual set of words and second on every word in the output of above function :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def edit_two_letters(word, allow_switches = True):
    edit_two_set = set()

    edit_one = edit_one_letter(word,allow_switches=allow_switches)
    for w in edit_one:
        if w:
            edit_two = edit_one_letter(w,allow_switches=allow_switches)
            edit_two_set.update(edit_two)

    return edit_two_set
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, we will use &lt;strong&gt;edit_two_letters()&lt;/strong&gt; and &lt;strong&gt;edit_one_letter()&lt;/strong&gt;  in the given function to get a list of all suggested words along with their respective probabilities in a list named &lt;strong&gt;n_best&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def get_corrections(word, probs, vocab, n=2):    
    suggestions = []
    n_best = []

    suggestions = list((word in vocab and word) or edit_one_letter(word).intersection(vocab) or edit_two_letters(word).intersection(vocab))
    n_best = [[s,probs[s]] for s in list(reversed(suggestions))]

    return n_best
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, &lt;em&gt;suggestions&lt;/em&gt; will have the words that are in vocab or the common words between the list of words obtained from &lt;em&gt;edit_one_letter()&lt;/em&gt; and vocab or &lt;em&gt;edit_two_letters()&lt;/em&gt; and vocab.&lt;br&gt;
As it can be seen, &lt;strong&gt;n_best&lt;/strong&gt; returns the list of all suggested words with their respective probabilities of being correct.&lt;/p&gt;
&lt;h1&gt;
  
  
  Testing our Model :
&lt;/h1&gt;

&lt;p&gt;The following code snippet will be used to test our Autocorrect Model :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my_word = 'dys' 
tmp_corrections = get_corrections(my_word, probs, vocab, 2) 
for i, word_prob in enumerate(tmp_corrections):
    print(f"word {i}: {word_prob[0]}, probability {word_prob[1]:.6f}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The results obtained were : &lt;strong&gt;dye&lt;/strong&gt; with probability of &lt;strong&gt;0.000019&lt;/strong&gt; and &lt;strong&gt;days&lt;/strong&gt; with probability &lt;strong&gt;0.000410&lt;/strong&gt; .&lt;/p&gt;

&lt;h5&gt;
  
  
  &lt;strong&gt;That's all Folks !&lt;/strong&gt;
&lt;/h5&gt;

&lt;p&gt;This is how you can make your own autocorrect model using Natural Language Processing and Python.&lt;br&gt;&lt;br&gt;
Thank you for your valuable time and patience !&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>python</category>
    </item>
  </channel>
</rss>
