Autocorrect using NLP

#machinelearning #datascience #python

In this era of Digitalization, almost every individual has a smartphone with them, may it be Android or iOS. The benefits of smartphone would be a never-ending list and we ain't going to focus on that in this blog ! The purpose of this post, as you can guess it from the title itself, is to build an Autocorrect System from Scratch. Yes, it's somewhat similar, definitely not the exact copy , to that of the smartphone you are using now, but this would be just an implementation of Natural Language Processing on a comparatively smaller dataset- Shakespeare.txt (a file full of words used by William Shakespeare in his works).
This algorithm was first created by Peter Norvig back in 2007 in his article.

Overview of the Problem :

In this model, we are going to consider edit distance between every pair of words in a list containing the vocabulary. Basically, edit distance is a measure of minimum edits required to convert one word to another.
This process of conversion includes steps like Delete,Replace,Switch and Insert on a pair of words. In this blog, to reduce complexity, we would go for words that are 1 or 2 edit distance away.
The goal of our model to produce the right output is to compute the probability of a word being correct, P(c/w) ,is probability of certain word w given that is is correct, P(w/c) , multiplied to probability of being correct in general, P(c) , divided by probability of that word appearing, P(w) .

Formula : 𝑃(𝑐|𝑤)=𝑃(𝑤|𝑐)×𝑃(𝑐)/𝑃(𝑤)

The method used above is called Bayes Theorem.

And we are all set for the journey ahead !

Import the necessary packages :

import re
from collections import Counter
import numpy as np
import pandas as pd

Pre-processing the data :

def process_data(file_name):
    words = [] 
    with open(file_name) as f:
        file_name_data = f.read()
    file_name_data=file_name_data.lower()
    words = re.findall(r'\w+',file_name_data)

    return words

word_l = process_data('shakespeare.txt')
vocab = set(word_l)

Here we are using the function process data which takes the file
shakespeare.txt as input and returns a words, list of all words in the file by ignoring the numerical values and converting every word to lower case. Again, we set vocabulary for the text file , vocab , as a set of all words from the list of words received by word_l, i.e making a list of all unique words.

Now, we would create a count dictionary for the words in vocab and their number of occurrences in the list of words from the input file :

def get_count(word_l):
    word_count_dict = {}  # fill this with word counts
    word_count_dict = Counter(word_l)
    return word_count_dict

word_count_dict = get_count(word_l)

The above function will return a dictionary, word_count_dict where each word will be a key and the value assigned to each word will be the number of times it has appeared in the list of words word_l .

Next, we will need a dictionary that will contain every word as the key and the value assigned to each key would be the probability of the occurrence of that specific word(key) in the input file shakespeare.txt.

def get_probs(word_count_dict):
    probs = {}  # return this variable correctly

    m = sum(word_count_dict.values())
    for key in word_count_dict :
        probs[key] = word_count_dict[key]/m
    return probs

probs = get_probs(word_count_dict)

probs is the required dictionary that will contain each word w and the respective probability P(w).

Now, here comes the most important aspect of our model ! In the following code snippets below, we would be implementing functiones to carry out the tasks : Delete, Switch, Replace and Insert and return the respective list of words from each task.

def delete_letter(word):
    delete_l = []
    split_l = []

    for i in range(len(word)):
        split_l.append([word[:i],word[i:]])
    for a,b in split_l :
        delete_l.append(a+b[1:])

    return delete_l

def switch_letter(word):
    switch_l = []
    split_l = []

    for c in range(len(word)):
        split_l.append([word[:c],word[c:]])
    switch_l = [a + b[1] + b[0] + b[2:] for a,b in split_l if len(b) >= 2]    

    return switch_l

def replace_letter(word):
    letters = 'abcdefghijklmnopqrstuvwxyz'
    replace_l = []
    split_l = []

    for c in range(len(word)):
        split_l.append((word[0:c],word[c:]))
    replace_l = [a + l + (b[1:] if len(b)> 1 else '') for a,b in split_l if b for l in letters]
    replace_set=set(replace_l)    
    replace_set.remove(word)
    # turn the set back into a list and sort it, for easier viewing
    replace_l = sorted(list(replace_set))

    return replace_l

def insert_letter(word):
    letters = 'abcdefghijklmnopqrstuvwxyz'
    insert_l = []
    split_l = []

    for c in range(len(word)+1):
        split_l.append((word[0:c],word[c:]))
    insert_l = [ a + l + b for a,b in split_l for l in letters]

    return insert_l

The functions used above produce lists of words on specific type of edits. We need to use these as a whole for each word in the list of input words from the file by using the n-edit distance algorithm where we will use n=1,2 only.

So, if we want to edit one letter at a time then we choose n=1 and implement the following function :

def edit_one_letter(word, allow_switches = True):
    edit_one_set = set()

    edit_one_set.update(delete_letter(word))
    if allow_switches:
        edit_one_set.update(switch_letter(word))
    edit_one_set.update(replace_letter(word))
    edit_one_set.update(insert_letter(word))

    return edit_one_set

Similarly, if we choose to edit two letters at a time, so we go for n=2 and implement the following function on our list of words, that is going to use the above function two times, i.e first on actual set of words and second on every word in the output of above function :

def edit_two_letters(word, allow_switches = True):
    edit_two_set = set()

    edit_one = edit_one_letter(word,allow_switches=allow_switches)
    for w in edit_one:
        if w:
            edit_two = edit_one_letter(w,allow_switches=allow_switches)
            edit_two_set.update(edit_two)

    return edit_two_set

Finally, we will use edit_two_letters() and edit_one_letter() in the given function to get a list of all suggested words along with their respective probabilities in a list named n_best.

def get_corrections(word, probs, vocab, n=2):    
    suggestions = []
    n_best = []

    suggestions = list((word in vocab and word) or edit_one_letter(word).intersection(vocab) or edit_two_letters(word).intersection(vocab))
    n_best = [[s,probs[s]] for s in list(reversed(suggestions))]

    return n_best

Here, suggestions will have the words that are in vocab or the common words between the list of words obtained from edit_one_letter() and vocab or edit_two_letters() and vocab.
As it can be seen, n_best returns the list of all suggested words with their respective probabilities of being correct.

Testing our Model :

The following code snippet will be used to test our Autocorrect Model :

my_word = 'dys' 
tmp_corrections = get_corrections(my_word, probs, vocab, 2) 
for i, word_prob in enumerate(tmp_corrections):
    print(f"word {i}: {word_prob[0]}, probability {word_prob[1]:.6f}")

The results obtained were : dye with probability of 0.000019 and days with probability 0.000410 .