DEV Community

Cover image for Spell Checker-Predicting Correct Word-NLP-Part 2
datatoinfinity
datatoinfinity

Posted on

Spell Checker-Predicting Correct Word-NLP-Part 2

def spell_checker(word,count=5):
    output=[]
    suggested_words=edit(word)
    for wrd in suggested_words:
        if wrd in word_probability.keys():
            output.append([wrd,word_probability[wrd]])
    return list(pd.DataFrame(output,columns=['word','prob']).sort_values(by='prob',ascending=False).head(count)['word'].values)

Let's break it down step by step.

def spell_checker(word,count=5):
  • Defines a function called spell_checker.
  • word is the misspelled word you want to correct.
  • count=5 is the number of top suggestions you want to return (default = 5).
output=[]
  • Initializes an empty list to store valid suggested words with their probabilities.
suggested_words=edit(word)
  • Calls the edit() function which is defined earlier.

    def edit(word):
    return set(insert(word) + delete(word) + swap(word) + replace(word))
    
  • This returns a set of all words that are one edit away from the input word.

  • Examples: For "lve" → ['love', 'live', 'lave', ...]

    for wrd in suggested_words:
        if wrd in word_probability.keys():
            output.append([wrd, word_probability[wrd]])

What happens here:

  • Loops through each wrd in the list of suggested words.
  • Checks: Is wrd a real word?
    • If yes (i.e., it's in word_probability, which comes from your big.txt dictionary),
  • Then it appends a pair [wrd, probability] to the output list.

Example:

If 'love' is in the corpus and has probability 0.0042:

Output: 
[['love', 0.0042], ['live', 0.0021], ...]
    return list(pd.DataFrame(output, columns=['word', 'prob']).sort_values(by='prob', ascending=False).head(count)['word'].values)
  1. pd.DataFrame(output, columns=['word', 'prob'])

Converts the list of [word, prob] pairs into a pandas DataFrame:

   word   prob
0  love  0.0042
1  live  0.0021
  1. .sort_values(by='prob', ascending=False)
  • Sorts the DataFrame so the most frequent (most likely correct) words come first.
  1. .head(count)

    • Selects the top count words (default = 5)
  2. ['word'].values and list(...)

* Extracts just the `"word"` column as a list.
Enter fullscreen mode Exit fullscreen mode
spell_checker('famili')

If the top edits (like family, familiar, fail, etc.) exist in the corpus and are frequent, you might get:

['family', 'familiar', 'fail', 'facility', 'famine']

Top comments (0)