def spell_checker(word,count=5): output=[] suggested_words=edit(word) for wrd in suggested_words: if wrd in word_probability.keys(): output.append([wrd,word_probability[wrd]]) return list(pd.DataFrame(output,columns=['word','prob']).sort_values(by='prob',ascending=False).head(count)['word'].values)
Let's break it down step by step.
def spell_checker(word,count=5):
- Defines a function called spell_checker.
- word is the misspelled word you want to correct.
- count=5 is the number of top suggestions you want to return (default = 5).
output=[]
- Initializes an empty list to store valid suggested words with their probabilities.
suggested_words=edit(word)
-
Calls the
edit()
function which is defined earlier.def edit(word): return set(insert(word) + delete(word) + swap(word) + replace(word))
This returns a set of all words that are one edit away from the input word.
Examples: For "lve" → ['love', 'live', 'lave', ...]
for wrd in suggested_words: if wrd in word_probability.keys(): output.append([wrd, word_probability[wrd]])
What happens here:
- Loops through each
wrd
in the list of suggested words. - Checks: Is
wrd
a real word?- If yes (i.e., it's in
word_probability
, which comes from yourbig.txt
dictionary),
- If yes (i.e., it's in
- Then it appends a pair
[wrd, probability]
to the output list.
Example:
If 'love'
is in the corpus and has probability 0.0042:
Output: [['love', 0.0042], ['live', 0.0021], ...]
return list(pd.DataFrame(output, columns=['word', 'prob']).sort_values(by='prob', ascending=False).head(count)['word'].values)
- pd.DataFrame(output, columns=['word', 'prob'])
Converts the list of [word, prob]
pairs into a pandas DataFrame:
word prob 0 love 0.0042 1 live 0.0021
- .sort_values(by='prob', ascending=False)
- Sorts the DataFrame so the most frequent (most likely correct) words come first.
-
.head(count)
- Selects the top count words (default = 5)
['word'].values
andlist(...)
* Extracts just the `"word"` column as a list.
spell_checker('famili')
If the top edits (like family, familiar, fail, etc.) exist in the corpus and are frequent, you might get:
['family', 'familiar', 'fail', 'facility', 'famine']
Top comments (0)