DEV Community

Cover image for Spell Checker - Finding probability distribution-NLP
datatoinfinity
datatoinfinity

Posted on • Edited on

Spell Checker - Finding probability distribution-NLP

Finding Probability Distribution

Kaggle Dataset for Spelling Corrector

You need to download big.txt or Create Notebook.

import re
with open('/kaggle/input/spelling/big.txt','r') as fd:
    lines=fd.readlines()
    words=[]
    for line in lines:
        words+=line.split(' ')
len(words)
Output:
1164968

Explanation:

  1. Load the /kaggle/input/spelling/big.txt
  2. fd.readlines() readlines read all lines from file and make list of string. Example: lines = [ "I love NLP", "Spell checkers are helpful", "Python is powerful" ]
  3. Iterate through that list for line in lines:
  4. Now word+=line.split(' ') split line into words, using space as separator.
  5. len(words) return the length of words in words[] list.
import re
with open('/kaggle/input/spelling/big.txt','r') as fd:
    lines=fd.readlines()
    words=[]
    for line in lines:
        words+=re.findall('\w+',line)
len(words)
Output:
1115585

re.findall('\w+',line) does the same thing split line into word and then make list of those word. Now, question arises whats the difference.

line.split(' ')

This will separate the line with words by spacing but it will add other character also like spacing, '\','*','.'&' etc.

print(words[:100])
Output:
['The','Project','Gutenberg', 'EBook','of','The','Adventures', 'of','Sherlock','Holmes\n','by','Sir','Arthur','Conan','Doyle\n',
'(#15','in','our','series','by','Sir','Arthur','Conan','Doyle)\n',
'\n','Copyright','laws','are','changing','all','over','the',
'world.','Be','sure','to','check','the\n','copyright','laws',
'for','your','country','before','downloading','or','redistributing\n','this','or','any','other','Project','Gutenberg','eBook.\n',
'\n','This','header','should','be','the','first','thing','seen',
'when','viewing','this','Project\n','Gutenberg','file.','',
'Please','do','not','remove','it.','','Do','not','change','or',
'edit','the\n','header','without','written','permission.\n',
'\n','Please','read','the','"legal','small','print,"','and', 'other','information','about','the\n','eBook','and']

re.findall('\w+',line)

\w+ matches any word made up of:

  • Letters (A–Z, a–z)
  • Numbers (0–9)
  • Underscore _

It will not take other than these character.

print(words[:30])
Output:
['The','Project','Gutenberg','EBook','of','The','Adventures','of',
'Sherlock','Holmes','by','Sir','Arthur','Conan','Doyle','15',
'in','our','series','by','Sir','Arthur','Conan','Doyle','Copyright','laws','are','changing','all','over']

Lets check how many unique words

print(len(words))
vocab=list(set(words))
print(len(vocab))
Output:
1115585
38160

Finding Probability Distribution

It means that how frequently word is repeated, what is the probability of getting the word.

words.count('the')
Output:
79809

There 79809 'the' word in list. Do remember that use lower() function because it can give 'The' or 'the' different value.
for line in lines:
words+=re.findall('\w+',line.lower())

Probability Distribution

len(words)/words.count('the')
Output:
13.978185417684722

Probability Distribution for first 10 word

word_probability={}
for word in vocab[:10]:
    print(word,words.count(word))
Output:
susan 1
Tillage 1
shortly 21
enlivened 2
1720 1
victors 3
shipments 2
Go 100
constitution 63
blur 1

If we want probability of all word

from tqdm import tqdm
word_probability={}
for word in tqdm(vocab):
    word_probability[word] = float(words.count(word)/len(words))

Top comments (0)