DEV Community

Brandon Rozek
Brandon Rozek

Posted on • Originally published at brandonrozek.com on

Identifying Misspelled Words in your Dataset with Hunspell

This article is based on one written by Markus Konrad at this link <!-- raw HTML omitted -->https://datascience.blog.wzb.eu/2016/07/13/autocorrecting-misspelled-words-in-python-using-hunspell/<!-- raw HTML omitted -->

I assume in this article that you have hunspell and it’s integration with python installed. If not, please refer to the article mention above and follow the prerequisite steps.

This article is inspired from the need to correct misspelled words in the Dress Attributes Dataset. I’ll share with you my initial pitfall, and what I ended up doing instead.

Background Information

Misspelled words are common when dealing with survey data or data where humans type in the responses manually. In the Dress Attributes Dataset this is apparent when looking at the sleeve lengths of the different dresses.

<!-- raw HTML omitted --><!-- raw HTML omitted -->

Word Frequency
sleevless 223
full 97
short 96
halfsleeve 35
threequarter 17
thressqatar 10
sleeveless 5
sleeevless 3
capsleeves 3
cap-sleeves 2
half 1
Petal 1
urndowncollor 1
turndowncollor 1
sleveless 1
butterfly 1
threequater 1

Ouch, so many misspelled words. This is when my brain is racking up all the ways I can automate this problem away. Hence my stumbling upon Markus’ post.

Automagically Correcting Data

First, I decided to completely ignore what Markus warns in his post and automatically correct all the words in that column.

To begin the code, let’s import and create an instance of the spellchecker:

<!-- raw HTML omitted --><!-- raw HTML omitted -->

I modified his correct_words function so that it only corrects one word and so I can apply it along the SleeveLength column.

<!-- raw HTML omitted --><!-- raw HTML omitted -->

Now let’s apply the function over the SleeveLength column of the dataset:

<!-- raw HTML omitted --><!-- raw HTML omitted -->

Doing so creates the following series:<!-- raw HTML omitted -->

Word Frequency
sleeveless 232
full 97
short 96
half sleeve 35
three quarter 17
throatiness 10
cap sleeves 3
cap-sleeves 2
Petal 1
butterfly 1
turndowncollor 1
half 1
landownership 1
forequarter 1

As you might be able to tell, this process didn’t go as intended. landownership isn’t even a length of a sleeve!

Reporting Misspelled Items and Allowing User Intervention

This is when I have to remember, technology isn’t perfect. Instead we should rely on ourselves to identify what the word should be correctly spelled as.

Keeping that in mind, I modified the function again to take in a list of the data, and return a dictionary that has the misspelled words as the keys and suggestions as the values represented as a list.

<!-- raw HTML omitted --><!-- raw HTML omitted -->

With that, I can use the function on my data. To do so, I convert the pandas values to a list and pass it to the function:

<!-- raw HTML omitted --><!-- raw HTML omitted -->

These are the suggestions it produces:

<!-- raw HTML omitted --><!-- raw HTML omitted -->

From here, you can analyze the output and do the replacements yourself:

<!-- raw HTML omitted --><!-- raw HTML omitted -->

What’s the Benefit?

This is where you ask “What’s the difference if it doesn’t automatically fix my data?”

When you have large datasets, it can be hard to individually identify which items are misspelled. Using this method will allow you to have a list of all the items that are misspelled which can let you deal with it in a systematic way.

Top comments (0)