Hi DEV community!
I have a project idea that I would like to dive into but I need your help defining some major points.
The project is about matching strings to the most similar string in a different dataset(tables) in Python.
For example:
Table 1
name/city, country
Mark Smith/New York City, USA
Mirko Smirk/New York City, USA
John E. Doe/Paris, USA
Jane Doe/Paris, France
Table 2
name/city, country
Mirko S./NYC, US
Mark S./NYC, US
Jane D/Paris, France
J. Doe/Paris, US
The idea is to match each person with itself on the other table.
For example row 1 of the first table would correctly match with row 2 of the second table.
As you can see there are no perfect matches and sometimes it may match the first column but not the second. The second column should have more weight in the decision.
I first thought of an approach based on Levenshtein Distance to calculate the differences between sequences using FuzzyWuzzy package. But I encountered many issues like the following:
Table 1:
Franklin Delano Roosevelt
Franklin Da Turtle
Table 2:
FDR
Levenshtein Distance would say that Franklin Da Turtle is a better match than Franklin Delano Roosevelt for FDR since it has fewer characters to compare.
In real-life data, I will have many columns to validate the accuracy of the decision but I'm stuck anyway.
Is there a better approach? I first thought of some vector usage of NLP but isn't my area of expertise.
Any ideas will be well received.
Leave your comment if you know something better than FuzzyWuzzy approach.
Top comments (4)
I am also a noob.. But something in my head i wants to share may help you.. Alright make a dictionary of correct words you desires.. Then take each cell and match it with your dictionary and count accuracy.. If accuracy is good enough replace it with your dictionary word.. Maybe it helps..
Hi there, Frank
A couple of questions (that I had) before you proceed with the below proposed:
Idea here assumes the answers to these questions are yes -> yes -> no
You could make a couple of small methods (or just one if this doesn't extend to more variations) that does two things:
Again, this assumes the rules do not change too much for table 1 -> table 2 form. Hope it helps. If not, try to abbreviate, then use regex module to remove punctuation and check the character sequence (?maybe). I think NLTK might be used if you have a huuuuuuuuge dataset with collocation finding and measures. But yes, if you don't have to go there yet, a method should do.
Unfortunately I oversimplified the problem.
In real data I will have to cross a list of hotel names with my client list of hotels.
Maybe my list says "Hotel Pennsylvania" and the other list says "Hotel Pennsylvania New York".
It's not that I will always have the full name, maybe my clients have a fuller version which is OK.
In order to match this 2 hotels I will use more columns like city and country.
Gotcha! Then you might be able to combine the two. NLTK has a collocations module where you are able to see if a word or phrase is in a block of text and get that word along with n closest terms to it on either side. If you have both full words and abbreviations e.g. NYC and New York City, you could use a method to transform and then use collocations to check for both forms of something from table 1 in table 2. If I remember correctly, it is part of
nltk.collocations
. There are methods for bi, tri, quad grams. Still new here so not entirely sure how to share a code snippet, but I hope this jolts some ideas.P.S. for conjoined words that you may want to split (if there is cleaning involved), there is a
wordninja
module that does this: e.g. iamtiredtoday -> I am tired today.Good luck!