🎥 Watch the full PyCon AU 2025 talk here
When you work with human-entered data (registrations, surveys, customer forms, you name it!) you soon discover that people are very creative typists. Names, schools, companies, and addresses come in with abbreviations, nicknames, missing words, and typos galore.
That mess makes it hard to answer even simple questions like: “Do these two records refer to the same person?” or “How many participants came from this organisation?”
At PyCon AU 2025, I explored how different fuzzy matching techniques, from traditional algorithms to generative AI, can help make sense of that chaos.
The Fuzzy Matching Challenge
String comparison looks straightforward until you meet real-world data. “PLC Sydney” might really be “Presbyterian Ladies’ College Sydney.” “Certain Collage” is obviously a typo for “Certain College” (hopefully). And nicknames like Liz, Lizzy, and Elizabeth might all belong to the same person.
That’s where fuzzy matching comes in, using a variety of techniques we can rank how similar different non-identical words are to try and find the most likely match. But the question is, what fuzzy matching algorithms are best suited for matching what types of data? And can generative AI play a part in this matching game?
Comparing Algorithmic Approaches
I put six Python libraries to the test:
- TextDistance and Python-Levenshtein – classic edit-distance approaches.
- FuzzyWuzzy and RapidFuzz – hybrids that combine multiple distance metrics.
- Nicknames and PyNameMatcher – specialised tools for given-name variations.
To test them, I generated around 100 fake student names with nicknames, misspellings, and swapped orderings. Then I measured how accurately each library matched them to their correct counterparts.
RapidFuzz came out ahead, matching almost every record correctly, and doing it fast! The edit-distance methods struggled most with multicultural names where order or character sets varied, and the nickname libraries were strong but less consistent overall.
When Generative AI Shines
Algorithmic fuzzy matching is fast and accurate, but it only looks at characters, not meaning. That’s where I turned to Azure OpenAI Service for a different kind of help.
By feeding in real school-name data, I found that straight out of the box GPT-5 was exceptionally good at recognising and correcting school names, especially when they were abbreviated, misspelled, or included local school nicknames.
For example, it could confidently map:
- “PLC Syd” → “Presbyterian Ladies’ College Sydney”
- “Cerdon Collage” → “Cerdon College”
- “St Cats” → “St Catherine’s School, Waverley”
That level of contextual correction is almost impossible to achieve with pure algorithmic matching unless you maintain a custom dictionary of every possible variation. And who has time for that!
The trade-off, of course, is performance. Generative models are slower and costlier to run at scale. But when used selectively, just for ambiguous or hard-to-match cases, they can dramatically improve accuracy. And of course this is something this specifically works well for names, like schools, that are well documented on the internet – something that doesn’t apply to the names of individual school students.
Can we have the best of both worlds?
hybrid AI + Algorithmic matching!
In practice, the best results came from a hybrid approach, using traditional fuzzy-matching algorithms for most cases, and bringing in Azure OpenAI only when the names got tricky. For example, RapidFuzz could quickly match “Lizzy Wong” to “Elizabeth Wong,” while the generative model was better at reasoning through ambiguous inputs like “Sally-Anne W.” or reversed multicultural name orders. By combining both, I could match almost every student record accurately, keeping the speed of algorithmic methods while adding the contextual understanding of generative AI.
Try It Yourself
🎥 You can watch my full PyCon AU 2025 talk here:
What’s in a Name: Fuzzy Matching Techniques for Proper Nouns
📁 If you’d like to explore this further you can check out my fuzzy matching repo.
Take a look at the libraries and tool I mentioned above, they’re easy to install and experiment with in Python. If you’re already using Azure OpenAI, it’s worth testing how a small retrieval-augmented setup might complement your existing matching logic.
Chat to us!
💬 To chat more about AI solutions you can join the AI Foundry Discord, where advocates like me are chatting about the latest tools all the time.
Join the Azure AI Foundry Discord here
Good luck on your fuzzy matching adventures!
Top comments (0)