Building our Address Verification product is an iterative process. We regularly measure its accuracy, understand where its greatest opportunities are, then work to make improvements. One example is how it selects the most appropriate street name from our database for a queried address.
A complication when working with address data is that people sometimes take shortcuts. Instead of writing “Martin Luther King” they write “MLK” just “Bridgewater”instead of “Bridgewater Crossings”. These abbreviations, as well as typos and phonetic variations make it more challenging to always identify the correct street name from a list of options. Try it for yourself by identifying the pairs of street names that are similar enough to be called a match:
- Susan vs Suzanne
- Angel vs Angeline
- Forest vs Forest Green
- Old vs Old Knoxville
- Penn vs Pennsylvania
- Royal Adams vs Royal Arms
- Hwy 3 vs Country Road 3
Once we have an idea of which street names are considered a match, we develop a set of features that can differentiate matches from non-matches.
When comparing strings (in our case street names), there are plenty of off-the-shelf features that can be used, such as those provided by the jellyfish. This package also provides a number of phonetic encodings. We can combine an encoding with a metric, such as Levenshtein Distance, to measure the phonetic similarity between two street names.
Although these prebuilt metrics are useful, they only get you so far in a problem as nuanced as matching street namesIt’s necessary to create custom metrics such as consonant similarity, matching first letters, or longest common substring to develop a highly accurate product.
Our Address Verification product was built to quickly provide customers with answers. Therefore, we need to maximize the accuracy of our street name similarity model within its time constraint. This affects two key choices, (1) the type of model to use and (2) which features to include.
Although a rules-based system could make fast predictions, it would quickly become complex and difficult to develop as more features are added to it. Fortunately, a machine learning model can take care of this by finding suitable combinations of features. When comparing the performance of different models we found that some were too slow (neural networks) and some were too inaccurate (logistic regression), but support vector machines were in the goldilocks zone. Due to its design, it can serve fast predictions that encode interactions between features.
As more features are added to a model, the longer it will take to make a prediction. To help you find a suitable set of features, I have two suggestions, (1) recursive feature selection and (2) SHAP values. Using either of these methods can save you time as you find the right set of features for your model.
Depending on how time constrained your service is, it could also be worthwhile to add a threshold step before two strings are compared in a model. At Lob, we sometimes compare the street name from a queried address against over a thousand street names. Many non-matching street names can be identified using a single metric. However, at some value for this metric, correct matches will start to be filtered out. Finding the right metric(s) for this threshold will help you to maximize precision, recall, and speed.
For a problem like comparing street names, it can be difficult to acquire a long list of good training examples. Since the data can be scarce, using leave-one-out cross validation can help you to make the best use of your data.
When it comes time to evaluate your model, there’s no need to use the predefined 0.5 threshold for considering a pair of street names to be a match or not. By plotting your model’s precision and recall on the same plot at different thresholds (i.e., x-axis is threshold, y-axis is precision and recall scores) you can quickly find the threshold value that is most suitable for you.
Standardizing the street names before they are compared can help you to avoid creating new features and finding more training examples. For example, written numbers can be converted to numerics (six -> 6), abbreviations can be expanded (MLK -> Martin Luther King), and similar words can be standardized (North, Northern -> N).
Training machine learning models can have significant accuracy and speed trade-offs. By choosing the right model and finding a suitable set of features, you might be able to achieve the right balance. If possible, add a threshold to your matching pipeline to save time by eliminating clear mismatches.
To learn more, read our recent article about parsing addresses with machine learning.