DEV Community

Cover image for How to choose the best geocoder for your project: a comparative analysis
abraham poorazizi
abraham poorazizi

Posted on

How to choose the best geocoder for your project: a comparative analysis

Geocoding is a fundamental process in many location-based applications, from mapping and navigation to marketing and logistics. It consists of three main steps: (1) parsing the input data (e.g., addresses), (2) querying the reference database, and (3) assigning geographic coordinates, which can be either obtained directly from the reference database or calculated using interpolation techniques. The primary outputs of the geocoding process are geographic coordinates (also known as longitude and latitude pairs). In addition to these, there are other byproducts such as address components that result from the address normalization process. This process converts an arbitrary address into a normalized address and its components, such as street number, street name, city, postal code, region, and country, which you can use for consistent address formatting and data enrichment.

Geocoding APIs provide a simple and efficient way for you to incorporate geocoding functionality into your applications, without having to create your own geocoding algorithms or reference databases. Amazon Location Service provides a geocoding API that offers two major benefits. Firstly, it provides a competitive pricing model that is up to 10 times cheaper than competitors, with the option to store and reuse geocoded data indefinitely. Secondly, it provides access to high-quality data from multiple data providers, giving you the flexibility to choose the provider with the best data quality for your project.

If you are wondering which data provider to use for your geocoding project, the answer depends on your area of interest. Each data provider may gather and curate their data using different techniques — for example they may use a combination of authoritative sources, open data, and AI-assisted mapping to compile their databases. Additionally, they may use different geocoding algorithms. Therefore, the quality of data sources and the accuracy of their methods have a direct effect on the quality of the output.

In this blog post, you will learn how to use quality metrics such as match rate, accuracy, and similarity to select the most suitable geocoder for your project. You can also use the evaluation framework presented here to compare different geocoders using your own data. By following these guidelines, you can make an informed decision on which geocoder to use, and ensure that your project has the necessary geocoding functionality to achieve its goals.

Evaluation framework

Amazon Location Service offers a geocoding API that sources data from multiple data providers including Esri, HERE, and GrabMaps, through its place index resource. In this blog post, you will create two place index resources using Esri and HERE as data providers and refer to them as "Esri geocoder" and "HERE geocoder". You will then evaluate their performance using the following quality metrics:

  • Match rate: it is the percentage of geocoded addresses divided by the total number of submitted addresses. A higher match rate indicates a higher likelihood of successful address resolution.
  • Accuracy: it compares the geocoding results to the baseline and consists of two metrics:
    • Positional accuracy: it indicates how close each geocoded point is to the “true” location of the address, which is typically referred to as the baseline data or ground truth. It is determined by calculating the spatial distance between each geocoded point and the baseline. A shorter distance indicates higher positional accuracy of the geocoded points.
    • Lexical accuracy: it indicates how close each returned address label is to the “true” address label (i.e., the baseline data or ground truth). It is determined by calculating a similarity score based on the Levenshtein distance between two strings (i.e., geocoded point vs. the baseline). The Levenshtein distance is a string metric that measures the minimum number of single-character edits (insertions, deletions or substitutions) required to transform one string into another. A higher score indicates higher lexical accuracy of the geocoded points.
  • Similarity: it performs a pairwise comparison between two different geocoders and consists of two metrics:
    • Positional similarity: it indicates how similar two sets of geocoded points are in terms of their spatial positions. It is determined by calculating the pairwise distance between each point in the two sets. A shorter pairwise distance indicates a higher level of similarity between the two geocoders and a higher likelihood that they will produce equivalent geographic coordinates.
    • Lexical similarity: it indicates how similar two sets of geocoded points are in terms of their format and spelling. It is measured by calculating pairwise similarity scores, using the Levenshtein distance, between pairs of geocoded points. A higher score indicates a higher level of similarity between the two geocoders and a higher likelihood that they will produce equivalent address labels.

The general process for comparing different geocoders involves the following steps:

  • Obtain a set of input addresses that need to be geocoded. Pick addresses that are relevant to your needs.
  • Obtain a set of corresponding baseline addresses, which represent the "true" location and label of the input addresses. This step may require manual cleaning and matching to ensure that each baseline address corresponds to the correct input address.
  • Geocode the input addresses using the geocoders that you want to evaluate.
  • Calculate the match rate to compare the overall effectiveness of different geocoders.
  • Calculate the accuracy metrics to compare how close the geocoding results are to the baseline.
  • Calculate the similarity metrics to measure the level of similarity between the two geocoders.
  • Analyze the results to determine which geocoder performs best based on your use case.

Evaluation scenarios

You will follow three test scenarios to compare the geocoders. You will use a set of 1,000 Canadian addresses, randomly selected (out of 562,000 residential addresses) from a dataset available on the City of Calgary’s open data portal, as the input to the geocoding process. Consider this dataset as the baseline for positional accuracy evaluation for all the scenarios. For each scenario, you will collect a set of corresponding Canada Post-verified addresses as the baseline to measure lexical accuracy. The test scenarios are:

  • Scenario 1: use the input data as is for this scenario. The input data is in CSV format with four fields: id, address, longitude, and latitude. The address field does not have information about the city, province, postal code, or country. The following shows the first five rows of the input data for the first scenario.
id,address,latitude,longitude
85,25 EVERGREEN CR SW,50.9215815275547,-114.100897916366
776,8608 METIS TR NE,51.1298266872992,-113.96923260807
916,121 SADDLEMEAD RD NE,51.1282839861152,-113.946243034372
1103,22 SADDLEBACK RD NE,51.1283490668448,-113.950394401587
1173,179 SADDLEMEAD RD NE,51.1288607473867,-113.944666215123
Enter fullscreen mode Exit fullscreen mode
  • Scenario 2: enrich the original input data programmatically by appending city, province, and country to the address field. For example,25 EVERGREEN CR SW will be transformed into 25 EVERGREEN CR SW, Calgary, AB, Canada. You will use the enriched data as the input data for this scenario.
  • Scenario 3: In Calgary, the last part of the street name in an address indicates a city quadrant, such as SW for Southwest or NE for Northeast. For this scenario, use the input data from Scenario 2 and intentionally remove the "E" and "W" from the quadrant indicator to create misspelled addresses. For example, 25 EVERGREEN CR SW, Calgary, AB, Canada will be turned into 25 EVERGREEN CR S, Calgary, AB, Canada.

Walkthrough

Prerequisites

To complete this walkthrough, you will need access to an AWS account. Create an AWS account if you do not have one already. Next, create an IAM user with sufficient permissions to call the SearchPlaceIndexForText API. Then, create an access key ID and secret access key for Command Line Interface (CLI) access.

You will also need access to a Jupyter Notebook environment, with Python 3.9, for geocoding addresses and analyzing the results. You can use SageMaker Studio Lab, a free development environment to learn and experiment with data analysis and machine learning.

Additionally, you will need to download two datasets: (1) the residential address dataset (i.e., input data) and (2) the City of Calgary’s boundary dataset. After downloading the datasets, you will need to add the datasets to the project root directory of your Jupyter Notebook environment.

Finally, you will need access to Canada Post APIs to get Canada Post-verified addresses, which you will use as the baseline for lexical accuracy evaluations.

It is important to note that you need to pay for what you use when calling Amazon Location Service and Canada Post APIs. Visit Amazon Location Service pricing page and Canada Post pricing page to estimate the cost of this experiment.

Create Amazon Location resources

Head to the Amazon Location Service page in the AWS Management Console and create two Amazon Location Service’s place index resources called esri-geocoder and here-geocoder, with Esri and HERE as data providers (Figure 1).

Choose Place indexes in the Amazon Location Service page.<br>
Figure 1. Create two place index resources with Esri and HERE as data providers.

Geocode addresses

In this step, you will use Esri and HERE geocoders to geocode 1,000 residential addresses under three test scenarios. You will also collect a set of corresponding Canada Post-verified addresses as the baseline for lexical accuracy evaluation.

To start, open a new Terminal and create a .env file in your project root directory. Next, enter the AWS access key and secret access key that you created earlier, the AWS Region that you used to deploy the AWS resources, and your Canada Post API key.

AWS_ACCESS_KEY=YOUR_AWS_ACCESS_KEY
AWS_SECRET_ACCESS_KEY=YOUR_AWS_SECRET_ACCESS_KEY
AWS_REGION=YOUR_AWS_REGION
CANADA_POST_API_KEY=YOUR_CANADA_POST_API_KEY
Enter fullscreen mode Exit fullscreen mode

Then, create a new Notebook and install the following dependencies:

! pip3 install boto3
! pip3 install python-dotenv
! pip3 install requests
! pip3 install pandas
Enter fullscreen mode Exit fullscreen mode

boto3 is the AWS Python SDK that allows you to interact with AWS services. python-dotenv is a library that allows you to load environment variables from a .env file into your Notebook. requests is a Python library for making HTTP requests and handling responses in an easy and efficient way. pandas is a Python library for data manipulation and analysis that provides data structures for efficiently storing and working with large datasets.

Import the necessary libraries afterwards.

import boto3
import csv
from dotenv import load_dotenv
import os
import pandas as pd
import re
import requests
Enter fullscreen mode Exit fullscreen mode

Next, load the environment variables and initialize a boto3 client to interact with Amazon Location Service’s geocoding API.

# Load environment variables from .env file
load_dotenv()

# Create a client for Amazon Location service
amazon_location_client = boto3.client(
    "location",
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY"),
    aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
    region_name=os.getenv("AWS_REGION"),
)
Enter fullscreen mode Exit fullscreen mode

Then, create a function to call Esri and HERE geocoders, parse the API response, and return the input address ID, address components, and assigned geographic coordinates. Although geocoders usually return more than one (candidate) match for a given query to deal with ambiguity issues, configure them to return only one result per address for this experiment.

def amazon_location_geocoder(geocoder, address):
    # Search the Amazon Location Service place index for the given address text
    result = amazon_location_client.search_place_index_for_text(
        IndexName=geocoder, MaxResults=1, Text=address["text"]
    )

    # Extract geocoding results from the API response
    for result in result["Results"]:
        place = result["Place"]
        street_number = place.get("AddressNumber")
        street_name = place.get("Street")
        city = place.get("Municipality")
        province = place.get("Region")
        postal_code = place.get("PostalCode")
        country = place.get("Country")
        label = place.get("Label")
        geometry = place.get("Geometry")
        if geometry:
            longitude = geometry.get("Point", [None, None])[0]
            latitude = geometry.get("Point", [None, None])[1]
        else:
            longitude = None
            latitude = None

    # Return geocoding results in a standardized format
    return [
        {
            "id": address["id"],
            "street_number": street_number,
            "street_name": street_name,
            "city": city,
            "province": province,
            "postal_code": postal_code,
            "country": country,
            "label": label,
            "longitude": longitude,
            "latitude": latitude,
        }
    ]
Enter fullscreen mode Exit fullscreen mode

Next, create a function to call the Canada Post APIs to get verified addresses. This function will first find addresses matching the input and return their IDs. It will then retrieve the full address details based on the IDs and return the input address IDs and address components. Note that Canada Post APIs do not return geographic coordinates.

def canada_post_geocoder(address):
    # Search the Canada Post API for the given address text
    url = f"https://ws1.postescanada-canadapost.ca/addresscomplete/interactive/find/v2.10/json.ws?key={os.getenv('CANADA_POST_API_KEY')}&provider=AddressComplete&package=Interactive&service=Find&version=2.1&endpoint=json.ws&SearchTerm={address['text']}&MaxSuggestions=1"
    response = requests.request("GET", url)
    data = response.json()
    id = data[0]["Id"]

    # Retrieve full geocoded address details from Canada Post API
    url = f"https://ws1.postescanada-canadapost.ca/addresscomplete/interactive/retrieve/v2.11/json.ws?key={os.getenv('CANADA_POST_API_KEY')}&provider=AddressComplete&package=Interactive&service=Retrieve&version=2.11&endpoint=json.ws&Id={id}"
    response = requests.request("GET", url)
    results = response.json()

    # Extract geocoding results from the API response
    for result in results:
        if "Language" in result and result["Language"] == "ENG":
            street_number = result.get("BuildingNumber")
            street_name = result.get("Street")
            city = result.get("City")
            province = result.get("ProvinceName")
            postal_code = result.get("PostalCode")
            country = result.get("CountryName")
            label = result.get("Label")

    # Return geocoding results in a standardized format
    return [
        {
            "id": address["id"],
            "street_number": street_number,
            "street_name": street_name,
            "city": city,
            "province": province,
            "postal_code": postal_code,
            "country": country,
            "label": label,
            "longitude": None,
            "latitude": None,
        }
    ]
Enter fullscreen mode Exit fullscreen mode

Then, create a function to store the results in CSV format for analysis.

def save_csv(data, file_name):
    # Normalize JSON data into a flat table
    df = pd.json_normalize(data)
    # Save the data as a CSV file
    df.to_csv(file_name, encoding="utf-8", index=False)
Enter fullscreen mode Exit fullscreen mode

Finally, run the three test scenarios against the geocoders and Canada Post APIs and store the results in CSV format. For Scenario 1, use the input data as-is.

# Scenario 1: address as is (incomplete)

with open("input/input-addresses.csv", newline="") as csvfile:
    amazon_location_esri_result = []
    amazon_location_here_result = []
    canada_post_result = []

    # Read CSV file and iterate over rows
    reader = csv.reader(csvfile, delimiter=",", quotechar="|")

    # Skip header row
    next(reader, None)
    for row in reader:
        print(row[0])

        # Create a dictionary containing the address information
        address = {"id": row[0], "text": f"{row[1]}"}

        try:
            # Call geocoding functions for Amazon Location Service with Esri and HERE providers,
            # and Canada Post
            amazon_location_esri_result.append(
                amazon_location_geocoder("esri-geocoder", address)
            )
            amazon_location_here_result.append(
                amazon_location_geocoder("here-geocoder", address)
            )
            canada_post_result.append(canada_post_geocoder(address))
        except Exception as e:
            print(e, row)

    # Save the geocoding results to CSV files
    save_csv(amazon_location_esri_result,
             "output/esri-geocoding-result-scenario-1.csv")
    save_csv(amazon_location_here_result,
             "output/here-geocoding-result-scenario-1.csv")
    save_csv(canada_post_result,
             "output/canada-post-geocoding-result-scenario-1.csv")
Enter fullscreen mode Exit fullscreen mode

For Scenario 2, enrich the original input data programmatically by appending city, province, and country to the address field. Use the following code to configure the address and store the results in separate CSV files, for example esri-geocoding-result-scenario-2.csv.

...

    for row in reader:
        ...

        address = {"id": row[0], "text": f"{row[1]}, Calgary, AB, Canada"}

        ...
Enter fullscreen mode Exit fullscreen mode

For Scenario 3, use the input data from Scenario 2 and intentionally remove the "E" and "W" from the quadrant indicator (i.e., the last part of the address) to create misspelled addresses. Use the following code to first define a regular expression pattern outside the loop, then configure the address and store the results in separate CSV files, for example esri-geocoding-result-scenario-3.csv.

with open("input/input-addresses.csv", newline="") as csvfile:
    ...

    # Define a regular expression pattern to match the letters "e" or "w", ignoring case
    pattern = re.compile("[ew]", re.IGNORECASE)

    ...

    for row in reader:
        ...

        # Split the address in the current row by whitespace
        address = row[1].split()
        # Retrieve the last item in the "address" list and remove any "e" or "w" characters found using the regular expression pattern
        address[-1] = re.sub(pattern, "", address[-1])
        # Rejoin the modified "address" list into a single string with whitespace in between each item
        misspelled_address = " ".join(address)

        address = {
            "id": row[0], "text": f"{misspelled_address}, Calgary, AB, Canada"}

        ...
Enter fullscreen mode Exit fullscreen mode

The final output will consist of nine CSV files that contain the geocoding results across three test scenarios.

Analyze results

In this step, you will assess the performance of Esri and HERE geocoders across three test scenarios using match rate, and positional and lexical accuracy and similarity metrics. You will consider the residential property dataset from the City of Calgary as the baseline for positional accuracy evaluation and the geocoding results from Canada Post as the baseline for lexical accuracy evaluation.

To start, install the following dependencies:

! pip3 install pandas
! pip3 install geopandas
! pip3 install levenshtein
Enter fullscreen mode Exit fullscreen mode

geopandas is a library that extends the functionality of pandas to include spatial data analysis and visualization capabilities. levenshtein is a string comparison library that provides functions to calculate the minimum edit distance and similarity scores between two strings. Minimum edit distance, also known as Levenshtein distance, is a measure of the difference between two sequences of characters, which is calculated by determining the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one sequence into the other.

Import the necessary libraries afterwards.

import pandas as pd
import geopandas as gpd
from Levenshtein import ratio
Enter fullscreen mode Exit fullscreen mode

First, calculate the match rate for both geocoders across three test scenarios. The following code loads the geocoding results from Esri geocoder for Scenario 1 and calculates the match rate.

# Load data into a DataFrame
df = pd.read_csv("output/here-geocoding-result-scenario-1.csv")

# Add a new column to distinguish between matched and unmatched results
df["is_geocoded"] = df["longitude"].apply(
    lambda x: "yes" if not pd.isna(x) else "no")

# Calculate Match Rate
all_addresses_count = df.shape[0]
matched_addresses_count = df[df["is_geocoded"] == "yes"].shape[0]
match_rate = matched_addresses_count * 100 / all_addresses_count
print(f"match rate: {match_rate}%")
Enter fullscreen mode Exit fullscreen mode

Next, conduct a point-in-polygon operation to calculate how many of the geocoded addresses fall inside the City of Calgary boundaries. Point-in-polygon operation is a spatial operation that determines if a point falls within a specific polygon such as a city or country boundary.

# Convert geocoded addresses into a GeoDataFrame and set the coordinate reference system to EPSG:4326
geocoded_addresses = df[df["is_geocoded"] == "yes"]
geocoded_addresses = gpd.GeoDataFrame(
    geocoded_addresses,
    geometry=gpd.points_from_xy(
        geocoded_addresses["longitude"], geocoded_addresses["latitude"]
    ),
).set_crs("epsg:4326")

# Load the city boundary dataset into a GeoDataFrame and set the coordinate reference system to EPSG:4326
city_boundary = gpd.read_file(
    "input/city-boundary.geojson").set_crs("epsg:4326")
#  Create a single geometry representing the entire city boundary
city_boundary = city_boundary["geometry"].unary_union

# Calculate the proportion of geocoded addresses that fall within the city boundaries
geocoded_addresses_within_city = geocoded_addresses[
    geocoded_addresses["geometry"].within(city_boundary)
]
proportion = round(len(geocoded_addresses_within_city)
                   * 100 / all_addresses_count, 1)
print(
    f"proportion of geocoded results that fall within city boundaries: {proportion}%")
Enter fullscreen mode Exit fullscreen mode

For the next analysis, you will dive deeper into how much the geocoded results deviate from the baseline. First, calculate the spatial distance between each geocoded point and the baseline for both Esri and HERE geocoders across three test scenarios. The following code loads the baseline data and the geocoding results from Esri geocoder for Scenario 1, and calculates distances between each geocoded point and its corresponding baseline.

# Load data into DataFrames
baseline = pd.read_csv("input/input-addresses.csv")
geocoder = pd.read_csv("output/here-geocoding-result-scenario-1.csv")

# Convert DataFrames into GeoDataFrames with a local projection system (EPSG:3776 for Alberta) to calculate distances in meters
baseline = (
    gpd.GeoDataFrame(
        baseline,
        geometry=gpd.points_from_xy(
            baseline["longitude"], baseline["latitude"]),
    )
    .set_crs("epsg:4326")
    .to_crs("epsg:3776")
)
geocoder = (
    gpd.GeoDataFrame(
        geocoder,
        geometry=gpd.points_from_xy(
            geocoder["longitude"], geocoder["latitude"]),
    )
    .set_crs("epsg:4326")
    .to_crs("epsg:3776")
)

# Calculate the distance between the geocoded points and the baseline
baseline["distance"] = baseline["geometry"].distance(geocoder["geometry"])

# Calculate summary statistics
summary_stats = baseline["distance"].describe()
print(summary_stats)
Enter fullscreen mode Exit fullscreen mode

Then, classify the distance values into five groups based distance thresholds (10m, 100m, 1km, 10km, and 10km+) to better understand their distribution.

# Classify distance values into 5 groups to better understand their distributions
labels = ["10m", "100m", "1km", "10km", "10km+"]
classes = [0, 10, 100, 1000, 10000, float("inf")]

baseline["distance_class"] = pd.cut(
    baseline["distance"], bins=classes, labels=labels, include_lowest=True
)

# Calculate summary statistics
counts = baseline["distance_class"].value_counts(sort=False)
print(counts)

summary_stats = baseline["distance_class"].describe()
print(summary_stats)
Enter fullscreen mode Exit fullscreen mode

For the next analysis, evaluate the positional similarity of the geocoded results. The process is similar to the positional accuracy evaluation. However, instead of comparing geocoded results against a baseline, compare the results of the two geocoders. Similar to the positional accuracy analysis, classify the distance values into five groups to better understand their distribution.

Next, measure how similar returned address labels are to the baseline for both Esri and HERE geocoders across three test scenarios. Note that you can directly obtain labels from the label property returned by geocoders (i.e., provided labels) or construct them based on properties like street number, street name, city, province, postal code, and country using a consistent format (i.e., constructed labels). You will examine the impact of “provided” and “constructed” labels in this analysis.

First, calculate a similarity score for each geocoded point using provided labels. The following code loads the baseline data and the geocoding results from Esri geocoder for Scenario 1, and calculates a lexical accuracy score for each geocoded point.

# Load data into DataFrames
baseline = pd.read_csv("output/canada-post-geocoding-result-scenario-1.csv")
geocoder = pd.read_csv("output/esri-geocoding-result-scenario-1.csv")

# Rename the label column in all DataFrames and join them
baseline = baseline[["id", "label"]].rename(
    columns={"label": "label_baseline"})
geocoder = geocoder[["label"]].rename(columns={"label": "label_geocoder"})
df = pd.concat([baseline, geocoder], axis=1)

# Calculate lexical accuracy
df["lexical_accuracy"] = df.apply(
    lambda x: ratio(x["label_baseline"], x["label_geocoder"]), axis=1
)

# Calculate summary statistics
summary_stats = df["lexical_accuracy"].describe()
print(summary_stats)
Enter fullscreen mode Exit fullscreen mode

Then, follow the same procedure as before and compute a similarity score for each record using constructed labels. The only difference here is that you will construct the label using address components. Use the following code to construct the labels.

# Construct labels and join DataFrames
baseline["label_baseline"] = baseline.apply(
    lambda x: f"{x['street_name']} {x['city']}, {x['province']}, {x['postal_code']}, {x['country']}",
    axis=1,
)
geocoder["label_geocoder"] = geocoder.apply(
    lambda x: f"{x['street_number']} {x['street_name']}, {x['city']}, {x['province']}, {x['postal_code']}, {x['country']}",
    axis=1,
)
df = pd.concat([baseline, geocoder], axis=1)
Enter fullscreen mode Exit fullscreen mode

Finally, classify the scores into four groups (0.25, 0.5, 0.75, 1) to better understand their distribution.

# Classify distance values into 5 groups to better understand their distributions
labels = ["0.25", "0.5", "0.75", "1"]
classes = [0, 0.25, 0.5, 0.75, 1]

df["lexical_accuracy_class"] = pd.cut(
    df["lexical_accuracy"], bins=classes, labels=labels, include_lowest=True
)

# Calculate summary statistics
counts = df["lexical_accuracy_class"].value_counts(sort=False)
print(counts)

summary_stats = df["lexical_accuracy_class"].describe()
print(summary_stats)
Enter fullscreen mode Exit fullscreen mode

For the final analysis, evaluate the lexical similarity of the geocoded results. The process is similar to the lexical accuracy evaluation, but instead of comparing geocoded results against a baseline, compare the results of the two geocoders. Like the lexical accuracy analysis, classify the similarity scores into four groups to better understand their distribution.

Evaluation

To evaluate the match rate, you geocoded the input addresses and performed a point-in-polygon test to determine if the returned geographic coordinates fell within the City of Calgary boundaries. In Scenario 1, the results indicate that HERE geocoder performed the best, achieving an overall match rate of 99.2%, with 84.1% of geocoded addresses falling inside the city boundaries (Table 1). In contrast, Esri geocoder had a match rate of 60.8%, with only 3.3% of the geocoded addresses falling within the city boundaries.

For Scenario 2 and Scenario 3, Esri geocoder achieved the highest match rate at 100%, marginally outperforming HERE geocoder by 2.5% and 3.3%, respectively.

Esri HERE
Scenario 1 Match rate 60.80% 99.20%
Within city bounds 3.30% 84.10%
Scenario 2 Match rate 100% 97.50%
Within city bounds 100% 97.40%
Scenario 3 Match rate 100% 96.70%
Within city bounds 99.90% 96.40%

Table 1. Match rate of geocoded results.

For the positional accuracy test, you estimated the deviation of geocoding results from the baseline by calculating the spatial distance between each geocoded point and its corresponding baseline across three test scenarios. You then classified distance values into five groups to better understand their distribution.

In Scenario 1, the results (Table 2) suggest that HERE geocoder outperformed Esri geocoder with a median error distance of 18.49m (vs. 6,157km). The classification results (Figure 2) show that about 70% of geocoded points returned by HERE geocoder are within 100m of the baseline. However, geocoded results from Esri geocoder are mostly 10km away from the baseline.

For Scenario 2 and Scenario 3, the data suggests that Esri geocoder performed slightly better, with median error distances of 6.4m (vs. 16.79m) and 7.01m (vs. 17.01m), respectively. Half of the geocoded points by Esri geocoder fell between about 3m and 24m (vs. 13m and 75m) for Scenario 2 and between about 3m and 41m (vs. 14m and 94m) for Scenario 3. The classification results (Figure 2) show that about 60% of geocoded points returned by Esri geocoder are within 10m of the baseline, while more than 65% of geocoded points returned by HERE geocoder are within 100m of the baseline.

Scenario 1 Scenario 2 Scenario 3
Esri HERE Esri HERE Esri HERE
Min (m) 0.53 0.47 0 0.47 0 0.47
Max (m) 22,575,139.36 22,517,070.99 1,519.57 45,140.98 180,513.23 52,277.26
Median (m) 6,156,974.76 18.49 6.4 16.79 7.01 17.01
IQR (Q3-Q1) (m) 15,847,869.85 238.73 21.33 61.62 37.72 80.55
Mean (m) 7,976,346.72 516,877.99 58.26 248.63 351.37 511.25
Standard deviation (m) 6,880,389.06 1,772,337.79 168.38 1,726.07 5,791.3 3,015.81

Table 2. The descriptive statistics of positional accuracy for Esri and HERE geocoders.

The distribution of error distances computed between each geocoded point and its corresponding baseline for each geocoder
Figure 2. The distribution of error distances computed between each geocoded point and its corresponding baseline for each geocoder across three test scenarios.

For the next analysis, you evaluated the positional similarity between the geocoders using relative distances. You then classified distance values into five groups to better understand their distribution.

The results indicate that although the two geocoders performed differently in Scenario 1, with a median pairwise distance of 6,608km, they showed strong similarities in Scenario 2 and Scenario 3, with a median pairwise distance of 12m and 13m, respectively (Table 3). In Scenario 1, the data shows a great variation of pairwise distances between geocoded points, with 94% of the points being more than 10km apart (Figure 3). However, for more than 70% of the geocoded results in the second and third scenarios, the two geocoders generated points within a 100m distance, with over 30% of those points being closer than 10m.

Scenario 1 Scenario 2 Scenario 3
Min (m) 0 0.03 0.03
Max (m) 27,023,925.72 45,146.2 173,911.89
Median (m) 6,608,269.48 12.03 13.04
IQR (Q3-Q1) (m) 16,208,528.66 38.77 107.46
Mean (m) 8,003,303.66 241.14 765.18
Standard deviation (m) 7,079,866.15 1,727.7 6,389.32

Table 3. The descriptive statistics of positional similarity between Esri and HERE geocoders across three test scenarios.

The distribution of distances between paired geocoded points returned by Esri and HERE geocoders
Figure 3. The distribution of distances between paired geocoded points returned by Esri and HERE geocoders across three test scenarios.

For lexical accuracy evaluation, you measured the similarity between the provided and constructed address labels and the baseline by calculating a score between 0 and 1. You then classified the scores into four groups to better understand their distribution.

The results suggest that HERE geocoder produced the most similar address labels to the baseline in Scenario 1, with median scores of 0.60 (vs. 0.0) and 0.80 (vs. 0.36) when using provided and constructed labels, respectively (Table 4). Figure 4 shows that for provided labels (i.e., the address labels returned by geocoders), the majority of the geocoded results (~ 60%) returned by Esri and HERE geocoders achieved a score of 0.25 and 0.75, respectively. However, using constructed labels (i.e., the address labels created based on individual address components) significantly improved the scores, with the scores going up to 0.5 and 1 for Esri and HERE geocoders, respectively (Figure 5).

In Scenario 2 and Scenario 3, both geocoders performed similarly, with over 70% of the geocoded results achieving a lexical accuracy score of 0.75 or higher. It is also evident that both geocoders performed significantly better when using constructed labels.

Scenario 1 Scenario 2 Scenario 3
Esri HERE Esri HERE Esri HERE
Min Provided labels 0 0 0 0 0 0
Constructed labels 0.19 0.21 0.24 0.26 0.25 0.22
Max Provided labels 0.79 0.73 0.79 0.73 0.79 0.73
Constructed labels 0.9 0.9 0.91 0.91 0.91 0.9
Median Provided labels 0 0.6 0.71 0.64 0.67 0.64
Constructed labels 0.36 0.8 0.87 0.85 0.86 0.85
IQR (Q3-Q1) (m) Provided labels 0.36 0.39 0.14 0.13 0.07 0.21
Constructed labels 0.09 0.49 0.09 0.09 0.12 0.15
Mean Provided labels 0.17 0.45 0.65 0.58 0.62 0.55
Constructed labels 0.42 0.65 0.82 0.8 0.79 0.77
Standard deviation Provided labels 0.2 0.27 0.16 0.17 0.18 0.2
Constructed labels 0.17 0.25 0.12 0.14 0.15 0.16

Table 4. The descriptive statistics of lexical accuracy for Esri and HERE geocoders across three test scenarios.

The distribution of lexical accuracy scores calculated using provided labels
Figure 4. The distribution of lexical accuracy scores calculated using provided labels across three test scenarios.

The distribution of lexical accuracy scores calculated using constructed labels
Figure 5. The distribution of lexical accuracy scores calculated using constructed labels across three test scenarios.

Finally, you evaluated the lexical similarity between the geocoders by calculating pairwise similarity scores between pairs of geocoded points. You then classified the scores into four groups to better understand their distribution.

The results indicate that the two geocoders performed differently in Scenario 1, regardless of the type of label used in the experiment (Table 5). However, they showed strong similarities in the next two scenarios, with median pairwise scores above 0.86 (Figure 6). It is also evident that both geocoders produced equivalent results when using constructed labels, achieving median similarity scores of 1.00 and 0.99 in Scenario 2 and Scenario 3, respectively.

Scenario 1 Scenario 2 Scenario 3
Min Provided labels 0 0 0
Constructed labels 0.22 0.28 0.28
Max Provided labels 0.93 0.93 0.93
Constructed labels 1 1 1
Median Provided labels 0.33 0.9 0.86
Constructed labels 0.37 1 0.99
IQR (Q3-Q1) Provided labels 0.45 0.1 0.11
Constructed labels 0.19 0.05 0.1
Mean Provided labels 0.26 0.84 0.81
Constructed labels 0.42 0.96 0.93
Standard deviation Provided labels 0.24 0.15 0.17
Constructed labels 0.14 0.11 0.13

Table 5. The descriptive statistics of lexical similarity for Esri and HERE geocoders across three test scenarios.

The distribution of lexical similarity scores calculated using provided labels
Figure 6. The distribution of lexical similarity scores calculated using provided labels across three test scenarios.

The distribution of lexical similarity scores calculated using constructed labels
Figure 7. The distribution of lexical similarity scores calculated using constructed labels across three test scenarios.

Next steps

When considering geocoding for your own project, it is important to evaluate the performance of geocoders using representative data specific to your use case. While a general evaluation can offer an idea of how various geocoders perform, it may not accurately reflect their performance on your own data. Your data may have unique characteristics, such as specific address formats or geographic coverage, that can influence the performance of geocoders. By evaluating geocoders with your own data, you can gain a better understanding of the strengths and limitations of different geocoding methods, and make an informed decision on which geocoder to use in your project.

You can explore the associated GitHub repository for this project, download the code, and configure it to run this experiment with your own data. By delving into the code and conducting your own experiment, you can gain a more comprehensive understanding of the methodology and determine which geocoding approach and quality metrics are best suited for your specific use case.

Conclusion

Amazon Location Service provides a geocoding API that sources data from different data providers, including Esri and HERE. This gives you the flexibility to choose a provider with the best quality data for your project. However, it is important to note that data quality may vary based on your area of interest, as data providers may use different geocoding algorithms, data sources, and update cycles. In this blog post, you learned how to select the right data provider for your geocoding project using metrics like match rate, accuracy, and similarity. You can use the evaluation framework presented in this blog post to choose the right data provider and geocoding API for various use cases or geographies.

Top comments (0)