Posted on Oct 6

How to calculate the distance (time and miles) between the geographies of your portfolio and a comparator (Point A and Point B)

#python #api

Using Python to Calculate Geographic Distances

This code is useful for anyone looking to match a portfolio with geographic data to any other geography with the intent of calculating drive time and distance between the two points. It is inspired by a work task I was assigned to help grant funders understand how close approved projects were to one another after querying the geographic spread of their applicants.

This article will provide a walkthrough for how to use API calls, built-in, and custom functions to match a list of charities (Point A) to their nearest rail station (Point B) and calculate the distance in miles and drive time in minutes.

Other use cases include, as examples:

Matching postcode to nearest school
Matching postcode to nearest charity
Matching postcode to nearest NHS provider
Matching postcode to nearest national park
Matching postcode in list A to nearest postcode in list B

Requirements

Packages:

pandas
numpy
requests
json
haversine

Resources used in this article:

Charity data (for this example, I have selected the first 100 charities from an extract of charities with expenditure over 5M from the charity commission register)
UK train station data (as this is not readily available, I have used a github document containing UK train stations and their longitude, latitude and postcode)
Postcodes.io (an API to search and extract UK postcode data)
Project OSRM (an API for calculating routes)

Why Use Python for This?

The steps discussed here may seem intricate and complicated, but the end result is a template that can be reused and reformatted to match your needs when looking to calculate geographic distances between point A and point B for several rows of data.

Let's say you are working with 100 charities, for example. You would like to know how close these charities are to nearby rail stations as part of wider analysis of the geography of these charities. It may be the case that you want to visually map this data, or use it as a starting point for further analysis, such as looking into accessibility of attending the charity from a far away location.

Whatever the use case, if you wanted to manually do this, your steps would be as follows:

Find the charity postcode
Use an online tool to check the nearest station to the charity
Use an online map tool to find the distance in miles and the driving time for travelling from the charity to the nearest station
Record the results in a spreadsheet
Repeat steps 1 to 4 for the remaining 99 charities

This may be effective for a handful of charities, but after a while the process will be time-consuming, tedious, and prone to human error.

By using Python to complete this task, we can automate the steps and with only a few additions required by the user, simply run our code at the end.

What Can Python Do?

Let's break down the task into steps. Our required steps here are as follows:

Find the nearest station to a given postcode
Calculate the distance between the two
Calculate the driving time for travel
Produce a dataset containing all the required information

To complete step 1, we will use Python to:

import our dataset containing the charity's details, including its postcode
use the Postcodes.io API to extract the longitde and latitude for each postcode
compile this information back into a dataframe containing the original information, plus longitude and latitude for each charity.

Step 1: Find the nearest station to a given postcode

1- import packages

# data manipulation
import numpy as np
import pandas as pd

# http requests
import requests

# handling json
import json

# calculating distances
import haversine as hs
from haversine import haversine, Unit

2 - import and clean data

# import as a pandas dataframe, specifying which columns to import
charities = pd.read_excel('charity_list.xlsx', usecols='A, C, E')
stations = pd.read_csv('uk-train-stations.csv', usecols=[1,2,3])

# renaming stations columns for ease of use
stations = stations.rename(columns={'station_name':'Station Name','latitude':'Station Latitude', 'longitude':'Station Longitude'})

Our variable containing our charity dataset, named 'charities', will be our master dataframe, which we will use as we go along to merge with the data we extract.

For now, our next step is to create our function for extracting longitude and latitude for our charities' postcodes.
3 - convert postcodes into list for matching function
charities_pc = charities['Charity Postcode'].tolist()
4 - create a function that takes a postcode, makes a request to postcodes.io, records the latitude and longitude, and returns the data into a new dataframe.

for further info, please consult the postcodes.io documentation
def bulk_pc_lookup(postcodes):

    # set up the api request
    url = "https://api.postcodes.io/postcodes"
    headers = {"Content-Type": "application/json"}

    # specify our input data and response, specifying that we are working with data in json format
    data = {"postcodes": postcodes}
    response = requests.post(url, headers=headers, data=json.dumps(data))

    # specify the information we want to extract from the api response

    if response.status_code == 200:
        results = response.json()["result"]
        postcode_data = []

        for result in results:
            postcode = result["query"]

            if result["result"] is not None:
                latitude = result["result"]["latitude"]
                longitude = result["result"]["longitude"]
                postcode_data.append({"Charity Postcode": postcode, "Latitude": latitude, "Longitude": longitude})

        return postcode_data

    # setting up a fail safe to capture any errors or results not found
    else:
        print(f"Error: {response.status_code}")
        return []
5 - pass our charity postcode list into the function to extract the desired results
# specify where the postcodes are
postcodes = charities_pc

# save the results of the function as output
output = bulk_pc_lookup(postcodes)

# convert the results to a pandas dataframe
output_df = pd.DataFrame(output)
output_df.head()
please note:

if your Point B data (in this case, the UK rail stations) does not already contain latitude and longitude, you will need to also performs steps 3 to 5 on the Point B data as well

postcodes.io allows bulk look up requests for up to 100 postcodes at a time. if your dataset contains more than 100 postcodes, you will need to either manually create new excel sheets containing only 100 rows per sheet, or you will need to write a function to break your dataset into the required length for the API call

6 - we can now either merge our output_df with our original charity dataset, or, to leave our original data untouched, create a new dataframe that we will use for the rest of the project for our extracted results
charities_output = pd.merge(charities, output_df, on="Charity Postcode")

charities_output.head()
Step 1 Complete

We now have two dataframes which we will use for the next steps:

Our original stations dataframe containing the UK train stations latitude and longitude

Our new charities_output dataframe containing the original charity information and the new latitude and longitude information extracted from our API call

Step 2 - Calculate the distance between Point A (charity) and Point B (train station), and record the nearest result for Point A

In this section, we will be using the haversine distance formula to:

check the distance between a charity and every UK train station

match the nearest result i.e. the UK train station with the minimum distance from our charity

loop over our charities dataset to find the nearest match for each row

record our results in a dataframe

Please note, for further information on using the haversine module, consult the documentation

1 - create a function for calculating the distance between Point A and Point B
def calc_distance(lat1, lon1, lat2, lon2):

    # specify data for location one, i.e. Point A
    loc1 = (lat1, lon1)

    # specify the data for location two, i.e. Point B
    loc2 = (lat2, lon2)

    # calculate the distance and specify the units as miles
    dist = haversine(loc1, loc2, unit=Unit.MILES)

    return dist
2 - create a loop that calculates the distance between Point A and every row in Point B, and match the result where Point B is nearest to Point A
# create an empty dictionary to store the results
results = {}

# begin with looping over the dataset containing the data for Point A
for index1, row1 in charities_output.iterrows():

    # specify the location of our data
    charity_name = row1['Charity Name']
    lat1 = row1['Latitude']
    lon1 = row1['Longitude']

    # track the minimum distance between Point A and every row of Point B
    min_dist = float('inf')
    # as the minimum distance i.e. nearest Point B is not yet known, create an empty string for storage
    min_station = ''

    # loop over the dataset containing the data for Point B
    for index2, row2 in stations.iterrows():

        # specify the location of our data
        lat2 = row2['Station Latitude']
        lon2 = row2['Station Longitude']

        # use our previously created distance function to calculate the distance
        dist = calc_distance(lat1, lon1, lat2, lon2)

        # check each distance - if it is lower than the last, this is the new low. this will repeat until the lowest distance is found
        if dist < min_dist:
            min_dist = dist
            min_station = row2['Station Name']

    results[charity_name] = {'Nearest Station': min_station, 'Distance (Miles)': min_dist}

# convert the results dictionary into a dataframe
res = pd.DataFrame.from_dict(results, orient="index")

res.head()
3 - merge our new information with our charities_output dataframe
# as our dataframe output has used our charities as an index, we need to re-add it as a column
res['Charity Name'] = res.index

# merging with our existing output dataframe
charities_output = charities_output.merge(res, on="Charity Name")

charities_output.head()
Step 2 Complete

We now have all our information in one place, charities_output, containing:

Our charity information

The nearest station to each charity

The distance in miles

Step 3 - Calculate the driving time for travel

Our final step uses Project OSRM to find the driving distance between each of our charities and its nearest station. This is helpful as miles are not always an accurate descriptor of distance, where, for example, in a city like London, a 1 mile journey might take as long as a 5 mile journey in a rural area.

To prepare for this step, we must have one dataframe containing the following information:

charity information: name, longitude, latitude, nearest station, distance in miles

station information: name, longtiude, latitude

1- create a data frame with the above information
drive_time_df = pd.merge(charities_output, stations, left_on='Nearest Station', right_on='Station Name')
drive_time_df = drive_time_df.drop(columns=['Station Name'])

drive_time_df.head()
2 - now that our dataframe is ready, we can set up our function for calculating drive time using Project OSRM

please note: for further information, consult the documentation
url = "http://router.project-osrm.org/route/v1/driving/{lon1},{lat1};{lon2},{lat2}"

# function 

def calc_driveTime(row):

    # extract lat and lon
    lat1, lon1 = row['Latitude'], row['Longitude']
    lat2, lon2 = row['Station Latitude'], row['Station Longitude']

    # request
    response = requests.get(url.format(lat1=lat1, lon1=lon1, lat2=lat2, lon2=lon2))

    # parse response
    data = json.loads(response.content)

    # drive time in seconds
    drive_time_sec = data["routes"][0]["duration"]

    # convert to minutes
    drive_time = round((drive_time_sec) / 60, 0)

    return drive_time
3 - pass our data into our new function to calculate driving time in minutes
# apply the above function to our dataframe
driving_time_res = drive_time_df.apply(calc_driveTime, axis=1)

# add dataframe results as a new column
drive_time_df['Driving Time (Minutes)'] = driving_time_res

drive_time_df.head()
Step 4 Complete

We now have all our desired information in one compact dataframe. For layout purposes, and depending on what we want to do next with our data, we can create one final dataframe as output, containing the following information:

Charity Name

Nearest Station

Distance (Miles)

Driving Time (Minutes)
final_output = drive_time_df.drop(columns=['Charity Number', 'Charity Postcode', 'Latitude', 'Longitude', 'Station Latitude', 'Station Longitude'])

final_output.head()
Thankyou for reading! I hope this was helpful. Please checkout my website if you are interested in my work.