Using Python to Calculate Geographic Distances
This code is useful for anyone looking to match a portfolio with geographic data to any other geography with the intent of calculating drive time and distance between the two points. It is inspired by a work task I was assigned to help grant funders understand how close approved projects were to one another after querying the geographic spread of their applicants.
This article will provide a walkthrough for how to use API calls, built-in, and custom functions to match a list of charities (Point A) to their nearest rail station (Point B) and calculate the distance in miles and drive time in minutes.
Other use cases include, as examples:
- Matching postcode to nearest school
- Matching postcode to nearest charity
- Matching postcode to nearest NHS provider
- Matching postcode to nearest national park
- Matching postcode in list A to nearest postcode in list B
Requirements
Packages:
- pandas
- numpy
- requests
- json
- haversine
Resources used in this article:
- Charity data (for this example, I have selected the first 100 charities from an extract of charities with expenditure over 5M from the charity commission register)
- UK train station data (as this is not readily available, I have used a github document containing UK train stations and their longitude, latitude and postcode)
- Postcodes.io (an API to search and extract UK postcode data)
- Project OSRM (an API for calculating routes)
Why Use Python for This?
The steps discussed here may seem intricate and complicated, but the end result is a template that can be reused and reformatted to match your needs when looking to calculate geographic distances between point A and point B for several rows of data.
Let's say you are working with 100 charities, for example. You would like to know how close these charities are to nearby rail stations as part of wider analysis of the geography of these charities. It may be the case that you want to visually map this data, or use it as a starting point for further analysis, such as looking into accessibility of attending the charity from a far away location.
Whatever the use case, if you wanted to manually do this, your steps would be as follows:
- Find the charity postcode
- Use an online tool to check the nearest station to the charity
- Use an online map tool to find the distance in miles and the driving time for travelling from the charity to the nearest station
- Record the results in a spreadsheet
- Repeat steps 1 to 4 for the remaining 99 charities
This may be effective for a handful of charities, but after a while the process will be time-consuming, tedious, and prone to human error.
By using Python to complete this task, we can automate the steps and with only a few additions required by the user, simply run our code at the end.
What Can Python Do?
Let's break down the task into steps. Our required steps here are as follows:
- Find the nearest station to a given postcode
- Calculate the distance between the two
- Calculate the driving time for travel
- Produce a dataset containing all the required information
To complete step 1, we will use Python to:
- import our dataset containing the charity's details, including its postcode
- use the Postcodes.io API to extract the longitde and latitude for each postcode
- compile this information back into a dataframe containing the original information, plus longitude and latitude for each charity.
Step 1: Find the nearest station to a given postcode
1- import packages
# data manipulation
import numpy as np
import pandas as pd
# http requests
import requests
# handling json
import json
# calculating distances
import haversine as hs
from haversine import haversine, Unit
2 - import and clean data
# import as a pandas dataframe, specifying which columns to import
charities = pd.read_excel('charity_list.xlsx', usecols='A, C, E')
stations = pd.read_csv('uk-train-stations.csv', usecols=[1,2,3])
# renaming stations columns for ease of use
stations = stations.rename(columns={'station_name':'Station Name','latitude':'Station Latitude', 'longitude':'Station Longitude'})
Our variable containing our charity dataset, named 'charities', will be our master dataframe, which we will use as we go along to merge with the data we extract.
For now, our next step is to create our function for extracting longitude and latitude for our charities' postcodes.3 - convert postcodes into list for matching function
charities_pc = charities['Charity Postcode'].tolist()
4 - create a function that takes a postcode, makes a request to postcodes.io, records the latitude and longitude, and returns the data into a new dataframe.
for further info, please consult the postcodes.io documentation
def bulk_pc_lookup(postcodes): # set up the api request url = "https://api.postcodes.io/postcodes" headers = {"Content-Type": "application/json"} # specify our input data and response, specifying that we are working with data in json format data = {"postcodes": postcodes} response = requests.post(url, headers=headers, data=json.dumps(data)) # specify the information we want to extract from the api response if response.status_code == 200: results = response.json()["result"] postcode_data = [] for result in results: postcode = result["query"] if result["result"] is not None: latitude = result["result"]["latitude"] longitude = result["result"]["longitude"] postcode_data.append({"Charity Postcode": postcode, "Latitude": latitude, "Longitude": longitude}) return postcode_data # setting up a fail safe to capture any errors or results not found else: print(f"Error: {response.status_code}") return []
5 - pass our charity postcode list into the function to extract the desired results
# specify where the postcodes are postcodes = charities_pc # save the results of the function as output output = bulk_pc_lookup(postcodes) # convert the results to a pandas dataframe output_df = pd.DataFrame(output) output_df.head()
please note:
- if your Point B data (in this case, the UK rail stations) does not already contain latitude and longitude, you will need to also performs steps 3 to 5 on the Point B data as well
- postcodes.io allows bulk look up requests for up to 100 postcodes at a time. if your dataset contains more than 100 postcodes, you will need to either manually create new excel sheets containing only 100 rows per sheet, or you will need to write a function to break your dataset into the required length for the API call
6 - we can now either merge our output_df with our original charity dataset, or, to leave our original data untouched, create a new dataframe that we will use for the rest of the project for our extracted results
charities_output = pd.merge(charities, output_df, on="Charity Postcode") charities_output.head()
Step 1 Complete
We now have two dataframes which we will use for the next steps:
- Our original stations dataframe containing the UK train stations latitude and longitude
- Our new charities_output dataframe containing the original charity information and the new latitude and longitude information extracted from our API call
Step 2 - Calculate the distance between Point A (charity) and Point B (train station), and record the nearest result for Point A
In this section, we will be using the haversine distance formula to:
- check the distance between a charity and every UK train station
- match the nearest result i.e. the UK train station with the minimum distance from our charity
- loop over our charities dataset to find the nearest match for each row
- record our results in a dataframe
Please note, for further information on using the haversine module, consult the documentation
1 - create a function for calculating the distance between Point A and Point B
def calc_distance(lat1, lon1, lat2, lon2): # specify data for location one, i.e. Point A loc1 = (lat1, lon1) # specify the data for location two, i.e. Point B loc2 = (lat2, lon2) # calculate the distance and specify the units as miles dist = haversine(loc1, loc2, unit=Unit.MILES) return dist
2 - create a loop that calculates the distance between Point A and every row in Point B, and match the result where Point B is nearest to Point A
# create an empty dictionary to store the results results = {} # begin with looping over the dataset containing the data for Point A for index1, row1 in charities_output.iterrows(): # specify the location of our data charity_name = row1['Charity Name'] lat1 = row1['Latitude'] lon1 = row1['Longitude'] # track the minimum distance between Point A and every row of Point B min_dist = float('inf') # as the minimum distance i.e. nearest Point B is not yet known, create an empty string for storage min_station = '' # loop over the dataset containing the data for Point B for index2, row2 in stations.iterrows(): # specify the location of our data lat2 = row2['Station Latitude'] lon2 = row2['Station Longitude'] # use our previously created distance function to calculate the distance dist = calc_distance(lat1, lon1, lat2, lon2) # check each distance - if it is lower than the last, this is the new low. this will repeat until the lowest distance is found if dist < min_dist: min_dist = dist min_station = row2['Station Name'] results[charity_name] = {'Nearest Station': min_station, 'Distance (Miles)': min_dist} # convert the results dictionary into a dataframe res = pd.DataFrame.from_dict(results, orient="index") res.head()
3 - merge our new information with our charities_output dataframe
# as our dataframe output has used our charities as an index, we need to re-add it as a column res['Charity Name'] = res.index # merging with our existing output dataframe charities_output = charities_output.merge(res, on="Charity Name") charities_output.head()
Step 2 Complete
We now have all our information in one place, charities_output, containing:
- Our charity information
- The nearest station to each charity
- The distance in miles
Step 3 - Calculate the driving time for travel
Our final step uses Project OSRM to find the driving distance between each of our charities and its nearest station. This is helpful as miles are not always an accurate descriptor of distance, where, for example, in a city like London, a 1 mile journey might take as long as a 5 mile journey in a rural area.
To prepare for this step, we must have one dataframe containing the following information:
- charity information: name, longitude, latitude, nearest station, distance in miles
- station information: name, longtiude, latitude
1- create a data frame with the above information
drive_time_df = pd.merge(charities_output, stations, left_on='Nearest Station', right_on='Station Name') drive_time_df = drive_time_df.drop(columns=['Station Name']) drive_time_df.head()
2 - now that our dataframe is ready, we can set up our function for calculating drive time using Project OSRM
please note: for further information, consult the documentation
url = "http://router.project-osrm.org/route/v1/driving/{lon1},{lat1};{lon2},{lat2}" # function def calc_driveTime(row): # extract lat and lon lat1, lon1 = row['Latitude'], row['Longitude'] lat2, lon2 = row['Station Latitude'], row['Station Longitude'] # request response = requests.get(url.format(lat1=lat1, lon1=lon1, lat2=lat2, lon2=lon2)) # parse response data = json.loads(response.content) # drive time in seconds drive_time_sec = data["routes"][0]["duration"] # convert to minutes drive_time = round((drive_time_sec) / 60, 0) return drive_time
3 - pass our data into our new function to calculate driving time in minutes
# apply the above function to our dataframe driving_time_res = drive_time_df.apply(calc_driveTime, axis=1) # add dataframe results as a new column drive_time_df['Driving Time (Minutes)'] = driving_time_res drive_time_df.head()
Step 4 Complete
We now have all our desired information in one compact dataframe. For layout purposes, and depending on what we want to do next with our data, we can create one final dataframe as output, containing the following information:
- Charity Name
- Nearest Station
- Distance (Miles)
- Driving Time (Minutes)
final_output = drive_time_df.drop(columns=['Charity Number', 'Charity Postcode', 'Latitude', 'Longitude', 'Station Latitude', 'Station Longitude']) final_output.head()
Thankyou for reading! I hope this was helpful. Please checkout my website if you are interested in my work.
Top comments (0)