DEV Community

Cover image for How to use GTFS data to track transit vehicles in realtime
Ably Realtime
Ably Realtime

Posted on • Originally published at

How to use GTFS data to track transit vehicles in realtime

At Ably we've been developing tools in order to help distribute various open data sources, from information on transport systems, cryptocurrencies, weather changes and more. All of this has been done through the Ably Hub, using our Ingester tool to take useful open data and ingest it into the Ably ecosystem. We've created some tools from this which have been useful internally and externally.

Historically however we've not done much to convert the data into something more than its original structure, at most adjusting structures to better match realtime paradigms. For transport sources defined by the General Transit Feed Specification Realtime (GTFS-R), we've kept the data in the same structures. Although this has inherent value, making endpoints which require polling accessible via realtime protocols such as WebsocketsMQTT and SSE, as well as allowing for finer filtering of relevant information through the use of channels, it could be processed to be far more.

What is GTFS (General Transit Feed Specification)?

There are a number of data structures and specifications for transit information, each with their own niches and uses. For example, TransXChange is used in the UK and is extremely detailed, using XML to structure the data.

GTFS (General Transit Feed Specification) is another popular way of structuring transit data, notably in the USA. Hundreds of agencies provide this specification, and it allows for the schedules of these agencies to be programmatically used. These specifications generally only update every couple of months at most, depending on the agency.

GTFS-R is an extension of this, allowing for these agencies to provide changes to schedules and vehicle positions as they occur. A bus that is running 5 minutes late can be indicated using GTFS-R. The ability to combine GTFS and GTFS-R data to have in-depth details of a transit system in combination with realtime updates can be extremely powerful. Unfortunately, GTFS-R is far from widely supported. Additionally, most GTFS-R endpoints require polling, which in combination with low rate limits makes true realtime updates impossible.

Transit data and GTFS

As we'd like to be able to provide realtime updates for as many agencies as possible, within the limitations we have, we wanted to create a way of representing all transit sources in the same way. This would need to be independent of having a GTFS-R or not due to its lack of adoption at this time. In addition, being a realtime data platform, our implementation needed to work purely as realtime updates, rather than relying on a static set of data to be extended from.

In addition, we wanted to provide something of value beyond the core data provided by GTFS and GTFS-R. Rather than just passing the data on to be processed in its raw format, we wanted to do a lot of the hard computation and calculating as part of our ingestion to Ably, so that the output is useful from the get-go.

Tracking vehicles with GTFS data

What we settled on was a new Hub Product which would have representations of a vehicle's position within a geofenced region, with geometries and associated times to traverse said geometries.

In practice, what this means is that every X period of time, the ingester will publish a set of coordinates with associated times, which represent the position of a vehicle at said times. This allows for the vehicle's presumed position at any time to be an interpolation between the associated points and times, allowing for smooth representations of a vehicle's position at any desired interval.

  "id": "trip1",
  "route": [
    { "time": "00:00:01", "lat": "20.000", "lon": "10.000" },
    { "time": "00:00:10", "lat": "20.010", "lon": "10.030" },
    { "time": "00:00:31", "lat": "20.040", "lon": "10.060" },

Enter fullscreen mode Exit fullscreen mode

Example spatiotemporal update

This spatiotemporal representation of the data can then be restricted to certain geofenced regions, with the limit of a message's data being either a max time or max region. This method was decided on as it strikes a good balance between message publish rate (publishing vehicle positions every 100ms for example) and granularity of information provided (not just publishing large sets of data). This is notably true for the data sources which are only GTFS and not GTFS-R, due to the information being guaranteed to not change, and thus a period representation of the data being as accurate as it could be.

Constructing spatiotemporal updates

In order to achieve this, we needed to firstly construct a program which can convert existing GTFS data into a representation which allows for accurate position prediction based on time. GTFS provides various elements of data which can be used for this.

Predicting positions at set times

Firstly, we needed to create a representation of each trip, and the times at which the vehicle  should be at each point of that route. The stops.txt file, which contains most importantly to us each stop's position and ID, can be used in combination with stop_times.txt, to work out for any given trip when a vehicle will be at a stop. This provides us with the initial building block required we need, timeframes in which we know a vehicle should be between two points, or at a point. However, unless we're to assume a vehicle is travelling in a straight line between each point at a consistent rate, certainly not a valid assumption, we need more data.

This is where we can bring in the shapes.txt files. This provides information on the actual structure of trips, with a list of coordinates effectively representing every turn or bend. Unfortunately, these points don't have any information within GTFS to associate with time.

However, what both shapes and stop_times optionally have attached is something known as shape_dist_traveled. This is a representation of how far along a trip a shape's node or station is. With this, it's fairly trivial to insert a station into these geometries accurately. Once that's done, we can make a prediction of the times for each shape node based on its distance between two station nodes. Although this won't be 100% accurate (we won't have representations of traffic lights slowing parts of a journey between two station nodes for example), it's about as good as we can get without the addition of GTFS-R data. In addition, any variation of accuracy only peaks at the midpoints of two station nodes, with the variation reducing to zero as we approach station nodes.

Representation of a trip with nodes as vehicles.

Predicting times based off position and station times

Predictions without shape_dist_traveled

Unfortunately, due to the inclusion of shape_dist_traveled being optional for these files in GTFS, we can't rely on it existing in all cases. For cases where it isn't included, we needed a method to insert stations into the geometry as accurately as possible. The method we opted for is to iterate through the stations in order of arrival, and for each station find which geometric line between two route points comes closest to intersecting a station. The station is then inserted at this point, and the process continues.

Due to both the shape and stations having a determined order, we can also ensure that we're not incorrectly inserting stations too late in the geometry, which could happen if say a route includes the same road twice. Once a station has been placed, the position of the next station can be checked against both prior and future nodes. If the next node's best and next-best position both fall prior to the previously placed station, we can place the previous station at its second-best position and try again. This was combined with various other techniques such as comparing the position through the shape geometry to the position through the station geometry to see if it makes sense. If the optimal position for the second station in a trip of 200 stations is at the end of the route, this should and does set off red flags as to the validity of it.

Once stations have been inserted into the geometries, we can do much the same as with the use of shape_dist_traveled. The difference is that rather than having the total distance traveled to reach any point, we will need to calculate distances between nodes ourselves, and use that to predict timings.

Converting times and distances to positions\

Now that we have a way to get the time and position of any point in our trips, we can use this to predict the position of a vehicle at any point in time. For a trip we know intersects with the time period we're interested in, we can iterate through its nodes until we find a node with a time during or after our time frame.

Once we have this node, we need to work out the position between this node and the previous node the vehicle will be at the start of our timeframe. This can be done by working out the time difference between the previous node and the current node, then working out the percent our current time is between them. With this, we can then apply this percent traveled through time to the distance between the two nodes' X and Y coordinates.

Maths notation to calculate the distance along a route

Calculating the distance along a route at a given time ti, based off the information of the previous and next station

It's often the case however that the two nodes directly surrounding the current time are in fact not stations, and thus have no time associated with them. In this case, we instead take the time of the last visited station, and the next station to work out the positioning of the vehicle between them. We can then iterate through the nodes between the stations to work out which adjacent nodes the vehicle would placed between. Once we have that, we can work out the position of the vehicle between them based on distance rather than time.

Math notation calculating the current x position of a vehicle based on the position and distance travelled between two adjacent nodes

Calculating the current x position of a vehicle based on the position and distance travelled between two adjacent nodes

We now have the expected position for the start of our timeframe. We can continue to define the route by adding nodes, with their respective times and positions, to our final return value until we reach a node beyond our desired timeframe. We can then repeat the same process as we used for the initial point to determine the end point.

With this, we're able to provide updates via Ably channels for vehicle positions at a manageable rate, whilst allowing for subscribers to easily represent a vehicle's position accurately at any time with very little data stored on the client side.

Going from temporal to spatiotemporal limits\

Whilst the above updates for vehicle positions is in itself incredibly useful, it can be taken further by attaching additional restrictions to the data based on the region a vehicle is in. Most mapping will make use of different granularities of data being available at different levels of zoom, with only data relevant to what is being looked at being available.

In Ably, the most logical way to break data down would be to have a channel naming convention which indicates what data you're interested in. At present, we've gone with the following:

[zoom ID]:[X ID]:[Y ID]

For example, 1:5:3 would mean you're interested in cell [5, 3] at a zoom level of 1. This allows subscribers of this data to decide exactly what they're interested in and subscribe only to that.

To make this work, we needed to firstly determine which cells a vehicle would enter during a time segment, and then break up the data for each of these channels such that they're only receiving information on these routes which pertain to when the vehicle is within their perimeter.

This is done by starting off as above, locating the first node which will be in our time segment, and then locating the point between it and the prior node where the vehicle will start. Once this is done, we can quickly check what cell this point falls in at the deepest level of zoom, and start creating our list of positions to be published to both this cell and all higher-levels of zoom which contain it. As each level of zoom is just a division of two from the prior level of zoom, we can work out all appropriate cells with the following:

We then continue to the next node, and check if this line up to the new node intersects with another cell. If it does, we will update the lowest level zoom's list of positions and times to end at the intersection point, and create a new list which'll contain updates for the new cell with the same intersection points. This continues until the line no longer intersects with any more cells. We then iterate upwards in zoom, checking for intersections until we no longer find one. At this point we continue to add to the existing lists of points for the existing cells.

Math notation to calculate a vehicle's position.

Lowest co-ordinates within any given region of zoom z, co-ordinate column x and row y. The upper limit is the same equation but x+1, y+1

This continues until we have the entire time period defined as positions for the vehicle broken up spatiotemporally for each cell and level of zoom. These messages can then be published to their respective channels, and the process is repeated for the remaining vehicles.

Adding GTFS-R

With the core structure designed, we needed to apply any potential realtime updates. TripUpdates in GTFS-R provide delays to routes. For example an update may say 'Trip 1 is delayed by 50s'. Alternatively, the arrival and departure times can have a new time indicated rather than a delay. Both of these can be easily applied to change the expected times of each node to the new time, and then process the spatiotemporal positions as normal.

In the future we'd want to look into integrating realtime vehicle position updates into the equation, predicting more accurately speeds between nodes, and adding in additional nodes to reflect both speed variations as well as common stopping points, including 'arrival_time' and 'departure_time' to represent the average time spent at these stopping points.

Processing GTFS data

The above has now allowed for spatiotemporal data to be easily distributed via channels at varying levels of zoom. At any point of time with very little processing a subscriber can have access to the position and near-future position of any vehicle they're interested in.

The above had only been applied to a single transit provider. We needed this to scale, with the ability eventually to have this running for every single transit provider in the world. Ably would have no issues at all distributing the data, but we needed a way to properly break up the processing of data and storage of GTFS data.

Our ingester tool proved to be the perfect solution for this. The tool is based on kubernetes, with each ingester being a pod within a cluster. Once we had our GTFS tool working, it was just a matter of adjusting the parameters (the necessary GTFS and GTFS-R endpoints), and making a new pod. This allows for every transit service's data to be processed and transmitted separately, with no concerns of overlapping resources or certain sources overwhelming everything.

Due to choosing to reduce the memory impact of data over the processing time, it's entirely possible to keep the entire state of the system in memory without the need to constantly check segments of the data from a database. This in turn then helps with the overall time to process, getting a strong balance of memory and processing time.

The result and the future

This all results in a tool which can be scaled easily, with accurate positioning based on schedules and realtime updates, all while providing subscribers with fine-grained control over the data they're interested in.

At present, we've applied this to CTtransit's and TARC's GTFS and GTFS-R data for testing, and intend to scale this out in the near future to as many transit providers as possible. The tool is available as part of our ingester tool, and the data is all available for free on the Ably Hub. Currently the data is only available at zoom level 6-8, however please get in touch if you'd like us to extend this. The product will only publish updates to channels with subscribers (obtained with channel occupancy).

In addition, we'd want to add transit agencies which don't use GTFS as their specification. For example, within the UK it would be good to be able to go from TransXChange to our spatiotemporal representation as well.

If you've got any agencies you'd be interested in having position updates from please get in touch with us at

Top comments (0)