When computing the shortest path between two locations, the size of the roads dataset is often the number one impacting factor on performances. Here we will discuss some ideas on how to reduce the size of this dataset.
This article is mainly about graph theory and route calculation, and more broadly spatial processing.
If you're already fluent in these domains, you can skip the Terminology part.
Terminology
Spatial DataBase
The terminology of this article will be based on the PostGIS terminology, but the concepts are most often compatible with other spatial databases.
LineString
As per the official PostGIS documentation
A LineString is a 1-dimensional line formed by a contiguous sequence of line segments. Each line segment is defined by two points, with the end point of one segment forming the start point of the next segment.
Notice the contiguous
here, no space between lines.
Notice also that start and end refers to points.
MultiLineString
As per the official PostGIS documentation
A MultiLineString is a collection of LineStrings.
That way, multiple LineStrings that are not contiguous can be gathered in one MultiLineString.
Cluster
A cluster is a group of entities gathered by geographical proximity.
Topology
Topology is a very large subject. In our case, this means that objects relations are preserved in their geometry.
A good example with a road network is the difference between a crossroad and a tunnel / bridge.
At a crossroad, all the roads are connected, and this relation is shown in the geometry with a point shared by all four roads.
At a tunnel / bridge, the roads are not connected, and thus they do not share any point.
Graph Theory
- node = point
- edge = linestring
- linear node = a node connecting only two edges
directed
When orientation matters, and going from A to B may not be the same as going from B to A.
In most datasets, direction is an attribute, with three possible values:
- forward = the road can only be traveled from start to end
- backward = the road can only be traveled from end to start
- both ways
weighted
When not all edges have the same cost.
Most often, this cost is the length of the edge, or the travelling time.
topological
When edges share a relation only if their starting/ending point is on the same node.
For pgRouting, the starting and ending node id of the edges are respectively called source and target. The source and target are not nodes or points, just the id of a node.
Two edges are connected for pgRouting if the source/target of one is the source/target of the other, even if they do not share any point (not topological).
Most often, road networks are topological weighted directed graphs. Given such a graph, most routing algorithms rely only on topology, not geometries.
Contraction
The best way to improve routing performances is to reduce the number of edges and nodes in a graph. To simplify the graph, while keeping the correct topology, is what we call contraction.
- linear contraction = remove linear nodes, remove their two edges, replace with only one edge
The pgRouting documentation on the subject is well explained with examples : https://docs.pgrouting.org/latest/en/contraction-family.html
Dead end contraction is beyond the scope of this post.
Use case
Sometimes, the dataset that is used for routing is too detailed. This results in road segments being split at each characteristic change, even when those characteristics have no interest for us.
One issue with the contraction available in the pgRouting toolbox is that it is too aggressive. It doesn't take into account the characteristics of the road segments. You can't prevent two road segments from being merged, and keep their separate characteristics.
In database terminology, those characteristics are called attributes.
This is why an "attribute aware linear contraction" is needed, when attributes matter.
Another improvement is that this contraction doesn't need to be unnested when the start or end of the routing is on a contracted edge.
Attribute aware
Let's say we have a set of three edges "a", "b" and "c", with various attributes:
- speed: attribute of interest to us, must be kept;
- cost: attribute needed for the routing algorithm, must be merged;
- name: attribute of no interest to us, can be discarded.
should merge
In the following example, we can see that edges "a", "b" and "c" share the same "speed" attribute, but must have been split because of the different "name" on "b".
If we are not interested in keeping the "name" attribute, the following "a" edge is equivalent to the preceding "a", "b" and "c" in terms of graph.
The geometry is updated as a merge of previous geometries.
should not merge
In the following example, we can see that edges "b" and "c" share the same "speed" attribute, but not edge "a".
The contraction should then only be applied on edges "b" and "c".
Prerequisite
The road network to contract should be topologically correct. Meaning all edges connected by source or target to another edge should share the same start/end point.
Ideally, multiple sources/targets should not share the same geometry (=point), but this is dealt with as an inconsistency.
General idea
Given a list of attributes of interest, we want to identify groups of edges, that share the same value for those attributes, and could be linearly contracted.
Steps
Clean
We revert edges to only have forward or both ways edges.
Identify
We identify linear nodes.
We remove the nodes sharing the same geometry.
We identify the edges starting or ending on the remaining nodes.
Group
We group those edges by geometry (cluster) and by attributes.
Merge
We try merge each group of edges into a single LineString.
When the result of a merge is a MultiLineString, the group can't be merged and is kept as is (inconsistency).
Limit cases - why some extra steps
First and last edges
We want to merge edges starting or ending on linear nodes, but we must keep non-linear nodes.
Thus, edges starting or ending on linear nodes is not specific enough. Here is a counter-example:
The edges "a", "b", "c" and "d" are all starting OR ending on linear nodes, but "b" and "c" should not be merged.
The solution is to first keep only edges starting AND ending on linear nodes, then add the first and last edges.
Attributes variations in a cluster
When edges have been identified, we group them by geometry and attributes.
This can lead to gaps when one of the attributes of interest is changing value along the line of edges to contract.
Here is an example, where "a", "b" and "c" have been grouped in the same cluster, but only "a" and "c" are grouped by attributes.
Merging "a" and "c" will result in a MultiLineString, so we discard this group as an inconsistency.
Edges sharing a point with no source/target connection
Sometimes, road network datasets are not topologically correct. This happens for edges when their geometry are linked, they share the same starting and ending point, but their attributes are not, they do not share the same source and target.
As the first step - identifying edges touching linear nodes - relies on attributes, and the second step - clustering - relies on geometry, this can lead to inconsistencies.
illustration
identify by attributes
"b" and "c" are connected by source/target on node "3".
"e" and "f" are connected by source/target on node "6".
This means that all the edges have been identified as starting or ending on linear nodes.
cluster by geometry
When clustering, as "e" and "f" touch "b" and "c", all four edges would be put in the same group to be merged.
The solution here is to remove nodes "3" and "6", that share the same geometry, from linear nodes.
ST_LineMerge changing point order
Sometimes, when merging edges, the orientation of some of the edges will be flipped.
Source and target
This may cause the source and target of the resulting edge not to be accurate.
We update the source and target by matching the starting and ending point of the new geometry to the previously extracted linear nodes.
Reversed "cost" and "reverse_cost"
This may also cause "cost" and "reverse_cost" to be reversed.
But as we made sure to never have backward edges, this can only happen with both ways edges.
And as we group by attribute, especially by "speed" and "reverse_speed", this means that "cost" and "reverse_cost" will have the same value.
Algorithm
- create the contracted edges table
- duplicate from the edge table
- reverse the one way edges in order to only have forward edges
- create the candidate (=linear) node table
- extract all the sources and targets with their geometry
- keep only the nodes that link exactly 2 edges
- remove the nodes that share the same geometry
-
first pass - center of 3 or more
- keeping only the edges where both source and target is a candidate node
- apply the cluster DBSCAN function with a maximum distance of 0 and a minimum edge count of 1
We group only touching edges, but an edge could be a group on it's own.
- group the edges by cluster and attributes of interest
- remove MultiLineString inconsistencies
update source and target of newly contracted edges, due to possible line merge inversion
clean up old edges, and use id for newly contracted edges
-
second pass - extend to source
- join and merge each newly contracted edge to it's source edge, if they share the same attributes of interest
clean up old edges, and use id for newly contracted edges
-
second pass - extend to target
- join and merge each newly contracted edge to it's target edge, if they share the same attributes of interest
clean up old edges, and use id for newly contracted edges
-
third pass - 2 exactly
- keeping the edges where source or target is a candidate node
- group by this node id to get pairs of edges
- keep only the pairs that share the same attributes
- keep only the edges appearing only once in all the pairs
An edge appearing more than once belongs to a group of more that 2 edges, already dealt with, inconsistency.
- merge the pair and keep the orientation
clean up old edges, and use id for newly contracted edges
SQL implementation
-- SELECT count(*) FROM edge;
-- edge_contracted
DROP TABLE IF EXISTS edge_contracted;
CREATE TABLE edge_contracted AS TABLE edge WITH DATA;
ALTER TABLE edge_contracted ADD PRIMARY KEY (id);
ALTER TABLE edge_contracted ADD COLUMN one_way boolean;
ALTER TABLE edge_contracted ADD COLUMN contracted_ids integer[];
CREATE INDEX IF NOT EXISTS edge_contracted_id ON edge_contracted USING btree (id);
CREATE INDEX IF NOT EXISTS edge_contracted_source ON edge_contracted USING btree (source);
CREATE INDEX IF NOT EXISTS edge_contracted_target ON edge_contracted USING btree (target);
CREATE INDEX IF NOT EXISTS edge_contracted_kmh ON edge_contracted USING btree (kmh);
CREATE INDEX IF NOT EXISTS edge_contracted_reverse_kmh ON edge_contracted USING btree (reverse_kmh);
CREATE INDEX IF NOT EXISTS edge_contracted_categorie ON edge_contracted USING btree (categorie);
UPDATE edge_contracted SET
geom = ST_Reverse(geom),
source = target, target = source,
cost = reverse_cost, reverse_cost = cost,
kmh = reverse_kmh, reverse_kmh = kmh
WHERE cost = -1;
UPDATE edge_contracted SET one_way = true WHERE reverse_cost = -1;
-- candidate_vertex
DROP TABLE IF EXISTS candidate_vertex;
CREATE TABLE candidate_vertex AS
WITH source_target_union AS (
SELECT id, source AS vid, ST_StartPoint(geom) as geom FROM edge_contracted
UNION
SELECT id, target AS vid, ST_EndPoint(geom) as geom FROM edge_contracted
),
ref_count_2 AS (
SELECT vid, ST_Union(geom) AS geom
FROM source_target_union
GROUP BY vid
HAVING count(*) = 2
)
SELECT (array_agg(vid))[1] AS vid, geom
FROM ref_count_2
GROUP BY geom
HAVING count(*) = 1;
CREATE INDEX IF NOT EXISTS candidate_vertex_vid ON candidate_vertex USING btree (vid);
CREATE INDEX IF NOT EXISTS candidate_vertex_geom ON candidate_vertex USING gist (geom);
-- Pass 1 - interior of merging 3 edges or more
INSERT INTO edge_contracted (
id, contracted_ids,
cost, reverse_cost,
kmh, reverse_kmh, categorie, one_way, geom
)
WITH
candidate_edge AS (
SELECT
e.id,
cost, reverse_cost,
kmh, reverse_kmh, categorie, one_way, geom
FROM edge_contracted AS e
WHERE
source IN (SELECT vid FROM candidate_vertex)
AND
target IN (SELECT vid FROM candidate_vertex)
),
clustered AS (
SELECT *, ST_ClusterDBSCAN(geom, eps := 0, minpoints := 1) over () AS cid
FROM candidate_edge
),
grouped AS (
SELECT
-(array_agg(c.id))[1] AS id,
array_agg(c.id) AS contracted_ids,
sum(cost) AS cost,
sum(reverse_cost) AS reverse_cost,
kmh,
reverse_kmh,
categorie,
one_way,
ST_LineMerge(ST_Union(geom)) as geom
FROM clustered AS c
GROUP BY cid, categorie, one_way, kmh, reverse_kmh
HAVING cid IS NOT NULL
)
SELECT * FROM grouped
WHERE ST_GeometryType(geom) = 'ST_LineString';
-- Update source / target
UPDATE edge_contracted AS e SET
source = (
SELECT vid FROM candidate_vertex AS v WHERE v.geom && ST_StartPoint(e.geom)
ORDER BY ST_Distance(v.geom, ST_EndPoint(e.geom))
LIMIT 1
)
, target = (
SELECT vid FROM candidate_vertex AS v WHERE v.geom && ST_EndPoint(e.geom)
ORDER BY ST_Distance(v.geom, ST_StartPoint(e.geom))
LIMIT 1
)
WHERE id < 0 AND contracted_ids IS NOT NULL;
WITH
delete_ids AS (
SELECT UNNEST(contracted_ids) AS id
FROM edge_contracted
WHERE id < 0 AND contracted_ids IS NOT NULL
)
DELETE FROM edge_contracted AS e
USING delete_ids AS d
WHERE e.id = d.id;
UPDATE edge_contracted SET id = -id WHERE id < 0;
-- SELECT * FROM edge_contracted WHERE source IS NULL OR target IS NULL;
-- SELECT * FROM edge_contracted WHERE cost IS NULL OR reverse_cost IS NULL;
-- Pass 2 - exterior of merging 3 edges or more - source
INSERT INTO edge_contracted (
id, contracted_ids,
source, target, cost, reverse_cost,
kmh, reverse_kmh, categorie, one_way, geom
)
WITH
multi_not_filtered AS (
SELECT DISTINCT ON (
id, contracted_ids,
source, target, cost, reverse_cost,
kmh, reverse_kmh, categorie, one_way, geom
)
-source_edge.id AS id,
array_prepend(source_edge.id, contracted_edge.contracted_ids) AS contracted_ids,
source_edge.source AS source,
contracted_edge.target AS target,
source_edge.cost + contracted_edge.cost AS cost,
source_edge.reverse_cost + contracted_edge.reverse_cost AS reverse_cost,
contracted_edge.kmh,
contracted_edge.reverse_kmh,
contracted_edge.categorie,
contracted_edge.one_way,
ST_LineMerge(ST_Union(source_edge.geom, contracted_edge.geom)) as geom
FROM edge_contracted AS contracted_edge
JOIN edge_contracted AS source_edge
ON contracted_edge.source = source_edge.target
AND contracted_edge.categorie = source_edge.categorie
AND contracted_edge.one_way = source_edge.one_way
AND contracted_edge.kmh = source_edge.kmh
AND contracted_edge.reverse_kmh = source_edge.reverse_kmh
WHERE contracted_edge.contracted_ids IS NOT NULL
)
SELECT * FROM multi_not_filtered
WHERE ST_GeometryType(geom) = 'ST_LineString';
WITH
delete_ids AS (
SELECT UNNEST(contracted_ids) AS id
FROM edge_contracted
WHERE id < 0 AND contracted_ids IS NOT NULL
)
DELETE FROM edge_contracted AS e
USING delete_ids AS d
WHERE e.id = d.id;
UPDATE edge_contracted SET id = -id WHERE id < 0;
-- Pass 2 - exterior of merging 3 edges or more - target
INSERT INTO edge_contracted (
id, contracted_ids,
source, target, cost, reverse_cost,
kmh, reverse_kmh, categorie, one_way, geom
)
WITH
multi_not_filtered AS (
SELECT DISTINCT ON (
id, contracted_ids,
source, target, cost, reverse_cost,
kmh, reverse_kmh, categorie, one_way, geom
)
-contracted_edge.id AS id,
array_append(contracted_edge.contracted_ids, target_edge.id) AS contracted_ids,
contracted_edge.source AS source,
target_edge.target AS target,
contracted_edge.cost + target_edge.cost AS cost,
contracted_edge.reverse_cost + target_edge.reverse_cost AS reverse_cost,
contracted_edge.kmh,
contracted_edge.reverse_kmh,
contracted_edge.categorie,
contracted_edge.one_way,
ST_LineMerge(ST_Union(contracted_edge.geom, target_edge.geom)) as geom
FROM edge_contracted AS contracted_edge
JOIN edge_contracted AS target_edge
ON contracted_edge.target = target_edge.source
AND contracted_edge.categorie = target_edge.categorie
AND contracted_edge.one_way = target_edge.one_way
AND contracted_edge.kmh = target_edge.kmh
AND contracted_edge.reverse_kmh = target_edge.reverse_kmh
WHERE contracted_edge.contracted_ids IS NOT NULL
)
SELECT * FROM multi_not_filtered
WHERE ST_GeometryType(geom) = 'ST_LineString';
WITH
delete_ids AS (
SELECT UNNEST(contracted_ids) AS id
FROM edge_contracted
WHERE id < 0 AND contracted_ids IS NOT NULL
)
DELETE FROM edge_contracted AS e
USING delete_ids AS d
WHERE e.id = d.id;
UPDATE edge_contracted SET id = -id WHERE id < 0;
-- Pass 3 - merging of 2 edges exactly
INSERT INTO edge_contracted
(id, contracted_ids,
source, target, cost, reverse_cost,
kmh, reverse_kmh, categorie, geom
)
WITH
group_by_vid AS (
SELECT v.vid, array_agg(e.id) AS edge_ids
FROM candidate_vertex AS v
JOIN edge_contracted AS e ON e.source = v.vid OR e.target = v.vid
GROUP BY v.vid
),
edge_pair AS (
SELECT vid,
e1.id AS id1, e1.source AS source1, e1.target AS target1, e1.cost AS cost1, e1.reverse_cost AS reverse_cost1,
e1.kmh AS kmh1, e1.reverse_kmh AS reverse_kmh1, e1.km AS km1, e1.categorie AS categorie1, e1.geom AS geom1,
e2.id AS id2, e2.source AS source2, e2.target AS target2, e2.cost AS cost2, e2.reverse_cost AS reverse_cost2,
e2.kmh AS kmh2, e2.reverse_kmh AS reverse_kmh2, e2.km AS km2, e2.categorie AS categorie2, e2.geom AS geom2
FROM group_by_vid
JOIN edge_contracted AS e1 ON e1.id = edge_ids[1]
JOIN edge_contracted AS e2 ON e2.id = edge_ids[2]
),
same_att_values AS (
SELECT *
FROM edge_pair
WHERE kmh1 = kmh2 AND reverse_kmh1 = reverse_kmh2 AND categorie1 = categorie2
),
edge_id_union AS (
SELECT vid, id1 as eid FROM same_att_values
UNION
SELECT vid, id2 as eid FROM same_att_values
),
edge_ref_count_too_high AS (
SELECT eid FROM edge_id_union GROUP BY eid HAVING count(*) > 1
),
direct_direct AS (
SELECT -id1 AS id, ARRAY[id1, id2],
source1 AS source, target2 AS target, cost1 + cost2 AS cost, reverse_cost1 + reverse_cost2 AS reverse_cost,
kmh1 AS kmh, reverse_kmh1 AS reverse_kmh, categorie1 AS categorie, ST_LineMerge(ST_Union(geom1, geom2)) AS geom
FROM same_att_values
WHERE target1 = source2
AND id1 NOT IN (SELECT eid FROM edge_ref_count_too_high)
AND id2 NOT IN (SELECT eid FROM edge_ref_count_too_high)
),
reverse_reverse AS (
SELECT -id2 AS id, ARRAY[id2, id1],
source2 AS source, target1 AS target, cost1 + cost2 AS cost, reverse_cost1 + reverse_cost2 AS reverse_cost,
kmh1 AS kmh, reverse_kmh1 AS reverse_kmh, categorie1 AS categorie, ST_LineMerge(ST_Union(geom1, geom2)) AS geom
FROM same_att_values
WHERE target2 = source1
AND id1 NOT IN (SELECT eid FROM edge_ref_count_too_high)
AND id2 NOT IN (SELECT eid FROM edge_ref_count_too_high)
)
SELECT * FROM direct_direct WHERE ST_GeometryType(geom) = 'ST_LineString'
UNION
SELECT * FROM reverse_reverse WHERE ST_GeometryType(geom) = 'ST_LineString';
WITH
delete_ids AS (
SELECT UNNEST(contracted_ids) AS id
FROM edge_contracted
WHERE id < 0 AND contracted_ids IS NOT NULL
)
DELETE FROM edge_contracted AS e
USING delete_ids AS d
WHERE e.id = d.id;
UPDATE edge_contracted SET id = -id WHERE id < 0;
-- Update length and reverse cost
UPDATE edge_contracted SET km = ST_Length(geom)/1000;
UPDATE edge_contracted SET reverse_cost=(CASE WHEN reverse_cost < 0 THEN -1 ELSE reverse_cost END);
-- Cleanup before export
ALTER TABLE edge_contracted DROP COLUMN one_way, DROP COLUMN contracted_ids;
-- SELECT count(*) FROM edge_contracted;
First results
On a project with a routing dataset of approximately 500.000 edges, this contraction reduced the number of edges by ~18%, reducing the computing time by ~24% on a number of predefined, project specific scenarios.
To go further
As previously mentioned, this algorithm is meant to be fast, so we deliberately cast away MultiLineString inconsistencies.
One already identified is when a clustered group of edges is split in attributes sub-groups, and those groups are disconnected.
Another is when the same node geometry links several edges that do not share a source/target.
There could be other not yet identified MultiLineString inconsistencies.
Solving these cases would mean less edges in the final result, and still better performances when it comes to routing algorithm, where the less the edges, the better the performances.
Top comments (4)
Thanks for sharing this post (Just watched your presentation in the PostGIS day webinar). Why not removing more nodes by computing a global cost between intersections ? If the cost is different (because of the speed attribute change), you could compute a single cost by interpolating costs between edges.
That's an excellent remark!
As I said in the webinar, the data is static but overloaded by dynamic changes before it's actually navigated. Those changes happen between the contraction and the navigation, and are based on attributes, so we need to keep those attributes.
For example, it's deciding that in a given area, all edges at 50 kph will in fact be at 45 kph, because of known traffic slow down.
Understood, thanks for your reply
Thanks for sharing your insights on a large and mission critical project using PostgreSQL/PostGIS