Managing Delivery Networks: A Use Case For Graph Databases

fppt profile image Filipe Peliz Pinto Teixeira ・6 min read

At takealot.com one of the biggest competitive advantages we have as an e-commerce platform is the maintenance and expansion of our own logistics network.This network allows us to control how and when we deliver to customers and, amongst other aspects, ensures that takealot.com is the leading e-commerce platform in South Africa.

In this article, I will provide an analysis of the unique problem takealot.com faced in facilitating reliable deliveries to customers and how we use a graph database to deliver a performant and scalable solution.

So What's The Problem?

One might ask why takealot.com does not make use of South Africa’s postal service and existing couriers. The answer, quite simply, is lack of reliability. Local postal services cannot provide accurate delivery dates. This means that using this service as our logistics backbone would not provide exact delivery dates to our customers. Additionally, existing courier companies would charge premiums which would add an additional cost largely incurred by customers. This key factor means that maintaining our own logistics operations remains the best course of action.

However, when growing a logistics network to fulfil the needs of a growing national e-commerce platform, one can't go from 1 to 11 overnight. As such, we had to ensure integration remained with existing third-party courier companies.

So the requirements of the system are to establish:

  1. How to get an order to the customer.
  2. When will the order get to the customer.
  3. Wether a third-party courier or our own delivery network will be used to fulfil the delivery.

For the remainder of this article we will focus on point 1 while touching on point 2. If there is interest I will try to expand point 2 in further articles.

Let's Make Ship Happen

A delivery to a customer in the takealot.com world consists of two phases:

  1. Courier the parcel from a warehouse to the branch closest to the customer. This can involve bouncing a package across multiple branches.
  2. Deliver the parcel from the branch to the customer.

In short, takealot.com needs to ensure an order is shipped from a central warehouse or third-party Marketplace seller to the customer in the shortest and most economical way possible.

This involves computing the shortest path, a classic computing problem. In principle, we consider branches to be a series of nodes which can be displayed as such:

When you link them all together with edges representing routes and run a graph search algorithm we are delivering and fast celebrating with ice-cold beers and high fives.

We implemented the above via pgrouting and Yen's Algorithm. This was the latency with that approach:

Getting Routes via Yen's Algorithm on PGRouting

Within 135ms we will know how to get to the customer. After finding out when we get to the customer the system response time looks like:

Getting Routes & Dates For Deliveries

After 510ms we will know how to get to you and when. Success!

Then we needed to scale and found an issue. The graph representation of the delivery network was just a graph; a bunch of nodes connected by a bunch of edges. This means that the application of delivery constraints such as making sure we don't deliver a fridge on a scooter was done outside of the graph in post.

As additional routes, couriers and branches were added, performance degraded to the point where we could not expand delivery capabilities without making the customer wait several seconds for a delivery date.

Eventually, this meant third-party couriers could not be continually integrated; amongst other challenges, affordable rural couriers were now unavailable to the company.

Property Graphs To The Rescue

Turns out we didn't need a graph we needed a property graph. In short, a property graph is a graph data structure with the addition of properties (key, value pairs) which sit on the edges and vertices of the graph. The previously discussed graph model went from a bunch of connected nodes to:

with each edge/vertex having following properties:

This meant that we were able to push all route constraint logic to the graph. In terms of figuring out how to get to customers, no post processing was required. We were able to facilitate this via JanusGraph as a data layer with TinkerPop & Gremlin facilitating the query language. This also meant that hundreds of lines of post processing code turned into a simple gremlin-scala query like so:

g.V().has(Keys.Vertex.LOOKUPID, lookupIDTo)
              .has(Keys.Edge.PHYSMINWEIGHT, P.lte(attributes.physWeight))
              .has(Keys.Edge.PHYSMAXWEIGHT, P.gte(attributes.physWeight))
              //More property constraints
            .has(Keys.Vertex.HUBACTIVE, true)
              .has(Keys.Edge.PHYSMINWEIGHT, P.lte(attributes.physWeight))
              .has(Keys.Edge.PHYSMAXWEIGHT, P.gte(attributes.physWeight))
              //More property constraints
              .has(Keys.Vertex.HUBACTIVE, true)
            ).until(_.has(Keys.Vertex.LOOKUPID, lookupIDFrom)).limit(10).path

Any gremlin fan reading this will note how we are traversing in reverse. Let's say that edge labels with a better understanding of our degree distribution allowed for further optimisations.

The above query not only checks how we can get to customers but also uses the details of the parcels (such as weight) being delivered to more quickly eliminate routes.

Now, let’s take another look at the latency graph below:

A P95 response time of 10ms, an order of magnitude better. In addition to this, as takealot.com added more hubs and couriers there was no performance degradation (or at least there hasn't been yet). This is because our search space remained limited to the number of LINE_HAUL edges we have and most expansions to the logistical networks occur at the LAST_MILE edges layer. Another perk of property graph is the ability to formally categorise and layer the structure of the graph via labels as well properties.

This new property graph structure also allowed us to represent our temporal constraints in a structured and traversable manner. However, computing when the delivery gets to the customer is still performed outside of the graph. The more natural representation of temporal data (not logic) on the graph allowed us to optimise at that layer as well. In the end the final system response time is:

A P95 response time of 150ms is much more acceptable. As I said, if there is interest in this article I may follow up with how the graph allowed us to optimise the delivery date computations as well.

This new property graph model provides many other non-performance related benefits such as improved observability, easier modifications and many others.

So Are We Making Ship Happen?

I would like to think we are. The system has allowed us to more accurately predict deliveries dates - not taking into account potential operational delays such as stock arriving late from suppliers, bad traffic, etc. . . Accounting for potential operational delays on delivery routes could be another challenge for us to try solve next.

Are property graphs a silver bullet to computing problems? No! Definitely not! Just as you would not use a plunger to take out a screw you would not use a graph database to model a shopping cart for example.

However, this new property graph model has opened up multiple discussions which could lead to more customer facing improvements.

We are excited to push this tech even further and see what we can do with it: we need to make more ship happen.


Editor guide
mairas profile image
Matti Airas

Just out of interest - are all your transports modeled independently? What if you had one site sharing e.g. outbound capacity with multiple edges, or if you had a single last-mile transport deliver to multiple locations? What would be your preferred way for modeling such shared capacities?

fppt profile image
Filipe Peliz Pinto Teixeira Author

are all your transports modeled independently?

Types of transports and their capacity are, i.e. trucks vs scooters vs cars.

What if you had one site sharing e.g. outbound capacity with multiple edges [...]

We model this via what we call a "Satellite" Hub. A satellite hub represents different physical delivery capacity to different areas but from the same physical "Main" hub. This complicates the temporal calculations a bit but we have found a convention which works.

The article actually contains an example of this in the picture - "BRANCH-GAR" is a Main Hub and "BRANCH-SEA" is a Satellite hub. They deliver to different areas with different fleets.

That make sense Matt?

P.S. Good question, that was one of the first hurdles when we sat down to model this.

danielcrowe profile image
daniel crowe

@fppt - Check out Grakn.ai (@graknlabs github) - think you'll find our work super interesting and potentially valuable to this work.

fppt profile image
Filipe Peliz Pinto Teixeira Author

Hahaha it's a small world. I am familiar with grakn. A knowledge graph may be the next evolution of this system, if it is, I will certainly suggest Grakn.