DEV Community: Filipe Peliz Pinto Teixeira

Managing Delivery Networks: A Use Case For Graph Databases

Filipe Peliz Pinto Teixeira — Wed, 30 Oct 2019 13:37:19 +0000

At takealot.com one of the biggest competitive advantages we have as an e-commerce platform is the maintenance and expansion of our own logistics network.This network allows us to control how and when we deliver to customers and, amongst other aspects, ensures that takealot.com is the leading e-commerce platform in South Africa.

In this article, I will provide an analysis of the unique problem takealot.com faced in facilitating reliable deliveries to customers and how we use a graph database to deliver a performant and scalable solution.

So What's The Problem?

One might ask why takealot.com does not make use of South Africa’s postal service and existing couriers. The answer, quite simply, is lack of reliability. Local postal services cannot provide accurate delivery dates. This means that using this service as our logistics backbone would not provide exact delivery dates to our customers. Additionally, existing courier companies would charge premiums which would add an additional cost largely incurred by customers. This key factor means that maintaining our own logistics operations remains the best course of action.

However, when growing a logistics network to fulfil the needs of a growing national e-commerce platform, one can't go from 1 to 11 overnight. As such, we had to ensure integration remained with existing third-party courier companies.

So the requirements of the system are to establish:

How to get an order to the customer.
When will the order get to the customer.
Wether a third-party courier or our own delivery network will be used to fulfil the delivery.

For the remainder of this article we will focus on point 1 while touching on point 2. If there is interest I will try to expand point 2 in further articles.

Let's Make Ship Happen

A delivery to a customer in the takealot.com world consists of two phases:

Courier the parcel from a warehouse to the branch closest to the customer. This can involve bouncing a package across multiple branches.
Deliver the parcel from the branch to the customer.

In short, takealot.com needs to ensure an order is shipped from a central warehouse or third-party Marketplace seller to the customer in the shortest and most economical way possible.

This involves computing the shortest path, a classic computing problem. In principle, we consider branches to be a series of nodes which can be displayed as such:

When you link them all together with edges representing routes and run a graph search algorithm we are delivering and fast celebrating with ice-cold beers and high fives.

We implemented the above via pgrouting and Yen's Algorithm. This was the latency with that approach:

Within 135ms we will know how to get to the customer. After finding out when we get to the customer the system response time looks like:

After 510ms we will know how to get to you and when. Success!

Then we needed to scale and found an issue. The graph representation of the delivery network was just a graph; a bunch of nodes connected by a bunch of edges. This means that the application of delivery constraints such as making sure we don't deliver a fridge on a scooter was done outside of the graph in post.

As additional routes, couriers and branches were added, performance degraded to the point where we could not expand delivery capabilities without making the customer wait several seconds for a delivery date.

Eventually, this meant third-party couriers could not be continually integrated; amongst other challenges, affordable rural couriers were now unavailable to the company.

Property Graphs To The Rescue

Turns out we didn't need a graph we needed a property graph. In short, a property graph is a graph data structure with the addition of properties (key, value pairs) which sit on the edges and vertices of the graph. The previously discussed graph model went from a bunch of connected nodes to:

with each edge/vertex having following properties:

This meant that we were able to push all route constraint logic to the graph. In terms of figuring out how to get to customers, no post processing was required. We were able to facilitate this via JanusGraph as a data layer with TinkerPop & Gremlin facilitating the query language. This also meant that hundreds of lines of post processing code turned into a simple gremlin-scala query like so:

g.V().has(Keys.Vertex.LOOKUPID, lookupIDTo)
          .inE(LegType.LAST_MILE.toString)
              .has(Keys.Edge.PHYSMINWEIGHT, P.lte(attributes.physWeight))
              .has(Keys.Edge.PHYSMAXWEIGHT, P.gte(attributes.physWeight))
              //More property constraints
          .outV()
            .has(Keys.Vertex.HUBACTIVE, true)
          .repeat(
            _.inE(LegType.LINE_HAUL.toString)
              .has(Keys.Edge.PHYSMINWEIGHT, P.lte(attributes.physWeight))
              .has(Keys.Edge.PHYSMAXWEIGHT, P.gte(attributes.physWeight))
              //More property constraints
            .outV()
              .has(Keys.Vertex.HUBACTIVE, true)
            .simplePath()
            ).until(_.has(Keys.Vertex.LOOKUPID, lookupIDFrom)).limit(10).path

Any gremlin fan reading this will note how we are traversing in reverse. Let's say that edge labels with a better understanding of our degree distribution allowed for further optimisations.

The above query not only checks how we can get to customers but also uses the details of the parcels (such as weight) being delivered to more quickly eliminate routes.

Now, let’s take another look at the latency graph below:

A P95 response time of 10ms, an order of magnitude better. In addition to this, as takealot.com added more hubs and couriers there was no performance degradation (or at least there hasn't been yet). This is because our search space remained limited to the number of LINE_HAUL edges we have and most expansions to the logistical networks occur at the LAST_MILE edges layer. Another perk of property graph is the ability to formally categorise and layer the structure of the graph via labels as well properties.

This new property graph structure also allowed us to represent our temporal constraints in a structured and traversable manner. However, computing when the delivery gets to the customer is still performed outside of the graph. The more natural representation of temporal data (not logic) on the graph allowed us to optimise at that layer as well. In the end the final system response time is:

A P95 response time of 150ms is much more acceptable. As I said, if there is interest in this article I may follow up with how the graph allowed us to optimise the delivery date computations as well.

This new property graph model provides many other non-performance related benefits such as improved observability, easier modifications and many others.

So Are We Making Ship Happen?

I would like to think we are. The system has allowed us to more accurately predict deliveries dates - not taking into account potential operational delays such as stock arriving late from suppliers, bad traffic, etc. . . Accounting for potential operational delays on delivery routes could be another challenge for us to try solve next.

Are property graphs a silver bullet to computing problems? No! Definitely not! Just as you would not use a plunger to take out a screw you would not use a graph database to model a shopping cart for example.

However, this new property graph model has opened up multiple discussions which could lead to more customer facing improvements.

We are excited to push this tech even further and see what we can do with it: we need to make more ship happen.

Grakn’s (mostly) Agile Methodology

Filipe Peliz Pinto Teixeira — Wed, 02 Aug 2017 14:16:34 +0000

Photo credit: Mincarconsulting.com

This article originally appeared on the GRAKN.AI blog.

Anyone who has ever found themselves monitoring or managing any kind of project may have encountered the term “It’s like herding cats”. These days this refers to the difficulty associated with managing the many interacting components of a project. Originally it was specific to people:

Managing senior programmers is like herding cats.

The difficulty associated with managing software projects and minimising the risks of those projects has spawned off an almost entirely different industry in itself. You will often hear people boast about following some formal workflow process. Waterfall, Agile, Scrum, Kanban, the list goes on... Some actually get quite passionate about these approaches. I have even heard of people not in software engineering following these formal processes. Personally, I am of the thought that you should find something that works for the team you have, stick with it when it works, and tweak it when it doesn’t work.

So this is a tale and a retrospective look at how we at Grakn Labs found our “something that seems to work for us. This tale may aid you in finding your something.

We found ourselves in an interesting position in comparison to most software teams. We are not all software engineers. Some of us fall more in the data scientist side, some in the rapid prototype side, and some in the marketing side. Regardless, when you have a small mixed team that have to reach objectives quickly, it’s important to get everyone working in the same manner. So, in our case, our cats are on fire and each cat is trying to put out the fire with different materials. Finally, you, who are watching this, are also on fire.

Graknâ€Š–â€ŠDay 0

It’s year one of Grakn, we have vague requirements, a rough design document, and zero lines of code to work with. We also have a new and relatively young team. From the management perspective you should quickly get everyone on the same platform and process. With this being a new team our CEO had time to investigate and introduce our new platform, TargetProcess. Think Jira, but less bloated and more focussed.

The tool is only the start and should be used to facilitate the needs of the team. At this stage we started going with a Waterfall-like model. Basically all work has to go through the following pipeline:

Requirements Gatheringâ€Š–â€ŠSimply list the high level features which need to be delivered by the software. Anything goes here, so list whatever you think adds value to your project, as long as it’s high level. Its important to encourage this attitude as it allows everyone in your team to contribute in some way or another to the direction of the project.
Requirements Discussionâ€Š–â€ŠAt this stage you will be accepting and rejecting requirements based on: (1) does the requirement fit within the vision of the product? and (2) is it feasible to do, regardless of timeline? During this stage, avoid discussing design. The objective is to create your feature set and prioritise those features. Your must haves for your minimum viable product.
Design Timeâ€Š–â€ŠPick your highest priority requirements and start designing and outlining a rough implementation. Be sure to document any decisions and assumptions. You don’t need to go through all the requirements. Just enough keep everyone busy for a some time. When time allows, you can do the same for the lower priority requirements. Not planning all the requirements is important as things may change before you finally get around to implementing them and with a young team on a new project you will need flexibility.

The Requirements Board We Use For a subset of the project

This basically takes care of managing which features get into the product. If you notice the new column looking empty, time do to step 1. If you notice the accepted column looking thin, time for step 2. Finally, if the in progress column is looking thin, time for step 3.

Next up, is task planning and task execution, and this is where we set our first cats on fire.

Running Sprints

Now that you know what needs to get done you need to make sure progress can be monitored and communicated in a non intrusive manner. Enter the Scrum Board:

Our Weekly Sprint Board on TargetProcess

Borrowing a page from Scrum methodology we run weekly sprints. At the end of each week we group up and fill our todos for the following week. This is basically the process of taking accepted and designed requirements and breaking them down into executable tasks. Doing this serves two purposes:

Planning the tasks you going to execute in advance minimises “dead time”, i.e. time spent waiting for others to do something you need or time spent looking for your next actionable task when you not sure what to do next.
By planning in smaller groups we more effectively communicate how we going to impact each other. This minimises surprises throughout the week. For example, if you going to be refactoring some API calls your colleagues should know about these in advance before they wonder where their API has gone.

The Scrum Board allows each developer to be aware of everyone’s status on each task without having to poke anyone. A digital board also makes this easier to work with, as changing the status of a task is as simple as dragging a card to the next column.

KISSâ€Š–â€ŠKeep It Simple Stupid

KISSâ€Š–â€ŠThis phrase is often used when designing software and I believe it should be applied towards managing workflows as well.

This is where things started to go wrong. A new team, a new workflow, and vague requirements will results in some failures. We had plenty of these during the early days of Grakn. From not properly defining requirements, skipping the requirements pipeline all together, to just failing to plan, we experienced all of these in our mad rush to our fist PoC, Moogi.

Some may argue that these failures are a result of not strictly sticking to established and proven process. In other words, if you going to Scrum, you Scrum 100%, if you going to Kanban, you go with that 100% without deviation. This is what we tried. This is where we failed. We found that sticking to a specific process was never flexible enough. When you have 6 months to get to a PoC you can’t spend much time on your workflow. So this is where we started to deviate from the proven processes which resulted in things getting better. . . until they got bad again.

These processes are proven and established for a reason and too much deviation from them results in failed planning and planning to fail. So at this stage we learnt our lessons and went with a different approach. We treated these processes as guidelines and picked those which worked, while abandoning those which failed.

KISSâ€Š–â€ŠEstimating Effort

One of the most common principles of Scrum is in estimating effort associated with tasks. Many insist that you should not use hours when estimating task difficulty. You should use story points, or hot dog sizes, or precipitation levels, or some other layer of abstraction.

One of the reasonings behind this is that engineers do not like associating hours with tasks. However, in most cases (not all) these abstracted points get converted into hours anyway. So this layer of abstraction exists to protect our feelings at the cost of giving us the opportunity at getting better at estimating the time it takes to execute tasks. I know this is a controversial stand point but here me out, knowing how long it will take you to do something is a valuable skill for anyone to posses. A skill which is more difficult to develop when you think in terms of hot dogs and not time. So KISSâ€Š–â€Šgo back to hours, yes, you will get it wrong often but, as time goes on, you will get increasingly accurate. Of course if the point abstraction works for your team then stick with it, but in our smaller team where everyone is responsible for individual components, time based estimations work better for us.

When you start using hours for task estimation do not use it to keep time sheets on you people. If you do, you are going to encourage people to pad out the hours, which defeats the purpose of this exercise. KISSâ€Š–â€Šworkflow management is not the same as keeping time sheets.

KISSâ€Š–â€ŠCommunication

Another common aspect of Scrum is the daily standup. The principle is simple, every morning everyone stands up and gives a brief talk about what they did the day before, and what they going to do today. This has the common pitfall of running over the allotted time.

Daily Standups are supposed to be brief 10 minute intros to the day. However, more often that not there is run over because as developers we are enthusiastic people and will talk any implementation detail into the ground if given the chance. Tabs vs spaces anyone? This is such a common problem that you can find articles suggesting ways to deal with it. Some people use stop watches, talking sticks (you can only talk if you have the stick), druidic rituals, and many other workarounds. KISSâ€Š–â€ŠIf it doesn’t work for your team, get rid of it.

We replaced the daily standup with a daily email update everyone sends at the end of the work day. This email simply states what they did, issues they encountered, and what they planning for the next day. If conversation needs to continue, it can do so on the email thread without dragging in people who do not need to be involved. In addition to this we have a weekly standup at the start of every week to say what our objective is for the coming week. This is high level enough that dev talk rarely happens and if we do run over it only happens once a week so we don’t feel the need to introduce controlling measures such as the previously mentioned stopwatches and druids.

Practice Makes Perfect

Handling incorrect task estimations needs to be done carefully. It is easy to accidentally create an environment of rewarding overestimating and punishing underestimating. The objective is to reward accurate task estimations and assist with understanding why certain tasks were misestimated. This is regardless whether they were executed 2 times faster or slower than expected.

You may ask why are accurate estimations so important? Isn’t it a good thing to have people finishing work ahead of plan. Yes of course! However, wouldn’t it be even better if you knew they were going to finish ahead of time? Furthermore, there are troubles that can occur when constantly under or over estimating:

Overestimating can lead to people getting to the middle of their week and not be sure what to do next. Senior staff with a good initiative and intuition may easily jump on another task, but some (if not most) people are reluctant to do so. So it’s best to avoid this
Underestimating means that you risk blocking other developers or missing deadlines. It also means you can’t accurately plan around someone.

One general guidelines which has helped us in the past is that any task over 8 hours should be broken down further. Luckily TargetProcess lets us do this:

A more granular breakdown of tasks

Graknâ€Š–â€ŠPresent Day

This roughly covers how we do business and how we work at Grakn Labs. It’s not perfect but it’s gotten us to where we are today. We are still tweaking here and there but the core principles have been maintained since we started our journey.

There are still many challenges which we are facing with regards to this workflow. The research side of our work still doesn’t perfectly match this workflow. I have heard of research teams who have incorporated Agile principles into their work and I would love to know peoples thought’s on how to do it. Similarly, there is still the vagueness of non requirement driven and non technical tasks. For example, I am writing this blog piece which takes time, how does this sort of work get factored in? Should it get factored in? I have been told that these are common problems amongst small teams of varying composition and background. I would love to hear your thoughts on how your teams have tackled these issues.

I hope you enjoyed this high level review of how we went from a group of cats on fire running in random directions to cats on fire running in the same direction. Maybe one day we will no longer be on fire.

Thanks to Thanh for converting this article to Markdown

Please let us know your thoughts by leaving us a comment below, joining our Slack community or pinging us on Twitter! And if you liked this post, please do hit recommend.