Calculating a running total/rolling sum in SQL is a useful skill to have.
It can often come in handy for reporting and even when developing applications. Perhaps your users might want to see a running total of the points they have gained or the money they have earned. Like most problems in SQL, there are multiple ways you can solve this.
You can use analytic functions, self joins, or an aggregate table that tracks the running sum. Let’s take a look at a few examples.
Skip down to the bottom if you’d rather watch these explanations in video form.
Using an analytic function is the easiest way to calculate a running total. An analytic function lets you partition data by a specific field. For instance, in this case, we can break down the rolling sum by driver_id and month. This will give us the running total by customer and month. So every month will start again at 0.
See the query below to see how to use an analytic function:
--We are assuming a month and year field are in the f_daily rides ----table which is abnormal.
--More likely you would have a datekey and a dimension table for all --your dates. However, we didn't want to add an extra join that ------might complicate the the concept we are explaining
SUM(Profit) over (partition by driver_id,year,month order by day)
This might be useful to compare how much profit a driver brings in day-over-day per month. You might be able to spot some sort of trend and help you notice bad drivers, or use it to generate performance reports for drivers.
They could easily switch between different days to see how much they have made as of a certain day of the month.
However, using an analytic function in an interview can sometimes lead the interviewer to ask if you know a different way to solve the problem. It is not that your answer is wrong, it is just they might be interested in seeing your thinking.
Another option to solve the running total problem is to use a self join. A self join refers to joining a table to itself. Unlike your typical join which uses the “=” for all values, we will be joining on multiple values. First, we will join the driver id and month/year normally and then we will join the date of the trip. When we join the date of the trip we will use the >= sign. This will join all the dates from one table that are greater than the dates from the other table.
So, if we look at the query below we will see this basic structure. It is a little bit trickier than just using an analytic function. If you were able to come up with this on your own, great job!
FROM f_daily_rides t1
JOIN f_daily_rides t2
On driver_id =driver_id
And t1.Month = t2.Month
And t1.Year = t2.Year
And t1.day >=t2.day
Group by day
Here is what the data would look like in the subquery. If you notice the t1.day field has repeat values for dates where the t1.day is larger than the t2.day. By doing this you can aggregate on t1.day and get a running total.
One of the last ways we have seen people calculate a running total is by using a subquery in the select statement. This has some similarities to a self join, except the self join occurs inside of the select statement. This is usually unadvised because using this method typically forces the query to scan the table for every row that exists in the original table all over again. This tends to be very inefficient, but we feel it is always good to have a general understanding of all the possible options.
Besides using direct ad-hoc queries, another option is to design a table to track the running total. What this design looks like will vary depending on the goal. More than likely the goal will be to either populate an application or dashboard. Based on the requirements changes the overall design. If we kept it simple, like the example above where we have a rolling sum for a user per month, then you just need to insert a new row that appends the previous value if it is for the same month and year.
--For simplicity we are not going to focus on the month and year
,(Select sum(profit) from f_daily_rides t2 where t1.day >=t2.day and t1.driver_id = t2.driver_id)
FROM f_daily_rides t1
This would limit how easy it is to be dynamic. For instance, what if you decide that you want to see the running total for every three months?
Any ideas? Here is a video that goes over these examples!
We are focusing on a problem-solving series and would love to know what problems you are trying to solve or you would like to see solved. If you have any please reach out today!
If you enjoyed this video about software engineering then consider these videos as well!
Using Python to Scrape the Meet-Up API
The Advantages Healthcare Providers Have In Healthcare Analytics
142 Resources for Mastering Coding Interviews
Learning Data Science: Our Top 25 Data Science Courses
The Best And Only Python Tutorial You Will Ever Need To Watch
Dynamically Bulk Inserting CSV Data Into A SQL Server
4 Must Have Skills For Data Scientists
What Is A Data Scientist