Calculating a running total/rolling sum in SQL is a useful skill to have.
It can often come in handy for reporting and even when developing applications. Perhaps your users might want to see a running total of the points they have gained or the money they have earned. Like most problems in SQL, there are multiple ways you can solve this.
You can use analytic functions, self joins, or an aggregate table that tracks the running sum. Let’s take a look at a few examples.
Skip down to the bottom if you’d rather watch these explanations in video form.
Using an Analytic Function
Using an analytic function is the easiest way to calculate a running total. An analytic function lets you partition data by a specific field. For instance, in this case, we can break down the rolling sum by driver_id and month. This will give us the running total by customer and month. So every month will start again at 0.
See the query below to see how to use an analytic function:
--We are assuming a month and year field are in the f_daily rides ----table which is abnormal.
--More likely you would have a datekey and a dimension table for all --your dates. However, we didn't want to add an extra join that ------might complicate the the concept we are explaining
Select
SUM(Profit) over (partition by driver_id,year,month order by day)
,driver_id
,day
FROM f_daily_rides
This might be useful to compare how much profit a driver brings in day-over-day per month. You might be able to spot some sort of trend and help you notice bad drivers, or use it to generate performance reports for drivers.
They could easily switch between different days to see how much they have made as of a certain day of the month.
However, using an analytic function in an interview can sometimes lead the interviewer to ask if you know a different way to solve the problem. It is not that your answer is wrong, it is just they might be interested in seeing your thinking.
Using a Self Join
Another option to solve the running total problem is to use a self join. A self join refers to joining a table to itself. Unlike your typical join which uses the “=” for all values, we will be joining on multiple values. First, we will join the driver id and month/year normally and then we will join the date of the trip. When we join the date of the trip we will use the >= sign. This will join all the dates from one table that are greater than the dates from the other table.
So, if we look at the query below we will see this basic structure. It is a little bit trickier than just using an analytic function. If you were able to come up with this on your own, great job!
SELECT
day
,driver_id
,SUM(Profit)
FROM
(
Select
t1.day
,t2.day
,t1.driver_id
,t2.Profit
FROM f_daily_rides t1
JOIN f_daily_rides t2
On driver_id =driver_id
And t1.Month = t2.Month
And t1.Year = t2.Year
And t1.day >=t2.day
) t3
Group by day
Here is what the data would look like in the subquery. If you notice the t1.day field has repeat values for dates where the t1.day is larger than the t2.day. By doing this you can aggregate on t1.day and get a running total.
Using a Subquery in the Select Clause
One of the last ways we have seen people calculate a running total is by using a subquery in the select statement. This has some similarities to a self join, except the self join occurs inside of the select statement. This is usually unadvised because using this method typically forces the query to scan the table for every row that exists in the original table all over again. This tends to be very inefficient, but we feel it is always good to have a general understanding of all the possible options.
Designing a Table to Track a Rolling Sum
Besides using direct ad-hoc queries, another option is to design a table to track the running total. What this design looks like will vary depending on the goal. More than likely the goal will be to either populate an application or dashboard. Based on the requirements changes the overall design. If we kept it simple, like the example above where we have a rolling sum for a user per month, then you just need to insert a new row that appends the previous value if it is for the same month and year.
--For simplicity we are not going to focus on the month and year
Select
t1.day
,t1.driver_id
,(Select sum(profit) from f_daily_rides t2 where t1.day >=t2.day and t1.driver_id = t2.driver_id)
FROM f_daily_rides t1
This would limit how easy it is to be dynamic. For instance, what if you decide that you want to see the running total for every three months?
Any ideas? Here is a video that goes over these examples!
We are focusing on a problem-solving series and would love to know what problems you are trying to solve or you would like to see solved. If you have any please reach out today!
If you enjoyed this video about software engineering then consider these videos as well!
Using Python to Scrape the Meet-Up API
The Advantages Healthcare Providers Have In Healthcare Analytics
142 Resources for Mastering Coding Interviews
Learning Data Science: Our Top 25 Data Science Courses
The Best And Only Python Tutorial You Will Ever Need To Watch
Dynamically Bulk Inserting CSV Data Into A SQL Server
4 Must Have Skills For Data Scientists
What Is A Data Scientist
Top comments (7)
Hi Seattle Data Guy,
Great idea for a post. Analytical Functions and getting your head around self joins is really important.
When you say "See the query below to see how to use an analytic function" are you referring to the video or should there be a code snippet in the post itself?
Hello Helen!
Thank you for pointing that out. I forgot that gists don't work with github.
I changed the programming examples with markdown vs. github.
Cheers,
Ben
Looking good!
If you'd like the code to be highlighted for SQL all you need to do is add the word sql immediately after your first set of back ticks.
Oh! I didn't actually know that.
I went back and updated some other posts too that have python and json.
Thanks for all your help!
No worries Ben, keep the awesome SQL content coming :)
At the risk of being tedious, I'll point out that your use of the analytic function assumes a particular default in the SQL dialect. For your example to be unambiguous you would need to add "ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW".
I've covered this previously in a similar response on dev.to - see dev.to/helenanders26/sql-301-why-y...
While this can seem to be nitpicking, I think it tells us all that we can easily mistakenly think we know a feature of the SQL language when in fact we just know a feature of some implementations. i.e. the implementation is applying a default ROWS clause whether we realise it or not.
On another point, I have found a main reason for not using analytic functions occurs when they're simply not there in the dialect. You might be surprised how many of these are around in all sorts of products.
You've done a good job here of describing alternatives, I expect someone somewhere sometime will find themselves needing that option and will be grateful for your examples.
It's very true!
That is why I provided multiple ways to perform the action! Not just using an analytic function. Analytic functions just tend to be the quickest and easiest to understand.