Here is my Python and duckdb take on the Spot the Sale challenge by Maven Analytics. Let's start with Python!
Solutions
To start, we need to load the data:
promo_df = pd.read_csv("promotions.csv", parse_dates=["start_date", "end_date"])
orders_df = pd.read_csv("orders.csv", parse_dates=["order_date"])
First Take
My approach uses a User Defined Function:
def get_promo(dt: pd.Timestamp) -> str:
filtered_promo = promo_df.where(
(promo_df["start_date"] <= dt) & (promo_df["end_date"] >= dt)
).dropna()
if filtered_promo.empty:
return ""
return filtered_promo["promo_id"].values[0]
which takes a datetime and checks whether it falls within the date range of a promotion. Lastly, we use the UDF with the apply
method:
res = orders_df.order_date.apply(lambda x: get_promo(x))
This produces a series of promotion IDs, which I can check as follows:
len(res[res == ""])
and get the expected number: 1916. However, this approach is very slow: about 1 second! Let's try to improve this.
Take Two using cut
Using the start and end dates of each promotion, we construct an IntervalIndex
which is fed into the pd.cut
function:
res = pd.cut(
orders_df["order_date"],
bins=pd.IntervalIndex.from_arrays(
promo_df["start_date"], promo_df["end_date"], closed="both"
),
labels=promo_df["promo_id"],
)
Timing this shows a significant decrease to 1.8 ms.
Take Three using duckdb
If we assign the following SQL query to my_q
:
select
count(*)
from (
SELECT od.*,
coalesce(pd.promo_id, '') as promo
FROM
orders_df as od
left join
promo_df as pd on od.order_date between pd.start_date and pd.end_date
) t
where 1=1
and promo = ''
then duckdb.sql(my_q)
runs in 430 μs! This is, once again, a significant improvement.
The Solution by Maven
Lastly, the Python solution provided in the video is:
pd.merge_asof(
orders_df.sort_values("order_date", ascending=True),
promo_df.sort_values("start_date", ascending=True),
left_on="order_date",
right_on="start_date",
direction="backward",
).query("order_date <= end_date")
This solution runs in 1.2 ms and is comparable to the cut
approach but slower than duckdb
approach. Moreover, merge_asof
is not very popular, which negatively impact the readability.
Discussion
Which solution do you like best? Here is my summary:
Solution | Timing | Pros | Cons |
---|---|---|---|
Naive | 1 s | Easy to read / Flexible | Very Slow |
cut |
~2 ms | Fast / Flexible | A little harder to read |
merge_asof |
~1 ms | Even faster | A little harder to read and adjust |
duckdb |
430 μs | Fastest / SQL based | Requires another dependency on top of pandas
|
Top comments (0)