DEV Community

Dror Atariah
Dror Atariah

Posted on

Spot the Sale - My take

Here is my Python and duckdb take on the Spot the Sale challenge by Maven Analytics. Let's start with Python!

Solutions

To start, we need to load the data:

promo_df = pd.read_csv("promotions.csv", parse_dates=["start_date", "end_date"])
orders_df = pd.read_csv("orders.csv", parse_dates=["order_date"])
Enter fullscreen mode Exit fullscreen mode

First Take

My approach uses a User Defined Function:

def get_promo(dt: pd.Timestamp) -> str:
    filtered_promo = promo_df.where(
        (promo_df["start_date"] <= dt) & (promo_df["end_date"] >= dt)
    ).dropna()
    if filtered_promo.empty:
        return ""
    return filtered_promo["promo_id"].values[0]
Enter fullscreen mode Exit fullscreen mode

which takes a datetime and checks whether it falls within the date range of a promotion. Lastly, we use the UDF with the apply method:

res = orders_df.order_date.apply(lambda x: get_promo(x))
Enter fullscreen mode Exit fullscreen mode

This produces a series of promotion IDs, which I can check as follows:

len(res[res == ""])
Enter fullscreen mode Exit fullscreen mode

and get the expected number: 1916. However, this approach is very slow: about 1 second! Let's try to improve this.

Take Two using cut

Using the start and end dates of each promotion, we construct an IntervalIndex which is fed into the pd.cut function:

res = pd.cut(
    orders_df["order_date"],
    bins=pd.IntervalIndex.from_arrays(
        promo_df["start_date"], promo_df["end_date"], closed="both"
    ),
    labels=promo_df["promo_id"],
)
Enter fullscreen mode Exit fullscreen mode

Timing this shows a significant decrease to 1.8 ms.

Take Three using duckdb

If we assign the following SQL query to my_q:

select
    count(*)
from (
    SELECT od.*,
    coalesce(pd.promo_id, '') as promo
    FROM
        orders_df as od
    left join
        promo_df as pd on od.order_date between pd.start_date and pd.end_date
) t
where 1=1
and promo = ''
Enter fullscreen mode Exit fullscreen mode

then duckdb.sql(my_q) runs in 430 μs! This is, once again, a significant improvement.

The Solution by Maven

Lastly, the Python solution provided in the video is:

pd.merge_asof(
    orders_df.sort_values("order_date", ascending=True),
    promo_df.sort_values("start_date", ascending=True),
    left_on="order_date",
    right_on="start_date",
    direction="backward",
).query("order_date <= end_date")
Enter fullscreen mode Exit fullscreen mode

This solution runs in 1.2 ms and is comparable to the cut approach but slower than duckdb approach. Moreover, merge_asof is not very popular, which negatively impact the readability.

Discussion

Which solution do you like best? Here is my summary:

Solution Timing Pros Cons
Naive 1 s Easy to read / Flexible Very Slow
cut ~2 ms Fast / Flexible A little harder to read
merge_asof ~1 ms Even faster A little harder to read and adjust
duckdb 430 μs Fastest / SQL based Requires another dependency on top of pandas

Top comments (0)