Dror Atariah

Posted on Sep 24

Spot the Sale - My take

#pandas #data #python

Here is my Python and duckdb take on the Spot the Sale challenge by Maven Analytics. Let's start with Python!

Solutions

To start, we need to load the data:

promo_df = pd.read_csv("promotions.csv", parse_dates=["start_date", "end_date"])
orders_df = pd.read_csv("orders.csv", parse_dates=["order_date"])

First Take

My approach uses a User Defined Function:

def get_promo(dt: pd.Timestamp) -> str:
    filtered_promo = promo_df.where(
        (promo_df["start_date"] <= dt) & (promo_df["end_date"] >= dt)
    ).dropna()
    if filtered_promo.empty:
        return ""
    return filtered_promo["promo_id"].values[0]

which takes a datetime and checks whether it falls within the date range of a promotion. Lastly, we use the UDF with the apply method:

res = orders_df.order_date.apply(lambda x: get_promo(x))

This produces a series of promotion IDs, which I can check as follows:

len(res[res == ""])

and get the expected number: 1916. However, this approach is very slow: about 1 second! Let's try to improve this.

Take Two using `cut`

Using the start and end dates of each promotion, we construct an IntervalIndex which is fed into the pd.cut function:

res = pd.cut(
    orders_df["order_date"],
    bins=pd.IntervalIndex.from_arrays(
        promo_df["start_date"], promo_df["end_date"], closed="both"
    ),
    labels=promo_df["promo_id"],
)

Timing this shows a significant decrease to 1.8 ms.

Take Three using `duckdb`

If we assign the following SQL query to my_q:

select
    count(*)
from (
    SELECT od.*,
    coalesce(pd.promo_id, '') as promo
    FROM
        orders_df as od
    left join
        promo_df as pd on od.order_date between pd.start_date and pd.end_date
) t
where 1=1
and promo = ''

then duckdb.sql(my_q) runs in 430 μs! This is, once again, a significant improvement.

The Solution by Maven

Lastly, the Python solution provided in the video is:

pd.merge_asof(
    orders_df.sort_values("order_date", ascending=True),
    promo_df.sort_values("start_date", ascending=True),
    left_on="order_date",
    right_on="start_date",
    direction="backward",
).query("order_date <= end_date")

This solution runs in 1.2 ms and is comparable to the cut approach but slower than duckdb approach. Moreover, merge_asof is not very popular, which negatively impact the readability.

Discussion

Which solution do you like best? Here is my summary:

Solution	Timing	Pros	Cons
Naive	1 s	Easy to read / Flexible	Very Slow
`cut`	~2 ms	Fast / Flexible	A little harder to read
`merge_asof`	~1 ms	Even faster	A little harder to read and adjust
`duckdb`	430 μs	Fastest / SQL based	Requires another dependency on top of `pandas`

DEV Community

Spot the Sale - My take

Solutions

First Take

Take Two using `cut`

Take Three using `duckdb`

The Solution by Maven

Discussion

Top comments (0)

Solutions

First Take

Take Two using cut

Take Three using duckdb

The Solution by Maven

Discussion

Take Two using `cut`

Take Three using `duckdb`