AJ Kerrigan

Posted on Dec 30, 2022 • Edited on Jan 15, 2023

Hanukkah of Data 2022 - Puzzle 7

#hanukkahofdata #datasette #python #pandas

Hanukkah of Data is a series of data-themed puzzles, where you solve puzzles to move your way through a holiday-themed story using a fictional dataset. See the introductory post for a bit more detail, but the pitch in my head is "Advent of Code meets SQL Murder Mystery". This post walks through my approach to the seventh puzzle.

Thinking
Doing (Datasette)
Doing (Pandas)
- Tricky Colors
- Getting Chainy

Warning: This post unavoidably contains spoilers. If you'd rather do the puzzles on your own first, please close this browser tab and run away :).

Thinking

It turns out that one day while Emily was robbing Noah's blind, another customer bought the same item in a different color:

“Well I turned around and sure enough this cute guy was holding something I had bought. He said ‘I got almost exactly the same thing!’ We laughed about it and wound up swapping items because he had wanted the color I got. We had a moment when our eyes met and my heart stopped for a second. I asked him to get some food with me and we spent the rest of the day together.

So we'll need to find orders on the same day as Emily, where the item is almost the same as what Emily bought...

Tackling this in pandas felt a bit daunting to me, so I figured I'd start with SQL. Looking through the data I saw that color usually showed up inside parentheses in a description. That seemed something worth pursuing.

Doing (Datasette)

Emily's Orders

Breaking the problem down into smaller pieces, I figured we'd want to have Emily's orders first. So I set up a common table expression for those:

with emily_orders as (
  select
    o.ordered,
    p.sku,
    p.desc
  from
    customers c
    join orders o on c.customerid = o.customerid
    join orders_items i on o.orderid = i.orderid
    join products p on i.sku = p.sku
  where
    c.name = 'Emily Randolph'
)

Orders on Emily Days

Then I pulled in non-Emily orders that were placed on the same day as one of Emily's:

select
  c.name,
  c.address,
  c.citystatezip,
  c.phone,
  o.ordered,
  em.ordered as em_ordered,
  p.sku,
  em.sku as em_sku,
  p.desc
from
  customers c
  join orders o on c.customerid = o.customerid
    and date(o.ordered) in (
        select
        distinct date(ordered)
        from
        emily_orders
    )

That was too many results, so I fine-tuned the order time to look for non-Emily orders placed within 5 minutes of an Emily order:

  join emily_orders em on c.name != 'Emily Randolph'
    and abs(
        strftime('%s', o.ordered) - strftime('%s', em.ordered)
    ) <= 300

Matching by SKU Prefix

That was still too many results, so I tried filtering on orders that matched the sku prefix:

  join emily_orders em on c.name != 'Emily Randolph'
  join orders_items i on o.orderid = i.orderid
  join products p on i.sku = p.sku
where
  substr(p.sku, 1, 3) = substr(em.sku, 1, 3)

Fine, I'll Filter on Description

I was trying to avoid comparing descriptions, but ran out of other ideas before finding a definitive match. So I tried matching just the portion of a description in front of any parentheses by adding this to the select:

  case
    when instr(p.desc, '(') = 0 then p.desc
    else substr(p.desc, 1, instr(p.desc, '(') - 1)
  end as stripped_desc,
  case
    when instr(em.desc, '(') = 0 then em.desc
    else substr(em.desc, 1, instr(em.desc, '(') - 1)
  end as em_stripped_desc

And matching it in the WHERE:

where
  substr(p.sku, 1, 3) = substr(em.sku, 1, 3)
  and stripped_desc = em_stripped_desc

This felt pretty awkward to me as any latent SQL Server or Postgres instincts were useless. But it did work!

I think there are better ways to do this with extensions and/or full-text search support. This felt like a really ugly way to work around a lack of fuzzy matching or string splitting operations.

The Whole is Uglier Than the Sum of Its Parts

That was a bumpy ride, and I ended up with this monstrosity:

with emily_orders as (
  select
    o.ordered,
    p.sku,
    p.desc
  from
    customers c
    join orders o on c.customerid = o.customerid
    join orders_items i on o.orderid = i.orderid
    join products p on i.sku = p.sku
  where
    c.name = 'Emily Randolph'
)
select
  c.name,
  c.phone,
  o.ordered,
  em.ordered as em_ordered,
  p.sku,
  em.sku as em_sku,
  p.desc,
  case
    when instr(p.desc, '(') = 0 then p.desc
    else substr(p.desc, 1, instr(p.desc, '(') - 1)
  end as stripped_desc,
  em.desc,
  case
    when instr(em.desc, '(') = 0 then em.desc
    else substr(em.desc, 1, instr(em.desc, '(') - 1)
  end as em_stripped_desc
from
  customers c
  join orders o on c.customerid = o.customerid
  and date(o.ordered) in (
    select
      distinct date(ordered)
    from
      emily_orders
  ) -- rough cut
  join emily_orders em on c.name != 'Emily Randolph'
  and abs(
    strftime('%s', o.ordered) - strftime('%s', em.ordered)
  ) <= 300 -- fine tune
  join orders_items i on o.orderid = i.orderid
  join products p on i.sku = p.sku
where
  substr(p.sku, 1, 3) = substr(em.sku, 1, 3)
  and stripped_desc = em_stripped_desc

But also an answer, so I'll take it.

Doing (Pandas)

Because the SQL approach felt so gross to me, I tried to rethink it rather than just "porting" it to pandas. The start was the same - finding a group of not-Emily orders on the same day as Emily orders:

df['ordered_date'] = df.ordered.dt.date
emily = df[df.name == "Emily Randolph"]
not_emily = df[df.name != "Emily Randolph"]
merged = not_emily.merge(emily, on="ordered_date")

But then I tried to come up with a better way to indicate "descriptions that differ just by color". This was tricky! I experimented a bit with difflib which is fun for looking at close string matches. But in the end I leaned on a little helper function to split a description into a set of lowercased words with the parentheses and other punctuation discarded:

import re

stemset = lambda x: {val.lower() for val in re.findall(r"\w+", x)}

Which I could use in a transform(), and then use apply() to find the symmetric difference of those sets between each Emily and not-Emily order:

merged["desc_diff"] = merged.transform({"desc_x": stemset, "desc_y": stemset}).apply(
    lambda x: x.desc_x ^ x.desc_y, axis=1
)

Because the color wasn't always present in an item description, I figured I could look for cases where the symmetric difference between description wordsets had either 1 or 2 words. Combining that with the "orders within 5 minutes of each other" condition, that left this:

merged[
    merged.apply(
        lambda x: abs((x.ordered_x - x.ordered_y).total_seconds()) <= 300
        and len(x.desc_diff) in (1, 2),
        axis=1,
    )
]

Tricky Colors

I already had the answer by this point, but there was one thing nagging me: how would I check for specifically a color difference rather than just a 1-2 word difference? This is where I had to shake an angry yet appreciative fist at the puzzle design. Because while I could find lists of color names in packages like webcolors, matplotlib, Crayola APIs, etc... the relevant colors in this puzzle were outside the core set of Crayola/HTML/CSS colors. The best alternative I found was to pull out a distinct set of colors like this:

colors = set(
  df.desc.transform(
    lambda x: (match := re.search(r'\(([a-z]+)\)', x) ) and match.groups() or []
  ).explode().dropna().unique()
)

...and then check to be sure my desc_diff column contained only words from that set. In this case that ended up not being necessary or useful, but hey I was curious.

Getting Chainy

Some pandas users like to chain their methods together rather than leaving a trail of intermediate variables. I don't have a consistent preference, but it feels like good practice to work both ways. So if I tried to chainify this pandas code, it would look something like this:

df[df.name != "Emily Randolph"].merge(
    df[df.name == "Emily Randolph"], on="ordered_date"
).assign(
    desc_diff=lambda x: x.transform({"desc_x": stemset, "desc_y": stemset}).apply(
        lambda x: x.desc_x ^ x.desc_y, axis=1
    )
).pipe(
    lambda x: x[
        x.apply(
            lambda x: abs((x.ordered_x - x.ordered_y).total_seconds()) <= 300
            and len(x.desc_diff) in (1, 2),
            axis=1,
        )
    ]
)

I hope that when I browse other people's solutions I find a cleaner chained version, because this looks gross to me 😅.

DEV Community