ahmad-khan-97

Posted on May 14 • Originally published at perfumem.com

I scraped Google Trends for 30 fragrances across all 50 US states with pytrends. 8 states reject the national #1.

#python #datascience #opendata #tutorial

If you've used pytrends for the typical "trending this week" overview, you've probably hit the same wall I did: the default interest_over_time and trending_searches give you national-level signal, which is exactly what every other analyst already has. The interesting story is almost always at the regional level, and pytrends has a quietly underused method for that: interest_by_region(resolution="REGION").

I used it to answer a question that had been bugging me: when 30 of the most-talked-about fragrances of 2024-2026 are matched up across all 50 US states + DC, does every state pick the same #1, or do some go their own way?

Result: 43 states pick the same fragrance. 8 states pick something completely different, and the outliers cluster in ways that turned out to be defensible (more on this below).

The full analysis went up at perfumem.com, the raw CSVs and choropleth PNGs are on GitHub under CC BY 4.0. This post is the technical walkthrough, with the gotchas I had to work around.

The core pattern: pytrends batched at 5 keywords with a state-level resolution

pytrends accepts max 5 keywords per request. To analyze 30 fragrances across 50 states, you batch the keywords and stitch the wide matrix together yourself.

from pytrends.request import TrendReq
import pandas as pd
import time

FRAGRANCES = [
    "Dior Sauvage", "Bleu de Chanel", "Polo Blue", "Old Spice", "Glossier You",
    "Chanel No 5", "Coco Mademoiselle", "Marc Jacobs Daisy",
    "Ariana Grande Cloud", "Sol de Janeiro Cheirosa 62",
    # ... 20 more
]

BATCH_SIZE = 5
TIMEFRAME = "today 12-m"
GEO = "US"

pytrends = TrendReq(hl="en-US", tz=300, retries=3, backoff_factor=2)
rows = {}  # state -> {fragrance: score}

for i in range(0, len(FRAGRANCES), BATCH_SIZE):
    batch = FRAGRANCES[i:i + BATCH_SIZE]
    pytrends.build_payload(batch, timeframe=TIMEFRAME, geo=GEO)
    df = pytrends.interest_by_region(
        resolution="REGION",
        inc_low_vol=True,
        inc_geo_code=False,
    )
    for state, row in df.iterrows():
        rows.setdefault(state, {})
        for frag in batch:
            rows[state][frag] = int(row[frag])
    time.sleep(8)  # be polite, Google rate-limits aggressive callers

The output rows dict gives you a state x fragrance matrix you can pivot into anything.

Gotcha 1: Google's scores are relative, not absolute

This trips up almost everyone the first time. The 0-100 score is normalized within the keyword set you submitted in that batch. A score of 87 in batch A is not directly comparable to 87 in batch B unless you re-normalize.

The cleanest fix is to include a "calibration" keyword in every batch (something with stable, broad search volume) and renormalize relative to it. For my use case I skipped this because I wasn't comparing across batches: the per-state winner is just the highest score within each state's row, regardless of cross-batch absolute magnitude. If you need cross-batch comparability, factor in a calibration term.

# example: rebase each batch to a calibration keyword
CALIBRATION = "perfume"
batch_with_cal = [CALIBRATION] + batch[:4]  # 4 + 1 calibration
# after fetching, divide every score in batch by its calibration row score, multiply 100

Gotcha 2: `inc_low_vol=True` matters at the state level

Default is False, which means low-volume regions get filtered out and your matrix has holes. For a state-level analysis where some states (Wyoming, Alaska, Vermont) have low aggregate search activity, you want inc_low_vol=True or your sparse states disappear from the matrix entirely. Tradeoff: low-volume scores are noisier, so the bottom of the distribution is less trustworthy than the top.

Gotcha 3: pytrends is unmaintained-ish, expect occasional 429s

The project is community-maintained and Google occasionally changes the unofficial endpoint. Build retries with exponential backoff into your TrendReq constructor (retries=3, backoff_factor=2) and accept that some batches will need a re-run. I had 1 of my 6 batches fail on the first pass and had to retry an hour later.

Computing the per-state winner

Once you have the matrix, the winner-per-state is a one-liner:

import csv

with open("winners.csv", "w") as f:
    w = csv.writer(f)
    w.writerow(["state", "winning_fragrance", "score", "second_place", "second_score"])
    for state in sorted(rows.keys()):
        ranked = sorted(rows[state].items(), key=lambda kv: -kv[1])
        winner, score = ranked[0]
        second, second_score = (ranked[1] if len(ranked) > 1 else ("", 0))
        w.writerow([state, winner, score, second, second_score])

Visualizing it: choropleth in 30 lines with plotly

Plotly Express handles US state choropleths cleanly. The trick is mapping full state names to USPS 2-letter codes (which locationmode="USA-states" requires).

import plotly.express as px
import pandas as pd

US_STATE_TO_CODE = {
    "Alabama": "AL", "Alaska": "AK", "Arizona": "AZ", # ... (full dict)
}

df = pd.read_csv("winners.csv")
df["code"] = df["state"].map(US_STATE_TO_CODE)
df = df.dropna(subset=["code"])

fig = px.choropleth(
    df,
    locations="code",
    locationmode="USA-states",
    color="winning_fragrance",
    scope="usa",
    title="Most-searched fragrance by US state, last 12 months",
    color_discrete_sequence=px.colors.qualitative.Set3,
)
fig.write_image("us-state-map.png", width=1600, height=900, scale=2)

You'll need kaleido installed for write_image to work: pip install kaleido.

What the data actually showed

Old Spice ranks #1 in 43 states. Of the 8 outliers:

State	Winner	Category
Alaska, South Dakota	Coco Mademoiselle (Chanel)	Designer luxury women's
Louisiana, Mississippi	Polo Blue (Ralph Lauren)	Designer men's
Montana	Marc Jacobs Daisy	Designer women's
New Mexico	Ariana Grande Cloud	Celebrity / viral
North Dakota, Vermont	Glossier You	Niche / clean beauty

The clusters are interesting in ways I didn't expect going in. Louisiana + Mississippi sharing a #1 (Polo Blue) tracks with cultural Gulf Coast preference signals you can find in other consumer-goods data. North Dakota + Vermont sharing Glossier You was the strangest finding to me; both are low-population states with strong direct-to-consumer ecommerce penetration, and Glossier's brand voice plays well in both demographics, but I wouldn't have predicted them as a pair.

New Mexico is the only US state where Ariana Grande Cloud ranks #1, which is a "single state where a celebrity scent dominates" pattern that I'd love to see replicated for other celebrity fragrance launches.

Open data + reproducibility

Everything is on GitHub: ahmad-khan-97/us-fragrance-trends-2026. The repo includes:

data/raw_interest_by_state.csv: the full 30 x 51 matrix
data/winning_fragrance_per_state.csv: state, winner, score, runner-up, runner-up score
charts/: the three matplotlib + plotly outputs
LICENSE: CC BY 4.0, free to remix with attribution

The full written analysis with the cluster interpretation is at perfumem.com.

What I'd build next

If I were extending this:

Time series per state for the top 5 outlier picks: did Glossier You always win Vermont, or is this a 2026 phenomenon? interest_over_time per state would answer that.
Calibrated cross-state magnitude: with a calibration keyword in every batch, you could rank "intensity of fragrance interest" per state, not just the within-state winner.
Compare to actual purchase data: Google Trends measures search intent, not purchases. Anyone with a national fragrance retailer's POS data has a great cross-validation opportunity here.

If you build any of those, I'd love to see the result. The dataset is intentionally permissive (CC BY 4.0) so derivative analyses are encouraged.

Quick reference: the full minimal script

# us_state_fragrance_trends.py
from pytrends.request import TrendReq
import csv, time

FRAGRANCES = [...]  # your 5-30 keywords
BATCH_SIZE = 5

pytrends = TrendReq(hl="en-US", tz=300, retries=3, backoff_factor=2)
rows = {}

for i in range(0, len(FRAGRANCES), BATCH_SIZE):
    batch = FRAGRANCES[i:i + BATCH_SIZE]
    pytrends.build_payload(batch, timeframe="today 12-m", geo="US")
    df = pytrends.interest_by_region(resolution="REGION", inc_low_vol=True)
    for state, row in df.iterrows():
        rows.setdefault(state, {})
        for frag in batch:
            rows[state][frag] = int(row[frag])
    time.sleep(8)

with open("winners.csv", "w") as f:
    w = csv.writer(f)
    w.writerow(["state", "winner", "score"])
    for state, frag_scores in rows.items():
        winner, score = max(frag_scores.items(), key=lambda kv: kv[1])
        w.writerow([state, winner, score])

That's it. ~30 lines for a state-level Google Trends analysis you can drop any keyword set into.

If you build something with this pattern, drop a link in the comments. Particularly interested in non-fragrance domains: fast food, pickup trucks, streaming shows. The state-level segmentation usually reveals at least one cluster that the national ranking hides.

DEV Community

I scraped Google Trends for 30 fragrances across all 50 US states with pytrends. 8 states reject the national #1.

The core pattern: pytrends batched at 5 keywords with a state-level resolution

Gotcha 1: Google's scores are relative, not absolute

Gotcha 2: `inc_low_vol=True` matters at the state level

Gotcha 3: pytrends is unmaintained-ish, expect occasional 429s

Computing the per-state winner

Visualizing it: choropleth in 30 lines with plotly

What the data actually showed

Open data + reproducibility

What I'd build next

Quick reference: the full minimal script

Top comments (0)

The core pattern: pytrends batched at 5 keywords with a state-level resolution

Gotcha 1: Google's scores are relative, not absolute

Gotcha 2: inc_low_vol=True matters at the state level

Gotcha 3: pytrends is unmaintained-ish, expect occasional 429s

Computing the per-state winner

Visualizing it: choropleth in 30 lines with plotly

What the data actually showed

Open data + reproducibility

What I'd build next

Quick reference: the full minimal script

Gotcha 2: `inc_low_vol=True` matters at the state level