A practical look at content monitoring at scale without Playwright, without scraping farms, and without setting a single VPS on fire.
The 2am Alert That Started This Post
A skincare brand using Beaconmon woke up to an alert. A top competitor had quietly dropped prices across their bestselling serums by 18% and was running a sitewide free-shipping promo. No press release. No announcement. Just a Tuesday.
Their team repriced and matched the shipping threshold before the competitor's paid ads started driving volume. They saw it because a background worker had fetched that competitor's storefront, extracted the right DOM nodes with cheerio, diffed the result against a 24-hour-old snapshot, and scored the change as high-significance.
That is the whole product. Everything below is how we make it reliable at scale.
TL;DR
- BullMQ workers fetch competitor HTML on a schedule
- cheerio extracts price, promo, and product-grid content using a ranked selector cascade
- Diffs are normalized, stored, and scored by rules first, AI second
- No Playwright, no headless Chrome, no scraping farm
- The lesson that mattered most: normalize before storage, or you will alert on whitespace forever
The Architecture in One Paragraph
Every tracked competitor is a Monitor record flagged is_competitor: true. A BullMQ scheduler enqueues a content-check job for each one on its configured interval (15 minutes on free, down to 5 on Growth). Workers pull jobs, fetch the page with undici, parse with cheerio, normalize the text, and compare against the last snapshot in Postgres. If anything changed, a second job scores the significance and fans out alerts.
No Playwright. No headless Chrome. HTML-only, on purpose. A single VPS will not survive 3,000 concurrent Chromium processes, and the signal we care about (price text, promo copy, product grid content) almost always exists in the initial HTML response.
Step 1: Map the Catalog Once, Then Leave It Alone
Shopify exposes a public JSON endpoint for any store's product catalog. We fetch it once at setup time to build an internal product map:
type ProductMap = Record<string, {
title: string;
variants: Variant[];
priceRange: [number, number];
}>;
That map is what lets us say "3,000 products" instead of "3,000 URLs." We do not poll that endpoint on every check cycle. It is the seed, not the heartbeat.
The heartbeat is cheerio against the live HTML.
Step 2: The Selector Cascade
Shopify themes are not standardized, but they rhyme. Debut, Dawn, and most third-party themes share a small vocabulary of price-related class names. We try a ranked list of selectors and take the first match:
const PRICE_SELECTORS = [
'.price__sale',
'.price-item--sale',
'.price__regular',
'.price-item--regular',
'.product-form__price',
'.product__price',
'[data-product-price]',
'[data-price]',
'[class*="price__"]',
'[class*="ProductPrice"]',
'.price',
];
If none match, we fall back to main. A fallback match is recorded but never marked as "detected," so a miss on the specific selector does not produce a false-positive event.
That distinction matters. We separate "we found the thing we wanted" from "we tracked something and it changed." If you collapse those two states, your alert quality dies.
The same pattern applies to announcement bars, collection grids, and sale pages. Each preset is a ranked selector list plus a fallback:
{
id: 'announcement-bar',
selectors: [
'.announcement-bar',
'.shopify-section--announcement-bar',
'[class*="AnnouncementBar"]',
'[data-announcement-bar]',
],
fallbackSelector: 'header',
}
Step 3: The Actual Content Check
The fetch and parse is about 30 lines. The important parts are normalization, and using undici over node-fetch for connection pooling at volume.
import { request } from 'undici';
import * as cheerio from 'cheerio';
const { statusCode, body } = await request(url, {
method: 'GET',
headersTimeout: timeoutMs,
bodyTimeout: timeoutMs,
headers: {
'user-agent': 'beaconmon/1.0 (+https://beaconmon.com)',
accept: 'text/html,application/xhtml+xml',
},
});
const html = await body.text();
const $ = cheerio.load(html);
const rawText = selector ? $(selector).text() : $('body').text();
const content = rawText
.split('\n')
.map((l) => l.replace(/[ \t]+/g, ' ').trim())
.filter((l) => l.length > 0)
.join('\n');
The normalization step matters more than it looks. Shopify themes render whitespace inconsistently across CDN edges and A/B test variants. Without collapsing horizontal whitespace per line and stripping blanks, you get false-positive diffs every few hours on high-traffic stores. With it, the diff is quiet until something actually changes.
Step 4: Score What Changed
Not all diffs are equal. A competitor updating their "About" copy is noise. A competitor dropping 18% off their hero SKU is a buying signal.
We score in two layers.
Rules First (fast, free, zero latency)
case 'price_changed': {
const delta = Math.abs(next - old) / old;
if (delta >= 0.05) return 'high';
return 'medium';
}
case 'out_of_stock':
case 'back_in_stock':
case 'sale_started':
return 'high';
AI Second (Growth and Scale plans only, fails open)
The diff text goes to Claude with a short system prompt describing the scoring rubric. It returns low, medium, or high with a 4-second hard timeout. If the model is unreachable or the plan does not include AI, the rule-based score is used. The system never blocks on the AI layer.
What the Data Actually Enables
This is the part that matters to the stores using it.
Repricing with context, not fear. When you see a competitor's price drop tagged high at 2am, you have a diff, a timestamp, and the old price. You are not guessing. You are deciding.
Promo calendar reconstruction. Six weeks of announcement-bar snapshots tells you when a competitor runs their sales, how long they run, and what copy they use. That is a content calendar you did not have to build yourself.
New-arrival velocity. A collection-grid monitor tells you how fast a competitor is dropping product. Twice a week means real buying power or a production operation worth watching. Once a month means they are coast-clearing.
Stock signal for sourcing. Out-of-stock events on a competitor's bestsellers can indicate a supply chain gap. That is a window to step up your own inventory or run a targeted ad against their unmet demand.
None of that requires crawling at scale or violating terms of service. It is the same HTML a browser would render, fetched politely with a declared user-agent, on a reasonable interval.
Three Lessons I'd Tell My Past Self
1. The selector cascade is the hard part, not the queue
Shopify's theme ecosystem is large enough that there is no single correct selector. A ranked list with a named fallback, and a clear distinction between "matched" and "fell back," is the right model. Get this wrong and alert quality collapses.
2. Normalize before storage. Always.
Storing raw HTML and diffing later sounds appealing until you realize CDN-injected attributes, nonce values in inline scripts, and whitespace drift will fire alerts on every check for a percentage of your monitors. Store the normalized text.
3. Fail open everywhere
The AI layer falls back. The selector cascade falls back. The check itself records the error type and moves on rather than retrying forever. A monitoring product that generates noise when its own dependencies hiccup is worse than useless.
Wrap
If you are building something similar, or you run a Shopify or WooCommerce store and want to see this in action, Beaconmon is in early access. I am happy to talk architecture, selector strategies, or where the AI layer earns its keep versus where rules are enough.
Find me at @haimanot_getu.
Top comments (0)