I stopped fighting Amazon's anti-bot wall — and built a scoring engine instead

#webscraping #javascript #ecommerce #datascience

I stopped fighting Amazon's anti-bot wall — and built a scoring engine instead

I spent way too long trying to scrape Amazon product pages in real time. Proxies got CAPTCHA'd. Search pages returned 503s. eBay blocked me too. The usual arms race.

Then I realized I was solving the wrong problem.

The insight: scraping and analysis are different products

The scraping is fragile and commoditized — proxies rotate, layouts change, anti-bot vendors (DataDome, PerimeterX, Cloudflare) escalate. Everyone fights the same war.

The analysis — "which of these 100k products is actually worth sourcing?" — is durable. The scoring logic doesn't care whether the data came from a scraper, an API, a CSV export, or a partner feed.

So I split them. The scoring engine is the asset. The data source is a swappable adapter.

The engine

ecom-intel-engine (MIT) takes normalized product records and returns a 0–100 opportunity score across four dimensions:

demand — log-scaled review count (proven sales)
margin — price headroom (sweet spot ~$20–60)
quality — rating fit (4.0–4.8 ideal; suspiciously perfect penalized)
competition — reviews vs. category median (below median = an opening)

const { analyzeDataset } = require('ecom-intel-engine');

const products = [
  { asin: 'B01', title: 'Vitamin C Serum', price: 24.99, rating: 4.6, reviews: 8200, category: 'Skincare' },
  { asin: 'B02', title: 'Jade Roller', price: 9.99, rating: 4.3, reviews: 120, category: 'BeautyTools' },
];

const { summary } = analyzeDataset(products);
console.log(summary.topOpportunities);
// [{ id: 'B01', title: 'Vitamin C Serum', score: 82, category: 'Skincare' }, ...]

No scraping inside the package. Zero runtime dependencies for the core scoring path.

Trend detection needs history, not a single scrape

A one-shot scrape gives you a snapshot. To know what's trending, you need a time series. So the package ships a tiny snapshot store that accumulates review/price history across runs and de-dupes by day:

const { store, analyzeDataset } = require('ecom-intel-engine');

let s = store.loadStore('store.json');
store.ingest(s, freshBatch, { source: 'my-source' });   // run daily
store.saveStore('store.json', s);

const { products } = analyzeDataset(store.toDataset(s));
// products now carry trendingScore + reviewVelocityPerDay

A product only gets flagged isTrending if it had a real baseline (≥100 reviews) and grew ≥50% — so you don't get fooled by "5 → 12 reviews = +140%!" noise.

Where the data comes from (when Amazon won't cooperate)

Since real-time scraping kept getting walled, I pointed the adapter at a public, CC-licensed Amazon products dataset (117k rows on HuggingFace) instead. Same engine, stable input, no proxy war. When you do have a working scraper or a paid API (Keepa, Rainforest, PA-API), you just write a 20-line adapter and the engine doesn't change.

Takeaway

If your scraper keeps getting blocked, ask whether scraping is even the valuable part. Often the durable, sellable thing is the analysis layer — and that layer should never be coupled to where the bytes came from.

Engine: npm · MIT licensed.