DEV Community

KazKN
KazKN

Posted on

I Built the Wrong Scraper First. The Data Looked Clean and Was Still Useless.

I built the wrong scraper first.

It worked.

That was the problem.

The scraper extracted product titles, prices, URLs, images, and brands. The dataset was clean. The rows looked normal. Nothing crashed.

But the output was still not useful enough.

The mistake was simple: I treated a resale marketplace like a product catalog.

It is not.

A marketplace has state.

The clean dataset trap

This looked fine:

{
  "title": "Leather bag",
  "price": 1200,
  "brand": "Example",
  "url": "https://example.com/item/123"
}
Enter fullscreen mode Exit fullscreen mode

But the real questions were missing:

  • Is it live or sold?
  • Did it disappear since the last run?
  • Which country was searched?
  • Where is the seller?
  • What condition is the item in?
  • Has the price changed?
  • Is this listing unusually similar to another listing?

None of those questions are visible in a basic product-card scrape.

What I changed

I stopped designing the output around the page.

I started designing it around decisions.

The row became more explicit:

{
  "recordType": "listing",
  "displayStatus": "Available",
  "isSold": false,
  "country": "FR",
  "sellerCountry": "IT",
  "condition": "Very good condition",
  "price": 1200,
  "priceHistory": [],
  "requiresManualReview": false
}
Enter fullscreen mode Exit fullscreen mode

Less pretty.

Much more useful.

The lesson

Selectors are not the product.

The state model is the product.

If the scraper cannot preserve state, it will make the dataset feel more certain than it really is.

That is dangerous because clean JSON is easy to trust.

The checklist I use now

Before I trust a marketplace scraper, I check whether it handles:

  1. live items;
  2. sold items;
  3. disappearing items;
  4. seller country;
  5. page country;
  6. item condition;
  7. price history;
  8. duplicate or similar listings;
  9. manual-review signals.

If it does not, I treat the dataset as a starting point, not a decision layer.

The uncomfortable part

The broken version was easier to explain.

"It scrapes product data" is simple.

"It models marketplace state" takes longer.

But the second one is the version I would actually use.

Have you ever shipped a scraper that was technically correct but operationally useless?

Top comments (0)