DEV Community

lulzasaur
lulzasaur

Posted on

I Built 20 Marketplace Data APIs in a Weekend -- Here's the Stack

I Built 20 Marketplace Data APIs in a Weekend -- Here's the Stack

I got tired of writing the same scraping logic every time I needed marketplace data for a side project. So I sat down and built a unified REST API layer across 20 different marketplaces -- TCGPlayer, Redfin, Poshmark, Reverb, Craigslist, and more -- and shipped them all in a weekend.

Here's how the stack works, what I learned about parsing at scale, and the full list of APIs if you want to use them.

The Architecture

Client Request
    |
    v
[RapidAPI Gateway] -- auth, rate limiting, billing
    |
    v
[Express Server on Railway] -- routing, validation, caching
    |
    v
[Apify Actor Pool] -- headless scrapers with Cheerio
    |
    v
[Target Marketplace] -- actual HTML parsing
Enter fullscreen mode Exit fullscreen mode

The core idea: each marketplace gets its own Apify Actor (a serverless scraper), and the Express server acts as a thin orchestration layer that runs the right actor, waits for results, and returns clean JSON.

The Backend: 120 Lines That Run Everything

The entire backend is a single server.js file. Each route follows the same pattern:

const express = require("express");
const { ApifyClient } = require("apify-client");

const client = new ApifyClient({ token: process.env.APIFY_TOKEN });

// Every actor follows the same execution pattern
async function runActor(actorId, input) {
  const run = await client.actor(actorId).call(input, {
    memory: 256,
    timeout: 90,
  });

  // Poll until completion
  let runInfo = await client.run(run.id).get();
  const POLL_INTERVAL = 2000;
  const MAX_WAIT = 90000;
  let elapsed = 0;

  while (
    !["SUCCEEDED", "FAILED", "ABORTED", "TIMED-OUT"].includes(runInfo.status) &&
    elapsed < MAX_WAIT
  ) {
    await new Promise((r) => setTimeout(r, POLL_INTERVAL));
    elapsed += POLL_INTERVAL;
    runInfo = await client.run(run.id).get();
  }

  if (runInfo.status !== "SUCCEEDED") {
    throw new Error(`Actor run ${runInfo.status}: ${run.id}`);
  }

  const { items } = await client
    .dataset(runInfo.defaultDatasetId)
    .listItems();
  return items;
}

// Then each route is just a thin wrapper:
app.get("/tcgplayer/search", async (req, res) => {
  const items = await runActor("O0sgde2LgujyWHvNt", {
    searchQueries: [req.query.query],
    maxListings: parseInt(req.query.limit) || 20,
  });
  res.json({ success: true, count: items.length, results: items });
});
Enter fullscreen mode Exit fullscreen mode

That is genuinely the whole pattern. The complexity lives in the individual scrapers, not the API layer.

How the Scrapers Work: Cheerio Over Puppeteer

Most marketplace scrapers use Puppeteer or Playwright. I went with Cheerio (server-side jQuery) for 18 out of 20 scrapers. The reason: most modern sites expose structured data in their HTML even without JavaScript execution.

Here is a simplified version of the TCGPlayer scraper:

const cheerio = require("cheerio");

async function scrapeTCGPlayer(query) {
  const url = `https://www.tcgplayer.com/search/all/product?q=${encodeURIComponent(query)}`;
  const response = await fetch(url, {
    headers: { "User-Agent": "Mozilla/5.0 ..." }
  });
  const html = await response.text();
  const $ = cheerio.load(html);

  const results = [];
  $(".search-result").each((_, el) => {
    const name = $(el).find(".search-result__title").text().trim();
    const price = $(el).find(".search-result__market-price--value").text().trim();
    const set = $(el).find(".search-result__subtitle").text().trim();
    const image = $(el).find("img").attr("src");
    const link = $(el).find("a").attr("href");

    results.push({
      name,
      marketPrice: parseFloat(price.replace("$", "")) || null,
      set,
      imageUrl: image,
      url: link ? `https://www.tcgplayer.com${link}` : null,
    });
  });

  return results;
}
Enter fullscreen mode Exit fullscreen mode

Three patterns kept coming up across all 20 sites:

1. __NEXT_DATA__ extraction -- Next.js sites embed their full page data in a <script id="__NEXT_DATA__"> tag. You get structured JSON without any DOM parsing. Thumbtack, IMDb, and several others use this.

const nextDataScript = $("script#__NEXT_DATA__").html();
if (nextDataScript) {
  const pageData = JSON.parse(nextDataScript);
  // All the data you need is in pageData.props.pageProps
}
Enter fullscreen mode Exit fullscreen mode

2. JSON-LD microdata -- E-commerce sites embed <script type="application/ld+json"> for SEO. ThriftBooks, AbeBooks, and Houzz all have rich structured data this way.

$('script[type="application/ld+json"]').each((_, el) => {
  const ld = JSON.parse($(el).html());
  if (ld["@type"] === "Product") {
    // ld.name, ld.offers.price, ld.isbn, etc.
  }
});
Enter fullscreen mode Exit fullscreen mode

3. Algolia-backed search -- Some sites (Grailed, Bonanza) use Algolia for search. If you inspect their network requests, you can find the Algolia app ID and API key embedded in the page source, then query Algolia directly for structured data. No HTML parsing needed at all.

Caching Strategy

For most marketplace APIs, the data changes every few minutes, so aggressive caching would serve stale results. But for reference data -- like NYC building violations or FCC certifications -- the data changes rarely.

I use a two-tier approach:

  • Apify-backed routes: No caching. Each request runs a fresh scraper. The 90-second timeout and 256MB memory allocation keeps costs low while ensuring fresh data.
  • Direct data routes (NYC violations use NYC Open Data / Socrata API): In-memory cache with 5-minute TTL.
const cache = new Map();
const CACHE_TTL = 5 * 60 * 1000;

async function fetchSocrata(datasetId, queryParams) {
  const cacheKey = `${datasetId}?${new URLSearchParams(queryParams)}`;
  const cached = cache.get(cacheKey);
  if (cached && Date.now() - cached.ts < CACHE_TTL) return cached.data;

  const data = await fetch(`https://data.cityofnewyork.us/resource/${datasetId}.json?...`);
  cache.set(cacheKey, { data, ts: Date.now() });
  return data;
}
Enter fullscreen mode Exit fullscreen mode

Error Handling

Every route wraps its actor call in try/catch and returns a consistent error shape:

{ "success": false, "error": "Actor run TIMED-OUT: abc123" }
Enter fullscreen mode Exit fullscreen mode

The Apify actor runner itself has a belt-and-suspenders timeout: Apify enforces a 90-second timeout on the actor, and the polling loop has its own 90-second max wait. If the actor hangs, the API returns a clear error instead of hanging the HTTP connection.

Try It

Here is a real curl you can run right now:

curl "https://rapidapi-backend-production.up.railway.app/tcgplayer/search?query=charizard"
Enter fullscreen mode Exit fullscreen mode

That will return JSON with TCGPlayer prices, market data, and card images for Charizard cards.

The Full API List

# API Endpoint Use Case
1 TCGPlayer /tcgplayer/search Trading card prices (Pokemon, MTG, Yu-Gi-Oh)
2 Reverb /reverb/search Used music gear listings
3 Thumbtack /thumbtack/search Local service professionals
4 Poshmark /poshmark/search Fashion resale marketplace
5 Craigslist /craigslist/search Local classifieds
6 StubHub /stubhub/search Event ticket prices
7 Swappa /swappa/search Used electronics
8 OfferUp /offerup/search Local buy/sell marketplace
9 Redfin /redfin/search Real estate listings
10 IMDb /imdb/search Movie/TV data and charts
11 Grailed /grailed/search Designer fashion marketplace
12 Bonanza /bonanza/search eBay-alternative marketplace
13 Houzz /houzz/search Home professional directory
14 ThriftBooks /thriftbooks/search Used books at discount prices
15 AbeBooks /abebooks/search Rare and used book search
16 Goodreads /goodreads/search Book ratings by genre
17 NYC Violations /nyc-violations/search Building violation records
18 Contractor License /contractor-license/verify License verification (CA, TX, FL, NY)
19 Nurse License /nurse-license/verify Nursing license verification (FL, NY)
20 PSA Pop Report /psa/pop Graded card population data

All APIs are available on RapidAPI with a free tier (50 requests/month): https://rapidapi.com/lulzasaur9192

Hosting Costs

  • Railway: ~$5/month for the Express server (sleeps when idle)
  • Apify: Pay-per-event, roughly $0.005 per scraper result
  • RapidAPI: Free to list, they take a cut of paid subscriptions

Total cost to run: under $10/month at low traffic. The free tier on RapidAPI covers the Apify compute costs until you hit meaningful volume.

What I'd Do Differently

  1. Add Redis caching -- The in-memory Map works for the Socrata routes but doesn't survive deploys. Redis on Railway would give persistent caching for about $0/month on the free tier.
  2. Batch endpoints -- Some users want to search multiple queries at once. A /batch endpoint that runs actors in parallel would cut latency.
  3. Webhook delivery -- Instead of polling for slow scrapers, accept a webhook URL and POST results when done.

If you want to build something similar, the pattern is dead simple: Cheerio + Express + a serverless scraper platform. The hard part is not the code -- it is figuring out where each site hides its structured data.

Top comments (0)