lulzasaur

Posted on Mar 23

I Built 20 Marketplace Data APIs in a Weekend -- Here's the Stack

#sideprojects

I Built 20 Marketplace Data APIs in a Weekend -- Here's the Stack

I got tired of writing the same scraping logic every time I needed marketplace data for a side project. So I sat down and built a unified REST API layer across 20 different marketplaces -- TCGPlayer, Redfin, Poshmark, Reverb, Craigslist, and more -- and shipped them all in a weekend.

Here's how the stack works, what I learned about parsing at scale, and the full list of APIs if you want to use them.

The Architecture

Client Request
    |
    v
[RapidAPI Gateway] -- auth, rate limiting, billing
    |
    v
[Express Server on Railway] -- routing, validation, caching
    |
    v
[Apify Actor Pool] -- headless scrapers with Cheerio
    |
    v
[Target Marketplace] -- actual HTML parsing

The core idea: each marketplace gets its own Apify Actor (a serverless scraper), and the Express server acts as a thin orchestration layer that runs the right actor, waits for results, and returns clean JSON.

The Backend: 120 Lines That Run Everything

The entire backend is a single server.js file. Each route follows the same pattern:

const express = require("express");
const { ApifyClient } = require("apify-client");

const client = new ApifyClient({ token: process.env.APIFY_TOKEN });

// Every actor follows the same execution pattern
async function runActor(actorId, input) {
  const run = await client.actor(actorId).call(input, {
    memory: 256,
    timeout: 90,
  });

  // Poll until completion
  let runInfo = await client.run(run.id).get();
  const POLL_INTERVAL = 2000;
  const MAX_WAIT = 90000;
  let elapsed = 0;

  while (
    !["SUCCEEDED", "FAILED", "ABORTED", "TIMED-OUT"].includes(runInfo.status) &&
    elapsed < MAX_WAIT
  ) {
    await new Promise((r) => setTimeout(r, POLL_INTERVAL));
    elapsed += POLL_INTERVAL;
    runInfo = await client.run(run.id).get();
  }

  if (runInfo.status !== "SUCCEEDED") {
    throw new Error(`Actor run ${runInfo.status}: ${run.id}`);
  }

  const { items } = await client
    .dataset(runInfo.defaultDatasetId)
    .listItems();
  return items;
}

// Then each route is just a thin wrapper:
app.get("/tcgplayer/search", async (req, res) => {
  const items = await runActor("O0sgde2LgujyWHvNt", {
    searchQueries: [req.query.query],
    maxListings: parseInt(req.query.limit) || 20,
  });
  res.json({ success: true, count: items.length, results: items });
});

That is genuinely the whole pattern. The complexity lives in the individual scrapers, not the API layer.

How the Scrapers Work: Cheerio Over Puppeteer

Most marketplace scrapers use Puppeteer or Playwright. I went with Cheerio (server-side jQuery) for 18 out of 20 scrapers. The reason: most modern sites expose structured data in their HTML even without JavaScript execution.

Here is a simplified version of the TCGPlayer scraper:

const cheerio = require("cheerio");

async function scrapeTCGPlayer(query) {
  const url = `https://www.tcgplayer.com/search/all/product?q=${encodeURIComponent(query)}`;
  const response = await fetch(url, {
    headers: { "User-Agent": "Mozilla/5.0 ..." }
  });
  const html = await response.text();
  const $ = cheerio.load(html);

  const results = [];
  $(".search-result").each((_, el) => {
    const name = $(el).find(".search-result__title").text().trim();
    const price = $(el).find(".search-result__market-price--value").text().trim();
    const set = $(el).find(".search-result__subtitle").text().trim();
    const image = $(el).find("img").attr("src");
    const link = $(el).find("a").attr("href");

    results.push({
      name,
      marketPrice: parseFloat(price.replace("$", "")) || null,
      set,
      imageUrl: image,
      url: link ? `https://www.tcgplayer.com${link}` : null,
    });
  });

  return results;
}

Three patterns kept coming up across all 20 sites:

1. __NEXT_DATA__ extraction -- Next.js sites embed their full page data in a <script id="__NEXT_DATA__"> tag. You get structured JSON without any DOM parsing. Thumbtack, IMDb, and several others use this.

const nextDataScript = $("script#__NEXT_DATA__").html();
if (nextDataScript) {
  const pageData = JSON.parse(nextDataScript);
  // All the data you need is in pageData.props.pageProps
}

2. JSON-LD microdata -- E-commerce sites embed <script type="application/ld+json"> for SEO. ThriftBooks, AbeBooks, and Houzz all have rich structured data this way.

$('script[type="application/ld+json"]').each((_, el) => {
  const ld = JSON.parse($(el).html());
  if (ld["@type"] === "Product") {
    // ld.name, ld.offers.price, ld.isbn, etc.
  }
});

3. Algolia-backed search -- Some sites (Grailed, Bonanza) use Algolia for search. If you inspect their network requests, you can find the Algolia app ID and API key embedded in the page source, then query Algolia directly for structured data. No HTML parsing needed at all.

Caching Strategy

For most marketplace APIs, the data changes every few minutes, so aggressive caching would serve stale results. But for reference data -- like NYC building violations or FCC certifications -- the data changes rarely.

I use a two-tier approach:

Apify-backed routes: No caching. Each request runs a fresh scraper. The 90-second timeout and 256MB memory allocation keeps costs low while ensuring fresh data.
Direct data routes (NYC violations use NYC Open Data / Socrata API): In-memory cache with 5-minute TTL.

const cache = new Map();
const CACHE_TTL = 5 * 60 * 1000;

async function fetchSocrata(datasetId, queryParams) {
  const cacheKey = `${datasetId}?${new URLSearchParams(queryParams)}`;
  const cached = cache.get(cacheKey);
  if (cached && Date.now() - cached.ts < CACHE_TTL) return cached.data;

  const data = await fetch(`https://data.cityofnewyork.us/resource/${datasetId}.json?...`);
  cache.set(cacheKey, { data, ts: Date.now() });
  return data;
}

Error Handling

Every route wraps its actor call in try/catch and returns a consistent error shape:

{ "success": false, "error": "Actor run TIMED-OUT: abc123" }

The Apify actor runner itself has a belt-and-suspenders timeout: Apify enforces a 90-second timeout on the actor, and the polling loop has its own 90-second max wait. If the actor hangs, the API returns a clear error instead of hanging the HTTP connection.

Try It

Here is a real curl you can run right now:

curl "https://rapidapi-backend-production.up.railway.app/tcgplayer/search?query=charizard"

That will return JSON with TCGPlayer prices, market data, and card images for Charizard cards.

The Full API List

#	API	Endpoint	Use Case
1	TCGPlayer	`/tcgplayer/search`	Trading card prices (Pokemon, MTG, Yu-Gi-Oh)
2	Reverb	`/reverb/search`	Used music gear listings
3	Thumbtack	`/thumbtack/search`	Local service professionals
4	Poshmark	`/poshmark/search`	Fashion resale marketplace
5	Craigslist	`/craigslist/search`	Local classifieds
6	StubHub	`/stubhub/search`	Event ticket prices
7	Swappa	`/swappa/search`	Used electronics
8	OfferUp	`/offerup/search`	Local buy/sell marketplace
9	Redfin	`/redfin/search`	Real estate listings
10	IMDb	`/imdb/search`	Movie/TV data and charts
11	Grailed	`/grailed/search`	Designer fashion marketplace
12	Bonanza	`/bonanza/search`	eBay-alternative marketplace
13	Houzz	`/houzz/search`	Home professional directory
14	ThriftBooks	`/thriftbooks/search`	Used books at discount prices
15	AbeBooks	`/abebooks/search`	Rare and used book search
16	Goodreads	`/goodreads/search`	Book ratings by genre
17	NYC Violations	`/nyc-violations/search`	Building violation records
18	Contractor License	`/contractor-license/verify`	License verification (CA, TX, FL, NY)
19	Nurse License	`/nurse-license/verify`	Nursing license verification (FL, NY)
20	PSA Pop Report	`/psa/pop`	Graded card population data

All APIs are available on RapidAPI with a free tier (50 requests/month): https://rapidapi.com/lulzasaur9192

Hosting Costs

Railway: ~$5/month for the Express server (sleeps when idle)
Apify: Pay-per-event, roughly $0.005 per scraper result
RapidAPI: Free to list, they take a cut of paid subscriptions

Total cost to run: under $10/month at low traffic. The free tier on RapidAPI covers the Apify compute costs until you hit meaningful volume.

What I'd Do Differently

Add Redis caching -- The in-memory Map works for the Socrata routes but doesn't survive deploys. Redis on Railway would give persistent caching for about $0/month on the free tier.
Batch endpoints -- Some users want to search multiple queries at once. A /batch endpoint that runs actors in parallel would cut latency.
Webhook delivery -- Instead of polling for slow scrapers, accept a webhook URL and POST results when done.

If you want to build something similar, the pattern is dead simple: Cheerio + Express + a serverless scraper platform. The hard part is not the code -- it is figuring out where each site hides its structured data.

Top comments (1)

Apex Stack • Mar 23

The JSON-LD extraction pattern is underrated and I'm glad you called it out. I run a financial data site that serves stock analysis across 8,000+ tickers in 12 languages, and we use FinancialProduct schema markup on every page. The irony is that the structured data I embed for Google's benefit is exactly what makes my own pages scrapeable by the same technique you describe here. It's a good reminder that schema markup isn't just SEO theater — it's genuinely useful infrastructure.

Your point about NEXT_DATA is spot on too. I've seen entire product catalogs exposed that way. For anyone building on this pattern, worth noting that some sites are starting to move away from NEXT_DATA with React Server Components, so the extraction target shifts to streamed HTML chunks instead of one big JSON blob.

The cost breakdown is the part that would surprise most people — under $10/month for 20 APIs is remarkable. I spend roughly the same hosting a static Astro site on DO Spaces + Cloudflare, but my data pipeline (yfinance + Supabase) adds another layer. Curious if you've considered adding financial data sources like Yahoo Finance or SEC EDGAR to the mix — there's a lot of demand for clean APIs over those datasets and the structured data is already well-defined.