DEV Community

Boon
Boon

Posted on

The Hybrid Vinted Scraping Architecture That Outperforms Pure Browser Crawls

Cover

The Hybrid Vinted Scraping Architecture That Outperforms Pure Browser Crawls

When you scrape Vinted at scale, you quickly hit a wall.

Not a firewall metaphor. A literal one. Datadome. Cloudflare. Aggressive rate limits. Token rotation that invalidates your session mid-crawl. And if you are still running headless Chromium for every single request, you are burning proxy credits and clock cycles for no reason.

After months of iteration — and enough failed runs to fill a datacenter — the architecture that actually works is hybrid: use a real browser only where Vinted forces you to, then switch to lightweight HTTP for the actual data extraction.

This is how Vinted Turbo Scraper implements that hybrid model, what makes it faster than pure-browser approaches, and why the architecture is the real product.


Why Pure Browser Crawling Is a Trap

Most tutorials tell you to fire up Playwright or Puppeteer, navigate to a Vinted search page, scroll endlessly, and extract DOM nodes. This works for five items. It collapses at scale.

Here is why:

Problem Browser-Only Impact
Proxy cost Every image, font, and JS asset loads through your proxy. Bandwidth is not free.
Memory bloat Chromium instances chew 200-500MB each. At concurrency 5, you are eating gigabytes.
Fingerprint fatigue Datadome profiles browser behavior. Repeating the same navigation pattern = flag.
Session decay Cookies and tokens expire. A pure browser crawl does not gracefully re-authenticate.
Speed ceiling Rendering a full React-powered catalog page takes 2-5 seconds. Per page.

A pure browser crawl is not "robust." It is expensive, slow, and detectable.

The insight is simple: Vinted serves catalog data via an internal JSON API. Once you have a valid session cookie, you can query that API directly with HTTP requests. No rendering. No DOM traversal. No asset loading.

The challenge is getting that cookie in the first place.


The Hybrid Model: Browser for Session, HTTP for Extraction

Vinted Turbo Scraper uses a two-phase approach:

  1. Phase One: Session initialization via Playwright — Navigate to the target catalog page once, let Datadome validate the browser fingerprint, capture cookies, and grab the user agent string.
  2. Phase Two: HTTP API extraction via got-scraping — Use the captured session to fire lightweight JSON API requests, paginating through results at ~200 items per minute.

This is not theoretical. Here is how the crawler initialization blocks media assets to keep proxy usage minimal:

preNavigationHooks: [
    async ({ page }) => {
        await page.route('**/*', (route) => {
            const type = route.request().resourceType();
            // Block images, media, fonts to save proxy bandwidth
            if (['image', 'media', 'font'].includes(type)) {
                route.abort().catch(() => {});
            } else {
                route.continue().catch(() => {});
            }
        });
    }
]
Enter fullscreen mode Exit fullscreen mode

By aborting image and font requests before they hit the proxy, we cut bandwidth consumption by roughly 70%. On metered residential proxies, that translates directly to cost savings.


Translating Vinted Search URLs into API Calls

Vinted search URLs encode filter parameters in query strings: catalog[], brand_id[], size_id[], color_id[], status[], and more.

The internal API expects these same values but with slightly different parameter names and array bracket syntax. The Turbo Scraper extracts and rewrites these parameters automatically:

function translateToApiUrl(urlStr: string, domain: string): string | null {
    const u = new URL(urlStr);
    const params = new URLSearchParams(u.searchParams);

    const arrayMaps: Record<string, string> = {
        'catalog[]': 'catalog_ids',
        'color_id[]': 'color_ids',
        'size_id[]': 'size_ids',
        'status[]': 'status_ids',
        'brand_id[]': 'brand_ids',
    };

    const STRIP = new Set([
        'search_id', 'time', 'search_by_image_uuid',
        'search_by_image_id', 'currency', 'page', 'per_page'
    ]);

    const apiParams = new URLSearchParams();
    const accumulated: Record<string, string[]> = {};

    for (const [k, v] of params.entries()) {
        if (STRIP.has(k)) continue;
        if (arrayMaps[k]) {
            if (!accumulated[arrayMaps[k]]) accumulated[arrayMaps[k]] = [];
            accumulated[arrayMaps[k]].push(v);
        } else {
            apiParams.set(k, v);
        }
    }

    // Critical fix: append brackets for multi-value arrays
    for (const [key, vals] of Object.entries(accumulated)) {
        for (const v of vals) apiParams.append(`${key}[]`, v);
    }

    return `https://www.${domain}/api/v2/catalog/items?${apiParams.toString()}`;
}
Enter fullscreen mode Exit fullscreen mode

This translator is the bridge between the URL your user copies from their browser and the internal API endpoint that returns raw JSON. Without it, you would need users to manually map catalog IDs — which defeats the purpose of a "zero-config" scraper.


The Human-Friendly Mapping Layer

Vinted uses numeric IDs for filters. Users do not know that "Nike" maps to brand ID 53 or that "new with tags" maps to status ID 6.

The actor maintains internal dictionaries that resolve plain text to these IDs:

const BRAND_MAP: Record<string, number> = {
    'nike': 53, 'zara': 12, 'h&m': 7, 'adidas': 14,
    'levis': 10, 'ralph lauren': 88, 'calvin klein': 33,
    'guess': 35, 'puma': 15, 'vans': 16, 'converse': 17,
    'tommy hilfiger': 94, 'lacoste': 93, 'the north face': 114,
    'asics': 631, 'new balance': 267, 'carhartt': 362, 'dickies': 1007
};

const CONDITION_MAP: Record<string, number> = {
    'neuf avec étiquette': 6, 'new': 6, 'new_with_tags': 6,
    'neuf sans étiquette': 3, 'new_without_tags': 3,
    'très bon état': 2, 'very_good': 2,
    'bon état': 1, 'good': 1,
    'satisfaisant': 4, 'satisfactory': 4,
};

const SIZE_MAP: Record<string, number> = {
    '35': 54, '36': 55, '37': 56, '38': 57, '39': 58, '40': 59,
    '41': 60, '42': 61, '43': 62, '44': 63, '45': 64, '46': 65, '47': 66,
    'xxs': 205, 'xs': 206, 's': 207, 'm': 208, 'l': 209, 'xl': 210, 'xxl': 211
};
Enter fullscreen mode Exit fullscreen mode

This lets users pass intuitive inputs like ["Nike", "Adidas"] or ["new", "very_good"] instead of reverse-engineering Vinted's internal taxonomy. The actor falls back to raw numeric IDs for anything not in the map, so power users are not constrained either.


HTTP Extraction Loop: Where the Speed Lives

Once the session cookie is captured, the actor switches to got-scraping for the heavy lifting:

const res = await gotScraping({
    url: apiReqUrl,
    responseType: 'json',
    proxyUrl,
    headers: {
        'User-Agent': userAgent,
        'Accept': 'application/json, text/plain, */*',
        'Accept-Language': 'fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7',
        'Cookie': cookieStr,
        'Referer': `https://www.${domain}/`,
        'X-Money-Object-Enabled': 'true',
        'Sec-Fetch-Dest': 'empty',
        'Sec-Fetch-Mode': 'cors',
        'Sec-Fetch-Site': 'same-origin',
    },
    timeout: { request: 15000 }
});
Enter fullscreen mode Exit fullscreen mode

The Sec-Fetch-* headers are not decoration. They signal to Vinted's edge that this is a same-origin AJAX request, not an external scraper. Combined with a matching Referer and the validated Cookie string, the request sails through.

Each page returns 96 items. The loop paginates until data.pagination.current_page >= data.pagination.total_pages or the maxItems limit is hit.

Result: ~200 items per minute sustained, with a memory footprint under 512MB per worker.


Input Schema Deep Dive

The actor accepts minimal but precise JSON input. Here is the exact schema:

{
  "maxItems": 100,
  "proxyConfiguration": {
    "useApifyProxy": true,
    "apifyProxyGroups": ["RESIDENTIAL"]
  },
  "startUrls": "https://www.vinted.co.uk/catalog?catalog[]=1844&brand_ids[]=53&size_ids[]=207&status_ids[]=6&price_from=20&price_to=50&currency=GBP&order=price_low_to_high"
}
Enter fullscreen mode Exit fullscreen mode
Field Type Required Description
startUrls string or array Yes One or more Vinted search URLs. Supports batch processing.
maxItems number No (default: 100) Cap on results per run. Use for cost control.
proxyConfiguration object No (recommended) Defaults to Apify residential proxies. Essential for Datadome evasion.

You can pass multiple URLs as a comma-separated string or an array of objects with url keys. The actor processes them sequentially in a single run, combining outputs into one unified dataset.


Integration Patterns: From Scraper to Pipeline

Raw data is worthless without a destination. The actor integrates with Apify's ecosystem for downstream automation:

Destination Trigger Use Case
Google Sheets Apify integration Live inventory tracking
Slack Webhook Alert team on new listings
Airtable Zapier/Make bridge Visual database for resellers
Custom API Dataset webhook Push to your own backend
CSV/Excel Manual download One-off market analysis

For recurring monitoring, pair the actor with Apify Scheduler. Set it to run every 15 minutes against a filtered search URL and pipe results to a Slack channel or Google Sheet. You catch new listings before manual browsers refresh the page.


Real-World Performance Benchmarks

Here are observed numbers from production runs across different proxy tiers:

Proxy Type Speed Reliability Cost per 1k Items Best For
Apify Proxy (Datacenter) ~300 items/min Low (blocks after ~500) ~$0.30 Quick tests
Apify Proxy (Residential) ~200 items/min High (rarely blocked) ~$1.50 Production runs
Custom Proxy Variable Depends on quality Variable Power users

The residential proxy is the sweet spot: fast enough for real-time workflows, reliable enough for continuous monitoring, and priced predictably per result.


Architecture Comparison: Browser vs Hybrid vs Pure HTTP

Approach Speed Cost Reliability Complexity
Pure Browser ~20-40 items/min High (full asset load) Medium (detectable patterns) Low
Pure HTTP ~300+ items/min Minimal Low (session requires bootstrapping) High
Hybrid (Turbo) ~200 items/min Low (blocked assets) High (session + retry logic) Medium

Pure HTTP is fastest on paper, but without a valid session cookie, every request returns a 403. The hybrid approach trades absolute speed for operational reliability — the metric that actually matters when you are running automated workflows.


When to Use Turbo vs Smart Scraper

Vinted Turbo Scraper is part of a two-tool ecosystem. Choose based on your use case:

Feature Turbo Scraper Smart Scraper
URL-based input Yes No (form-based)
Batch URL processing Yes No
Cross-country comparison No Yes
Seller analysis No Yes
Sold items tracking No Yes
Trending discovery No Yes
Price monitoring Yes Yes (cross-border)
Speed Faster Slower (richer data)
Cost Lower Higher

Use Turbo when you have a Vinted search URL ready and need structured data fast. Use Smart when you are doing deep market intelligence, seller profiling, or cross-country arbitrage.


Anti-Ban Mechanisms Beyond Proxies

Proxy rotation is table stakes. The actor adds three additional layers:

  1. Request fingerprint rotation via Crawlee — Built-in proxy configuration rotates IPs per session.
  2. Aggressive retry with exponential backoffmaxRequestRetries: 5 with a 30-second handler timeout.
  3. Graceful session recycling — If an HTTP request fails with a 403, the Playwright session is refreshed before retry.

The output is a clean JSON schema with optional lightweight mode:

{
  "id": 8464268321,
  "title": "Levi black skinny jeans 33\" waist",
  "url": "https://www.vinted.co.uk/items/8464268321-levi-black-skinny-jeans-33-waist",
  "price": 20,
  "currency": "GBP",
  "brand": "Levi's",
  "size": "M / UK 12-14",
  "condition": "New with tags",
  "photos": ["..."],
  "favouriteCount": 1,
  "seller": {
    "id": 73959532,
    "username": "maxi83199",
    "profileUrl": "https://www.vinted.co.uk/member/73959532-maxi83199"
  },
  "scrapedAt": "2026-03-24T10:25:41.604Z"
}
Enter fullscreen mode Exit fullscreen mode

Structured. Timestamped. Ready for pipelines.


FAQ: Technical Details

Q: Does this use headless browsers for every request?
A: No. Only for initial session bootstrap. Data extraction uses lightweight HTTP requests via got-scraping.

Q: How many items can I extract per run?
A: The maxItems parameter lets you cap runs. We have tested up to 10,000 items in a single run without memory issues.

Q: Is there a Vinted API this connects to?
A: Vinted does not offer a public API for catalog data. This actor acts as a practical alternative by reverse-engineering the internal endpoints.

Q: Will my IP get banned?
A: With residential proxies and the hybrid architecture, blocks are rare. The actor implements retry logic and session refresh for edge cases.

Q: Can I run this on a schedule?
A: Yes, via Apify Scheduler or cron triggers. Ideal for monitoring new listings.

Q: What output formats are available?
A: JSON (structured), CSV, Excel, or direct API export to integrations.


The Honest Bottom Line

No scraper is "unbanable." Platforms evolve. What the hybrid architecture buys you is time — time between Vinted deploying a new detection mechanism and you pushing an update.

Because this is packaged as an Apify Actor, that update propagates to every user instantly. No pip upgrade. No breaking dependency chains. No "works on my machine."

If you are still maintaining a custom Python Selenium script that breaks every two weeks, you are not scraping Vinted. You are debugging Vinted.

Switch to infrastructure that was built to survive the platform, not chase it.


Ready to extract Vinted data at scale?


Questions about the architecture or want to integrate this into a pipeline? Drop a comment below.

Top comments (0)