KazKN

Posted on Apr 23

The Hybrid Vinted Scraping Architecture That Outperforms Pure Browser Crawls

#apify #vinted #scraping #automation

The Hybrid Vinted Scraping Architecture That Outperforms Pure Browser Crawls

When you scrape Vinted at scale, you quickly hit a wall.

Not a firewall metaphor. A literal one. Datadome. Cloudflare. Aggressive rate limits. Token rotation that invalidates your session mid-crawl. And if you are still running headless Chromium for every single request, you are burning proxy credits and clock cycles for no reason.

After months of iteration — and enough failed runs to fill a datacenter — the architecture that actually works is hybrid: use a real browser only where Vinted forces you to, then switch to lightweight HTTP for the actual data extraction.

This is how Vinted Turbo Scraper implements that hybrid model, what makes it faster than pure-browser approaches, and why the architecture is the real product.

Why Pure Browser Crawling Is a Trap

Most tutorials tell you to fire up Playwright or Puppeteer, navigate to a Vinted search page, scroll endlessly, and extract DOM nodes. This works for five items. It collapses at scale.

Here is why:

Problem	Browser-Only Impact
Proxy cost	Every image, font, and JS asset loads through your proxy. Bandwidth is not free.
Memory bloat	Chromium instances chew 200-500MB each. At concurrency 5, you are eating gigabytes.
Fingerprint fatigue	Datadome profiles browser behavior. Repeating the same navigation pattern = flag.
Session decay	Cookies and tokens expire. A pure browser crawl does not gracefully re-authenticate.
Speed ceiling	Rendering a full React-powered catalog page takes 2-5 seconds. Per page.

A pure browser crawl is not "robust." It is expensive, slow, and detectable.

The insight is simple: Vinted serves catalog data via an internal JSON API. Once you have a valid session cookie, you can query that API directly with HTTP requests. No rendering. No DOM traversal. No asset loading.

The challenge is getting that cookie in the first place.

The Hybrid Model: Browser for Session, HTTP for Extraction

Vinted Turbo Scraper uses a two-phase approach:

Phase One: Session initialization via Playwright — Navigate to the target catalog page once, let Datadome validate the browser fingerprint, capture cookies, and grab the user agent string.
Phase Two: HTTP API extraction via got-scraping — Use the captured session to fire lightweight JSON API requests, paginating through results at ~200 items per minute.

This is not theoretical. Here is how the crawler initialization blocks media assets to keep proxy usage minimal:

preNavigationHooks: [
    async ({ page }) => {
        await page.route('**/*', (route) => {
            const type = route.request().resourceType();
            // Block images, media, fonts to save proxy bandwidth
            if (['image', 'media', 'font'].includes(type)) {
                route.abort().catch(() => {});
            } else {
                route.continue().catch(() => {});
            }
        });
    }
]

By aborting image and font requests before they hit the proxy, we cut bandwidth consumption by roughly 70%. On metered residential proxies, that translates directly to cost savings.

Translating Vinted Search URLs into API Calls

Vinted search URLs encode filter parameters in query strings: catalog[], brand_id[], size_id[], color_id[], status[], and more.

The internal API expects these same values but with slightly different parameter names and array bracket syntax. The Turbo Scraper extracts and rewrites these parameters automatically:

function translateToApiUrl(urlStr: string, domain: string): string | null {
    const u = new URL(urlStr);
    const params = new URLSearchParams(u.searchParams);

    const arrayMaps: Record<string, string> = {
        'catalog[]': 'catalog_ids',
        'color_id[]': 'color_ids',
        'size_id[]': 'size_ids',
        'status[]': 'status_ids',
        'brand_id[]': 'brand_ids',
    };

    const STRIP = new Set([
        'search_id', 'time', 'search_by_image_uuid',
        'search_by_image_id', 'currency', 'page', 'per_page'
    ]);

    const apiParams = new URLSearchParams();
    const accumulated: Record<string, string[]> = {};

    for (const [k, v] of params.entries()) {
        if (STRIP.has(k)) continue;
        if (arrayMaps[k]) {
            if (!accumulated[arrayMaps[k]]) accumulated[arrayMaps[k]] = [];
            accumulated[arrayMaps[k]].push(v);
        } else {
            apiParams.set(k, v);
        }
    }

    // Critical fix: append brackets for multi-value arrays
    for (const [key, vals] of Object.entries(accumulated)) {
        for (const v of vals) apiParams.append(`${key}[]`, v);
    }

    return `https://www.${domain}/api/v2/catalog/items?${apiParams.toString()}`;
}

This translator is the bridge between the URL your user copies from their browser and the internal API endpoint that returns raw JSON. Without it, you would need users to manually map catalog IDs — which defeats the purpose of a "zero-config" scraper.

The Human-Friendly Mapping Layer

Vinted uses numeric IDs for filters. Users do not know that "Nike" maps to brand ID 53 or that "new with tags" maps to status ID 6.

The actor maintains internal dictionaries that resolve plain text to these IDs:

const BRAND_MAP: Record<string, number> = {
    'nike': 53, 'zara': 12, 'h&m': 7, 'adidas': 14,
    'levis': 10, 'ralph lauren': 88, 'calvin klein': 33,
    'guess': 35, 'puma': 15, 'vans': 16, 'converse': 17,
    'tommy hilfiger': 94, 'lacoste': 93, 'the north face': 114,
    'asics': 631, 'new balance': 267, 'carhartt': 362, 'dickies': 1007
};

const CONDITION_MAP: Record<string, number> = {
    'neuf avec étiquette': 6, 'new': 6, 'new_with_tags': 6,
    'neuf sans étiquette': 3, 'new_without_tags': 3,
    'très bon état': 2, 'very_good': 2,
    'bon état': 1, 'good': 1,
    'satisfaisant': 4, 'satisfactory': 4,
};

const SIZE_MAP: Record<string, number> = {
    '35': 54, '36': 55, '37': 56, '38': 57, '39': 58, '40': 59,
    '41': 60, '42': 61, '43': 62, '44': 63, '45': 64, '46': 65, '47': 66,
    'xxs': 205, 'xs': 206, 's': 207, 'm': 208, 'l': 209, 'xl': 210, 'xxl': 211
};

This lets users pass intuitive inputs like ["Nike", "Adidas"] or ["new", "very_good"] instead of reverse-engineering Vinted's internal taxonomy. The actor falls back to raw numeric IDs for anything not in the map, so power users are not constrained either.

HTTP Extraction Loop: Where the Speed Lives

Once the session cookie is captured, the actor switches to got-scraping for the heavy lifting:

const res = await gotScraping({
    url: apiReqUrl,
    responseType: 'json',
    proxyUrl,
    headers: {
        'User-Agent': userAgent,
        'Accept': 'application/json, text/plain, */*',
        'Accept-Language': 'fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7',
        'Cookie': cookieStr,
        'Referer': `https://www.${domain}/`,
        'X-Money-Object-Enabled': 'true',
        'Sec-Fetch-Dest': 'empty',
        'Sec-Fetch-Mode': 'cors',
        'Sec-Fetch-Site': 'same-origin',
    },
    timeout: { request: 15000 }
});

The Sec-Fetch-* headers are not decoration. They signal to Vinted's edge that this is a same-origin AJAX request, not an external scraper. Combined with a matching Referer and the validated Cookie string, the request sails through.

Each page returns 96 items. The loop paginates until data.pagination.current_page >= data.pagination.total_pages or the maxItems limit is hit.

Result: ~200 items per minute sustained, with a memory footprint under 512MB per worker.

Input Schema Deep Dive

The actor accepts minimal but precise JSON input. Here is the exact schema:

{
  "maxItems": 100,
  "proxyConfiguration": {
    "useApifyProxy": true,
    "apifyProxyGroups": ["RESIDENTIAL"]
  },
  "startUrls": "https://www.vinted.co.uk/catalog?catalog[]=1844&brand_ids[]=53&size_ids[]=207&status_ids[]=6&price_from=20&price_to=50&currency=GBP&order=price_low_to_high"
}

Field	Type	Required	Description
`startUrls`	string or array	Yes	One or more Vinted search URLs. Supports batch processing.
`maxItems`	number	No (default: 100)	Cap on results per run. Use for cost control.
`proxyConfiguration`	object	No (recommended)	Defaults to Apify residential proxies. Essential for Datadome evasion.

You can pass multiple URLs as a comma-separated string or an array of objects with url keys. The actor processes them sequentially in a single run, combining outputs into one unified dataset.

Integration Patterns: From Scraper to Pipeline

Raw data is worthless without a destination. The actor integrates with Apify's ecosystem for downstream automation:

Destination	Trigger	Use Case
Google Sheets	Apify integration	Live inventory tracking
Slack	Webhook	Alert team on new listings
Airtable	Zapier/Make bridge	Visual database for resellers
Custom API	Dataset webhook	Push to your own backend
CSV/Excel	Manual download	One-off market analysis

For recurring monitoring, pair the actor with Apify Scheduler. Set it to run every 15 minutes against a filtered search URL and pipe results to a Slack channel or Google Sheet. You catch new listings before manual browsers refresh the page.

Real-World Performance Benchmarks

Here are observed numbers from production runs across different proxy tiers:

Proxy Type	Speed	Reliability	Cost per 1k Items	Best For
Apify Proxy (Datacenter)	~300 items/min	Low (blocks after ~500)	~$0.30	Quick tests
Apify Proxy (Residential)	~200 items/min	High (rarely blocked)	~$1.50	Production runs
Custom Proxy	Variable	Depends on quality	Variable	Power users

The residential proxy is the sweet spot: fast enough for real-time workflows, reliable enough for continuous monitoring, and priced predictably per result.

Architecture Comparison: Browser vs Hybrid vs Pure HTTP

Approach	Speed	Cost	Reliability	Complexity
Pure Browser	~20-40 items/min	High (full asset load)	Medium (detectable patterns)	Low
Pure HTTP	~300+ items/min	Minimal	Low (session requires bootstrapping)	High
Hybrid (Turbo)	~200 items/min	Low (blocked assets)	High (session + retry logic)	Medium

Pure HTTP is fastest on paper, but without a valid session cookie, every request returns a 403. The hybrid approach trades absolute speed for operational reliability — the metric that actually matters when you are running automated workflows.

When to Use Turbo vs Smart Scraper

Vinted Turbo Scraper is part of a two-tool ecosystem. Choose based on your use case:

Feature	Turbo Scraper	Smart Scraper
URL-based input	Yes	No (form-based)
Batch URL processing	Yes	No
Cross-country comparison	No	Yes
Seller analysis	No	Yes
Sold items tracking	No	Yes
Trending discovery	No	Yes
Price monitoring	Yes	Yes (cross-border)
Speed	Faster	Slower (richer data)
Cost	Lower	Higher

Use Turbo when you have a Vinted search URL ready and need structured data fast. Use Smart when you are doing deep market intelligence, seller profiling, or cross-country arbitrage.

Anti-Ban Mechanisms Beyond Proxies

Proxy rotation is table stakes. The actor adds three additional layers:

Request fingerprint rotation via Crawlee — Built-in proxy configuration rotates IPs per session.
Aggressive retry with exponential backoff — maxRequestRetries: 5 with a 30-second handler timeout.
Graceful session recycling — If an HTTP request fails with a 403, the Playwright session is refreshed before retry.

The output is a clean JSON schema with optional lightweight mode:

{
  "id": 8464268321,
  "title": "Levi black skinny jeans 33\" waist",
  "url": "https://www.vinted.co.uk/items/8464268321-levi-black-skinny-jeans-33-waist",
  "price": 20,
  "currency": "GBP",
  "brand": "Levi's",
  "size": "M / UK 12-14",
  "condition": "New with tags",
  "photos": ["..."],
  "favouriteCount": 1,
  "seller": {
    "id": 73959532,
    "username": "maxi83199",
    "profileUrl": "https://www.vinted.co.uk/member/73959532-maxi83199"
  },
  "scrapedAt": "2026-03-24T10:25:41.604Z"
}

Structured. Timestamped. Ready for pipelines.

FAQ: Technical Details

Q: Does this use headless browsers for every request?
A: No. Only for initial session bootstrap. Data extraction uses lightweight HTTP requests via got-scraping.

Q: How many items can I extract per run?
A: The maxItems parameter lets you cap runs. We have tested up to 10,000 items in a single run without memory issues.

Q: Is there a Vinted API this connects to?
A: Vinted does not offer a public API for catalog data. This actor acts as a practical alternative by reverse-engineering the internal endpoints.

Q: Will my IP get banned?
A: With residential proxies and the hybrid architecture, blocks are rare. The actor implements retry logic and session refresh for edge cases.

Q: Can I run this on a schedule?
A: Yes, via Apify Scheduler or cron triggers. Ideal for monitoring new listings.

Q: What output formats are available?
A: JSON (structured), CSV, Excel, or direct API export to integrations.

The Honest Bottom Line

No scraper is "unbanable." Platforms evolve. What the hybrid architecture buys you is time — time between Vinted deploying a new detection mechanism and you pushing an update.

Because this is packaged as an Apify Actor, that update propagates to every user instantly. No pip upgrade. No breaking dependency chains. No "works on my machine."

If you are still maintaining a custom Python Selenium script that breaks every two weeks, you are not scraping Vinted. You are debugging Vinted.

Switch to infrastructure that was built to survive the platform, not chase it.

Ready to extract Vinted data at scale?

Actor: Vinted Turbo Scraper on Apify
Pricing: $1.50 per 1,000 results. No subscription. Free plan covers thousands of items.

Questions about the architecture or want to integrate this into a pipeline? Drop a comment below.

DEV Community

The Hybrid Vinted Scraping Architecture That Outperforms Pure Browser Crawls

The Hybrid Vinted Scraping Architecture That Outperforms Pure Browser Crawls

Why Pure Browser Crawling Is a Trap

The Hybrid Model: Browser for Session, HTTP for Extraction

Translating Vinted Search URLs into API Calls

The Human-Friendly Mapping Layer

HTTP Extraction Loop: Where the Speed Lives

Input Schema Deep Dive

Integration Patterns: From Scraper to Pipeline

Real-World Performance Benchmarks

Architecture Comparison: Browser vs Hybrid vs Pure HTTP

When to Use Turbo vs Smart Scraper

Anti-Ban Mechanisms Beyond Proxies

FAQ: Technical Details

The Honest Bottom Line

Top comments (0)