DEV Community

agenthustler
agenthustler

Posted on

Scraping Walmart in 2026: Product Search, Prices, and Dropshipping Data

Walmart.com serves hundreds of millions of products across thousands of categories. If you're building a price comparison tool, sourcing products for dropshipping, or doing competitive research, you need reliable access to that data.

This guide covers practical approaches to scraping Walmart in 2026 — from raw HTTP requests with Python to using managed scraping platforms. I'll show you what works, what Walmart blocks, and how to get clean product data efficiently.

The Challenge: Walmart's Anti-Bot Defenses

Walmart doesn't make scraping easy. Their stack includes:

  • PerimeterX / HUMAN Security — JavaScript challenges and behavioral fingerprinting
  • Rate limiting — Aggressive throttling on repeated requests from the same IP
  • Dynamic rendering — Some product data loads via JavaScript after the initial page load
  • Session validation — Cookie-based session tracking that detects automated access

A naive requests.get() call will return a CAPTCHA page or a 403 within a few requests. You need a strategy.

Approach 1: Direct HTTP with httpx (DIY)

If you want to understand what's happening under the hood, start here. Walmart renders product data server-side and embeds it in a JavaScript variable called window.__WML_REDUX_INITIAL_STATE__. This is your goldmine — it contains structured JSON with product details, prices, reviews, and availability.

Here's a working approach using httpx:

import httpx
import json
import re
import time
import random

def scrape_walmart_product(url: str, proxy: str | None = None) -> dict | None:
    """Scrape a single Walmart product page."""
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                      'AppleWebKit/537.36 (KHTML, like Gecko) '
                      'Chrome/131.0.0.0 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.9',
        'Accept-Encoding': 'gzip, deflate, br',
    }

    client_kwargs = {'headers': headers, 'follow_redirects': True, 'timeout': 30.0}
    if proxy:
        client_kwargs['proxy'] = proxy

    with httpx.Client(**client_kwargs) as client:
        response = client.get(url)

        if response.status_code != 200:
            print(f"Got status {response.status_code} for {url}")
            return None

        # Extract the Redux state JSON
        pattern = r'window\.__WML_REDUX_INITIAL_STATE__\s*=\s*({.*?});\s*</script>'
        match = re.search(pattern, response.text, re.DOTALL)

        if not match:
            print("Could not find product data — possible CAPTCHA or page change")
            return None

        data = json.loads(match.group(1))

        # Navigate the nested structure to extract product info
        try:
            product = data.get('product', {})
            item = product.get('item', {})
            return {
                'title': item.get('name'),
                'price': item.get('priceInfo', {}).get('currentPrice', {}).get('price'),
                'currency': item.get('priceInfo', {}).get('currentPrice', {}).get('currencyUnit'),
                'rating': item.get('averageRating'),
                'review_count': item.get('numberOfReviews'),
                'in_stock': item.get('availabilityStatus') == 'IN_STOCK',
                'seller': item.get('sellerName'),
                'brand': item.get('brand'),
                'category': item.get('category', {}).get('path', []),
                'image': item.get('imageInfo', {}).get('thumbnailUrl'),
                'url': url,
            }
        except (KeyError, TypeError) as e:
            print(f"Error parsing product data: {e}")
            return None


def scrape_walmart_search(query: str, proxy: str | None = None) -> list[dict]:
    """Scrape Walmart search results for a query."""
    url = f'https://www.walmart.com/search?q={query.replace(" ", "+")}'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                      'AppleWebKit/537.36 (KHTML, like Gecko) '
                      'Chrome/131.0.0.0 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.9',
    }

    client_kwargs = {'headers': headers, 'follow_redirects': True, 'timeout': 30.0}
    if proxy:
        client_kwargs['proxy'] = proxy

    with httpx.Client(**client_kwargs) as client:
        response = client.get(url)

        if response.status_code != 200:
            return []

        pattern = r'window\.__WML_REDUX_INITIAL_STATE__\s*=\s*({.*?});\s*</script>'
        match = re.search(pattern, response.text, re.DOTALL)

        if not match:
            return []

        data = json.loads(match.group(1))
        items = data.get('searchContent', {}).get('preso', {}).get('items', [])

        results = []
        for item in items:
            results.append({
                'title': item.get('title'),
                'price': item.get('priceInfo', {}).get('currentPrice', {}).get('price'),
                'rating': item.get('averageRating'),
                'review_count': item.get('numberOfReviews'),
                'url': f"https://www.walmart.com{item.get('canonicalUrl', '')}",
                'image': item.get('imageUrl'),
            })

        return results


# Example usage
if __name__ == '__main__':
    # Scrape search results
    products = scrape_walmart_search('bluetooth headphones')
    for p in products[:5]:
        print(f"{p['title'][:50]} — ${p['price']}")

    # Add delay between requests
    time.sleep(random.uniform(2, 5))

    # Scrape a specific product
    product = scrape_walmart_product(
        'https://www.walmart.com/ip/some-product/123456789'
    )
    if product:
        print(json.dumps(product, indent=2))
Enter fullscreen mode Exit fullscreen mode

Install dependencies:

pip install httpx
Enter fullscreen mode Exit fullscreen mode

Anti-Bot Strategies for DIY Scraping

If you go the DIY route, here's what you need:

  1. Rotating residential proxies — Datacenter IPs get blocked fast. Residential proxies from providers like Bright Data, Oxylabs, or SmartProxy are essential for any volume.

  2. Request throttling — Add random delays (2-8 seconds) between requests. Walmart's rate limiter looks at request frequency per session.

  3. Header rotation — Rotate User-Agent strings and vary Accept headers. Use realistic browser fingerprints.

  4. Session management — Create fresh sessions periodically. Don't reuse cookies across hundreds of requests.

  5. Retry with backoff — When you hit a 403 or CAPTCHA, back off exponentially. Don't hammer the same URL.

import time
import random

def scrape_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        result = scrape_walmart_product(url)
        if result:
            return result
        wait = (2 ** attempt) + random.uniform(1, 3)
        print(f"Retry {attempt + 1} in {wait:.1f}s...")
        time.sleep(wait)
    return None
Enter fullscreen mode Exit fullscreen mode

The Reality of DIY Scraping

This approach works for small-scale projects (dozens of products). But at scale — thousands of products daily — you'll spend more time maintaining your scraper than using the data. Walmart updates their anti-bot measures regularly, proxy costs add up, and you need infrastructure to run and monitor the scraper.

Approach 2: Managed Scraping with Apify

For production workloads, a managed scraping platform eliminates the infrastructure burden. Apify runs your scraper in the cloud, handles proxy rotation, and provides scheduling, storage, and integrations out of the box.

The Walmart Scraper actor on Apify handles the anti-bot complexity for you. Here's how to use it:

from apify_client import ApifyClient

client = ApifyClient('YOUR_APIFY_API_TOKEN')

# Search for products
run_input = {
    "searchTerms": ["bluetooth headphones"],
    "maxItems": 100,
}

run = client.actor('cryptosignals/walmart-scraper').call(run_input=run_input)
items = list(client.dataset(run['defaultDatasetId']).iterate_items())

for item in items[:5]:
    print(f"{item['title'][:50]} — ${item.get('price', 'N/A')}")
Enter fullscreen mode Exit fullscreen mode

Why Use a Managed Actor?

  • No proxy management — The actor handles proxy rotation internally
  • Anti-bot updates — When Walmart changes their defenses, the actor maintainer updates the code. You don't touch anything.
  • Scheduling — Run daily, hourly, or on any cron schedule from the Apify dashboard
  • Integrations — Export to Google Sheets, webhook to Slack, push to your API
  • Cost-effective — You pay per compute unit, which is typically cheaper than maintaining your own proxy pool + infrastructure

Use Case: Dropshipping Price Monitor

Here's a practical example. You're dropshipping products from Walmart to eBay. You need to monitor Walmart prices daily to ensure your margins stay positive.

from apify_client import ApifyClient
import csv

client = ApifyClient('YOUR_APIFY_API_TOKEN')

# Your product URLs to monitor
product_urls = [
    'https://www.walmart.com/ip/product-1/111111',
    'https://www.walmart.com/ip/product-2/222222',
    'https://www.walmart.com/ip/product-3/333333',
]

run_input = {
    "startUrls": [{"url": u} for u in product_urls],
}

run = client.actor('cryptosignals/walmart-scraper').call(run_input=run_input)
items = list(client.dataset(run['defaultDatasetId']).iterate_items())

# Check margins against your eBay listings
MIN_MARGIN = 0.15  # 15% minimum margin

for item in items:
    walmart_price = item.get('price', 0)
    ebay_price = get_your_ebay_price(item['title'])  # Your lookup function
    margin = (ebay_price - walmart_price) / ebay_price if ebay_price else 0

    if margin < MIN_MARGIN:
        print(f"LOW MARGIN: {item['title'][:40]}"
              f"Walmart: ${walmart_price}, eBay: ${ebay_price}, "
              f"Margin: {margin:.1%}")
Enter fullscreen mode Exit fullscreen mode

Schedule this to run every morning, and you'll catch price increases before they eat your margins.

Which Approach Should You Choose?

Factor DIY (httpx) Managed (Apify Actor)
Setup time Hours Minutes
Maintenance Ongoing Handled by maintainer
Scale Limited by your infra Cloud-scale
Cost at low volume Cheaper (just proxy costs) Small Apify fee
Cost at high volume Expensive (proxies + servers) More predictable
Learning value High Low

Choose DIY if you're learning, scraping < 100 products, or need custom extraction logic.

Choose managed if you need reliability, scale, or don't want to maintain scraping infrastructure.

For most dropshipping and price monitoring workflows, the managed approach with Walmart Scraper on Apify saves significant time and produces more reliable results.

Key Takeaways

  1. Walmart embeds product data in window.__WML_REDUX_INITIAL_STATE__ — this is the most reliable extraction point
  2. Anti-bot defenses require residential proxies and careful request management
  3. DIY scraping is educational but doesn't scale well for production use
  4. Managed actors like the Walmart Scraper handle the hard parts so you can focus on using the data
  5. Always add delays, rotate headers, and handle failures gracefully

Whatever approach you choose, respect Walmart's terms of service and rate limits. Aggressive scraping hurts everyone — including you, when your IPs get permanently blocked.


This is part of my Web Scraping in 2026 series. Check out the previous article for a comparison of the best Walmart scrapers available today.

Top comments (0)