DEV Community

wfgsss
wfgsss

Posted on • Edited on

How to Extract Product Images from Yiwugo.com for Your E-commerce Store

If you're sourcing wholesale products from Yiwugo.com, you'll quickly realize that manually downloading product images is tedious and time-consuming. Each product listing has multiple high-resolution images, and copying them one by one doesn't scale.

In this tutorial, I'll show you how to automate product image extraction from Yiwugo.com and prepare them for your e-commerce store.

Why Extract Images from Yiwugo?

Yiwugo.com (义乌购) is China's largest wholesale marketplace, with millions of products. When you're building an e-commerce store or dropshipping business, you need:

  • High-quality product images for your listings
  • Multiple angles of each product
  • Batch processing to handle hundreds of products
  • CDN optimization for fast loading

Manual downloading doesn't work at scale. Automation does.

What You'll Learn

  • How to scrape product image URLs from Yiwugo
  • How to batch download images efficiently
  • How to optimize images for web (compression, resizing)
  • How to integrate with CDN services (optional)

Prerequisites

  • Basic Python knowledge
  • An Apify account (free tier works)
  • Node.js installed (for the scraper)

Step 1: Get Product Image URLs

First, we need to extract image URLs from Yiwugo product pages. The easiest way is to use the Yiwugo Scraper on Apify Store.

Using the Scraper

// Run via Apify API
const { ApifyClient } = require('apify-client');

const client = new ApifyClient({
    token: 'YOUR_APIFY_TOKEN',
});

const input = {
    startUrls: [
        { url: 'https://www.yiwugo.com/search?keyword=backpack' }
    ],
    maxItems: 50,
};

const run = await client.actor('jungle_intertwining/yiwugo-scraper').call(input);
const { items } = await client.dataset(run.defaultDatasetId).listItems();

console.log(`Scraped ${items.length} products`);
Enter fullscreen mode Exit fullscreen mode

Sample Output

Each product includes an images array:

{
  "title": "Fashion Backpack",
  "price": "¥45.00",
  "images": [
    "https://cbu01.alicdn.com/img/ibank/O13_1234567890.jpg",
    "https://cbu01.alicdn.com/img/ibank/O1CN01def456_0987654321.jpg",
    "https://cbu01.alicdn.com/img/ibank/O1CN01ghi789_1122334455.jpg"
  ],
  "url": "https://www.yiwugo.com/item/12345.html"
}
Enter fullscreen mode Exit fullscreen mode

Step 2: Batch Download Images

Now let's download all images efficiently using Python:

import os
import requests
from concurrent.futures import ThreadPoolExecutor
from urllib.parse import urlparse
import hashlib

def download_image(url, product_id, index):
    """Download a single image with error handling"""
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()

        # Generate filename from URL hash (avoid duplicates)
        url_hash = hashlib.md5(url.encode()).h8]
        ext = os.path.splitext(urlparse(url).path)[1] or '.jpg'
        filename = f"{product_id}_{index}_{url_hash}{ext}"

        filepath = os.path.join('images', filename)
        os.makedirs('images', exist_ok=True)

        with open(filepath, 'wb') as f:
            f.write(response.content)

        print(f"✓ Downloaded: {filename}")
        return filepath
    except Exception as e:
        print(f"✗ Failed {url}: {e}")
        return None

def batch_download(products, max_workers=10):
    """Download all images from multiple products in parallel"""
    tasks = []

    for product in products:
        product_id = product.get('id', 'unknown')
        images = product.get('images', [])

        for idx, img_url in enumerate(images):
            tasks.append((img_url, product_id, idx))

    print(f"Downloading {len(tasks)} images from {len(products)} products...")

    with ThreadPoolExecutor(max_workers) as executor:
        results = executor.map(lambda t: download_image(*t), tasks)

    downloaded = [r for r in results if r]
    print(f"\n✓ Downloaded {len(downloaded)}/{len(tasks)} images")
    return downloaded

# Example usage
products = [
    {
        'id': '12345',
        'images': [
            'https://cbu01.alicdn.com/img/ibank/O1CN01abc123.jpg',
            'https://cbu01.alicdn.com/img/ibank/O1CN01def456.jpg'
        ]
    },
    # ... more products
]

batch_download(products)
Enter fullscreen mode Exit fullscreen mode

Key features:

  • Parallel downloads (10 concurrent threads)
  • Error handling (skips failed downloads)
  • Duplicate prevention (URL hash in filename)
  • Progress tracking

Step 3: Optimize Images for Web

Raw images from Yiwugo are often large (1-3et's compress and resize them:

from PIL import Image
import os

def optimize_image(filepath, max_width=800, quality=85):
    """Compress and resize image for web"""
    try:
        img = Image.open(filepath)

        # Convert RGBA to RGB if needed
        if img.mode == 'RGBA':
            img = img.convert('RGB')

        # Resize if too large
        if img.width > max_width:
            ratio = max_width / img.width
            new_height = int(img.height * ratio)
            img = img.resize((max_width, new_height), Image.LANCZOS)

        # Save with compression
        optimized_path = filepath.replace('images/', 'images/optimized_')
        img.save(optimized_path, 'JPEG', quality=quality, optimize=True)

        original_size = os.path.getsize(filepath) / 1024
        optimized_size = os.path.getsize(optimized_path) / 1024
        saved = ((original_size - optimized_size) / original_size) * 100

        print(f"✓ Optimized: {os.path.basename(filepath)} "
              f"({original_size:.1f}KB → {optimized_size:.1f}KB, -{saved:.1f}%)")

        return optimized_path
    except Exception as e:
        print(f"✗ Failed to optimize {filepath}: {e}")
        return None

# Optimize all downloaded images
image_files = [f for f in os.listdir('images') if f.endswith(('.jpg', '.png'))]
for img_file in image_files:
    optimize_image(os.path.join('images', img_file))
Enter fullscreen mode Exit fullscreen mode

Typical results:

  • Original: 1.2 MB → Optimized: 180 KB (85% reduction)
  • Page load time: 3s → 0.5s

Step 4: Upload to CDN (Optional)

For production e-commerce stores, serve images from a CDN:

Using Cloudflare Images

import requests

def upload_to_cloudflare(filepath, account_id, api_tokn    """Upload imagare Images"""
    url = f"https://api.cloudflare.com/client/v4/accounts/{account_id}/images/v1"

    headers = {
        'Authorization': f'Bearer {api_token}'
    }

    with open(filepath, 'rb') as f:
        files = {'file': f}
        response = requests.post(url, headers=headers, files=files)

    if response.status_code == 200:
        data = response.json()
        cdn_url = data['result']['variants'][0]
        print(f"✓ Uploaded: {cdn_url}")
        return cdn_url
    else:
        print(f"✗ Upload failed: {response.text}")
        return None
Enter fullscreen mode Exit fullscreen mode

Using AWS S3

import boto3

def upload_to_s3(filepath, bucket_name, s3_key):
    """Upload image to AWS S3"""
    s3 = boto3.client('s3')

    with open(filepath, 'rb') as f:
        s3.upload_fileobj(
            f, 
            bucket_name, 
            s3_key,
            ExtraArgs={'ContentType': 'image/jpeg', 'ACL': 'public-read'}
        )

    cdn_url = f"https://{bucket_name}.s3.amazonaws.com/{s3_key}"
    print(f"✓ Uploaded: {cdn_url}")
    return cdn_url
Enter fullscreen mode Exit fullscreen mode

Complete flow Script

Here's the full pipeline:

import os
import requests
from PIL import Image
from concurrent.futures import ThreadPoolExecutor
import hashlib

def extract_and_optimize_images(products, output_dir='images'):
    """Complete pipeline: download → optimize → return CDN-ready URLs"""

    os.makedirs(output_dir, exist_ok=True)
    os.makedirs(f"{output_dir}/optimized", exist_ok=True)

    results = []

    for product in products:
        product_id = product.get('id', 'unknown')
        images = product.get('images', [])

        product_images = []

        for idx, img_url in enumerate(images):
            # Download
            try:
                response = requests.get(img_url, timeout=10)
                response.raise_for_status()

                url_hash = hashlib.md5(img_url.encode()).hexdigest()[:8]
                filename = f"{product_id}_{idx}_{url_hash}.jpg"
                filepath = os.path.join(output_dir, filename)

                with open(filepath, 'wb') as f:
                    f.write(response.content)

                # Optimize
                img = Image.open(filepath)
                if img.mode == 'RGBA':
                    img = img.convert('RGB')

                if img.width > 800:
                    ratio = 800 / img.width
                    new_height = int(img.height * ratio)
                    img = img.resize((800, new_height), Image.LANCZOS)

                optimized_path = os.path.join(f"{output_dir}/optimized", filename)
                img.save(optimized_path, 'JPEG', quality=85, optimize=True)

                product_images.append({
                    'original': filepath,
                    'optimized': optimized_path,
                    'url': img_url
                })

                print(f"✓ Processed: {filename}")

            except Exception as e:
                print(f"✗ Failed {img_url}: {e}")

        results.append({
            'product_id': product_id,
            'images': product_images
        })

    return results

# Example usage
products = [
    {
        'id': '12345',
        'title': 'Fashion Backpack',
        'images': [
            'https://cbu01.alicdn.com/img/ibank/O1CN01abc123.jpg',
            'https://cbu01.alicdn.com/img/ibank/O1CN01def456.jpg'
        ]
    }
]

results = extract_and_optimize_images(products)
print(f"\n✓ Processed {len(results)} products")
Enter fullscreen mode Exit fullscreen mode

Real-World Use Cases

1. Dropshipping Store Setup

  • Scrape 500 products from Yiwugo
  • Download and optimize all images
  • Upload to Shopify/WooCommerce
  • Time saved: 40 hours → 2 hours

2. Price Comparison Website

  • Extract images from multiple suppliers dize image sizes (800x800)
  • Serve from CDN for fast loading
  • Result: 3x faster page load

3. Product Catalog Generation

  • Batch download 10,000+ product images
  • Auto-generate thumbnails (200x200)
  • Create image galleries
  • Storage saved: 15 GB → 2 GB (optimized)

Best Practices

  1. Respect rate limits - Don't hammer Yiwugo's servers (use delays)
  2. Handle errors gracefully - Some images may be deleted or moved
  3. Check image licenses - Ensure you have rights to use the images
  4. *Optimize for mobile responsive images (srcset)
  5. Cache aggressively - Set long cache headers on CDN

Troubleshooting

Images won't download

  • Check if URL is still valid (Yiwugo sometimes removes old images)
  • Verify your IP isn't blocked (use proxies if needed)
  • Increase timeout (some images are large)

Optimization fails

  • Install Pillow: pip install Pillow
  • Check image format (some WebP images need conversion)
  • Ensure enough disk space

CDN upload errors

  • Verify API credentials
  • Check file size limits (Cloudflare: 10 MB max)
  • Ensure correct content-type headers

Next Steps

  • Automate the pipeline - Run daily to sync new products
  • Add watermarks - Protect your curated image library
  • Generate variants - Create thumbnails, zoom views, etc.
  • Track performance - Monitor CDN bandwidth and costs

Try It Yourself

Get started with the Yiwugo Scraper on Apify Store. The free tier includes 100 scraping runs per month.

GitHub Example: yiwugo-scraper-example


Questions? Drop a comment below or check out these related articles:

📦 Also check out: DHgate Scraper — Extract DHgate product data for dropshipping research.

📚 More on wholesale data:

Top comments (0)