wfgsss

Posted on Feb 13 • Edited on Feb 15

How to Extract Product Images from Yiwugo.com for Your E-commerce Store

#webscraping #python #ecommerce #tutorial

If you're sourcing wholesale products from Yiwugo.com, you'll quickly realize that manually downloading product images is tedious and time-consuming. Each product listing has multiple high-resolution images, and copying them one by one doesn't scale.

In this tutorial, I'll show you how to automate product image extraction from Yiwugo.com and prepare them for your e-commerce store.

Why Extract Images from Yiwugo?

Yiwugo.com (义乌购) is China's largest wholesale marketplace, with millions of products. When you're building an e-commerce store or dropshipping business, you need:

High-quality product images for your listings
Multiple angles of each product
Batch processing to handle hundreds of products
CDN optimization for fast loading

Manual downloading doesn't work at scale. Automation does.

What You'll Learn

How to scrape product image URLs from Yiwugo
How to batch download images efficiently
How to optimize images for web (compression, resizing)
How to integrate with CDN services (optional)

Prerequisites

Basic Python knowledge
An Apify account (free tier works)
Node.js installed (for the scraper)

Step 1: Get Product Image URLs

First, we need to extract image URLs from Yiwugo product pages. The easiest way is to use the Yiwugo Scraper on Apify Store.

Using the Scraper

// Run via Apify API
const { ApifyClient } = require('apify-client');

const client = new ApifyClient({
    token: 'YOUR_APIFY_TOKEN',
});

const input = {
    startUrls: [
        { url: 'https://www.yiwugo.com/search?keyword=backpack' }
    ],
    maxItems: 50,
};

const run = await client.actor('jungle_intertwining/yiwugo-scraper').call(input);
const { items } = await client.dataset(run.defaultDatasetId).listItems();

console.log(`Scraped ${items.length} products`);

Sample Output

Each product includes an images array:

{
  "title": "Fashion Backpack",
  "price": "¥45.00",
  "images": [
    "https://cbu01.alicdn.com/img/ibank/O13_1234567890.jpg",
    "https://cbu01.alicdn.com/img/ibank/O1CN01def456_0987654321.jpg",
    "https://cbu01.alicdn.com/img/ibank/O1CN01ghi789_1122334455.jpg"
  ],
  "url": "https://www.yiwugo.com/item/12345.html"
}

Step 2: Batch Download Images

Now let's download all images efficiently using Python:

import os
import requests
from concurrent.futures import ThreadPoolExecutor
from urllib.parse import urlparse
import hashlib

def download_image(url, product_id, index):
    """Download a single image with error handling"""
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()

        # Generate filename from URL hash (avoid duplicates)
        url_hash = hashlib.md5(url.encode()).h8]
        ext = os.path.splitext(urlparse(url).path)[1] or '.jpg'
        filename = f"{product_id}_{index}_{url_hash}{ext}"

        filepath = os.path.join('images', filename)
        os.makedirs('images', exist_ok=True)

        with open(filepath, 'wb') as f:
            f.write(response.content)

        print(f"✓ Downloaded: {filename}")
        return filepath
    except Exception as e:
        print(f"✗ Failed {url}: {e}")
        return None

def batch_download(products, max_workers=10):
    """Download all images from multiple products in parallel"""
    tasks = []

    for product in products:
        product_id = product.get('id', 'unknown')
        images = product.get('images', [])

        for idx, img_url in enumerate(images):
            tasks.append((img_url, product_id, idx))

    print(f"Downloading {len(tasks)} images from {len(products)} products...")

    with ThreadPoolExecutor(max_workers) as executor:
        results = executor.map(lambda t: download_image(*t), tasks)

    downloaded = [r for r in results if r]
    print(f"\n✓ Downloaded {len(downloaded)}/{len(tasks)} images")
    return downloaded

# Example usage
products = [
    {
        'id': '12345',
        'images': [
            'https://cbu01.alicdn.com/img/ibank/O1CN01abc123.jpg',
            'https://cbu01.alicdn.com/img/ibank/O1CN01def456.jpg'
        ]
    },
    # ... more products
]

batch_download(products)

Key features:

Parallel downloads (10 concurrent threads)
Error handling (skips failed downloads)
Duplicate prevention (URL hash in filename)
Progress tracking

Step 3: Optimize Images for Web

Raw images from Yiwugo are often large (1-3et's compress and resize them:

from PIL import Image
import os

def optimize_image(filepath, max_width=800, quality=85):
    """Compress and resize image for web"""
    try:
        img = Image.open(filepath)

        # Convert RGBA to RGB if needed
        if img.mode == 'RGBA':
            img = img.convert('RGB')

        # Resize if too large
        if img.width > max_width:
            ratio = max_width / img.width
            new_height = int(img.height * ratio)
            img = img.resize((max_width, new_height), Image.LANCZOS)

        # Save with compression
        optimized_path = filepath.replace('images/', 'images/optimized_')
        img.save(optimized_path, 'JPEG', quality=quality, optimize=True)

        original_size = os.path.getsize(filepath) / 1024
        optimized_size = os.path.getsize(optimized_path) / 1024
        saved = ((original_size - optimized_size) / original_size) * 100

        print(f"✓ Optimized: {os.path.basename(filepath)} "
              f"({original_size:.1f}KB → {optimized_size:.1f}KB, -{saved:.1f}%)")

        return optimized_path
    except Exception as e:
        print(f"✗ Failed to optimize {filepath}: {e}")
        return None

# Optimize all downloaded images
image_files = [f for f in os.listdir('images') if f.endswith(('.jpg', '.png'))]
for img_file in image_files:
    optimize_image(os.path.join('images', img_file))

Typical results:

Original: 1.2 MB → Optimized: 180 KB (85% reduction)
Page load time: 3s → 0.5s

Step 4: Upload to CDN (Optional)

For production e-commerce stores, serve images from a CDN:

Using Cloudflare Images

import requests

def upload_to_cloudflare(filepath, account_id, api_tokn    """Upload imagare Images"""
    url = f"https://api.cloudflare.com/client/v4/accounts/{account_id}/images/v1"

    headers = {
        'Authorization': f'Bearer {api_token}'
    }

    with open(filepath, 'rb') as f:
        files = {'file': f}
        response = requests.post(url, headers=headers, files=files)

    if response.status_code == 200:
        data = response.json()
        cdn_url = data['result']['variants'][0]
        print(f"✓ Uploaded: {cdn_url}")
        return cdn_url
    else:
        print(f"✗ Upload failed: {response.text}")
        return None

Using AWS S3

import boto3

def upload_to_s3(filepath, bucket_name, s3_key):
    """Upload image to AWS S3"""
    s3 = boto3.client('s3')

    with open(filepath, 'rb') as f:
        s3.upload_fileobj(
            f, 
            bucket_name, 
            s3_key,
            ExtraArgs={'ContentType': 'image/jpeg', 'ACL': 'public-read'}
        )

    cdn_url = f"https://{bucket_name}.s3.amazonaws.com/{s3_key}"
    print(f"✓ Uploaded: {cdn_url}")
    return cdn_url

Complete flow Script

Here's the full pipeline:

import os
import requests
from PIL import Image
from concurrent.futures import ThreadPoolExecutor
import hashlib

def extract_and_optimize_images(products, output_dir='images'):
    """Complete pipeline: download → optimize → return CDN-ready URLs"""

    os.makedirs(output_dir, exist_ok=True)
    os.makedirs(f"{output_dir}/optimized", exist_ok=True)

    results = []

    for product in products:
        product_id = product.get('id', 'unknown')
        images = product.get('images', [])

        product_images = []

        for idx, img_url in enumerate(images):
            # Download
            try:
                response = requests.get(img_url, timeout=10)
                response.raise_for_status()

                url_hash = hashlib.md5(img_url.encode()).hexdigest()[:8]
                filename = f"{product_id}_{idx}_{url_hash}.jpg"
                filepath = os.path.join(output_dir, filename)

                with open(filepath, 'wb') as f:
                    f.write(response.content)

                # Optimize
                img = Image.open(filepath)
                if img.mode == 'RGBA':
                    img = img.convert('RGB')

                if img.width > 800:
                    ratio = 800 / img.width
                    new_height = int(img.height * ratio)
                    img = img.resize((800, new_height), Image.LANCZOS)

                optimized_path = os.path.join(f"{output_dir}/optimized", filename)
                img.save(optimized_path, 'JPEG', quality=85, optimize=True)

                product_images.append({
                    'original': filepath,
                    'optimized': optimized_path,
                    'url': img_url
                })

                print(f"✓ Processed: {filename}")

            except Exception as e:
                print(f"✗ Failed {img_url}: {e}")

        results.append({
            'product_id': product_id,
            'images': product_images
        })

    return results

# Example usage
products = [
    {
        'id': '12345',
        'title': 'Fashion Backpack',
        'images': [
            'https://cbu01.alicdn.com/img/ibank/O1CN01abc123.jpg',
            'https://cbu01.alicdn.com/img/ibank/O1CN01def456.jpg'
        ]
    }
]

results = extract_and_optimize_images(products)
print(f"\n✓ Processed {len(results)} products")

Real-World Use Cases

1. Dropshipping Store Setup

Scrape 500 products from Yiwugo
Download and optimize all images
Upload to Shopify/WooCommerce
Time saved: 40 hours → 2 hours

2. Price Comparison Website

Extract images from multiple suppliers dize image sizes (800x800)
Serve from CDN for fast loading
Result: 3x faster page load

3. Product Catalog Generation

Batch download 10,000+ product images
Auto-generate thumbnails (200x200)
Create image galleries
Storage saved: 15 GB → 2 GB (optimized)

Best Practices

Respect rate limits - Don't hammer Yiwugo's servers (use delays)
Handle errors gracefully - Some images may be deleted or moved
Check image licenses - Ensure you have rights to use the images
*Optimize for mobile responsive images (srcset)
Cache aggressively - Set long cache headers on CDN

Troubleshooting

Images won't download

Check if URL is still valid (Yiwugo sometimes removes old images)
Verify your IP isn't blocked (use proxies if needed)
Increase timeout (some images are large)

Optimization fails

Install Pillow: pip install Pillow
Check image format (some WebP images need conversion)
Ensure enough disk space

CDN upload errors

Verify API credentials
Check file size limits (Cloudflare: 10 MB max)
Ensure correct content-type headers

Next Steps

Automate the pipeline - Run daily to sync new products
Add watermarks - Protect your curated image library
Generate variants - Create thumbnails, zoom views, etc.
Track performance - Monitor CDN bandwidth and costs

Try It Yourself

Get started with the Yiwugo Scraper on Apify Store. The free tier includes 100 scraping runs per month.

GitHub Example: yiwugo-scraper-example

Questions? Drop a comment below or check out these related articles:

📦 Also check out: DHgate Scraper — Extract DHgate product data for dropshipping research.

Made-in-China Scraper — Extract B2B product data, supplier info, and MOQ from Made-in-China.com

📚 More on wholesale data:

DEV Community