Bulk Downloading 1688 Product Images: A Lesson in Maxing Out Bandwidth

#ai #webdev

ur purchasing system suddenly went down. Monitoring showed that outbound bandwidth was maxed out at 500Mbps, causing all external API requests to timeout. The culprit was a script for bulk downloading 1688 product images—it launched 200 concurrent download threads without any rate limiting, completely saturating our shared bandwidth.

Problem Scenario: A Brutal Approach to Image Downloading

We needed to sync approximately 3,000 1688 products daily, including main images and detail images, averaging 5 images per product. The initial implementation was straightforward but crude:

// Old brute-force download script
function downloadAllImages($productIds) {
    foreach ($productIds as $id) {
        $images = get1688ProductImages($id); // Call 1688 API to get image URL list
        foreach ($images as $url) {
            $content = file_get_contents($url); // Synchronous blocking download
            file_put_contents("/images/$id/".basename($url), $content);
        }
    }
}

This script had 3 critical issues:

No concurrency control: While file_get_contents is synchronous, the outer loop had no limits, resulting in massive HTTP requests fired simultaneously
No retry mechanism: If an image download failed (e.g., network jitter), the script simply skipped it, leaving product images missing
No bandwidth limiting: 200 concurrent requests downloading simultaneously, each averaging 2MB, instantly consumed 400MB of bandwidth

The immediate consequence: all other business operations (including order processing and logistics queries) were interrupted for 18 minutes. We had to manually kill the process and spend 2 hours re-downloading the failed images.

Solution: A Downloader with Rate Limiting and Queue

We redesigned the downloader using Guzzle's async capabilities, adding bandwidth control and retry mechanisms.

Step one: Use Guzzle's concurrent request pool with a maximum concurrency limit.
Step two: Implement a simple token bucket algorithm for bandwidth control.

// New rate-limited downloader
use GuzzleHttp\Client;
use GuzzleHttp\Pool;
use GuzzleHttp\Psr7\Request;

class ThrottledImageDownloader {
    private $client;
    private $concurrency = 10; // Maximum concurrency
    private $bandwidthLimit = 50 * 1024 * 1024; // 50MB/s bandwidth limit
    private $tokens;
    private $lastRefillTime;

    public function __construct() {
        $this->client = new Client(['timeout' => 30]);
        $this->tokens = $this->bandwidthLimit;
        $this->lastRefillTime = microtime(true);
    }

    // Token bucket algorithm for bandwidth control
    private function consumeBandwidth($bytes) {
        $now = microtime(true);
        $elapsed = $now - $this->lastRefillTime;
        $this->tokens = min($this->bandwidthLimit, $this->tokens + $elapsed * $this->bandwidthLimit);
        $this->lastRefillTime = $now;

        if ($this->tokens < $bytes) {
            $sleepTime = ($bytes - $this->tokens) / $this->bandwidthLimit;
            usleep($sleepTime * 1e6);
            $this->tokens = 0;
        } else {
            $this->tokens -= $bytes;
        }
    }

    public function downloadBatch(array $imageUrls) {
        $requests = function ($urls) {
            foreach ($urls as $url) {
                yield new Request('GET', $url);
            }
        };

        $pool = new Pool($this->client, $requests($imageUrls), [
            'concurrency' => $this->concurrency,
            'fulfilled' => function ($response, $index) use ($imageUrls) {
                $content = $response->getBody()->getContents();
                $this->consumeBandwidth(strlen($content));
                // Save image logic
                $filename = basename($imageUrls[$index]);
                file_put_contents("/images/$filename", $content);
            },
            'rejected' => function ($reason, $index) use ($imageUrls) {
                // Retry on failure, up to 3 times
                $this->retryDownload($imageUrls[$index], 3);
            },
        ]);

        $pool->promise()->wait();
    }

    private function retryDownload($url, $maxRetries) {
        for ($i = 0; $i < $maxRetries; $i++) {
            try {
                $response = $this->client->get($url);
                $content = $response->getBody()->getContents();
                $filename = basename($url);
                file_put_contents("/images/$filename", $content);
                return;
            } catch (\Exception $e) {
                if ($i === $maxRetries - 1) {
                    // Log failure
                    error_log("Failed to download $url after $maxRetries attempts");
                }
                sleep(pow(2, $i)); // Exponential backoff
            }
        }
    }
}

Key improvements:

Concurrency control: concurrency set to 10, preventing instant bandwidth saturation
Token bucket rate limiting: The consumeBandwidth method ensures downloads don't exceed 50MB per second
Exponential backoff retry: Wait 2^i seconds after failure, with a maximum of 3 attempts

Lessons Learned: From Bandwidth Disaster to Stable Sync

After deploying the new downloader, we ran A/B tests. The old script took 12 minutes to download images for 3,000 products (~15,000 images), but consumed 500Mbps of bandwidth. The new script took 18 minutes for the same task, but bandwidth remained stable at 45-50Mbps with zero impact on other services.

Further optimization: Incremental downloads and caching

We also added a simple file hash check to avoid re-downloading existing images:

// Incremental check - only download new images
function needsDownload($url, $localPath) {
    if (!file_exists($localPath)) {
        return true;
    }
    // Check if remote file has been updated via HEAD request
    $headers = get_headers($url, 1);
    $remoteSize = $headers['Content-Length'] ?? 0;
    $localSize = filesize($localPath);
    return $remoteSize != $localSize;
}

This optimization reduced daily incremental sync time from 18 minutes to 3-5 minutes, since only about 10% of product images are updated daily.

Summary: When bulk downloading third-party resources, never assume that "faster is better." Brute-force concurrent downloads may seem efficient, but they often sacrifice system stability. Rate limiting, retry mechanisms, and incremental checks are the three core elements of a reliable download system. If your image sync script is still running file_get_contents without protection, it's time for an upgrade.

Has your system encountered similar issues when handling large volumes of external resource downloads? Feel free to share your solutions.

About the Author: Building cross-border purchasing solutions with taocarts — a daigou system for 1688/Taobao purchasing, order management, and international shipping.