Muhammad Ikramullah Khan

Posted on Dec 28

Downloading Files and Images with Scrapy: The Complete Beginner's Guide

#webdev #python #beginners #programming

When I first tried to download images with Scrapy, I followed the official documentation step by step. It didn't work. The images field stayed empty, the downloads folder was empty, and I had no idea why.

After hours of debugging, I discovered all the tiny details the documentation skips. The field names that MUST be exact. The Pillow library that needs installing. The pipeline that silently fails if configured wrong.

Let me save you that frustration. This is the complete guide to downloading files and images in Scrapy, including all the stuff nobody tells you.

The Big Picture: How File/Image Downloads Work

Before we dive into code, understand how this works:

Your spider scrapes URLs (not the actual files, just their URLs)
You put URLs in a special field (with an exact name Scrapy expects)
Scrapy's pipeline downloads them (automatically, in the background)
Results appear in another field (also with an exact name)

You don't manually download anything. The pipeline handles it all. You just provide URLs.

Part 1: Downloading Files (PDFs, ZIPs, Documents, etc.)

Step 1: Install Nothing (FilesPipeline is Built-In)

Good news! For files, you don't need extra libraries. FilesPipeline comes with Scrapy.

Step 2: Enable the Pipeline

Edit settings.py:

# settings.py

# Enable the FilesPipeline
ITEM_PIPELINES = {
    'scrapy.pipelines.files.FilesPipeline': 1
}

# Set where files will be saved
FILES_STORE = 'downloads'  # Creates a 'downloads' folder

That's it. Pipeline enabled. Download location set.

Step 3: Create Your Item

The field names MUST be exact:

# items.py
import scrapy

class DocumentItem(scrapy.Item):
    file_urls = scrapy.Field()  # MUST be named 'file_urls' (plural!)
    files = scrapy.Field()       # MUST be named 'files' (plural!)

    # Your other fields
    title = scrapy.Field()
    date = scrapy.Field()

Critical: The names file_urls and files are NOT suggestions. They're requirements. Use different names and nothing works.

Step 4: Scrape URLs

In your spider, scrape file URLs and put them in file_urls:

# spider.py
import scrapy
from myproject.items import DocumentItem

class FileSpider(scrapy.Spider):
    name = 'files'
    start_urls = ['https://example.com/documents']

    def parse(self, response):
        for doc in response.css('.document'):
            item = DocumentItem()

            # Scrape the file URL
            file_url = doc.css('a.download::attr(href)').get()

            # MUST be a list, even for one URL!
            item['file_urls'] = [response.urljoin(file_url)]

            # Other data
            item['title'] = doc.css('h2::text').get()
            item['date'] = doc.css('.date::text').get()

            yield item

Key points:

file_urls MUST be a list (even for one file!)
Use response.urljoin() for relative URLs
Just scrape the URLs, don't download anything yourself

Step 5: What Happens Automatically

When you yield that item:

Scrapy sees file_urls field
Downloads each file in the list
Saves files to downloads/full/
Populates the files field with results

The files field gets filled with something like this:

{
    'files': [
        {
            'url': 'https://example.com/doc.pdf',
            'path': 'full/2a3b4c5d6e7f8g9h.pdf',  # Hashed filename
            'checksum': '2a3b4c5d6e7f8g9h',
            'status': 'downloaded'
        }
    ]
}

Step 6: Access Downloaded Files

def parse(self, response):
    item = DocumentItem()
    item['file_urls'] = [...]

    yield item  # FilesPipeline downloads files

    # Later, in a pipeline, you can access the results

# In pipelines.py
class MyPipeline:
    def process_item(self, item, spider):
        if 'files' in item:
            for file_info in item['files']:
                downloaded_path = file_info['path']
                original_url = file_info['url']
                spider.logger.info(f'Downloaded {original_url} to {downloaded_path}')

        return item

Part 2: Downloading Images

Images work almost identically to files, with a few differences.

Step 1: Install Pillow

This is required! ImagesPipeline needs Pillow:

pip install Pillow

Without Pillow, ImagesPipeline fails silently. No errors. Just no downloads.

Step 2: Enable ImagesPipeline

Edit settings.py:

# settings.py

ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1
}

# Set where images will be saved
IMAGES_STORE = 'images'  # Creates an 'images' folder

Step 3: Create Your Item

Different field names for images:

# items.py
import scrapy

class ProductItem(scrapy.Item):
    image_urls = scrapy.Field()  # MUST be 'image_urls' (plural!)
    images = scrapy.Field()       # MUST be 'images' (plural!)

    # Your other fields
    name = scrapy.Field()
    price = scrapy.Field()

Notice: It's image_urls and images, NOT file_urls and files.

Step 4: Scrape Image URLs

# spider.py
import scrapy
from myproject.items import ProductItem

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        for product in response.css('.product'):
            item = ProductItem()

            # Scrape image URLs
            img_urls = product.css('img::attr(src)').getall()

            # Convert to absolute URLs and store as list
            item['image_urls'] = [response.urljoin(url) for url in img_urls]

            # Other data
            item['name'] = product.css('h2::text').get()
            item['price'] = product.css('.price::text').get()

            yield item

Step 5: What You Get

Images are automatically:

Downloaded
Converted to JPEG
Converted to RGB mode
Saved with SHA1-hashed filenames

The images field gets populated:

{
    'images': [
        {
            'url': 'https://example.com/product.jpg',
            'path': 'full/a1b2c3d4e5f6.jpg',
            'checksum': 'a1b2c3d4e5f6',
            'status': 'downloaded'
        }
    ]
}

The Stuff Nobody Tells You

Problem #1: Field Names MUST Be Exact

This won't work:

item['file_url'] = [url]   # WRONG (singular!)
item['image_url'] = [url]  # WRONG (singular!)
item['urls'] = [url]        # WRONG (too vague!)

Must be:

item['file_urls'] = [url]   # For FilesPipeline
item['image_urls'] = [url]  # For ImagesPipeline

Change the names and the pipeline ignores your item completely. No errors. Just nothing happens.

Problem #2: URLs MUST Be a List

Even for one URL, it MUST be a list:

# WRONG
item['image_urls'] = 'https://example.com/image.jpg'

# RIGHT
item['image_urls'] = ['https://example.com/image.jpg']

# Also RIGHT (multiple images)
item['image_urls'] = [url1, url2, url3]

Problem #3: Relative URLs Break Everything

# WRONG (relative URLs fail)
item['image_urls'] = ['/static/images/product.jpg']

# RIGHT (absolute URLs)
item['image_urls'] = [response.urljoin('/static/images/product.jpg')]

Always use response.urljoin() to convert relative URLs to absolute.

Problem #4: ImagesPipeline Needs Pillow

Without Pillow installed:

No errors show up
Pipeline appears enabled
But nothing downloads

Always install Pillow:

pip install Pillow

Problem #5: Files Are Renamed with SHA1 Hashes

Your file: product_manual.pdf
Scrapy saves as: 2a3b4c5d6e7f8g9h.pdf

This prevents conflicts but makes files hard to find. More on fixing this later.

Problem #6: The Pipeline Runs BEFORE Other Pipelines

FilesPipeline and ImagesPipeline have priority 1 (very high). They run before your custom pipelines.

This means:

ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1,  # Runs first
    'myproject.pipelines.MyPipeline': 300,         # Runs after images download
}

Your pipeline gets the item AFTER images are already downloaded.

Customizing File/Image Names

By default, files get hashed names. Here's how to use real filenames:

Custom FilesPipeline with Original Names

# pipelines.py
from scrapy.pipelines.files import FilesPipeline
from urllib.parse import urlparse
import os

class MyFilesPipeline(FilesPipeline):
    def file_path(self, request, response=None, info=None, *, item=None):
        # Extract the original filename from URL
        url_path = urlparse(request.url).path
        filename = os.path.basename(url_path)

        # Optionally add item info to filename
        if item and 'title' in item:
            title = item['title'].replace(' ', '_')[:50]  # Sanitize
            filename = f"{title}_{filename}"

        return filename

Enable your custom pipeline in settings:

# settings.py
ITEM_PIPELINES = {
    'myproject.pipelines.MyFilesPipeline': 1
}

FILES_STORE = 'downloads'

Custom ImagesPipeline with Original Names

# pipelines.py
from scrapy.pipelines.images import ImagesPipeline
from urllib.parse import urlparse
import os

class MyImagesPipeline(ImagesPipeline):
    def file_path(self, request, response=None, info=None, *, item=None):
        url_path = urlparse(request.url).path
        filename = os.path.basename(url_path)

        # Create folders per product
        if item and 'name' in item:
            product_name = item['name'].replace(' ', '_')[:30]
            return f"{product_name}/{filename}"

        return filename

Enable it:

# settings.py
ITEM_PIPELINES = {
    'myproject.pipelines.MyImagesPipeline': 1
}

IMAGES_STORE = 'images'

Generating Thumbnails (Images Only)

ImagesPipeline can auto-generate thumbnails:

# settings.py
ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1
}

IMAGES_STORE = 'images'

# Generate thumbnails
IMAGES_THUMBS = {
    'small': (50, 50),
    'medium': (150, 150),
    'large': (300, 300)
}

This creates:

images/
    full/
        a1b2c3.jpg          (original)
    thumbs/
        small/
            a1b2c3.jpg      (50x50)
        medium/
            a1b2c3.jpg      (150x150)
        large/
            a1b2c3.jpg      (300x300)

Thumbnails maintain aspect ratio. If image is 800x600 and thumbnail is (100, 100), you get 100x75.

Filtering Images by Size

Download only images above certain dimensions:

# settings.py
IMAGES_STORE = 'images'

# Minimum dimensions (pixels)
IMAGES_MIN_WIDTH = 100
IMAGES_MIN_HEIGHT = 100

Images smaller than 100x100 get skipped automatically.

Avoiding Re-Downloads (Caching)

By default, Scrapy avoids re-downloading files that were downloaded recently:

# settings.py

# For files (default: 90 days)
FILES_EXPIRES = 90

# For images (default: 90 days)
IMAGES_EXPIRES = 90

Set to 0 to always re-download:

IMAGES_EXPIRES = 0  # Always download fresh

Handling Download Failures

Sometimes downloads fail. Here's how to handle it:

# pipelines.py
from scrapy.exceptions import DropItem

class MyImagesPipeline(ImagesPipeline):
    def item_completed(self, results, item, info):
        # results is a list of (success, file_info) tuples

        image_paths = []
        for success, file_info in results:
            if success:
                image_paths.append(file_info['path'])
            else:
                # Log the failure
                self.logger.warning(f'Image download failed: {file_info}')

        if not image_paths:
            # No images downloaded successfully
            raise DropItem('No images downloaded')

        # Store paths in item
        item['image_paths'] = image_paths

        return item

This:

Checks which downloads succeeded
Drops items with no successful downloads
Stores successful paths in a custom field

Using Both Files and Images Pipelines

You can use both at the same time:

# settings.py
ITEM_PIPELINES = {
    'scrapy.pipelines.files.FilesPipeline': 1,
    'scrapy.pipelines.images.ImagesPipeline': 2  # Different priority
}

FILES_STORE = 'downloads'
IMAGES_STORE = 'images'

In your item:

class ProductItem(scrapy.Item):
    # For documents
    file_urls = scrapy.Field()
    files = scrapy.Field()

    # For images
    image_urls = scrapy.Field()
    images = scrapy.Field()

    # Other fields
    name = scrapy.Field()

Both pipelines work independently.

Complete Working Example

Here's a full spider that downloads product images:

# items.py
import scrapy

class ProductItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()
    image_paths = scrapy.Field()  # Custom field for paths

# spider.py
import scrapy
from myproject.items import ProductItem

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://books.toscrape.com/']

    def parse(self, response):
        for product in response.css('.product_pod'):
            item = ProductItem()

            # Scrape data
            item['name'] = product.css('h3 a::attr(title)').get()
            item['price'] = product.css('.price_color::text').get()

            # Scrape image URL
            img_url = product.css('img::attr(src)').get()
            item['image_urls'] = [response.urljoin(img_url)]

            yield item

        # Pagination
        next_page = response.css('.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

# pipelines.py
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
import os
from urllib.parse import urlparse

class MyImagesPipeline(ImagesPipeline):
    def file_path(self, request, response=None, info=None, *, item=None):
        # Use original filename
        url_path = urlparse(request.url).path
        filename = os.path.basename(url_path)

        # Create folder per product
        if item and item.get('name'):
            folder = item['name'].replace(' ', '_')[:30]
            folder = ''.join(c for c in folder if c.isalnum() or c in ('_', '-'))
            return f"{folder}/{filename}"

        return filename

    def item_completed(self, results, item, info):
        image_paths = []

        for success, file_info in results:
            if success:
                image_paths.append(file_info['path'])

        if not image_paths:
            raise DropItem('No images downloaded')

        item['image_paths'] = image_paths

        return item

# settings.py
BOT_NAME = 'myproject'

SPIDER_MODULES = ['myproject.spiders']
NEWSPIDER_MODULE = 'myproject.spiders'

# Enable custom images pipeline
ITEM_PIPELINES = {
    'myproject.pipelines.MyImagesPipeline': 1
}

# Images settings
IMAGES_STORE = 'images'
IMAGES_THUMBS = {
    'small': (100, 100),
    'large': (300, 300)
}
IMAGES_MIN_WIDTH = 50
IMAGES_MIN_HEIGHT = 50
IMAGES_EXPIRES = 30  # 30 days

Run it:

scrapy crawl products

Check the images/ folder. You'll see subfolders for each product with their images and thumbnails.

Debugging When Nothing Downloads

If images/files aren't downloading:

Check #1: Is Pillow Installed? (Images Only)

pip install Pillow

Check #2: Are Field Names Exact?

# Must be exactly these names
item['file_urls'] = [...]   # For files
item['image_urls'] = [...]  # For images

Check #3: Is STORE Setting Set?

# settings.py
FILES_STORE = 'downloads'    # For files
IMAGES_STORE = 'images'      # For images

Check #4: Are URLs Absolute?

# Use urljoin for relative URLs
item['image_urls'] = [response.urljoin(url)]

Check #5: Are URLs in a List?

# Must be a list
item['image_urls'] = [url]  # Not just url

Check #6: Check Logs

Look for errors in the output:

scrapy crawl myspider --loglevel=DEBUG

Search for "FilesPipeline" or "ImagesPipeline" in the logs.

Performance Tips

Tip #1: Limit Concurrent Downloads

# settings.py
CONCURRENT_REQUESTS_PER_DOMAIN = 4  # Don't overwhelm servers

Tip #2: Add Delays

# settings.py
DOWNLOAD_DELAY = 1  # 1 second between requests

Tip #3: Filter Before Downloading

Don't download unnecessary files:

def parse(self, response):
    for product in response.css('.product'):
        img_url = product.css('img::attr(src)').get()

        # Skip placeholder images
        if 'placeholder' in img_url or 'no-image' in img_url:
            continue

        item['image_urls'] = [response.urljoin(img_url)]
        yield item

Quick Reference

Files

# Enable
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
FILES_STORE = 'downloads'

# Item fields (exact names!)
file_urls = scrapy.Field()  # Input: list of URLs
files = scrapy.Field()       # Output: download results

# Scrape
item['file_urls'] = [response.urljoin(url)]

Images

# Enable (install Pillow first!)
pip install Pillow

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
IMAGES_STORE = 'images'

# Item fields (exact names!)
image_urls = scrapy.Field()  # Input: list of URLs
images = scrapy.Field()       # Output: download results

# Scrape
item['image_urls'] = [response.urljoin(url) for url in urls]

# Optional: thumbnails
IMAGES_THUMBS = {'small': (100, 100)}

# Optional: size filter
IMAGES_MIN_WIDTH = 100
IMAGES_MIN_HEIGHT = 100