DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Downloading Files and Images with Scrapy: The Complete Beginner's Guide

When I first tried to download images with Scrapy, I followed the official documentation step by step. It didn't work. The images field stayed empty, the downloads folder was empty, and I had no idea why.

After hours of debugging, I discovered all the tiny details the documentation skips. The field names that MUST be exact. The Pillow library that needs installing. The pipeline that silently fails if configured wrong.

Let me save you that frustration. This is the complete guide to downloading files and images in Scrapy, including all the stuff nobody tells you.


The Big Picture: How File/Image Downloads Work

Before we dive into code, understand how this works:

  1. Your spider scrapes URLs (not the actual files, just their URLs)
  2. You put URLs in a special field (with an exact name Scrapy expects)
  3. Scrapy's pipeline downloads them (automatically, in the background)
  4. Results appear in another field (also with an exact name)

You don't manually download anything. The pipeline handles it all. You just provide URLs.


Part 1: Downloading Files (PDFs, ZIPs, Documents, etc.)

Step 1: Install Nothing (FilesPipeline is Built-In)

Good news! For files, you don't need extra libraries. FilesPipeline comes with Scrapy.

Step 2: Enable the Pipeline

Edit settings.py:

# settings.py

# Enable the FilesPipeline
ITEM_PIPELINES = {
    'scrapy.pipelines.files.FilesPipeline': 1
}

# Set where files will be saved
FILES_STORE = 'downloads'  # Creates a 'downloads' folder
Enter fullscreen mode Exit fullscreen mode

That's it. Pipeline enabled. Download location set.

Step 3: Create Your Item

The field names MUST be exact:

# items.py
import scrapy

class DocumentItem(scrapy.Item):
    file_urls = scrapy.Field()  # MUST be named 'file_urls' (plural!)
    files = scrapy.Field()       # MUST be named 'files' (plural!)

    # Your other fields
    title = scrapy.Field()
    date = scrapy.Field()
Enter fullscreen mode Exit fullscreen mode

Critical: The names file_urls and files are NOT suggestions. They're requirements. Use different names and nothing works.

Step 4: Scrape URLs

In your spider, scrape file URLs and put them in file_urls:

# spider.py
import scrapy
from myproject.items import DocumentItem

class FileSpider(scrapy.Spider):
    name = 'files'
    start_urls = ['https://example.com/documents']

    def parse(self, response):
        for doc in response.css('.document'):
            item = DocumentItem()

            # Scrape the file URL
            file_url = doc.css('a.download::attr(href)').get()

            # MUST be a list, even for one URL!
            item['file_urls'] = [response.urljoin(file_url)]

            # Other data
            item['title'] = doc.css('h2::text').get()
            item['date'] = doc.css('.date::text').get()

            yield item
Enter fullscreen mode Exit fullscreen mode

Key points:

  • file_urls MUST be a list (even for one file!)
  • Use response.urljoin() for relative URLs
  • Just scrape the URLs, don't download anything yourself

Step 5: What Happens Automatically

When you yield that item:

  1. Scrapy sees file_urls field
  2. Downloads each file in the list
  3. Saves files to downloads/full/
  4. Populates the files field with results

The files field gets filled with something like this:

{
    'files': [
        {
            'url': 'https://example.com/doc.pdf',
            'path': 'full/2a3b4c5d6e7f8g9h.pdf',  # Hashed filename
            'checksum': '2a3b4c5d6e7f8g9h',
            'status': 'downloaded'
        }
    ]
}
Enter fullscreen mode Exit fullscreen mode

Step 6: Access Downloaded Files

def parse(self, response):
    item = DocumentItem()
    item['file_urls'] = [...]

    yield item  # FilesPipeline downloads files

    # Later, in a pipeline, you can access the results

# In pipelines.py
class MyPipeline:
    def process_item(self, item, spider):
        if 'files' in item:
            for file_info in item['files']:
                downloaded_path = file_info['path']
                original_url = file_info['url']
                spider.logger.info(f'Downloaded {original_url} to {downloaded_path}')

        return item
Enter fullscreen mode Exit fullscreen mode

Part 2: Downloading Images

Images work almost identically to files, with a few differences.

Step 1: Install Pillow

This is required! ImagesPipeline needs Pillow:

pip install Pillow
Enter fullscreen mode Exit fullscreen mode

Without Pillow, ImagesPipeline fails silently. No errors. Just no downloads.

Step 2: Enable ImagesPipeline

Edit settings.py:

# settings.py

ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1
}

# Set where images will be saved
IMAGES_STORE = 'images'  # Creates an 'images' folder
Enter fullscreen mode Exit fullscreen mode

Step 3: Create Your Item

Different field names for images:

# items.py
import scrapy

class ProductItem(scrapy.Item):
    image_urls = scrapy.Field()  # MUST be 'image_urls' (plural!)
    images = scrapy.Field()       # MUST be 'images' (plural!)

    # Your other fields
    name = scrapy.Field()
    price = scrapy.Field()
Enter fullscreen mode Exit fullscreen mode

Notice: It's image_urls and images, NOT file_urls and files.

Step 4: Scrape Image URLs

# spider.py
import scrapy
from myproject.items import ProductItem

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        for product in response.css('.product'):
            item = ProductItem()

            # Scrape image URLs
            img_urls = product.css('img::attr(src)').getall()

            # Convert to absolute URLs and store as list
            item['image_urls'] = [response.urljoin(url) for url in img_urls]

            # Other data
            item['name'] = product.css('h2::text').get()
            item['price'] = product.css('.price::text').get()

            yield item
Enter fullscreen mode Exit fullscreen mode

Step 5: What You Get

Images are automatically:

  • Downloaded
  • Converted to JPEG
  • Converted to RGB mode
  • Saved with SHA1-hashed filenames

The images field gets populated:

{
    'images': [
        {
            'url': 'https://example.com/product.jpg',
            'path': 'full/a1b2c3d4e5f6.jpg',
            'checksum': 'a1b2c3d4e5f6',
            'status': 'downloaded'
        }
    ]
}
Enter fullscreen mode Exit fullscreen mode

The Stuff Nobody Tells You

Problem #1: Field Names MUST Be Exact

This won't work:

item['file_url'] = [url]   # WRONG (singular!)
item['image_url'] = [url]  # WRONG (singular!)
item['urls'] = [url]        # WRONG (too vague!)
Enter fullscreen mode Exit fullscreen mode

Must be:

item['file_urls'] = [url]   # For FilesPipeline
item['image_urls'] = [url]  # For ImagesPipeline
Enter fullscreen mode Exit fullscreen mode

Change the names and the pipeline ignores your item completely. No errors. Just nothing happens.

Problem #2: URLs MUST Be a List

Even for one URL, it MUST be a list:

# WRONG
item['image_urls'] = 'https://example.com/image.jpg'

# RIGHT
item['image_urls'] = ['https://example.com/image.jpg']

# Also RIGHT (multiple images)
item['image_urls'] = [url1, url2, url3]
Enter fullscreen mode Exit fullscreen mode

Problem #3: Relative URLs Break Everything

# WRONG (relative URLs fail)
item['image_urls'] = ['/static/images/product.jpg']

# RIGHT (absolute URLs)
item['image_urls'] = [response.urljoin('/static/images/product.jpg')]
Enter fullscreen mode Exit fullscreen mode

Always use response.urljoin() to convert relative URLs to absolute.

Problem #4: ImagesPipeline Needs Pillow

Without Pillow installed:

  • No errors show up
  • Pipeline appears enabled
  • But nothing downloads

Always install Pillow:

pip install Pillow
Enter fullscreen mode Exit fullscreen mode

Problem #5: Files Are Renamed with SHA1 Hashes

Your file: product_manual.pdf
Scrapy saves as: 2a3b4c5d6e7f8g9h.pdf

This prevents conflicts but makes files hard to find. More on fixing this later.

Problem #6: The Pipeline Runs BEFORE Other Pipelines

FilesPipeline and ImagesPipeline have priority 1 (very high). They run before your custom pipelines.

This means:

ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1,  # Runs first
    'myproject.pipelines.MyPipeline': 300,         # Runs after images download
}
Enter fullscreen mode Exit fullscreen mode

Your pipeline gets the item AFTER images are already downloaded.


Customizing File/Image Names

By default, files get hashed names. Here's how to use real filenames:

Custom FilesPipeline with Original Names

# pipelines.py
from scrapy.pipelines.files import FilesPipeline
from urllib.parse import urlparse
import os

class MyFilesPipeline(FilesPipeline):
    def file_path(self, request, response=None, info=None, *, item=None):
        # Extract the original filename from URL
        url_path = urlparse(request.url).path
        filename = os.path.basename(url_path)

        # Optionally add item info to filename
        if item and 'title' in item:
            title = item['title'].replace(' ', '_')[:50]  # Sanitize
            filename = f"{title}_{filename}"

        return filename
Enter fullscreen mode Exit fullscreen mode

Enable your custom pipeline in settings:

# settings.py
ITEM_PIPELINES = {
    'myproject.pipelines.MyFilesPipeline': 1
}

FILES_STORE = 'downloads'
Enter fullscreen mode Exit fullscreen mode

Custom ImagesPipeline with Original Names

# pipelines.py
from scrapy.pipelines.images import ImagesPipeline
from urllib.parse import urlparse
import os

class MyImagesPipeline(ImagesPipeline):
    def file_path(self, request, response=None, info=None, *, item=None):
        url_path = urlparse(request.url).path
        filename = os.path.basename(url_path)

        # Create folders per product
        if item and 'name' in item:
            product_name = item['name'].replace(' ', '_')[:30]
            return f"{product_name}/{filename}"

        return filename
Enter fullscreen mode Exit fullscreen mode

Enable it:

# settings.py
ITEM_PIPELINES = {
    'myproject.pipelines.MyImagesPipeline': 1
}

IMAGES_STORE = 'images'
Enter fullscreen mode Exit fullscreen mode

Generating Thumbnails (Images Only)

ImagesPipeline can auto-generate thumbnails:

# settings.py
ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1
}

IMAGES_STORE = 'images'

# Generate thumbnails
IMAGES_THUMBS = {
    'small': (50, 50),
    'medium': (150, 150),
    'large': (300, 300)
}
Enter fullscreen mode Exit fullscreen mode

This creates:

images/
    full/
        a1b2c3.jpg          (original)
    thumbs/
        small/
            a1b2c3.jpg      (50x50)
        medium/
            a1b2c3.jpg      (150x150)
        large/
            a1b2c3.jpg      (300x300)
Enter fullscreen mode Exit fullscreen mode

Thumbnails maintain aspect ratio. If image is 800x600 and thumbnail is (100, 100), you get 100x75.


Filtering Images by Size

Download only images above certain dimensions:

# settings.py
IMAGES_STORE = 'images'

# Minimum dimensions (pixels)
IMAGES_MIN_WIDTH = 100
IMAGES_MIN_HEIGHT = 100
Enter fullscreen mode Exit fullscreen mode

Images smaller than 100x100 get skipped automatically.


Avoiding Re-Downloads (Caching)

By default, Scrapy avoids re-downloading files that were downloaded recently:

# settings.py

# For files (default: 90 days)
FILES_EXPIRES = 90

# For images (default: 90 days)
IMAGES_EXPIRES = 90
Enter fullscreen mode Exit fullscreen mode

Set to 0 to always re-download:

IMAGES_EXPIRES = 0  # Always download fresh
Enter fullscreen mode Exit fullscreen mode

Handling Download Failures

Sometimes downloads fail. Here's how to handle it:

# pipelines.py
from scrapy.exceptions import DropItem

class MyImagesPipeline(ImagesPipeline):
    def item_completed(self, results, item, info):
        # results is a list of (success, file_info) tuples

        image_paths = []
        for success, file_info in results:
            if success:
                image_paths.append(file_info['path'])
            else:
                # Log the failure
                self.logger.warning(f'Image download failed: {file_info}')

        if not image_paths:
            # No images downloaded successfully
            raise DropItem('No images downloaded')

        # Store paths in item
        item['image_paths'] = image_paths

        return item
Enter fullscreen mode Exit fullscreen mode

This:

  • Checks which downloads succeeded
  • Drops items with no successful downloads
  • Stores successful paths in a custom field

Using Both Files and Images Pipelines

You can use both at the same time:

# settings.py
ITEM_PIPELINES = {
    'scrapy.pipelines.files.FilesPipeline': 1,
    'scrapy.pipelines.images.ImagesPipeline': 2  # Different priority
}

FILES_STORE = 'downloads'
IMAGES_STORE = 'images'
Enter fullscreen mode Exit fullscreen mode

In your item:

class ProductItem(scrapy.Item):
    # For documents
    file_urls = scrapy.Field()
    files = scrapy.Field()

    # For images
    image_urls = scrapy.Field()
    images = scrapy.Field()

    # Other fields
    name = scrapy.Field()
Enter fullscreen mode Exit fullscreen mode

Both pipelines work independently.


Complete Working Example

Here's a full spider that downloads product images:

# items.py
import scrapy

class ProductItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()
    image_paths = scrapy.Field()  # Custom field for paths
Enter fullscreen mode Exit fullscreen mode
# spider.py
import scrapy
from myproject.items import ProductItem

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://books.toscrape.com/']

    def parse(self, response):
        for product in response.css('.product_pod'):
            item = ProductItem()

            # Scrape data
            item['name'] = product.css('h3 a::attr(title)').get()
            item['price'] = product.css('.price_color::text').get()

            # Scrape image URL
            img_url = product.css('img::attr(src)').get()
            item['image_urls'] = [response.urljoin(img_url)]

            yield item

        # Pagination
        next_page = response.css('.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)
Enter fullscreen mode Exit fullscreen mode
# pipelines.py
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
import os
from urllib.parse import urlparse

class MyImagesPipeline(ImagesPipeline):
    def file_path(self, request, response=None, info=None, *, item=None):
        # Use original filename
        url_path = urlparse(request.url).path
        filename = os.path.basename(url_path)

        # Create folder per product
        if item and item.get('name'):
            folder = item['name'].replace(' ', '_')[:30]
            folder = ''.join(c for c in folder if c.isalnum() or c in ('_', '-'))
            return f"{folder}/{filename}"

        return filename

    def item_completed(self, results, item, info):
        image_paths = []

        for success, file_info in results:
            if success:
                image_paths.append(file_info['path'])

        if not image_paths:
            raise DropItem('No images downloaded')

        item['image_paths'] = image_paths

        return item
Enter fullscreen mode Exit fullscreen mode
# settings.py
BOT_NAME = 'myproject'

SPIDER_MODULES = ['myproject.spiders']
NEWSPIDER_MODULE = 'myproject.spiders'

# Enable custom images pipeline
ITEM_PIPELINES = {
    'myproject.pipelines.MyImagesPipeline': 1
}

# Images settings
IMAGES_STORE = 'images'
IMAGES_THUMBS = {
    'small': (100, 100),
    'large': (300, 300)
}
IMAGES_MIN_WIDTH = 50
IMAGES_MIN_HEIGHT = 50
IMAGES_EXPIRES = 30  # 30 days
Enter fullscreen mode Exit fullscreen mode

Run it:

scrapy crawl products
Enter fullscreen mode Exit fullscreen mode

Check the images/ folder. You'll see subfolders for each product with their images and thumbnails.


Debugging When Nothing Downloads

If images/files aren't downloading:

Check #1: Is Pillow Installed? (Images Only)

pip install Pillow
Enter fullscreen mode Exit fullscreen mode

Check #2: Are Field Names Exact?

# Must be exactly these names
item['file_urls'] = [...]   # For files
item['image_urls'] = [...]  # For images
Enter fullscreen mode Exit fullscreen mode

Check #3: Is STORE Setting Set?

# settings.py
FILES_STORE = 'downloads'    # For files
IMAGES_STORE = 'images'      # For images
Enter fullscreen mode Exit fullscreen mode

Check #4: Are URLs Absolute?

# Use urljoin for relative URLs
item['image_urls'] = [response.urljoin(url)]
Enter fullscreen mode Exit fullscreen mode

Check #5: Are URLs in a List?

# Must be a list
item['image_urls'] = [url]  # Not just url
Enter fullscreen mode Exit fullscreen mode

Check #6: Check Logs

Look for errors in the output:

scrapy crawl myspider --loglevel=DEBUG
Enter fullscreen mode Exit fullscreen mode

Search for "FilesPipeline" or "ImagesPipeline" in the logs.


Performance Tips

Tip #1: Limit Concurrent Downloads

# settings.py
CONCURRENT_REQUESTS_PER_DOMAIN = 4  # Don't overwhelm servers
Enter fullscreen mode Exit fullscreen mode

Tip #2: Add Delays

# settings.py
DOWNLOAD_DELAY = 1  # 1 second between requests
Enter fullscreen mode Exit fullscreen mode

Tip #3: Filter Before Downloading

Don't download unnecessary files:

def parse(self, response):
    for product in response.css('.product'):
        img_url = product.css('img::attr(src)').get()

        # Skip placeholder images
        if 'placeholder' in img_url or 'no-image' in img_url:
            continue

        item['image_urls'] = [response.urljoin(img_url)]
        yield item
Enter fullscreen mode Exit fullscreen mode

Quick Reference

Files

# Enable
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
FILES_STORE = 'downloads'

# Item fields (exact names!)
file_urls = scrapy.Field()  # Input: list of URLs
files = scrapy.Field()       # Output: download results

# Scrape
item['file_urls'] = [response.urljoin(url)]
Enter fullscreen mode Exit fullscreen mode

Images

# Enable (install Pillow first!)
pip install Pillow

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
IMAGES_STORE = 'images'

# Item fields (exact names!)
image_urls = scrapy.Field()  # Input: list of URLs
images = scrapy.Field()       # Output: download results

# Scrape
item['image_urls'] = [response.urljoin(url) for url in urls]

# Optional: thumbnails
IMAGES_THUMBS = {'small': (100, 100)}

# Optional: size filter
IMAGES_MIN_WIDTH = 100
IMAGES_MIN_HEIGHT = 100
Enter fullscreen mode Exit fullscreen mode

Summary

Key takeaways:

  • Install Pillow for images: pip install Pillow
  • Field names MUST be exact: file_urls/files or image_urls/images
  • URLs MUST be in a list
  • URLs MUST be absolute (use response.urljoin())
  • Set STORE location in settings
  • Files get renamed with SHA1 hashes (override file_path() to fix)
  • ImagesPipeline can generate thumbnails automatically
  • Both pipelines avoid re-downloading recent files

Start with the basic setup. Get it working. Then customize filenames and add features as needed.

Happy downloading! 🕷️

Top comments (0)