When I first tried to download images with Scrapy, I followed the official documentation step by step. It didn't work. The images field stayed empty, the downloads folder was empty, and I had no idea why.
After hours of debugging, I discovered all the tiny details the documentation skips. The field names that MUST be exact. The Pillow library that needs installing. The pipeline that silently fails if configured wrong.
Let me save you that frustration. This is the complete guide to downloading files and images in Scrapy, including all the stuff nobody tells you.
The Big Picture: How File/Image Downloads Work
Before we dive into code, understand how this works:
- Your spider scrapes URLs (not the actual files, just their URLs)
- You put URLs in a special field (with an exact name Scrapy expects)
- Scrapy's pipeline downloads them (automatically, in the background)
- Results appear in another field (also with an exact name)
You don't manually download anything. The pipeline handles it all. You just provide URLs.
Part 1: Downloading Files (PDFs, ZIPs, Documents, etc.)
Step 1: Install Nothing (FilesPipeline is Built-In)
Good news! For files, you don't need extra libraries. FilesPipeline comes with Scrapy.
Step 2: Enable the Pipeline
Edit settings.py:
# settings.py
# Enable the FilesPipeline
ITEM_PIPELINES = {
'scrapy.pipelines.files.FilesPipeline': 1
}
# Set where files will be saved
FILES_STORE = 'downloads' # Creates a 'downloads' folder
That's it. Pipeline enabled. Download location set.
Step 3: Create Your Item
The field names MUST be exact:
# items.py
import scrapy
class DocumentItem(scrapy.Item):
file_urls = scrapy.Field() # MUST be named 'file_urls' (plural!)
files = scrapy.Field() # MUST be named 'files' (plural!)
# Your other fields
title = scrapy.Field()
date = scrapy.Field()
Critical: The names file_urls and files are NOT suggestions. They're requirements. Use different names and nothing works.
Step 4: Scrape URLs
In your spider, scrape file URLs and put them in file_urls:
# spider.py
import scrapy
from myproject.items import DocumentItem
class FileSpider(scrapy.Spider):
name = 'files'
start_urls = ['https://example.com/documents']
def parse(self, response):
for doc in response.css('.document'):
item = DocumentItem()
# Scrape the file URL
file_url = doc.css('a.download::attr(href)').get()
# MUST be a list, even for one URL!
item['file_urls'] = [response.urljoin(file_url)]
# Other data
item['title'] = doc.css('h2::text').get()
item['date'] = doc.css('.date::text').get()
yield item
Key points:
-
file_urlsMUST be a list (even for one file!) - Use
response.urljoin()for relative URLs - Just scrape the URLs, don't download anything yourself
Step 5: What Happens Automatically
When you yield that item:
- Scrapy sees
file_urlsfield - Downloads each file in the list
- Saves files to
downloads/full/ - Populates the
filesfield with results
The files field gets filled with something like this:
{
'files': [
{
'url': 'https://example.com/doc.pdf',
'path': 'full/2a3b4c5d6e7f8g9h.pdf', # Hashed filename
'checksum': '2a3b4c5d6e7f8g9h',
'status': 'downloaded'
}
]
}
Step 6: Access Downloaded Files
def parse(self, response):
item = DocumentItem()
item['file_urls'] = [...]
yield item # FilesPipeline downloads files
# Later, in a pipeline, you can access the results
# In pipelines.py
class MyPipeline:
def process_item(self, item, spider):
if 'files' in item:
for file_info in item['files']:
downloaded_path = file_info['path']
original_url = file_info['url']
spider.logger.info(f'Downloaded {original_url} to {downloaded_path}')
return item
Part 2: Downloading Images
Images work almost identically to files, with a few differences.
Step 1: Install Pillow
This is required! ImagesPipeline needs Pillow:
pip install Pillow
Without Pillow, ImagesPipeline fails silently. No errors. Just no downloads.
Step 2: Enable ImagesPipeline
Edit settings.py:
# settings.py
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1
}
# Set where images will be saved
IMAGES_STORE = 'images' # Creates an 'images' folder
Step 3: Create Your Item
Different field names for images:
# items.py
import scrapy
class ProductItem(scrapy.Item):
image_urls = scrapy.Field() # MUST be 'image_urls' (plural!)
images = scrapy.Field() # MUST be 'images' (plural!)
# Your other fields
name = scrapy.Field()
price = scrapy.Field()
Notice: It's image_urls and images, NOT file_urls and files.
Step 4: Scrape Image URLs
# spider.py
import scrapy
from myproject.items import ProductItem
class ProductSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://example.com/products']
def parse(self, response):
for product in response.css('.product'):
item = ProductItem()
# Scrape image URLs
img_urls = product.css('img::attr(src)').getall()
# Convert to absolute URLs and store as list
item['image_urls'] = [response.urljoin(url) for url in img_urls]
# Other data
item['name'] = product.css('h2::text').get()
item['price'] = product.css('.price::text').get()
yield item
Step 5: What You Get
Images are automatically:
- Downloaded
- Converted to JPEG
- Converted to RGB mode
- Saved with SHA1-hashed filenames
The images field gets populated:
{
'images': [
{
'url': 'https://example.com/product.jpg',
'path': 'full/a1b2c3d4e5f6.jpg',
'checksum': 'a1b2c3d4e5f6',
'status': 'downloaded'
}
]
}
The Stuff Nobody Tells You
Problem #1: Field Names MUST Be Exact
This won't work:
item['file_url'] = [url] # WRONG (singular!)
item['image_url'] = [url] # WRONG (singular!)
item['urls'] = [url] # WRONG (too vague!)
Must be:
item['file_urls'] = [url] # For FilesPipeline
item['image_urls'] = [url] # For ImagesPipeline
Change the names and the pipeline ignores your item completely. No errors. Just nothing happens.
Problem #2: URLs MUST Be a List
Even for one URL, it MUST be a list:
# WRONG
item['image_urls'] = 'https://example.com/image.jpg'
# RIGHT
item['image_urls'] = ['https://example.com/image.jpg']
# Also RIGHT (multiple images)
item['image_urls'] = [url1, url2, url3]
Problem #3: Relative URLs Break Everything
# WRONG (relative URLs fail)
item['image_urls'] = ['/static/images/product.jpg']
# RIGHT (absolute URLs)
item['image_urls'] = [response.urljoin('/static/images/product.jpg')]
Always use response.urljoin() to convert relative URLs to absolute.
Problem #4: ImagesPipeline Needs Pillow
Without Pillow installed:
- No errors show up
- Pipeline appears enabled
- But nothing downloads
Always install Pillow:
pip install Pillow
Problem #5: Files Are Renamed with SHA1 Hashes
Your file: product_manual.pdf
Scrapy saves as: 2a3b4c5d6e7f8g9h.pdf
This prevents conflicts but makes files hard to find. More on fixing this later.
Problem #6: The Pipeline Runs BEFORE Other Pipelines
FilesPipeline and ImagesPipeline have priority 1 (very high). They run before your custom pipelines.
This means:
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1, # Runs first
'myproject.pipelines.MyPipeline': 300, # Runs after images download
}
Your pipeline gets the item AFTER images are already downloaded.
Customizing File/Image Names
By default, files get hashed names. Here's how to use real filenames:
Custom FilesPipeline with Original Names
# pipelines.py
from scrapy.pipelines.files import FilesPipeline
from urllib.parse import urlparse
import os
class MyFilesPipeline(FilesPipeline):
def file_path(self, request, response=None, info=None, *, item=None):
# Extract the original filename from URL
url_path = urlparse(request.url).path
filename = os.path.basename(url_path)
# Optionally add item info to filename
if item and 'title' in item:
title = item['title'].replace(' ', '_')[:50] # Sanitize
filename = f"{title}_{filename}"
return filename
Enable your custom pipeline in settings:
# settings.py
ITEM_PIPELINES = {
'myproject.pipelines.MyFilesPipeline': 1
}
FILES_STORE = 'downloads'
Custom ImagesPipeline with Original Names
# pipelines.py
from scrapy.pipelines.images import ImagesPipeline
from urllib.parse import urlparse
import os
class MyImagesPipeline(ImagesPipeline):
def file_path(self, request, response=None, info=None, *, item=None):
url_path = urlparse(request.url).path
filename = os.path.basename(url_path)
# Create folders per product
if item and 'name' in item:
product_name = item['name'].replace(' ', '_')[:30]
return f"{product_name}/{filename}"
return filename
Enable it:
# settings.py
ITEM_PIPELINES = {
'myproject.pipelines.MyImagesPipeline': 1
}
IMAGES_STORE = 'images'
Generating Thumbnails (Images Only)
ImagesPipeline can auto-generate thumbnails:
# settings.py
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1
}
IMAGES_STORE = 'images'
# Generate thumbnails
IMAGES_THUMBS = {
'small': (50, 50),
'medium': (150, 150),
'large': (300, 300)
}
This creates:
images/
full/
a1b2c3.jpg (original)
thumbs/
small/
a1b2c3.jpg (50x50)
medium/
a1b2c3.jpg (150x150)
large/
a1b2c3.jpg (300x300)
Thumbnails maintain aspect ratio. If image is 800x600 and thumbnail is (100, 100), you get 100x75.
Filtering Images by Size
Download only images above certain dimensions:
# settings.py
IMAGES_STORE = 'images'
# Minimum dimensions (pixels)
IMAGES_MIN_WIDTH = 100
IMAGES_MIN_HEIGHT = 100
Images smaller than 100x100 get skipped automatically.
Avoiding Re-Downloads (Caching)
By default, Scrapy avoids re-downloading files that were downloaded recently:
# settings.py
# For files (default: 90 days)
FILES_EXPIRES = 90
# For images (default: 90 days)
IMAGES_EXPIRES = 90
Set to 0 to always re-download:
IMAGES_EXPIRES = 0 # Always download fresh
Handling Download Failures
Sometimes downloads fail. Here's how to handle it:
# pipelines.py
from scrapy.exceptions import DropItem
class MyImagesPipeline(ImagesPipeline):
def item_completed(self, results, item, info):
# results is a list of (success, file_info) tuples
image_paths = []
for success, file_info in results:
if success:
image_paths.append(file_info['path'])
else:
# Log the failure
self.logger.warning(f'Image download failed: {file_info}')
if not image_paths:
# No images downloaded successfully
raise DropItem('No images downloaded')
# Store paths in item
item['image_paths'] = image_paths
return item
This:
- Checks which downloads succeeded
- Drops items with no successful downloads
- Stores successful paths in a custom field
Using Both Files and Images Pipelines
You can use both at the same time:
# settings.py
ITEM_PIPELINES = {
'scrapy.pipelines.files.FilesPipeline': 1,
'scrapy.pipelines.images.ImagesPipeline': 2 # Different priority
}
FILES_STORE = 'downloads'
IMAGES_STORE = 'images'
In your item:
class ProductItem(scrapy.Item):
# For documents
file_urls = scrapy.Field()
files = scrapy.Field()
# For images
image_urls = scrapy.Field()
images = scrapy.Field()
# Other fields
name = scrapy.Field()
Both pipelines work independently.
Complete Working Example
Here's a full spider that downloads product images:
# items.py
import scrapy
class ProductItem(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
image_paths = scrapy.Field() # Custom field for paths
# spider.py
import scrapy
from myproject.items import ProductItem
class ProductSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://books.toscrape.com/']
def parse(self, response):
for product in response.css('.product_pod'):
item = ProductItem()
# Scrape data
item['name'] = product.css('h3 a::attr(title)').get()
item['price'] = product.css('.price_color::text').get()
# Scrape image URL
img_url = product.css('img::attr(src)').get()
item['image_urls'] = [response.urljoin(img_url)]
yield item
# Pagination
next_page = response.css('.next a::attr(href)').get()
if next_page:
yield response.follow(next_page, callback=self.parse)
# pipelines.py
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
import os
from urllib.parse import urlparse
class MyImagesPipeline(ImagesPipeline):
def file_path(self, request, response=None, info=None, *, item=None):
# Use original filename
url_path = urlparse(request.url).path
filename = os.path.basename(url_path)
# Create folder per product
if item and item.get('name'):
folder = item['name'].replace(' ', '_')[:30]
folder = ''.join(c for c in folder if c.isalnum() or c in ('_', '-'))
return f"{folder}/{filename}"
return filename
def item_completed(self, results, item, info):
image_paths = []
for success, file_info in results:
if success:
image_paths.append(file_info['path'])
if not image_paths:
raise DropItem('No images downloaded')
item['image_paths'] = image_paths
return item
# settings.py
BOT_NAME = 'myproject'
SPIDER_MODULES = ['myproject.spiders']
NEWSPIDER_MODULE = 'myproject.spiders'
# Enable custom images pipeline
ITEM_PIPELINES = {
'myproject.pipelines.MyImagesPipeline': 1
}
# Images settings
IMAGES_STORE = 'images'
IMAGES_THUMBS = {
'small': (100, 100),
'large': (300, 300)
}
IMAGES_MIN_WIDTH = 50
IMAGES_MIN_HEIGHT = 50
IMAGES_EXPIRES = 30 # 30 days
Run it:
scrapy crawl products
Check the images/ folder. You'll see subfolders for each product with their images and thumbnails.
Debugging When Nothing Downloads
If images/files aren't downloading:
Check #1: Is Pillow Installed? (Images Only)
pip install Pillow
Check #2: Are Field Names Exact?
# Must be exactly these names
item['file_urls'] = [...] # For files
item['image_urls'] = [...] # For images
Check #3: Is STORE Setting Set?
# settings.py
FILES_STORE = 'downloads' # For files
IMAGES_STORE = 'images' # For images
Check #4: Are URLs Absolute?
# Use urljoin for relative URLs
item['image_urls'] = [response.urljoin(url)]
Check #5: Are URLs in a List?
# Must be a list
item['image_urls'] = [url] # Not just url
Check #6: Check Logs
Look for errors in the output:
scrapy crawl myspider --loglevel=DEBUG
Search for "FilesPipeline" or "ImagesPipeline" in the logs.
Performance Tips
Tip #1: Limit Concurrent Downloads
# settings.py
CONCURRENT_REQUESTS_PER_DOMAIN = 4 # Don't overwhelm servers
Tip #2: Add Delays
# settings.py
DOWNLOAD_DELAY = 1 # 1 second between requests
Tip #3: Filter Before Downloading
Don't download unnecessary files:
def parse(self, response):
for product in response.css('.product'):
img_url = product.css('img::attr(src)').get()
# Skip placeholder images
if 'placeholder' in img_url or 'no-image' in img_url:
continue
item['image_urls'] = [response.urljoin(img_url)]
yield item
Quick Reference
Files
# Enable
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
FILES_STORE = 'downloads'
# Item fields (exact names!)
file_urls = scrapy.Field() # Input: list of URLs
files = scrapy.Field() # Output: download results
# Scrape
item['file_urls'] = [response.urljoin(url)]
Images
# Enable (install Pillow first!)
pip install Pillow
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
IMAGES_STORE = 'images'
# Item fields (exact names!)
image_urls = scrapy.Field() # Input: list of URLs
images = scrapy.Field() # Output: download results
# Scrape
item['image_urls'] = [response.urljoin(url) for url in urls]
# Optional: thumbnails
IMAGES_THUMBS = {'small': (100, 100)}
# Optional: size filter
IMAGES_MIN_WIDTH = 100
IMAGES_MIN_HEIGHT = 100
Summary
Key takeaways:
- Install Pillow for images:
pip install Pillow - Field names MUST be exact:
file_urls/filesorimage_urls/images - URLs MUST be in a list
- URLs MUST be absolute (use
response.urljoin()) - Set STORE location in settings
- Files get renamed with SHA1 hashes (override
file_path()to fix) - ImagesPipeline can generate thumbnails automatically
- Both pipelines avoid re-downloading recent files
Start with the basic setup. Get it working. Then customize filenames and add features as needed.
Happy downloading! 🕷️
Top comments (0)