When I finished my first Scrapy spider, I had 10,000 items scraped. I was excited! Then I realized I had no idea where the data went or how to export it properly.
I tried using -o output.json but the file was 500MB and crashed my text editor. Then I tried saving directly to MongoDB from my spider and lost data when the spider crashed halfway through.
After scraping millions of items, I've learned the right way to export and store data. Let me show you all the options and when to use each.
The Big Picture: Five Ways to Export Data
Scrapy gives you five main options for saving data:
1. Feed Exports (Built-in, Easy)
- Command line:
scrapy crawl spider -o items.json - Works for JSON, CSV, XML, JSON Lines
- Good for: Small to medium datasets, quick exports
2. Item Pipelines (Flexible)
- Write custom pipeline to save to database
- Full control over data
- Good for: Databases, custom storage, data transformation
3. Feeds in Settings (Configured Exports)
- Set up exports in settings.py
- Automatic, no command line flags
- Good for: Production, consistent output format
4. Custom Exporters (Advanced)
- Create your own export format
- Rare but powerful
- Good for: Special formats, custom requirements
5. Direct Storage (Quick & Dirty)
- Write files directly in spider
- Not recommended
- Good for: Quick tests only
Let's explore each properly.
Method 1: Feed Exports (The Quick Way)
Basic Usage
# JSON
scrapy crawl myspider -o items.json
# CSV
scrapy crawl myspider -o items.csv
# JSON Lines (one JSON object per line)
scrapy crawl myspider -o items.jl
# XML
scrapy crawl myspider -o items.xml
Simple. Fast. Works great for small to medium datasets.
Appending vs Overwriting
# Overwrite (default with -o)
scrapy crawl myspider -o items.json
# Append (keeps existing data)
scrapy crawl myspider -O items.json # Capital O
What the docs don't tell you:
-
-ooverwrites the file each time -
-O(capital O) appends to existing file - Be careful with append! Can create invalid JSON
Multiple Formats at Once
scrapy crawl myspider -o items.json -o items.csv
Scrapy exports to both files simultaneously.
Custom Filenames with Timestamp
scrapy crawl myspider -o "items_$(date +%Y%m%d_%H%M%S).json"
Creates: items_20241225_143022.json
Feed Export Settings (Configuration)
Instead of command line flags, configure in settings.py:
# settings.py
FEEDS = {
'items.json': {
'format': 'json',
'encoding': 'utf-8',
'store_empty': False, # Don't create file if no items
'overwrite': True,
},
'items.csv': {
'format': 'csv',
'fields': ['name', 'price', 'url'], # Specify field order
}
}
Now just run scrapy crawl myspider and both files are created automatically.
Dynamic Filenames
FEEDS = {
'output/%(name)s_%(time)s.json': {
'format': 'json',
}
}
Available placeholders:
-
%(name)s- Spider name -
%(time)s- Timestamp - Custom ones you define
Creates: output/myspider_20241225_143022.json
Storage Backends
Save directly to S3, FTP, or local filesystem:
# Amazon S3
FEEDS = {
's3://mybucket/items/%(name)s.json': {
'format': 'json',
}
}
# FTP
FEEDS = {
'ftp://user:pass@ftp.example.com/items.json': {
'format': 'json',
}
}
# Local with custom path
FEEDS = {
'file:///absolute/path/items.json': {
'format': 'json',
}
}
JSON vs JSON Lines (Important Difference)
JSON (Not Great for Large Files)
scrapy crawl myspider -o items.json
Creates:
[
{"name": "Item 1", "price": 10},
{"name": "Item 2", "price": 20},
{"name": "Item 3", "price": 30}
]
Problems:
- Entire file must be valid JSON
- Can't process incrementally
- Memory issues with large files
- If spider crashes, file is invalid
JSON Lines (Better for Large Files)
scrapy crawl myspider -o items.jl
Creates:
{"name": "Item 1", "price": 10}
{"name": "Item 2", "price": 20}
{"name": "Item 3", "price": 30}
Benefits:
- One JSON object per line
- Can process line by line
- Valid even if spider crashes
- Easy to append
- Better for large datasets
What the docs don't tell you:
- JSON Lines (.jl or .jsonl) is almost always better than JSON for scraping
- Can easily convert to regular JSON later if needed
- Most data tools support JSON Lines
CSV Export (Tricky)
Basic CSV
scrapy crawl myspider -o items.csv
Field Order (Important!)
By default, CSV field order is random. Specify it:
FEEDS = {
'items.csv': {
'format': 'csv',
'fields': ['name', 'price', 'url', 'description'],
}
}
CSV Encoding Issues
FEED_EXPORT_ENCODING = 'utf-8-sig' # For Excel compatibility
What the docs don't tell you:
- Regular 'utf-8' causes problems in Excel
- 'utf-8-sig' adds BOM (Byte Order Mark) that Excel needs
- Without this, Excel shows weird characters for non-English text
CSV with Nested Data
CSV can't handle nested structures well:
# This item causes problems in CSV
{
'name': 'Product',
'reviews': [ # Nested list!
{'author': 'John', 'rating': 5},
{'author': 'Jane', 'rating': 4}
]
}
Solution: Flatten before export or use JSON instead.
Method 2: Item Pipelines (The Professional Way)
For production scrapers, use pipelines to save to databases.
Save to JSON File (Pipeline Version)
# pipelines.py
import json
class JsonWriterPipeline:
def open_spider(self, spider):
self.file = open('items.jl', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + '\n'
self.file.write(line)
return item
Enable in settings:
ITEM_PIPELINES = {
'myproject.pipelines.JsonWriterPipeline': 300,
}
Save to MongoDB
# pipelines.py
import pymongo
class MongoDBPipeline:
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
collection = self.db[spider.name]
collection.insert_one(dict(item))
return item
Settings:
ITEM_PIPELINES = {
'myproject.pipelines.MongoDBPipeline': 300,
}
MONGO_URI = 'mongodb://localhost:27017'
MONGO_DATABASE = 'scrapy_data'
Save to PostgreSQL
# pipelines.py
import psycopg2
class PostgresPipeline:
def __init__(self):
self.connection = None
self.cursor = None
def open_spider(self, spider):
self.connection = psycopg2.connect(
host='localhost',
database='scrapy_db',
user='user',
password='password'
)
self.cursor = self.connection.cursor()
# Create table if not exists
self.cursor.execute('''
CREATE TABLE IF NOT EXISTS products (
id SERIAL PRIMARY KEY,
name TEXT,
price DECIMAL,
url TEXT UNIQUE,
scraped_at TIMESTAMP DEFAULT NOW()
)
''')
self.connection.commit()
def close_spider(self, spider):
self.connection.commit()
self.cursor.close()
self.connection.close()
def process_item(self, item, spider):
try:
self.cursor.execute('''
INSERT INTO products (name, price, url)
VALUES (%s, %s, %s)
ON CONFLICT (url) DO UPDATE
SET name = EXCLUDED.name,
price = EXCLUDED.price,
scraped_at = NOW()
''', (
item.get('name'),
item.get('price'),
item.get('url')
))
self.connection.commit()
except Exception as e:
spider.logger.error(f'Error saving to DB: {e}')
self.connection.rollback()
return item
What the docs don't tell you:
- Always use
ON CONFLICTfor upserts (avoid duplicates) - Commit in batches for better performance (shown later)
- Handle errors or you'll lose data
Batch Inserts (Much Faster)
class BatchPostgresPipeline:
def __init__(self):
self.items = []
self.batch_size = 100
def open_spider(self, spider):
self.connection = psycopg2.connect(...)
self.cursor = self.connection.cursor()
def close_spider(self, spider):
self.insert_batch() # Insert remaining items
self.connection.commit()
self.cursor.close()
self.connection.close()
def process_item(self, item, spider):
self.items.append(item)
if len(self.items) >= self.batch_size:
self.insert_batch()
return item
def insert_batch(self):
if not self.items:
return
values = [(item['name'], item['price'], item['url'])
for item in self.items]
self.cursor.executemany('''
INSERT INTO products (name, price, url)
VALUES (%s, %s, %s)
ON CONFLICT (url) DO NOTHING
''', values)
self.connection.commit()
self.items = []
Performance difference:
- Insert one by one: 10 items/second
- Batch of 100: 1000 items/second
Save to SQLite (Simple Database)
Perfect for local development:
import sqlite3
class SQLitePipeline:
def open_spider(self, spider):
self.connection = sqlite3.connect('data.db')
self.cursor = self.connection.cursor()
self.cursor.execute('''
CREATE TABLE IF NOT EXISTS items (
name TEXT,
price REAL,
url TEXT UNIQUE
)
''')
self.connection.commit()
def close_spider(self, spider):
self.connection.close()
def process_item(self, item, spider):
self.cursor.execute('''
INSERT OR REPLACE INTO items (name, price, url)
VALUES (?, ?, ?)
''', (item['name'], item['price'], item['url']))
self.connection.commit()
return item
Multiple Storage Destinations
Save to multiple places at once:
# settings.py
ITEM_PIPELINES = {
'myproject.pipelines.PostgresPipeline': 100, # Save to database
'myproject.pipelines.S3Pipeline': 200, # Backup to S3
'myproject.pipelines.JsonPipeline': 300, # Local JSON backup
}
All three run for every item.
Handling Large Files
Problem: Memory Issues
Large JSON files load entirely into memory. This crashes with millions of items.
Solution 1: JSON Lines
FEEDS = {
'items.jl': {'format': 'jsonlines'}
}
One item per line. Can process without loading everything.
Solution 2: Rotate Files
# settings.py
from datetime import datetime
current_time = datetime.now().strftime('%Y%m%d_%H%M%S')
FEEDS = {
f'output/items_{current_time}.jl': {
'format': 'jsonlines',
}
}
Creates new file each run.
Solution 3: Split by Spider
FEEDS = {
'output/%(name)s.jl': {
'format': 'jsonlines',
}
}
Each spider gets its own file.
Incremental Scraping (Only Save New Items)
Save only items that changed or are new:
class IncrementalPipeline:
def __init__(self):
self.seen_urls = self.load_seen_urls()
def load_seen_urls(self):
try:
with open('seen_urls.txt', 'r') as f:
return set(line.strip() for line in f)
except FileNotFoundError:
return set()
def close_spider(self, spider):
with open('seen_urls.txt', 'w') as f:
for url in self.seen_urls:
f.write(url + '\n')
def process_item(self, item, spider):
url = item['url']
if url in self.seen_urls:
# Already scraped, skip
raise DropItem(f'Duplicate: {url}')
self.seen_urls.add(url)
return item
Only new items pass through to storage pipelines.
Export to Google Sheets
import gspread
from oauth2client.service_account import ServiceAccountCredentials
class GoogleSheetsPipeline:
def open_spider(self, spider):
scope = ['https://spreadsheets.google.com/feeds']
creds = ServiceAccountCredentials.from_json_keyfile_name(
'credentials.json', scope)
client = gspread.authorize(creds)
self.sheet = client.open('Scraped Data').sheet1
# Write headers
self.sheet.insert_row(['Name', 'Price', 'URL'], 1)
self.row = 2
def process_item(self, item, spider):
self.sheet.insert_row([
item['name'],
item['price'],
item['url']
], self.row)
self.row += 1
return item
Common Mistakes
Mistake #1: Not Handling Connection Failures
# BAD (crashes if DB is down)
def process_item(self, item, spider):
self.cursor.execute(...)
return item
# GOOD (handles errors gracefully)
def process_item(self, item, spider):
try:
self.cursor.execute(...)
self.connection.commit()
except Exception as e:
spider.logger.error(f'DB error: {e}')
self.connection.rollback()
return item
Mistake #2: Committing After Every Item
# BAD (super slow)
def process_item(self, item, spider):
self.cursor.execute(...)
self.connection.commit() # Commit every item
# GOOD (batch commits)
def process_item(self, item, spider):
self.cursor.execute(...)
if self.item_count % 100 == 0:
self.connection.commit()
Mistake #3: Not Closing Connections
# BAD (connections leak)
def open_spider(self, spider):
self.connection = connect()
# GOOD (always close)
def close_spider(self, spider):
if self.connection:
self.connection.close()
Mistake #4: Large JSON Files
# BAD for large datasets
scrapy crawl spider -o items.json # Loads everything in memory
# GOOD
scrapy crawl spider -o items.jl # Streams line by line
Quick Reference
Feed Exports
# JSON
scrapy crawl spider -o items.json
# JSON Lines (better for large data)
scrapy crawl spider -o items.jl
# CSV
scrapy crawl spider -o items.csv
# Append
scrapy crawl spider -O items.json
Settings Configuration
FEEDS = {
'items.jl': {
'format': 'jsonlines',
'encoding': 'utf-8',
}
}
Pipeline Template
class MyPipeline:
def open_spider(self, spider):
# Setup
pass
def close_spider(self, spider):
# Cleanup
pass
def process_item(self, item, spider):
# Save item
return item
Summary
For Quick Exports:
- Use feed exports:
-o items.jl - JSON Lines for large files
- CSV with
utf-8-sigfor Excel
For Production:
- Use pipelines for database storage
- Batch inserts for performance
- Handle errors gracefully
- Always close connections
Best Practices:
- Use JSON Lines (.jl) over JSON (.json)
- Batch database operations
- Commit in chunks, not per item
- Handle connection failures
- Use ON CONFLICT for upserts
Start with feed exports for testing. Move to pipelines for production.
Happy scraping! 🕷️
Top comments (0)