Muhammad Ikramullah Khan

Posted on Dec 31, 2025

Scrapy Data Export & Storage: Save Your Data Like a Pro

#webdev #beginners #programming #productivity

When I finished my first Scrapy spider, I had 10,000 items scraped. I was excited! Then I realized I had no idea where the data went or how to export it properly.

I tried using -o output.json but the file was 500MB and crashed my text editor. Then I tried saving directly to MongoDB from my spider and lost data when the spider crashed halfway through.

After scraping millions of items, I've learned the right way to export and store data. Let me show you all the options and when to use each.

The Big Picture: Five Ways to Export Data

Scrapy gives you five main options for saving data:

1. Feed Exports (Built-in, Easy)

Command line: scrapy crawl spider -o items.json
Works for JSON, CSV, XML, JSON Lines
Good for: Small to medium datasets, quick exports

2. Item Pipelines (Flexible)

Write custom pipeline to save to database
Full control over data
Good for: Databases, custom storage, data transformation

3. Feeds in Settings (Configured Exports)

Set up exports in settings.py
Automatic, no command line flags
Good for: Production, consistent output format

4. Custom Exporters (Advanced)

Create your own export format
Rare but powerful
Good for: Special formats, custom requirements

5. Direct Storage (Quick & Dirty)

Write files directly in spider
Not recommended
Good for: Quick tests only

Let's explore each properly.

Method 1: Feed Exports (The Quick Way)

Basic Usage

# JSON
scrapy crawl myspider -o items.json

# CSV
scrapy crawl myspider -o items.csv

# JSON Lines (one JSON object per line)
scrapy crawl myspider -o items.jl

# XML
scrapy crawl myspider -o items.xml

Simple. Fast. Works great for small to medium datasets.

Appending vs Overwriting

# Overwrite (default with -o)
scrapy crawl myspider -o items.json

# Append (keeps existing data)
scrapy crawl myspider -O items.json  # Capital O

What the docs don't tell you:

-o overwrites the file each time
-O (capital O) appends to existing file
Be careful with append! Can create invalid JSON

Multiple Formats at Once

scrapy crawl myspider -o items.json -o items.csv

Scrapy exports to both files simultaneously.

Custom Filenames with Timestamp

scrapy crawl myspider -o "items_$(date +%Y%m%d_%H%M%S).json"

Creates: items_20241225_143022.json

Feed Export Settings (Configuration)

Instead of command line flags, configure in settings.py:

# settings.py

FEEDS = {
    'items.json': {
        'format': 'json',
        'encoding': 'utf-8',
        'store_empty': False,  # Don't create file if no items
        'overwrite': True,
    },
    'items.csv': {
        'format': 'csv',
        'fields': ['name', 'price', 'url'],  # Specify field order
    }
}

Now just run scrapy crawl myspider and both files are created automatically.

Dynamic Filenames

FEEDS = {
    'output/%(name)s_%(time)s.json': {
        'format': 'json',
    }
}

Available placeholders:

%(name)s - Spider name
%(time)s - Timestamp
Custom ones you define

Creates: output/myspider_20241225_143022.json

Storage Backends

Save directly to S3, FTP, or local filesystem:

# Amazon S3
FEEDS = {
    's3://mybucket/items/%(name)s.json': {
        'format': 'json',
    }
}

# FTP
FEEDS = {
    'ftp://user:pass@ftp.example.com/items.json': {
        'format': 'json',
    }
}

# Local with custom path
FEEDS = {
    'file:///absolute/path/items.json': {
        'format': 'json',
    }
}

JSON vs JSON Lines (Important Difference)

JSON (Not Great for Large Files)

scrapy crawl myspider -o items.json

Creates:

[
    {"name": "Item 1", "price": 10},
    {"name": "Item 2", "price": 20},
    {"name": "Item 3", "price": 30}
]

Problems:

Entire file must be valid JSON
Can't process incrementally
Memory issues with large files
If spider crashes, file is invalid

JSON Lines (Better for Large Files)

scrapy crawl myspider -o items.jl

Creates:

{"name": "Item 1", "price": 10}
{"name": "Item 2", "price": 20}
{"name": "Item 3", "price": 30}

Benefits:

One JSON object per line
Can process line by line
Valid even if spider crashes
Easy to append
Better for large datasets

What the docs don't tell you:

JSON Lines (.jl or .jsonl) is almost always better than JSON for scraping
Can easily convert to regular JSON later if needed
Most data tools support JSON Lines

CSV Export (Tricky)

Basic CSV

scrapy crawl myspider -o items.csv

Field Order (Important!)

By default, CSV field order is random. Specify it:

FEEDS = {
    'items.csv': {
        'format': 'csv',
        'fields': ['name', 'price', 'url', 'description'],
    }
}

CSV Encoding Issues

FEED_EXPORT_ENCODING = 'utf-8-sig'  # For Excel compatibility

What the docs don't tell you:

Regular 'utf-8' causes problems in Excel
'utf-8-sig' adds BOM (Byte Order Mark) that Excel needs
Without this, Excel shows weird characters for non-English text

CSV with Nested Data

CSV can't handle nested structures well:

# This item causes problems in CSV
{
    'name': 'Product',
    'reviews': [  # Nested list!
        {'author': 'John', 'rating': 5},
        {'author': 'Jane', 'rating': 4}
    ]
}

Solution: Flatten before export or use JSON instead.

Method 2: Item Pipelines (The Professional Way)

For production scrapers, use pipelines to save to databases.

Save to JSON File (Pipeline Version)

# pipelines.py
import json

class JsonWriterPipeline:
    def open_spider(self, spider):
        self.file = open('items.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + '\n'
        self.file.write(line)
        return item

Enable in settings:

ITEM_PIPELINES = {
    'myproject.pipelines.JsonWriterPipeline': 300,
}

Save to MongoDB

# pipelines.py
import pymongo

class MongoDBPipeline:
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        collection = self.db[spider.name]
        collection.insert_one(dict(item))
        return item

Settings:

ITEM_PIPELINES = {
    'myproject.pipelines.MongoDBPipeline': 300,
}

MONGO_URI = 'mongodb://localhost:27017'
MONGO_DATABASE = 'scrapy_data'

Save to PostgreSQL

# pipelines.py
import psycopg2

class PostgresPipeline:
    def __init__(self):
        self.connection = None
        self.cursor = None

    def open_spider(self, spider):
        self.connection = psycopg2.connect(
            host='localhost',
            database='scrapy_db',
            user='user',
            password='password'
        )
        self.cursor = self.connection.cursor()

        # Create table if not exists
        self.cursor.execute('''
            CREATE TABLE IF NOT EXISTS products (
                id SERIAL PRIMARY KEY,
                name TEXT,
                price DECIMAL,
                url TEXT UNIQUE,
                scraped_at TIMESTAMP DEFAULT NOW()
            )
        ''')
        self.connection.commit()

    def close_spider(self, spider):
        self.connection.commit()
        self.cursor.close()
        self.connection.close()

    def process_item(self, item, spider):
        try:
            self.cursor.execute('''
                INSERT INTO products (name, price, url)
                VALUES (%s, %s, %s)
                ON CONFLICT (url) DO UPDATE
                SET name = EXCLUDED.name,
                    price = EXCLUDED.price,
                    scraped_at = NOW()
            ''', (
                item.get('name'),
                item.get('price'),
                item.get('url')
            ))
            self.connection.commit()
        except Exception as e:
            spider.logger.error(f'Error saving to DB: {e}')
            self.connection.rollback()

        return item

What the docs don't tell you:

Always use ON CONFLICT for upserts (avoid duplicates)
Commit in batches for better performance (shown later)
Handle errors or you'll lose data

Batch Inserts (Much Faster)

class BatchPostgresPipeline:
    def __init__(self):
        self.items = []
        self.batch_size = 100

    def open_spider(self, spider):
        self.connection = psycopg2.connect(...)
        self.cursor = self.connection.cursor()

    def close_spider(self, spider):
        self.insert_batch()  # Insert remaining items
        self.connection.commit()
        self.cursor.close()
        self.connection.close()

    def process_item(self, item, spider):
        self.items.append(item)

        if len(self.items) >= self.batch_size:
            self.insert_batch()

        return item

    def insert_batch(self):
        if not self.items:
            return

        values = [(item['name'], item['price'], item['url']) 
                  for item in self.items]

        self.cursor.executemany('''
            INSERT INTO products (name, price, url)
            VALUES (%s, %s, %s)
            ON CONFLICT (url) DO NOTHING
        ''', values)

        self.connection.commit()
        self.items = []

Performance difference:

Insert one by one: 10 items/second
Batch of 100: 1000 items/second

Save to SQLite (Simple Database)

Perfect for local development:

import sqlite3

class SQLitePipeline:
    def open_spider(self, spider):
        self.connection = sqlite3.connect('data.db')
        self.cursor = self.connection.cursor()

        self.cursor.execute('''
            CREATE TABLE IF NOT EXISTS items (
                name TEXT,
                price REAL,
                url TEXT UNIQUE
            )
        ''')
        self.connection.commit()

    def close_spider(self, spider):
        self.connection.close()

    def process_item(self, item, spider):
        self.cursor.execute('''
            INSERT OR REPLACE INTO items (name, price, url)
            VALUES (?, ?, ?)
        ''', (item['name'], item['price'], item['url']))

        self.connection.commit()
        return item

Multiple Storage Destinations

Save to multiple places at once:

# settings.py
ITEM_PIPELINES = {
    'myproject.pipelines.PostgresPipeline': 100,  # Save to database
    'myproject.pipelines.S3Pipeline': 200,         # Backup to S3
    'myproject.pipelines.JsonPipeline': 300,       # Local JSON backup
}

All three run for every item.

Handling Large Files

Problem: Memory Issues

Large JSON files load entirely into memory. This crashes with millions of items.

Solution 1: JSON Lines

FEEDS = {
    'items.jl': {'format': 'jsonlines'}
}

One item per line. Can process without loading everything.

Solution 2: Rotate Files

# settings.py
from datetime import datetime

current_time = datetime.now().strftime('%Y%m%d_%H%M%S')

FEEDS = {
    f'output/items_{current_time}.jl': {
        'format': 'jsonlines',
    }
}

Creates new file each run.

Solution 3: Split by Spider

FEEDS = {
    'output/%(name)s.jl': {
        'format': 'jsonlines',
    }
}

Each spider gets its own file.

Incremental Scraping (Only Save New Items)

Save only items that changed or are new:

class IncrementalPipeline:
    def __init__(self):
        self.seen_urls = self.load_seen_urls()

    def load_seen_urls(self):
        try:
            with open('seen_urls.txt', 'r') as f:
                return set(line.strip() for line in f)
        except FileNotFoundError:
            return set()

    def close_spider(self, spider):
        with open('seen_urls.txt', 'w') as f:
            for url in self.seen_urls:
                f.write(url + '\n')

    def process_item(self, item, spider):
        url = item['url']

        if url in self.seen_urls:
            # Already scraped, skip
            raise DropItem(f'Duplicate: {url}')

        self.seen_urls.add(url)
        return item

Only new items pass through to storage pipelines.

Export to Google Sheets

import gspread
from oauth2client.service_account import ServiceAccountCredentials

class GoogleSheetsPipeline:
    def open_spider(self, spider):
        scope = ['https://spreadsheets.google.com/feeds']
        creds = ServiceAccountCredentials.from_json_keyfile_name(
            'credentials.json', scope)
        client = gspread.authorize(creds)

        self.sheet = client.open('Scraped Data').sheet1

        # Write headers
        self.sheet.insert_row(['Name', 'Price', 'URL'], 1)
        self.row = 2

    def process_item(self, item, spider):
        self.sheet.insert_row([
            item['name'],
            item['price'],
            item['url']
        ], self.row)

        self.row += 1
        return item

Common Mistakes

Mistake #1: Not Handling Connection Failures

# BAD (crashes if DB is down)
def process_item(self, item, spider):
    self.cursor.execute(...)
    return item

# GOOD (handles errors gracefully)
def process_item(self, item, spider):
    try:
        self.cursor.execute(...)
        self.connection.commit()
    except Exception as e:
        spider.logger.error(f'DB error: {e}')
        self.connection.rollback()
    return item

Mistake #2: Committing After Every Item

# BAD (super slow)
def process_item(self, item, spider):
    self.cursor.execute(...)
    self.connection.commit()  # Commit every item

# GOOD (batch commits)
def process_item(self, item, spider):
    self.cursor.execute(...)
    if self.item_count % 100 == 0:
        self.connection.commit()

Mistake #3: Not Closing Connections

# BAD (connections leak)
def open_spider(self, spider):
    self.connection = connect()

# GOOD (always close)
def close_spider(self, spider):
    if self.connection:
        self.connection.close()

Mistake #4: Large JSON Files

# BAD for large datasets
scrapy crawl spider -o items.json  # Loads everything in memory

# GOOD
scrapy crawl spider -o items.jl  # Streams line by line

Quick Reference

Feed Exports

# JSON
scrapy crawl spider -o items.json

# JSON Lines (better for large data)
scrapy crawl spider -o items.jl

# CSV
scrapy crawl spider -o items.csv

# Append
scrapy crawl spider -O items.json

Settings Configuration

FEEDS = {
    'items.jl': {
        'format': 'jsonlines',
        'encoding': 'utf-8',
    }
}

Pipeline Template

class MyPipeline:
    def open_spider(self, spider):
        # Setup
        pass

    def close_spider(self, spider):
        # Cleanup
        pass

    def process_item(self, item, spider):
        # Save item
        return item