DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Scrapy Data Export & Storage: Save Your Data Like a Pro

When I finished my first Scrapy spider, I had 10,000 items scraped. I was excited! Then I realized I had no idea where the data went or how to export it properly.

I tried using -o output.json but the file was 500MB and crashed my text editor. Then I tried saving directly to MongoDB from my spider and lost data when the spider crashed halfway through.

After scraping millions of items, I've learned the right way to export and store data. Let me show you all the options and when to use each.


The Big Picture: Five Ways to Export Data

Scrapy gives you five main options for saving data:

1. Feed Exports (Built-in, Easy)

  • Command line: scrapy crawl spider -o items.json
  • Works for JSON, CSV, XML, JSON Lines
  • Good for: Small to medium datasets, quick exports

2. Item Pipelines (Flexible)

  • Write custom pipeline to save to database
  • Full control over data
  • Good for: Databases, custom storage, data transformation

3. Feeds in Settings (Configured Exports)

  • Set up exports in settings.py
  • Automatic, no command line flags
  • Good for: Production, consistent output format

4. Custom Exporters (Advanced)

  • Create your own export format
  • Rare but powerful
  • Good for: Special formats, custom requirements

5. Direct Storage (Quick & Dirty)

  • Write files directly in spider
  • Not recommended
  • Good for: Quick tests only

Let's explore each properly.


Method 1: Feed Exports (The Quick Way)

Basic Usage

# JSON
scrapy crawl myspider -o items.json

# CSV
scrapy crawl myspider -o items.csv

# JSON Lines (one JSON object per line)
scrapy crawl myspider -o items.jl

# XML
scrapy crawl myspider -o items.xml
Enter fullscreen mode Exit fullscreen mode

Simple. Fast. Works great for small to medium datasets.

Appending vs Overwriting

# Overwrite (default with -o)
scrapy crawl myspider -o items.json

# Append (keeps existing data)
scrapy crawl myspider -O items.json  # Capital O
Enter fullscreen mode Exit fullscreen mode

What the docs don't tell you:

  • -o overwrites the file each time
  • -O (capital O) appends to existing file
  • Be careful with append! Can create invalid JSON

Multiple Formats at Once

scrapy crawl myspider -o items.json -o items.csv
Enter fullscreen mode Exit fullscreen mode

Scrapy exports to both files simultaneously.

Custom Filenames with Timestamp

scrapy crawl myspider -o "items_$(date +%Y%m%d_%H%M%S).json"
Enter fullscreen mode Exit fullscreen mode

Creates: items_20241225_143022.json


Feed Export Settings (Configuration)

Instead of command line flags, configure in settings.py:

# settings.py

FEEDS = {
    'items.json': {
        'format': 'json',
        'encoding': 'utf-8',
        'store_empty': False,  # Don't create file if no items
        'overwrite': True,
    },
    'items.csv': {
        'format': 'csv',
        'fields': ['name', 'price', 'url'],  # Specify field order
    }
}
Enter fullscreen mode Exit fullscreen mode

Now just run scrapy crawl myspider and both files are created automatically.

Dynamic Filenames

FEEDS = {
    'output/%(name)s_%(time)s.json': {
        'format': 'json',
    }
}
Enter fullscreen mode Exit fullscreen mode

Available placeholders:

  • %(name)s - Spider name
  • %(time)s - Timestamp
  • Custom ones you define

Creates: output/myspider_20241225_143022.json

Storage Backends

Save directly to S3, FTP, or local filesystem:

# Amazon S3
FEEDS = {
    's3://mybucket/items/%(name)s.json': {
        'format': 'json',
    }
}

# FTP
FEEDS = {
    'ftp://user:pass@ftp.example.com/items.json': {
        'format': 'json',
    }
}

# Local with custom path
FEEDS = {
    'file:///absolute/path/items.json': {
        'format': 'json',
    }
}
Enter fullscreen mode Exit fullscreen mode

JSON vs JSON Lines (Important Difference)

JSON (Not Great for Large Files)

scrapy crawl myspider -o items.json
Enter fullscreen mode Exit fullscreen mode

Creates:

[
    {"name": "Item 1", "price": 10},
    {"name": "Item 2", "price": 20},
    {"name": "Item 3", "price": 30}
]
Enter fullscreen mode Exit fullscreen mode

Problems:

  • Entire file must be valid JSON
  • Can't process incrementally
  • Memory issues with large files
  • If spider crashes, file is invalid

JSON Lines (Better for Large Files)

scrapy crawl myspider -o items.jl
Enter fullscreen mode Exit fullscreen mode

Creates:

{"name": "Item 1", "price": 10}
{"name": "Item 2", "price": 20}
{"name": "Item 3", "price": 30}
Enter fullscreen mode Exit fullscreen mode

Benefits:

  • One JSON object per line
  • Can process line by line
  • Valid even if spider crashes
  • Easy to append
  • Better for large datasets

What the docs don't tell you:

  • JSON Lines (.jl or .jsonl) is almost always better than JSON for scraping
  • Can easily convert to regular JSON later if needed
  • Most data tools support JSON Lines

CSV Export (Tricky)

Basic CSV

scrapy crawl myspider -o items.csv
Enter fullscreen mode Exit fullscreen mode

Field Order (Important!)

By default, CSV field order is random. Specify it:

FEEDS = {
    'items.csv': {
        'format': 'csv',
        'fields': ['name', 'price', 'url', 'description'],
    }
}
Enter fullscreen mode Exit fullscreen mode

CSV Encoding Issues

FEED_EXPORT_ENCODING = 'utf-8-sig'  # For Excel compatibility
Enter fullscreen mode Exit fullscreen mode

What the docs don't tell you:

  • Regular 'utf-8' causes problems in Excel
  • 'utf-8-sig' adds BOM (Byte Order Mark) that Excel needs
  • Without this, Excel shows weird characters for non-English text

CSV with Nested Data

CSV can't handle nested structures well:

# This item causes problems in CSV
{
    'name': 'Product',
    'reviews': [  # Nested list!
        {'author': 'John', 'rating': 5},
        {'author': 'Jane', 'rating': 4}
    ]
}
Enter fullscreen mode Exit fullscreen mode

Solution: Flatten before export or use JSON instead.


Method 2: Item Pipelines (The Professional Way)

For production scrapers, use pipelines to save to databases.

Save to JSON File (Pipeline Version)

# pipelines.py
import json

class JsonWriterPipeline:
    def open_spider(self, spider):
        self.file = open('items.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + '\n'
        self.file.write(line)
        return item
Enter fullscreen mode Exit fullscreen mode

Enable in settings:

ITEM_PIPELINES = {
    'myproject.pipelines.JsonWriterPipeline': 300,
}
Enter fullscreen mode Exit fullscreen mode

Save to MongoDB

# pipelines.py
import pymongo

class MongoDBPipeline:
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        collection = self.db[spider.name]
        collection.insert_one(dict(item))
        return item
Enter fullscreen mode Exit fullscreen mode

Settings:

ITEM_PIPELINES = {
    'myproject.pipelines.MongoDBPipeline': 300,
}

MONGO_URI = 'mongodb://localhost:27017'
MONGO_DATABASE = 'scrapy_data'
Enter fullscreen mode Exit fullscreen mode

Save to PostgreSQL

# pipelines.py
import psycopg2

class PostgresPipeline:
    def __init__(self):
        self.connection = None
        self.cursor = None

    def open_spider(self, spider):
        self.connection = psycopg2.connect(
            host='localhost',
            database='scrapy_db',
            user='user',
            password='password'
        )
        self.cursor = self.connection.cursor()

        # Create table if not exists
        self.cursor.execute('''
            CREATE TABLE IF NOT EXISTS products (
                id SERIAL PRIMARY KEY,
                name TEXT,
                price DECIMAL,
                url TEXT UNIQUE,
                scraped_at TIMESTAMP DEFAULT NOW()
            )
        ''')
        self.connection.commit()

    def close_spider(self, spider):
        self.connection.commit()
        self.cursor.close()
        self.connection.close()

    def process_item(self, item, spider):
        try:
            self.cursor.execute('''
                INSERT INTO products (name, price, url)
                VALUES (%s, %s, %s)
                ON CONFLICT (url) DO UPDATE
                SET name = EXCLUDED.name,
                    price = EXCLUDED.price,
                    scraped_at = NOW()
            ''', (
                item.get('name'),
                item.get('price'),
                item.get('url')
            ))
            self.connection.commit()
        except Exception as e:
            spider.logger.error(f'Error saving to DB: {e}')
            self.connection.rollback()

        return item
Enter fullscreen mode Exit fullscreen mode

What the docs don't tell you:

  • Always use ON CONFLICT for upserts (avoid duplicates)
  • Commit in batches for better performance (shown later)
  • Handle errors or you'll lose data

Batch Inserts (Much Faster)

class BatchPostgresPipeline:
    def __init__(self):
        self.items = []
        self.batch_size = 100

    def open_spider(self, spider):
        self.connection = psycopg2.connect(...)
        self.cursor = self.connection.cursor()

    def close_spider(self, spider):
        self.insert_batch()  # Insert remaining items
        self.connection.commit()
        self.cursor.close()
        self.connection.close()

    def process_item(self, item, spider):
        self.items.append(item)

        if len(self.items) >= self.batch_size:
            self.insert_batch()

        return item

    def insert_batch(self):
        if not self.items:
            return

        values = [(item['name'], item['price'], item['url']) 
                  for item in self.items]

        self.cursor.executemany('''
            INSERT INTO products (name, price, url)
            VALUES (%s, %s, %s)
            ON CONFLICT (url) DO NOTHING
        ''', values)

        self.connection.commit()
        self.items = []
Enter fullscreen mode Exit fullscreen mode

Performance difference:

  • Insert one by one: 10 items/second
  • Batch of 100: 1000 items/second

Save to SQLite (Simple Database)

Perfect for local development:

import sqlite3

class SQLitePipeline:
    def open_spider(self, spider):
        self.connection = sqlite3.connect('data.db')
        self.cursor = self.connection.cursor()

        self.cursor.execute('''
            CREATE TABLE IF NOT EXISTS items (
                name TEXT,
                price REAL,
                url TEXT UNIQUE
            )
        ''')
        self.connection.commit()

    def close_spider(self, spider):
        self.connection.close()

    def process_item(self, item, spider):
        self.cursor.execute('''
            INSERT OR REPLACE INTO items (name, price, url)
            VALUES (?, ?, ?)
        ''', (item['name'], item['price'], item['url']))

        self.connection.commit()
        return item
Enter fullscreen mode Exit fullscreen mode

Multiple Storage Destinations

Save to multiple places at once:

# settings.py
ITEM_PIPELINES = {
    'myproject.pipelines.PostgresPipeline': 100,  # Save to database
    'myproject.pipelines.S3Pipeline': 200,         # Backup to S3
    'myproject.pipelines.JsonPipeline': 300,       # Local JSON backup
}
Enter fullscreen mode Exit fullscreen mode

All three run for every item.


Handling Large Files

Problem: Memory Issues

Large JSON files load entirely into memory. This crashes with millions of items.

Solution 1: JSON Lines

FEEDS = {
    'items.jl': {'format': 'jsonlines'}
}
Enter fullscreen mode Exit fullscreen mode

One item per line. Can process without loading everything.

Solution 2: Rotate Files

# settings.py
from datetime import datetime

current_time = datetime.now().strftime('%Y%m%d_%H%M%S')

FEEDS = {
    f'output/items_{current_time}.jl': {
        'format': 'jsonlines',
    }
}
Enter fullscreen mode Exit fullscreen mode

Creates new file each run.

Solution 3: Split by Spider

FEEDS = {
    'output/%(name)s.jl': {
        'format': 'jsonlines',
    }
}
Enter fullscreen mode Exit fullscreen mode

Each spider gets its own file.


Incremental Scraping (Only Save New Items)

Save only items that changed or are new:

class IncrementalPipeline:
    def __init__(self):
        self.seen_urls = self.load_seen_urls()

    def load_seen_urls(self):
        try:
            with open('seen_urls.txt', 'r') as f:
                return set(line.strip() for line in f)
        except FileNotFoundError:
            return set()

    def close_spider(self, spider):
        with open('seen_urls.txt', 'w') as f:
            for url in self.seen_urls:
                f.write(url + '\n')

    def process_item(self, item, spider):
        url = item['url']

        if url in self.seen_urls:
            # Already scraped, skip
            raise DropItem(f'Duplicate: {url}')

        self.seen_urls.add(url)
        return item
Enter fullscreen mode Exit fullscreen mode

Only new items pass through to storage pipelines.


Export to Google Sheets

import gspread
from oauth2client.service_account import ServiceAccountCredentials

class GoogleSheetsPipeline:
    def open_spider(self, spider):
        scope = ['https://spreadsheets.google.com/feeds']
        creds = ServiceAccountCredentials.from_json_keyfile_name(
            'credentials.json', scope)
        client = gspread.authorize(creds)

        self.sheet = client.open('Scraped Data').sheet1

        # Write headers
        self.sheet.insert_row(['Name', 'Price', 'URL'], 1)
        self.row = 2

    def process_item(self, item, spider):
        self.sheet.insert_row([
            item['name'],
            item['price'],
            item['url']
        ], self.row)

        self.row += 1
        return item
Enter fullscreen mode Exit fullscreen mode

Common Mistakes

Mistake #1: Not Handling Connection Failures

# BAD (crashes if DB is down)
def process_item(self, item, spider):
    self.cursor.execute(...)
    return item

# GOOD (handles errors gracefully)
def process_item(self, item, spider):
    try:
        self.cursor.execute(...)
        self.connection.commit()
    except Exception as e:
        spider.logger.error(f'DB error: {e}')
        self.connection.rollback()
    return item
Enter fullscreen mode Exit fullscreen mode

Mistake #2: Committing After Every Item

# BAD (super slow)
def process_item(self, item, spider):
    self.cursor.execute(...)
    self.connection.commit()  # Commit every item

# GOOD (batch commits)
def process_item(self, item, spider):
    self.cursor.execute(...)
    if self.item_count % 100 == 0:
        self.connection.commit()
Enter fullscreen mode Exit fullscreen mode

Mistake #3: Not Closing Connections

# BAD (connections leak)
def open_spider(self, spider):
    self.connection = connect()

# GOOD (always close)
def close_spider(self, spider):
    if self.connection:
        self.connection.close()
Enter fullscreen mode Exit fullscreen mode

Mistake #4: Large JSON Files

# BAD for large datasets
scrapy crawl spider -o items.json  # Loads everything in memory

# GOOD
scrapy crawl spider -o items.jl  # Streams line by line
Enter fullscreen mode Exit fullscreen mode

Quick Reference

Feed Exports

# JSON
scrapy crawl spider -o items.json

# JSON Lines (better for large data)
scrapy crawl spider -o items.jl

# CSV
scrapy crawl spider -o items.csv

# Append
scrapy crawl spider -O items.json
Enter fullscreen mode Exit fullscreen mode

Settings Configuration

FEEDS = {
    'items.jl': {
        'format': 'jsonlines',
        'encoding': 'utf-8',
    }
}
Enter fullscreen mode Exit fullscreen mode

Pipeline Template

class MyPipeline:
    def open_spider(self, spider):
        # Setup
        pass

    def close_spider(self, spider):
        # Cleanup
        pass

    def process_item(self, item, spider):
        # Save item
        return item
Enter fullscreen mode Exit fullscreen mode

Summary

For Quick Exports:

  • Use feed exports: -o items.jl
  • JSON Lines for large files
  • CSV with utf-8-sig for Excel

For Production:

  • Use pipelines for database storage
  • Batch inserts for performance
  • Handle errors gracefully
  • Always close connections

Best Practices:

  • Use JSON Lines (.jl) over JSON (.json)
  • Batch database operations
  • Commit in chunks, not per item
  • Handle connection failures
  • Use ON CONFLICT for upserts

Start with feed exports for testing. Move to pipelines for production.

Happy scraping! 🕷️

Top comments (0)