How to Store What You Scrape (Without Making a Mess)

#webscraping #python #zyte #progrmaming

The default for a first scraper is usually printing to stdout or dumping everything into a JSON array. Both work for a single test run. Neither works reliably at scale — a JSON array can't be read incrementally, and stdout isn't a data format.

The format you choose matters more than it seems. The wrong one creates problems that appear later: a 2GB JSON file that has to be fully parsed before you can read a single record; a CSV that corrupts silently when a field contains a comma; a database that slows to a crawl because nobody added an index. This post covers the three formats that cover most scraping use cases, and when to reach for each.

JSONL over JSON arrays

A JSON array written to a file is convenient until the scrape crashes halfway through, at which point you have a truncated, unparseable file. You also can't append to it without reading the whole thing first.

JSONL (JSON Lines) writes one JSON object per line. Each line is independently parseable. A crashed scrape leaves a valid file up to the last complete line. Appending is open("file", "a"). Streaming a large file is a for line in f loop.

import json
import requests
from bs4 import BeautifulSoup
from pathlib import Path

HEADERS = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-GB,en;q=0.9",
}

session = requests.Session()
session.headers.update(HEADERS)

output = Path("books.jsonl")

with output.open("w") as f:
    resp = session.get("https://books.toscrape.com/", timeout=15)
    resp.encoding = "utf-8"
    soup = BeautifulSoup(resp.text, "html.parser")

    for book in soup.find_all("article", class_="product_pod"):
        item = {
            "title":  book.find("h3").find("a")["title"],
            "price":  book.find("p", class_="price_color").text.strip(),
            "rating": book.find("p", class_="star-rating")["class"][1],
        }
        f.write(json.dumps(item) + "\n")

# Reading back
with output.open() as f:
    records = [json.loads(line) for line in f]

print(f"Read {len(records)} records")
print(records[0])

Output:

Read 20 records
{'title': 'A Light in the Attic', 'price': '£51.77', 'rating': 'Three'}

For a paginated scrape, open the file in append mode ("a") and write after each page. If the scrape is interrupted on page 15 of 50, you still have the first 14 pages intact. You can resume from page 15 without re-scraping the ones you already have.

CSV with DictWriter

CSV is the right format when the output has a fixed, flat schema and you need to open it in a spreadsheet or feed it to a tool that expects CSV. For nested data or variable fields, JSONL is more flexible.

import csv
import requests
from bs4 import BeautifulSoup
from pathlib import Path

HEADERS = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-GB,en;q=0.9",
}

session = requests.Session()
session.headers.update(HEADERS)

FIELDS = ["title", "price", "rating"]
output = Path("books.csv")

with output.open("w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=FIELDS)
    writer.writeheader()

    resp = session.get("https://books.toscrape.com/", timeout=15)
    resp.encoding = "utf-8"
    soup = BeautifulSoup(resp.text, "html.parser")

    for book in soup.find_all("article", class_="product_pod"):
        writer.writerow({
            "title":  book.find("h3").find("a")["title"],
            "price":  book.find("p", class_="price_color").text.strip(),
            "rating": book.find("p", class_="star-rating")["class"][1],
        })

Two things to get right here: newline="" in the open() call is required on Windows (without it, csv.writer produces double line endings); and DictWriter with explicit fieldnames means extra keys in your dict are ignored rather than causing a crash, which matters when scraped data is messier than expected.

DictWriter also handles quoting automatically — if a field contains a comma or a quote character, it wraps the field correctly. Manual string concatenation doesn't.

SQLite for deduplication and querying

SQLite is worth reaching for when you're doing repeated scrapes of the same site (price monitoring, availability checking) and need to avoid re-inserting data you already have. INSERT OR IGNORE skips any row that would violate a UNIQUE constraint, with no error raised.

import sqlite3
import requests
from bs4 import BeautifulSoup

HEADERS = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-GB,en;q=0.9",
}

session = requests.Session()
session.headers.update(HEADERS)

conn = sqlite3.connect("books.db")
conn.execute("""
    CREATE TABLE IF NOT EXISTS books (
        title  TEXT,
        price  TEXT,
        rating TEXT,
        UNIQUE(title)
    )
""")
conn.commit()

resp = session.get("https://books.toscrape.com/", timeout=15)
resp.encoding = "utf-8"
soup = BeautifulSoup(resp.text, "html.parser")

inserted = 0
skipped  = 0

for book in soup.find_all("article", class_="product_pod"):
    try:
        conn.execute(
            "INSERT INTO books (title, price, rating) VALUES (?, ?, ?)",
            (
                book.find("h3").find("a")["title"],
                book.find("p", class_="price_color").text.strip(),
                book.find("p", class_="star-rating")["class"][1],
            ),
        )
        inserted += 1
    except sqlite3.IntegrityError:
        skipped += 1  # already in the DB

conn.commit()
conn.close()

print(f"Inserted: {inserted}, skipped (duplicate): {skipped}")

The parameterised query (? placeholders) is not optional. String formatting SQL with scraped data is a SQL injection vulnerability even when you control the scraper — the scraped site controls the data.

For repeated scrapes where you want to track price history rather than deduplicate, swap the UNIQUE constraint and INSERT OR IGNORE for a timestamp column and a plain INSERT. Each run adds new rows without touching old ones.

Scrapy pipelines

In Scrapy, storage belongs in item pipelines. A pipeline is a class with a process_item method that receives each item as it comes off the spider. Pipelines can be chained — one validates, one writes to JSONL, one writes to SQLite — and each gets called in the order defined in ITEM_PIPELINES in your settings.

# pipelines.py
import json
import sqlite3
from itemadapter import ItemAdapter


class JsonlPipeline:
    def open_spider(self, spider):
        self.file = open("output.jsonl", "w")

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(ItemAdapter(item).asdict())
        self.file.write(line + "\n")
        return item  # must return item for the next pipeline to receive it


class SqlitePipeline:
    def open_spider(self, spider):
        self.conn = sqlite3.connect("output.db")
        self.conn.execute("""
            CREATE TABLE IF NOT EXISTS items (
                title  TEXT,
                price  TEXT,
                rating TEXT,
                UNIQUE(title)
            )
        """)
        self.conn.commit()

    def close_spider(self, spider):
        self.conn.close()

    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        self.conn.execute(
            "INSERT OR IGNORE INTO items (title, price, rating) VALUES (?, ?, ?)",
            (adapter["title"], adapter["price"], adapter["rating"]),
        )
        self.conn.commit()
        return item

Enable in settings.py:

ITEM_PIPELINES = {
    "myproject.pipelines.JsonlPipeline":   300,
    "myproject.pipelines.SqlitePipeline":  400,
}

The integer values are priority — lower numbers run first. The return item at the end of each process_item is mandatory; if you forget it, the next pipeline in the chain receives None.

open_spider and close_spider are the correct place to open and close file handles and database connections. Opening them in __init__ means they persist for the lifetime of the pipeline object even when no spider is running.

Choosing a format

Format	Use when
JSONL	Default for most scrapes; variable fields; needs to survive interruption
CSV	Fixed flat schema; downstream tools expect CSV; spreadsheet output
SQLite	Repeated scrapes of the same targets; need to query across runs; deduplication
PostgreSQL / MySQL	Multiple scrapers writing concurrently; data volumes above ~10GB

The jump to a full database server is rarely needed for a single-scraper project. SQLite handles hundreds of millions of rows on modern hardware without issue, provided you're not writing from multiple processes simultaneously.

Tags: python scrapy webscraping tutorial