DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Scrapy Signals: The Complete Beginner's Guide (Make Your Spider Talk to You)

When I first heard about Scrapy signals, I thought "why would I need this?" I was scraping just fine without them.

Then I needed to send an email when my spider finished. I tried putting it at the end of parse(). Didn't work. The spider kept running after parse() finished.

I tried putting it in a pipeline. Didn't work either. The pipeline processes items, not spider completion.

Finally, I learned about signals. Specifically, spider_closed. Problem solved in five minutes.

Signals are Scrapy's way of saying "hey, something just happened!" You can listen for these events and run custom code. Let me show you how.


What Are Signals, Really?

Think of signals like notifications on your phone. Something happens (new message, battery low, screenshot taken), and your phone tells you.

In Scrapy:

  • Spider opens → spider_opened signal fires
  • Item gets scraped → item_scraped signal fires
  • Spider closes → spider_closed signal fires
  • Request fails → spider_error signal fires

You can "subscribe" to these notifications and run code when they happen.

Without signals:
You have no way to know when these events occur. Your spider runs, stuff happens, and you're blind to it.

With signals:
Your spider tells you exactly what's happening. You can react to events in real time.


The Most Important Signals (Start Here)

Let's learn the signals you'll actually use, starting with the most common.

spider_opened (When Spider Starts)

Fires AFTER the spider starts running, BEFORE any requests are made.

Use cases:

  • Open database connections
  • Initialize counters
  • Start timers
  • Send "spider started" notifications
  • Set up resources
from scrapy import signals
import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['https://example.com']

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super().from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_opened_handler, signal=signals.spider_opened)
        return spider

    def spider_opened_handler(self, spider):
        spider.logger.info(f'{spider.name} has started!')
        spider.start_time = time.time()

    def parse(self, response):
        yield {'data': response.css('h1::text').get()}
Enter fullscreen mode Exit fullscreen mode

When it fires: Right after spider initialization, before first request.

spider_closed (When Spider Finishes)

Fires AFTER the spider completes. This is the most used signal.

Use cases:

  • Close database connections
  • Send completion emails
  • Upload results to S3
  • Calculate total runtime
  • Send metrics to monitoring services
  • Clean up temporary files
def from_crawler(cls, crawler, *args, **kwargs):
    spider = super().from_crawler(crawler, *args, **kwargs)
    crawler.signals.connect(spider.spider_closed_handler, signal=signals.spider_closed)
    return spider

def spider_closed_handler(self, spider, reason):
    spider.logger.info(f'{spider.name} closed. Reason: {reason}')

    if hasattr(spider, 'start_time'):
        duration = time.time() - spider.start_time
        spider.logger.info(f'Ran for {duration:.2f} seconds')

    # Send completion email
    send_email(
        subject=f'Spider {spider.name} finished',
        body=f'Reason: {reason}'
    )
Enter fullscreen mode Exit fullscreen mode

When it fires: After spider finishes, before Scrapy shuts down.

The reason parameter:

  • 'finished' - Spider completed normally
  • 'cancelled' - Spider was cancelled (Ctrl+C)
  • 'shutdown' - Engine was shut down

item_scraped (When Item Passes Through Pipelines)

Fires AFTER an item successfully goes through ALL pipelines without being dropped.

Use cases:

  • Count scraped items
  • Update progress in real-time
  • Send alerts for specific items
  • Track scraping rate
def from_crawler(cls, crawler, *args, **kwargs):
    spider = super().from_crawler(crawler, *args, **kwargs)
    spider.items_scraped = 0
    crawler.signals.connect(spider.item_scraped_handler, signal=signals.item_scraped)
    return spider

def item_scraped_handler(self, item, response, spider):
    spider.items_scraped += 1

    # Log progress every 100 items
    if spider.items_scraped % 100 == 0:
        spider.logger.info(f'Scraped {spider.items_scraped} items so far')

    # Alert on expensive items
    if 'price' in item and float(item['price']) > 1000:
        send_alert(f'Expensive item found: {item["name"]} at ${item["price"]}')
Enter fullscreen mode Exit fullscreen mode

When it fires: After item successfully passes through all pipelines.

item_dropped (When Item Gets Rejected)

Fires when a pipeline drops an item (raises DropItem).

Use cases:

  • Count rejected items
  • Log why items were dropped
  • Alert when drop rate is too high
  • Debug pipeline issues
from scrapy import signals
from scrapy.exceptions import DropItem

def from_crawler(cls, crawler, *args, **kwargs):
    spider = super().from_crawler(crawler, *args, **kwargs)
    spider.items_dropped = 0
    crawler.signals.connect(spider.item_dropped_handler, signal=signals.item_dropped)
    return spider

def item_dropped_handler(self, item, response, exception, spider):
    spider.items_dropped += 1
    spider.logger.warning(f'Item dropped: {exception}')

    # Alert if drop rate is too high
    if hasattr(spider, 'items_scraped'):
        total = spider.items_scraped + spider.items_dropped
        drop_rate = spider.items_dropped / total if total > 0 else 0

        if drop_rate > 0.5:  # More than 50% dropped
            send_alert(f'High drop rate: {drop_rate:.1%}')
Enter fullscreen mode Exit fullscreen mode

When it fires: When a pipeline raises DropItem.

spider_error (When Something Breaks)

Fires when a spider callback raises an exception.

Use cases:

  • Track error rates
  • Alert on critical errors
  • Log detailed error info
  • Trigger retries or fallbacks
def from_crawler(cls, crawler, *args, **kwargs):
    spider = super().from_crawler(crawler, *args, **kwargs)
    spider.error_count = 0
    crawler.signals.connect(spider.spider_error_handler, signal=signals.spider_error)
    return spider

def spider_error_handler(self, failure, response, spider):
    spider.error_count += 1
    spider.logger.error(f'Error #{spider.error_count}: {failure.getErrorMessage()}')
    spider.logger.error(f'Failed URL: {response.url}')

    # Alert on repeated errors
    if spider.error_count > 10:
        send_alert(f'Spider {spider.name} has {spider.error_count} errors!')
Enter fullscreen mode Exit fullscreen mode

When it fires: When parse() or any callback raises an exception.


Real-World Use Case #1: Send Completion Email

import scrapy
from scrapy import signals
import smtplib
from email.mime.text import MIMEText
from datetime import datetime

class EmailSpider(scrapy.Spider):
    name = 'email_spider'
    start_urls = ['https://example.com/products']

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super().from_crawler(crawler, *args, **kwargs)

        # Connect signals
        crawler.signals.connect(spider.spider_opened_handler, signal=signals.spider_opened)
        crawler.signals.connect(spider.spider_closed_handler, signal=signals.spider_closed)
        crawler.signals.connect(spider.item_scraped_handler, signal=signals.item_scraped)

        # Initialize counters
        spider.items_count = 0
        spider.start_time = None

        return spider

    def spider_opened_handler(self, spider):
        spider.start_time = datetime.now()
        spider.logger.info(f'Spider started at {spider.start_time}')

    def item_scraped_handler(self, item, response, spider):
        spider.items_count += 1

    def spider_closed_handler(self, spider, reason):
        end_time = datetime.now()
        duration = end_time - spider.start_time

        # Prepare email
        subject = f'Spider {spider.name} finished'
        body = f'''
Spider: {spider.name}
Status: {reason}
Items scraped: {spider.items_count}
Started: {spider.start_time}
Finished: {end_time}
Duration: {duration}
        '''

        # Send email
        self.send_email(subject, body)
        spider.logger.info('Completion email sent!')

    def send_email(self, subject, body):
        msg = MIMEText(body)
        msg['Subject'] = subject
        msg['From'] = 'spider@example.com'
        msg['To'] = 'admin@example.com'

        with smtplib.SMTP('localhost') as server:
            server.send_message(msg)

    def parse(self, response):
        for product in response.css('.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get()
            }
Enter fullscreen mode Exit fullscreen mode

Real-World Use Case #2: Track Metrics and Performance

import scrapy
from scrapy import signals
import time

class MetricsSpider(scrapy.Spider):
    name = 'metrics'
    start_urls = ['https://example.com']

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super().from_crawler(crawler, *args, **kwargs)

        # Connect all relevant signals
        crawler.signals.connect(spider.on_spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(spider.on_spider_closed, signal=signals.spider_closed)
        crawler.signals.connect(spider.on_item_scraped, signal=signals.item_scraped)
        crawler.signals.connect(spider.on_item_dropped, signal=signals.item_dropped)
        crawler.signals.connect(spider.on_spider_error, signal=signals.spider_error)

        # Initialize metrics
        spider.metrics = {
            'items_scraped': 0,
            'items_dropped': 0,
            'errors': 0,
            'start_time': None,
            'end_time': None
        }

        return spider

    def on_spider_opened(self, spider):
        spider.metrics['start_time'] = time.time()

    def on_item_scraped(self, item, response, spider):
        spider.metrics['items_scraped'] += 1

    def on_item_dropped(self, item, response, exception, spider):
        spider.metrics['items_dropped'] += 1

    def on_spider_error(self, failure, response, spider):
        spider.metrics['errors'] += 1

    def on_spider_closed(self, spider, reason):
        spider.metrics['end_time'] = time.time()

        # Calculate metrics
        duration = spider.metrics['end_time'] - spider.metrics['start_time']
        total_items = spider.metrics['items_scraped'] + spider.metrics['items_dropped']
        success_rate = (spider.metrics['items_scraped'] / total_items * 100) if total_items > 0 else 0
        items_per_second = spider.metrics['items_scraped'] / duration if duration > 0 else 0

        # Log detailed metrics
        spider.logger.info('='*60)
        spider.logger.info('SPIDER METRICS')
        spider.logger.info(f'Duration: {duration:.2f} seconds')
        spider.logger.info(f'Items scraped: {spider.metrics["items_scraped"]}')
        spider.logger.info(f'Items dropped: {spider.metrics["items_dropped"]}')
        spider.logger.info(f'Errors: {spider.metrics["errors"]}')
        spider.logger.info(f'Success rate: {success_rate:.1f}%')
        spider.logger.info(f'Speed: {items_per_second:.2f} items/sec')
        spider.logger.info('='*60)

        # Send to monitoring service (e.g., Datadog, Prometheus)
        self.send_to_monitoring(spider.metrics)

    def send_to_monitoring(self, metrics):
        # Send metrics to your monitoring service
        pass

    def parse(self, response):
        for item in response.css('.item'):
            yield {'name': item.css('h2::text').get()}
Enter fullscreen mode Exit fullscreen mode

Real-World Use Case #3: Database Connection Management

import scrapy
from scrapy import signals
import psycopg2

class DatabaseSpider(scrapy.Spider):
    name = 'database'
    start_urls = ['https://example.com/products']

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super().from_crawler(crawler, *args, **kwargs)

        # Connect signals for database management
        crawler.signals.connect(spider.spider_opened_handler, signal=signals.spider_opened)
        crawler.signals.connect(spider.spider_closed_handler, signal=signals.spider_closed)

        return spider

    def spider_opened_handler(self, spider):
        # Open database connection when spider starts
        spider.logger.info('Opening database connection...')
        spider.db_conn = psycopg2.connect(
            host='localhost',
            database='scraping',
            user='user',
            password='password'
        )
        spider.db_cursor = spider.db_conn.cursor()
        spider.logger.info('Database connected!')

    def spider_closed_handler(self, spider, reason):
        # Close database connection when spider finishes
        spider.logger.info('Closing database connection...')

        if hasattr(spider, 'db_cursor'):
            spider.db_cursor.close()

        if hasattr(spider, 'db_conn'):
            spider.db_conn.commit()  # Commit any pending transactions
            spider.db_conn.close()

        spider.logger.info('Database disconnected!')

    def parse(self, response):
        for product in response.css('.product'):
            item = {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get()
            }

            # Save directly to database
            self.db_cursor.execute(
                'INSERT INTO products (name, price) VALUES (%s, %s)',
                (item['name'], item['price'])
            )

            yield item
Enter fullscreen mode Exit fullscreen mode

Real-World Use Case #4: Progress Updates to Slack

import scrapy
from scrapy import signals
import requests

class SlackSpider(scrapy.Spider):
    name = 'slack'
    start_urls = ['https://example.com']
    slack_webhook = 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super().from_crawler(crawler, *args, **kwargs)

        crawler.signals.connect(spider.on_spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(spider.on_spider_closed, signal=signals.spider_closed)
        crawler.signals.connect(spider.on_item_scraped, signal=signals.item_scraped)
        crawler.signals.connect(spider.on_spider_error, signal=signals.spider_error)

        spider.item_count = 0
        spider.error_count = 0

        return spider

    def on_spider_opened(self, spider):
        self.send_slack(f'🕷️ Spider `{spider.name}` started!')

    def on_item_scraped(self, item, response, spider):
        spider.item_count += 1

        # Update every 100 items
        if spider.item_count % 100 == 0:
            self.send_slack(f'📊 Progress: {spider.item_count} items scraped')

    def on_spider_error(self, failure, response, spider):
        spider.error_count += 1
        self.send_slack(f'⚠️ Error #{spider.error_count}: {failure.getErrorMessage()[:100]}')

    def on_spider_closed(self, spider, reason):
        message = f'''
✅ Spider `{spider.name}` finished!
Reason: {reason}
Items scraped: {spider.item_count}
Errors: {spider.error_count}
        '''
        self.send_slack(message)

    def send_slack(self, message):
        requests.post(
            self.slack_webhook,
            json={'text': message}
        )

    def parse(self, response):
        for item in response.css('.item'):
            yield {'name': item.css('h2::text').get()}
Enter fullscreen mode Exit fullscreen mode

The from_crawler Pattern (How to Actually Use Signals)

Every signal example uses from_crawler. Here's why and how it works:

@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
    # Step 1: Create spider instance normally
    spider = super().from_crawler(crawler, *args, **kwargs)

    # Step 2: Connect signal handlers
    crawler.signals.connect(spider.my_handler, signal=signals.spider_closed)

    # Step 3: Return spider
    return spider
Enter fullscreen mode Exit fullscreen mode

Why this pattern?

  • from_crawler is called when Scrapy initializes your spider
  • It gives you access to the crawler object
  • The crawler object has the signals manager
  • You connect your handlers through the crawler

Without from_crawler:
You can't access signals. Simple as that.


Signal Handler Function Signatures

Each signal passes different arguments to your handler. You don't need to accept all of them, but they're available:

spider_opened

def handler(self, spider):
    # spider: the spider instance
    pass
Enter fullscreen mode Exit fullscreen mode

spider_closed

def handler(self, spider, reason):
    # spider: the spider instance
    # reason: why spider closed ('finished', 'cancelled', 'shutdown')
    pass
Enter fullscreen mode Exit fullscreen mode

item_scraped

def handler(self, item, response, spider):
    # item: the scraped item
    # response: response that generated the item
    # spider: the spider instance
    pass
Enter fullscreen mode Exit fullscreen mode

item_dropped

def handler(self, item, response, exception, spider):
    # item: the dropped item
    # response: response that generated the item
    # exception: the DropItem exception
    # spider: the spider instance
    pass
Enter fullscreen mode Exit fullscreen mode

spider_error

def handler(self, failure, response, spider):
    # failure: Twisted Failure object with error details
    # response: response that caused the error
    # spider: the spider instance
    pass
Enter fullscreen mode Exit fullscreen mode

Less Common But Useful Signals

engine_started / engine_stopped

Fires when Scrapy engine starts/stops. Use for global setup/teardown.

crawler.signals.connect(spider.engine_started_handler, signal=signals.engine_started)
crawler.signals.connect(spider.engine_stopped_handler, signal=signals.engine_stopped)
Enter fullscreen mode Exit fullscreen mode

request_scheduled

Fires when a request is scheduled for download.

def request_scheduled_handler(self, request, spider):
    spider.logger.debug(f'Scheduled: {request.url}')
Enter fullscreen mode Exit fullscreen mode

response_received

Fires when a response is received from downloader.

def response_received_handler(self, response, request, spider):
    spider.logger.debug(f'Received {len(response.body)} bytes from {response.url}')
Enter fullscreen mode Exit fullscreen mode

Common Mistakes

Mistake #1: Not Using from_crawler

# WRONG (can't connect signals)
class MySpider(scrapy.Spider):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # No access to crawler here!

# RIGHT
class MySpider(scrapy.Spider):
    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super().from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.handler, signal=signals.spider_closed)
        return spider
Enter fullscreen mode Exit fullscreen mode

Mistake #2: Wrong Function Signature

# WRONG (spider_closed needs 'reason' parameter)
def spider_closed_handler(self, spider):
    pass

# RIGHT
def spider_closed_handler(self, spider, reason):
    pass
Enter fullscreen mode Exit fullscreen mode

You'll get errors if parameter names don't match what the signal sends.

Mistake #3: Not Returning Spider from from_crawler

# WRONG (forgot to return spider)
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
    spider = super().from_crawler(crawler, *args, **kwargs)
    crawler.signals.connect(spider.handler, signal=signals.spider_closed)
    # Missing return!

# RIGHT
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
    spider = super().from_crawler(crawler, *args, **kwargs)
    crawler.signals.connect(spider.handler, signal=signals.spider_closed)
    return spider  # Don't forget this!
Enter fullscreen mode Exit fullscreen mode

Mistake #4: Trying to Use Signals in Pipelines

# WRONG (pipelines don't have from_crawler by default)
class MyPipeline:
    def process_item(self, item, spider):
        # Can't easily connect signals here
        pass

# RIGHT (use spider's signals if needed, or create custom extension)
Enter fullscreen mode Exit fullscreen mode

For pipeline-level signal handling, create an extension instead.


Quick Reference

Most Used Signals

@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
    spider = super().from_crawler(crawler, *args, **kwargs)

    # Spider lifecycle
    crawler.signals.connect(spider.on_opened, signal=signals.spider_opened)
    crawler.signals.connect(spider.on_closed, signal=signals.spider_closed)

    # Items
    crawler.signals.connect(spider.on_item_scraped, signal=signals.item_scraped)
    crawler.signals.connect(spider.on_item_dropped, signal=signals.item_dropped)

    # Errors
    crawler.signals.connect(spider.on_error, signal=signals.spider_error)

    return spider
Enter fullscreen mode Exit fullscreen mode

Signal Handler Templates

def on_opened(self, spider):
    spider.logger.info('Spider opened!')

def on_closed(self, spider, reason):
    spider.logger.info(f'Spider closed: {reason}')

def on_item_scraped(self, item, response, spider):
    spider.logger.info(f'Item scraped: {item}')

def on_item_dropped(self, item, response, exception, spider):
    spider.logger.warning(f'Item dropped: {exception}')

def on_error(self, failure, response, spider):
    spider.logger.error(f'Error: {failure.getErrorMessage()}')
Enter fullscreen mode Exit fullscreen mode

Summary

Signals let you run code when specific events happen:

  • spider_opened - Setup (open connections, start timers)
  • spider_closed - Cleanup (close connections, send reports)
  • item_scraped - Track progress, send alerts
  • item_dropped - Monitor quality, debug issues
  • spider_error - Track errors, send alerts

Always use from_crawler pattern:

@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
    spider = super().from_crawler(crawler, *args, **kwargs)
    crawler.signals.connect(spider.handler, signal=signals.spider_closed)
    return spider
Enter fullscreen mode Exit fullscreen mode

Real use cases:

  • Send completion emails
  • Track metrics and performance
  • Manage database connections
  • Send progress to Slack/Discord
  • Alert on errors or anomalies
  • Upload results to cloud storage

Start with spider_closed for completion notifications. Add others as you need them.

Happy scraping! 🕷️

Top comments (0)