When I first heard about Scrapy signals, I thought "why would I need this?" I was scraping just fine without them.
Then I needed to send an email when my spider finished. I tried putting it at the end of parse(). Didn't work. The spider kept running after parse() finished.
I tried putting it in a pipeline. Didn't work either. The pipeline processes items, not spider completion.
Finally, I learned about signals. Specifically, spider_closed. Problem solved in five minutes.
Signals are Scrapy's way of saying "hey, something just happened!" You can listen for these events and run custom code. Let me show you how.
What Are Signals, Really?
Think of signals like notifications on your phone. Something happens (new message, battery low, screenshot taken), and your phone tells you.
In Scrapy:
- Spider opens →
spider_openedsignal fires - Item gets scraped →
item_scrapedsignal fires - Spider closes →
spider_closedsignal fires - Request fails →
spider_errorsignal fires
You can "subscribe" to these notifications and run code when they happen.
Without signals:
You have no way to know when these events occur. Your spider runs, stuff happens, and you're blind to it.
With signals:
Your spider tells you exactly what's happening. You can react to events in real time.
The Most Important Signals (Start Here)
Let's learn the signals you'll actually use, starting with the most common.
spider_opened (When Spider Starts)
Fires AFTER the spider starts running, BEFORE any requests are made.
Use cases:
- Open database connections
- Initialize counters
- Start timers
- Send "spider started" notifications
- Set up resources
from scrapy import signals
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://example.com']
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super().from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_opened_handler, signal=signals.spider_opened)
return spider
def spider_opened_handler(self, spider):
spider.logger.info(f'{spider.name} has started!')
spider.start_time = time.time()
def parse(self, response):
yield {'data': response.css('h1::text').get()}
When it fires: Right after spider initialization, before first request.
spider_closed (When Spider Finishes)
Fires AFTER the spider completes. This is the most used signal.
Use cases:
- Close database connections
- Send completion emails
- Upload results to S3
- Calculate total runtime
- Send metrics to monitoring services
- Clean up temporary files
def from_crawler(cls, crawler, *args, **kwargs):
spider = super().from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_closed_handler, signal=signals.spider_closed)
return spider
def spider_closed_handler(self, spider, reason):
spider.logger.info(f'{spider.name} closed. Reason: {reason}')
if hasattr(spider, 'start_time'):
duration = time.time() - spider.start_time
spider.logger.info(f'Ran for {duration:.2f} seconds')
# Send completion email
send_email(
subject=f'Spider {spider.name} finished',
body=f'Reason: {reason}'
)
When it fires: After spider finishes, before Scrapy shuts down.
The reason parameter:
-
'finished'- Spider completed normally -
'cancelled'- Spider was cancelled (Ctrl+C) -
'shutdown'- Engine was shut down
item_scraped (When Item Passes Through Pipelines)
Fires AFTER an item successfully goes through ALL pipelines without being dropped.
Use cases:
- Count scraped items
- Update progress in real-time
- Send alerts for specific items
- Track scraping rate
def from_crawler(cls, crawler, *args, **kwargs):
spider = super().from_crawler(crawler, *args, **kwargs)
spider.items_scraped = 0
crawler.signals.connect(spider.item_scraped_handler, signal=signals.item_scraped)
return spider
def item_scraped_handler(self, item, response, spider):
spider.items_scraped += 1
# Log progress every 100 items
if spider.items_scraped % 100 == 0:
spider.logger.info(f'Scraped {spider.items_scraped} items so far')
# Alert on expensive items
if 'price' in item and float(item['price']) > 1000:
send_alert(f'Expensive item found: {item["name"]} at ${item["price"]}')
When it fires: After item successfully passes through all pipelines.
item_dropped (When Item Gets Rejected)
Fires when a pipeline drops an item (raises DropItem).
Use cases:
- Count rejected items
- Log why items were dropped
- Alert when drop rate is too high
- Debug pipeline issues
from scrapy import signals
from scrapy.exceptions import DropItem
def from_crawler(cls, crawler, *args, **kwargs):
spider = super().from_crawler(crawler, *args, **kwargs)
spider.items_dropped = 0
crawler.signals.connect(spider.item_dropped_handler, signal=signals.item_dropped)
return spider
def item_dropped_handler(self, item, response, exception, spider):
spider.items_dropped += 1
spider.logger.warning(f'Item dropped: {exception}')
# Alert if drop rate is too high
if hasattr(spider, 'items_scraped'):
total = spider.items_scraped + spider.items_dropped
drop_rate = spider.items_dropped / total if total > 0 else 0
if drop_rate > 0.5: # More than 50% dropped
send_alert(f'High drop rate: {drop_rate:.1%}')
When it fires: When a pipeline raises DropItem.
spider_error (When Something Breaks)
Fires when a spider callback raises an exception.
Use cases:
- Track error rates
- Alert on critical errors
- Log detailed error info
- Trigger retries or fallbacks
def from_crawler(cls, crawler, *args, **kwargs):
spider = super().from_crawler(crawler, *args, **kwargs)
spider.error_count = 0
crawler.signals.connect(spider.spider_error_handler, signal=signals.spider_error)
return spider
def spider_error_handler(self, failure, response, spider):
spider.error_count += 1
spider.logger.error(f'Error #{spider.error_count}: {failure.getErrorMessage()}')
spider.logger.error(f'Failed URL: {response.url}')
# Alert on repeated errors
if spider.error_count > 10:
send_alert(f'Spider {spider.name} has {spider.error_count} errors!')
When it fires: When parse() or any callback raises an exception.
Real-World Use Case #1: Send Completion Email
import scrapy
from scrapy import signals
import smtplib
from email.mime.text import MIMEText
from datetime import datetime
class EmailSpider(scrapy.Spider):
name = 'email_spider'
start_urls = ['https://example.com/products']
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super().from_crawler(crawler, *args, **kwargs)
# Connect signals
crawler.signals.connect(spider.spider_opened_handler, signal=signals.spider_opened)
crawler.signals.connect(spider.spider_closed_handler, signal=signals.spider_closed)
crawler.signals.connect(spider.item_scraped_handler, signal=signals.item_scraped)
# Initialize counters
spider.items_count = 0
spider.start_time = None
return spider
def spider_opened_handler(self, spider):
spider.start_time = datetime.now()
spider.logger.info(f'Spider started at {spider.start_time}')
def item_scraped_handler(self, item, response, spider):
spider.items_count += 1
def spider_closed_handler(self, spider, reason):
end_time = datetime.now()
duration = end_time - spider.start_time
# Prepare email
subject = f'Spider {spider.name} finished'
body = f'''
Spider: {spider.name}
Status: {reason}
Items scraped: {spider.items_count}
Started: {spider.start_time}
Finished: {end_time}
Duration: {duration}
'''
# Send email
self.send_email(subject, body)
spider.logger.info('Completion email sent!')
def send_email(self, subject, body):
msg = MIMEText(body)
msg['Subject'] = subject
msg['From'] = 'spider@example.com'
msg['To'] = 'admin@example.com'
with smtplib.SMTP('localhost') as server:
server.send_message(msg)
def parse(self, response):
for product in response.css('.product'):
yield {
'name': product.css('h2::text').get(),
'price': product.css('.price::text').get()
}
Real-World Use Case #2: Track Metrics and Performance
import scrapy
from scrapy import signals
import time
class MetricsSpider(scrapy.Spider):
name = 'metrics'
start_urls = ['https://example.com']
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super().from_crawler(crawler, *args, **kwargs)
# Connect all relevant signals
crawler.signals.connect(spider.on_spider_opened, signal=signals.spider_opened)
crawler.signals.connect(spider.on_spider_closed, signal=signals.spider_closed)
crawler.signals.connect(spider.on_item_scraped, signal=signals.item_scraped)
crawler.signals.connect(spider.on_item_dropped, signal=signals.item_dropped)
crawler.signals.connect(spider.on_spider_error, signal=signals.spider_error)
# Initialize metrics
spider.metrics = {
'items_scraped': 0,
'items_dropped': 0,
'errors': 0,
'start_time': None,
'end_time': None
}
return spider
def on_spider_opened(self, spider):
spider.metrics['start_time'] = time.time()
def on_item_scraped(self, item, response, spider):
spider.metrics['items_scraped'] += 1
def on_item_dropped(self, item, response, exception, spider):
spider.metrics['items_dropped'] += 1
def on_spider_error(self, failure, response, spider):
spider.metrics['errors'] += 1
def on_spider_closed(self, spider, reason):
spider.metrics['end_time'] = time.time()
# Calculate metrics
duration = spider.metrics['end_time'] - spider.metrics['start_time']
total_items = spider.metrics['items_scraped'] + spider.metrics['items_dropped']
success_rate = (spider.metrics['items_scraped'] / total_items * 100) if total_items > 0 else 0
items_per_second = spider.metrics['items_scraped'] / duration if duration > 0 else 0
# Log detailed metrics
spider.logger.info('='*60)
spider.logger.info('SPIDER METRICS')
spider.logger.info(f'Duration: {duration:.2f} seconds')
spider.logger.info(f'Items scraped: {spider.metrics["items_scraped"]}')
spider.logger.info(f'Items dropped: {spider.metrics["items_dropped"]}')
spider.logger.info(f'Errors: {spider.metrics["errors"]}')
spider.logger.info(f'Success rate: {success_rate:.1f}%')
spider.logger.info(f'Speed: {items_per_second:.2f} items/sec')
spider.logger.info('='*60)
# Send to monitoring service (e.g., Datadog, Prometheus)
self.send_to_monitoring(spider.metrics)
def send_to_monitoring(self, metrics):
# Send metrics to your monitoring service
pass
def parse(self, response):
for item in response.css('.item'):
yield {'name': item.css('h2::text').get()}
Real-World Use Case #3: Database Connection Management
import scrapy
from scrapy import signals
import psycopg2
class DatabaseSpider(scrapy.Spider):
name = 'database'
start_urls = ['https://example.com/products']
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super().from_crawler(crawler, *args, **kwargs)
# Connect signals for database management
crawler.signals.connect(spider.spider_opened_handler, signal=signals.spider_opened)
crawler.signals.connect(spider.spider_closed_handler, signal=signals.spider_closed)
return spider
def spider_opened_handler(self, spider):
# Open database connection when spider starts
spider.logger.info('Opening database connection...')
spider.db_conn = psycopg2.connect(
host='localhost',
database='scraping',
user='user',
password='password'
)
spider.db_cursor = spider.db_conn.cursor()
spider.logger.info('Database connected!')
def spider_closed_handler(self, spider, reason):
# Close database connection when spider finishes
spider.logger.info('Closing database connection...')
if hasattr(spider, 'db_cursor'):
spider.db_cursor.close()
if hasattr(spider, 'db_conn'):
spider.db_conn.commit() # Commit any pending transactions
spider.db_conn.close()
spider.logger.info('Database disconnected!')
def parse(self, response):
for product in response.css('.product'):
item = {
'name': product.css('h2::text').get(),
'price': product.css('.price::text').get()
}
# Save directly to database
self.db_cursor.execute(
'INSERT INTO products (name, price) VALUES (%s, %s)',
(item['name'], item['price'])
)
yield item
Real-World Use Case #4: Progress Updates to Slack
import scrapy
from scrapy import signals
import requests
class SlackSpider(scrapy.Spider):
name = 'slack'
start_urls = ['https://example.com']
slack_webhook = 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super().from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.on_spider_opened, signal=signals.spider_opened)
crawler.signals.connect(spider.on_spider_closed, signal=signals.spider_closed)
crawler.signals.connect(spider.on_item_scraped, signal=signals.item_scraped)
crawler.signals.connect(spider.on_spider_error, signal=signals.spider_error)
spider.item_count = 0
spider.error_count = 0
return spider
def on_spider_opened(self, spider):
self.send_slack(f'🕷️ Spider `{spider.name}` started!')
def on_item_scraped(self, item, response, spider):
spider.item_count += 1
# Update every 100 items
if spider.item_count % 100 == 0:
self.send_slack(f'📊 Progress: {spider.item_count} items scraped')
def on_spider_error(self, failure, response, spider):
spider.error_count += 1
self.send_slack(f'⚠️ Error #{spider.error_count}: {failure.getErrorMessage()[:100]}')
def on_spider_closed(self, spider, reason):
message = f'''
✅ Spider `{spider.name}` finished!
Reason: {reason}
Items scraped: {spider.item_count}
Errors: {spider.error_count}
'''
self.send_slack(message)
def send_slack(self, message):
requests.post(
self.slack_webhook,
json={'text': message}
)
def parse(self, response):
for item in response.css('.item'):
yield {'name': item.css('h2::text').get()}
The from_crawler Pattern (How to Actually Use Signals)
Every signal example uses from_crawler. Here's why and how it works:
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
# Step 1: Create spider instance normally
spider = super().from_crawler(crawler, *args, **kwargs)
# Step 2: Connect signal handlers
crawler.signals.connect(spider.my_handler, signal=signals.spider_closed)
# Step 3: Return spider
return spider
Why this pattern?
-
from_crawleris called when Scrapy initializes your spider - It gives you access to the
crawlerobject - The
crawlerobject has thesignalsmanager - You connect your handlers through the crawler
Without from_crawler:
You can't access signals. Simple as that.
Signal Handler Function Signatures
Each signal passes different arguments to your handler. You don't need to accept all of them, but they're available:
spider_opened
def handler(self, spider):
# spider: the spider instance
pass
spider_closed
def handler(self, spider, reason):
# spider: the spider instance
# reason: why spider closed ('finished', 'cancelled', 'shutdown')
pass
item_scraped
def handler(self, item, response, spider):
# item: the scraped item
# response: response that generated the item
# spider: the spider instance
pass
item_dropped
def handler(self, item, response, exception, spider):
# item: the dropped item
# response: response that generated the item
# exception: the DropItem exception
# spider: the spider instance
pass
spider_error
def handler(self, failure, response, spider):
# failure: Twisted Failure object with error details
# response: response that caused the error
# spider: the spider instance
pass
Less Common But Useful Signals
engine_started / engine_stopped
Fires when Scrapy engine starts/stops. Use for global setup/teardown.
crawler.signals.connect(spider.engine_started_handler, signal=signals.engine_started)
crawler.signals.connect(spider.engine_stopped_handler, signal=signals.engine_stopped)
request_scheduled
Fires when a request is scheduled for download.
def request_scheduled_handler(self, request, spider):
spider.logger.debug(f'Scheduled: {request.url}')
response_received
Fires when a response is received from downloader.
def response_received_handler(self, response, request, spider):
spider.logger.debug(f'Received {len(response.body)} bytes from {response.url}')
Common Mistakes
Mistake #1: Not Using from_crawler
# WRONG (can't connect signals)
class MySpider(scrapy.Spider):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# No access to crawler here!
# RIGHT
class MySpider(scrapy.Spider):
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super().from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.handler, signal=signals.spider_closed)
return spider
Mistake #2: Wrong Function Signature
# WRONG (spider_closed needs 'reason' parameter)
def spider_closed_handler(self, spider):
pass
# RIGHT
def spider_closed_handler(self, spider, reason):
pass
You'll get errors if parameter names don't match what the signal sends.
Mistake #3: Not Returning Spider from from_crawler
# WRONG (forgot to return spider)
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super().from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.handler, signal=signals.spider_closed)
# Missing return!
# RIGHT
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super().from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.handler, signal=signals.spider_closed)
return spider # Don't forget this!
Mistake #4: Trying to Use Signals in Pipelines
# WRONG (pipelines don't have from_crawler by default)
class MyPipeline:
def process_item(self, item, spider):
# Can't easily connect signals here
pass
# RIGHT (use spider's signals if needed, or create custom extension)
For pipeline-level signal handling, create an extension instead.
Quick Reference
Most Used Signals
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super().from_crawler(crawler, *args, **kwargs)
# Spider lifecycle
crawler.signals.connect(spider.on_opened, signal=signals.spider_opened)
crawler.signals.connect(spider.on_closed, signal=signals.spider_closed)
# Items
crawler.signals.connect(spider.on_item_scraped, signal=signals.item_scraped)
crawler.signals.connect(spider.on_item_dropped, signal=signals.item_dropped)
# Errors
crawler.signals.connect(spider.on_error, signal=signals.spider_error)
return spider
Signal Handler Templates
def on_opened(self, spider):
spider.logger.info('Spider opened!')
def on_closed(self, spider, reason):
spider.logger.info(f'Spider closed: {reason}')
def on_item_scraped(self, item, response, spider):
spider.logger.info(f'Item scraped: {item}')
def on_item_dropped(self, item, response, exception, spider):
spider.logger.warning(f'Item dropped: {exception}')
def on_error(self, failure, response, spider):
spider.logger.error(f'Error: {failure.getErrorMessage()}')
Summary
Signals let you run code when specific events happen:
-
spider_opened- Setup (open connections, start timers) -
spider_closed- Cleanup (close connections, send reports) -
item_scraped- Track progress, send alerts -
item_dropped- Monitor quality, debug issues -
spider_error- Track errors, send alerts
Always use from_crawler pattern:
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super().from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.handler, signal=signals.spider_closed)
return spider
Real use cases:
- Send completion emails
- Track metrics and performance
- Manage database connections
- Send progress to Slack/Discord
- Alert on errors or anomalies
- Upload results to cloud storage
Start with spider_closed for completion notifications. Add others as you need them.
Happy scraping! 🕷️
Top comments (0)