DEV Community: Pradip Panjiyar

Handling 100+ Website Scrapers with Python's asyncio

Pradip Panjiyar — Sat, 11 Oct 2025 22:55:14 +0000

A Quick Note on Timeline

Although this is Part 2 of the CollegeBuzz series, it was actually the first major component I built. The MongoDB archiving system came later because I needed to solve data consistency issues after scraping was already running in production.

Lesson learned: Start with archiving from Day 1. But hindsight is 20/20, right? 😅

The Problem: Scraping 100+ Colleges Without Losing My Mind

When I started building CollegeBuzz — an AICTE academic news aggregator — I needed to scrape notices, events, and announcements from 100+ Indian engineering colleges daily.

My first naive attempt:

import requests
from bs4 import BeautifulSoup

def scrape_college(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    # Extract data...
    return data

# The slow way
college_urls = ["https://iitb.ac.in", "https://nitt.edu", ...]
all_data = []

for url in college_urls:
    data = scrape_college(url)  
    all_data.append(data)

# 🐌 Total time: 4+ hours

Why so slow?

Each request waits for the previous one to complete. If one site takes 8 seconds to respond, my scraper just sits there... waiting. Multiply that across 100+ sites and you get an eternity.

I needed something better.

Discovering Crawl4AI: The Game Changer

After wrestling with BeautifulSoup and Selenium, I found Crawl4AI by @unclecode.

Why Crawl4AI for async scraping?

⚡ Built for asyncio from the ground up — native async/await support
🎯 CSS-based extraction strategies — no more manual BeautifulSoup parsing
📦 Works out of the box — handles browser automation, retries, error handling
🚀 Battle-tested — 50k+ GitHub stars

Resources:

📺 YouTube Channel — Excellent tutorials by the creator
🐙 GitHub Repository
📚 Official Documentation

My Async Scraping Architecture

Instead of trying to scrape everything at once, I built a controlled async pipeline:

async def extract_notices_and_events():
    """Main async scraping orchestrator"""

    # Initialize MongoDB handler
    mongo_handler = MongoDBHandler(uri=os.environ.get("MONGO_URI"))

    # Single crawler instance handles all sites
    async with AsyncWebCrawler(verbose=True) as crawler:
        for site in urls:  # Sequential at site level
            # Configure extraction strategy
            extraction_strategy = JsonCssExtractionStrategy(
                site["schema"], 
                verbose=True
            )

            config = CrawlerRunConfig(
                cache_mode=CacheMode.BYPASS,  # Always fresh data
                extraction_strategy=extraction_strategy
            )

            # Async scrape (non-blocking!)
            result = await crawler.arun(url=site["url"], config=config)

            if result.success:
                data = json.loads(result.extracted_content)
                # Process and store data
                process_and_store(data, mongo_handler)

Key Design Decision: Sequential Sites, Async Pages

I intentionally scrape sites one-by-one but use async for the actual HTTP requests. Why?

Avoid IP bans — 100 concurrent requests to different domains = red flags
Resource management — One browser at a time keeps memory under control
Error isolation — If one site fails, others continue

The Magic: CSS-Based Extraction

Instead of writing BeautifulSoup parsing for each site, I use declarative schemas:

# In crawler_config.py
urls = [
    {
        "url": "https://www.iitb.ac.in/",
        "schema": {
            "name": "IIT Bombay Notices",
            "baseSelector": ".notice-item",
            "fields": [
                {
                    "name": "title",
                    "selector": "h3.title",
                    "type": "text"
                },
                {
                    "name": "notice_url",
                    "selector": "a",
                    "type": "attribute",
                    "attribute": "href"
                }
            ]
        }
    },
    # ... 100+ more colleges
]

Then in my scraper:

for site in urls:
    extraction_strategy = JsonCssExtractionStrategy(
        site["schema"], 
        verbose=True
    )

    result = await crawler.arun(
        url=site["url"], 
        config=CrawlerRunConfig(extraction_strategy=extraction_strategy)
    )

    # Get clean JSON data, no BeautifulSoup needed!
    data = json.loads(result.extracted_content)

Why this is powerful for async scraping:

✅ No manual parsing — Crawl4AI handles HTML extraction
✅ Maintainable — Update schemas without touching scraper logic
✅ Scalable — Add new colleges by adding new schema objects

Real-World Async Patterns I Used

Pattern 1: Context Manager for Resource Cleanup

async with AsyncWebCrawler(verbose=True) as crawler:
    # Use crawler
    result = await crawler.arun(url=url, config=config)
# Automatically closes browser and releases resources

Even if errors occur, the browser gets cleaned up. Critical for long-running scrapers.

Pattern 2: Handling Failures Gracefully

result = await crawler.arun(url=site["url"], config=config)

if not result.success:
    print(f"❌ Crawl failed for {site['url']}: {result.error_message}")
    continue  # Skip this site, move to next

# Process successful result
data = json.loads(result.extracted_content)

One failed site doesn't crash the entire pipeline.

Pattern 3: Async Configuration

config = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS,  # Fresh data every time
    extraction_strategy=extraction_strategy
)

result = await crawler.arun(url=site["url"], config=config)

Crawl4AI's CrawlerRunConfig lets you customize behavior per site without creating new crawler instances.

Handling Real-World Edge Cases

Edge Case 1: Data Volume Control

Some colleges list 1000+ notices on their homepage. I don't need all of them:

# Site-specific limits
if site["url"] in ["https://www.nitt.edu/", "https://www.iitkgp.ac.in/"]:
    data = data[:10]  # Only recent 10
elif site["url"] in ["https://www.iitk.ac.in/", "https://www.iiti.ac.in/"]:
    data = data[:4]   # These update slowly

Edge Case 2: URL Normalization

College websites have inconsistent URL formats:

def process_url(base_url, extracted_url):
    """Convert relative URLs to absolute"""
    if not extracted_url:
        return extracted_url

    extracted_url = extracted_url.strip()

    # Handle relative URLs
    if not extracted_url.startswith(("http://", "https://")):
        return urljoin(base_url, extracted_url)

    return extracted_url

Edge Case 3: JavaScript URL Madness (IIT Roorkee)

IIT Roorkee embeds JavaScript in href attributes:

<a href="window.open('/events/workshop.pdf')">View Event</a>

Solution:

if site["url"] == "https://www.iitr.ac.in/":
    for entry in data:
        if "upcoming_Event_url" in entry and "window.open(" in entry["upcoming_Event_url"]:
            # Extract actual URL from JavaScript
            match = re.search(r"window\.open\('([^']+)'\)", entry["upcoming_Event_url"])
            if match:
                entry["upcoming_Event_url"] = match.group(1)

Performance: The Numbers

Before (Sequential with Requests + BeautifulSoup):

⏱️ Time: 4 hours, 23 minutes
🐌 Average: ~150 seconds per site
💾 Memory: ~200MB stable

After (Async with Crawl4AI):

⏱️ Time: 12 minutes, 30 seconds
⚡ Average: ~7.5 seconds per site
💾 Memory: ~600MB peak (browser overhead)

20x faster with better reliability!

Why Not Fully Concurrent?

You might ask: "Why not scrape all 100+ sites simultaneously with asyncio.gather()?"

# Why I DON'T do this:
tasks = [scrape_college(crawler, site) for site in urls]
results = await asyncio.gather(*tasks)

I tried this. Results:

❌ IP bans from 12 colleges
❌ Memory explosion (100 browsers = 8GB+ RAM)
❌ Browser crashes

Lesson: For daily scraping of 100+ different domains, sequential with async is the sweet spot:

Fast enough (12 minutes vs 4 hours)
Respectful to websites (no hammering)
Stable and maintainable

Integration with the Full Pipeline

Here's how async scraping fits into CollegeBuzz:

# In aictcscraper.py
async def extract_notices_and_events():
    mongo_handler = MongoDBHandler(uri=os.environ.get("MONGO_URI"))

    try:
        async with AsyncWebCrawler(verbose=True) as crawler:
            for site in urls:
                # Async scraping
                result = await crawler.arun(url=site["url"], config=config)
                data = json.loads(result.extracted_content)

                # Insert into MongoDB (with deduplication from Part 1)
                mongo_handler.insert_data(collection_name, records)

    except Exception as e:
        print(f"Error: {e}")
        raise
    finally:
        mongo_handler.close_connection()

The scraper feeds data into the MongoDB handler, which automatically:

Deduplicates records (from Part 1)
Updates timestamps
Archives old data

Running the Scraper

Manual Trigger

python aictcscraper.py

Via Flask API

# In app.py
@app.route('/api/scrape', methods=['POST'])
def run_scraper():
    try:
        result = asyncio.run(extract_notices_and_events())
        return jsonify({"status": "success"}), 200
    except Exception as e:
        return jsonify({"status": "error", "message": str(e)}), 500

Trigger via HTTP:

curl -X POST http://localhost:8080/api/scrape

Scheduled with Cron

# crontab -e
0 2 * * * cd /path/to/collegebuzz && python aictcscraper.py >> scraper.log 2>&1

Key Takeaways for Async Scraping

1. Context Managers Are Essential

async with AsyncWebCrawler() as crawler:
    # Always use context managers
    result = await crawler.arun(url)
# Automatic cleanup, even on errors

2. Don't Over-Optimize

Sequential scraping of 100+ sites in 12 minutes is good enough for daily jobs. Don't chase 100% concurrency at the cost of stability.

3. Schema-Based Extraction > Manual Parsing

Declarative CSS schemas are:

Easier to maintain
Easier to debug
Easier to scale

4. Handle Failures Gracefully

if not result.success:
    print(f"Failed: {result.error_message}")
    continue  # Don't crash entire pipeline

Resources & Credits

Huge thanks to @unclecode for creating Crawl4AI! This library made async scraping approachable.

Learn More:

CollegeBuzz Series:

📝 Part 1: MongoDB Archiving System

Closing Thoughts

Async scraping with Crawl4AI transformed CollegeBuzz from a 4-hour batch job to a 12-minute operation. But the real win was simplicity.

No threading nightmares. No multiprocessing complexity. Just clean async/await code with schema-based extraction.

If you're scraping multiple websites in 2025, start with Crawl4AI and asyncio. Your future self will thank you.

Found this helpful? Hit that ❤️ and follow for Part 3!

Questions? Drop a comment or reach out @pradippanjiyar

This is Part 2 of the CollegeBuzz engineering series. All code examples are from my production system.

How I Built a MongoDB Archiving System for Crawled Data

Pradip Panjiyar — Fri, 03 Oct 2025 11:10:57 +0000

How I Built a MongoDB Archiving System for Crawled Data

The Problem: Data Chaos at Scale

Imagine scraping 100+ college websites daily. Notices get updated. Events disappear. New announcements pop up every hour. Your database becomes a graveyard of duplicates, outdated entries, and lost history.

That was my reality building CollegeBuzz — an AICTE academic news aggregator.

The challenge wasn't just collecting data. It was managing its lifecycle:

❌ Overwrite everything? You lose historical context
❌ Blindly insert? Hello, duplicate hell
❌ Manual cleanup? Not scalable at 10,000+ records/day

I needed something smarter. Here's how I built an automated archiving system that keeps data fresh, preserves history, and stays performant at scale.

The Architecture: Active + Archive Pattern

Core Concept

Instead of one bloated collection, I split data into two purpose-built collections:

Why this works:

Users query only fresh data (faster searches)
Archive grows unbounded without impacting performance
You can always restore or analyze historical trends

The Implementation: Smart Deduplication

Problem 1: How do you detect duplicates?

A notice titled "Admissions Open 2025" might appear on multiple pages, or get re-crawled daily. I needed a deterministic unique identifier.

Solution: Composite Unique Keys

Instead of MongoDB's _id, I use business logic to create unique identifiers:

def _create_unique_identifier(self, record):
    """
    Creates a composite key from title + URL
    Falls back to full record comparison if needed
    """
    unique_id = {}

    if 'title' in record and record['title']:
        unique_id['title'] = record['title']

    if 'url' in record and record['url']:
        unique_id['url'] = record['url']

    # If no title/URL, compare all fields except timestamps
    if not unique_id:
        unique_id = {
            k: v for k, v in record.items() 
            if k not in ['crawled_at', 'last_updated_at', '_id']
        }

    return unique_id

Why not content hashing?
While SHA-256 hashing is elegant, I found composite keys more debuggable. If there's a collision, I can instantly see which title/URL caused it. Hashes hide this context.

The Magic: Timestamp Tracking

Every record carries two timestamps:

Field	Purpose	Example
`crawled_at`	First time we ever saw this record	`2025-01-08 10:00:00`
`last_updated_at`	Most recent time we re-crawled it	`2025-10-03 14:30:00`

The Critical Rule:

crawled_at NEVER changes. last_updated_at ALWAYS updates.

This creates a paper trail:

If crawled_at is 60 days old but last_updated_at is yesterday → the content is still live, just stable
If both are 60 days old → probably abandoned

# Preserve original crawl date, update last seen
if existing_record:
    original_crawled_at = existing_record['crawled_at']
else:
    original_crawled_at = datetime.now()

record['crawled_at'] = original_crawled_at  # Frozen in time
record['last_updated_at'] = datetime.now()  # Always fresh

The Archiving Logic: Time-Based + Event-Based

Records get archived under two conditions:

1. Age-Based Archiving (30-day threshold)

def should_archive_by_age(record, threshold_days=30):
    """
    Archive if the record was FIRST CRAWLED more than 30 days ago
    """
    archive_threshold = datetime.now() - timedelta(days=threshold_days)

    return record['crawled_at'] < archive_threshold

2. Event-Based Archiving

def should_archive_by_event_date(record):
    """
    Archive if the event date has passed
    Example: "Admission closes: 2025-09-15" → Auto-archive on Sept 16
    """
    event_date = parse_date(record.get('event_date'))

    if event_date and event_date < datetime.now().date():
        return True

    return False

Combined logic:

should_archive = (
    should_archive_by_age(record, 30) or 
    should_archive_by_event_date(record)
)

Real-World Example: Archiving Old News

Let's say you manually changed a record's crawled_at to January 8, 2025 (more than 30 days ago). Here's what happens:

from mongodb_handler import MongoDBHandler
from datetime import datetime, timedelta

db = MongoDBHandler()

# Current record in 'news' collection:
# {
#   "title": "New Campus Opened",
#   "url": "https://college.edu/news/123",
#   "crawled_at": "2025-01-08 10:00:00",  # 9+ months old!
#   "last_updated_at": "2025-10-01 08:00:00"
# }

# Run archiving
db.archive_old_records()

# Output:
# [news] Found 1 records to archive.
# Archiving: {'title': 'New Campus Opened', 'url': 'https://college.edu/news/123'}
#   - crawled_at: 2025-01-08 10:00:00
#   - last_updated_at: 2025-10-01 08:00:00
# ✅ Archived 1 record(s) from news

What happened under the hood:

MongoDB query finds records where crawled_at < (today - 30 days)
Record is copied to news_archived collection
Record is deleted from news collection
All timestamps are preserved (this is crucial for analytics!)

Performance Optimization: Indexing Strategy

With 100K+ records, queries need to be fast. Here's my indexing setup:

# Create indexes on both collections
active_collection.create_index("crawled_at")
active_collection.create_index("last_updated_at")
active_collection.create_index([("title", 1), ("url", 1)], unique=True)

archived_collection.create_index("crawled_at")
archived_collection.create_index("last_updated_at")

Why these indexes?

crawled_at → Fast archiving queries (WHERE crawled_at < threshold)
last_updated_at → Sort by freshness for user-facing queries
(title, url) composite → O(1) duplicate detection

Before indexing: Archive query took 4.2s on 50K records
After indexing: Same query in 120ms

Handling Edge Cases

Case 1: Date Strings vs DateTime Objects

Scraped dates come in wild formats:

"15 May 2024"
"2024-05-15"
"May 15, 2024"

My parser handles them all:

def _parse_date(self, date_str):
    """
    Robust date parsing with multiple format fallbacks
    """
    if not date_str or not isinstance(date_str, str):
        return None

    date_formats = [
        "%d %B %Y",     # "15 May 2024"
        "%Y-%m-%d",     # "2024-05-15"
        "%d-%m-%Y",     # "15-05-2024"
        "%B %d, %Y",    # "May 15, 2024"
    ]

    clean_date = re.sub(r'\s+', ' ', date_str.strip())

    for fmt in date_formats:
        try:
            return datetime.strptime(clean_date, fmt).date()
        except ValueError:
            continue

    return None

Case 2: Records Without Timestamps

If an old record exists without crawled_at, the system handles it gracefully:

original_crawled_at = (
    existing_record.get('crawled_at') or 
    datetime.now()  # Default to now if missing
)

Testing: Verify It Works

I built a self-test function to validate archiving:

def test_archive_functionality(self, collection_name="news"):
    """
    1. Insert a record with crawled_at = 31 days ago
    2. Run archive_old_records()
    3. Verify it moved to archive
    """
    old_date = datetime.now() - timedelta(days=31)
    test_record = {
        "title": f"Test Record {datetime.now().timestamp()}",
        "url": f"https://test.com/{datetime.now().timestamp()}",
        "crawled_at": old_date,
        "last_updated_at": old_date
    }

    # Insert to active collection
    test_collection = self.db[collection_name]
    result = test_collection.insert_one(test_record)

    # Run archiving
    self.archive_old_records()

    # Verify: Should be in archive, not in active
    archived = self.db[f"{collection_name}_archived"].find_one({
        "title": test_record["title"]
    })

    active = test_collection.find_one({"_id": result.inserted_id})

    assert archived is not None, "Record not in archive!"
    assert active is None, "Record still in active!"

    print("✅ Archive test passed!")

Run it:

python mongodb_handler.py  # Runs built-in test

Lessons Learned (The Hard Way)

1. Never Trust Scraper Consistency

Websites change formats overnight. Your archiving logic needs to be defensive:

Always validate dates before parsing
Use try-except blocks around timestamp operations
Log failed parses for manual review

2. Indexes Are Non-Negotiable

I initially skipped indexing ("we'll add it later"). BAD IDEA. Archiving 10K records took 45 minutes. After indexing: 2 minutes.

Rule: Add indexes BEFORE your first large crawl.

3. Separate Active from Archive Early

I tried using a single collection with an is_archived flag. Query complexity exploded:

# Nightmare query
active_records = collection.find({
    "is_archived": False,
    "crawled_at": {"$gt": threshold}
})

With separate collections:

# Clean query
active_records = active_collection.find()

4. Manual Archive Triggers Are Essential

Sometimes you need to force-archive records (e.g., after fixing date parsing bugs):

# Force archive everything older than 60 days
db.manually_archive_old_records(days_threshold=60)

Real-World Impact: By The Numbers

After deploying this system:

Metric	Before	After
Active collection size	450K records	12K records
Average query time (user search)	3.2s	180ms
Duplicate notices	~15% of dataset	0%
Historical data available	❌ Lost on updates	✅ Full audit trail
Disk usage (with compression)	8.4GB	2.1GB (active) + 4.2GB (archive)

The hidden win: Being able to answer questions like:

"How long do admission notices usually stay live?"
"Which colleges update their news most frequently?"
"Can we restore that deleted event from March?"

The Complete Code

Here's the full archiving system (simplified for readability):

from pymongo import MongoClient
from datetime import datetime, timedelta

class MongoDBHandler:
    def __init__(self, db_name="AICTE_Scraper"):
        self.client = MongoClient("localhost", 27017)
        self.db = self.client[db_name]
        self.active_collections = ["news", "events", "admissions"]

        # Create archive collections
        for collection in self.active_collections:
            self.create_collection_if_not_exists(f"{collection}_archived")

    def insert_data(self, collection_name, records):
        """
        Insert with smart deduplication and auto-archiving
        """
        current_time = datetime.now()
        archive_threshold = current_time - timedelta(days=30)

        active = self.db[collection_name]
        archive = self.db[f"{collection_name}_archived"]

        for record in records:
            unique_id = self._create_unique_identifier(record)
            existing = active.find_one(unique_id)

            # Preserve original crawl date
            if existing and 'crawled_at' in existing:
                original_crawled_at = existing['crawled_at']
            else:
                original_crawled_at = current_time

            record['crawled_at'] = original_crawled_at
            record['last_updated_at'] = current_time

            # Check if should archive
            should_archive = original_crawled_at < archive_threshold

            if should_archive:
                if existing:
                    active.delete_one(unique_id)
                archive.update_one(unique_id, {"$set": record}, upsert=True)
            else:
                active.update_one(unique_id, {"$set": record}, upsert=True)

    def archive_old_records(self):
        """
        Move records older than 30 days to archive
        """
        threshold = datetime.now() - timedelta(days=30)

        for collection_name in self.active_collections:
            active = self.db[collection_name]
            archive = self.db[f"{collection_name}_archived"]

            old_records = list(active.find({"crawled_at": {"$lt": threshold}}))

            for record in old_records:
                record_id = record.pop('_id')
                unique_id = self._create_unique_identifier(record)

                archive.update_one(unique_id, {"$set": record}, upsert=True)
                active.delete_one({"_id": record_id})

            print(f"✅ Archived {len(old_records)} from {collection_name}")

Full implementation with tests: GitHub Link

What's Next?

I'm working on:

TTL-based archiving using MongoDB's built-in TTL indexes
Incremental archiving (archive in batches to avoid blocking)
Change detection (highlight what fields changed between versions)
Archive compression (BSON → compressed JSON for long-term storage)

Closing Thoughts

Building this archiving system taught me that data management is harder than data collection.

If you're building any system that involves:

🕷️ Web scraping
📰 News aggregation
📅 Event tracking
📊 Time-series data

Think about archiving on Day 1, not Day 100.

Your future self (and your database) will thank you.

Resources

MongoDB Time-Series Collections
🔧 Full code on GitHub
💬 Questions? Drop a comment or DM me [@pradippanjiyar ]

Found this helpful? Hit that ❤️ and follow for more web scraping deep dives!

This post is part of a series on building CollegeBuzz. Next up: "Handling 100+ Website Scrapers with Python's asyncio"