Pradip Panjiyar

Posted on Oct 11

Handling 100+ Website Scrapers with Python's asyncio

#webscraping #python #crawl4ai

A Quick Note on Timeline

Although this is Part 2 of the CollegeBuzz series, it was actually the first major component I built. The MongoDB archiving system came later because I needed to solve data consistency issues after scraping was already running in production.

Lesson learned: Start with archiving from Day 1. But hindsight is 20/20, right? 😅

The Problem: Scraping 100+ Colleges Without Losing My Mind

When I started building CollegeBuzz — an AICTE academic news aggregator — I needed to scrape notices, events, and announcements from 100+ Indian engineering colleges daily.

My first naive attempt:

import requests
from bs4 import BeautifulSoup

def scrape_college(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    # Extract data...
    return data

# The slow way
college_urls = ["https://iitb.ac.in", "https://nitt.edu", ...]
all_data = []

for url in college_urls:
    data = scrape_college(url)  
    all_data.append(data)

# 🐌 Total time: 4+ hours

Why so slow?

Each request waits for the previous one to complete. If one site takes 8 seconds to respond, my scraper just sits there... waiting. Multiply that across 100+ sites and you get an eternity.

I needed something better.

Discovering Crawl4AI: The Game Changer

After wrestling with BeautifulSoup and Selenium, I found Crawl4AI by @unclecode.

Why Crawl4AI for async scraping?

⚡ Built for asyncio from the ground up — native async/await support
🎯 CSS-based extraction strategies — no more manual BeautifulSoup parsing
📦 Works out of the box — handles browser automation, retries, error handling
🚀 Battle-tested — 50k+ GitHub stars

Resources:

📺 YouTube Channel — Excellent tutorials by the creator
🐙 GitHub Repository
📚 Official Documentation

My Async Scraping Architecture

Instead of trying to scrape everything at once, I built a controlled async pipeline:

async def extract_notices_and_events():
    """Main async scraping orchestrator"""

    # Initialize MongoDB handler
    mongo_handler = MongoDBHandler(uri=os.environ.get("MONGO_URI"))

    # Single crawler instance handles all sites
    async with AsyncWebCrawler(verbose=True) as crawler:
        for site in urls:  # Sequential at site level
            # Configure extraction strategy
            extraction_strategy = JsonCssExtractionStrategy(
                site["schema"], 
                verbose=True
            )

            config = CrawlerRunConfig(
                cache_mode=CacheMode.BYPASS,  # Always fresh data
                extraction_strategy=extraction_strategy
            )

            # Async scrape (non-blocking!)
            result = await crawler.arun(url=site["url"], config=config)

            if result.success:
                data = json.loads(result.extracted_content)
                # Process and store data
                process_and_store(data, mongo_handler)

Key Design Decision: Sequential Sites, Async Pages

I intentionally scrape sites one-by-one but use async for the actual HTTP requests. Why?

Avoid IP bans — 100 concurrent requests to different domains = red flags
Resource management — One browser at a time keeps memory under control
Error isolation — If one site fails, others continue

The Magic: CSS-Based Extraction

Instead of writing BeautifulSoup parsing for each site, I use declarative schemas:

# In crawler_config.py
urls = [
    {
        "url": "https://www.iitb.ac.in/",
        "schema": {
            "name": "IIT Bombay Notices",
            "baseSelector": ".notice-item",
            "fields": [
                {
                    "name": "title",
                    "selector": "h3.title",
                    "type": "text"
                },
                {
                    "name": "notice_url",
                    "selector": "a",
                    "type": "attribute",
                    "attribute": "href"
                }
            ]
        }
    },
    # ... 100+ more colleges
]

Then in my scraper:

for site in urls:
    extraction_strategy = JsonCssExtractionStrategy(
        site["schema"], 
        verbose=True
    )

    result = await crawler.arun(
        url=site["url"], 
        config=CrawlerRunConfig(extraction_strategy=extraction_strategy)
    )

    # Get clean JSON data, no BeautifulSoup needed!
    data = json.loads(result.extracted_content)

Why this is powerful for async scraping:

✅ No manual parsing — Crawl4AI handles HTML extraction
✅ Maintainable — Update schemas without touching scraper logic
✅ Scalable — Add new colleges by adding new schema objects

Real-World Async Patterns I Used

Pattern 1: Context Manager for Resource Cleanup

async with AsyncWebCrawler(verbose=True) as crawler:
    # Use crawler
    result = await crawler.arun(url=url, config=config)
# Automatically closes browser and releases resources

Even if errors occur, the browser gets cleaned up. Critical for long-running scrapers.

Pattern 2: Handling Failures Gracefully

result = await crawler.arun(url=site["url"], config=config)

if not result.success:
    print(f"❌ Crawl failed for {site['url']}: {result.error_message}")
    continue  # Skip this site, move to next

# Process successful result
data = json.loads(result.extracted_content)

One failed site doesn't crash the entire pipeline.

Pattern 3: Async Configuration

config = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS,  # Fresh data every time
    extraction_strategy=extraction_strategy
)

result = await crawler.arun(url=site["url"], config=config)

Crawl4AI's CrawlerRunConfig lets you customize behavior per site without creating new crawler instances.

Handling Real-World Edge Cases

Edge Case 1: Data Volume Control

Some colleges list 1000+ notices on their homepage. I don't need all of them:

# Site-specific limits
if site["url"] in ["https://www.nitt.edu/", "https://www.iitkgp.ac.in/"]:
    data = data[:10]  # Only recent 10
elif site["url"] in ["https://www.iitk.ac.in/", "https://www.iiti.ac.in/"]:
    data = data[:4]   # These update slowly

Edge Case 2: URL Normalization

College websites have inconsistent URL formats:

def process_url(base_url, extracted_url):
    """Convert relative URLs to absolute"""
    if not extracted_url:
        return extracted_url

    extracted_url = extracted_url.strip()

    # Handle relative URLs
    if not extracted_url.startswith(("http://", "https://")):
        return urljoin(base_url, extracted_url)

    return extracted_url

Edge Case 3: JavaScript URL Madness (IIT Roorkee)

IIT Roorkee embeds JavaScript in href attributes:

<a href="window.open('/events/workshop.pdf')">View Event</a>

Solution:

if site["url"] == "https://www.iitr.ac.in/":
    for entry in data:
        if "upcoming_Event_url" in entry and "window.open(" in entry["upcoming_Event_url"]:
            # Extract actual URL from JavaScript
            match = re.search(r"window\.open\('([^']+)'\)", entry["upcoming_Event_url"])
            if match:
                entry["upcoming_Event_url"] = match.group(1)

Performance: The Numbers

Before (Sequential with Requests + BeautifulSoup):

⏱️ Time: 4 hours, 23 minutes
🐌 Average: ~150 seconds per site
💾 Memory: ~200MB stable

After (Async with Crawl4AI):

⏱️ Time: 12 minutes, 30 seconds
⚡ Average: ~7.5 seconds per site
💾 Memory: ~600MB peak (browser overhead)

20x faster with better reliability!

Why Not Fully Concurrent?

You might ask: "Why not scrape all 100+ sites simultaneously with asyncio.gather()?"

# Why I DON'T do this:
tasks = [scrape_college(crawler, site) for site in urls]
results = await asyncio.gather(*tasks)

I tried this. Results:

❌ IP bans from 12 colleges
❌ Memory explosion (100 browsers = 8GB+ RAM)
❌ Browser crashes

Lesson: For daily scraping of 100+ different domains, sequential with async is the sweet spot:

Fast enough (12 minutes vs 4 hours)
Respectful to websites (no hammering)
Stable and maintainable

Integration with the Full Pipeline

Here's how async scraping fits into CollegeBuzz:

# In aictcscraper.py
async def extract_notices_and_events():
    mongo_handler = MongoDBHandler(uri=os.environ.get("MONGO_URI"))

    try:
        async with AsyncWebCrawler(verbose=True) as crawler:
            for site in urls:
                # Async scraping
                result = await crawler.arun(url=site["url"], config=config)
                data = json.loads(result.extracted_content)

                # Insert into MongoDB (with deduplication from Part 1)
                mongo_handler.insert_data(collection_name, records)

    except Exception as e:
        print(f"Error: {e}")
        raise
    finally:
        mongo_handler.close_connection()

The scraper feeds data into the MongoDB handler, which automatically:

Deduplicates records (from Part 1)
Updates timestamps
Archives old data

Running the Scraper

Manual Trigger

python aictcscraper.py

Via Flask API

# In app.py
@app.route('/api/scrape', methods=['POST'])
def run_scraper():
    try:
        result = asyncio.run(extract_notices_and_events())
        return jsonify({"status": "success"}), 200
    except Exception as e:
        return jsonify({"status": "error", "message": str(e)}), 500

Trigger via HTTP:

curl -X POST http://localhost:8080/api/scrape

Scheduled with Cron

# crontab -e
0 2 * * * cd /path/to/collegebuzz && python aictcscraper.py >> scraper.log 2>&1

Key Takeaways for Async Scraping

1. Context Managers Are Essential

async with AsyncWebCrawler() as crawler:
    # Always use context managers
    result = await crawler.arun(url)
# Automatic cleanup, even on errors

2. Don't Over-Optimize

Sequential scraping of 100+ sites in 12 minutes is good enough for daily jobs. Don't chase 100% concurrency at the cost of stability.

3. Schema-Based Extraction > Manual Parsing

Declarative CSS schemas are:

Easier to maintain
Easier to debug
Easier to scale

4. Handle Failures Gracefully

if not result.success:
    print(f"Failed: {result.error_message}")
    continue  # Don't crash entire pipeline

Resources & Credits

Huge thanks to @unclecode for creating Crawl4AI! This library made async scraping approachable.

Learn More:

CollegeBuzz Series:

📝 Part 1: MongoDB Archiving System

Closing Thoughts

Async scraping with Crawl4AI transformed CollegeBuzz from a 4-hour batch job to a 12-minute operation. But the real win was simplicity.

No threading nightmares. No multiprocessing complexity. Just clean async/await code with schema-based extraction.

If you're scraping multiple websites in 2025, start with Crawl4AI and asyncio. Your future self will thank you.

Found this helpful? Hit that ❤️ and follow for Part 3!

Questions? Drop a comment or reach out @pradippanjiyar

This is Part 2 of the CollegeBuzz engineering series. All code examples are from my production system.

Top comments (1)

OnlineProxy • Oct 12

When scraping 100+ sites concurrently, stability is achieved by enforcing per-domain concurrency limits using semaphores and working with bounded asyncio queues to enforce backpressure between download, parse, and storage phases. On resource-intensive activities like JS rendering, reuse browser contexts and tidy up with async context managers to prevent leaks. On data storage, combine async crawlers with MongoDB via Motor, writing in batches and adjusting connection pool tuning for throughput. Best-of-all balanced outcomes result from hybrids of architecture-a fast async request cycle for static pages and selective browser rendering for dynamic ones with concurrent adaptation based on system metrics.