DEV Community

Pradip Panjiyar
Pradip Panjiyar

Posted on

Handling 100+ Website Scrapers with Python's asyncio

A Quick Note on Timeline

Although this is Part 2 of the CollegeBuzz series, it was actually the first major component I built. The MongoDB archiving system came later because I needed to solve data consistency issues after scraping was already running in production.

Lesson learned: Start with archiving from Day 1. But hindsight is 20/20, right? πŸ˜…

The Problem: Scraping 100+ Colleges Without Losing My Mind

When I started building CollegeBuzz β€” an AICTE academic news aggregator β€” I needed to scrape notices, events, and announcements from 100+ Indian engineering colleges daily.

My first naive attempt:

import requests
from bs4 import BeautifulSoup

def scrape_college(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    # Extract data...
    return data

# The slow way
college_urls = ["https://iitb.ac.in", "https://nitt.edu", ...]
all_data = []

for url in college_urls:
    data = scrape_college(url)  
    all_data.append(data)

# 🐌 Total time: 4+ hours
Enter fullscreen mode Exit fullscreen mode

Why so slow?

Each request waits for the previous one to complete. If one site takes 8 seconds to respond, my scraper just sits there... waiting. Multiply that across 100+ sites and you get an eternity.

I needed something better.


Discovering Crawl4AI: The Game Changer

After wrestling with BeautifulSoup and Selenium, I found Crawl4AI by @unclecode.

Why Crawl4AI for async scraping?

  • ⚑ Built for asyncio from the ground up β€” native async/await support
  • 🎯 CSS-based extraction strategies β€” no more manual BeautifulSoup parsing
  • πŸ“¦ Works out of the box β€” handles browser automation, retries, error handling
  • πŸš€ Battle-tested β€” 50k+ GitHub stars

Resources:


My Async Scraping Architecture

Instead of trying to scrape everything at once, I built a controlled async pipeline:

async def extract_notices_and_events():
    """Main async scraping orchestrator"""

    # Initialize MongoDB handler
    mongo_handler = MongoDBHandler(uri=os.environ.get("MONGO_URI"))

    # Single crawler instance handles all sites
    async with AsyncWebCrawler(verbose=True) as crawler:
        for site in urls:  # Sequential at site level
            # Configure extraction strategy
            extraction_strategy = JsonCssExtractionStrategy(
                site["schema"], 
                verbose=True
            )

            config = CrawlerRunConfig(
                cache_mode=CacheMode.BYPASS,  # Always fresh data
                extraction_strategy=extraction_strategy
            )

            # Async scrape (non-blocking!)
            result = await crawler.arun(url=site["url"], config=config)

            if result.success:
                data = json.loads(result.extracted_content)
                # Process and store data
                process_and_store(data, mongo_handler)
Enter fullscreen mode Exit fullscreen mode

Key Design Decision: Sequential Sites, Async Pages

I intentionally scrape sites one-by-one but use async for the actual HTTP requests. Why?

  1. Avoid IP bans β€” 100 concurrent requests to different domains = red flags
  2. Resource management β€” One browser at a time keeps memory under control
  3. Error isolation β€” If one site fails, others continue

The Magic: CSS-Based Extraction

Instead of writing BeautifulSoup parsing for each site, I use declarative schemas:

# In crawler_config.py
urls = [
    {
        "url": "https://www.iitb.ac.in/",
        "schema": {
            "name": "IIT Bombay Notices",
            "baseSelector": ".notice-item",
            "fields": [
                {
                    "name": "title",
                    "selector": "h3.title",
                    "type": "text"
                },
                {
                    "name": "notice_url",
                    "selector": "a",
                    "type": "attribute",
                    "attribute": "href"
                }
            ]
        }
    },
    # ... 100+ more colleges
]
Enter fullscreen mode Exit fullscreen mode

Then in my scraper:

for site in urls:
    extraction_strategy = JsonCssExtractionStrategy(
        site["schema"], 
        verbose=True
    )

    result = await crawler.arun(
        url=site["url"], 
        config=CrawlerRunConfig(extraction_strategy=extraction_strategy)
    )

    # Get clean JSON data, no BeautifulSoup needed!
    data = json.loads(result.extracted_content)
Enter fullscreen mode Exit fullscreen mode

Why this is powerful for async scraping:

  • βœ… No manual parsing β€” Crawl4AI handles HTML extraction
  • βœ… Maintainable β€” Update schemas without touching scraper logic
  • βœ… Scalable β€” Add new colleges by adding new schema objects

Real-World Async Patterns I Used

Pattern 1: Context Manager for Resource Cleanup

async with AsyncWebCrawler(verbose=True) as crawler:
    # Use crawler
    result = await crawler.arun(url=url, config=config)
# Automatically closes browser and releases resources
Enter fullscreen mode Exit fullscreen mode

Even if errors occur, the browser gets cleaned up. Critical for long-running scrapers.

Pattern 2: Handling Failures Gracefully

result = await crawler.arun(url=site["url"], config=config)

if not result.success:
    print(f"❌ Crawl failed for {site['url']}: {result.error_message}")
    continue  # Skip this site, move to next

# Process successful result
data = json.loads(result.extracted_content)
Enter fullscreen mode Exit fullscreen mode

One failed site doesn't crash the entire pipeline.

Pattern 3: Async Configuration

config = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS,  # Fresh data every time
    extraction_strategy=extraction_strategy
)

result = await crawler.arun(url=site["url"], config=config)
Enter fullscreen mode Exit fullscreen mode

Crawl4AI's CrawlerRunConfig lets you customize behavior per site without creating new crawler instances.


Handling Real-World Edge Cases

Edge Case 1: Data Volume Control

Some colleges list 1000+ notices on their homepage. I don't need all of them:

# Site-specific limits
if site["url"] in ["https://www.nitt.edu/", "https://www.iitkgp.ac.in/"]:
    data = data[:10]  # Only recent 10
elif site["url"] in ["https://www.iitk.ac.in/", "https://www.iiti.ac.in/"]:
    data = data[:4]   # These update slowly
Enter fullscreen mode Exit fullscreen mode

Edge Case 2: URL Normalization

College websites have inconsistent URL formats:

def process_url(base_url, extracted_url):
    """Convert relative URLs to absolute"""
    if not extracted_url:
        return extracted_url

    extracted_url = extracted_url.strip()

    # Handle relative URLs
    if not extracted_url.startswith(("http://", "https://")):
        return urljoin(base_url, extracted_url)

    return extracted_url
Enter fullscreen mode Exit fullscreen mode

Edge Case 3: JavaScript URL Madness (IIT Roorkee)

IIT Roorkee embeds JavaScript in href attributes:

<a href="window.open('/events/workshop.pdf')">View Event</a>
Enter fullscreen mode Exit fullscreen mode

Solution:

if site["url"] == "https://www.iitr.ac.in/":
    for entry in data:
        if "upcoming_Event_url" in entry and "window.open(" in entry["upcoming_Event_url"]:
            # Extract actual URL from JavaScript
            match = re.search(r"window\.open\('([^']+)'\)", entry["upcoming_Event_url"])
            if match:
                entry["upcoming_Event_url"] = match.group(1)
Enter fullscreen mode Exit fullscreen mode

Performance: The Numbers

Before (Sequential with Requests + BeautifulSoup):

⏱️ Time: 4 hours, 23 minutes
🐌 Average: ~150 seconds per site
πŸ’Ύ Memory: ~200MB stable
Enter fullscreen mode Exit fullscreen mode

After (Async with Crawl4AI):

⏱️ Time: 12 minutes, 30 seconds
⚑ Average: ~7.5 seconds per site
πŸ’Ύ Memory: ~600MB peak (browser overhead)
Enter fullscreen mode Exit fullscreen mode

20x faster with better reliability!


Why Not Fully Concurrent?

You might ask: "Why not scrape all 100+ sites simultaneously with asyncio.gather()?"

# Why I DON'T do this:
tasks = [scrape_college(crawler, site) for site in urls]
results = await asyncio.gather(*tasks)
Enter fullscreen mode Exit fullscreen mode

I tried this. Results:

  • ❌ IP bans from 12 colleges
  • ❌ Memory explosion (100 browsers = 8GB+ RAM)
  • ❌ Browser crashes

Lesson: For daily scraping of 100+ different domains, sequential with async is the sweet spot:

  • Fast enough (12 minutes vs 4 hours)
  • Respectful to websites (no hammering)
  • Stable and maintainable

Integration with the Full Pipeline

Here's how async scraping fits into CollegeBuzz:

# In aictcscraper.py
async def extract_notices_and_events():
    mongo_handler = MongoDBHandler(uri=os.environ.get("MONGO_URI"))

    try:
        async with AsyncWebCrawler(verbose=True) as crawler:
            for site in urls:
                # Async scraping
                result = await crawler.arun(url=site["url"], config=config)
                data = json.loads(result.extracted_content)

                # Insert into MongoDB (with deduplication from Part 1)
                mongo_handler.insert_data(collection_name, records)

    except Exception as e:
        print(f"Error: {e}")
        raise
    finally:
        mongo_handler.close_connection()
Enter fullscreen mode Exit fullscreen mode

The scraper feeds data into the MongoDB handler, which automatically:

  • Deduplicates records (from Part 1)
  • Updates timestamps
  • Archives old data

Running the Scraper

Manual Trigger

python aictcscraper.py
Enter fullscreen mode Exit fullscreen mode

Via Flask API

# In app.py
@app.route('/api/scrape', methods=['POST'])
def run_scraper():
    try:
        result = asyncio.run(extract_notices_and_events())
        return jsonify({"status": "success"}), 200
    except Exception as e:
        return jsonify({"status": "error", "message": str(e)}), 500
Enter fullscreen mode Exit fullscreen mode

Trigger via HTTP:

curl -X POST http://localhost:8080/api/scrape
Enter fullscreen mode Exit fullscreen mode

Scheduled with Cron

# crontab -e
0 2 * * * cd /path/to/collegebuzz && python aictcscraper.py >> scraper.log 2>&1
Enter fullscreen mode Exit fullscreen mode

Key Takeaways for Async Scraping

1. Context Managers Are Essential

async with AsyncWebCrawler() as crawler:
    # Always use context managers
    result = await crawler.arun(url)
# Automatic cleanup, even on errors
Enter fullscreen mode Exit fullscreen mode

2. Don't Over-Optimize

Sequential scraping of 100+ sites in 12 minutes is good enough for daily jobs. Don't chase 100% concurrency at the cost of stability.

3. Schema-Based Extraction > Manual Parsing

Declarative CSS schemas are:

  • Easier to maintain
  • Easier to debug
  • Easier to scale

4. Handle Failures Gracefully

if not result.success:
    print(f"Failed: {result.error_message}")
    continue  # Don't crash entire pipeline
Enter fullscreen mode Exit fullscreen mode

Resources & Credits

Huge thanks to @unclecode for creating Crawl4AI! This library made async scraping approachable.

Learn More:

CollegeBuzz Series:


Closing Thoughts

Async scraping with Crawl4AI transformed CollegeBuzz from a 4-hour batch job to a 12-minute operation. But the real win was simplicity.

No threading nightmares. No multiprocessing complexity. Just clean async/await code with schema-based extraction.

If you're scraping multiple websites in 2025, start with Crawl4AI and asyncio. Your future self will thank you.


Found this helpful? Hit that ❀️ and follow for Part 3!

Questions? Drop a comment or reach out @pradippanjiyar


This is Part 2 of the CollegeBuzz engineering series. All code examples are from my production system.

Top comments (1)

Collapse
 
onlineproxy profile image
OnlineProxy

When scraping 100+ sites concurrently, stability is achieved by enforcing per-domain concurrency limits using semaphores and working with bounded asyncio queues to enforce backpressure between download, parse, and storage phases. On resource-intensive activities like JS rendering, reuse browser contexts and tidy up with async context managers to prevent leaks. On data storage, combine async crawlers with MongoDB via Motor, writing in batches and adjusting connection pool tuning for throughput. Best-of-all balanced outcomes result from hybrids of architecture-a fast async request cycle for static pages and selective browser rendering for dynamic ones with concurrent adaptation based on system metrics.