A Quick Note on Timeline
Although this is Part 2 of the CollegeBuzz series, it was actually the first major component I built. The MongoDB archiving system came later because I needed to solve data consistency issues after scraping was already running in production.
Lesson learned: Start with archiving from Day 1. But hindsight is 20/20, right? π
The Problem: Scraping 100+ Colleges Without Losing My Mind
When I started building CollegeBuzz β an AICTE academic news aggregator β I needed to scrape notices, events, and announcements from 100+ Indian engineering colleges daily.
My first naive attempt:
import requests
from bs4 import BeautifulSoup
def scrape_college(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Extract data...
return data
# The slow way
college_urls = ["https://iitb.ac.in", "https://nitt.edu", ...]
all_data = []
for url in college_urls:
data = scrape_college(url)
all_data.append(data)
# π Total time: 4+ hours
Why so slow?
Each request waits for the previous one to complete. If one site takes 8 seconds to respond, my scraper just sits there... waiting. Multiply that across 100+ sites and you get an eternity.
I needed something better.
Discovering Crawl4AI: The Game Changer
After wrestling with BeautifulSoup and Selenium, I found Crawl4AI by @unclecode.
Why Crawl4AI for async scraping?
- β‘ Built for asyncio from the ground up β native async/await support
- π― CSS-based extraction strategies β no more manual BeautifulSoup parsing
- π¦ Works out of the box β handles browser automation, retries, error handling
- π Battle-tested β 50k+ GitHub stars
Resources:
- πΊ YouTube Channel β Excellent tutorials by the creator
- π GitHub Repository
- π Official Documentation
My Async Scraping Architecture
Instead of trying to scrape everything at once, I built a controlled async pipeline:
async def extract_notices_and_events():
"""Main async scraping orchestrator"""
# Initialize MongoDB handler
mongo_handler = MongoDBHandler(uri=os.environ.get("MONGO_URI"))
# Single crawler instance handles all sites
async with AsyncWebCrawler(verbose=True) as crawler:
for site in urls: # Sequential at site level
# Configure extraction strategy
extraction_strategy = JsonCssExtractionStrategy(
site["schema"],
verbose=True
)
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS, # Always fresh data
extraction_strategy=extraction_strategy
)
# Async scrape (non-blocking!)
result = await crawler.arun(url=site["url"], config=config)
if result.success:
data = json.loads(result.extracted_content)
# Process and store data
process_and_store(data, mongo_handler)
Key Design Decision: Sequential Sites, Async Pages
I intentionally scrape sites one-by-one but use async for the actual HTTP requests. Why?
- Avoid IP bans β 100 concurrent requests to different domains = red flags
- Resource management β One browser at a time keeps memory under control
- Error isolation β If one site fails, others continue
The Magic: CSS-Based Extraction
Instead of writing BeautifulSoup parsing for each site, I use declarative schemas:
# In crawler_config.py
urls = [
{
"url": "https://www.iitb.ac.in/",
"schema": {
"name": "IIT Bombay Notices",
"baseSelector": ".notice-item",
"fields": [
{
"name": "title",
"selector": "h3.title",
"type": "text"
},
{
"name": "notice_url",
"selector": "a",
"type": "attribute",
"attribute": "href"
}
]
}
},
# ... 100+ more colleges
]
Then in my scraper:
for site in urls:
extraction_strategy = JsonCssExtractionStrategy(
site["schema"],
verbose=True
)
result = await crawler.arun(
url=site["url"],
config=CrawlerRunConfig(extraction_strategy=extraction_strategy)
)
# Get clean JSON data, no BeautifulSoup needed!
data = json.loads(result.extracted_content)
Why this is powerful for async scraping:
- β No manual parsing β Crawl4AI handles HTML extraction
- β Maintainable β Update schemas without touching scraper logic
- β Scalable β Add new colleges by adding new schema objects
Real-World Async Patterns I Used
Pattern 1: Context Manager for Resource Cleanup
async with AsyncWebCrawler(verbose=True) as crawler:
# Use crawler
result = await crawler.arun(url=url, config=config)
# Automatically closes browser and releases resources
Even if errors occur, the browser gets cleaned up. Critical for long-running scrapers.
Pattern 2: Handling Failures Gracefully
result = await crawler.arun(url=site["url"], config=config)
if not result.success:
print(f"β Crawl failed for {site['url']}: {result.error_message}")
continue # Skip this site, move to next
# Process successful result
data = json.loads(result.extracted_content)
One failed site doesn't crash the entire pipeline.
Pattern 3: Async Configuration
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS, # Fresh data every time
extraction_strategy=extraction_strategy
)
result = await crawler.arun(url=site["url"], config=config)
Crawl4AI's CrawlerRunConfig lets you customize behavior per site without creating new crawler instances.
Handling Real-World Edge Cases
Edge Case 1: Data Volume Control
Some colleges list 1000+ notices on their homepage. I don't need all of them:
# Site-specific limits
if site["url"] in ["https://www.nitt.edu/", "https://www.iitkgp.ac.in/"]:
data = data[:10] # Only recent 10
elif site["url"] in ["https://www.iitk.ac.in/", "https://www.iiti.ac.in/"]:
data = data[:4] # These update slowly
Edge Case 2: URL Normalization
College websites have inconsistent URL formats:
def process_url(base_url, extracted_url):
"""Convert relative URLs to absolute"""
if not extracted_url:
return extracted_url
extracted_url = extracted_url.strip()
# Handle relative URLs
if not extracted_url.startswith(("http://", "https://")):
return urljoin(base_url, extracted_url)
return extracted_url
Edge Case 3: JavaScript URL Madness (IIT Roorkee)
IIT Roorkee embeds JavaScript in href attributes:
<a href="window.open('/events/workshop.pdf')">View Event</a>
Solution:
if site["url"] == "https://www.iitr.ac.in/":
for entry in data:
if "upcoming_Event_url" in entry and "window.open(" in entry["upcoming_Event_url"]:
# Extract actual URL from JavaScript
match = re.search(r"window\.open\('([^']+)'\)", entry["upcoming_Event_url"])
if match:
entry["upcoming_Event_url"] = match.group(1)
Performance: The Numbers
Before (Sequential with Requests + BeautifulSoup):
β±οΈ Time: 4 hours, 23 minutes
π Average: ~150 seconds per site
πΎ Memory: ~200MB stable
After (Async with Crawl4AI):
β±οΈ Time: 12 minutes, 30 seconds
β‘ Average: ~7.5 seconds per site
πΎ Memory: ~600MB peak (browser overhead)
20x faster with better reliability!
Why Not Fully Concurrent?
You might ask: "Why not scrape all 100+ sites simultaneously with asyncio.gather()?"
# Why I DON'T do this:
tasks = [scrape_college(crawler, site) for site in urls]
results = await asyncio.gather(*tasks)
I tried this. Results:
- β IP bans from 12 colleges
- β Memory explosion (100 browsers = 8GB+ RAM)
- β Browser crashes
Lesson: For daily scraping of 100+ different domains, sequential with async is the sweet spot:
- Fast enough (12 minutes vs 4 hours)
- Respectful to websites (no hammering)
- Stable and maintainable
Integration with the Full Pipeline
Here's how async scraping fits into CollegeBuzz:
# In aictcscraper.py
async def extract_notices_and_events():
mongo_handler = MongoDBHandler(uri=os.environ.get("MONGO_URI"))
try:
async with AsyncWebCrawler(verbose=True) as crawler:
for site in urls:
# Async scraping
result = await crawler.arun(url=site["url"], config=config)
data = json.loads(result.extracted_content)
# Insert into MongoDB (with deduplication from Part 1)
mongo_handler.insert_data(collection_name, records)
except Exception as e:
print(f"Error: {e}")
raise
finally:
mongo_handler.close_connection()
The scraper feeds data into the MongoDB handler, which automatically:
- Deduplicates records (from Part 1)
- Updates timestamps
- Archives old data
Running the Scraper
Manual Trigger
python aictcscraper.py
Via Flask API
# In app.py
@app.route('/api/scrape', methods=['POST'])
def run_scraper():
try:
result = asyncio.run(extract_notices_and_events())
return jsonify({"status": "success"}), 200
except Exception as e:
return jsonify({"status": "error", "message": str(e)}), 500
Trigger via HTTP:
curl -X POST http://localhost:8080/api/scrape
Scheduled with Cron
# crontab -e
0 2 * * * cd /path/to/collegebuzz && python aictcscraper.py >> scraper.log 2>&1
Key Takeaways for Async Scraping
1. Context Managers Are Essential
async with AsyncWebCrawler() as crawler:
# Always use context managers
result = await crawler.arun(url)
# Automatic cleanup, even on errors
2. Don't Over-Optimize
Sequential scraping of 100+ sites in 12 minutes is good enough for daily jobs. Don't chase 100% concurrency at the cost of stability.
3. Schema-Based Extraction > Manual Parsing
Declarative CSS schemas are:
- Easier to maintain
- Easier to debug
- Easier to scale
4. Handle Failures Gracefully
if not result.success:
print(f"Failed: {result.error_message}")
continue # Don't crash entire pipeline
Resources & Credits
Huge thanks to @unclecode for creating Crawl4AI! This library made async scraping approachable.
Learn More:
- πΊ Crawl4AI YouTube Tutorials
- π GitHub: unclecode/crawl4ai
- π Official Documentation
- π Python asyncio Docs
CollegeBuzz Series:
Closing Thoughts
Async scraping with Crawl4AI transformed CollegeBuzz from a 4-hour batch job to a 12-minute operation. But the real win was simplicity.
No threading nightmares. No multiprocessing complexity. Just clean async/await code with schema-based extraction.
If you're scraping multiple websites in 2025, start with Crawl4AI and asyncio. Your future self will thank you.
Found this helpful? Hit that β€οΈ and follow for Part 3!
Questions? Drop a comment or reach out @pradippanjiyar
This is Part 2 of the CollegeBuzz engineering series. All code examples are from my production system.
Top comments (1)
When scraping 100+ sites concurrently, stability is achieved by enforcing per-domain concurrency limits using semaphores and working with bounded asyncio queues to enforce backpressure between download, parse, and storage phases. On resource-intensive activities like JS rendering, reuse browser contexts and tidy up with async context managers to prevent leaks. On data storage, combine async crawlers with MongoDB via Motor, writing in batches and adjusting connection pool tuning for throughput. Best-of-all balanced outcomes result from hybrids of architecture-a fast async request cycle for static pages and selective browser rendering for dynamic ones with concurrent adaptation based on system metrics.