The Problem
Ever tried monitoring AI developments across arXiv, GitHub, and news sites simultaneously? Yeah, my laptop's fan wasn't happy about those 40+ browser tabs either.
The Solution: AiLert
I built an open-source content aggregator using Python & AWS. Here's the technical breakdown:
Core Architecture
# Initial naive approach
for source in sources:
content = fetch_content(source) # π
Bad idea!
# Current async implementation
async def fetch_content(session, source):
async with session.get(source.url) as response:
return await response.text()
Key Technical Features
-
Async Content Fetching
- aiohttp for concurrent requests
- Custom rate limiting
- Error handling & retries
Smart Deduplication
def similarity_check(text1, text2):
# Embedding-based similarity
emb1, emb2 = get_embeddings(text1, text2)
score = cosine_similarity(emb1, emb2)
# Fallback to fuzzy matching
return fuzz.ratio(text1, text2) if score < 0.8 else score
- AWS Integration
- DynamoDB for flexible storage
- Auto-scaling capabilities
- Cost-effective data management
Technical Challenges & Solutions
1. Memory Management
Initial SQLite implementation:
data.db: 8.2GB and growing π
Solution: Switched to DynamoDB with selective data retention
2. Content Processing
Challenge: JavaScript-heavy sites and rate limits
Solution: Custom scraping strategies and intelligent retry mechanisms
3. Deduplication
Challenge: Same content, different formats
Solution: Multi-stage matching algorithm
Open for Contributions!
Areas we need help:
- Performance optimization
- Better content categorization
- Template system improvements
- API development
Code: https://github.com/anuj0456/ailert
Docs: https://github.com/anuj0456/ailert/blob/main/README.md
Top comments (0)