The Problem
As a machine learning enthusiast, I often find myself browsing Roboflow Universe looking for pre-trained models. But manually searching, clicking through pages, and copying API endpoints is tedious. I wanted a way to:
- Search for models by keywords
- Extract detailed information (metrics, classes, API endpoints)
- Get structured data I could use programmatically
So I built a Python web scraper that does exactly that! π
What It Does
The Roboflow Universe Search Agent is a Python tool that:
β
Searches Roboflow Universe with custom keywords
β
Extracts model details (title, author, metrics, classes)
β
Finds API endpoints using multiple extraction strategies
β
Outputs structured JSON data
β
Handles retries and errors gracefully
The Challenge: Finding API Endpoints
The trickiest part was reliably extracting API endpoints. Roboflow displays them in various places:
- JavaScript code snippets
- Model ID variables
- Input fields
- Page text
- Legacy endpoint formats
I needed a robust solution that wouldn't break if the website structure changed.
The Solution: Multi-Strategy Extraction
Instead of relying on a single method, I implemented 6 different extraction strategies with fallbacks:
Strategy 1: JavaScript Code Blocks
The most reliable source - API endpoints appear in code snippets:
js_patterns = [
r'url:\s*["\']https://serverless\.roboflow\.com/([^"\'?\s]+)["\']',
r'"https://serverless\.roboflow\.com/([^"\'?\s]+)"',
r'https://serverless\.roboflow\.com/([a-z0-9\-_]+/\d+)',
]
Strategy 2: Model ID Patterns
Extract from JavaScript variables:
model_id_patterns = [
r'model_id["\']?\s*[:=]\s*["\']([a-z0-9\-_]+/\d+)["\']',
r'MODEL_ENDPOINT["\']?\s*[:=]\s*["\']([a-z0-9\-_]+/\d+)["\']',
]
Strategy 3: Input Fields & Textareas
Check form elements and code blocks:
input_selectors = [
"input[value*='serverless.roboflow.com']",
"textarea",
"code",
]
Strategy 4: Page Text Search
Fallback to visible text on the page
Strategy 5: Legacy Endpoints
Support older endpoint formats:
detect.roboflow.comclassify.roboflow.comsegment.roboflow.com
Strategy 6: URL Construction
Build endpoint from page URL structure if all else fails
This multi-strategy approach ensures we find the API endpoint even if the page structure changes!
Tech Stack
- Playwright: Browser automation (more reliable than requests for dynamic content)
- Python 3.7+: Core language
- Regex: Pattern matching for extraction
Usage
Basic Example
# Search for basketball detection models
SEARCH_KEYWORDS="basketball model object detection" \
MAX_PROJECTS=5 \
python roboflow_search_agent.py
JSON Output
# Get structured JSON output
SEARCH_KEYWORDS="soccer ball instance segmentation" \
OUTPUT_JSON=true \
python roboflow_search_agent.py
Example Output
[
{
"project_title": "Basketball Detection",
"url": "https://universe.roboflow.com/workspace/basketball-detection",
"author": "John Doe",
"project_type": "Object Detection",
"has_model": true,
"mAP": "85.2%",
"precision": "87.1%",
"recall": "83.5%",
"training_images": "5000",
"classes": ["basketball", "player"],
"api_endpoint": "https://serverless.roboflow.com/basketball-detection/1",
"model_identifier": "workspace/basketball-detection"
}
]
Key Features
1. Intelligent Search
The tool applies the "Has a Model" filter automatically and handles keyword prioritization.
2. Comprehensive Data Extraction
Extracts:
- Performance metrics (mAP@50, Precision, Recall)
- Training data info (image count, classes)
- Project metadata (author, update time, tags)
- API endpoints (the hard part!)
3. Robust Error Handling
- Automatic retries (3 attempts)
- Graceful failure handling
- Timeout management
4. Flexible Output
- Human-readable console output
- JSON format for programmatic use
- Configurable via environment variables
Technical Highlights
Browser Automation with Playwright
def connect_browser(headless=True):
playwright = sync_playwright().start()
browser = playwright.chromium.launch(
headless=headless,
args=["--no-sandbox", "--disable-setuid-sandbox"]
)
context = browser.new_context(viewport={"width": 1440, "height": 900})
return playwright, browser, context, page
Smart Scrolling
Instead of fixed waits, the scraper detects when content stops loading:
def scroll_page(page, max_scrolls=15):
last_height = 0
for i in range(max_scrolls):
page.evaluate("window.scrollBy(0, window.innerHeight)")
page.wait_for_timeout(800)
new_height = page.evaluate("document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
Lessons Learned
- Multiple Strategies > Single Strategy: Having fallbacks makes the scraper much more reliable
- Playwright > Requests: For dynamic sites, browser automation is essential
- Pattern Matching: Regex patterns need careful testing with real data
- Error Handling: Web scraping is fragile - always have retry logic
Use Cases
- Research: Quickly find models for specific tasks
- API Discovery: Extract endpoints for integration
- Model Comparison: Compare metrics across multiple models
- Automation: Integrate into ML pipelines
Installation
# Clone the repository
git clone https:https://github.com/SumitS10/Roboflow-.git
cd roboflow
# Install dependencies
pip install -r requirements.txt
# Install Playwright browsers
playwright install chromium
Future Improvements
- [ ] Add filtering by metrics (e.g., mAP > 80%)
- [ ] Support for batch processing multiple searches
- [ ] Export to CSV/Excel
- [ ] Add model comparison features
- [ ] Cache results to avoid re-scraping
Conclusion
Building this scraper taught me a lot about web scraping, browser automation, and handling edge cases. The multi-strategy approach for API extraction was key to making it reliable.
If you're working with Roboflow models or need to automate model discovery, give it a try! Contributions and feedback are welcome.
Links
π GitHub Repository: https://github.com/SumitS10/Roboflow-.git
π Roboflow Universe: universe.roboflow.com
Tags: #python #webscraping #machinelearning #roboflow #playwright #automation #api #ml
Top comments (0)