Sumit Sapkota

Posted on Jan 5

Building a Roboflow Universe Search Agent: Automating ML Model Discovery

#automation #machinelearning #python #showdev

The Problem

As a machine learning enthusiast, I often find myself browsing Roboflow Universe looking for pre-trained models. But manually searching, clicking through pages, and copying API endpoints is tedious. I wanted a way to:

Search for models by keywords
Extract detailed information (metrics, classes, API endpoints)
Get structured data I could use programmatically

So I built a Python web scraper that does exactly that! 🚀

What It Does

The Roboflow Universe Search Agent is a Python tool that:

✅ Searches Roboflow Universe with custom keywords

✅ Extracts model details (title, author, metrics, classes)

✅ Finds API endpoints using multiple extraction strategies

✅ Outputs structured JSON data

✅ Handles retries and errors gracefully

The Challenge: Finding API Endpoints

The trickiest part was reliably extracting API endpoints. Roboflow displays them in various places:

JavaScript code snippets
Model ID variables
Input fields
Page text
Legacy endpoint formats

I needed a robust solution that wouldn't break if the website structure changed.

The Solution: Multi-Strategy Extraction

Instead of relying on a single method, I implemented 6 different extraction strategies with fallbacks:

Strategy 1: JavaScript Code Blocks

The most reliable source - API endpoints appear in code snippets:

js_patterns = [
    r'url:\s*["\']https://serverless\.roboflow\.com/([^"\'?\s]+)["\']',
    r'"https://serverless\.roboflow\.com/([^"\'?\s]+)"',
    r'https://serverless\.roboflow\.com/([a-z0-9\-_]+/\d+)',
]

Strategy 2: Model ID Patterns

Extract from JavaScript variables:

model_id_patterns = [
    r'model_id["\']?\s*[:=]\s*["\']([a-z0-9\-_]+/\d+)["\']',
    r'MODEL_ENDPOINT["\']?\s*[:=]\s*["\']([a-z0-9\-_]+/\d+)["\']',
]

Strategy 3: Input Fields & Textareas

Check form elements and code blocks:

input_selectors = [
    "input[value*='serverless.roboflow.com']",
    "textarea",
    "code",
]

Strategy 4: Page Text Search

Fallback to visible text on the page

Strategy 5: Legacy Endpoints

Support older endpoint formats:

detect.roboflow.com
classify.roboflow.com
segment.roboflow.com

Strategy 6: URL Construction

Build endpoint from page URL structure if all else fails

This multi-strategy approach ensures we find the API endpoint even if the page structure changes!

Tech Stack

Playwright: Browser automation (more reliable than requests for dynamic content)
Python 3.7+: Core language
Regex: Pattern matching for extraction

Usage

Basic Example

# Search for basketball detection models
SEARCH_KEYWORDS="basketball model object detection" \
MAX_PROJECTS=5 \
python roboflow_search_agent.py

JSON Output

# Get structured JSON output
SEARCH_KEYWORDS="soccer ball instance segmentation" \
OUTPUT_JSON=true \
python roboflow_search_agent.py

Example Output

[
  {
    "project_title": "Basketball Detection",
    "url": "https://universe.roboflow.com/workspace/basketball-detection",
    "author": "John Doe",
    "project_type": "Object Detection",
    "has_model": true,
    "mAP": "85.2%",
    "precision": "87.1%",
    "recall": "83.5%",
    "training_images": "5000",
    "classes": ["basketball", "player"],
    "api_endpoint": "https://serverless.roboflow.com/basketball-detection/1",
    "model_identifier": "workspace/basketball-detection"
  }
]

Key Features

1. Intelligent Search

The tool applies the "Has a Model" filter automatically and handles keyword prioritization.

2. Comprehensive Data Extraction

Extracts:

Performance metrics (mAP@50, Precision, Recall)
Training data info (image count, classes)
Project metadata (author, update time, tags)
API endpoints (the hard part!)

3. Robust Error Handling

Automatic retries (3 attempts)
Graceful failure handling
Timeout management

4. Flexible Output

Human-readable console output
JSON format for programmatic use
Configurable via environment variables

Technical Highlights

Browser Automation with Playwright

def connect_browser(headless=True):
    playwright = sync_playwright().start()
    browser = playwright.chromium.launch(
        headless=headless,
        args=["--no-sandbox", "--disable-setuid-sandbox"]
    )
    context = browser.new_context(viewport={"width": 1440, "height": 900})
    return playwright, browser, context, page

Smart Scrolling

Instead of fixed waits, the scraper detects when content stops loading:

def scroll_page(page, max_scrolls=15):
    last_height = 0
    for i in range(max_scrolls):
        page.evaluate("window.scrollBy(0, window.innerHeight)")
        page.wait_for_timeout(800)
        new_height = page.evaluate("document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height

Lessons Learned

Multiple Strategies > Single Strategy: Having fallbacks makes the scraper much more reliable
Playwright > Requests: For dynamic sites, browser automation is essential
Pattern Matching: Regex patterns need careful testing with real data
Error Handling: Web scraping is fragile - always have retry logic

Use Cases

Research: Quickly find models for specific tasks
API Discovery: Extract endpoints for integration
Model Comparison: Compare metrics across multiple models
Automation: Integrate into ML pipelines

Installation

# Clone the repository
git clone https:https://github.com/SumitS10/Roboflow-.git
cd roboflow

# Install dependencies
pip install -r requirements.txt

# Install Playwright browsers
playwright install chromium

Future Improvements

[ ] Add filtering by metrics (e.g., mAP > 80%)
[ ] Support for batch processing multiple searches
[ ] Export to CSV/Excel
[ ] Add model comparison features
[ ] Cache results to avoid re-scraping

Conclusion

Building this scraper taught me a lot about web scraping, browser automation, and handling edge cases. The multi-strategy approach for API extraction was key to making it reliable.

If you're working with Roboflow models or need to automate model discovery, give it a try! Contributions and feedback are welcome.

Links

🔗 GitHub Repository: https://github.com/SumitS10/Roboflow-.git
🌐 Roboflow Universe: universe.roboflow.com

Tags: #python #webscraping #machinelearning #roboflow #playwright #automation #api #ml

DEV Community