DEV Community

Sumit Sapkota
Sumit Sapkota

Posted on

Building a Roboflow Universe Search Agent: Automating ML Model Discovery

The Problem

As a machine learning enthusiast, I often find myself browsing Roboflow Universe looking for pre-trained models. But manually searching, clicking through pages, and copying API endpoints is tedious. I wanted a way to:

  • Search for models by keywords
  • Extract detailed information (metrics, classes, API endpoints)
  • Get structured data I could use programmatically

So I built a Python web scraper that does exactly that! πŸš€

What It Does

The Roboflow Universe Search Agent is a Python tool that:

βœ… Searches Roboflow Universe with custom keywords

βœ… Extracts model details (title, author, metrics, classes)

βœ… Finds API endpoints using multiple extraction strategies

βœ… Outputs structured JSON data

βœ… Handles retries and errors gracefully

The Challenge: Finding API Endpoints

The trickiest part was reliably extracting API endpoints. Roboflow displays them in various places:

  • JavaScript code snippets
  • Model ID variables
  • Input fields
  • Page text
  • Legacy endpoint formats

I needed a robust solution that wouldn't break if the website structure changed.

The Solution: Multi-Strategy Extraction

Instead of relying on a single method, I implemented 6 different extraction strategies with fallbacks:

Strategy 1: JavaScript Code Blocks

The most reliable source - API endpoints appear in code snippets:

js_patterns = [
    r'url:\s*["\']https://serverless\.roboflow\.com/([^"\'?\s]+)["\']',
    r'"https://serverless\.roboflow\.com/([^"\'?\s]+)"',
    r'https://serverless\.roboflow\.com/([a-z0-9\-_]+/\d+)',
]
Enter fullscreen mode Exit fullscreen mode

Strategy 2: Model ID Patterns

Extract from JavaScript variables:

model_id_patterns = [
    r'model_id["\']?\s*[:=]\s*["\']([a-z0-9\-_]+/\d+)["\']',
    r'MODEL_ENDPOINT["\']?\s*[:=]\s*["\']([a-z0-9\-_]+/\d+)["\']',
]
Enter fullscreen mode Exit fullscreen mode

Strategy 3: Input Fields & Textareas

Check form elements and code blocks:

input_selectors = [
    "input[value*='serverless.roboflow.com']",
    "textarea",
    "code",
]
Enter fullscreen mode Exit fullscreen mode

Strategy 4: Page Text Search

Fallback to visible text on the page

Strategy 5: Legacy Endpoints

Support older endpoint formats:

  • detect.roboflow.com
  • classify.roboflow.com
  • segment.roboflow.com

Strategy 6: URL Construction

Build endpoint from page URL structure if all else fails

This multi-strategy approach ensures we find the API endpoint even if the page structure changes!

Tech Stack

  • Playwright: Browser automation (more reliable than requests for dynamic content)
  • Python 3.7+: Core language
  • Regex: Pattern matching for extraction

Usage

Basic Example

# Search for basketball detection models
SEARCH_KEYWORDS="basketball model object detection" \
MAX_PROJECTS=5 \
python roboflow_search_agent.py
Enter fullscreen mode Exit fullscreen mode

JSON Output

# Get structured JSON output
SEARCH_KEYWORDS="soccer ball instance segmentation" \
OUTPUT_JSON=true \
python roboflow_search_agent.py
Enter fullscreen mode Exit fullscreen mode

Example Output

[
  {
    "project_title": "Basketball Detection",
    "url": "https://universe.roboflow.com/workspace/basketball-detection",
    "author": "John Doe",
    "project_type": "Object Detection",
    "has_model": true,
    "mAP": "85.2%",
    "precision": "87.1%",
    "recall": "83.5%",
    "training_images": "5000",
    "classes": ["basketball", "player"],
    "api_endpoint": "https://serverless.roboflow.com/basketball-detection/1",
    "model_identifier": "workspace/basketball-detection"
  }
]
Enter fullscreen mode Exit fullscreen mode

Key Features

1. Intelligent Search

The tool applies the "Has a Model" filter automatically and handles keyword prioritization.

2. Comprehensive Data Extraction

Extracts:

  • Performance metrics (mAP@50, Precision, Recall)
  • Training data info (image count, classes)
  • Project metadata (author, update time, tags)
  • API endpoints (the hard part!)

3. Robust Error Handling

  • Automatic retries (3 attempts)
  • Graceful failure handling
  • Timeout management

4. Flexible Output

  • Human-readable console output
  • JSON format for programmatic use
  • Configurable via environment variables

Technical Highlights

Browser Automation with Playwright

def connect_browser(headless=True):
    playwright = sync_playwright().start()
    browser = playwright.chromium.launch(
        headless=headless,
        args=["--no-sandbox", "--disable-setuid-sandbox"]
    )
    context = browser.new_context(viewport={"width": 1440, "height": 900})
    return playwright, browser, context, page
Enter fullscreen mode Exit fullscreen mode

Smart Scrolling

Instead of fixed waits, the scraper detects when content stops loading:

def scroll_page(page, max_scrolls=15):
    last_height = 0
    for i in range(max_scrolls):
        page.evaluate("window.scrollBy(0, window.innerHeight)")
        page.wait_for_timeout(800)
        new_height = page.evaluate("document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height
Enter fullscreen mode Exit fullscreen mode

Lessons Learned

  1. Multiple Strategies > Single Strategy: Having fallbacks makes the scraper much more reliable
  2. Playwright > Requests: For dynamic sites, browser automation is essential
  3. Pattern Matching: Regex patterns need careful testing with real data
  4. Error Handling: Web scraping is fragile - always have retry logic

Use Cases

  • Research: Quickly find models for specific tasks
  • API Discovery: Extract endpoints for integration
  • Model Comparison: Compare metrics across multiple models
  • Automation: Integrate into ML pipelines

Installation

# Clone the repository
git clone https:https://github.com/SumitS10/Roboflow-.git
cd roboflow

# Install dependencies
pip install -r requirements.txt

# Install Playwright browsers
playwright install chromium
Enter fullscreen mode Exit fullscreen mode

Future Improvements

  • [ ] Add filtering by metrics (e.g., mAP > 80%)
  • [ ] Support for batch processing multiple searches
  • [ ] Export to CSV/Excel
  • [ ] Add model comparison features
  • [ ] Cache results to avoid re-scraping

Conclusion

Building this scraper taught me a lot about web scraping, browser automation, and handling edge cases. The multi-strategy approach for API extraction was key to making it reliable.

If you're working with Roboflow models or need to automate model discovery, give it a try! Contributions and feedback are welcome.

Links

πŸ”— GitHub Repository: https://github.com/SumitS10/Roboflow-.git
🌐 Roboflow Universe: universe.roboflow.com


Tags: #python #webscraping #machinelearning #roboflow #playwright #automation #api #ml

Top comments (0)