Niksa Kuzmanic

Posted on Feb 24

Making Browser Agents More Observable with Raindrop and Steel

#ai #webdev #observability

Browser automation agents are powerful but notoriously difficult to debug. They fail silently, extract incorrect data, or get blocked by anti-bot protection, and you often don't know until it's too late. When your agent stops working in production, you're left digging through logs trying to figure out which step failed and why.

In this post, I'll show you how to build observable browser agents using Steel for browser automation and Raindrop for comprehensive monitoring. We'll walk through a real example that demonstrates common challenges in web scraping and how the combination of reliable automation plus deep observability helps you catch and debug issues quickly.

Steel handles the complexity of browser automation—JavaScript rendering, anti-bot protection and session management, while Raindrop captures every decision your agent makes. Together, they turn a black box into a fully observable system.

The Challenge with Browser Agent Monitoring

Traditional monitoring tools weren't built for AI agents. They can tell you if your API returned a 500, but not:

Why your agent extracted zero results when the page had twenty items
Which CSS selector broke when the site updated
Whether you're hitting rate limits or actual errors
Where in a multi-step workflow things failed

You need observability that understands agent workflows, not just HTTP codes.

Why Steel + Raindrop?

Steel: Cloud Browser Automation

Steel is a cloud browser API designed for AI agents. Unlike running Puppeteer locally, Steel handles:

JavaScript rendering and CAPTCHA solving
Anti-bot protection and proxy management
Session management across requests
Live session viewer for real-time debugging

Steel's scrape() endpoint returns clean, rendered HTML after JavaScript execution—what a real user sees, not what curl gets. The API is simple: pass a URL and delay, Steel handles the rest.

The session model is key: create a persistent browser that maintains state across multiple scrapes. Critical for workflows requiring login or navigation. Every session gets a viewer URL where you can watch your automation live.

Raindrop: AI Agent Observability

Raindrop is a monitoring platform built for AI agents. Its event tracking and signal system work perfectly for browser automation:

Interaction tracking - Group events into workflows with begin() / finish()
Event logging - Track steps with context and metadata
Signals - Attach searchable labels for failures
Timeline view - See exactly what happened and when

Combined with Steel, you get reliable scraping plus visibility into every agent decision.

How They Work Together

Steel and Raindrop complement each other perfectly:

Steel provides the execution layer:

Handles the complex browser automation
Returns clean, rendered content
Manages sessions and state
Deals with anti-bot measures

Raindrop provides the observability layer:

Tracks what Steel is being asked to do
Records how long operations take
Captures errors and edge cases
Links Steel session IDs to monitoring data

The integration point is simple but powerful: Steel returns session IDs and viewer URLs that you log to Raindrop. When something fails, you can see both the Raindrop timeline (what the agent tried to do) and the Steel session viewer (what actually happened in the browser).

This dual visibility is the key insight: Steel shows you the browser state, Raindrop shows you the agent logic. Together, they answer "why did this fail?" instead of just "it failed."

Building an Observable Browser Agent

Let's build a practical example: a property search agent. The agent needs to search a property listing site, extract listing data (name, location, price, rating), validate it, and rank results by custom criteria.

This example hits common pain points: dynamic content, complex HTML, data validation, and multi-step workflows.

Project Setup

pip install steel-sdk raindrop-ai python-dotenv

Create a .env file:

STEEL_API_KEY=your_steel_api_key
RAINDROP_WRITE_KEY=your_raindrop_write_key

Architecture Overview

Our agent follows this flow:

1. Create Steel browser session
2. Scrape property listing page (Steel)
3. Parse HTML for listing data
   ├─ Try JSON-LD extraction (structured data)
   └─ Fallback to regex parsing
4. Validate each listing
5. Calculate custom scoring
6. Return ranked results

(Every step logged to Raindrop)

Core Agent Structure

import os
from steel import Steel
import raindrop.analytics as raindrop

raindrop.init(os.getenv("RAINDROP_WRITE_KEY"))

class PropertySearchAgent:
    def __init__(self):
        self.session_id = f"search_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
        self.client = Steel(steel_api_key=os.getenv("STEEL_API_KEY"))
        self.session = None

The session ID becomes our tracking identifier across both Steel and Raindrop—we can correlate browser sessions with monitoring data.

Step 1: Browser Session Management

def start_session(self):
    t0 = datetime.now()
    self.session = self.client.sessions.create()
    duration = (datetime.now() - t0).total_seconds()

    raindrop.track_ai(
        user_id=self.session_id,
        event="session_started",
        input="Create Steel browser session",
        output=f"Session {self.session.id} ready in {duration:.2f}s",
        properties={
            "steel_session_id": self.session.id,
            "viewer_url": self.session.session_viewer_url,
            "duration_seconds": duration,
        },
    )

Steel returns a session object with:

session.id - For subsequent scrapes to maintain state
session.session_viewer_url - Live view of your automation

The session ID lets you reuse the browser instance, maintaining cookies across scrapes. The viewer URL shows you what the page looks like in real-time—invaluable when CSS selectors break.

We log both to Raindrop so you can correlate Steel sessions with monitoring data.

Step 2: Scraping with Observability

This is where observability becomes critical. Scraping can fail in many ways: timeouts, anti-bot detection, rate limiting, or the page structure changing. We need to capture context for debugging.

def scrape_page(self, url: str) -> str:
    # Begin a Raindrop interaction - groups related events
    interaction = raindrop.begin(
        user_id=self.session_id,
        event="page_scrape",
        input=f"Scrape property listings from {url}",
        properties={"url": url},
    )

    try:
        t0 = datetime.now()

        result = self.client.scrape(
            url=url,
            format=["html"],
            delay=3000,  # Wait for JS to render
        )

        duration = (datetime.now() - t0).total_seconds()
        html = result.content.html or ""

        # Save raw HTML for debugging
        with open("scraped_page.html", "w", encoding="utf-8") as f:
            f.write(html)

        # Log successful scrape
        interaction.set_properties({
            "duration_seconds": duration,
            "content_length": len(html)
        })
        interaction.finish(output=f"Scraped {len(html)} chars in {duration:.2f}s")

Steel's scrape() does the heavy lifting:

format=["html"] returns fully rendered HTML
delay=3000 waits for JavaScript execution
Behind the scenes: browser instantiation, proxy management, anti-bot evasion

The returned HTML is what a real browser sees after JavaScript, not initial page source. Critical for modern web apps.

The combination: Steel provides reliable scraping, Raindrop captures context.

When a scrape takes 10s instead of 3s, Raindrop logs show exactly which URL was slow.

try:
    duration = time.time() - start_time

    # Signal if scrape was slow
    if duration > 8:
        raindrop.track_signal(
            event_id=interaction.id,
            name="slow_scrape",
            sentiment="NEGATIVE",
            properties={"duration_seconds": duration}
        )

    return html

except Exception as e:
    # Capture the error context
    interaction.set_properties({"error": str(e)})
    interaction.finish(output=f"Scrape failed: {e}")

    # Attach a signal for easy filtering
    raindrop.track_signal(
        event_id=interaction.id,
        name="scrape_failure",
        sentiment="NEGATIVE",
        properties={
            "error": str(e),
            "url": url
        }
    )

    raise

What makes this observable

Interaction wrapping – raindrop.begin() / .finish() groups all the scraping events together
Timing data – Capture how long each scrape takes
Content validation – Check if we got meaningful data (page size)
Signals – Attach searchable labels for common issues
Debug artifacts – Save the raw HTML for inspection

When something goes wrong (like in the error above), Raindrop shows you:

The exact error message
How long the scrape took before failing
The URL being scraped
The full context of what the agent was doing

This beats digging through text logs. You can search for:

signal:scrape_failure

and immediately see all failed scrapes with their error details.

Step 3: Parsing with Fallback Strategies

Real-world scraping requires multiple strategies. Websites change their HTML structure, and you need fallbacks. Here's where observability really pays off—you want to know which parsing method worked (or didn't).

def parse_listings(self, html: str) -> List[Dict]:
    interaction = raindrop.begin(
        user_id=self.session_id,
        event="parse_listings",
        input=f"Parse listings from {len(html)} chars of HTML",
    )

    listings = []

    # Strategy 1: Try JSON-LD structured data
    json_ld_matches = re.findall(
        r'<script[^>]+type="application/json"[^>]*>(.*?)</script>',
        html, re.DOTALL
    )

    for match in json_ld_matches:
        try:
            data = json.loads(match)
            extracted = self._extract_from_json(data)
            listings.extend(extracted)
        except Exception:
            continue

    if listings:
        raindrop.track_ai(
            user_id=self.session_id,
            event="json_parse_success",
            input="Parse JSON-LD from HTML",
            output=f"{len(listings)} listings from JSON",
            properties={"count": len(listings), "method": "json_ld"},
        )
    else:
        # Strategy 2: Regex fallback
        raindrop.track_ai(
            user_id=self.session_id,
            event="json_parse_empty",
            input="JSON-LD extraction",
            output="No listings found, trying regex fallback",
        )
        listings = self._regex_parse(html)

    # Validation
    valid = [l for l in listings if self._is_valid(l)]

    interaction.finish(
        output=f"Parsed {len(valid)} valid listings",
        properties={
            "valid_count": len(valid),
            "raw_count": len(listings),
            "method": "json_ld" if listings else "regex"
        }
    )

    # Signal if we got no results
    if not valid:
        raindrop.track_signal(
            event_id=interaction.id,
            name="no_results",
            sentiment="NEGATIVE"
        )

    return valid

This structure lets you answer critical questions in Raindrop:

How often does JSON parsing work? Search for event:json_parse_success
When do we fall back to regex? Look for event:json_parse_empty
What's our parse success rate? Compare valid_count to raw_count in properties

Step 4: End-to-End Monitoring

Wrap the entire workflow in a top-level interaction:

def run(self, search_term: str, location: str):
    run_interaction = raindrop.begin(
        user_id=self.session_id,
        event="property_search_run",
        input=f"Search for {search_term} in {location}",
        properties={"search_term": search_term, "location": location},
    )

    try:
        self.start_session()
        html = self.scrape_page(self._build_url(search_term, location))
        listings = self.parse_listings(html)
        ranked = self._rank_results(listings)
        self._save_results(ranked)

        run_interaction.finish(
            output=f"Found {len(ranked)} listings",
            properties={"results_count": len(ranked), "success": True}
        )

        raindrop.track_signal(
            event_id=run_interaction.id,
            name="task_success",
            sentiment="POSITIVE",
            properties={"results_count": len(ranked)}
        )

        return ranked

    except Exception as e:
        run_interaction.finish(output=f"Task failed: {e}")
        raindrop.track_signal(
            event_id=run_interaction.id,
            name="task_failure",
            sentiment="NEGATIVE",
            properties={"error": str(e)}
        )
        raise
    finally:
        self.end_session()
        raindrop.flush()

What This Looks Like in Production

After running your agent, Raindrop gives you a complete view:

Timeline:

[15:23:01] session_started (duration: 1.2s)
[15:23:02] page_scrape started
[15:23:08] page_scrape completed (6.1s, 245KB)
[15:23:08] parse_listings started
[15:23:09] json_parse_success (5 listings)
[15:23:11] property_search_run completed (success: true)

Signals:

✓ task_success (session: search_20260217_152301)
✓ results_found (count: 5)

Debugging with Raindrop

When something breaks, Raindrop lets you:

Search by natural language:

"Show me slow scrapes" → signal:slow_scrape
"Find parsing failures" → event:json_parse_empty AND valid_count:0
"What happened in session X" → session_id:search_20260217_152301

Filter by properties:

duration_seconds>5 - Find slow operations
valid_count:0 - Parsing returned nothing
error:*rate limit* - Rate limiting issues

Compare runs:

Look at successful vs failed runs side-by-side
See which parsing strategy works more often
Track success rates over time

Best Practices

1. Log at Multiple Levels

Use track_ai() for steps and begin()/finish() for workflows:

property_search_run (workflow)
├─ session_started (step)
├─ page_scrape (workflow)
└─ parse_listings (workflow)

2. Attach Rich Context

properties={
    "url": url,
    "duration_seconds": duration,
    "content_length": len(html),
}

3. Use Signals for Searchability

Create signals for common failures: slow_scrape, parse_failure, rate_limited.

4. Save Debug Artifacts

Save HTML/screenshots when things fail, reference in Raindrop logs.

Benefits of This Approach

Faster debugging: Raindrop timeline shows where things failed, Steel session viewer shows what was on screen.

Reliable + visible: Steel handles browser complexity, Raindrop shows exactly what it's doing. When Steel solves a CAPTCHA, Raindrop logs the attempt and duration.

Proactive monitoring: Alert on task_failure or slow operations. Catch issues before users do.

Cost optimization: Steel charges by browser hours. Raindrop shows which operations take longest, helping optimize usage.

Conclusion

Browser automation agents are powerful but fragile. Websites change, anti-bot measures evolve, and parsing logic breaks in unexpected ways. Without proper infrastructure and observability, you're flying blind.

Steel and Raindrop solve complementary problems:

Steel solves the execution problem:

Reliable browser automation that handles JavaScript, CAPTCHAs, and anti-bot measures
Cloud infrastructure so you don't manage browser instances
Session management for stateful workflows
Live session viewer for real-time debugging

Raindrop solves the observability problem:

Complete visibility into agent decision-making
Timeline views showing exactly what happened and when
Searchable signals for common failure patterns
Historical data for trend analysis

Together, they transform browser agents from black boxes into observable, debuggable systems. Steel ensures your scraping works reliably, Raindrop ensures you know when it doesn't—and more importantly, why.

The pattern is straightforward:

Use Steel for reliable scraping with built-in anti-bot protection
Wrap Steel operations in Raindrop interactions
Log Steel session IDs and viewer URLs to Raindrop
Log important steps with context (duration, content size, errors)
Signal on failures and anomalies (slow scrapes, empty results, validation errors)
Query and debug using both Raindrop's timeline and Steel's session viewer

Start with basic logging on your existing agents, then expand to full interaction tracking. You'll catch issues faster, optimize Steel usage, and build more reliable automation.

The complete code for this example is available on GitHub: https://github.com/Niksa-1/Steel-Swamp-Finder. Try it out, and let me know what patterns you find useful for monitoring your agents.

Top comments (4)

Nikola Balic • Feb 25

Nice write-up, Niksa. One thing teams often miss with sessions is treating them as part of the tracing context.

Along with steel_session_id and the viewer URL, log attempt_number, proxy/region (if relevant), the current step, and a simple failure_class (blocked/timeout/empty/selector). Then Raindrop searches are easy, like “selector failures in extract, region EU.”

Niksa Kuzmanic • Feb 25

Thank you! I completely agree, attaching structured context to sessions is a great practice.

It makes the system far easier to reason about over time and actually analyze and learn from at scale. That kind of structure pays off quickly once you start looking for patterns instead of individual failures.

Really appreciate you calling that out, it’s a strong addition to the approach!

nasr mohamed • Feb 25

Great article! I wonder if we can create a custom UI pulling in the session viewer from Steel and the timeline from Raindrop. I’ll have to experiment. But nonetheless, thanks for the article, very informative :)

Niksa Kuzmanic • Feb 26 • Edited

Thank you so much! I love that idea, it would make for a really powerful debugging setup.
Please share if you end up building it, I’d be excited to follow along!