WonderLab

Posted on Jun 19

Agent Series (23): Web Agent — Giving Your Agent Real Eyes on the Internet

#ai #agents #langchain #contextengineering

Why Web Agents Exist

LLMs have a knowledge cutoff. Ask one "what's the latest version of LangGraph?" and it can only tell you what was in its training data. Web Agents solve this: the agent actually browses the internet and returns real-time information.

But "browsing the internet" is more complicated than it sounds:

Web pages are HTML, not text — dumping raw HTML into context floods it with useless tags
A single page can be tens of thousands of tokens — well beyond useful context density
Agents can loop forever — page A links to B, B links to C, never stopping
URLs can be hallucinated — LLMs will invent plausible-sounding links that don't exist

Four problems, four engineering designs: HTML cleaning, Token Budget, Step Limit, URL error handling. This article assembles them into a working Web Agent.

Architecture

The overall structure is a standard LangGraph two-node graph:

User question
     │
     ▼
┌─────────────────────────────────────┐
│         agent_node                  │
│  SystemPrompt + messages → LLM      │
│  bound_llm.invoke(msgs)             │
└────────┬────────────────────────────┘
         │
    Has tool_calls?
         │
    ┌────┴─────┐
   Yes          No (or steps >= MAX_STEPS)
    │                │
    ▼                ▼
tools_node          END
web_search /
fetch_page
    │
    └──→ agent_node (loop)

State has only two fields:

class WState(TypedDict):
    messages: Annotated[list, add_messages]  # accumulated messages
    steps: int                                # steps consumed

steps is Web Agent-specific — standard agents don't need an explicit step counter, but Web Agents can jump between pages indefinitely. A hard limit is mandatory.

Two Tools

web_search: DuckDuckGo Search

@tool
def web_search(query: str) -> str:
    """
    Search the web with DuckDuckGo.
    Returns up to 5 results, each with title, snippet, and URL.
    Use the URLs from results to call fetch_page — never invent URLs.
    """
    try:
        resp = requests.get(
            "https://html.duckduckgo.com/html/",
            params={"q": query},
            headers=HEADERS,
            timeout=12,
        )
        soup = BeautifulSoup(resp.text, "html.parser")
        results = []
        for i, block in enumerate(soup.select(".result"), 1):
            if i > 5:
                break
            title   = (block.select_one(".result__title")   or soup.new_tag("x")).get_text(strip=True)
            snippet = (block.select_one(".result__snippet") or soup.new_tag("x")).get_text(strip=True)
            url_raw = (block.select_one(".result__url")     or soup.new_tag("x")).get_text(strip=True)
            url = f"https://{url_raw}" if url_raw and not url_raw.startswith("http") else url_raw
            results.append(f"{i}. {title}\n   {snippet}\n   URL: {url}")
        return "\n\n".join(results) if results else "No results found."
    except Exception as exc:
        return f"Search error: {exc}"

Uses DuckDuckGo's HTML interface — no API key required. Parses .result CSS classes to extract title, snippet, and URL, returning structured text to the LLM.

There's a critical instruction in the tool description: Use the URLs from results to call fetch_page — never invent URLs. This is the first line of defense against URL hallucination — instructing the model at the Prompt layer where valid URLs come from.

fetch_page: Page Fetching + Cleaning

@tool
def fetch_page(url: str) -> str:
    """
    Fetch a web page and return its cleaned text (truncated to token budget).
    Only call with real URLs obtained from web_search results.
    """
    try:
        resp = requests.get(url, headers=HEADERS, timeout=12)
        resp.raise_for_status()
        full_text = clean_html(resp.text)
        orig_tokens = count_tokens(full_text)
        displayed = truncate_to_budget(full_text)
        shown_tokens = min(orig_tokens, PAGE_TOKEN_BUDGET)
        return (
            f"[URL: {url}]\n"
            f"[Size: {orig_tokens} tokens → showing {shown_tokens} tokens "
            f"(budget={PAGE_TOKEN_BUDGET})]\n\n"
            f"{displayed}"
        )
    except requests.HTTPError as exc:
        return f"HTTP {exc.response.status_code} — could not fetch {url}"
    except requests.ConnectionError:
        return f"Connection error — {url} may not exist or be unreachable"
    except Exception as exc:
        return f"Error fetching {url}: {type(exc).__name__}: {exc}"

Three steps:

clean_html: BeautifulSoup removes script/style/nav/footer, returns plain text
truncate_to_budget: truncates anything beyond the Token Budget
Error classification: HTTP errors, connection errors, and other exceptions each return different safe strings

Note that requests.HTTPError and requests.ConnectionError represent two distinct failure scenarios: the former means the server responded (4xx/5xx), the latter means the connection itself failed (domain doesn't exist, network unreachable).

Three Engineering Guards

Guard 1: URL Error Handling

Testing a completely nonexistent domain:

fetch_page(https://totally-made-up-domain-xyz99999.org/docs/n...)
→ Connection error — https://totally-made-up-domain-xyz99999.org/docs/nonexistent may not exist or be unreachable

No crash, no exception propagation — a safe error string is returned. The LLM receives this string and can choose to try a different URL or a different search query.

This is a key guard design principle: errors are return values, not exceptions. Tool call failures shouldn't interrupt the entire Agent execution; instead, let the LLM adapt based on the error information.

Guard 2: Token Budget Truncation

Testing the langgraph page on PyPI:

fetch_page(pypi.org/project/langgraph/)
→ [Size: 4576 tokens → showing 800 tokens (budget=800)]

Original page: 4,576 tokens. After truncation: 800 tokens. That's an 82% reduction in context usage.

The truncation implementation is simple:

PAGE_TOKEN_BUDGET = 800   # max tokens of page text sent to LLM per fetch

def count_tokens(text: str) -> int:
    """Rough estimate: ~3 chars per token for English/Chinese mix."""
    return max(1, len(text) // 3)

def truncate_to_budget(text: str, budget: int = PAGE_TOKEN_BUDGET) -> str:
    if count_tokens(text) <= budget:
        return text
    cutoff = budget * 3
    return text[:cutoff] + f"\n\n[... content truncated to ~{budget}-token budget ...]"

count_tokens uses a rough estimate (3 chars ≈ 1 token), not a precise tokenizer. For truncation purposes, speed matters more than precision.

Guard 3: Step Limit

MAX_STEPS = 8

def router(state: WState) -> str:
    if state["steps"] >= MAX_STEPS:
        return END
    last = state["messages"][-1]
    if isinstance(last, AIMessage) and last.tool_calls:
        return "tools"
    return END

state["steps"] is incremented in every agent_node execution:

def agent_node(state: WState) -> dict:
    msgs = [SystemMessage(content=SYSTEM_PROMPT)] + state["messages"]
    response = bound_llm.invoke(msgs)
    return {"messages": [response], "steps": state["steps"] + 1}

The router checks step count before checking tool_calls. Even if the LLM wants to keep calling tools, when the step limit is reached, execution terminates. This is a hard boundary against infinite loops.

Step count is initialized at invocation time:

state = graph.invoke(
    {"messages": [HumanMessage(content=query)], "steps": 0},
    config={"recursion_limit": MAX_STEPS * 3},
)

recursion_limit is LangGraph's built-in protection; steps is the application-level custom protection. Both work independently.

Run Results

======================================================================
Web Agent Demo
Model: glm-4-flash  |  Token budget/page: 800  |  Max steps: 8
======================================================================

=== Part 3: Engineering Guards ===

──────────────────────────────────────────────────────────────────────
[Guard 1] URL error handling (bad / hallucinated URL)
  fetch_page(https://totally-made-up-domain-xyz99999.org/docs/n...)
  → Connection error — https://totally-made-up-domain-xyz99999.org/docs/nonexistent may not exist or be unreachable

──────────────────────────────────────────────────────────────────────
[Guard 2] Token budget enforcement (budget=800 tokens/page)
  fetch_page(pypi.org/project/langgraph/)
  → [Size: 4576 tokens → showing 800 tokens (budget=800)]

──────────────────────────────────────────────────────────────────────
[Guard 3] Step limit (MAX_STEPS=8) — agent cannot loop forever
  Graph router returns END when state['steps'] >= 8
  Even if tool_calls remain, execution stops.

All three guards worked as expected.

The research sections (Parts 1 & 2) hit DuckDuckGo rate limiting — searches returned empty results, and the model correctly reported failure instead of hallucinating answers. This is itself a sign the guards are effective: the agent didn't loop on empty results, it reported the failure clearly to the user.

DuckDuckGo's Limitations

The DuckDuckGo HTML interface requires no API key, but it's unreliable for production:

Frequent requests get rate-limited or return empty results
HTML structure can change anytime, breaking CSS selectors
No rate limiting control, easy to trigger blocks

Production alternatives:

Option	Characteristics
Tavily API	Designed for LLM agents, returns structured results
SerpAPI	Multi-engine, stable, paid
Brave Search API	Generous free tier, independent index
Jina Reader	Specialized in page-to-text conversion, high quality

Switching only requires replacing the web_search tool implementation — the agent graph structure stays the same.

Complete Graph Code

TOOLS   = [web_search, fetch_page]
TOOL_MAP = {t.name: t for t in TOOLS}
bound_llm = llm.bind_tools(TOOLS)

SYSTEM_PROMPT = f"""You are a web research agent. Answer the user's question by browsing the web.

Workflow:
1. Call web_search to find relevant pages.
2. Call fetch_page on promising URLs to read content.
3. If you find the answer, give a clear, concise final response.
4. If a page doesn't help, try a different search query.

Strict rules:
- Only use URLs from web_search results — never invent or guess URLs.
- If fetch_page returns an error, try a different URL or search query.
- You have at most {MAX_STEPS} total steps. Be efficient.
- Once you have enough information, stop browsing and answer directly."""


class WState(TypedDict):
    messages: Annotated[list, add_messages]
    steps: int


def agent_node(state: WState) -> dict:
    msgs = [SystemMessage(content=SYSTEM_PROMPT)] + state["messages"]
    response = bound_llm.invoke(msgs)
    return {"messages": [response], "steps": state["steps"] + 1}


def tools_node(state: WState) -> dict:
    last = state["messages"][-1]
    results = []
    for tc in last.tool_calls:
        output = TOOL_MAP[tc["name"]].invoke(tc["args"])
        results.append(ToolMessage(content=str(output), tool_call_id=tc["id"]))
    return {"messages": results}


def router(state: WState) -> str:
    if state["steps"] >= MAX_STEPS:
        return END
    last = state["messages"][-1]
    if isinstance(last, AIMessage) and last.tool_calls:
        return "tools"
    return END


def build_graph():
    g = StateGraph(WState)
    g.add_node("agent", agent_node)
    g.add_node("tools", tools_node)
    g.set_entry_point("agent")
    g.add_conditional_edges("agent", router, {"tools": "tools", END: END})
    g.add_edge("tools", "agent")
    return g.compile()

The compiled graph is assigned to a module-level graph variable. run_research calls graph.invoke() directly.

Design Checklist

Tool design

[ ] HTML cleaning: remove script/style/nav/footer, keep only body text
[ ] Error classification: HTTP error / connection error / other — each returns a safe string
[ ] Tool description includes URL source rule: never invent URLs

Engineering guards

[ ] Token Budget: truncate page text to a reasonable limit (800–2000 tokens)
[ ] Step Limit: router checks step count before checking tool_calls
[ ] Two-layer protection: application-level steps + LangGraph recursion_limit

State design

[ ] messages: Annotated[list, add_messages] — must use reducer, otherwise messages don't accumulate
[ ] steps: int — Web Agent-specific field; standard agents can omit this

Production hardening

[ ] Replace search tool with a reliable API-key-based solution (Tavily/SerpAPI)
[ ] Set User-Agent to a real browser UA to avoid being rejected
[ ] Request timeouts: timeout=12 for both search and page fetching

Summary

Three conclusions:

Guards are independent of content quality: Tool failure doesn't equal Agent failure. Errors as return values let the LLM adapt — execution continues rather than crashing
Token Budget is non-negotiable: A typical web page is 4,576 tokens; truncating to 800 saves 82% of context. At scale, browsing many pages without this would exhaust context in a few steps
Step Limit is a hard boundary: steps >= MAX_STEPS → END lives in the router, not in the Prompt. No matter how much the LLM wants to continue, the counter stops it. Don't trust "self-discipline" for safety-critical behavior

A Web Agent's essence: give the LLM controlled eyes on the internet, not unlimited network access.

References

LangGraph StateGraph documentation
BeautifulSoup HTML parsing documentation
Full demo code: agent-22-web-agent

Check out PrimeSkills — a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.

Find more useful knowledge and interesting products on my Homepage

DEV Community