lazyasscoder

Posted on Apr 1

Web Scraping Is Dead. Web Agents Just Replaced It.

#webdev #ai #webagents #programming

Last month, I needed to pull together a competitive landscape report. Nothing exotic — pricing trends, feature comparisons, market positioning across about 30 to 40 industry sites and competitor pages.

Sounds manageable, right? Here's what actually happened.

Half the sites loaded data dynamically — you scroll down, content appears, but view-source shows empty divs. A few required creating accounts just to see pricing. Others buried the good stuff behind three layers of filters and dropdown menus. And two sites hit me with CAPTCHAs the moment I tried to load them in an automated way.

I did what I always do: opened 40 Chrome tabs, spent two days copying and pasting into a spreadsheet, and ended up with a document that was already going stale by the time I finished formatting it.

That experience sent me down a rabbit hole. What I found changed how I think about web automation entirely.

Why Traditional Scraping Doesn't Cut It Anymore

Look, I'm not here to trash Beautiful Soup or Scrapy. If your target is a handful of static HTML pages that rarely change, those tools still work beautifully. No shame in using them. I still do for simple stuff.

But the web I was dealing with, the 2026 web, is a fundamentally different animal.

Most modern sites render content with JavaScript. Your scraper sends a request, gets back a skeleton of empty <div> tags, and has no idea that the actual data loads three seconds later through an API call triggered by a React component. Single-page apps mean the URL stays the same while the content changes entirely. Data hides behind interactive elements: filters, dropdowns, "load more" buttons, infinite scroll. And anti-bot systems have gotten genuinely sophisticated — they can tell the difference between a real browser with a real human behind it and a Python script pretending to be one.

Traditional scraping is like using a paper map in a city that redraws its streets every week. The map was accurate when you printed it. The city just didn't care.

The problem isn't that your code is bad. The problem is that the web wasn't designed to be scraped. It was designed to be used — by humans, in browsers, with clicks and scrolls and waiting for things to load. And every year it gets better at enforcing that assumption.

So I started looking at what else is out there. The landscape has changed a lot more than I expected.

I Tried a Bunch of Tools. Here's What Actually Happened.

Instead of just reading comparison articles, I spent a weekend actually testing different approaches against my real market research task. Here's the honest breakdown.

Search APIs (Exa, Tavily)

I tried these first because they're the fastest path to structured data. Give them a query, get back JSON with titles, snippets, and URLs. Some of them, like Exa, maintain their own semantic index that's optimized for LLM consumption. Tavily focuses on research-grade search results. Both are impressive for what they do.

But I hit the wall pretty quickly. They surface what's already been indexed and the data I needed wasn't sitting on pages that Google had crawled. It was behind login walls, inside interactive dashboards, and loaded dynamically after user interaction.

My takeaway: search APIs are perfect for discovery — figuring out WHICH pages have the data you want. But they can't go get it for you. It's like asking a librarian which shelf a book is on. Helpful. But they can't read it for you.

Content Extraction (Firecrawl)

Firecrawl was the logical next step. You give it a URL, it renders the JavaScript, and returns clean, structured content. It bridges the gap between "finding a page" and "reading a page."

I liked it for known URLs where I just needed to pull the rendered content. But when I needed to interact with the page — click through filters, paginate through results, fill out a form to reveal pricing — it wasn't built for that. It's still reading, not doing.

Browser Agents (Browser Use, OpenAI Operator)

This is where things got genuinely interesting. These are AI systems that can actually navigate pages — clicking buttons, filling forms, interpreting what's on screen and deciding what to do next.

Browser Use is open source and flexible. You can choose your own LLM, and their cloud offering means you're not tying up your own machine. I was impressed by how well the AI reasoned about page layouts.

OpenAI Operator has a polished interface and feels smooth for single-task workflows. Good for "go to this site and do this thing."

But here's where I ran into friction: I needed to run the same basic task across 30+ sites. Not one site at a time. All of them, ideally in parallel. Orchestrating that myself — managing sessions, handling failures, collecting structured output — started feeling like I was building infrastructure rather than doing research.

Remote Web Agent Platforms (TinyFish, Browserbase)

This is the category I didn't know existed until I went looking.

Browserbase is impressive cloud browser infrastructure. They spin up headless browsers at scale and handle the hard parts — proxies, anti-detection, session management. But it's fundamentally an infrastructure layer. You connect your own automation code to their browsers. Powerful for teams that want full control, but you're still writing the agent logic yourself.

Then I tried TinyFish. You give it a URL and a goal in plain English. It gives you structured JSON back. No selectors. No Playwright scripts. No browser management. I didn't have to think about HOW to navigate each site — just WHAT I wanted from it.

That felt fundamentally different. Not a better scraper. Not a fancier browser automation. A different way of thinking about the problem.

Now, I want to be clear about something: every tool I tried has its sweet spot. Exa is fantastic for semantic search. Firecrawl is great for clean content extraction from known URLs. Browser Use gives you maximum flexibility with open-source transparency. Browserbase is the right call if your team wants to build custom agent infrastructure.

I'm not saying TinyFish is the only answer. I'm saying that the experience of "describe what you want, get it back" was the thing that made me rethink the whole approach. It was the moment I realized web automation might be going through the same kind of shift that happened when we went from writing assembly to writing in high-level languages. The abstraction layer matters.

The Real Shift: From "Automate Clicks" to "Describe Goals"

Here's the idea that stuck with me after trying all of this.

The old way to automate the web was procedural. You'd write something like: "Navigate to this URL. Find the element with class price-card. Click the dropdown labeled 'Enterprise'. Wait 2 seconds. Extract the text from the element with ID annual-price."

Every single step is hard-coded. Change the class name? It breaks. Add a cookie banner that overlays the dropdown? It breaks. Redesign the page? Everything breaks.

The new approach is goal-oriented. You say: "Go to this company's pricing page. Find the enterprise annual price. Return it as JSON."

The agent figures out the how. You only specify the what.

This might sound like a small difference. It isn't. It's the difference between giving a taxi driver turn-by-turn directions versus giving them an address. If the road is closed, turn-by-turn fails — the driver is stuck at the non-existent left turn. But if you gave them the destination, they just find another route. The goal hasn't changed. Only the path.

And it changes who can automate web tasks. You don't need to understand the DOM, write XPath expressions, or manage headless browser sessions. You need to be able to describe what you want. For someone like me — technical enough to understand what's happening under the hood, but not wanting to maintain Playwright infrastructure for a market research project — this shift feels overdue.

Why I Kept Digging Into This

After my initial test, I got curious about TinyFish's thinking. Not as a customer evaluating a purchase, more like a nerd who found something interesting and wanted to understand why it worked differently.

Their blog has this argument that I keep coming back to: search engines only index about 5% of online content. The other 90% or more sits behind logins, forms, dynamic interfaces, and interactive workflows. It's not hidden in the sense that it's secret. It's hidden because you have to interact with a website to reveal it.

That matched my experience exactly. The market data I needed wasn't hard to find. I could see it in my browser just fine. It was hard to extract programmatically without writing a fragile script for every single site.

I've only scratched the surface of what their platform can do, but the way they frame the problem — that the web needs to be operated on, not just searched or scraped, resonates with what I experienced firsthand.

Where This Is All Going: WebMCP

One more thing worth mentioning, because it signals where this whole space is heading.

Google and Microsoft are jointly developing a new W3C standard called WebMCP (Web Model Context Protocol) that would let websites publish structured tools directly for AI agents. Instead of agents having to visually interpret a page — taking screenshots, parsing the DOM, guessing where to click — a site could expose a function like get_pricing({ plan: "enterprise" }) that agents call directly through a new browser API called navigator.modelContext.

(Quick note for the confused: WebMCP is not the same thing as Anthropic's MCP. Anthropic's MCP connects AI agents to backend services. WebMCP is a browser-native standard that connects agents to web page interfaces. Different protocols, complementary purposes. The naming is confusing — I know.)

Think of it this way: instead of making a delivery robot squint at a restaurant's chalkboard menu from across the room, you hand it a structured list of dishes and prices. Same information. Completely different efficiency.

But this is still extremely early. Most sites won't support it for years. And the sites that most need automation — legacy portals, government systems, enterprise SaaS with no API — are precisely the ones least likely to adopt it quickly. But it signals a future where the web isn't just built for human eyes. It's becoming readable by agents too.

What I'm Doing Next

The web automation landscape has genuinely shifted. "Scraping" as I knew it — writing brittle selector-based scripts — is being replaced by something more like "describing what you want and getting it back."

You don't have to throw out your working scripts overnight. Start with the workflows that break most often. The ones that cost you a day of copy-pasting every month. Those are your candidates for a different approach.

I'm planning to take this further: pick a real market research workflow I currently handle manually, run it through a proper hands-on test, and write up exactly what happens. The good, the bad, and the weird. Stay tuned for that.

If you've been fighting with web scrapers, I'd love to hear what's worked for you — and what hasn't. Drop a comment below.

Further reading if you want to go deeper:

Why 90% of the Internet Is Invisible (And Why AI Hasn't Fixed It) — TinyFish's blog on the invisible web problem
Firecrawl's Guide to Browser Agents — a good overview of the browser agent landscape from another player in the space
Browser Use Documentation — if you want to explore the open-source agent approach

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.