lazyasscoder

Posted on Apr 4

You've Never Seen 90% of the Internet. Neither Has Google.

#ai #webdev #internet #data

Google indexes billions of web pages. That sounds like a lot until you realize it might be less than 10% of the total web. The rest, the overwhelming majority of online content, is invisible to every search engine that exists.

Not hidden on purpose. Not encrypted on the dark web. Just... inaccessible to anything that crawls the web the way search engines do.

I'd heard the "deep web" statistic before. Most people have. But I always assumed it was mostly junk — expired pages, duplicate databases, internal server logs. It wasn't until I started doing real market research across dozens of industry sites that I realized: the data I needed most was almost always in that invisible 90%. And once I saw it, I couldn't unsee it.

First, Let's Clear Up the Confusion

Whenever someone says "90% of the internet is hidden," half the room immediately thinks about the dark web — Tor, anonymous marketplaces, stolen credentials. That's not what we're talking about.

The web has three layers, and people mix them up constantly.

Surface web is everything Google can index. Public pages, blog posts, Wikipedia articles, news sites. This is where you spend most of your browsing time. Estimates vary — typically cited as 4-10% of all web content depending on methodology and what you count — but the point is consistent: it's a small fraction.

Deep web is everything behind a barrier that prevents search engine crawlers from accessing it. The flight prices that only appear after you enter dates and destinations. The supplier portal that requires a login. None of this is sinister — it's just not publicly crawlable. This makes up the vast majority of the web, around 90-96%.

Dark web is a tiny subset of the deep web that requires specialized software like Tor to access. According to Britannica, it's roughly 0.01% of the deep web — far smaller than most people imagine. It gets all the headlines, but in terms of size it's a rounding error.

The deep web is boring. That's the point. It's boring, and enormous, and full of exactly the kind of data that businesses desperately need.

What Actually Lives Behind the Wall?

Here's what makes the invisible web interesting: it's not a wasteland of forgotten pages. It's where the most valuable, most current, most actionable data on the internet lives.

Dynamic pricing and inventory. Every airline, hotel booking system, and e-commerce platform generates pricing based on inputs — dates, locations, user profiles. The price you see is computed on the fly. It doesn't exist as a static page for Google to crawl. If you want competitor pricing data, you have to interact with the site to generate it.

Authenticated portals. Government databases, insurance claim portals, enterprise SaaS dashboards, supplier catalogs — all behind login walls. The data is there, it's just not public. A procurement team that needs to compare pricing across 200 supplier portals can't Google their way to an answer. Each portal requires authentication, has a different interface, a different workflow.

Interactive search results. LinkedIn People Search. Zillow's filtered listings. Patent databases. Academic paper repositories. The results only exist after you type a query and apply filters. Before that interaction, the data is invisible to crawlers.

Form-gated content. Reports behind download forms. Tools that generate output based on user input. Calculators, configurators, quote generators. All invisible until a human (or something that acts like one) fills in the fields.

Single-page applications. Modern web apps built with React, Vue, or Angular often load a shell page and then fetch content dynamically. A search engine crawler that doesn't execute JavaScript sees an empty skeleton. The actual content only renders in a real browser environment.

This isn't obscure stuff. This is where most business-critical data lives today.

Why Search Engines Can't Fix This

Why can't Google just get better at this?

The answer is architectural. Search engines are built on a specific model: send a crawler to a URL, download what's there, index it, rank it. That model assumes content is static, public, and available at a fixed address. It's an incredibly powerful model for the surface web. But it fundamentally cannot handle content that requires interaction to exist.

Google's crawler can't log into your competitor's supplier portal. It can't fill out a form with your specific parameters to generate a custom quote. It can't scroll through an infinite-loading feed, click "next page" 47 times, and filter results by date range. It can't provide credentials, handle two-factor authentication, or navigate a multi-step checkout flow.

This isn't a limitation that gets fixed by better crawling technology. It's a limitation of the crawling paradigm itself. Crawling is about reading pages. The invisible web requires doing things on pages.

Google knows this, by the way. There's a reason Google Hotels uses third-party web agents to aggregate hotel inventory from thousands of Japanese booking sites that their own crawlers can't reach. When the company that built web search can't access web data with search technology, that tells you something about the structural boundary.

So What Can Reach It?

This is where the landscape gets interesting, and where I've been spending most of my time learning over the past few days.

"Agentic search" tools like Perplexity and Google's AI Overviews try to bridge the gap by synthesizing information from multiple sources. They're better than raw search for getting summarized answers, but they're still ultimately constrained by what's been indexed. They're smarter librarians, but the library hasn't gotten bigger.

Content extraction tools like Firecrawl go a step further — they can visit a URL, render JavaScript, and return clean content. This handles the single-page app problem. But they still can't interact with pages. If the data requires filling a form or clicking through filters, you're stuck.

Browser agents like Browser Use and OpenAI Operator are where things start to change. These are AI systems that actually navigate pages — clicking, typing, scrolling, filling forms, handling pop-ups. They can reach content that requires interaction. The limitation I've found is orchestration: when you need to run the same task across dozens or hundreds of sites in parallel, managing that yourself becomes its own infrastructure project.

Remote web agent platforms like TinyFish and Browserbase handle the orchestration part. Cloud-hosted browsers, parallel execution, structured output. I wrote about my experience testing several of these — the shift from "automate clicks" to "describe what you want" is real.

The framing I keep coming back to is the distinction between searching the web and operating on it. Search is about finding pages. Operating is about interacting with them — logging in, navigating workflows, extracting data from dynamic interfaces. These are fundamentally different activities, and they require fundamentally different tools. TinyFish's blog has some interesting writing on this if you want to go deeper.

The Economic Problem Nobody Talks About

Here's the angle that doesn't get enough attention: the invisible web isn't just a technical problem. It's an economic one.

Consider a procurement team that needs competitive pricing across 200 supplier portals. Each portal requires a login, has a unique interface, a different navigation flow. You could hire someone to manually check all 200. But the labor cost makes it prohibitive. So you check 5. Maybe 10. You make decisions based on incomplete data because complete data is economically inaccessible.

Or think about a pharmaceutical company matching patients to clinical trials. Eligibility criteria are scattered across thousands of fragmented research sites, each with its own search interface and data structure. No search engine indexes this. No API aggregates it.

Or an insurance company that needs to monitor prior authorization statuses across 50+ health plan portals — each one a different website, different login, different workflow to check status.

In all these cases, the data exists. It's not secret. But the cost of accessing it at scale, manually, is so high that most organizations simply don't. They make decisions with partial information and accept the inefficiency.

This is what makes the web agent category genuinely interesting to me — not as a cool technology demo, but as an economic unlock. When you can automate the interactive parts of the web at scale, you make data accessible that was previously too expensive to touch. That's not an incremental improvement. That's a category of decisions that was previously impossible to make.

One More Thing: WebMCP and the Future

There's a new W3C standard in early development called WebMCP that could eventually shrink the invisible web. The idea is that websites would publish structured tools that AI agents can call directly, instead of agents having to navigate the visual interface. I wrote a deeper explainer on how it works.

But here's the realistic take: WebMCP depends on website owners voluntarily adopting a new standard. The sites with the most valuable hidden data — legacy portals, government systems, enterprise SaaS — are the slowest to adopt anything. The invisible web will stay invisible for a long time. The question is who builds the bridge.

What This Means for Us

If you're building anything that depends on web data — competitive intelligence, market research, lead enrichment, pricing optimization — it's worth asking: how much of the data I need actually shows up in search results?

My guess is less than you think.

The surface web is the tip of the iceberg we've all been staring at. The real depth is underneath — behind logins, inside interactive interfaces, generated dynamically by forms and filters. It's not hidden in any dramatic sense. It's just waiting for something that can interact with it.

That's the gap web agents are filling. Not by indexing more pages, but by operating on the ones that already exist.

Further reading:

W3C WebMCP Specification — the emerging standard for making the web agent-readable
Firecrawl's Guide to Browser Agents — landscape overview from a content extraction perspective
Browser Use Documentation — open-source browser agent framework
Why We Launched With a Story About a Tiny Japanese Hotel — TinyFish on how invisible hotel data became accessible through web agents