Tony Wang

Posted on Jun 11 • Originally published at crawlora.net

How Paywalls Actually Work: The Engineering Behind Them

#webscraping #seo #webdev #tutorial

A paywall is one of the more interesting engineering problems on the web, because the publisher has to satisfy two goals that pull in opposite directions. It needs Google to index the article so people can find it and click through — which means a search crawler has to see the full text. But it also needs to withhold that same text from a logged-out reader so there's a reason to subscribe. Reconciling "show the bot everything" with "show the human almost nothing," without getting penalized for it, is the whole game. How a publisher resolves that tension decides whether its paywall is a bank vault or a velvet rope you can step around.

This guide explains the machinery from an engineer's point of view: the kinds of paywall, where the content actually lives, the structured-data contract that lets publishers serve crawlers and readers different things on purpose, and why some of these walls are trivial to read past while others are effectively sealed.

Key takeaways

Paywalls come in four flavors — hard, soft/freemium, metered, and dynamic — and each is enforced differently.
The single most important fact is where the content is hidden: client-side paywalls ship the full article to the browser and then hide it (often readable), while server-side paywalls never send it (effectively not).
Publishers declare gated sections to Google with isAccessibleForFree JSON-LD and grant Googlebot full, IP-validated access — which is exactly why 'pretend to be Googlebot' sometimes works and is usually blocked.
Reading content behind a paywall is the highest-risk category of access (DMCA §1201, CFAA, terms of service). The defensible path is public data, official APIs, and the structured data publishers already expose.

What this guide is — and isn't

This is a technical explainer for engineers, SEOs, and publishers who want to understand the machinery. It is not a how-to for reading paid articles without paying. Bypassing a paywall to reach gated content is a real legal risk (covered below), and it is explicitly not what Crawlora is for — we build for public web data.

The four kinds of paywall

"Paywall" is a single word for several very different mechanisms. Knowing which one you're looking at tells you almost everything about how it behaves and how robust it is.

Type	What the reader gets	How it's enforced
Hard	Nothing without a subscription	The article body is withheld outright; you see a headline, a deck, and a subscribe prompt
Soft / freemium	Some articles free, some "premium"	A per-article flag decides whether the full body is served at all
Metered	N free articles per period	A counter (cookie, local storage, device fingerprint, or server-side account) tracks views and gates after the limit
Dynamic / propensity	Varies per visitor	A model scores how likely you are to subscribe and shows a harder or softer wall accordingly

Hard paywalls are the simplest and the strongest: the body never ships to a non-subscriber, so there's nothing to recover. The Financial Times and parts of the Wall Street Journal run close to this model. The tradeoff is reach — a hard wall sacrifices the casual reader and some SEO surface to protect revenue.

Soft/freemium walls flag certain articles as premium and leave the rest open. The decision is per-article, made on the server, so a "premium" piece behaves like a hard wall while a "free" piece is fully open.

Metered paywalls are the most common on large news sites because they thread the needle: a handful of free articles per month drive subscriptions, social sharing, and search traffic, while heavy readers eventually hit the wall. The catch is that metering has to count, and where it counts is the whole story (more on that below).

Dynamic / propensity paywalls are the modern evolution. Instead of a fixed meter, a model looks at signals — how often you visit, what you read, where you came from, whether you look like a likely subscriber — and decides in real time whether to show you a hard wall, a soft nudge, or nothing at all. Two readers can hit the same URL and see completely different walls. That variability is deliberate: it makes the wall harder to reason about and harder to defeat with a single static trick.

The one distinction that explains everything: client-side vs server-side

Forget the marketing names for a second. The question that actually determines whether a paywall is robust is brutally simple: does the full article text reach the browser at all?

CLIENT-SIDE (leaky)                  SERVER-SIDE (sealed)

  origin ──[ full article ]──▶ browser   origin ──[ teaser only ]──▶ browser
                 │                                   ▲
        JS / CSS hides the body            access check runs at the origin,
        (overlay, truncation, fade)        BEFORE the body is ever sent
                 │                                   │
   the bytes are already on the         there is nothing on the page
   page  →  "un-hideable"               to un-hide  →  sealed

Client-side paywalls send the complete article in the HTML or in a JSON blob the page hydrates from, then use JavaScript and CSS to hide most of it — an overlay, a display:none, a truncated container, or a gradient "fade to subscribe." The content is already on the page; the wall is cosmetic. This is why the classic tricks (disable JavaScript, view source, use a browser's reader mode) sometimes reveal the whole article: the bytes were delivered before the wall was painted.
Server-side paywalls make the access decision on the server and simply never include the gated text in the response. A non-subscriber receives a teaser — headline, a paragraph or two, structured metadata — and nothing else. There is nothing to un-hide because the body was never sent.

Google says exactly this to publishers in its own documentation: "If you don't want the content to be accessible to the browser at the time of serving, choose a paywall implementation that doesn't supply the paywalled content to the browser." In plain terms, Google is openly telling publishers that client-side gating is leaky and server-side gating is not.

So why does anyone still ship client-side? Because it's cheaper and more flexible. Rendering the full page and gating it in the browser plays nicely with ad tech, A/B testing, personalization, and CDN caching (one cached page serves everyone; the JS decides what to show). Server-side entitlement checks mean per-request rendering, a harder caching story, and more backend work. Plenty of publishers knowingly trade a little leakiness for a lot of operational convenience — which is why the web is full of client-side walls a reader can see straight through.

How metering actually counts you

Metered paywalls deserve their own look, because "you've read 5 of 5 free articles" has to be stored somewhere, and where decides how sturdy the meter is.

Cookies / local storage. The cheapest meter increments a counter in your browser. It's also the weakest: clearing site data, or opening a private/incognito window (which starts with empty storage), resets the count. This is the single reason "open it in incognito" works on so many sites — you're not breaking anything, you're just presenting as a brand-new visitor.
Device fingerprinting. Sturdier meters derive a semi-stable id from your browser and device characteristics, so a fresh incognito window still looks like the same device. Harder to reset, but probabilistic and privacy-fraught.
IP address. Some meters count per IP. Effective against casual evasion, but blunt — it can wrongly gate everyone behind a shared office or campus network.
Server-side accounts. The sturdiest meter ties consumption to a logged-in identity. There's nothing client-side to clear, because the count lives in the publisher's database. This is where metering converges with a hard wall.

The pattern to notice: the more robust the meter, the more it moves off the client and onto the server — the same migration we just saw with rendering. Anything enforced in the browser can be undone in the browser.

The Googlebot contract: how publishers show bots what they hide from you

Here's the part most explanations skip, and it's the most important. A publisher who hides the article from readers but serves the full text to Googlebot is, on its face, doing cloaking — showing crawlers something different from what users get. Cloaking is a search-spam violation that gets a site demoted or removed from the index. So how do paywalled articles rank at all?

Google built a sanctioned exception. It evolved out of the old "first click free" policy (drop the wall for visitors arriving from Google) and became, in 2017, flexible sampling plus a structured-data declaration. Publishers mark their paywalled sections with schema.org markup — isAccessibleForFree: false plus a hasPart block whose cssSelector points at the gated element:

{
  "@context": "https://schema.org",
  "@type": "NewsArticle",
  "headline": "Article headline",
  "isAccessibleForFree": false,
  "hasPart": {
    "@type": "WebPageElement",
    "isAccessibleForFree": false,
    "cssSelector": ".paywall"
  }
}

That declaration is the contract. It tells Google: "this .paywall section is gated, and any difference between what Googlebot sees and what a logged-out human sees is intentional, not cloaking." In return, the publisher grants Googlebot (and Googlebot-News) full access to the body so the article can be indexed and ranked.

                    ┌──────────────────────────────┐
   Googlebot  ────▶ │  Publisher origin            │ ──▶  FULL article
 (verified by       │   isAccessibleForFree: false │      (so it can be indexed)
  reverse DNS)      │   hasPart → ".paywall"       │
   Logged-out  ───▶ │                              │ ──▶  teaser + subscribe wall
   reader           └──────────────────────────────┘
      The JSON-LD declares the gap on purpose, so serving the
      bot more than the human is treated as policy — not cloaking.

Two consequences fall out of this, and they explain a lot of real-world behavior:

Publishers verify that Googlebot is really Googlebot. Because crawler access is a privilege, sites confirm it by reverse-DNS and IP against Google's published ranges — not by trusting the User-Agent header. That's why simply sending User-Agent: Googlebot from an ordinary server gets you an HTTP 403: the request's IP doesn't belong to Google. The user-agent trick only ever worked on sites that didn't bother validating, and the big publishers all validate.
The markup hands out a map of the wall. The cssSelector: ".paywall" is, quite literally, the selector of the overlay element. A declaration intended to help search engines also tells anyone reading the page source exactly which node is the gate — which is why client-side "un-hide" tools target that same selector.

The same logic extends to AMP: Google requires a publisher's bot-access policy to match across AMP and non-AMP pages (via amp-subscriptions), or Search Console flags a content mismatch. That parity requirement is why AMP versions of articles are sometimes less aggressively gated than their canonical pages — the publisher had to keep the two consistent for the crawler.

How paywall "bypass" tools actually work

Open-source paywall removers — the best known being Bypass Paywalls Clean, plus web tools like 12ft and archives like archive.today — are essentially a catalogue of per-site rules, each exploiting one of the weaknesses above. Understanding what they do is useful for reasoning about how robust a given paywall is. It is not an endorsement: several have been removed from extension stores under legal pressure, which is the subject of the next section.

Technique	Which paywall design it targets	Why it fails on hardened sites
Crawler user-agent (Googlebot/Bingbot)	Sites that serve crawlers the full body	Blocked by IP / reverse-DNS validation of the bot
Referer spoofing (Google / social)	"First-click-free"-style allowances	Most publishers dropped first-click-free; ignored on server-side gates
Clearing cookies / storage	Metered counters tracked client-side	Useless against server-side, account-based, or fingerprinted meters
Blocking the paywall script (Piano/Tinypass, Poool, etc.)	Client-side JS enforcement	Nothing to block when the gate is server-side
AMP / reader-mode / view-source	Content shipped-then-hidden	The body simply isn't in the response on server-side pages
Reading embedded JSON (`articleBody`, framework state)	Sites that ship full text for their own SPA/SEO	The text isn't embedded when rendered server-side per entitlement
Web archives (archive.today)	Anything someone already archived	Depends on a third-party copy existing; raises its own copyright questions

Walk down the column and a single pattern emerges. Crawler-UA and referer tricks exploit the indexing contract — they try to look like the privileged visitor the publisher serves in full. Cookie-clearing exploits client-side metering. Script-blocking, reader-mode, and view-source exploit client-side rendering. Reading embedded JSON exploits the fact that a single-page app or an SEO setup often ships the whole article as data even when the visible DOM is truncated. Archives sidestep the live site entirely by reading a copy someone else already saved.

The throughline: every one of these works only because the content already left the publisher's server. Server-side rendering plus IP-validated bot access closes the entire column at once — there is no header to spoof into a privilege, no counter in the browser to reset, no hidden body to un-hide, and no embedded JSON because the body was never serialized to the client.

Why the arms race now favors publishers

A decade ago, "disable JavaScript" beat most paywalls. Today it rarely does, for a few converging reasons:

Server-side rendering keeps the body off the wire until entitlement is checked. The leak closes at the source.
Dynamic / propensity models change the wall per visit, so a single static rule breaks the moment the model decides you look different.
Bot validation — reverse DNS for Googlebot, plus commercial anti-bot vendors like Cloudflare and DataDome at the edge — makes crawler impersonation and naive automated access expensive and unreliable. A spoofed user-agent now meets a fingerprinting challenge, not a free pass.
Edge enforcement means the gate is applied at the CDN, before a request ever reaches the origin app. The decision happens in front of the content, not inside it.

The net effect is that the cheap, client-side techniques are dying off, and what remains is either legally fraught (archives, account sharing) or simply doesn't work against a modern server-side, dynamically gated, edge-protected site.

The legal reality: paywalls are the highest-risk category

This is the part that matters most, and it's why Crawlora's position is unambiguous: don't bypass paywalls. It's consistent with everything in our guide on whether web scraping is legal in 2026 — the rules depend on the data, the method, and what you do with the results.

Access risk stratifies cleanly:

Tier 1 — public, non-gated pages. The lowest risk. In the US, hiQ Labs v. LinkedIn and the Supreme Court's narrowing of the CFAA in Van Buren v. United States support the view that accessing data available to the public without authentication is not "unauthorized access."
Tier 2 — login-gated content. A step riskier: you're now past an authentication boundary, and terms of service are squarely in play.
Tier 3 — paywalled content. The top of the risk stack. Engineering a workaround around a technological access control can implicate the DMCA's anti-circumvention rule (§1201) — which targets circumventing a measure that controls access to a work, separate from copyright infringement itself — and the CFAA, on top of breaching the site's terms of service.

The case law is moving in the publishers' direction. Reddit v. Perplexity alleges circumvention of rate limits and anti-bot systems; Google sued SerpApi in late 2025 citing the DMCA and copyright. And the open-source paywall removers themselves have been pulled from the Chrome and Firefox stores under the DMCA — the clearest signal of where the legal line sits.

Public, non-gated pages are the defensible tier; logins and paywalls escalate risk sharply.
Circumventing a technological access control — a paywall, login, or anti-bot system — is a distinct legal exposure under DMCA §1201, separate from reading a public page.
Terms of service can prohibit automated access even to public content; that's a contract risk on top of everything else.
If you need a specific publisher's articles at scale, the right path is a licensing or syndication deal — not a workaround.

The right way to get article content at scale

If your project genuinely needs article text, there are legitimate routes, in rough order of preference:

Official content APIs and licensing. Many publishers and wire services license full text, and a syndication or licensing agreement is the durable answer for a specific outlet's articles at scale. Several large publishers also expose documented developer APIs for metadata.
The structured data publishers already expose. Headlines, descriptions, authors, dates, sections, and tags are published for crawlers in JSON-LD — that's fair game and machine-readable by design. You can get a lot of value from the metadata layer without touching gated bodies.
Public, non-gated pages. For the large universe of web content that isn't paywalled at all, a compliant scraping API that respects robots.txt, rate limits, and terms is the clean way to get structured content without running your own browser fleet.

That last one is where Crawlora fits. Our web scraping API and the /web/scrape endpoint turn public URLs into clean Markdown and structured metadata, with managed rendering and proxies — built for public web data, not for circumventing paid content. If you want to know how hard a given public page is to fetch before you start, the anti-bot checker gives you a difficulty read on the exact URL, and the proxies explainer covers responsible pacing.

The takeaway

A paywall is just an answer to one question — where does the content live when a non-subscriber asks for it? Keep it in the browser and hide it, and the wall is cosmetic. Keep it on the server and never send it, and the wall is real. The structured-data contract with Google explains the strange middle ground where bots see everything and humans see a teaser, and the steady migration of every defense — rendering, metering, bot checks — from the client to the server and the edge is why the easy tricks keep dying. The robust, lawful way to work with article content at scale isn't to fight that trend; it's to use the public data, the structured metadata, and the licensing the open web already provides.

Sources

Frequently asked questions

How do paywalls work?

A paywall withholds an article from non-subscribers, but the implementation varies. Hard paywalls serve no body at all; metered paywalls track your free-article count with a cookie, device fingerprint, or account; dynamic paywalls vary the wall per visitor. The key technical difference is whether the full text is sent to your browser and then hidden (client-side) or never sent at all (server-side).

Why can I read some paywalled articles in incognito mode but not others?

Incognito clears cookies and local storage, which resets a client-side metered counter that tracks how many free articles you've read — so metered paywalls often reopen in a fresh private window. It does nothing against hard or server-side paywalls, where the article body is never delivered to the browser in the first place.

What is the difference between a client-side and server-side paywall?

A client-side paywall sends the full article to the browser and hides it with JavaScript/CSS (an overlay or truncation), so the content technically reached your device. A server-side paywall decides access on the server and never includes the gated text in the response. Client-side gates are far easier to circumvent; server-side gates are, in Google's own words, almost impossible to get around.

Is it legal to bypass a paywall?

Bypassing a paywall is the highest-risk category of web access. Circumventing a technological access control can implicate the DMCA's anti-circumvention rules (§1201) and the CFAA, on top of breaching the site's terms of service. Reading public, non-gated pages is far more defensible, and for a specific publisher's full articles at scale, licensing is the right path — not a workaround. This is not legal advice.

Originally published on crawlora.net. Crawlora is a structured web-data, search, and anti-bot API — dozens of platforms as normalized JSON, plus a hosted MCP server, with a free tier (no card).

DEV Community