NexGenData

Posted on Jun 22 • Originally published at thenextgennexus.com

Google Killed the Cached Page View in Feb 2024. Here's How I Replaced It

#webscraping #seo #api #opensource

In early February 2024, Google's Danny Sullivan confirmed on X what SEO professionals, journalists, and OSINT researchers had started to suspect: thecache: operator and the "Cached" link in search results were being permanently removed.

Sullivan's stated reason: the feature was originally built to help people load pages when "the internet was unreliable," and Google no longer considered it necessary. For anyone using cached pages for real work, the removal broke an entire workflow:

SEO auditors could no longer see what Google's crawler actually indexed
Journalists lost a quick check for pages that had been edited or deleted
OSINT researchers lost a first-stop verification tool
Support teams could no longer see the version of a page a user was looking at when something went wrong
Litigation preservation lost its easiest evidentiary primer

A month later, Google also removed cached pages from the Search Console's URL Inspection tool. The cache was really, completely gone.

What replaced it? Nothing, officially. Unofficially, everyone moved to the Internet Archive's Wayback Machine and archive.today. Both work. Both have quirks. Both have APIs that are half-documented and behave in ways that surprise you.

This post is the workflow we built to put a reliable "show me what this page looked like" capability back together. It centers on the NexGenData Google Cache Viewer actor, which wraps Wayback and archive.today with a unified fallback chain, schema-level parsing, and a zero-config HTTP interface.

Why This Workflow Gap Matters More Than Google Implied

When Google pulled the plug, the first wave of reactions was "who cares, just use Wayback." Six months in, the consensus shifted. Here's what actually broke:

SEO and Content Work

Before: SEO teams could paste cache:example.com/page into the URL bar and see exactly what Googlebot had indexed last—word count, schema markup, rendered HTML, the whole picture.

After: You can still see cached snapshots in a general "what did this URL look like" sense, but Wayback doesn't tell you what Google's crawler saw. For ranked-page diagnostics (why is this page ranking for this query? has Google indexed my latest update?), the answer became "it depends, and you'll need multiple tools."

Journalism and Fact-Checking

Before: A reporter could paste a URL into cache: and instantly see the pre-edit or pre-deletion version. Getting stories onto the record within minutes mattered.

After: Wayback often does have the older snapshot—but not always on the original timestamp the reporter needs. archive.today frequently has the snapshot Wayback missed, because it was triggered by a human pressing "archive this." Knowing which archive to check, and being able to check both quickly, became the new skill.

OSINT and Research

Before: A quick cache check was the first step in any investigation of a suspicious page, takedown, or edit war.

After: OSINT researchers now run multi-archive checks as standard practice. Every experienced researcher we've talked to has built or adopted a tool that fans out to multiple archives and returns the first useful hit.

The Two Real Alternatives (and Their Quirks)

There are two credible replacements for Google's cache. Each has characteristic strengths and failure modes.

Internet Archive Wayback Machine

The Wayback Machine is run by the nonprofit Internet Archive and is the largest general web archive in the world—hundreds of billions of snapshots. Its API is technically public:


    https://archive.org/wayback/available?url=example.com&timestamp;=20240301

That returns JSON with the nearest snapshot to the timestamp. In practice, you hit three issues:

Scheme sensitivity : Wayback treats http://example.com, https://example.com, https://www.example.com, and https://example.com/ as different URLs for lookup purposes. Query the wrong variant and Wayback says "no snapshot exists" even when there are thousands for the same page.
Rate limiting : The available endpoint enforces a soft limit; CDX search (the powerful bulk query) enforces a stricter one. Aggressive polling will 429 you.
Partial renders : JavaScript-heavy pages frequently archive as blank shells. The snapshot exists but the content you wanted isn't there.

archive.today (archive.ph / archive.is / archive.li)

archive.today is a smaller, user-initiated archive. Nothing gets archived there unless a human (or bot) triggered a save. That means its coverage is much thinner than Wayback's overall, but dramatically better on:

News articles that were deleted or heavily edited (because someone always triggers an archive on contested stories)
Twitter/X posts (Wayback's Twitter snapshots are frequently broken; archive.today renders them well)
Paywalled articles (archive.today bypasses many paywalls by the mechanism it uses to render)

The quirks:

No documented API : Everything is scraped. archive.today publishes OpenGraph metadata on its snapshot pages, which is what parseable tools use.
CDN / DNS rotation : The domain moves between .today, .ph, .is, .li, .fo for resilience reasons. Hard-coding a single hostname breaks.
Scheme sensitivity, but different from Wayback : archive.today tends to canonicalize more aggressively than Wayback does—but its hit-rate on the "canonical" form isn't 100%, so you still need to try variants.

The Takeaway

Neither source alone is enough. The right workflow is: try archive.today first (user-triggered archives tend to be the snapshot someone wanted preserved), fall back to Wayback, and handle scheme variants for both.

That's what the actor implements.

What the Actor Does

The NexGenData Google Cache Viewer takes a URL (or array of URLs) and returns:

The first successful archive match across the fallback chain
Snapshot timestamp
Archive source (wayback or archive_today)
Snapshot URL (the permalink on the archive)
Optionally: rendered HTML, extracted text, page title, meta description

The fallback logic handles scheme sensitivity automatically. For example.com/article, it tries, in order:

https://example.com/article on archive.today
https://example.com/article on Wayback
http://example.com/article on archive.today
http://example.com/article on Wayback
https://www.example.com/article on archive.today
https://www.example.com/article on Wayback
With/without trailing slash variants
Canonical URL extracted from OpenGraph if any snapshot was found

You get the first hit. If no archive has the page, you get a clean found: false with the variants that were tried—which is itself useful information.

Try It: One URL with cURL


    curl "https://api.apify.com/v2/acts/lueaWo5saiyGDsdNz/run-sync-get-dataset-items?token=$APIFY_TOKEN" \
      -X POST \
      -H 'Content-Type: application/json' \
      -d '{
        "urls": ["https://techcrunch.com/2024/02/01/google-search-cache-links-dead/"],
        "preferredSource": "archive_today",
        "fallbackToWayback": true,
        "extractText": true
      }'

Response:


    [
      {
        "requestedUrl": "https://techcrunch.com/2024/02/01/google-search-cache-links-dead/",
        "found": true,
        "source": "archive_today",
        "snapshotUrl": "https://archive.ph/2024/20240202143021/https://techcrunch.com/2024/02/01/google-search-cache-links-dead/",
        "snapshotTimestamp": "2024-02-02T14:30:21Z",
        "title": "Google Search's cache links are officially gone",
        "metaDescription": "Google Search has officially killed off the 'cached' link, Google's Search Liaison Danny Sullivan confirmed...",
        "extractedText": "Google Search has officially killed off the 'cached' link, which had been part of Google Search since the earliest days... [full article text] ...",
        "wordCount": 1247,
        "attemptedVariants": [
          { "url": "https://techcrunch.com/2024/02/01/google-search-cache-links-dead/", "source": "archive_today", "result": "hit" }
        ]
      }
    ]

The attemptedVariants array is useful diagnostics—if the first variant hit, you'll see one entry; if fallback was needed, you'll see each attempt and its outcome.

Bulk Mode: Checking 1,000 URLs in Python

The pattern we use most is feeding in a list of URLs from somewhere (a sitemap, a link audit, a list of referrer URLs from analytics) and getting back an enriched dataset.


    import os
    from apify_client import ApifyClient

    client = ApifyClient(os.environ["APIFY_TOKEN"])

    # URLs from your link audit, competitor sitemap, or referrer log
    urls = [
        "https://example.com/blog/old-post-1",
        "https://example.com/blog/old-post-2",
        # ... hundreds more
    ]

    run = client.actor("nexgendata/google-cache-viewer").call(run_input={
        "urls": urls,
        "preferredSource": "archive_today",
        "fallbackToWayback": True,
        "extractText": False,        # faster/cheaper if you don't need full text
    })

    results = list(client.dataset(run["defaultDatasetId"]).iterate_items())

    # Tally by source
    from collections import Counter
    sources = Counter(r["source"] for r in results if r["found"])
    print(f"Found in archive.today: {sources.get('archive_today', 0)}")
    print(f"Found in Wayback:       {sources.get('wayback', 0)}")
    print(f"Not archived anywhere:  {sum(1 for r in results if not r['found'])}")

    # Identify the unarchived URLs so you can submit them to Wayback's save API
    unarchived = [r["requestedUrl"] for r in results if not r["found"]]
    for url in unarchived:
        print(f"No archive: {url}")

JavaScript / Node: On-Demand Cache Lookup from a Web UI

If you're building a tool where a user pastes a URL and wants to see the archived version, here's the client-side pattern (proxied through your own backend so you don't expose the Apify token):


    // server.js (Express)
    app.post("/api/lookup-cache", async (req, res) => {
      const { url } = req.body;

      const apifyRes = await fetch(
        `https://api.apify.com/v2/acts/lueaWo5saiyGDsdNz/run-sync-get-dataset-items?token=${process.env.APIFY_TOKEN}`,
        {
          method: "POST",
          headers: { "Content-Type": "application/json" },
          body: JSON.stringify({
            urls: [url],
            preferredSource: "archive_today",
            fallbackToWayback: true,
            extractText: true,
          }),
        }
      );

      const [result] = await apifyRes.json();

      if (!result.found) {
        return res.status(404).json({
          error: "No archive exists for this URL",
          attemptedVariants: result.attemptedVariants,
        });
      }

      res.json({
        snapshot: result.snapshotUrl,
        capturedAt: result.snapshotTimestamp,
        title: result.title,
        text: result.extractedText,
        source: result.source,
      });
    });

This gives you a "what did this page look like" lookup you can embed in an internal tool in an afternoon. No need to learn the Wayback CDX API or scrape archive.today directly.

Real Use Cases I Use This For

1. SEO Diagnostics on De-indexed Pages

A client's page drops out of Google's index overnight. You want to compare the current live page to the last known "good" version. The actor pulls the archived snapshot; you diff it against the current HTML; you often find the exact change that triggered the deindex (usually a meta robots change, a schema breakage, or content thinning).

2. Litigation / Compliance Preservation

A brand's competitor publishes something defamatory and then deletes it three hours later. Wayback crawls infrequently and may have missed the window. archive.today is more likely to have the snapshot because someone involved in the dispute triggered an archive. The fallback chain finds whichever archive caught it.

3. Research and Academic Citations

Papers that cite web sources routinely hit link rot within a few years. Running the archive lookup on a bibliography gets you current archive URLs to substitute in, or flags the citations that are dead across every archive (so you can hunt for alternatives).

4. Journalism Verification

A source claims "X said Y on their website three weeks ago." You paste the URL, get the archived version with its timestamp, compare with the current page, and you have evidence that the claim is or isn't accurate.

5. Deleted Social Media / News Stories

When a news organization retracts a story or a public figure deletes a blog post, the first question is "was it archived?" The fallback logic maximizes the odds of the answer being yes.

Why Not Just Call the Wayback API Yourself?

You absolutely can. We did, for about three months. Here's what kept tripping me up and eventually pushed me to build the wrapper:

Scheme normalization : Writing and testing the "try HTTPS, try HTTP, try with/without www., try with/without trailing slash" permutation logic is mildly tedious and a source of edge-case bugs.
archive.today has no clean API : You scrape it. Which means dealing with its CDN rotation across .today, .ph, .is, .li, .fo domains.
Rate limits are aggressive : Wayback's CDX endpoint is stricter than the available endpoint. A naive bulk script gets 429'd fast.
Text extraction requires actually rendering the snapshot : Snapshots are in Wayback's replay frame or archive.today's snapshot frame. Extracting the underlying page text means parsing around those wrappers.
Maintenance drift : archive.today's HTML structure changes every few months. Wayback's API gets the occasional parameter tweak. You maintain this forever if you DIY.

The actor exists because we got tired of maintaining that logic in three different side projects. Now it's one piece of code, fixed centrally.

Pricing: About $0.002 Per URL

The actor uses Apify's pay-per-event pricing. At the default settings (fallback chain enabled, text extraction on), typical cost is about $0.002 per URL. A 1,000-URL bulk check runs about $2.

A few concrete comparisons:

| Tool | Price | Per-URL | Notes | |-------------------------------|----------------|----------|------------------------------------| | Wayback Machine direct | Free | $0 | DIY, rate-limited, scheme-sensitive| | archive.today direct | Free | $0 | DIY, no official API | | Google Cache (RIP) | — | — | Dead since Feb 2024 | | Time Travel (Memento) | Free, limited | $0 | Federated, not uniformly parsed | | NexGenData Cache Viewer | ~$0.002/URL | $0.002 | Unified, parsed, bulk |

Answering the Common Questions

Q: Is this the same as thecache: operator? A: Functionally close, but the underlying source is Wayback + archive.today rather than Google's own cache. For most use cases (seeing a prior version of a page), it's equivalent. For "see what Googlebot indexed," it's not—only Google's systems know that, and they no longer expose it.

Q: Can it trigger a new archive save? A: Not directly, though we're considering adding a saveIfMissing flag that triggers Wayback's Save Page Now. For now, if the URL isn't archived, you get back found: false with a list of attempted variants, and you can use Wayback's public save endpoint to archive it yourself.

Q: Does it bypass paywalls? A: archive.today's snapshots often do (as a side effect of how they render, not by design). Wayback varies. The actor returns whatever the archives return—neither better nor worse.

Q: How do I know the snapshot is current? A: The response includes snapshotTimestamp. If the archived snapshot is older than you need, you can set minTimestamp in the input and the actor will only return snapshots newer than that date (returning found: false if none qualify).

Q: Can it follow redirects? A: Yes, optionally. followRedirects: true will chase HTTP 301/302s and archive the final destination. Off by default because you usually want the specific URL you asked about.

Related NexGenData Actors for Research and Monitoring

DNS Propagation Checker — When a site goes down or pops back up, check from 15 resolvers worldwide to verify global propagation state.
Email DMARC Auditor — Pair with cache lookups to see the historical DMARC posture of a domain that's now changed its records.
Google Scholar Scraper — Academic references benefit from both Scholar's citation data and archived full-text from the cache viewer.
Hacker News Scraper — Monitor HN comments that reference URLs and archive them before the links rot.

Try It

Single URL, right now:


    curl "https://api.apify.com/v2/acts/lueaWo5saiyGDsdNz/run-sync-get-dataset-items?token=$APIFY_TOKEN" \
      -X POST \
      -H 'Content-Type: application/json' \
      -d '{"urls": ["https://example.com/some-article"]}'

Or from the UI: apify.com/nexgendata/google-cache-viewer.

Free tier on Apify gives you $5 of credits, which covers roughly 2,500 URL lookups. Enough to run a sitemap audit, do a research pass, or back-fill a broken-link check across an entire blog.

Want more pragmatic data tools? NexGenData builds Apify actors that plug gaps like this one—subscribe for new releases.

Build your own actors and ship them alongside mine on Apify: apify.com.

Resources:

DEV Community