AstroworldMC

Posted on May 15 • Originally published at astroworldmc.com

Auto-Indexing 30 Sites with IndexNow and Google Indexing API

#seo #node #webdev #indexing

The problem

I run a small constellation of sites — a main brand site, a few content-heavy reference databases for Minecraft (mobs, biomes, items, enchantments, structures), a hosting page, a status page, a few tool pages. 29 deployed sites total, plus one main domain. Every time I push new pages, Google takes a week to discover them through the sitemap, and Bing takes longer. For pages that target time-sensitive search terms, that's a real cost.

I wrote a small service called Indexing-Astroworld that solves two problems:

Detect new and changed URLs across all 30 sites every 6 hours
Submit them to IndexNow (Bing, Yandex, Naver, Seznam) and Google Indexing API the moment they exist

It runs on the same VPS as the rest of the stack. It's about 400 lines of TypeScript. This post is what I actually built and a few things I learned along the way.

IndexNow vs Google Indexing API

These are not the same thing. People mix them up constantly.

IndexNow is a simple HTTP POST protocol. You send a list of URLs to https://api.indexnow.org/indexnow and search engines that participate (Bing, Yandex, Naver, Seznam, Cloudflare-fronted versions) get notified. It's open. No auth, no quota that I've ever hit, no rate limit documented. The "key" is a random string you publish at https://yourdomain.com/{key}.txt so the search engine can verify you own the domain.

Google Indexing API is the opposite. It's authenticated with a Google Cloud service account, quota is 200 URL submissions per day per project, and Google's own docs say it's primarily intended for JobPosting and BroadcastEvent schema. They will index other URLs but rank them through normal channels. Useful, but limited.

Together they cover Bing, Yandex, Google, plus some Cloudflare workers that pick up IndexNow signals.

The architecture

┌────────────────────┐    every 6h    ┌────────────────────┐
│  systemd timer     │ ─────────────► │  diff worker       │
└────────────────────┘                │  - fetch sitemaps  │
                                      │  - compare to disk │
                                      │  - emit diff       │
                                      └─────────┬──────────┘
                                                │ new + changed URLs
                                                ▼
                                      ┌────────────────────┐
                                      │  submitter         │
                                      │  ┌──IndexNow ──┐   │
                                      │  └──Google API──┘  │
                                      └────────────────────┘

No queue, no database. The diff worker reads last-snapshot.json from disk, fetches every sitemap, computes the new and changed URL lists, then submits and writes the new snapshot back. If anything fails, the systemd timer fires again in 6 hours and tries again with the same comparison.

The sitemap discovery loop

const SITES = [
  "https://astroworldmc.com",
  "https://api.astroworldmc.com",
  "https://mobs.astroworldmc.com",
  // ...30 total
];

async function fetchSitemap(origin: string): Promise<string[]> {
  const url = `${origin}/sitemap.xml`;
  const res = await fetch(url, {
    headers: { "User-Agent": "AstroworldIndexer/1.0" },
    signal: AbortSignal.timeout(20_000),
  });
  if (!res.ok) {
    console.warn(`[indexer] ${origin} sitemap returned ${res.status}`);
    return [];
  }
  const xml = await res.text();
  return [...xml.matchAll(/<loc>([^<]+)<\/loc>/g)].map((m) => m[1]);
}

A few notes from running this in production for a couple of months.

The AbortSignal.timeout is not optional. Without it, one slow origin (or, more often, one origin you forgot to renew DNS for) blocks the entire run. The default fetch timeout in Node 20 is effectively infinite. 20 seconds is plenty for a small sitemap.

Some sites return an index sitemap that points to per-section sitemaps. I added a second pass:

async function expandSitemap(origin: string): Promise<string[]> {
  const top = await fetchSitemap(origin);
  const isIndex = top.some((u) => u.includes("sitemap") && u.endsWith(".xml"));
  if (!isIndex) return top;

  const all: string[] = [];
  for (const childUrl of top) {
    const res = await fetch(childUrl, { signal: AbortSignal.timeout(20_000) });
    if (!res.ok) continue;
    const xml = await res.text();
    all.push(...[...xml.matchAll(/<loc>([^<]+)<\/loc>/g)].map((m) => m[1]));
  }
  return all;
}

You can do this concurrently with Promise.all if your sitemaps are big. Mine aren't, so the serial fetch is fine and easier to reason about.

The diff

type Snapshot = Record<string, { urls: Set<string>; etag?: string }>;

async function diff(prev: Snapshot, current: Snapshot) {
  const newUrls: string[] = [];
  const changedUrls: string[] = [];

  for (const [origin, { urls }] of Object.entries(current)) {
    const prevUrls = prev[origin]?.urls ?? new Set();
    for (const url of urls) {
      if (!prevUrls.has(url)) newUrls.push(url);
    }
  }

  // Changed: URLs whose lastmod differs. Easy if you parsed lastmod;
  // I skip this because most of my sitemaps don't carry stable lastmods.

  return { newUrls, changedUrls };
}

I deliberately do not detect "changed" URLs through content hashing. Fetching every URL on every site every 6 hours to compute a hash would be wasteful. If a page changes meaningfully, I update the sitemap's <lastmod> and re-submit through the same path. Most of the value is in catching genuinely new URLs anyway.

Submitting to IndexNow

const INDEXNOW_KEY = process.env.INDEXNOW_KEY!;

async function submitToIndexNow(host: string, urls: string[]) {
  const res = await fetch("https://api.indexnow.org/indexnow", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      host,
      key: INDEXNOW_KEY,
      keyLocation: `https://${host}/${INDEXNOW_KEY}.txt`,
      urlList: urls,
    }),
  });

  if (res.status === 202) {
    console.log(`[indexnow] ${host}: ${urls.length} accepted`);
    return { ok: true, count: urls.length };
  }
  console.warn(`[indexnow] ${host}: ${res.status} — ${await res.text()}`);
  return { ok: false };
}

You submit one batch per host. The protocol returns 202 Accepted and that's the only response you should expect for a healthy run. 200 means the call worked but no URLs were processed (often this is a key mismatch). 429 means you hit a rate limit that isn't documented but seems to land around 10,000 URLs per minute.

The keyLocation is mandatory. Without it the protocol can't verify you own the host. The file at that path needs to literally contain the key string, nothing else, served as text/plain.

Submitting to Google Indexing API

The Google side is more involved because of OAuth.

import { google } from "googleapis";

const auth = new google.auth.GoogleAuth({
  keyFile: "/etc/astroworld/google-indexing.json",
  scopes: ["https://www.googleapis.com/auth/indexing"],
});

async function submitToGoogle(url: string, type: "URL_UPDATED" | "URL_DELETED") {
  const client = await auth.getClient();
  const indexing = google.indexing({ version: "v3", auth: client });
  await indexing.urlNotifications.publish({
    requestBody: { url, type },
  });
}

Two things to flag.

The service account needs Webmaster ownership. You add the service account's email as a delegated owner inside Google Search Console for each property. Without this you get 403 Permission denied. The error message helpfully suggests you visit a URL that doesn't address the issue.

Daily quota is 200 URLs. If your first run has 2,000 new URLs you will only get the first 200 indexed via this path. That's fine — the rest go through IndexNow and the regular Googlebot crawl. The quota resets at midnight Pacific time.

I wrap it like this:

const GOOGLE_QUOTA = 200;

async function submitGoogleBatch(urls: string[]) {
  let used = 0;
  for (const url of urls.slice(0, GOOGLE_QUOTA)) {
    try {
      await submitToGoogle(url, "URL_UPDATED");
      used++;
    } catch (err: any) {
      if (err.code === 429) {
        console.warn("[google] quota exhausted");
        break;
      }
      console.warn(`[google] ${url}: ${err.message}`);
    }
  }
  console.log(`[google] ${used}/${urls.length} submitted`);
  return used;
}

I do not retry. If a single URL fails I'd rather skip it and move on — the goal is to get most URLs indexed quickly, not every URL perfectly.

Real numbers

On the first run after the worker shipped:

2,434 URLs total discovered across 30 sites
1,730 URLs net-new (the others were already in the previous snapshot — first run had a previous snapshot of zero, so this is "first time the system saw them")
200 went through Google Indexing API in the first 60 seconds
The remaining 1,530 went through IndexNow
Bing started showing those URLs in their index ~36 hours later
Google picked up the long tail over the next week

Per 6-hour run since then, typically 10-40 new URLs (new mob pages, new guide articles, new "where to find" landing pages).

What I'd build differently

Persist the snapshot in a tiny SQLite file, not a JSON blob. With JSON I rewrite the whole thing on every run. With SQLite I'd do incremental updates and survive a crash mid-write. Not a real problem at my scale but it's the kind of thing that bites later.

Submit changed URLs (with lastmod parsing), not only new URLs. Right now if you edit a page without changing the URL, the indexer doesn't notice. I want to add lastmod diffing.

Health endpoint that reports drift. If a sitemap suddenly drops 90% of its URLs, that's almost certainly a deploy bug, not a real drop. The indexer should refuse to submit a delete-everything event and page me instead.

Try it

The full service is at github.com/astroworld-mc (the indexing worker will be opened-up shortly). The API it indexes lives at api.astroworldmc.com and you can poke at it without an account.

If you run a multi-site stack and Google takes a week to discover your new pages, you don't need a CDN, you need IndexNow. It's two endpoints and a key file.

Built by Astroworld. All the supporting open-source repos live at github.com/astroworld-mc. If you want server hosting that handles this kind of small-but-busy stack well, Astroworld Hosting.

DEV Community