Digiwares

Posted on Apr 8

How Google De-indexed My Entire Site - and What I Did to Recover

#webdev #seo #nextjs #beginners

Last November, Google quietly dropped almost every page of my site from its index. I went from having 200+ pages crawled to 1 page indexed — the homepage. It took me weeks to figure out what happened, and months of work to fix it.

This is the full story.

The Product

VoiceToTextOnline.com is a browser-based voice-to-text tool. It uses the Web Speech API — no server processing, no audio stored, nothing uploaded. You open the page, click a button, and speak. The browser does all the recognition locally.

It also supports file transcription (upload an MP3/WAV/MP4, get a transcript with speaker labels and SRT export) and text-to-speech via Google Cloud neural voices.

I launched it in early 2025 as a solo founder. By mid-2025 it was getting 150-200 daily visitors mostly from Bing and AI referrers like ChatGPT and Doubao (ByteDance's AI assistant). One paying Pro subscriber.

Then the GSC numbers started looking strange.

What Happened

I had made three mistakes simultaneously, and they compounded each other.

Mistake 1: www vs non-www split

At some point my deployment config started serving the site on both www.voicetotextonline.com and voicetotextonline.com without a proper canonical redirect. Google was seeing duplicate versions of every page and splitting crawl budget between them.

Mistake 2: Middleware blocking Googlebot

I had Next.js middleware handling auth redirects. The middleware was checking session state on every request — including requests from Googlebot. In certain conditions it was returning redirect responses to the crawler instead of the actual page content.

// The problematic pattern — middleware firing on all routes
export function middleware(request: NextRequest) {
  // This was intercepting Googlebot requests
  const session = getSession(request)
  if (!session) {
    return NextResponse.redirect('/auth')
  }
}

Googlebot would hit a page, get redirected, follow the redirect, potentially get redirected again. Some pages were returning 308 chains that never resolved cleanly.

Mistake 3: Sitemap pollution

My sitemap.ts was including URLs from non-Phase-1 pages that I had intentionally set to noindex via HTTP headers. So the sitemap was advertising pages that returned X-Robots-Tag: noindex — a contradiction that confused Googlebot's crawl prioritisation.

The combination of all three meant Google had spent months crawling a confusing, inconsistent site — duplicate URLs, broken redirects, noindex signals fighting sitemap inclusions. Its response was to stop trusting the site almost entirely.

Result: 220 pages crawled, 1 indexed.

The Fix (Technical)

I combined all three fixes into a single commit in early March 2026.

1. Force non-www canonical redirect

// middleware.ts — added before any other logic
const host = request.headers.get('host') || ''
if (host.startsWith('www.')) {
  const url = request.nextUrl.clone()
  url.host = host.replace('www.', '')
  return NextResponse.redirect(url, 301)
}

2. Exclude Googlebot from auth middleware

const userAgent = request.headers.get('user-agent') || ''
const isBot = /googlebot|bingbot|slurp|duckduckbot/i.test(userAgent)

if (isBot) {
  return NextResponse.next() // Let crawlers through unconditionally
}

3. Rebuild the sitemap to only include indexable pages

Removed all pages with noindex headers from sitemap.ts. The sitemap now only advertises pages that return 200 with no noindex signal.

I submitted a GSC validation request on March 17. It failed on March 21. The crawled-but-not-indexed count was still at 91 pages.

This is where the real problem became clear.

The Real Problem: Thin Content at Scale

Even with the technical issues fixed, Google wasn't indexing the language pages. And looking at them honestly, I understood why.

I had 43 language-specific pages — one for each language the tool supports. Hindi voice to text, Arabic voice to text, Japanese voice to text, and so on. Each page was built from the same template:

Hero with the language name
4-step how-to
Voice commands table
"Why Choose Us" 4-card section (identical on every page)
FAQ
CTA

The language name was swapped in. The content was largely identical. To Google's quality systems, this looked like 43 copies of the same page — thin content at scale.

The site:voicetotextonline.com search confirmed it. Only the homepage appeared.

The Content Fix: 43 Unique Pages

I spent several weeks rewriting all 43 language pages from scratch. The rule was simple: each page had to say something true and specific about that language that no other page on the site said.

This required actual research into the linguistics, diaspora communities, and cultural context of each language. A few examples of what "genuinely unique" ended up meaning:

Lithuanian — Lithuanian is considered the oldest surviving Indo-European language, more archaic than Latin or Sanskrit. Real Lithuanian-Sanskrit cognates exist and are fascinating: avis (sheep) in Lithuanian vs áviḥ in Sanskrit, sūnus (son) vs sūnúḥ. The page leads with this and shows a comparison table. The UK/Ireland diaspora section covers the 200,000+ Lithuanians who moved after EU accession in 2004 and face keyboard incompatibility with the 9 special Lithuanian characters.

Slovenian — One of the only living languages with a grammatical dual number — a distinct form for exactly two of something. The three-card visual showing ednina (singular) / dvojina (dual) / množina (plural) with real examples is content no competitor page has.

Catalan — 10 million speakers but no EU official status, despite having more native speakers than Irish, Maltese, and Luxembourgish combined. The comparison table makes this concrete. The l·l character (ela geminada) exists only in Catalan and is impossible to type on any standard keyboard — the page explains this and positions voice typing as the solution.

Bulgarian — The Cyrillic alphabet was created in Bulgaria in the 9th century. Every other Cyrillic language page can claim the script as their writing system. Only Bulgarian can claim they gave it to the world. The three-card section (9th century / 250M+ users / 50+ languages) is genuine and searchable.

Afrikaans — The youngest natural language in the world (standardised 1925) and one of the most accurately recognised by Web Speech API, because it has no special characters and writes almost exactly as it sounds. The comparison table shows why Afrikaans outperforms German, Danish, and Dutch for voice recognition accuracy.

After 43 rewrites, every page has:

A unique lead angle based on genuine linguistics
A named culturally-specific section (diaspora keyboards, minority status, historical alphabet events)
A sample output block showing real text in the target script
5 language-specific FAQs

The Internal Linking Gap

While doing this I discovered another issue: every language page's breadcrumb pointed to /#languages — an anchor that didn't exist on the homepage. 43 pages with broken breadcrumb links.

The fix was building a proper /languages index page linking to all 43 pages, grouped by region, with flag, native script, and speaker count for each. This also gave Google a clear crawl path: homepage → /languages → all 43 language pages.

Current Status

10 rewritten language pages submitted to GSC via URL Inspection
The /languages index page deployed and submitted
GSC still shows 1 indexed page (the homepage)
91 pages in "Crawled - currently not indexed" with failed validation
124 pages "Discovered - currently not indexed" with passed validation

The 124 "Discovered" pages are the hopeful signal — Google has them queued. The 91 failed pages are the harder problem: Google formed its quality judgment when the content was thin, and that judgment is sticky even after the content improves.

The honest timeline for recovery is 3-6 months from the content fix. The technical fixes were necessary but not sufficient. The content rewrites were necessary but not sufficient. What's likely needed next is external backlinks — at this domain trust level, Google needs signals from other sites before it will fully re-evaluate.

Lessons

1. www/non-www is not a minor detail. It's a canonical URL decision that affects every page on your site. Pick one and enforce it with a 301 at the infrastructure level, not just in meta tags.

2. Middleware that touches all routes will touch Googlebot. Any redirect logic in Next.js middleware needs a bot exemption or it will silently confuse crawlers.

3. Sitemaps and robots signals must be consistent. If a page is in your sitemap, it should return 200 with no noindex. If it's noindexed, remove it from the sitemap.

4. Template pages at scale look like duplicate content. 43 pages with the same structure and swapped keywords is not 43 unique pages. Each page needs to say something the others don't.

5. Google's quality judgment is sticky. Fixing the content doesn't immediately fix the index. Google cached its assessment when the pages were bad. Re-earning trust takes time and likely external validation.

If you're building a multi-language tool and want to avoid this — hope the technical details here save you some pain.

The tool is at voicetotextonline.com if you want to see what the rewritten pages look like. The Lithuanian and Slovenian pages are probably the most interesting from a linguistics angle.

Solo founder, building in public from Bangkok. This is month 13 with zero revenue. The goal is $5K MRR. Current status: 1 paying subscriber.

DEV Community