Darian Vance

Posted on Jan 9 • Edited on Jan 20 • Originally published at wp.me

Solved: Facing Indexing Issues for 4 Months — Only Homepage Indexed.

#devops #programming #tutorial #cloud

🚀 Executive Summary

TL;DR: Widespread indexing issues, where only the homepage is indexed, often stem from client-side rendering challenges, misconfigured robots.txt/sitemaps, or poor canonicalization and internal linking. Resolving this requires implementing Server-Side Rendering (SSR) or Static Site Generation (SSG) for dynamic content, meticulously auditing and optimizing crawler directives, and establishing a robust internal linking structure to ensure all valuable content is discoverable by search engines.

🎯 Key Takeaways

Google Search Console (GSC) reports (Coverage, Sitemaps, URL Inspection) are critical diagnostic tools, revealing ‘Excluded,’ ‘Crawled – currently not indexed,’ or ‘Blocked by robots.txt’ statuses.
Modern SPAs relying on Client-Side Rendering (CSR) often struggle with bot indexing; implementing Server-Side Rendering (SSR) or Static Site Generation (SSG) ensures fully rendered HTML is delivered to crawlers.
Misconfigurations in robots.txt can inadvertently block crucial content, while an incomplete or outdated sitemap.xml hinders efficient page discovery, both severely impacting site-wide indexing.
Proper rel=”canonical” tags are essential to consolidate link equity and prevent duplicate content issues, complemented by a strong internal linking structure to guide bots and distribute PageRank, avoiding ‘orphan pages’.

Struggling with search engine indexing where only your homepage gets listed? This guide dissects common causes for widespread indexing issues and provides actionable DevOps strategies to ensure all your site’s valuable content is discoverable.

Symptoms of Limited Indexing

When only your homepage is being indexed by search engines, it’s a critical issue that starves your valuable content of organic traffic. As a DevOps professional, you’re uniquely positioned to diagnose and resolve these underlying technical problems. The symptoms typically manifest in several key areas:

Google Search Console (GSC) Reports:
- Coverage Report: You’ll see a high number of “Excluded” or “Crawled – currently not indexed” pages. “Discovered – currently not indexed” is also a red flag, indicating Google knows about the pages but isn’t prioritizing them for indexing.
- Sitemaps Report: The submitted sitemap shows a high number of URLs discovered, but a low number actually indexed.
- URL Inspection Tool: When inspecting specific non-indexed pages, you might see “Crawled – currently not indexed,” “Discovered – currently not indexed,” or “Blocked by robots.txt.”
“site:yourdomain.com” Search Operator: Performing a site:yourdomain.com search in Google reveals only a handful of pages, predominantly the homepage and perhaps a few top-level sections, despite your site having hundreds or thousands of pages.
Organic Traffic Declines: A significant drop or stagnation in organic traffic, particularly to deep-dive content, blog posts, or product pages, indicates these pages aren’t visible in search results.
Slow Indexing of New Content: New pages take an exceptionally long time to appear in search results, or simply never do, even after being linked internally.

Solution 1: Addressing Server-Side Rendering (SSR) & Pre-rendering Challenges

Modern web applications, especially Single-Page Applications (SPAs) built with frameworks like React, Angular, or Vue.js, often rely heavily on client-side JavaScript to render content. While excellent for user experience, this presents a significant challenge for search engine crawlers, which may not fully execute JavaScript or wait for dynamic content to load. This can result in search engines only seeing a blank or incomplete page, leading to indexing issues.

The Problem with Client-Side Rendering (CSR) for Bots

Search engine bots (like Googlebot) are becoming more sophisticated, but they still don’t always behave exactly like a full browser. When a bot hits a purely client-side rendered page:

It might download the initial HTML, which often contains little more than a <div id="root"></div>.
It may or may not execute the JavaScript needed to fetch data and render the full content. Even if it does, it might be resource-intensive, leading to lower crawl rates or incomplete indexing.
Crucially, it might not wait long enough for all asynchronous data fetches and rendering to complete.

Implementing SSR or Pre-rendering

To overcome this, DevOps teams can implement Server-Side Rendering (SSR) or Pre-rendering strategies.

Server-Side Rendering (SSR)

SSR involves rendering the full HTML for a page on the server for each request. This means the browser (and the search engine bot) receives a fully formed HTML document with all content already present, eliminating the need for client-side JavaScript to build the initial view.

Example (Next.js – React Framework):

In a Next.js application, you can use getServerSideProps to fetch data and render a page on the server for every incoming request.

// pages/products/[id].js
import Head from 'next/head';

function ProductPage({ product }) {
  if (!product) {
    return <p>Product not found.</p>;
  }
  return (
    <div>
      <Head>
        <title>{product.name} - My Store</title>
        <meta name="description" content={product.description.substring(0, 150)} />
      </Head>
      <h1>{product.name}</h1>
      <p>{product.description}</p>
      <p>Price: ${product.price}</p>
    </div>
  );
}

export async function getServerSideProps(context) {
  const { id } = context.params;
  // Simulate fetching data from an API
  const res = await fetch(`https://api.example.com/products/${id}`);
  const product = await res.json();

  if (!product) {
    return {
      notFound: true,
    };
  }

  return {
    props: { product }, // Will be passed to the page component as props
  };
}

export default ProductPage;

DevOps Configuration for SSR:

Deployment: SSR applications typically require a Node.js server environment (e.g., EC2, Google Cloud Run, Vercel, Netlify) to execute the server-side code.
Scalability: Implement load balancing and auto-scaling groups to handle increased server load during peak traffic.
Monitoring: Monitor server performance (CPU, memory, response times) and implement caching strategies (CDN, server-side caching) to optimize delivery.

Pre-rendering (Static Site Generation – SSG)

Pre-rendering, particularly Static Site Generation (SSG), involves generating all possible HTML pages at build time. These static HTML files are then served directly from a CDN, offering maximum performance and SEO benefits. This is ideal for content that doesn’t change frequently.

Example (Next.js – React Framework):

Using getStaticProps and getStaticPaths to generate pages at build time.

// pages/blog/[slug].js
import Head from 'next/head';

function BlogPost({ post }) {
  return (
    <div>
      <Head>
        <title>{post.title} - My Blog</title>
        <meta name="description" content={post.excerpt} />
      </Head>
      <h1>{post.title}</h1>
      <div dangerouslySetInnerHTML={{ __html: post.content }} />
    </div>
  );
}

export async function getStaticPaths() {
  // Fetch all possible blog post slugs
  const res = await fetch('https://api.example.com/blog/posts');
  const posts = await res.json();

  const paths = posts.map((post) => ({
    params: { slug: post.slug },
  }));

  return { paths, fallback: false }; // 'fallback: false' means paths not returned will 404
}

export async function getStaticProps({ params }) {
  // Fetch data for a specific post
  const res = await fetch(`https://api.example.com/blog/posts/${params.slug}`);
  const post = await res.json();

  return {
    props: { post },
  };
}

export default BlogPost;

DevOps Configuration for SSG:

Build Process: Integrate the SSG build step into your CI/CD pipeline. Triggers for rebuilds should include content updates (e.g., webhook from a CMS).
Deployment: Deploy the generated static HTML, CSS, and JS files to a CDN (e.g., AWS S3 + CloudFront, Google Cloud Storage + CDN, Cloudflare Pages, Netlify, Vercel).
Cache Invalidation: Implement efficient cache invalidation strategies for the CDN when content is updated.

Comparison: SSR vs. Pre-rendering (SSG)


Feature	Server-Side Rendering (SSR)	Pre-rendering (Static Site Generation – SSG)
Content Freshness	Real-time (content always up-to-date on request)	As fresh as the last build (requires rebuild for updates)
Performance	Good initial load, but depends on server response time	Excellent initial load (pre-built HTML, served from CDN)
SEO Friendliness	Excellent (fully rendered HTML to bots)	Excellent (fully rendered HTML to bots)
Complexity	More complex server setup, managing server resources	Simpler deployment (static files), complex build process for large sites
Use Cases	Dynamic, frequently changing content (e-commerce, user-specific dashboards)	Static content, blogs, marketing sites, documentation
Hosting	Node.js server environment	CDN/Object Storage (S3, GCS)

Solution 2: Optimizing Robots.txt and Sitemap Configuration

A common culprit for unindexed pages lies in how you communicate with search engine crawlers. Misconfigurations in your robots.txt file can inadvertently block crucial content, while an outdated or improperly formatted sitemap.xml can prevent bots from efficiently discovering your site’s structure.

Auditing and Correcting `robots.txt`

The robots.txt file is a crucial directive file that tells search engine crawlers which parts of your site they can or cannot access. A single misplaced Disallow rule can de-index your entire site.

Common Pitfalls:

Accidental Full Site Block: The most dangerous mistake is a general Disallow: / without specific allowances.
Blocking Crucial Resources: Blocking CSS, JavaScript, or image directories can prevent Googlebot from properly rendering your page, leading to “mobile-friendly” or “page experience” issues and potential de-indexing.
Misunderstanding Wildcards: Incorrect use of * can unintentionally block more than intended.

Checking your robots.txt:

Use curl to inspect your live robots.txt:

curl https://www.yourdomain.com/robots.txt

Example of a healthy robots.txt:

This configuration allows all bots to crawl the entire site, explicitly allows common asset types (though typically not needed if root is allowed), and points to your sitemap.

User-agent: *
Allow: /
Sitemap: https://www.yourdomain.com/sitemap.xml

If you need to block specific directories, ensure they are non-essential for indexing, e.g., admin panels or test environments:

User-agent: *
Allow: /
Disallow: /admin/
Disallow: /test-env/
Disallow: /private-docs/
Sitemap: https://www.yourdomain.com/sitemap.xml

After modifying robots.txt, use the Robots.txt Tester in Google Search Console to verify its syntax and check if it blocks any critical URLs. Changes can take time to be recognized by crawlers.

Optimizing Your Sitemap.xml

A sitemap.xml file is a roadmap for search engines, listing all the URLs on your site that you want to be indexed. It helps crawlers discover pages they might not find through internal linking alone, especially for large or newly launched sites.

Key Considerations:

Completeness: Ensure all indexable pages are listed. For dynamic sites, this often means dynamic sitemap generation.
Accuracy: Only include canonical, indexable URLs. Avoid broken links, noindex pages, or redirects.
Freshness: The sitemap should be updated regularly, especially when new content is added or existing content is changed.
Size Limits: A single sitemap file can contain up to 50,000 URLs and be no larger than 50MB uncompressed. For larger sites, use sitemap index files.

Example of a sitemap.xml entry:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://www.yourdomain.com/about-us</loc>
    <lastmod>2023-10-26T10:00:00+00:00</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.8</priority>
  </url>
  <url>
    <loc>https://www.yourdomain.com/blog/latest-post</loc>
    <lastmod>2023-11-15T14:30:00+00:00</lastmod>
    <changefreq>daily</changefreq>
    <priority>1.0</priority>
  </url>
</urlset>

DevOps Integration for Sitemaps:

Dynamic Generation: Integrate sitemap generation into your application’s backend or CMS. For instance, a nightly cron job can regenerate the sitemap based on current database content.
CI/CD Integration: For static sites, ensure sitemap generation is part of your build pipeline.
Automated Submission: While Google often finds sitemaps linked in robots.txt, explicitly submitting your sitemap(s) via the Google Search Console Sitemaps report is a best practice.
Monitoring: Regularly check the Sitemaps report in GSC for errors or warnings related to your submitted sitemaps.

Solution 3: Deep Dive into Canonicalization and Internal Linking Structure

Even if bots can access and render your content, issues with how your pages relate to each other can confuse search engines, leading to duplicate content penalties or undervalued pages. Proper canonicalization and a robust internal linking strategy are crucial for guiding bots and distributing “link equity.”

Mastering Canonicalization with `rel=”canonical”`

Duplicate content, whether accidental (e.g., HTTP vs. HTTPS, trailing slash vs. no trailing slash, session IDs in URLs) or intentional (e.g., product pages accessible via multiple categories), can dilute link equity and confuse search engines about which version to index. The rel="canonical" tag tells search engines the preferred version of a page.

How it Works:

The canonical tag is placed in the <head> section of an HTML document, pointing to the canonical (authoritative) version of that page.

Example Implementation:

If you have a product page accessible at both https://www.example.com/products/blue-widget and https://www.example.com/category/widgets/blue-widget, you would place the following tag on the category-specific URL:

<head>
  <!-- Other head elements -->
  <link rel="canonical" href="https://www.example.com/products/blue-widget" />
</head>

This tells Google that while the page content is available at two URLs, the /products/blue-widget URL is the one that should be indexed and receive link equity.

DevOps Configuration Considerations:

Automated Canonical Tag Generation: For dynamic content, your CMS or application framework should programmatically generate the correct canonical URL for each page. Ensure this logic handles all URL variations (e.g., query parameters, case sensitivity, trailing slashes).
HTTP Headers: For non-HTML documents (e.g., PDFs), or in cases where you want to simplify HTML, you can use the Link HTTP header:

  Link: <https://www.example.com/products/blue-widget>; rel="canonical"

This requires server-level configuration (e.g., Nginx, Apache).

Auditing: Implement regular checks (e.g., using a crawler like Screaming Frog or GSC’s URL Inspection tool) to identify canonicalization errors like self-referencing canonicals pointing to a non-canonical URL, or canonical chains (A -> B -> C).
HTTP/HTTPS and www/non-www Redirections: Ensure 301 redirects are in place to send users and bots to your preferred domain variant (e.g., HTTP to HTTPS, non-www to www or vice-versa). This complements canonical tags but is not a replacement.

Strengthening Internal Linking Structure

Internal links are hyperlinks that point to other pages within the same domain. They are crucial for:

Content Discovery: Bots follow internal links to find new pages. A page without any internal links is an “orphan page” and is unlikely to be indexed.
Link Equity Distribution: Internal links pass “PageRank” (link equity) from stronger pages to weaker ones, helping newer or less authoritative pages to rank.
User Navigation: A logical internal linking structure improves user experience, encouraging longer site visits.

Best Practices:

Descriptive Anchor Text: Use relevant, keyword-rich anchor text (the visible, clickable text of a link) instead of generic phrases like “click here.”
Contextual Links: Link naturally from relevant content. For example, in a blog post about DevOps tools, link to other posts discussing specific tools mentioned.
Avoid Orphan Pages: Every important page should be reachable from at least one other page via a direct link.
Flat Site Architecture (within reason): While not strictly necessary, keeping important pages “closer” to the homepage (fewer clicks away) can aid discovery.
Use Navigation and Breadcrumbs: Ensure your main navigation, footer navigation, and breadcrumbs are well-structured and link to relevant internal pages.

DevOps Tools & Strategies for Internal Linking:

Automated Link Checking: Implement CI/CD pipeline steps or scheduled jobs to run link checkers (e.g., linkchecker utility, custom scripts) to identify broken internal links (404s), which waste crawl budget and diminish user experience.

  # Example using 'linkchecker' (install via apt/brew/pip)
  linkchecker --recursive --check-extern --report-file=broken_links.html https://www.yourdomain.com/

Content Management Systems (CMS): Most modern CMS platforms (WordPress, Drupal, headless CMS with rich text editors) make it easy for content creators to add internal links. Ensure editors are trained on best practices.
API-Driven Navigation: For dynamic applications, ensure your navigation and related content sections are built from APIs that correctly reference the canonical URLs of your content.
Site Crawler Tools: Regularly use professional site auditing tools (e.g., Screaming Frog SEO Spider, Ahrefs, SEMrush) to visualize your site’s internal linking structure, identify orphan pages, and pinpoint pages with low internal link counts. These tools help identify the pages that bots struggle to discover.

By meticulously auditing and optimizing your site’s technical SEO foundation – from rendering strategies to crawler directives and internal linking – you can overcome persistent indexing issues and ensure all your valuable content gets the visibility it deserves.

👉 Read the original article on TechResolve.blog

☕ Support my work

If this article helped you, you can buy me a coffee:

👉 https://buymeacoffee.com/darianvance

DEV Community

Solved: Facing Indexing Issues for 4 Months — Only Homepage Indexed.

🚀 Executive Summary

🎯 Key Takeaways

Symptoms of Limited Indexing

Solution 1: Addressing Server-Side Rendering (SSR) & Pre-rendering Challenges

The Problem with Client-Side Rendering (CSR) for Bots

Implementing SSR or Pre-rendering

Server-Side Rendering (SSR)

Pre-rendering (Static Site Generation – SSG)

Comparison: SSR vs. Pre-rendering (SSG)

Solution 2: Optimizing Robots.txt and Sitemap Configuration

Auditing and Correcting `robots.txt`

Optimizing Your Sitemap.xml

Solution 3: Deep Dive into Canonicalization and Internal Linking Structure

Mastering Canonicalization with `rel=”canonical”`

Strengthening Internal Linking Structure

Top comments (0)

🚀 Executive Summary

🎯 Key Takeaways

Symptoms of Limited Indexing

Solution 1: Addressing Server-Side Rendering (SSR) & Pre-rendering Challenges

The Problem with Client-Side Rendering (CSR) for Bots

Implementing SSR or Pre-rendering

Server-Side Rendering (SSR)

Pre-rendering (Static Site Generation – SSG)

Comparison: SSR vs. Pre-rendering (SSG)

Solution 2: Optimizing Robots.txt and Sitemap Configuration

Auditing and Correcting robots.txt

Optimizing Your Sitemap.xml

Solution 3: Deep Dive into Canonicalization and Internal Linking Structure

Mastering Canonicalization with rel=”canonical”

Strengthening Internal Linking Structure

Auditing and Correcting `robots.txt`

Mastering Canonicalization with `rel=”canonical”`