How to remove a webpage from the Google index

#it #web #html

It's important to keep in mind that search engines scan websites on periodic bases and these periods may vary depending on a number of factors. In general, websites' owners don't have full control over the behavior of search engines, but instead, they can define preferences in a form of instructions. Such instructions, for example, allow excluding certain web pages from showing up in search results or preventing search engines from digging into specific paths. There are two ways to declare preferences: tweaking parameters of robots.txt in the root of a website and HTML <meta> tag "robots" in the <head> block of web pages.

I've recently needed to move one of my static websites to another domain. It became a complex task as I'm not able to change a server-side configuration, and the redirection of HTTP-requests is only one part of the story. Once all users are being redirected to a new location, I had to initiate and speed up a process of cleaning up the search results from links to my old website.

There are basically a few common ways to remove web pages from search indexes:

remove a page completely, so clients will be getting 404 Not Found HTTP response. It is clearly not my case, as the old website responses with valid and existing web pages
restrict access to a page by asking clients to enter credentials. Then, the server will be sending 401 Unauthorized HTTP response. This also won't work for me, as requires changing the configuration on the server-side
add an HTML <meta> tag robots with the value noindex. That's exactly what I needed and can be implemented on the client-side.

The last method allows setting different preferences per page right from the HTML code. That is, search engines must have access to a page to read it and find this instruction. This also means that all web pages with robots meta tag shouldn't be blocked even by a robots.txt file!

This solution will show a few steps for removing an entire website from Google's search results.

check robots.txt (if it exists) and be sure that search bots are allowed to go through the site and read all indexed web pages. The file should either be empty or something like this (allows any bots read any webpage on a site):

User-agent: * 
Disallow:

add robots HTML <meta> tag in the <head> block with "noindex, nofollow" value in each indexed web page:

<meta name="robots" content="noindex, nofollow" />

create a sitemap.xml file and define all indexed web pages with the <lastmod> section which points to some recent time. For example:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <url>
        <loc>https://example.com/page1/</loc>
        <changefreq>daily</changefreq>
        <lastmod>2019-06-15</lastmod>
    </url>
    <url>
        <loc>https://example.com/page2/</loc>
        <changefreq>daily</changefreq>
        <lastmod>2019-06-15</lastmod>
    </url>
</urlset>

submit this sitemap.xml file to Google to let it know about recent changes. It can be done using curl command:

curl -sSLf https://google.com/ping?sitemap=https%3A%2F%2Fexample.com%2Fsitemap.xml

submit a removal request for each indexed web page. It may take several days for some links (and a few tries per a page's URL) to get considered "outdated" and eligible for deleting from the index