Last week I decided to host my website using S3 and Cloudfront and because I am not an expert I ran into lots of difficulties. While Googling out there I relied on some posts that turned out to be bad advice. Below advice only applies if u do not change ur website name and are just migrating over from a different web host. Also it seems like much of my advice below is SEO related. While many folks abhore SEO my opinion is that it is a necessary evil and best practice only if u want to get noticed out there (being on serp page 6 is better than not being noticed at all). But if ur web page is for internal corporate use only then no need to even care about SEO.
Firstly let me reiterate that gotcha #1 from my first post is the most major issue I faced. It took days for me to figure this one out. Your Cloudfront Distribution will not even resolve if u do not follow gotcha #1.
So while my migration was not working for days I created some more problems for myself that I had to correct after deployment.
Correction #1 - scratch what I tell u in gotcha #4 because it is all wrong. Yes ur files are now in a new path but it is ONLY a file path from Cloudfront to S3. There is ABSOLUTELY NO NEED to redo all ur links to this file path or ur canonicals. KEEP THEM AS IS using the legal path that u set up in Route53 and the S3 Bucket name. This is the legal path. Also it is imperative for ur sitemap to not have any references to the S3 file path and same for robots.txt. Bottom Line, make NO references to the cloudfront S3 path anywhere.
Correction #2 - more gotchas #1 is incorrect, you do not need to re seo everything. Your legal site name is still the same so don't create a new property in Google Search Console with the new S3 path name and start to request indexing as this will only create duplicate and confusing content out there on the web.
Correction #3 - be consistent about ur page names, if they did not have an html extension last time be sure this is true this time otherwise users will get a 404 error. My confusion was that by default all pages in S3 have an html extension. If u were on an old apache server where ur htaccess file stripped the html extension then this advice is for you. It is easy to just check the box in s3 bucket next to ur file, click actions, then rename and just remove the html extension and don't forget to re save.
Correction #4 - refers to more gotchas #4 that states that the CDN returns a 304 code vs a 301. While this is true and in some cases Google Search Console will not re index, here is what is really happening. When Google reindexes a page u can find that the status code of that page is 200 only for the indexing. This is because your robots.txt and sitemap are actually instructing the Google Bot to go to the source and not to the cached page over at the CDN. When this happens ur page will reindex. Here is the rub, much advice out there tells u that for small sites u do not need robots or a sitemap but I say ignore this and do both anyway only if u want web crawlere out there to notice u. Don't forget to do the same over at Bing Webmaster Tools. My suspicion is that if u don't have robots or sitemap then Google will then go out to ur 304 version of the page at the CDN and thus will not index the page. This is because Google Bot has no sitemap or robots to tell them what to do so Google has to guess. I think I am correct with this presumption but have not fully tested and professional feedback is a bit hard to come by these days.
This is all for my corrections. During my travels I found some new Gotchas that no one tells u about.
DO NOT create a sitemap using the cloudfront s3 file path name.
DO NOT create a new property at Google Search Console.
Even though all ur pre migration files are now deleted it seems like there is no place for a sitemap. Each Google property has a designated location to submit a sitemap.
DO NOT change any of ur old canonicals as AWS does its magic for you. If u try to change any canonicals (i use Screaming Frog to test crawl) Google will pick the legal canonical with the path name that matches ur legal name that u registered at Route53 which is also ur bucket name. It will not choose the user selected canonical if it is wrong and thus will not index ur page.
This one is kind of insideous. The default time for the Cloudfront CDN to update cache is 24 hours. Be advised that sitemaps and javascript files will work on a different schedule and update at different times. Don't be too hasty when checking all ur edits. Be patient. Cloudfront does warn us on Route53 that TTL is 48 hours. In my case some of my javascript was not working but 12 hours later everything was fine.
Thats all for now folks and THANKS for reading!
Find more about my NoSQL adventures at my home page
Find more about Rick Delpo here.
Top comments (0)