Dixit R Jain for AWS Community Builders

Posted on Apr 3

Building a Zero-Cost SEO Prerender Pipeline for My React SPA on AWS

#cloudfront #s3 #lambda #seo

How I built a self-hosted prerender cache using CloudFront Functions, Lambda, Puppeteer, and S3 — replacing a $49/month service with existing AWS infrastructure.

Inspiration

This blog was inspired by Avinash Dalvi's excellent post on replacing prerender.io with a serverless renderer on AWS. His approach uses Lambda@Edge + API Gateway + a container-based Puppeteer Lambda. I had the same core idea and adapted it to my stack — a React SPA on CloudFront with CDK — making different architectural trade-offs along the way for my use case.

Why Prerendering Matters for SPAs

Already know why SPAs are invisible to crawlers? Skip ahead to <a The Site.

Your SPA ships an index.html with an empty <div id="root"></div> and a script tag in the body. Browsers execute the JS and render everything. Crawlers and social bots don't — they read the raw HTML and move on.

<!DOCTYPE html>
<html>
  <head><title>Vairaa</title></head>
  <body>
    <div id="root"></div>
    <script src="/assets/index-abc123.js"></script>
  </body>
</html>

The impact is twofold:

SEO: Google can execute JS, but queues it separately, delaying indexing by days or weeks. Pages sit in "Discovered — currently not indexed" limbo. Other search engines don't execute JS at all. You lose rich results, waste crawl budget, and get empty search snippets.
Social sharing: When someone shares your link on WhatsApp, LinkedIn, Slack, or Twitter/X, those platforms fetch the URL and look for Open Graph meta tags in the HTML. If those tags are rendered client-side, the preview card is blank. No title, no image, no description. For a B2B site where link sharing drives leads, that's a conversion killer.

Prerendering fixes this by serving crawlers a fully rendered HTML snapshot — with all content, meta tags, and structured data baked in — while regular users still get the normal SPA experience. Google explicitly documents this as dynamic rendering and approves the pattern. It's not cloaking — you're serving the same content in a different format, not different content to manipulate rankings.

The typical options are SSR (framework rewrite), SSG (doesn't work for dynamic content), a prerender service ($49–$249/month), or self-hosted prerendering. This post is about the last one.

The Site: What I'm Working With

Vairaa is a B2B marketing site for a cocopeat and coir products company specifically focused on Lead Generation. It's not a simple brochure site — it has a full product catalog, a blog, a lead management CRM, a media gallery, and an admin panel for managing all of it.

The tech stack:

Frontend: React 19 + Vite 7 SPA, styled with Tailwind CSS 4, animated with Framer Motion, smooth-scrolled with Lenis. Radix UI primitives for accessible components. TipTap for rich-text blog authoring.
Backend: A single AWS Lambda (Node.js 22, ARM64) handling all API routes — products, categories, blogs, leads, gallery, uploads, site config. Express wraps it locally for dev.
Database: DynamoDB single-table design. Products, categories, blogs, leads, gallery items, site config — all in one table with PK/SK patterns and a GSI for category and status queries.
Infrastructure: AWS CDK v2. CloudFront serves the SPA from S3 (OAC, no public bucket). API Gateway fronts the Lambda. Route53 + ACM for DNS and TLS.

Every content change in the admin panel — updating a product, publishing a blog post, changing site config — is relevant to the prerender story, because each change needs to invalidate and regenerate the cached HTML that crawlers see.

The Problem

The site ships a nearly empty index.html and renders everything client-side. Great for users. Terrible for crawlers. I was about to pay $49/month for prerender.io to fix it, but it felt wrong for a site already running on AWS Lambda, S3, and CloudFront. I already had everything I needed.

Here's every architectural decision and the wrong turns I almost took.

The Architecture in One Sentence

A CloudFront Function detects crawlers by User-Agent, rewrites their request URI to a cached HTML file in S3, and serves it — all transparently, with no redirects, at the same URL.

Decision 1: URI Rewriting, Not Redirecting

My first instinct was to redirect crawlers to a separate endpoint that returns rendered HTML. Something like:

Crawler hits /products/cocopeat-bricks
→ 302 redirect to /api/prerender?path=/products/cocopeat-bricks
→ Lambda renders the page
→ Returns HTML

This is actually what my original setup did for social bots — redirect to /api/og?path=... which returned minimal OG meta tags.

This hurts SEO. When Google follows a redirect, it sees two URLs involved. It may split PageRank signals, flag the redirect as suspicious, or simply not follow it at all. The canonical URL becomes ambiguous.

The correct approach is a transparent URI rewrite. The crawler requests /products/cocopeat-bricks. CloudFront internally fetches /prerender/products/cocopeat-bricks/index.html from S3. The crawler receives a single HTTP 200 response at the original URL. It never sees a redirect. It has no idea any rewriting happened.

This is Google's officially documented dynamic rendering pattern.

Decision 2: Same Bucket, Not a New Origin

Here's where it gets interesting. A CloudFront Function runs at the viewer-request stage. It can rewrite the URI. But it cannot switch the origin — that's a hard CloudFront limitation. Origin selection happens at the behavior level, not the function level.

So I had two options:

Option A: Lambda@Edge — runs at the origin-request stage, can dynamically switch origins. More powerful, but adds latency (~1-5ms per request), costs more, and has a more complex deployment model. This is the approach Avinash used — Lambda@Edge for bot detection + origin switching to a separate API Gateway endpoint that runs Puppeteer.

Option B: Same bucket, path-based routing — store the prerender cache inside the existing static hosting S3 bucket under a prerender/ prefix. The CloudFront Function rewrites /products/cocopeat-bricks → /prerender/products/cocopeat-bricks/index.html. Same origin, just a different path.

I went with Option B. No new infrastructure, no Lambda@Edge cold starts, no extra CDK complexity. The trade-off is that the prerender cache lives alongside your static assets — but since everything is private behind CloudFront OAC anyway, that's a non-issue.

The S3 key structure looks like this:

vairaa-static-prod/
├── index.html                          ← SPA shell
├── assets/                             ← Vite build output
└── prerender/                          ← Our cache
    ├── index.html                      ← Cached: /
    ├── products/
    │   ├── index.html                  ← Cached: /products
    │   └── cocopeat-bricks/
    │       └── index.html              ← Cached: /products/cocopeat-bricks
    └── blogs/
        └── growing-with-coco-peat/
            └── index.html

How This Differs from Avinash's Approach

Avinash's system renders on-demand — a bot hits a URL, Lambda@Edge routes to a Puppeteer Lambda, which renders the page (or serves from S3 cache), and returns the HTML in real-time. The first request for any URL pays the full Puppeteer render cost (~5-10 seconds).

My system pre-renders — the admin triggers a warm-up that renders all pages upfront and stores them in S3. When a bot hits a URL, CloudFront serves the pre-cached HTML directly from S3. No Lambda invocation at all for cache hits. The trade-off is that cache misses serve the SPA shell instead of rendering on-demand.

Both approaches are valid. On-demand is better if you have thousands of pages that change rarely. Pre-rendering is better if you have a manageable page count and want zero-latency bot responses.

Decision 3: The CloudFront Function

The existing CloudFront Function already handled www→non-www redirects. I extended it to also handle crawler detection and URI rewriting.

The key insight: don't redirect social bots to /api/og anymore. Instead, rewrite all crawlers (both search engines and social bots) to the prerender cache.

function handler(event) {
  var request = event.request;
  var host = request.headers.host.value;
  var uri = request.uri;

  // 1. www → non-www redirect (unchanged)
  if (host.startsWith('www.')) {
    return { statusCode: 301, headers: { location: { value: 'https://' + host.slice(4) + uri } } };
  }

  // 2. Detect crawlers
  var ua = (request.headers['user-agent'] || { value: '' }).value.toLowerCase();
  var bots = [
    'googlebot', 'bingbot', 'slurp', 'duckduckbot', 'baiduspider',
    'yandexbot', 'sogou', 'exabot', 'facebot', 'ia_archiver',
    'whatsapp', 'facebookexternalhit', 'twitterbot', 'linkedinbot',
    'slackbot', 'telegrambot', 'discordbot', 'pinterestbot', 'redditbot'
  ];
  var isCrawler = bots.some(function(b) { return ua.indexOf(b) !== -1; });

  // 3. Rewrite URI for crawlers (skip API, assets, files with extensions)
  if (isCrawler && !uri.startsWith('/api/') && !uri.startsWith('/product-assets/') && !uri.match(/\.[a-zA-Z0-9]+$/)) {
    request.uri = uri === '/' ? '/prerender/index.html' : '/prerender' + uri + '/index.html';
  }

  return request; // Always return request, never a redirect response for crawlers
}

Note on the extension regex: The pattern \.[a-zA-Z0-9]+$ skips any URI ending with a file extension (.js, .css, .png, etc.) so static assets aren't rewritten. This could also skip legitimate routes containing dots (e.g., /products/version-2.0). If your slugs contain dots, swap the regex for an explicit extension allowlist like /\.(js|css|png|jpg|svg|ico|woff2?)$/.

Notice the last line: we always return the request object, never a response with a statusCode. That's what makes it a rewrite instead of a redirect.

Decision 4: What Happens on a Cache Miss?

Since the CloudFront Function blindly rewrites the URI without checking if the file exists, what happens when a crawler hits a URL that hasn't been cached yet?

CloudFront fetches /prerender/products/new-product/index.html from S3. The file doesn't exist. S3 returns 404. CloudFront's existing error response configuration kicks in — it serves index.html (the SPA shell) with HTTP 200.

The crawler gets the SPA shell. Not ideal, but not a penalty either. Google will just re-crawl later.

This is acceptable because:

You warm the cache before going live
Cache is automatically regenerated whenever content changes
Cache misses are rare in steady state

If you need zero-miss guarantee with on-demand rendering, Lambda@Edge is the right call (see Avinash's approach). For most sites, pre-rendering with graceful fallback is fine.

Decision 5: Puppeteer, Not Playwright

For this use case — render a React page, capture the HTML, store it — you need Chromium and a page.goto(). Playwright has advantages (better auto-waiting, multi-browser support), but those don't matter here. What matters on Lambda is package size and deployment simplicity.

Puppeteer with @sparticuz/chromium is purpose-built for this:

	Puppeteer + `@sparticuz/chromium`	Playwright
Package size	~50MB compressed	~400MB+
Lambda deployment	Zip (no Docker)	Container image required
Cold start	Faster	Slower
Lambda fit	Purpose-built for Lambda	General-purpose, heavier footprint

@sparticuz/chromium is specifically built and compressed for Lambda's environment. It's the community standard for running Chromium on Lambda. Avinash uses the same stack in his implementation.

Decision 6: Batched Lambda Invocations

My first design fired one Lambda per URL. For a site with 50 pages, that's 50 Lambda cold starts. Each cold start loads Chromium (~3-5 seconds). That's 150-250 seconds of wasted cold start time.

The better approach: batch 10 URLs per Lambda invocation.

URL Manifest (50 URLs)
  → chunk into 5 batches of 10
  → invoke 5 Render_Worker Lambdas with InvocationType: 'Event'
  → each Lambda: launch Chromium ONCE, render 10 URLs parallel

Each Lambda pays the cold start cost once, then reuses the browser instance for all 10 URLs. 5 Lambda invocations instead of 50. All 5 render worker lambdas run in parallel.

With 2048MB memory and a 3-minute timeout (generous headroom for cold start + 10 renders at ~8-15s each), a full site warm-up completes in ~20-30 seconds wall-clock time.

Decision 7: Async Lambda Invocation for Cache Invalidation

When an admin updates a product, the API needs to:

Update DynamoDB
Delete the stale S3 cache entry
Trigger a re-render

Steps 2 and 3 can't run as a background Promise in the same Lambda. When the Lambda returns its HTTP response, the execution context freezes. Any unresolved promises die.

The solution: InvocationType: 'Event' — invoke the Render_Worker Lambda asynchronously. The API Lambda fires the invocation and gets back immediately. The Render_Worker runs on its own lifecycle, with its own timeout, completely independent.

The admin gets their response in milliseconds. The cache regenerates in the background. Lambda retries failed async invocations up to 2 times by default — so even transient failures are handled.

Decision 8: No Meta Tag Injection

Some prerender services inject OG tags and canonical URLs into the captured HTML before storing it. I don't need to do this.

My React app has a proper SEO component that renders all meta tags — canonical, OG, Twitter Card, hreflang, geo tags — as part of the normal React render. When Puppeteer renders the page, those tags are already in the DOM. The captured HTML already has everything.

// This component renders during Puppeteer's page.content() capture
export function SEO({ pageKey, title, url, siteConfig }) {
  return (
    <>
      <title>{fullTitle}</title>
      <link rel="canonical" href={fullUrl} />
      <meta property="og:title" content={fullTitle} />
      <meta property="og:url" content={fullUrl} />
      {/* ... all the tags */}
    </>
  );
}

The canonical URL points to https://vairaa.com/products/cocopeat-bricks — the real URL, not the internal S3 path. Google reads the canonical tag, not the URL it fetched from. So even though CloudFront internally fetched /prerender/products/cocopeat-bricks/index.html, Google indexes https://vairaa.com/products/cocopeat-bricks. Correct.

The Code

The Render Worker Lambda

This is the Lambda that actually launches Chromium, renders pages, and stores the HTML in S3. It receives a batch of URLs and a target bucket name.

// backend/src/render-worker.ts
import { PutObjectCommand } from '@aws-sdk/client-s3';
import { s3Client } from './utils/s3';
import puppeteer from 'puppeteer-core';
import chromium from '@sparticuz/chromium';

export interface RenderWorkerEvent {
  urls: string[];
  bucketName: string;
}

export function urlToS3Key(url: string): string {
  const pathname = new URL(url).pathname;
  if (pathname === '/') return 'prerender/index.html';
  const normalized = pathname.replace(/\/$/, '');
  return `prerender${normalized}/index.html`;
}

export const handler = async (event: RenderWorkerEvent) => {
  const browser = await puppeteer.launch({
    args: [...chromium.args, '--no-sandbox', '--disable-dev-shm-usage'],
    defaultViewport: { width: 1280, height: 800 },
    executablePath: await chromium.executablePath(),
    headless: true,
  });

  try {
    const results = await Promise.allSettled(
      event.urls.map(async (url) => {
        const page = await browser.newPage();
        try {
          await page.goto(url, { waitUntil: 'networkidle0', timeout: 30000 });
          const html = await page.content();
          const key = urlToS3Key(url);
          await s3Client.send(new PutObjectCommand({
            Bucket: event.bucketName,
            Key: key,
            Body: html,
            ContentType: 'text/html; charset=utf-8',
            Metadata: { 'rendered-at': new Date().toISOString() },
          }));
          return url;
        } finally {
          await page.close();
        }
      })
    );

    for (const result of results) {
      if (result.status !== 'fulfilled') {
        const url = event.urls[results.indexOf(result)];
        console.warn(`WARN: render failed for ${url}`, result.reason);
      }
    }
  } finally {
    await browser.close();
  }
};

A few things worth noting:

Promise.allSettled means one failed render doesn't kill the entire batch. We log failures for debugging.
All 10 URLs render concurrently in a single Chromium instance. At 2048MB, this works fine for ~200KB pages. For heavier pages, you could add a concurrency limiter (e.g., a simple semaphore) to cap parallel tabs.

The Cache Manager (Warm-up & Invalidation)

The prerender handler builds the URL manifest, chunks it into batches, and fires off Render Worker Lambdas:

// backend/src/handlers/prerender.ts (key excerpts)

const STATIC_PATHS = [
  '/', '/products', '/about', '/contact', '/blogs', '/gallery',
  '/faqs', '/terms-and-conditions', '/privacy-policy',
  '/shipping-policy', '/landingpage',
];

/** Builds the full URL manifest: static paths + product slugs + blog slugs. */
async function buildUrlManifest(): Promise<string[]> {
  const [productsResult, blogsResult] = await Promise.all([
    listProducts(),
    listBlogs(),
  ]);

  const productPaths = productsResult.items.map(p => `/products/${p.slug}`);
  const blogPaths = blogsResult.items.map(b => `/blogs/${b.slug}`);

  return [...STATIC_PATHS, ...productPaths, ...blogPaths];
}

/** Deletes cache entries and asynchronously invokes Render_Worker to regenerate. */
async function invalidateAndRerender(paths: string[]): Promise<void> {
  // Delete stale S3 entries
  await Promise.all(
    paths.map(path =>
      s3Client.send(new DeleteObjectCommand({
        Bucket: WEBSITE_BUCKET, Key: pathToS3Key(path),
      }))
    )
  );

  // Fire-and-forget: invoke Render_Worker asynchronously
  const urls = paths.map(p => `${SITE_URL}${p}`);
  await lambdaClient.send(
    new InvokeCommand({
      FunctionName: RENDER_WORKER_FUNCTION_NAME,
      InvocationType: 'Event', // async — returns immediately
      Payload: Buffer.from(JSON.stringify({ urls, bucketName: WEBSITE_BUCKET })),
    })
  );
}

// POST /api/admin/prerender/warmup
async function warmupHandler(): Promise<APIGatewayProxyResult> {
  // Check for warmup lock (prevent concurrent warm-ups)
  const lockKey = 'prerender/.warmup-lock';
  try {
    await s3Client.send(new HeadObjectCommand({ Bucket: WEBSITE_BUCKET, Key: lockKey }));
    return { statusCode: 409, body: JSON.stringify({ message: 'Warm-up already in progress' }) };
  } catch { /* Lock doesn't exist — proceed */ }

  // Write lock
  await s3Client.send(new PutObjectCommand({
    Bucket: WEBSITE_BUCKET, Key: lockKey,
    Body: new Date().toISOString(), ContentType: 'text/plain',
  }));

  const manifest = await buildUrlManifest();
  const batches = chunkArray(manifest, 10); // 10 URLs per Lambda invocation

  await Promise.all(
    batches.map((batch) => {
      const urls = batch.map(p => `${SITE_URL}${p}`);
      return lambdaClient.send(
        new InvokeCommand({
          FunctionName: RENDER_WORKER_FUNCTION_NAME,
          InvocationType: 'Event', // all batches fire in parallel
          Payload: Buffer.from(JSON.stringify({ urls, bucketName: WEBSITE_BUCKET })),
        })
      );
    })
  );

  return {
    statusCode: 202,
    body: JSON.stringify({
      message: `Warm-up initiated for ${manifest.length} URLs in ${batches.length} batches`,
      totalUrls: manifest.length,
      totalBatches: batches.length,
    }),
  };
}

The invalidateAndRerender function is what product, blog, and site-config handlers call after any content change. It deletes the stale cache entries and fires the Render Worker to regenerate them — all asynchronously so the admin API response isn't blocked.

The Full Flow

Admin clicks "Warm Up All Pages"
  → POST /api/admin/prerender/warmup
  → API Lambda queries DynamoDB for all product + blog slugs
  → Builds URL manifest: 11 static pages + N products + M blogs
  → Chunks into batches of 10
  → Fires K Render_Worker Lambdas with InvocationType: 'Event'
  → Returns HTTP 202 immediately

Each Render_Worker Lambda (running in parallel):
  → Launches Chromium once
  → For each URL in batch:
      → page.goto(url, { waitUntil: 'networkidle0', timeout: 30000 })
      → html = await page.content()
      → s3.putObject({ Key: 'prerender/products/slug/index.html', Body: html, ... })
  → Closes browser

Later, Googlebot hits /products/cocopeat-bricks:
  → CloudFront viewer-request → CloudFront Function
  → Detects 'googlebot' in User-Agent
  → Rewrites URI: /products/cocopeat-bricks → /prerender/products/cocopeat-bricks/index.html
  → CloudFront fetches from S3
  → Returns full HTML with HTTP 200 at original URL
  → Googlebot indexes complete product page ✓

Admin updates a product:
  → PUT /api/admin/products/:id
  → API Lambda updates DynamoDB
  → invalidateAndRerender(['/products/slug', '/products'])
  → Deletes stale S3 cache entries
  → Fires Render_Worker async (InvocationType: 'Event')
  → Returns HTTP 200 to admin immediately
  → Render_Worker regenerates cache in background

What It Costs

For a site with ~50 pages:

S3 storage: 50 HTML files × ~200KB each = ~10MB. Essentially free.
Lambda invocations: Warm-up fires ~5 invocations. Each content change fires 1-2. At Lambda's free tier (1M invocations/month), this is $0.
CloudFront Function: $0.10 per 1M invocations. For most sites, this rounds to $0.
Puppeteer Lambda: 2048MB × 3 minutes × 10 invocations per warm-up = ~60GB-seconds. Lambda free tier is 400,000 GB-seconds/month. Still $0.

Compare to prerender.io at $49/month minimum.

What I'd Do Differently

Cache miss on-demand render: Right now, a cache miss just serves the SPA shell. A Lambda@Edge function could detect the miss and trigger an on-demand render (or at least queue an async render for next time). I decided this complexity wasn't worth it given how rare cache misses are in steady state, but it's the most architecturally interesting upgrade path.

Cache staleness after deploys: The prerender cache doesn't automatically regenerate when you deploy a new frontend build. If your layout or components change, the cached HTML is stale until the next content edit triggers invalidation. The fix is simple — add a warm-up invocation to your deploy pipeline — but it's worth calling out because it's easy to forget.

Progress tracking: Since warm-up is fully async, the admin UI polls GET /api/admin/prerender/status every 10 seconds and compares cached entry count to manifest total. It works but feels crude. An SQS queue with a progress counter would be more elegant.

Warm-up lock: I'm using a prerender/.warmup-lock S3 object to prevent concurrent warm-ups. It works, but a DynamoDB item with a TTL would be cleaner and auto-expire if a warm-up crashes.

The Admin UI

The admin page gives full visibility and control:

Total cached pages vs. manifest total
Per-page status: Cached or Missing
Last rendered timestamp per page
"Warm Up All Pages" button with live progress
Per-page "Refresh" (synchronous re-render) and "Delete" actions
"Clear All Cache" with confirmation

Results

Google Search Console

Search Performance Before vs After

Impressions jumped significantly after the prerender cache went live.

Page Indexing Before vs After

The inflection point around February 10 — the date the solution was deployed — is hard to miss. Pages that were stuck in "Discovered — currently not indexed" started getting picked up.

Social Link Previews

Wrapping Up

The full system is:

CloudFront Function — crawler detection + URI rewrite (no Lambda@Edge, no redirects)
S3 prerender/ prefix — cache store (no new bucket)
Render_Worker Lambda — puppeteer-core + @sparticuz/chromium, zip deployment, batched rendering
Cache_Manager — API routes for warm-up, status, invalidation
Automatic invalidation — hooks in product/blog/site-config handlers, async Lambda invocation

Total new infrastructure: one Lambda function and some S3 objects. Everything else reuses what's already there.

The key insight that makes it all work: URI rewriting is invisible to crawlers. They see one URL, one HTTP 200 response. Google's dynamic rendering guidelines explicitly approve this pattern. You get full SEO benefit with zero redirect penalty.

If you're running a React SPA + CloudFront setup and paying for prerender.io, this is worth the afternoon it takes to set up.

DEV Community