I Built a Deterministic Crosslink Engine for 117 Pages Using Jaccard Similarity

#seo #nextjs #webdev

A content site with 117 pages and zero internal linking strategy is a site where visitors bounce after reading one page. That was my site two weeks ago.

Today, every page on alexandrecaramaschi.com has 6 contextual crosslinks generated by a deterministic engine that runs in 200ms, costs nothing, and lives in a single Node.js script — no embeddings, no vector databases, no API calls.

Here is exactly how I built it.

The Problem: 117 Pages, Manual Linking

The site has 41 long-form articles, 38 courses (388 modules), 26 strategic insights, and 14 service/tool pages. All built with Next.js 16 App Router.

The existing relatedArticles field in my CMS was manually curated — and covered maybe 15% of pages. Course pages had zero outbound links to articles. Articles never pointed to courses. The result: visitors arrived via search, consumed one page, and left.

The Architecture: Faceted Taxonomy + Weighted Scoring

Instead of reaching for OpenAI embeddings, I designed a controlled vocabulary with 4 semantic facets:

1. Topics — 26 canonical terms with synonym normalization:

export const TOPICS = {
  geo: ["geo", "generative engine optimization", "motor generativo"],
  seo: ["seo", "search engine optimization"],
  "ia-generativa": ["ia generativa", "llm", "chatgpt", "claude", "gemini"],
  vscode: ["vscode", "vs code", "visual studio code", "editor", "ide"],
  // ... 22 more
};

Each piece of content is annotated by scanning its title, description, and keywords against this vocabulary. Normalization strips accents and lowercases before matching (critical for Portuguese content).

2. Audience — 7 profiles (beginner, dev, marketing-pro, executive, etc.)

3. Intent — 4 journey stages: discover → learn → apply → decide

4. Vertical — 12 industry sectors (healthcare, legal, tourism, etc.)

The Scoring Function

For each pair of content items (A, B), the score is a weighted sum across facets:

score(A, B) = 1.0 * jaccard(topics_A, topics_B)
            + 0.5 * audienceOverlap(A, B)
            + 0.8 * intentFlow(A, B)
            + 1.2 * verticalBridge(A, B)
            + 1.3 * crossDomainBonus(A, B)
            + 0.6 * trackAffinity(A, B)

Jaccard similarity handles topic matching. Two items sharing 3 of 5 topics score 0.6 — high enough to be relevant, low enough to avoid duplicates.

Intent flow rewards linking from discovery content (articles) to learning content (courses) to action pages (tools) — guiding visitors deeper.

Cross-domain bonus is the key retention driver: an article about "zero-click economy" linking to the "SEO + GEO Fundamentals" course is more valuable than linking to another article about zero-click. Different content types with shared topics get a 1.3x boost.

Track affinity ensures courses in the same learning path (e.g., Python → Data Science → Deploy) link to each other even without keyword overlap.

Anti-Bubble Mixing

Raw scoring produces homogeneous results — a course page would only suggest other courses. The mixer enforces quotas:

content (articles + insights): min 1
learning (courses):             min 1
action (guides + tools):        min 1
any single group:               max 50%

Three phases:

Fill mandatory quotas from each group
Complete by score, respecting group caps
Fallback by supercategory for edge cases

Injection Without Editing 63 Static Pages

The site has 38 static course pages and 26 static insight pages — all individual page.tsx files. Editing each one was not viable.

Solution: middleware + headers + layout injection.

The middleware sets an x-pathname header:

// middleware.ts
const requestHeaders = new Headers(request.headers);
requestHeaders.set('x-pathname', pathname);
const response = NextResponse.next({ request: { headers: requestHeaders } });

A server component reads it:

// SmartRelated.tsx
const h = await headers();
const path = h.get("x-pathname");
const items = getCrosslinksFor(path, 6);

Injected via educacao/layout.tsx and insights/layout.tsx, it automatically appears below every course and insight page. For articles (dynamic [slug] route), the pathname is passed explicitly as a prop.

Results

Metric	Before	After
Pages with crosslinks	~15%	100%
Total crosslinks	~40 manual	700 generated
Cross-type links	0	116 of 117 pages
Badge types per page	1	2.3 average
Build time delta	—	+200ms
API costs	—	$0

The generator runs as part of prebuild and outputs a static JSON map consumed at render time.

Why Not Embeddings?

At 117 pages, embeddings are overkill. The controlled vocabulary approach is:

Deterministic — same input, same output, every time
Auditable — grep the vocabulary file to understand any link
Free — no API calls, no vector DB
Fast — 200ms to generate the entire map
Versionable — the JSON map is committed to git

When the site crosses ~500 pages, I will migrate to pgvector. The architecture was designed for this: consumers only read crosslink-map.json — they do not care how it was generated.

Try It

The full source is at alexandrecaramaschi.com. Navigate any course, scroll to the bottom, and you will see the crosslinks in action.

Alexandre Caramaschi — CEO at Brasil GEO, former CMO at Semantix (Nasdaq), co-founder of AI Brasil. Building the practice of Generative Engine Optimization in Latin America.