- I generate stable job URLs from messy scraped data.
- I avoid slug collisions with a deterministic hash.
- I enforce uniqueness in Postgres, not in app code.
- I handle “same job, new source URL” without 404s.
Context
My job board ingests 200+ jobs daily. Multiple sources. Same role. Different URLs. Different punctuation. Sometimes different company names too.
At first, I used a normal slug.
/jobs/psychiatric-nurse-practitioner-remote
Brutal idea.
I got collisions within a week. Two companies. Same title. Same city. Boom. Wrong job on the wrong page.
Then I tried “just append the company.” Still collisions. “Behavioral Health” vs “Behavioral-Health” vs “Behavioral Health Inc”. Scrapers aren’t polite.
So I switched to stable IDs. Deterministic. Based on a canonical fingerprint. Stored in Postgres. One URL per real-world job.
This post is that setup. Next.js 14 + Supabase + Postgres.
1) I stopped trusting slugs from scraped text
Slugs are for humans. IDs are for computers.
I still want readable URLs. But the stable part has to be machine-generated.
So I build two fields:
-
slug: pretty, derived from title + company -
public_id: stable, derived from a fingerprint hash
public_id becomes the real key in the URL.
Here’s the exact function I use in my ingestion worker. Node runtime. No weird deps.
// lib/jobIdentity.ts
import crypto from "node:crypto";
export function slugify(input: string): string {
return input
.toLowerCase()
.normalize("NFKD")
.replace(/[\u0300-\u036f]/g, "") // strip accents
.replace(/[^a-z0-9]+/g, "-")
.replace(/(^-|-$)/g, "")
.slice(0, 80);
}
// Keep it boring. Deterministic. No timestamps.
export function jobFingerprint(fields: {
title: string;
company: string;
city?: string | null;
state?: string | null;
country?: string | null;
employmentType?: string | null;
}): string {
const norm = (s?: string | null) => (s ?? "").trim().toLowerCase();
// Don't include source_url. It changes all the time.
// Don't include salary. Scrapers lie.
const payload = [
norm(fields.title),
norm(fields.company),
norm(fields.city),
norm(fields.state),
norm(fields.country),
norm(fields.employmentType),
].join("|");
return crypto.createHash("sha256").update(payload).digest("hex");
}
export function publicIdFromFingerprint(fingerprint: string): string {
// 12 bytes => 24 hex chars. Short enough for URLs.
return fingerprint.slice(0, 24);
}
I learned the hard way that “include everything” makes IDs unstable.
If you put source_url into the hash, every time a site adds ?utm=..., you create a “new job”. Your listings inflate. Your dedupe fails silently.
2) I made the URL format boring on purpose
My job pages are:
/jobs/{public_id}/{slug}
The slug can change later. The public_id can’t.
That solves two things:
- Collisions stop.
public_idis unique. - Renames stop breaking links. Only the tail changes.
In Next.js App Router, I parse it like this.
// app/jobs/[publicId]/[slug]/page.tsx
import { notFound, redirect } from "next/navigation";
import { createClient } from "@supabase/supabase-js";
const supabase = createClient(
process.env.NEXT_PUBLIC_SUPABASE_URL!,
process.env.SUPABASE_SERVICE_ROLE_KEY! // server-only
);
function buildSlug(title: string, company: string) {
return `${title} ${company}`
.toLowerCase()
.replace(/[^a-z0-9]+/g, "-")
.replace(/(^-|-$)/g, "")
.slice(0, 80);
}
export default async function JobPage({
params,
}: {
params: Promise<{ publicId: string; slug: string }>;
}) {
const { publicId, slug } = await params;
const { data: job, error } = await supabase
.from("jobs")
.select("public_id,title,company,description,updated_at")
.eq("public_id", publicId)
.maybeSingle();
if (error) throw error;
if (!job) notFound();
const canonicalSlug = buildSlug(job.title, job.company);
if (slug !== canonicalSlug) {
// Permanent canonicalization. Stops duplicate pages.
redirect(`/jobs/${job.public_id}/${canonicalSlug}`);
}
return (
# {job.title}
{job.company}
{job.description}
);
}
Yeah, I’m using service role on the server.
If you don’t like that, use Supabase SSR helpers + RLS. I do that in other parts. For this page, I wanted predictable reads and zero client-side fetch.
Also: that redirect() is doing SEO work. It prevents /jobs/abc/old-slug from becoming a second indexed page.
3) I pushed uniqueness into Postgres, not my scraper
I tried to “dedupe in code.”
Spent 4 hours on this. Most of it was wrong.
Two workers ran at the same time. Same job. Both saw “not exists yet.” Both inserted. Classic race.
So I let Postgres do its job.
I store:
-
fingerprint(full sha256 hex) -
public_id(short stable id)
And enforce uniqueness.
-- migrations/20260407_jobs_identity.sql
create extension if not exists pgcrypto;
create table if not exists jobs (
id bigserial primary key,
public_id text not null,
fingerprint text not null,
title text not null,
company text not null,
city text,
state text,
country text,
employment_type text,
source_url text,
description text,
posted_at timestamptz,
updated_at timestamptz not null default now()
);
-- Stable identity
create unique index if not exists jobs_public_id_ux on jobs (public_id);
create unique index if not exists jobs_fingerprint_ux on jobs (fingerprint);
-- Helpful for common queries
create index if not exists jobs_updated_at_idx on jobs (updated_at desc);
Then ingestion becomes “upsert by fingerprint”.
If the job already exists, I update mutable fields like source_url, description, updated_at.
// scripts/ingestOne.ts
import { createClient } from "@supabase/supabase-js";
import {
jobFingerprint,
publicIdFromFingerprint,
slugify,
} from "../lib/jobIdentity";
const supabase = createClient(
process.env.NEXT_PUBLIC_SUPABASE_URL!,
process.env.SUPABASE_SERVICE_ROLE_KEY!
);
type ScrapedJob = {
title: string;
company: string;
city?: string | null;
state?: string | null;
country?: string | null;
employmentType?: string | null;
sourceUrl: string;
description: string;
};
export async function ingestOne(j: ScrapedJob) {
const fingerprint = jobFingerprint({
title: j.title,
company: j.company,
city: j.city,
state: j.state,
country: j.country,
employmentType: j.employmentType,
});
const publicId = publicIdFromFingerprint(fingerprint);
const { error } = await supabase.from("jobs").upsert(
{
public_id: publicId,
fingerprint,
title: j.title,
company: j.company,
city: j.city ?? null,
state: j.state ?? null,
country: j.country ?? null,
employment_type: j.employmentType ?? null,
source_url: j.sourceUrl,
description: j.description,
updated_at: new Date().toISOString(),
},
{
onConflict: "fingerprint", // the important part
}
);
if (error) throw error;
return {
publicId,
urlSlug: slugify(`${j.title} ${j.company}`),
};
}
This is the whole trick.
I don’t care if 3 scrapers hit the same job. Postgres resolves it.
One thing that bit me: don’t use onConflict: "public_id".
If you ever change how you generate public_id (even accidentally), you’ll create duplicates. The fingerprint is the real identity. Keep that invariant.
4) I kept old URLs alive when the “identity” changes
Sometimes the fingerprint really does change.
Example: a source switches from “NYC” to “New York”. Or the company name gets normalized later. Or I fix a bug in my canonicalization.
If the fingerprint changes, public_id changes.
That’s a broken link.
So I store aliases. Old public_id values that redirect to the new one.
This saved me after I changed my normalization rules and accidentally invalidated 317 URLs. I noticed because my own smoke test started hitting 404.
Schema + redirect lookup.
-- migrations/20260407_job_aliases.sql
create table if not exists job_aliases (
old_public_id text primary key,
new_public_id text not null references jobs(public_id) on delete cascade,
created_at timestamptz not null default now()
);
create index if not exists job_aliases_new_idx on job_aliases (new_public_id);
And in Next.js, I check aliases before I give up.
// app/jobs/[publicId]/[slug]/page.tsx
import { notFound, redirect } from "next/navigation";
import { createClient } from "@supabase/supabase-js";
const supabase = createClient(
process.env.NEXT_PUBLIC_SUPABASE_URL!,
process.env.SUPABASE_SERVICE_ROLE_KEY!
);
export default async function JobPage({
params,
}: {
params: Promise<{ publicId: string; slug: string }>;
}) {
const { publicId, slug } = await params;
const { data: job } = await supabase
.from("jobs")
.select("public_id,title,company,description")
.eq("public_id", publicId)
.maybeSingle();
if (job) {
const canonicalSlug = `${job.title} ${job.company}`
.toLowerCase()
.replace(/[^a-z0-9]+/g, "-")
.replace(/(^-|-$)/g, "")
.slice(0, 80);
if (slug !== canonicalSlug) redirect(`/jobs/${job.public_id}/${canonicalSlug}`);
return (
# {job.title}
{job.company}
{job.description}
);
}
// Not found? Try alias redirect.
const { data: alias } = await supabase
.from("job_aliases")
.select("new_public_id")
.eq("old_public_id", publicId)
.maybeSingle();
if (alias?.new_public_id) {
redirect(`/jobs/${alias.new_public_id}/${slug}`); // slug will canonicalize next
}
notFound();
}
That last redirect looks weird. It is.
I redirect to the new public_id first, keeping whatever slug the user had. Then the canonical slug redirect runs. Two hops.
I kept it because it’s simple and it works.
If you want one hop, you can fetch the job by new_public_id and build the canonical URL immediately.
Results
Before stable IDs, I had 41 collisions in my jobs table in one week. Same slug. Different jobs. Users reported “this page changed overnight” twice. That was enough.
After switching to public_id + slug, collisions dropped to 0 because the URL key stopped being user-visible text. I also fixed 317 broken URLs after a normalization change by adding job_aliases redirects, instead of letting them 404.
The board is sitting at 8,000+ active listings from 2,000+ companies. URLs are boring now. That’s the point.
Key takeaways
- Make the stable part of the URL deterministic, not pretty.
- Hash a canonical fingerprint. Don’t hash source URLs.
- Put uniqueness constraints in Postgres, or concurrency will humble you.
- Treat slug as display-only. Enforce canonical redirects in Next.js.
- When you change identity rules, keep aliases so old links still resolve.
Closing
If you’re scraping jobs (or listings, or products), URL stability becomes an SEO problem fast.
How do you build your fingerprint: which fields do you include, and which ones did you remove after getting burned by unstable IDs?
Top comments (0)