DEV Community

Cover image for How I built the OSS alternatives directory: GitHub ETL, Turso, and the UPSERT trap I hit
MORINAGA
MORINAGA

Posted on

How I built the OSS alternatives directory: GitHub ETL, Turso, and the UPSERT trap I hit

When I launched three programmatic directory sites in April 2026, the open-source alternatives site had the most interesting data model. The AI tools directory indexes HuggingFace models — that's a pull from one API. The indie games directory reads Steam. But the OSS alternatives site has to answer a different question: for this SaaS product, which open-source repos actually cover the same use case, and how do they compare?

Getting that right required a two-phase ETL approach, a careful UPSERT strategy I initially got wrong, and some deliberate choices about where to use Claude Haiku and where to use a fallback template.

What the data model looks like

Three tables in Turso libSQL:

  • saas — the SaaS tool being replaced (Datadog, Notion, Figma, etc.)
  • alternatives — GitHub repos that serve the same use case, linked by saas_slug
  • saas_content — Claude-generated per-entry text: an intro, comparison notes, and migration tips

The alternatives table stores everything the GitHub API returns that matters for a directory: stars, forks, language, license, last_pushed, description. The saas_content table stores only what Claude adds — the editorial layer that turns raw repo metadata into something useful.

The full export lives in a JSON file that Astro reads at build time. No database connection at build. The ETL pipeline and the Astro build are separate processes.

Phase 1: seeding from JSON

The first time the site runs on a new machine, there's no database. Rather than block a local build on a live GitHub API pass, I wrote a seed.ts script that bootstraps the database from a hand-curated saas.json file.

The JSON contains: SaaS name, slug, homepage, category, and a list of owner/repo strings. Stars, forks, license, and last_pushed are deliberately omitted — they'll come from the live fetch. What I do include in JSON is pre-polished content for some entries where the Claude default output was weak.

for (const e of entries) {
  await db.execute({
    sql: `INSERT INTO saas (slug, name, homepage, category, fetched_at)
          VALUES (?, ?, ?, ?, ?)
          ON CONFLICT(slug) DO NOTHING`,
    args: [e.slug, e.name, e.homepage, e.category, now],
  });

  for (const a of e.alternatives) {
    await db.execute({
      sql: `INSERT INTO alternatives (saas_slug, repo, name, description, ...)
            VALUES (?, ?, ?, ?, ...)
            ON CONFLICT(saas_slug, repo) DO NOTHING`,
      args: [e.slug, a.repo, a.name, a.description, ...],
    });
  }
}
Enter fullscreen mode Exit fullscreen mode

DO NOTHING on conflict for alternatives is correct: once GitHub data is live, the seed shouldn't clobber fresh stars counts with the static values from the JSON. But for saas_content, I initially used the same DO NOTHING — and that was a mistake I'll get to below.

Phase 2: live GitHub data

fetch-alternatives.ts calls the GitHub REST API for every owner/repo in the database and upserts the live fields. Unlike the seed, this is DO UPDATE — we want fresh data.

The sleep interval is 100ms between GitHub API calls. For an authenticated token that rate limit is conservative (GitHub's REST API allows 5000 requests per hour for authenticated users, so 100ms is well under the minimum gap needed). Unauthenticated would be 60 per hour, which is 60 seconds per call — completely impractical at scale. The monorepo authenticates with a secret in GitHub Actions.

Errors per-repo are caught and logged but don't abort the batch:

for (const repoFull of s.alternatives) {
  const [owner, name] = repoFull.split("/");
  try {
    const r = await getRepo(owner, name);
    await db.execute({
      sql: `INSERT INTO alternatives (saas_slug, repo, name, description, stars,
              forks, language, license, last_pushed, url, fetched_at)
            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
            ON CONFLICT(saas_slug, repo) DO UPDATE SET
              description = excluded.description,
              stars = excluded.stars,
              forks = excluded.forks,
              language = excluded.language,
              license = excluded.license,
              last_pushed = excluded.last_pushed,
              fetched_at = excluded.fetched_at`,
      args: [
        s.slug, repoFull, r.name, r.description,
        r.stargazers_count, r.forks_count,
        r.language, r.license?.spdx_id ?? null,
        r.pushed_at, r.html_url, now,
      ],
    });
    await sleep(100);
  } catch (err) {
    console.error(`  ! Failed ${repoFull}:`, err instanceof Error ? err.message : err);
  }
}
Enter fullscreen mode Exit fullscreen mode

One field worth noting: r.license?.spdx_id returns null when GitHub sees a license file but can't identify the SPDX identifier. That happens more than you'd expect with non-standard licenses. I render those rows with "see repo" instead of a badge so I'm not misleading visitors about the license type.

Content generation with Claude Haiku

After the GitHub data is fresh, generate-content.ts queries for SaaS entries that either have no content row or whose model_used column is 'fallback-template' or 'seeded-from-json'. For each, it asks Claude Haiku for:

  • intro — 2 sentences on what the SaaS is and why teams seek OSS alternatives
  • comparison_notes — 2-3 sentences on actual tradeoffs (self-hosting overhead, feature gaps)
  • migration_tips — a 2-4 item array of concrete migration steps

I use the shared Claude Haiku client with system-prompt caching here. The system prompt is identical for every call in a batch, so caching it saves input tokens on all subsequent calls. On a 50-entry pass, the cost difference is real.

The fallback template — which runs when ANTHROPIC_API_KEY is absent — generates deterministic placeholder text. This matters for CI: the Astro build needs a content row for every SaaS entry. Missing content produces a blank page, which would then trigger the noindex gate I use for thin programmatic pages.

The three-tier content quality ladder I described earlier puts these generated entries at the middle tier — better than the raw repo description, worse than hand-edited content.

The UPSERT trap

Original seed.ts for saas_content:

INSERT INTO saas_content (saas_slug, intro, comparison_notes, migration_tips, generated_at, model_used)
VALUES (?, ?, ?, ?, ?, ?)
ON CONFLICT(saas_slug) DO NOTHING
Enter fullscreen mode Exit fullscreen mode

That looked safe. But the problem was subtle. When I seeded with model_used = null (the original JSON had no field), generate-content.ts queried:

SELECT slug FROM saas s
LEFT JOIN saas_content c ON c.saas_slug = s.slug
WHERE c.saas_slug IS NULL
   OR c.model_used IN ('fallback-template', 'seeded-from-json')
Enter fullscreen mode Exit fullscreen mode

Rows seeded with model_used = null didn't match either condition. They also weren't NULL (the row existed). So they got skipped by the generator — but the seed DO NOTHING also prevented the polished JSON content from landing, because a fallback-template row had already been written by an earlier run.

The fix was two parts:

  1. Seed.ts now uses DO UPDATE for saas_content, not DO NOTHING. Polished JSON content always wins.
  2. The JSON schema requires model_used to be set explicitly — 'seeded-from-json' for automatic entries, 'claude-routine-polish' for hand-checked ones. The generator's WHERE clause excludes both.
ON CONFLICT(saas_slug) DO UPDATE SET
  intro = excluded.intro,
  comparison_notes = excluded.comparison_notes,
  migration_tips = excluded.migration_tips,
  generated_at = excluded.generated_at,
  model_used = excluded.model_used
Enter fullscreen mode Exit fullscreen mode

This pattern — using model_used as a status field to coordinate between ETL phases — also showed up in the AI tools directory's fallback entry upgrade work. The lesson there was the same: never let an ETL pass silently skip a row because the status field was written inconsistently.

The Astro page structure

Each SaaS entry renders as a static page at /alternatives/[saas]/. The renderer reads from saas.json, assembles a grid of alternatives sorted by stars, and inlines the Claude-generated comparison notes. Each entry shows a license badge, language indicator, and last_pushed date formatted as a relative time string.

The grid intentionally doesn't paginate at the SaaS level. I capped entries per SaaS at 8. More than that becomes noise — the directory's value is curation, not exhaustiveness. The E-E-A-T transparency pages include a methodology note explaining what that cap means for each category.

What I'd change

Store raw GitHub JSON alongside derived columns. Currently each ETL adds derived fields: stars, forks, license, last_pushed. When I later wanted a "has_recent_releases" signal, I had to add a full new API call. If I'd kept the raw response in a JSONB/TEXT column, json_extract(raw, '$.has_wiki') would have been enough.

Add a deprecated_at field. When a repo gets deleted or renamed, the ETL call returns a 404 and the code just logs it. The row stays in the database with increasingly stale data. A deprecated_at timestamp would let the page renderer show a warning and let the content team decide whether to replace or remove the entry.

Parallelize generate-content with a rate-limit counter. The current sequential loop takes a noticeable number of minutes on a cold run with 100+ entries. Batching 10 concurrent Haiku calls with a shared counter that throttles at the API limit would be 5-10x faster without touching cost.

FAQ

Why Turso instead of a hosted Postgres?
Turso's edge replicas are in the same regions as Vercel's serverless functions, so read latency is low. The cost for my usage tier is also lower than a comparable Postgres instance. The full comparison is here.

Do you need a paid GitHub plan to avoid rate limits?
No. A free personal access token gives 5000 requests per hour — enough to fetch metadata for several hundred repos in a single daily cron run. The 60/hr unauthenticated limit would not work at any meaningful scale.

How do you prevent Claude costs from escalating?
System-prompt caching amortises the per-call cost across the batch. I also set max_tokens: 1024 for each call, which caps output length. The biggest lever is the model_used status field: entries that already have good content don't get regenerated.

What happens if a GitHub repo is deleted?
Right now the row goes stale silently. The fetch fails, the error is logged, and the next build still renders the row with whatever data the last successful fetch stored. Adding a 404-specific handler that sets deprecated_at is on the backlog.

Related reading


Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.

Top comments (0)