When I built the ETL pipelines for three programmatic directory sites in April — Top AI Tools (HuggingFace data), Find Games Like (Steam data), and Open Alternative To (GitHub data) — I had to figure out rate limits for three completely different APIs in the same week. The numbers, the failure modes, and the right way to handle errors are all different.
Here's what I actually shipped and the reasoning behind each number.
Steam: 250ms, deliberately aggressive
Steam's developer docs are sparse on hard rate-limit specifics. What I found from community discussion and trial: roughly 200 requests per 5 minutes per IP on the public Web API, which works out to one request per 1.5 seconds as a documented-safe interval. My code comments this openly:
await sleep(250); // Steam rate limit: ~200/5min, 1.5s is safe; 250ms is aggressive but usually fine
I chose 250ms anyway because the ETL runs as a nightly GitHub Actions job over ~60 game entries. At 250ms that's 15 seconds of sleep total. At 1.5 seconds it would be 90 seconds. The gap matters when the cron has three sites to process.
The acceptable risk: Steam doesn't hard-ban on the first rate-limit violation, it returns HTTP 429 and the job logs the error. The games ETL treats review-endpoint failures as non-fatal — the game row is still written; only the review stats are absent until the next run:
try {
const r = await getAppReviewSummary(appid);
// ... write to DB
} catch (err) {
reviewsFailed++;
console.error(`! Review fetch failed for appid ${appid}:`, err);
}
The reviewsFailed counter appears in the job log. If I see it climbing consistently, that's the signal to increase the sleep interval. So far I haven't needed to.
GitHub: 100ms, with authentication doing the real work
GitHub's REST API is explicit about limits: 60 requests per hour unauthenticated, 5,000 per hour with a personal access token. The GitHub docs on rate limiting cover both the primary limit and the secondary limits for specific endpoint categories. The OSS alternatives ETL makes one GET /repos/:owner/:repo call per alternative project — roughly 3–5 repos per SaaS tool in the seed data. Even a large seed run of 50 tools with 5 alternatives each is only 250 requests.
The sleep is there as a politeness interval, but authentication is doing the real rate-limit work:
function authHeaders(): Record<string, string> {
const token = process.env.GITHUB_TOKEN;
const base: Record<string, string> = {
Accept: "application/vnd.github+json",
"X-GitHub-Api-Version": "2022-11-28",
};
if (token) base.Authorization = `Bearer ${token}`;
return base;
}
GITHUB_TOKEN is set in GitHub Actions from a repository secret. Without it, 60 requests per hour would exhaust in under a minute for a full seed run. With it, the 5,000/hour ceiling gives comfortable headroom.
One subtlety: there are two separate GitHub rate limits — the core REST API limit (5,000/hour authenticated) and the search API limit (30 requests per minute unauthenticated, 10 per second authenticated). The current ETL uses GET /repos/:owner/:repo directly, not search, so the looser core limit applies. If I ever switch to search-based discovery the math changes.
HuggingFace: no sleep, because none is needed
The model registry API — listing models, fetching model metadata — has no hard documented rate limit that I've hit in weeks of nightly runs. The ETL fetches up to 100 models in one GET /api/models?limit=100&sort=downloads call, then one detailed fetch per model. 100 rapid-fire requests, no sleep, no 429s.
Part of this is the HUGGINGFACE_TOKEN header in authenticated requests, which raises whatever ceiling exists. Part of it is that the registry API is explicitly designed for automated tooling at batch scale — it's the primary way model cards, metadata scrapers, and leaderboard tools consume the catalog.
function authHeaders(): Record<string, string> {
const token = process.env.HUGGINGFACE_TOKEN;
return token ? { Authorization: `Bearer ${token}` } : {};
}
If I scale to 1,000 models per nightly fetch I'd add a 50ms sleep as a precaution. For 100, the simplest thing that works is also the correct thing.
A comparison
| API | Sleep | Auth impact | Failure mode | Fatal? |
|---|---|---|---|---|
| Steam appdetails | 250ms | None (public) | 429, occasional | Non-fatal |
| Steam reviews | 250ms (shared) | None (public) | 429, more frequent | Non-fatal |
| GitHub REST | 100ms | 60→5,000/hr | 403, clear message | Non-fatal |
| HuggingFace registry | None | Raises ceiling | Rare 429 | Non-fatal |
All four code paths are non-fatal. A 429 or connection error anywhere in the batch writes a fallback-template row to Turso and increments a counter. The content upgrade loop picks up any gaps the next night.
The pattern that matters
The sleep interval is a guess. What actually protects the ETL from being useless after a rate-limit event is that failures are cheap. Every external API call in this stack is wrapped in a try/catch that writes degraded content rather than crashing the batch. The sleep interval controls how likely you are to hit a rate limit; the fallback chain controls what happens when you do.
For indie-scale ETL — tens to hundreds of entries per night — the combination of a conservative-ish sleep and a non-fatal error path is enough. If the site grows to thousands of entries per run, I'd rethink both: moving to a queue-bounded concurrent fetcher with exponential backoff, and separating the content generation from the data fetch into stages that can be retried independently.
Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.
Top comments (0)