I tracked 4,200 startup GitHub orgs for six months — here's what actually predicts a fundraise

#datascience #startup #github #opensource

I started this six months ago because nobody else seemed to. Hedge funds spent the last decade extracting alpha from satellite imagery, credit-card panels, parking-lot photos. The venture-capital equivalent — public engineering activity on GitHub — was sitting in plain sight, and most institutional sourcing teams I knew still ran on Crunchbase, warm intros, and Twitter. So I built a crawler.

That sentence is short. The reality wasn't.

The first crawler melted my Postgres pool

The first version was a Python script that hit /repos/{org}/events for every org on the list, every hour, with a single connection. It worked for 80 orgs. By the time I'd seeded 1,200 orgs into the watchlist, I was hitting GitHub's secondary rate limits inside 12 minutes and my Postgres connection pool was burning to the ground. The script was opening a new connection for every API response, and the connections weren't recycling because I'd written psycopg.connect() inside a loop instead of using a pool. Standard mistake. Embarrassing mistake.

The fix wasn't more connections — it was fewer requests.

GitHub Archive (gharchive.org) publishes every public event from GitHub as hourly JSON dumps. Every PushEvent, every PullRequestEvent, every CreateEvent, every WatchEvent. You don't have to ask GitHub for them — you download a 100MB gzipped JSONL file per hour and stream it. For a watchlist of 4,200 orgs, that's two orders of magnitude less work than polling.

The new pipeline:

# Hourly cron, runs at :03 to give Archive time to publish
HOUR=$(date -u -d '1 hour ago' +%Y-%m-%d-%H)
curl -s "https://data.gharchive.org/${HOUR}.json.gz" \
  | gunzip \
  | jq -c 'select(.repo.name | split("/")[0] | inside($orgs))' --argjson orgs "$ORGS" \
  | psql -c "COPY events_raw FROM STDIN WITH (FORMAT csv);"

It's not pretty. It works. Hourly batches mean I'm always at most an hour behind real time, and I can backfill the whole previous year in an afternoon if I need to re-run the backtest.

The schema is two tables

I tried elaborate schemas first. Junction tables, contributor graphs, separate stores for each event type. None of them paid for themselves. What I actually use, six months in, is two tables and a materialized view.

CREATE TABLE events_raw (
  ts          timestamptz NOT NULL,
  org         text        NOT NULL,
  repo        text        NOT NULL,
  actor       text        NOT NULL,
  event_type  text        NOT NULL,
  payload     jsonb,
  PRIMARY KEY (ts, org, repo, actor, event_type)
);
CREATE INDEX idx_events_org_ts ON events_raw (org, ts DESC);

CREATE TABLE orgs_watchlist (
  org         text PRIMARY KEY,
  sector      text NOT NULL,
  added_at    date NOT NULL,
  notes       text
);

The materialized view rolls up the per-org metrics weekly. Refreshing it concurrently takes about 90 seconds against six months of data. I rebuild it Sunday nights so Monday's report is fresh.

That's it. No graph database. No data lake. No Airflow. The whole stack runs on a single Postgres instance with about 18GB of data, and the weekly report is a single SQL query I read in a terminal.

What the signal looks like, end-to-end

The single most predictive feature is not commit volume — it's commit velocity change. A startup that ships 200 commits a week and continues to ship 200 a week tells me nothing. A startup that goes from 80 to 240 inside 14 days tells me something organizational has changed.

I track three derivative metrics per org:

Commit velocity over a rolling 14-day window
Contributor delta over a rolling 30-day window
New-repo creation rate over a rolling 30-day window

When all three accelerate inside the same fortnight — each one breaking above its own org-specific six-month z-score — I classify the org as "accelerating." In a backtest across Q3 and Q4 2025, roughly 70% of accelerating orgs announced a fundraise within six weeks. The lead time was 3-6 weeks for Series A and shorter for late-stage rounds.

Four signal flavors, not one

Not every acceleration looks the same. After eyeballing a few hundred firings I started classifying them into four shapes:

Engineering hiring burst. Contributor count jumps 40%+ inside 30 days. Often pre-Series A. Term sheet has been signed; the new engineers are pushing first commits. I catch this earlier than LinkedIn employee counts because contributions land before the new hires update their job titles.

Infrastructure buildout. Commits to ops, infra, deploy, observability repos spike. Company is preparing to scale. Usually accompanies a Series A or B funding go-to-market.

Deploy frequency spike. Commits per day double. Often a launch run-up. Sometimes followed by a fundraise where the metrics make the deck.

Framework migration. Team migrating to a new stack — Next.js, Bun, a new ORM, a fresh CI. Often 60-120 days before a Series A. The engineering equivalent of cleaning your apartment before parents visit.

The classification matters because the same numeric "accelerating" label can mean very different things. A hiring burst is a sourcing signal — go talk to the founder now, before the round closes. A framework migration is a watch signal — set a 60-day alarm and see if a round materializes.

Where the signal fails

Honesty pass: it's bad for AI-pure startups. They commit constantly regardless of stage. Signal-to-noise is poor. I exclude AI-only orgs from the strongest classification tier and weight contributor and repo signals more heavily for them.

It's also useless for stealth startups. If the company doesn't open-source anything, GitHub gives me nothing. About 18% of the orgs I'd seeded turned out to fit this profile and got dropped from the watchlist after the first month.

And the signal is not investment advice. It tells you who to talk to. It does not tell you who to wire money to. A founder conversation, product evaluation, market analysis, and competitive teardown all still have to happen. Engineering velocity is a sourcing signal, not a thesis.

The part I got wrong

For the first three months I weighted commit count more heavily than commit velocity change. I assumed high-volume orgs were higher-quality leads. They weren't. They were just bigger. The orgs that ended up raising weren't the loudest in absolute terms — they were the ones whose own quiet baseline suddenly broke. Once I switched to per-org z-scores, the noise dropped and the small-team signals rose to the top.

The other thing I got wrong was treating star count as a proxy for anything. Vanity. A 30,000-star repo means it had a viral moment. Sometimes that moment was three years ago and the team has shipped nothing since. Star count is now in my dataset only because removing fields is harder than ignoring them.

What I do with this every week

Sunday night the materialized view refreshes. Monday morning I run a single query that surfaces orgs with two or more accelerating signals. The output is usually 10-30 candidates. I open each one, skim the last week of commits to make sure it's real product work and not version bumps, and write a one-line note per company. The notes go in a doc that's been growing since November. I publish the strongest ones to a public watchlist and let people verify the predictions themselves.

That last part is the unfair part. If the signal is real, the only way to prove it is to publish dated predictions and let them age. Six months of public dated predictions is what built my confidence in the methodology more than any backtest. The backtest tells you what would have worked. The public watchlist tells you what did.

The crawler still runs hourly. Postgres is still the only database. The Sunday refresh still takes 90 seconds. None of it is fancy. The interesting work was figuring out which two of five signals had to overlap before I'd trust any of them — and that's not infrastructure, it's pattern discipline.

Originally published at signals.gitdealflow.com. The weekly Signal Report is free — no paywall, no account.