Why finding a GitHub user's email is harder than you'd think

#github #api #opensource #security

You've found a contributor whose work you depend on. The maintainer of a package you use, a developer who fixed something for you upstream, the author of a CVE you need to coordinate with. You have their GitHub username. You'd like their email.

You'd think this would be a GET away. It isn't. Here's why — and what it actually takes to find one.

The GitHub API doesn't have it

GET /users/:login returns an email field. For the vast majority of users, that field is null.

GitHub flipped private-by-default years ago. When you sign up today, your commit email is set to <id>+<login>@users.noreply.github.com and the public profile email field is empty. Older accounts that opted in still expose addresses, but they're a minority — and the people you actually want to reach (active maintainers, security-conscious developers) are exactly the ones who turned this off.

So that's out.

The commits don't have it either (mostly)

The next obvious move: look at the user's commits. Every commit has an author email in its metadata. Pick a public repo, fetch the commit, get the email.

curl https://api.github.com/repos/torvalds/linux/commits | jq '.[0].commit.author.email'
# "torvalds@linux-foundation.org"

That works for Linus. It does not work for most people. Run this against any reasonably modern repo and you'll see a lot of:

"49699333+dependabot[bot]@users.noreply.github.com"
"12345678+somecontributor@users.noreply.github.com"

GitHub rewrites commit emails to the noreply form whenever the author has the "Keep my email addresses private" setting on, which is the default. The <id>+<login> part is the user's GitHub ID and login — useful if all you wanted was to identify them, but you already had their login. You wanted to email them.

The events archive: harder, but real data

There's another source that dev tooling people sometimes forget about: the public events stream. GitHub publishes a firehose of public events (pushes, opens, comments, releases) and GH Archive has been recording it hourly since 2011 — terabytes of newline-delimited JSON, gzipped, freely downloadable.

Each PushEvent carries the underlying commit metadata, including author name and email. In principle, if a developer ever pushed a commit before they turned on private email — or if they push from a CI pipeline that uses a real address — that email is in the archive.

The job that processes it looks roughly like this:

gz = Zlib::GzipReader.new(StringIO.new(http.get(archive_url).body))

gz.each_line do |line|
  event = JSON.parse(line)
  next unless event["type"] == "PushEvent"

  login = event.dig("actor", "login")

  event.dig("payload", "commits").to_a.each do |commit|
    author_name  = commit.dig("author", "name")
    author_email = commit.dig("author", "email")
    # ... we have a (login, name, email) triple. Now what?
  end
end

You scan a few hours of archive and immediately find a problem. A lot of those emails look like this:

8a3f9b2c1d4e5f6a7b8c9d0e1f2a3b4c5d6e7f80@gmail.com

Forty hex characters. That's a SHA-1 hash. The local part of the email has been one-way-hashed; only the domain is in the clear. This is a historical artifact of how the events feed has been emitted for stretches of GitHub's history — commit emails arriving with the local part obfuscated.

Great. Now you have a hash.

Reversing the hash

SHA-1 of an arbitrary string is a one-way function. SHA-1 of an email local part is not, because email local parts are not arbitrary. They're drawn from a tiny, predictable distribution: firstname, firstname.lastname, f.lastname, firstnamelastname, firstname_lastname, firstname1985, and a few hundred other patterns layered over a finite list of names.

If you precompute a table of sha1(local_part) → local_part for every plausible candidate — every name you've ever encountered, every email you've ever seen published — you can reverse most of these in O(1).

if email =~ /^([a-f0-9]{40})@(.+)$/
  hash, domain = $1, $2
  if (record = Sha1Hash.find_by(sha1_hash: hash))
    real_email = "#{record.text}@#{domain}"
  end
end

The lookup table is the asset. Building it well is most of the work. Mine is hundreds of millions of rows and grows every time the world publishes another address.

The harder problem: was that actually them?

You now have a (login, author_name, real_email) triple. The temptation is to claim the email belongs to the login. Don't.

Anyone can configure git locally. People commit with user.name set to their full legal name, their nickname, "John", "John D.", "johndoe", "John Doe via Acme Corp", "Acme CI Bot", or — frequently — someone else's name entirely, because they cloned a repo on a coworker's machine and never reconfigured. A login pushes hundreds of commits over the years; many of them carry author names that don't actually identify the person behind the login.

So you need a confidence layer. Mine is a separate pass over the same archive that builds a (login, author_name) → observation_count table:

CREATE TABLE github_login_author_name_mappings (
  login              text,
  author_name        text,
  observation_count  int,
  PRIMARY KEY (login, author_name)
);

When a hash reverses to a candidate email, I look up every author name that login has ever been observed pushing under, and ask: what fraction of this login's total commits use this author name?

total_count = all_names.sum { |_, count| count }
name_count  = all_names.find { |name, _| name == author_name }&.last || 0
percentage  = (name_count.to_f / total_count.to_f) * 100

# Need at least 10 commits for any signal at all
return false if total_count < 10

# With a long history, 50% co-occurrence is enough; with little, demand 80%
threshold = total_count > 100 ? 50.0 : 80.0
percentage >= threshold

This rejects the noise. The contributor who once pushed a commit signed "Test User" doesn't get linked to a test.user@example.com reversal. The CI bot pushing under a real engineer's login but with git config user.name "Buildkite" doesn't pollute the index. What survives is the set of (login, name) pairs that consistently co-occur — a fairly trustworthy proxy for "this is the human behind this login."

What's left

Doing this for one user, end to end:

Identify which monthly archive shards likely contain their activity.
Stream and decompress hundreds of gigabytes of JSON.
Maintain a SHA-1 lookup table of every plausible email local part you've ever seen.
Maintain a parallel (login, author_name) co-occurrence index across the entire archive.
For every reversed hash, run the confidence check.
Validate the resulting email isn't already claimed by a different GitHub login (people misconfigure git constantly).
Verify the address actually accepts mail before you rely on it.

That's a multi-day backfill the first time, hundreds of gigabytes resident, and a continuous trickle of new data forever. Perfectly reasonable to build if finding email addresses for GitHub users is your full-time job. Absurd to build if you just need to email three maintainers about a CVE.

The shortcut

This is the work PeopleDB does in the background. The pipeline above — archive ingestion, hash reversal, identity correlation, deduplication, SMTP validation — runs continuously. The answer is one HTTP call:

curl "https://peopledb.co/api/v1/people?github_login=octocat" \
  -H "Authorization: Bearer $PEOPLEDB_TOKEN"

{
  "github_login": "octocat",
  "github_id": 583231,
  "linkedin_public_identifier": "...",
  "email_addresses": ["..."],
  "personal_email_addresses": ["..."],
  "work_email_addresses": ["..."]
}

The endpoint also accepts github_id, linkedin_id, and linkedin_public_identifier — the same identity-merging logic runs across all of them, so if a person has both a GitHub and a LinkedIn record in the index, you get the union.

If you're doing security disclosure, contributor outreach, or any kind of identity resolution where you start with a username and need to actually reach the human, that's the trade: spin up the pipeline, or skip it.