DEV Community

Nelson Figueroa
Nelson Figueroa

Posted on • Originally published at nelson.cloud on

Scrape Contributor Emails From Any Git Repository

In a previous post I wrote about how it’s possible to scrape emails from GitHub repositories using their API. I even wrote up a Ruby script to do this. I now realize that is a very complicated way to go about it after discovering the git shortlog command.

With git shortlog you can list all contributor emails for any git repository, not just GitHub repos.


Disclaimer: I am writing about this to make others aware of this form of scraping and it is purely for educational purposes. I do not plan on doing anything with emails from git repos and you shouldn’t either.


TL;DR

You can run this command within any git repo to extract all contributor emails:

git shortlog -sea | grep -E -o "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" | awk '{print tolower($0)}' | sort | uniq | grep -wv 'users.noreply.github.com'
Enter fullscreen mode Exit fullscreen mode

Command Breakdown

The git shortlog -sea part of the command is short for git shortlog --summary --email --all. This command outputs the number of commits each user has made, along with their name and email, across all branches.

$ git shortlog -sea

    54 First Last <FirstLast@example.com>
   385 Another User <Anotheruser@example.com>
     2 user1 <user1@example.com>
    31 first last <firstlast@example.com>
    10 Someone Else <1234567+someoneelse@users.noreply.github.com>
Enter fullscreen mode Exit fullscreen mode

The next command, grep -E -o "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b", extracts emails from each line using a regular expression.

$ git shortlog -sea | grep -E -o "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b"

FirstLast@example.com
Anotheruser@example.com
user1@example.com
firstlast@example.com
1234567+someoneelse@users.noreply.github.com
Enter fullscreen mode Exit fullscreen mode

The output from the previous command is piped into awk '{print tolower($0)}', which lowercases all the emails. Sometimes emails are typed in with capital letters. Lowercasing all characters will help with sorting and finding unique emails later.

$ git shortlog -sea | grep -E -o "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" | awk '{print tolower($0)}'

firstlast@example.com
anotheruser@example.com
user1@example.com
firstlast@example.com
1234567+someoneelse@users.noreply.github.com
Enter fullscreen mode Exit fullscreen mode

After that, the output is piped into sort and uniq. These commands are straightforward. The emails are sorted alphabetically, then duplicates are excluded from the output.

$ git shortlog -sea | grep -E -o "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" | awk '{print tolower($0)}' | sort | uniq

1234567+someoneelse@users.noreply.github.com
anotheruser@example.com
firstlast@example.com
user1@example.com
Enter fullscreen mode Exit fullscreen mode

That should suffice for a lot of git repos, but I also added grep -wv 'users.noreply.github.com' to the end of the command to exclude noreply emails associated with GitHub.

$ git shortlog -sea | grep -E -o "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" | awk '{print tolower($0)}' | sort | uniq | grep -wv 'users.noreply.github.com'

anotheruser@example.com
firstlast@example.com
user1@example.com
Enter fullscreen mode Exit fullscreen mode

Extracting Emails With git log

It’s possible to do something similar with the git log --pretty="%ce" command. However, I noticed that this command does not show as many emails as git shortlog. I didn’t look too much into it, but I believe it only pulls emails from one branch rather than all branches like with git shortlog --all.

References

I learned about git shortlog from this Stack Overflow question:

I got the email regex from here:

Top comments (0)