How to Block AI Bot Spam in Your GitHub Repo Using Git's Author Filters

#git #github #devops #opensource

The 3 AM Wake-Up Call

Last month I woke up to 47 GitHub notifications. Not the good kind. Someone had pointed an AI agent at one of my open source repos, and it had opened a torrent of "helpful" PRs — refactors nobody asked for, README rewrites in confident broken English, and one memorable PR that deleted half the test suite while claiming to "improve coverage."

If you maintain anything public on GitHub right now, you've probably seen this. The barrier to spinning up an autonomous coding agent is basically zero, and a lot of them are aimed at racking up "contributions" rather than actually contributing. So you end up reviewing slop at 3 AM.

This post walks through what we did about it. Spoiler: git log --author and a couple of pre-receive checks did most of the work. No paid services, no fancy infrastructure.

Why Bot PRs Are Hard to Filter

The first instinct is to block by username. That fails fast — bot accounts get renamed, multiplied, or hidden behind a real-looking handle. The second instinct is to filter on PR content with regex. That fails too, because the output looks plausibly human.

The thing bots are surprisingly bad at hiding is their commit author metadata. Git records two identities per commit: the author (who wrote it) and the committer (who applied it). Most AI agents either:

Use a giveaway author string like noreply@anthropic.com, github-actions[bot], or some agent-framework default
Forge a name but leave the email pointing at the agent host
Set author and committer to different identities in a way real workflows almost never do

That's a fingerprint. And unlike a username, it's baked into every commit forever.

Inspecting What You're Actually Getting

Before writing any rules, look at your own repo. This is the command I run first on any contributor PR:

git log --all --pretty=format:'%h | %an <%ae> | %cn <%ce> | %s' | head -50

The format string breaks down as:

%an / %ae — author name and email
%cn / %ce — committer name and email
%s — subject line

Run it across a noisy repo for a minute and the patterns jump out. We found 14 distinct "author" emails across what turned out to be the same three bot operators.

To zoom in on a single suspected actor:

# All commits matching an author pattern (regex, case-insensitive by default on most setups)
git log --author='bot&#124;agent&#124;noreply' --pretty=format:'%h %ae %s'

--author matches against both name and email, and it accepts a regex. That last part is what makes it useful — you can build a denylist and run it as one command.

Step 1: Build a Local Audit Script

Start with detection before enforcement. You want to know what you'd be blocking before you actually block it. Here's the script I keep in scripts/audit-authors.sh:

#!/usr/bin/env bash
set -euo pipefail

# Patterns we consider suspicious. Tune for your project.
SUSPICIOUS='(bot|agent|noreply|automated|\[bot\])'

echo "== Commits with suspicious author metadata =="
git log --all \
  --author="$SUSPICIOUS" \
  --pretty=format:'%h  %an <%ae>  -- %s' \
  --regexp-ignore-case

echo
echo "== Commits where author != committer (unusual outside of merges) =="
# %ae and %ce differing is a yellow flag for agent-applied commits
git log --all --no-merges \
  --pretty=format:'%h %ae | %ce | %s' \
  | awk -F'|' '$1 !~ $2 {print}'

The second check — author vs committer mismatch — caught more bots than the name regex did. Humans rebasing or cherry-picking will occasionally trip it, so don't auto-reject on this signal alone. Use it to flag for review.

Step 2: A Pre-Receive Hook on the Server Side

Once you know your patterns, push enforcement to the git server. If you're self-hosting (Gitea, GitLab, plain Git over SSH), pre-receive is the right place. It runs before refs are updated, so you can reject the push outright.

#!/usr/bin/env bash
# hooks/pre-receive
# Reject pushes whose new commits have disallowed author metadata.
set -euo pipefail

DENY_PATTERN='(@anthropic\.com|@openai\.com|noreply@.*-bot|agent@)'

while read -r oldrev newrev refname; do
  # Skip branch deletions
  [ "$newrev" = "0000000000000000000000000000000000000000" ] && continue

  # On a new branch, oldrev is all zeroes — limit the range to avoid scanning history
  if [ "$oldrev" = "0000000000000000000000000000000000000000" ]; then
    range="$newrev"
  else
    range="$oldrev..$newrev"
  fi

  bad=$(git log "$range" --pretty=format:'%H %ae' \
          | grep -E "$DENY_PATTERN" || true)

  if [ -n "$bad" ]; then
    echo "Rejected: commit author matches deny list:" >&2
    echo "$bad" >&2
    exit 1
  fi
done

A few gotchas I hit:

Don't use git log --all here. You only want to check the commits being pushed, not your whole history. The oldrev..newrev range is the right scope.
Anchor your regexes. I once wrote noreply without anchoring and rejected legitimate dependabot security updates. Embarrassing.
Log rejections somewhere. When a real contributor gets blocked, you need to see why.

Step 3: For GitHub-Hosted Repos

GitHub doesn't expose pre-receive on free repos, so we moved the check into a workflow that runs on every PR:

# .github/workflows/author-check.yml
name: Author Check
on: [pull_request]

jobs:
  check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          # We need PR commits, not just the merge ref
          fetch-depth: 0
      - name: Inspect commit authors
        run: |
          base='${{ github.event.pull_request.base.sha }}'
          head='${{ github.event.pull_request.head.sha }}'
          # Fail if any commit in the PR has a denylisted author email
          if git log "$base..$head" --pretty='%ae' \
               | grep -Ei '@(some-agent-host)\.com|noreply.*bot'; then
            echo "::error::PR contains commits from disallowed authors"
            exit 1
          fi

This won't stop the PR from being opened, but it'll fail the required check so it can't be merged, and the maintainer sees the reason immediately.

Prevention Tips

A few things I'd do from day one on a new public repo:

Require signed commits on protected branches. Signing isn't a perfect bot-blocker, but it raises the cost meaningfully. See the official Git docs on commit signing.
Set up CODEOWNERS so PRs to sensitive paths require review from a known human.
Track patterns over time. Re-run the audit script monthly. Bot operators change their fingerprints; your denylist needs to keep up.
Don't over-block. Every false positive costs you a real contributor. Start with detection, log everything for a week, then move to enforcement.

None of this is a silver bullet — a determined operator can spoof author metadata trivially, and the sophisticated ones already do. But the spam tier of AI bot PRs almost never bothers, because they're optimizing for volume. Filtering on --author knocked our noise level down by something like 80% in the first week. Worth the afternoon it took to set up.