The 3 AM Wake-Up Call
Last month I woke up to 47 GitHub notifications. Not the good kind. Someone had pointed an AI agent at one of my open source repos, and it had opened a torrent of "helpful" PRs — refactors nobody asked for, README rewrites in confident broken English, and one memorable PR that deleted half the test suite while claiming to "improve coverage."
If you maintain anything public on GitHub right now, you've probably seen this. The barrier to spinning up an autonomous coding agent is basically zero, and a lot of them are aimed at racking up "contributions" rather than actually contributing. So you end up reviewing slop at 3 AM.
This post walks through what we did about it. Spoiler: git log --author and a couple of pre-receive checks did most of the work. No paid services, no fancy infrastructure.
Why Bot PRs Are Hard to Filter
The first instinct is to block by username. That fails fast — bot accounts get renamed, multiplied, or hidden behind a real-looking handle. The second instinct is to filter on PR content with regex. That fails too, because the output looks plausibly human.
The thing bots are surprisingly bad at hiding is their commit author metadata. Git records two identities per commit: the author (who wrote it) and the committer (who applied it). Most AI agents either:
- Use a giveaway author string like
noreply@anthropic.com,github-actions[bot], or some agent-framework default - Forge a name but leave the email pointing at the agent host
- Set author and committer to different identities in a way real workflows almost never do
That's a fingerprint. And unlike a username, it's baked into every commit forever.
Inspecting What You're Actually Getting
Before writing any rules, look at your own repo. This is the command I run first on any contributor PR:
git log --all --pretty=format:'%h | %an <%ae> | %cn <%ce> | %s' | head -50
The format string breaks down as:
-
%an/%ae— author name and email -
%cn/%ce— committer name and email -
%s— subject line
Run it across a noisy repo for a minute and the patterns jump out. We found 14 distinct "author" emails across what turned out to be the same three bot operators.
To zoom in on a single suspected actor:
# All commits matching an author pattern (regex, case-insensitive by default on most setups)
git log --author='bot|agent|noreply' --pretty=format:'%h %ae %s'
--author matches against both name and email, and it accepts a regex. That last part is what makes it useful — you can build a denylist and run it as one command.
Step 1: Build a Local Audit Script
Start with detection before enforcement. You want to know what you'd be blocking before you actually block it. Here's the script I keep in scripts/audit-authors.sh:
#!/usr/bin/env bash
set -euo pipefail
# Patterns we consider suspicious. Tune for your project.
SUSPICIOUS='(bot|agent|noreply|automated|\[bot\])'
echo "== Commits with suspicious author metadata =="
git log --all \
--author="$SUSPICIOUS" \
--pretty=format:'%h %an <%ae> -- %s' \
--regexp-ignore-case
echo
echo "== Commits where author != committer (unusual outside of merges) =="
# %ae and %ce differing is a yellow flag for agent-applied commits
git log --all --no-merges \
--pretty=format:'%h %ae | %ce | %s' \
| awk -F'|' '$1 !~ $2 {print}'
The second check — author vs committer mismatch — caught more bots than the name regex did. Humans rebasing or cherry-picking will occasionally trip it, so don't auto-reject on this signal alone. Use it to flag for review.
Step 2: A Pre-Receive Hook on the Server Side
Once you know your patterns, push enforcement to the git server. If you're self-hosting (Gitea, GitLab, plain Git over SSH), pre-receive is the right place. It runs before refs are updated, so you can reject the push outright.
#!/usr/bin/env bash
# hooks/pre-receive
# Reject pushes whose new commits have disallowed author metadata.
set -euo pipefail
DENY_PATTERN='(@anthropic\.com|@openai\.com|noreply@.*-bot|agent@)'
while read -r oldrev newrev refname; do
# Skip branch deletions
[ "$newrev" = "0000000000000000000000000000000000000000" ] && continue
# On a new branch, oldrev is all zeroes — limit the range to avoid scanning history
if [ "$oldrev" = "0000000000000000000000000000000000000000" ]; then
range="$newrev"
else
range="$oldrev..$newrev"
fi
bad=$(git log "$range" --pretty=format:'%H %ae' \
| grep -E "$DENY_PATTERN" || true)
if [ -n "$bad" ]; then
echo "Rejected: commit author matches deny list:" >&2
echo "$bad" >&2
exit 1
fi
done
A few gotchas I hit:
-
Don't use
git log --allhere. You only want to check the commits being pushed, not your whole history. Theoldrev..newrevrange is the right scope. -
Anchor your regexes. I once wrote
noreplywithout anchoring and rejected legitimate dependabot security updates. Embarrassing. - Log rejections somewhere. When a real contributor gets blocked, you need to see why.
Step 3: For GitHub-Hosted Repos
GitHub doesn't expose pre-receive on free repos, so we moved the check into a workflow that runs on every PR:
# .github/workflows/author-check.yml
name: Author Check
on: [pull_request]
jobs:
check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
# We need PR commits, not just the merge ref
fetch-depth: 0
- name: Inspect commit authors
run: |
base='${{ github.event.pull_request.base.sha }}'
head='${{ github.event.pull_request.head.sha }}'
# Fail if any commit in the PR has a denylisted author email
if git log "$base..$head" --pretty='%ae' \
| grep -Ei '@(some-agent-host)\.com|noreply.*bot'; then
echo "::error::PR contains commits from disallowed authors"
exit 1
fi
This won't stop the PR from being opened, but it'll fail the required check so it can't be merged, and the maintainer sees the reason immediately.
Prevention Tips
A few things I'd do from day one on a new public repo:
- Require signed commits on protected branches. Signing isn't a perfect bot-blocker, but it raises the cost meaningfully. See the official Git docs on commit signing.
- Set up CODEOWNERS so PRs to sensitive paths require review from a known human.
- Track patterns over time. Re-run the audit script monthly. Bot operators change their fingerprints; your denylist needs to keep up.
- Don't over-block. Every false positive costs you a real contributor. Start with detection, log everything for a week, then move to enforcement.
None of this is a silver bullet — a determined operator can spoof author metadata trivially, and the sophisticated ones already do. But the spam tier of AI bot PRs almost never bothers, because they're optimizing for volume. Filtering on --author knocked our noise level down by something like 80% in the first week. Worth the afternoon it took to set up.
Top comments (0)