GnomeMan4201

Posted on May 21

Found a Second Layer to a GitHub Follow Botnet

#security #github #python #opensource

Forensic mapping of 552 linked repositories

This is Part 2 of an ongoing investigation. Part 1 documented the initial discovery — 8 accounts with Jaccard following-list similarity of 0.99+ across ~29,800 entries each, evading cross-follow detection entirely.

After Part 1 published, I kept pulling the data.

Subsequent analysis expanded the cluster to 9 accounts, recovered infrastructure linkage to a specific GitHub identity, and mapped the generation pipeline responsible for all 552 repositories across the cluster. The pipeline left recoverable artifacts in every repository it produced.

Following that same pipeline fingerprint led to an earlier operation — running nine months before the follow botnet was provisioned. The same GitHub identity appears in both. So does the same generator artifact. Four accounts documented in Part 1 appear in both operations.

This post documents what the data shows. Inference is labeled as inference throughout. I am not establishing intent or ownership beyond what the API evidence directly supports.

Methodology

All findings in this report are derived from:

GitHub REST API v3 responses (public endpoints, authenticated requests)
Raw git commit metadata
Repository file retrieval via raw content endpoints
WHOIS and DNS records
Graph overlap analysis (Jaccard similarity, set intersection, cosine similarity of similarity vectors)

No private repositories, leaked credentials, or non-public systems were accessed. Every finding in the confirmed findings table is reproducible from public API endpoints with a valid GitHub token.

This analysis is limited to publicly available GitHub API data and does not include private network signals, rate-limited endpoints, or non-indexed interactions. Findings reflect the state of public data at time of retrieval.

Epistemic Boundaries

These boundaries apply to the entire report and are stated once here rather than repeated inline.

What this analysis can establish:

Observable properties of public GitHub accounts, repositories, and commit metadata
Statistical deviation from a naive independent-uniform baseline model
Structural matches between artifacts across two time-separated operations
Presence of the same authenticated GitHub ID across multiple contexts

What this analysis cannot establish:

Who controls any of the accounts documented
Whether hajigur69 is an operator, collaborator, or an identity whose credentials were reused
Intent or downstream use of the documented infrastructure
Whether lynewinter's pairwise similarity to the non-mariwatts accounts meets cluster inclusion criteria — that data was not retrieved
Whether the fallback_ label carries the semantic meaning its name implies in the generating tool's context

Inference is labeled as inference when it appears. The confirmed findings table at the end of this report lists only directly observable, API-verifiable data.

Baseline: What Random Accounts Would Look Like

Before presenting the cluster data, it is worth establishing what Jaccard similarity looks like under independent sampling — the null hypothesis.

GitHub has approximately 100 million user accounts. Each cluster account follows ~29,800. Under a naive independent-uniform model — where two accounts select their follows independently and uniformly at random from the full user population:

E[|A ∩ B|] = k² / N = (29,800)² / 100,000,000 ≈ 8.88 accounts
E[Jaccard]  = 8.88 / (2×29,800 − 8.88)         ≈ 0.000149

The 3-sigma upper bound under this model is approximately 0.000299.

The observed cluster minimum is 0.9898 — 6,642× the expected mean overlap under the uniform independence model.

This model makes simplifying assumptions that do not hold on GitHub: following behavior is not uniformly distributed, popular accounts attract disproportionate follows, and community clustering means real accounts share partial follow overlap above the uniform baseline. A realistic null model would produce a higher baseline than 0.000149. The observed values would still exceed it by orders of magnitude, but the precise ratio is model-dependent and should not be read as a formal hypothesis-test result. It is presented as a reference point against the simplest baseline, not as a statistically calibrated rejection threshold.

Every similarity value in the cluster below sits against this reference.

The Cluster Expanded

Running Jaccard similarity analysis against the original 8 accounts and their extended follower graphs surfaced a ninth account: lynewinter.

lynewinter  ↔  mariwatts   jaccard=0.9898   shared≈29,200

The methodology is identical to Part 1. A coefficient of 0.9898 across ~29,800 following entries places this pair within the same anomalous range as the original cluster. Against the null model baseline of 0.000149, this value is statistically incompatible with independent account behavior.

The confirmed cluster is now 9 accounts:

canestein, hazexone, domcomit, kylehyne, jaderytm,
vierystein, hanyvert, mariwatts, lynewinter (partial coverage — 1 confirmed pairwise value)

Similarity Structure of the Full Cluster

The complete pairwise Jaccard matrix for all 9 accounts, computed from following-list intersection over union:

Per-account mean first-order similarity within the cluster:

jaderytm      mean=0.9970   min=0.9908   max=0.9998
mariwatts     mean=0.9968   min=0.9898   max=0.9998
kylehyne      mean=0.9970   min=0.9907   max=0.9998
domcomit      mean=0.9967   min=0.9912   max=0.9997
hanyvert      mean=0.9966   min=0.9912   max=0.9997
canestein     mean=0.9969   min=0.9907   max=0.9996
vierystein    mean=0.9962   min=0.9909   max=0.9985
hazexone      mean=0.9912   min=0.9907   max=0.9925  ← peripheral
lynewinter    mean=0.9910   min=0.9898   max=0.9912  ← peripheral

hazexone and lynewinter are the structural outliers of the cluster. Their mean within-cluster similarity (0.9912 and 0.9910 respectively) sits approximately 0.006 below the core group mean of ~0.9969. Both still exceed 0.98 on every confirmed pairwise comparison. One interpretation consistent with, but not uniquely explained by, the data: they were provisioned from the same seed list but at a different time or via a slightly diverged list version. The position in the similarity distribution is observation; the generative explanation is inference.

Second-Order Structure

The "second layer" referenced in the investigation title refers to structure that emerges when comparing the similarity profiles of accounts, not just their direct following overlap.

Each account can be represented as a vector of its Jaccard similarities to all other cluster members. Computing cosine similarity between these vectors yields a second-order metric: how structurally equivalent two accounts are within the similarity graph.

For this cluster, all pairwise cosine similarities between Jaccard vectors compute to >0.9999 at four decimal places.

Important limitation: this result is partly a consequence of low variance across the input vectors. When all Jaccard values in a cluster fall within the range 0.98–0.9998, the similarity vectors are themselves numerically similar regardless of structural origin — cosine saturation at this scale is expected even for accounts that share only approximate overlap. The result does not independently establish shared generation; it is consistent with it, but a high-similarity cluster will produce this outcome under multiple generative models.

What the second-order metric does contribute: the inter-account variance in similarity profiles is extremely low across all 9 accounts. In organic follower networks, accounts accumulate following behavior across different communities over time and typically show differentiated structural positions — some accounts are more similar to high-degree hubs, others to peripheral clusters. The near-zero variance here is consistent with accounts whose following lists were seeded from the same source, but that interpretation is not uniquely supported by this metric alone.

A proper second-order baseline would require computing cosine similarity distributions across random high-degree graph samples of comparable size and density. That computation is not performed here. The metric is presented as a structural observation, not a statistically calibrated discriminator.

Sanity Check: What Doesn't Fit Cleanly

lynewinter is worth examining as a boundary case.

It was added to the cluster based on a single confirmed pairwise value: Jaccard 0.9898 against mariwatts. Its pairwise values against the remaining 7 accounts were not directly retrieved and are not included in the matrix above.

What this means: lynewinter's cluster membership rests on one confirmed measurement. It satisfies the inclusion threshold. It does not have the full evidentiary support of the core 7 accounts, which have complete pairwise matrices. The heatmap reflects only confirmed values for lynewinter; cells against the 7 non-mariwatts accounts should be treated as unverified.

If lynewinter were removed from the cluster, the core 8-account finding from Part 1 is unchanged. The lynewinter inclusion is the weakest link in the cluster membership list, and that is worth stating explicitly.

552 Repositories, One Embedded Timestamp per Account, 34-Minute Span

Each of the 9 accounts has between 57 and 63 public repositories. Total across the cluster: 552 repositories.

Every repository was created on May 12, 2026. Fetching the first repository per account and reading the raw README returned an HTML comment — invisible on the rendered page — containing a creation timestamp and a job identifier:

2026-05-12 11:10:39 | hanyvert   | SwapLink     | job=48099
2026-05-12 11:18:46 | jaderytm   | GasSync      | job=39412
2026-05-12 11:27:52 | hazexone   | BitForge     | job=63871
2026-05-12 11:30:00 | canestein  | BlockLink    | job=51606
2026-05-12 11:33:07 | mariwatts  | MintChain    | job=82564
2026-05-12 11:35:37 | vierystein | HashSync     | job=20845
2026-05-12 11:38:58 | kylehyne   | SmartLink    | job=38575
2026-05-12 11:42:07 | lynewinter | YieldChain   | job=78012
2026-05-12 11:44:30 | domcomit   | ProjectCloud | job=26977

The first and last timestamps are 34 minutes apart. The job IDs are non-sequential across accounts — consistent with a job queue dispatching work across multiple workers concurrently, though the data does not rule out other scheduling patterns.

The comment format is consistent across all sampled READMEs:

<!-- fallback_BlockLink_20260512113000_51606 -->

The fallback_ prefix is present in every instance retrieved. In template generation systems, a fallback_ label typically indicates the primary generation path failed and a static secondary template was substituted. Whether that interpretation applies here is inference. What is directly observable is that the prefix is consistent across all 552 repositories and across both the 2026 and 2025 operations documented below.

Repository Contents

Structural Uniformity

Fetching file trees and raw content from sampled repositories across all 9 accounts returned the same structural pattern.

A representative Python file (blocklink.py, 1,656 bytes):

class BlockLink:
    def run(self) -> bool:
        try:
            self.logger.info("Starting BlockLink processing")
            # Add your main logic here
            self.logger.info("Processing completed successfully")
            return True

# Add your main logic here is the sole content of the method body. Every sampled repository follows this pattern: a class stub, a logging initializer, an argparse entry point, and a test file that instantiates the stub. No functional implementation was found in any sampled file. Repo names follow a [Word][Suffix] pattern; suffixes drawn from a fixed set: Core, Chain, Sync, Vault, Forge, Link.

Engagement Signals

Across all 552 repositories at time of retrieval:

Signal	Count
Stars	0
Forks	0
PyPI uploads	0
CI/CD config files	0
Open issues	0
Pull requests	0

These are directly retrievable via the GitHub API. Their absence across 552 repositories is a cluster-level property, not an individual account characteristic.

Generation Artifacts

All 552 repositories contain an embedded HTML comment in the README — invisible on the rendered page — in the format . The fallback_ label is treated here as an embedded string, not as a confirmed semantic signal about generation pipeline behavior. The LICENSE URL substitution error — documented in the following section — is the more structurally significant generation artifact, as it demonstrates a shared template origin independently of any label interpretation.

A Template Substitution Error Confirms Shared Generation

The LICENSE section of every generated README contains a hardcoded URL using mariwatts as the repository owner regardless of which account's repository it appears in.

From canestein/BlockLink:

See the LICENSE file at https://github.com/mariwatts/BlockLink/blob/main/LICENSE

From lynewinter/YieldChain:

See the LICENSE file at https://github.com/mariwatts/YieldChain/blob/main/LICENSE

The repo name variable was substituted correctly. The account name variable in the LICENSE URL field was not. mariwatts appears to be the base account in the generation template — the value present when the template was authored, not replaced during per-account substitution. Confirmed across multiple accounts. Not present in the mariwatts repositories themselves.

The Pipeline Is Linked to a Specific GitHub Identity via Commit Metadata

Every repository across all 9 accounts contains this co-author trailer:

Co-authored-by: Hajigur <66867581+hajigur69@users.noreply.github.com>

GitHub's authenticated noreply format is NUMERICID+login@users.noreply.github.com. The numeric ID is assigned at account creation and embedded by GitHub's systems when a commit is pushed through an authenticated session. It is not user-configurable.

The GitHub account hajigur69 has internal numeric ID 66867581:

curl -s https://api.github.com/users/hajigur69 | python3 -c \
  "import json,sys; u=json.load(sys.stdin); print(u['id'])"
# 66867581

Commits on hajigur69's own repository (Cloud9, created February 2026) carry the same identifier:

Author: Hajigur | 66867581+hajigur69@users.noreply.github.com

Observation

The same authenticated GitHub ID appears in commits across all 9 cluster accounts and in commits authored directly by hajigur69.

Interpretation

This co-author line links the cluster's commit history to a specific authenticated GitHub identity. It is non-identifying structural compatibility with shared control, credential sharing, or credential compromise — the data does not distinguish between them.

hajigur69: GitHub account created June 13, 2020. At time of retrieval: 903 followers, 679 following. Public bio: lamer.

Infrastructure: carox.tech

Two cluster accounts — canestein and lynewinter — use a custom email domain in their git commit author metadata:

canestein  → locis@carox.tech
lynewinter → doar@carox.tech

WHOIS and DNS:

Creation Date:  2025-07-19
Updated Date:   2025-08-01
Registrar:      Namify Domains Inc
Name Servers:   raphaela.ns.cloudflare.com / uriah.ns.cloudflare.com
A record:       none
MX:             Cloudflare Email Routing (3 records)
TXT:            v=spf1 include:_spf.mx.cloudflare.net ~all

No web presence. MX records point to Cloudflare Email Routing — a free forwarding service. Destination inbox not publicly recoverable. Domain predates the cluster provisioning event by approximately 10 months.

A Malformed Co-Author Address

In addition to the hajigur69 trailer, a second co-author line appears across 8 of the 9 accounts:

Co-authored-by: v <v@users.noreply.github.com>

There is a GitHub account with login v (ID: 627846). Its correct noreply address is 627846+v@users.noreply.github.com. The string in these commits is missing the numeric prefix that GitHub's authentication system generates automatically. It cannot be produced by a normal authenticated push.

The most likely explanations: a user.email set manually in a local git config, a placeholder from a development environment not replaced before deployment, or a test identity carried into production. All three produce the same result — consistent across 8 of 9 accounts, set once, never audited. This is not an attribution of the v GitHub account to this operation.

An Earlier Operation: The Same Fingerprints, Nine Months Prior

The 66867581+hajigur69 co-author string and the fallback_ generator artifact do not appear for the first time in May 2026.

GitHub's commit search API returns the same string across thousands of commits from a cluster of 22 accounts in a July–August 2025 window:

2025-07-08..2025-07-14  →  1,738 hits
2025-07-15..2025-07-21  →    701 hits
2025-07-22..2025-07-31  →    949 hits
2025-08-01..2025-08-15  →  7,194 hits
2025-08-16..2025-08-31  →      0 hits  ← hard stop

Four accounts from the 2026 follow botnet cluster — canestein, hazexone, domcomit, kylehyne — are present in this earlier commit set. The August 16 cutoff is a directly observable fact. Its cause is not established by this data.

Lyne6666

Lyne6666: created May 3, 2025. 163 public repositories, all with a GitHub API creation timestamp of July 9, 2025, 18:55 UTC.

Observation

LICENSE file SHA across all 163 repositories:

8aa26455d23acf904be3ed9dfb3a3efe3e49245a

Git hashes content. Identical SHA across 163 repositories = identical bytes in every file = single source file, copied without modification.

Repository names follow {Tech}{Testnet}{Function}{Suffix}. Every README install section:

pip install git+https://github.com/Lyne6666/{RepoName}.git

Present across all 163 repositories. No postinstall hook content was confirmed in examined repositories.

uhsr

The Lyne6666 commit author email field: uhsr@eteb.me — a private domain, WHOIS-shielded via Identity Digital. Account uhsr created July 10, 2025 — one day after Lyne6666's mass repository creation timestamp.

At time of retrieval: 237 public repositories, 2,972 followers, 30,778 following.

Observation: Commit Volume

July 2025:      1,382 commits  (71% of all-time total at retrieval)
August 2025:      247 commits
September 2025:    21 commits
October 2025:      96 commits

Interpretation

A 71% concentration of all-time commit activity within a single calendar month is statistically atypical for accounts with multi-month histories. It is consistent with a scripted bulk operation rather than incremental development. That is an interpretation; the commit counts are directly retrieved from the API.

The Backdated Commit History

uhsr/AssetMarket contains a .Logs file with ~365 entries spanning January 1–December 31, 2025, format: Logs: YYYY-MM-DD <8charToken>.

Observation

Repository creation date:

curl -s "https://api.github.com/repos/uhsr/AssetMarket" | python3 -c \
  "import json,sys; r=json.load(sys.stdin); print(r['created_at'])"
# 2025-08-02T16:29:22Z

Root commit:

SHA:            4f8f47697eb89c8818820ca92348be01c4544878
Message:        Logs on 2025-01-01
Author date:    2025-01-01T14:47:47Z
Committer date: 2025-01-01T14:47:47Z
Author email:   uhsr@eteb.me

The repository did not exist until August 2, 2025. The root commit carries an author date of January 1, 2025 — 213 days earlier.

Derived Structure

Git stores GIT_AUTHOR_DATE and GIT_COMMITTER_DATE separately. Both are user-configurable before a push. Naive backdating sets only the author date, leaving committer date at the real push timestamp — a detectable mismatch. In this root commit, both fields are set identically. The mismatch that typically exposes backdating is absent.

Interpretation

The presence of a 213-day pre-creation commit, with both date fields aligned to eliminate the typical detection artifact, is consistent with deliberate fabrication of commit history. The .Logs content — uniform daily entries with 8-character tokens across a full calendar year — is consistent with bulk generation rather than organic accumulation. Both are interpretations. The timestamp mismatch between repo creation and root commit author date is a directly observable, verifiable fact.

The Generator Artifact in the 2025 Repositories

Raw README of uhsr/AssetMarket:

<!-- fallback_AssetMarket_20250802163009_95172 -->

Same embedded string format as the 2026 cluster: a fallback_ prefix, repo name, timestamp, and trailing numeric ID. The fallback_ label is treated as an embedded string whose origin is unknown — it may derive from a template engine, a CI scaffold, a repository bootstrap tool, or a custom generation script. The label alone does not establish what tool produced it or what the label means in that tool's context.

What is directly observable: the format fallback_{name}_{timestamp}_{id} appears in both the 2025 uhsr repositories and across all 552 repositories in the 2026 cluster. That structural match is the finding; the label's semantic meaning within any particular system is not established by this analysis.

Two additional repositories:

uhsr/SmartContract  →  <!-- fallback_SmartContract_20250802162757_83653 -->
uhsr/TokenLab       →  <!-- fallback_TokenLab_20250802161931_80263 -->

Three artifacts, 38-minute window:

16:19:31  TokenLab      ID: 80263
16:27:57  SmartContract ID: 83653
16:30:09  AssetMarket   ID: 95172

Observation

The trailing IDs increase non-uniformly — gaps of ~3,390 and ~11,519. On Linux systems, process IDs increment sequentially; irregular gaps are consistent with other processes consuming assignments between runs. This is an interpretation of the pattern, not a definitive conclusion about the execution environment.

The fallback_ prefix and fallback_{name}_{timestamp}_{id} format are identical across both the 2025 and 2026 operations. That is a directly observable structural match.

The Stargazer Overlap

AssetMarket (83 stars), SmartContract (50), DigitalWallet (49) at time of analysis.

Observation

AssetMarket ∩ DigitalWallet ∩ SmartContract = 33 accounts

33 accounts starred all three repositories — 67% of DigitalWallet's total star count from a single overlapping pool. Under a model where starring behavior is distributed independently across repositories with no shared promotion mechanism, the probability of 33 accounts converging on all three repositories with zero followers, zero forks, and no search visibility is not consistent with the observed overlap concentration.

The July 11, 2025 batch — 83 repositories, single day:

★2:  64 repositories  (77%)
★1:  19 repositories  (23%)
★0:   0 repositories

Zero repositories with zero stars. Uniform two-tier distribution with no variance.

Two accounts from the 33-account pool — SAPH1TE and ahnshy — also appear in stargazer lists for Lyne6666 repositories. The uhsr and Lyne6666 clusters share no observable social graph overlap. These two accounts are the only cross-cluster link found in this data.

Interpretation

What produced the 33-account overlap and the uniform star distribution is not established by this data. The overlap pattern and cross-cluster appearance of two accounts are documented as observations. A coordinated engagement mechanism is one explanation consistent with the data; it is not the only possible explanation.

mohammadtzs

One fork of DigitalWallet exists, made by mohammadtzs. Account created March 2025, 506 public repositories, 100 forks — all from accounts returning 404 at time of retrieval. Fork names included alork1, alork2, alork3, alorki1. mohammadtzs is also present in the 33-account stargazer pool.

Observable: forked a cluster repository, present in the shared stargazer pool, prior forks exclusively from accounts no longer present on the platform.

October 2025: Repository Names

uhsr commit activity: 21 commits in September, 96 in October — concentrated in a 15-minute window, October 20 between 05:04 and 05:19 UTC, across 7 repositories:

awesomepythonTech        → name matches vinta/awesome-python (290k+ stars)
freeprogrammingbooksHub  → EbookFoundation/free-programming-books (340k+)
publicapisAI             → public-apis/public-apis (320k+)
codinginterviewuniversityTools → jwasham/coding-interview-university (310k+)
developerroadmapLab      → kamranahmedse/developer-roadmap (300k+)
systemdesignprimerCloud  → donnemartin/system-design-primer (280k+)
buildyourownxTools       → codecrafters-io/build-your-own-x (330k+)

None contain implementation content. developerroadmapLab description: "enterprise enterprise-grade" — a duplicated token consistent with an unresolved template variable.

Alternative Explanations

Could the hajigur69 co-author identity appear in unrelated operations by coincidence? The GitHub noreply format embeds an immutable, account-specific numeric ID. The same ID (66867581) appearing across thousands of commits in a 2025 cluster and across all 552 repositories in the 2026 cluster deviates significantly from baseline expectations under independent sampling models of follower selection. The identity would have to be reused deliberately, or the same credentials used in both operations.

Could the fallback_ artifact format be from a widely distributed open-source tool? Possible. If the prefix is a convention from a publicly available README generation tool, its presence in both operations indicates both used the same tool — not necessarily the same operator. No such tool was identified in this research. The artifact format is not established as unique to a single actor.

Could the template substitution error (the mariwatts LICENSE URL) appear independently across unrelated generation pipelines? Less likely than the fallback_ case. A shared template variable left unsubstituted in the same field across 552 repositories from 9 accounts is more parsimoniously explained by a single template source than by independent pipelines converging on the same substitution gap. However, if a widely-used generation tool ships with mariwatts hardcoded as a default account in its LICENSE URL template, the error would appear across any pipeline using that tool without modification. That scenario is not established as absent.

Could the four accounts in both operations be coincidentally shared? Four accounts from the 2026 cluster — canestein, hazexone, domcomit, kylehyne — appear in the 2025 commit-farming activity. Their simultaneous presence in two temporally separated operations is statistically implausible under independence assumptions given the scale of GitHub's account population. Whether that overlap reflects shared control is an inference; the factual overlap is directly retrievable.

I am not establishing that a single individual or organization controls both operations. I am documenting that the same authenticated GitHub identity, the same generator artifact format, and four of the same accounts are present in both.

Summary of Confirmed Findings

Finding	Method
Cluster minimum Jaccard 0.9898 vs null baseline 0.000149 (6,642×)	Analytical null model, GitHub API
Second-order cosine similarity >0.9999 (precision saturation) across all 36 pairs	Cosine of per-account Jaccard vectors
`hazexone`, `lynewinter` structurally peripheral (mean ≤ 0.9912)	Within-cluster mean Jaccard
`lynewinter` cluster membership supported by 1 confirmed pair; 7 unverified	Direct pairwise retrieval
All 552 repos created May 12, 2026 in a 34-minute window	Embedded HTML comment timestamps
`<!-- fallback_NAME_TIMESTAMP_ID -->` in every README	Direct raw file fetch, all 9 accounts
`mariwatts` hardcoded in LICENSE URLs across foreign accounts	Direct raw file fetch
`66867581+hajigur69` co-author on all cluster commits	Raw commit data, GitHub API
`66867581+hajigur69` author on `hajigur69`'s own repository	Raw commit data, GitHub API
`v@users.noreply.github.com` lacks numeric ID prefix	Raw commit data
`locis@carox.tech`, `doar@carox.tech` in commit author fields	Raw commit data
`carox.tech`: no A record, Cloudflare MX, created July 2025	WHOIS, DNS
All 552 repos: zero stars, forks, CI, issues, PRs	GitHub API
Same `fallback_` format in 2025 `uhsr` repositories	Direct raw file fetch
`uhsr/AssetMarket` root commit 213 days before repo creation	GitHub API commit + repo endpoints
Root commit SHA, both date fields set to 2025-01-01	`4f8f47697eb89c8818820ca92348be01c4544878`
PID artifacts in 3 README files, same machine, 38-minute window	Direct raw file fetch
33-account pool in all three high-value stargazer lists (67% overlap)	Stargazer API cross-reference
`SAPH1TE`, `ahnshy` in both `uhsr` and `Lyne6666` stargazer lists	Stargazer API cross-cluster
`canestein`, `hazexone`, `domcomit`, `kylehyne` in both 2025 and 2026 operations	Commit search API, Part 1 data
October 2025 repos named after widely-starred repositories	GitHub API, name comparison

Constraints on the Null Hypothesis

These results constrain the likelihood of independent behavior under standard sampling assumptions.

The first-order Jaccard values (0.9898–0.9998) are 6,642× the analytically expected baseline for independent accounts following 29,800 users from a pool of 100 million. The second-order structure — cosine similarity of Jaccard vectors at precision saturation (>0.9999) across all 36 account pairs — is consistent with accounts whose similarity profiles derive from a near-identical source. Higher-precision computation may reveal internal structure not visible at four decimal places; the current data does not resolve it.

Any competing explanation must jointly account for the following linkage classes:

Commit identity — the same authenticated GitHub ID (66867581+hajigur69) present across all 552 cluster repositories and in that identity's own repository
Generator artifact — the fallback_{name}_{timestamp}_{id} format present across both the 2025 and 2026 operations
Cross-operation account overlap — four accounts (canestein, hazexone, domcomit, kylehyne) present in both operations

Disclosure

This report has been submitted in full to GitHub Trust & Safety with API-verifiable evidence including the root backdated commit SHA (4f8f47697eb89c8818820ca92348be01c4544878), the generator artifact URLs, the 33-account stargazer overlap, and the complete account list.

All data was retrieved via the GitHub REST API v3 with authenticated requests. No accounts were accessed beyond their public API surface. No systems were compromised.

All account names published here are publicly visible GitHub profiles. This methodology is only verifiable if the data is reproducible.

If you have seen the hajigur69 co-author string or the fallback_ artifact pattern in your own repositories' commit histories — that is the fingerprint documented here. Worth reporting.

All tooling used in this investigation is in BANANA_TREE.

Top comments (4)

Andy Stewart • May 22

Brilliant data cleaning and cross-graph analysis! This is raw, hardcore data forensics at its finest.

From the Jaccard coefficients and hardcoded variable leaks right down to the immutable noreply IDs, these unforgeable digital fingerprints leave the botnet with nowhere to hide.

GnomeMan4201 • May 22

Thanks Andy. the mariwatts substitution leak was the linchpin. Jaccard proves the cluster, but a shared template bug across 552 repos is hard to argue with. There’s still more in the data.

Rahul S • May 26

The October 2025 repo-naming section is the most operationally revealing part imo. Follow farming and commit farming are both well-documented, but naming repos after awesome-python, system-design-primer, etc. with zero implementation is a different signal — that's GitHub search SEO. Someone searching "system design primer" or "coding interview" might surface these in results alongside the real repos, especially if the follower count looks legit at first glance.

It also means the three infrastructure layers you've documented aren't independent — the follow botnet feeds credibility to the accounts, the repo farm provides search surface area, and the star farm validates individual repos. One campaign, three coordinated layers. The fallback_ artifact tying both operations together makes the timeline clear: repo generation pipeline existed first (July 2025), follow botnet was bolted on later (May 2026) to boost the accounts that already had repos.

Curious whether any of the 29,800 follow targets overlap with accounts that starred the uhsr repos. If the operator is also farming inbound follows from the same star pool, that'd close the loop on whether this is a self-contained reputation laundering system or if the follow targets serve a separate purpose.

GnomeMan4201 • May 26

That’s a really solid observation.

The repo-renaming behavior stood out to me for the same reason, it looked less like normal experimentation and more like search-surface engineering layered onto the broader amplification system.

I also agree the three layers probably weren’t independent operations. The timing correlations and reuse patterns suggest a coordinated reputation-building pipeline rather than isolated activity.

I’ve been looking into overlap analysis between follow targets and inbound network behavior as well, because that could help distinguish whether the objective was closed-loop credibility inflation or broader audience acquisition.

Appreciate the thoughtful breakdown.