I saw a tweet recently about the scale of PR submissions on high-traffic repos like OpenClaw from Pete himself. AI agents are great for coding, but they are flooding maintainers with duplicate logic and PRs that don't align with the long term vision that the maintainers have for the project.
So I decided to build something that would audit the PRs and tag them accordingly .
An interesting problem to solve in this project was identifying when two contributors are solving the same functional problem in completely different ways, often across disjoint files.
So how did it go?
I ran an audit on 200 recent Pull Requests in the shadcn-ui/ui repo. It seemed like a perfect high-traffic test bed for an analysis like this. I expected simple code clones, but I found unique-looking PRs that were actually solving the identical problem.
The System isn't looking for copy-pasted lines, it's evaluating the Architectural Goal. If it finds a match, it classifies it into one of these buckets:
- SHADOW: An exact duplicate fix for the same regression.
- SUPERSET: A broader architectural fix that covers a smaller, specific one.
- COMPETING: Two different paths taken to solve the same functional outcome.
Here is the best match I found
These all try to fix the broken /blocks page link
| PR # | Title | Strategy | File Modified |
|---|---|---|---|
| #10156 | fix: update broken link on /blocks page |
Simple URL replacement | apps/www/config/docs.ts |
| #10088 | fix(docs): absolute path for blocks link |
Path normalization | apps/www/lib/utils.ts |
| #10096 | chore: rename internal block references |
Refactoring the reference key | apps/www/registry/registry.json |
Even though the code changes were in completely disjoint files (Config vs. Utils vs. Registry), the system identified that all three were targeting the same functional failure. This is "Goal Duplication," and it’s arguably the most expensive type of noise for a maintainer to filter manually.
PR #10088 solved the root cause (renaming the file), rendering the individual documentation fixes in #10156 and #10096 redundant before they were even merged.
Top 10 matches from the scan
Across the 200 PRs, the script flagged 69 valid redundancies. Here are some of the most interesting ones:
| PR Identity | Primary Match | Categorisation | Why It Matters |
|---|---|---|---|
| #10404: ThemeHotkey guard | #10401 | SHADOW | Identical null-check for event.key crash in Hotkeys. |
| #9895: Docs Copy Button | #9876 | SHADOW | Identical split of bash cmd/text to fix the copy button. |
| #10421: DataTable A11y | #10402 | SHADOW | Concurrent addition of aria-labels to Data Tables. |
| #10403: Drawer asChild fix | #10139 | SHADOW | Adding asChild to Drawer docs to fix broken nesting. |
| #10424: Monorepo CLI fix | #10258 | SUPERSET | #10424 is a broader "fix-at-once" strategy for monorepo CLI. |
| #10393: Geist Font Mismatch | #10273 | SUPERSET | #10273 provides a more robust font mapping than #10393. |
| #10244: Calendar Responsive | #10235 | COMPETING | Different CSS strategies for Calendar width responsiveness. |
| #10386: ThemeHotkey bug | #10404 | SHADOW | Identical logic-level fix for an undefined key crash during autofill. |
| #10383: FieldSeparator fix | #10201 | SHADOW | Both PRs modify the same property to fix separator inheritance. |
| #10158: iOS Date Input | #10133 | COMPETING | Global CSS vs. Component-level fix for the same iOS rendering bug. |
So How it Works?
I initially set out to build two things: a backfill script to audit historical data, and a real-time bot for live triage. I ended up leaning hard into the first, finding clusters of historical redundancy is the ultimate stress test for any detection engine.
Phase 1: The Ingestion & The "Context Squeeze"
The first hurdle was the sheer volume of data in the PRs. Finding the sweet spot in between hallucination and getting rate limited was critical.
Using Octokit, I paginate through the PRs targeting critical branches like main or master.
Managing token budgets was critical, since I had a big budget of $0 for this. To survive the free tier, I decided to aggressively compress and filter massive code changes :
- Stripping out known files like SVGs, lockfiles, and documentation.
- Removing all comments and unchanged imports to isolate the core logic.
- If the code is still over a 1500-character limit, extracting only the modified hunks (the
+and-lines).
Phase 2: Vector Memory & Structural Bias
Once cleaned, we convert the PR into a vector using gemini embedding models and index it into an Upstash Vector. When the sweep encounters a new PR, it simply queries this memory for the 8 most similar candidates.
However, I quickly learned that vector similarity doesn't always equal functional redundancy. In early tests, a clear Structural Bias emerged, the system would flag totally different JSON additions as duplicates simply because their shape looked similar.
To solve this, I forced the engine to prioritize literal values (like IDs and URLs) over the syntax, narrowing the sieve from finding similar code to finding identical goals in the next stage.
Phase 3: The Logic Check & Intelligent Switching
The final pass is the LLM. We take the top candidates from the vector search and feed them into a deep reasoning loop to understand the author's intent.
Handling the rate limits required a router that automatically pivots between providers like Gemini and Llama depending on availability, openrouter and other similar services can also be used. This stage is resilient by design, using a 3-retry logic with exponential backoff to handle the 503 busy errors common on free-tier APIs.
At this stage it categorises the PRs with the correct bracket as mentioned above shadow, superset and competing
Where Things Broke Down
No automated system is flawless, and this one definitely had its own quirks. A few of them were actually pretty fascinating from an engineering point of view:
- The Vector Gap: When two PRs tried to solve the same problem but did it in very different ways, the system sometimes failed to connect them at all. The vector search wouldn’t fire, which meant those PRs never even reached the LLM for reasoning.
- The model Problem: At one point, I maxed out my higher tier AI quota and had to fall back to a smaller 8B model. That’s where things got weird. The smaller model showed a clear structural bias: it started flagging totally new registry entries as duplicates just because they looked similar at a JSON structure level. It was effectively ignoring the actual URL values because it wasn’t smart enough to weigh content over structure.
- Weak on Wide Sweeps: When looking across large PRs, the system sometimes saw relationships that weren’t really there. If two big PRs happened to touch the same package, it could hallucinate a connection between them, even if, logically, they had nothing to do with each other.
What’s Next?
The backfill engine is now the analytical core for what’s next: a live GitHub Bot.
By using the historical memory we’ve built, the bot can analyze a new PR the second it’s opened and alert maintainers if a redundant fix already exists. I’m also exploring a Maintainer Dashboard to visualize these semantic clusters, giving project maintainers a high-level view of where their contributors are accidentally overlapping.
If you're a maintainer interested in trying it out on your repo or a developer who wants to contribute, hit me up, I'd love to chat.


Top comments (0)