AI Research Slop: How to Filter Signal From the ArXiv Flood

#ai #webdev #productivity #tutorial

ArXiv's cs.LG and cs.CL categories now receive several thousand new submissions every month, and a non-trivial slice of that flood is what researchers on r/MachineLearning have started calling "slop": papers that recycle benchmarks, propose marginal deltas over published baselines, or in some cases were partly authored by the LLMs they claim to evaluate. The signal-to-noise ratio is degrading fast enough that people who used to read the daily arxiv mailing now report giving up. If you are trying to stay current as a developer or applied researcher, the question is not whether to filter — it is how to filter without missing the few papers that will reshape your stack six months from now.

What slop actually looks like in ML research

Slop is not a single failure mode. It is a cluster.

The first category is the marginal delta: a paper that tweaks a known architecture, runs it on the same three benchmarks everyone uses, and reports a 0.3-point improvement that may or may not survive a seed change. These papers were always common, but submission volume has multiplied them.

The second is benchmark reuse without acknowledgment. A model is fine-tuned on data that overlaps with the evaluation set, and the leaderboard win is reported as if it were honest. The 2024 wave of LLM contamination findings made this category much harder to ignore.

The third — and the one that surfaces most often in community threads — is LLM-assisted writing that crosses the line from polishing into authorship. Reviewers have flagged submissions where citations point to papers that do not exist, equations contain hallucinated symbols, and the related-work section reads like a summary generated from titles alone. Several venues, including ICLR and NeurIPS, have updated their author policies in response. Reviewer-side abuse exists too: there is meaningful evidence that some reviews are themselves LLM-generated, which sets up a feedback loop where slop reviews slop.

The honest framing is that the median paper has not gotten worse. The tail has gotten much longer, and the cost of finding the good papers inside that tail has gone up.

A triage workflow that scales past 20 papers a week

You cannot read everything. You probably should not even skim everything. The workflow below assumes you have roughly 60 to 90 minutes per week for paper intake and want most of that time spent on papers worth reading carefully.

Filter by source before content. Submissions to top venues that have already cleared first-round review are a different distribution than raw arxiv preprints. If a paper has not been posted by an author you recognize, an affiliation you trust, or accepted somewhere with real review, it goes to a "maybe later" pile. This is not gatekeeping — it is the only sane way to spend a finite reading budget.

Use the abstract as a hypothesis, not a summary. Read it once, then jump to the main results table or method figure. If the gap between the abstract's claim and the table's actual numbers is large, the paper is signaling slop. You will internalize this pattern within a few weeks.

Check whether code and weights ship. For empirical work, the absence of a public repository is no longer a neutral signal — it is a negative one. Papers with Code, Hugging Face, and GitHub links should be present and non-empty. Repositories with a README and no commits since submission are a known pattern.

Use citation graphs, not citation counts. Connected Papers and Semantic Scholar's "influential citations" view show you whether a paper is being built on or merely cited. A new paper with 200 citations that are all "we follow X" is not the same as one with 30 citations from groups extending its method.

Treat any paper whose related-work section cites work that does not exist as conclusively slop and stop reading. We have seen this in papers from named labs, not just anonymous submissions. A single hallucinated citation is enough — the rest of the document was not checked by a human carefully enough to trust.

Curation services and tools worth your subscription

The cheapest filter is to outsource part of it to someone whose taste you trust.

Sebastian Raschka's Ahead of AI publishes long-form roundups that triage papers around LLMs, training methods, and applied ML. The summaries are technical enough that you can decide whether to read the original from the summary alone.

Import AI by Jack Clark covers policy, capability progression, and a small number of papers per issue. Useful for spotting trends before they hit your feed.

The Batch from DeepLearning.AI is broader and shallower; helpful for orientation if you are not deep in ML day to day, less useful if you are.

alphaXiv layers comments on top of arxiv. Reading the discussion under a contested paper often saves you the trouble of reading the paper itself.

Semantic Scholar's Research Feeds lets you specify seed papers and surfaces new arxiv preprints ranked by relevance to that seed. Better than chronological for the same reason a recommendation engine beats a chronological feed.

Papers with Code is still the right starting point if you want to find the current state of the art on a specific benchmark and skip the marketing.

For deeper provenance work — checking whether a model's training data contaminates its evaluation — scite and Litmaps are worth knowing about, though both are paid.

Build your own filter and let it compound

The most underrated filter is the one you build yourself, paper by paper, over months. Every paper you read carefully should produce a one-paragraph note: what the claim is, what evidence supports it, and whether your prior moved. Six months in, you will notice that some authors and labs consistently move your prior and others do not. That ranking is more accurate than any third-party signal.

A simple Notion database with paper title, link, claim, evidence, and a 'would re-read' flag is enough. The 'would re-read' column is the highest-signal field — papers you actually return to are the ones that mattered.

The compound effect matters more than any single tool. A researcher who has read 200 papers carefully has a better filter than any service can provide, because their filter encodes their specific research interests. The tools above are scaffolding for getting to that point without burning out on slop first.

Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.