DEV Community

WDSEGA
WDSEGA

Posted on • Originally published at wdsega.github.io

The Noise Farmer [Sci-Fi Short Story]

In the age of AI training data markets, a profession called "noise farming" emerged — producing junk data to sell to AI companies' competitors, degrading their training quality. Wang Fang was the best in the business.


Most historians trace the data war to 2028.

That year, a leading AI company's training data was proven systematically contaminated. A competitor had spent eight months injecting carefully designed junk samples into public datasets. The samples passed every automated quality check — but created measurable performance degradation on specific reasoning chains.

The producers were later identified as a small team registered in the Cayman Islands, offering "data adversarial services." They called themselves Noise Farmers.

Wang Fang was the best designer on that team.


Noise farming isn't ordinary data forgery. Forgeries are detected by anomalous statistical distributions. High-quality noise must be statistically indistinguishable from real data at every macro level — flawed only in very specific logical chains.

Wang Fang's specialty was text reasoning datasets. Her methodology:

Bury one flawed step inside a correct answer, let that flaw be learned in training, amplified in inference, until the model develops a stable wrong tendency on a class of problems.

"Stable" was the key word. Random errors are filtered as noise. The error has to be patterned, learnable — only then does it contaminate.

Counterintuitively: the more carefully designed the error, the harder it is to detect. Because it looks too reasonable.


A colleague asked her: have you thought about the real people who use those models?

"Yes," she said.

"Then why?"

"Because what I do is just making something that's already happening more visible. Every training dataset has biases, errors, subjective human choices. Those get learned too. They affect real users too. The difference is, no one designed those biases, no one knows about them, and no one is responsible for them.

"The biases I design — someone is responsible. The person who paid me."

"That sounds like sophistry."

"I know. I've been looking for a better reason. Haven't found one."


Her last project was contaminating training data for medical diagnostic assistance models.

Halfway through, she stopped.

Not conscience. Calculation. A content recommendation model going wrong costs commercial losses. A diagnostic model going wrong affects diagnoses — real patients.

She returned the deposit and dissolved the project without explanation.

Three months later, she opened a data quality auditing firm, helping AI companies detect training data contamination.

She's now the best detection expert in the field.

Because no one knows better than she does what good noise looks like.


本文首发于我的博客

Top comments (0)