Teaching computers to understand human language used to be a tedious and imprecise process. Now, language algorithms analyze oceans of text to teach themselves how language works. The results can be unsettling, such as when the Microsoft bot Tay taught itself to be racist after a single day of exposure to humans on Twitter.
It turns out that data-fueled algorithms are no better than humans—and frequently, they’re worse.
“Data and datasets are not objective; they are creations of human design,” writes data researcher Kate Crawford. When designers miss or ignore the imprint of biased data on their models, the result is what Crawford calls a “signal problem,” where “data are assumed to accurately reflect the social world, but there are significant gaps, with little or no signal coming from particular communities.”
Siri, Google Translate, and job applicant tracking systems all use the same kind of algorithm to talk to humans. Like other machine learning systems, NLPs (short for “natural language processors” or sometimes “natural language programs”) are bits of code that comb through vast troves of human writing and churn out something else––insights, suggestions, even policy recommendations. And like all machine learning applications, a NLP program’s functionality is tied to its training data––that is, the raw information that has informed the machine’s understanding of the reading material.
Skewed data is a very old problem in the social sciences, but machine learning hides its bias under a layer of confusion. Even AI researchers who work with machine learning models––like neural nets, which use weighted variables to approximate the decision-making functions of a human brain––don’t know exactly how bias creeps into their work, let alone how to address it.
As NLP systems creep into every corner of the digital world, from job recruitment software to hate speech detectors to police data, that signal problem grows to fit the size of its real-world container. Every industry that uses machine language solutions risks contamination. Algorithms are given jurisdiction over public services like healthcare frequently exacerbate inequalities, excusing the ancient practice of shifting blame the most vulnerable populations for their circumstances in order to redistribute the best services to the least in need; models that try to predict where crime will occur can wind up making racist police practices even worse.
by Abel Ayuba