Book: How To Lie with Statistics - Bias

#books #career #datascience #analytics

The book How To Lie with Statistics, by Darrel Huff, has great views on how to analyze data. After reading this book, good questions arise from any discussion or affirmation that you may hear from friends, news, articles, and so many other sources.

"A great introduction to the use of statistics, and a great refresher for anyone who's already well versed in it." (Bill Gates)

The book contains 10 chapters and I will write down an article for each one of them. It must avoid suppressing the richness of the content found in the book.

The book chapters are:

The Sample with Built-in Bias
The Well-Chosen Average
The Little Figures That Are Not There
Much Ado about Practically Nothing
The Gee-Whiz Graph
One-Dimensional Picture
The Semiattached Figure
Post Hoc Rides Again
How to Statisticulate
How to Talk Back to a Statistic

Below is the article for the first chapter.

1. The Sample with Built-in Bias

The book begins by saying that in the New York Magazine once there was a comment saying that the AVERAGE of a man who studied at Yale, in the class of '24, "makes $25,111 a year".

What exactly does it mean?

If you send your boy to Yale, you won't have to work in your old age, and neither will he
This number is too precise

This kind of information is usually taken from surveys. Some people may say they get more than they do because they're vain, some people may say they make less, afraid to pay more taxes, and some people may not even know exactly how much they make every year, since they could have investments and other types of income.

People that has answered the survey are a sample of all students from Yale's class of '24, therefore, there source that can produce $25,111 as the "average income" of some men whose actual average may well be nearer half of that amount.

It is sad truth that conclusions from such samples, biased or too small or both, lie behind much of what we read or think we know.

How to be so sure it was taken from a sample?

What's in the magazine about the "class of '24" was published about 25 years later. Many of the addresses (or contacts) from the "class of '24" would not be reached 25 years later. Those whose addresses are known, could not respond.

So we find that the income figure is based on a sample composed of all class members whose addresses are known and who replied to the questionnaire.

Who are the lost sheep down in Yale rolls as "address unknown"? Are they big-income earners? (usually, the address of the rich are easy to find). It is a good guess that the lost names are those of the men who, 25 years or so after becoming Yale Bachelors of Arts, have not fulfilled any shining promise, part of the low-paid market.

The Psychiatrist

A psychiatrist reported once that practically everybody is neurotic.

Who has the psychiatrist been observing? It turns out that he has reached this edifying conclusion from studying his patients, who are a long way from being a sample of the population. If a man were normal, our psychiatrist would never meet him.

The Literary Digest's famed fiasco (Elections 1936)

The ten million telephone and Digest's subscribers who assured the editors of the doomed magazine that it would be Landon 370, Roosevelt 161 came from the list that had accurately predicted the 1932 election. If the list has been tested before, how could there be a bias in it?

There was. People who could afford telephones and magazine subscribers in 1936 were not a cross-section of voters.

The basic sample is the kind called "random", however, it could select all the answers from the same city or neighbourhood.

As seen, the tendency of the sample to be biased in this consistent direction can rig it automatically.