KWAN

Posted on Mar 1, 2024

The HitchHiker’s Guide To The Data Science History

#datascience #ai #machinelearning

A very abridged version of a very long history of about 50+ years!

Where does all I do in data science come from?

I guess you might have asked yourself this a few times during your learning pathway into the arts of data science.

Let’s delve into a HitchHiker’s style journey through the data science evolution to see where we came from and which lessons we can learn from it.

Pre-Data Science

While we can actually agree that systematic study of data to obtain knowledge has always been an inherent part of mathematics and other disciplines such as epidemiology (where we can see, for example, the story of the doctor John Snow and how he detected the origin of a cholera epidemic in London) it was only in 1962 in a piece by John W. Tukey entitled “The Future of Data Analysis” that we can read:

For a long time I thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and doubt… I have come to feel that my central interest is in data analysis… Data analysis, and the parts of statistics which adhere to it, must…take on the characteristics of science rather than those of mathematics… data analysis is intrinsically an empirical science… How vital and how important… is the rise of the stored-program electronic computer? In many instances the answer may surprise many by being ‘important but not vital,’ although in others there is no doubt but what the computer has been ‘vital’. – John W. Tukey

John Tukey (1915-2000), the first mathematician to define the work of what would become a data scientist.

This became the very first description of what would become almost 50 years later the profession of a data scientist. John W. Turkey ended up publishing another important work “Exploratory Data Analysis” in 1977, where he described in depth the scope and usage of applied statistics, i.e. statistics used in benefit of another area of knowledge.

After WWII, much of the technology that had been created for the war effort found some usage in the daily lives of people. One of those technologies was ENIAC which led the way to the IBM/360 mainframes of the early 60s. These computers were still huge and complex but manageable, unlike ENIAC. They were inexpensive enough to become a part of most university campuses, becoming thus a part of the education of future IT scientists. Applied statisticians and programmers led the way and computer rooms across the world were packed with them.

IBM System/360 Model 50 CPU, computer operator’s console, and peripherals at Volkswagen.

This allowed applied statisticians to start developing their own identity and considering how they existed as a community. The “data sciences” (because they haven’t been named yet) included any area of knowledge that analyzed data, regardless of the domain specialization, as opposed to pure mathematical manipulations. Mathematical statistics was not a part of it because it didn’t involve data. Biostatistics, chemometrics, psychometrics, social and educational statistics, epidemiology, agricultural statistics, econometrics, and other applications were tough. Business statistics still didn’t exist.

Applied statisticians were trying to set themselves apart from programmers, domain knowledge specialists, and mathematicians. They encompassed a bit of the 3 and in their core was data.

In the 1950s and 1960s, as programming was mostly Fortran, COBOL, and a bit of Algol, applied statisticians still had a lot of issues doing programming by themselves as they were quite inefficient and unreliable. Paraphrasing Dr. McCoy, “I’m an applied statistician, not a computer programmer”. So when the late 1960s arrived and dedicated statistics programs such as BMDP appeared and later SPSS and SAS, applied statisticians were in heaven.

As these programs were expensive and most of them ran in the mainframes, most of the development in these structures only happened in universities and major companies.

Conducting a statistical analysis at this time was an involved process. You had to write your own programs (some used programming languages such as FORTRAN, and some used the languages embedded in SAS or SPSS). There were no Graphical User Interfaces or code-writing applications.

Once you had handwritten your data analysis program you had to wait in line for an available keypunch machine so you could transfer your program code and data onto computer punch cards. And after that waiting in line again to feed the cards through the mechanical card reader. On a good day, it didn’t jam….much. Finally, you waited for the mainframe to run your program and the printer to print the results. Sometimes all you would pick up was a page of error codes which you would have to decipher and decide what to do next just to start the process all over again.

The 1970s and 1980s finally allowed for the arrival of personal computers and mainframes available to smaller companies. By that time, statistical software for PCs began to spring out of academia. There was a ready market of applied statisticians who learned on a mainframe using SAS and SPSS but didn’t have them in their workplaces.

The Birth of a New Area of Knowledge

During the 80s and 90s the boom of technology and with it the internet generated a much-needed boost to the area of applied statistics…..cough….data science.

Inexpensive statistical packages that ran on PCs multiplied like rabbits. All of these packages had GUIs; all were weird and even unusable by today’s standards. Even the “venerable” ancients such as SAS and SPSS, evolved point-and-click faces (although you could still write code if you wanted). By the mid-1980s, you could run even the most complex statistical analysis in less time than it takes to drink a cup of coffee … so long as your computer didn’t crash.

PC sales increased steadily to reach almost a million per year by 1980. In 1981, when IBM introduced their 8088 PC, sales skyrocketed and IBM-compatible PCs reached an annual sales of 200 million. By the 1990s fueled by Pentiums, GUIs, the internet, and affordable and user-friendly software including spreadsheet software, the usage of technology became ubiquitous. Long gone were the behemoths of MITS and Altair and Microsoft survived, evolved, and became the apex predator.

The development and maturation of the internet also developed many new opportunities. What was once confined to expensive books and inaccessible libraries now it was easily accessible with a few mouse clicks on the internet. If you couldn’t find a website dedicated to what you were looking for, discussion groups existed where you could post about your struggles and get help. Data and knowledge became accessible to all.

Some of the computers being sold were being used for statistics. In the 1980s Lotus 1-2-3 became a pioneer in spreadsheet software, which was quickly surpassed by Microsoft Excel, which still reigns today.

Lotus 1-2-3 running on MS-DOS.

With the availability of more computers and more statistical software, you might expect an increase in statistical analysis. That’s a tough trend to quantify, but let’s consider the increases in the numbers of political polls in America and pollsters. Before 1988, there were on average only 1 or 2 presidential approval polls conducted each month. After a decade, that number had increased to more than a dozen. This trend is quite similar to the increases in PC sales and revenues from statistical software companies such as SPSS. While we can’t argue that correlation necessarily means causation here, it is remarkable.

More revealing perhaps is the increasing number of pollsters. Before 1990, all the work was pretty much done by the Gallup Organization, nowadays dozens of pollsters work on increasingly specific topics in politics. separating analysis by topic, demographics, age, gender, and geographical area (and many other subsets) became the norm over the years. The internet especially allowed for bigger and bigger polls to be conducted over the years.

With the proliferation of computers, statistical software, and their accessibility it became a standard to develop statistics courses for many subjects in the universities and bosses saw the opportunities in demanding their young employees to do increasingly specific analyses of their business data. While many of the bosses couldn’t even fix the clock on their microwaves, they saw the business value of data analysis and explored the heck out of it.

Besides the backdrop of applied statistics, the ’80s and ’90s experienced an explosion in data-wrangling capabilities. Relational databases and SQL became the vogue, PCs became faster, and hard disk drives became bigger and less expensive. This led to the birth of data warehousing and the emergence of big data. Big data in itself brought Data Mining and black box modeling. Business Intelligence emerged in 1989, mainly in major corporations.

In the 90s, when technology went into overdrive, the internet grew and search engines such as Google saw its early development, new tech to deal with the increasingly Big Data such as Hadoop appeared.

Open-source programming languages such as R and Python brought the independence of the applied statisticians from universities and big corporations. This software was free and open source, accessible to all.

It was in 1996 that the International Federation of Classification Societies became the first conference to actually feature Data Science as a topic.

A new area of knowledge was thus officially born and applied statisticians have now a moniker to group them all. They are more than statisticians, they are data scientists.

The Present

From 1996 onwards, the growth has been exponential. Not only did the tech and software diversify, one thing increased significantly, funding.

Mathematical/ Computer Sciences Research Funding from 1978 until 2017.

Funding for specific projects to apply statistical and computational techniques to several areas of knowledge increased steadily as we can see in the graph, with areas so diversified as business, medicine, public health, politics, and geography entering the chat.

David Donoho captured the current sentiment of statisticians in his address at the 2015 Tukey Centennial workshop:

“_Data Scientist means a professional who uses scientific methods to liberate and create meaning from raw data. … Statistics means the practice or science of collecting and analyzing numerical data in large quantities.

To a statistician, [the definition of data scientist] sounds an awful lot like what applied statisticians do: use methodology to make inferences from data. … [the] definition of statistics seems already to encompass anything that the definition of Data Scientist might encompass…

The statistics profession is caught at a confusing moment: the activities which preoccupied it over centuries are now in the limelight, but those activities are claimed to be bright shiny new, and carried out by (although not actually invented by) upstarts and strangers._” – David Donoho

And we arrive at 2012 and to the 2012 Harvard Business Review article that declared data scientist to be the sexiest job of the 21st century. The article by Davenport and Patil described a data scientist as “a high-ranking professional with the training and curiosity to make discoveries in the world of big data.”

This article put Data Science in the center of public attention. Where there is data, there are probably companies trying to make use and money out of it.

But we must not forget history.

While people believe Artificial Intelligence is a recent field of study, akin to science fiction, it actually dates back to the autopilots in airplanes and ships in the early 20th century, which now lead us to driverless cars and trucks. Computers, perhaps the ultimate AI, were first developed in the 1940s. Voice recognition in the 1950s, now we can talk with Siri or Alexa.

And even after 50+ years of development we still argue about what the true definition of data science is and what is its scope of practice.

Bias, fairness, privacy, and safety are in vogue today as we aim to move into a world in which we work with computers and are not controlled by them.

The Future?

We might be data people but nobody has a crystal ball to see clearly into the future, at least for now. =)

What will the future bring us in this area? The promises are many but let’s analyze the web summit attendees in this year to see two of the trends that we observe.

Neural networks and NLP paired together to generate a hilarious lyrics generator.

Natural Language Processing is on a big frontier

The release of the Generative Pre-trained Transformer 3 (GPT-3) model in 2020, as well as new Neural Network structures such as LSTM’s, are fueling a massive shift in the accessibility of NLP tools to the common human. NLP for all has come to stay.

Companies such as Textmetrics and Rytr are taking the power of AI to the common writing tasks of the everyday human. And AI is being used in pair with NLP to create hilarious tools, such as this one, which allows you to create original music lyrics!

Not all that glitters is gold and NLP sometimes misses the mark, especially in translation. Companies will become hopefully more and more responsible for the outcome of errors in their translation, and situations such as the man who got arrested because of a mistake by Facebook’s machine-translation service will be prevented.

Data Privacy and Synthetic Data are Gaining Importance

After the Cambridge Analytica scandal, which showed the public in 2017 how easy it is to collect data massively about people without their knowledge, Data Privacy and Protection have come into the spotlight.

People want to know that their data is secure and ethically used. Here are two companies that seem very promising.

First, YData, from the brain of brilliant Portuguese engineers, a company dedicated to helping customers with data quality and accessibility is growing before our eyes. They dedicate themselves to helping data scientists with their data and are thriving at it. They are well known for their advocacy on the usage of Synthetic Data, data artificially created to keep the mathematical properties of the original dataset, allowing models with sensitive data to be able to be deployed using only synthetic data, assuring the privacy of the original dataset remains intact and safe. Moreover, on its website, YData shares valuable open-source tools and manages a community dedicated to synthetic data.

Second is Ethyca, an American company dedicated to data privacy. Grounded on the motto of “data privacy by design” they educate and help data professionals to have data privacy in the focus of their work and share with the overall community an open source tool stack, called Fides, to help implement data privacy as a basis within all data project operations.

Final Thoughts

History plays a big role in teaching us many lessons. So what can we learn from the history of data science?

1 – Don’t take data for granted – accessible and open data wasn’t a thing of the past. While people willingly share data nowadays, ethical and privacy concerns remain. We must strive to know and operate within an ethical framework while the tsunami of data still grows. And much of this data is still unstructured, paving the way for a new wave of creative methods and analyses;

2 – Think Big – big data requires equally big analyses. As technology evolves, we must develop our own high-performance computing skills as well. New methods of complex data mining and predictive analytics (which will probably include quantum computing) are being developed on a daily basis. Stay informed;

3 – Know and work within the context – In the past data scientists worked primarily in the information technology sector, nowadays data scientists work in a variety of industries, helping organizations make data-driven decisions that ultimately change the way the companies work and compete. To be successful, the data scientists of tomorrow must develop their skills in data communication and strategic decision-making.

One thing will always be clear: Data scientists will always be in demand. As long as data exists, there must be highly skilled individuals who can analyze it.

Article written by Susana Paço and originally published at https://kwan.com/blog/the-hitchhiker-s-guide-to-the-data-science-history/ on November 15, 2021.