Data science seems to be a convoluted term that rose in popularity mostly tied with the advancements of machine learning frameworks. Some argue that data science is a term for an overpaid statistician, others imagine them to be a role one adapts from software engineering, while some see them as businessmen with technical skills. I've seen heated forums debating the ethos of this term. I found Field Cady's definition of data science to be in the right spot where this tension can be settled.
Data science means doing analytics work that, for one reason or another, requires a substantial amount of software engineering skills.
At the heart of this practice lies a set of skills that makes one a data scientist.
There are three hats the guise of a data scientist revolves into.
Whereas a statistician is trained to do all the analytical methods, a data scientist extends their knowledge of applying statistical techniques for building the necessary tools in modeling a computational system. Data scientists make sure they covered the edge cases where no known tools are yet found to be useful. They apply their proficiency in computer programming where a traditional statistician might not have. A data scientist does not need statistical tools to get their work done.
In the guise of an Engineer. A data scientist has to be technically savvy. They have to know how to transform their data into something meaningful without relying too much on ready-made software. Covering the edge-cases means that exploring uncharted territories spawned with irregularities. A data scientist must cope up with irregularities. And one way to adapt is to develop the software skills in handcrafting their tools to cover the edge cases. Note that turning your data into something that is finally useful for your model is one of the most crucial and tedious processes of data science (and machine learning).
This is also a good point to motivate anyone to hone software skills. Not only that you grow beyond the ready-made services provided in the software, but you also gain the freedom to go wherever you want to go. Unchained.
Business people are oriented to the value of a product. Their sense is to follow whichever is essential to their business problem. Field Cady (2017), noted that what separates a good data science professional from a mediocre one is not up to their maths or engineering skills, it is about formulating the right question. The questions that we aim to solve roughly estimate the value of our product. In addition, business people are also good at with communicating other people. Likewise, a data scientist must tell us their most interesting insights in the form of a story, that does not limit discussion in a room of people who knows data science.
No amount of technical competence or statistical rigor can make up for having solved a useless problem.
Now that we clarified the terms, let's lay out our map in navigating our workflow as data explorers of the wild. I got this from François Chollet's introduction to Deep Learning and Field Cady's handbook for data scientists.
Define the problem. List down the questions you want to ask. This will be your guide for understanding the kinds of data you require. This is usually done in groups. This step is critical as it sets your objective, and you need to be very clear about it. Having this discussion on your team keeps everyone on the same page working on the same set of objectives.
Gather relevant data sets. Check if there are blanks on your dataset. What does joining two datasets mean? Do they agree on one thing? This is also critical for it affects the quality of your output: you feed data in, and you get information out. If you feed your model with a bunch of garbage (having pathological patterns, too many outliers, etc.). In a notable paper from Microsoft researchers Michele Banko and Eric Bill (2001), and from the computer scientist Norvig et al. (2009): data matters more to algorithms.
Hack and Analyze. This is where it gets so bright and beautiful! What you do is you play with a bunch of knobs and twist them together randomly to see how one affects the other -- that is the famous neural network architecture: the foundations of Deep Learning AI. On the other end, is a set of more rigorous algorithms from Statistics and Computer Science known as Clustering, Classification, our good old Regression, and dimensionality reduction: this comprises the shallow ML architecture.
Communicate your results. At the end of it all: we should tailor our results in a coherent story. Make it compelling! That includes our intentions, and the results of our analysis put together in a format that anyone can understand. We don't talk to the people in the room we work with anymore. We talk for the people. Make it visual.
- Cady, F. (2017). The Data Science Handbook. John Wiley & Sons.
- Chollet, F. (2018). Deep Learning with Python (Vol. 361). New York: Manning.
- Géron, A. (2019). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. O'Reilly Media.
- Banko, M., & Brill, E. (2001, July). Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th annual meeting of the Association for Computational Linguistics (pp. 26-33).
- Halevy, A., Norvig, P., & Pereira, F. (2009). The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2), 8-12.
A note from the author:
I do not consider myself a data scientist. I only have fun times with Machine Learning Engineering as a part of my software projects. The ideas I gathered here are not entirely from my own. For the interested reader, I referenced above what I would recommend for your to get started. Cheers!