I dare suggest that Data Science is one of the "sexiest" careers in the twenty-first century.
Breaking into the field of Data Science, especially one as sophisticated and multidimensional as this one, is not simple. We're in an odd era where even the concept (and expectations) of Data Science differ from one organization to the next. What data scientists perform, what they need to know, and the types of firms that need to hire data scientists are all changing rapidly.
Why on earth then would there be only one path to take in the first place? I can't think of a better time to start if you're looking to break into data science. Start now!
In this article, I am going to explain how one can best get started in Data Science from a personal perspective.
After reading this article the reader should gain the following from it:
The reader should be encouraged and able to kick start their journey into data science.
The reader should be able to get the best way on starting their data Science journey.
Peter Sondergaard, the Senior Vice President and Global head of research at Gartner,Inc once said that i*nformation is the oil of the 21st century, and analytics is the combustion engine*.
This is just a sliver of how lucrative the field of data is.
Choosing to be a data scientist is a great choice not because you are going to get money from it but because you have taken a step to care for the world.
Data Science involves changing different sad stories into grinning stories, stories we transform for the better with data.
So choose to care and Love the planet. It's all about you caring for the people, not just users.
Data Science is a universally recognized term that escapes every single complete definition. It is a volatile field whose methods and goals evolve with every technological advancement. The definition of data science 30 years ago is not the same today.
Since no one definition fits it let us define the key processes in data science and how they fit into each other. Acting as the building block to the bigger data science picture.
That way I believe one will be able to visualize, understand and conceptualize what data science truly is.
This also will help if one wants to be a competitive job applicant they will need to understand how various data science activities fit into the big picture as earlier stated. One will get to learn about the timing of different data processing analyses as well as who carries them out and let's not forget how. Does that make sense?
If you want to work in medicine, you will first learn how the human body functions and then decide whether you want to be a pediatrician or a nurse or an oncologist, etc. that is what we are about to do here but for data science.
Okay, so let's get started by talking about data since Before there was anything there was data.
Data is the foundation of Data Science. Therefore, we need to have a clearly understanding of what data is.
In the context of data science there are two types of data;
- Traditional Data
Traditional data is data that is structured and stored in databases that can be managed from one computer. It is in table format containing numeric and text values.
Traditional data may come from sources like basic customer records of a retail store or the historical price of crude oil in the middle east oil producer countries.
- Big Data
Big Data on the other hand is bigger than traditional data and not in the trivial sense. It isn't simply represented by numbers or texts but also by images, audio, mobile data, and so on. In addition, big data has high velocity, this is to means that its retrieved and computed in real-time.
And finally, think about its volume, big data is measured in tera-, Peta- and exabytes and hence often distributed into a network of computers.
Big data is all around us. A consistently growing number of companies and industries generate use and generate big data. Consider online communities like tick-tock, Facebook, and LinkedIn. They generate a massive amount of user data. This is a lot of data being generated. Right now digital data in the world amounts to 3.2 zettabytes.
Having known what data is and its different forms, now imagine the following scenario.
You are already a data scientist professional and you are working for a private airline company. A superior member of staff tells you one of the two things below, what is the difference between the two?
We need to consider client satisfaction in the next quarter, so we can predict the churn rate. Oversee the process and come up with some numbers.
We have an enormous amount of customer data from the previous quarter. Can you oversee the analysis and deliver an approximation of churn rates for the next quarter.
If You have noticed, the difference between the two is that unlike in the second case you do not have data in the first case. You will need to gather it, probably through surveys and so on.
So you have conducted the survey and received responses. Is this data ready to be analyzed? Not Exactly.
This is called raw data since you haven't done any processing on it. It is untouched data that cannot be analyzed straight away.
This takes us to the next point which is Preprocessing of data.
Preprocessing is what we can think of as preliminary data science.
Preprocessing is a crucial group of operations that converts raw data into a format that is more understandable and hence, useful for further processing. Plus it fixes the mistakes that occurred during the gathering phase.
Like when we are thinking about customer data, it's unrealistically easy to have a person registered as “KNO45P” years old, called “Ukraine”, flight number “Isabel Valentine ” from “32” as her country.
Those data entries are incorrect and therefore must be handled before proceeding to any type of analysis, right?
That is why there are tons of preprocessing practices in place. Will tell you about some of the common ones.
i) Class Labelling
The first is class labeling your observations. This consists of arranging data by category or labeling data points to the correct data type. For example numerical or categorical.
The number of passengers on a day's flight would be numerical, you can manipulate this information mathematically, and the passenger's occupation and country of origin are categorical because no mathematical operations can be done on this information.
Just keep in mind that with big data the classes are extremely varied, therefore instead of ‘numerical’ vs ‘categorical’ the labels will be ‘text’, ‘digital image data’, ‘digital video data’, ‘digital audio data’...and so on.
ii) Data Cleansing or Scrubbing
These are techniques for dealing with inconsistent data, like misspelled categories and missing values. You know a lot, of people sharing their name and occupation but omitting their age, or gender.
iii) Data Shuffling
Data shuffling is another interesting one! Imagine shuffling a deck of cards.
It ensures that your dataset is free from unwanted patterns caused by problematic data collection.
For example, if the first 105 observations in your data are from the first 105 passengers who boarded the first flight of the day. This data isn't randomized and is likely to reflect just the behavior of those 105 passengers when the airline had just been rolled out.
In a word, data shuffling prevents patterns due to sampling to emerge.
iv) Data Masking
Finally consider data masking. This is primarily a big-data-specific technique. When collecting data on a mass scale, you can accidentally put your hands on a lot of sensitive information which you need to urgently hide from yourself.
Masking aims to ensure that any confidential information in the data remains private, without hindering the analysis and extraction of insight.
Essentially the process involves concealing the original data with random and false data, allowing the scientist to conduct their analyses without compromising private details.
Let's not forget that all of this is just the very beginning of doing data science. Pre-processing of data to make it usable is laying the groundwork.
Alright let's assume your databases are clean and organized at this point, so let's get into the real deal now
Before we begin I want to make sure that we all moving together.
There are two ways of looking at data:
One is with the intent to explain behavior that has already happened, and you have gathered data for it.
The second way is to use data that you already have to predict future behavior that has not yet happened.
One needs to be very clear on this distinction because it can be what tilts the scales one way or another when you are deliberating which data science path is best for you.
There is also a temporal relationship between the two ways of looking at data. Before data science jumps into predictive analytics, it must look at the patterns of behavior the past provides. It must analyze them to draw insights which will then inform the direction in which forecasting should go.
This brings us to the next part which is Exploratory data analysis
In this stage the data scientist, after collecting the data and ensuring it is clean. They now take the data in three fundamental operations.
First, extract meaningful metrics from the data set.
For example in our airline case, the data scientist would extract the average quarterly revenue per new customer.
Second, identify the Key Performance Indicators, that is only those metrics that will clearly show how the business is doing.
Third, analyze the data to extract insights from it.
So why is the Exploratory Data analysis Stage important and the data science stepping stone?
Well, consider this; The airline company we are working for is running a marketing campaign and you have received the data. You examine it and identify one of the metrics. It indicates all the traffic to a page on the company’s website.
Then you think about what a KPI could be in this case, and you realize that a KPI would show the volume of the traffic to the same page, but only if generated from users who have clicked on a link in your ad campaign to get there.
This way you can check if the ads you are positioning are working and driving customers to click on a link, in turn, this would determine whether you should continue to spend on ads or not.
Of course, this is not where a data scientist's responsibilities conclude.
Data Science is about telling a story. I repeat Data Science is about telling a story. And crunching the numbers is just the introduction to the story.
So apart from handling strictly numerical information, Data Science and specifically Exploratory data analysis are about visualizing the findings and creating easily digestible images supported by the most relevant numbers.
After all levels of management should be able to understand the insights from the data and inform their decision making.
And this is in the hands of the Data Scientist. Data Scientists create dashboards and reports, accompanied by graphs diagrams maps, and other comparable visualizations to present the findings most relevant to the current objectives.
A real-life example can be as follows. Let's say you are a hotel manager would you keep the prices of rooms constant all year round? Probably not when you want to attract visitors when the tourist season is not in bloom and if you want to capitalize on it when it is. And would you inform your strategic decision to lower or increase room prices? A data Scientist will now perform the above-mentioned processes coming up with the best strategy.
Once all the work above has been done the information can now become the basis for predicting future values. This now takes us to the next part which is now Predictive Analysis
Here now is where it becomes truly awesome. Here now one can make forecasts and predictions.
The accuracy of your forecasts though will differ based on the methods and techniques you decide to apply.
And this is where the more popular Data Science concepts come into play. Examples of such techniques are Neural Networks, Deep Learning, Time series, and Random Forests.
But just as there is a distinction between traditional and big data, there is also a distinction between traditional methods in predictive analytics and Machine Learning.
Traditional Data invites Traditional analytics like Linear Regression, Cluster Analysis, and Factor Analysis just to mention.
So what statistical knowledge do you need for traditional analytics in Data Science?
Most often Data Science Employs one of the below-mentioned five analyses:
Time Series Analysis
a) Linear Regression
This method is used for quantifying casual relationships among the different variables included in the analysis. You will use this if you need to assess the relationship between, for example, house prices, the size of the house, and the year they were built.
The model calculates the coefficients with which they can predict the price of a new house if you have the rest of the information available.
There is a linear line that governs the relationships between the size and the price
b) Cluster Analysis
This Exploratory Data Science technique is applied when the observations in the data form groups according to the same criteria
It takes into account that some observations show similarities and facilitate the discovery f new significant predictors, ones that were not part of the original conceptualization of the data
c) Factor Analyses
If the cluster analysis is about grouping observations together this analysis is about grouping features together.
Data Scientists resort to using it to reduce the dimensionality of a problem.
For example, if you have a questionnaire of 60 questions and every 10 questions are trying to determine a single general attitude. This analysis will identify the 10 factors.
Once a factor analysis identifies some factors they can be used for a regression that will deliver a more interpretable prediction. A lot of other Data Science techniques are integrated like this.
d) The Time Series Analysis
This is a popular method for following the development of specific values over time.
It is widely used in economics and finance because their subject matter is stock prices and sales volume which are variables that are typically plotted against time
e) Logistic Regression
Since not all relationships between variables can be expressed as linear. Data Science makes use of methods such as logistic regression to create non-linear models.
Logistic Regression Operate with 0’s and 1’s.
For Instance, think about the process of hiring new staff. Companies apply logistic regression algorithms to filter job candidates during the screening process.
If the algorithm estimates that the probability a prospective candidate will perform well in the company within the year is above 50% it will return 1 or a successful application. Otherwise will return 0 and the candidate won't be called in for the interview.
The above is at the core of the traditional methods for predictive analytics in Data Science.
Machine learning, compared to Traditional methods of predictive analytics, is far much equipped to handle big data.
As you can imagine machine learning steps on the shoulders of classical statistical forecasting.
People in the data science industry refer to some of these methods as machine learning too but when I talk about machine learning I am referring to newer smatter better methods like Deep Learning.
Don't worry I am still going to tackle this subject and will explain everything clearly.
Thank you very much for taking time to go through this article.
Will be publishing even more content around Machine Learning, be sure to come around.
Please feel free to drop your comments in the comment section.