In the time of this Pandemic, we often wonder if we can predict the time it will take for the vaccinations to come in the market. Is it possible to predict the potential COVID-19 cases across the globe on a daily basis?
The answer to all these questions is 'yes'. Data Science has proven to be invaluable in predicting the solutions to these problems. So let's understand more about Data Science and it's sub-parts through this article which is a part of MSP Developer Stories initiative by Microsoft Student Partners(India) Program (https://studentpartners.microsoft.com).
Data Science is a field of study that is used to extract knowledge from structured or unstructured data by using different processes and algorithms. A Data Science process consists of several steps.These steps are:
- Data Analysis: This step gives the insight about the data we are working with. This will also help us in applying proper algorithms according to the data and it's other requirements.
- Feature Engineering: This process helps to create features from the available data. Some techniques used in feature engineering are moving averages and different types of aggregations. A moving average is the change in average for a specific constant time interval. Aggregations are combinations of the data based on another feature. Some examples of aggregations are sum, average, and count.
- Modeling: It is basically an algorithm that learns from the data and provides a probabilistic prediction of discrete and continuous values.
- Identifying the problem and setting a project goal: This stage is based on the use case analysis of the problem. Objectives for the project are defined and a goal is identified in this stage.
- Data preparation: Data is splitted into different features and the cross validation of the data is done. In this, over-fitting of the data is prevented by dividing the data into it's subsets.
- Selecting and training the model: Selection of the model is done on the basis of the output of the data that we have to get. Before training the model, data is split into four parts: features, labels, training set and testing set. These are discussed later in this article.
- Evaluating the result and Model deployment: It consists of tuning the hyper-parameters within the model to improve the results. Then the model is evaluated based on the confusion matrix. The respective Precision Score and Recall Score is calculated. If the model gives a good accuracy, it is deployed otherwise any other model is selected and trained with the data.
This process is often completed by a team of people with different roles such as business analyst, data engineer, data scientist, developer etc. Some common tools are also there which are used by these people for performing different operations. Some of these common tools are:
- Python and related packages
- Apache Sparks
- Azure Databricks
- Source code control(Git, SVN)
To have an in-depth knowledge about this topic, you can also refer this course:
Machine Learning is a subset of Data Science which is used to potentially predict the outcome to the problems mentioned above. So let's discuss in detail about Machine Learning.
Machine Learning is the field of study that gives computers the capability to learn without being explicitly programmed. It is also categorized into three subcategories: Supervised Learning, Unsupervised Learning and Reinforcement Learning.
Since this article covers more of the technical aspects, you can refer this link for the theoretical understanding of these topics:
Before moving forward with the technical part let us discuss the platform where we can do these projects. Of Course, it's none other than Microsoft Azure.
Azure Notebooks is a cloud-based platform for building and running Jupyter notebooks. Jupyter is an environment based on IPython that facilitates interactive programming and data analysis using a variety of programming languages, including Python. Azure Notebooks provide Jupyter as a service for free. It's a convenient way to build notebooks and share them with others without having to install and manage a Jupyter server. And it's web-based, making it an ideal solution for collaborating online.
Now let's start with the project. It consists of mainly three basics steps:
Curl is a bash command which is used to download the data set from an online source. You can use it as:
!curl https://topcs.blob.core.windows.net/public/FlightData.csv -o flightdata.csv
You can also download the dataset and can provide the link to read the csv file and store it in a dataframe like this: df=pd.read_csv("C:/Users/God/Downloads/data.csv")
Pre-processing requires several steps like:
- Removing duplicate values from the data: Duplicate rows or columns can be dropped by using this command: DataFrame.drop_duplicates(subset=None, keep=’first’, inplace=False)
- Filling the missing values: This is usually done by df.fillna(df.mean()) command where df is the dataframe's name and the missing values are replaced by the mean of the data. You can also replace them by median or mode of the data(as per the requirement).
- Converting strings entries to their respective string values: The computer cannot understand byte code so we have to convert the strings into machine understandable language. We have several methods to do so like:
- Get dummies method: Syntax- pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
- Label Encoding:
- One-Hot Encoding:
- Detection and handling of outliers: Outliers are the data points that differ significantly from other observations. These values deviate extremely from other observed data points. There are several methods to detect these outliers. You can refer this article for reading in brief about the methods of detecting outliers. Removing these outliers is also a very important step in data pre-processing. Outliers increase the variability in the data and because of this, the statistical power of data decreases. Here is a medium article on how to remove these outliers:
- Scaling and normalizing the data: By scaling the data, we change it's range and normalization is used to change the shape of the data. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values. For machine learning, every dataset does not require normalization. It is required only when features have different ranges.
- Splitting the data into features, labels, training data and testing data:Features are the input variables which are fed in the model and label is the output given by the model. training data is used to train the model and the performance of the model is evaluated by the testing data.
A Machine-Learning model is built depending upon the type of the data.
- If the data has a continuous output then, the supervised machine learning models which can perform regression are used.
- If the data has discrete outputs then, supervised machine learning models which can perform classification are used.
- If the output is not specified then, unsupervised learning algorithms are used.
Data visualization is the representation of data or information in a graph, chart, or other visual format. It communicates relationships of the data with images. This is important because it allows trends and patterns to be more easily seen. With the rise of big data upon us, we need to be able to interpret increasingly larger batches of data. Machine learning makes it easier to conduct analyses such as predictive analysis, which can then serve as helpful visualizations to present. But data visualization is not only important for data scientists and data analysts, it is necessary to understand data visualization in any career. Whether you work in finance, marketing, tech, design, or anything else, you need to visualize data.
The model is trained with the training data. After learning from that data, the model is now ready for predicting the output. So, it is fed with the testing data. The model predicts the output for the testing data and the predicted output is matched with the original output to find the accuracy of the model. The model that gives the best accuracy is chosen.
In this way, Machine Learning models are able to predict the outcomes for any kind of data that is fed in the machine.
For practice you can complete this learning path:
For a detailed description of this learning path, you can refer my session: