This is written as a part of MSP Developer Stories initiative by Microsoft Student Partners (India) program( https://studentpartners.microsoft.com/).
Artificial Intelligence and Machine learning are the two buzzwords which are well heard of in today’s tech-world. Often, people confuse it to be two different things but when we actually study deeply, we find out that one is a subset of the other. Artificial Intelligence comprises Machine Learning, Deep Learning, Natural language processing to name a few.
‘Intelligence’ is a very commonly used term nowadays. What is Intelligence? It is the power to see, think and act, this involves taking decisions as well which we do in our day to day life. Similarly, Artificial Intelligence is nothing but making this artificially. But, the question arises, why artificial? Let us take an example, humans can do several calculations. However, we are completely prone to making errors. Complex calculations take humongous amounts of time to be done manually and may lead to a lot of human errors. Also, we tend to get tired after a while. However, if the same intelligence is programmed into a machine, it can carry out infinite calculations, efficiently, accurately and in very less time. So, basically we can say that It’s for the machines in order to make the best use of intelligence by giving it the power to analyze and make decisions. We have enormous use cases of Artificial Intelligence, be in facial recognition on our cell phones or the social media handles for tagging pictures. There are wide applications of Artificial intelligence. It is believed that some day AI will take over the human race completely.
How do we make these machines intelligent? What is the powerhouse? The answer to all these questions is “DATA”. Yes, the machines learn from the data we provide. Data might be in any form text, images, videos or anything for that matter. Data is analyzed with the help of “Algorithms” and we get the results. Unlike the regular coding where we give in the input and the expected output is shown, here, we feed in the data and we get results out in some forms. Algorithms do the job of telling the machine how to analyze the data in order to enable them to give out good results.
This tutorial can be done using Machine learning studio as well. This is a simple drag and drop, low code platform. Basic python fluency is needed to understand the code of this tutorial. We will analyse the covid-19 data and predict the upcoming cases using machine learning algorithms. I have compared it by using two different algorithms i.e., Linear regression and Support Vector Regression.
Now let us learn it by doing it hands-on. I will be using Microsoft Azure Machine learning service platform in this tutorial.
To open this, navigate to the azure portal (https://azure.microsoft.com/) and then sign in using Microsoft account and create a ‘Machine learning’ service. Fill in the details and create the resource
Once the resource is deployed, you can see an overview page as shown above. Navigate to ‘Experiments’ and launch Azure ML interface. A new window pops up and we can see the interface as shown below.
Click on the ‘Create New’ and select a new ‘JupyterLab’ file with the recent python version.
NOTE: In this tutorial we will go through a basic method of training a model on a notebook. Machine learning service can be further used to deploy models. We can also use Pipelines and Automated ML as well.
Well now we see a new jupyter notebook where we can build our ML model.
First and foremost, we need to import our 3 basic libraries
Pandas: The name is derived from the word Panel-Data. Used for data manipulation and analysis.
NumPy: Numerical-Python. It is an advanced library for mathematical computation. Mainly for working with multidimensional arrays.
Matplotlib.pyplot: Mathematical Plotting library. This is a visualization library. It gives user the flexibility and full control over plotting graphs.
Apart from these, we import some more libraries,
Seaborn : Advanced library for legend functions.
Datetime: For understanding the time format.
Time delta: Understanding the time duration.
We also import a few things from Scikit-learn Library.Scikit-learn library is a very vast library. It is very widely used in the field of Machine learning. It includes many methods which are needed for the data preprocessing as well as fitting the algorithm to the data. Since we use two algorithms, we import SVR and Linear Regression.Gridsearch is used for searching the parameters.Ridge and Lasso are the methods.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import datetime as dt
from datetime import timedelta
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression,Ridge,Lasso
from sklearn.svm import SVR
Once we are done with importing the libraries,we go ahead and import the dataset by using the command as follows. Dataset is a systematic collection of data in rows and columns. The dataset I am using here contains the covid-19 cases from all around the world. These datasets can be in any form but we use the CSV (comma separated version) format in this tutorial
covid=pd.read_csv('covid_19_data.csv')
covid.head() #returns first 5 columns
.head() function displays the 5 entries of the dataset.
Machine Learning is all about math. It works on Statistics. Hence it is important that one makes sure that the correct information has been provided. No dataset comes in well processed like cereals or rice, We need to preprocess the data just the way we pick out the stones from cereals or rice. This is called “Data Preprocessing”. Data Processing consists of steps like Taking care of missing data, feature scaling, label encoding and many more. This makes sure that we keep the data ready in order to fit it into a Machine learning algorithm like Linear Regression.
print("Size/Shape of the dataset",covid.shape)
print("Checking for null values \n",covid.isnull().sum())
print("Checking Data type:",covid.dtypes)
This gives us the number of null values. They are the missing values from the dataset. We need to fill them with the mean of that particular column or eliminate the whole row consisting of the missing value.
covid.drop(["SNo"],1,inplace=True)
covid.head()
The serial number column is not needed here hence we drop the column form the dataset. Since Serial number is not used in any of the calculations in our algorithm, we don’t need it. The ‘inplace’ makes sure that the model understands that the column is dropped.
covid["ObservationDate"]=pd.to_datetime(covid["ObservationDate"])
Here we use pandas library and datetime libraries to convert into the datetime format. Pandas come in handy when we work with the dataset. Especially during the pre-processing step.
datewise=covid.groupby(["ObservationDate"]).agg({"Confirmed":'sum',"Recovered":'sum',"Deaths":'sum'})
We find the sum of the specified columns. Aggregating the confirmed, Recovered and death column. Here we group different cases that are datatypes.
print("Basic information")
print("Total number of confirmed cases around the world",datewise["Confirmed"].iloc[-1])
print("Total number of Recovered cases around the world",datewise["Recovered"].iloc[-1])
print("Total number of Death cases around the world",datewise["Deaths"].iloc[-1]) #index location in the form of integers.-1 means starts from end
print("Total number of active cases around the world",datewise["Confirmed"].iloc[-1]-datewise["Recovered"].iloc[-1]-datewise["Deaths"].iloc[-1])
print("Total mumber of close cases",datewise["Recovered"].iloc[-1]+datewise["Deaths"].iloc[-1])
Displaying the cases.
plt.figure(figsize=(15,5))
sns.barplot(x=datewise.index.date,y=datewise["Confirmed"]-datewise["Recovered"]-datewise["Deaths"])
plt.title("Distribution plot for Active cases")
plt.xticks(rotation=90)
plt.figure(figsize=(15,5))
sns.barplot(x=datewise.index.date,y=datewise["Confirmed"]+datewise["Recovered"]-datewise["Deaths"])
plt.title("Distribution plot for Closed cases")
plt.xticks(rotation=90)
Plotting the active and closed cases using the matplotlib.
datewise["WeekofYear"]=datewise.index.weekofyear
week_num= [] #for next week projection
weekwise_confirmed =[]
weekwise_recovered = []
weekwise_deaths =[]
w = 1
for i in list(datewise["WeekofYear"].unique()):
weekwise_confirmed.append(datewise[datewise["WeekofYear"]==i]["Confirmed"].iloc[-1])
weekwise_recovered.append(datewise[datewise["WeekofYear"]==i]["Recovered"].iloc[-1])
weekwise_deaths.append(datewise[datewise["WeekofYear"]==i]["Deaths"].iloc[-1])
week_num.append(w)
w=w+1
plt.figure(figsize=(8,5))
plt.plot(week_num,weekwise_confirmed,linewidth=3)
plt.plot(week_num,weekwise_recovered,linewidth=3)
plt.plot(week_num,weekwise_deaths,linewidth=3)
plt.xlabel("week number")
plt.ylabel("Number of cases")
plt.title("Weekly Progress of different type of cases")
Monitoring the weekly progress of the different types of cases is being plotted as above.
plt.figure(figsize = (15,6))
plt.plot(datewise["Confirmed"].diff().fillna(0),label="Daily increase in confirmed cases",linewidth = 3)
plt.plot(datewise["Recovered"].diff().fillna(0),label="Daily increase in Recoveredd cases",linewidth = 3) #diff means approximate values are taaken and filled in empty
plt.plot(datewise["Deaths"].diff().fillna(0),label="Daily increase in Death cases",linewidth = 3)
plt.xlabel("Timestamp")
plt.ylabel("Daily Increment")
plt.title("Daily increase")
plt.xticks(rotation=90)
plt.legend()
print("Average increase in the number of Confirmed cases everyday",np.round(datewise["Confirmed"].diff().fillna(0).mean()))
print("Average increase in the number of Recovered cases everyday",np.round(datewise["Recovered"].diff().fillna(0).mean()))
print("Average increase in the number of Death cases everyday",np.round(datewise["Deaths"].diff().fillna(0).mean()))
We fill the missing column values with the mean of that particular column.
countrywise=covid[covid["ObservationDate"]==covid["ObservationDate"].max()].groupby(["Country/Region"]).agg({"Confirmed":"sum","Recovered":"sum","Deaths":"sum"}).sort_values(["Confirmed"],ascending =False)
countrywise["Mortality"]=(countrywise["Deaths"]/countrywise["Confirmed"])*100
countrywise["Recovery"]=(countrywise["Recovered"]/countrywise["Confirmed"])*100
fig,(ax1,ax2) = plt.subplots(1,2,figsize = (25,10))
top_15confirmed = countrywise.sort_values(["Confirmed"],ascending=False).head(15)
top_15deaths = countrywise.sort_values(["Deaths"],ascending=False).head(15)
sns.barplot(x=top_15confirmed["Confirmed"],y=top_15confirmed.index,ax=ax1)
ax1.set_title("Top15 countries as per number of confimred cases")
sns.barplot(x = top_15deaths["Deaths"],y=top_15deaths.index,ax=ax2)
ax2.set_title("Top15 countries as per number of death cases")
We calculate the country wise mortality rate as shown above.
india_data = covid[covid["Country/Region"]=="India"]
datewise_india = india_data.groupby(["ObservationDate"]).agg({"Confirmed":"sum","Recovered":"sum","Deaths":"sum"})
print(datewise_india.iloc[-1])
print("Total Actvie Cases:",datewise_india["Confirmed"].iloc[-1]-datewise_india["Recovered"].iloc[-1]-datewise_india["Deaths"].iloc[-1])
print("Total Closed cases",datewise_india["Recovered"].iloc[-1]+datewise_india["Deaths"].iloc[-1])
datewise_india["WeekofYear"] = datewise_india.index.weekofyear
week_num_india = []
india_weekwise_confirmed = []
india_weekwise_recovered = []
india_weekwise_deaths = []
w = 1
for i in list(datewise_india["WeekofYear"].unique()):
india_weekwise_confirmed.append(datewise_india[datewise_india["WeekofYear"]==i]["Confirmed"].iloc[-1])
india_weekwise_recovered.append(datewise_india[datewise_india["WeekofYear"]==i]["Recovered"].iloc[-1])
india_weekwise_deaths.append(datewise_india[datewise_india["WeekofYear"]==i]["Deaths"].iloc[-1])
week_num_india.append(w)
w = w+1
This code section does the data analysis for india.
plt.figure(figsize = (8,5))
plt.plot(week_num_india,india_weekwise_confirmed,linewidth=3)
plt.plot(week_num_india,india_weekwise_recovered,linewidth = 3)
plt.plot(week_num_india,india_weekwise_deaths,linewidth = 3)
plt.xlabel("Week number")
plt.ylabel("Number of cases")
plt.title("Weekly Progress of different types of cases")
This code section plots and visualizes the weekly progress of different type of cases in India.
max_ind = datewise_india["Confirmed"].max()
china_data = covid[covid["Country/Region"]=="Mainland China"]
Italy_data = covid[covid["Country/Region"]=="Italy"]
US_data = covid[covid["Country/Region"]=="US"]
spain_data = covid[covid["Country/Region"]=="Spain"]
datewise_china = china_data.groupby(["ObservationDate"]).agg({"Confirmed":"sum","Recovered":"sum","Deaths":"sum"})
datewise_Italy = Italy_data.groupby(["ObservationDate"]).agg({"Confirmed":"sum","Recovered":"sum","Deaths":"sum"})
datewise_US = US_data.groupby(["ObservationDate"]).agg({"Confirmed":"sum","Recovered":"sum","Deaths":"sum"})
datewise_Spain = spain_data.groupby(["ObservationDate"]).agg({"Confirmed":"sum","Recovered":"sum","Deaths":"sum"})
print ("It took",datewise_india[datewise_india["Confirmed"]>0].shape[0],"days in India to reach",max_ind,"Confirmed Cases")
print ("It took",datewise_Italy[(datewise_Italy["Confirmed"]>0)&(datewise_Italy["Confirmed"]<=max_ind)].shape[0],"days in Italy to reach number of Confirmed cases to India")
print ("It took",datewise_US[(datewise_US["Confirmed"]>0)&(datewise_US["Confirmed"]<=max_ind)].shape[0],"days in US to reach number of Confirmed cases to India")
print("It took",datewise_Spain[(datewise_Spain["Confirmed"]>0)&(datewise_Spain["Confirmed"]<=max_ind)].shape[0],"days in Spain to reach number of Confirmed cases to India")
print ("It took",datewise_china[(datewise_china["Confirmed"]>0)&(datewise_china["Confirmed"]<=max_ind)].shape[0],"days in China to reach number of Confirmed cases to India")
datewise["Days Since"] = datewise.index-datewise.index[0]
datewise["Days Since"] = datewise["Days Since"].dt.days
train_ml = datewise.iloc[:int(datewise.shape[0]*0.90)]
valid_ml = datewise.iloc[int(datewise.shape[0]*0.90):]
model_scores = []
Now lets compare the number of days countries took to reach their current covid-19 scenario.
lin_reg = LinearRegression(normalize=True)
svm = SVR(C=1,degree = 6,kernel= 'poly',epsilon=0.01,gamma =’scale’)
lin_reg.fit(np.array(train_ml["Days Since"]).reshape(-1,1),np.array(train_ml["Confirmed"]).reshape(-1,1))
svm.fit(np.array(train_ml["Days Since"]).reshape(-1,1),np.array(train_ml["Confirmed"]).reshape(-1,1))
Here we fit our processed data to the algorithms. We start predicting the value that is our model is being trained here. Training our model is a very necessary step as the whole prediction score depends on how well the model is trained and preprocessed.
prediction_valid_lin_reg = lin_reg.predict(np.array(valid_ml["Days Since"]).reshape(-1,1))
prediction_valid_svm = svm.predict(np.array(valid_ml["Days Since"]).reshape(-1,1))
Now the model undergoes the testing. Here it actually starts predicting .
new_date = []
new_prediction_lr = []
new_prediction_svm = []
for i in range(1,18):
new_date.append(datewise.index[-1]+timedelta(days=i))
new_prediction_lr.append(lin_reg.predict(np.array(datewise["Days Since"].max()+i).reshape(-1,1))[0][0])
new_prediction_svm.append(svm.predict(np.array(datewise["Days Since"].max()+i).reshape(-1,1))[0])
pd.set_option("display.float_format",lambda x:'%.f' %x)
model_predictions = pd.DataFrame(list(zip(new_date,new_prediction_lr,new_prediction_svm)),columns = ["Dates","LINEAR REGRSN","SVM PREDICTION"])
model_predictions.head(10)
Here we go! The model is now ready to predict the values. We have given the data for the next 10 days. We can increase this by increasing the number in the .head() function.
Note that the SVR is a better prediction in this case. The values predicted are much closer to the actual figures. The model learns from the existing data set and predicts these values. The best algorithm should be chosen according to the need.
Top comments (0)