Table of contents
Introduction
About you
Why Julia?
Install Julia and Jupyter notebook support
Julia basics
Linear algebra
Working with datasets
Vizualizing data
Overview of Titanic machine learning problem
Prepare the training data for machine learning
Fix missing values
Fix non-numeric data
Visual analysis
Train machine learning model
Make predictions and submit them to the Kaggle
Deploy the model to production
Export the model to a file
Create the frontend
Create the backend
Conclusion
Introduction
Julia is a general purpose programming language well suited for numerical analysis and computational science. Sometimes it's stated as a future of machine learning and the most natural replacement for Python in this field.
This article introduces Julia language, and it's ecosystem, shows how to use it to solve a Titanic machine learning competition and submit it to the Kaggle. In addition, it will show how to deploy the created machine learning model to production as a web service and create a web interface to send prediction requests to this service from a web browser.
By the end of the article, you will create a simple AI-powered web application that can be used as a template for creating more complex Julia ML solutions.
About you
This is not a book, but only an article. That is why it can't cover everything and assumes that you already have some base knowledge to get the most from reading it. It is essential that you are familiar with Python machine learning and understand how to train machine learning models using Numpy, Pandas, SciKit-Learn and Matplotlib Python libraries. Also, I assume that you are familiar with machine learning theory: types of machine learning problems like regression and classification, the concept and process of Supervised machine learning (fit/predict and evaluate quality using metrics) and common models used for it, including Random Forest Classifier, and it's implementation in SciKit-Learn Python library. Additionally, it would be great if you previously participated in Kaggle competitions, because to understand and run all code of this article you need to have an account on https://kaggle.com.
There are a lot of books and articles already written, and many courses already released about topics described above. In this article I only show how to create, train and deploy basic machine learning model using Julia, without diving to theoretical aspects of ML and AI.
Why Julia?
For a long time, Python known as a standard for data science and machine learning because of it simplicity and great set of libraries and tools. Among others there are great libraries as Numpy to do linear algebra with vectors and matrices, Pandas to manipulate datasets, Matplotlib for data visualizations and Scikit-Learn that provides a uniform interface to work with well-known machine learning models. Furthermore, the Jupyter Notebooks that allows to write and run Python code online right in a web browser make a comfortable environment for data researchers to design and implement the whole machine learning cycle even if they are not very experienced in programming.
However, all this is good to research in laboratories, but at some step need to go to production and at this moment things change dramatically. The Python was created in early nineties and never supposed to be fast. It's kernel, never assumed to be used for new modern technologies like distributed computing. That is why, to make complex ML tasks production ready, a lot of third party dependencies should be installed and a lot of tricks should be made with that Python code to speed it up. A few companies even rewrite or convert Python machine learning models before deploying them to production in faster languages like C++.
The Julia aimed to resolve these problems. This is what the authors wrote about reasons of creating the Julia:
We are greedy: we want more. We want a language that’s open source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as Matlab, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled.
Source: The Julia blog.
So, from ML perspective, the Julia got the best from two worlds. It's aimed to be as fast as C and as simple as Python. In addition, it has replacements for all libraries, that Python data scientists used to use in their work:
Purpose | Python | Julia |
---|---|---|
Linear algebra | Numpy | Built in arrays, LinearAlgebra package |
Work with datasets | Pandas | DataFrames.jl |
Data visualization | Matplotlib | Plots.jl |
Classic Machine learning | SciKit-Learn | MLJ.jl, ScikitLearn.jl, BetaML.jl |
Neural Networks | TensorFlow or Pytorch | Flux.jl, BetaML.jl |
Read more about why Julia is a great choice for machine learning here.
Furthermore, Julia has a module to support Jupyter Notebooks, so you can write Julia code there the same as on Python. All this makes the Julia ready to do machine learning tasks, including Kaggle competitions, at the same environment as by using Python. Let's install this environment and introduce some Julia ML basics.
Install Julia and Jupyter notebook support
To install Julia, follow this link: https://julialang.org/downloads/, download a Julia package for your operating system and run it. After successful installation, you will be able to run the julia
command to enter the Julia REPL environment. Here, you can write and run Julia code. To exit from REPL, enter exit() command.
Also, you can write your code in any text editor and save to files with .jl
extension. Then you can run your Julia programs by this command:
julia <filename>.jl
In addition, you can use VSCode to develop on Julia. It has a great extension for this: https://www.julia-vscode.org/.
However, the best option to develop machine learning and data science solutions is Jupyter Notebook, so, ensure that it's installed before continue. Then, install Jupyter support for Julia package using REPL:
- Enter REPL using the
julia
command - Import the
Pkg
module
using Pkg
- Install the
IJulia
package
Pkg.add("IJulia")
- Exit the REPL by
exit()
command
Then you can run Jupyter and create notebooks with Julia support. For your convenience, the next video shows how to install Julia and integrate it to Jupyter on macOS (assuming that Jupyter itself already installed).
Sometimes the julia
command does not work in terminal after installation on MacOS. You can use the following workaround to fix this: https://discourse.julialang.org/t/how-can-i-be-able-to-use-binary-command-julia-in-mac-osx-terminal/22270
Julia basics
Julia has a simple syntax. If you're familiar with Python, then it will be easy to start writing on Julia. You can read more about basic Julia syntax in this article. Here I will only cover features that required for machine learning and only the features which will be used to solve the Titanic Kaggle competition. To learn more about each of these libraries and modules, I will provide useful links.
Create new Jupyter Notebook to enter and run all code samples below.
Linear algebra
Basic linear algebra features already integrated to Julia standard library. Each 1D array is a vector, and each 2D array works as a Numpy array by default. You do not need to include any additional packages for it. For example, if you write and run this code:
A = [
[1 2 3]
[4 5 6]
[7 8 9]
]
B = [
[7 8 9]
[4 5 6]
[1 2 3]
]
A*B
it will do a matrix multiplication and will output the following result:
3×3 Matrix{Int64}:
18 24 30
54 69 84
90 114 138
For additional features, you can import a LinearAlgebra module.
using LinearAlgebra
Then, you can use such functions as det
, tr
or inv
with matrices to get their determinants, traces or inverse matrix:
using LinearAlgebra
A = [
[1 2 3]
[4 5 6]
[7 8 9]
]
println("Determinant: ",det(A))
println("Trace: ",tr(A))
println("Inverse: ")
inv(A)
Find more about linear algebra features in the LinearAlgebra module documentation.
Working with datasets
To work with datasets, you have to install an external Dataframes.jl
module. In addition, to load and save datasets to CSV files, you have to add CSV.jl
module.
Julia package manager implemented as a Pkg
module, so, you have to import it and then use the add
method to install required packages. Run this in your Jupyter notebook to install these packages.
using Pkg
Pkg.add("DataFrames")
Pkg.add("CSV")
Then, you can import installed modules to your program:
using DataFrames, CSV
DataFrames module imports DataFrame
data type, that you will use to construct datasets and manipulate data frame objects.
Create a data frame
This is how you can create a data frame with two columns:
df = DataFrame(name=["Julia", "Robert", "Bob","Mary"],
age=[12,15,45,32])
This code will create and output the following dataset:
Select data from a data frame
To select data from a data frame, you can use the array syntax:
df[<rows>,<columns>]
You should specify range of rows to select in <rows>
and range of columns to select in <columns>
. This you can use to select first three rows and only the "age" column:
subs = df[1:3,"age"]
Important to note that array numbering in Julia starts with 1, not with 0 as in most other languages. To select the first three rows and all columns, you can run this:
subs = df[1:3,:]
Also, to select a single column, you can use dot syntax:
names = df.name
As you see, each column is a native Julia array (vector).
You can use conditions to specify row ranges. For example, this can be used to select all persons from dataset that older than 15 years:
older = df[df.age .>15,:]
Sort data in a data frame
To sort data in a data frame, you can use the sort
function. This will sort the dataset by age in ascending order:
sort(df,"age")
and next code will sort it in descending order:
sort(df,"age",rev=true)
Add columns to a data frame
To add a new column, just use dot syntax:
df.sex = ["female","male","male","female"]
This added the sex
column for persons to the data frame.
Remove columns from a data frame
A select
function can be used for more complex data extraction from frames. In particular, it can be used to extract all columns except specified, which is equal to removing these columns:
new_df = select(df,Not("sex"))
This code returns a new data frame by selecting all columns from the original except sex
.
Group and summarize data in data frame
A groupby
and combine
functions are used to group data and show summary information for each group. The former used to group data by specified field or fields and the latter used to add summary columns to it, like number of rows in each group or average value of some column in the group. Next code groups data by sex, calculates number of rows in each group and adds it as a "count" column:
group_df = groupby(df,"sex")
combine(group_df,nrow => "count")
So, the first line of this code creates a GroupDataFrame object with rows, grouped by "sex". The second line creates the "count" column with count of items in each group. There are 2 females and 2 males in this dataset.
Also, a custom function can be used to calculate summary data. For example, this can be used to add both row counts and average ages for each group:
combine(group_df,
nrow => "count",
"age" => ((rows) -> sum(rows)/length(rows)) => "Average Age"
)
This code adds the "Average Age" column that produced from values of "age" column by applying to it custom anonymous function, that calculates average of values in this group.
It were just a few percents of all possible manipulations that you can do with data using DataFrames.jl library. Read more about it in the documentation.
Vizualizing data
Using Plots.jl, you can create a lot of different graphs to analyze your data, similar to Matplotlib or Seaborn in Python. To use it, you have to install the Plots package to your notebook and import it:
using Pkg
Pkg.add("Plots")
using Plots
Let me provide a few examples of graphs.
Line chart
plot(
[1,2,3,4,5],
[3,6,9,15,16],
title="Basic line chart",label="Line"
)
Scatter plot
plot(
[1,2,3,4,5],
[3,6,9,15,16],
title="Basic scatter plot",
label="Data",
seriestype="scatter"
)
Bar chart
The next code generates a bar chart from the df dataset that was created earlier.
plot(
df.name,
df.age,
title="Ages",
label=nothing,
seriestype="bar"
)
There are much more that you can do using Plots.js. Read more about it's features in the documentation.
After this short overview of basic data science features of Julia, it's time to create and train the first machine learning model and evaluate its quality on the competition.
Overview of Titanic machine learning problem
The "Titanic - Machine Learning from Disaster" is one of the first educational machine learning problems that you could see in books, articles or courses. In this task you are provided with a dataset of data about Titanic passengers. Each passenger data includes an ID, name, sex, ticket cost, ticket class, cabin number, port of embarkation and number of family members. For passengers in this dataset is known did they survive or not in "Survived" column. If the passenger survived, the value is 1, if not then 0. Formally, this is called a labeled or training dataset. All data columns except one called the "feature matrix", and the "Survived" column called the "labels vector".
There is also the second dataset with the same data about other passengers but without "Survived" column. In other words, this dataset contains only features matrix, but do not have the labels vector. This is called the testing dataset. The task is to train a machine learning model on the training dataset and use this model to predict the "Survived" column values in the testing dataset or, in other words, predict the "labels vector" of the testing dataset based on its "features matrix".
The Kaggle competition is available here: https://www.kaggle.com/competitions/titanic
Read briefly the description, then, open "Evaluation" section to discover how the Kaggle will evaluate the predictions that you submit.
Prepare the training data for machine learning
The "Data" tab on the Kaggle competition page contains training and testing datasets in train.csv
and test.csv
files, along with descriptions for each data column.
Create new Jupyter notebook with Julia backend and download these files to the same folder with your notebook.
Load train.csv
to Data Frame using CSV module:
# Add packages
using Pkg
Pkg.add("DataFrames")
Pkg.add("CSV")
# Import modules
using DataFrames, CSV
# Load training data to data frame
train_df = CSV.read("train.csv", DataFrame)
In case of errors, please check that train.csv
file exists in a folder where you run your notebook.
If no errors, it will show first rows of the data:
As you see, this dataset has 891 rows and 12 columns. This is the basic data about passenger like "Name", "Sex" and "Age". In addition, we see the "Survived" column, with 0 if passenger did not survive and 1 if survived.
Let's see the summary information about this data using the describe
function:
describe(train_df)
This summary table shows info about each column. It shows min, max, mean and median of data in each of them. The basic goal of data preparation is to transform these columns to features matrix and labels vector. The labels vector is ready, this is the "Survived" column with numeric values. All other columns form the features matrix, and not everything ok with them.
Let's look at the nmissing
and eltype
for each column. The nmissing
shows the number of missing values in the appropriate column, and the eltype
shows the type of data values in them. The matrix should contain only numbers, but there are many columns of "string" data type. Also, the matrix should not have missing values, but we have some missing values in Age
, Cabin
and Embarked
columns. Let's fix all this.
Fix missing values
As the previous table shows, Age
, Embarked
and Cabin
columns contain missing values. The Embarked
absents only in 2 rows, so we will not lose too much data if just remove these rows. DataFrames module has a handy dropmissing
function that can be used for this:
train_df = dropmissing(train_df,"Embarked")
This will remove all rows with missing values in the Embarked
column.
The Age
contains 177 missing values, and it's not a good idea to remove these rows, because we will lose about 20% of data in the dataset. So, let's just fill it with something, for example with median value. The median age is 28 as displayed in the description table. Let's use the replace
function of DataFrames to replace missing ages to 28:
train_df.Age = replace(train_df.Age,missing=>28)
The Cabin
column contains 687 missing values, which is more than 50% of the dataset. It's too few data in this column to be useful for predictions. Also, it's difficult to predict which data should be in these rows if there are more data is missing than exists. So, let's just drop this column using select
function:
train_df = select(train_df, Not("Cabin"))
Finally, all missing data in the dataset has fixed.
Fix non-numeric data
As said before, all data should be encoded to numbers, but we have Name
, PassengerId
, Sex
, Ticket
and Embarked
as strings.
The Name
and the PassengerId
values are unique for each passenger, and that is why they can't be used by ML model to split the data to categories or classify it. So, you can just remove these columns:
train_df = select(train_df,Not(["PassengerId","Name"]));
For other string columns, is required to encode all text values to numbers. To do that, need to discover all unique values of these columns. Let's start from the Embarked
:
combine(groupby(train_df,"Embarked"),nrow=>"count")
This code grouped dataset by the Embarked
column and showed all possible values and their counts. So, here there are "S", "C" and "Q" values only. It's easy to encode them as S=1, C=2 and Q=3. This can be simply done by the following replace
function:
train_df.Embarked = Int64.(
replace(train_df.Embarked,
"S" => 1, "C" => 2, "Q" => 3
)
)
Also, this code converted the column from "String" to "Int64" data type.
Then, repeat the same for the Sex
column:
combine(groupby(train_df,"Sex"),nrow=>"count")
and replace female=1 and male=2.
train_df.Sex = Int64.(
replace(train_df.Sex,
"female" => 1, "male" => 2
)
)
Now it's time to see the summary info for the Ticket
column:
combine(groupby(train_df,"Ticket"),nrow=>"count")
Here we see that it has 680 different categories of tickets, which is more than 50% of data. However, we need to predict just two categories, either survived or not survived. Not sure that this data can help the model to make good predictions without additional processing to reduce the number of categories in this column. Although this goes beyond our current basic model, as an additional practice, you can play more with data in this column to improve prediction results, like, try to find how to group tickets to more general categories and encode these categories by unique numbers. For now, let's just remove this column:
train_df = select(train_df,Not("Ticket"))
Now all string data is categorized, and all values replaced to category numbers. Let's describe the dataset again to ensure that all problems with data resolved:
describe(train_df)
You can see that all columns contain only numeric data and there are no missing values in them.
Visual data analysis
Now, the dataset is ready to train a machine learning model on it. Let's visualize this data to find some relations in it.
using Plots
# Group dataset by "Survived" column
survived = combine(groupby(train_df,"Survived"), nrow => "Count")
# Display the data on bar chart
plot(
survived.Survived,
survived.Count,
title="Survived Passengers",
label=nothing,
seriestype="bar",
texts=survived.Count
)
# Modify X axis to display text labels
# instead of numbers
xticks!([0:1:1;],["Not Survived","Survived"])
Here we see that 340 passengers survived. Now let's see how these passengers distributed by sex.
# Group dataset by Sex column
# and show only rows where Survived=1
survived_by_sex = combine(
groupby(
train_df[train_df.Survived .== 1,:],
"Sex"),
nrow => "Count"
)
# Display the data on bar chart
plot(
survived_by_sex.Sex,
survived_by_sex.Count,
title="Survived Passengers by Sex",
label=nothing,
seriestype="bar",
texts=survived_by_sex.Count
)
# Modify X axis to display text
# labels instead of numbers
xticks!([1:1:2;],["Female","Male"])
Interesting, there are two times more females survived than males in the training dataset. Now let's see the distribution of not survived passengers by ticket class.
# Group dataset by PClass column
# and show only rows where Survived=0
death_by_pclass = combine(
groupby(
train_df[train_df.Survived .== 0,:],
"Pclass"),
nrow => "Count")
# Display the data on bar chart
plot(
death_by_pclass.Pclass,
death_by_pclass.Count,
title="Dead Passengers by Ticket class",
label=nothing,
seriestype="bar",
texts=death_by_pclass.Count
)
# Modify X axis to display
# text labels instead of numbers
xticks!([1:1:3;],["First","Second","Third"])
This clearly shows that first and second class passengers had more chances to survive than third class ones.
Assuming that data in the training and the testing datasets distributed randomly, it's highly likely that a machine learning model trained on this data should predict that women in first or second class had much more chances to survive than others. Let's remember this finding to check this hypothesis at the end of the article, after train and deploy the ML model.
Finally, let's see the cleaned training dataset again:
train_df
Now it really looks like a matrix, or, to be more precise, like a system of algebraic linear equations written in matrix form. Data in matrix format is exactly what the most machine learning algorithms expect to get as an input. Let's get started.
Train machine learning model
For machine learning, we will use SciKitLearn.jl library, which replicates SciKit-Learn library for Python. It provides an interface for commonly used machine learning models like Logistic Regression, Decission Tree or Random Forest. SciKitLearn.jl is not a single package but a rich ecosystem with many packages, and you need to select which of them to install and import. You can find a list of supported models here. Some of them are built-in Julia models, others are imported from Python. Also, the SciKitLearn.jl has a lot of tools to tune the learning process and evaluate results.
For this "Titanic" task, we will use the RandomForestClassifier
model from the DecisionTree.jl package. Usually it works good for classification problems. Also, we will use the Cross Validation to calculate accuracy of model predictions from SciKitLearn.CrossValidation package. You have to install and import these packages before using them:
Pkg.add("DecisionTree")
Pkg.add("SciKitLearn")
using DecisionTree, SciKitLearn.CrossValidation
Then we will implement the training process. First we need to split the training dataset to features matrix and labels vector, then we need to create the RandomForestClassifier
model and train it using this data. Finally, we will evaluate a prediction accuracy of this model using cross_val_score
function.
# Put "Survived" column to labels vector
y = train_df[:,"Survived"]
# Put all other columns to features
# matrix (important to convert to "Matrix" data type)
X = Matrix(train_df[:,Not(["Survived"])])
# Create Random Forest Classifier with 100 trees
model = RandomForestClassifier(n_trees=100)
# Train the model, using features matrix
# and labels vector
fit!(model,X,y)
# Evaluate the accuracy of predictions
# using Cross Validation
accuracy = minimum(
cross_val_score(model, X, y, cv=5)
)
The cross validation splits X and y arrays to 5 parts (folds) and return the array of accuracies for each of these parts. Then the minimum
function selects the worst accuracy from this array, which means that all others are better than the selected one. Finally, the achieved accuracy is more than 0.78, which is 78% for our training data. It's not bad, but does not guarantee that on the testing dataset the result will be the same. You can try to improve this value by selecting different models, or by tuning their hyperparameters. For example, you can increase the number of trees (n_trees) from 100 to 1000 or reduce to 10 and see how it will change the accuracy. After achieving the best result, it's time to use it for predictions.
Make predictions and submit them to the Kaggle
Now, when the model is ready, it's time to apply it to data from test.csv
file which does not have the "survived" labels. First we need to load it and look the summary table as we did for training dataset:
test_df = CSV.read("test.csv",DataFrame)
describe(test_df)
Here you can see the same problems with data: missing values and string columns. You need to apply exactly the same transformations to this data as you did in the training dataset, except removing any rows because the Kaggle requires that you do predictions for each row, so you can only fill missing values, but not remove the rows with them. Fortunately, the Embarked
column does not have missing values, so there is no need to fix it. However, this dataset has a single missing value in the Fare
column, but we did not have any missing values there in the training set. It's not a big problem, you can just replace this missing value by median 14.4542.
But first thing that needed to do, is to save the PassengerId
column to separate variable. It will be required later for the Kaggle submission.
PassengerId = test_df[:,"PassengerId"]
Then, apply all required data fixing:
# Repeat the same transformations as we did for training dataset
test_df = select(test_df,
Not(
["PassengerId","Name","Ticket","Cabin"]
)
)
test_df.Age = replace(test_df.Age,missing=>28)
test_df.Embarked = replace(
test_df.Embarked,"S" => 1, "C" => 2, "Q" => 3
)
test_df.Embarked = convert.(Int64,test_df.Embarked)
test_df.Sex = replace(
test_df.Sex,"female" => 1,"male" => 2
)
test_df.Sex = convert.(Int64,test_df.Sex)
# In addition, replace missing value
# in 'Fare' field with median
test_df.Fare = replace(
test_df.Fare,
missing=>14.4542
)
After the testing dataset is clean, you can use the trained model to make predictions:
Survived = predict(model, Matrix(test_df))
This code returns array of predictions for each row of testing dataset matrix and saves it to the Survived
variable.
Now it's time to submit it to Kaggle. Before doing it, look again to "Evaluation" tab on the Kaggle Titanic competition page to see the required submission format:
The competition requires the CSV file with two columns: "PassengerId" and "Survived". You already have all this data. Let's create the data frame with these two columns and save it to CSV:
submit_df = DataFrame(PassengerId=PassengerId,Survived=Survived)
CSV.write("submission.csv",submit_df)
The first line of this code constructs the submit_df
data frame with the PassengerId
column that was saved previously and the Survived
column with predictions for each passenger ID. The second line saves this submit_df
to the submission.csv
file. This is how the content of this file looks:
Finally, go to the Kaggle competition page, press the "Submit Predictions" button, upload the submission.csv
file and see your result. When I did this, I received the following:
The prediction accuracy is 0.76555 which is more than 76% and is close to the accuracy that was received on the training dataset. Not bad for the first time, but you can keep going: play with data, try different models, change their hyperparameters, surf Internet for articles and Jupyter notebooks of other people who solved the Titanic competition before. I know that it's possible to achieve up to 98% accuracy using various tricks with models and data.
Deploy the model to production
It's fun to play with machine learning on your computer, but it does not have any sense for the surrounding world. Usually, customers do not have Jupyter Notebooks and they do not train the models. They need to have a simple tools that will help them to make decisions based on predictions from data that they have. That is why the only really important thing is how your models will work in production. In this section, I will explain how to use Julia to create a web application that will load the machine learning model you trained to make predictions online in a web browser.
Export the model to a file
First, you need to save the model
from the notebook to a file. For this you can use JLD2.jl module. This module used to serialize Julia object to HDF5-compatible format (which is well known by Python data scientists) and save it to a file.
Install and load the package to the notebook:
Pkg.add("JLD2")
using JLD2
and then save the model
variable to the titanic.jld2
file:
save_object("titanic.jld2", model)
The work with Jupyter Notebook ended now. All next code should be written as a separate application. Create a folder for a new application, like titanic
for example, and copy the titanic.jld2
file to it.
Now you can create a text file titanic.jl
which will contain a code of the web application that you will write soon. Use any text editor for this or VS Code with Julia extension. Enter the following to titanic.jl
:
using JLD2, DecisionTree
model = load_object("titanic2.jld2")
survived = predict(model,[1 2 35 0 2 144.5 1])
println(survived)
This code imported required modules first. As you see, just two modules required to run prediction process: the JLD2
to load the model object, and the DecisionTree
to run predict function for the RandomForestClassifier. Then, the code loads the model from the file, then it makes predictions for a single row of data. The columns in this row should go in the same order as they passed from the dataset when trained the model: Pclass
, Sex
, Age
, SibSp
, Parch
, Fare
and Embarked
. Finally, it prints the array of predictions. In this case, it will print the array with a single item, because only a single row of data passed to the model for predictions.
You can run this code using julia
command:
julia titanic.jl
If everything work ok, it should print [0]
or [1]
to the console depending on prediction result. If you receive errors, then perhaps you need to install JLD2
and DecisionTree
packages using Julia REPL environment, as you did it in the Jupyter notebook.
Now, let's refactor this code to a function that will receive the row of data and return a survival prediction (either 0 or 1):
using JLD2, DecisionTree
# Returns 1 if a passenger with
# specified 'data' survived or 0 if not
function isSurvived(data)
model = load_object("titanic2.jld2")
survived = predict(model,data)
return survived[1]
end
Create the frontend
The next step is to create a web interface, that will be used to collect the data for this function. This will look as displayed on the next screenshot:
With this interface, the user can enter the data about a passenger, then press the "PREDICT" button and discover could the passenger with this data survive on Titanic or not. This is an HTML code of this web page:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Titanic</title>
</head>
<body>
<table>
<tbody>
<tr>
<td>Ticket class</td>
<td>
<select id="pclass">
<option value="1">1</option>
<option value="2">2</option>
<option value="3">3</option>
</select>
</td>
</tr>
<tr>
<td>Sex</td>
<td>
<select id="sex">
<option value="1">Female</option>
<option value="2">Male</option>
</select>
</td>
</tr>
<tr>
<td>Age</td>
<td>
<input id="age" type="number"/>
</td>
</tr>
<tr>
<td># of Siblings/Spouces</td>
<td>
<input id="sibsp" type="number"/>
</td>
</tr>
<tr>
<td># of Parents/children</td>
<td>
<input id="parch" type="number"/>
</td>
</tr>
<tr>
<td>Fare</td>
<td>
<input id="fare"/>
</td>
</tr>
<tr>
<td>Embarked</td>
<td>
<select id="embarked">
<option value="1">S</option>
<option value="2">C</option>
<option value="3">Q</option>
</select>
</td>
</tr>
<tr>
<td>Survived</td>
<td id="survived"></td>
</tr>
<tr>
<td colspan="2">
<div>
<button id="submit" type="button">PREDICT</button>
</div>
</td>
</tr>
</tbody>
</table>
<script>
document.getElementById("survived").innerHTML = "";
document.getElementById("submit").addEventListener("click",async() => {
response = await fetch("http://localhost:8080",{
method:"POST",
body: JSON.stringify({
"pclass":parseInt(document.getElementById("pclass").value),
"sex":parseInt(document.getElementById("sex").value),
"age":parseFloat(document.getElementById("age").value),
"sibsp":parseInt(document.getElementById("sibsp").value),
"parch":parseInt(document.getElementById("parch").value),
"fare":parseFloat(document.getElementById("fare").value),
"embarked":parseInt(document.getElementById("embarked").value),
})
});
const survivedCode = parseInt(await response.text());
document.getElementById("survived").innerHTML = survivedCode ? "YES" : "NO"
})
</script>
<style>
input,select {
width:100%;
}
td {
padding:5px;
}
td > div {
text-align: center;
}
#survived {
font-weight: bold;
color:green;
}
</style>
</body>
</html>
Create an index.html
file in the same folder and copy this code to it. The HTML part of the file contains a simple form with all data fields. As you see, all values encoded to the same numbers as we did with data in training and test datasets. Then, the JavaScript part of this code defines the handler of the "PREDICT" button. When the user clicks on it, the script collects all entered data and saves it as a JSON string. Then it makes an AJAX request to the web service running on port 8080 of the localhost (which have not created yet) and sends this JSON to the web service. So, the web service should be able to receive HTTP POST requests with JSON body in the following format:
{
"pclass": 1,
"sex": 1,
"age": 32,
"sibsp": 5,
"parch": 6,
"fare": 123.44,
"embarked": 1
}
Create the backend
Now it's time to modify the titanic.jl
file to make it work as a web server, that can display the index.html
page, receive POST request from it, parse the body of this request to JSON, make prediction based on this JSON data and return this prediction to the web page.
Creating a web server on Julia is the same simple as on Python, Go, or Node.js. By using HTTP.jl package, you can create and run a web server by a single line of code:
using HTTP
HTTP.serve(handler,8080)
function handler(req)
# handle HTTP request
end
The HTTP.serve
function runs the web server on the specified port. Each time when the web server receives a client request, it calls the specified handler
function and sends an HTTP request object to it as a req
argument. The function should read this request, process it and write a response to the calling client.
The req.url
field contains the URL of the received request, the req.method
field contains request method, like GET or POST, the req.body
field contains the POST body of the request in binary format. HTTP request object contains much other information. All this you can find in HTTP.jl documentation. Our web application will only check the request method. If the received request is a POST request, it will parse req.body
to JSON object and send the data from this object to the isSurvived
function to make a prediction and return it to the client browser. For all other request types, it will just return the content of the index.html
file, to display the web interface. This is how the whole source of titanic.jl
web service looks:
using JLD2, DecisionTree
# Returns 1 if a passenger with
# specified 'data' survived or 0 if not
function isSurvived(data)
model = load_object("titanic.jld2")
survived = predict(model,data)
return survived[1]
end
using HTTP,JSON3
function handle(req)
if req.method == "POST"
form = JSON3.read(String(req.body))
survived = isSurvived([
form.pclass
form.sex
form.age
form.sibsp
form.parch
form.fare
form.embarked
])
return HTTP.Response(200,"$survived")
end
return HTTP.Response(200,read("./index.html"))
end
HTTP.serve(handle, 8080)
Before running it, you need to install the HTTP.jl package by running Pkg.add("HTTP")
in the julia REPL environment.
The web service code goes right after isSurvived
function. First, the required modules imported: HTTP
to create a web server and JSON3
to parse JSON from request body. Then, the handler
function defined. The function checks request method of received requests and if it equals to POST, it converts the stringified JSON body of this request to the form
object. Then, using fields of this object, the isSurvived
function called. It's important to put array items in correct order here. Finally, the prediction result is returned to the client using the HTTP.Response
function.
For all other request types, the function returns the body of index.html
file in the HTTP.Response(200,read("./index.html"))
line.
Finally, HTTP.serve
function starts a web server on port 8080 that waits for the HTTP requests and handles them using the handle
function, defined above.
Now you can run this by typing julia titanic.jl
in terminal or by pressing Ctrl+F5
in VSCode. Then you can access the web interface from a web browser on http://localhost:8080
and play with the service by entering data in the form, press the PREDICT
button and see either YES
or NO
on the Survived line depending on the prediction result. You can check the hypothesis which we made from bar charts: the women in 1 or 2 class have more chances to survive than others.
Conclusion
In this article, I introduced the Julia programming language along with its ecosystem and explained why it's so great for machine learning. I showed how to set up a comfortable development environment and gave a brief overview of the common Julia modules used for data science. Then I guided you through the process of training the machine learning model for the Titanic competition and showed how to make predictions and submit them to the Kaggle platform for scoring. Finally, I showed how to export this model to an external application, create the web service with this model and the web interface to enter data to the form and predict could the human with this data survive on the Titanic or not.
For all topics that explained briefly, I provided the links with more thorough documentation. In addition, I would highly recommend reading the Julia Data Science online book and learn the great set of machine learning examples in Julia Academy Data Science GitHub repository.
See the source code of this article including the Jupyter Notebook and the web service in this repository:
https://github.com/AndreyGermanov/julia_titanic_model
Have a fun coding and never stop learning!
Subscribe to the newsletter on my website: https://germanov.dev/#newsletter and follow me on social networks to know first about new articles like this one and other software development news:
LinkedIn: https://www.linkedin.com/in/andrey-germanov-dev/
Twitter: https://twitter.com/GermanovDev
Facebook: https://www.facebook.com/AndreyGermanovDev
Top comments (4)
Hello, thanks.
This is how you can do the same using BetaML equivalent imputer / random forest :
Unsurprising the accuracy is the same, as the general algorithm is the same (although the code implementation of the underlying libraries is very different)
Great, thank you! Also, I will try to use Neural network for this to compare results.
Perhaps will write article about this too.
Here it is...
Predicting titanic survivals using BetaML NeuralNetworkEstimator.
Note that compared to Random Forests, that "digest" any kind of data without too many complains, here we need to clean a bit the data...
Great! Added the BetaML library to this article as a library for ML and neural networks. Will look at it in more detail.