DEV Community

loading...
Cover image for Deep learning on "the iris data-set" in Julia

Deep learning on "the iris data-set" in Julia

bionboy profile image Luke Floden Updated on ・5 min read

Static HTML notebook found here: Pluto
Use the above link for interactive charts!

My task for this research is to explore JuliaLang and Flux.jl through experiments on the ubiquitous data-set known as 'the iris data-set'.


Data Summary

Data set: iris
This data set contains 150 samples iris flower. The features in each sample are the length and width of both the iris petal and sepal, and also the species of iris. data = 150×5

Each feature is recorded as a floating point value except for the species (string). The species identifier acts as the labels for this data set (if used for supervised learning). There are no missing values. The data and header is separated into two different files.

This data could be used for iris classification. This could be useful in an automation task involving these flowers or as a tool for researchers to assist in quick identification. Other, less "real world" applications include use as a data set for ML systems such as supervised learning (NN) and unsupervised learning (K-NN).

Imports

begin
    import Pkg;
    packages = ["CSV","DataFrames","PlutoUI","Plots","Combinatorics"]   
    Pkg.add(packages)

    using CSV, DataFrames, PlutoUI, Plots, Combinatorics

    plotly()
    theme(:solarized_light)
end
Enter fullscreen mode Exit fullscreen mode

Loading, cleaning, and manipulating the data

begin
    path = "iris/iris.data"
    csv_data = CSV.File(path, header=false)

    iris_names = ["sepal_len", "sepal_wid", "petal_len", "petal_wid", "class"]
    df = DataFrame(csv_data.columns, Symbol.(iris_names))
    dropmissing!(df)

end
Enter fullscreen mode Exit fullscreen mode

Splitting the data into three iris classes

As you can see, there is a equal representation of each class:

begin
    df_species = groupby(df, :class)
end
Enter fullscreen mode Exit fullscreen mode

Class sizes: (50, 5), (50, 5) (50, 5)


Visualizations

Comparing length vs width of the sepal and petal

begin
    scatter(title="len vs wid", xlabel = "length", ylabel="width",
             df.sepal_len, df.sepal_wid, color="blue", label="sepal")
    scatter!(df.petal_len, df.petal_wid, color="red", label="petal")
end
Enter fullscreen mode Exit fullscreen mode

Comparing all combinations of variables

Column pairs per chart: [sepal_len, sepal_wid, petal_len, petal_wid, class]
-> [1, 2] , [1, 3] , [1, 4]
-> [2, 3] , [2, 4] , [3, 4]

begin
    # Get all combinations of colums
    combins = collect(combinations(1:4,2))
    combos = [(df[x][1], df[x][2]) for x in combins]
    # Plot all combinations in sub-plots
    scatter(combos, layout=(2,3))
end
Enter fullscreen mode Exit fullscreen mode

Comparing the sepal length vs sepal width vs petal length of all three classes of iris

Restricted to three variables to plot in 3d

begin
    setosa, versicolor, virginica = df_species

    scatter(setosa[1], setosa[2], setosa[3], label="Setosa", xlabel="d")
    scatter!(versicolor[1], versicolor[2], versicolor[3], label="versicolor")
    scatter!(virginica[1], virginica[2], virginica[3], label="virginica")
end
Enter fullscreen mode Exit fullscreen mode


[3] Deep Learning

Imports

begin
    Pkg.add("Flux")
    Pkg.add("CUDA")
    Pkg.add("IterTools")

    using Flux
    using Flux: Data.DataLoader
    using Flux: @epochs
    using CUDA
    using Random
    using IterTools: ncycle

    Random.seed!(123);
end
Enter fullscreen mode Exit fullscreen mode

The Data

Formating data for training (including onehot conversion)

begin   
    # Convert df to array
    data = convert(Array, df)

    # Shuffle
    data = data[shuffle(1:end), :]

    # train/test split
    train_test_ratio = .7
    idx = Int(floor(size(df, 1) * train_test_ratio))
    data_train = data[1:idx,:]
    data_test = data[idx+1:end, :]

    # Get feature vectors
    get_feat(d) = transpose(convert(Array{Float32},d[:, 1:end-1]))
    x_train = get_feat(data_train)
    x_test = get_feat(data_test)

    # One hot labels
    #   onehot(d) = [Flux.onehot(v, unique(df.class)) for v in d[:,end]]
    onehot(d) = Flux.onehotbatch(d[:,end], unique(df.class))
    y_train = onehot(data_train)
    y_test = onehot(data_test)
end
Enter fullscreen mode Exit fullscreen mode

Creating DataLoaders for batches

begin
    batch_size= 1
    train_dl = DataLoader((x_train, y_train), batchsize=batch_size, shuffle=true)
    test_dl = DataLoader((x_test, y_test), batchsize=batch_size)
end
Enter fullscreen mode Exit fullscreen mode

The Model

I am going to implement a fully connected neural network to classify by species.

  • Layers: Chain(Dense(4, 8, relu), Dense(8, 3), softmax)
  • Loss: logit binary crossentropy
  • Optimizer: Flux.Optimise.ADAM
  • Learning rate: 0.001
  • Epochs: 30
  • Batch size: 1

Training

begin
    ### Model ------------------------------
    function get_model()
        c = Chain(
            Dense(4,8,relu),
            Dense(8,3),
            softmax
        )
    end

    model = get_model()

    ### Loss ------------------------------
    loss(x,y) = Flux.Losses.logitbinarycrossentropy(model(x), y)

    train_losses = []
    test_losses = []
    train_acces = []
    test_acces = []

    ### Optimiser ------------------------------
    lr = 0.001
    opt = ADAM(lr, (0.9, 0.999))

    ### Callbacks ------------------------------
    function loss_all(data_loader)
        sum([loss(x, y) for (x,y) in data_loader]) / length(data_loader) 
    end

    function acc(data_loader)
        f(x) = Flux.onecold(cpu(x))
        acces = [sum(f(model(x)) .== f(y)) / size(x,2)  for (x,y) in data_loader]
        sum(acces) / length(data_loader)
    end

    callbacks = [
        () -> push!(train_losses, loss_all(train_dl)),
        () -> push!(test_losses, loss_all(test_dl)),
        () -> push!(train_acces, acc(train_dl)),
        () -> push!(test_acces, acc(test_dl)),
    ]

    # Training ------------------------------
    epochs = 30
    ps = Flux.params(model)

    @epochs epochs Flux.train!(loss, ps, train_dl, opt, cb = callbacks)

    @show train_loss = loss_all(train_dl)
    @show test_loss = loss_all(test_dl)
    @show train_acc = acc(train_dl)
    @show test_acc = acc(test_dl)
end 
Enter fullscreen mode Exit fullscreen mode

Results


One example prediction:

begin
    y = (y_test[:,1])
    pred = (model(x_test[:,1]))
end
Enter fullscreen mode Exit fullscreen mode

Prediction: 0.00020066714 , 0.19763687 , 0.8021625
Truth: 0 , 0 , 1
error: 0.395675f0

Confusion Matrix


[4] Conclusion

Tools

I chose to implement a basic feed forward neural network because of the scale of the problem. With the data set containing so few samples with very little features a small network would fit better. I chose a NN because I wanted to evaluate Julia as a suitable tool for me to use with deep learning solutions. Again, because of the size of the problem, shallow ML approaches would have been sufficient. Something to expand on in this research is to compare to such methods.

I wanted to challenge myself and learn an entirely new language and platform for this project. The Julia Programming Language is a high level, dynamically typed language. It comes with its own web-based editor that is much like Python's Jupter notebooks. Because Julia is newer and the community is smaller than Python, the documentation and support were not even close in magnitude. This slowed me down considerably. Despite the setbacks, I learned a lot in this research and I am glad I decided to use Julia.

Results

My model's test accuracy was 95.55%. This is satisfactory for me due to the simplicity of the data set and the model. While one species was linearly seperable, the other two were not. These later species are the main problem for the model to tackle.

As I stated in the beginning of this paper, this model could be used for classification tasks such as automation or as a tool for bio researchers to aid in identification. Furthermore, this model could be used as a pre-trained model for more specific tasks; I understand this statement is a bit of a stretch but I want to account for as many applications as possible.


[5] Related work

Related research: Kaggle

References

  1. The Iris Data-set
  2. Flux.jl
  3. Exploring High Level APIs of Knet.jl and Flux.jl in comparison to Tensorflow-Keras
  4. Related Kaggle work

Discussion

pic
Editor guide