[Photo by jim gade on Unsplash,
modified]
Data Science: a branch of computer science that studies how to use, store, and analyze data in order to derive information from it.
With this mini-series we are going to explore how to use some Rusty tools to accomplish the tasks that are the bread and butter of any Data Scientist.
The final goal is to show that Rust can be employed in this field, and how so. Ultimately our goal is also to sparkle interest in this field of application: the author is persuaded that Rust should prove very useful in the field of Data Science (as well as Machine Learning and ultimately AI).
You can find this article's code in the repo: github.com/davidedelpapa/rdatascience-tut1
Setting the stage for this tutorial
There are few crates we are going to cover in this tutorial. However, we are going to introduce them as we go.
Let's start our project the standard rusty way.
cargo new rdatascience-tut1 && cd rdatascience-tut1
cargo add ndarray ndarray-rand ndarray-stats noisy_float poloto
code .
I am using currently cargo add
from the good cargo-edit (quick inst: cargo install cargo-edit
) to handle dependencies, and VisualStudio Code as dev IDE.
Feel free to handle Cargo.toml dependencies by hand, or use a different IDE.
ndarray: what is it, and why to use it?
ndarray is a Rust crate used to work with arrays.
It covers all the classic uses of an array handling framework (such as numpy
for Python). Some use cases which are not covered by the main crate, are covered through some corollary crates, such as ndarray-linalg for linear algebra, ndarray-rand to generate randomness, and ndarray-stats for statistics.
Additionally, ndarray
has got also some nice extra, such as support for rayon for parallelization, or the popular BLAS low-level specs, through one of the working back-ends (using blas-src ).
Why to use ndarray?
In Rust there are already arrays (or lists), and also vectors, and the language itself allows for many different types of manipulation through powerful iterators.
What is more, what is offered by the bare Rust language (enhanced by the std
) is many times even faster than other more popular languages; still, ndarray
is specialized to handle n-dimensional arrays with a mathematical end in view.
Thus ndarray
builds over the power already provided by the language; Rust power is one of the reasons why the author is persuaded that Rust will be the language of Data Science in the next few years.
ndarray Quick-Start
At the top of our src/main.rs we are going to import as usual:
use ndarray::prelude::*;
We have almost everything we need in the prelude.
We can start to put stuff inside the fn main()
Array creation
Let's start to see how we can create arrays:
let arr1 = array![1., 2., 3., 4., 5., 6.];
println!("1D array: {}", arr1);
ndarray
provides the array!
macro that detects which type of ArrayBase
is needed. In this case this is a 1-D, that is, a one dimensional array. Notice that the underlying ArrayBase
already implements a std::fmt::Display
function.
Compare it to the standard Rust array (let's call them lists in order not to confuse them with ndarray
's arrays) and Vec:
// 1D array VS 1D array VS 1D Vec
let arr1 = array![1., 2., 3., 4., 5., 6.];
println!("1D array: \t{}", arr1);
let ls1 = [1., 2., 3., 4., 5., 6.];
println!("1D list: \t{:?}", ls1);
let vec1 = vec![1., 2., 3., 4., 5., 6.];
println!("1D vector: \t{:?}", vec1);
And the result:
1D array: [1, 2, 3, 4, 5, 6]
1D list: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
1D vector: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
Notice too that array!
has written the floats as integers, since they are all .0
.
Array Sum
Let's try to sum 2 arrays element by element:
let arr2 = array![1., 2.2, 3.3, 4., 5., 6.];
let arr3 = arr1 + arr2;
println!("1D array: \t{}", arr3);
Let see how it compares with standard arrays(lists) and vectors:
let arr2 = array![1., 2.2, 3.3, 4., 5., 6.];
let arr3 = arr1 + arr2;
println!("1D array: \t{}", arr3);
let ls2 = [1., 2.2, 3.3, 4., 5., 6.];
let mut ls3 = ls1.clone();
for i in 1..ls2.len(){
ls3[i] = ls1[i] + ls2[i];
}
println!("1D list: \t{:?}", ls3);
let vec2 = vec![1., 2.2, 3.3, 4., 5., 6.];
let vec3: Vec<f64> = vec1.iter().zip(vec2.iter()).map(|(&e1, &e2)| e1 + e2).collect();
println!("1D vec: \t{:?}", vec3);
The result is:
1D array: [2, 4.2, 6.3, 8, 10, 12]
1D list: [1.0, 4.2, 6.3, 8.0, 10.0, 12.0]
1D vec: [2.0, 4.2, 6.3, 8.0, 10.0, 12.0]
As you can see, with Rust standard tools it became more complicated very soon. To perform an element by element sum we need a for
or (only for Vec) we need to use iterators, which are powerful, but very complicated to use in such a day-to-day Data Science scenario.
2D arrays & more
let's just abandon quickly the examples using Rust's standard constructs, since as we have shown, they are more complex, and let us focus on ndarray
.
ndarray
offers various methods to create and instantiate (and use) 2D arrays.
Just look at this example:
let arr4 = array![[1., 2., 3.], [ 4., 5., 6.]];
let arr5 = Array::from_elem((2, 1), 1.);
let arr6 = arr4 + arr5;
println!("2D array:\n{}", arr6);
with its output:
2D array:
[[2, 3, 4],
[5, 6, 7]]
With the macro array!
we need to specify all elements, while with Array::from_elem
we need to offer a Shape
, in this case (2,1)
and an element to fill the array, in this case 1.0
: it will fill for us the whole shape with the selected element.
let arr7 = Array::<f64, _>::zeros(arr6.raw_dim());
let arr8 = arr6 * arr7;
println!("\n{}", arr8);
Which outputs:
[[0, 0, 0],
[0, 0, 0]]
Array::zeros(Shape)
creates an array of Shape
filled with zero's.
Notice that sometimes the compiler cannot infer the type of zero to feed in (you almost forgot Rust has got a nice type system, didn't you?), so we help it with the annotation Array::<f64, _>
, which gives the type, letting the compiler infer the shape (_
).
The function .raw_dim()
, as you can imagine, gives the shape of the matrix.
Let's create an identity matrix now (a 2 dimensional array with all 0 but the diagonal)
let identity: &Array2<f64> = &Array::eye(3);
println!("\n{}", identity);
Which outputs:
[[1, 0, 0],
[0, 1, 0],
[0, 0, 1]]
We helped the compiler providing the shape and type, but this time using a specialized form of ArrayBase
, that is, Array2
that represents 2-dimensional arrays. Notice that we created a reference so that we can re-use the variable without incurring in the ire of the borrow checker (yes, always working, did you forget that as well?)
Let's explore now the use of an identity matrix:
let arr9 = array![[1., 2., 3.], [ 4., 5., 6.], [7., 8., 9.]];
let arr10 = &arr9 * identity;
println!("\n{}", arr10);
Outputs:
[[1, 0, 0],
[0, 5, 0],
[0, 0, 9]]
From my math classes I remember something like that the identity matrix should give back the same matrix when multiplied...
Yes, of course, we are not doing dot multiplications! With normal multiplication it does not work.
In fact, when using matrices there is a element-wise multiplication, which is done by arr9 * identity
, but there's too a matrix multiplication, which is done by
let arr11 = arr9.dot(identity);
println!("\n{}", arr11);
which finally outputs:
[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]
Of course, ndarray
can handle also a 0-D array, with 0 meaning that it is just an element:
println!("\n{}", array![2.]);
println!("Dimensions: {}", array![2.].ndim());
which correctly outputs:
[2]
Dimensions: 1
Likewise, we could go to 3D or more
let arr12 = Array::<i8, _>::ones((2, 3, 2, 2));
println!("\nMULTIDIMENSIONAL\n{}", arr12);
Guessed its output?
MULTIDIMENSIONAL
[[[[1, 1],
[1, 1]],
[[1, 1],
[1, 1]],
[[1, 1],
[1, 1]]],
[[[1, 1],
[1, 1]],
[[1, 1],
[1, 1]],
[[1, 1],
[1, 1]]]]
It's a 2-elements 2 vectors, repeated 3 times, repeated 2 times; just go from right to left to unpack it from smaller to bigger (and vice-versa).
If it is still unclear, don't worry: we are here for the programming more than for the math/stats behind it.
Let's add some randomness to the mess!
We also loaded ndarray-rand
into our Cargo.toml, which we briefly described earlier.
This package adds the power of the rand crate (which it re-exports as sub-module) to your ndarray ecosystem.
In order to see some examples, let's add the following in the use
section of our src/main.rs
use ndarray_rand::{RandomExt, SamplingStrategy};
use ndarray_rand::rand_distr::Uniform;
Then we can get an array of shape (5, 2)
, for example, filled with a uniform distribution between 1 and 10 (floats, though):
let arr13 = Array::random((2, 5), Uniform::new(0., 10.));
println!("{:5.2}", arr13);
Which results, for example, in:
[[ 2.04, 0.15, 6.66, 3.06, 0.91],
[ 8.18, 6.08, 6.99, 4.45, 5.27]]
Results should vary at each run, being the distribution (pseudo)random.
We can also "pick" data from an array (sampling) in the following way:
let arr14 = array![1., 2., 3., 4., 5., 6.];
let arr15 = arr14.sample_axis(Axis(0), 2, SamplingStrategy::WithoutReplacement);
println!("\nSampling from:\t{}\nTwo elements:\t{}", arr14, arr15);
Which may result in:
Sampling from: [1, 2, 3, 4, 5, 6]
Two elements: [4, 2]
Let me show another way of sampling, which involves the use of the rand
crate and the creation of an array from a vector:
We first need the following added to the use
section:
use ndarray_rand::rand as rand;
use rand::seq::IteratorRandom;
So we use the rand
crate as re-exported by ndarray-rand
.
Then we can do the following (example in the rand docs, adapted):
let mut rng = rand::thread_rng();
let faces = "😀😎😐😕😠😢";
let arr16 = Array::from_shape_vec((2, 2), faces.chars().choose_multiple(&mut rng, 4)).unwrap();
println!("\nSampling from:\t{}", faces);
println!("Elements:\n{}", arr16);
We define the thread_rng
to be used first, then we set a string containing the emoji we want to select.
Then we create an array from a vector, giving a shape. The shape we chose is (2, 2)
, but the vector is created using a particular IteratorRandom
, i.e., choose_multiple
, extracting 4 elements (chars) at random from the string.
The output is obvious:
Sampling from: 😀😎😐😕😠😢
Elements:
[[😀, 😎],
[😢, 😠]]
Beware though not to over-sample, otherwise choose_multiple
will simply panic.
Instead, Array::from_shape_vec
returns a Result
stating if it could create an array or not (Result which we simply unwrap).
Let's do some stats and visualize something, shall we?
Before introducing visualization, let's introduce the crate ndarray-stats, actually, also the crate noisy_float which is a must when using ndarray-stats
.
First of all, we start with a Standard Normal Distribution, randomly created.
First we add:
use ndarray_rand::rand_distr::{Uniform, StandardNormal};
in its proper place, then:
let arr17 = Array::<f64, _>::random_using((10000,2), StandardNormal, &mut rand::thread_rng());
This way we have a 2D array with 10,000 couples of elements
Then we add to the use
section also the imports we need to do statistics:
use ndarray_stats::HistogramExt;
use ndarray_stats::histogram::{strategies::Sqrt, GridBuilder};
use noisy_float::types::{N64, n64};
Now we need to transform each element from float into a noisy float; I will not go into explaining a noisy float, just consider it as a float that can't silently fail (be a NaN
); besides this way it is order-able, which is what is needed by ndarray-stats
to create an histogram.
In order to perform by value an operation on each element of the ndarray, we will use the function mapv()
which is akin to the standard map()
for iterators.
let data = arr17.mapv(|e| n64(e));
At this point, we can create a grid for our histogram (a grid is needed to divide the data into bins); we try to infer the best way, using the strategies::Sqrt
(a strategy used by many programs, including MS Excel):
let grid = GridBuilder::<Sqrt<N64>>::from_array(&data).unwrap().build();
Now that we have a grid, that is, a way to divide our raw data to prepare our histogram, we can create such histogram:
let histogram = data.histogram(grid);
In order to get the underlying counts matrix, we can simply state:
let histogram_matrix = histogram.counts();
The count matrix just states how many elements are present in each bin and each height, in the grid.
Ok, now we have a histogram... but how could we visualize it?
Well, before visualizing our data we should prepare it for visualization.
The problem we face is that we have the counts of a grid, but to plot it we should really have a number of bin and all elements in that bin, meaning, we should sum vertically all elements.
In order to do so, we need to sum on axis(0) of the ndarray:
let data = histogram_matrix.sum_axis(Axis(0));
Now we have a 1D ndarray containing all the sums of the grid. At this point we can establish that each sum is a different bin, and enumerate them. We will transform it all to a vector of tuples, in order to prepare it for the visualization tool, where the first element of the tuple is the number of bin, and the second is the height of the bin.
let his_data: Vec<(f32, f32)> = data.iter().enumerate().map(|(e, i)| (e as f32, *i as f32) ).collect();
Remember: this is just a hoax dataset, based on a pseudorandom generator of a normal distribution (i.e., a Gaussian distribution centered in 0.0
, with radius approx. 1
). Still, we should see a rough Gaussian on a histogram.
DataViz
In order to visualize things we will use poloto, which is one of many plotting crates for Rust.
It is a simple one, meaning we do not need many lines of code to have something to see on our screen.
We will not import it in the use
section, because it is very simple. Let me explain how to plot a histogram in three steps:
Step one - create a file to store our graph:
let file = std::fs::File::create("standard_normal_hist.svg").unwrap();
Step two - create a histogram out of the data:
let mut graph = poloto::plot("Histogram", "x", "y");
graph.histogram("Stand.Norm.Dist.", his_data).xmarker(0).ymarker(0);
We create a Plotter
object, assigning it a title, and legend for each axis.
Then, we plot our histogram on it, assigning the title in the legend ("Stand.Norm.Dist."
).
Step three - write the graph on disk:
graph.simple_theme(poloto::upgrade_write(file));
As simple as that!
Let's admire our work of (random) art:
OK, let's try something different: let's view our graph as a scatter plot. Since our hoax data is a Standard Normal Distribution, if we have N pairs of coordinates, the scatter plot should be like a cloud centered on the 0,0
coordinates.
Let's visualize it!
let arr18 = Array::<f64, _>::random_using((300, 2), StandardNormal, &mut rand::thread_rng());
let data: Vec<(f64, f64)> = arr18.axis_iter(Axis(0)).map(|e| {
let v = e.to_vec();
(v[0], v[1])
}).collect();
We created 300 pairs of random numbered centered around (0, 0)
, according to a Standard Normal Distribution.
Then we transformed that array to a Vec<(f64, f64)>
, because the poloto
library only graphs [f64; 2]
or whatever can be converted to a AsF64
.
We will add also two lines to show the center of our graph:
let x_line = [[-3,0], [3,0]];
let y_line = [[0,-3], [0, 3]];
Next we create a file, plot, and save, just as we did for the histogram:
let file = std::fs::File::create("standard_normal_scatter.svg").unwrap(); // create file on disk
let mut graph = poloto::plot("Scatter Plot", "x", "y"); // create graph
graph.line("", &x_line);
graph.line("", &y_line);
graph.scatter("Stand.Norm.Dist.", data).ymarker(0);
graph.simple_theme(poloto::upgrade_write(file));
That's it! We can admire our random creation now:
Conclusion
I think this should wrap it up for today.
We saw how to use ndarray
(in a basic form), and how it differs from Rust arrays and vectors.
We saw also some of its companion crates that complete the ecosystem, to provide randomness and some statistic feats.
We saw also a way to plot graphs with data, showing how to plot a histogram, a scatter plot, and some lines.
I hope this will be a good starting point to delve deeper into the use of Rust for Data Science.
That's all folks for today, see you next time!
Top comments (6)
You had me worried for a minute. The article only uses 300 points, which produces a disappointing graph. The code in github uses 10,000 points, which is much more satisfying.
let arr18 = Array::::random_using((300, 2), StandardNormal, &mut rand::thread_rng());
😂😉 good catch... in fact I just forgot to update the numbers, but the image refers to the GitHub repo
Small correction: There is a snippet that says
use
twice, i.e.Edited. Thank you!
Thank you, this is very helpful! Especially guiding through using noisy_float with the histogram!
I’m trying.
Thank ⛩