DEV Community

kojix2
kojix2

Posted on • Edited on

Rumale Cheat Sheet

🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆
This is a Rumale machine learning cheat sheet based on DataCamp's Scikit-learn cheat sheet.

GitHub logo yoshoku / rumale

Rumale is a machine learning library in Ruby

Rumale

Rumale

Build Status Gem Version BSD 3-Clause License Documentation

Rumale (Ruby machine learning) is a machine learning library in Ruby Rumale provides machine learning algorithms with interfaces similar to Scikit-Learn in Python Rumale supports Support Vector Machine, Logistic Regression, Ridge, Lasso, Multi-layer Perceptron, Naive Bayes, Decision Tree, Gradient Tree Boosting, Random Forest, K-Means, Gaussian Mixture Model, DBSCAN, Spectral Clustering, Mutidimensional Scaling, t-SNE, Fisher Discriminant Analysis, Neighbourhood Component Analysis, Principal Component Analysis, Non-negative Matrix Factorization, and many other algorithms.

Installation

Add this line to your application's Gemfile:

gem 'rumale'
Enter fullscreen mode Exit fullscreen mode

And then execute:

$ bundle

Or install it yourself as:

$ gem install rumale

Documentation

Usage

Example 1. Pendigits dataset classification

Rumale provides function loading libsvm format dataset file. We start by downloading the pendigits dataset from LIBSVM Data web site.

$ wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/pendigits
$ wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/pendigits.t
Enter fullscreen mode Exit fullscreen mode

Training of the classifier with Linear SVM and RBF kernel feature map is the following code.

Enter fullscreen mode Exit fullscreen mode

Comparison of classifiers

t-SNE + MNIST

Let's get started!

gem install rumale
Enter fullscreen mode Exit fullscreen mode

A Basic Example

require 'rumale'

ruby_labels = label_array
#  [0,0,0,0,0,1,1,1,1,1,2,2,2,2,2]
ruby_samples = sample_array
#  [[samples_1], [samples_2], [samples_3], .. [samples_n]]

# Convert to Narray.
labels = Numo::Int32.cast(ruby_labels)
samples = Numo::DFloat.cast(ruby_samples)

# Preprocessing The Data
# Encoding Categorical Features, Normalization, etc.

# Create Your Model
model = Rumale::NearestNeighbors::KNeighborsClassifier.new
model.fit(samples, labels)

# Prediction
model.predict(new_samples)

# Evaluation
puts model.score(test_samples, testl_labels)
Enter fullscreen mode Exit fullscreen mode

Loading The Data

Convert Ruby Array to NArray.

labels = Numo::Int32[*ruby_array]
# labels = Numo::Int32.cast(ruby_array)
# labels = Numo::Int32.asarray(ruby_array)

samples = Numo::DFloat[*ruby_array]
# samples = Numo::DFloat.cast(ruby_array)
# samples = Numo::DFloat.asarray(ruby_array)
Enter fullscreen mode Exit fullscreen mode

Libsvm file.

# Load the training dataset.
samples, labels = Rumale::Dataset.load_libsvm_file('pendigits')
Enter fullscreen mode Exit fullscreen mode

Preprocessing The Data

Standardization

Preprocessing >

normalizer = Rumale::Preprocessing::StandardScaler.new
new_training_samples = normalizer.fit_transform(training_samples)
new_testing_samples = normalizer.transform(testing_samples)
Enter fullscreen mode Exit fullscreen mode

Normalization

Preprocessing

normalizer = Rumale::Preprocessing::L2Normalizer.new
new_samples = normalizer.fit_transform(samples)
Enter fullscreen mode Exit fullscreen mode

Binarization

Preprocessing

na[na >= thresh] = 1
na[na <  thresh] = 0
Enter fullscreen mode Exit fullscreen mode

Encoding Categorical Features

Preprocessing

encoder = Rumale::Preprocessing::LabelEncoder.new
labels = Numo::Int32[1, 8, 8, 15, 0]
encoded_labels = encoder.fit_transform(labels)
# => Numo::Int32#shape=[5]
[1, 2, 2, 3, 0]
decoded_labels = encoder.inverse_transform(encoded_labels)
# => [1, 8, 8, 15, 0]
Enter fullscreen mode Exit fullscreen mode
encoder = Rumale::Preprocessing::LabelEncoder.new
labels = ["A", "B", "B", "A", "C", "C"]
encoded_labels = encoder.fit_transform(labels)
# => Numo::Int32#shape=[6]
# [0, 1, 1, 0, 2, 2]
decoded_labels = encoder.inverse_transform(encoded_labels)
# => ["A", "B", "B", "A", "C", "C"]
Enter fullscreen mode Exit fullscreen mode

One-hot-encoding

Preprocessing

encoder = Rumale::Preprocessing::OneHotEncoder.new
labels = Numo::Int32[0, 0, 2, 3, 2, 1]
one_hot_vectors = encoder.fit_transform(labels)
# => Numo::DFloat#shape=[6,4]
# [[1, 0, 0, 0], 
#  [1, 0, 0, 0], 
#  [0, 0, 1, 0], 
#  [0, 0, 0, 1], 
#  [0, 0, 1, 0], 
#  [0, 1, 0, 0]]
Enter fullscreen mode Exit fullscreen mode

Imputing Missing Values

Preprocessing

idx = narray.eq(0).where
narray[idx] = Float::NAN
mean = narray.mean(axis:0, nan:true)
axis = narray.new_narray.seq % narray.shape[1]
narray[idx] = mean[axis[idx]]
Enter fullscreen mode Exit fullscreen mode

Create Your Model

Supervised Learning Estimators

k-NN(k-Nearest Neighbors)

SupervisedClassifierRegressor

Rumale::NearestNeighbors::KNeighborsClassifier.new(n_neighbors: 5) 
Rumale::NearestNeighbors::KNeighborsRegressor.new(n_neighbors: 5) 
Enter fullscreen mode Exit fullscreen mode
  • n_neighbors : The number of neighbors.

Linear Regression

SupervisedRegressor

Rumale::LinearModel::LinearRegression.new(
  fit_bias:       # (Boolean) — The flag indicating whether to fit the bias term.
  bias_scale:     # (Float) — The scale of the bias term.
  max_iter:       # (Integer) — The maximum number of iterations.
  batch_size:     # (Integer) — The size of the mini batches.
  optimizer:      # (Optimizer) — The optimizer to calculate adaptive learning rate. If nil is given, Nadam is used.
  random_seed:    # (Integer) — The seed value using to initialize the random generator.
)
Enter fullscreen mode Exit fullscreen mode

optimizer: AdaGrad, Adam, Nadam, RMSProp, SGD, goldFin

Ridge Regression

SupervisedRegressor

L2 regularization

Rumale::LinearModel::Ridge.new(
  reg_param:      # (Float) — The regularization parameter.
  fit_bias:       # (Boolean) — The flag indicating whether to fit the bias term.
  bias_scale:     # (Float) — The scale of the bias term.
  max_iter:       # (Integer) — The maximum number of iterations.
  batch_size:     # (Integer) — The size of the mini batches.
  optimizer:      # (Optimizer) — The optimizer to calculate adaptive learning rate. If nil is given, Nadam is used.
  random_seed:    # (Integer) — The seed value using to initialize the random generator.
)
Enter fullscreen mode Exit fullscreen mode

Lasso Regression

SupervisedRegressor
L1 regularization

Rumale::LinearModel::Lasso.new(
  reg_param:      # (Float) — The regularization parameter.
  fit_bias:       # (Boolean) — The flag indicating whether to fit the bias term.
  bias_scale:     # (Float) — The scale of the bias term.
  max_iter:       # (Integer) — The maximum number of iterations.
  batch_size:     # (Integer) — The size of the mini batches.
  optimizer:      # (Optimizer) — The optimizer to calculate adaptive learning rate. If nil is given, Nadam is used.
  random_seed:    # (Integer) — The seed value using to initialize the random generator.
)
Enter fullscreen mode Exit fullscreen mode

Logistic Regression

SupervisedClassifier

Rumale::LinearModel::LogisticRegression.new(
  reg_param:      # (Float) — The regularization parameter.
  fit_bias:       # (Boolean) — The flag indicating whether to fit the bias term.
  bias_scale:     # (Float) — The scale of the bias term. If fit_bias is true, the feature vector v becoms [v; bias_scale].
  max_iter:       # (Integer) — The maximum number of iterations.
  batch_size:     # (Integer) — The size of the mini batches.
  optimizer:      # (Optimizer) — The optimizer to calculate adaptive learning rate. If nil is given, Nadam is used.
  random_seed:    # (Integer) — The seed value using to initialize the random generator.
)
Enter fullscreen mode Exit fullscreen mode

Support Vector Machine

SupervisedClassifier

svc = Rumale::LinearModel::SVC.new(
  reg_param:      # (Float) —  The regularization parameter.
  fit_bias:       # (Boolean) —  The flag indicating whether to fit the bias term.
  bias_scale:     # (Float) —  The scale of the bias term.
  max_iter:       # (Integer) —  The maximum number of iterations.
  batch_size:     # (Integer) —  The size of the mini batches.
  probability:    # (Boolean) —  The flag indicating whether to perform probability estimation.
  optimizer:      # (Optimizer) —  The optimizer to calculate adaptive learning rate. If nil is given, Nadam is used.
  random_seed:    # (Integer) —  The seed value using to initialize the random generator.
)
Enter fullscreen mode Exit fullscreen mode

Naive Bayes

SupervisedClassifier

  • GaussianNB
  • BernoulliNB
  • MultinomialNB
Rumale::NaiveBayes::GaussianNB.new
Rumale::NaiveBayes::BernoulliNB.new(smoothing_param: 1.0, bin_threshold: 0.0)
Rumale::NaiveBayes::MultinomialNB.new(smoothing_param: 1.0)
Enter fullscreen mode Exit fullscreen mode

Decision Tree

SupervisedClassifierRegressor

Rumale::Tree::DecisionTreeClassifier.new(
  criterion:         # (String) —  The function to evaluate spliting point. Supported criteria are ‘gini’ and ‘entropy’.
  max_depth:         # (Integer) —  The maximum depth of the tree. If nil is given, decision tree grows without concern for depth.
  max_leaf_nodes:    # (Integer) —  The maximum number of leaves on decision tree. If nil is given, number of leaves is not limited.
  min_samples_leaf:  # (Integer) —  The minimum number of samples at a leaf node.
  max_features:      # (Integer) —  The number of features to consider when searching optimal split point. If nil is given, split process considers all features.
  random_seed:       # (Integer) —  The seed value using to initialize the random generator. It is used to randomly determine the order of features when deciding spliting point.
)
Enter fullscreen mode Exit fullscreen mode
Rumale::Tree::DecisionTreeRegressor.new(
  criterion:         # (String) —The function to evaluate spliting point. Supported criteria are ‘mae’ and ‘mse’.
  max_depth:         # (Integer) —The maximum depth of the tree. If nil is given, decision tree grows without concern for depth.
  max_leaf_nodes:    # (Integer) —The maximum number of leaves on decision tree. If nil is given, number of leaves is not limited.
  min_samples_leaf:  # (Integer) —The minimum number of samples at a leaf node.
  max_features:      # (Integer) —The number of features to consider when searching optimal split point. If nil is given, split process considers all features.
  random_seed:       # (Integer) —The seed value using to initialize the random generator. It is used to randomly determine the order of features when deciding spliting point.
)
Enter fullscreen mode Exit fullscreen mode

ExtraTree

Random Forest

SupervisedClassifierRegressor

Rumale::Ensemble::RandomForestClassifier.new(
  n_estimators:      # (Integer) —The numeber of decision trees for contructing random forest.
  criterion:         # (String) —The function to evalue spliting point. Supported criteria are ‘gini’ and ‘entropy’.
  max_depth:         # (Integer) —The maximum depth of the tree. If nil is given, decision tree grows without concern for depth.
  max_leaf_nodes:    # (Integer) —The maximum number of leaves on decision tree. If nil is given, number of leaves is not limited.
  min_samples_leaf:  # (Integer) —The minimum number of samples at a leaf node.
  max_features:      # (Integer) —The number of features to consider when searching optimal split point. If nil is given, split process considers all features.
  random_seed:       # (Integer) —The seed value using to initialize the random generator. It is used to randomly determine the order of features when deciding spliting point.
)
Enter fullscreen mode Exit fullscreen mode
Rumale::Ensemble::RandomForestRegressor.new(
  n_estimators:      # (Integer) —The numeber of decision trees for contructing random forest.
  criterion:         # (String) —The function to evalue spliting point. Supported criteria are ‘gini’ and ‘entropy’.
  max_depth:         # (Integer) —The maximum depth of the tree. If nil is given, decision tree grows without concern for depth.
  max_leaf_nodes:    # (Integer) —The maximum number of leaves on decision tree. If nil is given, number of leaves is not limited.
  min_samples_leaf:  # (Integer) —The minimum number of samples at a leaf node.
  max_features:      # (Integer) —The number of features to consider when searching optimal split point. If nil is given, split process considers all features.
  random_seed:       # (Integer) —The seed value using to initialize the random generator. It is used to randomly determine the order of features when deciding spliting point.
)
Enter fullscreen mode Exit fullscreen mode

AdaBoost (Adaptive Boosting)

SupervisedClassifierRegressor

Rumale::Ensemble::AdaBoostClassifier.new(
  n_estimators:      # (Integer) —The numeber of decision trees for contructing random forest.
  criterion:         # (String) —The function to evalue spliting point. Supported criteria are ‘gini’ and ‘entropy’.
  max_depth:         # (Integer) —The maximum depth of the tree. If nil is given, decision tree grows without concern for depth.
  max_leaf_nodes:    # (Integer) —The maximum number of leaves on decision tree. If nil is given, number of leaves is not limited.
  min_samples_leaf:  # (Integer) —The minimum number of samples at a leaf node.
  max_features:      # (Integer) —The number of features to consider when searching optimal split point. If nil is given, split process considers all features.
  random_seed:       # (Integer) —The seed value using to initialize the random generator. It is used to randomly determine the order of features when deciding spliting point.
)
Enter fullscreen mode Exit fullscreen mode
Rumale::Ensemble::AdaBoostRegressor.new(
  n_estimators:      # (Integer) —The numeber of decision trees for contructing random forest.
  threshold:         # (Float) —The threshold for delimiting correct and incorrect predictions. That is constrained to [0, 1]
  exponent:          # (Float) —The exponent for the weight of each weak learner.
  criterion:         # (String) —The function to evalue spliting point. Supported criteria are ‘gini’ and ‘entropy’.
  max_depth:         # (Integer) —The maximum depth of the tree. If nil is given, decision tree grows without concern for depth.
  max_leaf_nodes:    # (Integer) —The maximum number of leaves on decision tree. If nil is given, number of leaves is not limited.
  min_samples_leaf:  # (Integer) —The minimum number of samples at a leaf node.
  max_features:      # (Integer) —The number of features to consider when searching optimal split point. If nil is given, split process considers all features.
  random_seed:       # (Integer) —The seed value using to initialize the random generator. It is used to randomly determine the order of features when deciding spliting point.
)
Enter fullscreen mode Exit fullscreen mode

Unsupervised Learning Estimators

PCA (Principal component analysis)

UnsupervisedTransformer

Rumale::Decomposition::PCA.new(
  n_components:    # (Integer) —The number of principal components.
  max_iter:        # (Integer) —The maximum number of iterations.
  tol:             # (Float) —The tolerance of termination criterion.
  random_seed:     # (Integer) —The seed value using to initialize the random generator.
)
Enter fullscreen mode Exit fullscreen mode

NMF (Non-negative matrix factorization)

UnsupervisedTransformer

Rumale::Decomposition::NMF.new(
  n_components:    # (Integer) —The number of components.  
  max_iter:        # (Integer) —The maximum number of iterations.  
  tol:             # (Float) —The tolerance of termination criterion.  
  eps:             # (Float) —A small value close to zero to avoid zero division error.  
  random_seed:     # (Integer) —The seed value using to initialize the random generator.  
)
Enter fullscreen mode Exit fullscreen mode

t-SNE (T-distributed Stochastic Neighbor Embedding)

UnsupervisedTransformer

Rumale::Manifold::TSNE.new(
  n_components: # (Integer) —The number of dimensions on representation space.
  perplexity:   # (Float) —The effective number of neighbors for each point. Perplexity are typically set from 5 to 50.
  metric:       # (String) —The metric to calculate the distances in original space. If metric is 'euclidean', Euclidean distance is calculated for distance in original space. If metric is 'precomputed', the fit and fit_transform methods expect to be given a distance matrix.
  init:         # (String) —The init is a method to initialize the representaion space. If init is 'random', the representaion space is initialized with normal random variables. If init is 'pca', the result of principal component analysis as the initial value of the representation space.
  max_iter:     # (Integer) —The maximum number of iterations.
  tol:          # (Float) —The tolerance of KL-divergence for terminating optimization. If tol is nil, it does not use KL divergence as a criterion for terminating the optimization.
  verbose:      # (Boolean) —The flag indicating whether to output KL divergence during iteration.
  random_seed:  # (Integer) —The seed value using to initialize the random generator.
)
Enter fullscreen mode Exit fullscreen mode

KMeans clustering

UnsupervisedClustering

Rumale::Clustering::KMeans.new(
  n_clusters:      # (Integer) —The number of clusters.
  init:            # (String) —The initialization method for centroids (‘random’ or ‘k-means++’).
  max_iter:        # (Integer) —The maximum number of iterations.
  tol:             # (Float) —The tolerance of termination criterion.
  random_seed:     # (Integer) —The seed value using to initialize the random generator.
)
Enter fullscreen mode Exit fullscreen mode

DBSCAN (Density-based spatial clustering of applications with noise)

UnsupervisedClustering

Rumale::Clustering::DBSCAN.new(
  eps:             # (Float) —The radius of neighborhood.
  min_samples:     # (Integer) —The number of neighbor samples to be used for the criterion whether a point is a core point.
)
Enter fullscreen mode Exit fullscreen mode

Model Fitting

model.fit(samples, labels)
model.fit(samples)
model.fit_transform(x)
Enter fullscreen mode Exit fullscreen mode

Prediction

y_pred = model.predict(samples)
Enter fullscreen mode Exit fullscreen mode

Evaluate Model’s Performance

Classification Metrics

Accuracy Score

Evaluation

evaluator = Rumale::EvaluationMeasure::Accuracy.new
puts evaluator.score(ground_truth, predicted)
Enter fullscreen mode Exit fullscreen mode

Regression Metrics

Mean Absolute Error, MAE

Evaluation

evaluator = Rumale::EvaluationMeasure::MeanAbsoluteError.new
puts evaluator.score(ground_truth, predicted)
Enter fullscreen mode Exit fullscreen mode

Mean Squared Error

Evaluation

evaluator = Rumale::EvaluationMeasure::MeanSquaredError.new
puts evaluator.score(ground_truth, predicted)
Enter fullscreen mode Exit fullscreen mode

R2 Score

(coefficient of determination)
Evaluation

evaluator = Rumale::EvaluationMeasure::R2Score.new
puts evaluator.score(ground_truth, predicted)
Enter fullscreen mode Exit fullscreen mode

Clustering Metrics

Adjusted Rand Index

Evaluation

evaluator = Rumale::EvaluationMeasure::AdjustedRandScore.new
puts evaluator.score(ground_truth, predicted)
Enter fullscreen mode Exit fullscreen mode

Cross-Validation

Evaluation

svc = Rumale::LinearModel::SVC.new
kf = Rumale::ModelSelection::StratifiedKFold.new(n_splits: 5)
cv = Rumale::ModelSelection::CrossValidation.new(estimator: svc, splitter: kf)
report = cv.perform(samples, labels)
mean_test_score = report[:test_score].inject(:+) / kf.n_splits
Enter fullscreen mode Exit fullscreen mode

Tune Your Model

Grid Search

Tuning

rfc = Rumale::Ensemble::RandomForestClassifier.new(random_seed: 1)

pg = { n_estimators: [5, 10], max_depth: [3, 5], max_leaf_nodes: [15, 31] }

kf = Rumale::ModelSelection::StratifiedKFold.new(n_splits: 5)

gs = Rumale::ModelSelection::GridSearchCV.new(estimator: rfc, param_grid: pg, splitter: kf)
gs.fit(samples, labels)

p gs.cv_results
p gs.best_params
Enter fullscreen mode Exit fullscreen mode

Tuning
Grid search with pipeline

rbf = Rumale::KernelApproximation::RBF.new(random_seed: 1)
svc = Rumale::LinearModel::SVC.new(random_seed: 1)
pipe = Rumale::Pipeline::Pipeline.new(steps: { rbf: rbf, svc: svc })

pg = { rbf__gamma: [32.0, 1.0], rbf__n_components: [4, 128], svc__reg_param: [16.0, 0.1] }

kf = Rumale::ModelSelection::StratifiedKFold.new(n_splits: 5)

gs = Rumale::ModelSelection::GridSearchCV.new(estimator: pipe, param_grid: pg, splitter: kf)
gs.fit(samples, labels)

p gs.cv_results
p gs.best_params
Enter fullscreen mode Exit fullscreen mode

Pipeline

Sequentially apply a list of transforms and a final estimator.

rbf = Rumale::KernelApproximation::RBF.new(gamma: 1.0, n_coponents: 128, random_seed: 1)
svc = Rumale::LinearModel::SVC.new(reg_param: 1.0, fit_bias: true, max_iter: 5000, random_seed: 1)

pipeline = Rumale::Pipeline::Pipeline.new(steps: { trs: rbf, est: svc })
pipeline.fit(training_samples, traininig_labels)

results = pipeline.predict(testing_samples)
Enter fullscreen mode Exit fullscreen mode

References

The duck logo

duck

Ugly_duckling_theorem1

(Wikipedia)
image.png


  1. https://github.com/yoshoku/rumale/issues/4#issuecomment-483495559 

Top comments (0)