🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆
This is a Rumale machine learning cheat sheet based on DataCamp's Scikit-learn cheat sheet.
Rumale
Rumale (Ruby machine learning) is a machine learning library in Ruby Rumale provides machine learning algorithms with interfaces similar to Scikit-Learn in Python Rumale supports Support Vector Machine, Logistic Regression, Ridge, Lasso, Multi-layer Perceptron, Naive Bayes, Decision Tree, Gradient Tree Boosting, Random Forest, K-Means, Gaussian Mixture Model, DBSCAN, Spectral Clustering, Mutidimensional Scaling, t-SNE, Fisher Discriminant Analysis, Neighbourhood Component Analysis, Principal Component Analysis, Non-negative Matrix Factorization, and many other algorithms.
Installation
Add this line to your application's Gemfile:
gem 'rumale'
And then execute:
$ bundle
Or install it yourself as:
$ gem install rumale
Documentation
Usage
Example 1. Pendigits dataset classification
Rumale provides function loading libsvm format dataset file. We start by downloading the pendigits dataset from LIBSVM Data web site.
$ wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/pendigits
$ wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/pendigits.t
Training of the classifier with Linear SVM and RBF kernel feature map is the following code.
Let's get started!
gem install rumale
A Basic Example
require 'rumale'
ruby_labels = label_array
# [0,0,0,0,0,1,1,1,1,1,2,2,2,2,2]
ruby_samples = sample_array
# [[samples_1], [samples_2], [samples_3], .. [samples_n]]
# Convert to Narray.
labels = Numo::Int32.cast(ruby_labels)
samples = Numo::DFloat.cast(ruby_samples)
# Preprocessing The Data
# Encoding Categorical Features, Normalization, etc.
# Create Your Model
model = Rumale::NearestNeighbors::KNeighborsClassifier.new
model.fit(samples, labels)
# Prediction
model.predict(new_samples)
# Evaluation
puts model.score(test_samples, testl_labels)
Loading The Data
Convert Ruby Array to NArray.
labels = Numo::Int32[*ruby_array]
# labels = Numo::Int32.cast(ruby_array)
# labels = Numo::Int32.asarray(ruby_array)
samples = Numo::DFloat[*ruby_array]
# samples = Numo::DFloat.cast(ruby_array)
# samples = Numo::DFloat.asarray(ruby_array)
Libsvm file.
# Load the training dataset.
samples, labels = Rumale::Dataset.load_libsvm_file('pendigits')
Preprocessing The Data
Standardization
normalizer = Rumale::Preprocessing::StandardScaler.new
new_training_samples = normalizer.fit_transform(training_samples)
new_testing_samples = normalizer.transform(testing_samples)
Normalization
normalizer = Rumale::Preprocessing::L2Normalizer.new
new_samples = normalizer.fit_transform(samples)
Binarization
na[na >= thresh] = 1
na[na < thresh] = 0
Encoding Categorical Features
encoder = Rumale::Preprocessing::LabelEncoder.new
labels = Numo::Int32[1, 8, 8, 15, 0]
encoded_labels = encoder.fit_transform(labels)
# => Numo::Int32#shape=[5]
[1, 2, 2, 3, 0]
decoded_labels = encoder.inverse_transform(encoded_labels)
# => [1, 8, 8, 15, 0]
encoder = Rumale::Preprocessing::LabelEncoder.new
labels = ["A", "B", "B", "A", "C", "C"]
encoded_labels = encoder.fit_transform(labels)
# => Numo::Int32#shape=[6]
# [0, 1, 1, 0, 2, 2]
decoded_labels = encoder.inverse_transform(encoded_labels)
# => ["A", "B", "B", "A", "C", "C"]
One-hot-encoding
encoder = Rumale::Preprocessing::OneHotEncoder.new
labels = Numo::Int32[0, 0, 2, 3, 2, 1]
one_hot_vectors = encoder.fit_transform(labels)
# => Numo::DFloat#shape=[6,4]
# [[1, 0, 0, 0],
# [1, 0, 0, 0],
# [0, 0, 1, 0],
# [0, 0, 0, 1],
# [0, 0, 1, 0],
# [0, 1, 0, 0]]
Imputing Missing Values
idx = narray.eq(0).where
narray[idx] = Float::NAN
mean = narray.mean(axis:0, nan:true)
axis = narray.new_narray.seq % narray.shape[1]
narray[idx] = mean[axis[idx]]
Create Your Model
Supervised Learning Estimators
k-NN(k-Nearest Neighbors)
Rumale::NearestNeighbors::KNeighborsClassifier.new(n_neighbors: 5)
Rumale::NearestNeighbors::KNeighborsRegressor.new(n_neighbors: 5)
- n_neighbors : The number of neighbors.
Linear Regression
Rumale::LinearModel::LinearRegression.new(
fit_bias: # (Boolean) — The flag indicating whether to fit the bias term.
bias_scale: # (Float) — The scale of the bias term.
max_iter: # (Integer) — The maximum number of iterations.
batch_size: # (Integer) — The size of the mini batches.
optimizer: # (Optimizer) — The optimizer to calculate adaptive learning rate. If nil is given, Nadam is used.
random_seed: # (Integer) — The seed value using to initialize the random generator.
)
optimizer: AdaGrad, Adam, Nadam, RMSProp, SGD, goldFin
Ridge Regression
L2 regularization
Rumale::LinearModel::Ridge.new(
reg_param: # (Float) — The regularization parameter.
fit_bias: # (Boolean) — The flag indicating whether to fit the bias term.
bias_scale: # (Float) — The scale of the bias term.
max_iter: # (Integer) — The maximum number of iterations.
batch_size: # (Integer) — The size of the mini batches.
optimizer: # (Optimizer) — The optimizer to calculate adaptive learning rate. If nil is given, Nadam is used.
random_seed: # (Integer) — The seed value using to initialize the random generator.
)
Lasso Regression
Rumale::LinearModel::Lasso.new(
reg_param: # (Float) — The regularization parameter.
fit_bias: # (Boolean) — The flag indicating whether to fit the bias term.
bias_scale: # (Float) — The scale of the bias term.
max_iter: # (Integer) — The maximum number of iterations.
batch_size: # (Integer) — The size of the mini batches.
optimizer: # (Optimizer) — The optimizer to calculate adaptive learning rate. If nil is given, Nadam is used.
random_seed: # (Integer) — The seed value using to initialize the random generator.
)
Logistic Regression
Rumale::LinearModel::LogisticRegression.new(
reg_param: # (Float) — The regularization parameter.
fit_bias: # (Boolean) — The flag indicating whether to fit the bias term.
bias_scale: # (Float) — The scale of the bias term. If fit_bias is true, the feature vector v becoms [v; bias_scale].
max_iter: # (Integer) — The maximum number of iterations.
batch_size: # (Integer) — The size of the mini batches.
optimizer: # (Optimizer) — The optimizer to calculate adaptive learning rate. If nil is given, Nadam is used.
random_seed: # (Integer) — The seed value using to initialize the random generator.
)
Support Vector Machine
svc = Rumale::LinearModel::SVC.new(
reg_param: # (Float) — The regularization parameter.
fit_bias: # (Boolean) — The flag indicating whether to fit the bias term.
bias_scale: # (Float) — The scale of the bias term.
max_iter: # (Integer) — The maximum number of iterations.
batch_size: # (Integer) — The size of the mini batches.
probability: # (Boolean) — The flag indicating whether to perform probability estimation.
optimizer: # (Optimizer) — The optimizer to calculate adaptive learning rate. If nil is given, Nadam is used.
random_seed: # (Integer) — The seed value using to initialize the random generator.
)
Naive Bayes
- GaussianNB
- BernoulliNB
- MultinomialNB
Rumale::NaiveBayes::GaussianNB.new
Rumale::NaiveBayes::BernoulliNB.new(smoothing_param: 1.0, bin_threshold: 0.0)
Rumale::NaiveBayes::MultinomialNB.new(smoothing_param: 1.0)
Decision Tree
Rumale::Tree::DecisionTreeClassifier.new(
criterion: # (String) — The function to evaluate spliting point. Supported criteria are ‘gini’ and ‘entropy’.
max_depth: # (Integer) — The maximum depth of the tree. If nil is given, decision tree grows without concern for depth.
max_leaf_nodes: # (Integer) — The maximum number of leaves on decision tree. If nil is given, number of leaves is not limited.
min_samples_leaf: # (Integer) — The minimum number of samples at a leaf node.
max_features: # (Integer) — The number of features to consider when searching optimal split point. If nil is given, split process considers all features.
random_seed: # (Integer) — The seed value using to initialize the random generator. It is used to randomly determine the order of features when deciding spliting point.
)
Rumale::Tree::DecisionTreeRegressor.new(
criterion: # (String) —The function to evaluate spliting point. Supported criteria are ‘mae’ and ‘mse’.
max_depth: # (Integer) —The maximum depth of the tree. If nil is given, decision tree grows without concern for depth.
max_leaf_nodes: # (Integer) —The maximum number of leaves on decision tree. If nil is given, number of leaves is not limited.
min_samples_leaf: # (Integer) —The minimum number of samples at a leaf node.
max_features: # (Integer) —The number of features to consider when searching optimal split point. If nil is given, split process considers all features.
random_seed: # (Integer) —The seed value using to initialize the random generator. It is used to randomly determine the order of features when deciding spliting point.
)
ExtraTree
Random Forest
Rumale::Ensemble::RandomForestClassifier.new(
n_estimators: # (Integer) —The numeber of decision trees for contructing random forest.
criterion: # (String) —The function to evalue spliting point. Supported criteria are ‘gini’ and ‘entropy’.
max_depth: # (Integer) —The maximum depth of the tree. If nil is given, decision tree grows without concern for depth.
max_leaf_nodes: # (Integer) —The maximum number of leaves on decision tree. If nil is given, number of leaves is not limited.
min_samples_leaf: # (Integer) —The minimum number of samples at a leaf node.
max_features: # (Integer) —The number of features to consider when searching optimal split point. If nil is given, split process considers all features.
random_seed: # (Integer) —The seed value using to initialize the random generator. It is used to randomly determine the order of features when deciding spliting point.
)
Rumale::Ensemble::RandomForestRegressor.new(
n_estimators: # (Integer) —The numeber of decision trees for contructing random forest.
criterion: # (String) —The function to evalue spliting point. Supported criteria are ‘gini’ and ‘entropy’.
max_depth: # (Integer) —The maximum depth of the tree. If nil is given, decision tree grows without concern for depth.
max_leaf_nodes: # (Integer) —The maximum number of leaves on decision tree. If nil is given, number of leaves is not limited.
min_samples_leaf: # (Integer) —The minimum number of samples at a leaf node.
max_features: # (Integer) —The number of features to consider when searching optimal split point. If nil is given, split process considers all features.
random_seed: # (Integer) —The seed value using to initialize the random generator. It is used to randomly determine the order of features when deciding spliting point.
)
AdaBoost (Adaptive Boosting)
Rumale::Ensemble::AdaBoostClassifier.new(
n_estimators: # (Integer) —The numeber of decision trees for contructing random forest.
criterion: # (String) —The function to evalue spliting point. Supported criteria are ‘gini’ and ‘entropy’.
max_depth: # (Integer) —The maximum depth of the tree. If nil is given, decision tree grows without concern for depth.
max_leaf_nodes: # (Integer) —The maximum number of leaves on decision tree. If nil is given, number of leaves is not limited.
min_samples_leaf: # (Integer) —The minimum number of samples at a leaf node.
max_features: # (Integer) —The number of features to consider when searching optimal split point. If nil is given, split process considers all features.
random_seed: # (Integer) —The seed value using to initialize the random generator. It is used to randomly determine the order of features when deciding spliting point.
)
Rumale::Ensemble::AdaBoostRegressor.new(
n_estimators: # (Integer) —The numeber of decision trees for contructing random forest.
threshold: # (Float) —The threshold for delimiting correct and incorrect predictions. That is constrained to [0, 1]
exponent: # (Float) —The exponent for the weight of each weak learner.
criterion: # (String) —The function to evalue spliting point. Supported criteria are ‘gini’ and ‘entropy’.
max_depth: # (Integer) —The maximum depth of the tree. If nil is given, decision tree grows without concern for depth.
max_leaf_nodes: # (Integer) —The maximum number of leaves on decision tree. If nil is given, number of leaves is not limited.
min_samples_leaf: # (Integer) —The minimum number of samples at a leaf node.
max_features: # (Integer) —The number of features to consider when searching optimal split point. If nil is given, split process considers all features.
random_seed: # (Integer) —The seed value using to initialize the random generator. It is used to randomly determine the order of features when deciding spliting point.
)
Unsupervised Learning Estimators
PCA (Principal component analysis)
Rumale::Decomposition::PCA.new(
n_components: # (Integer) —The number of principal components.
max_iter: # (Integer) —The maximum number of iterations.
tol: # (Float) —The tolerance of termination criterion.
random_seed: # (Integer) —The seed value using to initialize the random generator.
)
NMF (Non-negative matrix factorization)
Rumale::Decomposition::NMF.new(
n_components: # (Integer) —The number of components.
max_iter: # (Integer) —The maximum number of iterations.
tol: # (Float) —The tolerance of termination criterion.
eps: # (Float) —A small value close to zero to avoid zero division error.
random_seed: # (Integer) —The seed value using to initialize the random generator.
)
t-SNE (T-distributed Stochastic Neighbor Embedding)
Rumale::Manifold::TSNE.new(
n_components: # (Integer) —The number of dimensions on representation space.
perplexity: # (Float) —The effective number of neighbors for each point. Perplexity are typically set from 5 to 50.
metric: # (String) —The metric to calculate the distances in original space. If metric is 'euclidean', Euclidean distance is calculated for distance in original space. If metric is 'precomputed', the fit and fit_transform methods expect to be given a distance matrix.
init: # (String) —The init is a method to initialize the representaion space. If init is 'random', the representaion space is initialized with normal random variables. If init is 'pca', the result of principal component analysis as the initial value of the representation space.
max_iter: # (Integer) —The maximum number of iterations.
tol: # (Float) —The tolerance of KL-divergence for terminating optimization. If tol is nil, it does not use KL divergence as a criterion for terminating the optimization.
verbose: # (Boolean) —The flag indicating whether to output KL divergence during iteration.
random_seed: # (Integer) —The seed value using to initialize the random generator.
)
KMeans clustering
Rumale::Clustering::KMeans.new(
n_clusters: # (Integer) —The number of clusters.
init: # (String) —The initialization method for centroids (‘random’ or ‘k-means++’).
max_iter: # (Integer) —The maximum number of iterations.
tol: # (Float) —The tolerance of termination criterion.
random_seed: # (Integer) —The seed value using to initialize the random generator.
)
DBSCAN (Density-based spatial clustering of applications with noise)
Rumale::Clustering::DBSCAN.new(
eps: # (Float) —The radius of neighborhood.
min_samples: # (Integer) —The number of neighbor samples to be used for the criterion whether a point is a core point.
)
Model Fitting
model.fit(samples, labels)
model.fit(samples)
model.fit_transform(x)
Prediction
y_pred = model.predict(samples)
Evaluate Model’s Performance
Classification Metrics
Accuracy Score
evaluator = Rumale::EvaluationMeasure::Accuracy.new
puts evaluator.score(ground_truth, predicted)
Regression Metrics
Mean Absolute Error, MAE
evaluator = Rumale::EvaluationMeasure::MeanAbsoluteError.new
puts evaluator.score(ground_truth, predicted)
Mean Squared Error
evaluator = Rumale::EvaluationMeasure::MeanSquaredError.new
puts evaluator.score(ground_truth, predicted)
R2 Score
(coefficient of determination)
evaluator = Rumale::EvaluationMeasure::R2Score.new
puts evaluator.score(ground_truth, predicted)
Clustering Metrics
Adjusted Rand Index
evaluator = Rumale::EvaluationMeasure::AdjustedRandScore.new
puts evaluator.score(ground_truth, predicted)
Cross-Validation
svc = Rumale::LinearModel::SVC.new
kf = Rumale::ModelSelection::StratifiedKFold.new(n_splits: 5)
cv = Rumale::ModelSelection::CrossValidation.new(estimator: svc, splitter: kf)
report = cv.perform(samples, labels)
mean_test_score = report[:test_score].inject(:+) / kf.n_splits
Tune Your Model
Grid Search
rfc = Rumale::Ensemble::RandomForestClassifier.new(random_seed: 1)
pg = { n_estimators: [5, 10], max_depth: [3, 5], max_leaf_nodes: [15, 31] }
kf = Rumale::ModelSelection::StratifiedKFold.new(n_splits: 5)
gs = Rumale::ModelSelection::GridSearchCV.new(estimator: rfc, param_grid: pg, splitter: kf)
gs.fit(samples, labels)
p gs.cv_results
p gs.best_params
rbf = Rumale::KernelApproximation::RBF.new(random_seed: 1)
svc = Rumale::LinearModel::SVC.new(random_seed: 1)
pipe = Rumale::Pipeline::Pipeline.new(steps: { rbf: rbf, svc: svc })
pg = { rbf__gamma: [32.0, 1.0], rbf__n_components: [4, 128], svc__reg_param: [16.0, 0.1] }
kf = Rumale::ModelSelection::StratifiedKFold.new(n_splits: 5)
gs = Rumale::ModelSelection::GridSearchCV.new(estimator: pipe, param_grid: pg, splitter: kf)
gs.fit(samples, labels)
p gs.cv_results
p gs.best_params
Pipeline
Sequentially apply a list of transforms and a final estimator.
rbf = Rumale::KernelApproximation::RBF.new(gamma: 1.0, n_coponents: 128, random_seed: 1)
svc = Rumale::LinearModel::SVC.new(reg_param: 1.0, fit_bias: true, max_iter: 5000, random_seed: 1)
pipeline = Rumale::Pipeline::Pipeline.new(steps: { trs: rbf, est: svc })
pipeline.fit(training_samples, traininig_labels)
results = pipeline.predict(testing_samples)
Top comments (0)