DEV Community: Keita Onabuta

Macbook Pro M5 Setup

Keita Onabuta — Fri, 23 Jan 2026 17:08:30 +0000

I have bought Macbook Pro M5 after using Macbook Pro M1 2021 late for 4 years. This blog is a note to myself about what I have installed.

Magnet
Homebrew

https://brew.sh/

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

KeyCastr

brew install --cask keycastr

Battery Install

curl -s https://raw.githubusercontent.com/actuallymentor/battery/main/setup.sh | bash

80%

battery maintain 80

Visual Studio code
Karabiner-Elements
Google Chrome Canary
I really love to use Vertical tabs! That's why I instaleld Canary version of Google Chrome browzer.

Visual Studio Code Install Visual Studio Code from Visual Studio Code.

In Visual Studio Code, open command palete and type "Shell Command: Install 'code' command in PATH" to install command line tool.

Terminal setup

iTerm2
fish terminal

brew install fish
echo /opt/homebrew/bin/fish | sudo tee -a /etc/shells
chsh -s /opt/homebrew/bin/fish #change default shell

change font for Japanese readability

brew tap homebrew/cask-fonts
brew install --cask font-plemol-jp

fisher

curl -sL https://raw.githubusercontent.com/jorgebucaran/fisher/main/functions/fisher.fish | source && fisher install jorgebucaran/fisher

tide tide is a tool for customizing Fish prompt.

Install font at first.

brew install font-hack-nerd-font

fisher install IlanCosman/tide@v6

brew install fzf

fzf --fish | source

Azure Machine Learning is the Best Place for LightGBM

Keita Onabuta — Thu, 14 May 2020 18:01:49 +0000

LightGBM は Microsoft が主導して開発を進めている勾配ブースティングのライブラリです。数多くの実績があり、Microsoft 社内でも利用されています。

microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine

LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:

Faster training speed and higher efficiency.
Lower memory usage.
Better accuracy.
Support of parallel, distributed, and GPU learning.
Capable of handling large-scale data.

For further details, please refer to Features.

Benefiting from these advantages, LightGBM is being widely-used in many winning solutions of machine learning competitions.

Comparison experiments on public datasets show that LightGBM can outperform existing boosting frameworks on both efficiency and accuracy, with significantly lower memory consumption. What's more, distributed learning experiments show that LightGBM can achieve a linear speed-up by using multiple machines for training in specific settings.

Get Started and Documentation

Our primary documentation is at https://lightgbm.readthedocs.io/ and is generated from this repository. If you are new to LightGBM, follow the installation instructions on that site.

Next…

View on GitHub

本投稿では、Azure Machine Learning で LightGBM を利用するメリットや実装方法について説明します。

Azure Machine Learning とは？

Azure が提供する機械学習プラットフォームです。 Python & R のプログラム言語、機械学習に特化した UI も提供しています。統計解析、従来の機械学習、深層学習、強化学習 (Preview) まであらゆる機械学習要件に対応しています。

Web ページ : Azure Machine Learning

LightGBM とは？

アルゴリズムの詳細については数多くのブログで説明されているので、ここでは説明を割愛させていただきます。特徴的な点は下記になると思います。

決定木がベースモデル
アンサンブル学習としてブースティングを採用
Leaf-wise tree growth
ヒストグラム法
数多くのハイパーパラメータ

分析コンペなどで優秀な成績を残しています。
参考 : Winning Solutions

モデル学習

Automated Machine Learning

Microsoft の AutoML といえば、 Azure Machine Learning の自動機械学習 Automated Machine Learning です。 Microsoft Research が開発した 協調フィルタリングとベイズ最適化の探索アルゴリズム により効率的に高精度モデルを探索します。

LightGBM は回帰、分類、時系列予測の全てのシナリオでアルゴリズム候補となっており、私の個人的な印象にはなりますが、上位の精度を出すことが多いです。

LightGBM Estimator

LightGBM は並列学習に対応しており、大量データのハンドリングも出来ます。Microsoft 社内では、Bing のログ解析に LightGBM が利用されています。

また、Azure Machine Learning には LightGBM 専用の Estimator が提供されており、マネージドなインフラを利用できるので、すぐに大規模学習を実行することができます。

Bing チームも Azure Machine Learning の LightGBM Estimator を用いており、最大 13TB 以上のデータを 100台以上の計算環境で並列学習しています。

training_data_list=["binary0.train", "binary1.train"]
validation_data_list = ["binary0.test", "binary1.test"]
lgbm = LightGBM(source_directory=scripts_folder, 
                compute_target=cpu_cluster, 
                distributed_training=Mpi(),
                node_count=100,  # 100台並列の場合
                lightgbm_config='train.conf',
                data=training_data_list,
                valid=validation_data_list
               )

experiment = Experiment(ws, name='lightgbm-estimator-test')
run = experiment.submit(lgbm, tags={"test public docker image": None})

ドキュメント : Bing accelerates model training with LightGBM and Azure Machine Learning

ハイパーパラメータチューニング

Hyperdrive

Hyperdrive は Azure Machine Learning が提供するハイパーパラメータチューニング機能になります。LightGBM を利用する際はパラメータチューニングすることが多いので、Hyperdrive の利用は欠かせません。

Hyperdrive のアウトプット例になります。パラメータ探索は、 Grid Search、Random Search、Bayesian Optimization をサポートしています。

ドキュメント : Azure Machine Learning でモデルのハイパーパラメーターを調整する

モデル推論

LightGBM は ONNX に変換することができます。onnxmltools が変換ツールです。Azure Machine Learning で ONNX の推論環境を構築することもできます。

※ ONNX を利用してモデルの相互運用性が向上します。ONNX はオープンソースのモデルフォーマットで、あらゆるプログラム環境で動作します。

ドキュメント : ONNX と Azure Machine Learning:ML モデルの作成と能率化

モデル解釈

LightGBM のモデルから出力される変数の重要度はモデルに影響している説明変数を理解するために用いられます。モデル解釈手法の Global Surrogate は、LightGBM のような解釈可能なサロゲートモデルを利用して、複雑な機械学習モデルのグローバル(広域的)解釈を行います。

Azure Machine Learning では Microsoft が OSS で公開しているモデル解釈フレームワーク Interpret Community を利用することができます。

interpretml / interpret-community

Interpret Community extends Interpret repository with additional interpretability techniques and utility functions to handle real-world datasets and workflows.

Interpret Community SDK

Interpret-Community is an experimental repository extending Interpret, with additional interpretability techniques and utility functions to handle real-world datasets and workflows for explaining models trained on tabular data. This repository contains the Interpret-Community SDK and Jupyter notebooks with examples to showcase its use.

Overview of Interpret-Community

Interpret-Community extends the Interpret repository and incorporates further community developed and experimental interpretability techniques and functionalities that are designed to enable interpretability for real world scenarios. Interpret-Community enables adding new experimental techniques (or functionalities) and performing comparative analysis to evaluate them.

Interpret-Community

Actively incorporates innovative experimental interpretability techniques and allows for further expansion by researchers and data scientists
Applies optimizations to make it possible to run interpretability techniques on real-world datasets at scale
Provides improvements such as the capability to "reverse the…

View on GitHub

Global Surrogateのようなグローバル解釈だけでなく、SHAPなどのローカル(局所的)解釈もサポートしています。簡単なコーディングと使いやすいダッシュボードが特徴です。

from interpret.ext.glassbox import LGBMExplainableModel
explainer = MimicExplainer(model, 
                           x_train, 
                           LGBMExplainableModel, 
                           augment_data=True, 
                           max_num_of_augmentations=10, 
                           features=breast_cancer_data.feature_names, 
                           classes=classes)

他システムとの連携

Optuna LightGBM Tuner
Optuna にて LightGBM 専用の Tuner を提供しています。Azure Machine Learning 上で Optuna のパラメータチューニングを行うことも可能です。
Neural Network Intelligence
Neural Network Intelligenceは、Microsoft Research が開発している AutoML Toookit です。

LightGBM における実装例は、下記ドキュメントをご参照ください。

ドキュメント : GBDT in nni

まとめ

Azure Machine Learning は LightGBM に関する機能を数多くご提供しています。ぜひ一度使ってみてください。

参考情報

Azure 無料アカウントの作成
チュートリアル:Python SDK で初めての ML 実験を作成する
Channel 9 AI Show (Azure AI の最新機能や動向を動画で紹介)

Tidyverts による時系列予測モデリング

Keita Onabuta — Tue, 21 Apr 2020 11:12:06 +0000

R を使って、基本的な時系列予測モデルの構築と予測値算出を行っていきます。本投稿では、R の時系列モデリングのライブラリ tidyverts を中心に利用します。

tidyverts とは？

tidyverts は、大きく下記の3つのライブラリから構成されます。

tsibble : 時系列データに特化たデータ型を提供

fable : 時系列予測のモデルを提供

feasts : 時系列データの統計処理や特徴量抽出の機能を提供

Microsoft Forecasting Best Practice

Microsoft が Forecasting の Best Practice 集を公開しています。

microsoft / forecasting

Time Series Forecasting Best Practices & Examples

Forecasting Best Practices

Time series forecasting is one of the most important topics in data science. Almost every business needs to predict the future in order to make better decisions and allocate resources more effectively.

This repository provides examples and best practice guidelines for building forecasting solutions. The goal of this repository is to build a comprehensive set of tools and examples that leverage recent advances in forecasting algorithms to build solutions and operationalize them. Rather than creating implementations from scratch, we draw from existing state-of-the-art libraries and build additional utilities around processing and featurizing the data, optimizing and evaluating models, and scaling up to the cloud.

The examples and best practices are provided as Python Jupyter notebooks and R markdown files and a library of utility functions. We hope that these examples and utilities can significantly reduce the “time to market” by simplifying the experience from defining the…

View on GitHub

このリポジトリでも Tidyverts が利用されていますので、是非ご確認ください。

需要予測などのテーマでは、例えば "店舗" ✖️ "ブランド" ごとにモデルを作る必要があったりします。そのために並列処理の実装が欠かせません。本投稿では言及しませんが、このリポジトリでは、Python では Ray 、R では {parallel} を利用して並列処理を実装しています。

利用するデータ

Prophet のチュートリアルで使用されている Peyton Manning の Wikipedia の閲覧者数の時系列データを利用します。下記 URL から CSV データをダウンロードしておきます。

example_wp_log_peyton_manning.csv

分析環境

R Studio や Visual Studio Code で構いません。私は Azure Machine Learning で提供している PaaS の R Studio を利用しています。

Azure Machine Learning

R Support in Azure Machine Learning | AI Show | Channel9

ライブラリのロード

利用するライブラリをロードします。

library(fable)
library(fable.prophet)
library(tsibble)
library(tsibbledata)
library(feasts)
library(lubridate)
library(dplyr)
library(ggplot2)

データ準備と探索

CSV ファイルのパスを指定して、データをロードします。

df <- read.csv("example_wp_log_peyton_manning.csv", colClasses=c("Date","numeric"))

df に格納されているデータを tsibble形式に変換し可視化します。"%>%" は R 言語で利用されるパイプ処理を意味しています。autoplot を適用すると可視化まで簡単にできます。

df %>% as_tsibble %>% autoplot

df %>% as_tsibble %>% fill_gaps

fill_gaps を利用して、データに欠損している日時項目を補完します。また tidyr の fill で欠損値を補完します。１つ先の時系列データ値を用いて補完する場合は、.direction = down と指定します。

df_filled <- df %>% as_tsibble %>% fill_gaps %>% tidyr::fill(y, .direction = "down")

次に feasts ライブラリの STL を用いて、時系列データを "季節" "トレンド" "不規則成分" に分解します。

dcmp <- df_filled %>%
    model(STL(y ~ season(window = Inf ))) %>% components
dcmp

可視化して特徴を確認します。Weekly の季節性トレンドは細かすぎて黒塗りに見えちゃっているようです。

dcmp %>% autoplot

時系列モデルの構築と予測値算出

まずは非常にシンプルなモデルを試します。データには季節性などが確認されるものの、これらのモデルは考慮してくれないことが分かります。

MEAN : 平均値
NAIVE(=RW): ランダムウォークモデル

ここでは、ドリフトを考慮するモデルも追加します。このデータは後半にトレンドが減少している傾向にあるので、このモデルも減少方向に予測していることが確認できます。

df_filled %>% 
    model(
        mean=MEAN(y),
        naive=NAIVE(y),
        drift=RW(y ~ drift()),  # ドリフトを考慮
        .safely=FALSE) %>% 
    forecast(h="2 years") %>% 
    autoplot(filter(as_tsibble(df_filled),year(ds) > 2007), level = NULL)

次に季節性を考慮します。

SNAIVE : 季節性を考慮するランダムウォークモデル

1年周期の季節性があるデータなので、過去 1 年前のデータを予測とするモデルを採用します。

df_filled %>% 
    model(
        mean=MEAN(y),
        naive=NAIVE(y),
        drift=RW(y ~ drift()),  # ドリフトを考慮
        snaive = SNAIVE(y ~ lag("year")), # 季節性を考慮
        .safely=FALSE) %>% 
    forecast(h="2 years") %>% 
    autoplot(filter(as_tsibble(df_filled),year(ds) > 2007), level = NULL)

次に指数平滑化を使用します。

ETS : 指数平滑化

1年周期を適用してみます。

df_filled %>% 
    model(
        est=ETS(y ~ season("A", period= "1 year")),
        .safely=FALSE) %>% 
    forecast(h="1 year") %>% 
    autoplot(filter(as_tsibble(df_filled),year(ds) > 2015), level = NULL)

すると、エラーが出てきました。

Error: Seasonal periods (period) of length greather than 24 are not supported by ETS.

最大 24 サイクルしか考慮してくれないようです。このデータは Daily のデータなので、年周期の季節性の考慮は難しそうです。

週周期の季節性があると想定して、モデル構築と予測値算出を行います。

df_filled %>% 
    model(
        est=ETS(y ~ season("A", period= "7 days")),
        .safely=FALSE) %>% 
    forecast(h="30 days") %>% 
    autoplot(filter(as_tsibble(df_filled),year(ds) > 2014), level = NULL)

次に ARIMA モデルを利用します。ETS ではうまく動きませんでしたが、ARIMA では、period や D=1 を設定することで年周期の季節性も考慮してくれます。

(S)ARIMA : (季節)自己回帰和分移動平均

df_filled %>% 
    model(
        arima=ARIMA(y ~ pdq(1,1,3) + PDQ(0,1,0,period="365 days")),
        .safely=FALSE) %>%
        forecast(h="2 years") %>% 
    autoplot(filter(as_tsibble(df_filled),year(ds) > 2007), level = NULL)

次に、Facebook の Prophet を利用します。fable.prophet というライブラリが公開されており、いままで使ってきた fable でも Prophet のモデルを組み込めるようになっています。詳細は fable.prophet からご確認ください。

df_filled %>% 
    model(
        prophet(y),
        .safely=FALSE) %>% 
    forecast(h="2 years") %>% 
    autoplot(filter(as_tsibble(df_filled),year(ds) > 2007), level = NULL)

以上です。

DataDrift in Azure Machine Learning

Keita Onabuta — Fri, 17 Apr 2020 02:54:20 +0000

一般的に機械学習モデルの精度は落ちていきます。モデル学習時のデータと最新のデータの特徴が変わってくるためです。本投稿では、Azure Machine Learning の Datadrift 機能について説明します。

※ スライドを Slideshare で公開しています。
DataDrift in Azure Machine Learning

精度変化に対応するための再学習

再学習は次のアプローチでタイミングを検知します。

運用中のモデルの精度が落ちたとき

運用中のモデルの精度が分かるのであれば問題無いのですが、例えばモデルを自社製品に組み込んで出荷してお客様がネットワークに閉じた事業所で利用している場合など、モデルの精度がわからない場合があります。そのような場合には下記アプローチが必要です。

データの特徴が変化した時

データの様々な特徴の変化を検知します。本投稿でお話する内容になります。では、どういう時にデータの特徴が変化するのか？については様々なケースがありますが、代表的なものを列挙します。

データ計測方法の変更
- 街頭アンケートをWeb形式に切り替える etc
データ品質の劣化
- 欠損値の増加、文字の揺らぎ etc
季節性などの自然な変化
- 気温・湿度の変化、消費者の需要の変化 etc
データ生成物の変化 etc
- 設備の劣化、嗜好の変化 etc

Azure Machine Learning

Azure Machine Learning は機械学習のライフサイクルをフルでサポートする機械学習プラットフォームです。AutoML、HyperParameter Tuning のようなモデル学習を効率的に行ったり、MLOps の仕組みもご提供しています。

Azure Machine Learning
企業向け仕様の機械学習サービスでモデルをよりすばやく構築してデプロイする
https://azure.microsoft.com/ja-jp/services/machine-learning/

また、データセットを管理する機能もご提供しています。

データソースへの接続
- Azure が提供する様々なデータソースへの接続 (Azure SQL Database, Azure PostgreSQL, Azure MySQL Database, Azure Data Lake Storage Gen2 など)
属性情報のバージョン管理
スナップショット保存
Data Drift (本投稿のトピック)

DataDrift 機能

DataDrift は学習データや推論データに適用できます。モニタリング対象は大きく下記の2つです。

データセット全体
個々の特徴量

それぞれどのようなアプローチをしているか説明します。

データセット全体

DataDrift を検知する分類モデルを Azure Machine Learning を有しており、それを用いています。比較対象の 2 つのデータ (モデル学習時のデータ、最新のデータ) を識別することを試みます。全く識別できない場合は 0 、完全に識別できる場合は 1 となるようなドリフト係数 (Drift Coefficient) を出力します。また内部でモデル解釈機能を利用しており、ドリフトに寄与している

個々の特徴量

先ほどはデータセット全体を見ましたが、個々の特徴量についてもアプローチすることができます。こちらは様々なメトリックを提供しています。

数値系データ
- ワッサースタイン距離、最小値、平均値、最大値
カテゴリーデータ
- ユークリッド距離、一意の値の数

ワッサースタイン距離とは、異常検知や最適輸送経路問題でも利用されるもので、2 つのデータ分布を出力するものです。SciPyでも簡単に算出することができます。

# 2 つのBeta分布のワッサースタイン距離
from scipy.stats import wasserstein_distance,beta
wasserstein_distance(stats.beta.pdf(x,5,5), stats.beta.pdf(x,8,5))

他システムとの連携

DataDrift が発生していることを検知する方法としては、デフォルトでメールでのアラート配信機能を提供しています。また、Azure Machine Learning と Event Grid を連携させることで、次のような連携を行うことができます。

DataDrift をトリガーにしてモデルの再学習 & デプロイを自動実行
Github の Issue を作成
Teams, Slack に DataDrift の発生を通知

まとめ

モデル精度維持のためにデータセットをモニタリング
- モデルの精度は日々変化するものです。データの変化はモデル精度に大きく影響します。
モニタリング対象は “データセット全体” & “個々の特徴量”
イベントドリブンな機械学習パイプラインの実現
- 複雑な機械学習環境の連携性を向上することができます。

参考情報

Channel 9 (短編動画 – 日本語字幕あり) # チャネル登録ぜひ！
- Data Drift Monitoring for Azure ML Datasets
製品ドキュメント
- データセットでデータドリフトを検出する (プレビュー)
- Azure Kubernetes Service (AKS) にデプロイされたモデルのデータの誤差 (プレビュー) を検出する
Sample Code (Jupyter)
- Analyze data drift in Azure Machine Learning datasets
- Monitor data drift on models deployed to Azure Kubernetes Service

BERTによる日本語テキスト分類 (Azure)

Keita Onabuta — Mon, 30 Mar 2020 11:45:22 +0000

This is Japanese article.

konabuta / AzureML-NLP

NLP for japanese language text.

AzureML-NLP

本リポジトリでは、Azure Machine Learning を利用した日本語の自然言語処理 NLP モデル構築のサンプルコードを提供します。Microsoft の NLP Best Practice を参考にしています。

コンテンツ

シナリオ	モデル	概要	対応言語
テキスト分類	BERT	テキストのカテゴリーを学習・推論する教師付き学習です。	Japanese

Get started

最初は Azure Cognitive Service の利用検討を推奨します。この学習済みのモデルで対応できない場合は、カスタムで機械学習モデルを構築する必要がございます。まず、Setup を参照し、必要なライブラリを導入してください。

View on GitHub

Microosft が公開している自然言語処理のベストプラクティス集 "NLP Best Practices" をベースにした日本語テキスト分類のサンプルコードを作成しました。

本家と大きく違う点は下記です。

日本語の BERT Tokenizer を利用する
- Mecab (+辞書) のダウンロードとインストールの手順を追加
日本語 PreTrained モデルを利用する
- Hugging Face のモデルを利用
サンプルデータとして Livedoor ニュースを利用

Mecabの辞書の導入が複雑なので本家とマージするかはまだ未定です。

コード(※抜粋)はこちらです。

1. Livedoor コーパスのデータ加工

# Livedoor ニュースコーパスをダウンロードして利用します。
from urllib.request import urlretrieve
import tarfile

text_url = "https://www.rondhuit.com/download/ldcc-20140209.tar.gz"
file_path = "./ldcc-20140209.tar.gz"
urlretrieve(text_url, file_path)

# gz ファイルを解凍します。
with tarfile.open('./ldcc-20140209.tar.gz', 'r:gz') as tar:
    tar.extractall(path='livedoor')
    tar.close()

# Pandas Dataframe を作成します。
for folder_name in os.listdir(path):
    print(folder_name)
    if folder_name.endswith(".txt") :
        continue
    for file in os.listdir(os.path.join(path, folder_name)):
        if folder_name == "LICENSE.txt" :
            continue
        with open(os.path.join(path, folder_name, file), 'r') as f:
            lines = f.read().split('\n')
            if len(lines) == 1:
                continue
            url = lines[0]
            date = lines[1]
            label = folder_name
            title = lines[3]
            text = "".join(lines[4:])
            data = {'url': url, 'date':date, 'label': label, 'title':title, 'text':text}
        s = pd.Series(data)        
        df = df.append(s, ignore_index=True)

2. ファインチューニング

準備されている関数 util_nlp を利用します。

classifier = SequenceClassifier(
    model_name=model_name, num_labels=num_labels, cache_dir=CACHE_DIR
)

with Timer() as t:
    classifier.fit(
        train_dataloader, num_epochs=NUM_EPOCHS, num_gpus=NUM_GPUS, verbose=False,
    )
train_time = t.interval / 3600

精度確認を確認します。

# テストデータの予測
preds = classifier.predict(test_dataloader, num_gpus=NUM_GPUS, verbose=False)

# 評価
accuracy = accuracy_score(df_test[LABEL_COL], preds)
class_report = classification_report(
    df_test[LABEL_COL], preds, target_names=label_encoder.classes_, output_dict=True
)

最終的な精度は 85% ぐらいでした。

accuracy : 0.866052
f1-score : 0.858849

How to create SSH Key on Mac Terminal

Keita Onabuta — Sun, 29 Mar 2020 08:05:18 +0000

Just for memorandum.

ssh-keygen -t rsa

and you get

$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/Users/<username>/.ssh/id_rsa): 
/Users/<username>/.ssh/id_rsa already exists.
Overwrite (y/n)? y
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /Users/<username>/.ssh/id_rsa.
Your public key has been saved in /Users/<username>/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:O9gHqGxZOua51PfVNL6GZN/FixT6nOuXptCTOzhkqd0 <hostname>
The key's randomart image is:
+---[RSA 2048]----+
|                 |
|                 |
|                 |
|       .     .   |
|      o S   o +. |
|   . * o o =+=..o|
|    X o = *+B*=.+|
|   = o . = =oE=*.|
|    +.    . o*B  |
+----[SHA256]-----+

There are files generated under ~/.ssh folder.

 id_rsa
 id_rsa.pub

id_rsa : private key
id_rsa.pub : public key

Microsoft ML/DL テクノロジー

Keita Onabuta — Sat, 28 Mar 2020 14:49:57 +0000

Microsoft の機械学習・深層学習の研究開発、OSSライブラリ、クラウド製品についてまとめています。随時アップデートしていきます。最新版は konabuta/ML-tech をご参照ください。

ML-tech

Microsoft libraries, tools, recipes, ample codes and workshop contents for machine learning & deep learning.

1. Library & tool

LightGBM

高速勾配ブースティングのライブラリ。Kaggle の成績上位者が良く利用しており、現場でも実績がある。

EconML

統計的因果推論の推定ライブラリ。Microsoft Research の Alice Project.

DoWhy

統計的因果推論のライフサイクルをサポートするライブラリ(フレームワーク)。

Nueral Network Intelligence

Neural Architect Search, Hyperparameter Tuning などの AutoML Toolkit。

InterpretML

一般化加法モデルのGA2Mを実装したライブラリ。一般化線形モデルよりも柔軟な設計が可能なため精度向上が期待できる。モデルが解釈可能なことでも注目されている。

Interpret Community

様々なモデル解釈のテクノロジーを統合 API 経由で提供。また、専用の Dashboard によりモデルの解釈が可能に。

MMLSpark

分散コンピューティング環境 Apache Spark 上で動作する機械学習フレームワーク。LightGBM、OpenCV なども利用可能。

EdgeML

Edge デバイスのための機械学習のアルゴリズム。

Dice

反事実 (counterfactual) によるモデルの解釈

MMdnn

Deep Neural Network を可視化するクロスフレームワークソリューション

TensorWatch

機械学習のデバック、可視化ツール

ONNX Runtime

ONNX モデルファイルを動作させるランタイム

TagAnomaly

時系列データ用のタギングツール

VoTT

画像、動画データ用のタギングツール

TextWorld

テキストベースの強化学習のためのゲームシミュレーター

AirSim

自動運転シミュレーター

UniLM

UniLM - Unified Language Model Pre-training

Vowpal Wabbit

Microsoft Reserch と旧 Yahoo! Research が開発している高速機械学習ライブラリ。オンライン機械学習、強化学習にも対応。

Microsoft Z3 Solver

定理証明器 (Theorem Prover)

COCO Dataset

画像データ。80個のカテゴリ、20万以上のラベル付き画像データ、150万個のオブジェクトインスタンスを提供。Microsoft, Facebook などが主要スポンサー。

2. Recipe

Computer Vision

コンピュータービジョンのベストプラクティス集

Neural Language Processing

自然言語処理のベストプラクティス集

Recommenders

推薦システムのベストプラクティス集

MLOps

MLOps のベストプラクティス集

3. Sample Codes

Azure Machine Learning 関連

Azure ML Sample Codes

Azure Machine Learning 公式サンプルコード。

BERT

BERT の E2E の再学習・転移学習のサンプルコード

Distributed Deep Learning

分散 Deep Learning サンプルコード。

Hyperdrive for Deep Learning

Deep Learning モデルのハイパーパラメータチューニングのサンプルコード。Mask RCNN を利用。

Batch Inference

バッチ推論のサンプルコード。

ML on IoT Edge

Azure IoT Edge に機械学習モデルをデプロイする手順サンプル

4. Workshop

Causal Inference and Counterfactual Reasoning @KDD2018

因果推論と基本的な理論の解説 & 因果推論フレームワーク DoWhy ライブラリのチュートリアル

Nvidia Rapids on Azure ML

Azure Machine Learning 上で NVidia Rapids を利用するためのチュートリアル。詳細ブログはこちら。

Deep Learning for Time Series Forecasting @KDD2019

深層学習による時系列予測モデリングのチュートリアル。

From Graph to Knowledge Graph @KDD2019

グラフデータとそのモデリングの基礎チュートリアル。

AutoML Workshop @Dllab

Azure の AutoML に関するチュートリアル。

Deep Learning with TensorFlow 2.0 and Azure @TensorFlow World 2019

BERT を用いて Stack overflow の質問に自動的にタグを付与するモデル開発を行うワークショップ

5. Training

Foundations of Data Science

データサイエンスの基本的な理論について記載している PDF

Python for Beginners

初心者向け Python 無料講座

Microsoft NLP Recipes

Keita Onabuta — Sat, 28 Mar 2020 14:26:05 +0000

microsoft / nlp-recipes

Natural Language Processing Best Practices & Examples

NLP Best Practices

In recent years, natural language processing (NLP) has seen quick growth in quality and usability, and this has helped to drive business adoption of artificial intelligence (AI) solutions. In the last few years, researchers have been applying newer deep learning methods to NLP. Data scientists started moving from traditional methods to state-of-the-art (SOTA) deep neural network (DNN) algorithms which use language models pretrained on large text corpora.

This repository contains examples and best practices for building NLP systems, provided as Jupyter notebooks and utility functions. The focus of the repository is on state-of-the-art methods and common scenarios that are popular among researchers and practitioners working on problems involving text and language.

Overview

The goal of this repository is to build a comprehensive set of tools and examples that leverage recent advances in NLP algorithms, neural architectures, and distributed machine learning systems The content is based on…

View on GitHub

Microsoft が公開している NLP のベストプラクティス集をご紹介します。

NLP Recipes とは？

Microsoft の Data Scientist が主導して開発している自然言語処理(NLP) のベストプラクティスです。

コンテンツ

2020年3月28日時点で公開されているコンテンツは下記になります。

Scenario	Models	Description	Languages
Text Classification	BERT, XLNet, RoBERTa	Text classification is a supervised learning method of learning and predicting the category or the class of a document given its text content.	English, Hindi, Arabic
Named Entity Recognition	BERT	Named entity recognition (NER) is the task of classifying words or key phrases of a text into predefined entities of interest.	English
Text Summarization	BERTSum	Text summarization is a language generation task of summarizing the input text into a shorter paragraph of text.	English
Entailment	BERT, XLNet, RoBERTa	Textual entailment is the task of classifying the binary relation between two natural-language texts, text and hypothesis, to determine if the text agrees with the hypothesis or not.	English
Question Answering	BiDAF, BERT, XLNet	Question answering (QA) is the task of retrieving or generating a valid answer for a given query in natural language, provided with a passage related to the query.	English
Sentence Similarity	BERT, GenSen	Sentence similarity is the process of computing a similarity score given a pair of text documents.	English
Embeddings	Word2Vec fastText GloVe	Embedding is the process of converting a word or a piece of text to a continuous vector space of real number, usually, in low dimension.	English
Sentiment Analysis	Dependency Parser GloVe	Provides an example of train and use Aspect Based Sentiment Analysis with Azure ML and Intel NLP Architect .	English

特徴

HuggingFace のモデルがベース
少ないコーディング量で NLP モデルの開発が可能
現状メインは英語のみ対応 (テキスト分類の日本語バージョンを後日公開します！)

DEV Community: Keita Onabuta

Macbook Pro M5 Setup

Terminal setup

Azure Machine Learning is the Best Place for LightGBM

microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine

Get Started and Documentation

Azure Machine Learning とは？

LightGBM とは？

モデル学習

Automated Machine Learning

LightGBM Estimator

ハイパーパラメータチューニング

Hyperdrive

モデル推論

モデル解釈

interpretml / interpret-community

Interpret Community extends Interpret repository with additional interpretability techniques and utility functions to handle real-world datasets and workflows.

Interpret Community SDK

Contents

Overview of Interpret-Community

他システムとの連携

まとめ

参考情報

Tidyverts による時系列予測モデリング

tidyverts とは？

Microsoft Forecasting Best Practice

microsoft / forecasting

Time Series Forecasting Best Practices & Examples

Forecasting Best Practices

利用するデータ

分析環境

ライブラリのロード

データ準備と探索

時系列モデルの構築と予測値算出

DataDrift in Azure Machine Learning

精度変化に対応するための再学習

Azure Machine Learning

DataDrift 機能

データセット全体

個々の特徴量

他システムとの連携

まとめ

参考情報

BERTによる日本語テキスト分類 (Azure)

konabuta / AzureML-NLP

NLP for japanese language text.

AzureML-NLP

コンテンツ

Get started

1. Livedoor コーパスのデータ加工

2. ファインチューニング

How to create SSH Key on Mac Terminal

Microsoft ML/DL テクノロジー

ML-tech

1. Library & tool

2. Recipe

3. Sample Codes

Azure Machine Learning 関連

4. Workshop

5. Training

Microsoft NLP Recipes

microsoft / nlp-recipes

Natural Language Processing Best Practices & Examples

NLP Best Practices

Overview

NLP Recipes とは？

コンテンツ

特徴