使用google cloud profiler來對go gRPC server做效能分析

#go #googlecloud #performance #profiling

English version

About

這篇文章主要是整合 Cloud Profiler 至go的程式 (以gRPC Server為範例)

完整的範例在github

Profiler

Profiling是一種軟體效能分析的方法，而Profiler則是做Profiling的工具。簡單的說它就是測量軟體所花費的時間、空間、或其他的資源，近年來最常用的視覺化工具為FlameGraph, 其概念可以參考Brendan Gregg的網站或google cloud

Cloud Profiler

Google出個一個服務叫Cloud Profiler，它可以讓你非常容易的自動取得並視覺化你的程式，只需要加進一小段簡單的程式碼，以下是一些關於Cloud Profiler的基本資訊：

支援的程式語言: Python, Go, NodeJS, and Java
價格: 免費
資料保留時間: 30天
對於程式的效能影響: 請參考這邊
就算你的程式不是跑在GCP上，你還是可以使用，以下的範例就是只跑在本機，然後資料會傳送到Cloud Profiler做視覺化

gRPC server

因為這邊的範例使用gRPC server，所以在介紹整合前先簡單介紹一下該example gRPC service，如果你已經熟悉gRPC service，可以跳過此節

profobuf - Ping service

該服務只定義了一個Ping服務，位於proto資料夾，當中ping.proto:

syntax = "proto3";

package ping;
option go_package = ".;ping";

service Ping {
    // Get returns a response with same message id and body, and with timestamp.
    rpc Get (PingRequest) returns (PingResponse);
    // GetAfter is same as Ping but return the response after certain time
    rpc GetAfter (PingRequestWithSleep) returns (PingResponse);
    // GetRandom generates random strings and return, also produce lots of useless stuff to show the effects of heap
    rpc GetRandom (PingRequest) returns (PingResponse);
}


message PingRequest {
    string message_ID = 1;
    string message_body = 2;
}

message PingRequestWithSleep {
    string message_ID = 1;
    string message_body = 2;
    int32 sleep = 3;
}

message PingResponse {
    string message_ID = 1;
    string message_body = 2;
    uint64 timestamp = 3;
}

如果你對於protobuf不熟悉，可以參考此文件

這個Ping服務有3個RPC (remote procedure call):

Get: 單純回傳相同的訊息ID及body, 並帶server的timestamp
GetAfter: 與Get相同單純回傳，但會等待給定的秒數後再回傳。
GetRandom: 與Get相似，但body會變成一個隨機的文字，而在下面的實踐上回傳前會故意多做一些事，用意是產生一些cpu及memory用量，作為demo.

產生gRPC go code

需要安裝protoc - 請參考here
在你的terminal執行以下指令來產生go gRPC server/client codes (會產生一個檔案叫ping.pb.go):

make proto-go

該repository已經包含該ping.pb.go檔案，但我使用的版本為
protoc-gen-go v1.25.0 & protoc v3.6.1，你可能會有不同的版本，所以重新產生以免不相同

Ping服務實現

以下是該服務的實踐，該檔案位於ping資夾當中的ping.go:

package Ping

import (
    "context"
    pb "github.com/billcchung/example-service/protobuf"
    "math/rand"
    "time"
)

var letterRunes = []rune("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")

type Server struct{}

func (s Server) Get(ctx context.Context, req *pb.PingRequest) (res *pb.PingResponse, err error) {
    res = &pb.PingResponse{
        Message_ID:  req.Message_ID,
        MessageBody: req.MessageBody,
        Timestamp:   uint64(time.Now().UnixNano() / int64(time.Millisecond)),
    }
    return
}

func (s Server) GetAfter(ctx context.Context, req *pb.PingRequestWithSleep) (res *pb.PingResponse, err error) {
    time.Sleep(time.Duration(req.Sleep) * time.Second)
    return s.Get(ctx, &pb.PingRequest{Message_ID: req.Message_ID, MessageBody: req.MessageBody})
}

func (s Server) GetRandom(ctx context.Context, req *pb.PingRequest) (res *pb.PingResponse, err error) {
    var garbage []string
    for i := 0; i <= 1000000; i++ {
        garbage = append(garbage, string(letterRunes[rand.Intn(len(letterRunes))]))
    }
    return s.Get(ctx, &pb.PingRequest{Message_ID: req.Message_ID, MessageBody: string(letterRunes[rand.Intn(len(letterRunes))])})
}

設定Cloud Profiler

前面提到只需要加入很簡單的程式碼就可以為你的服務做profiling，你只需要import "cloud.google.com/go/profiler"並在你程式啟動的同時啟動profiler即可, 該profiler是一個go routine在背景定時取樣並上傳至cloud profiler

以我們的範列，其程式如下(於main.go):

profiler.Start(profiler.Config{
    Service:   service,
    ServiceVersion: serviceVersion
    ProjectID: projectID,
})

其中

Service為服務名稱,
ServiceVersion服務的版本
ProjectID為GCP的project ID

啟動Server服務

在啟動服務之前，請確保你的Cloud Profiler是已經開啟(enabled)的狀態，你可以從這邊開啟

另外因為這個example service並不會跑在GCP裡面，你需要有相對應的cloud profiler權限並確定已經登入glcoud, 若你尚未登入或需要權限請見這邊,

然後你就可以啟動服務(請將$PROJECT_ID換成你的GCP project ID):

go run main.go -p $PROJECT_ID

執行client

一旦server開始執行，你可以執行在tools裡面的client端，該程式會依序call Get, GetAfter, 及 GetRandom，你會需要執行多次來確保profiler可以取得樣本:

for i in $(seq 1 1000); do go run tools/connect.go ; done

檢視圖表及理解

待client執行一段時間後 (不需要等所有的loop跑完)，你可以在cloud profiler console看到該服務的profiles及其視覺化(flamegraph):

你所看到的圖可能會跟我的不太一樣，但以我的圖為例

它取得16個profiles樣本，其中CPU時間為520ms至780ms
Server.GetRandom花了不少時間，而該函式為我們為服務所寫的
沒有(或極少)Server.Get是因為該函式執行時間非常快，也沒有(或極少)GetAfter 因為 time.Sleep並不佔用CPU時間。你可能會有該兩個函式，但應該極少，因為採樣的角度來說會採到該兩函式的機率太小，而通常我們會使用profiling的時候我們也只在意佔用最多時間或資料的函式。

你可以發現Server.GetRandom使用了164.38ms，點擊可以看詳細是什麼函式在佔用時間:

你可以看到growslice及Intn佔了大多數時間:

當然該兩個函式是go bulitin, 這只是為了demo使用，讓你可以解讀及了解什麼函式佔用了最多的時間，在實務上的程式則可以提供一些資訊來幫忙你做優化