DEV Community

Cover image for Day 17: Speech-to-Text with gRPC and Golang
Dilek Karasoy for Picovoice

Posted on

Day 17: Speech-to-Text with gRPC and Golang

To have a working gRPC microservice, three components are essential:

  1. .proto file to define the gRPC services and messages
  2. server to process the submitted audio and returns back the transcription
  3. client to talk to the server

.proto file

syntax = "proto3";
package messaging;
option go_package = "go-grpc/messaging";
service LeopardService {
  rpc GetTranscriptionFile(stream Chunk) returns (transcriptResponse) {}
}
message Chunk {
  bytes Content = 1;
}
Enter fullscreen mode Exit fullscreen mode

We define only one service (GetTranscriptionFile) in the proto file for simplicity.

gRPC has a limit of 4MB for incoming messages. Hence, transcription service type needs to be set to the client-side stream. So files can be sent in chunks of bytes.

enum StatusCode {
  Unknown = 0;
  Ok = 1;
  Failed = 2;
}
message transcriptResponse {
  string transcript = 1;
  StatusCode Code = 2;
}
Enter fullscreen mode Exit fullscreen mode

Now, let's compile the .proto file with protoc as we are going to write both server and client in Go.
Client:
First, we need a client for the defined LeopardService service.

func main() {
    f, err := os.Open(inputAudioPath)
    defer f.Close()
    opts := grpc.WithInsecure()
    conn, err := grpc.Dial(*serverAddressArg, opts)
    defer conn.Close()
    client := messaging.NewLeopardServiceClient(conn)
    runTranscriptionFile(client, *inputAudioPathArg)
}
Enter fullscreen mode Exit fullscreen mode

Inside the runsTranscriptFile function, the audio file is read in chunks of 1 MB and transmitted over to the server, and a timeout of 10 seconds is considered here. Finally, the stream is closed, and the server response is received by calling the CloseAndRecv function.

func runTranscriptionFile(client messaging.LeopardServiceClient, filePath string) (err error) {
    var (
        writing = true
        buf     []byte
        n       int
        file    *os.File
    )
    file, err = os.Open(filePath)
    defer file.Close()
    ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
    defer cancel()
    stream, err := client.GetTranscriptionFile(ctx)
    defer stream.CloseSend()
    buf = make([]byte, 1024*1024*1024) // 1 MB
    for writing {
        n, err = file.Read(buf)
        if err != nil {
            if err == io.EOF {
                writing = false
                err = nil
                continue
            }
            return err
        }
        // send the loaded bytes to the server
        err = stream.Send(&messaging.Chunk{Content: buf[:n]})
    }
    // signal the server that it is done and ready to receive a response
    reply, err := stream.CloseAndRecv()
    log.Printf("replay: %v", reply)
    return err
}
Enter fullscreen mode Exit fullscreen mode

Server:
On the server-side, a gRPC service instance is defined and registered to answer to LeopardService calls.

func main() {
    add := fmt.Sprintf("localhost:%d", *port)
    lis, err := net.Listen("tcp", add)
    grpcServer := grpc.NewServer()
    messaging.RegisterLeopardServiceServer(grpcServer, newServer(*accessKeyArg))
    grpcServer.Serve(lis)
}
Enter fullscreen mode Exit fullscreen mode

After getting a transcription request, the server starts an instance of Leopard and keeps reading the shipped bytes until the EOF. Then, the bytes are stored as a temporary file and passed to Leopard. Finally, the transcription is sent back to the client side along with a status code.

func (s *leopardServer) GetTranscriptionFile(stream messaging.LeopardService_GetTranscriptionFileServer) (err error) {
    // define an instance of Leopard and init it
    engine := leopard.NewLeopard(s.accessKey)
    error := engine.Init()
    defer engine.Delete()

    var audio []byte = make([]byte, 0)
    // default returned values if any error happens
    var transcription string = ""
    var statusCode messaging.StatusCode = messaging.StatusCode_Failed

    for !is_done {
        // keep reading bytes from the stream till it reaches to the end
        audioFileChunk, err := stream.Recv()
        if err == io.EOF {
            // create a temporary file to store the received audio stream
            f, err := os.CreateTemp("", "auido_temp_file")
            defer os.Remove(f.Name())
            _, err = f.Write(audio)
            // process the audio file without any preprocessing with ProccessFile method of Leopard
            transcription, _, err = engine.ProcessFile(f.Name())
            statusCode = messaging.StatusCode_Ok
            is_done = true
        } else {
            audio = append(audio, audioFileChunk.Content...)
        }
    }
    // send back the result and close the stream connection
    return stream.SendAndClose(&messaging.TranscriptResponse{
        Transcript: transcription,
        Code:       statusCode,
    })
}
Enter fullscreen mode Exit fullscreen mode

We could also have sent the audio in raw (pcm) format and directly fed it to Leopard without storing, but there are two caveats.

  1. more preprocessing needed on the client-side to decode the audio file.
  2. amount of data to be transferred is significantly more for the raw format than than compressed formats such as MP3 or OGG.

Learn more about Leopard, and check out the open-source demos.

Top comments (0)