DEV Community

Cover image for Enhancing gRPC Error Handling in a Microservice Architecture
Homayoon Alimohammadi
Homayoon Alimohammadi

Posted on • Originally published at Medium

Enhancing gRPC Error Handling in a Microservice Architecture

Error handling and the way it is done is a crucial part of software engineering. Poorly returned errors and non-informative ones can cause unimaginable headaches. In this article I’m going to demonstrate the problems that I’ve faced with regards to gRPC error handling and what we might be able do to in order to improve upon our originally not-so-useful gRPC errors.

I make lots of mistakes on a daily basis and I try to fix them and learn from them. If you’ve noticed one, I’d be more than grateful if you would correct me however you like.

Enhancing gRPC Error Handling in a Microservice Architecture

Introduction

First of all, let’s start with explaining the actual problem that got me here. Inter-service communication (and the way it’s done) is probably one of the more important concerns in a microservice architecture. While here are certain concerns when using RPCs as a whole, (which can be mitigated as beautifully described by Netflix Tech Blog) it can’t be denied that the self-documenting nature of request/response messages and RPCs (via protobuf), coupled with server reflection has proved extremely comfortable and handy.

One of the most overlooked aspects of gRPC might be the way the errors are returned from one service to another. The most simplistic way to return an error in an RPC might be as follows:

func (s *serverImpl) UnaryRPC(ctx context.Context, req *pb.Request) (*pb.Response, error) {
    // do something
    if somethingFailed {
        return nil, errors.New("custom error")
    }

    return resp, nil
}
Enter fullscreen mode Exit fullscreen mode

The problem with the example above is that it provides nearly zero information to the caller about what went wrong and why. From the caller perspective the error might be handled like this:

    reps, err := client.UnaryRPC(ctx, req)
    if err != nil {
      st, ok := status.FromError(err)
      if ok {
        if st.Message() == "custom error" {
          // handle custom error
        } else {
          // handle other errors
        }
      }
    }
Enter fullscreen mode Exit fullscreen mode

Hopefully this is already bothering you a lot simply because we are tightly coupled to an error string from another service which might get changed at any moment which renders our glorious error handling useless. Let’s take a look at how status.FromError works to see if we have any other useful information available:

    func FromError(err error) (s *Status, ok bool) {
      if err == nil {
        return nil, true
      }

      // doing some stuff with interface{ GRPCStatus() *Status } type
      // our simple error does not implement any custom interface
      // ...

      return New(codes.Unknown, err.Error()), false
    }
Enter fullscreen mode Exit fullscreen mode

This signals to us that other attributes of the status.Status (e.g. Code and Details ) are simply meaningless. Code is automatically set to Unknown and Details will be zero-valued.

  • You can safely use st := status.Convert(err) which just skips the ok handling part since the Go implementation of gRPC guarantees that all RPCs return status type errors. We can also confirm this by heading to the http2_client.go operateHeaders method:
    func (t *http2Client) operateHeaders(frame *http2.MetaHeadersFrame) {
      // ...
      var (
        // ...
        grpcMessage string
        statusGen *status.Status
        rawStatusCode = codes.Unknown
        // ...
      )
      // ...
      for _, hf := range frame.Fields {
        switch hf.Name {
        // ...
        case "grpc-message":
          grpcMessage = decodeGrpcMessage(hf.Value)
        case "grpc-status":
          code, err := strconv.ParseInt(hf.Value, 10, 32)
          handle(err)
          rawStatusCode = codes.Code(uint32(code))
        case "grpc-status-details-bin":
          statusGen, err = decodeGRPCStatusDetails(hf.Value)
          handle(error)
        // ...
        }
        // ...
      } 
      // ...
      if statusGen == nil {
        statusGen = status.New(rawStatusCode, grpcMessage)
      }
      // ...
      t.closeStream(s, io.EOF, rst, http2.ErrCodeNo, statusGen, mdata, true)
    }
Enter fullscreen mode Exit fullscreen mode

Let’s be polite and provide more information to our caller:

    func (s *serverImpl) UnaryRPC(ctx context.Context, req *pb.Request) (*pb.Response, error) {
      // do something
      if somethingFailed {
        return nil, status.Error(codes.Internal, "internal error")
      }

      return resp, nil
    }
Enter fullscreen mode Exit fullscreen mode

Now, the client can also make important and useful assumptions about the returned code:

    resp, err := client.UnaryRPC(ctx, req)
    if err != nil {
      st := status.Convert(err)
      switch st.Code() {
      case codes.Internal:
        // handle internal error
      default:
        // handle other errors
      }
    }
Enter fullscreen mode Exit fullscreen mode

As informative and descriptive as it might look, this method falls short of explaining and reasons behind its (status) code and error.

Status Details

Returning status codes was the farthest we’ve come with regards to returning descriptive errors from our microservices. But there are certain scenarios in which we simply can’t afford to solely rely on the status codes. Why, you might ask? Let me explain.

Assume that we have two services, the first one which interacts with our end users is called Submit and the other one,Storage, is considered an internal service, talking only to other services and not to external clients/customers or end users.

Consider the flow as follows: the end user tries to submit his/her data through our Submit service, which in turn tries to persist the users data by saving it in the Storage service (let’s say that Storage simply wraps some arbitrary DB).

In a peculiar scenario, due to some business logic,Storage denies to save the data provided by Submit and returns the reason like below:

    func (s *storageService) Save(ctx context.Context, req *pb.Req) (*pb.Resp, error) {
      // the actual logic
      if cantSave {
        return nil, status.Error(codes.FailedPrecondition, "the fully described reason here")
      }  
      // ...
    }
Enter fullscreen mode Exit fullscreen mode

Assuming that the described reason is specific to Storage service domain, meaning that Submit won’t be able to reproduce the exact descriptive reason in order to show the client, let’s explore different ways that Submit can handle this scenario and return some error to the end user.

Return the error as is

    func (s *submitService) Submit(ctx context.Context, req *pb.Req) (*pb.Resp, error) {
      resp, err := storageClient.Save(ctx, req)
      if err != nil {
        st := status.Convert(err)
        switch st.Code() {
        case codes.FailedPrecondition:
          return nil, st.Err()
        default:
          return nil, status.Error(codes.Internal, "oops!")
        }
      }
    }
Enter fullscreen mode Exit fullscreen mode

While this method might simply solve our problem (the end user gets to see the fully descriptive error as is) a major headache is introduced as well: Debugging this error will be an absolute hassle. Submit is actually returning this error, but good luck finding the text in this repo. Imagine doing the same thing with a couple of other services apart from Storage. Soon you’ll find yourself searching all your client services for some error string that seeminglySubmit returns, scratching your head and regretting every decision you’ve made in your life. (just think about what happens if Storage returns codes.FailedPrecondition in several scenarios, some with error strings containing private information and details that end users should not know about)

Drop the original error and replace with our own

    func (s *submitService) Submit(ctx context.Context, req *pb.Req) (*pb.Resp, error) {
      resp, err := storageClient.Save(ctx, req)
      if err != nil {
        st := status.Convert(err)
        switch st.Code() {
        case codes.FailedPrecondition:
          return nil, errors.New("dear end user, you can not do that")
        }
      }
    }
Enter fullscreen mode Exit fullscreen mode

This way we can at least make sure that Submit won’t return unwanted error messages or ones that are not available in the Submit repo, yet the returned error string is not exactly what we wanted our end user to see. It’s not descriptive enough. Also, we still have the same problem with multiple codes.FailedPrecondition errors as above.

Storage service provides further error details

In order to enable effective communication between these two services, it might be a good idea that the error details are structured as a proto message. The first (and probably) more obvious approach is to extend the Storage response message and add the extra Error message inside it:

    message SaveError {
      int32 first_field = 1;
      bool second_field = 2;
    }

    message SaveResp {
      string first = 1;
      string second = 2;  
      SaveError error = 3;  
    }

    service Storage {
      rpc (SaveRequest) returns (SaveResponse);  
    }
Enter fullscreen mode Exit fullscreen mode

This way the SaveError provides enough information for the Submit service to reproduce the descriptive error string and finally show it to the end user.

     func (s *submitService) Submit(ctx context.Context, req *pb.Req) (*pb.Resp, error) {
      resp, err := storageClient.Save(ctx, req)
      if err != nil {
        // some other error
      }
      if resp.Error != nil {
        return nil, createCustomError(resp.Error)  
      }
    }
Enter fullscreen mode Exit fullscreen mode

But there’s something off with this kind of error handling. The fact that we should handle two sources of error from a single RPC doesn’t sound like a good idea and might introduce some unnecessary complexity and confusion (it suggests that our client call might still be failed even though the RPC itself was successful!).

What if we want the Storage service to provide the much needed error details but not exactly in the body of the response? We can add Details to our status in form of proto messages:

    func (s *storageService) Save(ctx context.Context, req *pb.Req) (*pb.Resp, error) {
      // some logic
      if cantSave {
        st := status.New(codes.FailedPrecondition, "any grpc message")
        st, err := st.WithDetails(&pb.SaveError{...})
        handle(err)  
        return nil, st.Err()  
      }
    }
Enter fullscreen mode Exit fullscreen mode

The st.WithDetails() should be given an instance of proto.Message. Feel free to pass your own custom proto messages, or use the errdetails package. From the caller perspective, the detail can be simply retrieved as well:

    func (s *submitService) Submit(ctx context.Context, req *pb.Req) (*pb.Resp, error) {
      resp, err := storageClient.Save(ctx, req)
      if err != nil {
        st := status.Convert(err)
        for _, d := range st.Details() {
          switch detail := d.(type) {
          case *pb.SaveError:  
            return nil, createCustomError(detail)  
          default:
            // other details
          }
        }
      }
    } 
Enter fullscreen mode Exit fullscreen mode

Now we have the luxury of not only conditioning on the status code, but to rely on a documented and agreed upon detail without being overly dependent on any other service. Yay!

More on status details in this wonderful article by Johan Brandhorst.

Status under the hood

We’ve seen how the client transport provides status in any gRPC error, so we might have a rough idea of how the situation is handled from the server’s point of view, e.g. what happens under the hood when we return a status.Status with Code, Message and Details. Basically if we take a look at the http2_server.go WriteStatus method we can see how these grpc-specific headers are written in action:

    func (t *http2Server) WriteStatus(s *Stream, st *status.Status) error {
      // ...
      headerFields := make([]hpack.HeaderField, 0, 2)
      // ...
      headerFields = append(headerFields, hpack.HeaderField{
        Name: "grpc-status", 
        Value: strconv.Itoa(int(st.Code())),
      })
      headerFields = append(headerFields, hpack.HeaderField{
        Name: "grpc-message",
        Value: encodeGrpcMessage(st.Message()),
      })

      if p := st.Proto(); p != nil && len(p.Details) > 0 {
        stBytes, err := proto.Marshal(p)
        handle(err)
        headerFields = append(headerFields, hpack.HeaderField{
          Name: "grpc-status-details-bin",
          Value: encodeBinHeader(stBytes),
        })
      }  
      // use the headerFields ...
    }
Enter fullscreen mode Exit fullscreen mode

gRPC uses “Trailers” to convey its final status as well as any messages and/or details in case of an error. Trailers are just a subset of headers which are sent after the response body (in contrast to the actual response headers that are sent before the response body in the HTTP protocol). More about these trailers is available in Carl Mastrangelo’s “Why does gRPC insist on Trailers?” and SoByte’s “Talking about gRPC’s Trailers Design”, both of which I really encourage you to take a look at, they’re extraordinary. Here is a couple of quotes from the latter:

So the question is, why does gRPC rely on Trailers? The core reason is to support a streaming interface. Because it is a streaming interface, it is not possible to determine the length of the data in advance, and it is not possible to use the HTTP Content-Length header. The corresponding HTTP request looks like this.

    GET /data HTTP/1.1
    Host: example.com

    HTTP/1.1 200 OK
    Server: example.com

    abc123
Enter fullscreen mode Exit fullscreen mode

What? Uncertain length? Carl points out that using chunked is ambiguous. He gives the following example.

    GET /data HTTP/1.1
    Host: example.com

    HTTP/1.1 200 OK
    Server: example.com
    Transfer-Encoding: chunked

    6\r\n
    abc123\r\n
    0\r\n
Enter fullscreen mode Exit fullscreen mode

Suppose there is a proxy before the client and the server. The proxy receives the response and starts forwarding the data to the client. The first thing is to send the Header part to the client, so the caller determines that this time the status code is 200, which is successful. Then the data part is forwarded paragraph by paragraph. If the server is down after the proxy forwards the first abc123, what signal does the proxy need to send to the client?
Because the status code has been sent, there is no way to change 200 to 5xx. You can’t send 0\r\n directly to end the chunked transfer, so that the client learns that the server has quit abnormally. The only thing you can do is to close the corresponding underlying connection directly, but this will consume additional resources as the client creates a new connection. So we needed to find a way to notify the client of the server error while reusing the underlying connection as much as possible. The gRPC team finally decided to use Trailers for transport.

Trailers in action

Imagine that we have a server-side streaming RPC like below. It basically sends 3 valid response messages and then an error:

    func (s *serverImpl) ServerStream(req *pb.GreetRequest, stream pb.Greeter_ServerStreamServer) error {
      ctx := stream.Context()
      for i := 0; i < 5; i++ {
        select {
        case <-ctx.Done():
          return status.Error(codes.DeadlineExceeded, "deadline exceeded")
        default:
          if i == 3 {
            st := status.New(codes.Internal, "something went wrong")
            st, _ = st.WithDetails(&errdetails.ErrorInfo{
              Reason: "some random reason",
              Domain: "some.random.domain",
              Metadata: map[string]string{
                "first": "something",
                "second": "another thing",
              },
            })
            return st.Err()
          }

          err := stream.Send(&pb.GreetResponse{Greet: fmt.Sprintf("hello %s!", req.Name)})
          if err != nil {
            log.Print("failed to send: ", err)
          }
        }
      }

      return nil
    }
Enter fullscreen mode Exit fullscreen mode

After calling this method with a client (I chose just a simple gRPC client written in go) we can take a look at what’s being transmitted in something like Wireshark (I’ve omitted certain lines to reduce unwanted clutter):

    POST /test.Greeter/ServerStream

    HyperText Transfer Protocol 2
        Stream: DATA, Stream ID: 1, Length 15
            Flags: 0x01
                .... ...1 = End Stream: True

    Protocol Buffers: /test.Greeter/ServerStream,request
        Message: test.GreetRequest
            [Message Name: test.GreetRequest]
            Field(1): name = homayoon (string)
Enter fullscreen mode Exit fullscreen mode

In the response stream, first we’ll receive the following packet:

    HyperText Transfer Protocol 2
        Stream: HEADERS, Stream ID: 1, Length 14, 200 OK
            Flags: 0x04
                .... ...0 = End Stream: False
                .... .1.. = End Headers: True
            Header: :status: 200 OK
            Header: content-type: application/grpc
        Stream: DATA, Stream ID: 1, Length 22
            Flags: 0x00
                .... ...0 = End Stream: False
            0... .... .... .... .... .... .... .... = Reserved: 0x0
            .000 0000 0000 0000 0000 0000 0000 0001 = Stream Identifier: 1
            [Pad Length: 0]
            DATA payload (22 bytes)

    Protocol Buffers: /test.Greeter/ServerStream,response
        Message: test.GreetResponse
            [Message Name: test.GreetResponse]
            Field(1): greet = hello homayoon! (string)
Enter fullscreen mode Exit fullscreen mode

followed by two other packets without the :status 200 OK header (since it’s already sent in the first one):

    HyperText Transfer Protocol 2
        Stream: DATA, Stream ID: 1, Length 22
            Flags: 0x00
                .... ...0 = End Stream: False

    Protocol Buffers: /test.Greeter/ServerStream,response
        Message: test.GreetResponse
            [Message Name: test.GreetResponse]
            Field(1): greet = hello homayoon! (string)
Enter fullscreen mode Exit fullscreen mode

Finally we’ll have the final packet which contains the gRPC trailers (notice the End Stream: True:

    HyperText Transfer Protocol 2
        Stream: HEADERS, Stream ID: 1, Length 226
            Flags: 0x05
                .... ...1 = End Stream: True
                .... .1.. = End Headers: True
            Header: grpc-status: 13
            Header: grpc-message: something went wrong
            Header: grpc-status-details-bin: CA0SFHNvbWV0aGluZyB3ZW50IHdyb25nGoEBCih0eXBlLmdvb2dsZWFwaXMuY29tL2dvb2dsZS5ycGMuRXJyb3JJbmZvElUKEnNvbWUgcmFuZG9tIHJlYXNvbhISc29tZS5yYW5kb20uZG9tYWluGhIKBWZpcnN0Eglzb21ldGhpbmcaFwoGc2Vjb25kEg1hbm90aGVyIHRoaW
Enter fullscreen mode Exit fullscreen mode

Needless to say that we’ll easily be able to retrieve the original details in the grpc-status-details-bin, let’s do it the hard way and just copy the base64 encoded string above (rather than retrieving it in the gRPC client):

    import (
       spb "google.golang.org/genproto/googleapis/rpc/status"
    )

    d := "CA0SFHNvbWV0aGluZyB3ZW50IHdyb25nGoEBCih0eXBlLmdvb2dsZWFwaXMuY29tL2dvb2dsZS5ycGMuRXJyb3JJbmZvElUKEnNvbWUgcmFuZG9tIHJlYXNvbhISc29tZS5yYW5kb20uZG9tYWluGhIKBWZpcnN0Eglzb21ldGhpbmcaFwoGc2Vjb25kEg1hbm90aGVyIHRoaW5n"
    b, _ := base64.StdEncoding.DecodeString(d)
    st := &spb.Status{}
    proto.Unmarshal(b, st)
    for _, dt := range st.Details {
      errD := &errdetails.ErrorInfo{}
      dt.UnmarshalTo(errD)
      b, _ := json.MarshalIndent(errD, "", "  ")
      fmt.Println(string(b))
    }
Enter fullscreen mode Exit fullscreen mode

Viola!

    {
      "reason": "some random reason",
      "domain": "some.random.domain",
      "metadata": {
        "first": "something",
        "second": "another thing"
      }
    }
Enter fullscreen mode Exit fullscreen mode

Conclusion

In this article we explored how to provide descriptive and informative errors in a gRPC microservice architecture communicating via gRPC. We did also take a look at how different gRPC trailers are handled under the hood. Finally, we discussed the gRPC trailers and experimented with a mock server to better understand how and when these headers/trailers are passed. I hope you’ve found this article helpful. I’d be grateful if you would like and share this article as well.

Top comments (0)