DEV Community

Kyle Gallatin for AWS Community Builders

Posted on • Originally published at

Exposing Tensorflow Serving’s gRPC Endpoints on Amazon EKS

Tensorflow serving is popular way to package and deploy models trained in the tensorflow framework for real time inference. Using the official docker image and a trained model, you can almost instantaneously spin up a container exposing REST and gRPC endpoints to make predictions.

Most of the examples and documentation on tensorflow serving focus on the popular REST endpoint usage. Very few focus on how to adapt and use the gRPC endpoint for their own use case — and fewer mention how this works when you scale to kubernetes.

Rare footage of a real live docker whale — Photo by [Todd Cravens]( on [Unsplash]( footage of a real live docker whale — Photo by Todd Cravens on Unsplash

In this post I’ll scratch the surface of gRPC, kubernetes/EKS, nginx, and how we can use them with tensorflow serving.

Why gRPC over REST?

There’s a ton of reasons. First off gRPC uses efficient HTTP 2 protocol as opposed to classic HTTP 1. It also uses language neutral, serialized protocol buffers instead of JSON which reduces the overhead of serializing and deserializing large JSON payloads.

Some talk about the benefits of the API definition and design, and others about the efficiencies of HTTP 2 in general — but here’s my experience with gRPC in regards machine learning deployments:

  • gRPC is crazy efficient. It can dramatically reduce inference time and the overhead of large JSON payloads by using protobufs

  • gRPC in production has a pretty big learning curve and was a massive pain in the a** to figure out

In short gRPC can offer huge performance benefits. But as a more casual API developer, I can definitely say no — it is not “easier” than REST. For someone new to gRPC, HTTP 2 with nginx, TLS/SSL and tensorflow serving — there was a lot to figure out.

Still the benefits are worth it. I saw up to an 80% reduction in inference time with large batch sizes doing initial load tests. For groups with strict service level agreements (SLAs) around inference time, gRPC can be a life saver. While I won’t explain gRPC in more depth here, the internet is littered with helpful tutorials to get you started.

Setup a Kubernetes Cluster on AWS

Let’s get started and setup a kubernetes cluster. We’ll use AWS EKS, and the eksctl command line utility. For the most part , you can also follow along on kubernetes in Docker Desktop if you don’t have an AWS account.

If you don’t have eksctl or aws-cli installed already you can use the docker image in this repository. It also comes with kubectl for interacting with our cluster. First, build and run the container.

docker build -t eksctl-cli .
docker run -it eksctl-cli
Enter fullscreen mode Exit fullscreen mode

Once you started a bash session in the new container, sign into AWS with your preferred user and AWS access key ids:

aws configure
Enter fullscreen mode Exit fullscreen mode

You’ll need a key, secret key, default region name (I use us-east-1) and output format like json. Now we can check on the status of our clusters.

eksctl get clusters
Enter fullscreen mode Exit fullscreen mode

If you have no active clusters like myself, you’ll get No clusters found. In that case, let’s create one.

eksctl create cluster \
--name test \
--version 1.18 \
--region us-east-1 \
--nodegroup-name linux-nodes \
--nodes 1 \
--nodes-min 1 \
--nodes-max 2 \
--with-oidc \
Enter fullscreen mode Exit fullscreen mode

You can vary the parameters if you like, but as this is just an example I’ve left the cluster size small. This may take a little while. If you go to the console you’ll see your cluster creating.

Make sure you’re logged in with the proper user and looking at the right regionMake sure you’re logged in with the proper user and looking at the right region

Now that our cluster is complete, we can install nginx for ingress and load balancing. Let’s test kubectl is working as expected. Here’re some example commands.

kubectl config get-contexts
kubectl get nodes
Enter fullscreen mode Exit fullscreen mode

You should see the current cluster context as something like and also see the node(s) created by our command. You can view similar content in the console if you’re signed in with proper permissions (i.e. the user you created the cluster with).

Cluster’s all good, so let’s install nginx. Going back to the command line in our docker container:

kubectl apply -f
Enter fullscreen mode Exit fullscreen mode

If you’re using a kubernetes provider other than AWS, check the nginx installation instructions (there is one for Docker Desktop k8s). This command will create all the necessary resources for nginx. You can check it is up and running by looking at the pods in the new namespace.

kubectl get po -n ingress-nginx
Enter fullscreen mode Exit fullscreen mode

What we’re most interested in is whether we’ve created a LoadBalancer for our cluster with an externally reachable IP address. If you run this command:

kubectl get svc -n ingress-nginx
Enter fullscreen mode Exit fullscreen mode

You should see that nginx-ingress-controller has an external IP like If you copy and paste that into a browser, you should see this.

Never been so happy to get a 404Never been so happy to get a 404

Hell yeah! I know it doesn’t look good but what we’ve actually done is create proper ingress into our cluster and we can expose things to the ~web~.

Deploy Tensorflow Serving on Kubernetes

Tensorflow has some passable docs on deploying models to kubernetes, and it’s easy to create your own image for serving a custom model. For ease, I’ve pushed the classic half_plus_two model to dockerhub so that anyone can pull it for this demo.

The following YAML defines the deployment and service for a simple k8s app that exposes the tfserving grpc endpoint.

To deploy it in our cluster, simply apply the raw YAML from the command line.

kubectl apply -f [](
Enter fullscreen mode Exit fullscreen mode

Another kubectl get po will show us if our pod has created successfully. To make sure the servers started, check the logs. You should see things like Running gRPC ModelServer at and Exporting HTTP/REST API. The gRPC one is the only one we’ve made available via our pod/service.

kubectl logs $POD_NAME 
Enter fullscreen mode Exit fullscreen mode

Before exposing through nginx, let’s make sure the service works. Forward the tensorflow serving service to your localhost.

kubectl port-forward service/tfserving-service 8500:8500 &
Enter fullscreen mode Exit fullscreen mode

Then there’re multiple ways we can check the service is running. The simplest is just establishing an insecure connection with the grpc Python client.

If you enter the Python shell in the docker container and run the above, you should be able to connect to the grpc service insecurely.

>>> grpc_server_on(channel)
Handling connection for 8500
Enter fullscreen mode Exit fullscreen mode

This means the service is working as expected. Let’s take a look at the more tensorflow specific methods we can call. There should be a file called in your docker container (if following along elsewhere here’s the link). Let’s run that and inspect the output.

Enter fullscreen mode Exit fullscreen mode

Whoa, lot’s of unwieldy and inconvenient information. The most informative part is the {'inputs': 'x'... portion. This helps us formulate the proper prediction request. Note — what we’re actually doing in these Python scripts is using tensorflow provided libraries to generate prediction protobufs and send them over our insecure gRPC channel.

Let’s use this information to make a prediction over gRPC. You should also have a file in your current directory.

Run that and inspect the output. You’ll see a json-like response (it’s actually a tensorflow object).

outputs {
  key: "y"
  value {
    dtype: DT_FLOAT
    tensor_shape {
      dim {
        size: 3
    float_val: 2.5
    float_val: 3.0
    float_val: 4.5
model_spec {
  name: "model"
  version {
    value: 1
  signature_name: "serving_default"
Enter fullscreen mode Exit fullscreen mode

We’ve made our first predictions over gRPC! Amazing. All of this is pretty poorly documented on the tensorflow serving side in my opinion, which can make it difficult to get started.

In the next section, we’ll top it off by actually exposing our gRPC service via nginx and our public URL.

Exposing gRPC Services via Nginx

The key thing that’ll be different about going the nginx route is that we can no longer establish insecure connections. We will be required by nginx to provide TLS encryption for our domain over port 443 in order to reach our service.

Since nginx enables http2 protocol on port 443 by default, we shouldn’t have to make any changes there. However, if you have existing REST services on port 80, you may want to disable ssl redirects in the nginx configuration.

kubectl edit configmap -n ingress-nginx ingress-nginx-controller
Enter fullscreen mode Exit fullscreen mode

Then add:

  "ssl-redirect": "false"
Enter fullscreen mode Exit fullscreen mode

And save.

Create a TLS Secret

To create TLS for your domain, you can do something akin to this. Proper TLS/SSL involves a certificate authority (CA), but that’s out of scope for this article.

First create a cert directory and then create a conf file.

Edit DNS.1 so that it reflects the actual hostname of your ELB (the URL we used earlier to see nginx in browser). You don’t need to edit CN.

Using this, create a key and cert.

openssl genrsa -out cert/server.key 2048
openssl req -nodes -new -x509 -sha256 -days 1825 -config cert/cert.conf -extensions 'req_ext' -key cert/server.key -out cert/server.crt
Enter fullscreen mode Exit fullscreen mode

Then use those to create a new kubernetes secret in the default namespace. We will use this secret in our ingress object.

kubectl create secret tls ${CERT_NAME} --key ${KEY_FILE} --cert ${CERT_FILE}
Enter fullscreen mode Exit fullscreen mode

Finally we create the ingress object with necessary annotations for grpc. Note that we specify the tls-secret in the ingress. We’re also doing path rewrite and exposing our service on /service1. By segmenting our services by path, we can expose more than 1 gRPC service via nginx.

Replace — host: with your URL. You can apply the yaml above and then edit the resulting object, or vice versa. This way is pretty easy:

kubectl apply -f [](
kubectl edit tfserving-ingress
Enter fullscreen mode Exit fullscreen mode

You now have an ingress in the default namespace. We can’t connect like before, with an insecure connection, as this endpoint specifies TLS. We have to establish a secure connection using the certificate we just created.

The nice part is, if we have this cert we can now connect from anywhere — we no longer need kubectl to port forward the service to our container or local machine. It’s publicly accessible.

Replace the crt_path with the cert you’ve generated and replace the host with your URL.

Note the custom route we’ve exposed our service on. Tensorflow serving services are available on /tensorflow.serving.PredictionService by default, but this makes it difficult to add new services if we’re exposing them all on the same domain.

gRPC only connects to a host and port — but we can use whatever service route we want. Above I use the path we configured in our k8s ingress object: /service1, and overwrite the base configuration provided by tensorflow serving. When we call the tfserving_metadata function above, we specify /service1 as an argument.

This also applies to making predictions, we can easily establish a secure channel and make predictions to our service over it using the right host, route and cert.

Again, we overwrite the route with our custom path, convert our data into a tensor proto and make a prediction! Easy as hell…haha not really. You won’t find any of this in the tensorflow docs (or at least I didn’t), and the addition of unfamiliar tools like TLS/gRPC/HTTP 2 with nginx make it even more difficult.

When you’re done delete your cluster so you don’t get charged.

eksctl delete cluster test
Enter fullscreen mode Exit fullscreen mode

Ay We’re Done

Hope you find this helpful, it was definitely difficult to piece together from various articles, source code, stack overflow questions and github issues.

Probably worth noting we didn’t do any actual CI/CD here, and I wouldn’t recommend kubectl edit in production — but those are topics for other articles. If I used any terminology improperly, or there’s something you think could be explained better please reach out! Happy to connect on LinkedIn or twitter or just go at it in the Medium comments section.


Top comments (0)