DEV Community: Kyle Gallatin

The Fastest Way to Deploy Your ML App on AWS with Zero Best Practices

Kyle Gallatin — Sat, 20 Mar 2021 14:30:50 +0000

Publish a Machine Learning API to the Internet in like…15 Minutes

You’ve been working on your ML app and a live demo is coming up fast. You wanted to push it to Github, add Docker, and refactor the code but you spent all day yesterday on some stupid pickle error you still don’t really understand. Now you only have 1 hour before the presentation and your ML app needs to be available on the internet. You just need to be able to reach your model via browser for 15 seconds to show some executives.

Photo by Braden Collum on Unsplash

This tutorial is the “zero best practices” way to create a public endpoint for your model on AWS. In my opinion, this is one of the shortest paths to creating a model endpoint assuming you don’t have any other tooling setup (SCM, Docker, SageMaker) and have written a small application in Python.

You’ll need:

The AWS console
A terminal
Your ML app

If you don’t have an ML app and just want to follow along, here’s the one I wrote this morning. My app was built with FastAPI because it was….fast…but this will work for any Flask/Dash/Django/Streamlit/whatever app.

Create an EC2 Instance

Log into the console, search “EC2” and navigate to the instances page.

Click “launch instance”. Now you have to select a machine type. You’re in a rush, so you just click the first eligible one you see (Amazon Linux 2 AMI).

You decide to keep the default settings (leave it a t2.micro) and click “Review and Launch”.

Create a new key pair, click “Download Key Pair” and then launch the instance.

Open Port 80

While we’re here in the console, let’s open port 80 to web traffic. Navigate back to instances page and click on your instance.

Go the the “Security” tab, and under security groups click there should be a link you can click on that looks like sg-randomletters (launch-wizard-3). On the next page scroll to the bottom where and go to “Edit Inbound Rules”.

Add an all traffic rule with 0.0.0.0/0 CIDR block and then save.

Copy Your Files to the Instance

Now our instance is ready to go, so let’s get our files over. To make it easy we can set 2 environment variables. KEY is just the path to the .pem file you downloaded earlier, and HOST is the Public IPv4 DNS name you can view on the instances page.

Edit the example below to contain your information. These commands assume macOS/Linux, for Windows check out PuTTY.

export HOST=ec2-35-153-79-254.compute-1.amazonaws.com
export KEY=/PATH/TO/MY/KEY.pem

Now we can copy our files over. Again if you don’t have an ML app, clone my slapdash one, and cd into the directory.

git clone [https://github.com/kylegallatin/fast-bad-ml.git](https://github.com/kylegallatin/fast-bad-ml.git)
cd fast-bad-ml

Now change key perms and copy everything.

chmod 700 $KEY
scp -i $KEY -r $(pwd) ec2-user@$HOST:/home/ec2-user

Type yes and you’ll see files copying over.

Setup Your Instance

Now it’s time to ssh in and start a session so we can run our app.

ssh -i $KEY ec2-user@$HOST

Running pwd && ls will show you that you’re in /home/ec2-user and the contents of a previous directory have been copied. Now cd into that directory and setup Python (this assumes you have a requirements.txt).

cd fast-bad-ml
sudo yum install python3
sudo python3 -m pip install -r requirements.txt

Run and Test Your App

Now that everything is installed, start your application on port 80 (default web traffic) using host 0.0.0.0 (binds to all addresses — 127.0.0.1 won’t work).

The command below is the uvicorn command for my FastAPI app, but you can replace that part to suit your app as long as host/port are the same.

sudo /usr/local/bin/uvicorn app:app --reload --port 80 --host 0.0.0.0 &

The & runs the process in the background so it doesn’t stop when you exit the session and remains available. Add nohup to redirect logs for later perusal.

You can now reach the app from the internet. Use the same Public IPv4 DNS as earlier and just copy it into a browser. I’ve configured the / route to return a simple message.

If you’ve exposed a /predict method that takes query parameters, you can pass those in with the URL too to get your prediction. The format is $HOST/$ROUTE/?$PARAM1=X&$PARAM2=Y....

Conclusion

Just want to caveat this is nothing close to production. Even if we introduced Docker and the scale of Kubernetes, true “production” requires tests, automated CI/CD, monitoring, and much more. But for getting you to the demo on time? There’s nothing better. Good luck!

Exposing Tensorflow Serving’s gRPC Endpoints on Amazon EKS

Kyle Gallatin — Wed, 10 Feb 2021 15:13:45 +0000

Tensorflow serving is popular way to package and deploy models trained in the tensorflow framework for real time inference. Using the official docker image and a trained model, you can almost instantaneously spin up a container exposing REST and gRPC endpoints to make predictions.

Most of the examples and documentation on tensorflow serving focus on the popular REST endpoint usage. Very few focus on how to adapt and use the gRPC endpoint for their own use case — and fewer mention how this works when you scale to kubernetes.

Rare footage of a real live docker whale — Photo by Todd Cravens on Unsplash

In this post I’ll scratch the surface of gRPC, kubernetes/EKS, nginx, and how we can use them with tensorflow serving.

Why gRPC over REST?

There’s a ton of reasons. First off gRPC uses efficient HTTP 2 protocol as opposed to classic HTTP 1. It also uses language neutral, serialized protocol buffers instead of JSON which reduces the overhead of serializing and deserializing large JSON payloads.

Some talk about the benefits of the API definition and design, and others about the efficiencies of HTTP 2 in general — but here’s my experience with gRPC in regards machine learning deployments:

gRPC is crazy efficient. It can dramatically reduce inference time and the overhead of large JSON payloads by using protobufs
gRPC in production has a pretty big learning curve and was a massive pain in the a** to figure out

In short gRPC can offer huge performance benefits. But as a more casual API developer, I can definitely say no — it is not “easier” than REST. For someone new to gRPC, HTTP 2 with nginx, TLS/SSL and tensorflow serving — there was a lot to figure out.

Still the benefits are worth it. I saw up to an 80% reduction in inference time with large batch sizes doing initial load tests. For groups with strict service level agreements (SLAs) around inference time, gRPC can be a life saver. While I won’t explain gRPC in more depth here, the internet is littered with helpful tutorials to get you started.

Setup a Kubernetes Cluster on AWS

Let’s get started and setup a kubernetes cluster. We’ll use AWS EKS, and the eksctl command line utility. For the most part , you can also follow along on kubernetes in Docker Desktop if you don’t have an AWS account.

If you don’t have eksctl or aws-cli installed already you can use the docker image in this repository. It also comes with kubectl for interacting with our cluster. First, build and run the container.

docker build -t eksctl-cli .
docker run -it eksctl-cli

Once you started a bash session in the new container, sign into AWS with your preferred user and AWS access key ids:

aws configure

You’ll need a key, secret key, default region name (I use us-east-1) and output format like json. Now we can check on the status of our clusters.

eksctl get clusters

If you have no active clusters like myself, you’ll get No clusters found. In that case, let’s create one.

eksctl create cluster \
--name test \
--version 1.18 \
--region us-east-1 \
--nodegroup-name linux-nodes \
--nodes 1 \
--nodes-min 1 \
--nodes-max 2 \
--with-oidc \
--managed

You can vary the parameters if you like, but as this is just an example I’ve left the cluster size small. This may take a little while. If you go to the console you’ll see your cluster creating.

Make sure you’re logged in with the proper user and looking at the right region

Now that our cluster is complete, we can install nginx for ingress and load balancing. Let’s test kubectl is working as expected. Here’re some example commands.

kubectl config get-contexts
kubectl get nodes

You should see the current cluster context as something like eksctl@test.us-east-1.eksctl.io and also see the node(s) created by our command. You can view similar content in the console if you’re signed in with proper permissions (i.e. the user you created the cluster with).

Cluster’s all good, so let’s install nginx. Going back to the command line in our docker container:

kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v0.43.0/deploy/static/provider/aws/deploy.yaml

If you’re using a kubernetes provider other than AWS, check the nginx installation instructions (there is one for Docker Desktop k8s). This command will create all the necessary resources for nginx. You can check it is up and running by looking at the pods in the new namespace.

kubectl get po -n ingress-nginx

What we’re most interested in is whether we’ve created a LoadBalancer for our cluster with an externally reachable IP address. If you run this command:

kubectl get svc -n ingress-nginx

You should see that nginx-ingress-controller has an external IP like something.elb.us-east-1.amazonaws.com. If you copy and paste that into a browser, you should see this.

Never been so happy to get a 404

Hell yeah! I know it doesn’t look good but what we’ve actually done is create proper ingress into our cluster and we can expose things to the ~web~.

Deploy Tensorflow Serving on Kubernetes

Tensorflow has some passable docs on deploying models to kubernetes, and it’s easy to create your own image for serving a custom model. For ease, I’ve pushed the classic half_plus_two model to dockerhub so that anyone can pull it for this demo.

The following YAML defines the deployment and service for a simple k8s app that exposes the tfserving grpc endpoint.

To deploy it in our cluster, simply apply the raw YAML from the command line.

kubectl apply -f [https://gist.githubusercontent.com/kylegallatin/734176736b0358c7dfe57b8e62591931/raw/ffebc1be625709b9912c3a5713698b80dc7925df/tfserving-deployment-svc.yaml](https://gist.githubusercontent.com/kylegallatin/734176736b0358c7dfe57b8e62591931/raw/ffebc1be625709b9912c3a5713698b80dc7925df/tfserving-deployment-svc.yaml)

Another kubectl get po will show us if our pod has created successfully. To make sure the servers started, check the logs. You should see things like Running gRPC ModelServer at 0.0.0.0:8500 and Exporting HTTP/REST API. The gRPC one is the only one we’ve made available via our pod/service.

kubectl logs $POD_NAME

Before exposing through nginx, let’s make sure the service works. Forward the tensorflow serving service to your localhost.

kubectl port-forward service/tfserving-service 8500:8500 &

Then there’re multiple ways we can check the service is running. The simplest is just establishing an insecure connection with the grpc Python client.

If you enter the Python shell in the docker container and run the above, you should be able to connect to the grpc service insecurely.

>>> grpc_server_on(channel)
Handling connection for 8500
True

This means the service is working as expected. Let’s take a look at the more tensorflow specific methods we can call. There should be a file called get_model_metadata.py in your docker container (if following along elsewhere here’s the link). Let’s run that and inspect the output.

python get_model_metadata.py

Whoa, lot’s of unwieldy and inconvenient information. The most informative part is the {'inputs': 'x'... portion. This helps us formulate the proper prediction request. Note — what we’re actually doing in these Python scripts is using tensorflow provided libraries to generate prediction protobufs and send them over our insecure gRPC channel.

Let’s use this information to make a prediction over gRPC. You should also have a get_model_prediction.py file in your current directory.

Run that and inspect the output. You’ll see a json-like response (it’s actually a tensorflow object).

outputs {
  key: "y"
  value {
    dtype: DT_FLOAT
    tensor_shape {
      dim {
        size: 3
      }
    }
    float_val: 2.5
    float_val: 3.0
    float_val: 4.5
  }
}
model_spec {
  name: "model"
  version {
    value: 1
  }
  signature_name: "serving_default"
}

We’ve made our first predictions over gRPC! Amazing. All of this is pretty poorly documented on the tensorflow serving side in my opinion, which can make it difficult to get started.

In the next section, we’ll top it off by actually exposing our gRPC service via nginx and our public URL.

Exposing gRPC Services via Nginx

The key thing that’ll be different about going the nginx route is that we can no longer establish insecure connections. We will be required by nginx to provide TLS encryption for our domain over port 443 in order to reach our service.

Since nginx enables http2 protocol on port 443 by default, we shouldn’t have to make any changes there. However, if you have existing REST services on port 80, you may want to disable ssl redirects in the nginx configuration.

kubectl edit configmap -n ingress-nginx ingress-nginx-controller

Then add:

data:
  "ssl-redirect": "false"

And save.

Create a TLS Secret

To create TLS for your domain, you can do something akin to this. Proper TLS/SSL involves a certificate authority (CA), but that’s out of scope for this article.

First create a cert directory and then create a conf file.

Edit DNS.1 so that it reflects the actual hostname of your ELB (the URL we used earlier to see nginx in browser). You don’t need to edit CN.

Using this, create a key and cert.

openssl genrsa -out cert/server.key 2048
openssl req -nodes -new -x509 -sha256 -days 1825 -config cert/cert.conf -extensions 'req_ext' -key cert/server.key -out cert/server.crt

Then use those to create a new kubernetes secret in the default namespace. We will use this secret in our ingress object.

CERT_NAME=tls-secret
KEY_FILE=cert/server.key
CERT_FILE=cert/server.crt
kubectl create secret tls ${CERT_NAME} --key ${KEY_FILE} --cert ${CERT_FILE}

Finally we create the ingress object with necessary annotations for grpc. Note that we specify the tls-secret in the ingress. We’re also doing path rewrite and exposing our service on /service1. By segmenting our services by path, we can expose more than 1 gRPC service via nginx.

Replace — host: with your URL. You can apply the yaml above and then edit the resulting object, or vice versa. This way is pretty easy:

kubectl apply -f [https://gist.githubusercontent.com/kylegallatin/75523d2d2ce2c463c653e791726b2ba1/raw/4dc91989d8bdfbc87ca8b5192f60c9c066801235/tfserving-ingress.yaml](https://gist.githubusercontent.com/kylegallatin/75523d2d2ce2c463c653e791726b2ba1/raw/4dc91989d8bdfbc87ca8b5192f60c9c066801235/tfserving-ingress.yaml)
kubectl edit tfserving-ingress

You now have an ingress in the default namespace. We can’t connect like before, with an insecure connection, as this endpoint specifies TLS. We have to establish a secure connection using the certificate we just created.

The nice part is, if we have this cert we can now connect from anywhere — we no longer need kubectl to port forward the service to our container or local machine. It’s publicly accessible.

Replace the crt_path with the cert you’ve generated and replace the host with your URL.

Note the custom route we’ve exposed our service on. Tensorflow serving services are available on /tensorflow.serving.PredictionService by default, but this makes it difficult to add new services if we’re exposing them all on the same domain.

gRPC only connects to a host and port — but we can use whatever service route we want. Above I use the path we configured in our k8s ingress object: /service1, and overwrite the base configuration provided by tensorflow serving. When we call the tfserving_metadata function above, we specify /service1 as an argument.

This also applies to making predictions, we can easily establish a secure channel and make predictions to our service over it using the right host, route and cert.

Again, we overwrite the route with our custom path, convert our data into a tensor proto and make a prediction! Easy as hell…haha not really. You won’t find any of this in the tensorflow docs (or at least I didn’t), and the addition of unfamiliar tools like TLS/gRPC/HTTP 2 with nginx make it even more difficult.

When you’re done delete your cluster so you don’t get charged.

eksctl delete cluster test

Ay We’re Done

Hope you find this helpful, it was definitely difficult to piece together from various articles, source code, stack overflow questions and github issues.

Probably worth noting we didn’t do any actual CI/CD here, and I wouldn’t recommend kubectl edit in production — but those are topics for other articles. If I used any terminology improperly, or there’s something you think could be explained better please reach out! Happy to connect on LinkedIn or twitter or just go at it in the Medium comments section.

✌🏼