DEV Community: Ninii

REST vs gRPC in Python: A Practical Benchmark!

Ninii — Mon, 28 Apr 2025 13:28:58 +0000

Ever wondered: "Is gRPC really faster than REST?" Let's not just believe the hype. Let's measure it ourselves!

In this blog, we'll build small REST and gRPC services in Python, benchmark them, and compare their real-world performance.

REST vs gRPC: Quick Intro

Feature	REST (Flask)	gRPC (Protobuf)
Protocol	HTTP/1.1	HTTP/2
Data Format	JSON (text)	Protobuf (binary)
Human-readable?	Yes	No
Speed	Slower	Faster
Streaming support	Hard	Native

Setup: Build Two Simple Services

1. REST API Server (Flask)

# rest_server.py
from flask import Flask, request, jsonify

app = Flask(__name__)

@app.route('/hello', methods=['POST'])
def say_hello():
    data = request.json
    name = data.get('name', 'World')
    return jsonify({'message': f'Hello, {name}!'}), 200

if __name__ == '__main__':
    app.run(port=5000)

2. gRPC Server (Python)

First, define your service using Protocol Buffers.

hello.proto

syntax = "proto3";

service HelloService {
  rpc SayHello (HelloRequest) returns (HelloResponse);
}

message HelloRequest {
  string name = 1;
}

message HelloResponse {
  string message = 1;
}

Generate Python files:

python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. hello.proto

gRPC server:

# grpc_server.py
import grpc
from concurrent import futures
import hello_pb2
import hello_pb2_grpc

class HelloService(hello_pb2_grpc.HelloServiceServicer):
    def SayHello(self, request, context):
        return hello_pb2.HelloResponse(message=f"Hello, {request.name}!")

server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
hello_pb2_grpc.add_HelloServiceServicer_to_server(HelloService(), server)
server.add_insecure_port('[::]:50051')
server.start()
server.wait_for_termination()

Benchmark: How to Measure?

We will send 1000 requests to each server and measure total time taken.

REST Benchmark Client

import requests
import time

def benchmark_rest():
    url = "http://localhost:5000/hello"
    data = {"name": "Ninad"}

    start = time.time()

    for _ in range(1000):
        response = requests.post(url, json=data)
        _ = response.json()

    end = time.time()
    print(f"REST Total Time: {end - start:.2f} seconds")

if __name__ == "__main__":
    benchmark_rest()

gRPC Benchmark Client

import grpc
import hello_pb2
import hello_pb2_grpc
import time

def benchmark_grpc():
    channel = grpc.insecure_channel('localhost:50051')
    stub = hello_pb2_grpc.HelloServiceStub(channel)

    start = time.time()

    for _ in range(1000):
        _ = stub.SayHello(hello_pb2.HelloRequest(name="Ninad"))

    end = time.time()
    print(f"gRPC Total Time: {end - start:.2f} seconds")

if __name__ == "__main__":
    benchmark_grpc()

Results: What I Observed

Metric	REST	gRPC
Total Time (1000 req)	~15-20 seconds	~3-5 seconds
Avg Latency/request	~15-20 ms	~3-5 ms
Payload Size	Larger (text)	Smaller (binary)

gRPC was about 4-5x faster than REST in this small test!

Why is gRPC Faster?

Uses HTTP/2: multiplexing multiple streams.
Binary Protobufs: smaller, faster to serialize/deserialize.
Persistent connection: no 3-way TCP handshake every call.

Conclusion

If you're building frontend APIs (browsers/mobile apps) -> REST is still great.
If you're building internal microservices at scale -> gRPC shines!

Bonus: Advanced Benchmarking Tools

For REST: wrk
For gRPC: ghz

Final Thought:

Real engineers don't guess performance — they measure it!

Happy benchmarking! 💪

Would love to hear your thoughts — have you tried gRPC before? How did it go for you? Feel free to share in the comments! 🚀

Progressive Delivery with Argo Rollouts: Canary Deployment

Ninii — Sat, 25 Jun 2022 13:51:50 +0000

In Part 1 of Argo Rollouts, we have seen what Progressive Delivery is and how you can achieve the Blue-Green deployment type using Argo Rollouts. We also deployed a sample app in a Kubernetes cluster using it. Do read the first part of this Progressive Delivery blog series, if you haven't yet.

In this hands-on article, we will explore what is the canary deployment strategy and how you can achieve the same using Argo Rollouts. But before that, let's first understand what is canary deployment and the need behind it.

What is Canary Deployment?

As stated rightly by Danilo Sato in this CanaryRelease article,

“Canary release* is a technique to reduce the risk of introducing a new software version in production by slowly rolling out the change to a small subset of users before rolling it out to the entire infrastructure and making it available to everybody.”*

Canary is one of the most popular and widely adopted techniques of progressive delivery. Do you know why we call it canary and not anything else? The term “canary deployment” comes from an old coal mining technique. These mines often contained carbon monoxide and other dangerous gases that could kill the miners. Canary birds were more sensitive to airborne toxins than humans, so miners would use them as early detectors, So Similar approach is used in canary deployment, where instead of putting entire end-users in danger like in old big-bang deployment, we instead start releasing our new version of the application to a very small percentage of users and then try to do analysis and see if all working as expected and then gradually release it to a larger audience in an incremental way.

Image Source

Need for Canary Deployment

Some of us have already seen that sometimes new updates of apps (like WhatsApp or Facebook) is visible to one of our friends but not to everyone, and that's the power of canary deployment strategy handling new version rollout in the background. The problems that canary deployment tries to solve are:

Canary deployments help to do testing in production with real users and real traffic which unfortunately Blue-Green deployment can not help with
One gets the ability to analyze the response of a new version of your application in more controlled manner and then rollout efficiently to all the end-users incrementally.
Infrastructure cost involved compared to the Blue-Green deployment technique is less.
Lowest risk-prone compared to all other deployment strategies.

How does Argo Rollouts handle the Canary Deployment?

Let's say, once you start using the Argo Rollouts controller for canary style deployment, it basically creates a new ReplicaSet of the new version of the application (which creates a new set of pod) and divides the traffic between the old stable and this new canary version by using the single service object that it was using to route traffic to the older stable version.

Image Source

Now, let's try on our own with some hands-on to see how it works in real.

Lab/Hands-on of Argo Rollouts with Canary Deployment

If you do not have K8s cluster readily available to do further lab then we recommend going for CloudYuga platform-based version of this blogpost. Else, you can set up your own kind local cluster with Nginx controller also deployed, and follow along to execute the below commands against your kind cluster.

Clone the Argo Rollouts example GitHub repo or preferably, please fork this

git clone https://github.com/NiniiGit/argo-rollouts-example.git

Installation of Argo Rollouts controller

Create the namespace for installation of the Argo Rollouts controller and Install the Argo Rollouts through the below command, more about the installation can be found in the first part of the progressive delivery blog series.

kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml

You will see that the controller and other components have been deployed. Wait for the pods to be in the Running state.

kubectl get all -n argo-rollouts

Install Argo Rollouts Kubectl plugin with curl for easy interaction with Rollout controller and resources.

curl -LO https://github.com/argoproj/argo-rollouts/releases/latest/download/kubectl-argo-rollouts-linux-amd64
chmod +x ./kubectl-argo-rollouts-linux-amd64
sudo mv ./kubectl-argo-rollouts-linux-amd64 /usr/local/bin/kubectl-argo-rollouts
kubectl argo rollouts version

Argo Rollouts comes with its own GUI as well that you can access with the below command.

kubectl argo rollouts dashboard

Now you can access Argo Rollouts console, by accessing http://localhost:3100 on your browser.
You would be presented with UI as shown below (currently it won't show you anything since we are yet to deploy any Argo Rollouts based).

Figure 1: Argo Rollouts Dashboard

Now, let's go ahead and deploy the sample app using the canary deployment strategy.

Canary Deployment with Argo Rollouts

To experience how the canary deployment works with Argo Rollouts, we will deploy the sample app which contains Rollouts with canary strategy, Service, and Ingress as Kubernetes objects.

rollout.yaml content:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: rollouts-demo
spec:
  replicas: 5
  strategy:
    canary:
      steps:
      - setWeight: 20
      - pause: {}
      - setWeight: 40
      - pause: {duration: 10}
      - setWeight: 60
      - pause: {duration: 10}
      - setWeight: 80
      - pause: {duration: 10}
  revisionHistoryLimit: 2
  selector:
    matchLabels:
      app: rollouts-demo
  template:
    metadata:
      labels:
        app: rollouts-demo
    spec:
      containers:
      - name: rollouts-demo
        image: argoproj/rollouts-demo:blue
        ports:
        - name: http
          containerPort: 8080
          protocol: TCP
        resources:
          requests:
            memory: 32Mi
            cpu: 5m

Here setWeight field dictates the percentage of traffic that should be sent to the canary, and the pause struct instructs the rollout to pause. When the controller reaches a pause step for a rollout, it will set adds a PauseCondition struct to the .status.PauseConditions field. If the duration field within the pause struct is set, the rollout will not progress to the next step until it has waited for the value of the duration field. Otherwise, the rollout will wait indefinitely until that pause condition is removed. By using the setWeight and the pause fields, a user can declarative describe how they want to progress to the new version. You can find more details about all the different parameters available.

Now, we will create the service object for this rollout object.
service.yaml content:

apiVersion: v1
kind: Service
metadata:
  name: rollouts-demo
spec:
  ports:
  - port: 80
    targetPort: http
    protocol: TCP
    name: http
  selector:
    app: rollouts-demo

Let's now create an ingress object.
ingress.yaml content:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: rollouts-ingress
  annotations:
    kubernetes.io/ingress.class: nginx
spec:
  rules:
  - http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: rollouts-demo
            port:
              number: 80

To keep things simple, let's create all these objects for now in the default namespace by executing the below commands.

kubectl apply -f argo-rollouts-example/canary-deployment-example/

You would be able to see all the objects created in the default namespace by running the below command.

kubectl get all

Now, you can access your sample app, by accessing this http://localhost:80 on your browser.
You would be able to see the app as shown below.

Figure 2: Sample app with blue-version

If you visit the Argo Rollouts console by again accessing http://localhost:3100 on your browser then this time, you could see the sample deployed on the Argo Rollouts console as below.

Figure 3: Canary Deployment on Argo Rollouts Dashboard

Click on this rollout-demo in the console and it will present you with its current status of it as below.

Figure 4: Details of Canary Deployment on Argo Rollouts Dashboard

Again, either you can use this GUI or else use the command shown below to continue with this demo.
You can see the current status of this rollout by running the below command as well.

kubectl argo rollouts get rollout rollouts-demo

Now, let's deploy the Yellow version of the app using canary strategy via command line.

kubectl argo rollouts set image rollouts-demo rollouts-demo=argoproj/rollouts-demo:yellow

You would be able to see a new, i.e yellow version-based pod of our sample app, coming up.

kubectl get pods

Currently, only 20% i.e 1 out of 5 pods with a yellow version will come online, and then it will be paused as we have mentioned in the steps above. See line number 9 in the rollout.yaml
On the Argo console, you would be able to see the new revision of the app with the changed image version running.

Figure 5: Another version of the sample app in Canary Deployment on Argo Rollouts Dashboard

If you visit the http://localhost:80 on your browser, you would still see only the majority of blue versions, and a very less number of yellow is visible as we have not yet fully promoted the yellow version of our app.

Figure 6: blue-yellow versions of the sample app

You can confirm the same now, by running the command below, which shows, the new version is in paused state.

kubectl argo rollouts get rollout rollouts-demo

Let's promote the yellow version of our app, by executing the below command.

kubectl argo rollouts promote rollouts-demo

Run the following command and you would see it's scaling the new, i.e yellow version of our app completely.

kubectl argo rollouts get rollout rollouts-demo

The same can be confirmed by running the below command, which shows the old set of pods i.e old blue version of our app, terminating or already terminated.

kubectl get pods

Eventually, if you visit the app URL on http://localhost:80 on your browser, you would see only the Yellow version is visible right now because we have fully promoted the yellow version of our app.

Figure 7: Sample app with yellow-version

Kudos!! you have successfully completed the canary deployment using Argo Rollouts.
You can also delete this entire setup i.e our sample deployed app using the below command.

kubectl delete -f argo-rollouts-example/canary-deployment-example/

Summary

In this post, we experienced how we can achieve canary deployment style of progressive delivery using Argo Rollouts quite easily. Achieving canary deployment in this way with Argo Rollouts is simple and does not require any service mesh and provides much better control on rolling out a new version of your application than using the default rolling update strategy of Kubernetes.

I hope you found this post informative and engaging. I’d love to hear your thoughts on this post, so start a conversation on Twitter or LinkedIn :)

What Next? Now we have developed some more understanding of progressive delivery and created a canary deployment out of it. Next would be diving deeper to try the canary deployment with Analysis using Argo Rollouts, stay tuned for this post.

You can find all the parts of this Argo Rollouts Series below:
Part 1: Progressive Delivery with Argo Rollouts: Blue Green Deployment
Part 2: Progressive Delivery with Argo Rollouts: Canary Deployment

References and further reading:

Argo Rollouts
Kubernets Banglore workshop on Argo Rollouts
Argo Rollouts - Kubernetes Progressive Delivery Controller
CICD with Argo

Looking for help with building your DevOps strategy or want to outsource DevOps to the experts? learn why so many startups & enterprises consider us as one of the best DevOps consulting & services companies.

Progressive Delivery with Argo Rollouts : Blue-Green Deployment

Ninii — Sat, 25 Jun 2022 13:46:10 +0000

To understand about blue-green deployment with Argo Rollouts

Continuous Integration (CI) and Continuous Delivery (CD) have been widely adopted in modern software development enabling organizations to quickly deploy these software to customers. But doing it in the right way is equally important as in some cases unfinished code can lead to failures and customers have to face downtime.
So to solve this, progressive delivery was introduced which enables the delivery of software with the right changes to the right amount of customers at the right time. More precisely, it controls the speed at which changes are deployed for a software.

Traditional CI/CD and Progressive Delivery

Continuous Integration (CI) is an automation process that helps in continuously integrating software development changes. It automates the building, testing, and validation of the source code. Its goal is to ultimately produce a packaged artifact that is ready to deploy.

Continuous Delivery (CD) helps in deploying software changes to users. It needs CI which produces an artifact that can be deployed to users. Hence, CI and CD are often used together.

But Continuous Delivery poses many challenges, such as handling fast delivery of changes, handling high-risk failures, to ensure uptime and efficient performance of the software.
To solve the above problems of continuous delivery, progressive delivery comes into action along with different deployment strategies like blue-green deployment, canary deployment.

Progressive Delivery is one step ahead of Continuous Delivery. It enables delivering the software updates in a controlled manner by reducing the risks of failures. It is done by exposing the new changes of software to a smaller set of users, and then by observing and analyzing the correct behavior, it is then exposed to more users progressively. It is known to move fast but with control.

Challenges with default “RollingUpdate” Kubernetes deployment strategy

Kubernetes comes up with default RollingUpdate deployment strategy which at present, have below set of limitations:

Fewer controls over speed of the rollout
Inability to control traffic flow to the new version
Readiness probes are unsuitable for deeper, stress, or one-time checks
No ability to query external metrics to verify an update
Can halt the progression, but unable to automatically abort and rollback the update

Argo Projects

Argo is a group of many open source projects which help in the fast and safe delivery of software by extending the capabilities of Kubernetes.

Argo Workflows - Container-native Workflow Engine
Argo CD - Declarative GitOps Continuous Delivery
Argo Events - Event-based Dependency Manager
Argo Rollouts - Progressive Delivery with support for Canary and Blue Green deployment strategies
Argoproj-labs - separate GitHub org that is setup for community contributions related to the Argoproj ecosystem

To overcome the limitations of native Kubernetes deployment strategies, Argo Rollouts has been introduced and it contains below a different set of projects.

What is Argo Rollouts?

Argo Rollouts is a Kubernetes controller and a set of CRDs which provides progressive delivery features along with advanced deployments such as blue-green, canary, canary analysis. It has the potential to control and shift traffic to a newer version of software through ingress controllers and service meshes.

The below table shows comparative analysis of capabilities of the default Kubernetes deployment strategy vs ArgoRollouts.

Working of Argo Rollouts

Argo Rollout controller helps in finding and detecting the resource of a kind: Rollout in the cluster which manages the replicasets just like Kubernetes Deployment does. It creates a stable replicaset for the older version of the software and a canary replicaset for the newer version of the software.

source: Argo-Rollout-Architecture

The AnalysisTemplate helps in doing the analysis of the replicasets through the AnalysisRun component. Together, they help to observe the working of the newly deployed version. Accordingly, it can automatically roll out to a newer version or roll back it. For this one can use any metrics tool like Prometheus, Kubernetes jobs, and so on.

Lab/Hands-on of Argo Rollouts with Blue-Green Deployments

If you do not have k8s cluster readily available to do further lab then we recommend to go for CloudYuga platform based version of this blogpost. Else, you can set up your own kind local cluster with nginx controller also deployed,and follow along to execute below commands against your kind cluster.

Clone the Argo Rollouts example GitHub repo or preferably, please fork this

git clone https://github.com/NiniiGit/argo-rollouts-example.git

Installation of Argo Rollouts controller

Create the namespace for installation of the Argo Rollouts controller

kubectl create namespace argo-rollouts

You will see the namespace has been created

kubectl get ns argo-rollouts

We will use the latest version to install the Argo Rollouts controller.

kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml

You will see the controller and other components have been deployed. Wait for the pods to be in the Running state.

kubectl get all -n argo-rollouts

Install Argo Rollouts Kubectl plugin with curl for easy interaction with Rollout controller and resources.

curl -LO https://github.com/argoproj/argo-rollouts/releases/latest/download/kubectl-argo-rollouts-linux-amd64
chmod +x ./kubectl-argo-rollouts-linux-amd64
sudo mv ./kubectl-argo-rollouts-linux-amd64 /usr/local/bin/kubectl-argo-rollouts
kubectl argo rollouts version

Argo Rollouts comes with its own GUI as well that you can access with the below command

kubectl argo rollouts dashboard

and now you can access Argo Rollout console, by accessing http://localhost:3100 on your browser.
You would be presented with UI as shown below (currently it won't show you anything since we are yet to deploy any Argo Rollouts based app)

Now, let's go ahead and deploy our first sample app using the Blue-Green Deployment strategy.

Blue-Green Deployment with Argo Rollouts

To experience how the blue-green deployment works with Argo Rollouts, we will deploy the sample app which contains Rollouts, Service, and Ingress as Kubernetes objects.\
rollout.yaml content:

# This example demonstrates a Rollout using the blue-green update strategy, which contains a manual
# gate before promoting the new stack.
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: rollout-bluegreen
spec:
  replicas: 2
  revisionHistoryLimit: 2
  selector:
    matchLabels:
      app: rollout-bluegreen
  template:
    metadata:
      labels:
        app: rollout-bluegreen
    spec:
      containers:
      - name: rollouts-demo
        image: argoproj/rollouts-demo:blue
        imagePullPolicy: Always
        ports:
        - containerPort: 8080
  strategy:
    blueGreen: 
      # activeService specifies the service to update with the new template hash at time of promotion.
      # This field is mandatory for the blueGreen update strategy.
      activeService: rollout-bluegreen-active
      # previewService specifies the service to update with the new template hash before promotion.
      # This allows the preview stack to be reachable without serving production traffic.
      # This field is optional.
      previewService: rollout-bluegreen-preview
      # autoPromotionEnabled disables automated promotion of the new stack by pausing the rollout
      # immediately before the promotion. If omitted, the default behavior is to promote the new
      # stack as soon as the ReplicaSet are completely ready/available.
      # Rollouts can be resumed using: `kubectl argo rollouts promote ROLLOUT`
      autoPromotionEnabled: false

service.yaml content:

kind: Service
apiVersion: v1
metadata:
  name: rollout-bluegreen-active
spec:
  selector:
    app: rollout-bluegreen
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080

---
kind: Service
apiVersion: v1
metadata:
  name: rollout-bluegreen-preview
spec:
  selector:
    app: rollout-bluegreen
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080

ingress.yaml content:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: rollout-bluegreen
  annotations:
    kubernetes.io/ingress.class: nginx
spec:
  rules:
  - http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: rollout-bluegreen-active
            port:
              number: 80

Now, let's create all these objects for now in the default namespace. Please execute the below commands:

kubectl apply -f argo-rollouts-example/blue-green-deployment-example/

You would be able to see all the objects been created in the default namespace by running the below commands:

kubectl get all

Now, you can access your sample app, by accessing http://localhost:80 on your browser.\
You would be able to see the app as shown below:

Now, again visit the Argo-Rollouts console. And this time, you could see the sample deployed on the Argo Rollouts console as below:

You can click on this rollout-bluegreen in the console and it will present you with its current status as below:

Going forward, either you can use this GUI or else (Preferably) use the commands shown below to continue with this demo.

You can see the current status of this rollout by running the below command as well

kubectl argo rollouts get rollout rollout-bluegreen

Now, let's deploy the Green version of the app via command line

kubectl argo rollouts set image rollout-bluegreen rollouts-demo=argoproj/rollouts-demo:green

You would be able to see new i.e green version based set of pods of our sample app, coming up

kubectl get pods

Now, after a few seconds, you would be able to see both your old set of pods (with version blue) as well as the new set of pods(with version green) available. Also on the Argo console, you would be able to see below the kind of new revision of the app with the changed image version running.

If you visit the http://localhost:80 on your browser, you would still see only the blue version is visible rightly because we have not yet fully promoted the green version of our app
You can confirm the same now, by running the command below, which shows the new version is in paused state.

kubectl argo rollouts get rollout rollout-bluegreen

Now, let us promote the green version of our app, by executing below command

kubectl argo rollouts promote rollout-bluegreen

Run the following command and you would see it's scaling the new i.e green version of our app completely.

kubectl argo rollouts get rollout rollout-bluegreen

The same can be confirmed by running the below command, which shows the old set of pods i.e old blue version of our app, terminating or already terminated.

kubectl get pods

If you visit the app URL on http://localhost:80 on your browser, you would see only the Green version is visible right now because we have fully promoted the green version of our app

Congratulations!! you have successfully completed the blue-green deployment using Argo Rollouts.

You can delete this entire setup i.e our sample deployed app using the below command.

kubectl delete -f argo-rollouts-example/blue-green-deployment-example/

Summary

In this post, we discussed what progressive delivery is all about and its characteristics. We also learned about ArgoRollouts custom controller and how it can help to achieve the blue-green deployment, which is ultimately a form of progressive delivery.

I hope you found this post informative and engaging. For more posts like this one, do subscribe to our weekly newsletter. I’d love to hear your thoughts on this post, so start a conversation on Twitter or LinkedIn :)

What Next?
Now we have developed an understanding of progressive delivery and created a blue-green deployment. Next would be to try the canary deployment using Argo Rollouts.

References and further reading:

Argo Rollouts\
Argo Rollouts - Kubernetes Progressive Delivery Controller

Prometheus Definitive Guide Part III - Prometheus Operator

Ninii — Tue, 21 Sep 2021 13:14:00 +0000

In the previous post, we covered monitoring basics, including Prometheus, metrics, its most common use cases, and how to query Prometheus data using PromQL. If you’re just starting with Prometheus, I’d highly recommend reading the first two parts of the ‘Prometheus Definitive Guide’ series. In this blog post, we will focus on how we can install and manage Prometheus on the Kubernetes cluster using the Prometheus Operator and Helm in an easy way. Let’s get started!

What is an operator?

Before moving directly to the installation of the Prometheus using the Prometheus Operator, let's first understand some of the key concepts needed to understand the Prometheus Operator.

Custom Resource Definition (CRD) resource is a way to define your own resource kind like Deployment, StatefulSet etc. CRDs define the structure and validation of the custom kind.

Custom Resource (CR) are the resources that are created by following the structure from a Custom Resource Definition (CRD).

Custom Controller makes sure that our Kubernetes cluster or application always matches its current state with the desired state that we expect it to.

So, Operator is a set of Kubernetes custom controllers that we deploy in the cluster. These listen for changes in the custom resources owned by them (those which we have created using CRDs), and perform certain actions like creating, modifying, deleting Kubernetes resources.

You can read more about this in the Custom Resources documentation page.

What are some use cases of Kubernetes operators?

A Kubernetes operator can:

Provide a great way to deploy stateful services like any database on Kubernetes
Handling upgrades of your application code
Horizontal scaling of resources according to performance metrics
Backup and restoration of your application state or databases on demand
Deploy any monitoring, storage, vault solutions to Kubernetes

What is Prometheus Operator?

In simple words, Prometheus Operator is a fully automated way of deploying (like any other standard Kubernetes Deployment object) Prometheus server, Alertmanager and all the related secrets, configmap etc; that will help to set up Prometheus monitoring ecosystem and instantiate Kubernetes cluster monitoring in just a few minutes.

Once deployed, Prometheus Operator provides the following features:

Automation: Easily launch a Prometheus instance for your Kubernetes namespace, a specific application, or a team.

Service discovery: Automatically discover the targets to be monitored using familiar Kubernetes label queries; without a need to learn a Prometheus specific configuration language.

Easy Configuration: Manage the configuration of the essential resources of Prometheus like versions, persistence, retention policies, and replicas from a Kubernetes resource.

Ways to install Prometheus stack

There are three different ways to setup the Prometheus monitoring stack in Kubernetes.

1.Creating everything on your own

If you’re completely comfortable with Prometheus components and its every prerequisites, you can then manually deploy YAML spec file for every component like Prometheus, Alertmanager and Grafana, all the Secrets and ConfigMaps used by Prometheus stack in right sequence by considering its inter dependency.

This approach can be quite time consuming and take a lot of effort to deploy and manage the Prometheus ecosystem. Also it would need a strong documentation to be built to replicate it for any other environments

2.Using Prometheus Operator

We have conceptually already seen the way the Prometheus Operator can ease our life by managing the life cycle of all the Prometheus components single-handedly.

We can find the Prometheus Operator here, that we can use to deploy Prometheus inside your Kubernetes cluster.

3.Using Helm chart to deploy operator

This approach is the most efficient and better option where we can use the Helm chart maintained by the Prometheus community to deploy Prometheus Operator. So, in a nutshell, Helm will do initial Prometheus Operator installation along with creation of Prometheus, Alertmanager, and other custom resources. And then the Prometheus Operator will manage the entire life-cycle of those custom resources. It's very easy and straightforward to install by just executing the following steps:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

helm repo update

helm install prometheus prometheus-community/kube-prometheus-stack

This kube-prometheus-stack chart installs the following components:

Prometheus Operator
Prometheus and Alertmanager (creates the Prometheus, Alertmanager and related CRs)
Grafana
node-exporter, and a couple of other exporters

They are also pre-configured to work together and set up basic cluster monitoring for you while also making it easy to tweak and add your own customization.

The execution of the above commands will be quite quick and it will take a few more minutes to bring all the components up and running.

You can run helm get manifest prometheus | kubectl get -f - command to see all the objects created as below:

You would be able to see all the different resources like the Deployments, StatefulSets of the Prometheus stack being created as shown above.

How does Prometheus find all the targets to monitor and scrape?

For Prometheus to figure out what all it needs to monitor, we need to pass a YAML configuration file, usually called prometheus.yaml; that Prometheus can refer to and start monitoring accordingly. Each target endpoint to be monitored is defined under the scrape_configs section in the prometheus.yaml. A typical sample configuration file that comes with Prometheus release tar has the following content:

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
       - localhost:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - '/etc/prometheus/alert.rules'

  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node_exporter'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9100']

So, let’s dive deep into some of the major key terms from the prometheus.yaml file.

There are two ways to specify a set of target endpoints to be scraped by Prometheus:

Using scrape_config
Using ServiceMonitors (Prometheus Operator specific)

scrape_config vs ServiceMonitor: When to use one over the other?

A scrape_config specifies a set of targets and configuration parameters describing how to scrape them. In this case, for each target; one scrape configuration block needs to be defined as you see in the above sample prometheus.yaml file.

A ServiceMonitor lets us create a job entry in scrape_config in an easier Kubernetes-native way. Internally Prometheus Operator translates the configuration from each ServiceMonitor resource to prometheus.yaml's scrape_config section. The Prometheus resource created by the kube-prometheus-stack has a selector which says, act on all the ServiceMonitors with label release: prometheus (configuration). Take a look at the diagram below of how it works:

Image Credits: CoreOS

Let's check if a ServiceMonitor automatically creates a scrape_config entry in the Prometheus config file or not by taking Prometheus service itself as an example.

kubectl get services prometheus-prometheus-oper-prometheus -o wide --show-labels 
NAME                                    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE   SELECTOR                                                          LABELS
prometheus-prometheus-oper-prometheus   ClusterIP   10.105.67.172   <none>        9090/TCP   12d   app=prometheus,prometheus=prometheus-prometheus-oper-prometheus   app=prometheus-operator-prometheus,release=prometheus,self-monitor=true

Let's take a look if the respective ServiceMonitor is in place for Prometheus' service or not.

kubectl get servicemonitors.monitoring.coreos.com -l app=prometheus-operator-prometheus
NAME                                    AGE
prometheus-prometheus-oper-prometheus   12d

This confirms that the ServiceMonitor for the Prometheus service itself is present. Let’s now try to see if this ServiceMonitor “prometheus-prometheus-oper-prometheus” has added a job inside the Prometheus config YAML file.

To check this, we first need to access the Prometheus pod created by the Prometheus Operator.

kubectl exec -it prometheus-prometheus-prometheus-oper-prometheus-0 -- /bin/sh
/prometheus $

Once we are inside the pod, let's find out the configuration file name used by Prometheus inside the pod now.

/prometheus $ ps

PID   USER     TIME  COMMAND
1     1000      4h58 /bin/prometheus … --config.file=/etc/prometheus/config_out/prometheus.env.yaml
59    1000      0:00 /bin/sh

We could see above, the config file with name prometheus.env.yaml is created by the operator and used by the Prometheus server to find out target endpoints to be monitored and scraped. Finally, let’s try to find out if the job for the Prometheus service itself is added by ServiceMonitor or not inside this configuration file:

/prometheus $ cat /etc/prometheus/config_out/prometheus.env.yaml | grep -i -A 10 "job_name: default/prometheus-prometheus-oper-prometheus/0"

- job_name: default/prometheus-prometheus-oper-prometheus/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - default
  metrics_path: /metrics

We could see the job for Prometheus service. So, this shows that a ServiceMonitor automatically creates job for the Kubernetes-based service to be monitored and scraped.

There is also another way to view the scrape_config directly inside the Prometheus web UI at Status -> Configuration as you can see below:

Apart from ServiceMonitor, there is another way called PodMonitor to scrape your Kubernetes pods. This is also a custom resource handled by Prometheus Operator.

What is a PodMonitor, when is it needed?

PodMonitor declaratively specifies how to directly monitor a group of pods.

PodMonitor vs ServiceMonitor

One must be wondering that when we have ServiceMonitor in place, then what is the need to have PodMonitor which ultimately do the same job as ServiceMonitor.

The way I see it, ServiceMonitor is suitable if you already have a Service for your pods. However, if in a certain scenario, you don't have it, then PodMonitor is the right choice.

Discovering targets to scrape

To scrape target endpoints, first Prometheus should know what those are and how to monitor those targets. So, there are mainly two ways in Prometheus configuration to define the target endpoints to be monitored.

Using static_config mechanism

If you have a very small and fixed set of Kubernetes services/endpoints to be monitored, then you can define those static endpoints using static_config in the prometheus.yml file.

Take a look at this example that shows how we can configure Prometheus to use static_configs to monitor Prometheus itself by default.

This is fine for simple use cases, but practically having to manually keep your prometheus.yml up to date as machines are added and removed would get annoying, particularly if you were in a dynamic environment like Kubernetes where new instances of application services might be brought up every minute.

Using service_discovery mechanism

There are different Prometheus supported service-discovery mechanisms like DNS, Kubernetes, AWS, Consul, custom one, etc. These mechanisms dynamically discover target endpoints to be monitored and scraped. In the case of Kubernetes, it uses the Kubernetes API to discover a list of target endpoints to monitor and scrape.

Take a look at this example that shows how we can configure Prometheus for Kubernetes.

Prometheus Operator takes care of configuring above based on the ServiceMonitor and PodMonitor resources.

Rules in Prometheus

Prometheus supports two types of rules which can be configured and evaluated at regular intervals. These are mainly:

Recording rules
Alerting rules

You can create a YAML file containing your rule statements and load them into Prometheus using the rule_files field in Prometheus configuration.

When using Prometheus Operator, these rules can be created with the helm of PrometheusRule resource.

Recording rules

Recording rules allow you to pre-compute PromQL expressions which are frequently used and internally require a relatively large number of steps to complete the result of that expression. Next time, when you run the same PromQL query, results will be fetched from pre-computed PromQL results. These are faster than originally executing the same query again and again.

For example:

groups:
  - name: example
    rules:
    - record: job:http_inprogress_requests:sum
      expr: sum by (job) (http_inprogress_requests)

Alerting rules

Alerting rules allow you to define alert conditions based on PromQL and to send notifications about firing alerts to an external receiver. Whenever the alert expression results in True at any point of time, it sends alerts.

For example:

groups:
- name: example
  rules:
  - alert: HighRequestLatency
    expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
    for: 10m
    labels:
      severity: page
    annotations:
      summary: High request latency

Alerts and visualization

Credit: Prometheus introduction

Once the Alerting rules are configured, we would like to send alerts to external receivers. We would need someone who can add alert summary, control, and sometimes silence the number of notifications that the receiver will receive. This is where Alertmanager comes into the picture.

Alertmanager Insights

As you can see in the above diagram, Alertmanager periodically receives information about alert state from Prometheus server and then it makes sure to group, deduplicate and send notifications to the defined receiver like email, PagerDuty, etc.

But where and how to define or set up Alertmanager in the Kubernetes cluster? Well, we don't need to worry about it as the Prometheus Operator we had deployed with help of Helm chart earlier, creates an Alertmanager as a StatefulSet as you could see:

kubectl get statefulsets.apps 

NAME                                                    READY      AGE
alertmanager-prometheus-prometheus-oper-alertmanager    1/1         8d

This Alertmanager StatefulSet internally uses a configuration file called alertmanager.yaml as you could see below (inside alertmanager pod) :

/bin/alertmanager --config.file=/etc/alertmanager/config/alertmanager.yaml

In this alertmanager.yaml file (as you could see below), some of the key things like route and receiver are defined.

Here,

route:
It's a code block that defines where the alerts will be further routed to.

receiver:
Receiver is the one to whom alerts will be sent or notified finally. It can be a webhook or an email address or a tool like PagerDuty.

inhibit_rules:
The inhibit rules section allows to silence a set of alerts given that another alert is firing for the same cause. For example, normally if any application service goes down then warning-level notification will be silenced as that service has gone into critical mode already.

You can take a look at the sample Alertmanager configuration file.

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

Metrics visualization with Grafana

Okay, now enough of all the insights and let’s access Grafana straight away, which is a standard tool that helps you to visualize all the metrics you have gathered with help of Prometheus.

Our kube-prometheus-stack Helm chart has already deployed Grafana for us. Now, to access Grafana dashboard, we first need to find the Grafana service.

kubectl get services 


NAME                               TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      

prometheus-grafana                ClusterIP    10.104.143.147    <none>       80/TCP

Let's port-forward to this service so that we can access the Grafana web interface.

kubectl port-forward svc/prometheus-grafana 3000:80

Forwarding from 127.0.0.1:3000 -> 3000
Forwarding from [::1]:3000 -> 3000

And then in the browser visit http://localhost:3000

If you’re able to see the Grafana dashboard below, Congratulations!!!

Enter the default username: admin and password: prom-operator which you can find from here to access Grafana.

After adding these credentials, you would be able to logged-in to Grafana as below:

Click on Dashboard -> Manage, and you would be able to see all the dashboards provided by kube-prometheus-stack about our Kubernetes cluster:

And you can browse through any of those dashboards. For example, 'Kubernetes/Compute Resources/Pod' dashboard as below:

These all standard dashboards are basically generated from the kubernetes-mixin project.

Summary

Let’s do a quick recap before concluding this post. We discussed what the Prometheus Operator is and how we can configure Prometheus easily with help of the Prometheus Operator and Helm charts. We also explored how Prometheus discovers the resources to monitor and what all components of Prometheus one needs to configure and how they work. Plus, we also looked at how we can set up alerts as well as how to visualize them.

I hope you found this post informative and engaging. For more posts like this one, do subscribe to our weekly newsletter. I’d love to hear your thoughts on this post, so do start a conversation on Twitter or LinkedIn :).

References and further reading:

Achieving Cloud Native Security and Compliance with Teleport

Ninii — Wed, 17 Feb 2021 13:34:42 +0000

Security is the most critical aspect for any IT solutions and with the ever-increasing adoption of cloud-native technologies, the need for Zero Trust Architecture is irrefutable as:

The traditional networking approach is not effective enough to provide full security to cloud-native applications.
With cloud offerings being heavily getting used going forward, security policies around the application need to be scalable as well.
With more emphasis on loosely coupled microservice-based applications, chances of vulnerabilities getting introduced also increases.
People are using multiple clouds to take advantage of the offerings.
On-premise to the cloud and another way around connectivity is a reality.
Devices that span beyond traditional data centres & cloud are increasingly being used to provide connectivity to remote sites.

To sum up, it means there is no "real network boundary" anymore, and hence we need to have a way in which we don't trust anyone. Traditionally it was assumed that communication between entities inside the same data centre is secure; whereas with zero trusts, we don't even assume that. So, it has become a mandate to have a zero-trust security framework around your cloud-native applications.

Teleport? What is it? Why is it needed?

Gravitational Teleport is one such fine product which fits in the space of “Zero Trust Networks” for cloud-native applications. It acts as a gateway machine for identity and access management of clusters of Linux servers via SSH or the Kubernetes API. It can replace the use of traditional OpenSSH going forward for organizations who are looking for:

Better security and compliance practices.
End to end visibility of activities happening in their infrastructure.
Achieving compliance for their software products like SOC 2.
Reduction in the operational burden of access management across both on-premise and cloud-native infrastructure.

Benefits of using Teleport for cloud-native applications

Teleport is a cloud-native SSH solution. Below is a list of the key features that Teleport offers:

It can act as a single SSH/Kubernetes gateway machine for your organizations.
It can provide an audit log and recording of all kubectl commands executed in your Kubernetes cluster.
It can also run in “agentless” mode also where session details can be MITM-ed and taped on the proxy server; instead of keeping them on the nodes.
It does not rely on static keys and provides authentication based on SSH certificate.
It uses auto-expiring keys signed by a cluster certificate authority (CA).
It enforces 2nd-factor authentication.
You can connect to your clusters (protected by a firewall) without internet access via SSH bastion host.
You can share sessions with your colleagues for troubleshooting any issues collectively.
It’s a single tool that can manage RBAC for both SSH and Kubernetes.
You can replay audit logs from recorded sessions to detect any safety issues.

Let’s find out how this tool works in the following sections along with the inner workings of Teleport.

Architecture

Let's walk through a scenario of a user connecting to a node and how Teleport works in that case.

source: https://goteleport.com/teleport/docs/architecture/overview/

1: Establishing client connection

source: https://goteleport.com/teleport/docs/architecture/overview/

To initiate the client connection; when the client tries to SSH to the required target, it first establishes a connection to the Teleport proxy and offers the client’s certificate. Teleport proxy makes sure to record SSH sessions and keep an eye on live and active sessions of users. With records of all users logged in to the target server, Teleport proxy also helps SSH users to see if anyone else is also connected to the same target that the client is intending. Once client ssh to server, it can run command tsh status which provides session id as well user id of the users who are logged in.

2: Authenticate client provided certificate

source: https://goteleport.com/teleport/docs/architecture/overview/

If the user or client is logging in for the 1st time or if the previously provided client certificate is expired, the proxy server will deny the further connection. In this case, it requests the user or client to again login using a new password and two-factor authentication code if it is enabled already. Since this connection request is HTTPS, Teleport must have the right HTTPS certificate installed on the proxy server.

For 2FA currently, Teleport supports Google Authenticator, Authy, and any time-based one-time password generators.

3: Lookup node

source: https://goteleport.com/teleport/docs/architecture/overview/

At this stage, the proxy server tries to find out the requested target server in the cluster. It uses three lookup mechanisms to find out the requested target/node server’s IP address.

Lookup mechanism 1: Using DNS resolver to resolve the target name requested by the client.
Lookup mechanism 2: The next step is to check with the Teleport Auth Server if the requested target/node server is present in its list of target/nodes.
Lookup mechanism 3: Finally the Teleport Auth Server is used to find the target server/node with the label name suggested by the client.

If the requested target/node server is located, the proxy server further forwards the connection request from the client to that target/node server. Then that target server starts creating an event log file using the dir backend. The entire session recording log is stored in data_dir under log directory (usually in /var/lib/teleport/log/sessions/default) .This session log contains events like session.start and session.end. It also records the complete stream of bytes which goes as the standard input to an SSH session as well as the standard output from an SSH session. These target session's event logs are stored in a raw format which is either .bytes or .chunks.gz. From these raw logs, a video recording of the SSH session is generated.

For example, we can see the session files from one target node by running following command:

$ ls /var/lib/teleport/log/sessions/default

-rw-r----- 1 root root 506192 Feb 4 00:46 1c301ec7-aec5-22e4-c1c3-40167a86o821.session.bytes

-rw-r----- 1 root root  44943 Feb 4 00:46 1c301ec7-aec5-22e4-c1c3-40167a86o821.session.log

If needed, we can run commands like tsh --proxy=proxy play 1c301ec7-aec5-22e4-c1c3-40167a86o821 to replay a session.

This is how the target server records the session and keep sharing it to Auth Server for storage purpose.

4: Authenticate node certificate

source:https://goteleport.com/teleport/docs/architecture/overview/

For future connection requests, when the target/node server receives a client connection request, it again goes back and checks with the Teleport Auth Server if the target/node’s certificate is valid or not. If the certificate is valid, then the target/node server starts the SSH session.

5: Grant user node access

source:https://goteleport.com/teleport/docs/architecture/overview/

As the last step, target/node server requests to the Teleport Auth Server to provide a list of OS users and cross-verify that the client is really authorized to use the requested OS login and thus the client is authorized to SSH to requested target/node server.

To sum up, if a user Joe wants to connect the server/node called grav-00; then below is the sequence of actions that takes place inside Teleport to allow Joe access to it.

source:https://goteleport.com/teleport/docs/architecture/overview/

Installation/demo

Just for demo purpose, we will have one Teleport master, which will run Teleport and secure only one node/target server. In production, you might have more than one Teleport master for high availability and potentially hundreds of nodes secured by Teleport. For now, we will do these steps manually but you can easily automate them.

Prerequisites

For installation of Teleport, we would need:

A Linux machine with ports 3023, 3024, 3025 and 3080 open.
A domain name, DNS and TLS certificates for a production system.

Installation steps

Note that the following installation is completed and tested with help of localhost and self-signed certificate, by referring to the Teleport Quick Start guide.

Step 1: Install Teleport on a Linux host

There are multiple ways to install Teleport which you can find here. I will be following the one for Amazon Linux.

sudo yum-config-manager --add-repo https://rpm.releases.teleport.dev/teleport.repo
sudo yum install teleport

Step 2: Configure Teleport

It’s recommended by Teleport to use a YAML configuration file for setting it up. Create a YAML file called teleport.yaml, and add the below configurations to it.

Note: Make sure to validate your teleport.yaml file, with the help of local tools like yamllint (which can be launched easily from the local console) to avoid any syntax errors.

# teleport.yaml
---
app_service:
  debug_app: true
  enabled: true
auth_service:
  cluster_name: teleport-quickstart
  enabled: true
  listen_addr: "<internal_ip_of_server>:3025"
  tokens:
  - "proxy,node,app:f7adb7ccdf04037bcd2b52ec6010fd6f0caec94ba190b765"
proxy_service:
  enabled: true
  listen_addr: "0.0.0.0:3023"
  tunnel_listen_addr: "0.0.0.0:3024"
  web_listen_addr: "0.0.0.0:3080"
ssh_service:
  enabled: true
  labels:
    env: staging
teleport:
  data_dir: /var/lib/teleport  # This is the default directory, where Teleport stores its log and all the configuration files

Below are some of the configuration details present in this above yaml file:

teleport.data_dir: /var/lib/teleport:
This is the default data directory, where Teleport stores its log and all the configuration files
auth_service section:
Auth service listens on SSH port 3025 to provide its API to communicate with other nodes in a cluster.
auth_service.tokens:
To make each node/server participate in the Teleport cluster, we need to establish a secure tunnel between the master and the other participating nodes. For this connection, there should be a shared secret among them. This secret is nothing but what we call a token. We can generate a static token or a dynamic one. For the demo purpose, we are going to use a static one f7adb7ccdf04037bcd2b52ec6010fd6f0caec94ba190b765. (Teleport recommends using tools like pwgen -s 32 to generate sufficiently random static tokens of 32+ byte length).
proxy_service section:
Proxy service listens on port 3023 (i.e listen_addr) and acts as a forwarder to direct incoming connection request to desired target node on port 3022 (which is Teleport equivalent of port 22 for SSH).
Also this proxy service uses SSH port 3024 to do a reverse SSH tunnelling behind a firewall, into a proxy server.
Web UI of teleport listen on this port and all HTTPS connection are from tsh users are authenticated on this port.
app_service section:
For testing and debugging, Teleport provides an in-bulit app, which can be enabled by setting debug_app as true.

Now, you can move this file to Teleport's default configuration location by running sudo mv teleport.yaml /etc.

Step 3: Configure domain name and obtain TLS certificates

Teleport normally requires a secured publicly accessible endpoint, and so a domain name and TLS certificates are needed too. But since we are doing just a test here; we can locally set up them using the self-signed TLS certificate without a domain name.

Self-signed TLS certificate creation

openssl req -x509 -out localhost.crt -keyout localhost.key \
    -newkey rsa:2048 -nodes -sha256 \
    -subj '/CN=localhost' -extensions EXT -config <( \printf "[dn]\nCN=localhost\n[req]\ndistinguished_name = dn\n[EXT]\nsubjectAltName=DNS:localhost\nkeyUsage=digitalSignature\nextendedKeyUsage=serverAuth")

This will create a certificate localhost.crt and a key localhost.key in your current directory.

Update teleport.yaml with certificate details

We can now amend the previously created teleport.yaml configuration with above locally created TLS certificate. Add the below code at the end of previous YAML configurations with a new section https_keypairs.

https_keypairs:
- key_file: /usr/local/bin/localhost.key
  cert_file: /usr/local/bin/localhost.crt

Now, you can start Teleport directly using sudo teleport start or else in case, if you are already in /usr/local/bin directory, you can also use nohup sudo ./teleport start &.

You can also setup a systemd service by following these instructions.

Step 4: Create a Teleport user and set up 2-factor authentication

Next step is to create teleport-admin which will help us to SSH into any node which is part of the Teleport cluster. For that, we need to assign an OS user to the teleport-admin account. This can be achieved using the below command which will assign OS users root, ubuntu and ec2-user to the teleport-admin user.

sudo tctl users add teleport-admin root,ubuntu,ec2-user

Once you run this command, Teleport will give you output similar to below.

User teleport-admin has been created but requires a password. Share this URL with the user to complete user setup, link is valid for 1h0m0s:

https://<internal_ip_address>:3080/web/invite/7e3bd3d504c10b2637db6b06f29529

NOTE: Make sure <internal_ip_address>:3080 points at a Teleport proxy that users can access.

You can now go to your any favourite browser, and put the URL:

https://<public_ip_address_of_server_where_teleport_installed>:3080/web/invite/7e3bd3d504c10b2637db6b06f29529

Which will open a page like below:

Set any desired password with a minimum length of six characters and use any authenticator (like Google Authenticator, Authy etc) to scan and obtain 6 digit code and put that into TWO FACTOR TOKEN dialogue box.

Once completed above, below home-page will appear.

Congratulations!! You have successfully installed a single node cluster of Teleport.

Step 5: Log in using tsh to your Teleport cluster

tsh login --proxy=localhost:3080 --user=teleport-admin --insecure

You will see the below kind of prompt for password and OTP that you will receive over your authenticator.

WARNING: You are using insecure connection to SSH proxy https://localhost:3080
Enter password for Teleport user teleport-admin:
Enter your OTP token:
511510

On successful login, you will see the below message displayed.

WARNING: You are using insecure connection to SSH proxy https://localhost:3080
Profile URL:   https://localhost:3080
Logged in as:       teleport-admin
Cluster:            teleport-quickstart
Roles:              admin*
Logins:             ec2-user, root, ubuntu
Kubernetes:         disabled
Valid until:        2021-01-07 05:20:50 +0000 UTC [valid for 12h0m0s]
Extensions:         permit-agent-forwarding, permit-port-forwarding, permit-pty
RBAC is only available in Teleport Enterprise
https://gravitational.com/teleport/docs/enterprise

Step 6: Check Teleport cluster

You can check the status of your cluster using command:

[ec2-user@ip-172-31-34-133 bin]$ tsh ls
Node Name                                    Address            Labels
-------------------------------------------- ------------------ -----------
ip-172-31-34-133.ap-south-1.compute.internal 172.31.34.133:3022 env=staging

Step 7: Add new node or target server in Teleport cluster

You can now secure more servers with Teleport by adding them to the Teleport cluster we just created. For that, we need to install Teleport on all those servers/nodes by doing Step 1 and Step 2 that we did earlier on those new target servers, followed by a simple below command to be executed on those target servers (to add those servers in a cluster).

sudo teleport start --roles=node --token=f7adb7ccdf04037bcd2b52ec6010fd6f0caec94ba190b765 --labels=env=quickstart --insecure

After this command, you will see the output as:

[NODE]         Service 5.1.0:v5.1.0-0-g46679fb34 is starting on 0.0.0.0:3022.

Which confirms that the node has been added in the Teleport cluster. We can be sure of that by running tsh ls command on Teleport master server (where we installed Teleport cluster first).

[ec2-user@ip-172-31-34-133 bin]$ tsh ls
Node Name                                    Address            Labels
------------------------------------------------- ------------------ --------------
ip-172-31-34-133.ap-south-1.compute.internal 172.31.34.133:3022 env=staging
ip-172-31-41-130.ap-south-1.compute.internal 172.31.41.130:3022 env=quickstart

We can now login into any node by doing, tsh ssh user-name@Node Name. The target Node Name in this command, is the one you will find in 'Node Name' column of tsh ls command's output.

And now, once the user is connected to the required target server/node, then on the Teleport web UI, you would be able to see video recordings of all the user activities during a particular session on the Teleport web UI once the user is connected to the required target server/node. A quick video demonstration of the same is shown below.

Quick Demo

The demo video will show you:

How to check Teleport cluster status by running the tsh status command and list the nodes participating in the cluster with the help of the tsh ls command.
How the web interface of Teleport looks like, with different sections.
How to access the Session Recordings.
How you can Join session for troubleshooting collectively with other users.
How you can access the Audit logs in JSON format.

Conclusion

Implementing security and achieving compliance particularly in the cloud-native infrastructure has been challenging for most organisations. In many cases, doing-it-yourself with a variety of capabilities and tools turns out to be a complex solution.

Teleport is a mature product with a focus on developer productivity and its alignment to help with compliance makes them a leader. Some features like ChatOps integration, focus on edge/IoT connectivity, a variety of SSO integrations, are attractive. Just to keep in mind that a lot of things can be done using their open-source offering but I find that implementing RBAC would become a bare minimum requirement in an enterprise environment, where you would need the enterprise version.

Hope this post was helpful to you. Do try Teleport for cloud-native security and compliance, and share your experience in the comments section below. Happy Coding :)