rhazn

Posted on Aug 17, 2019 • Edited on Feb 10, 2024 • Originally published at heltweg.org

Run a personal Cloud with Traefik, Let's encrypt and Zookeeper

#kubernetes #docker #devops #infrastructure

Kubernetes ingress with Traefik

As mentioned in my last blog post I want to focus on a provider neutral setup for my own cloud, using technology that is not bound to any cloud offering whenever possible.

While google cloud offers load balanced HTTP ingress by default it is apparently very expensive in comparison to running small nodes and I have heard only good things about using Traefik for kubernetes ingress.

For setting up Traefik I followed Manuel's excellent guide with minor modifications (you can find the final files at the end of the article.)

HTTPs and Let's encrypt

Traefik has built-in support for automatically getting and renewing HTTPS certificates with Let's Encrypt. As HTTPS is good practice and a requirement for HTTP2 and PWAs anyway I set it up using example configurations from the Traefik docs.

Because I was using just one node for Traefik I chose to go with the easy setup of a local acme.json file that stores the certificate while the node is running.

GKE Preemptible nodes, your own chaos monkey

To save costs I chose to use "Preemtible VMs" as nodes to power my kubernetes cluster on GKE. According to google's docs: "Preemptible VMs are Google Compute Engine VM instances that last a maximum of 24 hours and provide no availability guarantees." This means the nodes in my kubernetes cluster randomly go down and are never up more than 24h. While this obviously is not a smart decision for a production setup I have chosen to embrace it and consider the nodes going down my own "chaos monkey" that forces me to write resilient code.

A concrete example I ran into: The Let's encrypt production API has a rate limit of requesting 5 certificates for the same URL in a week. Because my initial naive setup did not save the certificate anywhere it got lost whenever my Traefik node was terminated. While Traefik regenerates the certificate without any issue on startup... after five startups I hit my rate limit and was greeted by an insecure warning without certificate.

Shared K/V store for Traefik with Zookeeper

Enter a shared Key/Value store for Traefik. Using one is required if you want to run Traefik in cluster mode anyway (and I like to think my setup is easily scalable). It also means I can store my generated certificate in the K/V store where it will no longer just disappear when Traefik restarts.

Since I have previous experience with Zookeeper and the setup was relatively painless I went with it.

All Kubernetes yaml files for the setup

Finally the meat of the blog post, my complete setup as yaml files you can directly deploy into your GKE cluster:

Set up Zookeeper first

From this excellent ressource: https://github.com/kow3ns/kubernetes-zookeeper/blob/master/manifests/README.md

apiVersion: v1
kind: Service
metadata:
  name: zk-hs
  labels:
    app: zk
spec:
  ports:
  - port: 2888
    name: server
  - port: 3888
    name: leader-election
  clusterIP: None
  selector:
    app: zk
---
apiVersion: v1
kind: Service
metadata:
  name: zk-cs
  labels:
    app: zk
spec:
  ports:
  - port: 2181
    name: client
  selector:
    app: zk
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: zk
spec:
  serviceName: zk-hs
  replicas: 1
  podManagementPolicy: Parallel
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: zk
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: "app"
                    operator: In
                    values:
                    - zk
              topologyKey: "kubernetes.io/hostname"
      containers:
      - name: kubernetes-zookeeper
        imagePullPolicy: Always
        image: "gcr.io/google_containers/kubernetes-zookeeper:1.0-3.4.10"
        resources:
          requests:
            memory: "200M"
            cpu: "0.3"
        ports:
        - containerPort: 2181
          name: client
        - containerPort: 2888
          name: server
        - containerPort: 3888
          name: leader-election
        command:
        - sh
        - -c
        - "start-zookeeper \
          --servers=1 \
          --data_dir=/var/lib/zookeeper/data \
          --data_log_dir=/var/lib/zookeeper/data/log \
          --conf_dir=/opt/zookeeper/conf \
          --client_port=2181 \
          --election_port=3888 \
          --server_port=2888 \
          --tick_time=2000 \
          --init_limit=10 \
          --sync_limit=5 \
          --heap=512M \
          --max_client_cnxns=60 \
          --snap_retain_count=3 \
          --purge_interval=12 \
          --max_session_timeout=40000 \
          --min_session_timeout=4000 \
          --log_level=INFO"
        readinessProbe:
          exec:
            command:
            - sh
            - -c
            - "zookeeper-ready 2181"
          initialDelaySeconds: 10
          timeoutSeconds: 5
        livenessProbe:
          exec:
            command:
            - sh
            - -c
            - "zookeeper-ready 2181"
          initialDelaySeconds: 10
          timeoutSeconds: 5
        volumeMounts:
        - name: datadir
          mountPath: /var/lib/zookeeper
      securityContext:
        runAsUser: 1000
        fsGroup: 1000
  volumeClaimTemplates:
  - metadata:
      name: datadir
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 5Gi

Permissions for Traefik

# create Traefik cluster role
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
 name: traefik-ingress-controller
rules:
  - apiGroups:
      - ""
    resources:
      - services
      - endpoints
      - secrets
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - extensions
    resources:
      - ingresses
    verbs:
      - get
      - list
      - watch
---
# create Traefik service account
kind: ServiceAccount
apiVersion: v1
metadata:
  name: traefik-ingress-controller
  namespace: default
---
# bind role with service account
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: traefik-ingress-controller
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: traefik-ingress-controller
subjects:
- kind: ServiceAccount
  name: traefik-ingress-controller
  namespace: default

Traefik config

Note the configuration of zookeeper using the service address for the "client service" (cs) as well as the Let's encrypt config here.

# define Traefik configuration
kind: ConfigMap
apiVersion: v1
metadata:
  name: traefik-config
data:
  traefik.toml: |
    # traefik.toml
    defaultEntryPoints = ["http", "https"]
    [entryPoints]
      [entryPoints.http]
        address = ":80"
        [entryPoints.http.redirect]
          entryPoint = "https"
      [entryPoints.https]
      address = ":443"
        [entryPoints.https.tls]

      [zookeeper]
        endpoint = "zk-cs.default.svc.cluster.local:2181"
        watch = true
        prefix = "traefik"

      [acme]
      email = "your@email.com"
      storage = "traefik/acme/account"
      onHostRule = true
      caServer = "https://acme-v02.api.letsencrypt.org/directory"
      acmeLogging = true
      entryPoint = "https"
        [acme.httpChallenge]
        entryPoint = "http"

      [[acme.domains]]
        main = "your.domain.com"

Deployment for Traefik

I run just one replica in here to save costs in my dev setup but I've also scaled it up to three to test if it would stay up 100% of the time even with random nodes going down and everything works fine :).

# declare Traefik deployment
kind: Deployment
apiVersion: extensions/v1beta1
metadata:
  name: traefik-ingress-controller
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: traefik-ingress-controller
    spec:
      serviceAccountName: traefik-ingress-controller
      terminationGracePeriodSeconds: 60
      volumes:
        - name: config
          configMap:
            name: traefik-config
      containers:
      - name: traefik
        image: "traefik:1.7.14"
        volumeMounts:
          - mountPath: "/etc/traefik/config"
            name: config
        args:
        - --configfile=/etc/traefik/config/traefik.toml
        - --kubernetes
        - --logLevel=INFO

Traefik service

# Declare Traefik ingress service
kind: Service
apiVersion: v1
metadata:
  name: traefik-ingress-controller
spec:
  selector:
    app: traefik-ingress-controller
  ports:
    - port: 80
      name: http
    - port: 443
      name: tls
  type: LoadBalancer

Final result

The final workloads with traefik and zookeeper

And the kubernetes ingresses (ignore the app I used as demo for this)

About Me

I am a full stack developer and digital product enthusiast, I am available for freelance work and always looking for the next exciting project :).

You can reach me online at https://heltweg.org.

Top comments (3)

Dinesh Rathee • Mar 4 '20

LetsEncrypt have revoked around 3 million certs last night due to a bug that they found. Are you impacted by this, Check out ?

DevTo
[+] dev.to/dineshrathee12/letsencrypt-...

GitHub
[+] github.com/dineshrathee12/Let-s-En...

LetsEncryptCommunity
[+] community.letsencrypt.org/t/letsen...

Simon Massey • Aug 29 '19

how many VMs are in your personal cloud and of what instance type? thanks!

rhazn • Aug 29 '19

Hi Simon :). I started out with three f1-micro VMs and recently upgraded that to five f1-micros. Also using preemptible nodes to save costs and have a neat way to automatically randomly destroy my infrastructure and make sure it recovers.