DEV Community: Barbara

Data Visualisation Basics

Barbara — Fri, 06 Sep 2024 13:47:02 +0000

Why use data vis

When you need to work with a new data source, with a huge amount of data, it can be important to use data visualization to understand the data better.
The data analysis process is most of the times done in 5 steps:

Extract - Obtain the data from a spreadsheet, SQL, the web, etc. 
Clean - Here we could use exploratory visuals. 
Explore - Here we use exploratory visuals. 
Analyze - Here we might use either exploratory or explanatory visuals. 
Share - Here is where explanatory visuals live.

Types of data

To be able to choose an appropriate plot for a given measure, it is important to know what data you are dealing with.

Qualitative aka categorical types

Nominal qualitative data

Labels with no order or rank associated with the items itself.
Examples: Gender, marital status, menu items

Ordinal qualitative data

Labels that have an order or ranking.
Examples: letter grades, rating

Quantitative aka numeric types

Discrete quantitative values

Numbers can not be split into smaller units
Examples: Pages in a Book, number of trees in a park

Continuous quantitative values

Numbers can be split in smaller units
Examples: Height, Age, Income, Workhours

Summary Statistics

Numerical Data

Mean: The average value.
Median: The middle value when the data is sorted.
Mode: The most frequently occurring value.
Variance/Standard Deviation: Measures of spread or dispersion.
Range: Difference between the maximum and minimum values.

Categorical Data

Frequency: The count of occurrences of each category.
Mode: The most frequent category.

Visualizations

You can get insights to a new data source very quick and also see connections between different datatypes easier.
Because when you only use the standard statistics to summarize your data, you will get the min, max, mean, median and mode, but this might be misleading in other aspects. Like it is shown in Anscombe's Quartet: the mean and deviation are always the same, but the data distribution is always different.

In data visualization, we have two types:

Exploratory data visualization We use this to get insights about the data. It does not need to be visually appealing.
Explanatory data visualization This visualizations need to be accurate, insightful and visually appealing as this is presented to the users.

Chart Junk, Data Ink Ratio and Design Integrity

Chart Junk

To be able to read the information provided via plot without distraction, it is important to avoid chart junk. Like:

Heavy grid lines
Pictures in the visuals
Shades
3d components
Ornaments
Superfluous texts

Data Ink Ratio

The lower your chart junk in a visual is the higher the data ink ratio is. This just means the more "ink" in the visual is used to transport the message of the data, the better it is.

Design Integrity

The Lie Factor is calculated as:

$$
\text{Lie Factor} = \frac{\text{Size of effect shown in graphic}}{\text{Size of effect in data}}
$$

The delta stands for the difference. So it is the relative change shown in the graphic divided by the actual relative change in the data. Ideally it should be 1. If it is not, it means that there is some missmatch in the way the data is presented and the actual change.

In the example above, taken from the wiki, the lie factor is 3, when comparing the pixels of each doctor, representing the numbers of doctors in California.

Tidy data

make sure you're data is cleaned properly and ready to use:

each variable is a column
each observation is a row
each type of observational unit is a table

Univariate Exploration of Data

This refers to the analysis of a single variable (or feature) in a dataset.

Bar Chart

always plot starting with 0 to present values in real comparable way.
sort nominal data
don't sort ordinal data - here it is more important to know how often the most important category appears than the most frequent
if you have a lot of categories use a horizontal bar chart: having the categories on the y-axes, to make it better readable.

Histogram

quantitative version of a bar chart. This is used to plot numeric values.
values are grouped into continous bins, one bar for each is plotted

KDE - Kernel Density Estimation

often a Gaussian or normal distribution, to estimate the density at each point.
KDE plots can reveal trends and the shape of the distribution more clearly, especially for data that is not uniformly distributed.

Pie Chart and Donut Plot

data needs to be in relative frequencies
pie charts work best with 3 slices at maximum. If there are more wedges to display it gets unreadable and the different amounts are hard to compare. Then you would prefer a bar chart.

BiVariate Exploration of Data

Analyzes the relationship between two variables in a dataset.

Clustered Bar Charts

displays the relationship between two categorical values. The bars are organized in clusters based on the level of the first variable.

Scatterplots

each data point is plotted individually as a point, its x-position corresponding to one feature value and its y-position corresponding to the second.
if the plot suffers from overplotting (too many datapoints overlap): you can use transparency and jitter (every point is moved slightly from its true value)

Heatmaps

2d version of a Histogram
data points are placed with its x-position corresponding to one feature value and its y-position corresponding to the second.
the plotting area is divided into a grid, and the numbers of points add up there and the counts are indicated by color

Violin plots

show the relationship between quantitative (numerical) and qualitative (categorical) variables on a lower level of absraction.
the distribution is plotted like a kernel density estimate, so we can have a clear
to display the key statistics at the same time, you can embedd a box plot in a violin plot.

Box plots

it also plots the relationship between quantitative (numerical) and qualitative (categorical) variables on a lower level of absraction.
compared to the violin plot, the box plot leans more on the summarization of the data, primarily just reporting a set of descriptive statistics for the numeric values on each categorical level.
it visualizes the five-number summary of the data: minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum.

Key elements of a boxplot:
Box: The central part of the plot represents the interquartile range (IQR), which is the range between the first quartile (Q1, 25th percentile) and the third quartile (Q3, 75th percentile). This contains the middle 50% of the data.

Median Line: Inside the box, a line represents the median (Q2, 50th percentile) of the dataset.

Whiskers: Lines extending from the box, known as "whiskers," show the range of the data that lies within 1.5 times the IQR from Q1 and Q3. They typically extend to the smallest and largest values within this range.

Outliers: Any data points that fall outside 1.5 times the IQR are considered outliers and are often represented by individual dots or marks beyond the whiskers.

Combined Violin and Box Plot

The violin plot shows the density across different categories, and the boxplot provides the summary statistics

Faceting

the data is divided into disjoint subsets, most often by different levels of a categorical variable. For each of these subsets of the data, the same plot type is rendered on other variables, ie more histograms next to each other with different categorical values.

Line plot

used to plot the trend of one number variable against a seconde variable.

Quantile-Quantile (Q-Q) plot

is a type of plot used to compare the distribution of a dataset with a theoretical distribution (like a normal distribution) or to compare two datasets to check if they follow the same distribution.

Swarm plot

Like to a scatterplot, each data point is plotted with position according to its value on the two variables being plotted. Instead of randomly jittering points as in a normal scatterplot, points are placed as close to their actual value as possible without allowing any overlap.

Spider plot

compare multiple variables across different categories on a radial grid. Also know as radar chart.

Useful links

My sample notebook

Sample Code

Libs used for the sample plots:

Matplotlib: a versatile library for visualizations, but it can take some code effort to put together common visualizations.
Seaborn: built on top of matplotlib, adds a number of functions to make common statistical visualizations easier to generate.
pandas: while this library includes some convenient methods for visualizing data that hook into matplotlib, we'll mainly be using it for its main purpose as a general tool for working with data (https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf).

My K8s Cheatsheet

Barbara — Fri, 12 Jan 2024 12:48:18 +0000

In this cheatsheet I summed up the most used commands.
In doubt you can always consult

kubectl --help
The K8s documentation or play around out on killercoda

General Helper

Add aliases and functions to the .bashrc, to save time avoid typing:

Aliases and Functions

Alias for kubernetes
alias k = 'kubectl'
Do a dry-run and output it as yaml
export do = "-o yaml --dry-run="client"

k create deployment test --image="nginx:alpine" $do > deployment.yaml

Do it immediately
export now = "--force --grace-period=0"

k delete deployment test $now

Set the namespace

kn(){
kubectl config set-context --current --namespace="$1"
}

call it like: kn crazynamespace

Run a command from a temp container

tmp(){
 kubectl run tmp --image="nginx:alpine" -i -rm --restart=Never -- sh -c "$1"
}

call it like: tmp curl http://servicename:namespace:port

Kubectl commands

Get a configuration as .yaml

k get deployment -o yaml > depl.yaml
k get pod -o yaml > pod.yaml

k create deployment -o yaml > depl.yaml
k run pod1 $do > pod.yaml

Pod

Create a pod that has a command:

k run pod1 image=imagetouse --comand -- sh -c "commandlinecommand" $do > pod.yaml

Search pods in a namepspace for label

k get pod -o yaml | grep searchitem

Create a service for a pot

k expose podname --name=servicename --port=3333 --target-port=3333

Serviceaccount

k create serviceaccount yourServiceAccount

add to pod

kind: Pod
metadata:
    name: yourpod
    namespace: yourns
spec:
    serviceAccountName: yourServiceAccount

Secrets

k get secrets
k create secret generic mySecret --from-literal key=value
k create secret generic mySecret --from-file=path/to/file
k get secret -o jsonpath='{.data.yourKey}' | base64 decode > supersecret.txt

Configmaps

k create configmap myConfigmap --from-literal key=value $do > configmap.yaml

k create configmap myconfigmap --from-file=path/to/file $do > configmap.yaml

Clusterrole

k create clusterrole myclusterrole --verb=get, list, create, delete --resource=tralala

Clusterrolebinding

k create clusterrolebinding my-cluster-role-binding --clusterrole=my-cluster-role --serviceaccount=default:my-service-account

k create sa admin-user
k create clusterrolebinding admin-user --clusterrole cluster-admin --serviceaccount kubernetes-dashboard:admin-user
k create token admin-user

Patch

// to add a selector to the created service
k patch service old-app -p '{"spec":{"selector":{"app": "new-app"}}}'
--> you can patch anything, need to know the level

Label and Annotate

k label pod -l type=runner another=label
k annotate pod -l type=runner type="i am a great type"

Networking

Expose

k expose deployment example --port=8765 --target-port=9376 \ --name=example-service --type=LoadBalancer

k expose podname --name=servicename --port=3333 --target-port=3333 --type=Nodeport

Curl with temp pod to test

k run tmp --restart=Never --rm --image=nginx:alpine -i -- curl http://servicename.namespace:port

ROLLOUTS

Rollouts and rollbacks

k get deploy
k rollout history
k undo deploy deploymentname

Rolling update

k scale deploy/dev-web --replicas=4
k edit deployment yourdeployment

Canary rollout

depl1: repl: 2
depl2: repl: 8

Green Blue deployment

deploy both
switch version
scale down deploy1
update service

Scale a deployment

k scale deployment/my-nginx --replicas=1
k autoscale deployment/my-nginx --min=1 --max=3
k get pods -l app=nginx

Storage

k create pvc name > pvc.yaml
k create pv name > pv.yaml

--> get pv and pvc at the same time to see if it is working
k get pv, pvc
--> status is bound, storageClass is manual -> everything is working
--> if Storage class needed:
to try:
k create sc yourStorageClass -o yaml --dry-run="client" > sc.yaml

Troubleshooting

try to call outside:

k exec frontend-789cbdc677-c9v8h -- wget -O- www.google.com

check if env variables exist in a pod

k exec pod1 -- env | grep "<key>=<value>"

check if volume is mounted

k exec pod1 -- cat /path/to/mount

PODMAN

podman build -t super:v1
podman run --name my-container super:v1
podman save -o /path/to/output/myimage.tar super:v1
(podman uses oci format as default, docker does not)

HELM

helm repo
helm repo list
helm repo update
helm search repo whatever

helm -n yourns upgrade
helm -n yourns install currentthingi imageToTake --set replicaCount=2

k --help

https://kubernetes.io/docs/home/

https://killercoda.com/killer-shell-ckad/

Kubernetes Troubleshooting

Barbara — Tue, 21 Nov 2023 08:00:00 +0000

With Kubernetes large and diverse workloads can be handled.
To keep track of all these processes, monitoring is essential.

Monitoring

To monitor the application you need to collect metrics, like CPU, memory, disk usage and bandwidth on your nodes.

Because Kubernetes is a distributed system, it needs to be monitored and trace cluster-wide.

You can use exterior tools like Prometheus and visualize it with Grafana. But to get started I recommend you to use the Kubernetes dashboard, as it is very easy to set up and you have a default user interface with the most important metrics.

Logging

If you have aggregated logs, you can visualize issues and search the logs for issues.

In Kubernetes the kubelet writes container logs to local files. With the command kubectl logs you can see this logs.

If you want to perform cluster wide logging, you can use Fluentd to aggregate logs.
Fluentd agents run on each node via a DeamonSet and feed them to an ElasticSearch instance prior to visualization.

Troubleshooting

Errors in the container

If you are not sure where to start, run

kubectl describe your-pod

This will report

the overall status of the pod: running, pending or an error state
the container configuration
the container events

If the pod is already running you can first look at the standard outs of the container. One common issue is that there are not enough resources allocated.

kubectl logs your-pod your-container

You can look for error messages in the logs.

If there are errors inside a container you execute into the shell of the container to see what is going on.

kubectl exec -it yourdeployment -- /bin/sh

Networking issues

This could be the next place, where the issues arise.
So you can go ahead and check the DNS, firewalls and general connectivity.

Security issues

You might want to check your RBAC.
SELinux and AppArmor are also common issues, especially with network-centric applications.

If you don't know where to start, you can disable security for testing, to delimit the issue source. But be sure to reenable security afterwards.

Another reason - not only for security issues - could be an update. You can roll back to find out when the issue was introduced.

Further reading:
Kubernetes dashboard
Prometheus
Fluentd
Troubleshoot a cluster
Troubleshoot applications
Debug Pods

Expose Applications from a K8s cluster

Barbara — Mon, 20 Nov 2023 18:30:00 +0000

To expose applications from our Kubernetes cluster we need different service types.

Service Types

ClusterIP

The ClusterIP service type is the default and only provides access internally - within the cluster.
If you need to expose a service to the external world, you might consider other service types such as NodePort or LoadBalancer.

The kubectl proxy command creates a local service to access a ClusterIP. This can be useful for troubleshooting or development work.

apiVersion: v1
kind: Service
metadata:
  name: internal-cluster-ip-service
spec:
  selector:
    app: your-app
  ports:
    - protocol: TCP
      port: 80 #exposes this port internally
      targetPort: 8080 # directs traffic to pods on that port

NodePort

The NodePort type is great for debugging, or when a static IP address is necessary, such as opening a particular address through a firewall. The NodePort range is defined in the cluster configuration.

kind: Service
metadata:
  name: your-nodeport-service
spec:
  type: NodePort
  selector:
    app: your-app
  ports:
    - protocol: TCP
      port: 8080
      targetPort: 80
      nodePort: 30080 # port 8080 is exposed on all nodes, and reachable from their ip on port 30080

Create a service via kubectl:
kubectl expose deployment/nginx --port=80 --type=NodePort
This service creates a service for the nginx deployment.
kubectl get svc
kubectl get svc nginx -o yaml

LoadBalancer

LoadBalancer is a type of service that automatically provides external access to services within a cluster by distributing incoming network traffic across multiple nodes.

Using a LoadBalancer service is a convenient way to expose services externally, especially in production environments, where load balancing and high availability are crucial.

apiVersion: v1
kind: Service
metadata:
  name: your-loadbalancer-service
spec:
  type: LoadBalancer
  selector:
    app: your-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080

ExternalName

With this service you can map a Kubernetes service to a DNS Name. Use of the service returns a CNAME record.
Working with the ExternalName service is handy when using a resource external to the cluster, perhaps prior to full integration.

apiVersion: v1
kind: Service
metadata:
  name: geiler-service
spec:
  type: ExternalName
  externalName: geil.example.com

Ingress

Ingress Resource

An ingress resource is an API object containing a list of rules matched against all incoming requests.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: your-app-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  rules:
  - host: your-app.example.com  # Replace with your desired domain or IP
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: your-app-service
            port:
              number: 80

kubectl apply -f ingress.yaml

Ingress Controller

An ingress controller manages all the ingress rules to route traffic to existing services.
This is important if the number of services gets high.

Service Mesh

If you need service discovery, rate limiting, traffic management and advanced metrics you can implement a service mesh.

Further reading:
Kubernetes Ingress
What is a service mesh

Kubernetes Security

Barbara — Sun, 19 Nov 2023 19:15:05 +0000

In this post you are going to learn about the basics of the Kubernetes security. You will see how the "admission control" of the kube-apiserver works, how to authorize with RBAC and how to set network policies.

Accessing the Kubernetes API

All requests that reach the API are encrypted using TLS, therefore you need to configure SSL certificates or use kubeadmin

Authentication
Authorization
Admission Control

Authentication

This is done with certificates, tokens or a basic authentication (username and password).

Users are not created by the API and should be managed by the operating system or an external server.
System accounts aka service accounts aka service principal are used by processes to access the API.

It can also be done with Webhooks, to verify bearer tokens or a connection with an external OpenId provider.

You define the type of authentication in the kube-apiserver startup options and select the authenticator module:
--basic-auth-file
--oidc-issuer-url
--token-auth-file
--authorization-webhook-config-file

If one or more Authenticator Modules are used, each is tried until successful, and the order is not guaranteed.
Anonymous access can also be enabled, otherwise you will get a 401 response.

Authorization

There are three main modules for Authorization
Node: is needed for the kubelet to communicate with the kube-apiserver
RBAC - Role Bases Access Control: All non kubelet traffic would be checked by RBAC, if set
Webhook:All non kubelet traffic would be checked by RBAC, if set

You can configure them in the kube-apiserver startup options
--authorization-mode=Node,RBAC

The attributes of the request are checked agains the policies like (user, usergroup, namespace, http verbs).
To see the authorization information of a cluster run
kubectl config get-contexts

RBAC - Role Based Access Control

All resources are modelled API objects in Kubernetes.

API Groups

These resources belong to API groups, like core and apps. They allow HTTP verbs like POST, GET, PUT, DELETE.
RBAC settings are additive, with no permission allowed unless defined.

Rules - Rules can act upon an API group.
Roles - One or more rules with affect a scope of a single namespace.
ClusterRoles - Scoped for the entire cluster.

Admission Control

Admission controllers intercept and modify requests.
They can modify the content or validate it, and potentially deny the request.
--enable-admission-plugins=NamespaceLifecycle,LimitRanger
--disable-admission-plugins=PodNodeSelector

Security Contexts

This is a Kubernetes object that defines privileges and access control settings for a Pod or a container inside a Pod. Below you can find the most used security context options.

RunAsUser

Specifies the user or group ID under which the process should run inside the container. This helps to isolate processes and restrict their access.

Privileged

If set to true, the container gains access to all Linux capabilities, effectively turning off all isolation between the host and the container. Using privileged mode should be done cautiously, as it can introduce security risks.

ReadOnlyRootFilesystem

When set to true, the container's root file system is mounted as read-only. This provides an additional layer of security by preventing processes within the container from writing to the root file system.

Capabilities:

Allows you to add or remove specific Linux capabilities for processes within the container. This provides fine-grained control over what the processes are allowed to do.

apiVersion: v1
kind: Pod
metadata:
  name: yourpod
spec:
  containers:
  - name: yourcontainer
    image: yourimage
    securityContext:
      runAsUser: 1000 // user id, default is 0, which is the root user
      capabilities:
        add: ["NET_ADMIN"]
      readOnlyRootFilesystem: true

If the security context is set wrong, you will see an warning in the status of your pods.

PodSecurity Admission Controllers

PodSecurity admission controllers are part of the built-in set of admission controllers in Kubernetes.
You can define policies on different levels and customize them as needed.
They are part of the Admission Control Framework.
They are designed to be compatible with a variety of container runtimes.
You can set it in the cluster configuration.

apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
apiServer:
  extraArgs:
    enable-admission-plugins: "PodSecurity,PodNodeSelector"

and use the policy like:

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: restrictive
spec:
  privileged: false
  allowPrivilegeEscalation: false
  requiredDropCapabilities:
    - ALL
  volumes:
    - 'configMap'
    - 'emptyDir'
  hostNetwork: false
  hostIPC: false
  hostPID: false
  runAsUser:
    rule: 'MustRunAsNonRoot'
  seLinux:
    rule: 'RunAsAny'

Network Security Policies

By default, all pods can reach each other all ingress and egress traffic is allowed. This has been a high-level networking requirement in Kubernetes. But the ingress and egress trafic can be controlled by a policy. The network policy.

Network Policy Sample

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ingress-egress-policy
  namespace: default
spec:
  podSelector:
    matchLabels:
      role: db
  policy types:
    - Ingress
    - Egress
  ingress:
    - from:
        - ipBlock:
            cidr: 172.17.0.0/16
            except:
              - 172.17.1.0/24
        - namespaceSelector:
            matchLabels:
              project: yourproject
        - podSelector:
            matchLabels:
              role: frontend
      ports:
        - protocol: TCP
          port: 6379
  egress:
    - to:
        - ipBlock:
            cidr: 10.0.0.0/24
      ports:
        - protocol: TCP
          port: 5978

Default Network Policy

The empty braces in the below example match all Pods, that are not selected by another Network Policy and will not allow ingress traffic.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
spec:
  podSelector: {}
  policyTypes:
  - Ingress

Further reading:
Controlling Access to Kubernetes API
What is TLS
Configure Service Accounts
Dynamic Admission Control
Network Policy Recipes

Kubernetes Volumes

Barbara — Sun, 19 Nov 2023 17:18:54 +0000

Volumes

Volumes are needed to store data within a container or share data among other containers.
All volumes requested by a Pod must be mounted before the containers within the Pod are started. This applies also to secrets and configmaps.

Shared Volume

Below you can find a sample of how to create a shared volume.
But be aware that one container can overwrite the data that from the other container.
You can use locking or versioning to overcome this topic.

   containers:
   - name: firstcontainer
     image: busybox
     volumeMounts:
     - mountPath: /firstdir
       name: sharevol
   - name: secondcontainer
     image: busybox
     volumeMounts:
     - mountPath: /seconddir
       name: sharevol
   volumes:
   - name: sharevol
     emptyDir: {}

$ kubectl exec -ti example -c secondcontainer -- touch /seconddir/bla

$ kubectl exec -ti example -c firstcontainer -- ls -l /firstdir

Persistent Volume - PV

This is a storage abstraction used to keep data even if the Pods is killed. In the Pods you define a volume of that type.
kubectl get pv

Sample of a PV with hostPath Type

kind: PersistentVolume
apiVersion: v1
metadata:
name: 10Gpv01
labels:
type: local
spec:
capacity:
        storage: 10Gi
    accessModes:
        - ReadWriteOnce
    hostPath:
        path: "/somepath/data01"

Persistent Volume Claim - PVC

With the PVC volumes can be accessed by multiple pods and allow state persistency.
The cluster attaches the Persistent Volume.

There is no concurrency checking, so data corruption is probable unless locking takes place outside.

There are 3 access modes for the PVC:

RWO - ReadWriteOnce by a single node
ROX - ReadOnlyMany by multiple nodes
RWX - ReadWriteMany by many nodes

kubectl get pvc

Phases to persistent storage

Provisioning: Can be done in advance, ie resources from a cloud provider
Binding: Once a watch loop on master notices a PVC it requests the access.
Using: The volume is mounted to the Pod and can now be used.
Releasing: When the pod is down, the PVC is deleted. The resident data remains depending on the persitenVolumReclaimPolicy
Reclaiming: You have three options: Retain, Delete, Recycle

Empty Dir

The kubelet creates an emptyDir. It will create the directory in the container but not mount any storage. The data written to that storage is not persistent, as it will be deleted when the Pod is deleted.

apiVersion: v1
kind: Pod
metadata:
    name: sample
    namespace: default
spec:
    containers:
    - image: sample
      name: sample
      command:
        - sleep
        - "3600"
      volumeMounts:
      - mountPath: /sample-mount
        name: sample-volume
    volumes:
    - name: sample-volume
            emptyDir: {}

Other Volume types

GCEpersistenDisk and awsElsaticBlockStore

You can mount your GCE or your EBS into your Pods.

hostPath

This mounts a resource from the host node filesystem. The resource must be already in advance in order to be used.

DirectoryOrCreate
FileOrCreate

and many more

NFS - Network File System
iSCSI - Internet Small Computer System Interface
RBD (RADOS Block Device) - RBD is a block storage device that runs on top of the Ceph distributed storage system. It allows you to create block devices that can be mounted and used like a regular disk. RBD is often used in virtualization environments, providing storage for virtual machines.
CephFS - CephFS is a distributed file system built on top of the Ceph storage system.
GlusterFS - open-source, distributed file system that can scale out to petabytes of storage. It works by aggregating various storage resources across nodes into a single, global namespace.

Dynamic Provisioning

With the kind StorageClass, a user can request a claim, which the API Server fills via auto-provisioning. Common choices for dynamic storage are AWS and GCE.

Sample for gce:

apiVersion: storage.k8s.io/v1        
kind: StorageClass
metadata:
  name: you-name-it                        
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-ssd

ConfigMaps

This kind of storage is used to store sensitive data, that does not need to be encoded, but should not be stored within the application itself.
Using configmaps we can decouple the container image from the configuration artifacts.
If configmaps are marked as "optional" they don't need to be mounted before a pod wants to use them.

They can be consumed in various ways:

Pod environmental variables from single or multiple ConfigMaps
Use ConfigMap values in Pod commands
Populate Volume from ConfigMap
Add ConfigMap data to a specific path in Volume
Set file names and access mode in Volume from ConfigMap data
Can be used by system components and controllers.

Create a Configmap from literal:
kubectl create cm yourcm --from-literal yoursecret=topsecret

Create a Configmap from a file:
kubectl -f your-cm.yaml create

Sample ConfigMap:

apiVersion: v1
data:
  yoursecret: topsecret
  level: "3"
kind: ConfigMap
metadata:
  name: yourcm

read the configmap
kubectl get configmap yourcm -o yaml

Secrets

This kind of storage is used to store sensitive data, that needs to be encoded.

A Secret in Kubernetes is base64-encoded by default.
If you want to encrypt secrets, you have to create a EncryptionConfiguration.
There is no limit to the number of secrets, but there is a 1MB limit to their size.
Secrets are stored in the tmpfs storage on the host node and are only sent to the host running Pod.

Secret as an environmental variable

kubectl get secrets
kubectl create secret generic --help
kubectl create secret generic mysecret --from-literal=password=supersecret

spec:
     containers:
     -image: yourimage
      name: yourcontainername
      env:
      - name: ROOT_PASSWORD
        valueFrom: 
         secretKeyRef:
           name: yoursecret
           key: password

Mounting secrets as volumes

spec:
    containers:
    - image: busybox
      name: busy
      command:
        - sleep
        - "3600"
      volumeMounts:
      - mountPath: /mysqlpassword
        name: mysql
    volumes:
    - name: mysql
      secret:
        secretName: mysql

Verify that the secret is available in thte container:
kubectl exec -ti busybox -- cat /mysqlpassword/password

Further reading:
https://trainingportal.linuxfoundation.org/learn/course/kubernetes-for-developers-lfd259/
Volumes on Kubernetes: https://kubernetes.io/docs/concepts/storage/volumes/
Ceph: https://ubuntu.com/ceph/what-is-ceph

Kubernetes Deployment

Barbara — Wed, 15 Nov 2023 11:48:08 +0000

Deployment

A K8s Deployment is a declarative configuration in a .yaml or .json file to define the desired state of an containerized piece of code.

Create a basic deployment.yaml

kubectl create deploy your-deployment --image=your-image -oyaml --dry-run=client > deploy.yaml

Modify as needed, for example you can add livenessProbes
run kubectl apply -f=deploy.yaml
check run kubectl describe deploy

Deployment Configuration Status

kubectl get deployments
kubectl describe deployment yourdeploymentname

availableReplicas

indicates how many were configured by the ReplicaSet. This would be compared to the later value of #### readyReplicas
which would be used to determine if all replicas have been fully generated and without error. #### observedGeneration
shows how often the deployment has been updated. This information can be used to understand the rollout and rollback situation of the deployment.

Scaling and Rolling Updates

kubectl scale deploy/dev-web --replicas=4

Rolling Update

If you want to modify non-immutable values, you can change them in an editor.
This triggers a rolling update of the deployment. While the deployment would show an older age, a review of the Pods would show a recent update and older version of the web server application deployed.

kubectl edit deployment yourdeployment

containers:
      - image: geile-app:1.8 #<<---Change version number
        imagePullPolicy: IfNotPresent
        name: dev-geile-app

This will update gradually replacing old pods with new ones to ensure continuous availability of the service.
It is the default update method

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 1
    maxSurge: 1

Canary Rollout

A new version of the application is deployed to a small percentage of the pods or replicas in the Kubernetes cluster. This can be achieved using a Deployment resource with specific strategies and configurations.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: your-app
spec:
  replicas: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 25% <--- incremental increase in the number of pods running the canary version.
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: your-app
    spec:
      containers:
      - name: your-app
        image: your-registry/your-app:canary

Blue- Green Deployment

In a Blue-Green Deployment, two identical environments, typically referred to as "Blue" and "Green," are maintained: - one for the current production version (Blue) and

one for the new version being deployed (Green).

The deployment process involves switching the traffic from the Blue environment to the Green environment once the new version is considered ready for production.

# blue-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: blue-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: your-app
      color: blue
  template:
    metadata:
      labels:
        app: your-app
        color: blue
    spec:
      containers:
        - name: your-app
          image: registry/your-app:blue
          ports:
            - containerPort: 80

# green-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: green-deployment
spec:
  replicas: 0  # Keeping replicas at 0 initially
  selector:
    matchLabels:
      app: your-app
      color: green
  template:
    metadata:
      labels:
        app: your-app
        color: green
    spec:
      containers:
        - name: your-app
          image: registry/your-app:green
          ports:
            - containerPort: 80

# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: your-app-service
spec:
  selector:
    app: your-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80

and then deploy the first
kubectl apply -f blue-deployment.yaml
once validated and deploy the other version
kubectl apply -f green-deployment.yaml

Deployment Rollbacks

Look what happened kubectl rollout history deployment/mydeploy
Check the status of the deployment kubectl get pods
If you need to do a rollback kubectl rollout undo deployment/mydeploy

Further reading:
https://trainingportal.linuxfoundation.org/learn/course/kubernetes-for-developers-lfd259/
https://kubernetes.io/docs/concepts/workloads/controllers/deployment/

Kubernetes Design

Barbara — Mon, 09 Oct 2023 12:10:55 +0000

In this blogpost, you will get a crisp guide through the design concepts of Kubernetes. Let's go:

Decoupled resources

Each component should be decoupled from outer resources, so that every component can be removed, replaced or rebuilt.
Use Services for connections to other resources to provide flexibility.

Transience

Each object should be built with the expectation that other components will die and be rebuilt.
Having this in mind, we can update and scale with ease.

Flexible Framework

Multiple independent resources work together, but they are decoupled and do not expect a permanent relationship to other resources.
This framework of independent resources is not as efficient, as we have a lot of controllers or watch-loops in place to monitor the current cluster state and change things until the state matches the configuration.
But on the other hand, this framework allows us to have more flexibility, a very high availability and scalability.

Resource Usage

Kubernetes allows us to easily scale clusters and sets resource limits via configuration.

CPU

spec.containers[].resources.limits.cpu
spec.containers[].resources.requests.cpu

1 CPU in K8s is equivalent to 1 AWS vCPU or 1GCP Core or 1 AzurevCore or 1 Hyperthread on a bare-metal Intel processor with Hyperthreading (eg to behave like multiple virtual cores).

RAM

spec.containers[].resources.limits.memory
spec.containers[].resources.requests.memory

With Docker the limits.memory value is converted to an integer value to be used in the docker run --memory <value> <image> command.
If the container exceeds its memory limit, it may be restarted or the entire Pod could be evicted from the node.

Ephemeral Storage

Container files, logs can be stored there. If the containers use more than the limit in the Pod, the Pod will be evicted.

spec.containers[].resources.limits.ephemeral-storage
spec.containers[].resources.requests.ephemeral-storage

Label Selectors

They provide a flexible and dynamic way to interact with your K8s cluster and help with the following points:

Resource Organization
Resource Identification
Selective Resource Access
Application Deployment
Scaling and Load Balancing
Rolling Updates and Rollbacks
Monitoring and Logging
Multi-Tenancy
Custom Workflows

Selectors are namespace scoped, you can add the --all-namespaces argument to select matching objects in all namespaces

kubectl get object-name -o yaml
kubectl get pod pod-name -o yaml

labels:
  app: sample
  pod-template-hash: 0815

kubectl -n yourns get --selector app=your_pod

Multi-Container Pods

Having multiple containers allows independent development and scaling for every container to best meet the needs of the workload.
Every container in a POD shares a single IP address and namespace.
Each container has equal potential access to storage given to the Pod.

Different types of containers

Ambassador

Used to communicate outside resources, often outside a cluster. With this you don't need to implement a new service or a new entry to an ingress controller.

Adapter

is used to modify the data generated by the primary container. An example would be a datastream that needs to be modified for a usecase.

Sidecar

you can compare it to a sidecar on a motorcycle. It often provides services that are not found in the main application. For example a logging container. So it remains decoupled and scalable.

initContainer

An init container allows one or more containers to run only if one or more previous containers run and exit successfully.
For example a git-sync container would be an init container for another applications that needs to have always the latest information from a given git.

spec:
  containers:
   -name: app
    image: app-image
  initContainers:
  - name: git-sync
    image: git-sync-image

CRD - Custom Resource Definition

With CRDs you can extend the K8s API and create custom resource.
With the help of a CRD you can add databases, message queues or machine learning models and many more and create custom schemas for your custom resource.
There are also public CRDs that can be used. For example Helm CRDs or the Prometheus Operator CRDs.

Job

It is a resource object to manage and run a task or a batch process. Jobs ensure that a given number of pods successfully completes.

Characteristics

One-Time Execution. For example a database migration or a backup
Parallelism. number of parallel pod completions
Pod Template. defines the container(s) to run
Completion and Failure Handling. can be defined via completions and backoffLimit. The backoffLimit defines the number of retries.
Garbage collection.

Sample of a job manifest.yaml in K8s

apiVersion: batch/v1
kind: Job
metadata:
  name: your-job
spec:
  completions: 3  # Number of desired completions
  parallelism: 1  # Number of pods running in parallel
  template:
    spec:
      containers:
      - name: your-container
        image: your-image
        command: ["echo", "Hello World!"]
  backoffLimit: 2  # Maximum number of retries in case of failure

checklist to see if your design is good

[] The application is as decoupled as it can be
[] Nothing can be taken out of an existing container.
[] Every container is transient and is able to react properly when other containers are transient
[] Chaos Monkey can run without my users noticing it
[] Every component can be scaled to meet the workload

Further reading:
K8s CRD: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/
Resource: https://training.linuxfoundation.org/training/kubernetes-for-developers/

Kubernetes Build

Barbara — Fri, 29 Sep 2023 15:47:34 +0000

This post sums up the steps to build a Kubernetes application.

CRI - Container Runtime Interface

Kubernetes is designed to work with many different container runtimes like Docker, CRI-O, containerd, rkt and others. The CRI allows easy integration of various container runtimes with kubelet.

Containerizing an application

The more stateless and transient the better an app is suited for containerization
Environmental configuration needs to be provided via configMaps and secrets

What it is

A dockerfile is a list of commands from whit an image can be build.
An image is a binary file that includes everything needed to be run as a container. The images are usually stored in the container registry.
A container is a running instance of an Image.

Sample with Docker

After you wrote your docker file do:

sudo docker build -t yourapp # build the container
sudo docker images # verify the image 
sudo docker run yourapp #execute the image
sudo docker push # push to the repository

you can keep your docker images local, on a repo or on a container registry of a cloud provider, like azure container registry, google, aws.
Every container in a POD shares a single IP address and namespace. Every container has equal potential storage given to the pod.

Probes

Three different types of probes help to ensure that applications are ready for traffic and healthy within Kubernetes.

readinessProbe

If your application needs to be initizalized or configured in order to accept traffic, you can use the readinessProbe.
The container will not accept traffic until the probe returns a healthy state.

livenessProbe

It checks if the container is in a healthy state, while running. If it fails, the container is terminated and a replacement would be spawned.

startupProbe

This probe is used to test an application that takes a long time to start. The duration until a container is considered to have failed is determined by the failureThreshold x periodSeconds. If the periodSeconds is set to 5 seconds and the failureThreshold set, it would check every 5 seconds and fail after a total of 50 Seconds.

Probes samples

apiVersion: v1
kind: Pod
metadata:
  name: your-app
spec:
  containers:
    - name: your-container
      image: your-image:latest
      ports:
        - containerPort: 8080
      startupProbe:
        httpGet:
          path: /areyouready
          port: 8080
        initialDelaySeconds: 10
        periodSeconds: 5
      livenessProbe:
        httpGet:
          path: /health
          port: 8080
        initialDelaySeconds: 15
        periodSeconds: 10
      # Define a custom configuration probe
      # readinessProbe:
      #  exec:
      #    command:
      #     - cat
      #     - /app/config/config.yaml
      #  initialDelaySeconds: 20
      #  periodSeconds: 30

Create a POD

With the following command, you can create a pod as defined in a file called your-pod.yaml

kubectl apply -f your-pod.yaml

Testing

To see if everything is working as expected, you can use the describe functionality or get the logs of a pod.

kubectl describe pod <pod-name>

kubectl logs <pod-name> -c <container-name>

In the next post I will show you how to write a declarative configuration to define the desired state of an containerized piece of code - in Kubernetes it is called DEPLOYMENT. See you there.

Further reading:
Container runtimes: https://github.com/containers
Helm: https://helm.sh/
ArtifactHub: https://artifacthub.io/
Resource: https://training.linuxfoundation.org/training/kubernetes-for-developers/

Delta Live Tables

Barbara — Fri, 29 Sep 2023 14:21:28 +0000

TL;DR - Delta Live Tables aka DLT

DLT is a framework on top of a Delta Lake, and does magic simsalabim out of the box, so you can process big amounts of data without having any knowledge about the mechanics used. But you also have the possibility to configure it very fine grained via a json file, when creating.

Key features of Delta Live Tables

Different data sets

Dataset type	How is the data processed
Streaming table	Each record is processed exactly once. This assumes an append-only source.
Materialized views	Records are processed as required to return accurate results for the current data state. Materialized views should be used for data sources with updates, deletions, or aggregations, and for change data capture processing (CDC).
Views	Records are processed each time the view is queried. Use views for intermediate transformations and data quality checks that should not be published to public datasets.

You can write DTL in Python or SQL.
You can use different editions "core", "pro" and "advanced".
You can use it to orchestrate tasks and build pipelines in a very fast way and with a lot less code.
it takes care of the cluster management by itself, but you can also configure it yourself with a .json if needed.
You have inbuilt monitoring. Within the delta live tables user interface, you can see Pipeline status, latency, throughput, error rates and the data quality as defined by you.
you can add data quality benchmarking in a very simple way. But it is only enabled in the "advanced" edition.

@dlt.expect("valid_user_name", "user_name IS NOT NULL")
@dlt.expect_or_fail("valid_count", "click_count > 0")

Sample: Medallion Architecture done with Delta Live Tables

import dlt 
# bumms - magic imported
# expected to run as a part of a delta table pipeline
from pyspark.sql.functions import *

# if you want to ingest json data
json_path = "your_path"

# STEP 1 - Bronze Layer
# alias for creating table function
@dlt.table(
    comment="ingests raw data from wherever you want"
    # you could assign a different table name in here, if you don't want the table to be the function name, like
    # name= "my_bronze_layer" 
)

# function name is the name of the DLT
# this function always needs to follow after the table creation with @dlt.table
def bronze_layer():
    """
    This function ingest raw data from a given source and stores it to a table called "bronze_layer"
    """
    # df = spark. read...whatever you want, like filter data as long as you return a DataFrame
    return (spark.read.format("json").load(json_path)) # a dataframe


# STEP 2 - Silver Layer
@dlt.table(
  comment="Create a silver layer with selected, quality-checked data"
)

@dlt.expect("valid_user_name", "user_name IS NOT NULL")
@dlt.expect_or_fail("valid_count", "click_count > 0")

# new table creation
def silver_layer():
  return (
    # live table depending on the table built in STEP 1
    dlt.read("bronze_layer") # after this you can go ahead with spark as usual
      .withColumn("click_count", expr("CAST(n AS INT)"))
      .withColumnRenamed("user_name", "user")
      .withColumnRenamed("prev_title", "previous_page_title")
      .select("user", "click_count", "previous_page_title")
  )

# STEP 3 - Gold Layer
@dlt.table(
  comment="A table containing the top pages linking to the checkout page."
)
def gold_layer():
  return (
    dlt.read("silver_layer")
      .filter(expr("current_page_title == 'Checkout'"))
      .withColumnRenamed("previous_page_title", "referrer")
      .sort(desc("click_count"))
      .select("referrer", "click_count")
      .limit(10)
  )

Further reading:
Delta Live Tables
Delta Lake
Databricks

Kubernetes Architecture

Barbara — Tue, 15 Aug 2023 10:30:49 +0000

Kubernetes is an open-source system for automating deployment, scaling and management of containerized applications.

In this post, you will get an overview of the K8s architecture. If you are coming from software engineering and want to get a first understanding of how K8s works: this post is for you.

Terminology.

Namespace.

a group of resources. for every group resource quotas can be set with the LimitRange admission. Also, user permissions can be applied.
K8s clusters can be created in a namespace or cluster-scoped.
Two objects can not have the same Name value in a namespace

Context.

This consists of the user, cluster name (eg dev and prod) and namespace. It is used to switch between permissions and restrictions.
The context information is stored in ~/.kube/config.

Resource limits.

limits can be set per namespace and pod. The namespace limits have priority over pod spec.

Pod security admission.

There are 3 profiles: privileged, baseline, restricted policies

Network policies.

Ingress and Egress traffic can be limited according to namespaces and labels or addresses.

K8s API Flow.

In the following sections, the parts of the control plane and the worker nodes will be explained.

Control plane node components

kube-apiserver

is the central part of a K8s cluster. All calls are handled on this server. Every API call passes three steps: authentications, authorization, and admission controllers.
Only the kube-apiserver connects to the etcd database.

kube-scheduler

scans available resources (like CPU, memory utilization, node health and workload distributions) and makes informed decisions which node will host a Pod of containers.
It monitors the cluster and makes decisions based on the current state

etcd database

is a key-value store that stores the state of the cluster, networking and persistent information is stored.

kube-controller-manager

is a core control loop daemon that interacts with the kube-api-server to determine the state of the cluster. If the state does not match, the manager contacts the necessary controllers to match the desired state. There are several controllers in use, like endpoints, namespace and replication.

cloud-controller-manager

can interact with agents outside of the cloud. It allows faster changes without altering the core K8s control process (see kube-apiserver).
Each kubelet must use the --cloud-provider-external settings passed to the binary.

Worker node components

kubelet

interacts with the underlying Docker Engine (installed on all nodes) and ensures that all containers are running as desired.
It accepts the API calls for Pod specifications and configures the local node until the specification has been met.
For example, if a Pod needs access to storage, Secrets or ConfigMaps, the kubelet will make this happen.
It sends back the status to the kube-apiserver to be persistent in the etcd.

kube-proxy

manages the network connectivity to the containers via iptables (IPv4 and IPv6). A 'userspace mode' monitors Services and Endpoints.

logging

Currently, there is no cluster-wide logging. Fluentd can be used to have a unified logging layer for the cluster.

metrics

run kubectl top to get the metrics of a K8s component.
If needed Prometheus can be deployed to gather metrics from nodes and applications.

container engine

A container engine for the management of containerized applications, like containerd or cri-o.

Pods

are the smallest units we can work with on K8s. The design of a pod follows a one-process-per-container architecture. A pod represents a group of co-located containers with some associated data volumes.
Containers in a pod start in parallel by default.

special containers:

initContainers: if we want to wait for a container to start before another.
sidecar: used to perform helper tasks, like logging.

single IP per Pod

All containers in a pod share the same network namespace. You can not see those containers on K8s level, only on the pod level.
The containers use the loopback interface, write to files on a common filesystem or via inter-process communication (IPC).

Services

are flexible and scalable operators that connect resources. Each service is a microservice handling a particular bit of traffic, like a NodePort or a LoadBalancer to distribute requests. They are also used for resource control and security.
They use selectors to know which objects to connect. These selectors can be:

equality-based: =, ==, 1=
set-based: in, notin exists

Operators

aka watch-loops aka controllers query the current state against the given spec and execute code to meet the spec.
A DeltaFIFO queue is used. The loop process only ends if the delta is the type Deleted.

Networking Setup

ClusterIP

is used for the traffic within the cluster.

NodePort

creates first a ClusterIP and then associates a port of the node to that new ClusterIp.

LoadBalancer

if a LoadBalancer Service is used, it will first create a ClusterIP and then a NodePort. Then it will make an async request for an external load balancer. If the external is not configured to respond, it will stay in pending state.

Ingress Controller

acts as a reverse proxy to route external traffic to the assigned services based on the configuration. So its key responsibilities are:

Routing and Load Balancing
TLS Termination
Path-Based Routing
Virtual Hosts
Authentication and Authorization

Video about the K8s API: https://www.youtube.com/watch?v=YsmgB2QDaUg

Container Network Interface (CNI) Configuration File

It is the default networking interface mechanism used by kubeadm, which is the K8s cluster bootstrapping tool.

It is a specification to configure container networking communications, provide a single IP per pod and remove resources when a container is deleted.

The CNI is language-agnostic and there are many different plugins available.

Now you have a first overview of the architecture of K8s.
You learned about the difference between the control plane and the worker nodes and its components. With the knowledge of this terminology in place, you can start to get into the details and run a cluster yourself.
If you already working with the K8s API, remember for now kubectl --help is your best friend. As kubectl offers more than 40 arguments you can explore all of these with the --help flag. For example kubectl taint --help. You will get your information faster there because chatgpt and bard tend to talk a lot and say so little.

In the next post of this series, I will write about how to build a K8s cluster. See you there.

Dig deeper:
official documentation
K8s API Flow explained in a beautiful video
concepts of cluster networking
Resource: https://training.linuxfoundation.org/training/kubernetes-for-developers/

Data Pipelines explained with Airflow

Barbara — Tue, 25 Jan 2022 12:06:00 +0000

In the following lines I am doing a write-up about everything I learned about data pipelines at the Udacity online class. It gives a general overview about data pipelines and provides also the core concepts of Airflow and some links to code examples on github.

WHAT - A series

A data pipeline is a series of steps in which data is processed, mostly ETL or ELT.

Data pipelines provide a set of logical guidelines and a common set of terminology.
The conceptual framework of data pipelines will help you better organize and execute everyday data engineering tasks.

Examples of use cases are automate marketing emails, real time pricing or targeted advertising based on the browsing history.

WHY - Data Quality

We want to provide high quality data.
There can be different requirements how to measure data quality based on the use case. For example:

Data must be a certain size
Data must be accurate to some margin of error
Data must arrive within a given timeframe fro the start of the execution
Pipelines must run on a particular schedule
Data must not contain any sensitive information

Data Validation

is the process of ensuring that data is present, correct and meaningful. Ensuring the quality of your data through automated validation checks is a critical step when working with data.

Data Lineage

of a dataset describes the discrete steps involved in the creation, movement and calculation of a dataset. It is important for the following points:

Gain Confidence

If we can describe the data lineage of a dataset or analysis is building confidence in our data consumers like Engineers, Analyst, Data Scientists, Stakeholders.

Else if the data lineage is unclear it is very likely that our data consumers do not trust or want to use the data.

Defining Metrics

If we can surface data lineage, everyone in the company is able to agree on the definition of how a particular metric is calculated.

Debugging

If each step of the data movement and transformation process is well described, it's easy to find problems if they occur.

Airflow DAGs are a natural representation for the movement and transformation of data. The components can be used to track data lineage: the rendered code tab for a task, the graph view for a DAG, historical runs under the tree view.

Schedules

allow us to make assumption about the scope of the data. The scope of a pipeline run can be defined as the time of the current execution until the end of the last execution.

Schedules improve data quality by limiting our analysis to relevant data to a time period. If we use schedules appropriately, they are also a form of data partitioning, which can increase the speed of our pipeline runs.
With the help of schedules we also can leverage already completed work. For example we only would need the aggregation of the current month and add it to the existing totals instead of aggregating data of all times.

How to schedule

If we answer the below questions, we can find an appropriate schedule for our pipelines.

What is the average size of the data for a time period? The more data we have, the more often the pipeline needs to be scheduled
How frequently is data arriving and how often do we need to perform analysis? If the company needs data on a daily basis, that is the driving factor in determining the schedule.
What is the frequency on related datasets? A rule of thumb is that the frequency of a pipeline's schedule should be determined by the dataset in our pipeline, that requires the most frequent analysis.

Data Partitioning

This is the process of isolating data to be analyzed by one or more attributes, such as time - schedule partitioning, conceptually related data into discrete groups - logical partitioning, data size - size partitioning or location.
This will lead to faster and more reliable pipelines. As smaller datasets, time periods and related concepts are easier to debug than big amounts of data and unrelated concepts. There will also be fewer dependencies.
Tasks operating on partitioned data my be more easy parallelized.

HOW does a pipeline work

DAGs - Directed Acyclic Graphs

A DAG is a collection of nodes and edges that describe the order of operations for a data pipeline.
The conceptual framework of data pipelines help us to better organize and execute everyday data engineering tasks.

NODE

A node is a step in a data pipeline process.

EDGE

The dependencies or relationships other between nodes.

GRAPH

A graph describes entities and relationships between the DAGS

In real world it is possible to model a data pipeline that is not a DAG, meaning it contains a cycle within the process. But the majority of pipelines can be described as a DAG. This makes the code more understandable and maintainable.

Apache Airflow

is an open-source DAG-based, schedulable, data-pipeline tool that can run mission-critical environments.
It is not a data processing framework, it is a tool that coordinates the movement between other data stores and data processing tools.
Airflow allows users to write DAGs in Python that run on a schedule and/or from an external trigger.

The advantage of defining pipelines in code are:

maintainability
versionable
testable
collaborative

Airflow is simple to maintain and can run data analysis itself, trigger external tools (Redshift, Spark, etc). It also provides a web-based UI for users to visuzalize and interact with their data pipelines.

Components of Airflow

A Scheduler for orchestrating the execution of jobs on a trigger or schedule. A Work Queue which holds the state of the running DAGs and Tasks.
Worker Processes that execute the operations defined in each DAG. A Database which saves credentials, connections, history and configuration.
A Web Interface that provides a control dashboard for users and maintainers.

How it works

The scheduler starts a DAG based on time or external triggers.
If a DAG is started, the scheduler looks at the steps within the DAG and determines which steps can run by looking at their dependencies.
The scheduler places runnable steps in the queue.
Workers pick up those tasks and run them
Once the worker has finished running a step, the final status of the task is recorded and additional tasks are placed by the scheduler until all tasks are complete.
Once all tasks have been completed, the DAG is complete.

Creating a DAG

To create a DAG you need a name, a description, a start data and an interval.

from airflow import DAG

my_first_dag = DAG(
  'my_first',
   description='Says hello world',
   start_date=datetime(2022, 1, 22),
   schedule_interval='@daily')

If the start date is in the past, Airflow will run your DAG as many times as
there are schedule intervals between the. start date and the current date. This is called backfill. If a company has established years of data that may need to be retroactively analyzed, this is useful.

Schedule intervals are optional and can be defined with cron strings or Airflow presets, like @once, @hourly, @daily, @weekly, @monthly, @yearly or None.

End date is optional, if it is not specified, the DAG will run until it is disabled or deleted. An end date might be useful to mark the end of life or handling data bounds by two points in time.

Operators

define the atomic steps of a work that make up a DAG. Instantiated operators are referred to as Tasks.
Airflow comes with many operators that can perform common operations, like S3ToRedshiftOperator or SimpleHttpOperator.

Task dependencies can be described programmatically using a >> b or a.set_downstream(b), telling a comes before b or
a << b or a.set_upstream(b), telling a comes after b
.

from airflow.operators.python_operator import PythonOperator

def hello_world():
  print('Hello World')

def second_step():
  print('Second Step')

my_first_sample_task = PythonOperator(
  task_id='hello_world'
  python_callable=hello_world,
  dag=my_first_dag)

second_step_task = PythonOperator(
  task_id='second_step',
  python_callable=second_step,
  dag=my_first_dag)

my_first_sample_task >> second_step_task

Task Boundaries

DAG tasks should be

atomic and have a single, well defined purpose. The more work a task performs the less clear becomes its purpose. So it will be easy to maintain, understand and run fast.

Write programs that do one thing and do it well. - Ken Thompson’s Unix Philosophy

maximize parallelism, if a task is scoped properly we can minimize dependencies and enable parallelism. This parallelization can speed up the execution of DAGs.

We also can create custom operators as plugins. One example of a custom operator can be a certain data quality check, that is needed more often.

To create a custom operator we have to:

Identify operators that perform similar functions and can be consolidated
Define a new operator in the plugins folder
Replace the original operators with your new custom one, re-parameterize, and instantiate them.

You can find a sample for custom operators here.

SubDAGs

Commonly repeated series of tasks within DAGs can be captured as reusable SubDAGs. An example would be the "S3ToRedshiftSubDag"

Advantages

decrease the amount of code we need to write and maintain to create a new DAG
easier to understand the high level goals of a DAG
bug fixes, speedups, and other enhancements can be made more quickly and distributed to all DAGs that use that SubDAG #### Disadvantages
limited visibility within the AirflowUI
harder to understand because of the abstraction level

If you want you can also use nested subDAGs, but keep in mind that it makes it much harder to understand and maintain.

Hooks

Connections can be accessed in code via hooks. Hooks provide a reusable interface to external systems and databases. Airflow comes with many hooks, like HttpHook, PostgresHook, SlackHook etc. We don't have to worry how and where to store connection strings and secrets. You can store those in the Airflow user interface under Admin - Variables.

from airflow import DAG
from airflow.hooks.postgres_hook import PostgresHook
from airflow.operators.python_operator import PythonOperator

def load():
# Create a PostgresHook option using the 'demo' connection
    db_hook = PostgresHook('demo')
    df = db_hook.get_pandas_df('SELECT * FROM my_sample')
    print(f'your sample has {len(df)} records')

load_task = PythonOperator(task_id='load_sample_data', python_callable=hello_world, ...)

Like for operators we also can create custom hooks.
Before creating a new plugin you might want to check Airflow contrib to see if there was already a plugin created by community members for your needs. If not you can build one and contribute it to the community.

Runtime variables

Another feature is that Airflow provides runtime variables that can be used. One example is the {{ execution_date }}, the execution date.

def hello_date(*args, **kwargs):
    print(f“Hello {kwargs[‘execution_date’]}”)

divvy_dag = DAG(...)
task = PythonOperator(
    task_id=’hello_date’,
    python_callable=hello_date,
    provide_context=True,
    dag=my_first_dag)

Monitoring

DAGs can be configured to have a SLA - Service Level Agreement, which is defined as a time by which a DAG must complete.
We can email a list of missed SLAs or view it in the AirflowUI. Missed SLAs can also be early indicators of performance problems or inidicate that we need to scale up the size of your Airflow cluster.
If you are working on a time sensitive application an SLA would be crucial.

Airflow can be configured to send emails on DAG and task state changes. These state changes may include successes, failures, or retries. Failure emails can allow you to easily trigger alerts. It is common for alerting systems to accept emails as a source of alerts.

Metrics

Airflow comes out of the box with the ability to send system metrics using a metrics aggregator called statsd. Statsd can be coupled with metrics visualization tools like Grafana to provide you and your team high level insights into the overall performance of your DAGs, jobs, and tasks. These systems can be integrated into your alerting system. These Airflow system-level metrics allow you and your team to stay ahead of issues before they even occur by watching long-term trends.

You can find code samples to all of the above mentioned topics here.