DEV Community: Jennifer Luther Thomas

How purpose-built observability will speed up your Kubernetes troubleshooting

Jennifer Luther Thomas — Tue, 28 Nov 2023 21:47:06 +0000

If you remember from my last post, in my previous job, I was battling a support case where a user had implemented network security policies for their FME Flow (Server) application and a part of the application wasn’t working. They were using Open Service Mesh (a lightweight and extensible cloud-native service mesh), which had some mechanism for recommending Kubernetes security policies. However, in doing so it had missed one of the port ranges that FME Flow required. This took weeks of troubleshooting with involvement from the customer, resellers/consultants with escalation to the software vendor (me).

Knowing that FME Flow was working fine before the network security policies were applied, OSM seemed like the probable cause. But with no insights into network topology or flow logs to see how those policies had potentially impacted FME it was difficult to isolate the problem. I had also never used OSM before, which made it hard for me to know where to start. Eventually the customer figured out that a port range had been missed in the network security policies so I can’t take any credit for the resolution.

If the customer or I had access to Kubernetes observability back then, it would have made it so much easier and faster to troubleshoot the issue.

So what is observability?

Observability refers to the ability to understand the internal state of a system by looking at the external outputs of the system. For me, the main benefit was being able to quickly view my intra-cluster traffic and identify where packets were being denied and why.

In Calico Cloud (which I’m using for this example) this can be found inside the Dynamic Service and Threat Graph.

“Dynamic Service and Threat Graph provides a point-to-point, topographical representation of traffic within your cluster to observe Kubernetes environment behavior, troubleshoot connectivity issues, and identify performance hotspots.”

According to Splunk, in their “The State of Observability 2023” report:

“Observability has become foundational to modern enterprises, providing a way to see into the stunningly complex web of systems that characterizes today’s IT environments”

I would agree with that.

Observability helps:

understand the communication patterns within Kubernetes
visualize microservice communication
quickly see dependencies and interactions
identify external services
analyze performance
speed up troubleshooting
increase resilience

The situation

In the FME Flow (an enterprise spatial ETL application) helm chart you can add a value for the fmeserver.portpool. This defines a range of ports that the FME Engines (the ‘worker’ that does all of the ETL processing) use when connecting to the FME Core, which is essentially the brains of the application.

If you apply network security policies (without the port range exposed) after you’ve already launched FME you may not notice immediately. In fact, I think the issue mainly manifested itself when an FME Engine needed to communicate with the Core to retrieve database connection information (if an ETL job was reading/writing to a database) which was not the easiest to troubleshoot. This was the customer’s situation that was brought to me.

If you don’t allow the port range within the cluster before you install the application you may find that the FME Engines can’t register with the FME Core and the problem would be more obvious and critical. If the FME Engines aren’t registering, you’re not going to be processing any data.

This is why I wanted to see how fast I could identify and fix this issue using observability.

Reproducing the scenario

As I’ve been learning more about Calico Cloud and it’s capabilities, I realized the policies that I implemented previously could not only be used in Calico Cloud, but I could easily and quickly view how all of the components of my application communicate with each other and if the policies I’m putting in place are working.

Connecting Calico Cloud to my cluster was a piece of cake. The UI generates a command that you can apply to your cluster. Then while you take an extended coffee or a snack break, it installs everything in your cluster and you’re ready to go!

To reproduce the customer scenario, I had the Policy Board open on one screen.

The Policy Board lets me see which policies I have enforced, either by applying them via CLI or creating in the Calico Cloud interface. It very quickly gives insights into which policies are evaluating traffic whether packets are being allowed or denied.

The Policies Board in Calico Cloud

On my other screen I had the Dynamic Service and Threat Graph open, which allows you to click on any line and see what traffic is flowing between components.

Based on the arrow direction we can easily see that traffic is flowing from NGINX (the ingress) to FME (the application). The arrow is green, which means that traffic is allowed. Inspecting the traffic shows the protocol, ports and any policies in place in the right hand sidebar:

An up-close view of the communication between two Kubernetes namespaces

If we look inside the fme namespace we can see traffic communicating between the different components. There is a red line between the engine deployment groups and the engine registration service (core).

Inspecting that traffic flow shows the engine group is trying to communicate with the core on ports 7070 and 42001–42002 (the port range I specified in the helm chart is 42000–43000).

A topological view of microservices communication within the fme namespace

Looking at the right-hand panel we can see that traffic is being denied as it leaves the engine group.

In Calico Cloud at the bottom of the Dynamic Service and Threat Graph is a Flows table which lists every flow happening within the cluster. This includes source and destination, ports, policies, action, process ID, etc. To find out which policy is denying the traffic we can look at the policies that are evaluating the flow from engine to the engine registration service.

The default deny policy is in place to block any traffic that I haven’t explicitly allowed as part of my zero-trust security posture. I deliberately excluded my port range from any policy so that it would be denied by default to reproduce the customer’s scenario.

To allow traffic to flow as intended the engine policy will need to be set up so that it can communicate with the other FME Flow components.

To apply this policy to the engines, I used this policy label selector:

selector: safe.k8s.fmeserver.component == "engine"

This is the Create Policy UI where I’ve defined ingress and egress rules for the engines:

The ingress and egress rules for the fme engine deployment

And if you’d rather look at the yaml:

apiVersion: projectcalico.org/v3
kind: NetworkPolicy
metadata:
  name: application.engine-fmeserver
  namespace: fme
spec:
  tier: application
  order: 37.5
  selector: safe.k8s.fmeserver.component == "engine"
  serviceAccountSelector: ''
  ingress:
    - action: Allow
      protocol: TCP
      source:
        selector: safe.k8s.fmeserver.component == "core"
      destination:
        ports:
          - '7500'
  egress:
    - action: Allow
      protocol: TCP
      source: {}
      destination:
        selector: safe.k8s.fmeserver.component == "core"
        ports:
          - '7070'
          - '42000:43000'
    - action: Allow
      protocol: TCP
      source: {}
      destination:
        selector: app.kubernetes.io/name == "postgresql"
        ports:
          - '5432'
  types:
    - Ingress
    - Egress

And almost like magic, the Dynamic Service and Threat Graph turns green:

Another snazzy feature that makes it easy to see which components are talking to which is to click on it in the Dynamic Service and Threat Graph. Clicking on the engine-standard-group deployment shows every inbound and outbound communication on the right-hand side.

Flow Visualization

Another observability feature of Calico Cloud that I’ve recently come to appreciate is Flow Visualization, or “FlowViz”.

At first look, it’s a bit of a WTF moment. This doesn’t tell me anything?!

Instead of the topological view that is easy to comprehend at first look, Flow Visualization gives a 360 degree view of your cluster, with network traffic represented volumetrically. Moving in from the outside it represents namespaces, endpoint names and flows. Colour-coded flows quickly lets you see if there’s any denied traffic, and next to this visualization is a table that shows allowed and denied traffic by namespace, with connections per second (CPS), packets per second (PPS) and bits per second (BPS), if network performance is your thing.

You can also ‘zoom in’ to your namespace and easily find denied traffic and which policies are denying (or allowing) traffic. Clicking on the denied traffic flow (as shown in the gif above) instantly shows which policy is responsible so that it can be fixed.

Conclusion

If the customer had the same visibility into their cluster it would have been very easy to identify the denied traffic and correct the policies. But then I wouldn’t be here telling this story.

Observability using Dynamic Service and Threat Graph, FlowViz and the Policy Board make it incredibly easy and fast to apply the correct network security policies to protect your workloads, as well as see what’s actually going on inside your cluster! The massive overhead of writing policies in yaml and taking an iterative approach to testing connection by connection is so last blog. Not only does observability make your cluster look cool, I’ve found it incredibly valuable.

If you want to give it a go for yourself, sign up for a Calico Cloud trial. There is also a hands-on tutorial to learn how to gain observability and optimize troubleshooting in under an hour.

Stay connected with me on here, X or LinkedIn to follow my journey and more introductory security content!

If you want to see all of my policies for FME Flow reach out and I can share the yaml.

A beginner’s journey to Kubernetes security

Jennifer Luther Thomas — Fri, 13 Oct 2023 03:31:28 +0000

Come with me on my career journey as I learn all about Kubernetes security.

If you’re reading this blog then I’m assuming you have some knowledge of Kubernetes. If not, and you’re interested in learning, I not only learnt a lot but also enjoyedthis course on Udemy (not affiliated in any way — I just found this course extremely engaging and helpful!).

I’d like to say I wasn’t a complete stranger to security prior to accepting a position at Tigera. I follow the r/scams subreddit and had read The Perfect Weapon but in hindsight I was totally ignorant of how Kubernetes clusters could be compromised, and what security measures can be put in place. After my technical interview task, I was surprised at how easy implementing some basic security is.

With this blog I’m here to make Kubernetes and security a little more accessible. I wrote this with the intention to help users who are new to Kubernetes security. Perhaps, like me, you’ve never really considered Kubernetes security before. Perhaps you’ve been asked to secure your cluster and you’re not sure what the first steps that you need to take are. Or perhaps you’re already familiar with the concepts of zero-trust and microsegmentation and you want to know how to protect your cluster from malicious traffic.

In this blog I’ll give you an introduction to Kubernetes networking and security policies, why they are used, policy examples (and what I did wrong/didn’t consider) and how you can easily (and for free) implement them in your own Kubernetes clusters!

Disclaimer: I do work for Tigera so this content will be biased towards Calico, as that’s what I am using.

The Scenario

I was still working at Safe Software as a Technical Support Lead for Cloud and Containers when I was interviewing for Tigera. During this time, I coincidentally had my first security support case come in from a user who was using Open Service Mesh (OSM) in their cluster, and FME Flow (the product I was supporting) was not working correctly.

For context, FME Flow is an enterprise tool that lets you automate ETL tasks. In 2019, the traditional installation was adapted for Docker and Kubernetes and the different components of FME Flow were containerized. OSM allows you to manage, secure, and get out-of-the box observability features for highly dynamic microservice environments.

This was the first time I had to evaluate FME Flow’s pod to pod communication to troubleshoot and reproduce where the communication was breaking down (spoiler alert: OSM had missed a port pool). It was also my first security related question in about 3 years of supporting FME Flow on Kubernetes (excluding customer reported CVEs from image scanning). With my newfound exposure to network policies (thanks Calico), I undertook the challenge of writing policies to control and test communication between FME Flow pods. And while I say challenge, if you are already comfortable with YAML and can plan a sensible roll out of policies (so you can test iteratively) it is not that difficult.

But first:

What are network and security policies for Kubernetes and containers?

In Kubernetes all pods are allowed to communicate with each other by default — it’s a very flat network. If one application has vulnerabilities or has been compromised, then everything else in that cluster is at risk. Network policies can restrict traffic between pods, reducing the risk that if one application is compromised, a bad actor will not be able to travel to other pods within that cluster (also known asmicrosegmentation). And if you’re thinking that it’s unlikely that your application will be a target, the concept ofzero trust can also protect from human error within an organization or social engineering.

Kubernetes environments are designed to be dynamic and ephemeral. How can you write traditional rules to protect your pods or allow traffic when at any moment they may or may not exist? If a pod or node is scaled, terminated or recreated, can you guarantee your rules will always be targeting the correct pod(s) or node(s) if the IP address has changed?

The Kubernetes Network Policy API solves that problem by supporting namespaces, label selectors, CIDRs, a few protocols and named port numbers. However, Kubernetes doesn’t enforce these policies itself, and instead delegates these to a Container Network Interface, or CNI (What is CNI?).

Tigera is the creator and maintainer ofCalico Open Source, and because this is where I was interviewing (and now work) I chose this as my CNI.

Each CNI can add functionality on top of what the Kubernetes default is. As an example, Calico Network Policy supports added features, such as applying policies to any kind of Kubernetes endpoint, ingress and/or egress rules, actions (allow, deny, log, pass) and source and destination match criteria.

Creating policies

Knowing what ports to allow and between which components of FME Flow was the easy part. For traditional installations there was a list of services and the ports FME communicates over so all I had to do was match them up with the right pod.

Before I show you some of the network policies that I created I want to mention a couple of things.

If you want to do this for yourself, know that when no network policies are applied to a pod then all traffic is allowed (default allow). If you apply a policy, then all traffic will be denied to that pod unless specifically allowed.

This leads into my next point: have a plan for how to test and deploy your policies. If we take FME Flow as an example, which is made up of multiple pods/services that all communicate with each other, you don’t want to start by allowing traffic to your database first and not the UI that the clients will interact with, because you won’t be able to easily test and verify that your policies are working (more on that in a future blog).

FME Flow Example

Having worked with FME for 10 years, I’m intimately familiar with how the FME Flow components work and communicate, so I’ll use it as my example here. This could translate to any Kubernetes application with multiple services.

Kubernetes Architecture Diagram for FME Flow

As mentioned above, as soon as any policy is applied to a pod then traffic is denied. Knowing that, I always check that my application works correctly before applying any kind of policy. To confirm that policies are being applied I started with a default-deny.yaml policy, denying any ingress or egress traffic to the fmeserver namespace. This stops any communication coming into or out of any pods or endpoints within the fmeserver namespace.

apiVersion: projectcalico.org/v3 
kind: NetworkPolicy 
metadata: 
  name: default-deny-fmeserver 
  namespace: fmeserver 
spec: 
  order: 100 
  selector: all() 
  types: 
 - Ingress 
 - Egress

If you apply the above policy in your own environment and try to access any component of FME Flow, you won’t be able to. You have to write policies to allow all of the necessary communication.

I’ll show you the policy that was applied to the queue-fmeserver pod, which is basically a Redis container for the FME Flow job queues:

apiVersion: projectcalico.org/v3 
kind: NetworkPolicy 
metadata: 
  name: queue-fmeserver 
  namespace: fmeserver # NetworkPolicies are namespaced, this is the namespace where the policy will apply 
spec: 
  order: 0 
  selector: statefulset.kubernetes.io/pod-name contains 'queue' # What these rules apply to, in this case the Queue Pod 
  ingress: # inbound traffic
 - action: Allow 
   protocol: TCP 
   source: 
     selector: statefulset.kubernetes.io/pod-name contains 'core' # What source pod(s) are allowed, in this case any traffic from core pods
   destination:  
     ports: [6379] #inbound default Redis port 
  egress: # outbound traffic
 - action: Deny

One of the benefits of YAML is that it’s easily readable, so it’s quite easy to see what’s going on here (plus I actually remembered to add comments to my policies).

How the policy is constructed:

metadata.name: The policy name is crucial, and make sure it’s unique. I cannot confirm nor deny if I’ve accidentally forgotten to change the policy name when copying and pasting and spent too much time troubleshooting why my policies weren’t working.

metadata.namespace: Make sure the policy applies to the correct namespace.

spec.selector: Make sure the policy is applied to the correct pods by using label selectors. By doing this, if the application scales this policy will apply to all of the correct pods.

spec.ingress: Here you can see the rules allowing traffic into any queue pods.

spec.egress: Here you can see the rules denying traffic out of any queue pods. In this case the queue/Redis pod shouldn’t be initiating communication with any other pod or service, hence egress is denied.

Let’s talk about the FME Engine policy

For another example, here you can see the policy I created for the FME Engine. This is the pod that does all of the data processing. One FME Engine can process one ETL job at a time. However, don’t just copy this policy for your own environment, and I’ll tell you why.

apiVersion: projectcalico.org/v3 
kind: NetworkPolicy 
metadata: 
  name: engine-fmeserver 
  namespace: fmeserver # NetworkPolicies are namespaced, this is the namespace where the policy will apply 
spec: 
  order: 0 
  selector: safe.k8s.fmeserver.component contains 'engine' # What these rules apply to, in this case the engine pod (fme engine containers) 
  ingress: 
 - action: Allow 
   protocol: TCP 
   source: 
     selector: statefulset.kubernetes.io/pod-name contains 'core' 
   destination:  
     ports: [7500, 7501, '4500:4800'] #https://docs.safe.com/fme/html/FME-Flow/ReferenceManual/FME-Flow-Ports.htm  
  egress: 
 - action: Allow

My selector for this policy is set to apply to all engine pods. However, in FME you can create multipleengine deployments with different properties to enable queue control (so a specific number or subset of engines performs certain jobs). I’ve seen FME Flow users set up different engine deployments for different business processes and systems. If you’re security conscious and serious about zero trust you will want to create multiple engine policies.

For example, one engine deployment may be configured to only process jobs that are reading/writing data to a file share and a PostGIS database. That engine should only have egress (outbound) access on ports 445 and 5432, and not just “Allow”. I know this _now — _but I didn’t realize the importance of egress two months ago. You can also specify destination IP addresses which may applicable if you’re connecting to resources or services with a static IP.

Why? If you allow all egress traffic then a bad actor has the potential to exfiltrate your data, communicate with their command and control centre, or continue to traverse you cluster and network and look for more valuable targets. If (in this scenario) someone accesses your engine container they may be able to access your PostGIS instance or your file share, but they have no way of exfiltrating that data (from that pod) because you’ve denied egress traffic except to known and trusted destinations. They’ll have to find another weak link.

When creating policies, you want to make sure you understand what your services will need to communicate with and create policies with the rightselectors and ingress/egress rules.

Why are policies important?

If you are deploying your applications on-premises or on VMs and using firewalls — why aren’t you protecting your Kubernetes applications?

Securing the perimeter of your network or applications is a good first line of defense but you cannot rely on it alone. By combining security methods, you improve your security posture and reduce the damage if a container or cluster is compromised.

If you’ve ever played Age of Empires, do you just build a wall around the outside of your base and call it a day? Probably not, unless you lose a lot. You’re likely building defensive units and structures inside your base, protecting valuable assets (monastry? market?), upgrading your castle/towers, and putting archers in them. If you keep getting breached, you’re going to lose resources and units fighting them off which would’ve been better spent on upgrades and attacking.

However, losing at Age of Empires because your base got destroyed is not going to be as disastrous, expensive, or damaging as getting hacked.

Wrap up

If you’re newer to Kubernetes, container security, or both — I hope this was an informative, introductory read and you feel confident in how you can begin to secure your own Kubernetes workloads with policies and understand why that’s important.

If you want to improve your security posture I encourage you to check out Project Calico functionality, or explore our commercial offerings.

If you want to try this for yourself, check out theCalico installation and best practices lab (this is what I used before my interview).

Stay connected with me on here, Xor LinkedIn (or Steam for a game of AoE4?) for updates about Calico and more introductory security content!