Alister Baroi for Tigera Inc

Posted on Apr 2 • Originally published at tigera.io on Feb 11

Kubernetes Network Observability: Comparing Calico, Cilium, Retina, and Netobserv

#technicalblog #observability #howto

Calico, Cilium, Retina, and Netobserv: Which Observability Tool is Right for Your Kubernetes Cluster? Network observability is a tale as old as the OSI model itself and anyone who has managed a network or even a Kubernetes cluster knows the feeling: a service suddenly can’t reach its dependency, a pod is mysteriously offline, and the Slack alerts start rolling in. Investigating network connectivity issues in these complex, distributed environments can be incredibly time consuming. Without the right tools, the debugging process often involves manually connecting to each node, running tcpdump on multiple machines, and piecing together logs to find the root cause. A path that often leads to frustration and extended downtime.

This is the problem that Kubernetes Network Observability was built to solve. By deploying distributed observers, these cloud-native solutions take the traditional flow entries and enrich them with Kubernetes flags and labels to allow Kubernetes users to get insight into the inner workings of their clusters.

This blog post aims to give you a rundown of the leading solutions in the CNCF ecosystem, and compare how they track a packet’s journey across your cluster.

Feature Comparison Matrix

Before diving into the specifics, let’s look at how these four major players (Calico, Cilium, Microsoft Retina, and Netobserv) stack up against one another.

Feature	Calico Observability	Cilium Observability	Microsoft Retina	Netobserv (Red Hat)
CNI Agnostic	No (Requires Calico)	No (Requires Cilium)	Yes	Yes
UI Experience	Calico Whisker / Grafana	Hubble UI / Grafana	Grafana / Azure Monitor / Hubble UI*	OpenShift Plugin / Grafana
Installation	Easy (Helm/Operator)	Easy (CLI/Helm)	Easy (Helm)	Moderate (Operator)**
Monitoring Backend	eBPF (Linux) / HNS (Win)	eBPF (Linux)	eBPF (Linux) / HNS (Win)	eBPF (Linux)
Flow Type	Flow Aggregation	Individual Flows	Individual Flows + Metrics	Flow Aggregation (IPFIX)
Enrichment	K8s Metadata (Pod/NS)	K8s Metadata + Identity ***	K8s Metadata	K8s Metadata + Owner Ref
Observability Domain	Cluster and Host	Cluster based	Cluster and Host	Cluster and Host
Prometheus Export	Yes	Yes	Yes	Yes
Policy Insights	Full Policy Hierarchy	Verdict (Allow/Deny)	Verdict + Drop Reason	Verdict + Policy Name

* Microsoft Retina has a couple of modes, one of these modes offers a smaller set of features but allows you to use Hubble as its UI.

** Netobserv installation experience can differ depending on your cluster, in a non OpenShift cluster you might hit some bumps while installing.

*** Identity is an internal Cilium value that is assigned to cluster resources.

Understanding Flow Types

Before focusing on specific observability solutions, let’s take a look at flow types. Any network observability application is made up of two parts. A collector that gathers information related to networking activities in that environment and an exporter that emits this information via pulling or pushing.

These flows can be stored in two different formats, individual or aggregated.

Aggregated Flows

Aggregated Flows group similar packets together over a window of time (e.g., “50 packets went from Pod A to Pod B in the last 10 seconds”).

Pros : Significantly lower storage costs; better for long-term trend analysis and capacity planning.
Cons : You lose the precise timestamp of a single packet drop; smooths out “micro-bursts.”

Individual Flows

Individual Flows treat every connection attempt or significant network event as a discrete log entry.

Pros: You can see exactly which specific request failed at what time.
Cons: Can generate massive amounts of data in high-traffic clusters; usually requires a short retention period (e.g., rolling buffer).

Now that we have established the foundational flow types and data collection methods, let’s see how the leading tools in the ecosystem apply these concepts to real-world cluster monitoring.

Calico Observability Stack

Calico is a modern unified security platform designed not just for Kubernetes, but also for Virtual Machines, OpenStack and bare metal systems.

How it works

Observability in Calico is deeply integrated into its core components. In Linux Calico eBPF programs hook into the inner workings of the kernel, allowing it to extract deep network telemetry directly from the kernel. Calico observability also works on Windows, where it relies on its Windows data plane based on the HNS technology to gather all the information related to each flow. All this information is accessible via a gRPC channel to Calico Whisker for visualization.

To see how this context-driven approach differs from legacy monitoring, check out our deep dive on why context matters in Kubernetes networking.

Key Data Types

Calico provides deep visibility into the decision-making process of the network:

Direction Aware: Calico intelligently categorizes each flow as reported by the sender or a receiver. This is a problem solver in troubleshooting or writing policy scenarios.
Enriched Logs: Each flow provides a list of aggregate information enriched with Kubernetes metadata (Namespace, Owner, Resource).
Policy Evaluation: On top of highlighting the final verdict and policy name, by default, Calico also outputs all the policies that matched against a flow allowing for policy performance tuning and easier troubleshooting.
L7 Visibility: Optionally, Calico Ingress Gateway can report application-layer data (like HTTP methods and URLs) for deeper debugging.

Cilium Observability Stack

Cilium is an open-source, cloud-native solution for providing, securing, and observing network connectivity between workloads, Cilium networking is established via eBPF programs and its observability components are funneled to Hubble via a gRPC channel.

How it Works

Cilium leverages eBPF programs to tap into the system. It captures network events directly from the kernel as they happen and streams them in real-time via a gRPC channel. (For a broader look at how these architectures compare, see our guide on the key differences between Calico and Cilium). Hubble taps into the Cilium gRPC channel and visualizes each flow.

Key Data Types

Cilium uses an internal concept called identities to distinguish resources within clusters:

Flow Verdicts: It tracks the state of every packet: forwarded, dropped, or audited, mapped directly to the Cilium Network Policies enforcing them.
Enriched Logs: Each flow provides a list of information enriched with Kubernetes metadata (Namespace, Owner, Resource).
L7 Visibility: Optionally, Hubble has integrations that can be enabled to provide L7 Visibility. However, since it requires traffic to be redirected to an embedded user-space Envoy proxy for parsing, it introduces an additional latency.

Microsoft Retina

Microsoft Retina is a cloud-agnostic observability platform that leverages the power of eBPF to provide deep, actionable insights into network traffic. Since its open-source debut on GitHub, it was specifically designed to address the challenges of monitoring modern Kubernetes environments, which often span multiple clouds and hybrid deployments.

How it Works

The defining feature of Retina is its CNI Agnostic design. Whether you are running Flannel, Calico, Cilium, or Azure CNI, Retina can be used to start collecting data from your environment. By using eBPF programs, Retina offers a transparent, low-overhead window into the kernel’s networking stack without requiring any modifications to your applications.

Key Data Types

Retina focuses heavily on actionable metrics for Site Reliability Engineers:

Enriched Logs: Correlates raw IPs with Kubernetes metadata (Namespace, Owner, Resource).
Drop Reasons: Insights into why a packet was dropped (e.g., IPTABLES_DROP, CONN_TRACK_ERR). Not detailed as others due to policy limitations.
DNS Latency: Specialized metrics to track DNS resolution times and timeout occurrences.
TCP State: Metrics regarding TCP retransmissions and connection resets, which are vital for debugging latency issues.

Netobserv (Red Hat)

Netobserv (Network Observability Operator) is an OpenShift-native (but Kubernetes compatible) solution that brings flow-based observability to the cluster. It leverages an eBPF agent to generate flows and a flow collector pipeline (often using Loki) to store and query them.

How it Works

Netobserv is designed to be a “plug-and-play” flow collection system. It deploys an eBPF agent (Flow Logs Pipeline) to all nodes to sample traffic and export it in the IPFIX standard or JSON. It integrates tightly with the OpenShift console but can be visualized via standard Grafana dashboards in vanilla Kubernetes.

Key Data Types

Netobserv provides a “NetFlow-like” experience for Kubernetes:

Enriched Logs: Correlates raw IPs with Kubernetes metadata (Namespace , Pod, Labels, Resource).
Connection Tracking: It visualizes traffic as conversations, calculating Round Trip Time (RTT) to help identify network latency versus application processing latency.
Interface Metrics: Visibility into the specific network interfaces (veth pairs, physical nics) where traffic is ingressing or egressing.

Conclusion

While we highlighted multiple choices when it comes to network observability for Kubernetes, Calico Whisker with its unique design is our recommendation. All you need to consider is the 3 Rules of Kubernetes Network Observability.

The 3 Rules of Kubernetes Network Observability

1. The Native Stack Rule

If you want to be in control of your cluster (on-premises/self-managed), make sure to use a custom CNI which allows you the most control. For example, Calico in such a scenario gives you the most control over your cluster networking and security capabilities. Simply by using Tigera Operator to deploy Calico Whisker observability is achieved and you can go even further using other Calico capabilities to get rid of all other third-party projects and use Calico as your unified network and security platform. This allows you to move beyond flat networks and implement a robust security hierarchy across your entire infrastructure.

2. The Cloud Pragmatist Rule

If you are using a cloud-provider setup (managed cluster) with the default CNI (AWS VPC CNI, Azure CNI, etc.), you can still take advantage of other CNI features. In such a setup the default cloud provider CNI will provide the networking foundation and Calico provides the more advanced features such as Observability, Gateway API, WireGuard, and mTLS, allowing you to have the best of both worlds.

3. The Red Hat Rule

In an OpenShift environment you could choose any of the previous rules depending on your networking choices at the time of cluster creation.

Keep in mind that NetObserv, and Microsoft Retina can be installed on any cluster and are not locked to any CNIs.

Regardless of the tool you pick, moving away from individually running tcpdump on workloads and nodes, toward continuous observability is the only way to maintain a secure and reliable distributed environment.