DEV Community

Cover image for Network policies are not the right abstraction (for developers)
Tomer Greenwald for Otterize

Posted on • Originally published at otterize.com

Network policies are not the right abstraction (for developers)

You’re a platform engineer working on building a Kubernetes-based platform that achieves zero-trust between pods. Developers have to be able to get work done quickly, which means you’re putting a high priority on developer experience alongside zero-trust.

Are Kubernetes network policies good enough? I think there are multiple flaws that prevent network policies, on their own, from being an effective solution for a real-world use case.

Before pointing out the problems, I’d like to walk you through what I mean when I say zero-trust, as well as a couple of details about how network policies work.

Zero-trust means preventing access from unidentified or unauthorized sources

Network policies can prevent incoming traffic to a destination (a server), or prevent outgoing traffic from a source (a client).

Zero trust inherently means you don’t trust any of the sources just because they’re in your network perimeter, so the only blocking relevant for achieving zero-trust is blocking incoming traffic (“ingress”) from unauthorized sources.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: my-policy
spec:
  ingress:
    - {} # ingress rules
  policyTypes:
    - Ingress # policy refers to ingress, but it could also have egress
Enter fullscreen mode Exit fullscreen mode

Let’s talk about network policies

They’re namespaced resources and refer to pods by label

Network policies are namespaced resources, and refer to pods by label. Logically, they must live alongside the pods they apply to – in our case, since we’re using ingress policies, that means alongside the servers they protect.

They don’t refer directly to specific pods, of course, because pods are ephemeral, but they refer logically to pods by label. This is common in Kubernetes, but introduces problems for network policies. Keep this detail in mind as we’ll get back to it later.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: protect-backend
spec:
  podSelector:
    matchLabels:
      app: my-backend # policy will apply to pods labeled app=my-backend, in the same namespace as the policy
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: my-client # and allow ingress access from pods labeled app=my-client
  policyTypes:
    - Ingress
Enter fullscreen mode Exit fullscreen mode

They hold information about multiple sets of pods

The contents of the network policies are effectively an allowlist specifying which other pods can access the pods which the policy protects. But there’s one big problem there: while the network policy must live with the protected pods, and is updated as part of the protected pods’ lifecycle, it won’t naturally be updated as part of the lifecycle of the client pods accessing the protected pods.

Friction when using network policies

Enabling access between two pods

Whenever a developer for a client pod needs access to a server, they need to get their client pod into the server’s network policy so it’s allowed to call the server. The developer often cannot manage that network policy themselves, as it usually exists in a namespace they are not responsible for, and deployed with a service they don’t own.

The result is that the client developer is dependent on the server team for access that should have been self-service, and the server team is now distracted enabling a client developer even though nothing has really changed from the point of view of the server team – a new client is connecting to their server, that’s it! There should not be server-side changes required to simply enable another client.

What if you need to roll back the server?

There are also a myriad of second-order problems, which the team at Monzo had learned about through solving for this problem. (It’s a super well-written blog post; I recommend having a read), such as that rolling back the server would affect whether clients could connect, since it rolled back its network policy.

When a server is rolled back due to an unrelated problem, its network policy may also be rolled back if it is part of the same deployment (e.g. part of the same Helm chart), and break the clients that relied on that version of the network policy! It’s a reflection of the unhealthy dependency between the client and server teams: while it would make sense that a server-side change that breaks functionality would affect the client, it does not make sense that an unrelated and functionally-non-breaking rollback of the server would affect the client.

How do you know the policy is correct?

Because network policies refer to pod labels, they are difficult to validate statically. Pods are generally not created directly, but instead created by other resources, such as Deployments.

Can you tell whether a network policy will allow access for your service without deploying and trying it out? In fact, just asking the question “which services have effective access to service A?” becomes super hard.

Developers don’t think of services as pod labels, but they tend to have a developer-friendly name they use. For example, checkoutservice is a friendly name, whereas checkoutservice-dj3847-e120 is not. This may in fact be the value of some label, but there’s no standard way to discover this name.

So then, how do you take the concept of a service, with its developer-friendly name, and map that to its labels that are referred to by the network policies and, say, its Deployment, to be able to check if it will have access once its new labels are deployed? You could manually do that, as a developer in a single team that understands all the moving parts. However, this is very error-prone, and of course, doesn’t apply to a solution a platform engineer could deploy: as a platform engineer, you’d need something automated you could make available to every developer in your organization.

This problem is one that the team at Monzo worked hard at. I recommend giving that blog a read as it is very well-written and also covers other factors of the problem.

How do you refer to pods within network policies?

Earlier, I mentioned that network policies don’t refer to pods directly, as they’re ephemeral, but refer to them by labels. This is common practice in Kubernetes. However, network policies are unique in that they use labels to refer to two (or more) sets of pods that are often owned by different teams in the organization.

This presents unique challenges because, for the network policy to function, the labels referenced by the policy and the labels attached to the pods must be kept in sync, with destructive consequences if you fail to do so – communication will be blocked! The pod labels for the client pods are managed by the client team, while the network policy that refers to them is managed by the server team, so you can see where things can get out of sync.

Network policies are effectively owned by multiple teams

This means that you need coordination between the teams, not only when the network policy is first deployed, but also over time as clients and servers evolve.

What if you have a network policy that allows multiple clients to connect to one server? Now you’ve got the server team coordinating with 2 teams.

For each change a client team proposes, the server team needs to not only change network policy rules referring to that client, but also make sure they don’t inadvertently affect other clients. This can be a cognitively difficult task, as the server team members normally don’t refer to pod labels belonging to other teams, so it may not immediately be clear which labels belong to which team.

This reduces the ability for teams to set internal standards and work independently, and slows down development. If you don’t get this right, there can be painful points in the development cycle where changes are fragile and their pace slows to a crawl. The pain may lead to bikeshedding and inter-team politics, as teams argue over how things should be done, and growing frustration as client deployments are delayed as a result of server network policies not yet being updated.

Is everyone in your organization proficient with how network policies work?

In many organizations, this is not the case. Network policies are already error-prone, with destructive consequences for even small mistakes. Asking every developer whose service calls another service to be familiar with network policies may be a tall order, with potential for failed deployments or failed calls that are hard to debug.

What would a good abstraction look like?

A good solution for zero trust should be optimized for that specific outcome, whereas network policies are a bit of a swiss army knife: they aren’t just for pod-to-pod traffic, so they’re not optimized for this use case.

The following 3 attributes are key for a good zero-trust abstraction that actually gets adopted:

  1. Single team ownership: Each resource should only be managed by one team so that client teams can get access independently, and server teams don’t need to be involved if no changes are required on their end.
  2. Static analysis should be possible: It should be possible to statically check if a service will have access without first deploying it.
  3. Universal service identities: Services should be referred to using a standard name that is close to or identical to their developer-friendly names, rather than pod labels.

Enter client intents

At Otterize, we believe that client intents satisfy these requirements. Let me explain briefly what they are, and then examine whether they satisfy the above attributes.

A client intents file is simply a list of calls to servers which a given client intends to make. Coupled with a mechanism for resolving service names, the list of client intents can be translated to different authorization mechanisms, such as network policies.

In other words, developers declare what their service intends to access, and that can then be converted to a network policy and the associated set of pod labels.

Here’s an example of a client intents file (as a Kubernetes custom resource YAML) for a service named client calling another service named server:

apiVersion: k8s.otterize.com/v1alpha2
kind: ClientIntents
metadata:
  name: client-intents

spec:
  service:
    name: client
  calls:
    - name: server
Enter fullscreen mode Exit fullscreen mode

Let’s see if this is a good abstraction

Now let’s go back and review our criteria for a good zero-trust abstraction:

Does a team own all of, and only, the resources it should be managing?

Client intents files are deployed and managed together with the client, so only the client team owns them. You would deploy the ClientIntents for this client along with the client, e.g. alongside its Deployment resource.

Can access be checked statically?

Since services are first-class identities in client intents (rather than indirectly represented by pod labels), it is trivially possible to query which clients have access to a server, and whether a specific client has access to a server. As an added bonus, all the information for a single client is collected in a single resource in one namespace, instead of being split up across multiple namespaces where the servers are deployed.

Are service identities universal and natural?

Service names are resolved in the same manner across the entire organization, making it easy to reason about whether a specific service has a specific name.

How would a Kubernetes operator that manages these intents work?

When intents are created for a client, the intents operator should automatically create, update and delete network policies, and automatically label client and server pods, to reflect precisely the client-to-server calls declared in client intents files. A single network policy is created per server, and pod labels are dynamically updated for clients when their intents update.

Service names are resolved by recursively getting the owner of a pod until the original owner is found, usually a Deployment, StatefulSet, or other such resource. The name of that resource is used, unless the pod has a service-name annotation which overrides the name, in which case the value of that annotation is used instead.

Try out the intents operator!

It won’t surprise you that we in fact built such an open source implementation, and it’s called the Otterize intents operator. Give it a shot and see if it makes managing network policies easier for you 🙂

Top comments (0)