DEV Community: Otterize

Otterize launches open-source, declarative IAM permissions for workloads on AWS EKS clusters

Tomer Greenwald — Wed, 10 Jan 2024 15:41:50 +0000

Know that pesky workflow of creating AWS IAM roles and policies, and binding them to Kubernetes ServiceAccounts, just for your pods to be able to access resources on AWS?

Using AWS IRSA, supported on EKS since version 1.14, it is possible to bind service accounts to AWS IAM roles. But that leaves a lot to be desired: managing these resources and their lifecycle, outside of Kubernetes can be a pain.

You’ve gotta create an AWS role, establish a trust relationship with a Kubernetes ServiceAccount through an OIDC provider, grab the role’s ARN and annotate the ServiceAccount, then finally create an IAM policy. Need to have multiple deployments, versions, upgrades, rollbacks? Better get to work.

No more! The open-source intents-operator and credentials-operator enable you to achieve the same, except without all that work: do it all from Kubernetes, declaratively, and just-in-time, through the magic of IBAC (intent-based access control).

How does it work?

Step 1: Label a pod, requesting an AWS IAM role to be created and bound to the ServiceAccount.

Label a pod with this label:

"credentials-operator.otterize.com/create-aws-role": "true"

This creates a matching IAM role, for this specific workload in this namespace in this cluster, that is not shared with other workloads, with the trust relationship set up for your cluster’s OIDC provider.

Step 2: Declare your workload’s ClientIntents, specifying desired permissions.

apiVersion: k8s.otterize.com/v1alpha3
kind: ClientIntents
metadata:
  name: workload
spec:
  service:
    name: workload
  calls:
    - name: arn:aws:s3:::otterize-tutorial-bucket-*/*
      type: aws
      awsActions:
        - 's3:PutObject'

This creates an IAM policy and associates it with the previously created role.

Step 3: Done! Your workload can access AWS resources. AWS will be kept in sync with the cluster state, as you update your Pods and ClientIntents.

Wanna try it out for real? Check out the AWS IAM tutorial.

Wondering what are ClientIntents? Read more about IBAC, intents-based access control.

Can I also map my cluster’s traffic and AWS API calls?

Yep! When you deploy Otterize, you get a map of your cluster’s traffic, with zero-configuration, through the open-source network-mapper.

Coming soon: capture AWS API calls for pods in your cluster, automatically generating the required least-privilege permissions, or ClientIntents, for each workload. Zero-friction in development, zero-trust in production. It’s coming.

Who moved my error codes? Adding error types to your GoLang GraphQL Server

Tomer Greenwald — Sun, 25 Jun 2023 10:33:56 +0000

A few months ago, we at Otterize went on a journey to migrate many of our APIs, including the ones used between our back-end services and the ones used by our web app, to GraphQL. While we enjoyed many GraphQL features, we faced along the way a few interesting challenges which required creative solutions.

Personally, I find our adventure with GraphQL’s errors and the error handling mechanism a fascinating one. Considering GraphQL’s popularity, I didn’t expect the GraphQL standard to miss this one very fundamental thing…

Where are the error codes?!

What happens when your code encounters a problem making an API call? Coming from REST, we’re all used to HTTP error codes as the standard way to identify errors and take action accordingly. For example, when a service called another service and receives an error, it may handle 404 Not Found by creating (if appropriate) the missing resource, while 400 Bad Request errors abort the execution and return a client-appropriate error message.

Now, let’s look at what errors look like in GraphQL. It’s a pretty basic error structure, which consists of 2 parts. The first is “message", which is a textual error message designed for human users. The second is "path", which describes the path of the field from the query that returned the error. Can you see what is missing?

{
  "errors": [
    {
      "message": "user Tobi was not found",
      "path": ["getUser"]
    }
  ]
}

An example of a simple GraphQL error

There are no error codes in GraphQL. It might not disturb you much as a human reading the error, but the absence of error codes makes it really hard to develop client-side code that identifies and handles errors received from the server.

We started out by identifying errors by searching for specific words in the returned error message, but it was clear it is not a permanent solution. Any small change to the error message on the server side might fail the identification of the type on the client side.

Masking unexpected errors - obvious in HTTP, not so much in GraphQL

Probably the most annoying HTTP error code is 500 Internal Server Error, as it doesn’t really give any useful information. But this is the one error code that matters the most regarding your application’s information security — in other words, the lack of information is intentional. HTTP frameworks mask any unexpected error and return HTTP 500 Internal Server Error instead, in the process also masking any sensitive information that might have been part of the error message.

GraphQL’s spec, as it turns out, does not specify how servers should handle internal errors at all, leaving it entirely to the choice of the frameworks’ creators. Take for example our GoLang GraphQL framework of choice - gqlgen. It makes no distinction between intentional and unexpected errors: all errors are returned as-is to the client within the error message. Internal errors, which often contain sensitive information like network details and internal URIs, would leak to clients easily if not caught manually by the programmer.

{
  "errors": [
    {
      "message": "Post \"http://credential-factory:18081/internal/creds/env\": dial tcp credential-factory:18081: connect: connection refused",
      "path": ["createEnvironment"]
    }
  ]
}

A simulation of an unhandled internal error leaked through the GraphQL server.

And gqlgen is not alone in this. We found several more GraphQL frameworks that don’t take it upon themselves to address this problem. Widely used GraphQL server implementations, such as graphql-go/graphql and Python’s graphene, have the exact same gap of exposing messages of unexpected errors by default.

With these two points in mind, it was clear that to complete our move to GraphQL, we needed to find some way to add error types. For one thing, we would have a reliable way to identify errors in clients’ code. And for another, we could catch unexpected errors on the server side and hide their message from clients.

How can we add error types to GraphQL?

We started researching possible solutions and encountered various ways people took to solve the same problem, but many of those seemed inconvenient, at least for us. Then we read the GraphQL errors spec and learned that errors have an optional field called “extensions" — an unstructured key-value map that can be used to add any additional information to the error. They even use a key called “code” that contains what looks like an error code in one of their examples, but we didn’t see any further information. (Later, I figured it was taken from Apollo — see below.)

Knowing this, we came up with a plan of adding an “errorType” key to the error’s “extensions” map, with the error code as the value. For example, here is the same error with the new “extensions” field:

{
  "errors": [
    {
      "message": "User Tobi not found",
      "path": [
        "getUser"
      ]
      "extensions": {
        "errorType": "NotFound"
      }
    }
  ]
}

Digging into gqlgen’s sources, we discovered that the gqlgen GraphQL server uses the extension key "code" to report the error code of parsing and schema validation errors.

{
  "error": {
    "errors": [
      {
        "message": "Cannot query field \"userDetails\" on type \"Query\".",
        "locations": [
          {
            "line": 2,
            "column": 3
          }
        ],
        "extensions": {
          "code": "GRAPHQL_VALIDATION_FAILED"
        }
      }
    ]
}

Example of a schema validation error. Note the “code” key and the error code under “extensions”, added by the gqlgen GraphQL Server itself.

Unfortunately, there is no built-in way to extend gqlgen’s error codes with additional ones. We considered using the same "code" key for our custom error codes, but eventually, we preferred sticking to our separate “errorType” key to avoid potential future collisions with gqlgen’s error handling mechanism.

While working on this blog post, I learned that Apollo Server, the most popular GraphQL server for typescript, uses a similar method for adding error codes to GraphQL. It even lets you add custom errors. Hopefully, someday other GraphQL server projects will follow them. Until then, we’ve got a strong indication we took the right approach.

Our Go implementation for typed GraphQL errors

Equipped with all the knowledge we’ve built up and our plan, we were ready to implement our error-typing solution. Throughout the rest of this post, I will be describing our implementation of that plan in practice, in our application.

Defining our application’s standard error codes

First, we listed all the error codes we would like to have. We started with the HTTP error codes we used to work with in REST and placed them in a GraphQL enum. Putting the error codes in the schema is not mandatory, but it makes it easier to refer to the same error types on both the server and client sides.

enum ErrorType {
  InternalServerError
  NotFound
  BadRequest
  Forbidden
  Conflict
}

The error codes schema. We put it in a dedicated schema file called "errors.graphql".

After running go generate, gqlgen generated the model package with the variables from the error codes enum. The next step was to create a new typedError struct, which pairs an error with the error type that should be returned to the client.

package typederrors

type typedError struct {
    err       error
    errorType model.ErrorType  // error types are auto-generated from the schema
}

func (g typedError) Error() string {
    return g.err.Error()
}

func (g typedError) Unwrap() error {
    return g.err
}

func (g typedError) ErrorType() model.ErrorType {
    return g.errorType
}

// We have such a function for each of the types
func NotFound(messageToUserFormat string, args ...any) error {
    return &typedError{err: fmt.Errorf(messageToUserFormat, args...), errorType: model.ErrorTypeNotFound}
}

func InternalServerError(messageToUserFormat string, args ...any) error {
    return &typedError{err: fmt.Errorf(messageToUserFormat, args...), errorType: model.ErrorTypeInternalServerError}
}

// ...

Then, we searched our server codebase for errors and replaced native go errors such as fmt.Errorf("user %s not found", userName) with the appropriate typed error, in this case typederrors.NotFound("user %s not found", userName)

Integrating with the GraphQL server

Next, we needed to make our GraphQL server handle the typed errors returned by our application’s GraphQL resolvers, extract the error codes, and attach them to the extensions map. The way to do that using gqlgen is to implement an ErrorPresenter, a hook function that lets you modify the error before it is sent to the client.

type TypedError interface {
    error
    ErrorType() model.ErrorType
}

// presentTypedError is a helper function that converts a TypedError to *gqlerror.Error
// and adds the error type to the extensions field
func presentTypedError(ctx context.Context, typedErr TypedError) *gqlerror.Error {
    presentedError := graphql.DefaultErrorPresenter(ctx, typedErr)
    if presentedError.Extensions == nil {
        presentedError.Extensions = make(map[string]interface{})
    }
    presentedError.Extensions["errorType"] = typedErr.ErrorType()
    return presentedError
}

// GqlErrorPresenter is a hook function for the gqlgen's GraphQL server, that handle
// TypedErrors and adds the error type to the extensions field.
func GqlErrorPresenter(ctx context.Context, err error) *gqlerror.Error {
    var typedError TypedError
    isTypedError := errors.As(err, &typedError)
    if isTypedError {
        return presentTypedError(ctx, typedError)
    }
    return graphql.DefaultErrorPresenter(ctx, err)
}

The GqlErrorPresenter function is our implementation of the ErrorPresenter hook.

func main() {
  /// ...
  // Create a GraphQL server and make it use our error presenter
  srv := handler.NewDefaultServer(server.NewExecutableSchema(conf))
  srv.SetErrorPresenter(server.GqlErrorPresenter)
  /// ...
}

Hooking our new error presenter into the GraphQL server.

Once our new ErrorPresenter is bound into the GraphQL server, raised typed errors are now processed and their type is exposed to the client under the "errorType" extensions field.

{
  "errors": [
    {
      "message": "User Tobi not found",
      "path": ["updateUser"],
      "extensions": {
        "errorType": "NotFound"
      }
    }
  ]
}

The GraphQL error reported when the server returns a typed error.

Masking non-typed errors with InternalServerError

In order to prevent the leaking of sensitive information buried inside error messages, we then adopted the error handling behavior of HTTP servers. Instead of returning non-typed errors, we log them and return the typed InternalServerError instead. Given the typed errors, it only requires a small change to the ErrorPresenter.

func GqlErrorPresenter(ctx context.Context, err error) *gqlerror.Error {
  var typedError TypedError
  isTypedError := errors.As(err, &typedError)
  if isTypedError {
    return presentTypedError(ctx, typedError)
  }
  // New code for masking sensitive error messages starts here
  var gqlError *gqlerror.Error
  if errors.As(err, &gqlError) && errors.Unwrap(gqlError) == nil {
    // It's a GraphQL schema validation / parsing error generated by the server itself,
    // error message should not be masked
    return graphql.DefaultErrorPresenter(ctx, err)
  }
  // Log original error and return InternalServerError instead
  logrus.WithError(err).Error("Custom GraphQL error presenter got an unexpected error")
  return presentTypedError(ctx, typederrors.InternalServerError("internal server error").(TypedError))
}

The `GqlErrorPresenter` with the new addition that replaces non-typed errors with `InternalServerError`

{
  "errors": [
    {
      "message": "internal server error",
      "path": ["updateUser"],
      "extensions": {
        "errorType": "InternalServerError"
      }
    }
  ]
}

This is how an untyped error will be presented to the client after the change.

Handling errors on the client side

Having finished our work on the server side, it was time to reap the benefits and use the error codes to handle the errors properly on the client side. First, we need the error codes GraphQL enum to be generated as Go code. We typically generate client-side code using genqlient, but in this case, it wasn’t possible because the error type enum isn’t referenced by any query. We solved this by running the gqlgen server-side code generation tool and keeping only the generated error enums.

schema:
  - "../../../graphql/errors.graphql"

model:
  filename: models_gen.go
  package: gqlerrors

gqlgen.yml

package gqlerrors

//go:generate go run github.com/99designs/gqlgen@v0.17.13
// we only need models_gen for the enum, so we delete the server code
//go:generate rm generated.go

generate.go

Once we generated the error codes enum, we could write a simple function in the same package that extracts the error code from the genqlient error object:

package gqlerrors

import (
    "github.com/sirupsen/logrus"
    "github.com/vektah/gqlparser/v2/gqlerror"
)

func GetGQLErrorType(err error) ErrorType {
    if errList, ok := err.(gqlerror.List); ok {
        gqlerr := &gqlerror.Error{}
        if errList.As(&gqlerr) && gqlerr.Extensions != nil {
            errorTypeString, isString := gqlerr.Extensions["errorType"].(string)
            if isString {
                return ErrorType(errorTypeString)
            }
        }
    }
    return ""
}

And that’s it! We are ready to write code that identifies the different error codes and handles the different errors appropriately.


package main

func main() {
    // ...
    if err != nil {
        if gqlerrors.GetGQLErrorType(err) == gqlerrors.ErrorTypeNotFound {
            // do something to handle the NotFound error
        } else {
            panic(err)
        }
    }
    // ...
}

Conclusion

GraphQL is a great platform, but the absence of standardized error codes is a real shortcoming. Although it’s addressed by Apollo’s GraphQL server, it’s unfortunate that many other GraphQL servers have yet to address it, including our choice — gqlgen.

By defining our own error codes and integrating them with the GraphQL server’s ErrorPresenter, we can easily identify errors on the client side and handle them. In addition, we prevent sensitive internal error messages from being sent to clients and maintain the integrity of our information security.

You may check out the example project to see what our implementation looks like in a small Go project, and how error types affect the client’s behavior. This should help understand where all the snippets come together in an actual working code use case and make a great solution for the missing error codes problem.

So you want to deploy mTLS

Tomer Greenwald — Sun, 11 Jun 2023 20:44:27 +0000

Our story begins with myself, a platform engineer, having been tasked with encrypting and authenticating traffic within our production environment – “our” referring to a previous employer. The business motivation was simply that we guaranteed encryption and authentication within our backend to our customers.

The idea behind encrypting and authenticating traffic is that our platform, used by the dev team to deploy workloads, should not expose its data or operations, either through malicious activities of attackers or or even just accidents, like mistakenly linking production and staging workloads.

Mutual TLS, or mTLS, seemed like the right tool for the job, so I started researching how to achieve mTLS for our platform.

As a platform engineer, my role wasn’t just to achieve security, but to do it in a way that was easy for the rest of the team to work with and understand.

In this post, I’ll share with you the considerations and the process I went through when solving for three attributes: (1) encryption (2) authentication and (3) simplicity from a developer experience point of view. There are many ways to achieve (1) and (2), but coupling them with (3) is the hardest part, and was our guiding north star. Through sharing, I hope to both learn, from readers willing to share their experience and from research, and provide useful insights for others who might be working through a similar process.

But first, a little primer on mTLS for the uninitiated. Skip ahead if you’re already a level 60 warlock.

What’s mTLS?

mTLS is just like TLS, but with one extra consideration. In mTLS, both the client and server are authenticated, whereas in standard TLS only the server is authenticated. That’s where the m comes from: mutual.

To understand why you might need this, consider a common use-case for TLS: web browsing using HTTPS (which is just HTTP wrapped by TLS). When you browse to yourbank.com, you, the user, want to know that you are really on the website of your bank, and that when you sign in and view your information or make transactions, you can trust that your bank stands behind all of that. You assume that your bank is indeed the owner of the yourbank.com domain; with TLS, you can be sure that only the owner of the yourbank.com domain, namely your bank, can be on the other side of your request for yourbank.com.

How can you be sure? Because your browser can validate, when you connect to the yourbank.com server, that the certificate presented by the server is indeed legitimately owned by your bank, as attested to by a third-party entity (a Certificate Authority) that your browser trusts and that signed the certificate. Your browser does this validation for you automatically when you browse to a URL that starts with https: if validation succeeds, you see the familiar and reassuring lock icon in the address bar, and if it fails, your browser warns you in no uncertain terms not to proceed.

Now, in standard TLS as commonly used with HTTP, while the client verifies the server’s identity, the server does not validate the client’s identity. That authentication usually happens via some other mechanism after the server was authenticated. In the case of the bank, that probably involves asking for your login credentials.

mTLS lets you do this kind of connection at the HTTP connection level, not afterwards. Not only are the client and server both authenticated, but the mechanism to do so is completely standardized (it’s just TLS). It’s also a more secure mechanism than the token or cookie mechanisms that are often used, say when you normally login to your bank, because mTLS is not vulnerable to token replay attacks, and because no secrets are transmitted at any point in the communication.

So far we’ve only discussed authentication: validating the communicating parties. But TLS, and by extension mTLS, also provide confidentiality, i.e. that third parties cannot see the data, and integrity, i.e. that the data cannot be modified in transit.

Our stack was pretty standard, and yet deploying mTLS is still difficult

Back to my situation. In our team we had a polyglot architecture: a mix of services written in Go, Python and node.js.

This was all running on Kubernetes, coupled with Google Cloud SQL for PostgreSQL and an HAProxy deployment managed by an ingress controller (jcmoraisjr/haproxy-ingress with a modified config file template). Branch or test deployments were a little different: the database was deployed on Kubernetes directly, to make it simple to deploy additional environments without spinning up resources outside of Kubernetes.

Inter-service communication was accomplished using REST (the Python services), gRPC (Go and node.js), as well as proxy traffic between HAProxy and the services configured by the ingress (REST, GraphQL). All of these different kinds of communication had to be encrypted and authenticated.

A priority for the team: keep it simple

Our thinking was that we would have to keep it simple: adding mTLS to our platform should not require attention from most engineers most of the time, rather it should just work. We were a small team of 15, and often engineers were required to make changes across multiple services and the platform itself, so new components or technologies that get introduced have their onboarding costs multiplied by everyone. The solution should be as simple as can be, in order to reduce friction when the team needs to work on the stack.

There are various ways to evaluate and achieve simplicity. For me, it always starts with: can we reduce, or at least not increase, the number of technologies you need to work with at all? Even very good engineers can only be truly good at a finite number of things, and in a small team of engineers that needed to deal with everything in the stack, that stack needed to be limited to the fewest number of technologies possible. Sure, adding just one more tech might seem like a “best of breed” solution to a complex problem, but what does adding it do to the effectiveness of the team? How much context switching do they need to now do, how many more problems would they need to contend with, how much longer would critical problems take to debug?

I’ll get back to how I think about simplicity at the conclusion.

So go with a service mesh, right? It’s just one new technology

Service meshes promise to take care of service communications outside of the service implementation, which apparently would address just the problem I was looking to solve: developers would focus on writing code, and – if everything worked – they would not have to be concerned with mTLS since the service mesh would handle it outside their code’s container.

In fact, service meshes are a bit of a swiss army knife: they address a lot of problems, like load balancing, service discovery, observability, and – yep – encryption and authentication. They do that by deploying sidecar proxies and managing them via a unified control plane, which sounds great – one tech across the entire stack! But also, one new tech across the entire stack, with multiple moving pieces and many ways to configure and use it.

As a small team, where each developer already has to know quite a bit, we were wary of introducing new runtime components, especially ones at the core of everything, and that could make understanding and debugging our Kubernetes deployment that much more complicated when things didn’t work quite right.

Perhaps that would be worth the risk if we needed the service mesh for many other needs, but we didn’t. We were already heavy Datadog users, so our observability needs were pretty well served. We had simple load balancing needs that were met by an in-cluster HAProxy ingress controller. And service discovery was already achieved just fine, through plain old Kubernetes services and DNS.

So should we really introduce sidecars? Another control plane? A bunch of new resources, configurations and tools that everyone had to get familiar with to debug some situations, in addition to mTLS, CAs and certificate management? On one hand it doesn’t truly solve the problem because there are as many unsolved cases (such as PostgreSQL managed by Google Cloud SQL, which can’t be part of the service mesh) as there are solved cases; and it adds not only new moving components (the service mesh components) but also new skills and new ways for things to go wrong.

So we decided to go back to the built-in capabilities of our stack (meaning gRPC, PostgreSQL, HAProxy, etc.) and find ways to roll out mTLS within them. The only complexity we really needed to introduce was mTLS itself, and we would do it in a way that induced as little variance as possible between workloads.

Implementing mTLS

I researched how mTLS can be implemented with each of our tools: gRPC in Go, Python and node.js; HTTP servers and clients in Python; GraphQL server in node.js; Google Cloud SQL; and HAProxy. Here are the steps we’d need to take, at a high level:

Generate pairs of keys and certificates (”keypairs”) for each environment (production, staging, etc.)
Distribute the keypairs using Kubernetes secrets to each component
Configure clients and servers to present the keypairs to each other, and to trust them when they’re presented – i.e. when authenticating the presenter.

The following is a look into the work that went into implementing mTLS.

Generating key pairs

Our goal was to create separation between environments: separate production, staging and dev. Since our use case was simple, we didn’t need separate keys for each workload: the entire environment could share the same key. This also meant that we didn’t need a CA to sign them, but rather just a self-signed certificate, which can only be used to establish connections with parties that trust that exact certificate (a practice called certificate pinning).

So we first needed to generate key pairs and CAs for each environment. Thankfully, this is easy to do, with just a single OpenSSL CLI command:

openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -days 3650 -nodes -subj '/CN=production.example.com'

This generates an RSA 4096-bit keypair, with the key written into key.pem and the cert into cert.pem, valid for 10 years. We’ve also specified a CN, or Common Name, for the certificate, that maps to a hostname that’s specific for that environment.

Push the keypairs to Kubernetes as secrets

With this we had keypairs we could use to establish connections secured with mTLS, but we then had to distribute them as secrets to the workloads. At the time, we did not have any PKI infrastructure, or any infrastructure for managing secrets, so we used plain Kubernetes secrets, created directly. Specifically, we just used kubectl create secret to stick with our approach of keeping things simple. To automate things, we had a script that would create all the necessary secrets, and another script that deployed a cluster using the gcloud (Google Cloud) CLI. No CloudFormation, Terraform, Vault, or anything like that.

So here’s how we would create a secret with two entries, key and cert, using the key.pem and cert.pem from the last section, respectively:

kubectl create secret generic mtls -n production --from-file=key=key.pem --from-file=cert=cert.pem

Mounting the secrets into pods

Now that we’ve got the secrets, we need to mount them into all of the relevant pods - all the functional service pods, as well as HAProxy. The following snippet from a Deployment resource (it’s approximate but gets the point across) will mount the keys and certs as files into /var/mtls on the local filesystem of each pod.

[...]
spec:
  containers:
  - name: ...
        volumeMounts:
            - name: mtls
              mountPath: "/var/mtls"
              readOnly: true
  volumes:
    - name: mtls
      secret:
        secretName: mtls

Configuring clients and servers: Go & gRPC

We had to configure servers in Go (gRPC), Python (Flask), and node.js (GraphQL). There are plenty of guides and docs on how to do this, if you’re curious :-) For the sake of brevity, I’ll only give an example for Go (gRPC) to illustrate what this entails. Here’s the heart of it:

Client:

func LoadKeyPair() credentials.TransportCredentials {
    certificate, err := tls.LoadX509KeyPair("/var/mtls/cert", "/var/mtls/key")
    if err != nil {
        panic("Load client certification failed: " + err.Error())
    }

    ca, err := ioutil.ReadFile("/var/mtls/cert")
    if err != nil {
        panic("can't read ca file")
    }

    capool := x509.NewCertPool()
    if !capool.AppendCertsFromPEM(ca) {
        panic("invalid CA file")
    }

    tlsConfig := &tls.Config{
        Certificates: []tls.Certificate{certificate},
        RootCAs:      capool,
    }

    return credentials.NewTLS(tlsConfig)
}

func main() {
    conn, err := grpc.Dial("localhost:10200", grpc.WithTransportCredentials(LoadKeyPair()))
}

Server

func LoadKeyPair() credentials.TransportCredentials {
    certificate, err := tls.LoadX509KeyPair("/var/mtls/key", "/var/mtls/cert")
    if err != nil {
        panic("failed to load server certification: " + err.Error())
    }

    data, err := ioutil.ReadFile("/var/mtls/cert")
    if err != nil {
        panic("failed to load CA file: " + err.Error())
    }

    capool := x509.NewCertPool()
    if !capool.AppendCertsFromPEM(data) {
        panic("can't add ca cert")
    }

    tlsConfig := &tls.Config{
        ClientAuth:   tls.RequireAndVerifyClientCert,
        Certificates: []tls.Certificate{certificate},
        ClientCAs:    capool,
    }
    return credentials.NewTLS(tlsConfig)
}

func main() {
    server := grpc.NewServer(
        grpc.Creds(LoadKeyPair()),
    )
}

Configuring clients and servers - HAProxy

So we’ve configured 3 different kinds of servers and clients. Are we done? Nope! It’s time to configure the ingress load balancer, so that the external traffic it funnels to externally-exposed services is also protected by mTLS. We did this by adjusting the template for the HAProxy config file which the ingress controller uses to configure the HAProxy instances.

What you have to do is configure each server that HAProxy recognizes — those are the server instances we’d just configured.

It looks something like this:

server srv001 <templated by ingress controller> ssl verify required ca-file /var/mtls/cert crt /var/mtls/key+cert

The sharp-eyed will have noticed that we’re not using the key and cert as separate files like we did for other services. HAProxy takes a keypair – the key followed by the cert – as a single file, so we’ll have to prepare a special version for HAProxy. Fortunately this is also easy to do using CLI:

cat key.pem cert.pem > key+cert.pem

Now we can add it to a Kubernetes secret:

kubectl create secret generic haproxy-mtls -n production --from-file=key+cert=key+cert.pem --from-file=cert=cert.pem

Configuring clients and servers - Google Cloud SQL for PostgreSQL

So far so good! We were able to use the same key and cert for the entire environment, with just minor adjustments for HAProxy. This hopefully makes it easier for the team to grok what’s going on.

Unfortunately, Google Cloud SQL for PostgreSQL doesn’t let you use your own keypairs for the server, and you can’t use a self-signed cert! You have to use their CA, and you have to generate a client keypair using Cloud SQL, download it and use that one to authenticate and authorize the client; see here. Fortunately, this is also possible to do using CLI, and we can keep using our barebones method for generating secrets and storing them on Kubernetes. Once we’ve generated the keypair, we store the key, cert and this time also the CA cert into a Kubernetes secret as usual, and mount it into all our client services.

Let’s look at an example for how to configure a PostgreSQL client with mTLS in Go. We used pq, a database driver for PostgreSQL. pq takes mTLS configuration in the form of a connection string. Here’s how you would initialize a connection, assuming the key, cert and CA cert are at /var/postgres-mtls/key, /var/postgres-mtls/cert, and /var/postgres-mtls/cacert, respectively:

conn, err := sql.Open("postgres", "[...] sslmode=verify-full sslrootcert=/var/postgres-mtls/cacert sslcert=/var/postgres-mtls/cert sslkey=/var/postgres-mtls/key”)

Kept it simple… enough?

Even though our requirements were very minimal – just 2 sets of credentials per environment, shared across all services – and even though our tech stack was also very minimal – basically no new software or tech or concepts beyond what we already had and mTLS itself.

The end result was still, in reality, complex to operate.

Here are the problems we managed to deal with, after some more investment:

Local development was more challenging than expected. We used Docker Compose for local deployment, and now there were parts of the code that expected mTLS credentials, which did not exist locally as mTLS credentials were available only on Kubernetes deployments. We had to add environment variables that disabled mTLS in the code that worked with Postgres and inter-service communication.
Also in the context of local development, when someone ran a service from the runtime within their IDE (to make debugging easier), they had to disable mTLS as well. We added the ability for services to auto-detect that they’re running locally and disable mTLS. Needless to say, we had to do that separately for each service, as they didn’t all share the same tech stack.
Connecting local instances to instances running on Kubernetes was even more challenging: when you ran a single service locally and hooked it up to another service running within a test deployment on Kubernetes. Before, you could simply run kubectl port-forward and wire it up. The test deployment on Kubernetes expected only mTLS connections. We wrote a little CLI utility that helped you get the appropriate mTLS credentials for connecting to test deployments.

Even after we solved those problems, there was still a lack of strong understanding within the team for how this all works. For example, using the same key and cert for all workloads turned out to be confusing for some rather than helpful: they assumed that certs were credentials just like tokens or usernames/passwords, so having a “shared credential” was very odd to them. This caused some people to avoid adding anything that required inter-service connectivity and mTLS, which was obviously problematic for getting things done.

Really everyone just wanted to say “I want to connect to this service” and know how to connect, without dealing with the details of how it happens. Doesn’t seem like too much to ask, does it?

What have we learned?

We ended up with a solution that worked well enough for us to continue. It simply wasn’t worth our investment, at the time, to completely solve the problem in a way that minimized friction and optimized our engineering resources – and I think that’s usually the case. When you have a business to run, you’re not always going to have the time to solve all the problems along the way. Sometimes what you’ve solved suboptimally is good enough – unless, of course, somebody else has solved it for you (see my personal note below). 😉

Now, as an engineer, you still want to be as productive as possible concerning your task – you want to be able to focus on the understanding required to complete your task and reduce the number of things you have to keep in mind to successfully complete it. A good platform helps you do that by minimizing how much a developer needs to know to complete most tasks. It may not be possible to eliminate all complexity, but a good platform lets you minimize the parts you don’t want your team to deal with. Without such a good platform – and we did not have one – you end up with a gap between making the engineers as productive as they should be, and investing in your core business.

A personal note

My mind kept going, though… contemplating what such a platform could look like. I wish somebody did solve this problem in a way that went the extra mile and truly, significantly reduced friction for all engineers in the organization.

You want the engineers to not have too many things they have to keep in mind, in particular security mechanisms, and multiple ways of configuring access controls. To minimize operational complexity, the solution should just fit into the way they operate already: same processes, same tools. And if anyone has to learn a new tech or take on a complex setup task, it should be the platform engineers, not every functional developer in the organization.

This experience, among many others like it, provided the core motivation for me to build Otterize. Secured access for your services should not be such an ordeal. It should be simple to grok, and easy to do things like adding a new service, using a new third-party service, or doing local development. If your team has to know a magic incantation, or people rely on tribal knowledge of how things work, then the platform is not good enough.

How I saw too much information coming back from a company’s backend

Tomer Greenwald — Thu, 08 Jun 2023 09:05:21 +0000

A few years ago, I was working on a project where we needed to get a bunch of data about products from various sources around the web. As often happens, what started as a small scripting task that I could just knock out in the background (in my “free time”) turned into an interesting adventure of API discovery and exploration, with a layer of security insights for good measure.

It all started from how to actually fetch the data.

Sometimes the data itself is directly available as structured databases. That’s where you can directly interact with a SQL or NoSQL database to extract relevant information. The great thing about these kinds of data sources is that they can help you make use of their data just with their existing internal structuring. A quick look at their table structure, identifiers, and the way they link data points can already be a great jump start to building a domain-specific knowledge graph. And of course you can then easily retrieve any data point you want, or all data points, by simply making the right queries.

Other times an API is available to retrieve the data directly. From the perspective of my needs, that’s a data API. These data APIs may be monetized by their owners with a paywall or may be free to use, but either way, they’re inevitably well-structured and well-documented APIs clearly meant for external users to interact with. Usually, you will get some tutorials to help you understand how to make sense of them and work with them, guides about the internal data schemes, and sometimes also instructions for working with them in production-scale environments (thresholds, order of calls, batching, …).

But in (desperate) times, all you have is the presentation of that data as HTML on web pages. In other words, you see a website showcasing the data you want to retrieve, but it’s grouped in categories, perhaps shown in cards, and searchable using a search box at the top of the website. You see that the data is all there, but it's not directly available to you, since it's rendered in HTML and encapsulated in DIVs and tables – in other words, data and presentation are combined to make a human-oriented presentation rather than a machine-oriented data source. Naturally, when you see this, you think: let’s just scrape the page by retrieving the HTML, mapping the various data structures we want to work with, and then extracting the data with some scripting language, an XML parser, or both. (BTW, my preference is extracting the data with regex vs using XPath).

But these days the mixing of data and HTML to create the presentation layer is likely done in the browser, within a single-page app, rather than server-side. Which means the browser likely fetched the data before wrapping it up in presentation elements. And if the browser can get data that way, you and I can do the same: skip the middle steps and just extract the data directly from the same HTTP calls the web page is making. I call that scraping the API because the developers behind the web site were effectively exposing an API, knowingly or not, but one geared towards HTML presentation, so to get at the data requires mapping this presentation API back into a data structure and using it as an effective data API.

How do you scrape an API? You start by exploring it

We start by deciding what data we want to scrape from the website. In our case, we wanted to build a product catalog, therefore the end goal was to get detailed information for all available items. For that, we needed to discover, i.e. list, all available items. The specific site structure only allowed listing items within a category, so we needed a way to discover (list) categories as well. It became clear that I would need to discover much if not all the API surface for this presentation API: its methods, data types, structures, allowed values, and whatever matching identifiers were used to link calls (e.g., an identifier retrieved when calling one method can be used when calling another method).

Usually, when I want to understand an undocumented API and I have access to a working example of that API being used, I start by inspecting how the example calls and consumes that API. In our case we had a browser application communicating with the API server using HTTP/REST, so a really easy tool for inspecting the traffic is just the built-in Chrome Developer Tools - Network tab:

Another tool I highly recommend is Postman Interceptor. It has a much cleaner user interface and any reader already using Postman (is there anyone not using it?) should feel right at home using the Interceptor.

When inspecting all communication between the browser and the backend, we need to differentiate between requests made specifically to the API server and requests made to other parts of the backend for rendering the website. The API server calls will usually fall under the XHR category. Any calls there are a good tell-tale for an API we may be able to consume.

For each of the relevant pages and calls I wanted to map, I initiated multiple calls to the backend by entering different values in search boxes, selecting items in drop-down menus, and clicking items. With all these calls to the backend, I listed available API calls and their respective parameters. By piecing the puzzle pieces together, I figured out that there were several identifiers I needed to track, e.g. category_id and product_id. Calls often work as a hierarchy: the first call returns a list of identifiers, you then use those to make another call and get back more identifiers, and so on.

Here’s an example of three calls I needed to make in a specific order:

/list_categories returned categories with detailed info about the category. Critically, from that list I could retrieve category_id values to make the next call.
/list_products/{category_id} listed the products within a given category. From it, I could retrieve product_id values that allowed me, in turn, to make the next call.
/get_product/{product_id} returned a lot of data about the product – which was the end goal.

You can see that I’m effectively reverse-engineering the API spec:

paths:
  /list_categories:
    get:
      summary: “List all categories”
  /list_products/{category_id}:
    get:
      summary: “List all products in the given category”
  /get_product/{product_id}:
    get:
      summary: “Return the product information for the given product”

With this home-brewed data API spec, we can start implementing the scraper.

Building a scraper

With all functions in hand, the basic scraping logic looked like this:

products = []
for category in list_categories():
    for product in list_products(category):
        products.append(get_product(product))

This worked at a small scale, to check that the logic made sense and I could scrape the data correctly. But to do it at full scale, with numerous calls to pick up all the data available, required dealing with the problems common to all scraping techniques: multi-threading, caching, data filtering, and related batching concerns.

Caching

The core scraping logic assumed we were dealing with a tree-like structure where there’s a single path between the root node and each leaf node (e.g. product). But as the code started making the API calls, and I started examining the retrieved IDs, I realized that the data structure we were looking at wasn’t a tree: a leaf node might be reachable from the root via multiple paths. The code would wastefully retrieve items multiple times unless we added a caching layer.

The first implementation for a caching layer was the simplest. The API used REST (well, roughly) over HTTP, so I figured we could cache calls by the HTTP path and request parameters.

To make sure all calls made were really necessary, I first created a proxy function that returned results from cache if they were in the cache (cache hits), and only made outbound API calls if they were not (cache misses). Once I switched all requests.get calls to use the proxy, I plugged the caching mechanism into that proxy.

Over time, we made the caching more sophisticated:

We allowed the application to force a call even if it was already in the cache, to refresh items that might have gone stale.
We allowed the application to “teach” the cache that some calls might look different but amount to the same result, e.g. by providing the cache with hints that some parameters were not part of the data semantics.
We added some logic behind the cache to understand when an item was already processed on the back end so we could avoid making the HTTP call completely. For example, sometimes there were multiple types of calls that would retrieve information on the same item, so even though a certain type of call was not made and hence isn’t in the cache, we don’t need to make it because we already have the data about the item.

Parallelizing

When making 10K+ synchronous API calls to any server, you often start noticing that your process spends most of its time waiting for requests to complete. Because I knew most requests weren’t dependent on other requests to be completed, I could (and did) parallelize them to achieve much higher overall throughput.

Working around anti-scraping roadblocks

Sites usually don’t expect you to scrape them or their backend, but that doesn’t mean they don't implement basic anti-scraping mechanisms.

One of the basic ways which websites, API gateways, and WAFs use to verify who is making the request is the presence of a “normal” looking User-Agent field in the HTTP request. That’s because usually, when HTTP calls are generated using automated scripting or dev tools, they either contain an empty User-Agent or an SDK-specific User-Agent. On the other hand, web browsers specify their User-Agent in a format that looks like:

User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36

This identifies to the web server the exact type of browser it’s working with. To simulate a normal browser and avoid potential blocking on the server side, we can simply send this HTTP header ourselves to look like a standard Chrome browser (in this example).

Another check which web servers often make to ensure they’re getting requests from “normal” clients is by sending, and expecting back, certain cookies. Cookies are of course used by websites for functional reasons, such as preference tracking (site language), form tracking, and so on. But websites can also use cookies for sanity checks when API calls are made to their backend. A common example is a site creating and sending a unique “session” cookie the first time you visit any of its pages. The site then expects and validates this cookie when you call its API. If you see you’re being blocked because of such a mechanism, you’ll first need to figure out which cookies are being set by this site and returned in its API calls, and then send them with your automated scraping. To do this, you might simply copy them from your browser, or if needed you can automatically discover them by making the right initial HTTP call to the backend – that initial call that returns the session cookie the first time.

What I discovered

Once I had the API scraping automated, I turned to the data I was building up, and made some interesting discoveries.

First, I noticed that not all of the data returned in a response was actually needed for rendering the pages I was looking at. That was glaringly obvious when I saw that responses to describe individual products were several KBs long, but there wasn’t nearly enough data represented on the screen to consume that (excluding images, as they were retrieved separately anyway). In fact, the data looked like it was simply the unfiltered output of an internal system that the API server was fronting, sometimes even multiple systems, because I could see bits of schemas and identifiers that were quite different than the identifiers and schemas I had reverse-engineered from the presentation API.

This isn’t just a matter of the API being inefficient; it often poses a real security risk for the site and the business, as you could use and abuse these internal identifiers to:

Identify which specific backend systems are being used. This knowledge can be combined with catalogs of known vulnerabilities (e.g. from the Cybersecurity and Infrastructure Security Agency) to exploit those systems, perhaps with specifically crafted malicious payloads.
Identify new APIs – sources of data – that can then be probed directly. For example, you might see a fragment of data that suggests the existence of another API for shipping partners; perhaps that API relies on “security through obscurity” but can be queried directly to reveal preferred shipping rates and other confidential information?
Identify new values that should not be exposed but are not in fact protected. For example, if product identifiers are sequential, then sending sequences of product IDs might reveal products that are not yet released and allow competitors to get a head start.

In some cases, I realized that information which might not be regarded as sensitive on its own became more concerning when collected in bulk. Think of item stock levels, used to present availability indicators in stores. Users can benefit from learning whether an item would be present when they go to the store to pick up their item, or how long it might take to get delivered to them. But when we accumulate many availability data points for a product over time, we can learn about its sales cycle, its replenishment frequency, and other telling metrics about the business.

Takeaways

While it’s intriguing to embark on such a “fishing expedition” and see what can be learned with a bit of scripting, I think the important lessons here are for the developers of backend systems, in particular ones that need to support front-end development.

It seems very natural for developers to repurpose internal APIs when the UI needs data, reusing previous investments rather than building new APIs from scratch. Reuse of what’s readily available is a pillar of software development, whether it’s building on existing open-source software or extending existing APIs. And it seems equally natural to assume that, since the same company is building the front end and the APIs behind it, that these APIs are private and therefore… protected?

But the true boundary of these APIs is, at best, the corporate network: once they are exposed to the public internet, as they must be if they’re to serve public web pages, anyone can discover and access them, just as I did, for good purposes or otherwise.

Such APIs are, in fact, public APIs. Their purpose might be private – so, e.g., they need not be managed carefully like APIs meant for external consumption and integration need to be – but their security model must be appropriate for a public API. A good way to think about them is as products, intended for external consumption, and even if the expected mode of consumption is via a browser, like any consumer product the “manufacturer” must consider unexpected usage.

A productized API benefits from several important elements. This has been described elsewhere at great length (just Google it), but here I want just to highlight a couple of important points.

Formalize the API specification – the resources, methods, request data and response data, and don’t forget the responses when errors happen. Just declaring the fields and types of data that can pass between systems will trigger a sanity check for the teams responsible for producing and consuming the API, which already helps catch problems.

Really think about the data being exposed, individually as well as in bulk. For example, consider an online retail site that sends real-time item availability to its web UI, even if that UI may only render a simplistic view (available or not) to the consumer. Wouldn’t it be better to send only the information the UI needs, perhaps an enum with a couple of options, and expand that only if and when the needs change?

Meanwhile, for anyone needing to discover data from web APIs, explicit or implicit – happy hunting!

Network policies are not the right abstraction (for developers)

Tomer Greenwald — Tue, 06 Jun 2023 12:55:13 +0000

You’re a platform engineer working on building a Kubernetes-based platform that achieves zero-trust between pods. Developers have to be able to get work done quickly, which means you’re putting a high priority on developer experience alongside zero-trust.

Are Kubernetes network policies good enough? I think there are multiple flaws that prevent network policies, on their own, from being an effective solution for a real-world use case.

Before pointing out the problems, I’d like to walk you through what I mean when I say zero-trust, as well as a couple of details about how network policies work.

Zero-trust means preventing access from unidentified or unauthorized sources

Network policies can prevent incoming traffic to a destination (a server), or prevent outgoing traffic from a source (a client).

Zero trust inherently means you don’t trust any of the sources just because they’re in your network perimeter, so the only blocking relevant for achieving zero-trust is blocking incoming traffic (“ingress”) from unauthorized sources.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: my-policy
spec:
  ingress:
    - {} # ingress rules
  policyTypes:
    - Ingress # policy refers to ingress, but it could also have egress

Let’s talk about network policies

They’re namespaced resources and refer to pods by label

Network policies are namespaced resources, and refer to pods by label. Logically, they must live alongside the pods they apply to – in our case, since we’re using ingress policies, that means alongside the servers they protect.

They don’t refer directly to specific pods, of course, because pods are ephemeral, but they refer logically to pods by label. This is common in Kubernetes, but introduces problems for network policies. Keep this detail in mind as we’ll get back to it later.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: protect-backend
spec:
  podSelector:
    matchLabels:
      app: my-backend # policy will apply to pods labeled app=my-backend, in the same namespace as the policy
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: my-client # and allow ingress access from pods labeled app=my-client
  policyTypes:
    - Ingress

They hold information about multiple sets of pods

The contents of the network policies are effectively an allowlist specifying which other pods can access the pods which the policy protects. But there’s one big problem there: while the network policy must live with the protected pods, and is updated as part of the protected pods’ lifecycle, it won’t naturally be updated as part of the lifecycle of the client pods accessing the protected pods.

Friction when using network policies

Enabling access between two pods

Whenever a developer for a client pod needs access to a server, they need to get their client pod into the server’s network policy so it’s allowed to call the server. The developer often cannot manage that network policy themselves, as it usually exists in a namespace they are not responsible for, and deployed with a service they don’t own.

The result is that the client developer is dependent on the server team for access that should have been self-service, and the server team is now distracted enabling a client developer even though nothing has really changed from the point of view of the server team – a new client is connecting to their server, that’s it! There should not be server-side changes required to simply enable another client.

What if you need to roll back the server?

There are also a myriad of second-order problems, which the team at Monzo had learned about through solving for this problem. (It’s a super well-written blog post; I recommend having a read), such as that rolling back the server would affect whether clients could connect, since it rolled back its network policy.

When a server is rolled back due to an unrelated problem, its network policy may also be rolled back if it is part of the same deployment (e.g. part of the same Helm chart), and break the clients that relied on that version of the network policy! It’s a reflection of the unhealthy dependency between the client and server teams: while it would make sense that a server-side change that breaks functionality would affect the client, it does not make sense that an unrelated and functionally-non-breaking rollback of the server would affect the client.

How do you know the policy is correct?

Because network policies refer to pod labels, they are difficult to validate statically. Pods are generally not created directly, but instead created by other resources, such as Deployments.

Can you tell whether a network policy will allow access for your service without deploying and trying it out? In fact, just asking the question “which services have effective access to service A?” becomes super hard.

Developers don’t think of services as pod labels, but they tend to have a developer-friendly name they use. For example, checkoutservice is a friendly name, whereas checkoutservice-dj3847-e120 is not. This may in fact be the value of some label, but there’s no standard way to discover this name.

So then, how do you take the concept of a service, with its developer-friendly name, and map that to its labels that are referred to by the network policies and, say, its Deployment, to be able to check if it will have access once its new labels are deployed? You could manually do that, as a developer in a single team that understands all the moving parts. However, this is very error-prone, and of course, doesn’t apply to a solution a platform engineer could deploy: as a platform engineer, you’d need something automated you could make available to every developer in your organization.

This problem is one that the team at Monzo worked hard at. I recommend giving that blog a read as it is very well-written and also covers other factors of the problem.

How do you refer to pods within network policies?

Earlier, I mentioned that network policies don’t refer to pods directly, as they’re ephemeral, but refer to them by labels. This is common practice in Kubernetes. However, network policies are unique in that they use labels to refer to two (or more) sets of pods that are often owned by different teams in the organization.

This presents unique challenges because, for the network policy to function, the labels referenced by the policy and the labels attached to the pods must be kept in sync, with destructive consequences if you fail to do so – communication will be blocked! The pod labels for the client pods are managed by the client team, while the network policy that refers to them is managed by the server team, so you can see where things can get out of sync.

Network policies are effectively owned by multiple teams

This means that you need coordination between the teams, not only when the network policy is first deployed, but also over time as clients and servers evolve.

What if you have a network policy that allows multiple clients to connect to one server? Now you’ve got the server team coordinating with 2 teams.

For each change a client team proposes, the server team needs to not only change network policy rules referring to that client, but also make sure they don’t inadvertently affect other clients. This can be a cognitively difficult task, as the server team members normally don’t refer to pod labels belonging to other teams, so it may not immediately be clear which labels belong to which team.

This reduces the ability for teams to set internal standards and work independently, and slows down development. If you don’t get this right, there can be painful points in the development cycle where changes are fragile and their pace slows to a crawl. The pain may lead to bikeshedding and inter-team politics, as teams argue over how things should be done, and growing frustration as client deployments are delayed as a result of server network policies not yet being updated.

Is everyone in your organization proficient with how network policies work?

In many organizations, this is not the case. Network policies are already error-prone, with destructive consequences for even small mistakes. Asking every developer whose service calls another service to be familiar with network policies may be a tall order, with potential for failed deployments or failed calls that are hard to debug.

What would a good abstraction look like?

A good solution for zero trust should be optimized for that specific outcome, whereas network policies are a bit of a swiss army knife: they aren’t just for pod-to-pod traffic, so they’re not optimized for this use case.

The following 3 attributes are key for a good zero-trust abstraction that actually gets adopted:

Single team ownership: Each resource should only be managed by one team so that client teams can get access independently, and server teams don’t need to be involved if no changes are required on their end.
Static analysis should be possible: It should be possible to statically check if a service will have access without first deploying it.
Universal service identities: Services should be referred to using a standard name that is close to or identical to their developer-friendly names, rather than pod labels.

Enter client intents

At Otterize, we believe that client intents satisfy these requirements. Let me explain briefly what they are, and then examine whether they satisfy the above attributes.

A client intents file is simply a list of calls to servers which a given client intends to make. Coupled with a mechanism for resolving service names, the list of client intents can be translated to different authorization mechanisms, such as network policies.

In other words, developers declare what their service intends to access, and that can then be converted to a network policy and the associated set of pod labels.

Here’s an example of a client intents file (as a Kubernetes custom resource YAML) for a service named client calling another service named server:

apiVersion: k8s.otterize.com/v1alpha2
kind: ClientIntents
metadata:
  name: client-intents

spec:
  service:
    name: client
  calls:
    - name: server

Let’s see if this is a good abstraction

Now let’s go back and review our criteria for a good zero-trust abstraction:

Does a team own all of, and only, the resources it should be managing?

Client intents files are deployed and managed together with the client, so only the client team owns them. You would deploy the ClientIntents for this client along with the client, e.g. alongside its Deployment resource.

Can access be checked statically?

Since services are first-class identities in client intents (rather than indirectly represented by pod labels), it is trivially possible to query which clients have access to a server, and whether a specific client has access to a server. As an added bonus, all the information for a single client is collected in a single resource in one namespace, instead of being split up across multiple namespaces where the servers are deployed.

Are service identities universal and natural?

Service names are resolved in the same manner across the entire organization, making it easy to reason about whether a specific service has a specific name.

How would a Kubernetes operator that manages these intents work?

When intents are created for a client, the intents operator should automatically create, update and delete network policies, and automatically label client and server pods, to reflect precisely the client-to-server calls declared in client intents files. A single network policy is created per server, and pod labels are dynamically updated for clients when their intents update.

Service names are resolved by recursively getting the owner of a pod until the original owner is found, usually a Deployment, StatefulSet, or other such resource. The name of that resource is used, unless the pod has a service-name annotation which overrides the name, in which case the value of that annotation is used instead.

Try out the intents operator!

It won’t surprise you that we in fact built such an open source implementation, and it’s called the Otterize intents operator. Give it a shot and see if it makes managing network policies easier for you 🙂

Kubernetes traffic discovery

Tomer Greenwald — Sun, 04 Jun 2023 15:36:12 +0000

Recently, my team and I set out to build an open-source tool for Kubernetes that automatically creates a functional map of which pods are talking to which pods: a map of the network of services running in the cluster. In this blog, I wanted to share how we approached the problem of figuring out “who’s calling whom” within a Kubernetes cluster.

The context was to provide a starting point for making it easy to access services securely: the network map would show what access is needed now, the cluster could be configured to only allow this intended access, and the map could then be evolved as the functional needs evolve, so security would always align with desired functionality rather than get in its way. You can read more about intent-based access control here.

When we say “who’s calling whom”, we mean at the logical, functional level: e.g. “checkout service” calls “catalog service”. Having a logical network map of your cluster seems to be pretty useful regardless of context. In our experience, people who operate the cluster don’t necessarily know all the ways that services call on and depend on one another, so this may be useful for them for all sorts of reasons. It’s likely interesting for security, for dependency analysis, and just to catch things that look odd.

So how do we tell who’s calling whom? One approach was the way Monzo Bank started their exploration of service-to-service calls: by analyzing service code to figure out what calls it intends to make. But we were wary that it would be very difficult if not impossible to make a general-purpose tool that covered most languages and would robustly catch most calls.

So, instead, we opted for an approach based on sniffing network traffic over a period of time, assuming that all intended calls would happen at least once during that time, and figuring out who’s on either side of these calls. True, we might miss infrequent calls, but that’s a limitation we’d reveal in our docs; once developers were handed bootstrapped files describing most of their calls, they could always add any missed calls manually, if needed.

We were left with two things to work out:

Who’s calling whom at the network level, i.e. pairs of IP addresses; and
What's the functional name of the services with those IP addresses.

In this blog post, we’ll describe the former, and leave the functional name resolution to a future post.

If you're already curious about the solution and want to check it out for yourself you can browse https://github.com/otterize/network-mapper

Goals and constraints

We’re a very technical, curious, and persistent team, and we knew it wouldn’t take much to take us down a rabbit hole. So before brainstorming how to solve this, we made sure to write down what exactly we need to solve and what constraints an acceptable solution must meet:

Map pod-to-pod communication as pairs of IPs (client and server)
Focus on in-cluster traffic for now (not cluster ingress and egress)
Should work in most Kubernetes “flavors” without needing to tailor for each one
Must be easy to get started
Must export the output as structured text, for bootstrapping intents files
Minimize dependencies on other tools which users may not have
Minimize the impact on the cluster being observed

We pinned this on our wall (at least virtually) and set off to research a solution – this is always the fun part!

Don’t reinvent the wheel

It’s often been said that the tool most by developers is Google search, and this was no exception. Like most devs, we’re lazy, in a good way: we want to learn from others, we want to reuse what’s already out there, and we want to spend our time pushing the boundaries instead of reinventing the wheel. That’s especially true for a project like this: we wouldn’t want to find out after building something that it was not needed after all.

So we started to look for open-source software that could sniff pod-to-pod traffic. It needs to be OSS because our overall solution would itself be OSS so this part of it would need to fit in. There are certainly several projects out there. Perhaps the most well-known is Calico by Tigera, so we started there.

Logging traffic with Calico

We drew inspiration from a blog post from Monzo about network isolation. They used Calico’s logging capabilities to detect connections that would have been blocked by network policies and use the information to update their existing network policies. We considered using the same logging capability to achieve observability for all connections. The follow snippet demonstrates a Calico network policy that logs and allows all connections:

apiVersion: projectcalico.org/v3
kind: NetworkPolicy
metadata:
name: log-and-allow-all
namespace: production
spec:
types:
  - ingress
  - egress
ingress:
  - action: Log
  - action: Allow
egress:
  - action: Log
  - action: Allow

In generic Kubernetes network policies, there is no action field. The Calico CNI plugin (Kubernetes network plugin that implements the Container Network Interface) provides this functionality, and in particular provides logging even for allowed traffic. And this worked when we tried it in our test clusters and in our own back end.

But we also realized the hardships it might cause:

We’ll have to ship Calico with our solution, or require it as a prerequisite.
More importantly, it might conflict with an existing Kubernetes network plugin the user might be using.
And we were also pretty convinced that logging all allowed requests would push many clusters to the edge, resource-wise.

So while Calico network policies provide much more capabilities over native ones, using Calico as part of Otterize OSS would not meet our goals and constraints. (Support for Calico could always be added later should the community be interested.)

Any other OSS solutions out there?

Kubernetes is the most popular container orchestration tool by far, it’s also open source, and it’s backed by a massive and active community. So we were completely expecting that we’d find a wide array of tools that at least partially solve the same issues we’re tackling.

One promising candidate was Mizu, which is in fact a traffic viewer for Kubernetes. It’s easy to get started with it, has a great UI for viewing traffic, and a cool network mapper that’s great for troubleshooting. But it was designed for troubleshooting and debugging, and apparently not for reuse, since there is no way to export the map or obtain it via the CLI, at least that we could find. For us, the tool needs to be part of the larger solution, so not having a way to export as text is a deal breaker for our use case. We could fork the project, implement an export, and send back a pull request with our enhancement. But…

When looking to enhance a project, it’s important to understand what the project is aiming to do, not just what it does right now. Because Mizu is aimed at debugging use cases, it needs to look at all traffic, it needs to capture payloads and not just connections, and it needs various powerful capabilities such as seeing the plain text content of encrypted traffic. It’s simply not designed to be lightweight, reusable, low-impact, and easy to understand and vet. Adding an export feature would still leave its design goals far from our stated goals – it’s just not a good fit.

We found other tools for monitoring, alerting, and logging, but none of the ones we looked at came close to meeting our goals. After some time, it became clear we needed to consider building something from scratch.

Sniffing network traffic

We’ve got a lot of collective experience in our team sniffing network traffic at various levels, and Kubernetes offers a rich and reasonably standardized platform to apply this. Let’s recall that all we were trying to do was to identify pairs of pod IPs, i.e. a caller and a callee. So looking at all the traffic in the cluster would be heavy-handed and wasteful. A better idea would be to look only at the SYN packets that initiate any TCP session and establish the connection between caller and callee. But upon further thought, we realized that we’d still likely see many connections over the lifetime of the same pair of pods. An even better approach presented itself: how about looking just at the DNS query made by the caller to initially find the IP address of the callee? Or even better, why not just look at DNS replies, which should be the most efficient approach? We went with DNS replies over TCP SYN because:

DNS responses are still available even when TCP gets blocked for some reason.
DNS is typically less traffic intensive for several reasons:
- DNS can benefit significantly from caching, so e.g. any external traffic with a TTL will usually hit the cache and won’t generate more DNS load.
- When TCP connections are blocked, we would still see TCP retransmissions: common TCP stacks (that run within the kernel, so they don’t re-resolve between attempts) will retransmit TCP SYNs 3 times before giving up.
- We could look at the names in DNS responses and determine whether they likely pointed at Kubernetes endpoints just by parsing them, which is much less intensive than attempting to resolve all IP addresses seen in TCP SYNs to pods.
DNS offers intriguing possibilities to also discover traffic directed outside the cluster, perhaps to another cluster or to a non-Kubernetes-based service, because DNS replies contain the target service name as text. We are planning to expand the reach of the mapper beyond in-cluster traffic, and of course as the network mapper is open source, users can extend it to implement additional, perhaps custom, resolution mechanisms.

Approaches to sniffing DNS

The straightforward way to get at DNS queries would be to use an efficient tool like tcpdump (or equivalent) to sniff for DNS queries, and then process them to figure out pod IP pairs.

But another approach that could work in Kubernetes, because the DNS servers are within the cluster itself, would be to work directly with the DNS server pods. In most Kubernetes clusters, whether standalone or managed (GKE, AKS, EKS), the cluster DNS is either coredns or kube-dns. That was great to minimize how much configuration options we’d need to support. We realized we could edit the coredns or kube-dns configmap resources to enable their log option, which would make them log all the queries they handle. We’ll cover exactly how it’s done in more detail below.

Both approaches seemed reasonable, so we tried them both. We started with the latter, thinking that it might be simpler and would not require any traffic sniffing, which could prevent some limitations of traffic sniffing down the line.

Monitoring Kubernetes DNS server logs
to enable the DNS server logging option, we simply edited the appropriate configmap for our DNS provider in the kube-system namespace and added the log option. For our EKS cluster, the coredns configmap looked like this:

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
      log <--- ADD THIS; PODS WILL RELOAD THE CONFIG & START TO LOG
      errors
      health

The coredns pods reload the the configuration and, voila!, we could see queries being logged:

172.31.4.118:51448 - 48460 "AAAA IN nginx-svc.namespace.svc.cluster.local

It seemed all was well with the world. We could move on to parsing these queries and start building the network map.

But lo and behold, after a minute or two, we could not see any more queries being logged! Were there still DNS queries being made? Certainly. So how come they’re no longer being logged?

It turns out that, in a managed Kubernetes cluster, some things like the cluster’s DNS configurations are, well, managed for you. In an EKS cluster, for example, the coredns add-on is installed automatically when creating a cluster via the AWS management console. We noticed our changes to the coredns configmap went missing, and when checking the cluster’s audit we saw the changes were overwritten by eks:addon-manager user. The same behavior was observed in a GKE cluster, with the kube-dns configmap being overwritten by the cloud provider.

There are crude actions you can take to prevent this behavior. For example, we removed the patch permissions for configmaps from the eks:addon-manager cluster role, and saw that our changes were no longer being overwritten. But those kinds of actions aren’t a good option, in our opinion: EKS should be able to manage the cluster. We felt that actions like editing default ClusterRole resources are aggressive, may have unforeseen consequences, and will be frowned upon by users wanting to adopt our solution. We also did not want to tailor a specific solution for each cloud provider, which would go against our goal of “one size fits most”. Nor did we want to tell our standalone Kubernetes users how to manage their DNS configs. All these reasons added up to a lot of discomfort with this solution, even if it initially seemed like the most straightforward.

Direct DNS network sniffing - success!

And so we went back to the first DNS-based approach: filtering out of the network traffic all but DNS queries, and processing them to build out a map of pod-to-pod IP pairs.

We set up a quick tcpdump-based POC of a DaemonSet running with the hostNetwork: true option so it ran within the same network as the host of the Kubernetes node itself. It simply captured DNS requests (UDP port 53) and logged them. According to the Pod Security Policy documentation, any pod using the host’s network “could be used to snoop on network activity of other pods on the same node”.

We could now see our solution in action:

# An example DNS query to 'cartservice' from our lab namespace
192.168.15.228.43303 > 192.168.33.60.53: 39615+ A? cartservice.labstg.svc.cluster.local. (54)
192.168.15.228.43303 > 192.168.33.60.53: 41003+ A? cartservice.svc.cluster.local. (47)
192.168.15.228.43303 > 192.168.33.60.53: 42023+ A? cartservice.cluster.local. (43)
192.168.15.228.43303 > 192.168.33.60.53: 56922+ A? cartservice.ec2.internal. (42)
192.168.33.60.53 > 192.168.15.228.43303: 39615*- 1/0/0 cartservice.labstg.svc.cluster.local. A 10.100.115.187 (106)

And we now knew we could reliably tail DNS queries in the cluster. So we set about converting our solution from a tcpdump-based POC to a more robust Go solution, using gopacket to actually parse the requests and build connection pairs.

We also noticed, as you can see in the above example, that we were still seeing multiple requests being sent for a single DNS lookup, and figured that it was due to our pods having multiple entries under search in their resolv.conf file. From the man page for resolv.conf:

Resolver queries having fewer than ndots dots (default is 1) in them will be attempted using each component of the search path in turn until a match is found.

This means that if a DNS query occurs in a pod using just a service name (without a full domain name), all suffixes will be queried, hence the 4 extra DNS requests. So a lookup like nslookup cartservice will actually generate 4 extra queries before getting the answer on the 5th one, like the example above. That presented an obvious optimization: why not only listen for DNS answers? After changing our code to filter out all but DNS answers, we only see the one line:

192.168.33.60.53 > 192.168.15.228.43303: 39615*- 1/0/0 cartservice.labstg.svc.cluster.local. A 10.100.115.187 (106)

Success! We now process even less data, further reducing our resource requirements.

Completing the picture

DNS replies, of course, will only be generated when a connection is initiated. What about long-lived connections in the cluster that were initiated before the tool was turned on? To deal with those, we took an approach similar to the well-known netstat tool. In addition to capturing DNS traffic, we parse the files representing existing connections in /proc on each node, which provides us with the IP addresses of the endpoints participating in a connection. There are other means for finding all open connections and resolving them to the relevant pods (such as with eBPF), but we preferred a method which users can easily reason about. Many people know what netstat outputs – not many understand eBPF well.

Wrapping it up

After multiple iterations, research sessions and some trial & error, we could produce an exportable list of network connections in any Kubernetes cluster. You might recall that our larger goal was to get to a logical (functional) map of pod-to-pod traffic, and that will be covered in a future posting. After adding that capability, here’s an example output from our project, now called network-mapper, when pointed at one of the clusters in our “lab” environment:

cartservice in namespace otterize-ecom-demo calls:
  - redis-cart
checkoutservice in namespace otterize-ecom-demo calls:
  - cartservice
  - currencyservice
  - emailservice
  - paymentservice
  - productcatalogservice
  - shippingservice
frontend in namespace otterize-ecom-demo calls:
  - adservice
  - cartservice
  - checkoutservice
  - currencyservice
  - productcatalogservice
  - recommendationservice
  - shippingservice
loadgenerator in namespace otterize-ecom-demo calls:
  - frontend
recommendationservice in namespace otterize-ecom-demo calls:
  - productcatalogservice

Our solution has low cluster impact, no external dependencies, should work in any Kubernetes flavor, and is quite minimal, making it easy to plug in wherever a pod-to-pod communication map is needed. We felt we successfully achieved our goals without violating any of the principles and constraints we set at the beginning of our journey. After more polish and bug fixes, we were ready to release it to the OSS community.

Please take a look; we’d love your feedback and input. Of course, pull requests are always appreciated ;-)

https://github.com/otterize/network-mapper