Our story begins with myself, a platform engineer, having been tasked with encrypting and authenticating traffic within our production environment – “our” referring to a previous employer. The business motivation was simply that we guaranteed encryption and authentication within our backend to our customers.
The idea behind encrypting and authenticating traffic is that our platform, used by the dev team to deploy workloads, should not expose its data or operations, either through malicious activities of attackers or or even just accidents, like mistakenly linking production and staging workloads.
Mutual TLS, or mTLS, seemed like the right tool for the job, so I started researching how to achieve mTLS for our platform.
As a platform engineer, my role wasn’t just to achieve security, but to do it in a way that was easy for the rest of the team to work with and understand.
In this post, I’ll share with you the considerations and the process I went through when solving for three attributes: (1) encryption (2) authentication and (3) simplicity from a developer experience point of view. There are many ways to achieve (1) and (2), but coupling them with (3) is the hardest part, and was our guiding north star. Through sharing, I hope to both learn, from readers willing to share their experience and from research, and provide useful insights for others who might be working through a similar process.
But first, a little primer on mTLS for the uninitiated. Skip ahead if you’re already a level 60 warlock.
What’s mTLS?
mTLS is just like TLS, but with one extra consideration. In mTLS, both the client and server are authenticated, whereas in standard TLS only the server is authenticated. That’s where the m comes from: mutual.
To understand why you might need this, consider a common use-case for TLS: web browsing using HTTPS (which is just HTTP wrapped by TLS). When you browse to yourbank.com, you, the user, want to know that you are really on the website of your bank, and that when you sign in and view your information or make transactions, you can trust that your bank stands behind all of that. You assume that your bank is indeed the owner of the yourbank.com domain; with TLS, you can be sure that only the owner of the yourbank.com domain, namely your bank, can be on the other side of your request for yourbank.com.
How can you be sure? Because your browser can validate, when you connect to the yourbank.com server, that the certificate presented by the server is indeed legitimately owned by your bank, as attested to by a third-party entity (a Certificate Authority) that your browser trusts and that signed the certificate. Your browser does this validation for you automatically when you browse to a URL that starts with https: if validation succeeds, you see the familiar and reassuring lock icon in the address bar, and if it fails, your browser warns you in no uncertain terms not to proceed.
Now, in standard TLS as commonly used with HTTP, while the client verifies the server’s identity, the server does not validate the client’s identity. That authentication usually happens via some other mechanism after the server was authenticated. In the case of the bank, that probably involves asking for your login credentials.
mTLS lets you do this kind of connection at the HTTP connection level, not afterwards. Not only are the client and server both authenticated, but the mechanism to do so is completely standardized (it’s just TLS). It’s also a more secure mechanism than the token or cookie mechanisms that are often used, say when you normally login to your bank, because mTLS is not vulnerable to token replay attacks, and because no secrets are transmitted at any point in the communication.
So far we’ve only discussed authentication: validating the communicating parties. But TLS, and by extension mTLS, also provide confidentiality, i.e. that third parties cannot see the data, and integrity, i.e. that the data cannot be modified in transit.
Our stack was pretty standard, and yet deploying mTLS is still difficult
Back to my situation. In our team we had a polyglot architecture: a mix of services written in Go, Python and node.js.
This was all running on Kubernetes, coupled with Google Cloud SQL for PostgreSQL and an HAProxy deployment managed by an ingress controller (jcmoraisjr/haproxy-ingress with a modified config file template). Branch or test deployments were a little different: the database was deployed on Kubernetes directly, to make it simple to deploy additional environments without spinning up resources outside of Kubernetes.
Inter-service communication was accomplished using REST (the Python services), gRPC (Go and node.js), as well as proxy traffic between HAProxy and the services configured by the ingress (REST, GraphQL). All of these different kinds of communication had to be encrypted and authenticated.
A priority for the team: keep it simple
Our thinking was that we would have to keep it simple: adding mTLS to our platform should not require attention from most engineers most of the time, rather it should just work. We were a small team of 15, and often engineers were required to make changes across multiple services and the platform itself, so new components or technologies that get introduced have their onboarding costs multiplied by everyone. The solution should be as simple as can be, in order to reduce friction when the team needs to work on the stack.
There are various ways to evaluate and achieve simplicity. For me, it always starts with: can we reduce, or at least not increase, the number of technologies you need to work with at all? Even very good engineers can only be truly good at a finite number of things, and in a small team of engineers that needed to deal with everything in the stack, that stack needed to be limited to the fewest number of technologies possible. Sure, adding just one more tech might seem like a “best of breed” solution to a complex problem, but what does adding it do to the effectiveness of the team? How much context switching do they need to now do, how many more problems would they need to contend with, how much longer would critical problems take to debug?
I’ll get back to how I think about simplicity at the conclusion.
So go with a service mesh, right? It’s just one new technology
Service meshes promise to take care of service communications outside of the service implementation, which apparently would address just the problem I was looking to solve: developers would focus on writing code, and – if everything worked – they would not have to be concerned with mTLS since the service mesh would handle it outside their code’s container.
In fact, service meshes are a bit of a swiss army knife: they address a lot of problems, like load balancing, service discovery, observability, and – yep – encryption and authentication. They do that by deploying sidecar proxies and managing them via a unified control plane, which sounds great – one tech across the entire stack! But also, one new tech across the entire stack, with multiple moving pieces and many ways to configure and use it.
As a small team, where each developer already has to know quite a bit, we were wary of introducing new runtime components, especially ones at the core of everything, and that could make understanding and debugging our Kubernetes deployment that much more complicated when things didn’t work quite right.
Perhaps that would be worth the risk if we needed the service mesh for many other needs, but we didn’t. We were already heavy Datadog users, so our observability needs were pretty well served. We had simple load balancing needs that were met by an in-cluster HAProxy ingress controller. And service discovery was already achieved just fine, through plain old Kubernetes services and DNS.
So should we really introduce sidecars? Another control plane? A bunch of new resources, configurations and tools that everyone had to get familiar with to debug some situations, in addition to mTLS, CAs and certificate management? On one hand it doesn’t truly solve the problem because there are as many unsolved cases (such as PostgreSQL managed by Google Cloud SQL, which can’t be part of the service mesh) as there are solved cases; and it adds not only new moving components (the service mesh components) but also new skills and new ways for things to go wrong.
So we decided to go back to the built-in capabilities of our stack (meaning gRPC, PostgreSQL, HAProxy, etc.) and find ways to roll out mTLS within them. The only complexity we really needed to introduce was mTLS itself, and we would do it in a way that induced as little variance as possible between workloads.
Implementing mTLS
I researched how mTLS can be implemented with each of our tools: gRPC in Go, Python and node.js; HTTP servers and clients in Python; GraphQL server in node.js; Google Cloud SQL; and HAProxy. Here are the steps we’d need to take, at a high level:
- Generate pairs of keys and certificates (”keypairs”) for each environment (production, staging, etc.)
- Distribute the keypairs using Kubernetes secrets to each component
- Configure clients and servers to present the keypairs to each other, and to trust them when they’re presented – i.e. when authenticating the presenter.
The following is a look into the work that went into implementing mTLS.
Generating key pairs
Our goal was to create separation between environments: separate production, staging and dev. Since our use case was simple, we didn’t need separate keys for each workload: the entire environment could share the same key. This also meant that we didn’t need a CA to sign them, but rather just a self-signed certificate, which can only be used to establish connections with parties that trust that exact certificate (a practice called certificate pinning).
So we first needed to generate key pairs and CAs for each environment. Thankfully, this is easy to do, with just a single OpenSSL CLI command:
openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -days 3650 -nodes -subj '/CN=production.example.com'
This generates an RSA 4096-bit keypair, with the key written into key.pem and the cert into cert.pem, valid for 10 years. We’ve also specified a CN, or Common Name, for the certificate, that maps to a hostname that’s specific for that environment.
Push the keypairs to Kubernetes as secrets
With this we had keypairs we could use to establish connections secured with mTLS, but we then had to distribute them as secrets to the workloads. At the time, we did not have any PKI infrastructure, or any infrastructure for managing secrets, so we used plain Kubernetes secrets, created directly. Specifically, we just used kubectl create secret to stick with our approach of keeping things simple. To automate things, we had a script that would create all the necessary secrets, and another script that deployed a cluster using the gcloud (Google Cloud) CLI. No CloudFormation, Terraform, Vault, or anything like that.
So here’s how we would create a secret with two entries, key and cert, using the key.pem and cert.pem from the last section, respectively:
kubectl create secret generic mtls -n production --from-file=key=key.pem --from-file=cert=cert.pem
Mounting the secrets into pods
Now that we’ve got the secrets, we need to mount them into all of the relevant pods - all the functional service pods, as well as HAProxy. The following snippet from a Deployment resource (it’s approximate but gets the point across) will mount the keys and certs as files into /var/mtls on the local filesystem of each pod.
[...]
spec:
containers:
- name: ...
volumeMounts:
- name: mtls
mountPath: "/var/mtls"
readOnly: true
volumes:
- name: mtls
secret:
secretName: mtls
Configuring clients and servers: Go & gRPC
We had to configure servers in Go (gRPC), Python (Flask), and node.js (GraphQL). There are plenty of guides and docs on how to do this, if you’re curious :-) For the sake of brevity, I’ll only give an example for Go (gRPC) to illustrate what this entails. Here’s the heart of it:
Client:
func LoadKeyPair() credentials.TransportCredentials {
certificate, err := tls.LoadX509KeyPair("/var/mtls/cert", "/var/mtls/key")
if err != nil {
panic("Load client certification failed: " + err.Error())
}
ca, err := ioutil.ReadFile("/var/mtls/cert")
if err != nil {
panic("can't read ca file")
}
capool := x509.NewCertPool()
if !capool.AppendCertsFromPEM(ca) {
panic("invalid CA file")
}
tlsConfig := &tls.Config{
Certificates: []tls.Certificate{certificate},
RootCAs: capool,
}
return credentials.NewTLS(tlsConfig)
}
func main() {
conn, err := grpc.Dial("localhost:10200", grpc.WithTransportCredentials(LoadKeyPair()))
}
Server
func LoadKeyPair() credentials.TransportCredentials {
certificate, err := tls.LoadX509KeyPair("/var/mtls/key", "/var/mtls/cert")
if err != nil {
panic("failed to load server certification: " + err.Error())
}
data, err := ioutil.ReadFile("/var/mtls/cert")
if err != nil {
panic("failed to load CA file: " + err.Error())
}
capool := x509.NewCertPool()
if !capool.AppendCertsFromPEM(data) {
panic("can't add ca cert")
}
tlsConfig := &tls.Config{
ClientAuth: tls.RequireAndVerifyClientCert,
Certificates: []tls.Certificate{certificate},
ClientCAs: capool,
}
return credentials.NewTLS(tlsConfig)
}
func main() {
server := grpc.NewServer(
grpc.Creds(LoadKeyPair()),
)
}
Configuring clients and servers - HAProxy
So we’ve configured 3 different kinds of servers and clients. Are we done? Nope! It’s time to configure the ingress load balancer, so that the external traffic it funnels to externally-exposed services is also protected by mTLS. We did this by adjusting the template for the HAProxy config file which the ingress controller uses to configure the HAProxy instances.
What you have to do is configure each server that HAProxy recognizes — those are the server instances we’d just configured.
It looks something like this:
server srv001 <templated by ingress controller> ssl verify required ca-file /var/mtls/cert crt /var/mtls/key+cert
The sharp-eyed will have noticed that we’re not using the key and cert as separate files like we did for other services. HAProxy takes a keypair – the key followed by the cert – as a single file, so we’ll have to prepare a special version for HAProxy. Fortunately this is also easy to do using CLI:
cat key.pem cert.pem > key+cert.pem
Now we can add it to a Kubernetes secret:
kubectl create secret generic haproxy-mtls -n production --from-file=key+cert=key+cert.pem --from-file=cert=cert.pem
Configuring clients and servers - Google Cloud SQL for PostgreSQL
So far so good! We were able to use the same key and cert for the entire environment, with just minor adjustments for HAProxy. This hopefully makes it easier for the team to grok what’s going on.
Unfortunately, Google Cloud SQL for PostgreSQL doesn’t let you use your own keypairs for the server, and you can’t use a self-signed cert! You have to use their CA, and you have to generate a client keypair using Cloud SQL, download it and use that one to authenticate and authorize the client; see here. Fortunately, this is also possible to do using CLI, and we can keep using our barebones method for generating secrets and storing them on Kubernetes. Once we’ve generated the keypair, we store the key, cert and this time also the CA cert into a Kubernetes secret as usual, and mount it into all our client services.
Let’s look at an example for how to configure a PostgreSQL client with mTLS in Go. We used pq, a database driver for PostgreSQL. pq takes mTLS configuration in the form of a connection string. Here’s how you would initialize a connection, assuming the key, cert and CA cert are at /var/postgres-mtls/key, /var/postgres-mtls/cert, and /var/postgres-mtls/cacert, respectively:
conn, err := sql.Open("postgres", "[...] sslmode=verify-full sslrootcert=/var/postgres-mtls/cacert sslcert=/var/postgres-mtls/cert sslkey=/var/postgres-mtls/key”)
Kept it simple… enough?
Even though our requirements were very minimal – just 2 sets of credentials per environment, shared across all services – and even though our tech stack was also very minimal – basically no new software or tech or concepts beyond what we already had and mTLS itself.
The end result was still, in reality, complex to operate.
Here are the problems we managed to deal with, after some more investment:
- Local development was more challenging than expected. We used Docker Compose for local deployment, and now there were parts of the code that expected mTLS credentials, which did not exist locally as mTLS credentials were available only on Kubernetes deployments. We had to add environment variables that disabled mTLS in the code that worked with Postgres and inter-service communication.
- Also in the context of local development, when someone ran a service from the runtime within their IDE (to make debugging easier), they had to disable mTLS as well. We added the ability for services to auto-detect that they’re running locally and disable mTLS. Needless to say, we had to do that separately for each service, as they didn’t all share the same tech stack.
- Connecting local instances to instances running on Kubernetes was even more challenging: when you ran a single service locally and hooked it up to another service running within a test deployment on Kubernetes. Before, you could simply run kubectl port-forward and wire it up. The test deployment on Kubernetes expected only mTLS connections. We wrote a little CLI utility that helped you get the appropriate mTLS credentials for connecting to test deployments.
Even after we solved those problems, there was still a lack of strong understanding within the team for how this all works. For example, using the same key and cert for all workloads turned out to be confusing for some rather than helpful: they assumed that certs were credentials just like tokens or usernames/passwords, so having a “shared credential” was very odd to them. This caused some people to avoid adding anything that required inter-service connectivity and mTLS, which was obviously problematic for getting things done.
Really everyone just wanted to say “I want to connect to this service” and know how to connect, without dealing with the details of how it happens. Doesn’t seem like too much to ask, does it?
What have we learned?
We ended up with a solution that worked well enough for us to continue. It simply wasn’t worth our investment, at the time, to completely solve the problem in a way that minimized friction and optimized our engineering resources – and I think that’s usually the case. When you have a business to run, you’re not always going to have the time to solve all the problems along the way. Sometimes what you’ve solved suboptimally is good enough – unless, of course, somebody else has solved it for you (see my personal note below). 😉
Now, as an engineer, you still want to be as productive as possible concerning your task – you want to be able to focus on the understanding required to complete your task and reduce the number of things you have to keep in mind to successfully complete it. A good platform helps you do that by minimizing how much a developer needs to know to complete most tasks. It may not be possible to eliminate all complexity, but a good platform lets you minimize the parts you don’t want your team to deal with. Without such a good platform – and we did not have one – you end up with a gap between making the engineers as productive as they should be, and investing in your core business.
A personal note
My mind kept going, though… contemplating what such a platform could look like. I wish somebody did solve this problem in a way that went the extra mile and truly, significantly reduced friction for all engineers in the organization.
You want the engineers to not have too many things they have to keep in mind, in particular security mechanisms, and multiple ways of configuring access controls. To minimize operational complexity, the solution should just fit into the way they operate already: same processes, same tools. And if anyone has to learn a new tech or take on a complex setup task, it should be the platform engineers, not every functional developer in the organization.
This experience, among many others like it, provided the core motivation for me to build Otterize. Secured access for your services should not be such an ordeal. It should be simple to grok, and easy to do things like adding a new service, using a new third-party service, or doing local development. If your team has to know a magic incantation, or people rely on tribal knowledge of how things work, then the platform is not good enough.
Top comments (0)