DEV Community: Philippe Bürgisser

MISP at scale on Kubernetes

Philippe Bürgisser — Thu, 17 Nov 2022 14:37:05 +0000

Introduction

The MISP platform is an OpenSource projet to collect, store, distribute and share Indicators of Compromise (IoC). The project is mostly written in PHP by about +200 contributors and currently has more than 4000 stars on GitHub.

The MISP’s core is composed of:

Frontend: PHP based application offering both a UI and an API
Background jobs: Waits for signals to download IOCs from feeds, index IOCs, etc
Cron jobs: Triggering recurrent tasks such as IOCs downloads and update of internal information

It also relies on external resources like a SQL database, a Redis and an HTTP frontend proxying requests to PHP.

At KPMG-Egyde, we operate our own MISP platform and enrich its content based on our findings which we share with some of our customers. We wanted to provide a scalable, highly available and performant platform; this is why we decided to move the MISP from a traditional virtualized infrastructure to Kubernetes.

As per writing this article, there is no official guide on how to deploy and operate the MISP on Kubernetes. Containers and few Kubernetes manifests are available on GitHub. Despite the great efforts made to deploy the MISP on containers, the solutions isn’t made to scale -yet-.

State of the art

Traditional deployments

The MISP can be deployed on traditional Linux servers running either on-premises or cloud providers. The installation guide provides scripts to deploy the solution. External services such as the database are also part of the installation script.

Cloud images

The project misp-cloud is providing ready to use AWS AMI containing the MISP platform as well as all other external component on the same image. They may provide images for Azure and DigitalOcean in the future.

Containers

MISP-Docker

The official MISP project is providing a containerized version of the MISP where all elements except the SQL database are included in a single container.

Coolacid’s MISP-Docker

The project MISP-Docker from Coolacid is providing a containerized version of the MISP solution. This all-in-one solution includes the frontend, background jobs, cronjobs and an HTTP Server (Nginx) all orchestrated by process manager tool called supervisor. External services such as the database and Redis aren’t part of the container but are necessary. We decided that this project is very a good starting point to scale the MISP on Kubernetes.

Scaling the MISP

v1. Running on an EC2 instance

In our first version of the platform, we deployed the MISP on a traditional hardened EC2 instance where the MySQL database and redis were running within the instance. Running everything in a VM was causing scaling issues and requires the database to be either outside the solution or synced between the instances. Also, running an EC2 means petting it by applying latest patches, etc.

v2. Running on EKS

In order to simplify the operations ,but also for the sake of curiosity, we decided to migrate the traditional EC2 instance based MISP platform to Kubernetes, but more specifically the AWS service EKS. To fix the problems with scaling and synchronizing, we decided to use AWS managed services like RDS (MySQL) and ElastiCache (Redis).

After few months of operating the MISP on Kubernetes, we struggled with scale the deployment due to the design of the container:

Entrypoint: The startup script copies files, updates rights and permissions at each start up
PHP Sessions: by default, PHP is configured to store sessions locally causing the sessions to be lost when landing other replicas.
Background jobs: Each replica are running the background jobs causing the jobs not being found by the UI and appear to be down.
Cron jobs: Jobs are scheduled at the same time on every replica causing all pods executing the same task at the same time and loading external components.
Lack of monitoring.

v3. Running on EKS

When it comes to use containers and Kubernetes, it also comes with new concepts such as micro-services and single process container. We realized that the MISP is actually two programs written in a single base code (frontend/ui and background jobs) sharing the same configuration. In our quest to scale, we decided to split the containers into smaller pieces. As mentioned above, this version of our implantation is based on the Coolacid’s amazing work which we forked.

Frontend

In this version, we kept the implementation of Nginx and PHP-FPM from Coolacid’s container and removed the workers and cron jobs from it. Nginx and PHP-FPM are started by Supervisor and the configuration of the application is mounted as static files (ConfigMap, Secret) to the container instead of dynamically set.

Background Jobs aka workers

Up to version 2.4.150

Up to version 2.4.150, the MISP background jobs (workers) were managed by CakeResque library. When starting a worker, the MISP is writing its PID into Redis. Based on that PID, the frontend shows worker’s health by checking if the /proc/{PID} exists. When running multiple replicas, the last started pod writes its PIDs to Redis and may not be the same as the others. Refreshing the frontend on the browser may show an unavailable worker due to inconsistent PID numbers (for example, latest PID of process prio registered is 33 and the container serving the user’s request know process prio by its PID 55).

Our first idea was to create two different Kubernetes deployments: a frontend with no workers that could scale and a single replica frontend which is also running the workers.

Using the Kubernetes' ingress object, we can split the traffic based on the path given in the URL: /servers/serverSettings/workers forwards the traffic to the single-replica of the frontend pod running the workers and all other requests are forwarded to the highly available frontend.

Background jobs are managed by the platform administrators, it’s acceptable that the UI isn’t available for a short period of time whereas the rest of the platform must remain accessible by our end-users and third party applications.

From version 2.4.151

As stated in the MISP documentation, from version 2.4.151, the library CakeResque became deprecated and is replaced by the SimpleBackgroundJobs feature which rely on the Supervisor.

Supervisor is configured to expose an API over HTTP on port 9001. Hence, the workers can be separated from the frontend which can be easily scaled.

Wait, what about the single process container and micro service mentioned earlier ? As per writing this article, the MISP configuration allows to configure a single supervisor endpoint to be polled and not an endpoint per worker.

Yes but … the frontend/ui is still trying to check the health of each process by checking in /proc/{PID} like in previous and shows that the process maybe start but it couldn’t check if it’s alive or not. An issue was created and we’re waiting for the patch to be integrated in a future version.

CronJobs

The MISP relies on few cron jobs used to trigger tasks like downloading new IOCs from external sources or content indexing. When running multiple replicas of the same pod, the same jobs are all scheduled at the same time. For example, the CacheFeed task happens every day at 2:20 am on every replicas degrading the performance of the whole application when running.

Thanks to the Kubernetes CronJob object, those jobs are running only once following the same schedule and independently from the number of replicas. All CronJobs are using the same image but the command parameter is specific for every jobs.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: cache-feeds
...
spec:
...
  jobTemplate:
...
    spec:
...
        spec:
          containers:
          - command:
            - sudo
            - -u
            - www-data
            - /var/www/MISP/app/Console/cake
            - Server
            - cacheFeed
            - "2"
            - all
...

Other improvements across all components

Entrypoint: All files operations are done at the dockerfile level speeding up the startup phase instead of the entrypoint script.
MISP configuration: Configuration of the software is done dynamically done with Cakephp CLI tool and using sed commands based on environment variables. We replaced this process by mounting static configuration files as Kubernetes object ConfigMap
PHP Sessions: PHP was configured to store sessions into Redis as it’s already used for the core of the MISP
Helm chart: All the Kubernetes manifests were packaged into a helm chart
Blue/Green deployment: Shorten the downtime during update of the underlying infrastructure or application

Next steps

We’re already working on a v4 on which we’ll focus on the security aspects as well as matching with the Kubernetes and Docker best practices.

Conclusion

We read quite often from Kubernetes providers that moving monoliths to containers is easy, but they tend to omit that scaling isn’t an easy task. Also, monitoring, backups and all other known challenges on traditional VMs still exist with containers and can be even more complex at scale.

Through the whole project, we learned a lot about how the MISP works by testing numerous parameters but also by reading a lot of code to better understand what and how to scale.

Enable OpenShift login on ArgoCD from GitOps Operator

Philippe Bürgisser — Thu, 20 May 2021 12:48:47 +0000

Since few weeks now, the operator Red Hat OpenShift GitOps became GA and embbed tools like Tekton and ArgoCD.

When the operator is deployed, it provisions a vanilla ArgoCD which miss the OpenShift integrated login. In this post, we are going to review the steps to enable it.

Deploy and fine tune the Red Hat OpenShift GitOps

Follow the official documentation on the installation of the operator
Once the operator is deployed, go to the menu Operators>Installed Operators and click on the freshly deployed Red Hat OpenShift GitOps
Using the dropdown Actions on top right of the page, choose Edit Subscription
On the YAML code, under the spec level, enable the DEX feature to enable external authentication and click Save

...
spec:
  config:
    env:
      - name: DISABLE_DEX
        value: 'false'
...

oc patch subscription openshift-gitops-operator -n openshift-operators --type=merge -p='{"spec":{"config":{"env":[{"name":"DISABLE_DEX","Value":"false"}]}}}'

Configure ArgoCD to allow OpenShift authentication

Change the project to openshift-gitops
Go to the menu Operators>Installed Operators and click on Red Hat OpenShift GitOps, select tab Argo CD
On the ArgoCD instance list, click on the three dots at the very left of the openshift-gitops and select Edit ArgoCD
On the YAML code, under the spec level, update the DEX and RBAC section to match the following

...
spec:
  dex:
    openShiftOAuth: true
  rbac:
    defaultPolicy: 'role:readonly'
    policy: |
      g, system:cluster-admins, role:admin
    scopes: '[groups]'
...

oc patch argocd openshift-gitops -n openshift-gitops --type=merge -p='{"spec":{"dex":{"openShiftOAuth":true},"rbac":{"defaultPolicy":"role:readonly","policy":"g, system:cluster-admins, role:admin","scopes":"[groups]"}}}'

Monitor the pods being restared to apply the configuration and test your login

How to Integrate AWS Cognito with OpenShift 4

Philippe Bürgisser — Fri, 26 Mar 2021 13:14:19 +0000

In this post, we are going to integrate the Cognito authentication service from AWS with Red Hat OpenShift 4.

OpenShift 4 comes with a wide range of authentication providers to authenticate users, they can be very basic (HTPasswd), traditional (LDAP), integrated (GitHub) or based on OpenID Connect. We're going to focus on the OpenID Connect identity provider.

Configure AWS Cognito

Creating of the service

Open your AWS Console and go to the Cognito service
Create a user pool and choose the Step trough settings option
In the Attributes section, ensure you configure the following:
1. How do you want your end users to sign in? Username (you can also allow users to use their e-mail address)
2. Which standard attributes do you want to require?: email, profile, preferred username
In the App client section, click on Add an app client, enter an app name (e.g.: OCP) and ensure you select Enable username password based authentication (ALLOW_USER_PASSWORD_AUTH), keep the rest as it is and click on Create app client
Once the pool is created, go to App integration > App client settings and configure the following:
1. Enabled Identity Providers: Select all
2. Cognito User Pool: Checked
3. Callback URL(s): https://oauth-openshift.apps.demo.example.com/oauth2callback/Cognito (Match to your domain)
4. Allowed OAuth Flows: Authorization code grant
5. Allowed OAuth Scopes: email,openid,aws.cognito.signin.user.admin,profile
Confirm configuration by clicking on Create pool in the Review section

Configuration of the service

Now that the user pool has been created and configured to accept authentication requests from OpenShift, we'll have to gather some information:

Go to freshly created pool and go to the General settings section and copy the value from Pool Id
On the App clients section, you should find the the app client created previously, click on Show Details to expend and gather the App client id and the App client secret
Create some users in your pool

Configure OpenShift 4

Now that Cognito is ready, let's configure OpenShift to use the values gathered previously.

Go to Administration > Cluster Settings
Go to Global Configuration and search for OAuth
In the dropdown located in the Identity providers part at the bottom, choose OpenID Connect and enter the following information
- Name: Cognito
- Client ID: {Gathered client ID from previous step}
- Client Secret: {Gathered client secret}
- Issuer URL: https://cognito-idp.{aws_region}.amazonaws.com/{ pool ID eg: eu-west-1_sLzMKS}

Leave the other parameters and click Add
Leave a few minutes to the Authentication Operator to reload and try to logout
```
oc get clusteroperator authentication
```

Once the authentication operator has restarted, logout from OpenShift. On the login page, you should now see a button named Cognito.

When clicking on the Cognito button, you'll be redirected to the Cognito login page.

Use the account you previously created in Cognito and voilà!

Conclusion

This gives you an overview on how to get Cognito working with your OpenShift. It still requires to manage the RBAC and the groups for the users so they have correct permissions on the cluster. Also note that all theses steps can be automated.

TKGI: Observability challenge

Philippe Bürgisser — Wed, 03 Mar 2021 13:56:03 +0000

Introduction

In this post we’re going to review the observability options on a Kubernetes multi-cluster managed by VMware TKGI (Tanzu Kubernetes Grid Integrated).

When deployed using the TKGI toolset, Kubernetes comes with a concepts of metric sinks to collect data from the platform and/or from applications. Based on Telegraf, the metrics are pushed to a destination that has to be set in the ClusterMetricSink CR object.

In our use case, TKGI is used to deploy one Kubernetes cluster per application/environment (dev, qa, prod) from which we need to collect metrics. At this customer, we also operate a Prometheus stack which is used to scrape data from traditional virtual machines and containers running on OpenShift, in order to handle alarms and to offer dashboards to end-users via Grafana.
We have explored different architectures of implementation that match our current monitoring system and our internal process.

Architecture 1

In this scenario, we leverage the usage of the (cluster) MetricSink provided by VMware, configured to push the data into a central InfluxDB database. Data pushed by Telegraf can either come from pushed metrics from Telegraf Agent or can be scrape by Telegraf. Then running Telegraf, a pod is running on each node, deployed via a DaemonSet. Grafana has a database connector able to connect to InfluxDB.

Pros

Easiest implementation
No extra software to deploy on Kubernetes
Multi-tenancy of data
RBAC for data access

Cons

InfluxDB cannot scale and there is no HA in the free version
Need to rewrite Grafana dashboards in order to to match the InfluxDB query language
Integration with our current alarm flow

Architecture 2

Telegraf is able to expose data using the Prometheus format over an HTTP endpoint. This configuration is done using the MetricSink CR. Prometheus will then scrape the Telegraf service.
When Telegraf is deployed on each node using a DeamonSet, it comes with a Kubernetes service so we can access the exposed service. As Prometheus is sitting outside of the targeted cluster, it is not possible to directly access each Telegraf endpoint as it needs to be accessed through a Kubernetes service. The main drawback of this architecture is that we cannot ensure that all endpoints are scraped evenly, so it may create gaps in the metrics. We have also noticed that when Telegraf is configured to expose Prometheus data over HTTP, the service isn’t updated to match the new exposed port. One solution would have been to create another service in the namespace where telegraf resides but due to RBAC, we aren’t allowed to do so.

apiVersion: pksapi.io/v1beta1
kind: ClusterMetricSink
metadata:
  name: my-demo-sink
spec:
  inputs:
  - monitor_kubernetes_pods: true
    type: prometheus
  outputs:
    - type: prometheus_client

Pros

We can leverage the usage of Telegraf and MetricSinks
Integration with our existing Prometheus stack
Prometheus ServiceDiscovery possible through Kubernetes API

Cons

No direct access to Telegraf endpoints
Depending on the number of targets to discover for each
Kubernetes cluster, the ServiceDiscovery can be impacted in terms of performance

Architecture 3

In this architecture, we configure Prometheus to directly scrape exporters running on each cluster. Unfortunately, each replica of a pod running the exporter is exposing its endpoint through a Kubernetes service. As mentioned in architecture 2, Prometheus, living outside, cannot directly scrape an endpoint and we thus can’t ensure the scraping is evenly done.

Pros

Integration with our Prometheus stack
Prometheus ServiceDiscovery possible through Kubernetes API

Cons

No direct access to exporter endpoints
Not good for scaling

Architecture 4

This is a hybrid approach where we leverage the metric tooling provided by VMware. We push all the metrics into an InfluxDB exporter acting as proxy-cache, which is scrapped by Prometheus.

Pros

Leveraging VMware tooling

Cons

InfluxDB exporter becomes an SPOF (Single Point of Failure)
Extra components to manage
No Prometheus ServiceDiscovery available
Handling of data expiration

Architecture 5

In this architecture we introduce PushProx, composed of a proxy running on the same cluster as Prometheus and agents that are running on each Kubernetes cluster. These agents initiate a connection towards the proxy to create a tunnel so Prometheus can directly scrape each endpoint through the tunnel.

Each scraping configuration will need to have a proxy referenced:

scrape_configs:
- job_name: node
  proxy_url: http://proxy:8080/
  static_configs:
    - targets: ['client:9100']

Pros

Bypass network segmentation
Integration with our Prometheus stack

Cons

No Prometheus ServiceDiscovery
Scaling issue
Extra component to manage

Architecture 6

In this architecture, a Prometheus instance is deployed on each cluster which will scrape targets residing in the same cluster. Using this design, the data will be stored on each instance. The major difference in this approach is that only the AlertManager and Grafana are shared for all clusters.

Pros

Best integration with our Prometheus stack
Multi-tenancy
Federation possible

Cons

Memory and CPU footprint due repetition of the same service
Skip using any TKGI component
Multiple instances to manage

Conclusion

After testing almost all architectures, we finally came to the conclusion that architecture 6 is the best match with our current architecture and needs. We also privileged Prometheus as it can be easily deployed using the operator and features such as HA is automatically managed. We had however to make some compromises like not using the TKGI metric components and “reinvent the wheel” as we believe that monitoring and alerting should be done by pulling data and not pushing them.

Disclaimer

This research was made on a TKGI environment that hasn't been installed and operated by Camptocamp.

DEV Community: Philippe Bürgisser

MISP at scale on Kubernetes

Introduction

State of the art

Traditional deployments

Cloud images

Containers

MISP-Docker

Coolacid’s MISP-Docker

Scaling the MISP

v1. Running on an EC2 instance

v2. Running on EKS

v3. Running on EKS

Frontend

Background Jobs aka workers

Up to version 2.4.150

From version 2.4.151

CronJobs

Other improvements across all components

Next steps

Conclusion

Links

Enable OpenShift login on ArgoCD from GitOps Operator

Deploy and fine tune the Red Hat OpenShift GitOps

Configure ArgoCD to allow OpenShift authentication

How to Integrate AWS Cognito with OpenShift 4

Configure AWS Cognito

Creating of the service

Configuration of the service

Configure OpenShift 4

Conclusion

TKGI: Observability challenge

Introduction

Architecture 1

Pros

Cons

Architecture 2

Pros

Cons

Architecture 3

Pros

Cons

Architecture 4

Pros

Cons

Architecture 5

Pros

Cons

Architecture 6

Pros

Cons

Conclusion

Disclaimer