<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Philippe Bürgisser</title>
    <description>The latest articles on DEV Community by Philippe Bürgisser (@pburgisser).</description>
    <link>https://dev.to/pburgisser</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F589688%2F77990e23-61b6-487f-a334-3735fc08e7aa.jpeg</url>
      <title>DEV Community: Philippe Bürgisser</title>
      <link>https://dev.to/pburgisser</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/pburgisser"/>
    <language>en</language>
    <item>
      <title>MISP at scale on Kubernetes</title>
      <dc:creator>Philippe Bürgisser</dc:creator>
      <pubDate>Thu, 17 Nov 2022 14:37:05 +0000</pubDate>
      <link>https://dev.to/pburgisser/misp-at-scale-on-kubernetes-2k31</link>
      <guid>https://dev.to/pburgisser/misp-at-scale-on-kubernetes-2k31</guid>
      <description>&lt;h1&gt;
  
  
  Introduction
&lt;/h1&gt;

&lt;p&gt;The &lt;a href="https://www.misp-project.org/"&gt;MISP&lt;/a&gt; platform is an OpenSource projet to collect, store, distribute and share Indicators of Compromise (IoC). The project is mostly written in PHP by about +200 contributors and currently has more than 4000 stars on GitHub.&lt;/p&gt;

&lt;p&gt;The MISP’s core is composed of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Frontend: PHP based application offering both a UI and an API&lt;/li&gt;
&lt;li&gt;Background jobs: Waits for signals to download IOCs from feeds, index IOCs, etc&lt;/li&gt;
&lt;li&gt;Cron jobs: Triggering recurrent tasks such as IOCs downloads and update of internal information&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It also relies on external resources like a SQL database, a Redis and an HTTP frontend proxying requests to PHP.&lt;/p&gt;

&lt;p&gt;At KPMG-Egyde, we operate our own MISP platform and enrich its content based on our findings which we share with some of our customers. We wanted to provide a scalable, highly available and performant platform; this is why we decided to move the MISP from a traditional virtualized infrastructure to Kubernetes.&lt;/p&gt;

&lt;p&gt;As per writing this article, there is no official guide on how to deploy and operate the MISP on Kubernetes. Containers and few Kubernetes manifests are available on GitHub. Despite the great efforts made to deploy the MISP on containers, the solutions isn’t made to scale -yet-.&lt;/p&gt;

&lt;h1&gt;
  
  
  State of the art
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Traditional deployments
&lt;/h2&gt;

&lt;p&gt;The MISP can be deployed on traditional Linux servers running either on-premises or cloud providers. The installation guide provides scripts to deploy the solution. External services such as the database are also part of the installation script. &lt;/p&gt;

&lt;h3&gt;
  
  
  Cloud images
&lt;/h3&gt;

&lt;p&gt;The project &lt;a href="https://github.com/MISP/misp-cloud"&gt;misp-cloud&lt;/a&gt; is providing ready to use AWS AMI containing the MISP platform as well as all other external component on the same image. They may provide images for Azure and DigitalOcean in the future.&lt;/p&gt;

&lt;h2&gt;
  
  
  Containers
&lt;/h2&gt;

&lt;h3&gt;
  
  
  MISP-Docker
&lt;/h3&gt;

&lt;p&gt;The official MISP project is providing a &lt;a href="https://github.com/MISP/misp-docker"&gt;containerized&lt;/a&gt; version of the MISP where all elements except the SQL database are included in a single container.&lt;/p&gt;

&lt;h3&gt;
  
  
  Coolacid’s MISP-Docker
&lt;/h3&gt;

&lt;p&gt;The project &lt;a href="https://github.com/coolacid/docker-misp"&gt;MISP-Docker&lt;/a&gt; from &lt;a href="https://github.com/coolacid"&gt;Coolacid&lt;/a&gt; is providing a containerized version of the MISP solution. This all-in-one solution includes the frontend, background jobs, cronjobs and an HTTP Server (Nginx) all orchestrated by process manager tool called &lt;a href="http://supervisord.org/"&gt;supervisor&lt;/a&gt;. External services such as the database and Redis aren’t part of the container but are necessary. We decided that this project is very a good starting point to scale the MISP on Kubernetes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7hH5ZAiq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fpimxpmghn8uzaaytslo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7hH5ZAiq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fpimxpmghn8uzaaytslo.png" alt="Coolacid's processes design" width="851" height="391"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Scaling the MISP
&lt;/h1&gt;

&lt;h2&gt;
  
  
  v1. Running on an EC2 instance
&lt;/h2&gt;

&lt;p&gt;In our first version of the platform, we deployed the MISP on a traditional hardened EC2 instance where the MySQL database and redis were running within the instance. Running everything in a VM was causing scaling issues and requires the database to be either outside the solution or synced between the instances. Also, running an EC2 means petting it by applying latest patches, etc.&lt;/p&gt;

&lt;h2&gt;
  
  
  v2. Running on EKS
&lt;/h2&gt;

&lt;p&gt;In order to simplify the operations ,but also for the sake of curiosity, we decided to migrate the traditional EC2 instance based MISP platform to Kubernetes, but more specifically the AWS service EKS. To fix the problems with scaling and synchronizing, we decided to use AWS managed services like RDS (MySQL) and ElastiCache (Redis).&lt;/p&gt;

&lt;p&gt;After few months of operating the MISP on Kubernetes, we struggled with scale the deployment due to the design of the container:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Entrypoint: The startup script copies files, updates rights and permissions at each start up&lt;/li&gt;
&lt;li&gt;PHP Sessions: by default, PHP is configured to store sessions locally causing the sessions to be lost when landing other replicas.&lt;/li&gt;
&lt;li&gt;Background jobs: Each replica are running the background jobs causing the jobs not being found by the UI and appear to be down.&lt;/li&gt;
&lt;li&gt;Cron jobs: Jobs are scheduled at the same time on every replica causing all pods executing the same task at the same time and loading external components.&lt;/li&gt;
&lt;li&gt;Lack of monitoring.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  v3. Running on EKS
&lt;/h2&gt;

&lt;p&gt;When it comes to use containers and Kubernetes, it also comes with new concepts such as micro-services and single process container. We realized that the MISP is actually two programs written in a single base code (frontend/ui and background jobs) sharing the same configuration. In our quest to scale, we decided to split the containers into smaller pieces. As mentioned above, this version of our implantation is based on the Coolacid’s amazing work which we forked.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frontend
&lt;/h2&gt;

&lt;p&gt;In this &lt;a href="https://github.com/pburgisser/misp-container-frontend"&gt;version&lt;/a&gt;, we kept the implementation of Nginx and PHP-FPM from Coolacid’s container and removed the workers and cron jobs from it. Nginx and PHP-FPM are started by Supervisor and the configuration of the application is mounted as static files (ConfigMap, Secret) to the container instead of dynamically set.&lt;/p&gt;

&lt;h2&gt;
  
  
  Background Jobs aka workers
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Up to version 2.4.150
&lt;/h3&gt;

&lt;p&gt;Up to version 2.4.150, the MISP background jobs (workers) were managed by &lt;a href="https://github.com/wa0x6e/Cake-Resque"&gt;CakeResque&lt;/a&gt; library. When starting a worker, the MISP is writing &lt;a href="https://github.com/wa0x6e/ResqueStatus/blob/451a38060b4e752c304a1cdbcee1de3015c6d9ee/src/ResqueStatus/ResqueStatus.php#L90-L93"&gt;its PID into Redis&lt;/a&gt;. Based on that PID, the frontend shows worker’s health by checking if the &lt;em&gt;/proc/{PID}&lt;/em&gt; exists. When running multiple replicas, the last started pod writes its PIDs to Redis and may not be the same as the others. Refreshing the frontend on the browser may show an unavailable worker due to inconsistent PID numbers (for example, latest PID of process &lt;em&gt;prio&lt;/em&gt; registered is 33 and the container serving the user’s request know process &lt;em&gt;prio&lt;/em&gt; by its PID 55).&lt;/p&gt;

&lt;p&gt;Our first idea was to create two different Kubernetes deployments: a frontend with no workers that could scale and a single replica frontend which is also running the workers.&lt;/p&gt;

&lt;p&gt;Using the Kubernetes' ingress object, we can split the traffic based on the path given in the URL: /servers/serverSettings/workers forwards the traffic to the single-replica of the frontend pod running the workers and all other requests are forwarded to the highly available frontend.&lt;/p&gt;

&lt;p&gt;Background jobs are managed by the platform administrators, it’s acceptable that the UI isn’t available for a short period of time whereas the rest of the platform must remain accessible by our end-users and third party applications.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--O2SOfg4Q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kgecnalp703fl01eoi0a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--O2SOfg4Q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kgecnalp703fl01eoi0a.png" alt="First try of workers split" width="624" height="399"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  From version 2.4.151
&lt;/h3&gt;

&lt;p&gt;As stated in the MISP documentation, from version 2.4.151, the library &lt;em&gt;CakeResque&lt;/em&gt; became deprecated and is replaced by the &lt;em&gt;SimpleBackgroundJobs&lt;/em&gt; feature which rely on the &lt;em&gt;Supervisor&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Supervisor&lt;/em&gt; is configured to expose an API over HTTP on port 9001. Hence, the workers can be separated from the frontend which can be easily scaled.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--rvHt3Prt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ngezk5ulgzh54x1qrfg3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--rvHt3Prt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ngezk5ulgzh54x1qrfg3.png" alt="Workers managed by Supervisor" width="861" height="301"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Wait, what about the single process container and micro service mentioned earlier ? As per writing this article, the MISP configuration allows to configure a single supervisor endpoint to be polled and not an endpoint per worker.&lt;/p&gt;

&lt;p&gt;Yes but … the frontend/ui is still trying to check the health of each process by checking in /proc/{PID} like in previous and shows that the process maybe start but it couldn’t check if it’s alive or not. An &lt;a href="https://github.com/MISP/MISP/issues/8616"&gt;issue&lt;/a&gt; was created and we’re waiting for the patch to be integrated in a future version.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--HHqRdJ9O--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/llusls8bknb3xg9q5ec6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--HHqRdJ9O--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/llusls8bknb3xg9q5ec6.png" alt="Workers shown in UI" width="847" height="295"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  CronJobs
&lt;/h3&gt;

&lt;p&gt;The MISP relies on few cron jobs used to trigger tasks like downloading new IOCs from external sources or content indexing. When running multiple replicas of the same pod, the same jobs are all scheduled at the same time. For example, the &lt;em&gt;CacheFeed&lt;/em&gt; task happens every day at 2:20 am on every replicas degrading the performance of the whole application when running.&lt;/p&gt;

&lt;p&gt;Thanks to the Kubernetes CronJob object, those jobs are running only once following the same schedule and independently from the number of replicas. All CronJobs are using the same image but the command parameter is specific for every jobs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;batch/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CronJob&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cache-feeds&lt;/span&gt;
&lt;span class="nn"&gt;...&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="nn"&gt;...&lt;/span&gt;
  &lt;span class="na"&gt;jobTemplate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="nn"&gt;...&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="nn"&gt;...&lt;/span&gt;
        &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;sudo&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;-u&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;www-data&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/var/www/MISP/app/Console/cake&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Server&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;cacheFeed&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;all&lt;/span&gt;
&lt;span class="nn"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Other improvements across all components
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Entrypoint: All files operations are done at the dockerfile level speeding up the startup phase instead of the entrypoint script.&lt;/li&gt;
&lt;li&gt;MISP configuration: Configuration of the software is done dynamically done with Cakephp CLI tool and using sed commands based on environment variables. We replaced this process by mounting static configuration files as Kubernetes object &lt;em&gt;ConfigMap&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;PHP Sessions: PHP was configured to store sessions into Redis as it’s already used for the core of the MISP&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/pburgisser/misp-helm-chart"&gt;Helm chart&lt;/a&gt;: All the Kubernetes manifests were packaged into a helm chart&lt;/li&gt;
&lt;li&gt;Blue/Green deployment: Shorten the downtime during update of the underlying infrastructure or application&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Next steps
&lt;/h1&gt;

&lt;p&gt;We’re already working on a v4 on which we’ll focus on the security aspects as well as matching with the Kubernetes and Docker best practices.&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;We read quite often from Kubernetes providers that moving monoliths to containers is easy, but they tend to omit that scaling isn’t an easy task. Also, monitoring, backups and all other known challenges on traditional VMs still exist with containers and can be even more complex at scale.&lt;/p&gt;

&lt;p&gt;Through the whole project, we learned a lot about how the MISP works by testing numerous parameters but also by reading a lot of code to better understand what and how to scale.&lt;/p&gt;

&lt;h4&gt;
  
  
  Links
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/pburgisser/misp-container-frontend"&gt;Frontend container&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/pburgisser/misp-container-worker"&gt;Workers container&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/pburgisser/misp-helm-chart"&gt;Helm Chart&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/coolacid/docker-misp"&gt;Coolacid's version&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>misp</category>
      <category>kubernetes</category>
      <category>ioc</category>
      <category>cti</category>
    </item>
    <item>
      <title>Enable OpenShift login on ArgoCD from GitOps Operator</title>
      <dc:creator>Philippe Bürgisser</dc:creator>
      <pubDate>Thu, 20 May 2021 12:48:47 +0000</pubDate>
      <link>https://dev.to/camptocamp-ops/enable-openshift-login-on-argocd-from-gitops-2h9a</link>
      <guid>https://dev.to/camptocamp-ops/enable-openshift-login-on-argocd-from-gitops-2h9a</guid>
      <description>&lt;p&gt;Since few weeks now, the operator Red Hat OpenShift GitOps became GA and embbed tools like Tekton and ArgoCD.&lt;/p&gt;

&lt;p&gt;When the operator is deployed, it provisions a vanilla ArgoCD which miss the OpenShift integrated login. In this post, we are going to review the steps to enable it.&lt;/p&gt;

&lt;h1&gt;
  
  
  Deploy and fine tune the Red Hat OpenShift GitOps
&lt;/h1&gt;

&lt;ol&gt;
&lt;li&gt;Follow the &lt;a href="https://docs.openshift.com/container-platform/4.7/cicd/gitops/installing-openshift-gitops.html"&gt;official documentation&lt;/a&gt; on the installation of the operator&lt;/li&gt;
&lt;li&gt;Once the operator is deployed, go to the menu &lt;strong&gt;Operators&lt;/strong&gt;&amp;gt;&lt;strong&gt;Installed Operators&lt;/strong&gt; and click on the freshly deployed &lt;strong&gt;Red Hat OpenShift GitOps&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Using the dropdown &lt;strong&gt;Actions&lt;/strong&gt; on top right of the page, choose &lt;strong&gt;Edit Subscription&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;On the YAML code, under the &lt;strong&gt;spec&lt;/strong&gt; level, enable the DEX feature to enable external authentication and click &lt;strong&gt;Save&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;...&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DISABLE_DEX&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;false'&lt;/span&gt;
&lt;span class="nn"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;oc patch subscription openshift-gitops-operator &lt;span class="nt"&gt;-n&lt;/span&gt; openshift-operators &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;merge &lt;span class="nt"&gt;-p&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{"spec":{"config":{"env":[{"name":"DISABLE_DEX","Value":"false"}]}}}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Configure ArgoCD to allow OpenShift authentication
&lt;/h1&gt;

&lt;ol&gt;
&lt;li&gt;Change the project to &lt;strong&gt;openshift-gitops&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Go to the menu &lt;strong&gt;Operators&lt;/strong&gt;&amp;gt;&lt;strong&gt;Installed Operators&lt;/strong&gt; and click on &lt;strong&gt;Red Hat OpenShift GitOps&lt;/strong&gt;, select tab &lt;strong&gt;Argo CD&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;On the ArgoCD instance list, click on the three dots at the very left of the &lt;strong&gt;openshift-gitops&lt;/strong&gt; and select &lt;strong&gt;Edit ArgoCD&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;On the YAML code, under the &lt;strong&gt;spec&lt;/strong&gt; level, update the DEX and RBAC section to match the following
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;...&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;dex&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;openShiftOAuth&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;rbac&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;defaultPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;role:readonly'&lt;/span&gt;
    &lt;span class="na"&gt;policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;g, system:cluster-admins, role:admin&lt;/span&gt;
    &lt;span class="na"&gt;scopes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;[groups]'&lt;/span&gt;
&lt;span class="nn"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;oc patch argocd openshift-gitops &lt;span class="nt"&gt;-n&lt;/span&gt; openshift-gitops &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;merge &lt;span class="nt"&gt;-p&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{"spec":{"dex":{"openShiftOAuth":true},"rbac":{"defaultPolicy":"role:readonly","policy":"g, system:cluster-admins, role:admin","scopes":"[groups]"}}}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Monitor the pods being restared to apply the configuration and test your login&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>gitops</category>
      <category>openshift</category>
      <category>argocd</category>
      <category>authentication</category>
    </item>
    <item>
      <title>How to Integrate AWS Cognito with OpenShift 4</title>
      <dc:creator>Philippe Bürgisser</dc:creator>
      <pubDate>Fri, 26 Mar 2021 13:14:19 +0000</pubDate>
      <link>https://dev.to/camptocamp-ops/how-to-integrate-aws-cognito-with-openshift-4-4n43</link>
      <guid>https://dev.to/camptocamp-ops/how-to-integrate-aws-cognito-with-openshift-4-4n43</guid>
      <description>&lt;p&gt;In this post, we are going to integrate the &lt;a href="https://aws.amazon.com/cognito/" rel="noopener noreferrer"&gt;Cognito authentication service&lt;/a&gt; from AWS with Red Hat &lt;a href="https://www.openshift.com/" rel="noopener noreferrer"&gt;OpenShift 4&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;OpenShift 4 comes with a wide range of authentication providers to authenticate users, they can be very basic (&lt;a href="https://docs.openshift.com/container-platform/4.7/authentication/identity_providers/configuring-htpasswd-identity-provider.html#configuring-htpasswd-identity-provider" rel="noopener noreferrer"&gt;HTPasswd&lt;/a&gt;), traditional (&lt;a href="https://docs.openshift.com/container-platform/4.7/authentication/identity_providers/configuring-ldap-identity-provider.html#configuring-ldap-identity-provider" rel="noopener noreferrer"&gt;LDAP&lt;/a&gt;), integrated (&lt;a href="https://docs.openshift.com/container-platform/4.7/authentication/identity_providers/configuring-github-identity-provider.html#configuring-github-identity-provider" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;) or based on &lt;a href="https://openid.net/connect/" rel="noopener noreferrer"&gt;OpenID Connect&lt;/a&gt;. We're going to focus on the OpenID Connect identity provider.&lt;/p&gt;

&lt;h1&gt;
  
  
  Configure AWS Cognito
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Creating of the service
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Open your AWS Console and go to the Cognito service&lt;/li&gt;
&lt;li&gt;Create a user pool and choose the &lt;strong&gt;Step trough settings&lt;/strong&gt; option&lt;/li&gt;
&lt;li&gt;In the &lt;strong&gt;Attributes&lt;/strong&gt; section, ensure you configure the following:

&lt;ol&gt;
&lt;li&gt;How do you want your end users to sign in? &lt;strong&gt;Username&lt;/strong&gt; (you can also allow users to use their e-mail address)&lt;/li&gt;
&lt;li&gt;Which standard attributes do you want to require?: &lt;strong&gt;email, profile, preferred username&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;li&gt;In the &lt;strong&gt;App client&lt;/strong&gt; section, click on &lt;strong&gt;Add an app client&lt;/strong&gt;, enter an app name (e.g.: OCP) and ensure you select &lt;strong&gt;Enable username password based authentication (&lt;code&gt;ALLOW_USER_PASSWORD_AUTH&lt;/code&gt;)&lt;/strong&gt;, keep the rest as it is and click on &lt;strong&gt;Create app client&lt;/strong&gt;
&lt;/li&gt;

&lt;li&gt;Once the pool is created, go to &lt;strong&gt;App integration&lt;/strong&gt; &amp;gt; &lt;strong&gt;App client settings&lt;/strong&gt; and configure the following:

&lt;ol&gt;
&lt;li&gt;Enabled Identity Providers: &lt;strong&gt;Select all&lt;/strong&gt; &lt;/li&gt;
&lt;li&gt;Cognito User Pool: &lt;em&gt;Checked&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Callback URL(s): &lt;code&gt;https://oauth-openshift.apps.demo.example.com/oauth2callback/Cognito&lt;/code&gt; (&lt;em&gt;Match to your domain&lt;/em&gt;)&lt;/li&gt;
&lt;li&gt;Allowed OAuth Flows: &lt;strong&gt;Authorization code grant&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Allowed OAuth Scopes: &lt;strong&gt;email&lt;/strong&gt;,&lt;strong&gt;openid&lt;/strong&gt;,&lt;strong&gt;aws.cognito.signin.user.admin&lt;/strong&gt;,&lt;strong&gt;profile&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;li&gt;Confirm configuration by clicking on &lt;strong&gt;Create pool&lt;/strong&gt; in the &lt;strong&gt;Review&lt;/strong&gt; section&lt;/li&gt;

&lt;/ol&gt;

&lt;h2&gt;
  
  
  Configuration of the service
&lt;/h2&gt;

&lt;p&gt;Now that the user pool has been created and configured to accept authentication requests from OpenShift, we'll have to gather some information:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Go to freshly created pool and go to the &lt;strong&gt;General settings&lt;/strong&gt; section and copy the value from &lt;strong&gt;Pool Id&lt;/strong&gt;
&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa6xi4y7dcz6a15vzbtff.png" alt="Pool Id"&gt;
&lt;/li&gt;
&lt;li&gt;On the &lt;strong&gt;App clients&lt;/strong&gt; section, you should find the the app client created previously, click on &lt;strong&gt;Show Details&lt;/strong&gt; to expend and gather the &lt;strong&gt;App client id&lt;/strong&gt; and the &lt;strong&gt;App client secret&lt;/strong&gt;
&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F61zxxe9klmkvxr16xoqt.png" alt="Client Id"&gt;
&lt;/li&gt;
&lt;li&gt;Create some users in your pool&lt;/li&gt;
&lt;/ol&gt;

&lt;h1&gt;
  
  
  Configure OpenShift 4
&lt;/h1&gt;

&lt;p&gt;Now that Cognito is ready, let's configure OpenShift to use the values gathered previously.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Go to &lt;strong&gt;Administration&lt;/strong&gt; &amp;gt; &lt;strong&gt;Cluster Settings&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Go to &lt;strong&gt;Global Configuration&lt;/strong&gt; and search for &lt;strong&gt;OAuth&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;In the dropdown located in the &lt;strong&gt;Identity providers&lt;/strong&gt; part at the bottom, choose &lt;strong&gt;OpenID Connect&lt;/strong&gt; and enter the following information

&lt;ul&gt;
&lt;li&gt;Name: &lt;strong&gt;Cognito&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Client ID: {Gathered client ID from previous step}&lt;/li&gt;
&lt;li&gt;Client Secret: {Gathered client secret}&lt;/li&gt;
&lt;li&gt;Issuer URL: &lt;code&gt;https://cognito-idp.{aws_region}.amazonaws.com/{ pool ID eg: eu-west-1_sLzMKS}&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkfug8y8d34cbeg0w592u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkfug8y8d34cbeg0w592u.png" alt="Configuring OIDC in OpenShift 4"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Leave the other parameters and click &lt;strong&gt;Add&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Leave a few minutes to the Authentication Operator to reload and try to logout&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;oc get clusteroperator authentication
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once the authentication operator has restarted, logout from OpenShift. On the login page, you should now see a button named &lt;strong&gt;Cognito&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0qrchs2ad723hepoohrv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0qrchs2ad723hepoohrv.png" alt="Login page using Cognito"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When clicking on the &lt;strong&gt;Cognito&lt;/strong&gt; button, you'll be redirected to the Cognito login page.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv2xd2h2pw260a11fatxd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv2xd2h2pw260a11fatxd.png" alt="Login in to Cognito"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Use the account you previously created in Cognito and voilà!&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;This gives you an overview on how to get Cognito working with your OpenShift. It still requires to manage the RBAC and the groups for the users so they have correct permissions on the cluster. Also note that all theses steps can be automated.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>openshift</category>
      <category>redhat</category>
      <category>authentication</category>
    </item>
    <item>
      <title>TKGI: Observability challenge</title>
      <dc:creator>Philippe Bürgisser</dc:creator>
      <pubDate>Wed, 03 Mar 2021 13:56:03 +0000</pubDate>
      <link>https://dev.to/camptocamp-ops/tkgi-observability-challenge-376f</link>
      <guid>https://dev.to/camptocamp-ops/tkgi-observability-challenge-376f</guid>
      <description>&lt;h1&gt;
  
  
  Introduction
&lt;/h1&gt;

&lt;p&gt;In this post we’re going to review the observability options on a Kubernetes multi-cluster managed by &lt;a href="https://tanzu.vmware.com/kubernetes-grid" rel="noopener noreferrer"&gt;VMware TKGI (Tanzu Kubernetes Grid Integrated)&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;When deployed using the TKGI toolset, Kubernetes comes with a concepts of metric sinks to collect data from the platform and/or from applications. Based on Telegraf, the metrics are pushed to a destination that has to be set in the ClusterMetricSink CR object.&lt;/p&gt;

&lt;p&gt;In our use case, TKGI is used to deploy one Kubernetes cluster per application/environment (dev, qa, prod) from which we need to collect metrics. At this customer, we also operate a Prometheus stack which is used to scrape data from traditional virtual machines and containers running on OpenShift, in order to handle alarms and to offer dashboards to end-users via Grafana. &lt;br&gt;
We have explored different architectures of implementation that match our current monitoring system and our internal process.&lt;/p&gt;
&lt;h1&gt;
  
  
  Architecture 1
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2uxh5xjfgrzwhbghf1ww.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2uxh5xjfgrzwhbghf1ww.png" alt="Architecture 1"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this scenario, we leverage the usage of the (cluster) MetricSink provided by VMware, configured to push the data into a central InfluxDB database. Data pushed by Telegraf can either come from pushed metrics from Telegraf Agent or can be scrape by Telegraf. Then running Telegraf, a pod is running on each node, deployed via a DaemonSet. Grafana has a database connector able to connect to InfluxDB.&lt;/p&gt;
&lt;h2&gt;
  
  
  Pros
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Easiest implementation&lt;/li&gt;
&lt;li&gt;No extra software to deploy on Kubernetes&lt;/li&gt;
&lt;li&gt;Multi-tenancy of data&lt;/li&gt;
&lt;li&gt;RBAC for data access&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Cons
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;InfluxDB cannot scale and there is no HA in the free version&lt;/li&gt;
&lt;li&gt;Need to rewrite Grafana dashboards in order to to match the InfluxDB query language&lt;/li&gt;
&lt;li&gt;Integration with our current alarm flow&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;
  
  
  Architecture 2
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0qc8ka8ewiqu2b6rsog9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0qc8ka8ewiqu2b6rsog9.png" alt="Architecture 2"&gt;&lt;/a&gt;&lt;br&gt;
Telegraf is able to expose data using the Prometheus format over an HTTP endpoint. This configuration is done using the &lt;a href="https://docs.pivotal.io/tkgi/1-10/create-sinks.html#define-sinks" rel="noopener noreferrer"&gt;MetricSink CR&lt;/a&gt;. Prometheus will then scrape the Telegraf service.&lt;br&gt;
When Telegraf is deployed on each node using a DeamonSet, it comes with a Kubernetes service so we can access the exposed service. As Prometheus is sitting outside of the targeted cluster, it is not possible to directly access each Telegraf endpoint as it needs to be accessed through a Kubernetes service. The main drawback of this architecture is that we cannot ensure that all endpoints are scraped evenly, so it may create gaps in the metrics. We have also noticed that when Telegraf is configured to expose Prometheus data over HTTP, the service isn’t updated to match the new exposed port. One solution would have been to create another service in the namespace where telegraf resides but due to RBAC, we aren’t allowed to do so.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pksapi.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterMetricSink&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-demo-sink&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;inputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;monitor_kubernetes_pods&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prometheus&lt;/span&gt;
  &lt;span class="na"&gt;outputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prometheus_client&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Pros
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;We can leverage the usage of Telegraf and MetricSinks&lt;/li&gt;
&lt;li&gt;Integration with our existing Prometheus stack&lt;/li&gt;
&lt;li&gt;Prometheus ServiceDiscovery possible through Kubernetes API&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cons
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;No direct access to Telegraf endpoints&lt;/li&gt;
&lt;li&gt;Depending on the number of targets to discover for each&lt;/li&gt;
&lt;li&gt;Kubernetes cluster, the ServiceDiscovery can be impacted in terms of performance&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Architecture 3
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5cb4hpu8nze31b6gj35i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5cb4hpu8nze31b6gj35i.png" alt="Architecture 3"&gt;&lt;/a&gt;&lt;br&gt;
In this architecture, we configure Prometheus to directly scrape exporters running on each cluster. Unfortunately, each replica of a pod running the exporter is exposing its endpoint through a Kubernetes service. As mentioned in architecture 2, Prometheus, living outside, cannot directly scrape an endpoint and we thus can’t ensure the scraping is evenly done.&lt;/p&gt;
&lt;h2&gt;
  
  
  Pros
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Integration with our Prometheus stack&lt;/li&gt;
&lt;li&gt;Prometheus ServiceDiscovery possible through Kubernetes API&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Cons
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;No direct access to exporter endpoints&lt;/li&gt;
&lt;li&gt;Not good for scaling&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;
  
  
  Architecture 4
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flbpkxnxva77pj47st27q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flbpkxnxva77pj47st27q.png" alt="Architecture 4"&gt;&lt;/a&gt;&lt;br&gt;
This is a hybrid approach where we leverage the metric tooling provided by VMware. We push all the metrics into an &lt;a href="https://github.com/prometheus/influxdb_exporter" rel="noopener noreferrer"&gt;InfluxDB exporter&lt;/a&gt; acting as proxy-cache, which is scrapped by Prometheus.&lt;/p&gt;
&lt;h2&gt;
  
  
  Pros
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Leveraging VMware tooling&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Cons
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;InfluxDB exporter becomes an SPOF (Single Point of Failure)&lt;/li&gt;
&lt;li&gt;Extra components to manage&lt;/li&gt;
&lt;li&gt;No Prometheus ServiceDiscovery available&lt;/li&gt;
&lt;li&gt;Handling of data expiration&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;
  
  
  Architecture 5
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9q5f0gs6ioe1z5ykagek.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9q5f0gs6ioe1z5ykagek.png" alt="Architecture 5"&gt;&lt;/a&gt;&lt;br&gt;
In this architecture we introduce PushProx, composed of a proxy running on the same cluster as Prometheus and agents that are running on each Kubernetes cluster. These agents initiate a connection towards the proxy to create a tunnel so Prometheus can directly scrape each endpoint through the tunnel.&lt;/p&gt;

&lt;p&gt;Each scraping configuration will need to have a proxy referenced:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;scrape_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;job_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;node&lt;/span&gt;
  &lt;span class="na"&gt;proxy_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://proxy:8080/&lt;/span&gt;
  &lt;span class="na"&gt;static_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;targets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;client:9100'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Pros
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Bypass network segmentation&lt;/li&gt;
&lt;li&gt;Integration with our Prometheus stack&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cons
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;No Prometheus ServiceDiscovery&lt;/li&gt;
&lt;li&gt;Scaling issue&lt;/li&gt;
&lt;li&gt;Extra component to manage&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Architecture 6
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdios2ckec4ucfwvn5v9r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdios2ckec4ucfwvn5v9r.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;br&gt;
In this architecture, a Prometheus instance is deployed on each cluster which will scrape targets residing in the same cluster. Using this design, the data will be stored on each instance. The major difference in this approach is that only the AlertManager and Grafana are shared for all clusters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pros
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Best integration with our Prometheus stack&lt;/li&gt;
&lt;li&gt;Multi-tenancy&lt;/li&gt;
&lt;li&gt;Federation possible&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cons
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Memory and CPU footprint due repetition of the same service&lt;/li&gt;
&lt;li&gt;Skip using any TKGI component&lt;/li&gt;
&lt;li&gt;Multiple instances to manage&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;After testing almost all architectures, we finally came to the conclusion that architecture 6 is the best match with our current architecture and needs. We also privileged Prometheus as it can be easily deployed using the operator and features such as HA is automatically managed. We had however to make some compromises like not using the TKGI metric components and “reinvent the wheel” as we believe that monitoring and alerting should be done by pulling data and not pushing them.&lt;/p&gt;

&lt;h1&gt;
  
  
  Disclaimer
&lt;/h1&gt;

&lt;p&gt;This research was made on a TKGI environment that hasn't been installed and operated by Camptocamp.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>monitoring</category>
      <category>prometheus</category>
    </item>
  </channel>
</rss>
