Chabane R. for Onepoint x Stack Labs

Posted on Nov 12, 2021 • Edited on Nov 16, 2021

Mayday, mayday! I need a scalable infrastructure to migrate on Scaleway Elements! Part 2 - Ops & Container migration

#scaleway #devops #monitoring #kubernetes

Hello again!

We saw in the part 1 how to build a scalable Scaleway Elements organization, implementing a networking topology, defining a disaster recovery and centralizing IAM and key management.

In this part 2, we will talk about DevOps, monitoring, logging and migration of business applications in Kubernetes as an example.

DevOps

The Cloud is ruled by DevOps – this fact has been proven time and again. [1]

I highly recommend all my customers to adopt a DevOps culture and practices from the beginning.

Using a CI/CD tool becomes vital when you have multiples environments and a complex infrastructure configuration that depends on infrastructure as code.

Many DevOps tools exist in the market. Let's take Gitlab as an example.

Gitlab

With Gitlab, you could manage your own instance in your infrastructure or use the SaaS solution, gitlab.com.

In the SaaS solution, you can choose between:

shared runners that are managed by GitLab infrastructure and limited in minutes (Depending on your licence subscription)
specific or group runners that will be managed by your infrastructure with no limit of minutes.

If you choose the shared runners to deploy your resources in Scaleway Elements, you will need to create and save service accounts keys as variables in the Gitlab CI configuration. You will face all the security issues related to storing credentials outside of your cloud infrastructure: Key rotation, age, destruction, location, etc.

For production use, I recommend customers to use specific or group runners with jobs running in a Kubernetes Kapsule cluster.

The runner configuration that I had used for customers:

Specific or group runners only,
Deployed using Helm Chart, more easy to upgrade,
Accessibles by tags,
Locked to current project(s),
Ran only on protected branches (Production),
Assigned a specific Kubernetes Service Account,
Jobs run in separate nodepools per environment.

The Kapsule cluster that hosts those runners has its own private network which is isolated from the internet with an explicit allow security group rule for gitlab ingress traffic on port 443.

If your organization has multiple teams, you can grant specific teams access to only the runners that they use:

GitOps

GitOps builds on DevOps with Git as a single source of truth for declarative infrastructure like Kubernetes. [2]

If you have the possibility to migrate your existing workloads to Kubernetes for example, you can take advantage of GitOps practices.

GitOps Tools

Many tools exist in the market like ArgoCD and FluxCD. Personally I consider ArgoCD as the most complete tool for GitOps but you will need to manage it by yourself in your infrastructure.

ArgoCD allows you to:

Manage configurations for multiple environments using a Git repository as a single source of truth.
Review changes before deployment through pull requests on GitHub or merge requests on Gitlab.
Test and promote changes across different environments.
Roll back changes quickly.
See your applications version and status on the ArgoCD Console.

GitOps Integration

The common patterns used to integrate GitOps practices is creating multiples repositories instead of a single source.

Image repository dedicated for the business application. The dockerfile(s) will reside in this repository.
Env repository dedicated for the deployment in Kubernetes. The kubernetes manifests will reside in this repository.
Infra repository dedicated for the deployment in Scaleway Elements. The terraform plans will reside in this repository.

Let's illustrate that with a diagram:

1 - We start by deploying the Kapsule cluster using terraform and infra repo:

The gitlab runner job has a Kubernetes Service Account (KSA) which is bound with an API Key (stored in Vault) with appropriate permissions in the Env project.

2 - Build a new docker image after each git tag:

The new docker image is built and published in a centralized container registry.

Share the specific runner registered in infra repo with the image repo.

3 - Edit the docker image version of Kuberentes manifests using Kustomize.

The kubernetes workloads are updated automatically with the new docker image version using a GitOps tool like ArgoCD.

Share the specific runner registered in infra repo with the env repo and lock the runner to the current Gitlab projects.

4 - Authorize the kubernetes cluster of the env project to access docker images from the DevOps project:

We can easily develop a generator that create for each new env project a blueprint of gitlab repositories and pipelines. I had the chance to build a generator like this for a customer using Yeoman. It was a really cool challenge.

Final architecture:

Operations

Monitoring

If you have multiple Kubernetes clusters to operate, you have the possibility to centralize all of your metrics in a centralized monitoring dashboard for all of your environments (development, integration, and production).

You can achieve this centralization using:

Prometheus to collect metrics.
Thanos alongside Prometheus. It will be used for long term storing capabilities and federation.
Alert Manager to send alerts based on metrics query.
Grafana fancy dashboards.

For more details, you can consult this article: Multi-Cluster Monitoring with Thanos

Logging

Hot

When you deploy containers to your development cluster, you will need quick access to logging to analyze container errors. A good solution is to deploy Rancher in your DevOps cluster. Rancher has many features:

An admin console,
Give you a cluster overview,
Access to workloads logs and states,
Role-based Access Control, etc.

Cold

When you operate your production cluster, in the event of a disaster or failure, you will need to access the log history. The ELK might be a good solution for storing logs for the long term:

Business applications

Let's say we have docker images hosted on premise and we want to deploy them in a Scaleway Elements Service.

There are two ways to achieve that:

Lift & Shift in Virtual instance, if we had a self-managed docker platform.
Improve & Move in Serverless Containers or Kubernetes Kapsule.

Unless there is a strong dependency between the docker runtime and the business applications, there is no value to go to the Cloud if you don't use a managed services for Docker.

Serverless Containers is currently ideal if your application is internet facing and does not depend on network or security restrictions.

Kapsule is still the most used to deploy docker images, as it's fully managed by Scaleway Elements and highly maintained and secured by the cloud provider.

Configuring a Kapsule cluster

I apply the following configuration to secure a Kapsule cluster:

Stable version as possible. Do not use an upstream release.
Kapsule attached to a private network (Possible soon),
Backup of the cluster using Velero,
Pools with auto scaling & auto healing enabled,
Network Policy enabled to manage network traffic,
HTTPs Load balancer (LB) enabled for internet facing applications + LB ACLs when necessary,
Keycloak for internal applications.

Kubernetes architecture

To migrate a 3-tiers architecture in Kapsule, I commonly deploy this architecture in Kubernetes:

Depending on the use case, you can deploy additional tools like Istio.

I wrote a complete article on how to secure connectivity between a Kapsule container and RDB databases.

Docker images

When you migrate existing docker images in the cloud, it's not easy to optimize the existing layers for fear of breaking something.

What I recommend for all existing images is to:

Properly tag images (semantic versioning and/or Git commit bash),
Use Kaniko to build and push images,
Use Container Registry to store the images.

For new images:

Package a single app per container,
Build the smallest image possible,
Optimize for the Docker build cache,
Remove unnecessary tools,
Carefully consider whether to use a public image,
Use Clair for vulnerability scanning.

Cost Savings tips

When you move existing workloads in a Cloud Provider, you can save money in two differents ways:

Cost-optimization: Running the workloads in the right service with the right configuration.
Cost-cutting: Removing deprecated and unused resources.

To apply a cost-cutting, you can develop a Serverless Function that:

Stops all resources that are not Serverless on non production environments like VMs and RDB instances,
Resizes Kapsule nodepools to zero on non production environments,
Removes old docker images.

For new workloads, the best way to save money is to use Scaleway Serverless services as possible.

Conclusion

It took us around 20 days the first time we deployed such architectures. However, after applying these principles, the ensuing implementations took just few days to deploy them.

There are other subjects I would have liked to give feedback on: Data Analytics and AI implementation. We can keep it for a part 3.

In the meantime, if you have any questions or feedback, please feel free to leave a comment.

Otherwise, I hope it helped you to see how automating everything via CI/CD and using infrastructure as code anywhere will ensure you to have a scalable strategy to migrate on Scaleway Elements whatever is your business.

By the way, do not hesitate to share with peers 😊

Documentation:

[1] https://www.idexcel.com/blog/true-business-efficiency-combines-the-power-of-cloud-computing-and-devops-practices/
[2] https://www.weave.works/blog/gitops-modern-best-practices-for-high-velocity-application-development

Top comments (2)

Aliaksandr Valialkin • Jan 22 '22

Thanks for the detailed blogpost on scalable architecture from devops perspective!
Did you consider other Prometgeus-compatible long-term storage and monitoring solutions such as Cortex, M3DB, TimescaleDB or VictoriaMetrics? They may be easier to setup and operate. They also may need lower amounts of resources - cpu, ram, storage space. All these properties may result in substantial cost savings for both devops and infrastructure. See, for example, this article.

Unknownntada • Aug 11 '24

is possible to do how gitlab runners run in kapsule for terraform stuff? without cli please just the scw console