DEV Community: Ray Edwards

7 Considerations for Multi-Cluster Kubernetes

Ray Edwards — Tue, 16 Jan 2024 21:06:01 +0000

In the IT space today, customers often intermix Multi Cloud and Hybrid Cloud terms without necessarily understanding the distinction between them.

A hybrid cloud is a cloud computing environment that combines public and private (typically on-premise) clouds, allowing organizations to utilize the benefits of both. In a hybrid cloud, an organization can store and process critical data and applications in its private cloud, while using the public cloud for non-sensitive data, such as testing and development.

The hybrid cloud model is becoming increasingly popular among organizations because it enables them to optimize their IT infrastructure, while keeping costs under control. Additionally, hybrid cloud environments can provide a more seamless and integrated user experience, with the ability to move workloads between public and private clouds based on business needs.

Multi-cloud on the other hand, is a setup that involves the use of multiple cloud computing platforms from different vendors. This approach enables organizations to use the best cloud services and features from different providers to create a more optimized and customized IT environment.

Both these approaches provide IT groups with flexibility and certain benefits.

In a multi-cloud environment, an organization can leverage the strengths of different cloud providers, such as AWS, Azure, Google Cloud, and others, to achieve a range of benefits such as increased scalability, flexibility, resilience, and cost-effectiveness. Multi-cloud also enables businesses to avoid vendor lock-in and achieve greater redundancy, as data and applications can be distributed across multiple cloud platforms.

Hybrid cloud for its part, enables businesses to have greater flexibility in their IT infrastructure, allowing them to leverage the scalability and cost-effectiveness of the public cloud for non-critical workloads, while keeping sensitive data and applications within their own private cloud, which provides greater control, security, and compliance.

Google Internet searches of Hybrid Cloud and Multi Cloud

Hybrid Cloud and Multi Cloud adoption is growing rapidly. Flexera 2023 State of the Cloud Report highlights the fact that a vast majority of enterprises have adopted a hybrid cloud model, and almost 87% have a Multi Cloud approach.

While both of these approaches have inherent benefits, this blog raises a number of challenges that organizations should consider in advance, along with some possible solutions.

Complexity of Cloud Orchestration

Managing multiple cloud environments, each with its own unique nuances, is a significant challenge. Even Kubernetes, which was envisioned as a way to abstract away from infrastructure dependency, is implemented differently by different cloud providers. EKS, AKS, Rancher, Mirantis Tanzu, Openshift are only a few of the major distributions of Kubernetes (and managed Kubernetes) that IT leaders have to contend with. Each one has some specific configuration that presents a challenge when moving workload from one platform to another. Even if one decides to deploy the same distribution, say a private Rancher cluster, onto all the platforms, the end result is the headache of now having to get into business of managing Kubernetes platforms across providers. You could farm that role out to one of the multi-cloud management providers, but they too have their own deficiencies.

This approach also shifts dependence from one vendor to another, but nevertheless, presents the very reason to move to multi-cloud, that of avoiding a single vendor lock-in.

A novel approach that is emerging, and something that IT leaders should consider is the idea of virtualizing applications themselves. By virtualizing or isolating at the application or microservice (namespace) level; users free themselves from the underlying platform, or even Kubernetes distribution, and can instead run the application or microservice seamlessly anywhere.

Interoperability

Aside from the orchestration itself, IT managers must also contend with application interoperability across their deployments. This is where the cloud provider’s interest and the customers interest don’t always align.

AWS for example, wants to provide a service to its customers as the most cost effective and efficient way possible. They will frequently customize the service, say a Redis database, to run efficiently on AWS infrastructure. This hyper-customization can provide a competitive advantage to AWS vs. other platform but also has a downside.

Customers who deploy on this hyper-customized version on Redis may find that their database is no longer easily ported out to say Azure cache (Azure’s deployment of Redis). Once again, the customer may find themselves back where they started - ie locked in to a specific provider.

So the challenge the customers must solve is how to ensure data applications in different cloud environments be distributed and synchronized?

Data Portability

There is another nasty little surprise that hits some customers when it comes to their data. Cloud providers make it fairly easy to upload data into their platform, most times with no extra fees, but if a customer wants to move data out of the platform, they can get hit with hefty ‘data egress’ fees. This ‘data tax’ can be quite expensive depending on the amount of data that is moved out.

This is better known as the “Hotel California Effect” [ You can checkout anytime you want, but you can never leave]

Security & Governance

Of course, security and governance concern is paramount to any application deployment. The situation is compounded manyfold when leaders have to plan for hybrid cloud and multi cloud deployments.

A thorough discussion around security and governance for hybrid cloud and multi-cloud deployments will be covered in a future blog, but for now customers should consider a few items as essential to ensure success.

Encryption of all traffic
Monitoring and audits of clusters and nodes
Locking down and security end-points
Vulnerability scanning and regular patching
Enforce zero Trust security posture

The steps above help provide security at the platform level or at the cluster level but practitioners need to pay particular attention to individual application level security issues.

Since a single cluster may host many applications, it will also likely have multiple namespaces set up within the cluster. Each application may have its own set of security policies and user access that need to be enforced. Tools such as Kubeslice are indispensable for this, by allowing managers to set namespace level policies and propagate these seamlessly across all platforms where the slice (namespace) is running.

Resource Optimization

One major reason companies cloud deployments is to reduce operating costs; by reducing or eliminating on-premises infrastructure. But a multi-cloud environment can easily become a costly exercise without effective management.

My previous blog covered cost optimization considerations for Kubernetes clusters in depth. In Multi Cloud environments, customers may find that a single autoscaling process does not work the same way for every cloud vendor. This is yet another reason to consider an automated tool such as SmartScaler. In addition to right sizing the Kubernetes deployments, SmartScaler utilizes Reinforced Learning to understand the specific characteristics of each cloud provider’s autoscaling process and optimizes the deployment accordingly.

Talent shortage

One reason enterprises are hesitant about hybrid cloud or multi cloud setups is the lack of skills in such projects. There is dearth of architects who have the expertise and deep know how to properly manage and run multicloud or hybrid environments. Business leaders have to find and retain these practitioners before embarking on a hybrid journey.

Resiliency

It is rather surprising how many businesses migrate workloads to the cloud without really planning for resiliency. Any simple search will provide multiple instances where one of the major cloud providers suffered outages that impacted businesses. Simply putting one's trust in the distributed cloud instance does not protect an application. Businesses should implement a resiliency strategy that includes load balancers that route traffic to an active cluster seamlessly in cases of failure.

Since cloud vendors are not going to encourage workloads to be routed to a competitor, IT leaders should implement solutions such as Kubeslice that abstracts the workload from the underlying infrastructure, ensuring the specific application is always available via intelligent routing.

It is also important not to over engineer resiliency to the other extreme. \
Some IT leaders leave it to the individual application teams to craft their own disaster recovery strategy. This can lead to multiple conflicting DR setups which in turn introduce more complexity into the environment.

At an enterprise level, businesses should adopt a Cloud Native Disaster Recovery strategy that provides a baseline for Kubernetes applications and databases, and allow individual teams to tweak it to their specific needs.

Focus Items	Traditional Disaster Recovery approach	Cloud Native Disaster Recovery approach
Failure detection and trigger	Human	Fully autonomous
DR actions / Procedure	A mixture of human action and automation	Fully Automated
Recovery Time Objective (RTO)	From Minutes to hours	Near zero
Recovery Point Objective (RPO)	From zero to hours	Zero
Process Owner	Mostly Storage team	Application itself / team
Technical components	From Storage products (backups, volume, sync)	From Networking products, (east west communication, global load balancer) [Note: Kubeslice enables adoption of this approach for all K8s apps]

Kubernetes resiliency (RTO/RPO) in Multi-Cluster deployments

Ray Edwards — Thu, 07 Dec 2023 21:24:39 +0000

Ah Kubernetes! The panacea to all our DevOps challenges.

Kubernetes is the open source container orchestration tool that was supposed to speed up software delivery, secure our applications, lower our costs and reduce our headaches, right?

Seriously though, Kubernetes has revolutionized how we write and deliver software. And with the proliferation of EKS, AKS, GKE, Red Hat OpenShift, Rancher and K3s, Kubernetes has truly won the container orchestration battle. As we expand our applications, cloud platforms, and data, we start to identify areas where Kubernetes is not quite fulfilling the requirements for security and ease of use. Therefore, we need to find ways to help Kubernetes in order to keep up with our growth.

Kubernetes practitioners have turned to third-party tools for networking, security, and resiliency for stateful apps. This helps to make their deployments more reliable.

In this blog, we’ll delve a little deeper into data resiliency for Kubernetes apps.

Kubernetes was designed to solve the challenges of application orchestration - the assumption is that Kubernetes nodes are ephemeral. In reality however, applications do consume and/or produce data. This is referred to as StatefulSets in Kubernetes. Additionally, Kubernetes objects, CRDs, artifacts etc are all details that need to be available in cases of cluster failure - hence the realization that even Kubernetes deployments need a DR strategy.

The Rise of StatefulSets

StatefulSets are designed to handle stateful workloads that require unique network identifiers and stable storage. Each instance of Databases, message queues, or distributed file systems, typically requires stable network identity and persistent storage. StatefulSets address this challenge by providing ordered, unique network identifiers and persistent storage for each pod in the set. There are of course a large contingent of Kubernetes storage deployments that rely on static volume attachments of course, but they do not have the horizontal scaling capabilities, and thus are not part of this blog discussion for now.

In addition to providing uniqueIDs and scaling, StatefulSets provide the ability to mount persistent volumes to each pod. This allows stateful applications to store and access data that persists across pod restarts or rescheduling. Each pod in a StatefulSet receives its own unique persistent volume, enabling data locality and minimizing the impact on other pods.

The Container Storage Interface (CSI) standard is the most widely used API standard. It enables containerized workloads in Kubernetes to access any block or file storage system. Released at the end of 2019, the Container Storage Interface driver with snapshot opened the doors to StatefulSets. See Kubernetes CSI

The Cloud Native Computing Foundation (CNCF) framework includes various projects to meet the storage requirements of Kubernetes. However, practitioners must be knowledgeable in storage fundamentals. This is not a skill usually associated with DevOps.

Companies like Portworx by Pure, Rancher Longhorn , Rook and LINBIT are setting a new standard for container storage. (in addition to vendor offerings from NetApp and HPE). They are providing several enterprise features to make container storage more efficient.

With that background, we turn to the considerations for Multi-cluster database resiliency in Kubernetes.

Single Cluster Apps

For applications running in a single cluster, Kubernetes (together with the CSI storage component) provide a feature for volume replication. Each persistent volume can be set to have a number of replica copies (typically 3) locally.

Kubernetes will take action if a node fails. It will restart the service on a new node and attach it to a replicated version of the volume. This helps to prevent data loss. (Setting aside static CSI or NFS drivers for now)

For tools that replicate data locally, the data is copied synchronously, as it is a single local cluster. Restarting the node is done instantly with the Kubernetes control plane. Therefore, the recovery time is considered zero.

From a recovery point of view, this is considered to be ZERO Recovery Time Objective (RTO). The Recovery Point Objective (RPO) may also be ZERO in this case, but this of course would depend on mitigating for data corruption, journaling etc.

Zooming out to complex environments reveals a murkier picture. These environments include clusters that span multiple clusters, platforms, and even cloud providers.

Multi-cluster database resiliency

Understanding the why and how of Multi Cluster Kubernetes can be difficult. Fortunately, this article by our friends at Traefiklabs provides a clear explanation of Kubernetes Multi Clusters in detail.

The architecture of a multi cluster Kubernetes is depicted in the diagram below :

What is pertinent in our discussions is the traffic route between these clusters. From a user perspective, application requests are routed to their US cluster or EU cluster by the global load balancer.

What happens when a pod from Node 1 in US-Cluster needs to communicate to Node 1 on the EU-Cluster? This could present a challenge due to the distance and complexity between the two clusters. At a minimum, data must travel through at least two network boundaries, exiting the US cluster, crossing the Atlantic, and entering the EU cluster.

This network traffic flow between a local network and external networks is referred to as North South traffic. "North" refers to outbound traffic from the local network to external networks. "South" refers to inbound traffic from external networks to the local network.

From a data plane perspective, the North/South traffic between clusters introduces a higher level of latency and reduced reliability when compared to traffic within a local cluster. Whereas traffic within a cluster is synchronous (fast, reliable) ; traffic between remote clusters is thus asynchronous (slower, less reliable).

For those wishing to learn more about synchronous and asynchronous traffic may want to review this article.

How does North/South traffic and synch/asynchronous communications relate back to Kubernetes Multi-cluster database resiliency?

A CSI storage connector in Kubernetes uses synchronous communication for traffic within a local cluster. This means that data is written simultaneously to the primary and replica data volumes.

Data for remote clusters is written locally first. Then, a snapshot copy of the data is shipped to the second cluster on a regular schedule. This is done asynchronously.

This difference in approach also has implications for data resiliency and how data is recovered after a failure.

Kubernetes recovery in a disaster recovery scenario within a local cluster is instantaneous. This is due to data being simultaneously copied to the local replica. As a result, data availability is also instantaneous.

In technical terms, there are two metrics that we track. Recovery Time Objective (RTO) is the time it takes to recover from a failure. Recovery Point Objective (RPO) is the maximum allowable data loss.

Thus, DR within a local Kubernetes cluster can be said to have Zero RPO / Zero RTO

(For an insightful discussion about RPO and RTO considerations in Kubernetes, I highly recommend this post by Bijit Ghosh and another one by Matt LeBlanc.)

In a remote cluster situation however, the picture is somewhat different. When assessing a multi-platform or multi-cloud application, a Cloud Architect or SRE should consider how asynchronous replication impacts the Recovery Time Objective (RTO). Shipping the replica copy, traveling North/South and restoring it on the active cluster at the recovery site all take time. This means that DR operations have a time lag.

This time lag can be significant. Conservative estimates show that CSI solutions can reduce the RTO to 15 minutes. This means we can have a RPO of zero and a RTO of 15.

Most DevOps teams and Cloud Architects focus on single cluster, single region and single platform deployments. They may not consider DR architectures and requirements. As a result, many early adopters may be satisfied with an RTO of 15 minutes.

Traditional infrastructure architects and data owners may want to explore new solutions on the market. These solutions can help them reach a close-to-zero recovery time objective and disaster recovery across multiple Kubernetes platforms.

Tools such as KubeSlice can lower RTO thresholds by establishing a low latency data plane interconnect between multiple clusters or platforms. This helps to enable faster recovery for Kubernetes applications and data.

KubeSlice converts North/South traffic into East-West network traffic via application level virtualization across a secure/encrypted connection, thereby eliminating the need to traverse network boundaries. This creates a lower latency connection for the data plane, and enables **synchronous **replication of data to the DR site.

KubeSlice makes the remote cluster seem local to the primary cluster. This enables recovery that is close to having Zero Recovery Time Objective (RTO).

Regardless of which recovery scheme is employed, application owners should carefully consider their application and data resiliency needs and plan accordingly.

As the Data on Kubernetes report points out, a full one-third of organizations saw productivity increase twofold by deploying data on Kubernetes, with gains benefiting organizations at all levels of tech maturity. Kubernetes Zero RTO resiliency in MultiCluster deployments

Summary

Container Storage Interface (CSI) standard and projects like Portworx and Rancher Longhorn are a great start towards a Software Defined Storage approach for Kubernetes persistent apps.

The article explores Multi-cluster database resiliency in Kubernetes, starting with the resiliency of single-cluster applications. It explains how Kubernetes handles volume replication in case of node failures, ensuring zero recovery point objective (RPO) and zero recovery time objective (RTO) within a single cluster. However, the complexity arises when dealing with multi-cluster environments spanning multiple platforms and cloud providers. The challenges of North/South traffic and asynchronous communication between remote clusters are discussed, and the implications for data resiliency and recovery in disaster scenarios are explained.

The article introduces KubeSlice as a tool that can lower RTO thresholds by establishing a low latency data plane interconnect between multiple clusters or platforms. It converts North/South traffic into East-West network traffic, enabling faster recovery for Kubernetes applications and synchronous replication of data to the disaster recovery (DR) site. The importance of carefully considering application and data resiliency needs is emphasized, and the benefits of deploying data on Kubernetes are mentioned.

Overall, the article focuses on the challenges and solutions related to data resiliency in Kubernetes, particularly in multi-cluster deployments, and highlights the role of tools like KubeSlice in achieving faster recovery times.

Credits: Michael Courcy, Tau M., Brendan Pascale, Michael Levan

7 Common mistakes in a Multi-Cloud Journey

Ray Edwards — Fri, 17 Nov 2023 16:00:00 +0000

Your executive leadership launched a ‘Cloud First’ strategy with much fanfare a few months ago. At first it was simple and seemed inexpensive to move workloads to your primary cloud service provider. But now, you feel the cloud provider is taking you for granted - you’ve lost all price negotiation power, and SLAs and response levels are not what it used to be. To make matters worse, your new cloud service bill seems to be out of control.

You are confronted with what is known as the “Cloud Paradox” - the very benefits that made the move to the cloud as a compelling strategy are now causing costs to spiral, stifling innovation and making it near impossible to repatriate workload from the cloud.

Most large enterprises have started cloud adoption and now find themselves in the trough of cloud paradox where their cloud bills are starting to bite. They are facing margin pressures of their cloud infrastructure. Trying to optimize cloud costs in turn is reducing their ability to innovate freely.

The folks at Azul Systems have published a very interesting Cloud Adoption Journey chart that highlights the technology maturity curve as customers adopt cloud platforms. Similar in concept to the Gartner Hype cycle, technology leaders have to be mindful of the various challenges and pitfalls of cloud adoption before they can truly realize the value optimization of cloud computing.

For those that find themselves in the trough of the journey, and are thinking of ways to maintain control of your cloud strategy; and break out of the cloud paradox, you are not alone.

Here, I will share several initiatives that can help IT leaders avoid getting stuck in the Cloud Paradox, and move further along the curve. Only then can enterprises realize the true benefits of cloud adoption all while optimizing their cloud spend.

Before we delve into that however, it is useful to understand why a Multi-Cloud deployment may be a worthwhile strategy in the first place.

Guard against Vendor Lock In
The number one reason business leaders consider a Multi-vendor multi-cloud adoption is to avoid putting all their ‘eggs’ into one basket. Amazon, Azure, GCP and others continue to enhance their cloud services and any enterprise that wants to stay at the forefront of technology improvements needs the flexibility to utilize the platform that best suits their needs. Not to mention the obvious loss of negotiation power when one is beholden to any single cloud provider.

Compliance and Data sovereignty
Companies doing business globally may be subject to rules and regulations that require customer data to be located in a particular region or country. The flexibility to choose a vendor that enables such data localization is another essential reason for a multi-cloud approach.

Workload specialization and costs
As mentioned earlier, cloud providers continue to offer specializations for workloads and features that may address the specific needs of a particular application. For example, AWS may offer some advances in ML application but your IOT application may be most suited to a GCP set of enhancements. Having a multi-cloud approach ensures you are able to optimize an application based on need and costs as appropriate.

Forestall Data Gravity challenges
The idea of data gravity was first coined by Dave McCrory in 2010. His view was that once data is stored anywhere, it starts to “grow” in size and quantity. That mass attracts services and applications, because the closer they are to the data, the lower the latency and better the throughput. The resulting challenge is that as the data gravity increases, it becomes almost impossible to migrate that data to a different platform. What results is ‘artificial data gravity’ where data is centralized not because of any benefits or requirements, but rather because it is too difficult to do otherwise.

By deploying a “multi-cloud first” approach, customers can retain data where it is created (e.g at the edge) and optimize the applications to utilize real-time processing and secure multi-cluster networking instead.

Optimize Geographical Performance
Since the performance and latencies differ between cloud providers in different regions, a multi-cloud application can utilize the provider with the best characteristics for a given region or geography.

Application Resiliency
While cloud outages are rare, they are extremely disruptive when they do occur. The recent Fortune technology Article on Cloud Outages estimates a downtime cost of $365K per hour for cloud outages on business.

A multi-cloud approach is an effective defense against such risks.

Now that we’ve outlined some of the reasons WHY you should incorporate a multi-cloud approach to your devops projects, let’s discuss some of the common mistakes to avoid when establishing a multi-cloud strategy.

Mistake #1 - Relying on custom resources

Ideally, your application should not be reliant on custom resources from your cloud vendor. Resources that you use (networking, compute, memory, storage) should be virtualized using industry standard interfaces - CNI, CSI etc. This enables you to move an application between platforms without having to unwind custom hooks. Adopt a true Infrastructure as Code (IaC) methodology.

Mistake #2 - Overlooking inter-cloud network connectivity

One of the often overlooked aspects of Multi-cloud deployments is the network connectivity between applications running on different cloud platforms. Using a global load balancer to route traffic to individual application ‘islands’ is a suboptimal solution at best. Data and Metrics that have to travel between these isolated clusters have to navigate network boundaries via North-South traffic, and can only do so asynchronously.

Users should instead use tools such as KubeSlice to create a software defined network tunnel between the clusters / Clouds that ensures security and lowers the network latency. Thus, data and metrics can connect synchronously, via an East-West pathway between the clusters. In addition, the connection can be isolated at the individual namespace level - complying with zero trust application deployment model.

Mistake #3 - Simply using Autoscaling to control costs vs. Intelligent Autoscaling and Remediation

Intelligent autoscaling is a type of autoscaling that uses machine learning to predict future resource demand and scale accordingly. This can help to improve the performance and scalability of cloud-based applications, while also reducing costs.

Traditional autoscaling approaches are based on simple rules, such as scaling up when CPU utilization reaches a certain threshold. However, these rules are reactive in nature, and can be too simplistic, leading to over- or under-provisioning of resources. Intelligent autoscaling, on the other hand, uses machine learning to learn the patterns of resource usage and predict future demand. This allows it to scale more effectively and efficiently.

There are a number of different ways to implement intelligent autoscaling. Some popular approaches include:

Using machine learning models to predict resource demand. This can be done by analyzing historical data, such as CPU utilization, memory usage, and traffic patterns.
Using reinforcement learning algorithms to learn the optimal scaling policies. This can be done by trial and error, or by using a simulation environment.
Using a combination of machine learning and reinforcement learning. This can be a more effective approach, as it can combine the strengths of both approaches.

Intelligent autoscaling is a relatively new technology, but it is becoming increasingly popular. As cloud-based applications become more complex, the need for intelligent autoscaling will only grow.

Using Intelligent Autoscaling tools such as Avesha SmartScaler, enterprises can ensure their multi-cloud applications also benefit from:

Improved performance: Intelligent autoscaling can help to improve the performance of cloud-based applications by ensuring that they have the right amount of resources available. This can help to reduce latency and improve the user experience.
Reduced costs: Intelligent autoscaling can help to reduce costs by preventing over-provisioning of resources. This can save businesses money on their cloud bills.
Improved scalability: Intelligent autoscaling can help to improve the scalability of cloud-based applications by allowing them to adapt to changes in demand. This can help businesses to avoid downtime and ensure that their applications are always available.
Instantaneous action: Unlike some of the common autoscaling tools, SmartScaler not only provides autoscale recommendations, but can actually make those changes in real time, instead of waiting for a manual operation. This not only saves time, but can also ensure specific application SLOs.

Mistake #4 - Cluster monitoring vs. fleet monitoring

When we’re only monitoring a single cluster, it is relatively easy to install Prometheus and Grafana, two open-source tools that are often used together for monitoring and observability.

Prometheus is a monitoring system that collects metrics from a variety of sources, such as applications, services, and infrastructure. It stores these metrics in a time series database, which can be queried to generate alerts and visualizations.
Grafana is a visualization tool that can be used to create dashboards and charts from Prometheus data. It provides a wide range of features for visualizing metrics, including graphs, tables, and maps.

The situation becomes infinitely more complex when we want to monitor a fleet of clusters on a single dashboard. Installing a Prometheus server and Grafana dashboard for every cluster quickly becomes unwieldy. On the other hand, exporting metrics from different clusters, from multiple cloud platforms, to a centralized control center tends to create IP address conflicts and challenges in metrics latency

Here again, a tool such as KubeSlice can help create a virtual, synchronous data plane for all Prometheus metrics from each cluster to be transmitted to a central control dashboard via a low latency synchronous connection.

Mistake #5 - Not planning for MultiCloud Resiliency and Backup

A full discussion on Resiliency and backup in Multi-Cloud setup is beyond the scope of this blog, but I’ve covered this in a previous blog post which can be found here. There is also an in depth webinar by the WeMakeDevs folks on Zero RTO DR for Kubernetes for those that are interested.

Mistake #6 - Shortchanging Multi-Cloud security

Cloud Security causes trepidation in most application owners’ minds, multi-cloud security even more so. With a few best practices however, these risks can be ameliorated. Since each cloud platform may have its own security features and requirements, adopting a cloud-agnostic security stance frees you up from dependency on any one tool or security model.

Some of the key best practices to ensure multi-cloud security include:

Implementing consistent security policies across all cloud platforms. This will help to ensure that all of your data and applications are protected to the same level.
Using a centralized cloud security management platform. This can help you to simplify the management of your security policies and configurations across multiple cloud platforms.
Monitoring your cloud environments for security threats. This will help you to identify and respond to threats quickly.
Encrypt your data (obviously!)
- Store encryption keys separately (in a central repository) from cloud workloads
- Rotate the keys regularly
Use IAM Federation for SSO
Educating your employees about cloud security best practices. This will help to reduce the risk of human error.

Mistake #7 - Not having an escape plan

It is surprising to see so many business leaders embark on a public cloud journey with no exit plan. Even if today’s financial analysis makes it a no-brainer to move applications to a cloud vendor - those rosy numbers may change over time, or performance / service may degrade. It is important to have a backup plan on how to repatriate applications and services.

To quote the famous line by Robert De Niro in the movie Ronin ; ”I never walk into a place I don't know how to walk out of.”

David Heinemeier Hansson has a great blog on Cloud Repatriation and what was the analysis that led him down that path.

Another extreme case of cloud repatriation is that of Dropbox who saved over $75 million over 2 years by moving much of their workload from public cloud to colo facilities with custom infrastructure. Sarah Wang and Martin Casado of Andreessen Horowitz used the Dropbox example in their groundbreaking article The Cost of Cloud, a Trillion Dollar Paradox, a worthy read.

Whether we like it or not, the public cloud has become an integral part of our technological landscape and is poised for further growth. Forward-thinking leaders should acknowledge this reality and instead of relying solely on a single cloud provider, they should adopt a multi-cloud strategy to mitigate potential future uncertainties.

Some of the key benefits of a multi-cloud strategy include:

Increased flexibility and scalability: By using multiple cloud providers, organizations can choose the best services for their specific needs. This can help to improve flexibility and scalability, as well as reduce costs.
Improved resilience: By spreading their workloads across multiple cloud providers, organizations can improve their resilience to outages and other disruptions.
Increased security: By using multiple cloud providers, organizations can reduce their reliance on any single provider. This can help to improve security by reducing the risk of a single point of failure.

With the right approach, multi-cloud can be a safe, secure, and economical path. However, it is important to have a clear understanding of the risks and challenges involved. Tools such as Kubeslice can help to mitigate these risks and make multi-cloud a more manageable reality.

Kafka Multi-Cluster Deployment on Kubernetes - simplified !

Ray Edwards — Thu, 19 Oct 2023 20:27:53 +0000

What is Kafka
Commonly known simply as Kafka, Apache Kafka is an open-source event streaming platform maintained by the Apache Software Foundation. Initially conceived at LinkedIn, Apache Kafka was collaboratively created by Jay Kreps, Neha Narkhede and Jun Rao, and subsequently released as an open-source project in 2011. Wiki Page

Today, Kafka is one of the most popular event streaming platforms designed to handle real-time data feeds. It is widely used to build scalable, fault-tolerant, and high-performance streaming data pipelines.

Kafka's uses are continually expanding, with the top 5 cases nicely illustrated by Brij Pandey in the accompanying image.

As a brief primer, it is important to understand the components of the Kafka platform and how they work..

Kafka works as a distributed event streaming platform, designed to handle real-time data feeds efficiently. It operates based on the publish-subscribe messaging model and follows a distributed and fault-tolerant architecture. It maintains a persistent, ordered, and partitioned sequence of records called "topics." Producers write data to these topics, and consumers read from them. This enables decoupling between data producers and consumers and allows multiple applications to consume the same data stream independently.

Key components of Kafka include:

Topics and Partitions: Kafka organizes data into topics. Each topic is a stream of records, and the data within a topic is split into multiple partitions. Each partition is an ordered, immutable sequence of records. Partitions enable horizontal scalability and parallelism by allowing data to be distributed across multiple Kafka brokers.
Producers: Producers are applications that write data to Kafka topics. They publish records to specific topics, which are then stored in the topic's partitions. Producers can send records to a particular partition explicitly or allow Kafka to determine the partition using a partitioning strategy.
Consumers: Consumers are applications that read data from Kafka topics. They subscribe to one or more topics and consume records from the partitions they are assigned to. Consumer groups are used to scale consumption, and each partition within a topic can be consumed by only one consumer within a group. This allows multiple consumers to work in parallel to process the data from different partitions of the same topic.
Brokers: Kafka runs as a cluster of servers, and each server is called a broker. Brokers are responsible for handling read and write requests from producers and consumers, as well as managing the topic partitions. A Kafka cluster can have multiple brokers to distribute the load and ensure fault tolerance.
Partitions/Replication: To achieve fault tolerance and data durability, Kafka allows configuring replication for topic partitions. Each partition can have multiple replicas, with one replica designated as the leader and the others as followers. The leader replica handles all read and write requests for that partition, while followers replicate the data from the leader to stay in sync. If a broker with a leader replica fails, one of the followers automatically becomes the new leader to ensure continuous operation.
Offset Management: Kafka maintains the concept of offsets for each partition. An offset represents a unique identifier for a record within a partition. Consumers keep track of their current offset, allowing them to resume consumption from where they left off in case of failure or reprocessing.
ZooKeeper: While not part of Kafka itself, ZooKeeper is often used to manage the metadata and coordinate the brokers in a Kafka cluster. It helps with leader election, topic and partition information, and managing consumer group coordination. [Note: Zookeeper metadata management tool, will soon be phased out in favor of Kafka Raft, or KRaft, a protocol for internally managed metadata]

Overall, Kafka's design and architecture make it a highly scalable, fault-tolerant, and efficient platform for handling large volumes of real-time data streams. It has become a central component in many data-driven applications and data infrastructure, facilitating data integration, event processing, and stream analytics.

A typical Kafka architecture would then be as follows :

Kafka clustering refers to the practice of running multiple Kafka brokers together as a group to form a Kafka cluster. Clustering is a fundamental aspect of Kafka's architecture, providing several benefits, including scalability, fault tolerance, and high availability. A Kafka cluster is used to handle large-scale data streams and ensure that the system remains operational even in the face of failures.

In the cluster, Kafka topics are divided into multiple partitions to achieve scalability and parallelism. Each partition is a linearly ordered, immutable sequence of records. Partitions therefore allow data to be distributed across multiple brokers in the cluster.

It should be noted that a minimum Kafka cluster consists of 3 Kafka brokers, each of which can be run on a separate server (virtual or physical). The 3 node guidance is to help avoid a split brain scenario in case of a broker failure. (Nice article by Dhinesh Sunder Ganapathi that goes into more detail.)

Kafka and Kubernetes

As more companies adopt Kafka, there is also an increasing interest in deploying Kafka on Kubernetes.

In fact, the most recent Kubernetes in the Wild report 2023 by Dynatrace shows that over 40% of large organizations run their open source messaging platform within Kubernetes - the majority of this being Kafka.

The same report also makes a bold claim , that “Kubernetes is emerging as the ‘operating system’ of the cloud.”

It is imperative then, for Kafka administrators to understand the interplay between Kafka and Kubernetes, and how to implement these appropriately for scale.

The Case for Multi Cluster Kafka

Running a Kafka cluster in a single Kubernetes cluster setup is fairly straightforward and enables scalability as needed in theory. In production however, the picture can get a bit murky.

We should distinguish the use of the term cluster between Kafka and Kubernetes. A Kubernetes deployment also uses the term cluster to designate a grouping of connected nodes, referred to as Kubernetes cluster. When the Kafka workload is deployed on Kubernetes, you will end up with a Kafka cluster running inside a Kubernetes cluster, but more relevant to our discussion, you may also have a Kafka cluster that spans multiple Kubernetes clusters - for resiliency, performance, data sovereignty etc.

To begin with, Kafka is not designed for multi-tenant setups. In technical terms, Kafka does not understand concepts such as Kubernetes namespaces or resource isolation. Within a particular topic, there is no easy mechanism to enforce security access restrictions between multiple user groups.

Additionally, different workloads may have different update frequency and scale requirements eg batch application vs. real-time application. Combining the two workloads into a single cluster could cause adverse impacts or consume much more resources than necessary.

Data sovereignty and regulatory compliance can also impose restrictions on co locating data and topics in a specific region or application.

Resiliency of course is another strong driving force behind the need for multiple Kafka clusters. While Kafka clusters are designed for fault tolerance of topics, we still have to plan for a catastrophic failure of an entire cluster. In such cases, the need for a fully replicated cluster enables proper business continuity planning.

For businesses that are migrating workload to the cloud or have a hybrid cloud strategy, you may want to set up multiple Kafka clusters and perform a planned workload migration over time rather than a risky full scale Kafka migration.

These are just a few of the reasons why in practice, enterprises find themselves having to create multiple Kafka clusters that nevertheless need to interact with each other.

Multi Cluster Kafka

In order to have multiple Kafka clusters that are connected to each other, key items from one cluster must be replicated to the other cluster(s). These include the topics, offsets and metadata. In Kafka terms, this duplication is considered Mirroring.

There are two approaches of multi-cluster setups that are possible. Stretched Clusters or Connected Clusters.

Stretched clusters - Synchronous replication

A stretched cluster is a logical cluster that is ‘stretched’ across several physical clusters. Topics and replicas are distributed across the physical clusters, but since they are represented as a logical cluster, the applications themselves are not aware of this multiplicity.

Stretched clusters have strong consistency and are easier to manage and administer. Since applications are unaware of the existence of multiple clusters, they are easier to deploy on stretched clusters, compared to connected clusters.

The downsides of stretched clusters are that it requires a synchronous connection between the clusters. They are not ideal for a hybrid cloud deployment, and will require a quorum of at least 3 clusters to avoid a ‘split-brain’ scenario.

Connected Clusters - Asynchronous replication

A Connected Cluster on the other hand, is deployed by connecting multiple independent clusters. These independent clusters could be running in different regions or cloud platforms and are managed individually.

The primary benefit of the connected cluster model is that there is no downtime in cases of a cluster failure, since the other clusters are running independently. Each cluster can also be optimized for its particular resources.

The major downside of connected clusters is that it relies on asynchronous connection between the clusters. Topics that are replicated between the clusters are not ‘copy on write’ but rather, depend on eventual consistency. This can lead to possible data loss during the async mirroring process.

Additionally, applications that work across connected clusters have to be modified to be aware of the multiple clusters.

Before we address the solution to this conundrum, I’ll briefly cover the common tools on the market to enable Kafka cluster connectivity.

Open Source Kafka itself ships with a mirroring tool called Mirror Maker.

Source: https://www.altoros.com/blog/multi-cluster-deployment-options-for-apache-kafka-pros-and-cons/

Mirror Maker duplicates topics between different clusters via a built in producer. This way data is cross replicated between clusters with eventual consistency, but without interrupting individual processes.

It is important to note that while Mirror Maker is simple in its concept, setting up Mirror Maker at scale can be quite a challenge for IT organizations. Managing IP addresses, naming conventions, number of replicas etc must be done correctly or it could lead to what is known as ‘infinite replication’ where a topic is infinitely replicated, leading to eventual crash.

Other downsides of Mirror Maker is the lack of dynamic configuration of allowed/disallowed lists for updates. Mirror Maker also does not sync topic properties properly, which makes it an operational headache at scale when adding or removing topics to be replicated. Mirror Maker 2 attempts to fix some of these challenges but many IT shops still struggle to get Mirror Maker set up correctly.

Other Open Source tools for Kafka replication include Mirus from Salesforce, uReplicator from Uber and customised Flink from Netflix.

For commercial licensed options, Confluent offers two options, Confluent Replicator and Cluster Linking. Confluent Replicator is essentially a Kafka Connect connector that provides a high-performance and resilient way to copy topic data between clusters. Cluster Linking is another offering, developed internally and is targeted at multi region replication while preserving topic offsets.

Even so, Cluster Linking is an asynchronous replication tool with data having to cross network boundaries and traverse public traffic pathways.

As should be clear by now, Kafka replication is a crucial strategy for production applications at scale, the question is which option to choose.

Imaginative Kafka administrators will quickly realize that you may need connected clusters and stretched clusters, or a combination of these deployments, depending on the application performance and resiliency requirements.

What is daunting however, is the exponential challenges of setting up the cluster configurations and managing these at scale across multiple clusters. Is there a more elegant way to solve this nightmare?

The answer is Yes!

KubeSlice by Avesha is an exquisitely simple way to get the best of both worlds. By creating a direct Service Connectivity between clusters or namespaces, KubeSlice obviates the need for manually configuring individual connectivity between Kafka clusters.

At its core, KubeSlice creates a secure, synchronous Layer 3 network gateway between clusters; isolated at the application or namespace level. Once this is set up, Kafka administrators are free to deploy Kafka brokers in any of the clusters.

Each broker has a synchronous connectivity to every other broker that is joined via the slice, even though the brokers themselves may be on separate clusters. This effectively creates a stretched cluster between the brokers and provides the benefit of a strong consistency, and low administration overhead.

Have your cake and eat it too!

For those that may want to deploy Mirror Maker into their clusters, this can be done with minimal effort since the connectivity between the clusters is delegated to KubeSlice. Thus, Kafka applications can have the benefits of synchronous (speed, resiliency) AND asynchronous (independence, scale) replication in the same deployment with the ability to mix and match the capabilities as needed. This is true of on-prem data centers, across public clouds or any combinations of these in a hybrid setup.

The best part is that KubeSlice is a non-disruptive deployment, meaning that there is no need to uninstall any tool already deployed. It is simply a matter of establishing a slice and adding the Kafka deployment onto that slice.

This blog provided a brief overview of Apache Kafka and has touched on some of the more common use cases. We covered the current tools available to scale Kafka deployments across multiple clusters and discussed the advantages / disadvantages of each. Finally, the article also introduced Kubeslice - the emerging service connectivity solution that simplifies Kafka multi-cluster deployments and removes the headaches associated with configuring Kafka replication across multiple clusters at scale.

A couple of links that readers may find useful:

An older blog of best practices running Kafka on AWS(before KubeSlice was introduced)

Guided setup of KubeSlice

Deploying Kafka on GKE

What Every Engineer should know about distributed log - by Jay Kreps (essential reading!)