DEV Community: Alexander Dejanovski

Reaper 3.0 for Apache Cassandra is available

Alexander Dejanovski — Tue, 15 Mar 2022 19:34:16 +0000

The K8ssandra team is pleased to announce the release of Reaper 3.1. Let’s dive into the features and improvements that 3.0 recently introduced (along with some notable removals) and how the newest update to 3.1 builds on that.

JDK11 support

Starting with 3.1.0, Reaper can now compile and run with jdk11. Note that jdk8 is still supported at runtime.

Storage backends

Over the years, we regularly discussed dropping support for Postgres and H2 with the The Last Pickle (TLP) team, now part of DataStax, the organization leading the open-source development of Reaper. Despite our lack of expertise in Postgres, the effort required to maintain support for these storage backends was moderate as long as Reaper’s architecture was simple. However, complexity grew with more deployment options, culminating with the addition of the sidecar mode.

Some features require different consensus strategies depending on the backend, which sometimes led to implementations that worked well with one backend and were buggy with others.

In order to allow building new features faster, while providing a consistent experience for all users, we decided to drop the Postgres and H2 backends in 3.0.

Apache Cassandra and the managed DataStax Astra DB are now the only production storage backends for Reaper. The free tier of Astra DB will be more than sufficient for most deployments.

Reaper does not generally require high availability – even complete data loss has mild consequences. Where Astra is not an option, a single Cassandra server can be started on the instance that hosts Reaper, or an existing cluster can be used as a backend data store.

Adaptive Repairs and Schedules

One of the pain points we observed when people start using Reaper is understanding the segment orchestration and knowing how the default timeout impacts the execution of repairs.

Repair is a complex choreography of operations in a distributed system. As such, and especially in the days when Reaper was created, the process could get blocked for several reasons and required a manual restart. The smart folks that designed Reaper at Spotify decided to put a timeout on segments to deal with such blockage, over which they would be terminated and rescheduled.

Problems arise when segments are too big (or have too much entropy) to process within the default 30 minutes timeout, despite not being blocked. They are repeatedly terminated and recreated, and the repair appears to make no progress.

Reaper did a poor job at dealing with this for mainly two reasons:

Each retry will use the same timeout, possibly failing segments forever
Nothing obvious was reported to explain what was failing and how to fix the situation

We fixed the former by using a longer timeout on subsequent retries, which is a simple trick to make repairs more “adaptive”. If the segments are too big, they’ll eventually pass after a few retries. It’s a good first step to improve the experience, but it’s not enough for scheduled repairs as they could end up with the same repeated failures for each run.

This is where we introduce adaptive schedules, which use feedback from past repair runs to adjust either the number of segments or the timeout for the next repair run.

Figure 1: Example of how to use adaptive schedules in Reaper.

Adaptive schedules will be updated at the end of each repair if the run metrics justify it. The schedule can get a different number of segments or a higher segment timeout depending on the latest run.

The rules are the following:

If more than 20% segments were extended, the number of segments will be raised by 20% on the schedule.
If less than 20% segments were extended (and at least one), the timeout will be set to twice the current timeout.
If no segment was extended and the maximum duration of segments is below 5 minutes, the number of segments will be reduced by 10% with a minimum of 16 segments per node.

This feature is disabled by default and is configurable on a per schedule basis. The timeout can now be set differently for each schedule, from the UI or the REST API, instead of having to change the Reaper config file and restart the process.

Incremental Repair Triggers

As we celebrate the long awaited improvements in incremental repairs brought by Cassandra 4.0, it was time to embrace them with more appropriate triggers. One metric that incremental repair makes available is the percentage of repaired data per table. When running against too much unrepaired data, incremental repair can put a lot of pressure on a cluster due to the heavy anti-compaction process.

The best practice is to run it on a regular basis so that the amount of unrepaired data is kept low. Since your throughput may vary from one table/keyspace to the other, it can be challenging to set the right interval for your incremental repair schedules.

Reaper 3.0 introduced a new trigger for the incremental schedules, which is a threshold of unrepaired data. This allows creating schedules that will start a new run as soon as, for example, 10% of the data for at least one table from the keyspace is unrepaired.

Those triggers are complementary to the interval in days, which could still be necessary for low traffic keyspaces that need to be repaired to secure tombstones.

Figure 2: Setting interval for incremental repairs.

These new features will allow you to securely optimize tombstone deletions by enabling the only_purge_repaired_tombstones compaction subproperty in Cassandra, permitting it to reduce gc_grace_seconds down to three hours without the concern that deleted data will reappear.

Schedules can be edited

That may sound like an obvious feature but previous versions of Reaper didn’t allow for editing of an existing schedule. This led to an annoying procedure where you had to delete the schedule (which isn’t made easy by Reaper either) and recreate it with the new settings.

Version 3.0 fixed that embarrassing situation and adds an edit button to schedules, which allows you to change the mutable settings of schedules:

Figure 3: Reaper now has the ability to edit the settings for scheduled actions.

CVE fixes

With the release of Reaper 3.1.0, we were able to fix more than 80 reported CVEs by upgrading several dependencies to more current versions:

Dropwizard 2.0.25
Shiro 1.8.0
SnakeYAML 1.29
Netty 4.1.70.Final
Cassandra Java Driver 3.11.0
Jersey 2.33
Prometheus Simple Client 0.12.0

This allows Reaper to be more secure and future proof as it now enables us to migrate from the deprecated dropwizard-cassandra bundle to the officially supported one, along with upgrading the Cassandra driver to the latest 4.x.

More improvements

In order to protect clusters from running mixed incremental and full repairs in older versions of Cassandra, Reaper would disallow the creation of an incremental repair run/schedule if a full repair had been created on the same set of tables in the past (and vice versa).

Now that incremental repair is safe for production use, it is necessary to allow such mixed repair types. In case of conflict, Reaper 3.0 displays a pop-up informing you and allowing you to force create the schedule/run:

Figure 4: Reaper now shows a pop-up to inform you of a conflict and allowing to force create the schedule/run.

We’ve also added a special “schema migration mode” for Reaper, which will exit after the schema was created/upgraded. We use this mode in K8ssandra to prevent schema conflicts and allow the schema creation to be executed in an init container that won’t be subject to liveness probes that could trigger the premature termination of the Reaper pod:

java -jar path/to/reaper.jar schema-migration path/to/cassandra-reaper.yaml

There are many other improvements and we invite all users to check the changelog in the GitHub repo.

Upgrade Now

We encourage all Reaper users to upgrade to 3.1.0, while recommending users to carefully prepare their migration out of Postgres or H2. Note that there is no export/import feature and schedules will need to be recreated after the migration.

All instructions to download, install, configure, and use Reaper 3.1.0 are available on the Reaper website.

Let us know what you think of Reaper 3.1 by joining us on the K8ssandra Discord or K8ssandra Forum today. For exclusive posts on all things data, follow DataStax on Medium.

Resources

Reaper
Reaper Documentation: Downloads and Installation
Apache Cassandra
DataStax Astra DB
K8ssandra
TLP Blog: Incremental Repair Improvements in Cassandra 4
TLP Blog: Hinted Handoff and GC Grace Demystified

Cassandra Database Migration to Kubernetes with Zero Downtime

Alexander Dejanovski — Tue, 15 Feb 2022 15:22:07 +0000

K8ssandra is a cloud-native distribution of the Apache Cassandra® database that runs on Kubernetes, with a suite of tools to ease and automate operational tasks. In this post, we’ll walk you through a database migration from a Cassandra cluster running in AWS EC2 to a K8ssandra cluster running in Kubernetes on AWS EKS, with zero downtime.

As an Apache Cassandra user, your expectation should be that migrating to K8ssandra would happen without downtime. To make that happen with “classic” clusters running on virtual machines or bare metal instances, you will use the datacenter (DC) switch technique which is commonly used in the Cassandra community to transfer clusters to different hardware or environments. The good news is that it’s not very different for clusters running in Kubernetes as most Container Network Interfaces (CNI) will provide routable pod IPs.

Routable pod IPs in Kubernetes

A common misconception about Kubernetes networking is that services are the only way to expose pods outside the cluster and that pods themselves are only reachable directly from within the cluster.

Looking at the Calico documentation, we can read the following:

If the pod IP addresses are routable outside of the cluster then pods can connect to the outside world without SNAT, and the outside world can connect directly to pods without going via a Kubernetes service or Kubernetes ingress.

The same documentation tells us that the default CNI used in AWS EKS, Azure AKS and GCP GKE provide routable pod IPs within a VPC.

This is necessary because Cassandra nodes in both datacenters will need to be able to communicate with each other without having to go through services. Each Cassandra node stores the list of all the other nodes in the cluster in the system.peers(_v2) table and communicates with them using the IP addresses that are stored there. If pod IPs aren’t routable, there’s no (easy) way to create a hybrid Cassandra cluster that would span outside of the boundaries of a Kubernetes cluster.

Database Migration using Cassandra Datacenter Switch

The traditional technique to migrate a cluster to a different set of hardware or environment is to add up a new datacenter to the cluster whose nodes will be located in the target infrastructure, configure keyspaces so that Cassandra replicates data to the new DC, switch traffic to the new DC once it’s up to date, and then decommission the old infrastructure.

While this procedure was brilliantly documented by my co-worker Alain Rodriguez on the TLP blog, there are some subtleties related to running our new datacenter in Kubernetes, and more precisely using K8ssandra, which we’ll cover in detail here.

Here are the steps we’ll go through to perform the migration:

Restrict traffic to the existing datacenter.
Expand the Cassandra cluster by adding a new datacenter in a Kubernetes cluster using K8ssandra.
Rebuild the newly created datacenter.
Switch traffic over to the K8ssandra datacenter.
Decommission the original Cassandra datacenter.

Performing the migration

Initial State

Our starting point is a Cassandra 4.0-rc1 cluster running in AWS on EC2 instances:

$ nodetool status
Datacenter: us-west-2
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load      Tokens  Owns (effective)  Host ID                               Rack
UN  172.31.4.217   10.2 GiB  16      100.0%            9a9b5e8f-c0c2-404d-95e1-372880e02c43  us-west-2c
UN  172.31.38.15   10.2 GiB  16      100.0%            1e6a9077-bb47-4584-83d5-8bed63512fd8  us-west-2b
UN  172.31.22.153  10.2 GiB  16      100.0%            d6488a81-be1c-4b07-9145-2aa32675282a  us-west-2a

In the AWS console, we can access the details of a node in the EC2 service and locate its VPC id which we’ll need later to create a peering connection with the EKS cluster VPC:

Finding the VPC id

The next step is to create an EKS cluster with the right settings so that pod IPs will be reachable from the existing EC2 instances.

Creating the EKS cluster

We’ll use the k8ssandra-terraform project to spin up an EKS cluster with 3 nodes (see https://docs.k8ssandra.io/install/eks/ for more information).

After cloning the project locally, we initialize a few env variables to get started:

# Optional if you're using the default profile
export AWS_PROFILE=eks-poweruser
export TF_VAR_environment=dev
# Must match the existing cluster name
export TF_VAR_name=adejanovski-migration-cluster
export TF_VAR_resource_owner=adejanovski
export TF_VAR_region=us-west-2

We go to the env directory and initialize our Terraform files:

cd env
terraform init

We can then update the variables.tf file and adjust it to our needs:

variable "instance_type" {
 description = "Type of instance to be used for the Kubernetes cluster."
 type        = string
 default     = "r5.2xlarge"
}
variable "desired_capacity" {
 description = "Desired capacity for the autoscaling Group."
 type        = number
 default     = 3
}
variable "max_size" {
 description = "Maximum number of the instances in autoscaling group"
 type        = number
 default     = 3
}
variable "min_size" {
 description = "Minimum number of the instances in autoscaling group"
 type        = number
 default     = 3
}
...
variable "private_cidr_block" {
 description = "List of private subnet cidr blocks"
 type        = list(string)
 default     = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
}

Make sure the private CIDR blocks are different from the ones used in the EC2 cluster VPC, otherwise you may end up with IP addresses conflicts.

Now create the EKS cluster and the three worker nodes:

terraform plan
terraform apply
...
# Answer "yes" when asked for confirmation
Do you want to perform these actions in workspace "eks-experiment"?
 Terraform will perform the actions described above.
 Only 'yes' will be accepted to approve.
 Enter a value: yes

The operation will take a few minutes to complete and output something similar to this at the end:

Apply complete! Resources: 50 added, 0 changed, 0 destroyed.
Outputs:
bucket_id = "dev-adejanovski-migration-cluster-s3-bucket"
cluster_Endpoint = "https://FB2B5CD5D27F43B69B54.gr7.us-west-2.eks.amazonaws.com"
cluster_name = "dev-adejanovski-migration-cluster-eks-cluster"
cluster_version = "1.20"
connect_cluster = "aws eks --region us-west-2 update-kubeconfig --name dev-adejanovski-migration-cluster-eks-cluster"

Note the connect_cluster command which will allow us to create the kubeconfig context entry to interact with the cluster using kubectl:

% aws eks --region us-west-2 update-kubeconfig --name dev-adejanovski-migration-cluster-eks-cluster
Updated context arn:aws:eks:us-west-2:3373455535488:cluster/dev-adejanovski-migration-cluster-eks-cluster in /Users/adejanovski/.kube/config

We can now check the list of worker nodes in our k8s cluster:

% kubectl get nodes
NAME                                       STATUS   ROLES    AGE   VERSION
ip-10-0-1-107.us-west-2.compute.internal   Ready    <none>   5m   v1.20.4-eks-6b7464
ip-10-0-2-34.us-west-2.compute.internal    Ready    <none>   5m   v1.20.4-eks-6b7464
ip-10-0-3-239.us-west-2.compute.internal   Ready    <none>   5m   v1.20.4-eks-6b7464

VPC Peering and Security Groups

Our Terraform scripts will create a specific VPC for the EKS cluster. In order for our Cassandra nodes to communicate with the K8ssandra nodes, we will need to create a peering connection between both VPCs. Follow the documentation provided by AWS on this topic to create the peering connection: VPC Peering Connection.

Once the VPC peering connection is created and the route tables are updated in both VPCs, update the inbound rules of the security groups for both the EC2 Cassandra nodes and the EKS worker nodes to accept all TCP traffic on ports 7000 and 7001, which are used by Cassandra nodes to communicate with each other (unless configured otherwise).

Preparing the Cassandra cluster for the expansion

Original Cassandra cluster

When expanding a Cassandra cluster to another DC, and assuming you haven’t created your cluster with the SimpleSnitch (otherwise you first have to switch snitches first), you need to make sure your keyspaces use the NetworkTopologyStrategy (NTS). This replication strategy is the only one that is DC and rack aware. The default SimpleStrategy will not consider DCs and will behave as if all nodes were collocated in the same DC and rack.

We’ll use cqlsh on one of the EC2 Cassandra nodes to list the existing keyspaces and update their replication strategy.

$ cqlsh $(hostname)
Connected to adejanovski-migration-cluster at ip-172-31-22-153:9042
[cqlsh 6.0.0 | Cassandra 4.0 | CQL spec 3.4.5 | Native protocol v5]
Use HELP for help.
cqlsh> DESCRIBE KEYSPACES
system       system_distributed  system_traces  system_virtual_schema
system_auth  system_schema       system_views   tlp_stress

Several system keyspaces use the special LocalStrategy and are not replicated across nodes. They contain only node specific information and cannot be altered in any way.

We’ll alter the following keyspaces to make them use NTS and only put replicas on the existing datacenter:

system_auth (contains user credentials for authentication purposes)
system_distributed (contains repair history data and MV build status)
system_traces (contains probabilistic tracing data)
tlp_stress (user-created keyspace)

Add any other user-created keyspace to the list. Here we only have the tlp_stress keyspace which was created by the tlp-stress tool to generate some data for the purpose of this migration.

We will now run the following command on all the above keyspaces using the existing datacenter name, in our case us-west-2:

cqlsh> ALTER KEYSPACE <keyspace_name> WITH replication = {'class': 'NetworkTopologyStrategy', 'us-west-2': 3};

Make sure client traffic is pinned to the us-west-2 datacenter by specifying it as the local datacenter. This can be done by using the DCAwareRoundRobinPolicy in some older versions of the Datastax drivers or by specifying it as local datacenter when creating a new CqlSession object in the 4.x branch of the Java Driver:

CqlSession session = CqlSession.builder()
   .withLocalDatacenter("us-west-2")
   .build();

More information can be found in the drivers documentation.

Deploying K8ssandra as a new datacenter

Creating a K8ssandra deployment for the new datacenter

K8ssandra ships with cass-operator which orchestrates the Cassandra nodes and handles their configuration. Cass-operator exposes an additionalSeeds setting which allows us to add seed nodes that are not managed by the local instance of cass-operator and by doing so, create a new datacenter that will expand an existing cluster.

We will put all our existing Cassandra nodes as additional seeds, and you should not need more than three nodes in this list even if your original cluster is larger. The following migration.yaml values file will be used for our K8ssandra Helm chart:

cassandra:
 version: "4.0.0"
 clusterName: "adejanovski-migration-cluster"
 allowMultipleNodesPerWorker: false
 additionalSeeds:
 - 172.31.4.217
 - 172.31.38.15
 - 172.31.22.153
 heap:
  size: 31g
 gc:
   g1:
     enabled: true
     setUpdatingPauseTimePercent: 5
     maxGcPauseMillis: 300
 resources:
   requests:
     memory: "59Gi"
     cpu: "7000m"
   limits:
     memory: "60Gi"
 datacenters:
 - name: k8s-1
   size: 3
   racks:
   - name: r1
     affinityLabels:
       topology.kubernetes.io/zone: us-west-2a
   - name: r2
     affinityLabels:
       topology.kubernetes.io/zone: us-west-2b
   - name: r3
     affinityLabels:
       topology.kubernetes.io/zone: us-west-2c
 ingress:
   enabled: false
 cassandraLibDirVolume:
   storageClass: gp2
   size: 3400Gi
stargate:
 enabled: false
medusa:
 enabled: false
reaper-operator:
 enabled: false
kube-prometheus-stack:
 enabled: false
reaper:
 enabled: false

Note that the cluster name must match the value used for the EC2 Cassandra nodes and the datacenter should be named differently than the existing one(s). We will only install Cassandra in our K8ssandra datacenter, but other components could be deployed as well during this phase.

Let’s deploy K8ssandra and have it join the Cassandra cluster:

% helm install k8ssandra charts/k8ssandra -n k8ssandra --create-namespace -f ~/k8ssandra_demo/benchmarks.values.yaml
NAME: k8ssandra
LAST DEPLOYED: Thu Jul  1 09:46:54 2021
NAMESPACE: k8ssandra
STATUS: deployed
REVISION: 1
TEST SUITE: None

You can monitor the logs of the Cassandra pods to see if they’re joining appropriately:

kubectl logs pod/adejanovski-migration-cluster-k8s-1-r1-sts-0 -c server-system-logger -n k8ssandra --follow

Cass-operator will only start one node at a time so if you get a message looking like the following, try checking the logs of another pod:

tail: can't open '/var/log/cassandra/system.log': No such file or directory

If VPC peering was done appropriately, the nodes should join the cluster one by one and after a while, nodetool status should give an output that looks like this:

Datacenter: k8s-1
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns (effective)  Host ID                               Rack
UN  10.0.3.10      78.16 KiB  16      0.0%              c63b9b16-24fe-4232-b146-b7c2f450fcc6  r3
UN  10.0.2.66      69.14 KiB  16      0.0%              b1409a2e-cba1-482f-9ea6-c895bf296cd9  r2
UN  10.0.1.77      69.13 KiB  16      0.0%              78c53702-7a47-4629-a7bd-db41b1705bb8  r1
Datacenter: us-west-2
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns (effective)  Host ID                               Rack
UN  172.31.4.217   10.2 GiB   16      100.0%            9a9b5e8f-c0c2-404d-95e1-372880e02c43  us-west-2c
UN  172.31.38.15   10.2 GiB   16      100.0%            1e6a9077-bb47-4584-83d5-8bed63512fd8  us-west-2b
UN  172.31.22.153  10.2 GiB   16      100.0%            d6488a81-be1c-4b07-9145-2aa32675282a  us-west-2a

Rebuilding the new datacenter

Replicating data to the new datacenter by rebuilding

Now that our K8ssandra datacenter has joined the cluster, we will alter the replication strategies to create replicas in the k8s-1 DC for the keyspaces we previously altered:

cqlsh> ALTER KEYSPACE <keyspace_name> WITH replication = {'class': 'NetworkTopologyStrategy', 'us-west-2': '3', 'k8s-1': '3'};

After altering all required keyspaces, rebuild the newly added nodes by running the following command for each Cassandra pod:

% kubectl exec -it pod/adejanovski-migration-cluster-k8s-1-r1-sts-0 -c cassandra -n k8ssandra -- nodetool rebuild us-west-2

Once all three nodes are rebuilt, the load should be similar on all nodes:

Datacenter: k8s-1
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns (effective)  Host ID                               Rack
UN  10.0.3.10      10.3 GiB   16      100.0%            c63b9b16-24fe-4232-b146-b7c2f450fcc6  r3
UN  10.0.2.66      10.3 GiB   16      100.0%            b1409a2e-cba1-482f-9ea6-c895bf296cd9  r2
UN  10.0.1.77      10.3 GiB   16      100.0%            78c53702-7a47-4629-a7bd-db41b1705bb8  r1
Datacenter: us-west-2
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns (effective)  Host ID                               Rack
UN  172.31.4.217   10.32 GiB  16      100.0%            9a9b5e8f-c0c2-404d-95e1-372880e02c43  us-west-2c
UN  172.31.38.15   10.32 GiB  16      100.0%            1e6a9077-bb47-4584-83d5-8bed63512fd8  us-west-2b
UN  172.31.22.153  10.32 GiB  16      100.0%            d6488a81-be1c-4b07-9145-2aa32675282a  us-west-2a

Note that K8ssandra will create a new superuser and that the existing users in the cluster will be retained as well after the migration. You can forcefully recreate the existing superuser credentials in the K8ssandra datacenter by adding the following block in the “cassandra” section of the Helm values file:

 auth:
   enabled: true
   superuser:
     secret: "superuser-password"
     username: "superuser-name"

Switching traffic to the new datacenter

Redirecting client traffic to the new datacenter

Client traffic can now be directed at the k8s-1 datacenter, the same way we previously restricted it to us-west-2. If your clients are running from within the Kubernetes cluster, use the cassandra service exposed by K8ssandra as a contact point for the driver. If the clients are running outside of the Kubernetes cluster, you’ll need to enable Ingress and configure it appropriately, which is outside the scope of this blog post and will be covered in a future one.

Decommissioning the old datacenter and finishing the migration

Decommission the original datacenter

Once all the client apps/services have been restarted, we can alter our keyspaces to only replicate them on k8s-1:

cqlsh> ALTER KEYSPACE <keyspace_name> WITH replication = {'class': 'NetworkTopologyStrategy', 'k8s-1': '3'};

Then ssh into each of the Cassandra nodes in us-west-2 and run the following command to decommission them:

% nodetool decommission

They will appear as leaving (UL) while the decommission is running:

Datacenter: k8s-1
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns (effective)  Host ID                               Rack
UN  10.0.3.10      10.3 GiB   16      100.0%            c63b9b16-24fe-4232-b146-b7c2f450fcc6  r3
UN  10.0.2.66      10.3 GiB   16      100.0%            b1409a2e-cba1-482f-9ea6-c895bf296cd9  r2
UN  10.0.1.77      10.3 GiB   16      100.0%            78c53702-7a47-4629-a7bd-db41b1705bb8  r1
Datacenter: us-west-2
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns (effective)  Host ID                               Rack
UN  172.31.4.217   10.32 GiB  16      0.0%              9a9b5e8f-c0c2-404d-95e1-372880e02c43  us-west-2c
UN  172.31.38.15   10.32 GiB  16      0.0%              1e6a9077-bb47-4584-83d5-8bed63512fd8  us-west-2b
UL  172.31.22.153  10.32 GiB  16      0.0%              d6488a81-be1c-4b07-9145-2aa32675282a  us-west-2a

The operation should be fairly fast as no streaming will take place since we no longer have keyspaces replicated on us-west-2.

Once all three nodes were decommissioned, we should be left with the k8s-1 datacenter only:

Datacenter: k8s-1
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load      Tokens  Owns (effective)  Host ID                               Rack
UN  10.0.3.10  10.3 GiB  16      100.0%            c63b9b16-24fe-4232-b146-b7c2f450fcc6  r3
UN  10.0.2.66  10.3 GiB  16      100.0%            b1409a2e-cba1-482f-9ea6-c895bf296cd9  r2
UN  10.0.1.77  10.3 GiB  16      100.0%            78c53702-7a47-4629-a7bd-db41b1705bb8  r1

As a final step, we can now delete the VPC peering connection which is no longer necessary.

Note that the cluster can run in hybrid mode for as long as necessary. There’s no requirement to delete the us-west-2 datacenter if it makes sense to keep it alive.

Conclusion

We have seen today that it was possible to migrate existing Cassandra clusters to K8ssandra without downtime, leveraging flat networking to allow Cassandra nodes running in VMs to connect to Cassandra pods running in Kubernetes directly.

Join our forum if you have any questions about the above procedure and come speak with us directly in Discord. Curious to learn more about (or play with) Cassandra itself? We recommend trying it on Astra DB's free plan for the fastest setup.

Backing up K8ssandra with MinIO

Alexander Dejanovski — Thu, 03 Feb 2022 22:49:57 +0000

K8ssandra includes Medusa for Apache Cassandra® to handle backup and restore for your Cassandra nodes. Recently Medusa was upgraded to introduce support for all S3 compatible backends, including MinIO, the popular k8s-native object storage suite. Let’s see how to set up K8ssandra and MinIO to backup Cassandra in just a few steps.

Deploy MinIO

Similar to K8ssandra, MinIO can be simply deployed through Helm.

First, add the MinIO repository to your local list:

helm repo add minio https://helm.min.io/

The MinIO Helm charts allow you to do several things at once at install time:

Set the credentials to access MinIO
Create a bucket for your backups that can be set as default

You can create a k8ssandra-medusa bucket and use minio_key/minio_secret as the credentials, and deploy MinIO in a new namespace called minio by running the following command:

helm install --set accessKey=minio_key,secretKey=minio_secret,defaultBucket.enabled=true,defaultBucket.name=k8ssandra-medusa minio minio/minio -n minio --create-namespace

Note: Creating the bucket is not mandatory at this stage and can be done through MinIO’s UI.

After the helm install command has completed, you should see something similar to this in the minio namespace:

% kubectl get all -n minio
NAME                        READY   STATUS    RESTARTS   AGE
pod/minio-5fd4dd687-gzr8j   1/1     Running   0          109s
NAME            TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
service/minio   ClusterIP   10.96.144.61   <none>        9000/TCP   109s
NAME                    READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/minio   1/1     1            1           109s
NAME                              DESIRED   CURRENT   READY   AGE
replicaset.apps/minio-5fd4dd687   1         1         1       109s

Using port forwarding, you can expose access to the MinIO UI in the browser on port 9000:

% kubectl port-forward service/minio 9000 -n minio
Forwarding from 127.0.0.1:9000 -> 9000
Forwarding from [::1]:9000 -> 9000

Now you can login to MinIO at http://localhost:9000 using your install time defined credentials (if you used the same commands above they would be minio_key and minio_secret):

Once logged in, you can see that the k8ssandra-medusa bucket was created and is currently empty:

Deploy K8ssandra

Now that MinIO is up and running, you can create a namespace for your K8ssandra installation and create a secret for Medusa to access the bucket. Create a medusa_secret.yaml file with the following content:

apiVersion: v1
kind: Secret
metadata:
name: medusa-bucket-key
type: Opaque
stringData:
# Note that this currently has to be set to medusa_s3_credentials!
medusa_s3_credentials: |-
  [default]
  aws_access_key_id = minio_key
  aws_secret_access_key = minio_secret

Now create the k8ssandra namespace and the Medusa secret with the following commands:

kubectl create namespace k8ssandra
kubectl apply -f medusa_secret.yaml -n k8ssandra

You should now see the medusa-bucket-key secret in the k8ssandra namespace:

% kubectl get secrets -n k8ssandra
NAME                  TYPE                                  DATA   AGE
default-token-twk5w   kubernetes.io/service-account-token   3      4m49s
medusa-bucket-key     Opaque                                1      45s

You can then deploy K8ssandra with the following custom values file (all default values will be used if not customized here) :

medusa:
 enabled: true
 storage: s3_compatible
 storage_properties:
     host: minio.minio.svc.cluster.local
     port: 9000
     secure: "False"
 bucketName: k8ssandra-medusa
 storageSecret: medusa-bucket-key

Save the above file as k8ssandra_medusa_minio.yaml and then install K8ssandra with the following command:

helm install k8ssandra k8ssandra/k8ssandra -f k8ssandra_medusa_minio.yaml -n k8ssandra

Now wait for the Cassandra cluster to be ready by using the following wait command:

kubectl wait --for=condition=Ready cassandradatacenter/dc1 --timeout=900s -n k8ssandra

You should now see a list of pods similar to this:

% kubectl get pods -n k8ssandra
NAME                                                  READY   STATUS      RESTARTS   AGE
k8ssandra-cass-operator-547845459-dwg68               1/1     Running     0          6m36s
k8ssandra-dc1-default-sts-0                           3/3     Running     0          5m56s
k8ssandra-dc1-stargate-776f88f945-p9twg               0/1     Running     0          6m36s
k8ssandra-grafana-75b9cb64cc-kndtc                    2/2     Running     0          6m36s
k8ssandra-kube-prometheus-operator-5bdd97c666-qz5vv   1/1     Running     0          6m36s
k8ssandra-medusa-operator-d766d5b66-wjt7j             1/1     Running     0          6m36s
k8ssandra-reaper-5f9bbfc989-j59xk                     1/1     Running     0          2m48s
k8ssandra-reaper-operator-858cd89bdd-7gfjj            1/1     Running     0          6m36s
k8ssandra-reaper-schema-4gshj                         0/1     Completed   0          3m3s
prometheus-k8ssandra-kube-prometheus-prometheus-0     2/2     Running     1          6m32s

Create some data and back it up

Extract the username and password to access Cassandra (the password is different for each installation unless it is explicitly set at install time) into variables:

% username=$(kubectl get secret k8ssandra-superuser -n k8ssandra -o jsonpath="{.data.username}" | base64 --decode)
% password=$(kubectl get secret k8ssandra-superuser -n k8ssandra -o jsonpath="{.data.password}" | base64 --decode)

Connect through CQLSH on one of the nodes:

% kubectl exec -it k8ssandra-dc1-default-sts-0 -n k8ssandra -c cassandra -- cqlsh -u $username -p $password

Copy/paste the following statements into the CQLSH prompt and press enter:

CREATE KEYSPACE medusa_test  WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
USE medusa_test;
CREATE TABLE users (email TEXT PRIMARY KEY, name TEXT, state TEXT);
INSERT INTO users (email, name, state) VALUES ('alice@example.com', 'Alice Smith', 'TX');
INSERT INTO users (email, name, state) VALUES ('bob@example.com', 'Bob Jones', 'VA');
INSERT INTO users (email, name, state) VALUES ('carol@example.com', 'Carol Jackson', 'CA');
INSERT INTO users (email, name, state) VALUES ('david@example.com', 'David Yang', 'NV');

Check that the rows were properly inserted:

SELECT * FROM medusa_test.users;
email             | name          | state
-------------------+---------------+-------
alice@example.com |   Alice Smith |    TX
  bob@example.com |     Bob Jones |    VA
david@example.com |    David Yang |    NV
carol@example.com | Carol Jackson |    CA
(4 rows)

Now backup this data, and check that files get created in your MinIO bucket.

To that end, use the following command:

helm install my-backup k8ssandra/backup -n k8ssandra --set name=backup1,cassandraDatacenter.name=dc1

Since the backup operation is asynchronous, you can monitor its completion by running the following command:

kubectl get cassandrabackup backup1 -n k8ssandra -o jsonpath={.status.finishTime}

As long as this doesn’t output a date and time, then the backup is still running. With the amount of data present and the fact that you’re using a locally accessible backend, this should complete quickly.

Now refresh the MinIO UI and you should see some files in the k8ssandra-medusa bucket:

An index folder should appear (it is Medusa’s backup index) and then another folder that is specific to each Cassandra node in the cluster (in this case there is only one node).

Deleting the data and restoring the backup

TRUNCATE the table and verify it is empty:

% kubectl exec -it k8ssandra-dc1-default-sts-0 -n k8ssandra -c cassandra -- cqlsh -u $username -p $password
TRUNCATE medusa_test.users;
SELECT * FROM medusa_test.users;
email | name | state
-------+------+-------
(0 rows)

Now restore the backup taken previously:

helm install restore-test k8ssandra/restore --set name=restore-backup1,backup.name=backup1,cassandraDatacenter.name=dc1 -n k8ssandra

This operation will take a little longer as it requires to stop the StatefulSet pod and perform the restore as part of the init containers, before the Cassandra container can start. You can monitor progress using this command:

watch -d kubectl get cassandrarestore restore-backup1 -o jsonpath={.status} -n k8ssandra

The restore operation is fully completed once the finishTime value appears in the output:

{"finishTime":"2021-03-23T13:58:36Z","restoreKey":"83977399-44dd-4752-b4c4-407273f0339e","startTime":"2021-03-23T13:55:35Z"}

Check that you can read the data from the previously truncated table:

% kubectl exec -it k8ssandra-dc1-default-sts-0 -n k8ssandra -c cassandra -- cqlsh -u k8ssandra-superuser -p XHsZ943WBg5RPNhVAT8x -e "SELECT * FROM medusa_test.users"
email             | name          | state
-------------------+---------------+-------
alice@example.com |   Alice Smith |    TX
  bob@example.com |     Bob Jones |    VA
david@example.com |    David Yang |    NV
carol@example.com | Carol Jackson |    CA
(4 rows)

You’ve successfully restored your lost data in just a few commands!

Many backends available

MinIO, while being an obvious choice in the Kubernetes world, is not the only S3 compatible backend that K8ssandra can use. K8ssandra has supported AWS S3 and Google Cloud Storage as Medusa backends since 1.0.0. There are also a wide variety of solutions that can run on-prem (including CEPH, Cloudian, Riak S2, and Dell EMC ECS) or in cloud environments (including IBM Cloud Object Storage, and OVHcloud Object Storage). See the K8ssandra backup/restore documentation for more detailed instructions and let us know if you have questions, we love to help! If you are looking to learn Cassandra, or want to see how backups are handled on a Cassandra managed service, please head over to the Astra DB website and try the free tier.

Requirements for running K8ssandra for development

Alexander Dejanovski — Thu, 13 Jan 2022 17:27:05 +0000

``# Managing expectations

The K8ssandra Quick start is a excellent guide for doing a full installation of K8ssandra on a dev laptop and trying out the various components of the K8ssandra stack. While this is a great way to get your first hands-on experience with K8ssandra, let’s state the obvious: running K8ssandra locally on a dev laptop is not aimed at performance. In this blog post, we will start Apache Cassandra® locally then explain how to run benchmarks to help evaluate what level of performance (especially throughput) you can expect from a dev laptop deployment.

Our goal was to achieve the following:

Run the whole stack, if possible, with at least three Cassandra nodes and one Stargate node. K8ssandra ships with the following open source components:
- Apache Cassandra®
- Stargate : API framework and data gateway (CQL, REST, GraphQL)
- Medusa for Apache Cassandra® : Backup and restore tool for Cassandra
- Reaper for Apache Cassandra® : Repair orchestration tool for Cassandra
- Metrics Collector for Apache Cassandra : Metric collection and Dashboards for Apache Cassandra (2.2, 3.0, 3.11, 4.0) clusters.
- Management API for Apache Cassandra : Secure Management Sidecar for Apache Cassandra
- Cass Operator : Kubernetes Operator for Apache Cassandra
- Medusa Operator : Kubernetes Operator for Medusa
- Reaper Operator : Kubernetes Operator for Reaper
- kube-prometheus-stack chart:
  - Prometheus : Monitoring system & time series database
  - Grafana : Fully composable observability stack
Achieve reasonable startup times
Specify a dev setup stable enough to sustain moderate workloads (50 to 100 ops/s)
Come up with some minimum requirements and recommended K8ssandra settings

Using the right settings

Cassandra can run with fairly limited resources as long as you don’t put too much pressure on it. For example, for the Reaper project, we run our integration tests with CCM (Cassandra Cluster Manager), configured at 256MB of heap size. This allows the JVM to allocate an additional 256MB of off heap memory, allowing Cassandra to use up to 512MB of RAM.

If we want to run K8ssandra with limited resources, we’ll need to set these appropriately in our Helm values files.

Setting heap sizes in K8ssandra

The K8ssandra Helm charts allow us to set heap sizes for both the Cassandra and Stargate pods separately.

Cassandra

For Cassandra, the heap and new gen sizes can be set at the cluster level, or at the datacenter level (K8ssandra will support multi DC deployments in a future release):

`
cassandra:
version: "3.11.10"
...
...

Cluster level heap settings

heap: {}
#size:
#newGenSize:
datacenters:

name: dc1 size: 3 ... ... # Datacenter level heap settings heap: {} #size: #newGenSize: `

By default, these values aren’t set, which lets Cassandra perform its own computations based on the available RAM, applying the following formula:

` max(min(1/2 ram, 1024MB), min(1/4 ram, 8GB)) `

The catch when you run several Cassandra nodes on the same machine is that they will all see the same total available RAM but won’t be aware that other Cassandra nodes could be running as well. When allocating 8GB RAM to Docker, each Cassandra node will compute a 2GB heap. With a 3 nodes cluster, it’s already 6GB of RAM used, not accounting for the additional off heap memory that can be used by each JVM. That doesn’t leave much RAM for the other components K8ssandra includes, such as Grafana, Prometheus and Stargate.

The takeaway here: leaving heap settings blank is not a good idea for a dev environment in particular, where several Cassandra instances will be collocated on the same host machine. (By default, K8ssandra does not allow multiple Cassandra nodes on the same Kubernetes worker node. For this post, we’re using kind to have multiple worker nodes run on the same OS instance – or virtual machine in the case of Docker Desktop).

The chosen heap size will directly impact the throughput you can expect to achieve (although it’s not the only limiting factor). A small heap will involve more garbage collections, which will generate more stop the world pauses and directly impact throughput and latency. It also increases the odds of running out of memory if the workload is too heavy, as objects cannot end their lifecycle fast enough for the available heap space.

Setting the heap size at 500MB with 200MB of new gen globally for the cluster would be done as follows:

`
cassandra:
version: "3.11.10"
...
...

Cluster level heap settings

heap:
size: 500M
newGenSize: 200M
datacenters:

name: dc1 size: 3 `

Stargate

Because Stargate nodes are special coordinator-only Cassandra nodes and run in the JVM, it is also necessary to set their max heap size:

` stargate: enabled: true version: "1.0.9" replicas: 1 ... ... heapMB: 256 `

Stargate nodes will follow the same rule when it comes to off heap memory: the JVM will be allowed to use as much RAM for off heap memory as the configured heap size.

As Stargate serves as coordinator, it is likely to hold objects for longer on heap waiting for all nodes to respond to queries before it can acknowledge them and potentially return the result sets to clients. It needs enough heap to do so without excessive garbage collection. Unlike Cassandra, Stargate doesn’t compute a heap size based on the available RAM, and the value must be set explicitly.

During our tests, we observed that 256MB was a good initial value to have stable Stargate pods. In production you might want to tune this value for optimal performance.

Benchmark Environment

Our setup for running benchmarks was the following:

Apple MacBook Pro 2019 – i7 (6 cores) – 32GB RAM – 512GB SSD
Docker desktop 3.1.0
Kind 0.7.0
Kubernetes 1.17.11
kubectl v1.20.2

Note that we’ve used a fairly powerful environment as our tests ran on a 2019 Apple MacBook Pro with a 6 cores i7 CPU and 32GB RAM.

We used the Kind deployment guidelines found in the K8ssandra documentation to start a k8s cluster with 3 worker nodes.

Docker Desktop allows to tune its allocated resources by clicking on its icon in the status bar, then going to “Preferences…”:

Then click on “Resources” in the left menu, which will allow you to set the number of cores and the amount of RAM Docker can use overall:

Running the benchmarks

We used NoSQLBench to perform moderate load benchmarks. It comes with a convenient Docker image that we could use straight away to run stress jobs in our k8s cluster.

Here’s the Helm values file we used as a base for spinning up our cluster, which we’ll name three_nodes_cluster_with_stargate.yaml:

`
cassandra:
datacenters:

name: dc1 size: 3 ingress: enabled: false stargate: enabled: true replicas: 1 ingress: host: enabled: true cassandra: enabled: true medusa: multiTenant: true storage: s3 storage_properties: region: us-east-1 bucketName: k8ssandra-medusa storageSecret: medusa-bucket-key `

We want Stargate to be our Cassandra gateway and enabling Medusa requires us to set up a secret (remember, we want to run the whole stack).

You’ll have to adjust the Medusa storage settings to match your requirements (bucket and region) or disable it if you don’t have access to an AWS bucket at all by disabling Medusa:

` medusa: enabled: false `

Adjust the Medusa storage settings to match your requirements (bucket and region). You will need to disable Medusa if AWS usages when an S3 bucket is not available. In addition to AWS, future versions of Medusa will provide support for S3/MinIO and local storage configurations.

We can create a secret for Medusa by applying the following yaml:

`
apiVersion: v1
kind: Secret
metadata:
name: medusa-bucket-key
type: Opaque
stringData:

Note that this currently has to be set to medusa_s3_credentials!

medusa_s3_credentials: |-
`

[default]

aws_access_key_id = <aws key> aws_secret_access_key = <aws secret>

You’ll notice in our Helm values that they lack heap settings. We intentionally did this to set them when invoking helm install with various heap values for our different tests.

To fully set up our environment, we executed the following steps:

Create the kind cluster:kind create cluster --config ./kind.config.yaml
Configure and install Traefik
Create a namespace:kubectl create namespace k8ssandra
(If Medusa is enabled) Create the secret:kubectl apply -f medusa_secret.yaml -n k8ssandra
Deploy K8ssandra with the desired heap settings:helm repo add k8ssandra https://helm.k8ssandra.io/stable helm repo update helm install k8ssandra k8ssandra/k8ssandra -n k8ssandra \ -f /path/to/three_nodes_cluster_with_stargate.yaml \ --set cassandra.heap.size=500M,cassandra.heap.newGenSize=250M,stargate.heapMB=300

You’ll have to wait for the cassandradatacenter resource and then the Stargate pod to be ready before you can start interacting with Cassandra. This usually takes around 7 to 10 minutes.

You can wait for the cassandradatacenter to be ready with the following kubectl command:

` kubectl wait --for=condition=Ready cassandradatacenter/dc1 --timeout=900s -n k8ssandra `

Then wait for Stargate to be ready:

` kubectl rollout status deployment k8ssandra-dc1-stargate -n k8ssandra `

Once Stargate is ready, the above command should output something like this:

` deployment "k8ssandra-dc1-stargate" successfully rolled out. `

You can execute a NoSQLBench stress run by creating a k8s job. You’ll need the superuser credentials so that NoSQLBench can connect to the Cassandra cluster. You can get those credentials with the following commands ( requires jq to be installed):

` SECRET=$(kubectl get secret "k8ssandra-superuser" -n k8ssandra -o=jsonpath='{.data}') echo "Username: $(jq -r '.username' <<< "$SECRET" | base64 -d)" echo "Password: $(jq -r '.password' <<< "$SECRET" | base64 -d)" `

Then create the NoSQLBench job which will start automatically:

` kubectl create job --image=nosqlbench/nosqlbench nosqlbench -n k8ssandra \ -- java -jar nb.jar cql-iot rampup-cycles=1k cyclerate=100 \ username=<superuser username> password=<superuser pass> \ main-cycles=10k write_ratio=7 read_ratio=3 async=100 \ hosts=k8ssandra-dc1-stargate-service --progress console:1s -v `

This will run a 10k cycle stress run with 100 ops/s with 70% writes and 30% reads, allowing 100 in-flight async queries. Note that we’re providing the Stargate service as the contact host for NoSQLBench (the exact name will differ depending on your Helm release name).

While the job is running, you can tail its logs using the following command:

` kubectl logs job/nosqlbench -n k8ssandra --follow `

Latency metrics can be found at the end of the run, and since we’re running at a fixed rate we’ll be interested in the response time which takes coordinated omission (video, paper) into account:

` kubectl logs job/nosqlbench -n k8ssandra |grep cqliot_default_main.cycles.responsetime `

Which should output something like this:

` 12:41:18.924 [cqliot_default_main:008] INFO i.n.e.c.m.PolyglotMetricRegistryBindings - timer added: cqliot_default_main.cycles.responsetime 12:42:58.788 [main] INFO i.n.engine.core.ScenarioResult - type=TIMER, name=cqliot_default_main.cycles.responsetime, count=10000, min=1560.064, max=424771.583, mean=21894.6342016, stddev=45876.836258003656, median=5842.175, p75=17157.119, p95=100499.455, p98=187908.095, p99=263397.375, p999=384827.391, mean_rate=100.03389528501059, m1=101.58021531751795, m5=105.18698132587139, m15=106.3340149754869, rate_unit=events/second, duration_unit=microseconds `

As Cassandra operators, we usually focus on p99 latencies: p99=263397.375. That’s 263ms at p99, which is fine considering our environment (a laptop) and our performance requirements (very low).

Benchmark results

We ran our benchmarks with the following matrix of settings:

Cores: 4 and 8
RAM: 4GB and 8GB
Ops rate: 100, 500, 1000 and 1500 ops/s
Cassandra Heap: 500MB
Stargate Heap: 300MB and 500MB

Running the full stack with three Cassandra nodes, one Stargate node and 4GB allocated to Docker fails on any attempt of running stress tests, even moderate ones. However, running with a single Cassandra node allowed the stress test to run with the full stack loaded using 4GB RAM.

Latencies are very reasonable for all settings when using a 100 ops/s rate. Trying to achieve higher throughput requires using at least 8 cores, allowing to reach 1000 ops/s with 290ms p99 latencies. None of our tests allowed us to reach a sustained throughput of 1500 ops/s as shown by the response times that go over 9 seconds per operation.

Conclusion

Getting the full K8ssandra experience on a laptop will require at least 4 cores and 8GB of RAM available to Docker and appropriate heap sizes for Cassandra and Stargate. If you don’t have those resources available for Docker on your development machine, you can avoid deploying features such as monitoring, Reaper and Medusa, and also reduce the number of Cassandra nodes. Using heap sizes of 500MB for Cassandra and 300MB for Stargate proved to be enough to sustain workloads between 100 and 500 operations per second, which should be sufficient for development purposes.

Note that starting the whole stack takes around 7 to 10 minutes at the time of this writing on a fairly recent high end MacBook Pro, so expect your mileage to vary a bit depending on your hardware. Part of this time is spent pulling images from Docker Hub, meaning that your internet connection will play a big role in startup duration.

Now you know how to configure K8ssandra for your development machine, and you’re ready to start building cloud-native apps! Visit the Tasks section of our documentation site for detailed instructions on developing against your deployed cluster. If you're looking to learn Cassandra, or learn about the performance of our managed service, we suggest heading to the Astra DB website.