DEV Community: Jeffrey Carpenter

Data Services for the Masses

Jeffrey Carpenter — Thu, 06 Oct 2022 19:58:01 +0000

I’ve held several roles in my career in IT, ranging from software developer to enterprise architect to developer advocate. I’ve always been fascinated by the role that data plays in our applications—putting it into databases, getting it back out quickly, making sure it remains accurate when transferred between systems. Many of the hardest problems I’ve encountered have centered around data. For example:

Writing a cache eviction algorithm for an application that replayed hours worth of time-series radar data on a loop (ask me about the maintenance nightmare I created)

Learning how to form queries in Kibana so that we could pull just the right log statements to help us debug interactions between microservices and even down to the database.

One problem sits at the intersection of technology and the people building it. There’s an ongoing debate over when developers should be required to access data via APIs and when they should be allowed to write their own database queries directly.

Here, I’ll explore some of the solutions I’ve encountered in past projects and share why I’m excited about the Stargate project as a framework for solving this problem for everyone. Stargate is an open-source API gateway for data, built on top of Apache Cassandra.

A brief history of data access design patterns

Even back when we were writing monolithic applications, many of us committed to creating maintainable code were isolating data access and complex query logic behind object-relational mapping tools or using patterns like data access objects (DAOs). Later, when we started using service-oriented architecture (SOA), similar patterns for abstracting data access appeared.

While working in the hospitality industry, I helped design a cloud-based reservation system based on a microservices architecture. Following patterns inherited from our legacy SOA system, we found ourselves creating a set of services that we called entity services. Each entity service provided access to a particular data type such as hotels, rates, inventory, or reservations. I shared this architecture at Cassandra Summit and other conferences in 2016:

We layered services that implemented business processes on top of the entity services. The shopping service composed data from the hotel, rate, and inventory services to provide hotel and room options given desired dates and travel locations. The booking service wrote records into the reservation service and decremented the available room counts in the inventory service.

Each entity service was responsible for its own storage, and services were not permitted to access the storage of another service. This meant that each entity service could potentially have a different database, although, in practice, the initial entity services were all implemented on Cassandra, with a different keyspace to contain the tables used by each service.

Each entity service consisted of a few simple elements: an API layer (typically REST), business logic like data validation, and code to map between the data format presented on the API (typically JSON) and database queries implemented using a driver. I built a reference implementation of an entity service called the Reservation Service for my O’Reilly Cassandra book based on this architecture.

Drawbacks of entity services

Over time I began to observe that these entity services followed similar patterns in their API templates, validation logic, and database logic. While using frameworks such as Java Spring certainly helped with the API layer, logging, and other concerns, I wondered if there was more we could do to eliminate a lot of similar-looking code. The database access code, in particular, was quite formulaic:

We encountered a performance challenge as well. Having multiple layers of microservices meant additional latency as client requests traversed business services, entity services, and the database.

There are few new problems in computer science

It turns out that my team was not the only one encountering these issues and coming up with solutions. At DataStax Accelerate 2019 (our annual Cassandra community gathering), Michael Figurere shared how Instagram introduced the concept of a Cassandra gateway into their architecture.

The motivations for introducing this gateway layer were familiar to my ears:

A desire to abstract the details of writing application queries from developers

A desire to support increased throughput and lower latency

A strategy for using familiar APIs to minimize impact to client applications

In an interesting twist, the “familiar API” was Cassandra’s legacy Thrift-based API. Instagram had a large investment in clients using Thrift. Introducing the gateway promoted client reuse, while providing a translation between Thrift and Cassandra’s more modern CQL API. This layer also made it much easier to upgrade Cassandra versions.

This highlights an interesting phenomenon: while the desire for an API layer to abstract data access is common across many organizations, each organization tends to have its own unique API requirements. There are often existing services with various API styles (such as REST, gRPC, Thrift) and data formats (such as JSON or protobuf). What if we could avoid the hassle of maintaining a bunch of services that are just thin wrappers around the database?

Enter Stargate

This desire to provide a common layer for data access with multiple different API styles inspired the Stargate project. The basic idea is simple: collapse the API layer into the database. As the picture shows, Stargate provides a pluggable framework for adding different API styles on top of Cassandra-compatible databases.

At the time of writing, the following API plugins are supported:

RESTful APIs This plugin exposes existing tables defined via CQL in a connected Cassandra cluster and provides an endpoint for creating a new schema. Data payloads are defined as JSON objects.

GraphQL APIs This plugin exposes CQL tables as a GraphQL API. A couple of features I absolutely love about GraphQL are the ability to request a subset of the fields of a returned row of data and the ability to compose data from multiple tables in a single query.

Document API This plugin is what I point to when people ask if Cassandra is “schemaless” like Mongo DB. Traditionally the answer to this has been “no”—Cassandra requires a schema defined by CQL. However, the Document API changes all this, enabling you to throw arbitrary JSON documents at Stargate, which stores them and then lets you query documents or sub-documents.

CQL API This plugin supports Cassandra’s native query language. You might wonder why you would use this instead of one of the other APIs or just accessing a Cassandra cluster directly. The main reason, in my opinion, is a cool pattern I want to show you now.

The hidden benefit of an old idea

An interesting aspect of Stargate’s architecture is that Stargate nodes are actually Cassandra nodes. They participate in Cassandra’s distributed architecture as nodes that respond to client queries but don’t actually store any data, delegating storage and retrieval to regular Cassandra nodes. This enables a flexible scaling approach that wasn’t possible previously. Now you can scale the number of Stargate nodes to handle your query volume and the number of Cassandra nodes to handle your storage volume.

As it turns out, the idea of nodes that participate in a Cassandra cluster but don’t store data is not a new one. For example, longtime community member Eric Lubow introduced a similar concept called “coordinator nodes” (also known as “proxy nodes”) in his talk at Cassandra Summit 2016, based on his work at SimpleReach:

As shown in the figure above, the coordinator nodes can use different instance types than other “data nodes” in the cluster to support optimal use of resources and save on cloud computing costs. This is a benefit that Stargate provides as well, and one that is easily realized when deploying Stargate on Kubernetes as part of the K8ssandra project, just by changing a few values in a YAML config file to specify a different instance type.

More exciting possibilities ahead

Because of Stargate’s pluggable architecture, you can extend it for your own API needs and define additional APIs or tailor one of the existing open-source connectors to match your enterprise API standards. The roadmap includes plugins for gRPC, and streaming interfaces such as Pulsar and Kafka.

I’m also excited about the possibilities when the Stargate and K8ssandra open source projects are used together. The goal is to provide a production-ready, Cassandra-based data layer that you can install in any Kubernetes environment in minutes and focus on coding your apps. If you'd like to play with Cassandra quickly off K8s, try the managed DataStax Astra DB, which is built on Apache Cassandra.

Beyond these two projects, there’s another community of people from different organizations who have come together to dream up the future of cloud-based data infrastructure - the Data on Kubernetes community (in fact, this article is based on my talk at the Data on Kubernetes Community Day at KubeCon EU 2021). We’d love to work with you in any or all of these communities!

Why Kubernetes Is The Best Technology For Running A Cloud-Native Database

Jeffrey Carpenter — Tue, 20 Sep 2022 17:03:29 +0000

We’ve been talking about migrating workloads to the cloud for a long time, but a look at the application portfolios of many IT organizations demonstrates that there’s still a lot of work to be done. In many cases, challenges with persisting and moving data in clouds continue to be the key limiting factor slowing cloud adoption, despite the fact that databases in the cloud have been available for years.

For this reason, there has been a surge of recent interest in data infrastructure that is designed to take maximum advantage of the benefits that cloud computing provides. A cloud-native database is one that achieves the goals of scalability, elasticity, resiliency, observability, and automation; the K8ssandra project is a great example. It packages Apache Cassandra and supporting tools into a production-ready Kubernetes deployment.

This raises an interesting question: must a database run on Kubernetes to be considered cloud-native? While Kubernetes was originally designed for stateless workloads, recent improvements in Kubernetes such as StatefulSets and persistent volumes have made it possible to run stateful workloads as well. Even longtime DevOps practitioners skeptical of running databases on Kubernetes are beginning to come around, and best practices are starting to emerge.

But of course grudging acceptance of running databases on Kubernetes is not our goal. If we’re not pushing for greater maturity in cloud-native databases, we’re missing a big opportunity. To make databases the most “cloud-native” they can be, we need to embrace everything that Kubernetes has to offer. A truly cloud-native approach means adopting key elements of the Kubernetes design paradigm. A cloud-native database must be one that can run effectively on Kubernetes. Let’s explore a few Kubernetes design principles that point the way.

Principle 1: Leverage compute, network, and storage as commodity APIs

One of keys to the success of cloud computing is the commoditization of compute, networking, and storage as resources we can provision via simple APIs. Consider this sampling of AWS services:

Compute: we allocate virtual machines through EC2 and Autoscaling Groups (ASGs)
Network: we manage traffic using Elastic Load Balancers (ELB), Route 53, and VPC peering
Storage: we persist data using options such as the Simple Storage Service (S3) for long-term object storage, or Elastic Block Storage (EBS) volumes for our compute instances.

Kubernetes offers its own APIs to provide similar services for a world of containerized applications:

Compute: pods, deployments, and replica sets manage the scheduling and life cycle of containers on computing hardware
Network: services and ingress expose a container’s networked interfaces
Storage: persistent volumes and stateful sets enable flexible association of containers to storage

Kubernetes resources promote portability of applications across Kubernetes distributions and service providers. What does this mean for databases? They are simply applications that leverage compute, networking, and storage resources to provide the services of data persistence and retrieval:

Compute: a database needs sufficient processing power to process incoming data and queries. Each database node is deployed as a pod and grouped in StatefulSets, enabling Kubernetes to manage scaling out and scaling in.
Network: a database needs to expose interfaces for data and control. We can use Kubernetes Services and Ingress Controllers to expose these interfaces.
Storage: a database uses persistent volumes of a specified storage class to store and retrieve data.

Thinking of databases in terms of their compute, network, and storage needs removes much of the complexity involved in deployment on Kubernetes.

Principle 2: Separate the control and data planes

Kubernetes promotes the separation of control and data planes. The Kubernetes API server is the key data plane interface used to request computing resources, while the control plane manages the details of mapping those requests onto an underlying IaaS platform.

We can apply this same pattern to databases. For example, Cassandra’s data plane consists of the port exposed by each node for clients to access Cassandra Query Language (CQL) and the port used for internode communication. The control plane includes the Java Management Extensions (JMX) interface provided by each Cassandra node. Although JMX is a standard that’s showing its age and has had some security vulnerabilities, it's a relatively simple task to take a more cloud-native approach. In K8ssandra, Cassandra is deployed in a custom container image that adds a RESTful Management API, bypassing the JMX interface.

The remainder of the control plane consists of logic that leverages the Management API to manage Cassandra nodes. This is implemented via the Kubernetes operator pattern. Operators define custom resources and provide control loops that observe the state of those resources and take actions to move them toward a desired state, helping extend Kubernetes with domain-specific logic.

The K8ssandra project uses cass-operator to automate Cassandra operations. Cass-operator defines a “CassandraDatatcenter” custom resource (CRD) to represent each top-level failure domain of a Cassandra cluster. This builds a higher-level abstraction based on Stateful Sets and Persistent Volumes.

A sample K8ssandra deployment including Apache Cassandra and cass-operator:

Principle 3: Make observability easy

The three pillars of observable systems are logging, metrics, and tracing. Kubernetes provides a great starting point by exposing the logs of each container to third-party log aggregation solutions. Metrics and tracing require a bit more effort to implement, but there are multiple solutions available.

The K8ssandra project supports metrics collection using the kube-prometheus-stack. The Metrics Collector for Apache Cassandra (MCAC) is deployed as an agent on each Cassandra node, providing a dedicated metrics endpoint. A ServiceMonitor from the kube-prometheus-stack pulls metrics from each agent and stores them in Prometheus for use by Grafana or other visualization and analysis tools.

Principle 4: Make the default configuration secure

Kubernetes networking is secure by default: ports must be explicitly exposed in order to be accessed externally to a pod. This sets a useful precedent for database deployment, forcing us to think carefully about how each control plane and data plane interface will be exposed, and which interfaces should be exposed via a Kubernetes Service.

In Kassandra, CQL access is exposed as a service for each CassandraDatacenter resource, while APIs for management and metrics are accessed for individual Cassandra nodes by cass-operator and the Prometheus Service Monitor, respectively.

Kubernetes also provides facilities for secret management, including sharing encryption keys and configuring administrative accounts. K8ssandra deployments replace Cassandra’s default administrator account with a new administrator username and password.

Principle 5: Prefer declarative configuration

In the Kubernetes declarative approach, you specify the desired state of resources and controllers manipulate the underlying infrastructure in order to achieve that state. Cass-operator allows you to specify the desired number of nodes in a cluster, and manages the details of placing new nodes to scale up, and selecting which nodes to remove to scale down.

The next generation of operators should enable us to specify rules for stored data size, number of transactions per second, or both. Perhaps we’ll be able to specify maximum and minimum cluster sizes, and when to move less frequently used data to object storage.

The best designs draw on the wisdom of the community

Hopefully I’ve convinced you that Kubernetes is a great source of best practices for cloud-native database implementations, and the innovation continues. Solutions for federating Kubernetes clusters are still maturing, but will soon make it much simpler to manage multi-data center Cassandra clusters in Kubernetes. In the Cassandra community, we can work to make extensions for management and metrics a part of the core Apache project so that Cassandra is more naturally cloud-native for everyone, right out of the box.

If you’re excited at the prospect of cloud-native databases on Kubernetes, you’re not alone. A group of like-minded individuals and organizations has assembled as the Data on Kubernetes Community, which has hosted over 50 meetups in multiple languages since its inception last year. We’re grateful to MayaData for helping to start this community, and are excited to announce that DataStax has joined as a co-sponsor of the DoKC.

In more great news, the DoKC was accepted as an official CNCF community group, and hosted the first ever Data on Kubernetes Day as part of Kubecon/CloudNativeCon Europe on May 3. Rick Vasquez’s talk, “A Call for DBMS to Modernize on Kubernetes,” lays down a challenge to make the architectural changes required to become truly cloud native. Together, we’ll arrive at the best solutions through collaboration in open source communities like Kubernetes, Data on Kubernetes, Apache Cassandra, and K8ssandra. Let’s lead with code and keep talking! If you'd like to play with Cassandra quickly off K8s, try the managed DataStax Astra DB, which is built on Apache Cassandra.

How the world caught up with Apache Cassandra

Jeffrey Carpenter — Thu, 15 Sep 2022 16:50:42 +0000

The O’Reilly book, Cassandra: The Definitive Guide, features a quote from Ray Kurzweil, the noted inventor and futurist:

“An invention has to make sense in the world in which it is finished, not the world in which it is started.”

This quote has a prophetic ring to it, especially considering my co-author Eben Hewitt included it in the 2010 first edition of this book we wrote, back when Apache Cassandra, the open-source, distributed, and highly scalable NoSQL database, was just on its 0.7 release.

In those days, other NoSQL databases were appearing on the scene as part of platforms with worldwide scale from vendors like Amazon, YouTube, and Facebook. With many competing database projects and a slowly emerging response from relational database vendors, the future of this emerging landscape wasn’t yet clear, and Hewitt qualified his assessment with this summary: “In a world now working at web scale and looking to the future, Apache Cassandra might be one part of the answer.” (emphasis added)

While many of those databases from the NoSQL revolution and the NewSQL counter-revolution have now faded into history, Cassandra has stood the test of time, maturing into a rock-solid database that arguably still scales with performance and reliability better than any other.

Twelve-plus years after its invention, Cassandra is now used by approximately 90 percent of the Fortune 100, and it’s appeal is broadening quickly, driven by a rush to harness today’s “data deluge” with apps that are globally distributed and always-on. Add to this recent advances in the Cassandra ecosystem such as Stargate, K8ssandra, and cloud services like Astra DB, and the cost and complexity barriers to using Cassandra are fading into the past. So while it’s fair to say that while Cassandra might have been ahead of its time in 2007, it’s primed and ready for the data demands of the 2020s and beyond.

Cassandra grows up fast

Cassandra made a lot of sense to its inventors at Facebook when they developed it in 2007 to store and access reams of data for Messenger, which was growing insanely fast. From the start, Cassandra scaled quickly, and accessed huge amounts of data within strict SLAs—in a way that relational databases and SQL, which had long been the standard ways to access and manipulate data, couldn’t. As it became clear that this technology was suitable for other use cases, Facebook handed Cassandra to the Apache Software Foundation, where it became an open source project (it was voted into a top-level project in 2010).

The reliability and fail-over capabilities offered by Cassandra quickly won over some rising web stars, who loved its scalability and reliability. Netflix launched its streaming service in 2007, using an Oracle database in a single data center. As the company’s streaming service users, the devices they binge-watched with, and data expanded rapidly, the limitations on scalability and the potential for failures became a serious threat to Netflix’s success. At the time, Netflix’s then-cloud architect Adrian Cockroft said he viewed the single data center that housed Netflix’s backend as a single point of failure. Cassandra, with its distributed architecture, was a natural choice, and by 2013, most of Netflix’s data was housed there, and Netflix still uses Cassandra today.

Cassandra survived its adolescent years by retaining its position as the database that scales more reliably than anything else, with a continual pursuit of operational simplicity at scale. It demonstrated its value even further by integrating with a broader data infrastructure stack of open source components, including the analytics engine Apache Spark, stream-processing platform Apache Kafka, and others.

The Cassandra constellation

Cassandra hit a major milestone this month, with the release of 4.0. The members of the Cassandra community pledged to do something that’s unusual for a dot-zero release: make 4.0 so stable that major users would run it in production from the get-go. But the real headline is the overall growth of the Cassandra ecosystem, measured by changes both within the project and related projects, and improvements in how Cassandra plays within anyour infrastructure.

A host of complementary open-source technologies have sprung up around Cassandra to make it easier for developers to build apps with it. Stargate, for example, is an open source data gateway that provides a pluggable API layer that greatly simplifies developer interaction with any Cassandra database. REST, GraphQL, Document, and gRPC APIs make it easy to just start coding with Cassandra without having to learn the complexities of CQL and Cassandra data modeling.

K8ssandra is another open source project that demonstrates this approachability, making it possible to deploy Cassandra on any Kubernetes engine, from the public cloud providers to VMWare and OpenStack. K8ssandra extends the Kubernetes promise of application portability to the data tier, providing yet another weapon against vendor-lock in.

What if data wasn’t a problem?

There’s a question that Hewitt poses in Cassandra: The Definitive Guide: “What kind of things would I do with data if it wasn’t a problem?”

Netflix asked this question—and ran with the answer—almost a decade ago. The $25-billion company is a paragon of the kind of success that can be built with the right tools and the right strategy at the right time. But today, for a broad spectrum of companies that want to achieve business success, data also can’t be a “problem.”

Think of the modern applications and workloads that should never go down, like online banking services, or those that operate at huge, distributed scale, such as airline booking systems or popular retail apps. Cassandra’s seamless and consistent ability to scale to hundreds of terabytes, along with its exceptional performance under heavy loads, has made it a key part of the data infrastructures of companies that operate these kinds of applications.

Across industries, companies have staked their business on the reliability and scalability of Cassandra. Best Buy, the world’s largest multichannel consumer electronics retailer, refers to Cassandra as “flawless” in how it handles massive spikes in holiday purchasing traffic. Bloomberg News has relied on Cassandra since 2016 because it’s easy to use, easy to scale, and always available; the financial news service serves 20 billion requests per day on nearly a petabyte of data (that’s the rough equivalent of over 4,000 digital pictures a day—for every day of an average person’s life).

But Cassandra isn’t just for big, established sector leaders like Best Buy or Bloomberg. Ankeri, an Icelandic startup that operates a platform to help cargo shipping operators manage real-time vessel data, chose Cassandra—delivered through DataStax’s Astra DB—in part because of its ability to scale as the company gathers an increasing amount of data from a growing number of ships. It wanted a data platform that wouldn’t make data a problem, and wouldn’t get in the way of its success.

Making Cassandra simpler and more cost-effective

A handful of organizations have built services around Cassandra, in an effort to make it more accessible, and to solve some of the inherent challenges that come with operating a robust database.

One particularly hard nut to crack when it comes to managing databases has been provisioning. With cloud computing services (think AWS Lambda), scaling, capacity planning, and cost management are all automated, resulting in software that’s easy to maintain, and cost effective—”serverless,” in other words. But because modern databases store data by partitioning it across nodes of a database cluster, they’ve proved challenging to make serverless. Doing so requires rebalancing data across nodes when more are added, in order to balance storage and computing capabilities.

Because of this, enterprises have been required to guess what their peak usage will be—and pay for that level, even if they aren’t using that capacity. That’s why it was a big deal when DataStax announced earlier this year that its Astra DB cloud database built on Cassandra is available as a serverless, pay-as-you-go service. According to recent research by analyst firm GigaOm, the serverless Astra DB can deliver significant cost savings. And developers will only pay for what they use, no matter how many database clusters they create and deploy.

Carl Olofson, research vice president at IDC, noted: “A core benefit of the cloud is dynamic scalability, but this has been more difficult to achieve for storage than with compute. By decoupling compute from storage, DataStax’s Astra DB service lets users take advantage of the innate elasticity of the cloud for data, with a cloud agnostic database.”

A database for today

While Cassandra is more than a decade young, it is a database for today. If the argument of 2010 was “Cassandra may be the future,” and 2017 “Cassandra is mature,” the 2021 version is “Cassandra is an essential part of any modern data platform.” The developments in Cassandra and its surrounding ecosystem point to a coming wave of new developers and enterprises worldwide for whom Cassandra is not just a sensible choice, but an obvious one.

Want to learn more about DataStax Astra DB, built on Apache Cassandra? Sign up for a free demo.

Why a Cloud-Native Database Must Run on K8s

Jeffrey Carpenter — Tue, 19 Jul 2022 17:44:05 +0000

For this reason, there has been a surge of recent interest in data infrastructure that is designed to take maximum advantage of the benefits that cloud computing provides. A cloud-native database is one that achieves the goals of scalability, elasticity, resiliency, observability and automation; the K8ssandra project is a great example. It packages Apache Cassandra and supporting tools into a production-ready Kubernetes deployment.

Databases on Kubernetes

This raises an interesting question: must a database run on Kubernetes to be considered cloud-native? While Kubernetes was originally designed for stateless workloads, recent improvements in Kubernetes – such as StatefulSets and persistent volumes – have made it possible to run stateful workloads, as well. Even long-time DevOps practitioners skeptical of running databases on Kubernetes are beginning to come around, and best practices are starting to emerge.

But, of course, grudging acceptance of running databases on Kubernetes is not our goal. If we’re not pushing for greater maturity in cloud-native databases, we’re missing a big opportunity. To make databases the most “cloud-native” they can be, we need to embrace everything that Kubernetes has to offer. A truly cloud-native approach means adopting key elements of the Kubernetes design paradigm. A cloud-native database must be one that can run effectively on Kubernetes. Let’s explore a few Kubernetes design principles that point the way.

Principle One: Leverage Compute, Network and Storage as Commodity APIs

One of keys to the success of cloud computing is the commoditization of compute, networking and storage as resources we can provision via simple APIs. Consider this sampling of AWS services:

Compute: we allocate virtual machines through EC2 and Autoscaling Groups (ASGs)
Network: we manage traffic using Elastic Load Balancers (ELB), Route 53, and VPC peering
Storage: we persist data using options such as the Simple Storage Service (S3) for long-term object storage, or Elastic Block Storage (EBS) volumes for our compute instances.

Kubernetes offers its own APIs to provide similar services for a world of containerized applications:

Compute: pods, deployments, and replica sets manage the scheduling and life cycle of containers on computing hardware
Network: services and ingress expose a container’s networked interfaces
Storage: persistent volumes and stateful sets enable flexible association of containers to storage

Kubernetes resources promote portability of applications across Kubernetes distributions and service providers. What does this mean for databases? They are simply applications that leverage compute, networking and storage resources to provide the services of data persistence and retrieval:

Compute: a database needs sufficient processing power to process incoming data and queries. Each database node is deployed as a pod and grouped in StatefulSets, enabling Kubernetes to manage scaling out and scaling in.
Network: a database needs to expose interfaces for data and control. We can use Kubernetes Services and Ingress Controllers to expose these interfaces.
Storage: a database uses persistent volumes of a specified storage class to store and retrieve data.

Thinking of databases in terms of their compute, network and storage needs removes much of the complexity involved in deployment on Kubernetes.

Principle Two: Separate the Control and Data Planes

We can apply this same pattern to databases. For example, Cassandra’s data plane consists of the port exposed by each node for clients to access Cassandra Query Language (CQL) and the port used for internode communication. The control plane includes the Java Management Extensions (JMX) interface provided by each Cassandra node. Although JMX is a standard that’s showing its age and has had some security vulnerabilities, it’s a relatively simple task to take a more cloud-native approach. In K8ssandra, Cassandra is deployed in a custom container image that adds a RESTful Management API, bypassing the JMX interface.

The remainder of the control plane consists of logic that leverages the management API to manage Cassandra nodes. This is implemented via the Kubernetes operator pattern. Operators define custom resources and provide control loops that observe the state of those resources and take actions to move them toward a desired state, helping extend Kubernetes with domain-specific logic.

Principle Three: Make Observability Easy

The three pillars of observable systems are logging, metrics and tracing. Kubernetes provides a great starting point by exposing the logs of each container to third-party log aggregation solutions. Metrics and tracing require a bit more effort to implement, but there are multiple solutions available.

Principle Four: Make the Default Configuration Secure

Kubernetes also provides facilities for secrets management, including sharing encryption keys and configuring administrative accounts. K8ssandra deployments replace Cassandra’s default administrator account with a new administrator username and password.

Principle Five: Prefer Declarative Configuration

The next generation of operators should enable us to specify rules for stored data size, number of transactions per second or both. Perhaps we’ll be able to specify maximum and minimum cluster sizes, and when to move less frequently used data to object storage.

Draw on the Wisdom of the Community

I hope I’ve convinced you that Kubernetes is a great source of best practices for cloud-native database implementations, and the innovation continues. Solutions for federating Kubernetes clusters are still maturing, but will soon make it much simpler to manage multi-data center Cassandra clusters in Kubernetes. In the Cassandra community, we can work to make extensions for management and metrics a part of the core Apache project so that Cassandra is more naturally cloud-native for everyone, right out of the box.

In more great news, the DoKC was accepted as an official CNCF community group, and hosted the first ever Data on Kubernetes Day as part of Kubecon/CloudNativeCon Europe on May 3. Rick Vasquez’s talk, “A Call for DBMS to Modernize on Kubernetes,” lays down a challenge to make the architectural changes required to become truly cloud-native. Together, we’ll arrive at the best solutions through collaboration in open source communities like Kubernetes, Data on Kubernetes, Apache Cassandra and K8ssandra. Let’s lead with code and keep talking! If you'd like to play with Cassandra quickly off K8s, try the managed DataStax Astra DB, which is built on Apache Cassandra.

How to Put a Database in Kubernetes

Jeffrey Carpenter — Thu, 10 Mar 2022 22:08:33 +0000

Learn the key steps of deploying databases and stateful workloads in Kubernetes and meet the cloud-native technologies, like K8ssandra, that can streamline Apache Cassandra for K8s.

The idea of running a stateful workload in Kubernetes (K8s) can be intimidating, especially if you haven’t done it before. How do you deploy a database? Where is the actual storage? How is the storage mapped to the database or the application using it?

At KubeCon North America 2021, I’ll be giving a talk on “How to put a database in Kubernetes” where I demystify the deployment of databases and stateful workloads in K8s. Basically, it boils down to a few key steps:

Get to know the Kubernetes primitives
Pick a database
Pick a storage provider
Pick an operator

This blog is a sneak preview of my upcoming talk, which will take place in Los Angeles and streamed online this October 12. If you’d like to join me at this year’s KubeCon, either virtually or in-person, register here.

In the meantime, this blog post dives into the key steps of deploying databases and stateful workloads in K8s. You can learn more about them during my talk, as well as in the upcoming O’Reilly book: Managing Cloud Native Data on Kubernetes.

Get to know the Kubernetes primitives

Simply put: databases are just applications composed of compute, network, and storage. We can deploy them like any other K8s application and take advantage of resources that it provides: StatefulSets, Services, StorageClasses, PersistentVolumes, and PersistentVolumeClaims, and more.

Figure 1: Kubernetes resources help us think of applications in terms of compute, network, and storage.

Getting comfortable with using these primitives will help you understand how databases and other data infrastructure are deployed on K8s. For example, a deployment of Apache Cassandra® will typically use a StatefulSet to launch pods across available Kubernetes worker nodes, with each Cassandra pod having its own PersistentVolumeClaim that can be preserved and reused if the pod needs to be replaced.

Figure 2: Simple deployment of Cassandra on Kubernetes using a StatefulSet.

For more great examples of using these primitives online, check the reference example in the Kubernetes documentation of deploying Cassandra using StatefulSets. We’re also building a collection of examples on GitHub in association with the book project and would love to see your issues and pull requests.

Once you’ve familiarized yourself with the basic building blocks of Kubernetes, there are three main considerations when setting up the right database for your application.

Pick a database

To start, you’ll want to think about what kind of database your application needs. To help you make the right choice, consider the following factors:

Database language: does your application need SQL, NoSQL, developer-friendly data APIs?
Capacity, performance, and scalability requirements: will your data fit on a single node, or will you need a distributed database that can scale as your application grows?
Deployment topology: will your application be running in on-premises data centers, public clouds, or a mix of both?

Deciding on a database isn’t entirely independent from other decisions in your application design, and we’ll see more of this below. Note that your needs may also change as your application evolves.

Pick a storage provider

Unless the database you choose is just a cache holding ephemeral data, you’ll need to configure your database to use persistent storage. If you’re using one of the public clouds, you’ll have storage options available such as Elastic Block Storage (EBS) volumes in AWS.

However, there are many other options that are cloud-vendor independent. You can find a thriving ecosystem of K8s providers in the Cloud-Native Storage category of the CNCF Landscape.

Figure 3: Cloud Native Storage projects on the CNCF Landscape as of September 2021.

These include a number of options for managing both local and networked storage, in formats such as block, file, and object storage. You’ll likely be able to find sample code that shows how to configure your selected database to use your chosen storage provider. For example, here’s a tutorial on running Apache Cassandra on OpenEBS, a popular open-source storage provider for K8s that you can run in a variety of environments.

Pick an operator

If you intend on running more than a small handful of nodes of your selected database, you’ll benefit from automating your operations by using a K8s Operator. You can find a wide variety of operators for databases and other applications at the OperatorHub. When selecting an operator, you’ll want to make sure it’s open-source, and also check how actively it’s maintained.

There are operators for most popular databases, such as the Zalando Postgres-operator, or Cass-operator, which the Apache Cassandra community has recently banded around. Cass-operator is actually part of a larger project called K8ssandra, which builds on that operator to create a more comprehensive data platform around Cassandra. This includes tooling for maintenance and backups, along with an open-source data gateway called Stargate that supports a variety of developer-friendly APIs.

An alternate approach: Pick a managed service

Of course, even with an operator, running a database in K8s yourself may be more than you want to take on, especially if you’re a smaller team looking to maximize your leverage.

If this is you, you can still take advantage of one of the many managed database services available. If you need a highly scalable database combined with a great developer experience, DataStax Astra DB is a great choice. Astra DB is a managed Cassandra service that itself happens to be built on top of Kubernetes, and the Stargate APIs are available by default — even with a free Astra DB account.

Meet a community of cloud-native data practitioners

No matter what choices you end up making for your K8s-deployed applications, you can find a group of passionate developers pushing the state of the art forward in the Data on Kubernetes Community (DoKC). If you’re attending KubeCon North America, join us for DoK Day on Tuesday, October 12.

Register here to join KubeCon North America 2021 and subscribe to our event alert to get notified about new DataStax workshops for developers, by developers. For exclusive posts on Cassandra, streaming, Kubernetes, and more; follow DataStax on Medium.

Resources

Multi-cluster Cassandra deployment with Google Kubernetes Engine (Pt. 2)

Jeffrey Carpenter — Tue, 01 Mar 2022 21:15:25 +0000

This is the second in a series of posts examining patterns for using K8ssandra to create Cassandra clusters with different deployment topologies.

In the first article in this series, we looked at how you could create a Cassandra cluster with two datacenters in a single cloud region, using separate Kubernetes namespaces in order to isolate workloads. For example, you might want to create a secondary Cassandra datacenter to isolate a read-heavy analytics workload from the datacenter supporting your main application.

In the rest of this series, we’ll explore additional configurations that promote high availability and accessibility of your data across various different network topologies, including hybrid and multi-cloud deployments. Our focus for this post will be on creating a Cassandra cluster running on Kubernetes clusters in multiple regions within a single cloud provider – in this case Google Cloud. If you worked through the first blog, many of the steps will be familiar.

Note: for the purpose of this exercise, you’ll create GKE clusters in two separate regions, under the same Google Cloud project. This will make it possible to use the same network.

Preparing the first GKE Cluster

First, you’re going to need a Kubernetes cluster in which you can create the first Cassandra datacenter. To create this first cluster, follow the instructions for K8ssandra on Google Kubernetes Engine (GKE), which reference scripts provided as part of the K8ssandra GCP Terraform Example.

When building this example for myself, I provided values for the environment variables used by the Terraform script to match my desired environment. Notice my initial GKE cluster is in the us-west4 region. You’ll want to change these values for your own environment.

export TF_VAR_environment=dev
export TF_VAR_name=k8ssandra
export TF_VAR_project_id=<my project>
export TF_VAR_region=us-west4

After creating the GKE cluster, you can ignore further instructions on the K8ssandra GKE docs page (the “Install K8ssandra” section and beyond), since you’ll be doing a custom K8ssandra installation. The Terraform script should automatically change your kubectl context to the new cluster, but you can make sure by checking the output of kubectl config current-context.

Creating the first Cassandra datacenter

First, a bit of upfront planning. It will be easier to manage our K8ssandra installs in different clusters if we use the same administrator credentials in each datacenter. Let’s create a namespace for the first datacenter and add a secret within the namespace:

kubectl create namespace us-west4
kubectl create secret generic cassandra-admin-secret --from-literal=username=cassandra-admin --from-literal=password=cassandra-admin-password -n us-west4

Notice that I chose to create a namespace matching the GCP region in which I’m deploying K8ssandra. This is done as part of enabling DNS between the GKE clusters, which is a topic that we’ll discuss in depth in a future post. You’ll want to specify a namespace corresponding to the region you’re using.

The next step is to create a K8ssandra deployment for the first datacenter. You’ll need Helm installed for this step, as described on the K8ssandra GKE docs page. Create the configuration for the first datacenter in a file called dc1.yaml, making sure to change the affinity labels to match zones used in your GKE cluster:

cassandra:
 auth:
   superuser:
     secret: cassandra-admin-secret
 cassandraLibDirVolume:
   storageClass: standard-rwo
 clusterName: multi-region
 datacenters:
 - name: dc1
   size: 3
   racks:
   - name: rack1
     affinityLabels:
       failure-domain.beta.kubernetes.io/zone: us-west4-a
   - name: rack2
     affinityLabels:
       failure-domain.beta.kubernetes.io/zone: us-west4-b
   - name: rack3
     affinityLabels:
       failure-domain.beta.kubernetes.io/zone: us-west4-c

In addition to requesting 3 nodes in the datacenter, this configuration specifies an appropriate storage class for the GKE environment (standard-rwo), and uses affinity to specify how the racks are mapped to GCP zones. Make sure to change the referenced zones to match your configuration. For more details, please reference the first blog post in the series.

Deploy the release using this command:

helm install k8ssandra k8ssandra/k8ssandra -f dc1.yaml -n us-west4

This causes the K8ssandra release named k8ssandra to be installed in the namespace us-west4.

As would be the case for any Cassandra cluster deployment, you will want to wait for the first datacenter to be completely up before adding a second datacenter. Since you’ll now be creating additional infrastructure for the second datacenter, you probably don’t need to wait, but if you’re interested, one simple way to do make sure the datacenter is up is to watch until the Stargate pod shows as initialized since it depends on Cassandra being ready:

kubectl get pods -n us-west4 kubectl get pods -n us-west4 --watch --selector app=k8ssandra-dc1-stargate
NAME                                                  READY   STATUS             RESTARTS   AGE
k8ssandra-dc1-stargate-58bf5657ff-ns5r7                     1/1     Running            0          15m

This is a great point to get some information you’ll need below to configure the second Cassandra datacenter: seeds. In the first blog post in this series, we took advantage of a headless Kubernetes service that K8ssandra creates called the seed service, which points to a couple of the Cassandra nodes that can be used to bootstrap new nodes or datacenters into a Cassandra cluster. You can take advantage of the fact that the seed nodes are labeled to find their addresses.

kubectl get pods -n us-west4 -o jsonpath="{.items[*].status.podIP}" --selector cassandra.datastax.com/seed-node=true

Which produces output that looks like this:

10.56.6.8 10.56.5.8 10.56.4.7

Record a couple of these IP addresses to use as seeds further down.

Preparing the second GKE cluster

Now you’ll need a second Kubernetes cluster that will be used to host the second Cassandra datacenter. The terraform scripts used above to create the first GKE cluster also create a network and service account that should be reused for the second cluster. Instead of modifying the Terraform scripts to take existing resources into account, you can create the new GKE cluster using the console or the gcloud command line.

For example, I chose the us-central1 region for my second cluster. First, I explicitly created a subnet in that region as part of the same network that Terraform created for the first datacenter.

gcloud compute networks subnets create dev-k8ssandra-subnet2 --network=dev-k8ssandra-network --range=10.2.0.0/20 --region=us-central1

Then I created the second GKE cluster using that network and the same compute specs as the first cluster:

gcloud beta container clusters create "k8ssandra-2" --region "us-central1" --machine-type "e2-highmem-8" --disk-type "pd-standard" --disk-size "100" --num-nodes "1" --network dev-k8ssandra-network --subnetwork dev-k8ssandra-subnet2 --node-locations "us-central1-b","us-central1-c","us-central1-f"

Change the kubectl context to the second datacenter. Typically you can obtain a command to do this by selecting the cluster in the GCP console and pressing the “Connect” button.

Then you’ll need to create a firewall rule to allow traffic between the two clusters. I obtained the IP space of each subnet and the IP space of each GKE cluster and created a rule to allow all traffic:

gcloud compute firewall-rules create k8ssandra-multi-region-rule --direction=INGRESS --network=dev-k8ssandra-network --action=ALLOW --rules=all --source-ranges=10.0.0.0/20,10.2.0.0/20,10.56.0.0/14,10.24.0.0/14

If desired, you could create a more targeted rule to only allow TCP traffic between ports used by Cassandra.

Adding a second Cassandra datacenter

Let’s start by creating a namespace for the new datacenter matching the GCP region name. We also need to create administrator credentials to match those created for the first datacenter, since the secrets are not automatically replicated between clusters.

kubectl create namespace us-central1
kubectl create secret generic cassandra-admin-secret --from-literal=username=cassandra-admin --from-literal=password=cassandra-admin-password -n us-central1

Now you’ll create a configuration to deploy an additional Cassandra datacenter dc2 in the new GKE cluster. For the nodes in dc2 to be able to join the Cassandra cluster, a few steps are required:

The first is one you’ve already taken care of: using the same Google Cloud network for both GKE clusters means the nodes in the new datacenter will be able to communicate with nodes in the original datacenter.
Second, make sure to use the same Cassandra cluster name as for the first datacenter.
Finally, you’ll need to provide the seed nodes you recorded earlier so that the nodes in the new datacenter know how to contact nodes in the first datacenter to join the cluster.

Now create a configuration in a file called dc2.yaml. Here’s what my file looked like, you’ll want to change the additional seeds and affinity labels to your configuration:

cassandra:
 auth:
   superuser:
     secret: cassandra-admin-secret
 additionalSeeds: [ 10.56.2.14, 10.56.0.10 ]
 cassandraLibDirVolume:
   storageClass: standard-rwo
 clusterName: multi-region
 datacenters:
 - name: dc2
   size: 3
   racks:
   - name: rack1
     affinityLabels:
       failure-domain.beta.kubernetes.io/zone: us-central1-f
   - name: rack2
     affinityLabels:
       failure-domain.beta.kubernetes.io/zone: us-central1-b
   - name: rack3
     affinityLabels:
       failure-domain.beta.kubernetes.io/zone: us-central1-c

Similar to the configuration for dc1, this configuration also uses affinity. A similar allocation of racks can be used to make sure Cassandra nodes are evenly spread across the remaining workers. Deploy the release using a command such as this:

helm install k8ssandra2 k8ssandra/k8ssandra -f dc2.yaml -n us-central1

If you look at the resources in this namespace using a command such as kubectl get services,pods you’ll note that there are a similar set of pods and services as for dc1, including Stargate, Prometheus, Grafana, and Reaper. Depending on how you wish to manage your application, this may or may not be to your liking, but you are free to tailor the configuration to disable any components you don’t need.

Configuring Cassandra Keyspaces

Once the second datacenter comes online, you’ll want to configure Cassandra keyspaces to replicate across both clusters

Important: You’ll likely need to first change your kubectl context back to the first GKE cluster, for example using the kubectl config use-context command. You can list existing contexts using kubectl config get-contexts.

To update keyspaces, connect to a node in the first datacenter and execute cqlsh.

kubectl exec multi-region-dc1-rack1-sts-0 cassandra -it -- cqlsh -u cassandra-admin -p cassandra-admin-password

Use the DESCRIBE KEYSPACES to list the keyspaces and DESCRIBE KEYSPACE <name> command to identify those using the NetworkTopologyStrategy. For example:

cassandra-admin@cqlsh> DESCRIBE KEYSPACES
reaper_db      system_auth  data_endpoint_auth  system_traces
system_schema  system       system_distributed
cassandra-admin@cqlsh> DESCRIBE KEYSPACE system_auth
CREATE KEYSPACE system_auth WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '3'}  AND durable_writes = true;
…

Typically you’ll find that the system_auth, system_traces, and system_distributed keyspaces use NetworkTopologyStrategy, as well as data_endpoint_auth if you’ve enabled Stargate. You can then update the replication strategy to ensure data is replicated to the new datacenter. You’ll execute something like the following for each of these keyspaces:

ALTER KEYSPACE system_auth WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': 3, 'dc2': 3}

Important: Remember to create or alter the replication strategy for any keyspaces you need for your application so that you have the desired number of replicas in each datacenter.

After exiting cqlsh, make sure existing data is properly replicated to the new datacenter with the nodetool rebuild command.

Important: Remember to change your kubectl context back to the second GKE cluster.

Rebuild needs to be run on each node in the new datacenter, for example:

kubectl exec multi-region-dc2-rack1-sts-0 -n us-central1 -- nodetool --username cassandra-admin --password cassandra-admin-password rebuild dc1

Repeat for the other nodes multi-region-dc2-rack2-sts-0 and multi-region-dc2-rack3-sts-0.

Testing the configuration

Let’s verify the second datacenter has joined the cluster. To do this you’ll pick a Cassandra node to execute the nodetool status command against. Execute the nodetool command against the node:

kubectl exec multi-region-dc2-rack1-sts-0 -n us-central1 cassandra -- nodetool --username cassandra-admin --password cassandra-admin-password status

This will produce output similar to the following:

Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens       Owns    Host ID                               Rack
UN  10.56.2.8    835.57 KiB  256          ?       8bc5cd4a-7953-497a-8ac0-e89c2fcc8729  rack1
UN  10.56.5.8    1.19 MiB   256          ?       fdd96600-5a7d-4c88-a5cc-cf415b3b79f0  rack2
UN  10.56.4.7    830.98 KiB  256          ?       d4303a9f-8818-40c2-a4b5-e7f2d6d78da6  rack3
Datacenter: dc2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens       Owns    Host ID                               Rack
UN  10.24.4.99   418.52 KiB  256          ?       d2e71ab4-6747-4ac6-b314-eaaa76d3111e  rack3
UN  10.24.7.37   418.17 KiB  256          ?       24708e4a-61fc-4004-aee0-6bcc5533a48f  rack2
UN  10.24.1.214  398.22 KiB  256          ?       76c0d2ba-a9a8-46c0-87e5-311f7e05450a  rack1

If everything has been configured correctly, you’ll be able to see both datacenters in the cluster output. Here’s a picture that depicts what you’ve just deployed, focusing on the Cassandra nodes and networking:

What’s next

In the following posts in this series, we’ll explore additional multi-datacenter topologies across multiple Kubernetes clusters, including Cassandra clusters in hybrid cloud and multi-cloud deployments. We’ll also dive into more details on networking and DNS configuration. We’d love to hear your ideas for additional configurations you build, and please feel free to reach out with any questions you have on the forum or our Discord channel. We recommend trying it on the Astra DB free plan for the fastest setup.

Why we decided to build a K8ssandra Operator - Part 4

Jeffrey Carpenter — Tue, 22 Feb 2022 17:04:41 +0000

In the first, second, and third posts in this series, we’ve shared conversations with K8ssandra core team members on our journey to build a Kubernetes operator for K8ssandra. We’ve discussed the virtues of the Helm package manager versus Kubernetes operators for deploying and managing infrastructure in Kubernetes and some of our implementation choices for the operator.

In this final post of the series, we pick up from the previous post with a discussion of how we decided to structure our projects in GitHub, how we test the K8ssandra operator, and our hopes for how the operator will expand the K8ssandra developer community.

Implications of operators for project structure

Jeff Carpenter: There are external projects that K8ssandra is managing, but don’t have operators. If I look in GitHub, I see Reaper under The Last Pickle organization, but Reaper Operator under K8ssandra. Is this another case where Stargate isn’t building an operator under its org, but we’re building a Stargate operator under K8ssandra?

John Sanda: Yes, but note that while we have separate repositories for Reaper Operator, Medusa Operator, Stargate Operator, we do plan to consolidate those into the K8ssandra operator. We’ll have multiple CRDs and multiple controllers. Because cass-operator is already used independently, it will continue to be independent and will be a dependency pulled into the K8ssandra operator.

Jeff Carpenter: You’re saying there will be separate CRDs associated with Stargate, Reaper, and Medusa, but all managed by the K8ssandra operator. This makes me wonder, is there discussion in the Kubernetes operator world about monoliths versus microservices? Is there concern about building a monolithic cooperator?

John Sanda: Absolutely. It’s not a microservice architecture per se, but it is highly decoupled and highly modular. Let’s say we wanted to take the Stargate controller and run that in its own separate pod. We could do that without impacting the code of the Reaper or K8ssandra operator, or the cass-operator controllers, it would just be a matter of repackaging it. They are decoupled and modular in that regard. That’s also driven by having distinct CRDs, because you’ll typically have a separate controller per CRD, and those controllers, for the most part, act in isolation from one another.

How to test a Kubernetes operator

Jeff Carpenter: Are there any interesting considerations for testing an operator?

John Sanda: The multi-cluster testing is going to present some challenges in terms of resource requirements. We’ve done a lot to make sure we can do all our automation and continuous integration with GitHub Actions using the free tier runners in GitHub, but this is not going to be sufficient in terms of resources for multi-cluster.

John Sanda: We’re using Kind clusters for running most of our tests. We’ve put together some automation, in the form of setup scripts that will deploy and configure multiple Kind clusters for testing multi-cluster, but that’s just going to be too much for those free tier runners in GitHub. That presents some interesting challenges that we need to work through.

John Sanda: For the CassKop operator from Orange, they’ve used a tool called Kuttl, which does full integration tests with YAML files. There was some discussion of this recently on our Discord server, and I think that will be something for us to look at. Not everyone will be a Go programmer or be familiar with the Kubernetes APIs in order to write tests, but everyone using K8ssandra should know at least a little bit about YAML. That would be a really awesome way for people to contribute and add a lot of value to the project without having to have that deep, intimate knowledge. That’s something I’d like to look into.

Jeff Carpenter: Is the idea is to describe the desired configuration as a YAML, and that’s the spec for a test case?

John Sanda: Yes, and the verification would be an additional YAML manifest.

Jeff Carpenter: What about making specific Stargate API calls or CQL queries? Could it test those as well?

John Sanda: No, it’s more along the lines of here’s what I want to deploy, like verifying a StatefulSet was created correctly. There’s certainly gonna be limitations because obviously, in our tests, we do make calls to Stargate and CQL queries and so forth. That’s beyond the scope of what a tool like Kuttl can do, but it would cover certainly cover some use cases.

Jeff Carpenter: It sounds like this is more about setting up user-defined configurations, and the test passes once status gets to “Ready”.

John Sanda: I think that would be a good example. Perhaps it would be a good candidate for user acceptance testing.

Automating operator testing

Jeff Carpenter: What amount of testing do you expect to automate? What will the K8ssandra CI/CD pipeline look like with the expected combination of Helm and the K8ssandra operator?

John Sanda: Yes, there is automation involved. In terms of local development, the other tool that’s considered a counterpart to Helm is Kustomize. This is more of a declarative approach. It’s bundled as part of KubeBuilder and Operator SDK. You’re going to see Kustomize being used with the K8ssandra operator, and we already use it for testing scenarios. Applying this to the scenario of running unit tests locally, there’s a two-step process: first I run the build command to rebuild my operator image, then I’ll run another command that will use Kustomize to redeploy things. So while we can automate those steps, it’s still not as fast of a turnaround in terms of “wall clock” time, because you’re still having to rebuild an image.

Jeff Carpenter: Sure, that’s a key difference between any case where you have a compiled language versus a scripted language.

Expanding the K8ssandra community

Jeff Carpenter: What does this push to build a K8ssandra operator mean for contributors outside of the core team?

John Sanda: Hopefully, this means that we see an increase in contributions, whether that’s in issue activity, on the forums, or Discord. The evolution of the project is a maturation process. People will be looking to use K8ssandra to solve bigger, harder, more challenging problems. That will help to shape K8ssandra to be the solution to those problems.

John Sanda: Does it mean you have to be fluent in writing Go in order to get involved? Do you have to have experience with writing operators? That’s certainly helpful, but no, these things are not required. K8ssandra is still a big collection tying together various projects. There are many avenues for contributors to get involved. If nothing else, this opens the door for more contributions and hopefully bigger and better things for users and contributors.

Jeff Carpenter: I agree with you. On the one hand, you could make the argument that having to learn Go is an obstacle to contributing. On the other hand, I’m watching some of the help requests that come from our community, and I can attest it can be semi-inscrutable to figure out what is happening with Helm.

Jeff Carpenter: I also remember trying to make a change to see if I could modify the Helm templates to generate multiple Cassandra datacenters, and I thought I had the iterative looping down, but then struggled with the variable scope and pushing down the values that I needed. And that hour I spent was pretty enlightening.

Jeff Carpenter: I think that with Go, while you might have to spend some time spinning up on the language, that’s probably something you should learn anyway for modern, cloud-native backend development. For people that need to customize the project, it’s going to be a lot easier to do their own fork, which hopefully turns into a pull request back to the main project. It’s going to be a lot easier for them to do that in Go.

John Sanda: I agree, and I think this is something that Jeff DiNoto brought up when we were trying to decide at what point we should commit to building an operator. For engineers and developers, this is going to resonate more. In terms of development and testing, the libraries and the frameworks you’ll use for writing unit tests in Go code are the same ones that you can use in Kubernetes. Overall, this will make it easier for folks to get involved, and hopefully, submit PRs.

Summary

That’s where our conversation ended, and it’s a perfect place to wrap up this series. The K8ssandra team is working hard on implementing the K8ssandra operator for a 2.0 release, but the amount of Go code is still quite manageable to read and learn. This is a great time to get involved in the project, and we’d love to give you a hand with setting up and testing out the in your own environment. Please reach out in the #k8ssandra-dev channel in our Discord server and we’ll help you get started! Curious to learn more about (or play with) Cassandra itself? We recommend trying it on the Astra DB free plan for the fastest setup.

Unboxing K8ssandra: The Data Layer For Your Kubernetes-Powered Applications

Jeffrey Carpenter — Tue, 01 Feb 2022 17:51:23 +0000

A Complimentary Live Webinar, Sponsored by DataStax

Kubernetes made it easy to deploy and scale out your cloud-native applications. With K8ssandra, you can now scale application data with the same simplicity and high availability. Join us as we unbox K8ssandra a cloud native data layer for Kubernetes and explore how you can deploy it alongside your applications.

Install k8ssandra
Authenticate with Stargate
Query your data via a convenient API (REST, document, or graphql)

Not familiar with Cassandra? Astra DB is a great (free) place to learn without any of the infrastructure setup, or management headaches.

Speakers:

Christopher Bradford, Product Manager at DataStax

Christopher Bradford is a Product Manager at DataStax with a role in everything Kubernetes. For the years he has been immersed in the world of distributed systems and AP databases. Christopher loves a good challenge and the complex deployment models, network topologies, and stitching together of cloud offerings never leaves him in short supply.

Recently he focused on the trivial deployment and management of Apache Cassandra (and supporting tools) through the open-source projects K8ssandra and cass-operator. Previous speaking engagements by Christopher include Cassandra Summit, Spark Summit, Kong DevOps Summit along with a number of meetups and webinars. Topics have ranged from geographic data replication to ETL pipelines with Cassandra, Spark, and Solr for 200 years of Patent data.

Jeffrey Carpenter, Developer Relations at DataStax

Jeff Carpenter works in Developer Relations at DataStax, where he uses his background in system architecture, microservices and Apache Cassandra to help empower developers and operations engineers to build distributed systems that are scalable, reliable, and secure. Jeff has worked on large-scale systems in the defense and hospitality industries and is co-author of Cassandra: The Definitive Guide.

See the full workshop info at The Linux Foundation, or watch the video directly, here:

The search for a cloud-native database

Jeffrey Carpenter — Tue, 09 Nov 2021 04:03:44 +0000

The concept of “cloud-native” has come to stand for a collection of best practices for application logic and infrastructure, including databases. However, many of the databases supporting our applications have been around for decades, before the cloud or cloud-native was a thing. The data gravity associated with these legacy solutions has limited our ability to move applications and workloads. As we move to the cloud, how do we evolve our data storage approach? Do we need a cloud-native database? What would it even mean for a database to be cloud-native? Let’s take a look at these questions.

What is Cloud-Native?

It’s helpful to start by defining terms. In unpacking “cloud-native”, let’s start with the word “native”. For individuals, the word may evoke thoughts of your first language, or your country or origin – things that feel natural to you. Or in nature itself, we might consider the native habitats inhabited by wildlife, and how each species is adapted to its environment. We can use this as a basis to understand the meaning of cloud-native.

Here’s how the Cloud Native Computing Foundation (CNCF) defines the term:

“Cloud native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds: Containers, service meshes, microservices, immutable infrastructure, and declarative APIs exemplify this approach.

These techniques enable loosely coupled systems that are resilient, manageable, and observable. Combined with robust automation, they allow engineers to make high-impact changes frequently and predictably with minimal toil.”

This is a rich definition, but it can be a challenge to use this to define what a cloud-native database is, as evidenced by the Database section of the CNCF Landscape Map:

Databases are just a small portion of a crowded cloud computing landscape

Look closely, and you’ll notice a wide range of offerings: both traditional relational databases and NoSQL databases, supporting a variety of different data models including key/value, document, and graph. You’ll also find technologies that layer clustering, querying or schema management capabilities on top of existing databases. And this doesn’t even consider related categories in the CNCF landscape such as Streaming and Messaging for data movement, or Cloud Native Storage for persistence.

Which of these databases are cloud-native? Only those that are designed for the cloud, should we include those that can be adapted to work in the cloud? Bill Wilder provides an interesting perspective in his 2012 book, “Cloud Architecture Patterns”, defining “cloud-native” as:

“Any application that was architected to take full advantage of cloud platforms”

By this definition, cloud-native databases are those that have been architected to take full advantage of underlying cloud infrastructure. Obvious? Maybe. Contentious? Probably…

Why should I care if my database is cloud-native?

Or to ask a different way, what are the advantages of a cloud-native database? Consider the two main factors driving the popularity of the cloud: cost and time-to-market.

Cost – the ability to pay-as-you-go has been vital in increasing cloud adoption. (But that doesn’t mean that cloud is cheap or that cost management is always straightforward.)
Time-to-market – the ability to quickly spin up infrastructure to prototype, develop, test, and deliver new applications and features. (But that doesn’t mean that cloud development and operations are easy.)

These goals apply to your database selection, just as they do to any other part of your stack.

What are the characteristics of a cloud-native database?

Now we can revisit the CNCF definition and extract characteristics of a cloud-native database that will help achieve our cost and time-to-market goals:

Scalability – the system must be able to add capacity dynamically to absorb additional workload
Elasticity – it must also be able to scale back down, so that you only pay for the resources you need
Resiliency – the system must survive failures without losing your data
Observability – tracking your activity, but also health checking and handling failovers
Automation – implementing operations tasks as repeatable logic to reduce the possibility of error. This characteristic is the most difficult to achieve, but is essential to achieve a high delivery tempo at scale

Cloud-native databases are designed to embody these characteristics, which distinguish them from “cloud-ready” databases, that is, those that can be deployed to the cloud with some adaptation.

What’s a good example of a cloud-native database?

Let’s test this definition of a cloud-native database by applying it to Apache Cassandra™ as an example. While the term “cloud-native” was not yet widespread when Cassandra was developed, it bears many of the same architectural influences, since it was inspired by public cloud infrastructure such as Amazon’s Dynamo Paper and Google’s BigTable. Because of this lineage, Cassandra embodies the principles outlined above:

Cassandra demonstrates horizontal scalability through adding nodes, and can be scaled down elastically to free resources outside of peak load periods
By default, Cassandra is an AP system, that is, it prioritizes availability and partition tolerance over consistency, as described in the CAP theorem. Cassandra’s built in replication, shared-nothing architecture and self-healing features help guarantee resiliency.
Cassandra nodes expose logging, metrics, and query tracing, which enable observability
Automation is the most challenging aspect for Cassandra, as typical for databases.

While automating the initial deployment of a Cassandra cluster is a relatively simple task, other tasks such as scaling up and down or upgrading can be time-consuming and difficult to automate. After all, even single-node database operations can be challenging, as many a DBA can testify. Fortunately, the K8ssandra project provides best practices for deploying Cassandra on Kubernetes, including major strides forward in automating “day 2” operations.

Does a cloud-native database have to run on Kubernetes?

Speaking of Kubernetes… When we talk about databases in the cloud, we’re really talking about stateful workloads requiring some kind of storage. But in the cloud world, stateful is painful. Data gravity is a real challenge – data may be hard to move due to regulations and laws, and the cost can get quite expensive. This results in a premium on keeping applications close to their data.

The challenges only increase when we begin deploying containerized applications using Kubernetes, since it was not originally designed for stateful workloads. There’s an emerging push toward deploying databases to run on Kubernetes as well, in order to maximize development and operational efficiencies by running the entire stack on a single platform. What additional requirements does Kubernetes put on a cloud-native database?

Containerization

First, the database must run in containers. This may sound obvious, but some work is required. Storage must be externalized, the memory and other computing resources must be tuned appropriately, and the application logs and metrics must be made available to infrastructure for monitoring and log aggregation.

Storage

Next, we need to map the database’s storage needs onto Kubernetes constructs. At a minimum, each database node will make a persistent volume claim that Kubernetes can use to allocate a storage volume with appropriate capacity and I/O characteristics. Databases are typically deployed using Kubernetes Stateful Sets, which help manage the mapping of storage volumes to pods and maintain consistent, predictable, identity.

Automated Operations

Finally, we need tooling to manage and automate database operations, including installation and maintenance. This is typically implemented via the Kubernetes operator pattern. Operators are basically control loops that observe the state of Kubernetes resources and take actions to help achieve a desired state. In this way they are similar to Kubernetes built-in controllers, but with the key difference that they understand domain-specific state and thus help Kubernetes make better decisions.

For example, the K8ssandra project uses cass-operator, which defines a Kubernetes custom resource (CRD) called “CassandraDatacenter” to describe the desired state of each top-level failure domain of a Cassandra cluster. This provides a level of abstraction higher than dealing with Stateful Sets or individual pods.

Kubernetes database operators typically help to answer questions like:

What happens during failovers? (pods, disks, networks)
What happens when you scale out? (pod rescheduling)
How are backups performed?
How do we effectively detect and prevent failure?
How is software upgraded? (rolling restarts)

Conclusion and what’s next

A cloud-native database is one that is designed with cloud-native principles in mind, including scalability, elasticity, resiliency, observability, and automation. As we’ve seen with Cassandra, automation is often the final milestone to be achieved, but running databases in Kubernetes can actually help us progress toward this goal of automation.

What’s next in the maturation of cloud-native databases? We’d love to hear your input as we continue to invent the future of this technology together.