DEV Community: The Team @ Redpanda

Using Bytewax to build an anomaly detection app

The Team @ Redpanda — Wed, 05 Oct 2022 15:42:15 +0000

Highly scalable distributed processing has been a traditionally difficult task for any team to achieve. However, with technologies like Timely Dataflow and Redpanda, today it is easier than ever to build real-time fault-tolerant and easy-to-scale systems.

Timely Dataflow is a low-latency cyclic dataflow computational model, meaning it allows you to build data-parallel systems that can be scaled up from one thread on your laptop to a distributed execution environment across a cluster of computers.

A great use-case for Timely Dataflow is to detect anomalies in real-time data. Redpanda — as an Apache Kafka^Ⓡ API-compatible data store — enables us to easily create an application that can monitor data coming from multiple sources in real time.

Live anomaly detection usually requires a pre-trained model but for this project we opted for an online algorithm.

Bytewax is a Python native binding to the Rust-based Timely Dataflow library, which allows us to quickly build powerful applications in the same language as our mock producer.

Integrating Bytewax with Redpanda allows us to harness the power of the Rust-based Timely Dataflow framework which, combined with the robustness and developer-friendliness of Redpanda, allows us to build real-time data processing systems quickly.

The main flow of the application can be described using the following diagram:

Bytewax reads from the Redpanda topic, calculates the anomalies and produces the results to a different topic. As mentioned above, there is no distinction between Bytewax and Timely Dataflow in the diagram because Bytewax is a wrapper around the Rust Timely Dataflow library.

For the sensor data, we use a mock data generator that generates random air quality values and pushes them into a topic in Redpanda.

For calculating the outliers, we will use a five-second window to aggregate the data coming from the sensors and using these averages we will detect anomalies, then push them into a new Redpanda topic. Alerting can be easily set up using the anomaly topic (watch out for a later post on this).

You can access the code used in this demo in this GitHub repo.

The tech

To build our real-time monitoring applications we will use Redpanda for storage and Bytewax for the anomaly detection.

Redpanda

Redpanda is a source-available, Kafka API-compatible data store. This API compatibility allows us to very quickly use it in place of Kafka. In our case, our Producer and Consumer side code can stay exactly the same as if they were targeting Kafka!

Bytewax

Bytewax is an up-and-coming data processing framework that is built on top of Timely Dataflow, which is a cyclic dataflow computational model. At a high-level, dataflow programming is a programming paradigm where program execution is conceptualized as data flowing through a series of operator based steps. The Timely Dataflow library is written in Rust which makes it blazingly fast and easy to use due to the language's great Python bindings.

Setting up Redpanda

To make setting up Redpanda for this project super easy, we can use the provided docker-compose.yml file. In this file we define a Redpanda service as such:

redpanda:
  image: docker.vectorized.io/vectorized/redpanda:v22.1.4
  container_name: redpanda
  command:
    - redpanda start
    - --overprovisioned
    - --smp 1
    - --memory 1G
    - --reserve-memory 0M
    - --node-id 0
    - --check=false
    - --kafka-addr 0.0.0.0:9092
    - --advertise-kafka-addr redpanda:9092
    - --pandaproxy-addr 0.0.0.0:8082
    - --advertise-pandaproxy-addr redpanda:8082
    - --set redpanda.enable_transactions=true
    - --set redpanda.enable_idempotence=true
    - --set redpanda.auto_create_topics_enabled=true

This will start a Redpanda instance which we can access on port 9092.
Let's create this service first using docker-compose up -d redpanda. This will start the container in the background.

After the container is started we can run docker exec -it redpanda /bin/bash to access a shell inside the container, which allows us to use rpk, the official bundled CLI application for Redpanda clusters.

The CLI allows us to create topics with a simple command, so let's create two. One will store the data coming from our mock sensors, and one will store the calculated anomalies.

rpk create topic --topic air-quality --brokers 127.0.0.1.9092
rpk create topic --topic air-quality-anomalies --brokers 127.0.0.1.9092

To verify, we can list the topics with rpk topic list --brokers 127.0.0.1:9092.

In an environment where are unable to access the cluster manually in order to create topics, we can automate the their creation by interfacing with the Admin API of Redpanda from Python.

admin_client = KafkaAdminClient(
    bootstrap_servers=BOOTSTRAP_SERVERS, client_id="air-quality-producer"
)
topics = ["air-quality", "air-quality-anomalies"]
# Check if topics already exist first
existing_topics = admin_client.list_topics()
for topic in topics:
    if topic not in existing_topics:
        admin_client.create_topics(
            [NewTopic(topic, num_partitions=1, replication_factor=1)]
        )

Data Ingestion

The script we will use to generate the data is located under the path producer/main.py. The Kafka producers' configuration looks like this:

producer = KafkaProducer(bootstrap_servers="127.0.0.1:9092")

The script will start five asynchronous workers that generate data every 3 seconds (configurable, but an easily interpretable time limit helps with the demo here) with the following minimal schema:

{
  "timestamp": "2022-07-15 11:34:32.1134000",
  "value": 23
}

All five workers are going to produce records to the same topic so in order to be able to identify which sensor was the source of our data we have to attach a key to the records being produced into Redpanda. This is done by adding a key argument to the producer.send function call:

producer.send(
    "air-quality",
    key=sensor_name.encode("utf-8"),
    value=json.dumps(data).encode("utf-8"),
)

If we want to test the application outside of Docker we can run the producer with the python producer/main.py command.

After a few seconds data should be flowing into Redpanda.
The output of the script will look something like this:

Sent data to Redpanda: {'timestamp': '2022-07-18 16:20:06.799306', 'value': 95}, sleeping for 3 seconds
Sent data to Redpanda: {'timestamp': '2022-07-18 16:20:06.910629', 'value': 98}, sleeping for 3 seconds
Sent data to Redpanda: {'timestamp': '2022-07-18 16:20:07.020454', 'value': 49}, sleeping for 3 seconds
Sent data to Redpanda: {'timestamp': '2022-07-18 16:20:07.127894', 'value': 59}, sleeping for 3 seconds
Sent data to Redpanda: {'timestamp': '2022-07-18 16:20:07.243341', 'value': 81}, sleeping for 3 seconds
Sent data to Redpanda: {'timestamp': '2022-07-18 16:20:11.812876', 'value': 13}, sleeping for 3 seconds
Sent data to Redpanda: {'timestamp': '2022-07-18 16:20:11.913514', 'value': 15}, sleeping for 3 seconds
Sent data to Redpanda: {'timestamp': '2022-07-18 16:20:12.023514', 'value': 43}, sleeping for 3 seconds
Sent data to Redpanda: {'timestamp': '2022-07-18 16:20:12.132398', 'value': 39}, sleeping for 3 seconds
Sent data to Redpanda: {'timestamp': '2022-07-18 16:20:12.247009', 'value': 38}, sleeping for 3 seconds

To validate if the data is arriving in Redpanda, we can inspect the topic using the rpk command, from inside the container.

docker exec -it redpanda /bin/bash
rpk topic consume air-quality --brokers 127.0.0.1:9092

This will show us all the records in the topic so far.

{
  "topic": "air-quality",
  "key": "Sensor3",
  "value": "{\"timestamp\": \"2022-07-18 16:20:17.024915\", \"value\": 4}",
  "timestamp": 1658154017025,
  "partition": 0,
  "offset": 257
}
{
  "topic": "air-quality",
  "key": "Sensor4",
  "value": "{\"timestamp\": \"2022-07-18 16:20:17.133643\", \"value\": 21}",
  "timestamp": 1658154017133,
  "partition": 0,
  "offset": 258
}

This fake data producer application is also part of the docker-compose.yml configuration file, so when you start all containers at once you won't have to manually initiate the script. In the entrypoint.sh script we poll Redpanda through it's exposed admin API and if it replies with a ready the producer app will start generating data.

while [[ "$(curl -s localhost:9644/v1/status/ready)" != "{\"status\":\"ready\"}" ]]; do sleep 5; done
python /app/main.py

Machine learning with Bytewax

Let's do some machine learning using our data from the Redpanda topic with Bytewax!

The code located in consumer/main.py is the consumer side code that will consume the data from the Redpanda topic, run the aggregation on a five-second window, calculate the anomalies suing a supervised learning algorithm and push them into the Redpanda anomaly topic.

The Consumer configuration looks like this:

consumer = KafkaConsumer(
    "air-quality",
    bootstrap_servers="127.0.0.1:9092",
    auto_offset_reset="earliest",
)

The auto_offset_reset="earliest" option will make the consumer start from the beginning of the topic. With some Bytewax helper functions we sort our incoming data and group it into five-second windows before return them as a generator.

 # Ensure inputs are sorted by timestamp
sorted_inputs = sorted_window(
    get_records(), datetime.timedelta(seconds=5), lambda x: x["timestamp"]
)
# All inputs within a tumbling window are part of the same epoch.
tumbling_window = tumbling_epoch(
    sorted_inputs, datetime.timedelta(seconds=5), lambda x: x["timestamp"]
)

The high-level flow of the timely dataflow pipeline looks like this in Python:

# Create a dataflow
flow = Dataflow()
# Group by sensor name
flow.map(group_by_sensor)
# Calculate the rolling average of Air Quality values
flow.map(calculate_avg)
# Calculate anomaly score in tumbling window of 5 seconds
flow.stateful_map(
    step_id="anomaly_detector",
    builder=lambda key: AnomalyDetector(n_trees=5, height=3, window_size=5, seed=42),
    mapper=AnomalyDetector.update,
)
# Annotate with anomaly
flow.map(annotate_with_anomaly)
# Send to anomaly Redpanda
flow.capture()

First, we instantiate a Dataflow object, then we sequentially apply the following transformations:

Group the data by sensor name
Calculate the rolling average of Air Quality values
Calculate anomaly score in this tumbling window of 5 seconds
Annotate with an anomaly flag
Send the data to the anomaly Redpanda topic

The Producer object is configured like this:

producer = KafkaProducer(
    value_serializer=lambda m: json.dumps(m).encode("utf-8"),
    bootstrap_servers="127.0.0.1:9092",
)

The actual anaomly detection is done in the AnomalyDetector class.

class AnomalyDetector(anomaly.HalfSpaceTrees):
    def update(self, data):
        normalized_value = float(data["aiq_avg"][0]) / 100
        self.learn_one({"value": normalized_value})
        data["score"] = self.score_one({"value": normalized_value})
        return self, data

Half-space trees are an online variant of isolation forests. Before feeding the data to the anomaly detector, we normalize them to the range [0, 1], as the algorithm expects values in this range.

Running the script with python consumer/main.py should print some logs that look like this:

Sending anomalous data to Redpanda: {'sensor_name': ['Sensor3'], 'timestamp': '2022-07-14 19:16:17.583034', 'aiq_avg': [94.0], 'score': 0.7786666666666666, 'anomaly': True} with key b'Sensor3'
Sending anomalous data to Redpanda: {'sensor_name': ['Sensor5'], 'timestamp': '2022-07-14 19:16:17.807478', 'aiq_avg': [6.0], 'score': 0.7253333333333334, 'anomaly': True} with key b'Sensor5'
Sending anomalous data to Redpanda: {'sensor_name': ['Sensor1'], 'timestamp': '2022-07-14 19:16:22.370494', 'aiq_avg': [91.0], 'score': 0.7413333333333334, 'anomaly': True} with key b'Sensor1'
Sending anomalous data to Redpanda: {'sensor_name': ['Sensor2'], 'timestamp': '2022-07-14 19:16:22.476654', 'aiq_avg': [53.0], 'score': 0.7573333333333333, 'anomaly': True} with key b'Sensor2'
Sending anomalous data to Redpanda: {'sensor_name': ['Sensor4'], 'timestamp': '2022-07-14 19:16:22.697965', 'aiq_avg': [96.0], 'score': 0.8213333333333334, 'anomaly': True} with key b'Sensor4'

After a few calculations we can check the target Redpanda topic to see if the data is arriving correctly.

docker exec -it redpanda /bin/bash

rpk topic consume air-quality-anomaly --brokers 127.0.0.1:9092

The records in this topic should look like this:

{
  "topic": "air-quality-anomalies",
  "key": "Sensor1",
  "value": "\"{\\\"sensor_name\\\": [\\\"Sensor1\\\"], \\\"timestamp\\\": \\\"2022-07-18 16:20:06.799306\\\", \\\"aiq_avg\\\": [95.0], \\\"score\\\": 0.8693333333333333, \\\"anomaly\\\": true}\"",
  "timestamp": 1658154557943,
  "partition": 0,
  "offset": 282
}
{
  "topic": "air-quality-anomalies",
  "key": "Sensor2",
  "value": "\"{\\\"sensor_name\\\": [\\\"Sensor2\\\"], \\\"timestamp\\\": \\\"2022-07-18 16:20:11.913514\\\", \\\"aiq_avg\\\": [15.0], \\\"score\\\": 0.9226666666666666, \\\"anomaly\\\": true}\"",
  "timestamp": 1658154557944,
  "partition": 0,
  "offset": 283
}

What this data means is that, in the five seconds following a certain timestamp, the average air quality value for that sensor was considered anomalous compared to all the previous rolling averages.

Running the demo

In order to run the demo end-to-end all you have to do is to run the following command:

docker-compose up

This will start our Redpanda container as well as our Python producer and consumer scripts. The Python containers will wait until Redpanda is ready to accept data and after that you should see the producer pushing mock air quality data into the air-quality topic. The Bytewax consumer will also get to work and start calculating the rolling average of the inputs and running the supervised anomaly detection algorithm. The output will be sent to the air-quality-anomalies topic.

If you want to see the logs of the containers, you can use the docker-compose logs command.

Successfully started services will log their output like this:

redpanda  | INFO  2022-08-03 08:51:28,298 [shard 0] redpanda::main - application.cc:1021 - Successfully started Redpanda!
consumer  | ++ curl -s redpanda:9644/v1/status/ready
producer  | ++ curl -s redpanda:9644/v1/status/ready

The producer will start logging the data it sends to Redpanda.

producer  | Sent data to Redpanda: {'timestamp': '2022-08-03 08:51:32.740664', 'value': 14}, sleeping for 3 seconds
producer  | Sent data to Redpanda: {'timestamp': '2022-08-03 08:51:32.844237', 'value': 16}, sleeping for 3 seconds
producer  | Sent data to Redpanda: {'timestamp': '2022-08-03 08:51:32.947056', 'value': 83}, sleeping for 3 seconds
producer  | Sent data to Redpanda: {'timestamp': '2022-08-03 08:51:33.050752', 'value': 76}, sleeping for 3 seconds

And after a few seconds (when we have enough data for our rolling average window parameter) we should see the following:

consumer  | Sending anomalous data to Redpanda: {'sensor_name': ['Sensor4'], 'timestamp': '2022-08-03 08:51:53.065961', 'aiq_avg': [18.0], 'score': 0.72, 'anomaly': True} with key b'Sensor4'
consumer  | Sending anomalous data to Redpanda: {'sensor_name': ['Sensor2'], 'timestamp': '2022-08-03 08:51:57.855306', 'aiq_avg': [62.0], 'score': 0.7573333333333333, 'anomaly': True} with key b'Sensor2'
consumer  | Sending anomalous data to Redpanda: {'sensor_name': ['Sensor3'], 'timestamp': '2022-08-03 08:51:57.961218', 'aiq_avg': [20.0], 'score': 0.752, 'anomaly': True} with key b'Sensor3'
consumer  | Sending anomalous data to Redpanda: {'sensor_name': ['Sensor1'], 'timestamp': '2022-08-03 08:52:02.782120', 'aiq_avg': [5.0], 'score': 0.7893333333333333, 'anomaly': True} with key b'Sensor1'
consumer  | Sending anomalous data to Redpanda: {'sensor_name': ['Sensor2'], 'timestamp': '2022-08-03 08:52:02.860648', 'aiq_avg': [66.0], 'score': 0.736, 'anomaly': True} with key b'Sensor2'

Conclusion

As you can see, creating a minimal, real-time, highly-scalable data processing pipeline is a breeze with Redpanda and Bytewax.

Redpanda's Kafka-compatibility allows us to integrate any tool that was built with Kafka in mind, like Bytewax. This combination of a fault-tolerant streaming data platform and a high-performance data processing framework is a powerful tool not just for hobby projects, but also for large-scale production systems.

Now that you know how to use these tools together, you can apply this knowledge to any number of applications you dream up in the future.

To learn more about Redpanda, check out the documentation here or join the Redpanda Community on Slack.

Why we built our streaming data platform in C++

The Team @ Redpanda — Wed, 28 Sep 2022 17:09:44 +0000

We're reinventing and expanding what was previously possible with data streaming by building a platform from the ground up for cloud-native computing platforms and by designing a system that’s easy to use, even for non-experts.

For years, there have been inefficiencies in infrastructure that result in a significant amount of computer waste, but hardware is fundamentally different today than it was a decade ago, as this article by Avishai Ish-Shalom eloquently explains. Disk speeds, for example, grew by 1,000 times over the past 10 years. Processing capabilities have also significantly increased alongside developments in core processing.

Despite these critical improvements in computing hardware, today's software hasn't caught up - it's still engineered for a decade-old paradigm computer platform. That sets up a disparity between hardware and software that’s difficult to reconcile.

At Redpanda, we firmly believe that the only true platform is the hardware, so we asked ourselves if we were to design software for modern hardware, what could we do differently? The answer is Redpanda.

Redpanda differs from other projects of its kind by streamlining the complexity of the program and by presenting a simple interface to the user. We cannot entirely remove complexity from the system, but we can move it around. Because our developers are the experts, it makes more sense for us to own the complexity rather than push it down to the end user.

We do this by focusing on two core principles to shift that complexity: Redpanda needs to function well without constant human attention, and the results and output need to be predictable.

The advantages of C++

For Redpanda to pull this off, we chose to use a programming language that both allows direct communication with hardware and has predictable latencies.

We wrote the early Redpanda prototypes in several different programming languages, but only C++ gave us the ability to create a developer and user experience aligned with our goals. It allows Redpanda to extract every ounce of performance from the available hardware while also maintaining predictability.

Most programmers only view performance in terms of latency averages, and we think that's an inefficient metric. Latency is measured in percentiles, and there’s no way to measure the average of a percentile. It’s math that doesn’t tell a useful story.

Instead of focusing on the latencies at 99.9%, 99.99%, or even 99.999%, we focus on the entire 0-100% latency distribution. It’s not enough to look at the experience of 99.999% of the transactions – we need to fix the problems that show up at the 100th percentile. When a system is processing millions of messages a second, the difference between 99.999% and 100% matters. C++ provides a high level of tail latency predictability.

Another benefit of C++ is its stable and mature repository of libraries. Redpanda only uses a few dozen libraries, while other comparatively sized projects use hundreds of dependent libraries. Having so many dependencies weakens the security posture of the software. We avoid vulnerabilities by utilizing C++ libraries that have worked for decades and which are very good at finding precise information.

C++ also allows us to control as much as possible from the platform. Through the efficiency of our own code, combined with the amazing Seastar framework and other best-in-class libraries, Redpanda speaks directly to the hardware. It only depends on the Linux kernel to launch the process, after which Redpanda is very deterministic in terms of performance, runtime characteristics, memory utilization, and CPU speed. We own the entire end-to-end experience, which provides safety and allows Redpanda to build impactful products.

Building the best present and future for streaming data

Redpanda creates new possibilities for developers, like what airplanes did for passenger liners. Ships are a slower mode of transportation, even if they're reliable, and even today, you can take a passenger ship from New York to London. Transcontinental travel used passenger ships for centuries, but when airplanes came into existence, they fundamentally changed the way people traveled. In doing so, air travel invented entire industries that people had never thought of before.

That’s the impact of Redpanda on where the streaming industry is headed.

We discovered that when you give programmers a new infrastructure primitive like Redpanda, something that's fast, predictable, and geared towards zero data loss, it expands the realm of possibilities about what they can do. Although Redpanda was initially designed to be a replacement for Kafka, it has started to transition into operational workloads.

For us, discovering new ways that developers are using Redpanda is probably the most exciting aspect of our job. For example, a satellite currently in orbit is running Redpanda, and the Alpaca platform uses Redpanda to trade millions of dollars in securities every single day. Redpanda will soon power the process of monitoring both a pregnant mother's heartbeat and her baby's vital signs during labor.

Redpanda Data is the only company that can cross this chasm and move to the foreground of operational workloads.

Redpanda expands the toolset for developers and crosses multiple computing paradigms, allowing us to expand what's possible in software development and operational workloads.

While we couldn't have imagined this when we started, we can't wait to hear what developers are going to build tomorrow. Take Redpanda for a test drive today, and introduce yourself to the Redpanda Community in Slack.

Testcontainers & Zerocode: An integration testing tutorial

The Team @ Redpanda — Wed, 21 Sep 2022 17:20:03 +0000

When setting up a streaming application, especially if you’re new to streaming data platforms like Redpanda, you’ll want to test that your application is set up correctly. You can do this using integration tests.

Integration tests check your producers and consumers against your data stream. They push test data through your application, allowing you to see if your architecture is correctly set up and working as expected.

Below, I discuss two popular libraries for integration testing: Testcontainers and Zerocode. I use these when I need to run integration tests, and nearly every developer I know uses them, as well.

In this post, you’ll learn how to run integration tests with them, too, so you can ensure your streaming application is properly configured. You can find the resources for the demos below in this GitHub repository.

The 2 best integration testing tools for streaming data stacks

1. Testcontainers

Testcontainers is a Java library that you can use to test anything that runs in a Docker container. You can use it to do integration testing on your data stream.

1.1 Prerequisites

Testcontainers can only be used within the Java ecosystem. For this reason, you will need to import it as a dependency. You can do this with Maven as shown here:

<dependency>
    <groupId>org.testcontainers</groupId>
    <artifactId>testcontainers</artifactId>
    <version>1.17.2</version>
    <scope>test</scope>
</dependency>

You can check out the latest dependencies in Maven’s central repository. You can also access the complete demo we’re about to walk through in this GitHub repo.

Next, you can move on to setting up your producer and consumer for testing.

1.2 Producer and consumer setup

Typical integration tests will include a producer that creates and sends an event. To do this with Redpanda, you can use the Apache Kafka^Ⓡ API since Redpanda is API-compatible with Kafka.

KafkaProducer<String, String> producer = new KafkaProducer<>(
       ImmutableMap.of(
               ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,
               bootstrapServers,
               ProducerConfig.CLIENT_ID_CONFIG,
               UUID.randomUUID().toString()
       ),
       new StringSerializer(),
       new StringSerializer()
);
producer.send(new ProducerRecord<>(topicName, "testcontainers", "redpanda")).get();

You will also need a consumer that consumes the event from the same topic as your producer. Again, we can set this up using the Kafka API.

KafkaConsumer<String, String> consumer = new KafkaConsumer<>(
       ImmutableMap.of(
               ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,
               bootstrapServers,
               ConsumerConfig.GROUP_ID_CONFIG,
               "tc-" + UUID.randomUUID(),
               ConsumerConfig.AUTO_OFFSET_RESET_CONFIG,
               "earliest"
       ),
       new StringDeserializer(),
       new StringDeserializer()
);
consumer.subscribe(Collections.singletonList(topicName));
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));

1.3 Redpanda node setup

Since we are running this test in Redpanda, you will also need to set up a Redpanda node. This is where Testcontainers comes into the picture. It allows you to create throwaway instances of the node, which will then be destroyed when tests finish running.

@Before
public void init() {
   redpanda = new RedpandaContainer("vectorized/redpanda:v22.1.4");
   redpanda.start();
}

In above snippet, I tell Testcontainers to pull down the Redpanda Docker image and start the container behind the scenes.

docker container ls
CONTAINER ID   IMAGE                            COMMAND                  CREATED         STATUS         PORTS                                                                                         NAMES
34a719219ff9   vectorized/redpanda:v22.1.3      "sh -c 'while [ ! -f..."   7 minutes ago   Up 7 minutes   8081-8082/tcp, 9644/tcp, 0.0.0.0:49167->9092/tcp, :::49167->9092/tcp                          zealous_elion

As you can see, Testcontainers created the Redpanda container. Next, you’ll define your test as a regular JUnit test:

    @Test
    public void testUsage() throws Exception {
        testRedpandaFunctionality(redpanda.getHost() + ":" +   redpanda.getMappedPort(9092), 1, 1);
    }

You then run the test using the following command:

mvn test

If a test fails, you will see this printed in the log under the Failures and Errors sections:

Tests run: 1, Failures: 1, Errors: 0, Skipped: 0

If your test is successful, you will instead see a 0 in the Failures and Errors sections:

Tests run: 1, Failures: 0, Errors: 0, Skipped: 0

After you are done testing, you can stop the container with the following command:

@After
public void tearDown() {
   redpanda.stop();
}

And that’s it! You’ve successfully run an integration test with Testcontainers.

Next, we’ll move on to another popular integration testing tool.

2. Zerocode

Zerocode is an open-source Java test automation framework that uses a declarative style of testing. In declarative testing, you don't write code, you declare scenarios that describe each step of a test in a JSON/YAML file. The Zerocode framework will then interpret the scenario and execute the instructions that you specify via a custom DSL. Zerocode can be used for end-to-end testing of your data stream.

2.1 Prerequisites

Zerocode is Java library, so it can only be used in the Java ecosystem. You can get it from this central repo and declare it as a dependency:

<dependency>
    <groupId>org.jsmart</groupId>
    <artifactId>zerocode-tdd</artifactId>
    <version>1.3.28</version>
</dependency>

You can also check out official GitHub repository of the Zerocode framework here, and you can find the full demo I’m about to walk you through below in the GitHub here.

2.2 Producer and consumer setup

As I did in my integration testing with Testcontainers, I also need a producer to create events with Zerocode:

{
 "name": "produce_test_message",
 "url": "kafka-topic:test-topic",
 "method": "produce",
 "request": {
   "recordType": "JSON",
   "records": [
     {
       "key": "${RANDOM.NUMBER}",
       "value": "Hello Redpanda"
     }
   ]
 },
 "assertions": {
   "status": "Ok"
 }
}

In the above snippet, I declare what our producer should do:

name - The scenario step name. This can be anything you want.
url - Specifies the Redpanda topic via the kafka-topic property and tells the producer which topic events should be sent to (Note: Although there is no Redpanda keyword here, kafka-topic will work with Redpanda).
method - Tells Zerocode to create a Redpanda producer.
request - Specifies data that should be produced.
recordType - The type of records to be produced/consumed. In this example, it's JSON.
assertions - Checks the execution response. In this example, we are verifying that producing the event was successful.

The above declared producer will send one event (with a JSON payload) where the value would be “Hello Redpanda”. You then need to consume that event and check the payload. For that reason, I declare the consumer as well:

{
 "name": "consume_test_message",
 "url": "kafka-topic:test-topic",
 "method": "consume",
 "request": {
   "consumerLocalConfigs": {
     "recordType": "JSON"
   }
 },
 "retry": {
   "max": 2,
   "delay": 30
 },
 "validators": [
   {
     "field": "records[0].value",
     "value": "Hello Redpanda"
   }
 ]
}

url - Specifies the topic (via kafka-topic keyword) to consume from. This should be the same as the url you set in your producer.
method - Tells Zerocode to create a Redpanda consumer.
retry - Sets a max number of retries and the delay between retries in case the consumer did not find any events.

Other keywords in our consumer are the same as in the produce. In the validators block, you can see that I’m verifying that the consumed events value is “Hello Redpanda” as it was written by the producer.

With the above steps I verify that I can produce an event to the Redpanda stream and I can consume that same event. Once you’ve completed this step, you can move on to configuring the test.

2.3 Configuration

In the case of Testcontainers, it was the library that created a Redpanda broker (via Docker). However, before launching the Zerocode tests, you need to have the Redpanda broker up and running. For local testing, you can create a YAML file and use Docker Compose to do this.

After you have the Redpanda broker ready, you need to tell Zerocode how to reach it. You may need to specify some properties that Zerocode will use when creating your producer and consumer. For that reason, create a properties file with following content:

kafka.bootstrap.servers=localhost:9092
kafka.producer.properties=producer.properties
kafka.consumer.properties=consumer.properties

Kafka.bootstrap.servers - Here you specify the bootstrap of Redpanda. Keep in mind that there is not a Redpanda keyword, but the Kafka keyword works with Redpanda.
Kafka.producer.properties - Name of the file that contains producer properties.The file is in the same folder in this example.
Kafka.consumer.properties - Name of the file that contains consumer properties.The file is in the same folder in this example.

Once you’ve configured the test, it’s time to write the test case.

2.4 Writing the test case

At this point in the demo, you have the scenario file and configuration. So, how do you link them and run the scenario? Behind the scenes, Zerocode uses JUnit4 runners. For that reason, we now create a Java test class where you will utilize JUnit annotations:

@RunWith(ZeroCodeUnitRunner.class)
@TargetEnv("redpanda.properties")
public class RedpandaTest {
   @Test
   @Scenario("redpanda-stream-test.json")
   public void test_redpanda() {
   }
}

@Runwith - You specify the Zerocode runner that will be responsible for running your scenario.
@TargetEnv - The name of the configuration file that Zerocode will use for the scenario. This is how you link configurations files to scenarios.
@Scenario - The name of our scenario that Zerocode will run.
@test - This is the Junit annotation.

You can run the test using the following command:

mvn test

Just as with our Testcontainers test above, you will see any errors or failures printed in the log:

Tests run: 1, Failures: 1, Errors: 0, Skipped: 0

If your test is successful, no failures or errors will be noted.

Tests run: 1, Failures: 0, Errors: 0, Skipped: 0

Conclusion

Now that you’ve learned how to run two types of integration tests on your applications, you can validate that your data streams are configured correctly. These tests are also useful for checking your producers and consumers against your Redpanda nodes.

As I mentioned at the start of this article, Zerocode and Testcontainers are the two integration testing tools that I and other devs tend to use, and there aren’t many other integration testing tools available. If you know of others that we should look into, share them in the Redpanda Community on Slack, or share them on Twtiter: @redpandadata. To learn more about getting started with Redpanda, view the documentation here.

Building a real-time materialized cache

The Team @ Redpanda — Wed, 14 Sep 2022 14:23:22 +0000

Organizations often need to build real-time data-processing applications. Specialized tools for stream processing can help build such applications. In another article, you learned how to process data streams with Apache Flink^Ⓡ . This article will show you how to do something similar with ksqlDB.

ksqlDB is an event-streaming database that simplifies real-time application building with two Apache Kafka^Ⓡ API components — Kafka Connect^Ⓡ and Kafka Streams^Ⓡ — into a single system. This makes it possible to integrate the stream-processing application with different source systems.

With ksqlDB, you can use SQL queries for processing streaming data. Examples of such use cases include identifying anomalies in real-time data, log monitoring, tracking, and alerting. Using ksqlDB on top of Redpanda, which is API-compatible with Kafka, allows you to explore topics, transform data within topics, copy existing topics from one format to another, and more.

What is ksqlDB?

ksqlDB differs from other popular data-processing tools like Flink or Apache Spark^Ⓡ in its ability to build complete streaming applications with only a small set of SQL statements—you don’t need to write Java/Scala/Python in addition to SQL statements when using ksqlDB.

ksqlDB has a simplified architecture and is deployed as a separate, scalable cluster. The interface for event capturing, processing, and query serving is combined into a single system.

Let’s take a look at how Redpanda and ksqlDB can be used together to build a stream-processing application.

Integrating ksqlDB with Redpanda

To set the scene, imagine that you have a database that stores emergency calls made by residents of different locations. It contains their names, emergency type, and area code. You frequently make a few specific queries, and you want to move those out of the database, precompute them, and store the results for fast access. Here, you can leverage the power of ksqlDB (computing) with Redpanda (storage) to build a materialized cache for quick access to the data.

This tutorial will walk you through creating this materialized cache using Redpanda, ksqlDB server, and ksqlDB CLI and show you how to query it.

Specifically, you’ll learn how to do the following:

Install Redpanda
Install ksqlDB
Configure ksqlDB to ingest data from Redpanda

Prerequisites

Before getting started, you’ll need to have the following:

Docker and Docker Compose installed
Familiarity with Apache Kafka or other messaging systems (recommended for Redpanda)
Familiarity with SQL syntax (recommended for ksqlDB)

Setting up the stack

The image below is a schematic view of data flow within the system you’d use to process external data. To connect to the external sources, you’d have to set up your connectors. This tutorial doesn’t connect to any external source and uses mock data for the sake of simplicity.

Installing Redpanda

You can follow the detailed steps for installing Redpanda from the official documentation on your platform of choice. In this tutorial, you’ll install Redpanda using Docker Compose.

First, add the following configurations to a docker-compose.yml file to install Redpanda from its Docker image:

version: '3.9'
services:
  redpanda:
    command:
    - redpanda
    - start
    - --smp
    - '1'
    - --reserve-memory
    - 0M
    - --overprovisioned
    - --set
    - redpanda.cluster_id=turning-red
    - --set 
    - redpanda.enable_idempotence=true
    - --set 
    - redpanda.enable_transactions=true
    - --set
    - redpanda.auto_create_topics_enabled=true
    - --node-id
    - '0'
    - --kafka-addr
    - PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092
    - --advertise-kafka-addr
    - PLAINTEXT://redpanda:29092,OUTSIDE://localhost:9092
    image: docker.vectorized.io/vectorized/redpanda:v21.11.11
    container_name: redpanda
    ports:
    - 9092:9092
    - 29092:29092

Next, run the command below in the root directory of your docker-compose file to start a local Redpanda cluster:

docker-compose up -d

Now that your Redpanda cluster is running, you can do some test streaming.

Starting Redpanda

Run the command below to access the Redpanda Docker container’s command line:

docker exec -it redpanda /bin/sh

Then run the following command to create a calls topic (note the use of the rpk command-line utility for Redpanda):

$ rpk topic create calls --brokers=localhost:9092

Next, produce a message on the topic:

$ rpk topic produce calls --brokers=localhost:9092

Input some text into the topic, and once you’re finished, press Ctrl+C to exit the prompt.

Finally, consume the messages on the topic:

$ rpk topic consume calls --brokers=localhost:9092

Below is a sample output when consuming the messages:

##Output
{
  "topic": "calls",
  "value": "3",
  "timestamp": 1650692216007,
  "partition": 0,
  "offset": 2
}

That’s it! You now have enough Redpanda knowledge to leverage the power of ksqlDB. Before moving on, do some cleanup by running the following command:

docker compose down

This command stops and removes the Redpanda container.

Installing ksqlDB

First, add the services below in your docker-compose.yml file:

 ksqldb-server:
    image: confluentinc/ksqldb-server:0.25.1
    hostname: ksqldb-server
    container_name: ksqldb-server
    depends_on:
      - redpanda
    ports:
      - "8088:8088"
    environment:
      KSQL_LISTENERS: "http://0.0.0.0:8088"
      KSQL_BOOTSTRAP_SERVERS: "redpanda:29092"
      KSQL_KSQL_SCHEMA_REGISTRY_URL: "http://schema-registry:8081"
      KSQL_KSQL_LOGGING_PROCESSING_STREAM_AUTO_CREATE: "true"
      KSQL_KSQL_LOGGING_PROCESSING_TOPIC_AUTO_CREATE: "true"  

  ksqldb-cli:
    image: confluentinc/ksqldb-cli:0.25.1
    container_name: ksqldb-cli
    depends_on:
      - redpanda
      - ksqldb-server
    entrypoint: /bin/sh
    tty: true

This code snippet creates two containers—the ksqlDB server and ksqlDB CLI—from their respective Docker images. The ksqlDB server is where your application runs, and the ksqlDB CLI allows you to interact with the server.

Then run the command below in the root directory of your docker-compose.yml file to start all three services:

docker-compose up -d

Finally, run the following command to check if the containers are running as expected:

docker stats

If everything is okay, you should have three containers running: ksqldb-cli, ksqldb-server, and redpanda.

Starting ksqlDB

To start ksqlDB and access its interface, run the command below:

docker exec -it ksqldb-cli ksql http://ksqldb-server:8088

You should see something similar to this:

If the server isn’t responding, give it a while, exit the ksqlDB CLI, and then retry.

Configuring ksqlDB to ingest data from Redpanda

Now that your stack is running, it’s time to execute some ksqlDB code. You’ll use the ksqlDB CLI to interact with the server.

Creating a stream

Before you create your stream, enter the command below in the running instance of ksqlDB CLI:

SHOW TOPICS;

This displays a list of existing topics. At this point, you will see only default topics. You can now create a stream that matches the data in your database as shown below:

CREATE STREAM emergencies (name VARCHAR, reason VARCHAR, area VARCHAR)
  WITH (kafka_topic='call-center', value_format='json', partitions=1);

This command creates not only a stream but also a Redpanda topic named call-center, if it does not already exist. If the topic does already exist, the command defines the stream, which can then be selected from with SQL syntax.

Running the SHOW TOPICS; command displays the newly created topic.

Creating materialized views

To keep track of certain logic, you need to create a materialized view for the logic. Run the following commands in the ksqlDB CLI instance to do so.

The location_of_interest materialized view counts the number of distinct areas, identifies the latest area of the emergency call, and then groups the rows returned by the reason for the call:

// RUN 1

CREATE TABLE location_of_interest AS
    SELECT reason,
           count_distinct(area) AS distinct_pings,
           latest_by_offset(area) AS last_location
    FROM emergencies
    GROUP BY reason
    EMIT CHANGES;

The call_record materialized view counts the number of times a resident called based on the reason and groups them by the resident’s name:

// RUN 2

CREATE TABLE call_record AS
    SELECT name,
           count(reason) AS total_emergencies
    FROM emergencies
    GROUP BY name
    EMIT CHANGES;

Adding mock data

Now that you have a topic, a stream linked to your topic, and a materialized view to make your queries persistent, you can add some mock data to test your application.

First, open a new terminal and open the Redpanda terminal using the following command:

docker exec -it redpanda /bin/sh

Then you can produce messages on the topic:

$ rpk topic produce call-center --brokers=localhost:9092

Add the messages below in the terminal. Each message is produced to a partition and given a timestamp.

{"name":"Liam", "reason": "allergy", "area": "Florida"}
{"name":"Fiona", "reason": "dizziness", "area": "Orlando"}
{"name":"Mike", "reason": "pain", "area": "Florida"}
{"name":"Louise", "reason": "allergy", "area": "Orlando"}
{"name":"Steven", "reason": "stroke", "area": "New York"}
{"name":"Liam", "reason": "pain", "area": "Florida"}
{"name":"Louise", "reason": "dizziness", "area": "Hawai"}
{"name":"Ivor", "reason": "choking", "area": "New York"}
{"name":"Louise", "reason": "pain", "area": "Florida"}
{"name":"Beckham", "reason": "allergy", "area": "New York"}

You are now ready to test your application by running some queries.

Running queries

Before you run any query, set the property below to ensure the queries run from the beginning of the topic:

SET 'auto.offset.reset' = 'earliest';

To run a query that terminates immediately after it has returned the results, test with the following command:

SELECT * FROM location_of_interest
WHERE reason = 'allergy';

To run a query that keeps running and updates the results as more data comes in, use this command (note the use of the “EMIT CHANGES” clause):

SELECT * FROM call_record
WHERE name = 'Louise' EMIT CHANGES;

If you open a new ksqlDB server and add some more mock data, the query above will update with the new data.

You can view failed ksqlDB messages by adding the following statements in your docker-compose.yml file:

environment:
    …
    KSQL_LOG4J_ROOT_LOGLEVEL: "ERROR"
    KSQL_LOG4J_LOGGERS: "org.apache.kafka.connect.runtime.rest=WARN,org.reflections=ERROR"
    KSQL_LOG4J_PROCESSING_LOG_BROKERLIST: kafka:29092
    KSQL_LOG4J_PROCESSING_LOG_TOPIC: <ksql-processing-log-topic-name>
    KSQL_KSQL_LOGGING_PROCESSING_TOPIC_NAME: <ksql-processing-log-topic-name>
    KSQL_KSQL_LOGGING_PROCESSING_TOPIC_AUTO_CREATE: "true"
    KSQL_KSQL_LOGGING_PROCESSING_STREAM_AUTO_CREATE: "true"

To stop and remove the containers, run the command below:

docker-compose down

Conclusion

Now that you've been proprly introduced to ksqlDB and how to use it with Redpanda, you can take what you've learned in this tutorial and create a data stream-processing application with a materialized cache for any number of use cases.

As you saw, ksqlDB is easy to install and configure, and it lets you run standard SQL queries.

All the code in this tutorial can be found on GitHub. Try out Redpanda using the tutorial, interact with Redpanda’s developers directly in the Redpanda Community on Slack, or contribute to Redpanda’s source-available GitHub repo here. To learn more about everything you can do with Redpanda, check out our documentation here.

Change data capture with CockroachDB and Redpanda

The Team @ Redpanda — Wed, 07 Sep 2022 18:15:31 +0000

In the cloud-native era, applications have gradually transformed towards a more distributed, less coupled architecture. Monolithic architectures have evolved into microservices, and microservices are evolving into ever smaller services or functions.

Apart from all the benefits of distributed architecture — like separation of concerns — this approach can have drawbacks, one of which is the data itself.

Data becomes a real problem when you want to share it in a distributed system. For monolithic applications, it used to be the case that you’d have a single database as a cluster with replicated nodes, but things have changed.

Distributed applications like microservices need their dedicated databases or any other middlewares as a data store, such as a dedicated cache system like Redis or a search engine like Elasticsearch.

Because of the distributed architecture, you need to keep the same data in different databases or middleware systems, and you must keep this data consistent. In most cases, developers try to do so by doing dual writes.

A dual write happens when an application changes the same data in two different systems without any layer for data consistency, like transactions or distributed transactions. Not every system supports distributed transactions, so you can not guarantee data consistency in those cases.

However, change data capture (CDC), a data integration pattern, enables capturing row-level changes into a configurable sink for downstream processing such as reporting, caching, full-text indexing, or — most importantly — helping avoid dual writes and ensuring data durability and consistency.

CockroachDB, a distributed and reliable database, supports CDC via its Changefeeds. CockroachDB provides Changefeeds for data sinks like AWS S3, webhooks, or an Apache Kafka^Ⓡ API-compatible streaming data system like Redpanda.

Redpanda provides a fast and fault-tolerant, safe by default system while being fully compatible with Kafka API. You can use CockroachDB CDC to capture changes and stream into Redpanda in order to implement any vital CDC use case more efficiently, such as reporting, avoiding dual writes, or most importantly, keeping the data consistency through the shards of CockroachDB: Because CockroachDB has a distributed architecture, keeping the transactional jobs consistent through its shards is crucial. CDC with Changefeeds provides emitting changes to sinks like Kafka or Redpanda as a solution for this.

The CDC mechanism of CockroachDB not only provides a data capturing mechanism but also an integration point to Redpanda, which can stream the captured change events to other data points like data warehouses, OLAP databases, or search engines.

In this article, you will learn how to stream CDC from CockroachDB to Redpanda by completing a tutorial involving the following steps:

Run a Redpanda cluster in a containerized way using Docker.
Create a topic within Redpanda using its rpk CLI.
Install CockroachDB and use its SQL client.
Create a table on CockroachDB and configure it for using CDC.
Create, update, and delete records in the CockroachDB table.
Consume the change events from the relevant Redpanda topic using the rpk CLI.

If you’d like to follow along in your own editor, you can access all the resources for this tutorial in this repository.

Prerequisites

To complete this tutorial, you’ll need the following:

A macOS environment with the Homebrew package manager installed.
A recent version of Docker installed on your machine. (Docker Desktop 4.6.1 was used at the time of writing this article.)
A 30-day trial license for CockroachDB, which is required in order to use CDC capabilities.

Use case: Change data capture with CockroachDB and Redpanda

Suppose that you are a contractor who is about to sign a five-year contract with a potential customer, the PandaBank. Before closing the deal, they would like you to accomplish a small task of theirs as an assignment to see if you are suitable for the work.

PandaBank uses CockroachDB internally, and most of the daily account transactions are kept in this database. Currently, they have a mechanism for indexing the account transaction changes in Elasticsearch, but they noticed that it creates data inconsistencies between the actual data and the indexed log data that is in Elasticsearch.

They want you to create a base mechanism to avoid any data inconsistency issues between the systems. They require you to create a basic implementation of a CDC using CockroachDB’s changefeed mechanism and Redpanda for a durable, Kafka API-compliant messaging system.

For their assignment, you don’t need to implement the Elasticsearch part, just the CDC part. You are responsible for creating a CockroachDB instance and a Redpanda instance on your local machine. Because PandaBank runs Redpanda on Docker, you’ll need to do so as well.

The following image shows the architectural diagram of the system they require you to implement:

Running Redpanda

In this tutorial, you will run Redpanda in a container via Docker.

Note: For more information on installing Redpanda on other platforms, refer to this documentation.

Make sure that you have installed Docker and started the Docker daemon in your macOS environment. Then, open a terminal window and run the following command to run Redpanda:

docker run -d --pull=always --name=redpanda-1 --rm \
    -p 9092:9092 \
    -p 9644:9644 \
    docker.vectorized.io/vectorized/redpanda:latest \
    redpanda start \
    --overprovisioned \
    --smp 1  \
    --memory 1G \
    --reserve-memory 0M \
    --node-id 0 \
    --check=false

Redpanda will be accessible via localhost:9092 on your computer.

Installing and running CockroachDB

In order to install CockroachDB on your local macOS environment, run the following command:

Note: Make sure you’ve already installed the Homebrew package manager, as noted in the prerequisites.

brew install cockroachdb/tap/cockroach

After installing CockroachDB, run a single-node cluster that is in insecure mode:

Note: For information on starting a CockroachDB cluster in secure mode, you can refer to
this documentation.

cockroach start-single-node --insecure

In another terminal, run the following command to access the CockroachDB SQL client interface:

cockroach sql --insecure

On the client interface, run the following commands to enable enterprise usage since CDC is an Enterprise Changefeed. Refer to the prerequisites section if you have not signed up for a trial license yet.

SET CLUSTER SETTING cluster.organization = '_YOUR_ORGANIZATION_';
SET CLUSTER SETTING enterprise.license = '_YOUR_LICENSE_';

Creating and configuring the CockroachDB table

On the terminal window where the SQL query client is open, run the following command to create a database called bank in CockroachDB:

root@:26257/defaultdb> CREATE DATABASE bank;

Select the bank database to be used for the rest of the actions in the query window:

root@:26257/defaultdb> USE bank;

Create a table called accounts with integer fields named id and balance:

root@:26257/bank> CREATE TABLE accounts (id INT PRIMARY KEY, balance INT);

Create a Changefeed for the table accounts. Set the Redpanda broker address for the Changefeed to configure it to send the captured change data:

root@:26257/bank> CREATE CHANGEFEED FOR TABLE accounts INTO 'kafka://localhost:9092' WITH UPDATED;

Note: For more information on creating a Changefeed on CockroachDB, refer to
their documentation.

Leave the terminal window open for later use.

Creating the Redpanda topic and consuming data

In another terminal window, run the following command to create a Redpanda topic called accounts:

docker exec -it redpanda-1 \
rpk topic create accounts

The output should look like this:

TOPIC     STATUS
accounts  OK

Notice that the topic has the same name as the CockroachDB table accounts. CockroachDB CDC produces data to a topic with the same name as the table by default.

In the same terminal window, run the following command to start consuming from the accounts topic:

docker exec -it redpanda-1 \
rpk topic consume accounts

Leave the terminal window open to view the consumed messages in the following steps.

Capturing the change events

In order to confirm that the CDC mechanism works, you must create, update, and delete some data in the accounts table. You’ll also observe and examine the captured events in the Redpanda accounts topic.

Creating the accounts data

In the SQL client terminal window, run the following command to insert some data into the accounts table:

root@:26257/bank> INSERT INTO accounts (id, balance) VALUES (1, 1000), (2, 250), (3, 700);

This creates the following accounts in the CockroachDB accounts table:

Account ID  Balance
1           1000
2           250
3           700

After inserting the data, verify that the Redpanda CLI consumer prints out the consumed data:

{
  "topic": "accounts",
  "key": "[1]",
  "value": "{\"after\": {\"balance\": 1000, \"id\": 1}, \"updated\": \"1648496379523876000.0000000000\"}",
  "timestamp": 1648496379856,
  "partition": 0,
  "offset": 0
}
{
  "topic": "accounts",
  "key": "[2]",
  "value": "{\"after\": {\"balance\": 250, \"id\": 2}, \"updated\": \"1648496379523876000.0000000000\"}",
  "timestamp": 1648496379856,
  "partition": 0,
  "offset": 1
}
{
  "topic": "accounts",
  "key": "[3]",
  "value": "{\"after\": {\"balance\": 700, \"id\": 3}, \"updated\": \"1648496379523876000.0000000000\"}",
  "timestamp": 1648496379856,
  "partition": 0,
  "offset": 2
}

Notice that you have all the data from the accounts table as separate event logs in your Redpanda instance’s accounts topic.

Running account transactions

The application development team of PandaBank shared a small containerized application with you that runs some transactions on these bank accounts. This application connects to CockroachDB over localhost:26257, so be sure that CockroachDB is accessible in your local environment.

Use the following command to run the transaction between accounts:

git clone git@github.com:redpanda-data-blog/2022-cdc-with-cockroachdb.git
cd 2022-cdc-with-cockroachdb/account_transaction_manager/
docker build -t account-transaction-manager .
docker run account-transaction-manager

The output of this command should look like this:

DEBUG:root:print_balances(): status message: SELECT 3
Balances at Mon Mar 28 19:43:34 2022:
(1, 1000)
(2, 250)
(3, 700)
DEBUG:root:transfer_funds(): status message: UPDATE 1
DEBUG:root:transfer_funds(): status message: UPDATE 1
DEBUG:root:print_balances(): status message: SELECT 3
Balances at Mon Mar 28 19:43:35 2022:
(1, 700)
(2, 350)
(3, 900)

When you run the SELECT * FROM accounts; command in the CockroachDB SQL client, you will see the following results:

Account ID  Balance
1           700
2           350
3           900

Verify that the new changes captured by CockroachDB CDC are reflected to your Redpanda consumer. In the consumer’s terminal window, you should see the following result:

{
  "topic": "accounts",
  "key": "[1]",
  "value": "{\"after\": {\"balance\": 900, \"id\": 1}, \"updated\": \"1648496614787637000.0000000000\"}",
  "timestamp": 1648496615283,
  "partition": 0,
  "offset": 3
}
{
  "topic": "accounts",
  "key": "[2]",
  "value": "{\"after\": {\"balance\": 350, \"id\": 2}, \"updated\": \"1648496614787637000.0000000000\"}",
  "timestamp": 1648496615283,
  "partition": 0,
  "offset": 4
}
{
  "topic": "accounts",
  "key": "[1]",
  "value": "{\"after\": {\"balance\": 700, \"id\": 1}, \"updated\": \"1648496614835272000.0000000000\"}",
  "timestamp": 1648496615283,
  "partition": 0,
  "offset": 5
}
{
  "topic": "accounts",
  "key": "[3]",
  "value": "{\"after\": {\"balance\": 900, \"id\": 3}, \"updated\": \"1648496614835272000.0000000000\"}",
  "timestamp": 1648496615283,
  "partition": 0,
  "offset": 6
}

Notice the balance changes that represent the transaction history in the consumer output.

Deleting the accounts

As the last step, delete the bank accounts and see how CDC captures them. Run the following SQL query in the query console of CockroachDB:

root@:26257/bank> DELETE FROM bank.accounts where id <> 0;

The Redpanda consumer should have the following captured events:

{
  "topic": "accounts",
  "key": "[1]",
  "value": "{\"after\": null, \"updated\": \"1648497640110587000.0000000000\"}",
  "timestamp": 1648497640200,
  "partition": 0,
  "offset": 6
}
{
  "topic": "accounts",
  "key": "[2]",
  "value": "{\"after\": null, \"updated\": \"1648497640110587000.0000000000\"}",
  "timestamp": 1648497640200,
  "partition": 0,
  "offset": 7
}
{
  "topic": "accounts",
  "key": "[3]",
  "value": "{\"after\": null, \"updated\": \"1648497640110587000.0000000000\"}",
  "timestamp": 1648497640200,
  "partition": 0,
  "offset": 8
}

Notice that the value field becomes null when the change event is a delete. CockroachDB CDC successfully captures data change events and sends them to Redpanda for further consumption.

Congratulations! You’ve successfully captured the bank account transaction changes and made them consumable as events from Redpanda.

Conclusion

In this tutorial, you’ve learned how to run Redpanda in a container using Docker and how to create a topic and consume messages from it. You also learned to install CockroachDB, create a table using its SQL query interface, and configure it for using CDC.

You can now use CockroachDB to capture changes and stream them to Redpanda to implement CDC use cases such as reporting, caching, full-text indexing, avoiding dual writes, and much more.

Find all the resources for this tutorial in this repository, or join the Redpanda Community on Slack to ask specific questions. View Redpanda’s documentation here.

Using GitHub Actions to automate development

The Team @ Redpanda — Wed, 31 Aug 2022 15:29:23 +0000

If you’re already using Redpanda, then you know one of its most alluring draws is its intention to make data streaming development work as simple as possible. (And, if you aren’t already using Redpanda, you can learn how we deliver that simplicity in this blog post.)

Having scorned complexity, the logical next step was to create an easy and efficient way to automate and test builds that depend on Redpanda. In this post, we’ll show you how to do just that, using the Redpanda GitHub Action.

What are GitHub Actions?

Before digging into how to test our code, it’s worth understanding what a GitHub Action is. According to GitHub’s website, “GitHub Actions makes it easy to automate all your software workflows.”

In our case, we'll focus specifically on the continuous integration workflow so that we can run automated tests on GitHub CI. After you get your code running and working with Redpanda on your local development environment, how do you ensure that your teammates won’t introduce breaking changes to code? Running the automated tests on the CI will ensure it's working as expected.

Sometimes it makes sense to isolate third-party dependencies in the software architecture, but what if we want to test using a real Redpanda instance? That's the main reason why the Action was created: to bring all the Redpanda power to the GitHub CI environment. It makes the test suite faster and more reliable.

Even if you aren’t using Redpanda on your production environment, you can benefit from its drop-in replacement for other Apache Kafka^Ⓡ distributions and take advantage of faster boot times and less RAM usage. The Redpanda GitHub Action means saving build minutes (and CI costs) by using it.

So how do you use it? Let's configure it together.

Local development

Disclaimer: The Ruby language is an entirely arbitrary choice. You can write the code in any language you prefer.

All the code discussed here is available on the Redpanda-action-demo repository.

Here we have a test suite that performs two tests:

It publishes to a Redpanda topic
It fetches the message

Both tests are supported by a setup that connects to Redpanda using the ruby-kafka gem.

Note: Always use localhost:9092 as your Redpanda address. It works locally, and GitHub CI will spin up the Docker image and bind its port to 9092.

Continuous Integration

To use the Redpanda GitHub Action, you’ll need to configure GitHub CI to run the test. This is configured in .github/workflows/ci.yml. Here are its contents, which we will talk more about below:

# the name of our job
name: CI

# yes, we want to run for all branches and pull requests
on:
  push:
    branches: "*"
  pull_request:
    branches: "*"

# we have just our job `test`
jobs:
  test:
    runs-on: ubuntu-latest
    # here is the main section for us, where we spin up the Redpanda instance
    # using the Redpanda GitHub action,pay attention on the `.with.version` key, we are using _latest_ but you can use any Redpanda version
    # tip: version is exactly the same as the Redpanda docker image
    steps:
    - name: start Redpanda
        uses: Redpanda-data/github-action@v0.1.3
        with:
        version: "latest"
    - uses: actions/checkout@v2
    # below is how we setup ruby and run the tests using `rake`
    - name: Set up Ruby
        uses: ruby/setup-ruby@359bebbc29cbe6c87da6bc9ea3bc930432750108
        with:
        ruby-version: '3.0'
        bundler-cache: true
    - name: Install dependencies
        run: bundle install
    - name: Run tests
        run: bundle exec rake

After committing this file and pushing it to a branch, GitHub CI will run the test suite automatically and give you feedback about the changes on your pull request, showing you if the test failed or was successful.

Using the Redpanda GitHub Action is a straightforward way to get Kafka-based projects tested. It's simple, using just one Docker image with everything you need. There’s no JVM, no Zookeeper^Ⓡ: it's an all-in-one solution. And it's super fast! Redpanda will boot up and be ready to work in a few seconds. It's a huge win in the developer experience.

When compared to a popular Kafka Docker image, the Redpanda GitHub Action offers smaller RAM usage (47MB vs. 465MB) and smaller Docker image size (only 130MB vs. 465MB). Considering GitHub Actions is billed by minutes usage, the smaller the footprint, the cheaper it's to run the tests.

After getting set up and running the tests locally, we now know that everything is working properly and can now publish to our repo and collaborate.

How will you use the Redpanda GitHub Action?

We want to hear how you use your newfound knowledge about the Redpanda Action on Github CI. Join the community Slack to share your ideas and experience, and check out Redpanda’s documentation for information on the other things you can do with Redpanda and your applications.

For more information on the Redpanda GitHub Action, or to report an issue, please see the GitHub repository for the project.

You can also visit the repository for the demo in this article, where you’ll find the code from this article and a sample test suite that uses the Github Action in a real-world scenario.

Happy testing!

Clickstream data analysis with Databricks and Redpanda

The Team @ Redpanda — Wed, 24 Aug 2022 17:36:16 +0000

Global organizations need a way to process the massive amounts of data they produce for real-time decision making. They often utilize event-streaming tools like Redpanda with stream-processing tools like Databricks for this purpose.

An example use case is recommending content to users based on their clicks on a mobile or web app. The clickstreams will be streamed through Redpanda to Databricks, where a recommendation engine will analyze their data and recommend content:

Redpanda is a fast and scalable real-time event streaming platform that serves as an Apache Kafka^Ⓡ alternative. It’s API-compatible with Kafka, so all of your existing tooling with Kakfa works with Redpanda, too. It also ships as a single binary and can run on a virtual machine, Docker, and Kubernetes.

This tutorial covers event streaming and data analytics using Redpanda and Databricks. You will learn how to produce data to a Redpanda topic from a shell, store the produced data in CSV files within Databricks, and analyze the data in real-time to obtain insights.

Using Redpanda to send data to Databricks

Let’s get started!

First, the prerequisites. To complete this tutorial, you’ll need the following:

A physical or virtual machine with a publicly accessible IP address
Docker and docker-compose installed on that virtual machine
Kafkacat (or any client compatible with the Kafka API) for connecting to Redpanda as a producer

Setting up Redpanda

To set up Redpanda, create a docker-compose.yml file in a server that can be accessed over the internet. This ensures that the Redpanda broker can communicate with your deployed Databricks instance:

version: "3.7"
services:
  redpanda:
    command:
      - redpanda
      - start
      - --smp
      - "1"
      - --reserve-memory
      - 0M
      - --overprovisioned
      - --node-id
      - "0"
      - --kafka-addr
      - PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092
      - --advertise-kafka-addr
      - PLAINTEXT://redpanda:29092,OUTSIDE://localhost:9092
    image: docker.vectorized.io/vectorized/redpanda:latest
    container_name: redpanda-1
    ports:
      - 9092:9092
      - 29092:29092
      - 8081:8081

Start the Redpanda container by changing directories to the directory containing the docker-compose.yml file and execute the following command:

docker compose up -d

This operation will pull the Redpanda Docker image and start Redpanda on port 9092. Ensure your virtual machine instance has a static public IP address and that port 9092 is public.

Refer to the following guides to add a public IP address and port on your virtual machine instance:

Setting up Databricks

To get started, create a Databricks account (Your account is free for a 14-day trial period). After filling in your account details, you'll be redirected to choose a cloud provider. Go ahead with your preferred cloud provider, choosing the appropriate setup instructions from the list below:

After a successful setup, you should land on a dashboard with links to various aspects of Databricks.

Your first task is to create a Databricks cluster. A cluster is a set of computational resources and configurations that lets you run data science and engineering workloads. In this case, you’ll be running a data engineering workload to stream data from a Redpanda topic to a CSV file within Databricks.

On the dashboard, click on the Create button at the top left to create a new cluster. By clicking on Cluster, you’ll be taken to a page to configure your cluster’s properties. Choose a name for your cluster and leave the other fields unchanged; then, click on Create Cluster.

You’ll be using a Databricks notebook to carry out all the tasks in this tutorial. A Databricks notebook is an interface that can contain runnable code, documentation, and visualization, similar to a Jupyter notebook. This notebook will serve as the scratchpad for running your commands.
Again, click on the Create button at the top left of your dashboard and this time select the option to create a new Notebook. Choose a descriptive name like redpanda-kconnect-scratchpad and set Scala as the default language.

Setting up streaming in Databricks

After setting up your first notebook, paste the content below in the notebook’s first cell:

import org.apache.spark.sql.functions.{get_json_object, json_tuple}

var streamingInputDF =
  spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "SERVER_IP:9092")
    .option("subscribe", "csv_input")
    .option("startingOffsets", "latest")
    .option("minPartitions", "10")
    .option("failOnDataLoss", "true")
    .load()
    .select($"value".cast("string"))
    .as[(String)]

The code snippet above creates a streaming data frame assigned to the variable streamingInputDF. This data frame subscribes to the topic of interest, csv_input, in the Redpanda cluster. The cluster is identified by the server IP and port. The port in this case is 9092, the same port that Kafka exposes.

Replace SERVER_IP with the deployed IP address of your server. After setting the SERVER_IP, run the cell to initialize the configuration. You should get an output similar to the one below:

streamingInputDF:org.apache.spark.sql.Dataset[String] = [value: string]
import org.apache.spark.sql.functions.{get_json_object, json_tuple}
streamingInputDF: org.apache.spark.sql.Dataset[String] = [value: string]

In order to save data in Databricks, you need to define a write stream to a file. This write stream should be of the same file type as the input stream. The snippet below reads data from the streamingInput data frame and writes it to a CSV file. The write operation is performed every thirty seconds and all new entries to the Redpanda topic will be read and written to a new CSV file.

Create a second cell and add the content in the snippet below:

import org.apache.spark.sql.streaming.Trigger

val query =
  streamingInputDF
    .writeStream
    .format("csv")
    .outputMode("append")
    .option("path", "/FileStore/tables/user-details-data")
    .option("checkpointLocation", "/FileStore/tables/user-details-check")
    .trigger(Trigger.ProcessingTime("30 seconds"))
    .start()

Now, run the command to start the streaming operation. Your output should look similar to the image below.

Loading data from Redpanda to Databricks

In order to see actual data in Databricks, you’ll stream data to Redpanda using Kafkacat. Run the command below in your shell to create a Redpanda console producer. Replace SERVER_IP with the IP address of the server running Redpanda:

kafkacat -b SERVER_IP:9092 -t csv_input1 -P

Now paste the CSV content into the producer line by line:

id,first_name,last_name
1,Israel,Edeh
2,John,Doe
3,Jane,Austin
4,Omo,Lawal
5,John,Manson
6,John,Rinzler

Depending on your interval of pasting the content, you should see five completed jobs in the output area of the write stream cell.

To see the CSV files, create a new cell and run the following command:

%fs ls /FileStore/tables/user-details-data/

You should get a table showing all created CSV files in the output area.

Run the command below to see the actual content of the files:

spark.read.csv("/FileStore/tables/user-details-data/").show()

You should get an output listing the entries you’ve streamed so far. It will look something like this:

+-------------+
|          _c0|
+-------------+
|id, first_name,last_name|
|1,Israel,Edeh|
|3,Jane,Austin|
|5,John,Manson|
|6,John,Rinzler|
|  4,Omo,Lawal|
|   2,John,Doe|
+-------------+

Analyzing the streamed user data

Databricks offers data analysis and machine learning tools to help organizations make sense of their data. You can perform simple analysis on the data you streamed in this tutorial using Apache Spark^Ⓡ queries. Say you want to group users by their first name to find out the number of users with the same first names. You can use the following code to achieve this:

%python
users_df = spark.read.csv("/FileStore/tables/user-details-data/", header="true", inferSchema="true")

users_df.groupBy("first_name").count().show()

Running the command above will produce the following result:

+----------+-----+
|first_name|count|
+----------+-----+
|      John|    3|
|    Israel|    1|
|       Omo|    1|
|      Jane|    1|
+----------+-----+

You can see from the analysis that three users have "John" as their first name. You can run further analysis with a dataset with more rows.

Plotting the streamed data

Databricks also allows you to visualize and plot your data. You can prepare your CSV data for plotting by selecting the headers and configuring the plot options. Your first task is to display the data as a table. Run the command below in a new cell to do so:

%python
diamonds_df = spark.read.csv("/FileStore/tables/user-details-data/", header="true", inferSchema="true")

display(diamonds_df.select("id", "first_name", "last_name"))

Now, change the display type to bar and then click on the Plot Options… button to customize the bar chart. Drag and drop first_name to the keys and values boxes and remove other fields. Then set the aggregation type to COUNT. Finally, click on Apply to apply your customization.

What will you build with Databricks and Redpanda?

Distributed systems require speed at all levels and in every component. Redpanda scales particularly well for mission-critical systems, and without dependencies on a JVM or ZooKeeper.

Now that you know how to stream data from Redpanda to Databricks and analyze and plot data using Databricks’s native display function, you can use this setup to analyze data in real-time for nearly any project.

Interact with Redpanda’s developers directly in the Redpanda Slack community, or contribute to Redpanda’s source-available GitHub repo.To learn more about everything you can do with Redpanda, check out the documentation here.

Spark vs Flink vs ksqlDB for stream processing

The Team @ Redpanda — Wed, 17 Aug 2022 14:19:00 +0000

Modern business is digital and happens in real time. Users expect more interactive and instantaneous experiences all the time, which must be facilitated with suitable real-time data processing. Distributed applications like microservices, with automated deployments to public or private cloud platforms, have also incorporated more event-driven systems with a corresponding increased need for real-time capabilities. In this context, applications rely on real-time stream processing to power their business logic and deliver the appropriate real-time experiences for their users and decision-making capabilities for themselves.

As the amount of data that must be processed has grown, companies have focused on large-scale data processing technologies that can analyze data, run machine learning functions, and create materialized views and time windows. There are many available stream processing technologies, but this article focuses on three of the most popular:

Apache Spark^Ⓡ is a multi-language framework designed for executing data engineering, data science, and machine learning computation on single-node machines or clusters.
Apache Flink^Ⓡ is a stream and batch processing framework designed for data analytics, data pipelines, ETL, and event-driven applications. Like Spark, Flink helps process large-scale data streams and delivers real-time analytical insights.
ksqlDB is an Apache Kafka^Ⓡ-native stream processing framework that provides a useful, lightweight SQL interface for event capturing, continuous event transformations, aggregations, and serving materialized views.

This article introduces these stream processing frameworks and compares the pros and cons of the tools and some of their more unique features. You'll also learn how to use each tool with Redpanda for real-time data processing.

Apache Spark

Apache Spark is a popular open-source analytics engine that is designed for scalable big data analytics. The Apache Spark research project was started in 2014 at UC Berkeley as a solution to the limitations of a MapReduce algorithm.

MapReduce is a first-generation distributed data processing system.
It processes data that is parallelizable and performs computation on a distributed, horizontally scalable infrastructure. As a distributed system, MapReduce applies a particular linear data flow structure, which consists of MapReduce programs that read input from a disk, map a function across the data, reduce the resulting data of the map, and store it on the disk again.

Spark is a third-generation data processing framework that enhances MapReduce’s performance by processing data in memory instead of writing them to the disk in each step.
Spark distributes the in-memory data in a logically partitioned way on many machines and calls these Resilient Distributed Datasets (RDDs), which are then used as an abstraction layer to manage the logically distributed data.

Spark consists of many components such as Spark SQL, MLlib, Spark Streaming, and GraphX.
These components were not included from the start but have been developed over the years to satisfy many big data system requirements. Spark Streaming is one of the most important components, which provides support for live data streams generated by a variety of sources such as Apache Kafka, Apache Flume, Twitter, ZeroMQ, Amazon Kinesis, and more.

Moreover, Spark has a high-level API called Structured Streaming, which is built on top of Spark SQL API. Structured Streaming can stream the same operations that you would perform in batch mode, such as querying a static RDD.

Using Apache Spark with Redpanda

Both Spark Streaming and Structured Streaming API integrate well with the Kafka API.

Because Redpanda is API-compatbible with Kafka, you can use Redpanda for mission-critical workloads that you need to process via Apache Spark.

You can stream any data from Redpanda and process the data in batches in Apache Spark. Or, you can also use the Structured Streaming API to consume the data from Redpanda, process them in Spark, and save them in the Spark SQL DataFrame as an example.

The following Python code snippet is a very high-level example of reading a stream from a Redpanda topic called my-redpanda-topic, by accessing the Redpanda cluster via redpandahost:9092:

df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "redpandahost:9092") \
  .option("subscribe", "my-redpanda-topic") \
  .load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

You can use Redpanda and Spark Streaming for real-time analysis, sentiment analysis, real-time fraud detection, and many more real-time stream processing use-cases that require the computational capabilities of Apache Spark. For more detailed information on how to use Apache Spark with Redpanda, check out our blog on structured stream processing with Redpanda and Apache Spark.

Pros and cons of Apache Spark

Apache Spark is a great framework to use with Redpanda streaming, but as is the case with many tools or frameworks, there are a few pros and cons worth mentioning.

One of the game-changing advantages of Spark is its in-memory structure, which provides very fast performance. However, Spark’s in-memory structure can cause high memory consumption as it might respond to many stream processing requests simultaneously.

Apart from the in-memory structure, the following is an aggregated list of pros and cons for Apache Spark:

Pros:

supports multiple languages (Scala, Java, Python, R, C#, F#)
fault-tolerant
integrates with many technologies
advanced analysis capability
easily does batch processing (micro-batches)
supports stream processing
fast performance (because of the in-memory structure)
supports SQL

Cons:

high memory consumption
HDFS as the only state backend
steep learning curve
time windowing only
no native stream processing

These are just some examples, and there will likely be many other pros and cons of Apache Spark that are either use-case specific or related to the streaming technology being used.

Apache Flink

Apache Flink is an open source distributed processing engine that provides stateful computations over unbounded and bounded data streams. In Flink, everything is considered as a stream, including the batch files.


https://flink.apache.org/

Flink is a fourth-generation data processing framework and supports both batch and stream processing. Unlike Apache Spark, Flink is natively designed for stream processing. It treats batch files as bounded streams.

You can ingest streaming data from many sources, process them, and distribute them across various nodes with Apache Flink.

Apache Kafka is one such streaming source, which is considered a great persistent layer for stream processing applications. Flink provides a Kafka connector library for reading data from a Kafka topic or writing data to Kafka topics.

Using Apache Flink with Redpanda

Apache Flink can easily read data from or write data to Redpanda. It does not provide a particular API or connector for Redpanda, but because Redpanda is fully Kafka API compatible, you can just configure it as a Kafka connection, and Redpanda takes care of the rest.

You can create continuous streaming pipelines where event computations are triggered in Flink as soon as the event is received from Redpanda.

The following Java code snippet is a small example of using the connector library to read a stream from a Redpanda topic called my-redpanda-topic:

KafkaSource<String> source = KafkaSource.<String>builder()
    .setBootstrapServers("redpandahost:9092")
    .setTopics("my-redpanda-topic")
    .setGroupId("my-group")
    .setStartingOffsets(OffsetsInitializer.earliest())
    .setValueOnlyDeserializer(new SimpleStringSchema())
    .build();

  DataStream<String> dataStreamSource = env.fromSource(source, WatermarkStrategy.noWatermarks(), "Redpanda Source");

  //Stream processing actions on dataStreamSource...

You should note that a class called KafkaSource is used for consuming the data from Redpanda. Inversely, to send messages to Redpanda, you must use the KafkaSink to stream data to Redpanda. You might come across some examples that use classes like FlinkKafkaConsumer and FlinkKafkaProducer, which are the deprecated Flink Kafka connector classes.

For more detailed information about using Apache Flink with Redpanda, check out the building streaming applications using Flink SQL and Redpanda tutorial.

Pros and cons of Apache Flink

Apache Flink is a great native stream processing system to use with Redpanda. However, as with the other streaming technologies, there are several pros and cons of Apache Flink you should consider.

One of the great benefits of Apache Flink is its very shallow learning curve. It’s very easy to get started and has good documentation. However, good documentation is not enough in terms of support, particularly if you run into more advanced issues; compared to Apache Spark, Apache Flink has a smaller community that can provide limited support.

Apart from the learning curve, documentation, and the support, the following is an aggregated list of pros and cons for Apache Flink:

Pros:

easy to start / very low learning curve
good documentation
low latency / high throughput
clean data stream API
simple UI and UX
in-memory, file system, RocksDB as the state backend
windowing by both time and count
SQL support

Cons:

difficult Hadoop integration (Apache Spark integrates better)
limited language support (Java, Scala, Python, and SQL)
limited community support (compared to Apache Spark)

Again, these are simply some examples of Apaches Flink’s pros and cons, which can vary depending on the specific use-case or related to the streaming technology you are using.

ksqlDB

The database ksqlDB is for building stream processing applications on top of Apache Kafka.
It is based on the Kafka Streams API and licensed under the Confluent Community License Agreement. It is also a distributed, scalable, real-time stream processing framework that provides a lightweight SQL syntax.

Powered by the Kafka Streams API, ksqlDB is a robust, embeddable, stream processing engine that provides a simple way to build standard Java stream processing applications.
It extends Kafka Streams API by providing more features, such as a streaming server and an easy-to-use SQL interface.

The ksqlDB has Apache Kafka in its core as the persistence layer. Since it uses the Kafka Streams API, it can access and use any Kafka cluster without integration configuration.

Using ksqlDB with Redpanda

Redpanda can be easily used as the Kafka replacement of ksqlDB. The standalone installation of ksqlDB requires a Kafka backbone, so Redpanda can be a great replacement as it is fully compatible with the Kafka API.

The following Docker Compose YAML is an example of setting up a ksqlDB server and a ksqlDB CLI container that is connected to the Redpanda cluster:

---
version: '2'

services:
  redpanda:
     command:
     - redpanda
     - start
     - --smp
     - '1'
     - --reserve-memory
     - 0M
     - --overprovisioned
     - --node-id
     - '0'
     - --kafka-addr
     - PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092
     - --advertise-kafka-addr
     - PLAINTEXT://redpanda:29092,OUTSIDE://localhost:9092
     image: docker.vectorized.io/vectorized/redpanda:latest
     container_name: redpanda-1
     ports:
     - 9092:9092
     - 29092:29092

  ksqldb-server:
    image: confluentinc/ksqldb-server:0.25.1
    hostname: ksqldb-server
    container_name: ksqldb-server
    depends_on:
      - redpanda
    ports:
      - "8088:8088"
    environment:
      KSQL_LISTENERS: http://0.0.0.0:8088
      KSQL_BOOTSTRAP_SERVERS: redpanda:9092
      KSQL_KSQL_LOGGING_PROCESSING_STREAM_AUTO_CREATE: "true"
      KSQL_KSQL_LOGGING_PROCESSING_TOPIC_AUTO_CREATE: "true"

  ksqldb-cli:
    image: confluentinc/ksqldb-cli:0.25.1
    container_name: ksqldb-cli
    depends_on:
      - ksqldb-server
    entrypoint: /bin/sh
    tty: true

You can use Redpanda and ksqlDB to create a streaming ETL pipeline (aka a streaming data pipeline) for real-time data analysis, creating materialized views of event-driven microservices, predictive analytics, and many more similar use-cases.

For more detailed information on how to use ksqlDB with Redpanda, check out this tutorial on how to build a materialized cache with ksqlDB and Redpanda.

Pros and cons of ksqlDB

The ksqlDB is a Kafka native stream processing system that is very easy to use with Redpanda.
However, as with the other tools, it does have its pros and cons.

One of ksqlDB’s greatest advantages is strong integration with Apache Kafka. Whereas other streaming frameworks manage this integration using connectors or Kafka libraries, ksqlDB has Kafka in its core. However, ksqlDB has poor analytics capability in comparison with Flink and with Spark, especially, that have more tools to handle workloads for analytics.

Apart from the Kafka integration, analytics, capabilities and licensing model, here is an aggregated list of pros and cons for ksqlDB:

Pros:

very easy Apache Kafka / Redpanda integration
low-latency value of up to ten milliseconds
easy to use SQL interface
integrates with existing applications (because of Kafka Streams API)
RocksDB as the state backend
less steep learning curve (compared to Spark)
integrates through Kafka Connect

Cons:

poor analytics capability (compared to Flink and Spark)
higher learning curve (compared to Flink)
unbounded data streams only (compared to Flink)
license (Confluent Community License)
no direct integration to Hadoop or other big data frameworks

As was the case with the other tools, the stated pros and cons are just examples, and there might be many more depending on either the specific use-case or related to the streaming technology that is used.

Conclusion

Now that you understand the differences between popular stream processing frameworks Apache Spark, Apache Flink, and ksqlDB, you can make more informed decision about when to use each tool. And, thanks to the integration tutorials we’ve linked in each section above, you know how to use any of them with Redpanda to accomplish your stream processing needs. Use any of these tools with Redpanda for real-time sentiment analysis, fraud detection, predictive analytics, and more.

Follow the Redpanda Blog for future tutorials and articles about integration use-cases of Redpanda and other cool data technologies, and join our Slack community to share what you plan to build with Redpanda. To contribute to our GitHub repo, submit an issue here. For specific feature and usability questions, our documentation can help.

Using Buildkite and GitHub to automate parallel CI steps

The Team @ Redpanda — Wed, 10 Aug 2022 14:09:31 +0000

Introduction

At Redpanda, we want to always provide an experience that is fast, simple, and productive for developers. That applies to our own team of engineers, too. When considering how we could achieve a more stable continuous integration (CI) pipeline, we wanted that same experience: fast, simple, productive. By running multiple instances of our pipeline steps in parallel on our CI platform, Buildkite, we can now run multiple repetitions of the same Buildkite step and use only the amount of time needed for a single step.

Today, our devs can kick off any number of builds in parallel simply by attaching a label to their PR like “ci-repeat-X.” In the rest of this post, I’ll discuss how we made this easy dev experience possible. I discuss how we achieve repeatable builds by taking advantage of Buildkite’s parallelism attribute and pre-command hook, in combination with GitHub labels on pull requests for triggering parallel builds.

Buildkite parallel programming

When Buildkite introduced a new feature to run multiple repetitions of a build step in parallel, we took advantage of this by adding an attribute in our CI pipeline configuration called parallelism. We use this attribute to define the desired level of parallelism. We started off using a constant value of 1.

However, the challenge is to have the parallelism value configurable so that users can enable/disable it whenever they want, providing a value of their choice that represents the number of the parallel instances per step. Ideally, we want to grant the ability to developers to configure this number outside Buildkite’s context. A good candidate for that is GitHub, but we need a “bridge” between it and Buildkite. The bridge cannot be configured in a step’s command querying GitHub’s pull request because it would be too late to configure the parallelism attribute of a step at runtime. In seeking a way to do this, we discovered Buildkite’s pre-command hook.

Buildkite pre-command

Buildkite includes hooks that we can enable in order to have them automatically executed before a step’s command is initiated (pre-command), or after a step run (post-command). We took advantage of the pre-command hook to discover the value that the user wants to configure as the parallelism value. By doing this, we created a way to run any bash script we want before a pipeline’s step gets executed. This means that we can tweak a variable in the pre-command hook in order to update the parallelism attribute of a Buildkite step.

Having done this, we addressed the next natural question: what is the most productive process for users to follow in order to update this variable when opening a pull request? Our options were:

Comment on the pull request (e.g. /hey-buildkite repeat 5)
Edit a file to update the value and push the code
Add a GitHub label (e.g. ci-repeat-5)

Our choice trails are:

Productive
Easy-to-use
Clean

If we go with choice number one, we will end up having a big pull request conversation with many scattered comments that clutter up what should be a conversation between developers about a pull request. Thus, we didn’t select this option because it violates the second and third trails.

For choice number two, we would have to answer the questions:

What happens when we want to merge the PR?
Do we want our default branch to be based on this file and run in parallel? (If so, what’s the impact on our cost?)

Thus, the questions raised by option two also suggested it wasn’t the best course to take. Besides, it violates the second choice trail because the user has to push code each time they want to update the request level of parallelism.

So, we decided to go with the third and best choice: add a GitHub label. Using this process, users who desire to run their PR tests in parallel need only to add a label in their PR and rebuild the pipeline.

The workflow

The parallelism attribute is set in each Buildkite step of the pipeline.yml configuration. Its value is dynamically provided via an environment variable called PARALLEL_STEPS. We just have to modify this environment variable using the pre-command hook.

We wrote a script to run before the steps are loaded into Buildkite that queries the GitHub API. This allows us to get the labels of this PR (Buildkite provides the PR number as environment variable BUILDKITE_PULL_REQUEST) and match those against the pattern ci-repeat-NN. Thus, we have the whole workflow ready: the hook queries and gets the specific label, discovers the number, and exports it as the environment variable PARALLEL_STEPS.

What about the cost? Shouldn’t we require users to delete this label after their job is done? Otherwise, won’t every commit have Buildkite run multiple steps in parallel? As mentioned, we aim to increase developer productivity. Requiring users to delete the label after the job is a manual step, and we avoid these as much as we can. When the pre-command discovers the label, then it’s useless to keep it on the PR, so the bot we’re using can delete it. Thus, we decrease the manual steps required of the developer and improve the cost, just by deleting a label.

Building with DevProd in mind

In summary, our process for running multiple instances of CI steps in parallel was created with developer productivity in mind. By parallelizing and running multiple instances of CI steps on Buildkite, we decreased our build’s total running time and improved the stability of CI testing in Redpanda.

Learn more about Redpanda and download our binary on GitHub. Interact with our developers directly by joining our Slack Community to ask questions about our CI steps or anything else. For more information about Redpanda and its features, browse our documentation.

Simplifying Java development for real-time applications with Redpanda

The Team @ Redpanda — Wed, 03 Aug 2022 17:46:14 +0000

At DataCater we make real-time streaming a commodity for data and developer teams. DataCater provides a user interface, API, and declarative formats for creating production-grade Kafka Streams applications. Consequently, our products utilize and evolve around Apache Kafka^Ⓡ API-compatible technologies.

Kafka is a heavy piece of machinery, with multiple components that must be managed together to create a running instance. Components such as ZooKeeper^Ⓡ, a separate schema registry, etc. make it unfit for modern continuous integration / continuous deployment (CI/CD) pipelines, wherein we need to spin up clusters quickly and efficiently.

Therefore, to accelerate our development cycle, we make use of Redpanda and Java Quarkus to keep our workstations lightweight and simplify our CI pipeline for end-to-end testing.

We develop our applications and services in Java to utilize the wide variety of connections and protocols readily available through frameworks like Apache Kafka Connect^Ⓡ and Apache Camel^Ⓡ.

In this blog, we’ll show you how to use Redpanda to speed up and ease the burden of developing a Java-based streaming application.

Integrating Redpanda into Java Quarkus

Java Quarkus is a Kubernetes-native Java stack. One goal of Java Quarkus is packaging libraries, such that Java developers can create single binaries from Java programs via GraalVM^TM. This leads to fast start-up times and JVM-free containers making Quarkus a go-to choice for developers targeting Kubernetes as their runtime. We use multiple Java Quarkus packages, including quarkus-smallrye-reactive-messaging-kafka, which pulls Redpanda’s docker image and runs it as a process for streaming applications.

The ease of installation, zero-config process starts, and fast startup of Redpanda has led to increasing adoption of Redpanda as the default Kafka-compatible messaging technology for development. At DataCater, we chose Redpanda as our default with Java Quarkus as well.

Once you have quarkus-smallrye-reactive-messaging-kafka extension as a dependency in your build system, Quarkus will automatically pull and run a single node Redpanda installation in your local docker engine.

Starting Quarkus in dev mode will yield the following output and you can get started on working interactively with Redpanda. This example uses the Testontainers library.

$ ./gradlew quarkusDev
…

2022-04-14 11:33:16,880 INFO  [🐳 .io/.11.3]] (build-27) Pulling docker image: docker.io/vectorized/redpanda:v21.11.3. Please be patient; this may take some time but only needs to be done once.
…
2022-04-14 11:33:41,037 INFO  [🐳 .io/.11.3]] (build-27) Container docker.io/vectorized/redpanda:v21.11.3 is starting: c41e717b516ad9810ca93828a72cf5203068c607996962951aa4979e08a6f15e
2022-04-14 11:33:42,539 INFO  [🐳 .io/.11.3]] (build-27) Container docker.io/vectorized/redpanda:v21.11.3 started in PT25.70221S

As you can see, the command ./gradlew quarkusDev gets you started with Redpanda, whereas a Kafka cluster would require you to set up ZooKeeper alongside. Configuring ZooKeeper, Kafka, and networking on a development machine is brittle and makes setting up development environments complex.

This setup, instead, enables you to tear down and restart without having to think or be time-constrained by setting up anything new on your developer machine.

Perfect fit for continuous integration

Redpanda’s small footprint compared to the Kafka stack and startup time make it a perfect fit for integration testing in our CI pipeline. First, let’s take a look at the image size:

$ docker images | grep 'debezium\|redpanda'
debezium/kafka                                  1.8         5da8d5410fe6   2 days ago   764MB
debezium/zookeeper                              1.8         c76319be13f3   2 days ago   547MB
vectorized/redpanda                             v21.11.12   31f4853dadff   6 days ago   302MB

The above example takes containers for Kafka and ZooKeeper distributed by the Debezium project as examples. By the nature of Kafka and ZooKeeper, a Java runtime has to be packaged into the container. This leads to Redpanda’s size being 1GB less than the combination of Kafka and ZooKeeper container images. A huge difference, especially, if you can not guarantee to cache your pulled images when using cloud-provided CI pipelines.

Further, startup of containers for Kafka and ZooKeeper takes around 10 seconds until the broker endpoint is ready, while with Redpanda this takes roughly six seconds. Over a lot of commits and continuous testing this difference easily accumulates to minutes and hours over the period of a week or month.

Redpanda simplifies networking for development

Engineering teams are employing Kubernetes (K8s) and, in the process, are trying to make development environments as close to production as possible. As a software engineer, I want my tools to be as simple to set up and run with similar configurations across multiple stages. Redpanda's setup remains just as simple, easy, efficient and quick for production as it is for development or staging.

At DataCater we target K8s as our runtime environment. A common approach for exposing K8s resources outside of a given cluster is to use an Ingress, which we also use for developing a local application against services in a minikube cluster.

To achieve this with Redpanda, you simply deploy Redpanda via its custom resource and configure it with the following YAML description:

configuration:
 kafkaApi:
 - port: 9092
 pandaproxyApi:
 - port: 8082
 adminApi:
 - port: 9644

The Redpanda operator prepares a redpanda.yaml and starts a Redpanda cluster with its configuration including an advertised_kafka_api.

The strimzi-kafka-operator deploys a Kafka cluster with all its dependencies and networking configuration.

Kafka’s networking makes this much more complicated for developers using strimzi-kafka-operator. The equivalent to advertised_kafka_api is only known upon initial deployment. Hence, developers need to adjust their K8s manifest and re-deploy Kafka via strimzi-kafka-operator after any change to a Kafka cluster.

An example from their blog:

listeners:
  external:
    type: ingress
    configuration:
      bootstrap:
        host: kafka-bootstrap.localhost
      brokers:
      - broker: 0
        host: kafka-broker-0.localhost
      - broker: 1
        host: kafka-broker-1.localhost
      - broker: 2
        host: kafka-broker-2.localhost

How will you speed up development with Redpanda?

Now that you’ve seen how we use Redpanda to speed up and simplify the burden of developing a data-streaming application, you can improve your own development projects!

At DataCater, we use Redpanda to speed up the development of our core product. Redpanda easily integrates in our development stack with Java Quarkus, reducing time and size of our CI pipeline, and Redpanda’s ease of configuration helps us to test Kafka workloads easily on a new Kubernetes cluster. Setting up advertised_kafka_api for access from outside the cluster makes working with Redpanda a charm, and really reaps the benefit of thinking cloud-native and Kubernetes-first.

So, a big thank you to the engineering team at Redpanda for creating a great developer experience! Check out the
Redpanda GitHub repo, or go to their documentation to learn more. Join
the Redpanda Community on Slack to interact with me and Redpanda’s engineers directly.

Using Flink SQL and Redpanda for stream processing

The Team @ Redpanda — Wed, 13 Jul 2022 20:59:41 +0000

Stream-based processing has risen in popularity in recent years, spurred on by event-driven architectures, the need for faster analytics, and the availability of various technology stacks that make it all feasible. One popular component of such a stack is Apache Flink^Ⓡ, a stream processing framework that boasts one of the most active open source communities today. Flink takes a stream-first approach — everything is a stream, and a batch file is treated as a special case of a bounded stream. At Redpanda, we share the view that streaming is a superset of batch, and our goal is to make Redpanda the best persistence layer for stream processors.

Flink has several API layers that provide different levels of abstraction. At the highest level is Flink SQL, which allows non-programmers to easily build complex streaming pipelines in a familiar language. Flink SQL is ANSI compliant, and supports constructs such as joins, aggregations, windowing, and even user-defined functions. It can integrate with a number of data sources and sinks declaratively, via DDL statements. For example, to allow Flink SQL to read from an Apache Kafka^Ⓡ topic:

CREATE TABLE access_logs (
    event_time TIMESTAMP(3) METADATA FROM 'timestamp',
    host STRING,
    ident STRING,
    authuser STRING,
    request STRING,
    status SHORT,
    bytes BIGINT
) WITH (
    'connector' = 'kafka',  -- using kafka connector
    'topic' = ‘logs’,  -- kafka topic
    'scan.startup.mode' = 'earliest-offset',  -- reading from the beginning
    'properties.bootstrap.servers' = 'kafka:9094',  -- kafka broker address
    'format' = 'csv'
);

Once the table is declared, reading and processing the stream coming from the topic is straightforward. For example:

SELECT count(1), status FROM access_logs GROUP BY status ;

Connecting to Redpanda

Flink SQL does not ship with a specific connector for Redpanda. However, given Redpanda’s strong wire compatibility with the Kafka protocol, the standard Kafka connector works perfectly. As an example, we take an existing Flink SQL demo that shows an end-to-end streaming application. The demo shows Flink SQL reading a stream from a Kafka topic, which is then processed via streaming SQL. The results are written to Elastic, which are then presented as dashboards using Kibana. We replaced Kafka with Redpanda while keeping the rest of the application intact.
TL;DR: It just works.

To see for yourself, go to the demo article at the Apache Flink project website. (Props to Flink PMC / Committer @JarkWu for putting together this excellent demo.) The demo requires Docker and Docker Compose to bring together the various components to run in your local environment.

You can follow the step-by-step instructions in the article, except for the initial step of grabbing the docker-compose.yml file. We had to modify the file to substitute Redpanda for Kafka, so the first step should be to enter the following from your command line to get our version of docker-compose.yml:

mkdir flink-sql-demo-redpanda; cd flink-sql-demo-redpanda;
wget https://raw.githubusercontent.com/patrickangeles/flink-sql-demo-redpanda/main/docker-compose.yml

The rest of the demo walks you through establishing connectors to Redpanda and Elastic via SQL DDL statements, building streaming jobs via SQL DML statements, and wiring the data and visualizations together via Kibana. We won’t repeat the steps here, instead we encourage you to follow the instructions exactly as described in the original article.

Dissecting `docker-compose.yml`

It’s worth going through the changes made to docker-compose.yml in case you want to build your own Redpanda powered projects using Docker Compose. We updated to a more current compose version (3.7), replaced the kafka and zookeeper services with a redpanda service, and updated the service dependency graph appropriately. The Redpanda service declaration looks like the following:

redpanda:
    image: docker.vectorized.io/vectorized/redpanda:v21.8.1
    command:
      - redpanda start
      - --smp 1
      - --memory 512M
      - --overprovisioned
      - --node-id 0
      - --set redpanda.auto_create_topics_enabled=true
      - --kafka-addr INSIDE://0.0.0.0:9094,OUTSIDE://0.0.0.0:9092
      - --advertise-kafka-addr INSIDE://kafka:9094,OUTSIDE://localhost:9092
    hostname: kafka
    ports:
      - "9092:9092"
      - "9094:9094"
    volumes:
      - /var/lib/redpanda/data

Some of these parameters are worth mentioning, especially if you’re new to Redpanda. For one thing, Redpanda follows a thread-per-core model and likes to consume all available resources in the host environment when permitted. This is great for production deployments, but not ideal when you are prototyping on your laptop. The first three parameters mentioned below are startup flags that tell Redpanda to play nice with other processes in a shared host or VM.

--smp 1 Limits Redpanda to only use one logical core.
--memory 512M Limits Redpanda to 512M memory. Alternatively, you can specify --reserve-memory N, which lets Redpanda to grab all the available memory, but reserving N for the OS and other processes.
--overprovisioned Indicates to Redpanda that there are other processes running on the host. Redpanda will not pin its threads or memory, and will reduce the amount of polling to lower CPU utilization.
--node-id 0 This is a required parameter. Every broker in Redpanda is identified by a node-id that survives restarts.
--set redpanda.auto_create_topics_enabled=true Equivalent to setting auto.create.topics.enable=true in Kafka.
hostname: kafka In Docker Compose, the default hostname is based on the service name. We override this with hostname: kafka, so we can stay compatible with the connector declaration from the original demo script.
volumes: /var/lib/redpanda/data This tells Docker Compose to make a volume available for that path, which is the default Redpanda data directory.

Conclusion

Docker and Docker Compose are great for developer productivity as they allow for quick assembly of different application components. Developers can build rapid prototypes of end-to-end applications all within their local environment. In this article, we showed how to retrofit an existing application prototype using Kafka (as well as Flink, MySQL, Elastic, Kibana) with Redpanda. Using Redpanda containers in lieu of Kafka and Zookeeper for your streaming stack has some nice benefits, including faster startup times and more efficient resource consumption.

We believe that this way of prototyping is conducive to building new streaming applications. In particular, Redpanda for event sourcing and Flink SQL for stream processing is a powerful, easy-to-use combination. The upcoming Redpanda Data Policies feature will allow for outbound data transformation via WASM. Eventually, we can use this to implement capabilities like predicate and projection push-down, which have the potential to speed up basic streaming operations by reducing the amount of data that goes from Redpanda to your stream processors.

In the future, we want to provide more prototype examples of Redpanda with Flink SQL, and also explore Redpanda in combination with other streaming engines. Our goal is to make Redpanda the best persistence layer for streaming. Watch this space!

If you have any questions about this example project or Redpanda in general, you can interact with our team on GitHub or by joining our Community on Slack.

How to join multiple data streams with Kafka Streams and Redpanda

The Team @ Redpanda — Wed, 06 Jul 2022 21:11:43 +0000

Apache Kafka Streams® (KStreams) is a client library and Java dependency that performs stream processing, or real-time processing of incoming streams of data, for smoother builds of applications and microservices. Common use cases for KStreams include aggregating the occurrence of certain words in a chat application, or filtering fraudulent transactions in a credit card processing system.

Message brokers—which act as a buffer for events produced by different services—are a vital component in the development of distributed applications. For the mission-critical systems serviced by KStreams, message brokers with minimal latencies are the best option. Redpanda is an Apache Kafka® API-compatible system that integrates with the Kafka ecosystem, is easy to manage, and can deliver significant performance improvements over other systems. Check this benchmark to learn more.

In this tutorial, you will learn how to aggregate multiple event streams using a KStreams application and Redpanda as the message broker. You’ll set up a KStreams application but use Redpanda instead of Kafka as the message broker without any code changes.

Prerequisites

In order to follow this tutorial, you need the following:

Access to Docker and Docker Compose
An appropriate IDE like IntelliJ IDEA or Eclipse to run Java applications

You can find the demo application in this GitHub repo. To follow the tutorial, you can clone the project with the below command:

git clone https://github.com/redpanda-data-blog/2022-aggregation-with-kstreams.git
cd kstreams-demo

Overview of a KStreams application

KStreams is a Java dependency added to a Java application like a Spring Boot backend. When it connects to a Kafka cluster, it takes input streams from a Kafka topic, transforms them, and sends the output as a stream to a different topic.

A KStreams application is an important component of a distributed system due to the nature of data processing within such systems. Consumers might want to use data in a certain way and would have to do the transformation themselves if stream processing were excluded from the systems. KStreams ensures that stream processing occurs only once. For a KStreams application to function properly, it requires a Kafka cluster with topics that will serve as input and output stream sources.

A typical stream processing application contains several methods such as Map, MapValues , Filter, and Join that transform or exclude data in one way or another. KStreams implements these methods in its domain-specific language (DSL). The DSL gives you all the tools required to perform stream processing so that it easily integrates with your Java application.

To understand how a KStreams application works, navigate to the cloned KStreams demo application and open the /src/main/java/com/application/WordCountDemo.java file. In this file, you will find a createWordCountStream method. This method contains the main logic of the application.

static void createWordCountStream(final StreamsBuilder builder) {
        final KStream<String, String> source = builder.stream(INPUT_TOPIC);

        final KTable<String, Long> counts = source
                .flatMapValues(value -> Arrays.asList(value.toLowerCase(Locale.getDefault()).split("\\W+")))
                .groupBy((key, value) -> value)
                .count();

        // need to override value serde to Long type
        counts.toStream().to(OUTPUT_TOPIC, Produced.with(Serdes.String(), Serdes.Long()));
    }

The method uses a StreamsBuilder object passed to it to create a KStream called source. The values in this KStream are then grouped and counted using a combination of flatMapValues, groupBy, and count.

The source KStream object represents the stream of data from the streams-plaintext-input topic, while the counts KTable object represents the transformed data. These two objects are fundamental to every KStreams application.

A KStream and KTable are two sides of the same coin. A KStream is a continuous stream of data in which every new event is recorded as a unique piece of data. As an example, consider the following events sent through a KStream: ('eggs', 34), ('bread', 10), ('bread', 8). These events are composed as key-value pairs in which the string is the key and the number is the value. The second event will not be affected by the third in a KStream even if they have the same key.

A KTable, on the other hand, updates events with the same key. So if these three events, ('eggs', 34), ('bread', 10), and ('bread', 8), are sent to the KTable, it will resolve them into ('eggs', 34), ('bread', 8). Much like a relational database table, it basically takes note of the latest event. But then, a KTable can be transformed to a KStream and a KStream to a KTable. Think of a KTable as the latest values of a KStream and a KStream as the historical view of a KTable. This is the basis of aggregation and data joining in a KStreams application.

Setting up Redpanda

You will be using the Redpanda Docker image to run examples in this tutorial. To ensure that you can start Redpanda with a single command, you will use a docker-compose.yml file to define the Redpanda configuration. Create a docker-compose.yml file in a directory of your choice, then add the following content to it:

version: "3.7"
services:
  redpanda:
    command:
      - redpanda
      - start
      - --smp
      - "1"
      - --memory
      - 1G
      - --reserve-memory
      - 0M
      - --overprovisioned
      - --node-id
      - "0"
      - --kafka-addr
      - PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092
      - --advertise-kafka-addr
      - PLAINTEXT://redpanda:29092,OUTSIDE://localhost:9092
    # NOTE: Please use the latest version here!
    image: docker.vectorized.io/vectorized/redpanda:v21.11.11
    container_name: redpanda-1
    ports:
      - 9092:9092
      - 29092:29092

You can run this file in detached mode by running docker compose up -d in the directory where the file lives. You should get an output like this:

[+] Running 1/1
 - Container redpanda-1  Started

In event streaming architectures, topics live in the message brokers. These topics store the messages or events sent by a producer. A stream is a continuous flow of data to or from the topic. Use the below command to create the topics needed in this KStreams application:

docker exec -it redpanda-1 \
rpk topic create streams-plaintext-input streams-wordcount-output

Now you need to set up the producer to write to the input topic and a consumer to read from the output topic. Redpanda offers an easy way to set up a producer without adding command line arguments. Run the below command to create the producer:

docker exec -it redpanda-1 \
rpk topic produce streams-plaintext-input

Setting up the consumer follows a similar approach. Open a new terminal and run the below command:

docker exec -it redpanda-1 \
rpk topic consume streams-wordcount-output

Now start the KStreams application in your IDE. In the terminal running the Redpanda streams-plaintext-input producer, type in the sentence all streams lead to kafka. Check for the output on the Redpanda streams-wordcount-output terminal. Your output should look like the following:

{
  "topic": "streams-wordcount-output",
  "key": "all",
  "value": "\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0003",
  "timestamp": 1646660088785,
  "partition": 0,
  "offset": 1
}
{
  "topic": "streams-wordcount-output",
  "key": "streams",
  "value": "\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0003",
  "timestamp": 1646660088785,
  "partition": 0,
  "offset": 2
}
{
  "topic": "streams-wordcount-output",
  "key": "lead",
  "value": "\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0003",
  "timestamp": 1646660088785,
  "partition": 0,
  "offset": 3
}
{
  "topic": "streams-wordcount-output",
  "key": "to",
  "value": "\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0003",
  "timestamp": 1646660088785,
  "partition": 0,
  "offset": 4
}
{
  "topic": "streams-wordcount-output",
  "key": "kafka",
  "value": "\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0004",
  "timestamp": 1646660088785,
  "partition": 0,
  "offset": 5
}

Joining multiple streams in Redpanda

A KStreams application is not limited to stream aggregation. It can also be used to join multiple streams together, as long as they have the same key. Typically a Kafka Streams application can join any of the following:

One KStream to another KStream; produces a KStream
One KStream to a KTable; produces a KTable
One KTable to another KTable; produces a KStream
A KStream to a GlobalKTable; produces a KStream

Similar to a typical database, these join operations can be any of the following:

A left join
An inner join
An outer join

In this section, you will create an enhanced orders stream formed by joining user and order streams. It follows a similar architecture to the previous example, but in this case, you will use a leftJoin to join both streams.

To get started, create new topics for userProfiles, orders, and enhancedOrders using the below command:

docker exec -it redpanda-1 \
rpk topic create userProfiles orders enhancedOrders

The code that handles the joining can be found in the src/main/java/com/application/EnhancedOrdersApplication.java from the cloned repo. Find the part where the userProfiles table and orders stream are initialized. A simple join is achieved using concatenation. The joined data is then streamed to the output topic.

final KTable<String, String> userProfiles = builder.table("userProfiles");
final KStream<String, String> orders = builder.stream("orders");

KStream<String, String> joined = orders.join(userProfiles,
        (userProfile, order) -> userProfile + order
);
joined.to("enhancedOrders");

Now, run the application in IntelliJ. Open three terminals and run the commands below for the producers and consumer sequentially. The messages will be streamed via stdin. The producer set the delimiter as a new line (\n) and the key-value separator as a colon (:). In this way the messages can have defined key-value pairs.

docker exec -it redpanda-1 \
rpk topic produce userProfiles -f "%k:%v\n"

docker exec -it redpanda-1 \
rpk topic produce orders -f "%k:%v\n"

docker exec -it redpanda-1 \
rpk topic consume enhancedOrders

In the first terminal, produce the following data:

1:{"id":"1", "email":"john.wick@gmail.com", "first_name":"John", "last_name":"Wick"}
2:{"id":"2", "email":"malik.gruder@gmail.com", "first_name":"Malik", "last_name":"Gruder"}

Produce data to the orders topic in the second terminal using the command below:

1:{"id":"1", "product_id":"33","user_id":"1"}
2:{"id":"2", "product_id":"75","user_id":"2"}
1:{"id":"3", "product_id":"1005","user_id":"1"}

Now observe the third terminal for the output of the enhancedOrders stream. You should get data similar to what’s shown below:

{
  "topic": "enhancedOrders",
  "key": "1",
  "value": "{\"id\":\"1\", \"product_id\":\"33\",\"user_id\":\"1\"}{\"id\":\"1\", \"email\":\"john.wick@gmail.com\", \"first_name\":\"John\", \"last_name\":\"Wick\"}",
  "timestamp": 1649141650456,
  "partition": 0,
  "offset": 0
}
{
  "topic": "enhancedOrders",
  "key": "2",
  "value": "{\"id\":\"2\", \"product_id\":\"75\",\"user_id\":\"2\"}{\"id\":\"2\", \"email\":\"malik.gruder@gmail.com\", \"first_name\":\"Malik\", \"last_name\":\"Gruder\"}",
  "timestamp": 1649141650458,
  "partition": 0,
  "offset": 1
}
{
  "topic": "enhancedOrders",
  "key": "1",
  "value": "{\"id\":\"3\", \"product_id\":\"1005\",\"user_id\":\"1\"}{\"id\":\"1\", \"email\":\"john.wick@gmail.com\", \"first_name\":\"John\", \"last_name\":\"Wick\"}",
  "timestamp": 1649141650461,
  "partition": 0,
  "offset": 2
}

You’ll notice that the second output has the second user’s details, while the first and third output has the first user’s details. This matches the expected result from the join diagram above.

Conclusion

Distributed applications that use message brokers often need a way to process streams of data for easier consumption and less computational overhead in consumer applications. KStreams serves as an excellent stream-processing tool for this purpose. While KStreams was built with Kafka in mind, it works with other systems, too.

As you saw in this tutorial, your KStreams application works well using Redpanda, which is a drop-in replacement for Kafka, without Zookeeper®, and without a JVM. This enables you to increase your application’s message-processing ability with minimal effort.

To check your work, use the demo application in this GitHub repo.

You can also discuss the demo, KStreams, or ask questions about anything else you can do with Redpanda in the Redpanda Slack community.