DEV Community: Elvis David

How to query Postgres from Cloudflare Workers with Neon serverless driver

Elvis David — Thu, 15 Feb 2024 16:24:25 +0000

Cloudflare Workers is a serverless platform that allows developers to build and deploy applications on the edge. By leveraging Cloudflare Workers, you can execute your code in close proximity to your users, reducing latency and improving performance. However, one challenge developers face when building serverless applications is accessing databases. This article will explore how to query a PostgreSQL database from Cloudflare Workers using the Neon serverless driver.

The Neon serverless driver is a PostgreSQL client library specifically designed for serverless environments, such as Cloudflare Workers. It provides a lightweight and efficient way to connect to a PostgreSQL database within your Cloudflare Workers code. The driver is built on top of the popular node-postgres library, but with some optimizations and modifications to make it work better in serverless environments.

Prerequisites

Before getting started, you’ll need to make sure that you have the following:

A Cloudflare account.
A Neon Cloud account
NPM installed with it’s package runner NPX.
A PostgreSQL database.

Here is the link to the GitHub project repository to make it easy to follow along.

Create a Neon project

To create your first Neon project, follow these steps:

Sign in to your Neon account.
Once signed in, you will be redirected to the Neon console.

On the console, you will see a project creation dialog. This dialog allows us to specify the following details for our project:

      -  **Project name**: Choose a descriptive name for your project.
      -  **Postgres version**: Select the desired version of PostgreSQL.
      -  **Database name**: Provide a name for your database.
      -  **Region**: Choose the region where your project is to be hosted.

After creating a project, you get a redirect to the Neon console, and a pop-up modal with a connection string appears. Copy the connection string, you’ll use it in the next step.

The connection string is used to connect to your project's database, which is automatically created when you create a new project.

Loading demo data into the database

For demonstration purposes, you’ll populate the database with sample product data for querying.
To load the data into your database, you must connect to it using the connection string. Open a terminal window and run the following command:



    psql 'postgresql://xxx:xxxx@ep-xxx-xxx-29168360.us-east-2.aws.neon.tech/xxx?sslmode=require'

This command connects to the database using the provided connection string.

Once you're connected to the database, you can load the demo data by following the steps below:

Create a folder named sql in your project directory.
Download the products.sql from here.
Move the products.sql file into the "sql" folder.

Once you have set up the sql folder and added the products.sql file, you can run the following command to load the demo data into your database.



    psql 'postgresql://xxx:xxxx@ep-xxx-xxx-29168360.us-east-2.aws.neon.tech/xxx?sslmode=require' < products.sql

This command loads the products.sql file into the database. The products.sql file contains SQL statements that create a table and insert demo data into the table.

After running the command, you should see some output in your terminal, indicating that the demo data has been successfully loaded into the database. The output should look something like this:

The output CREATE SEQUENCE, CREATE TABLE, and INSERT 0 5 indicates that the sequence, the table, and the 5 rows were inserted into the table, respectively.

Creating a worker application

First, use the create-cloudflare CLI to create a new Worker application. To do this, open a terminal window and run the following command:



    npx wrangler init neon-query-db

This will prompt you to install the create-cloudflare package and lead you through a setup wizard.

To continue with this guide, follow these steps:

When prompted, provide a name for your new Worker application.
Select "Hello World" Worker as the type of application.
Choose "Yes" to use TypeScript.
Select "No" when asked if you want to deploy your application.

If you choose to deploy your application, you may be asked to authenticate if you're not logged in already. Your project will then be deployed. Even if you deploy, you can modify your Worker code and redeploy it at the end of this tutorial.

The above command creates a new directory named neon-query-db containing the scaffolding for a new Cloudflare Worker project. Navigate into this directory by running cd neon-query-db.

After running npx wrangler init in your project directory, the following files have been generated:

wrangler.toml: This file is your Wrangler configuration file.
src/index.ts: It contains a minimal Hello World Worker written in TypeScript.
package.json: This file is a minimal Node dependencies configuration file. It will be generated if you select "Yes" when prompted during the wrangler init command.
tsconfig.json: This file is the TypeScript configuration that includes Worker types. It is only generated if specified in the wrangler init command.

For this tutorial, you only need to focus on the wrangler.toml and src/index.ts files. You don't need to edit the other files; they should be left as they are.

Installing the Neon package

To install the Neon package, run the following code on the terminal.



    npm install @neondatabase/serverless

Configuring connection to Postgres database

There are two methods to connect to your PostgreSQL database: using a connection string or setting explicit parameters.

Use a connection string

To connect, run the following code:



    npx wrangler secret put DATABASE_URL

Paste in your connection string when prompted (you’ll find this in your Neon dashboard).

Set explicit parameters

Configure each database parameter as an environment variable via the Cloudflare dashboard or in your wrangler.toml file.

The example below shows how to configure the parameters in the wrangler.toml file.



    //wrangler.toml file

    [vars]
    DB_USERNAME = "postgres"
    # Set your password by creating a secret so it is not stored as plain text
    DB_HOST = "ep-aged-sound-175961.us-east-2.aws.neon.tech"
    DB_PORT = "5432"
    DB_NAME = "products"

To set a secret for your worker, use npx wrangler secret put DB_PASSWORD :



    npx wrangler secret put DB_PASSWORD

After executing the command, you’ll be prompted to enter the value of DB_PASSWORD into the terminal to securely store the database password as a secret in your Cloudflare Workers environment variables.

Connecting to the Postgres database in the worker

Next, import the Client class into your Worker's main file from the pg library. Depending on your chosen connection method, use either the connection string or explicit parameters to establish a connection to the PostgreSQL database.



    // src/index.ts file

    import { Client } from '@neondatabase/serverless';
    interface Env { DATABASE_URL: string; }

In the fetch event handler, connect to the PostgreSQL database using your chosen method, either the connection string or the explicit parameters.

Using a connection string



    const client = new Client(env.DATABASE_URL);
    await client.connect();

Setting explicit parameters



    const client = new Client({
      user: env.DB_USERNAME,
      password: env.DB_PASSWORD,
      host: env.DB_HOST,
      port: env.DB_PORT,
      database: env.DB_NAME
    });
    await client.connect();

Querying the database

To demonstrate how to interact with the products database, let’s fetch data from the products table by querying the table when a request is received.

To fetch data from the products table, add the following code snippet inside the fetch event handler in your index.ts file, after the existing query code:



    export default {
      async fetch(request: Request, env: Env, ctx: ExecutionContext) {
        const client = new Client(env.DATABASE_URL);
        await client.connect();
        const { rows } = await client.query(`
        select * from products,
        );
        ctx.waitUntil(client.end());  // this doesn’t hold up the response
        return new Response(JSON.stringify(rows), {
          headers: { "content-type": "application/json"},
        });
      }
    }
```
This code snippet does the following:


- Checks if the request is a GET request and if the URL path is `/products`.
- Constructs a SELECT SQL query that fetches all rows from the "products" table.
- Executes the query and fetches all rows from the "products" table.
- Returns all rows as a JSON response.

When you send a GET request to your Worker’s URL with the /products path, the Worker will fetch all rows from the products table and return them as JSON.

## Deploying your Worker

Run the following command to deploy your Worker:

```typescript
    npx wrangler deploy
```

![Screenshot](https://paper-attachments.dropboxusercontent.com/s_E5DFD9AAD2B93DCFCEDE3C785055C7D8015AC8E87BD28019E97637CA43906324_1702038762591_Screenshot+from+2023-12-08+15-31-41.png)


Your application is now live and accessible at `<YOUR_WORKER>.<YOUR_SUBDOMAIN>.workers.dev`.

After deploying, you can interact with your PostgreSQL products database using your Cloudflare Worker. Whenever a request is made to your Worker’s URL, it will fetch data from the `products` table and return it as a JSON response. You can modify the query as needed to retrieve the desired data from your product database.

![Screenshot](https://paper-attachments.dropboxusercontent.com/s_E5DFD9AAD2B93DCFCEDE3C785055C7D8015AC8E87BD28019E97637CA43906324_1702067189656_Screenshot+from+2023-12-08+23-26-05.png)


The product data is successfully returned from the database! You can confirm that your Cloudflare Worker successfully connected to the PostgreSQL database using the Neon serverless driver.

## Conclusion

In this article, you've explored how to query a PostgreSQL database from Cloudflare Workers using the Neon serverless driver. By leveraging this driver, you can efficiently connect to your PostgreSQL databases and retrieve data from within your serverless applications. This enables you to build powerful and performant applications that utilize Cloudflare's edge network.


## Resources
- [Build your first Worker](https://developers.cloudflare.com/workers/get-started/guide/?utm_source=hackmamba&utm_medium=blog&utm_id=HMBcommunity)
- [Neon serverless driver](https://neon.tech/docs/serverless/serverless-driver?ref=hm)

Scala for Data Engineering: Harnessing the Power of Functional Programming

Elvis David — Thu, 27 Jul 2023 21:40:10 +0000

You want to write a program that handles data. Which language should you choose?

Introduction

In the dynamic world of data engineering, where processing and managing vast amounts of data have become paramount, programming languages that offer flexibility, efficiency, and scalability are highly sought after. Enter Scala - a powerful and versatile language that has been gaining traction in the data science community for its exceptional capabilities.

When choosing a programming language to use in writing your program that handles data, there are different options you can choose from. You might choose a dynamic language such as
Python or R or a more traditional object-oriented language such as Java.

In this post, we will explore how Scala differs from these languages and when it might make sense to use it.

Why Scala?

Scala, is a statically typed programming language, has been steadily gaining popularity in the data engineering community due to its unique combination of features that make it well-suited for handling data-intensive tasks.

In the next sections, we examine how Scala compares to the programming languages in the field of data science.

Static typing and type inference

Scala's static typing system offers remarkable versatility, as it allows a significant amount of information about the program's behavior to be encoded in types. This ensures a certain level of correctness, making it especially beneficial for rarely used code paths. In contrast, dynamic languages can only identify errors during specific execution branches, potentially leading to persistent bugs.

One common criticism of statically typed object-oriented languages, like Java, is their verbosity. For instance, when initializing an instance of the Example class in Java, the class name is repeated twice, unnecessarily defining the compile-time type of the variable and constructing the instance.

Scala, being a functional language, leverages type inference, enabling the compiler to deduce variable types from assigned instances. As a result, Scala code is more concise and readable, without compromising type safety. By specifying argument and return value types of functions, the compiler infers types for all variables within the function's body. Scala's elegant approach to type inference significantly streamlines code development, making it a compelling choice for data engineers and programmers.

Scala encourages immutability

Scala promotes the adoption of immutable objects, making it effortless to define attributes as immutable.
For instance:

val amountSpent = 500

Additionally, the default collections in Scala are immutable, as demonstrated with the List:

val clientIds = List("123", "456") // List is immutable
clientIds(1) = "589" // Compile-time error

Embracing immutability eradicates a common source of bugs. By ensuring that certain objects cannot be changed once created, the number of potential bug locations is reduced. Instead of considering the object's lifetime, the focus narrows down to the constructor, leading to more robust and predictable code.

Scala and functional programs

Scala strongly encourages functional programming, which involves using higher-order functions to transform collections. As a programmer, you don't have to worry about the intricate details of iterating over the collection.
Let's take a look at an example of an "occurrencesOf" function in Scala:

def occurrencesOf[A](elem: A, collection: List[A]): List[Int] = {
  for {
    (currentElem, index) <- collection.zipWithIndex
    if (currentElem == elem)
  } yield index
}

In this Scala code, we declare a new list, "collection.zipWithIndex," which contains pairs of elements and their respective indexes from the original collection. Then, by using a filter and a for-comprehension, we iterate over this collection, binding the "currentElem" variable to the current element and "index" to the index. We filter and return the indexes where "currentElem" is equal to "elem."

The equivalent Java code for the same functionality looks like this:

static <T> List<Integer> occurrencesOf(T elem, List<T> collection) {
  List<Integer> occurrences = new ArrayList<>();
  for (int i = 0; i < collection.size(); i++) {
    if (collection.get(i).equals(elem)) {
      occurrences.add(i);
    }
  }
  return occurrences;
}

In Java, we start by defining a mutable list to store occurrences as we find them. We then iterate over the collection using a counter and check each element to see if it matches "elem." If it does, we add its index to the list of occurrences. This Java code requires managing more moving parts, and the logic of the function is somewhat obscured by the iteration mechanism.

It's important to note that this comparison is not meant to criticize Java; in fact, Java 8 introduced functional constructs like lambda expressions and stream processing. However, it highlights the benefits of functional approaches in Scala, which minimize the potential for errors and improve code clarity, making it easier to work with collections and focus on the core logic of the function.

Null pointer uncertainty

In many scenarios, representing the possible absence of a value becomes necessary. For example, consider a case where we read a list of usernames from a CSV file, and some users have opted not to provide their email addresses. In Java, this absence of email information is often denoted by setting the reference to null, while in Python, None is used.

However, this approach can be risky as it does not explicitly encode the possibility of a value's absence. Determining whether an instance attribute can be null becomes cumbersome in larger programs, leading to potential issues if not handled carefully. Scala, inspired by functional languages, addresses this concern by introducing the Option[T] type to represent attributes that might be absent.

In Scala, we can achieve this by writing:

class User {
  ...
  val email: Option[Email]
  ...
}

By utilizing Option[Email], we clearly convey to programmers using the User class that email information may be absent. The compiler also becomes aware of this possibility, prompting us to handle the situation explicitly rather than risking null pointer exceptions at runtime.

By eliminating the use of null, we can achieve a higher level of provable correctness and mitigate null-related issues. In languages without Option[T], developers often resort to writing unit tests on the client code to ensure correct behavior when dealing with null attributes.

It's worth noting that similar functionality can be achieved in Java using external libraries like Google's Guava library or the Optional class in Java 8. However, the convention of using null to indicate the absence of a value has long been ingrained in Java. In contrast, Scala embraces Option[T], offering a more natural and safer way to handle optional values.

Easier parallelism

Developing programs that leverage parallel architectures presents significant challenges, but it is an essential aspect of tackling most data science problems. Parallel programming poses difficulties because our natural inclination as programmers is to think sequentially. Reasoning about the potential order of events in concurrent programs becomes complex.

Scala addresses these challenges by providing several abstractions that facilitate the creation of parallel code. These abstractions impose constraints on the approach to parallelism. For example, parallel collections require computations to be expressed as a sequence of operations, such as map, reduce, and filter, on collections. Actor systems encourage thinking in terms of encapsulated actors that communicate through message passing.

The restriction of the programmer's freedom to write parallel code in any way they desire may seem paradoxical. However, it actually simplifies understanding the program's behavior. For instance, if an actor misbehaves, the problem is either in the actor's code or one of the messages it receives.

To illustrate the power of coherent, restrictive abstractions, let's solve a probability problem using parallel collections in Scala. We aim to calculate the probability of getting at least 60 heads out of 100 coin tosses using a Monte Carlo simulation. By running the simulation repeatedly and aggregating the results, we can achieve this estimation with parallel collections, parallelizing the computation across multiple CPUs effortlessly.

While not all problems are as straightforward to parallelize as the Monte Carlo example, Scala's rich set of intuitive abstractions makes writing parallel applications more manageable, providing an effective way to leverage parallel architectures and improve the performance of data science tasks.

Interoperability with Java

Scala is built to run on the Java Virtual Machine (JVM), and its compiler translates Scala programs into Java bytecode. This compatibility allows Scala developers to seamlessly utilize Java libraries within their Scala code. Considering the vast number of Java applications, both open-source and in legacy systems, this interoperability between Scala and Java has significantly contributed to Scala's widespread adoption.

Moreover, the interoperability between Scala and Java is not limited to one direction. Some Scala libraries, like the Play framework, have gained popularity among Java developers as well, indicating the bidirectional nature of this compatibility. This mutual interaction between the two languages fosters a thriving ecosystem and encourages developers from both communities to explore and leverage each other's tools and frameworks.

When not to use Scala

When considering whether to use Scala for your next project, there are certain factors to take into account. While Scala's strong type system, preference for immutability, functional capabilities, and parallelism abstractions make it an excellent choice for writing reliable programs and minimizing unexpected behavior, there are some reasons why you might decide against it.

One crucial consideration is familiarity. Scala introduces various concepts, such as implicits, type classes, and composition using traits, which may not be familiar to programmers coming from an object-oriented background. Mastering Scala's expressive type system and harnessing its full power may require time and adapting to a new programming paradigm. Additionally, dealing with immutable data structures can feel unfamiliar to those coming from languages like Java or Python.

However, with time and effort, these drawbacks can be overcome. Nevertheless, Scala does fall short in terms of library availability compared to other data science languages. For data exploration, the IPython Notebook coupled with matplotlib remains unparalleled. Although there are ongoing efforts to provide similar functionality in Scala (like Spark Notebooks or Apache Zeppelin), these projects may not have reached the same level of maturity.

Considering the above, in the author's biased opinion, Scala shines when used for more permanent programs. If you're writing a quick throwaway script or primarily focusing on data exploration, you might find Python better suited for the task. However, for projects that require reusability and a higher level of provable correctness, Scala proves to be an extremely powerful and beneficial choice. Ultimately, the decision on whether to use Scala will depend on your project's specific requirements and your team's familiarity with the language and its ecosystem.

Conclusion

In conclusion, Scala presents a compelling option for developers seeking a robust and functional language, offering the potential to build cutting-edge applications and tackle complex data science challenges. Whether you choose Scala for its expressiveness and parallel processing capabilities or opt for another language based on familiarity and immediate project needs, making an informed decision will ultimately lead to successful and impactful software development endeavors.

Real-time data ingestion to Databricks Delta Lake using Redpanda (Kafka Connect)

Elvis David — Thu, 27 Jul 2023 14:37:18 +0000

Introduction

We are in the era where Real-time data ingestion has become one of the critical requirement for many organizations which are seeking to tap the value from their data in Real-time and near Real-time. Real-time data ingestion has grown to become one of the brands that sets apart many organizations in the competing market, but a research from Databricks revealed that, a staggering 73% of a company's data goes unused for analytics and decision-making when stored in a data lake. This means that machine learning models are doomed to return inaccurate results, perform poorly in real-world scenarios and many other implications.

Delta lake is a game changer for big data, developed as an advanced open source storage layer, it provides an abstract layer on top of existing data lakes and it is optimized for Parquet-based storage. Databricks Delta lake provides features such as:

Support for ACID transactions - ACID stands for Atomicity, Consistency, Isolation, and Durability. All the transactions made on the delta lakes are ACID compliant.
Schema enforcement - When writing data to storage, Delta lakes will enforce schema which helps in maintaining the columns and data types and achieving data reliability and high quality data.
Scalable metadata handling - Delta lake will scale out all metadata processing operations using compute engines like Apache Spark and Apache Hive, allowing it to process the metadata for petabytes of data efficiently.

Databricks Delta lake integration can be used for a variety use cases to store, manage and analyze volumes of incoming data. For example, In customer analytics, Delta lakes can be used to store and process customer data for analytics purposes, and also it can be used to process IoT data because of it is ability to handle large volumes of data and even performing real-time data processing.

In this tutorial, you will learn how to do the following:

Create and configure Databricks Delta lakes
Set up and run a Redpanda cluster and create topics for Kafka Connect usage
Configure and run a Kafka Connect cluster for Redpanda and Databricks Delta lake integration

Deep dive into Databricks deployment architecture

Databricks deployment is structured to provide a fully managed, scalable, reliable and secure platform for data processing and analytics tasks. It's architecture is split into two main components: the control plane and data plane and this enables secure cross-functional team collaboration while keeping a significant amount of backend services managed by Databricks.

The control plane - It includes the backend services that Databricks manages within the could environment including access control, user authentication and resource management.
The data plane - It's used to manage the storage and processing of data. It includes resources such as Databricks Cluster, which is a set of virtual machines that are provisioned on-demand to run notebooks and jobs, and the Databricks Delta Lake, which is a storage layer that provide an abstract layer on top of your data lake in the cloud.

The control plane and data plane work together to provide a smooth experience for the data teams.
The separation of the control plane and data plane makes it possible for Databricks to scale each component independently which ensures the deployment architecture remains scalable, reliable and performant even as workloads grow in size and complexity.

The overall Redpanda BYOC (bring your own cloud) deployment architecture follows the same structure as that of Databricks deployment architecture in the following ways:

Both Redpanda and Databricks architectures leverage the underlying cloud infrastructure to provide reliability, security and scalability for data processing and analytics tasks.
Just like Databricks, Redpanda BYOC contains the control plane which is responsible for managing Redpanda clusters including provisioning and scaling, also the data plane which is responsible for processing and storing data.

Redpanda BYOC and Databricks both provide an interactive and collaborative workspace for data teams to operate in, facilitating faster iterations and the delivery of value.

Prerequisites

You'll need the following for this tutorial:

Docker installed on your machine, preferably Docker Desktop if you’re on Windows/Mac (this article uses Docker 20.10.21)
A machine to install Redpanda and Kafka Connect (this tutorial uses Linux, but you can run Redpanda on Docker, MacOS and Windows, as well)
Python 3.10 or higher.
Java version 8 or later( this tutorial uses java 11)
Delta Lake 1.2.1

All of the code resources for this tutorial can be found in this repository.

Use case: Implementing data ingestion to Databricks Delta lakes with Redpanda (Kafka Connect)

In this part we create a fictitious scenario to help you understand how you can ingest data to Databricks Delta Lake using Kafka Connect with Redpanda. This is just for demonstration purposes only.

Let us imagine that you work as a Lead engineer at a e-commerce company known as EasyShop,that sells products its website. Your company is experiencing a rapid growth and due to this it is expanding its operations to new markets.

To better understand customer behavior and preferences, the company wants to create a unified view of its sales and customer data which includes user clicks, page views, and orders.
To achieve this, you decide to set up a data pipeline that ingests data in Real-time from the company website into Databricks Delta Lake.

The following diagram explains the high-level architecture:

Setting up Redpanda

In this article, we assume that you already have Redpanda installed via Docker. If you have note installed, please refer to this Documentation on how to installed Redpanda via Docker.

To check if these Redpanda is up and running, execute the command docker ps. You’ll see an output like these if the service is running:

CONTAINER ID   IMAGE                 COMMAND                  CREATED          STATUS          PORTS                                                                     NAMES
bb2305256285   vectorized/redpanda   "/entrypoint.sh redp…"   16 minutes ago   Up 16 minutes   8081/tcp, 9092/tcp, 9644/tcp, 0.0.0.0:8082->8082/tcp, :::8082->8082/tcp   redpanda-1

You can show Redpanda is running via rpk CLI on the docker by executing this command docker exec -it redpanda-1 rpk cluster info where redpanda-1 is the container_name.
The output of the command should look similar to the following:

CLUSTER
=======
redpanda.36dda01b-4e8a-4949-bb38-04f83a13b009

BROKERS
=======
ID    HOST     PORT
0*    0.0.0.0  9092

Creating a Redpanda Topic

To create a topic in Redpanda, you can use rpk which is a CLI tool for connecting to and interacting with Redpanda brokers. Open a terminal and connect to Redpanda’s container session by executing the command docker exec -it redpanda-1 bash. You should see your terminal session connected now to the redpanda-1 container:

redpanda@bb2305256285:/$

Next, execute the command rpk topic create customer-data to create the topic customer-data. You can check the existence of the topic created with the command rpk topic list. This will list all the topics which are up and running and you should see an output like this:

NAME           PARTITIONS  REPLICAS
customer-data  1           1

OR you can verify the created topics by getting the cluster information using the command rpk cluster info.

You will see the following output:

CLUSTER
=======
redpanda.36dda01b-4e8a-4949-bb38-04f83a13b009

BROKERS
=======
ID    HOST     PORT
0*    0.0.0.0  9092

TOPICS
======
NAME           PARTITIONS  REPLICAS
customer-data  1           1

Developing the producer code for a scenario

You have Redpanda services running and a topic available in the redpanda container to receive events, move on to create a application that will publish messages to this topic.
This will be a two step process:

First, you will set up a virtual environment and install the necessary dependencies.
Create a producer code to generate sample customer data and publish it to the created customer-data topic

Setting up virtual environment

Before creating a producer and consumer code, you have to your Python virtual environment and install the dependencies.

To begin, create a project directory,Real-time data ingestion to Databricks Delta Lake with Redpanda (Kafka Connect) in your machine.

Create a subdirectory customer-data in the project directory.
The customer-data directory will hold the Python application that will publish our data.

While inside the customer-data subdirectory, run python3 -m venv venv command to create a virtual environment setup for this demo project.

In the same subdirectory, create a requirements.txt file and paste the content from the Github Demo repo.

To install the dependencies, run pip install -r requirements.txt .

Producing data

Create a file called redpanda_producer.py that has producer code that generates and inserts 10000 random JSON entries into the customer-data topic.

Paste this piece of code into the file:

from kafka import KafkaProducer, KafkaAdminClient
from kafka.admin import NewTopic
from time import sleep
from faker import Faker
import faker_commerce
from faker.providers import internet
import json
import os
from dotenv import load_dotenv


def main():
    topic_name = os.environ.get('REDPANDA_TOPIC')
    broker_ip = os.environ.get('REDPANDA_BROKER_IP')
    types_of_categories = ['clothing', 'electronics', 'home & kitchen', 'beauty & personal care', 'toys & games']
    fake = Faker()
    fake.add_provider(faker_commerce.Provider)
    fake.add_provider(internet)
    try:
        # Create Kafka topic
        topic = NewTopic(name=topic_name, num_partitions=1, replication_factor=1)
        admin = KafkaAdminClient(bootstrap_servers=broker_ip)
        admin.create_topics([topic])
    except Exception:
        print(f"Topic {topic_name} is already created")

    producer = KafkaProducer(bootstrap_servers=[broker_ip],
                             value_serializer=lambda m: json.dumps(m).encode('ascii'))
    for i in range(10000):
        fake_date = fake.date_this_month(True)
        product_id = fake.pyint(1, 10000)
        categories = fake.word(ext_word_list=types_of_categories)
        product_name = fake.product_name()
        name = fake.name()
        email_addr = fake.email(name)
        str_date = str(fake_date)
        units_sold = fake.pyint(1, 25)
        unit_price = fake.pyint(10, 500)
        country = fake.country()
        producer.send(topic_name, {'date': str_date, 'product_name': product_name, 'category': categories,
                                   'name': name, 'email': email_addr, 'units_sold': units_sold,
                                   'unit_price': unit_price, 'country': country})
        print("Inserted entry ", i, " to topic", topic_name)
        sleep(1)


load_dotenv()
main()

To run the script, execute the command python redpanda_producer.py and if it runs successfully, you should see the messages produced on the topics.

You can also check the output using Redpanda’s CLI tool, rpk by running the following command:

rpk topic consume customer-data --brokers <IP:PORT>

You should see the output below, which has the topic, value, timestamp,partition, and offset fields.

Configuring Databricks Delta lake

In this section, you'll configure Delta Lake using Spark session. For this, we will use the configure_delta-lake.py script from our Github Demo repo.

Running the configure_delta-lake script will accomplish the following:

Generate a Spark session
Obtain the schema from the topic we are generating the data
Create the Delta table

Let us look deep into the steps

Using Pyspark, you'll create a new Spark session with the correct packages and add the corresponding additional Delta Lake dependencies to interact with Delta Lake. You can do this using the following code:

import os
from dotenv import load_dotenv
from delta import *
import pyspark as pyspark

def get_spark_session():
    load_dotenv()
    broker_ip = os.environ.get('REDPANDA_BROKER_IP')
    topic = os.environ.get('REDPANDA_TOPIC')
    builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

    spark = spark = configure_spark_with_delta_pip(builder).getOrCreate()
    return spark

You can find the packages used inside the .env file.

REDPANDA_BROKER_IP=0.0.0.0:9092
REDPANDA_TOPIC=customer-data
PYSPARK_SUBMIT_ARGS='--packages io.delta:delta-core_2.12:1.0.0,org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 pyspark-shell'
DELTA_LAKE_TABLE_DIR='/tmp/delta/customer-table'
DELTA_LAKE_CHECKOUT_DIR='/tmp/checkpoint/'

Once, you've set up the Spark session, you'll use it to obtain the schema of the customer-data topic.
You can find this logic in the get_schema() function inside get_schema.py file.

def get_schema(spark):
    load_dotenv()
    broker_ip = os.environ.get('REDPANDA_BROKER_IP')
    topic = os.environ.get('REDPANDA_TOPIC')
    df_json = (spark.read
                .format("kafka")
                .option("kafka.bootstrap.servers", broker_ip)
                .option("subscribe", topic)
                .option("startingOffsets", "earliest")
                .option("endingOffsets", "latest")
                .option("failOnDataLoss", "false")
                .load()
                # filter out empty values
                .withColumn("value", expr("string(value)"))
                .filter(col("value").isNotNull())
                # get latest version of each record
                .select("key", expr("struct(offset, value) r"))
                .groupBy("key").agg(expr("max(r) r")) 
                .select("r.value"))

    # decode the json values
    df_read = spark.read.json(df_json.rdd.map(lambda x: x.value), multiLine=True)
    return df_read.schema.json()

You have Spark session running and the schema from the topic, now you can go ahead to create the Delta table. The paths are configurable and can be modified as per your preference by editing the .envfile.
You can use the following code to create a delta table:

def create_delta_tables(spark, table_df):
    load_dotenv()
    table_path = os.environ.get('DELTA_LAKE_TABLE_DIR')
    table = (spark
        .createDataFrame([], table_df.schema)
        .write
        .option("mergeSchema", "true")
        .format("delta")
        .mode("append")
        .save(table_path))

You've successfully configured Databricks Delta lake and created a Delta table for the data to be ingested.

Setting up Kafka Connect

Kafka Connect is an integration tool released with the Apache KafkaⓇ project. It’s scalable and flexible, and it provides reliable data streaming between Apache Kafka and external systems. You can use it to integrate with any system, including databases, search indexes, and cloud storage providers. Redpanda is fully compatible with the Kafka API.

Kafka Connect uses source and sink connectors for integration. Source connectors stream data from an external system to Kafka, while sink connectors stream from Kafka to an external system.

To install Kafka connect, you have to download the Apache Kafka package. Navigate to the Apache downloads page for Kafka and click the suggested download link for the Kafka 3.1.0 binary package.

Run the following commands to create a folder called pandabooks_integration in your home directory and extract the Kafka binaries file to this directory.

mkdir pandabooks_integration && \
mv ~/Downloads/kafka_2.13-3.1.0.tgz pandabooks_integration && \
cd pandabooks_integration && \
tar xzvf kafka_2.13-3.1.0.tgz

Configuring the connect cluster

Before running a kafka connect cluster, you have to set up a configuration file in the properties format.
Navigate to the pandabooks_integration and create a folder called configuration. While inside the folder, create a file known as connect.properties and paste this contents:

#Kafka broker addresses
bootstrap.servers=

#Cluster level converters
#These applies when the connectors don't define any converter
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter

#JSON schemas enabled to false in cluster level
key.converter.schemas.enable=true
value.converter.schemas.enable=true

#Where to keep the Connect topic offset configurations
offset.storage.file.filename=/tmp/connect.offsets
offset.flush.interval.ms=10000

#Plugin path to put the connector binaries
plugin.path=

Set the bootstrap.servers value to localhost:9092 to configure the Connect cluster to use the Redpanda cluster.

Also it is necessary to configure plugin.path, which you’ll use to put the connector binaries in.

Create a folder called plugins in the pandabooks_integration directory. Navigate to the plugins folder and create another folder and call it delta-lake, this is where you will add the connector binaries.

Navigate to this web page and click Download to download the archived binaries. Unzip the file and copy the files in the lib folder into a folder called kafka-connect-delta-lake, placed in the plugins directory.

The final folder structure for pandapost_integration should look like this:

pandapost_integration
├── configuration
│   ├── connect.properties
├── plugins
│   ├── kafka-connect-delta-lake
└── kafka_2.13-3.1.0

You will need to change the plugin.path value to /home/_YOUR_USER_NAME_/pandabooks_integration/plugins. This will configure the Connect cluster to use the Redpanda cluster.

Now, the final connect.properties file should look like this:

#Kafka broker addresses
bootstrap.servers=localhost:9092

#Cluster level converters
#These applies when the connectors don't define any converter
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter

#JSON schemas enabled to false in cluster level
key.converter.schemas.enable=true
value.converter.schemas.enable=true

#Where to keep the Connect topic offset configurations
offset.storage.file.filename=/tmp/connect.offsets
offset.flush.interval.ms=10000

#Plugin path to put the connector binaries
plugin.path=_YOUR_HOME_DIRECTORY_/pandapost_integration/plugins

Configuring the Delta Lake Connector

You have successfully set up connector plugins in a kafka connector cluster to achieve integration with external systems but this is not enough, you need to configure the sink connector, that is Delta Lake Connector plugin that will enable Kafka connect to write data directly to Delta Lake.

To do this, create a file named delta-lake-sink-connector.properties in the ~/pandabooks_integration/configuration directory and paste the following contents:

name=delta-lake-sink-connector

# Connector class
connector.class= io.delta.connect.kafka.DeltaLakeSinkConnector

# Format class
format.class=io.delta.standalone.kafka.DeltaInputFormat

# The key converter for this connector
key.converter=org.apache.kafka.connect.storage.StringConverter

# The value converter for this connector
value.converter=org.apache.kafka.connect.json.JsonConverter

# Identify, if value contains a schema.
# Required value converter is `org.apache.kafka.connect.json.JsonConverter`.
value.converter.schemas.enable=false

tasks.max=1

# Topic name to get data from
topics= customer-data

# Table to ingest data into
tableName = customer-table

key.ignore=true

schema.ignore=true

Remember to change the following values for the keys in the delta-lake-sink-connector.properties file:

connector.class
topic

Running the Kafka Connect cluster

You have to run the Kafka connect cluster with the configurations that you have made. Open a new terminal, navigate to the _YOUR_HOME_DIRECTORY_/pandapost_integration/configuration directory and run the following command:

../kafka_2.13-3.1.0/bin/connect-standalone.sh connect.properties delta-lake-sink-connector.properties

If everything was done correctly, the output will look like this:

...output omitted...
 groupId=connect-delta-lake-sink-connector] Successfully joined group with generation Generation{generationId=25, memberId='connector-consumer-delta-lake-sink-connector-0-eb21795e-f3b3-4312-8ce9-46164a2cdb27', protocol='range'} (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:595)
[2022-04-17 03:37:06,872] INFO [delta-lake-sink-connector|task-0] [Consumer clientId=connector-consumer-delta-lake-sink-connector-0, groupId=connect-delta-lake-sink-connector] Finished assignment for group at generation 25: {connector-consumer-delta-lake-sink-connector-0-eb21795e-f3b3-4312-8ce9-46164a2cdb27=Assignment(partitions=[customer-data-0])} (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:652)
[2022-04-17 03:37:06,891] INFO [delta-lake-sink-connector|task-0] [Consumer clientId=connector-consumer-delta-lake-sink-connector-0, groupId=connect-delta-lake-sink-connector] Successfully synced group in generation Generation{generationId=25, memberId='connector-consumer-delta-lake-sink-connector-0-eb21795e-f3b3-4312-8ce9-46164a2cdb27', protocol='range'} (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:761)
[2022-04-17 03:37:06,891] INFO [delta-lake-sink-connector|task-0] [Consumer clientId=connector-consumer-delta-lake-sink-connector-0, groupId=connect-delta-lake-sink-connector] Notifying assignor about the new Assignment(partitions=[customer-data-0]) (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:279)
[2022-04-17 03:37:06,893] INFO [delta-lake-sink-connector|task-0] [Consumer clientId=connector-consumer-delta-lake-sink-connector-0, groupId=connect-delta-lake-sink-connector] Adding newly assigned partitions: customer-data-0 (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:291)
[2022-04-17 03:37:06,903] INFO [delta-lake-sink-connector|task-0] [Consumer clientId=connector-consumer-delta-lake-sink-connector-0, groupId=connect-delta-lake-sink-connector] Setting offset for partition customer-data-0 to the committed offset FetchPosition{offset=3250, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=Optional[localhost:9092 (id: 0 rack: null)], epoch=absent}} (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:844)

Conclusion

In conclusion, real-time data ingestion to Databricks Delta Lake with Redpanda (Kafka Connect) is a powerful and flexible solution for processing and analyzing streaming data. In this article, we've discussed step by step on implementing data ingestion to Databricks Delta Lake with Redpanda (Kafka Connect).

Remember, you can find the code resources for this tutorial in this repository.