DEV Community: Memphis.dev

Ingesting Webhooks From Stripe – The Better Way

Memphis.dev team — Thu, 18 Jan 2024 12:22:37 +0000

Introduction

Learn what webhooks are, and how you use them with Stripe to react to events quickly and in real-time with greater reliability using a streaming platform. Below, we’re answering these questions and more. This post will show you everything you need to know about webhooks, including what they are, how they work, examples, and how they can be improved using Memphis.

What are Webhooks?

Imagine a world where information flows seamlessly between systems. In this world, there’s no need for constant browser refreshing or sending numerous requests for updates. Welcome to the domain of webhooks, where real-time communication glides with the rhythm of efficiency and automation.

Webhooks stand out for their effectiveness for both the provider and the user. The main challenge with webhooks, however, is the complexity involved in their initial setup.

Often likened to reverse APIs, webhooks offer something akin to an API specification, requiring you to craft an API for the webhook to interact with. When the webhook initiates an HTTP request to your application, usually through a POST method, your task is to interpret and handle this incoming data effectively.

The downsides of webhooks

Push-based: Webhooks deliver or push events to your clients’ services, requiring them to handle the resulting back pressure. While understandable, this approach can impede your customers’ progress.
Implementing a server: For your client’s services to receive webhooks, they need a server that listens to incoming events. This involves managing CORS, middleware, opening ports, and securing network access, which adds extra load to their service by increasing overall memory consumption.
Retry: Services experience frequent crashes or unavailability due to various reasons. While some triggered webhooks might lead to insignificant events, others can result in critical issues, such as incomplete datasets where orders fail to be documented in CRM or new shipping instructions are not being processed. Hence, having a robust retry mechanism becomes crucial. 4 Persistent: Standard webhook systems generally lack event persistence for future audits and replays.
Replay: Similarly, it boils down to the user or developer experience you aim to provide. While setting up an endpoint for users to retrieve past events is feasible, it demands meticulous handling, intricate business logic, an extra database, and increased complexity for the client.
Throttling: Throttling is a technique used in computing and networking to control data flow, requests, or operations to prevent overwhelming a system or service. It limits the rate or quantity of incoming or outgoing data, recommendations, or actions. The primary challenge lies not in implementing throttling but in managing distinct access levels for various customers. Consider having an enterprise client with notably higher throughput needs compared to others. To accommodate this, you’d require a multi-tenant webhook system tailored to support diverse demands.

Why to use webhooks with Stripe

When you’re piecing together Stripe integrations, it’s crucial to have your applications tuned in to live events from your Stripe accounts. This way, your backend systems are always ready to spring into action based on these events.

To get this real-time event flow, you’ll need to set up webhook endpoints in your application. Once you’ve registered these endpoints, Stripe becomes your real-time informant, pushing event data directly to your application’s webhook endpoint as things happen in your Stripe account. Stripe uses HTTPS to deliver these events, packaging them as JSON payloads that feature an Event object.

Webhook events are your go-to for monitoring asynchronous activities. They’re perfect for keeping tabs on events like a customer’s bank giving the green light on a payment, charge disputes from customers, successful recurring payments, or managing subscription billing. With webhooks, you’re not just informed; you’re always a step ahead.

Why use Memphis as your Stripe’s webhook destination

Convert the push to pull: Memphis.dev operates as a pull-based message broker where clients actively pull and consume data from the broker.
Retry: Memphis provides a built-in retry system that maintains client states and offsets, even during disconnections. This configurable mechanism resends unacknowledged events until they’re acknowledged or until the maximum number of retries is reached.
Persistent: Memphis ensures message persistence by assigning a retention policy to each topic and message.
Replay: The client has the flexibility to rotate the active offset, enabling easy access to read and replay any past event that complies with the retention policy and is still stored.
Back pressure: Let Memphis handle the back pressure and scale from your team and clients.
Backup: You can easily enable automatic backup that will back up each and every message to an external S3-compatible storage.
Dead-letter: They enable you to preserve unconsumed messages, rather than discarding them, to diagnose why their processing was unsuccessful.

How to get started

Head to Stripe’s webhook dashboard

Create a Memphis account
Create a new Memphis station

Create a new client-type user and generate a URL for producing data

Copy the produce URL to the Stripe dashboard and click “Add endpoint”

Once a selected event will occur, it will trigger an event that will be sent to your Memphis Station

Join 4500+ others and sign up for our data engineering newsletter.

Originally published at Memphis.dev By Shoham Roditi Elimelech, software engineer at @Memphis.dev

Event Sourcing with Memphis.dev: A Beginner’s Guide

Memphis.dev team — Thu, 04 Jan 2024 12:14:54 +0000

Introduction

In the realm of modern software development, managing and maintaining data integrity is paramount. Traditional approaches often involve updating the state of an application directly within a database. However, as systems grow in complexity, ensuring data consistency and traceability becomes more challenging. This is where Event Sourcing, coupled with a powerful distributed streaming platform like Memphis.dev, emerges as a robust solution and a great data structure to work with.

What is Event Sourcing?

At its core, Event Sourcing is a design pattern that captures every change or event that occurs in a system as an immutable and sequentially ordered log of events. Instead of persisting the current state of an entity, Event Sourcing stores a sequence of state-changing events. These events serve as a single source of truth for the system’s state at any given point in time.

Understanding Event Sourcing in Action

Imagine a banking application that tracks an account’s balance. Instead of only storing the current balance in a database, Event Sourcing captures all the events that alter the balance. Deposits, withdrawals, or any adjustments are recorded as individual events in chronological order.

Let’s break down how this might work:

Event Creation: When a deposit of $100 occurs in the account, an event, such as FundsDeposited, with relevant metadata (timestamp, amount, account number) is created.
Event Storage: These events are then appended to an immutable log, forming a sequential history of transactions specific to that account.
State Reconstruction: To obtain the current state of an account, the application replays these events sequentially to compute the current balance. Each event is applied in order to derive the current balance, enabling the system to rebuild state at any given point in time.

Leveraging Memphis in Event Sourcing

Memphis.dev, an open-source distributed event streaming platform, is perfect for implementing Event Sourcing due to its features:

Scalability and Fault Tolerance: Memphis’ distributed nature allows for horizontal scalability and ensures fault tolerance by replicating data across multiple brokers (nodes).
Ordered and Immutable Event Logs: Memphis’ log-based architecture aligns seamlessly with the principles of Event Sourcing. It maintains ordered, immutable logs, preserving the sequence of events.
Real-time Event Processing: Memphis Functions offers a serverless framework, built within the Memphis platform to handle high-throughput, real-time event streams. Applications can process events as they occur, enabling near real-time reactions to changes in state.
Managing Schemas: One of the major challenges in event sourcing is maintaining schemas across the different events to avoid upstream breaks and client crashes.

Benefits of Event Sourcing with Kafka

Temporal Queries and Auditing: By retaining a complete history of events, it becomes possible to perform temporal queries and reconstruct past states, aiding in auditing and compliance.
Flexibility and Scalability: As the system grows, Event Sourcing with Memphis allows for easy scalability, as new consumers can independently process the event log.
Fault Tolerance and Recovery: In the event of failures, the ability to rebuild state from events ensures resiliency and quick recovery.

Let’s see what it looks like via code

Events occur and are pushed by their order of creation into some Memphis Station (=topic)

Event Log:

class EventLog:
    def __init__(self):
        self.events = []

    def append_event(self, event):
        self.events.append(event)

    def get_events(self):
        return self.events

Memphis Producer:

from __future__ import annotations
import asyncio
from memphis import Memphis, Headers, MemphisError, MemphisConnectError, MemphisHeaderError, MemphisSchemaError
import json

class MemphisEventProducer:
    def __init__(self,host="my.memphis.dev"):
        try:
        self.memphis = Memphis()
        await self.memphis.connect(host=host, username="<application type username>", password="<password>")

    def send_event(self, topic, event):
        await self.memphis.produce(station_name=topic, producer_name='prod_py',
  message=event,nonblocking=False)
        except (MemphisError, MemphisConnectError, MemphisHeaderError, MemphisSchemaError) as e:
          print(e)
        finally:
          await self.memphis.close()

Usage:

# Initialize Event Log
event_log = EventLog()

# Initialize Memphis Producer
producer = MemphisEventProducer()

# Append events to the event log and produce them to Memphis
events_to_publish = [
    {"type": "Deposit", "amount": 100},
    {"type": "Withdrawal", "amount": 50},
    # Add more events as needed
]

for event in events_to_publish:
    event_log.append_event(event)
    producer.send_event('account-events', event)

Criteria to choose the right event streaming platform for the job

When implementing Event Sourcing with a message broker, several key features are crucial for a streamlined and efficient system:

Persistent Message Storage:
Durability: Messages should be reliably stored even in the event of failures. This ensures that no events are lost and the event log remains intact.
Ordered and Immutable Event Logs:
Sequential Order: Preserving the order of events is critical for accurate state reconstruction. Events must be processed in the same sequence they were produced.
Immutability: Once an event is stored, it should not be altered. This guarantees the integrity and consistency of the event log.
Scalability and Performance:
Horizontal Scalability: The message broker should support horizontal scaling to accommodate increased event volume without sacrificing performance.
Low Latency: Minimizing message delivery time ensures near real-time processing of events, enabling quick reactions to state changes.
Fault Tolerance and High Availability:
Redundancy: Ensuring data redundancy across multiple nodes or partitions prevents data loss in the event of node failures.
High Availability: Continuous availability of the message broker is essential to maintain system functionality.
Consumer Flexibility and State Rebuilding:
Consumer Groups: Support for consumer groups allows multiple consumers to independently process the same set of events, aiding in parallel processing and scalability.
State Rebuilding: The broker should facilitate easy rebuilding of the application state by replaying events, enabling historical data retrieval.
Retention Policies and Archiving:
Retention Policies: Configurable retention policies allow managing the duration or size of stored messages. This ensures efficient storage management.
Archiving: Ability to archive or offload older events to long-term storage for compliance or historical analysis purposes.
Monitoring and Management:
Metrics and Monitoring: Providing insights into message throughput, latency, and system health helps in monitoring and optimizing system performance.
Admin Tools: Easy-to-use administrative tools for managing topics, partitions, and configurations streamline system management.
Security and Compliance:
Encryption and Authentication: Support for encryption and authentication mechanisms ensures the confidentiality and integrity of transmitted events.
Compliance Standards: Adherence to compliance standards (such as GDPR, SOC2) ensures that sensitive data is handled appropriately.
Seamless Integration and Ecosystem Support:
Compatibility and Integrations: Seamless integration with various programming languages and frameworks, along with support for diverse ecosystems, enhances usability.
Ecosystem Tools: Availability of connectors, libraries, and frameworks that facilitate Event Sourcing simplifies implementation and reduces development efforts.

Choosing a message broker that aligns with these critical features is essential for implementing robust Event Sourcing, ensuring data integrity, scalability, and resilience within your application architecture.

Event Sourcing using a Database vs a Message Broker (Streaming Platform)

Use Case Complexity: For simpler applications or where scalability isn’t a primary concern, databases might suffice. For higher reliability, distributed systems needing high scalability, and real-time processing, a message broker can be more suitable.

Replay: In event streaming platforms or message brokers, events are stored in a FIFO manner, one after the other as they first appear. That nature also makes it easier for the consumer on the other side to understand the natural flow of events and replay the entire “scene,” whereas in databases, it is not the case, and additional fields must be added, like timestamps, to organize the data based on time. It also requires additional logic to understand the latest state of an entity.

Continue your learning: read how and why event sourcing outgrows the database.

Join 4500+ others and sign up for our data engineering newsletter.

Originally published at Memphis.dev By Idan Asulin, Co-Founder & CTO at @Memphis.dev.

Comparing Webhooks and Event Consumption: A Comparative Analysis

Memphis.dev team — Thu, 28 Dec 2023 14:59:11 +0000

Introduction

In event-driven architecture and API integration, two vital concepts stand out: webhooks and event consumption. Both are mechanisms used to facilitate communication between different applications or services. Yet, they differ significantly in their approaches and functionalities, and by the end of this article, you will learn why consuming events can be a much more robust option than serving them using a webhook.

The foundational premise of the article assumes you function as a platform that wants or already delivers internal events to your clients through webhooks.

Webhooks

Webhooks are user-defined HTTP callbacks triggered by specific events on a service. They enable real-time communication between systems by notifying other applications when a particular event occurs. Essentially, webhooks eliminate the need to do manual polling or checking for updates, allowing for a more efficient, event-driven, and responsive system.

Key Features of Webhooks:

Event-driven: Webhooks are event-driven and are only triggered when a specified event occurs. For example, a webhook can notify an application when a new user signs up or when an order is placed.
Outbound Requests: They use HTTP POST requests to send data payloads to a predefined URL the receiving application provides.
Asynchronous Nature: Webhooks operate asynchronously, allowing the sending and receiving systems to continue their processes independently.

Event Consumption

Event consumption involves receiving, processing, and acting upon events emitted by various systems or services. This mechanism facilitates the seamless integration and synchronization of data across different applications.

Key Features of Event Consumption*:

Message Queues or Brokers: Event consumption often involves utilizing message brokers like Memphis.dev, Kafka, RabbitMQ, or AWS SQS to manage and distribute events.
Subscriber-Driven: Unlike webhooks, event consumption relies on subscribers who listen to event streams and process incoming events.
Scalability: Event consumption systems are highly scalable, efficiently handling large volumes of events.

Architectural questions for a better decision

Push-based VS. Pull based:
Webhooks deliver or push events to your clients’ services, requiring them to handle the resulting back pressure. While understandable, this approach can impede your customers’ progress. Using a scalable message broker to support consumption can alleviate this burden for your clients. How? By allowing clients to pull events based on their availability.
I*mplementing a server vs. Implementing a broker SDK:*
For your client’s services to receive webhooks, they need a server that listens to incoming events. This involves managing CORS, middleware, opening ports, and securing network access, which adds extra load to their service by increasing overall memory consumption.
Opting for pull-based consumption eliminates most of these requirements. With pull-based consumption, as the traffic is egress (outgoing) rather than ingress (incoming), there’s no need to set up a server, open ports, or handle CORS. Instead, the client’s service initiates the communication, significantly reducing complexity.
Retry:
Services experience frequent crashes or unavailability due to various reasons. While some triggered webhooks might lead to insignificant events, others can result in critical issues, such as incomplete datasets where orders fail to be documented in CRM or new shipping instructions are not being processed. Hence, having a robust retry mechanism becomes crucial. This can be achieved by incorporating a retry mechanism within the webhook system or introducing an endpoint within the service.
In contrast, when utilizing a message broker, events are acknowledged only after processing. Although implementing a retry mechanism is necessary in most cases, it’s typically more straightforward and native than handling retries with webhooks.
Persistent:
Standard webhook systems generally lack event persistence for future audits and replays, a capability inherently provided by persisted message brokers.
Replay:
Similarly, it boils down to the user or developer experience you aim to provide. While setting up an endpoint for users to retrieve past events is feasible, it demands meticulous handling, intricate business logic, an extra database, and increased complexity for the client. In contrast, using a message broker supporting this feature condenses the process to just a line or two of code, significantly reducing complexity.
Throttling:
Throttling is a technique used in computing and networking to control data flow, requests, or operations to prevent overwhelming a system or service. It limits the rate or quantity of incoming or outgoing data, recommendations, or actions. The primary challenge lies not in implementing throttling but in managing distinct access levels for various customers. Consider having an enterprise client with notably higher throughput needs compared to others. To accommodate this, you’d require a multi-tenant webhook system tailored to support diverse demands or opt for a message broker or streaming platform designed to handle such differential requirements.

Memphis as a tailor-made solution for the task

We are still iterating on the subject, so if you have any thoughts or ideas, I would love to learn from them: idan@memphis.dev

How to handle API rate limitations with a queue

Memphis.dev team — Wed, 20 Dec 2023 14:57:10 +0000

Introduction

Rate limitation refers to restricting the number of times a specific action can be performed within a certain time frame. For example, an API might have rate limitations restricting user or app requests within a given period. This helps prevent server overload, ensures fair usage, and maintains system stability and security.

Rate limitation is also a challenge for the apps that encounter it, as it requires to “slow down” or pause. Here’s a typical scenario:

Initial Request: When the app initiates communication with the API, it requests specific data or functionality.
API Response: The API processes the request and responds with the requested information or performs the desired action.
Rate-Limitation: If the app has reached the limit, it will usually need to wait until the next designated time frame (like a minute to an hour) before making additional requests. If it is a “soft” rate limitation and timeframes are known and linear, it’s easier to handle. Often, the waiting time climbs and increases in every block, requiring a whole different and custom handling per each API.
Handling Rate Limit Exceedances: If the app exceeds the rate limit, it might receive an error response from the API (such as a “429 Too Many Requests” status code). The app needs to handle this gracefully, possibly by queuing requests, implementing backoff strategies (waiting for progressively more extended periods before retrying), or informing the user about the rate limit being reached.

To effectively operate within rate limitations, apps often incorporate strategies like:

Throttling: Regulating the rate of outgoing requests to align with the API’s rate limit.
Caching: Storing frequently requested data locally to reduce the need for repeated API calls.
Exponential Backoff: Implementing a strategy where the app waits increasingly longer between subsequent retries after hitting a rate limit to reduce server load and prevent immediate retries.
Queue? More in the next section

Using a queue

A queue serves as an excellent “sidekick” or tool for helping services manage rate limitations due to its ability to handle tasks systematically. However, while it offers significant benefits, it’s not a standalone solution for this purpose.

In constructing a robust architecture, the service or app used to interact with an external API subject to rate limitations often handles tasks asynchronously. This service is typically initiated by tasks derived from a queue. When the service encounters a rate limit, it can easily return the job to the main queue, or assign it to a separate queue designated for delayed tasks, and revisit it after a specific waiting period, say X seconds.

This reliance on a queue system is highly advantageous, primarily because of its temporary nature and ordering. However, the queue alone cannot fully address rate limitations; it requires additional features or help from the service itself to effectively handle these constraints.

Challenges may arise when utilizing a queue:

Tasks re-entering the queue might return earlier than necessary, as their timing isn’t directly controlled by your service.
Exceeding rate limitations due to frequent calls within restricted timeframes. This may necessitate implementing sleep or wait mechanisms, commonly considered poor practice due to their potential impact on performance and responsiveness.

Here is what it will look like with RabbitMQ:

const amqp = require('amqplib');
const axios = require('axios');

// Function to make API requests, simulating rate limitations
async function makeAPICall(url) {
  try {
    const response = await axios.get(url);
    console.log('API Response:', response.data);
  } catch (error) {
    console.error('API Error:', error.message);
  }
}

// Connect to RabbitMQ server
async function connect() {
  try {
    const connection = await amqp.connect('amqp://localhost');
    const channel = await connection.createChannel();

    const queue = 'rateLimitedQueue';
    channel.assertQueue(queue, { durable: true });

    // Consume messages from the queue
    channel.consume(queue, async msg => {
      const { url, delayInSeconds } = JSON.parse(msg.content.toString());

      // Simulating rate limitation
      await new Promise(resolve => setTimeout(resolve, delayInSeconds * 1000));

      await makeAPICall(url); // Make the API call

      channel.ack(msg); // Acknowledge message processing completion
    });
  } catch (error) {
    console.error('RabbitMQ Connection Error:', error.message);
  }
}

// Function to send a message to the queue
async function addToQueue(url, delayInSeconds) {
  try {
    const connection = await amqp.connect('amqp://localhost');
    const channel = await connection.createChannel();

    const queue = 'rateLimitedQueue';
    channel.assertQueue(queue, { durable: true });

    const message = JSON.stringify({ url, delayInSeconds });
    channel.sendToQueue(queue, Buffer.from(message), { persistent: true });

    console.log('Task added to the queue');
  } catch (error) {
    console.error('RabbitMQ Error:', error.message);
  }
}

// Usage example
addToQueue('https://api.example.com/data', 5); // Add an API call with a delay of 5 seconds

// Start the consumer
connect();

Or with Kafka

const { Kafka } = require('kafkajs');
const axios = require('axios');

// Function to make API requests, simulating rate limitations
async function makeAPICall(url) {
  try {
    const response = await axios.get(url);
    console.log('API Response:', response.data);
  } catch (error) {
    console.error('API Error:', error.message);
  }
}

// Kafka configuration
const kafka = new Kafka({
  clientId: 'my-app',
  brokers: ['localhost:9092'], // Replace with your Kafka broker address
});

// Create a Kafka producer
const producer = kafka.producer();

// Connect to Kafka and send messages
async function produceToKafka(topic, message) {
  await producer.connect();
  await producer.send({
    topic,
    messages: [{ value: message }],
  });
  await producer.disconnect();
}

// Create a Kafka consumer
const consumer = kafka.consumer({ groupId: 'my-group' });

// Consume messages from Kafka topic
async function consumeFromKafka(topic) {
  await consumer.connect();
  await consumer.subscribe({ topic });
  await consumer.run({
    eachMessage: async ({ message }) => {
      const { url, delayInSeconds } = JSON.parse(message.value.toString());

      // Simulating rate limitation
      await new Promise(resolve => setTimeout(resolve, delayInSeconds * 1000));

      await makeAPICall(url); // Make the API call
    },
  });
}

// Usage example - Sending messages to Kafka topic
async function addToKafka(topic, url, delayInSeconds) {
  const message = JSON.stringify({ url, delayInSeconds });
  await produceToKafka(topic, message);
  console.log('Message added to Kafka topic');
}

// Start consuming messages from Kafka topic
const kafkaTopic = 'rateLimitedTopic';
consumeFromKafka(kafkaTopic);

// Usage example - Adding messages to Kafka topic
addToKafka('rateLimitedTopic', 'https://api.example.com/data', 5); // Add an API call with a delay of 5 seconds

Both approaches are legitimate, yet they necessitate your service to incorporate a ‘sleep’ mechanism.

With Memphis, you can offload the delay from the client to the queue using a simple feature made
just for that purpose and called “Delayed Messages”. Delayed messages allow you to send a received message back to the broker when your consumer application requires extra processing time.

What sets apart Memphis’ implementation is the consumer’s capability to control this delay independently and atomically.
Within the station, the count of unconsumed messages doesn’t impact the consumption of delayed messages. For instance, if a 60-second delay is necessary, it precisely configures the invisibility time for that specific message.

Memphis.dev Delayed Messages

message is received by the consumer group.
An event occurs, prompting the consumer group to pause processing the message.
Assuming the maxMsgDeliveries hasn’t hit its limit, the consumer will activate message.delay(delayInMilliseconds), bypassing the message. Instead of immediately reprocessing the same message, the broker will retain it for the specified duration.
The subsequent message will be consumed.
Once the requested delayInMilliseconds has passed, the broker will halt the primary message flow and reintroduce the delayed message into circulation.

const { memphis } = require('memphis-dev');

// Function to make API requests, simulating rate limitations 
async function makeAPICall(message) 
{ 
  try { 
    const response = await axios.get(message.getDataAsJson()['url']); 
    console.log('API Response:', response.data); 
    message.ack();
  } catch (error) { 
    console.error('API Error:', error.message); 
    console.log("Delaying message for 1 minute"); 
    message.delay(60000);
  } 
}

(async function () {
    let memphisConnection;

    try {
        memphisConnection = await memphis.connect({
            host: '<broker-hostname>',
            username: '<application-type username>',
            password: '<password>'
        });

        const consumer = await memphisConnection.consumer({
            stationName: '<station-name>',
            consumerName: '<consumer-name>',
            consumerGroup: ''
        });

        consumer.setContext({ key: "value" });
        consumer.on('message', (message, context) => {
            await makeAPICall(url, message);
        });

        consumer.on('error', (error) => { });
    } catch (ex) {
        console.log(ex);
        if (memphisConnection) memphisConnection.close();
    }
})();

Wrapping up

Understanding and adhering to rate limitations is crucial for app developers working with APIs. It involves managing request frequency, handling errors when limits are reached, implementing backoff strategies to prevent overloading the API servers, and utilizing rate limit information provided by the API to optimize app performance, and now you know how to do it with a queue as well!

Head to our blog or [docs(https://docs.memphis.dev/memphis/getting-started/readme) for more examples like that!

Join 4500+ others and sign up for our data engineering newsletter.

Originally published at Memphis.dev By Idan Asulin, Co-Founder & CTO at @Memphis.dev.

Real-Time Data Scrubbing Before Storing In A Data Warehouse

Memphis.dev team — Wed, 20 Dec 2023 08:22:03 +0000

Between January 2023 and May 2023, companies violating general data processing principles incurred fines totaling 1.86 billion USD (!!!).

In today’s data-driven landscape, the importance of data accuracy and compliance cannot be overstated. As businesses amass vast amounts of information, the need to ensure data integrity, especially PII storing, becomes paramount. Data scrubbing emerges as a crucial process, particularly in real-time scenarios, before storing information in a data warehouse.

Data Scrubbing in the context of compliance
Data scrubbing, often referred to as data cleansing or data cleaning, involves the identification and rectification of errors or inconsistencies in a dataset. In the context of compliance, it means removing certain values that qualify as PII that cannot be stored or should be handled differently.

Real-time data scrubbing takes the cleansing process a step further by ensuring that incoming data is cleaned and validated instantly, before being stored in a data warehouse.

Compliance standards, such as GDPR, HIPAA, or industry-specific regulations, mandate stringent requirements for data accuracy, privacy, and security. Failure to adhere to these standards can result in severe repercussions, including financial penalties and reputational damage. Real-time data scrubbing acts as a robust preemptive measure, ensuring that only compliant data is integrated into the warehouse.

Event-driven Scrubbing
Event-driven applications stand as stateful systems that intake events from one or multiple streams and respond to these incoming events by initiating computations, updating their state, or triggering external actions.

They represent a progressive shift from the conventional application structure that segregates computation and data storage into distinct tiers. In this novel architecture, these applications retrieve data from and save data to a remote transactional database.

In stark contrast, event-driven applications revolve around stateful stream processing frameworks. This approach intertwines data and computation, facilitating localized data access either in-memory or through disk storage. To ensure resilience, these applications implement fault-tolerance measures by periodically storing checkpoints in remote persistent storage.

In the context of Scrubbing, it means that the actual action of scrubbing will take place for each ingested event, in real-time, powering up only when new events arrive, and immediately after, not based on constant times, usually performed on top of the database, after being stored, meaning the potential violation already took place.

How does Memphis Functions support such a use case?
At times, a more comprehensive policy-driven cleansing may be necessary. However, if a quick, large-scale ‘eraser’ is what you require, Memphis Functions can offer an excellent solution. The diagram illustrates two options: data sourced from either a Kafka topic or a Memphis station, potentially both concurrently. This data passes through a Memphis Function named ‘remove-fields‘ before progressing to the data warehouse for further storage.

Behind the curtain, events or streaming data are grouped into batches, a configuration determined by the user’s specifications. These batches then undergo processing via a serverless function, specifically the ‘remove-fields’ function, meticulously designed to cleanse the ingested data according to pre-established rules. Following this scrubbing process, the refined data is either consumed internally or routed to a different Kafka topic, alternatively being swiftly directed straight to the Data Warehouse (DWH) for immediate utilization.

Usage example
Before

{
  "id": 123456789,
  "full_name": "Peter Parker",
  "gender": "male"
}
After (Removing ‘gender’)

{
  "id": 123456789,
  "full_name": "Peter Parker",
}

Next steps
An ideal follow-up action would involve implementing schema enforcement. Data warehouses are renowned for their rigorous schema enforcement practices. By integrating both a transformation layer and schema validation, it’s possible to significantly elevate data quality while reducing the risk of potential disruptions or breaks in the system. This can simply take place by attaching a Schemaverse schema to the station.

Start by signing up to Memphis Cloud. We have a great free plan that can get you up and running in no time, and try to build a pipeline yourself.

Join 4500+ others and sign up for our data engineering newsletter.

Originally published at Memphis.dev By Idan Asulin, Co-Founder & CTO at @Memphis.dev.

Introducing Memphis Functions

avital trifsik — Thu, 09 Nov 2023 14:14:39 +0000

The story

Organizations are increasingly embracing real-time event processing, intercepting data streams before they enter the data warehouse, and embracing event-driven architectural paradigms. However, they must contend with the ever-evolving landscape of data and technology. Development teams face the challenge of maintaining alignment with these changes while also striving for greater development efficiency and agility.

Further challenges lie ahead:

Developing new stream processing flows is a formidable task.
Code exhibits high coupling to particular flows or event types.
There is no opportunity for code reuse or sharing.
Debugging, troubleshooting, and rectifying issues pose ongoing challenges.
Managing code evolution remains a persistent concern.

The shortcomings of current solutions are as follows:

They impose the use of SQL or other vendor-specific, lock-in languages on developers.
They lack support for custom logic.
They add complexity to the infrastructure, particularly as operations scale.
They do not facilitate code reusability or sharing.
Ultimately, they demand a significant amount of time to construct a real-time application or pipeline.

Introducing Memphis Functions

The Memphis platform is composed of four independent components:

Memphis Broker, serving as the storage layer.
Schemaverse, responsible for schema management.
Memphis Functions, designed for serverless stream processing.
Memphis Connectors, facilitating data retrieval and delivery through pre-built connectors.

Memphis Functions empower developers and data engineers with the ability to seamlessly process, transform, and enrich incoming events in real-time through a serverless paradigm, all within the familiar AWS Lambda syntax.

This means they can achieve these operations without being burdened by boilerplate code, intricate orchestration, error-handling complexities, or the need to manage underlying infrastructure.

Memphis Functions provides this versatility in an array of programming languages, including but not limited to Go, Python, JavaScript, .NET, Java, and SQL. This flexibility ensures that development teams have the freedom to select the language best suited to their specific needs, making the event processing experience more accessible and efficient.

What’s more?

In addition to orchestrating various functions, Memphis Functions offer a comprehensive suite for the end-to-end management and observability of these functions. This suite encompasses features such as a robust retry mechanism, dynamic auto-scaling utilizing both Kubernetes-based and established public cloud serverless technologies, extensive monitoring capabilities, dead-letter handling, efficient buffering, distributed security measures, and customizable notifications.

It’s important to note that Memphis Functions are designed to seamlessly complement existing streaming platforms, such as Kafka, without imposing the necessity of adopting the Memphis broker. This flexibility allows organizations to leverage Memphis Functions while maintaining compatibility with their current infrastructures and preferences.

Getting started

Step 1: Write your processing function
Utilize the same syntax as you would when crafting a function for AWS Lambda, taking advantage of the familiar and powerful AWS Lambda framework. This approach ensures that you can tap into AWS Lambda’s extensive ecosystem and development resources, making your serverless function creation a seamless and efficient process and without learning yet another framework syntax.

Functions can be a simple string-to-JSON conversion all the way to pushing a webhook based on some event’s payload.

Step 2: Connect Memphis to your git repository
Integrating Memphis with your git repository is the next crucial step. By doing so, Memphis establishes an automated link to your codebase, effortlessly fetching the functions you’ve developed. These functions are then conveniently showcased within the Memphis Dashboard, streamlining the entire process of managing and monitoring your serverless workflows. This seamless connection simplifies collaboration, version control, and overall visibility into your stream processing application development.

Step 3: Attach functions to streams
Now it’s time to integrate your functions with the streams. By attaching your developed functions to the streams, you establish a dynamic pathway for ingested events. These events will seamlessly traverse through the connected functions, undergoing processing as specified in your serverless workflow. This crucial step ensures that the events are handled efficiently, allowing you to unleash the full potential of your processing application with agility and scalability.

Gain early access and sign up to our Private Beta Functions waiting list here!

Join 4500+ others and sign up for our data engineering newsletter.

Event-Driven Architecture with Serverless Functions – Part 1

avital trifsik — Mon, 09 Oct 2023 08:38:24 +0000

This is the 1st part of the series “A new type of stream processing״.

In this series of articles, we are going to explain what is the missing piece in stream processing, and in this part, we’ll start from the source. We’ll break down the different components and walk through how they can be used in tandem to drive modern software.

First things first, Event-driven architecture. EDA and serverless functions are two powerful software patterns and concepts that have become popular in recent years with the rise of cloud-native computing. While one is more of an architecture pattern and the other a deployment or implementation detail, when combined, they provide a scalable and efficient solution for modern applications.

What is Event-Driven Architecture

EDA is a software architecture pattern that utilizes events to decouple various components of an application. In this context, an event is defined as a change in state. For example, for an e-commerce application, an event could be a customer clicking on a listing, adding that item to their shopping cart, or submitting their credit card information to buy. Events also encompass non-user-initiated state changes, such as scheduled jobs or notifications from a monitoring system.

The primary goal of EDA is to create loosely coupled components or microservices that can communicate by producing and consuming events between one another in an asynchronous way. This way, different components of the system can scale up or down independently for availability and resilience. Also, this decoupling allows development teams to add or release new features more quickly and safely as long as its interface remains compatible.

The Usual Components

A scalable event-driven architecture will comprise three key components:

Producer: components that publish or produce events. These can be frontend services that take in user input, edge devices like IoT systems, or other types of applications.
Broker: components that take in events from producers and deliver them to consumers. Examples include Kafka, Memphis.dev, or AWS SQS.
Consumer: components that listen to events and act on them.

It’s important also to note that some components may be a producer for one workflow while being a consumer for another. For example, if we look at a credit card processing service, it could be a consumer for events that involve credit cards, such as new purchases or updating credit card information. At the same time, this service may be a producer for downstream services that record purchase history or detect fraudulent activity.

Common Patterns

Since EDA is a broad architectural pattern, it can be applied in many ways. Some common patterns include:

Point-to-point messaging: For applications that need a simple one-to-one communication channel, a point-to-point messaging pattern may be used with a simple queue. Events are sent to a queue (messaging channels) and buffered for consumers.
Pub/sub: If multiple consumers need to listen to the same events, pub/sub style messaging may be used. In this scenario, the producer generates events on a topic that consumers can subscribe to. This is useful for scenarios where events need to be broadcast (e.g. replication) or different business logic must be applied to the same event.
Communication models: Different use cases dictate how communication should be coordinated. In some cases, it must be orchestrated via a centralized service if logic involves some distinct steps with dependencies. In other cases, it can be choreographed where producers can generate events without worrying about downstream dependencies as long as the events adhere to a predetermined schema or format.

EDA Use Cases

Event-driven architecture became much more popular with the rise of cloud-native applications and microservices. We are not always aware of it, but if we take Uber, Doordash, Netflix, Lyft, Instacart, and many more, each one of them is completely based on an event-driven, async architecture.

Another key use case is data processing of events that require massive parallelization and elasticity to changes.

Let’s talk about Serverless Functions

Serverless functions are a subset of the serverless computing model, where a third party (typically a cloud or service provider) or some orchestration engine manages the infrastructure on behalf of the users and only charges on a per-use basis. In particular, serverless functions or Function-as-a-Service (FaaS) allow users to write small functions as opposed to full-fledged services with server logic and abstract away the typical “server” functionality such as HTTP listeners, scaling, and monitoring. To developers, serverless functions can simplify their workflow significantly as they can focus on the business logic and allow the service provider to bootstrap the infrastructure and server functionalities.

Serverless functions are usually triggered by external events. This can be a HTTP call or events on a queue or a pub/sub like messaging system. Serverless functions are generally stateless and designed to handle individual events. When there are multiple calls, the service provider will automatically scale up functions as needed unless parallelism is limited to one by the user. While different implementations of serverless functions have varying execution limits, in general, serverless functions are meant to be short-lived and should not be used for long-running jobs.

How about combining EDA and Serverless Functions?

As you probably noticed, since serverless functions are triggered by events, it makes for a great pairing with event-driven architectures. This is especially true for stateless services that can be short-lived. A lot of microservices probably fall under this bucket unless it is responsible for batch processing or some heavy analytics that push the execution limit.

The benefit of utilizing serverless functions with event-driven architecture is reduced overhead of managing the underlying infrastructure and freeing up developer’s time to focus on business logic that drives business value. Also, since service providers only charge per use, it can be a cost-efficient alternative to running a self-hosted infrastructure either on servers, VMs, or Containers.

Sounds like the perfect match, right? Why isn’t it everywhere?
While AWS is doing its best to push us to use lambda, Cloudflare invests hundreds of millions of dollars to convince developers to use its serverless framework, gcp, and others, it still feels “harder” than building a traditional service in the context of EDA or data processing.

Among the reasons are:

Lack of observability. What went in and what went out?
Debug is still hard. When dealing with future given data, great debugging experience is a must, as the slightest change in its structure will break the function.
Retry mechanism.
Convert a batch of messages into a processing batch.

The idea of combining both technologies is very interesting. Ultimately can save a great number of dev hours, efforts, and add abilities that are extremely hard to develop with traditional methodologies, but the mentioned reasons are definitely a major blocker.

I will end part 1 with a simple, open question/teaser –
Have you ever used Zapier?

Stay tuned for part 2 and get one step closer to learning more about the new way to do stream processing.

Join 4500+ others and sign up for our data engineering newsletter.

Originally published at Memphis.dev By Yaniv Ben Hemo, Co-Founder & CEO at Memphis.dev

Task scheduling with a message broker

avital trifsik — Thu, 17 Aug 2023 08:38:45 +0000

Introduction

Task scheduling is essential in modern applications to maximize resource utilization and user experience (Non-blocking task fulfillment).
A queue is a powerful tool that allows your application to manage and prioritize tasks in a structured, persistent, and scalable way.
While there are multiple possible solutions, working with a queue (which is also the perfect data structure for that type of work), can ensure that tasks are completed in their creation order without the risk of forgetting, overlooking, or double-processing critical tasks.

A very interesting story on the need and evolvement as the scale grows can be found in one of DigitalOcean’s co-founder’s blog post
From 15,000 database connections to under 100.

Any other solutions besides a queue?

Multiple. Each with its own advantages and disadvantages.

Cron
You can use cron job schedulers to automate such tasks. The issue with cron is that the job and its execution time have to be written explicitly and before the actual execution, making your architecture highly static and not event-driven. Mainly suitable for a well-defined and known set of tasks that either way have to take place, not by a user action.

Database
A database can be a good and simple choice for a task storing place, and actually used for that in the early days of a product MVP,
but there are multiple issues with that approach, for example:

Ordering of insertion is not guaranteed, and therefore the tasks handling might not take place in the order they actually got created.
Double processing can happen as the nature of a database is not to delete a record once read, so there is a potential of double reading and processing a specific task, and the results of that can be catastrophic to a system’s behavior.

Traditional queues

Often, for task scheduling, the chosen queue would probably be a pub/sub system like RabbitMQ.

Choosing RabbitMQ over a classic broker such as Kafka, for example, in the context of task scheduling does make sense as a more suitable tool for that type of task given the natural behavior of Kafka to retain records (or tasks) till a specific point in time, no matter if acknowledged or not.

The downside in choosing RabbitMQ would be the lack of scale, robustness, and performance, which in time become increasingly needed.

With that idea in mind, Memphis is a broker that presents scale, robustness, and high throughput alongside a type of retention that fully enables task scheduling over a message broker.

Memphis Broker is a perfect queue for task scheduling

On v1.2, Memphis released its support for ACK-based retention through Memphis Cloud. Read more here.

Messages will be removed from a station only when acknowledged by all the connected consumer groups. For example:

If we have only one connected consumer group when a message/record is acknowledged, it will be automatically removed from the station.
If we have two connected consumer groups, the message will be removed from the station (=queue) once all CGs acknowledge the message.

We mentioned earlier the advantages and disadvantages of using traditional queues such as RabbitMQ in comparison to common brokers such as Kafka in the context of task scheduling. When comparing both tools to Memphis, it’s all about getting the best from both worlds.

A few of Memphis.dev advantages –

Ordering
Exactly-once delivery guarantee
Highly scalable, serving data in high throughput with low 4. latency
Ack-based retention
Many-to-Many pattern

Getting started with Memphis Broker as a tasks queue

Sign up to Memphis Cloud.
Connect your task producer –
Producers are the entities that insert new records or tasks.
Consumers are the entities who read and process them.
A single client with a single connection object can act as both at the same time, meaning be both a producer and a consumer. Not to the same station because it will lead to an infinite loop. It’s doable, but not making much sense. That pattern is more to reduce footprint and needed “workers” so a single worker can produce tasks to a specific station, but can also act as a consumer or a processor to another station of a different use case. The below code example will create an Ack-based station and initiate a producer in node.js –

const { memphis } = require("memphis-dev");

(async function () {
  let memphisConnection;

  try {
    memphisConnection = await memphis.connect({
      host: "MEMPHIS_BROKER_HOSTNAME",
      username: "CLIENT_TYPE_USERNAME",
      password: "PASSWORD",
      accountId: ACCOUNT_ID
    });

    const station = await memphis.station({
      name: 'tasks',
      retentionType: memphis.retentionTypes.ACK_BASED,
    })

    const producer = await memphisConnection.producer({
      stationName: "tasks",
      producerName: "producer-1",
    });

    const headers = memphis.headers();
    headers.add("Some_KEY", "Some_VALUE");
    await producer.produce({
      message: {taskID: 123, task: "deploy a new instance"}, // you can also send JS object - {}
      headers: headers,
    });

    memphisConnection.close();
  } catch (ex) {
    console.log(ex);
    if (memphisConnection) memphisConnection.close();
  }
})();

Connect your task consumer –
The below consumer group will consume tasks, process them, and, once finished – acknowledge them. By acknowledging the tasks, the broker will make sure to remove those records to ensure exactly-once processing. We are using the station entity here as well in case the consumer starts before the producer. No need to worry. It is applied if the station does not exist yet.Another thing to remember is that a consumer group can contain multiple consumers to increase parallelism and read-throughput. Within each consumer group, only a single consumer will read and ack the specific message, not all the contained consumers. In case that pattern is needed, then multiple consumer groups are needed.

const { memphis } = require("memphis-dev");

(async function () {
  let memphisConnection;

  try {
    memphisConnection = await memphis.connect({
      host: "MEMPHIS_BROKER_HOSTNAME",
      username: "APPLICATION_TYPE_USERNAME",
      password: "PASSWORD",
      accountId: ACCOUNT_ID
    });

    const station = await memphis.station({
      name: 'tasks',
      retentionType: memphis.retentionTypes.ACK_BASED,
    })

    const consumer = await memphisConnection.consumer({
      stationName: "tasks",
      consumerName: "worker1",
      consumerGroup: "cg_workers",
    });

    consumer.setContext({ key: "value" });
    consumer.on("message", (message, context) => {
      console.log(message.getData().toString());
      message.ack();
      const headers = message.getHeaders();
    });

    consumer.on("error", (error) => {});
  } catch (ex) {
    console.log(ex);
    if (memphisConnection) memphisConnection.close();
  }
})();

If you liked the tutorial and want to learn what else you can do with Memphis Head here

Join 4500+ others and sign up for our data engineering newsletter.

Originally published at Memphis.dev By Shay Bratslavsky, Software Engineer at @Memphis.dev

You have a chance to save the world! 🔥

avital trifsik — Mon, 17 Jul 2023 12:11:02 +0000

We are happy to announce Memphis #1 hackathon #SaveZakar!📣📣📣

Sponsored by Memphis.dev, Streamlit and Supabase - Join us to save the world from wildfires using real-time data and AI!

What the hackathon is all about? 🌎

Wildfires wreck havoc every year. They take human and animal lives. The fires destroy homes and other property. They destroy agricultural and industrial crops and cause famines. They damage the environment, contribute to global warming, and generate smoke that pollutes the air. Their overall impact runs into billions of dollars and includes incalculable harm to people and animals.

Where and when wildfires will occur is difficult to predict ahead of time. Instead, researchers, federal, state, and municipal governments and international non-profits have invested heavily in early warning systems. If fires can be detected early, firefighters can be deployed to prevent their spread. If people can be notified, they can be evacuated to avoid loss of life.

Your mission 🔥

In this hackathon, you are going to create a wildfire early warning system for the fictional island nation of Zakar. Zakar has been struggling with wildfires in the last few years. As a small island nation, fires are particularly problematic. Importing materials is expensive and takes time, so they must do everything they can to protect their homes, farms, and natural resources. Similarly, with a relatively small geographic footprint, smoke can quickly pollute the air, causing health problems for their people.

Prizes 🏆

Each project will be judged by the following categories:

Creativity
Most informative visualization
Most accurate solution (For the early warning system)
Most interesting architecture
Most interesting algorithm

Besides internal glory 😉, the best project will get the perfect gaming package which includes the following:

SteelSeries Arctis Nova Pro Wireless Multi-System Gaming Headset.
Logitech G Pro Wireless Gaming Mouse - League of Legends Edition.
RARE Framed Nintendo Game Boy Color GBC Disassembled.
Tons of swag from Memphis.dev and Streamlit!

The 2nd best project will receive -

Logitech G Pro Wireless Gaming Mouse - League of Legends Edition.
Tons of swag from Memphis.dev and Streamlit!

Useful information💡

The hackathon week will occur from July 31 - August 7th.
There are two types of potential submitted projects: early warning system and data visualization.
The submission deadline is Monday, August 7, 2023
Join our Discord channel to get full support.
The winners will be announced on August 21, 2023

Sign up 🔥

Join 4500+ others and sign up for our data engineering newsletter.

Introducing Memphis.dev Cloud: Empowering Developers with the Next Generation of Streaming

avital trifsik — Mon, 03 Jul 2023 13:30:37 +0000

Event processing innovator Memphis.dev today introduced Memphis Cloud to enable a full serverless experience for massive scale event streaming and processing, and announced it had secured $5.58 million in seed funding co-led by Angular Ventures and boldstart ventures, with participation from JFrog co-founder and CTO Fred Simon, Snyk co-founder Guy Podjarny, CircleCI CEO Jim Rose, Console.dev co-founder David Mytton, and Priceline CTO Martin Brodbeck.

Introducing Memphis.dev Cloud

Memphis.dev, the next-generation event streaming platform, is ready to make waves in the world and disrupt data streaming with its highly anticipated cloud service launch.
With a firm commitment to providing developers and data engineers with a powerful and unified streaming engine, Memphis.dev aims to revolutionize the way software is utilizing a message broker. In this blog post, we delve into the key features and benefits of Memphis.dev’s cloud service, highlighting how it empowers organizations and developers to unleash the full potential of their data.

What to expect?

1. The Serverless Experience
Memphis’ platform was intentionally designed to be deployed in minutes, on any Kubernetes, on any cloud, both on-prem, public cloud, or even in air-gapped environments.
In advance of the rising multi-cloud architecture, Memphis enables streamlining development between the local dev station all the way to the production, both on-prem and through various clouds, and to reduce TCO and overall complexity – the serverless cloud will enable an even faster onboarding and time-to-value.

2. Enable “Day 2” operations
Message brokers need to evolve to handle the vast amount and complexity of events that occur, and they need to incorporate three critical elements: reliable; ease to manage and scale; and, offer what we call the “Day 2” operations on top to help build queue-based, stream-driven applications in minutes.

To support both the small and the massive scale and workloads Memphis is built for, some key features were only enabled to be delivered via the cloud.

Key features in Memphis Cloud include:

Augmenting Kafka clusters – providing the missing piece in modern stream processing with the ability to augment Kafka clusters;
Schemaverse – Enabling built-in schema management, enforcement, and transformation to ensure data quality as our data gets complicated and branched;
Multi-tenancy – Offering the perfect solution for users of SaaS platforms who want to isolate traffic between their customers;
True multi-cloud – creating primary instances on GCP, and a replica on AWS.

3. A Developer-Centric Approach (and obsession
Memphis.dev’s cloud service launch is driven by a developer-centric philosophy, recognizing that developers are the driving force behind technological innovation. With a deep understanding of developers’ and data engineers’ needs, especially in the current era of much more complicated pipelines with much fewer hands, Memphis.dev has created a comprehensive suite of out-of-the-box tools and features tailored specifically to enhance productivity, streamline workflows, and facilitate collaboration. By prioritizing the developer experience, Memphis.dev aims to empower developers to focus on what they do best: writing exceptional code, and extracting value out of their data!

4. No code changes. Open-source to Cloud.
Fully aligned development experience between the open-source and the cloud. No code changes are needed, nor an application config modification.
The cloud does reveal an additional parameter to add and is not mandatory, which is an account id.

5. Enhanced Security and Compliance:
Memphis.dev prioritizes the security and compliance of its cloud service, recognizing the critical importance of protecting sensitive data. With robust security measures, including data encryption, role-based identity and access management, integration with 3rd party identity managers, and regular security audits, Memphis.dev ensures that developers’ applications and data are safeguarded. By adhering to industry-standard compliance frameworks, Memphis.dev helps developers meet regulatory requirements and build applications with confidence.

6. Support and Success
The core support and customer service ethos of Memphis.dev is customer success and enablement. A successful customer is a happy customer, and we are working hard to support our customers not just with Memphis, but with their bigger picture and data journey. Three global customer support teams, spread across three different timezones alongside highly experienced data engineers and data architects that are positioned as Customer Success Engineers and willing to dive into the internals of our customers and help them achieve their goals.

“Cluster setup, fault tolerance, high availability, data replication, performance tuning, multitenancy, security, monitoring, and troubleshooting all are headaches everyone who has deployed traditional message broker platforms is familiar with,” said Torsten Volk, senior analyst, EMA. “Memphis however is incredibly simple so that I had my first Python app sending and receiving messages in less than 15 minutes.”

“The world is asynchronous and built out of events. Message brokers are the engine behind their flow in the modern software architecture, and when we looked at the bigger picture and the role message brokers play, we immediately understood that the modern message broker should be much more intelligent and by far with much less friction,” said Yaniv Ben Hemo, co-founder and CEO, Memphis. “With that idea, we built Memphis.dev which takes five minutes on average for a user to get to production and start building queue-based applications and distributed streaming pipelines.”

Join 4500+ others and sign up for our data engineering newsletter.

Originally published at Memphis.dev By Yaniv Ben Hemo, Co-Founder & CEO at Memphis.dev

Part 4: Validating CDC Messages with Schemaverse

avital trifsik — Thu, 22 Jun 2023 10:58:53 +0000

This is part four of a series of blog posts on building a modern event-driven system using Memphis.dev.

In the previous two blog posts (part 2 and part 3), we described how to implement a change data capture (CDC) pipeline for MongoDB using Debezium Server and Memphis.dev.

Schema on Write, Schema on Read

With relational databases, schemas are defined before any data are ingested. Only data that conforms to the schema can be inserted into the database. This is known as “schema on write.” This pattern ensures data integrity but can limit flexibility and the ability to evolve a system.

Predefined schemas are optional in NoSQL databases like MongoDB. MongoDB models collections of objects. In the most extreme case, collections can contain completely different types of objects such as cats, tanks, and books. More commonly, fields may only be present on a subset of objects or the value types may vary from one object to another. This flexibility makes it easier to evolve schemas over time and efficiently support objects with many optional fields.

Schema flexibility puts more onus on applications that read the data. Clients need to check for any desired field and confirm their data types. This pattern is called "schema on read."

Malformed Records Cause Crashes

In one of my positions earlier in my career, I worked on a team that developed and maintained data pipelines for an online ad recommendation system. One of the most common sources of downtime were malformed records. Pipeline code can fail if a field is missing, an unexpected value is encountered, or when trying to parse badly-formatted data. If the pipeline isn't developed with errors in mind (e.g., using defensive programming techniques, explicitly-defined data models, and validating data), the entire pipeline may crash and require manual intervention by an operator.

Unfortunately, malformed data, especially when handling large volumes of data, is a frequent occurrence. Simply hoping for the best won't lead to resilient pipelines. As the saying goes, "Hope for the best. Plan for the worst."

The Best of Both Worlds: Data Validation with Schemaverse

Fortunately, Memphis.dev has an awesome feature called Schemaverse. Schemaverse provides a mechanism to check messages for compliance with a specified schema and handle non-confirming messages.

To use Schemaverse, the operator needs to first define a schema. Messaged schemas can be defined using JSON Schema, Google Protocol Buffers, or GraphQL. The operator will choose the schema definition language appropriate to the format of the message payloads.

Once a schema is defined, the operator can "attach" the schema to a station. The schema will be downloaded by clients using the Memphis.dev client SDKs. The client SDK will validate each message before sending it to the Memphis broker. If a message doesn't validate, the client will redirect the message to the dead-letter queue, trigger a notification, and raise an exception to notify the user of the client.

In this example, we'll look at using Schemaverse to validate change data capture (CDC) events from MongoDB.

Review of the Solution

In our previous post, we described a change data capture (CDC) pipeline for a collection of todo items stored in MongoDB. Our solution consists of eight components:

Todo Item Generator: Inserts a randomly-generated todo item in the MongoDB collection every 0.5 seconds. Each todo item contains a description, creation timestamp, optional due date, and completion status.
MongoDB: Configured with a single database containing a single collection (todo_items).
Debezium Server: Instance of Debezium Server configured with MongoDB source and HTTP Client sink connectors.
Memphis.dev REST Gateway: Uses the out-of-the-box configuration.
Memphis.dev: Configured with a single station (todo-cdc-events) and single user (todocdcservice).
Printing Consumer: A script that uses the Memphis.dev Python SDK to consume messages and print them to the console.
Transformer Service: A transformer service that consumes messages from the todo-cdc-events station, deserializes the MongoDB records, and pushes them to the cleaned-todo-cdc-events station.
Cleaned Printing Consumer: A second instance of the printing consumer that prints messages pushed to the cleaned-todo-cdc-events station.

In this iteration, we aren't adding or removing any of the components. Rather, we're just going to change Memphis.dev's configuration to perform schema validation on messages sent to the "cleaned-todo-cdc-events" station.

Schema for Todo Change Data Capture (CDC) Events

In part 3, we transformed the messages to hydrate a serialized JSON subdocument to produce fully deserialized JSON messages. The resulting message looked like so:

{
"schema" : ...,

"payload" : {
"before" : null,

"after" : {
"_id": { "$oid": "645fe9eaf4790c34c8fcc2ed" },
"creation_timestamp": { "$date": 1684007402978 },
"due_date": { "$date" : 1684266602978 },
"description": "buy milk",
"completed": false
},

...
}
}

Each JSON-encoded message has two top-level fields, "schema" and "payload." We are concerned with the "payload" field. The payload object has two required fields, "before" and "after", that we are concerned with. The before field contains a copy of the record before being modified (or null if it didn't exist), while the after field contains a copy of the record after being modified (or null if the record is being deleted).

From this example, we can define criteria that messages must satisfy to be considered valid. Let's write the criteria out as a set of rules:

The payload/before field may contain a todo object or null.
The payload/after field may contain a todo object or null.
A todo object must have five fields ("_id", "creation_timestamp", "due_date", "description", and "completed").
The creation_timestamp must be an object with a single field ("$date"). The "$date" field must have a positive integer value (Unix timestamp).
The due_date must be an object with a single field ("$date"). The "$date" field must have a positive integer value (Unix timestamp).
The description field should have a string value. Nulls are not allowed.
The completed field should have a boolean value. Nulls are not allowed.

For this project, we'll define the schema using JSON Schema. JSON Schema is a very powerful data modeling language. It supports defining required fields, field types (e.g., integers, strings, etc.), whether fields are nullable, field formats (e.g., date / times, email addresses), and field constraints (e.g., minimum or maximum values). Objects can be defined and referenced by name, allowing recursive schema and for definitions to be reused. Schema can be further combined using and, or, any, and not operators. As one might expect, this expressiveness comes with a cost: the JSON Schema definition language is complex, and unfortunately, covering it is beyond the scope of this tutorial.

Creating a Schema and Attaching it to a Station

Let's walk through the process of creating a schema and attaching it to a station. You'll first need to complete the first 10 steps from part 2 and part 3.

Step 11: Navigate to the Schemaverse Tab
Navigate to the Memphis UI in your browser. For example, you might be able to find it at https://localhost:9000/ . Once you are signed in, navigate to the Schemaverse tab:

Step 12: Create the Schema
Click the "Create from blank" button to create a new schema. Set the schema name to "todo-cdc-schema" and the schema type to "JSON schema." Paste the following JSON Schema document into the textbox on the right.

{
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "$id": "https://example.com/product.schema.json",
    "type" : "object",
    "properties" : {
        "payload" : {
            "type" : "object",
            "properties" : {
                "before" : {
                    "oneOf" : [{ "type" : "null" }, { "$ref" : "#/$defs/todoItem" }]
                },
                "after" : {
                    "oneOf" : [{ "type" : "null" }, { "$ref" : "#/$defs/todoItem" }]
                }
            },
            "required" : ["before", "after"]
        }
    },
    "required" : ["payload"],
   "$defs" : {
      "todoItem" : {
          "title": "TodoItem",
          "description": "An item in a todo checklist",
          "type" : "object",
          "properties" : {
              "_id" : {
                  "type" : "object",
                  "properties" : {
                      "$oid" : {
                          "type" : "string"
                      }
                  }
              },
              "description" : {
                  "type" : "string"
              },
              "creation_timestamp" : {
                  "type" : "object",
                  "properties" : {
                      "$date" : {
                          "type" : "integer"
                      }
                  }
              },
              "due_date" : {
                    "anyOf" : [
                        {
                            "type" : "object",
                            "properties" : {
                                "$date" : {
                                    "type" : "integer"
                                }
                            }
                        },
                        {
                            "type" : "null"
                        }
                    ]
              },
              "completed" : {
                  "type" : "boolean"
              }
          },
          "required" : ["_id", "description", "creation_timestamp", "completed"]
      }
  }
}

When done, your window should look like so:

When done, click the "Create schema" button. Once the schema has been created, you'll be returned to the Schemaverse tab. You should see an entry for the newly created schema like so:

Step 13: Attach the Schema to the Station
Once the schema is created, we want to attach the schema to the "cleaned-todo-cdc-events" station. Double-click on the "todo-cdc-schema" window to bring up its details window like so:

Next, click on the "+ Attach to Station" button. This will bring up the following window:

Select the "cleaned-todo-cdc-events" station, and click "Attach Selected." The producers attached to the station will automatically download the schema and begin validating outgoing messages within a few minutes.

Step 14: Confirm that Messages are Being Filtered
Navigate to the station overview page for the "cleaned-todo-cdc-events" station. After a couple of minutes, you should see a red warning notification icon next to the "Dead-letter" tab name.

If you click on the "Dead-letter" tab and then the "Schema violation" subtab, you'll see the messages that failed the schema validation. These messages have been re-routed to the dead letter queue so that they don't cause bugs in the downstream pipelines. The window will look like so:

Congratulations! You're now using Schemaverse to validate messages. This is one small but incredibly impactful step towards making your pipeline more reliable.

In case you missed parts 1,2 and 3:
Part 3: Transforming MongoDB CDC Event Messages

Part 2: Change Data Capture (CDC) for MongoDB with Debezium and Memphis.dev

Part 1: Integrating Debezium Server and Memphis.dev for Streaming Change Data Capture (CDC) Events

Originally published at Memphis.dev By RJ Nowling, Developer advocate at Memphis.dev

Part 3: Transforming MongoDB CDC Event Messages

avital trifsik — Tue, 06 Jun 2023 10:26:31 +0000

This is part three of a series of blog posts on building a modern event-driven system using Memphis.dev.

In our last blog post, we introduced a reference implementation for capturing change data capture (CDC) events from a MongoDB database using Debezium Server and Memphis.dev. At the end of the post we noted that MongoDB records are serialized as strings in Debezium CDC messages like so:

{
    "schema" : ...,

"payload" : {
"before" : null,

"after" : "{\\"_id\\": {\\"$oid\\": \\"645fe9eaf4790c34c8fcc2ed\\"},\\"creation_timestamp\\": {\\"$date\\": 1684007402978},\\"due_date\\": {\\"$date\\": 1684266602978},\\"description\\": \\"buy milk\\",\\"completed\\": false}",

...
}
}

We want to use the Schemaverse functionality of Memphis.dev to check messages against an expected schema. Messages that don’t match the schema are routed to a dead letter station so that they don’t impact downstream consumers. If this all sounds like ancient Greek, don’t worry! We’ll explain the details in our next blog post.

To use functionality like Schemaverse, we need to deserialize the MongoDB records as JSON documents. In this blog post, we describe a modification to our MongoDB CDC pipeline that adds a transformer service to deserialize the MongoDB records to JSON documents.

Overview of the Solution

The previous solution consisted of six components:

Todo Item Generator: Inserts a randomly-generated todo item in the MongoDB collection every 0.5 seconds. Each todo item contains a description, creation timestamp, optional due date, and completion status.
MongoDB: Configured with a single database containing a single collection (todo_items).
Debezium Server: Instance of Debezium Server configured with MongoDB source and HTTP Client sink connectors.
Memphis.dev REST Gateway: Uses the out-of-the-box configuration.
Memphis.dev: Configured with a single station (todo-cdc-events) and single user (todocdcservice).
Printing Consumer: A script that uses the Memphis.dev Python SDK to consume messages and print them to the console.

In this iteration, we are adding two additional components:

Transformer Service: A transformer service that consumes messages from the todo-cdc-events station, deserializes the MongoDB records, and pushes them to the cleaned-todo-cdc-events station.
Cleaned Printing Consumer: A second instance of the printing consumer that prints messages pushed to the cleaned-todo-cdc-events station.

Our updated architecture looks like this:

A Deep Dive Into the Transformer Service

Skeleton of the Message Transformer Service

Our transformer service uses the Memphis.dev Python SDK. Let’s walk through the transformer implementation. The main() method of our transformer first connects to the Memphis.dev broker. The connection details are grabbed from environmental variables. The host, username, password, input station name, and output station name are passed using environmental variables in accordance with suggestions from the Twelve-Factor App manifesto.

async def main():
    try:
        print("Waiting on messages...")
        memphis = Memphis()
        await memphis.connect(host=os.environ[HOST_KEY],
                              username=os.environ[USERNAME_KEY],
                              password=os.environ[PASSWORD_KEY])

Once a connection is established, we create consumer and producer objects. In Memphis.dev, consumers and producers have names. These names appear in the Memphis.dev UI, offering transparency into the system operations.

print("Creating consumer")
        consumer = await memphis.consumer(station_name=os.environ[INPUT_STATION_KEY],
                                          consumer_name="transformer",
                                          consumer_group="")

        print("Creating producer")
        producer = await memphis.producer(station_name=os.environ[OUTPUT_STATION_KEY],
                                          producer_name="transformer")

The consumer API uses the callback function design pattern. When messages are pulled from the broker, the provided function is called with a list of messages as its argument.

  print("Creating handler")
        msg_handler = create_handler(producer)

        print("Setting handler")
        consumer.consume(msg_handler)

After setting up the callback, we kick off the asyncio event loop. At this point, the transformer service pauses and waits until messages are available to pull from the broker.

Keep your main thread alive so the consumer will keep receiving data

await asyncio.Event().wait()

Creating the Message Handler Function

The create function for the message handler takes a producer object and returns a callback function. Since the callback function only takes a single argument, we use the closure pattern to implicitly pass the producer to the msg_handler function when we create it.

The msg_handler function is passed three arguments when called: a list of messages, an error (if one occurred), and a context consisting of a dictionary. Our handler loops over the messages, calls the transform function on each, sends the messages to the second station using the producer, and acknowledges that the message has been processed. In Memphis.dev, messages are not marked off as delivered until the consumer acknowledges them. This prevents messages from being dropped if an error occurs during processing.

def create_handler(producer):
    async def msg_handler(msgs, error, context):
        try:
            for msg in msgs:
                transformed_msg = deserialize_mongodb_cdc_event(msg.get_data())
                await producer.produce(message=transformed_msg)
                await msg.ack()
        except (MemphisError, MemphisConnectError, MemphisHeaderError) as e:
            print(e)
            return

    return msg_handler

The Message Transformer Function

Now, we get to the meat of the service: the message transformer function. Message payloads (returned by the get_data() method) are stored as bytearray objects. We use the Python json library to deserialize the messages into a hierarchy of Python collections (list and dict) and primitive types (int, float, str, and None).

def deserialize_mongodb_cdc_event(input_msg):
    obj = json.loads(input_msg)

We expect the object to have a payload property with an object as the value. That object then has two properties (“before” and “after”) which are either None or strings containing serialized JSON objects. We use the JSON library again to deserialize and replace the strings with the objects.

 if "payload" in obj:
        payload = obj["payload"]

        if "before" in payload:
            before_payload = payload["before"]
            if before_payload is not None:
                payload["before"] = json.loads(before_payload)

        if "after" in payload:
            after_payload = payload["after"]
            if after_payload is not None:
                payload["after"] = json.loads(after_payload)

Lastly, we reserialize the entire JSON record and convert it back into a bytearray for transmission to the broker.

  output_s = json.dumps(obj)
    output_msg = bytearray(output_s, "utf-8")
    return output_msg

Hooray! Our objects now look like so:

{
"schema" : ...,

"payload" : {
"before" : null,

"after" : {
"_id": { "$oid": "645fe9eaf4790c34c8fcc2ed" },
"creation_timestamp": { "$date": 1684007402978 },
"due_date": { "$date" : 1684266602978 },
"description": "buy milk",
"completed": false
},

...
}
}

Running the Transformer Service

If you followed the 7 steps in the previous blog post, you only need to run three additional steps. to start the transformer service and verify that its working:

Step 8: Start the Transformer Service

$ docker compose up -d cdc-transformer
[+] Running 3/3
 ⠿ Container mongodb-debezium-cdc-example-memphis-metadata-1  Hea...                                                             0.5s
 ⠿ Container mongodb-debezium-cdc-example-memphis-1           Healthy                                                            1.0s
 ⠿ Container cdc-transformer                                  Started                                                            1.3s

Step 9: Start the Second Printing Consumer

$ docker compose up -d cleaned-printing-consumer
[+] Running 3/3
 ⠿ Container mongodb-debezium-cdc-example-memphis-metadata-1  Hea...                                                             0.5s
 ⠿ Container mongodb-debezium-cdc-example-memphis-1           Healthy                                                            1.0s
 ⠿ Container cleaned-printing-consumer                        Started                                                            1.3s

Step 10: Check the Memphis UI

When the transformer starts producing messages to Memphis.dev, a second station named "cleaned-todo-cdc-events" will be created. You should see this new station on the Station Overview page in the Memphis.dev UI like so:

The details page for the "cleaned-todo-cdc-events" page should show the transformer attached as a producer, the printing consumer, and the transformed messages:

Congratulations! We’re now ready to tackle validating messages using Schemaverse in our next blog post. Subscribe to our newsletter to stay tuned!

Head over to Part 4: Validating CDC Messages with Schemaverse to learn further