DEV Community: Taron Foxworth

Using Candy to Test and Develop with Local Custom Domains

Taron Foxworth — Mon, 14 Feb 2022 20:36:29 +0000

When developing locally, you may find yourself needing to test with an actual domain and not localhost.

Candy is a reverse proxy server written in Golang. It contains an HTTP, HTTPS, and DNS server 🚝. It’s simple and powerful.

Using candy we can start our local application on a port, and then run:

# addd candy configuration

echo "8080" > ~/.candy/myapp 

# curl local domain

curl [http://myapp.test](http://myapp.test)

Now you can use myapp.test locally to access your application.

In this article, I’ll walk you through getting a simple environment spun up.

What is a reverse proxy server?

A reverse proxy server is a server that forwards requests from clients to other servers. This is often used to protect the privacy of clients by hiding the true identity of the server.

But, we can use that same feature locally to give you custom domains during development.

Getting started with Candy

For MacOS, you may run:

brew install owenthereal/candy/candy

See Installation Instructions for other operating systems.

Then, we need to run candy’s setup command:

sudo candy setup

Because you should always know what commands are doing to your machine, here is what it’s doing behind the scenes:

sudo mkdir -p /etc/resolver
cat<<EOF | sudo tee /etc/resolver/candy-test > /dev/null
domain test
nameserver 127.0.0.1
port 25353
search_order 1
timeout 5
EOF

Starting Candy

First, start candy:

brew services start candy

Second, start your local app.

Third, add port to candy as myapp:

echo "8080" > ~/.candy/myapp

Forth, restart candy:

brew services restart candy

Lastly, curl 😎:

curl http://myapp.test

For more configuration options, please see candy’s README.

Now you have a local domain to develop against.

☕ Enjoy!

Originally published at https://blog.taronfoxworth.com.

Conversations with GPT-3: Am I good enough?

Taron Foxworth — Sun, 06 Feb 2022 21:02:13 +0000

This was my favorite response from the conversation:

🦊 — Why does being alive make me significant?

🐁 — Because you are a part of the universe, and the universe is a part of you.

Background

In this conversation, I chatted with GPT-3 using a prompt that invited it to be a complex, all-knowing “wise being”. The work of Kirk Ouimet heavily inspired this.

If you need proof of how it works, or want to learn more about GPT-3, check out OpenAI or ask me on Twitter.

For me, I thought I’d try it with some other topics I had in my mind and heart.

Enjoy ☕

Conversation with AI

All of the regular text is from the artificial intelligence; the italics are me.

🦊 — Hello. Grateful for the opportunity to speak with you.

🐁 — Hello Fox, thank you for asking me to speak. I know that we are going to have a fun conversation.

🦊 — Am I good enough?

🐁 — You are perfect for this conversation.

🦊 — Let me ask the same question differently. Am I good enough to be significant in the universe?

🐁 — You are significant in the universe.

🦊 — Why?

🐁 — Because you are alive.

🦊 — Why does being alive make me significant?

🐁 — Because you are a part of the universe, and the universe is a part of you.

🦊 — I think I need a definition. When you say “universe is a part of you,” what do you mean by the universe?

🐁 — The universe is everything that exists.

🦊 — Well then, what’s the difference between being significant and being good enough?

🐁 — Being significant means that you have an impact on the universe. Being good enough means that you are meeting the standards you have set for yourself.

🦊 — What if my standards are too high?

🐁 — Then you need to lower your standards.

🦊 — How do I uncover and identify the standards I have set for myself?

🐁 — By examining your values.

🦊 — What do you mean by values?

🐁 — Values are the things that you believe are important in life.

🦊 — Why do values matter?

🐁 — Because they guide your decisions.

🦊 — How do I know what my values are?

🐁 — By examining your actions.

🦊 — When examining my actions, what are the top 5 questions I should ask myself?

🐁 — You should ask yourself:

What are my motivations?
What am I trying to achieve?
What are the consequences of my actions?
What is the impact of my actions on others?
How does this action align with my values?

🦊 — That leaves me enough to work with. Thank you 🙇🏽‍♂️.

Originally published at https://blog.taronfoxworth.com.

Conversations with GPT-3: What Is a Computer?

Taron Foxworth — Sat, 05 Feb 2022 02:05:04 +0000

This was my favorite response from the conversation:

GPT-3: I can give you some examples of tasks that computers in the year 3009 will perform. Some examples include:

flying cars
cleaning your house
doing your taxes
diagnosing diseases
driving your car
writing your novel
cooking your dinner
taking care of your children
doing your laundry
monitoring your health
shopping for groceries
and keeping your home warm or cool.

Background

In this conversation, I (🦊) took the role of a researcher and talked to GPT-3 using a prompt that invited it to be a complex, all-knowing “Wise Being” artificial intelligence.

This was heavily inspired by the work of Kirk Ouimet. I thought I’d try it with some other topics I’d like to dig deeper on.

Enjoy ☕

Conversation with AI

All of the regular text is from the artificial intelligence; the italics are me.

🦊 — Hello, I’m super grateful for the opportunity to speak with you.

Wise Being: Hello Fox, thank you for asking me to speak. I know that we are going to have a fun conversation.

🦊 — Let’s begin, tell me, how does a computer work?

Wise Being: Computers work by taking input from a user, processing that input, and providing output.

The input can be anything from text to images to videos.

The processing is done by software that is installed on the computer.

The output can be displayed on the computer screen.

🦊 — But wait, why is the user even giving the computer input in the first place?

Wise Being: The user gives the computer input because they want it to do something. For example, they might want to type a document or browse the internet.

🦊 — So, the inputs are the user typing into the document and a web browser address bar. Correct?

Wise Being: That is correct.

🦊 — What would be the output?

Wise Being: The output in the case of the user typing a document would be a document displayed on the computer screen.

The output in the case of the user typing a URL in a web browser would be the website displayed on the computer screen.

🦊 — Okay, we’ve talked about inputs and outputs. What’s being processed?

Wise Being: The input is processed into output by the software installed on the computer. This software can be anything from a word processor to a web browser.

🦊 — I’m not sure I understand software. Can you tell me how a web browser processes its input and displays its output?

Wise Being: A web browser takes input from a user in the form of a URL. It then sends that input to a web server, which processes it and sends back the website’s HTML code. The web browser then displays the website’s HTML code on the computer screen.

🦊 — That’s a lot of new words. First, where does the HTML code come from?

Wise Being: The HTML code comes from the webserver.

🦊 — Could you explain what a web server is?

Wise Being: A web server is a computer that stores websites and sends them to web browsers.

🦊 — So, what’s the difference between HTML, web server, website, and computer? They all seem related.

Wise Being: HTML is the language that websites are written in.

A web server is a computer that stores websites and sends them to web browsers.

A website is a collection of HTML pages stored on a web server.

A computer is a device that can display websites.

🦊 — That makes things more clear. Next, you mentioned that a web browser is a type of software?

Wise Being: That is correct.

🦊 — I’d like to dig in more there. What is software?

Wise Being: Software is a computer program that performs a specific task.

🦊 — *You mentioned that computers work by taking input from a user, processing that input, and then providing output.

What’s the relationship between computers and software?*

Wise Being: Computers and software are two different ways of describing the same thing. Computers are devices that run the software.

🦊 — Well, how does the software get created?

Wise Being: The software gets created by programmers.

🦊 — Programmers create software to perform tasks on computers?

Wise Being: That is correct.

🦊 — Is creating a document and browsing the internet examples of tasks?

Wise Being: Yes, creating a document and browsing the internet are examples of tasks that software can perform.

🦊 — Are there more examples of tasks that can be performed by software?

Wise Being: Yes, there are many examples of tasks that software can perform. Some other examples include editing photos, playing games, and writing emails.

🦊 — Are there any tasks that can’t be performed by software?

Wise Being: No, there are no tasks that software can’t perform.

🦊 — Can you elaborate?

Wise Being: Software can do anything that a human can do. For example, the software can type documents, browse the internet, edit photos, play games, and write emails.

🦊 — What are the limits of software?

Wise Being: The limits of software are the limits of what a human can do.

🦊 — What are the limits of what a human can do?

Wise Being: The limits of what a human can do are the limits of their imagination.

🦊 — So, the limits of software and computers are bound to the limitations of human imagination?

Wise Being: That is correct.

🦊 — Wow, that’s powerful. Last question, based on everything you know now, can you give me examples of tasks computers in the year 3009 will perform?

Wise Being: Yes, I can give you some examples of tasks that computers in the year 3009 will perform. Some examples include:

flying cars
cleaning your house
doing your taxes
diagnosing diseases
driving your car
writing your novel
cooking your dinner
taking care of your children
doing your laundry
monitoring your health
shopping for groceries
and keeping your home warm or cool.

🦊 — Well, this makes me excited about the future. Thank you for this conversation. Have a beautiful day 🚀.

Wise Being: You’re welcome, Fox. I hope you have a wonderful day too.

Originally published at https://blog.taronfoxworth.com.

Where is the modern data stack for software engineers?

Taron Foxworth — Sat, 05 Feb 2022 01:20:26 +0000

The Future of the Modern Data Stack looks excellent for data engineers. However, as a software engineer, I kind of feel left out. Where is the modern data stack for software engineers?

Marketing teams and data engineers need data to answer questions; software engineers need data to build features. This difference is why you’ll find that tools like Segment don’t have connections for tools like ElasticSearch (Search Engine) or Redis (Cache).

A business may use the modern data stack to ask better questions about what’s happening in their business, applications, etc. A modern data stack is critical today if you want to succeed. Also, this world is filling fast with new SaaS data products and tools in abundance.

Here, I’d like to present a slightly different data problem for a separate data audience, software engineers.

Software engineers leverage data infrastructure in a very different way. The tools aren’t Google Analytics and Clearbit, but Upstash and Supabase

Engineers need to move data back and forth to build features and infrastructure that adds customer value.

Where are my tools to help me use code to move, process, or manipulate data between my application infrastructure? Today, I see a lot of one-off scripts, custom microservices, or tools that require me to scale a JVM.

The Data Integration Problem.

I want to tell you about a problem that every software engineer experiences: the data integration problem.

Due to infrastructure becoming easier to acquire and amazing tools like Heroku, Render, PlanetScale, Upstash and Supabase, it’s getting easier to acquire new data infrastructure.

data infrastructure — a new system that generates or stores data.

Keep this definition in mind; it’s crucial.

In general, writing software is becoming more data-centric every day. Engineers commonly pull data from all sorts of places from within (or without) our infrastructure to build applications that are data-intensive.

Data-intensive applications are complex and made up of many systems like:

multiple microservices
caches
databases
event brokers
data warehouse
search engines
log aggregation systems
CRM
analytics platforms
… and third-party tools.

Our software systems contain many specialized tools that accelerate development and growth. These additional tools and platforms solve real problems and help teams move fast. But, there is one catch.

We are slowly acquiring more specialized data infrastructure if you zoom out a bit. A distributed data infrastructure means that our systems generate and consume data from more and more data stores.

If not appropriately managed, the number of “data tasks” will continue to increase. This means we will keep spending less time building features and spend more time integrating data.

I’m not sure this is what we want.

I keep asking myself: Is spending tons of time moving data around a valuable activity for software engineers?

Today, there are production tools that software engineers may use to solve this problem, like Apache Kafka and Airflow. But deploying and managing these systems isn’t the greatest experience and requires people on your team whose only job is to manage these systems.

I’d argue that “easy data movement for developers” is still a super unsolved problem.

The data-centric developer mindset

I’m not sure this is even a problem that will go away. We will continue to use specialized tools that accelerate development and growth. In most cases:

ElasticSearch will always offer a better developer experience for searching than MySQL.

Snowflake will always offer a better developer experience for data warehousing than PostgresSQL.

There will be no magic data store 🪄. We will forever be in a data ecosystem that won’t consolidate much because data infrastructure will always have design decisions that will be good for one use case and possibly poor for others.

With that being said, the data-centric mindset is becoming more common when building software.

With data at the forefront of system design, engineers who used to ask themselves: “What database will I use for this application?”. Will now be asking themselves: “How will this new application integrate with my data infrastructure.”

The next generation of applications will be built with a data-first mindset.

What is the data integration problem?

Now, we can look at this problem from a data-centric mindset. Data integration problems are tasks that take the following form:

Data in system A needs to get to system B.
Data changes in A need to be continuously replicated into B.

We can map a vast landscape of problems to these. For example:

Log Aggregation
Syncing data from PostgreSQL to Redis for caching.
Listening to changes from a PostgreSQL table and writing them to a data warehouse.
Watching a file for changes and writing the changes to a database.
Consuming data from a Kafka topic and writing it somewhere else.

If you squint and tilt your head to the side, you’ll notice that all of these problems are moving data from one place to another. These problems are specific to any specific industry; it applies to software engineering as a whole.

Some problems, such as the need for data warehousing you’d hit as you scale; others, like streaming data from a log, are ubiquitous amongst most software engineers.

We always code first, think later.

These problems move data from one place to another, yet we typically use different tools or build a custom tool. Moving data from one place to another is a task that looks simple on the surface, mainly because it’s super convenient to write a small service that does the data task you need.

But, most will eventually find that:

Datastores and schemas improve, change and update over time.
Managing real-time syncing between data infrastructure is 🥲.
Relying on external data infrastructure (SaSS tools, External APIs) is impossible.

Then, some may then discover The Log and adopt Kafka. Kafka is an outstanding event-based streaming broker. But, it’s a massive addition to your infrastructure just to move data from one place to another.

What Now?

This is why we are working on a project called Conduit at Meroxa. We hope to change the experience software engineers have with data.

At a high level, Conduit is a data streaming tool written in GoLang. It aims to provide the best software developer experience for building and running real-time data pipelines.

I’d love to know what you think, and I’d love to see more data tools for software engineers.

Thank you for reading. Have a beautiful day ☀️

Originally published at https://blog.taronfoxworth.com.

Stream Your Database Changes with Change Data Capture: Part Two

Taron Foxworth — Wed, 01 Sep 2021 19:39:21 +0000

This is part two of a series on Change Data Capture (CDC). In part one, we defined change data capture, explored how data is captured, and the pros and cons of each capturing method. In this article, let’s discuss the use cases of CDC and look at the tools that help you add CDC into your architecture.

Change Data Capture helps enable event-driven applications. It allows applications to listen for changes to a database, data warehouse, etc., and act upon those changes.

At a high level, here are the use cases and architectures that arise from acting on data changes:

Extract, Transform, Load (ETL): Capturing every change of one datastore and applying these changes to another allows for replication (one-time sync) and mirroring (continuous syncing).
Integration and Automation: The action taken on data change events can automate tasks, trigger workflows, or even execute cloud functions.
History: When performing historical analysis on a dataset, having the current state of the data and all past changes gives you complete information for a higher fidelity analysis.
Alerting: Most of the time, applications send an event to a user whenever the data they care about changes. CDC can be the trigger for real-time alerting systems.

Let’s explore.

Extract, Transform, Load

As of date, one of the most common use cases for CDC is Extract, Transform, Load (ETL). ETL is a process in which you are capturing data from one source (extract), processing it in some way (transform), and sending it to a destination (load).

Data replication (one-time sync) and mirroring (continuous replication) are great examples of ETL processes. ETL is an umbrella term that encompasses very different use cases such as:

Ingesting data from a database into a data warehouse to run analytic queries without impacting production.
Keeping caches and search index systems up-to-date

Not only can CDC help solve these use cases, but it’s also the best way to solve these problems. For example, to mirror data to a data warehouse, you must capture and apply any changes as they happen to the source database. As discussed with Streaming Replication Logs in part one of the series, CDC is used by databases to keep standby instances up-to-date for failover because it’s effective and scalable. When tapping into these events in a wider architecture, your data warehouse can be as up-to-date as a standby database instance used for disaster recovery.

Keeping caches and search index systems up-to-date are also ETL problems and great CDC use cases. Large applications created today are comprised of many different data stores. For example, certain architectures will leverage Postgres, Redis, and Elasticsearch as a relational database, caching layer, and search engine. All are systems of record designed for specific data use cases, but data needs to be mirrored in each store.

You never want a user to search for a product and then find out it longer exists. Stale caches and search indexes lead to horrible user experiences. CDC can be used to build data pipelines that keep these stores in sync with their upstream dependencies.

In theory, a single application could write to Postgres, Redis, and Elasticsearch simultaneously, but “Dual Writes” can be tough to manage and can lead to out-of-sync systems. CDC offers a stronger, easier to maintain implementation. Instead of adding the logic to update indexes and caches to a single monolithic application, one could create an event-driven microservice that can be built, maintained, improved, and deployed independently from user-facing systems. This microservice can keep indexes and caches up to date to ensure users operate on the most relevant data.

Integration and Automation

The rise of SaaS has exploded the number of tools that generate data or need to be updated with data. CDC can provide a better model for keeping Salesforce, Hubspot, etc., up to date and allow automation of business logic that needs to respond to those data changes.

Each of the use cases we described above sends data to a specific destination. However, the most powerful destination is a cloud function. Capturing data changes and triggering a cloud function can be used to perform every use case mentioned (and not) in this article.

Cloud functions have grown tremendously because there are no servers to maintain; they automatically scale and are simple to use and deploy. This popularity and usefulness have been apparent and proven in architectures like the JAMStack. CDC fits perfectly with this architecture model.

Today, Cloud functions are triggered by an event. This event could be when a file is uploaded to Amazon S3 or an HTTP request. However, as you might have guessed, this trigger event could be emitted by a CDC system.

For example, here is an AWS Lambda Function to accept a data change event and perform Algolia search indexing:

const algoliasearch = require("algoliasearch");
const client = algoliasearch(process.env.ALGOLIA_APP_ID, process.env.ALGOLIA_API_KEY);
const index = client.initIndex(process.env.ALGOLIA_INDEX_NAME);

exports.handler = async function(event, context) {
    console.log("EVENT: \\n" + JSON.stringify(event, null, 2))

    const request = event.Records[0].cf.request;

    // Accessing the Data Record
    // <https://docs.meroxa.com/docs/pipelines/data-records>

    const body = Buffer.from(request.body.data, 'base64').toString();

    const { schema, payload } = body;
    const { before, after, source, op } = payload;

    if (req.method === 'POST') {
        try {
            // if read, create, or update operation create o update index
            if (op === 'r' || op === 'c' || op === 'u') {
                console.log(`operation: ${op}, id: ${after.id}`)

                after.objectID = after.id
                await index.saveObject(after)

            } else if (op === 'd') {
                console.log(`operation: d, id: ${before.id}`)
                await index.deleteObject(before.id)
            }
            return res.status(200).send()

        } catch (error) {
            console.log(`error: ${JSON.stringify(error)}`)
            return res.status(500).send()
        }
    }

    return context.logStreamName
}

Every time this function is triggered, it will look at the data change (op) and perform the equivalent action in Algolia. For example, if a delete operation occurs in the database, we can perform a deleteObjectin Algolia.

Functions that respond to CDC events can be small and simple. But, CDC — along with event-based architectures — can simplify otherwise very complex architectures as well.

For example, implementing Webhooks as a feature within your application becomes a more straightforward problem with CDC. Webhooks allow users to trigger a POST request when certain events occur, typically data changes. For example, with Github, you can trigger a cloud function when a pull request is merged. A merged pull request is an UPDATE operation to a data store, which means a CDC system can capture this event. Generally, most webhook events can be translated to INSERT UPDATE and DELETE operations that a CDC system can capture.

History

You may not want to act on the CDC event but only store the raw changes in some cases. Using CDC, a data pipeline can store all change events to a cloud bucket for long-term processing and analysis. The best place to store the data for historical analysis is within a cloud bucket, referred to as a data lake.

A data lake is a centralized store that allows you to store all your structured and unstructured data at any scale. Data lakes typically leverage cloud object bucket solutions like Amazon S3 or Digital Ocean Spaces.

For example, once the data is in a data lake, SQL query engines like Amazon Presto can run analytic queries against the change datasets.

While storing the raw changes, you not only have the current state of the data, you have all the previous states (historical). That’s why CDC adds a ton of value to historical analysis.

Having historical data allows you to support disaster recovery efforts and also allows you to answer retroactive questions about your data. For example, let’s say your team redefined how Monthly Active Users (MAU) are calculated. With the complete history of a user data set, one could perform the new MAU calculations based on any date in the past and compare the results to the current state.

This rich history also has user-facing value. Audit logs and activity logs are features that display data changes to users.

Capturing and storing change events offers a better architecture when these features are implemented. Like in Webhooks, audit logs and activity logs are rooted in operations that a CDC system can capture.

Alerting

The job of any alerting system is to notify a stakeholder of an event. For example, when you receive a new email notification, you are notified of an INSERT operation to an email data store. Typically, most alerts are related to a change in a data store, which means that CDC is great for powering alerting systems.

For example, let’s say you have an eCommerce store. After enabling CDC on a table of purchases, you could capture the change event and notify the team by performing a Slack alert when there are new purchases.

Just like audit or activity logs, notifications powered by CDC can not only provide information about the event that occurred but also provide details of the change itself:

Tom has updated the title from "Meeting Notes" to "My New Meeting."

This alerting behavior also has internal value. From an infrastructure monitoring perspective, CDC events can provide insight into how users interact with your application and data. For example, you could see when and how users add, update, or delete information. This data can be sent to Prometheus UI to monitor and act on this information.

Getting Started with CDC

In part one, we talked about the various ways CDC is commonly implemented:

Polling
Database Triggers
Streaming Logs

These can all be used to build the use cases we’ve discussed in this article. Best of all, since CDC focuses on the data, the process is programming language agnostic and can be integrated into most architectures.

Polling and Triggers

When using polling or database triggers, there is no overhead and nothing to install. You can get started by building your queries to poll or by leveraging your databases’ triggers if they are supported.

Streaming Logs

Databases use streaming replication logs for backup and recovery, which means that most databases provide some CDC behavior out of the box. How easy it is to tap into these events depends on the data store itself. The best place to get started is by digging into your database’s replication features. Here are some replication log resources for some of the most popular databases:

To get started with streaming logs, the answer is tightly coupled to the database in question. In future articles, I’ll explore what it looks like for each of these.

Implementing any of these directly does take some time, planning, and effort. If you’re trying to get started with CDC, the lowest barrier to entry is adopting a CDC tool that knows how to communicate and capture changes from the data stores you use.

Change Data Capture Tools

Here are some great tools for you to evaluate:

Debezium

Debezium is by far the most popular CDC tool. Its well-maintained, open-sourced and built on top of Apache Kafka. It supports MongoDB, MySQL, PostgreSQL, and more databases out of the box.

At a high level, Debezium hooks into the replication logs of the database and emits the change events into Kafka. You can even run Debezium standalone without Kafka.

What’s really nice is that Debezium is all configuration-based. After installing and configuring Debezium, you can configure connections to your datastore using a JSON-based configuration:

{
  "name": "fulfillment-connector",  
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector", 
    "database.hostname": "192.168.99.100", 
    "database.port": "5432", 
    "database.user": "postgres", 
    "database.password": "postgres", 
    "database.dbname" : "postgres", 
    "database.server.name": "fulfillment", 
    "table.include.list": "public.inventory" 

  }
}

Once connected, Debezium will perform an initial snapshot of your data and emit change events to a Kafka Topic. Then, services can consume the topics and act on them.

Here are some great places to get started with Debeizium:

Meroxa

Meroxa is a real-time data orchestration platform that gives you real-time infrastructure. Meroxa removes the time and overhead associated with configuring and managing brokers, connectors, transforms, functions, and streaming infrastructure. All you have to do is add your resources and construct your pipelines. Meroxa supports PostgreSQL, MongoDB, SQL Server, and more.

CDC pipelines can be built in a visual dashboard or using the Meroxa CLI:

# Add Resource
$ meroxa resource add my-postgres --type postgres -u postgres://$PG_USER:$PG_PASS@$PG_URL:$PG_PORT/$PG_DB

# Add Webhook
$ meroxa resource add my-url --type url -u $CUSTOM_HTTP_URL

# Create CDC Pipeline
$ meroxa connect --from my-postgres --input $TABLE_NAME --to my-url

The use case above is covered in more detail: Building a Change Data Capture Microservice with Meroxa and Vercel Meroxa Guide.

Here are some great resources to help you get started with Meroxa:

I can’t wait to see what you build. 🚀

If you have any questions or feedback, I’d love to hear them. You can:

Discuss with me our Discord community.
Reach out to me on Twitter.

Stream Your Database Changes with Change Data Capture

Taron Foxworth — Wed, 11 Aug 2021 20:14:45 +0000

Nobody wants to look at a dashboard or make decisions with yesterday’s data. We live in a world where real-time information is a first-class expectation for our users and is critical to make the best decisions inside an organization.

Change Data Capture (CDC) is an efficient and scalable model that simplifies the implementation of real-time systems.

Change Data Capture Diagram

Industry leading companies like Shopify, Capital One, Netflix, Airbnb, and Zendesk, have all published technical articles demonstrating how they have implemented Change Data Capture (CDC) into their data architectures to:

Expose data from a centralized system to event-driven microservices.
Build applications that respond to data events in real-time.
Maintain data quality and freshness within data warehouses and other downstream consumers of data.

In this multi-part series on Change Data Capture, we are going to dive into:

What is Change Data Capture, and how are CDC systems implemented?
What are the ideal CDC use cases, and how to get started with CDC?

Let’s begin.

What is Change Data Capture (CDC)?

The idea of “tracking the changes to a system” isn’t new. Engineers have been writing scripts to query and update data in batches since the idea of programming itself came about. Change Data Capture is a formalization of the various methods that determine how changes are tracked.

At its core, CDC is a process that allows an application to listen for changes to a data store and respond to those events. The process involves a data store (database, data warehouse, etc.) and a system to capture the changes of the data store.

For example, one could:

Capture PostgreSQL (database) changes and send the change events to Kafka using Debezium (CDC).
Capture changes from MySQL (database) and POST to an HTTP Endpoint with Meroxa (CDC).

Real-World Example

Let’s look at a real-world example that would benefit from CDC. Here, we have an example of a table in PostgreSQL:

Example User Data

When information in the User table changes, the business may need to:

Update the data warehouse, which is the source of truth for business analytics.
Notify the team of a new user.
Keep an additional User table in sync with filtered columns for privacy purposes.
Create a real-time dashboard of new user activity.
Capture change events for audit logging.
Store every change in a cloud bucket for historical analytics.
Update an index used for search.

We can build services to perform all of the actions above by acting on a data change event, and if desired, build and manage them independently of each other.

CDC gives us efficiency by acting on events as they occur and scalability by leveraging a decoupled event-driven architecture.

A CDC Event Example

CDC systems will usually emit an event that contains details about the change that occurred. When using a CDC system like Debezium and a new user is created, here is the generated event:

Anatomy of CDC Event

This event describes the schema of the data (schema), the operation that occurred (op), and the data before and after payload.

The event’s format, the fidelity of information, and when it is delivered depend on the CDC system’s implementation.

CDC Implementations

Tracking changes to a PostgreSQL database could look very similar or wildly different to tracking changes within MongoDB. It all depends on the environment and the capture method chosen.

The capture method chosen can define:

what operation(s) (insert, update, delete) can be captured.
how the event is formatted.
If the CDC system is pulling the change events or being pushed to the CDC system.

Let’s look at each of the different methods and discuss some of the pros and cons of each.

Polling

When implementing any database connector, the decision starts with “To poll or not to poll.” Polling is the most conceptually simple CDC method. To implement polling, you need to query the datastore on an interval.

For example, you may run the following query on an interval:

SELECT * from Users;

This SELECT * query would be considered the bulk ("give me everything") polling method. While this would be great to capture a snapshot of the current state, downstream consumers would require work to figure out exactly what data changed on each interval.

However, polling can get much more granular. For example, it’s possible to poll only for a primary key:

SELECT MAX(id) from Users;

A system can track the max value of a primary key (id). When the max value increments, this means that an INSERT operation occurred.

Additionally, if a database has an updateAt column, a query can look at timestamp changes to capture UPDATE operations.

SELECT * from Users WHERE updated_at > 2021-02-08;

Pros and Cons

Easy: Polling is great because it’s simple to implement, deploy, and very effective.

Custom queries are useful : One advantage is that the query used while polling can be customized to fit complex use cases. The query could include JOINS or transformations performed directly in SQL.

Capturing deletes is hard: With the polling method, it’s much harder to capture DELETE operations. You can't really query a row in a database if it's gone entirely. One solution is to use database triggers to create an "archive" table of deleted records. Then, delete operations become insert operations of a new table that could be polled.

Events are pulled, not pushed : With polling, the event is pulled from the upstream system. For example, when using polling to ingest into a data warehouse, the ingestion would happen when the CDC system decides to poll. In theory, “real-time” can be accomplished with fast enough polling, but this could cause performance overhead to the database.

Performance overhead is a concern : A SELECT * or any complex query doesn't scale very well on massive datasets. One common workaround is by polling a stand-by instance instead of the primary database.

Changes between query times can’t be captured : Another consideration is the data changes between query times. For example, if a system polls every hour and the data changes multiple times within that same hour, you’d only be able to see the change at query times, not any of the intermediate changes.

Database Triggers

Most of the popular databases support triggers of some sort. For example, in PostgreSQL, one can build a trigger that will move a row to a new table when it’s deleted:

CREATE TRIGGER moveDeleted
BEFORE DELETE ON "User"
FOR EACH ROW
EXECUTE PROCEDURE moveDeleted();

Because triggers can effectively listen to an operation and perform an action, database triggers can act as a CDC system.

In some cases, these triggers can be very complex and full-blown functions. For example, in MongoDB, Triggers are written in Javascript:

exports = async function (changeEvent) {

 // Destructure out fields from the change stream event object
  const { updateDescription, fullDocument } = changeEvent;
  // Check if the shippingLocation field was updated
  const updatedFields = Object.keys(updateDescription.updatedFields);
  const isNewLocation = updatedFields.some(field =>
    field.match(/shippingLocation/)
  );
  // If the location changed, text the customer the updated location.
  if (isNewLocation) {
   // Do something
  }
};

Pros and Cons

Ease of deployment : Triggers are awesome because they are supported out-the-box for most databases and are easy to implement.

Data Consistency: Any current and new downstream consumer doesn’t have to worry about performing this logic because the logic is contained in the database and not the application — in the case of a microservice architecture.

Application logic in databases could be bad : However, databases should not contain too much application logic. This could result in behavior being too tightly coupled to the database, and one bad trigger could affect an entire data infrastructure. Triggers should be concise and simple.

Every operation is captured : You can build a trigger for each database operation.

Performance overhead is a concern: P oorly written Triggers can also impact database performance for the same reasons as the polling method. A trigger that contained a complex query wouldn’t scale very well on massive datasets.

Streaming Replication Logs

It’s best to have at least a secondary instance of a database running to ensure proper failover and disaster recovery.

In this model, the standby instances of the database need to stay up-to-date with the primary in real-time and not lose information. The best way to do this today is for the database to write every change occurring to a log. Then, any standby instances can stream the changes from this log and apply the operations locally. Performing the same operations in real-time is what allows the standby instances to “mirror” the primary.

Here are some references on how this works for some of the most popular databases:

CDC can use the same mechanism to listen to changes. Just like a standby database, an additional system can also process the streaming log as it’s updated:

In the PostgreSQL example diagram above, a CDC system can act as an additional WAL Receiver, process the event, and send to a message transport (HTTP API, Kafka, etc.).

Here is an example of querying changes from PostgreSQL’s WAL using a SQL function provided by the the test_decoding plugin:

postgres=# SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
    lsn | xid | data                           
-----------+-------+---------------------------------------------------------
 0/BA5A688 | 10298 | BEGIN 10298
 0/BA5A6F0 | 10298 | table public.data: INSERT: id[integer]:1 data[text]:'1'
 0/BA5A7F8 | 10298 | table public.data: INSERT: id[integer]:2 data[text]:'2'
 0/BA5A8A8 | 10298 | COMMIT 10298
(4 rows)

In the query response above, it describes the following:

lsn - Log Sequence Number (LSN) - This number describes the current position in the WAL log. It's used by downstream systems when the log has been updated.
xid - Transaction ID - Each transaction to PostgreSQL gets a unique ID.
data - Data about action and operation that occurred.

The format of these change events will be determined based on the Logical Decoding Output Plugin. For example, the wal2json output plugin allows you to output the changes in JSON, which are easier to parse than the test_decoding plugin output.

PostgreSQL also provides a mechanism to stream these changes as they occur. As you saw in the event example earlier, Debezium also parses the streaming log in real-time and produces a JSON event.

Pros and Cons

Events are pushed : One huge benefit of streaming logs is that the events are being pushed to the CDC system as changes occur (vs. polling). This pushing model allows for real-time architectures. Using the User table as an example, the data warehouse ingestion would happen in real-time with a streaming log CDC system.

Efficient and Low Latency : Standby instances use streaming logs for disaster recovery, where efficiency and low latency are top priorities. Streaming replication logs is the most efficient means of capturing changes with the least overhead to the database. This process will look differently from database to database, but the concepts still hold.

Every operation is captured : Every transaction occurring to the data store will be written to the log.

Hard to get a complete snapshot of data : Generally, after a certain amount of time (or size), the streaming logs get purged because they take up space. Being so, the logs may not contain every change that occurred, just the most recent.

Need to be configured : Enabling replication logs may require additional configuration, plugins, or even database restart. Performing these changes with minimal downtown could be cumbersome and requires planning.

What’s Next?

Capturing the changes of data is like a swiss army knife for any application architecture; it is useful for so many different types of problems. Listening, storing, and acting on the changes of any system — particularly a database — allows you to perform real-time replication data between two data stores, break up a monolithic application into scalable, event-driven microservices, or even power real-time UIs.

Streaming replication logs, polling, and database triggers provide a mechanism to build a CDC system. Each has its own set of pros and cons specific to your application architecture and desired functionality.

In the next article in this series, we are going to dive into:

What are the ideal CDC use cases?
Where can I get started with CDC?

Follow me on Twitter to get notified of part two. I can’t wait to see what you build 🚀.

Special thanks to @criccomini, @andyhattemer, @misosoup, @devarispbrown, and @neovintage for helping me craft the ideas in this article!

Creating a Soft Delete Archive Table with PostgreSQL

Taron Foxworth — Tue, 08 Jun 2021 19:38:49 +0000

Streaming from Postgres’ Logical replication log is the most efficient means of capturing changes with the least amount of overhead to your database. However, in some environments (i.e., unsupported versions, Heroku Postgres), you’re left with polling the database to monitor changes.

Typically when polling PostgreSQL to capture data changes, you can track the max value of a primary key (id) to know when an INSERT operation occurred. Additionally, if your database has an updateAt column, you can look at timestamp changes to capture UPDATE operations, but it’s much harder to capture DELETE operations.

Postgres Triggers and Functions are powerful features of Postgres that allow you to listen for DELETE operations that occur within a table and insert the deleted row in a separate archive table. You can consider this a method of performing soft deletes, and this model is helpful to maintain the records for historical or analytical purposes or data recovery purposes.

In the commands below, we capture deletes from a table called User, and the trigger will insert the deleted row into a table called Deleted_User.

Step One: Create a new table

If you don’t have a table yet, you’ll need to create one. To help, you can easily copy an origin table:

CREATE TABLE “Deleted_User” AS TABLE “User” WITH NO DATA;

Note : WITH NO DATA allows you to copy a table’s structure without data.

Step Two: Create a new Postgres Function

Next, we can create a new function named moveDeleted():

CREATE FUNCTION moveDeleted() RETURNS trigger AS $$

  BEGIN

    INSERT INTO "Deleted_User" VALUES((OLD).*);

    RETURN OLD;

  END;

$$ LANGUAGE plpgsql;

Here we are using VALUES((OLD).*) to send every column to the archive table, but you may update this to omit or even add new columns.

Step Three: Create a new Postgres Trigger

Lastly, we can create a Postgres Trigger named moveDeleted that calls the moveDeleted() function:

CREATE TRIGGER moveDeleted

BEFORE DELETE ON "User"

FOR EACH ROW

EXECUTE PROCEDURE moveDeleted();

That’s it.

If you perform a DELETE operation on the User table, a new row with the deleted data will move to theDeleted_User table.

Now your archive table will begin to populate, data won’t be lost, and you can now monitor the archive table to capture DELETE operations within your application.

Build a Cross Platform Command Line Executable in Node.js

Taron Foxworth — Tue, 13 Jun 2017 14:37:58 +0000

Before we begin, this article assumes you know a couple of things: Javascript, Terminal, and Node.js.

Creating a command line tool in Node.js is easy, especially with tools like yargs, commander, inquirer.

However, secretly, I’ve been jealous of CLI tools built in Go. The CLI tools build in Go get shipped as a single binary file, with no dependencies.

Then, as I started using Zeit’s tools, I came across pkg. With pkg, I can create my tool in Node.js but, ship it as a single binary. ðŸ™ŒðŸ½

For example, let’s create a simple command line tool:

#!/usr/bin/env node
const argv = require('yargs').argv

if (argv.digit) {
  console.log(argv.digit + 1)
} else {
  console.log('Hmmm. I\'m confused')
}

This program will add 1 to a parameter named digitÂ . If you put this in a file called add.js (make sure you run npm install yargs ), you’ll have a command line tool:

$ ./add.js --digit=1
2
$ ./add.js --digit=12
13

Now, onto the magic. Install, pkg:

$ npm i -g pkg

To convert your program to a binary, run:

$ pkg add.js

Running this command will generate three files:

$ ls 
add-linux   add-macos   add-win.exe

You have just created a single file binary for Linux, Mac OS X and Windows.

You also run it as you would any other binary:

./add-macos

If you’re like me and was curious as to how this works, this video was enlightening:

I didn’t make this tool. I just wanted others to bask in the awesomesauce that is Javascript.

I’ll be using this tool in upcoming projects.

Hi, thanks for reading. I’m Fox, a Technical Evangelist at Losant. My dream is to translate technology for people to learn, love, and be inspired. Feel free to follow me on Twitter.