DEV Community: upgrowcode

Create Streaming SQL Pipelines With dbt

upgrowcode — Tue, 01 Feb 2022 18:58:42 +0000

dbt (data build tool) has emerged as the industry standard for data transformations in recent years. It combines SQL accessibility with software engineering best practices, allowing data teams to design dependable data pipelines and document, test, and version-control them. The dbt ETL tool alleviates these frustrations by taking over the transformation step in your ETL pipelines.

While dbt is ideal for batch data transformations, it can only approximate the transformation of streaming data. Dealing with real-time data can be pretty challenging, especially when handling vast amounts of it.

But streaming dbt will be possible, as Materialize, a state-of-the-art SQL platform for processing streaming data, has announced a new dbt adapter. Keep reading to find why this could be a game-changer in the streaming analytics world.

What Is Materialize's dbt Adapter?

Materialize dbt adapter is an integration that allows you to transform real-time streaming data using Materialize as your data warehouse.

Together, these tools can allow data analysts to be the creators and users of streaming pipelines instead of relying on data engineers. The result could be a more streamlined analytics workflow as streaming capabilities become accessible across several data ecosystems.

The Problem With Today's Real-Time Analytics

To begin, what exactly do we mean by batch and streaming data? As the name implies, batch data is any data that arrives in discrete batches, which can be once a minute, once an hour, or once a day. On the other hand, streaming data comes continuously and on no particular schedule.

Let's see what challenges this can cause.

Batch-based tooling for real-time data is complicated for higher data volumes.

Batch-based methods are adequate for the majority of production use cases. Nonetheless, while actual real-time needs are rare, the stakes are generally more significant. Unfortunately, we do not have many options for meeting these needs right now.

Particularly at bigger data volumes, there is a limit to how much we can optimize SQL before performance suffers. So, as data quantities increase, we require streaming-specific gear.

Current dbt methods do not truly transform streaming data in real-time.

Let's look at how dbt transforms data beneath the hood to see why it doesn't currently transform streaming data in real-time.

Using dbt "models," dbt users define the data transformations they want. dbt models provide the following information: A SELECT statement carrying out the desired transformation and a materialization parameter.

dbt supports four different materializations: table, view, incremental, and ephemeral. The results of these materializations are either the creation of a table, a view, or the results directly using a common table expression (CTE) without persisting anything.

These database objects are sufficient for batch data transformations. But they are not efficient at transforming streaming data.

Real-Time Streaming Models Using dbt + Materialize

dbt would need to persist a database object that updates as new data gets upstream to perform reliable, real-time manipulations of streaming data. Fortunately, we have a database object that can do this: materialized views.

Materialize's materialized views, unlike typical materialized views, are constantly updated as new data arrives–no refreshes are required. Even better, they deliver real-time results with millisecond latency.

So, what does this have to do with dbt and streaming data? It means that when you execute a dbt model on top of Materialize for the first time, dbt persists in a materialized view. You'll never have to restart your model again. Your model will remain up to date regardless of how much or how frequently your data arrives.

Use dbt to Create Materialized Views for Streaming

Unlike querying tables or logical views, querying materialized views can minimize query costs by storing results in memory and only updating them when necessary.

Creating and maintaining a materialized view might help you save money on expensive or frequently run queries. The potential cost decrease of a materialized view, on the other hand, is highly dependent on its underlying refresh mechanism.

Only incremental refreshes can lower the per-refresh cost of keeping a materialized view while simultaneously ensuring that views are up to date as needed.

A key difference of Materialize compared to traditional data warehouse materialized views is that they work as constantly updated queries instead of being cached.

So, if you're a dbt user who is familiar with batch procedures, you would be delighted to know that Materialize's dbt adapter needs you to execute "dbt run" once, and your data will be up to date. Later in this article, we shall look at a use case to explain how that works.

Optimize Your Workflow With Materialize + dbt

Materialize + dbt integration allows data engineers and analysts to cooperate across numerous data warehouses for a far more simplified data transformations approach.

Connecting your GitHub account to dbt Cloud, for example, offers innovative and appealing capabilities in dbt Cloud. Once your GitHub account is linked, you can do the following:

When Pull Requests are opened in GitHub, CI builds are triggered.
Log in to dbt Cloud using GitHub OAuth.
With a simple click, you may add additional repositories to the Cloud.

Use Macros to Parameterize and Deploy Views

A fantastic technique to scale Materialize pipelines is parameterizing and deploying views through macros. SQL may be combined with Jinja, a templating language, in dbt.

Using Jinja transforms your dbt project into a SQL programming environment, allowing you to perform things that aren't ordinarily feasible with SQL. For instance, with Jinja, you can:

Use control structures (such as if statements and for loops).
For production deployments, use environment variables.
Modify the way your project is built depending on the current target.
Use the output of one query to generate another.
Use macros.

Macros are chunks of code in Jinja that may be reused numerous times — like "functions" in other programming languages. They are convenient if you find yourself repeating code across multiple models.

Demo: Using dbt + Materialize to Stream Wikipedia Data

This demo project illustrates how to turn streaming Wikipedia data into materialized views in Materialize using dbt. Refer to this guide to get everything you need to run dbt with Materialize.

To start, let's set up a stream of Wikipedia's recent changes and write all the data we see to a file.

_Note: If you are running Materialize via a Docker container, run docker exec -it [materialized container id] /bin/sh before curl-ing to create this file directly within the Materialize container.
_

From a new shell, run:

while true; do

curl --max-time 9999999 -N https://stream.wikimedia.org/v2/stream/recentchange >> wikirecent

done

Note the absolute path of the location of wikirecent, which we'll need in the next step.

Connect to your Materialize instance from your shell:

psql -U materialize -h localhost -p 6875 materialize

Then, create a source using your wikirecent file:

CREATE SCHEMA wikimedia;

CREATE SOURCE wikimedia.wikirecent

FROM FILE '[path to wikirecent]' WITH (tail = true)

FORMAT REGEX '^data: (?P.*)';

This source takes the lines from the stream, finds those that begin with data:And captures the rest of the line in a column called data.

Now we can use dbt to create materialized views on top of wikirecent. In your shell, navigate to play/wikirecent-dbt within the clone of the repo on your local machine. Once there, run the following dbt command:

dbt run

Note: If the profiles.yml you're using for this project is not located at ~/.dbt/, you will have to provide additional information to dbt run.

This command generates executable SQL from our model files (found in the models directory of this project) and executes that SQL against the target database, creating our materialized views.

Note: If you installed dbt-materialize in a virtual environment, ensure it's activated.

That's it! You just used dbt to create materialized views in Materialize. You can verify the views were created from your psql shell connected to Materialize.

Enable Enterprise-Grade Collaboration Features With Real-Time SQL

upgrowcode — Tue, 01 Feb 2022 18:41:39 +0000

SQL is a language used primarily to query and manipulate data in relational databases. It is one of the most widely used programming languages in use today.

The SQL query language is a flexible and powerful tool used in any database. It provides the ability to create, query, update and delete data according to the needs of your enterprise.

Enterprise-grade collaboration features offer real-time communication and co-creation between team members to facilitate better decision-making for business processes like supply chain management, outsourced R&D services, or sales & marketing initiatives.

Using SQL, you can build applications with real-time collaboration features and more efficient content distribution.

What Is Real-Time SQL?

SQL is a standard language used by databases to define, maintain and run relational database management system data.

Real-time SQL gathers data from the users and then sends it back to the database server. It provides a single point of access to all data in the database. It also has the advantage of providing immediate feedback, which is beneficial for an application with many users. This makes it possible for users to work on tasks anytime without waiting for the request to be processed by the database server.

The most crucial benefit of real-time SQL is that it helps in reducing latency (or speed of response) in real-time.

You can also use real-time SQL for other non-relational databases, such as NoSQL or object-relational databases. It can also be used with web apps, mobile apps and web services.

Easily Search for Existing Questions and Answers

The real-time SQL function is a powerful tool that you can use to find existing questions and answers in a database. This function lets you search for questions and answers in your database across all databases on your server. Real-time SQL can be used for various use cases and applications.

For example, instead of waiting for the data to come back from MongoDB, you can perform a query on your database or any other database connected via ODBC or JDBC.

Create and Monitor Real-Time SQL Reports and Dashboards

SQL reports are an essential part of most organizations, although these reports can be quite challenging to generate using traditional tools. This process can be automated using real-time SQL dashboards, allowing for flexibility in creating and monitoring the reports.

Creating real-time SQL reports and dashboards is like building a house. The foundation needs to be strong for the house to stand firm. If there are too many problems with the foundation, the house will fall sooner or later.

The first step is selecting your tools and configuring them correctly according to your company's needs.

The process begins by creating a report template with all the necessary components to create real-time dashboards. The next step is to download the report template and create a live connection to a database that you want to monitor. After this, you can create new queries or modify existing ones based on your needs.

To create real-time SQL reports and dashboards, it is necessary to set up a database connection.

Build a Relational Database

Relational database management systems (RDBMS) are among the most common database management systems in use today. Relational databases are databases primarily used to store data in a logically structured way.

You can build your database in minutes rather than hours using real-time SQL. You can do this with the help of an API that allows you to connect to your database remotely via SQL commands.
When it comes to relational databases, there are two types of tables: relational and non-relational.

The following is how to build a relational database using real-time SQL.

How to build a relational database using real-time SQL:

Create an empty database
Create a schema
Create tables within it
Create relationships between existing and new tables
Perform queries or insert, read, update and delete data

Execute Complex Queries Across Enterprise Platforms To Fetch Live Data

Using real-time SQL queries, enterprises can focus on the data they need to analyze and act on. They can avoid having to rebuild the query to get live data.

There are various ways to execute complex queries across enterprise platforms. These methods include database triggers, stored procedures, or database views and functions.

The advantage of using these methods is that they are easy to set up and maintain and allow for the flexibility of database design (more than with tables). The disadvantage is that there may be performance issues in large databases. The alternative is executing queries in an ad hoc manner, with limited performance benefits but greater flexibility.

Visualize Your Metrics in Real-Time

Real-time SQL is a new feature that allows companies to view their metrics in real-time. It can be used in multiple ways, such as the following:

Scenario 1: You have a new project and you need to see how your team's performance is on the project. With this feature, you can view your team's performance as soon as they finish the work.

Scenario 2: You want to monitor the quality of your company's performance in terms of key metrics such as bounce rate, average order value, conversion rate, etc.

To visualize metrics in real-time with SQL software, you need to follow some simple steps:

The first step is to gather the information you want to visualize. This includes the time range, metric type and field names for each metric.
Secondly, set up some filters - for example, if you want to show all events that happened in a specific time range with a specific metric type and value.
Thirdly, create an SQL query that will generate the desired visualization based on these filters.
Finally, put all of this together in a graph with charting software like Tableau or Excel.

Have Your Results Securely Shared Across Databases

SQL is the most widely used language for manipulating relational databases. It can get data from one or more tables and then analyze it, filter it and organize it. Real-time SQL is a powerful and popular feature that can also be used to share data securely within a database. It can save your team hours of non-productive time by taking the tedious work out of data processing.

Real-time SQL software is not just limited to the use cases it was designed for. There are many ways to use it to make your life easier and more productive in other ways as well.

Real-time SQL will speed up your data processing if you need to process or summarize large amounts of text, data, or images. This will increase productivity and allow team members to work at quicker paces.
You can also have your results shared across databases securely using real-time SQL so that your team does not have to worry about losing valuable records by copying and pasting them across different databases.

Achieve Hybrid Cloud Flexibility at Scale

One of the biggest challenges in cloud computing is how to shift workloads to the cloud in a secure and manageable manner.

Hybrid cloud flexibility has been achieved by leveraging real-time SQL functions. This function allows you to move workloads from on-premises or public clouds to hybrid clouds and vice versa.

The idea of a hybrid cloud is to integrate two or more platforms into one infrastructure to increase agility, reduce costs and improve security.

Several variables can make up for the overhead of maintaining enterprise-class cloud infrastructure. You can shift workloads within public and private cloud environments to flexibly scale digital clouds by using real-time SQL and automation. This helps your business save on IT costs and scale and reduce the time it takes for operations.

Run Queries Simultaneously With Other Users

Real-time SQL is a database implementation that allows numerous users to simultaneously run queries without any influence from each other. It uses shared resources such as locks, semaphores and buffers, specially designed for concurrent access by multiple users to provide an extremely high degree of concurrency with low contention on shared resources.

Diagnose Root Causes Faster by Correlating Metrics Across Teams

For many companies, their IT infrastructure is their competitive edge. By having a robust IT infrastructure, they can be more competitive in the market. To ensure that their IT infrastructure is as effective as possible, they need to stay ahead of the curve and keep learning from each other to find root causes faster. To do this, IT teams are using real-time SQL to diagnose root causes more quickly by correlating metrics across teams and getting a better understanding of how changes in one area impact other areas.

Real-time SQL also allows you to collect metrics for every second instead of every five minutes or more. It does this by correlating metrics across teams so that you can get to root causes faster because of cross-team correlations.

There are two types of RDS queries real-time queries and stored queries.

A real-time query starts with a stored query and then updates it on the fly as new data becomes available.

A stored query is when you want to track the change over time but only retrieve the results at a previous point in time (e.g., one hour ago).

Ready To Start Monitoring Your Enterprise Data in Real-Time?

In the last decade, online collaboration has become an essential aspect of conducting business. Many new technologies have allowed collaboration to become more accessible and seamless, but there is still room for improvement. Real-Time SQL is a new technology that can provide real-time collaboration capabilities.

The process of creating and monitoring real-time SQL reports and dashboards is very complicated and time-consuming, but a streaming database, SQL reports can be generated with ease. As you can see, SQL has ways to improve collaboration and overall productivity. No wonder companies in hundreds of industries are adopting this technology in droves.

Real-Time Dashboard Using Kafka

upgrowcode — Tue, 01 Feb 2022 18:31:38 +0000

As data becomes increasingly complex and voluminous, it's more important than ever to have a fast and reliable way to process it. Apache Kafka is an ideal tool to help you build a real-time dashboard to keep track of your business operations. This article will discuss how to create a real-time dashboard using Kafka.

What is a Real-Time Dashboard?

Dashboards are an information management tool that allows executives to evaluate essential data in an easily digestible manner. Consequently, leaders are better equipped to recognize and reverse harmful patterns in company performance, understand which sections of the organization are functioning well, and discover the most significant prospects for progress.

A real-time dashboard is a visualization that is constantly updated with new data. The majority of these visualizations employ a combination of historical data and real-time information to discover emerging patterns or monitor efficiency.
The fact that real-time dashboards' information is time-sensitive is essential. IT organizations use specialized software tools to collect computer and user-generated data, aggregate that data into useful information, and present it in real-time dashboards to the appropriate person who can use it to gain insight and improve decision-making.

In today's digital environment, real-time SQL and obtaining business intelligence insights from processed data has become a popular trend. Real-time data helps IT companies and executives to respond to business, security, and operational concerns more swiftly.

Apache Kafka as an Event Streaming Platform for Large Data Volumes

Kafka is a free and open-source software framework for storing, reading, and analyzing streaming data. Kafka is meant to function in a "distributed" environment. Instead of running on a single user's computer, it runs across multiple servers, exploiting a greater processing power and storage capacity.

Kafka operates as a cluster, storing messages from one or more producers. The streaming data is divided into segments known as topics. The producer might be single or multiple web hosts or web servers that broadcast the data. The producer disseminates data on a particular topic, and the consumers "listen" to the topic and consistently consume the data.

Businesses frequently use Kafka to build real-time SQL data pipelines because it can extract high-velocity, high-volume data.

How Do I Make a Real-Time Dashboard Using Kafka?

There are several ways to build a real-time SQL dashboard using Kafka. In this case, we will use two technologies that can be used for a prevalent scenario. The technologies are Debezium and Materialize.

The scenario is the following: Data from a Kafka stream needs to be joined with a database table, and the results need to be presented in a real-time dashboard. The data in both sources - the stream and the table - is, of course, changing.

These are examples of situations that would require the above-described solution:

Sensor data analysis – A sensor configuration table needs to be joined with IoT sensor stream data in Kafka.
API usage analysis – Combining API logs in a Kafka stream with a user table.
Affiliate program analysis – Pageviews data in a Kafka stream combined with a user table.

The following sections describe the general approach, but you can consult the practical steps of joining Kafka with a database using Debezium and Materialize.

Stream the database into Kafka using Debezium

Debezium is a Change Data Capture (CDC) software that uses logs to detect database changes and propagates them to Kafka. Whenever you insert, edit, or delete a record in your database, an event containing information about the change is instantly emitted.

This procedure occurs automatically, with no need for you to write a single line of code, almost as if it were a feature of the database itself. Therefore, this process ensures that every change is recorded. In other words, the database and Kafka stream are assured to be consistent.

Connecting Kafka; Materialize a View

The next step is to materialize the Kafka stream and CDC data into a materialized view that holds the structure we need. Materialize, a state-of-the-art engine for materializing views on rapidly changing data streams is used for this.

Materialize is helpful for this challenge for several reasons:

It's capable of sophisticated JOINs – Materialize supports JOINs far more broadly than other streaming platforms.
It's strongly consistent – In a streaming solution, eventual consistency might lead to unexpected consequences.
Views are created in conventional SQL – making it simple to connect to and query the results using existing libraries.

Create a real-time dashboard

The final and most important aspect of analytics is viewing and engaging with data. Dashboards can be designed to continuously refresh and offer in-page filtering with a comprehensive collection of visualizations.

There are two primary ways to access the output of the Materialize view:

Poll

PostgreSQL query – Materialize does not compute the results after each query; the computing is only done when new data comes in Kafka. Therefore, utilizing polling queries every second is fine.

Push
Materialize streams output via:

TAIL – You can stream changes to views using the TAIL command.
Sink out to a new Kafka topic – You can use a sink to stream data out into another Kafka topic.

From here, you can interface with any dashboard engine. For example, a solution like Kibana may be utilized to interface with Kafka.

What Are The Advantages of Using Kafka?

A few of the advantages of using Kafka for real-time dashboards include the following:

High-throughput – Kafka is capable of handling high-velocity and high-volume data.
Low Latency – Kafka can handle messages with very low latency (in the range of milliseconds).
Fault-Tolerant – Kafka tends to resist node failure within a cluster because it's distributed.
Durability – Messages are persisted on disk. However, it should not be used as a database.
Scalability – Kafka can be scaled by adding nodes. Capabilities like replication and partitioning contribute to its scalability.
Real-Time – Because of a combination of the above features, Kafka can be ideal for handling real-time data pipelines.

Use Kafka And Materialize to Create Real-Time Dashboards

Making operational decisions based on the most recent data is a competitive advantage for every business. However, real-time SQL pipelines and dashboard engineering are complex tasks requiring the correct technologies to be used effectively.

Luckily, powerful tools like log-based CDC and Materialize can be used for combining, reducing, and aggregating high-volume streams of data from Kafka into any output format your dashboard requires.