DEV Community: Kovid Rathee

Optimize Read Performance in Supabase with Postgres Materialized Views

Kovid Rathee — Tue, 30 Sep 2025 13:51:06 +0000

A Postgres database typically consists of schemas that serve single-table CRUD (create, read, update, delete) operations. While CRUD operations handle basic transactional needs, applications also require more complex queries. These often involve multiple joins, nested CTEs, and intricate predicates for read-heavy workloads like embedded reporting and analytics. Such multitable queries can degrade overall database performance and block CRUD operations through locking. Postgres materialized views provide a solution by separating read-heavy workloads from transactional processing.

Materialized views are just like standard views but with one key difference—materialized views are precomputed and stored on disk. They save the query results in separate database objects, enabling you to run queries on top of a static snapshot instead of live underlying tables. The precomputation can help you so things like serving reports and enabling filtered data exports from multiple tables, where the queries can get quite complex.

This tutorial explains how I improved my application's read performance using Postgres materialized views in Supabase.

Understanding Materialized Views in Postgres

Unlike a standard view, a materialized view has an attached storage component; this means that materialized views can be precomputed and stored while a standard view cannot. Under the hood, both types of views follow Postgres's rule system. A table is designed to support all kinds of CRUD operations. In contrast, a materialized view is intended to be refreshed by a prewritten SQL query attached to its definition.

A materialized view is a point-in-time snapshot of a query result. This snapshot unlocks a wide range of benefits around query performance, with instant query execution as the results are precomputed. It also decouples the typical CRUD-type operations from the read-heavy operations using multiple tables and complex queries. It's also highly beneficial as it minimizes long-running lock contentions that can occur due to read-heavy operations on live tables.

Before using materialized views, you should consider that their data storage component will impact the database storage cost. Materialized views also don't provide the most up-to-date data. They're most suitable for use cases that can function with precomputed data. The article covers some other overheads associated with materialized views later, but first, let's look at why I chose to use materialized views to improve my query performance in Supabase.

Why Use Materialized Views

Supabase uses PostgREST instead of manual SQL-based CRUD programming and traditional ORMs (object-relational mappers). PostgREST converts your entire database into a RESTful API—the Supabase API—that allows you to interact with your database through standard API endpoints.
Read-heavy queries slow API response times because they compute results on the fly, requiring the API to wait for database processing. Materialized views address this issue by pre-computing expensive joins and aggregations, eliminating the need for real-time calculations in performance-critical applications.
Materialized views also have some security implications. You can create RLS (Row Level Security) policies on materialized views just as you would on tables. These policies translate directly into API endpoint security policies, providing granular control over your data at both the table and materialized view levels.

Creating and Using Materialized Views in Supabase

Let's look at how to create and use materialized views in Supabase using an example online storefront for an e-commerce business with the following database tables: users, products, orders, and order_payments.

Consider this scenario: The storefront owner logs into a portal that displays a month-on-month performance report. The summary should only include successful orders (orders that have been paid for) and exclude orders where a refund has been processed. Running this query every time the storefront owner logs in would be slow as it would involve filtering, aggregating, and joining data from at least three out of the five tables mentioned above. The query would look something like the following:

WITH successful_orders AS (
  SELECT o.user_id, 
         o.order_id,
         o.original_amount_usd, 
         o.discount,
         o.final_amount_usd, 
         op.payment_amount_usd,
         o.order_date,
         TO_CHAR(order_date, 'YYYY-MM') order_month
    FROM orders o
    INNER JOIN order_payments op ON o.order_id = op.order_id
   WHERE o.order_status = 'SUCCESSFUL'
     AND op.order_payment_status = 'SUCCESSFUL'
     AND ROUND(o.final_amount_usd,2) = ROUND(op.payment_amount_usd,2)
     AND op.refund_processed = FALSE
)

SELECT 
    order_month,
    COUNT(DISTINCT user_id) unq_customers,
    COUNT(order_id) unq_orders,
    SUM(original_amount_usd) total_original_amount,
    SUM(payment_amount_usd) total_payment_amount
FROM successful_orders
GROUP BY order_month
ORDER BY order_month DESC

Running this query on demand could be quite expensive and frustrating for the storefront owner if they log in to check the status several times daily. In this situation, having a materialized view based on this query that materializes every few hours will be quicker and cheaper as the materialized view's performance will be similar to that of a regular table. Here's how you can create a materialized view:

CREATE MATERIALIZED VIEW mv_successful_orders AS ... ;

Once this materialized view is created, it will hold the results of the underlying query until the next time it's refreshed. You can access a materialized view using the same SELECT statement as you do for a standard view or a table. Deleting a materialized view is similar to deleting any other database object, like a table or a standard view, for which you can use the DROP statement:

DROP MATERIALIZED VIEW mv_successful_orders;

Creating and dropping a materialized view works in a similar way to creating and dropping tables and regular views. If you want to reload data into a static table, you would need to TRUNCATE and INSERT the new records, or you can DROP and CREATE the table again. Regular views don't have this requirement as they don't store data on disk. Materialized views, on the other hand, need to be refreshed to provide a fresher copy of the data from the underlying query.

Refreshing Materialized Views in Supabase

I'll now explain how to keep the data fresh and up-to-date based on your application's needs in the materialized view.

Running the underlying query (for example, the one defined using a CTE (common table expression) in the previous section) brings the up-to-date data into the materialized view. This process is called a refresh. Refreshing a materialized view is simple:

REFRESH MATERIALIZED VIEW mv_successful_orders;

There are many ways to refresh materialized views. You can run an on-demand refresh from the command line, refresh based on a database trigger, or refresh based on logic built into the application code. Each of these methods might be suitable for separate roles, such as data engineers, frontend engineers, and so on.

Supabase Edge Functions

You can call Supabase Edge Functions directly from your application, and you can conditionally refresh the materialized view based on the conditions specified in the function code, preventing unnecessary refreshes. Here's a code snippet showing how to call the previously defined mv_successful_orders using a Supabase Edge Function:

  const response = await fetch(`${supabaseUrl}/functions/v1/refresh-materialized-view`, {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${supabaseAnonKey}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      mv_name: 'mv_successful_orders',
      use_concurrent: true
    })
  })

  const result = await response.json()

Manual Refreshes Using SQL Statements

You can run SQL statements directly or wrapped in Postgres functions using the Supabase dashboard or your client application using the Supabase SDK via RPCs (remote procedure calls). An RPC call would look something like the following:

  const result = await supabase.rpc('refresh_mv_successful_orders')

Automatic Refreshes Using Postgres Triggers

You can also trigger materialized view refreshes using database changes using Supabase Realtime Broadcast or Supabase Realtime Postgres changes. This method is more configured on the backend rather than the frontend as it is based on the database change log and database triggers.

Defining Your Refresh Strategy

No matter what method you use, you'll need a smart refresh strategy to benefit from the performance and cost gains of using materialized views. Running refreshes frequently can cause the same performance issues you might encounter with regular views or on-the-fly SQL queries. That's why materialized views are most effective when you strike the right balance between data freshness and query performance. Using materialized views can worsen the performance if you don't have the right refresh strategy. To decide on the correct strategy, you need to answer the following:

Data freshness: How frequently the application needs the data to be refreshed is a key factor that helps you decide the strategy for the data freshness requirement. For example, if a report in the application is seen by application users typically every hour, it would make sense to refresh the materialized view once an hour.
Cost and maintenance: You must understand the cost of running frequent refreshes, including query execution, index recreation, and potential blocking of CRUD operations during refreshes. This impact grows with large data sets. To reduce cost and contention, limit the data in the materialized view, such as by filtering by date ranges.
Time of refresh: It helps to refresh materialized views when the database usage is low in terms of compute and memory or when there are less critical workloads running on the database. This way, even if the refresh takes up a lot of resources, it doesn't impact other important database operations.

As mentioned earlier, the best way to determine the right strategy is to strike a balance among the three factors mentioned above. All in all, you want to come up with a strategy that fulfills the data freshness needs, doesn't refresh more frequently than required, and times the refreshes intelligently based on existing database workloads.

Best Practices for Materialized Views: When and How to Use Them

Materialized views can be a powerful tool for optimizing database performance, but they're not suitable for every scenario. Understanding when and how to use them effectively is crucial for maintaining both performance and system efficiency.

For instance, if your application needs fresh data all the time or if the resulting data sets are huge, materialized views aren't typically a suitable option. However, materialized views would be a good option for an application that requires a lightweight reporting and analytics layer, but where creating a dedicated data warehouse or data lake would involve excessive cost and maintenance overhead. They can serve predefined reports, dashboards, and data extracts without overburdening the database and the application with repetitive and unnecessary queries.

Use materialized views for the following cases:

Use materialized views for complex queries that deal with multiple tables, especially with large volumes of data.
Use materialized views for underlying tables that have a high read-to-write ratio. It means that the data doesn't change often; hence, the materialized view doesn't need to be refreshed frequently.
Use materialized views when your queries have complex logic with filters, aggregates, and window functions as these are operations that can slow down and potentially block crucial CRUD operations of your application.
Use materialized views for queries that are run frequently. The maximum positive impact of using materialized views occurs when your application uses the precomputed data in the materialized view frequently as it saves computation cost every time the query is executed.

Materialized views shouldn't be used for the following:

Do not use materialized views if you need the data to be refreshed frequently as, performance-wise, the experience will somewhat be equivalent to running live complex queries on tables, resulting in little or no benefits regarding cost or performance.
Do not use materialized views when the underlying data is too small and the queries on the tables or standard views are fast enough for your application's performance needs. Precomputation only helps if the queries are slowing down your application's or the database's performance.
Do not overuse materialized views. Materialized views come with their own complexities around index maintenance and vacuuming, among other things. Overusing materialized views might end up increasing the maintenance overhead while giving you some performance benefits.

If you're using materialized views, you should also follow some best practices to optimize them:

Use indexes to enable faster point lookups and aggregates. Just like Postgres tables, materialized views also support various types of indexes, which can significantly speed up point lookups and aggregate queries.
Implement a refresh strategy that doesn't burden the database with lock contentions—that is, don't refresh multiple heavy materialized views that use the same underlying tables or standard views underneath.

To reiterate, the decision to use materialized views, like many other database features such as triggers, user-defined functions, and stored procedures, depends on your specific business use case, cost considerations, and maintenance overhead tolerance. You need to find the right balance to determine if materialized views make sense for your particular scenarios.

Conclusion

This article covered some of the key benefits of materialized views, how to create, delete, and refresh materialized views in Supabase, and some of the best practices and recommendations for using them.

With materialized views, you can offload some of the read-heavy and compute-heavy queries to precompute the results that your application needs. You can schedule or trigger the materialized view refreshes at a time when the database utilization isn't high, essentially freeing up resources for the application to use and also turbocharging the performance with faster, table-like reads, which is especially beneficial for reporting, analytics, and other OLAP (online analytical processing) type of use cases.

Finding missing data in your database with QuestDB

Kovid Rathee — Sat, 28 Jan 2023 04:26:34 +0000

Introduction

Whether you are just starting to work with a specific data set or monitoring activities and reports based on existing data sets, one of the first things you need to consider is the quality of the data you’re dealing with. Continuity is one of the most critical factors in gauging the quality of time-series data. Time-series systems usually serve use cases where data needs to be consumed, processed, and acted upon with urgency.

Take the example of a public transport vehicle. For reasons pertaining to the safety of passengers and the timeliness of the service, vehicles need their various sensors - GPS, proximity sensors, pressure sensors, engine diagnostics sensors, and so on. Continuously using the data from these sensors helps the public transport service guarantee timeliness, safety, and reliability. However, a break in the data coming from these sensors would mean that there’s a problem.

Most data access frameworks, including query languages and importable libraries, allow you to filter and see columns or rows where data is missing. The concept of data continuity and completeness isn't more relevant anywhere than when you're talking about time-series data. By definition, time-series data needs to be continuous. However, the granularity of the continuum might differ for different requirements.

When you have to test your data for completeness in a relational database, you often have to write complex SQL queries paired with intermediate or temporary tables to find missing data. In some cases, these queries can be tedious and non-performant. QuestDB is a time-series database that lets you store and consume your data in tabular form, but it’s not what you would call a traditional relational database. To cater to the time-series workloads, QuestDB extends the standard SQL functionalities using SQL extensions. One of these extensions is the SAMPLE BY extension, which allows you to find and deal with missing data with ease.

This tutorial will take you through how to use QuestDB's SQL extensions to find gaps in your data without any complex queries or overhead.

Dataset

To demonstrate finding gaps in time-series data, we'll be using the trades dataset, which is readily available on the QuestDB demo website. The trades dataset contains real-time anonymized trades data for Bitcoin and Ethereum in US Dollars from 8th March 2022 till date. Here's the table structure of the trades dataset:

CREATE TABLE 'trades' (
 symbol SYMBOL capacity 256 CACHE,
 side SYMBOL capacity 256 CACHE,
 price DOUBLE,
 amount DOUBLE,
 timestamp TIMESTAMP
) timestamp (timestamp) PARTITION BY DAY;

For more details about the dataset and the process used to ingest data into QuestDB, you can go through this article. Now that you understand the structure and contents of the trades dataset, let's try to figure out if anything is missing.

Finding missing data

Using QuestDB

As mentioned earlier in the article, you can use SQL extensions to find missing data in QuestDB. There are three keywords (or SQL keyphrases) you need to know that are unique to QuestDB:

SAMPLE BY allows you to create groups and buckets of data based on time ranges.
FILL allows you to specify a fill behaviour when using SAMPLE BY, which, in turn, allows you to perform time-series interpolation on the data.
ALIGN TO CALENDAR allows you to align your time buckets to a calendar date based on a timezone or an offset.

You can find missing data using a combination of the aforementioned SQL extensions. First, let's look at a basic query using these extensions to get a day-on-day count of trades in December till date this year, using the following query:

SELECT timestamp,
       COUNT(*) trades_in_december_2022
  FROM 'trades'
 WHERE timestamp IN '2022-12'
SAMPLE BY 1d
 ALIGN TO CALENDAR;

Running this query gives us the following output when selecting the “Draw” option in the Chart view:

Note that running such a simple aggregate query on low granularity is possible on any database. It only becomes a problem with other databases when the data is too granular, especially in real-time. Now that it's clear how to use the SQL extensions you need, let's move on to our query that finds missing data.

In the query, we'll find the volume-weighted average price (VWAP) indicator for the trades dataset. The key idea is to get all the timestamps where we don't have data to calculate VWAP for all the Bitcoin trades from the starting date of the dataset till now. In the following query, you can see that the trades are sampled by 1 second using the SAMPLE BY 1s statement:

WITH extract AS (
SELECT timestamp,
       ROUND(SUM(price * amount) / SUM(amount),2) AS vwap_price,
       ROUND(SUM(amount),2) AS volume
  FROM trades
 WHERE symbol = 'BTC-USD'
   AND timestamp > DATEADD('d', -1, NOW())
SAMPLE BY 1s
 ALIGN TO CALENDAR)
SELECT timestamp,
       vwap_price,
       volume
  FROM extract
 WHERE vwap_price IS NULL
    OR vwap_price = 0;

However, the query doesn't result in anything, as shown in the image below:

Why? Because QuestDB won't return anything if there's no data for a timestamp or timestamp range based on the SAMPLE BY aggregator. To get the missing data missing from the results, you will need to use the FILL keyword like this:

SAMPLE BY 1s FILL(NULL)

Using this, you can force-fill all the zero-value records for vwap_price with NULL. The complete query for finding missing data will look something like the following:

WITH extract AS (
SELECT timestamp,
       ROUND(SUM(price * amount) / SUM(amount),2) AS vwap_price,
       ROUND(SUM(amount),2) AS volume
  FROM trades
 WHERE symbol = 'BTC-USD'
   AND timestamp > DATEADD('d', -1, NOW())
SAMPLE BY 1s FILL(NULL)
 ALIGN TO CALENDAR)

SELECT timestamp,
       vwap_price,
       volume
  FROM extract
 WHERE vwap_price IS NULL;

When you run the query, you will get all the 1s windows where the data was missing, as shown in the image below:

Again, there is no denying that running one-off, ad-hoc queries that aggregate on lower granularity dimensions, such as 1d or 1m, might not be that hard to do in other databases. However, if you want to keep running these queries at scale, they can create performance issues in your traditional relational database. You can get similar results in other databases if you want to find missing data daily, as shown in the image below:

If you had to perform the same operation in a PostgreSQL database, you'd need to run a generate_series() function to generate a bunch of data and then join it with the trades dataset. For the sake of simplicity, let's assume that the timestamp format generated by both systems will be the same. To identify gaps in PostgreSQL, you'll need to write something like this:

WITH all_seconds AS (
 SELECT *
   FROM generate_series('2022-12-17 00:00:00', '2022-12-17 23:59:59',
                        INTERVAL '1 second') dummy_timestamp)

SELECT *
  FROM (SELECT s.dummy_timestamp,
               ROUND(SUM(t.price * t.amount) / SUM(t.amount),2) AS vwap_price,
               ROUND(SUM(t.amount),2) AS volume
          FROM all_seconds AS s
          LEFT JOIN trades AS t
            ON s.dummy_timestamp = t.timestamp
         GROUP BY s.dummy_timestamp)
 WHERE t.vwap_price IS NULL

PostgreSQL has the advantage of having a generator function that supports all kinds of dummy data generation use cases, as you witnessed above. Not all databases have this function. In MySQL, for instance, you'd have to use recursive common table expressions (CTEs) to get the job done. In some other databases, it might be even more troublesome.

How does finding missing data help?

Identifying missing data is of utmost importance because it can tremendously impact the accuracy and reliability of every system or person that consumes it. When it comes to time-series databases, many use cases come to mind, especially those that involve edge computing devices and IoT devices, such as sensors and detectors.

Take the example of sensors that send data about critical systems in industrial machineries, such as vibration, vibration, torque, pressure, and so on. Data coming from many of these sensors not only help improve machine efficiency but also helps detect early signs of possible machine failures.

In many cases, this data might help improve safety and reliability too. If the continuous stream of time-series data is broken, i.e., the data is missing, the aforementioned benefits of having real-time data go down the drain - and can cause more damage than intended as a lot rides on the reliability of these systems. This is why there’s real value in identifying missing data, and QuestDB makes it super easy.

Conclusion

Continuing our SQL extensions theme, this tutorial walked you through finding missing data using SAMPLE BY, FILL, and ALIGN BY CALENDAR keywords with simple and highly performant queries. This article also explored some benefits of identifying missing data, especially in time-series datasets. Now, it's time for you to give this a shot. There's a system ready for you on the demo website. Take it for a ride!

SQL extensions for time series data in QuestDB part II

Kovid Rathee — Mon, 12 Dec 2022 02:35:06 +0000

This tutorial follows up on the one where we introduced SQL extensions in QuestDB that make time-series analysis easier. In this tutorial, you will learn in detail about the SAMPLE BY extension in QuestDB, which will enable you to work with time-series data efficiently because of its simplicity and flexibility.

To get started with this tutorial, you should know that SAMPLE BY is a SQL extension in QuestDB that helps you group or bucket your time-series data based on the designated timestamp. It removes the need for lengthy CASE WHEN statements and GROUP BY clauses. Not only that, the SAMPLE BY extension helps you quickly deal with many other data-related issues, such as missing data, incorrect timezones, and offsets.

This tutorial assumes you have an up-and-running QuestDB instance ready for use. Let's dive straight into it.

Setup

Import sample data

Similar to the previous tutorial, we'll use the NYC taxi riders data for February 2018. You can use the following script utilizing the HTTP REST API to upload data into QuestDB:

curl https://s3-eu-west-1.amazonaws.com/questdb.io/datasets/grafana_tutorial_dataset.tar.gz > grafana_data.tar.gz
tar -xvf grafana_data.tar.gz

curl -F data=@taxi_trips_feb_2018.csv http://localhost:9000/imp
curl -F data=@weather.csv http://localhost:9000/imp

Alternatively, you can utilize the import functionality in the QuestDB console, as shown in the image below:

For importing large CSV files into partitioned tables, QuestDB recommends using the COPY command. Thie method is especially useful when you are trying to migrate data from another database into QuestDB.

Create an ordered timestamp column

QuestDB mandates the use of an ordered timestamp column, so you'll have to cast the pickup_datetime column to TIMESTAMP in a new table called taxi_trips with the script below:

CREATE TABLE taxi_trips AS (
  SELECT *
    FROM 'taxi_trips_feb_2018.csv'
   ORDER BY pickup_datetime
) TIMESTAMP(pickup_datetime)
PARTITION BY MONTH;

By converting the pickup_datetime column to timestamp, you'll allow QuestDB to use it as the designated timestamp. Using the designated timestamp column, QuestDB is able to index the table to run time-based queries more efficiently.

If it all goes well, you should see the following data after running a SELECT * query on the taxi_trips table:

Understanding the basics of `SAMPLE BY`

The SAMPLE BY extension allows you to create groups and buckets of data based on time ranges. This is especially valuable for time-series data as you can calculate frequently used aggregates with extreme simplicity. SAMPLE BY offers you the ability to summarize or aggregate data from very fine to very coarse units of time, i.e., from microseconds to months and everything in between, i.e., millisecond, second, minute, hour, and day. You can derive other units of time, such as a week, fortnight, and year from the ones provided out of the box.

Let's look at some examples to understand how to use SAMPLE BY in different scenarios.

Hourly count of trips

You can use the SAMPLE BY keyword with the sample unit of h to get an hour-by-hour count of trips for the whole duration of the data set. Running the following query, you'll get results in the console:

SELECT pickup_datetime,
       COUNT() total_trips
  FROM 'taxi_trips'
 SAMPLE BY 1h;

There are two ways you can read your data in the QuestDB console: using the grid, which has a tabular form factor or using a chart, where you can draw up a line chart, a bar graph, or an area chart to visualize your data. Here's an example of a bar chart drawn from the query mentioned above:

Three-hourly holistic summary of trips

The SAMPLE BY extension allows you to group data by any arbitrary number of sample units. In the following example, you'll see that the query is calculating a three-hourly summary of trips with multiple aggregate functions:

SELECT pickup_datetime,
       COUNT() total_trips,
       SUM(passenger_count) total_passengers,
       ROUND(AVG(trip_distance), 2) avg_trip_distance,
       ROUND(SUM(fare_amount)) total_fare_amount,
       ROUND(SUM(tip_amount)) total_tip_amount,
       ROUND(SUM(fare_amount + tip_amount)) total_earnings
  FROM 'taxi_trips'
 SAMPLE BY 3h;

You can view the output of the query in the following grid on the QuestDB console:

Weekly summary of trips

As mentioned earlier in the tutorial, although there's no sample unit for a week, a fortnight, or a year, you can derive them simply by utilizing the built-in sample units. If you want to sample the data by a week, use 7d as the sampling time, as shown in the query below:

SELECT pickup_datetime,
       COUNT() total_trips,
       SUM(passenger_count) total_passengers,
       ROUND(AVG(trip_distance), 2) avg_trip_distance,
       ROUND(SUM(fare_amount)) total_fare_amount,
       ROUND(SUM(tip_amount)) total_tip_amount,
       ROUND(SUM(fare_amount + tip_amount)) total_earnings
  FROM 'taxi_trips'
 WHERE pickup_datetime BETWEEN '2018-02-01' AND '2018-02-28'
 SAMPLE BY 7d;

Dealing with missing data

If you've worked a fair bit with data, you already know that data isn't always in a pristine state. One of the most common issues, especially with time-series data, is discontinuity, i.e., scenarios where data is missing for specific time periods. You can quickly identify and deal with missing data using the advanced functionality of the SAMPLE BY extension.

QuestDB offers an easy way to generate and fill missing data with the SAMPLE BY clause. Take the following example: I've deliberately removed data from 4 am to 5 am for the 1st of February 2018. Notice how the FILL keyword, when used in conjunction with the SAMPLE BY extension, can generate a row for the hour starting at 4 am and fill it with some data:

SELECT pickup_datetime,
       COUNT() total_trips,
       SUM(passenger_count) total_passengers,
       ROUND(AVG(trip_distance), 2) avg_trip_distance,
       ROUND(SUM(fare_amount)) total_fare_amount,
       ROUND(SUM(tip_amount)) total_tip_amount,
       ROUND(SUM(fare_amount + tip_amount)) total_earnings
  FROM 'taxi_trips'
 WHERE pickup_datetime NOT BETWEEN '2018-02-01T04:00:00' AND '2018-02-01T04:59:59'
 SAMPLE BY 1h FILL(LINEAR);

In the example above, we've used an inline WHERE clause to emulate missing clause with the help of the NOT BETWEEN keyword. Alternatively, you can create a separate table with missing trips using the same idea, as shown below:

CREATE TABLE 'taxi_trips_missing' AS (
SELECT * FROM 'taxi_trips'
WHERE pickup_datetime 
  NOT BETWEEN '2018-02-01T04:00:00'
  AND '2018-02-01T04:59:59');

Ideally, you should use DROP PARTITION to emulate missing data, but because the data is partitioned by MONTH, you cannot run the following query:

ALTER TABLE 'taxi_trips'
 DROP PARTITION
WHERE pickup_datetime < ('2018-02-01T04:59:59') 
  AND pickup_datetime > ('2018-02-01T04:00:00');

The FILL keyword demands a fillOption from the following:

`fillOption`	Usage scenario	Notes
NONE	When you don't want to populate missing data, and leave it as is	This is the default `fillOption`
NULL	When you want to generate rows for missing time periods, but leave all the values as NULLs
PREV	When you want to copy the values of the previous row from the summarized data	This is useful when you expect the numbers to be similar to the preceding time period
LINEAR	When you want to normalize the missing values, you can take the average of the immediately preceding and following row
CONST or x	When you want to hardcode values where data is missing	FILL (column_1, column_2, column_3, ...)

Here's another example of hardcoding values using the FILL(x) fillOption:

Working with timezones and offsets

The SAMPLE BY extension also enables you to change timezones and add or subtract offsets from your timestamp columns to adjust for any issues you might encounter when dealing with different source systems, especially in other geographic areas. It is important to note that, by default, QuestDB aligns its sample calculation based on the FIRST OBSERVATION, as shown in the example below:

SELECT pickup_datetime,
       COUNT() total_trips,
       SUM(passenger_count) total_passengers,
       ROUND(AVG(trip_distance), 2) avg_trip_distance,
       ROUND(SUM(fare_amount)) total_fare_amount,
       ROUND(SUM(tip_amount)) total_tip_amount,
       ROUND(SUM(fare_amount + tip_amount)) total_earnings
  FROM 'taxi_trips'
 WHERE pickup_datetime BETWEEN '2018-02-01T13:35:52' AND '2018-02-28'
 SAMPLE BY 1d;

Note now the 1d sample calculation starts at 13:35:52 and ends at 13:35:51 the next day. Apart from the one demonstrated above, there are two other ways to align your sample calculations -- to the calendar time zone, and to calendar with offset.

Let's look at the other two alignment methods now.

Aligning sample calculation to another timezone

When moving data from one system to another or via a complex pipeline, you can encounter issues with time zones. For the sake of demonstration, let's assume that you've identified that the data set you've loaded into the database is not for New York City but for Melbourne, Australia. These two cities are far apart and are in very different time zones.

QuestDB allows you to fix this issue by aligning your data to another timezone using the ALIGN TO CALENDAR TIME ZONE option with the SAMPLE BY extension. In the example shown below, you can see how an ALIGN TO CALENDAR TIME ZONE ('AEST') has helped align the pickup_datetime, i.e., the designated timestamp column to the AEST timezone for Melbourne.

SELECT pickup_datetime,
       COUNT() total_trips,
       SUM(passenger_count) total_passengers,
       ROUND(AVG(trip_distance), 2) avg_trip_distance,
       ROUND(SUM(fare_amount)) total_fare_amount,
       ROUND(SUM(tip_amount)) total_tip_amount,
       ROUND(SUM(fare_amount + tip_amount)) total_earnings
  FROM 'taxi_trips'
 SAMPLE BY 3h
 ALIGN TO CALENDAR TIME ZONE ('AEST');

Aligning sample calculation with offsets

Similar to the previous example, you can also align your sample calculation by offsetting the designated timestamp column manually by any hh:mm value between -23:59 to 23:59. In the following example, we're offsetting the sample calculation by -5:30, i.e., negative five hours and thirty minutes:

SELECT pickup_datetime,
       COUNT() total_trips,
       SUM(passenger_count) total_passengers,
       ROUND(AVG(trip_distance), 2) avg_trip_distance,
       ROUND(SUM(fare_amount)) total_fare_amount,
       ROUND(SUM(tip_amount)) total_tip_amount,
       ROUND(SUM(fare_amount + tip_amount)) total_earnings
  FROM 'taxi_trips'
 SAMPLE BY 3h
 ALIGN TO CALENDAR WITH OFFSET '-05:30';

Conclusion

In this tutorial, you learned how to exploit the SAMPLE BY extension in QuestDB to work efficiently with time-series data, especially in aggregated form. In addition, the SAMPLE BY extension also allows you to fix specific common problems with time-series data attributable to complex data pipelines, disparate source systems in different geographical areas, software bugs, etc. All in all, SQL extensions like SAMPLE BY provide a significant advantage when working with time-series data by enabling you to achieve more in fewer lines of SQL.

Alerting Dashboard for Tesla's Stock Price with QuestDB and Grafana

Kovid Rathee — Thu, 11 Mar 2021 14:12:26 +0000

Introduction

There are many reasons why reacting to time series data is useful, and usually, the quicker you can respond to changes in this data, the better. The best tool for this job is easily a time series database, a type of database designed to write and read large amounts of measurements that change over time.

In this tutorial, you will learn how to read data from a REST API and stream it to QuestDB, an open-source time-series database. We will use Grafana to visualize the data and alerting to notify Slack on changes that interest us. We use Python to fetch data from the API and stream it to QuestDB and you can easily customize the scripts to check different stocks or even APIs.

Configuration

Prerequisites

Before getting started with the tutorial, you will need the following things:

Docker desktop - we have created a GitHub repository that will enable you to run Grafana and QuestDB in a Docker container. The project README also documents setup steps for Grafana, QuestDB, and Python.
IexFinance Account - we will use the IexFinance API for polling stock prices, note that a free account on IexFinance has a limit of 50,000 API calls per month.
Slack workspace - to deliver alerts about Stock prices from Grafana, you'd need a Slack workspace with the ability to create incoming webhooks.

Deploy QuestDB & Grafana containers using Docker

Firstly clone the repository from GitHub :

git clone git@github.com:questdb/questdb-slack-grafana-alerts.git

Running docker-compose up will bring up two containers that are networked together; Grafana is running on localhost:3000 and QuestDB has a web console available on localhost:9000 as well as a port open on 8812, which can accept Postgres protocol.

To check if your QuestDB and Grafana containers are working, please visit the aforementioned URLs. Alternatively, you can check the status using docker-compose ps on the command line, which should show you the following output:

Running docker-compose will also provide Grafana with the default connection credentials to use the Postgres authentication. This means you can use QuestDB as a default data source in Grafana right away without manual configuration steps.

Install Python Libraries

All the Python libraries required for this tutorial are listed in the requirements.txt file. Install the requirements using pip:

pip install -r requirements.txt

Ingest mock data into QuestDB

We need some data in QuestDB to create some visualizations and alerts. We can use the IexFinance API to fetch stock prices and an additional script to generate dummy data. The IexFinance API has a cap of 50,000 requests per month in the free account, so our mock script can generate random prices so that we don't max out our trial during testing. To start ingesting mock data into QuestDB, run the script:

cd python
python mock_stock_data_example.py

The script will automatically create a table stock_prices, and it will start ingesting mock data into this table that contains three columns:

stock-listed name of the stock, e.g., TSLA for Tesla. QuestDB has an optimized data type, symbol , for text columns that have repetitive values. Read more about that in QuestDB's official documentation.
stockPrice - price of the stock in USD in double.
createdDateTime- timestamp at which stockPrice was ingested in QuestDB.

In the following screenshot, you can see that the data ingested in QuestDB:

Configure the IexFinance API

Once you have tested the ingestion, you can start using the API with real data. Using this API, you can query stock prices in real-time. As mentioned earlier, there is a cap of 50000 free API calls per month, so make sure you don't cross that limit while on the free plan. To configure IexFinance API, follow these steps:

Create a free account on IexFinance.
Create an API token.
Press Reveal Secret Token and copy the SECRET token.
Create a new file .env in the ./python folder.
Paste the token in the .env file in the format →IEX_TOKEN=Skwf93hD.

Configure a Slack Incoming Webhook

Next, we need to create a Slack webhook for sending alert messages from Grafana:

Go to https://api.slack.com/apps?new_app=1
Name your Slack app QuestDB Stock Price Alerts
In Features and functionality choose incoming webhooks
Activate incoming webhooks and click Add New Webhook to Workspace
Select the channel to allow the app to post to and click Allow
Copy the Webhook URL which is in the following format → https://hooks.slack.com/services/T123/B0123/2Fb...

Create a Notification channel in Grafana

Go to localhost:3000 in your browser. To enable connectivity between Grafana and Slack for alerting, click Add Channel in the Alerting > Notification channels section as shown below:

Paste the Slack Incoming Webhook URL in the Url field while creating a new notification channel as shown below:

You can quickly test if the webhook is working fine by pressing the Test button on the screen above. This will trigger a notification from Grafana to be published on Slack. You can see an example notification below:

Create a Grafana Panel & Setup the Alert

Set up a Grafana panel that hosts the real-time graph of TSLA stock price using the following steps:

Navigate to + Create and select Dashboard
Click + Add new panel
In the panel, click the pencil icon or click Edit SQL and paste the following example query:

SELECT createdDatetime time, 
       round(avg(stockPrice),2) avgPrice
FROM (stock_prices timestamp(createdDatetime))
WHERE stock = 'TSLA'
SAMPLE BY 5s;

After creating the Grafana panel with the query shown in the image above. Save the dashboard. To create an alert on TSLA stock price, perform the following steps:

Edit the panel in the dashboard.
Go to the Alert tab and name the alert Tesla Stock Price alert.
Set Evaluate every 10 seconds for 30 seconds (Evaluate every signifies how often the scheduler will evaluate the alert rule and for specifies how long the query needs to violate the thresholds before triggering alert notifications).
Set the conditions to WHEN min() OF query(5-second Avg. of TSLA, 30s, now() IS BELOW 762. In other words, the conditions for alerting are met if the minimum value of the query named 5-second Avg. of TSLA is below 762 in the last 30 seconds.
In the No Data & Error Handling section, use the defaults.
In Notifications → Send to, select the notification channel that we set up earlier named Stock Price Alerts.
Add the message The 5-second bucketed average of TSLA stock price has gone below 762 in the last 30 seconds.
Save the Panel.

You can see the steps in action below:

Once the condition is met, Grafana will trigger an alert and send a notification to Slack. The notification will be something like the screenshot below:

To understand the alert status changes more deeply, you can visit the State history. It will show you the timeline of transition from one status to another. You can see an example of state history below:

To learn more about building dashboards for time series data with Grafana, there's another tutorial on QuestDB's website with a link to example data to try out more features in detail.

Conclusion

In this tutorial, you learned how Grafana could confluence with QuestDB using the PostgreSQL endpoint. Using the data ingested from an API into QuestDB, you learned how to visualize that data in a Grafana dashboard and set up alerts based on some predefined conditions. You also learned how to publish alert messages to external tools like Slack. For more information on any of the topics covered in this tutorial, please visit QuestDB's official documentation.

SQL Extensions for Time-Series Data in QuestDB

Kovid Rathee — Mon, 18 Jan 2021 13:24:36 +0000

In this tutorial, you are going to learn about QuestDB SQL extensions which prove to be very useful with time-series data. Using some sample data sets, you will learn how designated timestamps work, and how to use extended SQL syntax to write queries on time-series data.

Introduction

Traditionally, SQL has been used for relational databases and data warehouses. In recent years there has been an exponential increase in the amount of data that connected systems produce, which has brought about a need for new ways to store and analyze such information. For this reason, time-series analytics have proved critical for making sense of real-time market data in financial services, sensor data from IoT devices, and application metrics.

This explosion in the volume of time-series data led to the development of specialized databases designed to ingest and process time-series data as efficiently as possible. QuestDB achieves this while supporting standard ANSI SQL with native extensions for time series analysis.

Apart from that, QuestDB also makes the syntax easier by implementing implicit clauses. It also includes a random data generation feature which is extremely useful for exploring the functionality of the database as well as in database testing. Although there is much to talk about QuestDB’s SQL dialect, in this tutorial you will learn about SQL extensions.

Throughout this tutorial, we’ll be using two data sets. The first one is taxi trips data for New York City for the month of February 2018. It contains information about the number of passengers, trip fare, tip amount and the start datetime of the trip. You can find out average earnings per number of passengers, tipping behaviour of NYC taxi riders, busiest times of the day, and so on.

The second data set contains weather information for 10 years starting from 1st January 2010 to 1st January 2020. This dataset contains information about temperature, windspeed, rainfall, depth of snow, visibility, and more. You can use this data to analyse how the weather patterns emerge over long periods of time. You can also compare weather during the same time of the year for different years. To get started, you can install the aforementioned data sets using the following shell script:

SQL Extensions

While implementing ANSI SQL to the greatest extent, QuestDB has introduced some time-series specific SQL extensions to enhance performance and query reading and writing experience of the database users and developers. Let’s look into all of the SQL extensions one by one.

Timestamp Search

A time-series database isn’t complete if it doesn’t provide a method to search across time. In QuestDB, you can partition tables by time intervals. Each partition will be saved in a separate set of files on disk. To provide a relational database-like optimization of pruning partitions, QuestDB offers the feature of Timestamp search.

To benefit from this feature, a table should have a designated timestamp column. Any timestamp column can be marked as the designated timestamp column either while creating the table or while creating temporary sub-tables within a query. The designated timestamp column forces the table to have records in increasing time order. Hence, it implicitly enforces a constraint which rejects any out-of-order inserts. Rather than rejecting the out-of-order inserts, QuestDB is already working on accepting delayed records out of order. Timestamp search can also be performed using the normal ≥, ≤, <, > operators but it is not as efficient as it is using designated timestamps.

Another benefit of the designated timestamp column is that it enables the efficient use of ASOF joins which are specialized joins to join tables based on timestamp where timestamps don’t match exactly. A prerequisite for using getting deterministic results from an ASOF join is that the data in the table should be ordered by time. Designated timestamp columns enforce time ordering in a table.

The two sample data sets were imported directly from a CSV file and a table was created on-the-fly. Although you can create a designated timestamp while importing the data, it is important to understand how to deal with tables that don’t have a designated timestamp. So, let’s create the designated timestamp now and partition the two tables by month.

Using designated timestamp search notation, you can simplify your timestamp-based searches on tables. The following example queries the weather dataset. In this example, you can see that the same operator can be used to query many different time ranges. The first part of the UNION will give you the count of records for the whole year of 2019 while the second part of the UNION will give you the count of records for the month of December in 2019, and so on.

LATEST BY

This SQL extension finds the latest entry for a given key or combination of keys by timestamp. The functionality of LATEST BY is similar to functions like FIRST, FIRST_VALUE, etc. which are available in traditional relational databases and data warehouses.

In a relational database, you’d either have to first find out the latest timestamp and using a subquery find the farePlusTip amount for the passengerCount, or you’d have to use one of the aforementioned analytic functions like FIRST_VALUE. QuestDB makes life easier for database users and developers by creating a new clause for serving the purpose of finding the latest records per group.

In the following example, you will see that by using LATEST BY clause, based on the passengerCount, we can find out what the farePlusTrip amount was for the latest trip completed.

SAMPLE BY

This is another extension that is optimal for time-series data as it allows the grouping of data based on timestamp without explicitly providing timestamp ranges in the where clause. You can bucket your data into chunks of time using this extension.

In regular SQL, you’d need to use a combination of CASE WHEN statements, GROUP BY clause, and WHERE clause to get similar results. In QuestDB, SAMPLE BY does the trick. To use this SQL extension, you need to make sure that the table has a designated timestamp column.

In the following example, you’ll see the data is sampled or grouped by a day using 24h as the SAMPLE_SIZE in the SAMPLE BY clause. Depending upon the frequency of data ingested into the table, you might need to adjust the size of the bucket by adjusting SAMPLE_SIZE.

It is common in time-series database to have really low granularity. Hence, it is common to have data grouped by intervals of time ranging from seconds to years. Here are some more examples demonstrating how to go about using the SAMPLE BY clause for different sample sizes:

Changes to the usual SQL Syntax

Apart from the SQL extensions, there are a few changes to the usual SQL syntax to enhance database user experience. The changes are related to GROUP BY and HAVING clauses. The idea behind doing this is to simplify query writing, improving readability, and ease-of-use of the SQL dialect while Reducing SQL verbosity.

Optional GROUP BY clause

Because of the widespread use of aggregation functions in time-series databases, QuestDB implicitly groups the aggregation results to make the query writing experience better. While GROUP BY keyword is supported by QuestDB, it would not make any difference to the result set if you include it in your query or not. Let’s see an example:

Implicit HAVING clause

As HAVING is always used only with the GROUP BY clause, HAVING clause automatically becomes implicit with an optional GROUP BY clause as mentioned above. Let’s see an example for this too:

Optional SELECT * FROM phrase

QuestDB goes a step further by making the SELECT * FROM phrase optional. This one really helps reduce verbosity by a lot when there are nested subqueries involved. In QuestDB, just writing the name of the table and executing the statement will act as a SELECT * FROM TABLE_NAME statement. Please look at the example below:

All of these improvements help reduce the effort required to write and maintain queries in a time-series database like QuestDB at scale.

Conclusion

In this tutorial, you learned how QuestDB supports SQL and enhances performance and developer experience by writing custom SQL extensions specially designed for time-series databases. You also learned about a few syntactical changes in QuestDB’s SQL dialect. If you are interested in knowing more about QuestDB, please visit QuestDB’s official documentation.

This tutorial was originally published on Medium.

When To Cache?

Kovid Rathee — Tue, 02 Jun 2020 16:50:46 +0000

In Spark, memory is used for two purposes -

Storage - store/cache the data that will be used later. This data usually occupies memory for a long time.
Execution - memory space used for certain operations like joins, srots, aggregations and especially shuffles (when data is shuffled between two executors because of partition skew)

One has to be careful with caching. You want to cache only something you're sure is going to be needed later for a transformation.

Dynamic Resource Allocation and Caching

The idea behind dynamic allocation is that you define the initial, min and max number of executors. Spark starts with the initial executors, then based on the load it decides whether or not to start the other executors. Once the heavy lifting the processing is done, Spark asks the executors to be released. This is a great way to increase efficiency but what if you had cached something?

Because you had cached something, now the executors can't be freed unless that cached data is moved to another set of executors. And that's what you need to bear in mind.

Cache vs. Persist

We'll just talk about RDDs (and not Datasets). So, there's no real difference between cache() and persist(). They're both used for the same purpose. Both of these functions cache your data in MEMORY_ONLY as that's the default storage level for both in RDDs. For Datasets, it's MEMORY_AND_DISK.

You can obviously go and change the storage level depending on the use case. Also, there's no uncache() function. There's just unpersist().

Databricks has improved caching by quite a bit by introducing Delta caching but it is limited to some file formats.

Defensive SQL Query Writing

Kovid Rathee — Mon, 01 Jun 2020 15:44:59 +0000

Derived from defensive programming, defencive query writing is a practice which tries to make sure that a query run doesn't fail. Just like in application development, try to remove the scope for silly mistakes and control unforseen circumstances. Sometimes the decision comes down to whether you want your query to fail or you want it to run even if with some incorrect data. I'll share a couple of simple examples where we can employ these practices while writing queries.

Checking If Objects Exist

Take the very basic example of creating and dropping database objects. Rather than using a CREATE TABLE xyz, use CREATE TABLE IF NOT EXISTS xyz (id int) or if you want to recreate the table losing all the data you can run DROP TABLE IF EXISTS xyz and then CREATE TABLE IF NOT EXISTS xyz (id int).

The same practice can be used with the creation and deletion of databases, views, indexes, triggers, procedures, functions and more. I have come to realize that in most cases, using this is helpful.

Using database and column aliases

Prevent yourself from getting ambiguous column errors. See in the example below, the column city might be present both in TABLE_1 and TABLE_2. How do you expect the database to know which field you want it to pick up.

It's generally a very good practice to create aliases for database objects and then access those database objects and their child objects using the alias rather than the complete name. Obviously for doing this efficiently, you'd need to follow a SQL Style Sheet.

Using LIMIT

No, I'm not talking about using LIMIT to restrict the number of records in your final query. Rather, I am talking about queries like this.

The LIMIT clase in the subquery returning a column is important because it prevents the query from failing if the subquery returns more than one row, that is, if there is more than one record in TABLE_2 for every record in TABLE_1. This is a really useful trick to write better queries.

These are three of the most common scenarios which, if not taken care of, can prevent your query from running at all. Obviously, all of these come with an asterix. More on that later.

Please feel free to share other practices that you have followed!

Complete Data Engineer's Vocabulary

Kovid Rathee — Sun, 31 May 2020 08:59:27 +0000

I recently wrote a post on for Towards Data Science summarizing most of the technologies and basic concepts that a data engineer should know about - Complete Data Engineer's Vocabulary. I say that it's the complete vocabulary but it really isn't - it tries to cover most of the tech that people are using these days. I'll make sure to keep it updated.

The whole idea was to summarize every technology or concept in 10 words or less - an activity that I thoroughly enjoyed. It's one thing to use a technology or know a concept, but it really gets tough when you have to explain these things to someone else and that too using the minimum amount of words possible.

I have already had a lot of suggestions for additions on the original post. More suggestions are welcome!

Platform For Online Concerts

Kovid Rathee — Sat, 30 May 2020 14:56:56 +0000

Amidst the current Covid-19 crisis, the world is suffering from all kinds of problems, but the one I wanted to ask pertains to independent musicians, especially classical musicians, who don't have any gigs now because everything is shut down. Is there a good platform which can provide these arists a great way to share their music with their audience?

Livestreams on Facebook, Instagram don't really work because of bad quality and bad monetization options. YouTube does have a subscription option which could work but it doesn't really make sense just for concerts. How about a per gig kind of deal where people could listen to their favorite artists perform by paying.

Stuff like Patreon, Ko-fi and YouTube's SuperChat etc. definitely helps but I'm thinking more about a dedicated platform for niche audiences - which some of these classical arts could also survive. On Facebook, Instagram etc., there's just too much noise. You can't play a Mozart concert on MTV for the same reason.

Suggestions/ideas are welcome! Stay safe!

DEV Community: Kovid Rathee

Optimize Read Performance in Supabase with Postgres Materialized Views

Understanding Materialized Views in Postgres

Why Use Materialized Views

Creating and Using Materialized Views in Supabase

Refreshing Materialized Views in Supabase

Supabase Edge Functions

Manual Refreshes Using SQL Statements

Automatic Refreshes Using Postgres Triggers

Defining Your Refresh Strategy

Best Practices for Materialized Views: When and How to Use Them

Conclusion

Finding missing data in your database with QuestDB

Introduction

Dataset

Finding missing data

Using QuestDB

How does finding missing data help?

Conclusion

SQL extensions for time series data in QuestDB part II

Setup

Import sample data

Create an ordered timestamp column

Understanding the basics of SAMPLE BY

Hourly count of trips

Three-hourly holistic summary of trips

Weekly summary of trips

Dealing with missing data

Working with timezones and offsets

Aligning sample calculation to another timezone

Aligning sample calculation with offsets

Conclusion

Alerting Dashboard for Tesla's Stock Price with QuestDB and Grafana

Introduction

Configuration

Prerequisites

Deploy QuestDB & Grafana containers using Docker

Install Python Libraries

Ingest mock data into QuestDB

Configure the IexFinance API

Configure a Slack Incoming Webhook

Create a Notification channel in Grafana

Create a Grafana Panel & Setup the Alert

Conclusion

SQL Extensions for Time-Series Data in QuestDB

Introduction

SQL Extensions

Timestamp Search

LATEST BY

SAMPLE BY

Changes to the usual SQL Syntax

Optional GROUP BY clause

Implicit HAVING clause

Optional SELECT * FROM phrase

Conclusion

When To Cache?

Defensive SQL Query Writing

Checking If Objects Exist

Using database and column aliases

Using LIMIT

Complete Data Engineer's Vocabulary

Platform For Online Concerts

Understanding the basics of `SAMPLE BY`