DEV Community: Kellen

Launching opub: donated compute for open-source maintainers

Kellen — Thu, 21 May 2026 15:05:00 +0000

First 20 open source maintainers with over 100 GitHub stars get to register at opub.dev receive $50 model credits!

Companies large and small are throwing as much cash as they can find at model tokens. The impacts are complex, massive, and everywhere.

In this new era, GitHub activity tells quite the story:

"[GitHub] platform activity is surging. There were 1 billion commits in 2025. Now, it's 275 million per week ... GitHub Actions has grown from 500M minutes/week in 2023 to 1B minutes/week in 2025, and now 2.1B minutes so far this week."

— Kyle Daigle, COO, GitHub, April 4, 2026

This flood brings a lot of good with it. It also brings swells upon swells of new maintenance pressure. New repository issues are long, numerous, and verbose. New contributors are zealous and plentiful, with large PRs full of massive new line counts.

Even with the resources and talent of a strong team, it is hard to keep up. It does not feel sustainable. And what about the projects run by volunteers? The open source maintainers and projects without whom software as we know it, and our beloved internet, would not work?

Open source and the agentic flood

Software depends on open source. The humble maintainer, historically underappreciated and underpaid, now has to struggle to stay afloat in this new world whether they use agentic coding tools or not.

As great as hyper-intelligent bug finders and contributors are, parsing through all of this information is often exhausting. Some of the more popular projects have turned off their issue queues and PR permissions outright in response.

For those that have embraced these new tools, the rising prices of quality compute mean that, along with their free time, they now need to burn their own cash to keep up. This makes us uneasy. Many of us cannot shake the feeling that this initial "generally affordable" period of frontier model usage will not last. What then?

Something has got to give. These people and projects need more support.

That's where we come in...

Introducing Open Public (opub)

We link donors to open source projects. Donations fund donated compute for over 30 leading agentic coding models. Token usage is public. Donors know their generosity went toward the projects they support.

When donations are received, capped compute keys provide maintainers with a fast, reliable stream of compute to fuel whatever will help them keep up.

They might use GitHub's Copilot CLI to manage GitHub issues and PRs, Continue to review and audit incoming PRs, or spend raw tokens for development and fixes through popular harnesses like Claude Code, OpenAI Codex, Mistral Vibe, or OpenCode.

Spend events roll back to the opub project page, so donor impact is visible in the project ledger. If a maintainer's agentic session starts through the opub open source CLI, compute usage is considered Linked: donors can see that spend was tied to the compute key and project session they funded.

The juice to stay afloat

Maintainers already have work that donated compute can help with.

A donor might want to help a project close stale bugs, review a backlog of pull requests, improve tests before a release, investigate a security report, ship a desirable feature, or modernize documentation. Open Public provides a way to do so by directly empowering the maintainers at the heart of the project.

We turn donations into compute: tokens, the juice, the fuel of agentic coding.

Through opub, the unit of trust is at the project level:

an opub project represents and verifies a public GitHub repository + maintainer
a donation funds one project's compute balance
maintainers make capped compute keys to consume the balance
token spend appears within the public project ledger
if the CLI is used, Linked sessions show funded compute was launched from the right project context

To correlate spend with the project, we do not need to observe prompts, responses, diffs, files, commits, or pull requests. It's a clean, transparent way to ensure the projects you appreciate and rely on won't fall behind the agentic flood.

The founding round

Our goal is to amplify the health of the open source ecosystem. Any public GitHub repository can register now. To celebrate our launch and welcome project maintainers, the first 20 eligible verified projects with 100 or more stars get $50 in starter donated compute from opub.

Register now

After registration, connect a GitHub repository to receive your own Open Public project page. Projects outside the starter donated compute offer are welcome too: register, share the page, receive donations, and create compute keys once the project has available balance.

Our next mission? Find you some generous donors. To help, you can share your project page on social media, put it into your README, and point your community to it when people ask how to support the project. We'll do our best to surface projects and provide exposure wherever possible.

Once someone has donated to your project, you can create a key and securely apply that compute through leading models such as:

Claude Sonnet 4.6 (anthropic/claude-sonnet-4.6)
Claude Haiku 4.5 (anthropic/claude-haiku-4.5)
GPT-5.5 (openai/gpt-5.5)
GPT-5.4 (openai/gpt-5.4)
GPT-5.4 Mini (openai/gpt-5.4-mini)
GPT-5.3 Codex (openai/gpt-5.3-codex)
Codestral 2508 (mistralai/codestral-2508)
Devstral (mistralai/devstral-2512)
MiniMax M2.5 (minimax/minimax-m2.5)

There are over 30 models at various costs, all served at their standard rates.

See the documentation for the full list of Available models.

What maintainers can do

After a project has available balance, a verified maintainer creates a compute key with a dollar limit. The limit is chosen by the maintainer and reserved from the project balance. Multiple active compute keys are allowed, so a project can keep setup flexible, rotate keys, or separate workflows without exposing unlimited spend.

Each key is shown once in the browser. After that, the secret belongs in the maintainer's local credential store or direct tool configuration.

The key can be used two ways:

run a supported agent through the opub CLI for a Linked session paste the key into any compatible OpenAI-formatted workflow for direct, unlinked use
Both paths spend from the project balance. The CLI path appends a Linked badge. That way, donors can see when compute was spent through a linked project session.

Linked sessions

Session linking is not required, but it sends a strong signal to donors.

The opub CLI wraps the agent harness launch:

opub setup codex --project owner/repo --compute-key-id ck_...
opub run codex

opub setup stores the capped compute key in the system credential store and writes non-secret agent configuration. opub run starts the agent with the right credentials and refreshes a local, secretless MCP session for project context.

That session context can say which project, compute key, and agent were launched. It does not prove what the maintainer typed, what the model answered, which files changed, or which issue was fixed.

We do not and will not observe, train on, or collect prompts or prompt responses. Our method of session linking is open source and relies on MCP and agent-side skills to report session state. Donors get a useful public signal without turning maintainers' workspaces into surveillance systems.

The CLI supports popular agentic coding harnesses such as:

Claude Code (claude)
Codex (codex)
GitHub Copilot CLI (copilot)
Vibe (vibe)
OpenCode (opencode)
Continue (continue)

The API key can also be used directly with OpenAI-compatible tooling. That path is unlinked, but spend still accrues to the project balance.

What comes next

We're excited to see what the creators of our favorite projects can do with greater access to today's leading coding models. We know this isn't for everyone, and there's no pressure for project maintainers to register if they would rather not use agentic compute. But for those who are willing to use these new tools, we're excited to work with you to eliminate or reduce your agent-based costs.

This blog will publish content to help maintainers get the most out of donated compute, and profile the maintainers using it to build and refine the great things we have relied on in the past and will rely on tomorrow.

May no maintainer be left behind.

How to increase Grafana refresh rate frequency

Kellen — Wed, 10 Jan 2024 16:33:27 +0000

QuestDB is a high-performance time series database with SQL analytics that can power through market data ingestion and analysis. It's open source and integrates well with the tools and languages you use. Check us out!

A nice thing about Grafana is seeing your dynamic dashboards refresh as data updates over time. However, to limit the load on the Grafana server both on the browser and on the underlying database, the maximum default refresh rate is once every 5 seconds. While this is certainly more than enough for some applications, in more real-time use cases such as financial market data, there is a latent need to always get always closer to realtime.

Conveniently, Grafana allows you to easily tweak server settings to allow for higher frequencies. But there's a catch. Whatever is sending data to your dashboard needs to accommodate your desired rate. Luckily, Questdb can support hyper-fast refresh rates. The end result is more up to date dashboard with smooth continuous-looking updates and that unparalleled feeling of realtime.

We recently built a series of financial market data dashboards that refresh at a very fast interval. This guide shows you how to do the same.

Why higher frequency

The first question to ask is whether this feature is useful for your use case. Here are some example reasons and scenarios where this can be useful.

It depends on the scale

There is little upside in having high refresh rate on large scale charts which span a few hours to a few days. But when you are looking at small intervals such as the last minute, last 5 minutes and so on, then a higher update frequency makes sense as the short term changes become much more visible.

Most up to date information

Often times you want to get updated information as soon as possible. Generally, if you are looking at prices, it's better to be looking at the latest price possible rather than at a price that's 5 seconds old. Whether it is material or not depends on your use case.

If you are actively trading using Grafana charts as a tool, for example to highlight arbitrage opportunities, then want the latest price as soon as possible. If you are using a Grafana dashboard to view the value of your portfolio, or some other non-time-sensitive metric then - strictly speaking - it's better to be closer to realtime, but not necessary.

Shorter feedback loop

Say you are working on a closed-loop system. For example, you change a parameter on your trading algo and you want to see the effect on derived metrics. Does that generate new trades? Does it change other linked parameters by propagation? And so on.

If your actions trigger reactions, then a more frequent update frequency reduces the latency in receiving feedback. If you make a mistake changing something, you can see it immediately instead of somewhere in the next seconds. In some cases, it may not make a difference, in others, it can be critical.

It's satisfying

It cannot be just me… There's a big difference between looking at a periodically updating dashboard versus a continuously updating one. Yes, strictly speaking, the dashboard is always updating periodically and we just reduce the update period...

However, when the period gets small, like one second and below, then it looks smooth and continuous. It's a bit like going from a small low definition 24 hz computer monitor to 4k 120hz. Sure, you can work using both. But one is much more comfortable and satisfying to use.

Configuring update frequencies for your use case

We'll review:

how to tweak the default settings to add/remove options
how to tweak the server configuration to allow even higher refresh frequencies than allowed by default
Default frequency choices#

Grafana comes with a few default frequency options

1 day
2 hours
1 hour
30 minutes
15 minutes
5 minutes
1 minute
30 seconds
10 seconds
5 seconds
off - you need to click the refresh data button manually to update

Tweaking default settings

Typically, your dashboard may not need all the above options. If you typically refresh once every 2 hours for example, then probably the 5 seconds interval is not useful, and vice versa.

Tweak these default options in the dashboard settings using the Auto refresh field:

To change, add, remove frequencies to the list, simply edit the list of time intervals. For example, you may choose to remove anything over 10 minutes, and add a custom intervals such as every 2 minutes, every 10 minutes and so on. The resulting list would look like the following:

After saving your dashboard, the new settings are available in the dropdown:

Unleashing high frequency updates

While this works for frequencies up to 5 seconds, this approach does not work by default for anything below 1 second. If you try to add a frequency such as '1s' or '250ms', then you won't normally see them in the dropdown.

Here is an attempt with higher refresh rates below the standard maximum of once every 5 seconds:

Once we save the dashboard, the frequencies below 5s are not available in the dropdown:

This is because the Grafana server config grafana.ini has a setting called _min_refresh_interval which by default is set to 5s. The default location for this file is is /etc/grafana/grafana.ini. But it depends on your OS and setup. If you need a locating it, checkout the Grafana config docs.

############## Dashboards History #######

[dashboards]

# Number dashboard versions to keep (per dashboard).
# Default: 20, Minimum: 1
; versions_to_keep = 20

# Minimum dashboard refresh interval. When set, this
# will restrict users to set the refresh interval of a
# dashboard lower than given interval.
# Per default this is 5 seconds.
# The interval string is a possibly signed sequence
# of decimal numbers, followed by a unit
# suffix (ms, s, m, h, d), e.g. 30s or 1m.

; min_refresh_interval = 5s # !!

# Path to the default home dashboard.
# If this value is empty, then Grafana
# uses StaticRootPath + "dashboards/home. json"
; default_home_dashboard_path

However, you can set it to something else, for example 200ms.

############## Dashboards History #######

[dashboards]

# Number dashboard versions to keep (per dashboard).
# Default: 20, Minimum: 1
; versions_to_keep = 20

# Minimum dashboard refresh interval. When set, this
# will restrict users to set the refresh interval of a
# dashboard lower than given interval.
# Per default this is 5 seconds.
# The interval string is a possibly signed sequence
# of decimal numbers, followed by a unit
# suffix (ms, s, m, h, d), e.g. 30s or 1m.

min_refresh_interval = 200ms # Faster!

# Path to the default home dashboard.
# If this value is empty, then Grafana
# uses StaticRootPath + "dashboards/home. json"
; default_home_dashboard_path

Since the changes only take place upon restart, restart the Grafana instance. We can then setup our new frequencies in the dashboard settings:

And this time, they are available in the list dropdown:

Finally, we can get our realtime charts going!

See a gif: https://questdb.io/img/blog/2024-01-08/grafana-refresh-9.gif

Normalizing Grafana charts with window functions

Kellen — Wed, 10 Jan 2024 16:33:17 +0000

QuestDB is a high-performance time series database with SQL analytics that can power through market data ingestion and analysis. It's open source and integrates well with the tools and languages you use. Check us out!

In a previous post, we looked at how to create dynamic lists of symbols and charts in Grafana. While this is great to watch individual charts for different symbols, sometimes you may want to merge all the charts together to compare changes in a visual manner.

One of our community members had been struggling with this over time and had tried various approaches. As a result, we created an implementation of the first_value() window function which easily solves the underlying issues with these types of visualizations. This article explains how it's applied.

The simple approach

Unfortunately … does not work. If we were to create a chart with the prices of ETH-USD and BTC-USD, then we would end up with something like this…

SELECT
  timestamp,
  symbol,
  price
FROM
  trades
WHERE
  $__timeFilter(timestamp)
  AND symbol IN ('ETH-USD', 'BTC-USD')

The chart uses the partition by values Grafana transformation to generate two series. As you can see, the price of 'BTC-USD' is roughly 17x greater than 'ETH-USD'. On the same scale, the time series are hard to compare. If we wanted to look at volatility or another moving metric, we'd need lots of patience and a very strong magnifying glass.

Overriding the axis

One solution to create comparable data is to assign each series to its own axis. We can achieve this with an override on the price ETH-USD series, setting the axis placement property to the right:

While it makes it easier to see how each series is moving, we are missing any sense of scale because both axes scales remain independent. As a result, the two series do not start at the same point. It is hard to tell if one is moving relatively more than the other in a given direction. In addition, this can quickly become messy if we need to compare more than two series.

The overkill

Before window functions, our community member found a trick to achieve what they wanted. It consisted of using sub-queries to do the following…

Get the first value of the series for each symbol first
Get all the values for each symbol for the time period current
Cross join both of the above based on symbol
Calculate the percentage change for all symbols, for example current/first * 100

While this achieved the desired results, it was a heavy query to write and run due to the heavy joins and multiple sub-queries. Over large time periods and with many symbols, this could result in too many data points for Grafana dashboards and thus necessitate the use of SAMPLE BY to further reduce its time frame. While adequate, it further complicates the query. We can do better.

The first_value() window function

The first_value() window function returns the first value of a metric over a time window. With it, we can do two things:

Easily access the value as of the first timestamp of the series
Use the resultant value as a normalizing factor to compare the evolution of the series

WITH series AS (
  SELECT
    timestamp,
    symbol,
    price,
    first_value(price) OVER (
      PARTITION BY symbol
      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    )
  FROM
    trades
  WHERE
    $__timeFilter(timestamp)
    AND symbol IN ('ETH-USD', 'BTC-USD')
)

SELECT
  timestamp,
  symbol,
  price / first_value AS perf
FROM
  series

Our new approach revealed that our earlier twin-Y axis chart was misleading. It showed us that ETH and BTC moved up by roughly the same relative amount. However, with the normalized chart above we can see everything in the same scale. ETH significantly outperformed BTC over the time interval!

This sort of visualization is quite powerful as it can synthesize market activity across many instruments in a more simple chart. But we can do even better:

WITH data AS (
  SELECT
    timestamp,
    symbol,
    price,
    first_value(price) OVER (
      PARTITION BY symbol
      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    )
  FROM
    trades
  WHERE
    $__timeFilter(timestamp)
)

SELECT
  timestamp,
  symbol,
  price / first_value AS perf
FROM
  data

Let's unpack the differences.

In the prior query, the expression named series filters the trades table to include only rows where the symbol is either 'ETH-USD' or 'BTC-USD' and the timestamp satisfies the condition specified by $__timeFilter(timestamp).

In our new query, the expression named data filters the trades table to include only rows where the timestamp satisfies the condition specified by $__timeFilter(timestamp). It does not filter based on the symbol.

The prior query will only calculate and return performance (perf) for 'ETH-USD' and 'BTC-USD', while our new query will calculate and return performance (perf) for all symbols in the trades table that satisfy the time filter condition.

Now, we can configure Grafana as follows:

Legend mode = Table
Legend placement = Right
Legend values = Last

We then end up with the following chart summarizing how each crypto pair performed in relative terms in the last few hours:

The chart makes it pretty apparent that SOL pairs are up, while MATIC and DOT are strongly down. While it's easy to get this by computing the ratio of first and last prices over the interval, having a full time series helps understand what's going on with higher fidelity. We can ask: Did the pair jump up quickly, did it trend upwards, and so on.

Next steps

Window functions greatly simplify query complexity and performance. Perhaps most important: It leads to higher quality analysis. While we've made progress in making it easier to compare values across time, our approach still requires a sub-query. QuestDB is considering a separate function such as normalised_value(field, base) where base is the scale. For example: Does the series start at 100, at 1, or somewhere else.

If you're interested in that functionality or have any other feedback, please drop by our open source repository or community Slack and let us know.

Solving duplicate data with performant deduplication

Kellen — Wed, 22 Nov 2023 17:56:40 +0000

QuestDB is a high-performance time series database with SQL analytics that can help you overcome ingestion speed bottle necks. It's open source and integrates well with tools and languages you use. Check us out!

It's a mad, mad, mad, mad world...

Your plane lands and the cabin crew announces: "You may now use your electronic devices.” You switch your phone on and a few seconds later a text message welcomes you to the network and informs you of the (outrageous) service rates. Minutes later, you get exactly the same message. Double-y outrageous! What just happened? Well, you've been "at-least-once'd”.

The same thing can happen tens of thousands of times when you ingest time series, analytic or event data. Duplicate data is a pain. It wastes compute and storage resources, slows down ingestion times and distorts the accuracy of your data sets. Wouldn't it be better if “at-least-once” was “exactly once?”

In this article, we'll look at data deduplication and compare the performance impact of data deduplication across Timescale, Clickhouse and QuestDB.

Performance details for the curious mind

Before we get into more details about deduplication, including the approach we took with QuestDB, let's start with the data. Of course, we expect deduplication to degrade performance somewhat. How much will depend on the number of UPSERT Keys and on the number of conflicts. But how much?

To demonstrate, we'll run an experiment.

Our goal is to evaluate the performance impact when ingesting a dataset twice.

Our methodology is:

Ingest a dataset once
Re-ingest the same dataset again

This is a pretty ugly scenario, as every row means a conflict. To make things even more interesting, we will ingest the datasets in parallel. This will force the databases to not only deal with duplicates, but also with out-of-order data. The processing of out-of-order data — while valuable for customers ingesting large data volumes — is more challenging to process.

To make sure performance degradation is within industry expectations, we will run the experiment against two very popular and excellent databases that we usually meet in real-time or time-series use cases: Clickhouse and Timescale.

Disclaimer: This experiment is not intended to be a rigorous benchmark. The goal is to check relative performance of QuestDB when handling duplicate data. While we are quite knowledgeable of QuestDB, we're average at operating Timescale and Clickhouse. As such, the experiment applies default, out-of-the-box configurations for each database engine, with no alterations. Fine-tuning would certainly generate different results. We welcome you to perform your own tests and share the benchmarks!

This test will run on an AWS EC2 instance: m6a.4xlarge, 16 CPUs, 64 Gigs of RAM, GP3 EBS volume. We will ingest 15 uncompressed CSV files, each containing 12,614,400 rows, for a total of 189,216,000 rows representing 12 years of hourly data.

The data represents synthetic e-commerce statistics, with one hourly entry per country (ES, DE, FR, IT, and UK) and category (WOMEN, MEN, KIDS, HOME, KITCHEN). It will be ingested into five tables (one per country) with this structure:

 CREATE TABLE 'ecommerce_sample_test_DE' (
   ts TIMESTAMP,
   country SYMBOL capacity 256 CACHE,
   category SYMBOL capacity 256 CACHE,
   visits LONG,
   unique_visitors LONG,
   avg_unit_price DOUBLE,
   sales DOUBLE
 ) timestamp (ts) PARTITION BY DAY WAL DEDUP UPSERT KEYS(ts,country,category);

The total size of the raw CSVs is about 17GB, and we are reading from a RAM disk to minimize the impact of reading the files. We will thus be reading/parsing/ingesting from up to 8 files in parallel. The experiment's scripts are written in Python, so we could optimize ingestion by reducing CSV parsing time using a different programming language. But remember: this is not a rigorous benchmark.

Our data set: https://mega.nz/folder/A1BjnSYQ#NQe5qhYLVBqiRwhWRmcVtg
Our scripts: https://github.com/javier/deduplication-stats-questdb

How did it turn out?

As far as raw numbers, the table demonstrates:

Storage engine	Ingest without dedupe	Ingest with dedupe	Latency increase (lower is better)
Timescale	13m23s	15m23s	+15.5%
Clickhouse	3m24s	4m36	+17.9%
QuestDB	2m12s	2m23s	+8.3%

To get these results, the ingestion script was run multiple times for every database. The execution numbers below are those of the best runs. In any case, variability was quite low when re-running several times. Looks good! But the raw numbers don't tell the full story.

Timescale deduplication performance

Timescale took 13 minutes and 23 seconds to ingest the whole dataset for the first time. When re-ingesting for deduplication over the same tables with ON CONFLICT … DO UPDATE, the process took 15 minutes and 28 seconds.

As was to be expected due to the underlying PostgreSQL engine, uniqueness is guaranteed at every point. This would be suitable for strong exactly-once semantics. But performance-wise, re-playing the events took 15.5% longer than the initial ingestion. And this was significantly slower than analytics optimized storage engines, Clickhouse and QuestDB.

Clickhouse deduplication performance

Clickhouse took 3 minutes and 54 seconds to ingest the whole dataset for the first time. When re-ingesting over the same tables using the ReplacingMergeTree engine and defining Primary Keys, the process took 4 minutes and 36 seconds. That's quite awesome.

However, when we ran a SELECT count(*) FROM ecommerce_sample_test_ES, the number of rows was 60,543,200 which is exactly double the number of rows within the dataset. The explanation is that Clickhouse applies deduplication in the background after a few minutes have elapsed. Fast, but during that time period we have inaccurate data.

If we want to generate the real number of unique rows, we can add the FINAL keyword to the query. But in that case the query slows down. For example, a simple count went from 0.002 seconds to 0.337 seconds. Doing so is also not recommended by Clickhouse themselves. This makes it tricky to get to a strong exactly-one semantic. Performance-wise, re-playing the events took 17.9% longer than the initial ingestion.

QuestDB deduplication performance

QuestDB took 2 minutes and 12 seconds to ingest the whole dataset for the first time. When re-ingesting over the same tables using DEDUP … USERT KEYS, the process took 2 minutes and 23 seconds. No duplicates were written, so this would be suitable for strong exactly-once semantics. Performance-wise, re-playing the events took 8.3% longer than the initial ingestion.

Overall, performance looks good. But let's take a step back and explain what's happening within QuestDB. Briefly, how does it work?

The process for QuestDB deduplication is:

Sort the commit data by designated timestamp before INSERT into the table. This is the same as a normal INSERT
If deduplication uses extra UPSERT Keys, sort the data by an additional key against the matching timestamp
Eliminate the duplicates in uncommitted data
Perform a 2-way sorted merge of the uncommitted data into existing partitions. This is also the same as a normal INSERT
If a particular timestamp value exists in both new and existing data, compare the additional key columns. If it is a full match, take the new row instead of the old one

The extra steps are 2, 3, and 5. To perform the comparison, these steps require more CPU processing and disk IO to load the values from the additional columns. If there are no additional UPSERT Keys or if all the timestamps are unique, then these additional steps add virtually no load.

Towards exactly once

To solve data deduplication, top databases typically implement one of the following strategies:

Unique indexes

Traditional databases like PostgreSQL, PostgreSQL-based systems like Timescale and most relational databases require that you create Unique indexes. Unique indexes thus provide two options:

Throw an error when a duplicate is encountered
Use an UPSERT strategy

With UPSERT, non-duplicate rows are inserted as usual. But rows where the unique keys already exist will be considered updates that store the most recent values for the non-unique columns. The table still doesn't hold duplicates, but it does allow for corrections or updates.

Since indexes are involved, deduplication has an impact on performance. For the typical use case of a relational database, when the read/write ratio is heavily biased for reads, this impact might be negligible. But with write-heavy systems such as those seen within analytics, performance can noticeably degrade.

Compacted tables

Analytical databases such as Clickhouse prioritize ingestion performance. As a result, they will accept duplicate values. At a later point in time they will
then compact the tables in the background, keeping only the latest version of a row. This means that for an indeterminate time, duplicates will be present on your tables. This is probably not ideal for data accuracy and efficiency.

To resolve this, developers can add the FINAL keyword to their queries as a workaround. The FINAL keyword will prevent duplicates from appearing within query results. But that makes queries slower. Clickhouse recommends that people avoid this method. That means that the recommended, performant path risks duplicates making it into your data set and thus into your queries and dashboards.

Out-of-database deduplication

The third strategy is to move deduplication handling outside of the database itself. In this case, the deduplication logic is applied right before ingestion, as a connector running on Kafka Connect, Apache Flink, Apache Spark, or another similar stream processing component. This solution can be flimsy. Data ingested outside of the connector or outside a particular time-range might result in duplicates. Until recently, this was the only approach available in QuestDB.

Adding native data deduplication to QuestDB

While delegating deduplication to an external component was convenient for the QuestDB team, it was not ideal for customers. Customers had to manage an extra component so that all ingestion occurred within the deduplication pipeline. In many cases, external deduplication was only best-effort and didn't hit our high quality standards.

To address this, the team implemented native data deduplication with four goals in mind:

Guaranteed with exactly-once-semantics
Works as UPSERT. Must load then discard exact duplicates, and update the metric columns when a new version of a record appears; make re-ingestion idempotent
Impact on performance must be negligible, so deduplication can be always-on
Simple to activate/deactivate and transparent for developers and tools that select data

After a few months of hard work, the DEDUP keyword was released as part of QuestDB 7.3. Now a developer can activate/deactivate deduplication either as part of a CREATE TABLE statement, or any time via ALTER TABLE.

QuestDB is a time-series database. Thus deduplication must include the designated timestamp, plus any number of columns to use as UPSERT Keys. A table with deduplication enabled is guaranteed to store only one row for each unique combination of designated timestamp + UPSERT Keys. will then silently update the rest of the columns if a new version of the row is received.

Deduplication happens at ingestion time, which means you will never see any duplicates when selecting rows from a table. This gives you a strong exactly-once guarantee. The best part is that QuestDB will do all of that with very low impact on ingestion performance.

Conclusion, only a mad world

If you need ingestion idempotence and exactly-once semantics for your database, QuestDB gives you deduplication with very low impact on performance. As a result, you can enjoy strong performance with deduplication always enabled. When compared to other excellent database engines, QuestDB strikes a very good balance between performance, strong guarantees and usability.

To learn more about deduplication and see how you can apply it in QuestDB, checkout our data deduplication documentation.

To learn more about deduplication and see how you can apply it in QuestDB,
checkout our data deduplication documentation.

Is all data time-series data?

Kellen — Wed, 22 Nov 2023 17:33:12 +0000

QuestDB is an open source, high performance time series database. With its massive ingestion throughput speeds and cost effective operation, QuestDB reduces infrastructure costs and helps you overcome tricky ingestion bottlenecks. Thanks for reading!

Time-series data is everywhere. It's the fastest growing new data type. But where is it all coming from? And isn't all data essentially series data? When does data ever exist outside of time? This article investigates the source of all this new time bound data, and explains why more and more of it will keep coming.

Is it all time series?

The quick and clean answer is no, not all data is time-series data. There are many data types and most do not relate to time. To demonstrate, we'll pick a common non-time-series data type: Relational data.

Relational data is organized into tables and linked to other data through common attributes. An example would be a database of dogs, their breed, and whether or not they are well-behaved. This data is relational, categorical and cross-sectional. It's a snapshot of a group of entities with no relationship to "when". In this case, these entities are dogs. Adorable! But not time-series data.

In this example, time is not relevant to the name, breed and behavioural tendencies of our dogs. It does not matter when the dog was added to the table, or when any of its values changed. The entities are held in a timeless vacuum.

By contrast, time-series data is indexed in accordance with time, which is linear and synchronized. It consists of a sequence of data points, each associated with a timestamp. An example would be if our database of dogs included the dog's name, their breed, and their time in the local dog race:

This table contains a timestamp and so now contains time-series data. But this data is not the time-series data that requires specific features or a specialized time-series database. To require a specialized database, data must also match a specific set of demand and usage patterns. For now, it is simply time-series data in a transactional, relational database.

In each action, a wake of time

When does it cross into that threshold? For this example, we will put our table of dogs into a practical light. Consider that a group of people input dog information into a database:

Simple! Now take one more practical step. How is the database accessed? A small team of people each login to a front-end data-entry application. For security purposes, an authentication server sits before the web application. A person authenticates, and then their session is kept alive for 24 hours:

The authentication server needs to know exactly when a person logged in to determine when to invalidate their session. This requires a timestamp column.

The security provider that handles login may receive tens of thousands of requests every second. Tracking each attempt in chronological order and revoking sessions with precision is an intense demand.

This is a key point. As above, the presence of time-series data didn't mean much to our small-scale transactional, relational database. But now we've got a flow of time-series data. And with it our requirements change.

And so deeper we go...

The database is hosted somewhere, perhaps in the Cloud. Cloud billing is based on compute time. The matching front-end application collects performance metrics. This is all novel time-series data which contains essential insights from which business logic is written.

We can go deeper still and consider the entire chain. DNS queries hit DNS resolvers and hold a Time-To-Live value for DNS propagation. A Content Delivery Network before the front-end application gives precise detail on when something was accessed and how long static assets will need to live within the cache. Just one transactional update to the relational database generates a wake of essential time-bound data, for security and analysis.

Thus we retreat from our dive and head into the next section with the following crystallized takeaway: Not all data is time-series data, but time-series data is generated via virtually any operation, including the creation, curation, and
analysis of “non-time-series data” in “non-time-series databases”.

OLAP vs OLTP: Process types

This cascading relationship between transactions and analysis is best explained via a comparison of two common data processing techniques: Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP).

Online Transaction Processing (OLTP)

OLTP systems prioritize fast queries and data integrity in multi-access environments, where many people make updates at once. They are designed to handle a large number of short, atomic transactions and are optimized for transactional consistency. OTLP systems tend to perform operational tasks, such as creating, reading, updating, and deleting operations — the basic CRUD operations in a database.

Most OLTP systems use a relational database, where data is organized into normalized tables. Our dog database is OLTP. Another example would be an ecommerce system where customers purchase from an inventory. Many shoppers purchase many shoes from an accurate, single source of inventory. Each transaction is atomic, and the system handles a large number of concurrent
transactions.

Online Analytical Processing (OLAP)

OLAP systems prioritize the fast computation of complex queries and often aggregate large amounts of data. They optimize for query and ingest speed over transactional consistency. They tend to be fast, powerful and flexible.

Our dog race database is not an example of an OLAP system. Even though it applies time-series data, the race times are recorded and then entered into the database, making it more transactional and thus better suited for OLTP. In other words: the presence of time-series data does not automatically mean it's an OLAP use case.

However, the example infrastructure above introduces time-series elements where the performance optimizations found in OLAP systems become essential. On the way to one transactional operation, thousands of time-bound data points may be collected. Given how much time-series data is created, ingest and analysis requires the mightier performance profile of a purpose-built analytical system.

Better with both

OLAP and OLTP are not at odds. They are complementary. Both OLTP and OLAP systems apply time-series data. But due to their respective strengths, time-series specialized databases excel at OLAP while relational databases with their transactional guarantees are a more natural fit for OLTP. At scale, you will see both systems work together.

Consider ecommerce once more. The ratio of browsers to buyers skews very heavily towards browsers. We all window shop from time to time, but a much smaller percentage of us complete a purchase. Every completed purchase requires a reliable way to update the inventory, which is a task for OLTP.

But every browser generates waves of data with their clicks, navigation and interactions: behavioural, statistical, and so on, which can be leveraged to increase the likelihood a person will complete a purchase. Time-series data is created in abundance in the browsers case.

# Buyers" width="800" height="544">

QuestDB is a specialized time-series database. It excels at OLAP cases, and supplements OLTP cases. Consider a traditional relational database like PostgreSQL. In it we may store stock exchange data, such as the current price of a stock for tens of thousands of companies. With the sheer number of stocks and transactions between them, a time-series database is needed beside it to record and analyze its history.

For example, whenever a stock price changes, an application will update a queue which then updates a PostgreSQL table. The UPDATE/INSERT event is then sent to something like Apache Kafka, which reads the events, and then inserts them into a time-series database.

The relational store, which provides the transactional guarantee, maintains the stock prices and the present "state". The time-series database then keeps the history of changes, visualizing trends via a dashboard with computed averages and a chart showing the volume of changes. The relational database may process and hold 100,000 rows at any time, while the time-series database may process and hold 100,000 * seconds rows for present and future analysis.

Conclusion

Not all data is time-series data. But creating and accessing any online data generates time-series data in waves. To respond to this demand, time-series databases like QuestDB and others have arrived to handle the wake of data left in the exchange of high-integrity transactions and high-volume operations.

While time-series data is found in both OLAP and OTLP systems, a specialized performance and feature profile is required to handle the significant historical and temporal data that is generated by even basic functions in modern day applications. Time-series databases excel with these demands, while not being confined strictly to OLAP use cases.

Interesting in a high performance time-series database? Checkout QuestDB Cloud and get started in minutes.

Time-series IoT tracker using QuestDB, Node.js, and Grafana

Kellen — Fri, 06 Oct 2023 19:54:49 +0000

QuestDB is a high performance, open source time-series database that breaks through traditional ingest bottlenecks. This tutorial can help you get started! Like what you see? Star us on GitHub!

Time-series data is all around us. We rely on financial tick data to make monetary decisions. Application services track web engagement and deep system metrics. Even data with no obvious relationship to time is often bound to time-based metadata. In almost every aspect of our connected lives, time leads a stream of data, from GPS and geolocation to health monitors and much more.

Managing time-series data - especially at large volumes - is challenging with traditional tools. Why? Time-series data needs precise chronological order. The key insight within a data set is seldom an individual data point, but rather the patterns seen once data is down-sampled or aggregated. Without chronological order, insights from these patterns are lost.

Since traditional tools were not designed to handle order at very high ingest volume, we see performance challenges at scale. In recent years, we’ve seen a rise of time-series databases that are better optimized for these workloads.

In this tutorial, we’ll simulate a busy, real-time IoT tracker and see how QuestDB, a fast time-series database, can ingest and analyze that data efficiently.

Time-series data and IoT

This project will have three main components:

Node.js IoT simulator to generate some fake data to QuestDB
QuestDB to store that data
Grafana to visualize the data

You can easily replace the IoT simulator with real-data sources from managed services like AWS IoT Core or Azure IoT. But we want to show off ingestion and query performance, so we’re using simulated data for convenience.

Prerequisites

IoT simulator setup

First, create a new Node.js project:

mkdir questdb-iot && npm init -y

Next, install the QuestDB Node.js client and chance library to generate fake data.

npm install @questdb/nodejs-client chance

The QuestDB Node.js client applies the InfluxDB Line Protocol (ILP). QuestDB also supports the PostgreSQL wire protocol if you would rather use the PostgreSQL client libraries. For those not familiar with the InfluxDB Line Protocol, it is an efficient, text-based protocol for sending time-series data
points in a concise manner. It compactly sends timestamp, measurement value, as well as other metadata. For more information, check out the InfluxDB Line Protocol reference guide.

The simulator code is a variation of the Python Quickstart example. In the code, we loop over 10,000 devices, generate a random deviceId, latitude, longitude, and temperature values along with a random deviceVersion within v1.0, v1.1, and v2.0. We can imagine this as a sort of sensor, perhaps a bursty one in a satellite orbiting the earth.

const { Sender } = require("@questdb/nodejs-client")
const Chance = require("chance")

function delay(time) {
  return new Promise((resolve) => setTimeout(resolve, time))
}

async function run() {
  // create a sender with a 4k buffer
  // it is important to size the buffer correctly so messages can fit
  const sender = new Sender({ bufferSize: 4096 })

  // connect to QuestDB
  // host and port are required in connect options
  await sender.connect({ port: 9009, host: "localhost" })

  // initialize random generator lib
  const chance = new Chance()

  // device types
  const deviceVersions = ["v1.0", "v1.1", "v2.0"]

  // loop over devices
  for (let deviceId = 0; deviceId < 10000; deviceId++) {
    const randomVersion =
      deviceVersions[Math.floor(Math.random() * deviceVersions.length)]

    sender
      .table("devices")
      .symbol("deviceVersions", randomVersion)
      .stringColumn("deviceId", chance.guid())
      .floatColumn("lat", chance.latitude())
      .floatColumn("lon", chance.longitude())
      .floatColumn("temp", chance.floating({ min: 0, max: 100 }))
      .atNow()
    // flush the buffer of the sender, sending the data to QuestDB
    // the buffer is cleared after the data is sent and the sender is ready to accept new data
    await sender.flush()
  }

  // close the connection after all rows ingested
  await sender.close()
  return new Promise((resolve) => resolve(0))
}

run()
  .then((value) => console.log(value))
  .catch((err) => console.log(err))

As we are using the InfluxDB Line Protocol, there is no need to predefine the schema on the database before sending the data. If a table does not exist, it will be created. This is useful for bootstrapping a quick project and tinkering around.

QuestDB and Grafana setup

For convenience, we will use Docker Compose to bootstrap QuestDB and Grafana.

Create a docker-compose.yml file and paste the following:

services:
  questdb:
    image: questdb/questdb:7.3.1
    hostname: questdb
    container_name: questdb
    ports:
      - "9000:9000"

      - "9009:9009"

    environment:
      - QDB_PG_READONLY_USER_ENABLED=true

  grafana:
    image: grafana/grafana-oss:10.1.0
    hostname: grafana
    container_name: grafana
    ports:
      - "3000:3000"

This starts up QuestDB and Grafana in a Docker network and creates a read-only user for Grafana to pull data from QuestDB.

Next, start up the containers:

docker-compose up -d

We now have port 9000 (QuestDB UI), 9009 (InfluxDB Line Protocol), and 3000 (Grafana UI) mapped to localhost.

Ingesting data

Now to send data.

Start the simulator!

node index.js

As a specialized time series database, QuestDB has a better handle on the volume and the cardinality of the incoming time-series data compared to a traditional database, or to alternative time-series databases. It also provides features like data deduplication and out-of-order (O3) indexing which are essential with high amounts of incoming data.

Neat, fast. However, 10,000 "devices" reporting is not that many in the time-series world. Let's crank up our sample data, thus increasing both the overall burst and the cardinality of one of our columns. For this, we will run
the script multiple times and make our loop more flavourful.

We don't want to cook our computers, but we can get creative. For a more robust test scenario, we will raise the number of "devices" to, say, a million, and add a very highly random column full of values between negative million and a million:

...
  for (let deviceId = 0; deviceId < 1000000; deviceId++) {
      ...
      .floatColumn("curio", chance.integer({ min: -1000000, max: 1000000 }))
      ...
  }
...

And maybe we want to run it a couple times, for good measure?

...
Promise.all([run(), run()])
  .then((values) => console.log(values))
  .catch((err) => console.log(err))

Now run it, navigate to localhost:9000, and we should see our data arrive.

A broad SELECT query returns fast, even when Docker-ized. Just like that, we have nice table view of all of our simulated devices.

Measuring database performance depends on many factors, like hardware, the schema, overall infrastructure and so on. However, we can still get a sense of how a specialized time-series database handles bursting data, even in a local case.

But speedy ingest is not all we're after. We also want to query and visualize our data. Let's start with the built-in Chart view and see the temperature of all devices with v2.0 over a period of time:

Connecting to Grafana

Instead of making charts directly within our database, we will offload visualization to a specialized layer for easier access. For this, we'll use Grafana.

Navigate to localhost:3000 and use the default credentials to login: admin/admin. Remember, if you are creating something in production, change the default user and password values!

Click on “Add data source” and choose the PostgreSQL type. Since we’re running this within the same Docker network, we can use the container name for the host and specify QuestDB’s PostgreSQL wire port 8812 (i.e., questdb:8812). The read-only user credentials are user/quest, with the default database as qdb.

After connecting, click on “Explore data”.

Use the native SQL query to visualize our dataset:

Now build out panels using built-in Grafana types. For example, we can use the time-series visualization:

This adds threshold lines, and zooms in so we can see the different temperatures. From here, there are many ways we can alter the visual acuity to make the tracked values clear and easy to see.

Wrapping up

In this tutorial, we used a random IoT device data generator to show how to ingest bursting data into QuestDB. We then used Grafana to visualize the data. Underneath, we used the QuestDB Node.js Client library to utilize the InfluxDB Line Protocol for concise data transfer. Once the data hit QuestDB, we saw how fast it was to analyze them and also to create simple visualizations in the Web Console. Finally, we connected Grafana to the QuestDB PostgreSQL endpoint to build more complex panels.

In a real-world scenario, the data source may be coming directly from the devices, over a managed IoT connection service like AWS IoT Core, Azure IoT, or from a third-party. In that case, we can modify our simulator to listen to those events and use the same client library to send data. We may elect to also queue and batch our calls or use multiple threads to spread out the load. Whichever way we choose, using a specialized time series database like QuestDB prevents us from running into issues as burst and scale increases.

Our Website Source Is Now Private, A Cautionary Tale

Kellen — Fri, 01 Sep 2023 21:30:59 +0000

QuestDB is a high-performance time-series database built in Java and C++, with no dependencies and zero garbage collection. Check us out if you have time series data and are looking for high throughput ingestion and fast SQL queries. Not to worry: QuestDB remains open source!

"Imitation is the most sincere form of flattery" - Oscar Wilde

As an open source company that strives to be as transparent as possible — both internally and in the community — we want to have all our code public. We want you to see it all, even for our blogs, docs and marketing pages. But as far as our website goes, recent unfortunate events have made us reconsider our position.

We'll share what we learned so that it doesn't happen to you!

Great artists…

For page traffic, we want to know whose reading what, what is shared — if we know what you like and what is helpful, we can make more of it.

We're growing, and our traffic remains modest. When something succeeds on Reddit or HackerNews, we feel it. When the charts go up, it's a real thrill.

During one such event, one of our engineers, Maciej, looked into Google Analytics and noticed something of an... anomaly in our traffic. It had exploded.

And a large portion of it originated from an uncommon region — Brazil:

Hey, neat! Something must have resonated within the Brazilian developer community. But which page? One of our deep technical articles? Must be.

A few clicks later, we determined that the source of traffic was not what we had hoped. Far from coming from one of our fresh new articles, it was generated by an unknown page. A page with which none of us had any familiarity. This was because it was, in fact, from a different website entirely:

Uh oh. But was this just one page? Nope, it was many of them.

Looking deeper, traffic spanned many paths:

Apart from the root domain, these were all unexpected. Not good. But with all this data pointing to unfamiliar pages, we knew exactly where to look for answers. The moment we landed on the strange site, it confirmed what we had all expected.

While the main landing page and its site paths were altered, the rest our site had been copied over and hosted in full: metadata, supporting pages with trademarks, images, copy, logos and all. Little effort was taken to obscure this fact.

According to Google Analytics, in a short time we had collected well over 150,000 new visitors to “our site”. But in reality, it was from an entirely different one.

Well, hey. Traffic is traffic, right? All press is good press? And the theme was open source, isn't that what open source is for? Yes and no, it's not so clear.

There are things to consider.

Not flattering

Their post-launch results were staggering. Once live, their pages spread like wildfire. And we know, we have the data to prove it. But the intermingling traffic made us nervous. This is all much different from our usual keywords, demographics and traffic volume. On top of that, it came from an industry that is somewhat of a gray area.

Could we receive punishment as a result of the clientele of this site? Did we just become flagged as a “toxic website” to the many sites that interlink with our content? Are our search rankings about to plummet?

Any of that would be very bad. Can we resolve this, and fast? Yes, we can. But before we go any further, how did this even happen?

We can point to two key reasons.

Docusaurus showcase

We use a static site generator called Docusaurus for our docs, blogs and marketing pages. The design is customized, and tailored for your (and thus our) needs. As such, it felt great for the team when Docusaurus featured the QuestDB.io design within their showcase:

People looking for inspiration for their new Docusaurus site can use the filters to select “open source”. From there, as expected, a click of the source button leads you to, that's right — all of it, the goods. This very blog and all its surrounding pieces are hosted from within that repository, in its entirety…

Given the complexity and uniqueness of our Docusaurus rendition, it did not seem likely someone would fork it. There has been minimal work to template it as such. One does not simply swap some config options and colours and arrive at a brand new site. We're also on a fairly dated version of Docusaurus, and we'd expect people to want the new stuff. But it turns out we were wrong.

And upon reflection, we get it. The source is open and under a permissive license: Apache 2.0. People are free to do as they will, within the parameters of the license. But one consideration of the license is to respect existing trademarks and product names:

... This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor...

Using the styles, code and layouts is one thing. But using the same logos, trademarks, and so on, is another. And as for their traffic appearing in our Google Analytics, the blame for that lies with us.

Remember your env vars

Analytics providers, search engines, and other third parties who provide tokens that are browser exposed, usually provide safe, read only keys. They are most often non-destructive, as they are exposed to the client.

As many of you know, each property in Google Analytics gets its own GA tag which is then placed in Google's JavaScript snippet. These are visible when you inspect a website's source.

To prove it, visit a tech site that you like, inspect the source, and search for GTM. You will most likely find a GTM-XXXXX value. There is nothing stopping you for using it, except that it won't reveal anything unless you have access to the matching Google Analytics dashboard.

Though it is somewhat uncomfortable to admit, we soon confirmed that our Google Analytics tag was hard-coded directly in the source:

This meant that the repository could be cloned or forked, and bam — the new website is a part of QuestDB, as far as Google Analytics is concerned. This is both unfortunate, and preventable. In hindsight, the value could have been hidden behind an environment variable in the actual code. It is now!

The moral of the story: set env vars for any sensitive - or somewhat sensitive - variable. It is basic advice, yes, but sometimes one forgets just how wild and random the broad internet can be. Even if you think no one would use it or that no damage could be done, set it as an env var anyway.

Closed source, for now

Despite reaching out to Google, we were unable to get anyone to provide any help. But it's no matter, we cleaned up and applied a new Google Analytics tag. And it appears that the blip is not interfering with our business in any major way.

As of yet we have seen no punitive impact to our rankings. But as a precaution, we have decided to make a change to the visibility of our website repository. The website source will now be set to "private".

This isn't so we can twirl our moustaches like villains and apply shady marketing practices. It's to protect us so that if we do silly things like forget an env var, we won't risk cratering the hard-earned value of our website property.

That said, we will work to open parts of it in the future. For example, documentation has received helpful contributions from community members. Closing the door on them doesn't feel right. Luckily, there is a way around that.

Right now, QuestDB.io is a single build of our Docusaurus repository. Any doc, blog or content changes in the repo generates a new Netlify build. In the future, we can extract doc contents — which exist in their own folder as .md or .mdx files — and host them in their own open repository.

Using GitHub Actions or other build runner, we can setup a pipeline like this:

On PR event, create a temporary workspace
Pull private repo & docs into workspace
Build workspace as though whole
Provide PR preview
On merge, rsync doc contents to private repo
On commits to main, rebuild production website

With this model, our documentation can remain open even with the rest closed.

Summary

QuestDB is an open source company. We want to work in the open: no secrets! But we've decided to make our website source private for the time being.

Remember: even if the tag or key or whatever seems benign, use an environment variable. It might save you from a real hassle.

Concurrent Data-structure Design Walk-Through

Kellen — Tue, 22 Aug 2023 21:47:58 +0000

QuestDB is a time-series database that offers fast ingest speeds, InfluxDB Line Protocol and PGWire support and SQL query syntax. QuestDB is composed mostly in Java, and we've learned a lot of difficult and interesting lessons. We're happy to share them with you.

Investigating data structures

Concurrent data structure design is hard. This blog offers a guided tour on constructing a special-purpose concurrent map that heavily favors readers. The article will not just present yet another ready-to-use data structure. Instead, I will walk you through the design process while solving a real-world problem. I will even present dead ends that I bumped into along the way. It's a detective story for programmers interested in concurrent programming.

By the end of the article, we will have a concurrent map for storing blobs of data in native memory. The map is lock-free on a reading path and is also very conservative with memory allocations. Let's get started!

This article assumes basics in Java or a Java-like programming language.

The problem

I need a concurrent map where keys are strings and values are fixed-size blobs (public cryptographic keys). This could sound like a job for the plain old ConcurentHashMap from JDK, but there's a twist: the blobs must be available outside of the Java heap.

Why? So that callers can get a pointer to a blob and pass it to Rust code via JNI. The Rust code then uses the public key to verify digital signatures.

Here's a simplified version of the interface:

interface ConcurrentString2KeyMap {
  void set(String username, long keyPtr);
  long get(String username);
  void remove(String username);
}

The set() method receives a username and a pointer to a key. The map outlives the pointers it receives, so it must copy the memory under the received pointers into its own buffer. In other words: the get() method must return a pointer to this internal buffer, not the original pointer which was used for set().

I can assume the get() method will be used frequently and often on the hot path, while the mutation methods will be invoked rarely and never on the hot path.

This is roughly how readers are going to use it:

boolean verifySignature(CharSequence username, long challengePtr, int
challengeLen, long signaturePtr, int signatureLen) {
  long keyPtr = map.get(username);
  return AuthCrypto.verifySignature(keyPtr, PUBLIC_KEY_SIZE_BYTES, challengePtr, challengeLen, signaturePtr, signatureLen);
}

If there were no mutations, I could have just implemented a pre-populated immutable lookup directory and called it a day. However, shared mutable state brings two classes of challenges:

Pointer lifecycle management
Consistency of map internals

The first issue boils down to ensuring that when a map.get() returns a pointer, the pointer must remain valid and the memory behind it must not change for as long as needed. In our case, it means until the AuthCrypto.verifySignature() returns.

The second issue is all about concurrent data-structure design, and we will discuss this in more detail later. Let’s
explore the first issue.

Pointer lifecycle management

If our map's values were just regular objects managed by the JVM, things could be simple: map.get() would return a reference to an object and then it could forget this get() call ever happened. The remove()and set() methods would just remove the map's reference to the value object, and would never change an already returned object. Easy. But that’s not our case, we are working with off-heap memory and have to manage it on our own.

Fundamentally, there are two ways to solve it:

Change the get() contract so it doesn't return a pointer. Instead, it receives a pointer from the outside and copies the value there.
get() still returns a pointer, but the map guarantees the memory behind it stays immutable until the caller notifies the map that it’s done and that it won’t use the pointer anymore.

Option 1: Callers own the destination memory

The first option looks interesting. The new contract could look like this:

interface ConcurrentString2KeyMap {
  void set(String username, long srcKeyPtr);
  void get(String username, long dstKeyPtr);
  void remove(String username);
}

The caller would own the dstKeyPtr pointer and the map would copy the key from its internals to this pointer and forget this get() call ever happened.

This sounds quite nice at first, until we realize it just kicks the can down the road: it forces every calling thread to maintain its own buffer to pass to get(). If callers are all single-threaded, it’s still easy: each calling object owns a buffer to pass to get() .

But if calling functions are themselves concurrent, it becomes more complicated. We have to make sure each calling thread uses a different buffer.

Ideally, the buffer would be allocated on stack, but this is Java so that’s not possible. We certainly do not want to
allocate/deallocate a new buffer in the process heap for every invocation.

So what’s left? Pooling? That’s messy.

ThreadLocal? Even more messy and harder to put a cap on the number of buffers.

Maybe option 1 is not interesting as it seemed at first.

Option 2: Lifecycle notifications

Let’s explore the 2nd option. The contract remains the same as outlined in the original proposal: long get(String username). We have to make sure the memory behind the pointer remains unchanged until we are done.

The absolute simplest thing would be to use a read-write lock.

Each map would have a read-write lock associated, and then readers acquire a read lock before calling get() and release it only after returning from AuthCrypto.verifySignature():

boolean verifySignature(CharSequence username, long challengePtr, int
challengeLen, long signaturePtr, int signatureLen) {
  map.acquireReadLock();
  try {
    long keyPtr = map.get(username);
    return AuthCrypto.verifySignature(keyPtr, PUBLIC_KEY_SIZE_BYTES, challengePtr, challengeLen, signaturePtr, signatureLen);
  } finally {
    map.releaseReadLock();
  }
}

Mutators would only have to acquire a write lock before calling set() or remove(). Not only is this design simple to reason about, it is also simple to implement.

Assuming that only set() and remove() change the internal state, we can just take a single-threaded map implementation and it will do it. But there's a catch... It violates our original requirements!

Readers are often on the hot-path and we want them to remain lock-free. The proposed design blocks readers when the map is being updated, so this is a no-go.

What can we do? We could change the locking schema to be more fine-grained - instead of locking the whole map we could lock particular entries. While this would improve practical behaviour, it would also complicate the map design and the readers could still be blocked when the same key is being updated.

What else? We could use an optimistic locking schema, but
this brings its own intricacies.

It’s becoming clear that pointer lifecycle management will have to work in concert with the internal map implementation. So was this exercise completely fruitless? Not completely.

There is still one design idea we could reuse: Map users must
explicitly notify that they are no longer using the pointer.

Let’s explore how to
design map internals!

Designing a map for lock-free readers

I consider myself an experienced generalist. I know bits and bobs about concurrent programming, distributed systems and all kinds of other areas, but I’m not really narrowly specialized in any particular topic.

A jack of all trades and master of none? Probably. So when I was thinking about a suitable data structure, I did what every generalist would do in 2023: asked ChatGPT!

I was amazed GPT realized that I meant to write “single writer” and not “single reader”, which I took as proof that GPT knows what is it talking about! 🙂 So I read further: I might have heard about RCU before, but I have never used it myself. I found the description a bit too vague to be used as an implementation guide, and it was a lunch time anyway.

Copy-On-Write intermezzo

While walking to a lunch place, I was thinking about it more and got an idea. Why not use the Copy-On-Write technique to implement a persistent map?

That way, I could take a regular single-threaded map, and mutators would clone the current map, do their thing, and then atomically set this newly created map as the map for readers. Readers would then use whatever map was published as the latest. Published maps are immutable, thus always safe for concurrent readers, even from multiple threads, lock-free. In fact, even wait-free. Yay!

Additionally, we would have to introduce a mechanism for safe deallocation of map internal buffers when a stale (=no longer the latest published map) map has no readers. Otherwise, we would be leaking memory. That’s a complication, but it feels like something easily fixable with enough dedication and atomic reference counters.

So this all sounds good, but as have come to expect... there is still a catch. We need to allocate a block of memory for the map contents on every single mutation. We said that mutations are rare, so perhaps that’s not a big deal? Maybe it’s not, but one of the QuestDB design principles is to be conservative with memory allocations as they cost CPU cycles, memory bandwidth, cause CPU cache thrashing, and in general tend to introduce unpredictable behaviour.

Back to the drawing board: Map recycling

So I couldn’t implement a straightforward Copy-On-Write map, but I felt I was on the right path towards the goal: lock-Free readers. At some point I realized that, instead of allocating a new map whenever there is a change, I could keep reusing only 2 maps: one would be available for readers and the other for writers.

Once a map is published for readers, it’s guaranteed to be immutable as long as there is at least one reader still accessing it.

It would look similar to this:

class ConcurrentMap {
  private InternalMap readerMap = new ...
  private InternalMap writerMap = new ...

  void set(String username, long keyPtr) {
    getMapForWriters().set(username, keyPtr);
    swapMaps();
  }

  long get(String username) {
    return getMapForReaders().get(username);
  }

  void remove(String username) {
    getMapForWriters().remove(username);
    swapMaps();
  }
}

The idea looks neat, but it’s clear the code as outlined above has a number of issues and open questions:

The code always mutates just a single map, but we clearly need to keep both maps in sync. We cannot lose updates.
Multiple mutating threads could step on each others toes.
If swapMap() unconditionally swaps reader and writer maps, the mutating thread performing two consecutive mutations could write into a map which still has some readers. This violates our invariant: we must not have readers and writers concurrently accessing the same internal map.
How to implement getMapForWriters() and getMapForReaders()? 🙂

Single-writer FTW!

Let’s start with problem #2 - multiple mutating threads.

We said that mutations are rare and never on the hot path. Hence, we can afford to be brutal and use a simple mutex - to make sure there is always at most a single mutator. The single-writer principle is known to simplify design of concurrent algorithms anyway.

Thus the map now looks like this:

class ConcurrentMap {
  private InternalMap readerMap = new ...
  private InternalMap writerMap = new ...
  private final Object writeMutex = new Object();

  void set(String username, long keyPtr) {
    synchronized(writeMutex) {
      getMapForWriters().set(username, keyPtr);
      swapMaps();
    }
  }

  long get(String username) {
    return getMapForReaders().get(username);
  }

  void remove(String username) {
    synchronized(writeMutex) {
      getMapForWriters().remove(username);
      swapMaps();
    }
  }
}

That was easy. Maybe violent, but easy.

Racing threads

Let’s explore something more complicated - problem #3 - multiple consecutive write operations. What do I mean by this?

Consider this scenario:

We have 2 instances of InternalMap, let’s call them m0 and m1.
The field readerMap references the map m0 and writerMap references m1.
Reader thread calls map.get(). Thus getMapForReaders() returns m0. At this point the reader thread is paused by the OS.
Writer thread calls map.set(). Thus getMapForWriters() returns m1.
Writer modifies m1 and swaps the maps.
The field readerMap now references the map m1 and writerMap references m0.
Another writer calls map.set(). Thus getMapForWriters() returns m0 and the writer starts mutating it. The write operation takes a while.
The OS resumes the reader thread from #3, and it starts reading m0 (because that’s the map the reader got before its thread was paused!)
At this point we have a reader thread concurrently accessing the same internal map instance as a writer thread -> Boom 💥💥💥!

If the scenario looks too long and boring and you skipped it, here is a short summary: a reader obtains mapForReaders and in the next moment this map becomes writerMap. So what was once a readerMap is now a writerMap and thus the next write operation can mutate it at will. Except the stale reader still thinks the same map is safe for reading. That’s a bad concurrency bug!

How can we prevent the racy scenario outlined above? We already use a mutex on the write path, that’s almost as nasty as it gets. Almost?! Can we be even nastier? Sure we can!

Each internal map could have a reader counter and getMapForWriters() won’t return until the reader counter on the current mapForWriters reaches 0. In other words: the writer won’t mutate writerMap until all readers indicate they are no longer using this map.

How about newly arriving readers? New readers don’t touch writerMap at all, they always load the current readerMap so that’s not a problem.

Enough talking! Let’s see some code:

class ConcurrentMap {
  private InternalMap readerMap = new InternalMap();
  private InternalMap writerMap = new InternalMap();
  private final Object writeMutex = new Object();

  void set(String username, long keyPtr) {
    synchronized(writeMutex) {
      getMapForWriters().set(username, keyPtr);
      swapMaps();
    }
  }

  Reader concurrentReader() {
    InternalMap map;
    for (;;) {
      map = readerMap;
      map.readerArrived();
      if (map == readerMap) {
        return map;
      }
      map.readerGone();
    }
  }

  private InternalMap getMapForWriters() {
    InternalMap map = writerMap;
    while(map.hasReaders()) {
      backoff();
    }
    return map;
  }

  void remove(String username) {
    synchronized(writeMutex) {
      getMapForWriters().remove(username);
      swapMaps();
    }
  }

  interface Reader {
    long get(String username);
    void readerGone();
  }

  static class InternalMap implement Reader {
    private final AtomicInteger readerCounter;

    public void readerGone() {
      readerCounter.decrement();
    }

    public void readerArrived() {
      readerCounter.increment();
    }

    public boolean hasReaders() {
      return readerCounter.get() > 0;
    }

    // the rest of a single threaded map impl

  }
}

The code above looks way more complex than the previous buggy version.

Let’s go briefly through the changes:

The most visible change: the get() method is gone! Instead, there is a new method concurrentReader() which returns an interface Reader with two methods: get() and readerGone()
There is a skeleton of InternalMap. It does not show any map-related logic, because it could be any single-threaded implementation of a map-like structure. It only demonstrates that each internal map has its own reader counter.
For the first time we see an implementation getMapForWriters(). It’s really not doing anything else except waiting for all the stale readers from the writerMap to disappear. The backoff() method may have various implementations and use primitives such as Thread#yield() or LockSupport#parkNanos().

Counting readers

Let’s have a closer look at each change.

Why did we introduce the Reader interface? Isn’t it just an unnecessary complication, and an example of the over-engineering so prevalent in Java culture?

Well, maybe, but it simplifies the mechanism for readers to
notify the map that they won’t access the returned pointer anymore.

How? Each internal map has its own reader counter. When a reader no longer needs a pointer returned by a prior get(), it must invoke readerGone() on the correct internal map.

The Reader interface does exactly that - it knows which
instance of InternalMap is in use. When a thread calls reader.readerGone(), it decrements the reader counter on that map.

For example:

boolean verifySignature(CharSequence username, long challengePtr, int
challengeLen, long signaturePtr, int signatureLen) {
  ConcurrentMap.Reader reader = map.concurrentReader();
  try {
    long keyPtr = map.get(username);
    return AuthCrypto.verifySignature(keyPtr, [...]);
  } finally {
    reader.readerGone();
  }
}

This hopefully made it clearer why we need the Reader interface.

A little aside: Do you still remember the design idea with a read-write-lock? I decided not to use it, because it could block readers. But the lock usage pattern was an inspiration for this notification mechanism.

Avoiding check-then-act bugs

Let’s focus on the implementation of concurrentReader() method.

It looks like this:

Reader concurrentReader() {
  InternalMap map;
  for (;;) {
    map = readerMap;
    map.readerArrived();
    if (map == readerMap) {
      return map;
    }
    map.readerGone();
  }
}

It loads the current readerMap, increments its reader counter, and returns it to the caller if - and only if - the readerMap field still points to the same InternalMap instance. Otherwise, it decrements the reader counter to undo the increment, and retries everything from start.

Why this complexity? Why do we need the retry mechanism at all? It’s to protect us against a similar problem with stale readers that we already discussed.

Consider this simpler implementation of concurrentReader():

Reader concurrentReader() { // buggy!
  InternalMap map = readerMap;
  map.readerArrived(); // increment the reader counter
  return map;
}

// getMapForWriters() shown for reference only
private InternalMap getMapForWriters() {
  InternalMap map = writerMap;
  while (map.hasReaders()) {
    backoff();
  }
  return map;
}

Broken down, we see that:

There is a writer thread calling map.set() and a reader thread calling map.concurrentReader().
The reader thread loads the current readerMap, but the OS pauses it before it increments the reader counter.
The writer thread loads the current writerMap, does a mutation, and swaps the maps. This means the old readerMap is now the new writerMap.
At this point the reader has an instance of InternalMap which is also set in the writerMap field.
There is another writer operation. getMapForWriters() returns current writerMap immediately, because the reader counter is still zero. The writer thread starts mutating the map.
The OS resumes the reader thread. The thread has a reference to the same internal map which is currently being mutated by the thread from the previous point.
The reader thread increments the internal map reader counter, but that’s fruitless as the writer thread is already mutating the map.
The reader thread concurrentReader() returns a map which is being concurrently mutated by a writer thread -> Boom 💥💥💥!

The extra check in concurrentReader() is meant to prevent the above scenario. It guarantees it incremented the reader counter on the map which is still the current readerMap:

Reader concurrentReader() {
  InternalMap map;
  for (;;) {
    map = readerMap;
    map.readerArrived(); // increment the reader counter
    if (map == readerMap) {
      return map;
    }
    map.readerGone();
  }
}

It is still possible that a reader thread increments the reader counter, returns the Reader to the caller, and in the next microsecond a writer thread swaps the maps so the map instance returned to the caller is now set as the writerMap. This is entirely possible, but it won’t do any harm. The
writer won’t get access to the writerMap before its reader counter has reached zero.

What is left?

At this point we solved the hardest part of the concurrent algorithm, but there are still some unresolved issues:

The map is still losing updates! We have 2 internal maps, but each update mutates just a single map.
Some smaller bits:
1. There is no happens-before relationship between writers swapping maps and readers loading them.
2. The close() method is not implemented so the map might leak native memory, etc.

Dealing with lost updates

We can fix the first problem. After swapping the maps, we could wait until the current writerMap has no readers and then update it.

So mutation operations would look like this:

void set(String username, long keyPtr) {
  synchronized(writeMutex) {
    getMapForWriters().set(username, keyPtr);
    swapMaps();
    getMapForWriters().set(username, keyPtr);
  }
}

This is a safe implementation as the getMapForWriters() guarantees that the returned map has no readers and no new reader will arrive until the next swap.

On the other hand, it's inefficient: when we switch maps after writing, the new writerMap may have stale readers, causing delays until they're cleared.

Is there a better option? It turns out that there is!

We could change the first map, swap them and remember the operation including all parameters. During the next mutation we would replay the operation on the mapForWriters and, if
mutations are sufficiently rare, by the time we are replaying the operation the writerMap has no longer any readers.

Let’s see the code:

void set(String username, long keyPtr) {
  synchronized(writeMutex) {
    InternalMap map = getMapForWriters();
    replayLastOperationOn(map);
    map.set(username, keyPtr);
    swapMaps();
    rememberSetOperation(username, keyPtr);
  }
}

The map.remove() looks like this:

void remove(String username) {
  synchronized(writeMutex) {
    InternalMap map = getMapForWriters();
    replayLastOperationOn(map);
    map.remove(username);
    swapMaps();
    rememberRemoveOperation(username);
  }
}

rememberSetOperation() must copy the memory under the pointer to its own buffer, but we only have to remember a single operation. Given our blobs are fixed-size, it allows us to keep reusing the same replay buffer. Zero allocation.

Playing by the Java Memory Model rules

Now let’s do the last, important change.

This is how the ConcurrentMap looks like:

class ConcurrentMap {
  private InternalMap readerMap = new InternalMap();
  private InternalMap writerMap = new InternalMap();
  private final Object writeMutex = new Object();
  private final WriterOperation lastWriterOperation;
  [...]
}

The whole mutable state is encapsulated in these 4 objects. The fields writerMap and lastWriterOperations are only accessed by a writer thread while holding a mutex. But the readerMap field is set by a writer thread and then loaded by readers.

Readers are lock-free, they do not acquire any mutex
before accessing the reader map. That is a data race and it could cause visibility issues.

The fix is easy, just mark the readerMap as volatile:

class ConcurrentMap {
  private volatile InternalMap readerMap = new InternalMap();
  private InternalMap writerMap = new InternalMap();
  private final Object writeMutex = new Object();
  private final WriterOperation lastWriterOperation;
  [...]

The writer path now looks like this:

Acquire a mutex
Load the current writerMap
Wait until all stale readers are gone
Replay the last operation
Do a new mutation
Swap readerMap and writerMap
Remember the operation so it can be replayed on the untouched map during the new mutation

The readerMap is now marked as volatile, giving us sequential consistency.

Colloquially speaking, the readers will see the most recent map swap done by a writer thread. The readers are also guaranteed to see all changes which were performed before the writer thread set the map as mapForReaders. And that’s it!

Summary

We walked through a process of designing a concurrent data-structure which is lock-free on the read path. We could generalize some of the design principles we applied into the following rules:

Single Writer Rule: Use a mutex for writes, ensuring only one writer at any time.
Dual Maps: Maintain two maps – one for readers and another for the writer.
Pointer Swap Mechanism: When a writer updates, it operates on the mapForWriters and then switches the roles of the two maps.
Reader Counter: Each map has an atomic readerCounter. As a reader begins, the counter increments and decrements upon completion. This ensures that no one accesses the mapForWriters until all active readers from the previous swap have completed their reads.
Change Tracking: Writers, when updating mapForWriters, log their modifications. This is crucial since we need to replicate these changes to mapForReaders, which might still be in use by some readers. Instead of waiting for all readers to shift to the new map, we log the change and apply it in the subsequent update. Given that we switch maps post-update, by the time of the next update, mapForWriters is likely free of old readers, allowing for immediate change application.

What’s next?

We have a working implementation of a concurrent map, but it’s not yet ready for production. There are still some issues to be solved:

The close() method is not implemented so the map might leak native memory. This is trivial to fix and I leave it as an exercise for the reader.
There are no tests! There are various ways to test concurrent data structures. You could use a stress test, where you spawn a lot of threads and let them mutate the map in a random way and then check the consistency of the map and some invariants. You could learn TLA+ and write a formal model of the map and then verify it.
Performance optimizations. The current implementation uses the same write path on both internal maps. Chances are that's not the best option. For example there is no reason to calculate the hash code twice. There is no reason to locate the bucket twice. The first set() could remember which bucket was used and the second set() could reuse it. We could also add batching to the writer path: the current implementation swaps the maps after every mutation. This is inefficient when writers are mutating the map in a tight loop.

Acknowledgements

After I finished the implementation, I got really excited and I wanted to share it with the world. I naively thought that I was the first one to come up with this idea. I was wrong.

First, I found the Double Instance Locking pattern at the amazing
Concurrency Freaks blog. This pattern is very similar to the one I described here. It also uses 2 internals structures and readers are alternating between them. It uses a read-write lock to protect the map being mutated. Given there is only a single writer then at any given time there is at least one internal map which is available for reading. This gives readers lock freedom.

It's fair to say that the Double Instance Locking
pattern is simpler to reason about. It decomposes the problem better. But I'd still argue my contribution is the trick with delayed replay of the last operation - if writers are sufficiently rare, then writers won't be blocked at all.

The same blog also links to the paper Left-Right: A Concurrency Control Technique with Wait-Free Population Oblivious Reads which on the surface also looks similar. It claims not only Lock-Freedom, but also Wait-Freedom. I have yet to deep dive into it.

I would like to thank to reviewers of this article: Andrei Pechkurov and Marko Topolnik. Thank you so much! Without your encouragement I would not find the courage to publish this article! ❤️ Of course, all the mistakes still left are solely my responsibility.

*Originally posted on QuestDB.io by @jerrinot *

Fuzz testing: the best thing to happen to our application tests

Kellen — Thu, 17 Aug 2023 21:42:35 +0000

QuestDB is a time-series database that offers fast ingest speeds, ILP and PGWire support and SQL query syntax. Databases are placed under constant and demanding workloads, and building one teaches many hard lessons. We're happy to share them with you.

Almost two years ago, we were playing an endless game of whack-a-mole game with segfaults, data corruption, and various concurrency bugs. Our users were reporting them and, for each report, we had to reproduce the bug, analyze and - finally - fix it. Eventually, we decided to take a step back and come up with a deeper solution. This article details our pain, and the journey we took to get out of it. Maybe we can help you out of a similar bind.

Slaying the many headed hydra

One bug solved, five more appear. We caught bugs and our users caught bugs. Each report lead to an investigation and - in most cases - a resolution. But sometimes users would apply workarounds and proceed forward without reporting the issue, and some bugs would go unresolved. This loop led to frustration for both our users and the QuestDB team.

We introduced the first fuzz test into the QuestDB project in an attempt to make the database more robust, and since then we have added many more of them. It's hard to quantify the bugs found by fuzzing, but all of the known critical ones are gone and it's a very rare case nowadays to see a critical issue reported by the community.

On top of that, recently the SQLancer team added QuestDB support to their testing tool and helped us to find a number of issues in our SQL engine. That's why we believe that almost any complex application would gain a lot from this kind of test, so if yours doesn't have them, today we hope to inspire you to start writing fuzz tests.

What is fuzzing?

"What's the fuzz? Tell me what's happening." - Modified lyrics from a well-known rock opera

Before sharing our story, let's start with the basics and define a fuzz test.

Wikipedia says that:

... fuzzing or fuzz testing is an automated software testing technique that involves providing invalid, unexpected, or random data as inputs to a computer program. The program is then monitored for exceptions such as crashes, failing built-in code assertions, or potential memory leaks.

So for instance, if you write a compiler, you can use fuzzing to generate source code variations, including invalid ones, and test whether your compiler is able to give a meaningful result for the input program. If you have a web service, you can write a fuzz test that would try to send invalid or semi-valid requests to your service, and then analyze whether any of the requests break the security of your service. Or even crash it.

Furthermore, if you have a command line utility, like curl, you may use fuzz testing to find nasty things like memory corruption bugs. But not only that. In the database world, fuzzing is usually applied to APIs that accept a query language, like SQL or a specialized protocol like the InfluxDB Line Protocol (ILP). Lastly, it would be silly not to mention that fuzzing can be used to find general vulnerabilities in programs.

So how do we write them?

Getting fuzzy

There are a number of different approaches to writing fuzz tests:

Your fuzzer may be generation-based and generate inputs from scratch. Or it may be mutation-based, and have a corpus of seed inputs that it modifies to produce a final input.
The test may be dumb, if it is producing unstructured, random inputs, like random strings instead of proper SQL statements. Or it may be smart, if it's aware of the expected input structure.
You may choose to use a white-, grey-, or black-box testing technique, depending on the awareness of the program-under-test structure.

If you were to fuzz test a database, the test could be as simple as the following test written in Java-inspired pseudo-code:

public void testSqlEngine() {
    try (Connection conn = openConnection()) {
        int len = randomInt();
        String stmt = randomString(len);
        executeSql(stmt);
        assertDatabaseDidNotCrash();
    }
}

Here we generate a string filled with random characters, send it over the database connection and, finally, check if the database process is still up and running. This is a (1) generation-based, (2) dumb, (3) black-box test that makes no assumptions about the SQL syntax and only follows the database network protocol.

Interestingly, it is very close to what the original authors of the term "fuzz" did in 1988 when they were testing Unix utilities. Of course, such an approach is not very efficient when applied to database software. At QuestDB, we prefer fuzzers to be generation-based and smart, but that is our preference. Other combinations may be useful in other software projects.

With the basics covered, let's look at how fuzzing helped us make QuestDB better...

Our fuzzing story

As we already mentioned, our first fuzz test was written almost two years ago. It tested the ILP protocol by sending semi-random, potentially invalid messages. The first test immediately revealed a number of critical issues, including a few segfaults.

To give you an impression of that fuzzer, let's dive into our ILP protocol implementation. Creating a single-row table with ILP is as simple as sending the following to <questdb_host>:9009:

weather,city=Sofia temperature=27.5 1692010877000000000\n

Once sent, this message tells QuestDB to create a table with the following structure:

CREATE TABLE 'weather' (
  city SYMBOL,
  temperature DOUBLE,
  timestamp TIMESTAMP
) timestamp (timestamp) PARTITION BY DAY WAL;

And with the below row:

city	temperature	timestamp
Sofia	27.5	2023-08-14T11:01:17.000000Z

Here we have a partitioned table named weather. The table structure, column names and row values are fully defined in our ILP message. Now, let's send another message over TCP:

weather,city=Berlin temperature=28,humidity=0.42 1692011659000000000\n

As a result, we got a new humidity column of type DOUBLE in the table.

The table contents are now:

city	temperature	humidity	timestamp
Sofia	27.5	0	2023-08-14T11:01:17.000000Z
Berlin	28	0.42	2023-08-14T11:14:19.000000Z

Shall we send another row? Let's do that:

weather,country=France,CiTy=Paris HuMiDiTy=0.58,TeMpErAtUrE=26 1692012370000000000\n

Here, we've changed the order for the humidity and temperature columns, added a new column named country and slightly ImProVeD letter capitalization in the older column names. Again, the database should add the new column and should also ignore the capitalization difference in the column names. With all that said, the message should yield the following rows:

city	temperature	humidity	timestamp	country
Sofia	27.5	0	2023-08-14T11:01:17.000000Z
Berlin	28	0.42	2023-08-14T11:14:19.000000Z
Paris	26	0.58	2023-08-14T11:26:10.000000Z	France

Adding new columns, reordering columns, and changing column names are not the only things that we can potentially do over the ILP protocol.

To give you an idea of what we can do, we can:

include non-ASCII characters in table and column names, as well as column values
skip existing columns and the timestamp value in the messages
duplicate certain columns, so that the column name and its value is repeated multiple times in the same message
continue with the list of mutations leading to valid and invalid ILP messages

Next, we can decide whether to apply or not to apply these mutations based on a RNG (Random Number Generator) and run the scenario over multiple connections. Once all messages are sent and received, we can compare the table's contents with the expected rows.

The above description is basically what our first fuzz test was doing. While being simple, it revealed - as we say - a "can of worms", i.e. many issues, around our ILP code. Since the first fuzzer, we wrote 62 additional tests, most of which are aimed to stress our storage, protocol implementation, and concurrent code. Needless to say, the number keeps growing.

While 62 fuzzers doesn't sound like a huge number, each of these tests produces a great number of different test scenarios, all thanks to randomization. To increase our chances of finding bugs, our CI runs the fuzzers periodically. If you'd like to see what the tests look like, take a look at the test that covers the new deduplication feature.

Based on our experience, we believe that any complex software dealing with messages and events can benefit from fuzzing. Using randomness to produce combinations of messages and events along with the verification logic for the end result is a very powerful approach. And it's not only about databases, compilers, and CLI tools - you may successfully add fuzzers to applications of almost any kind.

As for the QuestDB team, we want to improve our fuzzers by adding tests dedicated to our SQL engine. Luckily, in the interim, the SQLancer team has come to our rescue.

SQL fuzzing

SQLancer (Synthesized Query Lancer) is a tool to automatically test SQL Database Management Systems (DBMS) to find logic bugs in their implementation.

It operates in two phases:

Database generation. During this phase, SQLancer creates a populated database and stresses the DBMS to increase the probability of causing an inconsistent state. First, it creates random tables. Then, random SQL statements are chosen to generate, modify, and delete data.
Testing. Here, SQLancer detects logic bugs based on the generated database.

Both of the above phases support a number of testing approaches, depending on the supported database. For instance, the testing phase supports the so-called Non-optimizing Reference Engine Construction (NoREC) approach. NoREC is aimed to find optimization bugs. It translates a query that may be optimized by the database into another query for which optimizations are much less likely to be applicable. Then it runs both queries and compares the result.

Interestingly, we do a similar thing when testing our SIMD JIT compiler. We also have a non-fuzz test that runs a query with an enabled and disabled JIT compiler that then checks if the result sets are equal. Such an approach is usually called test oracle (or just oracle).

Initial QuestDB support was contributed to SQLancer by Suri Zhang and after that, we received a number of GH issues from the SQLancer team. We're very thankful to the SQLancer team for all of the reported issues, and continue to fix them in new releases. Needless to say: we'll keep using SQLancer to find bugs in the SQL engine. We're also working on a patch to improve QuestDB support.

Do I need fuzzing?

We believe that the answer is "yes, you do". Fuzzers are valuable for any complex software. Fuzz tests are not only about databases, compilers, and CLI tools - you may successfully add them to applications of almost any kind. It doesn't mean that you should go all-in for this kind of test and nothing else, but writing a fuzzer, as an addition, once you've written "traditional" tests, helped us to build a more robust database and it will certainly help you.

As usual, we encourage you to try the latest QuestDB release and share your feedback with our Slack Community. You can also play with our live demo to see how fast it executes your queries. And, of course, contributions to our open-source database on GitHub are more than welcome.

Article written by Andrei Pechkurov. Follow him on Twitter.