DEV Community: Greptime

Error Handling for Large Rust Projects - A Deep Dive into GreptimeDB's Practices

Greptime — Sun, 12 May 2024 08:50:00 +0000

:::tip TL;DR:
In this article, we discuss the practice of Rust error handling topic in GreptimeDB and shares possibly future work in the end.

Topics including:
(1) How to build a cheaper yet more accurate error stack to replace system backtrace;
(2) How to organize errors in large projects;
(3) How to print errors in different schemes to log and end users.

An error in GreptimeDB might look like this:

0: Foo error, at src/common/catalog/src/error.rs:80:10
1: Bar error, at src/common/function/src/error.rs:90:10
2: Root cause, invalid table name, at src/common/catalog/src/error.rs:100:10

:::

Introduce

Understanding `Error` in Rust

Rust's error handling is centered around the Result<T, E> enum, where E typically (but unnecessarily) extends std::error::Error.

pub enum Result<T, E> {
    /// Contains the success value
    Ok(T),
    /// Contains the error value
    Err(E),
}

This blog shares our experience organizing variant types of Error in a complex system like GreptimeDB, from how an error is defined to how to log the error or present it to end-users. Such a system is composed of multiple components with their own Error definitions.

Status Quo of Rust's Error Handling

A few standard libraries in Rust provide Error structs that implement std::error::Error, like std::io::Error or std::fmt::Error. But developers would usually define custom errors for their project, as either they want to express the application specific error info, or there is a necessity to group multiple errors in an enum.

Since the std::error::Error trait is not quite complicated, it's easy to implement manually for one custom error type. However, you usually won't want to do so. Because as error variants grow, it would be very hard to work with the flooding template code.

Nowadays, there are some widely used utility crates to help work with customized error types. For example, thiserror and anyhow are developed by the famous Rust wizard @dtolnay, with the distinction that thiserror is mainly for libraries and anyhow is for binaries. This rule suits most of the cases.

But for projects like GreptimeDB, where we divide the entire workspace into several individual sub-crates, we need to define one error type for each crate while keeping a streamlined combination. Neither thiserror nor anyhow can achieve this easily.

Hence, we chose another crate, snafu, to instrument our error system. It's like a combination of thiserror and anyhow. thiserror provides a convenient macro to define custom error types, with display, source, and some context fields. And anyhow gives a Context trait that can easily transform from one underlying error into another with a new context.

thiserror mainly implements the std::convert::From trait for your error types, so that you can simply use ? to propagate the error you receive. Consequently, this also means you cannot define two error variants from the same source type. Considering you are performing some I/O operations, you won't know whether an error is generated in the write path or the read path. This is also an important reason we don't use thiserror: the context is blurred in type.

Stacking the Error

Design Goals

In the real world, knowing barely the root cause of error is inadequate. Suppose we are building a protocol component in GreptimeDB. It reads messages from the network, decodes them, performs some operations, and then sends them. We may encounter errors from several aspects:

enum Error{
    ReadSocket(hyper::Error),
    DecodeMessage(serde_json::Error),
    Operation(GreptimeError),
    EncodeMessage(serde_json::Error),
    WriteSocket(hyper::Error),
}

One possible error message we can get is: DecodeMessage(serde_json: invalid character at 1). However, in a specific code snippet, there can be more than 10 places where decoding the message (and thus throw this error)! How can we figure out in which step we see the invalid character?

So, despite the error itself telling what has happened, if we want to have a clue on where this error occurs and if we should pay attention to it, we need the error to carry more information. For comparison, here is an example of an error log you might see from GreptimeDB.

``plain text Failed to handle protocol 0: Failed to handle incoming content, query: blabla, at src/protocol/handler.rs:89:22 1: Failed to reading next message at queue 5 of 10, at src/protocol/loop.rs:254:14 2: Failed to decode01010001001010001` to ProtocolHeader, at src/protocol/codec.rs:90:14
3: serde_json(invalid character at position 1)




**A good error report is not only about how it gets constructed, but what is more important, to tell what human can understand from its cause and trace. We call it Stacked Error.** It should be intuitive and you must have seen a similar format elsewhere like backtrace.

From this log, it's easy to know the entire thing with full context, from the user-facing behavior to the root cause. Plus the exact line and column number of where each error is propagated. You will know that this error is *"from the query "blabla", the fifth package's header is corrupted"*. It's likely to be invalid user input and we may not need to handle it from the server side.

This example shows the critical information that an error should contain:

- **The root cause** that tells what is happening.
- **The full context stack** that can be used in debugging or figuring out where the error occurs.
- **What happens from the user's perspective.** Decide whether we need to expose the error to users.

The first root cause is often clear in many cases, like the DecodeMessage example above, as long as the library or function we used implements their error type correctly. But only having the root cause can be not enough.

Here is another [evidence](https://github.com/delta-incubator/delta-kernel-rs/pull/151) from Delta Lake developed by Databricks:

![Databricks's example](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4vu65v27cmhf6ugt5648.png)

In the following sections, we will focus on the context stack and the way to present errors. And shows the way we implement it. So hopefully you can reproduce the same practices as in GreptimeDB.

### System Backtrace

So, now you have the root cause (`DecodeMessage(serde_json: invalid character at 1)`). But it's not clear at which step this error occurs: when decoding the header, or the body?

A intuitive thought is to capture the backtrace. `.unwrap()` is the first choice, where the backtrace will show up when error occurs (of course this is a bad practice). It will give you a complete call stack along with the line number.

Such a call stack contains the full trace, including lots of unrelated system stacks, runtime stacks and std stacks. If you'd like to find the call in application code, you have to inspect the source code stack by stack, and skip all the unrelated ones.

Nowadays, many libraries also provide the ability to capture backtrace on an `Error` is constructed. Regardless of whether the system backtrace can provide what we truly want, it's very costly on either CPU ([#1261](https://github.com/GreptimeTeam/greptimedb/pull/1261)) and memory ([#1273](https://github.com/GreptimeTeam/greptimedb/pull/1273)).

Capturing a backtrace will significantly slow down your program, as it needs to walk through the call stack and translate the pointer. Then, to be able to translate the stack pointer we will need to include a large `debuginfo` in our binary. In GreptimeDB, this means increasing the binary size by >700MB (4x compared to 170MB without debuginfo). And there will be many noises in the captured system backtrace because the system can't distinguish whether the code comes from the standard library, a third-party async runtime or the application code.

There is another difference between the system backtrace and the proposed Stacked Error. System backtrace tells us how to get to the position where the error occurs and you cannot control it, while the Stacked Error shows how the error is propagated.

Take the following code snippet as an example to examine the difference between system backtrace and virtual stack:



```rust
async fn handle_request(req: Request) -> Result<Output> {
    let msg = decode_msg(&req.msg).context(DecodeMessage)?; // propagate error with new stack and context
    verify_msg(&msg)?; // pass error to the caller directly
    process_msg(msg).await? // pass error to the caller directly
}

async fn decode_msg(msg: &RawMessage) -> Result<Message> {
    serde_json::from_slice(&msg).context(SerdeJson) // propagate error with new stack and context
}

System backtace will print the whole call stack, like:

1: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
            at /rustc/3f28fe133475ec5faf3413b556bf3cfb0d51336c/library/alloc/src/boxed.rs:2029:9
    std::panicking::rust_panic_with_hook
            at /rustc/3f28fe133475ec5faf3413b556bf3cfb0d51336c/library/std/src/panicking.rs:783:13
... many lines for std's internal traces

22: tokio::runtime::task::raw::RawTask::poll
            at /home/wayne/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.35.1/src/runtime/task/raw.rs:201:18
... many lines for tokio's internal traces

32: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}
            at /rustc/3f28fe133475ec5faf3413b556bf3cfb0d51336c/library/std/src/thread/mod.rs:529:17
... many lines for std's internal traces

As you can see, it includes a lot of internal stacks that you are uninterested in.

For other complex logic like batch processing, where errors may not be propagate immediately but be holded for a while, virtual stack can also help making it easy to understand. System backtrace is captured in place when the leaf error is generated, like in the middle step of a map-reduce logic. But with virtual stack, you can postpone the timing to or after reduce step, where you have more information about the overall task.

Virtual User Stack

Now let's introduce the virtual user stack. The word "virtual" means the contrast of the system stack. Means it's defined and constructed fully on user code. Look closer into the previous example:

0: Failed to handle incoming content, query: blabla, at src/protocol/handler.rs:89:22
1: Failed to reading next message at queue 5 of 10, at src/protocol/loop.rs:254:14
2: Failed to decode `01010001001010001` to ProtocolHeader, at src/protocol/codec.rs:90:14
3: serde_json(invalid character at position 1)

A stack layer is composed of 3 parts: [STACK_NUM]: [MSG], at [FILE_LOCATION]

Stack num is the number of this stack. Smaller number means outer error layer. And starts from 0 of course.
Message is the message related to one layer. This is scraped from the std::fmt::Display implementation of that error. Developers can attach useful context here, like the query string or loop counter.
File location is the location where one error is generated (and propagated, for intermediate error layer). Rust provides file!, line! and column! macros to help get that information. And the way we display it is also considered, most editors can jump to that location directly.

In practice, we utilize snafu::Location to gather the code location. So each location points to where the error is constructed. Through this chain we know how this error is generated and propagated to the uppermost.

Here is what it looks like all together from the code side:

#[derive(Snafu)]
pub enum Error {
    #[snafu(display("General catalog error: "))] // <-- the `Display` impl derive
    Catalog {
        location: Location, // <-- the `location`
        source: catalog::error::Error, // <-- inner cause
    }
}

Besides, we implemented a proc-macro stack_trace_debug to scrape necessary information from the Error's definition and generate the implementation of the related trait StackError, which provides useful methods to access and print the error:

pub trait StackError: std::error::Error {
    fn debug_fmt(&self, layer: usize, buf: &mut Vec<String>);
    fn next(&self) -> Option<&dyn StackError>;
    fn last(&self) -> &dyn StackError where Self: Sized { ... }
}

This proc-macro mainly does two things:

Implement StackError as the scaffold
Implement std::fmt::Debug based on debug_fmt()

By the way, we have added Location and display to all errors in GreptimeDB. This is the hard work behind the methodology.

Macro Details

Error is a singly linked list, like an onion from outer to inner. So we can capture an error at the outermost and walk through it.

One tricky thing we did here is about how to distinguish internal and external errors. Internal errors all implement the same trait ErrorExt which can be used as a marker. But depending on this requires a downcast every time. We avoid this extra downcast call by simply giving a different name to them and detect in our macro.

As shown below, we name all external errors error and all internal errors source. Then return None on implementing StackError::next method if we find an error, or Some(source) if we read source.

#[derive(Snafu)]
#[stack_trace_debug]
pub enum Error {
    #[snafu(display("Failed to deserialize value"))]
    ValueDeserialize {
        #[snafu(source)]
        error: serde_json::error::Error, // <-- external source
        location: Location,
    },

    #[snafu(display("Table engine not found: {}", engine_name))]
    TableEngineNotFound {
        engine_name: String,
        location: Location,
        source: table::error::Error,    // <-- internal source
    }
}

The method StackError::debug_fmt is used to render the error stack. It would be called recursively in the generated code. Each layer of error will write its own debug message to the mutable buf. The content will contain error description captured from #[snafu(display)] attribute, the variant arm type like TableEngineNotFound and the location from the enumeration.

Given we already defined our error types in that way, adopting stack error doesn't require too much work, only adding the attribute macro #[stack_trace_debug] to every error type would be enough.

Present Error to End Users

So far, we've covered most aspects. Now, let's delve into the final piece which is how to present errors to your users.

Unlike system developers, users may not care about the line number and even the stack. What information, then, is truly beneficial to end users?

This topic is very subjective. Still taking the above error as an example, let's consider which parts would or should users care about:




The first line gives a brief description of this error, i.e., what users actually saw from the top layer. We should keep it as well. Line 2 and line 3 are about internal details, which are too verbose to include. Line 4 is the leaf internal error, or the boundary from internal code to external dependency. It might sometimes contain useful information, so we count it in. However, we only include the error description since the stack number and code location are useless to users. Then the last line is external error, which is usually the root cause and we'd also include it.

Let's assemble the pieces we just picked. The final error message presents to users is as follow:



```plain text
Failed to handle protocol - Failed to decode `01010001001010001` to ProtocolHeader (serde_json(invalid character at position 1))

This can be achieved easily with previous StackError::next and StackError::last. Or you can customize the format you want with those methods.

Our experience is that the leaf (or the innermost) error's message might be useful as it is closer to what really goes wrong. The message can be further divided into two parts: internal and external, where the internal error is those defined in our codebase and the external is from dependencies, like serde_json from the previous example. The root (or the outermost) error's category is more accurate as it comes from where the error is thrown to the user.

In short, the error message scheme we proposed is:

```plain text
KIND - REASON ([EXTERNAL CAUSE])




## Cost?

The virtual stack is sweet so far, and it proves to be both more cost-effective and accurate compared to the system backtrace. So what is the cost?

As for runtime overhead, it only requires some string format for the per-level reason and location.

It's even better in binary size. In GreptimeDB's binary, the debug symbols occupied ~700MB. As a comparison, the `strip`-ed binary size is around 170MB, with `.rodata` section size `016a2225` (~22.6M), and the `.text` section occupies `06ad7511` (~ 106.8M).

Removing all `Location` reduces the `.rodata` size to `0169b225` (still ~22.6M, changes are very small) and the overall binary size to 170MB, while removing all `#[snafu(display)]` reduces the `.rodata` size to `01690225` (~22.5M) and the overall binary size to 170MB.

Hence, the Stacked Error mechanism's overhead to binary size is very low (~100K).

## Conclusion and Future Works

In this post, we present how to implement a proc-macro [`stack_trace_debug`](https://greptimedb.rs/common_macro/attr.stack_trace_debug.html) and use it to assemble a low-overhead but still powerful stacked error message. It also provides a convenient way to walk through the error chain, to help render the error in different schema for different purposes.

This macro is only adopted in GreptimeDB now, we are attempting to make it more generic for different projects. A wide adoption of this pattern can also make it even more powerful by bridging third-party stacks and detailed reasons.

Besides, `std::error::Error` now provides an unstable API [`provide`](https://doc.rust-lang.org/std/error/trait.Error.html#method.provide), that allows getting a field in a struct. We can consider using it in refactoring our stack-trace utils.

----

### About Greptime

We help industries that generate large amounts of time-series data, such as Connected Vehicles (CV), IoT, and Observability, to efficiently uncover the hidden value of data in real-time. 

Visit the [latest version](https://www.greptime.com/resources) from any device to get started and get the most out of your data.

- [GreptimeDB](https://github.com/GreptimeTeam/greptimedb), written in Rust, is a distributed, open-source, time-series database designed for scalability, efficiency, and powerful analytics. 
- [GreptimeCloud](https://www.greptime.com/product/cloud) is a fully-managed cloud database-as-a-service (DBaaS) solution built on GreptimeDB. It efficiently supports applications in fields such as observability, IoT, and finance. The built-in observability solution, [GreptimeAI](https://www.greptime.com/product/ai), helps users comprehensively monitor the cost, performance, traffic, and security of LLM applications.
- **Vehicle-Cloud Integrated TSDB** solution is tailored for business scenarios of automotive enterprises. It addresses the practical business pain points that arise when enterprise vehicle data grows exponentially.

If anything above draws your attention, don't hesitate to star us on [GitHub](https://github.com/GreptimeTeam/greptimedb) or join GreptimeDB Community on [Slack](https://www.greptime.com/slack). Also, you can go to our [contribution page](https://github.com/GreptimeTeam/greptimedb/contribute) to find some interesting issues to start with.

Overcoming Prometheus's Single-Value Data Model Limitations - A New Approach by GreptimeDB

Greptime — Sun, 12 May 2024 08:46:55 +0000

Introduction

Prometheus has established itself as a cornerstone in the monitoring and alerting ecosystem, favored for its straightforwardness and efficiency in handling real-time metrics. Central to its operation is a data model where each sample comprises a single value and an assortment of labels, a design that, while fostering simplicity and adaptability, also introduces several challenges. These challenges can impact data collection efficiency, analysis depth, and query capabilities.

This article explores the limitations inherent in Prometheus's single-value data model and introduces GreptimeDB's innovative solutions that aim to address these issues, illustrated with practical examples.

Challenges of The Single-Value Data Model

1. Redundant Label Transmission in Data Collection

Prometheus's data model necessitates the repeated transmission of labels for measurements from the same source, resulting in inefficient data collection and storage. Despite the employment of optimization techniques in Prometheus's storage engine to enhance data storage efficiency, the redundancy of label information still poses a significant overhead.

Example:

In a scenario where multiple metrics like CPU usage, memory usage, and disk I/O are collected from a server cluster, each metric carries identical labels such as cluster_name, region, instance and server_type, leading to unnecessary duplication.

2. Loss of Measurement Correlation

The separation of related measurements into distinct metrics, without a mechanism for structured grouping or inheritance, leads to a loss of correlation among measurements. This separation makes correlated analysis and queries difficult, limiting insights into metric interactions.

Example:

When monitoring a Redis instance by tracking metrics such as memory usage, command processing rates, and active connections separately, it becomes challenging to analyze how these metrics influence each other. For example, understanding how memory usage affects command processing rates becomes difficult.

3. Complexity in Querying Composite Monitoring Views

Creating comprehensive monitoring dashboards requires aggregating data from multiple, separate PromQL queries, complicating dashboard construction and increasing the query load.

Example:

To monitor a Kubernetes node effectively, a dashboard needs to aggregate metrics like CPU load, memory consumption, network I/O, and pod counts. However, each metric requires a separate PromQL query, which complicates the dashboard setup and may potentially impact performance.

GreptimeDB to the Rescue

GreptimeDB introduces innovative solutions to address the limitations of Prometheus's single-value data model:

1. Group related metrics and store them together

GreptimeDB has developed a new storage engine for this monitoring scenario, called Metric Engine. It supports storing multiple measurements together physically, cutting a huge amount of cost and accelerating the query in correlated measurements.

2. Multi-Value Samples and Diverse Value Types

GreptimeDB allows each sample from a single data source to store multiple values, supporting a variety of value types beyond floats.

Example:

Monitoring data for a Redis instance can be stored in one or multiple time-series tables, with labels stored as separate tag columns and grouped measurements as separate field columns. This approach reduces label transmission redundancy, preserves data correlation, and facilitates associated analysis and querying.

3. Extended PromQL for Multiple Field Queries

GreptimeDB enhances PromQL to allow queries to return multiple fields (values). To specify a particular field, an extended __field__ label can be used.

Example:

This extended PromQL query memstats{ __field__="used_bytes", __field__="free_bytes"} fetches two time series in one query and renders them together. This extension simplifies querying for composite monitoring views, reducing the complexity and load of constructing detailed dashboards.

4. Support for Table Model and SQL for Advanced Association Analysis

One of the most impactful features GreptimeDB offers is its support for a table model and the use of SQL for querying data. This capability significantly surpasses the flexibility of PromQL, especially when it comes to performing association analysis and executing complex queries. By leveraging a relational model, users can perform joins across different datasets, enabling a deeper and more nuanced analysis of the monitored systems.

Example:

In a complex monitoring scenario where one needs to correlate server performance metrics with application error logs, GreptimeDB allows for this data to be queried together using SQL. For instance, one could execute a SQL query to join CPU usage metrics with application error logs based on timestamps, providing insights into how spikes in CPU usage may correlate with increased error rates. This level of analysis would be cumbersome, if not impossible, to achieve with PromQL alone.

P.S. GreptimeDB is actively developing the logs engine as described in the Roadmap. Stay tuned!

This support for a table model and SQL, not only makes GreptimeDB a powerful tool for users transitioning from traditional SQL-based systems, but also enhances the capability for in-depth analysis without the steep learning curve associated with mastering PromQL. Introducing these features marks a significant step forward in making monitoring data more accessible and actionable for a broader range of analytical tasks, from basic monitoring to complex performance analysis and troubleshooting.

Conclusion

While Prometheus's single-value data model has contributed to its simplicity and widespread adoption, it also poses challenges in terms of data collection efficiency, measurement correlation, and query complexity. GreptimeDB's solutions offer a promising approach to overcoming these limitations, providing more efficient data collection, enhanced correlation analysis, and simplified querying for comprehensive monitoring views.

About Greptime

We help industries that generate large amounts of time-series data, such as Connected Vehicles (CV), IoT, and Observability, to efficiently uncover the hidden value of data in real-time.

Visit the latest version from any device to get started and get the most out of your data.

GreptimeDB, written in Rust, is a distributed, open-source, time-series database designed for scalability, efficiency, and powerful analytics.
GreptimeCloud is a fully-managed cloud database-as-a-service (DBaaS) solution built on GreptimeDB. It efficiently supports applications in fields such as observability, IoT, and finance. The built-in observability solution, GreptimeAI, helps users comprehensively monitor the cost, performance, traffic, and security of LLM applications.
Vehicle-Cloud Integrated TSDB solution is tailored for business scenarios of automotive enterprises. It addresses the practical business pain points that arise when enterprise vehicle data grows exponentially.

If anything above draws your attention, don't hesitate to star us on GitHub or join GreptimeDB Community on Slack. Also, you can go to our contribution page to find some interesting issues to start with.

Introducing GreptimeDB v0.7 — Unlock the Future of Cloud-Native Monitoring

Greptime — Thu, 07 Mar 2024 07:56:52 +0000

Last week, we unveiled the GreptimeDB roadmap for 2024, charting out several significant updates slated for this year.

With the advent of spring in early March, we also welcomed the debut of the first production-grade version of GreptimeDB. v0.7 represents a crucial leap toward achieving production readiness; it implements production-ready features for cloud-native monitoring scenarios. We eagerly invite the entire community to engage with this release and share their invaluable feedback through Slack.

From v0.6 to v0.7, the Greptime team has made significant strides: A total of 184 commits were merged, 705 files were modified, including 82 feature enhancements, 35 bug fixes, 19 code refactors, and a substantial amount of testing work.

During this period, a total of 8 individual contributors participated in the code contributions. Special thanks to Eugene Tolbakov for being continuously active in GreptimeDB's development as our first committer!

Update Highlights

Metric Engine: crafted specifically for Observability scenarios. It's adept at managing a vast array of small tables, making it ideal for cloud-native monitoring.
Region Migration: enhance the user experience, simplifying region migrations with straightforward SQL commands.
Inverted Index: dramatically improves the efficiency of locating data segments relevant to user queries, significantly reducing the IO operations needed for scanning data files and thus accelerating the query process.

Now, let's dive deep into the updates in v0.7.

Region Migration

Region Migration provides the capability to migrate regions of a data table between Datanodes. Leveraging this feature, we can easily implement hot data migration and horizontal scaling for load balancing. GreptimeDB mentioned an initial implementation of Region Migration in v0.6. In v0.7, we have further refined the feature and enhanced the user experience.

Now, we can conveniently execute region migration through SQL commands:

select migrate_region(
    region_id,
    from_dn_id,
    to_dn_id,
    [replay_timeout(s)]);

Metric Engine

The Metric Engine is a brand-new engine designed specifically for Observability scenarios. The primary goal of Metric Engine is to handle a large number of small tables, making it particularly suitable for cloud-native monitoring, such as scenarios previously using Prometheus. By leveraging the synthetic wide tables, this new engine offers the capability for metric data storage and metadata reuse, making "tables" more lightweight. It can overcome some of the limitations of the current Mito engine, where tables are too heavy weight.

Original Metric Data
- Taking the metrics from the following six node exporters as an example. In the single-value model systems represented by Prometheus, even highly correlated metrics need to be split and stored separately.

Logical Table from User's Perspective
- The Metric Engine authentically reproduces the structure of Metrics, presenting users with the exact structure of the Metrics as they were written.

Physical Table from the Storage Perspective
- At the storage layer, the Metric Engine performs mapping, using a single physical table to store related data. This approach reduces storage costs and supports the storage of Metrics at a larger scale.

Upcoming Development Plan: Automatic Field Grouping
- In real-world scenarios that generate Metrics, the majority of these metrics are interconnected. GreptimeDB will possess the capability to automatically identify related metrics and consolidate them. This approach will not only decrease the number of timelines across various metrics but also enhance the efficiency of handling queries across multiple metrics.

Lower Storage Cost
- To conduct cost testing based on the AWS S3 storage backend, data is written for approximately thirty minutes at a total write rate of about 300k rows per second. The number of operations occurring during the test is tallied to estimate the cost based on AWS's pricing. The index function is enabled throughout the testing process.

Pricing references are taken from the Standard tier at https://aws.amazon.com/s3/pricing/.

From the test data provided, it is evident that the Metric Engine can significantly reduce storage costs by decreasing the number of physical tables. The number of operations at each stage has seen a dramatic reduction, scaling down by an order of magnitude, which in turn has led to an overall cost reduction exceeding eight times.

Inverted Index

Inverted Index, as a newly introduced index module, is designed to pinpoint the data segments pertinent to user queries with high efficiency, significantly reducing the I/O operations required to scan data files, thereby accelerating the query process. In the context of TSBS testing scenarios, we observed an average performance increase of 50%, with select scenarios experiencing boosts of up to nearly 200%.

The advantages of the Inverted Index include:

Ready to use: The system can automatically generate appropriate indexes, no need for users to specify manually.
Practical functionality: Supports equality, range, and regular expression matches for multiple column values, ensuring rapid data location and filtering in most scenarios.
Flexible adaptation: It automatically fine-tunes internal parameters to strike an optimal balance between the cost of construction and the efficiency of queries, adeptly catering to the diverse indexing requirements of different use cases.

Other Updates

Database management capabilities significantly enhanced

We have substantially supplemented the information_schema tables, adding information such as SCHEMATA and PARTITIONS.

Besides, we also introduced many new SQL functions to facilitate management operations on GreptimeDB. For example, it is now possible to trigger Region Flush and perform Region migration through SQL, as well as to query the execution status of procedures.

Performance Improvement

In v0.7, the Memtable was restructured and upgraded, enhancing data scan speed and reducing memory usage. At the same time, we have made numerous improvements and optimizations to the read and write performance of object storage.

Upgrade Guide

As we have many significant changes in the new version, the release of v0.7 requires a system downtime upgrade. It is recommended to use the official upgrade tool, with the general upgrade process as follows:

Create a brand new v0.7 cluster
Shut down the traffic entry to the old cluster (stop writing)
Export the table structure and data using the GreptimeDB CLI upgrade tool
Import the data into the new cluster using the GreptimeDB CLI upgrade tool
Switch the traffic entry to the new cluster

Please refer to the detailed upgrade guide here.

Future Plan

Our next milestone is scheduled for April, as we anticipate the launch of v0.8. This release will mark the completion of GreptimeFlow, a streamlined stream computing solution adept at conducting continuous aggregation across GreptimeDB data streams. Designed with flexibility in mind, GreptimeFlow can either be integrated directly into the GreptimeDB Frontend or deployed as a standalone service within the GreptimeDB architecture.

Beyond continual functional upgrades, we are persistently optimizing GreptimeDB's performance. Although v0.7 has seen substantial enhancements in performance compared to its predecessors, there remains a gap in observability scenarios compared to some mainstream solutions. Bridging this performance gap will be our primary focus in the upcoming optimization efforts.

For a comprehensive view of our planned version updates, we invite you to explore the GreptimeDB 2024 roadmap. Stay connected and journey with us as we continue to evolve GreptimeDB.

About Greptime

We help industries that generate large amounts of time-series data, such as Connected Vehicles (CV), IoT, and Observability, to efficiently uncover the hidden value of data in real-time.

Visit the latest v0.7 from any device to get started and get the most out of your data.

GreptimeDB, written in Rust, is a distributed, open-source, time-series database designed for scalability, efficiency, and powerful analytics.
GreptimeCloud offers a fully managed DBaaS that integrates well with observability and IoT sectors.
GreptimeAI is a tailored observability solution for LLM applications.

What to Expect Next? GreptimeDB Roadmap for 2024

Greptime — Fri, 01 Mar 2024 09:52:34 +0000

Since GreptimeDB's open-sourcing on November 15th, 2022, we have stepped on the committed journey towards crafting a fast and efficient data infrastructure. This endeavor has been propelled by the collaborative efforts of both our dedicated team and the vibrant community that supports us.

As we embark on the inaugural season of 2024, a leap year enriched by an extra day in February, this year promises to be thrilling as we anticipate numerous groundbreaking developments. These crucial updates will significantly showcase the maturity of our product within production environments, presenting practical benchmarks for users to compare with leading time-series databases in the industry.

As we forge ahead with GreptimeDB 2024, it prompts the question, "What's next?" This roadmap outlines the objectives our team is pursuing and the visions we harbor for our collective community.

Providing clarity on what the community can expect from GreptimeDB for the next 10-12 months;
Offering insights to those wishing to contribute to GreptimeDB on GitHub by highlighting potential starting points and the types of projects we are eager to embark on.

Read the updated roadmap in this issue.

Main Feature Updates in 2024

The evolution of GreptimeDB in 2024 is marked by a suite of main feature updates. These enhancements are a testament to our ongoing commitment to excellence, driven by feedback from our community and the latest requirements in real-world scenarios.

Our roadmap for the year includes significant advancements that promise to elevate the capabilities of GreptimeDB and enrich the user experience.

Here's a glimpse into what we have in store:

Metric Engine
- Tracking issue: https://github.com/GreptimeTeam/greptimedb/issues/3187
- A new engine designed for observability scenarios. Its primary aim is to handle a large number of small tables, making it particularly suitable using Prometheus metrics. By utilizing synthetic wide tables, this new Engine offers the capability to store metric data and reuse metadata, rendering "tables" atop it more lightweight and overcoming some of the limitations of the existing Mito engine, which is considered too heavy for such tasks.
GreptimeFlow
- Tracking issue: https://github.com/GreptimeTeam/greptimedb/issues/3187
- A lightweight stream computing component capable of performing continuous aggregation on GreptimeDB data streams. It can be embedded into the GreptimeDB Frontend or deployed as a separate service within the GreptimeDB cluster.
- A flow job can be submitted in the form of SQL:
```
CREATE TASK avg_over_5m WINDOW_SIZE = "5m" AS
    SELECT avg(value) FROM table
    WHERE time > now() - 5m GROUP BY time(1m);
```
Index
- Inverted Index
  - Tracking issue: https://github.com/GreptimeTeam/greptimedb/issues/2705
- Smart Index
  - For instance, it automatically monitors workloads and query performance, and when necessary, it autonomously creates relevant indexes and removes unused ones.
- Spatial Index
  - Supports storage and retrieval of geographic location information.
Cluster Management & Autopilot
- Region Migration
  - Tracking issue: https://github.com/GreptimeTeam/greptimedb/issues/2700
  - It offers the capability to migrate Regions between Datanodes, facilitating the relocation of hot data and the horizontal scaling of load balancing.
- Auto Rebalance Regions
  - An automated load balancing scheduler built upon Region Migration.
Logs Engine
- A storage engine designed specifically for the characteristics of log data, sharing most of GreptimeDB's architecture and capabilities, such as the SQL query layer, data sharding, distributed routing, and querying indexing, and compression ability. This enables GreptimeDB to become a unified system offering optimized storage and a consistent access experience for both Metrics and Logs data, based on a multi-engine architecture.

GreptimeDB Version Plan

With all the feature updates listed above, we've made the version iteration plan for GreptimeDB in 2024.

The image below presents the GreptimeDB 2024 Roadmap, showcasing a structured release schedule and the pivotal feature enhancements planned for deployment throughout the year. Please note that these details are tentative and subject to refinement.

Track the progress of GreptimeDB versions here.

GreptimeDB v1.0 marks a milestone as a production-ready release, boasting advanced features such as Smart Index, setting a new standard for efficiency and performance.

Here we enthusiastically invite you to mark your calendar and experience the robust capabilities of GreptimeDB v1.0 (scheduled to be released in August) to boost your time-series data management and analysis.

March: GreptimeDB v0.7

Region Migration
Inverted Index
Metrics Engine

April: GreptimeDB v0.8

GreptimeFlow

June: GreptimeDB v0.9

Auto Rebalance Regions

August: GreptimeDB v1.0

Smart Index
Spatial Index

December: GreptimeDB v1.1

Logs Engine: Data ingestion from popular log collectors

Get Involved Now

If anything above draws your attention, don't hesitate to star us on GitHub or GreptimeDB Community on Slack. Also, you can go to our contribution page to find some interesting issues to start with.

Looking beyond the initiatives that are in progress, there's a lot of room for improvement. We also welcome other ideas besides these planned updates. If you might be interested in giving that a try, speak up and chat with the team. We probably will end up being the ones who get you the best.

About Us

Greptime helps industries that generate large amounts of time-series data, such as Connected Vehicles (CV), IoT, and Observability, to efficiently uncover the hidden value of data in real time.

GreptimeDB, written in Rust, is a distributed, open-source, time-series database designed for scalability, efficiency, and powerful analytics.
GreptimeCloud offers a fully managed DBaaS that integrates well with observability and IoT sectors.
GreptimeAI is a tailored observability solution for LLM applications.

As an open-source initiative, we welcome enthusiasts of relevant technologies to join our community and share their insights. Star us now on GitHub and help us strengthen our community together.

Twitter: https://twitter.com/Greptime
LinkedIn: https://www.linkedin.com/company/gr
Youtube: https://www.youtube.com/@greptime
Slack: https://www.greptime.com/slack
Contact us: info@greptime.com

Research Paper Sharing - Exploiting Cloud Object Storage for High-Performance Analytics

Greptime — Fri, 02 Feb 2024 01:41:56 +0000

In this sharing, we discuss a paper by Dominik Durner, Viktor Leis, and Thomas Neumann from the Technical University of Munich (TUM), published in July 2023 in PVLDB (Volume 16 No.11): Exploiting Cloud Object Storage for High-Performance Analytics.

DB: https://umbra-db.com/

Paper link: https://www.vldb.org/pvldb/vol16/p2769-durner.pdf

Abstract

...Our experiments demonstrate that even without caching, Umbra with integrated AnyBlob can match the performance of state-of-the-art cloud data warehouses that utilize local SSDs for caching, while also enhancing resource elasticity...

When developing our open-source cloud-native time-series analytical database GreptimeDB, we found this paper exceptionally beneficial. It primarily focuses on performing high-performance data analytics on object storage, with several conclusions providing clear direction for our engineering practices.

Introduction to AWS S3

AWS S3's storage cost is $23 per TB per month, offering 99.999999999% (eleven nines) of availability. It's important to note that the final cost also depends on the number of API calls and cross-region data transfer fees.
The bandwidth for accessing S3 can reach up to 200 Gbps, depending on the instance's bandwidth. While the original text in the Introduce section mentions 100 Gbps, later sections state that on AWS C7 series models, the bandwidth can fully reach 200 Gbps.

The paper identifies the following challenges with AWS S3:

Challenge 1: Underutilization of bandwidth
Challenge 2: Additional network CPU overhead
Challenge 3: Lack of multi-cloud support

Based on our experience, the importance of these challenges is in the order of 1 > 2 > 3

Characteristics of Cloud Storage (Object Storage)

Cloud storage (object storage) typically offers relatively low latency (ranging from several milliseconds to a few hundred milliseconds depending on the load size) and high throughput (capped by EC2 bandwidth, which can go as high as 200 Gbps on 7th generation EC2 models), making it suitable for large-scale data read and write operations.

In contrast, Amazon Elastic Block Store (EBS) usually provides lower latency (in the order of single-digit milliseconds). However, its throughput is lower than cloud storage, often by one or two orders of magnitude.

Latency

For small requests, first byte latency is a decisive factor.
In the case of large requests, experiments ranging from 8 MiB to 32 MiB showed that latency increases linearly with file size, ultimately reaching the bandwidth limit of a single request.
Regarding hot data, we use the first and the twentieth requests to represent scenarios of cold and hot data requests, respectively. In hot data request scenarios, latency is typically lower.

In GreptimeDB, the average latency data in the data file reading scenarios are as follows:

For operations involving reading Manifest Files averaging less than 1 KiB, the expected latency is around 30 ms (p50, Cold) / ~ 60 ms (p99, Cold).
Reading an 8 MiB Parquet file would take ~ 240 ms (p50, Cold) / ~ 370 ms (p99, Cold).

Noisy neighbors

Experimental Method: Single request of 16 MiB
Bandwidth Calculation Method: Total bytes / Duration

There is significant variability in object bandwidth, ranging from approximately 25 to 95 MiB/s.
A considerable number of data points (15%) are at the maximum value (~95 MiB/s).
The median performance is stable at 55-60 MiB/s.
Performance tends to be higher on weekends.

Latency across different cloud providers

Experimental Method: The test involves individual files of 16 MiB, with each request spaced 12 hours apart to reduce the influence of caching.

S3 exhibits the highest latency among the tested services.
S3 has a "minimum latency," meaning all data points exceed this value.
Compared to AWS, the presence of outliers in the low latency range for other providers suggests they do not conceal the effects of caching.

The above phenomenon might be related to the hardware and implementation of S3. Overall, older hardware or different caching strategies could lead to the observed outcomes in points 2 and 3.

Throughput

The outcomes from the above figure:

A single file of 16 MiB, with 256 parallel requests to achieve maximum throughput (100 Gbps).
The throughput bandwidth fluctuates with the region.
The median bandwidth of AWS is 75 Gbps.
The median bandwidth of Cloud X is 40 Gbps.
The median bandwidth of Cloud Y is 50 Gbps.
The difference between cold and hot data is minimal.

Optimal Request Size

From the above graph, we can see that the optimal request size usually lies between 8-16 MiB. Although the cost for 32 MiB is a bit lower, its download time is double that of 16 MiB under the same bandwidth, making it less cost-efficient.

Encryption

So far, all experiments conducted are based on non-secure HTTP connections. In this section, the authors compare the throughput performance with AES encryption enabled and after switching to HTTPS.

HTTPS requires twice the CPU resources compared to HTTP.
AES encryption increases CPU resources by only 30%.

In AWS, traffic between all regions, and even within availability zones, is automatically encrypted by the network infrastructure. Within the same location, due to VPC isolation, no other user can intercept the traffic between EC2 instances and the S3 gateway. Therefore, using HTTPS in this context is redundant.

Slow Requests

In the experiments, the authors observed significant tail latency in some requests, with some even being lost without any notification. To address this, cloud providers recommend a request hedging strategy to re-requesting unresponsive requests.

The authors have gathered some empirical data on slow requests for 16 MiB files:

After 600 milliseconds, less than 5% of objects have not been successfully downloaded.
Less than 5% of objects have a first byte latency exceeding 200 milliseconds.

Based on these observations, one can consider re-downloading attempts for requests exceeding a certain latency threshold.

Cloud Storage Data Request Model

In their study, the authors observed that the bandwidth of a single request is similar to that when accessing data on an HDD (Hard Disk Drive). To fully utilize network bandwidth, a large number of concurrent requests are necessary. For analytical workloads, requests in the 8-16 MiB range are cost-effective. They devised a model to predict the number of requests needed to achieve a given throughput target.

The experiment utilized computing instances with a total bandwidth of 100 Gbps. In the graph, "Model (Hot)" represents the 25th percentile (p25) latency observed in previous experiments.

The median base latency is approximately 30 ms, as determined from the 1 KiB trial in Figure 2.
The median data latency is around 20 ms/MiB, with Cloud X and Cloud Y exhibiting lower rates (12–15 ms/MiB), calculated from the 16 MiB median minus the base latency in Figure 2.
To achieve 100 Gbps on S3, 200-250 concurrent requests are necessary.
With access latencies in tens of milliseconds and a bandwidth of about 50 MiB/s per object, it suggests that the object storage is likely HDD-based. This implies that reading at ∼80 Gbps from S3 is equivalent to accessing around 100 HDDs simultaneously.

Anyblob

AnyBlob is a universal object storage library created by the authors, designed to support access to object storage services from various cloud providers.

Compared to existing C++ libraries for S3, AnyBlob utilizes the io_uring system call and removes the limitation of one-to-one thread mapping. The final results indicate that AnyBlob achieves higher performance with reduced CPU usage. However, it's worth considering that the primary reason for this improvement might be the subpar quality of the existing C++ S3 libraries.

Domain name resolution strategies

The AnyBlob does incorporate noteworthy features. The authors noted that resolving domain names for each request introduces significant latency overhead. To address this, they implemented strategies including:

Caching Multiple Endpoint IPs: Storing the IP addresses of multiple endpoints in a cache and scheduling requests to these IPs. Replace the endpoints with noticeably deteriorating performance based on statistical information.
Based on MTU (Maximum Transmission Unit): Different S3 endpoints have different MTUs. Some support jumbo frames up to 9001 bytes, which can significantly reduce CPU overhead.
MTU Discovery Strategy: This involves pinging the target endpoint's IP with a payload larger than 1500 bytes and the DNF (Do Not Fragment) flag set to determine if it supports larger MTUs.

Integration with Cloud Storage

In this section, the authors discuss how they integrated cloud storage. Overall, these ideas are converging in practice, and the specific implementation details depend on the engineering practices of different teams.

Adaptive Strategy
If the processing speed of requested data is slow, then reduce the number of download threads (and tasks) and increase the number of request threads (and tasks).

Performance Evaluation

Data Download Performance

Experimental Parameters: TPC-H scale factor 500 ( ~500 GiB of data).

The authors categorized the queries into two types: retrieval-heavy and computation-heavy.

Retrieval-heavy examples: Queries 1, 6, and 19. These are characterized by a constant multiple difference in performance between In-Memory and Remote storage.
Computation-heavy examples: Queries 9 and 18. These are marked by a very small performance difference between In-Memory and Remote storage.

Comparison of Different Storage Types

EBS (Elastic Block Store) exhibits the poorest performance, likely due to the utilization of lower-tier options like gp2/gp3, which offer around 1 GiB of bandwidth.

Scalability

Retrieval-Heavy (Q1): The bottleneck in this type of query lies in the network bandwidth.
Computation-Heavy (Q9): The performance improves with an increase in the number of cores. The throughput of the Remote (the Umbra) is nearly the same as that of the in-memory version.

End-To-End Study with Compression & AES

Experimental Parameters: Scale Factor (SF) of 100 (~ 100 GiB) and 1,000 (~ 1 TiB of data).

The Snowflake used in the experiment is a large-size configuration, while Umbra utilized EC2 c5d.18xlarge instances, with caching disabled.

Overall, this comparison might be insufficiently strict. For example, it lacks detailed information about the Snowflake setup:

For the Large-size Snowflake, there might be issues with overselling and throttling.
The Snowflake group may have purchased a standard, lower-tier version, which could also impact the results.

However, this also highlights another aspect: the core technique of benchmark marketing might involve some statistical wizardry, like hiding the query that didn't hit the cache behind the p99. In other words, the effort required for benchmarking optimization when running a single query 10 times versus 100 times might not be on the same scale.

Summary

Overall, this article provides substantial data support and insights in several areas:

Characteristics of object storage
Optimal file size for data requests
The impact of enabling HTTPS
Cloud storage data request model
Scheduling queries and download tasks based on statistical information
Empirical data on handling slow requests
Utilization of MTU jumbo frames

In the upcoming GreptimeDB 0.7.0 release, we have implemented extensive optimizations in querying, including enhancements for queries on object storage. In some scenarios, query response times are now approaching the levels of local storage. Star us on GitHub and stay tuned with GreptimeDB, we eagerly await your try and welcome any form of feedback and discussion.

GreptimeAI + Xinference - Efficient Deployment and Monitoring of Your LLM Applications

Greptime — Wed, 24 Jan 2024 12:18:49 +0000

With the rapid evolution of artificial intelligence technology, OpenAI has established itself as a frontrunner in the field. It demonstrates remarkable proficiency in a range of language processing tasks, including machine translation, text classification, and text generation. Parallel to OpenAI's ascent, many high-quality, open-source large language models such as Llama, ChatGLM, and Qwen have also gained prominence. These exceptional open-source models are invaluable assets for teams aiming to swiftly develop robust Large Language Model (LLM) applications.

With a myriad of options at hand, the challenge becomes how to uniformly use OpenAI's interface while also reducing development costs. Additionally, efficiently and continuously monitoring the performance of LLM applications is crucial, but how could we avoid increasing the complexity of development? GreptimeAI and Xinference offer pragmatic solutions to address these pivotal concerns.

What is GreptimeAI?

GreptimeAI, built upon the open-source time-series database GreptimeDB, offers an observability solution for Large Language Model (LLM) applications, currently supporting both LangChain and OpenAI's ecosystem. GreptimeAI enables you to understand cost, performance, traffic and security aspects in real-time, helping teams enhance the reliability of LLM applications.

What is Xinference?

Xorbits Inference (Xinference) is an open-source platform to streamline the operation and integration of a wide array of AI models. With Xinference, you’re empowered to run inference using any open-source LLMs, embedding models, and multimodal models either in the cloud or on your own premises, and create robust AI-driven applications. It provides a RESTful API compatible with OpenAI API, Python SDK, CLI, and WebUI. Furthermore, it integrates third-party developer tools like LangChain, LlamaIndex, and Dify, facilitating model integration and development.

Xinference supports multiple inference engines such as Transformers, vLLM, and GGML and is suitable for various hardware environments. It also supports multiple-nodes deployment, efficiently allocating model inference tasks across multiple devices or machines.

Utilize GreptimeAI + Xinference to Deploy and Monitor an LLM App

Next, we will take the Llama 2 model as an example to demonstrate how to install and run the model locally using Xinference. This example will feature the use of an OpenAI-style function call to conduct a weather query. Additionally, we will demonstrate how GreptimeAI can be effectively utilized to monitor the usage and performance of the LLM application.

Register and Get GreptimeAI Configuration Info

Visit https://console.greptime.cloud to register and create an AI service, then go to the Dashboard, click on the Setup page to get configuration information for OpenAI.

Start the Xinference Model Service

Initiating the Xinference model service locally is pretty straightforward. Simply enter the following command:

xinference-local -H 0.0.0.0

By default, Xinference initializes the service on your local machine, typically using port 9997. The process of installing Xinference locally is not covered here and you can refer to this article for installation instructions.

Launch the Model via Web UI

After starting Xinference, you can access its Web UI by entering http://localhost:9997 in your browser. This provides a user-friendly interface.

Launch the Model via Command Line Tool

Alternatively, the model can be launched using Xinference's command-line tool. The default Model UID is set to llama-2-chat, which will be used subsequently for accessing the model.

xinference launch -n llama-2-chat -s 13 -f pytorch

Obtain Weather Information through an OpenAI-Styled Interface

Suppose we have the capability to fetch weather information for a specific city using the get_current_weather function, with parameters location and format.

Configure and Call the OpenAI Interface

Access the Xinference local service using OpenAI's Python SDK and utilize GreptimeAI for metrics and traces collection. You can create dialogues using the chat.completions module, and specify the list of functions we've defined using tools.

from greptimeai import openai_patcher
from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:9997/v1",
)
openai_patcher.setup(client=client)

messages = [
    {"role": "system", "content": "You are a helpful assistant. Do not make assumptions about the values in the function calls."},
    {"role": "user", "content": "What's the weather in New York now"}
]
chat_completion = client.chat.completions.create(
    model="llama-2-chat",
    messages=messages,
    tools=tools,
    temperature=0.7
)
print('func_name: ', chat_completion.choices[0].message.tool_calls[0].function.name)
print('func_args: ', chat_completion.choices[0].message.tool_calls[0].function.arguments)

Details of the `tools`

The definition of the function calling tool list is as follows.

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "obtain current weather",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "city, such as New York",
                    },
                    "format": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "the temperature unit used, determined by the specific city",
                    },
                },
                "required": ["location", "format"],
            },
        },
    }
]

The output is as follows, showing the result generated by the Llama 2 model:

func_name: get_current_weather
func_args: {"location": "New York", "format": "celsius"}

Retrieve the Function Call Results and Making Subsequent Calls

Let's assume that we have invoked the get_current_weather function with specified parameters and obtained the results. These results, along with the context, will then be resent to the Llama 2 model:

messages.append(chat_completion.choices[0].message.model_dump())
messages.append({
    "role": "tool",
    "tool_call_id": messages[-1]["tool_calls"][0]["id"],
    "name": messages[-1]["tool_calls"][0]["function"]["name"],
    "content": str({"temperature": "10", "temperature_unit": "celsius"})
})

chat_completion = client.chat.completions.create(
    model="llama-2-chat",
    messages=messages,
    tools=tools,
    temperature=0.7
)
print(chat_completion.choices[0].message.content)

Final Result

The Llama 2 model ultimately generates the following response:

The current temperature in New York is 10 degrees Celsius.

GreptimeAI Dashboard

On the GreptimeAI Dashboard, you have the capability to comprehensively monitor LLM application behaviors based on the OpenAI interface in real-time, including key metrics such as token, cost, latency, and trace.
Below is a capture of the Dashboard's overview page.

Summary

If you're developing LLM applications with open-source large language models and aspire to utilize API calls akin to OpenAI's style, then choosing Xinference for managing your inference models, coupled with GreptimeAI for performance monitoring, would be a good choice. Whether it's for complex data analysis or simple routine queries, Xinference offers robust and flexible model management capabilities. Furthermore, GreptimeAI's monitoring features help you understand and optimize your model's performance and resource usage.

We look forward to seeing what you achieve with these tools and are eager to hear about your insights and experiences using GreptimeAI and Xinference. If you experience any issues or feedback, please don't hesitate to reach out to us at info@greptime.com or via Slack. Together, let's delve into the vast and exciting realm of artificial intelligence!

Memory Leak Diagnosing using Flame Graphs

Greptime — Fri, 19 Jan 2024 08:49:46 +0000

Starting with greptimedb#1733 in last June, GreptimeDB has adopted Jemalloc as its default memory allocator. This change not only boosts performance and reduces memory fragmentation but also offers convenient memory analysis capabilities.

In our previous article, Unraveling Rust Memory Leaks: Easy-to-Follow Techniques for Identifying and Solving Memory Issues, we explored several common methods for analyzing memory leaks in Rust applications.

Here in this article, I will delve into detailed techniques for troubleshooting based on Jemalloc. If you encounter any unusual memory usage issues while using or developing GreptimeDB, refer to this article for quick diagnostics and identification of potential memory leaks.

Preparations

Install tools

Install flamegraph.pl script

curl -s https://raw.githubusercontent.com/brendangregg/FlameGraph/master/flamegraph.pl > ${HOME}/.local/bin/flamegraph.pl
chmod +x ${HOME}/.local/bin/flamegraph.pl
export PATH=$PATH:${HOME}/.local/bin

flamegraph.pl , authored by Brendan Gregg, is a Perl script designed for visualizing hot spots in code call stacks. Brendan Gregg is an expert in system performance optimization. We are grateful to him for developing and open-sourcing numerous tools, including flamegraph.pl.

Install jeprof command

# For Ubuntu
sudo apt install -y libjemalloc-dev

# For Fedora
sudo dnf install jemalloc-devel

For other operating systems, you can find the dependency packages for jeprof through pkgs.org.

Enabling Heap Profiling in GreptimeDB

The heap profiling feature in GreptimeDB is turned off by default. You can enable this feature by turning on the mem-prof feature when compiling GreptimeDB.

cargo build --release -F mem-prof

The discussion about whether the mem-prof feature should be enabled by default is ongoing in greptimedb#3166. You are welcome to share your opinion there.

Starting GreptimeDB with `mem-prof` Feature

To enable the heap profiling feature, you need to set the MALLOC_CONF environment variable when starting the GreptimeDB process:

MALLOC_CONF=prof:true <path_to_greptime_binary> standalone start

You can use curl command to check if heap profiling is enabled.

curl <greptimedb_ip>:4000/v1/prof/mem

If the heap profiling feature is turned on, executing the curl command should yield a response similar to the following:

heap_v2/524288
  t*: 125: 136218 [0: 0]
  t0: 59: 31005 [0: 0]
...

MAPPED_LIBRARIES:
55aa05c66000-55aa0697a000 r--p 00000000 103:02 40748099                  /home/lei/workspace/greptimedb/target/debug/greptime
55aa0697a000-55aa11e74000 r-xp 00d14000 103:02 40748099                  /home/lei/workspace/greptimedb/target/debug/greptime

If you receive the response {"error":"Memory profiling is not enabled"}, it indicates that the MALLOC_CONF=prof:true environment variable has not been set correctly.

For information on the data format returned by the heap profiling API, refer to the HEAP PROFILE FORMAT - jemalloc.net.

Begin your Memory Exploration Journey

By using the command curl <greptimedb_ip>:4000/v1/prof/mem, you can quickly obtain details of the memory allocated by GreptimeDB. The tools jeprof and flamegraph.pl can be used to visualize memory usage details into a flame graph:

# To get memory allocation details
curl <greptimedb_ip>:4000/v1/prof/mem > mem.hprof

# To generate a flame graph of memory allocation
jeprof <path_to_greptime_binary> ./mem.hprof --collapse | flamegraph.pl > mem-prof.svg

After executing the above commands, a flame graph named 'mem-prof.svg' will be generated in the working directory.

How to Interpret the Flame Graph

Created by Brendan Gregg, the flame graph is a powerful tool for analyzing CPU overhead and memory allocation details. Its principle of generation is based on recording the function call stack that triggers each memory allocation event during each memory sampling.

After recording a sufficient number of times, the call stacks of each allocation are merged, thus revealing the memory size allocated by each function call and its child function calls.

The bottom of the flame graph represents the base of the function stack, while the top represents the stack top.
Each cell in the flame graph represents a function call, with the cells below it being the callers of that function, and the cells above being the callees, the functions that it calls.
The width of a cell indicates the total amount of memory allocated by that function and its child functions. Wider cells indicate that those functions are allocating more memory. If some functions allocate a lot of memory but they do not have many child functions (as shown in the diagram, with wider stack tops in the flame graph, known as plateaus), it suggests that these functions themselves might have a substantial number of allocation operations.
The color of each cell in the flame graph is a random warm color.
Opening the flame graph's SVG file in a browser allows for interactive clicking into each function for more detailed analysis.

Accelerating Flame Graph Generation

The heap memory details returned by Jemalloc include the addresses of each function in the call stack. Generating the flame graph requires translating these addresses into file names and line numbers, which is the most time-consuming step. Typically on Linux systems, this task is accomplished by the addr2line tool from GNU Binutils.

To speed up the generation of the flame graph, we can replace the Binutils addr2line tool with glimi-rs/addr2line, thereby achieving at least a 2x increase in speed.

git clone https://github.com/gimli-rs/addr2line
cd addr2line
cargo build --release
sudo cp /usr/bin/addr2line /usr/bin/addr2line-bak
sudo cp target/release/examples/addr2line /usr/bin/addr2line

Catching Memory Leaks through Allocation Differences

In most memory leak cases, the usage of memory tends to increase slowly. Therefore, during the process of memory growth, capturing memory usage at two different time points and analyzing the difference between them often points to potential memory leaks.

We can collect the memory data at the initial time point to establish a baseline:

curl -s <greptimedb_ip>:4000/v1/prof/mem > base.hprof

When memory usage increases slowly, which suggests a possible memory leak, we should collect the memory data again:

curl -s <greptimedb_ip>:4000/v1/prof/mem > leak.hprof

Then, using 'base.hprof' as a baseline, analyze the memory usage and generate a flame graph:

jeprof <path_to_greptime_binary> --base ./base.hprof ./leak.hprof --collapse | flamegraph.pl > leak.svg

In the flame graph generated with the --base parameter specifying the baseline, only the memory allocation differences between the current memory collection and the baseline will be included. This allows for a clearer understanding of which function calls are responsible for the increase in memory usage.

Reference

GreptimeDB v0.6 Released - Support Migrating Table's Regions between Datanodes

Greptime — Wed, 17 Jan 2024 08:30:54 +0000

As 2024 dawned, the Greptime team, invigorated by the New Year's fresh momentum, continued their efforts towards innovative version iterations. Just three weeks following our previous update, we are thrilled to announce a new version to our open-source time-series database: GreptimeDB v0.6.

This update marks a substantial leap from GreptimeDB v0.5, incorporating several major improvements.

GreptimeDB v0.6 Updates

Key Features

Region Migration

In version 0.5, we enabled support for Kafka WAL, making it possible to synchronize and migrate Region data across multiple Datanodes. In version 0.6, we initially implemented the Region Migration feature, providing users with the ability to migrate table Regions between Datanodes while ensuring data integrity. This lays a solid foundation for dynamically adjusting cluster load. For example, as query performance requirements increase, users can easily migrate table Regions to Datanodes with lower loads or larger specifications through Region Migration, achieving better query performance.

In the future, we plan to introduce dynamic Region distribution. This feature is designed to intelligently redistribute data Regions, leveraging real-time monitoring of workload conditions and business requirements while ensuring uninterrupted service. This strategic enhancement aims to optimize resource utilization. By doing so, it not only promotes more efficient and smarter data management but also ensures robust and adaptive support for the ever-evolving demands of the business environment.

Additional Updates

Added a configuration item that allows specifying the default time zone for queries
- By adding the --store-key-prefix configuration option, administrators can specify the Key prefix used by metasrv to avoid key name conflicts.
Implemented the OR logical operator in PromQL
- Added a special UNION operator (OR in PromQL) specifically for certain PromQL query scenarios. This operator takes two input nodes. All columns from the left child node will be output, and columns specified in compare_keys are used to check for duplicates. In case of duplicates, if both are from the right node, only the first row is retained; if from the left node, the row from the right node is discarded. The output includes columns from both left and right nodes, and the order of rows is not fixed.

Future Plans

Our next milestone, v0.7, promises to be even more exciting.

We plan to introduce a brand new indexing module, with its first implementation being an inverted index. This module aims to significantly boost performance when filtering and querying a small subset of time-series from vast datasets, a key focus for our Metric Engine in observable scenarios. Our team is currently rigorously testing the integration of these features to ensure optimal performance and stability. Stay tuned for the much-anticipated release of GreptimeDB v0.7!

Streamline your OpenAI Monitoring Experience with GreptimeAI

Greptime — Fri, 12 Jan 2024 07:12:05 +0000

With the rapid advancement of artificial intelligence technology, OpenAI has emerged as one of the leaders in the field. It excels in various language processing tasks, including machine translation, text classification, and text generation.

However, the critical role of continuous monitoring of API calls while using OpenAI should not be underestimated. This practice is crucial not only for identifying performance bottlenecks and analyzing usage patterns but also for swiftly detecting and addressing any issues that arise with the API.

GreptimeAI

GreptimeAI offers a tailor-made observability solution specifically designed for monitoring and managing large language model (LLM) applications. This solution provides comprehensive insights into the cost, performance, traffic, and security aspects of OpenAI usage. For more details about GreptimeAI, please refer to this article. Notably, GreptimeAI is built upon the open-source time-series database, GreptimeDB.

OpenAI Modules being Monitored

chat
completion
audio
images

Scenarios Supported

async
stream
with_raw_response
retry
error

User Guide

Installation

pip install --upgrade greptimeai

Registration

To get started, create a service by registering greptimeai, and get:

host
database
token

Setting up

export GREPTIMEAI_HOST='xxx'
export GREPTIMEAI_DATABASE='xxx'
export GREPTIMEAI_TOKEN='xxx'

Example

Here is a simple example to illustrate how to call OpenAI chat completion with GreptimeAI tracking enabled.

from greptimeai import openai_patcher
from openai import OpenAI

client = OpenAI()
openai_patcher.setup(client=client)

completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "How do I output all files in a directory using Python?",
        }
    ],
    model="gpt-4",
    user="<user_id_from_your_application>",
    stream=True,
)

How does it look like in GreptimeAI

Dashboard overview:

The following graph shows the trace detail with multiple spans.

Practical Tips for Choosing the Right AWS EC2 for your Workload

Greptime — Thu, 11 Jan 2024 16:18:47 +0000

Background

AWS EC2, the Elastic Compute Cloud service from Amazon Web Services, offers developers user-friendly and flexible virtual machines. As another most established service within AWS, alongside S3, EC2 has a rich history dating back to its inception in 2006. Over nearly 17 years, it has continuously evolved, underlining its significance and reliability in the cloud computing space.

Many people new to AWS EC2 might have similar feelings:

There are too many types of AWS EC2 (hundreds)! Which one should I choose to meet my business needs without exceeding the budget?

If the CPU and Memory configurations of EC2 are the same, does it mean their performance differences are also the same?

What is the most cost-effective EC2 payment mode?

Reflecting back on the initial launch of EC2, there were only two types of instance available. Fast forward to today, and the landscape has dramatically expanded to an impressive 781 different types. This vast selection of EC2 options presents developers with a wide array of choices, potentially leading to a challenging decision-making process.

This article will briefly introduce some tips for selecting EC2 instances to help readers choose the right EC2 type more smoothly.

Model Classification and Selection

Meet the EC2 family

Although AWS has hundreds of EC2 types, there are only a few major categories as listed follow:

General Purpose, M and T series: provide a balance of CPU, memory, and network resources, sufficient for most scenarios;
Compute Optimized, C series: Suitable for compute-intensive services;
Memory Optimized, R and X series: Designed to provide high performance for workloads processing large data sets;
Accelerated Computing: Accelerate the compute instances and use hardware accelerators or coprocessors to execute functions such as floating-point calculations, graphics processing, or data pattern matching, which are more efficient than software running on CPUs;
Storage Optimized: Designed for workloads that require high-speed, continuous read and write access to very large data sets on local storage;
HPC Optimized, HPC series: A new category by AWS mainly suitable for applications that require high-performance processing, such as large complex simulations and deep learning workloads;

Typically, each specific EC2 type belongs to a Family with a corresponding numerical sequence. For example, for the General Purpose type M series:

M7g / M7i / M7i-flex / M7a
M6g / M6i / M6in / M6a
M5 / M5n / M5zn / M5a
M4

The numerical sequence reveals that M7 represents the latest generation, whereas M4 is comparatively older. Generally, a higher number indicates a more recent model and CPU type, and often, the pricing is more favorable due to the natural depreciation of hardware.

Key Parameters

We can extract the following key parameters from the AWS EC2 model introduction.

Specific Model of EC2: Generally named as <family>.<size>, like m7g.large / m7g.xlarge. For EC2, a certain model is unique globally;
CPU and Memory Size: The number of vCPUs and the size of Memory. Most EC2 models have a 1:4 ratio, i.e., the ratio of the number of vCPUs to Memory. For example, when there is 1 vCPU, Memory is usually 4GiB; when there are 2 vCPUs, Memory is usually 8GiB.
Instance Storage: EC2 can generally mount different types of persistent storage disks, mainly:

EBS: Mounted AWS distributed block storage service, which is usually the default choice for most EC2 models. Some models only have the option to use EBS, which is bound to a specific AZ. Although its read/write latency is higher than local SSD, it's acceptable in most scenarios. EBS also has different types based on parameters like IOPS and throughput, such as:
- gp2/gp3: Underlying general-purpose SSD, with gp3 being officially recommended for better cost-performance. Typically, the default setting is 3000 IOPS, but it also offers the flexibility to increase IOPS on demand, without any downtime—though this does come with additional costs;
- io1/io2: Stronger performance and higher price, also supporting features like Multi Attach (usually, other types of EBS can only be mounted on one EC2);
Local Storage: Some models support local storage in addition to mounting EBS, but of course, they are more expensive. Generally, these models will have a d in their model name. For example, m7g.large is an EBS-Only model, while m7gd.large has 1 118GiB NVME SSD local storage. Some special models also support larger local HDDs;

EBS Bandwidth: For some newer and specifically EBS-optimized EC2 models, AWS equips them with dedicated EBS bandwidth. This means that in high data throughput scenarios, EBS-optimized models can always enjoy better throughput without competing for network bandwidth on the local machine;
Network Bandwidth: The network bandwidth corresponding to the EC2 model;
CPU Model: In most scenarios, we can see CPUs from the following manufacturers:

AWS's self-developed Graviton processor based on the ARM architecture (currently up to Graviton 3), such as the M7g series;
Intel x86-64 architecture CPU;
AMD x86-64 architecture CPU;

Generally, for similar configurations, the pricing trend is Intel being the most expensive, followed by AMD, and then Graviton, with the performance ranking inversely. For general scenarios that are performance-insensitive, users can consider using ARM architecture models, which offer greater cost-effectiveness.

AWS is one of the earliest cloud vendors to introduce ARM architecture into the server CPU field. After years of R&D, Graviton CPU has made significant progress and has a great competitive advantage in cost-performance. It is expected that more customers will use Graviton CPU models in the future.

Virtualization Technology: Various EC2 models employ distinct virtualization technologies, resulting in differences in their technical parameters. For example, for newer EC2 models, Nitro virtualization technology is generally applied. Nitro is AWS's latest virtualization technology, offloading many virtualization behaviors to hardware, making the software relatively lighter and virtualization performance stronger. From the user's perspective, identical configurations will yield enhanced performance due to reduced virtualization overhead.
Whether suitable for Machine Learning Scenarios: With the development of LLM technology, more and more vendors will choose to train their models in the cloud. If you want to use model training on AWS EC2, Accelerated Computing generally would be your choice, such as:

P series and G series models: They use Nvidia's GPU chips. At the re:Invent 2023 conference, Nvidia and AWS started a deeper strategic cooperation. AWS plans to use Nvidia's latest and most powerful GPUs to create a computing platform specifically for generative AI;
Trn and Inf series: In addition to using Nvidia GPUs, AWS also develops its own chips for machine learning, such as the Trainium chip for training and the Inferentia chip for model inference. Trn series and Inf series EC2 models correspond to these two AWS-developed machine learning chips respectively;

Key Takeaways

Building on the overview provided above (and there's much more to explore about EC2), we've compiled a few tips for users to consider when selecting EC2 instances.

Typically, for most EC2 models, a higher sequence number indicates a newer CPU model. This generally means better performance and, interestingly, a more cost-effective pricing structure – essentially, you get more bang for your buck.
Among the general-purpose EC2 models, the T series is relatively cheap and offers a Burstable CPU feature: the instance accumulates CPU credits while operating under baseline performance, and when encountering high load scenarios above baseline performance, it can run beyond baseline performance for a certain time according to CPU credits (without changing the cost). However, this also means the T series won't have very high performance, with generally low bandwidth and no EBS optimization. Therefore, the T series is more suitable for non-performance-verified test environments;
Within the general-purpose series, if you're aiming for cost-efficiency, it's advisable to prioritize AWS ARM architecture models;
AWS's official EC2 Pricing on its website is very difficult to read, it is recommended to use Vantage to check price information (it is also an open-source project);
For most cloud users, the cost of EC2 is generally their major expense. Here are a few ways to reduce this cost as much as possible:

Fully utilize the elasticity of the cloud:
make your architecture as flexible as possible, and use on-demand computing power. You can use AWS's Karpenter or Cluster Autoscaler to make your EC2 flexible and scalable;
Use Spot instances:
Spot instances can be 30% to 90% cheaper than On-Demand instances, but they are subject to preemption and can't be relied on for long-term stable operation. AWS will notify you 2 minutes before preemption, then proceed with it. Spot instances, if well-managed at the underlying level, are very suitable for elastic computing and interruption-tolerant scenarios. For example, the SkyPilot project uses different cloud Spot instances for machine learning training;
Optimize payment modes:
Beyond technical approaches, cost reduction can also be achieved by purchasing Saving Plans. These plans offer lower unit costs compared to On-Demand pricing, though they come with decreased flexibility. This makes them more suited for scenarios with relatively stable business architectures.

Conclusion

Efficient selection and utilization of EC2 should be tailored to the user's unique scenarios, requiring continuous and iterative optimization. In summary, leveraging the cloud's elasticity and understanding the key parameters of various EC2 models is essential for every AWS user.

Reducing S3 API Calls by 98% | Exploring the Secrets of OpenDAL's RangeReader

Greptime — Thu, 04 Jan 2024 07:51:41 +0000

Preface

At GreptimeDB, we utilize OpenDAL as our unified data access layer. Recently, a colleague informed me that it took 10 seconds to execute a Copy From statement to import an 800 KiB Parquet file from S3. After some investigation and reviewing related Reader of OpenDAL documentation and its implementation (realizing we hadn't RTFSC 🥲), I document and briefly summarize our findings here.

Relevant OpenDAL source code Commit: 6980cd1

Understanding OpenDAL Source Code

Frankly speaking, it was only recently that I fully grasped the intricacies of the OpenDAL source code and its invocation relationships, after previously having only a partial understanding of it.

Starting with the `Operator`

All our IO operations revolve around the Operator. Let's see how the Operator is constructed. In main.rs, we first create a file-system-based Backend Builder; subsequently build it into an accessor (implementing the Accessor trait); and then pass this accessor into OperatorBuilder::new, finally calling finish.

OpenDAL unifies the behavior of different storage backends through the Accessor trait, exposing a unified IO interface to the upper layer, like create_dir, read, write, etc.

use opendal::services::Fs;
use opendal::Operator;

#[tokio::main]
async fn main() -> Result<()> {
    // Create fs backend builder.
    let mut builder = Fs::default();
    // Set the root for fs, all operations will happen under this root.
    //
    // NOTE: the root must be absolute path.
    builder.root("/tmp");

    let accessor = builder.build()?;
    let op: Operator = OperatorBuilder::new(accessor)?.finish();

    Ok(())
}

What Happens in `OperatorBuilder::new`

The accessor we pass in is attached with two layers when invoking new, and an additional internal Layer is added when invoking finish. With these layers added, when we invoke interfaces exposed by Operator, the invoking starts from the outermost CompleteLayer and eventually reaches the innermost FsAccessor.

FsAccessor
ErrorContextLayer
CompleteLayer
^
|
| Invoking (`read`, `reader_with`, `stat`...)

impl<A: Accessor> OperatorBuilder<A> {
    /// Create a new operator builder.
    #[allow(clippy::new_ret_no_self)]
    pub fn new(accessor: A) -> OperatorBuilder<impl Accessor> {
        // Make sure error context layer has been attached.
        OperatorBuilder { accessor }
            .layer(ErrorContextLayer)
            .layer(CompleteLayer)
    }

    ...

    /// Finish the building to construct an Operator.
    pub fn finish(self) -> Operator {
        let ob = self.layer(TypeEraseLayer);
        Operator::from_inner(Arc::new(ob.accessor) as FusedAccessor)
    }
}

TL;DR: I just want to emphasize that we should read the source code of OpenDAL starting from CompleteLayer (an epiphany).

Background Information

Let me provide some necessary context here to understand the following content.

`LruCacheLayer`

Currently, in query scenarios, we add a LruCacheLayer while building the Operator, so our Operator looks like the diagram below:

S3Accessor                FsAccessor
ErrorContextLayer         ErrorContextLayer
CompleteLayer             CompleteLayer
    ^                         ^  |
    |                         |  |
    |`inner`           `cache`|  |
    |                         |  |
    |                         |  |
    |                         |  |
    +----- LruCacheLayer -----+  |
                 ^               |
                 |               |
                 |               |
                 |               v
                 |               FileReader::new(oio::TokioReader<tokio::fs::File>)
                 |
                 Invoking(`reader`, `reader_with`)

For example, with the read interface, LruCacheLayer caches S3 files in the file system, returning the cached file-system-based Box<dyn oio::Read>(FileReader::new(oio::TokioReader<tokio::fs::File>)) to the upper layer; if the file to be read is not in the cache, it's first loaded in full from S3 to the local file system.

struct LruCacheLayer {
  inner: Operator, // S3Backend
  cache: Operator, // FsBackend
  index: CacheIndex
}

impl LayeredAccessor for LruCacheLayer {
  ...
  async fn read(&self, path: &str, args: OpRead) -> Result<(RpRead, Self::Reader)> {
        if self.index.hit(path, args) {
          // Returns `Box<dyn oio::Read>`
          self.cache.read(path, args).await 
        } else {
          // Fetches cache and stores...
        }
  }
  ...
}

The `Copy From` Scenario

In the Copy From scenario, I didn't add this LruCacheLayer layer. Thus, our Operator looks like the diagram below:

``plain text S3Accessor ErrorContextLayer CompleteLayer ▲ │ │ │ │ │ │ ▼ │ RangeReader::new(IncomingAsyncBody) │ Invoking (reader,reader_with`)




## Issues Encountered with OpenDAL RangeReader

### Starting with the Construction of ParquetRecordBatchStream
In `Copy From`, after obtaining the file information(i.e., the file location on the S3), we first invoke `operator.reader` to return a `reader` implementing `AsyncReader + AsyncSeek`, then wrap it with a `BufReader`. Ultimately, this `reader` is passed into `ParquetRecordBatchStreamBuilder`.

> Here, the use of `BufReader` is superfluous because it clears its internal buffer after invoking the `seek` method, negating any potential performance benefits.



```rust
  ...
  let reader = operator
      .reader(path)
      .await
      .context(error::ReadObjectSnafu { path })?;

  let buf_reader = BufReader::new(reader.compat());

  let builder = ParquetRecordBatchStreamBuilder::new(buf_reader)
      .await
      .context(error::ReadParquetSnafu)?;

  let upstream = builder
      .build()
      .context(error::BuildParquetRecordBatchStreamSnafu)?;

  ...

Reading Metadata in `ParquetRecordBatchStream::new`

The metadata reading logic is as follows: first, invokes seek(SeekFrom::End(-FOOTER_SIZE_I64)), reads FOOTER_SIZE bytes and parse metadata_len; then invokes seek again, and reads metadata_len bytes to parse the metadata.

impl<T: AsyncRead + AsyncSeek + Unpin + Send> AsyncFileReader for T {
    fn get_metadata(&mut self) -> BoxFuture<'_, Result<Arc<ParquetMetaData>>> {
        const FOOTER_SIZE_I64: i64 = FOOTER_SIZE as i64;
        async move {
            self.seek(SeekFrom::End(-FOOTER_SIZE_I64)).await?;

            let mut buf = [0_u8; FOOTER_SIZE];
            self.read_exact(&mut buf).await?;

            let metadata_len = decode_footer(&buf)?;
            self.seek(SeekFrom::End(-FOOTER_SIZE_I64 - metadata_len as i64))
                .await?;

            let mut buf = Vec::with_capacity(metadata_len);
            self.take(metadata_len as _).read_to_end(&mut buf).await?;

            Ok(Arc::new(decode_metadata(&buf)?))
        }
        .boxed()
    }
}

The Real Problem

Up to this point, we've discussed some minor issues. The more challenging problem arises here, where the variable stream is the ParquetRecordBatchStream we've built above. When we invoke next, ParquetRecordBatchStream invokes RangeReader's seek and read multiple times. However, each call to seek resets RangeReader's internal state (discarding the previous byte stream) and, on the subsequent read call, initiates a new remote request (in the S3 backend scenario).

You can see detailed information in this issue and the discussion here).

When using ParquetRecordBatchStream to retrieve each column's data, it'll first invoke RangeReader seek, then read some bytes. Thus, the total number of remote calls required is the number of RowGroups multiplied by the number of columns in a RowGroup. Our 800KiB file contains 50 RowGroups and 12 columns (per RowGroup), which results in 600 S3 get requests!

        pub async fn copy_table_from(
    ...
            while let Some(r) = stream.next().await {
                let record_batch = r.context(error::ReadDfRecordBatchSnafu)?;
                let vectors =
                    Helper::try_into_vectors(record_batch.columns()).context(IntoVectorsSnafu)?;

                pending_mem_size += vectors.iter().map(|v| v.memory_size()).sum::<usize>();

                let columns_values = fields
                    .iter()
                    .cloned()
                    .zip(vectors)
                    .collect::<HashMap<_, _>>();

                pending.push(self.inserter.handle_table_insert(
                    InsertRequest {
                        catalog_name: req.catalog_name.to_string(),
                        schema_name: req.schema_name.to_string(),
                        table_name: req.table_name.to_string(),
                        columns_values,
                    },
                    query_ctx.clone(),
                ));

                if pending_mem_size as u64 >= pending_mem_threshold {
                    rows_inserted += batch_insert(&mut pending, &mut pending_mem_size).await?;
                }
            }

    ...

Explore the `RangeReader` Source Code

Take a look at `self.poll_read()`

In RangeReader, the self.state initially starts as State::Idle. Let's assume that self.offset is Some(0), then
self.state is set to State::SendRead(BoxFuture<'static, Result<(RpRead, R)>>) and self.poll_read(cx, buf) is invoked again.

impl<A, R> oio::Read for RangeReader<A, R>
where
    A: Accessor<Reader = R>,
    R: oio::Read,
{
    fn poll_read(&mut self, cx: &mut Context<'_>, buf: &mut [u8]) -> Poll<Result<usize>> {
        ...
        match &mut self.state {
            State::Idle => {
                self.state = if self.offset.is_none() {
                    // When the offset is none, it means we are performing tailing reading.
                    // We should start by getting the correct offset through a stat operation.
                    State::SendStat(self.stat_future())
                } else {
                    State::SendRead(self.read_future())
                };

                self.poll_read(cx, buf)
            }
            ...
        }
    }
}

What happens in `self.read_future()`

Clearly, self.read_future() returns a BoxedFuture. Within this BoxedFuture, the underlying Accessor's read method (acc.read(&path, op).await) is invoked. The Accessor can be an implementation for some storage backend. In our context, this Accessor represents an S3 storage backend. When its read interface is invoked, it establishes a TCP connection to retrieve the file data and returns a byte stream from S3's response to the upper layer.

impl<A, R> RangeReader<A, R>
where
    A: Accessor<Reader = R>,
    R: oio::Read,
{
    fn read_future(&self) -> BoxFuture<'static, Result<(RpRead, R)>> {
        let acc = self.acc.clone();
        let path = self.path.clone();

        let mut op = self.op.clone();
        if self.cur != 0 {
            op = op.into_deterministic();
        }
        op = op.with_range(self.calculate_range());

        Box::pin(async move { acc.read(&path, op).await })
    }

    ...
}

Continuing from where we left off in `self.poll_read()`

At this point, poll_read has not yet returned. In the previous section, self.poll_read() was invoked again with self.state being State::SendRead(BoxFuture<'static, Result<(RpRead, R)>>). The value returned by ready!(Pin::new(fut).poll(cx)) corresponds to the result of acc.read(&path, op).await from the previous section. For the S3 storage backend, remote calls happen here.

Afterward, the internal state self.poll_read is set to State::Read(r), and self.poll_read(cx, buf) is invoked once more. Upon entering self.poll_read() again, the internal state of RangeReader is set to State::Reader(R). Here, R(r) represents the byte stream of the read request's response. For the S3 storage backend, the Pin::new(r).poll_read(cx, buf) writes the byte data from the TCP buffer into the upper-level application.

impl<A, R> oio::Read for RangeReader<A, R>
where
    A: Accessor<Reader = R>,
    R: oio::Read,
{
    fn poll_read(&mut self, cx: &mut Context<'_>, buf: &mut [u8]) -> Poll<Result<usize>> {
        // Sanity check for normal cases.
        if buf.is_empty() || self.cur >= self.size.unwrap_or(u64::MAX) {
            return Poll::Ready(Ok(0));
        }

        match &mut self.state {
            ...
            State::SendRead(fut) => {
                let (rp, r) = ready!(Pin::new(fut).poll(cx)).map_err(|err| {
                    // If the read future returns an error, reset the state to Idle to retry.
                    self.state = State::Idle;
                    err
                })?;

                // Set the size if the read returns a size hint.
                if let Some(size) = rp.size() {
                    if size != 0 && self.size.is_none() {
                        self.size = Some(size + self.cur);
                    }
                }
                self.state = State::Read(r);
                self.poll_read(cx, buf)
            }
            State::Read(r) => match ready!(Pin::new(r).poll_read(cx, buf)) {
                Ok(0) => {
                    // Reset the state to Idle after all data has been consumed.
                    self.state = State::Idle;
                    Poll::Ready(Ok(0))
                }
                Ok(n) => {
                    self.cur += n as u64;
                    Poll::Ready(Ok(n))
                }
                Err(e) => {
                    self.state = State::Idle;
                    Poll::Ready(Err(e))
                }
            },
        }
    }
}

Final Look at `self.poll_seek()`

Remember the internal state of RangeReader we discussed earlier? Yes, it was State::Reader(R). When we call seek after a read, the byte stream inside RangeReader is discarded, and the state is reset to State::Idle. In other words, every time read is invoked after seek, RangeReader requests the read method of the underlying Accessor (acc.read(&path, op).await) to initiate a remote call. For the S3 storage backend, invoking this interface incurs significant overhead (typically around hundreds of milliseconds).

Additionally, there's a performance-related point to be considered. When attempting SeekFrom::End() and self.size is unknown, an additional stat operation is performed. After invoking self.poll_seek(), self.cur will be set to base.checked_add(amt).

Summary

We've implemented a quick fix that decreased the number of RowGroups imported from 50 to just 1. However, this solution still necessitates 12 remote calls. Moving forward, we plan to contribute a BufferReader to OpenDAL (details available at RFC here), which is expected to significantly reduce the number of consecutive remote calls triggered by 'seek' and 'read' operations in RangeReader. In certain cases, these calls could be entirely eliminated.
When invokes seek on a RangeReader, the internal state will be reset, and a subsequent read invoking results in a remote call that happens in the underlying Accessor (in scenarios where the backend is S3). (For related information, please refer to this issue and discussion links provided).
Both std::io::BufReader and tokio::io::BufReader clear their internal buffers after seek. If you wish to continue reading from the Buffer, you should use seek_relative.

Unlock Complex Time Series Analysis in SQL with Range Queries

Greptime — Thu, 28 Dec 2023 00:58:06 +0000

Background

Time-series data often requires querying and aggregating over specified time intervals, a pattern well-supported by PromQL's Range selector. However, executing these queries in standard SQL is notably complex. To address this, GreptimeDB has introduced an enhanced SQL Range query syntax, effectively marrying SQL's robust flexibility with specialized time-series querying capabilities. This advancement ensures seamless, native handling of time-series data within SQL.

Explore Range Queries with SQL on GreptimePlay

Our interactive documentation for range queries is now officially available on GreptimePlay!

You can delve into various query techniques through a daily example using SQL and receive immediate, visualized feedback. Dive into the world of dynamic data querying on GreptimePlay today!

Example

Let's illustrate the Range query with an example. The following temperature table records the temperatures in different cities at various times:

For the given scenario, where we want to query the daily and weekly average temperatures in Beijing up to May 2, 2023 (timestamp 1682985600000), with a fallback to use linear interpolation to estimate query values for missing data.

To conduct these two queries in PromQL, it is structured with a day as the step size. For the daily average temperature, we aggregate data over each day. For the weekly average, we extend this aggregation to a week-long period, calculating the average for each week. Additionally, to align our query with the specific timestamp of 1682985600000, we use the @ operator in PromQL. This aligns the query execution time exactly to the given timestamp, ensuring accurate and relevant data retrieval for the specified period.

The final query looks like this:

-- Daily average temperature
avg_over_time(temperature{city="beijing"}[1d] @ 1682985600000) step=1d

-- Weekly average temperature
avg_over_time(temperature{city="beijing"}[7d] @ 1682985600000) step=1d

However, the above query has some issues: PromQL emphasizes on data querying but struggles with handling missing data points, i.e., smoothing the queried data. Though PromQL has a Lookback delta mechanism (see this article for more details), which uses old data to replace missing data points, this default behavior might not be desirable for users under certain circumstances. Due to the existence of the Lookback delta mechanism, aggregated data might carry some old values. And it is challenging for PromQL to precisely control data accuracy. Furthermore, PromQL does not have an effective method for data smoothing, as our requirement mentioned above.

From a traditional SQL perspective, since there is no such Lookback delta mechanism, we can precisely control the scope of our data selection and aggregation, allowing for more accurate queries.

The query here essentially aggregates data daily and weekly. For daily average temperatures, we can use the scalar function date_trunc, which truncates timestamp to a certain precision. We use this function to truncate time to a daily unit and then aggregate the data by day to get the desired results.

-- Daily average temperature
SELECT
    day,
    avg(temp),
FROM (
    SELECT
        date_trunc('day', ts) as day
        temp,
    FROM
        temperature
    WHERE
        city="beijing" and ts < 1682985600000
)
GROUP BY day;

The above query roughly meets our needs, but there are issues with this type of query:

Complicated to write with the subqueries required;
This method can only calculate daily average temperatures, not weekly averages. In SQL, aggregation demands that each piece of data belong to only one group. However, this becomes problematic in time series queries where each sampling spans a week with intervals recorded daily. In such cases, a single data point is inevitably shared across multiple groups, making traditional SQL queries unsuitable for these queries.
Still doesn't address the issue of filling in blank data.

The crucial issue we now must address is that these queries are fundamentally time series in nature, yet the SQL we employ, despite its highly flexible expressive power, is not tailor-made for time series databases. This mismatch highlights the need for some new SQL extension syntax to effectively manage and query time series data. Some time series databases like InfluxDB offer group by time syntax, and QuestDB offers Sample By syntax. These implementations provide ideas for our Range queries.

Next, we'll introduce how to utilize GreptimeDB's Range syntax for the above queries.

-- average daily temperature
SELECT
    ts,
    avg(temp) RANGE '1d' FILL LINEAR,
FROM
    temperature
WHERE
    city="beijing" and ts < 1682985600000
ALIGN '1d';

-- average weekly temperature
SELECT
    ts,
    avg(temp) RANGE '7d' FILL LINEAR,
FROM
    temperature
WHERE
    city="beijing" and ts < 1682985600000
ALIGN '1d';

We have introduced a keyword, ALIGN, into a SELECT statement to represent the step size of each time series query, aligning the time to the calendar. Following the aggregation function, a RANGE keyword is used to denote the scope of each data aggregation. FILL LINEAR indicates the method of filling in when data points are missing, by using linear interpolation to fill the data. Through this approach, we can more easily fulfill the requirements mentioned earlier.

The Range query allows us to elegantly express time series queries in SQL, effectively compensating for SQL's shortcomings in describing time series queries. Moreover, it enables the combination of SQL's powerful expressive capabilities to achieve more complex data querying functions.
Range queries also offer more flexible usage options, with specific details available in this documentation.

Implementation Logic

Range query is essentially a data aggregation algorithm, but it differs from traditional SQL data aggregation in a key aspect: in Range queries, a single data point may be aggregated into multiple groups. For example, if a user wants to calculate the average weekly temperature for each day, each temperature data point will be used in the calculation for several weekly averages.

The aforementioned query logic, when formulated as a Range query, can be articulated in the following manner.

SELECT avg(temperature) RANGE '7d' from table ALIGN '1d';

For each Range expression, we utilize align_to (specified by the TO keyword, the TO keyword is not specified above, which is UTC 0 time. For more usage of the TO keyword, please refer to this documentation, the align (1d) and range (7d) parameters to define time windows (each time window is called a time slot) and categorize data based on their appropriate timestamps into these time slots.

The time origin on the time axis is set at align_to, and we segment aligned time points both forwards and backwards using align as the step size. This collection of time points is referred to as align_ts. The formula for align_ts is { ts | ts = align_to + k * align, k is an integer }.
For each element ts in the align_ts set, a time slot is defined. A time slot is a left-closed, right-open interval satisfying [ts , ts + range).

When align is greater than range, the segmented time slots are as illustrated below, and in this scenario, a single data point will belong to only one time slot.

When align is smaller than range, the segmented time slots appear as shown in the following illustration. In this situation, a single data point may belong to multiple time slots.

The implementation of the Range feature utilizes the classic hash aggregation algorithm. This involves reserving a hash bucket for each time slot being sampled and placing all the data scheduled for sampling into the corresponding hash buckets.

Unlike traditional aggregation algorithms, time series data aggregation may involve overlapping data points (e.g. calculating the daily average temperature for each week). In algorithmic terms, this means a single data point may belong to multiple hash buckets, which differentiates it from the conventional hash aggregation approach.

Summary

By leveraging the SQL RANGE query syntax extension provided by GreptimeDB, combined with the powerful expressive capabilities of the SQL language itself, we can conduct more concise, elegant, and efficient analysis and querying of time series data within GreptimeDB.
This approach also circumvents some of the limitations encountered in data querying with PromQL. Users can flexibly utilize RANGE queries in GreptimeDB to unlock new methods for time series data analysis and querying.

DEV Community: Greptime

Error Handling for Large Rust Projects - A Deep Dive into GreptimeDB's Practices

Introduce

Understanding Error in Rust

Status Quo of Rust's Error Handling

Stacking the Error

Design Goals

Virtual User Stack

Macro Details

Present Error to End Users

Overcoming Prometheus's Single-Value Data Model Limitations - A New Approach by GreptimeDB

Introduction

Challenges of The Single-Value Data Model

1. Redundant Label Transmission in Data Collection

2. Loss of Measurement Correlation

3. Complexity in Querying Composite Monitoring Views

GreptimeDB to the Rescue

1. Group related metrics and store them together

2. Multi-Value Samples and Diverse Value Types

3. Extended PromQL for Multiple Field Queries

4. Support for Table Model and SQL for Advanced Association Analysis

Conclusion

About Greptime

Introducing GreptimeDB v0.7 — Unlock the Future of Cloud-Native Monitoring

Region Migration

Metric Engine

Inverted Index

Other Updates

Upgrade Guide

Future Plan

What to Expect Next? GreptimeDB Roadmap for 2024

Main Feature Updates in 2024

GreptimeDB Version Plan

Get Involved Now

About Us

Research Paper Sharing - Exploiting Cloud Object Storage for High-Performance Analytics

Introduction to AWS S3

Characteristics of Cloud Storage (Object Storage)

Latency

Noisy neighbors

Latency across different cloud providers

Throughput

Optimal Request Size

Encryption

Slow Requests

Cloud Storage Data Request Model

Anyblob

Integration with Cloud Storage

Performance Evaluation

Data Download Performance

Comparison of Different Storage Types

Scalability

End-To-End Study with Compression & AES

Summary

GreptimeAI + Xinference - Efficient Deployment and Monitoring of Your LLM Applications

What is GreptimeAI?

What is Xinference?

Utilize GreptimeAI + Xinference to Deploy and Monitor an LLM App

Register and Get GreptimeAI Configuration Info

Start the Xinference Model Service

Launch the Model via Web UI

Launch the Model via Command Line Tool

Obtain Weather Information through an OpenAI-Styled Interface

Configure and Call the OpenAI Interface

Details of the tools

Retrieve the Function Call Results and Making Subsequent Calls

Final Result

GreptimeAI Dashboard

Summary

Memory Leak Diagnosing using Flame Graphs

Preparations

Install tools

Enabling Heap Profiling in GreptimeDB

Starting GreptimeDB with mem-prof Feature

Begin your Memory Exploration Journey

How to Interpret the Flame Graph

Accelerating Flame Graph Generation

Catching Memory Leaks through Allocation Differences

Reference

GreptimeDB v0.6 Released - Support Migrating Table's Regions between Datanodes

Understanding `Error` in Rust

Details of the `tools`

Starting GreptimeDB with `mem-prof` Feature

Starting with the `Operator`

What Happens in `OperatorBuilder::new`

`LruCacheLayer`

The `Copy From` Scenario

Reading Metadata in `ParquetRecordBatchStream::new`

Explore the `RangeReader` Source Code

Take a look at `self.poll_read()`

What happens in `self.read_future()`

Continuing from where we left off in `self.poll_read()`

Final Look at `self.poll_seek()`