DEV Community: Shawn Adams

Benchmarking Elasticsearch and Rockset: Rockset achieves up to 4X faster streaming data ingestion

Shawn Adams — Wed, 03 May 2023 07:00:00 +0000

Rockset is a database used for real-time search and analytics on streaming data. In scenarios involving analytics on massive data streams, we’re often asked the maximum throughput and lowest data latency Rockset can achieve and how it stacks up to other databases. To find out, we decided to test the streaming ingestion performance of Rockset’s next generation cloud architecture and compare it to open-source search engine Elasticsearch, a popular sink for Apache Kafka.

For this benchmark, we evaluated Rockset and Elasticsearch ingestion performance on throughput and data latency. Throughput measures the rate at which data is processed, impacting the database's ability to efficiently support high-velocity data streams. Data latency, on the other hand, refers to the amount of time it takes to ingest and index the data and make it available for querying, affecting the ability of a database to provide up-to-date results. We examine latency at the 95th and 99th percentile, given that both databases are used for production applications and require predictable performance.

We found that Rockset beat Elasticsearch on both throughput and end-to-end latency at the 99th percentile. Rockset achieved up to 4x higher throughput and 2.5x lower latency than Elasticsearch for streaming data ingestion.

In this blog, we’ll walk through the benchmark framework, configuration and results. We’ll also delve under the hood of the two databases to better understand why their performance differs when it comes to search and analytics on high-velocity data streams.

Why measure streaming data ingestion?

Streaming data is on the rise with over 80% of Fortune 100 companies using Apache Kafka. Many industries including gaming, internet and financial services are mature in their adoption of event streaming platforms and have already graduated from data streams to torrents. This makes it crucial to understand the scale at which eventually consistent databases Rockset and Elasticsearch can ingest and index data for real-time search and analytics.

In order to unlock streaming data for real-time use cases including personalization, anomaly detection and logistics tracking, organizations pair an event streaming platform like Confluent Cloud, Apache Kafka and Amazon Kinesis with a downstream database. There are several advantages that come from using a database like Rockset or Elasticsearch including:

Incorporating historical and real-time streaming data for search and analytics
Supporting transformations and rollups at time of ingest
Ideal when data model is in flux
Ideal when query patterns require specific indexing strategies

Furthermore, many search and analytics applications are latency sensitive, leaving only a small window of time to take action. This is the benefit of databases that were designed with streaming in mind, they can efficiently process incoming events as they come into the system rather than go into slow batch processing modes.

Now, let’s jump into the benchmark so you can have an understanding of the streaming ingest performance you can achieve on Rockset and Elasticsearch.

Using RockBench to measure throughput and latency

We evaluated the streaming ingest performance of Rockset and Elasticsearch on RockBench, a benchmark that measures the peak throughput and end-to-end latency of databases.

RockBench has two components: a data generator and a metrics evaluator. The data generator writes events every second to the database; the metrics evaluator measures the throughput and end-to-end latency or the time from when the event is generated until it is queryable.

Multiple instances of the benchmark connect to the database under test.

The data generator generates documents, each document is the size of 1.25KB and represents a single event. This means that 8,000 writes is equivalent to 10 MB/s.

Peak throughput is the highest throughput at which the database can keep up without an ever-growing backlog. For this benchmark, we continually added ingested data in increments of 10 MB/s until the database could no longer sustainably keep up with the throughput for a period of 45 minutes. We determined the peak throughput as the increment of 10 MB/s above which the database could no longer sustain the write rate.

Each document has 60 fields containing nested objects and arrays to mirror semi-structured events in real life scenarios. The documents also contain several fields that are used to calculate the end-to-end latency:

_id: The unique identifier of the document
_event_time: Reflects the clock time of the generator machine
generator_identifier: 64-bit random number

The _event_time of that document is then subtracted from the current time of the machine to arrive at the data latency of the document. This measurement also includes round-trip latency—the time required to run the query and get results from the database back to the client. This metric is published to a Prometheus server and the p50, p95 and p99 latencies are calculated across all evaluators.

In this performance evaluation, the data generator inserts new documents to the database and does not update any existing documents.

RockBench Configuration & Results

To compare the scalability of ingest and indexing performance in Rockset and Elasticsearch, we used two configurations with different compute and memory allocations. We selected the Elasticsearch Elastic Cloud cluster configuration that most closely matches the CPU and memory allocations of the Rockset virtual instances. Both configurations made use of Intel Ice Lake processors.

Table of the Rockset and Elasticsearch configurations used in the benchmark.

The data generators and data latency evaluators for Rockset and Elasticsearch were run in their respective clouds and the US West 2 regions for regional compatibility. We selected Elastic Elasticsearch on Azure as it is a cloud that offers Intel Ice Lake processors. The data generator used Rockset’s write API and Elasticsearch’s bulk API to write new documents to the databases.

We ran the Elasticsearch benchmark on the Elastic Elasticsearch managed service version v8.7.0, the newest stable version, with 32 primary shards, a single replica and availability zone. We tested several different refresh intervals to tune for better performance and landed on a refresh interval of 1 second which also happens to be the default setting in Elasticsearch. We settled on a 32 primary shard count after evaluating performance using 64 and 32 shards, following the Elastic guidance that shard size range from 10 GB to 50 GB. We ensured that the shards were equally distributed across all of the nodes and that rebalancing was disabled.

As Rockset is a SaaS service, all cluster operations including shards, replicas and indexes are handled by Rockset. You can expect to see similar performance on standard edition Rockset to what was achieved on the RockBench benchmark.

We ran the benchmark using batch sizes of 50 and 500 documents per write request to showcase how well the databases can handle higher write rates. We chose batch sizes of 50 and 500 documents as they mimic the load typically found in incrementally updating streams and high volume data streams.

Throughput: Rockset sees up to 4x higher throughput than Elasticsearch

Peak throughput is the highest throughput at which the database can keep up without an ever-growing backlog. The results with a batch size of 50 showcase that Rockset achieves up to 4x higher throughput than Elasticsearch.

Table of the peak throughput and p95 latency of Elasticsearch and Rockset. Databases were evaluated using vCPU 64 and vCPU 128 instances and a batch size of 50.

The results with a batch size of 50 showcase that Rockset achieves up to 4x higher throughput than Elasticsearch.

Table of the peak throughput and p95 latency of Elasticsearch and Rockset. Databases were evaluated using vCPU 64 and vCPU 128 instances and a batch size of 500.

With a batch size of 500, Rockset achieves up to 1.6x higher throughput than Elasticsearch.

Graph of the peak throughput of Elasticsearch and Rockset using batches of 50 and 500. Databases were evaluated on 64 and 128 vCPU instances. Higher throughput indicates better performance.

One observation from the performance benchmark is that Elasticsearch handles larger batch sizes better than smaller batch sizes. The Elastic documentation recommends using bulk requests as they achieve better performance than single-document index requests. In comparison to Elasticsearch, Rockset sees better throughput performance with smaller batch sizes as it’s designed to process incrementally updating streams.

We also observe that the peak throughput scales linearly as the amount of resources increases on Rockset and Elasticsearch. Rockset consistently beats the throughput of Elasticsearch on RockBench, making it better suited to workloads with high write rates.

Data Latency: Rockset sees up to 2.5x lower data latency than Elasticsearch

We compare Rockset and Elasticsearch end-to-end latency at the highest possible throughput that each system achieved. To measure the data latency, we start with a dataset size of 1 TB and measure the average data latency over a period of 45 minutes at the peak throughput.

We see that for a batch size of 50 the maximum throughput in Rockset is 90 MB/s and in Elasticsearch is 50 MB/s. When evaluating on a batch size of 500, the maximum throughput in Rockset is 110 MB/s and Elasticsearch is 80 MB/s.

Table of the 50th, 95th and 99th percentile data latencies on batch sizes of 50 and 500 in Rockset and Elasticsearch. Data latencies are recorded for 128 vCPU instances.

At the 95th and 99th percentiles, Rockset delivers lower data latency than Elasticsearch at the peak throughput. What you can also see is that the data latency is within a tighter bound on Rockset compared to the delta between p50 and p99 on Elasticsearch.

Graph of the data latency at 50th, 95th and 99th percentiles at the peak throughput rate of Rockset and Elasticsearch. Shows the results of a batch of 500 on 128 vCPU instances. Lower data latency indicates better performance.

Rockset was able to achieve up to 2.5x lower latency than Elasticsearch for streaming data ingestion.

How did we do it?: Rockset gains due to cloud-native efficiency

There have been open questions as to whether it is possible for a database to achieve both isolation and real-time performance. The de-facto architecture for real-time database systems, including Elasticsearch, is a shared nothing architecture where compute and storage resources are tightly coupled for better performance. With these results, we show that it is possible for a disaggregatedcloud architecture to support search and analytics on high-velocity streaming data.

One of the tenets of a cloud-native architecture is resource decoupling, made famous by compute-storage separation, which offers better scalability and efficiency. You no longer need to overprovision resources for peak capacity as you can scale up and down on demand. And, you can provision the exact amount of storage and compute needed for your application.

The knock against decoupled architectures is that they have traded off performance for isolation. In a shared nothing architecture, the tight coupling of resources underpins performance; data ingestion and query processing use the same compute units to ensure that the most recently generated data is available for querying. Storage and compute are also colocated in the same nodes for faster data access and improved query performance.

While tightly coupled architectures made sense in the past, they are no longer necessary due to advances in cloud architectures. Rockset’s compute-storage and compute-compute separation for real-time search and analytics lead the way by isolating streaming ingest compute, query compute and hot storage from each other. Rockset is able to ensure queries access the most recent writes by replicating the in-memory state across virtual instances, a cluster of compute and memory resources, making the architecture well-suited to latency sensitive scenarios. Furthermore, Rockset creates an elastic hot storage tier that is a shared resource for multiple applications.

Diagrams of a (a) shared nothing architecture like Elasticsearch and (b) a compute-compute separation architecture introduced by Rockset.

With compute-compute separation, Rockset achieves better ingest performance than Elasticsearch because it only has to process incoming data once. In Elasticsearch, which has a primary-backup model for replication, every replica needs to expend compute indexing and compacting newly generated writes. With compute-compute separation, only a single virtual instance does the indexing and compaction before transferring the newly written data to other instances for application serving. The efficiency gains from needing to only process incoming writes once is why Rockset recorded up to 4x higher throughput and 2.5x lower end-to-end latency than Elasticsearch on RockBench.

In Summary: Rockset achieves up to 4x higher throughput and 2.5x lower latency

In this blog, we have walked through the performance evaluation of Rockset and Elasticsearch for high-velocity data streams and come to the following conclusions:

Throughput : Rockset supports higher throughput than Elasticsearch, writing incoming streaming data up to 4x faster. We came to this conclusion by measuring the peak throughput, or the rate in which data latency starts monotonically increasing, on different batch sizes and configurations.

Latency : Rockset consistently delivers lower data latencies than Elasticsearch at the 95th and 99th percentile, making Rockset well suited for latency sensitive application workloads. Rockset provides up to 2.5x lower end-to-end latency than Elasticsearch.

Cost/Complexity : We compared Rockset and Elasticsearch streaming ingest performance on hardware resources, using similar allocations of CPU and memory. We also found that Rockset offers the best value. For a similar price point, you can not only get better performance on Rockset but you can do away with managing clusters, shards, nodes and indexes. This greatly simplifies operations so your team can focus on building production-grade applications.

We ran this performance benchmark on Rockset’s next generation cloud architecture with compute-compute separation. We were able to prove that even with the isolation of streaming ingestion compute, query compute and storage Rockset was still able to achieve better performance than Elasticsearch.

You can evaluate Rockset for your own real-time search and analytics workload by starting a free trial with $300 in credits. We have built-in connectors to Confluent Cloud, Kafka and Kinesis along with a host of OLTP databases to make it easy for you to get started.

Introducing Vector Search on Rockset: How to run semantic search with OpenAI and Rockset

Shawn Adams — Tue, 18 Apr 2023 07:00:00 +0000

We’re excited to introduce vector search on Rockset to power fast and efficient search experiences, personalization engines, fraud detection systems and more. To highlight these new capabilities, we built a search demo using OpenAI to create embeddings for Amazon product descriptions and Rockset to generate relevant search results. In the demo, you’ll see how Rockset delivers search results in 15 milliseconds over thousands of documents.

Why use vector search?

Organizations have continued to accumulate large quantities of unstructured data, ranging from text documents to multimedia content to machine and sensor data. Estimates show that unstructured data represents 80% of all generated data, but organizations only leverage a small fraction of it to extract valuable insights, power decision-making and create immersive experiences. Comprehending and understanding how to leverage unstructured data has remained challenging and costly, requiring technical depth and domain expertise. Due to these difficulties, unstructured data has remained largely underutilized.

With the evolution of machine learning, neural networks and large language models, organizations can easily transform unstructured data into embeddings, commonly represented as vectors. Vector search operates across these vectors to identify patterns and quantify similarities between components of the underlying unstructured data.

Before vector search, search experiences primarily relied on keyword search, which frequently involved manually tagging data to identify and deliver relevant results. The process of manually tagging documents requires a host of steps like creating taxonomies, understanding search patterns, analyzing input documents, and maintaining custom rule sets. As an example, if we wanted to search for tagged keywords to deliver product results, we would need to manually tag “Fortnite” as a ”survival game” and ”multiplayer game.” We would also need to identify and tag phrases with similarities to “survival game” like “battle royale” and “open-world play” to deliver relevant search results.

More recently, keyword search has come to rely on term proximity, which relies on tokenization. Tokenization involves breaking down titles, descriptions and documents into individual words and portions of words, and then term proximity functions deliver results based on matches between those individual words and search terms. Although tokenization reduces the burden of manually tagging and managing search criteria, keyword search still lacks the ability to return semantically similar results, especially in the context of natural language which relies on associations between words and phrases.

With vector search, we can leverage text embeddings to capture semantic associations across words, phrases and sentences to power more robust search experiences. For example, we can use vector search to find games with “space and adventure, open-world play and multiplayer options.” Instead of manually tagging each game with this potential criteria or tokenizing each game description to search for exact results, we would use vector search to automate the process and deliver more relevant results.

How do embeddings power vector search?

Embeddings, represented as arrays or vectors of numbers, capture the underlying meaning of unstructured data like text, audio, images and videos in a format more easily understood and manipulated by computational models.

Two-dimensional space used to determine the semantic relationship between games using distance functions like cosine, Euclidean distance and dot product

As an example, I could use embeddings to understand the relationship between terms like “Fortnite,” “PUBG” and “Battle Royale.” Models derive meaning from these terms by creating embeddings for them, which group together when mapped to a multi-dimensional space. In a two-dimensional space, a model would generate specific coordinates (x, y) for each term, and then we would understand the similarity between these terms by measuring the distances and angles between them.

In real-world applications, unstructured data can consist of billions of data points and translate into embeddings with thousands of dimensions. Vector search analyzes these types of embeddings to identify terms in close proximity to each other such as “Fortnite” and “PUBG” as well as terms that may be in even closer proximity to each other and synonyms like “PlayerUnknown's Battlegrounds” and the associated acronym “PUBG.”

Vector search has seen an explosion in popularity due to improvements in accuracy and broadened accessibility to the models used to generate embeddings. Embedding models like BERT have led to exponential improvements in natural language processing and understanding, generating embeddings with thousands of dimensions. OpenAI’s text embedding model, text-embedding-ada-002, generates embeddings with 1,526 dimensions, creating a rich representation of the underlying language.

Powering fast and efficient search with Rockset

Given we have embeddings for our unstructured data, we can turn towards vector search to identify similarities across these embeddings. Rockset offers a number of out-of-the-box distance functions, including dot product, cosine similarity, and Euclidean distance, to calculate the similarity between embeddings and search inputs. We can use these similarity scores to support K-Nearest Neighbors (kNN) search on Rockset, which returns the k most similar embeddings to the search input.

Leveraging the newly released vector operations and distance functions, Rockset now supports vector search capabilities. Rockset extends its real-time search and analytics capabilities to vector search, joining other vector databases like Milvus, Pinecone and Weaviate and alternatives such as Elasticsearch, in indexing and storing vectors. Under the hood, Rockset utilizes its Converged Index technology, which is optimized for metadata filtering, vector search and keyword search, supporting sub-second search, aggregations and joins at scale.

Rockset offers a number of benefits along with vector search support to create relevant experiences:

Real-Time Data: Ingest and index incoming data in real-time with support for updates.
Feature Generation: Transform and aggregate data during the ingest process to generate complex features and reduce data storage volumes.
Fast Search: Combine vector search and selective metadata filtering to deliver fast, efficient results.
Hybrid Search Plus Analytics: Join other data with your vector search results to deliver rich and more relevant experiences using SQL.
Fully-Managed Cloud Service: Run all of these processes on a horizontally scalable, highly available cloud-native database with compute-storage and compute-compute separation for cost-efficient scaling.

Building Product Search Recommendations

Let’s walk through how to run semantic search using OpenAI and Rockset to find relevant products on Amazon.com.

The workflow of semantic search using Amazon product reviews, vector embeddings from OpenAI and nearest neighbor search in Rockset

For this demonstration, we used product data that Amazon has made available to the public, including product listings and reviews.

Sample of the Amazon product reviews dataset

Generate Embeddings

The first stage of this walkthrough involves using OpenAI’s text embeddings API to generate embeddings for Amazon product descriptions. We opted to use OpenAI’s text-embedding-ada-002 model due to its performance, accessibility and reduced embedding size. Though, we could have used a variety of other models to generate these embeddings, and we considered several models from HuggingFace, which users can run locally.

OpenAI’s model generates an embedding with 1,536 elements. In this walkthrough, we’ll generate and save embeddings for 8,592 product descriptions of video games listed on Amazon. We will also create an embedding for the search query used in the demonstration, “space and adventure, open-world play and multiplayer options.”

We use the following code to generate the embeddings:

    import gzip
    import json
    import openai


    # Download the following file from https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/ 
product_data_full = []
for line in gzip.open('./amazon_metadata/meta_Video_Games.json.gz', 'rt', encoding='UTF-8'):
    product_data_full.append(json.loads(line))


    # Remove products without descriptions and embed a subset of the data to save time and money
    product_data = []
    for item in range(12000):
if product_data_full[item]['description'] and product_data_full[item]['price']:
    product_data.append(product_data_full[item])


    # Create embeddings for each product desciption
    for item in product_data:
item['description_embedding'] = openai.Embedding.create(input=item['description'][0], model="text-embedding-ada-002")["data"][0]["embedding"]


    # Create new file with embeddings
    for item in product_data:
jsonString = json.dumps(item)
jsonFile.write(jsonString + '\n')
    jsonFile.close()


    # Generate embedding for future search input
    search_query = 'open-world play, multiplayer options, and support for in-game purchases'
    search_query_embedding = openai.Embedding.create(input=search_query, model="text-embedding-ada-002")["data"][0]["embedding"]

Upload Embeddings to Rockset

In the second step, we’ll upload these embeddings, along with the product data, to Rockset and create a new collection to start running vector search. Here’s how the process works:

We create a collection in Rockset by uploading the file created earlier with the video game product listings and associated embeddings. Alternatively, we could have easily pulled the data from other storage mechanisms, like Amazon S3 and Snowflake, or streaming services, like Kafka and Amazon Kinesis, leveraging Rockset’s built-in connectors. We then leverage Ingest Transformations to transform the data during the ingest process using SQL. We use Rockset’s new VECTOR_ENFORCE function to validate the length and elements of incoming arrays, which ensure compatibility between vectors during query execution.

Use of the VECTOR_ENFORCE function as part of an ingest transformation

Run Vector Search on Rockset

Let’s now run vector search on Rockset using the newly released distance functions. COSINE_SIM takes in the description embeddings field as one argument and the search query embedding as another. Rockset makes all of this possible and intuitive with full-featured SQL.

For this demonstration, we copied and pasted the search query embedding into the COSINE_SIM function within the SELECT statement. Alternatively, we could have generated the embedding in real time by directly calling the OpenAI Text Embedding API and passing the embedding to Rockset as a Query Lambda parameter.

Due to Rockset’s Converged Index, kNN search queries perform particularly well with selective, metadata filtering. Rockset applies these filters before computing the similarity scores, which optimizes the search process by only calculating scores for relevant documents. For this vector search query, we filter by price and game developer to ensure the results reside within a specified price range and the games are playable on a given device.

kNN search on Rockset returns top 5 results in 15MS

Since Rockset filters on brand and price before computing the similarity scores, Rockset returns the top five results on over 8,500 documents in 15 milliseconds on a Large Virtual Instance with 16 vCPUs and 128 GiB of allocated memory. Here are the descriptions for the top three results based on the search input “space and adventure, open-world play and multiplayer options”:

This role-playing adventure for 1 to 4 players lets you plunge deep into a new world of fantasy and wonder, and experience the dawning of a new series.
Spaceman just crashed on a strange planet and he needs to find all his spacecraft's parts. The problem? He only has a few days to do it!
180 MPH slap in the face, anyone? Multiplayer modes for up to four players including Deathmatch, Cop Mode and Tag.

To summarize, Rockset runs semantic search in approximately 15 milliseconds on embeddings generated by OpenAI, using a combination of vector search with metadata filtering for faster, more relevant results.

What does this mean for search?

We walked through an example of how to use vector search to power semantic search and there are many other examples where fast, relevant search can be useful:

Personalization & Recommendation Engines : Leverage vector search in your e-commerce websites and consumer applications to determine interests based on activities like past purchases and page views. Vector search algorithms can help generate product recommendations and deliver personalized experiences by identifying similarities between users.

Anomaly Detection : Incorporate vector search to identify anomalous transactions based on their similarities (and differences!) to past, legitimate transactions. Create embeddings based on attributes such as transaction amount, location, time, and more.

Predictive Maintenance : Deploy vector search to help analyze factors such as engine temperature, oil pressure, and brake wear to determine the relative health of trucks in a fleet. By comparing readings to reference readings from healthy trucks, vector search can identify potential issues such as a malfunctioning engine or worn-out brakes.

In the upcoming years, we expect the use of unstructured data to skyrocket as large language models become easily accessible and the cost of generating embeddings continues to decline. Rockset will help accelerate the convergence of real-time machine learning with real-time analytics by easing the adoption of vector search with a fully-managed, cloud-native service.

Search has become easier than ever as you no longer need to build complex and hard-to-maintain rules-based algorithms or manually configure text tokenizers or analyzers. We see endless possibilities for vector search: explore Rockset for your use case by starting a free trial today.

Author: John Solitario, Product Manager

Rockset Architecture Whiteboard Session With CTO Dhruba Borthakur

Shawn Adams — Tue, 14 Jun 2022 13:00:00 +0000

In this 30 minute video overview, CTO and Rockset Co-founder Dhruba Borthakur discusses Rockset's ALT architecture, how data is ingested, stored and queried in Rockset, and why Rockset is simple to use, incredibly fast and capable of the highly efficient execution of complex distributed queries across diverse data sets.

We'll be doing more videos like this in the future, so sign up for notices from our blog and join our community so you don't miss them.

Learn More about Rockset Architecture

You can find more information about Rockset's architecture and functionality in the following resources:

Diagram of Rockset ALT architecture

About Dhruba Borthakur

Dhruba Borthakur is CTO and co-founder of Rockset, responsible for the company's technical direction. He was an engineer on the database team at Facebook, where he was the founding engineer of the RocksDB data store. Earlier at Yahoo, he was one of the founding engineers of the Hadoop Distributed File System. He was also a contributor to the open source Apache HBase project. Dhruba previously held various roles at Veritas Software, founded an e-commerce startup, Oreceipt.com, and contributed to Andrew File System (AFS) at IBM-Transarc Labs.

About Rockset

Rockset is a real-time analytics database that enables queries on massive, semi-structured data without operational burden. Rockset is serverless and fully managed. It offloads the work of managing configuration, cluster provisioning, denormalization and shard/index management. Rockset is also SOC 2 Type II compliant and offers encryption at rest and in flight, securing and protecting any sensitive data. Most teams can ingest data into Rockset and start executing queries in about 15 minutes, depending on the amount of data ingested.

Rockset is the leading real-time analytics platform built for the cloud, delivering fast analytics on real-time data with surprising efficiency. Learn more at rockset.com.

MongoDB vs DynamoDB Head-to-Head: Which Should You Choose?

Shawn Adams — Tue, 07 Jun 2022 15:00:00 +0000

Databases are a key architectural component of many applications and services.

Traditionally, organizations have chosen relational databases like SQL Server, Oracle, MySQL and Postgres. Relational databases use tables and structured languages to store data. They usually have a fixed schema, strict data types and formally-defined relationships between tables using foreign keys. They’re reliable, fast and support checks and constraints that help enforce data integrity.

They aren’t perfect, though. As companies become increasingly digital, they often begin generating massive amounts of data, and they need a place to store it. Relational databases scale up well, but can be painful to scale out when a company has more data than a single database server can manage.

On the other hand, non-relational databases (commonly referred to as NoSQL databases) are flexible databases for big data and real-time web applications. These databases were born out of necessity for storing large amounts of unstructured data. NoSQL databases don't always offer the same data integrity guarantees as a relational database, but they're much easier to scale out across multiple servers.

NoSQL databases have become so popular that big companies rely on them to store hundreds of terabytes of data and run millions of queries per second. So why have NoSQL databases become so popular compared to traditional, relational databases?

For one, NoSQL databases can accept any type of data: structured, unstructured or semi-structured. This flexibility makes them the go-to database for many use cases. Secondly, NoSQL is schemaless, so database items can have completely different structures from one another. And as mentioned, due to their architectures, NoSQL databases are easier to scale horizontally than relational databases.

There are many NoSQL databases available in the market. Two popular options are MongoDB and Amazon DynamoDB, and architects often find themselves choosing between the two. In this article, we’ll compare MongoDB and Amazon DynamoDB to each other and highlight their significant differences. We’ll include their pros and cons, differences in data types, and discuss factors like cost, reliability, performance and security.

Before comparing MongoDB to DynamoDB, let’s take an in-depth look at each solution to understand what they are, their characteristics and their advantages and disadvantages.

In This Corner, MongoDB

MongoDB is a NoSQL, document-oriented general purpose database management system. It is optimized for low latency, high throughput, and high availability. It also supports a JavaScript-based query language to run commands and retrieve data, with official client drivers available for over a dozen programming languages. It’s a cross-platform, open-source non-relational database that stores data as collections of documents.

MongoDB uses BSON internally to store documents which is a binary representation of JSON that fully supports all of the features of JSON with support for additional data types, more efficient compression, and easier parsability. While MongoDB collections can have a schema against which the database validates new documents, schema validation is optional.

MongoDB’s Characteristics

MongoDB is a general-purpose database. It can serve various loads and multiple purposes within an application. It also has a flexible schema design, meaning there’s no set schema to define how to store data, and it scales both vertically and horizontally. MongoDB takes into account security features such as authentication and authorization. It also has a document model that maps to objects in application code, making it easy to work with data.

MongoDB’s Pros

Flexibility: MongoDB has flexible database schemas. You can insert information into the database without worrying about matching criteria or data types. MongoDB supports more native data types than DynamoDB, and it lets you nest documents.
Systems Design: Beyond accommodating large volumes of rapidly changing structured, semi-structured and unstructured data, MongoDB enables developers to add to the schema as their needs change.
Data Model: Compared to DynamoDB, MongoDB supports regular JSON and advanced BSON data models such as int, long, date, timestamp, geospatial, floating-point and Decimal128.
Runs Anywhere: This solution can run anywhere, so users future-proof their work without fearing vendor lock-in.
Cost: MongoDB has a free, open-source version if you are cost conscious. They’ve also recently introduced a pay-as-you-go, serverless pricing option for MongoDB Atlas, their managed cloud offering.

MongoDB’s Cons

Memory Use: MongoDB needs to keep its working set in RAM to achieve acceptable performance. This reliance on RAM makes MongoDB too expensive for many use cases.
Data Duplication: Duplication happens because, in MongoDB, users tend to use nested documents instead of normalized tables like in a relational database. In some cases this may be due to denormalization that needs to occur because MongoDB does not support high performance JOINs, and instead uses a data that belongs together is stored together philosophy to avoid the use of JOINS entirely. This limitation can cause data sizes, and the related costs, to climb.
Indexing: MongoDB supports simple indexes and complex compound indexes containing multiple document properties. As with most databases, poorly designed or missing indexes can slow reads and writes, as the index must update every time someone inserts a new document in a collection.

And, in This Corner, DynamoDB

Amazon DynamoDB is a fast, flexible, NoSQL database. It’s suitable for all applications that need consistent latency at any scale. It’s a fully managed NoSQL database that’s ideal for document and key-value models. Amazon developed DynamoDB as a managed database for applications requiring similar, simple query patterns.

DynamoDB can scale on-demand to support virtually unlimited read and write operations with response time under single-digit milliseconds. It’s perfect for mobile, web, gaming and advertising technology.

DynamoDB’s Characteristics

DynamoDB is serverless and scales horizontally to support tables of any size, making it good for large-scale performance. Plus, query performance doesn’t degrade with database size when querying by key. It also has a flexible schema that enables you to quickly adapt tables as your needs change without restructuring the table schema (as required in relational databases).

DynamoDB also offers global tables, albeit at an extra cost. These tables replicate your data across AWS Regions, making it easy for your app to locally access data in the selected regions. DynamoDB also continuously backs up your data to prevent data loss. It encrypts your data for improved security, and is ideally suited for enterprise applications that have strict security requirements.

DynamoDB’s Pros

Customizable: The DynamoDB database can be modified according to your app’s priorities.
Fast: DynamoDB delivers excellent performance, no matter how many records you store or how often you query it by key.
Scalability: DynamoDB scales seamlessly, regardless of the traffic levels.
Pricing: DynamoDB uses a pay-as-you-go, throughput-based pricing technique where different inputs may affect prices. This can help to optimize your costs as they will fluctuate with your workload, but may also cause your pricing to be unpredictable.

DynamoDB’s Cons

Limited Query Language: DynamoDB has a limited query language compared to MongoDB. This is because DynamoDB is a key-value store and not a full document database. Every DynamoDB record has two keys: a partition key and a sort key. Every query must provide one partition key, and can optionally specify a single value or a range for the sort key. That’s it.
Limited Indexing: Compared to MongoDB, where indexing your data comes at no extra cost, DynamoDB indexes are limited and complex. Amazon sizes and bills the indexes separately from data.
Pricing: DynamoDB uses a pay-as-you-go, throughput-based pricing technique where different inputs may affect prices. This can help to optimize your costs as they will fluctuate with your workload, but may also cause your pricing to be unpredictable.

Head-to-Head Table of MongoDB vs DynamoDB

Both Amazon DynamoDB and MongoDB are widely used, highly scalable and cloud-compatible NoSQL databases. Despite these similarities, they have some key differences. The table below explores these further:

	MongoDB	DynamoDB
Source	MongoDB is open-source and can be deployed anywhere in most clouds and/or on premises.	DynamoDB is from the AWS ecosystem and can only be used within AWS.
Management	MongoDB can either be self-managed or fully managed with the MongoDB Atlas database as a service.	DynamoDB is a fully managed solution. Amazon handles all server updates, patch updates, and hardware provisioning.
Security	Developers need to spend extra time upfront reconfiguring security on MongoDB, especially when self-managed. This is because it runs with defaults permitting unrestricted and direct access to data without authentication. MongoDB Atlas requires setup of authentication and network access via IP access controls or VPC peering.	Security for DynamoDB starts out restrictive and incorporates with AWS IAM Policy infrastructure.
Database structure	MongoDB’s database structure is made of JSON-like documents comprising collections, keys, values, and documents. Documents can contain nested documents.	DynamoDB’s database structure supports either blobs or documents as values.
Index use	MongoDB supports up to 64 mutable indexes per collection, allowing the document’s structure to change dynamically.	DynamoDB supports up to 20 mutable global indexes per table which are not compatible with underlying data, and up to 5 local indexes which cannot be modified after table creation.
Programming language	MongoDB is written in C++ and supports programming languages like C, C++, Go, Java, JavaScript, PHP, Perl, Ruby, Python, and more.	DynamoDB supports programming languages like Java, JavaScript, Node.js, .NET, PHP, and more.
Data type and size restriction	MongoDB supports various data types, and allows document sizes of up to 16MB.	DynamoDB has limited support for data types, and allows item sizes of up to 400 KB.
Industry use	Companies use MongoDB for mobile apps and content management systems (CMSs). MongoDB is also excellent for scalability and caching.	The gaming and Internet of things (IoT) industries widely use DynamoDB.
Cost	MongoDB uses a fixed pricing model where you pay for provisioned resources ahead of time. Pricing is based on RAM, I/O, and storage for MongoDB Atlas, plus server and sysadmin time if you are hosting MongoDB yourself. Costs are consistent, but may not be optimal for variable workloads.	DynamoDB uses a variable pricing model where you pay for what you use, which is based on a throughput model with additional charges for features like backup and restore, on-demand capacity, streams, change data capture (CDC) and others. This may cause your costs to be less predictable.
Querying	MongoDB has a rich query language. You can apply it in various ways: single keys, ranges, graph transversals, joins, and more.	DynamoDB’s querying is only available in local secondary indexes (LSI) and global secondary indexes (GSI).

Which Database Should I Choose, MongoDB or DynamoDB?

DynamoDB and MongoDB are highly successful modern alternatives for traditional database systems, such as MySQL, PostgreSQL and others. When selecting your database, you need to consider factors such as scale, user requirements, deployment method, storage requirements and functionality.

If you’re looking for an AWS-native solution with MongoDB-like capabilities, you can also consider Amazon DocumentDB. While DocumentDB is not based on the MongoDB server, its abilities are close to MongoDB, and is compatible with the MongoDB 3.6 and 4.0 APIs. You can even use DocumentDB as a drop-in replacement for MongoDB as it is MongoDB compatible.

MongoDB and DynamoDB are both solid NoSQL databases that meet and solve various user needs. You need to carefully consider whether or not a database fully suits your use case. Each database has unique advantages, so factor in your long-term cloud strategy and an application’s specific needs when deciding which NoSQL database to select.

Regardless of which NoSQL database you use, pairing it with a real-time analytics database is a common pattern, as neither MongoDB or DynamoDB is primarily an analytical database. If you're building user-facing data applications using your data stored in MongoDB or DynamoDB, consider Rockset, which enables real-time SQL analytics on your MongoDB or DynamoDB NoSQL database.

Rockset is the leading real-time analytics platform built for the cloud, delivering fast analytics on real-time data with surprising efficiency. Learn more at rockset.com.

CDC on DynamoDB

Shawn Adams — Tue, 10 May 2022 13:00:00 +0000

DynamoDB is a popular NoSQL database available in AWS. It is a managed service with minimal setup and pay-as-you-go costing. Developers can quickly create databases that store complex objects with flexible schemas that can mutate over time. DynamoDB is resilient and scalable due to the use of sharding techniques. This seamless, horizontal scaling is a huge advantage that allows developers to move from a proof of concept into a productionized service very quickly.

However, DynamoDB, like many other NoSQL databases, is great for scalable data storage and single row retrieval but leaves a lot to be desired when it comes to analytics. With SQL databases, analysts can quickly join, group and search across historical data sets. With NoSQL, the language for performing these types of queries is often more cumbersome, proprietary, and joining data is either not possible or not recommended due to performance constraints.

To overcome this, Change Data Capture (CDC) techniques are often used to copy changes from the NoSQL database into an analytics database where analysts can perform more computationally heavy tasks across larger datasets. In this post, we’ll look at how CDC works with DynamoDB and its potential use cases.

How Change Data Capture Works on DynamoDB

We have previously discussed the many different CDC techniques available. DynamoDB uses a push-type model where changes are pushed to a downstream entity such as a queue or a direct consumer. DynamoDB pushes events about any changes to a DynamoDB stream that can be consumed by targets downstream.

Usually, push-based CDC patterns are more complex as they often require another service to act as the middleman between the producer and consumer of the changes. However, DynamoDB streams are natively supported within DynamoDB and can be simply configured and enabled with a touch of a button. This is because they are also a managed service within AWS. CDC on DynamoDB is easy because you only need to configure a consumer and an alternative data store.

Use Cases for CDC on DynamoDB

Let's take a look at some use cases for why you would need a CDC solution in the first place.

Archiving Historical Data

Due to its scalability and schemaless nature, DynamoDB is often used to store time-series data such as IoT data or weblogs. The schema of the data in these sources can change depending on what is being logged at any point in time and they often write data at variable speeds depending on current use. This makes DynamoDB a great use case for storing this data as it can handle the flexible schemas and can also scale up and down on-demand based on the throughput of data.

However, the utility of this data diminishes over time as the data becomes old and out of date. With pay-as-you-go pricing, the more data stored in DynamoDB the more it costs. This means you only want to use DynamoDB as a hot data store for frequently used data sets. Old and stale data should be removed to save cost and also help with efficiency. Often, companies don't want to simply delete this data and instead want to move it elsewhere for archival.

Setting up the CDC DynamoDB stream is a great use case to solve this. Changes can be captured and sent to the data stream so it can be archived in S3 or another data store and a data retention policy can be set up on the data in DynamoDB to automatically delete it after a certain period of time. This reduces storage costs in DynamoDB as the cold data is offloaded to a cheaper storage platform.

Real-Time Analytics on DynamoDB

As stated previously, DynamoDB is great at retrieving data fast but isn't designed for large-scale data retrieval or complex queries. For example, let's say you have a game that stores user events for each interaction and these events are being written to DynamoDB. Depending on the number of users playing at any time, you need to quickly scale your storage solution to deal with the current throughput making DynamoDB a great choice.

However, you now want to build a leaderboard that provides statistics for each of these interactions and shows the top ten players based on a particular metric. This leaderboard would need to update in real time as new events are captured. DynamoDB does not natively support real-time aggregations of data so this is another use case for using CDC out to an analytics platform.

Rockset, a real-time analytics database, is an ideal fit for this scenario. It has a built-in connector for DynamoDB that automatically configures the DynamoDB stream so changes are ingested into Rockset in near real time. The data is automatically indexed in Rockset for fast analytical queries and SQL querying to perform aggregations and calculations across the data.

Millisecond latency queries can be set up to constantly retrieve the latest version of the leaderboard as new data is ingested. Like DynamoDB, Rockset is a fully serverless solution providing the same scaling and hands-free infrastructure benefits.

Joining Datasets Together

Similar to its lack of analytics capabilities, DynamoDB doesn’t support the joining of tables in queries. NoSQL databases in general tend to lack this capability as data is stored in more complex structures instead of in flat, relational schemas. However, there are times when joining data together for analytics is critical.

Going back to our real-time gaming leaderboard, rather than just using data from one DynamoDB table, what if we wanted our leaderboard to contain other metadata about a user that comes from a different data source altogether? What if we also wanted to show past performance? These use cases would require queries with table joins.

Again, we could continue to use Rockset in this scenario. Rockset has multiple connectors available for databases like MySQL, Postgres, MongoDB, flat files and many more. We could set up connectors to update the data in real time and then amend our leaderboard SQL query to now join this data and a subquery of past performance to be shown alongside the current leaderboard scores.

Search

Another use case for implementing CDC with DynamoDB streams is search. As we know, DynamoDB is great for fast document lookups using indexes but searching and filtering large data sets is typically slow.

For searching documents with lots of text, AWS offers CloudSearch, a managed search solution that provides flexible indexing to provide fast search results with custom, weighted ordering. It is possible to sync DynamoDB data into Cloudsearch however, currently, the solution does not make use of DymanoDB Streams and requires a manual technical solution to sync the data.

On the other hand, with Rockset you can use its DynamoDB connector to sync data in near real time into Rockset where for a simple search you can use standard SQL where clauses. For more complex search, Rockset offers search functions to look for specific terms, boost certain results and also perform proximity matching. This could be a viable alternative to AWS CloudSearch if you aren’t searching through large amounts of text and is also easier to set up due to it using the DynamoDB streams CDC method. The data also becomes searchable in near real time and is indexed automatically. CloudSearch has limitations on data size and upload frequency in a 24-hour period.

A Flexible and Future-Proofed Solution

It is clear that AWS DynamoDB is a great NoSQL database offering. It is fully managed, easily scalable and cost-effective for developers building solutions that require fast writes and fast single row lookups. For use cases outside of this, you will probably want to implement a CDC solution to move the data into an alternative data store that is more suited to the use case. DynamoDB makes this easy with the use of DynamoDB streams.

Rockset takes advantage of DynamoDB streams by providing a built-in connector that can capture changes in seconds. As I have described, many of the common use cases for implementing a CDC solution for DynamoDB can be covered by Rockset. Being a fully managed service, it removes infrastructure burdens from developers. Whether your use case is real-time analytics, joining data and/or search, Rockset can provide all three on the same datasets, meaning you can solve more use cases with fewer architectural components.

This makes Rockset a flexible and future-proofed solution for many real-time analytic use cases on data stored in DynamoDB.

Rockset is the leading Real-time Analytics Platform Built for the Cloud, delivering fast analytics on real-time data with surprising efficiency. Learn more at rockset.com.

How Rockset Handles Data Deduplication

Shawn Adams — Tue, 03 May 2022 13:00:00 +0000

by Tyler Denton, Sales Engineer, Rockset

There are two major problems with distributed data systems. The second is out-of-order messages, the first is duplicate messages, the third is off-by-one errors, and the first is duplicate messages.

This joke inspired Rockset to confront the data duplication issue through a process we call deduplication.

As data systems become more complex and the number of systems in a stack increases, data deduplication becomes more challenging. That's because duplication can occur in a multitude of ways. This blog post discusses data duplication, how it plagues teams adopting real-time analytics, and the deduplication solutions Rockset provides to resolve the duplication issue. Whenever another distributed data system is added to the stack, organizations become weary of the operational tax on their engineering team.

Rockset addresses the issue of data duplication in a simple way, and helps to free teams of the complexities of deduplication, which includes untangling where duplication is occurring, setting up and managing extract transform load (ETL) jobs, and attempting to solve duplication at a query time.

The Duplication Problem

In distributed systems, messages are passed back and forth between many workers, and it’s common for messages to be generated two or more times. A system may create a duplicate message because:

A confirmation was not sent.
The message was replicated before it was sent.
The message confirmation comes after a timeout.
Messages are delivered out of order and must be resent.

The message can be received multiple times with the same information by the time it arrives at a database management system. Therefore, your system must ensure that duplicate records aren’t created. Duplicate records can be costly and take up memory unnecessarily. These duplicated messages must be consolidated into a single message.

Deduplication Solutions

Before Rockset, there were three general deduplication methods:

Stop duplication before it happens.
Stop duplication during ETL jobs.
Stop duplication at query time.

Deduplication History

Kafka was one of the first systems to create a solution for duplication. Kafka guarantees that a message is delivered once and only once. However, if the problem occurs upstream from Kafka, their system will see these messages as non-duplicates and deliver the duplicate messages with different timestamps. Therefore, exactly once semantics do not always solve duplication issues and can negatively impact downstream workloads.

Stop Duplication Before it Happens

Some platforms attempt to stop duplication before it happens. This seems ideal, but this method requires difficult and costly work to identify the location and causes of the duplication.

Duplication is commonly caused by any of the following:

A switch or router.
A failing consumer or worker.
A problem with gRPC connections.
Too much traffic.
A window size that is too small for packets.

Note: Keep in mind this is not an exhaustive list.

This deduplication approach requires in-depth knowledge of the system network, as well as the hardware and framework(s). It is very rare, even for a full-stack developer, to understand the intricacies of all the layers of the OSI model and its implementation at a company. The data storage, access to data pipelines, data transformation, and application internals in an organization of any substantial size are all beyond the scope of a single individual. As a result, there are specialized job titles in organizations. The ability to troubleshoot and identify all locations for duplicated messages requires in-depth knowledge that is simply unreasonable for an individual to have, or even a cross-functional team. Although the cost and expertise requirements are very high, this approach offers the greatest reward.

Stop Duplication During ETL Jobs

Stream-processing ETL jobs is another deduplication method. ETL jobs come with additional overhead to manage, require additional computing costs, are potential failure points with added complexity, and introduce latency to a system potentially needing high throughput. This involves deduplication during data stream consumption. The consumption outlets might include creating a compacted topic and/or introducing an ETL job with a common batch processing tool (e.g., Fivetran, Airflow, and Matillian).

In order for deduplication to be effective using the stream-processing ETL jobs method, you must ensure the ETL jobs run throughout your system. Since data duplication can apply anywhere in a distributed system, ensuring architectures deduplicate in all places messages are passed is paramount.

Stream processors can have an active processing window (open for a specific time) where duplicate messages can be detected and compacted, and out-of-order messages can be reordered. Messages can be duplicated if they are received outside the processing window. Furthermore, these stream processors must be maintained and can take considerable compute resources and operational overhead.

Note: Messages received outside of the active processing window can be duplicated. We do not recommend solving deduplication issues using this method alone.

Stop Duplication at Query Time

Another deduplication method is to attempt to solve it at query time. However, this increases the complexity of your query, which is risky because query errors could be generated.

For example, if your solution tracks messages using timestamps, and the duplicate messages are delayed by one second (instead of 50 milliseconds), the timestamp on the duplicate messages will not match your query syntax causing an error to be thrown.

How Rockset Solves Duplication

Rockset solves the duplication problem through unique SQL-based transformations at ingest time.

Rockset is a Mutable Database

Rockset is a mutable database and allows for duplicate messages to be merged at ingest time. This system frees teams from the many cumbersome deduplication options covered earlier.

Each document has a unique identifier called _id that acts like a primary key. Users can specify this identifier at ingest time (e.g. during updates) using SQL-based transformations. When a new document arrives with the same _id, the duplicate message merges into the existing record. This offers users a simple solution to the duplication problem.

When you bring data into Rockset, you can build your own complex _id key using SQL transformations that:

Identify a single key.
Identify a composite key.
Extract data from multiple keys.

Rockset is fully mutable without an active window. As long as you specify messages with _id or identify _id within the document you are updating or inserting, incoming duplicate messages will be deduplicated and merged together into a single document.

Rockset Enables Data Mobility

Other analytics databases store data in fixed data structures, which require compaction, resharding and rebalancing. Any time there is a change to existing data, a major overhaul of the storage structure is required. Many data systems have active windows to avoid overhauls to the storage structure. As a result, if you map _id to a record outside the active database, that record will fail. In contrast, Rockset users have a lot of data mobility and can update any record in Rockset at any time.

A Customer Win With Rockset

While we've spoken about the operational challenges with data deduplication in other systems, there's also a compute-spend element. Attempting deduplication at query time, or using ETL jobs can be computationally expensive for many use cases.

Rockset can handle data changes, and it supports inserts, updates and deletes that benefit end users. Here’s an anonymous story of one of the users that I’ve worked closely with on their real-time analytics use case.

Customer Background

A customer had a massive amount of data changes that created duplicate entries within their data warehouse. Every database change resulted in a new record, although the customer only wanted the current state of the data.

If the customer wanted to put this data into a data warehouse that cannot map _id, the customer would’ve had to cycle through the multiple events stored in their database. This includes running a base query followed by additional event queries to get to the latest value state. This process is extremely computationally expensive and time consuming.

Rockset's Solution

Rockset provided a more efficient deduplication solution to their problem. Rockset maps _id so only the latest states of all records are stored, and all incoming events are deduplicated. Therefore the customer only needed to query the latest state. Thanks to this functionality, Rockset enabled this customer to reduce both the compute required, as well as the query processing time — efficiently delivering sub-second queries.

How To Join Data in MongoDB

Shawn Adams — Wed, 20 Apr 2022 16:00:00 +0000

MongoDB is one of the most popular databases for modern applications. It enables a more flexible approach to data modeling than traditional SQL databases. Developers can build applications more quickly because of this flexibility and also have multiple deployment options, from the cloud MongoDB Atlas offering through to the open-source Community Edition.

MongoDB stores each record as a document with fields. These fields can have a range of flexible types and can even have other documents as values. Each document is part of a collection — think of a table if you’re coming from a relational paradigm. When you’re trying to create a document in a group that doesn’t exist yet, MongoDB creates it on the fly. There’s no need to create a collection and prepare a schema before you add data to it.

MongoDB provides the MongoDB Query Language for performing operations in the database. When retrieving data from a collection of documents, we can search by field, apply filters and sort results in all the ways we’d expect. Plus, most languages have native object-relational mapping, such as Mongoose in JavaScript and Mongoid in Ruby.

Adding relevant information from other collections to the returned data isn’t always fast or intuitive. Imagine we have two collections: a collection of users and a collection of products. We want to retrieve a list of all the users and show a list of the products they have each bought. We’d want to do this in a single query to simplify the code and reduce data transactions between the client and the database.

We’d do this with a left outer join of the Users and Products tables in a SQL database. However, MongoDB isn’t a SQL database. Still, this doesn’t mean that it’s impossible to perform data joins — they just look slightly different than SQL databases. In this article, we’ll review strategies we can use to join data in MongoDB.

Joining Data in MongoDB

Let’s begin by discussing how we can join data in MongoDB. There are two ways to perform joins: using the $lookup operator and denormalization. Later in this article, we’ll also look at some alternatives to performing data joins.

Using the $lookup Operator

Beginning with MongoDB version 3.2, the database query language includes the $lookup operator. MongoDB lookups occur as a stage in an aggregation pipeline. This operator allows us to join two collections that are in the same database. It effectively adds another stage to the data retrieval process, creating a new array field whose elements are the matching documents from the joined collection. Let’s see what it looks like:

Beginning with MongoDB version 3.2, the database query language includes the $lookup operator. MongoDB lookups occur as a stage in an aggregation pipeline. This operator allows us to join two collections that are in the same database. It effectively adds another stage to the data retrieval process, creating a new array field whose elements are the matching documents from the joined collection. Let’s see what it looks like:

db.users.aggregate([{$lookup: 
    {
     from: "products", 
     localField: "product_id", 
     foreignField: "_id", 
     as: "products"
    }
}])

You can see that we’ve used the $lookup operator in an aggregate call to the user’s collection. The operator takes an options object that has typical values for anyone who has worked with SQL databases. So, from is the name of the collection that must be in the same database, and localField is the field we compare to the foreignField in the target database. Once we’ve got all matching products, we add them to an array named by the property.

This approach is equivalent to an SQL query that might look like this, using a subquery:

SELECT *, products
FROM users
WHERE products in (
  SELECT *
  FROM products
  WHERE id = users.product_id
);

Or like this, using a left join:

SELECT *
FROM users
LEFT JOIN products
ON user.product_id = products._id

While this operation can often meet our needs, the $lookup operator introduces some disadvantages. Firstly, it matters at what stage of our query we use $lookup. It can be challenging to construct more complex sorts, filters or combinations on our data in the later stages of a multi-stage aggregation pipeline. Secondly, $lookup is a relatively slow operation, increasing our query time. While we’re only sending a single query internally, MongoDB performs multiple queries to fulfill our request.

Using Denormalization in MongoDB

As an alternative to using the $lookup operator, we can denormalize our data. This approach is advantageous if we often carry out multiple joins for the same query. Denormalization is common in SQL databases. For example, we can create an adjacent table to store our joined data in a SQL database.

Denormalization is similar in MongoDB, with one notable difference. Rather than storing this data as a flat table, we can have nested documents representing the results of all our joins. This approach takes advantage of the flexibility of MongoDB’s rich documents. And, we’re free to store the data in whatever way makes sense for our application.

For example, imagine we have separate MongoDB collections for products, orders, and customers. Documents in these collections might look like this:

Product

{
    "_id": 3,
    "name": "45' Yacht",
    "price": "250000",
    "description": "A luxurious oceangoing yacht."
}

Customer

{
    "_id": 47,
    "name": "John Q. Millionaire",
    "address": "1947 Mt. Olympus Dr.",
    "city": "Los Angeles",
    "state": "CA",
    "zip": "90046"
}

Order

{
    "_id": 49854,
    "product_id": 3,
    "customer_id": 47,
    "quantity": 3,
    "notes": "Three 45' Yachts for John Q. Millionaire. One for the east coast, one for the west coast, one for the Mediterranean".
}

If we denormalize these documents so we can retrieve all the data with a single query, our order document looks like this:

{
    "_id": 49854,
    "product": {
        "name": "45' Yacht",
        "price": "250000",
        "description": "A luxurious oceangoing yacht."
    },
    "customer": {
        "name": "John Q. Millionaire",
        "address": "1947 Mt. Olympus Dr.",
        "city": "Los Angeles",
        "state": "CA",
        "zip": "90046"
    },
    "quantity": 3,
    "notes": "Three 45' Yachts for John Q. Millionaire. One for the east coast, one for the west coast, one for the Mediterranean".
}

This method works in practice because, during data writing, we store all the data we need in the top-level document. In this case, we’ve merged product and customer data into the order document. When we query the information now, we get it straight away. We don’t need any secondary or tertiary queries to retrieve our data. This approach increases the speed and efficiency of the data read operations. The trade-off is that it requires additional upfront processing and increases the time taken for each write operation.

Copies of the product and every user who buys that product present an additional challenge. For a small application, this level of data duplication isn’t likely to be a problem. For a business-to-business e-commerce app, which has thousands of orders for each customer, this data duplication can quickly become costly in time and storage.

Those nested documents aren’t relationally linked, either. If there’s a change to a product, we need to search for and update every product instance. This effectively means we must check each document in the collection since we won’t know ahead of time whether or not the change will affect it.

Alternatives to Joins in MongoDB

Ultimately, SQL databases handle joins better than MongoDB. If we find ourselves often reaching for $lookup or a denormalized dataset, we might wonder if we’re using the right tool for the job. Is there a different way to leverage MongoDB for our application? Is there a way of achieving joins that might serve our needs better?

Rather than abandoning MongoDB altogether, we could look for an alternative solution. One possibility is to use a secondary indexing solution that syncs with MongoDB and is optimized for analytics. For example, we can use Rockset, a real-time analytics database, to ingest directly from MongoDB change streams, which enables us to query our data with familiar SQL search, aggregation and join queries.

Conclusion

We have a range of options for creating an enriched dataset by joining relevant elements from multiple collections. The first method is the $lookup operator. This reliable tool allows us to do the equivalent of left joins on our MongoDB data. Or, we can prepare a denormalized collection that allows fast retrieval of the queries we require. As an alternative to these options, we can employ Rockset’s SQL analytics capabilities on data in MongoDB, regardless of how it’s structured.

If you haven’t tried Rockset’s real-time analytics capabilities yet, why not have a go? Jump over to the documentation and learn more about how you can use Rockset with MongoDB.

Rockset is the real-time analytics database in the cloud for modern data teams. Get faster analytics on fresher data, at lower costs, by exploiting indexing over brute-force scanning.

Rockset Beats ClickHouse and Druid on the Star Schema Benchmark (SSB)

Shawn Adams — Tue, 05 Apr 2022 16:36:06 +0000

A year ago we evaluated Rockset on the Star Schema Benchmark (SSB), an industry-standard benchmark used to measure the query performance of analytical databases. Subsequently, Altinity published ClickHouse’s results on the SSB. Recently, Imply published revised Apache Druid results on the SSB with denormalized numbers. With all the performance improvements we've been working on lately, we took another look at how these would affect Rockset's performance on the SSB.

Rockset beat both ClickHouse and Druid query performance on the Star Schema Benchmark. Rockset is 1.67 times faster than ClickHouse with the same hardware configuration. And 1.12 times faster than Druid, even though Druid used 12.5% more compute.

Rockset executed every query in the SSB suite in 88 milliseconds or less. Rockset is faster than ClickHouse in 10 of the 13 SSB queries. Rockset is also faster than Druid in 9 queries.

The performance gains over ClickHouse and Druid are due to several enhancements we made recently that benefit Rockset users:

A new version of the on-disk format for the column-based index that has better compression, faster decoding and computations on compressed data.
Leveraging more Single Instruction/Multiple Data (SIMD) instructions as part of the vectorized execution engine to take advantage of higher throughput offered by modern processors.
The introduction of a custom block size policy in RocksDB to increase the throughput of large scans in the column-based index.
The automated splitting of column-based clusters to improve the read throughput and ensure all column clusters are properly sized.
A more efficient check for set containment to reduce compute costs.
The caching of column-based clustering metadata to improve aggregation performance.

As a result of these performance gains, users can build more interactive and responsive data applications using Rockset.

SSB Configuration & Results

The SSB measures the performance of 13 queries typical of data applications. It is a benchmark based on TPC-H and designed for data warehouse workloads. More recently, it has been used to measure the performance of queries involving aggregations and metrics in column-oriented databases ClickHouse and Druid.

To achieve resource parity, we used the same hardware configuration that Altinity used in its last published ClickHouse SSB performance benchmark. The hardware was a single m5.8xlarge Amazon EC2 instance. Imply has also released revised SSB numbers for Druid using a hardware configuration with more vCPU resources. Even so, Rockset was able to beat Druid’s numbers on absolute terms.

We also scaled the dataset size to 100 GB and 600M rows of data, a scale factor of 100, just like Altinity and Imply did. As Altinity and Imply released detailed SSB performance results on denormalized data, we followed suit. This removed the need for query time joins, even though that is something Rockset is well-equipped to handle.

All queries ran under 88 milliseconds on Rockset with an aggregate runtime of 664 milliseconds across the entire suite of SSB queries. Clickhouse’s aggregate runtime was 1,112 milliseconds. Druid’s aggregate runtime was 747 milliseconds. With these results, Rockset shows an overall speedup of 1.67 over ClickHouse and 1.12 over Druid.

Figure 1: Chart comparing ClickHouse, Druid and Rockset runtimes on SSB. The configuration of m5.8xlarge is 32 vCPUs and 128 GiB of memory. c5.9xlarge is 36 vCPUs and 72 GiB of memory.

Figure 2: Graph showing ClickHouse, Druid and Rockset runtimes on SSB queries.

You can dig further into the configuration and performance enhancements in the Rockset Performance Evaluation on the Star Schema Benchmark whitepaper. This paper provides an overview of the benchmark data and queries, describes the configuration for running the benchmark and discusses the results from the evaluation.

Rockset Performance Enhancements

The execution plan for all queries in the SSB benchmark is similar. They involve a clustered scan followed by evaluating functions, applying filters and calculating aggregations. The speed up in Rockset queries comes from a common set of performance enhancements. So, we cover the performance enhancements that contributed to the query speed in the benchmark below.

New On-Disk Format for the Column-Based Index

Rockset uses its Converged Index to organize and retrieve data efficiently and quickly for analytics. The Converged Index is composed of a search index, column-based index and a row store. Rockset introduced a new on-disk format for the column-based index that supports dictionary encoding for strings.

This means that if the same string is repeated multiple times within one chunk of data in the column-based index, the string is only stored on disk once, and we just store the index of that string. This reduces space usage on disk, and since the data is more compact, it is faster to load from disk or memory. We continue to store the strings in dictionary encoded format in memory, and we can compute on that format. The new columnar format also has other advantages, like handling null values more efficiently, and it is more extensible.

SIMD Vectorized Query Execution

Query execution operators exchange and process data chunks, which are organized in a columnar format. In vectorized query execution, operations are performed on a set of values rather than one value at a time in a data chunk for more efficient query execution. With SIMD instructions, we leverage modern processors that can compute on 256 bits or 512 bits of data at a time with a single CPU instruction.

For example, the _mm256_cmpeq_epi64 intrinsic can compare four 64-bit integers in a single instruction. For batch processing operations, this can substantially increase throughput. The comparison itself isn’t the end of the story though. SIMD instructions typically operate within a lane - so if you use four 64-bit inputs, you get four 64-bit outputs. That means instead of getting booleans as outputs, you get four 64-bit integers at the output. Typically when operating on booleans, you either want an array of booleans as the output, or a bitmask. We took great care to optimize that conversion step to see the maximum possible performance gain from SIMD.

RocksDB Block Size

RocksDB is a high-performance embedded storage engine used by modern datastores like Kafka Streams, ksqlDB and Apache Flink. Rockset stores its indexes on RocksDB. As the SSB queries access data using the column-based index, larger storage blocks were configured for that index to improve throughput.

RocksDB divides data into blocks. These blocks are the unit of data lookup for various operations, like reading from disk or reading from RocksDB’s in-memory block cache. The size of these blocks is configurable. Larger blocks help with throughput for large scans because you need to do fewer total lookups in the block cache and fewer random accesses to main memory. Smaller blocks help with performance for point lookups because if you only need one key you can load less surrounding data. The cost of loading a large block does not amortize well if you only need 1% of the data in it. You also waste space in the cache by storing data that was not recently accessed.

For Rockset’s inverted index and row-based index, which are often used for point lookups, a small block size makes sense. For the column-based index though, which is often used for bulk scans, a much larger block size improves throughput. We created a custom block size policy under the hood to tune the block size for each index independently and increased the size of the column-based index blocks.

Performance Gains for Rockset Users

Rockset is 1.67 times faster than ClickHouse and 1.12 times faster than Druid on the Star Schema Benchmark. Data engineering teams have over the years put up with a tremendous amount of complexity in the name of performance when using ClickHouse and Druid. Teams have traditionally had to do time-consuming data preparation, cluster tuning and infrastructure management in order to meet the performance requirements of their application. Rockset, with Converged Indexing and built-in data connectors, is the easiest real-time analytics platform to scale. We’re happy to share it also has the fastest query performance. Try Rockset and experience the performance enhancements on your own dataset and queries.

Authors: Ben Hannel, Software Engineering, and Julie Mills, Product Marketing

Rockset is the real-time analytics database in the cloud for modern data teams. Get faster analytics on fresher data, at lower costs, by exploiting indexing over brute-force scanning.

Streaming Analytics With KSQL vs. a Real-Time Analytics Database

Shawn Adams — Tue, 22 Mar 2022 13:00:00 +0000

By Lewis Gavin, Data Architect

In 2019, Gartner predicted that “by 2022, more than half of major new business systems will incorporate continuous intelligence that uses real-time context data to improve decisions,” and users have grown to expect real-time data, especially since the rise of social networks.

Companies are adopting real-time data for many reasons, including providing seamless and personalized experiences to users when interacting with services, and enabling real-time, data-driven decision making.

As the requirement for real-time data has grown, so have the technologies that enable it. Real-time analytics can be achieved in a number of ways, but approaches can generally be split into two camps: streaming analytics and analytics databases.

Streaming analytics happens inline, as data is streamed from one place to another. Analytics happens continuously and in real time, as data is fed through the pipeline. Analytics databases ingest data in as near real time as possible, and allow fast analytical queries to be done on this data.

In this post, we’ll talk through two technologies that implement these techniques: ksqlDB, which provides streaming analytics, and Rockset, a real-time analytics database. We’ll dive into the pros and cons of each approach so you can decide which is right for you.

Streaming Analytics

To deal with the scale and speed of the data being generated, a common pattern is to put this data onto a queue or stream. This decouples the mechanism for transporting the data away from any processing that you want to take place on the data. However, with this data being streamed in real-time, it makes sense to also process and analyze it in real-time, especially if you have a genuine use case for up-to-date analytics.

To overcome this, Confluent developed kqlDB. Developed to work with Apache Kafka, ksqlDB provides an SQL-like interface to data streams, allowing for filtering, aggregations and even joins across data streams. ksqlDB uses Kafka as the storage engine and then works as the compute engine. It also has built-in connectors for external data sources, such as connecting to databases over JDBC so they can be brought into Kafka to be joined with a real-time stream for enrichment.

You can perform analytics in two ways: pull queries or push queries. Pull queries allow you to look up results at a specific point in time and execute the query on the stream as a one-off. This is similar to running a query on a database where you execute the query and a result is returned; if you want to refresh the result, you run the query again. This is useful for synchronous applications and often run with lower latency, as the stream data can be fed into a materialized view, which is kept up to date automatically, so there is less work for the query to do.

Push queries allow you to subscribe to a table or a stream, and as the data is updated downstream, the query results will also reflect these updates in real-time. You execute the query once and the result changes as the data changes in the stream. This is a powerful use case for stream analytics as it allows you to subscribe to the result of a calculation on the data instead of subscribing to the data feed itself.

For example, let’s say you have a taxi app. When you request a taxi, the driver accepts the ride and then on the screen you are shown the driver's location and your location and given an estimated time of arrival. To display the driver’s current location and the estimated time of arrival, you need to understand the driver’s position in real time and then from that continuously calculate the estimated time to arrive as the driver’s location updates.

You could do this in two ways. The first way is to frequently poll the driver’s location and every time you retrieve the location, display the new position on the screen and also perform the calculation to estimate their arrival time. Alternatively, you could use stream analytics.

The second way is to continuously stream the driver’s and the user’s locations in real-time. This same stream can be used to obtain the driver’s location for display purposes and also, by using a ksqlDB push query, you can calculate the time of arrival. Your application is then subscribed to the output from this push query and whenever the time of arrival changes it is automatically updated on the screen.

Real-Time Analytics Database

An analytics database, as its name suggests, allows for analytics on data stored in a database. Historically, this could mean batch ingesting data into a database and then performing analytical queries on that data. However, tools like Rockset allow you to keep the benefits of a database but provide tools to perform analytics in near real-time.

Fig 1. Difference between streaming analytics and real-time analytics database

Rockset provides out-of-the-box data connectors that allow data to be streamed into their analytics database. Rather than analyzing the data as it is streamed, the data is streamed into the database as close to real time as possible. Then, the analytics can take place on the data at rest. As shown in Fig 1, streaming analytics takes place on the stream itself whereas analytics databases ingest the data in real time and analytics is performed on the database.

There are a number of benefits to storing the data in a database. Firstly you can index the data according to the use case to increase performance and reduce query latency. Unfortunately, creating bespoke indexes in order to make queries run quickly adds significant administrative overhead. And if the database needs bespoke indexes to perform well, then users submitting ad hoc queries are not going to have a great experience. Rockset solved this problem with the Converged Index and an SQL engine implementation that doesn't require administrators to create bespoke indexes.

With streaming analytics, the focus is often on what is happening right now and although analytics databases support this, they also enable analytics across larger historical data when required.

Some modern analytics databases also support schemaless ingest and can infer the schema on read to remove the burden of defining the schema upfront. For example, ksqlDB can connect to a Kafka topic that accepts unstructured data. However for ksqlDB to query this data, the schema of the underlying data needs to be defined upfront. On the other hand, modern analytics databases like Rockset allow the data to be ingested into a collection without defining the schema. This allows for flexible querying of the data, especially as the structure of the data evolves over time, as it doesn’t require any schema modifications to access the new properties.

Finally, cloud native analytics databases often separate the storage and compute resources. This gives you the ability to scale them independently. This is vital if you have applications with high query per second (QPS) workloads, as when your system needs to deal with a spike in queries. You can easily scale the compute to meet this demand without incurring extra storage costs.

Which Should I Use?

Overall, which system to use will ultimately depend on your use case. If your data is already flowing through Kafka topics and you want to run some real-time queries on this data in-flight, then ksqlDB may be the right choice. It will fulfil your use case and means you don’t have to invest in extra infrastructure to ingest this data into an analytics database. Remember, streaming analytics allows you to transform, filter and aggregate events as data is streamed in and your application can then subscribe to these results to get continuously updated results.

If your use cases are more varied, then a real-time analytics database like Rockset may be the right choice. Analytics databases are ideal if you have data from many different systems that you want to join together, as you can delay joins until query time to get the most up-to-date data. If you need to support ad-hoc queries on historical datasets on top of real-time analytics and require the compute and storage to be scaled separately (important if you have high or variable query concurrency), then a real-time analytics database is likely the right option.

Rockset is the real-time analytics database in the cloud for modern data teams. Get faster analytics on fresher data, at lower costs, by exploiting indexing over brute-force scanning.

What Do I Do When My Snowflake Query Is Slow? Part 2: Solutions

Shawn Adams — Wed, 26 Jan 2022 08:00:00 +0000

Snowflake’s data cloud enables companies to store and share data, then analyze this data for business intelligence. Although Snowflake is a great tool, sometimes querying vast amounts of data runs slower than your applications — and users — require.

In our first article, What Do I Do When My Snowflake Query Is Slow? Part 1: Diagnosis, we discussed how to diagnose slow Snowflake query performance. Now it’s time to address those issues.

We’ll cover Snowflake performance tuning, including reducing queuing, using result caching, tackling disk spilling, rectifying row explosion, and fixing inadequate pruning. We’ll also discuss alternatives for real-time analytics that might be what you’re looking for if you are in need of better real-time query performance.

Reduce Queuing

Snowflake lines up queries until resources are available. It’s not good for queries to stay queued too long, as they will be aborted. To prevent queries from waiting too long, you have two options: set a timeout or adjust concurrency.

Set a Timeout

Use STATEMENT_QUEUED_TIMEOUT_IN_SECONDS to define how long your query should stay queued before aborting. With a default value of 0, there is no timeout.

Change this number to abort queries after a specific time to avoid too many queries queuing up. As this is a session-level query, you can set this timeout for particular sessions.

Adjust the Maximum Concurrency Level

The total load time depends on the number of queries your warehouse executes in parallel. The more queries that run in parallel, the harder it is for the warehouse to keep up, impacting Snowflake performance.

To rectify this, use Snowflake’s MAX_CONCURRENCY_LEVEL parameter. Its default value is 8, but you can set the value to the number of resources you want to allocate.

Keeping the MAX_CONCURRENCY_LEVEL low helps improve execution speed, even for complex queries, as Snowflake allocates more resources.

Use Result Caching

Every time you execute a query, it caches, so Snowflake doesn’t need to spend time retrieving the same results from cloud storage in the future.

One way to retrieve results directly from the cache is by RESULT_SCAN.

Fox example:

select * from table(result_scan(last_query_id()))

The LAST_QUERY_ID is the previously executed query. RESULT_SCAN brings the results directly from the cache.

Tackle Disk Spilling

When data spills to your local machine, your operations must use a small warehouse. Spilling to remote storage is even slower.

To tackle this issue, move to a more extensive warehouse with enough memory for code execution.

  alter warehouse mywarehouse
        warehouse_size = XXLARGE
                   auto_suspend = 300
                      auto_resume = TRUE;

This code snippet enables you to scale up your warehouse and suspend query execution automatically after 300 seconds. If another query is in line for execution, this warehouse resumes automatically after resizing is complete.

Restrict the result display data. Choose the columns you want to display and avoid the columns you don’t need.

  select last_name 
       from employee_table 
          where employee_id = 101;

  select first_name, last_name, country_code, telephone_number, user_id from
  employee_table 
       where employee_type like "%junior%";

The first query above is specific as it retrieves the last name of a particular employee. The second query retrieves all the rows for the employee_type of junior, with multiple other columns.

Rectify Row Explosion

Row explosion happens when a JOIN query retrieves many more rows than expected. This can occur when your join accidentally creates a cartesian product of all rows retrieved from all tables in your query.

Use the Distinct Clause

One way to reduce row explosion is by using the DISTINCT clause that neglects duplicates.

For example:

  SELECT DISTINCT a.FirstName, a.LastName, v.District
  FROM records a 
  INNER JOIN resources v
  ON a.LastName = v.LastName
  ORDER BY a.FirstName;

In this snippet, Snowflake only retrieves the distinct values that satisfy the condition.

Use Temporary Tables

Another option to reduce row explosion is by using temporary tables.

This example shows how to create a temporary table for an existing table:

  CREATE TEMPORARY TABLE tempList AS 
      SELECT a,b,c,d FROM table1
          INNER JOIN table2 USING (c);

  SELECT a,b FROM tempList
      INNER JOIN table3 USING (d);

Temporary tables exist until the session ends. After that, the user cannot retrieve the results.

Check Your Join Order

Another option to fix row explosion is by checking your join order. Inner joins may not be an issue, but the table access order impacts the output for outer joins.

Snippet one:

  orders LEFT JOIN products 
      ON products.id = products.id
    LEFT JOIN entries
      ON entries.id = orders.id
      AND entries.id = products.id

Snippet two:

  orders LEFT JOIN entries 
      ON entries.id = orders.id
    LEFT JOIN products
      ON products.id = orders.id
      AND products.id = entries.id

In theory, outer joins are neither associative nor commutative. Thus, snippet one and snippet two do not return the same results. Be aware of the join type you use and their order to save time, retrieve the expected results, and avoid row explosion issues.

Fix Inadequate Pruning

While running a query, Snowflake prunes micro-partitions, then the remaining partitions’ columns. This makes scanning easy because Snowflake now does not have to go through all the partitions.

However, pruning does not happen perfectly all the time. Here is an example:

When executing the query, the filter removes about 94 percent of the rows. Snowflake prunes the remaining partitions. That means the query scanned only a portion of the four percent of the rows retrieved.

Data clustering can significantly improve this. You can cluster a table when you create it or when you alter an existing table.

  CREATE TABLE recordsTable (C1 INT, C2 INT) CLUSTER BY (C1, C2);

  ALTER TABLE recordsTable CLUSTER BY (C1, C2);

Data clustering has limitations. Tables must have a large number of records and shouldn’t change frequently. The right time to cluster is when you know the query is slow, and you know that you can enhance it.

In 2020, Snowflake deprecated the manual re-clustering feature, so that is not an option anymore.

Wrapping Up Snowflake Performance Issues

We explained how to use queuing parameters, efficiently use Snowflake’s cache, and fix disk spilling and exploding rows. It’s easy to implement all these methods to help improve your Snowflake query performance.

Another Strategy for Improving Query Performance: Indexing

Snowflake can be a good solution for business intelligence, but it’s not always the optimum choice for every use case, for example, scaling real-time analytics, which requires speed. For that, consider supplementing Snowflake with a database like Rockset.

High-performance real-time queries and low latency are Rockset's core features. Rockset provides less than one second of data latency on large data sets, making new data ready to query quickly. Rockset excels at data indexing, which Snowflake doesn’t do, and it indexes all the fields, making it faster for your application to scan through and provide real-time analytics. Rockset is far more compute-efficient than Snowflake, delivering queries that are both fast and economical.

Rockset is an excellent complement to your Snowflake data warehouse. Sign up for your free Rockset trial to see how we can help drive your real-time analytics.

Rockset is the real-time analytics database in the cloud for modern data teams. Get faster analytics on fresher data, at lower costs, by exploiting indexing over brute-force scanning.

What Do I Do When My Snowflake Query Is Slow? Part 1: Diagnosis

Shawn Adams — Thu, 20 Jan 2022 18:00:00 +0000

Because Rockset helps organizations achieve the data freshness and query speeds needed for real-time analytics, we sometimes are asked about approaches to improving query speed in databases in general, and in popular databases such as Snowflake, MongoDB, DynamoDB, MySQL and others. We turn to industry experts to get their insights and we pass on their recommendations. In this case, the series of two posts that follow address how to improve query speed in Snowflake.

Every developer wants peak performance from their software services. When it comes to Snowflake performance issues, you may have decided that the occasional slow query is just something that you have to live with, right? Or maybe not. In this post we’ll discuss why Snowflake queries are slow and options you have to achieve better Snowflake query performance.

It’s not always easy to tell why your Snowflake queries are running slowly, but before you can fix the problem, you have to know what’s happening. In part one of this two-part series, we’ll help you diagnose why your Snowflake queries are executing slower than usual. In our second article we’ll look at the best options for improving Snowflake query performance.

Diagnosing Queries in Snowflake

First, let’s unmask common misconceptions of why Snowflake queries are slow. Your hardware and operating system (OS) don’t play a role in execution speed because Snowflake runs as a cloud service.

The network could be one reason for slow queries, but it’s not significant enough to slow execution all the time. So, let’s dive into the other reasons your queries might be lagging.

Check the Information Schema

In short, the INFORMATION_SCHEMA is the blueprint for every database you create in Snowflake. It allows you to view historical data on tables, warehouses, permissions, and queries.

You cannot manipulate its data as it is read-only. Among the principal functions in the INFORMATION_SCHEMA, you will find the QUERY_HISTORY and QUERY_HISTORY_BY_* tables. These tables help uncover the causes of slow Snowflake queries. You'll see both of these tables in use below.

Keep in mind that this tool only returns data to which your Snowflake account has access.

Check the Query History Page

Snowflake’s query history page retrieves columns with valuable information. In our case, we get the following columns:

EXECUTION_STATUS displays the state of the query, whether it is running, queued, blocked, or success.
QUEUED_PROVISIONING_TIME displays the time spent waiting for the allocation of a suitable warehouse.
QUEUED_REPAIR_TIME displays the time it takes to repair the warehouse.
QUEUED_OVERLOAD_TIME displays the time spent while an ongoing query is overloading the warehouse.

Overloading is the more common phenomenon, and QUEUED_OVERLOAD_TIME serves as a crucial diagnosing factor.

Here is a sample query:

      select *
      from table(information_schema.query_history_by_session())
      order by start_time;

This gives you the last 100 queries that Snowflake executed in the current session. You can also get the query history based on the user and the warehouse as well.

Check the Query Profile

In the previous section, we saw what happens when multiple queries are affected collectively. It’s equally important to address the individual queries. For that, use the query profile option.

You can find a query’s profile on Snowflake’s History tab.

The query profile interface looks like an advanced flowchart with step-by-step query execution. You should focus mainly on the operator tree and nodes.

The operator nodes are spread out based on their execution time. Any operation that consumed over one percent of the total execution time appears in the operator tree.

The pane on the right side shows the query’s execution time and attributes. From there, you can figure out which step took too much time and slowed the query.

Check Your Caching

To execute a query and fetch the results, it might take 500 milliseconds. If you use that query frequently to fetch the same results, Snowflake gives you the option to cache it so the next time it is faster than 500 milliseconds.

Snowflake caches data in the result cache. When it needs data, it checks the result cache first. If it does not find data, it checks the local hard drive. If it still does not find the data, it checks the remote storage.

Retrieving data from the result cache is faster than from the hard drive or remote memory. So, it is best practice to use the result cache effectively. Data remains in the result cache for 24 hours. After that, you have to execute the query again to get the data from the hard disk.

You can check out how effectively Snowflake used the result cache. Once you execute the query using Snowflake, check the Query Profile tab.

You find out how much Snowflake used the cache on a tab like this.

Check Snowflake Join Performance

If you experience slowdowns during query execution, you should compare the expected output to the actual result. You could have encountered a row explosion.

A row explosion is a query result that returns far more rows than anticipated. Therefore, it takes far more time than anticipated. For example, you might expect an output of four million records, but the outcome could be exponentially higher. This problem occurs with joins in your queries that combine rows from multiple tables. The join order matters. You can do two things: look for the join condition you used, or use Snowflake’s optimizer to see the join order.

An easy way to determine whether this is the problem is to check the query profile for join operators that display more rows in the output than in the input links. To avoid a row explosion, ensure the query result does not contain more rows than all its inputs combined.

Similar to the query pattern, using joins is in the hands of the developer. One thing is clear — bad joins result in slow Snowflake join performance, and slow queries.

Check for Disk Spilling

Accessing data from a remote drive consumes more time than accessing it from a local drive or the result cache. But, when query results don’t fit on the local hard drive, Snowflake must use remote storage.

When data moves to a remote hard drive, we call it disk spilling. Disk spilling is a common cause of slow queries. You can identify instances of disk spilling on the Query Profile tab. Take a look at “Bytes spilled to local storage.”

In this example, the execution time is over eight minutes, out of which only two percent was for the local disk IO. That means Snowflake did not access the local disk to fetch data.

Check Queuing

The warehouse may be busy executing other queries. Snowflake cannot start incoming queries until adequate resources are free. In Snowflake, we call this queuing.

Queries are queued so as not to compromise Snowflake query performance. Queuing may happen because:

The warehouse you are using is overloaded.
Queries in line are consuming the necessary computing resources.
Queries occupy all the cores in the warehouse.

You can rely on the queue overload time as a clear indicator. To check this, look at the query history by executing the query below.

      QUERY_HISTORY_BY_SESSION(
      [SESSION_ID => <constant_expr>]
      [, END_TIME_RANGE_START => <constant_expr>]
      [, END_TIME_RANGE_END => <constant_expr>]
      [, RESULT_LIMIT => <num>] )

You can determine how long a query should sit in the queue before Snowflake aborts it. To determine how long a query should remain in line before aborting it, set the value of the STATEMENT_QUEUED_TIMEOUT_IN_SECONDS column. The default is zero, and it can take any number.

Analyze the Warehouse Load Chart

Snowflake offers charts to read and interpret data. The warehouse load chart is a handy tool, but you need the MONITOR privilege to view it.

Here is an example chart for the past 14 days. When you hover over the bars, you find two statistics:

Load from running queries — from the queries that are executing
Load from queued queries — from all the queries waiting in the warehouse

The total warehouse load is the sum of the running load and the queued load. When there is no contention for resources, this sum is one. The more the queued load, the longer it takes for your query to execute. Snowflake may have optimized the query, but it may take a while to execute because several other queries were ahead of it in the queue.

Use the Warehouse Load History

You can find data on warehouse loads using the WAREHOUSE_LOAD_HISTORY query.

Three parameters help diagnose slow queries:

AVG_RUNNING — the average number of queries executing
AVG_QUEUED_LOAD — the average number of queries queued because the warehouse is overloaded
AVG_QUEUED_PROVISIONING — the average number of queries queued because Snowflake is provisioning the warehouse

This query retrieves the load history of your warehouse for the past hour:

  use warehouse mywarehouse;

      select *
      from
      table(information_schema.warehouse_load_history(date_range_start=>dateadd
      ('hour',-1,current_timestamp())));

Use the Maximum Concurrency Level

Every Snowflake warehouse has a limited amount of computing power. In general, the larger (and more expensive) your Snowflake plan, the more computing horsepower it has.

A Snowflake warehouse's MAX_CONCURRENCY_LEVEL setting determines how many queries are allowed to run in parallel. In general, the more queries running simultaneously, the slower each of them. But if your warehouse's concurrency level is too low, it might cause the perception that queries are slow.

If there are queries that Snowflake can't immediately execute because there are too many concurrent queries running, they end up in the query queue to wait their turn. If a query remains in the line for a long time, the user who ran the query may think the query itself is slow. And if a query stays queued for too long, it may be aborted before it even executes.

Next Steps for Improving Snowflake Query Performance

Your Snowflake query may run slowly for various reasons. Caching is effective but doesn’t happen for all your queries. Check your joins, check for disk spilling, and check to see if your queries are spending time stuck in the query queue.

When investigating slow Snowflake query performance, the query history page, warehouse loading chart, and query profile all offer valuable data, giving you insight into what is going on.

Now that you understand why your Snowflake query performance may not be all that you want it to be, you can narrow down possible culprits. Your next step is to get your hands dirty and fix them.

Come back next week to check out the next article in this series, What Do I Do When My Snowflake Query Is Slow? Part 2: Solutions, for tips on optimizing your Snowflake queries and other choices you can make if real-time query performance is a priority for you.

Rockset is the real-time analytics database in the cloud for modern data teams. Get faster analytics on fresher data, at lower costs, by exploiting indexing over brute-force scanning.

Real-Time Analytics Podcast Episode 10: Self-service data analytics with Grafbase CEO, Fredrik Bjork

Shawn Adams — Tue, 05 Oct 2021 14:26:42 +0000

Fredrik Bjork has traversed the data space, creating a gaming social network, scaling the RealReal's marketplace and now investing in the data startups. Hear how the data experience has changed with each endeavor and what trends he's must bullish on in this podcast.

Listen to this podcast: https://rockset.com/podcasts/the-rise-of-real-time-analytics/episode10-self-service-data-analytics-in-cybersecurity/