DEV Community: TelemetryHub

DuckDB vs FINOS' Perspective: A Comparison for Web Developers

Sergio Tijerino — Tue, 19 Sep 2023 22:58:45 +0000

Handling huge(ish) data right in the Browser

Due to limitations on the browser and client’s devices, the idea of storing huge amounts of data and executing operations right on the browser was out of place, as it would require too much memory just for storage and leave a minimum amount of memory for data operations and business logic. Web applications were used to read chunks of data from an API, that was connected to a Backend with a specialized database that could handle this amount of data without impacting the performance of the web application.

This approach continues to be a valid option, but it lacks the snappy responses that we strive for in a web application as we would have to wait for the backend to create our views, and diagrams or even process huge batches of data before we are able to see updated results in the UI.

The definition of “big” data keeps on changing. In the recent past, a terabyte seemed like an unthinkable amount of storage. Now, companies routinely handle thousands of times that amount of data. The power of personal computers has increased in the same way. Browsers routinely handle hundreds of megabytes of data, or more. This has opened up opportunities for data-intensive applications served over the web. With the right bandwidth and front-end technologies, they can provide a rich exploratory experience with data sets that would have been considered “huge” not so long ago.

Introducing WebAssembly

Thankfully, a new optimized technology has been created in recent years called WebAssembly (WASM for short). WASM is a binary instruction format for a stack-based virtual machine. It was designed as a portable compilation target for programming languages, enabling deployment on the web for client and server applications. Although this technology was created to run on the web using a browser, it is also possible to run it on non-web-based environments as in Node.js. For the purpose of this writing, we’ll be focusing on web-based applications using WASM.

Two of these new technologies that emerged from WASM are DuckDB and the FINOS’ Perspective. Both of these are open-source technologies and were built with the primary goal of handling vast amounts of data with as much performance as possible.

DuckDB as a browser database

DuckDB is a lightweight, in-memory column-oriented database that is optimized for speed. Its column design allows it to benefit from not querying unnecessary columns and being able to apply complex operations over a single data set and type. It is a good choice for applications that require fast data access, such as real-time analytics and visualization. As a database, it allows you to execute SQL queries against the data, which grants a variety of different aggregated results. Not only that, but the DuckDB’s default package provides a robust API to handle and analyze the data inserted.

This technology is also available to be installed into Python, R, Java, Node.js, Julia, and C/C++ applications.

DuckDB’s engine allows you to insert data in different formats. Some of these include:

CSV
Parquet
JSON
S3
JSON
Excel
SQLite
PostgreSQL

In our experience working with DuckDB on the browser, the formats that we used the most are JSON row-like objects and JSON column-like objects. It also supports Apache Arrow tables right out of the box, and to do so, here are some useful insert statements

import { tableFromArrays } from 'apache-arrow';

// Create a connection
const c = await db.connect();

// Create Apache Arrow table
const arrowTable = tableFromArrays({
  id: [1, 2, 3],
  name: ['John', 'Jane', 'Jack'],
  age: [20, 21, 22],
});

// Data will be inserted as a new table following the 'name'
// parameter. In this example, 'arrow_table` is the name of it
await c.insertArrowTable(arrowTable, { name: 'arrow_table' });

// JSON row-like format
const jsonRowContent = [
  { "col1": 1, "col2": "foo" },
  { "col1": 2, "col2": "bar" },
];

// Take the JSON object and create a file out of it
await db.registerFileText(
  'rows.json',
  JSON.stringify(jsonRowContent),
);

// Create a DB table based on the newly created file
await c.insertJSONFromPath('rows.json', { name: 'rows' });

// JSON column-like format
const jsonColContent = {
  "col1": [1, 2],
  "col2": ["foo", "bar"]
};
await db.registerFileText(
  'columns.json',
  JSON.stringify(jsonColContent),
);
await c.insertJSONFromPath('columns.json', { name: 'columns' });

// Close the connection to release memory
await c.close();

The column-oriented storage format and WASM-backed calculations allow powerful processing right in the browser. Running operations over tens of thousands of records via DuckDB happens in a flash compared to operating on arrays of JavaScript objects, for example. Our team has taken advantage of this to perform aggregations on large data sets to reduce the burden on charting libraries. We also use it to build tabular displays with custom UI for filtering/sorting. Sorting 50K items in a JS array can take a notable amount of time. DuckDB can regenerate ordered data faster than some data table implementations, even considering marshaling from Arrow to JavaScript objects. For use-cases that want to support arbitrary querying of large datasets in the browser, especially if they are typically operating on a few columns of a wide data set, DuckDB shines.

We do not encourage the use of the toString or toArray built-in function of the Apache Arrow tables, as by using these with object-like columns, the parsing would result in more issues and patch fixes than smooth development. This became apparent the more we implemented DuckDB query results with our existing component.

Let’s have a look at one of our backend’s column object response:

attributes: [
  [
    {
      "key": "my key",
      "value": "my value"
    },
    {
      "key": "another key",
      "value": "another value"
    },
  ],
  [
    {
      "key": "extra key",
      "value": "extra value"
    },
   // More objects could be added...
  ],
 // More rows like the above ...
]

This type of compound object is handled by DuckDB automatically as a struct which makes the task of parsing and formatting the object whenever we request or insert new data way easier. The problem with it lies when we want to retrieve the complete column itself instead of just a set of attributes within. By doing this and attempting to call table_results.toString and then parsing it to a JSON object as JSON.parse(), we’d face an angry compiler that says that the JSON structure was incorrect. Workarounds started appearing and we tried to fix these types of issues until enough was enough and we followed another approach to parse Apache Table results into JSON objects.

A query that requests for the attributes column like the following one:

SELECT attributes FROM our_table

Would become a single-column Apache Arrow Table with the example data. Nothing is wrong so far, but the moment we attempted to parse this into a JavaScript object using table.toString, the results become like the following code snippet:

'[[{"key":"my\u0000key","value":"my\u0000value"},{"key":"another\u0000key","value":"another\u0000value"}],[{"key":"extra\u0000key","value":"extra\u0000value"}]]'

You’ll notice that the whitespaces inside the object were replaced automatically by \u0000 which is the Unicode for null. So we had to remove this character and replace it with simple spaces. But now we have the problem of deciding when can we be sure that we want to replace the Unicode for a single space and not a null literal. Not only that, but the more different types of spaces contained inside the object (tabs, new lines, empty spaces), the more replacement code we need to add.

With all of this happening as we continued to add more dependency on DuckDB, we decided to first, whenever we faced struct-like objects to be inserted into the database, transform them into JavaScript stringified objects, and secondly, use getChild on each column that we needed to transform to any non-Apache Arrow format and apply the required operations from there. Afterward, you can and should use the toJSON function that transforms a single column into JavaScript objects right away.

Following our example and expanding to different columns, we can use this code snippet to successfully parse any type of columns from DuckDB:

// Object to be returned in row format
const columnData: Record<string, any[]> = {};

// queryResults if the default Apache Arrow table results from DuckDb
for (const field of queryResults.schema.fields) {
  const columnName = field.name;
  columnData[columnName] = queryResults
    .getChild(columnName)
    ?.toJSON() as any[];
}
const firstKey = Object.keys(columnData)[0];
// Aux function used to parse attributes from String to actual objects
const parseAttributes = (
  colData: Record<string, any>,
  attributeKey: string
) => {
  if (attributeKey in colData) {
    (colData[attributeKey] as any[]).forEach(
      (value: string, index: number) => {
        colData[attributeKey][index] = JSON.parse(value);
      }
    );
  }
};
parseAttributes(columnData, "attributes");

// Transforms column-based to row-based results
return columnData[firstKey].map((_, index) => {
  const row: Record<string, any> = {};
  Object.keys(columnData).forEach((key) => {
    row[key] = columnData[key][index];
  });
  return row;
});

This approach has the benefit of operating over a single column with multiple values (as in columnar format) and lets us decide which columns should be parsed to row format without the extra overhead of the others, in addition to requiring zero code patches in the parsing process.

For more information on the complete set of capabilities of DuckDB, feel free to check their official documentation.

FINOS’ Perspective: A More Complete Solution

The Fintech Open Source Foundation (FINOS from now on) is a community that specializes in creating open-source solutions for financial services. One of these is Perspective, which is an interactive analytics and data visualization component. It is especially well-suited for large and/or streaming datasets. It offers a wider range of features than DuckDB, including support for user-configurable reports, dashboards, notebooks, and applications.

Just like DuckDB, this technology is compiled to WASM, which allows its usage in the browser, but it’s also compiled for Python, so it can be used in conjunction with Python code and/or Jupyterlab. Not only that, but you could power Perspective using a webSocket in either Python or Node.js.

In order to communicate with Perspective you’ll have to understand their two main elements: tables and views. Speaking only on the browser side, you need to request a table from a Web Worker. A table is an interface over a single dataset used to stream data into Perspective. A view is a continuous query of a table. You can say that a table contains the data and connects to Perspective, and a view is the current snapshot of the table, this includes filters, grouping by statements, ordering, available columns and so much more. A single table could have potentially multiple views, which allows us to rapidly switch between predefined filter selections that have value to the end user.

Here’s an example of a Table’s view, where each key represents a modifiable field in the Perspective table:

tableConfiguration: {
  aggregates: {},
  group_by: [],
  columns: [],
  expressions: [],
  filter: [],
  plugin: "Datagrid",
  plugin_config: {
    scroll_lock: false,
    editable: false,
    columns: {},
  },
  split_by: [],
  settings: true,
  sort: [],
}

Although Perspective offers a broader selection of features than DuckDB, it also restricts us in various other ways. For example, we need to define a schema before creating a table and once the table is created, it is immutable.

A table schema is defined as an object where all of the available columns to be shown have their respective data type. This doesn’t include expression columns as they are created by the user and can vary on data type depending on the operations applied to the data.

tableSchema: {
  elapsed: "integer",
  ends: "datetime",
  id: "string",
  label: "string",
  language: "string",
  libraryName: "string",
  libraryVersion: "string",
  name: "string",
  serviceInstanceId: "string",
  serviceName: "string",
  serviceNamespace: "string",
  serviceVersion: "string",
  starts: "datetime",
  statusCode: "integer",
  statusMessage: "string",
}

The data type supported by Perspective's table to update the data are only JavaScript objects, however, if we want to copy the table data from one table to another, we can export the first table data to arrow format and send that as the input data for the second table.

This leads us to another challenge in our use of Perspective, and it is that even with the ability to export the current table data to JSON, columnar, CSV, or Apache Arrow, this process tends to lag the page the more columns and more rows that are current in the table. The reason behind it is that as we only have a single worker for the Perspective table, this worker is in charge of taking the new data, displaying the UI configuration, and exporting to the other components its data. All of these operations are not done in parallel, which results in the table being stuck while the export is in process. This means that If you are planning on using Perspective as the main source of data or database, you must take into consideration that linking the data with other components is most likely going to affect the performance of the view.

A fix that the Perspective’s authors provided to us was to create a second worker, that handles the data exporting from Arrow format to JSON, and keep the initial worker as the main responsible of UI rendering and keeping the main source of data. The code to handle this looked something like this:

const originalSchema = firstTable.schema()
const currentView = await firstTable.view()

// Exporting to arrow instead of JSON is faster
const arrowData = await currentView.to_arrow()

// We have successfully copied the UI table data and view into a secondary 
// worker and table
const secondTable = secondWorker.table(orignalSchema).update(arrowData)
const secondView = secondTable.view(currentView)

// Return data as any format needed
return await secondView.to_json()

Does this mean that DuckDB is a better choice than using Perspective? It depends. The features that Perspective provided out-of-the-box really jump-started the development of our application, but we came to realize that customizing it to fit our desired UI was difficult and felt like fighting its strength instead of playing to it. Programmatically applying filters to the table data, while possible, required a DSL and Perspective APIs that were ultimately more work to keep track of than the SQL we could construct and use against DuckDB. Each technology has its benefits and disadvantages, the goal is to understand them and use them for their best use cases.

Combining the best of both worlds

Given that each technology offers its own advantages and disadvantages, we decided that instead of dumping or patching all of our existing code for Perspective, we would rather transition from a Perspective-centric storage philosophy, into a DuckDB-centered one, which then can export data to any other component, including Perspective. To achieve this, we kept the existing Perspective’s table schema for our components. Then, whenever we received data from the Backend, we would conduct a series of steps that would transform the data from DuckDB’s format to one that Perspective handles. This would effectively allow us to continue to use all of Perspective’s features, without worrying about Perspective’s current view and exporting inconsistent data to other components that may need a disabled column.

In addition to this, by changing the main data source to DuckDB, we could create multiple parallel connections to the DB, that translate into multiple components accessing the required data to display the insights information at the same time. Another improvement that we got from switching sources was that we can now keep all of the extra filters in a single shareable place, that could be applied to SQL queries, instead of having them directly linked to Perspective view filters. We could have created a function that transforms these into a SQL version of them, but that doesn’t remove the overhead caused when exporting data.

The flow for using DuckDB and Perspective followed the next structure:

// Realize new data has been inserted into DuckDB ...

const perspectiveSchema = getPerspectiveSchema()
const queryWithPerspectiveColumns = baseQuery.addColumns(perspectiveSchema.columns)

// Data comes as Apache Arrow, so we must parse them to Perspective valid format
const queryResults = await duckDB.query(queryWithPerspectiveColumns)
const parsedResults = duckDBResultsToPerspectiveInput(queryResults)
await perspective.updateTableData(parsedResults)

// Perspective is now updated with the newest DuckDB data...

Another advantage that came from switching to DuckDB was that we can now design as many complex, composable queries as we like, in order to obtain given insight values from the data. Aggregations within these queries are made under Apache Arrow data format which is much faster than using regular row-based aggregation right on the browser with JavaScript.

Obstacles Overcome

We would like to share a list of trial-and-error circumstances that we faced during the development cycle of these technologies. This list also includes the solution that we reached for it:

Exporting data from Perspective to another component must include the required columns for the second component, as the Table’s View configuration doesn’t include any column not present in the view or changes the exported data if any aggregation is applied. For instance, If we deselect the elapsed column, and try to export data to the Insights section (old Insights), the component will not be updated as exporting data from Perspective with the current view doesn't include that column

We validated that the columns are present in the current view, else we freeze the component status until they are available again

Although we can export data into Arrow format from Perspective, the majority of our already developed components did not accept that data type. This means that exporting data from Perspective to be used on another component with a dataset set of over 50K records will lag the UI experience

We copied the data from the first table into a second table. This data is passed by arrow format, so it’s faster than JSON. We keep the first table to be UI related component and the second one is in charge of transforming the data to JSON for the rest of the components. This will keep the rendering responsibility to the first worker and the exporting work to the second, reducing the workload and improving performance overall.

DuckDB offers benefits with regards to exporting data, however, because we need to format the incoming data from the Backend to match the current DB schema, we are not able to load as much data as fast as possible (using multiple Promises for example)

We decided to use a single thread that requests the highest amount of records before exceeding the limit of data transfer. That way, we can control loading as usual and we can keep the DB schema updated at all times without any unexpected errors happening

Key Takeaways

If you plan on only using Perspective features to display data and you don’t want to burden the user with their Data Grid controls, go ahead and create multiple views ready to be used by the user as in dropdown options.
If you don’t need to export the data once it reaches the main source, Perspective works just as well as DuckDB does.
If you need to share the data with multiple sources and have equally complex operations applied, DuckDB is a better option.
If you have a super user who is willing to learn how to use Perspective, how to create new columns, design different views, and extract as many insights from it as possible, take Perspective and make a good tutorial for it

Conclusion

DuckDB and Perspective Table are both powerful technologies that can be used to run queries on data stored in the browser. The best choice for you will depend on your specific needs and requirements.

We recommend you take your time using each technology, reading the docs and playing around with them, and choose whichever feels more comfortable and fulfills the needs of your web app. Both of them are constantly getting updated and the community surrounding them is incredibly gentle and patient when it comes to questions and doubts.

Our TelemetryHub site has a Perspective table demo attached to the Docs, You can go ahead and play around with it to get a feel for how the table works. We also use both DuckDB and Perspective to represent the Traces data in order to maximize the power of these technologies.

Apollo Link OpenTelemetry Instrumentation

Nathan Wade — Thu, 29 Jun 2023 15:00:27 +0000

Overview of Apollo Link

Apollo Link is a toolkit that addresses common problems between your Apollo Client and your GraphQL server.
The Apollo team has explained that different teams have different requirements. Trying to create one GraphQL client library that meets everyone's needs is impossible. Apollo Link helps solve this problem.
Apollo Link allows the community to create middleware hooks. You can use these hooks to create a custom version of Apollo Client to meet your GraphQL Client needs.

OpenTelemetry and Apollo Link

A system must be well instrumented to be observable. In this guide, we will use OpenTelemetry to capture custom trace data from GraphQL. We can do this with a small piece of manual instrumentation using Apollo Link.

We can a link to wrap middleware around the GraphQL call. This link is a function that receives the context object of the request. These functions contain information about the GraphQL operation. The return statement of our link will have access to our response data.
In OpenTelemetry, we capture a unit of work as a discrete Span. So, the first link in our chain should create a new Span. We will call it createSpanLink.

import { trace, SpanKind } from "@opentelemetry/api";

const createSpanLink = new ApolloLink((operation, forward) => {
  const tracer = trace.getTracer("@apollo/client");
  const span = tracer.startSpan(`gql.${operation.operationName}`, {
    startTime: operation.getContext().start,
    attributes: {
      user.id: myUser.id,
      // ... other attributes
    },
    kind: SpanKind.INTERNAL, // 0: Internal, 1: Server, 2: Client, 3: Producer, 4: Consumer
  });
  operation.setContext({ span });

  return forward(operation).map((data) => {
    span.end();
    return data;
  });
});

In the example, we create a new ApolloLink() and inside we set a tracer and a span. We start by creating some attributes to give our span useful information. We use operation.setContext({ span }) to make sure that we pass the same span to each link. The return statement of each links gets called once the GraphQL server has sent a response. We should call span.end() to verify the completed round trip.
Our chain now looks something like this:

Adding functionality to record exceptions

There will be instances where there is a GQLError or NetworkError during the trip. Luckily, Apollo Link already has a built-in solution for exception handling called onError(). A simple setup may look something like this:

// Create Error Link
import { onError } from "@apollo/client/link/error";

export const errorLink = onError(
  ({ graphQLErrors, networkErrors, operation }) => {
    if (graphQLErrors) console.log(graphQLErrors);
    if (networkErrors) console.log(networkErrors);
  }
);

This uses the built-in onError() function to send errors to the console. But, we want to add the context of these errors to our span.
OpenTelemetry spans have a special function for recording exceptions by using span.recordException(). This function accepts either: an object with up to 4 optional keys (code, message, name, and stack) or a string. We can set the status code with span.setStatus() which takes SpanStatusCode as a parameter. Here is an example:

// Create Error Link
import { onError } from "@apollo/client/link/error";
import { SpanStatusCode } from "@opentelemetry/api";
export const errorLink = onError(
  ({ graphQLErrors, networkError, operation }) => {
    const span = operation.getContext().span;
    span.setStatus({ code: SpanStatusCode.ERROR });
    if (graphQLErrors) {
      graphQLErrors.forEach(({ message, locations, path }) => {
        span.recordException({
          message: `[GraphQL error]: Message: ${message}, Location: ${locations}, Path: ${path}`,
        });
      });
      span.end();
    }

    if (networkError) {
      span.recordException({
        message: `[Network error]: ${networkError}`,
      });
      span.end();
    }
  }
);

The Apollo Link documentation says that the onError() link should be the first link of your chain. But we want our CreateSpanLink to observe our entire chain. Because of this, it makes sense to add the onError link as the second link. Our completed chain should look something like this:

Visualizing ApolloLink Traces in TelemetryHub

We can now visualize our GraphQL calls in TelemetryHub. Navigate to the Traces tab. We can filter on all of the traces to show the new Apollo Link traces in the table on the bottom left of the UI. Use the Where filtering function and set it to where libraryName == @apollo/client. Note: the library name was set by us when we initialized our tracer in createSpanLink.

From there, we are able to investigate each individual trace. By clicking on one of these traces, TelemetryHub will navigate you to the Trace Details page. Here's an example of a GraphQL query trace:

Adding Links to Apollo Client

The final step is to modify ApolloClient. This will add links to your GraphQL calls. We can use the link key from ApolloClient's optional options argument. We can pass link a function called from() which takes an array of our newly created Links. It expects our Links array to be in the order we want for our final chain. Here is an example:

const apolloClient = new ApolloClient({  
  link: from([  
    createSpanLink,  
    errorLink,   
  ]),    
});

The Complete Setup

import {  
  ApolloClient,  
  ApolloLink,  
} from "@apollo/client/core";
import { trace, SpanKind, SpanStatusCode } from "@opentelemetry/api";
import { onError } from "@apollo/client/link/error";

// Create Span Link
export const createSpanLink = new ApolloLink((operation, forward) => {
  const tracer = trace.getTracer("@apollo/client");
  const span = tracer.startSpan(`gql.${operation.operationName}`, {
    startTime: operation.getContext().start,
    attributes: {
      user.id: myUser.id,
      // ... other attributes
    },
    kind: SpanKind.INTERNAL, // 0: Internal, 1: Server, 2: Client, 3: Producer, 4: Consumer
  });
  operation.setContext({ span });

  return forward(operation).map((data) => {
    span.end();
    return data;
  });
});

// Create Error Link
export const errorLink = onError(
  ({ graphQLErrors, networkError, operation }) => {
    const span = operation.getContext().span;
    span.setStatus({ code: SpanStatusCode.ERROR });
    if (graphQLErrors) {
      graphQLErrors.forEach(({ message, locations, path }) => {
        span.recordException({
          message: `[GraphQL error]: Message: ${message}, Location: ${locations}, Path: ${path}`,
        });
      });
      span.end();
    }

    if (networkError) {
      span.recordException({
        message: `[Network error]: ${networkError}`,
      });
      span.end();
    }
  }
);

const apolloClient = new ApolloClient({  
  link: from([  
    createSpanLink,  
    errorLink,   
  ]),    
});

export default apolloClient;

You can view your telemetry data for free on TelemetryHub.

DynamoDB Single Table Design

Quinn Milionis — Thu, 22 Jun 2023 20:13:02 +0000

DynamoDB and the Key-Value data model

Amazon’s DynamoDB is a powerful and scalable solution for storing and retrieving data. While traditional NoSQL databases often rely on a document or column-based structure, DynamoDB embraces the simplicity and flexibility of a key-value data model. The structure of this key-value data model means that we should be especially considerate when entering data into the database, as we are more limited in how we can later retrieve this data. We can achieve efficient data entry and retrieval in DynamoDB by embracing a methodology called “Single Table Design.” This design philosophy encourages a shift in mindset when it comes to organizing and accessing data, resulting in a highly scalable and performant system. In this post, we’ll help you conceptualize single table design with real-world examples from our own codebase, as well as lessons and principles the TelemetryHub team has learned along the way.

The key-value nature of DynamoDB enables developers to store and retrieve data based on a unique identifier, or key, associated with each item. This approach eliminates the need for complex joins, schemas, and indexes, simplifying data access and reducing operational overhead. By leveraging key-value pairs, DynamoDB achieves very fast and predictable read and write performance.

We’ll discuss how this design pattern simplifies data modeling, improves query performance, and reduces costs, all while maintaining the scalability and reliability that DynamoDB is known for.

Whether you’re already familiar with DynamoDB or just starting to explore its capabilities, this blog will introduce you to the world of DynamoDB single table design, and how it can change the way you think about your data.

Single Table Design

Dynamo single table design refers to the practice of harnessing the full power of a single DynamoDB table to store multiple types of data. This approach sets it apart from traditional SQL databases, where each unique type of data typically resides in its own table. While it is not strictly required for using DynamoDB, we firmly believe that single table design is the optimal way to structure your DynamoDB tables, and its advantages will become evident as we explore further.

Should you use Single Table Design?

Single table design is not only an option but the recommended approach when working with DynamoDB tables. However, before delving deeper, it is essential to consider if DynamoDB is the right choice for your application. The key factor to contemplate is whether your application exhibits relatively predictable access patterns. As we proceed with practical examples, you will discover that most applications share common access patterns that can be effectively modeled in DynamoDB. By leveraging single table design, you can streamline your data organization, enhance query performance, reduce operational costs, and maintain the scalability and reliability that DynamoDB is renowned for.

Designing Our Table

Data Access Patterns

Before we dive into implementation, it is crucial to consider the data access patterns that our application will require. In other words, we need to determine how we will be working with or “using” our data.

Let’s take our TelemetryHub application as an example. Within TelemetryHub, we support an unlimited number of “service groups” for each account. These service groups represent distinct sets of applications or services, such as a “dev” and “prod” service group. One data access pattern we need to support is the retrieval of all service groups associated with an account. Additionally, another data access pattern could involve fetching all users belonging to the same account. (It’s worth mentioning that TelemetryHub doesn’t charge for seats, so your entire team can leverage the power of OpenTelemetry for developing and debugging applications.)

One-To-Many Relationships

In the previous examples, we’ve discussed three main record types that will be stored in our DynamoDB table: “Account,” “Service Group,” and “User”. We are also aware of the fundamental relationships between these records: an Account contains multiple Users, and an Account contains multiple Service Groups. These relationships can be described as a “One (Account)-to-many (Users / Service Groups)” relationship. This is a key insight, as understanding these relationships is crucial as we proceed with designing our DynamoDB table schema.

To provide a clearer understanding, let’s examine an example of each record type, which will serve as a reference point:

Account: 
PK: account:{account_id}
SK: metadata:account

User: 
PK: account:{account_id}
SK: user:{userId}

Service Group:
PK: account:{account_id}
SK: servicegroup:{service_group_id}

Next, let’s delve into the structure and limitations of DynamoDB records. While DynamoDB is commonly known as a “key-value store” database, it’s important to understand what this means in practice. Essentially, each record comprises one or two keys (Partition Key and Sort Key) and the remaining data as the “value.” These keys play a vital role in enabling access patterns for DynamoDB records. Additionally, each record can contain numerous other properties. For example, the User record above could have a schema that looks like:

User: 
PK: account:{account_id}
SK: user:{userId}
first_name: string
last_name: string
role: string
created_at: datetime

The non-key properties here can be whatever we’d like. Only the keys determine how records are stored and accessed in Dynamo, so for the remainder of the discussion, we’ll focus on just PK and SK.

Reading data from Dynamo

The DynamoDB API primarily supports two read main methods for data access: Get and Query. The Get method retrieves a single, specific record based on its PK and SK. Although useful, there may be instances where you don’t have the complete PK and SK information or need to retrieve multiple records. For example, if we want to obtain a list of all users belonging to an account, we can utilize the Query method with the following pseudocode:

dynamodb.query(
 PK="account:account-id-001",
 SK_begins_with="user:"
)

Using Scan:
There is also a scan method that allows you to search for any matching property or string within a key. However, it should be used sparingly and reserved only for one-off administrative actions, as it involves reading the entire table and can be expensive and inefficient, especially as your data grows. If you find yourself needing to scan your data regularly, there’s likely a design issue that needs attention.

Prefixes and “begins with”

The query above query would return all the user records belonging to the specified account. Awesome! Notice how we’re able to query with a “begins with’’ expression on the SK. In any DynamoDB query, you are required to know the entire PK, but you can use any conditional operator (<, >, etc) and “begins_with” when specifying the SK. This querying feature is why we use prefixes in our keys. If we know that a specific record types SK’s begin with a certain prefix, we can use that handy “begins_with” to select all the matching records.

This example demonstrates a straightforward representation of a one-to-many relationship in DynamoDB. This is the essence of Single Table design” — using a single table to represent multiple record types and their relationships without the need for complex joins. In the given example, “account_id” acts as the shared index among all these record types. In DynamoDB, relationships are modeled within the data structure itself, eliminating the need for external table schemas or queries.

Modeling relationships in the data

It is crucial to consider common data access patterns when designing our data in DynamoDB. Unlike traditional SQL databases, where complex queries pose the challenge, DynamoDB requires careful consideration during the table design phase, making the rest of the process relatively straightforward.

Under the hood, DynamoDB functions like a scalable, optimized hash table. Each record is placed into a “partition” based on its Partition Key and sorted within that partition based on its Sort Key. With both keys, we can retrieve an exact record. With only the Partition Key, we can query everything within that partition. However, by combining the power of the DynamoDB query API and our intelligent single table design, we can efficiently and consistently retrieve the precise data we need.

If you understood the example query above, you won’t be surprised that we can query all the service groups belonging to an account in a similar manager:

dynamodb.query(
 PK=account:"account-id-001",
 SK_begins_with="servicegroup:"
)

If you wanted to retrieve all the account data with a single query (including the account metadata record, all users, and service groups), you could query using only PK=account:account-id-001 and omit anything about the SK. However, this approach often involves “over-fetching” and would require sorting the returned data in your application code. So while we can strategically fetch multiple record types, we usually specify at least some part of the SK.

Many-To-Many relationships

In our TelemetryHub application, we allow account administrators to manage access to service groups on a per-user basis. This means that administrators should be able to assign developers to specific service groups, ensuring that each user only has access to the relevant groups based on their responsibilities. If we think about our existing data structure, we might see how there’s a problem here. An account contains multiple service groups as well as multiple users. However, when we examine our existing data structure, we realize there’s a challenge in modeling the relationship between users and service groups. This relationship falls under the category of a “many-to-many” relationship since a user can belong to multiple service groups, and a service group can be accessed by multiple users.

The Membership Record:

To address this challenge, our team has devised a solution by creating a “membership record” that explicitly captures the relationship between users and service groups. This record serves as a bridge connecting users and their associated service groups. Let’s take a look at what this membership record might look like:

User Service Group Membership:
PK: user:{user_id}
SK: servicegroup:{service_group_id}

Using this structure, we can easily query the database to retrieve all the service groups to which a user has permissions. By executing a query like the following:

dynamodb.query(
 PK="user:user-id-001",
 SK_begins_with="servicegroup:"
)

We obtain a list of service groups accessible by the user. In fact, this is the same query that powers the “Select Service Group” dropdown feature in the TelemetryHub application.

Exploring Alternatives:

While the above approach is effective, it’s important to note that it’s not the only way to model this relationship. We can explore more advanced key designs to optimize the data structure further. Another possible representation could be:

User Service Group Membership: 
PK: account:{account_id}
SK: user:{userId}#servicegroup:{service_group_id}

This alternative design also supports the required data access pattern to retrieve all the service groups a user has access to. Since we know the user_id, we can deduce the account_id that the user belongs to. It’s crucial to remember that we always need to provide the entire PK when querying the database.

To obtain all the service groups for a user using this alternate record structure, we would execute the following query:

dynamodb.query(
 PK="account:account-id-001"
 SK_begins_with="user:user-id-001#servicegroup:"
)

Considering Data Access Patterns:

Now, before we commit to either of these models designs, let’s think more about all the data access patterns we’ll want to support. We’ve demonstrated how either of these models would work for the “get all service groups for user” data access, but what if we wanted to go the other way? This is a “many-to-many” relationship, after all, so it’s common we’d want to be able to do the “reverse” lookup, which in this case would be “get all users belonging to a service group.”

In the first Service Group Membership record, the PK is user:{user_id}. Remember our querying restriction: we always need to know the entire PK. For instance, we can’t ask Dynamo for all PKs “starting with” “user:” for instance.

Now if we look at the second model option, you might notice that we run into a similar problem. While we can assume we’ll always know the entire PK (the accountId), the SK has the userId before the service group, so we can’t use our trusty “beings with” expression on the SK for this record either.

Well then, what are we to do? You’d be correct to figure that from the query limitations we’ve covered, a “reverse” query for this kind of relationship is not possible. Oh, but of course it is! This is a very common issue with Dynamo and one that is solved using a Global Secondary Index (GSI). To intuitively understand the use of a GSI, let’s continue with our example.

Global Secondary Index

Let’s imagine that we were redesigning our “service group membership” schema but with the primary focus of supporting the “get all users belonging to a service group” query and are not concerned about the “get all service groups for a user” query. A simple way to achieve this would be with the following key structure:

User Service Group Membership:
PK: servicegroup:{service_group_id}
SK: user:{user_id}

With this schema, we could perform the following query:

dynamodb.query(
 PK="servicegroup:service-group-id-1"
 SK_begins_with="user:"
)

Notice anything about this record schema? That’s right, it’s the inverse of our old schema. The PK and SK have been flipped or inverted. This is where the power of a Global Secondary Index becomes apparent. A GSI allows us to remap the PK and SK values across our entire table. In other words, it provides an alternative way to organize and access our data, opening up a ton of possibilities.

Technically, you can remap the PK and SK values of a GSI to any properties in your data. However, the most universally useful GSI in our case is the “inverse” GSI we just reasoned into. This is considered a “generic” GSI, and these play a vital role in creating a robust Single Table schema.

Under the hood, you can imagine a GSI creating an entirely separate Dynamo table with these new indices. GSIs require their own storage and add overhead to read/write operations anywhere in the table, as changes to the table must be propagated throughout the GSIs. For this reason, it is advised to limit GSIs on a table. With a good Single Table Design, you can go far with a single, generic, “multi-purpose” GSI like the “inverse” index we describe above, though adding additional for specialized cases is always an option.

By leveraging the flexibility of a Global Secondary Index, we can effectively handle the reverse lookup challenge in our many-to-many relationship.

Summary

In this discussion of DynamoDB single table design of the Dynamos’s key-value database. We learned that by using shared key indices in the PK, and prefixes in our SKs, we could easily and efficiently query for related records. We then looked at modeling many-to-many relationships using membership records and how we can use an inverse GSI to support lookups in both directions. This can be a great workflow for any team working on designing their Dynamo schemas and record types.

Consider the data that you’ll want to store in Dynamo. What sorts of records will you be creating, and how do they relate to each other?
Think about your common data access patterns. Which queries will be commonly performed on the data.
Design the initial records while thinking about those data access patterns. For now, only worry about one-to-many relationships. Writing the queries in pseudocode while at the same time speccing out the record can be helpful.
Identify which many-to-many relationships exist between your data. Create a “Membership Record” to relate these records, and again, think about your most common data access patterns here.
Create an inverse GSI to support the reverse query. Refine! Nothing is set in stone, but the earlier you nail down the structure of your data, the easier development will be going forward.
Single Table Design is the recommended approach when working with DynamoDb, as it lets us take advantage of fast Dynamo query times for querying data without the additional costs and complexity of managing multiple DynamoDB tables. We hope you found this useful.

Once your database is set up and integrated into your application, see how easy it is to integrate OpenTelemetry and TelemetryHub to monitor your entire application through the simple yet powerful dashboard and tools. Learn more about TelemetryHub here and check back to learn more about the tech we use to build the applications for developers, by developers.