DEV Community: Artem Chebotko

Five Data Models for IoT: Managing the Latest IoT Events Based on a State in Apache Cassandra

Artem Chebotko — Thu, 28 Jul 2022 17:36:42 +0000

Apache Cassandra is a rock-solid choice for managing IoT and time series data at scale. The most popular use case of storing, querying and analyzing time series generated by IoT devices in Cassandra is well-understood and documented. In general, a time series is stored and queried based on its source IoT device. However, there exists another class of IoT applications that require quick access to the most recent data generated by a collection of IoT devices based on a known state. The question that such applications need to answer is: Which IoT devices or sensors are currently reporting a specific state? In this blog post, we focus on this question and provide five possible data modeling solutions to efficiently answer it in Cassandra.

Introduction

The Internet of Things (IoT) is generating massive amounts of time series data that needs to be stored, queried, and analyzed. Apache Cassandra is an excellent choice for this task: not only because of its speed, reliability and scalability, but also because its internal data model has built-in support for time-ordered data.

In Cassandra, time series are usually stored and retrieved by a source (e.g., an IoT device or sensor) or a subject (e.g., a parameter or metric). There are many good resources that cover this topic in great detail, including this conference presentation video, and ready-to-use Cassandra data models for sensor data and time series.

In this post, we look at a related class of IoT use cases that need to manage a snapshot of the latest data coming from many IoT devices. Moreover, such a snapshot needs to be queried or filtered based on a specific state reported by IoT devices. In other words, we should be able to quickly answer this question in Cassandra: Which IoT devices are currently reporting a specific state? For many real-life use cases, this question may sound more like:

Which lights are currently on (off) in a smart home?
Which parking spots are currently occupied (unoccupied) in a parking structure?
Which vehicles are currently available (unavailable) near a specific location?
Which security alarms are currently triggered (activated, disabled) in an area?
Which doors are currently opened (closed, locked, unlocked) in a building?
Which fire detection sensors are currently reporting an abnormal (normal standby, error) state in a sensor network?

In the following, we define the problem a bit more formally and propose five practical solutions with example CQL implementations.

The problem definition

Given a collection of IoT devices or sensors that generate time-ordered sequences of events containing timestamps, data points and states, find the latest events with a known state reported by all IoT devices. The three key components of this problem are illustrated in the figure below and are described as follows:

The input consists of time series generated by IoT devices. The time series are generally stored in one or more Cassandra tables.
The intermediate view is a snapshot of only the latest events reported by the IoT devices. It is possible to either store the latest events separately and explicitly, or compute them dynamically from the input.
The final result is all the latest events with a known state. The latest events with the same state should be either stored together or readily computable.

Managing the Latest IoT Events Based on a State

We identify several challenges to managing the latest IoT events based on a state:

A snapshot of the latest events is constantly evolving. Additional effort may be required to incrementally capture any changes.
A frequency of event occurrences is generally unpredictable. It may be difficult to partition and organize events based on only their timestamp components.
A state can usually take only a few unique values. Partitioning and indexing data based on a low-cardinality column may result in large partitions.

We use the following running example as a starting point. Table events_by_device is the input. This table with multi-row partitions is designed to store time series, such that each partition corresponds to one device, and rows in a partition represent events with timestamps, states and values. Events within each partition are always sorted by their timestamps in the descending order. This table essentially stores one time series per partition. We insert five events into the table and retrieve the time series for one device. Moreover, in the second query, we demonstrate that it is possible to dynamically compute all the latest events for all devices. Unfortunately, we should not rely on this query to solve the problem: it can potentially become very expensive as it accesses every partition in the table.

Schema

-- All events by device

CREATE TABLE events_by_device (

device_id  UUID,

timestamp  TIMESTAMP,

state   TEXT,

value   TEXT,

PRIMARY KEY((device_id), timestamp)

) WITH CLUSTERING ORDER BY (timestamp DESC);

Data

-- Event 1-1
INSERT INTO events_by_device 
       (device_id, timestamp, state, value)
VALUES (11111111-aaaa-bbbb-cccc-12345678abcd, 
        '2021-01-01 01:11:11', 'on', 'event 1-1');
-- Event 1-2
INSERT INTO events_by_device 
       (device_id, timestamp, state, value)
VALUES (11111111-aaaa-bbbb-cccc-12345678abcd, 
        '2021-01-01 02:22:22', 'off', 'event 1-2');
-- Event 1-3
INSERT INTO events_by_device 
       (device_id, timestamp, state, value)
VALUES (11111111-aaaa-bbbb-cccc-12345678abcd, 
        '2021-01-01 03:33:33', 'on', 'event 1-3');
-- Event 2-1
INSERT INTO events_by_device 
       (device_id, timestamp, state, value)
VALUES (22222222-aaaa-bbbb-cccc-12345678abcd, 
        '2021-02-02 01:11:11', 'off', 'event 2-1');
-- Event 3-1
INSERT INTO events_by_device 
       (device_id, timestamp, state, value)
VALUES (33333333-aaaa-bbbb-cccc-12345678abcd, 
        '2021-03-03 01:11:11', 'off', 'event 3-1');

Queries

-- Find all events for a device
SELECT device_id, timestamp, state, value
FROM   events_by_device
WHERE  device_id = 11111111-aaaa-bbbb-cccc-12345678abcd;

 device_id                            | timestamp                       | state | value
--------------------------------------+---------------------------------+-------+-----------
 11111111-aaaa-bbbb-cccc-12345678abcd | 2021-01-01 03:33:33.000000+0000 |    on | event 1-3
 11111111-aaaa-bbbb-cccc-12345678abcd | 2021-01-01 02:22:22.000000+0000 |   off | event 1-2
 11111111-aaaa-bbbb-cccc-12345678abcd | 2021-01-01 01:11:11.000000+0000 |    on | event 1-1

-- Find the latest events for all devices
SELECT device_id, timestamp, state, value
FROM   events_by_device
PER PARTITION LIMIT 1;

 device_id                            | timestamp                       | state | value
--------------------------------------+---------------------------------+-------+-----------
 33333333-aaaa-bbbb-cccc-12345678abcd | 2021-03-03 01:11:11.000000+0000 |   off | event 3-1
 22222222-aaaa-bbbb-cccc-12345678abcd | 2021-02-02 01:11:11.000000+0000 |   off | event 2-1
 11111111-aaaa-bbbb-cccc-12345678abcd | 2021-01-01 03:33:33.000000+0000 |    on | event 1-3

Note that, we can assume that the number of events per device is not to exceed 100,000. Otherwise, we may have to further split partitions in table events_by_device by introducing another column into its partition key definition. Since this is not important for the problem we are solving in this post, let’s keep things simple.

Given the problem definition and the running CQL example of IoT events, we are ready to describe the five solutions with different characteristics.

Solution 1: Materialized view

The first solution requires a new table and a materialized view. Table latest_events_by_device is a table with single-row partitions, where each partition corresponds to a device and each row corresponds to the latest known event. The purpose of this table is to have a snapshot of only the latest events reported by the IoT devices. The table is also a base table for materialized view latest_events_by_state that enables querying the latest events using a state.

Notice that exactly the same data is inserted into both tables events_by_device and latest_events_by_device. However, for the latter, inserts become upserts that update rows to the latest events.

Schema

-- Latest known events by device
CREATE TABLE latest_events_by_device (
    device_id  UUID,
    timestamp  TIMESTAMP,
    state      TEXT,
    value      TEXT,
    PRIMARY KEY((device_id))
);

-- Latest events by state
CREATE MATERIALIZED VIEW latest_events_by_state AS 
  SELECT * FROM latest_events_by_device
  WHERE state IS NOT NULL AND device_id IS NOT NULL
PRIMARY KEY ((state), device_id);

Data

-- Event 1-1
INSERT INTO latest_events_by_device 
       (device_id, timestamp, state, value)
VALUES (11111111-aaaa-bbbb-cccc-12345678abcd, 
        '2021-01-01 01:11:11', 'on', 'event 1-1');
-- Event 1-2
INSERT INTO latest_events_by_device 
       (device_id, timestamp, state, value)
VALUES (11111111-aaaa-bbbb-cccc-12345678abcd, 
        '2021-01-01 02:22:22', 'off', 'event 1-2');
-- Event 1-3
INSERT INTO latest_events_by_device 
       (device_id, timestamp, state, value)
VALUES (11111111-aaaa-bbbb-cccc-12345678abcd, 
        '2021-01-01 03:33:33', 'on', 'event 1-3');
-- Event 2-1
INSERT INTO latest_events_by_device 
       (device_id, timestamp, state, value)
VALUES (22222222-aaaa-bbbb-cccc-12345678abcd, 
        '2021-02-02 01:11:11', 'off', 'event 2-1');
-- Event 3-1
INSERT INTO latest_events_by_device 
       (device_id, timestamp, state, value)
VALUES (33333333-aaaa-bbbb-cccc-12345678abcd, 
        '2021-03-03 01:11:11', 'off', 'event 3-1');

Queries

-- Find all the latest events with state 'on'
SELECT state, device_id, timestamp, value
FROM   latest_events_by_state
WHERE  state = 'on';

 state | device_id                            | timestamp                       | value
-------+--------------------------------------+---------------------------------+-----------
    on | 11111111-aaaa-bbbb-cccc-12345678abcd | 2021-01-01 03:33:33.000000+0000 | event 1-3

-- Find all the latest events with state 'off'
SELECT state, device_id, timestamp, value
FROM   latest_events_by_state
WHERE  state = 'off';

 state | device_id                            | timestamp                       | value
-------+--------------------------------------+---------------------------------+-----------
   off | 22222222-aaaa-bbbb-cccc-12345678abcd | 2021-02-02 01:11:11.000000+0000 | event 2-1
   off | 33333333-aaaa-bbbb-cccc-12345678abcd | 2021-03-03 01:11:11.000000+0000 | event 3-1

The materialized view solution has the following characteristics:

Applicability: state-based queries return 100K rows / 100MBs of data or less.
Pros: the view is maintained automatically; excellent performance.
Cons: materialized views have a few limitations; data distribution may become skewed.

To support multiple tenants, we can change the table primary key to PRIMARY KEY((tenant, device_id)) or PRIMARY KEY((tenant), device_id), and the materialized view primary key to PRIMARY KEY ((tenant, state), device_id). Multi-tenancy may also help improve data distribution.

This data model can be a simple, effective and efficient choice for many applications, as long as you are aware of and willing to counteract the materialized view limitations. Another less obvious advantage of this data model is how easy it would be to feed data from an event streaming platform like Apache Pulsar or Apache Kafka. All events can go to the base table and the materialized view takes care of the rest.

Solution 2: Secondary index

The second solution requires a new table and a secondary index. The table is the same as in the materialized view solution. Table latest_events_by_device is a table with single-row partitions, where each partition corresponds to a device and each row corresponds to the latest known event. The purpose of this table is to have a snapshot of only the latest events reported by the IoT devices. Secondary index latest_events_by_state_2i is created for this table to query the latest events based on a state.

Once again, exactly the same data is inserted into both tables events_by_device and latest_events_by_device. However, for the latter, inserts become upserts that update rows to the latest events.

Schema

-- Latest known events by device
CREATE TABLE latest_events_by_device (
    device_id  UUID,
    timestamp  TIMESTAMP,
    state      TEXT,
    value      TEXT,
    PRIMARY KEY((device_id))
);

-- Latest events by state
CREATE INDEX latest_events_by_state_2i 
ON latest_events_by_device (state);

Data

-- Event 1-1
INSERT INTO latest_events_by_device
(device_id, timestamp, state, value)
VALUES (11111111-aaaa-bbbb-cccc-12345678abcd,
'2021-01-01 01:11:11', 'on', 'event 1-1');
-- Event 1-2
INSERT INTO latest_events_by_device
(device_id, timestamp, state, value)
VALUES (11111111-aaaa-bbbb-cccc-12345678abcd,
'2021-01-01 02:22:22', 'off', 'event 1-2');
-- Event 1-3
INSERT INTO latest_events_by_device
(device_id, timestamp, state, value)
VALUES (11111111-aaaa-bbbb-cccc-12345678abcd,
'2021-01-01 03:33:33', 'on', 'event 1-3');
-- Event 2-1
INSERT INTO latest_events_by_device
(device_id, timestamp, state, value)
VALUES (22222222-aaaa-bbbb-cccc-12345678abcd,
'2021-02-02 01:11:11', 'off', 'event 2-1');
-- Event 3-1
INSERT INTO latest_events_by_device
(device_id, timestamp, state, value)
VALUES (33333333-aaaa-bbbb-cccc-12345678abcd,
'2021-03-03 01:11:11', 'off', 'event 3-1');

Queries

-- Find all the latest events with state 'on'
SELECT state, device_id, timestamp, value
FROM   latest_events_by_device
WHERE  state = 'on';

 state | device_id                            | timestamp                       | value
-------+--------------------------------------+---------------------------------+-----------
    on | 11111111-aaaa-bbbb-cccc-12345678abcd | 2021-01-01 03:33:33.000000+0000 | event 1-3

-- Find all the latest events with state 'off'
SELECT state, device_id, timestamp, value
FROM   latest_events_by_device
WHERE  state = 'off';

 state | device_id                            | timestamp                       | value
-------+--------------------------------------+---------------------------------+-----------
   off | 33333333-aaaa-bbbb-cccc-12345678abcd | 2021-03-03 01:11:11.000000+0000 | event 3-1
   off | 22222222-aaaa-bbbb-cccc-12345678abcd | 2021-02-02 01:11:11.000000+0000 | event 2-1

The secondary index solution has the following characteristics:

Applicability: state-based queries return 100K rows / 100MBs of data or more; state-based queries are executed infrequently.
Pros: may better distribute the query workload across nodes in a cluster when retrieving a large result set.
Cons: secondary indexes have a few limitations; performance may become unsatisfactory for real-time applications.

This data model can be a reasonable choice in some cases. In particular, when multi-tenancy is introduced by changing the table primary key to PRIMARY KEY((tenant), device_id), we can hit the sweet spot of using secondary indexes for real-time transactional queries. That is when retrieving rows from a large multi-row partition based on both a partition key and an indexed column specified in a query predicate.

Solution 3: State-Partitioned Table

The third solution relies on table latest_events_by_state to organize and query the latest events using a state. Every insert of an event with some state into this table must be accompanied by deletes of any outdated events with the other states for the same IoT device. In our example, we have one insert and one delete for each event, since we only have two unique states. If we were to have three possible states, each new event would result in one insert and two deletes.

Schema

-- Latest events by state
CREATE TABLE latest_events_by_state (
    state      TEXT,
    device_id  UUID,
    timestamp  TIMESTAMP,
    value      TEXT,
    PRIMARY KEY((state), device_id)
);

Data

-- Event 1-1
INSERT INTO latest_events_by_state 
       (state, device_id, timestamp, value)
VALUES ('on', 11111111-aaaa-bbbb-cccc-12345678abcd, 
        '2021-01-01 01:11:11', 'event 1-1');
DELETE FROM latest_events_by_state 
WHERE state = 'off' AND
      device_id = 11111111-aaaa-bbbb-cccc-12345678abcd;
-- Event 1-2
INSERT INTO latest_events_by_state 
       (state, device_id, timestamp, value)
VALUES ('off', 11111111-aaaa-bbbb-cccc-12345678abcd, 
        '2021-01-01 02:22:22', 'event 1-2');
DELETE FROM latest_events_by_state 
WHERE state = 'on' AND
      device_id = 11111111-aaaa-bbbb-cccc-12345678abcd;
-- Event 1-3
INSERT INTO latest_events_by_state 
       (state, device_id, timestamp, value)
VALUES ('on', 11111111-aaaa-bbbb-cccc-12345678abcd, 
        '2021-01-01 03:33:33', 'event 1-3');
DELETE FROM latest_events_by_state 
WHERE state = 'off' AND
      device_id = 11111111-aaaa-bbbb-cccc-12345678abcd;
-- Event 2-1
INSERT INTO latest_events_by_state 
       (state, device_id, timestamp, value)
VALUES ('off', 22222222-aaaa-bbbb-cccc-12345678abcd, 
        '2021-02-02 01:11:11', 'event 2-1');
DELETE FROM latest_events_by_state 
WHERE state = 'on' AND
      device_id = 22222222-aaaa-bbbb-cccc-12345678abcd;
-- Event 3-1
INSERT INTO latest_events_by_state 
       (state, device_id, timestamp, value)
VALUES ('off', 33333333-aaaa-bbbb-cccc-12345678abcd, 
        '2021-03-03 01:11:11', 'event 3-1');
DELETE FROM latest_events_by_state 
WHERE state = 'on' AND
      device_id = 33333333-aaaa-bbbb-cccc-12345678abcd;

Queries

-- Find all the latest events with state 'on'
SELECT state, device_id, timestamp, value
FROM   latest_events_by_state
WHERE  state = 'on';

 state | device_id                            | timestamp                       | value
-------+--------------------------------------+---------------------------------+-----------
    on | 11111111-aaaa-bbbb-cccc-12345678abcd | 2021-01-01 03:33:33.000000+0000 | event 1-3

-- Find all the latest events with state 'off'
SELECT state, device_id, timestamp, value
FROM   latest_events_by_state
WHERE  state = 'off';

 state | device_id                            | timestamp                       | value
-------+--------------------------------------+---------------------------------+-----------
   off | 22222222-aaaa-bbbb-cccc-12345678abcd | 2021-02-02 01:11:11.000000+0000 | event 2-1
   off | 33333333-aaaa-bbbb-cccc-12345678abcd | 2021-03-03 01:11:11.000000+0000 | event 3-1

The state-partitioned table solution has the following characteristics:

Applicability: state-based queries return 100K rows / 100MBs of data or less.
Pros: excellent performance.
Cons: additional deletes are required to maintain the table; measures to prevent tombstone-related problems may be necessary; data distribution may become skewed.

Neither of the three cons should be considered a serious obstacle in most circumstances. Additional deletes are equivalent to additional writes, and Cassandra can readily scale to handle more writes. Given that inserts and deletes are applied to the same rows again and again, tombstones are likely to get resolved in a MemTable rather than in SSTables, which can greatly reduce the overall number of tombstones. For example, for a given IoT device, even frequent status updates that all hit the same MemTable can only result in one tombstone. We still recommend monitoring table metrics to be on top of any potential problems. Last but not least, data distribution depends on data and application characteristics. We take full control of data distribution in our last solution in this post.

We can easily support multiple tenants by changing the table primary key to PRIMARY KEY((tenant, state), device_id). Multi-tenancy may also help improve data distribution. Overall, in terms of performance, this solution should be comparable to the materialized view solution.

Solution 4: Multiple tables

The fourth solution features a separate table for each state. Every insert to table latest_on_events_by_device must be accompanied by a delete from table latest_off_events_by_device, and vice versa. This is to ensure that the latest event always cancels out any outdated event with a different state for the same device. The state-based queries over the tables can become very expensive as they have to scan all partitions in the tables.

Schema

-- Latest 'on' events by device
CREATE TABLE latest_on_events_by_device (
    device_id  UUID,
    timestamp  TIMESTAMP,
    value      TEXT,
    PRIMARY KEY((device_id))
);

-- Latest 'off' events by device
CREATE TABLE latest_off_events_by_device (
    device_id  UUID,
    timestamp  TIMESTAMP,
    value      TEXT,
    PRIMARY KEY((device_id))
);

Data

-- Event 1-1
INSERT INTO latest_on_events_by_device 
       (device_id, timestamp, value)
VALUES (11111111-aaaa-bbbb-cccc-12345678abcd, 
        '2021-01-01 01:11:11', 'event 1-1');
DELETE FROM latest_off_events_by_device 
WHERE device_id = 11111111-aaaa-bbbb-cccc-12345678abcd;
-- Event 1-2
INSERT INTO latest_off_events_by_device 
       (device_id, timestamp, value)
VALUES (11111111-aaaa-bbbb-cccc-12345678abcd, 
        '2021-01-01 02:22:22', 'event 1-2');
DELETE FROM latest_on_events_by_device 
WHERE device_id = 11111111-aaaa-bbbb-cccc-12345678abcd;
-- Event 1-3
INSERT INTO latest_on_events_by_device 
       (device_id, timestamp, value)
VALUES (11111111-aaaa-bbbb-cccc-12345678abcd, 
        '2021-01-01 03:33:33', 'event 1-3');
DELETE FROM latest_off_events_by_device 
WHERE device_id = 11111111-aaaa-bbbb-cccc-12345678abcd;
-- Event 2-1
INSERT INTO latest_off_events_by_device 
       (device_id, timestamp, value)
VALUES (22222222-aaaa-bbbb-cccc-12345678abcd, 
        '2021-02-02 01:11:11', 'event 2-1');
DELETE FROM latest_on_events_by_device 
WHERE device_id = 22222222-aaaa-bbbb-cccc-12345678abcd;
-- Event 3-1
INSERT INTO latest_off_events_by_device 
       (device_id, timestamp, value)
VALUES (33333333-aaaa-bbbb-cccc-12345678abcd, 
        '2021-03-03 01:11:11', 'event 3-1');
DELETE FROM latest_on_events_by_device 
WHERE device_id = 33333333-aaaa-bbbb-cccc-12345678abcd;

Queries

-- Find all the latest events with state 'on'
SELECT device_id, timestamp, value
FROM   latest_on_events_by_device;

 device_id                            | timestamp                       | value
--------------------------------------+---------------------------------+-----------
 11111111-aaaa-bbbb-cccc-12345678abcd | 2021-01-01 03:33:33.000000+0000 | event 1-3

-- Find all the latest events with state 'off'
SELECT device_id, timestamp, value
FROM   latest_off_events_by_device;

 device_id                            | timestamp                       | value
--------------------------------------+---------------------------------+-----------
 33333333-aaaa-bbbb-cccc-12345678abcd | 2021-03-03 01:11:11.000000+0000 | event 3-1
 22222222-aaaa-bbbb-cccc-12345678abcd | 2021-02-02 01:11:11.000000+0000 | event 2-1

The multi-table solution has the following characteristics:

Applicability: state-based queries return 100K rows / 100MBs of data or more; state-based queries are executed infrequently.
Pros: may better distribute the query workload across nodes in a cluster when retrieving a large result set.
Cons: performance may become unsatisfactory for real-time applications; additional deletes are required to maintain the tables; measures to prevent tombstone-related problems may be necessary.

This solution is on par with the secondary index solution in terms of query performance. Multiple tenants can be supported by changing the table primary keys to PRIMARY KEY((tenant, device_id)) or PRIMARY KEY((tenant), device_id). While we do not recommend this solution in practice, what is really interesting about this data model is how it prepares the stage for customizable partitioning discussed next.

Solution 5: Customizable partitioning

Our final solution builds on the idea of using a separate table for each state. However, this time, we partition tables using artificial buckets. A bucket value is readily computable using user-defined function hash from a device UUID identifier. In this example, the function extracts the first three digits from a UUID literal, converts the resulting hexadecimal number to a decimal one, and returns the remainder of division of the decimal number by 3. Therefore, there can be at most three buckets or partitions per table with values 0, 1 or 2. It is just a coincidence that all our device identifiers map to bucket 0 in this example. Since Version 4 UUIDs are randomly generated, for a large number of events, data should be more or less uniformly distributed among the three buckets.

Similarly to the previous data model, every insert to table latest_on_events_by_bucket must be accompanied by a delete from table latest_off_events_by_bucket, and vice versa. Performance of the state-based queries depends on partitioning, and partitioning is customizable.

Schema

-- Custom hash function
CREATE FUNCTION hash(id UUID) 
RETURNS NULL ON NULL INPUT 
RETURNS INT 
LANGUAGE Java AS 
'return Integer.parseInt(id.toString().substring(0,3),16) % 3;';

-- Latest 'on' events by device
CREATE TABLE latest_on_events_by_bucket (
    bucket     INT,
    device_id  UUID,
    timestamp  TIMESTAMP,
    value      TEXT,
    PRIMARY KEY((bucket), device_id)
);

-- Latest 'off' events by device
CREATE TABLE latest_off_events_by_bucket (
    bucket     INT,
    device_id  UUID,
    timestamp  TIMESTAMP,
    value      TEXT,
    PRIMARY KEY((bucket), device_id)
);

Data

-- Event 1-1
INSERT INTO latest_on_events_by_bucket 
       (bucket, device_id, timestamp, value)
VALUES (hash(11111111-aaaa-bbbb-cccc-12345678abcd), 
        11111111-aaaa-bbbb-cccc-12345678abcd,
        '2021-01-01 01:11:11', 'event 1-1');
DELETE FROM latest_off_events_by_bucket 
WHERE bucket = hash(11111111-aaaa-bbbb-cccc-12345678abcd) AND 
      device_id = 11111111-aaaa-bbbb-cccc-12345678abcd;
-- Event 1-2
INSERT INTO latest_off_events_by_bucket 
       (bucket, device_id, timestamp, value)
VALUES (hash(11111111-aaaa-bbbb-cccc-12345678abcd),
        11111111-aaaa-bbbb-cccc-12345678abcd, 
        '2021-01-01 02:22:22', 'event 1-2');
DELETE FROM latest_on_events_by_bucket 
WHERE bucket = hash(11111111-aaaa-bbbb-cccc-12345678abcd) AND 
      device_id = 11111111-aaaa-bbbb-cccc-12345678abcd;
-- Event 1-3
INSERT INTO latest_on_events_by_bucket 
       (bucket, device_id, timestamp, value)
VALUES (hash(11111111-aaaa-bbbb-cccc-12345678abcd),
        11111111-aaaa-bbbb-cccc-12345678abcd, 
        '2021-01-01 03:33:33', 'event 1-3');
DELETE FROM latest_off_events_by_bucket 
WHERE bucket = hash(11111111-aaaa-bbbb-cccc-12345678abcd) AND 
      device_id = 11111111-aaaa-bbbb-cccc-12345678abcd;
-- Event 2-1
INSERT INTO latest_off_events_by_bucket 
       (bucket, device_id, timestamp, value)
VALUES (hash(22222222-aaaa-bbbb-cccc-12345678abcd),
        22222222-aaaa-bbbb-cccc-12345678abcd, 
        '2021-02-02 01:11:11', 'event 2-1');
DELETE FROM latest_on_events_by_bucket 
WHERE bucket = hash(22222222-aaaa-bbbb-cccc-12345678abcd) AND 
      device_id = 22222222-aaaa-bbbb-cccc-12345678abcd;
-- Event 3-1
INSERT INTO latest_off_events_by_bucket 
       (bucket, device_id, timestamp, value)
VALUES (hash(33333333-aaaa-bbbb-cccc-12345678abcd),
        33333333-aaaa-bbbb-cccc-12345678abcd, 
        '2021-03-03 01:11:11', 'event 3-1');
DELETE FROM latest_on_events_by_bucket 
WHERE bucket = hash(33333333-aaaa-bbbb-cccc-12345678abcd) AND 
      device_id = 33333333-aaaa-bbbb-cccc-12345678abcd;

Queries

-- Find all the latest events with state 'on'
SELECT bucket, device_id, timestamp, value
FROM   latest_on_events_by_bucket
WHERE  bucket IN (0,1,2);

 bucket | device_id                            | timestamp                       | value
--------+--------------------------------------+---------------------------------+-----------
      0 | 11111111-aaaa-bbbb-cccc-12345678abcd | 2021-01-01 03:33:33.000000+0000 | event 1-3

-- Find all the latest events with state 'off'
SELECT bucket, device_id, timestamp, value
FROM   latest_off_events_by_bucket
WHERE  bucket IN (0,1,2);

 bucket | device_id                            | timestamp                       | value
--------+--------------------------------------+---------------------------------+-----------
      0 | 22222222-aaaa-bbbb-cccc-12345678abcd | 2021-02-02 01:11:11.000000+0000 | event 2-1
      0 | 33333333-aaaa-bbbb-cccc-12345678abcd | 2021-03-03 01:11:11.000000+0000 | event 3-1

The customizable partitioning solution has the following characteristics:

Applicability: can meet different requirements when customized.
Pros: flexibility; performance can be optimized by customizing partitioning.
Cons: a good partitioning function must be supplied; additional deletes are required to maintain the tables; measures to prevent tombstone-related problems may be necessary.

Choosing a good partitioning function is a good problem to have. While this may add a bit more complexity, the solution gives full control over data partitioning and query performance. Finding a good partitioning function would depend on specific data and application requirements, and may require some experience and experimentation. For example, retrieving 100 rows from 1 partition is generally faster than retrieving 100 rows from 10 partitions, but retrieving 1,000,000 rows from 1 partition is usually slower than retrieving 1,000,000 rows from 10 partitions. Next, additional deletes are equivalent to additional writes, and Cassandra can readily scale to handle more writes.

Given that inserts and deletes are applied to the same rows again and again, tombstones are likely to get resolved in a MemTable rather than in SSTables, which can greatly reduce the overall number of tombstones. For example, for a given IoT device, even frequent status updates that all hit the same MemTable can only result in one tombstone. We still recommend monitoring table metrics to be on top of any potential problems. Last but not least, data distribution depends on data and application characteristics. We take full control of data distribution in our last solution in this post.

This data model provides ultimate flexibility. Multi-tenancy is achievable by changing the primary key of each table to PRIMARY KEY((tenant, bucket), device_id). More importantly, a partitioning function can be changed to increase or decrease the number of partitions. A query that retrieves a smaller result set should access a smaller number of partitions for better performance. A query that retrieves a larger result set should access a larger number of partitions for better workload distribution. It is possible to use different functions for different states and tenants to achieve optimal performance. Better partitioning should result in better performance.

Conclusion

We defined the problem of managing the latest IoT events based on a state, identified its challenges, and described how it can be solved in Apache Cassandra using five different data models. We stated the applicability, pros and cons for each data model. Our final recommendation is to focus on the materialized view, state-partitioned table, and customizable partitioning data models. Choose the first two for their simplicity and ease of use. Consider customizable partitioning for ultimate flexibility when other options are exhausted. Finally, be open to exploring new possible solutions that may push some computation to an application or rely on specialized search indexes and other technologies.

It is worth mentioning that this blog post was motivated by questions from Apache Cassandra community members on Discord. Join the Fellowship of the (Cassandra) Rings today and connect to fellow community members and experts!

Learn more about Cassandra, data modeling and IoT:

Cassandra-as-a-Service DataStax Astra – experience serverless Cassandra for free
Short course Cassandra Fundamentals – learn Cassandra Query Language (CQL)
Learning series Data Modeling by Example – study Cassandra data modeling in depth
Workshops Developer Workshops – attend free data modeling workshops and more
Whitepaper Data Modeling in Apache Cassandra – get an overview of the best data modeling practices
More on IoT and time series Fundamentals | IoT data model | Time series data model | Streaming

Data Modeling in Cassandra and Astra DB

Artem Chebotko — Thu, 16 Jun 2022 19:21:29 +0000

What does it take to build an efficient and sound data model for Apache Cassandra® and DataStax Astra DB? Where would one start? Are there any data modeling rules to follow? Can it be done consistently time and time again? The answers to these and many other questions can be found in the Cassandra data modeling methodology.

In this post, we present a high-level overview of the data modeling methodology for Cassandra and Astra DB, and share over half a dozen complete data modeling examples from various real-life domains. We apply the methodology to create Cassandra and Astra DB data models for IoT, messaging data, digital library, investment portfolio, time series, shopping cart, and order management. We even provide our datasets and queries for you to try.

As a side note, if you are new to Cassandra or if the terms single-row partitions and multi-row partitions sound unfamiliar, we recommend taking a closer look at Cassandra Fundamentals before deep diving into data modeling.

Data modeling and the methodology

Data modeling is a process that involves many activities:

Collecting and analyzing data requirements
Understanding domain entities and relationships
Identifying data access patterns
Organizing and structuring data in a particular way
Designing and specifying a database schema
Optimizing schema and data indexing techniques

Data modeling can have a profound effect on data quality and data access. For data quality, think about data completeness, consistency, and accuracy. With respect to data access, think about queryability, efficiency, and scalability. An efficient and sound data model is crucial for both data and applications.

Our methodology defines how the data modeling process can be carried out in a well-organized and repeatable fashion. In particular, the Cassandra data modeling methodology is based on four objectives, four models, and two transitions; along with specific modeling, visualization, mapping, and optimization techniques and methods.

Figure 1: Cassandra data modeling methodology.

Four objectives

The Cassandra data modeling process, when discussed at a high level, can be distilled into these four key objectives:

Understand the data: Whether starting from scratch or dealing with an existing dataset, do you understand data that needs to be managed? Things like entities, relationships, and key constraints come to mind.
Identify data access patterns: Do you have a good idea of what a data-driven application should be able to do? Think of tasks (or microservices) and their required data access patterns, execution sequences and workflows, and how data retrieved in one task is used by the next one.
Apply the query-first approach: Do you know how to design Cassandra tables to support specific queries? It is called a query-first or query-driven approach because designing table schemas depends on both data and queries.
Optimize and implement: How do you verify that both database tables and application queries are efficient and scalable? For example, large partitions and queries that access many partitions may require additional optimizations.

Four models

The four models directly correspond to the four objectives and are meant to make the process more concrete, manageable, repeatable, documentable, collaborative, and shareable. They are:

Conceptual data model: A technology-independent, high-level view of data. Its purpose is to understand the data in a particular domain. While there are a number of conceptual data modeling techniques, we use the Entity-Relationship Model and Entity-Relationship Diagrams in Chen’s Notation to document entity types, relationship types, attribute types, and cardinality and key constraints.
Application workflow model: A technology-independent, high-level view of a data-driven application, consisting of application tasks, execution dependencies, and access patterns. Its purpose is to identify data access patterns and how they may be executed in sequences. These include queries, inserts, updates, and deletes required by different data-driven tasks. We use simple graph-like diagrams to represent application workflows.
Logical data model: A Cassandra-specific data model featuring tables, materialized views, secondary indexes, user-defined types, and other database schema constructs. It is derived from a conceptual data model by organizing data into Cassandra-specific data structures based on data access patterns identified by an application workflow. This is where the query-first approach is applied. Logical data models can be conveniently captured and visualized using Chebotko Diagrams that can feature tables, materialized views, indexes, and so forth.
Physical data model: A Cassandra-specific data model that is directly derived from a logical data model by analyzing and optimizing for performance. Physical data models can be conveniently captured and visualized using Chebotko Diagrams and implemented in Cassandra using CQL.

Two transitions

To complete the picture, the methodology must define the transitions between the models:

Mapping a conceptual data model and an application workflow model to a logical data model
Optimizing a logical data model to produce a physical data model

In many aspects, the transitions are the most interesting and profound components of the methodology. To carry out the first transition, the methodology defines mapping rules and mapping patterns. For the second transition, some common optimization techniques include splitting and merging partitions, data indexing, data aggregation, and concurrent data access optimizations.

You can find more information about the Cassandra data modeling methodology in the original paper, conference presentation, or DataStax Academy video course DS220.

Data modeling in action

One of the best ways to become skilled in data modeling is to explore concrete examples. We maintain this growing collection of data modeling examples from various domains to help you get started with Cassandra and Astra DB data modeling. Each example applies the Cassandra data modeling methodology to produce and visualize four important artifacts: conceptual data model, application workflow model, logical data model, and physical data model.

Moreover, each example has a hands-on portion with practice questions and solutions. The hands-on scenarios make it straightforward to implement a data model in Cassandra, express data access patterns as CQL queries and run the queries against our sample datasets.

Figure 2: Example hands-on scenario with schema, data, and queries.

Go ahead and explore these data models, and execute real queries against them in your browser:

Sensor data model: Modeling sensor networks, sensors, and temperature measurements. The resulting database schema has four tables supporting four data access patterns.
Messaging data model: Modeling users, email folders, emails, and email attachments. The resulting database schema has five tables supporting four data access patterns.
Digital library data model: Modeling performers, albums, album tracks, and users. The resulting database schema has eight tables supporting nine data access patterns.
Investment portfolio data model: Modeling users, investment accounts, trades, and trading instruments. The resulting database schema has six tables supporting seven data access patterns.
Time series data model: Modeling IoT data sources, groups of related sources, metrics, data points, and time series with higher or lower resolution. The resulting database schema has seven tables supporting seven data access patterns.
Shopping cart data model: Modeling users, items, and shopping carts. The resulting database schema has three tables and one materialized view supporting seven data access patterns, including updates that use batches and lightweight transactions.
Order management data model: Modeling users, payment methods, addresses, items, shopping carts, orders, delivery options, and order statuses. The resulting database schema has four tables supporting five data access patterns, including multi-step updates that use lightweight transactions.

Data modeling and Astra DB

Astra DB is a cloud database service built on Apache Cassandra. It is a serverless and multi-region service that works in AWS, Azure and GCP. If you haven’t already, you should take advantage of Astra DB’s free tier to create your own fully managed Cassandra database in the cloud. After all, how many multi-cloud, multi-region, serverless databases built on open-source technologies do you know? Astra DB is the first one.

Astra DB databases are Cassandra databases. The same data modeling methodology applies and the above example data models can be instantiated in Astra DB. However, there are a couple of minor differences that you may want to be aware of:

Astra DB does not support materialized views. Materialized views are experimental in Cassandra and the use of regular tables is usually recommended instead.
Astra DB does not support user-defined functions. Strictly speaking, user-defined functions are not data modeling constructs. They usually can be readily replaced with computation outside of a database.
Astra DB supports Storage-Attached Indexing or SAI. Storage-attached indexes in Astra DB are secondary indexes with better performance, space efficiency, and more capabilities than regular secondary indexes or experimental SASI in Cassandra. With that said, it is important to understand that SAI and other secondary indexes still have the same use cases and limitations, and should be used with caution.

The Astra DB and Cassandra differences with respect to materialized views, user-defined functions, and secondary indexes should not have any profound effect on data modeling.

Data modeling and K8ssandra

K8ssandra is a cloud-native distribution of Cassandra that runs on Kubernetes. Besides Cassandra, the distribution also includes several integrated components that enable richer data APIs, and provide better automation for observability, metrics monitoring, backup and restore, and data anti-entropy services.

K8ssandra is open-source, free to use, and data modeling in K8ssandra is identical to data modeling in Cassandra.

Data modeling and Stargate

Stargate is an open-source data gateway deployed between applications and a database. It supports different API options for an application to interact with Cassandra, Astra DB, and K8ssandra. Stargate’s API extensions include CQL, REST, GraphQL, and Document APIs.

The use of CQL, REST, and GraphQL APIs has no effect on data modeling: the same data modeling methodology applies.

The use of Document API has a significant impact on data modeling. With Document API, the focus shifts from organizing data as rows, columns, and partitions to structuring data as JSON documents. Stargate then uses the predefined mapping to shred JSON documents and store them as rows in Cassandra tables. The topic of data modeling for document databases is beyond the scope of this article.

Conclusion

Data modeling in Cassandra and Astra DB is a very important topic and we just scratched the surface in this post. We presented a high-level overview of the Cassandra data modeling methodology and urged you to sharpen your skills by exploring the data modeling examples. We also established that data modeling in Cassandra, Astra DB, and K8ssandra are practically identical; with Astra DB having a significant advantage of being serverless and fully managed. Finally, we briefly discussed how Stargate APIs — namely CQL, REST, GraphQL, and Document APIs — can affect data modeling.

Explore DataStax Academy to get certified in Apache Cassandra with hands-on courses. You can also subscribe to our event alert to get notified about our latest developer workshops. Lastly, follow DataStax on Medium for exclusive posts on all things Cassandra, streaming, Kubernetes, and more.

Resources

Optimizing Distributed Joins: The Case of Google Cloud Spanner and DataStax Astra DB

Artem Chebotko — Tue, 31 May 2022 16:25:22 +0000

In this post, learn how relational and NoSQL databases, Google Cloud Spanner and DataStax Astra DB, optimize distributed joins for real-time applications.

Distributed joins are commonly considered to be too expensive to use for real-time transaction processing. That is because, besides joining data, they also frequently require moving or shuffling data between nodes in a cluster, which can significantly affect query response times and database throughput. However, there are certain optimizations that can completely eliminate the need to move data to enable faster joins. In this article, we first review the four types of distributed joins, including shuffle join, broadcast join, co-located join, and pre-computed join. We then demonstrate how leading fully managed Relational and NoSQL databases, namely Google Cloud Spanner, support optimized joins that are suitable for real-time applications.

Four types of distributed joins

Joins are used in databases to combine related data from one or more tables or datasets. Data is usually combined based on some condition that relates columns from participating tables. Hereafter, we refer to columns used in a join condition as join keys and assume that they are always related by equality operators.

Distributed joins are joins in distributed databases, where data from each table is partitioned into smaller chunks — usually called partitions — that are stored on different nodes in a cluster. While distributing data helps with managing large datasets, it also makes joins harder to perform and scale because table rows that can be joined may reside in different partitions on different nodes.

A distributed join can be represented as a three-step process. The first step is to move data between nodes in the cluster, such that rows that can potentially be combined based on a join condition end up on the same nodes. Data movement is usually achieved by shuffling or broadcasting data. The second step is to compute a join result locally on each node. This usually involves one of the fundamental join algorithms, such as a nested-loop, sort-merge, or hash join algorithm. The last step is to merge or union local join results and return the final result. In many cases, it is possible to optimize a distributed join by eliminating one or even two steps from this process.

Consider the four types of distributed joins illustrated in Figure 1:

A shuffle join re-distributes rows from both tables among nodes based on join key values, such that all rows with the same join key value are moved to the same node. Depending on a particular algorithm used to compute joins, a shuffle join can be a shuffle hash join, shuffle sort-merge join, and so forth.
A broadcast join moves data stored in only one table, such that all rows from the smallest table are available on every node. Depending on a particular algorithm used to compute joins, a broadcast join can be a broadcast hash join, broadcast nested-loop join, and so forth.
A co-located join does not need to move data at all because data is already stored such that all rows with the same join key value reside on the same node. Data still needs to be joined using a nested-loop, sort-merge, or hash join algorithm.
A pre-computed join does not need to move data or compute joins locally on each node because data is already stored in a joined form. This type of join skips data movement and join computation and goes directly to merging and returning results.

Figure 1: Four Types of Distributed Joins

Shuffle and broadcast joins are more suitable for batch or near real-time analytics. For example, they are used in Apache Spark as the main join strategies. Co-located and pre-computed joins are faster and can be used for online transaction processing with real-time applications. They frequently rely on organizing data based on unique storage schemes supported by a database.

In the rest of this article, our focus is on co-located and pre-computed joins, and how they can be used in representative cloud-native Relational and NoSQL databases. For co-located joins, we choose Google Cloud Spanner, which is a fully-managed relational database service. For pre-computed joins, we use DataStax Astra DB, which is a serverless NoSQL database service. Both database services can be tried for free if you prefer to follow our examples.

Running example: Managing users, accounts, and transactions

Let’s define a running example that we can implement in both Google Cloud Spanner and DataStax Astra DB.

Figure 2: Example of relational database schema and SQL queries

Figure 2 depicts the relational database schema with three tables and two SQL queries. We have users identified by user ids, bank accounts identified by account ids, and account transactions identified by transaction ids. A user can have one or more accounts, while each account must belong to exactly one user. An account can have zero or more transactions, while each transaction must be associated with exactly one account. These key and cardinality constraints are captured via primary key (PK) and foreign key (FK) constraints on the diagram.

The first query retrieves all accounts for a specified user by joining tables users and accounts. The second query finds all transactions for a given account by joining tables accounts and transactions; transactions are also ordered by transaction dates in the result.

This data model and queries can be readily instantiated in any relational database, including Google Cloud Spanner (see this SQL script as an example), but that would not result in the join optimizations we are looking to implement. We show how to do much better in the next two sections.

Co-located joins and interleaved tables in Google Cloud Spanner

Co-located joins can perform significantly faster than shuffle and broadcast joins because they avoid moving data between nodes in a cluster. To use co-located joins, a distributed database needs to have a mechanism to specify which related data entities must be stored together on the same node. In Google Cloud Spanner, this mechanism is called table interleaving.

Logically independent tables can be organized into parent-child hierarchies by interleaving tables. This results in a data locality relationship between parent and child tables, such that one or more rows from a child table are physically stored together with one row from a parent table. For two tables to be interleaved, the parent table primary key must also be included as the prefix of the child table primary key. In other words, the child table primary key must consist of the parent table primary key followed by additional columns.

Figure 3 shows how to take advantage of table interleaving and co-located joins in Google Cloud Spanner to improve the performance of queries in our example of users, accounts, and transactions. The three tables are organized into a hierarchy, where table users is the parent of table accounts, and table accounts is the parent of table transactions. Column user_id is the primary key of table users and prefix of the primary key of table accounts. Columns user_id and account_id constitute the primary key of table accounts and prefix of the primary key of table transactions. Finally, columns user_id, account_id and transaction_id constitute the primary key of table transactions.

The SQL queries are unchanged when compared to our original running example. They use the same joins as before, but these joins can now be executed faster as co-located joins.

To try this example in Google Cloud Spanner, we share our SQL script for Co-Located Joins.

Figure 3: Co-located joins in Google Cloud Spanner

Pre-computed joins and tables with multi-row partitions in DataStax Astra DB

Pre-computed joins are the fastest joins in our toolbox. They are significantly faster than shuffle and broadcast joins because they avoid moving data between nodes in a cluster. They are also faster than co-located joins because they do not need to compute joins dynamically. To store and serve pre-computed join results effectively, a distributed database needs to have a mechanism to nest related data entities together. In DataStax Astra DB, this mechanism is called tables with multi-row partitions.

Tables in Astra DB are defined and queried using CQL, an SQL-like language. They are similar to tables in relational databases as they have columns, rows, and primary keys. The important difference is that a table primary key consists of a mandatory partition key and an optional clustering key. A partition key uniquely identifies a partition in a table, and a clustering key uniquely identifies a row in a partition. When both partition and clustering keys are defined, a table can store multiple rows in each partition. Tables with multi-row partitions are used to store and retrieve related entities together very efficiently. In our case, we can store pre-joined entities in such tables.

Figure 4 shows how to take advantage of tables with multi-row partitions and pre-computed joins in DataStax Astra DB to make queries from our running example exceptionally fast. The two tables are specifically designed to support the two queries. To retrieve all accounts for a specified user, table accounts_by_user defines user_id as a partition key and account_id as a clustering key. Each user in this table has a distinct partition that stores all user accounts as rows in this partition. In addition, each user partition also has information about the name and email of the user stored in static columns whose values are shared by all rows in the partition. To find all transactions for a given account, table transactions_by_account defines account_id as a partition key and transaction_id as a clustering key. Each account in this table has a distinct partition that stores all account transactions as rows in this partition. Furthermore, transactions within each account partition are ordered based on timestamp components of their respective timeuuid identifiers as defined by the clustering order.

The CQL queries are much simplified when compared to their SQL counterparts. They are very efficient queries that retrieve one partition at a time based on a partition key value. There are no joins or ordering required because data is already organized in pre-joined and ordered form.

To try this example in DataStax Astra DB, we share our CQL script. If you are new to CQL, it stands for Cassandra Query Language and is used in both DataStax Astra DB and Apache Cassandra. Astra DB is a serverless and multi-region database service that is based on Apache Cassandra, an open-source NoSQL database. To learn more about CQL and tables with multi-row partitions, the hands-on Cassandra Fundamentals learning series is highly recommended. For more advanced data modeling, there is also the collection of data modeling examples from various domains.

Figure 4: Pre-computed joins in DataStax Astra DB

Conclusion

Having fast distributed joins is an important consideration when it comes to selecting a scalable database that can support real-time, high-throughput, data-driven applications. In this article, we discussed how shuffle, broadcast, co-located, and pre-computed joins work. We explained that shuffle and broadcast joins are more suitable for batch or near real-time analytics because they may require moving data among nodes in a cluster, which is expensive. Co-located and pre-computed joins are faster and can do well with real-time applications. Using Google Cloud Spanner, we demonstrated how a fully managed, cloud-native relational database can take advantage of fast co-located joins. Using DataStax Astra DB, we demonstrated how a serverless, cloud-native NoSQL database can take advantage of even faster pre-computed joins.

Follow the DataStax Tech Blog for more developer stories. Check out our YouTube channel for tutorials and here for DataStax Developers on Twitter for the latest news about our developer community.

References

DataStax Astra DB — a serverless, cloud-native NoSQL database
Apache Cassandra — open source NoSQL database
Astra DB / Cassandra Fundamentals
Astra DB / Cassandra Data Modeling
Google Cloud Spanner — a fully managed, cloud-native relational database
Optimizing Schema Design for Cloud Spanner