Non-relational data models

#datamodeling #apachecassandra #nosql #datascience

What

NoSQL databases are distributed databases. They are needed to have high availability, this means we have copies of the data. If there are new changes the data can be different in different locations for milliseconds. This behaviour is called eventual consistency.

When

handle different data configurations and store different data types
need high availability: There is no single point of failure like in relational databases
have a large amount of data, because they are able to scale also horizontally
need linear scalability
need fast write and reads
need a flexible schema

Common types

Wide Column Store:

CAP Theorem

It is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees.

C onstintency

Every read from the database gets the latest piece of data or an error

A vailability

Every request gets received and a response is given. Without a guarantee that the data is the latest update.

P artition tolerance

The system continues to work regardless of losing network connectivity between nodes.

Consistency in the CAP theorem refers to every read from the database getting the latest piece of data or an error.
Consistency in the ACID principle refers to the requirement that only transactions that abide by constraints and database rules are written into the database (correct across rows and tables).

In production supporting availability and partition tolerance makes sense.

Data Modeling with Apache Cassandra

Offers high availability at the potential cost of consistency.

it is optimised for fast reads and writes
queries first, there are no JOINS in Apache Cassandra. To achieve this denormalisation must be done.
one table per query is a good strategy

CQL - Cassandra Query Language

It is similar to SQL, but ~~JOINS, GROUP BY~~ and ~~SUBQUERIES~~ are not supported.

Primary key

PRIMARY KEY (year)
PRIMARY KEY (year, artist_name)

must be unique
can be only the partition key or can also include additional clustering columns
the partition key will determine the distribution across the system

Apache Cassandra does not allow for duplicated data in the rows. Thats why we might need to combine several columns in the primary key to make a composite key.

Clustering columns

PRIMARY KEY ((year), artist_name, album_name)

clustering column will sort in ascending order, in order of how they were added to the primary key
none, one or n clustering columns can be added

WHERE clause

data Modeling in Apache Cassandra is query focused, thats why we need the WHERE clause
The partition key must be included in the query and any clustering columns can be used in order they appear in the primary key
By using the WHERE statement, we know which node to go to, from which node to get that data and serve it back

SELELCT *

it can be done with ALLOW FILTERING but should be avoided. As the query will be very slow or simply fail due to the amount of data, terabytes of data.