What
NoSQL databases are distributed databases. They are needed to have high availability, this means we have copies of the data. If there are new changes the data can be different in different locations for milliseconds. This behaviour is called eventual consistency.
When
- handle different data configurations and store different data types
- need high availability: There is no single point of failure like in relational databases
- have a large amount of data, because they are able to scale also horizontally
- need linear scalability
- need fast write and reads
- need a flexible schema
Common types
Wide Column Store:
Document store:
Key Value Store:
Graph DBMS:
CAP Theorem
It is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees.
C onstintency
Every read from the database gets the latest piece of data or an error
A vailability
Every request gets received and a response is given. Without a guarantee that the data is the latest update.
P artition tolerance
The system continues to work regardless of losing network connectivity between nodes.
Consistency in the CAP theorem refers to every read from the database getting the latest piece of data or an error.
Consistency in the ACID principle refers to the requirement that only transactions that abide by constraints and database rules are written into the database (correct across rows and tables).
In production supporting availability and partition tolerance makes sense.
Data Modeling with Apache Cassandra
Offers high availability at the potential cost of consistency.
- it is optimised for fast reads and writes
- queries first, there are no JOINS in Apache Cassandra. To achieve this denormalisation must be done.
- one table per query is a good strategy
CQL - Cassandra Query Language
It is similar to SQL, but JOINS, GROUP BY and SUBQUERIES are not supported.
Primary key
PRIMARY KEY (year)
PRIMARY KEY (year, artist_name)
- must be unique
- can be only the partition key or can also include additional clustering columns
- the partition key will determine the distribution across the system
Apache Cassandra does not allow for duplicated data in the rows. Thats why we might need to combine several columns in the primary key to make a composite key.
Clustering columns
PRIMARY KEY ((year), artist_name, album_name)
- clustering column will sort in ascending order, in order of how they were added to the primary key
- none, one or n clustering columns can be added
WHERE clause
- data Modeling in Apache Cassandra is query focused, thats why we need the WHERE clause
- The partition key must be included in the query and any clustering columns can be used in order they appear in the primary key
- By using the WHERE statement, we know which node to go to, from which node to get that data and serve it back
SELELCT *
it can be done with ALLOW FILTERING but should be avoided. As the query will be very slow or simply fail due to the amount of data, terabytes of data.
Summarize summary: Sketchnote
To have all the above information in one view, I made a sketchnote.
If you need a higher resolution please use this page
more links:
Data Modeling with noSQL
Composite Partition Keys
Why not to use SELECT *
Top comments (1)
She's posting her notes, with multi-colored ink! wheee!