Farhad Rahimi Klie

Posted on Dec 22, 2025

Apache Cassandra Database – The Complete Guide (Architecture, Internals, and Full CQL Syntax)

#apachecassandra #cassandra #database #cql

Apache Cassandra is a distributed, highly available, horizontally scalable NoSQL database designed to handle massive amounts of data across many commodity servers, with no single point of failure.

It is widely used by organizations that require:

High write throughput
Always-on availability
Linear scalability
Multi-datacenter replication

1. What is Apache Cassandra?

Apache Cassandra is a wide-column store inspired by:

Amazon Dynamo (distributed system design)
Google Bigtable (data model)

Core Characteristics

Peer-to-peer architecture (no master)
Linear horizontal scalability
Tunable consistency
Fault tolerance
High write performance
Schema-based (unlike MongoDB)

2. Cassandra Architecture (High Level)

Cassandra uses a ring-based peer-to-peer architecture.

Key Components

Node – A single Cassandra instance
Cluster – A group of nodes
Datacenter – Logical grouping of nodes
Rack – Physical grouping for fault tolerance

Client
  |
Coordinator Node
  |
Replica Nodes (Ring)

There is no master node. Every node is equal.

3. Gossip Protocol (Node Communication)

Cassandra nodes communicate using the Gossip Protocol.

What Gossip Does

Node discovery
Cluster membership
Failure detection
Metadata sharing

Each node periodically exchanges state information with others.

4. Partitioning & Token Ring

Token Ring

Data is distributed using consistent hashing
Each node owns a range of tokens

Hash(key) → Token → Node

Partitioner Types

Murmur3Partitioner (default)
RandomPartitioner (deprecated)

5. Replication Strategy

Defines how data is replicated.

SimpleStrategy

Used for single datacenter setups.

CREATE KEYSPACE test
WITH replication = {
  'class': 'SimpleStrategy',
  'replication_factor': 3
};

NetworkTopologyStrategy

Used for multi-datacenter setups.

CREATE KEYSPACE prod
WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'DC1': 3,
  'DC2': 2
};

6. Cassandra Consistency Model

Cassandra offers tunable consistency.

Consistency Levels

ONE
TWO
THREE
QUORUM
ALL
LOCAL_QUORUM
EACH_QUORUM

Example:

CONSISTENCY QUORUM;

Consistency = trade-off between availability and consistency.

7. Write Path (Internal Flow)

Writes are very fast in Cassandra.

Write Steps

Write to Commit Log (durability)
Write to MemTable (in-memory)
MemTable flushes to SSTable

No random disk writes.

8. Read Path (Internal Flow)

Reads check multiple structures.

Read Steps

Check MemTable
Check Bloom Filter
Read from SSTables
Merge results
Apply tombstones

9. Storage Engine Internals

Commit Log

Append-only log
Crash recovery

MemTable

In-memory sorted structure
Flushed to disk

SSTable

Immutable disk file
Sorted by partition key

Bloom Filter

Probabilistic structure
Avoids unnecessary disk reads

10. Compaction

Compaction merges SSTables.

Compaction Strategies

SizeTieredCompactionStrategy (default)
LeveledCompactionStrategy
TimeWindowCompactionStrategy

11. Cassandra Data Model

Core Concepts

Keyspace
Table
Partition Key
Clustering Columns
Columns
Rows

Example Table Structure

Partition Key → Node
Clustering Columns → Sort order

12. Keyspaces (Full Syntax)

CREATE KEYSPACE app
WITH replication = {
  'class': 'SimpleStrategy',
  'replication_factor': 3
};

DESCRIBE KEYSPACES;

DROP KEYSPACE app;

13. Tables (Full Syntax)

CREATE TABLE users (
  user_id UUID,
  email TEXT,
  name TEXT,
  created_at TIMESTAMP,
  PRIMARY KEY (user_id)
);

Composite Primary Key

PRIMARY KEY ((user_id), created_at)

14. Clustering Order

CREATE TABLE posts (
  user_id UUID,
  created_at TIMESTAMP,
  post TEXT,
  PRIMARY KEY ((user_id), created_at)
) WITH CLUSTERING ORDER BY (created_at DESC);

15. Data Types (All Core Types)

TEXT
INT
BIGINT
BOOLEAN
UUID
TIMEUUID
TIMESTAMP
FLOAT
DOUBLE
BLOB

Collection Types

LIST
SET
MAP

tags SET<TEXT>

16. Insert Data

INSERT INTO users (user_id, email, name)
VALUES (uuid(), 'a@test.com', 'Alice');

TTL (Time To Live)

INSERT INTO sessions (id, data)
VALUES (1, 'temp')
USING TTL 3600;

17. Select Queries

SELECT * FROM users;

SELECT * FROM posts
WHERE user_id = ?;

⚠️ Cassandra requires partition key in WHERE clause.

18. Filtering (Limited)

SELECT * FROM users
WHERE email = 'a@test.com'
ALLOW FILTERING;

Not recommended for large datasets.

19. Update Data

UPDATE users
SET name = 'Bob'
WHERE user_id = ?;

Updates are upserts.

20. Delete Data

DELETE FROM users
WHERE user_id = ?;

Deletes create tombstones.

21. Indexes

Secondary Index

CREATE INDEX ON users (email);

Limited scalability.

22. Materialized Views

CREATE MATERIALIZED VIEW users_by_email AS
SELECT * FROM users
WHERE email IS NOT NULL AND user_id IS NOT NULL
PRIMARY KEY (email, user_id);

23. Batches

BEGIN BATCH
INSERT INTO users (...) VALUES (...);
INSERT INTO logs (...) VALUES (...);
APPLY BATCH;

Not for bulk operations.

24. Counters

CREATE TABLE page_views (
  page TEXT PRIMARY KEY,
  views COUNTER
);

UPDATE page_views
SET views = views + 1
WHERE page = 'home';

25. User Defined Types (UDT)

CREATE TYPE address (
  street TEXT,
  city TEXT,
  zip INT
);

26. User Defined Functions (UDF)

CREATE FUNCTION add(a int, b int)
RETURNS NULL ON NULL INPUT
RETURNS int
LANGUAGE java
AS 'return a + b;';

27. Security & Authentication

PasswordAuthenticator
Role-based access
TLS encryption

CREATE ROLE admin
WITH PASSWORD = 'secret'
AND LOGIN = true;

28. Cassandra vs Traditional RDBMS

Feature	Cassandra	MySQL
Joins	No	Yes
Schema	Flexible	Rigid
Scalability	Horizontal	Vertical
Consistency	Tunable	Strong

29. Common Use Cases

Time-series data
IoT platforms
Messaging systems
Analytics ingestion
Logging systems

30. When NOT to Use Cassandra

Complex joins
Strong ACID transactions
Small datasets
Ad-hoc queries

31. Final Thoughts

Apache Cassandra is a write-optimized, distributed database built for scale and availability. Its design prioritizes fault tolerance and performance over relational flexibility.

If your system demands always-on availability and massive horizontal scale, Cassandra is one of the best choices available.