Apache Cassandra is a distributed, highly available, horizontally scalable NoSQL database designed to handle massive amounts of data across many commodity servers, with no single point of failure.
It is widely used by organizations that require:
- High write throughput
- Always-on availability
- Linear scalability
- Multi-datacenter replication
1. What is Apache Cassandra?
Apache Cassandra is a wide-column store inspired by:
- Amazon Dynamo (distributed system design)
- Google Bigtable (data model)
Core Characteristics
- Peer-to-peer architecture (no master)
- Linear horizontal scalability
- Tunable consistency
- Fault tolerance
- High write performance
- Schema-based (unlike MongoDB)
2. Cassandra Architecture (High Level)
Cassandra uses a ring-based peer-to-peer architecture.
Key Components
- Node – A single Cassandra instance
- Cluster – A group of nodes
- Datacenter – Logical grouping of nodes
- Rack – Physical grouping for fault tolerance
Client
|
Coordinator Node
|
Replica Nodes (Ring)
There is no master node. Every node is equal.
3. Gossip Protocol (Node Communication)
Cassandra nodes communicate using the Gossip Protocol.
What Gossip Does
- Node discovery
- Cluster membership
- Failure detection
- Metadata sharing
Each node periodically exchanges state information with others.
4. Partitioning & Token Ring
Token Ring
- Data is distributed using consistent hashing
- Each node owns a range of tokens
Hash(key) → Token → Node
Partitioner Types
- Murmur3Partitioner (default)
- RandomPartitioner (deprecated)
5. Replication Strategy
Defines how data is replicated.
SimpleStrategy
Used for single datacenter setups.
CREATE KEYSPACE test
WITH replication = {
'class': 'SimpleStrategy',
'replication_factor': 3
};
NetworkTopologyStrategy
Used for multi-datacenter setups.
CREATE KEYSPACE prod
WITH replication = {
'class': 'NetworkTopologyStrategy',
'DC1': 3,
'DC2': 2
};
6. Cassandra Consistency Model
Cassandra offers tunable consistency.
Consistency Levels
- ONE
- TWO
- THREE
- QUORUM
- ALL
- LOCAL_QUORUM
- EACH_QUORUM
Example:
CONSISTENCY QUORUM;
Consistency = trade-off between availability and consistency.
7. Write Path (Internal Flow)
Writes are very fast in Cassandra.
Write Steps
- Write to Commit Log (durability)
- Write to MemTable (in-memory)
- MemTable flushes to SSTable
No random disk writes.
8. Read Path (Internal Flow)
Reads check multiple structures.
Read Steps
- Check MemTable
- Check Bloom Filter
- Read from SSTables
- Merge results
- Apply tombstones
9. Storage Engine Internals
Commit Log
- Append-only log
- Crash recovery
MemTable
- In-memory sorted structure
- Flushed to disk
SSTable
- Immutable disk file
- Sorted by partition key
Bloom Filter
- Probabilistic structure
- Avoids unnecessary disk reads
10. Compaction
Compaction merges SSTables.
Compaction Strategies
- SizeTieredCompactionStrategy (default)
- LeveledCompactionStrategy
- TimeWindowCompactionStrategy
11. Cassandra Data Model
Core Concepts
- Keyspace
- Table
- Partition Key
- Clustering Columns
- Columns
- Rows
Example Table Structure
Partition Key → Node
Clustering Columns → Sort order
12. Keyspaces (Full Syntax)
CREATE KEYSPACE app
WITH replication = {
'class': 'SimpleStrategy',
'replication_factor': 3
};
DESCRIBE KEYSPACES;
DROP KEYSPACE app;
13. Tables (Full Syntax)
CREATE TABLE users (
user_id UUID,
email TEXT,
name TEXT,
created_at TIMESTAMP,
PRIMARY KEY (user_id)
);
Composite Primary Key
PRIMARY KEY ((user_id), created_at)
14. Clustering Order
CREATE TABLE posts (
user_id UUID,
created_at TIMESTAMP,
post TEXT,
PRIMARY KEY ((user_id), created_at)
) WITH CLUSTERING ORDER BY (created_at DESC);
15. Data Types (All Core Types)
- TEXT
- INT
- BIGINT
- BOOLEAN
- UUID
- TIMEUUID
- TIMESTAMP
- FLOAT
- DOUBLE
- BLOB
Collection Types
- LIST
- SET
- MAP
tags SET<TEXT>
16. Insert Data
INSERT INTO users (user_id, email, name)
VALUES (uuid(), 'a@test.com', 'Alice');
TTL (Time To Live)
INSERT INTO sessions (id, data)
VALUES (1, 'temp')
USING TTL 3600;
17. Select Queries
SELECT * FROM users;
SELECT * FROM posts
WHERE user_id = ?;
⚠️ Cassandra requires partition key in WHERE clause.
18. Filtering (Limited)
SELECT * FROM users
WHERE email = 'a@test.com'
ALLOW FILTERING;
Not recommended for large datasets.
19. Update Data
UPDATE users
SET name = 'Bob'
WHERE user_id = ?;
Updates are upserts.
20. Delete Data
DELETE FROM users
WHERE user_id = ?;
Deletes create tombstones.
21. Indexes
Secondary Index
CREATE INDEX ON users (email);
Limited scalability.
22. Materialized Views
CREATE MATERIALIZED VIEW users_by_email AS
SELECT * FROM users
WHERE email IS NOT NULL AND user_id IS NOT NULL
PRIMARY KEY (email, user_id);
23. Batches
BEGIN BATCH
INSERT INTO users (...) VALUES (...);
INSERT INTO logs (...) VALUES (...);
APPLY BATCH;
Not for bulk operations.
24. Counters
CREATE TABLE page_views (
page TEXT PRIMARY KEY,
views COUNTER
);
UPDATE page_views
SET views = views + 1
WHERE page = 'home';
25. User Defined Types (UDT)
CREATE TYPE address (
street TEXT,
city TEXT,
zip INT
);
26. User Defined Functions (UDF)
CREATE FUNCTION add(a int, b int)
RETURNS NULL ON NULL INPUT
RETURNS int
LANGUAGE java
AS 'return a + b;';
27. Security & Authentication
- PasswordAuthenticator
- Role-based access
- TLS encryption
CREATE ROLE admin
WITH PASSWORD = 'secret'
AND LOGIN = true;
28. Cassandra vs Traditional RDBMS
| Feature | Cassandra | MySQL |
|---|---|---|
| Joins | No | Yes |
| Schema | Flexible | Rigid |
| Scalability | Horizontal | Vertical |
| Consistency | Tunable | Strong |
29. Common Use Cases
- Time-series data
- IoT platforms
- Messaging systems
- Analytics ingestion
- Logging systems
30. When NOT to Use Cassandra
- Complex joins
- Strong ACID transactions
- Small datasets
- Ad-hoc queries
31. Final Thoughts
Apache Cassandra is a write-optimized, distributed database built for scale and availability. Its design prioritizes fault tolerance and performance over relational flexibility.
If your system demands always-on availability and massive horizontal scale, Cassandra is one of the best choices available.
Top comments (0)