Priyanshu Mukherjee

Posted on Mar 3

Ceph Explained: The Distributed Storage Backbone Powering Modern Infrastructure

#storage #devops #cloud #opensource

01 — Introduction

Note: In 2004, a PhD student at UC Santa Cruz named Sage Weil began writing code at a summer internship at Lawrence Livermore National Laboratory. That code would eventually become Ceph — today one of the most widely deployed open-source distributed storage systems on the planet.

Ceph is an open-source, software-defined storage platform designed to deliver object, block, and file storage from a single unified system. Built to be self-healing and self-managing, it targets two challenges that haunt every storage engineer at scale: resilience and growth. It achieves both without proprietary hardware — running entirely on commodity servers connected via standard Ethernet.

Its architecture is fundamentally distributed, meaning data is always kept across multiple servers or nodes in a cluster, ensuring high availability, fault tolerance, and horizontal scalability all the way to the exabyte level. At its lowest level, Ceph stores everything as objects and then exposes those objects as block devices, file systems, or S3-compatible object storage, depending on what clients need.

Key Stats

Scale: Exabyte-class ceiling.
Nodes: 3+ minimum for production.
History: 20+ years of track record.

In 2014, Red Hat acquired Inktank — Ceph's commercial backer. In 2018, the Linux Foundation launched the Ceph Foundation to broaden governance and industry support. Contributions now flow from IBM, SUSE, Samsung, Intel, and hundreds of independent engineers worldwide. The project remains open-source; all upstream code changes follow a strict upstream-first policy.

02 — Architecture: The Cluster Map & Core Daemons

According to the official Ceph documentation, a Ceph Storage Cluster is built on RADOS — the Reliable Autonomic Distributed Object Store — a reliable distributed storage service that pushes intelligence to every node in the cluster rather than relying on a central brain. RADOS is the bedrock upon which every higher-level storage interface sits.

From docs.ceph.com/architecture
Ceph provides an infinitely scalable storage cluster based on RADOS — a distributed storage service that uses the intelligence in each of its nodes to secure data and provide it to clients, without relying upon a central lookup table.

The cluster topology is described in the Cluster Map, which is actually a collection of five sub-maps tracking monitors, OSDs, placement groups, CRUSH rules, and manager services. Every client and every OSD daemon holds a copy, eliminating single points of failure in the metadata plane.

Five key daemon types make up a Ceph cluster.

The Five Key Daemons

Monitor (MON): Maintains the master copy of the cluster map. Multiple monitors ensure cluster availability if one fails. Manages cluster membership, authentication, and consensus via Paxos.
OSD Daemon: The workhorse. One OSD per physical/logical storage device. Handles all reads, writes, replication, recovery, scrubbing, and rebalancing autonomously.
Manager (MGR): Manager (MGR) Runs alongside monitors, serving as the endpoint for monitoring, orchestration, and plug-in modules including the Dashboard UI, Prometheus exporter, and Balancer.
Metadata Server (MDS): Required only for CephFS. Manages directory and file name mappings to RADOS objects. The MDS cluster scales dynamically and rebalances metadata ranks to distribute load evenly.
RADOS Gateway (RGW): Exposes the underlying RADOS layer as a RESTful HTTP interface compatible with both Amazon S3 and OpenStack Swift APIs. Ideal for bulk object workloads.

All five daemon types are fully distributed and can run on the same physical nodes (hyper-converged) or on separate, dedicated servers. Clients always interact directly with the relevant daemon type — there is no routing layer or proxy in the hot path.

03 — Data Placement: The CRUSH Algorithm

The single most distinctive piece of Ceph's design is CRUSH — Controlled Replication Under Scalable Hashing. Both clients and OSD daemons use CRUSH to calculate where any given piece of data lives, rather than looking it up in a central directory. This is the key reason Ceph has no performance bottleneck in the metadata path.

CRUSH takes an object identifier and a cluster map as input and produces a deterministic list of OSDs where that object's replicas (or erasure-coded shards) should reside. Because any client with a current cluster map can run this computation locally, reads and writes go directly to OSD daemons with no intermediate broker.

CRUSH Topology Example

ROOT (datacenter)
├─ RACK-A
│  ├─ HOST-01
│  │  ├─ OSD.0 [NVMe]
│  │  └─ OSD.1 [NVMe]
│  └─ HOST-02
│     ├─ OSD.2 [SSD]
│     └─ OSD.3 [SSD]
└─ RACK-B
   └─ HOST-03
      ├─ OSD.4 [HDD]
      └─ OSD.5 [HDD]

The CRUSH map encodes your cluster's physical topology — hosts, chassis, racks, power distribution units, pods, rows, rooms, data centers — as a hierarchy of "buckets." CRUSH placement rules then ensure replicas are spread across different failure domains, so a rack power failure or network partition cannot destroy a full data copy.

CRUSH distributes data objects according to a per-device weight value, approximating a uniform probability distribution. By distributing the CRUSH map to OSD daemons, Ceph empowers OSDs to handle replication, backfilling, and recovery autonomously, without waiting for instruction from a central coordinator.

Placement Groups

Objects are not mapped directly to OSDs. Instead, Ceph maps objects to Placement Groups (PGs), and PGs are then mapped to OSDs. This two-layer indirection dramatically reduces the amount of state that must be tracked when OSDs join, leave, or fail. An autoscaler continuously analyzes pools and adjusts PG counts to maintain balanced data distribution — typically targeting around 100–200 PGs per OSD.

The Balancer module — available in the Manager daemon — further automatically optimizes PG distribution to maximize utilization and distribute I/O workloads evenly across all OSDs.

04 — Storage Interfaces

All three of Ceph's storage interfaces sit on top of the same RADOS layer, which means you get consistent durability, replication, and self-healing behavior regardless of which protocol your application speaks.

Interface	Name	Use Case
Object	RGW	Exposes RADOS as an HTTP API compatible with Amazon S3 and OpenStack Swift. Ideal for bulk data: AI/ML data lakes, backups, archives, IoT telemetry, media, and container image registries.
Block	RBD	Stripes virtual disk images across OSDs. Inherits librados capabilities — thin provisioning, snapshots, clones. Integrates natively with QEMU/KVM, libvirt, OpenStack Cinder, and Kubernetes via CSI.
File	CephFS	A POSIX-compliant distributed file system. MDS maps directories and file names to RADOS objects. Clients mount via Linux kernel module or FUSE. Used for HPC shared storage, analytics, and log collection.

For developers who need raw access without going through any of these gateways, librados provides native client libraries in C, C++, Python, Java, and PHP — enabling custom interfaces directly against the RADOS object store.

Key capability
By decoupling storage from underlying hardware and providing object, block, and file access from a single cluster, Ceph allows administrators to maintain a unified storage estate — eliminating storage silos and simplifying replication, data protection, and capacity planning across all workload types.

05 — BlueStore: The Modern OSD Engine

Since the Luminous 12.2 release, the default and recommended OSD storage backend has been BlueStore. Unlike the legacy FileStore backend (which is now deprecated as of the Reef release), BlueStore reads and writes to raw block devices directly — without an intervening file system like XFS. This eliminates the double-write penalty and the overhead of maintaining a separate journal on top of a POSIX filesystem.

What BlueStore brings to the table

BlueStore manages its internal metadata with an embedded RocksDB key-value store, mapping object names to block locations on disk. Every piece of data and metadata written to disk is protected by checksums by default — nothing is returned to the user without verification. BlueStore also supports inline compression and multi-device tiering: a fast NVMe or SSD can serve as the write-ahead log (WAL) device or metadata DB device, while bulk data lives on slower, cheaper HDDs.

Replication vs. Erasure Coding

Ceph pools can protect data in two ways. The default — replication — stores N full copies of every object across N different OSDs. Three-way replication is the common production default. Erasure coding offers a more space-efficient alternative: data is split into K data chunks and M parity chunks (K+M total), stored across K+M OSDs. If up to M OSDs fail simultaneously, the parity chunks allow full reconstruction.

For example, a k=4, m=2 erasure-coded pool stores 1 TB of usable data in 1.5 TB of raw capacity — compared to 3 TB for 3-way replication. Erasure coding is particularly well-suited to read-heavy cold-storage workloads like object archives, data lakes, and backup targets. Erasure-coded pools must reside on BlueStore OSDs, since BlueStore's checksumming is essential for reliable deep scrubbing across parity shards.

from docs.ceph.com
An erasure-coded pool with k=6, m=3 can survive the simultaneous failure of up to three full data centers with only 50% storage overhead — compared to 400% overhead for equivalent 4-way replication.

Replication: Stores $N$ full copies (usually 3).
Erasure Coding (EC): Splits data into chunks ( $K+M$ ). More space-efficient for cold storage.

Data Scrubbing

To detect and repair silent corruption, Ceph OSDs run two levels of scrubbing. Light scrubbing (typically daily) compares object metadata across replicas. Deep scrubbing (typically weekly) performs bit-for-bit comparison of object data, catching bad blocks that light scrubbing misses. BlueStore's native checksumming adds a further layer of integrity verification on every read.

06 — Ecosystem Integrations

Ceph has become a foundational layer in modern cloud-native and private cloud deployments precisely because it integrates deeply with the platforms infrastructure engineers already run.

OpenStack

Ceph is the most widely used storage backend for OpenStack deployments. RBD images back Nova instance disks and Cinder volumes. The RADOS Gateway provides object storage for Glance (image service) and Swift-compatible workloads. Libvirt integrates with QEMU/KVM via librbd, enabling live migration and thin-provisioned disk images across compute nodes.

Kubernetes & Cloud-Native

The Rook operator makes Ceph a first-class citizen in Kubernetes clusters, managing the full Ceph lifecycle via Kubernetes CRDs. Rook-Ceph exposes PersistentVolumes backed by RBD (block), CephFS (file), and RGW (object), covering virtually every storage need a stateful Kubernetes application might have. The official Ceph CSI driver provides RWO and RWX (ReadWriteMany) access modes, essential for distributed databases and machine learning workloads.

AI/ML & Data Lakes

As AI workloads demand increasingly large training datasets, Ceph's RADOS Gateway has become a go-to data lake backend. Its S3-compatible API means existing tooling (Spark, Presto, Trino, PyArrow) connects without modification. Erasure coding reduces the raw storage cost of retaining massive model checkpoints and training corpora, while Ceph's linear scalability means capacity can grow without architectural changes.

Monitoring & Management

The Ceph Dashboard — available through the Manager daemon since the Nautilus release — provides a web UI for cluster health, pool management, OSD status, and RGW configuration. The Manager also ships a Prometheus exporter module, feeding cluster metrics into any Prometheus/Grafana observability stack. Zabbix integration is available via a plugin. Ceph's CLI (ceph status, ceph osd tree, ceph df) remains the most direct path to operational insight.

07 — Strengths, Trade-offs, and When to Use Ceph

Ceph is powerful and proven — but it is also genuinely complex. Understanding both sides helps you deploy it well, or decide when a simpler solution might serve better.

Where Ceph excels

Ceph is an excellent fit for organizations operating private cloud infrastructure, running OpenStack or Kubernetes at meaningful scale, or building data platforms that must grow continuously without storage-layer migrations. Its software-defined nature means you can run it on commodity hardware, on heterogeneous disk types, and expand capacity by simply adding nodes — no forklift upgrades, no downtime.

The self-healing architecture is a genuine operational advantage. When an OSD or host fails, the cluster detects the failure, remaps affected placement groups, and begins backfilling data to meet the configured replication factor — all without administrator intervention. Recovery speed scales with cluster size: larger clusters recover faster because more OSDs contribute recovery bandwidth in parallel.

Trade-offs to understand

Ceph carries a steep learning curve. Designing a CRUSH map correctly for your topology, tuning placement group counts, selecting the right erasure-coding profile, and sizing monitor and OSD hardware require deliberate planning. Under-sized monitor nodes in large clusters can accumulate hundreds of gigabytes in their databases. PG misconfiguration causes data imbalance that silently degrades performance.

Network architecture matters enormously. Ceph is a chatty system — replication traffic, recovery backfill, and heartbeat messages all flow over the cluster network. Production deployments separate the public network (client-facing) from the cluster network (OSD-to-OSD) and provision plenty of bandwidth — 25 GbE or higher is common for serious deployments.

OPERATIONAL PRINCIPLE

Ceph rewards careful initial planning. Invest in sizing, network design, and CRUSH map design upfront, and the system will manage itself for years. Shortcut those decisions and you'll pay in operational complexity later.

08 — Conclusion

The Storage Substrate of an Entire Era

Twenty years after Sage Weil's first commit, Ceph occupies a uniquely durable position in the storage landscape. It is not the simplest option. It is not the fastest for every workload. But for the class of problems it was built to solve — building a unified, self-healing, infinitely scalable storage substrate on commodity hardware — nothing open-source comes close.

As AI infrastructure, cloud-native platforms, and data-intensive applications converge, the requirement for a storage layer that speaks object, block, and file from a single consistent foundation will only grow. Ceph, with twenty years of battle-testing, an active global community, and a governance structure built for longevity, is well-positioned to remain that foundation.

Whether you are architecting a private OpenStack cloud, building a Kubernetes-native platform, or designing a petabyte-scale AI data lake, understanding Ceph deeply is time well spent. The documentation at docs.ceph.com is comprehensive, actively maintained, and the definitive reference for everything from initial deployment to advanced erasure-coding profiles and CRUSH map engineering.

DEV Community