Introduction
This article explains the data corruption issue happened in Rook in 2021. The root cause lies in an unexpected place and can also occurs in all Ceph environment. It's interesting that Rook had started to encounter this problem recently even though this problem has existed for a long time. It's due to a series of coincidences. I wrote this article because the word "Atari" used in a non-historical context in 2021.
This article is a restructured version of the information written in Rook's official documentation, with additional information for those who are not familiar with Rook.
Glossary
- Ceph: A open source distributed storage system
- Rook: An irchestration of Ceph running on Kubernetes. This is also open source
- OSD: Data structure existing on disks that usually corresponds to a disk in Ceph cluster
- OSD on disk: One of the methods to create OSDs on Rook. The device path is directly written in the Rook configuration. Details will be discussed later
- Atari partition: The partition format used in the once-existing Atari ST computer
Problem Summary
- The problem:
- OSD data gets corrupted
- Root Cause Summary
- The disk containing the OSD is mistakenly recognized as having an Atari partition, and Rook creates an OSD on that (non-existent) partition
- Occurrence Conditions:
- Using Rook v1.6.0 to v1.6.7
- Creating OSD on disk on a disk without partitioning
- Mitigation:
- Update to Rook v1.6.8 or higher, or Rook v1.7
- Recovery from Data Corruption:
- Impossible. The only option is to recreate the OSD on the disk holding the corrupted OSD using this procedure
Mechanism
Let's assume that Rook tries to create an OSD on a disk called /dev/sdb
- Rook creates an OSD on /dev/sdb.
- Linux Kernel recognizes that there is an Atari Partition Table, not an OSD, on /dev/sdb by mistake.
- Rook found there are some "phantom" empty Atari Partitions that can be used to create OSDs.
- Rook creates an OSD on the empty partitions mentioned in step 3, resulting in data corruption of the OSD on /dev/sdb.
Details
To understand the problem in detail, some knowledge about Rook, Ceph, and Atari Partition is required. I will first explain some prerequisite knowledge, and then describe the actual flow leading up to the problem.
Configuring OSD on device in Rook
When creating an OSD in Rook, you write settings such as "I want to create an OSD under these conditions" in a CephCluster Custom Resource (CR). To create an OSD on device, you specify the following:
- The path of the device where you want to create the OSD ("/dev/sdb", etc.)
- A regular expression to match the device ("/dev/sd.*", etc.)
- The specification "create on all unused devices, useAllDevice: true"
Please refer to the official documentation for details.
When Rook operates, it creates OSDs on each device according to the Ceph Cluster CR as follows:
- Rook on the node composing the Rook cluster executes a command called ceph-volume provided by Ceph.
- The ceph-volume command lists the devices present in the system and displays whether they are empty and can be used to create OSDs.
- Rook creates OSDs on devices that are empty and match the settings in the Ceph Cluster CR.
OSD Formats in Ceph
When Ceph creates an OSD on a device, it writes OSD metadata to the device. There are two formats of OSD in Ceph, each with different locations for writing metadata.
- lvm mode OSD: Create a Volume Group (VG) in LVM on the device, then create a Logical Volume (LV) within it, and write the OSD metadata to the beginning of the LV.
- raw mode OSD: Write the OSD metadata to the beginning of the device.
Ceph only have lvm mode at first and introduced the simpler and easier-to-manage raw mode OSD later. In Rook, starting from v1.6.0, raw mode OSDs are created for OSD on device.
Atari Partition Recognition Method in the Linux Kernel
The recognition method for Atari Partition in the Linux kernel is much more lenient compared to other partitions. To determine whether the fundamental problem lies in the Linux kernel or in the Atari partition specification itself, it is necessary to look at the Atari partition specification, but since it could not be found, the investigation was limited to the source code.
The determination of whether a disk is an Atari Partition or not is based on the presence of one or more partition information in the beginning area of the disk. There can be up to 4 partitions1, and the method of checking this is the VALID_PARTITION()
macro.
rs = read_part_sector(state, 0, §);
if (!rs)
return -1;
/* Verify this is an Atari rootsector: */
hd_size = get_capacity(state->disk);
if (!VALID_PARTITION(&rs->part[0], hd_size) &&
!VALID_PARTITION(&rs->part[1], hd_size) &&
!VALID_PARTITION(&rs->part[2], hd_size) &&
!VALID_PARTITION(&rs->part[3], hd_size)) {
/*
* if there's no valid primary partition, assume that no Atari
* format partition table (there's no reliable magic or the like
* :-()
*/
put_dev_sector(sect);
return 0;
}
According to the comments, it seems that there is no magic number-like element that naturally exists in other partition tables.
The definition of VALID_PARTITION()
is as follows.
/* check if a partition entry looks valid -- Atari format is assumed if at
least one of the primary entries is ok this way */
#define VALID_PARTITION(pi,hdsiz) \
(((pi)->flg & 1) && \
isalnum((pi)->id[0]) && isalnum((pi)->id[1]) && isalnum((pi)->id[2]) && \
be32_to_cpu((pi)->st) <= (hdsiz) && \
be32_to_cpu((pi)->st) + be32_to_cpu((pi)->siz) <= (hdsiz))
It is quite frightening that a disk can be mistakenly recognized as a partition just by meeting such loose conditions.
The process leading to the problem
Let's consider the case of trying to create an OSD on /dev/sdb again. For simplicity, let's assume that the Rook configuration is set to create OSDs on all available devices.
First, Rook operates and creates an OSD on /dev/sdb. So far, so good. From v1.6.0 to v1.6.7, the OSD metadata is written to the beginning of the disk to create a raw mode OSD. Unfortunately, this OSD metadata has a bit pattern that is easily mistaken for an Atari Partition Table2.
You can check the devices where the misrecognition occurred from the results of the lsblk command as follows.
vdb 252:16 0 3T 0 disk
├─vdb2 252:18 0 48G 0 part # phantom Atari Partition
└─vdb3 252:19 0 6.1M 0 part # same as above
Interestingly, tools like lsblk, blkid, udevadm, and parted have been confusing users and developers by not recognizing Atari Partitions, making it appear as if vdb2 and vdb3 do not exist, or setting the partition table type to unknown. This is why the misrecognized partitions are called "phantom."
After this, for some reason, when Rook operates next time, the ceph-volume command is executed, and it mistakenly recognizes that there is an Atari Partition Table on /dev/vdb instead of an OSD, along with some phantom Partitions. Let's assume that /dev/vdb2 is the phantom partition here. As a result of this misrecognition, /dev/vdb is reported to be in use, but /dev/vdb2 is empty and can have an OSD created on it.
Finally, Rook receives instructions from the user to create OSDs on free devices, so it creates a new OSD on /dev/vdb2. This results in the partial destruction of the data on the original OSD on /dev/vdb.
History of Handling the Issue
There are three parties involved in this issue: Rook, Ceph, and the Linux Kernel. In fixing such issues, it is important to consider "which layer can fix it" and "which layer should fix it."
Various workarounds have been proposed after twists and turns, but it has been determined that there is not much that can be done at a superficial level. Currently, in Rook, a fix to create lvm mode OSDs when creating OSDs on disk has been incorporated.
In addition, a fix for ceph-volume to ignore phantom Atari partitions has already been made to Ceph, and it has fixed in the v16.2.6 rel. Once v16.2.6 was released, raw mode started to use again for OSD on disk in Rook when using that version.
As for the Linux kernel, a fix to disable Atari Partition support in the kernel for major cloud vendor environments in Ubuntu has been made. It is not written what problem prompted the fix, so it may have come up in a context unrelated to Rook or Ceph.
Conclusion
It was quite a dramatic issue, but I think it's a good example to learn how to fix software and make announcements in such cases. If you want to learn more, I recommend digging deeper by following related issues.
-
This may be a limitation of Linux, but it is unclear. ↩
-
Fortunately, there have been no reports of OSDs being mistaken for Atari Partitions when using lvm mode OSDs. However, this is simply because the metadata of the VG is at the beginning of the disk in the case of lvm mode, and the pattern just happens to not be recognized as an Atari Partition. ↩
Top comments (0)