To improve performance, JuiceFS employs a chunking strategy in its data storage process. Concepts like chunk, slice, and block, along with their working mechanisms, may not be easy for new users to grasp. This article, shared by JuiceFS community user Arthur, provides a deep dive into JuiceFS' design principles with code examples, covering both data and metadata storage.
Files written to object storage by JuiceFS
Using MinIO as an example, let’s examine how JuiceFS organizes files in object storage. This structure is similar across other cloud providers like AWS S3.
The bucket structure below provides a clear view of how JuiceFS organizes a volume's data in object storage.
In the bucket: One directory per volume
By creating two volumes using the juicefs format
command, their organization within the bucket can be observed.
In the bucket, the top-level "directories" represent JuiceFS volumes. The term "directory" is used here in quotes since object storage is a flat key-value storage system without a true directory structure. To make navigation and understanding easier, object storage simulates directories by grouping keys with common prefixes. For simplicity, the term "directory" will be used without quotes in the following sections.
Each volume’s directory structure: {chunks/, juicefs_uuid, meta/, ...}
Within each volume directory, the structure appears as follows:
|-chunks/ # Data directory where all user data for the volume resides
|-juicefs_uuid
|-meta/ # `Directory for metadata backups created by `juicefs mount --backup-meta`
juicefs_uuid
: Unique volume identifier
The juicefs_uuid
file contains the universally unique identifier (UUID) displayed during the juicefs format
process, serving as the volume's unique identifier. This UUID is required for volume deletion.
meta/
: JuiceFS metadata backup
If the --backup-meta
option is specified during the execution of the juicefs mount
command, JuiceFS periodically backs up metadata (stored in TiKV) to this directory. The backups serve purposes such as:
- Restoring metadata when there was a metadata engine failure
- Migrating metadata across different engines
chunks/
The figure below shows files of a bucket in the MinIO bucket browser:
The directory structure in chunks/
looks like this:
|-chunks/
| |-0/ # <-- id1 = slice_id / 1000 / 1000
| | |-0/ # <-- id2 = slice_id / 1000
| | |-1_0_16 # <-- {slice_id}_{block_id}_{size_of_this_block}
| | |-3_0_4194304 #
| | |-3_1_1048576 #
| | |-...
|-juicefs_uuid
|-meta/
All files in the bucket are named and stored using numbers, organized into three levels.
-
The first level: Purely numeric, derived by dividing
slice_id
by 1,000,000. -
The second level: Purely numeric, derived by dividing
slice_id
by 1,000. -
The third level: A combination of numbers and an underscore in the format
{slice_id}{block_id}{size_of_this_block}
, representing theblock_id
and size of the block within the chunk’s slice. Don’t worry if the concepts of chunk, slice, and block are unclear—we'll explain them shortly.
JuiceFS data design
Top-level division: Files are split into chunks
JuiceFS divides files into chunks, each with a fixed size of 64 MB. This simplifies the process of locating and accessing file contents during read or modification operations. Regardless of file size, all files are split into chunks, varying only in their number.
The figure below shows that each file in JuiceFS is divided into chunks, with a maximum chunk size of 64 MB:
Object storage: No chunk entity
Referring to the directory structure in the previous section in object storage:
|-chunks/
| |-0/ # <-- id1 = slice_id / 1000 / 1000
| | |-0/ # <-- id2 = slice_id / 1000
| | |-1_0_16 # <-- {slice_id}_{block_id}_{size_of_this_block}
| | |-3_0_4194304 #
| | |-3_1_1048576 #
| | |-...
|-juicefs_uuid
|-meta/
Chunks in object storage do not correspond to actual files. This means there are no 64 MB chunks stored individually in object storage. In JuiceFS terms, a chunk is a logical concept. If this concept isn't clear yet, don't worry—we'll explain it next.
A continuous write operation within a chunk: Slice
A chunk is merely a "container." Within this container, the data read and written for a file is represented by what JuiceFS calls a "slice." A continuous write within a chunk creates a slice, corresponding to the data written in that segment. Since a slice is a concept within a chunk, it cannot span across chunk boundaries, and its length does not exceed the maximum chunk size of 64 MB. The slice ID is globally unique.
Slice overlap
Depending on the write behavior, a chunk may contain multiple slices. If a file is written in a single, continuous operation, each chunk contains only one slice. However, if the file is written in multiple append operations, with each append triggering a flush, multiple slices are created.
For example, consider chunk1:
- The user writes about 30 MB of data, creating slice5.
- Next, at about the 20 MB mark, the user begins writing 45 MB (removing part of the original file and appending data).
- This creates slice6 within chunk1.
- As the data exceeds chunk1’s boundaries, slice7 and chunk2 are created, because slices cannot cross chunk boundaries.
- Later, the user begins overwriting from the 10 MB mark of chunk1, creating slice8.
Since slices overlap, several fields are introduced to indicate the valid data range:
type slice struct {
id uint64
size uint32
off uint32
len uint32
pos uint32
left *slice // This field is not stored in TiKV.
right *slice // This field is not stored in TiKV.
}
Handling multiple slices when reading chunk data: Fragmentation and merging
The figure below shows the relationship between slices and chunks in JuiceFS:
JuiceFS chunks consist of slices, each slice corresponding to a continuous write operation.
For JuiceFS users, there is only one file, but internally, the corresponding chunk may have multiple overlapping slices. In cases of overlap, the most recent write takes precedence.
Intuitively, in a chunk, when viewed from top to bottom, the parts that have been overwritten are considered invalid.
Therefore, when reading a file, the system must search for the latest slice written within the current read range. When many slices overlap, this can significantly impact read performance, a phenomenon known as "file fragmentation."
Fragmentation not only affects read performance but also increases space consumption at the object storage and metadata levels. Whenever a write occurs, the client checks for fragmentation and asynchronously performs fragmentation merging, combining all slices within a chunk.
Object storage: No slice entity
Similar to chunks, slices in object storage do not correspond to actual files.
|-chunks/
| |-0/ # <-- id1 = slice_id / 1000 / 1000
| | |-0/ # <-- id2 = slice_id / 1000
| | |-1_0_16 # <-- {slice_id}_{block_id}_{size_of_this_block}
| | |-3_0_4194304 #
| | |-3_1_1048576 #
| | |-...
|-juicefs_uuid
|-meta/
Slice divided into fixed-size blocks: Concurrent reads and writes in object storage
To accelerate writing to object storage, JuiceFS divides slices into individual blocks (4 MB by default), allowing multi-threaded concurrent writes.
Blocks are the final level in JuiceFS’ data segmentation design and the only level among chunks, slices, and blocks that correspond to actual files in the bucket.
The differences between continuous writes and append writes:
- Continuous writes: The default block size is 4 MB, and the last block is whatever remains.
- Append writes: If the data is less than 4 MB, the final block stored in object storage is also smaller than 4 MB.
From the example above, the file names and sizes correspond to the following:
-
1_0_16
: Corresponds tofile1_1KB
. Even thoughfile1_1KB
only contains two lines of content, it’s represented in MinIO as two separate objects, each containing one line. -
3_0_4194304 + 3_1_1048576
: A total of 5 MB, corresponding tofile2_5MB
. -
4_*
: Corresponds tofile3_129MB
.
Object key naming format (and code)
The naming format is:
{volume}/chunks/{id1}/{id2}/{slice_id}_{block_id}_{size_of_this_block}
Corresponding code:
func (s *rSlice) key(blockID int) string {
if s.store.conf.HashPrefix // false by default
return fmt.Sprintf("chunks/%02X/%v/%v_%v_%v", s.id%256, s.id/1000/1000, s.id, blockID, s.blockSize(blockID))
return fmt.Sprintf("chunks/%v/%v/%v_%v_%v", s.id/1000/1000, s.id/1000, s.id, blockID, s.blockSize(blockID))
}
Mapping chunks, slices, and blocks to object storage
Finally, we map the data segmentation and organization of the volume to paths and objects in MinIO:
Summary: Objects storage data and slice/block information alone cannot restore files
At this point, JuiceFS has solved the problem of how data is segmented and stored. This is a forward process: the user creates a file, which is segmented, named, and uploaded to object storage. The corresponding reverse process is: given the objects in object storage, how can we reconstruct them into the user's file? Clearly, just using the slice/block ID information from the object names is insufficient.
For example, in the simplest case, if there is no slice overlap within a chunk, we can reconstruct the file by piecing together slice_id
/block_id
/block_size
information from the object names. However, we still wouldn't know the file’s name, path (parent directory), or permissions (rwx), among other details.
Once chunks have overlapping slices, it becomes impossible to restore the file solely from the information in object storage.
In addition, soft links, hard links, and file attributes cannot be restored from object storage alone.
To solve this reverse process, we need the file’s metadata as auxiliary information—this information was recorded in JuiceFS’ metadata engine before the file was segmented and written to object storage.
JuiceFS metadata design (TKV version)
JuiceFS supports different types of metadata engines, such as Redis, PostgreSQL, TiKV, and etcd. Each type has its own key naming rules. This section discusses the key naming rules when using the transactional key-value (TKV) type of metadata engine, specifically using TiKV.
TKV type key list
Here is a list of JuiceFS’ defined metadata keys, which store key-value pairs in the metadata engine. Note that these keys are different from those used for object storage:
setting format
C{name} counter
A{8byte-inode}I inode attribute
A{8byte-inode}D{name} dentry
A{8byte-inode}P{8byte-inode} parents // for hard links
A{8byte-inode}C{4byte-blockID} file chunks
A{8byte-inode}S symlink target
A{8byte-inode}X{name} extented attribute
D{8byte-inode}{8byte-length} deleted inodes
F{8byte-inode} Flocks
P{8byte-inode} POSIX locks
K{8byte-sliceID}{8byte-blockID} slice refs
Ltttttttt{8byte-sliceID} delayed slices
SE{8byte-sessionID} session expire time
SH{8byte-sessionID} session heartbeat // for legacy client
SI{8byte-sessionID} session info
SS{8byte-sessionID}{8byte-inode} sustained inode
U{8byte-inode} usage of data length, space and inodes in directory
N{8byte-inode} detached inde
QD{8byte-inode} directory quota
R{4byte-aclID} POSIX acl
In TKV, all integers are stored in encoded binary format:
- Inode and counter values occupy 8 bytes using little-endian encoding.
- SessionID, sliceID, and timestamp also occupy 8 bytes but use big-endian encoding.
- Setting is a special key, where the value contains the configuration information for the volume.
Each key type can be quickly distinguished by its prefix:
-
C: Counter, which includes several subtypes such as:
nextChunk
nextInode
nextSession
- A: Inode attributes
- D: Deleted inodes
- F: Flocks
- P: POSIX locks
- S: Session related
- K: Slice references
- L: Delayed (to-be-deleted) slices
- U: Usage of data, space, and inodes within a directory
- N: Detached inodes
- QD: Directory quotas
- R: POSIX ACL
It’s important to note that these key formats are defined by JuiceFS. When the keys and values are written to the metadata engine, the engine may further encode the keys. For example, TiKV might insert additional characters into the keys.
Key/value pairs in the metadata engine
Scanning related TiKV keys
The scan operation in TiKV is similar to the list prefix function in etcd. Here, we scan all keys related to the foo-dev volume:
key: zfoo-dev\375\377A\000\000\000\020\377\377\377\377\177I\000\000\000\000\000\000\371
key: zfoo-dev\375\377A\001\000\000\000\000\000\000\377\000Dfile1_\3771KB\000\000\000\000\000\372
key: zfoo-dev\375\377A\001\000\000\000\000\000\000\377\000Dfile2_\3775MB\000\000\000\000\000\372
...
key: zfoo-dev\375\377SI\000\000\000\000\000\000\377\000\001\000\000\000\000\000\000\371
default cf value: start_ts: 453485726123950084 value: 7B225665727369...33537387D
key: zfoo-dev\375\377U\001\000\000\000\000\000\000\377\000\000\000\000\000\000\000\000\370
key: zfoo-dev\375\377setting\000\376
default cf value: start_ts: 453485722598113282 value: 7B0A224E616D65223A202266...0A7D
Decoding JuiceFS metadata keys
You can use the tikv-ctl --decode <key>
command to decode the keys. After removing the z
prefix, the original JuiceFS keys become clearer:
foo-dev\375A\001\000\000\000\000\000\000\000Dfile1_1KB
foo-dev\375A\001\000\000\000\000\000\000\000Dfile2_5MB
foo-dev\375A\001\000\000\000\000\000\000\000Dfile3_129MB
foo-dev\375A\001\000\000\000\000\000\000\000I
foo-dev\375A\002\000\000\000\000\000\000\000C\000\000\000\000
foo-dev\375A\002\000\000\000\000\000\000\000I
foo-dev\375A\003\000\000\000\000\000\000\000C\000\000\000\000
foo-dev\375A\003\000\000\000\000\000\000\000I
foo-dev\375A\004\000\000\000\000\000\000\000C\000\000\000\000
foo-dev\375A\004\000\000\000\000\000\000\000C\000\000\000\001
foo-dev\375A\004\000\000\000\000\000\000\000C\000\000\000\002
foo-dev\375A\004\000\000\000\000\000\000\000I
foo-dev\375ClastCleanupFiles
foo-dev\375ClastCleanupSessions
foo-dev\375ClastCleanupTrash
foo-dev\375CnextChunk
foo-dev\375CnextCleanupSlices
foo-dev\375CnextInode
foo-dev\375CnextSession
foo-dev\375CtotalInodes
foo-dev\375CusedSpace
foo-dev\375SE\000\000\000\000\000\000\000\001
foo-dev\375SI\000\000\000\000\000\000\000\001
foo-dev\375U\001\000\000\000\000\000\000\000
foo-dev\375setting
These decoded keys reveal the metadata of the three created files. Since they are linked with information such as slice_id
, this mapping allows them to be associated with data blocks in object storage.
Further decoding based on the key encoding rules can provide more specific details, such as sliceID and inode.
Common challenges with this design
Recovering files from data and metadata
Theoretical steps
For a given JuiceFS file, the previous article explained two forward processes:
- The file is split into chunks, slices, and blocks. Then they’re written to object storage.
- The file's metadata, including inode, slice, and block information, is organized and written to the metadata engine. With this understanding of the forward process, the reverse process of recovering a file from object storage and the
metadata engine becomes clear:
- Scan the metadata engine to gather details such as file name, inode, slice, size, location, and permissions.
- Reconstruct the object keys in object storage using
slice_id
,block_id
, andblock_size
. - Sequentially retrieve the data from object storage using these keys, assemble the file, and write it locally with the appropriate permissions.
Using juicefs info
to view file chunk, slice, and block information
JuiceFS provides a command-line option to directly view file chunk, slice, and block details. For example:
foo-dev/file2_5MB :
inode: 3
files: 1
dirs: 0
length: 5.00 MiB (5242880 Bytes)
size: 5.00 MiB (5242880 Bytes)
path: /file2_5MB
objects:
+------------+--------------------------------+---------+--------+---------+
| chunkIndex | objectName | size | offset | length |
+------------+--------------------------------+---------+--------+---------+
| 0 | foo-dev/chunks/0/0/3_0_4194304 | 4194304 | 0 | 4194304 |
| 0 | foo-dev/chunks/0/0/3_1_1048576 | 1048576 | 0 | 1048576 |
+------------+--------------------------------+---------+--------+---------+
This output matches what we observed in MinIO.
Why can’t files written to object storage by JuiceFS be read directly?
The term “cannot be read” here refers to the inability to directly reconstruct the original file from the object storage objects, not the inability to access the objects themselves.
As explained in this article, JuiceFS splits files into chunks, slices, and blocks before storing them in object storage. The data is stored without duplicate content, and no file metadata (such as file names) is included.
Therefore, with the storage format of objects, only the objects themselves can be read, and the original files cannot be reconstructed and presented to the user.
If you have any questions for this article, feel free to join JuiceFS discussions on GitHub and community on Slack.
Top comments (0)