DASWU

Posted on Apr 15

JuiceFS Performance Optimization for AI Scenarios

#ai

The scale of computing power for large language model (LLM) training continues to expand. While GPU performance keeps improving, data access bottlenecks are becoming increasingly prominent in overall system performance. Local storage offers excellent performance but has limited scalability. Object storage excels in cost and scalability but suffers from insufficient throughput in massive small‑file and high‑concurrency scenarios. Teams often struggle to choose between them.

Therefore, distributed file systems have become a key solution to balance high performance and scalability. JuiceFS has been widely deployed in AI scenarios across multiple industries. Its distributed architecture delivers high performance, strong scalability, and low cost simultaneously for large‑scale data access.

In this article, we’ll introduce JuiceFS’ architecture from a performance perspective and analyze core performance bottlenecks and optimization methods under different access patterns. We’ll also offer links of key points for references, helping you understand JuiceFS’ performance mechanisms and master common tuning strategies.

Performance foundations from the JuiceFS architecture

JuiceFS comes in Community Edition and Enterprise Edition. Both share the same architecture: metadata and data are separated. The client adopts a rich‑client design, handling core logic including some metadata operations, and provides both metadata and data caching. These modules work together for efficient data location and access. The underlying data is stored in object storage, with local caches further improving access performance. For external interfaces, JuiceFS supports multiple access methods – FUSE is the most common, and it also provides various SDKs and an S3 gateway.

JuiceFS Community Edition is designed as a general‑purpose file system. Users can choose different metadata engines based on their needs. For small‑scale deployments, Redis delivers lightweight, low‑latency metadata management. For large‑scale file scenarios, TiKV provides good horizontal scalability.

JuiceFS Enterprise Edition targets complex, high‑performance scenarios. It differs from Community Edition in two ways:

It uses a self‑developed multi‑zone metadata engine built on Raft that runs as an in‑memory cluster, offering low latency and strong horizontal scalability. It supports up to 500 billion files. Operations that require multiple key-value requests in the Community Edition often need only one or two in the Enterprise Edition, and complex logic can be processed inside the metadata cluster.
The Enterprise Edition supports distributed cache sharing: clients in the same group can access each other’s local caches via consistent hashing. This improves cache hit rates and access efficiency. In multi‑node, high‑concurrency scenarios, the cache space scales horizontally, and most required data can be warmed up before job execution. This accelerates AI training and inference while boosting performance and stability. See JuiceFS Enterprise 5.3: 500B+ Files per File System & RDMA Support.

Data chunking

JuiceFS splits data into chunks and stores them in object storage. This design is key to its performance, affecting data read efficiency, cache hit rate, and throughput under high concurrency.

JuiceFS breaks a file into multiple chunks. Inside each chunk, the system maintains a management structure called a slice to track writes and updates. When data is written, new data does not overwrite existing slices; instead, a new slice is appended on top of the chunk.

Ideally, each chunk ends up containing only one slice. Each slice consists of several 4 MB blocks, which are the smallest unit stored in object storage. By default, the caching system also manages data at the block level.

As shown in the diagram on the upper right, file updates use an append‑only write pattern: existing slices are shown in red, and new data is appended as a new slice. During reads, the system combines the slices to form the current view. When fragmentation becomes excessive, a compaction process merges slices to optimize access performance. For more details on data chunking, refer to Code-Level Analysis: Design Principles of JuiceFS Metadata and Data Storage.

Caching

Compared to direct object storage access, JuiceFS performance improvements largely benefit from its caching mechanism. The JuiceFS client comes with a high‑performance local cache module. Key configuration options include:

cache-dir: specifies the cache directory.
cache-size: sets the maximum cache space.
Prefetch: a parameter in the cache module that controls prefetching. When a request hits a block, a background thread fetches the entire block.
Write‑back related settings: improves write IOPS by writing data blocks that need to be uploaded to object storage into the local cache first, then asynchronously uploading them to object storage.

JuiceFS Enterprise Edition also provides advanced configurations. For example, a cache group can be used to designate a set of clients whose local caches form a distributed cache group, enabling cache sharing. In addition, the no sharing option (related to cache groups) allows a client to read data only from a specified cache group without serving its own cache to others. This creates a two‑level cache:

The first level is the local cache.
The second level is the cache on other nodes in the group.

Another performance‑boosting mechanism is the memory buffer (read buffer), which provides:

I/O request merging: multiple consecutive I/O requests can be merged in memory. For example, three I/O requests issued by the system may be reduced to just one after being processed by the memory buffer.
Adaptive read‑ahead: in large‑file sequential read scenarios, adaptive read‑ahead increases request concurrency by prefetching data. This fully utilizes cache and object storage resources and improves overall I/O performance.

The Enterprise Edition also offers advanced read‑ahead settings:

max read ahead: sets the maximum read‑ahead range.
initial read ahead: sets the initial read‑ahead window size (default unit is 4 MB blocks).
read ahead ratio: a configuration added last year that controls the read‑ahead ratio for large‑file random reads, reducing bandwidth waste caused by read amplification. Overly aggressive read‑ahead can negatively impact random read performance; read ahead ratio helps mitigate this. In AI scenarios, when large‑file sequential or random reads cause bandwidth or IOPS bottlenecks, adjusting these parameters can optimize overall performance.

JuiceFS benchmark I/O tests and bottleneck analysis

Before diving into performance tuning for common AI scenarios, let’s first examine JuiceFS’ I/O behavior under ideal conditions through sequential and random read benchmarks. This helps us understand throughput and latency under different access patterns, providing a reference for the read/write patterns of subsequent AI/ML workloads.

Sequential read performance

In JuiceFS, sequential read performance is typically bandwidth‑bound. In cold read scenarios, performance is mainly limited by object storage bandwidth; in distributed cache scenarios, network bandwidth can become the bottleneck. For example, a node with a 40 Gbps NIC may achieve less than 5 Gbps usable bandwidth. In addition, the user‑kernel transition overhead in the FUSE layer limits single‑thread throughput. Tests showed single‑thread sequential read bandwidth around 3.5 Gbps. To break this limit, multi‑threaded or higher‑concurrency strategies are needed to fully utilize storage and network resources.

The table below shows test results of JuiceFS sequential read performance:

Threads	Bandwidth (GB/s)	Bandwidth per thread (GB/s)
1	3.5	3.5
2	6.3	3.15
3	9.5	3.16
4	9.7	2.43
6	14.0	2.33
8	17.0	2.13
10	18.6	1.9
15	21	1.4

In the performance test, single‑thread sequential read bandwidth was about 3.5 Gbps. As the number of threads increased, total throughput gradually approached the network bandwidth limit. To help users evaluate the performance ceiling of their own environment, JuiceFS provides the bj bench subcommand for testing object storage bandwidth.

In real workloads, caching is more common than direct object storage access. In such cases, increasing the buffer size raises the number of background prefetch requests, thereby improving concurrency and overall throughput. For example, after increasing the buffer size to 400 MB (corresponding to 100 background prefetch requests of 4 MB each), concurrency improved significantly and overall throughput increased.

Random read performance

Low‑concurrency random reads

In low‑concurrency, non‑asynchronous access scenarios, each request must wait for the previous one to complete before being issued. As a result, latency has a significant impact on overall performance. I/O latency can come from many sources, including metadata query latency, object storage access latency, and local or distributed cache read latency. When analyzing random read performance, we must closely examine these latency factors.

In a 4 KB cold random read scenario, if the IOPS is only 8 and object storage latency is about 125 ms, the concurrency level is roughly 1 (8 IOPS × 125 ms ≈ 1,000 ms).

This indicates a near‑single‑concurrent, serial‑blocked state. In such cases, the optimization focus should be on shortening the access path and reducing per‑request latency rather than increasing concurrency – for example, by warming up data into the local cache. After data warm-up, the random read path switches from object storage to local cache, and IOPS can increase to about 12,000, approaching the I/O level of a local disk.

High‑concurrency random reads

High‑concurrency random reads typically occur in scenarios with high thread counts or asynchronous I/O. The main performance bottleneck is often IOPS limits – including metadata IOPS, object storage IOPS, and cache IOPS. JuiceFS allows you to observe these metrics and pinpoint the bottleneck. Client machine resources (CPU, memory) can also affect performance, but such bottlenecks are easy to monitor.

In a cold read scenario using Libaio for random reads, the object‑side IOPS ceiling is around 7,000/s. When caching is enabled and data is warmed up, the access path shifts from object storage to the cache layer, and IOPS can further increase to over 20,000. This shows that the bottleneck for high‑concurrency random reads shifts as the access path changes.

For a deeper dive into JuiceFS’ complete data access path, refer to Optimizing JuiceFS Read Performance: Readahead, Prefetch, and Cache.

I/O characteristics and performance tuning for common AI scenarios

Large‑file sequential reads

A typical large‑file sequential read scenario is model loading, such as loading PyTorch .pt files saved via pickle serialization. In this process, performance is limited by two factors:

Pickle deserialization efficiency determines data processing speed.
Data reading is usually single‑threaded and limited by FUSE bandwidth and CPU performance.

To increase throughput, you can raise concurrency through multi‑threaded or sharded loading, fully utilizing I/O capacity. For large‑file sequential reads, the best performance is achieved when the entire dataset can be cached locally. If only on‑demand reading is required, the implementation is simple.

For more details on optimizing large‑file sequential reads, see How JuiceFS Transformed Idle Resources into a 70 GB/s Cache Pool.

Massive small files

In computer vision and multimodal tasks, training datasets often consist of many individual files, for example, single images, video frames, or text annotations. Such massive small‑file scenarios place heavy pressure on metadata services.

In massive small-file scenarios, metadata performance is critical. On one hand, each file carries only a small amount of data; on the other hand, directory metadata access efficiency is low when a directory holds a huge number of small files.

For read‑only workloads, enabling client metadata caching and extending the cache lifetime can improve performance.

Moreover, the data read layer experiences higher IOPS pressure because small files cannot take advantage of read‑ahead. This makes requests more fragmented. Common optimizations include increasing local cache capacity; for the Enterprise Edition, you can also scale out the distributed cache cluster horizontally. Because small files derive little benefit from read‑ahead, their latency tends to be higher.

For performance tuning in this scenario, see How D-Robotics Manages Massive Small Files in a Multi-Cloud Environment with JuiceFS.

Large‑file random reads

This scenario is common in AI training, for example, when randomly accessing datasets in TFRecord, HDF5, or LMDB format by sample. Take model loading: if the dataset is accessed randomly and each read size equals the sample size (for example, 1 MB to 4 MB images or short videos), read‑ahead can waste bandwidth. Such scenarios can often break through IOPS bottlenecks by increasing concurrency.

Recommended measures include:

Increase the number of data‑loading reader threads.
Use asynchronous I/O to raise concurrency and saturate IOPS.
Improve the caching system, for example, pre‑map data into cache to boost underlying IOPS.
Adjust the read ahead ratio parameter (for example, set it to 0.5) to reduce bandwidth waste from read‑ahead. For instance, a 4 MB sequential read would previously prefetch 4 MB; after adjustment, only 2 MB is prefetched.

In this article, we’ve analyzed JuiceFS’ architecture from a performance perspective, covered benchmark I/O tests, and discussed tuning methods for typical AI scenarios. This provides an introductory reference for system performance. JuiceFS has been deployed in many production environments, and its distributed architecture offers a feasible balance between performance and cost.

If you have any questions for this article, feel free to join JuiceFS discussions on GitHub and community on Discord.

DEV Community