DEV Community: DASWU

JuiceFS 1.4: Lower Costs, Faster Metadata, and Better Control for Massive Data Management

DASWU — Wed, 15 Jul 2026 09:04:24 +0000

JuiceFS Community Edition 1.4 is released. It’s the fifth major release since the open source edition was introduced in 2021 and is now the new Long-Term Support (LTS) version. We’ll continue to maintain both v1.4 and v1.3, while v1.2 has reached end of maintenance.

JuiceFS has now surpassed 14.2k GitHub stars. According to anonymous usage statistics reported by users, the total amount of data managed by JuiceFS Community Edition has exceeded 1.4 EB, representing more than 700× growth since 2022.

As JuiceFS is increasingly adopted for large-scale data management, high-concurrency workloads, and multi-user shared environments, long-standing challenges such as storage cost optimization, metadata performance, and resource governance have become more prominent. These are the primary focus areas of the 1.4 release.

In this post, we'll walk through the key improvements in JuiceFS 1.4, including tiered storage, faster metadata operations, enhanced resource management, more reliable data synchronization, metadata change tracking, and broader platform compatibility.

Lower storage costs with file- and directory-level tiered storage

As file systems continue to grow, different datasets naturally diverge in access frequency, performance requirements, and retention periods. Using a single storage type uniformly makes it difficult to simultaneously meet the performance needs of frequently accessed data and the cost-control requirements of infrequently accessed data. Object storage typically offers different storage classes based on access patterns, including hot data, warm (infrequent access) data, and cold (archival) data storage.

JuiceFS has supported setting object storage types via --storage-class since v1.1, but the configuration granularity was mainly at the file system default or mount point level. JuiceFS 1.4 integrates storage class into the file system semantics, supporting storage tier settings on a per-file or per-directory basis. Directory-level configurations can be inherited by subsequently created files and subdirectories. This facilitates tiered management by project, dataset, or application directory.

The storage tiers can be configured flexibly according to the object storage vendor being used. When writing new data, JuiceFS writes it to the corresponding object storage type based on the configuration of the file or its parent directory. For existing data, you can also adjust the metadata configuration and leverage the data migration capabilities on the object storage side to move it to a new storage tier. This capability is suitable for scenarios such as AI training datasets, log archiving, backup data, historical experiment data, and offline analysis results. For archive storage, it’s still necessary to evaluate retrieval latency and fees. For more implementation details, usage methods, and future evolution, see A Deep Dive into JuiceFS 1.4 Tiered Storage.

Faster metadata operations: batch delete, batch clone, and hotspot read optimization

In workloads involving massive numbers of small files, large directories, or high-concurrency access, metadata operations often become the primary performance bottleneck.

JuiceFS 1.4 addresses write transaction overhead and hotspot read overhead in metadata operations with optimizations including batch delete, batch clone, and Redis client-side caching.

Batch delete and clone: reducing transaction overhead

Previously, deleting a large number of files required the system to process them one by one, sequentially updating directory entries, inodes, space statistics, trash, and quota metadata. JuiceFS 1.4 consolidates the deletion of multiple non-directory files within the same directory into a batch transaction, reducing the repetitive overhead of per-file operations. This is applicable to scenarios like large directory cleanup, temporary data reclamation, training sample cleanup, and log directory deletion.

Batch cloning targets directory replication and snapshot scenarios. juicefs clone does not copy the underlying data blocks but creates new file records at the metadata layer and reuses the source file's data block references. JuiceFS 1.4 further reduces the metadata transactions generated by per-file cloning by processing clones of multiple files within the same directory in batch. This is ideal for AI dataset version management, experiment environment preparation, and large-scale directory snapshots.

Redis client-side caching: reducing hotspot metadata read overhead

In high-concurrency reads, path resolution, directory entry lookups, and file attribute queries generate a large number of repeated requests. When Redis is used as the metadata engine, these requests require round trips between the client and Redis. This may impact access latency and increase Redis load.

JuiceFS 1.4 caches hot inode attributes and directory entries locally on the client side. When cached, it can reduce repeated queries to Redis. When related metadata changes, the local state is updated through a cache invalidation mechanism. It's important to note that this capability caches metadata, not file content.

It’s particularly beneficial for read-heavy workloads with stable hot paths, such as AI training data loading, large-scale container startup, and multi-task concurrent reading. For more implementation details, see Faster Metadata Operations with Batch Unlink, Batch Clone, and Redis Client-Side Caching.

Improved resource management: user quotas and trash usage statistics

In distributed storage environments, storage resources are often shared among multiple users, teams, and projects.

Without effective governance, accidental writes or abnormal workloads from a single user can quickly consume large amounts of storage space or inodes. This affects both system stability and operating costs. Quota management is a critical means of establishing predictable resource boundaries in shared environments.

JuiceFS Community Edition 1.4 introduces user and group quotas, allowing administrators to monitor, configure, and enforce resource limits based on identities.

Resource governance is now extended beyond file system and directory quotas to include user- and group-level quotas, making it especially suitable for shared clusters and AI training platforms.

To reduce metadata overhead in multi-client environments, JuiceFS uses asynchronous accounting so that usage statistics converge gradually over time. For details, see Quota Design in Distributed Architectures: Implementation and Use Cases in JuiceFS.

Supported quota types include:

Quota type	Scope	Design goal	Typical use case
Total file system quota	Entire file system	Prevents overall resource runaway	Cost budget control, capacity limit
Subdirectory quota	Directory subtree	Blocks abnormal write behavior	Prevents misoperations, small‑file storms
User quota	Per user	Isolates impact between different applications	Multi‑tenant data management
User group quota	Project or department	Cost allocation and team limits	Shared environment for AI projects

JuiceFS 1.4 also improves trash space visibility.

Deleted files may remain in the trash for a retention period, making it difficult to understand why storage space has not yet been reclaimed. The enhanced summary tool now reports trash usage, helping administrators identify storage consumption and make informed cleanup, retention, or expansion decisions.

Expanded capabilities: sync, backup, and change tracking

More reliable large-scale sync

Large-scale migration, cross-cloud sync, backup, and archival workloads often face interruptions, security requirements, and bandwidth contention.

JuiceFS 1.4 significantly enhances juicefs sync with three major capabilities:

Resumable sync: It reduces recovery costs after task interruptions. During synchronization, JuiceFS records the task progress. If the task exits abnormally or is manually interrupted, it can resume from the saved state, reducing repeated scanning and processing. This capability is suitable for migration and backup scenarios with a large number of objects, long task durations, or unstable cross-cloud links.
Data encryption and decryption: In cross-cloud backup and archiving scenarios, client-side encryption is a common compliance requirement. JuiceFS 1.4 supports completing encrypted writes, decryption recovery, or re-encryption within the sync pipeline, reducing reliance on external encryption tools. This capability is suitable for off-site backup, sensitive data migration, key rotation, and compliance auditing. However, it requires careful management of key storage and recovery processes.
Global traffic control: It provides bandwidth constraints for concurrent multiple sync tasks. Compared to per-process rate limiting, version 1.4 can centrally manage the overall bandwidth usage of multiple sync tasks, reducing the impact of sync tasks on online application and other network activities. This is suitable for cross-cloud transfers, multi-task concurrent backups, data center migrations, and shared outbound link scenarios. For implementation details, see JuiceFS Sync for PB-Scale Data Transfers: Resumable Sync, Encryption, and Bandwidth Control.

Changelog: metadata changes become traceable

JuiceFS Community Edition 1.4 introduces a metadata changelog capability, which records metadata change events across the file system.

Previously, troubleshooting relied primarily on client-side access logs, which only reflected operations performed through individual mount points. In multi-client deployments, reconstructing a complete sequence of events was often difficult.

A changelog records metadata operations—including file creation, deletion, attribute updates, and renames—directly at the metadata layer, providing a unified source for troubleshooting, auditing, and incremental processing.

Administrators can now quickly identify accidental deletions, unexpected renames, permission changes, and metadata modifications without collecting logs from every client.

When issues like accidental deletion, abnormal renaming, or unexpected permission or attribute changes occur, administrators can review the relevant change records based on the changelog. This reduces dependence on single-client logs and shortens the troubleshooting path. It also provides a more unified source of metadata changes for operational auditing.

In backup, migration, and recovery scenarios, the changelog can serve as a reference for incremental processing. For large-scale file systems, numerous changes may occur between two full backups or migration tasks. By recording the metadata changes during this period, the changelog can provide input for subsequent incremental backups, migrations, or recovery processes, reducing reliance on full scans.

Better support across different environments

JuiceFS Community Edition 1.4 further improves compatibility across diverse deployment environments.

On Windows clients, the release improves cross-platform consistency and stability, including user mapping, permission mapping, and file access behavior, reducing compatibility issues when Linux and Windows clients access the same file system.

For the Java SDK and the Hadoop ecosystem, JuiceFS 1.4 adds Kerberos authentication, completing support for Hadoop secure mode.

JuiceFS 1.3 already introduced Apache Ranger integration for authorization and access control. Together, Kerberos authenticates who the user is, while Ranger determines what the user can access, providing a more complete security model for enterprise big data platforms.

On the storage backend side, JuiceFS 1.4 also adds support for SMB/CIFS-based storage. This makes it easier to integrate with existing NAS or file-sharing infrastructure.

Continued growth in scale and AI adoption

According to anonymous usage statistics, JuiceFS Community Edition now powers nearly 70,000 file systems managing over 1.4 EB of data, with deployment scale continuing to grow.

Over the past year, AI applications have continued to expand from model training to inference services, agents, and multi-cloud scheduling, placing higher demands on data storage. These changes are also reflected in the use cases shared by community users, covering areas such as large language models, autonomous driving, quantitative investment, and computing platforms. We thank these users for sharing their real-world practices, providing valuable references for more teams building AI data infrastructure.

New user stories:

AI training and large language models

AIGC

Autonomous driving and robotics

Inference and AI agents

Quant investment

JuiceFS+MinIO: Ariste AI Achieved 3x Faster I/O and Cut Storage Costs by 40%+

Big data

NAVER, Korea's No.1 Search Engine, Chose JuiceFS over Alluxio for AI Storage

Development of JuiceFS 1.4 spanned nearly a year. During the release cycle, the community reported 366 issues, merged 515 pull requests, and welcomed contributions from 59 contributors. We sincerely thank everyone who reported issues, contributed code, improved documentation, and helped evolve JuiceFS for increasingly demanding production environments. Your participation drives the rapid growth of JuiceFS.

Download and try JuiceFS 1.4 here.

JuiceFS Sync for PB-Scale Data Transfers: Resumable Sync, Encryption, and Bandwidth Control

DASWU — Fri, 10 Jul 2026 06:56:21 +0000

In scenarios such as data migration, cross-cloud synchronization, and object storage backup, juicefs sync is commonly used to transfer large volumes of data. When datasets grow to the TB- or PB-scale, with millions or even billions of objects, a single synchronization task may run for hours or even days.

As these long-running jobs progress, several common challenges tend to emerge:

After network interruptions, process crashes, or node restarts, tasks often struggle to resume from a consistent state and may need to rescan or reprocess data.
Backup workflows may expose plaintext data and face compliance or security requirements.
When multiple sync jobs run concurrently, bandwidth contention becomes significant, while the overall transfer process lacks effective global control.

To address these challenges, JuiceFS 1.4 introduces three major enhancements to sync: resumable sync, data encryption/decryption, and global traffic control.

In this article, we’ll explain the use cases, implementation details, and configuration methods for each feature.

Resumable sync

In earlier versions, if a synchronization task failed or was interrupted, rerunning juicefs sync required rescanning both the source and destination before determining which objects had already been synchronized and which still needed to be copied.

For workloads involving hundreds of millions of objects or large files, the scan itself could incur substantial time and object-storage request costs.

To address this issue, JuiceFS 1.4 introduces a resumable sync mechanism for sync. When enabled, synchronization progress is periodically saved to the destination. If the task is interrupted, rerunning the same command automatically locates and loads the matching checkpoint and resumes from the last unfinished position, avoiding a full restart.

How it works

When resumable sync is enabled, sync stores a JSON state file on the destination side:

.juicefs-sync-checkpoint.<hash>.json

The <hash> value is derived from the source, destination, and key synchronization parameters. This ensures that a task only loads checkpoints created for itself, preventing accidental reuse across different jobs.

The workflow is shown below:

Checkpoint save, restore, and cleanup workflow in juicefs sync:

When sync starts, it first looks for a checkpoint matching the current task.
If a matching checkpoint is found, execution resumes from the saved state. Otherwise, synchronization starts normally with a fresh scan. sync traverses multiple prefixes concurrently, maintaining independent state for each prefix, including:

- Whether traversal is complete  
- The last scanned position  
- Pending objects to synchronize  
- Failed objects

When restoring from a checkpoint:

- Pending and failed objects recorded in the checkpoint are re-added to the task queue.  
- Prefixes that were not fully traversed resume scanning from their saved positions.  
- Fully traversed prefixes only continue processing unfinished objects recorded in the checkpoint.

During execution, progress is saved asynchronously at a configurable interval, which defaults to every 10 seconds.
After successful completion, the checkpoint file is automatically removed. If the task fails or is interrupted, the checkpoint is retained for resumption on the next execution of the same command.

In cluster mode, only a single checkpoint exists and is maintained centrally by Manager.

Workers do not directly read or write checkpoint files on the destination. Instead, they:

Pull tasks from Manager
Execute synchronization
Report results back to Manager

Manager aggregates completed objects, failed objects, statistics, and multipart-upload state into the global checkpoint.

Usage

# Enable resumable sync.
juicefs sync --enable-checkpoint SRC DST

# Customize checkpoint save interval (default: 10s).
juicefs sync --enable-checkpoint --checkpoint-interval 30s SRC DST

# Ignore existing checkpoints and restart from scratch.
juicefs sync --enable-checkpoint --checkpoint-force-reset SRC DST

Data encryption and decryption

For cross-cloud backup and archival workflows, client-side encryption is often required to satisfy compliance requirements such as data sovereignty, encryption at rest, and secure migration of sensitive data.

Previously, juicefs sync did not provide built-in encryption capabilities. Users who wanted to write encrypted data to the destination typically had to use external tools for additional processing.

In JuiceFS 1.4, streaming encryption and decryption are integrated directly into the synchronization pipeline, enabling three common workflows:

Encrypt-on-write: Encrypt plaintext data before writing it to the destination, suitable for encrypted backup and archiving.
Decrypt-on-read: Read encrypted data from the source and write decrypted data to the destination, suitable for data recovery or plaintext migration.
Re-encryption: Decrypts source data with an old key and re-encrypts it with a new key before writing to the destination, suitable for key rotation or cryptographic algorithm migration.

Chunk-based streaming encryption

To support object storage Range GET operations while avoiding excessive memory usage for large files all at once, sync uses a fixed-size 1 MiB chunk-based streaming encryption scheme.

A file is first divided into plaintext chunks:

[chunk 1: 1 MiB][chunk 2: 1 MiB] ... [chunk N: ≤1 MiB]

Each plaintext chunk is encrypted independently.

Each encrypted chunk consists of a 4-byte header and the ciphertext data, where the 4-byte header stores the actual ciphertext length (ct_len):

Each encrypted block: [4B ct_len][ciphertext + padding]

Encrypted file: [encrypted chunk 1][encrypted chunk 2] ... [encrypted chunk N]

The encrypted block size is determined by the plaintext chunk size plus encryption overhead: plainChunkSize + overhead. The plainChunkSize is fixed at 1 MiB, and the overhead depends on the encryption algorithm and key type used.

This design allows random reads to retrieve only the required encrypted chunk rather than downloading the entire file. Because encrypted objects contain additional headers, padding, and encryption metadata, the destination object is typically larger than the original plaintext file.

Supported algorithms

The table below shows the supported algorithms:

Option	Symmetric cipher	Key encapsulation	Typical use case
aes256gcm-rsa (default)	AES-256-GCM	RSA	General-purpose workloads
chacha20-rsa	ChaCha20-Poly1305	RSA	Environments without efficient AES hardware acceleration
sm4gcm	SM4-GCM	SM2	Scenarios requiring Chinese commercial cryptography standards

Usage

The following examples use RSA keys.

Generate a key pair:

# Generate an RSA private key (the public key is derived automatically).
openssl genrsa -out private.pem 2048

# Generate a password-protected private key.
openssl genrsa -aes256 -out private.pem 2048

Scenario 1: Encrypt and write to destination

juicefs sync /local/data s3://mybucket/backup 
    --encrypt-rsa-key /path/to/private.pem

Scenario 2: Decrypt and read from source for data recovery or plaintext migration

juicefs sync s3://mybucket/backup /local/data 
    --decrypt-rsa-key /path/to/private.pem

Scenario 3: Re-encrypt for key rotation or algorithm migration

# Decrypt data encrypted with the old key and re-encrypt with the new key to new storage.
juicefs sync s3://old-bucket/encrypted s3://new-bucket/re-encrypted 
    --decrypt-rsa-key /path/to/old-private.pem 
    --encrypt-rsa-key /path/to/new-private.pem

If the private key is password-protected, the password can be provided via environment variables:

# For encryption scenarios, use JFS_ENCRYPT_RSA_PASSPHRASE.
export JFS_ENCRYPT_RSA_PASSPHRASE="your-passphrase"
juicefs sync /local/data s3://mybucket/backup --encrypt-rsa-key private.pem

# For decryption scenarios, use JFS_DECRYPT_RSA_PASSPHRASE.
export JFS_DECRYPT_RSA_PASSPHRASE="your-passphrase"
juicefs sync s3://mybucket/backup /local/data --decrypt-rsa-key private.pem

Notes:

Encrypted data is stored using a JuiceFS-specific format and can only be decrypted through juicefs sync with the corresponding key.
Back up encryption keys carefully. Once a private key is lost, encrypted data cannot be recovered.

Global traffic control

In earlier versions, juicefs sync already supported per-process rate limiting via --bwlimit. However, when multiple sync processes run concurrently—such as multiple Workers in a distributed sync, or multiple independent sync tasks sharing the same egress link—per-process limiting cannot constrain total bandwidth usage. The egress link may still be saturated, affecting other application traffic.

JuiceFS 1.4 introduces the --traffic-control-url parameter. Multiple sync processes can connect to the same external traffic control service, which allocates bandwidth quotas uniformly, enabling cross-process, cross-task global rate limiting.

How it works

Global traffic control uses a token bucket model. Before transmitting data, each sync process requests byte credits from the same traffic-control service.

Each process periodically requests a certain number of bytes (credit) before data transfer.

The traffic-control service determines:

How many bytes to grant
How long the granted quota remains valid

When credits are exhausted, the process requests additional credits.

If a quota is about to expire before being fully consumed, the unused portion is returned to the service in advance.

The service exposes a simple HTTP API for granting and reclaiming quotas. This must be implemented by the user or integrated with an existing service:

POST /traffic-control
Content-Type: application/json

Request:
{"bytes": 1048576}
  bytes > 0: Request byte credits.
  bytes < 0: Return unused credits.


Response:
{"granted": 524288, "expired": 1000}
  granted: Number of bytes granted this time.
  expired: Credit validity period (milliseconds).

During synchronization, sync requests quotas from the traffic control service before transmitting data. If no credits are available, transmission blocks until new credits are obtained. In this way, multiple sync tasks can share a single global bandwidth limit, preventing the total traffic from becoming uncontrolled even when individual tasks have their own limits.

Usage

# Deploy a traffic-control service first.
# (Example: listen on port 8080 and cap total bandwidth at 100 Mbps)
# (Service implementation is user-defined; JuiceFS only calls the API)

# Multiple sync processes share the same control service.
juicefs sync SRC1 DST1 --traffic-control-url http://127.0.0.1:8080/traffic-control &
juicefs sync SRC2 DST2 --traffic-control-url http://127.0.0.1:8080/traffic-control &

--traffic-control-url can be combined with --bwlimit.

The two mechanisms are independent:

--bwlimit limits the bandwidth of a single sync process.
--traffic-control-url limits aggregate bandwidth across multiple processes.

# Per-process limit: 50 Mbps. All processes combined respect the service-side cap.
juicefs sync SRC DST 
    --bwlimit 50 
    --traffic-control-url http://controller:8080/traffic-control

Summary

JuiceFS 1.4 enhancements to sync include:

Resumable sync reduces recovery costs after task interruptions.
Encryption and decryption improve the security of backups and archival data.
Global traffic control enables multiple synchronization tasks to share bandwidth in a coordinated manner.

For scenarios such as data migration, cross-cloud sync, object storage backup, and encrypted archiving, users can combine these capabilities flexibly based on task scale, network environment, and security requirements.

If you have any questions for this article, feel free to join JuiceFS discussions on GitHub and community on Discord.

Monitoring JuiceFS with Better Stack

DASWU — Fri, 26 Jun 2026 07:46:14 +0000

After deployment, JuiceFS feels like a local drive, but underneath it's a sophisticated distributed system. This perfectly reflects one of its core design principles: distributed systems are complex, but from a user's perspective, they should be simple to use.

Even so, that simplicity on the surface doesn't negate the need for deep visibility. For any critical storage system, gaining real-time visibility into its operations is crucial to prevent subtle performance degradations from escalating into significant incidents.

Fortunately, JuiceFS exposes a suite of monitoring metrics, including throughput, IOPS, latency, data size, and many more, in the widely adopted Prometheus format, making it ready for modern monitoring stacks. Traditionally, you would probably pair Prometheus with Grafana to collect these metrics and visualize them. This is indeed a powerful combination. However, deploying, managing, and maintaining these systems yourself adds operational overhead again. Ironically, you may want to monitor them too, and trust me, you would rather not create yet another monitoring stack just to monitor your Prometheus and Grafana combo.

That's where Better Stack comes in. It is a fully managed SaaS observability platform that combines user-friendly dashboards, tracing, logging, error tracking, incident management, automatic alerting, and even AI-powered SRE, all for a predictable, cost-effective price. With Better Stack, you get the power of the best-in-class tools out of the box without the operational overhead.

In this post, we'll guide you through setting up a comprehensive monitoring system for JuiceFS using Better Stack, from metric ingestion to intelligent alerting, so you can ensure your file system remains healthy and performant.

Preparing the JuiceFS file system

Before diving into setting up Better Stack for monitoring, you'll need an existing JuiceFS file system that is actively publishing metrics. JuiceFS Community Edition and JuiceFS Enterprise Edition (our cloud service is based on JuiceFS Enterprise Edition) both expose real-time status metrics in Prometheus format, but they do it in slightly different ways.

For the JuiceFS Community Edition, after mounting the file system, JuiceFS automatically exposes metrics via http://localhost:9567/metrics by default on the mounting host where the JuiceFS client is running. You can customize this port using the --metrics option if needed.

On the other hand, for JuiceFS Enterprise Edition & Cloud Service, metrics are exposed through the console via dedicated API endpoints. You'll need to replace VOLUME_NAME with your file system name and API_TOKEN with your API token. In this case, both Prometheus and JSON formats are available for metrics:

Prometheus: https://juicefs.com/api/vol/VOLUME_NAME/metrics?token=API_TOKEN
JSON: https://juicefs.com/api/volume/VOLUME_NAME/status?token=YOUR_TOKEN

A quick but important note: metrics are only generated when the file system is mounted. So before proceeding, ensure your JuiceFS file system is properly mounted and accessible. In this guide, we will use the JuiceFS Cloud Service, as it's the simplest to get started. If you haven't set up JuiceFS yet, please refer to the documentation for detailed instructions. Once you have created the first file system, URLs for the metrics mentioned above would be available under its Monitor tab.

Setting up a metrics source in Better Stack

With your JuiceFS file system up and running (don't forget to mount the file system to a host machine) and publishing metrics, the next step is to configure Better Stack to start ingesting that data.

First, if you haven't already, register for a Better Stack account. The process is seamless. Using a work email is recommended, and the platform provides clear guidance to help you set up your account and organization.

Once you're logged in, follow these steps:

In the left-hand navigation panel, head to Telemetry.
Under the Sources section, click Connect source.
Give your telemetry data source a descriptive name, such as "jfs-better-stack" or "juicefs-production", to easily identify it later.

Now, you'll configure how Better Stack should collect your metrics. In the collector settings:

Under Metrics, choose the Prometheus scrape option and click Connect source.
In the URLs to scrape section, input the JuiceFS metrics endpoint as described above.

Note that if you are not using the JuiceFS Cloud Service and your JuiceFS endpoint is behind a firewall, you'll need to allow traffic from Better Stack's scrape servers. The list of IP addresses to add to the allowlist is available in their documentation and from here.

After saving the configuration, Better Stack will begin scraping the endpoint. Your JuiceFS metrics should be received within a few seconds.

Creating a dashboard with AI SRE

With your JuiceFS metrics flowing into Better Stack, it's time to visualize them. You could build a dashboard manually, but Better Stack provides a smarter and more efficient way to do it by using AI SRE.

What is AI SRE?

AI SRE (Site Reliability Engineering) is Better Stack's chat-based site reliability assistant. It's an autonomous AI agent that can read your telemetry data, analyze incidents, build dashboards, and even write code to fix errors. Instead of waiting for humans to manually set up charts and queries, AI SRE can generate comprehensive dashboards for you based on a prompt.

It's notable that AI SRE is a paid feature. If you're on the free plan, you can still create dashboards manually using the drag-and-drop chart builder.

Creating a JuiceFS monitoring dashboard with a single prompt

Once your metrics source is ready, follow these steps:

From the left panel, head to Telemetry and then Metrics.
Click Create dashboard and select the Create with AI option.
In the prompt field, give AI SRE a clear description of what you need. For example: "Create me a dashboard to track ALL JuiceFS metrics, such as latency, data size, etc."
Also make sure to select the metrics Source you created earlier (for example, "jfs-better-stack") so that AI SRE has the proper context and data to work with.

Give the platform a few minutes for the dashboard to be created. AI SRE will analyze your JuiceFS metrics and automatically generate a complete set of charts and panels for the important performance indicators such as throughput, IOPS, latency, and storage utilization. For my first time trying this, it just worked like a charm as shown below.

AI SRE is a powerful feature that does so much more than create dashboards. It can analyze incidents, perform root cause analysis, suggest fixes, and even open pull requests. We've only scratched the surface in this post. This is your first step toward a smarter, AI-assisted observability workflow. After building your dashboard, you can further customize it by adding panels, editing queries, or setting alerts directly from the graphs.

Conclusion

In this post, we have walked through how to build a complete observability system for JuiceFS with Better Stack. We started by setting up the JuiceFS file system and getting its Prometheus-formatted metrics, then created a metrics source in Better Stack to ingest the data. We examined rapid creation of a full dashboard with AI SRE.

We hope this guide helps you gain better visibility into your JuiceFS deployment. If you have any questions or run into issues, we'd love to hear from you. Join the JuiceFS community on GitHub or Discord. And don't forget to check out Better Stack's documentation and their amazing YouTube channel for practical insights about distributed file storage, observability, AI, and more.

JuiceFS 1.4: Faster Metadata Operations with Batch Unlink, Batch Clone, and Redis Client-Side Caching

DASWU — Thu, 18 Jun 2026 08:30:56 +0000

In large-scale file access scenarios such as AI training and dataset management, metadata often becomes the first performance bottleneck as file counts and concurrency grow. Whether you're deleting millions of small files, cloning large datasets, or traversing directories under heavy concurrency, metadata performance directly impacts application efficiency.

JuiceFS Community Edition 1.4 introduces three major metadata optimizations:

Batch unlink for large-scale file deletion
Batch clone for metadata cloning
Redis client-side caching for hot metadata reads

These improvements reduce transaction commits, network round trips, and redundant metadata lookups. In tests on a flat directory containing 100,000 files, batch unlink improved performance by up to 93×, while batch clone achieved up to 24× speedup.

In this article, we’ll explain the motivation, design, and performance benefits behind these optimizations.

Deletion: From one‑by‑one to batched transactions

Under JuiceFS' metadata-data separation architecture, deleting a file involves much more than removing a directory entry. The system must also:

Update inode reference counts
Reclaim inode and space resources
Process trash entries
Update quota statistics

These operations must typically be completed within the same transaction.

When a directory contains hundreds of thousands or even millions of files, the traditional file-by-file deletion approach used by rm -rf quickly becomes a bottleneck. Each unlink request goes through the FUSE protocol, switches between kernel and user space, and triggers a separate metadata transaction.

As the number of files grows, the overhead from system calls, context switches, network round trips, and transaction commits accumulates rapidly.

To mitigate this issue, JuiceFS previously introduced the juicefs rmr command. Unlike rm -rf, rmr bypasses the FUSE layer and sends deletion requests directly to the client. It also supports multi-threaded deletion (50 threads by default), significantly improving throughput.

However, each file deletion still requires its own metadata transaction. Deleting 100,000 files still means executing 100,000 transactions.

Batch unlink takes optimization one step further by merging many independent deletion operations within the same directory into a single batch transaction, further removing network overhead.

Core design

The key is to turn many small transactions into fewer large ones. JuiceFS adds a batch unlink interface at the metadata engine layer. It allows the client to delete multiple non‑directory files under the same directory in one call.

When recursively clearing a directory, JuiceFS reduces deletion overhead in two ways:

Different subdirectories are handled concurrently with multi‑threaded deletion.
Inside each directory, normal files and symlinks are grouped into batches and sent to BatchUnlink.

This merges many unlink operations into fewer batch transactions at the metadata level.

It's important to note that BatchUnlink does not directly delete directories. Directory removal still follows the standard recursive workflow: empty the subdirectory first, and then delete the subdirectory itself. Therefore, BatchUnlink only applies to regular files and symbolic links within the same directory.

This restriction preserves correct recursive deletion semantics while avoiding consistency risks to the directory tree structure.

Implementation across metadata engines

JuiceFS uses different batching strategies depending on the metadata backend to minimize transaction commits and network round trips.

SQL backends (MySQL, PostgreSQL, etc.): Previously, each file deletion required its own sequence of INSERT, DELETE, and UPDATE statements. With BatchUnlink, the system:

Fetches all edge records for the target entries in a single batch query.
Retrieves the relevant inode attributes in a single locked batch query.
Executes edge deletions, inode state updates (decrementing nlink or marking for cleanup), and delfile entry insertions — all within one transaction.

Instead of executing one transaction per file, the entire batch can now be completed in a single transaction.

Redis backend: The optimization uses Redis pipelines and transactions. Where individual deletions previously required separate command round trips, BatchUnlink collects all HDEL (dentry removal), ZADD (enqueue for cleanup), SET (inode attribute update), and INCRBY (counter update) commands for multiple files into a single pipeline, executed atomically within one MULTI/EXEC transaction. To avoid blocking Redis' single-threaded event loop for too long, batch size is capped at 250 entries.

TiKV backend: BatchUnlink consolidates multiple deletions into a single transaction, using TiKV's batch write capability to reduce network round trips and transaction overhead. For distributed key-value backends, this kind of batching allows the backend's concurrent write capacity to be more fully utilized.

The figure below shows benchmark results on a flat directory of 100,000 files using juicefs rmr --threads 16. BatchUnlink delivers meaningful improvements across all metadata backends, with TiKV and Redis showing the largest gains.

Clone: From one‑by‑one copy to batched references

juicefs clone creates fast copies of files or directories for training dataset version management, experiment snapshots, and large-scale directory duplication. Its efficiency comes from the fact that cloning doesn't immediately copy the underlying data blocks. Instead, it creates new file records at the metadata layer and reuses the source file's existing block references. New data blocks are only allocated when the clone is actually written to. This avoids the time and storage overhead of a full copy.

For large directory clones, the same problem as deletion arises: processing files one by one generates a large number of short transactions and network round trips. The core idea behind batch clone is to merge the clone operations for multiple files in the same directory into a single batch transaction. When recursively cloning a directory, the system reads directory entries in batches as a stream. For each batch, all non-directory entries are collected and cloned together in one operation.

One key implementation detail is inode pre-allocation: before entering the transaction, the system uses nextInode to pre-allocate target inodes for all entries to be cloned. This avoids lock contention from repeatedly requesting inodes inside the transaction. Once inside the transaction, the system batch-queries all source file attributes (with row locks), builds all the insertion data for target nodes, edges, chunks, symlinks, and xattrs, and then inserts everything in a single batch.

Batch clone uses each backend's native batch write capabilities in a similar way to batch unlink. The per-backend implementation details won't be repeated here.

The performance gains vary across backends depending on:

Transaction models
Network communication overhead
Batch insertion efficiency for metadata records such as nodes, edges, and chunk references

Results on a flat directory of 100,000 files are shown below. MySQL sees the largest improvement at approximately 24x; Redis at approximately 5x; TiKV at approximately 2x.

Redis client-side caching: Keeping hot metadata local

In high-concurrency metadata workloads such as AI training dataset access and large-scale container startup, network round trips between JuiceFS clients and Redis often become a major performance bottleneck.

Consider the following operation:

open("/mnt/jfs/dataset/images/cat.jpg")

Before the file can be opened, the Linux Virtual File System (VFS) must resolve every component in the path:

Look up dataset.
Look up images.
Look up cat.jpg.

If the images directory contains hundreds of thousands of files and training jobs perform random access across the dataset, each lookup requires a GET request to Redis.

Under heavy concurrency, this results in large numbers of network round trips and increased Redis CPU utilization. Even though a single Redis query takes only a few dozen microseconds, network latency pushes each lookup to hundreds of microseconds or even milliseconds. When thousands of training processes are accessing files simultaneously, this overhead becomes significant.

How it works: Redis 6.0 client-side caching

Redis 6.0 introduced client-side caching, which allows clients to cache hot keys locally and receive invalidation notifications whenever those keys are modified.

Based on this capability, JuiceFS caches two categories of metadata in client memory:

Inode attribute cache. Keyed by inode number, this stores the complete attribute data for a file, such as type, size, permissions, and timestamps. The caching is implemented transparently through hook mechanisms in the Redis driver layer. On query, it first checks the local cache; on hit, it returns immediately without any network request. On modification, it automatically invalidates the corresponding cache. Application logic requires no awareness of the cache.
Directory entry cache. Keyed by "parent inode + path separator + filename," this caches the results of directory lookups. Unlike the inode attribute cache, the lookup logic for entry cache is embedded directly in the directory lookup path rather than being intercepted transparently at the driver layer. When entries for a directory are invalidated, all related cache entries under that directory are cleared using prefix matching. This allows path resolution and repeated access to hot entries in the same directory to be served from local memory.

Introducing client-side caching creates a consistency challenge in multi-mount scenarios. When multiple clients share the same JuiceFS file system, an operation on one client — creating, deleting, renaming, or updating attributes of a file or directory — can invalidate cached inode attributes or directory entries on other clients. Without an effective invalidation mechanism, subsequent reads could hit stale metadata, causing the directory entries or file attributes seen by one client to diverge from the actual state in the backend.

To address this, JuiceFS introduces a Tracking and Broadcast Invalidation (BCAST) model on top of Redis' client-side caching mechanism. After connecting to Redis, each client declares the metadata key prefixes it wants to track. When those keys are modified, Redis sends invalidation notifications to the relevant clients. On receiving a notification, the client clears the corresponding inode attribute cache or entry cache entries, so that subsequent accesses fetch fresh data from the metadata engine.

In addition, at client initialization, JuiceFS warms up metadata for the root directory of the mount point. Since these files are typically the most frequently accessed, benchmarks show this warm-up significantly improves overall access performance.

Through this mechanism, hot metadata can be reused locally. When the metadata changes, the related caches are evicted in time, reducing the risk of stale metadata.

When to use it

Redis client‑side caching works best in read‑heavy, write‑light scenarios with repeated access to hot metadata. AI training dataset loading is a good example: the dataset is usually read‑only during training, and tasks repeatedly access the same directories and files, so inode attribute cache and entry cache hit often, reducing redundant lookups and remote metadata queries.

The benefit is even more obvious when there is higher network latency between the client and the Redis metadata engine, such as in cross-availability-zone deployments.

Redis 6.0 or later is required to use this feature. The default cache expiration time is 1 minute, which provides a safety net in case of network interruptions or connection anomalies where invalidation notifications may not arrive, preventing stale entries from persisting indefinitely. For workloads with stricter consistency requirements, the expiration time can be shortened or client-side caching can be disabled entirely to reduce the risk of reading stale metadata.

Summary

These three optimizations each target a different path through the metadata layer:

Batch unlink merges multiple independent unlink operations within the same directory into a single batch transaction.
Batch clone merges multiple independent clone operations within the same directory into a single batch transaction.
Redis client-side caching keeps hot metadata in client memory, bringing read latency from network-level down to memory-level, with broadcast invalidation to maintain consistency across multiple clients.

BatchUnlink and BatchClone are internal interfaces. Users do not call them directly. Just use the right commands: juicefs rmr for deleting large directories, juicefs clone for copying directories. The optimization will be applied automatically.

One thing worth noting: both batch operations work by merging regular files within the same directory into a single batch transaction. Subdirectories are handled recursively by concurrent goroutines. The larger the directory, the greater the benefit.

Batch operations mainly merge ordinary files under the same directory into one batch transaction. Subdirectories are handled recursively by concurrent goroutines. The larger the directory, the bigger the benefit.

All optimizations above are available in JuiceFS Community Edition 1.4. Upgrade the client to get the performance gains.

If you have any questions for this article, feel free to join JuiceFS discussions on GitHub and community on Discord.

How Gongjiyun Keeps Model Distribution Fast Enough for Cross-Cloud Elastic Inference

DASWU — Fri, 12 Jun 2026 03:25:32 +0000

Founded in 2023 at Tsinghua University, Gongjiyun provides compute platforms and Model as a Service (MaaS) for artificial intelligence generated content (AIGC) enterprises and research institutions. We aim to alleviate the mismatch between elastic compute demand and supply. By aggregating idle IDC resources and edge resources, the platform offers containerized services, delivering rapidly schedulable compute for volatile workloads such as AI inference, video rendering, data processing, and data synthesis.

In cross-cloud elastic inference scenarios, compute tasks can be scheduled to different regions, cloud environments, and clusters, but model files and application data are large and cannot be migrated as quickly as compute resources. Especially in online inference, the model repository is read‑heavy and frequently accessed – storage access performance directly affects service startup, elastic scaling, and request latency.

To address this, we built an object storage acceleration solution on top of JuiceFS, integrating users’ existing object storage into elastic inference clusters. Through a unified namespace, metadata import, FUSE mount, distributed cache, and data warm-up, it improves access efficiency for model repositories across clouds and clusters. In a case study with a leading text‑to‑image model community, the solution supports a tens‑of‑TB model repository, dynamic loading of checkpoints and low-rank adaptations (LoRAs), and elastic scaling of hundreds of GPUs at peak, while keeping additional latency within the customer’s acceptance range.

In this post, we'll walk through why storage — not compute — is the real bottleneck in cross-cloud elastic inference, how we evaluated and chose JuiceFS, and the step-by-step optimizations that brought latency from +10s down to under 2s in production.

Elastic demand is widespread, but supply is hard to match

As AI applications grow rapidly, compute demand continues to increase, but resource usage patterns differ across scenarios. Compared to training, which has stable resource needs, AI inference, data processing, and data synthesis are often more volatile: office applications may see higher traffic during the day, entertainment apps during evenings or weekends, and project‑based data processing may consume large amounts of compute in short bursts then idle. For small teams or exploratory applications, elastic compute also helps them better evaluate the relationship between per‑request cost and application value.

On the supply side, compute infrastructure is capital‑intensive. Resource providers are not incapable of offering elastic services, but they prefer long‑term dedicated leases to recover costs and reduce risk. As a result, low price, stability, and elasticity are difficult to achieve together:

Dedicated leases are low‑cost and stable but lack elasticity.
Spot resources are cheap and elastic but uncertain.
On‑demand resources are elastic and stable but expensive.

In China, this contradiction is further reflected by a market dominated by dedicated leases, with elastic supply accounting for a small share.

We aim to resolve this mismatch between elastic demand and supply. By aggregating idle IDC and edge resources, the platform offers containerized services, providing rapidly schedulable compute for AI inference, video rendering, data processing, and data synthesis. At lower resource costs, we help users quickly spin up tasks during peaks, schedule them across clusters, and handle elastic demand, while enabling resource providers to improve utilization and monetize idle capacity beyond dedicated leases.

Compute can be scheduled: How does storage keep up?

As elastic compute platforms evolve, compute resource scheduling is easy. Container images can be synchronized across clusters via registries and distribution networks, tasks can be launched in different resource pools by schedulers, and traffic can be distributed via unified ingress and traffic management.

But model and data files are typically large, making cross‑cloud, cross‑cluster migration costly and slow, unable to match the sub‑second startup and release of compute. Therefore, in cross‑cloud elastic inference architectures, the real limitation on system elasticity is often not compute scheduling, but the efficiency of data and model distribution.

Different application scenarios have different storage requirements:

Model training, development, and debugging: These involve complex read‑write needs, including code repositories, model files, experiment results, and intermediate state. They also require high environment stability; users cannot tolerate state loss from frequent host switching. Thus, the platform typically provides long‑term stable compute resources and runtime environments, and storage needs can be met by existing stable storage systems.
Data processing: This can be split further. If a single processing job has high application value and can cover cross‑cloud network transfer costs, you can build a pipeline that continuously pulls data from S3 or other object storage, processes it in the compute cluster, and writes back streaming. The system does not need large local storage. If the data scale is larger or per‑job value is low, local storage acts as a one‑time cache. Data flows through and does not need to be persisted.

What is truly more challenging is the online inference scenario. Online inference services cannot tolerate downtime. However, the resources used by an elastic computing platform may come from idle resource pools. These resources could be preempted. Once resources in a certain data center or cluster become unavailable, the platform must be able to migrate tasks to other providers or other clusters in time. This means not only computing tasks must be migrated. Model files and related storage access capabilities must also be migrated at the same time

Online inference has higher requirements for service continuity and cross-cluster migration capabilities, but its storage access pattern is also more clear. Compared to training, development, and debugging scenarios, inference workloads are typically read heavy. The core needs focus on efficient model loading, reading model weights, and accessing the model repository. For large models and online applications, model loading speed directly affects service startup time, elastic scaling efficiency, and request response stability. Therefore, inference scenarios are not suitable for simply adopting traditional read-write hybrid storage architectures. Instead, they are better suited for specialized optimizations around model distribution, read only access, and cache acceleration.

In addition, an elastic computing platform usually does not host a user's complete application system. The user's primary cloud account, application database, model management system, and even some fixed computing resources often already exist in other clouds or on premises. For the platform to integrate with the user's application, it must be compatible with the user's existing model repository and model management processes. It cannot require the user to fully migrate the entire system.

Therefore, to support cross-cloud elastic inference, we need more than just compute scheduling capabilities. We need a cross-cloud high-performance storage and model distribution solution tailored for model inference scenarios. This solution must support hosting a large model repository and high-performance reading, it must adapt to the user's existing model management system. And it must provide stable data access capabilities when resources are migrated across clouds and clusters.

Why JuiceFS: Unified cross-cloud access, strongly consistent metadata, and high-performance cache

Facing cross-cloud elastic inference scenarios, the storage system needs to meet several conditions at the same time:

It must provide a unified access point across different clouds and clusters. It must support shared read-write access and unified metadata management.
It must be compatible with the user's existing object storage and model repository to avoid data migration.
It needs low operational complexity and good read performance.

When evaluating storage options, we considered Ceph:

Ceph is mature. It’s suitable for building unified storage within a single data center or a stable resource domain.
However, in cross cloud elastic inference scenarios, Ceph requires high network stability and operational skills. The overall integration cost is higher. So we did not choose it.

We also evaluated Alluxio. However, in a multi-cloud environment, multiple clusters need to access the same underlying object storage data concurrently. The workload is not purely read only; there are also occasional writes. This scenario requires strong data consistency. Therefore, Alluxio was not chosen for production.

We finally chose JuiceFS mainly because:

It uses object storage as the database.
It provides a unified namespace and consistent file system view through an independent metadata service. This allows multiple clusters to access the same model data as a file system.
This architecture is suitable for cross-cloud and cross-cluster model distribution and shared reading.
It’s also compatible with the user's existing object storage and model repository, reducing data migration and application integration costs.

The decision to further adopt JuiceFS Enterprise Edition was mainly due to its distributed caching capabilities and managed metadata service. In this scenario, the value of JuiceFS is not just providing a file system interface. It combines object storage, unified namespace, metadata management, and cache acceleration into a storage access layer that is better suited for cross-cloud elastic inference.

Practical: Object storage acceleration based on JuiceFS

Based on JuiceFS, the platform encapsulates an object storage acceleration product. This product connects the user's existing object storage to the elastic inference cluster. It provides the storage as a high-performance file system for the application. The overall process is as follows.

Create a file system. The user provides object storage access credentials, for example, AK/SK for S3-compatible storage. The credential permissions can be configured as read only or read-write based on application needs. The platform creates a corresponding JuiceFS file system based on that object storage.
Import metadata. The platform uses the JuiceFS import feature to scan the metadata of files in object storage. Then, it imports that metadata into the JuiceFS metadata service. In this way, the model files originally stored by the user in object storage can be accessed as file system directories in JuiceFS.
Create a cache group. Within each cluster that may host workloads, the platform sets up a JuiceFS cache group. This forms a distributed cache group. Before running a task, the platform can warm-up model files. It caches hot data in the target cluster in advance. This reduces the time needed to pull data from remote object storage when the inference service starts.
Mount to application Pods. When the user's application runs, the platform uses the FUSE client to mount the JuiceFS file system into the application Pod. For the application, model files appear as local file system paths. Therefore, the original model reading logic usually does not need modification.
Enable node local cache. Besides the cluster level cache group, the node where the FUSE client runs can also provide local cache. This improves repeated read and model loading performance. It further reduces direct access to remote object storage.

This object storage acceleration product essentially productizes the JuiceFS metadata import, distributed cache, data warm-up, and FUSE mounting process. It allows the user's existing object storage to serve cross-cloud inference tasks in a way that feels closer to a local file system.

In addition, the JuiceFS cache group is independent from the file system access point. This characteristic, on one hand, adds management complexity on the platform side, because the platform needs to manage the relationships among the file system, cache groups, mount points, and task scheduling. On the other hand, it provides a foundation for cache isolation, independent scheduling, and fine-grained management based on clusters, users, or application scenarios in the future.

Production case study: A leading text-to-image model community

Scenario, challenges, and acceptance criteria

One of the most representative cases in this object storage acceleration solution involves a leading Chinese text-to-image model community hosting tens of terabytes of model data, including large checkpoint base models and a larger number of smaller LoRA models. In practice, inference jobs typically load a checkpoint first, then load one or more LoRA models to perform combined inference.

The company already operated compute infrastructure at scale — several thousand GPUs — but its workload, serving creative design and production use cases, exhibited significant variability. Overall average utilization was below 50%, yet during morning and afternoon peak hours on weekdays, load could reach 140% of normal capacity, degrading the user experience. The customer therefore needed a highly elastic compute supply.

We provided a high-elasticity resource model: compute support at the scale of hundreds of GPUs was available only during weekday peak hours — 10:00–12:00 AM and 2:00–6:00 PM — with resources scaling to zero at all other times.

This meant the platform needed to provision hundreds of GPUs within a window of minutes, while consuming zero resources outside peak hours. For the customer, this model delivers large-scale compute during peak periods while avoiding payment for idle capacity. For the platform, it enables more efficient utilization and monetization of idle compute resources.

The technical challenges were significant:

A model repository of this scale cannot simply be replicated to every elastic cluster.
Inference services do not load all models once at startup. Model reads and switches happen continuously as user requests arrive, resulting in high access frequency. Therefore, the object storage acceleration solution needed to support not just large-scale model repository access, but stable read performance under continuous dynamic loading.

The customer's performance requirements were also strict. During acceptance testing, a portion of production traffic was routed to the elastic cluster. The requirement was that both the median and mean inference latency of the elastic cluster must stay within 2 seconds of the customer's own cluster. Given that individual inference jobs take on the order of tens of seconds, this requirement left virtually no room for additional latency introduced by the storage layer. In the first few rounds of testing, both median and mean inference latency on the elastic cluster exceeded the customer's own cluster by approximately 10 seconds — failing the acceptance criteria.

Performance optimization: Reducing additional latency on the elastic cluster

Optimization began with the median. A high median indicates that a significant proportion of requests are experiencing performance degradation, not just a small number of outliers inflating the tail. JuiceFS monitoring revealed that the cluster's cache hit rate was not reaching the expected level. In the current architecture, a cache miss requires a round trip over the public internet to the customer's object storage on Alibaba Cloud. This significantly increases model loading time and then affects inference request latency.

To solve this, the platform used the isolation capability of the JuiceFS cache group. It assigned dedicated cache nodes to this customer, reserved enough cache space, and warmed up the core model data. After warming up, the access path for core models achieved nearly 100% cache hit rate. This effectively avoided the performance loss from cross public network backfilling.

The second factor affecting the median was metadata access latency. Because the platform uses a unified cross-cluster architecture, the metadata service is accessed over the public internet, for example, via JuiceFS Cloud Service or a deployment on a remote host, and this latency affects overall model read performance.

The platform took two measures to address this issue:

Enabling JuiceFS' open cache to keep metadata in local memory as much as possible. Since this workload is predominantly read-only, caching is an effective way to reduce metadata access overhead.
Tuning the cluster's network rate-limiting policy. While the platform cannot directly control network equipment in edge data centers, it can apply node-level rate limiting to prevent any single node from saturating the available bandwidth, improving overall network stability. After these optimizations, cluster-wide performance improved meaningfully and the median metric gradually reached the customer's requirement.

Once the median met the target, the mean still showed a gap. This indicated that long-tail requests remained, with a small number of requests taking significantly longer than normal and pulling up the overall average. Further analysis traced this to node-level local cache — specifically, the FUSE cache quota. With limited cache capacity, the elastic cluster experienced more frequent cache evictions than the customer's own cluster, causing some requests to reload model data from scratch and increasing mean inference latency. The platform addressed this by increasing the FUSE local cache quota in the production environment, reducing eviction frequency, improving tail latency, and ultimately bringing the mean metric within acceptance. The system passed validation and has been running stably since.

Multi-tenant cache management

After the single-tenant case was validated, the solution entered multi-tenant operation. As different tenants began time-sharing the same elastic nodes, a new issue emerged: cache contention between tenants.

In the elastic resource model, FUSE clients do not actively clear node cache on exit. This is a reasonable design in single-tenant scenarios, where cached data from previous jobs can be reused by subsequent jobs to improve hit rates. However, in multi-tenant scenarios, one tenant's data can occupy node cache for extended periods. This leaves insufficient cache capacity for the next tenant, who is then forced to fall back to object storage, causing a noticeable performance drop.

To address this, we deployed an independent daemon process on each node that performs a global cache garbage collection (GC) pass before the application FUSE client starts. The eviction strategy references the JuiceFS FUSE client implementation, using a 2-random policy to balance collection efficiency and performance overhead. Coordination across nodes is handled via Kubernetes distributed locks: only the client that acquires the lock executes GC, preventing multiple clients from running cache collection simultaneously and creating excessive network and I/O pressure.

This mechanism effectively mitigates the problem of historical jobs occupying cache resources in multi-tenant scenarios, allowing different tenants sharing elastic resources to maintain consistent cache performance.

Conclusion

For elastic compute to reliably serve production traffic, compute scheduling alone is not enough. Model data and hot data must remain stably accessible across clouds and clusters.

Built on JuiceFS, we’ve combined object storage, unified namespace, metadata management, distributed caching, and FUSE mounting into an object storage acceleration solution purpose-built for elastic inference. This is not simply about mounting object storage as a file system. It’s about building a data access layer around the access patterns of model inference: one that supports warm-up, caching, isolation, and management.

This represents Gongjiyun's current progress in elastic compute and cross-cloud storage acceleration. As AI inference scenarios continue to evolve, model distribution, cache management, and multi-cluster data access will continue to surface new engineering challenges. We look forward to exchanging ideas with developers, AI application teams, and infrastructure practitioners, and to exploring more stable and efficient data access solutions for elastic compute environments.

If you have any questions for this article, feel free to join JuiceFS discussions on GitHub and community on Discord.

Reducing Data Storage Costs: A Deep Dive into JuiceFS 1.4 Tiered Storage

DASWU — Fri, 05 Jun 2026 07:50:42 +0000

JuiceFS Community Edition 1.4 introduces enhanced tiered storage capabilities, allowing users to set object storage classes at the file or directory level. This makes it possible to manage different storage tiers for data under a unified file system interface. In this article, we’ll discuss this feature’s application background, evolution, usage model, implementation, and future plans.

Application background

In real‑world scenarios, different files have different access patterns and performance requirements. Some data is read or written frequently and demands low latency and high throughput. Other data is rarely accessed after being written, and the main concern is long‑term storage cost. Tiered storage addresses this by placing data in the appropriate storage layer based on access patterns, balancing performance and cost.

Typically, data can be classified into three categories:

Hot data: Frequently accessed, requires low latency and high throughput.
Warm (infrequent access) data: Accessed occasionally, but still requires fast retrieval when needed.
Cold (archival) data: Primarily for long‑term retention, very low access frequency, can tolerate some restoration delay in exchange for lower cost.

Object storage already offers tiering capabilities. For example, Amazon S3 provides S3 Standard for frequently accessed data, S3 Standard‑IA for infrequent but still millisecond‑accessible data, and Glacier / Deep Archive for long‑term archiving. These storage classes differ in access latency, minimum storage duration, and pricing.

The table below compares main S3 storage classes:

Storage class	Use case	First byte latency	Minimum storage duration fee
S3 Standard	General-purpose storage for frequently accessed data	Milliseconds	N/A
S3 Standard-IA	Infrequently accessed data requiring millisecond access	Milliseconds	30 days
S3 Glacier Deep Archive	Archiving very rarely accessed data with very low cost	Hours	180 days

For JuiceFS, which is built on top of object storage, the key is to translate these capabilities into file‑system‑level tiering: users set storage tiers for files, directories, or datasets, and JuiceFS maps them to the underlying object storage while handling writes, migrations, and restore operations.

Evolution of JuiceFS tiering capabilities

The evolution of JuiceFS tiering has moved from being “passively unaware of object storage classes” to “actively managing storage tiers at file and directory granularity.”

Before v1.1, JuiceFS did not provide a way to configure storage classes. While users could manually change the storage class of objects at the object storage side, these changes were not recognized or managed by JuiceFS at the file system level. For standard and infrequent‑access classes that support direct access, normal read/write operations usually continued to work. However, if objects were moved to archival storage, access would fail because those objects cannot be read directly.

Starting with v1.1, JuiceFS supports setting the object storage class via --storage-class. For example, you can specify the default storage class for the file system at format time or override the storage class used for data written to a specific mount point during mount. This gave JuiceFS a basic ability to leverage object storage tiering. However, the configuration granularity remained coarse – primarily at the file system default or mount‑point level – and did not allow fine‑grained management per directory, per file, or per dataset.

Version 1.4 further advances tiering capabilities to the file and directory level. You can assign a storage tier to individual files or directories based on data temperature. When a directory is assigned a tier, newly created files and subdirectories under it automatically inherit that configuration. Compared to the previous default or mount‑point level settings, v1.4 is better suited for tiered management by project, directory, dataset, or data temperature.

How to configure tiered storage

The key to tiered storage in JuiceFS 1.4 is translating object storage classes into file‑system‑manageable tiers. The usage model consists of two steps:

Map tier IDs to object storage classes.
Assign files or directories to those tier IDs.

This allows users to organise tiering policies by file, directory, or dataset without specifying the underlying storage class on every write.

The figure below shows mapping tier IDs to storage classes:

For example, map tier IDs 1–3 to different storage classes:

juicefs config redis://localhost --tier-id 1 --tier-sc STANDARD_IA -y  
juicefs config redis://localhost --tier-id 2 --tier-sc INTELLIGENT_TIERING -y  
juicefs config redis://localhost --tier-id 3 --tier-sc GLACIER_IR -y

After mapping, set the storage tier for a file or directory:

juicefs tier set redis://localhost --id 1 /path/to/file  
juicefs tier set redis://localhost --id 2 /path/to/dir

Directory‑level settings have inheritance semantics. Once a directory is assigned a tier ID, newly created files and subdirectories will inherit that tier. To apply the tier to existing data under the directory, use -r to recursively set the tier:

juicefs tier set redis://localhost --id 2 /path/to/dir -r

For archival storage classes such as Glacier, a restore request must be issued before reading:

juicefs tier restore redis://localhost /path/to/dir -r

Implementation

From an implementation perspective, the key to tiered storage in v1.4 is storing tier information in metadata and using the tier ID to decide the object storage behavior during writes, migrations, and reads.

Metadata design

JuiceFS uses tier-id on files and directories to indicate the storage tier. A value of 0 means the default storage tier; values 1 to 3 correspond to user‑configured object storage classes.

Thus, the storage tier is no longer just an external state at the object storage side, but becomes part of the file system metadata that JuiceFS can understand and manage. When writing new data, migrating existing data, or checking file status, JuiceFS can determine the intended storage class based on this metadata.

Migrating existing data

For existing data, changing the storage tier involves not only updating the metadata tier-id but also changing the actual storage class of the underlying objects. When a directory is set recursively, JuiceFS processes all files and subdirectories under it and uses the object storage’s copy capability to migrate existing objects to the new storage class.

If only the mapping from a tier ID to a storage class is changed, the actual storage class of existing objects is not automatically updated. In that case, you must use tier set --force to explicitly trigger the change. This will rewrite the objects with the new storage class.

Write path

When a new file is written, JuiceFS determines the target storage class based on the file’s own tier-id or, if not set, the inherited tier-id from its parent directory. For directories that already have a storage tier assigned, new data can be written directly to the corresponding storage tier. This avoids the overhead of first writing to the default tier and then migrating later.

Read path

For storage classes that support immediate access (for example, Standard and Standard‑IA), reads are transparent to the application, and JuiceFS simply reads the data from object storage as usual.

For archival classes such as Glacier and Deep Archive, objects cannot be read directly. You must first issue a restore request using juicefs tier restore. This sends a request to the object storage service. Whether and when the objects become readable depends on the cloud provider’s restore mechanism. After restoration completes, applications can retry the read.

Therefore, archival storage is suitable for data that is accessed very infrequently and can tolerate restoration delay. It’s not appropriate for workloads that require online access at any time. When using archival tiers, you must consider storage cost, restoration time, and restoration costs.

Future plans

Reducing operational costs of archival storage

Archival storage classes have low long‑term storage costs, but they often come with complex cost models for writes, restores, early deletion costs, and lifecycle transitions. Writing data directly to archival storage may incur extra costs in scenarios with frequent changes or bulk migrations.

In the future, JuiceFS could combine object storage lifecycle management. Data could first be written to standard storage with specific object tags. Users could then use cloud‑vendor lifecycle rules to automatically and cost‑effectively transition data to infrequent‑access or archival tiers based on those tags. This would preserve JuiceFS’ file‑system‑level tiering capabilities while leveraging native batch transition mechanisms to reduce overhead.

Extending tiering to multi‑bucket, multi‑cloud

Currently, tiered storage works on different storage classes within the same object storage backend. In the future, JuiceFS could extend “tier” to different buckets, different object storage services, or even different cloud providers. Tiering would no longer be limited to a single backend.

For example, hot data could be stored in a local high‑performance MinIO cluster backed by SSDs, while cold or archival data resides in low‑cost cloud archival buckets. Policies could then gradually move data from the hot tier to the cold tier. With such an architecture, JuiceFS could offer cross‑bucket, cross‑cloud, and cross‑media tiered data management under a unified file system namespace.

If you have any questions for this article, feel free to join JuiceFS discussions on GitHub and community on Discord.

JuiceFS at Xiaomi: Unified Storage for AI, Big Data, and Cloud‑Native Workloads

DASWU — Wed, 03 Jun 2026 08:29:53 +0000

Xiaomi is one of the world's leading smartphone companies. Since 2021, its storage team has been building a file storage platform based on JuiceFS, initially providing file storage capabilities for cloud‑native and some application scenarios. After Xiaomi announced its comprehensive AI strategy in 2024, issues with the previous heterogeneous storage system became more evident in areas such as technology selection, data flow, and development/operations. Leveraging multi‑protocol access, elastic scalability, multi‑cloud adaptability, and high performance, the team decided to build a unified file storage foundation centered on JuiceFS to support big data, cloud‑native, and AI workloads.

To achieve this goal, the platform further developed core capabilities, including a capacity layer, a performance layer, and a cache layer. These reduce the complexity of multi‑system access and data movement while balancing large‑scale storage with high‑performance access. Over the past two years, with the rapid growth of generative AI and autonomous driving, the platform has supported typical scenarios such as large‑model training, autonomous driving training, inference acceleration, and big‑data cloud migration. Today, the platform can handle hundreds of billions of files and EB‑scale storage, covering the entire AI storage chain from raw data and training data to model file distribution.

Storage architecture challenges under the AI strategy

Before 2023, Xiaomi, like most companies, had built multiple storage systems for different application scenarios. In the big data area, the data platform was mainly based on HDFS; AI workloads, before the rise of large language models, relied primarily on high‑performance file storage services on the cloud, such as Parallel File System (PFS) and Network Attached Storage (NAS).

During this period, we also began to introduce JuiceFS and built an internal self‑developed File Storage Service (FDS), using components like JuiceFS CSI Driver to provide file storage for cloud‑native and some application scenarios. As application needs evolved, these storage systems grew independently. This led to a complex heterogeneous storage landscape.

In 2024, after Xiaomi announced its comprehensive AI strategy, the shortcomings of the previous storage system became more pronounced in areas such as technology selection, access, data flow, and development/operations.

These challenges included:

High selection and access costs: With many storage systems and inconsistent capabilities, application teams had to understand and adapt to each one, raising the barrier to entry.
Low data flow efficiency: The lack of a unified access method across systems led to frequent cross‑system data copying. This hurt development efficiency.
Scattered development and operations efforts: Multiple systems were maintained and evolved independently, making it difficult to focus resources on the mission-critical infrastructure required for the AI strategy.

To address these issues, we conducted in‑depth internal discussions and architectural adjustments in 2024, and began redesigning a unified storage architecture for AI, big data, and cloud‑native scenarios.

Building a unified file foundation with JuiceFS

Selection rationale: Multi‑protocol support, elasticity, multi‑cloud, high Performance

JuiceFS is a distributed file system that natively supports multi‑protocol access, elastic scaling, and high‑performance reads/writes. This makes it a perfect fit for both native AI and big data storage needs.

In the cloud-native field, we’ve been using JuiceFS since 2021, continuously conducting internal development and iterative optimization. At the same time, we maintain close collaboration with the JuiceFS open-source community to jointly drive technology evolution and real-world adoption.

In AI scenarios, model training and inference rely heavily on POSIX semantics, which aligns naturally with JuiceFS capabilities. Meanwhile, in the big data area, we were already promoting HDFS replacement during cloud migration, a practice with many mature industry examples, so adapting the HDFS protocol was also feasible.

Considering multi-protocol support, elastic scalability, multi-cloud adaptability, and high-performance read/write, we ultimately chose JuiceFS as the core component of our unified file storage foundation. This solved the problems of complex data flow, high access costs, and scattered operations caused by using different file systems across multiple platforms and application units.

Storage layer capability construction

Our core goal is to build a unified file storage layer on top of JuiceFS, providing large capacity, high performance, and standardized access interfaces to uniformly support the three core application scenarios: big data, cloud-native, and AI.

On the client side, we fully leverage JuiceFS’ multi-protocol capabilities, offering access methods including POSIX, Hadoop SDK, Python SDK, and S3 Gateway. They’re all already in use internally.

On the data plane, the architecture consists of three layers:

Capacity layer: Built on public cloud object storage, designed for EB‑scale storage, supporting multi-cloud deployments across different strategic data centers and multiple cloud providers.
Performance layer: Large‑scale tuning based on Ceph and all‑flash nodes, designed for AI training and other scenarios with high throughput and low latency requirements.
Cache layer: Given the “write once, read many, seldom modify” characteristic of AI training datasets, we developed a high‑performance distributed cache system based on NVMe and RDMA to reduce repeated read costs and improve training data access efficiency.

On the control plane, we made custom enhancements to the Community Edition:

For metadata, we built a distributed metadata service based on the Raft protocol to integrate with internal infrastructure systems and support multi-system access, improving reliability and scalability.
For backend management, we built a unified management service responsible for data lifecycle management, tiered storage, garbage collection, and warm-up of hot data from the capacity layer to the performance or cache layers.

Through these efforts, JuiceFS has gradually become the unified file storage foundation at Xiaomi, supporting both large‑scale capacity storage and high‑performance access for AI training. The architecture is now running in production and provides the high throughput required for large model training.

Our practices

During the construction of the unified file storage foundation, JuiceFS has gradually covered Xiaomi’s mission-critical application scenarios, including big data, cloud-native, and AI:

In terms of scale, the solution can support EB‑level storage and hundreds of billions of files.
In terms of capability, the coordinated design of the capacity, performance, and cache layers balances large‑scale storage with high performance.

Below we describe two typical scenarios: big data cloud migration and the AI storage pipeline.

Big data cloud migration and unified lakehouse storage

In its early days, our big data system was mainly built on the Hadoop ecosystem, where HDFS used a previous‑generation coupled architecture. Over time, this architecture showed problems such as performance fluctuations, complex operations, and high total cost. In contrast, cloud storage offers significant advantages in elastic scaling, resource utilization, and cost control. Therefore, starting in 2021, we systematically began migrating big data to the cloud.

From cold data to the lakehouse layer

Our big data cloud migration went through three stages:

Cold data migration: We first migrated cold data from HDFS to cloud storage, a process lasting over two years.
Lakehouse layer migration: We self‑developed a unified lakehouse file system, promoting the evolution from coupled to decoupled storage and compute.
Unified storage foundation based on JuiceFS: After selecting JuiceFS, we migrated the entire lakehouse layer to JuiceFS.

Lakehouse construction can leverage Iceberg’s native support for object storage access (like OSS or S3). However, our application spans multiple regions globally using several cloud vendors. Adapting to each vendor individually would incur high access and maintenance costs.

Thus, we chose JuiceFS to uniformly access different cloud storage. Upper‑layer services simply switch the backend storage address via the SDK to adapt to access in different cloud environments, greatly reducing multi‑cloud complexity.

For data migration, our self‑developed data‑factory platform supports transparently switching a table’s underlying storage to the new architecture and gradually migrates existing data to the cloud in the background, with little or no impact on application. Moreover, JuiceFS supports multi-cloud and on‑premises deployment. If future cost or strategic considerations require switching to self‑built storage, data can be smoothly migrated back via JuiceFS. This preserves architectural flexibility.

Hot table cache acceleration for compute efficiency

After data was in the cloud, we further analyzed access patterns of the lakehouse layer. For daily reporting and analysis tasks, computation is usually concentrated on day‑level or week‑level hot data, not requiring frequent full scans. Therefore, the performance focus for the lakehouse layer was not simply improving full‑scan throughput but rather increasing hot data access efficiency and task execution stability.

Based on this, we built a hot table warm-up capability in cooperation with the lakehouse layer. The system identifies hot tables and their hot partitions based on daily access statistics, and preloads related data into the cache layer before task execution via a warm-up interface. For periodic reporting tasks that must be completed by 8 AM, hot data is warmed up before computation. This reduces remote reads and repeated access.

Through offline and online testing, after hot table caching, compute efficiency improved by about 10-20%, with reductions in both computation time and resource consumption. The cache size has reached PB level, with average throughput around 200 GB/s. The cache layer also reduces cross‑cloud bandwidth pressure and cloud storage API call costs: by improving the hot data hit rate, repeated cross-cloud reads can be reduced, thereby lowering bandwidth consumption and access costs.

Benefits for big data

Benefits for our big data application include:

Performance: After switching to JuiceFS, sequential read/write performance improved significantly, more than doubling in some scenarios. Overall task duration decreased by about 10–30%.
Cost: By Xiaomi's internal cost metrics, the unified storage architecture has greatly lowered storage costs – about 70% in China and 90% in overseas regions. The overseas legacy solution, which used HDFS with three replicas on cloud instances and EBS, had a high replication factor and thus higher costs.
Stability and operations: Under the previous mixed architecture, many compute tasks easily consumed node resources, raising node load and affecting storage performance. With the decoupled storage‑compute architecture, compute tasks run on dedicated nodes, task durations are more stable, and scaling and management are more flexible.

AI one‑stop storage

AI storage consists of three stages:

Raw data stage: Storing large volumes of raw data, which undergoes processing (for example, ETL) to produce training datasets, then is fed into high‑performance training environments.
Training stage: Training tasks require high throughput and low latency to reduce I/O wait time and increase GPU utilization. After training, model files are generated for subsequent inference.
Inference stage: Model files must be quickly distributed to specific nodes for rapid startup of inference tasks.

Previously, data flowed among multiple systems, causing inconvenience for both application teams and internal operations. By adopting JuiceFS uniformly, we can meet diverse needs based on different storage tiers.

Requirements and solutions by stage

AI one-stop storage needs to cover three stages: raw data, training data, and model files. The requirements for capacity, performance, cost, and distribution efficiency differ at each stage. The table below compares the application needs for each stage with previous and current solutions.

Use case	Application requirements	Previous solution	Current solution (JuiceFS)
Raw data	Large capacity, low cost; support high‑concurrency data processing; scale to PB+	Direct use of object storage; HDFS; other low‑cost storage	Capacity‑oriented JuiceFS: multi‑cloud object storage underlying, shielding vendor differences; EB capacity, hundreds of billions of files; millions of concurrent tasks
Training data	High throughput, low latency; reduce I/O wait time; improve GPU utilization	PFS, NAS (good performance but high cost)	Performance‑oriented/cache‑oriented JuiceFS: TB/s throughput, low latency; async checkpoint to reduce I/O wait; cache acceleration
Model files	Fast distribution; efficient loading; quick inference startup	P2P distribution; workflow distribution; PFS	Cache‑accelerated JuiceFS: cache improves model loading; up to 16 GB/s sequential load per node; several times faster than local disk or FDS

High‑performance cache acceleration: improving efficiency and cutting costs

In AI training, training datasets typically have the characteristics of "write once, read many times, and modify very little." This is a typical read-heavy, write-light access pattern, making it suitable for improving data access efficiency through caching.

Take our internal autonomous driving training as an example. Once a dataset version matures, its data volume may continue to grow within the version cycle, but existing data is rarely modified. While the previous high‑performance file storage met training performance requirements, it had some performance redundancy and cost waste for such repetitive reads. Therefore, we began promoting a high‑performance cache acceleration solution based on JuiceFS.

The cache solution offers several advantages:

Short I/O path: Clients operate on files directly, greatly shortening the I/O path for fast responses.
Performance optimization: Through RDMA and zero‑copy optimization, performance has significantly improved – throughput more than 20% higher than previous high‑performance storage, with ongoing optimization.
Cost reduction: The previous PFS‑based storage used replication (though some used EC, replication was more common for stability). With the cache solution, single‑copy storage reduces costs by more than 60%.
Resource consolidation: For CPU training, GPU nodes typically have NVMe drives (about 10 TB each), which were previously used in scattered ways with low utilization. Now, we consolidate these NVMe resources into a unified cache pool to accelerate nearby GPU training and data processing tasks.

Future plans

Looking ahead, we’ll focus on three directions:

Continuously improve the stability, performance, and scalability of the unified file storage foundation. As AI application grows rapidly, training, inference, and data processing tasks demand higher throughput, lower latency, and greater reliability. We’ll continue optimizing the underlying architecture and critical paths to enhance service capabilities under large‑scale concurrent access.
Strengthen lifecycle management for massive data. Current data volumes continue to grow, but management across storage tiers, access frequencies, and retention periods can be further optimized. We’ll refine tiered storage, archiving, warm-up, and cleanup strategies based on data temperature, access patterns, and cost models, reducing unit storage cost and improving resource utilization.
Enhance data management and analysis capabilities. On top of the unified file storage foundation, we’ll build data management capabilities for application users, helping them better understand data distribution, access behavior, and resource usage, supporting data management, cost optimization, and application decisions.

We look forward to continuous exchanges with industry peers to explore more technical practices. If you have any questions for this article, feel free to join JuiceFS discussions on GitHub and community on Discord.

Quota Design in Distributed Architectures: Implementation and Use Cases in JuiceFS

DASWU — Fri, 08 May 2026 06:59:24 +0000

In distributed storage environments, storage resources are typically shared across multiple users, projects, and applications. Without effective constraint mechanisms, abnormal writes or erroneous operations from a single tenant can quickly consume large amounts of space or inodes, impacting system stability and cost control. Quota management provides a way to establish predictable resource boundaries in shared environments.

In distributed systems, quota management is far more than just "setting a limit." The system must balance concurrent writes from multiple clients, asynchronous metadata updates, and overall throughput. At the same time, quota rules must be enforced at different levels of control. To address this, JuiceFS provides multi-level quota capabilities covering the entire file system, directories, and users, supporting scenarios ranging from overall capacity control to individual and team-level constraints.

In this article, we’ll introduce the design and implementation of JuiceFS' quota mechanism, including its core data structures, synchronization model, and the validation and accounting logic in write and delete processes. We’ll also include typical use cases that highlight common issues around quota changes, space reclamation, and over-limit writes.

Quota types and resource dimensions supported by JuiceFS

JuiceFS quotas support two resource dimensions:

Space: Used storage capacity. Statistics are based on the file system's usage perspective and are aligned to block granularity. The write path section later will explain how incremental usage is estimated under 4 KiB alignment.
Inodes: Number of used inodes. For workloads with a large number of small files, inodes often become the constraint bottleneck earlier than space. Therefore, inode quotas must also be part of the management strategy.

Based on these two resource dimensions, JuiceFS currently supports four types of quotas.

Quota type	Scope	Design goal	Typical use case
Total file system quota	Entire file system	Prevents overall resource runaway	Cost budget control, capacity limit
Subdirectory quota	Directory subtree	Blocks abnormal write behavior	Prevents misoperations, small‑file storms
User quota	Per user	Isolates impact between different applications	Multi‑tenant data management
User group quota	Project or department	Cost allocation and team limits	Shared environment for AI projects

User quotas and user group quotas are expected to be released in JuiceFS Community Edition 1.4.

In practice, a common and effective strategy combines the following:

Total file system quota as a safety net.
Directory quotas to address individual abuse and small‑file storms.
User/group quotas for multi‑tenant management.

This layered approach controls overall resource limits while preventing abnormal growth of a single entity from affecting other workloads.

Quota implementation mechanism

Synchronization model and data structures

The main challenge of implementing quotas is how to perform checking, accounting, and convergence at an acceptable cost under concurrent writes from multiple clients. JuiceFS clients run on various nodes and continuously issue resource‑changing operations such as creation, writing, truncation, and deletion. If every operation required a strongly consistent server‑side check and update, the write path would incur unacceptable overhead.

Therefore, the quota mechanism must satisfy two goals:

Performance: Avoid a strongly consistent server‑side update on every write.
Consistency: Ensure that system usage eventually converges under concurrent writes from multiple clients and prevent over‑limit operations before they happen, as much as possible.

Based on this trade‑off, JuiceFS adopts a synchronization model that works as "local accumulation, periodic flush, and periodic refresh." Clients first accumulate resource deltas in local memory, with background tasks periodically persisting them to the metadata engine in batches. At the same time, each client periodically pulls the latest quota configuration and baseline usage from the server, gradually aligning its own global view. Clients do not communicate directly with each other; instead, the metadata engine serves as the central coordination point.

In other words, JuiceFS quotas do not pursue strong consistency on each operation but achieve eventually consistent resource control through periodic synchronization.

In the current implementation, quota deltas are flushed every 3 seconds (flushQuotas\). Clients reload the latest quota configuration and baseline usage from the backend approximately every 12 seconds (via a refresh call triggered by the mount heartbeat). This means that under extreme conditions, the global views seen by different clients may diverge by up to about 12 seconds, but they will gradually converge in subsequent sync cycles.

Quota information is managed uniformly by the quota structure. It represents a single quota entity and can adapt to different types of managed objects such as directories, users, and user groups. Its core design decouples baseline usage from incremental usage:

UsedSpace\/UsedInodes\: Represents the baseline usage already persisted in the backend.
newSpace\/newInodes\: Represents the locally accumulated deltas on this client that have not yet been flushed to the backend.

\type Quota struct { MaxSpace, MaxInodes int64 // Maximum space and inode limits UsedSpace, UsedInodes int64 // Used space and inodes newSpace, newInodes int64 // Pending usage deltas to be synced } \\

For inode accounting, hard links require special attention. Different quota types have different counting semantics for hard links. For directory quotas, counting is based on directory entries: when a hard link is created under a directory, both space and inode usage of that directory increase by 1, and they decrease accordingly when the hard link is removed. For user quotas and user group quotas, counting is deduplicated by the file object (inode). Even if a file has multiple hard links, it’s counted only once per UID/GID dimension. Therefore, creating or deleting hard links does not change the usage for the associated user or user group.

Quota storage

Regarding the quota storage mechanism, the total file system quota (the global "red line") has its capacity and inode limits directly persisted in the metadata engine. Clients load this configuration during mount and enforce hard limits, ensuring the underlying resources are not exceeded.

In contrast, checks and delta accumulation for directory, user, and user group quotas rely more on the client side. Clients maintain in‑memory indexing structures keyed by inode, UID, and GID, and periodically synchronize the corresponding quota information from the backend. This keeps lookup overhead low in high‑frequency I/O scenarios. It’s important to emphasise that the client in‑memory state is only a runtime cache and incremental view; the authoritative source for quota configuration and baseline usage remains the metadata backend.

Quota checks

A synchronization model and data structures alone are not sufficient, and quota logic must also be embedded into the specific resource‑changing paths. A single write operation may not be a simple data append; it can simultaneously involve inode creation, block allocation, directory entry changes, and parent‑directory statistics updates. Under multi‑client concurrency, these changes collectively affect the same set of quota constraints. Therefore, only by placing checks and statistics updates directly into the operation paths (write, create, truncate, and delete) can we avoid out‑of‑limit writes and statistical inaccuracies.

Pre‑write: incremental estimation and multi‑dimensional quota check

When a user initiates an operation that may change resource usage (such as write, create, and truncate), the client first estimates the expected resource delta, including both space and inode changes.

Space delta is estimated based on the actual allocation granularity of underlying data blocks (for example, 4 KiB alignment), therefore block‑aligned calculation is required. Inode deltas primarily occur in creation operations (such as creating a new file or directory).

After obtaining the resource delta for the operation, the client performs a quota check before actually writing. The check covers multiple dimensions: user and user group quotas, total file system quota, and directory quotas for the target directory tree. If any dimension would exceed its limit after this operation, the request is rejected with an error such as quota exceeded or out of space.

By placing the check in the write path before the resource change, the system can block risky operations before they happen, avoiding complex cleanup or rollback afterwards.

Post‑write: local delta accumulation and background batched sync

After a successful write, the resource delta generated by the operation is incorporated into the corresponding usage statistics and gradually aligns with the global state according to the defined convergence mechanism. Specifically, three categories of statistics are affected:

Global level: The overall file system usage increases (or decreases).
Directory level: The usage of the relevant directory subtree changes accordingly.
User / user group level: The usage of the corresponding subject also accumulates.

These updates are first reflected in the client’s local accumulated deltas and are not immediately written back to the backend in a strongly consistent way. Later, background tasks flush them in batches, and periodic refresh operations gradually align them with other clients, achieving global convergence.

Usage statistics (`stats\`): foundation for the quota system

For quotas to work effectively, the system must be able to track current resource usage with low overhead. Whether for large directory trees or many users and user groups, if every check requires a real‑time full scan, the performance cost will be unacceptable. Therefore, an efficient and reliable usage statistics mechanism is a prerequisite for implementing quotas.

Directory statistics

Directory quotas constrain the total space and inode usage of an entire directory subtree, not the size of individual files. Consequently, they rely on directory‑level usage statistics.

It’s important to note that directory statistics (DirStats\) and quota statistics have different scopes: DirStats\ only sums up the usage of immediate children (files and subdirectories) under a given directory – a single‑level statistic. In contrast, directory quotas recursively sum up the entire subtree. This design allows DirStats\ to be maintained with lower overhead, while directory quotas provide a full subtree view.

The key to implementing such statistics is maintaining low overhead and high availability for large directory trees. JuiceFS follows the same approach as the quota mechanism: high‑frequency local updates and batched background persistence. Clients maintain directory usage deltas in memory; when operations such as writes or deletions occur, the changes are first recorded locally and then periodically synced in batches to the metadata engine by background tasks.

In addition, the system does not load all directory statistics at mount time. For large directory trees, a full load would cause significant latency and memory overhead. Therefore, directory statistics adopt an on‑demand fetch strategy: only when precise usage is required (such as quota checks, usage summarisation, and administrative queries) does the system load the statistics of the corresponding directory from the backend.

When users query usage information via df\ or an application calls statfs\, JuiceFS makes a trade‑off between performance and accuracy:

It first uses locally cached used space and inodes for fast calculation.
If the local baseline is incomplete (for example, just after startup) or higher real‑time accuracy is needed, it fetches the latest global counters from the backend for calibration.
Finally, it adds locally accumulated (not yet synced) deltas to make the result more accurate for the current node’s write state.

After obtaining the used amounts, the client calculates total\ and avail\ based on whether a total capacity limit is configured:

If a limit is configured, total capacity equals that limit, and available capacity is "limit minus used."
If no limit is configured, it returns a dynamically estimated total capacity so that tools like df\ can display normally.

Moreover, when querying quotas from the root directory, the system displays the maximum space and inode limits, allowing administrators to see the global resource boundaries.

In addition, JuiceFS will support real‑time updates of directory statistics for the trash in version 1.4. When files are deleted (moved to the trash), restored from the trash, or permanently cleaned up, the system updates the trash directory’s statistics immediately. This enables administrators to accurately track space usage of the trash.

User and user group statistics

User and user group statistics are collected only after the corresponding quota feature is enabled. Before enabling, the updateUserGroupStat\ call in the kernel path returns directly without generating any statistics. After enabling, clients maintain usage data in an in‑memory map keyed by uid\ and gid\ and update the relevant statistics on all paths that may cause usage changes.

A special note: when setting a quota for a user or user group for the first time via juicefs quota set \--uid\ or juicefs quota set \--gid\, the system immediately performs a full scan of existing files to initialise the baseline usage. After this initialisation, subsequent writes and deletions become incremental updates, and no further full scan is required.

Common scenarios

1. A file has been deleted, why hasn’t the total file system quota decreased? Why hasn’t the object storage billing changed?

This is usually not a statistics error, but a result of file system semantics combined with the statistical model.

For example, after enabling the trash in JuiceFS, a deletion operation does not immediately free space. The file is first moved to the trash for possible recovery. Therefore, files in the trash are still counted in the total file system quota and user / user group quotas, but are no longer counted in the original directory quota.

Another common reason is the time lag between file system statistics and object storage billing. JuiceFS quota statistics use a local accumulation + periodic background sync model, so it’s possible that different clients or different statistical interfaces have not yet converged in a short time. At the same time, object storage may not have completed garbage collection or lifecycle cleanup. Therefore, temporarily seeing inconsistency between file system usage, quota statistics, and object storage billing is generally expected. This is not considered a system anomaly as long as they gradually converge over time.

In addition, note that quota and statfs\ show the file system perspective of space usage and availability, while object storage billing is based on the underlying object storage model – affected by factors such as chunking, merging, delayed reclamation, and lifecycle rules. The two are not required to be the same.

2. Quota is full, but appending to an existing file did not report an error immediately.

This is often related to the asynchronous commit path in some JuiceFS writes. From the application’s perspective, the write system call may return success early, while the actual data commit and corresponding quota check happen later. Thus, appending may appear to "succeed," but the data may not be fully persisted; if the later commit stage determines that the quota would be exceeded, the write may still fail.

In other words, a successful write return does not guarantee that the write has been finally committed. In scenarios involving quota limits, a safer approach is to check the return status on close, the final file size, and handle possible errors accordingly.

3. Quota is not yet full, but file creation fails.

This phenomenon is usually related to temporary view divergence under the eventual‑consistent statistical model.

Example: a volume has a total quota of 2,000 inodes, and there are currently 1,999 files. One more file should be creatable. However, in extreme concurrency or unusual refresh timing, the client’s local cache may diverge briefly from the backend baseline count. This may cause the in‑memory used inode count to be temporarily too high, thus rejecting a legitimate creation request.

This type of problem inherently stems from the local accumulation + periodic sync convergence model. It avoids the high overhead of strong‑consistent backend updates on every operation, but in extreme cases the system may have short‑term false positives. Typically, such false positives disappear with the next sync cycle, and retries can mitigate the issue.

This also illustrates that, in a distributed environment, quotas are best understood as an efficient, near‑real‑time constraint mechanism, not a fully synchronous, strongly consistent judgement for every concurrent operation.

4. After a write exceeds the quota, why does the "failed" file remain in the directory?

This is not unique to JuiceFS; it’s not uncommon in file systems that follow POSIX semantics.

For example, a user sets a 1 GiB quota on a directory and then tries to write a 2 GiB file using dd\. The file system first allows the first 1 GiB of valid writes; only when the subsequent write exceeds the quota does it return “Disk quota exceeded.” Consequently, a "partial file" of about 1 GiB is left behind. This does not indicate abnormal behaviour. It simply means the first part of the data was written successfully, while the remainder failed due to the quota.

The file system's responsibility is to report the error, not to decide whether to delete the successfully written data. Whether to clean up such an incomplete file is left to the application. This follows standard POSIX semantics: the file system returns the error, and the application handles subsequent cleanup and recovery.

Summary

In a distributed file system, quotas are not a simple "counter" feature, but a system design that must balance performance, consistency, and management granularity. Through pre‑write checks, local accumulation, and periodic background synchronization, JuiceFS minimizes overhead on the write path while allowing various usage statistics to gradually converge under an eventual consistency model. Based on this mechanism, quota control covers not only total file system capacity, but also multiple levels such as directories, users, and user groups, thereby meeting the needs of typical scenarios including multi‑tenant isolation, individual constraints, and team‑level resource management.

If you have any questions for this article, feel free to join JuiceFS discussions on GitHub and community on Discord.

JuiceFS Performance Optimization for AI Scenarios

DASWU — Wed, 15 Apr 2026 09:37:48 +0000

The scale of computing power for large language model (LLM) training continues to expand. While GPU performance keeps improving, data access bottlenecks are becoming increasingly prominent in overall system performance. Local storage offers excellent performance but has limited scalability. Object storage excels in cost and scalability but suffers from insufficient throughput in massive small‑file and high‑concurrency scenarios. Teams often struggle to choose between them.

Therefore, distributed file systems have become a key solution to balance high performance and scalability. JuiceFS has been widely deployed in AI scenarios across multiple industries. Its distributed architecture delivers high performance, strong scalability, and low cost simultaneously for large‑scale data access.

In this article, we’ll introduce JuiceFS’ architecture from a performance perspective and analyze core performance bottlenecks and optimization methods under different access patterns. We’ll also offer links of key points for references, helping you understand JuiceFS’ performance mechanisms and master common tuning strategies.

Performance foundations from the JuiceFS architecture

JuiceFS comes in Community Edition and Enterprise Edition. Both share the same architecture: metadata and data are separated. The client adopts a rich‑client design, handling core logic including some metadata operations, and provides both metadata and data caching. These modules work together for efficient data location and access. The underlying data is stored in object storage, with local caches further improving access performance. For external interfaces, JuiceFS supports multiple access methods – FUSE is the most common, and it also provides various SDKs and an S3 gateway.

JuiceFS Community Edition is designed as a general‑purpose file system. Users can choose different metadata engines based on their needs. For small‑scale deployments, Redis delivers lightweight, low‑latency metadata management. For large‑scale file scenarios, TiKV provides good horizontal scalability.

JuiceFS Enterprise Edition targets complex, high‑performance scenarios. It differs from Community Edition in two ways:

It uses a self‑developed multi‑zone metadata engine built on Raft that runs as an in‑memory cluster, offering low latency and strong horizontal scalability. It supports up to 500 billion files. Operations that require multiple key-value requests in the Community Edition often need only one or two in the Enterprise Edition, and complex logic can be processed inside the metadata cluster.
The Enterprise Edition supports distributed cache sharing: clients in the same group can access each other’s local caches via consistent hashing. This improves cache hit rates and access efficiency. In multi‑node, high‑concurrency scenarios, the cache space scales horizontally, and most required data can be warmed up before job execution. This accelerates AI training and inference while boosting performance and stability. See JuiceFS Enterprise 5.3: 500B+ Files per File System & RDMA Support.

Data chunking

JuiceFS splits data into chunks and stores them in object storage. This design is key to its performance, affecting data read efficiency, cache hit rate, and throughput under high concurrency.

JuiceFS breaks a file into multiple chunks. Inside each chunk, the system maintains a management structure called a slice to track writes and updates. When data is written, new data does not overwrite existing slices; instead, a new slice is appended on top of the chunk.

Ideally, each chunk ends up containing only one slice. Each slice consists of several 4 MB blocks, which are the smallest unit stored in object storage. By default, the caching system also manages data at the block level.

As shown in the diagram on the upper right, file updates use an append‑only write pattern: existing slices are shown in red, and new data is appended as a new slice. During reads, the system combines the slices to form the current view. When fragmentation becomes excessive, a compaction process merges slices to optimize access performance. For more details on data chunking, refer to Code-Level Analysis: Design Principles of JuiceFS Metadata and Data Storage.

Caching

Compared to direct object storage access, JuiceFS performance improvements largely benefit from its caching mechanism. The JuiceFS client comes with a high‑performance local cache module. Key configuration options include:

cache-dir: specifies the cache directory.
cache-size: sets the maximum cache space.
Prefetch: a parameter in the cache module that controls prefetching. When a request hits a block, a background thread fetches the entire block.
Write‑back related settings: improves write IOPS by writing data blocks that need to be uploaded to object storage into the local cache first, then asynchronously uploading them to object storage.

JuiceFS Enterprise Edition also provides advanced configurations. For example, a cache group can be used to designate a set of clients whose local caches form a distributed cache group, enabling cache sharing. In addition, the no sharing option (related to cache groups) allows a client to read data only from a specified cache group without serving its own cache to others. This creates a two‑level cache:

The first level is the local cache.
The second level is the cache on other nodes in the group.

Another performance‑boosting mechanism is the memory buffer (read buffer), which provides:

I/O request merging: multiple consecutive I/O requests can be merged in memory. For example, three I/O requests issued by the system may be reduced to just one after being processed by the memory buffer.
Adaptive read‑ahead: in large‑file sequential read scenarios, adaptive read‑ahead increases request concurrency by prefetching data. This fully utilizes cache and object storage resources and improves overall I/O performance.

The Enterprise Edition also offers advanced read‑ahead settings:

max read ahead: sets the maximum read‑ahead range.
initial read ahead: sets the initial read‑ahead window size (default unit is 4 MB blocks).
read ahead ratio: a configuration added last year that controls the read‑ahead ratio for large‑file random reads, reducing bandwidth waste caused by read amplification. Overly aggressive read‑ahead can negatively impact random read performance; read ahead ratio helps mitigate this. In AI scenarios, when large‑file sequential or random reads cause bandwidth or IOPS bottlenecks, adjusting these parameters can optimize overall performance.

JuiceFS benchmark I/O tests and bottleneck analysis

Before diving into performance tuning for common AI scenarios, let’s first examine JuiceFS’ I/O behavior under ideal conditions through sequential and random read benchmarks. This helps us understand throughput and latency under different access patterns, providing a reference for the read/write patterns of subsequent AI/ML workloads.

Sequential read performance

In JuiceFS, sequential read performance is typically bandwidth‑bound. In cold read scenarios, performance is mainly limited by object storage bandwidth; in distributed cache scenarios, network bandwidth can become the bottleneck. For example, a node with a 40 Gbps NIC may achieve less than 5 Gbps usable bandwidth. In addition, the user‑kernel transition overhead in the FUSE layer limits single‑thread throughput. Tests showed single‑thread sequential read bandwidth around 3.5 Gbps. To break this limit, multi‑threaded or higher‑concurrency strategies are needed to fully utilize storage and network resources.

The table below shows test results of JuiceFS sequential read performance:

Threads	Bandwidth (GB/s)	Bandwidth per thread (GB/s)
1	3.5	3.5
2	6.3	3.15
3	9.5	3.16
4	9.7	2.43
6	14.0	2.33
8	17.0	2.13
10	18.6	1.9
15	21	1.4

In the performance test, single‑thread sequential read bandwidth was about 3.5 Gbps. As the number of threads increased, total throughput gradually approached the network bandwidth limit. To help users evaluate the performance ceiling of their own environment, JuiceFS provides the bj bench subcommand for testing object storage bandwidth.

In real workloads, caching is more common than direct object storage access. In such cases, increasing the buffer size raises the number of background prefetch requests, thereby improving concurrency and overall throughput. For example, after increasing the buffer size to 400 MB (corresponding to 100 background prefetch requests of 4 MB each), concurrency improved significantly and overall throughput increased.

Random read performance

Low‑concurrency random reads

In low‑concurrency, non‑asynchronous access scenarios, each request must wait for the previous one to complete before being issued. As a result, latency has a significant impact on overall performance. I/O latency can come from many sources, including metadata query latency, object storage access latency, and local or distributed cache read latency. When analyzing random read performance, we must closely examine these latency factors.

In a 4 KB cold random read scenario, if the IOPS is only 8 and object storage latency is about 125 ms, the concurrency level is roughly 1 (8 IOPS × 125 ms ≈ 1,000 ms).

This indicates a near‑single‑concurrent, serial‑blocked state. In such cases, the optimization focus should be on shortening the access path and reducing per‑request latency rather than increasing concurrency – for example, by warming up data into the local cache. After data warm-up, the random read path switches from object storage to local cache, and IOPS can increase to about 12,000, approaching the I/O level of a local disk.

High‑concurrency random reads

High‑concurrency random reads typically occur in scenarios with high thread counts or asynchronous I/O. The main performance bottleneck is often IOPS limits – including metadata IOPS, object storage IOPS, and cache IOPS. JuiceFS allows you to observe these metrics and pinpoint the bottleneck. Client machine resources (CPU, memory) can also affect performance, but such bottlenecks are easy to monitor.

In a cold read scenario using Libaio for random reads, the object‑side IOPS ceiling is around 7,000/s. When caching is enabled and data is warmed up, the access path shifts from object storage to the cache layer, and IOPS can further increase to over 20,000. This shows that the bottleneck for high‑concurrency random reads shifts as the access path changes.

For a deeper dive into JuiceFS’ complete data access path, refer to Optimizing JuiceFS Read Performance: Readahead, Prefetch, and Cache.

I/O characteristics and performance tuning for common AI scenarios

Large‑file sequential reads

A typical large‑file sequential read scenario is model loading, such as loading PyTorch .pt files saved via pickle serialization. In this process, performance is limited by two factors:

Pickle deserialization efficiency determines data processing speed.
Data reading is usually single‑threaded and limited by FUSE bandwidth and CPU performance.

To increase throughput, you can raise concurrency through multi‑threaded or sharded loading, fully utilizing I/O capacity. For large‑file sequential reads, the best performance is achieved when the entire dataset can be cached locally. If only on‑demand reading is required, the implementation is simple.

For more details on optimizing large‑file sequential reads, see How JuiceFS Transformed Idle Resources into a 70 GB/s Cache Pool.

Massive small files

In computer vision and multimodal tasks, training datasets often consist of many individual files, for example, single images, video frames, or text annotations. Such massive small‑file scenarios place heavy pressure on metadata services.

In massive small-file scenarios, metadata performance is critical. On one hand, each file carries only a small amount of data; on the other hand, directory metadata access efficiency is low when a directory holds a huge number of small files.

For read‑only workloads, enabling client metadata caching and extending the cache lifetime can improve performance.

Moreover, the data read layer experiences higher IOPS pressure because small files cannot take advantage of read‑ahead. This makes requests more fragmented. Common optimizations include increasing local cache capacity; for the Enterprise Edition, you can also scale out the distributed cache cluster horizontally. Because small files derive little benefit from read‑ahead, their latency tends to be higher.

For performance tuning in this scenario, see How D-Robotics Manages Massive Small Files in a Multi-Cloud Environment with JuiceFS.

Large‑file random reads

This scenario is common in AI training, for example, when randomly accessing datasets in TFRecord, HDF5, or LMDB format by sample. Take model loading: if the dataset is accessed randomly and each read size equals the sample size (for example, 1 MB to 4 MB images or short videos), read‑ahead can waste bandwidth. Such scenarios can often break through IOPS bottlenecks by increasing concurrency.

Recommended measures include:

Increase the number of data‑loading reader threads.
Use asynchronous I/O to raise concurrency and saturate IOPS.
Improve the caching system, for example, pre‑map data into cache to boost underlying IOPS.
Adjust the read ahead ratio parameter (for example, set it to 0.5) to reduce bandwidth waste from read‑ahead. For instance, a 4 MB sequential read would previously prefetch 4 MB; after adjustment, only 2 MB is prefetched.

In this article, we’ve analyzed JuiceFS’ architecture from a performance perspective, covered benchmark I/O tests, and discussed tuning methods for typical AI scenarios. This provides an introductory reference for system performance. JuiceFS has been deployed in many production environments, and its distributed architecture offers a feasible balance between performance and cost.

If you have any questions for this article, feel free to join JuiceFS discussions on GitHub and community on Discord.

How D-Robotics Manages Massive Small Files in a Multi-Cloud Environment with JuiceFS

DASWU — Fri, 06 Mar 2026 06:44:16 +0000

D-Robotics, founded in 2024 and spun off from Horizon Robotics' robotics division, specializes in the research and development of foundational computing platforms for consumer-grade robots. In 2025, we released an embodied AI foundation model.

In robot data management, training, and inference, the sheer volume of data is immense. Using object storage presents challenges such as handling small files and managing multi-cloud data. After trying some solutions and replacing private MinIO with SSD storage, we still faced difficulties in addressing these challenges. Ultimately, we selected JuiceFS as our core storage solution.

JuiceFS' inherent adaptability for cross-cloud operations efficiently supports data sharing needs in multi-cloud environments. In training scenarios, JuiceFS' cache mechanism, specifically designed for small file data, effectively replaces traditional caching solutions while achieving a cost-effective balance between cost and efficiency, fully meeting storage performance requirements. Currently, we manage tens of millions of files.

In this article, we’ll share our application characteristics, storage pain points, solution selection, implementation practices, and production tuning experiences. We hope our experience offers useful insights for those facing similar challenges in the industry.

Storage pain points in the robotics industry

The cloud platform serves as our core technical hub, undertaking key application functions such as simulation environment setup, data generation and model training, model lightweighting and deployment, and visual verification. The data types involved in the platform are diverse, mainly including sensor image data, LiDAR point cloud data, model weights and configuration data, motor operational data, and map construction data.

While object storage meets basic storage needs for massive data, its performance limitations become particularly obvious when handling the massive small files frequently encountered in robotics applications. Our storage system faced four challenges:

Metadata performance bottleneck with massive small files: Robot model training involves tens of millions to billions of sensor images, LiDAR data, and model files. Traditional object storage (like standard S3) exhibits significant metadata operation bottlenecks at this scale. The fixed API latency for routine operations like listing files or retrieving attributes is typically 10–30 ms. This directly constrains queries per second (QPS) performance during training and inference and impacts overall R&D efficiency.
Inefficient multi-cloud collaboration and data flow: As robotics companies increasingly adopt multi-cloud architectures for their R&D and production applications, ensuring efficient data synchronization and sharing across different cloud platforms and geographical regions has become a common challenge for the industry. Traditional storage solutions typically suffer from low cross-cloud data transfer efficiency and are often deeply integrated with a single cloud provider. This leads to technical lock-in and makes it difficult to achieve flexible cross-cloud deployment and data collaboration.
The impossible trinity of performance, cost, and operations: High-performance parallel file systems offer high throughput and low latency but typically rely on all-flash arrays or dedicated hardware. This leads to high hardware investment and ongoing operational costs, plus complex deployment. Low-cost object storage offers good elasticity but is difficult to support the high-throughput I/O demands of GPU clusters in AI training scenarios. A common industry workaround is using a high-speed file system as a cache synchronized with S3. However, the extra data synchronization steps significantly reduce usability and fail to achieve efficient storage-compute synergy.
Difficulty in dataset version management: The rapid iteration cycle of robot models requires efficient and granular management of multiple dataset versions. Using physical copies for version control directly leads to exponentially higher underlying storage consumption, significantly increasing costs. Moreover, the difficulty of retrieving, reusing, and maintaining multi-version data also increases substantially.

Storage selection: JuiceFS vs. MinIO/S3 vs. PFS

To address these storage challenges, we established a clear evaluation framework for storage selection. A comprehensive comparative test was conducted on mainstream storage solutions across seven core dimensions: storage architecture, protocol compatibility, metadata performance, scalability, multi-cloud adaptability, cost efficiency, and operational complexity.

Comparison basis	JuiceFS	MinIO / Public Cloud S3	CephFS / Public Cloud FS (CPFS)
Storage architecture	Separation of metadata and data	Unified object storage	Metadata and data typically coupled, often with kernel-space parallel design
Protocol support	Full compatibility: POSIX, HDFS, S3 API, Kubernetes CSI	Primarily S3 API, with weak POSIX compatibility	POSIX-oriented; HDFS or S3 compatibility often requires plugins
Metadata performance	Very high: sub-millisecond latency, supports hundreds of billions of files per volume	Lower: high metadata overhead for massive small files; API call overhead about 10–30 ms	Medium to high: performance bottlenecks and complexity challenges at ultra-large scale (100M+ files)
Scalability	High: horizontal scaling, supports tens to hundreds of billions of files per volume	High: near-infinite storage capacity, but small-file management efficiency degrades with scale	Moderate: scaling limited by metadata nodes; operational complexity grows exponentially with scale
Multi-cloud adaptability	Native support	Relies on sync tools; cross-cloud data flow inefficient; global unified view difficult	Limited: often tightly bound to specific hardware or cloud provider; cross-cloud deployment is complex
Cost efficiency	High performance-to-cost ratio	Low (storage only): cheap storage, but low GPU utilization in high-throughput scenarios like AI training	High: often requires all-flash architecture or dedicated hardware; high operational labor cost

Based on the comparison results above, JuiceFS demonstrates significant advantages in core performance, scalability, multi-cloud adaptability, and cost efficiency. This makes it the preferred choice for our unified storage solution.

Furthermore, JuiceFS has been widely adopted in the autonomous driving industry. Leading companies such as Horizon Robotics have leveraged JuiceFS to manage data at the exabyte scale. This demonstrates its maturity and effectiveness in large-scale production environments.

For our specific application scenarios, JuiceFS' core technical advantages:

Decoupled architecture: JuiceFS adopts a metadata-data separation architecture, persisting data in cost-effective object storage (like S3 or OSS) while storing metadata in databases like Redis or TiKV. This decoupled design enables elastic storage scaling and reduces dependence on any single cloud provider.
Chunking and caching mechanisms: JuiceFS uses chunks, slices, and blocks to significantly improve small file read efficiency and enhance concurrent read/write performance. In addition, multi-level caching (memory, local SSD, distributed cache) reduces access latency for hot data. This meets the demands of high-throughput training workloads.
Cloud-native adaptability: By providing a CSI Driver, JuiceFS delivers persistent storage decoupled from compute nodes in Kubernetes environments, supporting stateless container deployment and cross-cloud migration. It enables data sharing, enhances application high availability and flexibility, and adapts to various Kubernetes deployment methods.
Full-stack support for AI training: JuiceFS fully supports POSIX, HDFS, and S3 API, and is compatible with mainstream AI frameworks such as PyTorch and TensorFlow. It can be integrated without code modifications, lowering the technical barrier for adoption.
Multi-cloud support: Its cross-cloud capabilities and high-performance metadata engine ensure efficient data flow, perfectly aligning with our strategy of "computing power on demand."

From a cost perspective, JuiceFS does not offer a significant cost advantage in the early stages of small-scale deployment. However, when data volume reaches the petabyte level—especially at the 10 PB or 100 PB scale—and is compared against all-flash storage solutions, its cost-efficient architecture built on object storage becomes fully evident. In addition, JuiceFS requires minimal operational overhead. Currently, we need only one engineer to manage the entire cloud platform and storage system, a fraction of the personnel required by traditional solutions.

From Community Edition to Enterprise Edition: addressing larger-scale scenarios

As our application continued to expand, we encountered limitations when using Redis as the metadata engine—specifically, physical memory capacity constrained data scalability. When the number of files approached the hundred-million level, metadata query latency increased significantly. This impacted the concurrency efficiency of training tasks. After using the clone feature, the metadata volume grew substantially. In addition, in cross-cloud scenarios, we faced higher demands for metadata synchronization and mirror file system capabilities. We also required more granular capacity controls and permission management at the directory level.

Considering these requirements—along with our desire to leverage local SSDs on GPU nodes to build a distributed cache layer for improved performance—we decided to deploy JuiceFS Enterprise Edition in parallel, migrating core scenarios such as ultra-large-scale directory management and multi-node collaborative training to this version. Through this scenario-based approach, we’ve effectively enhanced the adaptability of our overall storage system and established a solid foundation for future application growth. Below are the key features of the Enterprise Edition that we’ve applied in real-world scenarios.

High-performance metadata engine: solving the bottleneck of large-scale directory retrieval

For high-frequency operations such as traversing directories with hundreds of millions of files and deep pagination queries, we previously encountered the "slower as you query" problem with traditional storage solutions. When the number of files in a single directory exceeded 10 million, and the pagination offset surpassed 100,000 entries, response latency would spike from hundreds of milliseconds to several seconds. This severely impacted data filtering efficiency.

After switching to JuiceFS Enterprise Edition, its native tree-structured metadata storage architecture played a key role. Unlike the flat key-value storage used—which stores file metadata in a disordered manner—this tree structure allows direct navigation to directory levels, reducing the scope of metadata scans. In our actual tests, deep pagination queries (with an offset of 500,000 entries) in a directory containing 120 million files saw latency drop from 3.8 seconds to just 210 milliseconds. This fully met the retrieval needs of large-scale datasets. In addition, this engine supports storing hundreds of billions of files per volume, and we’ve already used it to manage three petabyte-scale training datasets stably, aligning with our application growth expectations.

Enterprise-grade distributed cache: improving data sharing efficiency in multi-node, multi-GPU training

In multi-node, multi-GPU training scenarios, we previously faced challenges such as low cache hit rates and cross-node bandwidth congestion. The open-source version only supports local caching on each node. This means that when multiple nodes pull the same dataset simultaneously, each node must access object storage independently. This resulted in single-node bandwidth utilization exceeding 90%, with average training job startup delays of up to 20 minutes.

With JuiceFS Enterprise Edition's distributed caching feature, we set up a distributed cache across a 12-node training cluster using just three commands. The dataset only needs to be pulled from object storage once and is cached in a pool built from local SSDs across the nodes. As a result, the cache hit rate for multi-node collaborative training increased from 45% to 92%, cross-node bandwidth utilization dropped to below 15%, and training job startup time was reduced to under three minutes. This significantly improved compute utilization.

Enhanced cross-cloud collaboration: building a low-operational-cost cross-cloud data foundation

Since our R&D environments are distributed across two cloud environments, we previously encountered challenges with slow cross-cloud data synchronization and high operational costs. Using traditional synchronization tools to maintain data consistency between the two clouds required configuring eight scheduled tasks, with an average synchronization delay of four hours, and dedicated personnel needed to investigate sync failures weekly.

By using the JuiceFS sync tool combined with our internal AI operations tools, we achieved automated configuration of synchronization policies. The system automatically adjusts sync priorities based on data heat levels, keeping cross-cloud data latency within 10 minutes. In addition, tasks such as failure retries and log alerts for synchronization are fully automated, eliminating the need for dedicated monitoring. This has reduced operational overhead by 70%, and we now stably support multiple training projects across two cloud platforms sharing the same dataset. Going forward, we plan to use the Enterprise Edition's mirror file system feature to further enhance cross-cloud data collaboration.

JuiceFS optimization

Client cache and write performance tuning

We need to pay attention to compatibility issues between caching strategies and Kubernetes resource limits. For example, using memory as a local cache path with improper configuration may lead to abnormal memory growth in the Mount Pod, or insufficient resource quota reservations may cause checkpoint loss or file handle write exceptions during long-running training tasks.

Regarding write performance tuning, enabling writeback mode can improve small file write throughput to some extent. However, considering production environment requirements for data consistency, we still adopt write-through synchronous mode to reduce data risks in extreme crash scenarios. It’s recommended to cautiously enable writeback mode only in scenarios with lower data reliability requirements, such as temporary computing or offline data cleaning, based on actual needs.

Deployment and network topology optimization

For more stable performance, it’s strongly recommended to deploy the metadata engine and compute nodes within the same region during deployment. In actual operations, we observed that cross-region deployment could increase metadata operation latency by several to ten times. This significantly impacted I/O-intensive operations such as data decompression. Deploying metadata services and GPU computing resources within the same region helps maintain performance while controlling network transmission costs, improving overall resource utilization efficiency.

Data warm-up and cache optimization

In a 10-gigabit network environment, fully utilizing JuiceFS' data warm-up and reasonably adjusting data block sizes based on application scenarios can better leverage network bandwidth capabilities and improve read throughput. Combined with the distributed cache architecture, this can effectively enhance data sharing efficiency in multi-node concurrent scenarios and improve cache hit rates during high-concurrency reads. This thereby optimizes the overall performance of large-scale AI training tasks.

Resource quotas and high availability guarantee

In enterprise-level multi-role operations and storage responsibility separation scenarios, to avoid operational risks caused by inconsistent configurations, it’s recommended to finely control resource quotas for JuiceFS CSI Driver in Kubernetes environments. By appropriately setting CPU and memory request/limit for Mount Pods, Pod restarts or node anomalies caused by resource preemption can be reduced. In practice, resource reservation ratios can be dynamically adjusted based on cluster load.

In addition, for scenarios with high application continuity requirements, the automatic mount point recovery feature for Mount Pods can be enabled to achieve automated fault recovery for storage services, further ensuring underlying storage stability.

Multi-tenancy

We provide independent file systems and storage buckets for large enterprise customers, while achieving isolation for small and medium-sized enterprises and end users through subdirectory-level directory isolation and permission control.

Large enterprises can flexibly scale throughput and capacity, avoiding performance bottlenecks associated with shared storage buckets. For small and medium-sized enterprises and end users, we ensure data security and independence through subdirectory isolation and permission control, while enabling accurate metering and billing.

This architecture ensures tenant isolation while flexibly allocating resources, improving system management efficiency.

Version management

Using the juicefs clone command, copies of original datasets can be quickly created and modified independently without affecting the source data. The clone operation only copies file metadata, while data only stores additional changes, saving underlying storage space. This feature supports managing multiple versions, facilitating rollback and recovery and ensuring data security and version control.

Summary

JuiceFS' characteristics in metadata performance, scalability, cross-cloud adaptability, and comprehensive cost efficiency have made it our choice for building a unified storage layer. Currently, we adopt both JuiceFS Community Edition and Enterprise Edition to accommodate different storage requirements across various application scenarios.

In the future, we plan to further implement JuiceFS in the embodied intelligence field, addressing specific storage needs in this scenario. These include high-throughput processing of time-series data, precise multi-modal data alignment, edge-cloud collaborative storage, and integrated management of simulation and real-world data.

If you have any questions for this article, feel free to join JuiceFS discussions on GitHub and community on Slack.

JuiceFS Enterprise 5.3: 500B+ Files per File System & RDMA Support

DASWU — Wed, 04 Feb 2026 09:18:32 +0000

JuiceFS Enterprise Edition 5.3 has recently been released, achieving a milestone breakthrough by supporting over 500 billion files in a single file system. This upgrade includes several key optimizations to the metadata multi-zone architecture and introduces remote direct memory access (RDMA) technology for the first time to enhance distributed caching efficiency. In addition, version 5.3 enhances write support for mirrors and provides data caching for objects imported across buckets. It aims to support high-performance requirements and multi-cloud application scenarios.

JuiceFS Enterprise Edition is designed for high-performance scenarios. Since 2019, it has been applied in machine learning and has become one of the core infrastructures in the AI industry. Its customers include large language model (LLM) companies such as MiniMax and StepFun; AI infrastructure and applications like fal and HeyGen; autonomous driving companies like Momenta and Horizon Robotics; and numerous leading technology enterprises across various industries leveraging AI.

Single file system supports 500 billion+ files

The multi-zone architecture is one of JuiceFS' key technologies for handling hundreds of billions of files, ensuring high scalability and concurrent processing capabilities. To meet the growing demands of scenarios like autonomous driving, version 5.3 introduces in-depth optimizations to the multi-zone architecture, increasing the zone limit to 1,024 and enabling a single file system to store and access at least 500 billion files (each zone can store 500 million files, with a maximum of 2 billion).

The figure below shows JuiceFS Enterprise Edition architecture, with a single zone in the lower left corner:

This breakthrough presents exponentially increasing challenges in system performance, data consistency, and stability, backed by a series of complex underlying optimizations and R&D efforts.

Cross-zone hotspot balancing: automated monitoring and hotspot migration, with manual ops tools

In distributed systems, hotspots are a common challenge. Especially when data is distributed across multiple zones, some zones may experience higher loads than others. This leads to imbalance that impacts system performance.

When the number of zones reaches hundreds, hotspot issues become more prevalent. Particularly with smaller datasets and larger numbers of files, read/write hotspots exacerbate latency fluctuations.

We introduced an automated hotspot migration mechanism to move frequently accessed files to other zones, distributing the load and reducing pressure on specific zones. However, in practice, relying solely on automated migration cannot fully resolve all issues. In certain special or extreme scenarios, automated tools may not respond promptly. Therefore, alongside automated monitoring and migration, we added manual operational tools, allowing administrators to intervene in complex scenarios, perform manual analysis, and implement optimization solutions.

Large-scale migration: improved migration speed, small-batch concurrent migration

Facing zones with excessive hotspots, early migration operations were simple. However, as the system scale expanded, migration efficiency gradually decreased. To address this, we introduced a small-batch concurrent migration strategy, breaking down high-access directories into smaller chunks and migrating them in parallel to multiple lower-load zones. This quickly scatters hotspots and restores normal application access.

Enhanced reliability self-checks: automatic repair and cleanup of intermediate migration states

In large-scale clusters, the probability of distributed transaction failures increases significantly, especially during extensive migration processes. To address this, we enhanced reliability detection mechanisms, adding periodic background checks to scan cross-zone file states, particularly focusing on intermediate state issues, and automatically performing repairs and cleanup.

Previously, the system encountered issues with residual intermediate state data. While these did not affect operations in the short term, over time they could lead to errors. Through enhanced self-check mechanisms, we ensure the background periodically scans and promptly handles intermediate state issues, improving system stability and reliability.

Beyond the three key optimizations above, we also made multiple improvements to the console to better adapt to managing more zones. We optimized concurrent processing, operational tasks, and query displays, enhancing overall performance and user experience. Specifically, we refined UI design to better showcase system states in large-scale zone environments.

Performance stress test for hundreds of billions of files

We conducted large-scale tests using a custom mdtest tool on Google Cloud, deploying 60 nodes, each with over 1 TB of memory. In terms of software configuration, we increased the number of zones to 1,024. The deployment method was similar to previous setups, but to reduce memory consumption, we deployed only one service process, with two others as cold backups.

JuiceFS Enterprise Edition 5.3 test:

Test duration: Approximately 20 hours
Total files written: About 400 billion files
Write speed: 5 million files per second
Memory usage: About 35% to 40%
Disk usage: 40% to 50%, primarily for metadata persistence, with good utilization

Based on our experience, if using a configuration with one service process, one hot backup, and one cold backup, memory usage increases by 20% to 30%.

Due to limited cloud resources, this test only wrote up to 400 billion files. During stress testing, the system performed stably, with hardware resources still remaining. We’ll continue to attempt larger-scale tests in the future.

Support for RDMA: increased bandwidth cap, reduced CPU usage

This new version introduces support for RDMA technology for the first time. Its basic architecture is shown in the diagram below. RDMA allows direct access to remote node memory, bypassing the operating system's network protocol stack. This significantly improves data transfer efficiency.

The main advantages of RDMA include:

Low latency: By enabling direct memory-to-memory transfers and bypassing the OS network protocol layers, it reduces CPU interrupts and context switches. This lowers latency.
High throughput: RDMA uses hardware for direct data transfer, better utilizing the bandwidth of network interface cards (NICs).
Reduced CPU usage: In RDMA, data copying is almost entirely handled by the NIC, with the CPU only processing control messages. This allows the NIC to handle hardware transfers, freeing up CPU resources.

In JuiceFS, network request messages between clients and metadata services are small, and existing TCP configurations already meet the needs. However, in distributed caching, file data is transferred between clients and cache nodes. Using RDMA can effectively improve transfer efficiency and reduce CPU consumption.

We conducted 1 MB random read tests using 160 Gbps NICs, comparing versions 5.1, 5.2 (using TCP networking) with version 5.3 (RDMA), and observed CPU usage.

Tests showed that RDMA effectively reduces CPU usage:

In version 5.2, CPU usage was nearly 50%.
In version 5.3, with RDMA optimization, CPU usage dropped to about one-third. Client and cache node CPU usage decreased to 8 cores and 5 cores respectively, with bandwidth reaching 20 GiB/s.

In previous tests, we found that while TCP ran stably on 200G NICs, fully saturating bandwidth was challenging, typically achieving only 85%-90% utilization. For customers requiring higher bandwidth (such as 400G NICs), TCP could not meet demands. However, RDMA can more easily reach hardware bandwidth limits, providing better transfer efficiency.

If customers have RDMA-capable hardware and high bandwidth requirements (for example, NICs greater than 100G) and wish to reduce CPU usage, RDMA is a technology worth trying. Currently, our RDMA feature is in public testing and has not yet been widely deployed in production environments.

Enhanced write support for mirrors

Initially, mirror clusters were primarily used for read-only mirroring in enterprise products. As users requested capabilities like writing temporary files (such as training data) in mirrors, we provided write support for mirrors.

The mirror client implements a read-write separation mechanism. When reading data, the client prioritizes fetching from the mirror cluster to reduce latency. When writing data, it still writes to the source cluster to ensure data consistency. By recording and comparing metadata version numbers, we ensure strong consistency between the mirror client and source cluster client views of the data.

To improve availability, version 5.3 introduces a fallback mechanism. When the mirror becomes unavailable, client read requests automatically fall back to the source cluster. This ensures application continuity and avoids interruptions caused by mirror cluster failures. We also optimized deployments in multi-mirror environments. Previously, the mirror end required two hot backup nodes to ensure high availability. Now, with the improved fallback feature, deploying a single mirror node can achieve similar effects. This ensures application continuity and reduces costs, especially beneficial for users requiring multiple mirrors.

Through this improvement, we not only reduced hardware costs but also found a balance between high availability and low cost. For users deploying mirrors in multiple locations, reducing metadata replicas further lowers overall costs.

Simplified operations & increased flexibility: providing cross-bucket data cache for imported objects

In JuiceFS, users can use the import command to bring existing files from object storage under unified management. This is convenient for users already storing large amounts of data (for example, tens of petabytes). However, in previous versions, this feature only supported caching for objects within the same data bucket. This meant imported objects had to reside in the same bucket as the existing file system data. This limitation had certain practical constraints.

In version 5.3, we improved this feature. Users can now provide caching capability for any imported objects, regardless of whether they come from the same data bucket. This allows users more flexibility in managing objects across different data buckets, avoiding strict bucket restrictions and enhancing data management freedom.

In addition, previously, if users had data distributed across multiple buckets and wanted to provide caching for that data, they needed to create a new file system for each bucket. In version 5.3, users only need to create one file system (volume) to uniformly manage data from multiple buckets and provide caching for all buckets.

Other important optimizations

Trace feature

We added the trace feature, a feature provided by the Go language itself. Through this, advanced users can perform tracing and performance analysis, gaining more information to help quickly locate issues.

Trash recovery

In previous versions, especially with multiple zones, sometimes the paths recorded in the trash were incomplete. This led to anomalies during recovery, where files were not restored to the expected locations. To address this, in version 5.3, when deleting files, we record the original file path, ensuring more reliable recovery capabilities.

Python SDK improvements

In earlier versions, we released the Python SDK, providing basic read/write functionalities for Python users to interface with our system. In version 5.3, we not only strengthened basic read/write functions but also added support for operational subcommands. For example, users can directly call commands like juicefs info or warmup via the SDK without relying on external system commands. This simplifies coding efforts and avoids potential performance bottlenecks from frequently calling external commands.

The Windows client

We previously launched a beta version of the Windows client and have received some user feedback. After improvements, the current version shows significant enhancements in mount reliability, performance, and compatibility with Linux systems. In the future, we plan to further refine the Windows client, providing an experience closer to Linux for users reliant on Windows.

Summary

Compared to expensive dedicated hardware, JuiceFS helps users balance performance and cost when addressing data growth by flexibly utilizing cloud or existing customer storage resources. In version 5.3, by optimizing the metadata zone architecture, a single file system can support over 500 billion files. The first-time introduction of RDMA technology significantly improves distributed caching bandwidth and data access efficiency, reduces CPU usage, and further optimizes system performance. In addition, we enhanced features like write support for mirrors and caching, improving the performance and operational efficiency of large-scale clusters and optimizing user experience.

Cloud service users can now directly experience JuiceFS Enterprise Edition 5.3 online, while on-premises deployment users can obtain upgrade support through official channels. We’ll continue to focus on high-performance storage solutions, partnering with enterprises to tackle challenges brought by continuous data growth.

If you have any questions for this article, feel free to join JuiceFS discussions on GitHub and community on Slack.

From GlusterFS to JuiceFS: Lightillusions Achieved 2.5x Faster 3D AIGC Data Processing

DASWU — Fri, 09 Jan 2026 07:43:46 +0000

Lightillusions is a company specializing in spatial intelligence technology, integrating 3D vision, computer graphics, and generative models to build innovative 3D foundation models. Our company is led by Ping Tan, a professor at the Hong Kong University of Science and Technology (HKUST) and Director of the HKUST-BYD Joint Laboratory.

Unlike 2D models, a single 3D model can be several gigabytes in size, especially complex models like point clouds. When our data volume reached petabyte scales, management and storage became significant challenges. After trying solutions like NFS and GlusterFS, we chose JuiceFS, an open-source high-performance distributed file system, to build a unified storage platform. This platform now serves multiple scenarios, supports cross-platform access including Windows and Linux, manages hundreds of millions of files, improves data processing speed by 200%–250%, enables efficient storage scaling, and greatly simplifies operations and maintenance. This allows us to focus more on core research.

In this article, we’ll break down the unique storage demands of 3D AIGC, share why we selected JuiceFS over CephFS, and walk through the architecture of our JuiceFS-based storage platform.

Storage requirements for 3D AIGC

Our research focuses on perception and generation. In the 3D domain, task complexity is different from image and text processing. This placed higher demands on our AI models, algorithms, and infrastructure.

We illustrate the complexity of 3D data processing through a typical pipeline. On the left side of the diagram below is a 3D model containing texture (top-left) and geometry (bottom-right) information. First, we generate rendered images. Each model has text labels describing its content, geometric features, and texture features, which are tightly coupled with the model. In addition, we process geometry data, such as sampled points and necessary numerical values obtained from data preprocessing, like signed distance fields (SDFs). It's important to note that 3D model file formats are highly diverse, and image formats are also different.

Our work spans language models, image/video models, and 3D models. As data volume grows, so does the storage burden. The main characteristics of data usage in these scenarios are as follows:

Language models: Data typically consists of a vast number of small files. Although individual text files are small, the total file count can reach millions or even tens of millions as data volume increases. This makes the management of such a large number of files a primary storage challenge.
Image and video data: High-resolution images and long videos are usually large. A single image can range from hundreds of kilobytes to several megabytes, while video files can reach gigabytes. During preprocessing—such as data augmentation, resolution adjustment, and frame extraction—data volume increases significantly. Especially in video processing, where each video is typically decomposed into a large number of image frames, managing these massive file collections adds considerable complexity.
3D models: Individual models, especially complex ones like point clouds, can be several gigabytes in size. 3D data preprocessing is more complex than other data types, involving steps like texture mapping and geometry reconstruction, which consume great computational resources and can increase data volume. Furthermore, 3D models often consist of multiple files, leading to a large total file count. As data grows, managing these files becomes increasingly difficult.

Based on the storage characteristics discussed above, when we chose a storage platform solution, we expected it to meet the following requirements:

Diverse data formats and cross-node sharing: Different models use different data formats, especially the complexity and cross-platform compatibility issues of 3D models. The storage system must support multiple formats and effectively manage data sharing across nodes and platforms.
Handling data models of different sizes: Whether it's small files for language models, large-scale image/video data, or large files for 3D models, the storage system must be highly scalable to meet rapidly growing storage demands and handle the storage and access of large-size data efficiently.
Challenges of cross-cloud and cluster storage: As data volume increases, especially with petabyte-level storage needs for 3D models, cross-cloud and cluster storage issues become more prominent. The storage system must support seamless cross-region, cross-cloud data access and efficient cluster management.
Easy scaling: The need for scaling is constant, whether for language, image/video, or 3D models, and is particularly high for 3D model storage and processing.
Simple operations and maintenance: The storage system should provide easy-to-use management interfaces and tools. Especially for 3D model management, operational requirements are higher, making automated management and fault tolerance essential.

Storage solutions: from NFS, GlusterFS, CephFS to JuiceFS

Initial solution: NFS mount

Initially, we tried the simplest solution—using NFS for mounting. However, in practice, we found that the training cluster and rendering cluster required independent clusters for mount operations. Maintaining this setup was very cumbersome. Especially when adding new data, as we needed to write mount points separately for each new dataset. When the data volume reached about 1 million objects, we could no longer sustain this approach and abandoned it.

Mid-term solution: GlusterFS

GlusterFS was an easy-to-start-with choice, offering simple installation and configuration, acceptable performance, and no need for multiple mount points—just add new nodes.

While GlusterFS greatly reduced our workload in the early stages, we also discovered issues with its ecosystem:

Many GlusterFS execution scripts and features required writing custom scheduled tasks. Particularly when adding new storage, it had additional requirements, such as needing to increase nodes by specific multiples.
Support for operations like cloning and data synchronization was weak. This led us to frequently consult documentation.
Many operations were unstable. For example, when using tools like fio for speed testing, results were not always reliable.
A more serious problem was that GlusterFS performance would drastically decline when the number of small files reached a certain scale. For example, one model might generate 100 images. With 10 million models, that would produce 1 billion images. GlusterFS struggled severely with addressing in later stages, especially with an excessive number of small files. This led to significant performance drops and even system crashes.

Final selection: CephFS vs. JuiceFS

As storage demands grew, we decided to use a more sustainable solution. After evaluating various options, we compared CephFS and JuiceFS.

Although Ceph is widely used, through our own practice and reviewing documentation, we found Ceph's operational and management costs to be very high. Especially for a small team like ours, handling such complex operational tasks proved particularly difficult.

JuiceFS had two native features that strongly aligned with our needs:

The client data cache. For our model training clusters, which are typically equipped with high-performance NVMe storage, fully utilizing client caching could significantly accelerate model training and reduce pressure on the JuiceFS storage backend.
JuiceFS' S3 compatibility was crucial for us. As we had developed some visualization platforms based on storage for data annotation, organization, and statistics, S3 compatibility allowed us to rapidly develop web interfaces supporting visualization, data statistics, and other features.

The table below compares basic features of CephFS and JuiceFS:

Storage platform practice based on JuiceFS

Metadata engine selection and topology

JuiceFS employs a metadata-data separation architecture with several metadata engine options. We first quickly validated the Redis storage solution, which is well-documented by the JuiceFS team. Redis' advantage lies in its lightweight nature; configuration typically takes only a day or half a day, and data migration is smooth. However, when the number of small files exceeded 100 million, Redis' speed and performance significantly declined.

As mentioned earlier, each model might render 100 images. With other miscellaneous files, the number of small files increased dramatically. While we could mitigate the issue by packing small files, performing modifications or visualization on packed data greatly increased complexity. Therefore, we preferred to retain the original small image files for subsequent processing

As the file count grew and soon exceeded Redis' capacity, we decided to migrate the storage system to a combination of TiKV and Kubernetes (K8s). The TiKV-K8s setup provided us with a more highly available metadata storage solution. Furthermore, through benchmarking, we found that although TiKV's performance was slightly lower, the gap was not significant, and its support for small files was better than Redis'. We also consulted JuiceFS engineers and learned that Redis has poor scalability in cluster mode. Therefore, we switched to TiKV.

The table below shows read/write performance test results for different metadata engines:

Latest architecture: JuiceFS+TiKV+SeaweedFS

We use JuiceFS to manage the object storage layer. For the metadata storage system, we built it with TiKV and K8s. For object storage, we used SeaweedFS. This allows us to quickly scale storage capacity and provides fast access for both small and large files. In addition, our object storage is distributed across multiple platforms, including local storage and platforms like R2 and Amazon S3. Through JuiceFS, we were able to integrate these different storage systems and provide a unified interface.

To better manage system resources, we built a resource monitoring platform on K8s. The current system consists of about 60 Linux nodes and several Windows nodes handling rendering and data processing tasks. We monitored read stability, and the results show that even with multiple heterogeneous servers performing simultaneous read operations, the overall system I/O performance remains stable, able to fully utilize the bandwidth resources.

Problems we encountered

During the optimization of the storage solution, we initially tried an erasure code (EC) storage scheme aimed at reducing storage requirements and improving efficiency. However, in large-scale data migration, EC storage computation was slow, and its performance was unsatisfactory in high-throughput and frequent data change scenarios. Especially when combined with SeaweedFS, bottlenecks existed. Based on these issues, we decided to abandon EC storage and switch to a replication-based storage scheme.

We set up independent servers and configured scheduled tasks for large-volume metadata backups. In TiKV, we implemented a redundant replica mechanism, adopting a multi-replica scheme to ensure data integrity. For object storage, we used dual-replica encoding to further enhance data reliability. Although replica storage effectively ensures data redundancy and high availability, storage costs remain high due to processing petabyte-scale data and massive incremental data. In the future, we may consider further optimizing the storage scheme to reduce costs.

In addition, we found that using all-flash servers with JuiceFS did not bring significant performance improvements. The bottleneck mainly appeared in network bandwidth and latency. Therefore, we plan to consider using InfiniBand to connect storage servers and training servers to maximize resource utilization efficiency.

Summary

When using GlusterFS, we could process at most 200,000 models per day. After switching to JuiceFS, the processing capacity increased significantly. Our daily data processing capacity has grown by 2.5 times. Small file throughput also improved notably. The system remained stable even when storage utilization reached 70%. Furthermore, scaling became very convenient, whereas the previous architecture involved troublesome scaling processes.

Finally, let's summarize the advantages JuiceFS has demonstrated in 3D generation tasks:

Small file performance: Small file handling is a critical point, and JuiceFS provides an excellent solution.
Cross-platform features: Cross-platform support is very important. We found that some data can only be opened in Windows software, so we need to process the same data on both Windows and Linux systems and perform read/write operations on the same mount point. This requirement makes cross-platform features particularly crucial, and JuiceFS' design addresses this well.
Low operational cost: JuiceFS' operational cost is extremely low. After configuration, only simple testing and node management (for example, discarding certain nodes and monitoring robustness) are needed. We spent about half a year migrating data and have not encountered major issues so far.
Local cache mechanism: Previously, to use local cache, we needed to manually implement local caching logic in our code. JuiceFS provides a very convenient local caching mechanism, optimizing performance for training scenarios by setting mount parameters.
Low migration cost: Especially when migrating small files, we found using JuiceFS for metadata and object storage migration to be convenient, saving us a lot of time and effort. In contrast, migrating with other storage systems was very painful.

In summary, JuiceFS performs excellently in large-scale data processing, providing an efficient and stable storage solution. It not only simplifies storage management and scaling but also significantly improves system performance. This allows us to focus more on advancing core tasks. In addition, the JuiceFS tools are very convenient. For example, we used the sync tool for small file migration with extremely high efficiency. Without additional performance optimization, we successfully migrated 500 TB of data, including a massive number of small data and image files. It was done in less than 5 days, exceeding our expectations.

If you have any questions for this article, feel free to join JuiceFS discussions on GitHub and community on Slack.