DEV Community: Sergey Platonov

The Definitive Guide to mdraid, mdadm, and Linux Software RAID

Sergey Platonov — Tue, 29 Jul 2025 07:25:40 +0000

What Is mdraid? Core Concepts

mdraid (often shortened to MD RAID or simply md) is the Linux kernel’s built‑in software RAID framework. It aggregates multiple block devices (drives, partitions, loopbacks, NVMe namespaces, etc.) into a single logical block device (/dev/mdX) that can deliver improved performance, redundancy, capacity aggregation or some combination depending on the selected RAID level.

At its heart, mdraid:

Lives primarily in the Linux kernel as the md (Multiple Device) driver.
Uses on‑disk superblock metadata to describe array membership and layout.
Exposes assembled arrays as standard block devices that can be partitioned, formatted, LVM’ed, encrypted, or used directly by applications.

In other words: mdraid is the engine; mdadm is the toolkit and dashboard.

How mdraid Fits in the Linux Storage Stack

A simplified vertical stack (bottom → top):

[Physical / Virtual Drives]
  SATA / SAS / NVMe / iSCSI LUNs / VMDKs / Cloud Block Vols
        ↓
[mdraid Kernel Layer]  ← assembled via mdadm
  RAID0 | RAID1 | RAID4/5/6 | RAID10 | RAID1E | linear | multipath | etc.
        ↓
[Optional Layers]
  LUKS dm-crypt  |  LVM PV/VG/LV (incl. LVM-thin)  |  Filesystems (ext4, XFS, btrfs, ZFS-on-Linux as zvol consumer, etc.)
        ↓
[Applications / Containers / VMs]

Because md devices appear as regular block devices, you can build rich storage stacks: encrypt then RAID, RAID then encrypt, layer LVM for flexible volume management, or present md devices to virtualization hosts.

mdadm vs mdraid: Terminology Clarified

Term	Layer	What It Does	Where You See It
md	Kernel driver	Implements software RAID logic.	/proc/mdstat, /sys/block/md*, kernel logs.
mdraid	Informal name	Linux md software RAID subsystem.	Docs & articles: “Linux mdraid,” etc.
mdadm	User-space tool	Create, assemble, grow, monitor, fail, remove arrays; generate config.	CLI: `mdadm --create /dev/md0 …`
/etc/mdadm.conf	Config file	Records ARRAY definitions & metadata defaults; persists arrays across boots.	At boot/assembly time

Remember: You manage arrays with mdadm; the arrays themselves are provided by the mdraid kernel layer.

Supported RAID Levels & Modes

mdraid supports a broad set of personalities (RAID implementations). Availability may vary slightly by kernel version, but common ones include:

linear – Concatenate devices end‑to‑end; no redundancy.
RAID 0 (striping) – Performance & aggregate capacity; zero redundancy.
RAID 1 (mirroring) – Redundancy via copies; read parallelism.
RAID 4 – Dedicated parity disk; rarely used.
RAID 5 – Distributed parity across members; 1‑disk fault tolerance.
RAID 6 – Dual distributed parity; 2‑disk fault tolerance.
RAID 10 – Striped mirrors (mdraid implements as an n‑way layout over mirrored sets; flexible). Good mix of speed + redundancy.
RAID 1E / RAID 10 variants – Extended / asymmetric mirror‑stripe layouts for odd numbers of drives.
Multipath – md can be used (less common today) to provide failover across multiple I/O paths.
Faulty – Testing personality that injects failures.

Key Architecture Elements

Superblock Metadata – Small headers stored on member devices that describe array UUID, RAID level, layout, chunk size, role of each device, and state.
md Personalities – RAID level implementations registered with the kernel’s md subsystem.
Bitmaps – Optional on‑disk or in‑memory bitmaps track which stripes are dirty, dramatically shortening resync/recovery after unclean shutdowns.
Reshape Engine – Allows changing array geometry (add/remove devices, change RAID level in some cases) online in many scenarios. Performance impact can be significant; plan maintenance windows.
mdmon (External Metadata Monitor) – Required for some external metadata formats (e.g., IMSM/Intel Matrix) to keep metadata in sync.
Sysfs Controls – Tunables under /sys/block/mdX/md/ expose runtime parameters (stripe cache, sync speed limits, write‑mostly flags, etc.).

Common Use Cases

Homelab NAS or DIY storage server using commodity disks.
Bootable mirrors (RAID1) for root filesystem resiliency.
RAID10 for database or virtualization workloads needing strong random I/O + redundancy.
RAID5/6 for bulk capacity with parity protection (backup targets, media libraries—though see risk discussion below on rebuild windows and URE rates).
Aggregating NVMe drives when hardware RAID isn’t desired or available.
Cloud or virtual environments where virtual disks need software‑defined redundancy across failure domains.

Planning an mdraid Deployment

Before typing mdadm --create, answer these:

Workload Profile

Random vs sequential; read‑heavy vs write‑heavy; mixed DB + VM vs cold archive.
Redundancy vs Capacity

How many failures must you tolerate? RAID1/10 vs RAID5/6 trade‑offs.
Media Type

HDD, SSD, NVMe, or mixed? Parity RAID on SSD behaves differently (write amplification, TRIM behavior, queue depth scaling).
Growth Expectations

Will you add drives later? Consider RAID10 or start with larger drive count in RAID6; know reshape implications.
Boot Requirements

If booting from md, choose metadata format (0.90 or 1.0) that places superblock where firmware/bootloader expects.
Monitoring & Alerting Plan

Email alerts, systemd timers, mdadm --monitor, integration with Prometheus/node_exporter, etc.
Backup Strategy

RAID ≠ backup. Always maintain off‑array copies of critical data.

Step‑by‑Step: Creating and Managing Arrays with mdadm

Below are high‑signal, copy‑paste‑friendly examples. Adjust device names and sizes to your environment.

Install mdadm

Debian/Ubuntu:

sudo apt update && sudo apt install -y mdadm

RHEL/CentOS/Rocky/Alma:

sudo dnf install -y mdadm   # or yum on older releases

Create a RAID1 Mirror (Bootable Friendly Metadata)

sudo mdadm --create /dev/md0 \
  --level=1 \
  --raid-devices=2 \
  --metadata=1.0 \
  /dev/sda1 /dev/sdb1

--metadata=1.0 stores the superblock at the end of the device, leaving initial sectors clean for bootloaders.

Create a RAID5 Array

sudo mdadm --create /dev/md/data \
  --level=5 \
  --raid-devices=4 \
  --chunk=256K \
  /dev/sd[b-e]

Assemble Existing Arrays (e.g., after reboot)

sudo mdadm --assemble --scan

Check Status

cat /proc/mdstat
sudo mdadm --detail /dev/md0

Mark a Device as Failed & Remove It

sudo mdadm /dev/md0 --fail /dev/sdb1
sudo mdadm /dev/md0 --remove /dev/sdb1

Add Replacement Drive

sudo mdadm /dev/md0 --add /dev/sdb1

Generate /etc/mdadm.conf

sudo mdadm --detail --scan | sudo tee -a /etc/mdadm.conf

Best Practice:
After creating or modifying arrays, regenerate your initramfs if the system boots from md devices so the early boot environment knows how to assemble arrays.

Chunk Size, Stripe Geometry & Alignment

Chunk (sometimes called “stripe unit”) – The amount of contiguous data written to each member before moving to the next.

Stripe – One full pass across all data disks in the array (excludes parity disks in descriptive math but included in physical layout).

Why it matters:

Too small a chunk → frequent parity calculations, more IOPS overhead on large sequential writes.
Too large a chunk → small random writes waste bandwidth; read‑modify‑write penalties.
Align chunk to filesystem block and workload I/O sizes to reduce partial‑stripe writes.

Rules of Thumb

64K–256K chunks common for parity RAID on HDDs.
Larger (256K–1M+) chunks often beneficial for large sequential workflows (backups, media, VM images) and on fast SSD/NVMe backends.
Benchmark your workload: fio with realistic I/O depth & patterns beats rules of thumb every time.

Alignment Checklist

Use whole‑disk members when possible (avoid legacy partition offset issues).
If partitioning, start at 1MiB boundary (modern default in parted & gdisk).
Filesystem mkfs tools often have -E stride=stripe_unit or -d su= options—set them!

Performance Tuning Tips

Performance depends on workload, media, CPU, and kernel version. Start with measurement, then tune.

1. Set Appropriate Chunk Size at Creation

Changing later usually means rebuild. Match to workload I/O size mix.

2. Increase Stripe Cache (RAID5/6)

The stripe_cache_size sysfs tunable can significantly improve parity RAID write throughput by caching partial stripes.

echo 4096 | sudo tee /sys/block/md0/md/stripe_cache_size

3. Tune Sync/Resync/Reshape Speeds

Limit or accelerate background operations:

echo 50000 | sudo tee /proc/sys/dev/raid/speed_limit_min  # KB/s
echo 500000 | sudo tee /proc/sys/dev/raid/speed_limit_max # KB/s

Raise during maintenance windows; lower during production peaks.

4. Enable or Place Write‑Mostly Devices

sudo mdadm --write-mostly /dev/md0 /dev/sdd

5. Use Bitmaps to Shorten Resync Windows

Create with --bitmap=internal (or --bitmap=external to separate device). Faster recovery after power loss.

6. TRIM / Discard on SSD Arrays

Ensure filesystem supports discard; consider periodic fstrim rather than continuous discard for performance stability.

7. NUMA & IRQ Affinity

High‑throughput NVMe + mdraid builds benefit from CPU affinity tuning—pin IRQs, balance queues.

8. Benchmark Regularly

Use fio, dd (for rough sequential), iostat, perf, and application‑level metrics (DB TPS, VM boot time) to validate.

Monitoring, Alerts & Maintenance

Key Health Indicators:

Degraded arrays, failed drives, mismatched events, bitmap resync in progress, high mismatch counts after check runs.

Enable mdadm Monitor Service

Create a simple systemd unit or use distro packages:

sudo mdadm --monitor --daemonise --scan --syslog --program=/usr/sbin/mdadm-email.sh

Where your script sends email, Slack, PagerDuty, etc.

Check `/proc/mdstat` Regularly

Automate:

watch -n 2 cat /proc/mdstat

Scheduled Consistency Checks

Many distros schedule a monthly check:

echo check > /sys/block/mdX/md/sync_action

To repair mismatches (if any are found):

echo repair > /sys/block/mdX/md/sync_action

SMART + Predictive Failure

RAID hides but does not prevent media failure. Use smartctl, nvme-cli, and vendor tools.

Integrate SMART alerts with md state for proactive failure prediction and diagnostics.

Recovery, Rebuilds & Reshape Operations

When a member fails:

Identify: Use mdadm --detail and check system logs.
Fail the Device: mdadm /dev/mdX --fail /dev/sdY (if not auto‑failed).
Remove: mdadm /dev/mdX --remove /dev/sdY.
Replace Hardware / Repartition Replacement Drive.
Add: mdadm /dev/mdX --add /dev/sdZ.
Monitor Rebuild: watch /proc/mdstat.

Rebuild Performance Considerations

Large modern disks = long rebuild windows; risk of 2nd failure rises.
Use bitmaps to reduce resync scope after unclean shutdowns.
Raise speed_limit_min during maintenance to shorten exposure.
On SSD/NVMe parity RAID, controller queue and CPU bottlenecks matter.

Reshaping (e.g., RAID5 → RAID6, add disks)

Reshape is I/O heavy and lengthy. Always back up.

Expect degraded performance; schedule during off‑peak hours.

Security Considerations

Encryption: Layer LUKS/dm-crypt above mdraid (common) or below (each member encrypted) depending on your threat model. Above is simpler; below preserves per‑disk confidentiality when disks are moved.
Secure Erase Before Reuse: Superblocks persist. Wipe old metadata using mdadm --zero-superblock /dev/sdX before reassigning drives.
Access Control: md devices are block devices. Secure them with proper permissions and enable audit logging.
Firmware Trust: When mixing vendor drives, ensure no malicious firmware modification. Supply chain trust is critical in sensitive environments.

Alternatives to mdraid

Linux admins have choices. Here’s how mdraid stacks up against popular alternatives.

xiRAID: High‑Performance Software RAID

xiRAID (from Xinnor) is a modern, high‑performance software RAID engine engineered for today’s multi‑core CPUs and fast SSD/NVMe media.

Across multiple publicly reported benchmarks and partner tests, xiRAID has consistently shown higher performance than traditional mdraid—often substantially so under heavy write, mixed database, virtualization, or degraded/rebuild conditions.

Reported Advantages (varies by version, platform, and workload):

Higher IOPS and throughput vs mdraid on NVMe and SSD arrays.
Dramatically faster degraded‑mode performance (a common mdraid pain point).
Faster rebuild times, reducing risk windows.
Lower CPU overhead in some configurations; better multi‑core scaling.
Optimizations for large‑capacity SSD/QLC media and write amplification reduction.

Consider xiRAID When:

You run database, virtualization, analytics, or AI workloads on dense flash/NVMe.
Rebuild speed and minimal performance drop during failure are business‑critical.
You need to squeeze maximum performance from software RAID without dedicated hardware controllers.

Operational Notes:

Commercial licensing (free tiers often limited by drive count—check current program).
Kernel module / driver stack is separate from mdraid; evaluate distro compatibility.
Migration from mdraid typically requires data evacuation + recreate.

Hardware RAID Controllers

Pros:

Battery‑backed cache (BBU) accelerates writes.
Controller offloads parity computation.
Integrated management and vendor support.

Cons:

Proprietary metadata; controller lock‑in.
Rebuilds tied to card health.
Limited transparency compared to mdadm.
Potential single point of failure without identical spare controller.

Where It Fits:

Legacy datacenters.
Environments requiring OS‑agnostic boot processes.
Workflows with existing tooling and operational dependency on vendor RAID cards.

LVM RAID (device‑mapper RAID)

Logical Volume Manager (LVM2) can create RAID volumes using the device‑mapper‑raid target. Under the hood it leverages similar code paths to md but integrates tightly with LVM volume groups.

Use When: You want LVM flexibility (snapshots, thin provisioning) and RAID protection in one stack layer. Modern distros make this increasingly attractive.

Caveat: Tooling and recovery flows differ from raw mdadm; mixing both can confuse newcomers.

Btrfs Native RAID Profiles

Btrfs can do data/metadata replication (RAID1/10) and parity (RAID5/6—still flagged as having historical reliability caveats; check current kernel status). Provides end‑to‑end checksums, transparent compression, snapshots, send/receive.

Great For: Self‑healing replicated storage where checksums matter more than raw parity write speed.

Watch Out: Historically unstable RAID5/6; always confirm current kernel guidance before production use.

ZFS RAID‑Z & Mirrors

ZFS provides integrated RAID (mirrors, RAID‑Z1/2/3), checksums, compression, snapshots, send/receive replication, and robust scrubbing. Excellent data integrity.

Strengths: Bit‑rot detection, self‑healing, scalable pools, advanced caching (ARC/L2ARC, ZIL/SLOG).

Trade‑Offs: RAM hungry; license separation (CDDL) from Linux kernel means out‑of‑tree module; tuning required for small RAM systems.

Absolutely! Here's the formatted markdown version of the full FAQ, Glossary, and Final Thoughts section. You can copy and paste this directly into your Dev.to post:

FAQ: mdraid, mdadm & Linux RAID Troubleshooting

Q: Is mdraid stable for production?
Yes—mdraid has powered Linux servers for decades. It’s widely trusted in enterprise, hosting, and cloud images. Stability depends more on hardware quality, monitoring, and admin discipline than on the md layer itself.

Q: Can I expand a RAID5 array by adding a drive?
Yes, mdadm supports growing RAID5/6 arrays. The reshape can take a long time and is I/O heavy; always back up first.

Q: Should I use RAID5 with large (>14TB) disks?
Consider RAID6 or RAID10 instead. Rebuild times and URE risk make single‑parity arrays risky at scale.

Q: What metadata version should I pick for a boot array?
Use 1.0 (or 0.90 on very old systems) so bootloaders that expect clean starting sectors can function.

Q: How do I know if my array is healthy?
Check /proc/mdstat, mdadm --detail, and configure mdadm --monitor email alerts. Also monitor SMART.

Q: Can I mix SSD and HDD in one md array?
Technically yes, but performance drops to the slowest members. Better: separate tiers, or mark slower disks write‑mostly.

Q: How do I safely remove old RAID metadata from a reused disk?

mdadm --zero-superblock /dev/sdX

Q: What’s faster: mdraid or xiRAID?
In multiple published benchmarks on flash/NVMe workloads, xiRAID has outperformed mdraid—sometimes by large margins—especially under degraded or rebuild conditions. Always benchmark with your own hardware and workload mix.

Glossary

Array – A logical grouping of drives presented as one block device.
Bitmap – A map of dirty stripes that speeds resync after unclean shutdowns.
Chunk (Stripe Unit) – Data segment written to one disk before moving to the next in a stripe.
Degraded – Array running with one or more failed/missing members but still serving I/O.
Hot Spare – Idle member device that automatically rebuilds into an array upon failure.
mdadm – User‑space tool for managing mdraid arrays.
Metadata / Superblock – On‑disk record of array identity and layout.
Parity – Calculated redundancy data enabling reconstruction of lost blocks.
Resync / Rebuild – Process of restoring redundancy after failure.
Reshape – Changing array geometry (size, level, layout) in place.

Final Thoughts

mdraid remains a foundation technology for Linux storage: robust, flexible, battle‑tested, and free. For many workloads—especially HDD‑based capacity pools—it’s the obvious default. But the storage landscape has changed. With NVMe density, flash wear patterns, extreme rebuild windows, and ever‑higher performance expectations, specialized engines can deliver meaningful gains in throughput, latency consistency, degraded‑mode resilience, and rebuild speed.

The smart move:
Deploy mdraid where it fits, benchmark xiRAID (or other alternatives) where performance matters, and always design around data protection, observability, and recoverability.

Performance Guide Pt. 3: Setting Up and Testing RAID

Sergey Platonov — Wed, 02 Apr 2025 06:05:54 +0000

This is the third and the final part of the Performance Guide. The first two parts were about performance characteristics, the ways to measure it and optimal hardware and software configurations. Now, in this concluding section, we'll be focusing entirely on RAID. We'll cover topics like choosing the right RAID geometry, configuring NUMA nodes, and conducting RAID testing. Let's dive in!

xiRAID 4.0.x

RAID geometry selection

One of the key factors in achieving optimal performance is selecting the correct RAID geometry.

The following parameters should be considered:

The expected workload.
The characteristics of the drives.
The expected functionality of RAID in cases of failure and recovery.

Recommendations on selecting RAID geometry

If you deal with a large number of random workload threads, we recommend choosing RAID 5/50 configuration (RAID 50 must have a minimum of 8 drives).
For random workloads, it is recommended to use a strip size of 16K.
For sequential workloads, it is recommended to use a strip size larger than 16K (such as 32K, 64K, or 128K). However, the strip size should be selected in a way that ensures the RAID stripe size is 1MB or less. If the stripe size exceeds 1MB, merge functions will not work (the merge function is explained below in this document).
If you deal with a large number of drives and a relatively low number of writes, you can consider using RAID 6 configuration.
Different models of drives can exhibit the best write performance when using different block sizes. For certain drives, the maximum performance can be achieved with a 128K block, while for others it may be 64K or 32K. This can only be verified by experiment. The optimal size of the data block for achieving the best performance in a RAID should be used as the RAID strip size for sequential write workload.
For sequential write workload, it is important to find a balance between the write block size and the characteristics of the drives, as well as their number. If the write operation primarily involves the same and sufficiently large block, it is crucial to ensure that this block size matches the width of the RAID stripe.

An ideal scenario with sequential workload would involve receiving 1MB requests for writing, while the drives show their maximum performance when handling a 128K block. To achieve maximum performance for the drives, a RAID 5 configuration with 9 drives (8+1) can be created.

Writing with a 1MB block to a RAID 7+1 with a stripe size of 64K. The width of the stripe is 448K, which is not a multiple of the write block size. It is important to pay attention to the presence of read operations from the drives due to Read-Modify-Write operations and their volume.

Writing with a 1MB block to a RAID 8+1. The strip size here is 64K, and the width of the stripe is 64K x 8 = 512K, which is a multiple of the write block size. This means that there are no Read-Modify-Write operations, as all data on the RAID is written in full stripes (see the Merge write section below). Therefore, there are no read operations from the drives and the final write performance is significantly better.

ccNUMA

The location of arrays on the necessary NUMA nodes is crucial as NUMA can have a significant impact on system performance. xiRAID aims to process IO requests on the same cores that initiated them. However, problems can arise if the drives are on a different NUMA node.

Therefore,

try creating arrays of drives on a single NUMA node;
try running the workload on that same NUMA node.

To determine the NUMA node to which the device is connected, you can use the following command:

cat /sys/class/nvme/nvmeXnY/device/numa_node

To determine the connection of all devices, including NVMe, you can use the lstopo command (mentioned above).

If you want to run fio or other applications on a specific NUMA node, you can use the taskset command.

taskset -c `cat /sys/devices/system/node/node1/cpulist` fio<fio_config_file.cfg>

The following fio settings can be used for multiple arrays:

--numa_cpu_nodes=0: This option specifies the NUMA node(s) to be used for CPU affinity. In this case, NUMA node 0 is specified. You can adjust this value to the desired NUMA node(s) or a comma-separated list of multiple NUMA nodes.
--numa_mem_policy=local: This option sets the memory allocation policy to "local," which means that memory allocation for the FIO process will prefer the local NUMA node(s) specified by numa_cpu_nodes.

All drives, except nvme9c9n1, are connected to a single NUMA node, and a load-generating application is running on the same NUMA node. You can see that the I/O commands coming to the nvme9c9n1 drive, which is connected to another NUMA node, take longer to execute. This results in reduced performance for the entire array.

Merge write

Data writing to RAID arrays with parity (such as RAID 5, RAID 6, RAID 7.3, RAID 50, etc.) is performed one stripe at a time. This process can be carried out in two ways:

If it is necessary to write or overwrite an entire stripe as a whole, checksums are calculated based on the data that will be written in the stripe. After that, both the checksums and the data are written to the drives. This is considered the best and fastest way to write to parity RAID.
In many cases, it is necessary to write or overwrite only a part of the stripe. In such situations, the Read-Modify-Write (RMW) approach is used. The old data that is supposed to be overwritten and its corresponding checksum fragments (R) are read from the drives. Afterward, new values for the checksum fragments (M) are calculated using the received old data, old checksums, and new data. Finally, the new data and recalculated checksums are written to the drives (W). This approach is significantly slower than the first one.

If a RAID receives random workload, the Read-Modify-Write method must be used. However, if it is known that a RAID receives sequential write workload, you can attempt to avoid executing each write command separately. Instead, you can wait for the next write commands to come to the same stripe. If you manage to receive all the commands that allow overwriting the entire stripe (which is highly possible with sequential workload), you can combine these commands (Merge) and rewrite the stripe the first way. This way you can avoid performing RMW for each of the commands and significantly increase performance (up to several times). To achieve this, use the Merge Wait function in xiRAID.

It is important to keep in mind that using merge write can have a negative impact on random operations and often leads to increased delays.

In addition, it is not always possible to merge commands into a full stripe with sequential write. The xiRAID engine ensures that confirmation of write command execution is sent to the applications that initiated them only after the data has been successfully written to the drives. In order for the merge to work, it is necessary for the program to be able to send multiple commands consecutively without waiting for confirmation of data writing from the first command. Moreover, the data from these commands should be sufficient to form a complete stripe.

For a synthetic workload, this condition can be defined as follows:

Queue depth*io size must exceed the width of the stripe

Merge is activated with the command:

xicli raid modify -n <raid_name> -mwe 1

Furthermore, for a workload, it is necessary to monitor the iostat indicators. The statistics output by the "iostat -xmt 1" command appear to be convenient for this purpose.

The presence of read operations on drives with a write-only RAID load indicates the execution of read-modify-write operations.

The next steps will involve increasing the mm and mw parameters until the read operations either disappear completely or their value becomes small.

rdcli raid modify -n raidname -mm 2000 -mw 2000

The merge function only works on a RAID with a stripe size of 1MB or less.

Writing with a 1MB block to a RAID 7+1 with a stripe size of 64K. The width of the stripe is 448K, which is not a multiple of the write block size. The configuration is similar to the one shown in picture 1, with the only difference being that Merge Write is enabled on the RAID. This allows for writing to be performed in full stripes, which results in significantly higher writing performance.

Writing with a 1MB block to a RAID 7+1 with merge enabled

This picture shows the RAID configuration, while the previous picture displays the corresponding workload. Note that the merge_write_enabled parameter is set to 1.

Merge read settings

Performing a merge for read operations may not seem obvious, but it can significantly enhance the efficiency of RAID in degraded mode. This mode refers to the situation when one or more drives fail in the RAID system, but it has not went offline. In this situation, using merge for read operations allows you to reduce the number of commands needed to calculate and restore data from a failed drive (or drives). This, in turn, significantly improves performance.

The requirements for efficient operation of merge for read operations are similar:

the queue depth * IO size must exceed the width of the stripe.

Merge for sequential read operations in failure mode is configured as follows:

Enabling xicli raid modify -n raidname -mre 1.
In degraded mode, it is important to monitor the need for increasing timeouts when using iostat. If the total number of reads from the drives exceeds the number of reads from the array, it is necessary to increase the delays of mw and mm. For example:

rdcli raid modify -n raidname -mm 2000 -mw 2000

In the current versions of xiRAID (4.0.0 and 4.0.1), there is a common threshold value of merge_wait (mw) and merge_max (mm) for Merge Write and Merge Read, which can make it difficult to use them simultaneously. This issue will be addressed and fixed in future versions.

Sched setting

xiRAID attempts to process each I/O on the same core where it was initiated by the application or drive. However, this approach results in limited parallelization of processing when dealing with multithreaded loads.

If you notice that the number of threads in your workload is low and the htop command indicates a high workload on certain cores while others remain idle, it is advisable to enable the sched mode:

xicli raid modify -n raidname -se 1

This mode distributes processing tasks across all available cores.

These pictures show the performance of xiRAID RAID when the sched mode is turned off and the loading of RAID engine processor cores. You can see that only 4 cores are being loaded.

These two pictures show the performance of this xiRAID RAID under the same load, with the sched enabled. The performance has significantly improved, while also ensuring a more balanced distribution of the load on the processor cores.

Init and rebuild priorities
You can manage the service request queue depth, which enables you to allocate system resources between user IO and in-system processes.

The general scenario is as follows: if you are unsatisfied with performance parameters or experiencing delays when testing performance during array recovery or initialization (although we do not recommend the latter), you have the option to reduce the priority of these operations.

RAID testing

RAID testing is generally similar to drive testing, although there are some differences.

Workflow

This section outlines the main objectives discussed in the previous sections.

Define RAID performance testing tools and objectives. Use applications such as fio or vdbench for system testing. Do not use dd utilities and desktop applications like Crystal Disk.
Test the drives and backend as described above in the "Drive Performance" section.
Choose the array geometry that is best suited to your tasks, as described in the "xiRAID 4.0.x" section.
Determine the expected RAID performance levels based on the tests conducted on the included drives (as described in the "Drive Performance" section above).
Create the RAID and wait for the initialization process to complete.
Run the pre-conditioning on the array. Then, run tests on the array and adjust the parameters if necessary.
Make sure to perform array tests, including during simulated failure (degraded mode) and ongoing reconstruction (reconstruction mode).

If you are testing in accordance with this document, steps 2-4 should already be completed.

Pre-conditioning and typical settings

In order to obtain repeatable performance results, it is necessary to prepare the drives beforehand. As mentioned above, when preparing individual drives, we recommend following the guidelines provided in the “Drives Performance” section:

before testing sequential workload: overwrite all drives sequentially with a 128K block;
before testing random workload: overwrite all drives randomly with a 4KB block.

During RAID testing, it is possible to overwrite drives with a 128K or 4K block, but this can only be done before the RAID is assembled. Once the RAID is assembled, writing directly to the drives is not possible, as it would result in the loss of RAID metadata and the RAID itself. At the same time, after the RAID is created, all disks will be partially overwritten by a block equal to the size of the strip during initialization.

This makes it difficult to apply this approach in practice. Thus, we suggest performing pre-conditioning as follows:

before testing sequential workload, the entire RAID should be overwritten (preferably 1.5 - 2 times) sequentially with a block equal to the size of the RAID strip. Then each drive will be overwritten sequentially, with a block equal to the size of the strip (chunk). This method ensures that the RAID performance reaches a steady state corresponding to the selected chunk size during its creation.
before testing random workload, it is recommended to initiate a random write workload to a RAID with a 4K block size and a number of threads equal to the number of processor threads in the system. The duration of the writing process should be set to 30-40 minutes.

Performance troubleshooting

If you are not getting satisfactory results, we recommend using the following tools to analyze the problem:

iostat

Pay attention to the statistics of the drives included in the RAID. If you see a high load on one or more drives, indicated by a growth in command queues, we recommend doing the following:

Check the distribution of drives among NUMA nodes.
Ensure the correctness of the offset setting for multithreaded tests.
If a drive is consistently slower than others for an extended period of time, with a significantly larger queue of commands, it should be replaced.

htop

If you notice a high consumption of CPU resources, you should check the following:

For AMD processors, it is necessary for the number of cores (die) on each chiplet to match the number of cores given in specifications (in the BIOS settings).
Ensure that all processor memory channels are utilized by memory modules. Install the memory modules according to the motherboard's manual and ensure that they operate at the same frequency.
Run the "dmesg" command and check its output for memory errors, as well as any other hardware errors.

Thank you for reading! If you have any questions or thoughts, please leave them in the comments below. I’d love to hear your feedback!

Original article can be found here.

Performance Guide Pt. 2: Hardware and Software Configuration

Sergey Platonov — Mon, 31 Mar 2025 08:16:56 +0000

This is the second part of our Performance Guide blog post series. In the previous part, we've covered the fundamentals of system performance, its basic units and methods for measurement. In this part, we’ll be discussing the optimal hardware configuration and Linux settings and walk you through the process of basic calculation of the expected performance.

Hardware

To identify performance problems, it is necessary to configure the hardware and software before starting the tests. Based on our experience, low performance is often caused by incorrect settings and configurations.

The overall storage capacity of storage devices will always be limited by the capacity of the computer bus that attaches them.

Drive connection methods and topology

PCIe

Most modern NVMe storage devices connected through U.2 or U.3 connectors, utilize up to four PCI Express lanes.

To achieve optimal performance, it is important to ensure that the correct number of PCIe lanes are used to connect the drives and that there is no overcommit.

Ensure that the version of the PCIe protocol matches the one used by the drives. You can test the connection using the "lspci -vv" command. Pay attention to the "LnkCap" and "LnkSta" sections. The first indicates the connectivity, while the second indicates the actual connection status.

Non-Volatile memory controller: KIOXIA Corporation NVMe SSD Controller Cx6 (rev 01) (prog-if 02 [NVM Express])
Subsystem: KIOXIA Corporation Generic NVMe CM6
Physical Slot: 2
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 140
NUMA node: 0
IOMMU group: 41
Region 0: Memory at da510000 (64-bit, non-prefetchable) [size=32K]
Expansion ROM at da500000 [disabled] [size=64K]
Capabilities: [40] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [70] Express (v2) Endpoint, MSI 00
        DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
        DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
                RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                MaxPayload 256 bytes, MaxReadReq 512 bytes
        DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
        LnkCap: Port #1, Speed 16GT/s, Width x4, ASPM not supported
                ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
        LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed 16GT/s (ok), Width x4 (ok)
                TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

If the protocol version or the number of lanes differ from the expected ones, you should check the BIOS settings, as well as the capabilities of the motherboard. The performance of PCIe connections is influenced by both the PCIe version and number of lanes, as indicated in the table provided below.

PCIe Generation	Single Lane Transmission Speed	x1	x2	x4	x8	x16
3.0	8 GT/s	0.985 GB/s	1.969 GB/s	3.938 GB/s	7.877 GB/s	15.754 GB/s (126 Gbit/s)
4.0	16 GT/s	1.969 GB/s	3.938 GB/s	7.877 GB/s	15.754 GB/s	31.508 GB/s (252 Gbit/s)
5.0	32 GT/s	3.938 GB/s	7.877 GB/s	15.754 GB/s	31.508 GB/s	64.008 GB/s (512 Gbit/s)
6.0	64 GT/s	7.563 GB/s	15.125 GB/s	30.250 GB/s	60.500 GB/s	121.000 GB/s (968 Gbit/s)

Backplane

To identify bottlenecks, it is worth checking the physical connection and studying the specifications for BackPlane and NVMe expander cards.

Retimers and SAS-HBA

NVMe drives can be connected to PCIe retimers, OCuLink ports on the motherboard, or tri-mode SAS HBA adapters via a BackPlane. However, if SAS drives are being used, they must be connected to the SAS HBA. If you are using a SAS HBA, ensure that the controllers are correctly installed and update the SAS HBA controller to the latest firmware version. Note that the performance of your storage system will not exceed the performance of the PCIe interfaces used to connect the drives, whether they are connected directly or indirectly.

SAS Generation	Number of Ports	Max Performance
SAS-3	1	1200 MB/s
	4	4800 MB/s
SAS-4	1	2400 MB/s
	4	9600 MB/s

Ensure that the wide ports and links are correctly configured using the management utilities provided by SAS HBA.

Drives

For tasks that do not involve small block sizes, we highly recommend reformatting NVMe namespaces in 4k.

Memory

When calculating checksums and during recovery, arrays with parity use RAM for temporary data storage. Therefore, it is critically important to correctly install and configure memory.

xiRAID does not require a significant amount of memory. However, it is necessary to use error-correcting code memory, symmetric multiprocessing, and preferably memory with the highest supported frequency on the platform.

Failing to comply with step 2 may lead to a performance loss of about 30%-40%.

CPU

Cores and comands requirements. xiRAID requires a processor with support for the AVX instruction set, which is available on all modern Intel and AMD x86-64 processors. So, if you want to use xiRAID, your processor needs to have this extension.

To achieve high performance for random workload, it is necessary to have 1-2 cores for every expected 1 million IOps. For sequential workload, make sure you have at least 4 cores for every 20 GBps.

BIOS

С-States must be turned off.
PowerManagement. The CPU operation mode needs to be switched to performance (balanced is used by default).
CPU Topology (AMD only). The number of cores (die) on each chipset must match the number of cores specified. Incorrect configuration can result in a significant loss of performance.
HT/SMT. We recommend switching on HT/SMT to achieve better performance.

Linux

Preliminary checks and settings

The main tool for managing NVMe in Linux is the nvme management utility.

For example:

to see all the NVMe drives connected to the system, run the "nvme list" command:

# nvme list
Node                  Generic               SN                   Model                                    Namespace Usage                      Format           FW Rev  
--------------------- --------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme7n1          /dev/ng7n1            BTLJ85110C314P0DGN   INTEL SSDPE2KX040T8                      1           4.00  TB /   4.00  TB      4 KiB +  0 B   VDV10131
/dev/nvme6n1          /dev/ng6n1            BTLJ85110CDH4P0DGN   INTEL SSDPE2KX040T8                      1           4.00  TB /   4.00  TB      4 KiB +  0 B   VDV10131
/dev/nvme5n1          /dev/ng5n1            BTLJ85110C1K4P0DGN   INTEL SSDPE2KX040T8                      1           4.00  TB /   4.00  TB      4 KiB +  0 B   VDV10131
/dev/nvme4n1          /dev/ng4n1            BTLJ85110BU94P0DGN   INTEL SSDPE2KX040T8                      1           4.00  TB /   4.00  TB      4 KiB +  0 B   VDV10170
/dev/nvme3n1          /dev/ng3n1            BTLJ85110CG74P0DGN   INTEL SSDPE2KX040T8                      1           4.00  TB /   4.00  TB      4 KiB +  0 B   VDV10131
/dev/nvme2n1          /dev/ng2n1            BTLJ85110C4N4P0DGN   INTEL SSDPE2KX040T8                      1           4.00  TB /   4.00  TB      4 KiB +  0 B   VDV10170
/dev/nvme1n1          /dev/ng1n1            BTLJ85110C8G4P0DGN   INTEL SSDPE2KX040T8                      1           4.00  TB /   4.00  TB      4 KiB +  0 B   VDV10170
/dev/nvme0n1          /dev/ng0n1            BTLJ85110BV34P0DGN   INTEL SSDPE2KX040T8                      1           4.00  TB /   4.00  TB      4 KiB +  0 B   VDV10131

to view information about all NVMe subsystems and verify the drive connections, run the nvme list-subsys command:

# nvme list-subsys
nvme-subsys7 - NQN=nqn.2014.08.org.nvmexpress:80868086BTLJ85110C314P0DGN  INTEL SSDPE2KX040T8
\
 +- nvme7 pcie 0000:e6:00.0 live
nvme-subsys6 - NQN=nqn.2014.08.org.nvmexpress:80868086BTLJ85110CDH4P0DGN  INTEL SSDPE2KX040T8
\
 +- nvme6 pcie 0000:e5:00.0 live
nvme-subsys5 - NQN=nqn.2014.08.org.nvmexpress:80868086BTLJ85110C1K4P0DGN  INTEL SSDPE2KX040T8
\
 +- nvme5 pcie 0000:e4:00.0 live
nvme-subsys4 - NQN=nqn.2014.08.org.nvmexpress:80868086BTLJ85110BU94P0DGN  INTEL SSDPE2KX040T8
\
 +- nvme4 pcie 0000:e3:00.0 live
nvme-subsys3 - NQN=nqn.2014.08.org.nvmexpress:80868086BTLJ85110CG74P0DGN  INTEL SSDPE2KX040T8
\
 +- nvme3 pcie 0000:9b:00.0 live
nvme-subsys2 - NQN=nqn.2014.08.org.nvmexpress:80868086BTLJ85110C4N4P0DGN  INTEL SSDPE2KX040T8
\
 +- nvme2 pcie 0000:9a:00.0 live
nvme-subsys1 - NQN=nqn.2014.08.org.nvmexpress:80868086BTLJ85110C8G4P0DGN  INTEL SSDPE2KX040T8
\
 +- nvme1 pcie 0000:99:00.0 live
nvme-subsys0 - NQN=nqn.2014.08.org.nvmexpress:80868086BTLJ85110BV34P0DGN  INTEL SSDPE2KX040T8
\
 +- nvme0 pcie 0000:98:00.0 live

we recommend checking the sector size and firmware version:

nvme id-ctrl /dev/nvmeX -H | grep "LBA Format"
nvme id-ctrl /dev/nvmeX -H | grep "Firmware"

All drives intended for use in one RAID must have the same model, firmware version (preferably the latest one), and sector size.

run the following command to view the status of SMART drives:

nvme smart-log /dev/nvmeX --output-format=json

If you are using EBOF, check the native NVMe multipath parameters:

nvme list-subsys

and configure Round Robin if necessary:

nvme connect -t rdma -a <target_address> -s <subnet> -n <namespace_id> -p rr
echo "rr" > /sys/class/nvme/nvmeX/subsysY/ana_group/default_load_balance

Kernel parameters

To improve performance, you should configure the kernel boot parameters.

intel_idle.max_cstate=0: Sets the maximum C-state for Intel processors to 0. C-states are power-saving states for the CPU, and setting it to 0 ensures that the processor does not enter any idle power-saving state.

Applying the kernel parameters presented below will lead to a security risk. While they improve performance, they also reduce system security and Xinnor never uses them for production implementations or public performance tests. Please use these parameters with care and at your own responsibility.

noibrs: Disables the Indirect Branch Restricted Speculation (IBRS) mitigation. IBRS is a feature that protects against certain Spectre vulnerabilities by preventing speculative execution of indirect branches.
noibpb: Disables the Indirect Branch Predictor Barrier (IBPB) mitigation. IBPB is another Spectre mitigation that helps prevent speculative execution of indirect branches.
nopti: Disables the Page Table Isolation (PTI) mitigation. PTI is a security feature that isolates kernel and user page tables to mitigate certain variants of the Meltdown vulnerability.
nospectre_v2: Disables the Spectre Variant 2 mitigation. Spectre Variant 2, also known as Branch Target Injection, is a vulnerability that can allow unauthorized access to sensitive information.
nospectre_v1: Disables the Spectre Variant 1 mitigation. Spectre Variant 1, also known as Bounds Check Bypass, is another vulnerability that can lead to unauthorized access to data.
l1tf=off: Turns off the L1 Terminal Fault (L1TF) mitigation. L1TF is a vulnerability that allows unauthorized access to data in the L1 cache.
nospec_store_bypass_disable: Disables the Speculative Store Bypass (SSB) mitigation. SSB is a vulnerability that can allow unauthorized access to sensitive data stored in the cache.
no_stf_barrier: Disables the Single Thread Indirect Branch Predictors (STIBP) mitigation. STIBP is a feature that prevents speculation across different threads, mitigating certain Spectre vulnerabilities.
mds=off: Disables the Microarchitectural Data Sampling (MDS) mitigation. MDS is a vulnerability that can lead to data leakage across various microarchitectural buffers.
tsx=on: Enables Intel Transactional Synchronization Extensions (TSX). TSX is an Intel feature that provides support for transactional memory, allowing for efficient and concurrent execution of certain code segments.
tsx_async_abort=off: Disables asynchronous aborts in Intel TSX. Asynchronous aborts are a mechanism that can terminate transactions and undo their effects in certain cases.
mitigations=off: Turns off all generic mitigations for security vulnerabilities. This parameter disables all known hardware and software mitigations for various vulnerabilities, prioritizing performance over security.

It's important to note that disabling or modifying these security mitigations can potentially expose your system to security vulnerabilities. These parameters are typically used for debugging or performance testing purposes and are not recommended for regular usage, especially in production environments.

Example:

sed -i 's/^GRUB_CMDLINE_LINUX="\(.*\)"$/GRUB_CMDLINE_LINUX="noibrs noibpb nopti nospectre_v2 nospectre_v1 l1tf=off nospec_store_bypass_disable no_stf_barrier mds=off tsx=on tsx_async_abort=off mitigations=off intel_idle.max_cstate=0 \1"/' /etc/default/grub
update-grub or grub2-mkconfig

We recommend setting the polling mode for NVMe devices in Intel-based systems (excluding AMD).

echo "options nvme poll_queues=4" >> /etc/modprobe.d/nvme.conf

dracut -f or update-initramfs -u -k all

Schedulers

Make sure schedulers are set to ‘none’ or ‘noop’:

cat /sys/block/nvme*/queue/scheduler

If they are not, you can run the following Bash script:

for nvme_device in /sys/block/nvme*; do
    echo noop > $nvme_device/queue/scheduler
done

This setting is not permanent, so you can add this loop to the rc.local or a similar file. However, please ensure that it functions correctly. Please note that permissions for the rc.local file may need to be modified to allow it to execute after a system reboot.

Similarly, the script increases the I/O scheduler queue depth to 512 for NVMe devices by writing "512" to the /sys/block//queue/nr_requests file. Increasing the queue depth can allow for better utilization of the device and potentially improve performance:

for nvme_device in /sys/block/nvme*; do
    echo 512 > $nvme_device/queue/nr_requests
done

This setting is also not permanent, so you can add this loop to the rc.local or a similar file. However, please ensure that it functions correctly. Please note that permissions for the rc.local file may need to be modified to allow it to execute after a system reboot.

When using Flash drives connected via HBA, please refer to the HBA instructions for recommendations.

Other settings

Settings 1-5 are not permanent. So you can add this loop to the rc.local or a similar file. However, please ensure that it functions correctly. Please note that permissions for the rc.local file may need to be modified to allow it to execute after a system reboot.

Disabling Transparent Huge Pages (THP): The script sets the THP to "never" by writing "never" to the /sys/kernel/mm/transparent_hugepage/enabled file. This helps reduce latency.
echo never > /sys/kernel/mm/transparent_hugepage/enabled
Disabling Kernel Transparent Huge Pages (KPTI): The script sets KPTI to "never" by writing "never" to /sys/kernel/mm/transparent_hugepage/defrag file. This is another step to improve performance.
echo never > /sys/kernel/mm/transparent_hugepage/defrag
Disabling Kernel Same-Page Merging (KSM): The script disables KSM by writing "0" to the /sys/kernel/mm/ksm/run file. Disabling KSM reduces CPU usage by avoiding redundant memory sharing among processes.
echo 0 > /sys/kernel/mm/ksm/run
(Pattern-dependent) Setting read-ahead value for NVMe devices: The script uses the blockdev command to set the read-ahead value to 65536 (the maximum size) for NVMe devices. This setting controls the amount of data the system reads ahead of time from the storage device, which can help improve performance in some scenarios (mostly on sequential read).
blockdev --setra 65536 /dev/nvme0n1
Installing necessary packages: The script checks for the availability of package managers (dnf or apt) and installs the cpufrequtils and tuned packages if they are available.
Setting CPU frequency governor: The script uses the cpupower frequency-set command to set the CPU frequency governor to "performance," which keeps the CPU running at the highest frequency. This can improve performance at the cost of increased power consumption and potential higher temperatures.
cpupower frequency-set -g performance
Displaying current CPU frequency governor: The script displays the current CPU frequency governor by reading the value from the /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor file.
echo "Current CPU frequency governor: $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor)"
Stopping irqbalance service: The script stops the irqbalance service, which distributes interrupts across CPU cores to balance the load. Stopping irqbalance can be useful when trying to maximize performance in certain scenarios.
systemctl stop irqbalance
systemctl status irqbalance
Setting system profile to throughput-performance: The script uses the tuned-adm command to set the system profile to "throughput-performance" provided by the tuned package. This profile is designed to optimize the system for high throughput and performance.
tuned-adm profile throughput-performance

Drive Performance

After evaluating the system characteristics and determining the test objectives (as discussed in the previous blog post), it is necessary to proceed with the basic calculation of the expected performance. This calculation should be based on the hardware specifications and the conducted tests.

This should be done as follows:

Find the manufacturer's specifications for the drives.
Run tests on 1-3 separate drives.
If the results obtained differ significantly from the characteristics stated in the specification, refer to the troubleshooting guide.
Calculate the total expected performance of all drives intended for use in RAID.
Run tests simultaneously on all drives intended for use in the RAID.
If the results obtained differ significantly from the expected ones, refer to the troubleshooting guide.
Calculate the expected performance of the array based on these results.

The calculations should be based on the manufacturers' specifications, which often include the performance of random read/write 4k blocks, as well as of mixed workloads (70% read, 30% write). They also provide information on the sequential read/write 128k block performance.

The total queue depth or the ratio of the queue depth to the number of threads is often specified.

To assess performance levels, you can use the fio utility. Here are some examples of fio configuration files:

Prepare the drive for the tests: overwrite the drive with a 128k block twice.

[global]
direct=1
bs=128k
ioengine=libaio
rw=write
iodepth=32
numjobs=1
blockalign=128k
[drive]
filename=/dev/nvme1n1

Test the workflows:

[global]
direct=1
bs=128k
ioengine=libaio
rw=read / write
iodepth=8
numjobs=2
norandommap
time_based=1
runtime=600
group_reporting
offset_increment=50%
gtod_reduce=1
[drive]
filename=/dev/nvme1n1

Prepare the drive for tests: overwrite the drive with a 4k block using random pattern.

[global]
direct=1
bs=4k
ioengine=libaio
rw=randwrite
iodepth=128
numjobs=2
random_generator=tausworthe64
[drive]
filename=/dev/nvme1n1

Test random workload:

[global]
direct=1
bs=4k
ioengine=libaio
rw=randrw
rwmixread=0 / 50 / 70 / 100
iodepth=128
numjobs=2
norandommap
time_based=1
runtime=600
random_generator=tausworthe64
group_reporting
gtod_reduce=1
[drive]
filename=/dev/nvme1n1

After receiving the results for one drive, you should run tests on all drives simultaneously to calculate the total performance.

Prepare all drives for testing: overwrite each with the 128k block twice. Examples of fio configuration files:

[global]
direct=1
bs=128k
ioengine=libaio
rw=write
iodepth=32
numjobs=1
blockalign=128k
[drive1]
filename=/dev/nvme1n1
[drive2]
filename=/dev/nvme2n1
[drive3]
filename=/dev/nvme3n1
. . .

Test the workflows:

[global]
direct=1
bs=128k
ioengine=libaio
rw=read / write
iodepth=8
numjobs=2
norandommap
time_based=1
runtime=600
group_reporting
offset_increment=50%
gtod_reduce=1
[drive1]
filename=/dev/nvme1n1
[drive2]
filename=/dev/nvme2n1
[drive3]
filename=/dev/nvme3n1
. . .

Prepare all drives for testing: overwrite each with a 4k block using random pattern.

[global]
direct=1
bs=4k
ioengine=libaio
rw=randwrite
iodepth=128
numjobs=2
random_generator=tausworthe64
[drive1]
filename=/dev/nvme1n1
[drive2]
filename=/dev/nvme2n1
[drive3]
filename=/dev/nvme3n1
. . .

Test random workload:

[global]
direct=1
bs=4k
ioengine=libaio
rw=randrw
rwmixread=0 / 50 / 70 / 100
iodepth=128
numjobs=2
norandommap
time_based=1
runtime=600
random_generator=tausworthe64
group_reporting
gtod_reduce=1
[drive1]
filename=/dev/nvme1n1
[drive2]
filename=/dev/nvme2n1
[drive3]
filename=/dev/nvme3n1
. . .

Ideally, the test results for all drives should be equal to the results for one drive multiplied by the number of drives (in terms of GBps for sequential tests and IOPS for random tests). However, due to platform limitations, these results are not always achieved across all patterns. These limitations should be taken into account further on.

To ensure accurate results, it is recommended to repeat sequential tests using a block size that corresponds to the future array strip size (unless it is already equal to the specified block size used in the sequential tests). This is important because drives may show varying performance based on the block size being used.

Additionally, the load level, determined by the number of threads and the queue depth, should correspond to both the manufacturer's specifications and the expected load on the RAID.

To calculate the expected RAID performance, use the Table below:

T is the number of threads required for the test as given in the specification for the drive.

Q is the queue depth required for the test as given in the specification for the drive.

N is the number of drives used in the tests and in RAID creation.

These calculations are theoretical and allow us to estimate the maximum theoretical performance of the array.

In reality, the performance of the system can be influenced by both external factors (such as the operation of internal components of storage devices and temperature conditions), as well as internal factors (like the need to perform calculations and temporarily store data in memory). Therefore, we consider it normal for approximately 90-95% of the calculated performance to be achieved on the array.

Thank you for reading! If you have any questions or thoughts, please leave them in the comments below. I’d love to hear your feedback!

Original article can be found here

Virtualized NFSoRDMA-based Disaggregated Storage Solution for AI Workloads

Sergey Platonov — Wed, 26 Mar 2025 04:54:47 +0000

In the fields of HPC and AI, the demand for efficient and scalable storage solutions is ever-increasing. Traditional storage systems often struggle to meet the high throughput and low latency requirements of modern AI workloads. Disaggregated storage, particularly when combined with NFSoRDMA, presents a promising solution to these challenges. This blog post will explore the objectives, performance requirements, and solutions for implementing disaggregated storage tailored for AI workloads.

NFSoRDMA combines the widely adopted NFS protocol with the high-performance RDMA technology. By using standard protocols and avoiding proprietary software, we can sidestep limitations in OS compatibility and version conflicts. NFSoRDMA allows to achieve the required performance levels with minimal deployment costs, without the need for specific compatibility lists for parallel file system clients or strict version compatibility for all components. We maximize the utilization of a 400Gbit interface, demonstrating that high performance can be achieved efficiently. This integration ensures low latency and high throughput, making it ideal for our demanding storage needs.

Objectives

AI workloads often demand more than just a block device. They require sophisticated file storage systems capable of handling high performance and scalability needs. Our primary objectives in developing disaggregated storage solutions include:

Achieving High Throughput: The goal is to reach dozens of GBps from a few clients using one or two storage nodes, whether real or virtual.
Maximizing IOPS: It's crucial to obtain as many small IOPS as possible to support the high-speed data processing needs of AI tasks.
Simplicity in Configuration: Keeping both hardware and software configurations as straightforward as possible is essential for ease of deployment and maintenance.
Deployment Flexibility: The solution must be deployable on-premise or in the cloud, offering on-demand scalability and flexibility.

AI tasks often necessitate parallel access to data from multiple clients at speeds of tens to hundreds of gigabytes per second. Our software solutions aim to minimize complexity while ensuring high performance and scalability.

Required disaggregated storage solution

Performance Requirements

For efficient cluster operations, it is essential to have fast data loading and quick checkpoint writing. These requirements are critical for maintaining high performance in AI and HPC environments.

More details here: IO for Large Language Models and Secured AI Workflows

The provided graph above illustrates GPU utilization and Lustre read/write rates during an iterative processing cycle, common in HPC workloads. Initially, there's a brief period where the GPU utilization ramps up, known as the startup phase. This is followed by sequences of compute iterations, where GPU utilization remains high, indicating active processing. Periodically, there are drops in GPU utilization, corresponding to checkpointing phases where data is saved. During these phases, the GPUs are mostly idle, waiting for the checkpoint to complete.

The Lustre read/write rates reflect this workflow with distinct phases. At the beginning, there's a significant spike in read activity during the initialization phase as the system loads initial data, which occurs only once. During compute iterations, read activity is minimal, reflecting the workload's focus on computation rather than data movement. Every few compute iterations, checkpointing occurs, causing spikes in write activity. During these phases, data is written to storage, and GPU computation is paused.

Although this example is based on data from an NVIDIA presentation focused on Lustre, the underlying principles are also applicable to NFS systems. The current pattern reveals opportunities for efficiency improvements. For example, implementing parallel or asynchronous checkpointing could significantly reduce the idle time of GPUs during checkpointing phases. Our solution aims to address these inefficiencies, providing a more streamlined and effective process for handling high-throughput storage demands in HPC and AI workloads.

High-Level Solution Description

Our solution integrates a high-performance storage engine with well-known filesystem services, focusing on:

Software-defined RAID: Deployable across various environments such as bare-metal, virtual machines, and DPUs.
Tuned Filesystem (FS) and Optimized Server Configuration: Ensures optimal performance and efficiency.
RDMA-based Interfaces for Data Access: Provides the high-speed data access required for AI workloads.

The main idea is to deploy different elements of the file storage system on-demand across the necessary hosts. Disaggregated storage resources are combined into virtual RAID using xiRAID Opus RAID engine, which requires minimal CPU cores. Volumes are then created and exported to virtual machines via VHOST controllers or NVMe-oF, offering flexible and scalable storage solutions.

We will explore the virtualization of NFSoRDMA and xiRAID Opus, observing the limitations in the Linux kernel space and explaining how our solution can be virtualized.

To validate our solution, we will utilize FIO (Flexible I/O Tester) to conduct two types of tests: sequential reads/writes to demonstrate data load and checkpoints performance, and random reads to demonstrate small data reads performance. FIO allows us to utilize various engines and regulate the load, ensuring simplicity and repeatability.

Introducing Virtualized NFSoRDMA + xiRAID Opus Solution

New virtual storage systems comprise two critical components: a high-performance engine running in user space and virtual machines that act as NAS gateways, instantiated on demand. This architecture allows efficient allocation of storage resources, balancing high performance with the scalability and flexibility required in virtualized environments.

We deployed on-demand storage controllers and built virtual storage volumes for tenants from disaggregated storage resources. Deploying on-demand storage controllers and constructing virtual storage volumes from disaggregated storage resources offers each tenant its own dedicated virtual storage. This approach leverages two key components: xiRAID Opus, a high-performance block volume, and an NFS gateway. xiRAID Opus provides RAID-protected block devices tailored for virtualized environments, ensuring robust performance and reliability.

Virtualized solution architecture with on-demand storage controllers

However, performance remains a significant challenge in virtual environments, worsened by hardware and software limitations such as below:

PCI slot taxation by accelerators: This issue is particularly critical in cloud environments than in bare metal installations.
Linux kernel updates and live patching: Frequent in virtualized settings, these updates can destabilize proprietary software reliant on specific kernel versions.
Vhost protocol implementation: Limited to 250K IOPS per volume in the kernel, this constraint consumes significant resources, hindering overall efficiency.

To overcome these challenges, our solution operates the block device engine efficiently, using just 2-4 CPU cores while providing access to high-performance virtual block devices.

Key Features of Our Virtualized Solution

Our virtualized storage solution includes:

Software RAID controller in User Space
Volume manager with QoS Support
Multi-threaded VHOST in User Space
SR_IOV to pass NIC functions into VM

Our virtualized solution architecture

By aggregating namespaces into virtual arrays and creating volumes with specific size and performance characteristics, we pass these devices into virtual machines. These VMs function as virtual NVMe over RDMA servers, connected to high-performance network cards, either on the same or remote hosts, and client VMs can connect directly or through a host using virtio-fs.

The data path in our solution, while complex, is optimized by reducing levels and enhancing the efficiency of each. This optimization maintains compatibility with existing solutions, while reinventing or improving key components (highlighted in color) to achieve greater efficiency.

Data flow design

Testing Configuration

Only two CPU cores for RAID and vhost controller
24 CPU cores for NFS SERVER VM
RAID 50 (8+1) x2
KIOXIA CM6-R drives
AMD EPYC 7702P
Filesystem is aligned with RAID

We created one volume, passed it into a single VM, and applied load from two virtual clients on a remote host to evaluate performance.

Test architecture

Testing Results

In performance comparisons for sequential operations, mdraid shows significant performance losses, up to 50%, whereas xiRAID Opus maximizes interface read performance, achieving approximately 60% of the interface's potential for writes with one or two clients. This highlights xiRAID Opus's superior performance in virtualized environments.

In random read operations, when using 2 clients, xiRAID Opus shows superior scalability and a significant increase in performance, reaching up to 850k-950k IOPS. In contrast, mdraid-based solutions failed to scale effectively, demonstrating kernel limitations that cap small block read performance at 200-250k IOPS per VM. Our user-space solution nearly approaches 1 million IOPS after connecting 3 clients, further highlighting the scalability and efficiency of xiRAID Opus over mdraid.

When comparing CPU load, mdraid combined with kernel vhost and a virtual machine utilizes nearly 100% of about a quarter of the CPU cores. Conversely, xiRAID Opus uses only two fully loaded cores, with around 12% of the remaining cores operating at approximately 25% load.

MDRAID + NFS virtual machine:

¼ of all CPU cores are fully loaded

xiRAID Opus + NFS virtual machine:

Only 2 CPU cores are fully loaded

Wrap-Up and Final Thoughts

The following conclusions highlight the significant efficiencies observed in our virtualization of NFSoRDMA and xiRAID Opus:

A single virtual NFS server can saturate a 2x200 Gbit interface, performing comparably to a large server.
xiRAID Opus storage engine requires only 2 CPU cores, whereas mdraid consumes much more CPU.
mdraid is limited both in sequential and random operations.
By virtualizing our solutions, we achieve around 60% efficiency for write operations and 100% efficiency for read operations.

NFS over RDMA combined with xiRAID enables the creation of fast storage nodes, providing 50 GBps performance and saturating a 400 Gbit network. This is ideal to provide fast storage to NVIDIA DGX systems at universities and large research institutions. The main ingredients for such solution are the storage engine, NFS server, and client tuning.

Achieving high performance in virtual environment proves to be possible. While Linux kernel presents performance limitations, the user space-based RAID engine is the solution. It reduces resource consumption while maintaining performance, comparable to bare-metal installations. Our extensive testing and implementation of disaggregated storage based on NFSoRDMA highlight significant advancements in performance and efficiency for AI workloads.

Thank you for reading! If you have any questions or thoughts, please leave them in the comments below. I’d love to hear your feedback!

Original article can be found here.

Appendix

NFS Settings

Server Side:

/etc/nfs.conf
nfs.conf
threads=32
[nfsd]
# debug=0
threads=32
rdma=y
rdma-port=20049

/etc/exports
/data *(rw,no_root_squash,sync,insecure)

Client Side:

/etc/modprobe.d/nfsclient.conf options nfs max_session_slots=180 Mount options nfsvers=3,rdma,port=20049,sync,nconnect=16

MDRAID tuning

echo 24 > /sys/block/mdX/md/group_thread_cnt
mdadm –-grow /dev/mdX bitmap=none #gives 13GBps write but not recommended in real env
mdadm –-grow /dev/mdX --bitmap=internal --bitmap-chunk=524288. #gives 7,5 GBps write

ZFS settings

zpool create -o ashift=12 test -O recordsize=1M -O compression=off -O dedup=off -O atime=off -O xattr=sa -O logbias=throughput raidz

zfs_vdev_sync_{read/write}_max_active=64
zfs_vdev_async_write_max_active=64

Fio config

[global]
bs=1M
iodepth=32
direct=1
ioengine=libaio
rw=read/write
size=100G
[dir]
directory=/test

[global]
bs=4k
iodepth=128
direct=1
ioengine=libaio
rw=randread
size=100G
norandommap
[dir]
directory=/test

Performance Guide Pt. 1: Performance Characteristics and How It Can Be Measured

Sergey Platonov — Mon, 24 Mar 2025 06:40:08 +0000

We're excited to introduce a new blog post series focused on testing and enhancing storage performance. Throughout this series, we'll walk you through the entire process, from defining objectives and preparing the necessary hardware and software to optimizing performance using a software RAID engine. In our first blog post, we'll delve into the fundamentals of system performance, including its core components and methods for measurement.

The performance of a data storage system is usually evaluated based on its inherent set of interrelated characteristics, such as data transfer bandwidth, input/output performance, etc. This approach allows for a detailed comparison among existing solutions and between the solutions and specifications provided by suppliers. It is also useful for predicting the performance of an application. The output characteristics of a system may, however, vary as the load changes. For instance, they can be influenced by factors such as the number of queues, the depth of the request queue, the size of the read and write blocks, the alignment of the blocks, the locality of requests, and the ability to compress and deduplicate data.

This chapter covers data storage units, basic workload patterns, and provides instructions on how to measure and test performance.

Data Storage Units

GBps or GiBps. Data storage system transfer bandwidth measured in GB (binary or decimal). This parameter measures how much data can be processed during read and write operations per unit of time. It is typically used to evaluate the data storage system performance when dealing with large block sizes, ranging from 64kB to 8MB. This parameter is directly proportional to the following one.
IOps. Input/output operations per second (IOPS) is a parameter that indicates the number of requests a data storage system can process. It is typically used to evaluate the performance of a data storage system when dealing with small block sizes, ranging from 512B to 64kB.
avg lat. Average latency is the waiting period for a request to be processed, measured in nanoseconds (ns), microseconds (us), and milliseconds (ms). This parameter is inversely proportional to the previous one. Sometimes, in advanced analytics, average latency is divided into submission and completion latency.
99.x lat. Indicates the response time for 99.x requests, where x is 5,9,95,99, etc. This parameter should be considered if the application is sensitive to latency or when using data storage system in a complex application cluster with multiple nodes involved in the processing of requests. For more complex tests, the latency distribution can also be taken into account.
CPU load
Storage Utilization
Application parameters: number of supported threads, transactions, etc.

However, when measuring and evaluating these parameters, it is important to consider the workload on the storage device, as it can affect its performance. The performance differs due to devices’ features.

Terms and Characteristics Describing the Workload Patterns

I/O

An I/O is a single read/write request. That I/O is issued to a storage medium (like a hard drive or solid state drive).

I/O Request Size (block size)

The I/O request has a size, which can vary from small (like 1 Kilobyte) to large (several megabytes). Different application workloads will issue I/O operations with varying request sizes. The size of the I/O request can impact latency and IOPS figures (two metrics we will discuss shortly).

Access patterns

Sequential access

Sequential access is a type of access where the next input/output (I/O) operation starts from the same location (address or LBA) where the previous I/O operation ended. In other words, I/O operations form a sequence of reads or writes which come sequentially, one after another.

In real-life scenarios, sequential access I/O typically uses relatively large I/O sizes.

Random access

I/O requests are issued in a seemingly random pattern to the storage media. The data could be stored all over various regions of the storage media. An example of such an access pattern is a heavy utilized database server or a virtualization host running a lot of virtual machines (all operating simultaneously).

In real life pure sequential and pure random patterns rarely can be found. But usually real applications generate workload which could be close to some “pure“ patterns within a timeframe.

Queue depth

The queue depth is a number (usually between 1 and ~128) that shows how many I/O requests are queued (in-flight) on average. Having a queue is beneficial as the requests in the queue can be submitted to the storage subsystem in an optimized manner and often in parallel. A queue improves performance at the cost of latency.

How to Measure and Test Performance

Prior to conducting any performance tests, two essential steps should be taken in order to analyze the final results correctly:

Defining test objectives
Setting expectations

Generally, the approach to testing should be as follows:

Defining the objectives, forming expectations and designing a test strategy.
Setting up test environment suitable for the task.
Studying specifications for the hardware and evaluating the results accordingly.

Steps two and three are interlinked and usually executed at the same time. Some elements of the equipment used in testing are predetermined, while others must be chosen to ensure an optimal outcome.

Software setup.
Testing individual components and removing bottlenecks.
Carrying out tests.
Assessing the results.

Steps 4-7 and sometimes 2 can be repeated until the desired result is achieved.

Depending on the desired outcomes, we can determine the type of testing:

System testing

The primary objective of this testing is to comprehend the system behavior under various loading types and in different scenarios. This type of testing is most common and synthetic benchmarks like fio, vdbench, and iometer are generally used for it. During testing, parameters such as the number of threads and queue depth, read-write ratio, block size, and so on are altered. This data often provides an accurate prediction of what to expect during other testing scenarios.

Testing application performance

It is used to understand how a real application will work with storage. It acts as a tool to imitate the activity of real applications as well as launches.

Acceptance testing

It is used to determine if a new storage or modified settings of an existing one meet the project requirements. A fixed set of tests is used.

Comparative testing

Allows to see how storages' performance differs when tested in the most similar conditions. A fixed set of tests is used.

The main focus of this blog post series is system testing.

Example of a test objective and expected results:

Understanding the xiRAID RAID engine's capabilities for both random and sequential workloads when using a server equipped with 16 PCI-E v5 drives.

Boost maximum performance levels 2 times higher than those obtained on previous generation drives.

'>40 million IOps per read (4k block)

'>2 million IOps per record (4k block)

'>200Gbps per read (1Mb block)

'>100Gbps per write (1Mb block)

While running system and application performance tests, it's essential to also test the system in case of drive failure, reconstruction or if restriping and resizing. This is due to a significant decrease in array performance that might happen during these operations.

Restriping refers to the process of adding an additional drive to a RAID in order to increase its volume or change its level. It involves changing the configuration of the RAID.

Tool for Testing the Drive Performance

At present, the most preferred tool for creating a workload on drives for the Linux OS is the fio - Flexible I/O tester program. It allows custom-defining workload patterns. The standard command to run the fio program is as follows

# fio fio.cfg

where "fix.cfg" is a configuration file that describes the necessary FIO startup parameters, defines the tested devices, and outlines the load patterns.

Generally, this file contains one section, [global], which specifies parameters that are common to all tasks [jobs] and at least one section of the job. Most of the options can be specified both in the [global] section and in the [jobs] section.

Let us consider an example of a fio configuration file:

[global]
direct=1
ioengine=libaio
rw=randrw
rwmixread=50
iodepth=128
numjobs=2
offset_increment=2%
norandommap
time_based=1
runtime=600
random_generator=tausworthe64
group_reporting
gtod_reduce=1

[drive]
filename=/dev/nvme1n1

Major fio parameters:

[global]: [global] section start.
direct=1: Enables direct I/O, bypassing the operating system's cache.
bs=4k: Specifies the block size for I/O operations. In this example it’s set to 4 kilobytes.
ioengine=libaio: Sets the I/O engine to libaio, which provides asynchronous I/O support.
rw=randrw: Specifies the I/O access pattern. In this example it is set to random read/write. It allows to set the portion of the reads in the next parameter.
rwmixread=50: Sets the percentage of reads in the random read/write mix. It takes different values (0, 50, 70, or 100) for separate test runs.
iodepth=128: Determines the depth of I/O submission queue per job. It sets the maximum number of I/O requests in flight at any given time.
numjobs=2: Specifies the number of parallel jobs/threads to be used during the test. Each job defined below will be launched in “numjobs“ number of threads.
offset_increment=2%: Specifies offset between jobs start LBAs. The real start LBA for each thread will be offset_increment * thread_number. In this example - the offset is 2% of the target device size. It’s important to use this option at sequential patterns, since multiple threads working with the same stripes at the same time reduces the performance because of RAID engine stripe work logic and also such pattern does not reflect real tasks. For random patterns this option can be omitted.
norandommap: Disables random file access mapping.
time_based=1: Uses a time-based duration for the test instead of specifying the number of I/O operations.
runtime=600: Sets the duration of the test to 600 seconds (10 minutes).
random_generator=tausworthe64: Defines the random number generator algorithm used by FIO.
group_reporting: Enables reporting at the job group level instead of individual job level.
gtod_reduce=1: Activates gettimeofday reduction, which reduces the overhead of timing-related operations.
[drive]: Previous section finished, job with name “drive“ starts.
filename=/dev/nvme1n1: Specifies the target device or file to perform I/O operations on. In this case, it's set to the block device /dev/nvme1n1, likely an NVMe SSD.

More information on configuring fio can be found in the man-page of this program.

Thank you for reading! If you have any questions or thoughts, please leave them in the comments below. I’d love to hear your feedback!

Original article can be found here

High-Performance, Highly Available Lustre Solution with xiRAID 4.1 on Dual-Node Shared NVMe

Sergey Platonov — Fri, 21 Mar 2025 05:28:31 +0000

This comprehensive guide demonstrates how to create a robust, high-performance Lustre file system using xiRAID Classic 4.1 and Pacemaker on an SBB platform. We'll walk through the entire process, from system layout and hardware configuration to software installation, cluster setup, and performance tuning. By leveraging dual-ported NVMe drives and advanced clustering techniques, we'll achieve a highly available storage solution capable of delivering impressive read and write speeds. Whether you're building a new Lustre installation or looking to expand an existing one, this article provides a detailed roadmap for creating a cutting-edge, fault-tolerant parallel file system suitable for demanding high-performance computing environments.

System layout

xiRAID Classic 4.1 supports integration of RAIDs into Pacemaker-based HA-clusters. This ability allows users who require clusterization of their services to benefit from xiRAID Classic's great performance and reliability.

This article describes using an NVMe SBB system (a single box with two x86-64 servers and a set of shared NVMe drives) as a basic Lustre parallel filesystem HA-cluster with data placed on clustered RAIDs based on xiRAID Classic 4.1.

This article will familiarize you with how to deploy xiRAID Classic for a real-life task.

Lustre server SBB Platform

We will use Viking VDS2249R as the SBB platform. The configuration details are presented in the table below.

Viking VDS2249R

	Node 0	Node 1
Hostname	node26	node27
CPU	AMD EPYC 7713P 64-Core	AMD EPYC 7713P 64-Core
Memory	256GB	256GB
OS drives	2 x Samsung SSD 970 EVO Plus 250GB mirrored	2 x Samsung SSD 970 EVO Plus 250GB mirrored
OS	Rocky Linux 8.9	Rocky Linux 8.9
IPMI address	192.168.64.106	192.168.67.23
IPMI login	admin	admin
IPMI password	admin	admin
Management NIC	enp194s0f0: 192.168.65.26/24	enp194s0f0: 192.168.65.27
Cluster Heartbeat NIC	enp194s0f1: 10.10.10.1	enp194s0f1: 10.10.10.2
Infiniband LNET HDR	ib0: 100.100.100.26	ib0: 100.100.100.27
	ib3: 100.100.100.126	ib3: 100.100.100.127
NVMes	24 x Kioxia CM6-R 3.84Tb KCM61RUL3T8	24 x Kioxia CM6-R 3.84Tb KCM61RUL3T8

System configuration and tuning

Before software installation and configuration, we need to prepare the platform to provide optimal performance.

Performance tuning

tuned-adm profile accelerator-performance

Network configuration

Check that all IP addresses are resolvable from both hosts. In our case, we will use resolving via the hosts file, so we have the following content in /etc/hosts on both nodes:

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.65.26 node26
192.168.65.27 node27
10.10.10.1 node26-ic
10.10.10.2 node27-ic
192.168.64.50 node26-ipmi
192.168.64.76 node27-ipmi
100.100.100.26 node26-ib
100.100.100.27 node27-ib

Policy-based routing setup

We use a multirail configuration on the servers: two IB interfaces on each server are configured to work in the same IPv4 networks. To make the Linux IP stack work properly in this configuration, we need to set up policy-based routing on both servers for these interfaces.

node26 setup:

node26# nmcli connection modify ib0 ipv4.route-metric 100
node26# nmcli connection modify ib3 ipv4.route-metric 101
node26# nmcli connection modify ib0 ipv4.routes "100.100.100.0/24 src=100.100.100.26 table=100"
node26# nmcli connection modify ib0 ipv4.routing-rules "priority 101 from 100.100.100.26 table 100"
node26# nmcli connection modify ib3 ipv4.routes "100.100.100.0/24 src=100.100.100.126 table=200"
node26# nmcli connection modify ib3 ipv4.routing-rules "priority 102 from 100.100.100.126 table 200"
node26# nmcli connection up ib0
Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/6)
node26# nmcli connection up ib3
Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/7)

node27 setup:

node27# nmcli connection modify ib0 ipv4.route-metric 100
node27# nmcli connection modify ib3 ipv4.route-metric 101
node27# nmcli connection modify ib0 ipv4.routes "100.100.100.0/24 src=100.100.100.27 table=100"
node27# nmcli connection modify ib0 ipv4.routing-rules "priority 101 from 100.100.100.27 table 100"
node27# nmcli connection modify ib3 ipv4.routes "100.100.100.0/24 src=100.100.100.127 table=200"
node27# nmcli connection modify ib3 ipv4.routing-rules "priority 102 from 100.100.100.127 table 200"
node27# nmcli connection up ib0
Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/6)
node26# nmcli connection up ib3
Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/7)

NVMe drives setup

In the SBB system, we have 24 Kioxia CM6-R 3.84TB KCM61RUL3T84 drives. They are PCIe 4.0, dual-ported, read-intensive drives with 1DWPD endurance. A single drive's performance can theoretically reach up to 6.9GB/s for sequential read and 4.2GB/s for sequential write (according to the vendor specification).

In our setup, we plan to create a simple Lustre installation with sufficient performance. However, since each NVMe in the SBB system is connected to each server with only 2 PCIe lanes, the NVMe drives' performance will be limited. To overcome this limitation, we will create 2 namespaces on each NVMe drive, which will be used for the Lustre OST RAIDs, and create separate RAIDs from the first NVMe namespaces and the second NVMe namespaces. By configuring our cluster software to use the RAIDs made from the first namespaces (and their Lustre servers) on Lustre node #0 and the RAIDs created from the second namespaces on node #1, we will be able to utilize all four PCIe lanes for each NVMe used to store OST data, as Lustre itself will distribute the workload among all OSTs.

Since we are deploying a simple Lustre installation, we will use a simple filesystem scheme with just one metadata server. As we will have only one metadata server, we will need only one RAID for the metadata. Because of this, we will not create two namespaces on the drives used for the MDT RAID.

Here is how the NVMe drive configuration looks initially:

# nvme list
Node                  SN                   Model                                    Namespace Usage                      Format           FW Rev
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1          21G0A046T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme1n1          21G0A04BT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme10n1         21G0A04ET2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme11n1         21G0A045T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme12n1         S59BNM0R702322Z      Samsung SSD 970 EVO Plus 250GB           1           8.67  GB / 250.06  GB    512   B +  0 B   2B2QEXM7
/dev/nvme13n1         21G0A04KT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme14n1         21G0A047T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme15n1         21G0A04CT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme16n1         11U0A00KT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme17n1         21G0A04JT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme18n1         21G0A048T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme19n1         S59BNM0R702439A      Samsung SSD 970 EVO Plus 250GB           1         208.90  kB / 250.06  GB    512   B +  0 B   2B2QEXM7
/dev/nvme2n1          21G0A041T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme20n1         21G0A03TT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme21n1         21G0A04FT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme22n1         21G0A03ZT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme23n1         21G0A04DT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme24n1         21G0A03VT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme25n1         21G0A044T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme3n1          21G0A04GT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme4n1          21G0A042T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme5n1          21G0A04HT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme6n1          21G0A049T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme7n1          21G0A043T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme8n1          21G0A04AT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme9n1          21G0A03XT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106

The Samsung drives are used for the operating system installation.

Let's reserve /dev/nvme0 and /dev/nvme1 drives for the metadata RAID1. Currently, xiRAID does not support spare pools in a cluster configuration, but having a spare drive is really useful for quick manual drive replacement. So, let's also reserve /dev/nvme3 to be a spare for the RAID1 drive and split all other KCM61RUL3T84 drives into 2 namespaces.

Let’s take /dev/nvme4 as an example. All other drives will be splited in absolutely the same way.

Check the maximum possible size of the drive to be sure:

# nvme id-ctrl /dev/nvme4 | grep -i tnvmcap
tnvmcap : 3840755982336

Check the maximal number of the namespaces supported by the drive:

# nvme id-ctrl /dev/nvme4 | grep ^nn
nn : 64

Check the controller used for the drive connection at both servers (they will differ):

node27# nvme id-ctrl /dev/nvme4 | grep ^cntlid
cntlid : 0x1

node26# nvme id-ctrl /dev/nvme4 | grep ^cntlid
cntlid : 0x2

We need to calculate the size of the namespaces we are going to create. The real size of the drive in 4K blocks is:

3840755982336/4096=937684566

So, each namespace size in 4K blocks will be:

937684566/2=468842283

In fact, it is not possible to create 2 namespaces of exactly this size because of the NVMe internal architecture. So, we will create namespaces of 468700000 blocks.

If you are building a system for write-intensive tasks, we recommend using write-intensive drives with 3DWPD endurance. If that is not possible and you have to use read-optimized drives, consider leaving some space (10-25%) of the NVMe volume unallocated by namespaces. In many cases, this helps turn the NVMe behavior in terms of write performance degradation closer to that of write-intensive drives.

As a first step, remove the existing namespace on one of the nodes:

node26# nvme delete-ns /dev/nvme4 -n 1

After that, create namespaces on the same node:

node26# nvme create-ns /dev/nvme4 --nsze=468700000 --ncap=468700000 -b=4096 --dps=0 -m 1
create-ns: Success, created nsid:1
node26# nvme create-ns /dev/nvme4 --nsze=468700000 --ncap=468700000 -b=4096 --dps=0 -m 1
create-ns: Success, created nsid:2
node26# nvme attach-ns /dev/nvme4 --namespace-id=1 -controllers=0x2
attach-ns: Success, nsid:1
node26# nvme attach-ns /dev/nvme4 --namespace-id=2 -controllers=0x2
attach-ns: Success, nsid:2

Attach the namespaces on the second node with the proper controller:

node27# nvme attach-ns /dev/nvme4 --namespace-id=1 -controllers=0x1
attach-ns: Success, nsid:1
node27# nvme attach-ns /dev/nvme4 --namespace-id=2 -controllers=0x1
attach-ns: Success, nsid:2

It looks like this on both nodes:

# nvme list |grep nvme4
/dev/nvme4n1          21G0A042T2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme4n2          21G0A042T2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106

All other drives were split in the same way. Here is the resulting configuration:

# nvme list
Node                  SN                   Model                                    Namespace Usage                      Format           FW Rev
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1          21G0A046T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme1n1          21G0A04BT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme10n1         21G0A04ET2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme10n2         21G0A04ET2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme11n1         21G0A045T2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme11n2         21G0A045T2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme12n1         S59BNM0R702322Z      Samsung SSD 970 EVO Plus 250GB           1           8.67  GB / 250.06  GB    512   B +  0 B   2B2QEXM7
/dev/nvme13n1         21G0A04KT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme13n2         21G0A04KT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme14n1         21G0A047T2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme14n2         21G0A047T2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme15n1         21G0A04CT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme15n2         21G0A04CT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme16n1         11U0A00KT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme16n2         11U0A00KT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme17n1         21G0A04JT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme17n2         21G0A04JT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme18n1         21G0A048T2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme18n2         21G0A048T2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme19n1         S59BNM0R702439A      Samsung SSD 970 EVO Plus 250GB           1         208.90  kB / 250.06  GB    512   B +  0 B   2B2QEXM7
/dev/nvme2n1          21G0A041T2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme20n1         21G0A03TT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme20n2         21G0A03TT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme21n1         21G0A04FT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme21n2         21G0A04FT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme22n1         21G0A03ZT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme22n2         21G0A03ZT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme23n1         21G0A04DT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme23n2         21G0A04DT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme24n1         21G0A03VT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme24n2         21G0A03VT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme25n1         21G0A044T2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme25n2         21G0A044T2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme3n1          21G0A04GT2G8         KCM61RUL3T84                             1           0.00   B /   3.84  TB      4 KiB +  0 B   0106
/dev/nvme4n1          21G0A042T2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme4n2          21G0A042T2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme5n1          21G0A04HT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme5n2          21G0A04HT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme6n1          21G0A049T2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme6n2          21G0A049T2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme7n1          21G0A043T2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme7n2          21G0A043T2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme8n1          21G0A04AT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme8n2          21G0A04AT2G8         KCM61RUL3T84                             2           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme9n1          21G0A03XT2G8         KCM61RUL3T84                             1           0.00   B /   1.92  TB      4 KiB +  0 B   0106
/dev/nvme9n2          21G0A03XT2G8         KCM61RUL3T84

Software components installation

Lustre installation

Create Lustre repo file /etc/yum.repos.d/lustre-repo.repo :

lustre-server]
name=lustre-server
baseurl=https://downloads.whamcloud.com/public/lustre/latest-release/el8.9/server
# exclude=*debuginfo*
gpgcheck=0

[lustre-client]
name=lustre-client
baseurl=https://downloads.whamcloud.com/public/lustre/latest-release/el8.9/client
# exclude=*debuginfo*
gpgcheck=0

[e2fsprogs-wc]
name=e2fsprogs-wc
baseurl=https://downloads.whamcloud.com/public/e2fsprogs/latest/el8
# exclude=*debuginfo*
gpgcheck=0

Installing e2fs tools:

yum --nogpgcheck --disablerepo=* --enablerepo=e2fsprogs-wc install e2fsprogs

Installing Lustre kernel:

yum --nogpgcheck --disablerepo=baseos,extras,updates --enablerepo=lustre-server install kernel kernel-devel kernel-headers

Reboot to the new kernel:

reboot

Check the kernel version after reboot:

node26# uname -a
Linux node26 4.18.0-513.9.1.el8_lustre.x86_64 #1 SMP Sat Dec 23 05:23:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Installing lustre server components:

yum --nogpgcheck --enablerepo=lustre-server,ha install kmod-lustre kmod-lustre-osd-ldiskfs lustre-osd-ldiskfs-mount lustre lustre-resource-agents

Check Lustre module load:

[root@node26 ~]# modprobe -v lustre
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/net/libcfs.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/net/lnet.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/obdclass.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/ptlrpc.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/fld.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/fid.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/osc.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/lov.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/mdc.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/lmv.ko
insmod /lib/modules/4.18.0-513.9.1.el8_lustre.x86_64/extra/lustre/fs/lustre.ko

Unload modules:

# lustre_rmmod

Installing xiRAID Classic 4.1

Installing xiRAID Classic 4.1 at both nodes from the repositories following the Xinnor xiRAID 4.1.0 Installation Guide:

# yum install -y epel-release

# yum install https://pkg.xinnor.io/repository/Repository/xiraid/el/8/kver-4.18/xiraid-repo-1.1.0-446.kver.4.18.noarch.rpm

# yum install xiraid-release

Pacemaker installation
Running the following steps at both nodes:

Enable cluster repo

# yum config-manager --set-enabled ha appstream

Installing cluster:

# yum install pcs pacemaker psmisc policycoreutils-python3

Csync2 installation

Since we are installing the system on Rocky Linux 8, there is no need to compile Csync2 from sources ourselves. Just install the Csync2 package from the Xinnor repository on both nodes:

# yum install csync2

NTP server installation

# yum install chrony

HA cluster setup

Time synchronisation setup

Modify /etc/chrony.conf file if needed to setup working with proper NTP servers. At this setup we will work with the default settings.

# systemctl enable --now chronyd.service

Verify, that time sync works properly by running chronyc tracking.

Pacemaker cluster creation

In this chapter, the cluster configuration is described. In our cluster, we use a dedicated network to create a cluster interconnect. This network is physically created as a single direct connection (by dedicated Ethernet cable without any switch) between enp194s0f1 interfaces on the servers. The cluster interconnect is a very important component of any HA-cluster, and its reliability should be high. A Pacemaker-based cluster can be configured with two cluster interconnect networks for improved reliability through redundancy. While in our configuration we will use a single network configuration, please consider using a dual network interconnect for your projects if needed.

Set the firewall to allow pacemaker software to work (on both nodes):

# firewall-cmd --add-service=high-availability
# firewall-cmd --permanent --add-service=high-availability

Set the same password for the hacluster user at both nodes:

# passwd hacluster

Start the cluster software at both nodes:

# systemctl start pcsd.service
# systemctl enable pcsd.service

Authenticate the cluster nodes from one node by their interconnect interfaces:

node26# pcs host auth node26-ic node27-ic -u hacluster
Password:
node26-ic: Authorized
node27-ic: Authorized

Create and start the cluster (start at one node):

node26# pcs cluster setup lustrebox0 node26-ic node27-ic
No addresses specified for host 'node26-ic', using 'node26-ic'
No addresses specified for host 'node27-ic', using 'node27-ic'
Destroying cluster on hosts: 'node26-ic', 'node27-ic'...
node26-ic: Successfully destroyed cluster
node27-ic: Successfully destroyed cluster
Requesting remove 'pcsd settings' from 'node26-ic', 'node27-ic'
node26-ic: successful removal of the file 'pcsd settings'
node27-ic: successful removal of the file 'pcsd settings'
Sending 'corosync authkey', 'pacemaker authkey' to 'node26-ic', 'node27-ic'
node26-ic: successful distribution of the file 'corosync authkey'
node26-ic: successful distribution of the file 'pacemaker authkey'
node27-ic: successful distribution of the file 'corosync authkey'
node27-ic: successful distribution of the file 'pacemaker authkey'
Sending 'corosync.conf' to 'node26-ic', 'node27-ic'
node26-ic: successful distribution of the file 'corosync.conf'
node27-ic: successful distribution of the file 'corosync.conf'
Cluster has been successfully set up.

node26# pcs cluster start --all
node26-ic: Starting Cluster...
node27-ic: Starting Cluster...

Check the current cluster status:

node26# pcs status
Cluster name: lustrebox0

WARNINGS:
No stonith devices and stonith-enabled is not false

Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: node27-ic (version 2.1.7-5.el8_10-0f7f88312) - partition with quorum
  * Last updated: Fri Jul 12 20:55:53 2024 on node26-ic
  * Last change:  Fri Jul 12 20:55:12 2024 by hacluster via hacluster on node27-ic
  * 2 nodes configured
  * 0 resource instances configured

Node List:
  * Online: [ node26-ic node27-ic ]

Full List of Resources:
  * No resources

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

Fencing setup

It's very important to have properly configured and working fencing (STONITH) in any HA cluster that works with shared storage devices. In our case, the shared devices are all the NVMe namespaces we created earlier. The fencing (STONITH) design should be developed and implemented by the cluster administrator in consideration of the system's abilities and architecture. In this system, we will use fencing via IPMI. Anyway, when designing and deploying your own cluster, please choose the fencing configuration on your own, considering all the possibilities, limitations, and risks.

First of all, let's check the list of installed fencing agents in our system:

node26# pcs stonith list
fence_watchdog - Dummy watchdog fence agent

So, we don’t have the IPMI fencing agent installed at our cluster nodes. To install it, run the following command (at both nodes):

# yum install fence-agents-ipmilan

You may check the IPMI fencing agent options description by running the following command:

pcs stonith describe fence_ipmilan

Adding the fencing resources:

node26# pcs stonith create node27.stonith fence_ipmilan ip="192.168.67.23" auth=password password="admin" username="admin" method="onoff" lanplus=true pcmk_host_list="node27-ic" pcmk_host_check=static-list op monitor interval=10s
node26# pcs stonith create node26.stonith fence_ipmilan ip="192.168.64.106" auth=password password="admin" username="admin" method="onoff" lanplus=true pcmk_host_list="node26-ic" pcmk_host_check=static-list op monitor interval=10s

Preventing the STONITH resources from start on the node it should kill:

node26# pcs constraint location node27.stonith avoids node27-ic=INFINITY
node26# pcs constraint location node26.stonith avoids node26-ic=INFINITY

Csync2 configuration

Configure firewall to allow Csync2 to work (run at both nodes):

# firewall-cmd --add-port=30865/tcp
# firewall-cmd --permanent --add-port=30865/tcp

Create the Csync2 configuration file /usr/local/etc/csync2.cfg with the following content at node26 only:

nossl * *;
group csxiha {
host node26;
host node27;
key /usr/local/etc/csync2.key_ha;
include /etc/xiraid/raids; }

Generate the key:

node26# csync2 -k /usr/local/etc/csync2.key_ha

Copy the config and the key file to the second node:

node26# scp /usr/local/etc/csync2.cfg /usr/local/etc/csync2.key_ha node27:/usr/local/etc/

For Csync2 synchronisation by schedule one time per minute run crontab -e at both nodes and add the following record:

* * * * * /usr/local/sbin/csync2 -x

Also for asynchronous synchronisation run the following command to create a synchronisation script (repeat the script creation procedure at both nodes):

# vi /etc/xiraid/config_update_handler.sh

Fill the created script with the following content:

#!/usr/bin/bash
/usr/local/sbin/csync2 -xv

Save the file.

After that run the following command to set correct permissions for the script file:

# chmod +x /etc/xiraid/config_update_handler.sh

xiRAID Configuration for cluster setup

Disable RAID autostart to prevent RAIDs from being activated by xiRAID itself during a node boot. In a cluster configuration, RAIDs have to be activated by Pacemaker via cluster resources. Run the following command on both nodes:

# xicli settings cluster modify --raid_autostart 0

Make xiRAID Classic 4.1 resource agent visible for Pacemaker (run command this sequence at both nodes):

# mkdir -p /usr/lib/ocf/resource.d/xraid
# ln -s /etc/xraid/agents/raid /usr/lib/ocf/resource.d/xraid/raid

xiRAID RAIDs creation

To be able to create RAIDs, we need to install licenses for xiRAID Classic 4.1 on both hosts first. The licenses should be received from Xinnor. To generate the licenses, Xinnor requires the output of the xicli license show command (from both nodes).

node26# xicli license show
Kernel version: 4.18.0-513.9.1.el8_lustre.x86_64

hwkey: B8828A09E09E8F48
license_key: null
version: 0
crypto_version: 0
created: 0-0-0
expired: 0-0-0
disks: 4
levels: 0
type: nvme
disks_in_use: 2
status: trial

The license files received from Xinnor needs to be installed by xicli license update -p command (once again, at both nodes):

node26# xicli license show
Kernel version: 4.18.0-513.9.1.el8_lustre.x86_64

hwkey: B8828A09E09E8F48
license_key: 0F5A4B87A0FC6DB7544EA446B1B4AF5F34A08169C44E5FD119CE6D2352E202677768ECC78F56B583DABE11698BBC800EC96E556AA63E576DAB838010247678E7E3B95C7C4E3F592672D06C597045EAAD8A42CDE38C363C533E98411078967C38224C9274B862D45D4E6DED70B7E34602C80B60CBA7FDE93316438AFDCD7CBD23
version: 1
crypto_version: 1
created: 2024-7-16
expired: 2024-9-30
disks: 600
levels: 70
type: nvme
disks_in_use: 2
status: valid

Since we plan to deploy a small Lustre installation, combining MGT and MDT on the same target device is absolutely OK. But for medium or large Lustre installations, it's better to use a separate target (and RAID) for MGT.

Here is the list of the RAIDs we need to create.

RAID Name	RAID Level	Number of devices	Strip size	Drive list	Lustre target
r_mdt0	1	2	16	/dev/nvme0n1 /dev/nvme1n1	MGT + MDT index=0
r_ost0	6	10	128	/dev/nvme4n1 /dev/nvme5n1 /dev/nvme6n1 /dev/nvme7n1 /dev/nvme8n1 /dev/nvme9n1 /dev/nvme10n1 /dev/nvme11n1 /dev/nvme13n1 /dev/nvme14n1	OST index=0
r_ost1	6	10	128	/dev/nvme4n2 /dev/nvme5n2 /dev/nvme6n2 /dev/nvme7n2 /dev/nvme8n2 /dev/nvme9n2 /dev/nvme10n2 /dev/nvme11n2 /dev/nvme13n2 /dev/nvme14n2	OST index=1
r_ost2	6	10	128	/dev/nvme15n1 /dev/nvme16n1 /dev/nvme17n1 /dev/nvme18n1 /dev/nvme20n1 /dev/nvme21n1 /dev/nvme22n1 /dev/nvme23n1 /dev/nvme24n1 /dev/nvme25n1	OST index=2
r_ost3	6	10	128	/dev/nvme15n2 /dev/nvme16n2 /dev/nvme17n2 /dev/nvme18n2 /dev/nvme20n2 /dev/nvme21n2 /dev/nvme22n2 /dev/nvme23n2 /dev/nvme24n2 /dev/nvme25n2	OST index=3

Creating all the RAIDs at the first node:

node26# xicli raid create -n r_mdt0 -l 1 -d /dev/nvme0n1 /dev/nvme1n1
node26# xicli raid create -n r_ost0 -l 6 -ss 128 -d /dev/nvme4n1 /dev/nvme5n1 /dev/nvme6n1 /dev/nvme7n1 /dev/nvme8n1 /dev/nvme9n1 /dev/nvme10n1 /dev/nvme11n1 /dev/nvme13n1 /dev/nvme14n1
node26# xicli raid create -n r_ost1 -l 6 -ss 128 -d /dev/nvme4n2 /dev/nvme5n2 /dev/nvme6n2 /dev/nvme7n2 /dev/nvme8n2 /dev/nvme9n2 /dev/nvme10n2 /dev/nvme11n2 /dev/nvme13n2 /dev/nvme14n2
node26# xicli raid create -n r_ost2 -l 6 -ss 128 -d /dev/nvme15n1 /dev/nvme16n1 /dev/nvme17n1 /dev/nvme18n1 /dev/nvme20n1 /dev/nvme21n1 /dev/nvme22n1 /dev/nvme23n1 /dev/nvme24n1 /dev/nvme25n1
node26# xicli raid create -n r_ost3 -l 6 -ss 128 -d /dev/nvme15n2 /dev/nvme16n2 /dev/nvme17n2 /dev/nvme18n2 /dev/nvme20n2 /dev/nvme21n2 /dev/nvme22n2 /dev/nvme23n2 /dev/nvme24n2 /dev/nvme25n2

At this stage, there is no need to wait for the RAIDs initialization to finish - it can be safely left to run in the background.

Checking the RAID statuses at the first node:

ode26# xicli raid show
╔RAIDs═══╦══════════════════╦═════════════╦════════════════════════╦═══════════════════╗
║ name   ║ static           ║ state       ║ devices                ║ info              ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_mdt0 ║ size: 3576 GiB   ║ online      ║ 0 /dev/nvme0n1 online  ║                   ║
║        ║ level: 1         ║ initialized ║ 1 /dev/nvme1n1 online  ║                   ║
║        ║ strip_size: 16   ║             ║                        ║                   ║
║        ║ block_size: 4096 ║             ║                        ║                   ║
║        ║ sparepool: -     ║             ║                        ║                   ║
║        ║ active: True     ║             ║                        ║                   ║
║        ║ config: True     ║             ║                        ║                   ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_ost0 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme4n1 online  ║ init_progress: 11 ║
║        ║ level: 6         ║ initing     ║ 1 /dev/nvme5n1 online  ║                   ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme6n1 online  ║                   ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme7n1 online  ║                   ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme8n1 online  ║                   ║
║        ║ active: True     ║             ║ 5 /dev/nvme9n1 online  ║                   ║
║        ║ config: True     ║             ║ 6 /dev/nvme10n1 online ║                   ║
║        ║                  ║             ║ 7 /dev/nvme11n1 online ║                   ║
║        ║                  ║             ║ 8 /dev/nvme13n1 online ║                   ║
║        ║                  ║             ║ 9 /dev/nvme14n1 online ║                   ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_ost1 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme4n2 online  ║ init_progress: 7  ║
║        ║ level: 6         ║ initing     ║ 1 /dev/nvme5n2 online  ║                   ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme6n2 online  ║                   ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme7n2 online  ║                   ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme8n2 online  ║                   ║
║        ║ active: True     ║             ║ 5 /dev/nvme9n2 online  ║                   ║
║        ║ config: True     ║             ║ 6 /dev/nvme10n2 online ║                   ║
║        ║                  ║             ║ 7 /dev/nvme11n2 online ║                   ║
║        ║                  ║             ║ 8 /dev/nvme13n2 online ║                   ║
║        ║                  ║             ║ 9 /dev/nvme14n2 online ║                   ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_ost2 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme15n1 online ║ init_progress: 5  ║
║        ║ level: 6         ║ initing     ║ 1 /dev/nvme16n1 online ║                   ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme17n1 online ║                   ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme18n1 online ║                   ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme20n1 online ║                   ║
║        ║ active: True     ║             ║ 5 /dev/nvme21n1 online ║                   ║
║        ║ config: True     ║             ║ 6 /dev/nvme22n1 online ║                   ║
║        ║                  ║             ║ 7 /dev/nvme23n1 online ║                   ║
║        ║                  ║             ║ 8 /dev/nvme24n1 online ║                   ║
║        ║                  ║             ║ 9 /dev/nvme25n1 online ║                   ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_ost3 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme15n2 online ║ init_progress: 2  ║
║        ║ level: 6         ║ initing     ║ 1 /dev/nvme16n2 online ║                   ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme17n2 online ║                   ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme18n2 online ║                   ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme20n2 online ║                   ║
║        ║ active: True     ║             ║ 5 /dev/nvme21n2 online ║                   ║
║        ║ config: True     ║             ║ 6 /dev/nvme22n2 online ║                   ║
║        ║                  ║             ║ 7 /dev/nvme23n2 online ║                   ║
║        ║                  ║             ║ 8 /dev/nvme24n2 online ║                   ║
║        ║                  ║             ║ 9 /dev/nvme25n2 online ║                   ║
╚════════╩══════════════════╩═════════════╩════════════════════════╩═══════════════════╝

Checking that the RAID configs were successfully replicated to the second node (please note that on the second node, the RAID status is None, which is expected in this case):

node27# xicli raid show
╔RAIDs═══╦══════════════════╦═══════╦═════════╦══════╗
║ name   ║ static           ║ state ║ devices ║ info ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_mdt0 ║ size: 3576 GiB   ║ None  ║         ║      ║
║        ║ level: 1         ║       ║         ║      ║
║        ║ strip_size: 16   ║       ║         ║      ║
║        ║ block_size: 4096 ║       ║         ║      ║
║        ║ sparepool: -     ║       ║         ║      ║
║        ║ active: False    ║       ║         ║      ║
║        ║ config: True     ║       ║         ║      ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_ost0 ║ size: 14302 GiB  ║ None  ║         ║      ║
║        ║ level: 6         ║       ║         ║      ║
║        ║ strip_size: 128  ║       ║         ║      ║
║        ║ block_size: 4096 ║       ║         ║      ║
║        ║ sparepool: -     ║       ║         ║      ║
║        ║ active: False    ║       ║         ║      ║
║        ║ config: True     ║       ║         ║      ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_ost1 ║ size: 14302 GiB  ║ None  ║         ║      ║
║        ║ level: 6         ║       ║         ║      ║
║        ║ strip_size: 128  ║       ║         ║      ║
║        ║ block_size: 4096 ║       ║         ║      ║
║        ║ sparepool: -     ║       ║         ║      ║
║        ║ active: False    ║       ║         ║      ║
║        ║ config: True     ║       ║         ║      ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_ost2 ║ size: 14302 GiB  ║ None  ║         ║      ║
║        ║ level: 6         ║       ║         ║      ║
║        ║ strip_size: 128  ║       ║         ║      ║
║        ║ block_size: 4096 ║       ║         ║      ║
║        ║ sparepool: -     ║       ║         ║      ║
║        ║ active: False    ║       ║         ║      ║
║        ║ config: True     ║       ║         ║      ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_ost3 ║ size: 14302 GiB  ║ None  ║         ║      ║
║        ║ level: 6         ║       ║         ║      ║
║        ║ strip_size: 128  ║       ║         ║      ║
║        ║ block_size: 4096 ║       ║         ║      ║
║        ║ sparepool: -     ║       ║         ║      ║
║        ║ active: False    ║       ║         ║      ║
║        ║ config: True     ║       ║         ║      ║
╚════════╩══════════════════╩═══════╩═════════╩══════╝

After RAID creation, there's no need to wait for RAID initialization to finish. The RAIDs are available for use immediately after creation, albeit with slightly reduced performance.

For optimal performance, it's better to dedicate specific disjoint CPU core sets to each RAID. Currently, all RAIDs are active on node26, so the sets will joint, but when they are spread between node26 and node27, they will not joint.

node26# xicli raid modify -n r_mdt0 -ca 0-7 -se 1
node26# xicli raid modify -n r_ost0 -ca 8-67 -se 1
node26# xicli raid modify -n r_ost1 -ca 8-67 -se 1 # will be running at node27
node26# xicli raid modify -n r_ost2 -ca 68-127 -se 1
node26# xicli raid modify -n r_ost3 -ca 68-127 -se 1 # will be running at node27

Lustre setup

LNET configuration

To make lustre working, we need to configure Lustre network stack (LNET).

Run at both nodes

# systemctl start lnet
# systemctl enable lnet
# lnetctl net add --net o2ib0 --if ib0
# lnetctl net add --net o2ib0 --if ib3

Check the configuration

# lnetctl net show -v
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
          statistics:
              send_count: 289478
              recv_count: 289474
              drop_count: 4
          tunables:
              peer_timeout: 0
              peer_credits: 0
              peer_buffer_credits: 0
              credits: 0
          lnd tunables:
          dev cpt: 0
          CPT: "[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]"
    - net type: o2ib
      local NI(s):
        - nid: 100.100.100.26@o2ib
          status: down
          interfaces:
              0: ib0
          statistics:
              send_count: 213607
              recv_count: 213604
              drop_count: 7
          tunables:
              peer_timeout: 180
              peer_credits: 8
              peer_buffer_credits: 0
              credits: 256
          lnd tunables:
              peercredits_hiw: 4
              map_on_demand: 1
              concurrent_sends: 8
              fmr_pool_size: 512
              fmr_flush_trigger: 384
              fmr_cache: 1
              ntx: 512
              conns_per_peer: 1
          dev cpt: -1
          CPT: "[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]"
        - nid: 100.100.100.126@o2ib
          status: up
          interfaces:
              0: ib3
          statistics:
              send_count: 4
              recv_count: 4
              drop_count: 0
          tunables:
              peer_timeout: 180
              peer_credits: 8
              peer_buffer_credits: 0
              credits: 256
          lnd tunables:
              peercredits_hiw: 4
              map_on_demand: 1
              concurrent_sends: 8
              fmr_pool_size: 512
              fmr_flush_trigger: 384
              fmr_cache: 1
              ntx: 512
              conns_per_peer: 1
          dev cpt: -1
          CPT: "[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]"

Please pay attention to the LNET at the hosts - NIDs. We will use 100.100.100.26@o2ib for node26 and 100.100.100.27@o2ib for node27 as primary NIDs.

Save the LNET configuration:

# lnetctl export -b > /etc/lnet.conf

LDISKFS filesystems creation

At this step, we format the RAIDs into LDISKFS filesystem format. During formatting, we specify the target type (--mgs/--mdt/--ost), unique number of the specific target type (--index), Lustre filesystem name (--fsname), NIDs where each target filesystem could be mounted and where the corresponding servers will get started automatically (--servicenode), and NIDs where MGS could be found (--mgsnode).

Since our RAIDs will work within a cluster, we specify NIDs of both server nodes as the NIDs where the target filesystem could be mounted and where the corresponding servers will get started automatically for each target filesystem. For the same reason, we specify two NIDs where other servers should look for the MGS service.

node26# mkfs.lustre --mgs --mdt --fsname=lustre0 --index=0 --servicenode=100.100.100.26@o2ib --servicenode=100.100.100.27@o2ib --mgsnode=100.100.100.26@o2ib --mgsnode=100.100.100.27@o2ib /dev/xi_r_mdt0
node26# mkfs.lustre --ost --fsname=lustre0 --index=0 --servicenode=100.100.100.26@o2ib --servicenode=100.100.100.27@o2ib --mgsnode=100.100.100.26@o2ib --mgsnode=100.100.100.27@o2ib /dev/xi_r_ost0
node26# mkfs.lustre --ost --fsname=lustre0 --index=1 --servicenode=100.100.100.26@o2ib --servicenode=100.100.100.27@o2ib --mgsnode=100.100.100.26@o2ib --mgsnode=100.100.100.27@o2ib /dev/xi_r_ost1
node26# mkfs.lustre --ost --fsname=lustre0 --index=2 --servicenode=100.100.100.26@o2ib --servicenode=100.100.100.27@o2ib --mgsnode=100.100.100.26@o2ib --mgsnode=100.100.100.27@o2ib /dev/xi_r_ost2
node26# mkfs.lustre --ost --fsname=lustre0 --index=3 --servicenode=100.100.100.26@o2ib --servicenode=100.100.100.27@o2ib --mgsnode=100.100.100.26@o2ib --mgsnode=100.100.100.27@o2ib /dev/xi_r_ost3

More details could be found in the Lustre documentation.

Cluster resources creation

Please check the table below. The configuration to configure is described in the table.

RAID name	HA cluster RAID resource name	Lustre target	Mountpoint	HA cluster filesystem resource name	Preferred cluster node
r_mdt0	rr_mdt0	MGT + MDT index=0	/lustre_t/mdt0	fsr_mdt0	node26
r_ost0	rr_ost0	OST index=0	/lustre_t/ost0	fsr_ost0	node26
r_ost1	rr_ost1	OST index=1	/lustre_t/ost1	fsr_ost1	node27
r_ost2	rr_ost2	OST index=2	/lustre_t/ost2	fsr_ost2	node26
r_ost3	rr_ost3	OST index=3	/lustre_t/ost3	fsr_ost3	node27

To create Pacemaker resources for xiRAID Classic RAIDs, we will use the xiRAID resource agent, which was installed with xiRAID Classic and made available to Pacemaker in one of the previous steps.

To cluster Lustre services, there are two options, as currently two resource agents are capable of managing Lustre OSDs:

ocf💓Filesystem: Distributed by ClusterLabs in the resource-agents package, the Filesystem RA is a very mature and stable application and has been part of the Pacemaker project for many years. Filesystem provides generic support for mounting and unmounting storage devices, which indirectly includes Lustre.
ocf:lustre:Lustre: Developed specifically for Lustre OSDs, this RA is distributed by the Lustre project and is available in Lustre releases from version 2.10.0 onwards. As a result of its narrower scope, it is less complex than ocf💓Filesystem and better suited for managing Lustre storage resources.

For simplicity, we will use ocf💓Filesystem in our case. However, ocf:lustre:Lustre can also be easily used in conjunction with xiRAID Classic in a Pacemaker cluster configuration. For more details on Lustre clustering, please check this page of Lustre documentation.

First of all, create mountpoints for all the RAIDs formatted in LDISKFS at both nodes:

# mkdir -p /lustre_t/ost3
# mkdir -p /lustre_t/ost2
# mkdir -p /lustre_t/ost1
# mkdir -p /lustre_t/ost0
# mkdir -p /lustre_t/mdt0

Unload all the RAIDs at the node where they are active:

node26# xicli raid show
╔RAIDs═══╦══════════════════╦═════════════╦════════════════════════╦══════╗
║ name   ║ static           ║ state       ║ devices                ║ info ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_mdt0 ║ size: 3576 GiB   ║ online      ║ 0 /dev/nvme0n1 online  ║      ║
║        ║ level: 1         ║ initialized ║ 1 /dev/nvme1n1 online  ║      ║
║        ║ strip_size: 16   ║             ║                        ║      ║
║        ║ block_size: 4096 ║             ║                        ║      ║
║        ║ sparepool: -     ║             ║                        ║      ║
║        ║ active: True     ║             ║                        ║      ║
║        ║ config: True     ║             ║                        ║      ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_ost0 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme4n1 online  ║      ║
║        ║ level: 6         ║ initialized ║ 1 /dev/nvme5n1 online  ║      ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme6n1 online  ║      ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme7n1 online  ║      ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme8n1 online  ║      ║
║        ║ active: True     ║             ║ 5 /dev/nvme9n1 online  ║      ║
║        ║ config: True     ║             ║ 6 /dev/nvme10n1 online ║      ║
║        ║                  ║             ║ 7 /dev/nvme11n1 online ║      ║
║        ║                  ║             ║ 8 /dev/nvme13n1 online ║      ║
║        ║                  ║             ║ 9 /dev/nvme14n1 online ║      ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_ost1 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme4n2 online  ║      ║
║        ║ level: 6         ║ initialized ║ 1 /dev/nvme5n2 online  ║      ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme6n2 online  ║      ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme7n2 online  ║      ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme8n2 online  ║      ║
║        ║ active: True     ║             ║ 5 /dev/nvme9n2 online  ║      ║
║        ║ config: True     ║             ║ 6 /dev/nvme10n2 online ║      ║
║        ║                  ║             ║ 7 /dev/nvme11n2 online ║      ║
║        ║                  ║             ║ 8 /dev/nvme13n2 online ║      ║
║        ║                  ║             ║ 9 /dev/nvme14n2 online ║      ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_ost2 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme15n1 online ║      ║
║        ║ level: 6         ║ initialized ║ 1 /dev/nvme16n1 online ║      ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme17n1 online ║      ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme18n1 online ║      ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme20n1 online ║      ║
║        ║ active: True     ║             ║ 5 /dev/nvme21n1 online ║      ║
║        ║ config: True     ║             ║ 6 /dev/nvme22n1 online ║      ║
║        ║                  ║             ║ 7 /dev/nvme23n1 online ║      ║
║        ║                  ║             ║ 8 /dev/nvme24n1 online ║      ║
║        ║                  ║             ║ 9 /dev/nvme25n1 online ║      ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_ost3 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme15n2 online ║      ║
║        ║ level: 6         ║ initialized ║ 1 /dev/nvme16n2 online ║      ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme17n2 online ║      ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme18n2 online ║      ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme20n2 online ║      ║
║        ║ active: True     ║             ║ 5 /dev/nvme21n2 online ║      ║
║        ║ config: True     ║             ║ 6 /dev/nvme22n2 online ║      ║
║        ║                  ║             ║ 7 /dev/nvme23n2 online ║      ║
║        ║                  ║             ║ 8 /dev/nvme24n2 online ║      ║
║        ║                  ║             ║ 9 /dev/nvme25n2 online ║      ║
╚════════╩══════════════════╩═════════════╩════════════════════════╩══════╝

node26# xicli raid unload -n r_mdt0
node26# xicli raid unload -n r_ost0
node26# xicli raid unload -n r_ost1
node26# xicli raid unload -n r_ost2
node26# xicli raid unload -n r_ost3

node26# xicli raid show
╔RAIDs═══╦══════════════════╦═══════╦═════════╦══════╗
║ name   ║ static           ║ state ║ devices ║ info ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_mdt0 ║ size: 3576 GiB   ║ None  ║         ║      ║
║        ║ level: 1         ║       ║         ║      ║
║        ║ strip_size: 16   ║       ║         ║      ║
║        ║ block_size: 4096 ║       ║         ║      ║
║        ║ sparepool: -     ║       ║         ║      ║
║        ║ active: False    ║       ║         ║      ║
║        ║ config: True     ║       ║         ║      ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_ost0 ║ size: 14302 GiB  ║ None  ║         ║      ║
║        ║ level: 6         ║       ║         ║      ║
║        ║ strip_size: 128  ║       ║         ║      ║
║        ║ block_size: 4096 ║       ║         ║      ║
║        ║ sparepool: -     ║       ║         ║      ║
║        ║ active: False    ║       ║         ║      ║
║        ║ config: True     ║       ║         ║      ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_ost1 ║ size: 14302 GiB  ║ None  ║         ║      ║
║        ║ level: 6         ║       ║         ║      ║
║        ║ strip_size: 128  ║       ║         ║      ║
║        ║ block_size: 4096 ║       ║         ║      ║
║        ║ sparepool: -     ║       ║         ║      ║
║        ║ active: False    ║       ║         ║      ║
║        ║ config: True     ║       ║         ║      ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_ost2 ║ size: 14302 GiB  ║ None  ║         ║      ║
║        ║ level: 6         ║       ║         ║      ║
║        ║ strip_size: 128  ║       ║         ║      ║
║        ║ block_size: 4096 ║       ║         ║      ║
║        ║ sparepool: -     ║       ║         ║      ║
║        ║ active: False    ║       ║         ║      ║
║        ║ config: True     ║       ║         ║      ║
╠════════╬══════════════════╬═══════╬═════════╬══════╣
║ r_ost3 ║ size: 14302 GiB  ║ None  ║         ║      ║
║        ║ level: 6         ║       ║         ║      ║
║        ║ strip_size: 128  ║       ║         ║      ║
║        ║ block_size: 4096 ║       ║         ║      ║
║        ║ sparepool: -     ║       ║         ║      ║
║        ║ active: False    ║       ║         ║      ║
║        ║ config: True     ║       ║         ║      ║
╚════════╩══════════════════╩═══════╩═════════╩══════╝

Create a copy of the cluster information base to make changes to at the first node:

node26# pcs cluster cib fs_cfg
node26# ls -l fs_cfg
-rw-r--r--. 1 root root 8614 Jul 20 02:04 fs_cfg

Getting the RAIDs UUIDs:

node26# grep uuid /etc/xiraid/raids/*.conf
/etc/xiraid/raids/r_mdt0.conf:    "uuid": "75E2CAA5-3E5B-4ED0-89E9-4BF3850FD542",
/etc/xiraid/raids/r_ost0.conf:    "uuid": "AB341442-20AC-43B1-8FE6-F9ED99D1D6C0",
/etc/xiraid/raids/r_ost1.conf:    "uuid": "1441D09C-0073-4555-A398-71984E847F9E",
/etc/xiraid/raids/r_ost2.conf:    "uuid": "0E225812-6877-4344-A552-B6A408EC7351",
/etc/xiraid/raids/r_ost3.conf:    "uuid": "F749B8A7-3CC4-45A9-A61E-E75EDBB3A53E",

Creating resource rr_mdt0 for the r_mdt0 RAID:

node26# pcs -f fs_cfg resource create rr_mdt0 ocf:xraid:raid name=r_mdt0 uuid=75E2CAA5-3E5B-4ED0-89E9-4BF3850FD542 op monitor interval=5s meta migration-threshold=1

Setting a constraint to make the first node preferable for r_mdt0 resource:

node26# pcs -f fs_cfg constraint location rr_mdt0 prefers node26-ic=50

Creating a resource for the r_mdt0 RAID mountpoint at /lustre_t/mdt0:

node26# pcs -f fs_cfg resource create fsr_mdt0 Filesystem device="/dev/xi_r_mdt0" directory="/lustre_t/mdt0" fstype="lustre"

Configure the cluster to start rr_mdt0 and fsr_mdt0 at the same node ONLY:

node26# pcs -f fs_cfg constraint colocation add rr_mdt0 with fsr_mdt0 INFINITY

Configure the cluster to start fsr_mdt0 only after rr_mdt0:

node26# pcs -f fs_cfg constraint order rr_mdt0 then fsr_mdt0

Configure other resources in the same way:

node26# pcs -f fs_cfg resource create rr_ost0 ocf:xraid:raid name=r_ost0 uuid=AB341442-20AC-43B1-8FE6-F9ED99D1D6C0 op monitor interval=5s meta migration-threshold=1
node26# pcs -f fs_cfg constraint location rr_ost0 prefers node26-ic=50
node26# pcs -f fs_cfg resource create fsr_ost0 Filesystem device="/dev/xi_r_ost0" directory="/lustre_t/ost0" fstype="lustre"
node26# pcs -f fs_cfg constraint colocation add rr_ost0 with fsr_ost0 INFINITY
node26# pcs -f fs_cfg constraint order rr_ost0 then fsr_ost0

node26# pcs -f fs_cfg resource create rr_ost1 ocf:xraid:raid name=r_ost1 uuid=1441D09C-0073-4555-A398-71984E847F9E op monitor interval=5s meta migration-threshold=1
node26# pcs -f fs_cfg constraint location rr_ost1 prefers node27-ic=50
node26# pcs -f fs_cfg resource create fsr_ost1 Filesystem device="/dev/xi_r_ost1" directory="/lustre_t/ost1" fstype="lustre"
node26# pcs -f fs_cfg constraint colocation add rr_ost1 with fsr_ost1 INFINITY
node26# pcs -f fs_cfg constraint order rr_ost1 then fsr_ost1

node26# pcs -f fs_cfg resource create rr_ost2 ocf:xraid:raid name=r_ost2 uuid=0E225812-6877-4344-A552-B6A408EC7351 op monitor interval=5s meta migration-threshold=1
node26# pcs -f fs_cfg constraint location rr_ost2 prefers node26-ic=50
node26# pcs -f fs_cfg resource create fsr_ost2 Filesystem device="/dev/xi_r_ost2" directory="/lustre_t/ost2" fstype="lustre"
node26# pcs -f fs_cfg constraint colocation add rr_ost2 with fsr_ost2 INFINITY
node26# pcs -f fs_cfg constraint order rr_ost2 then fsr_ost2

node26# pcs -f fs_cfg resource create rr_ost3 ocf:xraid:raid name=r_ost3 uuid=F749B8A7-3CC4-45A9-A61E-E75EDBB3A53E op monitor interval=5s meta migration-threshold=1
node26# pcs -f fs_cfg constraint location rr_ost3 prefers node27-ic=50
node26# pcs -f fs_cfg resource create fsr_ost3 Filesystem device="/dev/xi_r_ost3" directory="/lustre_t/ost3" fstype="lustre"
node26# pcs -f fs_cfg constraint colocation add rr_ost3 with fsr_ost3 INFINITY
node26# pcs -f fs_cfg constraint order rr_ost3 then fsr_ost3

In xiRAID Classic 4.1, it is required to guarantee that only one RAID can be starting at a time. To do so, we define the following constraints. This limitation is planned for removal in xiRAID Classic 4.2.

node26# pcs -f fs_cfg constraint order start rr_mdt0 then start rr_ost0 kind=Serialize
node26# pcs -f fs_cfg constraint order start rr_mdt0 then start rr_ost1 kind=Serialize
node26# pcs -f fs_cfg constraint order start rr_mdt0 then start rr_ost2 kind=Serialize
node26# pcs -f fs_cfg constraint order start rr_mdt0 then start rr_ost3 kind=Serialize
node26# pcs -f fs_cfg constraint order start rr_ost0 then start rr_ost1 kind=Serialize
node26# pcs -f fs_cfg constraint order start rr_ost0 then start rr_ost2 kind=Serialize
node26# pcs -f fs_cfg constraint order start rr_ost0 then start rr_ost3 kind=Serialize
node26# pcs -f fs_cfg constraint order start rr_ost1 then start rr_ost2 kind=Serialize
node26# pcs -f fs_cfg constraint order start rr_ost1 then start rr_ost3 kind=Serialize
node26# pcs -f fs_cfg constraint order start rr_ost2 then start rr_ost3 kind=Serialize

To ensure Lustre servers start in the proper order, we need to configure the cluster to start MDS before all the OSS's. Since the Linux kernel starts MDSes and OSS's automatically when mounting the LDISKFS filesystem, we just need to set the proper start order for the fsr_* resources:

node26# pcs -f fs_cfg constraint order fsr_mdt0 then fsr_ost0
node26# pcs -f fs_cfg constraint order fsr_mdt0 then fsr_ost1
node26# pcs -f fs_cfg constraint order fsr_mdt0 then fsr_ost2
node26# pcs -f fs_cfg constraint order fsr_mdt0 then fsr_ost3

Applying the batch cluster information base changes:

node26# pcs cluster cib-push fs_cfg --config

Checking the resulting cluster configuration:

node26# pcs status
Cluster name: lustrebox0
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: node26-ic (version 2.1.7-5.el8_10-0f7f88312) - partition with quorum
  * Last updated: Tue Jul 23 02:14:54 2024 on node26-ic
  * Last change:  Tue Jul 23 02:14:50 2024 by root via root on node26-ic
  * 2 nodes configured
  * 12 resource instances configured

Node List:
  * Online: [ node26-ic node27-ic ]

Full List of Resources:
  * node27.stonith      (stonith:fence_ipmilan):         Started node26-ic
  * node26.stonith      (stonith:fence_ipmilan):         Started node27-ic
  * rr_mdt0             (ocf::xraid:raid):               Started node26-ic
  * fsr_mdt0            (ocf::heartbeat:Filesystem):     Started node26-ic
  * rr_ost0             (ocf::xraid:raid):               Started node26-ic
  * fsr_ost0            (ocf::heartbeat:Filesystem):     Started node26-ic
  * rr_ost1             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost1            (ocf::heartbeat:Filesystem):     Started node27-ic
  * rr_ost2             (ocf::xraid:raid):               Started node26-ic
  * fsr_ost2            (ocf::heartbeat:Filesystem):     Started node26-ic
  * rr_ost3             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost3            (ocf::heartbeat:Filesystem):     Started node27-ic

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

Double-check on both nodes that the RAIDs are active and the filesystems are mounted properly. Please note that we have all the OST RAIDs based on /dev/nvme*n1 active on the first node (node26) and all the OST RAIDs based on /dev/nvme*n2 on the second one (node27), which will help us utilize the full NVMe throughput as planned.

node26:

node26# xicli raid show
╔RAIDs═══╦══════════════════╦═════════════╦════════════════════════╦══════╗
║ name   ║ static           ║ state       ║ devices                ║ info ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_mdt0 ║ size: 3576 GiB   ║ online      ║ 0 /dev/nvme0n1 online  ║      ║
║        ║ level: 1         ║ initialized ║ 1 /dev/nvme1n1 online  ║      ║
║        ║ strip_size: 16   ║             ║                        ║      ║
║        ║ block_size: 4096 ║             ║                        ║      ║
║        ║ sparepool: -     ║             ║                        ║      ║
║        ║ active: True     ║             ║                        ║      ║
║        ║ config: True     ║             ║                        ║      ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_ost0 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme4n1 online  ║      ║
║        ║ level: 6         ║ initialized ║ 1 /dev/nvme5n1 online  ║      ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme6n1 online  ║      ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme7n1 online  ║      ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme8n1 online  ║      ║
║        ║ active: True     ║             ║ 5 /dev/nvme9n1 online  ║      ║
║        ║ config: True     ║             ║ 6 /dev/nvme10n1 online ║      ║
║        ║                  ║             ║ 7 /dev/nvme11n1 online ║      ║
║        ║                  ║             ║ 8 /dev/nvme13n1 online ║      ║
║        ║                  ║             ║ 9 /dev/nvme14n1 online ║      ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_ost1 ║ size: 14302 GiB  ║ None        ║                        ║      ║
║        ║ level: 6         ║             ║                        ║      ║
║        ║ strip_size: 128  ║             ║                        ║      ║
║        ║ block_size: 4096 ║             ║                        ║      ║
║        ║ sparepool: -     ║             ║                        ║      ║
║        ║ active: False    ║             ║                        ║      ║
║        ║ config: True     ║             ║                        ║      ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_ost2 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme15n1 online ║      ║
║        ║ level: 6         ║ initialized ║ 1 /dev/nvme16n1 online ║      ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme17n1 online ║      ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme18n1 online ║      ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme20n1 online ║      ║
║        ║ active: True     ║             ║ 5 /dev/nvme21n1 online ║      ║
║        ║ config: True     ║             ║ 6 /dev/nvme22n1 online ║      ║
║        ║                  ║             ║ 7 /dev/nvme23n1 online ║      ║
║        ║                  ║             ║ 8 /dev/nvme24n1 online ║      ║
║        ║                  ║             ║ 9 /dev/nvme25n1 online ║      ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_ost3 ║ size: 14302 GiB  ║ None        ║                        ║      ║
║        ║ level: 6         ║             ║                        ║      ║
║        ║ strip_size: 128  ║             ║                        ║      ║
║        ║ block_size: 4096 ║             ║                        ║      ║
║        ║ sparepool: -     ║             ║                        ║      ║
║        ║ active: False    ║             ║                        ║      ║
║        ║ config: True     ║             ║                        ║      ║
╚════════╩══════════════════╩═════════════╩════════════════════════╩══════╝

node26# df -h|grep xi
/dev/xi_r_mdt0       2.1T  5.7M  2.0T   1% /lustre_t/mdt0
/dev/xi_r_ost0        14T  1.3M   14T   1% /lustre_t/ost0
/dev/xi_r_ost2        14T  1.3M   14T   1% /lustre_t/ost2

node27:

node27# xicli raid show
╔RAIDs═══╦══════════════════╦═════════════╦════════════════════════╦══════╗
║ name   ║ static           ║ state       ║ devices                ║ info ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_mdt0 ║ size: 3576 GiB   ║ None        ║                        ║      ║
║        ║ level: 1         ║             ║                        ║      ║
║        ║ strip_size: 16   ║             ║                        ║      ║
║        ║ block_size: 4096 ║             ║                        ║      ║
║        ║ sparepool: -     ║             ║                        ║      ║
║        ║ active: False    ║             ║                        ║      ║
║        ║ config: True     ║             ║                        ║      ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_ost0 ║ size: 14302 GiB  ║ None        ║                        ║      ║
║        ║ level: 6         ║             ║                        ║      ║
║        ║ strip_size: 128  ║             ║                        ║      ║
║        ║ block_size: 4096 ║             ║                        ║      ║
║        ║ sparepool: -     ║             ║                        ║      ║
║        ║ active: False    ║             ║                        ║      ║
║        ║ config: True     ║             ║                        ║      ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_ost1 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme4n2 online  ║      ║
║        ║ level: 6         ║ initialized ║ 1 /dev/nvme5n2 online  ║      ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme6n2 online  ║      ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme7n2 online  ║      ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme8n2 online  ║      ║
║        ║ active: True     ║             ║ 5 /dev/nvme9n2 online  ║      ║
║        ║ config: True     ║             ║ 6 /dev/nvme10n2 online ║      ║
║        ║                  ║             ║ 7 /dev/nvme11n2 online ║      ║
║        ║                  ║             ║ 8 /dev/nvme13n2 online ║      ║
║        ║                  ║             ║ 9 /dev/nvme14n2 online ║      ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_ost2 ║ size: 14302 GiB  ║ None        ║                        ║      ║
║        ║ level: 6         ║             ║                        ║      ║
║        ║ strip_size: 128  ║             ║                        ║      ║
║        ║ block_size: 4096 ║             ║                        ║      ║
║        ║ sparepool: -     ║             ║                        ║      ║
║        ║ active: False    ║             ║                        ║      ║
║        ║ config: True     ║             ║                        ║      ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬══════╣
║ r_ost3 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme15n2 online ║      ║
║        ║ level: 6         ║ initialized ║ 1 /dev/nvme16n2 online ║      ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme17n2 online ║      ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme18n2 online ║      ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme20n2 online ║      ║
║        ║ active: True     ║             ║ 5 /dev/nvme21n2 online ║      ║
║        ║ config: True     ║             ║ 6 /dev/nvme22n2 online ║      ║
║        ║                  ║             ║ 7 /dev/nvme23n2 online ║      ║
║        ║                  ║             ║ 8 /dev/nvme24n2 online ║      ║
║        ║                  ║             ║ 9 /dev/nvme25n2 online ║      ║
╚════════╩══════════════════╩═════════════╩════════════════════════╩══════╝

node27# df -h|grep xi
/dev/xi_r_ost1        14T  1.3M   14T   1% /lustre_t/ost1
/dev/xi_r_ost3        14T  1.3M   14T   1% /lustre_t/ost3

Lustre performance tuning

Here we set some parameters for the performance optimisation. All the commands have to be run at the host, where MDS server is running.

Server-side parameters:

Lustre performance tuning
Here we set some parameters for the performance optimisation. All the commands have to be run at the host, where MDS server is running.

Server-side parameters:

These parameters are optimised for the best performance. They are not universal and can be not optimal for some cases.

Tests

Testbed description

Lustre client systems:

The Lustre client systems are 4 servers in identical configurations connected to the same Infiniband switch Mellanox QuantumTM HDR Edge Switch QM8700. The SBB system nodes (the cluster nodes) are also connected to the same switch. Lustre client parameters are modified to get the best performance. These parameter changes are accepted by the Lustre community for modern tests showing high performance. More details are provided in the table below:

Hostname	lclient00	lclient01	lclient02	lclient03
CPU	AMD EPYC 7502 32-Core	AMD EPYC 7502 32-Core	AMD EPYC 7502 32-Core	AMD EPYC 7502 32-Core
Memory	256GB	256GB	256GB	256GB
OS drives	INTEL SSDPEKKW256G8	INTEL SSDPEKKW256G8	INTEL SSDPEKKW256G8	INTEL SSDPEKKW256G8
OS	Rocky Linux 8.7	Rocky Linux 8.7	Rocky Linux 8.7	Rocky Linux 8.7
Management NIC	192.168.65.50	192.168.65.52	192.168.65.54	192.168.65.56
Infiniband LNET HDR	100.100.100.50	100.100.100.52	100.100.100.54	100.100.100.56

The Lustre clients are combined into a simple OpenMPI cluster, and the standard parallel filesystem test - IOR - is used to run the tests. The test files are created in the /stripe filesystem subfolder, which was created on the Lustre filesystem with the following striping parameters:

lclient01# mount -t lustre 100.100.100.26@o2ib:100.100.100.27@o2ib:/lustre0 /mnt.l
lclient01# mkdir /mnt.l/stripe4M
lclient01# lfs setstripe -c -1 -S 4M /mnt.l/stripe4M/

Test results

We used the standard parallel filesystem IOR test to measure the performance of the installation. For example, we ran 4 tests. Each test is started with 128 threads spread among 4 clients. The tests differ by transfer size (1M and 128M) and the use of directIO.

Normal state cluster performance
Tests with directIO enabled

The following list shows the test command and results for directIO enabled test with transfer size of 1MB.

lclient01# /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --hostfile ./hfile -np 128 --map-by node /usr/bin/ior -F -t 1M -b 8G  -k -r -w -o /mnt.l/stripe4M/testfile --posix.odirect

. . . 

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     19005      19005      0.006691    8388608    1024.00    0.008597   55.17      3.92       55.17      0
read      82075      82077      0.001545    8388608    1024.00    0.002592   12.78      0.213460   12.78      0
Max Write: 19005.04 MiB/sec (19928.23 MB/sec)
Max Read:  82075.33 MiB/sec (86062.22 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum
write       19005.04   19005.04   19005.04       0.00   19005.04   19005.04   19005.04       0.00   55.17357         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592  1048576 1048576.0 POSIX      0
read        82075.33   82075.33   82075.33       0.00   82075.33   82075.33   82075.33       0.00   12.77578         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592  1048576 1048576.0 POSIX      0

The following list shows the test command and results for directIO enabled test with transfer size of 128MB.

lclient01# /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --hostfile ./hfile -np 128 --map-by node /usr/bin/ior -F -t 128M -b 8G  -k -r -w -o /mnt.l/stripe4M/testfile --posix.odirect

. . .

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     52892      413.23     0.306686    8388608    131072     0.096920   19.82      0.521081   19.82      0
read      70588      551.50     0.229853    8388608    131072     0.002983   14.85      0.723477   14.85      0
Max Write: 52892.27 MiB/sec (55461.56 MB/sec)
Max Read:  70588.32 MiB/sec (74017.22 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum
write       52892.27   52892.27   52892.27       0.00     413.22     413.22     413.22       0.00   19.82475         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592 134217728 1048576.0 POSIX      0
read        70588.32   70588.32   70588.32       0.00     551.47     551.47     551.47       0.00   14.85481         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592 134217728 1048576.0 POSIX      0

Tests with directIO disabled

The following list shows the test command and results for buffered IO (directIO disabled) test with transfer size of 1MB.

lclient01#  /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --hostfile ./hfile -np 128 --map-by node /usr/bin/ior -F -t 1M -b 8G  -k -r -w -o /mnt.l/stripe4M/testfile

. . .

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     48202      48204      0.002587    8388608    1024.00    0.008528   21.75      1.75       21.75      0
read      40960      40960      0.002901    8388608    1024.00    0.002573   25.60      2.39       25.60      0
Max Write: 48202.43 MiB/sec (50543.91 MB/sec)
Max Read:  40959.57 MiB/sec (42949.22 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum
write       48202.43   48202.43   48202.43       0.00   48202.43   48202.43   48202.43       0.00   21.75359         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592  1048576 1048576.0 POSIX      0
read        40959.57   40959.57   40959.57       0.00   40959.57   40959.57   40959.57       0.00   25.60027         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592  1048576 1048576.0 POSIX      0

The following list shows the test command and results for buffered IO (directIO disabled) test with transfer size of 128MB.

lclient01#  /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --hostfile ./hfile -np 128 --map-by node /usr/bin/ior -F -t 128M -b 8G  -k -r -w -o /mnt.l/stripe4M/testfile

. . .

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     46315      361.84     0.349582    8388608    131072     0.009255   22.64      2.70       22.64      0
read      39435      308.09     0.368192    8388608    131072     0.002689   26.59      7.65       26.59      0
Max Write: 46314.67 MiB/sec (48564.45 MB/sec)
Max Read:  39434.54 MiB/sec (41350.12 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum
write       46314.67   46314.67   46314.67       0.00     361.83     361.83     361.83       0.00   22.64026         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592 134217728 1048576.0 POSIX      0
read        39434.54   39434.54   39434.54       0.00     308.08     308.08     308.08       0.00   26.59029         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592 134217728 1048576.0 POSIX      0

Failover behavior

To check the cluster behavior in case of a node failure, we will crash a node to simulate such a failure. Before the failure simulation, let's check the normal cluster state:

# pcs status
Cluster name: lustrebox0
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: node27-ic (version 2.1.7-5.el8_10-0f7f88312) - partition with quorum
  * Last updated: Tue Aug 13 19:13:23 2024 on node26-ic
  * Last change:  Tue Aug 13 19:13:18 2024 by hacluster via hacluster on node27-ic
  * 2 nodes configured
  * 12 resource instances configured

Node List:
  * Online: [ node26-ic node27-ic ]

Full List of Resources:
  * rr_mdt0             (ocf::xraid:raid):               Started node26-ic
  * fsr_mdt0            (ocf::heartbeat:Filesystem):     Started node26-ic
  * rr_ost0             (ocf::xraid:raid):               Started node26-ic
  * fsr_ost0            (ocf::heartbeat:Filesystem):     Started node26-ic
  * rr_ost1             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost1            (ocf::heartbeat:Filesystem):     Started node27-ic
  * rr_ost2             (ocf::xraid:raid):               Started node26-ic
  * fsr_ost2            (ocf::heartbeat:Filesystem):     Started node26-ic
  * rr_ost3             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost3            (ocf::heartbeat:Filesystem):     Started node27-ic
  * node27.stonith      (stonith:fence_ipmilan):         Started node26-ic
  * node26.stonith      (stonith:fence_ipmilan):         Started node27-ic

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

Now let’s execute the node node26 crash:

node26# echo c > /proc/sysrq-trigger

Here node27 identifies, that node26 does not responding and preparing to fence it.

node27# pcs status
Cluster name: lustrebox0
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: node27-ic (version 2.1.7-5.el8_10-0f7f88312) - partition with quorum
  * Last updated: Fri Aug 30 00:55:04 2024 on node27-ic
  * Last change:  Thu Aug 29 01:26:09 2024 by root via root on node26-ic
  * 2 nodes configured
  * 12 resource instances configured

Node List:
  * Node node26-ic: UNCLEAN (offline)
  * Online: [ node27-ic ]

Full List of Resources:
  * rr_mdt0             (ocf::xraid:raid):               Started node26-ic (UNCLEAN)
  * fsr_mdt0            (ocf::heartbeat:Filesystem):     Started node26-ic (UNCLEAN)
  * rr_ost0             (ocf::xraid:raid):               Started node26-ic (UNCLEAN)
  * fsr_ost0            (ocf::heartbeat:Filesystem):     Started node26-ic (UNCLEAN)
  * rr_ost1             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost1            (ocf::heartbeat:Filesystem):     Stopped
  * rr_ost2             (ocf::xraid:raid):               Started node26-ic (UNCLEAN)
  * fsr_ost2            (ocf::heartbeat:Filesystem):     Started node26-ic (UNCLEAN)
  * rr_ost3             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost3            (ocf::heartbeat:Filesystem):     Stopping node27-ic
  * node27.stonith      (stonith:fence_ipmilan):         Started node26-ic (UNCLEAN)
  * node26.stonith      (stonith:fence_ipmilan):         Started node27-ic

Pending Fencing Actions:
  * reboot of node26-ic pending: client=pacemaker-controld.286449, origin=node27-ic

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

Here, all the cluster resources are online at node27 after successful node26 fencing.

During the experiment, the cluster required about 1 minute 50 seconds to identify node26's absence, fence it, and start all the services in the required sequence on the survived node27.

node27# pcs status
Cluster name: lustrebox0
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: node27-ic (version 2.1.7-5.el8_10-0f7f88312) - partition with quorum
  * Last updated: Fri Aug 30 00:56:30 2024 on node27-ic
  * Last change:  Thu Aug 29 01:26:09 2024 by root via root on node26-ic
  * 2 nodes configured
  * 12 resource instances configured

Node List:
  * Online: [ node27-ic ]
  * OFFLINE: [ node26-ic ]

Full List of Resources:
  * rr_mdt0             (ocf::xraid:raid):               Started node27-ic
  * fsr_mdt0            (ocf::heartbeat:Filesystem):     Started node27-ic
  * rr_ost0             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost0            (ocf::heartbeat:Filesystem):     Starting node27-ic
  * rr_ost1             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost1            (ocf::heartbeat:Filesystem):     Started node27-ic
  * rr_ost2             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost2            (ocf::heartbeat:Filesystem):     Starting node27-ic
  * rr_ost3             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost3            (ocf::heartbeat:Filesystem):     Started node27-ic
  * node27.stonith      (stonith:fence_ipmilan):         Stopped
  * node26.stonith      (stonith:fence_ipmilan):         Started node27-ic

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

Since node26 was not shut down properly, the RAIDs migrated to node27 are under initialization to prevent from write hole. It’s the expected behaviour:

node27# xicli raid show
╔RAIDs═══╦══════════════════╦═════════════╦════════════════════════╦═══════════════════╗
║ name   ║ static           ║ state       ║ devices                ║ info              ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_mdt0 ║ size: 3576 GiB   ║ online      ║ 0 /dev/nvme0n1 online  ║                   ║
║        ║ level: 1         ║ initialized ║ 1 /dev/nvme1n1 online  ║                   ║
║        ║ strip_size: 16   ║             ║                        ║                   ║
║        ║ block_size: 4096 ║             ║                        ║                   ║
║        ║ sparepool: -     ║             ║                        ║                   ║
║        ║ active: True     ║             ║                        ║                   ║
║        ║ config: True     ║             ║                        ║                   ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_ost0 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme4n1 online  ║ init_progress: 31 ║
║        ║ level: 6         ║ initing     ║ 1 /dev/nvme5n1 online  ║                   ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme6n1 online  ║                   ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme7n1 online  ║                   ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme8n1 online  ║                   ║
║        ║ active: True     ║             ║ 5 /dev/nvme9n1 online  ║                   ║
║        ║ config: True     ║             ║ 6 /dev/nvme10n1 online ║                   ║
║        ║                  ║             ║ 7 /dev/nvme11n1 online ║                   ║
║        ║                  ║             ║ 8 /dev/nvme13n1 online ║                   ║
║        ║                  ║             ║ 9 /dev/nvme14n1 online ║                   ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_ost1 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme4n2 online  ║                   ║
║        ║ level: 6         ║ initialized ║ 1 /dev/nvme5n2 online  ║                   ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme6n2 online  ║                   ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme7n2 online  ║                   ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme8n2 online  ║                   ║
║        ║ active: True     ║             ║ 5 /dev/nvme9n2 online  ║                   ║
║        ║ config: True     ║             ║ 6 /dev/nvme10n2 online ║                   ║
║        ║                  ║             ║ 7 /dev/nvme11n2 online ║                   ║
║        ║                  ║             ║ 8 /dev/nvme13n2 online ║                   ║
║        ║                  ║             ║ 9 /dev/nvme14n2 online ║                   ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_ost2 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme15n1 online ║ init_progress: 29 ║
║        ║ level: 6         ║ initing     ║ 1 /dev/nvme16n1 online ║                   ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme17n1 online ║                   ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme18n1 online ║                   ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme20n1 online ║                   ║
║        ║ active: True     ║             ║ 5 /dev/nvme21n1 online ║                   ║
║        ║ config: True     ║             ║ 6 /dev/nvme22n1 online ║                   ║
║        ║                  ║             ║ 7 /dev/nvme23n1 online ║                   ║
║        ║                  ║             ║ 8 /dev/nvme24n1 online ║                   ║
║        ║                  ║             ║ 9 /dev/nvme25n1 online ║                   ║
╠════════╬══════════════════╬═════════════╬════════════════════════╬═══════════════════╣
║ r_ost3 ║ size: 14302 GiB  ║ online      ║ 0 /dev/nvme15n2 online ║                   ║
║        ║ level: 6         ║ initialized ║ 1 /dev/nvme16n2 online ║                   ║
║        ║ strip_size: 128  ║             ║ 2 /dev/nvme17n2 online ║                   ║
║        ║ block_size: 4096 ║             ║ 3 /dev/nvme18n2 online ║                   ║
║        ║ sparepool: -     ║             ║ 4 /dev/nvme20n2 online ║                   ║
║        ║ active: True     ║             ║ 5 /dev/nvme21n2 online ║                   ║
║        ║ config: True     ║             ║ 6 /dev/nvme22n2 online ║                   ║
║        ║                  ║             ║ 7 /dev/nvme23n2 online ║                   ║
║        ║                  ║             ║ 8 /dev/nvme24n2 online ║                   ║
║        ║                  ║             ║ 9 /dev/nvme25n2 online ║                   ║
╚════════╩══════════════════╩═════════════╩════════════════════════╩═══════════════════╝

Failover state cluster performance

Now all the Lustre filesystem servers are running on the survived node. In this configuration, we expect the performance to be halved because now all communication will go through only one server. Other bottlenecks in this situation are:

Decreased NVMe performance: since we have one server running, all the workload goes to the NVMes via only 2 PCIe lanes;
Lack of CPU;
Lack of RAM.

Tests with directIO enabled

The following list shows the test command and results for directIO enabled test with a transfer size of 1MB on the system with just one node working.

lclient01# /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --hostfile ./hfile -np 128 --map-by node /usr/bin/ior -F -t 1M -b 8G  -k -r -w -o /mnt.l/stripe4M/testfile --posix.odirect

. . .

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     17185      17185      0.007389    8388608    1024.00    0.012074   61.02      2.86       61.02      0
read      45619      45620      0.002803    8388608    1024.00    0.003000   22.99      0.590771   22.99      0
Max Write: 17185.06 MiB/sec (18019.84 MB/sec)
Max Read:  45619.10 MiB/sec (47835.10 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum
write       17185.06   17185.06   17185.06       0.00   17185.06   17185.06   17185.06       0.00   61.01671         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592  1048576 1048576.0 POSIX      0
read        45619.10   45619.10   45619.10       0.00   45619.10   45619.10   45619.10       0.00   22.98546         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592  1048576 1048576.0 POSIX      0

The following list shows the test command and results for directIO enabled test with a transfer size of 128MB on the system with just one node working.

lclient01# /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --hostfile ./hfile -np 128 --map-by node /usr/bin/ior -F -t 128M -b 8G  -k -r -w -o /mnt.l/stripe4M/testfile --posix.odirect

. . .

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     30129      235.39     0.524655    8388608    131072     0.798392   34.80      1.64       34.80      0
read      35731      279.15     0.455215    8388608    131072     0.002234   29.35      2.37       29.35      0
Max Write: 30129.26 MiB/sec (31592.82 MB/sec)
Max Read:  35730.91 MiB/sec (37466.57 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum
write       30129.26   30129.26   30129.26       0.00     235.38     235.38     235.38       0.00   34.80258         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592 134217728 1048576.0 POSIX      0
read        35730.91   35730.91   35730.91       0.00     279.15     279.15     279.15       0.00   29.34647         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592 134217728 1048576.0 POSIX      0

Tests with directIO disabled

The following list shows the test command and results for buffered IO (directIO disabled) test with a transfer size of 1MB on the system with just one node working.

lclient01#  /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --hostfile ./hfile -np 128 --map-by node /usr/bin/ior -F -t 1M -b 8G  -k -r -w -o /mnt.l/stripe4M/testfile

. . .

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     30967      31042      0.004072    8388608    1024.00    0.008509   33.78      7.55       33.86      0
read      38440      38441      0.003291    8388608    1024.00    0.282087   27.28      8.22       27.28      0
Max Write: 30966.96 MiB/sec (32471.21 MB/sec)
Max Read:  38440.06 MiB/sec (40307.32 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum
write       30966.96   30966.96   30966.96       0.00   30966.96   30966.96   30966.96       0.00   33.86112         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592  1048576 1048576.0 POSIX      0
read        38440.06   38440.06   38440.06       0.00   38440.06   38440.06   38440.06       0.00   27.27821         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592  1048576 1048576.0 POSIX      0
Finished            : Thu Sep 12 03:18:41 2024

The following list shows the test command and results for buffered IO (directIO disabled) test with a transfer size of 1MB on the system with just one node working.

lclient01#  /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --hostfile ./hfile -np 128 --map-by node /usr/bin/ior -F -t 128M -b 8G  -k -r -w -o /mnt.l/stripe4M/testfile

. . .

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     30728      240.72     0.515679    8388608    131072     0.010178   34.03      8.70       34.12      0
read      35974      281.05     0.386365    8388608    131072     0.067996   29.15      10.73      29.15      0
Max Write: 30727.85 MiB/sec (32220.49 MB/sec)
Max Read:  35974.24 MiB/sec (37721.72 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum
write       30727.85   30727.85   30727.85       0.00     240.06     240.06     240.06       0.00   34.12461         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592 134217728 1048576.0 POSIX      0
read        35974.24   35974.24   35974.24       0.00     281.05     281.05     281.05       0.00   29.14797         NA            NA     0    128  32    1   1     0        1         0    0      1 8589934592 134217728 1048576.0 POSIX      0

Failback

Meanwhile node26 booted after the crash. At our configuration the cluster software does not start automatically.

node26# pcs status
Error: error running crm_mon, is pacemaker running?
crm_mon: Connection to cluster failed: Connection refused

It could be useful in real life: before returning a node to a cluster, the administrator should identify, localize, and fix the problem to prevent it from recurring.

The cluster software works properly at node27:

node27# pcs status
Cluster name: lustrebox0
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: node27-ic (version 2.1.7-5.el8_10-0f7f88312) - partition with quorum
  * Last updated: Sat Aug 31 01:13:57 2024 on node27-ic
  * Last change:  Thu Aug 29 01:26:09 2024 by root via root on node26-ic
  * 2 nodes configured
  * 12 resource instances configured

Node List:
  * Online: [ node27-ic ]
  * OFFLINE: [ node26-ic ]

Full List of Resources:
  * rr_mdt0             (ocf::xraid:raid):               Started node27-ic
  * fsr_mdt0            (ocf::heartbeat:Filesystem):     Started node27-ic
  * rr_ost0             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost0            (ocf::heartbeat:Filesystem):     Started node27-ic
  * rr_ost1             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost1            (ocf::heartbeat:Filesystem):     Started node27-ic
  * rr_ost2             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost2            (ocf::heartbeat:Filesystem):     Started node27-ic
  * rr_ost3             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost3            (ocf::heartbeat:Filesystem):     Started node27-ic
  * node27.stonith      (stonith:fence_ipmilan):         Stopped
  * node26.stonith      (stonith:fence_ipmilan):         Started node27-ic

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

Since we know the reason of the node26 crash, we start the cluster software there:

node26# pcs cluster start
Starting Cluster...

In some time the cluster software starts and the resources, which should work at node26, become properly moved from node27 to node26. The failback process took about 30 seconds.

node26# pcs status
Cluster name: lustrebox0
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: node27-ic (version 2.1.7-5.el8_10-0f7f88312) - partition with quorum
  * Last updated: Sat Aug 31 01:15:03 2024 on node26-ic
  * Last change:  Thu Aug 29 01:26:09 2024 by root via root on node26-ic
  * 2 nodes configured
  * 12 resource instances configured

Node List:
  * Online: [ node26-ic node27-ic ]

Full List of Resources:
  * rr_mdt0             (ocf::xraid:raid):               Started node26-ic
  * fsr_mdt0            (ocf::heartbeat:Filesystem):     Started node26-ic
  * rr_ost0             (ocf::xraid:raid):               Started node26-ic
  * fsr_ost0            (ocf::heartbeat:Filesystem):     Started node26-ic
  * rr_ost1             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost1            (ocf::heartbeat:Filesystem):     Started node27-ic
  * rr_ost2             (ocf::xraid:raid):               Started node26-ic
  * fsr_ost2            (ocf::heartbeat:Filesystem):     Started node26-ic
  * rr_ost3             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost3            (ocf::heartbeat:Filesystem):     Started node27-ic
  * node27.stonith      (stonith:fence_ipmilan):         Started node26-ic
  * node26.stonith      (stonith:fence_ipmilan):         Started node27-ic

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled
node27# pcs status
Cluster name: lustrebox0
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: node27-ic (version 2.1.7-5.el8_10-0f7f88312) - partition with quorum
  * Last updated: Sat Aug 31 01:15:40 2024 on node27-ic
  * Last change:  Thu Aug 29 01:26:09 2024 by root via root on node26-ic
  * 2 nodes configured
  * 12 resource instances configured

Node List:
  * Online: [ node26-ic node27-ic ]

Full List of Resources:
  * rr_mdt0             (ocf::xraid:raid):               Started node26-ic
  * fsr_mdt0            (ocf::heartbeat:Filesystem):     Started node26-ic
  * rr_ost0             (ocf::xraid:raid):               Started node26-ic
  * fsr_ost0            (ocf::heartbeat:Filesystem):     Started node26-ic
  * rr_ost1             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost1            (ocf::heartbeat:Filesystem):     Started node27-ic
  * rr_ost2             (ocf::xraid:raid):               Started node26-ic
  * fsr_ost2            (ocf::heartbeat:Filesystem):     Started node26-ic
  * rr_ost3             (ocf::xraid:raid):               Started node27-ic
  * fsr_ost3            (ocf::heartbeat:Filesystem):     Started node27-ic
  * node27.stonith      (stonith:fence_ipmilan):         Started node26-ic
  * node26.stonith      (stonith:fence_ipmilan):         Started node27-ic

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

Conclusion

The article shows the possibility of creating a small, highly available, and high-performance Lustre installation based on an SBB system with dual-ported NVMe drives and xiRAID Classic 4.1 RAID engine. It also demonstrates the ease of xiRAID Classic integration with Pacemaker clusters and compatibility of xiRAID Classic with the classical approach to Lustre clustering.

The configuration is straightforward and requires the following program components to be installed and properly configured:

xiRAID Classic 4.1 and Csync2
Lustre software
Pacemaker software

The resulting system, based on the Viking VDS2249R SBB system, equipped with two single-CPU servers and 24 PCIe 4.0 NVMe drives, showed performance up to 55GB/s on writing and up to 86GB/s on reading from Lustre clients, using the standard parallel filesystem test program IOR.

This article, with minimal changes, can also be used to set up additional systems to expand existing Lustre installations.

Asynchronous I/O: A Practical Guide for Optimizing HPC Workflows with xiRAID in Lustre Environments

Sergey Platonov — Wed, 19 Mar 2025 14:15:20 +0000

In today's AI world, powerful storage is key for many GPUs working together on large datasets. These workflows involve an initial data load at high speeds, followed by data loading during training and periodic checkpoints, all at tens of gigabytes per second. Storage must handle many GPUs accessing massive datasets (hundreds of terabytes to petabytes) while delivering high throughput (tens of gigabytes per second or more) and efficient performance for small operations. This versatility is critical to prevent delays in computational clusters. Our goal is to minimize downtime and maximize efficiency.

Research groups often use limited infrastructures or cloud services. The growth of AI cloud services demands a data storage system that is fast, high-capacity, and integrates seamlessly into cloud infrastructure. Ideal solutions should be software-defined, deployable on any hardware, and easy to integrate.

At Xinnor, we provide high-performance storage solutions for diverse clients. Recently, we've observed a growing demand for storage tailored to HPC and AI, especially for shared file systems in smaller setups (with 1-2 DGXs or HGXs) and among cloud providers. These setups require several key features.

For small standalone installations, solutions need to be created consisting of only one or two storage controllers. These solutions should be capable of delivering performance levels corresponding to a 400-gigabit network and have the potential to scale up to 800 gigabits in the future.

For cloud environments we need to create a solution that primarily delivers performance of approximately 20 gigabytes per second per client’s virtual machine. Additionally, it must ensure consistent read and write access simultaneously from multiple virtual machines, and support essential requirements such as multi-tenancy and Quality of Service (QoS).

Our developments enable us to achieve the necessary performance levels while consuming minimal resources. Recent tests with KIOXIA showcased xiRAID's capabilities in both RAID5 and RAID6 configurations using 24 PCIe Gen5 drives. The results were impressive, achieving near-theoretical performance with minimal CPU load.

However, to meet all the requirements for our solution, we certainly need a parallel or clustered file system. For this, we have chosen Lustre.

This blog highlights how xiRAID, combined with our Lustre tuning expertise, delivers outstanding results in Lustre environments:

Why We Rely on Lustre
Our Objectives
Tested Architectures
Test Stand Configuration
Block Device Performance
IOR Single Test Results
Testing Synchronous and Asynchronous I/O Operations
Lustre vs. NFSoRDMA Testing
Testing Lustre in the Cloud Environment
Final Thoughts
Appendix

Why We Rely on Lustre

But first, why Lustre? At Xinnor, we rely on Lustre for our high-performance storage solutions because it offers several key advantages:

Shared storage for parallel workflows. Lustre enables the creation of shared storage, allowing multiple nodes to access the same file system concurrently. This is crucial for modern AI and HPC workloads, where numerous processing units need efficient, simultaneous access to data.
Scalability for growing demands. Lustre's architecture supports linear scalability. We can seamlessly add new storage nodes to the cluster without performance degradation. This ensures our storage solutions can grow alongside our clients' increasing workloads.
Shared-disk architecture. Lustre's shared-disk architecture perfectly aligns with our approach of providing high-performance block storage solutions.
Flexibility for Diverse Deployments. As a software solution, Lustre is deployable on any hardware and within virtualized infrastructures. This flexibility allows us to meet various performance targets across different environments.

Beyond these core benefits, Lustre boasts a proven track record in handling HPC tasks with high streaming performance. However, our goal extends beyond optimizing for large block I/O. We also aim to extract maximum performance from small block I/O operations, critical for many AI applications.

Our Objectives

At Xinnor, we have several installations with a combined capacity exceeding 100 petabytes. Each Lustre project is unique, and to ensure the highest performance levels, we fine-tune each solution through a multi-stage process:

Hardware and software configuration: we configure drives, storage services, and OS settings. This includes software installation, testing, and RAID configuration adjustments.
Lustre OSS and MDS testing: we test Lustre using I/O utilities like OBDFilter-survey and MDtest.
Client-perspective testing: we employ standard HPC industry I/O tools (IOR) and FIO for testing from the client's perspective. Using FIO with asynchronous engines helps us achieve optimal efficiency at each storage stack level.

Our performance objectives include:

Achieving tens of GBps throughput from a few Lustre clients using a couple of Object Storage Servers (OSS).
Attaining several million IOPS in the same configuration.
Maintaining a simple hardware and software configuration for easy deployment.
Developing an easily reproducible test approach for consistency.

Benchmarking with IOR presents challenges. Buffered I/O can be CPU-intensive, while Direct I/O may create uneven storage loads. Performance scaling with additional I/O threads can be limited on HDDs and read-intensive SSDs. Increasing I/O size isn't always effective either. Therefore, one of our objectives is to demonstrate how Asynchronous I/O (AIO) helps us achieve optimal performance.

We conducted a comprehensive analysis to identify the most effective configurations for HPC workflows. This included comparisons of Lustre 2.15.4 over ldiskfs, Lustre 2.15.4 over ZFS and NFSoRDMA (v3 and v4.2).

These comparisons will be further explored to showcase the best performing configurations. We will demonstrate the performance benefits of asynchronous engines compared to classic Buffered I/O and Direct I/O approaches. Additionally, we will benchmark ldiskfs vs. ZFS to determine the better backend file system for different scenarios. Finally, we will compare our solutions against NFS, a competitor for smaller research groups and cloud-deployed systems.

Through these comparisons and configurations, we aim to showcase the superior performance and versatility of our Lustre-based storage solutions in various HPC and AI environments.

Tested Architectures

Our testing involved two primary architectures: a "Cluster-in-the-box" solution and a virtualized solution, both designed to assess the performance of Lustre in various scenarios.

Cluster-in-the-box architecture

The first architecture we are testing is the "Cluster-in-the-box" solution. This high-performance system is designed for small installations, offering a fully integrated and fault-tolerant setup within a single enclosure. Despite its compact form factor, it supports multiple clients, allowing them to read and write data consistently. Furthermore, leveraging Lustre's capabilities, it can easily scale if the need arises. This solution is ideal for small research groups, combining the convenience of a plug-and-play setup with the high performance typically associated with larger Lustre deployments. Unlike traditional Lustre clusters, which require separate OST/OSS/MDS/MGS components that need to be interconnected, our "Cluster in the Box" offers a streamlined, all-in-one package that significantly simplifies deployment and management.

Virtualized solution architecture

The second architecture is a virtualized solution, suitable for deployment in cloud environments, whether private or public. This approach caters to the growing demand for flexible, cloud-based storage solutions. Unlike most clients who use traditional bare-metal distributed systems, our virtualized architecture stands out by providing a robust and scalable solution within a virtual environment. This setup not only supports the performance needs of HPC and AI workloads but also ensures seamless integration into existing cloud infrastructures, offering a modern alternative to conventional bare-metal installations.

Test Stand Configuration

Hardware configuration:

CPU: 64-Core Processor per node (AMD 7702P)
Memory: 256 GB RAM per Node
Networking: 1 x MT28908 Family [ConnectX-6] per node
Drives: 24x KIOXIA CM6-R 3.84TB: 1.6TB namespace per node
The clients are based on the same hardware and Rocky Linux 9

Software configuration:

Rocky Linux 8 with Lustre 2.15.4.
RAID: 4 х RAID 6: 10 drives(8d+2p), ss=64k for OSS
2x RAID1 for MGS and MDS

Block Device Performance

To start, we created a storage array consisting of four RAID6 volumes for data, with two arrays on each node, and two RAID1 volumes for the MGS/MDS. Our initial focus was to test the block device performance of these RAID6 arrays to establish a baseline for the potential performance of the file system within this compact unit. The results were impressive, with single drive performance reaching millions of IOPS for 4K random reads and writes, and up to 93.4 GBps for sequential reads with multiple jobs.

FIO configuration 1:

[global]
bs=4k
rw=randread/randwrite
norandommap
bs=4k
direct=1
group_reporting
random_generator=lfsr
norandommap
time_based=1
runtime=60
iodepth=128
ioengine=libaio
[file1]
filename=/dev/xi_data1
[file2]
filename=/dev/xi_data1

FIO configuration 2:

[global]
bs=1024k
rw=read/write
direct=1
group_reporting
time_based=1
runtime=60
iodepth=32
ioengine=libaio
numjobs=1
offset_increment=3%
[file1]
filename=/dev/xi_data1
[file2]
filename=/dev/xi_data1

When running multiple jobs, which is naturally our target workload, we observe that for random reads, we approach 12.5 million IOPS, and for random writes, we reach 2 million IOPS. Under streaming workloads, we can achieve up to 93 gigabytes per second for reads and 67 gigabytes per second for writes.

IOR Single Client Test Results

Next, we deployed the file system with default parameters and conducted the following round of tests. We conducted an IOR test with a single client to assess potential performance. For this test, the client was connected via a 200 Gbit port. The theoretical maximum throughput for streaming read/write operations is approximately 22-24 GBps, and for random operations, around 4.2M IOPS, which represents the maximum capacity of the 200 Gbit connection.

Our IOR tests provided valuable insights into the performance differences between Direct I/O (DIO) and Buffered I/O in various scenarios. Here's a detailed breakdown of the results.

Direct I/O (DIO) Performance

Large Sequential I/O Operations:

64M Write Operations: 13053 MiB.
64M Read Operations: 12288 MiB.

Small Random I/O Operations:

4k Write Operations: 6542 IOPS.
4k Read Operations: 6742 IOPS.

These results indicate that DIO provides stable and consistent performance for both large sequential and small random I/O operations. This stability is essential for applications requiring predictable I/O behavior, making DIO a reliable choice for demanding environments.

However, it is worth noting that the performance for small blocks is extremely low.

The load on the CPU is noticeable but not critical.

Buffered I/O Performance

Large Sequential I/O Operations:

64M Write Operations: 3874 MiB.
64M Read Operations: 12757 MiB.

Small Random I/O Operations:

4k Write Operations: 7359 IOPS.
4k Read Operations: 556629 IOPS.

One notable observation was the variability in buffered I/O write results, ranging from 2,000 to 28,000 IOPS. Additionally, CPU load fluctuated significantly during these operations, spanning from 6% to 100%. This variability and high CPU load highlights the challenges of using buffered I/O in environments requiring consistent performance.

Conclusions from Single Client Tests

With the existing approach we can meet the required performance for large sequential I/Os, achieving a single-threaded performance peak of approximately 13 GBps (almost half of potential 24 GBps).
DIO proves to be more stable compared to buffered I/O in these scenarios, making it a preferred choice for large sequential workloads.
The performance for small random I/Os remains far from the potential performance of the block device.

Testing Synchronous (SYNC) and Asynchronous I/O (AIO) operations

We often use the FIO utility because it offers extensive capabilities for regulating load and selecting different io engines for data access. This allows us to evaluate the system’s behavior under various types of workloads.

We conducted tests with various parameters to understand their impact on performance:

4k random reads and writes:

Tested with a fixed numjobs=1 and variable iodepth.
Also tested with numjobs=32 and variable iodepth.
Tested with a fixed iodepth=1 and variable numjobs.

1M sequential reads and writes:

Conducted with fixed numjobs=1 and variable iodepth.
Also explored numjobs=32 and variable iodepth.

I/O engines used: libaio, io_uring, and sync. We also used variable Lustre client settings and Lustre OSS settings.

Configurations description for testing Fio in distributed environment:

OPT_OSS – optimized OSS/OST settings, with no changing Clients settings
OPT_CL1 – optimized Client with max rpc in flight = 1 (lctl set_param osc..max_pages_per_rpc=4096 osc..checksums=0 osc.*.max_rpcs_in_flight=1)
OPT_CL128 – optimized Client with max rpc in flight = 128 (lctl set_param osc..max_pages_per_rpc=4096 osc..checksums=0 osc.*.max_rpcs_in_flight=128)
ASYNC – ioengine=libaio/io_uring
SYNC – ioengine=sync

Important! In this section, we do not differentiate between io_uring and libaio because we did not observe any difference between them in our tests. Therefore, we labeled both engines as “async” on the graphs. Also, we do not address sequential workloads here; we will address them later when we compare with NFS.

Testing Results

4k Random Reads, numjobs=1

When running a single job, increasing the number of requests at once (IO depth) significantly improved performance for the async engine. However, the sync engine, which always uses a queue depth of 1, only matched the async engine's performance when running 16 jobs at once. When using the sync engine, we can theoretically employ multiple threads as an alternative to the asynchronous engine with a high queue depth, but this is effective only up to 16 queues. Beyond 16 queues, performance does not continue to increase. Unfortunately, the sync engine couldn't handle more jobs effectively, and its latency increased significantly.

Performance varies depending on the type of workload and the settings of both the client and server parts of the Lustre file system. This is clearly illustrated in the graph.

4k Random Writes, numjobs=1

Similar results were seen with random writes. The Lustre client performed best when a configuration setting "max rpc flight" was set to a moderate value (between 8 and 24). Again, the sync engine achieved similar performance to the async engine with 16 jobs, but it didn't scale well with more jobs.

At the same time, the impact of different settings on random write performance is more straightforward.

4k Random Reads, numjobs=32

The most impressive results came from tests with 32 jobs per client. In this scenario, the async engine achieved nearly 4M IOPS, demonstrating Lustre's excellent scalability for high workloads.

On the graph, we can see that the performance of Lustre under async IO workloads scales quite well with increasing load, gradually approaching half of the network interface’s capacity in terms of throughput. And, of course, the performance is significantly higher than what we observed when using DIO and Buffered IO in the IOR utility.

4k Random Writes, numjobs=32

Random writes also showed near-linear improvement with increasing queue depth, reaching close to 1M IOPS. More detailed results including latency graphs can be observed in our presentation for LUG24 conference “Asynchronous I/O: A Practical Guide for Optimizing HPC Workflows”.

Insights from Testing SYNC and AIO

Our tests provided several key insights into the performance of synchronous (SYNC) and asynchronous I/O (AIO) operations, particularly for large and small block I/Os.

The performance difference between synchronous (SYNC) and asynchronous I/O (AIO) is minimal for large I/O operations. This observation is consistent, though not explicitly shown in the charts.
For small block I/Os, the difference reaches several times. SYNC operations scale effectively up to 16 jobs, indicating good performance within this range.
We achieved 49% of the maximum potential performance for random writes. This performance level is primarily constrained by the drives' capabilities.
For random reads, we reached 46% of the maximum possible performance, limited by the dual 200Gbit Host Channel Adapters (HCAs).
Achieving around 50% of the storage system’s capabilities on a parallel file system is quite a good result, as storage needs to perform a significant amount of additional work compared to just a block device. Additionally, we have greatly exceeded our expectations, as we had repeatedly heard that Lustre does not suit this type of workload, but we have proved that wrong. Furthermore, we see that there is still great potential for future growth, and we believe that over time we will be able to approach the efficiency of 70-80%.
The parameter max_rpc_in_flight has a significant impact on performance. Optimal results were observed with values ranging from 8 to 24.

Lustre vs NFSoRDMA Testing

When building small-scale systems, there are various approaches to consider. For instance, with Lustre, you can create a fully open-source solution using the ZFS file system as the backend. ZFS offers integrated RAID management and a volume manager, and it integrates well with Lustre. This makes it a robust option for those who prefer open-source solutions.

Alternatively, you can build a system based on NFS instead of Lustre. NFS has its own advantages, such as having a built-in client in the Linux kernel, which means you are not dependent on specific versions compatible with the Lustre client. This plug-and-play nature is a significant benefit. However, the primary downside is the lack of an open-source NFS implementation that matches the comprehensive functionality of Lustre. While there is an open-source NFS server, it is mainly designed for testing and does not offer the same level of scalability as Lustre. Nonetheless, for small systems, particularly those with one or two controllers, NFS can still be a viable option worth testing.

Here we'll present our test results for ZFS with NFS. We tested with optimal configurations, the best settings are described in the appendix. Now, let's delve into the performance graphs and detailed findings.

We compared NFSoRDMA and Lustre 2.15.4 over ldiskfs and ZFS using the same testing approach. We mounted NFS with Sync and Async options as well as changed the NFS server and client settings. Our key findings include:

No difference between sync and async mount options for reads.
No difference between NFS3 and NFS4.1 in most cases.

The graphs below show the outcome of our testing:

Lustre vs NFS3 vs NFS 4.2, 4k READ Ios, numjobs=1

The benchmark results showed an interesting trend in performance between NFS and Lustre. For single-threaded workloads (number of jobs = 1), NFS exhibited better performance, but this advantage plateaued quickly at around 250K IOPS (regardless of NFS version). In contrast, Lustre's performance scaled linearly as the number of jobs increased.

Lustre vs NFS3 vs NFS 4.2, 4k READ IOs, numjobs=32

After increasing the number of jobs, Lustre performance scales linearly and NFS plateaus at 250K IOPS. And Lustre over ZFS shows the bottleneck at 100K IOPS.

Lustre vs NFS3 vs NFS 4.2, 4k WRITE IOs, numjobs=1

Similar trends emerge in 4K random write tests. While NFS with the async mount option boasts the highest initial speeds, it again encounters a ceiling at 250K IOPS. Lustre maintains its linear scaling, but at a lower overall performance level. As expected, ZFS significantly lags behind.

Lustre vs NFS3 vs NFS 4.2, 4k WRITE IOs, numjobs=32

Once the number of jobs increases, NFS remains capped, while Lustre continues to scale linearly. Lustre over ZFS performance also does not scale.

Lustre vs NFS, 1M sequential reads

Tests involving sequential workloads, including reads at both single and 32 jobs, show minimal performance differences between NFS and Lustre over ldiskfs.

Note that the upper performance limit is around 45 GB/s, which is related to the performance of the network connections.

Lustre vs NFS, 1M sequential writes

The situation for sequential writes is similar, but for a single job only Lustre over ldiskfs demonstrates scalability.

Note that the upper performance limit is around 45 GB/s, which is related to the performance of the network connections.

Conclusions for Lustre vs NFSoRDMA Testing

Lustre easily reaches the maximum network connection performance with a total throughput of 400 Gbps when using a small number of clients, which is exactly what we need.
ZFS does not allow us to achieve the required performance numbers on small block IOs.
NFS over RDMA performs better for NJ=1 and iodepth=1 but does not scale well on small random IOs.
Lustre performs significantly better as the workload increases.
Overall, NFS over RDMA can be considered a decent solution for sequential workloads, but it slightly lags behind Lustre in terms of write performance.

Testing Lustre in the Cloud Environment

In this study, we investigated the virtualization of Lustre, specifically focusing on its OSS layer. We compared two data paths: a user-space implementation (using xiRAID Opus on SPDK) and a kernel-space implementation (Virtio-blk implementations within the Linux kernel). Both configurations utilized Lustre OPS_OSS and asynchronous I/O, with Opus leveraging Lustre OPS_OSS with asynchronous I/O and virtio-blk using aio=io_uring.

In this image, you can see two deployment options. On the left is the option using Opus, and on the right is the option using xiRAID. As you can see, Opus will use only one core, while the virtual machines operating as Lustre OSS servers will use three virtual cores each. Meanwhile, xiRAID Classic will use eight cores.

For each deployment option, we had two OSS virtual machines and two RAID Protected Volumes.

Interestingly, both approaches delivered similar results for sequential reads and writes, reaching around 45 GB/s (the network capacity limit). The results were as follows:

However, with an Opus-based solution, we can achieve the same level of performance as with a xiRAID Classic-based solution while using 8 times fewer CPU resources. Overall, to saturate the performance of a 200 Gbps interface with Opus, only a single CPU core is needed.

However, a significant difference emerged for random operations. The user-space approach exhibited scaling, while the kernel-space approach did not. This can be attributed to the limitation of virtio-blk implementation.

The user-space approach offers promising results for achieving scalability in random operations. We observed performance of 4M IOPS for random reads and 1M IOPS for random writes.

Additionally, we have completely replicated the performance results of the Bare Metal solution. However, we consumed significantly fewer computational resources and memory.

Conclusions for Lustre in Cloud Environment

Kernel-based virtio-blk achieves good performance on large IOs.
Kernel block devices exposed to VM do not provide good performance on small block IOs.
The solution is to run block device and vhost controller in user space: xiRAID Opus solves the problem.

Final Thoughts

Lustre stands out as a powerful option for data storage, even against established solutions like SAN and NFS. Lustre easily achieves performance matching the capabilities of network interfaces, 400 Gbps and above, using only two clients.

Its asynchronous I/O function significantly boosts performance. For random workloads, it is essential to use asynchronous I/O. Additionally, it is crucial to use the correct backend for Lustre, as ZFS does not handle the load well at all. However, for the best results, further development of io_uring features is needed.

Overall, the performance for synchronous small block IOs, after proper configuration, turned out to be quite good and even exceeded all our expectations. It approached half the capacity of a super high-performance block device, and we still see room for potential improvements in the future.

When operating in a virtualized environment, using only a few CPUs, we can achieve performance levels comparable to a bare-metal solution while ensuring super-high flexibility in deploying resources on-demand where we need them – on any hypervisor, within GPU environments, or on external storage arrays.

However, the high efficiency of the solution in terms of CPU resource consumption, as well as the ability to ensure high-performance access for small random IO, is possible only if specialized storage solutions operating in the Linux user space, such as xiRAID Opus, are used.

Thank you for reading! If you have any questions or thoughts, please leave them in the comments below. I’d love to hear your feedback!

Original article can be found here.

High-Performance Block Volumes in Virtual Cloud Environments: Parallel File Systems Comparison

Sergey Platonov — Mon, 17 Mar 2025 19:32:46 +0000

In virtualized cloud environments, supporting the intensive data demands of AI workloads requires a robust and scalable storage solution. Parallel file systems, such as Lustre and pNFS, provide the distributed data handling needed for these environments, allowing data to scale seamlessly across multiple nodes with minimal performance degradation. By integrating xiRAID Opus with these parallel file systems, Xinnor delivers enhanced storage performance, ensuring both random and sequential workloads achieve low-latency, high-throughput access. This blog explores how Lustre and pNFS, optimized with xiRAID Opus, create a flexible and high-performing storage architecture for AI-focused cloud environments.

To address the scalability challenges posed by AI workloads in virtualized cloud environments, integrating parallel file systems like Lustre and pNFS becomes essential. These systems enable distributed data handling, ensuring that workloads can scale across numerous compute and storage nodes without a significant performance hit. By leveraging the underlying block device performance delivered by xiRAID Opus, parallel file systems further optimize both random and sequential workloads, ensuring low-latency, high-throughput access to shared storage resources.

Xinnor Lustre Solution for Cloud Environments

Lustre is a well-known parallel file system used primarily in HPC environments, but it can also be leveraged for AI workloads, thanks to its scalability and high throughput. Lustre provides high availability over shared storage, making it ideal for cloud environments where reliability and performance are paramount.

At Xinnor, we have extensive experience with Lustre, having successfully deployed it in numerous production environments. Our expertise extends into virtualized environments, where we enable the deployment of Lustre to provide high-performance storage solutions.

In these setups, both the OSS and MDS components of Lustre are tuned by us for optimal performance. The architecture is built around disaggregated storage resources, which we transform into high-performance volumes using xiRAID Opus. These volumes are then passed through to virtual machines (VMs), forming the foundation for a highly scalable and efficient storage solution suited for AI workloads.

To validate our solution, we implemented a virtualized Lustre environment and conducted performance tests to demonstrate its scalability and efficiency for AI workloads in cloud environments.

Testing Environment Details:

CPU: 64-Core Processor per node (AMD 7702P)
Memory: 256 GB RAM per node
Networking: 1 x MT28908 Family [ConnectX-6] per node
Drives: 24x KIOXIA CM6-R 3.84TB (Gen 4)
Aggregated drive performance per node:
- 9M IOPS (4k random read)
- 3M IOPS (4k random write)
- 70 GBps (128k sequential write/read)

Implementation Overview:

Host Configuration:
We deployed three virtual machines (VMs) across two hosts—two OSS and one Lustre MDS. Each VM was configured with dedicated RAID setups:
OSS VMs use RAID 6 (16+2).
The MDS VM uses RAID 1+1.

Host Configuration:
We deployed three virtual machines (VMs) across two hosts—two OSS and one Lustre MDS. Each VM was configured with dedicated RAID setups:
a. OSS VMs use RAID 6 (16+2).
b. The MDS VM uses RAID 1+1.
Resource Allocation:
Each storage controller within the VMs is assigned a single CPU core. In total, only three CPU cores are utilized for managing the block storage system, maximizing efficiency without compromising performance.
VM Configuration:
Each OSS and MDS VM is assigned three virtual cores for processing. The Lustre Client VMs are deployed on an external host, with each client VM provisioned with 32 cores, ensuring sufficient computational power for handling intensive workloads.

Lustre Solution Performance

When testing sequential workloads (1M block size, 32 jobs), we achieved the following performance metrics with xiRAID Opus: 44 GB/s at read and 43 GB/s at write operations.

In addition to sequential workloads, we also tested random workloads, where xiRAID Opus demonstrated significantly better scaling at higher I/O depths compared to Lustre without it. This test used MDRAID (RAID 0) and Opus (RAID 6), showcasing the significant boost in both read and write performance when xiRAID Opus is incorporated into the solution. As seen in the graph above, Lustre with xiRAID Opus achieves remarkable performance growth, especially as the I/O depth increases. This scaling can be attributed to the efficiency of the multithreaded vhost-user-blk architecture, which distributes I/O tasks more effectively, leading to substantial improvements in throughput.

However, one of the primary limitations in maximizing streaming throughput lies in the network interface capacity, which often acts as a bottleneck. Despite this constraint, xiRAID Opus ensures high performance by maximizing network utilization, effectively mitigating the impact of network limitations.

Moreover, while Lustre has traditionally been considered unsuitable for small block I/O operations, recent advancements have significantly enhanced its capabilities. With improved asynchronous I/O support and the integration of high-performance interfaces, low-latency devices can now be passed directly into the MDS. This innovation, in combination with xiRAID Opus, delivers strong small block I/O performance, addressing a critical pain point for AI and cloud workloads that demand efficient data handling at scale.

Reducing the Complexity of Lustre Administration with VirtioFS

When managing file systems in virtualized environments, one of the key challenges is reducing administrative complexity while maintaining performance. To address this, we implemented VirtioFS, a solution for sharing file systems directly between hosts and VMs. VirtioFS eliminates the need for installing client software within the VMs by sharing a mounted file system from the host. This simplification makes it an ideal solution for cloud service providers, reducing administrative burden without sacrificing performance.

Xinnor-tuned VirtioFS: Performance Results

To fully optimize file system performance in virtualized environments, we’ve applied tuning to VirtioFS. This tuning allows VirtioFS to deliver performance on par with native Lustre clients, even in heavily virtualized environments. The performance improvements are especially significant in high-throughput workloads.

Sequential operations results:

These results show that with the right optimizations, VirtioFS can match the performance of native Lustre clients in sequential workloads while still providing the simplicity of a virtualized file system environment. However, in random operations VirtioFS is not able to demonstrate the same level of scalability as the native Lustre client.

Xinnor Lustre Solution Outcomes

The Xinnor Lustre solution demonstrates powerful performance capabilities, even with a virtualized setup. By pairing xiRAID Opus with virtualized Lustre OSS and MDS components, our solution is capable of handling both sequential and random I/O operations with minimal overhead. Key outcomes:

Performance:
a. With only two virtualized OSS, Lustre delivers impressive sequential and random I/O performance.
b. Critical to this performance is the high-performance block device provided by xiRAID Opus, which is passed directly to the OSS and MDS virtual machines.
Skill Requirements:
a. While Lustre configuration requires advanced expertise to set up the system and client VMs, VirtioFS offers a simplified alternative for workloads with primarily sequential patterns, reducing complexity without sacrificing throughput.
Solution for Cloud Environments:
a. Xinnor can deliver this high-performance Lustre solution for cloud-based environments, tailored to AI workloads as well as HPC.

While Lustre has a legacy in HPC environments, it’s also highly effective for AI-centric workloads. However, Lustre can be complex to administer, particularly in cloud environments, where configurations like LNET and client setups add layers of complexity. Additionally, Lustre supports a limited number of operating systems, making expert configuration essential.

Our Vision for the Future: pNFS Block Layout

pNFS (Parallel NFS) Block Layout is a part of the pNFS extension in NFSv4.1, designed to enable parallel access to storage devices, improving scalability and performance.

The block layout specifically focuses on enabling clients to access storage blocks directly, bypassing the NFS server for data transfers. This layout is ideal for environments where block storage devices (like SANs) are used, providing high-performance parallel access to large datasets.

This approach allows VMs to directly interact with xiRAID Opus block volumes, while a pNFS MDS server manages scalability. This flexible design minimizes the complexity of shared storage setups in cloud environments, ensuring both scalability and high performance.

Key Features of pNFS Block Layout:

Direct Data Access: Clients can bypass the NFS server and read/write directly to storage volumes using block-level protocols (e.g., iSCSI, Fibre Channel), reducing bottlenecks.
Separate Data and Metadata Paths: The NFS server manages metadata, but the data itself flows directly between clients and storage, streamlining performance.
Parallel Access: pNFS allows multiple clients to read/write to different sections of a file simultaneously, improving throughput for large datasets.
Scalability: By offloading data transfers to the storage devices themselves, pNFS supports high-scale operations, making it a perfect fit for cloud environments handling AI workloads or massive data sets.

pNFS Architecture in Cloud Environments

The beauty of pNFS is its simplicity, offering high-performance shared storage while requiring minimal system resources. It doesn’t need third-party client software or the direct passing of a high-performance network to VMs, making it incredibly versatile.

Shared Storage Support: pNFS can efficiently manage high-performance storage with low CPU overhead.
No Third-Party Software: Data volumes can be shared across compute nodes without needing additional software, simplifying the overall architecture. What makes this architecture especially appealing is that it leverages the same hardware we used in our Lustre testing earlier, showing just how adaptable and powerful pNFS can be.

pNFS Performance Results

Sequential operations (1M, 32 jobs):

Sequential Read:
- Without xiRAID Opus: 34.8 GB/s
- With xiRAID Opus: 47 GB/s
Sequential Write:
- Without xiRAID Opus: 32.7 GB/s
- With xiRAID Opus: 46 GB/s

By integrating xiRAID Opus, we further optimized the performance of pNFS block layouts. When we compare pNFS with and without xiRAID Opus, the results clearly demonstrate its value in high-performance environments. This test used MDRAID (RAID 0) and Opus (RAID 6), showcasing the significant boost in both read and write performance when xiRAID Opus is incorporated into the solution.

pNFS vs Lustre: Accelerated by Xinnor Solutions

When comparing the pNFS block layout to Lustre, our solutions provide significant acceleration in both setups. Both Lustre and pNFS, when paired with xiRAID Opus, are capable of delivering strong, near-equal performance in high-throughput environments:

Sequential Performance Comparison (1M, 32 jobs):

Sequential Read:
- Lustre: 44 GB/s
- pNFS: 47 GB/s
Sequential Write:
- Lustre: 43 GB/s
- pNFS: 46 GB/s

These results demonstrate that both Lustre and pNFS, when optimized by xiRAID Opus, are powerful solutions, capable of delivering outstanding performance in high-performance cloud environments.

pNFS in Cloud Environments: Conclusions

We believe pNFS represents the future of scalable, high-performance storage in cloud environments. With proper configuration, pNFS block layout can achieve tens or even hundreds of gigabytes per second in throughput, with minimal resource consumption.

Key Benefits:

Scalability: Supports large-scale environments, offering massive throughput potential with low system overhead.
High Performance: Delivers exceptional performance for both sequential and random small block operations, with minimal latency due to direct interaction with storage devices.
No Third-Party Client Software: Simplifies setup and management by removing the need for additional software on client machines.

Challenges:

While pNFS is highly promising, the current Open Source MDS is not production-ready, making it suitable for POCs but not yet for full production environments.

Conclusions

Xinnor offers two robust solutions tailored for AI workloads in cloud environments: xiRAID Opus and the Xinnor Lustre Solution. These high-performance tools are engineered to handle the demanding nature of AI applications. Our comparison of Lustre and pNFS, accelerated by xiRAID Opus, demonstrates that both parallel file systems provide exceptional scalability and performance for AI workloads in virtualized cloud settings. Lustre offers high throughput and reliability, making it suitable for complex cloud environments. On the other hand, pNFS presents a simpler, versatile alternative that minimizes setup complexity without sacrificing performance. While each solution has unique strengths, xiRAID Opus consistently enhances both, supporting fast, efficient data access across multiple cloud-based nodes. Together, these parallel file systems and xiRAID Opus form a powerful foundation for AI workloads.

You can read the original blogpost here

How to Build High-Performance NFS Storage with xiRAID Backend and RDMA Access

Sergey Platonov — Mon, 17 Mar 2025 15:54:54 +0000

This paper outlines the process of configuring a high-performance Network File System (NFS) storage solution using the xiRAID RAID engine, Remote Direct Memory Access (RDMA), and the XFS file system. Modern data-intensive workloads (such as those in AI & machine learning, high-performance computing, scientific research, media & entertainment (e.g., 4K/8K video rendering, real-time asset streaming), virtualized environments requiring rapid storage access, etc.) demand storage subsystems capable of delivering extreme throughput with minimal latency.

By leveraging xiRAID’s optimized RAID engine alongside RDMA’s low-latency data transfers and XFS’s scalability, this approach achieves unprecedented sequential access performance, critical for large-scale datasets, while offering actionable insights for improving random read/write efficiency.

The document focuses on maximizing throughput and reducing latency, particularly for sequential access patterns common in scenarios like AI/ML model training, large-scale HPC simulations, real-time media rendering pipelines, virtualized infrastructure requiring consistent I/O performance, etc. Though it does not cover full production configurations (e.g., security settings), the procedures outlined enable organizations to deploy a high-performance NFS storage foundation that balances simplicity, scalability, and raw speed.

InfiniBand Network Setup

For RDMA access, both the NFS server and clients must have the NVIDIA MLNX_OFED driver support (--with-nfsrdma) installed. Download the appropriate driver for your operating system from NVIDIA’s website and install it using the following command:

./mlnxofedinstall --with-nfsrdma

Ensure that similar configurations are applied on the NFS clients. Configure the InfiniBand adapters and verify the network settings. The following utilities can be used to check the performance of the Infiniband network:

ib_send_bw for bandwidth testing.
ib_send_lat for latency testing.
ib_read_bw and ib_read_lat for RDMA read bandwidth/latency.
ib_write_bw and ib_write_lat for RDMA write bandwidth/latency.

Disk Subsystem Performance Check

Before setting up RAID, configuring a file system and NFS server, determine the desired RAID level for data and file system journals.

Test raw drive performance to ensure no bottlenecks exist at the server or disk subsystem level. Refer to Xinnor's performance guide for detailed recommendations:

https://xinnor.io/blog/performance-guide-pt-1-performance-characteristics-and-how-it-can-be-measured/

https://xinnor.io/blog/performance-guide-pt-2-hardware-and-software-configuration/

After ensuring that the drives performance meets the requirements, start setting up RAID and configuring the file system.

xRAID Setup

This document uses RAID 6 with 10 drives and a strip size of 128k for data, and RAID 0 with a strip size of 16k for file system logs. (In production environment it is better to use RAID 1 or 10).

Install the latest xiRAID version using the documentation and create the RAID arrays as follows:

xicli raid create -n media6 -l 6 -d /dev/nvme16n2 /dev/nvme9n2 /dev/nvme20n2 /dev/nvme18n2 /dev/nvme8n2 /dev/nvme12n2 /dev/nvme13n2 /dev/nvme19n2 /dev/nvme23n2 /dev/nvme24n2 -ss 128
xicli raid create -n media0 -l 0 -d /dev/nvme7n1 /dev/nvme6n1

Check the RAID status after the initialization process is complete using xicli raid show

After creating the RAID array, verify that the performance at the RAID layer meets expectations before proceeding to file system creation. For RAID performance checks and additional tuning, refer to the materials in Xinnor's performance guide.

Based on our testing experience, the sequential write speed to the RAID should be approximately 90-95% of the write speed to raw data drives.

XFS Setup and Mount

Create the XFS file system with the following command:

mkfs.xfs -d su=128k,sw=8 -l logdev=/dev/xi_media0,size=1G /dev/xi_media6 -f -ssize=4k

Depending on the geometry of your RAID, the file system creation options may vary. Pay special attention to parameters such as su=128k and sw=8, as these are important for aligning the file system geometry with the RAID configuration.

Mount the file system using:

mount -t xfs /dev/xi_media6 /mnt/data -o logdev=/dev/xi_media0,noatime,nodiratime,logbsize=256k,largeio,inode64,swalloc,allocsize=131072k

Similar to the RAID setup, you should also test the performance of the file system by writing several large files to a directory. The performance should be approximately 70-80% of the RAID performance for writes and 90-100% for reads.

For permanent mounting of the file system, follow the recommendations in the xiRAID documentation.

NFS Server Setup

With the disk subsystem, RAID, and file system properly configured, the next step is to install and configure the NFS server. The installation is straightforward, but several optimizations are necessary to achieve better performance.

1.Install nfs-utils packages

yum install nfs-utils

2.Firewall setup. Below is an example of a simple firewall configuration for a test environment:

firewall-cmd --permanent --add-service=nfs
firewall-cmd --permanent --add-service=mountd
firewall-cmd --permanent --add-service=rpc-bind
firewall-cmd --reload

3.NFS share directory creation. The following commands create a directory for an NFS file share. These settings are suitable for testing purposes only, as they do not include access restrictions:

mkdir -p /mnt/data
chown nfsnobody:nfsnobody /mnt/data
chmod 777 /mnt/data

4.NFS export configuration. For the /mnt/data directory, typically found in /etc/exports. If the file does not exist, create it manually and add the following line in the /etc/exports file:

/mnt/data *(rw,sync,insecure,no_root_squash,no_subtree_check,no_wdelay)

5.NFS Server configuration tuning. Edit the NFS server configuration file /etc/nfs.conf to adjust the number of threads for handling requests and enable RDMA connections. The configuration should resemble the following:

[exportd]
# debug="all|auth|call|general|parse"
# manage-gids=n
# state-directory-path=/var/lib/nfs
threads=64

[nfsd]
# debug=0
threads=64
# host=
# port=0
# grace-time=90
# lease-time=90
# udp=n
# tcp=y
vers3=y
vers4=y
vers4.0=y
vers4.1=y
vers4.2=y
rdma=y
rdma-port=20049

6.Enable, restart and check NFS Server
After applying all settings, enable and restart the NFS server:

systemctl enable nfs-server
systemctl restart nfs-server

Check the status of the NFS server to ensure there are no errors, particularly those related to the RDMA module:

systemctl status nfs-server

NFS Client Setup

You can now proceed to configure the NFS client.

7.Install nfs-utils packages

yum install nfs-utils

8.NFS client kernel module options.

Add the following line to /etc/modprobe.d/nfsclient.conf:

options nfs max_session_slots=180

Increasing the max_session_slots value allows more simultaneous in-flight requests, improving performance for workloads with many small or parallel I/O operations.

9.Reboot the system

reboot

10.Mount NFS Share

Create a directory for the NFS file share (in this example, /mnt/nfs) and mount it using the following command:

mount -o rdma,port=20049,nconnect=16,vers=4.2 10.239.239.100:/mnt/data /mnt/nfs

After completing the setup, you can run performance tests from the NFS client. If everything is configured correctly, the performance should meet expectations. Use fio with the following configuration for testing:

[global]
rw=read
#rw=write
bs=1024K
iodepth=32
direct=1
ioengine=libaio
runtime=4000
size=32G
numjobs=4
group_reporting
exitall

[job3]
directory=/mnt/nfs

Depending on the configuration, you can expect NFS performance to be 50-70% of the XFS file system's performance.

Conclusion

By following the outlined steps, users can set up a high-performance NFS storage system leveraging xiRAID and RDMA. The configuration ensures optimal performance for sequential data access patterns and provides flexibility for tuning based on specific workload requirements. For production environments, additional configurations such as security settings should be applied as needed.

High-Performance Storage Solution for PostgreSQL in Virtual Environments with XIRAID Engine and Kioxia PCIe5 Drives

Sergey Platonov — Wed, 12 Mar 2025 14:30:18 +0000

Objectives

PostgreSQL is a highly popular open-source database due to its rich feature set, robust performance, and flexible data handling. It is used everywhere from small websites to large-scale enterprise applications, attracting users with its object-relational capabilities, advanced indexing, and strong security. However, to truly unleash its potential, PostgreSQL demands fast storage. Its transactional nature and ability to handle large datasets require low latency and high throughput. This is why pairing PostgreSQL with fast storage solutions is crucial for optimizing performance, minimizing downtime, and ensuring seamless data access for demanding workloads.

For flexibility, scalability and cost optimization, it is preferrable to run PostgreSQL on Virtual Machines, especially in development and testing environments. But sometimes, Virtualization introduces an abstraction layer that can lead to performance overhead compared to running directly on bare metal. On the other hand, using just bare metal leads to non-optimal usage of the CPU and storage resources, because one application typically doesn’t fully utilize the bare metal server performance.

In this document, we’ll look at the optimal way to provide high performance to PostgreSQL in a virtualized environment.

With this goal, we are comparing the performance of vHOST Kernel Target with Mdadm against SPDK vhost-blk target protected by Xinnor’s xiRAID Opus.

Mdadm, which stands for "Multiple Devices Administration", is a software tool used in Linux systems to manage software RAID (Redundant Array of Independent Disks) configurations. Unlike hardware RAID controllers, mdadm relies on the computer's CPU and software to achieve data redundancy and performance improvements across multiple physical disks.

XiRAID Opus (Optimized Performance in User Space) is a high-performance software RAID engine based on the SPDK libraries, designed specifically for NVMe storage devices.

We are focusing the benchmark on software RAID, as hardware RAID has only 16 PCIe lanes, meaning that by the design the performance is limited to the one of maximum 4 NVMe drives per controller, which is not sufficient for PostgreSQL applications.

As testing tool, we employed the pgbench utility and conducted tests on all three built-in scripts: tpcb-like, simple-update, and select-only. The script details are provided in Appendix 2.

Test Setup

Hardware Configuration:

Motherboard: Supermicro H13DSH
CPU: Dual AMD EPYC 9534 64-Core Processors
Memory: 773,672 MB
Drives: 10xKIOXIA KCMYXVUG3T20

Software Configuration:

OS: Ubuntu 22.04.3 LTS
Kernel: Version 5.15.0-91-generic
xiRAID Opus: Version xnr-1077
QEMU Emulator: Version 6.2.0

RAID Configuration:

Two RAID groups (4+1 configuration) were created utilizing drives on 2 independent NUMA nodes. The stripe size was set to 64K. A full RAID initialization was conducted prior to benchmarking.

Each RAID group was divided into 7 segments, with each segment being allocated to a virtual machine via a dedicated vhost controller.

Summary of Resources Allocated:

RAID Groups: 2
Volumes: 14
vhost Controllers: 14
VMs: 14, with each using segmented RAID volumes as storage devices.

Distribution of virtual machines, vhost controllers, RAID groups and NVMe drives

During the creation of mdraid, volumes, and vhost targets, assignment to specific CPU cores was not conducted because not supported. Nevertheless, virtual machines continued to operate on specific cores.

xiRAID. Placement of the array and VMs on cores

With xiRAID it is possible to assign the RAID engine to specific cores. In this example we are using 8 cores for any NUMA node. Such placement allows to separate infrastructure and database workload, and to isolate VM loads from each other.

This feature is not available on MDRAID, so the application must share the core resources with the RAID engine.

mdraid. Placement of the array and VMs on cores

Virtual Machine Configuration

CPU Allocation: 8
-cpu host -smp 8

QEMU Memory Configuration:
Memory Allocation: Each VM is provisioned with 32 GB of RAM via Hugepages. Memory is pre-allocated and bound to the same NUMA node as the allocated vCPUs to ensure efficient CPU-memory interaction.

-m 32G -object memory-backend-file,id=mem,size=32G,mem-path=/dev/hugepages,
share=on,prealloc=yes,host-nodes=0,policy=bind

Operating System: VMs run Debian GNU/Linux 12 (Bookworm)

PostgreSQL Version: 15

PostgreSQL Configuration

apt-get install postgresql-15 // installing PostgreSQL 15
cd /etc/postgresql/15/main/
sed -i 's|/var/lib/postgresql/15/main|/test/postgresql/15/main|g' postgresql.conf //
configuring the folder for the data
sed -i -e "s/^#\?\s*listen_addresses\s*[=]\s*[^\t#]*/listen_addresses = '127.0.0.1'/" postgresql.conf
sed -i -e "/^max_connections/s/[= ][^\t#]*/ = '300'/" postgresql.conf // increasing the number of connections up to 300

apt-get install xfsprogs
mkdir /test
mkfs.xfs /dev/vda -f
mount /dev/vda /test -o discard,noatime,largeio,inode64,swalloc,allocsize=64M -t xfs
cp -rp /var/lib/postgresql /test/
service postgresql restart

Configuring the folder for the data:

sudo -u postgres createdb test
sudo -u postgres pgbench -i -s 50000 test

We created and initialized the database for testing purposes. It is important to choose the scaling correctly, so that all data does not fit into the RAM.

Testing

We conducted tests while varying the number of clients and reported in this document only those where we achieved the maximum stable results. To adjust the number of clients, we selected the following values for the parameter -c (number of clients simulated, equal to the number of concurrent database sessions): 10, 20, 50, 100, 200, 500, 1000. For all script types, we reached a plateau at 100 clients.

As best practice, we fixed the parameter -j (number of worker threads within pgbench*) equal to the number of VM cores.

Using more than one thread can be helpful on multi-CPU machines. Clients are distributed as evenly as possible among available threads.

The tests appear as follows:

sudo -u postgres pgbench -j 8 -c 100 -b select-only -T 200 test
sudo -u postgres pgbench -j 8 -c 100 -b simple-update -T 200 test
sudo -u postgres pgbench -j 8 -c 100 -T 200 test

We conducted the test three times and recorded the average results across all virtual machines. Additionally, we performed select-only tests in degraded mode, as this script generates the maximum load on reading, enabling an assessment of the maximum impact on the database performance.

During the test, we monitored the array performance using the iostat utility. The total server performance comprises the sum of the performance of all machines (14 for xiRAID Opus and 16 for mdraid).

Select-only Test Results

Select-only Test Results, Degraded Mode

Simple-update Test Results

TPC-B-like Test Results

Conclusion

In select-only, with all the drives in the RAID operating properly, xiRAID Opus provides 30-40% better transaction per second than mdraid. Mdraid is nearing its maximum capabilities, and further scaling (by increasing the number of cores for virtual machines) would become challenging. This is not the case for xiRAID. The main reason for such a difference is the fact that xiRAID Opus enables the vhost target to run on a separate CCD.

When comparing different protection schemes, we cannot stop at measuring performance in normal operation. Indeed, RAID protection is needed to prevent data loss in case of one or more drives failure. In this situation (degraded mode), maintaining high performance is critical to avoid downtime to the database users.

When comparing performance in degraded mode, mdraid experiences a significant drop in performance, leading to over 20X times slower performance than xiRAID. In other terms, with MDRAID, users will be waiting for their data and this situation can lead to business losses (think about an online travel agency or a trading company).

When it comes to writing data to the database, each write of small blocks generates RAID calculations. In this situation, mdraid's performance is six times worse than xiRAID Opus.
The TPC-B Like script is more complex than the simple update and consumes more CPU resources, which again slows down mdraid on write operations. In this case, xiRAID outpaces mdraid by five times.
In conclusion, xiRAID provides great and stable performance to multiple VMs.

This means that applications will be able to get access to their data without any delay, even in case of drive failures or extensive write operations.

Furthermore, the scalability of xiRAID on VMs allows the system admin to consolidate the number of servers needed for large/multiple Database deployments. This benefit oversimplifies the storage infrastructure while providing great cost saving.

Thank you for reading! If you have any questions or thoughts about this high-performance storage solution for PostgreSQL, please leave them in the comments below. I'd love to hear your feedback and discuss how this setup could benefit your projects!

Appendix 1. mdraid Configuration

md0 : active raid5 nvme40n2[5] nvme45n2[3] nvme36n2[2] nvme46n2[1] nvme35n2[0]
12501939456 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]

Bitmaps disabled
cat /sys/block/md0/md/group_thread_cnt
16
Vhost target

Example Code for Launching VMs

taskset -a -c $CPU qemu-system-x86_64 -enable-kvm -smp 8 -cpu host -m 32G -drive file=$DISK_FILE,format=qcow2 --nographic \
-device vhost-scsi-pci,wwpn=naa.5001405dc22c8c4e,bus=pci.0,addr=0x5

Original article can be found here

Understanding RAID Levels: A Comprehensive Guide to RAID 0, 1, 5, 6, 10, and Beyond

Sergey Platonov — Wed, 12 Mar 2025 14:27:43 +0000

In today’s fast-paced digital landscape, data storage is crucial for safeguarding critical information. RAID technology has revolutionized data storage, offering improved performance, increased data redundancy, and optimized capacity. However, with various RAID levels available, selecting the ideal configuration can be challenging.

In this comprehensive article, we demystify RAID technology, guiding you through the intricacies of RAID 0, RAID 1, RAID 5, RAID 6, RAID 10, and more. By exploring their characteristics, benefits, and drawbacks, we empower you to make informed decisions that align with your specific storage demands. Whether you’re a tech enthusiast, system administrator, or business owner, this guide equips you with the expertise to fortify your data infrastructure effectively.

RAID 0

RAID 0 encompasses a configuration wherein all drives are merged into a single logical one. This level delivers exceptional performance at a reduced cost. However, it lacks data protection mechanisms, rendering it highly susceptible to data loss in the event of a drive failure. Consequently, the adoption of RAID 0 is not recommended for mission critical data.

Advantages:

Offers high-speed performance and availability while maintaining a cost-effective approach.
Utilizes the entire capacity of each individual drive.
Configuration is straightforward and user-friendly.

Disadvantages:

RAID 0 lacks any form of data protection.
In the event of a single drive failure, all data becomes irreversibly lost, with no possibility of recovery.

Areas of application

This RAID level is advisable for implementation in non-mission-critical scenarios. RAID 0 is suitable for purposes where the primary concern is maximizing performance and data read/write speeds. It is commonly used in scenarios where data redundancy (fault tolerance) is not a critical requirement, and the main focus is on improving the system’s overall data processing capabilities.

RAID 1

RAID 1, also known as “mirroring,” is a method in which all data is duplicated on two separate drives, with one set of data appearing as a logical drive. RAID 1 is primarily focused on providing data protection rather than improving performance or increasing storage capacity. Because data is replicated over 2 drives, the usable capacity is 50% of the total available drives in the RAID array.

Advantages:

High levels of redundancy — each drive is an exact copy of another.
If one drive fails, the system continues to function normally with no data loss.

Disadvantages:

Usable capacity is limited to 50% due to the need to store complete duplicates of data.
RAID 1 performance does not significantly exceed that of a single drive.

Areas of application

This RAID level finds frequent utilization in scenarios where storage capacity and cost are not a concern, yet the imperative requirement lies in the ability to fully recover data in the event of a drive failure. It’s commonly used for booth drives, small business applications, and personal data storage, ensuring continuous access to information even if one drive fails.

RAID 5

RAID 5, widely regarded as the most prevalent and versatile RAID level, employs a technique known as data block striping across the entirety of drives within the array (comprising 3 to N drives). It further distributes parity information evenly across all drives. In the event of a single drive failure, the system utilizes the parity information from the functioning drives to recover lost data blocks.

Advantages:

Strikes a favorable balance between cost and performance considerations.
Capability to recover data in the event of a single drive failure.
Enhanced data read performance.
Scalability: RAID 5 facilitates effortless expansion of storage capacity by incorporating additional drives without system interruption.

Disadvantages:

Parity storage leads to a reduction in individual drive capacity.
data loss in case 2 drives fail in the array.

Areas of application

This RAID level enjoys widespread adoption across diverse environments, including file servers, general-purpose storage servers, backup servers, and streaming data applications, among others. It offers superior performance while maintaining an optimal price-performance ratio.

RAID 6

RAID 6, also known as “double-parity interleaving,” is a data storage and recovery technique that distributes data across multiple drives while utilizing double-parity for enhanced fault tolerance. While RAID 6 performs similarly to RAID 5 in terms of performance and capacity, it offers an advantage by distributing the second parity scheme across different drives, allowing it to withstand the simultaneous failure of two drives within the array.

Advantages:

RAID 6 provides a reasonable price-quality ratio with good overall performance.
The array can endure the simultaneous failure of two drives or the failure of one drive, followed by the subsequent failure of a second drive during data recovery.

Disadvantages:

RAID 6 incurs higher costs compared to RAID 5, as it sacrifices the capacity of two drives for parity data.
In most scenarios, RAID 6 performs slightly slower than RAID 5.

Areas of application

RAID 6 is highly recommended for applications such as file servers, shared storage servers, and backup servers. It strikes a favorable balance between cost and performance, offering reliable and versatile operation. The key advantage of RAID 6 lies in its ability to tolerate the failure of two drives simultaneously or the failure of one drive followed by a second drive during the data recovery process.

RAID 7.3

To increase the reliability of the data warehouse, XINNOR engineers have developed and introduced to the market a new triple-parity RAID level, known as RAID 7.3. This level was designed with a unique erasure coding technology, which allows to perform checksum calculations at high speed. Thus, RAID 7.3 achieves performance comparable to RAID 6.

Advantages:

RAID 7.3, with triple parity, is ideal for use with high-capacity drives, where the recovery process can take long time. This is especially true in conditions of intense workload, where a long rebuilding process increases the risk of subsequent drive failure and potentially threatens data security.

The use of RAID 7.3 in combination with hard drives or hybrid solutions significantly reduces storage costs by reducing the number of drives used, meeting customer requirements for reliability and performance.

In addition, RAID 7.3 provides extensive capabilities for managing the infrastructure of your data centers. It offers a convenient and reliable technology for organizing a storage array.

RAID 10

RAID 10, also known as “striping and mirroring”, combines the benefits of RAID 1 and RAID 0 by creating multiple mirrored sets that are interleaved. RAID 10 provides high performance, good data protection, and does not require parity calculations.

RAID 10 requires at least four drives, and the usable capacity is 50% of the total drive capacity. However, it is worth noting that RAID 10 can use more than four drives, which must be a multiple of two. For example, a RAID 10 array of eight drives provides high performance on both spinning and SSD drives because data reads and writes are split into smaller chunks on each drive.

Advantages:

High speed and reliability through a combination of striping and mirroring.

Disadvantages:

Expensive configuration because it requires the use of more drives to achieve usable capacity.
Not recommended for large capacities due to cost constraints.
Slightly slower than RAID 5 in some streaming scenarios.

Areas of application

This RAID level is well-suited for databases, as it offers elevated read and write performance, and for virtualization, providing servers with both high performance and reliability. It is particularly relevant in domains such as video editing and multimedia applications, where RAID 10 can efficiently manage substantial data volumes. Additionally, it is recommended for mission-critical applications due to its robust data protection and recovery capabilities in the event of drive failure. Moreover, in the context of high-traffic file servers, RAID 10 adeptly handles heavy network traffic while delivering remarkable file system responsiveness.

RAID 50 & 60

There are also RAID 5+0 (RAID 50) and RAID 6+0 (RAID 60), which are hybrid RAID configurations that combine the features of multiple RAID levels for improved performance and fault tolerance. RAID 5+0 uses multiple RAID 5 arrays interleaved with RAID 0, providing faster data access and the ability to tolerate a single drive failure per RAID 5 array. RAID 6+0 combines multiple RAID 6 arrays interleaved with RAID 0, providing even better fault tolerance by tolerating two-drives failures per RAID 6 array. These configurations are suitable for situations requiring both high performance and enhanced data protection.

RAID N+M

RAID level N+M is a data block allocation system using M parity distribution. This level allows the end user to independently determine the number of drives that will be used to store checksums. RAID N+M is supported by xiRAID. This is an innovative technology, thanks to which it is possible to restore information in the event of a failure of up to 32 drives (depends on how many drives are used to store checksums).

How to Choose the RAID Level

Choosing the right RAID level depends on your specific storage needs, performance requirements, data redundancy preferences, and budget constraints. Here are the key factors to consider when making this decision:

Performance Requirements: Different RAID levels offer varying levels of performance. RAID 0, for example, provides excellent performance by striping data across multiple drives, but it lacks data redundancy. On the other hand, RAID 5 and RAID 6 offer both performance and redundancy but are not as fast as RAID 0. Consider the speed at which you need to access and transfer data, as well as the workload demands of your system.
Data Redundancy and Fault Tolerance: If data protection is a top priority, RAID levels with redundancy are essential. RAID 1 mirrors data across drives, providing a high level of fault tolerance, while RAID 5 and RAID 6 use distributed parity to protect against drive failures. RAID 10 combines mirroring and striping, offering both speed and redundancy. Assess the criticality of your data and how much protection you need against potential drive failures.
Drive Utilization: Different RAID levels use drives in various ways, impacting overall storage capacity. RAID 0 utilizes all drives for data storage, providing maximum capacity but no redundancy. In contrast, RAID 1 uses half the capacity for mirroring, reducing usable storage but ensuring complete redundancy. Evaluate how important drive utilization is for your setup.
Number of Drives Available: Some RAID levels require a minimum number of drives to function effectively. RAID 5, for instance, needs a minimum of three drives, while RAID 6 typically requires at least four drives. If you have limited drive slots or a specific number of available drives, this will influence your RAID level choice.
Cost Considerations: RAID configurations come with varying costs based on the number of drives needed and the drive types used (HDDs or SSDs). RAID 0 and RAID 5 might be more cost-effective due to their lower drive requirements, while RAID 1 and RAID 10 could be more expensive due to the need for mirroring. Balance your budget constraints with the level of performance and redundancy required.
Complexity and Manageability: Some RAID levels, like RAID 0 and RAID 1, are relatively simple to set up and manage, making them suitable for less experienced users. In contrast, RAID 5 and RAID 6 configurations involve distributed parity, which adds complexity but provides more redundancy. Consider the level of expertise and effort required for configuring and maintaining your chosen RAID level.
Specific Use Cases: Certain RAID levels excel in particular scenarios. For instance, RAID 0 is ideal for temporary data storage or high-performance applications where redundancy is not a concern. RAID 5 and RAID 6 are well-suited for data-centric environments that require both performance and fault tolerance. Identify your specific use case to align it with the RAID level that best suits your requirements.

By carefully evaluating these factors and understanding the strengths and weaknesses of each RAID level, you can confidently select the right RAID configuration that aligns with your storage needs and ensures the optimal balance between performance, data protection, and cost-effectiveness.

Several software solutions are available to optimize RAID configurations and achieve peak performance. A notable example is xiRAID, a software RAID engine, a universal tool compatible with all RAID levels. We can help you choose the best solution for your business needs.

Thank you for reading! If you have any questions or thoughts about this RAID Levels, please leave them in the comments below. I’d love to hear your feedback and discuss how this setup could benefit your projects!

Original article can be found here

[Boost]

Sergey Platonov — Mon, 10 Mar 2025 16:39:12 +0000

How to Build High-Performance NFS Storage with xiRAID Backend and RDMA Access

Sergey Platonov ・ Feb 18

#performance #rdma #nfs #storage

DEV Community: Sergey Platonov

The Definitive Guide to mdraid, mdadm, and Linux Software RAID

What Is mdraid? Core Concepts

How mdraid Fits in the Linux Storage Stack

mdadm vs mdraid: Terminology Clarified

Supported RAID Levels & Modes

Key Architecture Elements

Common Use Cases

Planning an mdraid Deployment

Step‑by‑Step: Creating and Managing Arrays with mdadm

Install mdadm

Chunk Size, Stripe Geometry & Alignment

Why it matters:

Rules of Thumb

Alignment Checklist

Performance Tuning Tips

1. Set Appropriate Chunk Size at Creation

2. Increase Stripe Cache (RAID5/6)

3. Tune Sync/Resync/Reshape Speeds

4. Enable or Place Write‑Mostly Devices

5. Use Bitmaps to Shorten Resync Windows

6. TRIM / Discard on SSD Arrays

7. NUMA & IRQ Affinity

8. Benchmark Regularly

Monitoring, Alerts & Maintenance

Enable mdadm Monitor Service

Check /proc/mdstat Regularly

Scheduled Consistency Checks

SMART + Predictive Failure

Recovery, Rebuilds & Reshape Operations

Rebuild Performance Considerations

Reshaping (e.g., RAID5 → RAID6, add disks)

Security Considerations

Alternatives to mdraid

xiRAID: High‑Performance Software RAID

Hardware RAID Controllers

LVM RAID (device‑mapper RAID)

Btrfs Native RAID Profiles

ZFS RAID‑Z & Mirrors

FAQ: mdraid, mdadm & Linux RAID Troubleshooting

Glossary

Final Thoughts

Performance Guide Pt. 3: Setting Up and Testing RAID

xiRAID 4.0.x

RAID geometry selection

Recommendations on selecting RAID geometry

ccNUMA

Merge write

Merge read settings

Sched setting

RAID testing

Workflow

Pre-conditioning and typical settings

Performance troubleshooting

Performance Guide Pt. 2: Hardware and Software Configuration

Hardware

Drive connection methods and topology

Drives

Memory

CPU

BIOS

Linux

Preliminary checks and settings

Kernel parameters

Schedulers

Other settings

Drive Performance

Virtualized NFSoRDMA-based Disaggregated Storage Solution for AI Workloads

Objectives

Performance Requirements

High-Level Solution Description

Introducing Virtualized NFSoRDMA + xiRAID Opus Solution

Key Features of Our Virtualized Solution

Testing Configuration

Testing Results

Wrap-Up and Final Thoughts

Appendix

NFS Settings

MDRAID tuning

ZFS settings

Check `/proc/mdstat` Regularly