ZFS and Lustre

(Updated: Sep 2010)

The Lustre™ node file system ldiskfs (based on ext3/ext4) is limited to an 8 TB maximum file system size and offers no guarantee of data integrity. To improve the reliability and resilience of the underlying file system on the OSS and MDS components, Lustre will add ZFS support.

Lustre supporting ZFS will offer a number of advantages, such as improved data integrity with transaction-based, copy-on-write operations and an end-to-end checksum on every block.

Copy-on-write means that ZFS never overwrites existing data. Changed information is written to a new block and the block pointer to in-use data is only moved after the write transaction is completed. This mechanism is used all the way up to the file system block structure at the top block.

To avoid data corruption, ZFS performs end-to-end checksumming. The checksum is not stored with the data block, but rather in the pointer to the block. All checksums are done in server memory, so errors not caught by other file systems are detected in ZFS, such as:
 * Phantom writes, where the write is dropped on the floor.
 * Misdirected reads or writes, where the disk accesses the wrong block.
 * DMA parity errors between the array and server memory or from the driver, since the checksum validates data inside the array.
 * Driver errors, where data winds up in the wrong buffer inside the kernel.
 * Accidental overwrites, such as swapping to a live file system.

In Lustre, ZFS checksumming will be done by the Lustre client on the application node. This will detect any data corruption introduced into the network between the application node and the disk drive in the Lustre storage system.

Previous testing of Lustre with network checksums has resulted in the detection of previously unknown corruption in network cards. These cards silently introduced data corruption that went undetected without the use of checksums. It should be noted that the checksum computation does consume some processor cycles, approximately 1 GHz of CPU time to process 500 MB/sec of I/O.

An implementation note: Previously, ZFS support was being developed and tested with a user space implementation of the ZFS DMU. Currently, we are running the DMU in kernel space. Also, the Lustre DMU code is almost entirely common with the Solaris version of ZFS, so Lustre support for ZFS will closely parallel the Solaris release of ZFS.

Lustre support of ZFS will offer several specific advantages:
 * Self-healing capability - In a mirrored or RAID configuration, ZFS not only detects data corruption, but it automatically corrects the bad data.
 * Improved administration - Because ZFS detects and reports data corruption on all read and write errors at the block level, it is easier for system administrators to quickly identify which hardware components are corrupting data. ZFS also has very easy-to-use command-line administration utilities.
 * SSD support - ZFS supports the addition of high-speed I/O devices, such as SSDs, to the storage pool. The Read Cache Pool or L2ARC acts as a cache layer between memory and the disk. This support can substantially improve the performance of random read operations. SSDs can also be used to improve synchronous write performance, by adding them to the pool as log devices. You can add as many SSDs to your storage pool as you need to increase your read cache size and IOPS, your synchronous write IOPS, or both.
 * Scalability - ZFS is a 128-bit file system. This means that current restrictions on maximum-size file systems for a single MDT or OST, maximum stripe size, andmaximum size of a single file will be removed. ZFS support will also remove the current 16 TB limitation on LUNs.

For more general information about ZFS, see ZFS Resources.