FAQ - Installation

(Updated: Dec 2009)

Which operating systems are supported as clients and servers?

Please see OS Support.

Can you use NFS or CIFS to reach a Lustre volume?

Yes. Any native Lustre client (running Linux today, by definition) can export a volume using NFS or Samba. Some people have even built small clusters of these export nodes, to improve overall performance to their non-native clients.

Although NFS export works today, it is slower than native Lustre access, and does not have cache coherent access that some applications depend on.

CIFS export with Samba, even a cluster of such nodes, is possible with one caveat: oplocks and Windows share modes. If you connect to these Samba shares with Windows clients, they will probably make heavy use of share modes and oplocks for locking and synchronization. Samba implements them internally, and does not yet have a clustered mode to coordinate them between multiple servers running on separate nodes. So if you rely on the consistency of these share modes and oplocks, you should use a single node to export CIFS.

What is the typical MDS node configuration?

10,000-node clusters with moderate metadata loads are commonly supported with a 4-socket quad-core node with 32GB of RAM, providing sustained throughput of over 5,000 creates or deletes per second, and up to 20,000 getattr per second. It is common for these systems to have roughly 200 million files. Even in 10,000-client clusters, the single MDS has been shown not to be a significant bottleneck under typical HPC loads. Having a low-latency network like InfiniBand plays a significant role in improved MDS performance.

High throughput with very large directories is possible with extra RAM and, optionally, solid-state disks. Typically, write I/O is low, but seek latency is very important, hence RAID-0+1 mirrored storage (RAID-0 striping of multiple RAID-1 mirrored disks) is strongly recommended for the MDS. Access to metadata from RAM can be 10x faster, so RAM more RAM is usually beneficial.

MDS storage requirements should be sized at approximately 4kB per file, except in unusual circumstances.

What is the typical OSS node configuration?

Multi-core 64-bit servers with good buses are capable of saturating multiple interconnects of any type. These nodes are often dual- or quad- socket and support up to 4 fibrechannel interfaces. RAM is used for locks and metadata caching, and in 1.8 and later extra server RAM will be used for file data caching, though is not strictly required. Aproximately 2 GB/OST is recommended for OSS nodes in 1.8, and twice that if OST failover will also run backup OSTs on the node.

Which architectures are interoperable?

Lustre requires the page size on server nodes (MDS and OSS) to be smaller or the same size as client nodes. Except for this, there are no known obstacles to interoperability, even among heterogeneous client groups and systems with different endianness.

Which storage devices are supported, on MDS and OSS nodes?

Servers support all block storage: SCSI, SATA, SAS, FC and exotic storage (SSD, NVRAM) are supported.

Which storage interconnects are supported?

Just to be clear: Lustre does not require a SAN, nor does it require a fabric like iSCSI. It will work just fine over simple IDE block devices. But because many people already have SANs, or want some amount of shared storage for failover, this is a common question.

For storage behind server nodes, FibreChannel, InfiniBand, iSCSI, or any other block storage protocol can be used. Failover functionality requires shared storage (each partition used active/passive) between a pair of nodes on a fabric like SCSI, FC or SATA.

'''Are fibrechannel switches necessary? How does HA shared storage work?'''

Typically, fibrechannel switches are not necessary. Multi-port shared storage for failover is normally configured to be shared between two server nodes on a FC-AL. Shared SCSI and SAS devices will also work.

Backend storage is expected to be cache-coherent between multiple channels reaching the devices. Servers in an OSS failover pair are normally both active in the file system, and can be configured to take over partitions for each other in the case of a failure. MDS failover pairs can also both be active, but only if they serve multiple separate file systems.

Can you put the file system journal on a separate device?

Yes. This can be configured when the backend ext3 file systems are created. Can you run Lustre on LVM volumes, software RAID, etc?

Yes. You can use any Linux block device as storage for a backend Lustre server file system, including LVM or software RAID devices. Can you describe the installation process?

The current installation process is straightforward, but manual:

1. Install the provided kernel and Lustre RPMs.

2. A configuration tool assistant can generate a configuration file for simple configurations, or you can build more complex configurations with relatively simple shell scripts.

3. Format and mount the OST and MDT filesystems. The command is usually identical on all nodes, so it's easy to use a utility like pdsh/prun to execute it.

4. Start the clients with mount, similar to how NFS is mounted.

What is the estimated installation time per compute node?

Assuming that node doesn't require special drivers or kernel configuration, 5 minutes. Install the RPMs, add a line for the file system to /etc/fstab. Compute nodes can be installed and started in parallel.

What is the estimated installation time per I/O node?

5-30 minutes, plus formatting time, which can also be done in parallel. The per-node time depends on how easily the commands can be run in parallel on the I/O nodes. Install the RPMs, format the devices, add a line to /etc/fstab for the file system, mount.