FAQ - Object Servers and I/O Throughput

What levels of throughput should I expect?

This of course depends on many variables, including the type and number of clients and servers, your network and disk infrastructure, your application's I/O patterns, tuning, and more. With standard HPC workloads and reasonable (ie, not seek-bound, nor extremely small I/O requests, etc) Lustre has demonstrated up to 90% of the system's raw I/O bandwidth capability.

With all of those variables in mind, here are some demonstrated single-server results on customer or demonstration installations of various types:


 * TCP/IP
 * Single-connected GigE: 115 MB/s
 * Dual-NIC GigE on a 32-bit OSS: 180 MB/s
 * Dual-NIC GigE on a 64-bit OSS: 220 MB/s
 * Single-connected 10GigE on a 64-bit OSS: 550 MB/s, 1GB/s on woodcrest
 * Unoptimized InfiniBand
 * Single-port SDR InfiniBand on a 64-bit OSS: 700-900 MB/s
 * DDR InfiniBand on a 64-bit OSS: 1500 MB/s

How fast can a single OSS be?

Using Lustre 1.4.0, a single 8-way Bull NovaScale IA-64 OSS, DataDirect Networks storage, and 3 rails of Quadrics Elan 4, a single OSS achieved 2.6 GB/s of sustained end-to-end bandwidth from two 16-way IA-64 client nodes.

Also using Lustre 1.4.0, a single-CPU AMD Opteron using 10-gigabit ethernet has been clocked at 550 MB/s.

How well does Lustre scale as OSSs are added?

Configured properly, it will scale linearly. In demonstrations on a production system of up to 104 Lustre OSSs, each connected with a single gigabit ethernet port, the aggregate sustained bandwidth reached 11.1 GB/s.

How many clients can each OSS support?

The number of clients is not usually a factor in choosing how many OSSs to deploy. Please see File Sizing.

What is a typical OSS node configuration?

Please see Installation.

How do I automate failover of my OSSs?

Please see Recovery.

Do you plan to support OSS failover without shared storage?

Yes, Server Network Striping will allow RAID-1 and RAID-5 file I/O features. They will provide redundancy and recoverability in the Lustre object protocol rather than requiring shared storage. In the meantime, some users have been trying DRBD to replicate the OST block device to a backup OSS instead of having shared storage, though this is not an officially supported configuration.

How is file data allocated on disk?

Because the Lustre OSTs mount regular ext3 file systems, you can mount them directly and look at them. If you were to do so, you would see a lot of files with names like "934151", which are object numbers. Inside each object is a file's data, or a portion of that file's data, depending on the striping policy for that file. There is no namespace information stored on the object server at this time.

The allocation of this file data to disk blocks is governed by ext3, although here we have made very substantial improvements. Instead of a long array of individual blocks, Lustre's ext3 manages file data extents, which can dramatically reduce the amount of this metadata for each file, and therefore the amount of seeking and I/O required to read and write it. We also implemented a new buddy block allocator, which can very quickly and without a lot of searching return very large contiguous disk extents.

How does the object locking protocol work?

Before any file data can be modified, or stored in a client cache, a lock must be taken. Each OST runs a lock server, and manages the locking for the stripes of data which reside on that OST. This has two extremely positive effects:

First, it removes a potential bottleneck of a single lock server. As you add object servers, you also add lock server capacity, in addition to disk capacity and bandwidth, and network bandwidth.

Second, it removes the so-called "split-brain" problem common in clustered systems. If the lock service and I/O service reside on different nodes, it is possible for the communications between them to be disrupted, while clients can still access one or both. In that case, data corruption could result because the locking and I/O would no longer be carefully coordinated.

In the Lustre protocol, if a client requests a lock which conflicts with a lock held by another client, a message is sent to the lock holder asking for the lock to be dropped. Before that lock is dropped, the client must write back any cached modifications, and remove all data from its cache for which it will no longer have a lock. Then, and only then, can it drop the lock.

If a client does not drop its lock in a reasonable amount of time (defined by a configurable timeout value) -- perhaps because it has been powered off, or suffered a hardware failure, or for some other reason -- it is evicted from that OST and will not be allowed to execute any operations until it has reconnected. This allows the remainder of the cluster to continue after a node has failed, after a short pause.

Finally, we have implemented a lock manager extension to optimize the very common case of sampling an object's attributes while it is being modified. Many users, for example, will want to track the progress of a job by getting a file listing ("ls -l") in the output directory while the job is writing its data.

Because it is not acceptable to return stale or out-of-date file size information, we must ask the server for a lock on this data. Because we don't actually need the data -- we just need to know how much there is -- we tell the server that instead of a lock it could simply provide the attributes. This is another case of intent locking. If the file is not being actively modified, then the server will grant a lock so that the client can cache the attributes.

Does Lustre support Direct I/O?

Yes. It locks the data to guarantee cluster-wide consistency, just like normal POSIX I/O, but does not cache the data on the client or server. With an RDMA-capable network (anything other than TCP) there is only a single data copy directly from the client RAM to the server RAM and straight to the disk.

Can these locks be disabled?

Yes, but:


 * It's only safe to do so when you use direct I/O; otherwise you have data in the caches which is not locked. Once that data is in the cache without a lock, it will not be removed except under memory pressure.
 * In practice, the overhead of these locks has not been shown to be an issue. Databases may or may not be an exception, but in any case, they tend to use direct I/O.

Do you plan to support T-10 object devices?

We are in touch with the T-10 committee. It is not clear to us that recovery and lock management implications for cluster file systems will see sufficient attention in the T-10 standard for this proposal to be viable. The goals of the T-10 committee may not, in the end, line up well with the very strong semantic guarantees that Lustre makes. Does Lustre support/require special parallel I/O libraries?

Lustre supports them, but by no means requires them. We have found equal performance when using standard POSIX I/O calls, the POSIX ADIO driver, or the MPI/IO libraries, when the IO size is sufficient.

A Lustre-specific MPI/IO ADIO driver has been developed to allow an application to provide hints about how it would like its output files to be striped, and to optimize the IO pattern when many clients are doing small read or write operations to a single file.

(Update 12/09)