WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.
FAQ - Metadata Servers
(Updated: Dec 2009)
How many metadata servers does Lustre support?
Lustre 1.x supports up to two metadata servers (MDSs) per file system in an active/passive failover configuration, meaning that only one server is actually servicing requests as a time.
You can configure multiple active metadata servers today, but they must each serve separate file systems.
Lustre 2.x will introduced the clustered metadata feature, which will permit dozens or hundreds of metadata servers working in parallel for a single file system.
How will clustered metadata work?
At a high level, it is reasonably simple: each directory can be striped over multiple metadata servers, each of which contains a disjoint portion of the namespace. When a client wants to lookup or create a name in that namespace, it uses a hashing algorithm to determine which metadata server holds the information for that name.
When you consider the details of doing this efficiently, coherently, and completely recoverably in the face of any number of different failures, it becomes more complicated.
We have already demonstrated very substantial portions of this functionality, including recovery, as a part of our DoE/NNSA PathForward effort. It will not be production-quality, however, for some time.
Isn't the single metadata server a bottleneck?
Not so far. We regularly perform tests with single directories containing millions of files, and we have several customers with 10,000-node clusters (or larger) and a single metadata server.
Lustre is carefully designed to place the entire burden of file I/O on the Object Storage Servers (OSSs): locking, disk allocation, storage and retrieval, everything. Once the file is opened and the striping information obtained by the client, the metadata server is no longer involved in the business of file I/O.
The Lustre metadata server software is extremely multithreaded, and we have made substantial efforts to reduce the lock contention on the MDS so that many clients can work concurrently. These are the kinds of optimizations which make it possible to do random creations or lookups in a single 10-million-file directory at a rate of more than 5,000 per second. Future versions of Lustre will continue to improve the metadata performance.
If there is a customer need for massive metadata servers prior to the release of clustered metadata, it should be possible to scale quite far using large SMP systems (such as the SunFire x4600 or Bull NovaScale) with large memory spaces and solid-state disks which can utilize large caches to speed MDS operations.
What is the typical MDS node configuration?
Please see Installation.
How do I automate failover of my MDSs?
Please see Recovery.
Do you plan to support MDS failover without shared storage?
No. The extreme complexity that this would introduce does not seem to be warranted by the relatively modest cost of a small amount of shared storage for the metadata servers.
How is metadata allocated on disk?
The standard way that Lustre formats the MDS file system is with 512-byte ext3 inodes, which contain extended attributes (EAs) embedded in the inodes. One use of such an EA is for the file striping data, which tells the clients on which object servers to find the file data. For very widely striped files, this EA may be too large to store in the inode and will be stored in separate blocks. By storing the EA in the inode when possible, we avoid an extra very expensive disk seek.
What is intent locking?
Most file systems operate in one of two modes: a mode in which the server does the metadata modifications, or a mode in which the client can cache metadata updates itself. Both ways have their advantages and disadvantages in certain situations.
Consider the case of 1,000 clients all chdir'ing to /tmp and creating their own output files. If each client locks the directory, adds their file, uploads the modification, and releases the lock, this simple operation will take forever. If the metadata server is able to execute the operations locally and return the results, it should all happen in less than a second.
This is not a contrived example -- Lustre users run applications which do this very thing every hour of every day, for example, to write checkpoint data of a long-running scientific computation.
Consider another very common case, like a user's home directory being used only by one node. In this case, it would be extremely advantageous to allow that node to cache metadata updates in ram, then lazily propagate them back to the MDS. This allows the user to make updates as fast as they can be recorded in ram (until ram is full).
The moral of the store is: in cases of high concurrency, do the updates on the metadata server. In single-user cases, cache updates on the client.
What does this have to do with intent locking? Our protocol bundles up the information for the entire operation with the initial lock request, in the form of a metadata intent, and gives the metadata server the option of whether to execute the operation immediately (and return only a result code), or to return a lock to allow the client to perform writeback caching.
Lustre 1.x does not include a metadata writeback cache on the client, so today's metadata server always executes the operation on the client's behalf. Even without a writeback cache, however, the intent locking infrastructure still provides value. By having all of the information available during the initial lock request, we are able to perform all metadata operations in a single RPC.
How does the metadata locking protocol work?
Prior to Lustre 1.4.x, each metadata server inode was locked as a single unit. When the client wished to cache the existence of a name, or the attributes or directory pages of an inode, it would take and hold such a read lock. When the metadata server modified an inode, it would take a write lock on the same.
There are common cases when even one lock per inode is not enough, however. For example, consider the case of creating the file dir/file, as would happen during the unpacking of a tar archive. The client would first lookup dir, and in doing so take a lock to cache the result. It would then ask the metadata server to create "file" inside of it, which would lock dir to modify it, thus yanking the lock back from the client. This "ping-pong" back and forth is unnecessary and very inefficient, so Lustre 1.4.x introduced separate locking of different parts of the inode (simple existence, directory pages, and attributes).
Does the MDS do any pre-allocation?
Yes. To enable very fast file creation, the metadata server asks the individual OSTs to pre-create some number of objects, which the MDS can then allocate as file stripes without additional RPCs. These preallocations are replenished asynchronously.