Architecture - Clustered Metadata

Clustering Metadata
In order to provide enhanced scalability and performance, Lustre offers clustered metadata servers. This section will give an outline of the architecture. The main challenge we face is to provide a substantial gain in scalability of the metadata performance of Lustre through great parallelism of common operations. This involves finding mechanisms which distribute operations evenly over the metadata cluster, while avoiding a more complex protocol involving further RPC’s. The current trend in distributed file system design is to do such clustering by allowing clients to pre-compute the location of the correct services.

A second challenge is to provide good load balancing and resource allocation properties both for large installations where the metadata cluster acts in effect as a metadata server and in the case of small clusters in which the metadata cluster itself will access metadata on other nodes in the cluster.

Our architecture accomplishes this by heavily leveraging existing building bricks, primarily existing file systems and their metadata interfaces. Finally the key challenge is to provide good scalability and simple recovery within the metadata cluster itself.

Summary of metadata clustering configurations.
Overall the clustered metadata handling is structured as follows.


 * A cluster of metadata servers manage a collection of inode groups. Each inode group is a Lustre device exporting the usual metadata api, augmented with a few operations specifically crafted for metadata clustering. We call these collections of inodes inode groups.
 * Directory formats for file systems used on the MDS devices are changed to introduce a allow directory entries to contain an inode group and identifier of the inode.
 * A logical metadata volume (LMV) driver is introduced below the client Lustre file system write back cache driver that maintains connections with the MDS servers.
 * There is a single metadata protocol that is used by the client file system to make updates on the MDS’s and by the MDS’s to make updates involving other MDS’s.
 * There is a single recovery protocol that is used by the clients -MDS and MDS-MDS service.
 * Directories can be split across multiple MDS nodes. In that case a primary MDS directory inode contains an extended attribute that points at other MDS inodes which we call directory objects.

Modular design.
Client systems will have the write back client (WBD) or client file system directly communicate with the LMV driver: it offers themetadata api to the file system and uses the metadata api offered by a collection of MDC drivers. Each MDC driver managed the metadata traffic to one. The function of the LMV is very simple: it figures out from the command issued what MDC to use. This is based on: (1) the inode groups in the request (2) a hash value of names used in the request, combined with the EA of a primary inode involved in the request. (3) for readdir the directory offset combined with the EA of the primary inode (4) the clustering descriptor In any case every command is dispatched to a single metadata server, the clients will not engage more than one metadata server for a single request. The api changes here are minimal and the client part of the implementation is very trivial.

Basics of the operations.
For the most part, operations are extremely similar or identical to what they were before. In some cases multiple mds servers are involved in updates. Getattr, open, readdir, setattr and lookup methods are unaffected. Methods adding entries to directories are modi.ed in some cases: (1) ”’mkdir”’ always create the new directory on another MDS (2) ”’unlink, rmdir, rename”’: may involve more than one MDS (3)	”’large directories”’ all operations making updates to directories can cause a directory split. The directory split is discussed below. (4) ”’other operations”’ If no splits large directories are encountered all other operations proceed as they are executed on one MDS.

Directory Split.
A directory that is growing larger will be split. There is a fairly heavy penalty associated with splitting the directory and also with renames in within split directories. Moreover, at the point of splitting, inodes become remote and will incur a penalty upon unlink.

Probably it is best to delay the split until the directory is fairly large, and then to split over several nodes, to avoid further splits being necessary soon afterwards.

Locking.
Locking can be done in fid order as it is currently done on the MDS. In order to obtain cluster wide ordering of resources, clients must chose the correct coordinating MDS, so that locks taken there initiate the lock ordering sequence to be followed. This is particuarly important for rename, which has to be started at the target or source directory, depending on which the highest order resource occurs.

Resources.
The MDS handles the persistent storage of metadata objects and directory data. Internal to the metadata service is a large amount of allocation management.

The use of resources is easily summarized as follows: (1) Look up the name in a directory (2) insert / remove names in a directory
 * Names: :

(1) get attributes for a fid (2) create, remove the corresponding object
 * FID::

The ownership of resources varies among file systems. In local file systems a single node owns all resources. No parallelism can be achieved with this. In traditional clustering file systems, nodes own individual inodes or disk blocks. This leads to fine grained ownership of resources, but involves frequent collisions and poor locality of reference.

For Lustre we propose that each node owns a moderately large group of objects. There would be a large shared storage pool, which would be subdivided into relatively small file systems, this is shown in figure 6.7.1. We call the small file systems an inode group. Each inode group has its own journal for recovery, is formatted as a file system and can fail-over to another node for availability or adjustment of resources. We will make the load on the inode groups evenly distributed through randomness.

Clients will get a logical clustered metadata driver which exploits multiple MDC clients (see figure 6.7.2). Just like the logical object volume, the file system itself does not need to know the details of the object distribution, that can be left to a small logical metadata volume driver, invoked by the file system through the same API. The MDS system will get clustering and policy adaptations. The key to this is to add an inode group identifier to the fid, this marks the inode group to which an inode belongs. The resource database for the cluster will provide every client with a load balancing map which indicates on which MDS server a particular inode group is currently mounted.

The resource location will be managed as follows: File inodes: 

Directory inodes:
 * Create the file inode in the inode group of the directory inode holding the name


 * Create in a new inode group
 * The policy on which group to pick could be round robin, random, most space available etc. Probably every MDS reply packet should contain some status information to give clients policy information.

Directory data:


 * While the directory is small, keep it with the inode
 * When it grows fan it out.

Clustered directories.
When directories grow we will split them up into directory data objects which are placed on multiple MDS servers, the figure 6.7.3shows this transition from a single directory to multiple directory objects. This is quite analogous to striped files, which are placed in data objects on multiple servers.



Directory entries will hold a inode group identifier and inode number, compared to traditional entries holding merely a name and inode number. So once a name is found in directory data the inode group and inode number in this group is known. getattr_lock(parent_fid,: name) To find the directory entry itself, the algorithm is similar to that of finding a file stripe. When a directory inode is located, the inode will either contain directory data in which case it is treated as a traditional directory. It can also contain an extended attribute describing what inode buckets exist, by specifying a fid for each bucket, each fid specifying its inode group, inode number and generation. A hash will then map the name to a particular bucket based on this metadata. A normal name lookup in the bucket will proceed to find the entry. The worst case here is that this requires 3 RPC’s. The first one to do a getattr on the directory inode which would give the extended attribute, the second to find the directory entry on the server holding the bucket, and the 3rd to find the inode attributes in the inode group associated with the entry. However, the common case is that a single RPC is sufficient, since normally the directory inode will be cached already, so the first RPC will go to the server containing the bucket. Furthermore, usually the inode is located on that server and will be fetched in the same RPC. The number of disk reads is identical or one higher than that for large non-clustered directories. The process of creating a clustered directory is triggered by the directory growing beyond a certain size. The splitting of a directory occurs quite as early as possible, there might be a small effect to performance in the beginning when a directory is split. But the aggregate performance would be good since parallel operations can be done.

Directory inodes and clustered metadata.
Directory inodes come in two variants:

small directories: An ordinary directory inode in a single inode group. large directories: master directory inode: with an EA pointing to the buckets in other inode groups bucket inodes:in other inode groups. The buckets are associated with an inode that manages the space allocation for the bucket directory data. The bucket directory data describes the directory data covering a range of hash values. It provides a map from name to (group, inode number) to identify the fid up to the generation number. The fanout operation, triggered by a directory growing beyond a certain size creates the buckets. This involves a new RPC in the MDS service that allows the creation of a remote bucket, and to populate it with directory entries.

This is a simple RPC that brings no complications to recovery since the buckets are exclusively visible to the the inode group of the master. It is possible that buckets are orphaned, and this requires cleanup.

Removal of a fanned out directory is similar in complexity. Here it is important to use an MDS to MDS reconnect handshake, identical to the client -mds handshake, between the master inode server and mds’s holding the inode groups holding the buckets to handle the failure of MDS servers that have buckets that need to be removed.

The security of such MDS-MDS interaction is probably most easily managed with a capability model similar to that found between the clients and OST.

The attributes of clustered directories are most easily managed in a distributed fashion as we do for the file data objects. size: sum of all the bucket sizes link count: sum of all the bucket link counts mtime: latest of all mtimes

Clustered MDS protocol.
The clustered MDS protocol involves a few changes to the API implementation found above. Most of the changes involve some new API calls between MDS servers. The goal is to use a single recovery infrastructure among the MDS servers and the clients, as described earlier in this chapter. Some detailed works remains to be done for the design to avoid cyclic lock dependencies or acknowledgment graphs (refer to section 11.3.6). As described previously in section 11.3.6, we now enforce ACKs for replies. The MDS takes locks on the resources it modi.es, these locks are canceled once ACKs are received. In the clustered MDS scenario, it is important to ensure that a deadlock is not caused as a result of the various systems waiting for ACKs from each other.
 * mds_create: :This call needs modifications when creating a new directory, because the new directory inode and new directory data will be created on another MDS server than the parent. The node holding the parent directory data will do a lookup, find it’s negative and hold a lock. Now it will make an MDC RPC to create a remote inode. When that call returns, the directory data can be filled in. The key issue here is recovery of the remote inode creation, which either requires writing the fid of the created inode in the commit log or using preallocated inodes. It is easy to see that in the normal case of file creations the code path is equally efficient for a clustered metadata service and a single node one.


 * mds_rename/mds_link:: These calls are probably the most interesting of all. It will involve three nodes. The source and target nodes holding the directory data and the node holding the inode which has a link that is to be renamed. An important invariant is that bucket in-odes and directory inodes are always on the same node as the node holding the associated data. This call pattern involves the mds making a remote link RPC to another MDS and a remote setattr RPC to the MDS holding the inode to be renamed. The calls appear to be easily recovered in case of failures.


 * mds_unlink:: This is also a two stage call. Both for creation and unlinking the management of orphans is important. This orphan management is entirely analogous between the MDS and OST data objects. The orphaned objects can be created during the object creation/removal, objects might be created on the OSTs, but the MDS could fail before recording these in the extended attributes on a persistent store. Similarly, during deletion, its possible that the record of the objects is deleted on the MDS but the corresponding objects are not deleted on the OSTs before some failure occurs. These first situation can only be prevent by requiring the OSTs to log every object creation, the MDS would send an asynchronous message to the OSTs once the objects information has been stored on persistent store. The OSTs can then delete the corresponding logs. Similarly, in the second case, the MDS can keep logs of object deletion, if an OST fails before removing the corresponding objects, it could check with the MDS upon recovery and delete the required objects.

Client -MDS replay protocol.
The clustered MDS -client recovery protocol is very similar to the single MDS -client protocol. In this case also, the MDS servers need to track whether a client request was executed, replied or committed. The MDS also regards other MDS systems that make requests as part of clustered metadata updates as a client for recovery purposes. If a request is committed, a replay is not required, the metadata server can simply forget the state associated with that request, except that it needs to be capable to reproduce the reply until the client has ack’d that. For a request was not executed, the client can simply retransmit it upon recovery; Lustre uses the word resending for this part of recovery. For requests that were executed and saw replies but lost on persistent storage the retransmission mechanism is called replay.

Replay.
To order transaction sequences Lustre uses reply ack’s: the acks server only one purpose to release a lock that enforces ordering of the transaction sequence. In the case where MDS operations involve more than server, the reply "ack" from the primary to secondary servers should only be sent after the client has sent the ack to the first server. This MDS-MDS reply ack is now not really an ack anymore but a simple lock cancelation review. Clients will replay lost transactions to the mds which they originally engaged for the request. Orphaned children will be cleaned up only after replay completes to allow orphaned objects to be re-used during replay.

Failures of multiple MDS nodes.
The handling of recovery of orphan objects between clustered metadata servers is identical to that of the single MDS case.

A new problem arises from multiple metadata server failures, such as present in the case of power-off. In this case the MDS should be rolled back to a consistent state.

Example: In transaction one, a node X creates directories a. Then in transaction 2 a cross MDS node rename moves a file with a directory entry on node Y into this directory. It is now possible for this file to lose its directory entry on Y and for the transaction on X not to commit. More complex examples exist.

We do this with a standard algorithm known as a consistent cut in causal time or snapshot (see Birman [] or other books on distributed algorithms). A consistent snapshot is a state of the MDS that could have been reached through full execution of requests coming from cilents, in other words, a consistent snapshot is a state of the MDS file systems that represents a valid file system. After multiple simultaneous MDS failures the state of the MDS’s must be rolled back to a consistent snapshot. We say that a transaction on an MDS1 depends on a transactions on MDS2 when the completion of a request to MDS1 has the transaction on MDS2 as a component.

Each MDS retains logs of transactions, sufficiently detailed that they can be undone. Each log record contains a transaction number corresponding to the transaction on this node and the transaction numbers of transactions that were started on other MDS to complete this transaction. The log records can be used for two operations. Log records can be canceled when the MDS cluster as a whole has committed the transactions that relate a particular log record. Also records can be used to undo operations that were already performed.

Every few seconds, the cluster computes a snapshot by first electing a leader. First leader asks all MDS’s to give their last committed transaction numbers. The MDS’s respond and also provide the transaction numbers for other MDS’s they depend on for this transaction. If an MDS provided a dependency higher than what was committed, that MDS should be asked to resend its transactions and dependencies to account for this. This algorithm then repeats and it converges because it produces a strictly decreasing set of transaction numbers. When the transaction numbers have reached a consistent snapshot, all MDS’s are told what their current last committed transaction for the snapshot is. Clients can be told to discard all requests held for replay that are older than those found in the snapshot.

The coordinating MDS of a client initiated transaction will first establish that the transaction can commit on all nodes, by acquiring locks on directories and checking for available space existing entries with the same name etc. It may also first perform a directory split if the size is becoming too large, and more MDS nodes are still available.

All nodes involved in the transaction need to have a transaction sequence number to place the transaction into their sequence and allow correctly replay. At this point the coordinator will:
 * start a transaction locally.
 * It will then report the transaction sequence number to all other nodes involved in the transaction.
 * These nodes will commit (in memory as usual), write a journal record for replay and reply to the coordinator.
 * The coordinator will then commit its own transaction.
 * The MDS created metadata undo log records, which are subject to normal log commit cancelation messages, but on the coordinator commit messages must be received from the leader before the record will be canceled.

Failover rings.
The configuration data can designate a standby MDS that will take over from a failed MDS. By organizing the servers in one or more rings, the nearest working left neighbor MDS can be the failover node. This leads to a simple scheme with multiple failover nodes, avoiding quorum and other complications beyond what is needed for two node clusters.