WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Difference between revisions of "Clustered Metadata"

From Obsolete Lustre Wiki
Jump to navigationJump to search
Line 1: Line 1:
== Clustering Metadata ==
+
= Clustered Metadata Design (in progress) =
In order to provide enhanced scalability and performance, Lustre offers clustered metadata servers. This section will give an outline of the architecture.
 
 
The main challenge we face is to provide a substantial gain in scalability of the metadata performance of Lustre through great parallelism of common operations. This involves finding mechanisms which distribute operations evenly over the metadata cluster, while avoiding a more complex protocol involving further RPC’s. The current trend in distributed file system design is to do such clustering by allowing clients to pre-compute the location of the correct services.
 
  
A second challenge is to provide good load balancing and resource allocation properties both for large installations where the metadata cluster acts in effect as a metadata server and in the case of small clusters in which the metadata cluster itself will access metadata on other nodes in the cluster.
+
This document describes the design of the clustered metadata handling
 +
for Lustre.  This material depends on other Lustre design, such as:
 +
:
 +
* General recovery
 +
* Orphan Recovery
 +
* Metadata Write Back caching
  
Our architecture accomplishes this by heavily leveraging existing building bricks, primarily existing file systems and their metadata interfaces.
+
= Introduction =
Finally the key challenge is to provide good scalability and simple recovery within the metadata cluster itself.
 
  
=== Summary of metadata clustering configurations.===
+
Overall the clustered metadata handling is structured as follows.
Overall the clustered metadata handling is structured as follows.  
 
  
* A cluster of metadata servers manage a collection of inode groups. Each inode group is a Lustre device exporting the usual metadata api, augmented with a few operations specifically crafted for metadata clustering. We call these collections of inodes inode groups.  
+
* A cluster of metadata servers manage a collection of inode groups.   Each inode group is a Lustre device exporting the usual metadata   api, augmented with a few operations specifically crafted for   metadata clustering. We call these collections of inodes inode   groups.
* Directory formats for file systems used on the MDS devices are changed to introduce a allow directory entries to contain an inode group and identifier of the inode.  
+
* Directory formats for file systems used on the MDS devices are   changed to introduce a allow directory entries to contain an inode   group and identifier of the inode.
* A logical metadata volume (LMV) driver is introduced below the client Lustre file system write back cache driver that maintains connections with the MDS servers.  
+
* A logical clustered metadata driver is introduced below the client   Lustre file system write back cache driver that maintains   connections with the MDS servers.
* There is a single metadata protocol that is used by the client file system to make updates on the MDS’s and by the MDS’s to make updates involving other MDS’s.  
+
* There is a single metadata protocol that is used by the client file   system to make updates on the MDS's and by the MDS's to make   updates involving other MDS's.
* There is a single recovery protocol that is used by the clients -MDS and MDS-MDS service.  
+
* There is a single recovery protocol that is used by the clients -   MDS and MDS-MDS service.
* Directories can be split across multiple MDS nodes. In that case a primary MDS directory inode contains an extended attribute that points at other MDS inodes which we call directory objects.  
+
* Directories can be split across multiple MDS nodes. In that case a   primary MDS directory inode contains an extended attribute that   points at other MDS inodes which we call directory objects.
  
==== Modular design.====
+
= Configuration management and Startup =
Client systems will have the write back client (WBD) or client file system directly communicate with the LMV driver: it offers themetadata api to the file system and uses the metadata api offered by a collection of MDC drivers. Each MDC driver managed the metadata traffic to one. The function of the LMV is very simple: it figures out from the command issued what MDC to use. This is based on:
 
<blockquote>
 
(1)
 
the inode groups in the request
 
<br>
 
(2)
 
a hash value of names used in the request, combined with the EA of a primary inode involved in the request.
 
<br>
 
(3)
 
for readdir the directory offset combined with the EA of the primary inode
 
<br>
 
(4)
 
the clustering descriptor
 
<br>
 
</blockquote>
 
In any case every command is dispatched to a single metadata server, the clients will not engage more than one metadata server for a single request. The api changes here are minimal and the client part of the implementation is very trivial.
 
  
==== Basics of the operations.====
+
The configuration will name an MDS server, and optionally a failover
For the most part, operations are extremely similar or identical to what they were before. In some cases multiple mds servers are involved in updates. Getattr, open, readdir, setattr and lookup methods are unaffected. Methods adding entries to directories are modi.ed in some cases:
+
node, which hold the root inode for a fileset. Clients will contact
<blockquote>
+
that MDS for the root inode during mount, as they do already.  
(1)
 
”’mkdir”’ always create the new directory on another MDS
 
<br>
 
(2)
 
”’unlink, rmdir, rename”’: may involve more than one MDS
 
<br>
 
(3)
 
”’large directories”’ all operations making updates to directories can cause a directory split. The directory split is discussed below.  
 
<br>
 
(4)
 
”’other operations”’ If no splits large directories are encountered all other operations proceed as they are executed on one MDS.
 
</blockquote>
 
  
==== Directory Split.====
+
They will also fetch from it a clustering descriptor. The clustering
A directory that is growing larger will be split. There is a fairly heavy penalty associated with splitting the directory and also with renames in within split directories. Moreover, at the point of splitting, inodes become remote and will incur a penalty upon unlink.  
+
descriptor contains a header and an array lists what inode groups are
 +
served by what server.
  
Probably it is best to delay the split until the directory is fairly large, and then to split over several nodes, to avoid further splits being necessary soon afterwards.  
+
Through normal mechanisms clients will wait and probe for available
==== Locking.====
+
metadata servers, during startup and cluster transitions. When new
Locking can be done in fid order as it is currently done on the MDS. In order to obtain cluster wide ordering of resources, clients must chose the correct coordinating MDS, so that locks taken there initiate the lock ordering sequence to be followed. This is particuarly important for rename, which has to be started at the target or source directory, depending on which the highest order resource occurs.
+
servers are found or configurations have changed they can update their
=== Resources.===
+
clustering descriptor as they update the LOV striping descriptor for
The MDS handles the persistent storage of metadata objects and directory data. Internal to the metadata service is a large amount of allocation management.  
+
OST's.
  
The use of resources is easily summarized as follows:
+
= Data Structures =
<blockquote>
 
;'''Names: ''':
 
(1)
 
Look up the name in a directory
 
<br>
 
(2)
 
insert / remove names in a directory
 
<br>
 
  
;'''FID:''':
+
The fid will contain a new 32 bit integer to name the inode group. 
(1)
 
get attributes for a fid
 
<br>
 
(2)
 
create, remove the corresponding object
 
</blockquote>
 
  
The ownership of resources varies among file systems. In local file systems a single node owns all resources. No parallelism can be achieved with this. In traditional clustering file systems, nodes own individual inodes or disk blocks. This leads to fine grained ownership of resources, but involves frequent collisions and poor locality of reference.  
+
Directory entries will contain a new 32 bit integer to name the inode
 +
group.  
  
For Lustre we propose that each node owns a moderately large group of objects. There would be a large shared storage pool, which would be subdivided into relatively small file systems, this is shown in figure 6.7.1. We call the small file systems an inode group. Each inode group has its own journal for recovery, is formatted as a file system and can fail-over to another node for availability or adjustment of resources. We will make the load on the inode groups evenly distributed through randomness.  
+
Directory inodes on the MDS, when large, contain a new EA which is a
 +
descriptor of how the directory is split over directory objects,
 +
residing on other MDS's. This EA is subject to ordinary concurrency
 +
control by the MDS holding the inode. The EA is virtually identical
 +
to the LOV EA.  
  
Clients will get a logical clustered metadata driver which exploits multiple MDC clients (see figure 6.7.2). Just like the logical object volume, the file system itself does not need to know the details of the object distribution, that can be left to a small logical metadata volume driver, invoked by the file system through the same API. The MDS system will get clustering and policy adaptations. The key to this is to add an '''inode group''' identifier to the fid, this marks the inode group to which an inode belongs. The resource database for the cluster will provide every client with a load balancing map which indicates on which MDS server a particular inode group is currently mounted.
+
= The clustered metadata client (CMC) =
  
The resource location will be managed as follows:  
+
Client systems will have the write back client (WBD) or client file
<blockquote>
+
system directly communicate with the CMC driver: it offers the
'''File inodes: '''<br>
+
metadata api to the file system and uses the metadata api offered by a
[[Image:Pages_from_logical_metadata_volume_driver.jpg]]
+
collection of MDC drivers.  Each MDC driver managed the metadata
 +
traffic to one.  
  
* Create the file inode in the inode group of the directory inode holding the name
+
The function of the CMC is very simple: it figures out from the
'''Directory inodes:'''
+
command issued what MDC to use.  This is based on:
 +
* the inode groups in the request
 +
* a hash value of names used in the request, combined with the EA of  a primary inode involved in the request.
 +
* for readdir the directory offset combined with the EA of the   primary inode
 +
* the clustering descriptor
  
* Create in a new inode group
+
In any case every command is dispatched to a single metadata server,
* The policy on which group to pick could be round robin, random, most space available etc. Probably every MDS reply packet should contain some status information to give clients policy information.  
+
the clients will not engage more than one metadata server for a single
 +
request.
  
 +
The api changes here are minimal and the client part of the
 +
implementation is very trivial.
  
'''Directory data:'''
+
= MDS implementation =
  
*While the directory is small, keep it with the inode
+
For the most part, operations are extremely similar or identical to
*When it grows fan it out.
+
what they were before.   In some cases multiple mds servers are
</blockquote>
+
involved in updates.
  
=== Clustered directories.===
+
Getattr, open, readdir, setattr and lookup methods are unaffected.  
When directories grow we will split them up into '''directory data objects''' which are placed on multiple MDS servers, the figure 6.7.3shows this transition from a single directory to multiple directory objects. This is quite analogous to striped files, which are placed in data objects on multiple servers.  
 
  
[[Image:Transition_from.jpg]]
+
Methods adding entries to directories are modified in some cases:  
  
Directory entries will hold a inode group identifier and inode number, compared to traditional entries holding merely a name and inode number. So once a name is found in directory data the inode group and inode number in this group is known.
+
* '''mkdir''' always create the new directory on another MDS
<blockquote>
+
* '''unlink, rmdir, rename''': may involve more than one MDS
'''getattr_lock(parent_fid,:''' name) To find the directory entry itself, the algorithm is similar to that of finding a file stripe. When a directory inode is located, the inode will either contain directory data in which case it is treated as a traditional directory. It can also contain an extended attribute describing what inode buckets exist, by specifying a fid for each bucket, each fid specifying its inode group, inode number and generation. A hash will then map the name to a particular bucket based on this metadata. A normal name lookup in the bucket will proceed to find the entry.
+
* '''large directories''' all operations making updates to    directories can cause a directory split. The directory split is   discussed below.
<br>
+
* '''other operations''' If no splits large directories are  encountered  all other operations proceed as they are executed on  one MDS.
    The worst case here is that this requires 3 RPC’s. The first one to do a getattr on the directory inode which would give the extended attribute, the second to find the directory entry on the server holding the bucket, and the 3rd to find the inode attributes in the inode group associated with the entry. However, the common case is that a single RPC is sufficient, since normally the directory inode will be cached already, so the first RPC will go to the server containing the bucket. Furthermore, usually the inode is located on that server and will be fetched in the same RPC. The number of disk reads is identical or one higher than that for large non-clustered directories.
 
</blockquote>
 
The process of creating a clustered directory is triggered by the directory growing beyond a certain size. The splitting of a directory occurs quite as early as possible, there might be a small effect to performance in the beginning when a directory is split. But the aggregate performance would be good since parallel operations can be done.
 
  
=== Directory inodes and clustered metadata.===
+
== Directory Split ==
Directory inodes come in two variants:
 
  
'''small directories:''' An ordinary directory inode in a single inode group.
+
A directory that is growing larger will be split.  There is a fairly heavy penalty associated with splitting the directory and also with renames in within split directories. Moreover, at the point of splitting, inodes become remote and will incur a penalty upon unlink.
<br>
 
'''large directories:'''
 
<blockquote>
 
'''master directory inode:''' with an EA pointing to the buckets in other inode groups <br>
 
'''bucket inodes:'''in other inode groups. The buckets are associated with an inode that manages the space allocation for the bucket directory data. The bucket directory data describes the directory data covering a range of hash values. It provides a map from name to (group, inode number) to identify the fid up to the generation number.
 
</blockquote>
 
The fanout operation, triggered by a directory growing beyond a certain size creates the buckets. This involves a new RPC in the MDS service that allows the creation of a remote bucket, and to populate it with directory entries. <br>
 
  
This is a simple RPC that brings no complications to recovery since the buckets are exclusively visible to the the inode group of the master. It is possible that buckets are orphaned, and this requires cleanup.  
+
Probably it is best to delay the split until the directory is fairly large, and then to split over several nodes, to avoid further splits being necessary soon afterwards.
  
Removal of a fanned out directory is similar in complexity. Here it is important to use an MDS to MDS reconnect handshake, identical to the client -mds handshake, between the master inode server and mds’s holding the inode groups holding the buckets to handle the failure of MDS servers that have buckets that need to be removed.
+
= Recovery =
  
The security of such MDS-MDS interaction is probably most easily managed with a capability model similar to that found between the clients and OST.
+
== Transaction Replay ==
  
The attributes of clustered directories are most easily managed in a distributed fashion as we do for the file data objects.  
+
The MDS - MDS interaction is managed as follows. The node approached
<blockquote>
+
with a request change is made the coordinator of the transaction.
'''size:''' sum of all the bucket sizes
 
<br>
 
'''link count:''' sum of all the bucket link counts
 
<br>
 
'''mtime: '''latest of all mtimes
 
</blockquote>
 
  
=== Clustered MDS protocol.===  
+
=== Mechanisms ===
The clustered MDS protocol involves a few changes to the API implementation found above. Most of the changes involve some new API calls between MDS servers. The goal is to use a single recovery infrastructure among the MDS servers and the clients, as described earlier in this chapter. Some detailed works remains to be done for the design to avoid cyclic lock dependencies or acknowledgment graphs (refer to section 11.3.6). As described previously in section 11.3.6, we now enforce ACKs for replies. The MDS takes locks on the resources it modi.es, these locks are canceled once ACKs are received. In the clustered MDS scenario, it is important to ensure that a deadlock is not caused as a result of the various systems waiting for ACKs from each other.
 
<blockquote>
 
;'''mds_create:''' :This call needs modifications when creating a new directory, because the new directory inode and new directory data will be created on another MDS server than the parent. The node holding the parent directory data will do a lookup, find it’s negative and hold a lock. Now it will make an MDC RPC to create a remote inode. When that call returns, the directory data can be filled in. The key issue here is recovery of the remote inode creation, which either requires writing the fid of the created inode in the commit log or using preallocated inodes. It is easy to see that in the normal case of file creations the code path is equally efficient for a clustered metadata service and a single node one.
 
  
;'''mds_rename/mds_link:''': These calls are probably the most interesting of all. It will involve three nodes. The source and target nodes holding the directory data and the node holding the inode which has a link that is to be renamed. An important invariant is that bucket in-odes and directory inodes are always on the same node as the node holding the associated data. This call pattern involves the mds making a remote link RPC to another MDS and a remote setattr RPC to the MDS holding the inode to be renamed. The calls appear to be easily recovered in case of failures.  
+
The coordinator will first establish that the transaction can commit on all nodes, by acquiring locks on directories and checking for available space existing entries with the same name etc.  It may also first perform a directory split if the size is becoming too large, and more MDS nodes are still available.
  
;'''mds_unlink:''': This is also a two stage call. Both for creation and unlinking the management of orphans is important. This orphan management is entirely analogous between the MDS and OST data objects. The ''orphaned objects'' can be created during the object creation/removal, objects might be created on the OSTs, but the MDS could fail before recording these in the extended attributes on a persistent store. Similarly, during deletion, its possible that the record of the objects is deleted on the MDS but the corresponding objects are not deleted on the OSTs before some failure occurs. These first situation can only be prevent by requiring the OSTs to log every object creation, the MDS would send an asynchronous message to the OSTs once the objects information has been stored on persistent store. The OSTs can then delete the corresponding logs. Similarly, in the second case, the MDS can keep logs of object deletion, if an OST fails before removing the corresponding objects, it could check with the MDS upon recovery and delete the required objects.  
+
All nodes involved in the transaction need to have a transaction sequence number to place the transaction into their sequence and allow correctly replay.  
</blockquote>
 
  
=== Clustered MDS recovery.===
+
At this point the coordinator will:
==== ''Client -MDS replay protocol.'' ====
+
* start a transaction locally.  
The clustered MDS -client recovery protocol is very similar to the single MDS -client protocol. In this case also, the MDS servers need to track whether a client request was executed, replied or committed. The MDS also regards other MDS systems that make requests as part of clustered metadata updates as a client for recovery purposes. If a request is committed, a replay is not required, the metadata server can simply forget the state associated with that request, except that it needs to be capable to reproduce the reply until the client has ack’d that. For a request was not executed, the client can simply retransmit it upon recovery; Lustre uses the word resending for this part of recovery. For requests that were executed and saw replies but lost on persistent storage the retransmission mechanism is called ''replay.''
+
* It will then report the transaction sequence number to all other nodes involved in the transaction.
==== Replay.====  
+
* These nodes will commit (in memory as usual), write a journal record for replay and reply to the coordinator.
To order transaction sequences Lustre uses reply ack’s: the acks server only one purpose to release a lock that enforces ordering of the transaction sequence. In the case where MDS operations involve more than server, the reply "ack" from the primary to secondary servers should only be sent after the client has sent the ack to the first server. This MDS-MDS reply ack is now not really an ack anymore but a simple lock cancelation review. Clients will replay lost transactions to the mds which they originally engaged for the request. Orphaned children will be cleaned up only after replay completes to allow orphaned objects to be re-used during replay.
+
* The coordinator will then commit its own transaction.
 +
* The replay log records are subject to normal log commit cancelation messages, but on the coordinator commit messages must be received from all other nodes before the record will be canceled.
 +
 
 +
In this way if the results of the transaction survive on any of the nodes, they can be replayed on all.
 +
 
 +
=== Cluster crashes and the transaction sequence ===
 +
 
 +
If the cluster crashes abruptly, there is the opportunity for transactions to be in progress affecting multiple nodes.  Dependencies between the transactions must be managed to ensure serializibility of the protocol.
 +
 
 +
 
 +
Example: In transaction one, a node X creates directories a. Then in transaction 2 a cross MDS node rename moves a file with a directory entry on node Y into this directory.  It is now possible for this file to lose its directory entry on Y and for the transaction on X not to commit. More complex examples exist.
 +
 
 +
 
 +
We see 4 solutions:
 +
 
 +
* Disk commit delay locks: dependent transactions many not commit before the parent transaction commits.
 +
* Commit acks: transactions may not proceed until previous pre-requisite transactions have committed.
 +
* Synchronous NVRAM journal on all MDS nodes
 +
* Shared journal among all MDS nodes
 +
 
 +
The first and last method offer the most opportunity to proceed without synchronous disk writes. The last method involves  contention on a shared resource.
  
==== Failures of multiple MDS nodes.====
+
Although exhaustive analysis remains, it is clear that rename and splitting the directories are the primary culprits. Hence, we wonder about the following policy related issues:
The handling of recovery of orphan objects between clustered metadata servers is identical to that of the single MDS case.  
+
* Only split really large directories, say after 1M entries.
 +
* Do not needlessly create subdirectories on other nodes. A much better policy is likely to keep directories with one owner, or possibly one client system generating them together.
  
A new problem arises from multiple metadata server failures, such as present in the case of power-off. In this case the MDS should be rolled back to a consistent state.
 
  
'''Example:''' In transaction one, a node X creates directories a. Then in transaction 2 a cross MDS node rename moves a file with a directory entry on node Y into this directory. It is now possible for this file to lose its directory entry on Y and for the transaction on X not to commit. More complex examples exist.
+
=== Replay ===
  
We do this with a standard algorithm known as a consistent cut in causal time or snapshot (see Birman [] or other books on distributed algorithms). A consistent snapshot is a state of the MDS that could have been reached through full execution of requests coming from cilents, in other words, a consistent snapshot is a state of the MDS file systems that represents a valid file system. After multiple simultaneous MDS failures the state of the MDS’s must be rolled back to a consistent snapshot. We say that a transaction on an MDS1 depends on a transactions on MDS2 when the completion of a request to MDS1 has the transaction on MDS2 as a component.
 
  
Each MDS retains logs of transactions, sufficiently detailed that they can be undone. Each log record contains a transaction number corresponding to the transaction on this node and the transaction numbers of transactions that were started on other MDS to complete this transaction. The log records can be used for two operations. Log records can be canceled when the MDS cluster as a whole has committed the transactions that relate a particular log record. Also records can be used to undo operations that were already performed.  
+
To order transaction sequences Lustre uses reply ack's: the acks
 +
server only one purpose to release a lock that enforces ordering of
 +
the transaction sequence.  In the case where MDS operations involve
 +
more than server, the reply "ack" from the primary to secondary
 +
servers should only be sent after the client has sent the ack to the
 +
first server.  This MDS-MDS reply ack is now not really an ack anymore
 +
but a simple lock cancelation review.
  
Every few seconds, the cluster computes a snapshot by first electing a leader. First leader asks all MDS’s to give their last committed transaction numbers. The MDS’s respond and also provide the transaction numbers for other MDS’s they depend on for this transaction. If an MDS provided a dependency higher than what was committed, that MDS should be asked to resend its transactions and dependencies to account for this. This algorithm then repeats and it converges because it produces a strictly decreasing set of transaction numbers. When the transaction numbers have reached a consistent snapshot, all MDS’s are told what their current last committed transaction for the snapshot is. Clients can be told to discard all requests held for replay that are older than those found in the snapshot.  
+
Clients will replay lost transactions to the mds which they originally
 +
engaged for the request.
  
The coordinating MDS of a client initiated transaction will first establish that the transaction can commit on all nodes, by acquiring locks on directories and checking for available space existing entries with the same name etc. It may also first perform a directory split if the size is becoming too large, and more MDS nodes are still available.
 
  
All nodes involved in the transaction need to have a transaction sequence number to place the transaction into their sequence and allow correctly replay. At this point the coordinator will:
+
Orphaned children will be cleaned up only
<blockquote>
+
after replay completes to allow orphaned objects to be re-used during
* start a transaction locally.
+
replay.
* It will then report the transaction sequence number to all other nodes involved in the transaction.
+
 
* These nodes will commit (in memory as usual), write a journal record for replay and reply to the coordinator.
+
 
* The coordinator will then commit its own transaction.
+
== Failover ==
* The MDS created metadata undo log records, which are subject to normal log commit cancelation messages, but on the coordinator commit messages must be received from the leader before the record will be canceled.  
+
 
</blockquote>
+
The configuration data can designate a standby MDS that will take over
 +
from a failed MDS.  By organizing the servers in one or more rings,
 +
the nearest working left neighbor MDS can be the failover node. This
 +
leads to a simple scheme with multiple failover nodes, avoiding quorum
 +
and other complications beyond what is needed for two node clusters.
  
==== Failover rings.====
 
The configuration data can designate a standby MDS that will take over from a failed MDS. By organizing the servers in one or more rings, the nearest working left neighbor MDS can be the failover node. This leads to a simple scheme with multiple failover nodes, avoiding quorum and other complications beyond what is needed for two node clusters.
 
  
== References ==
+
= Locking =
  
[[Category:Architecture|Clustered Metadata]]
+
We believe locking can be done in fid order as it is currently done on the MDS.

Revision as of 02:35, 9 April 2009

Clustered Metadata Design (in progress)

This document describes the design of the clustered metadata handling for Lustre. This material depends on other Lustre design, such as:

  • General recovery
  • Orphan Recovery
  • Metadata Write Back caching

Introduction

Overall the clustered metadata handling is structured as follows.

  • A cluster of metadata servers manage a collection of inode groups. Each inode group is a Lustre device exporting the usual metadata api, augmented with a few operations specifically crafted for metadata clustering. We call these collections of inodes inode groups.
  • Directory formats for file systems used on the MDS devices are changed to introduce a allow directory entries to contain an inode group and identifier of the inode.
  • A logical clustered metadata driver is introduced below the client Lustre file system write back cache driver that maintains connections with the MDS servers.
  • There is a single metadata protocol that is used by the client file system to make updates on the MDS's and by the MDS's to make updates involving other MDS's.
  • There is a single recovery protocol that is used by the clients - MDS and MDS-MDS service.
  • Directories can be split across multiple MDS nodes. In that case a primary MDS directory inode contains an extended attribute that points at other MDS inodes which we call directory objects.

Configuration management and Startup

The configuration will name an MDS server, and optionally a failover node, which hold the root inode for a fileset. Clients will contact that MDS for the root inode during mount, as they do already.

They will also fetch from it a clustering descriptor. The clustering descriptor contains a header and an array lists what inode groups are served by what server.

Through normal mechanisms clients will wait and probe for available metadata servers, during startup and cluster transitions. When new servers are found or configurations have changed they can update their clustering descriptor as they update the LOV striping descriptor for OST's.

Data Structures

The fid will contain a new 32 bit integer to name the inode group.

Directory entries will contain a new 32 bit integer to name the inode group.

Directory inodes on the MDS, when large, contain a new EA which is a descriptor of how the directory is split over directory objects, residing on other MDS's. This EA is subject to ordinary concurrency control by the MDS holding the inode. The EA is virtually identical to the LOV EA.

The clustered metadata client (CMC)

Client systems will have the write back client (WBD) or client file system directly communicate with the CMC driver: it offers the metadata api to the file system and uses the metadata api offered by a collection of MDC drivers. Each MDC driver managed the metadata traffic to one.

The function of the CMC is very simple: it figures out from the command issued what MDC to use. This is based on:

  • the inode groups in the request
  • a hash value of names used in the request, combined with the EA of a primary inode involved in the request.
  • for readdir the directory offset combined with the EA of the primary inode
  • the clustering descriptor

In any case every command is dispatched to a single metadata server, the clients will not engage more than one metadata server for a single request.

The api changes here are minimal and the client part of the implementation is very trivial.

MDS implementation

For the most part, operations are extremely similar or identical to what they were before. In some cases multiple mds servers are involved in updates.

Getattr, open, readdir, setattr and lookup methods are unaffected.

Methods adding entries to directories are modified in some cases:

  • mkdir always create the new directory on another MDS
  • unlink, rmdir, rename: may involve more than one MDS
  • large directories all operations making updates to directories can cause a directory split. The directory split is discussed below.
  • other operations If no splits large directories are encountered all other operations proceed as they are executed on one MDS.

Directory Split

A directory that is growing larger will be split. There is a fairly heavy penalty associated with splitting the directory and also with renames in within split directories. Moreover, at the point of splitting, inodes become remote and will incur a penalty upon unlink.

Probably it is best to delay the split until the directory is fairly large, and then to split over several nodes, to avoid further splits being necessary soon afterwards.

Recovery

Transaction Replay

The MDS - MDS interaction is managed as follows. The node approached with a request change is made the coordinator of the transaction.

Mechanisms

The coordinator will first establish that the transaction can commit on all nodes, by acquiring locks on directories and checking for available space existing entries with the same name etc. It may also first perform a directory split if the size is becoming too large, and more MDS nodes are still available.

All nodes involved in the transaction need to have a transaction sequence number to place the transaction into their sequence and allow correctly replay.

At this point the coordinator will:

  • start a transaction locally.
  • It will then report the transaction sequence number to all other nodes involved in the transaction.
  • These nodes will commit (in memory as usual), write a journal record for replay and reply to the coordinator.
  • The coordinator will then commit its own transaction.
  • The replay log records are subject to normal log commit cancelation messages, but on the coordinator commit messages must be received from all other nodes before the record will be canceled.

In this way if the results of the transaction survive on any of the nodes, they can be replayed on all.

Cluster crashes and the transaction sequence

If the cluster crashes abruptly, there is the opportunity for transactions to be in progress affecting multiple nodes. Dependencies between the transactions must be managed to ensure serializibility of the protocol.


Example: In transaction one, a node X creates directories a. Then in transaction 2 a cross MDS node rename moves a file with a directory entry on node Y into this directory. It is now possible for this file to lose its directory entry on Y and for the transaction on X not to commit. More complex examples exist.


We see 4 solutions:

  • Disk commit delay locks: dependent transactions many not commit before the parent transaction commits.
  • Commit acks: transactions may not proceed until previous pre-requisite transactions have committed.
  • Synchronous NVRAM journal on all MDS nodes
  • Shared journal among all MDS nodes

The first and last method offer the most opportunity to proceed without synchronous disk writes. The last method involves contention on a shared resource.

Although exhaustive analysis remains, it is clear that rename and splitting the directories are the primary culprits. Hence, we wonder about the following policy related issues:

  • Only split really large directories, say after 1M entries.
  • Do not needlessly create subdirectories on other nodes. A much better policy is likely to keep directories with one owner, or possibly one client system generating them together.


Replay

To order transaction sequences Lustre uses reply ack's: the acks server only one purpose to release a lock that enforces ordering of the transaction sequence. In the case where MDS operations involve more than server, the reply "ack" from the primary to secondary servers should only be sent after the client has sent the ack to the first server. This MDS-MDS reply ack is now not really an ack anymore but a simple lock cancelation review.

Clients will replay lost transactions to the mds which they originally engaged for the request.


Orphaned children will be cleaned up only after replay completes to allow orphaned objects to be re-used during replay.


Failover

The configuration data can designate a standby MDS that will take over from a failed MDS. By organizing the servers in one or more rings, the nearest working left neighbor MDS can be the failover node. This leads to a simple scheme with multiple failover nodes, avoiding quorum and other complications beyond what is needed for two node clusters.


Locking

We believe locking can be done in fid order as it is currently done on the MDS.