WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Clustered Metadata: Difference between revisions

From Obsolete Lustre Wiki
Jump to navigationJump to search
Line 1: Line 1:
= Clustered Metadata Design (in progress) =


This document describes the design of the clustered metadata handling
This document describes the design of the clustered metadata handling
for Lustre.  This material depends on other Lustre design, such as:  
for Lustre.  This material depends on other Lustre designs, such as:  
:
 
* General recovery
* General recovery
* Orphan Recovery
* Orphan Recovery
* Metadata Write Back caching
* Metadata Write Back caching


The draft Clustered Metadata Design is available as a PDF:
The draft Clustered Metadata Design is available as a PDF:[[Media:HPCS_CMD_06_15_09.pdf|''Clustered Metadata Design'']]
 
* [[Media:HPCS_CMD_06_15_09.pdf|Clustered Metadata Design]]


== Introduction ==
== Introduction ==


Overall the clustered metadata handling is structured as follows.  
Overall the clustered metadata handling is structured as follows:  


* A cluster of metadata servers manage a collection of inode groups.   Each inode group is a Lustre device exporting the usual metadata   api, augmented with a few operations specifically crafted for   metadata clustering.  We call these collections of inodes inode   groups.
* A cluster of metadata servers manage a collection of inode groups. Each inode group is a Lustre device exporting the usual metadata API augmented with a few operations specifically crafted for metadata clustering.  We call these collections of inodes inode groups.
* Directory formats for file systems used on the MDS devices are   changed to introduce a allow directory entries to contain an inode   group and identifier of the inode.
* Directory formats for file systems used on the MDS devices are changed to allow directory entries to contain an inode group and identifier of the inode.
* A logical clustered metadata driver is introduced below the client   Lustre file system write back cache driver that maintains   connections with the MDS servers.
* A logical clustered metadata driver is introduced below the client Lustre file system write back cache driver that maintains connections with the MDS servers.
* There is a single metadata protocol that is used by the client file   system to make updates on the MDS's and by the MDS's to make   updates involving other MDS's.
* There is a single metadata protocol that is used by the client file system to make updates on the MDSs and by the MDSs to make updates involving other MDSs.
* There is a single recovery protocol that is used by the clients -   MDS and MDS-MDS service.
* There is a single recovery protocol that is used by the clients - MDS and MDS-MDS service.
* Directories can be split across multiple MDS nodes.  In that case a   primary MDS directory inode contains an extended attribute that   points at other MDS inodes which we call directory objects.
* Directories can be split across multiple MDS nodes.  In that case, a primary MDS directory inode contains an extended attribute that points at other MDS inodes which we call directory objects.


== Configuration management and Startup ==
== Configuration management and startup ==


The configuration will name an MDS server, and optionally a failover
The configuration will name an MDS server, and optionally a failover
Line 30: Line 27:


They will also fetch from it a clustering descriptor.  The clustering
They will also fetch from it a clustering descriptor.  The clustering
descriptor contains a header and an array lists what inode groups are
descriptor contains a header, and an array lists which inode groups are
served by what server.
served by which server.


Through normal mechanisms clients will wait and probe for available
Through normal mechanisms, clients will wait and probe for available
metadata servers, during startup and cluster transitions.  When new
metadata servers, during startup and cluster transitions.  When new
servers are found or configurations have changed they can update their
servers are found or configurations have changed, they can update their
clustering descriptor as they update the LOV striping descriptor for
clustering descriptor as they update the LOV striping descriptor for
OST's.
OSTs.


== Data Structures ==
== Data Structures ==


The fid will contain a new 32 bit integer to name the inode group.   
The ''fid'' contains a new 32 bit integer to name the inode group.   


Directory entries will contain a new 32 bit integer to name the inode
Directory entries contain a new 32 bit integer to name the inode
group.  
group.  


Directory inodes on the MDS, when large, contain a new EA which is a
Directory inodes on the MDS, when large, contain a new EA which is a
descriptor of how the directory is split over directory objects,
descriptor of how the directory is split over directory objects,
residing on other MDS's.  This EA is subject to ordinary concurrency
residing on other MDSs.  This EA is subject to ordinary concurrency
control by the MDS holding the inode.  The EA is virtually identical
control by the MDS holding the inode.  The EA is virtually identical
to the LOV EA.  
to the LOV EA.  
Line 56: Line 53:
Client systems will have the write back client (WBD) or client file
Client systems will have the write back client (WBD) or client file
system directly communicate with the CMC driver: it offers the
system directly communicate with the CMC driver: it offers the
metadata api to the file system and uses the metadata api offered by a
metadata API to the file system and uses the metadata API offered by a
collection of MDC drivers.  Each MDC driver managed the metadata
collection of MDC drivers.  [[Each MDC driver managed the metadata
traffic to one.  
traffic to one.]] '''[[Is this OK?]]'''


The function of the CMC is very simple: it figures out from the
The function of the CMC is to figure out from the command issued which MDC to use.  This is based on:  
command issued what MDC to use.  This is based on:  
* The inode groups in the request
* the inode groups in the request
* A hash value of names used in the request, combined with the EA of a primary inode involved in the request
* a hash value of names used in the request, combined with the EA of   a primary inode involved in the request.
* For ''readdir'', the directory offset combined with the EA of the primary inode
* for readdir the directory offset combined with the EA of the   primary inode
* The clustering descriptor
* the clustering descriptor


In any case every command is dispatched to a single metadata server,
In any case, every command is dispatched to a single metadata server and
the clients will not engage more than one metadata server for a single
the clients will not engage more than one metadata server for a single
request.   
request.   


The api changes here are minimal and the client part of the
The API changes here are minimal and the client part of the implementation is trivial.
implementation is very trivial.


== MDS implementation ==
== MDS implementation ==


For the most part, operations are extremely similar or identical to
For the most part, operations are extremely similar or identical to
what they were before.   In some cases multiple mds servers are
what they were before. In some cases multiple MDS servers are
involved in updates.
involved in updates.


Getattr, open, readdir, setattr and lookup methods are unaffected.  
''getattr'', ''open'', ''readdir'', ''setattr'' and ''lookup'' methods are unaffected.  


Methods adding entries to directories are modified in some cases:  
Methods adding entries to directories are modified in some cases:  


* '''mkdir''' always create the new directory on another MDS
* ''mkdir'' always creates the new directory on another MDS.
* '''unlink, rmdir, rename''': may involve more than one MDS
* ''unlink'', ''rmdir'', and ''rename'' may involve more than one MDS.
* '''large directories''' all operations making updates to   directories can cause a directory split. The directory split is   discussed below.
* For ''large directories'', all operations making updates to directories can cause a directory split. The directory split is discussed below.
* '''other operations''' If no splits large directories are   encountered all other operations proceed as they are executed on   one MDS.
* For ''other operations'', [[if no splits large directories]] '''[[Is a word missing here?]]''' are encountered, all other operations proceed as they are executed on one MDS.


=== Directory Split ===
=== Directory Split ===


A directory that is growing larger will be split.  There is a fairly heavy penalty associated with splitting the directory and also with renames in within split directories.  Moreover, at the point of splitting, inodes become remote and will incur a penalty upon unlink.
A directory that is growing larger will be split.  There is a fairly heavy penalty associated with splitting the directory and also with renames within split directories.  Moreover, at the point of splitting, ''inodes'' become remote and will incur a penalty upon unlink.


Probably it is best to delay the split until the directory is fairly large, and then to split over several nodes, to avoid further splits being necessary soon afterwards.
Probably it is best to delay the split until the directory is fairly large, and then to split over several nodes, to avoid further splits being necessary soon afterwards.
Line 104: Line 99:
==== Mechanisms ====
==== Mechanisms ====


The coordinator will first establish that the transaction can commit on all nodes, by acquiring locks on directories and checking for available space existing entries with the same name etc.  It may also first perform a directory split if the size is becoming too large, and more MDS nodes are still available.   
The coordinator will first establish that the transaction can commit on all nodes by acquiring locks on directories and checking for ''available space existing entries with the same name etc''. '''[[Is a word missing here?]]''' It may also first perform a directory split if the size is becoming too large and more MDS nodes are still available.   


All nodes involved in the transaction need to have a transaction sequence number to place the transaction into their sequence and allow correctly replay.  
All nodes involved in the transaction need to have a transaction sequence number to place the transaction into their sequence and allow correctly replay.  


At this point the coordinator will:
At this point the coordinator will:
* start a transaction locally.  
* Start a transaction locally.  
* It will then report the transaction sequence number to all other nodes involved in the transaction.   
* It will then report the transaction sequence number to all other nodes involved in the transaction.   
* These nodes will commit (in memory as usual), write a journal record for replay and reply to the coordinator.   
* These nodes will commit (in memory as usual), write a journal record for replay and reply to the coordinator.   
* The coordinator will then commit its own transaction.
* The coordinator will then commit its own transaction.
* The replay log records are subject to normal log commit cancelation messages, but on the coordinator commit messages must be received from all other nodes before the record will be canceled.
* The replay log records are subject to normal log commit cancellation messages, but, on the coordinator commit, messages must be received from all other nodes before the record will be canceled.


In this way if the results of the transaction survive on any of the nodes, they can be replayed on all.
In this way if the results of the transaction survive on any of the nodes, they can be replayed on all.
Line 119: Line 114:
==== Cluster crashes and the transaction sequence ====
==== Cluster crashes and the transaction sequence ====


If the cluster crashes abruptly, there is the opportunity for transactions to be in progress affecting multiple nodes. Dependencies between the transactions must be managed to ensure serializibility of the protocol.  
If the cluster crashes abruptly, there is the opportunity for transactions to be in progress affecting multiple nodes. Dependencies between the transactions must be managed to ensure serialize-ability of the protocol.  


''Example:'' In transaction one, a node X creates directories ''a''.  Then in transaction 2 a cross MDS node rename moves a file with a directory entry on node Y into this directory.  It is now possible for this file to lose its directory entry on Y and for the transaction on X not to commit. More complex examples exist.


Example: In transaction one, a node X creates directories a.  Then in transaction 2 a cross MDS node rename moves a file with a directory entry on node Y into this directory.  It is now possible for this file to lose its directory entry on Y and for the transaction on X not to commit. More complex examples exist.
Four possible solutions are:
 
 
We see 4 solutions:


* Disk commit delay locks: dependent transactions many not commit before the parent transaction commits.
* Disk commit delay locks: dependent transactions many not commit before the parent transaction commits.
* Commit acks: transactions may not proceed until previous pre-requisite transactions have committed.
* Commit ''ack''s: transactions may not proceed until previous pre-requisite transactions have committed.
* Synchronous NVRAM journal on all MDS nodes
* Synchronous NVRAM journal on all MDS nodes.
* Shared journal among all MDS nodes  
* Shared journal among all MDS nodes.


The first and last method offer the most opportunity to proceed without synchronous disk writes. The last method involves contention on a shared resource.
The first and last method offer the most opportunity to proceed without synchronous disk writes. The last method involves contention on a shared resource.


Although exhaustive analysis remains, it is clear that rename and splitting the directories are the primary culprits. Hence, we wonder about the following policy related issues:
Although exhaustive analysis remains, it is clear that rename and splitting the directories are the primary culprits. Hence, we wonder about the following policy related issues:
* Only split really large directories, say after 1M entries.
* Only split really large directories, say after 1M entries.
* Do not needlessly create subdirectories on other nodes.  A much better policy is likely to keep directories with one owner, or possibly one client system generating them together.
* Do not needlessly create subdirectories on other nodes.  A much better policy is likely to keep directories with one owner, or possibly one client system generating them together.


==== Replay ====
==== Replay ====


 
To order transaction sequences, Lustre uses reply ''ack''s: the ''acks''
To order transaction sequences Lustre uses reply ack's: the acks
serve only one purpose, to release a lock that enforces ordering of
server only one purpose to release a lock that enforces ordering of
the transaction sequence.  In the case where MDS operations involve
the transaction sequence.  In the case where MDS operations involve
more than server, the reply "ack" from the primary to secondary
more than server, the reply ''ack'' from the primary to secondary
servers should only be sent after the client has sent the ack to the
servers should only be sent after the client has sent the ''ack'' to the
first server.  This MDS-MDS reply ack is now not really an ack anymore
first server.  This MDS-MDS reply ''ack'' is now not really an ''ack'' anymore
but a simple lock cancelation review.
but a simple lock cancellation review.


Clients will replay lost transactions to the mds which they originally
Clients will replay lost transactions to the MDS that they originally
engaged for the request.   
engaged for the request.   


 
Orphaned children will be cleaned up only after replay completes to allow orphaned objects to be re-used during replay.
Orphaned children will be cleaned up only
after replay completes to allow orphaned objects to be re-used during
replay.
 


=== Failover ===
=== Failover ===
Line 166: Line 153:
leads to a simple scheme with multiple failover nodes, avoiding quorum
leads to a simple scheme with multiple failover nodes, avoiding quorum
and other complications beyond what is needed for two node clusters.
and other complications beyond what is needed for two node clusters.


== Locking ==
== Locking ==


We believe locking can be done in fid order as it is currently done on the MDS.
We believe locking can be done in ''fid'' order as it is currently done on the MDS.

Revision as of 10:14, 9 December 2009

This document describes the design of the clustered metadata handling for Lustre. This material depends on other Lustre designs, such as:

  • General recovery
  • Orphan Recovery
  • Metadata Write Back caching

The draft Clustered Metadata Design is available as a PDF:Clustered Metadata Design

Introduction

Overall the clustered metadata handling is structured as follows:

  • A cluster of metadata servers manage a collection of inode groups. Each inode group is a Lustre device exporting the usual metadata API augmented with a few operations specifically crafted for metadata clustering. We call these collections of inodes inode groups.
  • Directory formats for file systems used on the MDS devices are changed to allow directory entries to contain an inode group and identifier of the inode.
  • A logical clustered metadata driver is introduced below the client Lustre file system write back cache driver that maintains connections with the MDS servers.
  • There is a single metadata protocol that is used by the client file system to make updates on the MDSs and by the MDSs to make updates involving other MDSs.
  • There is a single recovery protocol that is used by the clients - MDS and MDS-MDS service.
  • Directories can be split across multiple MDS nodes. In that case, a primary MDS directory inode contains an extended attribute that points at other MDS inodes which we call directory objects.

Configuration management and startup

The configuration will name an MDS server, and optionally a failover node, which hold the root inode for a fileset. Clients will contact that MDS for the root inode during mount, as they do already.

They will also fetch from it a clustering descriptor. The clustering descriptor contains a header, and an array lists which inode groups are served by which server.

Through normal mechanisms, clients will wait and probe for available metadata servers, during startup and cluster transitions. When new servers are found or configurations have changed, they can update their clustering descriptor as they update the LOV striping descriptor for OSTs.

Data Structures

The fid contains a new 32 bit integer to name the inode group.

Directory entries contain a new 32 bit integer to name the inode group.

Directory inodes on the MDS, when large, contain a new EA which is a descriptor of how the directory is split over directory objects, residing on other MDSs. This EA is subject to ordinary concurrency control by the MDS holding the inode. The EA is virtually identical to the LOV EA.

The clustered metadata client (CMC)

Client systems will have the write back client (WBD) or client file system directly communicate with the CMC driver: it offers the metadata API to the file system and uses the metadata API offered by a collection of MDC drivers. [[Each MDC driver managed the metadata traffic to one.]] Is this OK?

The function of the CMC is to figure out from the command issued which MDC to use. This is based on:

  • The inode groups in the request
  • A hash value of names used in the request, combined with the EA of a primary inode involved in the request
  • For readdir, the directory offset combined with the EA of the primary inode
  • The clustering descriptor

In any case, every command is dispatched to a single metadata server and the clients will not engage more than one metadata server for a single request.

The API changes here are minimal and the client part of the implementation is trivial.

MDS implementation

For the most part, operations are extremely similar or identical to what they were before. In some cases multiple MDS servers are involved in updates.

getattr, open, readdir, setattr and lookup methods are unaffected.

Methods adding entries to directories are modified in some cases:

  • mkdir always creates the new directory on another MDS.
  • unlink, rmdir, and rename may involve more than one MDS.
  • For large directories, all operations making updates to directories can cause a directory split. The directory split is discussed below.
  • For other operations, if no splits large directories Is a word missing here? are encountered, all other operations proceed as they are executed on one MDS.

Directory Split

A directory that is growing larger will be split. There is a fairly heavy penalty associated with splitting the directory and also with renames within split directories. Moreover, at the point of splitting, inodes become remote and will incur a penalty upon unlink.

Probably it is best to delay the split until the directory is fairly large, and then to split over several nodes, to avoid further splits being necessary soon afterwards.

Recovery

Transaction Replay

The MDS - MDS interaction is managed as follows. The node approached with a request change is made the coordinator of the transaction.

Mechanisms

The coordinator will first establish that the transaction can commit on all nodes by acquiring locks on directories and checking for available space existing entries with the same name etc. Is a word missing here? It may also first perform a directory split if the size is becoming too large and more MDS nodes are still available.

All nodes involved in the transaction need to have a transaction sequence number to place the transaction into their sequence and allow correctly replay.

At this point the coordinator will:

  • Start a transaction locally.
  • It will then report the transaction sequence number to all other nodes involved in the transaction.
  • These nodes will commit (in memory as usual), write a journal record for replay and reply to the coordinator.
  • The coordinator will then commit its own transaction.
  • The replay log records are subject to normal log commit cancellation messages, but, on the coordinator commit, messages must be received from all other nodes before the record will be canceled.

In this way if the results of the transaction survive on any of the nodes, they can be replayed on all.

Cluster crashes and the transaction sequence

If the cluster crashes abruptly, there is the opportunity for transactions to be in progress affecting multiple nodes. Dependencies between the transactions must be managed to ensure serialize-ability of the protocol.

Example: In transaction one, a node X creates directories a. Then in transaction 2 a cross MDS node rename moves a file with a directory entry on node Y into this directory. It is now possible for this file to lose its directory entry on Y and for the transaction on X not to commit. More complex examples exist.

Four possible solutions are:

  • Disk commit delay locks: dependent transactions many not commit before the parent transaction commits.
  • Commit acks: transactions may not proceed until previous pre-requisite transactions have committed.
  • Synchronous NVRAM journal on all MDS nodes.
  • Shared journal among all MDS nodes.

The first and last method offer the most opportunity to proceed without synchronous disk writes. The last method involves contention on a shared resource.

Although exhaustive analysis remains, it is clear that rename and splitting the directories are the primary culprits. Hence, we wonder about the following policy related issues:

  • Only split really large directories, say after 1M entries.
  • Do not needlessly create subdirectories on other nodes. A much better policy is likely to keep directories with one owner, or possibly one client system generating them together.

Replay

To order transaction sequences, Lustre uses reply acks: the acks serve only one purpose, to release a lock that enforces ordering of the transaction sequence. In the case where MDS operations involve more than server, the reply ack from the primary to secondary servers should only be sent after the client has sent the ack to the first server. This MDS-MDS reply ack is now not really an ack anymore but a simple lock cancellation review.

Clients will replay lost transactions to the MDS that they originally engaged for the request.

Orphaned children will be cleaned up only after replay completes to allow orphaned objects to be re-used during replay.

Failover

The configuration data can designate a standby MDS that will take over from a failed MDS. By organizing the servers in one or more rings, the nearest working left neighbor MDS can be the failover node. This leads to a simple scheme with multiple failover nodes, avoiding quorum and other complications beyond what is needed for two node clusters.

Locking

We believe locking can be done in fid order as it is currently done on the MDS.