Clustered Metadata

This document describes the design of the clustered metadata handling for Lustre™. This material depends on other Lustre designs, such as:


 * General recovery
 * Orphan Recovery
 * Metadata Write Back caching

For a draft of the design document, see [[Media:HPCS_CMD_06_15_09.pdf|Clustered Metadata Design]].

Introduction
Overall, the clustered metadata handling is structured as follows:


 * A cluster of metadata servers manage a collection of inode groups. Each inode group is a Lustre device exporting the usual metadata API augmented with a few operations specifically crafted for metadata clustering.  We call these collections of inodes inode groups.
 * Directory formats for file systems used on the MDS devices are changed to allow directory entries to contain an inode group and identifier of the inode.
 * A logical clustered metadata driver is introduced below the client Lustre file system write back cache driver that maintains connections with the MDS servers.
 * A single metadata protocol is used by the client file system to make updates on the MDSs and by the MDSs to make updates involving other MDSs.
 * A single recovery protocol is used by the clients - MDS and MDS-MDS service.
 * Directories can be split across multiple MDS nodes. In this case, a primary MDS directory inode contains an extended attribute that points at other MDS inodes, which we call directory objects.

Configuration management and startup
The configuration will name an MDS server, and optionally a failover node, which hold the root inode for a fileset. Clients will contact that MDS for the root inode during mount, as they do already.

They will also fetch from it a clustering descriptor. The clustering descriptor contains a header, and an array lists which inode groups are served by which server.

Through normal mechanisms, clients will wait and probe for available metadata servers, during startup and cluster transitions. When new servers are found or configurations have changed, they can update their clustering descriptor as they update the LOV striping descriptor for OSTs.

Data Structures
The fid contains a new 32 bit integer to name the inode group.

Directory inodes on the MDS, when large, contain a new EA which is a descriptor of how the directory is split over directory objects, residing on other MDSs. This EA is subject to ordinary concurrency control by the MDS holding the inode. The EA is virtually identical to the LOV EA.

The clustered metadata client (CMC)
The function of the CMC is to figure out from the command issued which MDC to use. This is based on:
 * The inode groups in the request
 * A hash value of names used in the request, combined with the EA of a primary inode involved in the request
 * For readdir, the directory offset combined with the EA of the primary inode
 * The clustering descriptor

In any case, every command is dispatched to a single metadata server and the clients will not engage more than one metadata server for a single request.

The API changes here are minimal and the client part of the implementation is trivial.

MDS implementation
For the most part, operations are similar or identical to what they were before. In some cases, multiple MDS servers are involved in updates.

getattr, open, readdir, setattr and lookup methods are unaffected.

Methods adding entries to directories are modified in some cases:


 * mkdir always creates the new directory on another MDS.
 * unlink, rmdir, and rename may involve more than one MDS.
 * For large directories, all operations making updates to directories can cause a directory split.
 * For other operations, if no splits in large directories are encountered, all other operations proceed as they are executed on one MDS.

Directory Split
A directory can be striped over several MDTs as files over several OSTs. Then the directory will be split into several objects and each one will be located in different MDTs. The layout information(stripe EA) will be stored in the extend attributes of all split objects.

Recovery
In the long term, CMD recovery will rely on global epochs, which will not be implemented in the initial version. Instead, those metadata operations that span multiple MDSs (MDTs) will be synchronous to simplify recovery from a system crash. This may impact the performance of operations involving several MDTs. Also, a small amount of memory leak may occur after MDS recovery.

Locking
We believe locking can be done in fid order as it is currently done on the MDS.

(Updated 1/10)