Subsystem Map

libcfs lustre/lnet/libcfs/**/*.[ch]
 * colspan="2" valign="top |
 * Summary
 * Libcfs provides an API comprising fundamental primitives and subsystems - e.g. process management and debugging support which is used throughout LNET, Lustre, and associated utilities. This API defines a portable runtime environment that is implemented consistently on all supported build targets.
 * Code
 * Code
 * Code
 * }

{| border="1" cellspacing=0 cellpadding="5" CMM
 * colspan="2" valign="top |
 * Summary
 * ===Overview===
 * ===Overview===

The CMM is a new layer in the MDS which cares about all clustered metadata issues and relationships. The CMM does the following:


 * Acts as layer between the MDT and MDD.
 * Provides MDS-MDS interaction.
 * Queries and updates FLD.
 * Does the local or remote operation if needed.
 * Will do rollback - epoch control, undo logging.

CMM functionality
CMM chooses all servers involved in operation and sends depended request if needed. The calling of remote MDS is a new feature related to the CMD. CMM mantain the list of MDC to connect with all other MDS.

Objects
The CMM can allocate two types of object - local and remote. Remote object can occur during metadata operations with more than one object involved. Such operation is called as cross-ref operation. lustre/cmm
 * Code
 * Code
 * }

{| border="1" cellspacing=0 cellpadding="5" recovery
 * colspan="2" valign="top |
 * Summary
 * Summary

Overview
Client recovery starts in case when no server reply is received within given timeout or when server tells to client that it is not connected (client was evicted on server earlier for whatever reason).

The recovery consists of trying to connect to server and then step through several recovery states during which various client-server data is synchronized, namely all requests that were already sent to server but not yet confirmed as received and DLM locks. Should any problems arise during recovery process (be it a timeout or server’s refuse to recognise client again), the recovery is restarted from the very beginning.

During recovery all new requests to the server are not sent to the server, but added to special delayed requests queue that is then sent once if recovery completes succesfully.

Replay and Resend
Recovery code is scattered through all code almost. Though important code: ldlm/ldlm_lib.c - generic server recovery code ptlrpc/ - client recovery code
 * Clients will go through all the requests in the sending and replay lists and determine the recovery action needed - replay request, resend request, cleanup up associated state for committed requests.
 * The client replays requests which were not committed on the server, but for which the client saw reply from server before it failed. This allows the server to replay the changes to the persistent store.
 * The client resends requests that were committed on the server, but the client did not see a reply for them, maybe due to server failure or network failure that caused the reply to be lost. This allows the server to reconstruct the reply and send it to the client.
 * The client resends requests that the server has not seen at all, these would be all requests with transid higher than the last_rcvd value from the server and the last_committed transno, and the reply seen flag is not set.
 * The client gets the last_committed transno information from the server and cleans up the state associated with requests that were committed on the server.
 * Code
 * Code
 * }

{| border="1" cellspacing=0 cellpadding="5" version recovery
 * colspan="2" valign="top |
 * Summary
 * Summary

Version Based Recovery
This recovery technique is based on using versions of objects (inodes) to allow clients to recover later than ordinary server recovery timeframe.

Recovery code is scattered through all code almost. Though important code: ldlm/ldlm_lib.c - generic server recovery code ptlrpc/ - client recovery code
 * 1) The server changes the version of object during any change and return that data to client. The version may be checked during replay to be sure that object is the same state during replay as it was originally.
 * 2) After failure the server starts recovery as usual but if some client miss the version check will be used for replays.
 * 3) Missed client can connect later and try to recover. This is 'delayed recovery' and version check is used during it always.
 * 4) The client which missed main recovery window will not be evicted and can connect later to initiate recovery. In that case the versions will checked to determine was that object changed by someone else or not.
 * 5) When finished with replay, client and server check if any replay failed on any request because of version mismatch. If not, the client will get a successful reintegration message. If a version mismatch was encountered, the client must be evicted.
 * Code
 * Code
 * }

{| border="1" cellspacing=0 cellpadding="5" pCIFS, CTDB
 * colspan="2" valign="top |
 * Summary
 * ===pCIFS Overview===
 * Lustre pCIFS client provides parallel I/O support to Lustre servers shared by Samba via CIFS protocol. All data I/O is to be dispatched smartly to Lustre OST nodes while the metadata operations will be kept untouched and go directly to Lustre MDS server.
 * Lustre pCIFS client provides parallel I/O support to Lustre servers shared by Samba via CIFS protocol. All data I/O is to be dispatched smartly to Lustre OST nodes while the metadata operations will be kept untouched and go directly to Lustre MDS server.

The pCIFS client is actually a Samba client via CIFS protocol, not a native Lustre client via Lustre LNET protocol. It is implemented as a Windows file filter driver upon Windows network file system (LanmanRedirector) to detect user i/o requests and redirect the requests to the corresponding OST nodes rather than the MDS node to eliminate the bottleneck in MDS node. Here's the architecture picture:

Currently a beta version is out for public test. The coming version will have failover supported.

pCIFS Architecture:


 * CTDB/Samba

CTDB is a database implementation, providing TDB-like APIs to Samba or other applications for temporary context data management. It relies the underlying clustered filesystem to manage TDB database files, since TDB uses FLOCK to protect database access.

As TDB database is shared to all nodes in the cluster, CTDB provides failover to all CTDB clients (like Samba). When a Samba/CTDB node hangs, another node will takeover the dead node's IP address and restore all TCP connections. The failover process is completely transparent, so the Samba client won't notice the failover process and will stay alive with the new cluster configuration.

More information is available at.

CTDB Architecture:

Correction: "CTDB Private Network" and "CTDB Public Network" should be exchanged since they are marked at wrong places in the following 3 pictures.

CTDB Failover:

pCIFS Failover
lustre/lnet/libcfs/**/*.[ch]
 * Code
 * Code
 * }