Recovering from a Node or Network Failure

http://wiki.lustre.org/index.php/Recovery_Overview - sent questions to Andreas 9/25

See Lustre Recovery in OM.

Lustre's recovery support is responsible for dealing with node or network failure, and returning the cluster to a consistent, performant state. Because Lustre allows servers to perform asynchronous update operations to the on-disk file system (i.e. the server can reply without waiting for the update to synchronously commit to disk) the clients may have state in memory that is newer than what the server can recover from disk after a crash.

A handful of different types of failure can cause recovery to occur:


 * Client (compute node) failure
 * MDS failure (and failover)
 * OST failure (and failover)
 * Transient network partition

At present, all failure and recovery operations are based on the notion of connection failure; all imports or exports associated with a given connection are considered to have failed if any of them do.

Client Failure
Lustre's support for recovery from client failure is based on the revocation of locks and other resources, so that surviving clients can continue their work uninterrupted. If a client fails to respond in a timely manner to a blocking lock callback from the Distributed Lock Manager (DLM), or it has failed to communicate with the server in a long period of time (i.e. no pings), the client is forcibly removed from the cluster. This permits other clients to acquire locks blocked by the dead client's locks, and also frees resources (file handles, export data) associated with that client. Note that this scenario can be caused by a network partition as well as by an actual client node system failure. The section below on transient network partitions describes this case in more detail.

MDS Failure (Failover)
Highly-available Lustre operation requires that the metadata server have a peer configured for failover, including the use of a shared storage device for the MDT backing file system. It is also possible to have MDS recovery with a single MDS node, and recovery will take as long as is needed for the single MDS to be restarted.

When clients detect an MDS failure, they will connect to the new MDS and begin Metadata Replay. Metadata Replay is responsible for ensuring that the replacement MDS re-accumulates state resulting from transactions whose effects were made visible to clients, but which were not committed to the disk.

Transaction Numbers are used to ensure that operations are replayed in the order used to originally perform them. In addition, clients inform the new server of their existing lock state (including locks that have not yet been granted). All metadata and lock replay must complete before new, non-recovery operations are permitted. In addition, only clients that were connected at the time of MDS failure are permitted to reconnect during the recovery window, to avoid them introducing state changes that might conflict with what is being replayed by the client.

The reconnection to a new (or rebooted) MDS is managed by the file system configuration loaded by the client when the file system is first mounted. If a failover MDS has been configured, the client will try to reconnect to both the primary and backup MDS until one of them responds that the failed MDT is again available. At that point the client will begin recovery.

OST Failure (Failover)
When an OST fails, or has communication problems with the client, the default action is that the corresponding OSC enters recovery, and IO requests going to that OST are blocked waiting for OST recovery or failover. It possible to administratively mark the OSC as inactive at this point, in which case file operations that involve the failed OST will return an IO error (-EIO).

The MDS (via the LOV) will detect that an OST is unavailable and skip it when assigning objects to new files. When the OST is restarted, or re-establishes communication with the MDS, the MDS and OST will perform Orphan Recovery to destroy any objects that belong to files that were deleted while the OST was unavailable.

While the OSC to OST operation recovery protocol is the same as that between the MDC and MDT using the Metadata Replay protocol, typically the OST will commit IO operations to disk synchronously and each reply will indicate that the request is already committed and does not need to be saved for recovery. In some cases, the OST will reply to the client before the operation is committed to disk (e.g. truncate, destroy, setattr, and IO operations in newer versions of Lustre), and normal replay and resend handling is done.

Network Partition
Network failures may be transient, and to avoid invoking recovery the client will initially try to resend any timed out request to the server. If the resend also fails, the client will try to re-establish a connection to the server. Clients can detect harmless partition upon reconnect. If a reply was dropped, the client must Reply Reconstruction
 * servers will evict clients, if they notice (due to a lock cancellation not arriving in time)
 * client upcall may try other routers; arbitrary configuration change possible

Failed Recovery
In the case of failed recovery, a client is evicted by the server and must reconnect after having flushed its state related to that server. Failed recovery might happen for a number of reasons, including: * blocking lock callback * lock completion callback * lock glimpse callback * server shutdown notification (with Simplified Interoperability) * if any previously-connected clients fail to participate in recovery (with 1.6) * replay of operations dependent upon other missing clients (with 1.8 and VBR * failure to participate in recovery in a timely manner (until Delayed Recovery is implemented * manual abort of recovery
 * Failure to respond to a server request in a timely manner
 * Failure of recovery
 * manual eviction by the administrator

If a client is evicted from the server for any reason, it must invalidate all locks, which will in turn result in all cached inodes becoming invalidated and all cached data to be flushed.

Recovery Test Infrastructure

 * lctl> notranso
 * lctl> readonly XXX broken right now
 * lctl> failconn