FAQ - Recovery

(Updated: Dec 2009)

How do I configure failover services?

Typical failover configurations couple two Lustre MDS or OSS nodes in pairs directly to a multi-port disk array. Object servers are typically active/active, with each serving half of the array, while metadata servers must be active/passive. These array devices typically have redundancy internally, to eliminate them as single points of failure. This does not typically require a Fibre Channel switch.

How do I automate failover of my MDSs/OSSs?

The actual business of automating the decisions about whether a server has failed, and which server should take over the load, is managed by a separate package (our customers have used Red Hat's Cluster Manager and SuSE's Heartbeat).

Completely automated failover also requires some kind of programmatically controllable power switch, because the new "active" MDS must be able to completely power off the failed node. Otherwise, there is a chance that the "dead" node could wake up, start using the disk at the same time, and cause massive corruption.

 How necessary is failover, really?

The answer depends on how close to 100% uptime you need to achieve. Failover doesn't protect against the failure of individual disks -- that is handled by software or hardware RAID at the OST and MDT level. Lustre failover is to handle the failure of an MDS or OSS node as a whole which, in our experience, is not very common.

We would suggest that simple RAID-5 or RAID-6 storage is sufficient for most users, with manual restart of failed OSS and MDS nodes, but that the most important production systems should consider failover.

'''I don't need failover, and don't want shared storage. How will this work?'''

If Lustre is configured without shared storage for failover, and a server node fails, then a client that tries to use that node will pause until the failed server is returned to operation. After a short delay (a configurable timeout value), applications waiting for those nodes can be aborted with a signal (kill or Ctrl-C), similar to the NFS soft-mount mode.

When the node is returned to service, applications which have not been aborted will continue to run without errors or data loss.

If a node suffers a connection failure, will the node select an alternate route for recovery?

Yes. If a node has multiple network paths, and one fails, it can continue to use the others.

What are the supported hardware methods for HBA, switch, and controller failover?

These are supported to the extent supported by the HBA drivers. If arrays with multiple ports are shared by multiple I/O nodes, Lustre offers 100% transparent failover for I/O and metadata nodes. Applications will see a delay while failover and recovery is in progress, but system calls complete without errors.

Can you describe an example failure scenario, and its resolution?

Although failures are becoming more rare, it is more likely that a node will hang or timeout rather than crash. If a client node hangs or crashes, usually all other client and server nodes are not affected. Normally such a client is rebooted and rejoins the file system. When server nodes hang, they are commonly restarted, merely causing a short delay to applications which try to use that node. Other server nodes or clients are not usually affected.

 How are power failures, disk or RAID controller failures, etc. addressed?

If I/O to the storage is interrupted AND the storage device guarantees strict ordering of transactions, then the ext3 journal recovery will restore the file system in a few seconds.

If the file system is damaged through device failures, unordered transactions, or a power loss affecting a storage device's caches, Lustre requires a file system repair. Lustre's tools will reliably repair any damage it can. It will run in parallel on all nodes, but can still be very time consuming for large file systems.