Fsck Support

Obtaining e2fsck Support for Lustre
To support more advanced features of Lustre, such as FIDs or multi-mount protection (MMP) support, you will need a Lustre-specific version of e2fsprogs, which can be found at http://downloads.lustre.org/public/tools/e2fsprogs/.

A quilt patchset of all changes to the vanilla e2fsprogs is available in e2fsprogs-{version}-patches.tgz.

Using e2fsck on a Backing File System
When an OSS, MDS, or MGS server crash occurs, it is not necessary to run e2fsck on the file system. The ext3 journaling will ensure that the file system remains coherent. The backing file systems are never accessed directly from the client, so client crashes are not relevant.

The only time it is REQUIRED that e2fsck be run on a device is when an event causes problems that ext3 journaling is unable to handle, such as a hardware device failure or IO error. If the ext3 kernel code detects corruption on the disk, it will mount the file system as read-only to prevent further corruption but still allow read access to the device. This will appear as error "-30" (EROFS) in the syslogs on the server. In such a situation, it is normally required only that e2fsck be run on the bad device before placing the device back into service.

In the vast majority of cases, Lustre will be able to cope with any inconsistencies it finds on the disk and between other devices in the file system.

Note: lfsck is rarely required for Lustre operation.

For problem analysis, it is strongly recommended that e2fsck be run under a logger, like script, to record all of the output and changes that are made to the file system in case this information is needed later.

If time permits, it is also a good idea to first run e2fsck in non-fixing mode (-n option) to assess the type and extent of damage to the file system. The drawback is that in this mode, e2fsck doesn't recover the file system journal, so there may appear to be file system corruption when none really exists.

To address concern about whether corruption is real or only due to the journal not being replayed, you can briefly mount and unmount the ext3 filesystem directly on the node with Lustre stopped (NOT via Lustre), using a command similar to:

mount -t ldiskfs /dev/{ostdev} /mnt/ost; umount /mnt/ost

This will cause the journal to be recovered.

e2fsck works well when fixing file system corruption (better than similar file system recovery tools and a primary reason why ext3 was chosen over other file systems for Lustre). However, it is often useful to identify the type of damage that has occurred so an ext3 expert can make intelligent decisions about what needs fixing, in place of e2fsck. Sun support is available for such situations.

root# {stop lustre services for this device, if running} root# script /tmp/e2fsck.sda Script started, file is /tmp/e2fsck.sda root# mount -t ldiskfs /dev/sda /mnt/ost root# umount /mnt/ost root# e2fsck -fn /dev/sda  # don't fix file system, just check for corruption [e2fsck output] root# e2fsck -fp /dev/sda  # fix filesystem using "prudent" answers (usually 'y')

In addition, the e2fsprogs package contains the lfsck tool, which does distributed coherency checking for the Lustre file system after e2fsck has been run. Running lfsck is NOT required in a large majority of cases, at a small risk of having some leaked space in the file system. To avoid a lengthy downtime, it can be run (with care) after Lustre is started.

Running e2fsck+lfsck on a Corrupted Lustre File System
In cases where the MDS or an OST become corrupted for some reason, it is possible to run a distributed check on the file system to determine what sort of problems exist.

The first step is to stop Lustre and run e2fsck -f on the individual MDS/OST that had problems to fix any local file system damage. It is a good idea to run this e2fsck under script to create a log of the changes made to the file system in case this is needed later. After this is complete, it is then possible to bring the file system up if necessary to reduce the outage window.

Next, run a full e2fsck of the MDS to create a database for lfsck. The use of the -n option is CRITICAL for a mounted file system (i.e. if Lustre is running), otherwise you will corrupt your file system. The mdsdb file can grow fairly large, depending on the number of files in the file system (10GB or more for millions of files, though the actual file size is larger because the file is sparse). It will be quicker to write the file to a local file system due to seeking and small writes. Depending on the number of files, this step can take several hours.

e2fsck -n -v --mdsdb /tmp/mdsdb /dev/{mdsdev}

Next, this file must be made accessible on all of the OSTs, either using a shared file system or by copying the file to the OSTs (pdcp is useful here).

A similar e2fsck step needs to be completed on the OSTs. If the OSTs do not have shared file system access to the MDS file system, a stub mdsdb file called {mdsdb}.mdshdr is generated that can be used instead of the full mdsdb file. The mdsdb is only used for reading, so it does not need to be in shared storage for all of the OSTs. The OST e2fsck --ostdb step can be run in parallel on all of the OSTs.

e2fsck -n -v --mdsdb /tmp/mdsdb --ostdb /tmp/{ostNdb} /dev/{ostNdev}

Finally, the mdsdb and all of the ostdb files need to be made available on a mounted client so that lfsck can be run to examine the file system and optionally correct defects it finds.

script /root/lfsck.lustre.log lfsck -n -v --mdsdb /tmp/mdsdb --ostdb /tmp/{ost1db} /tmp/{ost2db} ... /lustre/mount/point

By default, lfsck does not repair any inconsistencies found, but only reports the errors. It checks for three kinds of inconsistencies:


 * 1)  Inode exists but has missing objects (dangling inode).  This normally happens if there was a problem with an OST.
 * 2)  Inode is missing but OST has unreferenced objects (orphan object).  This normally happens if there was a problem with the MDS.
 * 3)  Multiple inodes reference the same objects. This can happen if there was corruption on the MDS, or if the MDS storage is cached and loses some but not all writes.

If the file system is in use and being modified while the --mdsdb and --ostdb steps are running, lfsck may report inconsistencies where none exist due to files and objects being created/removed after the database files were collected. Therefore, the results should be examined closely. You may want to re-run the test and/or contact Sun support for guidance.

The easiest problem to resolve is that of orphaned objects. When the -l option for lfsck is used, these objects are linked to new files and put into lost+found in the Lustre file system, where they can be examined and saved or deleted as necessary. If you are certain the objects are not useful, lfsck can be run with the -d option to delete orphaned objects and free up any space they are using.

To fix dangling inodes, using lfsck with the -c option will create new zero-length objects on the OSTs. These files will read back with binary zeros for the stripes that had objects recreated. Such files can also be read even without lfsck repair by entering:

dd if=/lustre/bad/file of=/new/file bs=4k conv=sync,noerror

Because it is rarely useful to have files with large holes in them, most users delete these files after reading them (if useful) and/or restoring them from backup. Note that it is not possible to write to the holes of such a file without having lfsck recreate the objects, so it is generally easier to delete these files and restore them from backup.

To fix inodes with duplicate objects, using lfsck with the -c option copies the duplicate object to a new object and assign it to one of the files. One of the files will be OK, and one will likely contain garbage, but lfsck cannot tell by itself which one is correct.