Fsck Support

Where to get e2fsck support for Lustre
To support more advanced features of Lustre, such as extents or large inode/EA support, you will need a Lustre-specific version of e2fsprogs, which can be found at http://downloads.lustre.org/public/tools/e2fsprogs/.

Code for e2fsprogs-lustre
A quilt patchset of all upstream changes to e2fsprogs is available in e2fsprogs/patches/.src.rpm.

Using e2fsck on a backing filesystem
In a node crash case it is not necessary to run e2fsck on the file system. The ext3 journaling will ensure that the file system remains coherent. The only time it is REQUIRED that e2fsck be run on a device is when there is an event that causes problems that ext3 journaling is unable to handle, such as a hardware device failure or IO error. If the ext3 kernel code detects corruption on the disk, it will mount the file system as read-only to prevent further corruption but still allow read access to the device. This will appear as error "-30" (EROFS) in the syslogs on the server.

In such a situation, it is normally required only that e2fsck be run on the bad device before placing the device back into service. In the vast majority of cases, Lustre will be able to cope with inconsistencies it finds on the disk and between other devices in the file system.

Note: lfsck is rarely required for Lustre operation.

For good problem analysis, it is always strongly recommended that e2fsck be run under a logger like script to record all of the output and changes that are made to the file system, in case this information is needed later. If time permits, it is also a good idea to first run e2fsck in non-fixing mode (-n option) to first assess the type and extent of damage to the file system. The drawback is that in this mode e2fsck doesn't recover the file system journal, so there may appear to be file system corruption when none really exists. Briefly mounting and unmounting the ext3 filesystem (directly on the node with a command similar to mount -t ldiskfs /dev/{ostdev} /mnt/ost; umount /mnt/ost with Lustre stopped, NOT via Lustre) will cause the journal to be recovered if there is concern/confusion about whether corruption is real or only due to the journal not being replayed.

While e2fsck is very good at fixing file system corruption (better than any other similar file system recovery tool, and a primary reason why ext3 was chosen over other file systems for Lustre), it is often useful to know what type of damage there is, so that an ext3 expert can make more intelligent decisions about what needs fixing compared to e2fsck. Sun support is available for such situations.

root# {stop lustre services for this device, if running} root# script /tmp/e2fsck.sda Script started, file is /tmp/e2fsck.sda root# mount -t ldiskfs /dev/sda /mnt/ost root# umount /mnt/ost root# e2fsck -fn /dev/sda  # don't fix filesystem, just check for corruption [e2fsck output] root# e2fsck -fp /dev/sda  # fix filesystem using "prudent" answers (usually 'y')

In addition, the e2fsprogs package contains an lfsck tool which does distributed coherency checking for the Lustre file system, after e2fsck has been run. Running lfsck is NOT required in a large majority of cases, at the small chance of having some leaked space in the file system. It can also be run once Lustre is already started (with care) to avoid a lengthy downtime.

How to run e2fsck+lfsck on a corrupted Lustre filesystem
In cases where the MDS or an OST become corrupted for some reason, it is possible to run a distributed check on the file system to determine what sort of problems exist.

The first step is to run 'e2fsck -f' on the individual MDS/OST with Lustre stopped that had problems in order to fix any local filesystem damage. It is a very good idea to run this e2fsck under "script" as shown above so that you have a log of whatever changes it made to the filesystem in case this is needed later. After this is complete it is then possible to bring the filesystem up if necessary to reduce the outage window.

Next, a full e2fsck of the MDS is necessary in order to create a databse for lfsck. The use of the '-n' option is critical for a mounted filesystem (i.e. if Lustre is running), otherwise you will corrupt your filesystem. The mdsdb file can grow fairly large, depending on the number of files in the filesystem (10GB or more for millions of files, though the actual file size is larger because the file is sparse). It is fastest if this is written to a local filesystem because of the seeking and small writes. Depending on the number of files, this step can take several hours.

e2fsck -n -v --mdsdb /tmp/mdsdb /dev/{mdsdev}

Next this file must be made accessible on all of the OSTs (either via a shared filesystem, or by copying it to the OSTs - pdcp is very useful here), and the OSTs need to run a similar e2fsck step. There is a stub mdsdb file generated called {mdsdb}.mdshdr that can be used instead of the full mdsdb file, if the OSTs do not have shared filesystem access to the MDS filesystem. The mdsdb is only used for reading so it does not need to be in shared storage for all of the OSTs. The OST e2fsck --ostdb step can be run in parallel on all of the OSTs.

e2fsck -n -v --mdsdb /tmp/mdsdb --ostdb /tmp/{ostNdb} /dev/{ostNdev}

Finally, the mdsdb and all of the ostdb files need to be made available on a mounted client so that lfsck can be run to examine the filesystem and optionally correct defects it finds.

script /root/lfsck.lustre.log lfsck -n -v --mdsdb /tmp/mdsdb --ostdb /tmp/{ost1db} /tmp/{ost2db} ... /lustre/mount/point

By default lfsck does not repair any inconsistencies it finds, only reporting the errors. It checks for 3 kinds of inconsistencies:


 * 1)  Inode exists but has missing objects = dangling inode.  This normally happens if there was a problem with an OST.
 * 2)  Inode is missing but OST has unreferenced objects = orphan object.  This normally happens if there was a problem with the MDS.
 * 3)  Multiple inodes reference the same objects. This can happen if there was corruption on the MDS, or if the MDS storage is cached and loses some but not all writes.

If the filesystem is in use and being modified while the --mdsdb and --ostdb steps are running, lfsck may report inconsistencies where none exist because of files and objects being created/removed after the database files were collected, so the results should be examined closely and you may want to re-run the test and/or contact CFS support for guidance.

The easiest problem to resolve is that of orphaned objects. Using the '-l' option to lfsck it can link these objects to new files and put them into lost+found in the Lustre filesystem, where they can be examined and saved or deleted as necessary. If you are certain the objects are not useful, lfsck can run with the '-d' option to delete orphaned objects and free up any space they are using.

To fix dangling inodes, lfsck will create new zero-length objects are created on the OSTs if the '-c' option is given. These files will read back with binary zeros for the stripes that had objects recreated. Such files can also be read even without lfsck repair by using dd if=/lustre/bad/file of=/new/file bs=4k conv=sync,noerror. Because this is rarely useful to have files with large holes in them, most users delete these files after reading them (if useful) and/or restoring them from backup. Note that it is not possible to write to the holes of such a file without having lfsck recreate the objects, so it is generally easier to delete these files and restore them from backup.

To fix inodes with duplicate objects, lfsck will copy the duplicate object to a new object and assign that to one of the files if the '-c' option is given. One of the files will be OK, and one will likely contain garbage, but lfsck cannot tell by itself which one is correct.