Fsck Support

Where to get e2fsck support for Lustre
In order to support some of the more advanced features of Lustre (e.g. extents and large inode/EA support) you need a patched e2fsprogs. This can be downloaded (RPM, SRPM) from the lfsck directory of the customer download site. See http://clusterfs.com/download.html

Code for e2fsprogs-lustre
We maintain a quilt patchset of all the CFS changes to the upstream e2fsprogs. The patches are available in the .src.rpm in the e2fsprogs/patches directory.

Using e2fsck on a backing filesystem
In a node crash case it is not neccesary to run e2fsck on the filesystem - the ext3 journaling will ensure that the filesystem remains coherent. The only time we really required that e2fsck is run on a device is when there is some sort of event that may cause problems outside of what ext3 journaling can handle, such as hardware device failure, IO errors, etc. If the ext3 kernel code detects corruption on the disk it will mount the filesystem read-only to prevent further corruption but still allow read access to the device. This will show up as error "-30" (EROFS) in the syslogs on the server.

In such a situation, it is normally only required that e2fsck be run on the bad device before placing it back into service. In the vast majority of cases Lustre will be able to cope with inconsistencies it finds on the disk and between other devices in the filesystem. It should be made clear that lfsck is very rarely required for Lustre operation.

For good problem analysis it is always strongly recommended that e2fsck is run under a logger like script to record all of the output and changes that are made to the filesystem in case this information is needed later. If time permits, it is also a good idea to first run e2fsck in non-fixing mode (-n option) to first assess the type/extent of damage to the filesystem. The drawback is that in this mode e2fsck doesn't recover the filesystem journal, so there may appear to be filesystem corruption when none really exists. Briefly mounting and unmounting the ext3 filesystem (directly on the node like "mount -t ldiskfs /dev/{ostdev} /mnt/ost; umount /mnt/ost", with Lustre stopped, NOT via Lustre) will cause the journal to be recovered if there is concern/confusion about whether corruption is real or only due to the journal not being replayed.

While e2fsck is very good at fixing filesystem corruption (better than any other similar filesystem recovery tool, and a primary reason why ext3 was chosen over other filesystems for Lustre), it is often useful to know what type of damage there is, and an ext3 expert may be able to make more intelligent decisions about what needs fixing compared to e2fsck. CFS support is available for such situations.

root# {stop lustre services for this device, if running} root# script /tmp/e2fsck.sda Script started, file is /tmp/e2fsck.sda root# mount -t ldiskfs /dev/sda /mnt/ost root# umount /mnt/ost root# script /root/e2fsck.sda.log root# e2fsck -fn /dev/sda  # don't fix filesystem, just check for corruption root# e2fsck -fp /dev/sda  # fix filesystem using "prudent" answers (usually 'y')

In addition, the e2fsprogs package contains an lfsck tool which does distributed coherency checking for the Lustre filesystem, after e2fsck has been run. Running lfsck is not required in a large majority of cases, at the small chance of having some leaked space in the filesystem. It can also be run once Lustre is already started (with care) to avoid a lengthy downtime.

How to run e2fsck+lfsck on a corrupted Lustre filesystem
In cases where the MDS or an OST become corrupted for some reason, it is possible to run a distributed check on the filesystem to determine what sort of problems exist.

The first step is to run 'e2fsck -f' on the individual MDS/OST with Lustre stopped that had problems in order to fix any local filesystem damage. It is a very good idea to run this e2fsck under "script" as shown above so that you have a log of whatever changes it made to the filesystem in case this is needed later. After this is complete it is then possible to bring the filesystem up if necessary to reduce the outage window.

Next, a full e2fsck of the MDS is necessary in order to create a databse for lfsck. The use of the '-n' option is critical for a mounted filesystem (i.e. if Lustre is running), otherwise you will corrupt your filesystem. The mdsdb file can grow fairly large, depending on the number of files in the filesystem (10GB or more for millions of files, though the actual file size is larger because the file is sparse). It is fastest if this is written to a local filesystem because of the seeking and small writes. Depending on the number of files, this step can take several hours.

e2fsck -n -v --mdsdb /tmp/mdsdb /dev/{mdsdev}

Next this file must be made accessible on all of the OSTs (either via a shared filesystem, or by copying it to the OSTs - pdcp is very useful here), and the OSTs need to run a similar e2fsck step. There is a stub mdsdb file generated called {mdsdb}.mdshdr that can be used instead of the full mdsdb file, if the OSTs do not have shared filesystem access to the MDS filesystem. The mdsdb is only used for reading so it does not need to be in shared storage for all of the OSTs. The OST e2fsck --ostdb step can be run in parallel on all of the OSTs.

e2fsck -n -v --mdsdb /tmp/mdsdb --ostdb /tmp/{ostNdb} /dev/{ostNdev}

Finally, the mdsdb and all of the ostdb files need to be made available on a mounted client so that lfsck can be run to examine the filesystem and optionally correct defects it finds.

script /root/lfsck.lustre.log lfsck -n -v --mdsdb /tmp/mdsdb --ostdb /tmp/{ost1db} /tmp/{ost2db} ... /lustre/mount/point

By default lfsck does not repair any inconsistencies it finds, only reporting the errors. It checks for 3 kinds of inconsistencies:


 * 1)  Inode exists but has missing objects = dangling inode.  This normally happens if there was a problem with an OST.
 * 2)  Inode is missing but OST has unreferenced objects = orphan object.  This normally happens if there was a problem with the MDS.
 * 3)  Multiple inodes reference the same objects. This can happen if there was corruption on the MDS, or if the MDS storage is cached and loses some but not all writes.

If the filesystem is in use and being modified while the --mdsdb and --ostdb steps are running, lfsck may report inconsistencies where none exist because of files and objects being created/removed after the database files were collected, so the results should be examined closely and you may want to re-run the test and/or contact CFS support for guidance.

The easiest problem to resolve is that of orphaned objects. Using the '-l' option to lfsck it can link these objects to new files and put them into lost+found in the Lustre filesystem, where they can be examined and saved or deleted as necessary. If you are certain the objects are not useful, lfsck can run with the '-d' option to delete orphaned objects and free up any space they are using.

To fix dangling inodes, lfsck will create new zero-length objects are created on the OSTs if the '-c' option is given. These files will read back with binary zeros for the stripes that had objects recreated. Such files can also be read even without lfsck repair by using dd if=/lustre/bad/file of=/new/file bs=4k conv=sync,noerror. Because this is rarely useful to have files with large holes in them, most users delete these files after reading them (if useful) and/or restoring them from backup. Note that it is not possible to write to the holes of such a file without having lfsck recreate the objects, so it is generally easier to delete these files and restore them from backup.

To fix inodes with duplicate objects, lfsck will copy the duplicate object to a new object and assign that to one of the files if the '-c' option is given. One of the files will be OK, and one will likely contain garbage, but lfsck cannot tell by itself which one is correct.