[edit] WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Fsck Support

From Obsolete Lustre Wiki
(Difference between revisions)
Jump to: navigation, search
m (Obtaining e2fsck support for Lustre)
(Redirected page to Handling File System Errors)
 
(20 intermediate revisions by 3 users not shown)
Line 1: Line 1:
== Obtaining ''e2fsck'' support for Lustre ==
+
#REDIRECT [[Handling_File_System_Errors]]
 
+
<small>''(Updated: Oct 2009)''</small>
To support more advanced features of Lustre, such as FIDs or multi-mount protection (MMP) support, you will need a Lustre-specific version of ''e2fsprogs'', which can be found at [http://downloads.lustre.org/public/tools/e2fsprogs/ http://downloads.lustre.org/public/tools/e2fsprogs/].
+
__TOC__
 +
To support more advanced features of Lustre™, such as FIDs or multi-mount protection (MMP) support, you will need a Lustre-specific version of ''e2fsprogs'', which can be found at [http://downloads.lustre.org/public/tools/e2fsprogs/ http://downloads.lustre.org/public/tools/e2fsprogs/].
  
 
A quilt patchset of all changes to the vanilla e2fsprogs is available in ''e2fsprogs-{version}-patches.tgz''.
 
A quilt patchset of all changes to the vanilla e2fsprogs is available in ''e2fsprogs-{version}-patches.tgz''.
  
 
== Using e2fsck on a Backing File System ==
 
== Using e2fsck on a Backing File System ==
When a node crash occurs, it is not necessary to run ''e2fsck'' on the file system. The ext3 journaling will ensure that the file system remains coherent.   
+
When an OSS, MDS, or MGS server crash occurs, it is not necessary to run ''e2fsck'' on the file system. The ext3 journaling will ensure that the file system remains coherent.  The backing file systems are never accessed directly from the client, so client crashes are not relevant.
  
 
The only time it is ''REQUIRED'' that ''e2fsck'' be run on a device is when an event causes problems that ext3 journaling is unable to handle, such as a hardware device failure or IO error.  If the ext3 kernel code detects corruption on the disk, it will mount the file system as read-only to prevent further corruption but still allow read access to the device.  This will appear as error "-30" (EROFS) in the syslogs on the server. In such a situation, it is normally required only that ''e2fsck'' be run on the bad device before placing the device back into service.   
 
The only time it is ''REQUIRED'' that ''e2fsck'' be run on a device is when an event causes problems that ext3 journaling is unable to handle, such as a hardware device failure or IO error.  If the ext3 kernel code detects corruption on the disk, it will mount the file system as read-only to prevent further corruption but still allow read access to the device.  This will appear as error "-30" (EROFS) in the syslogs on the server. In such a situation, it is normally required only that ''e2fsck'' be run on the bad device before placing the device back into service.   
  
In the vast majority of cases, Lustre will be able to cope with the inconsistencies it finds on the disk and between other devices in the file system.   
+
In the vast majority of cases, Lustre will be able to cope with any inconsistencies it finds on the disk and between other devices in the file system.   
  
 
'''''Note:''''' ''lfsck'' is rarely required for Lustre operation.
 
'''''Note:''''' ''lfsck'' is rarely required for Lustre operation.
  
For problem analysis, it is strongly recommended that ''e2fsck'' be run under a logger-like ''script'' to record all of the output and changes that are made to the file system in case this information is needed later.   
+
For problem analysis, it is strongly recommended that ''e2fsck'' be run under a logger, like ''script'', to record all of the output and changes that are made to the file system in case this information is needed later.   
  
 
If time permits, it is also a good idea to first run ''e2fsck'' in non-fixing mode (''-n'' option) to assess the type and extent of damage to the file system.  The drawback is that in this mode, ''e2fsck'' doesn't recover the file system journal, so there may appear to be file system corruption when none really exists.   
 
If time permits, it is also a good idea to first run ''e2fsck'' in non-fixing mode (''-n'' option) to assess the type and extent of damage to the file system.  The drawback is that in this mode, ''e2fsck'' doesn't recover the file system journal, so there may appear to be file system corruption when none really exists.   
Line 24: Line 25:
 
This will cause the journal to be recovered.
 
This will cause the journal to be recovered.
  
''e2fsck'' works well when fixing file system corruption (better than similar file system recovery tools and a primary reason why ext3 was chosen over other file systems for Lustre). However, it is often useful to identify the type of damage that has occurred so an ext3 expert can make intelligent decisions about what needs fixing, in place of ''e2fsck''.  Sun support is available for such situations.
+
''e2fsck'' works well when fixing file system corruption (better than similar file system recovery tools and a primary reason why ext3 was chosen over other file systems for Lustre). However, it is often useful to identify the type of damage that has occurred so an ext3 expert can make intelligent decisions about what needs fixing, in place of ''e2fsck''.
  
 
<pre>
 
<pre>
Line 39: Line 40:
 
</pre>
 
</pre>
  
In addition, the ''e2fsprogs'' package contains an ''lfsck'' tool which does distributed coherency checking for the Lustre file system after ''e2fsck'' has been run.  Running ''lfsck'' is ''NOT'' required in a large majority of cases, at a small risk of having some leaked space in the file system. To avoid a lengthy downtime,  
+
In addition, the ''e2fsprogs'' package contains the ''lfsck'' tool, which does distributed coherency checking for the Lustre file system after ''e2fsck'' has been run.  Running ''lfsck'' is ''NOT'' required in a large majority of cases, at a small risk of having some leaked space in the file system. To avoid a lengthy downtime, it can be run (with care) after Lustre is started.
it can be run (with care) after Lustre is started.
+
  
=== How to run e2fsck+lfsck on a corrupted Lustre filesystem ===
+
=== Running e2fsck+lfsck on a Corrupted Lustre File System ===
  
 
In cases where the MDS or an OST become corrupted for some reason, it is possible to run a distributed check on the file system to determine what sort of problems exist.
 
In cases where the MDS or an OST become corrupted for some reason, it is possible to run a distributed check on the file system to determine what sort of problems exist.
  
The first step is to run ''e2fsck -f'' on the individual MDS/OST ''with Lustre stopped'' that had problems in order to fix any local filesystem damage.  It is a very good idea to run this e2fsck under "script" as shown above so that you have a log of whatever changes it made to the filesystem in case this is needed later.  After this is complete it is then possible to bring the filesystem up if necessary to reduce the outage window.
+
The first step is to ''stop Lustre'' and run ''e2fsck -f'' on the individual MDS/OST that had problems to fix any local file system damage.  It is a good idea to run this ''e2fsck'' under ''script'' to create a log of the changes made to the file system in case this is needed later.  After this is complete, it is then possible to bring the file system up if necessary to reduce the outage window.
  
Next, a full e2fsck of the MDS is necessary in order to create a databse for lfsck.  The '''use of the '-n' option is critical''' for a mounted filesystem (i.e. if Lustre is running), otherwise you will corrupt your filesystem.  The mdsdb file can grow fairly large, depending on the number of files in the filesystem (10GB or more for millions of files, though the actual file size is larger because the file is sparse).  It is fastest if this is written to a local filesystem because of the seeking and small writes.  Depending on the number of files, this step can take several hours.
+
Next, run a full ''e2fsck'' of the MDS to create a database for ''lfsck''.  The use of the ''-n'' option is ''CRITICAL'' for a mounted file system (i.e. if Lustre is running), otherwise you will corrupt your file system.  The ''mdsdb'' file can grow fairly large, depending on the number of files in the file system (10GB or more for millions of files, though the actual file size is larger because the file is sparse).  It will be quicker to write the file to a local file system due to seeking and small writes.  Depending on the number of files, this step can take several hours.
  
 
<pre>
 
<pre>
Line 54: Line 54:
 
</pre>
 
</pre>
  
Next this file must be made accessible on all of the OSTs (either via a shared filesystem, or by copying it to the OSTs - pdcp is very useful here), and the OSTs need to run a similar e2fsck step.  There is a stub mdsdb file generated called {mdsdb}.mdshdr that can be used instead of the full mdsdb file, if the OSTs do not have shared filesystem access to the MDS filesystem.  The mdsdb is only used for reading so it does not need to be in shared storage for all of the OSTs.  The OST e2fsck --ostdb step can be run in parallel on all of the OSTs.
+
Next, this file must be made accessible on all of the OSTs, either using a shared file system or by copying the file to the OSTs (''pdcp'' is useful here).
 +
 
 +
A similar ''e2fsck'' step needs to be completed on the OSTsIf the OSTs do not have shared file system access to the MDS file system, a stub ''mdsdb'' file called ''{mdsdb}.mdshdr'' is generated that can be used instead of the full ''mdsdb'' file.  The ''mdsdb'' is only used for reading, so it does not need to be in shared storage for all of the OSTs.  The OST ''e2fsck --ostdb'' step can be run in parallel on all of the OSTs.
  
 
<pre>
 
<pre>
Line 60: Line 62:
 
</pre>
 
</pre>
  
Finally, the mdsdb and all of the ostdb files need to be made available on a mounted client so that lfsck can be run to examine the filesystem and optionally correct defects it finds.
+
Finally, the ''mdsdb'' and all of the ''ostdb'' files need to be made available on a mounted client so that ''lfsck'' can be run to examine the file system and optionally correct defects it finds.
  
 
<pre>
 
<pre>
Line 67: Line 69:
 
</pre>
 
</pre>
  
By default lfsck does not repair any inconsistencies it finds, only reporting the errors.  It checks for 3 kinds of inconsistencies:
+
By default, ''lfsck'' does not repair any inconsistencies found, but only reports the errors.  It checks for three kinds of inconsistencies:
  
#  Inode exists but has missing objects = dangling inode.  This normally happens if there was a problem with an OST.
+
#  Inode exists but has missing objects (dangling inode).  This normally happens if there was a problem with an OST.
#  Inode is missing but OST has unreferenced objects = orphan object.  This normally happens if there was a problem with the MDS.
+
#  Inode is missing but OST has unreferenced objects (orphan object).  This normally happens if there was a problem with the MDS.
 
#  Multiple inodes reference the same objects. This can happen if there was corruption on the MDS, or if the MDS storage is cached and loses some but not all writes.
 
#  Multiple inodes reference the same objects. This can happen if there was corruption on the MDS, or if the MDS storage is cached and loses some but not all writes.
  
If the filesystem is in use and being modified while the --mdsdb and --ostdb steps are running, lfsck may report inconsistencies where none exist because of files and objects being created/removed after the database files were collected, so the results should be examined closely and you may want to re-run the test and/or contact CFS support for guidance.
+
If the file system is in use and being modified while the ''--mdsdb'' and ''--ostdb'' steps are running, ''lfsck'' may report inconsistencies where none exist due to files and objects being created/removed after the database files were collected. Therefore, the results should be examined closely. You may want to re-run the test.
  
The easiest problem to resolve is that of orphaned objects.  Using the '-l' option to lfsck it can link these objects to new files and put them into lost+found in the Lustre filesystem, where they can be examined and saved or deleted as necessary.  If you are certain the objects are not useful, lfsck can run with the '-d' option to delete orphaned objects and free up any space they are using.
+
The easiest problem to resolve is that of orphaned objects.  When the ''-l'' option for ''lfsck'' is used, these objects are linked to new files and put into lost+found in the Lustre file system, where they can be examined and saved or deleted as necessary.  If you are certain the objects are not useful, ''lfsck'' can be run with the ''-d'' option to delete orphaned objects and free up any space they are using.
 +
 
 +
To fix dangling inodes, using ''lfsck'' with the ''-c'' option will create new zero-length objects on the OSTs.  These files will read back with binary zeros for the stripes that had objects recreated.  Such files can also be read even without ''lfsck'' repair by entering:
 +
 
 +
<pre>
 +
dd if=/lustre/bad/file of=/new/file bs=4k conv=sync,noerror
 +
</pre> 
  
To fix dangling inodes, lfsck will create new zero-length objects are created on the OSTs if the '-c' option is given.  These files will read back with binary zeros for the stripes that had objects recreated.  Such files can also be read even without lfsck repair by using ''dd if=/lustre/bad/file of=/new/file bs=4k conv=sync,noerror''.  Because this is rarely useful to have files with large holes in them, most users delete these files after reading them (if useful) and/or restoring them from backup.  Note that it is not possible to write to the holes of such a file without having lfsck recreate the objects, so it is generally easier to delete these files and restore them from backup.
+
Because it is rarely useful to have files with large holes in them, most users delete these files after reading them (if useful) and/or restoring them from backup.  Note that it is not possible to write to the holes of such a file without having ''lfsck'' recreate the objects, so it is generally easier to delete these files and restore them from backup.
  
To fix inodes with duplicate objects, ''lfsck'' copies the duplicate object to a new object and assign that to one of the files if the '-c' option is given.  One of the files will be OK, and one will likely contain garbage, but ''lfsck'' cannot tell by itself which one is correct.
+
To fix inodes with duplicate objects, using ''lfsck'' with the ''-c'' option copies the duplicate object to a new object and assign it to one of the files.  One of the files will be OK, and one will likely contain garbage, but ''lfsck'' cannot tell by itself which one is correct.

Latest revision as of 19:50, 27 March 2010

  1. REDIRECT Handling_File_System_Errors

(Updated: Oct 2009)

Contents

To support more advanced features of Lustre™, such as FIDs or multi-mount protection (MMP) support, you will need a Lustre-specific version of e2fsprogs, which can be found at http://downloads.lustre.org/public/tools/e2fsprogs/.

A quilt patchset of all changes to the vanilla e2fsprogs is available in e2fsprogs-{version}-patches.tgz.

Using e2fsck on a Backing File System

When an OSS, MDS, or MGS server crash occurs, it is not necessary to run e2fsck on the file system. The ext3 journaling will ensure that the file system remains coherent. The backing file systems are never accessed directly from the client, so client crashes are not relevant.

The only time it is REQUIRED that e2fsck be run on a device is when an event causes problems that ext3 journaling is unable to handle, such as a hardware device failure or IO error. If the ext3 kernel code detects corruption on the disk, it will mount the file system as read-only to prevent further corruption but still allow read access to the device. This will appear as error "-30" (EROFS) in the syslogs on the server. In such a situation, it is normally required only that e2fsck be run on the bad device before placing the device back into service.

In the vast majority of cases, Lustre will be able to cope with any inconsistencies it finds on the disk and between other devices in the file system.

Note: lfsck is rarely required for Lustre operation.

For problem analysis, it is strongly recommended that e2fsck be run under a logger, like script, to record all of the output and changes that are made to the file system in case this information is needed later.

If time permits, it is also a good idea to first run e2fsck in non-fixing mode (-n option) to assess the type and extent of damage to the file system. The drawback is that in this mode, e2fsck doesn't recover the file system journal, so there may appear to be file system corruption when none really exists.

To address concern about whether corruption is real or only due to the journal not being replayed, you can briefly mount and unmount the ext3 filesystem directly on the node with Lustre stopped (NOT via Lustre), using a command similar to:

mount -t ldiskfs /dev/{ostdev} /mnt/ost; umount /mnt/ost

This will cause the journal to be recovered.

e2fsck works well when fixing file system corruption (better than similar file system recovery tools and a primary reason why ext3 was chosen over other file systems for Lustre). However, it is often useful to identify the type of damage that has occurred so an ext3 expert can make intelligent decisions about what needs fixing, in place of e2fsck.

root# {stop lustre services for this device, if running}
root# script /tmp/e2fsck.sda
Script started, file is /tmp/e2fsck.sda
root# mount -t ldiskfs /dev/sda /mnt/ost
root# umount /mnt/ost
root# e2fsck -fn /dev/sda   # don't fix file system, just check for corruption
:
[e2fsck output]
:
root# e2fsck -fp /dev/sda   # fix filesystem using "prudent" answers (usually 'y')

In addition, the e2fsprogs package contains the lfsck tool, which does distributed coherency checking for the Lustre file system after e2fsck has been run. Running lfsck is NOT required in a large majority of cases, at a small risk of having some leaked space in the file system. To avoid a lengthy downtime, it can be run (with care) after Lustre is started.

Running e2fsck+lfsck on a Corrupted Lustre File System

In cases where the MDS or an OST become corrupted for some reason, it is possible to run a distributed check on the file system to determine what sort of problems exist.

The first step is to stop Lustre and run e2fsck -f on the individual MDS/OST that had problems to fix any local file system damage. It is a good idea to run this e2fsck under script to create a log of the changes made to the file system in case this is needed later. After this is complete, it is then possible to bring the file system up if necessary to reduce the outage window.

Next, run a full e2fsck of the MDS to create a database for lfsck. The use of the -n option is CRITICAL for a mounted file system (i.e. if Lustre is running), otherwise you will corrupt your file system. The mdsdb file can grow fairly large, depending on the number of files in the file system (10GB or more for millions of files, though the actual file size is larger because the file is sparse). It will be quicker to write the file to a local file system due to seeking and small writes. Depending on the number of files, this step can take several hours.

e2fsck -n -v --mdsdb /tmp/mdsdb /dev/{mdsdev}

Next, this file must be made accessible on all of the OSTs, either using a shared file system or by copying the file to the OSTs (pdcp is useful here).

A similar e2fsck step needs to be completed on the OSTs. If the OSTs do not have shared file system access to the MDS file system, a stub mdsdb file called {mdsdb}.mdshdr is generated that can be used instead of the full mdsdb file. The mdsdb is only used for reading, so it does not need to be in shared storage for all of the OSTs. The OST e2fsck --ostdb step can be run in parallel on all of the OSTs.

e2fsck -n -v --mdsdb /tmp/mdsdb --ostdb /tmp/{ostNdb} /dev/{ostNdev}

Finally, the mdsdb and all of the ostdb files need to be made available on a mounted client so that lfsck can be run to examine the file system and optionally correct defects it finds.

script /root/lfsck.lustre.log
lfsck -n -v --mdsdb /tmp/mdsdb --ostdb /tmp/{ost1db} /tmp/{ost2db} ... /lustre/mount/point

By default, lfsck does not repair any inconsistencies found, but only reports the errors. It checks for three kinds of inconsistencies:

  1. Inode exists but has missing objects (dangling inode). This normally happens if there was a problem with an OST.
  2. Inode is missing but OST has unreferenced objects (orphan object). This normally happens if there was a problem with the MDS.
  3. Multiple inodes reference the same objects. This can happen if there was corruption on the MDS, or if the MDS storage is cached and loses some but not all writes.

If the file system is in use and being modified while the --mdsdb and --ostdb steps are running, lfsck may report inconsistencies where none exist due to files and objects being created/removed after the database files were collected. Therefore, the results should be examined closely. You may want to re-run the test.

The easiest problem to resolve is that of orphaned objects. When the -l option for lfsck is used, these objects are linked to new files and put into lost+found in the Lustre file system, where they can be examined and saved or deleted as necessary. If you are certain the objects are not useful, lfsck can be run with the -d option to delete orphaned objects and free up any space they are using.

To fix dangling inodes, using lfsck with the -c option will create new zero-length objects on the OSTs. These files will read back with binary zeros for the stripes that had objects recreated. Such files can also be read even without lfsck repair by entering:

dd if=/lustre/bad/file of=/new/file bs=4k conv=sync,noerror

Because it is rarely useful to have files with large holes in them, most users delete these files after reading them (if useful) and/or restoring them from backup. Note that it is not possible to write to the holes of such a file without having lfsck recreate the objects, so it is generally easier to delete these files and restore them from backup.

To fix inodes with duplicate objects, using lfsck with the -c option copies the duplicate object to a new object and assign it to one of the files. One of the files will be OK, and one will likely contain garbage, but lfsck cannot tell by itself which one is correct.

Personal tools
Navigation