WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Architecture - MDS striping format

From Obsolete Lustre Wiki
Jump to navigationJump to search

Striping Description

In a Lustre file system, metadata describing where data is stored on object storage servers (OSTs) is defined in extended attributes (EAs) on the metadata server (MDS). This information, called the “striping EA”, is described in detail below. Also described are a set of APIs provided with Lustre that allow modules and applications to manipulate the striping EA.

This architecture page is available as a .pdf file at this link: Managing Lustre Data Striping

Striping Extended Attributes

In a Lustre file system, metadata and data are stored separately, in the metadata server (MDS) and in the object storage server (OST) respectively. When accessing a file, the client obtains data location information from the MDS. The location information indicates how the file is striped across the OSTs. Since this information is stored in the extend attributes of each inode in the MDS, it is called the “striping EA.” The status of the striping EA may be in-disk, in-memory (kernel mode inside Lustre), or in-application (striping EA in a user-level application). Each status corresponds to a different format.

Lustre provides a set of APIs for other modules or applications to use to manipulate a striping EA. Below are a few examples showing how the striping EA is used by other Lustre modules.

Use Case

id quality attribute summary
create-file usability The client creates a file.
unlink-file usability The client unlinks a file.
lfs-setstripe usability The client creates a file with a specified stripe EA.
MPI-LIB usability The MPI opens or creates a file with a specified stripe EA.
copy-file usability Copy files from Lustre to another filesystem (QFS, pNFS or GPFS), while retaining the same striping information.

Quality Attribute Scenarios

create-file
Scenario: Client create a new file.
Business Goals: Ensure that the basic POSIX function works.
Relevant QAs: Usability
Details Stimulus: Create a file
Stimulus source: Client application
Environment: Lustre mounted-client
Striping API usages: The client sends a “create” request to the MDS. The MDS calls the striping API to distribute the “create” request to the OSTs to create the data objects. The striping information is then returned to the MDS. The MDS calls the striping API again to convert the striping information to the appropriate disk format and places it into the EA of the metadata object.
unlink-file
Scenario: Client unlinks a file.
Business Goals: Ensure that the basic POSIX function works.
Relevant QAs: Usability
Details Stimulus: Unlink a file
Stimulus source: Client application
Environment: Lustre-mounted client
Striping API usages: A client sends an unlink request to the MDS. The MDS unlinks the metadata object and logs the action in the unlink log. The client then calls the striping API to locate the object on the OST and sends the unlink request to the OST. After the data objects of the OST are removed, the callback mechanism tells the MDS to remove the unlink log.
lfs-setstripe
Scenario: Client opens/creates a file with a specified striping EA.
Business Goals: Tune striping to meet user requirements.
Relevant QAs: Usability
Details Stimulus: Execute lfs setstripe.
Stimulus source: lfs setstripe and lfs getstripe utilities. Lustre also provides several lfs utilities to end users to set or get the striping information for a regular file or directory.
Environment: Lustre-mounted client
Striping API usages: In the current Lustre release, the striping EA of a regular file can only be set when it is opened or written the first time. So executing lfs-setstripe implies opening or creating the file with a specific striping EA.
In the stripe-setting process, lfs first transfers the defined striping EA to the file system (Lustre client), then the Lustre client sends the open/create request with the striping EA to the MDS. The MDS calls the striping API to locate the OSTs according to the striping EA specification and creates the object on these OSTs. Then the MDS calls the striping API again to set the striping EA to the metadata object.

Note: Limits for stripe settings are:

  • Maximum striping count for a single file is 160.
  • Maximum striping count for the system is 65532.
  • Minimum striping size is 65536.
  • Result of stripe_size * stripe_count should less than 0xffffffff.
MPI-LIB
Scenario: Client opens/creates a file with a specified striping EA in MPI-LIB
Business Goals: Enable MPI-LIB (Lustre ADIO driver) to to execute lfs-setstripe directly.
Relevant QAs: Usability
Details Stimulus: Use MPI_open/create with stripe hints to open or create a file
Stimulus source: MPI-LIB + Lustre ADIO driver
Environment: Lustre-mounted client and MPI environment
Striping API usages: The MPI uses the striping API only in MPI_Open (in the Lustre ADIO driver), where it may be necessary to open/create a file with a certain striping EA. The MPI programmer can set the striping EA using a hint. Below is an example showing how IOR is used to set a striping EA.
IOR_HINT__MPI__striping_unit=1048576 #striping size is 1M
IOR_HINT__MPI__striping_factor=2 #striping count is 2
IOR_HINT__MPI__striping_iodevice=0 #striping offset(index) is 0

The setting process is almost the same as for lfs-setstripe, but with one difference. In MPI, the ioctl system call is used directly to set the striping EA, instead of using an API from the Lustre user API lib, to avoid linking the unnecessary lib when building the MPI + Lustre ADIO driver.

copy-file
Scenario: Copy files from Lustre to another filesystem (QFS, pNFS or GPFS).
Business Goals: Copying files between Lustre and other filesystems (QFS, pNFS and GPFS), while retaining striping information without manual user intervention.
Relevant QAs: Usability
Details Stimulus: Copy files from the Lustre file system to another file system but keep the same striping pattern.
Stimulus source: Copy filesystem tool, GNU tar (gtar) is used to specify user-level Lustre striping.
Environment: Lustre filesystem. Striping information for the Lustre and QFS filesystems is similar enough that the user-level tool, GNU tar (gtar) can convert one to the other.
Striping API usages: Lustre provides an updated version of the GNU tar (gtar) backup tool that enables a complete Lustre file system to be restored with the same striping pattern as before. Gtar can also be used in the copy process.

For example, when file A is copied, gtar first calls the Lustre user-level striping API to extract the striping EA of file A from the MDS (in-application format). Then gtar starts to copy file A to the other file system (e.g. QFS).

Gtar creates a file on the target file system (possibly by using mknod) and sets the striping EA to that file. Since the striping format for these two file systems is very similar, gtar should not change the striping EA or should make only minor modifications..

Finally, gtar copies file A to the target file system according to the defined striping EA format.

Striping Format

The striping EA status designates three striping EA formats:

  • In-disk format (lov_mds_md) – Used when the striping EA is stored in disk.
  • In-memory format (lov_stripe_md) – Used when the striping EA is being read out from the disk and unpacked.
  • User format (lov_user_md) – Used when the striping EA is retrieved by the application and ready to output to the end user.

Independent of the format, all striping EAs consist primarily of two parts:

  • Public – Applies to all the OSTs on which the file is located. Indicates how the file is striped over the OSTs.
  • Private – An array in which each array item corresponds to one OST. Each array item specifies the OST index and data object ID within it.

When mapping the file offset to the special offset of the OST object, Lustre will compute the OST array index according to the file offset, striping size and striping count. Then it will go to the private OST array to obtain the OST index and object ID.

Striping Disk format

Two striping disk formats are available: normal striping format for a normal file and joined striping format for a joined file.

Normal Striping EA formats

The two parts of the normal striping EA, lov_mds_md (public) and lov_ost_data (OST private) are described below.

struct lov_mds_md {

       /* LOV_MDS_MD */  
       __u32 lmm_magic;          
       __u32 lmm_pattern;        
       __u64 lmm_object_id;       
       __u64 lmm_object_gr;      
       __u32 lmm_stripe_size;    
       __u32 lmm_stripe_count;
       /* LOV_OST_DATA */
       struct lov_ost_data lmm_objects[0];

};

ID Description
LOV_MDS_MD Striping information
LOV_OST_DATA[ ] Location information for the objects. Each OST for this object corresponds to an entry in the array.
LOV_MDS_MD
name size description
lmm_magic 32 bits Normal file (0x0BD10BD0)
lmm_pattern 32 bits Stripe pattern: RAID-0, RAID-1 or other network striping pattern. Only the RAID-0 pattern is currently supported.
lmm_object_id 64 bits Object ID on MDS, which is ino of the object (inode) in MDS.
lmm_object_gr 64 bits For a directory, the object group number is used to determine if the striping EA for the directory is the default striping EA or a striping EA specified by lfs setstripe. For a file, the object group number is currently unused, but, in future releases, it will be used to identify groups of objects in a cluster metadata(CMD)environment.
lmm_stripe_size 32 bits Stripe size: Number of bytes stored on each OST before moving to next OST.
lmm_stripe_count 32 bits Stripe count: Number of stripes in the file.
LOV_OST_DATA

struct lov_ost_data_v1 {

       __u64 l_object_id;        
       __u64 l_object_gr;        
       __u32 l_ost_gen; 
       __u32 l_ost_idx; 

};

Name Size Description
l_object_id 64 bits The object ID on the OST.
l_object_gr 64 bits The object group number (same as lmm_object_gr in LOV_MDS_MD_FORMAT_ID).
l_ost_gen 32 bits Generation of l_ost_idx.
l_ost_idx 32 bits OST index in the logical object volume (LOV) in the MDS server, which is handled by the management server (MGS) in the current version of Lustre.

Joined Striping EA format

A joined file is made up of several normal files, each with its own extent and corresponding striping EA.

Joined File Stripe Format

For a joined file, the striping disk formats include:

  • Joined striping information (LOV_MDS_JOINED_MD).
  • Striping extent information (MDS_EXTENT_DESCRIPTION). This information is stored in the log file for which the llog_log_id is defined in the joined striping EA.
struct lov_mds_md_join {
       /* LOV_MDS_JOINED_MD */
       struct lov_mds_md lmmj_md;
       /* MDS_EXTENT_DESCRIPTION*/
       struct llog_logid lmmj_array_id; 
       __u32  lmmj_extent_count; 

};

ID Description
LOV_MDS_JOINED_MD lmmj_md Striping information. The format is the same as the LOV_MDS_MD
lmmj_extent_count The number of normal files in the joined file.
JOINED_LOG_ID ID for the log file containing the striping extent information.
LOV_MDS_JOINED_MD
name size description
lmm_magic 32 bits Joined file (0x0BD20BD0)
lmm_pattern 32 bits Stripe pattern. For joined file, each file should be the same pattern in the current version of Lustre
lmm_object_id 64 bits Object ID on the MDS, which is ino of the object(inode) in the MDS.
lmm_object_gr 64 bits For a directory, the object group number is used to determine if the striping EA for the directory is the default striping EA or a striping EA specified by lfs setstripe. For a file, the object group number is currently unused, but, in future releases, it will be used to identify groups of objects in a cluster metadata(CMD)environment.
lmm_stripe_count 32 bits Total stripe count of each normal file in the joined file.
lmm_stripe_size 32 bits Not used currently.
lmmj_extent_count 32 bits The number of normal files in the joined file.
MDS_EXTENT_DESCRIPTION

For each joined file, extent striping information is stored in a log file, which is referred to by llog_logid.

struct llog_logid {

       __u64                   lgl_oid;
       __u64                   lgl_ogr;
       __u32                   lgl_ogen;

};

JOINED_LOG_ID

Name Size Description
lgl_oid 64 bits Log ID of the object.
lgl_ogr 64 bits Log group of the object.
lgl_ogen 32 bits Log generation of the object.
JOINED File LOG Formats

The joined log file is composed of joined log records. Each joined record includes a log header, a joined_record and a log tail.

struct mds_extent_desc {
       __u64                   med_start; 
       __u64                   med_len;   
       struct lov_mds_md       med_lmm;

};

struct llog_rec_hdr {
       __u32                   lrh_len;
       __u32                   lrh_index;
       __u32                   lrh_type;
       __u32                   padding;

};

struct llog_rec_tail {
       __u32 lrt_len;
       __u32 lrt_index;     

};

struct llog_array_rec {
       struct llog_rec_hdr     lmr_hdr;
       struct mds_extent_desc  lmr_med;
       struct llog_rec_tail    lmr_tail;

};

Name Size Description
log_header lrh_len 32 bit Log record length
lrh_index 32 bit Log record index
lrh_type 32 bit Log record type
padding 32 bit Record padding for 4 bytes aligned
joined record med_start 64 bits Offset of the extent for the normal file in the joined file.
med_len 64 bits Length of the extent for the normal file in the joined file.
med_lmm size of LOV_MDS_MD Striping information for each normal file (same as LOV_MDS_MD)
log_tail lrt_len 32 bit Log record length. The value is the same as for lrh_len.
lrt_index 32 bit Log record index, The value is the same as for lrh_index.

Striping memory format

In-memory striping MD also includes general striping information and private information for each OST.

struct lov_oinfo {

       __u64 loi_id;              
       __u64 loi_gr;              
       int loi_ost_idx;           
       int loi_ost_gen;           
       /* used by the osc to keep track of what objects to build into rpcs */
       struct loi_oap_pages loi_read_lop;
       struct loi_oap_pages loi_write_lop;
       /* _cli_ is poorly named, it should be _ready_ */
       struct list_head loi_cli_item;
       struct list_head loi_write_item;
       struct list_head loi_read_item;
       unsigned loi_kms_valid:1;
       __u64 loi_kms;             
       struct ost_lvb loi_lvb;
       struct osc_async_rc     loi_ar;

};

struct lov_stripe_md {

       /* General striping information */
       spinlock_t       lsm_lock;
       void            *lsm_lock_owner; 
       struct {
               __u64 lw_object_id;        
               __u64 lw_object_gr;
               __u64 lw_maxbytes;
               __u32 lw_magic;
               __u32 lw_stripe_size; 
               __u32 lw_pattern;     
               unsigned lw_stripe_count;
       } lsm_wire;
       /* Private OST array */
       struct lov_array_info *lsm_array; 
       struct lov_oinfo *lsm_oinfo[0];

};

Name Size Description
lsm_lock size of spin_lock_t lsm lock to protect each item of the striping EA.
lsm_lock_owner size of void* Owner of the lsm_lock, for debugging purposes
lsm striping information lw_object_id 64 bit lov object id (same as lmm_object_id)
lw_object_gr 64 bit lov object group number, same as lmm_object_gr
lw_max_bytes 64 bit Maximum possible file size
lw_magic 32 bit lsm magic number (same as lmm_magic)
lw_stripe_size 32 bit Size of the stripe (same as lmm_stripe_size)
lw_stripe_pattern 32 bit Pattern of the stripe (same as lmm_stripe_pattern)
OST array information
lsm_array size of pointer Pointer to a lsm array, only for joined file
loi_id 64 bit Data object id (same as l_object_id)
loi_gr 64 bit Data object group (same as l_object_gr)
loi_ost_idx 64 bit OST index of the data object
loi_ost_gen 64 bit OST generation of the data object
loi_read_lop size of struct loi_oap_pages List of pending read pages for the file for this object server client (OSC).
loi_write_lop size of struct loi_oap_pages List of pending write pages for the file for this OSC.
loi_cli_item size of struct list_head List of objects ready to read/write for this OSC.
loi_read_item size of struct list_head List of objects to be read for this OSC.
loi_write_item size of struct list_head List of objects to be written for this OSC.
loi_kms 64 bits Known minimum size of the data object.
loi_kms_valid size of unsigned long Valid flag for known minimum size
loi_lvb size of struct ost_lvb Lock value block. Used to capture data object status information (size, time, etc.) commu-nicated between the filter and OSC. The Lustre client system (llite) and LOV (llite/lov) merge the acquired information into a complete set of information about the file.
loi_ar size of struct osc_async_rc Used to propagate asynchronous writeback errors back up to the application. If an asynchronous write fails, an error code is recorded and used later when an application executes an fsync operation.

Striping user format

The striping user format is used when the striping EA is retrieved by a user-level application (for example, with lfs getstripe/setstripe).

struct lov_user_ost_data_v1 {

       __u64 l_object_id;        
       __u64 l_object_gr;        
       __u32 l_ost_gen;          
       __u32 l_ost_idx;          

}

struct lov_user_md {

       __u32 lmm_magic;          
       __u32 lmm_pattern;        
       __u64 lmm_object_id;      
       __u64 lmm_object_gr;      
       __u32 lmm_stripe_size;    
       __u16 lmm_stripe_count;   
       __u16 lmm_stripe_offset;  
       struct lov_user_ost_data_v1 lmm_objects[0]; 

}

The user format differs in the following ways from the in-disk format:

  • The user format has a lmm_stripe_offset, which the in-disk format does not have. lmm_stripe_offset is used by setstripe to transfer the striping_index parameters to Lustre when setting a stripe.
  • For the user format, lmm_stripe_count has only 16 bits, while for in-disk format, stripe_count has 32 bits. So in the current Lustre release, the maximum stripe count is 65532.

Striping API

Lustre provides a set of APIs to handle the striping EAs. The five types of APIs are listed below according to their functionality:

  • Set/get APIs. Used to set or get a striping EA to or from storage.
  • Pack/unpack APIs. Because striping EAs are stored in packed format on disk, pack/unpack APIs are provided to pack and unpack striping EAs after a get or setstriping EA API is used.
  • Allocate/free APIs. Used to allocate and free striping EAs in memory.
  • Striping location APIs. Since location information for data objects is stored in striping EAs, APIs are provided to access the striping EAs and return data object location information. These APIs are also used to select the OST where the data object is to be created.
  • lfs APIs. User-level APIs used by applications (lfs utilities) to handle striping EAs.

The set/get APIs operate on striping EAs in in-disk format. The pack/unpack APIs operate on striping EAS in both in-disk and in-memory formats. The other APIs operate on striping EAs in in-memory format.

Get/Set striping EA API

fsfilt_set/get_md

int fsfilt_set_md(struct obd_device *obd, struct inode *inode, void *handle,void *md, int size, const char *name)
int fsfilt_get_md(struct obd_device *obd, struct inode *inode, void *md, int size, const char *name)
Parameters
obd
Device of the object.
inode
MDS object.
handle
Journal handle for setting striping EA.
md
Buffer of the striping EA.
size
Size of the striping EA.
name
Name (LOV) of the striping EA
Return
fsfilt_set_md
0 means success. A negative error number means an error.
fsfilt_get_md
0 means success. A positive return value is the number of bytes that need to be added to the buffer to make it large enough to contain the striping EA. A negative error number means an error. Note: If the striping EA does not exist, get_md still returns 0.
Description

These two APIs are used by MDS to get/set a striping EA.

Pack/Unpack Striping EA API

obd_packmd

int obd_packmd(struct obd_export *exp, struct lov_mds_md **disk_tgt,struct lov_stripe_md *mem_src)
Parameters
exp
Export of the device.
disk_tgt
Disk structure for the striping EA.
mem_src
In-memory structure for the striping EA.
Return
If disk_tgt is NULL, striping size(in-memory structure *mem_src) is returned.
If both disk_tgt and mem_src are NULL, the maximum possible stripe size is returned.
If disk_tgt is not NULL and mem_src is NULL, @*disk_tgt is freed.
If @*disk_tgt is NULL, a in-disk structure is allocated.
Description
This API packs the striping EA from in-memory format to an in-disk description.

obd_unpackmd

int obd_unpackmd(struct obd_export *exp, struct lov_stripe_md **mem_tgt,struct lov_mds_md *disk_src, int disk_len)
Parameters
exp
Export of the device
mem_tgt
In-memory structure for the striping EA
disk_src
Disk structure for the striping EA
disk_len
Length of disk_tgt.
Return
Positive value indicates the size of the unpacked striping EA.
0 is returned when the API tries to free the disk_src.
Negative value indicates an error.
Description
This API unpacks the striping EA from an in-disk format (disk_src) to an in-memory description (mem_tgt). When mem_tgt is NULL, the API will free disk_src.

Allocation/Free

obd_size_diskmd

void obd_size_diskmd(struct obd_export *exp, struct lov_stripe_md *mem_src)
Parameters
exp
Export of the device.
disk_tgt
Disk structure for the striping EA.
mem_src
In-memory structure for the striping EA.
Return
If mem_src is not NULL, striping size pointed to by mem_src is returned.
If mem_src is NULL, the maximum striping size is returned.
Description
This API returns the real size of the striping EA.

obd_alloc_diskmd

int obd_alloc_diskmd(struct obd_export *exp, struct lov_mds_md **disk_tgt)
Parameters
exp
Export of the device.
disk_tgt
Allocated in-disk-formatted striping EA.
Return
0 means success. A negative number means an error.
Description
This API returns the in-disk-formatted striping EA pointed to by disk_tgt. It allocates the maximum striping EA size, which typically equals the maximum data object count of the file * size of struct lov_ost.

obd_free_diskmd

int obd_free_diskmd(struct obd_export *exp, struct lov_mds_md **disk_tgt)
Parameters
exp
Export of the device.
disk_tgt
Allocated in-disk-formatted striping EA.
Return
0 means success. A negative number means an error.
Description
This API frees the in-disk-formatted striping EA referenced by *disk_tgt.

obd_alloc_memmd

int obd_alloc_memmd(struct obd_export *exp, struct lov_stripe_md **mem_tgt)
Parameters
exp
Export of the device.
disk_tgt
Allocated in-memory-formatted striping EA
Return
0 means success. A negative number means an error.
Description
This API returns the in-memory-striping EA pointed to by mem_tgt. It allocates the maximum striping EA size.

obd_free_memmd

int obd_free_memmd(struct obd_export *exp,struct lov_stripe_md **mem_tgt)
Parameters
exp
Export of the device.
disk_tgt
In-memory-formatted striping EA memory to be freed.
Return
0 means success. A negative number means an error.
Description
This API frees the in-memory-formatted striping EA referenced by *mem_tgt.

Striping Location APIs

lov_stripe_size

obd_size lov_stripe_size(struct lov_stripe_md *lsm, obd_size ost_size,int stripeno)

Parameters
lsm
In-memory striping EA.
ost_size
Size of a single data object in an OST.
stripeno
Stripe number of the data object.
Return
0 means success. A negative number means an error.
Description
This API computes the file size given stripeno and the OST size, where stripeno and the OST size are associated with the OST where the end of the file is located.

lov_stripe_offset

int lov_stripe_offset(struct lov_stripe_md *lsm, obd_off lov_off, int stripeno, obd_off *obd_off)

Parameters
lsm
In-memory striping EA.
lov_off
Logic file offset.
stripeno
Stripe number of the data object.
obd_off
Offset of the OST indicated by stripeno, which is nearest to the logic file offset (lov_off).
Return
0 means the OST indicated by stripeno is exactly the same OST as the offset (lov_off) indicated.
-1 means the index of the OST indicated by stripeno is less than the index of the OST indicated by the offset (lov_off).
1 means the index of the OST indicated by stripeno is larger than the index of the OST indicated by the offset (lov_off).
Description

This API is used to check whether an extent intersects with an OST.

lov_stripe_number

int lov_stripe_number(struct lov_stripe_md *lsm, obd_off lov_off)

Parameters
lsm
In-memory striping EA
lov_off
Logic file offset
Return
0 means success. A negative number means an error.
Description

This API computes which stripe number lov_off belongs to.

lfs API

llapi_file_get_stripe

int llapi_file_get_stripe(const char *path, struct lov_user_md *lum)

Parameters
path
Path of the file.
lum
Striping information returned to the caller.
Return
0 means success. A negative number means an error.
Description

This API returns striping information to the caller to be used by the application.

llapi_file_open

int llapi_file_open(const char *name, int flags, int mode, unsigned long stripe_size, int stripe_offset, int stripe_count, int stripe_pattern)

Parameters
name
File name
flags
Open flags
mode
Open mode
stripe_size
Stripe size of the file
stripe_offset
Stripe offset(stripe_index) of the file
stripe_count
Stripe count of the file
stripe_pattern
Stripe pattern of the file
Return
0 means success. A negative number means an error.
Description

This API opens/creates a file with specified striping parameters.

Future developments

With the currently implemented striping disk format, ->obd_unpackmd() must have an end-to-end understanding of all possible combinations of layouts, i.e., the format is basically flat rather than hierarchical.

To facilitate development of new layouts, the striping disk format will be adjusted so that higher layers (e.g., struct lov_mds_md) can be parsed without knowing the details of the lower layer (in this case, struct lov_ost_data) representation.

A straightforward way to do this is to precede each layout descriptor with the standard header:

      struct md_layout_descriptor_header {
              __u16 mldh_magic;
              __u16 mldh_length;
      };

where ->mldh_magic identifies the layout type and is used to determine the ->obd_unpackmd() method to be called to parse the descriptor; and ->mldh_length is the total descriptor length, which is used by the upper layer to pass over lower layer descriptors without understanding details of their representation.

Care must be taken, however, to avoid introducing too much redundant information to the on-disk EA for the most common uses.

Glossary

ADIO
Analog-to-digital I/O. The ADIO driver is an abstract-device interface for parallel I/O that is used by the MPI to implement its I/O library.
CMD
Cluster metatdata
EA
Extended attribute
llite
Lustre client system
LOV
Logical object volume
MDS
Metadata server
MGS
Management server
MPI
Message Passing Interface
OSC
Object server client
OST
Object storage server

Category