Architecture - MDS striping format

Striping Description
In a Lustre file system, metadata describing where data is stored on object storage servers (OSTs) is defined in extended attributes (EAs) on the metadata server (MDS). This information, called the “striping EA”, is described in detail below. Also described are a set of APIs provided with Lustre that allow modules and applications to manipulate the striping EA.

This architecture page is available as a .pdf file at this link: [[Media:ManagingLustreDataStriping_v2.pdf|Managing Lustre Data Striping]]

Striping Extended Attributes
In a Lustre file system, metadata and data are stored separately, in the metadata server (MDS) and in the object storage server (OST) respectively. When accessing a file, the client obtains data location information from the MDS. The location information indicates how the file is striped across the OSTs. Since this information is stored in the extend attributes of each inode in the MDS, it is called the “striping EA.” The status of the striping EA may be in-disk, in-memory (kernel mode inside Lustre), or in-application (striping EA in a user-level application). Each status corresponds to a different format.

Lustre provides a set of APIs for other modules or applications to use to manipulate a striping EA. Below are a few examples showing how the striping EA is used by other Lustre modules.

Quality Attribute Scenarios

 * create-file
 * unlink-file


 * lfs-setstripe


 * MPI-LIB


 * copy-file

Striping Format
The striping EA status designates three striping EA formats:


 * In-disk format (lov_mds_md) – Used when the striping EA is stored in disk.
 * In-memory format (lov_stripe_md) – Used when the striping EA is being read out from the disk and unpacked.
 * User format (lov_user_md) – Used when the striping EA is retrieved by the application and ready to output to the end user.

Independent of the format, all striping EAs consist primarily of two parts:


 * Public – Applies to all the OSTs on which the file is located. Indicates how the file is striped over the OSTs.
 * Private – An array in which each array item corresponds to one OST. Each array item specifies the OST index and data object ID within it.

When mapping the file offset to the special offset of the OST object, Lustre will compute the OST array index according to the file offset, striping size and striping count. Then it will go to the private OST array to obtain the OST index and object ID.

Striping Disk format
Two striping disk formats are available: normal striping format for a normal file and joined striping format for a joined file.

Normal Striping EA formats
The two parts of the normal striping EA, lov_mds_md (public) and  lov_ost_data (OST private) are described below.

struct lov_mds_md { /* LOV_MDS_MD */ __u32 lmm_magic; __u32 lmm_pattern; __u64 lmm_object_id; __u64 lmm_object_gr; __u32 lmm_stripe_size; __u32 lmm_stripe_count; /* LOV_OST_DATA */ struct lov_ost_data lmm_objects[0]; };

LOV_OST_DATA
struct lov_ost_data_v1 { __u64 l_object_id; __u64 l_object_gr; __u32 l_ost_gen; __u32 l_ost_idx; };

Joined Striping EA format
A joined file is made up of several normal files, each with its own extent and corresponding striping EA.

Joined File Stripe Format
For a joined file, the striping disk formats include: /* LOV_MDS_JOINED_MD */ struct lov_mds_md lmmj_md; /* MDS_EXTENT_DESCRIPTION*/ struct llog_logid lmmj_array_id; __u32 lmmj_extent_count; };
 * Joined striping information (LOV_MDS_JOINED_MD).
 * Striping extent information (MDS_EXTENT_DESCRIPTION). This information is stored in the log file for which the llog_log_id is defined in the joined striping EA.
 * struct lov_mds_md_join {

MDS_EXTENT_DESCRIPTION
For each joined file, extent striping information is stored in a log file, which is referred to by llog_logid.

struct llog_logid { __u64                  lgl_oid; __u64                  lgl_ogr; __u32                  lgl_ogen; };

JOINED_LOG_ID

JOINED File LOG Formats
The joined log file is composed of joined log records. Each joined record includes a log header, a joined_record and a log tail.

__u64                  med_start; __u64                  med_len; struct lov_mds_md      med_lmm; };
 * struct mds_extent_desc {

__u32                  lrh_len; __u32                  lrh_index; __u32                  lrh_type; __u32                  padding; };
 * struct llog_rec_hdr {

__u32 lrt_len; __u32 lrt_index; };
 * struct llog_rec_tail {

struct llog_rec_hdr    lmr_hdr; struct mds_extent_desc lmr_med; struct llog_rec_tail   lmr_tail; };
 * struct llog_array_rec {

Striping memory format
In-memory striping MD also includes general striping information and private information for each OST.

struct lov_oinfo { __u64 loi_id; __u64 loi_gr; int loi_ost_idx; int loi_ost_gen;

/* used by the osc to keep track of what objects to build into rpcs */ struct loi_oap_pages loi_read_lop; struct loi_oap_pages loi_write_lop; /* _cli_ is poorly named, it should be _ready_ */ struct list_head loi_cli_item; struct list_head loi_write_item; struct list_head loi_read_item;

unsigned loi_kms_valid:1; __u64 loi_kms; struct ost_lvb loi_lvb; struct osc_async_rc    loi_ar; };

struct lov_stripe_md { /* General striping information */ spinlock_t      lsm_lock; void           *lsm_lock_owner;

struct { __u64 lw_object_id; __u64 lw_object_gr; __u64 lw_maxbytes;

__u32 lw_magic; __u32 lw_stripe_size; __u32 lw_pattern; unsigned lw_stripe_count; } lsm_wire;

/* Private OST array */ struct lov_array_info *lsm_array; struct lov_oinfo *lsm_oinfo[0]; };

Striping user format
The striping user format is used when the striping EA is retrieved by a user-level application (for example, with lfs getstripe/setstripe).

struct lov_user_ost_data_v1 { __u64 l_object_id; __u64 l_object_gr; __u32 l_ost_gen; __u32 l_ost_idx; }

struct lov_user_md { __u32 lmm_magic; __u32 lmm_pattern; __u64 lmm_object_id; __u64 lmm_object_gr; __u32 lmm_stripe_size; __u16 lmm_stripe_count; __u16 lmm_stripe_offset; struct lov_user_ost_data_v1 lmm_objects[0]; }

The user format differs in the following ways from the in-disk format:
 * The user format has a lmm_stripe_offset, which the in-disk format does not have. lmm_stripe_offset is used by setstripe to transfer the striping_index parameters to Lustre when setting a stripe.
 * For the user format, lmm_stripe_count has only 16 bits, while for in-disk format, stripe_count has 32 bits. So in the current Lustre release, the maximum stripe count is 65532.

Striping API
Lustre provides a set of APIs to handle the striping EAs. The five types of APIs are listed below according to their functionality: The set/get APIs operate on striping EAs in in-disk format. The pack/unpack APIs operate on striping EAS in both in-disk and in-memory formats. The other APIs operate on striping EAs in in-memory format.
 * Set/get APIs. Used to set or get a striping EA to or from storage.
 * Pack/unpack APIs. Because striping EAs are stored in packed format on disk, pack/unpack APIs are provided to pack and unpack striping EAs after a get or setstriping EA API is used.
 * Allocate/free APIs. Used to allocate and free striping EAs in memory.
 * Striping location APIs. Since location information for data objects is stored in striping EAs, APIs are provided to access the striping EAs and return data object location information. These APIs are also used to select the OST where the data object is to be created.
 * lfs APIs. User-level APIs used by applications (lfs utilities) to handle striping EAs.

fsfilt_set/get_md

 * int fsfilt_set_md(struct obd_device *obd, struct inode *inode, void *handle,void *md, int size, const char *name)
 * int fsfilt_get_md(struct obd_device *obd, struct inode *inode, void *md, int size, const char *name)


 * Parameters
 * obd: Device of the object.
 * inode: MDS object.
 * handle: Journal handle for setting striping EA.
 * md: Buffer of the striping EA.
 * size: Size of the striping EA.
 * name: Name (LOV) of the striping EA


 * Return
 * fsfilt_set_md: 0 means success. A negative error number means an error.
 * fsfilt_get_md: 0 means success. A positive return value is the number of bytes that need to be added to the buffer to make it large enough to contain the striping EA. A negative error number means an error. Note: If the striping EA does not exist, get_md still returns 0.

These two APIs are used by MDS to get/set a striping EA.
 * Description

obd_packmd

 * int obd_packmd(struct obd_export *exp, struct lov_mds_md **disk_tgt,struct lov_stripe_md *mem_src)


 * Parameters
 * exp: Export of the device.
 * disk_tgt: Disk structure for the striping EA.
 * mem_src: In-memory structure for the striping EA.


 * Return
 * If disk_tgt is NULL, striping size(in-memory structure *mem_src) is returned.
 * If both disk_tgt and mem_src are NULL, the maximum possible stripe size is returned.
 * If disk_tgt is not NULL and mem_src is NULL, @*disk_tgt is freed.
 * If @*disk_tgt is NULL, a in-disk structure is allocated.


 * Description
 * This API packs the striping EA from in-memory format to an in-disk description.

obd_unpackmd

 * int obd_unpackmd(struct obd_export *exp, struct lov_stripe_md **mem_tgt,struct lov_mds_md *disk_src, int disk_len)


 * Parameters
 * exp: Export of the device
 * mem_tgt: In-memory structure for the striping EA
 * disk_src: Disk structure for the striping EA
 * disk_len: Length of disk_tgt.


 * Return
 * Positive value indicates the size of the unpacked striping EA.
 * 0 is returned when the API tries to free the disk_src.
 * Negative value indicates an error.


 * Description
 * This API unpacks the striping EA from an in-disk format (disk_src) to an in-memory description (mem_tgt). When mem_tgt is NULL, the API will free disk_src.

obd_size_diskmd

 * void obd_size_diskmd(struct obd_export *exp, struct lov_stripe_md *mem_src)


 * Parameters
 * exp: Export of the device.
 * disk_tgt: Disk structure for the striping EA.
 * mem_src: In-memory structure for the striping EA.


 * Return
 * If mem_src is not NULL, striping size pointed to by mem_src is returned.
 * If mem_src is NULL, the maximum striping size is returned.


 * Description
 * This API returns the real size of the striping EA.

obd_alloc_diskmd

 * int obd_alloc_diskmd(struct obd_export *exp, struct lov_mds_md **disk_tgt)


 * Parameters
 * exp: Export of the device.
 * disk_tgt: Allocated in-disk-formatted striping EA.


 * Return
 * 0 means success. A negative number means an error.


 * Description
 * This API returns the in-disk-formatted striping EA pointed to by disk_tgt. It allocates the maximum striping EA size, which typically equals the maximum data object count of the file * size of struct lov_ost.

obd_free_diskmd

 * int obd_free_diskmd(struct obd_export *exp, struct lov_mds_md **disk_tgt)


 * Parameters
 * exp: Export of the device.
 * disk_tgt: Allocated in-disk-formatted striping EA.


 * Return
 * 0 means success. A negative number means an error.


 * Description
 * This API frees the in-disk-formatted striping EA referenced by *disk_tgt.

obd_alloc_memmd

 * int obd_alloc_memmd(struct obd_export *exp, struct lov_stripe_md **mem_tgt)


 * Parameters
 * exp: Export of the device.
 * disk_tgt: Allocated in-memory-formatted striping EA


 * Return
 * 0 means success. A negative number means an error.


 * Description
 * This API returns the in-memory-striping EA pointed to by mem_tgt. It allocates the maximum striping EA size.

obd_free_memmd

 * int obd_free_memmd(struct obd_export *exp,struct lov_stripe_md **mem_tgt)


 * Parameters
 * exp: Export of the device.
 * disk_tgt: In-memory-formatted striping EA memory to be freed.


 * Return
 * 0 means success. A negative number means an error.


 * Description
 * This API frees the in-memory-formatted striping EA referenced by *mem_tgt.

lov_stripe_size
obd_size lov_stripe_size(struct lov_stripe_md *lsm, obd_size ost_size,int stripeno)


 * Parameters
 * lsm: In-memory striping EA.
 * ost_size: Size of a single data object in an OST.
 * stripeno: Stripe number of the data object.


 * Return
 * 0 means success. A negative number means an error.


 * Description
 * This API computes the file size given stripeno and the OST size, where stripeno and the OST size are associated with the OST where the end of the file is located.

lov_stripe_offset
int lov_stripe_offset(struct lov_stripe_md *lsm, obd_off lov_off, int stripeno, obd_off *obd_off)


 * Parameters
 * lsm: In-memory striping EA.
 * lov_off: Logic file offset.
 * stripeno: Stripe number of the data object.
 * obd_off: Offset of the OST indicated by stripeno, which is nearest to the logic file offset (lov_off).


 * Return
 * 0 means the OST indicated by stripeno is exactly the same OST as the offset (lov_off) indicated.
 * -1 means the index of the OST indicated by stripeno is less than the index of the OST indicated by the offset (lov_off).
 * 1 means the index of the OST indicated by stripeno is larger than the index of the OST indicated by the offset (lov_off).

This API is used to check whether an extent intersects with an OST.
 * Description

lov_stripe_number
int lov_stripe_number(struct lov_stripe_md *lsm, obd_off lov_off)


 * Parameters
 * lsm: In-memory striping EA
 * lov_off: Logic file offset


 * Return
 * 0 means success. A negative number means an error.

This API computes which stripe number lov_off belongs to.
 * Description

llapi_file_get_stripe
int llapi_file_get_stripe(const char *path, struct lov_user_md *lum)


 * Parameters
 * path: Path of the file.
 * lum: Striping information returned to the caller.


 * Return
 * 0 means success. A negative number means an error.

This API returns striping information to the caller to be used by the application.
 * Description

llapi_file_open
int llapi_file_open(const char *name, int flags, int mode, unsigned long stripe_size, int stripe_offset, int stripe_count, int stripe_pattern)


 * Parameters
 * name: File name
 * flags: Open flags
 * mode: Open mode
 * stripe_size: Stripe size of the file
 * stripe_offset: Stripe offset(stripe_index) of the file
 * stripe_count: Stripe count of the file
 * stripe_pattern: Stripe pattern of the file


 * Return
 * 0 means success. A negative number means an error.

This API opens/creates a file with specified striping parameters.
 * Description

Future developments
With the currently implemented striping disk format, ->obd_unpackmd must have an end-to-end understanding of all possible combinations of layouts, i.e., the format is basically flat rather than hierarchical.

To facilitate development of new layouts, the striping disk format will be adjusted so that higher layers (e.g., struct lov_mds_md) can be parsed without knowing the details of the lower layer (in this case, struct lov_ost_data) representation.

A straightforward way to do this is to precede each layout descriptor with the standard header: struct md_layout_descriptor_header { __u16 mldh_magic; __u16 mldh_length; };

where ->mldh_magic identifies the layout type and is used to determine the ->obd_unpackmd method to be called to parse the descriptor; and ->mldh_length is the total descriptor length, which is used by the upper layer to pass over lower layer descriptors without understanding details of their representation.

Care must be taken, however, to avoid introducing too much redundant information to the on-disk EA for the most common uses.

Glossary

 * ADIO:Analog-to-digital I/O. The ADIO driver is an abstract-device interface for parallel I/O that is used by the MPI to implement its I/O library.
 * CMD:Cluster metatdata
 * EA:Extended attribute
 * llite:Lustre client system
 * LOV:Logical object volume
 * MDS:Metadata server
 * MGS:Management server
 * MPI:Message Passing Interface
 * OSC:Object server client
 * OST:Object storage server