WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Difference between revisions of "Architecture - Wire Level Protocol"

From Obsolete Lustre Wiki
Jump to navigationJump to search
Line 409: Line 409:
  
 
===='''Lustre lock reply structure.'''====
 
===='''Lustre lock reply structure.'''====
The reply message contains a reply (see table 17).  
+
The reply message contains a reply (see table 17).
 +
 
 
===='''Message structures for the various locking operations.'''====
 
===='''Message structures for the various locking operations.'''====
 
In the following sections we will describe the message structures for the various locking operations supported in Lustre.  
 
In the following sections we will describe the message structures for the various locking operations supported in Lustre.  

Revision as of 12:29, 19 January 2010

Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.

Introduction

This chapter gives an overview of the wire formats used by Lustre. Lustre embeds its messages in Portals packets. Lustre employs a Lustre message structure to encapsulate requests and replies which is extremely efficient for the purpose of packing and unpacking and shares many properties of the DAFS wire format. The Lustre systems embed their requests and replies in the Lustre messages as buffers. A message frequently contains multiple buffers, such as a request header and a pathname, or includes multiple requests, as is used in the intent locking case. Every Lustre message is encapsulated within a Portals message header, the figure 23.1.1 illustrates this. A lustre message could have message body with multiple message buffers.

Portals message header
Lustre message header
Lustre message body (message bufers)

Message structure

In this chapter we will describe the structure of the various message headers and the formats of the message buffers that are sent/received in Lustre for the different type of operations.

Portals message headers

As previously discussed, all Lustre network traffic is handled by Portals. Portals introduces its own headers. The vast majority of all Lustre packets are put on the wire with PtlPut and PtlGet, and are simple in structure. The various Portals packet types we use are PtlPut packets, PtlGet packets and ACK packets and reply. Lustre sends Lustre request, reply, and bulk read as PtlPut packets, this gets translated into PTL_MSG_PUT and a PTL_MSG_ACK packets. The bulk write messages are sent as PtlGet packets and this is translated into PTL_MSG_GET and a PTL_MSG_REPLY.

Each Lustre packet is wrapped in a portals header. The portals header can be visualized as consisting of two part, one portion is common to all packets and a second part that depends on the type of packet (PtlGet, PtlPut, PtlAck,PtlReply). The first part has a fixed sized, the size of the second part again depends on the type of packet and is stored within the structure for the various packet types.

The Fields in the common portals header have the semantic meaning in Lustre illustrated in table 1.

Some Lustre clusters use routing, but this shall not affect the content of any Portals header packet that is routed.

Bytes Description (ptl_hdr_t) Lustre Semantics incoming packets Outgoing packet semantics
8 Destination nid This equals the Lustre nid of the receiver. The nid can be an IP address or Elan node-id. This field is set to the final destination Lustre work id: IP addr or Elan Id. When a reply packet is sent, this field will be set to the nid of the request packet that was received nection with this reply.
8 Source nid Source Lustre network id, this as many does not equal the network address of the node sending the packet when the packet is routed. This field is set to the Lustre node-id from the packet originates.
4 Destination pid 0 for Lustre 0 for Lustre
4 Source pid 0 for Lustre 0 for Lustre
4 Message type PTL_MSG_PUT or PTL_MSG_ACK
PTL_MSG_GET
PTL_MSG_REPLY
PTL_MSG_PUT or PTL_MSG_ACK
PTL_MSG_GET or PTL_MSG_REPLY
size depend on the message type msg depends on message type depends on message type
size depend on the message type msg_filler The ptl_hdr_t structure size is 72, after the initial common header and the message specific structure, the rest of the structure is padded for the size to be 72

The ptl_hdr_t structure size is 72, after the common header and the message specific structure, the rest of the structure is padded for the to be 72 TABLE 1. Portals Header Structure


Notes:

  1. A Portals version field needs to be included in this header.
  2. All 32 bit integers will be little endian. 16 byte integers have the most significant bytes preceding the least significant.
  3. The nid will be the Lustre network id. In a routed network this may not coincide with the origin of the packet.
  4. ACK packets, which are generated when a PtlPut arrives with a nonzero ACK MD, are similar. But the fields following the ACK MD token are not used and the header is padded out to 88 bytes, and any padding must be ignored by the receiver.


The msg field is determined by the message type, each of the message types has a different structure. The table 2 illustrates the header for a PtlPut message

Bytes Description (ptl_put) Lustre Semantics incoming packets Outgoing packet semantics
4 Portal index See section 4.1 See section 4.1
16 ACK MD index The sending node of a PtlPut packet for which an ACK is requested includes this as a cookie for the packet for which an ACK is to be sent. On incoming ACK packets, this field will be used to generate an event for the corresponding packet (in portals lingo, for the memory descriptor for it). When set on incoming PltPut packets this field will be copied into outgoing ACK packets, but except for its’ presence this field is not interpreted by the receiver of a PtlPut packet. Memory descriptor used for ACK event. This field is set for PtlPut packets for Lustre bulk messages to indicate that an ACK packet is requested. On outgoing ACK packets this will equal the MD handle on the associated incoming PtlPut packet. Unless the first 64 bits of this field are set to ~0 an ACK will be generated.
8 Match bits In an incoming Lustre request PtlPut message, this is set to an integer that the recipient must include in (i) Lustre reply PtlPut packets for that request, and (ii) incoming or outgoing Lustre bulk PtlPut packets. On incoming replies this field must equal the xid of the request to which the packet belongs. For incoming, i.e. sink side, bulk packets, this field must equal the xid’s the sink sent to the source in the bulk handshake. On outgoing request packets this field must be set to a unique integer. On outgoing reply packets this field must be set to the xid of the transaction associated with the request. For bulk packets, the source must set this field to match the xid’s sent by the sink during the preparatory bulk handshake.
4 Length Packet body length, not including the Portals header. Packet body length, not including the Portals header.
4 Offset Sender managed offset: 0 in Lustre. Sender managed offset: 0 in Lustre.
8 Header data Reserved for Lustre use. Reserved for Lustre use.

TABLE 2. Portals PTL_MSG_PUT message Structure

Here, the xid is a request identifier that in normal use of the system increases by 1 for every request sent from a particular node to a service. During recovery of systems requests may be re-transmitted, hence the xid is not totally unique.

The table 3 shows the structure for a PtlGet message.

The table 5 shows the structure for an Ack packet.

Finally, the table shows the PtlReply packet:

Bytes Description (ptl_get) Lustre Semantics incoming packets Outgoing packet semantics
4 Portal index The portals index on the osc that the write source buffers have been posted under
16 return MD index This is the OST trying to provide a handle to the write target buffers that the PTL_MSG_REPLY should be sent to The OSC will use to send the PTL_MSG_REPLY
8 Match bits In an incoming Lustre request PtlGet message, this is set to an integer that the recipient must include in (i) Lustre reply PtlGet packets for that request, and (ii) incoming or outgoing Lustre bulk Ptl-Get packets. On incoming replies this field must equal the xid of the request to which the packet belongs. For incoming, i.e. sink side, bulk packets, this field must equal the xid’s the sink sent to the source in the bulk handshake. On outgoing request packets this field must be set to a unique integer. On outgoing reply packets this field must be set to the xid of the transaction associated with the request. For bulk packets, the source must set this field to match the xid’s sent by the sink during the preparatory bulk handshake.
4 Length The portals agent that receives the get msg uses the length to find a local match entry (list of memory regions) that will provide source of the bulk flow. The length of the source buffer is specified here. The length of the data region from which the get originated.
4 source offset Offset within the memory region that we get from the match bits Sender managed offset: 0 in Lustre.
4 return offset This offset is set by the OST and simply copied by the OSC The offset within the return wmd that the data should be send to by the source.
4 sink length length of the server buffer where the writes would go-set by the server sink length -used to verify that the matching packet is of correct length and not larger.

TABLE 3. Portals PTL_MSG_GET message Structure For more information about the precise meaning of these headers, see the Portals specification [1].

Lustre Messages: RPC’s

Lustre uses Portals to send Lustre messages across the net. All request and reply messages are packaged as Lustre messages and embedded within the Portals header. Lustre messages have fields

Bytes Description (ptl_ack) Lustre Semantics incoming packets Outgoing packet semantics
4 m_length
16 DST MD index This the receiver’s MD as copied from the PtlPut packet This is the destination MD copied from the corresponding PtlPut packet
8 Match bits
4 Length Packet body length, not including the Portals header. Packet body length, not including the Portals header.

TABLE 4. Portals PTL_MSG_ACK message structure Structure

Bytes Description (ptl_reply) Lustre Semantics incoming packets Outgoing packet semantics
4 unused
16 DST MD index This the receiver’s MD as copied from the PtlPut packet This is the destination MD copied from the corresponding PtlPut packet
4 dst_offset
4 unused
4 Length Packet body length, not including the Portals header. Packet body length, not including the Portals header.

TABLE 5. Portals PTL_MSG_REPLY message Structure

in the body that describe the contents of the message and assist in recovery. Bulk packets are sent as raw Portals packets.

Lustre requests and replies (but not bulk data) fill the “packet data” field with at least a lustre_msg structure explained in table 6. Some requests/replies might have additional buffers, this would be indicated by the buffer count field in the lustre_msg structure. Every Lustre message requires a corresponding reply.

Each buffer and the message itself is padded to an 8-byte boundary, but this padding is not included in the buffer lengths.

The structure of the data in the buffers in lustre_msg structure would depend on the type of operation. In the following sections we would describe the buffer data formats for all the possible Lustre request/reply types.

The structure described above has a field to contain the last received transaction number, this might not be required in every reply. The last committed transaction number is required for recovery purposes. We might also need to add the generation number of the connection on every outgoing request to enable the target to IO fence old requests. It might also be useful to add protocol version information in every outgoing request packet. In future, we might include a subsystem field similar to that in SUN RPC and an authenticator.

Bytes Name (struct lus-tre_msg) Use on incoming packets Use on outgoing packets
8 Export/import handle cookie The first 8 bytes provide a Lustre handle to the export or import data associated with the message. It is used by the receiver to locate the object. The export data handle is used by services for incoming request packets; the import handle is used by clients for incoming ASTs. The handle is a copy of the handle exchanged the peer during the subsystem connection shake. On outgoing request packets, handle of the target service is included; going ASTs, included is the import handle client.
4 Magic Magic constant 0x0BD00BD0. Magic constant 0x0BD00BD0.

“buffer count“

4 Type PTL_RPC_MSG_REQUEST PTL_RPC_MSG_REPLY
PTL_RPC_MSG_ERR
4 Lustre msg version and protocol version:version Current value 0x00040001 in little endian Most significant 16 bits: Lustre msg protocol version; least significant 16 bits: subsystem protocol version. This is checked by the service for request packets against the available protocols offered by the receiver. Not used by clients. This field is set by the client to indicate sions used.
4 Protocol and Opcode: opc Most significant 16 bits: Lustre subsystem protocol number; least significant 16 bits: opcode for request in protocol. Used by the service to locate the request handler. Set by the client to indicate what request sent. Not set by client
8 Last received counter In replies: last transaction no for MDS/OST.
8 Last committed counter In replies: last committed transaction.
8 Transaction number In replies: transaction no for request.
4 Status Return value of handler
4 Buffer count: bufcount How many buffers are included.
4 flag Operation specifics flags use the top 16 bits (eg. MSG_CONNECT_RECONNECT)and common flags used bottom 16 bits (eg. MSG_LAST_REPLAY).
"buffer count" *4 Buffer lengths
buflens[]
What is the length of each of these.
total of “buffer lengths“ Message data

TABLE 6. Lustre Message Structure

OSC -OST Network Protocol

In this section we will describe the buffer data formats for all the requests/replies exchanged between the object storage clients (OSC) and the targets (OSTs). There are eleven OST opcodes and each of them has different information in the buffers:

OST_CONNECT: This is sent as a Lustre message with a three buffers. The first buffer will contain only the UUID of the target to be connected to. On the server side this UUID is translated into the target device. The second buffer is a UUID identifying the connecting client. The target instantiates a unique export structure for the client which is passed in on every further request. The third buffer holds the import handle which is sent back to the client for lock callbacks.

OST_DISCONNEC: This is a Lustre message without any additional buffers. The server tears down the export specified in the lustre_msg structure.

OST_GETATTR, OST_SETATTR, OST_OPEN, OST_CLOSE, OST_CREATE, OST_DESTROY,:

OST_PUNCH: These OST requests have the structure shown in table 7 in a single message buffer (hereafter referred to as the OST body):

Bytes Description (struct ost_body)
OBD Object(OBDO)

TABLE 7. OST request structure


An OBD object, as illustrated in table 8, is similar to a cross-platform inode. It describes the attributes of a given object. The valid flag is used to indicate the fields that are relevant for a request. As an example, the OST_SETATTR will use the valid flag to indicate which attributes need to be written to the device.

OST_READ, OST_WRITE: These requests have different structure. The first buffer in the lustre message is a network ioobject. This is followed by an array of remote niobufs in buffer 2. There is one IO object (see table 9) and one niobuf (see table 11) per file extent. In case of reads, each niobuf includes a return code that indicates success/failure/errors on a per-page basis as shown in

When an OST_READ request is received, the data is sent to portals match entries equal to the xid given in the niobuf_remote structure. In case of reads, each niobuf includes a return code that indicates success/failure/errors on a per-page basis as shown in 10.

For an OST_WRITE request, buffers with such match bits are prepared by the client so that the server can get data from the buffers. The bulk data described in those structures is sent as a standard Portals packet, without any Lustre RPC header.

OST_STATFS: This function has one reply message buffer which contains a struct obd_statfs. The contents are shown in table 12.

The server should fill in the critical fields at the minimum, relating to the number of free/total file objects and blocks and zero-fill the unused fields.

Bytes Description (struct obdo)
8 id
8 Group
8 atime
8 mtime
8 ctime
8 Size
8 Blocks
8 rdev
4 Block size
4 Mode
4 uid
4 gid
4 Flags
4 Link count
4 Generation
4 Valid
4 OBDflags
4 o_easize
60 o_inline

TABLE 8. OBD object

Bytes Description (struct obd_ioobj)
8 id
8 Group
4 Type
4 Buffer count

TABLE 9. IO object

Lustre DLM Network Protocol

The Lustre lock manager has 3 regular calls and 2 callbacks. The regular calls are sent on the same portal as the affected service; for example, meta-data lock requests are sent to the MDS portal. The callbacks are sent to the portal reserved for DLM RPC’s. Every request to the lock manager has at least a single buffer with the ldlm_request structure as shown in table 13 in it or the ldlm_reply structure (see table 17).

Lustre lock request structure.

Any lock request in lustre consists of atleast the ldlm_request structure (see table 13). 
Bytes Description (struct niobuf_remote)
8 Offset
8 xid
4 Length
4 Flags
4 return code
addr
sizeof page struct Flags
target_private
sizeof dentry struct dentry

TABLE 10. Niobuf_local


Bytes Description (struct niobuf_remote)
8 Offset
8 xid
4 Length
4 Flags

TABLE 11. Niobuf


Bytes Field name Meaning
8 os_type Magic constant describing the type of OBD (not defined yet).
8 os_blocks Total number of blocks on OST.
8 os_bfree Free blocks.
8 os_bavail Available blocks (free minus reserved).
8 os_files Total number of objects.
8 os_ffree Number of unallocated objects.
40 os_fsid UUID of OST.
4 os_bsize Block size.
4 os_namelen Length of OST name.
48 os_spare Reserved.

TABLE 12. OBD Status structure


Bytes Name Description
4 lock_flags flag filled by the server to indicate statusof the lock
92 lock_desc Lock descriptor is filled with requested type, name, and extent.
8 lock_handle
8 lock_handle2

TABLE 13. The lock request structure


As shown in table 13, every lock request would contain a lock description structure as shown in 16. This structure has several sub-components. It contains a struct ldlm_extent (see table 14) structure that describes the file extent covered by the lock.

Bytes Name Description
8 start Start of extent.
8 end End of the extent.

TABLE 14. Lock extent descriptor


Secondly, we have resource descriptors, struct ldlm_resource_desc (see table 15), this is used to describe the resource for which a lock is requested. This is an unaligned structure, its allright as long as this is used only in ldlm_request structure.

Bytes Name Description
4 lr_type Resource type: one of LDLM_PLAIN, LDLM_INTENT, LDLM_EXTENT.
8*3 lr_name Resource name.
4*4 lr_version Version of the resource (not yet used).

TABLE 15. Lock resource descriptor


Bytes Name Description
44 l_resource description of the resource for the lock (see 15)
4 l_req_mode Requested lock mode, one of LCK_EX (=1), LCK_PW, LCK_PR, LCK_CW, LCK_CR, LCK_NL (=6) File I/O uses PR and PW locks.
4 l_granted_mode Lock mode that is granted on this lock.
16 l_extent Extent required for this lock (see 14)
4*4 l_version Version of this lock.

TABLE 16. Lock descriptor

Lustre lock reply structure.

The reply message contains a reply (see table 17).

Message structures for the various locking operations.

In the following sections we will describe the message structures for the various locking operations supported in Lustre.

LDLM_ENQUEUE.

This message is used to obtain a new lock. The Lustre message contains a single buffer with a struct ldlm_request .

LDLM_CANCEL.

This message cancels an existing lock. It places a struct ldlm_request in the Lustre message, but only uses the lock_handle1 part of the request (we will shrink this in the future). The reply contains just a lustre_msg.

Bytes Name Description
4 lock_flags Flags set during enqueue.
4 lock_mode The server may change the lock mode; if this quantity is non-zero, the client should update its lock structure accordingly.
8 * 3 lock_resource_name Resource actually given to the requester.
8 lock_handle Handle for the lock that was granted.
16 lock_extent Extent that was granted (will move to policy results).
8 lock_policy_res1 Field one for policy results.
8 lock_policy_res2 Field two for policy results.

TABLE 17. Reply for a lock request

LDLM_CONVERT

This message converts the lock type of an existing lock. The request contains an ldlm request structure, as in enqueue. The requested mode field contains the mode requested after conversion. An ldlm_reply message is returned to the client.

LDLM_BL_CALLBACK.

This message is sent by the lock server to the client to indicate that a lock held by the client is blocking another lock request. This sends a struct ldlm_request with the attributes of the blocked lock in lock_desc.

LDLM_CP_CALLBACK.

This message is sent by the lock server to the client to indicate that a prior unfulfilled lock request is now being granted. This too sends a struct ldlm_request with the attributes of the granted lock in lock_desc. Note that these attributes may differ from those that the client originally requested, in particular the resource name and lock mode.

Client / Meta-data Server

The client meta-data network protocol consists of just a few calls. Again, we first explain the components that make up the Lustre messages and then turn to the network structure of the individual requests. The MDC-MDS protocol has significant similarity with the OSC-OST protocol.

Messages have the following Portals related attributes:

  1. Destination portal for requests: MDS_REQUEST_PORTAL
  2. Reply packets go to: MDC_REPLY_PORTAL
  3. Readdir bulk packets travel to: MDC_BULK_PORTAL

A few other constants are important. We have a sequence of call numbers:

#define MDS_GETATTR 1 
#define MDS_OPEN 2 
#define MDS_CLOSE 3 
#define MDS_REINT 4 
#define MDS_READPAGE 6 
#define MDS_CONNECT 7 
#define MDS_DISCONNECT 8 
#define MDS_GETSTATUS 9
#define MDS_STATFS 10
#define MDS_GETLOVINFO 11

The update records are numbered too, to indicate their type:

#define REINT_SETATTR 1 
#define REINT_CREATE 2 
#define REINT_LINK 3 
#define REINT_UNLINK 4 
#define REINT_RENAME 5 
#define REINT_RECREATE 6 

Meta-data Related Wire Structures.

As indicated in table 1, many messages to MDS contain an mds_body (see table 18).

Bytes Name Description
16 fid1 First fid.
16 fid2 Second fid.
16 handle Lustre handle
8 size File size.
8 blocks blocks
4 ino inode number
4 valid Bitmap of valid fields sent in / returned.
4 fsuid effective user id for file access.
4 fsgid effective group id for file access.
4 capability Not currently used
4 mode Mode of file
4 uid real user id.
4 gid real group id.
4 mtime Last modification time.
4 ctime Last inode change time.
4 atime Last access time.
4 flags Flags.
4 rdev device
4 nlink Linkcount.
4 generation Generation.
4 suppgid
4 eadatasize Size of extended attributes.

TABLE 18. MDS Body


In the mds_body structure a file identifier is used to identify a file ( see table 19). The file type is a platform independent enumeration:

Bytes Name Description
8 id Inode id
4 generation Inode generation
4 f_type Inode type

TABLE 19. File Identifier Structure


#define S_IFSOCK 0140000
#define S_IFLNK 0120000
#define S_IFREG 0100000
#define S_IFBLK 0060000
#define S_IFDIR 0040000
#define S_IFCHR 0020000
#define S_IFIFO 0010000

The MDS stores the file striping information, which includes the object information, as extended atttributes. It might be required to send this information across the wire for certain operations. This can be done using the variable length data structure shown in described in table 20.

Bytes Name Description
4 lmm_magic 0x0BD00BD0, the striping magic (read, obdo-obdo).
8 lmm_object_id The id of the object as seen by the LOV.
4 lmm_stripe_size Stripe size.
4 lmm_stripe_offset Stripe offset.
2 lmm_stripe_count How many stripes are used for the file.
2 lmm_ost_count Total number of OSTs in the cluster (determines the maximum stripe count)
8*n lmm_objects An array of object id, in the order that they appear in the LOV descriptor.

TABLE 20. Variable Length Structure

MDS Update Record Packet Structure.

In this section we will describe the message structures for all the metadata operations that result in update of the file metadata on the MDS. The structure of the update record will depend on the operation type, all update records contain a 32 bit opcode at the begining for identification.

REINT_SETATTR.

The setattr message contains a structure containing the attributes that will be set, in a format commonly used across Unix systems as shown in table 21.

REINT_CREATE.

The create record is used to build files, symbolic links, directories, and special files. In all cases the record shown in figure 22 is included, and a second buffer in the Lustre message contains the name to be created. For files this is followed by a further buffer containing striping meta-data. For symbolic link a third buffer is also present, containing the null terminated name of the link. The reply contains only an mds_body structure along with the lustre_msg structure.

Bytes Name Description
4 sa_opcode opcode of the update record that follows.
4 sa_fsuid effective user id for file access.
4 sa_fsgid effective group id for file access.
4 sa_cap Not currently used
4 sa_reserved Not currently used
4 sa_valid Bitmap of valid fields.
16 sa_fid fid of object to update.
4 sa_mode Mode
4 sa_uid uid
4 sa_gid gid
4 sa_attr_flags Flags
8 sa_size Inode size.
8 sa_atime atime
8 sa_mtime mtime
8 sa_ctime ctime
4 sa_suppgid Not currently used

TABLE 21. setattr Message Structure


Bytes Name Description
4 cr_opcode opcode
4 cr_fsuid effective user id for file access.
4 cr_fsgid effective group id for file access.
4 cr_cap Not currently used
4 sa_flags for use with open
4 cr_mode Mode
16 cr_fid fid of parent.
16 cr_replayfid fid of parent used to replay request.
4 cr_uid uid
4 cr_gid gid
8 cr_time Time
8 cr_rdev Raw device.
4 cr_suppgid Not currently used

TABLE 22. Create record

REINT_LINK.

The link Lustre message contains 2 fields: an mds_rec_link record described in table 23 followed by a null terminated name to be used in the target directory. The reply consists of an mds_body.

Bytes Name Description
4 lk_opcode
4 lk_fsuid effective user id for file access.
4 lk_fsgid effective group id for file access.
4 lk_cap Not currently used
4 lk_suppgid Not currently used
16 lk_fid1 fid of source.
16 lk_fid2 fid of target parent.

TABLE 23. File link Records

REINT_UNLINK.

The unlink Lustre message contains 2 fields: an mds_rec_unlink record described in table 24 followed by a null terminated name to be used in the target directory.

Bytes Name Description
4 ul_opcode
4 ul_fsuid effective user id for file access.
4 ul_fsgid effective group id for file access.
4 ul_cap Not currently used
4 ul_reserved Not currently used
4 ul_mode Mode
4 ul_suppgid Not currently used
16 ul_fid1 fid of source.
16 ul_fid2 fid of file to be removed.

TABLE 24. File unlink Records

The reply consists of an mds_body. Notice that one that the lk_fid2 is super.uous, but useful for checking correctness of the protocol.

REINT_RENAME.

The rename lustre message contains 2 fields: an mds_rec_rename record (see table 25) followed by two null terminated names, indicating the source and destination name. The reply consists of an mds_body.

REINT_RECREATE.

This request is present for recovery purposes and identically formatted to that of REINT_CREATE, except for the value of the opcode.

MDS Request record packet structure.

MDS_GETATTR.

The getattr request contains a mds_body as request. The parameters that are relevant in the request are the fid and valid fields. In WB caching mode, the attributes are received by using the fid in the mds_body, but in CS mode the fid is that of the parent directory

Bytes Name Description
4 rn_opcode
4 rn_fsuid effective user id for file access.
4 rnl_fsgid effective group id for file access.
4 rn_cap Not currently used
4 rn_suppgid1 Not currently used
4 rn_suppgid2 Mode
16 rn_lk_fid1 fid of source.
16 rn_lk_fid2 fid of target parent.

TABLE 25. File rename Records


and the attributes are retrieved by a name included as a buffer in the lustre message following the mds_body. The reply may be followed by mds striping data in the case of a fid of type S_IFREG, or can be followed by a linkname. This happens when in the valid field the OBD_MD_LINKNAME bit is set.

MDS_OPEN.

The open request contains and mds_fileh_body (see figure 26), followed by an optional lov_stripe_md. The stripe meta-data is used to store the object identities on the MDS, in case the objects were created only at open time on the OST’s. The fid indicates what object is opened. The handle in the request is a local file handle, to deal with re-opening files, during cluster recovery.

Bytes Name Description
16 fid File id of object to open / close.
16 file handle File handle passed or returned.

TABLE 26. File handler structure for open/close requests


The reply contains the same structure. The Lustre handle contains the remote file handle and a security token; the body has the attributes of the inode.

MDS_CLOSE.

The structure of this call is equal to that of open, although lov_stripe_md can currently not be passed.

MDS_REINT.

The message structure contains an empty Lustre message with an update record. The update records are described above and appended as the first buffer in the Lustre message. The reply is an mds_body.

MDS_READPAGE.

The request structure contains an mds_body. The fid1 field contains the identifier of the file; the size field gives the offset of the page to be read. The reply is simply a Lustre message.

MDS_CONNECT.

See OST_CONNECT.

MDS_DISCONNECT.

See OST_DISCONNECT.

MDS_GETSTATUS.

This call will retrieve the root fid from the MDS. The request message is a lustre_msg; the reply contains a lustre_fid, the root fid of the file system.

MDS_STATFS.

MDS_GETLOVINFO.

Client -MDS/OST recovery protocol

We have introduced a new operation in which clients ping all the servers periodically. When a server (MDS/OST) fails, all the connected clients need to participate in recovery within a given time, if they miss the recovery window, they are removed from the cluster. The client will then lose all the cached updates. The ping operation can be used by the clients to continuously check if the servers are up or not. If a failover server is available, the clients need to .nd and connect to them. A new opcode, OBD_PING has been introduced for this purpose, this is understood by both OST and MDS nodes. This new opcode has a value of 400, and no request or reply body (both have length 0), the figure 27 illustrates this message.

Bytes Name (struct lustre_msg) Value
8 Export/import handle cookie Contains the import/export handle for the request
4 Magic constant 0x0BD00BD0.
4 Type PTL_RPC_MSG_REQUEST / PTL_RPC_MSG_REPLY
4 Lustre-msg version and protocol version Current value 0x00040001 in little endian
4 Opcode 400
8 Last received counter In replies: last transaction no for MDS/OST.
8 Last committed counter In replies: last committed transaction.
8 Transaction number In replies: transaction no for request.
4 Status Return value of handler.
4 Buffer count: bufcount 0
“buffer count“ * 4 Buffer lengths buflens[] No buffers

TABLE 27. Lustre message for the new OBD_PING operation


On the server side, zero-to-minimal processing should be done for this new type of Lustre message. In addition, OBD_PING can be sent with a request message containing an addr and cookie of zero (no export), and should be processed without any change in that case. Specifically, it should not return -ENOTCONN for a mis-matched export handle, if the addr and cookie are both zero. Another scenario in which the pinger plays an important role is during cleanup, in a cluster if a clients are shutdown while they hold locks, the OSTs will have to wait for a long time for timeouts to occur for all the clients. At this point, the server can assume that the client died and cleanup the locks held by it. On the other hand, in the presence of the ping operation, the OST will keep track of time_last_heard parameter for every client. The server can use this variable to track when it last heard from a client, if the time exceeds a certain threshold value, the OSTs can mark the clients as dead.

Changelog

Version 2.2 (Apr. 2003) (1) Radhika Vullikanti (28 Apr 2003) -Updated the structures to reflect the current protocol. Version 2.1 (Apr. 2003) (1) Phil Schwan (26 Apr 2003) -Updated wire protocols to reflect changes made between Lustre versions 0.6.0.3 and 0.6.0.4 (bugs 593, 1154, 1175, and 1178). All sizes are now in bytes. Version 2.0 (Apr. 2003) (1) Radhika Vullikanti (04/01/2003) -Added a new section describing the wire protocol changes made for recovery purposes. Version 1.5 (Jan. 2003) (1) Radhika Vullikanti (01/31/2003) -Updated section 13.2 to reflect the changes that were made to the wire protocol for using ptlget for bulk writes.