WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.
Architecture - Wire Level Protocol
Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.
Introduction
This chapter gives an overview of the wire formats used by Lustre. Lustre embeds its messages in Portals packets. Lustre employs a Lustre message structure to encapsulate requests and replies which is extremely efficient for the purpose of packing and unpacking and shares many properties of the DAFS wire format. The Lustre systems embed their requests and replies in the Lustre messages as buffers. A message frequently contains multiple buffers, such as a request header and a pathname, or includes multiple requests, as is used in the intent locking case. Every Lustre message is encapsulated within a Portals message header, the figure 23.1.1 illustrates this. A lustre message could have message body with multiple message buffers.
Portals message header |
Lustre message header |
Lustre message body (message bufers) |
Message structure
In this chapter we will describe the structure of the various message headers and the formats of the message buffers that are sent/received in Lustre for the different type of operations.
Portals message headers
As previously discussed, all Lustre network traffic is handled by Portals. Portals introduces its own headers. The vast majority of all Lustre packets are put on the wire with PtlPut and PtlGet, and are simple in structure. The various Portals packet types we use are PtlPut packets, PtlGet packets and ACK packets and reply. Lustre sends Lustre request, reply, and bulk read as PtlPut packets, this gets translated into PTL_MSG_PUT and a PTL_MSG_ACK packets. The bulk write messages are sent as PtlGet packets and this is translated into PTL_MSG_GET and a PTL_MSG_REPLY.
Each Lustre packet is wrapped in a portals header. The portals header can be visualized as consisting of two part, one portion is common to all packets and a second part that depends on the type of packet (PtlGet, PtlPut, PtlAck,PtlReply). The first part has a fixed sized, the size of the second part again depends on the type of packet and is stored within the structure for the various packet types.
The Fields in the common portals header have the semantic meaning in Lustre illustrated in table 1.
Some Lustre clusters use routing, but this shall not affect the content of any Portals header packet that is routed.
Bytes | Description (ptl_hdr_t) | Lustre Semantics incoming packets | Outgoing packet semantics |
8 | Destination nid | This equals the Lustre nid of the receiver. The nid can be an IP address or Elan node-id. | This field is set to the final destination Lustre work id: IP addr or Elan Id. When a reply packet is sent, this field will be set to the nid of the request packet that was received nection with this reply. |
8 | Source nid | Source Lustre network id, this as many does not equal the network address of the node sending the packet when the packet is routed. | This field is set to the Lustre node-id from the packet originates. |
4 | Destination pid | 0 for Lustre | 0 for Lustre |
4 | Source pid | 0 for Lustre | 0 for Lustre |
4 | Message type | PTL_MSG_PUT or PTL_MSG_ACK PTL_MSG_GET PTL_MSG_REPLY |
PTL_MSG_PUT or PTL_MSG_ACK PTL_MSG_GET or PTL_MSG_REPLY |
size depend on the message type | msg | depends on message type | depends on message type |
size depend on the message type | msg_filler | The ptl_hdr_t structure size is 72, after the initial common header and the message specific structure, the rest of the structure is padded for the size to be 72 |
The ptl_hdr_t structure size is 72, after the common header and the message specific structure, the rest of the structure is padded for the to be 72 TABLE 1. Portals Header Structure |
Notes:
- A Portals version field needs to be included in this header.
- All 32 bit integers will be little endian. 16 byte integers have the most significant bytes preceding the least significant.
- The nid will be the Lustre network id. In a routed network this may not coincide with the origin of the packet.
- ACK packets, which are generated when a PtlPut arrives with a nonzero ACK MD, are similar. But the fields following the ACK MD token are not used and the header is padded out to 88 bytes, and any padding must be ignored by the receiver.
The msg field is determined by the message type, each of the message types has a different structure. The table 2 illustrates the header for a PtlPut message
Bytes | Description (ptl_put) | Lustre Semantics incoming packets | Outgoing packet semantics |
4 | Portal index | See section 4.1 | See section 4.1 |
16 | ACK MD index | The sending node of a PtlPut packet for which an ACK is requested includes this as a cookie for the packet for which an ACK is to be sent. On incoming ACK packets, this field will be used to generate an event for the corresponding packet (in portals lingo, for the memory descriptor for it). When set on incoming PltPut packets this field will be copied into outgoing ACK packets, but except for its’ presence this field is not interpreted by the receiver of a PtlPut packet. | Memory descriptor used for ACK event. This field is set for PtlPut packets for Lustre bulk messages to indicate that an ACK packet is requested. On outgoing ACK packets this will equal the MD handle on the associated incoming PtlPut packet. Unless the first 64 bits of this field are set to ~0 an ACK will be generated. |
8 | Match bits | In an incoming Lustre request PtlPut message, this is set to an integer that the recipient must include in (i) Lustre reply PtlPut packets for that request, and (ii) incoming or outgoing Lustre bulk PtlPut packets. On incoming replies this field must equal the xid of the request to which the packet belongs. For incoming, i.e. sink side, bulk packets, this field must equal the xid’s the sink sent to the source in the bulk handshake. | On outgoing request packets this field must be set to a unique integer. On outgoing reply packets this field must be set to the xid of the transaction associated with the request. For bulk packets, the source must set this field to match the xid’s sent by the sink during the preparatory bulk handshake. |
4 | Length | Packet body length, not including the Portals header. | Packet body length, not including the Portals header. |
4 | Offset | Sender managed offset: 0 in Lustre. | Sender managed offset: 0 in Lustre. |
8 | Header data | Reserved for Lustre use. | Reserved for Lustre use. |
TABLE 2. Portals PTL_MSG_PUT message Structure
Here, the xid is a request identifier that in normal use of the system increases by 1 for every request sent from a particular node to a service. During recovery of systems requests may be re-transmitted, hence the xid is not totally unique.
The table 3 shows the structure for a PtlGet message.
The table 5 shows the structure for an Ack packet.
Finally, the table shows the PtlReply packet:
Bytes | Description (ptl_get) | Lustre Semantics incoming packets | Outgoing packet semantics |
4 | Portal index | The portals index on the osc that the write source buffers have been posted under | |
16 | return MD index | This is the OST trying to provide a handle to the write target buffers that the PTL_MSG_REPLY should be sent to | The OSC will use to send the PTL_MSG_REPLY |
8 | Match bits | In an incoming Lustre request PtlGet message, this is set to an integer that the recipient must include in (i) Lustre reply PtlGet packets for that request, and (ii) incoming or outgoing Lustre bulk Ptl-Get packets. On incoming replies this field must equal the xid of the request to which the packet belongs. For incoming, i.e. sink side, bulk packets, this field must equal the xid’s the sink sent to the source in the bulk handshake. | On outgoing request packets this field must be set to a unique integer. On outgoing reply packets this field must be set to the xid of the transaction associated with the request. For bulk packets, the source must set this field to match the xid’s sent by the sink during the preparatory bulk handshake. |
4 | Length | The portals agent that receives the get msg uses the length to find a local match entry (list of memory regions) that will provide source of the bulk flow. The length of the source buffer is specified here. | The length of the data region from which the get originated. |
4 | source offset | Offset within the memory region that we get from the match bits | Sender managed offset: 0 in Lustre. |
4 | return offset | This offset is set by the OST and simply copied by the OSC | The offset within the return wmd that the data should be send to by the source. |
4 | sink length | length of the server buffer where the writes would go-set by the server | sink length -used to verify that the matching packet is of correct length and not larger. |
TABLE 3. Portals PTL_MSG_GET message Structure For more information about the precise meaning of these headers, see the Portals specification [1].
Lustre Messages: RPC’s
Lustre uses Portals to send Lustre messages across the net. All request and reply messages are packaged as Lustre messages and embedded within the Portals header. Lustre messages have fields
Bytes | Description (ptl_ack) | Lustre Semantics incoming packets | Outgoing packet semantics |
4 | m_length | ||
16 | DST MD index | This the receiver’s MD as copied from the PtlPut packet | This is the destination MD copied from the corresponding PtlPut packet |
8 | Match bits | ||
4 | Length | Packet body length, not including the Portals header. | Packet body length, not including the Portals header. |
TABLE 4. Portals PTL_MSG_ACK message structure Structure
Bytes | Description (ptl_reply) | Lustre Semantics incoming packets | Outgoing packet semantics |
4 | unused | ||
16 | DST MD index | This the receiver’s MD as copied from the PtlPut packet | This is the destination MD copied from the corresponding PtlPut packet |
4 | dst_offset | ||
4 | unused | ||
4 | Length | Packet body length, not including the Portals header. | Packet body length, not including the Portals header. |
TABLE 5. Portals PTL_MSG_REPLY message Structure
in the body that describe the contents of the message and assist in recovery. Bulk packets are sent as raw Portals packets.
Lustre requests and replies (but not bulk data) fill the “packet data” field with at least a lustre_msg structure explained in table 6. Some requests/replies might have additional buffers, this would be indicated by the buffer count field in the lustre_msg structure. Every Lustre message requires a corresponding reply.
Each buffer and the message itself is padded to an 8-byte boundary, but this padding is not included in the buffer lengths.
The structure of the data in the buffers in lustre_msg structure would depend on the type of operation. In the following sections we would describe the buffer data formats for all the possible Lustre request/reply types.
The structure described above has a field to contain the last received transaction number, this might not be required in every reply. The last committed transaction number is required for recovery purposes. We might also need to add the generation number of the connection on every outgoing request to enable the target to IO fence old requests. It might also be useful to add protocol version information in every outgoing request packet. In future, we might include a subsystem field similar to that in SUN RPC and an authenticator.
Bytes | Name (struct lus-tre_msg) | Use on incoming packets | Use on outgoing packets |
8 | Export/import handle cookie | The first 8 bytes provide a Lustre handle to the export or import data associated with the message. It is used by the receiver to locate the object. The export data handle is used by services for incoming request packets; the import handle is used by clients for incoming ASTs. | The handle is a copy of the handle exchanged the peer during the subsystem connection shake. On outgoing request packets, handle of the target service is included; going ASTs, included is the import handle client. |
4 | Magic | Magic constant 0x0BD00BD0. | Magic constant 0x0BD00BD0.
“buffer count“ |
4 | Type | PTL_RPC_MSG_REQUEST | PTL_RPC_MSG_REPLY PTL_RPC_MSG_ERR |
4 | Lustre msg version and protocol version:version Current value 0x00040001 in little endian | Most significant 16 bits: Lustre msg protocol version; least significant 16 bits: subsystem protocol version. This is checked by the service for request packets against the available protocols offered by the receiver. Not used by clients. | This field is set by the client to indicate sions used. |
4 | Protocol and Opcode: opc | Most significant 16 bits: Lustre subsystem protocol number; least significant 16 bits: opcode for request in protocol. Used by the service to locate the request handler. | Set by the client to indicate what request sent. Not set by client |
8 | Last received counter | In replies: last transaction no for MDS/OST. | |
8 | Last committed counter | In replies: last committed transaction. | |
8 | Transaction number | In replies: transaction no for request. | |
4 | Status | Return value of handler | |
4 | Buffer count: bufcount | How many buffers are included. | |
4 | flag | Operation specifics flags use the top 16 bits (eg. MSG_CONNECT_RECONNECT)and common flags used bottom 16 bits (eg. MSG_LAST_REPLAY). | |
"buffer count" *4 | Buffer lengths buflens[] |
What is the length of each of these. | |
total of “buffer lengths“ | Message data |
TABLE 6. Lustre Message Structure
OSC -OST Network Protocol
In this section we will describe the buffer data formats for all the requests/replies exchanged between the object storage clients (OSC) and the targets (OSTs). There are eleven OST opcodes and each of them has different information in the buffers:
OST_CONNECT: This is sent as a Lustre message with a three buffers. The first buffer will contain only the UUID of the target to be connected to. On the server side this UUID is translated into the target device. The second buffer is a UUID identifying the connecting client. The target instantiates a unique export structure for the client which is passed in on every further request. The third buffer holds the import handle which is sent back to the client for lock callbacks.
OST_DISCONNEC: This is a Lustre message without any additional buffers. The server tears down the export specified in the lustre_msg structure.
OST_GETATTR, OST_SETATTR, OST_OPEN, OST_CLOSE, OST_CREATE, OST_DESTROY,:
OST_PUNCH: These OST requests have the structure shown in table 7 in a single message buffer (hereafter referred to as the OST body):
Bytes | Description (struct ost_body) |
OBD Object(OBDO) |
TABLE 7. OST request structure
An OBD object, as illustrated in table 8, is similar to a cross-platform inode. It describes the attributes of a given object. The valid flag is used to indicate the fields that are relevant for a request. As an example, the OST_SETATTR will use the valid flag to indicate which attributes need to be written to the device.
OST_READ, OST_WRITE: These requests have different structure. The first buffer in the lustre message is a network ioobject. This is followed by an array of remote niobufs in buffer 2. There is one IO object (see table 9) and one niobuf (see table 11) per file extent. In case of reads, each niobuf includes a return code that indicates success/failure/errors on a per-page basis as shown in
When an OST_READ request is received, the data is sent to portals match entries equal to the xid given in the niobuf_remote structure. In case of reads, each niobuf includes a return code that indicates success/failure/errors on a per-page basis as shown in 10.
For an OST_WRITE request, buffers with such match bits are prepared by the client so that the server can get data from the buffers. The bulk data described in those structures is sent as a standard Portals packet, without any Lustre RPC header.
OST_STATFS: This function has one reply message buffer which contains a struct obd_statfs. The contents are shown in table 12.
The server should fill in the critical fields at the minimum, relating to the number of free/total file objects and blocks and zero-fill the unused fields.
Bytes | Description (struct obdo) |
8 | id |
8 | Group |
8 | atime |
8 | mtime |
8 | ctime |
8 | Size |
8 | Blocks |
8 | rdev |
4 | Block size |
4 | Mode |
4 | uid |
4 | gid |
4 | Flags |
4 | Link count |
4 | Generation |
4 | Valid |
4 | OBDflags |
4 | o_easize |
60 | o_inline |
TABLE 8. OBD object
Bytes | Description (struct obd_ioobj) |
8 | id |
8 | Group |
4 | Type |
4 | Buffer count |
TABLE 9. IO object
Lustre DLM Network Protocol
The Lustre lock manager has 3 regular calls and 2 callbacks. The regular calls are sent on the same portal as the affected service; for example, meta-data lock requests are sent to the MDS portal. The callbacks are sent to the portal reserved for DLM RPC’s. Every request to the lock manager has at least a single buffer with the ldlm_request structure as shown in table 13 in it or the ldlm_reply structure (see table 17).
Lustre lock request structure.
Any lock request in lustre consists of atleast the ldlm_request structure (see table 13).
Bytes | Description (struct niobuf_remote) |
8 | Offset |
8 | xid |
4 | Length |
4 | Flags |
4 | return code |
addr | |
sizeof page struct | Flags |
target_private | |
sizeof dentry struct | dentry |
TABLE 10. Niobuf_local
Bytes | Description (struct niobuf_remote) |
8 | Offset |
8 | xid |
4 | Length |
4 | Flags |
TABLE 11. Niobuf
Bytes | Field name | Meaning |
8 | os_type | Magic constant describing the type of OBD (not defined yet). |
8 | os_blocks | Total number of blocks on OST. |
8 | os_bfree | Free blocks. |
8 | os_bavail | Available blocks (free minus reserved). |
8 | os_files | Total number of objects. |
8 | os_ffree | Number of unallocated objects. |
40 | os_fsid | UUID of OST. |
4 | os_bsize | Block size. |
4 | os_namelen | Length of OST name. |
48 | os_spare | Reserved. |
TABLE 12. OBD Status structure
Bytes | Name | Description |
4 | lock_flags | flag filled by the server to indicate statusof the lock |
92 | lock_desc | Lock descriptor is filled with requested type, name, and extent. |
8 | lock_handle | |
8 | lock_handle2 |
TABLE 13. The lock request structure
As shown in table 13, every lock request would contain a lock description structure as shown in 16. This structure has several sub-components. It contains a struct ldlm_extent (see table 14) structure that describes the file extent covered by the lock.
Bytes | Name | Description |
8 | start | Start of extent. |
8 | end | End of the extent. |
TABLE 14. Lock extent descriptor
Secondly, we have resource descriptors, struct ldlm_resource_desc (see table 15), this is used to describe the resource for which a lock is requested. This is an unaligned structure, its allright as long as this is used only in ldlm_request structure.
Bytes | Name | Description |
4 | lr_type | Resource type: one of LDLM_PLAIN, LDLM_INTENT, LDLM_EXTENT. |
8*3 | lr_name | Resource name. |
4*4 | lr_version | Version of the resource (not yet used). |
TABLE 15. Lock resource descriptor
Bytes | Name | Description |
44 | l_resource | description of the resource for the lock (see 15) |
4 | l_req_mode | Requested lock mode, one of LCK_EX (=1), LCK_PW, LCK_PR, LCK_CW, LCK_CR, LCK_NL (=6) File I/O uses PR and PW locks. |
4 | l_granted_mode | Lock mode that is granted on this lock. |
16 | l_extent | Extent required for this lock (see 14) |
4*4 | l_version | Version of this lock. |
TABLE 16. Lock descriptor
Lustre lock reply structure.
The reply message contains a reply (see table 17).
Message structures for the various locking operations.
In the following sections we will describe the message structures for the various locking operations supported in Lustre.
LDLM_ENQUEUE.
This message is used to obtain a new lock. The Lustre message contains a single buffer with a struct ldlm_request .
LDLM_CANCEL.
This message cancels an existing lock. It places a struct ldlm_request in the Lustre message, but only uses the lock_handle1 part of the request (we will shrink this in the future). The reply contains just a lustre_msg.
Bytes | Name | Description |
4 | lock_flags | Flags set during enqueue. |
4 | lock_mode | The server may change the lock mode; if this quantity is non-zero, the client should update its lock structure accordingly. |
8 * 3 | lock_resource_name | Resource actually given to the requester. |
8 | lock_handle | Handle for the lock that was granted. |
16 | lock_extent | Extent that was granted (will move to policy results). |
8 | lock_policy_res1 | Field one for policy results. |
8 | lock_policy_res2 | Field two for policy results. |
TABLE 17. Reply for a lock request
LDLM_CONVERT
This message converts the lock type of an existing lock. The request contains an ldlm request structure, as in enqueue. The requested mode field contains the mode requested after conversion. An ldlm_reply message is returned to the client.
LDLM_BL_CALLBACK.
This message is sent by the lock server to the client to indicate that a lock held by the client is blocking another lock request. This sends a struct ldlm_request with the attributes of the blocked lock in lock_desc.
LDLM_CP_CALLBACK.
This message is sent by the lock server to the client to indicate that a prior unfulfilled lock request is now being granted. This too sends a struct ldlm_request with the attributes of the granted lock in lock_desc. Note that these attributes may differ from those that the client originally requested, in particular the resource name and lock mode.
Client / Meta-data Server
The client meta-data network protocol consists of just a few calls. Again, we first explain the components that make up the Lustre messages and then turn to the network structure of the individual requests. The MDC-MDS protocol has significant similarity with the OSC-OST protocol.
Messages have the following Portals related attributes:
- Destination portal for requests: MDS_REQUEST_PORTAL
- Reply packets go to: MDC_REPLY_PORTAL
- Readdir bulk packets travel to: MDC_BULK_PORTAL
A few other constants are important. We have a sequence of call numbers:
#define MDS_GETATTR 1 #define MDS_OPEN 2 #define MDS_CLOSE 3 #define MDS_REINT 4 #define MDS_READPAGE 6 #define MDS_CONNECT 7 #define MDS_DISCONNECT 8 #define MDS_GETSTATUS 9 #define MDS_STATFS 10 #define MDS_GETLOVINFO 11
The update records are numbered too, to indicate their type:
#define REINT_SETATTR 1 #define REINT_CREATE 2 #define REINT_LINK 3 #define REINT_UNLINK 4 #define REINT_RENAME 5 #define REINT_RECREATE 6
Meta-data Related Wire Structures.
As indicated in table 1, many messages to MDS contain an mds_body (see table 18).
Bytes | Name | Description |
16 | fid1 | First fid. |
16 | fid2 | Second fid. |
16 | handle | Lustre handle |
8 | size | File size. |
8 | blocks | blocks |
4 | ino | inode number |
4 | valid | Bitmap of valid fields sent in / returned. |
4 | fsuid | effective user id for file access. |
4 | fsgid | effective group id for file access. |
4 | capability | Not currently used |
4 | mode | Mode of file |
4 | uid | real user id. |
4 | gid | real group id. |
4 | mtime | Last modification time. |
4 | ctime | Last inode change time. |
4 | atime | Last access time. |
4 | flags | Flags. |
4 | rdev | device |
4 | nlink | Linkcount. |
4 | generation | Generation. |
4 | suppgid | |
4 | eadatasize | Size of extended attributes. |
TABLE 18. MDS Body
In the mds_body structure a file identifier is used to identify a file ( see table 19).
The file type is a platform independent enumeration:
Bytes | Name | Description |
8 | id | Inode id |
4 | generation | Inode generation |
4 | f_type | Inode type |
TABLE 19. File Identifier Structure
#define S_IFSOCK 0140000 #define S_IFLNK 0120000 #define S_IFREG 0100000 #define S_IFBLK 0060000 #define S_IFDIR 0040000 #define S_IFCHR 0020000 #define S_IFIFO 0010000
The MDS stores the file striping information, which includes the object information, as extended atttributes. It might be required to send this information across the wire for certain operations. This can be done using the variable length data structure shown in described in table 20.
Bytes | Name | Description |
4 | lmm_magic | 0x0BD00BD0, the striping magic (read, obdo-obdo). |
8 | lmm_object_id | The id of the object as seen by the LOV. |
4 | lmm_stripe_size | Stripe size. |
4 | lmm_stripe_offset | Stripe offset. |
2 | lmm_stripe_count | How many stripes are used for the file. |
2 | lmm_ost_count | Total number of OSTs in the cluster (determines the maximum stripe count) |
8*n | lmm_objects | An array of object id, in the order that they appear in the LOV descriptor. |
TABLE 20. Variable Length Structure
MDS Update Record Packet Structure.
In this section we will describe the message structures for all the metadata operations that result in update of the file metadata on the MDS. The structure of the update record will depend on the operation type, all update records contain a 32 bit opcode at the begining for identification.
REINT_SETATTR.
The setattr message contains a structure containing the attributes that will be set, in a format commonly used across Unix systems as shown in table 21.
REINT_CREATE.
The create record is used to build files, symbolic links, directories, and special files. In all cases the record shown in figure 22 is included, and a second buffer in the Lustre message contains the name to be created. For files this is followed by a further buffer containing striping meta-data. For symbolic link a third buffer is also present, containing the null terminated name of the link. The reply contains only an mds_body structure along with the lustre_msg structure.
Bytes | Name | Description |
4 | sa_opcode | opcode of the update record that follows. |
4 | sa_fsuid | effective user id for file access. |
4 | sa_fsgid | effective group id for file access. |
4 | sa_cap | Not currently used |
4 | sa_reserved | Not currently used |
4 | sa_valid | Bitmap of valid fields. |
16 | sa_fid | fid of object to update. |
4 | sa_mode | Mode |
4 | sa_uid | uid |
4 | sa_gid | gid |
4 | sa_attr_flags | Flags |
8 | sa_size | Inode size. |
8 | sa_atime | atime |
8 | sa_mtime | mtime |
8 | sa_ctime | ctime |
4 | sa_suppgid | Not currently used |
TABLE 21. setattr Message Structure
Bytes | Name | Description |
4 | cr_opcode | opcode |
4 | cr_fsuid | effective user id for file access. |
4 | cr_fsgid | effective group id for file access. |
4 | cr_cap | Not currently used |
4 | sa_flags | for use with open |
4 | cr_mode | Mode |
16 | cr_fid | fid of parent. |
16 | cr_replayfid | fid of parent used to replay request. |
4 | cr_uid | uid |
4 | cr_gid | gid |
8 | cr_time | Time |
8 | cr_rdev | Raw device. |
4 | cr_suppgid | Not currently used |
TABLE 22. Create record
REINT_LINK.
The link Lustre message contains 2 fields: an mds_rec_link record described in table 23 followed by a null terminated name to be used in the target directory. The reply consists of an mds_body.
Bytes | Name | Description |
4 | lk_opcode | |
4 | lk_fsuid | effective user id for file access. |
4 | lk_fsgid | effective group id for file access. |
4 | lk_cap | Not currently used |
4 | lk_suppgid | Not currently used |
16 | lk_fid1 | fid of source. |
16 | lk_fid2 | fid of target parent. |
TABLE 23. File link Records
REINT_UNLINK.
The unlink Lustre message contains 2 fields: an mds_rec_unlink record described in table 24 followed by a null terminated name to be used in the target directory.
Bytes | Name | Description |
4 | ul_opcode | |
4 | ul_fsuid | effective user id for file access. |
4 | ul_fsgid | effective group id for file access. |
4 | ul_cap | Not currently used |
4 | ul_reserved | Not currently used |
4 | ul_mode | Mode |
4 | ul_suppgid | Not currently used |
16 | ul_fid1 | fid of source. |
16 | ul_fid2 | fid of file to be removed. |
TABLE 24. File unlink Records
The reply consists of an mds_body. Notice that one that the lk_fid2 is super.uous, but useful for checking correctness of the protocol.
REINT_RENAME.
The rename lustre message contains 2 fields: an mds_rec_rename record (see table 25) followed by two null terminated names, indicating the source and destination name. The reply consists of an mds_body.
REINT_RECREATE.
This request is present for recovery purposes and identically formatted to that of REINT_CREATE, except for the value of the opcode.
MDS Request record packet structure.
MDS_GETATTR.
The getattr request contains a mds_body as request. The parameters that are relevant in the request are the fid and valid fields. In WB caching mode, the attributes are received by using the fid in the mds_body, but in CS mode the fid is that of the parent directory
Bytes | Name | Description |
4 | rn_opcode | |
4 | rn_fsuid | effective user id for file access. |
4 | rnl_fsgid | effective group id for file access. |
4 | rn_cap | Not currently used |
4 | rn_suppgid1 | Not currently used |
4 | rn_suppgid2 | Mode |
16 | rn_lk_fid1 | fid of source. |
16 | rn_lk_fid2 | fid of target parent. |
TABLE 25. File rename Records
and the attributes are retrieved by a name included as a buffer in the lustre message following the mds_body.
The reply may be followed by mds striping data in the case of a fid of type S_IFREG, or can be followed by a linkname. This happens when in the valid field the OBD_MD_LINKNAME bit is set.
MDS_OPEN.
The open request contains and mds_fileh_body (see figure 26), followed by an optional lov_stripe_md. The stripe meta-data is used to store the object identities on the MDS, in case the objects were created only at open time on the OST’s. The fid indicates what object is opened. The handle in the request is a local file handle, to deal with re-opening files, during cluster recovery.
Bytes | Name | Description |
16 | fid | File id of object to open / close. |
16 | file handle | File handle passed or returned. |
TABLE 26. File handler structure for open/close requests
The reply contains the same structure. The Lustre handle contains the remote file handle and a security token; the body has the attributes of the inode.
MDS_CLOSE.
The structure of this call is equal to that of open, although lov_stripe_md can currently not be passed.
MDS_REINT.
The message structure contains an empty Lustre message with an update record. The update records are described above and appended as the first buffer in the Lustre message. The reply is an mds_body.
MDS_READPAGE.
The request structure contains an mds_body. The fid1 field contains the identifier of the file; the size field gives the offset of the page to be read. The reply is simply a Lustre message.
MDS_CONNECT.
See OST_CONNECT.
MDS_DISCONNECT.
See OST_DISCONNECT.
MDS_GETSTATUS.
This call will retrieve the root fid from the MDS. The request message is a lustre_msg; the reply contains a lustre_fid, the root fid of the file system.
MDS_STATFS.
MDS_GETLOVINFO.
Client -MDS/OST recovery protocol
We have introduced a new operation in which clients ping all the servers periodically. When a server (MDS/OST) fails, all the connected clients need to participate in recovery within a given time, if they miss the recovery window, they are removed from the cluster. The client will then lose all the cached updates. The ping operation can be used by the clients to continuously check if the servers are up or not. If a failover server is available, the clients need to .nd and connect to them. A new opcode, OBD_PING has been introduced for this purpose, this is understood by both OST and MDS nodes. This new opcode has a value of 400, and no request or reply body (both have length 0), the figure 27 illustrates this message.
Bytes | Name (struct lustre_msg) | Value |
8 | Export/import handle cookie | Contains the import/export handle for the request |
4 | Magic constant | 0x0BD00BD0. |
4 | Type | PTL_RPC_MSG_REQUEST / PTL_RPC_MSG_REPLY |
4 | Lustre-msg version and protocol version | Current value 0x00040001 in little endian |
4 | Opcode | 400 |
8 | Last received counter | In replies: last transaction no for MDS/OST. |
8 | Last committed counter | In replies: last committed transaction. |
8 | Transaction number | In replies: transaction no for request. |
4 | Status | Return value of handler. |
4 | Buffer count: bufcount | 0 |
“buffer count“ * 4 | Buffer lengths buflens[] | No buffers |
TABLE 27. Lustre message for the new OBD_PING operation
On the server side, zero-to-minimal processing should be done for this new type of Lustre message. In addition, OBD_PING can be sent with a request message containing an addr and cookie of zero (no export), and should be processed without any change in that case. Specifically, it should not return -ENOTCONN for a mis-matched export handle, if the addr and cookie are both zero.
Another scenario in which the pinger plays an important role is during cleanup, in a cluster if a clients are shutdown while they hold locks, the OSTs will have to wait for a long time for timeouts to occur for all the clients. At this point, the server can assume that the client died and cleanup the locks held by it. On the other hand, in the presence of the ping operation, the OST will keep track of time_last_heard parameter for every client. The server can use this variable to track when it last heard from a client, if the time exceeds a certain threshold value, the OSTs can mark the clients as dead.
Changelog
Version 2.2 (Apr. 2003) (1) Radhika Vullikanti (28 Apr 2003) -Updated the structures to reflect the current protocol. Version 2.1 (Apr. 2003) (1) Phil Schwan (26 Apr 2003) -Updated wire protocols to reflect changes made between Lustre versions 0.6.0.3 and 0.6.0.4 (bugs 593, 1154, 1175, and 1178). All sizes are now in bytes. Version 2.0 (Apr. 2003) (1) Radhika Vullikanti (04/01/2003) -Added a new section describing the wire protocol changes made for recovery purposes. Version 1.5 (Jan. 2003) (1) Radhika Vullikanti (01/31/2003) -Updated section 13.2 to reflect the changes that were made to the wire protocol for using ptlget for bulk writes.