Architecture - Wire Level Protocol

Wire Level Protocol Description

Introduction
This chapter gives an overview of the wire formats used by Lustre. Lustre embeds its messages in Portals packets. Lustre employs a Lustre message structure to encapsulate requests and replies which is extremely efficient for the purpose of packing and unpacking and shares many properties of the DAFS wire format. The Lustre systems embed their requests and replies in the Lustre messages as buffers. A message frequently contains multiple buffers, such as a request header and a pathname, or includes multiple requests, as is used in the intent locking case. Every Lustre message is encapsulated within a Portals message header, the figure 23.1.1 illustrates this. A lustre message could have message body with multiple message buffers.

Message structure

In this chapter we will describe the structure of the various message headers and the formats of the message buffers that are sent/received in Lustre for the different type of operations.

Portals message headers
As previously discussed, all Lustre network traffic is handled by Portals. Portals introduces its own headers. The vast majority of all Lustre packets are put on the wire with PtlPut and PtlGet, and are simple in structure. The various Portals packet types we use are PtlPut packets, PtlGet packets and ACK packets and reply. Lustre sends Lustre request, reply, and bulk read as PtlPut packets, this gets translated into PTL_MSG_PUT and a PTL_MSG_ACK packets. The bulk write messages are sent as PtlGet packets and this is translated into PTL_MSG_GET and a PTL_MSG_REPLY.

Each Lustre packet is wrapped in a portals header. The portals header can be visualized as consisting of two part, one portion is common to all packets and a second part that depends on the type of packet (PtlGet, PtlPut, PtlAck,PtlReply). The first part has a fixed sized, the size of the second part again depends on the type of packet and is stored within the structure for the various packet types. The Fields in the common portals header have the semantic meaning in Lustre illustrated in table 1.

Some Lustre clusters use routing, but this shall not affect the content of any Portals header packet that is routed.

Notes:


 * 1) A Portals version field needs to be included in this header.
 * 2) All 32 bit integers will be little endian. 16 byte integers have the most significant bytes preceding the least significant.
 * 3) The nid will be the Lustre network id. In a routed network this may not coincide with the origin of the packet.
 * 4) ACK packets, which are generated when a PtlPut arrives with a nonzero ACK MD, are similar. But the fields following the ACK MD token are not used and the header is padded out to 88 bytes, and any padding must be ignored by the receiver.

The msg field is determined by the message type, each of the message types has a different structure. The table 2 illustrates the header for a PtlPut message

TABLE 2. Portals PTL_MSG_PUT message Structure

Here, the xid is a request identifier that in normal use of the system increases by 1 for every request sent from a particular node to a service. During recovery of systems requests may be re-transmitted, hence the xid is not totally unique. The table 3 shows the structure for a PtlGet message.

The table 5 shows the structure for an Ack packet.

Finally, the table shows the PtlReply packet:

TABLE 3. Portals PTL_MSG_GET message Structure For more information about the precise meaning of these headers, see the Portals specification.

Lustre Messages: RPC’s
Lustre uses Portals to send Lustre messages across the net. All request and reply messages are packaged as Lustre messages and embedded within the Portals header. Lustre messages have fields TABLE 4. Portals PTL_MSG_ACK message structure Structure

TABLE 5. Portals PTL_MSG_REPLY message Structure

in the body that describe the contents of the message and assist in recovery. Bulk packets are sent as raw Portals packets.

Lustre requests and replies (but not bulk data) fill the “packet data” field with at least a lustre_msg structure explained in table 6. Some requests/replies might have additional buffers, this would be indicated by the buffer count field in the lustre_msg structure. Every Lustre message requires a corresponding reply. Each buffer and the message itself is padded to an 8-byte boundary, but this padding is not included in the buffer lengths.

The structure of the data in the buffers in lustre_msg structure would depend on the type of operation. In the following sections we would describe the buffer data formats for all the possible Lustre request/reply types.

The structure described above has a field to contain the last received transaction number, this might not be required in every reply. The last committed transaction number is required for recovery purposes. We might also need to add the generation number of the connection on every outgoing request to enable the target to IO fence old requests. It might also be useful to add protocol version information in every outgoing request packet. In future, we might include a subsystem field similar to that in SUN RPC and an authenticator.

TABLE 6. Lustre Message Structure

OSC -OST Network Protocol
In this section we will describe the buffer data formats for all the requests/replies exchanged between the object storage clients (OSC) and the targets (OSTs). There are eleven OST opcodes and each of them has different information in the buffers: OST_CONNECT: This is sent as a Lustre message with a three buffers. The first buffer will contain only the UUID of the target to be connected to. On the server side this UUID is translated into the target device. The second buffer is a UUID identifying the connecting client. The target instantiates a unique export structure for the client which is passed in on every further request. The third buffer holds the import handle which is sent back to the client for lock callbacks.

OST_DISCONNEC: This is a Lustre message without any additional buffers. The server tears down the export specified in the lustre_msg structure.

OST_GETATTR, OST_SETATTR, OST_OPEN, OST_CLOSE, OST_CREATE, OST_DESTROY,:

OST_PUNCH: These OST requests have the structure shown in table 7 in a single message buffer (hereafter referred to as the OST body):

TABLE 7. OST request structure

An OBD object, as illustrated in table 8, is similar to a cross-platform inode. It describes the attributes of a given object. The valid flag is used to indicate the fields that are relevant for a request. As an example, the OST_SETATTR will use the valid flag to indicate which attributes need to be written to the device.

OST_READ, OST_WRITE: These requests have different structure. The first buffer in the lustre message is a network ioobject. This is followed by an array of remote niobufs in buffer 2. There is one IO object (see table 9) and one niobuf (see table 11) per file extent. In case of reads, each niobuf includes a return code that indicates success/failure/errors on a per-page basis as shown in

When an OST_READ request is received, the data is sent to portals match entries equal to the xid given in the niobuf_remote structure. In case of reads, each niobuf includes a return code that indicates success/failure/errors on a per-page basis as shown in 10.

For an OST_WRITE request, buffers with such match bits are prepared by the client so that the server can get data from the buffers. The bulk data described in those structures is sent as a standard Portals packet, without any Lustre RPC header.

OST_STATFS: This function has one reply message buffer which contains a struct obd_statfs. The contents are shown in table 12.

The server should fill in the critical fields at the minimum, relating to the number of free/total file objects and blocks and zero-fill the unused fields.

TABLE 8. OBD object TABLE 9. IO object

Lustre DLM Network Protocol
The Lustre lock manager has 3 regular calls and 2 callbacks. The regular calls are sent on the same portal as the affected service; for example, meta-data lock requests are sent to the MDS portal. The callbacks are sent to the portal reserved for DLM RPC’s. Every request to the lock manager has at least a single buffer with the ldlm_request structure as shown in table 13 in it or the ldlm_reply structure (see table 17).

Lustre lock request structure.
Any lock request in lustre consists of atleast the ldlm_request structure (see table 13).

TABLE 10. Niobuf_local

TABLE 11. Niobuf

TABLE 12. OBD Status structure

TABLE 13. The lock request structure

As shown in table 13, every lock request would contain a lock description structure as shown in 16. This structure has several sub-components. It contains a struct ldlm_extent (see table 14) structure that describes the file extent covered by the lock. TABLE 14. Lock extent descriptor

Secondly, we have resource descriptors, struct ldlm_resource_desc (see table 15), this is used to describe the resource for which a lock is requested. This is an unaligned structure, its allright as long as this is used only in ldlm_request structure. TABLE 15. Lock resource descriptor

TABLE 16. Lock descriptor

Lustre lock reply structure.
The reply message contains a reply (see table 17).

Message structures for the various locking operations.
In the following sections we will describe the message structures for the various locking operations supported in Lustre.

LDLM_ENQUEUE.
This message is used to obtain a new lock. The Lustre message contains a single buffer with a struct ldlm_request.

LDLM_CANCEL.
This message cancels an existing lock. It places a struct ldlm_request in the Lustre message, but only uses the lock_handle1 part of the request (we will shrink this in the future). The reply contains just a lustre_msg. TABLE 17. Reply for a lock request

LDLM_CONVERT
This message converts the lock type of an existing lock. The request contains an ldlm request structure, as in enqueue. The requested mode field contains the mode requested after conversion. An ldlm_reply message is returned to the client.

LDLM_BL_CALLBACK.
This message is sent by the lock server to the client to indicate that a lock held by the client is blocking another lock request. This sends a struct ldlm_request with the attributes of the blocked lock in lock_desc.

LDLM_CP_CALLBACK.
This message is sent by the lock server to the client to indicate that a prior unfulfilled lock request is now being granted. This too sends a struct ldlm_request with the attributes of the granted lock in lock_desc. Note that these attributes may differ from those that the client originally requested, in particular the resource name and lock mode.

Client / Meta-data Server
The client meta-data network protocol consists of just a few calls. Again, we first explain the components that make up the Lustre messages and then turn to the network structure of the individual requests. The MDC-MDS protocol has significant similarity with the OSC-OST protocol.

Messages have the following Portals related attributes:
 * 1) Destination portal for requests: MDS_REQUEST_PORTAL
 * 2) Reply packets go to: MDC_REPLY_PORTAL
 * 3) Readdir bulk packets travel to: MDC_BULK_PORTAL

A few other constants are important. We have a sequence of call numbers:
 * 1) define MDS_GETATTR 1
 * 2) define MDS_OPEN 2
 * 3) define MDS_CLOSE 3
 * 4) define MDS_REINT 4
 * 5) define MDS_READPAGE 6
 * 6) define MDS_CONNECT 7
 * 7) define MDS_DISCONNECT 8
 * 8) define MDS_GETSTATUS 9
 * 9) define MDS_STATFS 10
 * 10) define MDS_GETLOVINFO 11

The update records are numbered too, to indicate their type:
 * 1) define REINT_SETATTR 1
 * 2) define REINT_CREATE 2
 * 3) define REINT_LINK 3
 * 4) define REINT_UNLINK 4
 * 5) define REINT_RENAME 5
 * 6) define REINT_RECREATE 6

Meta-data Related Wire Structures.
As indicated in table 1, many messages to MDS contain an mds_body (see table 18).

TABLE 18. MDS Body

In the mds_body structure a file identifier is used to identify a file ( see table 19). The file type is a platform independent enumeration:

TABLE 19. File Identifier Structure


 * 1) define S_IFSOCK 0140000
 * 2) define S_IFLNK 0120000
 * 3) define S_IFREG 0100000
 * 4) define S_IFBLK 0060000
 * 5) define S_IFDIR 0040000
 * 6) define S_IFCHR 0020000
 * 7) define S_IFIFO 0010000

The MDS stores the file striping information, which includes the object information, as extended atttributes. It might be required to send this information across the wire for certain operations. This can be done using the variable length data structure shown in described in table 20. TABLE 20. Variable Length Structure

MDS Update Record Packet Structure.
In this section we will describe the message structures for all the metadata operations that result in update of the file metadata on the MDS. The structure of the update record will depend on the operation type, all update records contain a 32 bit opcode at the begining for identification.

REINT_SETATTR.
The setattr message contains a structure containing the attributes that will be set, in a format commonly used across Unix systems as shown in table 21.

REINT_CREATE.
The create record is used to build files, symbolic links, directories, and special files. In all cases the record shown in figure 22 is included, and a second buffer in the Lustre message contains the name to be created. For files this is followed by a further buffer containing striping meta-data. For symbolic link a third buffer is also present, containing the null terminated name of the link. The reply contains only an mds_body structure along with the lustre_msg structure. TABLE 21. setattr Message Structure

TABLE 22. Create record

REINT_LINK.
The link Lustre message contains 2 fields: an mds_rec_link record described in table 23 followed by a null terminated name to be used in the target directory. The reply consists of an mds_body. TABLE 23. File link Records

REINT_UNLINK.
The unlink Lustre message contains 2 fields: an mds_rec_unlink record described in table 24 followed by a null terminated name to be used in the target directory. TABLE 24. File unlink Records

The reply consists of an mds_body. Notice that one that the lk_fid2 is super.uous, but useful for checking correctness of the protocol.

REINT_RENAME.
The rename lustre message contains 2 fields: an mds_rec_rename record (see table 25) followed by two null terminated names, indicating the source and destination name. The reply consists of an mds_body.

REINT_RECREATE.
This request is present for recovery purposes and identically formatted to that of REINT_CREATE, except for the value of the opcode.

MDS_GETATTR.
The getattr request contains a mds_body as request. The parameters that are relevant in the request are the fid and valid fields. In WB caching mode, the attributes are received by using the fid in the mds_body, but in CS mode the fid is that of the parent directory TABLE 25. File rename Records

and the attributes are retrieved by a name included as a buffer in the lustre message following the mds_body. The reply may be followed by mds striping data in the case of a fid of type S_IFREG, or can be followed by a linkname. This happens when in the valid field the OBD_MD_LINKNAME bit is set.

MDS_OPEN.
The open request contains and mds_fileh_body (see figure 26), followed by an optional lov_stripe_md. The stripe meta-data is used to store the object identities on the MDS, in case the objects were created only at open time on the OST’s. The fid indicates what object is opened. The handle in the request is a local file handle, to deal with re-opening files, during cluster recovery. TABLE 26. File handler structure for open/close requests

The reply contains the same structure. The Lustre handle contains the remote file handle and a security token; the body has the attributes of the inode.

MDS_CLOSE.
The structure of this call is equal to that of open, although lov_stripe_md can currently not be passed.

MDS_REINT.
The message structure contains an empty Lustre message with an update record. The update records are described above and appended as the first buffer in the Lustre message. The reply is an mds_body.

MDS_READPAGE.
The request structure contains an mds_body. The fid1 field contains the identifier of the file; the size field gives the offset of the page to be read. The reply is simply a Lustre message.

MDS_CONNECT.
See OST_CONNECT.

MDS_DISCONNECT.
See OST_DISCONNECT.

MDS_GETSTATUS.
This call will retrieve the root fid from the MDS. The request message is a lustre_msg; the reply contains a lustre_fid, the root fid of the file system.

Client -MDS/OST recovery protocol
We have introduced a new operation in which clients ping all the servers periodically. When a server (MDS/OST) fails, all the connected clients need to participate in recovery within a given time, if they miss the recovery window, they are removed from the cluster. The client will then lose all the cached updates. The ping operation can be used by the clients to continuously check if the servers are up or not. If a failover server is available, the clients need to .nd and connect to them. A new opcode, OBD_PING has been introduced for this purpose, this is understood by both OST and MDS nodes. This new opcode has a value of 400, and no request or reply body (both have length 0), the figure 27 illustrates this message. TABLE 27. Lustre message for the new OBD_PING operation

On the server side, zero-to-minimal processing should be done for this new type of Lustre message. In addition, OBD_PING can be sent with a request message containing an addr and cookie of zero (no export), and should be processed without any change in that case. Specifically, it should not return -ENOTCONN for a mis-matched export handle, if the addr and cookie are both zero. Another scenario in which the pinger plays an important role is during cleanup, in a cluster if a clients are shutdown while they hold locks, the OSTs will have to wait for a long time for timeouts to occur for all the clients. At this point, the server can assume that the client died and cleanup the locks held by it. On the other hand, in the presence of the ping operation, the OST will keep track of time_last_heard parameter for every client. The server can use this variable to track when it last heard from a client, if the time exceeds a certain threshold value, the OSTs can mark the clients as dead.

Changelog
Version 2.2 (Apr. 2003) (1) Radhika Vullikanti (28 Apr 2003) -Updated the structures to reflect the current protocol. Version 2.1 (Apr. 2003) (1) Phil Schwan (26 Apr 2003) -Updated wire protocols to reflect changes made between Lustre versions 0.6.0.3 and 0.6.0.4 (bugs 593, 1154, 1175, and 1178). All sizes are now in bytes. Version 2.0 (Apr. 2003) (1) Radhika Vullikanti (04/01/2003)	-Added a new section describing the wire protocol changes made for recovery purposes. Version 1.5 (Jan. 2003) (1) Radhika Vullikanti (01/31/2003) -Updated section 13.2 to reflect the changes that were made to the wire protocol for using ptlget for bulk writes.