Architecture - Interoperability fids zfs

Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain both outdated information and unimplemented functionality.

Summary
This document describes an architecture for client, server, network, and storage interoperability during migration from 1.6-based, fidless Lustre clusters, using ldiskfs as a back-end file system, to clusters based on fids and zfs file system.

Definitions
As release numbers and numbering schemas are in flux, the description below uses symbolic names for various important points in Lustre development.


 * OLD : any major release in b1_6 line of development. This might end up being 1.6.something, or 1.7.
 * OLD.x : a release in b1_6 line containing client that is able to interact with a NEW.0 md server. (Tentatively 1.8.)
 * NEW.0 : first release based on HEAD. This features kernel server, and uses ldiskfs as a back-end. This is (tentatively) 2.0. It is important to note that NEW.0 is a temporary intermediate release whose purpose is to effect transition from ldiskfs-based to DMU-based clusters.
 * NEW.1 : next release based on HEAD. This release introduces support for fids on OST, and DMU as a back-end, in addition to continued support for ldiskfs. This is (tentatively) 2.x.
 * OLD protocol : b1_6 wire network protocol.
 * NEW protocol : wire protocol using fids for object identification.
 * OLD storage, OLD file system : back-end file system of type ldiskfs.
 * DMU storage : back-end file system implemented through DMU.
 * fill-in-fid : a special not otherwise used fid value, reserved to indicate in a CREATE RPC that client requests server to generate fid for newly created object on client's behalf. This fid is taken from one of the system-reserved fid sequences.

Requirements

 * +-1 rule : adhere to the Lustre promise of maintaining interoperability one release back and forth.
 * downgrade : users are able to abandon upgrade and return back to the old cluster configuration up to a well-defined point of no-return when a decision is made to proceed forward. After that point downgrade is possible, on a condition that (potentially) all file system modifications made after no-return are lost.
 * rolling upgrade : an upgrade (and downgrade) is performed in a piecemeal fashion, a node after a node.
 * continuity : where possible upgrade and downgrade do not disrupt ongoing operations. Client upgrade or downgrade obviously requires client remount. Server upgrade and downgrade looks like a server fail-over, with clients operations continuing.
 * no stop-the-world : migration path cannot require whole cluster to be stopped for a prolonged amount of time (e.g,. to migrate all data to the new format).

Compatibility matrix
Legend


 * C : client
 * O : OSS
 * M : MDT
 * X : given version supports given format or protocol
 * - : given version does not support given format or protocol
 * gray area : impossible combination

Migration path
Following upgrade path is envisaged:


 * starting with OLD version installed on the cluster...
 * OLD.x release is installed, making clients upward compatible with NEW.0 MDT server. This step can be undone without loss of functionality or availability.
 * all clients are upgraded to OLD.x.
 * NEW.0 md server is installed, and original (OLD.x md server) is failed over to the former. Clients can continue without evictions. This step can be undone with the minor loss of availability (e.g., evictions during downgrade).
 * NEW.0 release is installed on client and OSS nodes. Client has to unmount and remount file system to continue with the new release. This step can be undone with the minor loss of availability (again, unmount followed by remount to revert back to the old release).
 * clients and OST's are upgraded to NEW.1 release. At that moment, no OLD code is running in the cluster, but all data and meta-data are still stored in the OLD format, except for the redundant information, like object index, and fids in EA, not used by the OLD server.
 * MDT fails over to NEW.1. On a reconnect, OST's switch to NEW protocol. At this moment, all networking traffic is in NEW protocol.
 * NEW.1 dmu based ost's are formatted and added to the cluster.
 * online migration of data starts. This step can be undone without loss of functionality or availability.
 * NEW.1 DMU mdt is formatted. Magic meta-data migration tool is invoked. ?Q not clear yet. Downgrade?
 * once meta-data are migrated to the NEW.1, upgrade is complete.

Use Cases
NEW.0 MDT handles...

NEW.0 OST handles ...

Quality Attribute Scenarios

 * old.x-client


 * new.1-ost


 * mdt.upgrade.0


 * mdt.upgrade.0.client


 * mdt.upgrade.1.ost


 * mdt.downgrade.0


 * mdt.downgrade.0.client


 * mdt.downgrade.1


 * mdt.downgrade.1.ost


 * mdt.lookup.old


 * mdt.lookup.new


 * mdt.create


 * mdt.readdir

Technical Details [not part of architecture, should go into HLD/DLD]
Brief outline of features relevant to interoperability and not mentioned above, supported and expected from the releases above:

OLD.x

 * OLD.x: client and OST support both OLD and NEW networking protocol. Protocol version is selected at the time of connection to MDT: if MDT supports OBD_CONNECT_FID connect flag, NEW protocol is used, otherwise OLD.
 * once OLD.x node (client or OST) connected to MDT in NEW mode it assures that all other connections are in this mode too. OST adds OBD_CONNECT_FID flag to its connection mask.
 * when connected in NEW node, OLD.x client
 * uses fids to identify inodes in the cache (for uniformity, it can internally use igifs, generated from ino/gen pairs in the OLD mode too). Inode numbers for stat(2), are generated from fids [done for HEAD, being ported to b1_6_cli_reqs];
 * expects cmd3-style directory pages in readdir with fids in directory entries [done];
 * takes dlm locks are in fid name-space [done];
 * participates in cmd3 recovery protocol, more on this below [being implemented by Amit];
 * uses seq and fld services [done];
 * when on a re-connect OLD.x client detects that connection lost OBD_CONNECT_FID flag that it used to have, it evicts itself to get rid of all extra fid-related state.
 * No interoperability changes to the MD server code are made in OLD.x release.
 * OLD.x OST servers also support both OLD and NEW networking protocol, and depending on the MDS connection flags either use fids or not. In fid-enabled mode, they act much like clients (see above) in their interaction with MDT. To support NEW protocol OST has to generate fids for objects already existing on the storage. Resulting surrogate fids are called idifs (igifs for data, see igif description below). [not started yet]

NEW.0
This release introduces MDT server speaking NEW protocol only, and running over OLD-format storage. OST server speaking NEW protocol was introduced in the previous OLD.x release. Support for old protocol is completely eliminated in this release.

To talk in new protocol server has to use FIDs to identify object, so NEW.0 MDT generates surrogate FIDs for existing inodes. Such a surrogate FIDs is referred to as an IGIF (inode-generation FID), because it is built from inode number and inode generation. Similarly, NEW.0 OST generates surrogate FIDs for existing id/group objects. Format of IGIF and IDIF is described in the table below:

Legend:
 * FID : File IDentifier generated by client from range allocated by the seq service. First 0x400 sequences [233, 233 + 0x400] are reserved for system use. Note that on ldiskfs MDTs that IGIF FIDs can use inode numbers starting at 12, but this is in the IGIF SEQ rangeand does not conflict with assigned FIDs.


 * IGIF : Inode and Generation In FID, a surrogate FID used to globally identify an existing object on OLD formatted MDT file system. This would only be used on MDT0 in a DNE filesystem, because there are not expected to be any OLD formatted DNE filesystems.  Belongs to a sequence in [12, 232 - 1] range, where sequence number is inode number, and inode generation is used as OID.  NOTE: This assumes no more than 232-1 inodes exist in the MDT filesystem, which is the maximum possible for an ldiskfs backend. NOTE: This assumes that the reserved ext3/ext4/ldiskfs inode numbers [0-11] are never visible to clients, which has always been true.

1 << 32 | (ost_index << 16) | ((objid >> 32) & 0xffff) objid & 0xffffffff
 * IDIF : object ID in FID, a surrogate FID used to globally identify an existing object on OLD formatted OST file system. Belongs to a sequence in [232, 233 - 1]. Sequence number is calculated as:
 *   : that is, SEQ consists of 16-bit OST index, and higher 16 bits of object ID.   The generation of unique SEQ values per OST allows the IDIF FIDs to be identified in the FLD correctly.  The OID field is calculated as:
 *   : that is, it consists of lower 32 bits of object ID. NOTE This assumes that no more than 248-1 objects have ever been created on an OST, and that no more than 65535 OSTs are in use.  Both are very reasonable assumptions (can uniquely map all objects on an OST that created 1M objects per second for 9 years, or combinations thereof).


 * OST_MDT0 : Surrogate FID used to identify an existing object on OLD formatted OST filesystem. Belongs to the reserved sequence 0, and is used internally prior to the introduction of FID-on-OST, at which point IDIF will be used to identify objects as residing on a specific OST.


 * LLOG : for Lustre Log objects the object sequence 1 is used. This is compatible with both OLD and NEW.1 namespaces, as this SEQ number is in the ext3/ldiskfs reserved inode range and does not conflict with IGIF sequence numbers.


 * ECHO : for testing OST IO performance the object sequence 2 is used. This is compatible with both OLD and NEW.1 namespaces, as this SEQ number is in the ext3/ldiskfs reserved inode range and does not conflict with IGIF sequence numbers.


 * OST_MDT1 .. OST_MAX : for testing with multiple MDTs the object sequence 3 through 9 is used, allowing direct mapping of MDTs 1 through 7 respectively, for a total of 8 MDTs including OST_MDT0. This matches the legacy CMD project "group" mappings.  However, this SEQ range is only for testing prior to any production DNE release, as the objects in this range conflict across all OSTs, as the OST index is not part of the FID.

For compatibility with existing OLD OST network protocol structures, the FID must map onto the o_id and o_gr in a manner that ensures existing objects are identified consistently for IO, as well as onto the lock namespace to ensure both IDIFs map onto the same objects for IO as well as resources in the DLM.

DLM OLD OBIF/IDIF: resource[] = {o_id, o_seq, 0, 0}; /* o_seq == 0 for production releases */

DLM NEW.1 FID (this is the same for both the MDT and OST): resource[] = {SEQ, OID, VER, HASH};

Note that for mapping IDIF values to DLM resource names the o_id may be larger than the 233 reserved sequence numbers for IDIF, so it is possible for the o_id numbers to overlap FID SEQ numbers in the resource. However, in all production releases the OLD o_seq field is always zero, and all valid FID OID values are non-zero, so the lock resources will not collide.

For objects within the IDIF range, group extraction (non-CMD) will be: o_id = (fid->f_seq & 0x7fff) << 16 | fid->f_oid; o_seq = 0; /* formerly group number */

Recovery
There are 2 important recovery scenarios related to interoperability:


 * OLD.x client reconnects to MDT after a fail-over and learns that it has to switch back to the OLD protocol, because server was downgraded. Client has to replay requests, but before that they have to be converted into OLD protocol format. This requires changing message format and going from client-assigned FIDs to inode/generation numbers (storage cookies). If a FID in is IGIF format it can be converted to inode number according to the reverse of IGIF generation algorithm. If a FID is client-generated, then *KABOOM*! Client has to evict itself, because it doesn't know old-format inode number. Q? Is there a better solution?. What to do with RPCs that old server cannot handle at all: SEQ_QUERY? Again, eviction seems to be the only option.


 * OLD.x client reconnects to MDT and determines that it has to switch to the new protocol, because MDT was upgraded to NEW.0. To replay RPCs, client has to convert them to the NEW format. This includes message format conversion and going from inode/generation numbers to FIDs. For RPCs that already include inode number as an argument, IGIF FID can be used. For CREATE RPC that requires fid in NEW protocol there are two options:
 * client supplies fill-in-FID. NEW.0 server recognizes this as a request to generate FID on the server, and uses special sequence range reserved for this purpose to allocate a FID from. Note that this sequence cannot be exhausted, as there is single MDT in the cluster at that point, which means it has full control over complete FID space.
 * client supplies inode number as in usual OLD protocol replay. Server detects this and creates inode with given inode number. This has certain drawbacks:
 * a dependency on ext3-wantedi patch is re-introduced, and
 * backward-compatibility code is introduced in NEW.0 release, which we are trying to avoid.