Architecture - Interoperability fids zfs

Summary
This document describes an architecture for client, server, network, and storage interoperability during migration from 1.6-based, fidless Lustre clusters, using ldiskfs as a back-end file system, to clusters based on fids and zfs file system.

This document originates from an internal wiki page.

Definitions
As release numbers and numbering schemas are in flux, the description below uses symbolic names for various important points in Lustre development.


 * OLD : any major release in b1_6 line of development. This might end up being 1.6.something, or 1.7.
 * OLD.x : a release in b1_6 line containing client that is able to interact with a NEW.0 md server. (Tentatively 1.8.)
 * NEW.0 : first release based on HEAD. This features kernel server, and uses ldiskfs as a back-end. This is (tentatively) 2.0. It is important to note that NEW.0 is a temporary intermediate release whose purpose is to effect transition from ldiskfs-based to DMU-based clusters.
 * NEW.1 : next release based on HEAD. This release introduces support for fids on OST, and DMU as a back-end, in addition to continued support for ldiskfs. This is (tentatively) 2.x.
 * OLD protocol : b1_6 wire network protocol.
 * NEW protocol : wire protocol using fids for object identification.
 * OLD storage, OLD file system : back-end file system of type ldiskfs.
 * DMU storage : back-end file system implemented through DMU.
 * fill-in-fid : a special not otherwise used fid value, reserved to indicate in a CREATE RPC that client requests server to generate fid for newly created object on client's behalf. This fid is taken from one of the system-reserved fid sequences.

Requirements

 * +-1 rule : adhere to the Lustre promise of maintaining interoperability one release back and forth.
 * downgrade : users are able to abandon upgrade and return back to the old cluster configuration up to a well-defined point of no-return when a decision is made to proceed forward. After that point downgrade is possible, on a condition that (potentially) all file system modifications made after no-return are lost.
 * rolling upgrade : an upgrade (and downgrade) is performed in a piecemeal fashion, a node after a node.
 * continuity : where possible upgrade and downgrade do not disrupt ongoing operations. Client upgrade or downgrade obviously requires client remount. Server upgrade and downgrade looks like a server fail-over, with clients operations continuing.
 * no stop-the-world : migration path cannot require whole cluster to be stopped for a prolonged amount of time (e.g,. to migrate all data to the new format).

Compatibility matrix
Legend


 * C : client
 * O : OSS
 * M : MDT
 * X : given version supports given format or protocol
 * - : given version does not support given format or protocol
 * gray area : impossible combination

Migration path
Following upgrade path is envisaged:


 * starting with OLD version installed on the cluster...
 * OLD.x release is installed, making clients upward compatible with NEW.0 MDT server. This step can be undone without loss of functionality or availability.
 * all clients are upgraded to OLD.x.
 * NEW.0 md server is installed, and original (OLD.x md server) is failed over to the former. Clients can continue without evictions. This step can be undone with the minor loss of availability (e.g., evictions during downgrade).
 * NEW.0 release is installed on client and OSS nodes. Client has to unmount and remount file system to continue with the new release. This step can be undone with the minor loss of availability (again, unmount followed by remount to revert back to the old release).
 * clients and OST's are upgraded to NEW.1 release. At that moment, no OLD code is running in the cluster, but all data and meta-data are still stored in the OLD format, except for the redundant information, like object index, and fids in EA, not used by the OLD server.
 * MDT fails over to NEW.1. On a reconnect, OST's switch to NEW protocol. At this moment, all networking traffic is in NEW protocol.
 * NEW.1 dmu based ost's are formatted and added to the cluster.
 * online migration of data starts. This step can be undone without loss of functionality or availability.
 * NEW.1 DMU mdt is formatted. Magic meta-data migration tool is invoked. ?Q not clear yet. Downgrade?
 * once meta-data are migrated to the NEW.1, upgrade is complete.

Use Cases
NEW.0 MDT handles...

NEW.0 OST handles ...

Quality Attribute Scenarios

 * old.x-client


 * new.1-ost


 * mdt.upgrade.0


 * mdt.upgrade.0.client


 * mdt.upgrade.1.ost


 * mdt.downgrade.0


 * mdt.downgrade.0.client


 * mdt.downgrade.1


 * mdt.downgrade.1.ost


 * mdt.lookup.old


 * mdt.lookup.new


 * mdt.create


 * mdt.readdir

Technical Details [not part of architecture, should go into HLD/DLD]
Brief outline of features relevant to interoperability and not mentioned above, supported and expected from the releases above:

OLD.x

 * OLD.x: client and OST support both OLD and NEW networking protocol. Protocol version is selected at the time of connection to MDT: if MDT supports OBD_CONNECT_FID connect flag, NEW protocol is used, otherwise OLD.
 * once OLD.x node (client or OST) connected to MDT in NEW mode it assures that all other connections are in this mode too. OST adds OBD_CONNECT_FID flag to its connection mask.
 * when connected in NEW node, OLD.x client
 * uses fids to identify inodes in the cache (for uniformity, it can internally use igifs, generated from ino/gen pairs in the OLD mode too). Inode numbers for stat(2), are generated from fids [done for HEAD, being ported to b1_6_cli_reqs];
 * expects cmd3-style directory pages in readdir with fids in directory entries [done];
 * takes dlm locks are in fid name-space [done];
 * participates in cmd3 recovery protocol, more on this below [being implemented by Amit];
 * uses seq and fld services [done];
 * when on a re-connect OLD.x client detects that connection lost OBD_CONNECT_FID flag that it used to have, it evicts itself to get rid of all extra fid-related state.
 * No interoperability changes to the MD server code are made in OLD.x release.
 * OLD.x OST servers also support both OLD and NEW networking protocol, and depending on the MDS connection flags either use fids or not. In fid-enabled mode, they act much like clients (see above) in their interaction with MDT. To support NEW protocol OST has to generate fids for objects already existing on the storage. Resulting surrogate fids are called idifs (igifs for data, see igif description below). [not started yet]

NEW.0
This release introduces MDT server speaking NEW protocol only, and running over OLD-format storage. OST server speaking NEW protocol was introduced in the previous OLD.x release. Support for old protocol is completely eliminated in this release.

To talk in new protocol server has to use fids to identify object, so NEW.0 MDT generates surrogate fids for existing objects. Such a surrogate fids is referred to as an igif (inode-generation fid), because it is built from inode number and inode generation. Format of igif and idif is described in the table below:

Legend:
 * FID : File IDentifier generated by client from range allocated by the seq service. First 0x400 sequences [232, 232 + 0x400] are reserved for system use. Note that on ldiskfs MDTs that IGIF FIDs can use inode numbers starting at 12, so the 0x400 reserved limit only strictly applies to DMU-based MDTs.


 * IGIF : Inode and Generation In FID, a surrogate FID used to identify an existing object on OLD formatted MDT file system. Belongs to a sequence in [2, 232 - 1] range, where sequence number is inode number, and inode generation is used as an oid.  NOTE: This assumes no more than 232-1 inodes exist in the MDT filesystem, which is the maximum possible for an ldiskfs backend.

1 << 32 | ost_index << 16 | ((objid >> 32) & 0xffff) objid & 0xffffffff
 * IDIF : object ID in FID, a surrogate FID used to identify an existing object on OLD formatted OST file system. Belongs to a sequence in [232, 233 - 1]. Sequence number is calculated as:
 * that is, it consists of 16bit ost index, and higher 16 bits of object id. oid field is calculated as:
 * that is, it consists of remaining 32 bits of object id. NOTE: This assumes that no more than 248-1 objects have ever been created on an OST, and that no more than 65535 OSTs are in use.  Both are very reasonable assumptions (1M objects per second for 9 years, or combinations thereof).

For compatibility with existing OLD OST network protocol structures, the FID must map onto the o_id and o_gr in a manner that ensures existing objects are identified consistently for IO, as well as onto the lock namespace to ensure both IDIFs map onto the same objects for IO as well as resources in the DLM.

DLM IDIF: resource[] = {o_id, o_gr, 0, 0}; // o_gr == 0 for all production releases

DLM non-IDIF FID (this is the same as on the MDT): resource[] = {seq, oid, ver, hash};

Note that while the o_id may be larger than the 2 33 reserved sequence numbers for IDIF, in all production releases the OLD o_gr field is always 0, so although it is possible for more than 8B objects to have been created on a single OST the non-zero oid field of the FID-based resource will distinguish the lock resources from each other.

For objects within the IDIF range, group extraction (non-CMD) will be: o_id = (fid->f_seq & 0xffff) << 16 | fid->f_oid; o_gr = 0;

For objects outside the IDIF range the reverse mapping will be as follows, until a new lov_ost_info_v2 is defined that contains only the lu_fid structure:

o_id_lo = fid->f_oid; o_id_hi = fid->f_ver << 32; o_gr   = fid->f_seq;

Recovery
There are 2 important recovery scenarios related to interoperability:


 * OLD.x client reconnects to MDT after a fail-over and learns that it has to switch back to the OLD protocol, because server was downgraded. Client has to replay requests, but before that they have to be converted into OLD protocol format. This requires changing message format and going from fids to inode numbers (storage cookies). If a fid in is igif format it can be converted to inode number according to the reverse of igif generation algorithm. If a fid is client-generated, then *KABOOM*! Client has to evict itself, because it doesn't know old-format inode number. Q? Is there a better solution?. What to do with RPCs that old server cannot handle at all: SEQ_QUERY? Again, eviction seems to be the only option.


 * OLD.x client reconnects to MDT and determines that it has to switch to the new protocol, because MDT was upgraded to NEW.0. To replay RPCs, client has to convert them to the NEW format. This includes message format conversion and going from inode numbers to fids. For RPCs that already include inode number as an argument, igif fid is used. For CREATE RPC that requires fid in NEW protocol there are two options:
 * client supplies fill-in-fid. NEW.0 server recognizes this as a request to generate fid on the server, and uses special sequence range reserved for this purpose to allocate a fid from. Note that this sequence cannot be exhausted, as there is single MDT in the cluster at that point, which means it has full control over complete fid space.
 * client supplies inode number as in usual OLD protocol replay. Server detects this and creates inode with given inode number. This has certain drawbacks:
 * a dependency on ext3-wantedi patch is re-introduced, and
 * backward-compatibility code is introduced in NEW.0 release, which we are trying to avoid.

--Nikita 12:16, 15 May 2008 (PDT)