WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Difference between revisions of "Architecture - Interoperability fids zfs"

From Obsolete Lustre Wiki
Jump to navigationJump to search
Line 600: Line 600:
 
   resource[] = {seq, oid, ver, hash};
 
   resource[] = {seq, oid, ver, hash};
  
Note that while the o_id may be larger than the 2<sup>33 reserved sequence numbers for IDIF, in all production releases the OLD o_gr field is always 0, so although it is possible for more than 8B objects to have been created on a single OST the non-zero ''oid'' field of the FID-based resource will distinguish the lock resources from each other.
+
Note that while the o_id may be larger than the 2<sup>33</sup> reserved sequence numbers for IDIF, in all production releases the OLD o_gr field is always 0, so although it is possible for more than 8B objects to have been created on a single OST the non-zero ''oid'' field of the FID-based resource will distinguish the lock resources from each other.
  
  

Revision as of 16:41, 18 January 2010

Summary

This document describes an architecture for client, server, network, and storage interoperability during migration from 1.6-based, fidless Lustre clusters, using ldiskfs as a back-end file system, to clusters based on fids and zfs file system.

This document originates from an internal wiki page.

Definitions

As release numbers and numbering schemas are in flux, the description below uses symbolic names for various important points in Lustre development.

OLD
any major release in b1_6 line of development. This might end up being 1.6.something, or 1.7.
OLD.x
a release in b1_6 line containing client that is able to interact with a NEW.0 md server. (Tentatively 1.8.)
NEW.0
first release based on HEAD. This features kernel server, and uses ldiskfs as a back-end. This is (tentatively) 2.0. It is important to note that NEW.0 is a temporary intermediate release whose purpose is to effect transition from ldiskfs-based to DMU-based clusters.
NEW.1
next release based on HEAD. This release introduces support for fids on OST, and DMU as a back-end, in addition to continued support for ldiskfs. This is (tentatively) 2.x.
OLD protocol
b1_6 wire network protocol.
NEW protocol
wire protocol using fids for object identification.
OLD storage, OLD file system
back-end file system of type ldiskfs.
DMU storage
back-end file system implemented through DMU.
fill-in-fid
a special not otherwise used fid value, reserved to indicate in a CREATE RPC that client requests server to generate fid for newly created object on client's behalf. This fid is taken from one of the system-reserved fid sequences.

Requirements

+-1 rule
adhere to the Lustre promise of maintaining interoperability one release back and forth.
downgrade
users are able to abandon upgrade and return back to the old cluster configuration up to a well-defined point of no-return when a decision is made to proceed forward. After that point downgrade is possible, on a condition that (potentially) all file system modifications made after no-return are lost.
rolling upgrade
an upgrade (and downgrade) is performed in a piecemeal fashion, a node after a node.
continuity
where possible upgrade and downgrade do not disrupt ongoing operations. Client upgrade or downgrade obviously requires client remount. Server upgrade and downgrade looks like a server fail-over, with clients operations continuing.
no stop-the-world
migration path cannot require whole cluster to be stopped for a prolonged amount of time (e.g,. to migrate all data to the new format).

Compatibility matrix

OLD OLD.x NEW.0 NEW.1
C O M C O M C O M C O M
OLD protocol X X X X X X - X - - - -
NEW protocol - - - X - - X - X X X X
OLD storage X X X X X X - -
DMU storage - - - - - - X X

Legend

C
client
O
OSS
M
MDT
X
given version supports given format or protocol
-
given version does not support given format or protocol
gray area
impossible combination

Migration path

Following upgrade path is envisaged:

  • starting with OLD version installed on the cluster...
  • OLD.x release is installed, making clients upward compatible with NEW.0 MDT server. This step can be undone without loss of functionality or availability.
  • all clients are upgraded to OLD.x.
  • NEW.0 md server is installed, and original (OLD.x md server) is failed over to the former. Clients can continue without evictions. This step can be undone with the minor loss of availability (e.g., evictions during downgrade).
  • NEW.0 release is installed on client and OSS nodes. Client has to unmount and remount file system to continue with the new release. This step can be undone with the minor loss of availability (again, unmount followed by remount to revert back to the old release).
  • clients and OST's are upgraded to NEW.1 release. At that moment, no OLD code is running in the cluster, but all data and meta-data are still stored in the OLD format, except for the redundant information, like object index, and fids in EA, not used by the OLD server.
  • MDT fails over to NEW.1. On a reconnect, OST's switch to NEW protocol. At this moment, all networking traffic is in NEW protocol.
  • NEW.1 dmu based ost's are formatted and added to the cluster.
  • online migration of data starts. This step can be undone without loss of functionality or availability.
  • NEW.1 DMU mdt is formatted. Magic meta-data migration tool is invoked. ?Q not clear yet. Downgrade?
  • once meta-data are migrated to the NEW.1, upgrade is complete.
Label Client OSS MDT Upgrade comment (read top-to-bottom) Downgrade comments (read bottom-to-top)
all-old OLD OLD OLD original configiration downgrade of clients, OSS and MDT to OLD can be performed in any order
client-old.x OLD.x OLD OLD upgrade of clients, OSS and MDT to OLD.x can be performed in any order
oss-old.x OLD.x OLD.x OLD
all-old.x OLD.x OLD.x OLD.x MDT is failed over to OLD.x version. On reconnect clients and OSS servers recognize downgrade and switch to the OLD protocol.
mdt-new.0 OLD.x OLD.x NEW.0 as new server is failed over to, OLD.x clients recognize this and start using NEW protocol to talk to MDT. OST still uses OLD protocol to talk to the MDT. clients are downgraded to OLD.x version in any order. They continue to speak NEW protocol. If SOM was activated during upgrade, no further downgrade is possible.
client-new.0 NEW.0 OLD.x NEW.0 clients and OSSes are upgraded to NEW-protocol-only version in any order.
all-new.0 NEW.0 NEW.0 NEW.0 SOM is de-activated on the MDT, if it was enabled.
new.0-som NEW.0 NEW.0 NEW.0 (Optional) SOM is activated on the MDT. all data are in OLD format.
client-new.1 NEW.1 NEW.1 NEW.0 Clients and OST's are upgarded to NEW.1 in any order. OST's continue to talk to the MDT using old protocol. OST's migrate back to NEW.0
mdt.1 NEW.1 NEW.1 NEW.1 MDT fails over to NEW.1 version, and announced to OST's that it talks NEW protocol. OST's switch to NEW protocol on reconnect MDT fails over to the NEW.0 version. OST's switch to the OLD protocol on reconnect.
data.dmu NEW.1 NEW.1 NEW.1 New DMU-based OST's are formatted and added to the cluster. Data migration starts. ldiskfs-based NEW.1 OST's are added into cluster and data are migrated back to them.
all-data.dmu NEW.1 NEW.1 NEW.1 all data are on DMU OSS servers. original configuration
point-of-no-return.
all-dmu NEW.0 NEW.1 NEW.1 meta-data is converted (offline?) to new DMU based MDT. downgrade is not possible from here.

Use Cases

id quality attribute summary
old.x-client usability OLD.x client is introduced into otherwise OLD cluster.
mdt.upgrade.0 usability, availability OLD.x MDT fails over to NEW.0 MDT
mdt.upgrade.0.client availability "...": client reconnection and recovery
new.1-ost usability NEW.1 OST is added to a cluster containing NEW.1 clients.
mdt.upgrade usability, availability NEW.0 MDT fails over to NEW.1 MDT
mdt.upgrade.1.ost availability "...": OST reconnection and recovery
mdt.downgrade.0 usability, availability NEW.0 MDT fails over to OLD.x MDT.
mdt.downgrade.0.client availability "...": client reconnection and recovery
mdt.downgrade.1 usability, availability NEW.1 MDT fails over to NEW.0 MDT.
mdt.downgrade.1.ost availability "...": OST reconnection and recovery

NEW.0 MDT handles...

id quality attribute summary
mdt.lookup.old correctness LOOKUP for a file created by OLD MDT
mdt.lookup.new.0 correctness LOOKUP for a file created by NEW.0 MDT
mdt.create correctness CREATE with a fid supplied by a client
mdt.readdir correctness READDIR

NEW.0 OST handles ...

id quality attribute summary
ost.lookup.old correctness LOOKUP for a file created by OLD OST
ost.lookup.new.0 correctness LOOKUP for a file created by NEW.0 OST
ost.create correctness CREATE with a fid supplied by a client
ost.unlink correctness UNLINK

Quality Attribute Scenarios

old.x-client
Scenario: OLD.x client is introduced into otherwise OLD cluster.
Business Goals: permit rolling upgrade
Relevant QA's: usability
details Stimulus source: cluster administrator
Stimulus: upgrade schedule
Environment: cluster with OLD release of lustre installed
Artifact: lustre client
Response: OLD client unmounts, OLD.x release is installed on a cluster node. Client connects to the MDT, requesting OBD_CONNECT_FID, which is not granted. Client detects that it connected to the OLD MDT.
Response measure: client should be able to talk to the OLD MDT.
Questions:
Issues:
new.1-ost
Scenario: NEW.1 OST is added to a cluster containing NEW.1 clients
Business Goals: permit rolling server upgrade
Relevant QA's: usability, availability
details Stimulus source: cluster administrator
Stimulus: upgrade schedule
Environment: cluster with a mixture of NEW.1 and NEW.0 releases of lustre installed
Artifact: OST
Response: NEW.0 OST fails over to NEW.1 version. OST reconnects to MDT, requesting OBD_CONNECT_FID, which is not granted. OST detects that it connected to NEW.0 MDT, and clears OBD_CONNECT_FID bit in its supported connection flags mask, forcing all reconnecting clients into OLD mode.
Response measure: OST should be able to talk to the NEW.0 MDT and NEW.0 clients.
Questions:
Issues:
mdt.upgrade.0
Scenario: OLD.x MDT fails over to NEW.0 MDT
Business Goals: upgrade to NEW.0 without downtime
Relevant QA's: usability, availability
details Stimulus source: cluster administrator
Stimulus: upgrade schedule
Environment: cluster with OLD.x release of lustre installed
Artifact: MDT
Response: After a fail-over MDT creates missing NEW.0 files (/oi, /fld, /seq, etc.), and starts recovery, accepting NEW-protocol connections from the clients, and OLD protocol connections from OS servers. When receiving replay of a CREATE rpc with a fill-in-fid, MDT generates fid internally (using seq service), and returns it to client.
Response measure: Fail-over and recovery have to complete successfully
Questions:
Issues: recovery, see following scenarios
mdt.upgrade.0.client
Scenario: OLD.x MDT fails over to NEW.0 MDT, client reconnects and replays.
Business Goals: successful recovery
Relevant QA's: availability
details Stimulus source: cluster administrator
Stimulus: upgrade schedule
Environment: cluster with a mixture of OLD.x and NEW.0 release of lustre installed
Artifact: client
Response: After a fail-over client gets OBD_CONNECT_FID bit from MDT and detects that it now talks to NEW.0 MDT. It continues to use OLD protocol to talk to OST's. Client proceeds with recovery, converting requests into new format, and converting inode numbers in RPCs into fids. For CREATE RPCs, some otherwise impossible fill-in-fid (from system-reserved fid sequence) is used, to indicate that server has to generate fid. Client should be ready that server can over-write client supplied fid in any CREATE rpc. There should be no need to rebuild any internal data structures (locks, inode table, pages, etc.) as all objects are identified by fids internally in OLD.x mode.
Response measure: successful recovery
Questions:
Issues:
mdt.upgrade.1.ost
Scenario: NEW.0 MDT fails over to NEW.1 MDT, OST reconnects and replays.
Business Goals: successful recovery
Relevant QA's: availability
details Stimulus source: cluster administrator
Stimulus: upgrade schedule
Environment: cluster with a mixture of NEW.1 and NEW.0 releases of lustre installed
Artifact: OST
Response: After a fail-over OST gets OBD_CONNECT_FID bit from MDT and detects that it now talks to NEW.1 MDT. OST sets OBD_CONNECT_FID in its own supported connect bits mask. OST proceeds with MDT-OST recovery, converting requests into new format, and converting inode numbers in RPCs into fids.
Response measure: successful recovery
Questions:
Issues:
mdt.downgrade.0
Scenario: NEW.0 MDT fails over to OLD.x MDT
Business Goals: downgrade with a minimal loss of availability
Relevant QA's: availability
details Stimulus source: cluster administrator
Stimulus: downgrade schedule
Environment: cluster with a mixture of OLD.x and NEW.0 releases of lustre installed
Artifact: MDT
Response: After a fail-over, MDT starts OLD-protocol recovery, accepting connections in OLD protocol.
Response measure: successful recovery
Questions:
Issues:
mdt.downgrade.0.client
Scenario: NEW.0 MDT fails over to OLD.x MDT: client reconnection and recovery
Business Goals: downgrade with a minimal loss of availability
Relevant QA's: availability
details Stimulus source: cluster administrator
Stimulus: downgrade schedule
Environment: cluster with a mixture of OLD.x and NEW.0 releases of lustre installed
Artifact: client
Response: After a fail-over, client reconnects, and is denied OBD_CONNECT_FID bit. Recognizing that MDT was downgraded, client switches to OLD.x mode, and starts replay, converting RPCs to the OLD protocol. If client is unable to convert an RPC, because it doesn't know inode number corresponding to the fid, it evicts itself.
Response measure: successful recovery
Questions: Search for "KABOOM" on this page.
Issues:
mdt.downgrade.1
Scenario: NEW.1 MDT fails over to NEW.0 MDT
Business Goals: downgrade with a minimal loss of availability
Relevant QA's: availability
details Stimulus source: cluster administrator
Stimulus: downgrade schedule
Environment: cluster with a mixture of NEW.1 and NEW.0 releases of lustre installed
Artifact: MDT
Response: After a fail-over, MDT starts recovery, accepting connections in OLD protocol from OST's and in NEW protocol from clients.
Response measure: successful recovery
Questions:
Issues:
mdt.downgrade.1.ost
Scenario: NEW.1 MDT fails over to NEW.0 MDT: ost reconnection and recovery
Business Goals: downgrade with a minimal loss of availability
Relevant QA's: availability
details Stimulus source: cluster administrator
Stimulus: downgrade schedule
Environment: cluster with a mixture of NEW.0 and NEW.0 releases of lustre installed
Artifact: OST
Response: After a fail-over, OST reconnects, and is denied OBD_CONNECT_FID bit. Recognizing that MDT was downgraded, OST switches to NEW.0 mode, clears OBD_CONNECT_FID bit in its supported connect flags mask, and starts replay, converting RPCs to the OLD protocol.
Response measure: successful recovery
Questions: Search for "KABOOM" on this page.
Issues:
mdt.lookup.old
Scenario: NEW.0 MDT handles LOOKUP(pdir, name) RPC, where name refers to the file created by OLD.x server.
Business Goals: access to existing data and meta-data
Relevant QA's: usability
details Stimulus source: client application
Stimulus: RPC
Environment: cluster with NEW.0 release of lustre installed
Artifact: MDT
Response: Given a fid of parent directory, server translates it into inode number (either by doing igif->ino computation, or using /oi index), loads directory inode and looks given name up. If name is found (-ENOENT otherwise), MDT loads inode and checks for "FID" EA. Assuming EA doesn't exists (see next QAS otherwise), server learns that inode was created by OLD.x server, generates igif fid from (inode number, inode generation) pair, and sends this fid to client as lookup result.
Response measure: consistent lookup result that can later be used to access file
Questions:
Issues:
mdt.lookup.new
Scenario: NEW.0 MDT handles LOOKUP(pdir, name) RPC, where name refers to the file created by NEW.0 server.
Business Goals: access to newly created data and meta-data
Relevant QA's: usability
details Stimulus source: client application
Stimulus: RPC
Environment: cluster with NEW.0 release of lustre installed
Artifact: MDT
Response: Given a fid of parent directory, server translates it into inode number (either by doing igif->ino computation, or using /oi index), loads directory inode and looks given name up. If name is found (-ENOENT otherwise), MDT loads inode and checks for "FID" EA. Assuming EA exists (see previous QAS otherwise), server learns that inode was created by NEW.0 server, interprets EA contents as a fid, and sends this fid to client as lookup result.
Response measure: consistent lookup result that can later be used to access file
Questions:
Issues: Possible sanity check: once fid was determined, check that /oi maps this fid to the inode number that was found in the directory.
mdt.create
Scenario: NEW.0 MDT handles CREATE(fid) RPC, with fid supplied by a client
Business Goals: create object that can later be accessed through client supplied fid.
Relevant QA's: usability
details Stimulus source: client application
Stimulus: RPC
Environment: cluster with NEW.0 release of lustre installed
Artifact: MDT
Response: If fid equals to special fill-in-fid constant, MDT generates new fid from an internal fid sequence. New inode is created. "FID" EA is allocated for this inode and filled with the fid. New (inode-number, inode-generation) record is inserted into /oi index with the fid as a key.
Response measure: new object created, and can be accessed by fid later.
Questions:
Issues:
mdt.readdir
Scenario: NEW.0 MDT handles READPAGE(parent-fid, offset) RPC
Business Goals: return a page filled with NEW protocol directory entries, provide access to both new and old objects through readdir.
Relevant QA's: usability
details Stimulus source: client application
Stimulus: RPC
Environment: cluster with NEW.0 release of lustre installed
Artifact: MDT
Response: Using dt-index iterators interface (internally based on ldiskfs_readdir()), MDT iterates over directory entries, and places file names and their hashed into directory entries. For every entry corresponding inode is loaded into memory. If inode contains "FID" EA, its contents is used as a fid, and is placed into readdir page. Otherwise, igif fid is generated, and placed into readdir page.
Response measure: pre-existing object, created by OLD.x server, are visible through readdir.
Questions:
Issues:

Technical Details [not part of architecture, should go into HLD/DLD]

Brief outline of features relevant to interoperability and not mentioned above, supported and expected from the releases above:

OLD.x

  • OLD.x: client and OST support both OLD and NEW networking protocol. Protocol version is selected at the time of connection to MDT: if MDT supports OBD_CONNECT_FID connect flag, NEW protocol is used, otherwise OLD.
  • once OLD.x node (client or OST) connected to MDT in NEW mode it assures that all other connections are in this mode too. OST adds OBD_CONNECT_FID flag to its connection mask.
  • when connected in NEW node, OLD.x client
    • uses fids to identify inodes in the cache (for uniformity, it can internally use igifs, generated from ino/gen pairs in the OLD mode too). Inode numbers for stat(2), are generated from fids [done for HEAD, being ported to b1_6_cli_reqs];
    • expects cmd3-style directory pages in readdir with fids in directory entries [done];
    • takes dlm locks are in fid name-space [done];
    • participates in cmd3 recovery protocol, more on this below [being implemented by Amit];
    • uses seq and fld services [done];
  • when on a re-connect OLD.x client detects that connection lost OBD_CONNECT_FID flag that it used to have, it evicts itself to get rid of all extra fid-related state.
    • No interoperability changes to the MD server code are made in OLD.x release.
  • OLD.x OST servers also support both OLD and NEW networking protocol, and depending on the MDS connection flags either use fids or not. In fid-enabled mode, they act much like clients (see above) in their interaction with MDT. To support NEW protocol OST has to generate fids for objects already existing on the storage. Resulting surrogate fids are called idifs (igifs for data, see igif description below). [not started yet]

NEW.0

This release introduces MDT server speaking NEW protocol only, and running over OLD-format storage. OST server speaking NEW protocol was introduced in the previous OLD.x release. Support for old protocol is completely eliminated in this release.

To talk in new protocol server has to use fids to identify object, so NEW.0 MDT generates surrogate fids for existing objects. Such a surrogate fids is referred to as an igif (inode-generation fid), because it is built from inode number and inode generation. Format of igif and idif is described in the table below:

fields SEQ OID VER
FID: seq:64 [233,264-1] oid:32 ver:32
IGIF: 0:32, ino:32 gen:32 0:32
IDIF: 0:31,1:1,ost-index:16,o_id_hi:16 o_id_lo:32 0:32
obdo(FID) o_gr:64 o_id_lo:32 o_id_hi:32
lov_oinfo o_gr:64 o_id_lo:32 o_id_hi:32

Legend:

FID
File IDentifier generated by client from range allocated by the seq service. First 0x400 sequences [232, 232 + 0x400] are reserved for system use. Note that on ldiskfs MDTs that IGIF FIDs can use inode numbers starting at 12, so the 0x400 reserved limit only strictly applies to DMU-based MDTs.
IGIF
Inode and Generation In FID, a surrogate FID used to identify an existing object on OLD formatted MDT file system. Belongs to a sequence in [2, 232 - 1] range, where sequence number is inode number, and inode generation is used as an oid. NOTE: This assumes no more than 232-1 inodes exist in the MDT filesystem, which is the maximum possible for an ldiskfs backend.
IDIF
object ID in FID, a surrogate FID used to identify an existing object on OLD formatted OST file system. Belongs to a sequence in [232, 233 - 1]. Sequence number is calculated as:
1 << 32 | ost_index << 16 | ((objid >> 32) & 0xffff)
that is, it consists of 16bit ost index, and higher 16 bits of object id. oid field is calculated as
objid & 0xffffffff
that is, it consists of remaining 32 bits of object id. NOTE
This assumes that no more than 248-1 objects have ever been created on an OST, and that no more than 65535 OSTs are in use. Both are very reasonable assumptions (1M objects per second for 9 years, or combinations thereof).

For compatibility with existing OLD OST network protocol structures, the FID must map onto the o_id and o_gr in a manner that ensures existing objects are identified consistently for IO, as well as onto the lock namespace to ensure both IDIFs map onto the same objects for IO as well as resources in the DLM.

DLM IDIF:

 resource[] = {o_id, o_gr, 0, 0};  // o_gr == 0 for all production releases

DLM non-IDIF FID (this is the same as on the MDT):

 resource[] = {seq, oid, ver, hash};

Note that while the o_id may be larger than the 233 reserved sequence numbers for IDIF, in all production releases the OLD o_gr field is always 0, so although it is possible for more than 8B objects to have been created on a single OST the non-zero oid field of the FID-based resource will distinguish the lock resources from each other.


For objects within the IDIF range, group extraction (non-CMD) will be:

 o_id = (fid->f_seq & 0xffff) << 16 | fid->f_oid;
 o_gr = 0;

For objects outside the IDIF range the reverse mapping will be as follows, until a new lov_ost_info_v2 is defined that contains only the lu_fid structure:

 o_id_lo = fid->f_oid;
 o_id_hi = fid->f_ver << 32;
 o_gr    = fid->f_seq;

Recovery

There are 2 important recovery scenarios related to interoperability:

  • OLD.x client reconnects to MDT after a fail-over and learns that it has to switch back to the OLD protocol, because server was downgraded. Client has to replay requests, but before that they have to be converted into OLD protocol format. This requires changing message format and going from fids to inode numbers (storage cookies). If a fid in is igif format it can be converted to inode number according to the reverse of igif generation algorithm. If a fid is client-generated, then *KABOOM*! Client has to evict itself, because it doesn't know old-format inode number. Q? Is there a better solution?. What to do with RPCs that old server cannot handle at all: SEQ_QUERY? Again, eviction seems to be the only option.
  • OLD.x client reconnects to MDT and determines that it has to switch to the new protocol, because MDT was upgraded to NEW.0. To replay RPCs, client has to convert them to the NEW format. This includes message format conversion and going from inode numbers to fids. For RPCs that already include inode number as an argument, igif fid is used. For CREATE RPC that requires fid in NEW protocol there are two options:
    • client supplies fill-in-fid. NEW.0 server recognizes this as a request to generate fid on the server, and uses special sequence range reserved for this purpose to allocate a fid from. Note that this sequence cannot be exhausted, as there is single MDT in the cluster at that point, which means it has full control over complete fid space.
    • client supplies inode number as in usual OLD protocol replay. Server detects this and creates inode with given inode number. This has certain drawbacks:
      • a dependency on ext3-wantedi patch is re-introduced, and
      • backward-compatibility code is introduced in NEW.0 release, which we are trying to avoid.

--Nikita 12:16, 15 May 2008 (PDT)