WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.
Architecture - Interoperability fids zfs: Difference between revisions
m (→NEW.0) |
(→NEW.0: Fix LLOG and ECHO group numbers error to match FID_SEQ_LLOG/FID_SEQ_ECHO) |
||
(13 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
'''''Note:''''' ''The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain both outdated information and unimplemented functionality.'' | |||
== Summary == | == Summary == | ||
This document describes an architecture for client, server, network, and storage interoperability during migration from 1.6-based, fidless Lustre clusters, using ldiskfs as a back-end file system, to clusters based on fids and zfs file system. | This document describes an architecture for client, server, network, and storage interoperability during migration from 1.6-based, fidless Lustre clusters, using ldiskfs as a back-end file system, to clusters based on fids and zfs file system. | ||
== Definitions == | == Definitions == | ||
Line 562: | Line 562: | ||
This release introduces MDT server speaking NEW protocol only, and running over OLD-format storage. OST server speaking NEW protocol was introduced in the previous OLD.x release. Support for old protocol is completely eliminated in this release. | This release introduces MDT server speaking NEW protocol only, and running over OLD-format storage. OST server speaking NEW protocol was introduced in the previous OLD.x release. Support for old protocol is completely eliminated in this release. | ||
To talk in new protocol server has to use | To talk in new protocol server has to use FIDs to identify object, so NEW.0 MDT generates ''surrogate'' FIDs for existing inodes. Such a surrogate FIDs is referred to as an ''IGIF'' (inode-generation FID), because it is built from inode number and inode generation. Similarly, NEW.0 OST generates surrogate FIDs for existing id/group objects. Format of IGIF and IDIF is described in the table below: | ||
{| border=1 cellspacing=0 cellpadding="5" | {| border=1 cellspacing=0 cellpadding="5" | ||
|fields | |fields ||SEQ ||OID ||VER | ||
|- | |||
|FID_SEQ_OST_MDT0 ||= 0 || || | |||
|- | |||
|FID_SEQ_LLOG ||= 1 || || | |||
|- | |||
|FID_SEQ_ECHO ||= 2 || || | |||
|- | |||
|FID_SEQ_OST_MDT1 ||= 3 || || | |||
|- | |||
|FID_SEQ_OST_MAX ||= 9 (=FID_SEQ_OST_MDT7) || || | |||
|- | |||
|FID_SEQ_IGIF ||= 12 || || | |||
|- | |||
|FID_SEQ_IGIF_MAX ||= 0xffffffff || || | |||
|- | |||
|FID_SEQ_IDIF ||=0x100000000 || || | |||
|- | |||
|FID_SEQ_IDIF_MAX ||=0x1ffffffff || || | |||
|- | |||
|FID_SEQ_LOCAL_FILE||=0x200000001 || || | |||
|- | |||
|FID_SEQ_DOT_LUSTRE||=0x200000002 || || | |||
|- | |||
|FID_SEQ_NORMAL ||=0x200000400 || || | |||
|- | |||
|- | |- | ||
| | |obdo/lmm/oinfo(OLD)||o_seq:64 [FID_SEQ_OST_MDT0] ||o_id_lo:48||o_id_hi:16 | ||
|- | |- | ||
| | |obdo/lmm/oinfo(NEW.1)||o_seq:64 [FID_SEQ_{IDIF,NORMAL}]||o_id_lo:32||o_id_hi:32 | ||
|- | |- | ||
| | |lu_fid ||f_seq:64 ||f_oid:32 ||f_ver:32 | ||
|- | |- | ||
| | |IGIF ||0:32, ino:32 [12,FID_SEQ_IGIF_MAX] ||gen:32 ||0:32 | ||
|- | |- | ||
| | |IDIF ||0:31, 1:1, ost_idx:16,o_id_hi:16 ||o_id_lo:32||o_id_hi_hi:16 | ||
|- | |||
|reserved ||[FID_SEQ_START,FID_SEQ_START+0x3ff]||f_oid:32 ||f_ver:32 | |||
|- | |||
|FID ||[FID_SEQ_NORMAL,2<sup>64</sup>-1] ||f_oid:32 ||f_ver:32 | |||
|} | |} | ||
Legend: | Legend: | ||
; '''FID''' : File IDentifier generated by client from range allocated by the seq service. First 0x400 sequences [2<sup> | ; '''FID''' : File IDentifier generated by client from range allocated by the seq service. First 0x400 sequences [2<sup>33</sup>, 2<sup>33</sup> + 0x400] are reserved for system use. Note that on ldiskfs MDTs that IGIF FIDs can use inode numbers starting at 12, but this is in the IGIF SEQ rangeand does not conflict with assigned FIDs. | ||
; '''IGIF''' : Inode and Generation In FID, a surrogate FID used to identify an existing object on OLD formatted MDT file system. Belongs to a sequence in [ | ; '''IGIF''' : Inode and Generation In FID, a surrogate FID used to globally identify an existing object on OLD formatted MDT file system. This would only be used on MDT0 in a DNE filesystem, because there are not expected to be any OLD formatted DNE filesystems. Belongs to a sequence in [12, 2<sup>32</sup> - 1] range, where sequence number is inode number, and inode generation is used as OID. '''NOTE''': This assumes no more than 2<sup>32</sup>-1 inodes exist in the MDT filesystem, which is the maximum possible for an ldiskfs backend. '''NOTE''': This assumes that the reserved ext3/ext4/ldiskfs inode numbers [0-11] are never visible to clients, which has always been true. | ||
; '''IDIF''' : object ID in FID, a surrogate FID used to identify an existing object on OLD formatted OST file system. Belongs to a sequence in [2<sup>32</sup>, 2<sup>33</sup> - 1]. Sequence number is calculated as: | ; '''IDIF''' : object ID in FID, a surrogate FID used to globally identify an existing object on OLD formatted OST file system. Belongs to a sequence in [2<sup>32</sup>, 2<sup>33</sup> - 1]. Sequence number is calculated as: | ||
<pre> | <pre> | ||
1 << 32 | ost_index << 16 | ((objid >> 32) & 0xffff) | 1 << 32 | (ost_index << 16) | ((objid >> 32) & 0xffff) | ||
</pre> | </pre> | ||
;that is, | ; ''' ''' : that is, SEQ consists of 16-bit OST index, and higher 16 bits of object ID. The generation of unique SEQ values per OST allows the IDIF FIDs to be identified in the FLD correctly. The OID field is calculated as: | ||
<pre> | <pre> | ||
objid & 0xffffffff | objid & 0xffffffff | ||
</pre> | </pre> | ||
;that is, it consists of | ; ''' ''' : that is, it consists of lower 32 bits of object ID. '''NOTE''' This assumes that no more than 2<sup>48</sup>-1 objects have ever been created on an OST, and that no more than 65535 OSTs are in use. Both are very reasonable assumptions (can uniquely map all objects on an OST that created 1M objects per second for 9 years, or combinations thereof). | ||
; '''OST_MDT0''' : Surrogate FID used to identify an existing object on OLD formatted OST filesystem. Belongs to the reserved sequence 0, and is used internally prior to the introduction of FID-on-OST, at which point IDIF will be used to identify objects as residing on a specific OST. | |||
; '''LLOG''' : for Lustre Log objects the object sequence 1 is used. This is compatible with both OLD and NEW.1 namespaces, as this SEQ number is in the ext3/ldiskfs reserved inode range and does not conflict with IGIF sequence numbers. | |||
; '''ECHO''' : for testing OST IO performance the object sequence 2 is used. This is compatible with both OLD and NEW.1 namespaces, as this SEQ number is in the ext3/ldiskfs reserved inode range and does not conflict with IGIF sequence numbers. | |||
; '''OST_MDT1''' .. '''OST_MAX''' : for testing with multiple MDTs the object sequence 3 through 9 is used, allowing direct mapping of MDTs 1 through 7 respectively, for a total of 8 MDTs including '''OST_MDT0'''. This matches the legacy CMD project "group" mappings. However, this SEQ range is only for testing prior to any production DNE release, as the objects in this range conflict across all OSTs, as the OST index is not part of the FID. | |||
For compatibility with existing OLD OST network protocol structures, the FID must map onto the o_id and o_gr in a manner that ensures existing objects are identified consistently for IO, as well as onto the lock namespace to ensure both IDIFs map onto the same objects for IO as well as resources in the DLM. | |||
DLM OLD OBIF/IDIF: | |||
resource[] = {o_id, o_seq, 0, 0}; /* o_seq == 0 for production releases */ | |||
DLM NEW.1 FID (this is the same for both the MDT and OST): | |||
resource[] = {SEQ, OID, VER, HASH}; | |||
Note that for mapping IDIF values to DLM resource names the o_id may be larger than the 2<sup>33</sup> reserved sequence numbers for IDIF, so it is possible for the o_id numbers to overlap FID SEQ numbers in the resource. However, in all production releases the OLD o_seq field is always zero, and all valid FID OID values are non-zero, so the lock resources will not collide. | |||
For objects within the IDIF range, group extraction (non-CMD) will be: | |||
o_id = (fid->f_seq & 0x7fff) << 16 | fid->f_oid; | |||
o_seq = 0; /* formerly group number */ | |||
=== Recovery === | === Recovery === | ||
Line 617: | Line 649: | ||
There are 2 important recovery scenarios related to interoperability: | There are 2 important recovery scenarios related to interoperability: | ||
* OLD.x client reconnects to MDT after a fail-over and learns that it has to switch back to the OLD protocol, because server was downgraded. Client has to replay requests, but before that they have to be converted into OLD protocol format. This requires changing message format and going from | * OLD.x client reconnects to MDT after a fail-over and learns that it has to switch back to the OLD protocol, because server was downgraded. Client has to replay requests, but before that they have to be converted into OLD protocol format. This requires changing message format and going from client-assigned FIDs to inode/generation numbers (storage cookies). If a FID in is IGIF format it can be converted to inode number according to the reverse of IGIF generation algorithm. If a FID is client-generated, then '''*KABOOM*'''! Client has to evict itself, because it doesn't know old-format inode number. '''Q? Is there a better solution?'''. What to do with RPCs that old server cannot handle at all: SEQ_QUERY? Again, eviction seems to be the only option. | ||
* OLD.x client reconnects to MDT and determines that it has to switch to the new protocol, because MDT was upgraded to NEW.0. To replay RPCs, client has to convert them to the NEW format. This includes message format conversion and going from inode numbers to | * OLD.x client reconnects to MDT and determines that it has to switch to the new protocol, because MDT was upgraded to NEW.0. To replay RPCs, client has to convert them to the NEW format. This includes message format conversion and going from inode/generation numbers to FIDs. For RPCs that already include inode number as an argument, IGIF FID can be used. For CREATE RPC that requires fid in NEW protocol there are two options: | ||
** client supplies fill-in- | ** client supplies fill-in-FID. NEW.0 server recognizes this as a request to generate FID on the server, and uses special sequence range reserved for this purpose to allocate a FID from. Note that this sequence cannot be exhausted, as there is single MDT in the cluster at that point, which means it has full control over complete FID space. | ||
** client supplies inode number as in usual OLD protocol replay. Server detects this and creates inode with given inode number. This has certain drawbacks: | ** client supplies inode number as in usual OLD protocol replay. Server detects this and creates inode with given inode number. This has certain drawbacks: | ||
*** a dependency on ext3-wantedi patch is re-introduced, and | *** a dependency on ext3-wantedi patch is re-introduced, and | ||
*** backward-compatibility code is introduced in NEW.0 release, which we are trying to avoid. | *** backward-compatibility code is introduced in NEW.0 release, which we are trying to avoid. | ||
Latest revision as of 23:19, 3 October 2012
Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain both outdated information and unimplemented functionality.
Summary
This document describes an architecture for client, server, network, and storage interoperability during migration from 1.6-based, fidless Lustre clusters, using ldiskfs as a back-end file system, to clusters based on fids and zfs file system.
Definitions
As release numbers and numbering schemas are in flux, the description below uses symbolic names for various important points in Lustre development.
- OLD
- any major release in b1_6 line of development. This might end up being 1.6.something, or 1.7.
- OLD.x
- a release in b1_6 line containing client that is able to interact with a NEW.0 md server. (Tentatively 1.8.)
- NEW.0
- first release based on HEAD. This features kernel server, and uses ldiskfs as a back-end. This is (tentatively) 2.0. It is important to note that NEW.0 is a temporary intermediate release whose purpose is to effect transition from ldiskfs-based to DMU-based clusters.
- NEW.1
- next release based on HEAD. This release introduces support for fids on OST, and DMU as a back-end, in addition to continued support for ldiskfs. This is (tentatively) 2.x.
- OLD protocol
- b1_6 wire network protocol.
- NEW protocol
- wire protocol using fids for object identification.
- OLD storage, OLD file system
- back-end file system of type ldiskfs.
- DMU storage
- back-end file system implemented through DMU.
- fill-in-fid
- a special not otherwise used fid value, reserved to indicate in a CREATE RPC that client requests server to generate fid for newly created object on client's behalf. This fid is taken from one of the system-reserved fid sequences.
Requirements
- +-1 rule
- adhere to the Lustre promise of maintaining interoperability one release back and forth.
- downgrade
- users are able to abandon upgrade and return back to the old cluster configuration up to a well-defined point of no-return when a decision is made to proceed forward. After that point downgrade is possible, on a condition that (potentially) all file system modifications made after no-return are lost.
- rolling upgrade
- an upgrade (and downgrade) is performed in a piecemeal fashion, a node after a node.
- continuity
- where possible upgrade and downgrade do not disrupt ongoing operations. Client upgrade or downgrade obviously requires client remount. Server upgrade and downgrade looks like a server fail-over, with clients operations continuing.
- no stop-the-world
- migration path cannot require whole cluster to be stopped for a prolonged amount of time (e.g,. to migrate all data to the new format).
Compatibility matrix
OLD | OLD.x | NEW.0 | NEW.1 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
C | O | M | C | O | M | C | O | M | C | O | M | |
OLD protocol | X | X | X | X | X | X | - | X | - | - | - | - |
NEW protocol | - | - | - | X | - | - | X | - | X | X | X | X |
OLD storage | X | X | X | X | X | X | - | - | ||||
DMU storage | - | - | - | - | - | - | X | X |
Legend
- C
- client
- O
- OSS
- M
- MDT
- X
- given version supports given format or protocol
- -
- given version does not support given format or protocol
- gray area
- impossible combination
Migration path
Following upgrade path is envisaged:
- starting with OLD version installed on the cluster...
- OLD.x release is installed, making clients upward compatible with NEW.0 MDT server. This step can be undone without loss of functionality or availability.
- all clients are upgraded to OLD.x.
- NEW.0 md server is installed, and original (OLD.x md server) is failed over to the former. Clients can continue without evictions. This step can be undone with the minor loss of availability (e.g., evictions during downgrade).
- NEW.0 release is installed on client and OSS nodes. Client has to unmount and remount file system to continue with the new release. This step can be undone with the minor loss of availability (again, unmount followed by remount to revert back to the old release).
- clients and OST's are upgraded to NEW.1 release. At that moment, no OLD code is running in the cluster, but all data and meta-data are still stored in the OLD format, except for the redundant information, like object index, and fids in EA, not used by the OLD server.
- MDT fails over to NEW.1. On a reconnect, OST's switch to NEW protocol. At this moment, all networking traffic is in NEW protocol.
- NEW.1 dmu based ost's are formatted and added to the cluster.
- online migration of data starts. This step can be undone without loss of functionality or availability.
- NEW.1 DMU mdt is formatted. Magic meta-data migration tool is invoked. ?Q not clear yet. Downgrade?
- once meta-data are migrated to the NEW.1, upgrade is complete.
Label | Client | OSS | MDT | Upgrade comment (read top-to-bottom) | Downgrade comments (read bottom-to-top) |
---|---|---|---|---|---|
all-old | OLD | OLD | OLD | original configiration | downgrade of clients, OSS and MDT to OLD can be performed in any order |
client-old.x | OLD.x | OLD | OLD | upgrade of clients, OSS and MDT to OLD.x can be performed in any order | |
oss-old.x | OLD.x | OLD.x | OLD | ||
all-old.x | OLD.x | OLD.x | OLD.x | MDT is failed over to OLD.x version. On reconnect clients and OSS servers recognize downgrade and switch to the OLD protocol. | |
mdt-new.0 | OLD.x | OLD.x | NEW.0 | as new server is failed over to, OLD.x clients recognize this and start using NEW protocol to talk to MDT. OST still uses OLD protocol to talk to the MDT. | clients are downgraded to OLD.x version in any order. They continue to speak NEW protocol. If SOM was activated during upgrade, no further downgrade is possible. |
client-new.0 | NEW.0 | OLD.x | NEW.0 | clients and OSSes are upgraded to NEW-protocol-only version in any order. | |
all-new.0 | NEW.0 | NEW.0 | NEW.0 | SOM is de-activated on the MDT, if it was enabled. | |
new.0-som | NEW.0 | NEW.0 | NEW.0 | (Optional) SOM is activated on the MDT. | all data are in OLD format. |
client-new.1 | NEW.1 | NEW.1 | NEW.0 | Clients and OST's are upgarded to NEW.1 in any order. OST's continue to talk to the MDT using old protocol. | OST's migrate back to NEW.0 |
mdt.1 | NEW.1 | NEW.1 | NEW.1 | MDT fails over to NEW.1 version, and announced to OST's that it talks NEW protocol. OST's switch to NEW protocol on reconnect | MDT fails over to the NEW.0 version. OST's switch to the OLD protocol on reconnect. |
data.dmu | NEW.1 | NEW.1 | NEW.1 | New DMU-based OST's are formatted and added to the cluster. Data migration starts. | ldiskfs-based NEW.1 OST's are added into cluster and data are migrated back to them. |
all-data.dmu | NEW.1 | NEW.1 | NEW.1 | all data are on DMU OSS servers. | original configuration |
point-of-no-return. | |||||
all-dmu | NEW.0 | NEW.1 | NEW.1 | meta-data is converted (offline?) to new DMU based MDT. | downgrade is not possible from here. |
Use Cases
id | quality attribute | summary |
---|---|---|
old.x-client | usability | OLD.x client is introduced into otherwise OLD cluster. |
mdt.upgrade.0 | usability, availability | OLD.x MDT fails over to NEW.0 MDT |
mdt.upgrade.0.client | availability | "...": client reconnection and recovery |
new.1-ost | usability | NEW.1 OST is added to a cluster containing NEW.1 clients. |
mdt.upgrade | usability, availability | NEW.0 MDT fails over to NEW.1 MDT |
mdt.upgrade.1.ost | availability | "...": OST reconnection and recovery |
mdt.downgrade.0 | usability, availability | NEW.0 MDT fails over to OLD.x MDT. |
mdt.downgrade.0.client | availability | "...": client reconnection and recovery |
mdt.downgrade.1 | usability, availability | NEW.1 MDT fails over to NEW.0 MDT. |
mdt.downgrade.1.ost | availability | "...": OST reconnection and recovery |
NEW.0 MDT handles...
id | quality attribute | summary |
---|---|---|
mdt.lookup.old | correctness | LOOKUP for a file created by OLD MDT |
mdt.lookup.new.0 | correctness | LOOKUP for a file created by NEW.0 MDT |
mdt.create | correctness | CREATE with a fid supplied by a client |
mdt.readdir | correctness | READDIR |
NEW.0 OST handles ...
id | quality attribute | summary |
---|---|---|
ost.lookup.old | correctness | LOOKUP for a file created by OLD OST |
ost.lookup.new.0 | correctness | LOOKUP for a file created by NEW.0 OST |
ost.create | correctness | CREATE with a fid supplied by a client |
ost.unlink | correctness | UNLINK |
Quality Attribute Scenarios
- old.x-client
Scenario: | OLD.x client is introduced into otherwise OLD cluster. | |
Business Goals: | permit rolling upgrade | |
Relevant QA's: | usability | |
details | Stimulus source: | cluster administrator |
Stimulus: | upgrade schedule | |
Environment: | cluster with OLD release of lustre installed | |
Artifact: | lustre client | |
Response: | OLD client unmounts, OLD.x release is installed on a cluster node. Client connects to the MDT, requesting OBD_CONNECT_FID, which is not granted. Client detects that it connected to the OLD MDT. | |
Response measure: | client should be able to talk to the OLD MDT. | |
Questions: | ||
Issues: |
- new.1-ost
Scenario: | NEW.1 OST is added to a cluster containing NEW.1 clients | |
Business Goals: | permit rolling server upgrade | |
Relevant QA's: | usability, availability | |
details | Stimulus source: | cluster administrator |
Stimulus: | upgrade schedule | |
Environment: | cluster with a mixture of NEW.1 and NEW.0 releases of lustre installed | |
Artifact: | OST | |
Response: | NEW.0 OST fails over to NEW.1 version. OST reconnects to MDT, requesting OBD_CONNECT_FID, which is not granted. OST detects that it connected to NEW.0 MDT, and clears OBD_CONNECT_FID bit in its supported connection flags mask, forcing all reconnecting clients into OLD mode. | |
Response measure: | OST should be able to talk to the NEW.0 MDT and NEW.0 clients. | |
Questions: | ||
Issues: |
- mdt.upgrade.0
Scenario: | OLD.x MDT fails over to NEW.0 MDT | |
Business Goals: | upgrade to NEW.0 without downtime | |
Relevant QA's: | usability, availability | |
details | Stimulus source: | cluster administrator |
Stimulus: | upgrade schedule | |
Environment: | cluster with OLD.x release of lustre installed | |
Artifact: | MDT | |
Response: | After a fail-over MDT creates missing NEW.0 files (/oi, /fld, /seq, etc.), and starts recovery, accepting NEW-protocol connections from the clients, and OLD protocol connections from OS servers. When receiving replay of a CREATE rpc with a fill-in-fid, MDT generates fid internally (using seq service), and returns it to client. | |
Response measure: | Fail-over and recovery have to complete successfully | |
Questions: | ||
Issues: | recovery, see following scenarios |
- mdt.upgrade.0.client
Scenario: | OLD.x MDT fails over to NEW.0 MDT, client reconnects and replays. | |
Business Goals: | successful recovery | |
Relevant QA's: | availability | |
details | Stimulus source: | cluster administrator |
Stimulus: | upgrade schedule | |
Environment: | cluster with a mixture of OLD.x and NEW.0 release of lustre installed | |
Artifact: | client | |
Response: | After a fail-over client gets OBD_CONNECT_FID bit from MDT and detects that it now talks to NEW.0 MDT. It continues to use OLD protocol to talk to OST's. Client proceeds with recovery, converting requests into new format, and converting inode numbers in RPCs into fids. For CREATE RPCs, some otherwise impossible fill-in-fid (from system-reserved fid sequence) is used, to indicate that server has to generate fid. Client should be ready that server can over-write client supplied fid in any CREATE rpc. There should be no need to rebuild any internal data structures (locks, inode table, pages, etc.) as all objects are identified by fids internally in OLD.x mode. | |
Response measure: | successful recovery | |
Questions: | ||
Issues: |
- mdt.upgrade.1.ost
Scenario: | NEW.0 MDT fails over to NEW.1 MDT, OST reconnects and replays. | |
Business Goals: | successful recovery | |
Relevant QA's: | availability | |
details | Stimulus source: | cluster administrator |
Stimulus: | upgrade schedule | |
Environment: | cluster with a mixture of NEW.1 and NEW.0 releases of lustre installed | |
Artifact: | OST | |
Response: | After a fail-over OST gets OBD_CONNECT_FID bit from MDT and detects that it now talks to NEW.1 MDT. OST sets OBD_CONNECT_FID in its own supported connect bits mask. OST proceeds with MDT-OST recovery, converting requests into new format, and converting inode numbers in RPCs into fids. | |
Response measure: | successful recovery | |
Questions: | ||
Issues: |
- mdt.downgrade.0
Scenario: | NEW.0 MDT fails over to OLD.x MDT | |
Business Goals: | downgrade with a minimal loss of availability | |
Relevant QA's: | availability | |
details | Stimulus source: | cluster administrator |
Stimulus: | downgrade schedule | |
Environment: | cluster with a mixture of OLD.x and NEW.0 releases of lustre installed | |
Artifact: | MDT | |
Response: | After a fail-over, MDT starts OLD-protocol recovery, accepting connections in OLD protocol. | |
Response measure: | successful recovery | |
Questions: | ||
Issues: |
- mdt.downgrade.0.client
Scenario: | NEW.0 MDT fails over to OLD.x MDT: client reconnection and recovery | |
Business Goals: | downgrade with a minimal loss of availability | |
Relevant QA's: | availability | |
details | Stimulus source: | cluster administrator |
Stimulus: | downgrade schedule | |
Environment: | cluster with a mixture of OLD.x and NEW.0 releases of lustre installed | |
Artifact: | client | |
Response: | After a fail-over, client reconnects, and is denied OBD_CONNECT_FID bit. Recognizing that MDT was downgraded, client switches to OLD.x mode, and starts replay, converting RPCs to the OLD protocol. If client is unable to convert an RPC, because it doesn't know inode number corresponding to the fid, it evicts itself. | |
Response measure: | successful recovery | |
Questions: | Search for "KABOOM" on this page. | |
Issues: |
- mdt.downgrade.1
Scenario: | NEW.1 MDT fails over to NEW.0 MDT | |
Business Goals: | downgrade with a minimal loss of availability | |
Relevant QA's: | availability | |
details | Stimulus source: | cluster administrator |
Stimulus: | downgrade schedule | |
Environment: | cluster with a mixture of NEW.1 and NEW.0 releases of lustre installed | |
Artifact: | MDT | |
Response: | After a fail-over, MDT starts recovery, accepting connections in OLD protocol from OST's and in NEW protocol from clients. | |
Response measure: | successful recovery | |
Questions: | ||
Issues: |
- mdt.downgrade.1.ost
Scenario: | NEW.1 MDT fails over to NEW.0 MDT: ost reconnection and recovery | |
Business Goals: | downgrade with a minimal loss of availability | |
Relevant QA's: | availability | |
details | Stimulus source: | cluster administrator |
Stimulus: | downgrade schedule | |
Environment: | cluster with a mixture of NEW.0 and NEW.0 releases of lustre installed | |
Artifact: | OST | |
Response: | After a fail-over, OST reconnects, and is denied OBD_CONNECT_FID bit. Recognizing that MDT was downgraded, OST switches to NEW.0 mode, clears OBD_CONNECT_FID bit in its supported connect flags mask, and starts replay, converting RPCs to the OLD protocol. | |
Response measure: | successful recovery | |
Questions: | Search for "KABOOM" on this page. | |
Issues: |
- mdt.lookup.old
Scenario: | NEW.0 MDT handles LOOKUP(pdir, name) RPC, where name refers to the file created by OLD.x server. | |
Business Goals: | access to existing data and meta-data | |
Relevant QA's: | usability | |
details | Stimulus source: | client application |
Stimulus: | RPC | |
Environment: | cluster with NEW.0 release of lustre installed | |
Artifact: | MDT | |
Response: | Given a fid of parent directory, server translates it into inode number (either by doing igif->ino computation, or using /oi index), loads directory inode and looks given name up. If name is found (-ENOENT otherwise), MDT loads inode and checks for "FID" EA. Assuming EA doesn't exists (see next QAS otherwise), server learns that inode was created by OLD.x server, generates igif fid from (inode number, inode generation) pair, and sends this fid to client as lookup result. | |
Response measure: | consistent lookup result that can later be used to access file | |
Questions: | ||
Issues: |
- mdt.lookup.new
Scenario: | NEW.0 MDT handles LOOKUP(pdir, name) RPC, where name refers to the file created by NEW.0 server. | |
Business Goals: | access to newly created data and meta-data | |
Relevant QA's: | usability | |
details | Stimulus source: | client application |
Stimulus: | RPC | |
Environment: | cluster with NEW.0 release of lustre installed | |
Artifact: | MDT | |
Response: | Given a fid of parent directory, server translates it into inode number (either by doing igif->ino computation, or using /oi index), loads directory inode and looks given name up. If name is found (-ENOENT otherwise), MDT loads inode and checks for "FID" EA. Assuming EA exists (see previous QAS otherwise), server learns that inode was created by NEW.0 server, interprets EA contents as a fid, and sends this fid to client as lookup result. | |
Response measure: | consistent lookup result that can later be used to access file | |
Questions: | ||
Issues: | Possible sanity check: once fid was determined, check that /oi maps this fid to the inode number that was found in the directory. |
- mdt.create
Scenario: | NEW.0 MDT handles CREATE(fid) RPC, with fid supplied by a client | |
Business Goals: | create object that can later be accessed through client supplied fid. | |
Relevant QA's: | usability | |
details | Stimulus source: | client application |
Stimulus: | RPC | |
Environment: | cluster with NEW.0 release of lustre installed | |
Artifact: | MDT | |
Response: | If fid equals to special fill-in-fid constant, MDT generates new fid from an internal fid sequence. New inode is created. "FID" EA is allocated for this inode and filled with the fid. New (inode-number, inode-generation) record is inserted into /oi index with the fid as a key. | |
Response measure: | new object created, and can be accessed by fid later. | |
Questions: | ||
Issues: |
- mdt.readdir
Scenario: | NEW.0 MDT handles READPAGE(parent-fid, offset) RPC | |
Business Goals: | return a page filled with NEW protocol directory entries, provide access to both new and old objects through readdir. | |
Relevant QA's: | usability | |
details | Stimulus source: | client application |
Stimulus: | RPC | |
Environment: | cluster with NEW.0 release of lustre installed | |
Artifact: | MDT | |
Response: | Using dt-index iterators interface (internally based on ldiskfs_readdir()), MDT iterates over directory entries, and places file names and their hashed into directory entries. For every entry corresponding inode is loaded into memory. If inode contains "FID" EA, its contents is used as a fid, and is placed into readdir page. Otherwise, igif fid is generated, and placed into readdir page. | |
Response measure: | pre-existing object, created by OLD.x server, are visible through readdir. | |
Questions: | ||
Issues: |
Technical Details [not part of architecture, should go into HLD/DLD]
Brief outline of features relevant to interoperability and not mentioned above, supported and expected from the releases above:
OLD.x
- OLD.x: client and OST support both OLD and NEW networking protocol. Protocol version is selected at the time of connection to MDT: if MDT supports OBD_CONNECT_FID connect flag, NEW protocol is used, otherwise OLD.
- once OLD.x node (client or OST) connected to MDT in NEW mode it assures that all other connections are in this mode too. OST adds OBD_CONNECT_FID flag to its connection mask.
- when connected in NEW node, OLD.x client
- uses fids to identify inodes in the cache (for uniformity, it can internally use igifs, generated from ino/gen pairs in the OLD mode too). Inode numbers for stat(2), are generated from fids [done for HEAD, being ported to b1_6_cli_reqs];
- expects cmd3-style directory pages in readdir with fids in directory entries [done];
- takes dlm locks are in fid name-space [done];
- participates in cmd3 recovery protocol, more on this below [being implemented by Amit];
- uses seq and fld services [done];
- when on a re-connect OLD.x client detects that connection lost OBD_CONNECT_FID flag that it used to have, it evicts itself to get rid of all extra fid-related state.
- No interoperability changes to the MD server code are made in OLD.x release.
- OLD.x OST servers also support both OLD and NEW networking protocol, and depending on the MDS connection flags either use fids or not. In fid-enabled mode, they act much like clients (see above) in their interaction with MDT. To support NEW protocol OST has to generate fids for objects already existing on the storage. Resulting surrogate fids are called idifs (igifs for data, see igif description below). [not started yet]
NEW.0
This release introduces MDT server speaking NEW protocol only, and running over OLD-format storage. OST server speaking NEW protocol was introduced in the previous OLD.x release. Support for old protocol is completely eliminated in this release.
To talk in new protocol server has to use FIDs to identify object, so NEW.0 MDT generates surrogate FIDs for existing inodes. Such a surrogate FIDs is referred to as an IGIF (inode-generation FID), because it is built from inode number and inode generation. Similarly, NEW.0 OST generates surrogate FIDs for existing id/group objects. Format of IGIF and IDIF is described in the table below:
fields | SEQ | OID | VER |
FID_SEQ_OST_MDT0 | = 0 | ||
FID_SEQ_LLOG | = 1 | ||
FID_SEQ_ECHO | = 2 | ||
FID_SEQ_OST_MDT1 | = 3 | ||
FID_SEQ_OST_MAX | = 9 (=FID_SEQ_OST_MDT7) | ||
FID_SEQ_IGIF | = 12 | ||
FID_SEQ_IGIF_MAX | = 0xffffffff | ||
FID_SEQ_IDIF | =0x100000000 | ||
FID_SEQ_IDIF_MAX | =0x1ffffffff | ||
FID_SEQ_LOCAL_FILE | =0x200000001 | ||
FID_SEQ_DOT_LUSTRE | =0x200000002 | ||
FID_SEQ_NORMAL | =0x200000400 | ||
obdo/lmm/oinfo(OLD) | o_seq:64 [FID_SEQ_OST_MDT0] | o_id_lo:48 | o_id_hi:16 |
obdo/lmm/oinfo(NEW.1) | o_seq:64 [FID_SEQ_{IDIF,NORMAL}] | o_id_lo:32 | o_id_hi:32 |
lu_fid | f_seq:64 | f_oid:32 | f_ver:32 |
IGIF | 0:32, ino:32 [12,FID_SEQ_IGIF_MAX] | gen:32 | 0:32 |
IDIF | 0:31, 1:1, ost_idx:16,o_id_hi:16 | o_id_lo:32 | o_id_hi_hi:16 |
reserved | [FID_SEQ_START,FID_SEQ_START+0x3ff] | f_oid:32 | f_ver:32 |
FID | [FID_SEQ_NORMAL,264-1] | f_oid:32 | f_ver:32 |
Legend:
- FID
- File IDentifier generated by client from range allocated by the seq service. First 0x400 sequences [233, 233 + 0x400] are reserved for system use. Note that on ldiskfs MDTs that IGIF FIDs can use inode numbers starting at 12, but this is in the IGIF SEQ rangeand does not conflict with assigned FIDs.
- IGIF
- Inode and Generation In FID, a surrogate FID used to globally identify an existing object on OLD formatted MDT file system. This would only be used on MDT0 in a DNE filesystem, because there are not expected to be any OLD formatted DNE filesystems. Belongs to a sequence in [12, 232 - 1] range, where sequence number is inode number, and inode generation is used as OID. NOTE: This assumes no more than 232-1 inodes exist in the MDT filesystem, which is the maximum possible for an ldiskfs backend. NOTE: This assumes that the reserved ext3/ext4/ldiskfs inode numbers [0-11] are never visible to clients, which has always been true.
- IDIF
- object ID in FID, a surrogate FID used to globally identify an existing object on OLD formatted OST file system. Belongs to a sequence in [232, 233 - 1]. Sequence number is calculated as:
1 << 32 | (ost_index << 16) | ((objid >> 32) & 0xffff)
- that is, SEQ consists of 16-bit OST index, and higher 16 bits of object ID. The generation of unique SEQ values per OST allows the IDIF FIDs to be identified in the FLD correctly. The OID field is calculated as:
objid & 0xffffffff
- that is, it consists of lower 32 bits of object ID. NOTE This assumes that no more than 248-1 objects have ever been created on an OST, and that no more than 65535 OSTs are in use. Both are very reasonable assumptions (can uniquely map all objects on an OST that created 1M objects per second for 9 years, or combinations thereof).
- OST_MDT0
- Surrogate FID used to identify an existing object on OLD formatted OST filesystem. Belongs to the reserved sequence 0, and is used internally prior to the introduction of FID-on-OST, at which point IDIF will be used to identify objects as residing on a specific OST.
- LLOG
- for Lustre Log objects the object sequence 1 is used. This is compatible with both OLD and NEW.1 namespaces, as this SEQ number is in the ext3/ldiskfs reserved inode range and does not conflict with IGIF sequence numbers.
- ECHO
- for testing OST IO performance the object sequence 2 is used. This is compatible with both OLD and NEW.1 namespaces, as this SEQ number is in the ext3/ldiskfs reserved inode range and does not conflict with IGIF sequence numbers.
- OST_MDT1 .. OST_MAX
- for testing with multiple MDTs the object sequence 3 through 9 is used, allowing direct mapping of MDTs 1 through 7 respectively, for a total of 8 MDTs including OST_MDT0. This matches the legacy CMD project "group" mappings. However, this SEQ range is only for testing prior to any production DNE release, as the objects in this range conflict across all OSTs, as the OST index is not part of the FID.
For compatibility with existing OLD OST network protocol structures, the FID must map onto the o_id and o_gr in a manner that ensures existing objects are identified consistently for IO, as well as onto the lock namespace to ensure both IDIFs map onto the same objects for IO as well as resources in the DLM.
DLM OLD OBIF/IDIF:
resource[] = {o_id, o_seq, 0, 0}; /* o_seq == 0 for production releases */
DLM NEW.1 FID (this is the same for both the MDT and OST):
resource[] = {SEQ, OID, VER, HASH};
Note that for mapping IDIF values to DLM resource names the o_id may be larger than the 233 reserved sequence numbers for IDIF, so it is possible for the o_id numbers to overlap FID SEQ numbers in the resource. However, in all production releases the OLD o_seq field is always zero, and all valid FID OID values are non-zero, so the lock resources will not collide.
For objects within the IDIF range, group extraction (non-CMD) will be:
o_id = (fid->f_seq & 0x7fff) << 16 | fid->f_oid; o_seq = 0; /* formerly group number */
Recovery
There are 2 important recovery scenarios related to interoperability:
- OLD.x client reconnects to MDT after a fail-over and learns that it has to switch back to the OLD protocol, because server was downgraded. Client has to replay requests, but before that they have to be converted into OLD protocol format. This requires changing message format and going from client-assigned FIDs to inode/generation numbers (storage cookies). If a FID in is IGIF format it can be converted to inode number according to the reverse of IGIF generation algorithm. If a FID is client-generated, then *KABOOM*! Client has to evict itself, because it doesn't know old-format inode number. Q? Is there a better solution?. What to do with RPCs that old server cannot handle at all: SEQ_QUERY? Again, eviction seems to be the only option.
- OLD.x client reconnects to MDT and determines that it has to switch to the new protocol, because MDT was upgraded to NEW.0. To replay RPCs, client has to convert them to the NEW format. This includes message format conversion and going from inode/generation numbers to FIDs. For RPCs that already include inode number as an argument, IGIF FID can be used. For CREATE RPC that requires fid in NEW protocol there are two options:
- client supplies fill-in-FID. NEW.0 server recognizes this as a request to generate FID on the server, and uses special sequence range reserved for this purpose to allocate a FID from. Note that this sequence cannot be exhausted, as there is single MDT in the cluster at that point, which means it has full control over complete FID space.
- client supplies inode number as in usual OLD protocol replay. Server detects this and creates inode with given inode number. This has certain drawbacks:
- a dependency on ext3-wantedi patch is re-introduced, and
- backward-compatibility code is introduced in NEW.0 release, which we are trying to avoid.