WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Architecture - Server Network Striping

From Obsolete Lustre Wiki
Revision as of 13:23, 22 January 2010 by Docadmin (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.


Server Network Striping (SNS) is Lustre-level striping of file data over multiple object servers with redundancy in a style similar to RAID disk arrays. It enhances the reliability and availability of the system with slight increase of redundant data but with the capability of tolerating the failure of single or multiple nodes. SNS will utilize the CMU RAIDframe subsystem to implement the rapid basic prototyping of the distributed networked RAID system.


SNS Server Network Striping

Striped Object The collection of objects on multiple OSTs forming a striped object.

Chunk Object An logical or persistent object containing data and/or redundant data for a striped object. A striped object consists of its chunk objects.

Stripe Unit Minimal extent in chunk objects.

Full Stripe A collection of stripe units in chunk objects that maps 1-1 to an extent in striped object. (In RAIDframe, stripe is short for full stripe)

Partial truncate If the size after truncate is not full-stripe-size aligned in parity RAID pool, it is called partial truncate.

Pool A pool is a name associated with a set of OSTs in a Lustre clueter.

DAG Directed Acyclic Graph. RAIDframe uses DAGs to model RAID operations.

ROD Read Old Data.

ROP Read old Parity.

WND Write New Data.

WNP Write New Parity.

IOV Page (brw_page) vector.


SNS Fundamentals

  • SNS should base on RAIDframe and support different kinds of RAID geometries.
  • SNS should provide the high availability with reasonable high performance meanwhile.
  • SNS should be implemented as a separated module; SNS should be an optional layer: Lustre can work with or without it; SNS can support lov-less configuration and local mount.
  • New client I/O interfaces should match the I/O interfaces of RAIDframe engine.
  • New file layout format should adapt to various RAID patterns.
  • Support to use mountconf to setup and configure SNS OBD.
  • Extent locks covered the full stripe across several OSTs should not result cascading evictions.
  • Handle grants/quota correctly.
  • Handle file metadata information (e.g. file size) with redundancy correctly in both normal and degraded case.
  • Support retry of failed I/O in DAG executing.

SNS Detailed Recovery

Number of possible failures have to be addressed:

  • Client crashes in a middle of stripe write.
  • OST(s) crash or power off
  • OST(s) experience I/O error, disk failure
  • Network failure occures in communication between Client(s) and one or more OSTs containing stripes.

To handle the failures gracefully, there are following requirements to the process of failure detection and the recovery procedure:

  • CUT preserves striped object consistency. CUT can't contain inconsistent stripe writes
  • OST is able to rollback incomplete stripe writes in case of client crash
  • I/O error may be handled by switching the object to degraded mode
  • striped object recovery may continue even if some stripe sites are not available by switching the object to degraded mode
  • object consistency constraints apply to the process of client redo-log replaying
  • RAID on-line recovery (degraded -> normal)

SNS Architecture Overview


SNS is implemented as a separated module. Similar with LOV, each SNS is a new kind OBD and functions to various OSTs in the SNS pool via the underneath OSCs but can appears as a single target to llite.

SNS can stack under llite directly or upon local disks/OBDs directly to support local mount.

Moreover, file objects in SNS pool can have various RAID patterns such as RAID0, RAID1, RAID5, declustering parity, etc.

SNS write as a distributed transaction

SNS write operation is a distributed transaction when several stripes on different servers get updated.

A simple approach is to let the client issue all write requests to the data servers. The client starts the transaction and communicates directly with the servers. In terms of Lustre transactions the client is the originator and the OSTs are replicators.

Unfortunately clients have no persistent storage (at least we assume so), it makes impossible resusing of lustre transactions with redo-logging at originator side.

Undo-logging at OSTs side is a solution for the problem. Undo operation can be triggered by a timeout, if client doesn't close the transaction correctly in specified period of time.

Undo logging at OST side

Llogs with undo records are transactionally updated together with each OST write op. There is a trick which allows undo records to be small. Modify write puts data to a new block and adds undo record about old data location. Append is recorded in llog as a file size increase.

RAIDFrame limitations

  • all needed stripe locks should be acquired before DAG executing and released after
  • there is no built-in support for metadata consistency (for example, file size) in RAIDFrame
  • page-aligned i/o
  • RF wants to know stripe state (number and location of failed chunks) for constructing correct DAG, there is a problem to keep the state up-to-date across network

metadata redundancy

  • striped object size is stored with each chunk object
  • for each stripe, failed chunks info is stored with each chunk objects

DAG creation

Achieving consistency in DAG creation procedure.

A client creates a DAG based on cached information about striped file. If a DAG doesn't match current stripe state, OST replies with an error to the client asking to re-create correct DAG.

I/O components

RAIDframe requires that I/O submitting to RAIDframe engine must be contiguous and sector-size (page-size) aligned. New I/O layer and interfaces should match I/O characteristic of RAIDframe engine and simplify the integration into SNS.

The I/O path and components of SNS is shown in the graph blew:


Common I/O path and components with/without SNS

  • read: Initialize the generic client page cl_page of various layers; Form an IOV in llite layer and submit to underlayer via cl_submit_iov.
  • readahead: Add readahead pages into IOV via the interface cl_add_page which is similar with 'bio_add_page' in linux kernel used to check whether it already forms an optimal IOV or is accross the stripe buoundry, etc; And then submit the IOV to underlayer via cl_submit_iov.
  • directIO: Almost same as the original path, form an IOV and submit to underlayer.
  • writepage: Check whehter it's out of grants/quota or hitting the dirty max of the object/OBD. if so, starts to write out pages to update quota/grants; Otherwise, page manager queues the dirty page and cache it on client.
  • Page manager: Page manager is used to cache and manage generic cl_page of various layers. In the new client I/O layering, page manager can be implemented in various layers theoretically. There is an import functionality of page manager - Form IOVs: once find a good IOV can be built after accumulate enough dirty pages queued by page manager, it will form one and submit to next I/O component to achieve max. throughput; Once out of grants/quota or hit the dirty max of the object/OBD, it starts to batch dirty pages to build IOVs and submit to next I/O component.
  • OSC I/O scheduler: It is an optional component. To utilize the network bandwidth efficiently, OSC should build RPC with max. RPC size, which needs an I/O scheduler together with the timer and IOV queue in OSC to delay and merge I/O requests especially when good IOVs are divided by RAIDframe into sub (small) IOVs to various chunk objects. Similar with linux I/O scheduler, IOVs from various objects are queued into the scheduler. With its help, it can merge small IOVs from different objects into a big I/O RPC; Moreover, via adjustment of unplug expire period it can implement local traffic control combined with Network Request Scheduler (NRS) according to the feedback from the server.

OSC I/O scheduler

OSC I/O scheduler can be implemented in OSC or as a separate module. It has the following features:

  • Each OSC OBD has an optional I/O scheduler together with two separate IOV queues for read and write respectively.
  • The peroid of timer expiration of schduler for read/write can be tunned.
  • The IOVs are queued in IOV queues and preferentially merged according to the chunk objects.
  • The OSC I/O scheduler processes IOVs in the list to build I/O RPC when,
  • Once there are enough IOVs in the queue to consist a max-size RPC.
  • On the unplug timer expireation.
  • Triggered by upper layer for fsync/sync or sync_page operations.

SNS I/O components

SNS contains three I/O components: RAIDframe, page manager, RF-OSC I/O wrapper.

  • RAIDframe uses DAGs to model RAID operations. After received the contiguous IOV issued from upper layer or page manager, RAIDframe engine builds DAGs which contains various kinds of DAG nodes such as ROD, ROP, WND, WNP, XOR, C, etc, and then execute the DAGs. (See RAIDframe: Rapid Prototyping for Disk Arrays in detail)
  • Page manager: In the configuration without SNS, the page manager is implemented in OSC layer, and it just simply chains the dirty pages into the list of corresponding chunk object; While in SNS it is much complext. RAIDframe configured with parity pattenr perfers full stripe write to reach good perfromance, the dirty page is managed according to the stripe index it resides to batch full-stripe IOV eadily.
  • The functions of RF-OSC I/O wrapper is to make sub IOVs (ROD, ROP, WND or WNP generated by RAIDframe engine) to the underlayer.
  • By reusing the layout interfaces in RAIDframe, SNS implements the functionalities such as offset converting, size calculation, etc.

Lock protocols

To maintain consistency during updating parity, it should serialize locking for full-stripe extent. But full-stripe locks are accross multiple OSTs. serialized locking will lead ascading evictions and locking per OST in parallel (Lock, I/O unlock operations to each OST parallel) can not maintain the atomicity of parity update.

To meet the requirement, it introduces master lock. The protocols is shown as follow:

  1. In client SNS layer, extend the exent of lock to full-stripe-size aligned.
  2. Full-stripe lock is divided into G sub stripe-unit locks to various chunk objects. There are two candicates for master lock: the sub Lock L for parity stripe unit P which resides on OST A and sub Lock L' of stripe unit with index (P + 1) % G which resides on OST B.
  3. Client sends stripe-unit locks parallelly.
  4. Client waits acquiring master lock from OST A.
  5. If acquirement of Lock L fails, client retries to acquire remastering Lock L' from OST B.
  6. When receive the remastering lock request, OST B checks the status of OST A. If find OST A is still active, OST B replies client with the indication that OST A is still alive, client will retry to acquire master lock from OST A or report error immediately (maybe occur network partition failure: net link between client<->OST A fails but net link between OSTA<->OSTB works well.); If OST A indeed occurs failure such as power off, etc, OST B enqueues the lock and records that it's remastered from OST A, and then replies to client to notify that the lock is remastered to OST B.
  7. If succeed to acquire the master lock, client starts the following I/O operations.
  8. When OST A powers back on, other OSTs in the pool remaster the locks back to OST A.

(Not consider the complex failure such as network partition.)

There is once a proposal that acquire the locks for data as normal, acquire locks for parity only when update parity during sync (short period). But in degraded case of read, we still face the problem of locking of full stripe.

Use cases

SNS fundamentals cases


id quality attribute summary
Extent_lock Usability & Availability Avoid cascading evictions as locks accorss multiple OSTs.
parity_grants_quota Usability grants/quota for parity.
Partial_truncate Consistency & recovery Parity consistency and recovery for partial truncate
Metadata_recovery recovery OSD metadata recovery in degraded case.
IO_retry Usability & recovery Filed read I/O .
Obj_alloc Usability Object allocation in SNS pool.
SNS_configure Usability Configure the SNS device via mountconf

Extent lock

Scenario: Client tries to acquire extent lock for I/O.
Business Goals Maintain the consistency during parity updating and read in degraded case, avoid cascading evictions
Relevant QA's Usability & Availability
details Stimulus extend the lock with full-stripe granularity, lead locking accorss multiple OSTs.
Stimulus source Other conflict clients crash or reboot, network failure
Environment Conflict with other clients
Artifact Lock requests
Response Choose a OST one of chunk objects resides as the master lock server. Consistency of concurrent access is maintained by the choosen lock server, I/O can start just afer acquire lock from the master lock server; lock requests to other OSTs are used to update the lvb and metadata information such as file size, access time, etc; When lock blocking AST takes places on a stripe unit of a chunk object, lock manager on client writes out pages in the corresponding full stripe.
Response measure
Questions How to handle the case of failure of Master lock server? Choose another master lock server via election?
Issues For declustering parity, we can always choose the server parity stripe unit of full stripe resides as master lock server; But it will make the DLM remaster more complex.

grants/quota for parity

Scenario: Client tries to cache a dirty page on client.
Business Goals Provide space grants/quota for client cache (both data and parity).
Relevant QA's Usability
details Stimulus Client writes to a file.
Stimulus source Client
Environment Parity is redundant and not part of file data;
Artifact Grants/quota
Response Client checks whether the page is out of grants/quota of the corresponding OSC; SNS checks grants/quota for the parity at the same time as every data page has a corresponding parity.
Response measure If one of them fails, client starts to write out pages to update grants/quota.
Questions In parity RAID, it can tolerate one point failure included poweroff, destory, using up of space of OST, etc, furthermore parity management on client is complex, thus can we use some smart strategies that only acquiring grants/quota for the data but not for parity.

Partial truncate

Scenario: Client truncates a file.
Business Goals Maintain parity consistency after truncate operations
Relevant QA's Usability
details Stimulus Client truncates a file.
Stimulus source Client.
Environment File is striping with parity pattern. It's partial truncate;
Artifact Parity consistency.
Response SNS sends the punch requests to various chunk objects touched the truncate operations; SNS updates the parity stripe unit of the last full stripe after truncate. (two phase: punch, parity update)
Response measure
Issues inconsistency recovery: client updates the parity of last full stripe, but fails to sends the punch requests, leading consistent problem. The same problem exists If first sends the punch requests and then update the parity. The proposed solution is: clients logs the truncate operations in the frist phase; When receive the parity update request, OST cancels the punch logs.

Metadata recovery in degraded case

Scenario: Client tries to get file attributes.
Business Goals Get the correct file metadata information even in degraded case
Relevant QA's Usability
details Stimulus Client gets file attributes such as file size, access time, etc.
Stimulus source Client
Environment Pool in degraded case;
Artifact Metadata recovery
Response In normal case, client stores metadata information of chunk objects touched client access into the EA of corresponding parity objects for redundancy.
Response measure In degraded case, recovery the metadata information of invalid chunk object via the redundant information in other valid chunk objects. And then get metadata information (file size, access time, etc) of file by merging the information in various chunk objects.
Issues For write it will update at least two chunk objects (data and corresponding parity); But for read it may update only one chunk object touched the protion of read data, thus there's an issue that how to store the atime into at least two chunk objects for redundancy.

Failed I/O retry

Scenario: DAG node for read fails.
Business Goals Retry the failed read I/O.
Relevant QA's Usability.
details Stimulus the execution of some I/O DAG nodes (such as ROD, ROP, WND, WNP, etc) failed.
Stimulus source failure of OST or network.
Environment OSTs or network occur failure during DAG executing.
Artifact Retry the failed I/O.
Response the RAIDframe engine reconfigures the OBD arrays and thier state.
Response measure RAIDframe in SNS creates a new DAG to execute in degraded case to retry the failed I/O.

Object allocation

Scenario: Client creates a new file.
Business Goals Allocate the object for the new created file in the dedicated pool.
Relevant QA's Usability.
details Stimulus Client creates a new file or set the stripe meatadata of the file via IOCTL.
Stimulus source Client.
Environment Client, MDS.
Artifact Object allocation.
Response MDS alocates the chunk objects in the pool, stores the file layout metadata into the EA of the file and replies to client with it.
Response measure
  • How to make sure the object allocation within the indicated pool without SNS device on MDS site especially there are multiple pools?
  • what's the effect of CROW on object allocation?
  • The common format of file layout information.

Configure the SNS device via mountconf

Scenario: Client configures the SNS OBD via mountconf
Business Goals Simplify the processes that user configures and setups the SNS via mountconf.
Relevant QA's Usability.
details Stimulus Client mounts the lustre file system.
Stimulus source User.
Environment Client, MDS.
Artifact Setup SNS device via mountconf.
Response Client gets configuration log via llog from MGS and Configure the SNS device and layout of RAIDframe via SNS log enties
Response measure
Questions Is the configuration of pool for mountconf still by llog?

Failure and recovery cases


id quality attribute summary
client_goes_offline recovery client crashes doing i/o
ost_goes_down recovery OST goes down
ost_delay_online recovery OST goes online with a delay
RAID_rebuild recovery an agent does striped object rebuild online
RAID_rebuild_crash recovery agent or OSTs crahed during striped object rebuild

client goes offline

Scenario: client crashes in the middle of stripe write
Business Goals resistibility to client crashes
Relevant QA's Availability
details Stimulus OST detects that client went offline
Stimulus source сlient crash or reboot, network failure
Environment client or network failures
Artifact striped object consistency
Response OSTs restore the object consistency after a failed write
Response measure
Questions a way to detect incompleted write

OST goes down

Scenario: OST crashes or reboots
Business Goals resistibility to server crashes
Relevant QA's Availability
details Stimulus Client tries to reconnect to the OST after it reboots
Stimulus source server crash or reboot, network
Environment server or network failures
Artifact cluster fs state
Response OST rolls its state to the CUT and applies client logs
Response measure
Questions how does CUT preserve striped object consistency?

OST goes online with a delay

Scenario: OST goes online after a delay
Business Goals Efficient OST re-integration
Relevant QA's Availability & Stability
details Stimulus degraded object state check, attempt to start RAID recovery agent
Stimulus source server crash or reboot, network failure
Environment server or network failures
Artifact recovery agent
Response striped object cleanup or synchronization
Response measure striped object state

RAID online recovery (rebuild)

Scenario: OST goes online after a delay
Business Goals restore object's availability
Relevant QA's Availability
details Stimulus sync or async object state check, attempt to start a recovery agent
Stimulus source server crash or reboot, network failure
Environment Lustre fs with obect in degraded mode
Artifact recovery agent
Response striped object online recovery
Response measure striped object state
Issues recovery of recovering process state

OST crash during striped object rebuild

Scenario: recovering of online raid recovery process
Business Goals restore object's availability
Relevant QA's Availability
details Stimulus recovery agent finds incomleted object recovery
Stimulus source server crash or reboot, network failure
Environment server or network failures
Artifact recovery agent
Response completing striped object online recovery
Response measure striped object state