Architecture - Server Network Striping

Summary
Server Network Striping (SNS) is Lustre-level striping of file data over multiple object servers with redundancy in a style similar to RAID disk arrays. It enhances the reliability and availability of the system with slight increase of redundant data but with the capability of tolerating the failure of single or multiple nodes. SNS will utilize the CMU RAIDframe subsystem to implement the rapid basic prototyping of the distributed networked RAID system.

Definitions
SNS Server Network Striping

Striped Object The collection of objects on multiple OSTs forming a striped object.

Chunk Object An logical or persistent object containing data and/or redundant data for a striped object. A striped object consists of its chunk objects.

Stripe Unit Minimal extent in chunk objects.

Full Stripe A collection of stripe units in chunk objects that maps 1-1 to an extent in striped object. (In RAIDframe, stripe is short for full stripe)

Partial truncate If the size after truncate is not full-stripe-size aligned in parity RAID pool, it is called partial truncate.

Pool A pool is a name associated with a set of OSTs in a Lustre clueter.

DAG Directed Acyclic Graph. RAIDframe uses DAGs to model RAID operations.

ROD Read Old Data.

ROP Read old Parity.

WND Write New Data.

WNP Write New Parity.

IOV Page (brw_page) vector.

SNS Fundamentals

 * SNS should base on RAIDframe and support different kinds of RAID geometries.
 * SNS should provide the high availability with reasonable high performance meanwhile.
 * SNS should be implemented as a separated module; SNS should be an optional layer: Lustre can work with or without it; SNS can support lov-less configuration and local mount.
 * New client I/O interfaces should match the I/O interfaces of RAIDframe engine.
 * New file layout format should adapt to various RAID patterns.
 * Support to use mountconf to setup and configure SNS OBD.
 * Extent locks covered the full stripe across several OSTs should not result cascading evictions.
 * Handle grants/quota correctly.
 * Handle file metadata information (e.g. file size) with redundancy correctly in both normal and degraded case.
 * Support retry of failed I/O in DAG executing.

SNS Detailed Recovery
Number of possible failures have to be addressed:
 * Client crashes in a middle of stripe write.
 * OST(s) crash or power off
 * OST(s) experience I/O error, disk failure
 * Network failure occures in communication between Client(s) and one or more OSTs containing stripes.

To handle the failures gracefully, there are following requirements to the process of failure detection and the recovery procedure:
 * CUT preserves striped object consistency. CUT can't contain inconsistent stripe writes
 * OST is able to rollback incomplete stripe writes in case of client crash
 * I/O error may be handled by switching the object to degraded mode
 * striped object recovery may continue even if some stripe sites are not available by switching the object to degraded mode
 * object consistency constraints apply to the process of client redo-log replaying
 * RAID on-line recovery (degraded -> normal)

SNS Architecture Overview


SNS is implemented as a separated module. Similar with LOV, each SNS is a new kind OBD and functions to various OSTs in the SNS pool via the underneath OSCs but can appears as a single target to llite.

SNS can stack under llite directly or upon local disks/OBDs directly to support local mount.

Moreover, file objects in SNS pool can have various RAID patterns such as RAID0, RAID1, RAID5, declustering parity, etc.

SNS write as a distributed transaction
SNS write operation is a distributed transaction when several stripes on different servers get updated.

A simple approach is to let the client issue all write requests to the data servers. The client starts the transaction and communicates directly with the servers. In terms of Lustre transactions the client is the originator and the OSTs are replicators.

Unfortunately clients have no persistent storage (at least we assume so), it makes impossible resusing of lustre transactions with redo-logging at originator side.

Undo-logging at OSTs side is a solution for the problem. Undo operation can be triggered by a timeout, if client doesn't close the transaction correctly in specified period of time.

Undo logging at OST side
Llogs with undo records are transactionally updated together with each OST write op. There is a trick which allows undo records to be small. Modify write puts data to a new block and adds undo record about old data location. Append is recorded in llog as a file size increase.

RAIDFrame limitations

 * all needed stripe locks should be acquired before DAG executing and released after
 * there is no built-in support for metadata consistency (for example, file size) in RAIDFrame
 * page-aligned i/o
 * RF wants to know stripe state (number and location of failed chunks) for constructing correct DAG, there is a problem to keep the state up-to-date across network

metadata redundancy

 * striped object size is stored with each chunk object
 * for each stripe, failed chunks info is stored with each chunk objects

DAG creation
Achieving consistency in DAG creation procedure.

A client creates a DAG based on cached information about striped file. If a DAG doesn't match current stripe state, OST replies with an error to the client asking to re-create correct DAG.

I/O components
RAIDframe requires that I/O submitting to RAIDframe engine must be contiguous and sector-size (page-size) aligned. New I/O layer and interfaces should match I/O characteristic of RAIDframe engine and simplify the integration into SNS.

The I/O path and components of SNS is shown in the graph blew:



Common I/O path and components with/without SNS

 * read: Initialize the generic client page cl_page of various layers; Form an IOV in llite layer and submit to underlayer via cl_submit_iov.


 * readahead: Add readahead pages into IOV via the interface cl_add_page which is similar with 'bio_add_page' in linux kernel used to check whether it already forms an optimal IOV or is accross the stripe buoundry, etc; And then submit the IOV to underlayer via cl_submit_iov.


 * directIO: Almost same as the original path, form an IOV and submit to underlayer.


 * writepage: Check whehter it's out of grants/quota or hitting the dirty max of the object/OBD. if so, starts to write out pages to update quota/grants; Otherwise, page manager queues the dirty page and cache it on client.


 * Page manager: Page manager is used to cache and manage generic cl_page of various layers. In the new client I/O layering, page manager can be implemented in various layers theoretically. There is an import functionality of page manager - Form IOVs: once find a good IOV can be built after accumulate enough dirty pages queued by page manager, it will form one and submit to next I/O component to achieve max. throughput; Once out of grants/quota or hit the dirty max of the object/OBD, it starts to batch dirty pages to build IOVs and submit to next I/O component.


 * OSC I/O scheduler: It is an optional component. To utilize the network bandwidth efficiently, OSC should build RPC with max. RPC size, which needs an I/O scheduler together with the timer and IOV queue in OSC to delay and merge I/O requests especially when good IOVs are divided by RAIDframe into sub (small) IOVs to various chunk objects. Similar with linux I/O scheduler, IOVs from various objects are queued into the scheduler. With its help, it can merge small IOVs from different objects into a big I/O RPC; Moreover, via adjustment of unplug expire period it can implement local traffic control combined with Network Request Scheduler (NRS) according to the feedback from the server.

OSC I/O scheduler
OSC I/O scheduler can be implemented in OSC or as a separate module. It has the following features:
 * Each OSC OBD has an optional I/O scheduler together with two separate IOV queues for read and write respectively.
 * The peroid of timer expiration of schduler for read/write can be tunned.
 * The IOVs are queued in IOV queues and preferentially merged according to the chunk objects.
 * The OSC I/O scheduler processes IOVs in the list to build I/O RPC when,
 * Once there are enough IOVs in the queue to consist a max-size RPC.
 * On the unplug timer expireation.
 * Triggered by upper layer for fsync/sync or sync_page operations.

SNS I/O components
SNS contains three I/O components: RAIDframe, page manager, RF-OSC I/O wrapper.


 * RAIDframe uses DAGs to model RAID operations. After received the contiguous IOV issued from upper layer or page manager, RAIDframe engine builds DAGs which contains various kinds of DAG nodes such as ROD, ROP, WND, WNP, XOR, C, etc, and then execute the DAGs. (See RAIDframe: Rapid Prototyping for Disk Arrays in detail)


 * Page manager: In the configuration without SNS, the page manager is implemented in OSC layer, and it just simply chains the dirty pages into the list of corresponding chunk object; While in SNS it is much complext. RAIDframe configured with parity pattenr perfers full stripe write to reach good perfromance, the dirty page is managed according to the stripe index it resides to batch full-stripe IOV eadily.


 * The functions of RF-OSC I/O wrapper is to make sub IOVs (ROD, ROP, WND or WNP generated by RAIDframe engine) to the underlayer.


 * By reusing the layout interfaces in RAIDframe, SNS implements the functionalities such as offset converting, size calculation, etc.

Lock protocols
To maintain consistency during updating parity, it should serialize locking for full-stripe extent. But full-stripe locks are accross multiple OSTs. serialized locking will lead ascading evictions and locking per OST in parallel (Lock, I/O unlock operations to each OST parallel) can not maintain the atomicity of parity update.

To meet the requirement, it introduces master lock. The protocols is shown as follow: (Not consider the complex failure such as network partition.)
 * 1) In client SNS layer, extend the exent of lock to full-stripe-size aligned.
 * 2) Full-stripe lock is divided into G sub stripe-unit locks to various chunk objects. There are two candicates for master lock: the sub Lock L for parity stripe unit P which resides on OST A and sub Lock L' of stripe unit with index (P + 1) % G which resides on OST B.
 * 3) Client sends stripe-unit locks parallelly.
 * 4) Client waits acquiring master lock from OST A.
 * 5) If acquirement of Lock L fails, client retries to acquire remastering Lock L' from OST B.
 * 6) When receive the remastering lock request, OST B checks the status of OST A. If find OST A is still active, OST B replies client with the indication that OST A is still alive, client will retry to acquire master lock from OST A or report error immediately (maybe occur network partition failure: net link between client<->OST A fails but net link between OSTA<->OSTB works well.); If OST A indeed occurs failure such as power off, etc, OST B enqueues the lock and records that it's remastered from OST A, and then replies to client to notify that the lock is remastered to OST B.
 * 7) If succeed to acquire the master lock, client starts the following I/O operations.
 * 8) When OST A powers back on, other OSTs in the pool remaster the locks back to OST A.

There is once a proposal that acquire the locks for data as normal, acquire locks for parity only when update parity during sync (short period). But in degraded case of read, we still face the problem of locking of full stripe.