Architecture - Server Network Striping

Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.

Summary

Server Network Striping (SNS) is Lustre-level striping of file data over multiple object servers with redundancy in a style similar to RAID disk arrays. It enhances the reliability and availability of the system with slight increase of redundant data but with the capability of tolerating the failure of single or multiple nodes. SNS will utilize the CMU RAIDframe subsystem to implement the rapid basic prototyping of the distributed networked RAID system.

Definitions

SNS Server Network Striping

Striped Object The collection of objects on multiple OSTs forming a striped object.

Chunk Object An logical or persistent object containing data and/or redundant data for a striped object. A striped object consists of its chunk objects.

Stripe Unit Minimal extent in chunk objects.

Full Stripe A collection of stripe units in chunk objects that maps 1-1 to an extent in striped object. (In RAIDframe, stripe is short for full stripe)

Partial truncate If the size after truncate is not full-stripe-size aligned in parity RAID pool, it is called partial truncate.

Pool A pool is a name associated with a set of OSTs in a Lustre clueter.

DAG Directed Acyclic Graph. RAIDframe uses DAGs to model RAID operations.

ROD Read Old Data.

ROP Read old Parity.

WND Write New Data.

WNP Write New Parity.

IOV Page (brw_page) vector.

Requirements

SNS Fundamentals

SNS should base on RAIDframe and support different kinds of RAID geometries.
SNS should provide the high availability with reasonable high performance meanwhile.
SNS should be implemented as a separated module; SNS should be an optional layer: Lustre can work with or without it; SNS can support lov-less configuration and local mount.
New client I/O interfaces should match the I/O interfaces of RAIDframe engine.
New file layout format should adapt to various RAID patterns.
Support to use mountconf to setup and configure SNS OBD.
Extent locks covered the full stripe across several OSTs should not result cascading evictions.
Handle grants/quota correctly.
Handle file metadata information (e.g. file size) with redundancy correctly in both normal and degraded case.
Support retry of failed I/O in DAG executing.

SNS Detailed Recovery

Number of possible failures have to be addressed:

Client crashes in a middle of stripe write.
OST(s) crash or power off
OST(s) experience I/O error, disk failure
Network failure occures in communication between Client(s) and one or more OSTs containing stripes.

To handle the failures gracefully, there are following requirements to the process of failure detection and the recovery procedure:

CUT preserves striped object consistency. CUT can't contain inconsistent stripe writes
OST is able to rollback incomplete stripe writes in case of client crash
I/O error may be handled by switching the object to degraded mode
striped object recovery may continue even if some stripe sites are not available by switching the object to degraded mode
object consistency constraints apply to the process of client redo-log replaying
RAID on-line recovery (degraded -> normal)

SNS Architecture Overview

SNS is implemented as a separated module. Similar with LOV, each SNS is a new kind OBD and functions to various OSTs in the SNS pool via the underneath OSCs but can appears as a single target to llite.

SNS can stack under llite directly or upon local disks/OBDs directly to support local mount.

Moreover, file objects in SNS pool can have various RAID patterns such as RAID0, RAID1, RAID5, declustering parity, etc.

SNS write as a distributed transaction

SNS write operation is a distributed transaction when several stripes on different servers get updated.

A simple approach is to let the client issue all write requests to the data servers. The client starts the transaction and communicates directly with the servers. In terms of Lustre transactions the client is the originator and the OSTs are replicators.

Unfortunately clients have no persistent storage (at least we assume so), it makes impossible resusing of lustre transactions with redo-logging at originator side.

Undo-logging at OSTs side is a solution for the problem. Undo operation can be triggered by a timeout, if client doesn't close the transaction correctly in specified period of time.

Undo logging at OST side

Llogs with undo records are transactionally updated together with each OST write op. There is a trick which allows undo records to be small. Modify write puts data to a new block and adds undo record about old data location. Append is recorded in llog as a file size increase.

RAIDFrame limitations

all needed stripe locks should be acquired before DAG executing and released after
there is no built-in support for metadata consistency (for example, file size) in RAIDFrame
page-aligned i/o
RF wants to know stripe state (number and location of failed chunks) for constructing correct DAG, there is a problem to keep the state up-to-date across network

metadata redundancy

striped object size is stored with each chunk object
for each stripe, failed chunks info is stored with each chunk objects

DAG creation

Achieving consistency in DAG creation procedure.

A client creates a DAG based on cached information about striped file. If a DAG doesn't match current stripe state, OST replies with an error to the client asking to re-create correct DAG.

I/O components

RAIDframe requires that I/O submitting to RAIDframe engine must be contiguous and sector-size (page-size) aligned. New I/O layer and interfaces should match I/O characteristic of RAIDframe engine and simplify the integration into SNS.

The I/O path and components of SNS is shown in the graph blew:

Common I/O path and components with/without SNS

read: Initialize the generic client page cl_page of various layers; Form an IOV in llite layer and submit to underlayer via cl_submit_iov.

readahead: Add readahead pages into IOV via the interface cl_add_page which is similar with 'bio_add_page' in linux kernel used to check whether it already forms an optimal IOV or is accross the stripe buoundry, etc; And then submit the IOV to underlayer via cl_submit_iov.

directIO: Almost same as the original path, form an IOV and submit to underlayer.

writepage: Check whehter it's out of grants/quota or hitting the dirty max of the object/OBD. if so, starts to write out pages to update quota/grants; Otherwise, page manager queues the dirty page and cache it on client.

Page manager: Page manager is used to cache and manage generic cl_page of various layers. In the new client I/O layering, page manager can be implemented in various layers theoretically. There is an import functionality of page manager - Form IOVs: once find a good IOV can be built after accumulate enough dirty pages queued by page manager, it will form one and submit to next I/O component to achieve max. throughput; Once out of grants/quota or hit the dirty max of the object/OBD, it starts to batch dirty pages to build IOVs and submit to next I/O component.

OSC I/O scheduler: It is an optional component. To utilize the network bandwidth efficiently, OSC should build RPC with max. RPC size, which needs an I/O scheduler together with the timer and IOV queue in OSC to delay and merge I/O requests especially when good IOVs are divided by RAIDframe into sub (small) IOVs to various chunk objects. Similar with linux I/O scheduler, IOVs from various objects are queued into the scheduler. With its help, it can merge small IOVs from different objects into a big I/O RPC; Moreover, via adjustment of unplug expire period it can implement local traffic control combined with Network Request Scheduler (NRS) according to the feedback from the server.

OSC I/O scheduler

OSC I/O scheduler can be implemented in OSC or as a separate module. It has the following features:

Each OSC OBD has an optional I/O scheduler together with two separate IOV queues for read and write respectively.
The peroid of timer expiration of schduler for read/write can be tunned.
The IOVs are queued in IOV queues and preferentially merged according to the chunk objects.
The OSC I/O scheduler processes IOVs in the list to build I/O RPC when,

Once there are enough IOVs in the queue to consist a max-size RPC.
On the unplug timer expireation.
Triggered by upper layer for fsync/sync or sync_page operations.

SNS I/O components

SNS contains three I/O components: RAIDframe, page manager, RF-OSC I/O wrapper.

RAIDframe uses DAGs to model RAID operations. After received the contiguous IOV issued from upper layer or page manager, RAIDframe engine builds DAGs which contains various kinds of DAG nodes such as ROD, ROP, WND, WNP, XOR, C, etc, and then execute the DAGs. (See RAIDframe: Rapid Prototyping for Disk Arrays in detail)

Page manager: In the configuration without SNS, the page manager is implemented in OSC layer, and it just simply chains the dirty pages into the list of corresponding chunk object; While in SNS it is much complext. RAIDframe configured with parity pattenr perfers full stripe write to reach good perfromance, the dirty page is managed according to the stripe index it resides to batch full-stripe IOV eadily.

The functions of RF-OSC I/O wrapper is to make sub IOVs (ROD, ROP, WND or WNP generated by RAIDframe engine) to the underlayer.

By reusing the layout interfaces in RAIDframe, SNS implements the functionalities such as offset converting, size calculation, etc.

Lock protocols

To maintain consistency during updating parity, it should serialize locking for full-stripe extent. But full-stripe locks are accross multiple OSTs. serialized locking will lead ascading evictions and locking per OST in parallel (Lock, I/O unlock operations to each OST parallel) can not maintain the atomicity of parity update.

To meet the requirement, it introduces master lock. The protocols is shown as follow:

In client SNS layer, extend the exent of lock to full-stripe-size aligned.
Full-stripe lock is divided into G sub stripe-unit locks to various chunk objects. There are two candicates for master lock: the sub Lock L for parity stripe unit P which resides on OST A and sub Lock L' of stripe unit with index (P + 1) % G which resides on OST B.
Client sends stripe-unit locks parallelly.
Client waits acquiring master lock from OST A.
If acquirement of Lock L fails, client retries to acquire remastering Lock L' from OST B.
When receive the remastering lock request, OST B checks the status of OST A. If find OST A is still active, OST B replies client with the indication that OST A is still alive, client will retry to acquire master lock from OST A or report error immediately (maybe occur network partition failure: net link between client<->OST A fails but net link between OSTA<->OSTB works well.); If OST A indeed occurs failure such as power off, etc, OST B enqueues the lock and records that it's remastered from OST A, and then replies to client to notify that the lock is remastered to OST B.
If succeed to acquire the master lock, client starts the following I/O operations.
When OST A powers back on, other OSTs in the pool remaster the locks back to OST A.

(Not consider the complex failure such as network partition.)

There is once a proposal that acquire the locks for data as normal, acquire locks for parity only when update parity during sync (short period). But in degraded case of read, we still face the problem of locking of full stripe.

Use cases

SNS fundamentals cases

QASs

id	quality attribute	summary
Extent_lock	Usability & Availability	Avoid cascading evictions as locks accorss multiple OSTs.
parity_grants_quota	Usability	grants/quota for parity.
Partial_truncate	Consistency & recovery	Parity consistency and recovery for partial truncate
Metadata_recovery	recovery	OSD metadata recovery in degraded case.
IO_retry	Usability & recovery	Filed read I/O .
Obj_alloc	Usability	Object allocation in SNS pool.
SNS_configure	Usability	Configure the SNS device via mountconf

Extent lock

Scenario:		Client tries to acquire extent lock for I/O.
Business Goals		Maintain the consistency during parity updating and read in degraded case, avoid cascading evictions
Relevant QA's		Usability & Availability
details	Stimulus	extend the lock with full-stripe granularity, lead locking accorss multiple OSTs.
	Stimulus source	Other conflict clients crash or reboot, network failure
	Environment	Conflict with other clients
	Artifact	Lock requests
	Response	Choose a OST one of chunk objects resides as the master lock server. Consistency of concurrent access is maintained by the choosen lock server, I/O can start just afer acquire lock from the master lock server; lock requests to other OSTs are used to update the lvb and metadata information such as file size, access time, etc; When lock blocking AST takes places on a stripe unit of a chunk object, lock manager on client writes out pages in the corresponding full stripe.
	Response measure
Questions		How to handle the case of failure of Master lock server? Choose another master lock server via election?
Issues		For declustering parity, we can always choose the server parity stripe unit of full stripe resides as master lock server; But it will make the DLM remaster more complex.

grants/quota for parity

Scenario:		Client tries to cache a dirty page on client.
Business Goals		Provide space grants/quota for client cache (both data and parity).
Relevant QA's		Usability
details	Stimulus	Client writes to a file.
	Stimulus source	Client
	Environment	Parity is redundant and not part of file data;
	Artifact	Grants/quota
	Response	Client checks whether the page is out of grants/quota of the corresponding OSC; SNS checks grants/quota for the parity at the same time as every data page has a corresponding parity.
	Response measure	If one of them fails, client starts to write out pages to update grants/quota.
Questions		In parity RAID, it can tolerate one point failure included poweroff, destory, using up of space of OST, etc, furthermore parity management on client is complex, thus can we use some smart strategies that only acquiring grants/quota for the data but not for parity.
Issues

Partial truncate

Scenario:		Client truncates a file.
Business Goals		Maintain parity consistency after truncate operations
Relevant QA's		Usability
details	Stimulus	Client truncates a file.
	Stimulus source	Client.
	Environment	File is striping with parity pattern. It's partial truncate;
	Artifact	Parity consistency.
	Response	SNS sends the punch requests to various chunk objects touched the truncate operations; SNS updates the parity stripe unit of the last full stripe after truncate. (two phase: punch, parity update)
	Response measure
Questions
Issues		inconsistency recovery: client updates the parity of last full stripe, but fails to sends the punch requests, leading consistent problem. The same problem exists If first sends the punch requests and then update the parity. The proposed solution is: clients logs the truncate operations in the frist phase; When receive the parity update request, OST cancels the punch logs.

Metadata recovery in degraded case

Scenario:		Client tries to get file attributes.
Business Goals		Get the correct file metadata information even in degraded case
Relevant QA's		Usability
details	Stimulus	Client gets file attributes such as file size, access time, etc.
	Stimulus source	Client
	Environment	Pool in degraded case;
	Artifact	Metadata recovery
	Response	In normal case, client stores metadata information of chunk objects touched client access into the EA of corresponding parity objects for redundancy.
	Response measure	In degraded case, recovery the metadata information of invalid chunk object via the redundant information in other valid chunk objects. And then get metadata information (file size, access time, etc) of file by merging the information in various chunk objects.
Questions
Issues		For write it will update at least two chunk objects (data and corresponding parity); But for read it may update only one chunk object touched the protion of read data, thus there's an issue that how to store the atime into at least two chunk objects for redundancy.

Failed I/O retry

Scenario:		DAG node for read fails.
Business Goals		Retry the failed read I/O.
Relevant QA's		Usability.
details	Stimulus	the execution of some I/O DAG nodes (such as ROD, ROP, WND, WNP, etc) failed.
	Stimulus source	failure of OST or network.
	Environment	OSTs or network occur failure during DAG executing.
	Artifact	Retry the failed I/O.
	Response	the RAIDframe engine reconfigures the OBD arrays and thier state.
	Response measure	RAIDframe in SNS creates a new DAG to execute in degraded case to retry the failed I/O.
Questions
Issues

Object allocation

Scenario:		Client creates a new file.
Business Goals		Allocate the object for the new created file in the dedicated pool.
Relevant QA's		Usability.
details	Stimulus	Client creates a new file or set the stripe meatadata of the file via IOCTL.
	Stimulus source	Client.
	Environment	Client, MDS.
	Artifact	Object allocation.
	Response	MDS alocates the chunk objects in the pool, stores the file layout metadata into the EA of the file and replies to client with it.
	Response measure
Questions		How to make sure the object allocation within the indicated pool without SNS device on MDS site especially there are multiple pools? what's the effect of CROW on object allocation? The common format of file layout information.
Issues

Configure the SNS device via mountconf

Scenario:		Client configures the SNS OBD via mountconf
Business Goals		Simplify the processes that user configures and setups the SNS via mountconf.
Relevant QA's		Usability.
details	Stimulus	Client mounts the lustre file system.
	Stimulus source	User.
	Environment	Client, MDS.
	Artifact	Setup SNS device via mountconf.
	Response	Client gets configuration log via llog from MGS and Configure the SNS device and layout of RAIDframe via SNS log enties
	Response measure
Questions		Is the configuration of pool for mountconf still by llog?
Issues

Failure and recovery cases

QASs

id	quality attribute	summary
client_goes_offline	recovery	client crashes doing i/o
ost_goes_down	recovery	OST goes down
ost_delay_online	recovery	OST goes online with a delay
RAID_rebuild	recovery	an agent does striped object rebuild online
RAID_rebuild_crash	recovery	agent or OSTs crahed during striped object rebuild

client goes offline

Scenario:		client crashes in the middle of stripe write
Business Goals		resistibility to client crashes
Relevant QA's		Availability
details	Stimulus	OST detects that client went offline
	Stimulus source	сlient crash or reboot, network failure
	Environment	client or network failures
	Artifact	striped object consistency
	Response	OSTs restore the object consistency after a failed write
	Response measure
Questions		a way to detect incompleted write
Issues

OST goes down

Scenario:		OST crashes or reboots
Business Goals		resistibility to server crashes
Relevant QA's		Availability
details	Stimulus	Client tries to reconnect to the OST after it reboots
	Stimulus source	server crash or reboot, network
	Environment	server or network failures
	Artifact	cluster fs state
	Response	OST rolls its state to the CUT and applies client logs
	Response measure
Questions		how does CUT preserve striped object consistency?
Issues

OST goes online with a delay

Scenario:		OST goes online after a delay
Business Goals		Efficient OST re-integration
Relevant QA's		Availability & Stability
details	Stimulus	degraded object state check, attempt to start RAID recovery agent
	Stimulus source	server crash or reboot, network failure
	Environment	server or network failures
	Artifact	recovery agent
	Response	striped object cleanup or synchronization
	Response measure	striped object state
Questions
Issues

RAID online recovery (rebuild)

Scenario:		OST goes online after a delay
Business Goals		restore object's availability
Relevant QA's		Availability
details	Stimulus	sync or async object state check, attempt to start a recovery agent
	Stimulus source	server crash or reboot, network failure
	Environment	Lustre fs with obect in degraded mode
	Artifact	recovery agent
	Response	striped object online recovery
	Response measure	striped object state
Questions
Issues		recovery of recovering process state

OST crash during striped object rebuild

Scenario:		recovering of online raid recovery process
Business Goals		restore object's availability
Relevant QA's		Availability
details	Stimulus	recovery agent finds incomleted object recovery
	Stimulus source	server crash or reboot, network failure
	Environment	server or network failures
	Artifact	recovery agent
	Response	completing striped object online recovery
	Response measure	striped object state
Questions
Issues

WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Architecture - Server Network Striping

Summary

Definitions

Requirements

SNS Fundamentals

SNS Detailed Recovery

SNS Architecture Overview

SNS write as a distributed transaction

Undo logging at OST side

RAIDFrame limitations

metadata redundancy

DAG creation

I/O components

Common I/O path and components with/without SNS

OSC I/O scheduler

SNS I/O components

Lock protocols

Use cases

SNS fundamentals cases

QASs

Extent lock

grants/quota for parity

Partial truncate

Metadata recovery in degraded case

Failed I/O retry

Object allocation

Configure the SNS device via mountconf

Failure and recovery cases

QASs

client goes offline

OST goes down

OST goes online with a delay

RAID online recovery (rebuild)

OST crash during striped object rebuild

Navigation menu

Search