Architecture - CTDB with Lustre

Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.

Summary

CTDB_with_Lustre provides a failsafe solution for windows pCIFS.

Definitions

keyword	explaination
TDB	TDB is "trivial database", used by Samba to store metadata only of CIFS semantics, like client connections (TCP), opened handles, locks (byte range lock, oplock), CIFS operations, client requests, etc. It's a shared data base among Samba processes
CTDB	CTDB is a cluster-based implementation of TDB. Samba servers with CTDB then could exchange the metadata between all nodes among a cluster
pCIFS	parallel CIFS, is an implementation of windows CIFS client to provide parallel I/O from or to Lustre OSS nodes exported by Samba.

Requirements

id	scope	quality	trigger	description
pCIFS working over CTDB	feature	usability		parallel CIFS support over CTDB/Samba
large scale network support	scalability	usability, performance		pCIFS should stand with large network of 100+ nodes and provide reasonable consistent performance (80% of Lustre ?).
server node failover	failover	availability	cluster failure	failover support to avoid any failure of a Lustre or Samba node stopping the whole cluster.
cleanup resources of a dead client	failover	availability	cluster failure	all resources (like opened handles or locks) grabbed by a dead client node should be gracefully released, or other clients will be forbidden to access these resources.
user access control	security	availability	access request from client	posix acls support.
data encryption over CIFS	security	availability	reading or writing	it's not secure with all data over CIFS left in plain-text.
ease of installation	UI	easiness, friendly	installation or configuration	good document, neat configuration scripts or tools.
various platforms support	environment	availability		1, what OSes and which versions and archs for Windows, Linux and Samba. 2, what type of networking to support: IB or IP, NON-IP.
enhancement of test utilities	test	testability	unit test	enhancement of pCIFS unit test utilities, torture utility to simulate huge scale network
high i/o throughput	i/o	performance	reading or writing	80% of Lustre throughput ?
efficient metadata operations	metadata	performance	client's query or modification	operations like file creating, attribute setting, byte-range lock, oplocks.
DFS support	feature	usability	Samba hosts Lustre as part of DFS cluster	integration with pCIFS and Microsoft DFS

pCIFS working over CTDB

Scenario:		pCIFS uses CTDB/Samba as CIFS servers exporting Lustre
Business Goals:		a graceful cooperation among Lustre, CTDB and pCIFS
Relevant QA's:		usability
details	Stimulus:	use CTDB/Samba to share Lustre volumes
	Stimulus source:	CTDB/Samba
	Environment:
	Artifact:
	Response:	eliminate all conflicts on Lustre flock, Samba share modes management and oplock handling
	Response measure:	smbtorture pass on Lustre, stress test pass on pCIFS
Questions:		none
Issues:		Samba share modes and oplock management will also involved to pCIFS/CTDB failover. should pay care of it when design new CTDB failover policy.

large scale network support

Scenario:		equip pCIFS a large scale cluster with 100+ nodes
Business Goals:		pCIFS should work with the huge cluster
Relevant QA's:		Availability
details	Stimulus:	use pCIFS on a l00+ nodes cluster
	Stimulus source:
	Environment:
	Artifact:	enhanced test utility to simulte large networking traffic
	Response:	perform system level test, review on all components to assure the support of large scale network
	Response measure:	emulation test passes, system level test passes
Questions:		None.
Issues:		None.

server node failover

Scenario:		Lustre MDS/OST server or Samba server hang
Business Goals:		failover should happen quickly to feed pCIFS client's requests
Relevant QA's:		availability
details	Stimulus:	a server hang, network or hardware failure
	Stimulus source:
	Environment:
	Artifact:	shutdown a active server or unplug it from network
	Response:	setup failover on Lustre nodes and harmonize CTDB failover and Lustre's
	Response measure:	a single node's failure won't stop the whole cluster
Questions:		could we accelerate Lustre failover ? could we immediately poweroff the "dead" node instead of waiting 250s to confirm ?
Issues:		possible conflicts between two different failover processes

cleanup resources of a dead client

Scenario:		dead client still hold resources from CTDB/Samba
Business Goals:		CTDB/Samba releases the resources in time and clean up all the client specific data
Relevant QA's:		Availability
details	Stimulus:	a pCIFS client hangs
	Stimulus source:
	Environment:
	Artifact:	unplug the working client from network
	Response:	CTDB/Samba is aware of the client's death in time and performs cleanup
	Response measure:	other client shouldn't be forbidden to access the same resources
Questions:		None.
Issues:		Study should be done to see how current CTDB/Samba response to a dead CIFS client

user access control

Scenario:		accounts with different privileges try to access pCIFS servers
Business Goals:		grant the qualified and deny the invalid
Relevant QA's:		usability
details	Stimulus:	access to pCIFS servers
	Stimulus source:	pCIFS or other CIFS clients
	Environment:
	Artifact:
	Response:	pCIFS servers validates the user's token with POSIX or NFSv4 ACLs.
	Response measure:	ACLs grants qualified and denies others
Questions:		none
Issues:		none

data encryption over CIFS

Scenario:		all data over CIFS are left as clear text
Business Goals:		encrypt all RPCs
Relevant QA's:		availability
details	Stimulus:
	Stimulus source:
	Environment:
	Artifact:
	Response:	encrypt data on sender and decrypt them on receiver
	Response measure:	data is secured
Questions:		none.
Issues:		none.

ease of installation

Scenario:		it's complex to set up pCIFS and get it work since there exists 3 big components: Lustre and Samba on Linux nodes, pCIFS on windows node.
Business Goals:		a solution of just clicking is ideal.
Relevant QA's:		availability
details	Stimulus:	installing and setting up pCIFS.
	Stimulus source:
	Environment:
	Artifact:
	Response:	at the moment, we should provide an automatic configuration of CTDB/Samba and srvmap.
	Response measure:	the less manual operations the better.
Questions:		Should LRE take pCIFS into account ?
Issues:		none

various platforms to support

Scenario:		customer's demands on special software and hardware platforms.
Business Goals:		full test and support with all common platforms
Relevant QA's:		availability
details	Stimulus:
	Stimulus source:	customer
	Environment:
	Artifact:
	Response:	only provide several fully-tested options as candidates like windows xp, RHEL4/SULES10, Samba-3.0.XX
	Response measure:
Questions:		none
Issues:		1, for windows Vista and later OS, we need buy a "certificate" from a commercial CA, like Verisign to sign our drivers. 2 CTDB needs two subset of networks, i.e. every server node should install two net cards

enhancement of test utilities

Scenario:		lack of test utilities for pCIFS, especially for metadata operations
Business Goals:		good test utilities addressing all pCIFS features
Relevant QA's:		availability
details	Stimulus:
	Stimulus source:	unit-test
	Environment:
	Artifact:
	Response:	1, collect 3rd party utilities 2, enhance current i/o programs 3, create new tools to torture metadata operations
	Response measure:
Questions:		none
Issues:		none

high i/o throughput

Scenario:		lack of test utilities for pCIFS, especially for metadata operations
Business Goals:		goot test utilities adressing all pCIFS features
Relevant QA's:		availability
details	Stimulus:
	Stimulus source:	unit-test
	Environment:
	Artifact:
	Response:	1, collect 3rd party utilies 2, enchance currentl i/o utlities 3, create new utilities to torture metadata operations
	Response measure:
Questions:		none
Issues:		none

efficient metadata operations

Scenario:		OPEN bench results are not good enough
Business Goals:		good bench score on metadata operations
Relevant QA's:		availability
details	Stimulus:
	Stimulus source:	smbtorture
	Environment:	CTDB/Samba + Lustre 1.4 or 1.6
	Artifact:
	Response:	1, performance collection on other file systems 2, tuning on CTDB/Samba and Lustre
	Response measure:	performance
Questions:		none
Issues:		none

DFS support

Scenario:		it's unknown whether pCIFS could work with microsoft distributed file system
Business Goals:		a definite support with DFS
Relevant QA's:		availability
details	Stimulus:	sharing Lustre as part of a DFS cluster
	Stimulus source:
	Environment:	DFS + pCIFS
	Artifact:
	Response:	experiments are needed before making any further response.
	Response measure:
Questions:		none
Issues:		it's also unknown whether CTDB support DFS hosting.

Implementation constraints

Use CIFS for interconnection between CTDB/Samba and Windows clients.
pCIFS drivers filter Windows CIFS client, i.e. LanmanRedirector.

CTDB_wiht_Lustre Architecture

SHARING-VIOLATION issue

with CTDB, all Samba servers in a CTDB cluster share the same database to manage all session status, such like connections, file handles, locks, etc. So when pCIFS tries to open OST files with the file already opened on MDS server, Samba will check the shared oplock and share modes database and then complain conflict of sharing violation.

we need the Samba servers on OST ignore the share modes and oplocks and just pass the OPEN request to Lustre. Lustre could handle this case with ease.

pCIFS failover model

pCIFS failover model deponds on the failover supports of Lustre and CTDB. And here are some main issues to be addressed during implementation:

CTDB couldn't support dynamically adding or removing a node to /from a working CTDB cluster. Tridge said they plan it for future, but wouldn't start for the moment. Currently there's no way to add a new node to CTDB and remvoing will cause CTDB failover. We need impelement this functionality to let Heartbeat renew CTDB cluster while Lustre failover occurs.
We must enhance srvmap to collect lustre server public IP addresses. pCIFS clients will access Lustre volumes by these IPs. But Lustre itself could not provide these information, since cluster could be working on NON-IP networks. Fortunately Heartbeat could do the job instead, following a scheme we've prepared in advance. Another enhancement is socket communication with pCIFS clients. The purpose is to send the events of Lustre failover to pCIFS clients, which could be triggered by Heartbeat.
CTDB is to select a (any) node as failover node inside the CTDB cluster to substitute the dead node, and requires the two nodes must be inside the same subnet. We need CTDB adaptd to our election policy to decide which node (inside or outside of the CTDB cluster) to take over the dead node's IP. The top-priority candidate should be the standby node.We could also put the two nodes in a different subnet to make them failover each other in a CTDB cluster. This feature is to ensure MDS and OST servers won't take over each other's IP, or it will bring SHARING-VIOLATION issue.
The timing issue between two different failover processes, CTDB is fast and Lustre is slow to confirm the node's death.

Questions and Issues

References

CTDB/Samba webpages at http://ctdb.samba.org

WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.