Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.
Summary
CTDB_with_Lustre provides a failsafe solution for windows pCIFS.
Definitions
keyword |
explaination
|
TDB |
TDB is "trivial database", used by Samba to store metadata only of CIFS semantics, like client connections (TCP), opened handles, locks (byte range lock, oplock), CIFS operations, client requests, etc. It's a shared data base among Samba processes
|
CTDB |
CTDB is a cluster-based implementation of TDB. Samba servers with CTDB then could exchange the metadata between all nodes among a cluster
|
pCIFS |
parallel CIFS, is an implementation of windows CIFS client to provide parallel I/O from or to Lustre OSS nodes exported by Samba.
|
Requirements
id |
scope |
quality |
trigger |
description
|
pCIFS working over CTDB |
feature |
usability |
|
parallel CIFS support over CTDB/Samba
|
large scale network support |
scalability |
usability, performance |
|
pCIFS should stand with large network of 100+ nodes and provide reasonable consistent performance (80% of Lustre ?).
|
server node failover |
failover |
availability |
cluster failure |
failover support to avoid any failure of a Lustre or Samba node stopping the whole cluster.
|
cleanup resources of a dead client |
failover |
availability |
cluster failure |
all resources (like opened handles or locks) grabbed by a dead client node should be gracefully released, or other clients will be forbidden to access these resources.
|
user access control |
security |
availability |
access request from client |
posix acls support.
|
data encryption over CIFS |
security |
availability |
reading or writing |
it's not secure with all data over CIFS left in plain-text.
|
ease of installation |
UI |
easiness, friendly |
installation or configuration |
good document, neat configuration scripts or tools.
|
various platforms support |
environment |
availability |
|
1, what OSes and which versions and archs for Windows, Linux and Samba. 2, what type of networking to support: IB or IP, NON-IP.
|
enhancement of test utilities |
test |
testability |
unit test |
enhancement of pCIFS unit test utilities, torture utility to simulate huge scale network
|
high i/o throughput |
i/o |
performance |
reading or writing |
80% of Lustre throughput ?
|
efficient metadata operations |
metadata |
performance |
client's query or modification |
operations like file creating, attribute setting, byte-range lock, oplocks.
|
DFS support |
feature |
usability |
Samba hosts Lustre as part of DFS cluster |
integration with pCIFS and Microsoft DFS
|
pCIFS working over CTDB
Scenario: |
pCIFS uses CTDB/Samba as CIFS servers exporting Lustre
|
Business Goals: |
a graceful cooperation among Lustre, CTDB and pCIFS
|
Relevant QA's: |
usability
|
details
|
Stimulus: |
use CTDB/Samba to share Lustre volumes
|
Stimulus source: |
CTDB/Samba
|
Environment: |
|
Artifact: |
|
Response: |
eliminate all conflicts on Lustre flock, Samba share modes management and oplock handling
|
Response measure: |
smbtorture pass on Lustre, stress test pass on pCIFS
|
Questions: |
none
|
Issues: |
Samba share modes and oplock management will also involved to pCIFS/CTDB failover. should pay care of it when design new CTDB failover policy.
|
large scale network support
Scenario: |
equip pCIFS a large scale cluster with 100+ nodes
|
Business Goals: |
pCIFS should work with the huge cluster
|
Relevant QA's: |
Availability
|
details
|
Stimulus: |
use pCIFS on a l00+ nodes cluster
|
Stimulus source: |
|
Environment: |
|
Artifact: |
enhanced test utility to simulte large networking traffic
|
Response: |
perform system level test, review on all components to assure the support of large scale network
|
Response measure: |
emulation test passes, system level test passes
|
Questions: |
None.
|
Issues: |
None.
|
server node failover
Scenario: |
Lustre MDS/OST server or Samba server hang
|
Business Goals: |
failover should happen quickly to feed pCIFS client's requests
|
Relevant QA's: |
availability
|
details
|
Stimulus: |
a server hang, network or hardware failure
|
Stimulus source: |
|
Environment: |
|
Artifact: |
shutdown a active server or unplug it from network
|
Response: |
setup failover on Lustre nodes and harmonize CTDB failover and Lustre's
|
Response measure: |
a single node's failure won't stop the whole cluster
|
Questions: |
could we accelerate Lustre failover ? could we immediately poweroff the "dead" node instead of waiting 250s to confirm ?
|
Issues: |
possible conflicts between two different failover processes
|
cleanup resources of a dead client
Scenario: |
dead client still hold resources from CTDB/Samba
|
Business Goals: |
CTDB/Samba releases the resources in time and clean up all the client specific data
|
Relevant QA's: |
Availability
|
details
|
Stimulus: |
a pCIFS client hangs
|
Stimulus source: |
|
Environment: |
|
Artifact: |
unplug the working client from network
|
Response: |
CTDB/Samba is aware of the client's death in time and performs cleanup
|
Response measure: |
other client shouldn't be forbidden to access the same resources
|
Questions: |
None.
|
Issues: |
Study should be done to see how current CTDB/Samba response to a dead CIFS client
|
user access control
Scenario: |
accounts with different privileges try to access pCIFS servers
|
Business Goals: |
grant the qualified and deny the invalid
|
Relevant QA's: |
usability
|
details
|
Stimulus: |
access to pCIFS servers
|
Stimulus source: |
pCIFS or other CIFS clients
|
Environment: |
|
Artifact: |
|
Response: |
pCIFS servers validates the user's token with POSIX or NFSv4 ACLs.
|
Response measure: |
ACLs grants qualified and denies others
|
Questions: |
none
|
Issues: |
none
|
data encryption over CIFS
Scenario: |
all data over CIFS are left as clear text
|
Business Goals: |
encrypt all RPCs
|
Relevant QA's: |
availability
|
details
|
Stimulus: |
|
Stimulus source: |
|
Environment: |
|
Artifact: |
|
Response: |
encrypt data on sender and decrypt them on receiver
|
Response measure: |
data is secured
|
Questions: |
none.
|
Issues: |
none.
|
ease of installation
Scenario: |
it's complex to set up pCIFS and get it work since there exists 3 big components: Lustre and Samba on Linux nodes, pCIFS on windows node.
|
Business Goals: |
a solution of just clicking is ideal.
|
Relevant QA's: |
availability
|
details
|
Stimulus: |
installing and setting up pCIFS.
|
Stimulus source: |
|
Environment: |
|
Artifact: |
|
Response: |
at the moment, we should provide an automatic configuration of CTDB/Samba and srvmap.
|
Response measure: |
the less manual operations the better.
|
Questions: |
Should LRE take pCIFS into account ?
|
Issues: |
none
|
various platforms to support
Scenario: |
customer's demands on special software and hardware platforms.
|
Business Goals: |
full test and support with all common platforms
|
Relevant QA's: |
availability
|
details
|
Stimulus: |
|
Stimulus source: |
customer
|
Environment: |
|
Artifact: |
|
Response: |
only provide several fully-tested options as candidates like windows xp, RHEL4/SULES10, Samba-3.0.XX
|
Response measure: |
|
Questions: |
none
|
Issues: |
1, for windows Vista and later OS, we need buy a "certificate" from a commercial CA, like Verisign to sign our drivers. 2 CTDB needs two subset of networks, i.e. every server node should install two net cards
|
enhancement of test utilities
Scenario: |
lack of test utilities for pCIFS, especially for metadata operations
|
Business Goals: |
good test utilities addressing all pCIFS features
|
Relevant QA's: |
availability
|
details
|
Stimulus: |
|
Stimulus source: |
unit-test
|
Environment: |
|
Artifact: |
|
Response: |
1, collect 3rd party utilities 2, enhance current i/o programs 3, create new tools to torture metadata operations
|
Response measure: |
|
Questions: |
none
|
Issues: |
none
|
high i/o throughput
Scenario: |
lack of test utilities for pCIFS, especially for metadata operations
|
Business Goals: |
goot test utilities adressing all pCIFS features
|
Relevant QA's: |
availability
|
details
|
Stimulus: |
|
Stimulus source: |
unit-test
|
Environment: |
|
Artifact: |
|
Response: |
1, collect 3rd party utilies 2, enchance currentl i/o utlities 3, create new utilities to torture metadata operations
|
Response measure: |
|
Questions: |
none
|
Issues: |
none
|
efficient metadata operations
Scenario: |
OPEN bench results are not good enough
|
Business Goals: |
good bench score on metadata operations
|
Relevant QA's: |
availability
|
details
|
Stimulus: |
|
Stimulus source: |
smbtorture
|
Environment: |
CTDB/Samba + Lustre 1.4 or 1.6
|
Artifact: |
|
Response: |
1, performance collection on other file systems 2, tuning on CTDB/Samba and Lustre
|
Response measure: |
performance
|
Questions: |
none
|
Issues: |
none
|
DFS support
Scenario: |
it's unknown whether pCIFS could work with microsoft distributed file system
|
Business Goals: |
a definite support with DFS
|
Relevant QA's: |
availability
|
details
|
Stimulus: |
sharing Lustre as part of a DFS cluster
|
Stimulus source: |
|
Environment: |
DFS + pCIFS
|
Artifact: |
|
Response: |
experiments are needed before making any further response.
|
Response measure: |
|
Questions: |
none
|
Issues: |
it's also unknown whether CTDB support DFS hosting.
|
Implementation constraints
- Use CIFS for interconnection between CTDB/Samba and Windows clients.
- pCIFS drivers filter Windows CIFS client, i.e. LanmanRedirector.
CTDB_wiht_Lustre Architecture
SHARING-VIOLATION issue
with CTDB, all Samba servers in a CTDB cluster share the same database to manage all session status, such like connections, file handles, locks, etc. So when pCIFS tries to open OST files with the file already opened on MDS server, Samba will check the shared oplock and share modes database and then complain conflict of sharing violation.
we need the Samba servers on OST ignore the share modes and oplocks and just pass the OPEN request to Lustre. Lustre could handle this case with ease.
pCIFS failover model
pCIFS failover model deponds on the failover supports of Lustre and CTDB. And here are some main issues to be addressed during implementation:
- CTDB couldn't support dynamically adding or removing a node to /from a working CTDB cluster. Tridge said they plan it for future, but wouldn't start for the moment. Currently there's no way to add a new node to CTDB and remvoing will cause CTDB failover. We need impelement this functionality to let Heartbeat renew CTDB cluster while Lustre failover occurs.
- We must enhance srvmap to collect lustre server public IP addresses. pCIFS clients will access Lustre volumes by these IPs. But Lustre itself could not provide these information, since cluster could be working on NON-IP networks. Fortunately Heartbeat could do the job instead, following a scheme we've prepared in advance. Another enhancement is socket communication with pCIFS clients. The purpose is to send the events of Lustre failover to pCIFS clients, which could be triggered by Heartbeat.
- CTDB is to select a (any) node as failover node inside the CTDB cluster to substitute the dead node, and requires the two nodes must be inside the same subnet. We need CTDB adaptd to our election policy to decide which node (inside or outside of the CTDB cluster) to take over the dead node's IP. The top-priority candidate should be the standby node.We could also put the two nodes in a different subnet to make them failover each other in a CTDB cluster. This feature is to ensure MDS and OST servers won't take over each other's IP, or it will bring SHARING-VIOLATION issue.
- The timing issue between two different failover processes, CTDB is fast and Lustre is slow to confirm the node's death.
Questions and Issues
References