WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Architecture - CTDB with Lustre

From Obsolete Lustre Wiki
Jump to navigationJump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.

Summary

CTDB_with_Lustre provides a failsafe solution for windows pCIFS.

Definitions

keyword explaination
TDB TDB is "trivial database", used by Samba to store metadata only of CIFS semantics, like client connections (TCP), opened handles, locks (byte range lock, oplock), CIFS operations, client requests, etc. It's a shared data base among Samba processes
CTDB CTDB is a cluster-based implementation of TDB. Samba servers with CTDB then could exchange the metadata between all nodes among a cluster
pCIFS parallel CIFS, is an implementation of windows CIFS client to provide parallel I/O from or to Lustre OSS nodes exported by Samba.

Requirements

id scope quality trigger description
pCIFS working over CTDB feature usability parallel CIFS support over CTDB/Samba
large scale network support scalability usability, performance pCIFS should stand with large network of 100+ nodes and provide reasonable consistent performance (80% of Lustre ?).
server node failover failover availability cluster failure failover support to avoid any failure of a Lustre or Samba node stopping the whole cluster.
cleanup resources of a dead client failover availability cluster failure all resources (like opened handles or locks) grabbed by a dead client node should be gracefully released, or other clients will be forbidden to access these resources.
user access control security availability access request from client posix acls support.
data encryption over CIFS security availability reading or writing it's not secure with all data over CIFS left in plain-text.
ease of installation UI easiness, friendly installation or configuration good document, neat configuration scripts or tools.
various platforms support environment availability 1, what OSes and which versions and archs for Windows, Linux and Samba. 2, what type of networking to support: IB or IP, NON-IP.
enhancement of test utilities test testability unit test enhancement of pCIFS unit test utilities, torture utility to simulate huge scale network
high i/o throughput i/o performance reading or writing 80% of Lustre throughput ?
efficient metadata operations metadata performance client's query or modification operations like file creating, attribute setting, byte-range lock, oplocks.
DFS support feature usability Samba hosts Lustre as part of DFS cluster integration with pCIFS and Microsoft DFS

pCIFS working over CTDB

Scenario: pCIFS uses CTDB/Samba as CIFS servers exporting Lustre
Business Goals: a graceful cooperation among Lustre, CTDB and pCIFS
Relevant QA's: usability
details Stimulus: use CTDB/Samba to share Lustre volumes
Stimulus source: CTDB/Samba
Environment:
Artifact:
Response: eliminate all conflicts on Lustre flock, Samba share modes management and oplock handling
Response measure: smbtorture pass on Lustre, stress test pass on pCIFS
Questions: none
Issues: Samba share modes and oplock management will also involved to pCIFS/CTDB failover. should pay care of it when design new CTDB failover policy.

large scale network support

Scenario: equip pCIFS a large scale cluster with 100+ nodes
Business Goals: pCIFS should work with the huge cluster
Relevant QA's: Availability
details Stimulus: use pCIFS on a l00+ nodes cluster
Stimulus source:
Environment:
Artifact: enhanced test utility to simulte large networking traffic
Response: perform system level test, review on all components to assure the support of large scale network
Response measure: emulation test passes, system level test passes
Questions: None.
Issues: None.

server node failover

Scenario: Lustre MDS/OST server or Samba server hang
Business Goals: failover should happen quickly to feed pCIFS client's requests
Relevant QA's: availability
details Stimulus: a server hang, network or hardware failure
Stimulus source:
Environment:
Artifact: shutdown a active server or unplug it from network
Response: setup failover on Lustre nodes and harmonize CTDB failover and Lustre's
Response measure: a single node's failure won't stop the whole cluster
Questions: could we accelerate Lustre failover ? could we immediately poweroff the "dead" node instead of waiting 250s to confirm ?
Issues: possible conflicts between two different failover processes

cleanup resources of a dead client

Scenario: dead client still hold resources from CTDB/Samba
Business Goals: CTDB/Samba releases the resources in time and clean up all the client specific data
Relevant QA's: Availability
details Stimulus: a pCIFS client hangs
Stimulus source:
Environment:
Artifact: unplug the working client from network
Response: CTDB/Samba is aware of the client's death in time and performs cleanup
Response measure: other client shouldn't be forbidden to access the same resources
Questions: None.
Issues: Study should be done to see how current CTDB/Samba response to a dead CIFS client

user access control

Scenario: accounts with different privileges try to access pCIFS servers
Business Goals: grant the qualified and deny the invalid
Relevant QA's: usability
details Stimulus: access to pCIFS servers
Stimulus source: pCIFS or other CIFS clients
Environment:
Artifact:
Response: pCIFS servers validates the user's token with POSIX or NFSv4 ACLs.
Response measure: ACLs grants qualified and denies others
Questions: none
Issues: none

data encryption over CIFS

Scenario: all data over CIFS are left as clear text
Business Goals: encrypt all RPCs
Relevant QA's: availability
details Stimulus:
Stimulus source:
Environment:
Artifact:
Response: encrypt data on sender and decrypt them on receiver
Response measure: data is secured
Questions: none.
Issues: none.

ease of installation

Scenario: it's complex to set up pCIFS and get it work since there exists 3 big components: Lustre and Samba on Linux nodes, pCIFS on windows node.
Business Goals: a solution of just clicking is ideal.
Relevant QA's: availability
details Stimulus: installing and setting up pCIFS.
Stimulus source:
Environment:
Artifact:
Response: at the moment, we should provide an automatic configuration of CTDB/Samba and srvmap.
Response measure: the less manual operations the better.
Questions: Should LRE take pCIFS into account ?
Issues: none

various platforms to support

Scenario: customer's demands on special software and hardware platforms.
Business Goals: full test and support with all common platforms
Relevant QA's: availability
details Stimulus:
Stimulus source: customer
Environment:
Artifact:
Response: only provide several fully-tested options as candidates like windows xp, RHEL4/SULES10, Samba-3.0.XX
Response measure:
Questions: none
Issues: 1, for windows Vista and later OS, we need buy a "certificate" from a commercial CA, like Verisign to sign our drivers. 2 CTDB needs two subset of networks, i.e. every server node should install two net cards

enhancement of test utilities

Scenario: lack of test utilities for pCIFS, especially for metadata operations
Business Goals: good test utilities addressing all pCIFS features
Relevant QA's: availability
details Stimulus:
Stimulus source: unit-test
Environment:
Artifact:
Response: 1, collect 3rd party utilities 2, enhance current i/o programs 3, create new tools to torture metadata operations
Response measure:
Questions: none
Issues: none

high i/o throughput

Scenario: lack of test utilities for pCIFS, especially for metadata operations
Business Goals: goot test utilities adressing all pCIFS features
Relevant QA's: availability
details Stimulus:
Stimulus source: unit-test
Environment:
Artifact:
Response: 1, collect 3rd party utilies 2, enchance currentl i/o utlities 3, create new utilities to torture metadata operations
Response measure:
Questions: none
Issues: none

efficient metadata operations

Scenario: OPEN bench results are not good enough
Business Goals: good bench score on metadata operations
Relevant QA's: availability
details Stimulus:
Stimulus source: smbtorture
Environment: CTDB/Samba + Lustre 1.4 or 1.6
Artifact:
Response: 1, performance collection on other file systems 2, tuning on CTDB/Samba and Lustre
Response measure: performance
Questions: none
Issues: none

DFS support

Scenario: it's unknown whether pCIFS could work with microsoft distributed file system
Business Goals: a definite support with DFS
Relevant QA's: availability
details Stimulus: sharing Lustre as part of a DFS cluster
Stimulus source:
Environment: DFS + pCIFS
Artifact:
Response: experiments are needed before making any further response.
Response measure:
Questions: none
Issues: it's also unknown whether CTDB support DFS hosting.

Implementation constraints

  1. Use CIFS for interconnection between CTDB/Samba and Windows clients.
  2. pCIFS drivers filter Windows CIFS client, i.e. LanmanRedirector.

CTDB_wiht_Lustre Architecture

SHARING-VIOLATION issue

with CTDB, all Samba servers in a CTDB cluster share the same database to manage all session status, such like connections, file handles, locks, etc. So when pCIFS tries to open OST files with the file already opened on MDS server, Samba will check the shared oplock and share modes database and then complain conflict of sharing violation.

we need the Samba servers on OST ignore the share modes and oplocks and just pass the OPEN request to Lustre. Lustre could handle this case with ease.

pCIFS failover model

pCIFS failover model deponds on the failover supports of Lustre and CTDB. And here are some main issues to be addressed during implementation:

  1. CTDB couldn't support dynamically adding or removing a node to /from a working CTDB cluster. Tridge said they plan it for future, but wouldn't start for the moment. Currently there's no way to add a new node to CTDB and remvoing will cause CTDB failover. We need impelement this functionality to let Heartbeat renew CTDB cluster while Lustre failover occurs.
  2. We must enhance srvmap to collect lustre server public IP addresses. pCIFS clients will access Lustre volumes by these IPs. But Lustre itself could not provide these information, since cluster could be working on NON-IP networks. Fortunately Heartbeat could do the job instead, following a scheme we've prepared in advance. Another enhancement is socket communication with pCIFS clients. The purpose is to send the events of Lustre failover to pCIFS clients, which could be triggered by Heartbeat.
  3. CTDB is to select a (any) node as failover node inside the CTDB cluster to substitute the dead node, and requires the two nodes must be inside the same subnet. We need CTDB adaptd to our election policy to decide which node (inside or outside of the CTDB cluster) to take over the dead node's IP. The top-priority candidate should be the standby node.We could also put the two nodes in a different subnet to make them failover each other in a CTDB cluster. This feature is to ensure MDS and OST servers won't take over each other's IP, or it will bring SHARING-VIOLATION issue.
  4. The timing issue between two different failover processes, CTDB is fast and Lustre is slow to confirm the node's death.

Questions and Issues

References