WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Using Pacemaker with Lustre

From Obsolete Lustre Wiki
Jump to navigationJump to search

DISCLAIMER - EXTERNAL CONTRIBUTOR CONTENT

This content was submitted by an external contributor. We provide this information as a resource for the Lustre™ open-source community, but we make no representation as to the accuracy, completeness or reliability of this information.


This page describes how to configure and use Pacemaker with Lustre Failover.

Setting Up Cluster Communications

Communication between the nodes of the cluster allows all nodes to “see” each other. In modern clusters, OpenAIS, or more specifically, its communication stack corosync, is used for this task. All communication paths in the cluster should be redundant so that a failure of a single path is not fatal for the cluster.

An introduction to the setup, configuration and operation of a Pacemaker cluster can be found in:

Setting Up the corosync Communication Stack

The corosync communication stack, developed as part of the OpenAIS project, supports all the communication needs of the cluster. The package is included in all recent Linux distributions. If it is not included in your distribution, you can find precompiled binaries at www.clusterlabs.org/rpm. It is also possible to compile OpenAIS from source and install it on all HA nodes by running /configure; make and make install.

Note: If corosync is not included in your distribution, your distribution may include the complete OpenAIS package. From the cluster point of view, the only difference is that all files and commands start with openais rather than corosync. The configuration file is located in /etc/ais/openais.conf.

Once installed, the software looks for a configuration in the file /etc/corosync/corosync.conf.

Complete the following steps to set up the corosync communication stack:

1. Are my edits to this step OK? Edit the totem section of the corosync.conf (or openais.conf) configuration file to designate the IP address and netmask of the interface(s) to be used. The totem section of the configuration file describes the way corosync communicates between nodes.
totem {
version: 2
secauth: off
threads: 0
interface {
ringnumber: 0
bindnetaddr: 10.0.0.0
mcastaddr: 226.94.1.1
mcastport: 5405
}
}
Corosync uses the option bindnetaddr to determine which interface is to be used for cluster communication. The example above assumes one of the node’s interfaces is configured on the network 10.0.0.0. Is the bold text OK? The value of the option is calculated from the IP address AND the network mask for the interface (IP & MASK) so the final bits of the address are cleared. Thus the configuration file is independent of any node and can be copied to all nodes.
2. Are my edits in this step OK? Edit the aisexec section of the configuration file to designate which user can start the service. The user must be root:
aisexec {
user: root
group: root
}
3. Are my edits to this step OK? In the service section of the configuration file, add the services that corosync is to administer. In this example, only pacemaker is included:


service {
name: pacemaker
version: 0
}

4. (Optional) To use the Pacemaker GUI, add the mgmt daemon to the service section:

service {
name: pacemaker
version: 0
use_mgmtd: yes
}

The corosync service starts as part of the normal init process. It can also be started manually by entering:

/etc/init.d/corosync start

After corosync has started, the following lines should be visible in the system log file:

(...) [MAIN ] Corosync (...) started and ready to provide service. 
(...) [TOTEM ] The network interface [...] is now up.

You can also check for correct functioning of the network stack by entering:

# corosync-cfgtool -s

The following should be displayed:

Printing ring status. 
Local node ID (...)
RING ID 0 
	id		= (...)
	status				= ring 0 active with no faults 

Setting up Redundant Communication Using Bonding

It is recommended that you set up the cluster communication via two or more redundant paths. One way to achieve this is to use bonding interfaces. Please consult the documentation for your distribution for information about how to configure bonding interfaces.

Setting up Redundant Communication within corosync

The corosync package provides a means for redundant communication. If two or more interfaces for the communication exist, an administrator can configure multiple interface{} sections in the configuration file, each with a different ringnumber. The rrd_mode option tells the cluster how to use these interfaces. If the value is set to active, corosync uses all interfaces actively. If the value is set to passive, corosync uses the second interface only if the first ring fails.

Sven has suggested adding an example to this section (“e.g., for the two network interfaces”).

Setting Up Resource Management

All services that the cluster should take care of (...that the Pacemaker cluster resource manager will manage?) are called resources. The Pacemaker cluster resource manager uses resource agents to start, stop or monitor resources.

Note: The simplest way to configure the cluster is by using a crm subshell. All examples will be given in this notation. If you understood the syntax of the cluster configuration, you also can use the GUI or XML notation.

Completing a Basic Setup of the Cluster

To test that your cluster manager is running and set global options, complete the steps below.

1. Display the cluster status by entering:
# crm_mon -1
The output should look similar to:
============
Last updated
Fri Dec 25 17:31:54 2009
Stack: openais
Current DC
node1 - partition with quorum
Version: 1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe
2 Nodes configured, 2 expected votes
0 Resources configured.
============
Online: [ node1 node2 ]
This output indicates that corosync started the cluster resource manager and it is ready to manage resources.

Several global options must be set in the cluster. The two described in the next two steps are especially important to consider.

2. If your cluster consists of just two nodes, switch the quorum feature off. On the command line, enter:
# crm configure property no-quorum-policy=ignore
If your lustre setup comprises more than two nodes, you can leave the no-quorum option as it is.
3. In a Lustre setup, fencing is normally used and is enabled by default. If you have a good reason not to use it, disable it by entering:
# crm configure property stonith-enabled=false

After the global options of the cluster are set up correctly, continue to the following sections to configure resources and constraints.

Configuring Resources

Are the next two paragraphs still correct after editing? OSTs are represented as Filesystem resources. A Lustre cluster consists of several Filesystem resources along with constraints that determine on which nodes of the cluster the resources can run.

By default, the start, stop, and monitor operations in a Filesystem resource time out after 20 sec. Since some mounts in Lustre require up to 5 minutes or more, the default timeouts for these operations must be modified. Also, a monitor operation must be added to the resource so that Pacemaker can check if the resource is still alive and react in case of any problems.

1. Create a definition of the Filesystem resource and save it in a file such as MyOST.res. Does the user need to define a Filesystem resource for each OST?
The example below shows a complete definition of the Filesystem resource. You will need to change the device and directory to correspond to your setup.
primitive resMyOST ocf
heartbeat:Filesystem \
meta target-role="stopped" \
operations $id="resMyOST-operations" \
op monitor interval="120" timeout="60" \
op start interval="0" timeout="300" \
op stop interval="0" timeout="300" \
params device="device" directory="directory" fstype="lustre"

In this example, the resource is stopped at first target-role=“stopped“ since we do not have defined the necessary constraints where that resource should run. OK to replace the bold text with the following? In this example, the resource is initially stopped (target-role=”stopped”) because the constraints specifying where the resource is to be run have not yet been defined.

The start and stop operations have each been set to a timeout of 300 sec. The resource is monitored all 120 sec (at intervals of 120 seconds?). The parameters "device", "directory" and "lustre" are passed to the mount command.