Obsolete Lustre Wiki - User contributions [en]

Using Red Hat Cluster Manager with Lustre

2010-12-14T14:37:57Z

Sven:

''(Updated: Dec 2010)''

''DISCLAIMER - EXTERNAL CONTRIBUTOR CONTENT''

''This content was submitted by an external contributor. We provide this information as a resource for the Lustre™ open-source community, but we make no representation as to the accuracy, completeness or reliability of this information.''
----
__TOC__
This page describes how to configure and use Red Hat Cluster Manager with Lustre failover. Sven Trautmann has contributed this content.

For more about Lustre failover, see [[Configuring Lustre for Failover]].

== Preliminary Notes ==

This document is based on the RedHat Cluster version 2.0, which is part of RedHat Enterprise Linux version 5.5. For other versions or RHEL-based distributions the syntax or methods to setup and run RedHat Cluster may differ.

In comparison with other HA solutions RedHat Cluster as in RHEL 5.5 is an old HA solution. It is recommended to use other HA solutions like pacemaker, if possible.

It is assumed that two Lustre server nodes share a number of Lustre targets. Each of the Lustre nodes provide a number of Lustre targets and in case of a failure the not failed node takes over the Lustre targets of the failed nodes and makes them available to the Lustre clients.

Furthermore, to make sure the Lustre targets are mounted only on one of the Lustre server nodes at a time we implement STONITH fencing. This requires a way to make sure the failed node is shut down in case of a failure. In the examples shown in this article it is assumed that the Lustre server nodes are equipped with a service processor allowing to shut down a failed node using IPMI. For other methods of fencing refer to the RedHat Cluster documentation.

== Setting Up RedHat Cluster ==

Setting up RedHat Cluster consists of three steps:
* setup ''openais'',
* configure the cluster and,
* start the RedHat cluster services

==== Setting Up the ''openais'' Communication Stack ====

The ''openais'' package is distributed with RHEL and can be installed using
<pre>
rpm -i /path/to/RHEL-DVD/Server/openais0.80.6-16.el5.x86_64.rpm
</pre> or
<pre>
yum install openais
</pre>
if yum is configured to access the RHEL repository.

Once installed, the software looks for a configuration in the file ''/etc/ais/openais.conf
''.

Complete the following steps to set up the ''openais'' communication stack:

'''1. Edit the totem section of the ''openais.conf'' configuration file to designate the IP address and netmask of the interface(s) to be used.''' The totem section of the configuration file describes the way ''openais'' communicates between nodes.

<pre>
totem {
version: 2
secauth: off
threads: 0
interface {
ringnumber: 0
bindnetaddr: 10.0.0.0
mcastaddr: 226.94.1.1
mcastport: 5405
}
}
</pre>

''Openais'' uses the option ''bindnetaddr'' to determine which interface is to be used for cluster communication. In the example shown above, it is assumed that one of the node’s interfaces is configured on the network 10.0.0.0. The value of the option is calculated from the IP address AND the network mask for the interface (IP & MASK) so the final bits of the address are cleared. Thus the configuration file is independent of any node and can be copied to all nodes.

'''2. Create an AIS key
<pre>
# /usr/sbin/ais-keygen
OpenAIS Authentication key generator.
Gathering 1024 bits for key from /dev/random.
Writing openais key to /etc/ais/authkey.
</pre>

== Installing RedHat Cluster ==

The minimum installation of RedHat Cluster consists of the Cluster Manager package ''cman'' and the Resource Group Manager package ''rgmanager''. The ''cman'' package can be found in the RHEL repository. The ''rgmanager'' package is part of the Cluster repository. It can be found on the RHEL DVD/ISO image in the ''Cluster'' sub-directory and may need to be added to the yum configuration manually. With yum configured correctly RedHat Cluster can be installed using:
<pre>
yum install cman rgmanager
</pre>
If yum is not set up correctly, the rpm packages and their dependencies need to be installed manually.

== Installing the Lustrefs resource script ==

The ''rgmanager'' package includes a number of resource scripts (''/usr/share/cluster'') which are used to integrate resources like network interfaces or file systems with ''rgmanager''. Unfortunately, there is no resource script for Lustre included.

Luckily Giacomo Montagner posted an resource script on the lustre-discuss mailing list:

http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090623/7799de37/attachment-0001.bin

After downloading this file it needs to be copied to /usr/share/cluster/lustrefs.sh.
Make sure the script is executable.

== Configure your Cluster ==

RedHat Cluster uses ''/etc/cluster/cluster.conf'' as central configuration file. This file is in XML format. The complete schema of the XML file can be found at http://sources.redhat.com/cluster/doc/cluster_schema_rhel5.html.

The Basic structure of a cluster.conf file may look like this:
<pre>
<?xml version="1.0" ?>
<cluster config_version="1" name="Lustre">
...
</cluster>
</pre>
In this example the name of the cluster is set to ''Lustre'' and the version is initialized as ''1''. If the cluster configuration is updated the config_version attribute must be increased on all nodes in this cluster.
RedHat cluster is usually used with more than two nodes providing resources. To tell RedHat cluster to work with two nodes the following ''cman'' attributes need to be set:
<pre>
<cman expected_votes="1" two_node="1"/>
</pre>
This tells cman, that there are only two nodes in a cluster and one vote is enough declare a node failed.

=== Nodes ===
Next the nodes which form the cluster need to be specified. Each cluster node need to be specified separately wrapped in an surrounding ''clusternodes'' tag.
<pre>
<clusternodes>
<clusternode name="lustre1" nodeid="1">
<fence>
<method name="single">
<device lanplus="1" name="lustre1-sp"/>
</method>
</fence>
</clusternode>
<clusternode name="lustre2" nodeid="2">
<fence>
<method name="single">
<device lanplus="1" name="lustre2-sp"/>
</method>
</fence>
</clusternode>
</clusternodes>
</pre>
Each cluster node is given a name which must be it's hostname or IP address. Additionally a unique node ID needs to be specified.
The ''fence'' tag assigned to each node specifies a fence device to use to shut down this cluster node. The fence devices are defined elsewhere in ''cluster.conf'' (see below for details).

=== Fencing ===

Fencing is essential to keep data on the Lustre file system consistent. Even with Multi-Mount-Protection enabled, fencing can make sure that a node in an unclear state is brought down for more analysis by the administrator.

To configure fencing, first some fence daemon options need to be specified. the ''fence_daemon'' tag is a direct child of the ''cluster'' tag.
<pre>
<fence_daemon post_fail_delay="0" post_join_delay="3"/>
<fence_daemon clean_start="0"/>
</pre>
Depending on the hardware configuration, these values may differ for different installations. Please see the notes in the cluster_schema_rhel5 document (linked above) for details.

Each Lustre node in a cluster should be equipped with a fencing device. RedHat cluster supports a number of devices. More details on which devices are supported and how to configure them can be found in the cluster schema document.
For this example IPMI based fencing devices are used.
The ''fencedevices'' section may look like this:
<pre>
<fencedevices>
<fencedevice name="lustre1-sp" agent="fence_ipmilan" auth="password" ipaddr="10.0.1.1" login="root" passwd="supersecretpassword" option="off"/>
<fencedevice name="lustre2-sp" agent="fence_ipmilan" auth="password" ipaddr="10.0.1.2" login="root" passwd="supersecretpassword" option="off"/>
</fencedevices>
</pre>
Every fence device has a number of attributes:
''name'' is used to define a name for this fencing device. This name is referred to in the ''fence'' part of the ''clusternode'' definition (see above). The ''agent'' defines the kind of fencing device to use. In this example an IPMI-over-Lan device is used. The remaining attributes are specific for the ''ipmilan'' device and are self-explanatory.

=== Resource Manager ===

The resource manager block of the ''cluster.conf'' is wrapped in a ''rm'' tag:
<pre>
<rm>
..
</rm>
</pre>
It contains definitions of resources, failover domains, and services.

==== Resources ====
In the ''resources'' block of the ''cluster.conf'' file all Lustre targets of both clustered nodes are specified. In this example, four Lustre object storage targets are defined:
<pre>
<resources>
<lustrefs name="target1" mountpoint="/mnt/ost1" device="/path/to/ost1/device" force_fsck="0" force_unmount="0" self_fence="1"/>
<lustrefs name="target2" mountpoint="/mnt/ost2" device="/path/to/ost2/device" force_fsck="0" force_unmount="0" self_fence="1"/>
<lustrefs name="target3" mountpoint="/mnt/ost3" device="/path/to/ost3/device" force_fsck="0" force_unmount="0" self_fence="1"/>
<lustrefs name="target4" mountpoint="/mnt/ost4" device="/path/to/ost4/device" force_fsck="0" force_unmount="0" self_fence="1"/>
</resources>
</pre>
To use the ''lustrefs'' resource definition it is essential that the lustrefs.sh script is installed in ''/usr/share/cluster'' as described above. To verify the script is installed correctly and has correct permission run
<pre>
# /usr/share/cluster/lustrefs.sh --help
usage: /usr/share/cluster/lustrefs.sh {start|stop|status|monitor|restart|meta-data|verify-all}
</pre>
Each ''lustrefs'' resource has a number of attributes. ''name'' defines how the resource can be addressed.

==== Failover Domains ====
Usually RedHat cluster is used to provide a service on a number of nodes, where one node takes over the service of a failed node. In this example a number of Lustre targets is provided by each of the Lustre server nodes. To allow such a configuration, the definition of two Failover domains is necessary. The definition of ''failoverdomains'' may look like this:
<pre>
<failoverdomains>
<failoverdomain name="first_first" ordered="1" restricted="1">
<failoverdomainnode name="lustre1" priority="1"/>
<failoverdomainnode name="lustre2" priority="2"/>
</failoverdomain>
<failoverdomain name="second_first" ordered="1" restricted="1">
<failoverdomainnode name="lustre1" priority="2"/>
<failoverdomainnode name="lustre2" priority="1"/>
</failoverdomain>
</failoverdomains>
</pre>
In this example, two fail-over-domains are created by adding the same nodes to each fail-over-domain, but the nodes are assigned different priorities.

==== Services ====
As a final configuration step the resources defined earlier are assigned to their fail-over-domain. This is done by defining a service for each of the Lustre nodes in the cluster and assign a domain. For the resources and fail-over-domains defined earlier this may look like this:
<pre>
<service autostart="1" exclusive="0" recovery="relocate" domain="first_first" name="lustre_2">
<lustrefs ref="target1"/>
<lustrefs ref="target2"/>
</service>

<service autostart="1" exclusive="0" recovery="relocate" domain="second_first" name="lustre_1">
<lustrefs ref="target3"/>
<lustrefs ref="target4"/>
</service>
</pre>
In this example ''target1'' and ''target2'' are assigned to the first node and ''target3'' and ''target4'' are assigned to the second node by default.

=== Start RedHat Cluster ===
Before bringing up RedHat cluster, make sure ''cluster.conf'' is update/edited on both Lustre server nodes. Usually ''cluster.conf'' should be the same on both nodes. The only exception is, if the device paths differ on both nodes.

==== ''cman'' service ====

With ''cluster.conf'' in place of both nodes it's time to start the ''cman'' service.
this is done by running
<pre>
service cman start
</pre>
on both clustered nodes. To verify ''cman'' is running ''clustat'' can be used:
<pre>
bash-3.2# clustat
Cluster Status for Lustre @ Tue Dec 14 11:27:36 2010
Member Status: Quorate

Member Name ID Status
------ ---- ---- ------
lustre1 1 Online, Local
lustre2 2 Online
</pre>
To enable the ''cman'' service permanently run:
<pre>
chkconfig cman on
</pre>

==== ''rgmanager'' service ====
With ''cman'' up and running it's time to start the resource group manager ''rgmanager'' by running
<pre>
service rgmanager start
</pre>
rgmanager will than start to bring up the Lustre targets assigned to each of the Lustre nodes.

==== Verifying RedHat Cluster ====

To verify the state of the cluster run ''clustat'' again. With the above configuration the output should look like this:
<pre>
bash-3.2# clustat
Cluster Status for Lustre @ Tue Dec 14 13:12:07 2010
Member Status: Quorate

Member Name ID Status
------ ---- ---- ------
lustre1 1 Online, Local, rgmanager
lustre2 2 Online, rgmanager

Service Name Owner (Last) State
------- ---- ----- ------ -----
service:lustre_1 lustre1 started
service:lustre_2 lustre2 started
</pre>

=== Relocate services ===
It may be necessary to relocate running lustre services manually. This can be done using
''clusvcadm'' as shown in the example below. First the service ''lustre_2'' is assigned to node ''lustre2''. After calling ''clusvcadm -r lustre_2'' this service is relocated to node ''lustre1'', as show in the last ''clustat'' output.
<pre>
bash-3.2# clustat
Cluster Status for Lustre @ Tue Dec 14 15:00:00 2010
Member Status: Quorate

Member Name ID Status
------ ---- ---- ------
lustre1 1 Online, Local, rgmanager
lustre2 2 Online, rgmanager

Service Name Owner (Last) State
------- ---- ----- ------ -----
service:lustre_1 lustre1 started
service:lustre_2 lustre2 started
bash-3.2# clusvcadm -r lustre_2
Trying to relocate service:lustre_2...Success
service:lustre_2 is now running on ldk-2-2-eth2
bash-3.2# clustat
Cluster Status for Lustre @ Tue Dec 14 15:01:00 2010
Member Status: Quorate

Member Name ID Status
------ ---- ---- ------
lustre1 1 Online, Local, rgmanager
lustre2 2 Online, rgmanager

Service Name Owner (Last) State
------- ---- ----- ------ -----
service:lustre_1 lustre1 started
service:lustre_2 lustre1 started
</pre>

== Other tools to use with RedHat Cluster ==

RedHat cluster is a complex system of programs and services. There are a number of tools available to interact and/or make working with RedHat Cluster easier. In this section a number of these tools are presented. For more details read the man pages.

; cman_tool : can be used to manage the cman subsystem. It can be used to add or remove nodes to a cluster configuration
; ccs_tool : may be used to update the configuration of the running cluster
; clustat : show the status of the cluster and if and where services are currently running
; clusvcadm : can be used to enable, disable or relocate services in a cluster
; system-config-cluster : a graphical user interface for cluster configuration

Using Red Hat Cluster Manager with Lustre

2010-12-14T14:13:16Z

Sven: /* Start RedHat Cluster */

''(Updated: Dec 2010)''

''DISCLAIMER - EXTERNAL CONTRIBUTOR CONTENT''

''This content was submitted by an external contributor. We provide this information as a resource for the Lustre™ open-source community, but we make no representation as to the accuracy, completeness or reliability of this information.''
----
__TOC__
This page describes how to configure and use Red Hat Cluster Manager with Lustre failover. Sven Trautmann has contributed this content.

For more about Lustre failover, see [[Configuring Lustre for Failover]].

== Preliminary Notes ==

This document is based on the RedHat Cluster version 2.0, which is part of RedHat Enterprise Linux version 5.5. For other Versions or RHEL-based distributions the syntax or methods to setup and run RedHat Cluster may differ.

In comparison with other HA solutions RedHat Cluster as in RHEL 5.5 is a pretty old HA solution. It is recommended to use other HA solutions like pacemaker, if possible.

It is assumed that two Lustre server nodes share a number of Lustre targets. Each of the Lustre nodes provide a number of Lustre targets and in case of a failure the not failed node takes over the Lustre targets of the failed nodes and makes them available to the Lustre clients.

Furthermore, to make sure the Lustre targets are mounted only on one of the Lustre server nodes at a time we implement STONITH fencing. This requires a way to make sure the failed node is shut down in case of a failure. In the examples shown in this article it is assumed that the Lustre server nodes are equipped with a service processor allowing to shut down a failed node using IPMI.

== Setting Up RedHat Cluster ==

==== Setting Up the ''openais'' Communication Stack ====

The ''openais'' package is distributed with RHEL and can be installed using
<pre>
rpm -i /path/to/RHEL-DVD/Server/openais0.80.6-16.el5.x86_64.rpm
</pre> or
<pre>
yum install openais
</pre>
if yum is configured to access the RHEL repository.

Once installed, the software looks for a configuration in the file ''/etc/ais/openais.conf
''.

Complete the following steps to set up the ''openais'' communication stack:

'''1. Edit the totem section of the ''openais.conf'' configuration file to designate the IP address and netmask of the interface(s) to be used.''' The totem section of the configuration file describes the way ''openais'' communicates between nodes.

<pre>
totem {
version: 2
secauth: off
threads: 0
interface {
ringnumber: 0
bindnetaddr: 10.0.0.0
mcastaddr: 226.94.1.1
mcastport: 5405
}
}
</pre>

''Openais'' uses the option ''bindnetaddr'' to determine which interface is to be used for cluster communication. The example above assumes one of the node’s interfaces is configured on the network 10.0.0.0. The value of the option is calculated from the IP address AND the network mask for the interface (IP & MASK) so the final bits of the address are cleared. Thus the configuration file is independent of any node and can be copied to all nodes.

'''2. Create an AIS key
<pre>
# /usr/sbin/ais-keygen
OpenAIS Authentication key generator.
Gathering 1024 bits for key from /dev/random.
Writing openais key to /etc/ais/authkey.
</pre>

== Installing RedHat Cluster ==

The minimum installation of RedHat Cluster consists of the Cluster Manager package ''cman'' and the Resource Group Manager package ''rgmanager''. The ''cman'' package can be found in the RHEL repository. The ''rgmanager'' package is part of the Cluster repository. It can be found on the RHEL DVD in the Cluster sub-directory and may need to be added to the yum configuration manually. With yum configured correctly RedHat Cluster can be installed using:
<pre>
yum install cman rgmanager
</pre>
If yum is not set up correctly, the rpm packages and their dependencies need to be installed manually.

== Installing the Lustre Resource Skript ==

The ''rgmanager'' package includes a number of resource scripts (''/usr/share/cluster'') which are used to integrate resources like network interfaces or file systems with ''rgmanager''. Unfortunately, there is no resource script for Lustre included.

Luckily Giacomo Montagner posted an resource script on the lustre-discuss mailing list:

http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090623/7799de37/attachment-0001.bin

After downloading this file it needs to be copied to /usr/share/cluster/lustrefs.sh.
Make sure the script is executable.

== Configure RedHat Cluster ==

RedHat Cluster uses ''/etc/cluster/cluster.conf'' as central configuration file. This file is in XML format. The complete schema of the XML file can be found at http://sources.redhat.com/cluster/doc/cluster_schema_rhel5.html.

The Basic structure of a cluster.conf file may look like this:
<pre>
<?xml version="1.0" ?>
<cluster config_version="1" name="Lustre">

...
</cluster>
</pre>
In this example the name of the cluster is set to ''Lustre'' and the version is initialized as ''1''. If the cluster configuration is updated the config_version attribute must be increased on all nodes in this cluster.
RedHat cluster is usually used with more than two nodes providing resources. To tell RedHat cluster to work with two nodes the following ''cman'' attributes need to be set:
<pre>
<cman expected_votes="1" two_node="1"/>
</pre>
This tells cman, that there are only two nodes in a cluster and one vote is enough declare a node failed.

=== Nodes ===
Next the nodes which form the cluster need to be specified. Each cluster node need to be specified separately wrapped in an surrounding ''clusternodes'' tag.
<pre>
<clusternodes>
<clusternode name="lustre1" nodeid="1">
<fence>
<method name="single">
<device lanplus="1" name="lustre1-sp"/>
</method>
</fence>
</clusternode>
<clusternode name="lustre2" nodeid="2">
<fence>
<method name="single">
<device lanplus="1" name="lustre2-sp"/>
</method>
</fence>
</clusternode>
</clusternodes>
</pre>
Each cluster node is given a name which must be it's hostname or IP address. Additionally a unique node ID needs to be specified.
The ''fence'' tag assigned to each node specifies a fence device to use to shut down this cluster node. The fence devices are defined elsewhere in ''cluster.conf'' (see below for details).

=== Fencing ===

Fencing is essential to keep data on the Lustre file system consistent. Even with Multi-Mount-Protection enabled, fencing can make sure that a node in an unclear state is brought down for more analysis by the administrator.

To configure fencing, first some fence daemon options need to be specified. the ''fence_daemon'' tag is a direct child of the ''cluster'' tag.
<pre>
<fence_daemon post_fail_delay="0" post_join_delay="3"/>
<fence_daemon clean_start="0"/>
</pre>
Depending on the hardware configuration, these values may differ for different installations. Please see the notes in the cluster_schema_rhel5 document (linked above) for details.

Each Lustre node in a cluster should be equipped with a fencing device. RedHat cluster supports a number of devices. More details on which devices are supported and how to configure them can be found in the cluster schema document.
For this example IPMI based fencing devices are used.
The ''fencedevices'' section may look like this:
<pre>
<fencedevices>
<fencedevice name="lustre1-sp" agent="fence_ipmilan" auth="password" ipaddr="10.0.1.1" login="root" passwd="supersecretpassword" option="off"/>
<fencedevice name="lustre2-sp" agent="fence_ipmilan" auth="password" ipaddr="10.0.1.2" login="root" passwd="supersecretpassword" option="off"/>
</fencedevices>
</pre>
Every fence device has a number of attributes:
''name'' is used to define a name for this fencing device. This name is referred to in the ''fence'' part of the ''clusternode'' definition (see above). The ''agent'' defines the kind of fencing device to use. In this example an IPMI-over-Lan device is used. The remaining attributes are specific for the ''ipmilan'' device and are self-explanatory.

=== Resource Manager ===

The resource manager block of the ''cluster.conf'' is wrapped in a ''rm'' tag:
<pre>
<rm>
..
</rm>
</pre>
It contains definitions of resources, failover domains, and services.

==== Resources ====
In the ''resources'' block of the ''cluster.conf'' file all Lustre targets of both clustered nodes are specified. In this example, four Lustre object storage targets are defined:
<pre>
<resources>
<lustrefs name="target1" mountpoint="/mnt/ost1" device="/path/to/ost1/device" force_fsck="0" force_unmount="0" self_fence="1"/>
<lustrefs name="target2" mountpoint="/mnt/ost2" device="/path/to/ost2/device" force_fsck="0" force_unmount="0" self_fence="1"/>
<lustrefs name="target3" mountpoint="/mnt/ost3" device="/path/to/ost3/device" force_fsck="0" force_unmount="0" self_fence="1"/>
<lustrefs name="target4" mountpoint="/mnt/ost4" device="/path/to/ost4/device" force_fsck="0" force_unmount="0" self_fence="1"/>
</resources>
</pre>
To use the ''lustrefs'' resource definition it is essential that the lustrefs.sh script is installed in ''/usr/share/cluster'' as described above. To verify the script is installed correctly and has correct permission run
<pre>
# /usr/share/cluster/lustrefs.sh --help
usage: /usr/share/cluster/lustrefs.sh {start|stop|status|monitor|restart|meta-data|verify-all}
</pre>
Each ''lustrefs'' resource has a number of attributes. ''name'' defines how the resource can be addressed.

==== Failover Domains ====
Usually RedHat cluster is used to provide a service on a number of nodes, where one node takes over the service of a failed node. In this example a number of Lustre targets is provided by each of the Lustre server nodes. To allow such a configuration, the definition of two Failover domains is necessary. The definition of ''failoverdomains'' may look like this:
<pre>
<failoverdomains>
<failoverdomain name="first_first" ordered="1" restricted="1">
<failoverdomainnode name="lustre1" priority="1"/>
<failoverdomainnode name="lustre2" priority="2"/>
</failoverdomain>
<failoverdomain name="second_first" ordered="1" restricted="1">
<failoverdomainnode name="lustre1" priority="2"/>
<failoverdomainnode name="lustre2" priority="1"/>
</failoverdomain>
</failoverdomains>
</pre>
In this example, two fail-over-domains are created by adding the same nodes to each fail-over-domain, but the nodes are assigned different priorities.

==== Services ====
As a final configuration step the resources defined earlier are assigned to their fail-over-domain. This is done by defining a service for each of the Lustre nodes in the cluster and assign a domain. For the resources and fail-over-domains defined earlier this may look like this:
<pre>
<service autostart="1" exclusive="0" recovery="relocate" domain="first_first" name="lustre_2">
<lustrefs ref="target1"/>
<lustrefs ref="target2"/>
</service>

<service autostart="1" exclusive="0" recovery="relocate" domain="second_first" name="lustre_1">
<lustrefs ref="target3"/>
<lustrefs ref="target4"/>
</service>
</pre>
In this example ''target1'' and ''target2'' are assigned to the first node and ''target3'' and ''target4'' are assigned to the second node by default.

=== Start RedHat Cluster ===
Before bringing up RedHat cluster, make sure ''cluster.conf'' is update/edited on both Lustre server nodes. Usually ''cluster.conf'' should be the same on both nodes. The only exception is, if the device paths differ on both nodes.

==== ''cman'' service ====

With ''cluster.conf'' in place of both nodes it's time to start the ''cman'' service.
this is done by running
<pre>
service cman start
</pre>
on both clustered nodes. To verify ''cman'' is running ''clustat'' can be used:
<pre>
bash-3.2# clustat
Cluster Status for Lustre @ Tue Dec 14 11:27:36 2010
Member Status: Quorate

Member Name ID Status
------ ---- ---- ------
lustre1 1 Online, Local
lustre2 2 Online
</pre>
To enable the ''cman'' service permanently run:
<pre>
chkconfig cman on
</pre>

==== ''rgmanager'' service ====
With ''cman'' up and running it's time to start the resource group manager ''rgmanager'' by running
<pre>
service rgmanager start
</pre>
rgmanager will than start to bring up the Lustre targets assigned to each of the Lustre nodes.

==== Verifying RedHat Cluster ====

To verify the state of the cluster run ''clustat'' again. With the above configuration the output should look like this:
<pre>
bash-3.2# clustat
Cluster Status for Lustre @ Tue Dec 14 13:12:07 2010
Member Status: Quorate

Member Name ID Status
------ ---- ---- ------
lustre1 1 Online, Local, rgmanager
lustre2 2 Online, rgmanager

Service Name Owner (Last) State
------- ---- ----- ------ -----
service:lustre_1 lustre1 started
service:lustre_2 lustre2 started
</pre>

=== Relocate services ===
It may be necessary to relocate running lustre services manually. This can be done using
''clusvcadm'' as shown in the example below. First the service ''lustre_2'' is assigned to node ''lustre2''. After calling ''clusvcadm -r lustre_2'' this service is relocated to node ''lustre1'', as show in the last ''clustat'' output.
<pre>
bash-3.2# clustat
Cluster Status for Lustre @ Tue Dec 14 15:00:00 2010
Member Status: Quorate

Member Name ID Status
------ ---- ---- ------
lustre1 1 Online, Local, rgmanager
lustre2 2 Online, rgmanager

Service Name Owner (Last) State
------- ---- ----- ------ -----
service:lustre_1 lustre1 started
service:lustre_2 lustre2 started
bash-3.2# clusvcadm -r lustre_2
Trying to relocate service:lustre_2...Success
service:lustre_2 is now running on ldk-2-2-eth2
bash-3.2# clustat
Cluster Status for Lustre @ Tue Dec 14 15:01:00 2010
Member Status: Quorate

Member Name ID Status
------ ---- ---- ------
lustre1 1 Online, Local, rgmanager
lustre2 2 Online, rgmanager

Service Name Owner (Last) State
------- ---- ----- ------ -----
service:lustre_1 lustre1 started
service:lustre_2 lustre1 started
</pre>

== Other tools to use with RedHat Cluster ==

RedHat cluster is a complex system of programs and services. There are a number of tools available to interact and/or make working with RedHat Cluster easier. In this section a number of these tools are presented. For more details read the man pages.

; cman_tool : can be used to manage the cman subsystem. It can be used to add or remove nodes to a cluster configuration
; ccs_tool : may be used to update the configuration of the running cluster
; clustat : show the status of the cluster and if and where services are currently running
; clusvcadm : can be used to enable, disable or relocate services in a cluster
; system-config-cluster : a graphical user interface for cluster configuration

Using Red Hat Cluster Manager with Lustre

2010-12-14T14:01:32Z

Sven: /* Useful tools to use with RedHat Cluster */

Using Red Hat Cluster Manager with Lustre

2010-12-14T13:58:32Z

Sven: /* Useful tools to use with RedHat Cluster */

Using Red Hat Cluster Manager with Lustre

2010-12-14T13:32:53Z

Sven: /* Useful tools to use with RedHat Cluster */

Using Red Hat Cluster Manager with Lustre

2010-12-14T13:29:36Z

Sven: /* cman_tool */

Using Red Hat Cluster Manager with Lustre

2010-12-14T12:14:55Z

Sven: /* Start RedHat Cluster */

Using Red Hat Cluster Manager with Lustre

2010-12-14T10:30:18Z

Sven: /* Start RedHat Cluster */

Using Red Hat Cluster Manager with Lustre

2010-12-14T10:05:02Z

Sven: /* Tools to use with RedHat Cluster */

Using Red Hat Cluster Manager with Lustre

2010-12-14T10:00:08Z

Sven: /* Configure RedHat Cluster */

Using Red Hat Cluster Manager with Lustre

2010-12-14T09:33:04Z

Sven: /* Services */

Using Red Hat Cluster Manager with Lustre

2010-12-14T09:12:44Z

Sven: /* Services */

Using Red Hat Cluster Manager with Lustre

2010-12-14T08:11:33Z

Sven: /* Fencing */

Using Red Hat Cluster Manager with Lustre

2010-12-13T16:11:45Z

Sven: /* Resources */

Using Red Hat Cluster Manager with Lustre

2010-12-13T16:11:21Z

Sven: /* Resources */

Using Red Hat Cluster Manager with Lustre

2010-12-13T15:55:39Z

Sven: /* Resource Manager */

Using Red Hat Cluster Manager with Lustre

2010-12-13T15:37:50Z

Sven: /* Failover Domains */

Using Red Hat Cluster Manager with Lustre

2010-12-13T15:34:18Z

Sven: /* Resource Manager */

Using Red Hat Cluster Manager with Lustre

2010-12-13T15:04:26Z

Sven: /* Fencing */

Using Red Hat Cluster Manager with Lustre

2010-12-13T14:53:50Z

Sven: /* Installing RedHat Cluster */

Using Red Hat Cluster Manager with Lustre

2010-12-13T14:53:27Z

Sven: /* Installing RedHat Cluster */

Using Red Hat Cluster Manager with Lustre

2010-12-13T14:48:58Z

Sven: /* Fencing */

Using Red Hat Cluster Manager with Lustre

2010-12-13T14:19:44Z

Sven: /* Tools to use with RedHat Cluster */

Using Red Hat Cluster Manager with Lustre

2010-12-13T14:17:13Z

Sven: /* system-config-cluster = */

Using Red Hat Cluster Manager with Lustre

2010-12-13T14:16:59Z

Sven: /* Tools to use with RedHat Cluster */

Using Red Hat Cluster Manager with Lustre

2010-12-13T14:05:27Z

Sven: /* Resources */

Using Red Hat Cluster Manager with Lustre

2010-12-13T14:04:58Z

Sven: /* Services */

Using Red Hat Cluster Manager with Lustre

2010-12-13T14:03:09Z

Sven: /* Resources */

Using Red Hat Cluster Manager with Lustre

2010-12-13T13:48:21Z

Sven: /* Configure RedHat Cluster */

Using Red Hat Cluster Manager with Lustre

2010-12-13T13:47:05Z

Sven: /* Failover Domains */

Using Red Hat Cluster Manager with Lustre

2010-12-13T13:44:05Z

Sven: /* Fencing */

Using Red Hat Cluster Manager with Lustre

2010-12-13T13:08:17Z

Sven: /* Nodes */

Using Red Hat Cluster Manager with Lustre

2010-12-13T12:46:57Z

Sven: /* Nodes */

Using Red Hat Cluster Manager with Lustre

2010-12-13T12:28:54Z

Sven: /* Configure RedHat Cluster */

Using Red Hat Cluster Manager with Lustre

2010-12-13T12:28:21Z

Sven: /* Configure RedHat Cluster */

Using Red Hat Cluster Manager with Lustre

2010-12-13T11:59:10Z

Sven: /* Configure RedHat Cluster */

Using Red Hat Cluster Manager with Lustre

2010-12-13T11:53:11Z

Sven: /* Configure RedHat Cluster */

Using Red Hat Cluster Manager with Lustre

2010-12-13T11:48:58Z

Sven: /* Configure RedHat Cluster */

Using Red Hat Cluster Manager with Lustre

2010-12-13T11:29:46Z

Sven: /* Preliminary Notes */

Using Red Hat Cluster Manager with Lustre

2010-12-10T15:40:53Z

Sven: /* Services = */

Using Red Hat Cluster Manager with Lustre

2010-12-10T15:40:37Z

Sven:

Using Red Hat Cluster Manager with Lustre

2010-12-10T15:26:09Z

Sven: /* Installing the Lustre Resource Skript */

''(Updated: Dec 2010)''

''DISCLAIMER - EXTERNAL CONTRIBUTOR CONTENT''

''This content was submitted by an external contributor. We provide this information as a resource for the Lustre™ open-source community, but we make no representation as to the accuracy, completeness or reliability of this information.''
----
__TOC__
This page describes how to configure and use Red Hat Cluster Manager with Lustre failover. Sven Trautmann has contributed this content.

For more about Lustre failover, see [[Configuring Lustre for Failover]].

== Preliminary Notes ==

This document is based on the RedHat Cluster version 2.0, which is part of RedHat Enterprise Linux version 5.5. For other Versions or RHEL-based distributions the syntax or methods to setup and run RedHat Cluster may differ.

In comparison with other HA solutions RedHat Cluster as in RHEL 5.5 is a pretty old HA solution. It is recommended to use other HA solutions like pacemaker, if possible.

== Setting Up RedHat Cluster ==

==== Setting Up the ''openais'' Communication Stack ====

The ''openais'' package is distributed with RHEL and can be installed using
<pre>
rpm -i /path/to/RHEL-DVD/Server/openais0.80.6-16.el5.x86_64.rpm
</pre> or
<pre>
yum install openais
</pre>
if yum is configured to access the RHEL repository.

Once installed, the software looks for a configuration in the file ''/etc/ais/openais.conf
''.

Complete the following steps to set up the ''openais'' communication stack:

'''1. Edit the totem section of the ''openais.conf'' configuration file to designate the IP address and netmask of the interface(s) to be used.''' The totem section of the configuration file describes the way ''openais'' communicates between nodes.

<pre>
totem {
version: 2
secauth: off
threads: 0
interface {
ringnumber: 0
bindnetaddr: 10.0.0.0
mcastaddr: 226.94.1.1
mcastport: 5405
}
}
</pre>

''Openais'' uses the option ''bindnetaddr'' to determine which interface is to be used for cluster communication. The example above assumes one of the node’s interfaces is configured on the network 10.0.0.0. The value of the option is calculated from the IP address AND the network mask for the interface (IP & MASK) so the final bits of the address are cleared. Thus the configuration file is independent of any node and can be copied to all nodes.

'''2. Create an AIS key
<pre>
# /usr/sbin/ais-keygen
OpenAIS Authentication key generator.
Gathering 1024 bits for key from /dev/random.
Writing openais key to /etc/ais/authkey.
</pre>

== Installing RedHat Cluster ==

The minimum installation of RedHat Cluster consists of the Cluster Manager package ''cman'' and the Resource Group Manager package ''rgmanager''. The ''cman'' package can be found in the RHEL repository. The ''rgmanager'' package is part of the Cluster repository. It can be found on the RHEL DVD in the Cluster sub-directory and may need to be added to the yum configuration manually. With yum configured correctly RedHat Cluster can be installed using:
<pre>
yum install cman rgmanager
</pre>

== Installing the Lustre Resource Skript ==

The ''rgmanager'' package includes a number of resource scripts (''/usr/share/cluster'') which are used to integrate resources like network interfaces or file systems with ''rgmanager''. Unfortunately, there is no resource script for Lustre included.

Luckily Giacomo Montagner posted an resource script on the lustre-discuss mailing list:

http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090623/7799de37/attachment-0001.bin

After downloading this file it needs to be copied to /usr/share/cluster/lustrefs.sh.
Make sure the script is executable.

== Configure RedHat Cluster ==

----

All services that the Pacemaker cluster resource manager will manage are called resources. The Pacemaker cluster resource manager uses resource agents to start, stop or monitor resources.

'''''Note:''''' The simplest way to configure the cluster is by using a crm subshell. All examples will be given in this notation. If you understood the syntax of the cluster configuration, you also can use the GUI or XML notation.

==== Completing a Basic Setup of the Cluster ====

To test that your cluster manager is running and set global options, complete the steps below.

'''1. Display the cluster status.''' Enter:

<pre>
# crm_mon -1
</pre>

The output should look similar to:

<pre>
============
Last updated: Fri Dec 25 17:31:54 2009
Stack: openais
Current DC: node1 - partition with quorum
Version: 1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe
2 Nodes configured, 2 expected votes
0 Resources configured.
============

Online: [ node1 node2 ]
</pre>

This output indicates that ''corosync'' started the cluster resource manager and it is ready to manage resources.

Several global options must be set in the cluster. The two described in the next two steps are especially important to consider.

'''2. If your cluster consists of just two nodes, switch the quorum feature off.''' On the command line, enter:

<pre>
# crm configure property no-quorum-policy=ignore
</pre>

If your lustre setup comprises more than two nodes, you can leave the no-quorum option as it is.

'''3. In a Lustre setup, fencing is normally used and is enabled by default. If you have a good reason not to use it, disable it by entering:'''

<pre>
# crm configure property stonith-enabled=false
</pre>

After the global options of the cluster are set up correctly, continue to the following sections to configure resources and constraints.

==== Configuring Resources ====

OSTs are represented as Filesystem resources. A Lustre cluster consists of several Filesystem resources along with constraints that determine on which nodes of the cluster the resources can run.

By default, the start, stop, and monitor operations in a Filesystem resource time out after 20 sec. Since some mounts in Lustre require up to 5 minutes or more, the default timeouts for these operations must be modified. Also, a monitor operation must be added to the resource so that Pacemaker can check if the resource is still alive and react in case of any problems.

'''1. Create a definition of the Filesystem resource and save it in a file such as ''MyOST.res''.'''

If you have multiple OSTs, you will need to define additional resources.

The example below shows a complete definition of the Filesystem resource. You will need to change the ''device'' and ''directory'' to correspond to your setup.

<pre>
primitive resMyOST ocf:heartbeat:Filesystem \
meta target-role="stopped" \
operations $id="resMyOST-operations" \
op monitor interval="120" timeout="60" \
op start interval="0" timeout="300" \
op stop interval="0" timeout="300" \
params device="device" directory="directory" fstype="lustre"
</pre>

In this example, the resource is initially stopped (''target-role=”stopped”'') because the constraints specifying where the resource is to be run have not yet been defined.

The ''start'' and ''stop'' operations have each been set to a timeout of 300 sec. The resource is monitored at intervals of 120 seconds. The parameters "''device''", "''directory''" and "lustre" are passed to the ''mount'' command.

'''2. Read the definition into your cluster configuration''' by entering:

# crm configure < MyOST.res

You can define as many OST resources as you want.

If a server fails or the monitoring of a OST results in the detection of a failure, the cluster first tries to restart the resource on the failed node. If the node fails to restart it, the resource is migrated to another node.

More sophosticated ways of failure management (such as trying to restart a node three times before migrating to another node) are possible using the cluster resource manager. See the Pacemaker documentation for details.

If mounting the file system depends on another resource like the start of a RAID or multipath driver, you can include this resource in the cluster configuration. This resource is then monitored by the cluster, enabling Pacemaker to react to failures.

==== Configuring Constraints ====
In a simple Lustre cluster setup, constraints are not required. However, in a larger cluster setup, you may want to use constraints to establish relationships between resources. For example, to keep the load distributed equally across nodes in your cluster, you may want to control how many OSTs can run on a particular node.

Constraints on resources are established by Pacemaker through a point system. Resources accumulate or lose points according to the constraints you define. If a resource has negative points with respect to a certain node, it cannot run on that node.

For example, to constrain the co-location of two resources, complete the steps below.

'''1. Add co-location constraints between resources.''' Enter commands similar to the following:

# crm configure colocation colresOST1resOST2 -100: resOST1 resOST2

This constraint assigns -100 points to resOST2 if an attempt is made to run resOST2 on the same node as resOST1. If the resulting total number of points assigned to reOST2 is negative, it will not be able run on that node.

'''2. After defining all necessary constraints, start the resources.''' Enter:

# crm resource start resMyOST

Execute this command for each OST (Filesystem resource) in the cluster.

----

'''Note:''' Use care when setting up your point system. You can use the point system if your cluster has at least three nodes or if the resource can acquire points from other constraints. However, in a system with only two nodes and no way to acquire points, the constraint in the example above will result in an inability to migrate a resource from a failed node.

For example, if resOST1 is running on ''node1'' and resOST2 on ''node2'' and ''node2'' fails, an attempt will be made to run resOST2 on ''node1''. However, the constraint will assign resOST2 -100 points since resOST1 is already running on ''node1''. Consequently resOST2 will be unable to run on ''node1'' and, since it is a two-node system, no other node is available.

----

To find out more about how the cluster resource manager calculates points, see the Pacemaker documentation.

==== Internal Monitoring of the System ====

In addition to monitoring of the resource itself, the nodes of the cluster must also be monitored. An important parameter to monitor is whether the node is connected to the network. Each node pings one or more hosts and counts the answers it receives. The number of responses determines how “good” its connection is to the network.

Pacemaker provides a simple way to configure this task.

'''1. Define a ping resource.''' In the command below, the ''host_list'' contains a list of hosts that the nodes should ping.
<pre>
# crm configure resPing ocf:pacemaker:pingd \
params host_list=“host1 ...“ multiplier=“10“ dampen=”5s“
# crm configure clone clonePing resPing
</pre>

For every accessible host detected, any resource on that node gets 10 points (set by the ''multiplier='' parameter). The clone configuration makes the ping resource run on every available node.

'''2. Set up constraints to run a resource on the node with the best connectivity.''' The score from the ''ping'' resource can be used in other constraints to allow a resource to run only on those nodes that have a sufficient ping score. For example, enter:

# crm configure location locMyOST resMyOST rule $id="locMyOST" pingd: defined pingd

This location constraint adds the ''ping'' score to the total score assigned to a resource for a particular node. The resource will tend to run on the node with the best connectivity.

Other system checks, such as CPU usage or free RAM, are measured by the Sysinfo resource. The capabilities of the Sysinfo resource are somewhat limited, so it will be replaced by the SystemHealth strategy in future releases of Pacemaker. For more information about the SystemHealth feature, see:
[http://www.clusterlabs.org/wiki/SystemHealth www.clusterlabs.org/wiki/SystemHealth]

==== Administering the Cluster ====

Careful system administration is required to support high availability in a cluster. A primary task of an administrator is to check the cluster for errors or failures of any resources. When a failure occurs, the administrator must search for the cause of the problem, solve it and then reset the corresponding failcounter.
This section describes some basic commands useful to an administrator. For more detailed information, see the Pacemaker documentation.

'''Displaying a Status Overview'''

The command ''crm_mon'' displays an overview of the status of the cluster. It functions similarly to the Linux top command by updating the output each time a cluster event occurs. To generate a one-time output, add the 
option ''-1''.

To include a display of all failcounters for all resources on the nodes, add the ''-f'' option to the command. The output of the command crm_mon -1f looks similar to:

<pre>
============
Last updated: Fri Dec 25 17:31:54 2009
Stack: openais
Current DC: node1 - partition with quorum
Version: 1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ node1 node2 ]

Clone Set: clonePing
Started: [ node1 node2 ]
resMyOST (ocf::heartbeat:filesys): Started node1

Migration summary:
* Node node1: pingd=20
resMyOST: migration-threshold=1000000 fail-count=1
* Node node2: pingd=20
</pre>

'''Switching a Node to Standby'''

You can switch a node to standby to, for example, perform maintenance on the node. In standby, the node is still a full member of the cluster but cannot run any resources. All resources that were running on that node are forced away.

To switch the node called ''node01'' to standby, enter:

# crm node standby node01

To switch the node online again enter:

# crm node online node01

'''Migrating a Resource to Another Node'''

The cluster resource manager can migrate a resource from one node to another while the resource is running. To migrate a resource away from the node it is running on, enter:

# crm resource migrate resMyOST

This command adds a location constraint to the configuration that specifies that the resource ''resMyOST'' can no longer run on the original node.

To delete this constraint, enter:

# crm resource unmigrate resMyOST

A target node can be specified in the migration command as follows:

# crm configure migrate resMyOST node02

This command causes the resource ''resMyOST'' to move to node ''node02'', while adding a location constraint to the configuration. To remove the location constraint, enter the ''unmigrate'' command again.

'''Resetting the failcounter'''

If Pacemaker monitors a resource and finds that it isn’t running, by default it restarts the resource on the node. If the resource cannot be restarted on the node, it then migrates the resource to another node.

It is the administrator’s task to find out the cause of the error and to reset the failcounter of the resource. This can be achieved by entering:

# crm resource failcount <resource> delete <node>

This command deletes (resets) the failcounter for the resource on the specified node.

'''“Cleaning up” a Resource'''

Sometimes it is necessary to “clean up” a resource. Internally, this command removes any information about a resource from the Local Resource Manager on every node and thus forces a complete re-read of the status of that resource. The command syntax is:

# crm resource cleanup resMyOST

This command removes information about the resource called ''resMyOST'' on all nodes.

== Setting up Fencing ==

Fencing is a technique used to isolate a node from the cluster when it is malfunctioning to prevent data corruption. For example, if a “split-brain” condition occurs in which two nodes can no longer communicate and both attempt to mount the same filesystem resource, data corruption can result. (The Multiple Mount Protection (MMP) mechanism in Lustre is designed to protect a file system from being mounted simultaneously by more than one node.)

Pacemaker uses the STONITH (Shoot The Other Node In The Head) approach to fencing malfunctioning nodes, in which a malfunctioning node is simply switched off. A good discussion about fencing can be found [http://www.clusterlabs.org/doc/crm_fencing.html here]. This article provides information useful for deciding which devices to purchase or how to set up STONITH resources for your cluster and also provides a detailed setup procedure.

A basic setup includes the following steps:

'''1. Test your fencing system manually before configuring the corresponding resources in the cluster.''' Manual testing is done by calling the STONITH command directly from each node. If this works in all tests, it will work in the cluster.

'''2. After configuring of the according resources, check that the system works as expected.''' To cause an artificial “split-brain” situation, you could use a host-based firewall to prohibit communication from other nodes on the heartbeat interface(s) by entering:

# iptables -I INPUT -i <heartbeat-IF> -p 5405 -s <other node> -j DROP

When the other nodes are not able to see the node isolated by the firewall, the isolated node should be shut down or rebooted.

== Setting Up Monitoring ==

Any cluster must to be monitored to provide the high availability it was designed for. Consider the following scenario demonstrating the importance of monitoring:

:''A node fails and all resources migrate to its backup node. Since the failover was smooth, nobody notices the problem. After some time, the second node fails and service stops. This is a serious problem since neither of the nodes is now able to provide service. The administrator must recover data from backups and possibly even install it on new hardware. A significant delay may result for users.''

Pacemaker offers several options for making information available to a monitoring system. These include:
*Utilizing the ''crm_mon'' program to send out information about changes in cluster status.
*Using scripts to check resource failcounters.

These options are described in the following sections.

==== Using ''crm_mon'' to Send Email Messages ====
In the most simple setup, the ''crm_mon'' program can be used to send out an email each time the status of the cluster changes. This approach requires a fully working mail environment and ''mail'' command.

Before configuring the ''crm_mon'' daemon, check that emails sent from the command line are delivered correctly by entering:

# crm_mon --daemonize –-mail-to <user@example.com> [--mail-host mail.example.com]

The resource monitor in the cluster can be configured to ensure the mail alerting service resource is running, as shown below:

<pre>
primitive resMON ocf:pacemaker:ClusterMon \
operations $id="resMON-operations" \
op monitor interval="180" timeout="20" \
params extra_options="--mail-to <your@mail.address>"
</pre>

If a node fails, which could prevent the email from being sent, the resource is started on another node and an email about the successful start of the resource is sent out from the new node. The administrator's task is to search for the cause of the failover.

==== Using ''crm_mon'' to Send SNMP Traps ====
The ''crm_mon'' daemon can be used to send SNMP traps to a network management server. The configuration from the command line is:

# crm_mon –-daemonize –-snmp-traps nms.example.com

This daemon can also be configured as a cluster resource as shown below:

<pre>
primitive resMON ocf:pacemaker:ClusterMon \
operations $id="resMON-operations" \
op monitor interval="180" timeout="20" \
params extra_options="--snmp-trap nms.example.com
</pre>

The MIB of the traps is defined in the ''PCMKR.txt'' file.

==== Polling the Failcounters ====
If all the nodes of a cluster have problems, pushing information about events may be not be sufficient. An alternative is to check the failcounters of all resources periodically from the network management station (NMS). A simple script that checks for the presence of any failcounters in the output of ''crm_mon -1f'' is shown below:

# crm_mon -1f | grep fail-count

This script can be called by the NMS via SSH, or by the SNMP agent on the nodes by adding the following line to the Net-SNMP configuration in ''snmpd.conf'':

extend failcounter crm_mon -1f | grep -q fail-count

The code returned by the script can be checked by the NMS using:

snmpget <node> nsExtend.\”failcounter\”

A result of ''0'' indicates a failure.

Using Red Hat Cluster Manager with Lustre

2010-12-10T15:25:26Z

Sven: /* Installing the Lustre Resource Skript */

''(Updated: Dec 2010)''

''DISCLAIMER - EXTERNAL CONTRIBUTOR CONTENT''

''This content was submitted by an external contributor. We provide this information as a resource for the Lustre™ open-source community, but we make no representation as to the accuracy, completeness or reliability of this information.''
----
__TOC__
This page describes how to configure and use Red Hat Cluster Manager with Lustre failover. Sven Trautmann has contributed this content.

For more about Lustre failover, see [[Configuring Lustre for Failover]].

== Preliminary Notes ==

This document is based on the RedHat Cluster version 2.0, which is part of RedHat Enterprise Linux version 5.5. For other Versions or RHEL-based distributions the syntax or methods to setup and run RedHat Cluster may differ.

In comparison with other HA solutions RedHat Cluster as in RHEL 5.5 is a pretty old HA solution. It is recommended to use other HA solutions like pacemaker, if possible.

== Setting Up RedHat Cluster ==

==== Setting Up the ''openais'' Communication Stack ====

The ''openais'' package is distributed with RHEL and can be installed using
<pre>
rpm -i /path/to/RHEL-DVD/Server/openais0.80.6-16.el5.x86_64.rpm
</pre> or
<pre>
yum install openais
</pre>
if yum is configured to access the RHEL repository.

Once installed, the software looks for a configuration in the file ''/etc/ais/openais.conf
''.

Complete the following steps to set up the ''openais'' communication stack:

'''1. Edit the totem section of the ''openais.conf'' configuration file to designate the IP address and netmask of the interface(s) to be used.''' The totem section of the configuration file describes the way ''openais'' communicates between nodes.

<pre>
totem {
version: 2
secauth: off
threads: 0
interface {
ringnumber: 0
bindnetaddr: 10.0.0.0
mcastaddr: 226.94.1.1
mcastport: 5405
}
}
</pre>

''Openais'' uses the option ''bindnetaddr'' to determine which interface is to be used for cluster communication. The example above assumes one of the node’s interfaces is configured on the network 10.0.0.0. The value of the option is calculated from the IP address AND the network mask for the interface (IP & MASK) so the final bits of the address are cleared. Thus the configuration file is independent of any node and can be copied to all nodes.

'''2. Create an AIS key
<pre>
# /usr/sbin/ais-keygen
OpenAIS Authentication key generator.
Gathering 1024 bits for key from /dev/random.
Writing openais key to /etc/ais/authkey.
</pre>

== Installing RedHat Cluster ==

The minimum installation of RedHat Cluster consists of the Cluster Manager package ''cman'' and the Resource Group Manager package ''rgmanager''. The ''cman'' package can be found in the RHEL repository. The ''rgmanager'' package is part of the Cluster repository. It can be found on the RHEL DVD in the Cluster sub-directory and may need to be added to the yum configuration manually. With yum configured correctly RedHat Cluster can be installed using:
<pre>
yum install cman rgmanager
</pre>

== Installing the Lustre Resource Skript ==

The ''rgmanager'' package includes a number of resource scripts (''/usr/share/cluster'') which are used to integrate resources like network interfaces or file systems with ''rgmanager''. Unfortunately, there is no resource script for Lustre included. Luckily Giacomo Montagner
posted an resource script on the lustre-discuss mailing list:

http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090623/7799de37/attachment-0001.bin

After downloading this file it needs to be copied to /usr/share/cluster/lustrefs.sh.
Make sure the script is executable.

== Configure RedHat Cluster ==

----

All services that the Pacemaker cluster resource manager will manage are called resources. The Pacemaker cluster resource manager uses resource agents to start, stop or monitor resources.

'''''Note:''''' The simplest way to configure the cluster is by using a crm subshell. All examples will be given in this notation. If you understood the syntax of the cluster configuration, you also can use the GUI or XML notation.

==== Completing a Basic Setup of the Cluster ====

To test that your cluster manager is running and set global options, complete the steps below.

'''1. Display the cluster status.''' Enter:

<pre>
# crm_mon -1
</pre>

The output should look similar to:

<pre>
============
Last updated: Fri Dec 25 17:31:54 2009
Stack: openais
Current DC: node1 - partition with quorum
Version: 1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe
2 Nodes configured, 2 expected votes
0 Resources configured.
============

Online: [ node1 node2 ]
</pre>

This output indicates that ''corosync'' started the cluster resource manager and it is ready to manage resources.

Several global options must be set in the cluster. The two described in the next two steps are especially important to consider.

'''2. If your cluster consists of just two nodes, switch the quorum feature off.''' On the command line, enter:

<pre>
# crm configure property no-quorum-policy=ignore
</pre>

If your lustre setup comprises more than two nodes, you can leave the no-quorum option as it is.

'''3. In a Lustre setup, fencing is normally used and is enabled by default. If you have a good reason not to use it, disable it by entering:'''

<pre>
# crm configure property stonith-enabled=false
</pre>

After the global options of the cluster are set up correctly, continue to the following sections to configure resources and constraints.

==== Configuring Resources ====

OSTs are represented as Filesystem resources. A Lustre cluster consists of several Filesystem resources along with constraints that determine on which nodes of the cluster the resources can run.

By default, the start, stop, and monitor operations in a Filesystem resource time out after 20 sec. Since some mounts in Lustre require up to 5 minutes or more, the default timeouts for these operations must be modified. Also, a monitor operation must be added to the resource so that Pacemaker can check if the resource is still alive and react in case of any problems.

'''1. Create a definition of the Filesystem resource and save it in a file such as ''MyOST.res''.'''

If you have multiple OSTs, you will need to define additional resources.

The example below shows a complete definition of the Filesystem resource. You will need to change the ''device'' and ''directory'' to correspond to your setup.

<pre>
primitive resMyOST ocf:heartbeat:Filesystem \
meta target-role="stopped" \
operations $id="resMyOST-operations" \
op monitor interval="120" timeout="60" \
op start interval="0" timeout="300" \
op stop interval="0" timeout="300" \
params device="device" directory="directory" fstype="lustre"
</pre>

In this example, the resource is initially stopped (''target-role=”stopped”'') because the constraints specifying where the resource is to be run have not yet been defined.

The ''start'' and ''stop'' operations have each been set to a timeout of 300 sec. The resource is monitored at intervals of 120 seconds. The parameters "''device''", "''directory''" and "lustre" are passed to the ''mount'' command.

'''2. Read the definition into your cluster configuration''' by entering:

# crm configure < MyOST.res

You can define as many OST resources as you want.

If a server fails or the monitoring of a OST results in the detection of a failure, the cluster first tries to restart the resource on the failed node. If the node fails to restart it, the resource is migrated to another node.

More sophosticated ways of failure management (such as trying to restart a node three times before migrating to another node) are possible using the cluster resource manager. See the Pacemaker documentation for details.

If mounting the file system depends on another resource like the start of a RAID or multipath driver, you can include this resource in the cluster configuration. This resource is then monitored by the cluster, enabling Pacemaker to react to failures.

==== Configuring Constraints ====
In a simple Lustre cluster setup, constraints are not required. However, in a larger cluster setup, you may want to use constraints to establish relationships between resources. For example, to keep the load distributed equally across nodes in your cluster, you may want to control how many OSTs can run on a particular node.

Constraints on resources are established by Pacemaker through a point system. Resources accumulate or lose points according to the constraints you define. If a resource has negative points with respect to a certain node, it cannot run on that node.

For example, to constrain the co-location of two resources, complete the steps below.

'''1. Add co-location constraints between resources.''' Enter commands similar to the following:

# crm configure colocation colresOST1resOST2 -100: resOST1 resOST2

This constraint assigns -100 points to resOST2 if an attempt is made to run resOST2 on the same node as resOST1. If the resulting total number of points assigned to reOST2 is negative, it will not be able run on that node.

'''2. After defining all necessary constraints, start the resources.''' Enter:

# crm resource start resMyOST

Execute this command for each OST (Filesystem resource) in the cluster.

----

'''Note:''' Use care when setting up your point system. You can use the point system if your cluster has at least three nodes or if the resource can acquire points from other constraints. However, in a system with only two nodes and no way to acquire points, the constraint in the example above will result in an inability to migrate a resource from a failed node.

For example, if resOST1 is running on ''node1'' and resOST2 on ''node2'' and ''node2'' fails, an attempt will be made to run resOST2 on ''node1''. However, the constraint will assign resOST2 -100 points since resOST1 is already running on ''node1''. Consequently resOST2 will be unable to run on ''node1'' and, since it is a two-node system, no other node is available.

----

To find out more about how the cluster resource manager calculates points, see the Pacemaker documentation.

==== Internal Monitoring of the System ====

In addition to monitoring of the resource itself, the nodes of the cluster must also be monitored. An important parameter to monitor is whether the node is connected to the network. Each node pings one or more hosts and counts the answers it receives. The number of responses determines how “good” its connection is to the network.

Pacemaker provides a simple way to configure this task.

'''1. Define a ping resource.''' In the command below, the ''host_list'' contains a list of hosts that the nodes should ping.
<pre>
# crm configure resPing ocf:pacemaker:pingd \
params host_list=“host1 ...“ multiplier=“10“ dampen=”5s“
# crm configure clone clonePing resPing
</pre>

For every accessible host detected, any resource on that node gets 10 points (set by the ''multiplier='' parameter). The clone configuration makes the ping resource run on every available node.

'''2. Set up constraints to run a resource on the node with the best connectivity.''' The score from the ''ping'' resource can be used in other constraints to allow a resource to run only on those nodes that have a sufficient ping score. For example, enter:

# crm configure location locMyOST resMyOST rule $id="locMyOST" pingd: defined pingd

This location constraint adds the ''ping'' score to the total score assigned to a resource for a particular node. The resource will tend to run on the node with the best connectivity.

Other system checks, such as CPU usage or free RAM, are measured by the Sysinfo resource. The capabilities of the Sysinfo resource are somewhat limited, so it will be replaced by the SystemHealth strategy in future releases of Pacemaker. For more information about the SystemHealth feature, see:
[http://www.clusterlabs.org/wiki/SystemHealth www.clusterlabs.org/wiki/SystemHealth]

==== Administering the Cluster ====

Careful system administration is required to support high availability in a cluster. A primary task of an administrator is to check the cluster for errors or failures of any resources. When a failure occurs, the administrator must search for the cause of the problem, solve it and then reset the corresponding failcounter.
This section describes some basic commands useful to an administrator. For more detailed information, see the Pacemaker documentation.

'''Displaying a Status Overview'''

The command ''crm_mon'' displays an overview of the status of the cluster. It functions similarly to the Linux top command by updating the output each time a cluster event occurs. To generate a one-time output, add the 
option ''-1''.

To include a display of all failcounters for all resources on the nodes, add the ''-f'' option to the command. The output of the command crm_mon -1f looks similar to:

<pre>
============
Last updated: Fri Dec 25 17:31:54 2009
Stack: openais
Current DC: node1 - partition with quorum
Version: 1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ node1 node2 ]

Clone Set: clonePing
Started: [ node1 node2 ]
resMyOST (ocf::heartbeat:filesys): Started node1

Migration summary:
* Node node1: pingd=20
resMyOST: migration-threshold=1000000 fail-count=1
* Node node2: pingd=20
</pre>

'''Switching a Node to Standby'''

You can switch a node to standby to, for example, perform maintenance on the node. In standby, the node is still a full member of the cluster but cannot run any resources. All resources that were running on that node are forced away.

To switch the node called ''node01'' to standby, enter:

# crm node standby node01

To switch the node online again enter:

# crm node online node01

'''Migrating a Resource to Another Node'''

The cluster resource manager can migrate a resource from one node to another while the resource is running. To migrate a resource away from the node it is running on, enter:

# crm resource migrate resMyOST

This command adds a location constraint to the configuration that specifies that the resource ''resMyOST'' can no longer run on the original node.

To delete this constraint, enter:

# crm resource unmigrate resMyOST

A target node can be specified in the migration command as follows:

# crm configure migrate resMyOST node02

This command causes the resource ''resMyOST'' to move to node ''node02'', while adding a location constraint to the configuration. To remove the location constraint, enter the ''unmigrate'' command again.

'''Resetting the failcounter'''

If Pacemaker monitors a resource and finds that it isn’t running, by default it restarts the resource on the node. If the resource cannot be restarted on the node, it then migrates the resource to another node.

It is the administrator’s task to find out the cause of the error and to reset the failcounter of the resource. This can be achieved by entering:

# crm resource failcount <resource> delete <node>

This command deletes (resets) the failcounter for the resource on the specified node.

'''“Cleaning up” a Resource'''

Sometimes it is necessary to “clean up” a resource. Internally, this command removes any information about a resource from the Local Resource Manager on every node and thus forces a complete re-read of the status of that resource. The command syntax is:

# crm resource cleanup resMyOST

This command removes information about the resource called ''resMyOST'' on all nodes.

== Setting up Fencing ==

Fencing is a technique used to isolate a node from the cluster when it is malfunctioning to prevent data corruption. For example, if a “split-brain” condition occurs in which two nodes can no longer communicate and both attempt to mount the same filesystem resource, data corruption can result. (The Multiple Mount Protection (MMP) mechanism in Lustre is designed to protect a file system from being mounted simultaneously by more than one node.)

Pacemaker uses the STONITH (Shoot The Other Node In The Head) approach to fencing malfunctioning nodes, in which a malfunctioning node is simply switched off. A good discussion about fencing can be found [http://www.clusterlabs.org/doc/crm_fencing.html here]. This article provides information useful for deciding which devices to purchase or how to set up STONITH resources for your cluster and also provides a detailed setup procedure.

A basic setup includes the following steps:

'''1. Test your fencing system manually before configuring the corresponding resources in the cluster.''' Manual testing is done by calling the STONITH command directly from each node. If this works in all tests, it will work in the cluster.

'''2. After configuring of the according resources, check that the system works as expected.''' To cause an artificial “split-brain” situation, you could use a host-based firewall to prohibit communication from other nodes on the heartbeat interface(s) by entering:

# iptables -I INPUT -i <heartbeat-IF> -p 5405 -s <other node> -j DROP

When the other nodes are not able to see the node isolated by the firewall, the isolated node should be shut down or rebooted.

== Setting Up Monitoring ==

Any cluster must to be monitored to provide the high availability it was designed for. Consider the following scenario demonstrating the importance of monitoring:

:''A node fails and all resources migrate to its backup node. Since the failover was smooth, nobody notices the problem. After some time, the second node fails and service stops. This is a serious problem since neither of the nodes is now able to provide service. The administrator must recover data from backups and possibly even install it on new hardware. A significant delay may result for users.''

Pacemaker offers several options for making information available to a monitoring system. These include:
*Utilizing the ''crm_mon'' program to send out information about changes in cluster status.
*Using scripts to check resource failcounters.

These options are described in the following sections.

==== Using ''crm_mon'' to Send Email Messages ====
In the most simple setup, the ''crm_mon'' program can be used to send out an email each time the status of the cluster changes. This approach requires a fully working mail environment and ''mail'' command.

Before configuring the ''crm_mon'' daemon, check that emails sent from the command line are delivered correctly by entering:

# crm_mon --daemonize –-mail-to <user@example.com> [--mail-host mail.example.com]

The resource monitor in the cluster can be configured to ensure the mail alerting service resource is running, as shown below:

<pre>
primitive resMON ocf:pacemaker:ClusterMon \
operations $id="resMON-operations" \
op monitor interval="180" timeout="20" \
params extra_options="--mail-to <your@mail.address>"
</pre>

If a node fails, which could prevent the email from being sent, the resource is started on another node and an email about the successful start of the resource is sent out from the new node. The administrator's task is to search for the cause of the failover.

==== Using ''crm_mon'' to Send SNMP Traps ====
The ''crm_mon'' daemon can be used to send SNMP traps to a network management server. The configuration from the command line is:

# crm_mon –-daemonize –-snmp-traps nms.example.com

This daemon can also be configured as a cluster resource as shown below:

<pre>
primitive resMON ocf:pacemaker:ClusterMon \
operations $id="resMON-operations" \
op monitor interval="180" timeout="20" \
params extra_options="--snmp-trap nms.example.com
</pre>

The MIB of the traps is defined in the ''PCMKR.txt'' file.

==== Polling the Failcounters ====
If all the nodes of a cluster have problems, pushing information about events may be not be sufficient. An alternative is to check the failcounters of all resources periodically from the network management station (NMS). A simple script that checks for the presence of any failcounters in the output of ''crm_mon -1f'' is shown below:

# crm_mon -1f | grep fail-count

This script can be called by the NMS via SSH, or by the SNMP agent on the nodes by adding the following line to the Net-SNMP configuration in ''snmpd.conf'':

extend failcounter crm_mon -1f | grep -q fail-count

The code returned by the script can be checked by the NMS using:

snmpget <node> nsExtend.\”failcounter\”

A result of ''0'' indicates a failure.

Using Red Hat Cluster Manager with Lustre

2010-12-10T15:24:47Z

Sven: /* Installing the Lustre Resource Skript */

''(Updated: Dec 2010)''

''DISCLAIMER - EXTERNAL CONTRIBUTOR CONTENT''

''This content was submitted by an external contributor. We provide this information as a resource for the Lustre™ open-source community, but we make no representation as to the accuracy, completeness or reliability of this information.''
----
__TOC__
This page describes how to configure and use Red Hat Cluster Manager with Lustre failover. Sven Trautmann has contributed this content.

For more about Lustre failover, see [[Configuring Lustre for Failover]].

== Preliminary Notes ==

This document is based on the RedHat Cluster version 2.0, which is part of RedHat Enterprise Linux version 5.5. For other Versions or RHEL-based distributions the syntax or methods to setup and run RedHat Cluster may differ.

In comparison with other HA solutions RedHat Cluster as in RHEL 5.5 is a pretty old HA solution. It is recommended to use other HA solutions like pacemaker, if possible.

== Setting Up RedHat Cluster ==

==== Setting Up the ''openais'' Communication Stack ====

The ''openais'' package is distributed with RHEL and can be installed using
<pre>
rpm -i /path/to/RHEL-DVD/Server/openais0.80.6-16.el5.x86_64.rpm
</pre> or
<pre>
yum install openais
</pre>
if yum is configured to access the RHEL repository.

Once installed, the software looks for a configuration in the file ''/etc/ais/openais.conf
''.

Complete the following steps to set up the ''openais'' communication stack:

'''1. Edit the totem section of the ''openais.conf'' configuration file to designate the IP address and netmask of the interface(s) to be used.''' The totem section of the configuration file describes the way ''openais'' communicates between nodes.

<pre>
totem {
version: 2
secauth: off
threads: 0
interface {
ringnumber: 0
bindnetaddr: 10.0.0.0
mcastaddr: 226.94.1.1
mcastport: 5405
}
}
</pre>

''Openais'' uses the option ''bindnetaddr'' to determine which interface is to be used for cluster communication. The example above assumes one of the node’s interfaces is configured on the network 10.0.0.0. The value of the option is calculated from the IP address AND the network mask for the interface (IP & MASK) so the final bits of the address are cleared. Thus the configuration file is independent of any node and can be copied to all nodes.

'''2. Create an AIS key
<pre>
# /usr/sbin/ais-keygen
OpenAIS Authentication key generator.
Gathering 1024 bits for key from /dev/random.
Writing openais key to /etc/ais/authkey.
</pre>

== Installing RedHat Cluster ==

The minimum installation of RedHat Cluster consists of the Cluster Manager package ''cman'' and the Resource Group Manager package ''rgmanager''. The ''cman'' package can be found in the RHEL repository. The ''rgmanager'' package is part of the Cluster repository. It can be found on the RHEL DVD in the Cluster sub-directory and may need to be added to the yum configuration manually. With yum configured correctly RedHat Cluster can be installed using:
<pre>
yum install cman rgmanager
</pre>

== Installing the Lustre Resource Skript ==

The ''rgmanager'' package includes a number of resource scripts (/usr/share/cluster) which are used to integrate resources like network interfaces or file systems with ''rgmanager''. Unfortunately, there is no resource script for Lustre included. Luckily Giacomo Montagner
posted an resource script on the lustre-discuss mailing list: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090623/7799de37/attachment-0001.bin
After downloading this file it needs to be copied to /usr/share/cluster/lustrefs.sh.
Make sure the script is executable.

== Configure RedHat Cluster ==

----

All services that the Pacemaker cluster resource manager will manage are called resources. The Pacemaker cluster resource manager uses resource agents to start, stop or monitor resources.

'''''Note:''''' The simplest way to configure the cluster is by using a crm subshell. All examples will be given in this notation. If you understood the syntax of the cluster configuration, you also can use the GUI or XML notation.

==== Completing a Basic Setup of the Cluster ====

To test that your cluster manager is running and set global options, complete the steps below.

'''1. Display the cluster status.''' Enter:

<pre>
# crm_mon -1
</pre>

The output should look similar to:

<pre>
============
Last updated: Fri Dec 25 17:31:54 2009
Stack: openais
Current DC: node1 - partition with quorum
Version: 1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe
2 Nodes configured, 2 expected votes
0 Resources configured.
============

Online: [ node1 node2 ]
</pre>

This output indicates that ''corosync'' started the cluster resource manager and it is ready to manage resources.

Several global options must be set in the cluster. The two described in the next two steps are especially important to consider.

'''2. If your cluster consists of just two nodes, switch the quorum feature off.''' On the command line, enter:

<pre>
# crm configure property no-quorum-policy=ignore
</pre>

If your lustre setup comprises more than two nodes, you can leave the no-quorum option as it is.

'''3. In a Lustre setup, fencing is normally used and is enabled by default. If you have a good reason not to use it, disable it by entering:'''

<pre>
# crm configure property stonith-enabled=false
</pre>

After the global options of the cluster are set up correctly, continue to the following sections to configure resources and constraints.

==== Configuring Resources ====

OSTs are represented as Filesystem resources. A Lustre cluster consists of several Filesystem resources along with constraints that determine on which nodes of the cluster the resources can run.

By default, the start, stop, and monitor operations in a Filesystem resource time out after 20 sec. Since some mounts in Lustre require up to 5 minutes or more, the default timeouts for these operations must be modified. Also, a monitor operation must be added to the resource so that Pacemaker can check if the resource is still alive and react in case of any problems.

'''1. Create a definition of the Filesystem resource and save it in a file such as ''MyOST.res''.'''

If you have multiple OSTs, you will need to define additional resources.

The example below shows a complete definition of the Filesystem resource. You will need to change the ''device'' and ''directory'' to correspond to your setup.

<pre>
primitive resMyOST ocf:heartbeat:Filesystem \
meta target-role="stopped" \
operations $id="resMyOST-operations" \
op monitor interval="120" timeout="60" \
op start interval="0" timeout="300" \
op stop interval="0" timeout="300" \
params device="device" directory="directory" fstype="lustre"
</pre>

In this example, the resource is initially stopped (''target-role=”stopped”'') because the constraints specifying where the resource is to be run have not yet been defined.

The ''start'' and ''stop'' operations have each been set to a timeout of 300 sec. The resource is monitored at intervals of 120 seconds. The parameters "''device''", "''directory''" and "lustre" are passed to the ''mount'' command.

'''2. Read the definition into your cluster configuration''' by entering:

# crm configure < MyOST.res

You can define as many OST resources as you want.

If a server fails or the monitoring of a OST results in the detection of a failure, the cluster first tries to restart the resource on the failed node. If the node fails to restart it, the resource is migrated to another node.

More sophosticated ways of failure management (such as trying to restart a node three times before migrating to another node) are possible using the cluster resource manager. See the Pacemaker documentation for details.

If mounting the file system depends on another resource like the start of a RAID or multipath driver, you can include this resource in the cluster configuration. This resource is then monitored by the cluster, enabling Pacemaker to react to failures.

==== Configuring Constraints ====
In a simple Lustre cluster setup, constraints are not required. However, in a larger cluster setup, you may want to use constraints to establish relationships between resources. For example, to keep the load distributed equally across nodes in your cluster, you may want to control how many OSTs can run on a particular node.

Constraints on resources are established by Pacemaker through a point system. Resources accumulate or lose points according to the constraints you define. If a resource has negative points with respect to a certain node, it cannot run on that node.

For example, to constrain the co-location of two resources, complete the steps below.

'''1. Add co-location constraints between resources.''' Enter commands similar to the following:

# crm configure colocation colresOST1resOST2 -100: resOST1 resOST2

This constraint assigns -100 points to resOST2 if an attempt is made to run resOST2 on the same node as resOST1. If the resulting total number of points assigned to reOST2 is negative, it will not be able run on that node.

'''2. After defining all necessary constraints, start the resources.''' Enter:

# crm resource start resMyOST

Execute this command for each OST (Filesystem resource) in the cluster.

----

'''Note:''' Use care when setting up your point system. You can use the point system if your cluster has at least three nodes or if the resource can acquire points from other constraints. However, in a system with only two nodes and no way to acquire points, the constraint in the example above will result in an inability to migrate a resource from a failed node.

For example, if resOST1 is running on ''node1'' and resOST2 on ''node2'' and ''node2'' fails, an attempt will be made to run resOST2 on ''node1''. However, the constraint will assign resOST2 -100 points since resOST1 is already running on ''node1''. Consequently resOST2 will be unable to run on ''node1'' and, since it is a two-node system, no other node is available.

----

To find out more about how the cluster resource manager calculates points, see the Pacemaker documentation.

==== Internal Monitoring of the System ====

In addition to monitoring of the resource itself, the nodes of the cluster must also be monitored. An important parameter to monitor is whether the node is connected to the network. Each node pings one or more hosts and counts the answers it receives. The number of responses determines how “good” its connection is to the network.

Pacemaker provides a simple way to configure this task.

'''1. Define a ping resource.''' In the command below, the ''host_list'' contains a list of hosts that the nodes should ping.
<pre>
# crm configure resPing ocf:pacemaker:pingd \
params host_list=“host1 ...“ multiplier=“10“ dampen=”5s“
# crm configure clone clonePing resPing
</pre>

For every accessible host detected, any resource on that node gets 10 points (set by the ''multiplier='' parameter). The clone configuration makes the ping resource run on every available node.

'''2. Set up constraints to run a resource on the node with the best connectivity.''' The score from the ''ping'' resource can be used in other constraints to allow a resource to run only on those nodes that have a sufficient ping score. For example, enter:

# crm configure location locMyOST resMyOST rule $id="locMyOST" pingd: defined pingd

This location constraint adds the ''ping'' score to the total score assigned to a resource for a particular node. The resource will tend to run on the node with the best connectivity.

Other system checks, such as CPU usage or free RAM, are measured by the Sysinfo resource. The capabilities of the Sysinfo resource are somewhat limited, so it will be replaced by the SystemHealth strategy in future releases of Pacemaker. For more information about the SystemHealth feature, see:
[http://www.clusterlabs.org/wiki/SystemHealth www.clusterlabs.org/wiki/SystemHealth]

==== Administering the Cluster ====

Careful system administration is required to support high availability in a cluster. A primary task of an administrator is to check the cluster for errors or failures of any resources. When a failure occurs, the administrator must search for the cause of the problem, solve it and then reset the corresponding failcounter.
This section describes some basic commands useful to an administrator. For more detailed information, see the Pacemaker documentation.

'''Displaying a Status Overview'''

The command ''crm_mon'' displays an overview of the status of the cluster. It functions similarly to the Linux top command by updating the output each time a cluster event occurs. To generate a one-time output, add the 
option ''-1''.

To include a display of all failcounters for all resources on the nodes, add the ''-f'' option to the command. The output of the command crm_mon -1f looks similar to:

<pre>
============
Last updated: Fri Dec 25 17:31:54 2009
Stack: openais
Current DC: node1 - partition with quorum
Version: 1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ node1 node2 ]

Clone Set: clonePing
Started: [ node1 node2 ]
resMyOST (ocf::heartbeat:filesys): Started node1

Migration summary:
* Node node1: pingd=20
resMyOST: migration-threshold=1000000 fail-count=1
* Node node2: pingd=20
</pre>

'''Switching a Node to Standby'''

You can switch a node to standby to, for example, perform maintenance on the node. In standby, the node is still a full member of the cluster but cannot run any resources. All resources that were running on that node are forced away.

To switch the node called ''node01'' to standby, enter:

# crm node standby node01

To switch the node online again enter:

# crm node online node01

'''Migrating a Resource to Another Node'''

The cluster resource manager can migrate a resource from one node to another while the resource is running. To migrate a resource away from the node it is running on, enter:

# crm resource migrate resMyOST

This command adds a location constraint to the configuration that specifies that the resource ''resMyOST'' can no longer run on the original node.

To delete this constraint, enter:

# crm resource unmigrate resMyOST

A target node can be specified in the migration command as follows:

# crm configure migrate resMyOST node02

This command causes the resource ''resMyOST'' to move to node ''node02'', while adding a location constraint to the configuration. To remove the location constraint, enter the ''unmigrate'' command again.

'''Resetting the failcounter'''

If Pacemaker monitors a resource and finds that it isn’t running, by default it restarts the resource on the node. If the resource cannot be restarted on the node, it then migrates the resource to another node.

It is the administrator’s task to find out the cause of the error and to reset the failcounter of the resource. This can be achieved by entering:

# crm resource failcount <resource> delete <node>

This command deletes (resets) the failcounter for the resource on the specified node.

'''“Cleaning up” a Resource'''

Sometimes it is necessary to “clean up” a resource. Internally, this command removes any information about a resource from the Local Resource Manager on every node and thus forces a complete re-read of the status of that resource. The command syntax is:

# crm resource cleanup resMyOST

This command removes information about the resource called ''resMyOST'' on all nodes.

== Setting up Fencing ==

Fencing is a technique used to isolate a node from the cluster when it is malfunctioning to prevent data corruption. For example, if a “split-brain” condition occurs in which two nodes can no longer communicate and both attempt to mount the same filesystem resource, data corruption can result. (The Multiple Mount Protection (MMP) mechanism in Lustre is designed to protect a file system from being mounted simultaneously by more than one node.)

Pacemaker uses the STONITH (Shoot The Other Node In The Head) approach to fencing malfunctioning nodes, in which a malfunctioning node is simply switched off. A good discussion about fencing can be found [http://www.clusterlabs.org/doc/crm_fencing.html here]. This article provides information useful for deciding which devices to purchase or how to set up STONITH resources for your cluster and also provides a detailed setup procedure.

A basic setup includes the following steps:

'''1. Test your fencing system manually before configuring the corresponding resources in the cluster.''' Manual testing is done by calling the STONITH command directly from each node. If this works in all tests, it will work in the cluster.

'''2. After configuring of the according resources, check that the system works as expected.''' To cause an artificial “split-brain” situation, you could use a host-based firewall to prohibit communication from other nodes on the heartbeat interface(s) by entering:

# iptables -I INPUT -i <heartbeat-IF> -p 5405 -s <other node> -j DROP

When the other nodes are not able to see the node isolated by the firewall, the isolated node should be shut down or rebooted.

== Setting Up Monitoring ==

Any cluster must to be monitored to provide the high availability it was designed for. Consider the following scenario demonstrating the importance of monitoring:

:''A node fails and all resources migrate to its backup node. Since the failover was smooth, nobody notices the problem. After some time, the second node fails and service stops. This is a serious problem since neither of the nodes is now able to provide service. The administrator must recover data from backups and possibly even install it on new hardware. A significant delay may result for users.''

Pacemaker offers several options for making information available to a monitoring system. These include:
*Utilizing the ''crm_mon'' program to send out information about changes in cluster status.
*Using scripts to check resource failcounters.

These options are described in the following sections.

==== Using ''crm_mon'' to Send Email Messages ====
In the most simple setup, the ''crm_mon'' program can be used to send out an email each time the status of the cluster changes. This approach requires a fully working mail environment and ''mail'' command.

Before configuring the ''crm_mon'' daemon, check that emails sent from the command line are delivered correctly by entering:

# crm_mon --daemonize –-mail-to <user@example.com> [--mail-host mail.example.com]

The resource monitor in the cluster can be configured to ensure the mail alerting service resource is running, as shown below:

<pre>
primitive resMON ocf:pacemaker:ClusterMon \
operations $id="resMON-operations" \
op monitor interval="180" timeout="20" \
params extra_options="--mail-to <your@mail.address>"
</pre>

If a node fails, which could prevent the email from being sent, the resource is started on another node and an email about the successful start of the resource is sent out from the new node. The administrator's task is to search for the cause of the failover.

==== Using ''crm_mon'' to Send SNMP Traps ====
The ''crm_mon'' daemon can be used to send SNMP traps to a network management server. The configuration from the command line is:

# crm_mon –-daemonize –-snmp-traps nms.example.com

This daemon can also be configured as a cluster resource as shown below:

<pre>
primitive resMON ocf:pacemaker:ClusterMon \
operations $id="resMON-operations" \
op monitor interval="180" timeout="20" \
params extra_options="--snmp-trap nms.example.com
</pre>

The MIB of the traps is defined in the ''PCMKR.txt'' file.

==== Polling the Failcounters ====
If all the nodes of a cluster have problems, pushing information about events may be not be sufficient. An alternative is to check the failcounters of all resources periodically from the network management station (NMS). A simple script that checks for the presence of any failcounters in the output of ''crm_mon -1f'' is shown below:

# crm_mon -1f | grep fail-count

This script can be called by the NMS via SSH, or by the SNMP agent on the nodes by adding the following line to the Net-SNMP configuration in ''snmpd.conf'':

extend failcounter crm_mon -1f | grep -q fail-count

The code returned by the script can be checked by the NMS using:

snmpget <node> nsExtend.\”failcounter\”

A result of ''0'' indicates a failure.

Using Red Hat Cluster Manager with Lustre

2010-12-10T15:06:00Z

Sven: /* Installing the Lustre Resource Skript */

''(Updated: Dec 2010)''

''DISCLAIMER - EXTERNAL CONTRIBUTOR CONTENT''

''This content was submitted by an external contributor. We provide this information as a resource for the Lustre™ open-source community, but we make no representation as to the accuracy, completeness or reliability of this information.''
----
__TOC__
This page describes how to configure and use Red Hat Cluster Manager with Lustre failover. Sven Trautmann has contributed this content.

For more about Lustre failover, see [[Configuring Lustre for Failover]].

== Preliminary Notes ==

This document is based on the RedHat Cluster version 2.0, which is part of RedHat Enterprise Linux version 5.5. For other Versions or RHEL-based distributions the syntax or methods to setup and run RedHat Cluster may differ.

In comparison with other HA solutions RedHat Cluster as in RHEL 5.5 is a pretty old HA solution. It is recommended to use other HA solutions like pacemaker, if possible.

== Setting Up RedHat Cluster ==

==== Setting Up the ''openais'' Communication Stack ====

The ''openais'' package is distributed with RHEL and can be installed using
<pre>
rpm -i /path/to/RHEL-DVD/Server/openais0.80.6-16.el5.x86_64.rpm
</pre> or
<pre>
yum install openais
</pre>
if yum is configured to access the RHEL repository.

Once installed, the software looks for a configuration in the file ''/etc/ais/openais.conf
''.

Complete the following steps to set up the ''openais'' communication stack:

'''1. Edit the totem section of the ''openais.conf'' configuration file to designate the IP address and netmask of the interface(s) to be used.''' The totem section of the configuration file describes the way ''openais'' communicates between nodes.

<pre>
totem {
version: 2
secauth: off
threads: 0
interface {
ringnumber: 0
bindnetaddr: 10.0.0.0
mcastaddr: 226.94.1.1
mcastport: 5405
}
}
</pre>

''Openais'' uses the option ''bindnetaddr'' to determine which interface is to be used for cluster communication. The example above assumes one of the node’s interfaces is configured on the network 10.0.0.0. The value of the option is calculated from the IP address AND the network mask for the interface (IP & MASK) so the final bits of the address are cleared. Thus the configuration file is independent of any node and can be copied to all nodes.

'''2. Create an AIS key
<pre>
# /usr/sbin/ais-keygen
OpenAIS Authentication key generator.
Gathering 1024 bits for key from /dev/random.
Writing openais key to /etc/ais/authkey.
</pre>

== Installing RedHat Cluster ==

The minimum installation of RedHat Cluster consists of the Cluster Manager package ''cman'' and the Resource Group Manager package ''rgmanager''. The ''cman'' package can be found in the RHEL repository. The ''rgmanager'' package is part of the Cluster repository. It can be found on the RHEL DVD in the Cluster sub-directory and may need to be added to the yum configuration manually. With yum configured correctly RedHat Cluster can be installed using:
<pre>
yum install cman rgmanager
</pre>

== Installing the Lustre Resource Skript ==

The ''rgmanager'' package includes a number of resource scripts (/usr/share/cluster) which are used to integrate resources like network interfaces or file systems with ''rgmanager''. Unfortunately, there is no resource script for Lustre included.

== Configure RedHat Cluster ==

----

All services that the Pacemaker cluster resource manager will manage are called resources. The Pacemaker cluster resource manager uses resource agents to start, stop or monitor resources.

'''''Note:''''' The simplest way to configure the cluster is by using a crm subshell. All examples will be given in this notation. If you understood the syntax of the cluster configuration, you also can use the GUI or XML notation.

==== Completing a Basic Setup of the Cluster ====

To test that your cluster manager is running and set global options, complete the steps below.

'''1. Display the cluster status.''' Enter:

<pre>
# crm_mon -1
</pre>

The output should look similar to:

<pre>
============
Last updated: Fri Dec 25 17:31:54 2009
Stack: openais
Current DC: node1 - partition with quorum
Version: 1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe
2 Nodes configured, 2 expected votes
0 Resources configured.
============

Online: [ node1 node2 ]
</pre>

This output indicates that ''corosync'' started the cluster resource manager and it is ready to manage resources.

Several global options must be set in the cluster. The two described in the next two steps are especially important to consider.

'''2. If your cluster consists of just two nodes, switch the quorum feature off.''' On the command line, enter:

<pre>
# crm configure property no-quorum-policy=ignore
</pre>

If your lustre setup comprises more than two nodes, you can leave the no-quorum option as it is.

'''3. In a Lustre setup, fencing is normally used and is enabled by default. If you have a good reason not to use it, disable it by entering:'''

<pre>
# crm configure property stonith-enabled=false
</pre>

After the global options of the cluster are set up correctly, continue to the following sections to configure resources and constraints.

==== Configuring Resources ====

OSTs are represented as Filesystem resources. A Lustre cluster consists of several Filesystem resources along with constraints that determine on which nodes of the cluster the resources can run.

By default, the start, stop, and monitor operations in a Filesystem resource time out after 20 sec. Since some mounts in Lustre require up to 5 minutes or more, the default timeouts for these operations must be modified. Also, a monitor operation must be added to the resource so that Pacemaker can check if the resource is still alive and react in case of any problems.

'''1. Create a definition of the Filesystem resource and save it in a file such as ''MyOST.res''.'''

If you have multiple OSTs, you will need to define additional resources.

The example below shows a complete definition of the Filesystem resource. You will need to change the ''device'' and ''directory'' to correspond to your setup.

<pre>
primitive resMyOST ocf:heartbeat:Filesystem \
meta target-role="stopped" \
operations $id="resMyOST-operations" \
op monitor interval="120" timeout="60" \
op start interval="0" timeout="300" \
op stop interval="0" timeout="300" \
params device="device" directory="directory" fstype="lustre"
</pre>

In this example, the resource is initially stopped (''target-role=”stopped”'') because the constraints specifying where the resource is to be run have not yet been defined.

The ''start'' and ''stop'' operations have each been set to a timeout of 300 sec. The resource is monitored at intervals of 120 seconds. The parameters "''device''", "''directory''" and "lustre" are passed to the ''mount'' command.

'''2. Read the definition into your cluster configuration''' by entering:

# crm configure < MyOST.res

You can define as many OST resources as you want.

If a server fails or the monitoring of a OST results in the detection of a failure, the cluster first tries to restart the resource on the failed node. If the node fails to restart it, the resource is migrated to another node.

More sophosticated ways of failure management (such as trying to restart a node three times before migrating to another node) are possible using the cluster resource manager. See the Pacemaker documentation for details.

If mounting the file system depends on another resource like the start of a RAID or multipath driver, you can include this resource in the cluster configuration. This resource is then monitored by the cluster, enabling Pacemaker to react to failures.

==== Configuring Constraints ====
In a simple Lustre cluster setup, constraints are not required. However, in a larger cluster setup, you may want to use constraints to establish relationships between resources. For example, to keep the load distributed equally across nodes in your cluster, you may want to control how many OSTs can run on a particular node.

Constraints on resources are established by Pacemaker through a point system. Resources accumulate or lose points according to the constraints you define. If a resource has negative points with respect to a certain node, it cannot run on that node.

For example, to constrain the co-location of two resources, complete the steps below.

'''1. Add co-location constraints between resources.''' Enter commands similar to the following:

# crm configure colocation colresOST1resOST2 -100: resOST1 resOST2

This constraint assigns -100 points to resOST2 if an attempt is made to run resOST2 on the same node as resOST1. If the resulting total number of points assigned to reOST2 is negative, it will not be able run on that node.

'''2. After defining all necessary constraints, start the resources.''' Enter:

# crm resource start resMyOST

Execute this command for each OST (Filesystem resource) in the cluster.

----

'''Note:''' Use care when setting up your point system. You can use the point system if your cluster has at least three nodes or if the resource can acquire points from other constraints. However, in a system with only two nodes and no way to acquire points, the constraint in the example above will result in an inability to migrate a resource from a failed node.

For example, if resOST1 is running on ''node1'' and resOST2 on ''node2'' and ''node2'' fails, an attempt will be made to run resOST2 on ''node1''. However, the constraint will assign resOST2 -100 points since resOST1 is already running on ''node1''. Consequently resOST2 will be unable to run on ''node1'' and, since it is a two-node system, no other node is available.

----

To find out more about how the cluster resource manager calculates points, see the Pacemaker documentation.

==== Internal Monitoring of the System ====

In addition to monitoring of the resource itself, the nodes of the cluster must also be monitored. An important parameter to monitor is whether the node is connected to the network. Each node pings one or more hosts and counts the answers it receives. The number of responses determines how “good” its connection is to the network.

Pacemaker provides a simple way to configure this task.

'''1. Define a ping resource.''' In the command below, the ''host_list'' contains a list of hosts that the nodes should ping.
<pre>
# crm configure resPing ocf:pacemaker:pingd \
params host_list=“host1 ...“ multiplier=“10“ dampen=”5s“
# crm configure clone clonePing resPing
</pre>

For every accessible host detected, any resource on that node gets 10 points (set by the ''multiplier='' parameter). The clone configuration makes the ping resource run on every available node.

'''2. Set up constraints to run a resource on the node with the best connectivity.''' The score from the ''ping'' resource can be used in other constraints to allow a resource to run only on those nodes that have a sufficient ping score. For example, enter:

# crm configure location locMyOST resMyOST rule $id="locMyOST" pingd: defined pingd

This location constraint adds the ''ping'' score to the total score assigned to a resource for a particular node. The resource will tend to run on the node with the best connectivity.

Other system checks, such as CPU usage or free RAM, are measured by the Sysinfo resource. The capabilities of the Sysinfo resource are somewhat limited, so it will be replaced by the SystemHealth strategy in future releases of Pacemaker. For more information about the SystemHealth feature, see:
[http://www.clusterlabs.org/wiki/SystemHealth www.clusterlabs.org/wiki/SystemHealth]

==== Administering the Cluster ====

Careful system administration is required to support high availability in a cluster. A primary task of an administrator is to check the cluster for errors or failures of any resources. When a failure occurs, the administrator must search for the cause of the problem, solve it and then reset the corresponding failcounter.
This section describes some basic commands useful to an administrator. For more detailed information, see the Pacemaker documentation.

'''Displaying a Status Overview'''

The command ''crm_mon'' displays an overview of the status of the cluster. It functions similarly to the Linux top command by updating the output each time a cluster event occurs. To generate a one-time output, add the 
option ''-1''.

To include a display of all failcounters for all resources on the nodes, add the ''-f'' option to the command. The output of the command crm_mon -1f looks similar to:

<pre>
============
Last updated: Fri Dec 25 17:31:54 2009
Stack: openais
Current DC: node1 - partition with quorum
Version: 1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ node1 node2 ]

Clone Set: clonePing
Started: [ node1 node2 ]
resMyOST (ocf::heartbeat:filesys): Started node1

Migration summary:
* Node node1: pingd=20
resMyOST: migration-threshold=1000000 fail-count=1
* Node node2: pingd=20
</pre>

'''Switching a Node to Standby'''

You can switch a node to standby to, for example, perform maintenance on the node. In standby, the node is still a full member of the cluster but cannot run any resources. All resources that were running on that node are forced away.

To switch the node called ''node01'' to standby, enter:

# crm node standby node01

To switch the node online again enter:

# crm node online node01

'''Migrating a Resource to Another Node'''

The cluster resource manager can migrate a resource from one node to another while the resource is running. To migrate a resource away from the node it is running on, enter:

# crm resource migrate resMyOST

This command adds a location constraint to the configuration that specifies that the resource ''resMyOST'' can no longer run on the original node.

To delete this constraint, enter:

# crm resource unmigrate resMyOST

A target node can be specified in the migration command as follows:

# crm configure migrate resMyOST node02

This command causes the resource ''resMyOST'' to move to node ''node02'', while adding a location constraint to the configuration. To remove the location constraint, enter the ''unmigrate'' command again.

'''Resetting the failcounter'''

If Pacemaker monitors a resource and finds that it isn’t running, by default it restarts the resource on the node. If the resource cannot be restarted on the node, it then migrates the resource to another node.

It is the administrator’s task to find out the cause of the error and to reset the failcounter of the resource. This can be achieved by entering:

# crm resource failcount <resource> delete <node>

This command deletes (resets) the failcounter for the resource on the specified node.

'''“Cleaning up” a Resource'''

Sometimes it is necessary to “clean up” a resource. Internally, this command removes any information about a resource from the Local Resource Manager on every node and thus forces a complete re-read of the status of that resource. The command syntax is:

# crm resource cleanup resMyOST

This command removes information about the resource called ''resMyOST'' on all nodes.

== Setting up Fencing ==

Fencing is a technique used to isolate a node from the cluster when it is malfunctioning to prevent data corruption. For example, if a “split-brain” condition occurs in which two nodes can no longer communicate and both attempt to mount the same filesystem resource, data corruption can result. (The Multiple Mount Protection (MMP) mechanism in Lustre is designed to protect a file system from being mounted simultaneously by more than one node.)

Pacemaker uses the STONITH (Shoot The Other Node In The Head) approach to fencing malfunctioning nodes, in which a malfunctioning node is simply switched off. A good discussion about fencing can be found [http://www.clusterlabs.org/doc/crm_fencing.html here]. This article provides information useful for deciding which devices to purchase or how to set up STONITH resources for your cluster and also provides a detailed setup procedure.

A basic setup includes the following steps:

'''1. Test your fencing system manually before configuring the corresponding resources in the cluster.''' Manual testing is done by calling the STONITH command directly from each node. If this works in all tests, it will work in the cluster.

'''2. After configuring of the according resources, check that the system works as expected.''' To cause an artificial “split-brain” situation, you could use a host-based firewall to prohibit communication from other nodes on the heartbeat interface(s) by entering:

# iptables -I INPUT -i <heartbeat-IF> -p 5405 -s <other node> -j DROP

When the other nodes are not able to see the node isolated by the firewall, the isolated node should be shut down or rebooted.

== Setting Up Monitoring ==

Any cluster must to be monitored to provide the high availability it was designed for. Consider the following scenario demonstrating the importance of monitoring:

:''A node fails and all resources migrate to its backup node. Since the failover was smooth, nobody notices the problem. After some time, the second node fails and service stops. This is a serious problem since neither of the nodes is now able to provide service. The administrator must recover data from backups and possibly even install it on new hardware. A significant delay may result for users.''

Pacemaker offers several options for making information available to a monitoring system. These include:
*Utilizing the ''crm_mon'' program to send out information about changes in cluster status.
*Using scripts to check resource failcounters.

These options are described in the following sections.

==== Using ''crm_mon'' to Send Email Messages ====
In the most simple setup, the ''crm_mon'' program can be used to send out an email each time the status of the cluster changes. This approach requires a fully working mail environment and ''mail'' command.

Before configuring the ''crm_mon'' daemon, check that emails sent from the command line are delivered correctly by entering:

# crm_mon --daemonize –-mail-to <user@example.com> [--mail-host mail.example.com]

The resource monitor in the cluster can be configured to ensure the mail alerting service resource is running, as shown below:

<pre>
primitive resMON ocf:pacemaker:ClusterMon \
operations $id="resMON-operations" \
op monitor interval="180" timeout="20" \
params extra_options="--mail-to <your@mail.address>"
</pre>

If a node fails, which could prevent the email from being sent, the resource is started on another node and an email about the successful start of the resource is sent out from the new node. The administrator's task is to search for the cause of the failover.

==== Using ''crm_mon'' to Send SNMP Traps ====
The ''crm_mon'' daemon can be used to send SNMP traps to a network management server. The configuration from the command line is:

# crm_mon –-daemonize –-snmp-traps nms.example.com

This daemon can also be configured as a cluster resource as shown below:

<pre>
primitive resMON ocf:pacemaker:ClusterMon \
operations $id="resMON-operations" \
op monitor interval="180" timeout="20" \
params extra_options="--snmp-trap nms.example.com
</pre>

The MIB of the traps is defined in the ''PCMKR.txt'' file.

==== Polling the Failcounters ====
If all the nodes of a cluster have problems, pushing information about events may be not be sufficient. An alternative is to check the failcounters of all resources periodically from the network management station (NMS). A simple script that checks for the presence of any failcounters in the output of ''crm_mon -1f'' is shown below:

# crm_mon -1f | grep fail-count

This script can be called by the NMS via SSH, or by the SNMP agent on the nodes by adding the following line to the Net-SNMP configuration in ''snmpd.conf'':

extend failcounter crm_mon -1f | grep -q fail-count

The code returned by the script can be checked by the NMS using:

snmpget <node> nsExtend.\”failcounter\”

A result of ''0'' indicates a failure.

Using Red Hat Cluster Manager with Lustre

2010-12-10T15:00:04Z

Sven: /* Installing RedHat Cluster */

''(Updated: Dec 2010)''

''DISCLAIMER - EXTERNAL CONTRIBUTOR CONTENT''

''This content was submitted by an external contributor. We provide this information as a resource for the Lustre™ open-source community, but we make no representation as to the accuracy, completeness or reliability of this information.''
----
__TOC__
This page describes how to configure and use Red Hat Cluster Manager with Lustre failover. Sven Trautmann has contributed this content.

For more about Lustre failover, see [[Configuring Lustre for Failover]].

== Preliminary Notes ==

This document is based on the RedHat Cluster version 2.0, which is part of RedHat Enterprise Linux version 5.5. For other Versions or RHEL-based distributions the syntax or methods to setup and run RedHat Cluster may differ.

In comparison with other HA solutions RedHat Cluster as in RHEL 5.5 is a pretty old HA solution. It is recommended to use other HA solutions like pacemaker, if possible.

== Setting Up RedHat Cluster ==

==== Setting Up the ''openais'' Communication Stack ====

The ''openais'' package is distributed with RHEL and can be installed using
<pre>
rpm -i /path/to/RHEL-DVD/Server/openais0.80.6-16.el5.x86_64.rpm
</pre> or
<pre>
yum install openais
</pre>
if yum is configured to access the RHEL repository.

Once installed, the software looks for a configuration in the file ''/etc/ais/openais.conf
''.

Complete the following steps to set up the ''openais'' communication stack:

'''1. Edit the totem section of the ''openais.conf'' configuration file to designate the IP address and netmask of the interface(s) to be used.''' The totem section of the configuration file describes the way ''openais'' communicates between nodes.

<pre>
totem {
version: 2
secauth: off
threads: 0
interface {
ringnumber: 0
bindnetaddr: 10.0.0.0
mcastaddr: 226.94.1.1
mcastport: 5405
}
}
</pre>

''Openais'' uses the option ''bindnetaddr'' to determine which interface is to be used for cluster communication. The example above assumes one of the node’s interfaces is configured on the network 10.0.0.0. The value of the option is calculated from the IP address AND the network mask for the interface (IP & MASK) so the final bits of the address are cleared. Thus the configuration file is independent of any node and can be copied to all nodes.

'''2. Create an AIS key
<pre>
# /usr/sbin/ais-keygen
OpenAIS Authentication key generator.
Gathering 1024 bits for key from /dev/random.
Writing openais key to /etc/ais/authkey.
</pre>

== Installing RedHat Cluster ==

The minimum installation of RedHat Cluster consists of the Cluster Manager package ''cman'' and the Resource Group Manager package ''rgmanager''. The ''cman'' package can be found in the RHEL repository. The ''rgmanager'' package is part of the Cluster repository. It can be found on the RHEL DVD in the Cluster sub-directory and may need to be added to the yum configuration manually. With yum configured correctly RedHat Cluster can be installed using:
<pre>
yum install cman rgmanager
</pre>

== Installing the Lustre Resource Skript ==

== Configure RedHat Cluster ==

----

All services that the Pacemaker cluster resource manager will manage are called resources. The Pacemaker cluster resource manager uses resource agents to start, stop or monitor resources.

'''''Note:''''' The simplest way to configure the cluster is by using a crm subshell. All examples will be given in this notation. If you understood the syntax of the cluster configuration, you also can use the GUI or XML notation.

==== Completing a Basic Setup of the Cluster ====

To test that your cluster manager is running and set global options, complete the steps below.

'''1. Display the cluster status.''' Enter:

<pre>
# crm_mon -1
</pre>

The output should look similar to:

<pre>
============
Last updated: Fri Dec 25 17:31:54 2009
Stack: openais
Current DC: node1 - partition with quorum
Version: 1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe
2 Nodes configured, 2 expected votes
0 Resources configured.
============

Online: [ node1 node2 ]
</pre>

This output indicates that ''corosync'' started the cluster resource manager and it is ready to manage resources.

Several global options must be set in the cluster. The two described in the next two steps are especially important to consider.

'''2. If your cluster consists of just two nodes, switch the quorum feature off.''' On the command line, enter:

<pre>
# crm configure property no-quorum-policy=ignore
</pre>

If your lustre setup comprises more than two nodes, you can leave the no-quorum option as it is.

'''3. In a Lustre setup, fencing is normally used and is enabled by default. If you have a good reason not to use it, disable it by entering:'''

<pre>
# crm configure property stonith-enabled=false
</pre>

After the global options of the cluster are set up correctly, continue to the following sections to configure resources and constraints.

==== Configuring Resources ====

OSTs are represented as Filesystem resources. A Lustre cluster consists of several Filesystem resources along with constraints that determine on which nodes of the cluster the resources can run.

By default, the start, stop, and monitor operations in a Filesystem resource time out after 20 sec. Since some mounts in Lustre require up to 5 minutes or more, the default timeouts for these operations must be modified. Also, a monitor operation must be added to the resource so that Pacemaker can check if the resource is still alive and react in case of any problems.

'''1. Create a definition of the Filesystem resource and save it in a file such as ''MyOST.res''.'''

If you have multiple OSTs, you will need to define additional resources.

The example below shows a complete definition of the Filesystem resource. You will need to change the ''device'' and ''directory'' to correspond to your setup.

<pre>
primitive resMyOST ocf:heartbeat:Filesystem \
meta target-role="stopped" \
operations $id="resMyOST-operations" \
op monitor interval="120" timeout="60" \
op start interval="0" timeout="300" \
op stop interval="0" timeout="300" \
params device="device" directory="directory" fstype="lustre"
</pre>

In this example, the resource is initially stopped (''target-role=”stopped”'') because the constraints specifying where the resource is to be run have not yet been defined.

The ''start'' and ''stop'' operations have each been set to a timeout of 300 sec. The resource is monitored at intervals of 120 seconds. The parameters "''device''", "''directory''" and "lustre" are passed to the ''mount'' command.

'''2. Read the definition into your cluster configuration''' by entering:

# crm configure < MyOST.res

You can define as many OST resources as you want.

If a server fails or the monitoring of a OST results in the detection of a failure, the cluster first tries to restart the resource on the failed node. If the node fails to restart it, the resource is migrated to another node.

More sophosticated ways of failure management (such as trying to restart a node three times before migrating to another node) are possible using the cluster resource manager. See the Pacemaker documentation for details.

If mounting the file system depends on another resource like the start of a RAID or multipath driver, you can include this resource in the cluster configuration. This resource is then monitored by the cluster, enabling Pacemaker to react to failures.

==== Configuring Constraints ====
In a simple Lustre cluster setup, constraints are not required. However, in a larger cluster setup, you may want to use constraints to establish relationships between resources. For example, to keep the load distributed equally across nodes in your cluster, you may want to control how many OSTs can run on a particular node.

Constraints on resources are established by Pacemaker through a point system. Resources accumulate or lose points according to the constraints you define. If a resource has negative points with respect to a certain node, it cannot run on that node.

For example, to constrain the co-location of two resources, complete the steps below.

'''1. Add co-location constraints between resources.''' Enter commands similar to the following:

# crm configure colocation colresOST1resOST2 -100: resOST1 resOST2

This constraint assigns -100 points to resOST2 if an attempt is made to run resOST2 on the same node as resOST1. If the resulting total number of points assigned to reOST2 is negative, it will not be able run on that node.

'''2. After defining all necessary constraints, start the resources.''' Enter:

# crm resource start resMyOST

Execute this command for each OST (Filesystem resource) in the cluster.

----

'''Note:''' Use care when setting up your point system. You can use the point system if your cluster has at least three nodes or if the resource can acquire points from other constraints. However, in a system with only two nodes and no way to acquire points, the constraint in the example above will result in an inability to migrate a resource from a failed node.

For example, if resOST1 is running on ''node1'' and resOST2 on ''node2'' and ''node2'' fails, an attempt will be made to run resOST2 on ''node1''. However, the constraint will assign resOST2 -100 points since resOST1 is already running on ''node1''. Consequently resOST2 will be unable to run on ''node1'' and, since it is a two-node system, no other node is available.

----

To find out more about how the cluster resource manager calculates points, see the Pacemaker documentation.

==== Internal Monitoring of the System ====

In addition to monitoring of the resource itself, the nodes of the cluster must also be monitored. An important parameter to monitor is whether the node is connected to the network. Each node pings one or more hosts and counts the answers it receives. The number of responses determines how “good” its connection is to the network.

Pacemaker provides a simple way to configure this task.

'''1. Define a ping resource.''' In the command below, the ''host_list'' contains a list of hosts that the nodes should ping.
<pre>
# crm configure resPing ocf:pacemaker:pingd \
params host_list=“host1 ...“ multiplier=“10“ dampen=”5s“
# crm configure clone clonePing resPing
</pre>

For every accessible host detected, any resource on that node gets 10 points (set by the ''multiplier='' parameter). The clone configuration makes the ping resource run on every available node.

'''2. Set up constraints to run a resource on the node with the best connectivity.''' The score from the ''ping'' resource can be used in other constraints to allow a resource to run only on those nodes that have a sufficient ping score. For example, enter:

# crm configure location locMyOST resMyOST rule $id="locMyOST" pingd: defined pingd

This location constraint adds the ''ping'' score to the total score assigned to a resource for a particular node. The resource will tend to run on the node with the best connectivity.

Other system checks, such as CPU usage or free RAM, are measured by the Sysinfo resource. The capabilities of the Sysinfo resource are somewhat limited, so it will be replaced by the SystemHealth strategy in future releases of Pacemaker. For more information about the SystemHealth feature, see:
[http://www.clusterlabs.org/wiki/SystemHealth www.clusterlabs.org/wiki/SystemHealth]

==== Administering the Cluster ====

Careful system administration is required to support high availability in a cluster. A primary task of an administrator is to check the cluster for errors or failures of any resources. When a failure occurs, the administrator must search for the cause of the problem, solve it and then reset the corresponding failcounter.
This section describes some basic commands useful to an administrator. For more detailed information, see the Pacemaker documentation.

'''Displaying a Status Overview'''

The command ''crm_mon'' displays an overview of the status of the cluster. It functions similarly to the Linux top command by updating the output each time a cluster event occurs. To generate a one-time output, add the 
option ''-1''.

To include a display of all failcounters for all resources on the nodes, add the ''-f'' option to the command. The output of the command crm_mon -1f looks similar to:

<pre>
============
Last updated: Fri Dec 25 17:31:54 2009
Stack: openais
Current DC: node1 - partition with quorum
Version: 1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ node1 node2 ]

Clone Set: clonePing
Started: [ node1 node2 ]
resMyOST (ocf::heartbeat:filesys): Started node1

Migration summary:
* Node node1: pingd=20
resMyOST: migration-threshold=1000000 fail-count=1
* Node node2: pingd=20
</pre>

'''Switching a Node to Standby'''

You can switch a node to standby to, for example, perform maintenance on the node. In standby, the node is still a full member of the cluster but cannot run any resources. All resources that were running on that node are forced away.

To switch the node called ''node01'' to standby, enter:

# crm node standby node01

To switch the node online again enter:

# crm node online node01

'''Migrating a Resource to Another Node'''

The cluster resource manager can migrate a resource from one node to another while the resource is running. To migrate a resource away from the node it is running on, enter:

# crm resource migrate resMyOST

This command adds a location constraint to the configuration that specifies that the resource ''resMyOST'' can no longer run on the original node.

To delete this constraint, enter:

# crm resource unmigrate resMyOST

A target node can be specified in the migration command as follows:

# crm configure migrate resMyOST node02

This command causes the resource ''resMyOST'' to move to node ''node02'', while adding a location constraint to the configuration. To remove the location constraint, enter the ''unmigrate'' command again.

'''Resetting the failcounter'''

If Pacemaker monitors a resource and finds that it isn’t running, by default it restarts the resource on the node. If the resource cannot be restarted on the node, it then migrates the resource to another node.

It is the administrator’s task to find out the cause of the error and to reset the failcounter of the resource. This can be achieved by entering:

# crm resource failcount <resource> delete <node>

This command deletes (resets) the failcounter for the resource on the specified node.

'''“Cleaning up” a Resource'''

Sometimes it is necessary to “clean up” a resource. Internally, this command removes any information about a resource from the Local Resource Manager on every node and thus forces a complete re-read of the status of that resource. The command syntax is:

# crm resource cleanup resMyOST

This command removes information about the resource called ''resMyOST'' on all nodes.

== Setting up Fencing ==

Fencing is a technique used to isolate a node from the cluster when it is malfunctioning to prevent data corruption. For example, if a “split-brain” condition occurs in which two nodes can no longer communicate and both attempt to mount the same filesystem resource, data corruption can result. (The Multiple Mount Protection (MMP) mechanism in Lustre is designed to protect a file system from being mounted simultaneously by more than one node.)

Pacemaker uses the STONITH (Shoot The Other Node In The Head) approach to fencing malfunctioning nodes, in which a malfunctioning node is simply switched off. A good discussion about fencing can be found [http://www.clusterlabs.org/doc/crm_fencing.html here]. This article provides information useful for deciding which devices to purchase or how to set up STONITH resources for your cluster and also provides a detailed setup procedure.

A basic setup includes the following steps:

'''1. Test your fencing system manually before configuring the corresponding resources in the cluster.''' Manual testing is done by calling the STONITH command directly from each node. If this works in all tests, it will work in the cluster.

'''2. After configuring of the according resources, check that the system works as expected.''' To cause an artificial “split-brain” situation, you could use a host-based firewall to prohibit communication from other nodes on the heartbeat interface(s) by entering:

# iptables -I INPUT -i <heartbeat-IF> -p 5405 -s <other node> -j DROP

When the other nodes are not able to see the node isolated by the firewall, the isolated node should be shut down or rebooted.

== Setting Up Monitoring ==

Any cluster must to be monitored to provide the high availability it was designed for. Consider the following scenario demonstrating the importance of monitoring:

:''A node fails and all resources migrate to its backup node. Since the failover was smooth, nobody notices the problem. After some time, the second node fails and service stops. This is a serious problem since neither of the nodes is now able to provide service. The administrator must recover data from backups and possibly even install it on new hardware. A significant delay may result for users.''

Pacemaker offers several options for making information available to a monitoring system. These include:
*Utilizing the ''crm_mon'' program to send out information about changes in cluster status.
*Using scripts to check resource failcounters.

These options are described in the following sections.

==== Using ''crm_mon'' to Send Email Messages ====
In the most simple setup, the ''crm_mon'' program can be used to send out an email each time the status of the cluster changes. This approach requires a fully working mail environment and ''mail'' command.

Before configuring the ''crm_mon'' daemon, check that emails sent from the command line are delivered correctly by entering:

# crm_mon --daemonize –-mail-to <user@example.com> [--mail-host mail.example.com]

The resource monitor in the cluster can be configured to ensure the mail alerting service resource is running, as shown below:

<pre>
primitive resMON ocf:pacemaker:ClusterMon \
operations $id="resMON-operations" \
op monitor interval="180" timeout="20" \
params extra_options="--mail-to <your@mail.address>"
</pre>

If a node fails, which could prevent the email from being sent, the resource is started on another node and an email about the successful start of the resource is sent out from the new node. The administrator's task is to search for the cause of the failover.

==== Using ''crm_mon'' to Send SNMP Traps ====
The ''crm_mon'' daemon can be used to send SNMP traps to a network management server. The configuration from the command line is:

# crm_mon –-daemonize –-snmp-traps nms.example.com

This daemon can also be configured as a cluster resource as shown below:

<pre>
primitive resMON ocf:pacemaker:ClusterMon \
operations $id="resMON-operations" \
op monitor interval="180" timeout="20" \
params extra_options="--snmp-trap nms.example.com
</pre>

The MIB of the traps is defined in the ''PCMKR.txt'' file.

==== Polling the Failcounters ====
If all the nodes of a cluster have problems, pushing information about events may be not be sufficient. An alternative is to check the failcounters of all resources periodically from the network management station (NMS). A simple script that checks for the presence of any failcounters in the output of ''crm_mon -1f'' is shown below:

# crm_mon -1f | grep fail-count

This script can be called by the NMS via SSH, or by the SNMP agent on the nodes by adding the following line to the Net-SNMP configuration in ''snmpd.conf'':

extend failcounter crm_mon -1f | grep -q fail-count

The code returned by the script can be checked by the NMS using:

snmpget <node> nsExtend.\”failcounter\”

A result of ''0'' indicates a failure.

Using Red Hat Cluster Manager with Lustre

2010-12-10T14:55:58Z

Sven: /* Setting Up the openais Communication Stack */

''(Updated: Dec 2010)''

''DISCLAIMER - EXTERNAL CONTRIBUTOR CONTENT''

''This content was submitted by an external contributor. We provide this information as a resource for the Lustre™ open-source community, but we make no representation as to the accuracy, completeness or reliability of this information.''
----
__TOC__
This page describes how to configure and use Red Hat Cluster Manager with Lustre failover. Sven Trautmann has contributed this content.

For more about Lustre failover, see [[Configuring Lustre for Failover]].

== Preliminary Notes ==

This document is based on the RedHat Cluster version 2.0, which is part of RedHat Enterprise Linux version 5.5. For other Versions or RHEL-based distributions the syntax or methods to setup and run RedHat Cluster may differ.

In comparison with other HA solutions RedHat Cluster as in RHEL 5.5 is a pretty old HA solution. It is recommended to use other HA solutions like pacemaker, if possible.

== Setting Up RedHat Cluster ==

==== Setting Up the ''openais'' Communication Stack ====

The ''openais'' package is distributed with RHEL and can be installed using
<pre>
rpm -i /path/to/RHEL-DVD/Server/openais0.80.6-16.el5.x86_64.rpm
</pre> or
<pre>
yum install openais
</pre>
if yum is configured to access the RHEL repository.

Once installed, the software looks for a configuration in the file ''/etc/ais/openais.conf
''.

Complete the following steps to set up the ''openais'' communication stack:

'''1. Edit the totem section of the ''openais.conf'' configuration file to designate the IP address and netmask of the interface(s) to be used.''' The totem section of the configuration file describes the way ''openais'' communicates between nodes.

<pre>
totem {
version: 2
secauth: off
threads: 0
interface {
ringnumber: 0
bindnetaddr: 10.0.0.0
mcastaddr: 226.94.1.1
mcastport: 5405
}
}
</pre>

''Openais'' uses the option ''bindnetaddr'' to determine which interface is to be used for cluster communication. The example above assumes one of the node’s interfaces is configured on the network 10.0.0.0. The value of the option is calculated from the IP address AND the network mask for the interface (IP & MASK) so the final bits of the address are cleared. Thus the configuration file is independent of any node and can be copied to all nodes.

'''2. Create an AIS key
<pre>
# /usr/sbin/ais-keygen
OpenAIS Authentication key generator.
Gathering 1024 bits for key from /dev/random.
Writing openais key to /etc/ais/authkey.
</pre>

== Installing RedHat Cluster ==

The minimum installation of RedHat Cluster consists of the Cluster Manager package ''cman'' and the Resource Group Manager package ''rgmanager''. The ''cman'' package can be found in the RHEL repository. The ''rgmanager'' package is part of the Cluster repository. It can be found on the RHEL DVD in the Cluster sub-directory.

== Installing the Lustre Resource Skript ==

== Configure RedHat Cluster ==

----

All services that the Pacemaker cluster resource manager will manage are called resources. The Pacemaker cluster resource manager uses resource agents to start, stop or monitor resources.

'''''Note:''''' The simplest way to configure the cluster is by using a crm subshell. All examples will be given in this notation. If you understood the syntax of the cluster configuration, you also can use the GUI or XML notation.

==== Completing a Basic Setup of the Cluster ====

To test that your cluster manager is running and set global options, complete the steps below.

'''1. Display the cluster status.''' Enter:

<pre>
# crm_mon -1
</pre>

The output should look similar to:

<pre>
============
Last updated: Fri Dec 25 17:31:54 2009
Stack: openais
Current DC: node1 - partition with quorum
Version: 1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe
2 Nodes configured, 2 expected votes
0 Resources configured.
============

Online: [ node1 node2 ]
</pre>

This output indicates that ''corosync'' started the cluster resource manager and it is ready to manage resources.

Several global options must be set in the cluster. The two described in the next two steps are especially important to consider.

'''2. If your cluster consists of just two nodes, switch the quorum feature off.''' On the command line, enter:

<pre>
# crm configure property no-quorum-policy=ignore
</pre>

If your lustre setup comprises more than two nodes, you can leave the no-quorum option as it is.

'''3. In a Lustre setup, fencing is normally used and is enabled by default. If you have a good reason not to use it, disable it by entering:'''

<pre>
# crm configure property stonith-enabled=false
</pre>

After the global options of the cluster are set up correctly, continue to the following sections to configure resources and constraints.

==== Configuring Resources ====

OSTs are represented as Filesystem resources. A Lustre cluster consists of several Filesystem resources along with constraints that determine on which nodes of the cluster the resources can run.

By default, the start, stop, and monitor operations in a Filesystem resource time out after 20 sec. Since some mounts in Lustre require up to 5 minutes or more, the default timeouts for these operations must be modified. Also, a monitor operation must be added to the resource so that Pacemaker can check if the resource is still alive and react in case of any problems.

'''1. Create a definition of the Filesystem resource and save it in a file such as ''MyOST.res''.'''

If you have multiple OSTs, you will need to define additional resources.

The example below shows a complete definition of the Filesystem resource. You will need to change the ''device'' and ''directory'' to correspond to your setup.

<pre>
primitive resMyOST ocf:heartbeat:Filesystem \
meta target-role="stopped" \
operations $id="resMyOST-operations" \
op monitor interval="120" timeout="60" \
op start interval="0" timeout="300" \
op stop interval="0" timeout="300" \
params device="device" directory="directory" fstype="lustre"
</pre>

In this example, the resource is initially stopped (''target-role=”stopped”'') because the constraints specifying where the resource is to be run have not yet been defined.

The ''start'' and ''stop'' operations have each been set to a timeout of 300 sec. The resource is monitored at intervals of 120 seconds. The parameters "''device''", "''directory''" and "lustre" are passed to the ''mount'' command.

'''2. Read the definition into your cluster configuration''' by entering:

# crm configure < MyOST.res

You can define as many OST resources as you want.

If a server fails or the monitoring of a OST results in the detection of a failure, the cluster first tries to restart the resource on the failed node. If the node fails to restart it, the resource is migrated to another node.

More sophosticated ways of failure management (such as trying to restart a node three times before migrating to another node) are possible using the cluster resource manager. See the Pacemaker documentation for details.

If mounting the file system depends on another resource like the start of a RAID or multipath driver, you can include this resource in the cluster configuration. This resource is then monitored by the cluster, enabling Pacemaker to react to failures.

==== Configuring Constraints ====
In a simple Lustre cluster setup, constraints are not required. However, in a larger cluster setup, you may want to use constraints to establish relationships between resources. For example, to keep the load distributed equally across nodes in your cluster, you may want to control how many OSTs can run on a particular node.

Constraints on resources are established by Pacemaker through a point system. Resources accumulate or lose points according to the constraints you define. If a resource has negative points with respect to a certain node, it cannot run on that node.

For example, to constrain the co-location of two resources, complete the steps below.

'''1. Add co-location constraints between resources.''' Enter commands similar to the following:

# crm configure colocation colresOST1resOST2 -100: resOST1 resOST2

This constraint assigns -100 points to resOST2 if an attempt is made to run resOST2 on the same node as resOST1. If the resulting total number of points assigned to reOST2 is negative, it will not be able run on that node.

'''2. After defining all necessary constraints, start the resources.''' Enter:

# crm resource start resMyOST

Execute this command for each OST (Filesystem resource) in the cluster.

----

'''Note:''' Use care when setting up your point system. You can use the point system if your cluster has at least three nodes or if the resource can acquire points from other constraints. However, in a system with only two nodes and no way to acquire points, the constraint in the example above will result in an inability to migrate a resource from a failed node.

For example, if resOST1 is running on ''node1'' and resOST2 on ''node2'' and ''node2'' fails, an attempt will be made to run resOST2 on ''node1''. However, the constraint will assign resOST2 -100 points since resOST1 is already running on ''node1''. Consequently resOST2 will be unable to run on ''node1'' and, since it is a two-node system, no other node is available.

----

To find out more about how the cluster resource manager calculates points, see the Pacemaker documentation.

==== Internal Monitoring of the System ====

In addition to monitoring of the resource itself, the nodes of the cluster must also be monitored. An important parameter to monitor is whether the node is connected to the network. Each node pings one or more hosts and counts the answers it receives. The number of responses determines how “good” its connection is to the network.

Pacemaker provides a simple way to configure this task.

'''1. Define a ping resource.''' In the command below, the ''host_list'' contains a list of hosts that the nodes should ping.
<pre>
# crm configure resPing ocf:pacemaker:pingd \
params host_list=“host1 ...“ multiplier=“10“ dampen=”5s“
# crm configure clone clonePing resPing
</pre>

For every accessible host detected, any resource on that node gets 10 points (set by the ''multiplier='' parameter). The clone configuration makes the ping resource run on every available node.

'''2. Set up constraints to run a resource on the node with the best connectivity.''' The score from the ''ping'' resource can be used in other constraints to allow a resource to run only on those nodes that have a sufficient ping score. For example, enter:

# crm configure location locMyOST resMyOST rule $id="locMyOST" pingd: defined pingd

This location constraint adds the ''ping'' score to the total score assigned to a resource for a particular node. The resource will tend to run on the node with the best connectivity.

Other system checks, such as CPU usage or free RAM, are measured by the Sysinfo resource. The capabilities of the Sysinfo resource are somewhat limited, so it will be replaced by the SystemHealth strategy in future releases of Pacemaker. For more information about the SystemHealth feature, see:
[http://www.clusterlabs.org/wiki/SystemHealth www.clusterlabs.org/wiki/SystemHealth]

==== Administering the Cluster ====

Careful system administration is required to support high availability in a cluster. A primary task of an administrator is to check the cluster for errors or failures of any resources. When a failure occurs, the administrator must search for the cause of the problem, solve it and then reset the corresponding failcounter.
This section describes some basic commands useful to an administrator. For more detailed information, see the Pacemaker documentation.

'''Displaying a Status Overview'''

The command ''crm_mon'' displays an overview of the status of the cluster. It functions similarly to the Linux top command by updating the output each time a cluster event occurs. To generate a one-time output, add the 
option ''-1''.

To include a display of all failcounters for all resources on the nodes, add the ''-f'' option to the command. The output of the command crm_mon -1f looks similar to:

<pre>
============
Last updated: Fri Dec 25 17:31:54 2009
Stack: openais
Current DC: node1 - partition with quorum
Version: 1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ node1 node2 ]

Clone Set: clonePing
Started: [ node1 node2 ]
resMyOST (ocf::heartbeat:filesys): Started node1

Migration summary:
* Node node1: pingd=20
resMyOST: migration-threshold=1000000 fail-count=1
* Node node2: pingd=20
</pre>

'''Switching a Node to Standby'''

You can switch a node to standby to, for example, perform maintenance on the node. In standby, the node is still a full member of the cluster but cannot run any resources. All resources that were running on that node are forced away.

To switch the node called ''node01'' to standby, enter:

# crm node standby node01

To switch the node online again enter:

# crm node online node01

'''Migrating a Resource to Another Node'''

The cluster resource manager can migrate a resource from one node to another while the resource is running. To migrate a resource away from the node it is running on, enter:

# crm resource migrate resMyOST

This command adds a location constraint to the configuration that specifies that the resource ''resMyOST'' can no longer run on the original node.

To delete this constraint, enter:

# crm resource unmigrate resMyOST

A target node can be specified in the migration command as follows:

# crm configure migrate resMyOST node02

This command causes the resource ''resMyOST'' to move to node ''node02'', while adding a location constraint to the configuration. To remove the location constraint, enter the ''unmigrate'' command again.

'''Resetting the failcounter'''

If Pacemaker monitors a resource and finds that it isn’t running, by default it restarts the resource on the node. If the resource cannot be restarted on the node, it then migrates the resource to another node.

It is the administrator’s task to find out the cause of the error and to reset the failcounter of the resource. This can be achieved by entering:

# crm resource failcount <resource> delete <node>

This command deletes (resets) the failcounter for the resource on the specified node.

'''“Cleaning up” a Resource'''

Sometimes it is necessary to “clean up” a resource. Internally, this command removes any information about a resource from the Local Resource Manager on every node and thus forces a complete re-read of the status of that resource. The command syntax is:

# crm resource cleanup resMyOST

This command removes information about the resource called ''resMyOST'' on all nodes.

== Setting up Fencing ==

Fencing is a technique used to isolate a node from the cluster when it is malfunctioning to prevent data corruption. For example, if a “split-brain” condition occurs in which two nodes can no longer communicate and both attempt to mount the same filesystem resource, data corruption can result. (The Multiple Mount Protection (MMP) mechanism in Lustre is designed to protect a file system from being mounted simultaneously by more than one node.)

Pacemaker uses the STONITH (Shoot The Other Node In The Head) approach to fencing malfunctioning nodes, in which a malfunctioning node is simply switched off. A good discussion about fencing can be found [http://www.clusterlabs.org/doc/crm_fencing.html here]. This article provides information useful for deciding which devices to purchase or how to set up STONITH resources for your cluster and also provides a detailed setup procedure.

A basic setup includes the following steps:

'''1. Test your fencing system manually before configuring the corresponding resources in the cluster.''' Manual testing is done by calling the STONITH command directly from each node. If this works in all tests, it will work in the cluster.

'''2. After configuring of the according resources, check that the system works as expected.''' To cause an artificial “split-brain” situation, you could use a host-based firewall to prohibit communication from other nodes on the heartbeat interface(s) by entering:

# iptables -I INPUT -i <heartbeat-IF> -p 5405 -s <other node> -j DROP

When the other nodes are not able to see the node isolated by the firewall, the isolated node should be shut down or rebooted.

== Setting Up Monitoring ==

Any cluster must to be monitored to provide the high availability it was designed for. Consider the following scenario demonstrating the importance of monitoring:

:''A node fails and all resources migrate to its backup node. Since the failover was smooth, nobody notices the problem. After some time, the second node fails and service stops. This is a serious problem since neither of the nodes is now able to provide service. The administrator must recover data from backups and possibly even install it on new hardware. A significant delay may result for users.''

Pacemaker offers several options for making information available to a monitoring system. These include:
*Utilizing the ''crm_mon'' program to send out information about changes in cluster status.
*Using scripts to check resource failcounters.

These options are described in the following sections.

==== Using ''crm_mon'' to Send Email Messages ====
In the most simple setup, the ''crm_mon'' program can be used to send out an email each time the status of the cluster changes. This approach requires a fully working mail environment and ''mail'' command.

Before configuring the ''crm_mon'' daemon, check that emails sent from the command line are delivered correctly by entering:

# crm_mon --daemonize –-mail-to <user@example.com> [--mail-host mail.example.com]

The resource monitor in the cluster can be configured to ensure the mail alerting service resource is running, as shown below:

<pre>
primitive resMON ocf:pacemaker:ClusterMon \
operations $id="resMON-operations" \
op monitor interval="180" timeout="20" \
params extra_options="--mail-to <your@mail.address>"
</pre>

If a node fails, which could prevent the email from being sent, the resource is started on another node and an email about the successful start of the resource is sent out from the new node. The administrator's task is to search for the cause of the failover.

==== Using ''crm_mon'' to Send SNMP Traps ====
The ''crm_mon'' daemon can be used to send SNMP traps to a network management server. The configuration from the command line is:

# crm_mon –-daemonize –-snmp-traps nms.example.com

This daemon can also be configured as a cluster resource as shown below:

<pre>
primitive resMON ocf:pacemaker:ClusterMon \
operations $id="resMON-operations" \
op monitor interval="180" timeout="20" \
params extra_options="--snmp-trap nms.example.com
</pre>

The MIB of the traps is defined in the ''PCMKR.txt'' file.

==== Polling the Failcounters ====
If all the nodes of a cluster have problems, pushing information about events may be not be sufficient. An alternative is to check the failcounters of all resources periodically from the network management station (NMS). A simple script that checks for the presence of any failcounters in the output of ''crm_mon -1f'' is shown below:

# crm_mon -1f | grep fail-count

This script can be called by the NMS via SSH, or by the SNMP agent on the nodes by adding the following line to the Net-SNMP configuration in ''snmpd.conf'':

extend failcounter crm_mon -1f | grep -q fail-count

The code returned by the script can be checked by the NMS using:

snmpget <node> nsExtend.\”failcounter\”

A result of ''0'' indicates a failure.

Using Red Hat Cluster Manager with Lustre

2010-12-10T14:51:33Z

Sven: /* Installing RedHat Cluster */

''(Updated: Dec 2010)''

''DISCLAIMER - EXTERNAL CONTRIBUTOR CONTENT''

''This content was submitted by an external contributor. We provide this information as a resource for the Lustre™ open-source community, but we make no representation as to the accuracy, completeness or reliability of this information.''
----
__TOC__
This page describes how to configure and use Red Hat Cluster Manager with Lustre failover. Sven Trautmann has contributed this content.

For more about Lustre failover, see [[Configuring Lustre for Failover]].

== Preliminary Notes ==

This document is based on the RedHat Cluster version 2.0, which is part of RedHat Enterprise Linux version 5.5. For other Versions or RHEL-based distributions the syntax or methods to setup and run RedHat Cluster may differ.

In comparison with other HA solutions RedHat Cluster as in RHEL 5.5 is a pretty old HA solution. It is recommended to use other HA solutions like pacemaker, if possible.

== Setting Up RedHat Cluster ==

==== Setting Up the ''openais'' Communication Stack ====

The ''openais'' package is distributed with RHEL and can be installed using ''rpm -i /path/to/RHEL-DVD/Server/openais0.80.6-16.el5.x86_64.rpm'' or ''yum install openais'' if yum is configured to access the RHEL repository.

Once installed, the software looks for a configuration in the file ''/etc/ais/openais.conf
''.

Complete the following steps to set up the ''openais'' communication stack:

'''1. Edit the totem section of the ''openais.conf'' configuration file to designate the IP address and netmask of the interface(s) to be used.''' The totem section of the configuration file describes the way ''openais'' communicates between nodes.

<pre>
totem {
version: 2
secauth: off
threads: 0
interface {
ringnumber: 0
bindnetaddr: 10.0.0.0
mcastaddr: 226.94.1.1
mcastport: 5405
}
}
</pre>

''Openais'' uses the option ''bindnetaddr'' to determine which interface is to be used for cluster communication. The example above assumes one of the node’s interfaces is configured on the network 10.0.0.0. The value of the option is calculated from the IP address AND the network mask for the interface (IP & MASK) so the final bits of the address are cleared. Thus the configuration file is independent of any node and can be copied to all nodes.

'''2. Create an AIS key
<pre>
# /usr/sbin/ais-keygen
OpenAIS Authentication key generator.
Gathering 1024 bits for key from /dev/random.
Writing openais key to /etc/ais/authkey.
</pre>

== Installing RedHat Cluster ==

The minimum installation of RedHat Cluster consists of the Cluster Manager package ''cman'' and the Resource Group Manager package ''rgmanager''. The ''cman'' package can be found in the RHEL repository. The ''rgmanager'' package is part of the Cluster repository. It can be found on the RHEL DVD in the Cluster sub-directory.

== Installing the Lustre Resource Skript ==

== Configure RedHat Cluster ==

----

All services that the Pacemaker cluster resource manager will manage are called resources. The Pacemaker cluster resource manager uses resource agents to start, stop or monitor resources.

'''''Note:''''' The simplest way to configure the cluster is by using a crm subshell. All examples will be given in this notation. If you understood the syntax of the cluster configuration, you also can use the GUI or XML notation.

==== Completing a Basic Setup of the Cluster ====

To test that your cluster manager is running and set global options, complete the steps below.

'''1. Display the cluster status.''' Enter:

<pre>
# crm_mon -1
</pre>

The output should look similar to:

<pre>
============
Last updated: Fri Dec 25 17:31:54 2009
Stack: openais
Current DC: node1 - partition with quorum
Version: 1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe
2 Nodes configured, 2 expected votes
0 Resources configured.
============

Online: [ node1 node2 ]
</pre>

This output indicates that ''corosync'' started the cluster resource manager and it is ready to manage resources.

Several global options must be set in the cluster. The two described in the next two steps are especially important to consider.

'''2. If your cluster consists of just two nodes, switch the quorum feature off.''' On the command line, enter:

<pre>
# crm configure property no-quorum-policy=ignore
</pre>

If your lustre setup comprises more than two nodes, you can leave the no-quorum option as it is.

'''3. In a Lustre setup, fencing is normally used and is enabled by default. If you have a good reason not to use it, disable it by entering:'''

<pre>
# crm configure property stonith-enabled=false
</pre>

After the global options of the cluster are set up correctly, continue to the following sections to configure resources and constraints.

==== Configuring Resources ====

OSTs are represented as Filesystem resources. A Lustre cluster consists of several Filesystem resources along with constraints that determine on which nodes of the cluster the resources can run.

By default, the start, stop, and monitor operations in a Filesystem resource time out after 20 sec. Since some mounts in Lustre require up to 5 minutes or more, the default timeouts for these operations must be modified. Also, a monitor operation must be added to the resource so that Pacemaker can check if the resource is still alive and react in case of any problems.

'''1. Create a definition of the Filesystem resource and save it in a file such as ''MyOST.res''.'''

If you have multiple OSTs, you will need to define additional resources.

The example below shows a complete definition of the Filesystem resource. You will need to change the ''device'' and ''directory'' to correspond to your setup.

<pre>
primitive resMyOST ocf:heartbeat:Filesystem \
meta target-role="stopped" \
operations $id="resMyOST-operations" \
op monitor interval="120" timeout="60" \
op start interval="0" timeout="300" \
op stop interval="0" timeout="300" \
params device="device" directory="directory" fstype="lustre"
</pre>

In this example, the resource is initially stopped (''target-role=”stopped”'') because the constraints specifying where the resource is to be run have not yet been defined.

The ''start'' and ''stop'' operations have each been set to a timeout of 300 sec. The resource is monitored at intervals of 120 seconds. The parameters "''device''", "''directory''" and "lustre" are passed to the ''mount'' command.

'''2. Read the definition into your cluster configuration''' by entering:

# crm configure < MyOST.res

You can define as many OST resources as you want.

If a server fails or the monitoring of a OST results in the detection of a failure, the cluster first tries to restart the resource on the failed node. If the node fails to restart it, the resource is migrated to another node.

More sophosticated ways of failure management (such as trying to restart a node three times before migrating to another node) are possible using the cluster resource manager. See the Pacemaker documentation for details.

If mounting the file system depends on another resource like the start of a RAID or multipath driver, you can include this resource in the cluster configuration. This resource is then monitored by the cluster, enabling Pacemaker to react to failures.

==== Configuring Constraints ====
In a simple Lustre cluster setup, constraints are not required. However, in a larger cluster setup, you may want to use constraints to establish relationships between resources. For example, to keep the load distributed equally across nodes in your cluster, you may want to control how many OSTs can run on a particular node.

Constraints on resources are established by Pacemaker through a point system. Resources accumulate or lose points according to the constraints you define. If a resource has negative points with respect to a certain node, it cannot run on that node.

For example, to constrain the co-location of two resources, complete the steps below.

'''1. Add co-location constraints between resources.''' Enter commands similar to the following:

# crm configure colocation colresOST1resOST2 -100: resOST1 resOST2

This constraint assigns -100 points to resOST2 if an attempt is made to run resOST2 on the same node as resOST1. If the resulting total number of points assigned to reOST2 is negative, it will not be able run on that node.

'''2. After defining all necessary constraints, start the resources.''' Enter:

# crm resource start resMyOST

Execute this command for each OST (Filesystem resource) in the cluster.

----

'''Note:''' Use care when setting up your point system. You can use the point system if your cluster has at least three nodes or if the resource can acquire points from other constraints. However, in a system with only two nodes and no way to acquire points, the constraint in the example above will result in an inability to migrate a resource from a failed node.

For example, if resOST1 is running on ''node1'' and resOST2 on ''node2'' and ''node2'' fails, an attempt will be made to run resOST2 on ''node1''. However, the constraint will assign resOST2 -100 points since resOST1 is already running on ''node1''. Consequently resOST2 will be unable to run on ''node1'' and, since it is a two-node system, no other node is available.

----

To find out more about how the cluster resource manager calculates points, see the Pacemaker documentation.

==== Internal Monitoring of the System ====

In addition to monitoring of the resource itself, the nodes of the cluster must also be monitored. An important parameter to monitor is whether the node is connected to the network. Each node pings one or more hosts and counts the answers it receives. The number of responses determines how “good” its connection is to the network.

Pacemaker provides a simple way to configure this task.

'''1. Define a ping resource.''' In the command below, the ''host_list'' contains a list of hosts that the nodes should ping.
<pre>
# crm configure resPing ocf:pacemaker:pingd \
params host_list=“host1 ...“ multiplier=“10“ dampen=”5s“
# crm configure clone clonePing resPing
</pre>

For every accessible host detected, any resource on that node gets 10 points (set by the ''multiplier='' parameter). The clone configuration makes the ping resource run on every available node.

'''2. Set up constraints to run a resource on the node with the best connectivity.''' The score from the ''ping'' resource can be used in other constraints to allow a resource to run only on those nodes that have a sufficient ping score. For example, enter:

# crm configure location locMyOST resMyOST rule $id="locMyOST" pingd: defined pingd

This location constraint adds the ''ping'' score to the total score assigned to a resource for a particular node. The resource will tend to run on the node with the best connectivity.

Other system checks, such as CPU usage or free RAM, are measured by the Sysinfo resource. The capabilities of the Sysinfo resource are somewhat limited, so it will be replaced by the SystemHealth strategy in future releases of Pacemaker. For more information about the SystemHealth feature, see:
[http://www.clusterlabs.org/wiki/SystemHealth www.clusterlabs.org/wiki/SystemHealth]

==== Administering the Cluster ====

Careful system administration is required to support high availability in a cluster. A primary task of an administrator is to check the cluster for errors or failures of any resources. When a failure occurs, the administrator must search for the cause of the problem, solve it and then reset the corresponding failcounter.
This section describes some basic commands useful to an administrator. For more detailed information, see the Pacemaker documentation.

'''Displaying a Status Overview'''

The command ''crm_mon'' displays an overview of the status of the cluster. It functions similarly to the Linux top command by updating the output each time a cluster event occurs. To generate a one-time output, add the 
option ''-1''.

To include a display of all failcounters for all resources on the nodes, add the ''-f'' option to the command. The output of the command crm_mon -1f looks similar to:

<pre>
============
Last updated: Fri Dec 25 17:31:54 2009
Stack: openais
Current DC: node1 - partition with quorum
Version: 1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ node1 node2 ]

Clone Set: clonePing
Started: [ node1 node2 ]
resMyOST (ocf::heartbeat:filesys): Started node1

Migration summary:
* Node node1: pingd=20
resMyOST: migration-threshold=1000000 fail-count=1
* Node node2: pingd=20
</pre>

'''Switching a Node to Standby'''

You can switch a node to standby to, for example, perform maintenance on the node. In standby, the node is still a full member of the cluster but cannot run any resources. All resources that were running on that node are forced away.

To switch the node called ''node01'' to standby, enter:

# crm node standby node01

To switch the node online again enter:

# crm node online node01

'''Migrating a Resource to Another Node'''

The cluster resource manager can migrate a resource from one node to another while the resource is running. To migrate a resource away from the node it is running on, enter:

# crm resource migrate resMyOST

This command adds a location constraint to the configuration that specifies that the resource ''resMyOST'' can no longer run on the original node.

To delete this constraint, enter:

# crm resource unmigrate resMyOST

A target node can be specified in the migration command as follows:

# crm configure migrate resMyOST node02

This command causes the resource ''resMyOST'' to move to node ''node02'', while adding a location constraint to the configuration. To remove the location constraint, enter the ''unmigrate'' command again.

'''Resetting the failcounter'''

If Pacemaker monitors a resource and finds that it isn’t running, by default it restarts the resource on the node. If the resource cannot be restarted on the node, it then migrates the resource to another node.

It is the administrator’s task to find out the cause of the error and to reset the failcounter of the resource. This can be achieved by entering:

# crm resource failcount <resource> delete <node>

This command deletes (resets) the failcounter for the resource on the specified node.

'''“Cleaning up” a Resource'''

Sometimes it is necessary to “clean up” a resource. Internally, this command removes any information about a resource from the Local Resource Manager on every node and thus forces a complete re-read of the status of that resource. The command syntax is:

# crm resource cleanup resMyOST

This command removes information about the resource called ''resMyOST'' on all nodes.

== Setting up Fencing ==

Fencing is a technique used to isolate a node from the cluster when it is malfunctioning to prevent data corruption. For example, if a “split-brain” condition occurs in which two nodes can no longer communicate and both attempt to mount the same filesystem resource, data corruption can result. (The Multiple Mount Protection (MMP) mechanism in Lustre is designed to protect a file system from being mounted simultaneously by more than one node.)

Pacemaker uses the STONITH (Shoot The Other Node In The Head) approach to fencing malfunctioning nodes, in which a malfunctioning node is simply switched off. A good discussion about fencing can be found [http://www.clusterlabs.org/doc/crm_fencing.html here]. This article provides information useful for deciding which devices to purchase or how to set up STONITH resources for your cluster and also provides a detailed setup procedure.

A basic setup includes the following steps:

'''1. Test your fencing system manually before configuring the corresponding resources in the cluster.''' Manual testing is done by calling the STONITH command directly from each node. If this works in all tests, it will work in the cluster.

'''2. After configuring of the according resources, check that the system works as expected.''' To cause an artificial “split-brain” situation, you could use a host-based firewall to prohibit communication from other nodes on the heartbeat interface(s) by entering:

# iptables -I INPUT -i <heartbeat-IF> -p 5405 -s <other node> -j DROP

When the other nodes are not able to see the node isolated by the firewall, the isolated node should be shut down or rebooted.

== Setting Up Monitoring ==

Any cluster must to be monitored to provide the high availability it was designed for. Consider the following scenario demonstrating the importance of monitoring:

:''A node fails and all resources migrate to its backup node. Since the failover was smooth, nobody notices the problem. After some time, the second node fails and service stops. This is a serious problem since neither of the nodes is now able to provide service. The administrator must recover data from backups and possibly even install it on new hardware. A significant delay may result for users.''

Pacemaker offers several options for making information available to a monitoring system. These include:
*Utilizing the ''crm_mon'' program to send out information about changes in cluster status.
*Using scripts to check resource failcounters.

These options are described in the following sections.

==== Using ''crm_mon'' to Send Email Messages ====
In the most simple setup, the ''crm_mon'' program can be used to send out an email each time the status of the cluster changes. This approach requires a fully working mail environment and ''mail'' command.

Before configuring the ''crm_mon'' daemon, check that emails sent from the command line are delivered correctly by entering:

# crm_mon --daemonize –-mail-to <user@example.com> [--mail-host mail.example.com]

The resource monitor in the cluster can be configured to ensure the mail alerting service resource is running, as shown below:

<pre>
primitive resMON ocf:pacemaker:ClusterMon \
operations $id="resMON-operations" \
op monitor interval="180" timeout="20" \
params extra_options="--mail-to <your@mail.address>"
</pre>

If a node fails, which could prevent the email from being sent, the resource is started on another node and an email about the successful start of the resource is sent out from the new node. The administrator's task is to search for the cause of the failover.

==== Using ''crm_mon'' to Send SNMP Traps ====
The ''crm_mon'' daemon can be used to send SNMP traps to a network management server. The configuration from the command line is:

# crm_mon –-daemonize –-snmp-traps nms.example.com

This daemon can also be configured as a cluster resource as shown below:

<pre>
primitive resMON ocf:pacemaker:ClusterMon \
operations $id="resMON-operations" \
op monitor interval="180" timeout="20" \
params extra_options="--snmp-trap nms.example.com
</pre>

The MIB of the traps is defined in the ''PCMKR.txt'' file.

==== Polling the Failcounters ====
If all the nodes of a cluster have problems, pushing information about events may be not be sufficient. An alternative is to check the failcounters of all resources periodically from the network management station (NMS). A simple script that checks for the presence of any failcounters in the output of ''crm_mon -1f'' is shown below:

# crm_mon -1f | grep fail-count

This script can be called by the NMS via SSH, or by the SNMP agent on the nodes by adding the following line to the Net-SNMP configuration in ''snmpd.conf'':

extend failcounter crm_mon -1f | grep -q fail-count

The code returned by the script can be checked by the NMS using:

snmpget <node> nsExtend.\”failcounter\”

A result of ''0'' indicates a failure.

Using Red Hat Cluster Manager with Lustre

2010-12-10T13:56:55Z

Sven: /* Setting Up Resource Management */

''(Updated: Dec 2010)''

''DISCLAIMER - EXTERNAL CONTRIBUTOR CONTENT''

''This content was submitted by an external contributor. We provide this information as a resource for the Lustre™ open-source community, but we make no representation as to the accuracy, completeness or reliability of this information.''
----
__TOC__
This page describes how to configure and use Red Hat Cluster Manager with Lustre failover. Sven Trautmann has contributed this content.

For more about Lustre failover, see [[Configuring Lustre for Failover]].

== Preliminary Notes ==

This document is based on the RedHat Cluster version 2.0, which is part of RedHat Enterprise Linux version 5.5. For other Versions or RHEL-based distributions the syntax or methods to setup and run RedHat Cluster may differ.

In comparison with other HA solutions RedHat Cluster as in RHEL 5.5 is a pretty old HA solution. It is recommended to use other HA solutions like pacemaker, if possible.

== Setting Up RedHat Cluster ==

==== Setting Up the ''openais'' Communication Stack ====

The ''openais'' package is distributed with RHEL and can be installed using ''rpm -i /path/to/RHEL-DVD/Server/openais0.80.6-16.el5.x86_64.rpm'' or ''yum install openais'' if yum is configured to access the RHEL repository.

Once installed, the software looks for a configuration in the file ''/etc/ais/openais.conf
''.

Complete the following steps to set up the ''openais'' communication stack:

'''1. Edit the totem section of the ''openais.conf'' configuration file to designate the IP address and netmask of the interface(s) to be used.''' The totem section of the configuration file describes the way ''openais'' communicates between nodes.

<pre>
totem {
version: 2
secauth: off
threads: 0
interface {
ringnumber: 0
bindnetaddr: 10.0.0.0
mcastaddr: 226.94.1.1
mcastport: 5405
}
}
</pre>

''Openais'' uses the option ''bindnetaddr'' to determine which interface is to be used for cluster communication. The example above assumes one of the node’s interfaces is configured on the network 10.0.0.0. The value of the option is calculated from the IP address AND the network mask for the interface (IP & MASK) so the final bits of the address are cleared. Thus the configuration file is independent of any node and can be copied to all nodes.

'''2. Create an AIS key
<pre>
# /usr/sbin/ais-keygen
OpenAIS Authentication key generator.
Gathering 1024 bits for key from /dev/random.
Writing openais key to /etc/ais/authkey.
</pre>

== Installing RedHat Cluster ==

== Installing the Lustre Resource Skript ==

== Configure RedHat Cluster ==

----

All services that the Pacemaker cluster resource manager will manage are called resources. The Pacemaker cluster resource manager uses resource agents to start, stop or monitor resources.

'''''Note:''''' The simplest way to configure the cluster is by using a crm subshell. All examples will be given in this notation. If you understood the syntax of the cluster configuration, you also can use the GUI or XML notation.

==== Completing a Basic Setup of the Cluster ====

To test that your cluster manager is running and set global options, complete the steps below.

'''1. Display the cluster status.''' Enter:

<pre>
# crm_mon -1
</pre>

The output should look similar to:

<pre>
============
Last updated: Fri Dec 25 17:31:54 2009
Stack: openais
Current DC: node1 - partition with quorum
Version: 1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe
2 Nodes configured, 2 expected votes
0 Resources configured.
============

Online: [ node1 node2 ]
</pre>

This output indicates that ''corosync'' started the cluster resource manager and it is ready to manage resources.

Several global options must be set in the cluster. The two described in the next two steps are especially important to consider.

'''2. If your cluster consists of just two nodes, switch the quorum feature off.''' On the command line, enter:

<pre>
# crm configure property no-quorum-policy=ignore
</pre>

If your lustre setup comprises more than two nodes, you can leave the no-quorum option as it is.

'''3. In a Lustre setup, fencing is normally used and is enabled by default. If you have a good reason not to use it, disable it by entering:'''

<pre>
# crm configure property stonith-enabled=false
</pre>

After the global options of the cluster are set up correctly, continue to the following sections to configure resources and constraints.

==== Configuring Resources ====

OSTs are represented as Filesystem resources. A Lustre cluster consists of several Filesystem resources along with constraints that determine on which nodes of the cluster the resources can run.

By default, the start, stop, and monitor operations in a Filesystem resource time out after 20 sec. Since some mounts in Lustre require up to 5 minutes or more, the default timeouts for these operations must be modified. Also, a monitor operation must be added to the resource so that Pacemaker can check if the resource is still alive and react in case of any problems.

'''1. Create a definition of the Filesystem resource and save it in a file such as ''MyOST.res''.'''

If you have multiple OSTs, you will need to define additional resources.

The example below shows a complete definition of the Filesystem resource. You will need to change the ''device'' and ''directory'' to correspond to your setup.

<pre>
primitive resMyOST ocf:heartbeat:Filesystem \
meta target-role="stopped" \
operations $id="resMyOST-operations" \
op monitor interval="120" timeout="60" \
op start interval="0" timeout="300" \
op stop interval="0" timeout="300" \
params device="device" directory="directory" fstype="lustre"
</pre>

In this example, the resource is initially stopped (''target-role=”stopped”'') because the constraints specifying where the resource is to be run have not yet been defined.

The ''start'' and ''stop'' operations have each been set to a timeout of 300 sec. The resource is monitored at intervals of 120 seconds. The parameters "''device''", "''directory''" and "lustre" are passed to the ''mount'' command.

'''2. Read the definition into your cluster configuration''' by entering:

# crm configure < MyOST.res

You can define as many OST resources as you want.

If a server fails or the monitoring of a OST results in the detection of a failure, the cluster first tries to restart the resource on the failed node. If the node fails to restart it, the resource is migrated to another node.

More sophosticated ways of failure management (such as trying to restart a node three times before migrating to another node) are possible using the cluster resource manager. See the Pacemaker documentation for details.

If mounting the file system depends on another resource like the start of a RAID or multipath driver, you can include this resource in the cluster configuration. This resource is then monitored by the cluster, enabling Pacemaker to react to failures.

==== Configuring Constraints ====
In a simple Lustre cluster setup, constraints are not required. However, in a larger cluster setup, you may want to use constraints to establish relationships between resources. For example, to keep the load distributed equally across nodes in your cluster, you may want to control how many OSTs can run on a particular node.

Constraints on resources are established by Pacemaker through a point system. Resources accumulate or lose points according to the constraints you define. If a resource has negative points with respect to a certain node, it cannot run on that node.

For example, to constrain the co-location of two resources, complete the steps below.

'''1. Add co-location constraints between resources.''' Enter commands similar to the following:

# crm configure colocation colresOST1resOST2 -100: resOST1 resOST2

This constraint assigns -100 points to resOST2 if an attempt is made to run resOST2 on the same node as resOST1. If the resulting total number of points assigned to reOST2 is negative, it will not be able run on that node.

'''2. After defining all necessary constraints, start the resources.''' Enter:

# crm resource start resMyOST

Execute this command for each OST (Filesystem resource) in the cluster.

----

'''Note:''' Use care when setting up your point system. You can use the point system if your cluster has at least three nodes or if the resource can acquire points from other constraints. However, in a system with only two nodes and no way to acquire points, the constraint in the example above will result in an inability to migrate a resource from a failed node.

For example, if resOST1 is running on ''node1'' and resOST2 on ''node2'' and ''node2'' fails, an attempt will be made to run resOST2 on ''node1''. However, the constraint will assign resOST2 -100 points since resOST1 is already running on ''node1''. Consequently resOST2 will be unable to run on ''node1'' and, since it is a two-node system, no other node is available.

----

To find out more about how the cluster resource manager calculates points, see the Pacemaker documentation.

==== Internal Monitoring of the System ====

In addition to monitoring of the resource itself, the nodes of the cluster must also be monitored. An important parameter to monitor is whether the node is connected to the network. Each node pings one or more hosts and counts the answers it receives. The number of responses determines how “good” its connection is to the network.

Pacemaker provides a simple way to configure this task.

'''1. Define a ping resource.''' In the command below, the ''host_list'' contains a list of hosts that the nodes should ping.
<pre>
# crm configure resPing ocf:pacemaker:pingd \
params host_list=“host1 ...“ multiplier=“10“ dampen=”5s“
# crm configure clone clonePing resPing
</pre>

For every accessible host detected, any resource on that node gets 10 points (set by the ''multiplier='' parameter). The clone configuration makes the ping resource run on every available node.

'''2. Set up constraints to run a resource on the node with the best connectivity.''' The score from the ''ping'' resource can be used in other constraints to allow a resource to run only on those nodes that have a sufficient ping score. For example, enter:

# crm configure location locMyOST resMyOST rule $id="locMyOST" pingd: defined pingd

This location constraint adds the ''ping'' score to the total score assigned to a resource for a particular node. The resource will tend to run on the node with the best connectivity.

Other system checks, such as CPU usage or free RAM, are measured by the Sysinfo resource. The capabilities of the Sysinfo resource are somewhat limited, so it will be replaced by the SystemHealth strategy in future releases of Pacemaker. For more information about the SystemHealth feature, see:
[http://www.clusterlabs.org/wiki/SystemHealth www.clusterlabs.org/wiki/SystemHealth]

==== Administering the Cluster ====

Careful system administration is required to support high availability in a cluster. A primary task of an administrator is to check the cluster for errors or failures of any resources. When a failure occurs, the administrator must search for the cause of the problem, solve it and then reset the corresponding failcounter.
This section describes some basic commands useful to an administrator. For more detailed information, see the Pacemaker documentation.

'''Displaying a Status Overview'''

The command ''crm_mon'' displays an overview of the status of the cluster. It functions similarly to the Linux top command by updating the output each time a cluster event occurs. To generate a one-time output, add the 
option ''-1''.

To include a display of all failcounters for all resources on the nodes, add the ''-f'' option to the command. The output of the command crm_mon -1f looks similar to:

<pre>
============
Last updated: Fri Dec 25 17:31:54 2009
Stack: openais
Current DC: node1 - partition with quorum
Version: 1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ node1 node2 ]

Clone Set: clonePing
Started: [ node1 node2 ]
resMyOST (ocf::heartbeat:filesys): Started node1

Migration summary:
* Node node1: pingd=20
resMyOST: migration-threshold=1000000 fail-count=1
* Node node2: pingd=20
</pre>

'''Switching a Node to Standby'''

You can switch a node to standby to, for example, perform maintenance on the node. In standby, the node is still a full member of the cluster but cannot run any resources. All resources that were running on that node are forced away.

To switch the node called ''node01'' to standby, enter:

# crm node standby node01

To switch the node online again enter:

# crm node online node01

'''Migrating a Resource to Another Node'''

The cluster resource manager can migrate a resource from one node to another while the resource is running. To migrate a resource away from the node it is running on, enter:

# crm resource migrate resMyOST

This command adds a location constraint to the configuration that specifies that the resource ''resMyOST'' can no longer run on the original node.

To delete this constraint, enter:

# crm resource unmigrate resMyOST

A target node can be specified in the migration command as follows:

# crm configure migrate resMyOST node02

This command causes the resource ''resMyOST'' to move to node ''node02'', while adding a location constraint to the configuration. To remove the location constraint, enter the ''unmigrate'' command again.

'''Resetting the failcounter'''

If Pacemaker monitors a resource and finds that it isn’t running, by default it restarts the resource on the node. If the resource cannot be restarted on the node, it then migrates the resource to another node.

It is the administrator’s task to find out the cause of the error and to reset the failcounter of the resource. This can be achieved by entering:

# crm resource failcount <resource> delete <node>

This command deletes (resets) the failcounter for the resource on the specified node.

'''“Cleaning up” a Resource'''

Sometimes it is necessary to “clean up” a resource. Internally, this command removes any information about a resource from the Local Resource Manager on every node and thus forces a complete re-read of the status of that resource. The command syntax is:

# crm resource cleanup resMyOST

This command removes information about the resource called ''resMyOST'' on all nodes.

== Setting up Fencing ==

Fencing is a technique used to isolate a node from the cluster when it is malfunctioning to prevent data corruption. For example, if a “split-brain” condition occurs in which two nodes can no longer communicate and both attempt to mount the same filesystem resource, data corruption can result. (The Multiple Mount Protection (MMP) mechanism in Lustre is designed to protect a file system from being mounted simultaneously by more than one node.)

Pacemaker uses the STONITH (Shoot The Other Node In The Head) approach to fencing malfunctioning nodes, in which a malfunctioning node is simply switched off. A good discussion about fencing can be found [http://www.clusterlabs.org/doc/crm_fencing.html here]. This article provides information useful for deciding which devices to purchase or how to set up STONITH resources for your cluster and also provides a detailed setup procedure.

A basic setup includes the following steps:

'''1. Test your fencing system manually before configuring the corresponding resources in the cluster.''' Manual testing is done by calling the STONITH command directly from each node. If this works in all tests, it will work in the cluster.

'''2. After configuring of the according resources, check that the system works as expected.''' To cause an artificial “split-brain” situation, you could use a host-based firewall to prohibit communication from other nodes on the heartbeat interface(s) by entering:

# iptables -I INPUT -i <heartbeat-IF> -p 5405 -s <other node> -j DROP

When the other nodes are not able to see the node isolated by the firewall, the isolated node should be shut down or rebooted.

== Setting Up Monitoring ==

Any cluster must to be monitored to provide the high availability it was designed for. Consider the following scenario demonstrating the importance of monitoring:

:''A node fails and all resources migrate to its backup node. Since the failover was smooth, nobody notices the problem. After some time, the second node fails and service stops. This is a serious problem since neither of the nodes is now able to provide service. The administrator must recover data from backups and possibly even install it on new hardware. A significant delay may result for users.''

Pacemaker offers several options for making information available to a monitoring system. These include:
*Utilizing the ''crm_mon'' program to send out information about changes in cluster status.
*Using scripts to check resource failcounters.

These options are described in the following sections.

==== Using ''crm_mon'' to Send Email Messages ====
In the most simple setup, the ''crm_mon'' program can be used to send out an email each time the status of the cluster changes. This approach requires a fully working mail environment and ''mail'' command.

Before configuring the ''crm_mon'' daemon, check that emails sent from the command line are delivered correctly by entering:

# crm_mon --daemonize –-mail-to <user@example.com> [--mail-host mail.example.com]

The resource monitor in the cluster can be configured to ensure the mail alerting service resource is running, as shown below:

<pre>
primitive resMON ocf:pacemaker:ClusterMon \
operations $id="resMON-operations" \
op monitor interval="180" timeout="20" \
params extra_options="--mail-to <your@mail.address>"
</pre>

If a node fails, which could prevent the email from being sent, the resource is started on another node and an email about the successful start of the resource is sent out from the new node. The administrator's task is to search for the cause of the failover.

==== Using ''crm_mon'' to Send SNMP Traps ====
The ''crm_mon'' daemon can be used to send SNMP traps to a network management server. The configuration from the command line is:

# crm_mon –-daemonize –-snmp-traps nms.example.com

This daemon can also be configured as a cluster resource as shown below:

<pre>
primitive resMON ocf:pacemaker:ClusterMon \
operations $id="resMON-operations" \
op monitor interval="180" timeout="20" \
params extra_options="--snmp-trap nms.example.com
</pre>

The MIB of the traps is defined in the ''PCMKR.txt'' file.

==== Polling the Failcounters ====
If all the nodes of a cluster have problems, pushing information about events may be not be sufficient. An alternative is to check the failcounters of all resources periodically from the network management station (NMS). A simple script that checks for the presence of any failcounters in the output of ''crm_mon -1f'' is shown below:

# crm_mon -1f | grep fail-count

This script can be called by the NMS via SSH, or by the SNMP agent on the nodes by adding the following line to the Net-SNMP configuration in ''snmpd.conf'':

extend failcounter crm_mon -1f | grep -q fail-count

The code returned by the script can be checked by the NMS using:

snmpget <node> nsExtend.\”failcounter\”

A result of ''0'' indicates a failure.

Using Red Hat Cluster Manager with Lustre

2010-12-10T13:48:06Z

Sven: /* Setting Up RedHat Cluster */

''(Updated: Dec 2010)''

''DISCLAIMER - EXTERNAL CONTRIBUTOR CONTENT''

''This content was submitted by an external contributor. We provide this information as a resource for the Lustre™ open-source community, but we make no representation as to the accuracy, completeness or reliability of this information.''
----
__TOC__
This page describes how to configure and use Red Hat Cluster Manager with Lustre failover. Sven Trautmann has contributed this content.

For more about Lustre failover, see [[Configuring Lustre for Failover]].

== Preliminary Notes ==

This document is based on the RedHat Cluster version 2.0, which is part of RedHat Enterprise Linux version 5.5. For other Versions or RHEL-based distributions the syntax or methods to setup and run RedHat Cluster may differ.

In comparison with other HA solutions RedHat Cluster as in RHEL 5.5 is a pretty old HA solution. It is recommended to use other HA solutions like pacemaker, if possible.

== Setting Up RedHat Cluster ==

==== Setting Up the ''openais'' Communication Stack ====

The ''openais'' package is distributed with RHEL and can be installed using ''rpm -i /path/to/RHEL-DVD/Server/openais0.80.6-16.el5.x86_64.rpm'' or ''yum install openais'' if yum is configured to access the RHEL repository.

Once installed, the software looks for a configuration in the file ''/etc/ais/openais.conf
''.

Complete the following steps to set up the ''openais'' communication stack:

'''1. Edit the totem section of the ''openais.conf'' configuration file to designate the IP address and netmask of the interface(s) to be used.''' The totem section of the configuration file describes the way ''openais'' communicates between nodes.

<pre>
totem {
version: 2
secauth: off
threads: 0
interface {
ringnumber: 0
bindnetaddr: 10.0.0.0
mcastaddr: 226.94.1.1
mcastport: 5405
}
}
</pre>

''Openais'' uses the option ''bindnetaddr'' to determine which interface is to be used for cluster communication. The example above assumes one of the node’s interfaces is configured on the network 10.0.0.0. The value of the option is calculated from the IP address AND the network mask for the interface (IP & MASK) so the final bits of the address are cleared. Thus the configuration file is independent of any node and can be copied to all nodes.

'''2. Create an AIS key
<pre>
# /usr/sbin/ais-keygen
OpenAIS Authentication key generator.
Gathering 1024 bits for key from /dev/random.
Writing openais key to /etc/ais/authkey.
</pre>

== Setting Up Resource Management ==

All services that the Pacemaker cluster resource manager will manage are called resources. The Pacemaker cluster resource manager uses resource agents to start, stop or monitor resources.

'''''Note:''''' The simplest way to configure the cluster is by using a crm subshell. All examples will be given in this notation. If you understood the syntax of the cluster configuration, you also can use the GUI or XML notation.

==== Completing a Basic Setup of the Cluster ====

To test that your cluster manager is running and set global options, complete the steps below.

'''1. Display the cluster status.''' Enter:

<pre>
# crm_mon -1
</pre>

The output should look similar to:

<pre>
============
Last updated: Fri Dec 25 17:31:54 2009
Stack: openais
Current DC: node1 - partition with quorum
Version: 1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe
2 Nodes configured, 2 expected votes
0 Resources configured.
============

Online: [ node1 node2 ]
</pre>

This output indicates that ''corosync'' started the cluster resource manager and it is ready to manage resources.

Several global options must be set in the cluster. The two described in the next two steps are especially important to consider.

'''2. If your cluster consists of just two nodes, switch the quorum feature off.''' On the command line, enter:

<pre>
# crm configure property no-quorum-policy=ignore
</pre>

If your lustre setup comprises more than two nodes, you can leave the no-quorum option as it is.

'''3. In a Lustre setup, fencing is normally used and is enabled by default. If you have a good reason not to use it, disable it by entering:'''

<pre>
# crm configure property stonith-enabled=false
</pre>

After the global options of the cluster are set up correctly, continue to the following sections to configure resources and constraints.

==== Configuring Resources ====

OSTs are represented as Filesystem resources. A Lustre cluster consists of several Filesystem resources along with constraints that determine on which nodes of the cluster the resources can run.

By default, the start, stop, and monitor operations in a Filesystem resource time out after 20 sec. Since some mounts in Lustre require up to 5 minutes or more, the default timeouts for these operations must be modified. Also, a monitor operation must be added to the resource so that Pacemaker can check if the resource is still alive and react in case of any problems.

'''1. Create a definition of the Filesystem resource and save it in a file such as ''MyOST.res''.'''

If you have multiple OSTs, you will need to define additional resources.

The example below shows a complete definition of the Filesystem resource. You will need to change the ''device'' and ''directory'' to correspond to your setup.

<pre>
primitive resMyOST ocf:heartbeat:Filesystem \
meta target-role="stopped" \
operations $id="resMyOST-operations" \
op monitor interval="120" timeout="60" \
op start interval="0" timeout="300" \
op stop interval="0" timeout="300" \
params device="device" directory="directory" fstype="lustre"
</pre>

In this example, the resource is initially stopped (''target-role=”stopped”'') because the constraints specifying where the resource is to be run have not yet been defined.

The ''start'' and ''stop'' operations have each been set to a timeout of 300 sec. The resource is monitored at intervals of 120 seconds. The parameters "''device''", "''directory''" and "lustre" are passed to the ''mount'' command.

'''2. Read the definition into your cluster configuration''' by entering:

# crm configure < MyOST.res

You can define as many OST resources as you want.

If a server fails or the monitoring of a OST results in the detection of a failure, the cluster first tries to restart the resource on the failed node. If the node fails to restart it, the resource is migrated to another node.

More sophosticated ways of failure management (such as trying to restart a node three times before migrating to another node) are possible using the cluster resource manager. See the Pacemaker documentation for details.

If mounting the file system depends on another resource like the start of a RAID or multipath driver, you can include this resource in the cluster configuration. This resource is then monitored by the cluster, enabling Pacemaker to react to failures.

==== Configuring Constraints ====
In a simple Lustre cluster setup, constraints are not required. However, in a larger cluster setup, you may want to use constraints to establish relationships between resources. For example, to keep the load distributed equally across nodes in your cluster, you may want to control how many OSTs can run on a particular node.

Constraints on resources are established by Pacemaker through a point system. Resources accumulate or lose points according to the constraints you define. If a resource has negative points with respect to a certain node, it cannot run on that node.

For example, to constrain the co-location of two resources, complete the steps below.

'''1. Add co-location constraints between resources.''' Enter commands similar to the following:

# crm configure colocation colresOST1resOST2 -100: resOST1 resOST2

This constraint assigns -100 points to resOST2 if an attempt is made to run resOST2 on the same node as resOST1. If the resulting total number of points assigned to reOST2 is negative, it will not be able run on that node.

'''2. After defining all necessary constraints, start the resources.''' Enter:

# crm resource start resMyOST

Execute this command for each OST (Filesystem resource) in the cluster.

----

'''Note:''' Use care when setting up your point system. You can use the point system if your cluster has at least three nodes or if the resource can acquire points from other constraints. However, in a system with only two nodes and no way to acquire points, the constraint in the example above will result in an inability to migrate a resource from a failed node.

For example, if resOST1 is running on ''node1'' and resOST2 on ''node2'' and ''node2'' fails, an attempt will be made to run resOST2 on ''node1''. However, the constraint will assign resOST2 -100 points since resOST1 is already running on ''node1''. Consequently resOST2 will be unable to run on ''node1'' and, since it is a two-node system, no other node is available.

----

To find out more about how the cluster resource manager calculates points, see the Pacemaker documentation.

==== Internal Monitoring of the System ====

In addition to monitoring of the resource itself, the nodes of the cluster must also be monitored. An important parameter to monitor is whether the node is connected to the network. Each node pings one or more hosts and counts the answers it receives. The number of responses determines how “good” its connection is to the network.

Pacemaker provides a simple way to configure this task.

'''1. Define a ping resource.''' In the command below, the ''host_list'' contains a list of hosts that the nodes should ping.
<pre>
# crm configure resPing ocf:pacemaker:pingd \
params host_list=“host1 ...“ multiplier=“10“ dampen=”5s“
# crm configure clone clonePing resPing
</pre>

For every accessible host detected, any resource on that node gets 10 points (set by the ''multiplier='' parameter). The clone configuration makes the ping resource run on every available node.

'''2. Set up constraints to run a resource on the node with the best connectivity.''' The score from the ''ping'' resource can be used in other constraints to allow a resource to run only on those nodes that have a sufficient ping score. For example, enter:

# crm configure location locMyOST resMyOST rule $id="locMyOST" pingd: defined pingd

This location constraint adds the ''ping'' score to the total score assigned to a resource for a particular node. The resource will tend to run on the node with the best connectivity.

Other system checks, such as CPU usage or free RAM, are measured by the Sysinfo resource. The capabilities of the Sysinfo resource are somewhat limited, so it will be replaced by the SystemHealth strategy in future releases of Pacemaker. For more information about the SystemHealth feature, see:
[http://www.clusterlabs.org/wiki/SystemHealth www.clusterlabs.org/wiki/SystemHealth]

==== Administering the Cluster ====

Careful system administration is required to support high availability in a cluster. A primary task of an administrator is to check the cluster for errors or failures of any resources. When a failure occurs, the administrator must search for the cause of the problem, solve it and then reset the corresponding failcounter.
This section describes some basic commands useful to an administrator. For more detailed information, see the Pacemaker documentation.

'''Displaying a Status Overview'''

The command ''crm_mon'' displays an overview of the status of the cluster. It functions similarly to the Linux top command by updating the output each time a cluster event occurs. To generate a one-time output, add the 
option ''-1''.

To include a display of all failcounters for all resources on the nodes, add the ''-f'' option to the command. The output of the command crm_mon -1f looks similar to:

<pre>
============
Last updated: Fri Dec 25 17:31:54 2009
Stack: openais
Current DC: node1 - partition with quorum
Version: 1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ node1 node2 ]

Clone Set: clonePing
Started: [ node1 node2 ]
resMyOST (ocf::heartbeat:filesys): Started node1

Migration summary:
* Node node1: pingd=20
resMyOST: migration-threshold=1000000 fail-count=1
* Node node2: pingd=20
</pre>

'''Switching a Node to Standby'''

You can switch a node to standby to, for example, perform maintenance on the node. In standby, the node is still a full member of the cluster but cannot run any resources. All resources that were running on that node are forced away.

To switch the node called ''node01'' to standby, enter:

# crm node standby node01

To switch the node online again enter:

# crm node online node01

'''Migrating a Resource to Another Node'''

The cluster resource manager can migrate a resource from one node to another while the resource is running. To migrate a resource away from the node it is running on, enter:

# crm resource migrate resMyOST

This command adds a location constraint to the configuration that specifies that the resource ''resMyOST'' can no longer run on the original node.

To delete this constraint, enter:

# crm resource unmigrate resMyOST

A target node can be specified in the migration command as follows:

# crm configure migrate resMyOST node02

This command causes the resource ''resMyOST'' to move to node ''node02'', while adding a location constraint to the configuration. To remove the location constraint, enter the ''unmigrate'' command again.

'''Resetting the failcounter'''

If Pacemaker monitors a resource and finds that it isn’t running, by default it restarts the resource on the node. If the resource cannot be restarted on the node, it then migrates the resource to another node.

It is the administrator’s task to find out the cause of the error and to reset the failcounter of the resource. This can be achieved by entering:

# crm resource failcount <resource> delete <node>

This command deletes (resets) the failcounter for the resource on the specified node.

'''“Cleaning up” a Resource'''

Sometimes it is necessary to “clean up” a resource. Internally, this command removes any information about a resource from the Local Resource Manager on every node and thus forces a complete re-read of the status of that resource. The command syntax is:

# crm resource cleanup resMyOST

This command removes information about the resource called ''resMyOST'' on all nodes.

== Setting up Fencing ==

Fencing is a technique used to isolate a node from the cluster when it is malfunctioning to prevent data corruption. For example, if a “split-brain” condition occurs in which two nodes can no longer communicate and both attempt to mount the same filesystem resource, data corruption can result. (The Multiple Mount Protection (MMP) mechanism in Lustre is designed to protect a file system from being mounted simultaneously by more than one node.)

Pacemaker uses the STONITH (Shoot The Other Node In The Head) approach to fencing malfunctioning nodes, in which a malfunctioning node is simply switched off. A good discussion about fencing can be found [http://www.clusterlabs.org/doc/crm_fencing.html here]. This article provides information useful for deciding which devices to purchase or how to set up STONITH resources for your cluster and also provides a detailed setup procedure.

A basic setup includes the following steps:

'''1. Test your fencing system manually before configuring the corresponding resources in the cluster.''' Manual testing is done by calling the STONITH command directly from each node. If this works in all tests, it will work in the cluster.

'''2. After configuring of the according resources, check that the system works as expected.''' To cause an artificial “split-brain” situation, you could use a host-based firewall to prohibit communication from other nodes on the heartbeat interface(s) by entering:

# iptables -I INPUT -i <heartbeat-IF> -p 5405 -s <other node> -j DROP

When the other nodes are not able to see the node isolated by the firewall, the isolated node should be shut down or rebooted.

== Setting Up Monitoring ==

Any cluster must to be monitored to provide the high availability it was designed for. Consider the following scenario demonstrating the importance of monitoring:

:''A node fails and all resources migrate to its backup node. Since the failover was smooth, nobody notices the problem. After some time, the second node fails and service stops. This is a serious problem since neither of the nodes is now able to provide service. The administrator must recover data from backups and possibly even install it on new hardware. A significant delay may result for users.''

Pacemaker offers several options for making information available to a monitoring system. These include:
*Utilizing the ''crm_mon'' program to send out information about changes in cluster status.
*Using scripts to check resource failcounters.

These options are described in the following sections.

==== Using ''crm_mon'' to Send Email Messages ====
In the most simple setup, the ''crm_mon'' program can be used to send out an email each time the status of the cluster changes. This approach requires a fully working mail environment and ''mail'' command.

Before configuring the ''crm_mon'' daemon, check that emails sent from the command line are delivered correctly by entering:

# crm_mon --daemonize –-mail-to <user@example.com> [--mail-host mail.example.com]

The resource monitor in the cluster can be configured to ensure the mail alerting service resource is running, as shown below:

<pre>
primitive resMON ocf:pacemaker:ClusterMon \
operations $id="resMON-operations" \
op monitor interval="180" timeout="20" \
params extra_options="--mail-to <your@mail.address>"
</pre>

If a node fails, which could prevent the email from being sent, the resource is started on another node and an email about the successful start of the resource is sent out from the new node. The administrator's task is to search for the cause of the failover.

==== Using ''crm_mon'' to Send SNMP Traps ====
The ''crm_mon'' daemon can be used to send SNMP traps to a network management server. The configuration from the command line is:

# crm_mon –-daemonize –-snmp-traps nms.example.com

This daemon can also be configured as a cluster resource as shown below:

<pre>
primitive resMON ocf:pacemaker:ClusterMon \
operations $id="resMON-operations" \
op monitor interval="180" timeout="20" \
params extra_options="--snmp-trap nms.example.com
</pre>

The MIB of the traps is defined in the ''PCMKR.txt'' file.

==== Polling the Failcounters ====
If all the nodes of a cluster have problems, pushing information about events may be not be sufficient. An alternative is to check the failcounters of all resources periodically from the network management station (NMS). A simple script that checks for the presence of any failcounters in the output of ''crm_mon -1f'' is shown below:

# crm_mon -1f | grep fail-count

This script can be called by the NMS via SSH, or by the SNMP agent on the nodes by adding the following line to the Net-SNMP configuration in ''snmpd.conf'':

extend failcounter crm_mon -1f | grep -q fail-count

The code returned by the script can be checked by the NMS using:

snmpget <node> nsExtend.\”failcounter\”

A result of ''0'' indicates a failure.