Guidelines for Setting Up a Cluster

Some tips are described below for making debugging easier when working on clusters.


 * Set up shared home directories. A shared namespace is useful for bringing up Lustre builds, collecting logs, using a blat command-line utility to email configuration files, etc. Sharing /home is the least surprising. OK to replace with this? The most commonly shared namespace is /home.
 * Use pdsh. Using pdsh is an absolute requirement with bonus points for being able to pdsh to all nodes from any node.
 * Use a regular node naming scheme. A node naming scheme consisting of a short prefix combined with regularly incremented decimal node numbers (e.g., n0001, n0002, etc.) works well with an automated tool like pdsh. Also, machines tend to be used for different roles in a cluster over time, so hostnames based on roles in the Lustre file system (mds, ost, etc) are not always practical. However, documenting how hostnames map to Lustre functions is useful.
 * Use serial consoles. As in any data center, serial consoles are essential. They enable output to be logged for later retrieval in case a problem occurs. They can be provided with a useful front end like conman or conserver. You'll want to use a front end that can send breaks to the kernel's sysrq facility over the serial console.


 * In 2.6 kernels, reliable network-based consoles allow sending (nearly) all kernel messages to a remote system, even oops messages. In 2.6.5, netconsole is provided. In 2.6.9 and later,  netdump supercedes netconsole.  The netdump code also supports kernel crash dumps over the network to another host, which can be invaluable for debugging node-crashing problems.


 * Collect syslogs in one place. It's convenient to be able to watch a single log for errors reported to syslog from across the cluster.


 * Set up remote power management. If a machine wedges, it must be possible to reboot it without physically flipping a switch. Various vendors offer serial-controlled power widgets. Power widgets that work with powerman are the most useful.  Remote power management is a requirement for doing automated failover (STONITH).


 * Automate disaster recovery. Although infrequently used, it's convenient to be able to reimage a node via netbooting and network software installs.


 * Boot quickly. To be able to boot quickly, do the following:
 * Disable non-essential services from starting at boot-time.
 * Minimize hardware checks made by the BIOS.
 * Avoid utilities like Red Hat's Kudzu that ask for user input before proceeding.