WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Guidelines for Setting Up a Cluster: Difference between revisions

From Obsolete Lustre Wiki
Jump to navigationJump to search
No edit summary
 
No edit summary
Line 1: Line 1:
Some tips we've collected while working on clusters that can lead to a more useful debugging experience.
Some tips we've collected while working on clusters that can lead to a more useful debugging experience.


1. '''Shared home directories'''
# '''Shared home directories'''


   Having a shared namespace comes in handy all the time.  Its useful for bringing up lustre builds, collecting logs, blatting configuration files, etc.  sharing /home is the least surprising.
   Having a shared namespace comes in handy all the time.  Its useful for bringing up lustre builds, collecting logs, blatting configuration files, etc.  sharing /home is the least surprising.


1. '''PDSH '''
# '''PDSH '''


   pdsh is an absolute requirement. Bonus points for being able to pdsh to all nodes from any node.
   pdsh is an absolute requirement. Bonus points for being able to pdsh to all nodes from any node.


1. '''Regular naming'''
# '''Regular naming'''


   A node naming scheme that involves a short prefix and regular incrementing decimal node numbers combines very well with automation like pdsh.  As machines tend to take on different roles as different people use the cluster, it doesn't make a lot of sense to give hostnames based on roles in the lustre universe (mds, ost, etc).
   A node naming scheme that involves a short prefix and regular incrementing decimal node numbers combines very well with automation like pdsh.  As machines tend to take on different roles as different people use the cluster, it doesn't make a lot of sense to give hostnames based on roles in the lustre universe (mds, ost, etc).


1. '''Serial Consoles'''
# '''Serial Consoles'''


   As in any data center, they're essential.  Log their output for later retrieval should the kernel go wrong.  Provide a useful front end like 'conman' or 'conserver'.  Make sure the front-end can send breaks to the kernel's sysrq facility over the serial console.
   As in any data center, they're essential.  Log their output for later retrieval should the kernel go wrong.  Provide a useful front end like 'conman' or 'conserver'.  Make sure the front-end can send breaks to the kernel's sysrq facility over the serial console.


1. '''Collect syslogs in one place'''
# '''Collect syslogs in one place'''


   Its nice to be able to watch one log for errors that are reported to syslog across the cluster.
   Its nice to be able to watch one log for errors that are reported to syslog across the cluster.


1. '''Remote Power Management'''
# '''Remote Power Management'''


   If a machine wedges one needs to be able to reboot it without physically flipping a switch.  Any number of vendors offer serial controlled power widgets.
   If a machine wedges one needs to be able to reboot it without physically flipping a switch.  Any number of vendors offer serial controlled power widgets.


1. '''Automated Disaster Recovery'''
# '''Automated Disaster Recovery'''


   Its nice to be able to reimage a node by via netbooting and network software installs.  Its a low frequency endevour, though.
   Its nice to be able to reimage a node by via netbooting and network software installs.  Its a low frequency endevour, though.


1. '''Boot Quickly'''
# '''Boot Quickly'''


  1. Disable non-essential services to be started at boot-time
## Disable non-essential services to be started at boot-time
  1. Minimize hardware checks the BIOS may do
## Minimize hardware checks the BIOS may do
  1. Especially avoid things like RH's Kudzu which can ask for user input before proceeding
## Especially avoid things like RH's Kudzu which can ask for user input before proceeding


----
----
* '''FrontPage'''
* '''FrontPage'''

Revision as of 07:17, 10 May 2007

Some tips we've collected while working on clusters that can lead to a more useful debugging experience.

  1. Shared home directories
  Having a shared namespace comes in handy all the time.  Its useful for bringing up lustre builds, collecting logs, blatting configuration files, etc.  sharing /home is the least surprising.
  1. PDSH
  pdsh is an absolute requirement. Bonus points for being able to pdsh to all nodes from any node.
  1. Regular naming
  A node naming scheme that involves a short prefix and regular incrementing decimal node numbers combines very well with automation like pdsh.  As machines tend to take on different roles as different people use the cluster, it doesn't make a lot of sense to give hostnames based on roles in the lustre universe (mds, ost, etc).
  1. Serial Consoles
  As in any data center, they're essential.  Log their output for later retrieval should the kernel go wrong.  Provide a useful front end like 'conman' or 'conserver'.  Make sure the front-end can send breaks to the kernel's sysrq facility over the serial console.
  1. Collect syslogs in one place
  Its nice to be able to watch one log for errors that are reported to syslog across the cluster.
  1. Remote Power Management
  If a machine wedges one needs to be able to reboot it without physically flipping a switch.  Any number of vendors offer serial controlled power widgets.
  1. Automated Disaster Recovery
  Its nice to be able to reimage a node by via netbooting and network software installs.  Its a low frequency endevour, though.
  1. Boot Quickly
    1. Disable non-essential services to be started at boot-time
    2. Minimize hardware checks the BIOS may do
    3. Especially avoid things like RH's Kudzu which can ask for user input before proceeding

  • FrontPage