Lustre Manual

Lustre 1.4.8 Operations Manual

Version 1.4.8.1-man-v38 (02/09/2007)

This publication is intended to help Cluster File Systems, Inc. (CFS) Customers and Partners who are involved in installing, configuring, and administering Lustre.

The information contained in this document has not been submitted to any formal CFS test and is distributed AS IS. The use of this information or the implementation of any of these techniques is a customer responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. While each item may have been reviewed by CFS for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk.

CFS™ and Cluster File Systems, Inc.™ are trademarks of Cluster File systems, Inc.

Lustre^® is a registered trademark of Cluster File Systems, Inc.

The Lustre logo is a trademark of Cluster File Systems, Inc.

Other product names are the trademarks of their respective owners.

Comments may be addressed to:

Cluster File Systems, Inc.

Suite E104 - 288

4800 Baseline Road

Boulder CO 80303

Part I. Architecture

Chapter-1. A Cluster with Lustre

Lustre Server Nodes

The Meta Data Server

The Object Storage Server

Chapter-2. Understanding Lustre Networking

Introduction

Old Schema

New Schema

PART II. LUSTRE ADMINISTRATION

Chapter-1. Prerequisites

Lustre Version Selection
How to get Lustre

Supported Configurations

    Using a Pre-packaged Lustre Release
        Choosing a Pre-packaged Kernel
        Lustre Tools
        Other Required Software
            Core Requirements
            High Availability Software
           Debugging Tools

      Environment Requirements
        Consistent Clocks
        Universal UID/GID

Proper Kernel I/O Elevator

Chapter 2. Lustre Installation

     Installing Lustre
        Quick Configuration of Lustre
        Single System Test Using the llmount.sh Script

     Using Supplied Configuration Tools
        Single Node Lustre
        Multiple Node Lustre
        Starting Lustre

Building From Source

Building Your Own Kernel

Patch Series Selection
Installing Quilt

Preparing the Kernel Tree Using Quilt

Building Lustre

           Configuration Options
           Liblustre
           Compiler Choice

Chapter-3. Configuring the Lustre Network

Designing Your Network

Identify all Lustre Networks

Identify nodes which will route between networks

Identify network interfaces that should be included/excluded from Lustre networking

Determine cluster-wide module configuration

Determine appropriate zconf-mount parameters for clients

Configuring Your Network

Starting and Stopping LNET

Starting LNET

Starting Clients

Stopping LNET

Chapter- 4. Configuring Lustre - Examples

Simple TCP Network

Example One: Simple Lustre Network

Example Two: Lustre with NFS

Example Three: Exporting Lustre with Samba

Example Four: Heterogeneous Network with Failover Support

Example Five: OSS with Multiple OSTs

Example Six: Client with Sub-clustering Support

Chapter- 5. More Complicated Configurations

Multihomed servers

Modprobe.conf

LMC configuration preparation

LMC configuration preparation

The Power Management Software

Power Equipment

Heartbeat

Roles of Nodes in a Failover

OST Failover Review

MDS Failover Review

Failover Configuration

Starting / Stopping a Resource

Active/Active Failover Configuration

Hardware Configurations

Hardware Preconditions

Instructions for Failover Setup with Heartbeat Version1
Software Installations

       Hardware Configurations
          Lustre Configuration
          Heartbeat Configuration

Mon
Mon Setup and Configuration

Scripts

Instructions for Failover Setup with Heartbeat Version2

Software Installations
Hardware Configurations

       Hardware Preconditions
       Lustre Configuration
       Heartbeat Configuration

Operation

Scripts

Chapter- 7. Configuring Quotas

Working with Quotas

Configuring Disk Quotas

Creating Quota Files
Quota Allocation

Chapter- 8. Raid

Considerations for Backend Storage

       Reliability
       Selecting Storage for the MDS and OSS
       Understanding Double Failures with Hardware and Software RAID5
       Performance considerations
       Formatting

Disk Performance Measurement

Sample Graphs

Graphs for Write Performance

Graphs for Read Performance

Chapter-9. Bonding

Network Bonding

Requirements

Bonding Module Parameters

Part III. Lustre Tuning, Monitoring and Troubleshooting

Chapter- 1. Lustre I/O Kit

Prerequisites

Running the I/O Kit Surveys

sgpdd-survey

obdfilter-survey

ost-survey

Chapter-2 . LustreProc

Introduction

proc entries for Lustre

Recovery

Lustre Timeouts/ Debugging

LNET Information

I/O

Client I/O RPC Stream Tunables

Watching the Client RPC Stream

Watching the OST Block I/O Stream

mballoc History

Locking

Debug Support

RPC Information For Other OBD Devices

Chapter- 3 . Lustre Tuning

Module Options

         OST Threads
         MDS Threads
         LNET Tunables

Large-Scale Tuning for Cray XT and Equivalents
Network Tunables

Chapter III – 4. Lustre Troubleshooting and Tips

Tips

Fixing a Changed Host Name
Setting SCSI IO Sizes

Part IV. Lustre for Users

Chapter-1. Free Space and Quotas

Querying File System Space

Using Quota

Chapter- 2. Striping and Other I/O Options

File Striping

Advantages of Striping

Disadvantages of Striping

Stripe Size

Displaying Striping Information with lfs getstripe

lfs setstripe – setting striping patterns

Changing Striping for a Subdirectory

Using a Specific Striping Pattern for a Single File

Performing Direct I/O

Making File System Objects Immutable

Disabling Locking for Direct I/O Accesses

Other I/O Options

MDS Space Utilization

Chapter- 3. Lustre Security

Using Access Control Lists

How Do ACLs Work?

Lustre ACLs

Examples

Chapter- 4. Other Lustre Operating Tips

Expanding the File System by Adding OSTs

Part V. Reference

Chapter- 1. User Utilities

Chapter- 2. Lustre Programming Interfaces

Introduction

User/Group Cache Upcall

Chapter- 3. Config Files and Module Parameters

SOCKLND Kernel TCP/IP LND

Chapter- 4. System Configuration Utilities

Chapter-5. Lustre Proc Elements

Elements in the namespace

       Top Level Nodes
       Children
    Statistics
       Statistics Data Types

Statistics Provisioning Methods
Implementation

Chapter-6. System Limits

   Introduction
       Maximum Stripe Count
       Maximum Stripe Size
       Minimum Stripe Size
       Maximum Number of OSTs and MDSs
       Maximum Number of Clients
       Maximum Size of a File System
       Maximum File Size
       Maximum Number of Files or Subdirectories in a Single Directory
       MDS Space Consumption
       Maximum Length of a Filename and Pathname

Part I. Architecture

Chapter-1. A Cluster with Lustre

1.1 Lustre Server Nodes

A Lustre cluster consists of three major types of systems:

the Meta Data Server (MDS)
the Object Storage Server (OSS)
the Lustre clients' server

Each of these systems are internally very modular in layout. For most of the modules, the request processing layers and message passing layers are shared between all of the systems, therefore forming an integral part of the framework. The rest of the modules are unique, for example, the Lustre Lite client module on client systems.

Figure 1.1.1: A Lustre Cluster shows the expected interactions between the servers and clients of the Lustre file system.

Figure 1.1.1: A Lustre Cluster

Lustre clients run the Lustre file system. They interact with the Object Storage Servers (OSSs) for file data I/O, and with the Meta Data Server (MDS) for name space operations.

When the Client, OSS, and MDS systems are separate, Lustre appears similar to a cluster file system with a file manager. But you can also have all these subsystems running on the same system, leading to a symmetrical layout. For the main protocols, see Figure 1.1.2: 'Interactions between the systems'.

Figure 1.1.2: Interactions between the systems

1.1.1The Meta Data Server

The Meta Data Server (MDS) is perhaps the most complex of the Lustre subsystems. It provides back-end storage for the meta data service and updates this service with every transaction over a network interface. The MDS presently uses a journal file system, however, other options such as shared object storage are also considered. Figure 1.1.3: MDS Software Module illustrates how the MDS functions.

The MDS uses the locking modules and existing features of a journal file system, such as Ext3 or XFS.
In Lustre, the complexity is limited due to the presence of a single meta data server. The system avoids single points of failure by offering failover meta data services based on existing solutions such as Linux-HA. In a Lustre file system with clustered meta data, meta data processing is load balanced, resulting in significant complexity in accessing persistent meta data concurrently.

Figure 1.1.3: MDS Software Module

1.1.2 The Object Storage Servers

The core concept to Lustre is object storage. Objects can be thought of as i-nodes that are used to store file data. An Object Storage Server (OSS) is a server node that runs the Lustre software stack. It has one or more network interfaces, and usually one or more disks. Every OSS exports one or more Object Storage Targets (OST).

An Object Storage Target is a software interface to a single exported back-end volume. It is conceptually similar to an NFS export except that an OST contains file system objects instead of the whole name space.

OSTs provide the file input/output service in a Lustre cluster by facilitating access to these objects. The name space is managed by a meta data service, which manages the Lustre i-nodes. Such i-nodes can be directories, symbolic links, or special devices where the associated data and meta data is stored on the meta data server. When a Lustre i-node represents a file, the meta data merely holds references to the file data objects stored on the OSTs.

The OSTs perform the block allocation for data objects, leading to distributed and scalable allocation of data. The OSTs also enforce security on access to objects from the clients. An interesting point to note is that the client-OST protocol bears some similarity to systems like DAFS in that it combines request processing with remote DMA.

The software modules in the OSTs are indicated in Figure 1.1.4: OST Software Module. This graphic shows how object storage targets provide a networked interface to other object storage. The second layer of object storage, direct object storage drivers, consists of drivers that manage objects. These objects are the files on persistent storage devices. There are many choices for direct drivers, which are often interchangeable. Objects can be stored as raw ext2 i-nodes or as files in many journal file systems by the filtering driver, which is now the standard driver for Lustre Lite. More exotic compositions of subsystems are possible, for example, in some situations an OBD filter direct driver can run on an NFS file system (where a single NFS client is all that is supported).

In Figure 1.1.4: OST Software Module, networking is expanded into its subcomponents. Lustre request processing is built on a thin API, called the LNET API. LNET inter-operates with a variety of network transports through Network Abstraction Layers (NAL).

This API provides the delivery and event generation in connection with network messages. It also provides advanced capabilities such as using Remote DMA (RDMA) if the underlying network transport layer supports this.

Figure 1.1.4: OST Software Module

Figure 1.1.5: Client Software Module

Chapter I-2. Understanding Lustre Networking

2.1 Introduction

Lustre now contains a new networking model known as LNET. LNET is designed for more complex topologies, better routing capabilities and simplified configuration. However, until a new configuration scheme is implemented for Lustre in 1.6.0, configuring LNET requires a hybrid of the old lconf/XML configuration and the new module-based configuration. Your patience through this transition period is appreciated. Lustre 1.6 will be configured entirely via standard Unix/Linux commands ('mount', 'mkfs'), networking will be configured with kernel module options.

2.2 Old Schema

Up until Lustre 1.4.5 was developed, networking configuration under Portals was defined in the configuration XML file. Every node had a single network ID (NID) that Portals used to identify the communication end points along with a specified network type (nettype). Different networks of the same type were specified with “cluster_id” attributes, similarly a single NID could be bound to multiple network interfaces through “hostaddr” attributes. Routing capabilities were limited to Elan/TCP and the routing function itself was over-complex in the way it was defined in the XML.

2.3 New Schema

Under the new schema the network ID (NID) concept expands to include the network where all communication takes place. This means there is a NID for every network a node uses and each NID now includes the network type A network number, linked to the NID, distinguishes different networks of the same type. By default, LNET uses all the available interfaces for a specified network type, but all networking hardware specifics, including routing, are now defined as module options.

An important point to note is that due to the NID changes, old portals and new LNET networking are not wire compatible. Upgrading a live system can be considered but it must be done in a particular order. See Appendix 1. Upgrading from 1.4.5.

Important terms:

Network: A group of nodes that communicate directly with each other. It is how LNET represents a single cluster. Multiple networks can be used to connect clusters together. Each network has a unique type and number (e.g. tcp3, elan1).

Network types include tcp (Ethernet), openib (Mellanox-Gold Infiniband), iib (Infinicon Infiniband), vib (Voltaire Infiniband), ra (RapidArray), elan (Quadrics Elan), gm (Myrinet), LNET.

NID: A Lustre networking address. Every node has one NID for each network it is on.

LND: Lustre networking device layer, a modular subcomponent of LNET that implements one of the network types.

PART II. LUSTRE ADMINISTRATION

Chapter II-1. Prerequisites

1.1 Lustre Version Selection

1.1.1 How to get Lustre

The current, stable version of Lustre is available for download from the website of Cluster File Systems:

http://www.clusterfs.com/download.html

The software available for download on this website is released under the GNU General Public License. It is strongly recommended to read the complete license and release notes for this software before downloading it, if you have not done so already. The license and the release notes can also be found at the aforementioned website.

1.1.2 Supported Configurations

Cluster File Systems, Inc. supports Lustre on the configurations listed in Table1.1.1: 'Supported Configurations'.

ASPECT	SUPPORT TYPE
Operating Systems:	Red Hat Enterprise Linux 3+, SuSE Linux Enterprise Server 9, Linux 2.4 and 2.6
Platforms	IA-32, IA-64, x86-64, PowerPC architectures, and mixed-endian clusters
Interconnect	TCP/IP; Quadrics Elan 3 and 4; Myranet, Mellanox, Infiniband (Voltaire, OpenIB and Silverstrom)

Table 1.1.1: Supported Configurations

1.2 Using a Pre-packaged Lustre Release

Due to the complexity involved in building and installing Lustre, Cluster File Systems has made available several different pre-packaged releases that cover some of the most common configurations.

The pre-packaged release consists of five different RPM packages given below. Install them in the following order:

kernel-smp-<release-ver>.rpm – This is the Lustre patched Linux kernel RPM. Use it with matching Lustre Utilities and Lustre Modules package.
kernel-source-<release-ver>.rpm – This is the Lustre patched Linux kernel source RPM. This comes with the kernel package, but is not required to build or use Lustre.
lustre-modules-<release-ver>.rpm – The Lustre kernel modules for the above kernel.
lustre-<release-ver>.rpm – These are the Lustre Utilities or userspace utilities for configuring and running Lustre. Use them only with the matching kernel RPM as mentioned above.
lustre-source-<release-ver>.rpm – This contains the Lustre source code (including the kernel patches). It is not required to build or use Lustre.

The source package is required only if you need to build your own modules (for networking, and so on) against the kernel source.

NOTE: Lustre contains kernel modifications, which interact with your storage devices and may introduce security issues and data loss if not installed, configured, or administered properly. Please exercise caution and back up all data before using this software.

1.2.1 Choosing a Pre-packaged Kernel

Determining the best suitable pre-packaged kernel, depends largely on the combination of hardware and software being run. CFS provides pre-packaged releases on our download Web site.

1.2.2 Lustre Tools

The lustre * package is required for proper Lustre setup and monitoring. The package contains many tools, the most important ones being:

lconf: High-level configuration tool that acts on XML files;
lctl: A low-level configuration utility that can also be used for troubleshooting and debugging;
lfs: A tool for reading/setting striping information for the cluster, as well as performing other actions specific to Lustre File Systems;
mount.lustre: A mounting script required by Lustre clients.

1.2.3 Other Required Software

Besides the tools provided along with Lustre, Lustre also requires some separate software tools to be installed.

1.2.3.1 Core Requirements

Table 1.2.1: Software URLs contains Hyperlinks to the software tools required by Lustre. Depending on your operating system, pre-packaged versions of these tools may be available, either from the sources listed below, or from your operating system vendor.

SOFTWARE	VERSION	LUSTRE FUNCTION
perl	>=5.6	Scripting language: used by monitoring and test scripts perl http://www.perl.com/download.csp
python	>=2	Scripting language: required by core Lustre tools python http://www.python.org/download/
PyXML	>=0.8	XML processor for python: requires PyXML http://sourceforge.net/project/showfiles.php?group_id=6473

SOFTWARE

VERSION

LUSTRE FUNCTION

perl

>=5.6

Scripting language: used by monitoring and test scripts perl

http://www.perl.com/download.csp

python

>=2

Scripting language: required by core Lustre tools python

http://www.python.org/download/

PyXML

>=0.8

XML processor for python: requires PyXML

http://sourceforge.net/project/showfiles.php?group_id=6473

Table 1.2.1: Software URLs

1.2.3.2 High Availability Software

If you plan to enable failover server functionality with Lustre (either on OSS or on MDS), a high availability software will be a necessary addition to your cluster software. One of the better known high availability software packages is Heartbeat.

Linux-HA (Heartbeat) supports redundant system with access to the Shared (Common) Storage with a dedicated connectivity; and can determine the general state of the system. (For details, see Part II - Chapter 6: 'Failover.)

1.2.3.3 Debugging Tools

Things inevitably go wrong – disks fail, packets get dropped, software has bugs – and when they do, it is always useful to have debugging tools on hand to help figure out, how and why.

The most useful tool in this regard is GDB, coupled with crash. Together, these tools can be used to investigate both, live systems and kernel core dumps. There are also useful kernel patches/ modules, such as netconsole and netdump, that allow core dumps to be made across the network.

More information about these tools can be found at the following locations:

GDB: http://www.gnu.org/software/gdb/gdb.html

crash: http://oss.missioncriticallinux.com/projects/crash/

netconsole: http://lwn.net/2001/0927/a/netconsole.php3

netdump: http://www.redhat.com/support/wpapers/redhat/netdump/

1.3 Environment Requirements

1.3.1 Consistent Clocks

Lustre always uses the client clock for timestamps. If the machine clocks across the cluster are not in sync, Lustre should not break. However, the unsynchronized clocks in a cluster will always be a source of headache as it will be very difficult to debug any multi-node issue, or otherwise correlate the logs. For this reason, CFS recommends that the machine clocks should be kept in sync as much as possible. The standard way to accomplish this is by using the Network Time Protocol, or NTP. All the machines in your cluster should synchronize their time from a local time server (or servers) at a suitable time interval.

More information about ntp can be found at:

http://www.ntp.org/

1.3.2 Universal UID/GID

In order to maintain uniform file access permissions on all the nodes of your cluster, the same user (UID) and group (GID) IDs should be used on all the clients. Pretty much like any cluster usage, Lustre uses the common UID/GID on all the cluster nodes.

1.3.3 Proper Kernel I/O Elevator

One of the many functions of the Linux kernel (indeed, of any OS kernel), is to provide access to disk storage. The algorithm which decides how the kernel provides disk access is known as the "I/O Scheduler," or "Elevator." In the 2.6 kernel series, there are four interchangeable schedulers, as follows:

cfq- "Completely Fair Queuing" makes a good default for most workloads on general-purpose servers. It is not a good choice for Lustre OSS nodes, however, as it introduces overhead and I/O latency

as - "Anticipatory Scheduler" is best for workstations and other systems with slow, single-spindle storage. It is not at all good for OSS nodes, as it attempts to aggregate or batch requests in order to improve performance for slow disks

deadline - “Deadline” is a relatively simple scheduler which tries to minimize I/O latency by re-ordering requests to improve performance. Best for OSS nodes with "simple" storage, that is software RAID, JBOD, LVM, and so on

noop- “NOOP” is the most simple scheduler of all, and is really just a single FIFO queue. It does not attempt to optimize I/O at all, and is best for OSS nodes that have high-performance storage, that is DDN, Engenio, and so on. This scheduler may yield the best I/O performance if the storage controller has been carefully tuned for the I/O patterns of Lustre

Please note that the above is just our best advice, and we strongly suggest that local testing is the best way to ensure high performance with Lustre. Also note that most distributions ship with either “cfq” or “as” configured as the default scheduler, and thus choosing an alternate scheduler is an absolutely necessary step in configuring Lustre for the best performance. The “cfq” and “as” schedulers should never be used for server platform.

Please see the following resources for more in-depth discussion on choosing an I/O scheduler algorithm for Linux:

There are two ways to change the I/O scheduler - at boot time, or with new kernels at runtime. For all Linux kernels, appending 'elevator={noop|deadline}' to the kernel boot string sets the I/O elevator.

With LILO, you can use the 'append' keyword:

image=/boot/vmlinuz-2.6.14.2
label=14.2
append="elevator=deadline"
read-only
optional

With GRUB, append the string to the end of the kernel command:

title Fedora Core (2.6.9-5.0.3.EL_lustre.1.4.2custom)
root (hd0,0)
kernel /vmlinuz-2.6.9-5.0.3.EL_lustre.1.4.2custom ro
root=/dev/VolGroup00/LogVol00 rhgb noapic quiet elevator=deadline

With newer Linux kernels (Red Hat Enterprise Linux v3 Update 3 does not have this feature. It is present in the main Linux tree as of 2.6.15), one can change the scheduler while running. If the file /sys/block/<DEVICE>/queue/scheduler exists (where DEVICE is the block device you wish to affect), it will contain a list of available schedulers and can be used to switch the schedulers.

(hda is the <disk>):

[root@cfs2]# cat /sys/block/hda/queue/scheduler
noop [anticipatory] deadline cfq
[root@cfs2 ~]# echo deadline > /sys/block/hda/queue/scheduler
[root@cfs2 ~]# cat /sys/block/hda/queue/scheduler
noop anticipatory [deadline] cfq

The other schedulers (anticipatory and cfq) are better suited for desktop use.

Chapter II-2. Lustre Installation

2.1 Installing Lustre

Follow the steps outlined below to install Lustre:

1. Install the Linux base OS as per your requirements along with the prerequisites like GCC, Python and Perl (as mentioned in Part II – Chapter 1. Prerequisites).

2. Install the RPMs as described in section 1.2 Using a Pre-packaged Lustre Release, in Part II – Chapter 1. Prerequisites. The preferred installation order is:

the Lustre patched version of the linux kernel (kernel-*)
the Lustre modules for that kernel (lustre-modules-*)
the Lustre user space programs (lustre-*). Other packages (optional).

3. Verify that all cluster networking is correct. This may include /etc/hosts, or DNS. Set the correct networking options for Lustre in /etc/modprobe.conf. (See 5.1.1 and 5.2.2 Modprobe.conf in Part II – Chapter 5. More Complicated Configurations.)

TIP:
While installing Lustre with InfiniBand, you need to keep the ibhost, kernel and Lustre all on the same revision. Follow the steps below to achieve this:
1. Install the kernel source (Lustre patched)
2. Install the Lustre source and the ibhost source
3. Compile the ibhost against your kernel
4. Compile the Linux kernel
5. Compile Lustre against the ibhost source --with-vib=<path to ibhost>
Now you can use the RPMs created by the above steps.

2.2 Quick Configuration of Lustre

As we have already discussed, Lustre consists of three types of subsystems – a metadata server (MDS), object storage targets (OSTs) and clients. All of these can co-exist on a single system or can run on different systems. The object storage servers and metadata server together present a Logical Object Volume (LOV) which is an abstraction that appears in the configuration.

It is possible to set up the Lustre system with many different configurations by using the administrative utilities provided with Lustre. Lustre includes some sample scripts in the /usr/lib/lustre/examples directory on a system where Lustre is installed (or the lustre/tests subdirectory of a source code installation). These scripts enable quick setup of some simple, standard configurations.

The next section describes how to install a simple Lustre setup using these scripts.

2.2.1 Single System Test Using the llmount.sh Script

The simplest Lustre installation is a configuration where all three subsystems execute on a single node. You can execute the script llmount.sh, located in the /usr/lib/lustre/examples directory, to set up, initialize and start the Lustre file system on a single node, using loopback devices in place of physical partitions. This script first executes a configuration script identified by a NAME variable. This configuration script then uses the LMC utility to generate an XML configuration file, which is in turn used by the lconf utility to perform the actual system configuration. The llmount.sh script then loads all the modules required by the specified configuration.

Next, the script creates small loopback file systems in /tmp for the server nodes. You can change the size and location of these files by modifying the configuration script.

Finally, the script mounts the Lustre file system at the mount point specified in the initial configuration script. The default used is /mnt/lustre.

Outlined below are the steps needed to configure and test Lustre for a single system. (You can use the llmount.sh script for initial testing to hide many of the background steps needed to configure Lustre. It is not intended to be used as a configuration tool for production installations.)

1. Starting the system: Two initial configuration scripts are provided for a single system test. In order to start the system, update these scripts as per the changes made to the loopback file system locations or sizes, or as per the changes made to the Lustre file system mount point.

Execute the local.sh script with the LMC commands to generate an XML (local.xml) configuration file for a single system.
Execute the lov.sh script with the LMC commands to generate a configuration file with an MDS, LOV, two OSTs and a client.

2. Executing the llmount.sh script as shown below, by specifying the setup based on either local.sh or lov.sh:

$ NAME={local|lov} sh llmount.sh

Below is a sample output when this command is executed:

$ NAME=local sh llmount.sh

loading module: libcfs srcdir None devdir libcfs

loading module: lnet srcdir None devdir lnet

loading module: ksocklnd srcdir None devdir klnds/socklnd

loading module: lvfs srcdir None devdir lvfs

loading module: obdclass srcdir None devdir obdclass

loading module: ptlrpc srcdir None devdir ptlrpc

loading module: ost srcdir None devdir ost

loading module: ldiskfs srcdir None devdir ldiskfs

loading module: fsfilt_ldiskfs srcdir None devdir lvfs

loading module: obdfilter srcdir None devdir obdfilter

loading module: mdc srcdir None devdir mdc

loading module: osc srcdir None devdir osc

loading module: lov srcdir None devdir lov

loading module: mds srcdir None devdir mds

loading module: llite srcdir None devdir llite

NETWORK: NET_mds.clusterfs.com_tcp NET_mds.clusterfs.com_tcp_UUID tcp mds.clusterfs.com

OSD: OST_mds.clusterfs.com OST_mds.clusterfs.com_UUID obdfilter /tmp/ost1-mds.clusterfs.com 400000 \ ldiskfs no 0 0

OST mount options: errors=remount-ro

MDSDEV: mds1 mds1_UUID /tmp/mds1-mds.clusterfs.com ldiskfs no

recording clients for filesystem: FS_fsname_UUID

Recording log mds1 on mds1

LOV: lov_mds1 110aa_lov_mds1_3af7a3d69c mds1_UUID 1 1048576 0 0 [u'OST_mds.clusterfs.com_UUID'] mds1

OSC: OSC_mds.clusterfs.com_OST_mds.clusterfs.com_mds1 110aa_lov_mds1_3af7a3d69c \ OST_mds.clusterfs.com_UUID

End recording log mds1 on mds1

MDSDEV: mds1 mds1_UUID /tmp/mds1-mds.clusterfs.com ldiskfs 400000 no

MDS mount options: errors=remount-ro,user_xattr,acl,

LOV: lov1 4987a_lov1_765ed779f4 mds1_UUID 1 1048576 0 0 [u'OST_mds.clusterfs.com_UUID'] mds1

OSC: OSC_mds.clusterfs.com_OST_mds.clusterfs.com_MNT_mds.clusterfs.com 4987a_lov1_765ed779f4 \ OST_mds.clusterfs.com_UUID

MDC: MDC_mds.clusterfs.com_mds1_MNT_mds.clusterfs.com f9c37_MNT_mds.clusterfs._30aaf9b569 mds1_UUID \

MTPT: MNT_mds.clusterfs.com MNT_mds.clusterfs.com_UUID \

/mnt/lustre mds1_UUID lov1_UUID

Now you can verify if the file system is mounted from the output of df:

$ df

Filesystem 1K-blocks Used Available Use% Mounted on

/dev/ubd/0 1011928 362012 598512 38% /

/dev/ubd/1 6048320 3953304 1787776 69% /r

none 193712 16592 167120 10% /mnt/lustre

NOTE: The output of the df command following the output of the script shows that the Lustre file system is mounted on the mount-point /mnt/lustre. Although the actual output of the script included with your Lustre installations may have changed due to enhancements or additional messages, they should still resemble the example above.

You can also verify that the Lustre stack has been set up correctly by observing the output of find /proc/fs/lustre.

NOTE: The actual output may depend on the inserted modules and on the instantiated OBD devices. Also note that the file system statistics presented from /proc/fs/lustre are expected to be the same as those obtained from df.

3. Bringing down a cluster and cleaning it up by using the llmountcleanup.sh script: Execute the command below to cleanup and unmount the file system:

$ NAME=<local/lov> sh llmountcleanup.sh

4. Remounting the file system by using the llrmount.sh script: llmount.sh reformats the devices. Therefore use llrmount.sh if you want to retain data in the file system.

$ NAME=<local/lov> sh llrmount.sh

2.3 Using Supplied Configuration Tools

It is possible to set up Lustre on a single system or on multiple systems. Lustre distribution comes with utilities that can be used to create configuration files easily and to set up Lustre for various configurations. Lustre uses three administrative utilities – lmc, lconf and lctl – to configure nodes. The lmc utility is used to create XML configuration files describing the configuration. The lconf utility uses the information in this configuration file to invoke the low-level configuration utility lctl. Lastly, lctl actually configures the systems. For further details on these utilities please refer the man pages. You must keep the complete configuration for the whole cluster in a single XML file, and similarly, use the same file on all the cluster nodes to configure the individual nodes.

The next few sections describe the process of setting up a variety of configurations.

TIP: You can use "lconf -v" to show more verbose messages when running other lconf commands.

NOTE: You must use fstype = ext3 for Linux 2.4 kernels, and fstype = ldiskfs for 2.6 kernels. (In 2.4, Lustre patches the ext3 driver while in 2.6, it provides its own driver.)

2.3.1 Single Node Lustre

Let us consider a simple configuration script where the MDS, the OSTs and the client are running on a single system. You can use the LMC utility to generate a configuration file for this as shown below. All the devices in the script are shown to be loopback devices, but you can specify any device here. The size option is required only for the loopback devices; for others the utility will extract the size from the device parameters. (See the usage of real disks below.)

$ !/bin/sh

$ local.sh

Create a node:

rm -f local.xml

lmc -m local.xml --add node --node localhost

lmc -m local.xml --add net --node localhost –nid localhost --nettype tcp

Add the MDS:

lmc -m local.xml --format --add mds --node localhost -- mds mds-test --fstype ext3 --dev \

/tmp/mds-test –size 50000

Add the logical object volume (Note the relationship to the MDS):

lmc -m local.xml --add lov --lov lov-test --mds mds-test --stripe_sz 4194304 --stripe_cnt \

-1 --stripe_pattern 0

Add the OSTs: (Note the linkage to the LOV)

lmc -m local.xml --add ost --node localhost –lov lov-test --ost ost1-test --fstype ldiskfs \

–dev /tmp/ost1-test --size 100000

lmc -m local.xml --add ost --node localhost –lov lov-test --ost ost2-test --fstype ldiskfs \

–dev /tmp/ost2-test --size 100000

Define the mount point:

lmc -m local.xml --add mtpt --node localhost –path /mnt/lustre --mds mds-test --lov lov-test

On running the script, these commands create a local.xml file describing the specified configuration. Now you can execute the actual configuration by using the command below.

To configure using LCONF:

$ sh local.sh

$ lconf --reformat local.xml

This command loads all the required Lustre and LNET modules and also does the low level configuration of every device using lctl. The reformat option here is essential to use the first time to initialize the file systems on the MDS and OSTs. If it is used on any subsequent attempts to bring up the Lustre system it will re-initialize the file systems.

2.3.2 Multiple Node Lustre

Now let us consider an example when setting up Lustre on multiple systems – with the MDS on one node, the OSTs on other nodes and the client on one or more nodes.

You can use the following configuration script to create this setup by replacing node-* in the example with the hostnames of real systems. The servers, clients and the node running the configuration script all need to resolve those hostnames into IP addresses via DNS or /etc/hosts. One common problem with some Linux setups is that the hostname is mapped in /etc/hosts to 127.0.0.1, which causes the clients to fail in communicating with the servers. For this example we will use read disks, and we add some mount options. ( For a real disks, these mount options should be considered mandatory)

1. Define the nodes, with a generic client node:

#!/bin/bash

# A few handy definitions:

dt=`date +%m%d_%H%M`

config="test_${dt}.xml"

LMC="lmc -m $config"

# first, clean up

rm -f $config

${LMC} --add node --node ft2

${LMC} --add node --node d1_q_0

${LMC} --add node --node d2_q_0

${LMC} --add node --node client

2. Configure the networking:

${LMC} --add net --node client --nid '*' --nettype lnet

${LMC} --add net --node d1_q_0 --nid 10.67.73.160@tcp --nettype lnet

${LMC} --add net --node d2_q_0 --nid 10.67.73.150@tcp --nettype lnet

${LMC} --add net --node ft2 --nettype lnet --nid 10.67.73.181@tcp

3. Define the MDS:

${LMC} --add mds --node ft2 --mds mds-l --fstype ldiskfs --dev $MDSDEV --failover --quota \ quotaon=ug,iunit=200,bunit=10 --mountfsoptions=acl || exit 7

4. Add the LOV:

${LMC} --add lov --lov lov-l --mds mds-l --stripe_sz 4194304 --stripe_cnt -1 --stripe_pattern 0

5. Add the OSTs:

${LMC} --add ost --node d1_q_0 --lov lov-l --ost ost-alpha --fstype ldiskfs --dev /dev/sdb1 \

--failover --mountfsoptions \ extents,mballoc

${LMC} --add ost --node d2_q_0 --lov lov-l --ost ost-beta --fstype ldiskfs --dev /dev/sdb2 \

--failover --mountfsoptions \ extents,mballoc

6. Define the client mountpoint:

${LMC} --add mtpt --node client --path /mnt/lustre --mds mds-l \ --lov lov-l --mountfsoptions \ extents,mballoc

7. Run the script to generate the config.xml (only once):

Put the file at a location where all the nodes can access it, for example, an NFS share. The XML is not especially human-readable, so we do not include an example here.

2.3.3 Starting Lustre

Follow the steps below to start Lustre for the first time.

1 . Reformat and start the OSTs:

$ lconf --reformat --node node-ost1 config.xml

$ lconf --reformat --node node-ost2 config.xml

$ lconf --reformat --node node-ost3 config.xml

Sample output:

loading module: libcfs srcdir None devdir libcfs

loading module: lnet srcdir None devdir lnet

loading module: ksocklnd srcdir None devdir klnds/socklnd

loading module: lvfs srcdir None devdir lvfs

loading module: obdclass srcdir None devdir obdclass

loading module: ptlrpc srcdir None devdir ptlrpc

loading module: ost srcdir None devdir ost

loading module: ldiskfs srcdir None devdir ldiskfs

loading module: fsfilt_ldiskfs srcdir None devdir lvfs

loading module: obdfilter srcdir None devdir obdfilter

NETWORK: NET_oss_tcp NET_oss_tcp_UUID tcp oss

OSD: oss-test oss-test_UUID obdfilter /dev/hdc 0 ldiskfs no 0 256

OST mount options: errors=remount-ro

File not found or readable: oss

configuring for host: ['oss']

setting /proc/sys/net/core/rmem_max to at least 16777216

setting /proc/sys/net/core/wmem_max to at least 16777216

Service: network NET_oss_tcp NET_oss_tcp_UUID

loading module: libcfs srcdir None devdir libcfs

+ /sbin/modprobe libcfs

loading module: lnet srcdir None devdir lnet

+ /sbin/modprobe lnet

loading module: ksocklnd srcdir None devdir klnds/socklnd

+ /sbin/modprobe ksocklnd

Service: ldlm ldlm ldlm_UUID

loading module: lvfs srcdir None devdir lvfs

+ /sbin/modprobe lvfs

loading module: obdclass srcdir None devdir obdclass

+ /sbin/modprobe obdclass

loading module: ptlrpc srcdir None devdir ptlrpc

+ /sbin/modprobe ptlrpc

Service: osd OSD_oss-test_oss OSD_oss-test_oss_UUID

loading module: ost srcdir None devdir ost

+ /sbin/modprobe ost

loading module: ldiskfs srcdir None devdir ldiskfs

+ /sbin/modprobe ldiskfs

loading module: fsfilt_ldiskfs srcdir None devdir lvfs

+ /sbin/modprobe fsfilt_ldiskfs

loading module: obdfilter srcdir None devdir obdfilter

+ /sbin/modprobe obdfilter

+ sysctl lnet/debug_path /tmp/lustre-log-oss

+ /usr/sbin/lctl modules > /tmp/ogdb-oss

Service: network NET_oss_tcp NET_oss_tcp_UUID

NETWORK: NET_oss_tcp NET_oss_tcp_UUID tcp oss

Service: ldlm ldlm ldlm_UUID

Service: osd OSD_oss-test_oss OSD_oss-test_oss_UUID

OSD: oss-test oss-test_UUID obdfilter /dev/hdc 0 ldiskfs no 0 256

+ sfdisk -s /dev/hdc

+ mkfs.ext2 -j -b 4096 -F -J size=388 -I 256 /dev/hdc

+ tune2fs -O dir_index /dev/hdc

+ dumpe2fs -f -h /dev/hdc

no external journal found for /dev/hdc

OST mount options: errors=remount-ro

+ /usr/sbin/lctl

attach obdfilter oss-test oss-test_UUID

quit

+ /usr/sbin/lctl

cfg_device oss-test

setup /dev/hdc ldiskfs f errors=remount-ro

quit

+ /usr/sbin/lctl

attach ost OSS OSS_UUID

quit

+ /usr/sbin/lctl

cfg_device OSS

setup

quit

2. Reformat and start the MDS:

$ lconf --reformat --node node-mds config.xml

3. Mount the file system on the clients:

$ mount -t lustre node-mds:/mds-test/client /mnt/lustre

2.4 Building From Source

2.4.1 Building Your Own Kernel

In the case that the hardware is not standard or CFS support have asked that you apply a patch, Lustre will require some changes to the core Linux kernel. These changes are organized in a set of patches in the kernel_patches directory of the Lustre CVS repository. If you are building your kernel from the source you will need to apply the appropriate patches.

Managing patches for the kernels is a very involved process given that most patches are intended to work with several kernels. To facilitate support, CFS maintains the tested version on the FTP site as some versions may not work properly with the patches from CFS. We recommend you use the Quilt package developed by Andreas Gruenbacher as it simplifies the process considerably. Patch management with Quilt works as follows:

a series file lists a collection of patches
the patches in a series form a stack
using Quilt you then push and pop the patches
you then edit and refresh (update) the patches in the stack that is being managed with Quilt
you can then revert inadvertent changes and fork or clone the patches and conveniently show the difference in work, before and after.

2.4.1.1 Patch Series Selection

Depending on the kernel being used, a different series of patches needs to be applied. Cluster File Systems maintains a collection of different patch series files for the various supported kernels in lustre/kernel_patches/series/.

For instance, the file lustre/kernel_patches/series/rh-2.4.20 lists all the patches that should be applied to a Red Hat 2.4.20 kernel to build a Lustre compatible kernel.

The current set of all the supported kernels and their corresponding patch series can always be found in the file –

lustre/kernel_patches/which_patch.

2.4.1.2 Installing Quilt

A variety of Quilt packages (RPMs, SRPMs, and tarballs) are available are available from Linux. As Quilt changes from times to time, we advise you to download the appropriate package from the ftp site of Cluster File Systems:

ftp://ftp.clusterfs.com/pub/quilt/

The Quilt RPMs have some installation dependencies on other utilities, e.g. the core-utils RPM that is available only in Red Hat 9. You will also need a recent version of the diffstat package. If you cannot fulfill the Quilt RPM dependencies for the packages made available by Cluster File Systems, we suggest building Quilt from the tarball.

After you have acquired the Lustre source (CVS or tarball), and chosen a series file to match you kernel sources, you must also choose a kernel config file. The supported kernel ".config" files are in the folder lustre/kernel_patches/kernel_configs, and are named in such a way as to indicate which kernel and architecture they are meant for. E.g. vanilla-2.4.20.uml.config is a UML config file for the vanilla 2.4.20 kernel series.

Next, unpack the appropriate kernel source tree. For the purposes of illustration, this documentation assumes that the resulting source tree is in /tmp/kernels/linux-2.4.20 and we call this the destination tree.

$ cd /tmp/kernels/linux-2.4.20

$ quilt setup

-l ../lustre/kernel_patches/series/rh-2.4.20 -d ../lustre/kernel_patches/patches

You can now have Quilt apply all the patches in the chosen series to your kernel sources by using the set of commands given below.

$ cd /tmp/kernels/linux-2.4.20

$ quilt push -av

If the right series files are chosen and the patches and the kernel sources are up-to-date, the patched destination Linux tree should now be able to act as a base Linux source tree for Lustre.

The patched Linux source need not be compiled to build Lustre from it. However, the same Lustre-patched kernel must be compiled and then booted on any node on which you intend to run a version of Lustre built using this patched kernel source.

2.4.1.3 Preparing the Kernel Tree Using Quilt

After acquiring the Lustre source (CVS or tarball) and choosing a series file to match your kernel sources you must also choose a kernel config file. The supported kernel ".config" files are in the folder lustre/kernel_patches/kernel_configs, and are named in such a way as to indicate which kernel and architecture they are meant for. For example, kernel-2.6.9-2.6-rhel4-x86_64-smp.config is a config file for the 2.6.9 kernel shipped with RHEL 4 suitable for x86_64 SMP systems.

Next unpack the appropriate kernel source tree. For the purposes of illustration, this documentation assumes that the resulting source tree is in /tmp/kernels/linux-2.6.9, we will refer to this as the destination tree.

You are now ready to use Quilt to manage the patching process for your kernel. The following set of commands will setup the necessary symlinks between the Lustre kernel patches and your kernel sources, assuming the Lustre sources are unpacked under /tmp/lustre-1.4.8.x and you have chosen the 2.6-rhel4 series:

$ cd /tmp/kernels/linux-2.6.9

$ rm -f patches series

$ ln -s /tmp/lustre-1.4.8.x/lustre/kernel_patches/series/2.6-\ rhel4.series ./series

$ ln -s /tmp/lustre-1.4.8.x/lustre/kernel_patches/patches .

You can now have quilt apply all the patches in the chosen series to your kernel sources by using the set of commands given below.

$ cd /tmp/kernels/linux-2.6.9

$ quilt push -av

If the right series files are chosen, and the patches and the kernel sources are up-to-date, the patched destination Linux tree should now be able to act as a base Linux source tree for Lustre.

You do not need to compile the patched Linux source in order to build Lustre from it. However, you must compile the same Lustre-patched kernel and then boot it on any node on which you intend to run the version of Lustre being built using this patched kernel source.

2.4.1.3 Preparing the Kernel Tree Using Quilt

$ cd /tmp/kernels/linux-2.6.9

$ rm -f patches series

$ ln -s /tmp/lustre-1.4.8.x/lustre/kernel_patches/series/2.6-\ rhel4.series ./series

$ ln -s /tmp/lustre-1.4.8.x/lustre/kernel_patches/patches .

You can now have quilt apply all the patches in the chosen series to your kernel sources by using the set of commands given below.

$ cd /tmp/kernels/linux-2.6.9

$ quilt push -av

If the right series files are chosen, and the patches and the kernel sources are up-to-date, the patched destination Linux tree should now be able to act as a base Linux source tree for Lustre.

2.4.2 Building Lustre

The Lustre source can be obtained by registering on the site:

http://www.clusterfs.com/download.html

Once you register you will receive an email with the link for download.

The following set of packages are available for each supported Linux distribution and architecture. The files employ the naming convention:

kernel-smp-<kernel versrion>_lustre.<lustre version>.<arch>.rpm

Example of binary packages for 1.4.8:

kernel-smp-2.6.9-42.0.3.EL_lustre.1.4.8.i686.rpm will contain patched kernel
lustre-1.4.8-2.6.9_42.0.3.EL_lustre.1.4.8smp.i686.rpm will contain Lustre user space files and utilities
lustre-modules-1.4.8-2.6.9_42.0.3.EL_lustre.1.4.8smp.i686.rpm will contain Lustre modules (kernel/fs/lustre and kernel/net/lustre).

You can install the binary packages by issuing the standard RPM commands:

$ rpm -ivh kernel-smp-2.6.9-42.0.3.EL_lustre.1.4.8.i686.rpm

$ rpm -ivh lustre-1.4.8-2.6.9_42.0.3.EL_lustre.1.4.8smp.i686.rpm

$ rpm -ivh lustre-modules-1.4.8-2.6.9_42.0.3.EL_lustre.1.4.8smp.i686.rpm

Example of Source packages:

kernel-source-2.6.9-42.0.3.EL_lustre.1.4.8.i686.rpm will contain the source for the patched kernel
lustre-source-1.4.8-2.6.9_42.0.3.EL_lustre.1.4.8smp.i686.rpm will contain the source for Lustre modules and user space utilities.

The kernel-source and lustre-source packages are provided in case you need to build external kernel modules or use additional network types. They are not required to run Lustre.

Once you have your Lustre source tree you can build Lustre by running the sequence of commands given below.

$ cd <path to kernel tree>

$ cp /boot/config-'uname -r' .config

$ make oldconfig || make menuconfig

# For 2.6 kernels

$ make include/asm

$ make include/linux/version.h

$ make SUBDIRS=scripts

# For 2.4 kernels

$ make dep

To configure Lustre and to build Lustre RPMs, go into the Lustre source directory and run:

$ ./configure --with-linux=<path to kernel tree>

$ make rpms

This will create a set of .rpms in /usr/src/redhat/RPMS/<arch>

with a date-stamp appended (the SUSE path is /usr/src/packages).

Example:

lustre-1.4.8-2.6.9_42.xx.xx.EL_lustre.1.4.8.custom_200609072009.i686.rpm

lustre-debuginfo-1.4.8-2.6.9_42.xx.xx.EL_lustre.1.4.8.custom_200609072009.i686.rpm

lustre-modules-1.4.8-2.6.9_42.xx.xxEL_lustre.1.4.8.custom_200609072009.i686.rpm

lustre-source-1.4.8-2.6.9_42.xx.xx.EL_lustre.1.4.8.custom_200609072009.i686.rpm

cd into the kernel source directory and run

$ make rpm

This will create a kernel RPM suitable for the installation.

Example: kernel-2.6.95.0.3.EL_lustre.1.4.2custom-1.i386.rpm

2.4.2.1 Configuration Options

Lustre supports several different features and packages that extend the core functionality of Lustre. These features/packages can be enabled at the build time by issuing appropriate arguments to the configure command. A complete listing of the supported features and packages can always be obtained by issuing the command “./configure –help” in your Lustre source directory. The config files matching the kernel version are in the configs/ directory of the kernel source. Copy one to .config at the root of the kernel tree.

2.4.2.2 Liblustre

The Lustre library client, liblustre, relies on libsysio, which is a library that provides POSIX-like file and name space support for remote file systems from the application program address space. Libsysio can be obtained from:

http://sourceforge.net/projects/libsysio/

NOTE: Liblustre is not for general use. It was created to work with specific hardware (Cray) and should never be used with other hardware.

Development of libsysio has continued ever since it was first targeted for use with Lustre. So first checkout the b_lustre branch from the libsysio CVS repository. This gives the version of libsysio compatible with Lustre. Once checked out, the steps listed below will build libsysio.

$ sh autogen.sh

$ ./configure --with-sockets

$ make

Once libsysio is built, you can build liblustre using the following commands.

$ ./configure --with-lib –with-sysio=/path/to/

libsysio/source

$ make

2.4.2.3 Compiler Choice

The compiler must be greater than GCC version 3.3.4. GCC v4.0 is not currently supported. GCC v3.3.4 has been used to successfully compile all of the pre-packaged releases made available by Cluster File Systems, and as such is the only compiler that is officially supported. Your mileage may vary with other compilers, or even other versions of GCC.

NOTE: GCC v3.3.4 was used to build 2.6 series kernels.

Chapter II-3. Configuring The Lustre Network

3.1 Designing Your Network

Before configuration can take place, a clear understanding of your Lustre network topologies is essential.

3.1.1 Identify all Lustre Networks

A network is a group of nodes that communicate directly with each other. As mentioned previously, Lustre supports a variety of network types and hardware, including TCP/IP, Elan, varieties of Infiniband and others. The normal rules for specifying networks apply, for example, two TCP networks on two different subnets would be considered two different Lustre networks. For example, tcp0 and tcp1.

3.1.2 Identify nodes which will route between networks

Any node with appropriate interfaces can route LNET between different networks – the node may be a server, a client, or a standalone router. LNET can route across different network types (For example, TCP to Elan) or across different topologies (For example, bridging two Infiniband or TCP/IP networks).

3.1.3 Identify any network interfaces that should be included/excluded from Lustre networking

LNET by default uses all interfaces for a given network type. If there are interfaces it should not use, (for example, Administrative networks, IP over IB, and so on), then the included interfaces should be explicitly listed.

3.1.4 Determine cluster-wide module configuration

The LNET configuration is managed via module options, typically specified in /etc/modprobe.conf or /etc/modprobe.conf.local (depending on distro). To help ease the maintenance of large clusters, it is possible to configure the networking setup for all nodes through a single unified set of options in the modprobe.conf file on each node. See the ip2nets option below for more information.

LibLustre users should set the accept=all parameter, see the appendix for details.

3.1.5 Determine appropriate zconf-mount parameters for clients

In their mount commands, clients use the NID of the MDS host to retrieve their configuration information. Since an MDS may have more than one NID, clients should use the NID appropriate for its local networks. If unsure, there is a lctl command that can help. On the MDS,

lctl list_nids

will display the server's NIDs. On a client,

lctl which_nid <NID list>

will display the closest NID for that client. So from a client with SSH access to the MDS,

mds_nids=`ssh the_mds lctl list_nids`

lctl which_nid $mds_nids

will in general be the correct NID to use for the MDS in the mount command.

3.2 Configuring Your Network

LNET before mountconf (mountconf was introduced in 1.4.6)

3.2.1 LNET Configurations

LNET and portals use different network addressing schemes; that is, their NIDs are different. Lmc/lconf allow NIDs to be specified in either format so that old configuration (lmc) scripts and old XML configuration files continue to work and the NIDs are converted to LNET format as required.

LNET NIDs take the form: nid = <address>[@<network>], where <address> is the network address within the network and <network> is the identifier for the network itself (network type + instance number). For example, 192.73.220.107@tcp0 would be a typical NID on a TCP network. '3@elan0' would be a typical Elan NID.

The network number can be used to distinguish between instances of the same network type, e.g. tcp0 and tcp1. An unspecified network number is 0, and unspecified network type is tcp.

NOTE: If a machine has multiple network interfaces, Lustre networking must be specified by modprobe.conf options (networks or ip2nets) as the default configuration will almost certainly not work for a multi-homed host.

3.2.1.1 NID Changes

The LNET NID is generated from old (lmc) configuration scripts by using the network type (specified by --nettype <type>) as the LNET network identifier. New configuration scripts should use the network type lnet and specify the LNET NID directly.

Example lmc line specifying a server’s NID:

$ LMC --add net --node srv1 --nettype lnet --nid 192.168.2.1@tcp1

(The lmc tool will be obsolete with mountconf in Lustre 1.6.0)

Example lmc line for clients on all networks:

$ LMC --add net --node client --nettype lnet --nid '*'

A client’s actual NIDs are determined from its local networks at client startup time.

3.2.1.2 XML Changes

These changes affect lmc and the XML it produces, as well as zeroconf mount commands. (The lmc tool will be obsolete with mountconf in Lustre 1.6.0)

Example zeroconf client mount command pointing to an MDS on an elan network:

mount -t lustre 3@elan:/mdsA/client /mnt/lustre

NOTE: We recommend using dotted-quad IP addressing rather than host names. We have found this aids in reading debug logs, and helps greatly when debugging configurations with multiple interfaces.

3.2.2 Module parameters

LNET network hardware and routing are now configured via module parameters of the LNET and LND-specific modules. Parameters should be specified in the /etc/modprobe.conf or /etc/modules.conf file, for instance:

options lnet networks=tcp0,elan0

specifies that this node should use all available TCP and elan interfaces.

Under Linux 2.6, the LNET configuration parameters can be viewed under /sys/module/; generic and acceptor parameters under lnet and LND-specific parameters under the corresponding LND's name.

Under Linux 2.4, sysfs is not available, but the LND-specific parameters are accessible via equivalent paths under /proc.

Notes about quotes: Depending on the Linux distribution, options with included commas may need to be escaped by using single and/or double quotes. Worst-case quotes look like this:

options lnet 'networks="tcp0,elan0"' 'routes="tcp [2,10]@elan0"'

But the additional quotes may confuse some distributions. Check for messages such as:

lnet: Unknown parameter `'networks'

After modprobe LNET, the additional single quotes should be removed from modprobe.conf in this case.

Additionally, the message "refusing connection - no matching NID" generally points to an error in the LNET module configuration.

Note: By default, Lustre will ignore the loopback (lo0) interface. Lustre will not ignore IP addresses aliased to the loopback. Specify all Lustre networks in this case.

Liblustre network parameters may be set by exporting the environment variables LNET_NETWORKS, LNET_IP2NETS and LNET_ROUTES. Each of these variables uses the same parameters as the corresponding modprobe option.

Please note that it is very important that a liblustre client includes ALL the routers in its setting of LNET_ROUTES. A liblustre client cannot accept connections, it can only create connections. If a server sends RPC replies via a router that the liblustre client hasn't already connected to, these RPC replies will be lost.

Note: Liblustre is not for general use. It was created to work with specific hardware (Cray) and should never be used with other hardware.

SilverStorm InfiniBand Options -

For the SilverStorm/Infinicon Infiniband LND (iiblnd), the network and HCA may be specific, as in the example below:

options lnet networks="iib3(2)"

This says that this node is on iib network number 3, using HCA[2] == ib3

3.2.3 Module Parameters – Routing

route=<net type> <router NID(s)>

This parameter specifies a colon-separated list of router definitions. Each route is defined as a network type, followed by a list of routers.

Examples:

options lnet 'networks="tcp0, elan0"' 'routes="tcp[2,10]@elan0"'

This identifies the Elan NIDS 2@elan0 and 10@elan0 as routers for the TCP network.

A more complicated example:

options lnet 'ip2nets="tcp0 192.168.0.*; elan0 132.6.1.*"' \ 'routes="tcp [2,10]@elan0; elan 192.168.0.[2,10]@tcp0"

This specifies bi-directional routing - Elan clients can reach Lustre resources on the TCP networks and TCP clients can access the Elan networks. (For more information on ip2nets, see section 5.1.1)

And here is a very complex routed configuration with Voltaire Infiniband and Myranet (GM) systems, with four systems configured as routers:

options lnet\

ip2nets="gm 10.10.3.* # aa*-i0; vib 10.10.131.[11-18] # aa[11-18]-ipoib0;\

vib 10.10.132.* # cc*-ipoib0;" routes="gm 10.10.131.[11-18]@vib # vib->gm via aa[11-13];\

vib 0xdd7f813b@gm # gm->vib via aa11; vib 0xdd7f81c7@gm # gm->vib via aa12;\

vib 0xdd7f81c2@gm # gm->vib via aa13"

live_router_check_interval, dead_router_check_interval, auto_down, check_routers_before_use and router_ping_timeout

In a routed Lustre setup with nodes on different networks such as TCP/IP and Elan, the router checker checks the status of a router. Currently, only the clients using the sock LND and Elan LND avoid failed routers. CFS is working on extending this behavior to include all types of LNDs. The auto_down parameter enables/disables (1/0) the automatic marking of router state. The parameter live_router_check_interval specifies a time interval in seconds after which the router checker will ping the live routers. In the same way, you can set the parameter dead_router_check_interval for checking dead routers. You can set the timeout for the router checker to check the live or dead routers by setting the parameter router_ping_timeout. The Router pinger sends a ping message to a dead/live router once every dead/live_router_check_interval seconds, and if it does not get a reply message from the router within router_ping_timeout seconds, it believes the router is down. The last parameter is check_routers_before_use, which is off by default. If it is turned on, you must also give dead_router_check_interval a positive integer value.

The router checker gets the following variables for each router:

last time that it was disabled
duration for which it is disabled.

The initial time to disable a router should be 1 minute (enough to plug in a cable after removing it usually). If the router is administratively marked as "up", the router checker clears the timeout. When a route is disabled, the (possibly new) "sent packets" counter is set to 0. When the route is first re-used (that is an elapsed disable time is found), the sent packets counter is incremented to 1, and is incremented for all further uses of the route. If the route has been used for 100 packets successfully, then the sent-packets counter should be with a value of 100. You should set the timeout to 0, so that future errors will no longer double the timeout.

NOTE: The router_ping_timeout is consistent with the default LND timeouts. You may have to increase it on very large clusters if the LND timeout is also increased.

For larger clusters, we suggest increasing the check interval.

3.2.4 Downed Routers

There are two mechanisms to update health status of a peer or a router:

LNET can actively check health status of all routers and mark them as dead or alive automatically. This is off by default. To enable it set auto_down and if desired check_routers_before_use. This initial check may cause a pause equal to router_ping_timeout at system startup, if there are dead routers in the system.
When there is a communication error, all LNDs will notify LNET that the peer (not necessarily a router) is down. This mechanism is always on, and there is no parameter to turn it off. However if you set the LNET module parameter auto_down to 0, LNET will ignore all such peer-down notifications.

Some key differences in both the mechanisms:

1. The router pinger only checks routers for their health, while LNDs can notice all dead peers irrespective of whether they are a router or not.

2. The router pinger checks the router health actively by sending pings, but LNDs can only notice a dead peer when there is network traffic going on.

3. The router pinger can bring a router from alive to dead or vice versa, but LNDs can only bring a peer down.

3.3 Starting and Stopping LNET

LNET is started and stopped automatically by Lustre, but can also be started manually in a standalone manner. This is particularly useful to verify that your networking setup is working correctly before you attempt to start Lustre.

A few of the advantages of using LNET are explained below -

A. Performance

LNET routing can fully exploit all the performance optimizations available to the native interconnect between any pair of nodes. In other words, it can achieve full bandwidth of data transfer with minimum latency and CPU overhead.

It is not possible to have RDMA with TCP/IP. Thus, the endpoints of a Lustre connection (i.e. the client and the server) are forced to copy data. The zero-copy-tcp patch from CFS allows you to avoid this on the output. TCP/IP may also suffer from further unavoidable performance degradation when it is layered over another cluster interconnect. For example, IP over IB forces the use of UD (Unreliable Datagram) queue pairs which imposes a reduced MTU (Maximum Transmission Unit) and simply cannot achieve the same bandwidth as RC (Reliable Connected) queue pairs.

B. Load sharing

LNET automatically load balances over equivalent routers. It is achieved by monitoring message queue depth and queuing messages to the router with the shortest queue.

TCP/IP load balancing takes place with LNET configuration. Managing it may therefore vary depending on the available IP networks, hardware and management interfaces.

C. Resilience

Lustre monitors all its communications, and initiates retry and recovery algorithms when any of them fails. This allows LNET to abandon communications in case of failure, but at a cost. LNET delivers messages reliably in absence of failures in the underlying fabric. But when such failures occur, LNET flashes messages of being affected by the failure and does not retry. This is true irrespective of the network driver(s) in the message path.

LNET routing supports the marking of routers up or down on individual nodes. System administrators can fail out a router by marking it down on all the nodes, and bring it back into service again by marking it up on all the nodes. LNET automatically marks a router down locally, if communications with it complete with an error.

NOTE: Currently this is implemented only for ELAN and TCP LNET networks. For all the remaining network types, it will be implemented as early as possible.

TCP achieves flexibility by retransmitting packets until safe receipt has been confirmed. Availability of alternative routes allows the effect of a failing IP router to be masked until IP route tables have been updated to route round it. If a site uses static IP routes, a failed IP router will reduce performance until it recovers. To prevent this, the site must run an automatic IP routing protocol over all IP networks, including the cluster "interior" network. Successful delivery must occur before the lustre timeout expires to avoid entering lustre in the retry/ recovery mode. TCP does not recognize the failed router, it just infers that its peer has not yet received some data it sent. So the second copy of the packet reaches there without running into the same problem. TCP requires redundant, load-balanced IP routes to work, and an IP routing protocol (e.g. RIP) to limit performance impact in the presence of failures.

The performance advantage of LNET routing is that it is built-in. It exploits the best drivers for the network available at all points. Obviously this advantage is unlikely to diminish with time.

3.3.1 Starting LNET

The command to start the lnet is -

$ modeprob lnet

$ lctl network up

To see the list of local nids -

$ lctl list_nids

This will tell you if your local node's networks are set up correctly.
If not, see modules.conf "networks=" line and insure the network layer modules are correctly installed and configured.

To get the best remote nid -

$ lctl which_nid

This will take the "best" nid from a list of the nids of a remote host. The "best" nid is the one the local node will use when trying to communicate with the remote node.

3.3.1.1 Starting Clients

TCP client:

mount -t lustre mdsnode:/mdsA/client /mnt/lustre/

Elan client:

mount -t lustre 2@elan0:/mdsA/client /mnt/lustre

3.3.2 Stopping LNET

Before the LNET modules can be removed, LNET referencesmust be removed. In general, these references are removed automatically during Lustre shutdown, but for standalone routers, an explicit step is necessary. It is to stop the LNET network by using the following command:

lctl network unconfigure

NOTE: Attempting to remove the Lustre modules prior to stopping the network may result in a crash, or an LNET hang. If this occurs, the node must be rebooted in most cases. So it is advised to be certain that the Lustre network and Lustre are stopped prior to module unloading, and to be extremely careful when using rmmod -f'

To unconfigure LCTL network, following command can be used:

modprobe –r <any lnd and the lnet modules>

lconf -–cleanup

This command will do the Lustre and LNET cleanup automatically in cases where lconf was used to start the services.

TIP:
To remove all the Lustre modules:
$ lctl modules | awk '{print $2}' | xargs rmmod

Chapter II -4. Configuring Lustre - Examples

4.1 Simple TCP Network

Because the default network is tcp0, we always omit the “@tcp0” command. We also use actual hostnames rather than numerical IP addresses within the lmc and mount commands. (NB hostnames, which are symbolic IP addresses cannot be used in LNET module parameters.)

${LMC} --add net --node megan --nettype lnet --nid megan

${LMC} --add net --node oscar --nettype lnet --nid oscar

${LMC} --add net --node client --nettype lnet --nid '*'

Modprobe.conf is the same on all nodes (we can omit this step as well, since this is the default) :

options lnet networks=tcp0

The servers megan and oscar are started with lconf, while clients are started by using the mount command:

mount –t lustre megan:/mdsA/client /mnt/lustre

4.2 Example One: Simple Lustre Network

4.2.1 Installation Summary

Eight OSS
One MDS
40 Lustre clients
All dual processor Intel machines
Storage provided via DDN nodes
All gigabit network

4.2.2 Usage Summary

Typically 17 clients and two OSS for continuous Lustre testing
Testing new kernels and Lustre versions
Testing various configurations and software for integration into production clusters

4.2.3 Configuration Generation and Application

Shell script runs “lmc'' to generate XML
Lustre XML stored in shared NFS
Zeroconf mounting used
Configurations also generated using the ltest framework
Uses same structure of shell script and shared NFS, but the hostnames and devices are determined by the test names.

4.3 Example Two : Lustre with NFS

4.3.1 Installation Summary

Five OSS
One MDS
One Lustre client and NFS Server
One AIX NFS client
All dual processor Intel machines
Storage provided from serial ATA drives connected to 3ware RAID cards with 32TB total available
Two gigabyte Ethernet networks, one each for Lustre and NFS
Streaming input/output performance over NFS typically 50-75% of Lustre
No failover enabled or configured
Initially three OSS configured, two more added when data from GFS was migrated in

4.3.2 Usage Summary

Near-line storage for ESMF computing facility
Replaced GFS – very poor performance and reliability, could not go beyond 2TB/fs limit
Scientific application data that is pushed off local AIX storag

4.3.3 Configuration Generation and Application

Shell script runs “lmc” to generate XML
Scp used to distribute XML to Lustre nodes
Custom script used from AIX client to start Lustre, NFS server and mount NFS client
Zeroconf not used, could be added without issue
Need to verify it does not reorder LOV when additional OSTs are added

4.4 Example Three : Exporting Lustre with Samba

4.4.1 Installation Summary

Two OSS
One MDS
Two Lustre clients; one exports Lustre via Samba
One Windows Samba client, two Mac samba clients
One flat gigabit Ethernet network
Failover is enabled, no failover pairs configured

4.4.2 Usage Summary

/home filespace for Linux
My Documents stored from Windows
Both Mac and Windows access streaming music stored in Lustre
Streaming input/output of 70-80M/s, ~18M/s over samba

4.4.3 Model of Storage

Typical Webfarm Home Network

4.4.4 Configuration Generation and Application

Shell script runs “lmc'' to generate XML
Lconf used to start MDS/OST by hand
Zeroconf mounting used on Lustre clients

4.5 Example Four : Heterogeneous Network with Failover Support

4.5.1 Installation Summary

64 OSS, 96 OSTs; each OSS has two gigE connections
Two MDS, one primary but not configured for failover
16 Portals routers; each has four gigE and one Elan connection
~1000 Lustre clients (NB: once federated gigE network is installed we hope to mount another 2000 clients (MCR clients and BGL IONs))
Heterogeneous:

Ia64 clients, MDS, Portals routers

Ia32 OSS
Storage is provided with 12 tiers of disk over eight DDN nodes
Failover is enabled, no failover pairs configured(the hardware is there, but not yet configured due to perceived and real lack of reliable failover software and its ease of integration with Lustre)

4.5.2 Usage Summary

Scientific computation
~180 TB file system
Theoretical best performance: 6.4GB/s:

best observed: 5.5GB/s;

typical observed: 2.0GB/s

4.5.3 Model of Storage

Typical LARGE HPC installation with mixed networking

4.5.4 Configuration Generation and Application

Home-grown bash scripts provide batch input to LMC
XML still edited by hand when new configurations are tested
Rare problems with the python XML parser
Configuration files distributed by custom configuration mgmt
Lctl used in home-grown scripts to check certain values against known correct values in order to verify health of servers or clients
/etc/init.d/lustre used to start OSTs, routers and MDS
zconf used whenever possible – much faster and friendlier

4.6 Example Five : OSS with Multiple OSTs

4.6.1 Installation Summary (*target)

224 OSS, 448 OSTs; each OSS has two gigE connections
Two MDS, one primary but not configured for failover
~1024 Lustre clients ― input/output nodes (ION)
64 Compute nodes (CN) per ION

Total of 65,536 CN

CN do not see Lustre, input/output forwarded to IONs
All gigabit Ethernet networking for Lustre
Compute nodes communicate with ION through the tree network
Storage is provided with 16 tiers on eight DDN nodes
Currently in initial stages of Lustre testing

4.6.2 Usage Summary

Scientific computation
~900 TB file system
Theoretical best performance: 40GB/s

4.6.3 Model of Storage

Next generation ultra-large HPC installation

4.6.4 Configuration Generation and Application

Generating and applying the configuration in Example Five: OSS with Multiple OSTs is similar to Example Four: Heterogeneous Network with Failover Support. Please refer to 4.5.4 Configuration Generation and Application.

4.7 Example Six : Client with Sub-clustering Support

4.7.1 Installation Summary

104 OSS
One MDS
1280 Lustre clients

Clients are arranged in sub-clusters of 256 nodes
All dual processor Intel machines
Storage provided via DDN nodes
All gigabit network
Peak performance of 11.1GB/s

4.7.2 Usage Summary

NSF grants allocations for researchers all over the world
General scientific load includes chemistry, cosmology, weather

4.7.3 Configuration Generation and Application

Shell script runs “lmc'' to generate XML
Lustre XML stored in shared NFS
Zeroconf mounting used

Chapter II-5. More Complicated Configurations

5.1 Multihomed servers

Servers megan and oscar each have three tcp NICs (eth0, eth1, and eth2) and an elan NIC. eth2 is used for management purposes and should not be used by LNET. TCP clients have a single TCP interface and Elan clients have a single Elan interface.

5.1.1 Modprobe.conf

Options under modprobe.conf are used to specify the networks available to a node. You have the choice of two different options – the networks option, which explicitly lists the networks available and the ip2nets option, which provides a list-matching lookup. Only one of these options can be used at any one time. The order of LNET lines in modprobe.conf is important when configuring multi-homed servers. If a server node can be reached using more than one network, the first network specified in modprobe.conf will be used.

Networks

On the servers:

options lnet 'networks="tcp0(eth0,eth1),elan0"'

Elan-only clients:

options lnet networks=elan0

TCP-only clients:

options lnet networks=tcp0

NOTE: In case of TCP-only clients, all the available IP interfaces will be used for tcp0 since the interfaces are not specified. If there is more than one, the IP of the first one found is used to construct the tcp0 NID.

ip2nets

The ip2nets option is typically used to provide a single, universal modprobe.conf file that can be run on all servers and clients. An individual node identifies the locally available networks based on the listed IP address patterns that match the node's local IP addresses. Note that the IP address patterns listed in this option (ip2nets) are used only to identify the networks that an individual node should instantiate. They are not used by LNET for any other communications purpose. The servers megan and oscar have eth0 ip addresses 192.168.0.2 and .4. They also have IP over Elan (eip) addresses of 132.6.1.2 and .4. TCP clients have IP addresses 192.168.0.5-255. Elan clients have eip addresses of 132.6.[2-3].2, .4, .6, .8.

Modprobe.conf is identical on all nodes:

options lnet 'ip2nets=" \ tcp0(eth0,eth1)192.168.0.[2,4]; tcp0 192.168.0.*; elan0 132.6.[1-3].[2-8/2]"'

NOTE: Lnet lines in modprobe.conf are used by the local node only to determine what to call its interfaces. They are not used for routing decisions.

Because megan and oscar match the first rule, LNET uses eth0 and eth1 for tcp0 on those machines. Although they also match the second rule, it is the first matching rule for a particular network that is used. The servers also match the (only) elan rule. The [2-8/2] format matches the range 2-8 stepping by 2; that is 2,4,6,8. For example, clients at 132.6.3.5 would not find a matching Elan network.

5.1.2 LMC configuration preparation

The tcp NIDs specified should use the address of the first TCP interface listed in the networks or ip2nets options line above (eth0).

${LMC} --add net --node megan --nettype lnet --nid 192.168.0.2@tcp0

${LMC} --add net --node megan --nettype lnet --nid 2@elan

${LMC} --add net --node oscar --nettype lnet --nid 192.168.0.4@tcp0

${LMC} --add net --node oscar --nettype lnet --nid 4@elan

A single client profile will work for both tcp and elan clients:

${LMC} --add net --node client --nettype lnet --nid '*'

NOTE: The example above shows that in --add net option for each interface the --node parameter is the same but the --nid parameter is changing, which specifies the NID of the interface.

5.1.3 Start servers

Start servers with lconf. The recommended order is the OSSs then the MDS. (Remember to start Lustre on servers with: lconf config.xml.)

5.1.4 Start clients

Tcp clients can use the hostname or ip address of the MDS:

mount –t lustre megan@tcp0:/mdsA/client /mnt/lustre

Elan clients will be started with:

mount –t lustre 2@elan0:/mdsA/client /mnt/lustre

5.2 Elan to TCP routing

Servers megan and oscar are on the elan network with eip addresses 132.6.1.2 and .4. Megan is also on the TCP network at 192.168.0.2 and routes between TCP and elan. There is also a standalone router, router1, at elan 132.6.1.10 and tcp 192.168.0.10. Clients are on either elan or tcp.

5.2.1 Modprobe.conf

Modprobe.conf is identical on all nodes:

options lnet 'ip2nets="tcp0 192.168.0.*; elan0 132.6.1.*"' 'routes="tcp [2,10]@elan0; elan 192.168.0.[2,10]@tcp0"

5.2.2 LMC configuration preparation

${LMC} --add net --node megan --nettype lnet --nid 192.168.0.2@tcp0

${LMC} --add net --node megan --nettype lnet --nid 2@elan

${LMC} --add net --node oscar --nettype lnet --nid 4@elan

${LMC} --add net --node client --nettype lnet --nid '*'

5.2.3 Start servers

router1

modprobe lnet

lctl network configure

megan and oscar:

lconf route.xml

5.2.4 Start clients

tcp client:

mount -t lustre megan:/mdsA/client /mnt/lustre/

elan client:

mount -t lustre 2@elan0:/mdsA/client /mnt/lustre

Chapter II-6. Failover

6.1 What is Failover?

We say a computer system is Highly Available when the services it provides are available with minimum downtime. Even in case of failure conditions such as loss of a server, or network or software fault, the services being provided remain unaffected for the user. We generally measure availability by the percentage of time we require the system to be available.

Availability is accomplished by providing replicated hardware and/or software, so that failure of any system will be covered by a paired system. What we call “failover” is a method of automatically switching an application and it's supporting resources to a standby server when the primary system fails or the service is temporarily shut down for maintenance. Failover should be automatic and in most cases completely application-transparent.

Lustre failover requires two nodes (a failover pair), which must be connected to a shared storage device. Lustre supports failover for both metadata and object storage servers.

Lustre provides a file system resource. The Lustre file system supports failover at the server level. Lustre does not provide the tool set for the system-level components necessary for a complete failover solution (node failure detection, power control, and so on), as this functionality has been available for some time from third party tools. CFS does provide the necessary scripts to interact with these packages, and exposes health information for system monitoring. The recommended choice is the Heartbeat package from linux-ha.org. Lustre will work with any HA software that supports resource (I/O) fencing. The Heartbeat software is responsible for detecting failure of the primary server node and controlling the failover.

The hardware setup requires a pair of servers with a shared connection to a physical storage (like SAN, NAS, hardware RAID, SCSI, Fiber Channel). The method of sharing the storage should be essentially transparent at the device level, that is the same physical LUN should be visible from both nodes. To ensure high availability at the level of physical storage, we encourage the use of RAID arrays to protect against drive-level failures.

To have a fully automated high available Lustre system, one needs a power management software and HA software, which must provide the following -

A) -- Resource fencing - Physical storage must be protected from simultaneous access by two nodes

B) -- Resource control - Starting and stopping the Lustre processes as a part of failover, maintaining the cluster state, and so on

C) -- Health monitoring - Verifying the availability of hardware and network resources, responding to health indications given by Lustre.

For proper resource fencing, the Heartbeat software must be able to completely power off the server or disconnect it from the shared storage device. It is absolutely vital that no two active nodes access the same partition, at the risk of severely corrupting data. When the Heartbeat detects a server failure, it calls a process (STONITH) to power off the failed node; and then starts Lustre on the secondary node. HA software controls the Lustre resources with a service script. CFS provides /etc/init.d/lustre for this purpose.

Servers providing Lustre resources are configured in primary/secondary pairs for the purpose of failover. A system administrator can failover manually with lconf. When an “lconf --cleanup --failover” command is issued, the disk device is set read-only. This allows the second node to start service using that same disk, after the command completes. This is known as a soft failover, in which case both the servers can be running and connected to the net. Powering the node off is known as a hard failover.

To automate failover with Lustre, one needs a power management software, remote control power equipment, and HA software.

6.1.1 The Power Management Software

The linux-ha package includes a set of power management tools, known as STONITH (Shoot The Other Node In The Head). STONITH has native support for many power control devices, and is extensible. It uses expect scripts to automate control. PowerMan, by the Lawrence Livermore National Laboratory, is a tool for manipulating remote power control (RPC) devices from a central location. Several RPC varieties are supported natively by PowerMan.

The latest version is available on

http://www.llnl.gov/linux/powerman/

6.1.2 Power Equipment

A multi-port, Ethernet addressable Remote Power Control is relatively inexpensive. Consult the list of supported hardware on the PowerMan site for recommended products. Linux Network Iceboxes are also very good tools. They combine both the remote power control and the remote serial console into a single unit.

6.1.3 Heartbeat

The heartbeat program is one of the core components of the Linux-HA (High-Availability Linux) project. Heartbeat is highly portable, and runs on every known Linux platform, and also on FreeBSD and Solaris.

For more information, see:

http://linux-ha.org/heartbeat/

For download, go to:

http://linux-ha.org/download

CFS supports both Heartbeat V1 and Heartbeat V2. V1 has a simpler configuration and works very well. V2 adds monitoring and supports more complex cluster topologies. The linux-ha web site contains a great deal of information. We recommend it as a resource.

6.1.3.1 Roles of Nodes in a Failover

A failover pair of nodes can be configured in two ways – active/active and active/passive. An active node actively serves data and a passive node is idle, standing by to take over in the event of a failure. In the example case of using two OSTs (both of which are attached to the same shared disk device), the following failover configurations are possible:

active/ passive - This configuration has two nodes out of which only one is actively serving data all the time. In case of a failure, the other node takes over.

If the active node fails, the OST in use by the active node will be taken over by the passive node, which now becomes active. This node will serve most of the services that were on the failed node.

active/ active - This configuration has two nodes actively serving data all the time. In case of a failure, one node would take over for the other.

To configure this with respect to the shared disk, the shared disk would need to provide multiple partitions, and each of the OSTs would be the primary server for one partition and the secondary server for the other partition. The active/passive configuration doubles the hardware cost without improving performance, and is seldom used for OST servers.

6.2 OST Failover Review

The OST has two operating modes: failover and failout. The default mode is failover. In this mode, the clients reconnect after a failure, and the transactions, which were in progress, get completed. Data on the OST is written synchronously, and the client replays uncommitted transactions after the failure.

In the failout mode when any communication error occurs, the client attempts to reconnect, but is unable to continue with the transactions that were in progress during the failure. Also, if the OST actually fails, data that has not been written to the disk (still cached on the client) is lost. Applications usually see an -EIO for operations done on that OST until the connection is reestablished. However, the LOV layer on the client avoids using that OST. Hence, the operations such as file creates and fsstat still succeed. The failover mode is the current default, while the failout mode is seldom used.

6.3 MDS Failover Review

The MDS has only one failover mode: active/passive, as only one MDS may be active at a given time.

6.4 Configuring MDS and OSTs for Failover

The failover MDS and OSTs are configured in the same way – multiple objects are added to the configuration with the same service name. The --failover option is specified on at least one of the objects to enable the failover mode. (This is required to enable failover on the OSTs.) For example, to create a failover OST named ost1 on nodes nodeA and nodeB with a shared disk device referenced on both nodes as /dev/sdb1; you can follow the steps below:

lmc --add ost --ost ost1 --failover --node nodeA --lov lov1 --dev /dev/sdb1

lmc --add ost --ost ost1 --failover --node nodeB --lov lov1 --dev /dev/sdb1

In addition, CFS recommends setting the mount option errors=panic (the default is “errors=ro”) to further protect data in the event of a disk issue.

6.4.1 Starting / Stopping a Resource

You can use the lconf --service option to override the current active node for a particular service, or to start services individually. To start ost1 on nodeB in the above example:

lconf –-service=ost1 <path to XML>

6.4.2 Active/Active Failover Configuration

With OST servers it is possible to have a load balanced active/active configuration. Each node is the primary node for a group of OSTs, and the failover node for other groups. To expand the simple two-node example, we add ost2 which is primary on nodeB, and is on the LUNs nodeB:/dev/sdc1 and nodeA:/dev/sdd1. This is to demonstrate the /dev/ identify can differ between nodes, but both devices must map to the same physical LUN.

lmc --add ost --ost ost1 --failover --node nodeA --group nodeA --lov lov1 --dev /dev/sda1

lmc --add ost --ost ost1 --failover --node nodeB --lov lov1 --dev /dev/sdb1

lmc --add ost --ost ost2 --failover --node nodeB --group nodeB --lov lov1 --dev /dev/sdc1

lmc --add ost --ost ost2 --failover --node nodeA --lov lov1 --dev /dev/sdd1

If the --group nodeB option is used, then only the active services in group nodeB will be started. This is generally used on a node, which is already running the services. If it is not used, then all the services active on nodeA will be started, which is generally needed at the boot time.

To return to the original load-balanced configuration, first stop the service in the failover mode. Restart it on the original node, and then update the active node for the affected services. The clients will treat this as a failover, and recover normally.

# On nodeA, limit the scope to only group nodeB

lconf --cleanup --force –service=ost2 <config.xml>

# On nodeB

lconf –service=ost2 <config.xml>

6.4.3 Hardware Configurations

6.4.3.1 Hardware Preconditions

The setup must consist of a failover pair where each node of the pair has access to shared storage. If possible, the storage paths should be identical (nodeA:/dev/sda == nodeB:/dev/sda).
Shared storage can be arranged in an active/passive (MDS,OSS) or active/active (OSS only) configuration. Each shared resource will have a primary (default) node. Heartbeat will assume that the non-primary node is secondary for that resource.
The two nodes must have one or more communication paths for heartbeat traffic. A communication path can be:

dedicated Ethernet

serial live (serial crossover cable)

Failure of all heartbeat communication is not good. This condition is called “split-brain” and the heartbeat software will resolve this situation by powering down one node.

The two nodes must have a method to control each other's state. The Remote Power Control hardware is the best. There must be a script to start and stop a given node from the other node. STONITH provides soft power control methods (ssh, meatware) but these cannot be used in a production situation.
Heartbeat provides a remote ping service that is used to monitor the health of the external network. If you wish to use the ipfail service, you must have a very reliable external address to use as the ping target. Typically, this would be a firewall router, or another very reliable network endpoint external to the cluster.

6.5 Instructions for Failover Setup with Heartbeat Version1

6.5.1 Software Installations

1. Install Lustre as described in Chapter II – 2. Lustre Installation.

2. Install RPMs required for configuring Heartbeat

The following packages are needed for Heartbeat (v1). We used the 1.2.3-1 version. Red Hat supplies v1.2.3-2. Heartbeat is available as an RPM or source.

Heartbeat packages, in order:

heartbeat-stonith -> heartbeat-stonith-1.2.3-1.i586.rpm
heartbeat-pils -> heartbeat-pils-1.2.3-1.i586.rpm
heartbeat itself -> heartbeat-1.2.3-1.i586.rpm

You can find the above RPMs at the location given below -

http://linux-ha.org/download/index.html#1.2.3

3. Install Prerequisites

Heartbeat 1.2.3 installation requires following:

python
openssl

libnet-> libnet-1.1.2.1-19.i586.rpm
libpopt -> popt-1.7-274.i586.rpm
librpm -> rpm-4.1.1-222.i586.rpm
glib -> glib-2.6.1-2.i586.rpm
glib-devel -> glib-devel-2.6.1-2.i586.rpm

6.5.2 Hardware Configurations

6.5.2.1 Lustre Configuration

Add the secondary servers to your configuration and re-create your XML configuration if necessary
Create the directory /etc/lustre
Copy your XML file to /etc/lustre/config.xml
Verify that /etc/init.d/lustre exists
Note the names of your OST and MDS resources
Decide which node owns each resource

6.5.2.2 Heartbeat Configuration

A. Basic Configuration - no STONITH

The linux-ha web site has several guides covering basic setup and initial testing of Heartbeat, we advise reading them.

1. It is good to configure and test the Heartbeat setup before adding STONITH.

Let us assume two nodes, nodeA and nodeB. nodeA owns ost1 and nodeB owns ost2. Both the nodes are with dedicated ethernet – eth0 having serial crossover link – /dev/ttySO. Consider that both the nodes are pinging to a remote host – 192.168.0.3 for health.

a. Create /etc/ha.d/ha.cf

This file must be identical on both the nodes

Follow the order of the directives as it matters

See sample ha.cf file in the section 6.5.5.3 ha.cf of this chapter

b. Create /etc/ha.d/haresources

This file must be identical on both the nodes

It specifies a virtual IP address, and a service

See sample in the section 6.5.5.4 haresources of this chapter

The virtual IP address should be a subnet matching a physical Ethernet. Failure to do so will result in error messages, but these errors will not be fatal.

c. Create /etc/ha.d/authkeys

Copy example from /usr/share/doc/heartbeat-<version>

chmod the file '0600' – heartbeat will not start if the permissions on this file are incorrect.

d. Execute the following commands to create symlinks between /etc/init.d/lustre and /etc/ha.d/resource.d/<lustre service name>

$ ln -s /etc/init.d/lustre /etc/ha.d/resource.d/ost1

$ ln -s /etc/init.d/lustre /etc/ha.d/resource.d/ost2

e. Restart heartbeat

Monitor the syslog on both nodes. After the initial deadtime interval, you should see the nodes discovering each other's state, and then they will start the Lustre resources they own. You should see the startup command in the log:

Sep 7 10:42:40 d1_q_0 heartbeat: info: Running \ /etc/ha.d/resource.d/ost1 start

In this example, 'ost1' is our shared resource. Common things to watch out for:

If you configure two nodes as primary for one resource, you will see both nodes attempt to start it. This is very bad. Shutdown immediately and correct your haresources files.

If the commutation between nodes is not correct, both nodes may also attempt to mount the same resource, or will attempt to STONITH each other. There should be many error messages in syslog indicating a communication fault.

When in doubt, you can set a Heartbeat debug level in ha.cf – levels above 5 will produce huge volumes of data.

f. Try some manual failover/ failback. Heartbeat provides two tools for this purpose (by default they are installed in /usr/lib/heartbeat) –

hb_standby [local|foreign] – Causes a node to yield resources to another node – if a resource is running on its primary node it is local, otherwise it is foreign.

hb_takeover [local|foreign] – Causes a node to grab resources from another node.

B. Basic Configuration - Adding STONITH

STONITH automates the process of power control with the expect package. Expect scripts are very dependent on the exact set of commands provided by each hardware vendor, and as a result any change made in the power control hardware/ firmware will require tweaking STONITH.

Much must be deduced by running the STONITH package by hand. STONITH has some supplied packages, but can also run with an external script. There are two STONITH modes:

a. Single STONITH command for all nodes found in ha.cf:

--------/etc/ha.d/ha.cf-------------------

stonith <type> <config file>

b. STONITH command per-node:

-------/etc/ha.d/ha.cf--------------------

stonith_host <hostfrom> <stonith_type> <params...>

You can use an external script to kill each node:

stonith_host nodeA external foo /etc/ha.d/reset-nodeB

stonith_host nodeB external foo /etc/ha.d/reset-nodeA

Here foo is a placeholder for an un-used parameter.

To get the proper syntax:

$ stonith -L

The above command lists supported models.

$ stonith -l -t <model>

The above command lists required parameters, and specifies config file name.

You should attempt a test with

$ stonith -l -t <mode1> <fake host name>

This will also give data on what is required. You will be able to test by using a real host name. The external STONITH scripts should take the parameters {start|stop|status} and return 0 or 1.

STONITH _only happens when the cluster cannot do things in an orderly manner. If two cluster nodes can communicate, they usually shutdown properly. This means many tests will not produce a STONITH, for example:

Calling init 0 or shutdown or reboot on a node, orderly halt, no STONITH

Stopping the heartbeat service on a node, again, orderly halt, no STONITH

You really have to do something drastic (for example, killall -9 heartbeat) like pulling cables, or so on before you trigger STONITH.

Also, the alert script does a software failover, which halts Lustre but does not halt or STONITH the system. To use STONITH, edit the fail_lustre.alert script (section 6.5.5.2 lustre_fail.alert) and add your preferred shutdown command after the line -

`/usr/lib/heartbeat/hb_standby local &`;

A simple method to halt the system is the sysrq method:

$ !/bin/bash

This script will force a boot

$ 'echo s' = sync

$ 'echo u' = remount read-only

$ 'echo b' = reboot

$

SYST="/proc/sysrq-trigger"

if [ ! -f $SYST ]; then

echo "$SYST not found!"

exit 1

fi

$ sync, unmount, sync, reboot

echo s > $SYST

echo u > $SYST

echo s > $SYST

echo b > $SYST

exit 0

6.5.3 Mon (Status Monitor)

Mon requires two scripts:

A monitor script, which checks a resource for health

An alert script, which is triggered by failure of the monitor

Mon requires one configuration file:

/etc/mon/mon.cf
We use a trap-based monitor. The trap is set with a time interval. The trap is cleared by checking Lustre health. If the trap is not cleared, mon will trigger a failover.
All monitors are configured in one file. Mon is started as a service at boot prior to heartbeat startup. All monitors are disabled at startup and enabled by Heartbeat in conjunction with resource startup/shutdown.

6.5.3.1 Mon Setup and Configuration

A. Install Prerequisites for Mon

Mon is not required for a basic failover setup. It is not required for Heartbeart V2, as monitoring is included in V2.

Heartbeat monitors the health of the node. Adding Mon to the setup allows us to monitor application health, the application in this case being Lustre.

The base package is available from

ftp://ftp.kernel.org/pub/software/admin/

Mon requires following Perl packages:

Time::Period

Time::HiRes

Convert::BER

Mon::SNMP

As always, when installing Perl we recommend using CPAN. The packages are also available as tarballs (see cpan.org).

B. Install Mon

After installing the Perl packages, get the Mon tarball from:

ftp://ftp.kernel.org/pub/software/admin/mon/

Untar the tarball

Copy the Mon program to a location on the root path

(/usr/lib/mon/mon is default)

Install the moncmd program

For this setup, CFS has altered the Mon startup a bit (see the section 6.5.5.10 S99mon.patch). You must patch the S99mon script, and install the result as /etc/init.d/mon – set this routine to start at boot, prior to heartbeat startup

$ chkconfig --add mon

Verify that the path for moncmd in the init script matches where you installed moncmd (/usr/local/bin/moncmd is the default).

Create a set of Mon directories as specified in /etc/mon/mon.cf

cfbasedir = /etc/mon

alertdir = /usr/local/lib/mon/alert.d

mondir = /usr/local/lib/mon/mon.d

statedir = /usr/local/lib/mon/state.d

logdir = /usr/local/lib/mon/log.d

dtlogfile = /usr/local/lib/mon/log.d/downtime.log

Create the /etc/mon/auth.cf file - allow everything in the command section change AUTH_ANY to all.

Create the /etc/mon/mon.cf file

Starting with the provided example,

a. Verify that the correct paths are set

b. For each Lustre object, create two watches

The first watch runs the trap monitor

The second watch receives the trap

Both monitors will attempt to fail Lustre if they fail

The monitor currently hard kills heartbeat to guarantee failover

A CFS user has provided a shell script that will generate a mon.cf file. It is provided in the section 6.5.5.7 mon.cf.

Copy the supplied trap generator script (mon.trap) to a proper location (/usr/local/lib/mon/)

This Perl script is based on a script found on the Mon mailing list. Other scripts are also available there
Copy the provided Lustre monitor script (lustre.mon.trap) to the mon monitor directory (/usr/local/lib/mon/mon.d)

a. Verify that the location of TRAPPER points at the trap generation script from mon.trap

b. Verify that the name matches the script specified in /etc/mon/mon.cf

c. This script is based on /etc/init.d/lustre
Copy the provided Lustre alert script to the mon alert directory

(/usr/local/lib/mon/alert.d)

a. Verify the name matches script specified in /etc/mon/mon.cf

b. This is a stock script from the mon package

c. For Lustre failover sequence you are free to choose another method of triggering the transition

The script will _not STONITH the node

You should edit the script to provide hard node power off or reboot if needed

Add Mon to the heartbeat configuration.

Copy the lustre-resource-monitor script to the Heartbeat resource directory (/etc/ha.d/resource.d)

Give the script a unique name (alpha-mon, beta-mon)

Edit the script, and set MONLIST to the service names to be monitored (two services per object as defined in /etc/mon/mon.cf)

Edit /etc/ha.d/haresources to add the mon scripts – the mon script will appear on the same line as the Lustre resource

Restart heartbeat

the trap should appear in syslog:

Apr 26 13:45:38 d2_q_0 mon[3000]: trap trap 1 from 192.168.0.150 for alpha-ost lustre_a, status 255

6.5.4 Scripts

In this section, all the scripts necessary for Failover setup with Heartbeat Version 1 are given. The scripts are listed below in the order they appear in the manual.

auth.cf

fail_lustre.alert

ha.cf

haresources

lustre.mon.trap

lustre-resource-monitor

mon.cf

mon.init

mon.trap

S99mon.patch

simple.health_check.monitor

6.5.4.1 auth.cf

# authentication file

# entries look like this:

# command: {user|all}[,user...]

# THE DEFAULT IT TO DENY ACCESS TO ALL IF THIS FILE

# DOES NOT EXIST, OR IF A COMMAND IS NOT DEFINED HERE

# command section

command section

ack: all

checkauth: all

clear: all

disable: all

dump: all

enable: all

get: all

list: all

loadstate: all

protid: all

quit: all

reload: all

reset: all

savestate: all

servertime: all

set: all

start: all

stop: all

term: all

test: all

version: all

# trap section

# if no source hosts or users are defined, then do \ not

# accept traps

trap section

#source_host user password

# allow from user "mon" from any host

# * mon monpassword

# allow from host 127.0.0.1 without requiring

# a valid username and password

# localhost * *

# d1_q_0 * *

# d2_q_0 * *

# 127.0.0.1 * *

* * *

6.5.4.2 fail_lustre.alert

#!/usr/bin/perl

# template for an alert

# Jim Trocki, trockij@transmeta.com

# $Id: alert.template 1.1 Sat, 26 Aug 2000 15:22:34 -0400 trockij $

# This program is free software; you can redistribute it and/or modify

# it under the terms of the GNU General Public License as published by

# the Free Software Foundation; either version 2 of the License, or

# (at your option) any later version.

# This program is distributed in the hope that it will be useful,

# but WITHOUT ANY WARRANTY; without even the implied warranty of

# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

# GNU General Public License for more details.

# You should have received a copy of the GNU General Public License

# along with this program; if not, write to the Free Software

# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA

use Getopt::Std;

getopts ("s:g:h:t:l:u");

# the first line is summary information, adequate to send to a pager

# or email subject line

# the following lines normally contain more detailed information,

# but this is monitor-dependent

# see the "Alert Programs" section in mon(1) for an explanation

# of the options that are passed to the monitor script.

$summary=<STDIN>;

chomp $summary;

$t = localtime($opt_t);

($wday,$mon,$day,$tm) = split (/\s+/, $t);

print <<EOF;

Alert for group $opt_g, service $opt_s

EOF

print "This alert was sent because service was restored\n"

if ($opt_u);

print <<EOF;

This happened on $wday $mon $day $tm

Summary information: $summary

Arguments passed to this script: @ARGV

Detailed information follows:

EOF

# We will do a very simple setup here

# We will attempt to release all resources

`/usr/lib/heartbeat/hb_standby local`;

`/usr/lib/heartbeat/hb_standby foreign`;

while (<STDIN>) {

print;

}

6.5.4.3 ha.cf

# Suggested fields - logging

debugfile /var/log/ha-debug

logfile /var/log/ha-log

logfacility local0

# Requited fields - Timing

keepalive 2

deadtime 30

initdead 120

# if using serial heartbeat

baud 19200

serial /dev/ttyS0

# for ethernet broadcast

udpport 694

bcast eth0

# use manual fail back

auto_failback off

# Cluster members - name must match `hostname`

node d1_q_0

node d2_q_0

# remote health ping

ping 192.168.0.3

respawn hacluster /usr/lib/heartbeat/ipfail

# Uncomment for STONITH

# a. Single command for both nodes:

# stonith <type> <config file>

# b. per-node STONITH command

# stonith_host <hostfrom> <stonith_type> <params...>

# # Using an external script to kill each node

# stonith_host d1_q_0 external foo /etc/ha.d/reset-d2_q_0

# stonith_host d2_q_0 external foo /etc/ha.d/reset-d1_q_0

# 'foo' is a placeholder for an un-used parameter

6.5.4.4 haresources

d1_q_0 192.168.0.191 beta-ost alpha-mon

d2_q_0 192.168.0.192 beta-mon beta-mon

6.5.4.5 lustre.mon.trap

#!/bin/sh

# This script monitors a Lustre object

# The object name is passed with the '-o' parameter

set -x

TRAPPER="/usr/local/lib/mon/mon.trap"

HOST=`hostname`

# Mon group

GROUP=$1

# Mon service

SERVICE=$2

# Lustre object or 'lustre'

SRV=$3

STATE="unknown"

GOOD_ARGS="-o ok -r 0 -s $STATE $HOST ${GROUP}:${SERVICE}"

BAD_ARGS="-o fail -r 1 -s $STATE $HOST ${GROUP}:${SERVICE}"

LOGFILE=/var/log/health.log

DT=`date`

echo "$DT" >> $LOGFILE

LOGFILE=/var/log/health.log

DT=`date`

echo "trap check $DT $HOST" >> $LOGFILE

if [ ! $SRV ]; then

echo "service not specified"

exit 1

# This is the inverse of /etc/init.d/lustre

# We will look for failed states first.

# Missing conditions will trigger a failure.

# We exit by calling the trap routine.

# NOTE - if this node is a router-only node, this script will NOT WORK

# First,modules must be loaded

egrep -q "libcfs|lvfs|lnet" /proc/modules

[ $? -ne 0 ] && $TRAPPER $BAD_ARGS && exit 1

# Second the kernel dir must exist

[ ! -d /proc/fs/lustre ] && $TRAPPER $BAD_ARGS && exit 1

# Third, the health check must pass

HEALTH="/proc/fs/lustre/health_check"

[ -f "$HEALTH" ] && grep -q "NOT HEALTHY" $HEALTH && $TRAPPER $BAD_ARGS && exit 1

[ -f "$HEALTH" ] && grep -q "LBUG" $HEALTH && $TRAPPER $BAD_ARGS && exit 1

# Finally, if we are checking a specific service, it must be found

DUMMY=`lctl dl | grep -q $SRV`

[ $? -ne 0 ] && $TRAPPER $BAD_ARGS && exit 1

$TRAPPER $GOOD_ARGS

exit 0

6.5.4.6 lustre-resource-monitor

#!/bin/sh

# start/stop the mon server

# You probably want to set the path to include

# nothing but local filesystems.

# chkconfig: 2345 99 10

# description: mon system monitoring daemon

# processname: mon

# config: /etc/mon/mon.cf

# pidfile: /var/run/mon.pid

PATH=/bin:/usr/bin:/sbin:/usr/sbin

export PATH

# Source function library.

# . /etc/init.d/functions

# These next two items should be customized for your config

# Each mon instance must have a unique port

MYNAME=${0##*/}

SERVICE=`echo $MYNAME | sed 's/-mon//g'`

MONCONFIG="/etc/mon/mon.cf"

MONCMD="/usr/local/bin/moncmd"

HAPATH="/etc/ha.d/resource.d"

LUSTRE=$HAPATH/$SERVICE

sleep_lustre () {

# A function to delay until lustre startup.

STOP=5

SLEEP=30

$LUSTRE status

RC=$?

while [ $STOP -gt 0 ] && [ $RC -ne 0 ];

STOP=$(( STOP - 1 ))

sleep $SLEEP

$LUSTRE status

RC=$?

done

if [ $RC -ne 0 ];then

echo "Failed to start Lustre!"

exit 1

}

# Here we are defining a naming convention. We need to start two

# watches. For a base name, the mon watches will use the name

# of the Lustre object being monitored. (OBJECT)

# The mon watch that recieves the traps will be called

# OBJECT-obj . The mon watch that sends the trap will be called

# OBJECT-mon . The OBJECT-mon watch is responsible to checking

# Lustre health and generating the trap.

# If the OBJECT-mon health check fails to run, it will

# also trigger a heartbeat takeover

MONLIST="${SERVICE}-mon ${SERVICE}-obj"

# See how we were called.

case "$1" in

start)

sleep_lustre

echo -n "Starting monitors : "

$MONCMD stop

for i in $MONLIST

echo "Enabling $i"

$MONCMD enable watch $i

done

$MONCMD start

$MONCMD list watch

exit 0

;;

stop)

echo -n "Stopping monitors : "

$MONCMD stop

for i in $MONLIST

echo "Disabling $i"

$MONCMD disable watch $i

done

$MONCMD start

$MONCMD list disabled

exit 0

;;

status)

$MONCMD list watch

exit 0

;;

restart)

killall -HUP mon

;;

echo "Usage: mon {start|stop|status|restart}"

exit 1

esac

exit 0

6.5.4.7 mon.cf

# Example "mon.cf" configuration for "mon".

# $Id: example.cf 1.1 Sat, 26 Aug 2000 15:22:34 -0400 trockij $

# This works with 0.38pre8

# global options

cfbasedir = /etc/mon

alertdir = /usr/local/lib/mon/alert.d

mondir = /usr/local/lib/mon/mon.d

statedir = /usr/local/lib/mon/state.d

logdir = /usr/local/lib/mon/log.d

dtlogfile = /usr/local/lib/mon/log.d/downtime.log

maxprocs = 20

histlength = 100

randstart = 60s

# authentication types:

# getpwnam standard Unix passwd, NOT for shadow passwords

# shadow Unix shadow passwords (not implemented)

# userfile "mon" user file

authtype = getpwnam

# NB: hostgroup and watch entries are terminated with a blank line (or

# end of file). Don't forget the blank lines between them or you lose.

# group definitions (hostnames or IP addresses)

# EXAMPLE

# Two servers: d1_q_0, d2_q_0

# Two OSTs: ost-alpha, ost-beta

hostgroup beta-ost d1_q_0 d2_q_0

hostgroup beta-mon d1_q_0 d2_q_0

hostgroup alpha-ost d1_q_0 d2_q_0

hostgroup alpha-mon d1_q_0 d2_q_0

# Lustre failover - the trap script needs

# group service object

# The lustre.mon.trap script is based on one found on the

# mon mailing list, there are otherw available that will work

# equally well

# the 'group' and 'service' tags are used by mon only,

# They must be unique

watch beta-mon

service lustre_mon_b

description sends traps for the lustre service

interval 3m

monitor lustre.mon.trap beta-ost lustre_b ost-beta

period

alert fail_lustre.alert

watch beta-ost

service lustre_b

description will fail unless trap recieved

traptimeout 6m

period wd {Sat-Sun}

alert fail_lustre.alert

# Second OST

watch alpha-mon

service lustre_mon_a

description sends traps for the lustre service

interval 3m

monitor lustre.mon.trap alpha-ost lustre_a ost-alpha

period

alert fail_lustre.alert

watch alpha-ost

service lustre_a

description will fail unless trap recieved

traptimeout 6m

period

alert fail_lustre.alert

6.5.4.8 mon.init

#!/bin/sh

# start/stop the mon server

# You probably want to set the path to include

# nothing but local filesystems.

# chkconfig: 2345 99 10

# description: mon system monitoring daemon

# processname: mon

# config: /etc/mon/mon.cf

# pidfile: /var/run/mon.pid

PATH=/bin:/usr/bin:/sbin:/usr/sbin

export PATH

# Source function library.

. /etc/rc.d/init.d/functions

dismon() {

MONCMD="/usr/local/bin/moncmd"

for i in `$MONCMD list watch | awk '{print $1}'`

echo "Disabling watch $i"

$MONCMD disable watch $i

done

$MONCMD list watch

}

# See how we were called.

case "$1" in

start)

echo -n "Starting mon daemon: "

daemon /usr/lib/mon/mon -S -f -c /etc/mon/mon.cf

echo

dismon

echo

touch /var/lock/subsys/mon

;;

stop)

echo -n "Stopping mon daemon: "

killproc mon

echo

rm -f /var/lock/subsys/mon

;;

status)

status mon

;;

restart)

killall -HUP mon

;;

echo "Usage: mon {start|stop|status|restart}"

exit 1

esac

exit 0

6.5.4.9 mon.trap

#!/usr/bin/perl

use strict;

use Getopt::Std;

use Mon::Client;

my @opstrings= (

"fail", "ok", "coldstart", "warmstart", "linkdown",

"unknown", "timeout", "untested",

);

my $usage= "montrap [-p port] [-r retval] -o opstatus -s summary [-d detail]

host group:service\n";

use vars qw($opt_p $opt_r $opt_o $opt_s $opt_d);

getopts("p:r:o:s:d:");

die $usage unless @ARGV == 2 and $ARGV[1] =~ /[^:]+:[^:]+/;

my $host= $ARGV[0];

my ($group, $service)= $ARGV[1] =~ /^([^:]+):([^:]+)/;

my $port= $opt_p || 2583;

my $retval= $opt_r || 255;

my $opstatus= $opt_o || die "montrap: '-o opstatus' required\n";

die "montrap: unrecognized opstatus: $opstatus\n" unless

grep $opstatus, @opstrings;

my $summary= $opt_s || die "montrap: '-s summary' required\n";

my $detail= $opt_d || "";

my $mon;

my $res;

my $failure;

if (!defined ($mon = Mon::Client->new( host => $host, port => $port, ))) {

die "$0: could not create client object: $@";

}

$mon->host($host);

$mon->send_trap(

group=> $group,

service=> $service,

retval=> $retval,

opstatus=> $opstatus,

summary=> $summary,

detail => $detail,

);

if ( !$res ) {

$failure = $mon->error();

print $failure . "\n";

}

6.5.4.10 S99mon.patch

--- /home/cliffw/failover/mon/mon-0.99.2/etc/S99mon 2000-08-26 12:22:34.000000000 -0700

+++ mon.init 2006-04-26 10:26:14.000000000 -0700

@@ -17,13 +17,26 @@ export PATH

# Source function library.

. /etc/rc.d/init.d/functions

+dismon() {

+MONCMD="/usr/local/bin/moncmd"

+for i in `$MONCMD list watch | awk '{print $1}'`

+do

+ echo "Disabling watch $i"

+ $MONCMD disable watch $i

+done

+$MONCMD list watch

# See how we were called.

case "$1" in

start)

echo -n "Starting mon daemon: "

- daemon /usr/lib/mon/mon -c /etc/mon/mon.cf

+ daemon /usr/lib/mon/mon -S -f -c \ /etc/mon/mon.cf

+ echo

+ dismon

echo

touch /var/lock/subsys/mon

;;

stop)

echo -n "Stopping mon daemon: "

6.5.4.11 simple.health_check.monitor

#!/bin/sh

# This script monitors a Lustre object

# The object name is passed with the '-o' parameter

# touch /etc/ha.d/nohb to stop on boot

KILLFILE="/etc/ha.d/nohb"

if [ -f $KILLFILE ]; then

echo "NO HEARTBEAT - remove $KILLFILE to start"

exit 0

SRV='foo'

while getopts "o:" opt; do

case $opt in

o) SRV=$OPTARG

;;

\?) echo "Usage: health_check.monitor -o <service>"

echo "use -o lustre for all services"

esac

done

SERV=${!OPTIND}

ME=`hostname`

LOGFILE=/var/log/health.log

DT=`date`

echo "simple health $DT $SERV" >> $LOGFILE

if [ $ME != $SERV ];then

echo "$SERV Not local host" >> $LOGFILE

exit 0

if [ $SRV == "foo" ]; then

SRV="lustre"

if [ $SRV == "lustre" ];then

echo "all status"

unset SRV

STATE="stopped"

# check for error in health_check

HEALTH="/proc/fs/lustre/health_check"

[ -f "$HEALTH" ] && grep -q "NOT HEALTHY" $HEALTH && STATE="unhealthy"

# check for LBUG

[ -f "$HEALTH" ] && grep -q "LBUG" $HEALTH && STATE="LBUG"

if [ "$SRV" ]; then

if [ "$DISCON" -o $STATE == "unhealthy" -o $STATE == "LBUG" ];then

exit 1

else

exit 0

else

echo $STATE

exit 0

6.6 Instructions for Failover Setup with Heartbeat Version2

6.6.1 Software Installations

1. Install Lustre as described in Chapter II – 2. Lustre Installation.

2. Install RPMs required for configuring Heartbeat.

The following packages are needed for Heartbeat (v2). We used the 2.0.4 version of Heartbeat.

Heartbeat packages, in order:

heartbeat-stonith -> heartbeat-stonith-2.0.4-1.i586.rpm
heartbeat-pils -> heartbeat-pils-2.0.4-1.i586.rpm
heartbeat itself -> heartbeat-2.0.4-1.i586.rpm

You can find all the RPMs at the location given below:

http://linux-ha.org/download/index.html#2.0.4

3. Install Prerequisites.

To install Heartbeat 2.0.4-1, you require:

Python
openssl

libnet-> libnet-1.1.2.1-19.i586.rpm
libpopt -> popt-1.7-274.i586.rpm
librpm -> rpm-4.1.1-222.i586.rpm
libtld- > libtool-ltdl-1.5.16.multilib2-3.i386.rpm
lingnutls -> gnutls-1.2.10-1.i386.rpm
Libzo ->lzo2-2.02-1.1.fc3.rf.i386.rpm
glib -> glib-2.6.1-2.i586.rpm
glib-devel -> glib-devel-2.6.1-2.i586.rpm

6.6.2 Hardware Configurations

Heartbeat v2 runs well with an un-altered v1 configuration. This makes upgrading simple. You can test the basic function and quickly roll back if issues appear. Heartbeat v2 does not require a virtual IP address to be associated with a resource. This is good since we do not use virtual IPs.

Heartbeat v2 supports multi-node clusters (of more than two nodes), though it has not been tested for a multi-node cluster. This section describes only the two-node case. The multi-node setup adds a score value to the resource configuration. This value is used to decide the proper node for a resource when failover occurs.

Heartbeat v2 adds a resource manager (crm). The resource configuration is maintained as an XML file. This file is re-written by the cluster frequently. Any alterations to the configuration should be made with the HA tools or when the cluster is stopped.

6.6.2.1 Hardware Preconditions

The basic cluster assumptions are the same as those for Heartbeat v1. We are re-iterating the preconditions for the sake of clarity.

The setup must consist of a failover pair where each node of the pair has access to shared storage. If possible, the storage paths should be identical (d1_q_0:/dev/sda == d2_q_0:/dev/sda).
Shared storage can be arranged in an active/passive (MDS,OSS) or active/active (OSS only) configuration. Each shared resource will have a primary (default) node. The secondary node is assumed.
The two nodes must have one or more communication paths for heartbeat traffic. A communication path can be:

dedicated Ethernet
serial live (serial crossover cable)

Failure of all heartbeat communication is not good. This condition is called “split-brain” and the heartbeat software will resolve this situation by powering down one node.

The two nodes must have a method to control each other's state. The Remote Power Control hardware is the best. There must be a script to start and stop a given node from the other node. STONITH provides soft power control methods (ssh, meatware) but these cannot be used in a production situation.
Heartbeat provides a remote ping service that is used to monitor the health of the external network. If you wish to use the ipfail service, you must have a very reliable external address to use as the ping target.

6.6.2.2 Lustre Configuration

Lustre configuration is identical to the V1 case.

6.6.2.3 Heartbeat Configuration

See the link below for thorough details on all the configuration options:

http://linux-ha.org/ha.cf

As mentioned earlier, you can run Heartbeat v2 with v1 configuration. To convert from v1 configuration to v2, use the haresources2cib.py script, typically found in /usr/lib/heartbeat. If you are starting with v2, we recommend creating a v1-style configuration and converting it, as the v1 style is human-readable. The heartbeat XML configuration is located at /var/lib/heartbeat/cib.xml and the new resource manager is enabled with the crm yes directive in /etc/ha.d/ha.cf. Further information on CiB can be found at:

http://linux-ha.org/ClusterInformationBase/UserGuide

A. Heartbeat log daemon

Heartbeat v2 adds a logging daemon, which manages logging on behalf of cluster clients. The UNIX syslog API makes calls that can block, heartbeat requires log writes to complete as a sign of health. This daemon prevents a busy syslog from triggering a false failover. The logging configuration has been moved to /etc/logd.cf, while the directives are essentially unchanged.

B. Basic configuration (No STONITH or monitor)

- Assuming two nodes, d1_q_0 and d21_q_0

- d1_q_0 owns ost-alpha

- d2_q_0 owns ost-beta

- dedicated Ethernet - eth0

- serial crossover link - /dev/ttySO

- remote host for health ping - 192.168.0.3

a. Create symlinks from /etc/init.d/lustre to /etc/init.d/<resource_name>

- These links must exist before running the conversion script.

- Placing these scripts in /etc/init.d/ causes the conversion script to identify the script as type lsb. This gives us more flexibility for script parameters. Scripts found in /etc/ha.d/resource.d are considered to be of type heartbeat and have more restrictions.

b. Create the basic ha.cf and haresources files

- haresources no longer requires the dummy virtual IP address.

Example of /etc/ha.d/haresouces

d1_q_0 ost-alpha

d2_q_0 ost-beta

Once you have these files created, you can run the conversion tool:

$ /usr/lib/heartbeat/haresources2cib.py -c basic.ha.cf basic.haresources > basic.cib.xml

c. Examine the cib.xml file

The first section in the XML file is <attributes>. The default values should be fine for most installations.

The actual resources are defined in the <primitive> section. The default behavior of Heartbeat is an automatic failback of resources when a server is restored. To avoid this, you must add a parameter to the <primitive> definition. You may also like to reduce the timeouts a bit. In addition, the current version of the script does not name the parameters correctly.

- Copy the modified resource file to /var/lib/heartbeat/crm/cib.xml

- Start Heartbeat

- After startup, Heartbeat will re-write the cib.xml, adding a <node> section and status information. Do not alter those fields.

C. Basic Configuration – Adding STONITH

As per B. Basic Configuration – Adding STONITH in the section 6.5.2.3 Heartbeat Configuration. The best way to do this is to add the STONITH options to ha.cf and run the conversion script. A sample example is in the section 6.6.4.1 ha.cf. See http://linux-ha.org/ExternalStonithPlugins for more information.

6.6.3 Operation

In normal operation, Lustre should be controlled by Heartbeat. Start Heartbeat at the boot time. It will start Lustre after the initial dead time.

A. Initial startup

Stop heartbeat if running
If this is a new Lustre file system:

lconf --reformat /etc/lustre/config.xml (both nodes)

lconf --cleanup /etc/lustre.config.xml (both nodes)

If this is a new Lustre configuration, remember to lconf

write_conf on the MDS

/etc/init.d/heartbeat start on one node
tail -f /var/log/ha-log to see progress
After initdead, this node should start all Lustre objects
/etc/init.d/heartbeat start on second node
After heartbeat is up on both the nodes, failback the resources to the second node. On the second node, run:

$ /usr/lib/heartbeart/hb_takeover local

You should see the resources stop on the first node, and start up on the second node

B. Testing

Pull power from one node
Pull networking from one node
After Mon is setup, pull the connection between the OST and the backend storage

C. Failback

In normal case, do the failback manually after determining that the failed node is now good. Lustre clients can work during a failback, but block momentarily.

6.6.4 Scripts

In this section, all the scripts necessary for Failover setup with Heartbeat Version 2 are given. The scripts are listed below in the order they appear in the manual.

ha.cf

haresouces

basic.cib.xml

Modified basic.cib.xml

HA with STONITH

Heartbeat CIB with basic STONITH

6.6.4.1 ha.cf

use_logd on

keepalive 1

deadtime 10

initdead 60

udpport 694

bcast eth1

baud 19200

serial /dev/ttyS1

auto_failback off

crm yes

node d1_q_0 d2_q_0

ping 192.168.0.60

respawn hacluster /usr/lib/heartbeat/ipfail

6.6.4.2 haresources

d1_q_0 ost-alpha

d2_q_0 ost-beta

6.6.4.3 basic.cib.xml

<cib>

<crm_config>

<cluster_property_set id="deafult">

</attributes>

</cluster_property_set>

</crm_config>

</operations>

</primitive>

</operations>

</primitive>

</resources>

<rsc_location id="rsc_location_ost-alpha_1" rsc="ost-alpha_1">

</rule>

</rsc_location>

<rsc_location id="rsc_location_ost-beta_2" rsc="ost-beta_2">

</rule>

</rsc_location>

</constraints>

</configuration>

</cib>

6.6.4.4 Modified basic.cib.xml

<cib>

<crm_config>

<cluster_property_set id="deafult">

</attributes>

</cluster_property_set>

</crm_config>

</operations>

</primitive>

</operations>

</primitive>

</resources>

<rsc_location id="rsc_location_ost-alpha_1" rsc="ost-alpha_1">

</rule>

</rsc_location>

<rsc_location id="rsc_location_ost-beta_2" rsc="ost-beta_2">

</rule>

</rsc_location>

</constraints>

</configuration>

</cib>

6.6.4.5 HA with STONITH

apiauth stonithd uid=root

respawn root /usr/lib/heartbeat/stonithd

use_logd on

keepalive 1

deadtime 10

initdead 60

udpport 694

bcast eth1

baud 19200

serial /dev/ttyS1

auto_failback off

crm yes

node d1_q_0 d2_q_0

ping 192.168.0.60

respawn hacluster /usr/lib/heartbeat/ipfail

stonith_host d1_q_0 ssh d1_q_0

stonith_host d2_q_0 ssh d2_q_0

6.6.4.6 Heartbeart CIB with basic STONITH

<cib>

<crm_config>

<cluster_property_set id="deafult">

</attributes>

</cluster_property_set>

</crm_config>

</operations>

</primitive>

</operations>

</primitive>

</operations>

<instance_attributes>

</attributes>

</instance_attributes>

</primitive>

</operations>

<instance_attributes>

</attributes>

</instance_attributes>

</primitive>

</resources>

<rsc_location id="rsc_location_ost-alpha_1" rsc="ost-alpha_1">

</rule>

</rsc_location>

<rsc_location id="rsc_location_ost-beta_2" rsc="ost-beta_2">

</rule>

</rsc_location>

<rsc_location id="rsc_location_stonith_3" rsc="stonith_3">

</rule>

</rsc_location>

<rsc_location id="rsc_location_stonith_4" rsc="stonith_4">

</rule>

</rsc_location>

</constraints>

</configuration>

</cib>

6.7 Considerations With Failover Software and Solutions

The failover mechanisms used by Lustre and tools such as Hearbeat are soft failover mechanisms. They check system and/or application health at a regular interval, typically measured in seconds. This, combined with the data protection mechanisms of Lustre, is usually sufficient for most user applications.

However, these soft mechanisms are not perfect. The Heartbeat poll interval is typically 30 seconds. To avoid a false failover, Heartbeat waits for a deadtime interval before triggering a failover. In normal case, a user I/O request should block and recover after the failover completes. But this may not always be the case, given the delay imposed by Heartbeat.

Likewise, the Lustre health_check mechanism cannot be a perfect protection against any or all failures. It is a sample taken at a time interval, not something that brackets each and every I/O request. This is true for every HA monitor, not just the Lustre health_check.

There will indeed be cases where a user job will die prior to the HA software triggering a failover. You can certainly shorten timeouts, add monitoring, and take other steps to decrease this probability. But there is a serious trade-off – shortening timeouts increases the probability of false-triggering a busy system. Increasing monitoring takes the system resources, and can likewise cause a false trigger.

Unfortunately, hard failover solutions capable of catching failures in the sub-second range generally require special hardware. As a result, they a

Chapter II – 7. Configuring Quotas

7.1 Working with Quotas

Quotas allow a system administrator to limit the maximum amount of disc space a user or group can consume in a directory. Quotas are set by root, and can be set for both individual users and/or groups. Before a file is written to a partition where quotas have been set, the quota of the creator's group is checked first. If a quota for that group exists, the size of the file is counted towards that group's quota. If no quota exists for the group, the owner's user quota is checked before the file is written.

Lustre quota enforcement differs from standard Linux quota support in several ways:

it is administered via the lfs command

the quota is distributed (as Lustre is a distributed file system), which has several ramifications

the quota is allocated and consumed in a quantized fashion

the client does not set the usrquota or grpquota options to mount. When a quota is enabled, it is enabled for all clients of the file system and turned on automatically at mount.

7.1.1 Configuring Disk Quotas

Enabling Quotas

If you have re-complied your Linux kernel, please be certain that CONFIG_QUOTA and CONFIG_QUOTACTL are enabled (quota is enabled in all the Linux 2.6 kernels supplied by CFS)
Rebuild your XML configuration by adding the --quota quoton=[u|g] option to the MDS and OST setup:

${LMC} –add mds –node ft1 –mds mds-l –fstype ldiskfs –dev $MDSDEV –failover –quota quotaon=ug

${LMC} –add ost –node oss-0 –lov lov-l –ost ost-0 –fstype ldiskfs –dev /dev/sda1 –failover \
–mountfsoptions extents, mballoc –quota quotaon=ug

Re-write the configuration log on the MDS with lconf –write_conf <XML>
Load the lquota module in the proper order (prior to the MDC, LOV or OSC modules). Add the following lines to /etc/modprobe.conf on both server and client nodes:

install mdc /sbin/modprobe lquota; /sbin/modprobe –ignore-install mdc

install lov /sbin/modprobe lquota; /sbin/modprobe –ignore-install lov

install osc /sbin/modprobe lquota; /sbin/modprobe –ignore-install osc

Restart Lustre (remember, you should always stop Lustre prior to --write_conf)
Mount the Lustre file system on the client and verify that the lquota module has loaded properly by using the lsmod command
The mount command for Lustre no longer recognizes the usrquota and grpquota options, please remove them from your /etc/fstab if they were specified previously
When quota is enabled on the file system, it is automatically enabled for all clients of the file system

NOTE: Lustre with Linux Kernel 2.4 will not support quotas.

7.1.2 Creating Quota Files

Once each quota-enabled file system is remounted, it will be capable of working with disk quotas. However, the file system itself is not yet ready to support quotas. The next step is to run the lfs command with the quotacheck option:

#lfs quotacheck -ug /mnt/lustre

The quota will be turned on by default after quotacheck completes. The options that can be used are as follows:

u — to check the user disk quota information
g — to check the group disk quota information

The lfs command now includes these other command options for working with quotas:

quotaon ― announces to the system that disk quotas should be enabled on one or more file systems. The file system quota files must be present in the root directory of the specified file system
quotaoff ― announces to the system that the specified file systems should have all the disk quotas turned off
setquota ― used to specify the quota limits and tune the grace period. By default the grace period is one week.

Usage: setquota [ -u | -g ] <name> <block-softlimit> <block-hardlimit> <inode-softlimit> <inode-hardlimit> <filesystem>

setquota -t [ -u | -g ] <block-grace> <inode-grace> <filesystem>

lfs > setquota -u bob 307200 309200 1000 1100 /mnt/lustre

Description: sets limits for a user "bob". The block hard limit is around 3GB and the inode hard limit is 1100. Please note: This example uses very tiny limits.

Quota displays the quota allocated and consumed for each Lustre device. This example shows the result of the previous setquota:

lfs > quota -u bob /mnt/lustre

Disk quotas for user bob (uid 502):

Filesystem blocks quota limit grace files quota limit grace

/mnt/lustre 0 307200 309200 0 1000 1100

mds-l_UUID 0 0 10240 0 0 200

ost-alpha_UUID 0 0 10240

ost-beta_UUID 0 0 10240

ost-gam_UUID 0 0 10240

Quotachown sets or changes the file owner and the group on OSTs of the specified file system.

$ lfs quotachown –I /mnt/lustre

7.1.3 Quota Allocation

The Linux kernel sets a default quota size of 1MB. Lustre handles quota allocation in a different manner. A quota must be set properly or users may experience unnecessary failures. The file system block quota is divided up among the OSTs within the file system. Each OST requests an allocation which is increased up to the quota limit. The quota allocation is then quantized to reduce the number of quota-related request traffic. By default, Lustre will allocate 100MB per OST. This means the minimum quota that can be assigned is 100 MB multiplied by the number of OSTs in your file system. If you attempt to assign a smaller quota, users maybe unable to create files. The default is established at file system creation time, but can be tuned via /proc values (detailed below). The inode quota is also allocated in a quantized manner on the MDS.

The setquota example above was run on a file system created with the following lmc quota options:

--quota quotaon=ug,bunit=10,iunit=200

This sets a much smaller granularity. We have specified that we will request new quota in units of 10 MB and 200 inodes respectively. If we look at the example again:

lfs > quota -u bob /mnt/lustre

Disk quotas for user bob (uid 502):

Filesystem blocks quota limit grace files quota limit grace

/mnt/lustre 0 307200 309200 0 1000 1100

mds-l_UUID 0 0 10240 0 0 200

ost-alpha_UUID 0 0 10240

ost-beta_UUID 0 0 10240

ost-gam_UUID 0 0 10240

We see that the 3GB quota requested is divided across the OSTs, with each OST having an initial allocation of 10MB blocks. The MDS line shows the initial 200 inode allocation.

It is very important to note that the block quota is consumed per OST. Much like free space, when the quota is consumed on one OST, clients may be unable to create files regardless of the quota available on other OSTs.

More details:

Lustre quota allocation is controlled by two values ― quota_bunit_sz and quota_iunit_sz ― referring to kilo bytes and inodes respectively. These values can be accessed on the MDS as /proc/fs/lustre/mds/*/quota_* and on the OST as /proc/fs/lustre/obdfilter/*/quota_*.

They can also be set as an option to lmc --quota. Changes will be required while using the lconf command with the parameter write_conf. A command like lconf --write_conf is to be used on the MDS. The /proc values are bounded by two other variables quota_btune_sz and quota_itune_sz. By default, the *tune_sz variables are set at 1/2 the *unit_sz variables, and you cannot set *tune_sz larger than *unit_sz. You must set bunit_sz first
if it is increasing by more than 2x, and btune_sz first if it is decreasing by more than 2x.

The values set for the MDS must match the values set on the OSTs.

The parameter quota_bunit_sz displays bytes, however lfs setquota uses kilo bytes. The parameter quota_bunit_sz must be a multiple of 1024. A proper minimum bkilo byte size for lfs setquota can be calculated by:

Size in bkilo bytes = (quota_bunit_sz * ( number of OSTS + 1 )) / 1024.

We add one to the number of OSTs as the MDS also consumes bkilo bytes. As inodes are only consumed on the MDS, the minimum inode size for lfs setquota is equal to quota_iunit_sz.

Note: Setting the quota below this limit may prevent the user from all the file creation.

Chapter II – 8. RAID

8.1 Considerations for Backend Storage

Lustre's architecture allows it to use any kind of block device as backend storage. The characteristics of such devices, particularly in the case of failures vary significantly and have an impact on configuration choices.

This section gives a survey of the issues and recommendations.

8.1.1 Reliability

Given below is a quick calculation that leads to the conclusion that without any further redundancy RAID5 is not acceptable for large clusters and RAID6 is a must.

Take a 1PB file system - that is 2000 disks of 500GB capacity. The MTF of a disk is likely about 1000 days and repair time at 10% of disk bandwidth is close to 1 day (500GB at 5MB/sec = 100,000 sec = 1 day). This means that the expected failure rate is 2000 / 1000 = 2 disks per day.

If we have a RAID5 stripe that is ~10 wide, then during the 1 day of rebuilding the chance that a second disk in the same array fails is about 9 / 1000 ~= 1/100. This means that the in the expected period of 50 days a double failure in a RAID5 stripe will lead to data loss.

So RAID6 or another double parity algorithm is really necessary for OST storage. For the MDS we recommend RAID0+1 storage.

8.1.2 Selecting Storage for the MDS and OSS

The MDS will do a large amount of small writes. For this reason we recommend RAID1 storage. Building RAID1 Linux MD devices and striping over these devices with LVM makes it easy to create an MDS file system of 1-2TB, for example, with 4 or 8 500GB disks.

Having disk monitoring software in place so that rebuilds happen without any delay should be regarded as mandatory. We recommend backups of the meta-data file systems. This can be done with LVM snapshots or using raw partition backups.

We also recommend using a kernel version of 2.6.15 or later with bitmap RAID rebuild features. These reduce RAID recovery time from a rebuild to a quick resynchronization.

8.1.3 Understanding Double Failures with Hardware and Software RAID5

Software RAID does not offer the hard consistency guarantees of top-end enterprise RAID arrays. Those guarantees state that the value of any block is exactly the before or after value and that ordering of writes is preserved. With software RAID, an interrupted write operation that spans multiple blocks can frequently leave a stripe in an inconsistent state that is not restored to either the old or the new value. Such interruptions are normally caused by an abrupt shutdown of the system.

If the array is functioning without disk failures, but experiencing sudden power down events, such interrupted writes on journal file systems can affect file data and data in the journal. Meta data itself is re-written from the journal during recovery and will be correct. Because the journal uses a single block to indicate a complete transaction has committed after other journal writes have completed, the journal remains valid. File data can be corrupted when overwriting file data, but this is a known problem with incomplete writes and caches anyway. Hence recovery of the disk file systems with software RAID is similar to recovery without software RAID. Moreover, using Lustre servers with disk file systems does not change these guarantees.

Problems can arise if after an abrupt shutdown a disk fails on restart. In this case even single block writes provide no guarantee that, for example, the journal will not be corrupted.

Hence:

1. IF A POWERDOWN IS FOLLOWED BY A DISK FAILURE, THE DISK FILE SYSTEM NEEDS A FILE SYSTEM CHECK.

2. IF A RAID ARRAY DOES NOT GUARANTEE before/after SEMANTICS, the same requirement holds.

We believe this requirement is present for most arrays that are used with Lustre, including the successful and popular DDN arrays.

CFS will release a modification to the disk file system that eliminates this requirement for a check with a feature called "journal checksums". With RAID6 this check is not required with a single disk failure, but is required with a double failure upon reboot after an abrupt interruption of the system.

8.1.4 Performance considerations

CFS is currently improving the Linux software RAID code to preserve large I/O which the disk subsystems can do very efficiently. With the existing RAID code software RAID performs equally with all stride sizes, but we expect that fairly large stride sizes will prove advantageous when these fixes are implemented.

8.1.5 Formatting

To format a software RAID file system, use the stride_size option while formatting.

8.2 Disk Performance Measurement

Below are some tips and insights for disk performance measurement. Some of this information is specific to RAID arrays and/or the Linux RAID implementation.

1. Performance is limited by the slowest disk.

Benchmark all disks individually. We have frequently encountered situations where drive performance was not consistent for all devices in the array.

2. Verify drive ordering and identification.

For example, on a test system with a Marvell driver, the disk ordering is not preserved between boots but the controller ordering is. Therefore, we had to perform the sgp_dd survey and create arrays without rebooting.

3. Disks and arrays are very sensitive to request size.

To identify the most ideal request size for a given disk, benchmark the disk with different record sizes ranging from 4 KB to 1-2 MB.

4. By default, the maximum size of a request is quite small.

To properly handle IO request sizes greater than 256 KB, the current Linux kernel either needs a driver patch or some changes in the block layer defaults, namely MAX_SECTORS, MAX_PHYS_SEGMENTS and MAX_HW_SEGMENTS. CFS kernels contain this patch. See blkdev_tunables-2.6-suse.patch in the CFS source.

5. I/O scheduler

Try different I/O schedulers because their behavior varies with storage and load. CFS recommends the deadline or noop schedulers. Benchmark them all and choose the best one for your setup. For further information on I/O schedulers, visit the following URLs:

http://www.linuxjournal.com/article/6931

http://www.redhat.com/magazine/008jun05/features/schedulers/

6. Use the proper block device with sgp_dd (sgX versus sdX)

size 1048576K rsz 128 crg 8 thr 32 read 20.02 MB/s

size 1048576K rsz 128 crg 8 thr 32 read 56.72 MB/s

Both the above outputs were achieved on the same disk with the same parameters for sgp_dd. The only difference is that in the first case /dev/sda was used; while in the second case /dev/sg0 was used. sgX is a special interface that bypasses the block layer and the I/O scheduler, but sends the SCSI commands directly to a drive. sdX is a regular block device, and the requests go through the block layer and the I/O scheduler. The numbers do not change on testing with different I/O schedulers.

Note: The sg device cannot be used by Lustre as it is not a block device – the sg device is used for performance measurement only.

7. Requests with partial-stripe write impair RAID5.

Remember that RAID 5 in many cases will do a read-modify-write cycle, which is not performant.

Try to avoid synchronized writes. Probably subsequent writes would make the stripe full and no reads will be needed. Try to configure RAID5 and the application in such a manner that most of the writes will be full-stripe and stripe-aligned.

8. NR_STRIPES in RAID5 (Linux kernel parameter)

This is the size of the internal cache that RAID5 uses for all the operations. If many processes are doing I/O, we suggest you to increase this number. In newer kernels, you can tune it by a module parameter.

9. Do not put an ext3 journal onto RAID5.

As journal is written linearly and synchronously, in most cases writes will not fill whole stripes. In this case, RAID5 will have to read parities.

10. Suggested MD device setups for maximum performance:

MDT

RAID1 with internal journal and 2 disks from different controllers
If you require larger MDTs, create 2 equal-sized RAID0 arrays from multiple disks. Create a RAID1 array from these 2 arrays. Using RAID10 directly requires a newer mdadm (the tool that administers software RAID on Linux) than the one shipped with RHEL 4. You can also use LVM instead of RAID0, but this has not been tested.

OST

File system: RAID5 with 6 disks, each from a different controller.
External journal: RAID1 with 2 partitions of 400MB (or more), each from disks on different controllers.

$ --mkfsoptions "-j -J device=/dev/mdX"

To enable an external journal, you can use the above options in the lmc script used to create your XML. mdX is the external journal device.

Before running --reformat, setup the journal device (/dev/mdX) by running:

$ 'mke2fs -O journal_dev -b 4096 /dev/mdX'

You can create a root file system, swap, and other system partitions on a RAID1 array with partitions on any 2 remaining disks. The remaining space on the OST journal disk could be used for this.

CFS has not tested RAID1 of swap.

11. rsz in sgp_dd:

It must be equal to the multiplication of <chunksize> and (disks-1).

You also should pass stripe=N, and extents or mballoc as a mountfs option for OSS. Here N = <chunksize> * (disks-1) / pagesize.

12. Run fsck on power failure or disk failure (RAID arrays).

You must run fsck on an array in the event of a power failure and failure of a disk in the array due to potential write consistency issues.
You can automate this in rc.sysinit by detecting degraded arrays.

8.2.1 Sample Graphs

8.2.1.1 Graphs for Write Performance

Figure 2.9.1: Write - RAID0, 64K chunks, 6 spindles

Figure 2.9.2: Write - RAID5, 64K chunks, 6 spindles

8.2.1.2 Graphs for Read Performance

Figure 2.9.3: Read - RAID0, 64K chunks, 6 spindles

Figure 2.9.4: Read – RAID5, 64 K chunks, 6 spindle

Chapter II – 9. Bonding

9.1 Network Bonding

Bonding is a method of aggregating multiple physical links into a single logical link. This technology is also known as trunking, port trunking and link aggregation. We will use the term bonding.

Several different types of bonding are supported in Linux. All these types are referred to as “modes,” and use the bonding kernel module.

Modes 0 to 3 provide support for load balancing and fault tolerance by using multiple interfaces. Mode 4 aggregates a group of interfaces into a single virtual interface where all members of the group share the same speed and duplex settings. This mode is described under IEEE spec 802.3ad, and it is referred to as either “mode 4” or “802.3ad.”

(802.3ad refers to mode 4 only. The detail is contained in Clause 43 of the IEEE 8 - the larger 802.3 specification. Consult IEEE for more information.)

9.2 Requirements

The most basic requirement for successful bonding is that both endpoints of the connection must support bonding. In a normal case, the non-server endpoint is a switch. (Two systems connected via crossover cables can also use bonding.) Any switch used must explicitly support 802.3ad Dynamic Link Aggregation.

The kernel must also support bonding. All supported Lustre kernels have this support. The network driver for the interfaces to be bonded must have the ethtool support. The ethtool support is necessary for determination of the slave speed and duplex settings. All recent network drivers implement it.

To verify that your interface supports ethtool:

$ which ethtool

$ ethtool eth0

Settings for eth0:

Supported ports: [ MII ]

Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full1000baseT/ Half1000baseT/Full

Supports auto-negotiation: Yes

(ethtool will return an error if your card is not supported.)

To quickly check whether your kernel supports bonding:

$ grep ifenslave /sbin/ifup

$ which ifenslave

Note: Bonding and ethtool have been available since 2000. All Lustre-supported kernels include this functionality.

9.3 Bonding Module Parameters

Bonding Module Parameters control various aspects of bonding.

Outgoing traffic is mapped across the slave interfaces according to the transmit hash policy. For Lustre, we recommend setting the xmit_hash_policy option to the 'layer3+4' option for bonding. This policy uses upper layer protocol information if available to generate the hash. This allows traffic to a particular network peer to span multiple slaves, although a single connection does not span multiple slaves. :

$ xmit_hash_policy=layer3+4

The 'miimon' option enables users to monitor the link status. (The parameter is a time interval in milliseconds.) It makes the failure of an interface transparent to avoid serious network degradation during link failures. 100 milliseconds is a reasonable default. Increase the timeout for a busy network.

$ miimon=100

9.4 Setup

Follow the process below to setup bonding:

Create a virtual 'bond' interface.

Assign an IP address to the 'bond' interface.

Attach one or more slave interfaces to the bond interface. Typically the MAC address of the first slave interface will become the MAC address of the bond.

Setup the bond interface and its options in /etc/modprobe.conf. Start the slave interfaces by your normal network method.

Note: You must modprobe the bonding module for each bonded interface. If you wish to create bond0 and bond1, two entries in modprobe.conf are required.

Our examples are from Red Hat systems, and use /etc/sysconfig/networking-scripts/ifcfg-* for setup. The OSDL reference site given below includes detailed instructions for other configuration methods, instructions for using DHCP with bonding, and other setup details. We strongly recommend using this site.

http://linux-net.osdl.org/index.php/Bonding

Check /proc/net/bonding to determine status on bonding. There should be a file there for each bond interface. Check the interface state with ethtool or ifconfig. ifconfig lists the first bonded interface as 'bond0'.

9.4.1 Examples

Let us see an example of Modprobe.conf for bonding ethernet interfaces eth1 and eth2 to bond0:

install bond0 /sbin/modprobe -a eth1 eth2 && /sbin/modprobe bonding miimon=100 mode=802.3ad xmit_hash_policy=layer3+4

alias bond0 bonding

ifcfg-bond0

DEVICE=bond0

BOOTPROTO=static

IPADDR=###.###.##.##
(Assign here the IP of the bonded interface.)

NETMASK=255.255.255.0

ONBOOT=yes

ifcfg-eth1 (eth2 is a duplicate)

DEVICE=eth1 # Change to match device

MASTER=bond0

SLAVE=yes

BOOTPROTO=none

ONBOOT=yes

TYPE=Ethernet

From linux-net.osdl.org:

For example, the content of /proc/net/bonding/bond0 after the driver is loaded with parameters of mode=0 and miimon=1000 is generally as follows:

Ethernet Channel Bonding Driver: 2.6.1 (October 29, 2004)

Bonding Mode: load balancing (round-robin)

Currently Active Slave: eth0

MII Status: up

MII Polling Interval (ms): 1000

Up Delay (ms): 0

Down Delay (ms): 0

Slave Interface: eth1

MII Status: up

Link Failure Count: 1

Slave Interface: eth0

MII Status: up

Link Failure Count: 1

In the example below, the bond0 interface is the master (MASTER) while eth0 and eth1 are slaves (SLAVE).

Note: All the slaves of bond0 have the same MAC address (Hwaddr) – bond0. All modes except TLB and ALB have this MAC address. TLB and ALB require a unique MAC address for each slave.

$ /sbin/ifconfig

bond0 Link encap:Ethernet Hwaddr 00:C0:F0:1F:37:B4

inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0

UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1

RX packets:7224794 errors:0 dropped:0 overruns:0 frame:0

TX packets:3286647 errors:1 dropped:0 overruns:1 carrier:0

collisions:0 txqueuelen:0

eth0 Link encap:Ethernet Hwaddr 00:C0:F0:1F:37:B4

inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0

UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1

RX packets:3573025 errors:0 dropped:0 overruns:0 frame:0

TX packets:1643167 errors:1 dropped:0 overruns:1 carrier:0

collisions:0 txqueuelen:100

Interrupt:10 Base address:0x1080

eth1 Link encap:Ethernet Hwaddr 00:C0:F0:1F:37:B4

inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0

UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1

RX packets:3651769 errors:0 dropped:0 overruns:0 frame:0

TX packets:1643480 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:100

Interrupt:9 Base address:0x1400

9.5 Lustre configuration

Lustre uses the IP address of the bonded interfaces and requires no special configuration. It treats the bonded interface as a regular TCP/IP interface. If necessary, specify "bond0" using the Lustre networks parameter:

options lnet networks=tcp(bond0)

9.6 References

Below are some references that we recommend -

In the Linux kernel source tree, see

Documentation/networking/bonding.txt
http://linux-ip.net/html/ether-bonding.html
http://www.sourceforge.net/projects/bonding

This is the bonding sourceforge site.
http://linux-net.osdl.org/index.php/Bonding

This is the most exhaustive reference and is highly recommended. It includes explanations of more complicated setups, including the use of DHCP with bonding.

Part III. Lustre Tuning, Monitoring and Troubleshooting

Chapter III-1. Lustre I/O Kit

1.1 Prerequisites

The Lustre I/O kit is a collection of benchmark tools for a Lustre cluster. You can download the I/O kits from:

https://downloads.clusterfs.com/customer/lustre-iokit/

In this directory, you will find two packages. The 'scali-lustre-iokit' is a Python tool maintained by the kind team at Scali, and is not discussed in this version of the manual. The 'lustre-iokit' package consists of a set of scripts developed and supported by CFS.

Prerequisites for the CFS I/O kit:

password-free remote access to nodes in the system (Normally obtained via ssh or rsh)
Lustre file system software
sg3_utils for the sgp_dd utility

The kit can be used to validate the performance of the various hardware and software layers in the cluster and also as a way of finding and troubleshooting input/output issues.

It is very important to establish performance from the “bottom up” perspective. Firstly, the performance of a single raw device should be verified. Once this is completed, you should then verify that performance is stable within a larger number of devices. Frequently, while troubleshooting such performance issues, we find that array performance with all LUNs loaded does not always match the performance of a single LUN when tested in isolation. After the raw performance has been established, the other software layers can be added and tested in an incremental manner.

The kit contains three tests. The first surveys basic performance of the device and bypasses the kernel block device layers, buffer cache and file system. The subsequent tests survey progressively higher layers of the Lustre stack. Typically with these tests, Lustre should deliver 85-90% of the raw device performance.

1.2 Running the I/O Kit Tests

As mentioned above, the I/O kit bundle contains three testing tools:

sgpdd survey
obdfilter survey
ost survey

1.2.1 sgpdd_survey

This is the tool for testing the bare metal performance, while bypassing as much of the kernel as we can. It does not require Lustre software, but does require the sgp_dd package. This survey may be used to characterize the performance of a SCSI device by simulating an OST serving multiple stripe files. The data gathered by this survey can help set expectations for the performance of a Lustre OST exporting the device.

The script uses sgp_dd to carry out raw sequential disk input/output. It runs with variable numbers of sgp_dd threads to show how performance varies with different request queue depths.

The script spawns variable numbers of sgp_dd instances, each reading or writing a separate area of the disk to demonstrate performance variance within a number of concurrent stripe files.

The device(s) used must meet one of the two tests mentioned below:

SCSI device:

Must appear in the output of 'sg_map'

(make sure the kernel module "sg" is loaded)

Raw device:

Must appear in the output of 'raw -qa'

If you need to create raw devices in order to use this tool, note that raw device 0 can not be used due to a bug in certain versions of the "raw" utility (including that shipped with RHEL4U4.)

You may not mix raw and SCSI devices in the test specification.

The script must be customized according to the particular device being tested and also according to the location where it should keep its working files. Customization variables are described explicitly at the start of the script.

When the script runs it creates a number of working files and a pair of result files. All files start with the prefix given by the script variable ${rslt}..

${rslt}_<date/time>.summary same as stdout

${rslt}_<date/time>_* tmp files

${rslt}_<date/time>.detail collected tmp files for post-mortem

The summary file and stdout contain lines like:

total_size 8388608K rsz 1024 thr 1 crg 1 180.45 MB/s 1 x 180.50 \ =/ 180.50 MB/s

The number immediately before the first MB/s is the bandwidth computed by measuring total data and elapsed time. The remaining numbers are a check on the bandwidths reported by the individual sgp_dd instances.

If there are so many threads that sgp_dd is unlikely to be able to allocate input/output buffers, "ENOMEM" is printed.

If all the sgp_dd instances do not successfully report a bandwidth number, "failed" is printed.

Note: This test overwrites the device being tested and will result in the LOSS OF ALL DATA on that device. Exercise caution when selecting the device to be tested.

1.2.2 obdfilter_survey

This survey script processes sequential input/output with varying numbers of threads and objects (files) by using lctl::test_brw to drive the echo_client connected to local or remote obdfilter instances, or remote obdecho instances. It can be used to characterize the performance of the Lustre components below.

1. The stripe F/S

Here the script directly exercises one or more instances of obdfilter. The script may be running on one or more nodes, for example, when the nodes are all attached to the same multi-ported disk subsystem.

You need to tell the script all the names of the obdfilter instances, which should already be up and running. If some are on different nodes, you also need to specify their host names, for example, node1:ost1. All the obdfilter instances are driven directly. The script automatically loads the obdecho module if required and creates one instance of echo_client for each obdfilter instance.

2. The network

Here the script drives one or more instances of obdecho via instances of echo_client running on one or more nodes. You need to tell the script all the names of the echo_client instances, which should already be up and running. If some are on different nodes, you also need to specify their host names, for example, node1:ECHO_node1.

3. The stripe F/S over the network

Here the script drives one or more instances of obdfilter via instances of echo_client running on one or more nodes. As with above, you need to tell the script all the names of the echo_client instances, which should already be up and running. Note that the script is not scalable to hundreds of nodes since it is only intended to measure individual servers, not the scalability of the system as a whole.

Running the script

The script must be customized according to the components being tested and also according to the location where it should keep its working files. Customization variables are described clearly at the start of the script.

Running the script against a local disk

Create a Lustre configuration shell script and XML using your normal methods. You do not need to specify an MDS or LOV, but you do need to list all OSTs that you wish to test.
On all OSS machines, use:

$ lconf --refomat <XML file>

Remember, write tests are destructive. This test should be run prior to startup of your actual Lustre file system. If you do this, you will not need to reformat to restart Lustre. However, if the test is terminated before completion, you may have to remove objects from the disk.

Determine the obdfilter instance names on all the clients. They appear as the 4th column of lctl dl. For example:

$ pdsh -w oss[01-02] lctl dl |grep obdfilter |sort

oss01: 0 UP obdfilter oss01-sdb oss01-sdb_UUID 3

oss01: 2 UP obdfilter oss01-sdd oss01-sdd_UUID 3

oss02: 0 UP obdfilter oss02-sdi oss02-sdi_UUID 3

Here the obdfilter instance names are oss01-sdb, oss01-sdd, oss02-sdi. Since you are driving obdfilter instances directly, set the shell array variable ost_names to the names of the obdfilter instances and leave client_names undefined.

For example:

ost_names_str='oss01:oss01-sdb oss01:oss01-sdd oss02:oss02-sdi' ./obdfilter-survey

Running the script against a network

If you are driving obdfilter or obdecho instances over the network, you must instantiate the echo_clients yourself using lmc/lconf. Set the shell array variable client_names to the names of the echo_client instances and leave ost_names undefined.

You can optionally prefix any name in ost_names or client_names with the host name that it is running on, for example, remote_node:ost4. If you are running remote nodes, you need to ensure the following:

custom_remote_shell() works on your cluster
all pathnames you specify in the script are mounted on the node you start the survey from and on all the remote nodes
obdfilter-survey must be installed on the clients at the same location as on the master node

First, bring up obdecho instances on the servers and echo_client instances on the clients and run the included echo.sh on a node that has Lustre installed. Shell variables:

SERVERS: set this to a list of server host names, or hostname of the current node will be used. This may be the wrong interface, so be sure to check it.

NOTE: echo.sh could probably be smarter about this.

NETS: set this if you are using a network type other than TCP.

For example:

SERVERS=oss01-eth2 sh echo.sh

On the servers, start the obdecho server and verify that it is up:

$ lconf --node (hostname)/(path)/echo.xml

$ lctl dl

0 UP obdecho ost_oss01.local ost_oss01.local_UUID 3

1 UP ost OSS OSS_UUID 3

On the clients, start the other side of the echo connection:

$ lconf --node client /(path)/echo.xml

$ lctl dl

0 UP osc OSC_xfer01.local_ost_oss01.local_ECHO_client \ 6bc9b_ECHO_client_2a8a2cb3dd 5

1 UP echo_client ECHO_client 6bc9b_ECHO_client_2a8a2cb3dd 3

Verify connectivity from a client:

$ lctl ping SERVER_NID

Run the script on the master node, specifying the client names in an environment variable.

For example:

$ client_names_str='xfer01:ECHO_client xfer02:ECHO_client

xfer03:ECHO_client xfer04:ECHO_client xfer05:ECHO_client

xfer06:ECHO_client xfer07:ECHO_client xfer08:ECHO_client

xfer09:ECHO_client xfer10:ECHO_client xfer11:ECHO_client

xfer12:ECHO_client' ./obdfilter-survey

When done, cleanup echo_client/obdecho instances

on clients:

$ lconf --cleanup --node client /(path)/echo.xml

on server(s):

$ lconf --cleanup --node (hostname)/(path)/echo.xml

When aborting, run killall vmstat on clients:

pdsh -w (clients) killall vmstat

Use lctl device_list to verify the obdfilter/echo_client instance names. For example, when the script runs, it creates a number of working files and a pair of result files. All files start with the prefix given by ${rslt}.

${rslt}.summary same as stdout

${rslt}.script_* per-host test script files

${rslt}.detail_tmp* per-ost result files

${rslt}.detail collected result files for
post-mortem

The script iterates over the given numbers of threads and objects performing all the specified tests and checking that all test processes completed successfully.

Note that the script does not clean up properly if it is aborted or if it encounters an unrecoverable error. In this case, manual cleanup may be required, possibly including killing any running instances of lctl (local or remote), removing echo_client instances created by the script and unloading obdecho.

Script output

The summary file and stdout contain lines like:

ost 8 sz 67108864K rsz 1024 obj 8 thr 8 write 613.54 \ [ 64.00, 82.00]

Where:

ost 8 is the total number of OSTs under test

sz 67108864K is the total amount of data read or written (in KB)

rsz 1024 is the record size (size of each echo_client input/output)

obj 8 is the total number of objects over all OSTs

thr 8 is the total number of threads over all OSTs and objects

write is the test name. If more tests have been specified they all appear on the same line

613.54 is the aggregate bandwidth over all OSTs measured by dividing the total number of MB by the elapsed time

[64.00, 82.00] are the minimum and maximum instantaneous bandwidths seen on any individual OST.

Note that although the numbers of threads and objects are specified per-OST in the customization section of the script, results

are reported aggregated over all OSTs.

Visualizing results

It is useful to import the summary data (its fixed width) into Excel (or any graphing package) and graph the bandwidth against the number of threads for varying numbers of concurrent regions. This shows how the OSS performs for a given number of concurrently accessed objects (files) with varying numbers of inputs/outputs in flight.

It is also useful to record average disk input/output sizes during each test. These numbers help find pathologies in the system when the file system block allocator or the block device elevator fragment I/O requests.

The included obparse.pl script is an example of processing the output files to a .csv format.

1.2.3 ost_survey

This is a shell script that uses lfs setstripe to perform I/O against a single OST. It will write a file (currently using dd) to each OST in the Lustre file system, comparing read and write speeds. It is to be used to detect misbehaving disk subsystems. We have frequently discovered wide performance variations across all LUNs in a cluster.

To run the script, supply a file size in KB and the Lustre mount point.

Example:

$ ./ost-survey.sh 10 /mnt/lustre

Average read Speed: 6.73

Average write Speed: 5.41

read - Worst OST indx 0 5.84 MB/s

write - Worst OST indx 0 3.77 MB/s

read - Best OST indx 1 7.38 MB/s

write - Best OST indx 1 6.31 MB/s

3 OST devices found

Ost index 0 Read speed 5.84 Write speed 3.77

Ost index 0 Read time 0.17 Write time 0.27

Ost index 1 Read speed 7.38 Write speed 6.31

Ost index 1 Read time 0.14 Write time 0.16

Ost index 2 Read speed 6.98 Write speed 6.16

Ost index 2 Read time 0.14 Write time 0.16

Chapter III-2. LustreProc

2.1 Introduction

The proc file system acts as an interface to internal data structures in the kernel. It can be used to obtain information about the system and to change certain kernel parameters at runtime (sysctl).

The Lustre file system provides several proc file system variables that control aspects of Lustre performance and provide information.

The proc variables are classified based on the subsystem they affect.

2.1.1 /proc Entries for Lustre

2.1.1.1 Recovery

/proc/sys/lustre/upcall

This will contain the path of the recovery upcall or DEFAULT for the normal case where there is no upcall. Certain states will place information here, including

FAILED_IMPORT – tgt_uuid obd_uuid net_uuid – which indicates failure of an upcall. The UUID information identifies target, obd name and network.
RECOVERY_OVER tgt_uuid – the upcall called on the server when the recovery period has ended. The UUID is the target that was in recovery mode. For example, syslog message:

“May 25 13:35:46 d2_q_0 kernel: Lustre: \ 12162:0:(recover.c:77:ptlrpc_run_recovery_over_upcall()) \ Invoked upcall DEFAULT RECOVERY_OVER ost-alpha_UUID”

/proc/sys/lnet/upcall

LBUG src_file line_number function – which is called when an LBUG occurs.

The script paths can be configured with lmc and/or lconf or by modifying the corresponding proc entries. Setting an upcall to "DEFAULT" means that the recovery will be handled within the kernel by reconnecting to the same device.

2.1.1.2 Lustre Timeouts/ Debugging

/proc/sys/lustre/timeout

This is the time period for which a client will wait on a server to complete an RPC (default 100s). Servers will wait half of this time for a normal client RPC to complete and a quarter of this time for a single bulk request (read or write of up to 1MB) to complete. The client will ping recoverable targets (MDS and OSTs) at one quarter of the timeout and the server will wait one and a half times the timeout before evicting a client for being "stale."

/proc/sys/lustre/ldlm_timeout

This is the time period for which a server will wait for a client to reply to an initial AST (lock cancellation request) where default is 20s for an OST and 6s for an MDS. If the client replies to the AST, the server will give it a normal timeout (halft of the client timeout) to flush any dirty data and release the lock.

/proc/sys/lustre/fail_loc

This is the internal debugging failure hook.

See lustre/include/linux/obd_support.h for the definitions of individual failure locations. The default value is zero.

sysctl -w lustre.fail_loc=0x80000122 # drop a single reply

/proc/sys/lustre/dump_on_timeout

This triggers dumps of the Lustre debug log when timeouts occur.

2.1.1.3 LNET Information

/proc/sys/lnet/peers

Shows all NIDs known to this node and also gives information on the queue state.

# cat /proc/sys/lnet/peers

nid refs state max rtr min tx min queue

10.67.73.181@tcp 1 up 8 8 8 8 7 0

Fields are explained below:

refs – A reference count, used for debugging primarily

state – Up or down

max – Maximum number of concurrent sends from this peer

rtr – Routing buffer credits

min – Minimum routing buffer credits seen

tx – Send credits

min – Minimum send credits seen

queue – Total bytes in active/queued sends.

Credits work like a semaphore. At start they are initialized to allow a certain number of operations ( 8 in this example). LNET keeps a track of the minimum value so that you can see how congested a resource was.

If rtr/tx is less than max, there are operations in progress. The number of operations is equal to rtr or tx subtracted from max.

If rtr/tx is greater that max, there are operations blocking.

LNET also limits concurrent sends and router buffers allocated to a single peer so that no peer can occupy all these resources

/proc/sys/lnet/nis

Shows current queue health on this node.

Fields are explained below:

nid – The network interface

refs – Internal reference counter

peer – the number of peer-to-peer send credits on this NID. Credits are used to size various buffer pools.

Max – Total number of send credits on this NID

tx – current number of send credits available on this NID

min – Lowest number of send credits available on this nid.

Subtracting max – tx yields the number of sends currently active. A large or increasing number of active sends may indicate a problem.

# cat /proc/sys/lnet/nis

nid refs peer max tx min

0@lo 2 0 0 0 0

10.67.73.173@tcp 4 8 256 256 253

2.2 Input/output

/proc/fs/lustre/llite/fs0/max_read_ahead_mb

This file contains the size of the client per-file read-ahead (default 40 MB). Setting this to zero will disable readahead.

/proc/fs/lustre/llite/fs0/max_cache_mb

This is the maximum amount of inactive data cached by the client (default 3/4 of RAM).

2.2.1 Client Input/output RPC Stream Tunables

The Lustre engine will always attempt to pack an optimal amount of data into each input/output RPC and will attempt to keep a consistent number of issued RPCs in progress at a time. Lustre exposes several tuning variables to adjust behaviour according to network conditions and cluster size. Each OSC has its own tree of these tunables. For example:

$ ls -d /proc/fs/lustre/osc/OSC_client_ost1_MNT_client_2 /localhost

/proc/fs/lustre/osc/OSC_uml0_ost1_MNT_localhost

/proc/fs/lustre/osc/OSC_uml0_ost2_MNT_localhost

/proc/fs/lustre/osc/OSC_uml0_ost3_MNT_localhost

$ ls /proc/fs/lustre/osc/OSC_uml0_ost1_MNT_localhost

blocksize filesfree max_dirty_mb ost_server_uuid stats

... and so on

The files related to tuning the RPC stream are as follows:

/proc/fs/lustre/osc/<object name>/max_dirty_mb

This controls how many megabytes of dirty data can be written and queued up in the OSC. POSIX file writes that are cached contribute to this count. When the limit is reached additional writes will stall until previously cached writes are written to the server. This may be changed by writing a single ASCII integer to the file. Only values between zero and 512 are allowed. If zero is given, no writes will be cached, but unless you use large writes (1MB or more) performance will suffer noticably.

/proc/fs/lustre/osc/<object name>/cur_dirty_bytes

This is a read-only value that returns the current amount of bytes written and cached on this OSC.

/proc/fs/lustre/osc/<object name>/max_pages_per_rpc

This value represents the maximum number of pages that will undergo input/output in a single RPC to the OST. The minimum is a single page and the maximum for this setting is platform depedent (256 for i386/x86_64, possibly less for ia64/PPC with larger PAGE_SIZE), though generally amounts to a total of one megabyte in the RPC.

/proc/fs/lustre/osc/<object name>/max_rpcs_in_flight

This value represents the maximum number of concurrent RPCs that the OSC will issue at a time to its OST. If the OSC tries to initiate an RPC but finds that it already has the same number of RPCs outstanding, it will wait to issue further RPCs until some complete. The minimum setting is one and maximum 32.

The value for max_dirty_mb is recommended to be 4 * max_pages_per_rpc * max_rpcs_in_flight in order to maximize performance.

NOTE: The <object name> will vary depending on the specific Lustre configuration. See the sample output from the commands for examples of <object name>.

2.2.2 Watching the Client RPC Stream

In the same directory is a file that gives a histogram of the make-up of previous RPCs.

# cat /proc/fs/lustre/osc/OSC_uml0_ost1_MNT_localhost/rpc_stats

snapshot_time: 1067551484:37103 (secs:usecs)

RPCs in flight: 0

pending write pages: 0

pending read pages: 0

other RPCs in flight when a new RPC is sent:

0: 0

1: 0

2: 0

3: 0

4: 0

5: 0

6: 0

7: 0

8: 0

9: 0

10: 0

11: 0

12: 0

13: 0

14: 0

15: 0

pages in each RPC:

0: 0

1: 0

2: 0

3: 0

4: 0

5: 0

6: 0

7: 0

8: 0

9: 0

10: 0

11: 0

12: 0

13: 0

14: 0

15: 0

RPCs in flight

This represents the number of RPCs that are issued by the OSC but are not complete at the time of the snapshot. It should always be less than or equal to max_rpcs_in_flight.

pending {read,write} pages

These fields show the number of pages that have been queued for Iinput/output in the OSC.

other RPCs in flight when a new RPC is sent

When an RPC is sent, it records the number of other RPCs that were pending in this table. When the first RPC is sent, the 0: row will be incremented. If the first RPC is sent while another is pending the 1: row will be incremented and so on. The number of RPCs that are pending as each RPC *completes* is not tabulated. This table is a good way of visualizing the concurrency of the RPC stream. Ideally you will see a large clump around the max_rpcs_in_flight value which shows that the network is being kept busy.

pages in each RPC

As an RPC is sent, the number of pages it is made of is recorded in order in this table. A single page RPC increments the 0: row, 128 pages the 7: row and so on.

These histograms can be cleared by writing any value into the rpc_stats file.

2.2.3 Watching the OST Block Input/output Stream

Similarly, there is a "brw_stats" histogram in the obdfilter directory which shows you the statistics for number of input/output requests sent to the disk, their size and whether they are contiguous on the disk or not.

cat /proc/fs/lustre/obdfilter/OST_localhost/brw_stats

snapshot_time: 1089922302:248138 (secs:usecs)

read write

pages per brw brws % cum % | rpcs % cum %

1: 0 0 0 | 1 0 0

2: 0 0 0 | 0 0 0

4: 0 0 0 | 0 0 0

8: 0 0 0 | 0 0 0

16: 0 0 0 | 0 0 0

32: 0 0 0 | 0 0 0

64: 0 0 0 | 0 0 0

128: 0 0 0 | 140 99 100

read write

discont pages rpcs % cum % | rpcs % cum %

0: 0 0 0 | 141 100 100

read write

discont blocks rpcs % cum % | rpcs % cum %

0: 0 0 0 | 123 87 87

1: 0 0 0 | 18 12 100

pages per brw – Number of pages per RPC request (should match aggregate client rpc_stats)

discont pages – Number of discontinuities in the logical file offset of each page in a single RPC

discont blocks – Number of discontinuities in the physical block allocation in the filesystem for a single RPC

2.2.4 mballoc History

/proc/fs/ldiskfs/loop0/mb_history

Each mballoc-enabled partition will have this file.

Sample output:

pid inode goal result found grps cr merge tail broken

1593 25052 1/12289/255 1/12289/255 1 0 0 M 0 0

1591 25052 1/12544/256 1/12544/256 1 0 0 M 0 0

1592 25052 1/12800/256 1/12800/256 1 0 0 M 256 512

1590 25052 1/13056/256 1/13056/256 1 0 0 M 0 0

1593 25052 1/13312/256 1/13312/256 1 0 0 M 256 1024

1591 25052 1/13568/256 1/13568/256 1 0 0 M 0 0

1592 25052 1/13824/256 1/13824/256 1 0 0 M 256 512

1590 25052 1/14080/256 1/14080/256 1 0 0 M 0 0

1593 25052 1/14336/256 1/14336/256 1 0 0 M 256 2048

1592 25052 1/14592/256 1/14592/256 1 0 0 M 0

Fields:

pid = Process that made the allocation

inode = inode number allocated blocks

goal = inital request that came to mballoc (group/block-in-group/number-of-blocks)

result = what mballoc actually found for the request

found = number of free chunks mballoc found and measured before the final decision

grps = number of groups mballoc scanned to satisfy the request

cr = stage at which mballoc found the result:

0 – the best in terms of resource allocation. The request was 1MB or larger and was satisfied directly via the kernel buddy allocator

1 – regular stage (good at resource consumption)

2 – fs is quite fragmented (not that bad at resource consumption)

3 – fs is very fragmented (worst at resource consumption)

merge = whether the request hit the goal. This is good as extents code can now merge new blocks to existing extent, eliminating the need for extents tree growth

tail = number of blocks left free after the allocation breaks large free chunks

broken = how large the broken chunk was

Most customers are probably interested in found/cr. If cr is zero or one and found is less than 100, then mballoc is doing quite well.

Also, number-of-blocks-in-request (third number in the goal triple) can tell the number of blocks requested by the obdfilter. If the obdfilter is doing a lot of small requests (just few blocks), then either the client is processing input/output to a lot of small files, or something may be wrong with the client (because it is better if client sends large input/output requests). This can be investigated with the OSC rpc_stats or OST brw_stats mentioned above.

Number of groups scanned (grps column) should be small. If it reaches few dozens often either your disk file system is pretty fragmented or mballoc is doing something wrong in the group selection part.

2.3 Locking

/proc/fs/lustre/ldlm/ldlm/namespaces/<OSC name|MDC name>
/lru_size

This variable determines how many locks can be queued up on the client in an LRU queue. The default value of LRU size is 100. Increasing this on a large number of client nodes is not recommended, though servers have been tested with up to 150,000 total locks (num_clients * lru_size). Increasing it for a small number of clients (for example, login nodes with a large working set of files due to interactive use) can speed up Lustre dramatically. Recommended values are in the neighbourhood of 2500 MDC locks and 1000 locks per OSC.

The following command can be used to clear the LRU on a single client, and as a result flush client cache, without changing the LRU size value:

$ echo clear > /proc/fs/lustre/ldlm/ldlm/namespaces/<OSC name|MDC name>/lru_size

If you shrink the LRU size below the number of existing unused locks, the locks are canceled immediately. Use echo "clear" to cancel all locks without changing the value.

2.4 Debug Support

/proc/sys/lnet/debug

Setting this to zero will completely turn-off debug logs for all the debug types. While setting it to -1 will turn on full debugging (see D_* definitions in lnet/include/linux/libcfs.h).

/proc/sys/lnet/subsystem_debug

This controls the debug logs for subsystems (see S_* definitions).

/proc/sys/lnet/debug_path

This indicates the location where debugging symbols should be stored for gdb. The default is set to /r/tmp/lustre-log-localhost.localdomain.

These values can also be set via sysctl -w lnet.debug={value}.

NOTE: Above entries exist only when Lustre has already been loaded.

Lustre uses the set debug level after it is loaded on a particular node. You can set the debug level by adding the following to the node entry config shell script:

--ptldebug <level>

2.4.1 RPC Information for Other OBD Devices

Some OBD devices maintain a count of the number of RPC events that they process. Sometimes these events are more specific to operations of the device, like llite, than actual raw RPC counts.

$ find /proc/fs/lustre/ -name stats

/proc/fs/lustre/llite/fs0/stats

/proc/fs/lustre/mdt/MDT/mds_readpage/stats

/proc/fs/lustre/mdt/MDT/mds_setattr/stats

/proc/fs/lustre/mdt/MDT/mds/stats

/proc/fs/lustre/osc/OSC_uml0_ost3_MNT_localhost/stats

/proc/fs/lustre/osc/OSC_uml0_ost2_MNT_localhost/stats

/proc/fs/lustre/osc/OSC_uml0_ost1_MNT_localhost/stats

/proc/fs/lustre/osc/OSC_uml0_ost3_mds1/stats

/proc/fs/lustre/osc/OSC_uml0_ost2_mds1/stats

/proc/fs/lustre/osc/OSC_uml0_ost1_mds1/stats

/proc/fs/lustre/obdfilter/ost2/stats

/proc/fs/lustre/obdfilter/ost3/stats

/proc/fs/lustre/obdfilter/ost1/stats

/proc/fs/lustre/ost/OSS/ost_create/stats

/proc/fs/lustre/ost/OSS/ost/stats

/proc/fs/lustre/ldlm/ldlm/ldlm_canceld/stats

/proc/fs/lustre/ldlm/ldlm/ldlm_cbd/stats

The OST .../stats files can be used to track the performance of RPCs that the OST gets from all clients. It is possible to get a periodic dump of values from these files, for instance every 10s, that show the RPC rates (similar to iostat) by using the "llstat.pl" tool like:

$ llstat.pl /proc/fs/lustre/ost/OSS/ost/stats 10

/proc/fs/lustre/ost/OSS/ost/stats @ 1126198063.790389

Name Cur.Count Cur.Rate #Events Unit last min avg max stddev

req_waittime 12 0 1522 [usec] 19800.50 68 1135.52 242393 10297.09

req_qdepth 12 0 1522 [reqs] 0.58 0 0.15 3 0.45

req_active 12 0 1522 [reqs] 1.08 1 1.01 2 0.09

reqbuf_avail 12 0 1522 [bufs] 63.67 63 63.93 64 0.26

ost_setattr 0 0 2 [usec] 0.00 240 257.50 275 24.75

ost_read 0 0 220 [usec] 0.00 530 1262.77 74463 4972.71

ost_write 0 0 230 [usec] 0.00 1438 2200.02 28189 2342.42

ost_create 2 0 24 [usec] 274.00 72 7322.46 35521 12654.60

ost_destroy 400 18 1047 [usec] 736.09 626 1134.41 30260 1560.68

ost_get_info 0 0 2 [usec] 0.00 71 101.50 132 43.13

ost_connect 2 0 26 [usec] 1395.50 1170 5037.04 27153 7231.62

ost_set_info 2 0 24 [usec] 297.50 108 300.38 1162 208.49

ldlm_enqueue 0 0 474 [usec] 0.00 194 351.57 1911 154.21

obd_ping 4 0 294 [usec] 151.50 62 175.97 600 49.36

Where:

Cur.Count = the number of events of each type sent in the last interval (10s in this case)

Cur.Rate = the number of events per second in the last interval

#Events = the total number of such events since the system was started

Unit = the unit of measurement for that statistic (microseconds, requests, buffers)

last = the average rate of these events (in units/event) for the last interval during which they arrived. For instance, in the above mentioned case of ost_destroy it took an average of 736 microseconds per destroy for the 400 object destroys in the previous 10s

min = the minimum rate (in units/event) since the service started

avg = the average rate

max = the maximum rate

stddev = the standard deviation (not measured in all the cases)

The events common to all services are:

req_waittime ― the amount of time a request waited in the queue before being handled by an available server thread

req_qdepth ― the number of requests waiting to be handled in the queue for this service

req_active ― the number of requests currently being handled

reqbuf_avail ― the number of unsolicited lnet request buffers for this service

Some service specific events of interest are:

ldlm_enqueue ― the time it takes to enqueue a lock (this includes file open on the MDS)

mds_reint ― the time it takes to process an MDS modification record (includes create, mkdir, unlink, rename, setattr)

Chapter III-3. Lustre Tuning

3.1 Module Options

Many options in Lustre are set by means of kernel module parameters. These parameters are contained in the “modprobe.conf” file (On SuSE, this may be “modprobe.conf.local”).

3.1.1 OST Threads

The ost_num_threads option allows the number of OST service threads to be specified at module load time on the OSS nodes:

options ost ost_num_threads={N}

An OSS can have a maximum of 36 service threads. Prior to Lustre 1.4.5, the default number of OSS service threads on an OSS depended on the server size. Similarly, with Lustre 1.4.6, the number of OST threads is a function of the server capacity (RAM + CPUs). For a 2GB 2-CPU system this works out to be 64 OST service threads. For larger servers this might be as high as 512 threads. Giving a specific thread count via the module parameter ost_num_threads= overrides the default calculation.

Increasing the size of the thread pool may help when:

several OSTs are exported from a single OSS
the back-end storage is running synchronously
input/output completions are taking excessive time.

In such cases, a larger number of input/output threads allows the kernel and storage to aggregate many writes together for more efficient disk input/output. The OST thread pool is shared ― each thread allocates approximately 1.5 MB (maximum RPC size + 0.5 MB) for internal input/output buffers.

However, do note that memory consumption should be considered when increasing the thread pool size.

3.1.2 MDS Threads

There is a similar parameter for the number of MDS service threads:

options mds mds_num_threads={N}

At this time, no testing has been done as to what the optimal number of MDS threads are. The default number varies based on the server size up to a maximum of 32. The maximum number of threads (MDS_MAX_THREADS) is 512.

3.1.3 LNET Tunables

Transmit and receive buffer size:

Also with Lustre releases 1.4.7 and greater, ksocklnd now has separate parameters for the transmit and receive buffers.

options ksocklnd tx_buffer_size=0 rx_buffer_size=0

If these parameters are left at the default (0) the system will automatically tune the transmit and receive buffer size. In almost every case, the defaults will produce the best performance. Do not attempt to tune this unless you are a network expert!

irq_affinity

This parameter is on by default. In the normal case on an SMP system, we would like our network traffic to remain local to a single CPU. This helps to keep the processor cache warm, and minimizes the impact of context switches. This is especially helpful when an SMP system has more than one network interface, and ideal when the number of interfaces equals the number of CPUs.

If you have an SMP platform with a single fast interface such as 10GB Ethernet and more than two CPUs, you may see performance improve by turning this parameter off, as always test to compare the impact.

3.2 DDN Tuning

This section provides a guideline to configure DDN storage arrays for use with Lustre.

3.2.1 Settings

3.2.1.1 Segment Size

The cache segment size noticeably affects input/output performance. You should set the cache segment size differently on the MDT (which does small, random input/output) and on the OST (which does large, contiguous input/output). The optimum values we have found in customer testing are 64KB for the MDT and 1MB for the OST.

The necessary DDN client commands are given below.

For MDT LUN:

$ cache size=64

size is in KB, 64, 128, 256, 512, 1024, and 2048. Default 128

For OST LUN:

$ cache size=1024

3.2.1.2 maxcmds

In a particular case, changing this value from the default two to four has improved the write performance by as much as 30%. This works only with SATA-based disks and when only one controller of the pair is actually accessing the shared LUNs.

However, this recommendation comes with a warning. DDN support do not recommend changing this setting from the default. By increasing the value to five, the same set up experienced some serious problems.

The necessary DDN client command is given below, where the default value is two.

$ disk maxcmds=3

3.2.1.3 Write-back Cache

Some customers run with the write-back cache turned on, because it significantly improves performance. They are willing to take the risk that when there is a DDN controller crash and they need to run e2fsck, it will take them less time than the performance hit from running with the write-back cache turned off.

Other customers run with the write-back cache off for increased data security. However, some of these customers experience performance problems with the small writes during journal flush. In this mode it is highly beneficial to also increase the number of OST service threads "ost_num_threads=512" in /etc/modprobe.conf, if the OST has enough RAM (about 1.5MB/thread is preallocated for I/O buffers). More input/output threads allow more input/output requests to be in flight waiting for the disk to complete the synchronous write.

This is a decision that you need to make yourself ― there is a trade off between improved performance and running the slight risk of data loss and downtime in the case of a hardware/software problem on the DDN. Note there is no risk from an OSS/MDS node crashing, only if the DDN itself fails.

3.2.1.4 Further Tuning Tips

Some tips we have drawn from testing at a large installation include:

Use the full device instead of a partition (sda vs sda1). When using the full device, Lustre will write nice aligned 1MB chunks to disk. Partitioning the disk can destroy this alignment and will noticeably impact performance.

Separate the EXT3 OST into 2 LUNs ― a small LUN for the EXT3 journal and a big one for the "data"
Since Lustre 1.0.4, we supply EXT3 mkfs options when we create the OST like -j -J and so on in the following manner (where /dev/sdj has been formatted before as a journal)

$ {LMC} --add mds --node io1 --mds iap-mds –dev /dev/sdi \
--mkfsoptions "-j -J device=/dev/sdj" --failover --group iap-mds

Very important: We have proved that we need to create one OST per TIER especially in write through (see the illustration below). This is of concern if you have 16 tiers. You should create 16 OSTs consisting of one tier each instead of eight made of two tiers each.

You are not obliged to lock in cache the small LUNs.

For example ― one OST per tier

LUN Label Owner Status Capacity Block Tiers Tier list

(Mbytes) Size

-----------------------------------------------------------------

0 1 Ready 102400 512 1 1

1 1 Ready 102400 512 1 2

2 1 Ready 102400 512 1 3

3 1 Ready 102400 512 1 4

4 2 Ready [GHS] 102400 4096 1 5

5 2 Ready [GHS] 102400 4096 1 6

6 2 Critical 102400 512 1 7

7 2 Critical 102400 4096 1 8

10 1 Cache Locked 64 512 1 1

11 1 Cache Locked 64 512 1 2

12 1 Cache Locked 64 512 1 3

13 1 Cache Locked 64 512 1 4

14 2 Ready [GHS] 64 512 1 5

15 2 Ready [GHS] 64 512 1 6

16 2 Critical 64 512 1 7

17 2 Critical 64 512 1 8

System verify extent: 16MB

System verify delay: 30

3.3 Large-Scale Tuning for Cray XT and Equivalents

This information applies to the Cray XT3 Catamount nodes only. Ignore this section if you find it irrelevant. This section expects you to be familiar with the operation of this type of system. The following section explains the parameters used with the kptllnd module.

3.3.1 Network Tunables

Given the large number of clients and servers possible on these systems, tuning various request pools becomes quite important. CFS is in the process of making changes to the ptllnd module.

One tunable parameter for enhancing the performance is max_nodes. It is the maximum number of queue pairs, and therefore the maximum number of peers that the instance of the LND may communicate with. Set max_nodes at a value higher than the product of total number of nodes and maximum processes per node.

Max nodes > (Total # Nodes) * (max_procs_per_node)

If you set max_nodes less than the above mentioned product, Lustre throws an error. If you set it too high, excess memory will be consumed.

There are a few other tunables in the code that may impact performance.

max_procs_per_node – It is the maximum number of cores (CPUs) on a single Catamount node. Portals must know this value in order to properly clean up various queues. LNET is not notified directly when a catamount process aborts. The first news it gets of this is when a new catamount process with the same cray portals NID starts up and sends a connection request. If the number of processes with that cray portals NID would now exceed max_procs_per_node, LNET removes the oldest one to make space for the new one.

These two tunables combine to set the size of the ptllnd request buffer pool. The buffer pool must never drop an incoming message, so proper sizing is quite important.

Two other parameters for ptllnd are ntx and credits.

Ntx helps to size the transmit (tx) descriptor pool. A tx descriptor is used for each send and each passive RDMA. The max number of concurrent sends == 'credits'. Passive RDMA is a response to a PUT or GET of a payload that is too big to fit in a small message buffer. For servers, this only happens on large RPCs (for instance, where a long file name is included), so the MDS could be under pressure in a large cluster. For routers, this will be bounded by the number of servers. A console error message will appear if the tx pool is exhausted.

Credits determine how many sends are in-flight at once on ptllnd. The optimum is 8 requests in-flight per server. The default is 128, which should be proper for most applications.

Chapter III-4. Lustre Troubleshooting Tips

4.1 Tips

4.1.1 Fixing a Changed Host Name

If you change the host name of a server, Lustre may not be able to start properly. Here is how you fix the Lustre configuration to fix the issue.

Update the UUID in last_rcvd to match a changed host name.

Note: The following procedure should only be performed when Lustre is NOT running. Ensure that Lustre has been completely stopped (by executing the command lconf -d) before attempting the following procedure.

For each device to be reconfigured, perform the following steps:

In the example we are assuming that <device> is the name of the device to be reconfigured (For instance, /dev/sdb), and <NEW_NAME> is the new name of the OST or MDT.

1. Mount the OST or MDT device:

mount -tldiskfs /dev/<device> /mnt/tmp

2. Make a backup of last_rcvd:

cp -a /mnt/tmp/last_rcvd /tmp/last_recvd.<device>.backup

3. Check the contents of the current last_rcvd:

cat /mnt/tmp/last_rcvd

4. Update last_rcvd with the correct UUID (Note the "-n" and “conv=notrunc” options):

echo -n "<NEW_NAME>_UUID" | dd of=/mnt/tmp/last_rcvd conv=notrunc

5. Verify that the new last_rcvd is correct:

cat /mnt/tmp/last_rcvd

6. Unmount the device:

umount /mnt/tmp

4.1.2 Setting SCSI IO Sizes

Some SCSI drivers default to a maximum IO size that is too small for good Lustre performance. CFS has fixed quite a few drivers, but you may still find some drivers giving unsatisfactory performance with Lustre. As the default is hard coded you need to recompile the drivers to change their default. On the other hand, some drivers may have a wrong default set.

If you suspect bad IO performance, and analyzing Lustre stats indicates that the IO is not 1 MB, check /sys/block/<device>/queue/max_sectors_kb. If it is less than 1024, set it to 1024 in order to improve the performance. If changing this value does not change the IO size as reported by Lustre, you may want to examine the SCSI driver code.

Part IV. Lustre for Users

Chapter IV – 1. Free Space and Quotas

1.1 Querying File System Space

The command lfs df is used to determine the disk space available on a file system. It displays the amount of available disk space on the mounted Lustre file system and shows space consumption per-OST. If multiple Lustre file systems are mounted, a PATH may be specified, but is not required.

Options

Description

-h

--human-readable print sizes in human readable format (For instance, 1K 234M 2G)

-i, --inodes

Lists inodes instead of block usage

Options	Description
-h	--human-readable print sizes in human readable format (For instance, 1K 234M 2G)
-i, --inodes	Lists inodes instead of block usage

Examples

fc3:~$ lfs df

UUID 1K-blocks Used Available Use% Mounted on

mds-p_UUID 4399856 528200 3871656 12 /mnt/lustre[MDT:0]

ost-a_UUID 153834852 55804744 98030108 36 /mnt/lustre[OST:0]

ost-b_UUID 153834852 55927804 97907048 36 /mnt/lustre[OST:1]

filesystem summary: 307669704 111732548 195937156 36 /mnt/lustre

fc3:~$ lfs df -h

UUID 1K-blocks Used Available Use% Mounted on

mds-p_UUID 4.2M 515.8K 3.7M 12 /mnt/lustre[MDT:0]

ost-a_UUID 146.7M 53.2M 93.5M 36 /mnt/lustre[OST:0]

ost-b_UUID 146.7M 53.3M 93.4M 36 /mnt/lustre[OST:1]

filesystem summary: 293.4M 106.6M 186.9M 36 /mnt/lustre fc3:~$

lfs df -i

UUID Inodes IUsed Ifree IUse% Mounted on

mds-p_UUID 1257360 272869 984491 21 /mnt/lustre[MDT:0]

ost-a_UUID 19546112 257430 19288682 1 /mnt/lustre[OST:0]

ost-b_UUID 19546112 257430 19288682 1 /mnt/lustre[OST:1]

filesystem summary: 1257360 272869 984491 21 /mnt/lustre

1.2 Using Quota

The lfs quota command displays disk usage and quotas. Only user quotas are displayed by default or with the -u flag.

A root user may use the -u flag with the optional user parameter to view the limits of other users. Users without the root user authority can view the limits of groups (of which they are members) by using the -g flag with the optional group parameter.

NOTE: If a particular user has no files in a file system on which they have a quota, the command will show quota: none for that user. The user's actual quota is displayed when the user has files in the file system.

Examples

To display your quotas as a user “bob”, enter:

$ lfs quota -u /mnt/lustre

The above example will display the disk usage and limits for the user "bob."

To display quotas as the root user for user “bob,” enter:

$lfs quota -u bob /mnt/lustre

The system can also show the below information about the disk usage by “bob.”

To display your group's quota as “tom”:

$ lfs -g tom /mnt/lustre

To display the group's quota of “tom”:

$lfs quota -g tom /mnt/lustre

Chapter IV-2. Striping and Other I/O Options

2.1 File Striping

Lustre stores files of one or more objects on object storage targets (OSTs). When a file is comprised of more than one object, Lustre will stripe the file data across them in a round-robin fashion. The number of stripes, the size of each stripe and the servers chosen are all configurable.

One of the most frequently asked Lustre questions is “How should I stripe my files, and what is a good default?” The short answer is that it depends on your needs. A good rule of thumb is to stripe over as few objects as will meet those needs and no more.

2.1.1 Advantages of Striping

There are two reasons to create files of multiple stripes: bandwidth and size.

There are many applications which require high-bandwidth access to a single file – more bandwidth than can be provided by a single OSS – for example, scientific applications which write to a single file from hundreds of nodes or a binary executable which is loaded by many nodes when an application starts.

In cases such as these you want to stripe your file over as many OSSs as it takes to achieve the required peak aggregate bandwidth for that file. In our experience, the requirement is “as quickly as possible,” which usually means all OSSs.

NOTE: This assumes that your application is using enough client nodes, and can read/write data fast enough, to take advantage of that much OSS bandwidth. The largest useful stripe count is bounded by the input/output rate of your clients/jobs divided by the performance per OSS.

The second reason to stripe is when a single object storage target (OST) does not have enough free space to hold the entire file.

2.1.2 Disadvantages of Striping

There are two disadvantages to striping which should deter you from choosing a default policy which stripes over all OSTs unless you really need it: increased overhead and increased risk.

Increased overhead comes in the form of extra network operations during common operations such as stat and unlink, and more locks. Even when these operations can be performed in parallel, there is a big difference between doing one network operation and doing one hundred.

Increased overhead also comes in the form of server contention. Consider a cluster with 100 clients and 100 OSSs, each with one OST. If each file has exactly one object and the load is distributed evenly, there is no contention and the disks on each server can manage sequential input/output. If each file has 100 objects, then the clients will all compete with each other for the attention of the servers and the disks on each node will be seeking in 100 different directions. In this case, there is needless contention.

Increased risk is evident when you consider again the example of striping each file across all servers. In this case, if any one OSS catches on fire, a small part of every file will be lost. By comparison, if every file has exactly one stripe, you will lose fewer files, but you will lose them in their entirety. Most users would rather lose some of their files entirely than all of their files partially.

2.1.3 Stripe Size

Choosing a stripe size is a small balancing act but there are reasonable defaults. The stripe size must be a multiple of the page size. For safety, Lustre tools enforce a multiple of 64 KB (the maximum page size on ia64 and PPC64 noeds), so that users on platforms with smaller pages do not accidentally create files which might cause problems for ia64 clients.

Although you could create files with a stripe size of 64 KB, this would be a poor choice. Practically, the smallest recommended stripe size is 512 KB because Lustre tries to batch input/output into 512 KB chunks over the network. This is a good amount of data to transfer at once. Choosing a smaller stripe size may hinder the batching.

Generally, a good stripe size for sequential input/output using high-speed networks is between 1 MB and 4 MB. Stripe sizes larger than 4 MB will not parallelize as effectively because Lustre tries to keep the amount of dirty cached data below 32 MB per server with the default configuration.

Writes which cross an object boundary are slightly less efficient than writes which go entirely to one server. Depending on your application's write patterns, you can assist it by choosing the stripe size with that in mind. If the file is written in a very consistent and aligned way, you can do it a favor by making the stripe size a multiple of the write() size.

The choice of stripe size has no effect on a single-stripe file.

2.2 Displaying Striping Information with lfs getstripe

Individual files and directories can be examined with lfs getstripe:

lfs getstripe <filename>

lfs will print the index and UUID for each OST in the file system along with the OST index and object ID for each stripe in the file. For directories, the default settings for files created in that directory will be printed.

A whole tree of files can also be inspected with lfs find:

lfs find [--recursive | -r] <file or directory> ...

2.3 lfs setstripe – Setting Striping Patterns

New files with a specific stripe configuration can be created with lfs setstripe:

lfs setstripe <filename> <stripe-size> <starting-ost> <stripe-count>

If you pass a stripe-size of 0, the file system default stripe size will be used. Otherwise, the stripe-size must be a multiple of 16 KB.

If you pass a starting-ost of -1, a random first OST will be chosen. Otherwise the file will start on the specified OST index (starting at zero).

If you pass a stripe-count of 0, the file system default number of OSTs will be used. A stripe-count of -1 means that all available OSTs should be used.

Note: If you pass a starting-ost of '0' and a stripe-count of 1, all files will be written to OST #0, until space is exhausted. This is probably not your intention. If you wish to adjust stripe-count only and keep the other parameters at their default, use this syntax:

lfs setstripe 0 -1 <stripe_count>

2.3.1 Changing Striping for a Subdirectory

lfs setstripe works on directories to set a default striping configuration for files created within that directory. The usage is the same as for lfs setstripe for a regular file, except that the directory must exist prior to setting the default striping configuration. If a file is created in a directory with a default stripe configuration (without otherwise specifying the striping) Lustre will use those striping parameters instead of the file system default for the new file.

To change the striping pattern for a subdirectory, create a directory with desired striping pattern as described above. The subdirectories inherit the striping pattern of the parent directory.

NOTE: Striping on directories only affects NEW files and NEW subdirectories created therein.

2.3.2 Using a Specific Striping Pattern for a Single File

lfs setstripe will create a file with a given stripe pattern.

lfs setstripe will fail if the file already exists.

2.4 Performing Direct Input/output

Starting with 1.4.7, Lustre supports the O_DIRECT flag to open.

Applications using the read() and write() calls must supply buffers aligned on a page boundary (usually 4k). If the alignment is not correct the call will return -EINVAL. Direct Input/output may help performance in cases where the client is doing a large amount of Input/output and is CPU-bound (CPU utilization 100%).

2.4.1 Making File System Objects Immutable

An immutable file or directory is one that cannot be modified, renamed or removed. To do this:

chattr +i <file>

chattr –i removes the flag

2.4.2 Disabling Locking for Direct I/O Accesses

2.5 Other Input/output Options

2.5.1 MDS Space Utilization

Lustre comprises of large inodes, where each inode is at least 512 bytes by default. Lustre also needs sufficient space left for other metadata like journals (up to 400MB), bitmaps and directories. There are also a few regular files that Lustre uses to maintain cluster consistency. To be on the safer side we recommend you plan for 4KB per inode on the MDS.

If you use the -i option for mke2fs and if you are specifying some absolute number of inodes using -N {num inodes}, newer e2fsprogs will reduce the group size. This will allow an increased number of inodes beyond one inode per 1024 bytes. Every time you create a file on a Lustre file system, you might notice that one inode on the corresponding MDS (as well as one inode on the OST itself) is used. The minimum bytes per inode for ext3 are 1024 and the maximum block size is 4096. Thus the maximum ratio of inodes per block is four.

The file system on an MDS and that on an OST are independent of each other. Hence, the formatting parameters for the two need not be same. The size of the MDS file system solely depends on how many inodes you want in the total Lustre file system. It is not the size of the aggregate OST space. You can have a much higher maximum number of bytes per inode in the file system up to 128MB per eight inodes. This is useful for OSTs if you have a very large average file size.

As a result, the only important factor when calculating the MDS size is the average size of files to be stored in the file system. If the average file size is, for instance, 5MB and you have 100TB of usable OST space then you need at least (100 * 1024 * 1024 / 5) = 20 million inodes (though it is recommended to have twice the minimum, that is 50 million inodes). That means 4KB per inode space is the default. This works out to only 80GB of space for the MDS.

On the other hand, if you had a very small average file size, for example 4KB, iLustre is not very efficient. This is because you consume as much space on the MDS as you are consuming on the OSTs. This is not a very common configuration for Lustre. With a 2TB MDS you could potentially have 1KB per inode. It is not possible to have an inode of less than 512 bytes. So 2B inodes would need 2B * 4KB = 8TB of usable OST space. Depending on your needs, you could instead just do this with a single ext3 file system instead of Lustre.

NOTE: In the Lustre file system, inodes are consumed and not the space

2.5.2 End to End Client Checksums

To guard against data corruption, a Lustre client can perform end to end data checksums. This must be enabled on the individual client nodes. If the checksum is bad, the client will not have an IO error. The bad checksum will be reported immediately as a syslog message. Both client and OST will log messages at intervals showing that checksums are being validated. A /proc file controls the checksum behavior. The file is:

/proc/fs/lustre/llite/fs0/checksum_pages

To enable checksums on a client:

echo 1 > /proc/fs/lustre/llite/fs0/checksum_pages

Chapter IV – 3. Lustre Security

3.1 Using Access Control Lists

An ACL, or access control list, is a set of data that informs an operating system about the permissions, or access rights, that each user or group has to a specific system object, such as a directory or file. Each object has a unique security attribute that identifies which users have access to it. The ACL is a list of each object and user access privileges such as read, write or execute.

3.1.1 How do ACLs work?

Implementing ACLs varies between operating systems. Systems that support the POSIX (Portable Operating System Interface) family of standards share a simple yet powerful file system permission model, which should be well-known to the Linux/Unix administrator. ACLs add finer-grained permissions to this model, allowing for more complicated permission schemes. For a detailed explanation of ACLs on Linux, we recommend the SuSE Labs article, “Posix Access Control Lists on Linux” found on-line here:

http://www.suse.de/~agruen/acl/linux-acls/online/

CFS has implemented ACLs according to this model. Lustre supports the standard Linux ACL tools, setfacl, getfacl, and the historical chacl, normally installed with the acl package.

3.1.2 Lustre ACLs

Lustre versions 1.4.6 and above support POSIX ACLs. When using a Lustre client of version 1.4.5 or below with an MDS of version 1.4.6, or vice versa, the user space program generates an error “Operation not supported” during ACL operations.

The MDS needs to be configured in order to enable ACLs. This can be enabled when creating your configuration with --mountsoptions:

lmc -m –add mds –node ft2 –mds mds-l –fstype ldiskfs –dev /dev/sdc –-mountfsoptions=acl

Or, you can enable at run time by using the --acl option with lconf:

lconf -–acl config.xml

ACLs on the client are enabled at mount time when ACLs are enabled on the MDS. You do not need to change the client configuration, and the “acl” string will not appear in the client /etc/mtab. The client acl mount option is no longer needed. If a client is mounted with that option, this message will appear in the MDS syslog:

...MDS requires ACL support but client does not

The message is harmless but indicates a configuration issue, which should be corrected.

If ACLs are not enabled on the MDS, any attempts to reference an ACL on a client will return an “Operation not supported” error.

3.1.3 Examples

These examples are taken directly from the POSIX paper referenced above. ACLs on a Lustre file system work exactly like ACLs on any Linux file system. They are manipulated with the standard tools in the standard manner. Here we create a directory and allow a specific user access.

[cliffw@q_3 lustre]$ umask 027

[cliffw@q_3 lustre]$ mkdir baz

[cliffw@q_3 lustre]$ ls -ld baz

drwxr-x--- 2 cliffw cliffw 4096 Sep 26 13:39 baz

[cliffw@q_3 lustre]$ getfacl baz

# file: baz

# owner: cliffw

# group: cliffw

user::rwx

group::r-x

other::---

[cliffw@q_3 lustre]$ setfacl -m user:monkey:rwx baz

[cliffw@q_3 lustre]$ ls -ld baz

drwxrwx---+ 2 cliffw cliffw 4096 Sep 26 13:39 baz

[cliffw@q_3 lustre]$ getfacl --omit-header baz

user::rwx

user:monkey:rwx

group::r-x

mask::rwx

other::---

Chapter IV – 4. Other Lustre Operating Tips

4.1 Expanding the File System by Adding OSTs

It is not an ideal behavior when OSTs are full. However there are several roadmap items to address this issue.

The current scenario can be described as follows -

Once an OST has filled up, an application trying to write new data to it will receive an -ENOSPC error. This produces the unfortunate confusing behavior of allowing writes for some files (or even some parts of a striped file) while it denies writes for others. This is not an issue for newly-created files, as Lustre will avoid placing new files on full OSTs.

If empty OSTs are then added to the system, the system will be quite out of balance. Although Lustre does not yet have an on-line data migration function, you can re-balance your data manually.

Instructions for adding OSTs to existing Lustre file systems -

Step 1: Stop any existing Lustre clients and servers

Step 2: Modify or re-create the Lustre XML configuration (using lmc, as usual). You can do this before your stop your servers, as long as you save a copy of the old XML.

Add "lmc --add ost" commands to your lmc script.

Step 3: Run "lconf --write_conf" on the MDS.

This compiles the XML into a binary configuration log that is stored in the MDS. This log is processed by clients at mount-time, to allow them to mount without needing lconf or the XML.

Just add "--write_conf" to your normal MDS lconf startup command. It will start the MDS, compile the XML, and stop the MDS. If you skip this step, your clients will use the old configuration, with unpredictable results.

Step 4: Start your servers and clients

It is very important that these additional commands are *after* all of the current OSTs in the script. You will also need to reformat the OST device manually after --write_conf, at least in Lustre 1.4.6 and above:

# mke2fs -j -J size=400 -I 256 -i 16384 /dev/DEVICE

The basic method is to copy existing files to a temporary file, then 'mv' the temp file over the old one -- naturally, this should not be attempted with files which are currently being written to by users or applications. This operation will redistribute the stripes over the entire set of OSTs. A sample script for this migration is attached.

A very clever migration script would:

- Examine the current distribution of data

- Calculate how much data should move from each full OST to the empty ones

- Search for files on a given full OST (using "lfs getstripe")

- Force the new destination OST (using lfs setstripe)

- Copy only enough files to address the imbalance

If an enterprising Lustre administrator wants to explore this approach, per-OST disk-usage statistics can be found under /proc/fs/lustre/osc/*

Future Development:

In the short term (4Q05-1Q06) Lustre will include a runtime option that will create proportionally more new files on OSTs with more room available. Although this won't help if you need to write new data to an existing file on a completely full OST, it will help to keep a system from getting too far out of balance in the first place, and help bring it back into balance more quickly.

The problem is best solved with a proper on-line data migrator, which can safely migrate even those files being actively modified. This is a very involved task, one which will likely not be completed in the coming year.

Part V. Reference

Chapter V-1. User Utilities

1.1 lfs

lfs is a Lustre client file system utility that is used to display striping information for file and directories, set striping policy for files and directories, search for files with specific attributes (after the Unix “find” command) and to create or set quotas.

1.1.1 Synopsis

lfs

lfs df [-i] [-h] [path]

lfs find [–quiet|-q] [–verbose|-v] [–recursive|-r] <dir/file>

lfs getstripe [–obd|-O <uuid>] [–quiet|-q] [–verbose|-v] \
[–recursive|-r] <dir/file>

lfs setstripe <filename|dirname> <stripe_size> <start_ost> <stripe_count>

lfs setstripe -d <dirname>

lfs quotachown [ -i ] <filesystem>

lfs quotacheck [ -ugf ] <filesystem>

lfs quotaon [-ugf] <filesystem>

lfs quotaoff [-ug] <filesystem>

lfs setquota [-u|-g] <name> <block-softlimit> <block-hardlimit> <inode-softlimit> \

<inode-hardlimit> <filesystem>

lfs quota [-o obd_uiid] [-u | -g] <name> <filesystem>

lfs check <mds| osts| servers> [–print|-p] [–obd|-O <uuid>] <dir/file>

lfs help

NOTE: For the above example <filesystem> refers to the mount point of the Lustre file system (Default: /mnt/lustre).

1.1.2 Description

This utility is used to create a new file with a specific striping pattern, determine the default striping pattern, gather the extended attributes (object numbers and location) for a specific file and for setting Lustre quota. It can be invoked interactively without any arguments or in a non-interactive mode.

You can issue the following commands to invoke lfs in an interactive mode.

$ lfs

lfs> help

To get a complete listing of available commands, type “help” on the lfs prompt. To get basic help on meaning and syntax of a command, type “help command.” The tab key activates command completion. Command history is available via the “UP” and “DOWN” arrow keys.

Here are the sub-commands available:

setstripe:

creates a new file with a specific striping pattern
sets the default striping pattern on an existing directory
deletes the default striping pattern from an existing directory.

getstripe:

lists the striping pattern for a given file name or files in a given directory
lists the striping pattern recursively for all files in a directory tree
lists the files that have objects on a specific OST.

Find: (old usage)

lists the extended attributes for a given filename or files in a directory
lists the extended attributes recursively for all files in a directory tree
lists the files that have objects on a specific OST.

Please note, we have replaced this use of the lfs command by “lfs getstripe.” “lfs find” now matches the traditional UNIX “find.” It will search the directory tree rooted at the given dir/file name for the files that match the given parameters.

Find: (New usage)

--atime (the file was last accessed N*24 hours ago), checks if the file was last accessed, changed, modified N days ago, that is within the interval (N+1,N] days. The number can be specified as +N and -N, for more than and less than N days ago respectively

-- ctime (the status of the file was last changed N*24 hours ago)

--mtime (the data in the file was last modified N*24 hours ago)

--obd (the file has an object on a specific OST)

--maxdepth allows the find command to descend at most N levels of the directory tree

[--print0|-P] [--print|-p] prints the full file name on the standard output, followed by a null character or a newline respectively.

If one of the options below is specified, lfind works in the so-called “old” mode. This mode is obsolete; use “lfs getstripe” instead. Both “lfs getstripe” and “lfs find” in the “old” mode have the following options:

[--quiet|-q] [--verbose|-v] [--recursive|-r]

NOTE: lfind in the “new” mode can run on a non-Lustre file system, and can cross all the Lustre/non-Lustre and vice versa mount points correctly.

df: reports file system disk space usage or inode usage for each MDS / OST.

quotachown: changes the owner or group of a file on the specified file system.

quotacheck: scans the specified file system for disk usage and creates or updates quota files.

quotaon: turns file system quotas on.

quotaoff: turns file system quotas off.

setquota: sets file system quotas.

quota: displays the disk usage and limits.

check: displays the status of MDS or OSTs (as specified in the command), or all the servers (MDS and OSTs).

osts: lists all the OSTs for the file system.

help: provides brief help on various arguments.

exit/quit: quits the interactive lfs session.

1.1.3 Examples

To create a file striped on one OST:

lfs setstripe /mnt/lustre/file1 131072 0 1

To create a default striping pattern on an existing directory for all the new files created therein:

$ lfs setstripe /mnt/lustre/dir 131072 0 1

To delete the default striping pattern on a directory:

$ lfs setstripe -d /mnt/lustre/dir

(New files will use the default striping pattern created therein.)

stripe size: if you pass a stripe-size of 0, the file system default stripe size will be used. Otherwise the stripe-size must be a multiple of 16 KB.

stripe start: if you pass a starting-ost of -1, a random first OST will be chosen. Otherwise the file will start on the specified OST index (starting at 0).

stripe count: if you pass a stripe-count of 0, the file system default number of OSTs will be used. A stripe-count of -1 means that all available OSTs should be used.

Note on defaults: The default stripe_size is 0, the default stripe start is -1 – do not confuse them! If you set the stripe start to 0 all new file creations will occur on OST 0 which is seldom a good idea.

Below is an example of setting and getting stripes:

$ lfs > setstripe lustre.iso 0 -1 0

$ lfs > getstripe lustre.iso

OBDS:

0: ost1_UUID ACTIVE

1: ost2_UUID_2 ACTIVE

./lustre

obdidx objid objid group

1 4 0x4 0

To list the extended attributes of a given file:

$ lfs find /mnt/lustre/foo1

OBDS:

O: OST_localhost_UUID

/mnt/lustre/foo1

obdidx objid objid group

0 1 0x1 0

To list the extended attributes of all files in a given directory:

$ lfs find /mnt/lustre/

fs find -r /mnt/lustre/

To list all the files that have objects on a specific OST:

$ lfs find -r --obd OST2-UUID /mnt/lustre/

To change the file owner and group:

$ lfs quotachown -i /mnt/lustre

To check the quota for a user and a group:

$ lfs quotacheck -ug /mnt/lustre

To turn on the quotas for a user and a group:

$ lfs quotaon -ug /mnt/lustre

To turn off the quotas for a user and a group:

$ lfs quotaoff -ug /mnt/lustre

To set the quotas for a user as 1GB block quota and 10,000 file quota:

$ lfs setquota -u {username} 0 1000000 0 10000 /mnt/lustre

To change the owner or group:

$ quotachown -i /mnt/lustre

To ignore the error if the file does not exist.

For example,

$ lfs quotachown -i {file|directory} /mnt/lustre

To check the disk space in inodes available on individual MDS and OST:

$ lfs df -i /mnt/lustre

UUID	Inodes	Used	Free	Mounted on
mds-1_UUID	53265600	28266	53237334	/mnt/lustre[MDT:0]
ost-1_UUID	244056064	1349	244054715	/mnt/lustre[OST:0]
ost-2_UUID	244056064	884	244055180	/mnt/lustre[OST:1]

To check the disk space in size available on individual MDS and OST:

$ lfs df -h /mnt/lustre

UUID	1K-blocks	Used	Available	Use%	Mounted on
mds-1_UUID	203.5M	12.1M	191.5M	5	/mnt/lustre[MDT:0]
ost-1_UUID	1.8G	384.7M	1.4G	20	/mnt/lustre[OST:0]
ost-2_UUID	1.8G	343.0M	1.5G	18	/mnt/lustre[OST:1]
ost-3_UUID	1.8G	332.2M	1.5G	18	/mnt/lustre[OST:2]

To list the quotas of a user:

$ lfs quota -u {username} /mnt/lustre

To check the status of all the servers – MDS and OSTs:

$ lfs check servers

OSC_localhost.localdomain_OST_localhost_mds1 active.

OSC_localhost.localdomain_OST_localhost_MNT_localhost active.

MDC_localhost.localdomain_mds1_MNT_localhost active.

To check the status of all the servers – MDSs:

$ lfs check mds

To check the status of all the servers – OSTs:

$ lfs check ost

To list all the OSTs:

$ lfs osts

OBDS:

O: OST_localhost_UUID

To list the logs of particular types:

$ lfs catinfo {keyword} [node name]

Keywords are one of the followings: config, deletions.

Node name must be provided when using the keyword config.

For instance,

$ lfs catinfo {config|dele*tions}{mdsnode|ostnode}

To join the files:

$ join <filename_A> <filename_B>

Chapter V-2. Lustre Programming Interfaces (man3)

2.1 Introduction

This chapter describes the public programming interfaces for controlling various aspects of Lustre from userspace. These interfaces are generally not guaranteed to remain unchanged over time, although we will make an effort to notify the user community well in advance of major changes.

2.2 User/Group Cache Upcall

2.2.1 Name

Use /proc/fs/lustre/mds/mds-service/group_upcall to look up a given user’s group membership.

2.2.2 Description

The group upcall file contains the path to an executable that, when properly installed, is invoked to resolve a numeric UID to a group membership list. This utility should complete the mds_grp_downcall_data data structure (below) and write it to the /proc/fs/lustre/mds/mds-service/group_info pseudo-file.

See lustre/utils/l_getgroups.c in the Lustre source distribution for an example upcall program.

2.2.3 Parameters

The name of the MDS service.

The numeric UID.

2.2.4 Data Structures

#include <lustre/lustre_user.h>

#define MDS_GRP_DOWNCALL_MAGIC 0x6d6dd620

struct mds_grp_downcall_data {

__u32 mgd_magic;

__u32 mgd_err;

__u32 mgd_uid;

__u32 mgd_gid;

__u32 mgd_ngroups;

__u32 mgd_groups[0];

};

Chapter V-3. Config Files and Module Parameters (man5)

3.1 Introduction

LNET network hardware and routing are now configured via module parameters. Parameters should be specified in the /etc/modprobe.conf file, for example:

alias lustre llite

options lnet networks=tcp0,elan0

The above option specifies that this node should use all the available tcp and elan interfaces.

Module parameters are read when the module is first loaded. Type-specific LND (Lustre Network Device) modules (for instance, ksocklnd) are loaded automatically by the lnet module when LNET starts (typically upon modprobe ptlrpc).

Under Linux 2.6, the LNET configuration parameters can be viewed under /sys/module/; generic and acceptor parameters under lnet and LND-specific parameters under the name of the corresponding LND.

Under Linux 2.4, sysfs is not available, but the LND-specific parameters are accessible via equivalent paths under /proc.

Important: All old (pre v1.4.6) Lustre configuration lines should be removed from the module configuration files, to be replaced with the following. Make sure that CONFIG_KMOD is set in your linux.config so that LNET can load the following modules it needs. The basic module files are:

modprobe.conf (Linux 2.6)

alias lustre llite

options lnet networks=tcp0,elan0

modules.conf (Linux 2.4)

alias lustre llite

options lnet networks=tcp0,elan0

For the following parameters default option settings are shown in parenthesis. Changes to parameters marked with a W affect running systems. (Unmarked parameters can only be set when LNET loads for the first time.) Changes to parameters marked with a Wc only have effect when connections are established (existing connections are not affected by these changes.)

3.2 Module Options

With routed or other multi-network configurations, use ip2nets rather than networks so that all nodes can use the same configuration.
For a routed network, use the same “routes” configuration everywhere. Nodes specified as routers automatically enable forwarding and any routes that are not relevant to a particular node are ignored. Keeping a common configuration guarantees that all nodes will have consistent routing tables.
A separate modprobe.conf.lnet included from modprobe.conf makes distributing the configuration much easier.
If you set “config_on_load=1” LNET starts up at modprobe time, rather than waiting for Lustre to start. This ensures routers start working at module load time. However, in this case lconf --cleanup will not stop LNET, you must run lctl --net stop on these nodes.
Remember lctl ping – it is a very handy way to check your LNET configuration.

3.2.1 LNET Options

3.2.1.1 Network Topology

The network topology module parameters determine which networks a node should join, whether it should route between these networks and how it communicates with non-local networks.

Here is a list of various networks and the supported software stacks:

Network	Software Stack
openib	OpenIB gen1 / Mellanox Gold
iib	Silverstorm (Infinicon)
vib	Voltaire
o2ib	OpenIB gen2
cib	Cisco

Note: Lustre will ignore the loopback interface (lo0). But Lustre will use any IP addresses aliased to the loopback by default. When in doubt, specify networks explicitly.

ip2nets ("") is a string that lists globally available networks, each with a set of IP address ranges. LNET determines the locally available networks from this list by matching the IP address ranges with the local IP’s of a node. The purpose of this option is to be able to use the same modules.conf file across a variety of nodes on different networks. The string has the following syntax...

<ip2nets> :== <net-match> [ <comment> ] { <net-sep> <net-match> }

<net-match> :== [ <w> ] <net-spec> <w> <ip-range> { <w> <ip-range> }

[ <w> ]

<net-spec> :== <network> [ "(" <interface-list> ")" ]

<network> :== <nettype> [ <number> ]

<nettype> :== "tcp" | "elan" | "openib" | ...

<iface-list> :== <interface> [ "," <iface-list> ]

<ip-range> :== <r-expr> "." <r-expr> "." <r-expr> "." <r-expr>

<r-expr> :== <number> | "*" | "[" <r-list> "]"

<r-list> :== <range> [ "," <r-list> ]

<range> :== <number> [ "-" <number> [ "/" <number> ] ]

<comment :== "#" { <non-net-sep-chars> }

<net-sep> :== ";" | "\n"

<w> :== <whitespace-chars> { <whitespace-chars> }

The <net-spec> contains enough information to identify the network uniquely and load an appropriate LND. The LND determines the missing "address-within-network" part of the NID based on the interfaces it can use.

The optional <iface-list> specifies which hardware interface the network can use. If omitted, all the interfaces are used. LNDs that do not support the <iface-list> syntax cannot be configured to use particular interfaces and just use what is there. Only a single instance of these LNDs can exist on a node at any time, and the <iface-list> must be omitted.

The <net-match> entries are scanned in the order declared to see if one of the node's IP addresses matches one of the <ip-range> expressions. If there is a match, the <net-spec> specifies the network to instantiate. Note that it is the first match for a particular network that counts. This can be used to simplify the match expression for the general case by placing it after the special cases. For example..

ip2nets="tcp(eth1,eth2) 134.32.1.[4-10/2]; tcp(eth1) *.*.*.*"

4 nodes on the 134.32.1.* network have 2 interfaces (134.32.1.{4,6,8,10}) but all the rest have 1.

ip2nets="vib 192.168.0.*; tcp(eth2) 192.168.0.[1,7,4,12]"

This describes an IB cluster on 192.168.0.*. 4 of these nodes also have IP interfaces; these 4 could be used as routers.

Note that match-all expressions (For instance, *.*.*.*) effectively mask all other <net-match> entries specified after them. Hence, they should be used with caution.

Here is a more complicated situation, see below for an explanation of the route parameter. We have:

Two TCP subnets
One Elan subnet
One machine set up as a router, with both TCP and Elan interfaces
We have IP over Elan configured, but IP will only be used to label the nodes.

options lnet ip2nets=”tcp 198.129.135.* 192.128.88.98; \
elan 198.128.88.98 198.129.135.3;” \
routes=”tcp 1022@elan # Elan NID of router;\
elan 198.128.88.98@tcp # TCP NID of router “

3.2.1.2 networks ("tcp")

This is an alternative to "ip2nets" which can be used to specify the networks to be instantiated explicitly. The syntax is a simple comma separated list of <net-spec>s (see above). The default is only used if neither “ip2nets” nor “networks” is specified.

3.2.1.3 routes (“”)

This is a string that lists networks and the NIDs of routers that forward to them.

It has the following syntax (<w> is one or more whitespace characters):

<routes> :== <route>{ ; <route> }

<route> :== [<net>[<w><hopcount>]<w><nid>{<w><nid>}

So a node on the network tcp1 that needs to go through a router to get to the elan network

options lnet networks=tcp1 routes="elan 1 192.168.2.2@tcp1”

The hopcount is used to help choose the best path between multiply-routed configurations.

A simple but powerful expansion syntax is provided, both for target networks and router NIDs as follows...

<expansion> :== "[" <entry> { "," <entry> } "]"

<numeric range> :== <number> [ "-" <number> [ "/" <number> ] ]

The expansion is a list enclosed in square brackets. Numeric items in the list may be a single number, a contiguous range of numbers, or a strided range of numbers. For example, routes="elan 192.168.1.[22-24]@tcp" says that network elan0 is adjacent (hopcount defaults to 1); and is accessible via 3 routers on the tcp0 network (192.168.1.22@tcp, 192.168.1.23@tcp and 192.168.1.24@tcp).

routes="[tcp,vib] 2 [8-14/2]@elan" says that 2 networks (tcp0 and vib0) are accessible through 4 routers (8@elan, 10@elan, 12@elan and 14@elan). The hopcount of 2 means that traffic to both these networks will be traversed 2 routers - first one of the routers specified in this entry, then one more.

Duplicate entries, entries that route to a local network, and entries that specify routers on a non-local network are ignored.

Equivalent entries are resolved in favor of the route with the shorter hopcount. The hopcount, if omitted, defaults to 1 (that is, the remote network is adjacent).

It is an error to specify routes to the same destination with routers on different local networks.

If the target network string contains no expansions, the hopcount defaults to 1 and may be omitted (that is, the remote network is adjacent). In practice, this is true for most multi-network configurations. It is an error to specify an inconsistent hop count for a given target network. This is why an explicit hopcount is required if the target network string specifies more than one network.

3.2.1.4 forwarding ("")

This is a string that can be set either to "enabled" or "disabled" for explicit control of whether this node should act as a router, forwarding communications between all local networks.

A standalone router can be started by simply starting LNET (“modprobe ptlrpc”) with appropriate network topology options

Acceptor

The acceptor is a TCP/IP service that some LNDs use to establish communications. If a local network requires it and it has not been disabled, the acceptor listens on a single port for connection requests that it redirects to the appropriate local network. The acceptor is part of the LNET module and configured by the following

accept ("secure") is a string that can be set to any of the following values.

secure - accept connections only from reserved TCP ports (< 1023).

all - accept connections from any TCP port. Note: this is required for libLustre clients to allow connections on non-privledged ports.

none - do not run the acceptor

accept_port

accept_port (988) is the port number on which the acceptor should listen for connection requests. All nodes in a site configuration that require an acceptor must use the same port.

accept_backlog

accept_backlog (127) is the maximum length that the queue of pending connections may grow to (see listen(2)).

accept_timeout

accept_timeout (5,W) is the maximum time in seconds the acceptor is allowed to block while communicating with a peer.

accept_proto_version

accept_proto_version is the version of the acceptor protocol that should be used by outgoing connection requests. It defaults to the most recent acceptor protocol version, but it may be set to the previous version to allows the node to initiate connections with nodes that only understand that version of the acceptor protocol. The acceptor can, with some restrictions, handle either version (i.e. it can accept connections from both 'old' and 'new' peers). For the current version of the acceptor protocol (version 1), the acceptor is compatible with old peers if it is only required by a single local network.

3.2.2 SOCKLND Kernel TCP/IP LND

The socklnd is connection-based and uses the acceptor to establish communications via sockets with its peers.

It supports multiple instances and load balances dynamically over multiple interfaces. If no interfaces are specified by the ip2nets or networks module parameter, all non-loopback IP interfaces are used. The address-within-network is determined by the address of the first IP interface an instance of the socklnd encounters.

Consider a node on the “edge” of an Infiniband network, with a low bandwidth management ethernet (eth0), IP over IB configured (ipoib0), and a pair of GigE NICs (eth1,eth2) providing off-cluster connectivity. This node should be configured with "networks=vib,tcp(eth1,eth2)” to ensure that the socklnd ignores the management ethernet and IPoIB.

timeout (50,W) is the time in seconds that communications may be stalled before the LND will complete them with failure.

nconnds (4) sets the number of connection daemons.

min_reconnectms (1000,W) is the minimum connection retry interval in milliseconds. This sets the time that must elapse before the first retry after a failed connection attempt. As connections attempts fail, this time is doubled on each successive retry up to a maximum of 'max_reconnectms'.

max_reconnectms (60000,W) is the maximum connection retry interval in milliseconds.

eager_ack (0 on linux, 1 on darwin,W) is a boolean that determines whether the socklnd should attempt to flush sends on message boundaries.

typed_conns (1,Wc) is a boolean that determines whether the socklnd should use different sockets for different types of message. When clear, all communication with a particular peer takes place on the same socket. Otherwise separate sockets are used for bulk sends, bulk receives and everything else.

min_bulk (1024,W) determines when a message is considered "bulk".

tx_buffer_size, rx_buffer_size (8388608,Wc) sets the socket buffer sizes. Set this to '0' to allow the system to auto-tune buffer sizes. Be very careful if altering this value as improper sizing can harm the performance.

nagle (0,Wc) is a boolean that determines if nagle should be enabled. It should never be set in production systems.

keepalive_idle (30,Wc) is the time in seconds that a socket can remain idle before a keepalive probe is sent. 0 disables keepalives

keepalive_intvl (2,Wc) is the time in seconds to repeat unanswered keepalive probes. 0 disables keepalives.

keepalive_count (10,Wc) is the number of unanswered keepalive probes before pronouncing socket (hence peer) death.

enable_irq_affinity (1,Wc) is a boolean that determines whether to enable IRQ affinity. When set, socklnd attempts to maximize performance by handling device interrupts and data movement for particular (hardware) interfaces on particular CPUs. This option is not available on all platforms. This option requires an SMP system to exist and produces best performance with multiple NICs. Systems with multiple CPUs and a single NIC may see increase in the performance with this parameter disabled.

zc_min_frag (2048,W) determines the minimum message fragment that should be considered for zero-copy sends. Increasing it above the platform's PAGE_SIZE will disable all zero copy sends. This option is not available on all platforms.

3.2.3 QSW LND

The qswlnd is connectionless, therefore it does not need the acceptor.

It is limited to a single instance, which uses all Elan "rails" that are present and load balances dynamically over them.

The address-with-network is the node's Elan ID. A specific interface cannot be selected in the "networks" module parameter.

tx_maxcontig (1024) is a integer that specifies the maximum message payload in bytes to copy into a pre-mapped transmit buffer.

ntxmsgs (8) is the number of "normal" message descriptors for locally initiated communications that may block for memory (callers block when this pool is exhaused).

nnblk_txmsg (512 with a 4K page size, 256 otherwise) is the number of "reserved" message descriptors for communications that may not block for memory. This pool must be sized large enough so that it is never exhausted.

nrxmsg_small (256) is the number of "small" receive buffers to post (typically everything apart from bulk data).

ep_envelopes_small (2048) is the number of message envelopes to reserve for the "small" receive buffer queue. This determines a breakpoint in the number of concurrent senders. Below this number, communication attempts are queued, but above this number, the pre-allocated envelope queue will fill, causing senders to back off and retry. This can have the unfortunate side effect of starving arbitrary senders, who continually find the envelope queue is full when they retry. This parameter should therefore be increased if envelope queue overflow is suspected.

nrxmsg_large (64) is the number of "large" receive buffers to post (typically for routed bulk data).

ep_envelopes_large (256) is the number of message envelopes to reserve for the "large" receive buffer queue. See "ep_envelopes_small" above for a further description of message envelopes.

optimized_puts (32768,W) is the smallest non-routed PUT that will be RDMA-ed.

optimized_gets (1,W) is the smallest non-routed GET that will be RDMA-ed.

3.2.4 RapidArray LND

The ralnd is connection-based and uses the acceptor to establish connections with its peers.

It is limited to a single instance, which uses all (both) RapidArray devices present. It load balances over them using the XOR of the source and destination NIDs to determine which device to use for any communication.

The address-within-network is determined by the address of the single IP interface that may be specified by the "networks" module parameter. If this is omitted, the first non-loopback IP interface that is up is used instead.

n_connd (4) sets the number of connection daemons.

min_reconnect_interval (1,W) is the minimum connection retry interval in seconds. This sets the time that must elapse before the first retry after a failed connection attempt. As connections attempts fail, this time is doubled on each successive retry up to a maximum of 'max_reconnect_interval'.

max_reconnect_interval (60,W) is the maximum connection retry interval in seconds.

timeout (30,W) is the time in seconds that communications may be stalled before the LND will complete them with failure

ntx (64) is the number of "normal" message descriptors for locally initiated communications that may block for memory (callers block when this pool is exhaused).

ntx_nblk (256) is the number of "reserved" message descriptors for communications that may not block for memory. This pool must be sized large enough so that it is never exhausted.

fma_cq_size (8192) is the number of entries in the RapidArray FMA completion queue to allocate. It should be increased if the ralnd starts to issue warnings that the FMA CQ has overflowed. This is only a performance issue.

max_immediate (2048,W) is the size in bytes of the smallest message that will be RDMA-ed, rather than being included as immediate data in an FMA. All messages over 6912 bytes must be RDMA-ed (FMA limit).

3.2.5 VIB LND

The vib lnd is connection based, establishing reliable queue-pairs over Infiniband with its peers. It does not use the acceptor for this.

It is limited to a single instance, which uses a single HCA that can be specified via the "networks" module parameter. It this is omitted, it uses the first HCA in numerical order it can open.

The address-within-network is determined by the IPoIB interface corresponding to the HCA used.

service_number (0x11b9a2) is the fixed IB service number on which the LND listens for incoming connection requests. Note that all instances of the viblnd on the same network must have the same setting for this parameter.

arp_retries (3,W) is the number of times the LND will retry ARP while it establishes communications with a peer.

max_reconnect_interval (60,W) is the maximum connection retry interval in seconds.

timeout (50,W) is the time in seconds that communications may be stalled before the LND will complete them with failure.

ntx (32) is the number of "normal" message descriptors for locally initiated communications that may block for memory (callers block when this pool is exhaused).

ntx_nblk (256) is the number of "reserved" message descriptors for communications that may not block for memory. This pool must be sized large enough so that it is never exhausted.

concurrent_peers (1152) is the maximum number of queue pairs, and therefore the maximum number of peers that the instance of the LND may communicate with.

hca_basename ("InfiniHost") is used to construct HCA device names by appending the device number.

ipif_basename ("ipoib") is used to construct IPoIB interface names by appending the same device number as is used to generate the HCA device name.

local_ack_timeout (0x12,Wc) is a low-level QP parameter. It should not be changed from the default unless advised.

retry_cnt (7,Wc) is a low-level QP parameter. It should not be changed from the default unless advised.

rnr_cnt (6,Wc) is a low-level QP parameter. It should not be changed from the default unless advised.

rnr_nak_timer (0x10,Wc) is a low-level QP parameter. It should not be changed from the default unless advised.

fmr_remaps (1000) controls how often FMR mappings may be reused before they must be unmapped. It should not be changed from the default unless advised.

cksum (0,W) is a boolean that determines whether messages (NB not RDMAs) should be checksummed. This is a diagnostic feature that should not be enabled normally.

3.2.6 OpenIB LND

The openib lnd is connection based and uses the acceptor to establish reliable queue-pairs over infiniband with its peers.

It is limited to a single instance that uses only IB device '0'.

n_connd (4) sets the number of connection daemons. The default is 4.

max_reconnect_interval (60,W) is the maximum connection retry interval in seconds.

timeout (50,W) is the time in seconds that communications may be stalled before the LND will complete them with failure.

ntx (64) is the number of "normal" message descriptors for locally initiated communications that may block for memory (callers block when this pool is exhausted).

ntx_nblk (256) is the number of "reserved" message descriptors for communications that may not block for memory. This pool must be sized large enough so that it is never exhausted.

concurrent_peers (1024) is the maximum number of queue pairs, and therefore the maximum number of peers that the instance of the LND may communicate with.

cksum (0,W) is a boolean that determines whether messages (NB not RDMAs) should be checksummed. This is a diagnostic feature that should not be enabled normally.

3.2.7 Portals LND (Linux)

The ptllnd can be used as a interface layer to communicate with Sandia Portals networking devices. This version is intended to work on the Cray XT3 Linux nodes using Cray Portals as a network transport.

Message Buffers - When ptllnd starts up, it allocates and posts sufficient message buffers to allow all expected peers (set by 'concurrent_peers') to send 1 message unsolicited. The first message a peer actually sends is a (so-called) "HELLO" message, which is used to negotiate how much additional buffering to set up; typically 8 messages. So if 10000 peers actually exist, we will post enough buffers for 80000 messages.

The maximum message size is set by the max_msg_size module parameter (default 512). This parameter sets the bulk transfer breakpoint. Below this breakpoint, payload data is sent in the message itself, and above this breakpoint, a buffer descriptor is sent and the receiver gets the actual payload.

The buffer size is set by the rxb_npages module parameter (default 1). The default conservatively avoids allocation problems due to kernel memory fragmentation. However increasing this to 2 is probably not risky.

The ptllnd also keeps an additional rxb_nspare buffers (default 8) posted to account for full buffers being handled.

Assuming a 4K page size, with 10000 peers, 1258 buffers can be expected to be posted at startup, rising to a max of 10008 as peers actually connected. This could be reduced by a factor of 4 by doubling rxb_npages halving max_msg_size.

ME/MD queue length - The ptllnd uses a single portal set by the portal module parameter (default 9) for both message and bulk buffers. Message buffers are always attached with PTL_INS_AFTER and match anything sent with "message" matchbits. Bulk buffers are always attached with PTL_INS_BEFORE and match only specific matchbits for that particular bulk transfer.

This scheme assumes that the majority of ME/MDs posted are for "message" buffers, and that the overhead of searching through the preceding "bulk" buffers is acceptable. Since the number of "bulk" buffers posted at any time is also dependent on the bulk transfer breakpoint set by max_msg_size, this seems like an issue worth measuring at scale.

TX descriptors - The ptllnd has a pool of so-called "tx descriptors", which it uses not only for outgoing messages, but also to hold state for bulk transfers requested by incoming messages. This pool should therefore scale with the total number of peers.

To enable the building of the Portals LND (ptllnd.ko) configure with the following option:

./configure --with-portals=<path-to-portals-headers>

ntx (256) The total number of message descriptors

concurrent_peers (1152) The maximum number of concurrent peers. Peers attempting to connect beyond the maximum will not be allowd.

peer_hash_table_size (101) The number of hash table slots for the peers. This number should scale with concurrent_peers. The size of the peer hash table is set by the module parameter peer_hash_table_size which defaults 101. This number should be prime to ensure the peer hash table is populated evenly. Increasing this to 1001 for~10000 peers is advisable.

cksum (0) Set to non-zero to enable message (not RDMA) checksums for outgoing packets. Incoming packets will always be checksumed if necssary, independnt of this value.

timeout (50) The amount of time a request can linger in a peers active queue, before the peer is considered dead. Units: seconds.

portal (9) The portal ID to use for the ptllnd traffic.

rxb_npages (64 * #cpus) The number of pages in a RX Buffer.

credits (128) The maximum total number of concurrent sends that are outstanding at any given instant.

peercredits (8) The maximum number of concurrent sends that are outstanding to a single peer at any given instant.

max_msg_size (512) The maximum immedate message size. This MUST be the same on all nodes in a cluster. A peer connecting with a diffrent max_msg_size will be rejected.

Portals LND (Catamount)

The ptllnd can be used as a interface layer to communicate with Sandia Portals networking devices. This version is intended to work on the Cray XT3 Catamount nodes using Cray Portals as a network transport.

To enable the building of the Portals LND configure with the following option:

./configure --with-portals=<path-to-portals-headers>

The following environment variables can be set to configure the PTLLND’s behavior.

PTLLND_PORTAL (9) The portal ID to use for the ptllnd traffic.

PTLLND_PID (9) The virtual pid on which to contact servers.

PTLLND_PEERCREDITS (8) The maximum number of concurrent sends that are outstanding to a single peer at any given instant.

PTLLND_MAX_MESSAGE_SIZE (512) The maximum messages size. This MUST be the same on all nodes in a cluster.

PTLLND_MAX_MSGS_PER_BUFFER (64) The number of messages in a receive buffer. Receive buffer will be allocated of size PTLLND_MAX_MSGS_PER_BUFFER times PTLLND_MAX_MESSAGE_SIZE.

PTLND_MSG_SPARE (256) Additional receive buffers posted to portals.

PTLLND_PEER_HASH_SIZE (101) The number of hash table slots for the peers.

PTLLND_EQ_SIZE (1024) The size of the Portals event queue (that is, maximum number of events in the queue).

Chapter V-4. System Configuration Utilities (man8)

4.1 lmc

lmc (Lustre configuration maker) is invoked for generating configuration data files (.xml).

4.1.1 Synopsis

lmc [options] --add <objectype> [args]

4.1.2 Description

When invoked, lmc adds configuration data to config files (.xml). This data is regarding all the components of a Lustre cluster, like MDSs, mount-points, OSTs, LOVs and others. A single configuration file would be generated for the entire cluster. In the lmc command line interface, each of these components is associated with the objecttype.

The objecttype can be anyone of the following - net, MDS, LOV, OST, mtpt, route, echo_client, or cobd. Every objecttype refers to a collection of related configuration entities.

Following options can be used to generate the configuration data associated with all the objecttypes in a Lustre cluster:

-v, --verbose Prints system commands as they run

-m, --merge Appends to the specified config file

-o, --output Writes xml configuration into given output file (Overwrites the existing one)

-i, --input Takes input from a specified file

--batch Used to execute the LMC command in batch mode

Node related options:

--add node Adds a new node in the cluster configuration

--node {nodename} Creates a new node with the given name if not already present

--timeout <num> Timeout before going into recovery

--lustre_upcall <path> Sets the location of both the upcall scripts (Lustre and lnet) used by the client for recovery

--portals_upcall <path> Specifies the location of the Portals upcall scripts used by the client for recovery (deprecated)

--upcall <path> Specifies the location of both the upcall scripts (Lustre and LNET) used by the client for recovery (deprecated)

--group_upcall <path> Specifies the location of the group upcall scripts used by the MDS for determining supplementary group membership

--ptldebug <debug_level> Sets the lnet debug level

--subsystem <subsystem_name> Specifies the Lustre subsystems, which have debug output recorded in the log

Network related options:

--add net Adds a network device descriptor for the given node

--node {node name} Creates a new node with the given name if not already present. This is also used to specify a specific node for other elements.

--nettype <type> This can be tcp, elan, or gm

--nid <nid> The network id, e.g. ElanID or IP address as used by lnet. If NID is “*”, then the local address of the interface with specified nettype will be substituted when the node is configured with lconf. An NID with “*” should be used only for generic client configurations.

--cluster_id <id> Specifies the cluster ID

--hostaddr <addr> Specifies the host address, which will be transfered to the real host address by lconf

--router Optional flag to mark this node as a router

--port [port] Optional argument to indicate the tcp port. The default is 988.

--tcpbuf <size> Optional argument. The default TCP buffer size is 1MB.

--irq_affinity 0 or 1 Optional argument. Default is 0.

--nid_exchange 0 or 1 Optional argument as some of the OSTs might not have the required support. This is turned off by default, value of 1 will turn it ON.

MDS related options:

--add mds Specifies the MDS configuration

--node <node name> Name of the node on which the MDS resides

--mds <mds_name> Host name of the MDS

--mdsuuid <uuid> Specifies MDS UUID

--dev <pathname> Path of the device on the local system. If the device is a file, then a loop device is created and used as a block device.

--backdev <pathname> Path of the device for backing storage on the local system

--size <size> Optional argument indicating the size (in KB) of the device to be created (used typically for loop devices)

--node {nodename} Adds an MDS to the specified node. This requires a --node argument, and it should not be a profile node.

--fstype ldiskfs|ext3 Optional argument used to specify the file system type. Default is ext3. For 2.6 kernels, the ldiskfs file system must be used.

--inode_size <size> Specifies new inode size for underlying ext3 file system. Must be a power of 2 between 128 and 4096. The default inode size is selected based on the default number of stripes specified for the file system.

--group_upcall <pathname> The group upcall program used to resolve a user as a secondary group. The default value is NONE, which means that the MDS will use whatever supplementary group is passed from the client. The supplied upcall is /usr/sbin/l_getgroups, which gets groups from the MDS as /etc/group file based on the client-supplied UID.

--mkfsoptions <options> Optional argument to mkfs

--mountfsoptions <options> Optional argument to mountfs. Mount options will be passed by this argument. For example, extents are to be enabled by adding ",extents" to the option
--mountfsoptions. Also, the option "asyncdel" can be added to it.

--journal_size <size> Optional argument to specify the journal size for the ext3 file system. The size should be in the units expected by mkfs. For ext3, it should be in MB. If this option is not used, the ext3 file system will be configured with the default journal size.

LOV related options:

--add lov Creates an LOV with the specified parameters. The mds_name must already exist in the descriptor.

--lov <name> Common name for the LOV

--mds <name> Host name for the MDS

--stripe_sz <size> Specifies the stripe size in bytes. Data exactly equal to this size is written in each stripe before starting to write in the next stripe. Default is 1048576.

--stripe_cnt <count> A value of 0 for this means Lustre is using the currently optimal number of stripes. Default is 1 stripe per file.

--stripe_pattern <pattern> Only Pattern 0 (RAID 0) is supported currently

OST related options:

--add os Creates an OBD, OST, and OSC. The OST and OBD are created on the specified node.

--ost <name> Assigns a name to the OST

--node <nodename> Node on which the OST service is running. It must not be a profile node.

--failout Disables failover support on the OST

--failover Enables failover support on the OST

--dev <pathname> Path of the device on the local system. If the device is a file, then a loop device is created and used as a block device.

--size [size] Optional argument indicating the size (in KB) of the device to be created (used typically for loop devices)

--obdtype <Obdfilter | obdecho> Specifies the type of the OSD

--lov <name> Name of the LOV to which this OSC will be attached. This is an optional argument.

--ostuuid UUID Specifies the UUID of the OST device

--fstype <extN | ext3> Optional argument used to specify the file system type. Default is ext3.

--inode_size <size> Specifies the new inode size for underlying ext3 file system

--mkfsoptions <options> Specifies additional options for the mkfs command line.

--mountfsoptions <options> Specifies additional options for mountfs command line. Mount options will be passed by this argument. For example, extents are to be enabled by adding ",extents" to the option --mountfsoptions. Also, the option "asyncdel" can be added to it.

Mountpoint related options:

--add mtpt Creates a mount-point on the specified node. Either an LOV name or an OSC name can be used.

--node {nodename} Node that will use the mountpoint

--path /mnt/path The mountpoint that can be used to mount Lustre file system. Default is /mnt/lustre.

--ost ost_name or --lov lov_name The OST or LOV name as specified earlier in the configuration

Route related options:

--add route Creates a static route through a gateway to a specific NID or a range of NIDs

--node {nodename} The node on which the route can be added

--router Optional flag to mark a node as the router

--gw nid The NID of the gateway. It must be a local interface or a peer.

--gateway_cluster_id id Specifies the id of the cluster, to which the gateway belongs

--target_cluster_id id Specifies the id of the cluster, to which the target of the route belongs

--lo nid The low value NID for a range route

--hi nid The high value NID for a range route

--add echo The client used exclusively for testing purpose

4.1.3 Examples

$ lmc --node adev3 --add net --nid adev3 --cluster_id 0x1000 --nettype tcp --hostaddr adev3-eth0 --port 988

$ lmc --node adev3 --add net --nid adev3 --cluster_id 0x2000 --nettype tcp --hostaddr adev3-eth1 --port 989

These commands are used to add a Lustre node to the specified Lustre cluster through a network interface. In this example, Lustre node adev3 has been added to 2 Lustre clusters separately through 2 network interface cards: adev3-eth0 and adev3-eth1. The cluster_ids of these 2 Lustres are 0x1000 and 0x2000. adev3 would listen to the respective specified port(s) to prepare for possible connection requests from nodes in these two clusters.

$ lmc --node adev3 --add route --nettype tcp --gw 5 --gateway_cluster_id 0x1000 –target_cluster_id \ 0x1000 --lo 4 --hi 7

This command is used to add a route entry for a Lustre node. Here Lustre node adev3 is added with a new route entry. This enables the Lustre node to send packets to Lustre nodes having the NIDs from 4 to 7 with the help of Lustre gateway node having the NID 5. Besides, Lustre gateway node is in the cluster of id 0x1000 and target of the route belongs to the cluster of the same id 0x1000. The network in this route path is a tcp network.

NOTE:- When using --mountfsoptions {extents|mballoc|asyncdel}, please remember the following:
-extents and mballoc are recommended only for 2.6 kernel and are used only for OSTs.
-asyncdel is recommended for 2.4 kernel and is not supported on 2.6 kernel.
One can use --mountfsoptions {extents|mballoc} on existing file systems. The Lustre servers need to be restarted before using this command so that the new options become effective.
asyncdel is used on 2.4 kernel to delete files in a separate thread. Using this option quickly releases inode semaphore of the parent directory in order to perform the other operations. Otherwise, deleting large files may take more time. For 2.6 kernel, this is not such a big issue because the parameter extents can increase the speed of deletion.

4.2 lconf

lconf is a Lustre utility that is used for configuring, starting and stopping a Lustre file system.

4.2.1 Synopsis

lconf [OPTIONS] <XML-config file>

4.2.2 Description

This utility configures a node by using the configuration data given in the <XML-config-file>. For all the nodes in a cluster, there is one single configuration file. Thus, this file should be distributed to all the nodes in the cluster or kept at a location accessible to all the nodes. The options that can be used with lconf are explained below. The XML file must be specified. When invoked with no options, lconf will attempt to configure the resources owned by the node it is invoked on.

--abort_recovery Used to start Lustre when recovery is certain to fail (for example when an OST is disabled). Can also be used to skip the recovery timeout period.

--acl Enables Access Control List support on the client.

--allow_unprivileged_port Allows connections from unprivileged ports.

--clientoptions Additional options for mounting a Lustre client – This is obsolete now and is replaced by zeroconfig mounts.

--client_uuid The failed client (required for recovery).

--clumanager Generates a Red Hat Clumanager configuration file for this node.

--config The cluster configuration name used for LDAP query (depreciated).

--conn_uuid The failed connection (required for recovery).

-d --clenaup Stops Lustre and unconfigures a node. The same config and --node argument used for configuration needs to be used for cleanup as well. This will attempt to undo all of the configuration steps done by lconf, including unloading the kernel modules.

--debug_path <path> Specifies the path for saving debug dumps. (Default is /tmp/lustre-log.)

--dump <file> Dumps the kernel debug log to the specified file before lnet is unloaded during clean up.

--failover Shuts down without saving state. This allows a node to give up service to another node for failover purposes. This is not a clean shutdown.

-f --force Forces unmounting and/or obd detach during cleanup.

--gdb Creates a gdb module script before doing any Lustre configuration.

--gdb_script Full name of the gdb debug script. Default is /tmp/ogdb.

--group The group of devices to cleanup/ configure.

--group_upcall The group upcall program used to resolve a user as a secondary group. The default value is NONE, which means that the MDS will use whatever supplementary group is passed from the client. But this is limited to a single supplementary group.

-h, --help Displays help

--inactive The UUID of the service to be ignored by the client which is mounting Lustre. It allows the client to mount in presence of some inactive services (currently OST only). Multiple UUIDs can be specified by repeating the option.

--lctl-dump Dumps all the ioctls to the specified file.

--ldapurl LDAP server URL (depreciated).

--lustre=<dir> The base directory of Lustre sources. This parameter causes lconf to load modules from a source tree.

--lustre_upcall Sets the location of the Lustre upcall scripts used by the client for recovery.

--make_service_scripts Creates per-service symlinks for use with clumanager HA software.

--maxlevel Performs configuration of devices and services up to the given level. When used in conjunction with cleanup, services are torn down up to a certain level.

Levels are approximately like:

10 – network

20 – device, ldlm

30 – OSD, MDD

40 – MDS, OST

70 – mountpoint, echo_client, OSC, MDC, LOV

--minlevel Specifies the minimum level of services to configure/ cleanup. Default is 0.

--mkfsoptions Specifies additional options for the mk*fs command line.

--mountfsoptions Specifies additional options for mountfs command line. Mount options will be passed by this argument. For example, extents are to be enabled by adding ",extents" to the option --mountfsoptions. Also, the option "asyncdel" can be added to it.

--node Specifies a specific node to be configured. By default, lconf searches for nodes with the local hostname and localhost. When --node is used, only node_name is searched for. If a matching node is not found in the config, then lconf exits with an error.

--noexec, -n Displays, but does not execute the steps to be performed by lconf. This is useful for debugging a configuration, and when used with --node, it can be run on any host.

--nomod Configures devices and services only. Does not load modules.

--nosetup Loads modules only. Does not configure devices or services.

--old_conf Starts up service even if config logs appear outdated.

--portals Specifies portals source directory. If this is a relative path, then it is assumed to be relative to Lustre. (Depreciated.)

--portals_upcall Specifies the location of the Portals upcall scripts used by the client for recovery. (Depreciated.)

--ptldebug Sets the portals debug level.

--quota Enables quota support for client file system.

--rawprimary <arg> For clumanager, device of the primary quorum. Default is /dev/raw/raw1.

--rawsecondary <arg> For clumanager, device of the secondary quorum. Default is /dev/raw/raw2.

--record Writes config information on the MDS.

--record_device Recovers a device.

--record_log Specifies the name of a config record log.

--recover Specifies the MDS device name that records the config commands.

--reformat Reformats all the devices. This is essential when the file system is brought up for first time.

--select Selects a particular node for a service

--service Shorthand for --group <arg> --select <arg>=<hostname>

--service_scripts <arg> For clumanager, directory containing per-service scripts. Default is /etc/lustre/services.

--single_socket The socknal option. Uses only one socket instead of a bundle.

--subsystem Sets the lnet debug subsystem.

--tgt_uuid Specifies the failed target (required for recovery).

--timeout Sets the recovery timeout period.

--upcall Sets the location of both the upcall scripts (Lustre and lnet) used by the client for recovery.

--user_xattr Enables user_xattr support on the MDS.

--verbose, -v Becomes verbose and shows actions performed during the execution of a command

--write_conf Saves the whole client configuration information on the MDS

4.2.3 Examples

To invoke lconf on the OST node:

$ lconf -v --reformat --node ost config.xml

configuring for host: ['ost']

setting /proc/sys/net/core/rmem_max to at least 16777216

setting /proc/sys/net/core/wmem_max to at least 16777216

Service: network NET_ost_tcp NET_ost_tcp_UUID

loading module: libcfs srcdir None devdir libcfs

+ /sbin/modprobe libcfs

loading module: lnet srcdir None devdir lnet

+ /sbin/modprobe lnet

loading module: ksocklnd srcdir None devdir klnds/socklnd

+ /sbin/modprobe ksocklnd

Service: ldlm ldlm ldlm_UUID

loading module: lvfs srcdir None devdir lvfs

+ /sbin/modprobe lvfs

loading module: obdclass srcdir None devdir obdclass

+ /sbin/modprobe obdclass

loading module: ptlrpc srcdir None devdir ptlrpc

+ /sbin/modprobe ptlrpc

Service: osd OSD_ost1_ost OSD_ost1_ost_UUID

loading module: ost srcdir None devdir ost

+ /sbin/modprobe ost

loading module: ldiskfs srcdir None devdir ldiskfs

+ /sbin/modprobe ldiskfs

loading module: fsfilt_ldiskfs srcdir None devdir lvfs

+ /sbin/modprobe fsfilt_ldiskfs

loading module: obdfilter srcdir None devdir obdfilter

+ /sbin/modprobe obdfilter

+ sysctl lnet/debug_path /tmp/lustre-log-ost

+ /usr/sbin/lctl modules > /tmp/ogdb-ost

Service: network NET_ost_tcp NET_ost_tcp_UUID

NETWORK: NET_ost_tcp NET_ost_tcp_UUID tcp ost

Service: ldlm ldlm ldlm_UUID

Service: osd OSD_ost1_ost OSD_ost1_ost_UUID

OSD: ost1 ost1_UUID obdfilter /lustre/filedevice/ost 100000 ldiskfs no 0 0

+ losetup /dev/loop0

+ losetup /dev/loop1

+ losetup /dev/loop2

+ losetup /dev/loop3

+ losetup /dev/loop4

+ losetup /dev/loop5

+ losetup /dev/loop6

+ losetup /dev/loop7

+ dd if=/dev/zero bs=1k count=0 seek=100000 of=/lustre/filedevice/ost

+ mkfs.ext2 -j -b 4096 -F /lustre/filedevice/ost 25000

+ tune2fs -O dir_index /lustre/filedevice/ost

+ losetup /dev/loop0

+ losetup /dev/loop0 /lustre/filedevice/ost

+ dumpe2fs -f -h /dev/loop0

no external journal found for /dev/loop0

OST mount options: errors=remount-ro

+ /usr/sbin/lctl

attach obdfilter ost1 ost1_UUID

quit

+ /usr/sbin/lctl

cfg_device ost1

setup /dev/loop0 ldiskfs f errors=remount-ro

quit

+ /usr/sbin/lctl

attach ost OSS OSS_UUID

quit

+ /usr/sbin/lctl

cfg_device OSS

setup

quit

To invoke lconf on the client node:

$ lconf --node client config.xml

To set the required debug levels:

$ lconf --ptldebug ~(lnet | mballoc | trace)

To turn-on specific debug types:

$ conf --ptldebug ldlm|ha

A subset of failed OSTs can be ignored during Lustre mount on the clients by using the option below. Here OST1 and OST2 have failed and need to be ignored.

$ lconf --inactive OST_ost1_UUID --inactive OST_ost2_UUID config.xml

To configure a node (the options in the square brackets are optional):

$ lconf --node {nodename} [--service name]] [--reformat [--force [--failover]] [--reformat] \
[--mountfsoptions={options}] config.xml

NOTE:- When using --mountfsoptions {extents|mballoc|asyncdel}, please remember the following:
-extents and mballoc are recommended only for 2.6 kernel and are used only for OSTs.
-asyncdel is recommended for 2.4 kernel and is not supported on 2.6 kernel.
One can use --mountfsoptions {extents|mballoc} on existing file systems. The Lustre servers need to be restarted before using this command so that the new options become effective.
asyncdel is used on 2.4 kernel to delete files in a separate thread. Using this option quickly releases inode semaphore of the parent directory in order to perform the other operations. Otherwise, deleting large files may take more time. For 2.6 kernel, this is not such a big issue because the parameter extents can increase the speed of deletion.

4.3 lctl

lctl is a Lustre utility used for low level configurations of Lustre file system. It also provides low-level testing and manages Lustre network (LNET) information.

4.3.1 Synopsis

lctl

lctl --device <devno> <command [args]>

lctl --threads <numthreads> <verbose> <devno> <command [args]>

4.3.2 Description

lctl can be invoked in interactive mode by issuing the commands given below.

$ lctl

lctl> help

The most common commands in lctl are in matching pairs - like device and attach, detach and setup, cleanup and connect, disconnect and help and quit. To get a complete listing of available commands, type “help” on the lctl prompt. To get basic help on meaning and syntax of a command, type “help command.” Command completion is activated with the TAB key, and command history is available via the “UP” and “DOWN” arrow keys.

For non-interactive single threaded use, one uses the second invocation, which runs command after connecting to the device. Some commands are used only when testing specific functionality inside Lustre and are not normally invoked by users, theses commands are identified by the string (CFS Dev). Several commands are old and will be removed in the next major release of Lustre. These commands are identified with the string (Old).

Network related options:

--net <tcp/elan/myrinet> The network type to be used for the operation

network <tcp/elans/myrinet> Indicates what kind of network is applicable for the configuration commands that follow

interface_list Displays the interface entries and requires the 'network' command

list_nids Displays network identifiers (NIDs) defined on this node

which_nid <remote host> - Identifies path to a specific host by NID. Can be used to verify network setup and connectivity

add_interface Adds an interface entry (Old)

del_interface [ip] Deletes an interface entry (Old)

peer_list Displays the peer entries

add_peer <nid> <host> <port> Adds a peer entry (CFS Dev)

del_peer [<nid>] [<host>] [ks] Removes a peer entry (CFS Dev)

conn_list Displays all the connected remote NIDs

disconnect <nid> Disconnects from a remote NID (CFS Dev)

active_tx Displays active transmits, and is used only for elan network type

mynid [nid] Informs the socknal of the local NID. It defaults to host name for tcp networks, and is automatically setup for elan/ myranet networks (CFS Dev)

add_uuid <uuid> <nid> Associates a given UUID with an NID (CFS Dev)

close_uuid <uuid> Disconnects a UUID

del_uuid <uuid> Deletes a UUID association (CFS Dev)

add_route <gateway> <target> [target] Adds an entry to the routing table for the given target (Old)

del_route <target> Deletes an entry for a target from the routing table (Old)

set_route <gateway> <up/down> [<time>] Enables/ disables routes via the given gateway in the protals routing table. <time> is used to specify when a gateway should come back online (Old)

route_list Displays the complete routing table

fail nid|_all_ [count] Fails/ restores communications. Omitting the count implies an indefinite fail. A count of zero indicates that communication should be restored. A non-zero count indicates the number of LNET messages to be dropped after which the communication is restored. The argument "nid" is used to specify the gateway, which is one peer of the communication (CFS Dev)

show_route Displays the complete routing table, same output as route_list

ping nid [timeout] [pid] Checks LNET connectivity, outputs a list of NIDs on the target machine

Device Selection:

newdev Creates a new device

device Selects the specified OBD device. All other commands depend on the device being set

cfg_device <$name> Sets the current device being configured to <$name> (Old)

device_list Shows all the devices

lustre_build_version Displays the Lustre build version

Device Configuration:

attach type [name [uuid]] Attaches a type to the current device (which is set using the device command), and gives that device a name and a UUID. This allows us to identify the device for later use, and to know the type of that device

setup <args...> Types specific device setup commands. For obdfilter, a setup command tells the driver which block device it should use for storage and what type of file system is on that device

cleanup Cleans up a previously setup device

detach Removes a driver (and its name and UUID) from the current device

lov_getconfig lov-uuid Reads LOV configuration from an MDS device. Returns default-stripe-count, default-stripe-size, offset, pattern, and a list of OST UUIDs (Old)

record cfg-uuid-name Records the commands that follow in the log

endrecord Stops recording

parse config-uuid-name Parses the log of recorded commands for a config

dump_log config-uuid-name Displays the log of recorded commands for a config to kernel debug log

clear_log config-name Deletes the current config log of recorded commands

Device Operations:

probe [timeout] Builds a connection handle to a device. This command is used to suspend configuration until the lctl command ensures the availability of the MDS and OSC services. This avoids mount failures in a rebooting cluster

close Closes the connection handle

getattr <objid> Gets the attributes for an OST object <objid> (CFS Dev)

setattr <objid> <mode> Sets the mode attribute for an OST object <objid> (CFS Dev)

create [num [mode [verbose]]] Creates the specified number <num> of OST objects with the given <mode> (CFS Dev)

destroy <num> Starting at <objid>, destroys <num> number of objects starting from the object with object id <objid> (CFS Dev)

test_getattr <num> [verbose [[t]objid]] Does <num> getattrs on an OST object <objid> (objectid+1 on each thread) (CFS Dev)

test_brw [t]<num> [write [verbose [npages [[t]objid]]]] Does <num> bulk read/ writes on an OST object <objid> (<npages> per I/O) (CFS Dev)

dump_ldlm Dumps all the lock manager states. This is very useful for debugging

activate Activates an import

deacttivate De-activates an import

recover <connection UUID>

lookup <directory> <file> Displays the information of the given file

notransno Disables the sending of committed transnumber updates

readonly Disables writes to the underlying device

abort_recovery Aborts recovery on the MDS device

mount_option Dumps mount options to a file

get_stripe Shows stripe information for an echo client object

set_stripe <objid>[ width!count[@offset] [:id:id....] Sets stripe information for an echo client

unset_stripe <objid> Unsets stripe information for an echo client object

del_mount_option profile Deletes a specified profile

set_timeout <secs> Sets the timeout (obd_timeout) for a server to wait before failing recovery

set_lustre_upcall </full/path/to/upcall> Sets the lustre upcall (obd_lustre_upcall) via the lustre.upcall sysctl

llog_catlist Lists all the catalog logs on current device

llog_info <$logname|#oid#ogr#ogen> Displays the log header information

llog_print <$logname|#oid#ogr#ogen> [from] [to] Displays the log content information. It displays all the records from index 1 by default

llog_check <$logname|#oid#ogr#ogen> [from] [to] Checks the log content information. It checks all the records from index 1 by default

llog_cancel <catalog id|catalog name> <log id> <index> Cancels a record in the log

llog_remove <catalog id|catalog name> <log id> Removes a log from the catalog, erases it from the disk

Debug:

debug_daemon Debugs the daemon control and dumps to a file

debug_kernel [file] [raw] Gets the debug buffer and dumps to a file

debug_file <input> [output] Converts the kernel-dumped debug log from binary to plain text format

clear Clears the kernel debug buffer

mark <text> Inserts marker text in the kernel debug buffer

filter <subsystem id/debug mask> Filters message type from the kernel debug buffer

show <subsystem id/debug mask> Shows the specific type of messages

debug_list <subs/types> Lists all the subsystem and debug types

modules <path> Provides gdb-friendly module information

panic Forces the kernel to panic

lwt start/stop [file] Light-weight tracing

memhog <page count> [<gfp flags>] Memory pressure testing

Control:

help Shows a complete list of commands. help <command name> can be used to get help on a specific command

exit Closes the lctl session

quit Closes the lctl session

Options:
(options that can be used to invoke lctl)

--device The device number to be used for the operation. The value of devno is an integer, normally found by calling lctl name2dev on a device name

--threads The numthreads variable is a strictly positive integer indicating the number of threads to be started. The devno option is used as above

--ignore_errors | ignore_errors Ignores errors during the script processing

dump Saves ioctls to a file

4.3.3 Examples

attach

$ lctl

lctl > newdev

lctl > attach obdfilter OBDDEV OBDUUID

lctl > dl

4 AT obdfilter OBDDEV OBDUUID 1

getattr

lctl > getattr 12

id: 12

grp: 0

atime: 1002663714

mtime: 1002663535

ctime: 1002663535

size: 10

blocks: 8

blksize: 4096

mode: 100644

uid: 0

gid: 0

flags: 0

obdflags: 0

nlink: 1

valid: ffffffff

inline:

obdmd:

lctl > disconnect

Finished (success)

setup

lctl > setup /dev/loop0 extN

lctl > quit

Network Commands

The example below shows how to use lctl for identifying interface information and peers that are up. In this case, we have one MDS (ft2) and two OSS nodes (d1_q_0, d2_q_0). First we display the interface information on the MDS, and then list MDS peers:

$ lctl > network tcp up

$ lctl > interface_list

ft2: (10.67.73.181/255.255.255.0) npeer 0 nroute 2

$ lctl > peer_list

12345-10.67.73.150@tcp [1]ft2->d2_q_0:988 #6

12345-10.67.73.160@tcp [1]ft2->d1_q_0:988 #6

To identify routes and check connectivity to another node:

# lctl list_nids

10.67.73.181@tcp

# lctl which_nid d1_q_0

10.67.73.160@tcp

lctl ping d1_q_0

12345-0@lo

12345-10.67.73.160@tcp

'Which_nid' does a lookup of the NID, and attempts to expand it. 'which_nid' does not care about the node state. In the example below, the machine 'dellap' is real, the machine 'bogus' and the IP '10.67.73.212' are fake.

# lctl which_nid bogus@tcp

Can't parse NID bogus@tcp

# lctl which_nid dellap@tcp

10.67.73.89@tcp

# lctl which_nid 10.67.73.212@tcp

10.67.73.212@tcp

# lctl which_nid 10.67.758.54@tcp

Can't parse NID 10.67.758.54@tcp

Chapter V – 5. System Limits

5.1 Introduction

This section describes various limits on the size of files and file systems. These limits are imposed either by the Lustre architecture or by the Linux VFS and VM subsystems. In a few cases, the limit is defined within the code and could be changed by re-compiling Lustre. In those cases, the limit chosen is supported by CFS testing and may change in future releases.

5.1.1 Maximum Stripe Count

The maximum number of stripe count is 160. This limit is a hard coded option and reflects current tested performance limits. It may be increased in future releases. Under normal circumstances, the stripe count is not affected by ACLs.

5.1.2 Maximum Stripe Size

For a 32-bit machine, the product of stripe size and stripe count (stripe_size * stripe_count) must be less than 2^{^32}. The ext3 limit of 2TB for a single file applies for a 64-bit machine. (Lustre can support 160 stripes of 2TB each on a 64-bit system.)

5.1.3 Minimum Stripe Size

Due to the 64KB PAGE_SIZE on some 64-bit machines, the minimum stripe size is set to 64 KB.

5.1.4 Maximum Number of OSTs and MDSs

You can set the maximum number of OSTs by a compile option. The limit of 512 OSTs in Lustre 1.4.6 is raised to 1020 OSTs in Lustre releases 1.4.7 and greater. Rigorous testing is in progress to move the limit to 4000 OSTs.

The maximum number of MDSs will be determined after accomplishing MDS clustering.

5.1.5 Maximum Number of Clients

The number of clients is currently limited to 65536 as defined in the code.

5.1.6 Maximum Size of a File System

In 2.4 kernels, the Linux block layer limits the block devices like hard disks or RAID arrays to 2TB. For i386 systems in 2.6 kernels, the block devices are limited to 16TB. Each OST or MDS can have a file system up to 2TB (The 2TB limit is imposed by ext3 for 2.6 kernels). You can have multiple OST file systems on a single node. The largest Lustre file system currently has 448 OSTs in a single file system (running the 1.4.3 Lustre version). There is a compile-time limit of 512 OSTs in a single file system, giving a single file system limit of 1PB.

Several production Lustre file systems have around 100 object storage servers in a single file system. One production file system is in excess of 900TB (448 OSTs). All these facts indicate that Lustre would scale just fine if more hardware were made available. The 2TB limit on a file system will be soon removed to allow larger file systems with fewer OST devices.

5.1.7 Maximum File Size

Individual files have a hard limit of nearly 16TB on 32-bit systems imposed by the kernel memory subsystem. On 64-bit systems this limit does not exist. Hence, files can be 64-bits in size. Lustre imposes an additional size limit of up to the number of stripes, where each stripe is of 2TB. A single file can have a maximum of 160 stripes, which gives an upper single file limit of 320TB for 64-bit systems. The actual amount of data that can be stored in a file depends upon the amount of free space in each OST on which the file is striped.

5.1.8 Maximum Number of Files or Subdirectories in a Single Directory

Lustre uses the ext3 hashed directory code, which has a limit of about 25 million files. On reaching this limit, the directory grows to more than 2GB depending on the length of the filenames. The maximum number of subdirectories in the versions before Lustre 1.2.6 is 32,000. You can have unlimited subdirectories in all the later versions of Lustre due to a small ext3 format change.

In fact, Lustre is tested with ten million files in a single directory. On a properly-configured dual-CPU MDS with 4 GB RAM, random lookups in such a directory are possible at a rate of 5,000 files /second.

5.1.9 MDS Space Consumption

A single MDS imposes an upper limit of 4 billion inodes. The default limit is slightly less than the device size of 4KB. That means about 512MB inodes for a file system with MDS of 2TB. This can be increased initially at the time of MDS file system creation by specifying the "--mkfsoptions='-i 2048'" option on the "--add mds" config line for the MDS.

For newer releases of e2fsprogs, you can specify '-i 1024' to create 1 inode for every 1KB disk space. You can also specify '-N {num inodes}' to set a specific number of inodes. Note that the inode size (-I) should not be larger than half the inode ratio (-i). Otherwise mke2fs will spin trying to write more number of inodes than the inodes that can fit into the device.

5.1.10 Maximum Length of a Filename and Pathname

This limit is 255 bytes for a single filename, the same as in an ext3 file system. The Linux VFS imposes a full pathname length of 4096 bytes.

Appendix 1 Upgrading

NOTE: This information is for upgrading from Lustre 1.4.5 only. If you are running a newer version of Lustre, you should ignore this chapter.

Portals and LNET Interoperability

LNET uses the same wire protocols as Portals but has a different network addressing scheme, that is Portals and LNET NIDs are different. In single-network configurations, LNET can be configured to work with Portals NIDs so that it can inter-operate with Portals and can allow a live cluster to be upgraded piecemeal. This is controlled by the portals_compatibility module parameter which is described below.

With Portals compatibility configured, old XML configuration files remain compatible with LNET. The lconf configuration utility recognizes Portals NIDs and converts them to LNET NIDs.

Old client configuration logs are also compatible with LNET. Lustre running over LNET recognizes Portals NIDs and converts them to LNET NIDs, but issues a warning. Once all the nodes have been upgraded to LNET, these configuration logs can be rewritten to update them to LNET NIDs.

Portals Compatibility Parameter

The following module parameter controls interoperability with Portals:

portals_compatibility=”strong”|”weak”|”none”

”Strong” is compatible with Portals and with LNET running in either ”strong” or ”weak” compatibility modes. As this is the only mode compatible with Portals, all the LNET nodes in the cluster must set this until the last Portals node has been upgraded.

”Weak” is not compatible with Portals, but is compatible with LNET running in any mode.

”None” is not compatible with Portals or with LNET running in ”strong” compatibility mode.

Upgrade a Cluster Using Shutdown

Upgrading a system that can be completely shut down is fairly simple ― shut down all the clients and servers, install an LNET release of Lustre everywhere, --write-conf the MDS and restart everything. No portals_compatibility option is needed (the default value is “none”).

When upgrading a cluster, you should install (rather than upgrade) the kernel and lustre-module RPMs. This allows you to keep the older, tried and tested kernels installed in case the new kernel fails to boot.

First upgrade the kernel using the following command:

$ rpm -ivh --oldpackage kernel-smp-2.6.9-22.0.2.EL_lustre.1.4.6.i686.rpm

Then upgrade the Lustre modules with the command:

$ rpm -ivh --oldpackage lustre-modules-lustre-modules-1.4.6-2.6.9_22.0.2.EL_lustre.1.4.6smp.i686.rpm

The lustre-modules RPM is kernel-specific, so if you have multiple kernels installed you will need a lustre-modules RPM for each kernel. We recommend using --oldpackage as sometimes RPM will report that an already installed RPM is newer, even though it may not be.

You can only have one Lustre RPM installed at a time (the userspace tools, not the lustre-modules RPM mentioned above), so you should upgrade this RPM with the command:

$ rpm -Uvh lustre-1.4.6-2.6.9_22.0.2.EL_lustre.1.4.6smp.i686.rpm

Before rebooting into the new Lustre kernel, double check your bootloader (grub, lilo) to make sure it will boot into the new kernel.

After installing packages

After you install certain updates, you may need to take additional steps. For example, if you are upgrading to Lustre 1.4.6, you need to update your configuration logs.

Occasionally it is necessary to update the configuration logs on the MDS.

Some examples of when this is needed, include:

enabling recovery on OSTs

changing the default striping pattern

changing the network address of a server

adding or removing servers

upgrading to a newer version

After installing the packages follow these steps:

1. Shut down all the client and MDS nodes. This operation does not affect OSS nodes, so they do not need to be shut down at this time.

2. On the MDS node, run the following command:

$ lconf --write_conf /path/to/lustre.xml

3. Start OSS nodes if they were not already running.

4. Start the meta-data server as usual.

5. Mount Lustre on clients.

Upgrading a Cluster “live”

A Portals installation may be upgraded to LNET “live” in three phases as described below. To maximize service availability, servers (MDS and OSS) should be failed over to their backups while they are being upgraded and/or rebooted.

Shutdown the Lustre services on any node (servers or clients), using failover backups if desired, for uninterrupted file system service. Remove old Lustre releases from the node and upgrade to an LNET release of Lustre by installing the appropriate release RPMs. Configure the LNET options in modprobe.conf as appropriate for your network topology (note that only basic network topologies will be supported through the live upgrade process). Set ‘portals_compatibility=“strong”’ in the LNET modprobe.conf options. The Lustre services or client may now be restarted on this node. At this point the node will be speaking "old" Portals, but will understand new LNET. This phase is only complete when all the nodes have been upgraded to LNET.
Ensure phase one is complete (that is, all the nodes have been upgraded to LNET). Now set ‘portals_compatibility=“weak”’ in the LNET modprobe.conf options on all the nodes. The nodes may now be rebooted (and servers failed over) in any order. As they are rebooted, the nodes will be speaking LNET but will understand old Portals (which is still being spoken by the "strong" nodes). This phase is only complete when all the nodes are either down or running LNET in “weak” compatibility mode.
Ensure phase two is complete (that is, all nodes are either down or are running LNET in “weak” compatibility mode). Now remove ‘portals_compatibility’ from the LNET modprobe.conf options on all the nodes (it defaults to “none”). The nodes may now be rebooted (and servers failed over) in any order. These nodes will now reject the old Portals protocol. This phase is only complete when all the nodes are either down or running LNET in the new mode.

Note that phase three must be complete before future upgrades are possible. Similarly, phase three must be complete before the site configuration can be updated to include multiple networks.

You may rewrite the client configuration logs after phase one has been completed to avoid warnings about converting portals NIDs to LNET NIDs. As this requires an MDS service outage, you may choose to complete the upgrade in one step at this time by removing ‘portals_compatibility’ from the LNET modprobe.conf options on all the nodes and rebooting everywhere.

Upgrading From 1.4.5

To upgrade from 1.4.5 to 1.4.6 you need to download the latest RPM of Lustre 1.4.6. If you want to upgrade Lustre live, you need to failover to another server.

As 1.4.5 Lustre modules have to be unloaded and then Lustre 1.4.6 modules need to be loaded, a single node (Lustre without failover) cannot continue to run Lustre during the upgrade.

Steps for upgrading Lustre live are as follows (given the node has a failover setup):

Download the latest RPM
Unload the Lustre module from node2 (failover node) using

$ lconf –cleanup –node node2 config.xml
Upgrade node2 to Lustre 1.4.6 as it is a backup node and will NOT currently be running Lustre.
Configure node2 for Lustre 1.4.6 and failover node1. Once failed, Lustre will be running from node2.
Unload the Lustre module from node1 (failed node) using

$ lconf –cleanup –node node1 config.xml
Upgrade node1 to Lustre 1.4.6 as currently it will NOT be running Lustre as we had a failover.
Now you can failback to node1 once configured.