Lustre Test System - Vision and Scope Document

1. Business Requirements

1.1 Background, Business Opportunity, and Customer Needs

Lustre testing is the cornerstone of our organization.  In early 2004 we rolled out the ltest deployment and it was very successful.   Ltest operates with packages build by lbuild and reports them to the buffalo system.  The buffalo system was extended to also allow the queueing of test requests.  CFS has successfully run 100's of tests daily with this system and the results published by buffalo have often been sufficient to make dubugging problems much simpler.  Few regressions have entered our software, and on the whole testing has been successful.

Recently, more and more requirements have been placed on testing, in particular on the ltest system. Key new requirements are to handle the new mountconf configuration system of Lustre, to test interoperability of multiple versions of Lustre, to test more realistic configurations, such as real failover, software raid and multiple networks easily and to test Lustre in conjunction with CIFS, NFS and other export software. More flexible choices of operating system and Lustre software, hardware and test configurations are desired and it is necessary to prioritize items in the test queues for purposes of releases, to temporarily stop testing and resume it.  The ltest system has not found adoption by developers and has been difficult to deploy elsewhere.

CFS attempted to address these issues by modifying ltest. Progress in making modifications to accomodate these have been insufficient to warrant continuing this effort.   We will replace ltest with a new system.

1.2 Business Opportunity

The operation of the test system involves many processes which our company and our customers need to repeat routinely, e.g. 

  1. building software, 
  2. downloading and installing the software, 
  3. defining configurations, 
  4. configuring a Lustre configuration, 
  5. running the test programs and 
  6. gathering and reporting diagnostic and debugging information. 

Several of these are CFS specific.  Building Lustre is something controlled by engineering.  An initiative is under way to define Lustre configurations flexibly with spreadsheets and configuing them with CFS supplied tools.  Other items are of a commodity nature: distributing and updating software is a mainstream Linux activity using systems like apt, yum etc.  Running programs on clusters is something our customers do all the time under job scheduler control. Gathering and reporting results can be a combination of CFS supplied infrastructure and standard reporting tools.  

CFS has tradionally been weak at providing good tools for customers to handle these tasks.  There is a tremendous opportunity to leverage the new test system to provide first rate tools to tackle these tasks.

1.3 Business Objectives and Success Criteria

BO-1:

Design a modular system which is usable for Lustre testing and customers alike.

BO-2:

Meet all requirements as soon as feasible

BO-3:

Build something that is loved, not hated, and completely intuitive

BO-4: CFS retains some competetive advantage for example, by not distributing its reporting infrastructure but only all other components.

SC-1:

First components of the system are in use for testing at CFS by July 1.

SC-2:

Components of the system are used by customers to maintain Lustre installations by Jan 2007

SC-3:

One engineer can maintain the system, all engineers use and fully understand the system

SC-4:

5 sites outside CFS use the system by Jan 2007.

1.4 Market & Customer Needs 

In order for Lustre to see a next level of success, deployment, configuration and monitoring must be brought under control  There are huge untapped opportunities for us to supply Lustre on flash installations for storage servers leading to appliance solutions.  The overlap with the testing requirements is compelling.

For CFS to effectively work with partners, a testing system needs to be developed that is universally accepted in the Lustre community and meets all demands.

1.5 Business Risks

RI-1:

Testing has become a critical bottleneck in our engineering organization which has already or could seriously interfere with our release capabilities.

RI-2:

If improperly designed too few engineers and partners may use the system.

RI-3:

If the system is too complex its delayed rollout will pose a major risk.

RI-4: If the system is too simplistic it cannot meet the complex feature requirements.

2. Vision of the Solution

2.1 Vision Statement

For CFS engineering who need to test and integrate the Lustre software the Lustre Testing System (LTS) is a modular suite of tools that will provide means of easily performing any or all of the elements associated with testing software. The system will run unattended, or can be manually controlled.  This system will be used by all engineers and be easily maintained and improved.   Unlike the current ltest system it is modular, intuitive and used beyond the core QA group our product will generate all interfaces required to define configurations and tests and provide reports required to perform dianostics and debugging information.

For CFS partners who integrate the Lustre software the Lustre Testing System (LTS) is a modular suite of tools that will provide the same means as those available to CFS engineering    Unlike the current ltest system it will see adoption in the community our product will be designed to be easily integrated in partners engineering environments. 

For CFS customers who run Lustre servers on appliances or systems the modules in the Lustre Testing System (LTS) is a modular suite of tools thatwill provide means of easily installing, configuring, installing and upgrading software and perform monitoring, diagnostics and information gathering.   Unlike previous attempts to provide this to customes our product will incorporate requirements posed by customers to ensure easy adoption.

2.2 Major Features

FE-1:

Use the lbuild system to provide packages

FE-2:

Use standard Linux mechanisms, including proxies, to update or install systems with new software

FE-3:

Create, view, modify, and delete Lustre configurations using a webversion of mountconf tools

FE-4:

The system is very modular and leverages existing components where possible

FE-5:

The system is usable with disk install, flash installs and pxe-booting

FE-6:

The queueing system can manage priorities, choose from multiple available resources and suspend processing (similar to PBS (pro?) / Maui/Moab)

FE-7: The system uses secure internet infrastructure and offers privacy for reports

FE-8:

Complex test definitions involving Lustre with NFS, CIFS exports, Windows are possible

FE-9:

Tests can report to buffalo as or similar to before, with privacy, quality statistics

FE-10:

Diagnostic information gathering is extended and available in the reports and through monitoring tools

2.3 Assumptions and Dependencies

AS-1:

The system is designed to handle testing from Lustre 1.6, not necessarily for earlier versions.

AS-2:

While only parts of the system are available manual effort will supplement the automated parts.

DE-1:


3. Scope and Limitations

3.1 Scope of Initial and Subsequent Releases

Feature

Desc

Release 1

Release 2

Release 3

FE-1

lbuild usage

lbuild for current configurations 

include Windows, Debian, Ubuntu and others into lbuild

 

FE-2

sofware update included for rpms

Fully implemented

Fully implemented

FE-3

mountconf config tools

Upload of config CSV multiple Lustre versions

Gui implemented

 

FE-4

modular

yes

yes

yes 

FE-5

disk/pxe/flash
installs

pxe booting & disk install

   

flash 

FE-6

job scheduler

run by hand

 initial trial fully implemented 

FE-7

secure full implemented



FE-8

complex tests

Not implemented

Fully implemented

 

FE-9

buffalo reports

tests report, privacy

 Quality metrics  
FE-10 diagnostics into buffalo extended  with performance monitoring tool included

3.2 Limitations and Exclusions

LI-1:

Possibly automating Windows support is out of reach

LI-2:

?

4. Business Context

4.1 Stakeholder Profiles

Stakeholder

Major Value

Attitudes

Major Interests

Constraints

Corporate Management

improved product quality and opportunities

strong commitment; top corporate priority 

considerable QA improvements; adoption immediate;  no runaway projects

none identified

QA department

more efficient use of staff time; higher customer satisfaction; more tests, less tinkering

eager to overcome inability to address corporate requirements

job satisfaction

need leadership from more experienced developers

Lustre engineers

tools they can use and love

strong enthusiasm, but might not use it as much as expected 

simplicity of use; reliability of delivery; 

flexible and lightweight

Partners

share testing effort with others

receptive but cautious

cost savings

no resources yet committed

Customers

tools for deployment and monitoring

receptive but cautious

minimal new technology needed; concern about CFS capabilities in this area

can only use proven systems

4.2 Project Priorities

Dimension

Driver

Constraint

Degree of Freedom

Schedule

Prove modularity through incremental improvements Little tolerance for slips

Features

Original incentive to redesign

All features scheduled for release 1.0 must be fully operational

 Little

Quality

 

Only extreme quality & usability will lead to adoption.

 

Staff

Payoff will offset any amount of resources applied

Sufficient to get into operation very fast  Flexible

Cost

   

Within normal operating procedures resources will be made avaialble as needed


4.3 Operating Environment

  1. On flexibly defined subsets of nodes in the Boulder cluster, and the china cluster for automatic and manual testing
  2. On vmware simulated environments for developer use and for use in training sessions
  3. At various partner and customer sites
  4. For updating disk installed server & clients systems at customers
  5. For updating flash installed appliance style sytems that will be forthcoming